Giskard是一个对AI模型进行测试的平台,可以执行功能验证、安全测试及合规扫描。工具主要分为两大块:Giskard Python库和一个server端Giskard Hub。其中Python库是开源的,github地址:https://github.com/Giskard-AI/giskard
使用Giskard的可以按照如下步骤进行测试:
1. 加载数据集进行功能验证;
2. 配置相关类型漏洞,进行安全漏洞扫描;
3. 生成测试报告,进行问题确认;
4. 针对问题生成测试用例;
5. 引入第三方LLM进行比对验证。
除了LLM,Giskard还支持NLP、视觉相关的模型测试,下面以LLM测试为例介绍Giskard的快速入门。使用Giskard Python库编写测试代码,就像把大象塞入冰箱一样“简单”:
- 封装Giskard模型
- 调用该模型的扫描
- 生成测试报告
封装Giskard模型
不能直接对LLM进行测试,需要进行封装才能做下一步操作。首先下载依赖的库:
pip install "giskard[llm]" --upgrade
pip install "langchain<=0.0.301" "pypdf<=3.17.0" "faiss-cpu<=1.7.4" "openai<=0.28.1" "tiktoken<=0.5.1"
笔者实践中发现faiss的安装推荐Python版本是3.11以下,且在windows环境下装的是cpu版faiss。
设置好OpenAI API key如下:
import os# Set the OpenAI API Key environment variable.
os.environ["OPENAI_API_KEY"] = "sk-..."
此处的目的是使用OpenAI的API来进行用例或测试数据生成的模型,也可以使用其他比如ollama的模型,具体文档见这。
接下来就可以搭建自己的测试LLM,如下使用langchain搭建起来:
from langchain import OpenAI, FAISS, PromptTemplate
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter# Prepare vector store (FAISS) with IPPC report
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100, add_start_index=True)
loader = PyPDFLoader("https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_LongerReport.pdf")
db = FAISS.from_documents(loader.load_and_split(text_splitter), OpenAIEmbeddings())# Prepare QA chain
PROMPT_TEMPLATE = """You are the Climate Assistant, a helpful AI assistant made by Giskard.
Your task is to answer common questions on climate change.
You will be given a question and relevant excerpts from the IPCC Climate Change Synthesis Report (2023).
Please provide short and clear answers based on the provided context. Be polite and helpful.Context:
{context}Question:
{question}Your answer:
"""llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0)
prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["question", "context"])
climate_qa_chain = RetrievalQA.from_llm(llm=llm, retriever=db.as_retriever(), prompt=prompt)# Test that everything works
climate_qa_chain.run({"query": "Is sea level rise avoidable? When will it stop?"})
然后进行Giskard模型的封装:
import giskard
import pandas as pddef model_predict(df: pd.DataFrame):"""Wraps the LLM call in a simple Python function.The function takes a pandas.DataFrame containing the input variables neededby your model, and must return a list of the outputs (one for each row)."""return [climate_qa_chain.run({"query": question}) for question in df["question"]]# Don’t forget to fill the `name` and `description`: they are used by Giskard
# to generate domain-specific tests.
giskard_model = giskard.Model(model=model_predict,model_type="text_generation",name="Climate Change Question Answering",description="This model answers any question about climate change based on IPCC reports",feature_names=["question"],
)
调用扫描
调用giskard的scan方法进行扫描。only参数用来控制扫描漏洞的类别,文档中为了减少耗时,只设置了hallucination用来扫描LLM幻觉相关漏洞。如果不设置就是全范围的扫描。
report = giskard.scan(giskard_model, giskard_dataset, only="hallucination")
生成报告
简单调用方法就可以生成报告:
display(full_report)# Save it to a file
full_report.to_html("scan_report.html")
也可以生成markdown类型的报告:
display(full_report)# Save it to a file
full_report.to_markdown("scan_report.md")
还能通过报告生成针对的测试用例集:
test_suite = full_report.generate_test_suite(name="Test suite generated by scan")
test_suite.run()
Giskard的Python库还能和Pytest框架集成,编写测试用例脚本。具体可以查看文档:https://docs.giskard.ai/en/stable/integrations/pytest/index.html。
示例代码:
import pytestfrom giskard import Dataset, Model, Suite, demo
from giskard.testing import test_accuracy, test_f1model_raw, df = demo.titanic()wrapped_dataset = Dataset(name="Test Data Set",df=df,target="Survived",cat_columns=["Pclass", "Sex", "SibSp", "Parch", "Embarked"],
)wrapped_model = Model(model=model_raw, model_type="classification", name="Classifier v1")suite = (Suite(default_params={"model": wrapped_model,"dataset": wrapped_dataset,}).add_test(test_f1(threshold=0.6)).add_test(test_accuracy(threshold=1)) # Certain to fail
)@pytest.fixture
def dataset():return wrapped_dataset@pytest.fixture
def model():return wrapped_model# Single wrapped test
def test_only_accuracy(dataset, model):test_accuracy(model=model, dataset=dataset, threshold=1).assert_()# Parametrise tests from suite
@pytest.mark.parametrize("test_partial", suite.to_unittest(), ids=lambda t: t.fullname)
def test_giskard(test_partial):test_partial.assert_()