1. Benchmarks Reasoning, conversation, Q&A benchmarks HellaSwagBIG-Bench HardSQuADIFEvalMuSRMMLU-PROMT-Bench Domain-specific benchmarks GPQAMedQAPubMedQA Math benchmarks GSM8KMATHMathEval Security-related benchmarks PyRITPurple Llama CyberSecEval 2. 国内外端侧大模型 模型本身方面,由于端侧大模型更多是