AccessEval: Benchmarking Disability Bias in Large Language Models Paper β’ 2509.22703 β’ Published Sep 22, 2025 β’ 20
PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications Paper β’ 2509.23879 β’ Published Sep 28, 2025 β’ 20
RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks Paper β’ 2509.23673 β’ Published Sep 28, 2025 β’ 20
Aligning LLMs for Multilingual Consistency in Enterprise Applications Paper β’ 2509.23659 β’ Published Sep 28, 2025 β’ 20
QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization Paper β’ 2506.22396 β’ Published Jun 27, 2025
ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering Paper β’ 2508.07321 β’ Published Aug 10, 2025
view article Article πΊπ¦ββ¬ LLM Comparison/Test: Phi-4, Qwen2 VL 72B Instruct, Aya Expanse 32B in my updated MMLU-Pro CS benchmark Jan 10, 2025 β’ 8
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation Paper β’ 2506.00482 β’ Published May 31, 2025 β’ 8