Holmes | Holmes Benchmark

Drag Racing

Holmes 🔎 is a research project focusing on assessing the linguistic competence of language models. Along with the benchmark itself, we provide different resources.

📚 Benchmark

The Holmes 🔎 benchmark features over 200 dataset covering 66 phenomena for morphology, syntax, semantics, reasoning, and discourse. Using classifier-based probing, Holmes 🔎 directly assesses the linguistic competence of language models without tangling them with other abilities, like following provided instructions in prompting-based evaluations. Find an overview of the insights here and more details in our paper.

🔥 Evaluation Code

Run the evaluation of your favorite language model quickly and easily. We provide code to run Holmes 🔎 or FlashHolmes ⚡ with no more than one command. Find it here

🚀 Leaderboard

The Holmes Leaderboard provides an interactive overview of evaluating over 50 different language models for Holmes 🔎 and its counterpart FlashHolmes ⚡ - optimized for efficiency. Find it here

🔎 Interactive Exploration

Using the Holmes Explorer, one can delve into more detailed results by comparing single datasets, phenomena, or phenomena types. Find it here