Benchmarking Current LLMs on "I am a Strange Dataset"

8 min read
LLMsLinguisticsBenchmarking

I am currently reading Godel Escher Bach, so of course I had to test GPT 5, GPT 5.1, Claude Sonnet 4.5, Llama-3.1-8B, and Llama-3.1-8B-Instruct on "I am a Strange Dataset" (Thrush et. al., 2024). I observed significant differences in performance between reasoning models, instruction-tuned models, and base models, and I outlined some thoughts on future directions. Read a short report here!