Benchmarking Current LLMs on 'I am a Strange Dataset'

I am currently reading Godel Escher Bach, so of course I had to test GPT 5, GPT 5.1, Claude Sonnet 4.5, Llama-3.1-8B, and Llama-3.1-8B-Instruct on "I am a Strange Dataset" (Thrush et. al., 2024). I observed significant differences in performance between reasoning models, instruction-tuned models, and base models, and I outlined some thoughts on future directions. Read a short report here!

|

Benchmarking Current LLMs on "I am a Strange Dataset"