I gave 9 AI models the same coding challenge: write code to generate an animation. Same prompt, same constraints, same evaluation criteria. No cherry-picking, no re-prompts โ just raw output.
| # | Model | Provider |
|---|---|---|
| 1 | GPT-5.5 | OpenAI |
| 2 | Claude Haiku | Anthropic |
| 3 | Claude Sonnet | Anthropic |
| 4 | Claude Opus | Anthropic |
| 5 | Gemini 3.1 Pro | |
| 6 | Qwen 3.6 27B | Alibaba |
| 7 | Kimi K2.5 | Moonshot |
| 8 | GPT-5.4 / Mini | OpenAI |
All prompts, model outputs, and my notes are available on the companion benchmark page:
๐ kuro-llm-benchmark.pages.dev
Full video with panda commentary: youtube.com/watch?v=CeMXQGfNuXo