Cursor · CursorBench

🚀 Discover this must-read post from Hacker News 📖

📂 **Category**:

✅ **What You’ll Learn**:

We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions. Higher scores are better.

More about CursorBench

A scatter and line chart comparing Fable 5, Opus 4.8, Opus 4.7, GPT-5.5, Sonnet 5, Sonnet 4.6, GLM 5.2, Composer 2.5, and Composer 2 scores against average cost per task.75% CursorBench 3.1 score70%65%60%55%50%45%$20$16$12$8$4$0Average cost per taskFable 5 highComposer 2.5GPT-5.5 mediumGemini 3.5 FlashOpus 4.8 highSonnet 5 highKimi K2.7 CodeGLM 5.2 high

Model
1 Fable 5 Max 72.9% $18.02 63,842 76
2 Fable 5 Extra High 72.0% $13.74 48,754 63
3 Fable 5 High 70.6% $10.81 37,173 54
4 Fable 5 Medium 69.8% $8.27 28,507 47
5 Opus 4.7 Max 64.8% $11.02 62,989 96
6 GPT-5.5 Extra High 64.3% $4.37 17,905 46
7 Fable 5 Low 64.2% $5.70 18,882 36
8 Opus 4.8 Max 63.8% $7.59 77,370 60
9 Composer 2.5 63.2% $0.55 15,152 37
10 GPT-5.5 High 62.6% $3.59 13,329 40
11 Opus 4.8 Extra High 62.1% $6.14 55,622 54
12 Opus 4.7 Extra High 61.6% $7.11 43,942 72
13 Sonnet 5 Max 61.2% $6.87 93,485 93
14 Opus 4.7 High 59.4% $5.01 32,227 59
15 GPT-5.5 Medium 59.2% $2.22 9,065 35
16 Opus 4.8 High 58.4% $4.41 36,788 45
17 Sonnet 5 Extra High 58.4% $5.23 58,228 86
18 Sonnet 5 High 57.0% $3.74 41,735 66
19 Opus 4.8 Medium 56.6% $3.83 31,684 41
20 Sonnet 5 Medium 54.9% $2.57 27,469 53
21 GLM 5.2 Max 54.6% $3.11 51,312 83
22 Opus 4.8 Low 54.3% $2.93 22,726 36
23 Opus 4.7 Medium 52.7% $2.93 19,193 41
24 Kimi K2.7 Code 52.7% $1.92 32,902 70
25 Composer 2 52.2% $0.56 14,163 40
26 GLM 5.2 High 50.7% $2.46 30,621 76
27 Gemini 3.5 Flash 49.8% $1.94 35,105 79
28 Sonnet 4.6 Max 49.0% $3.09 40,280 55
29 GPT-5.5 Low 48.8% $1.19 4,923 24
30 Sonnet 4.6 High 48.8% $3.06 37,352 57
31 Opus 4.7 Low 48.3% $1.87 13,164 29
32 Sonnet 5 Low 47.7% $1.46 17,028 37
33 Kimi 2.6 47.6% $1.27 24,783 56
34 Sonnet 4.6 Medium 46.0% $2.64 31,360 50
35 Sonnet 4.6 Low 41.5% $1.89 21,211 50
36 Kimi 2.5 31.9% $0.87 9,446 30

Changelog

CursorBench 3.1

  • Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
  • Improved grading criteria for some edit tasks.

CursorBench 3.0

  • Initial set of tasks focused on edit, refactor, and bugfix problems.

Avg cost / task is computed by applying each model’s published per-million-token pricing (input, cache read, cache write, and output) to the tokens it used on each CursorBench 3.1 task, then averaging across tasks. Results are subject to variance; small differences in scores may not be statistically meaningful.

💬 **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#Cursor #CursorBench**

🕒 **Posted on**: 1782972685

🌟 **Want more?** Click here for more info! 🌟

By

Leave a Reply

Your email address will not be published. Required fields are marked *