Cursor · CursorBench - viralpique.com

🚀 Discover this must-read post from Hacker News 📖

📂 **Category**:

✅ **What You’ll Learn**:

We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions. Higher scores are better.

More about CursorBench

	Model
1	Fable 5 Max	72.9%	$18.02	63,842	76
2	Fable 5 Extra High	72.0%	$13.74	48,754	63
3	Fable 5 High	70.6%	$10.81	37,173	54
4	Fable 5 Medium	69.8%	$8.27	28,507	47
5	Opus 4.7 Max	64.8%	$11.02	62,989	96
6	GPT-5.5 Extra High	64.3%	$4.37	17,905	46
7	Fable 5 Low	64.2%	$5.70	18,882	36
8	Opus 4.8 Max	63.8%	$7.59	77,370	60
9	Composer 2.5	63.2%	$0.55	15,152	37
10	GPT-5.5 High	62.6%	$3.59	13,329	40
11	Opus 4.8 Extra High	62.1%	$6.14	55,622	54
12	Opus 4.7 Extra High	61.6%	$7.11	43,942	72
13	Sonnet 5 Max	61.2%	$6.87	93,485	93
14	Opus 4.7 High	59.4%	$5.01	32,227	59
15	GPT-5.5 Medium	59.2%	$2.22	9,065	35
16	Opus 4.8 High	58.4%	$4.41	36,788	45
17	Sonnet 5 Extra High	58.4%	$5.23	58,228	86
18	Sonnet 5 High	57.0%	$3.74	41,735	66
19	Opus 4.8 Medium	56.6%	$3.83	31,684	41
20	Sonnet 5 Medium	54.9%	$2.57	27,469	53
21	GLM 5.2 Max	54.6%	$3.11	51,312	83
22	Opus 4.8 Low	54.3%	$2.93	22,726	36
23	Opus 4.7 Medium	52.7%	$2.93	19,193	41
24	Kimi K2.7 Code	52.7%	$1.92	32,902	70
25	Composer 2	52.2%	$0.56	14,163	40
26	GLM 5.2 High	50.7%	$2.46	30,621	76
27	Gemini 3.5 Flash	49.8%	$1.94	35,105	79
28	Sonnet 4.6 Max	49.0%	$3.09	40,280	55
29	GPT-5.5 Low	48.8%	$1.19	4,923	24
30	Sonnet 4.6 High	48.8%	$3.06	37,352	57
31	Opus 4.7 Low	48.3%	$1.87	13,164	29
32	Sonnet 5 Low	47.7%	$1.46	17,028	37
33	Kimi 2.6	47.6%	$1.27	24,783	56
34	Sonnet 4.6 Medium	46.0%	$2.64	31,360	50
35	Sonnet 4.6 Low	41.5%	$1.89	21,211	50
36	Kimi 2.5	31.9%	$0.87	9,446	30

Changelog

CursorBench 3.1

Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
Improved grading criteria for some edit tasks.

CursorBench 3.0

Initial set of tasks focused on edit, refactor, and bugfix problems.

Avg cost / task is computed by applying each model’s published per-million-token pricing (input, cache read, cache write, and output) to the tokens it used on each CursorBench 3.1 task, then averaging across tasks. Results are subject to variance; small differences in scores may not be statistically meaningful.

💬 **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#Cursor #CursorBench**

🕒 **Posted on**: 1782972685

🌟 **Want more?** Click here for more info! 🌟

Cursor · CursorBench

Changelog

CursorBench 3.1

CursorBench 3.0

By

Leave a Reply Cancel reply