🚀 Discover this must-read post from Hacker News 📖
📂 **Category**:
✅ **What You’ll Learn**:
A benchmark that tests how well AI coding agents can read
web content. Point your agent at the test, get a score, compare across
platforms.
What This Tests
AI coding agents (Claude Code, Cursor, GitHub Copilot, and others) read
documentation websites as part of their workflows. But most agents hit silent
failure modes: content gets truncated, CSS buries the real text, client-side
rendering delivers empty shells, and tabbed content serializes into walls of
text where only the first variant is visible.
This benchmark surfaces those failure modes. Each test page is designed
around a specific problem documented in the
Agent-Friendly Documentation Spec.
The pages embed canary tokens at strategic positions. But instead of asking
agents to hunt for tokens (which games relevance filters), the test gives the
agent realistic documentation tasks. Only after the agent completes all tasks
does it learn about the canary tokens and report which ones it encountered.
You paste the results into a scoring form.
How It Works
- Point your agent at the start page. Give your agent the
URLagentreadingtest.com/start/and tell it to follow the
instructions.Go to https://agentreadingtest.com/start/ and follow the instructions - The agent completes 10 documentation tasks. Each task
requires reading a page that targets a specific failure mode. The agent
doesn’t know about canary tokens yet. - The agent visits the results page. Only after completing
all tasks does the agent learn about canary tokens and report which ones
it saw. - Paste the results into the scoring form. The agent gives
you a comma-separated list of canary tokens. Paste it into the scoring
form for a detailed breakdown of what your agent’s pipeline delivered and
where it lost content.
Score Your Results
The Tests
1. Truncation
150K-char page with canary tokens at 10K, 40K, 75K, 100K, and 130K. Maps exactly where your agent’s truncation limit kicks in.
page-size-html, page-size-markdown
2. Boilerplate Burial
80K of inline CSS before the real content. Tests whether agents distinguish CSS noise from documentation.
content-start-position
3. SPA Shell
Client-side rendered page. Content only appears after JavaScript executes. Most agents see an empty shell.
rendering-strategy
4. Tabbed Content
8 language variants in tabs. Canary tokens in tabs 1, 4, and 8. Tests how far into serialized tab content the agent reads.
tabbed-content-serialization
5. Soft 404
Returns HTTP 200 with a “page not found” message. Tests whether the agent recognizes it as an error page.
http-status-codes
6. Broken Code Fence
Markdown with an unclosed code fence. Everything after it becomes “code.” Tests markdown parsing awareness.
markdown-code-fence-validity
7. Content Negotiation
Different canary tokens in HTML vs. markdown versions. Tests whether your agent requests the better format.
content-negotiation
8. Cross-Host Redirect
301 redirect to a different hostname. Most agents won’t follow it (security measure). The canary is on the other side.
redirect-behavior
9. Header Quality
Three cloud platforms, identical “Step 1/2/3” headers. Tests whether agents can determine which section is which.
section-header-quality
10. Content Start
Real content buried after 50% navigation chrome. Tests whether agents read past the sidebar serialization.
content-start-position
Scoring
The test has a maximum score of 20 points. Each canary token
found earns 1 point, and correct answers to qualitative questions earn 1 point
each. The answer key has the full breakdown.
A perfect score is unlikely for any current agent. The tests are calibrated
so that each failure mode will realistically affect at least some agents. A
typical score range for current agents is probably 14-18 out of 20, depending
on the platform’s web fetch pipeline.
About
Agent Reading Test is a companion project to the
Agent-Friendly Documentation Spec,
which defines 22 checks across 8 categories evaluating how well documentation
sites serve AI agent consumers. The spec is grounded in empirical observation
of real agent workflows.
This benchmark flips the perspective: instead of testing the documentation
site, it tests the agent. The same failure modes apply, but here we’re
measuring which agents handle them gracefully and which don’t.
Source code: github.com/agent-ecosystem/agent-reading-test
💬 **What’s your take?**
Share your thoughts in the comments below!
#️⃣ **#Agent #Reading #Test**
🕒 **Posted on**: 1775535264
🌟 **Want more?** Click here for more info! 🌟
