Every AI Visibility Tool Is Lying to You

💥 Discover this must-read post from Hacker News 📖

📂 **Category**:

✅ **What You’ll Learn**:

I’m an experienced software engineer, and I’ve spent enough time building and debugging measurement systems to know when a dashboard is asking you to trust a number it cannot support. A new software category now promises to tell brands how visible they are inside ChatGPT, Claude, Gemini, Perplexity, and Google’s AI answers. Then it turns that messy system into tidy claims like mention rate, citation rate, share of voice, or rank.

When a tool says you are number four in your category, moved up two spots this week, or sit at 17% visibility while a competitor sits at 31%, I do not think the signal is worthless; I think the precision is made up. These systems are noisy, personalized, geographic, nondeterministic, and constantly changing, so a clean leaderboard number hides the thing an engineer would actually want to inspect: the distribution, the methodology, the variance, and the raw evidence.

Most vendors are trying to measure something important, but the mechanism is usually weaker than the dashboard admits. If a tool claims to show “what customers see” in ChatGPT or Claude, it is probably scraping the consumer app or calling an API. A scrape captures one synthetic session, and an API call uses a different surface than your customer uses. Both can produce useful directional signal, especially when they reveal invisibility on commercial prompts or gaps in a geography, but neither should be sold as a precise, stable truth without showing its work.

The frontend scrape problem

Scraping the ChatGPT or Claude frontend sounds persuasive at first. The vendor can say, truthfully, that it opened the app, asked the question, and recorded what the product returned.

This is closer to the surface a real user sees. It still measures one controlled surface.

A scrape comes from one account, or a controlled account pool. That means one history state, one memory state, one subscription tier, one geography, one browser session, and one prompt. Change any of those and the answer can change. A real buyer asking “best CRM for a seed-stage startup” and a clean browser asking “best CRM software” from a datacenter IP are different instruments.

Mass scraping adds more bias. At any meaningful volume, the work has to run from somewhere: cloud machines, proxy routes, managed browsers, headless sessions, or another automation layer. That automation layer can bleed into the measurement. Concentrated IP patterns. Repeated logins. Odd session rhythms. Rate-limit pressure. Possible anti-abuse handling from the AI product itself.

The operator has to choose. Clean accounts are repeatable and unlike customers. Aged accounts have history and weaker controls. A benchmark account that asks thousands of category prompts also creates its own personalization trail. After a while, the account’s whole life is benchmark traffic.

This matters most for local and commercial prompts. “Best commercial roofing company near me” changes by place. “Best AEO agency in NYC” changes by place. The answer depends on the user’s location, the retrieval system, the account, and the sources pulled at that moment.

A single frontend answer is one lab sample.

The same prompt changes across runs

The simplest defense of an AI visibility rank is this: we ask the same question every week and count whether you show up.

This only works if the same question has a stable answer. The same words often produce different answers.

Even temperature-zero LLM calls are not perfectly stable in production. Thinking Machines Lab explained one technical reason: batching and kernel behavior can vary under real production load. Their example showed identical temperature-zero requests producing multiple unique completions.

SparkToro and Gumshoe saw the marketing version of the same problem. They had volunteers run repeated commercial prompts through ChatGPT, Claude, and Google’s AI products. Their research found that brand recommendations changed a lot across repeated runs.

This is the core measurement problem. If the next draw from the same system can name a different set of brands, then “you rank number four” becomes one sample from a distribution.

An honest dashboard would show the distribution.

Consumer apps and APIs behave differently

Some tools skip browser scraping and call provider APIs instead. The operational case is strong. API calls are easier to repeat, easier to audit, cheaper to run at scale, and less likely to break when a web app changes.

The tradeoff: the API and the consumer app behave differently.

The consumer product may have memory, account personalization, model routing, web retrieval, location inference, shopping modules, local modules, citations, and product-specific presentation. The API gives you a programmable model call with the tools and parameters you enable. OpenAI’s API docs, for example, require you to add tools such as web search when you want grounded retrieval. Google’s Gemini API has its own grounding and search configuration.

The gap cuts both ways. A raw API call can understate what the app would know because it browses differently. A browser scrape can overstate what real users would see because it captures one personalized session and calls it representative.

The API can be the right surface for controlled measurement. Sell it that way. Avoid calling it “what the consumer app showed your buyer.”

Prompt sets manufacture the score

AI visibility tools monitor a prompt set. They sample the market instead of covering the full long tail of real buyer questions.

The prompt set is decisive.

If I track “best AEO agency in NYC,” “AI search optimization consultant,” and “answer engine optimization audit,” I get one picture of Canonry. If I track “SEO agency,” “digital marketing firm,” and “AI marketing software,” I get another. Both prompt sets can be valid. They answer different questions.

The headline number depends on the selected prompts, their weights, the run frequency, and the competitor set. Profound’s own prompt-design guide says its users generally track 100 to 1,000 prompts, with a couple hundred being typical. The dashboard is sampling the market.

The scoring formula matters just as much. One dashboard can score mention frequency. Another can weight citation position. Another can count source links. Another can blend sentiment. Digital Applied’s AI share-of-voice framework gives a clean example: the same brand, on the same data, scores 20% mention-based share of voice, 16.8% position-weighted share of voice, and 31.4% citation-based share of voice.

Same evidence. Three headline numbers. Three competitive standings.

Practitioners are skeptical for good reason. In the same Digital Applied piece, Dan Taylor of SALT.agency criticizes vendors for measuring small, static prompt sets inside a contrived environment. Digiday reported the same operational problem from the buyer side. Paul Dyer, CEO of /prompt, said that if you give three tools the same prompts, you get three different answers.

Without the prompt list, runs per prompt, geography, model, account state, and scoring formula, the dashboard is showing a constructed metric.

Constructed metrics can be useful. They need a label.

Location breaks the leaderboard

Geography is the part most dashboards wave away.

For local, regional, and service-area businesses, location changes the question. A user in Brooklyn, Austin, London, or rural Michigan can get different recommendations for the same words because the answer engine infers local intent.

A single global visibility rank is often meaningless. “Visible in ChatGPT” where? From which user location? With which local retrieval context? With which city or service-area phrase?

Frontend scraping makes this especially messy. A synthetic browser run from a cloud server looks unlike a buyer in the market you care about. You can try proxies. You can try account pools. You can try browser automation. Now your “truth” depends on whether the frontend accepted the location story your scraper told.

API-based measurement has a cleaner path here: pass explicit location context where the provider supports it, and run the same prompt across the geographies you care about. You get a controlled location variable instead of an accidental scraper artifact.

Canonry takes that path.

Why local execution matters for local SEO

This is where Canonry’s local-first design changes the measurement problem.

Most hosted dashboards run probes from vendor infrastructure. For a national SaaS query, that may be fine. For a local client, it is often the wrong instrument. A plumber in Queens, a dentist in Austin, or a roofing contractor in Michigan needs to understand answers that buyers see inside the service area. A scraper cluster in another region is a weak stand-in.

Canonry can run on a machine in the market. An agency can run checks from its own office, from a technician’s laptop, or from another machine closer to the target consumer. Nondeterminism still exists. API results can still differ from the consumer UI. The win is narrower and practical: remove outsourced cloud geography from the measurement.

For local SEO and local AEO, that detail matters. The closer the measurement environment is to the buyer’s environment, the less you have to trust a proxy story. You can still pass explicit location context where providers support it. When the test runs from a machine in the relevant market, accidental signals line up with intentional ones.

This makes Canonry more accurate for operators serving local clients. If your customer is a Chicago HVAC company, a Brooklyn hospitality group, or a Michigan roofing contractor, you can run the same prompt set from different geographies. The difference is the thing you are trying to measure.

Model drift turns trend lines into fiction

Even if you handle sampling, personalization, API-vs-app differences, prompt selection, and geography, the instrument still changes.

The model behind a familiar product name can be updated, routed, rolled back, or silently adjusted. Retrieval systems change. Citation behavior changes. Product interfaces change. A week-over-week movement in your AI visibility dashboard can mean your content improved. It can also mean the model changed, the retrieval layer changed, or the product started answering the prompt differently.

This is real enough to measure. Chen, Zaharia, and Zou’s paper “How is ChatGPT’s behavior changing over time?” compared March 2023 and June 2023 versions of GPT-3.5 and GPT-4. They found large behavior changes across tasks under the same public model names. One example: GPT-4’s prime-number accuracy moved from 84% in March to 51% in June. Treat that as evidence of drift, rather than a current estimate of today’s model quality.

The same pattern appears in product behavior. In an April 29, 2025 post, OpenAI said it had rolled back the previous week’s GPT-4o update in ChatGPT because the removed version was too flattering and agreeable. An outside visibility dashboard usually sees that kind of product change only after it has already bent the trend line.

From the outside, those effects are hard to separate. A dashboard can tell you that a number moved. It usually cannot prove why.

The number can still help. The problem starts when the tool claims to explain why it moved.

What these tools can honestly tell you

The category can be useful. It needs to stop overselling precision.

AI visibility monitoring can support useful conclusions:

We are invisible for the commercial prompts buyers actually ask.
We appear often on branded prompts but rarely on category prompts.
One competitor is cited much more frequently than we are.
Claude sees us while ChatGPT misses us.
We show up in New York while Los Angeles stays blank.
A content or schema change appears to correlate with better citation frequency over repeated runs.

Those are directional, probabilistic findings. They are useful. They help teams prioritize work.

Fake precision creates the problem:

You are rank number four.
You moved up exactly two positions.
Your AI share of voice is 17%.
This week’s lift was caused by last week’s blog post.
This single screenshot is what your customers see.

Those claims collapse unless the tool shows its samples, its spread, and its method.

How Canonry measures it

Canonry avoids the idea of one canonical answer inside ChatGPT waiting to be scraped.

We treat AI visibility as a distribution.

The unit of measurement is repeated observations across prompts, providers, competitors, and locations. Canonry uses provider APIs because they give us a controlled, repeatable surface. APIs differ from the consumer app, and they are auditable. Where a provider supports it, we pass geolocation context instead of hoping a browser scrape inherits the right location from a proxy.

We record the prompt, provider, timestamp, configured location, cited domains, mentions, source evidence, and run history so the number can be audited later.

Does that match every real user? No.

The sample has clear limits: no years of ChatGPT history, no exact consumer UI, and no full long-term distribution of every possible buyer question. The work is built around a narrower question: under this prompt set, in this geography, against these competitors, across these providers, how often do we appear?

A narrower question is more honest and more useful.

The downside: honest measurement costs more

There is a reason the cheap dashboard is tempting.

One scrape is cheap. One prompt run is cheap. A single API call with no repetitions and no geography is cheap. A polished line chart built from thin data can still look confident.

Canonry’s approach costs more because it does more work:

It runs more than one sample when the question matters.
It compares multiple providers instead of collapsing the market into one model.
It tracks competitors alongside your own domain.
It passes location context where supported.
It keeps evidence so the result can be inspected instead of just summarized.
It treats prompt sets as configuration.

That costs money. Grounded calls can cost more than plain completions. Repeated runs multiply cost. Location-aware coverage multiplies cost again. If you want New York, Los Angeles, Chicago, London, and Toronto across 200 prompts and four providers, you are buying a measurement program.

The cheap version is cheap because it measures less.

The bar for any AI visibility dashboard

If you are buying a tool in this category, ask for the work behind the number.

Ask:

Are you scraping the consumer frontend, calling the API, or both?
If you scrape the frontend, whose account, location, memory state, and subscription tier are represented?
If you call the API, which tools are enabled, and how do you handle web retrieval?
How many runs per prompt produce the number?
Do you report variance or confidence intervals?
Is geography explicit, inferred, or ignored?
Can I see the raw answers and source evidence?
Can I see the prompt list and scoring formula?
Can I separate model drift from my own content changes?

Without answers to those questions, the number is decoration.

The honest future of AI visibility measurement is a distribution with evidence attached.

Less catchy than “you are number four.”

Closer to the truth.

💬 **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#Visibility #Tool #Lying**

🕒 **Posted on**: 1783057090

🌟 **Want more?** Click here for more info! 🌟