Talk is Cheap – by Jake - viralpique.com

🚀 Discover this trending post from Hacker News 📖

📂 **Category**:

✅ **What You’ll Learn**:

This is a continuation of the discussion in How I’m thinking about the value of LLMs. I’m arguing elsewhere that LLMs will never be geniuses. This is not part 2 of The Ontology Argument.

In How I’m thinking I said I wasn’t ready to take a stance on LLM value creation. That changes in this post. Here is the stance I’m taking:

On average, how we’re using LLMs is likely destroying value.

My stance originates from stumbling on Faros.ai – a software development telemetry firm. They have products that pipe into common development tools like Jira, Github, and CI/CD pipelines to directly measure major operational metrics for software development teams.

Faros published a report in March that directly compares transaction level data between teams using AI in their software development process vs those that are not across their customer base. 22,000 developers, 4000 teams in the sample. This is, by far, the best data I’ve been able to locate that directly measures the operational impact of use of LLMs in the software development process.

It’s bad. Really, really bad.

The whole report is worth a read, but I’m going to cover just three major headline conclusions.

I think this supports what I said in How I’m thinking – there is clearly an individual productivity speedup that happens with LLMs. Although, I will say – it doesn’t look like it’s 10x from here. It’s a much more modest improvement than what the optimistic AI case would tell you.

The canary in the coal mine is that -11% in deployment frequency. That’s a system level metric. It directly measures how often the firms are delivering value to their customers.

I won’t even touch that code deletion ratio. You can draw your own conclusions there.

Let’s discuss a bit of nuance here.

Why am I using the term “productivity” and not “throughput” as the diagram does. I’m an operator trained in the ways of The Goal. To me throughput is reserved for total system flow – i.e. how many finished goods are flowing out of the system as a whole. When a developer gets done with a task, the feature is not shipped. This point will become very relevant later.
You’ll see that asterisk that say 10% of the dataset – only a subset of Faros’ customers pipe the telemetry product into their CI/CD pipeline. So for direct system level data – you’re actually relying on a sample of 2200 developers and 400 teams. My stance is that this subsample is large enough to draw conclusions and that the averages calculated for it likely describe the center of the distribution for the rest. You may disagree with that. I’ll try and flag where the weaknesses are in my argument if you depart from me there.

Ok, so what do we have here? A meaningful, but modest improvement in developer productivity. It’s good, but it’s not 10x. It’s, at best, 2x.

What else?

This chart astonishes me. In my mind, it’s a brutal indictment of our use of LLMs. It’s hard for me to even put into words how bad this is.

These metrics are a direct correlate of system throughput. System throughput – from a business perspective – is the only thing that matters. You can’t sell a product until it’s on production. If the lead time of getting features onto production has increased almost 5x – we have changed the fitness of our operations by almost an order of magnitude in the wrong direction.

I will discuss this in more detail in What is Happening section, but the worst is yet to come:

I feel like there’s nothing to even say here. There is no 10% to hide behind on this. If these stats generalize to the overall population, just imagine how much we collectively are impacting our customers. It’s like we’re all looking away from the phenomenology that the farther a defect travels down the operational pipeline, it becomes exponentially more costly to the whole system.

The Faros report is interesting for all sorts of reasons that I don’t have the space or inclination to discuss, but there’s another fact that’s very interesting. They built some statistical models trying to figure out whether certain features of the organizations predicted worse outcomes. Here’s an amazing observation:

DORA’s 2025 State of AI-Assisted Software Development report concludes that AI amplifies existing strengths and weaknesses, and that strong engineering foundations offer protection against AI’s Downsides. Our telemetry data, drawn from engineering systems across thousands of teams, does not support that as a protective factor. High-performing engineering organizations are experiencing the same downstream deterioration as everyone else.

Emphasis mine.

How bad is this really? I make the argument above that finished goods that you can exchange for money is the only thing that matters. – i.e. system throughput that doesn’t drive waste and rework (low quality). In software that means shipping features to production that don’t break.

On that count, how is our use of LLMs impacting our ability to do that? On the quality measures, we have a direct measurement. The Faros data also implies (with a bit of modeling explained in the appendix below – subject to the 10% caveat) an average system throughput number.

The headline results:

Faros’ customers using LLMs – on average – have delivered a 50% increase in defect rate per developer.
Faros’ customers using LLMs – on average – have system throughput down by 71-80%.

Please see the appendix for the detailed math on the system throughput calculation.

I think there are four lines of high quality information describing the impact of LLMs use on product development operations in the software industry.

Direct productivity studies
The shovelwear data
The state of the DevOps reports for 2024 and 2025
Faros’ 2026 customer metrics

I covered 1-3 in How I’m thinking and 4 in this article.

I think these different lines paint a consistent picture: People are experiencing individual productivity benefits of using LLMs, but at the organizational level it’s not showing up. In the best case, these effects are neutral for finished good (completed features) throughput. In the worst case – supported by the most granular and direct data – the effects are slowing down system throughput – i.e. actively destroying value.

For quality – in my mind – there can be no best case or worst case. The data is unambiguous. Our use of LLMs is destroying product quality and hence enterprise value.

This thesis is showing up in the discourse around LLMs. You need only look at some recent articles to see people commenting on this effect.

For example, Azeem Azhar of Exponential View (a commentator I deeply respect) – just yesterday posted an essay discussing this issue. His view is that this dichotomy of individual productivity vs organizational effectiveness is a deployment problem. In his view, not enough has changed at the organizational level to fully realize value from the technology.

I think that’s wrong.

Azeem reaches for the comparison to the deployment of the electrical grid as a historical comparison. His analogy being that new technology takes time for organizations to learn how to use and deploy properly. In electricity’s case it took about 30 years for that effect to show up in aggregate productivity statistics. Rearrange everything in the organization around the new tech, the argument goes, and then the system level productivity comes.

Here’s the difference I see – in every case where a radical new technology has revolutionized industry – manufacturing machines, plumbing, planes, electricity, computers, the internet, etc – there is an increasing trend of reliability. Each one of those foundational technologies started from a place of unreliability and moved to a place of very high reliability. They became foundational because humans learned to trust them implicitly.

This is not a limitation that can be overcome by LLMs. Their value is in their unreliability. If you turn temperature down to zero, you get a deterministic machine – but you also break every meaningful application I know of in production.

So this is inherent to the technology. No amount of tokenmaxxing is going to change it. LLM development even breaks common and well accepted quality norms in software development – like backwards compatibility. You literally can’t (and wouldn’t want!) an LLM to do the same thing in the same way twice. But this means, LLMs – on their own – are not a solid foundation to build a revolution on. They never can be.

I’ve been very careful in this post to use the term “how we use LLMs” rather than just “LLMs”. I think the personification of LLMs in our discourse is deeply damaging on a number of levels. Here it distracts from the fact that they are tools. They don’t think. They don’t have agency. They don’t do anything without you telling them to.

A tool is only as good as how a person wields it. And I do think a lot of us are wielding them in a way that is guaranteed to destroy value. In my experience, It’s very common for people to say things like “use the LLM to make the first draft and then edit it afterwards”.

I think that’s exactly backwards. The first draft of something is the core intellectual contribution you are making. The effort of making it shapes your thinking. It causes you to interact with the ideas and understand the structure of what you’re trying to communicate. To me, this is true in natural and formal language.

Your understanding of the structure is – in my mind – the main part of the value you’re producing by writing something. By first-drafting with an LLM I think you are essentially handing off the thinking to the person downstream of you. And, in the process, you’re handing off responsibility for the defects introduced by your work. I think you are also dramatically increasing the cost of addressing those defects. I think the Faros data says this very plainly.

After you get the first draft down – rough as it might be – the LLM is a great help in shaping it into something coherent. It can give you great feedback. And, as you delta the structure, you continue to hold the responsibility for what’s changing. If this was the dominant pattern of how we used LLMs, I think it would enormously increase the quality of our products and hence deliver value.

You may disagree with me. That’s fine. I’m just trying to tell you what I think. I’ve had this conversation with a lot of people over the past few weeks and I get a lot of reactions. My view is that people are having a hard time really looking at LLMs. And I have a thought about that too.

I think many people think that because LLMs talk, they are intelligent.

My formative professional development happened in the construction trades. A bit of wisdom from that domain may be medicinal here.

Talk is cheap.

The Faros data allows us – along with two assumptions – to compute an implied system throughput number with an application of Little’s Law.

The two assumptions:

That the 10% dataset describes averages something close to the averages of the whole dataset.
That arrival rate to the software development system is the same rate as it’s leaving. I.e. lambda describes both.

Assumption (1) we’ve already discussed a bit. If you can’t follow me here – I understand. I think it’s a very reasonable place to stand, but we are operating under uncertainty.

Assumption (2) is certainly wrong, but it is useful because it is conservative. In domains where this is not true, every new addition to the queue adds exponentially to the mean wait time of the queue. So violating this assumption makes things much worse.

I think the above assumptions will allow us to arrive at a generous conclusion towards LLM use. A lower bound on how bad things are. Alright, let’s get into it.

What we want is a Little’s Law computation for the whole system describing the percent change. So we have:

Wikipedia has a great explanation of this law that I can’t improve on:

In plain terms, the law says that the average number of items in a system (L) depends on both the rate at which items enter the system () and the average time each item remains there (W). If items arrive faster, or if each item stays longer, the average number present increases proportionally.

(Ok, I did improve on it a bit by putting the symbols in the explanation.)

What we can get is a percent estimate of the delta on going from NoLLM -> LLM. So we want:

\(\frac{\lambda_{\text💬}}{\lambda_{\text{NoLLM}}} = \frac{W_{\text{NoLLM}}}{W_{\text{LLM}}} \cdot \frac{L_{\text{LLM}}}{L_{\text{NoLLM}}}\)

We have a system metric for the W ratio – that 480% number. We need an estimate on L.

There are two good candidates that we will use to derive a lower and upper bound:

The PR contexts per developer and the daily task contexts per developer are strong candidates to define our upper and lower bound on WIP increase. If we plug those numbers in:

For the upper:

\(\frac{\lambda_{\text{LLM}}}{\lambda_{\text{NoLLM}}} = \frac{1}{5.8} \cdot \frac{1.67}{1} = 0.29 = 29\%\)

And lower:

\(\frac{\lambda_{\text{LLM}}}{\lambda_{\text{NoLLM}}} = \frac{1}{5.8} \cdot \frac{1.17}{1} = 0.20 = 20\%\)

{💬|⚡|🔥} **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#Talk #Cheap #Jake**

🕒 **Posted on**: 1780246586

🌟 **Want more?** Click here for more info! 🌟

Talk is Cheap – by Jake

By

Leave a Reply Cancel reply