The L in “LLM” Stands for Lying — Acko.net

🚀 Read this must-read post from Hacker News 📖

📂 **Category**:

💡 **What You’ll Learn**:

So it’s no wonder artists would denounce generative AI as mass-plagiarism when it showed up. It’s also no wonder that a bunch of tech entrepreneurs and data janitors wouldn’t understand this at all, and would in fact embrace the plagiarism wholesale, training their models on every pirated shadow library they can get. Or indeed, every code repository out there.

If the output of this is generic, gross and suspicious, there’s a very obvious reason for it. The different training samples in the source material are themselves just slop for the machine. Whatever makes the weights go brrr during training.

This just so happens to create the plausible deniability that makes it impossible to say what’s a citation, what’s a hallucination, and what, if anything, could be considered novel or creative. This is what keeps those shadow libraries illegal, but ChatGPT “legal”.

Labeling AI content as AI generated, or watermarking it, is thus largely an exercise in ass-covering, and not in any way responsible disclosure.

It’s also what provides the fig leaf that allows many a developer to knock-off for early lunch and early dinner every day, while keeping the meter running, without ever questioning whether the intellectual property clauses in their contract still mean anything at all.

This leaves the engineers in question in an awkward spot however. In order for vibe-coding to be acceptable and justifiable, they have to consider their own output disposable, highly uncreative, and not worthy of credit.

* * *

If you ask me, no court should have ever rendered a judgement on whether AI output as a category is legal or copyrightable, because none of it is sourced. The judgement simply cannot be made, and AI output should be treated like a forgery unless and until proven otherwise.

The solution to the LLM conundrum is then as obvious as it is elusive: the only way to separate the gold from the slop is for LLMs to perform correct source attribution along with inference.

This wouldn’t just help with the artistic side of things. It would also reveal how much vibe code is merely just copy/pasted from an existing codebase, while conveniently omitting the original author, license and link.

With today’s models, real attribution is a technical impossibility. The fact that an LLM can even mention and cite sources at all is an emergent property of the data that’s been ingested, and the prompt being completed. It can only do so when appropriate according to the current position in the text.

There’s no reason to think that this is generalizable, rather, it is far more likely that LLMs are merely good at citing things that are frequently and correctly cited. It’s citation role-play.

The implications of sourcing-as-a-requirement are vast. What does backpropagation even look like if the weights have to be attributable, and the forward pass auditable? You won’t be able to fit that in an int4, that’s for sure.

Nevertheless, I think this would be quite revealing, as this is what “AI detection tools” are really trying to solve for backwards. It’s crazy that the next big thing after the World Wide Web, and the Google-scale search engine to make use of it, was a technology that cannot tell you where the information comes from, by design. It’s… sloppy.

To stop the machines from lying, they have to cite their sources properly. And spoiler, so do the AI companies.

💬 **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#LLM #Stands #Lying #Acko.net**

🕒 **Posted on**: 1772722689

🌟 **Want more?** Click here for more info! 🌟

By

Leave a Reply

Your email address will not be published. Required fields are marked *