The FSF considers large language models [LWN.net]

💥 Discover this awesome post from Hacker News 📖

📂 Category:

💡 Here’s what you’ll learn:

By Jonathan Corbet
October 14, 2025

Cauldron

The Free Software Foundation’s Licensing and Compliance Lab
concerns itself with many aspects of software licensing, Krzysztof Siewicz
said at the beginning of his 2025 GNU Tools
Cauldron session. These include supporting projects that are facing
licensing challenges, collecting copyright assignments, and addressing GPL
violations. In this session, though, there was really only one topic that
the audience wanted to know about: the interaction between free-software
licensing and large language models (LLMs).

[Krzysztof
Siewicz]

Anybody hoping to exit the session with clear answers about the status of
LLM-created code was bound to be disappointed; the FSF, too, is trying to
figure out what this landscape looks like. The organization is currently
running a survey of
free-software projects with the intent of gathering information about
what position those projects are taking with regard to LLM-authored code.
From that information (and more), the FSF eventually hopes to come up with
guidance of its own.

Nick Clifton asked whether the FSF is working on a new version of the GNU
General Public License — a GPLv4 — that takes LLM-generated code into
account. No license changes are under consideration now, Siewicz answered;
instead, the FSF is considering adjustments to the Free Software
Definition first.

Siewicz continued that LLM-generated code is problematic from a
free-software point of view because, among other reasons, the models
themselves are usually non-free, as is the software used to train them.
Clifton asked why the training code mattered; Siewicz said that at this
point he was just highlighting the concern that some feel. There are
people who want to avoid proprietary software even when it is being run by
others.

$ sudo subscribe today

Subscribe today and elevate your LWN privileges. You’ll have
access to all of LWN’s high-quality articles as soon as they’re
published, and help support LWN in the process. Act now and you can start with a free trial subscription.

Siewicz went on to say that one of the key questions is whether code that
is created by an LLM is copyrightable and, if not, if there is some way to
make it copyrightable. It was never said explicitly, but the driving issue
seems to be whether this software can be credibly put under a copyleft
license. Equally important is whether such code infringes on the rights of
others. With regard to copyrightability, the question is still open; there
are some cases working their way through the courts now. Regardless,
though, he said that it seems possible to ensure that LLM output can be
copyrighted by applying some human effort to enhance the resulting code.
The use of a “creative prompt” might also make the code
copyrightable.

Many years ago, he said, photographs were not generally seen as being
copyrightable. That changed over time as people figured out what could be
done with that technology and the creativity it enabled. Photography may
be a good analogy for LLMs, he suggested.

There is also, of course, the question of copyright infringements in
code produced by LLMs, usually in the form of training data leaking into
the model’s output. Prompting an LLM for output “in the style of”
some producer may be more likely to cause that to happen. Clifton
suggested that LLM-generated code should be submitted with the prompt used
to create it so that the potential for copyright infringement can be
evaluated by others.

Siewicz said that he does not know of any model that says explicitly
whether it incorporates licensed data. As some have suggested, it could be
possible to train a model exclusively on permissively licensed material so
that its output would have to be distributable, but even permissive
licenses require the preservation of copyright notices, which LLMs do not
do. A related concern is that some LLMs come with terms of service that
assert copyright over the model’s output; incorporating such code into a
free-software project could expose that project to copyright claims.

Siewicz concluded his talk with a few suggested precautions for
any project that accepts LLM-generated code, assuming that the project
accepts it at all. These suggestions mostly took the form of collecting
metadata about the code. Submissions should disclose which LLM was used to
create them, including version information and any available information on
the data that the model was trained on. The prompt used to create the code
should also be provided. The LLM-generated code should be clearly marked.
If there are any use restrictions on the model output, those need to be
documented as well. All of this information should be recorded and saved
when the code is accepted.

A member of the audience pointed out that the line between LLMs and
assistive (accessibility) technology can be blurry, and that any outright ban
of the former can end up blocking developers needing assistive technology,
which nobody wants to do.

There were some questions about how to distinguish LLM-generated code from
human-authored code, given that some contributors may not be up-front about
their model use. Clifton said that there must always be humans in the
loop; they, in the end, are responsible for the code they submit. Jeff Law
added that the developers
certificate of origin, under which code is submitted to many projects,
includes a statement that the contributor has the right to submit the code
in question. Determining whether that right is something the contributor
truly holds is not a new concern; developers could be, for example,
submitting code that is owned by their employer.

A real concern, Siewicz said, is whether contributors are sufficiently
educated to know where the risks actually are.

Mark Wielaard said that developers are normally able to cite any
inspirations for the code they write; an LLM is clearly inspired by other
code, but is unable to make any such citations. So there is no way to
really know where LLM-generated code came from. A developer would have to
publish their entire session with the LLM to even begin to fill that in.

The session came to an end with, perhaps, participants feeling that they
had a better understanding of where some of the concerns are, but nobody
walked out convinced that they knew the answers.

A
video of this session is available on YouTube.

[Thanks to the Linux Foundation, LWN’s travel sponsor, for supporting my travel to this event.]

💬 Tell us your thoughts in comments!

#️⃣ #FSF #considers #large #language #models #LWN.net

🕒 Posted on 1761489828

The FSF considers large language models [LWN.net]

By

Leave a Reply Cancel reply