Re: [MAINTAINERS SUMMIT] The role of AI and LLMs in the kernel process

ksummit.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

From: James Bottomley <James.Bottomley@HansenPartnership.com>
To: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>,
	Jonathan Corbet <corbet@lwn.net>,
	 "H. Peter Anvin" <hpa@zytor.com>,
	Sasha Levin <sashal@kernel.org>,
	ksummit@lists.linux.dev
Subject: Re: [MAINTAINERS SUMMIT] The role of AI and LLMs in the kernel process
Date: Mon, 08 Dec 2025 18:16:52 +0900	[thread overview]
Message-ID: <4597dfe45c9ff2991ed5221c618602ea42993940.camel@HansenPartnership.com> (raw)
In-Reply-To: <20251208094116.6757ddeb@foz.lan>

On Mon, 2025-12-08 at 09:41 +0100, Mauro Carvalho Chehab wrote:
> Em Mon, 08 Dec 2025 12:42:32 +0900
> James Bottomley <James.Bottomley@HansenPartnership.com> escreveu:
> 
> > On Sun, 2025-12-07 at 22:15 -0500, Steven Rostedt wrote:
> > > On Sun, 07 Dec 2025 18:59:19 -0700
> > > Jonathan Corbet <corbet@lwn.net> wrote:
> > >   
> > > > > I contend there is a huge difference between *code* and
> > > > > descriptions/documentation/...  
> > >   
> > > > 
> > > > As you might imagine, I'm not fully on board with that.  Code
> > > > is assumed plagiarized, but text is not?  Subtly wrong
> > > > documentation is OK?
> > > > 
> > > > I think our documentation requires just as much care as our
> > > > code does.  
> > > 
> > > I assumed what hpa was mentioning about documentation, may be
> > > either translation of original text of the submitter, or AI
> > > looking at the code that was created and created a change log. In
> > > either case, the text was generated from the input of the author 
> > 
> > I think this is precisely the problem Jon was referring to: you're
> > saying that if AI generates *text* based on input prompts it's not
> > a copyright problem, but if AI generates *code* based on input
> > prompts, it is.  As simply a neural net operational issue *both*
> > input to output sets are generated in the same way by the AI
> > process and would have the same legal probability of being
> > copyright problems.  i.e. if the first likely isn't a copyright
> > problem, the second likely isn't as well (and vice versa).
> 
> I'd say that there are different things placed in the same box. Those
> two, for example sound OK on my eyes:
> 
> - translations - either for documentation of for the code.
>   The original copyrights maintain on any translations. This is
> already
>   proofed in courts: if one translates Isaac Asimov's "Foundation" to
>   Greek, his copyright remains at the translation. Ok, if the
> translation
>   is done by a human, he can claim additional copyrights for the
>   translation, but a machine doesn't have legal rights to claim for
>   copyrights. Plus, the translation is a derivative work of the
> original
>   text, so, I can't see how this could ever be a problem, if the
>   copyrights of the original author is placed at the translation;

I can explain simply how I as a translator could cause a copyright
problem with no AI involvement: let's say I translate Foundation from
English to French but while doing so I embed a load of quotes from the
novels of Annie Ernaux but in a way that it nicely matches the Azimov
original.  Now I've created a work which may be derivative of
Foundation and partly owned by me but which also has claims of
copyright abuse from Annie Ernaux.

The above is directly analogous to what would happen if the AI output
were decided to be a derivative of its training for an AI translator.

> - code filling - if a prompt requests to automate a repetitive task,
>   like creating a skeleton code, adding includes, review coding style
>   and other brute force "brainless" activities, the generated code
> won't
>   be different than what other similar tools of what the developer
> would
>   do - AI is simply a tool to speedup it, just like any other similar
>   tools. No copyright issues.
> 
> Things could be in gray area if one uses AI to write a patch from the
> scratch. Still, if the training data is big enough, the weights at
> the neuron network will be calibrated to repeat the most common
> patterns, so the code would probably be similar to what most
> developers would do.
>
> 
> On some experiments I did myself, that's what it happened: the
> generated code wasn't much different than what a junior student with
> C knowledge would write, with about the same mistakes. The only thing
> is that, instead of taking weeks, the code materialized in seconds.
> To be something that a maintainer would pick, a senior developer
> would be required to cleanup the mess.

How good (or not) AI is at coding is different from the question of
whether the output has its copyright contaminated by the training data.
> 
> > > . Where as AI generated code likely comes from somebody else's
> > > code. Perhaps AI was trained on somebody else's text, but the
> > > output will likely not be a derivative of it as the input is
> > > still original. 
> > 
> > That's an incorrect statement: if the output is a derivative of the
> > training (which is a big if given the current state of the legal
> > landscape) and the training set was copyrighted, then even a
> > translated text using that training data will pick up the copyright
> > violation regardless of input prompting.
> 
> If one trains it only with internal code from an specific original 
> product that won't have any common patterns which anyone else would
> do, then this could be the case.
> 
> However, this is usually not the case: models are trained with big
> data from lots of different developers and projects. As Neural
> networks training is based on settings up weights based on
> inputs/outputs, if the training data is big enough, such weights will
> tend to follow the most repetitive patterns from similar code/text. 
> 
> On other words, AI training will generate a model that tends to
> repeat sequences with the most common patterns from its training
> data. This is not different than what a programming student would do
> without using AI when facing a programming issue: he would likely
> search for it on a browser. The search engine algorithms from search
> providers are already showing results with the more likely answers
> for such question on the top.

Patterns are not expression in the copyright sense.  Indeed, code tends
to be much more amenable to the independent invention defence than
literature: If I give the same programming task to a set of engineers
with the same CS training, most of them would come up with pretty
identical programs even if they don't collaborate.  However, as long as
they didn't copy from each other the programs they come up with are
separate works even if they're very similar in expression.

Just because code is more likely to be independently invented than
literature doesn't make it more prone to copyright violations (although
it does give more scope to the litigious to claim this).

Regards,

James

> The AI generated code won't be much different than that, except that,
> instead of taking just the first search result, it would use
> a mix of the top search results for the same prompt to produce its
> result.
> 
> In any case (googling or using AI), the tool-produced code examples
> aren't ready for submission. It can be just the beginning of some
> code that will require usually lots of work to be something that
> could be ready for submission - or even - it can be an example of
> what one should not do. In the latter case, the developer would need
> to google again or to change the prompt, until it gets something that
> might be applicable to the real use case.
> 
> Thanks,
> Mauro
>

next prev parent reply	other threads:[~2025-12-08  9:16 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-05 16:03 Lorenzo Stoakes
2025-08-05 16:43 ` James Bottomley
2025-08-05 17:11   ` Mark Brown
2025-08-05 17:23     ` James Bottomley
2025-08-05 17:43       ` Sasha Levin
2025-08-05 17:58         ` Lorenzo Stoakes
2025-08-05 18:16       ` Mark Brown
2025-08-05 18:01     ` Lorenzo Stoakes
2025-08-05 18:46       ` Mark Brown
2025-08-05 19:18         ` Lorenzo Stoakes
2025-08-05 17:17   ` Stephen Hemminger
2025-08-05 17:55   ` Lorenzo Stoakes
2025-08-05 18:23     ` Lorenzo Stoakes
2025-08-12 13:44       ` Steven Rostedt
2025-08-05 18:34     ` James Bottomley
2025-08-05 18:55       ` Lorenzo Stoakes
2025-08-12 13:50       ` Steven Rostedt
2025-08-05 18:39     ` Sasha Levin
2025-08-05 19:15       ` Lorenzo Stoakes
2025-08-05 20:02         ` James Bottomley
2025-08-05 20:48           ` Al Viro
2025-08-06 19:26           ` Lorenzo Stoakes
2025-08-07 12:25             ` Mark Brown
2025-08-07 13:00               ` Lorenzo Stoakes
2025-08-11 21:26                 ` Luis Chamberlain
2025-08-12 14:19                 ` Steven Rostedt
2025-08-06  4:04       ` Alexey Dobriyan
2025-08-06 20:36         ` Sasha Levin
2025-08-05 21:58   ` Jiri Kosina
2025-08-06  6:58     ` Hannes Reinecke
2025-08-06 19:36       ` Lorenzo Stoakes
2025-08-06 19:35     ` Lorenzo Stoakes
2025-08-05 18:10 ` H. Peter Anvin
2025-08-05 18:19   ` Lorenzo Stoakes
2025-08-06  5:49   ` Julia Lawall
2025-08-06  9:25     ` Dan Carpenter
2025-08-06  9:39       ` Julia Lawall
2025-08-06 19:30       ` Lorenzo Stoakes
2025-08-12 14:37         ` Steven Rostedt
2025-08-12 15:02           ` Sasha Levin
2025-08-12 15:24             ` Paul E. McKenney
2025-08-12 15:25               ` Sasha Levin
2025-08-12 15:28                 ` Paul E. McKenney
2025-12-08  1:12 ` Sasha Levin
2025-12-08  1:25   ` H. Peter Anvin
2025-12-08  1:59     ` Jonathan Corbet
2025-12-08  3:15       ` Steven Rostedt
2025-12-08  3:42         ` James Bottomley
2025-12-08  8:41           ` Mauro Carvalho Chehab
2025-12-08  9:16             ` James Bottomley [this message]
2025-12-08 10:22               ` Mauro Carvalho Chehab
2025-12-08  4:15   ` Laurent Pinchart
2025-12-08  4:31     ` Jonathan Corbet
2025-12-08  4:36       ` Laurent Pinchart
2025-12-08  7:00   ` Jiri Kosina
2025-12-08  7:38     ` James Bottomley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4597dfe45c9ff2991ed5221c618602ea42993940.camel@HansenPartnership.com \
    --to=james.bottomley@hansenpartnership.com \
    --cc=corbet@lwn.net \
    --cc=hpa@zytor.com \
    --cc=ksummit@lists.linux.dev \
    --cc=mchehab+huawei@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=sashal@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox