ksummit.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
To: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>,
	 Steven Rostedt <rostedt@goodmis.org>,
	Jonathan Corbet <corbet@lwn.net>,
	 "H. Peter Anvin" <hpa@zytor.com>,
	Sasha Levin <sashal@kernel.org>,
	ksummit@lists.linux.dev
Subject: Re: [MAINTAINERS SUMMIT] The role of AI and LLMs in the kernel process
Date: Mon, 8 Dec 2025 11:22:02 +0100	[thread overview]
Message-ID: <rtxwa23krfv4xqi2c3eb6f2zygppuft4fesg532squ656v7jba@iniftynodbt2> (raw)
In-Reply-To: <4597dfe45c9ff2991ed5221c618602ea42993940.camel@HansenPartnership.com>

On Mon, Dec 08, 2025 at 06:16:52PM +0900, James Bottomley wrote:
> On Mon, 2025-12-08 at 09:41 +0100, Mauro Carvalho Chehab wrote:
> > Em Mon, 08 Dec 2025 12:42:32 +0900
> > James Bottomley <James.Bottomley@HansenPartnership.com> escreveu:
> > 
> > > On Sun, 2025-12-07 at 22:15 -0500, Steven Rostedt wrote:
> > > > On Sun, 07 Dec 2025 18:59:19 -0700
> > > > Jonathan Corbet <corbet@lwn.net> wrote:
> > > >   
> > > > > > I contend there is a huge difference between *code* and
> > > > > > descriptions/documentation/...  
> > > >   
> > > > > 
> > > > > As you might imagine, I'm not fully on board with that.  Code
> > > > > is assumed plagiarized, but text is not?  Subtly wrong
> > > > > documentation is OK?
> > > > > 
> > > > > I think our documentation requires just as much care as our
> > > > > code does.  
> > > > 
> > > > I assumed what hpa was mentioning about documentation, may be
> > > > either translation of original text of the submitter, or AI
> > > > looking at the code that was created and created a change log. In
> > > > either case, the text was generated from the input of the author 
> > > 
> > > I think this is precisely the problem Jon was referring to: you're
> > > saying that if AI generates *text* based on input prompts it's not
> > > a copyright problem, but if AI generates *code* based on input
> > > prompts, it is.  As simply a neural net operational issue *both*
> > > input to output sets are generated in the same way by the AI
> > > process and would have the same legal probability of being
> > > copyright problems.  i.e. if the first likely isn't a copyright
> > > problem, the second likely isn't as well (and vice versa).
> > 
> > I'd say that there are different things placed in the same box. Those
> > two, for example sound OK on my eyes:
> > 
> > - translations - either for documentation of for the code.
> >   The original copyrights maintain on any translations. This is
> > already
> >   proofed in courts: if one translates Isaac Asimov's "Foundation" to
> >   Greek, his copyright remains at the translation. Ok, if the
> > translation
> >   is done by a human, he can claim additional copyrights for the
> >   translation, but a machine doesn't have legal rights to claim for
> >   copyrights. Plus, the translation is a derivative work of the
> > original
> >   text, so, I can't see how this could ever be a problem, if the
> >   copyrights of the original author is placed at the translation;
> 
> I can explain simply how I as a translator could cause a copyright
> problem with no AI involvement: let's say I translate Foundation from
> English to French but while doing so I embed a load of quotes from the
> novels of Annie Ernaux but in a way that it nicely matches the Azimov
> original.  Now I've created a work which may be derivative of
> Foundation and partly owned by me but which also has claims of
> copyright abuse from Annie Ernaux.

A usage like that would likely be fair use/fair dealing.

> The above is directly analogous to what would happen if the AI output
> were decided to be a derivative of its training for an AI translator.

As AI would pick most likely translations, the risk of picking quotes
would be less likely.

Worse case scenario on something digitally published: one can change
the translation to a different translated text/code if a valid copyright
claim would apply.

> 
> > - code filling - if a prompt requests to automate a repetitive task,
> >   like creating a skeleton code, adding includes, review coding style
> >   and other brute force "brainless" activities, the generated code
> > won't
> >   be different than what other similar tools of what the developer
> > would
> >   do - AI is simply a tool to speedup it, just like any other similar
> >   tools. No copyright issues.
> > 
> > Things could be in gray area if one uses AI to write a patch from the
> > scratch. Still, if the training data is big enough, the weights at
> > the neuron network will be calibrated to repeat the most common
> > patterns, so the code would probably be similar to what most
> > developers would do.
> >
> > 
> > On some experiments I did myself, that's what it happened: the
> > generated code wasn't much different than what a junior student with
> > C knowledge would write, with about the same mistakes. The only thing
> > is that, instead of taking weeks, the code materialized in seconds.
> > To be something that a maintainer would pick, a senior developer
> > would be required to cleanup the mess.
> 
> How good (or not) AI is at coding is different from the question of
> whether the output has its copyright contaminated by the training data.

True.

> > 
> > > > . Where as AI generated code likely comes from somebody else's
> > > > code. Perhaps AI was trained on somebody else's text, but the
> > > > output will likely not be a derivative of it as the input is
> > > > still original. 
> > > 
> > > That's an incorrect statement: if the output is a derivative of the
> > > training (which is a big if given the current state of the legal
> > > landscape) and the training set was copyrighted, then even a
> > > translated text using that training data will pick up the copyright
> > > violation regardless of input prompting.
> > 
> > If one trains it only with internal code from an specific original 
> > product that won't have any common patterns which anyone else would
> > do, then this could be the case.
> > 
> > However, this is usually not the case: models are trained with big
> > data from lots of different developers and projects. As Neural
> > networks training is based on settings up weights based on
> > inputs/outputs, if the training data is big enough, such weights will
> > tend to follow the most repetitive patterns from similar code/text. 
> > 
> > On other words, AI training will generate a model that tends to
> > repeat sequences with the most common patterns from its training
> > data. This is not different than what a programming student would do
> > without using AI when facing a programming issue: he would likely
> > search for it on a browser. The search engine algorithms from search
> > providers are already showing results with the more likely answers
> > for such question on the top.
> 
> Patterns are not expression in the copyright sense.  Indeed, code tends
> to be much more amenable to the independent invention defence than
> literature: If I give the same programming task to a set of engineers
> with the same CS training, most of them would come up with pretty
> identical programs even if they don't collaborate.

True. Also, such common patterns that are repeated everywhere are
very likely fair use, if they originally came from copyrighted material.

> However, as long as
> they didn't copy from each other the programs they come up with are
> separate works even if they're very similar in expression.

Those are indeed separate works. A code written by some developer,
either using as basis his CS training, an AI-generated code, a text
book code or a searched code from the Internet as an example can become
copyrighted by the developer who wrote it.

For me, AI, when used as an ancillary tool, is not any different than
what developers have been doing.

Now, using AI as replacement for humans is a hole different thing:
I don't think we are on that stage yet. I'm also not convinced that
this would happen anytime soon.

On some tests I did, even the most complex engines are not currently
capable of generating proper code: it usually requires lots of
interactions to refine prompts and new prompts to modify the produced
results to something more palatable. The output was almost always
a code skeleton that requires manual work.

On such workflow, the prompts can be considered as copyright material.
As such, the transformation into code also carries copyrights from
the developer. As the output requires manual changes to reach
production level, such changes are also copyrighted by the developer.

Again, this is not different than doing a research at specialized
literature and/or the Internet: one needs to do the right research,
classify the results and modify the code examples to generate the
real code.

> Just because code is more likely to be independently invented than
> literature doesn't make it more prone to copyright violations (although
> it does give more scope to the litigious to claim this).

True, but this is not different than not using AI at all.

> 
> Regards,
> 
> James
> 
> > The AI generated code won't be much different than that, except that,
> > instead of taking just the first search result, it would use
> > a mix of the top search results for the same prompt to produce its
> > result.
> > 
> > In any case (googling or using AI), the tool-produced code examples
> > aren't ready for submission. It can be just the beginning of some
> > code that will require usually lots of work to be something that
> > could be ready for submission - or even - it can be an example of
> > what one should not do. In the latter case, the developer would need
> > to google again or to change the prompt, until it gets something that
> > might be applicable to the real use case.
> > 
> > Thanks,
> > Mauro
> > 
> 

-- 
Thanks,
Mauro

  reply	other threads:[~2025-12-08 10:22 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-05 16:03 Lorenzo Stoakes
2025-08-05 16:43 ` James Bottomley
2025-08-05 17:11   ` Mark Brown
2025-08-05 17:23     ` James Bottomley
2025-08-05 17:43       ` Sasha Levin
2025-08-05 17:58         ` Lorenzo Stoakes
2025-08-05 18:16       ` Mark Brown
2025-08-05 18:01     ` Lorenzo Stoakes
2025-08-05 18:46       ` Mark Brown
2025-08-05 19:18         ` Lorenzo Stoakes
2025-08-05 17:17   ` Stephen Hemminger
2025-08-05 17:55   ` Lorenzo Stoakes
2025-08-05 18:23     ` Lorenzo Stoakes
2025-08-12 13:44       ` Steven Rostedt
2025-08-05 18:34     ` James Bottomley
2025-08-05 18:55       ` Lorenzo Stoakes
2025-08-12 13:50       ` Steven Rostedt
2025-08-05 18:39     ` Sasha Levin
2025-08-05 19:15       ` Lorenzo Stoakes
2025-08-05 20:02         ` James Bottomley
2025-08-05 20:48           ` Al Viro
2025-08-06 19:26           ` Lorenzo Stoakes
2025-08-07 12:25             ` Mark Brown
2025-08-07 13:00               ` Lorenzo Stoakes
2025-08-11 21:26                 ` Luis Chamberlain
2025-08-12 14:19                 ` Steven Rostedt
2025-08-06  4:04       ` Alexey Dobriyan
2025-08-06 20:36         ` Sasha Levin
2025-08-05 21:58   ` Jiri Kosina
2025-08-06  6:58     ` Hannes Reinecke
2025-08-06 19:36       ` Lorenzo Stoakes
2025-08-06 19:35     ` Lorenzo Stoakes
2025-08-05 18:10 ` H. Peter Anvin
2025-08-05 18:19   ` Lorenzo Stoakes
2025-08-06  5:49   ` Julia Lawall
2025-08-06  9:25     ` Dan Carpenter
2025-08-06  9:39       ` Julia Lawall
2025-08-06 19:30       ` Lorenzo Stoakes
2025-08-12 14:37         ` Steven Rostedt
2025-08-12 15:02           ` Sasha Levin
2025-08-12 15:24             ` Paul E. McKenney
2025-08-12 15:25               ` Sasha Levin
2025-08-12 15:28                 ` Paul E. McKenney
2025-12-08  1:12 ` Sasha Levin
2025-12-08  1:25   ` H. Peter Anvin
2025-12-08  1:59     ` Jonathan Corbet
2025-12-08  3:15       ` Steven Rostedt
2025-12-08  3:42         ` James Bottomley
2025-12-08  8:41           ` Mauro Carvalho Chehab
2025-12-08  9:16             ` James Bottomley
2025-12-08 10:22               ` Mauro Carvalho Chehab [this message]
2025-12-08  4:15   ` Laurent Pinchart
2025-12-08  4:31     ` Jonathan Corbet
2025-12-08  4:36       ` Laurent Pinchart
2025-12-08  7:00   ` Jiri Kosina
2025-12-08  7:38     ` James Bottomley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=rtxwa23krfv4xqi2c3eb6f2zygppuft4fesg532squ656v7jba@iniftynodbt2 \
    --to=mchehab+huawei@kernel.org \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=corbet@lwn.net \
    --cc=hpa@zytor.com \
    --cc=ksummit@lists.linux.dev \
    --cc=rostedt@goodmis.org \
    --cc=sashal@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox