Re: [MAINTAINERS SUMMIT] The role of AI and LLMs in the kernel process

ksummit.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
To: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Steven Rostedt <rostedt@goodmis.org>,
	Jonathan Corbet <corbet@lwn.net>,
	"H. Peter Anvin" <hpa@zytor.com>, Sasha Levin <sashal@kernel.org>,
	ksummit@lists.linux.dev
Subject: Re: [MAINTAINERS SUMMIT] The role of AI and LLMs in the kernel process
Date: Mon, 8 Dec 2025 09:41:16 +0100	[thread overview]
Message-ID: <20251208094116.6757ddeb@foz.lan> (raw)
In-Reply-To: <88091c9ac1d8f20bade177212445a60c752ba8b5.camel@HansenPartnership.com>

Em Mon, 08 Dec 2025 12:42:32 +0900
James Bottomley <James.Bottomley@HansenPartnership.com> escreveu:

> On Sun, 2025-12-07 at 22:15 -0500, Steven Rostedt wrote:
> > On Sun, 07 Dec 2025 18:59:19 -0700
> > Jonathan Corbet <corbet@lwn.net> wrote:
> >   
> > > > I contend there is a huge difference between *code* and
> > > > descriptions/documentation/...  
> >   
> > > 
> > > As you might imagine, I'm not fully on board with that.  Code is
> > > assumed plagiarized, but text is not?  Subtly wrong documentation
> > > is OK?
> > > 
> > > I think our documentation requires just as much care as our code
> > > does.  
> > 
> > I assumed what hpa was mentioning about documentation, may be either
> > translation of original text of the submitter, or AI looking at the
> > code that was created and created a change log. In either case, the
> > text was generated from the input of the author  
> 
> I think this is precisely the problem Jon was referring to: you're
> saying that if AI generates *text* based on input prompts it's not a
> copyright problem, but if AI generates *code* based on input prompts,
> it is.  As simply a neural net operational issue *both* input to output
> sets are generated in the same way by the AI process and would have the
> same legal probability of being copyright problems.  i.e. if the first
> likely isn't a copyright problem, the second likely isn't as well (and
> vice versa).

I'd say that there are different things placed in the same box. Those
two, for example sound OK on my eyes:

- translations - either for documentation of for the code.
  The original copyrights maintain on any translations. This is already
  proofed in courts: if one translates Isaac Asimov's "Foundation" to 
  Greek, his copyright remains at the translation. Ok, if the translation
  is done by a human, he can claim additional copyrights for the
  translation, but a machine doesn't have legal rights to claim for
  copyrights. Plus, the translation is a derivative work of the original
  text, so, I can't see how this could ever be a problem, if the
  copyrights of the original author is placed at the translation;

- code filling - if a prompt requests to automate a repetitive task,
  like creating a skeleton code, adding includes, review coding style
  and other brute force "brainless" activities, the generated code won't
  be different than what other similar tools of what the developer would
  do - AI is simply a tool to speedup it, just like any other similar
  tools. No copyright issues.

Things could be in gray area if one uses AI to write a patch from the
scratch. Still, if the training data is big enough, the weights at the
neuron network will be calibrated to repeat the most common patterns, 
so the code would probably be similar to what most developers would do.

On some experiments I did myself, that's what it happened: the generated
code wasn't much different than what a junior student with C knowledge would
write, with about the same mistakes. The only thing is that, instead of
taking weeks, the code materialized in seconds. To be something that
a maintainer would pick, a senior developer would be required to cleanup
the mess.

> > . Where as AI generated code likely comes from somebody else's code.
> > Perhaps AI was trained on somebody else's text, but the output will
> > likely not be a derivative of it as the input is still original.  
> 
> That's an incorrect statement: if the output is a derivative of the
> training (which is a big if given the current state of the legal
> landscape) and the training set was copyrighted, then even a translated
> text using that training data will pick up the copyright violation
> regardless of input prompting.

If one trains it only with internal code from an specific original 
product that won't have any common patterns which anyone else would
do, then this could be the case.

However, this is usually not the case: models are trained with big
data from lots of different developers and projects. As Neural networks
training is based on settings up weights based on inputs/outputs, if the
training data is big enough, such weights will tend to follow the most
repetitive patterns from similar code/text. 

On other words, AI training will generate a model that tends to repeat
sequences with the most common patterns from its training data. This
is not different than what a programming student would do without
using AI when facing a programming issue: he would likely search
for it on a browser. The search engine algorithms from search
providers are already showing results with the more likely answers
for such question on the top.

The AI generated code won't be much different than that, except that,
instead of taking just the first search result, it would use
a mix of the top search results for the same prompt to produce its
result.

In any case (googling or using AI), the tool-produced code examples
aren't ready for submission. It can be just the beginning of some code 
that will require usually lots of work to be something that could be 
ready for submission - or even - it can be an example of what one should
not do. In the latter case, the developer would need to google again or
to change the prompt, until it gets something that might be applicable
to the real use case.

Thanks,
Mauro

next prev parent reply	other threads:[~2025-12-08  8:41 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-05 16:03 Lorenzo Stoakes
2025-08-05 16:43 ` James Bottomley
2025-08-05 17:11   ` Mark Brown
2025-08-05 17:23     ` James Bottomley
2025-08-05 17:43       ` Sasha Levin
2025-08-05 17:58         ` Lorenzo Stoakes
2025-08-05 18:16       ` Mark Brown
2025-08-05 18:01     ` Lorenzo Stoakes
2025-08-05 18:46       ` Mark Brown
2025-08-05 19:18         ` Lorenzo Stoakes
2025-08-05 17:17   ` Stephen Hemminger
2025-08-05 17:55   ` Lorenzo Stoakes
2025-08-05 18:23     ` Lorenzo Stoakes
2025-08-12 13:44       ` Steven Rostedt
2025-08-05 18:34     ` James Bottomley
2025-08-05 18:55       ` Lorenzo Stoakes
2025-08-12 13:50       ` Steven Rostedt
2025-08-05 18:39     ` Sasha Levin
2025-08-05 19:15       ` Lorenzo Stoakes
2025-08-05 20:02         ` James Bottomley
2025-08-05 20:48           ` Al Viro
2025-08-06 19:26           ` Lorenzo Stoakes
2025-08-07 12:25             ` Mark Brown
2025-08-07 13:00               ` Lorenzo Stoakes
2025-08-11 21:26                 ` Luis Chamberlain
2025-08-12 14:19                 ` Steven Rostedt
2025-08-06  4:04       ` Alexey Dobriyan
2025-08-06 20:36         ` Sasha Levin
2025-08-05 21:58   ` Jiri Kosina
2025-08-06  6:58     ` Hannes Reinecke
2025-08-06 19:36       ` Lorenzo Stoakes
2025-08-06 19:35     ` Lorenzo Stoakes
2025-08-05 18:10 ` H. Peter Anvin
2025-08-05 18:19   ` Lorenzo Stoakes
2025-08-06  5:49   ` Julia Lawall
2025-08-06  9:25     ` Dan Carpenter
2025-08-06  9:39       ` Julia Lawall
2025-08-06 19:30       ` Lorenzo Stoakes
2025-08-12 14:37         ` Steven Rostedt
2025-08-12 15:02           ` Sasha Levin
2025-08-12 15:24             ` Paul E. McKenney
2025-08-12 15:25               ` Sasha Levin
2025-08-12 15:28                 ` Paul E. McKenney
2025-12-08  1:12 ` Sasha Levin
2025-12-08  1:25   ` H. Peter Anvin
2025-12-08  1:59     ` Jonathan Corbet
2025-12-08  3:15       ` Steven Rostedt
2025-12-08  3:42         ` James Bottomley
2025-12-08  8:41           ` Mauro Carvalho Chehab [this message]
2025-12-08  9:16             ` James Bottomley
2025-12-08 10:22               ` Mauro Carvalho Chehab
2025-12-08  4:15   ` Laurent Pinchart
2025-12-08  4:31     ` Jonathan Corbet
2025-12-08  4:36       ` Laurent Pinchart
2025-12-08  7:00   ` Jiri Kosina
2025-12-08  7:38     ` James Bottomley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251208094116.6757ddeb@foz.lan \
    --to=mchehab+huawei@kernel.org \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=corbet@lwn.net \
    --cc=hpa@zytor.com \
    --cc=ksummit@lists.linux.dev \
    --cc=rostedt@goodmis.org \
    --cc=sashal@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox