Re: Toy/demo: using ChatGPT to summarize lengthy LKML threads (b4 integration)

workflows.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Theodore Ts'o" <tytso@mit.edu>
To: Hannes Reinecke <hare@suse.de>
Cc: Bart Van Assche <bvanassche@acm.org>,
	Konstantin Ryabitsev <konstantin@linuxfoundation.org>,
	users@kernel.org, tools@kernel.org, workflows@vger.kernel.org
Subject: Re: Toy/demo: using ChatGPT to summarize lengthy LKML threads (b4 integration)
Date: Thu, 29 Feb 2024 02:37:32 -0600	[thread overview]
Message-ID: <20240229083732.GB272762@mit.edu> (raw)
In-Reply-To: <5758922f-a11a-4bbe-88a4-b724f53b2e6f@suse.de>

On Thu, Feb 29, 2024 at 08:18:43AM +0100, Hannes Reinecke wrote:
> On 2/28/24 19:55, Bart Van Assche wrote:
> > On 2/27/24 14:32, Konstantin Ryabitsev wrote:
> > Please do not publish the summaries generated by ChatGPT on the web. If
> > these summaries would be published on the world wide web, ChatGPT or
> > other LLMs probably would use these summaries as input data. If there
> > would be any mistakes in these summaries, then these mistakes would end
> > up being used as input data by multiple LLMs.
> > 
> Now there's a thought. Maybe we should do exactly the opposite, and posting
> _more_ ChatGPT generated content on the web?
> Sending them into a deadly self-enforcing feedback loop?

Well, I'll note that last July, when a number of AI companies,
including Amazon, Anthropic, Google, Inflection, Meta, Microsoft, and
OpenAI, met with President Biden at the White House, they made a
commitment to develop watermarking standards to allow AI generated
contexted to be detected[1].  Obviously, it's a lot easier to do this
with images, and Google was the first company to release a
watermarking system for AI-generated images[2].  However, there is
research on-going on how to add watermarking to text[3].

[1] https://www.whitehouse.gov/briefing-room/statements-releases/2023/07/21/fact-sheet-biden-harris-administration-secures-voluntary-commitments-from-leading-artificial-intelligence-companies-to-manage-the-risks-posed-by-ai/
[2] https://www.technologyreview.com/2023/08/29/1078620/google-deepmind-has-launched-a-watermarking-tool-for-ai-generated-images/
[3] https://www.nytimes.com/interactive/2023/02/17/business/ai-text-detection.html

I doubt whether anything we do is going to make a huge difference; one
of the largest uses of OpenAI's ChatGPT is to generate text to enable
Search Engine Optimization spam[4].  Another major use of LLM is to
lay off journalists by creating text explaining why a particular stock
when up by X% when the market went up or down by Y%.  After all, why
have to have a human making up stories explaining stock moves, when
you can have an AI model hallucinate them instead?  :-)

[4] https://www.opace.co.uk/blog/blog/how-openai-gpt-3-enhances-ai-chat-text-generation-for-seo

The bottom line is that there is a vast amount of AI-generated text
that has been put out on the web *already*.  This is going to be
poisoning future LLM training, even before we start generating
summaries of LKML traffic and making them available on the web.  It
also means that companies who are doing AI work have a large, vested
interest in develop stardized ways of watermarking AI-generated
context --- not just because they made a promise to some politicians,
but if all the companies can use some common watermarking standard,
hopefully they can all avoid this self-poisoning feedback loop.

Cheers,

					- Ted

next prev parent reply	other threads:[~2024-02-29  8:37 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-27 22:32 Konstantin Ryabitsev
2024-02-27 23:35 ` Junio C Hamano
2024-02-28  0:43 ` Linus Torvalds
2024-02-28 20:46   ` Shuah Khan
2024-02-29  0:33   ` James Bottomley
2024-02-28  5:00 ` Willy Tarreau
2024-02-28 14:03   ` Mark Brown
2024-02-28 14:39     ` Willy Tarreau
2024-02-28 15:22     ` Konstantin Ryabitsev
2024-02-28 15:29       ` Willy Tarreau
2024-02-28 17:52         ` Konstantin Ryabitsev
2024-02-28 17:58           ` Willy Tarreau
2024-02-28 19:16             ` Konstantin Ryabitsev
2024-02-28 15:04   ` Hannes Reinecke
2024-02-28 15:15     ` Willy Tarreau
2024-02-28 17:43     ` Jonathan Corbet
2024-02-28 18:52       ` Alex Elder
2024-02-28 18:55 ` Bart Van Assche
2024-02-29  7:18   ` Hannes Reinecke
2024-02-29  8:37     ` Theodore Ts'o [this message]
2024-03-01  1:13     ` Bart Van Assche
2024-02-29  9:30   ` James Bottomley
2024-02-28 19:32 ` Luis Chamberlain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240229083732.GB272762@mit.edu \
    --to=tytso@mit.edu \
    --cc=bvanassche@acm.org \
    --cc=hare@suse.de \
    --cc=konstantin@linuxfoundation.org \
    --cc=tools@kernel.org \
    --cc=users@kernel.org \
    --cc=workflows@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox