From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3C31FC48BF6
	for <linux-mm@archiver.kernel.org>; Sat, 24 Feb 2024 18:13:25 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 919046B00A6; Sat, 24 Feb 2024 13:13:24 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8C9746B00A7; Sat, 24 Feb 2024 13:13:24 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 790A66B00A8; Sat, 24 Feb 2024 13:13:24 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 69AB86B00A6
	for <linux-mm@kvack.org>; Sat, 24 Feb 2024 13:13:24 -0500 (EST)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 011FFC047D
	for <linux-mm@kvack.org>; Sat, 24 Feb 2024 18:13:23 +0000 (UTC)
X-FDA: 81827494728.15.6A7C48F
Received: from casper.infradead.org (casper.infradead.org [90.155.50.34])
	by imf13.hostedemail.com (Postfix) with ESMTP id 187702000C
	for <linux-mm@kvack.org>; Sat, 24 Feb 2024 18:13:19 +0000 (UTC)
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=qwO9NV2A;
	spf=none (imf13.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org;
	dmarc=none
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708798402; a=rsa-sha256;
	cv=none;
	b=C+d+pOK3tMsEBj8Je5sOgfl9/YcmzZupg38g+02cZHkZVQz0n8BvCZjvNzDwG+5XlqFghD
	A7Q4kR4TeN7X1l5hhwj99v5hnWU4Q0YGyoAvQa5bk3t9jB4WSDjE0VeWv8KGJZUreUARKP
	QOfNBJAQcFUimHmgw39/ro9KuH+qNGw=
ARC-Authentication-Results: i=1;
	imf13.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=qwO9NV2A;
	spf=none (imf13.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org;
	dmarc=none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1708798402;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=4Gqu6DXrP8Slk+IazA2w1RT9aS9FzXuXHguOaEpxhDk=;
	b=wGQNaqRuRlf9PmmWu6KH52iCUTiAokq2SBLhXdxJ0L9e2zuAD/XZMLZXdqpikAqQlnJwWs
	FMRiZVVyWeSgfuM6dUDPn9BKF/6eLOL1xLScwQvBpenguLgzRSQ/Byj1wOUf9+6z1CWREy
	7RbwuT+G8jPHORgWP3t95mdz/65WP4A=
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=4Gqu6DXrP8Slk+IazA2w1RT9aS9FzXuXHguOaEpxhDk=; b=qwO9NV2Al8a5D03bhsEZ5im3XA
	Pg6tFxq7tJsY0VJiHZGJ/lxFZEuZmipPI4TF5AgsNoy7ZEcNDYZ+tqJm90YfYKW1CwUkLJIS5XATn
	fQY9NTdAkgHBGFoSV1wz+zUCsO2OylBDHSU8ZbDV3zVBDZiowZqvkNio70/ZzQe2gCTutWbR4WeyW
	B6SfwdAsiFhNJhJ7niLX+wjAiwIxwZ7QiZc3iCTX2pB3FZB7+IPiYscdW1X1GaaQvhywyJbWyd/3r
	FDaUZ73i51jNpq3lCh0iIDhmdHkCrY0q3U6GagaPUxk/uPpB/4pn9zLXbjg1oj2dsYDd9MOmrjGVS
	AxTuu7uQ==;
Received: from willy by casper.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux))
	id 1rdwWO-0000000BUrR-2wmF;
	Sat, 24 Feb 2024 18:13:12 +0000
Date: Sat, 24 Feb 2024 18:13:12 +0000
From: Matthew Wilcox <willy@infradead.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>, lsf-pc@lists.linux-foundation.org,
	linux-fsdevel@vger.kernel.org, linux-mm <linux-mm@kvack.org>,
	Daniel Gomez <da.gomez@samsung.com>,
	Pankaj Raghav <p.raghav@samsung.com>, Jens Axboe <axboe@kernel.dk>,
	Dave Chinner <david@fromorbit.com>, Christoph Hellwig <hch@lst.de>,
	Chris Mason <clm@fb.com>, Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
Message-ID: <ZdoxuNx0Tt0E-Lzy@casper.infradead.org>
References: <Zdkxfspq3urnrM6I@bombadil.infradead.org>
 <Zdlsr88A6AAlJpcc@casper.infradead.org>
 <CAHk-=wjUkYLv23KtF=EyCrQcmf9NGwE8Yo1cuxdaLF8gqx5zWw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAHk-=wjUkYLv23KtF=EyCrQcmf9NGwE8Yo1cuxdaLF8gqx5zWw@mail.gmail.com>
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: 187702000C
X-Stat-Signature: dkk3yydm4byaecrn31wycdi8wq956gar
X-Rspam-User: 
X-HE-Tag: 1708798399-914690
X-HE-Meta: U2FsdGVkX1+hBn/3DlDPtTZO5Dp5xxaRsyX/pmTVfEO4L5eWI8GAq7Fh16uLT5Dxy2Xe11Wa+XSztyF0+0TKLqNrcbFDIZDueOdweREeZQXUY4blWA0DnPPAmhDtFO3cTuXe78TxHAAR2Nbnc8l7Jd06MiDcGkIgKnxiBq1ER/FZ/MSQBwzO+AoMTYfl9rpPVXpTA6A+td/ujuLq+q5iI2vUSovGTOEpoVmjD/3AdXlcI+R3ywfCK7QOdl2DEyLCOTlhvDHdy0u72+jd/8ckioStJ4zafScRbh0LwCDG7bW2NRMsTSt1pED9oqHov61KiMy+bidYrP6++/CQWCGcKY0zgBZOcMI30DQ9JQgkU50CX6jxB4kcBOKFaob4ZKmbGw18J8L1Z/qi9SGMnR73dMBl/iXh5fT0ZNKvwTbg9tpkJhpCr6BRhPDkzoOysarQYGSU2zL8JOnX8+AkXNCOULK/CxDPcWuLZ03yDWDBvXsZ1Kki7tXbYRFOhDLnJnjvFFXsmXHyBYOfEhvbMmAXkcHoa5STPSSLc+b78F3+t+OYMuZozOjPjvJlsS9j5WrSfnbdUKp1qzw78hcsn8qzObirEiAYgjiRn/BVfOnZEgGIEalGywr2aViPkrCH6S9yk+qyTuCQGolVdyl1HHOTd0bh4HZLS2xcVs0dVtIclcuNHIIfV11aoi56NU1tJK2O8AswJv0aO8UKa2a9h+moBTxtO7jG9tDwxZwpBIvqwrrbSyHxpw9SFJg45BAE7Xdl4VTpjMuFUFoH5plNwajgWizAN1/oI4R9RbchfswC1k4uriwfVIoldiJAYw5kHwSr1ktoMlpa7CAK4Vj4pdlBNQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Sat, Feb 24, 2024 at 09:31:44AM -0800, Linus Torvalds wrote:
> On Fri, 23 Feb 2024 at 20:12, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> > >  What are the limits to buffered IO
> > > and how do we test that? Who keeps track of it?
> >
> > TLDR: Why does the pagecache suck?
> 
> What? No.
> 
> Our page cache is so good that the question is literally "what are the
> limits of it", and "how we would measure them".
> 
> That's not a sign of suckage.

For a lot of things our pagecache is amazing.  And in other ways it
absolutely sucks.  The trick will be fixing those less-used glass jaws
without damaging the common cases.

> When you have to have completely unrealistic loads that nobody would
> actually care about in reality just to get a number for the limit,
> it's not a sign of problems.

No, but sometimes the unrealistic loads are, alas, a good proxy for
problems that customers hit.  For example, I have one where the customer
does an overnight backup with a shitty backup program that doesn't use
O_DIRECT and ends up evicting the actual working set from the page
cache.  They start work the next morning with terrible performance
because everything they care about has been swapped out.  The "fix"
is to limit the pagecache to one NUMA node.  I suspect if this customer
could be persuaded to run a more recent kernel that this problem has
been solved, so I'm not sure there's a call to action from this
particular case.

Anyway, if there's a way to fix an unrealistic load that doesn't affect
realistic loads, sometimes we fix a customer problem too.

> Guess what? It's because the CPU in question had quite a bit of L3,
> and it was spread out, and the CPU doesn't even start the memory
> access before it has checked caches.
> 
> And here's a big honking clue: only a complete nincompoop and mentally
> deficient rodent would look at that and say "caches suck".

although the problem might be that the CPU has a terrible cross-chiplet
interconnect ;-)

> > >  ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64
> > >      Vs
> > >  ~ 7,000 MiB/s with buffered IO
> >
> > Profile?  My guess is that you're bottlenecked on the xa_lock between
> > memory reclaim removing folios from the page cache and the various
> > threads adding folios to the page cache.
> 
> I doubt it's the locking.

It might not be!  But there are read-only workloads that do bottleneck
on the xa_lock.

> For writeout, we have a very traditional problem: we care about a
> million times more about latency than we care about throughput,
> because nobody ever actually cares all that much about performance of
> huge writes.
> 
> Ask yourself when you have last *really* sat there waiting for writes,
> unless it's some dog-slow USB device that writes at 100kB/s?

You picked a bad day to send this email ;-)

$ sudo dd if=Downloads/debian-testing-amd64-netinst.iso of=/dev/sda
[sudo] password for willy:
1366016+0 records in
1366016+0 records out
699400192 bytes (699 MB, 667 MiB) copied, 296.219 s, 2.4 MB/s

ok, that was a cheap-arse USB stick, but then I had to wait for the
installer to write 800GB of random data to /dev/nvme0n1p3 as it set up
the crypto drive.  The Debian installer didn't time that for me, but it
was enough time to vacuum the couch.

> Now, the benchmark that Luis highlighted is a completely different
> class of historical problems that has been around forever, namely the
> "fill up lots of memory with dirty data".
> 
> And there - because the problem is easy to trigger but nobody tends to
> care deeply about throughput because they care much much *MUCH* more
> about latency, we have a rather stupid big hammer approach.
> 
> It's called "vm_dirty_bytes".
> 
> Well, that's the knob (not the only one). The actual logic around it
> is then quite the moreass of turning that into the
> dirty_throttle_control, and the per-bdi dirty limits that try to take
> the throughput of the backing device into account etc etc.
> 
> And then all those heuristics are used to actually LITERALLY PAUSE the
> writer. We literally have this code:
> 
>                 __set_current_state(TASK_KILLABLE);
>                 bdi->last_bdp_sleep = jiffies;
>                 io_schedule_timeout(pause);
> 
> in balance_dirty_pages(), which is all about saying "I'm putting you
> to sleep, because I judge you to have dirtied so much memory that
> you're making things worse for others".
> 
> And a lot of *that* is then because we haven't wanted everybody to
> rush in and start their own synchronous writeback, but instead watn
> all writeback to be done by somebody else. So now we move from
> mm/page-writeback.c to fs/fs-writeback.c, and all the work-queues to
> do dirty writeout.
> 
> Notice how the io_schedule_timeout() above doesn't even get woken up
> by IO completing. Nope. The "you have written too much" logic
> literally pauses the writer, and doesn't even want to wake it up when
> there is no more dirty data.
> 
> So the "you went over the dirty limits It's a penalty box, and all of
> this comes from "you are doing something that is abnormal and that
> disturbs other people, so you get an unconditional penalty". Yes, the
> timeout is then obviously tied to how much of a problem the dirtying
> is (based on that whole "how fast is the device") but it's purely a
> heuristic.
> 
> And (one) important part here is "nobody sane does that".  So
> benchmarking this is a bit crazy. The code is literally meant for bad
> actors, and what you are benchmarking is the kernel telling you "don't
> do that then".

This was a really good writeup, thanks.  I think this might need some
tuning (if it is what's going on).  When the underlying device can do
86GB/s and we're only getting 7GB/s, we could afford to let this writer
do a bit more.