From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BD7B6C46CD2
	for <linux-mm@archiver.kernel.org>; Tue, 30 Jan 2024 05:29:42 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 599916B0088; Tue, 30 Jan 2024 00:29:42 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5493A6B0089; Tue, 30 Jan 2024 00:29:42 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3E9D76B008A; Tue, 30 Jan 2024 00:29:42 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 2885E6B0088
	for <linux-mm@kvack.org>; Tue, 30 Jan 2024 00:29:42 -0500 (EST)
Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id E5A7B1A0577
	for <linux-mm@kvack.org>; Tue, 30 Jan 2024 05:29:41 +0000 (UTC)
X-FDA: 81734850162.25.E5B854C
Received: from mail-pg1-f177.google.com (mail-pg1-f177.google.com [209.85.215.177])
	by imf18.hostedemail.com (Postfix) with ESMTP id EEE6C1C0016
	for <linux-mm@kvack.org>; Tue, 30 Jan 2024 05:29:39 +0000 (UTC)
Authentication-Results: imf18.hostedemail.com;
	dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=aCYpSe7B;
	spf=pass (imf18.hostedemail.com: domain of david@fromorbit.com designates 209.85.215.177 as permitted sender) smtp.mailfrom=david@fromorbit.com;
	dmarc=pass (policy=quarantine) header.from=fromorbit.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1706592580;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=vanmZRR1y5QKkmBQ3KupoInC2hTIn8cCfF2flGLfL1Q=;
	b=ZwqUXIAAJr5wXbT4nrxkGLm8ZBlj922vO5GEyV7j+Nq69cTFYo6eSJp//9vxqnmRVikmxN
	ehk5MyE9dwKn+difJx++p6zajrfIkwjSZnotpNIqTCrM7Eyj+ciftBbYk5eoq8DX8FnfSB
	Irrdqr3xtgwG0dc1zNJkToI6F7cjc9Q=
ARC-Authentication-Results: i=1;
	imf18.hostedemail.com;
	dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=aCYpSe7B;
	spf=pass (imf18.hostedemail.com: domain of david@fromorbit.com designates 209.85.215.177 as permitted sender) smtp.mailfrom=david@fromorbit.com;
	dmarc=pass (policy=quarantine) header.from=fromorbit.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706592580; a=rsa-sha256;
	cv=none;
	b=h97jypQ9724Vvrs/9tuSHZ77e5w4P60OwDM89rOfz6F98ngYJzkselY6cnnkZHvRFGy6E8
	lwYY/D/5+5VqJ8h6RbsGCAYk8u9GtRaBnRrUx7gpfNbCVLHtp1MTeyGegnxsxGEQ8LVIyX
	497aHgg2AVXnymqhIADjzWCsnkFXcwQ=
Received: by mail-pg1-f177.google.com with SMTP id 41be03b00d2f7-5c229dabbb6so1802440a12.0
        for <linux-mm@kvack.org>; Mon, 29 Jan 2024 21:29:39 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1706592579; x=1707197379; darn=kvack.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=vanmZRR1y5QKkmBQ3KupoInC2hTIn8cCfF2flGLfL1Q=;
        b=aCYpSe7BWIzxBD6vMDJ6SlWvwAS1UQ6wD4cWe8x2LpxARHPoMOIJJjoOwIDSHlFf/l
         AHMvOU//rce6fwU2+LQrs4LcAtM3+BD0y8cVx19Ft+F4vs6rejmPtxpRhbDO8kBdAw+e
         E3uE5ogtQj01ZDSstiWuoaWexMerxt+iXGLKJ043rKwoWFWhmbu0BbVkh8ZLUw3evz8L
         6Szbv0C3YIXUf8iOP6pijlKHTj3Q9RavE/5jD1hQ33OTXseYHSPQA3as0GrzSH0rs4/1
         e2niZbxZc6Tw4iTrK10gItGzFwVAO5kVZ44yg0BtPKgfU2faC2ds6z1TdZckL7/trYqj
         BlTg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1706592579; x=1707197379;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=vanmZRR1y5QKkmBQ3KupoInC2hTIn8cCfF2flGLfL1Q=;
        b=Sz8XdWiX3p3+cjRR/c4nAsj+PQRjDxey9cVVQO1fD5Cq2Tqp9mqqOZmzklDM/icUXN
         r7G5vZvMnYpacHSlPBU8wwkHF+X6JnRSJrKh7RLfjBlaH0fLlkCASDKIngG1uH9ILH1o
         wdIbirrz72R0r+ZPVjO85BqvlQa3cjHRYQtaK0abIrC7pN09WP3FwWXFzAvHUVbXI0sd
         8AkTRcYnd54tLk5TEOVzpp1z1Wem8qbndVOZBOEUpfMTVgfv5J+NPSXpizGyvXcl1SKK
         peJq84JF0P6CRS5GZ7FBswlwp0upp9ia+y2HUiMWns7gu32eKR65SbbkbSd1iKK5Z3tl
         K7pA==
X-Gm-Message-State: AOJu0YzhRGKFeapby6lsT0mEUWbz4W+UrGj2JtInygUkhQIoHs2ggNeW
	XPn1r9r9FKh09IVegT+moDFBKnKIHso5u4ukJp+Weu8Xb9d5wJOZdTaToUdYRsw=
X-Google-Smtp-Source: AGHT+IFw7hA4oDv8nkHlg66no17qVoSIkfqyrunXgNkE/oWL4LONcHwUWk8+uCHsvEFMQOIrPZiMUw==
X-Received: by 2002:a17:902:ed04:b0:1d7:2004:67eb with SMTP id b4-20020a170902ed0400b001d7200467ebmr5270694pld.26.1706592578661;
        Mon, 29 Jan 2024 21:29:38 -0800 (PST)
Received: from dread.disaster.area (pa49-181-38-249.pa.nsw.optusnet.com.au. [49.181.38.249])
        by smtp.gmail.com with ESMTPSA id g15-20020a1709029f8f00b001d90a67e10bsm422419plq.109.2024.01.29.21.29.37
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 29 Jan 2024 21:29:38 -0800 (PST)
Received: from dave by dread.disaster.area with local (Exim 4.96)
	(envelope-from <david@fromorbit.com>)
	id 1rUggh-00HBSK-1J;
	Tue, 30 Jan 2024 16:29:35 +1100
Date: Tue, 30 Jan 2024 16:29:35 +1100
From: Dave Chinner <david@fromorbit.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>, Matthew Wilcox <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Don Dutile <ddutile@redhat.com>,
	Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>, linux-block@vger.kernel.org
Subject: Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops
 in willneed range
Message-ID: <ZbiJP3Dhjkh6Dz4x@dread.disaster.area>
References: <ZbbPCQZdazF7s0_b@casper.infradead.org>
 <ZbbfXVg9FpWRUVDn@redhat.com>
 <ZbbvfFxcVgkwbhFv@casper.infradead.org>
 <CAH6w=aw_46Ker0w8HmSA41vUUDKGDGC3gxBFWAhd326+kEtrNg@mail.gmail.com>
 <ZbcDvTkeDKttPfJ4@dread.disaster.area>
 <ZbciOba1h3V9mmup@fedora>
 <Zbc0ZJceZPyt8m7q@dread.disaster.area>
 <ZbdhBaXkXm6xyqgC@fedora>
 <ZbghnK+Hs+if6vEz@dread.disaster.area>
 <ZbhpbpeV6ChPD9NT@fedora>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <ZbhpbpeV6ChPD9NT@fedora>
X-Rspamd-Queue-Id: EEE6C1C0016
X-Rspam-User: 
X-Stat-Signature: qrpx34homsr6pkukpemdagpphsj535fp
X-Rspamd-Server: rspam01
X-HE-Tag: 1706592579-202942
X-HE-Meta: U2FsdGVkX1/1h3jEz2xq0Fn/Q6qtfqtsiMut0dgN2R4R89sL7qZEChBaU6pGcHPOklsKITUMHU/XlzuiIsf2L6dvol1uMDaS6YuhUB5RXJsu8Xsx6+9mPqwcKNsU6q6oDt4AFnG+IRbg3HOYxCptyrebAseCnkpenhdoc5ohiCKklbpO2KDdTHh/yjFLyHSuAUiXdizuVE/JC+whcMejHTxHrrpWgThaPzQb+8Ci4Cf3QA2hkFWIVrVQB0vB5qiDwzEhoq1laD4exbjvgg8iiBnpAFVH30aTlacmeZAm9Zi9V5SDPUkMPhOZ1YXvz4tOQxn8guzDaqVWbO/A02A1hkVteRrSl7H95cjiiCASTksE02+krcp5lWoRXWx3NrCBRmi6JG3j22W7jbRaXnCMnCZSNkMX8TZ26OY8aM+k2vviN5o80yCl69RTgzkRxIkbbte4vPcFLjKOiUPAYYl9khktoKtpA+Qih6kJGM5smI5EiwpHgax2oPaeT/i0P27KcnBJMHTCNB9Hc0XVzxV6EgwuXQSJir9ZgzZinZY4r+3VkmYEzNbcAWj272P1CrRNUC+soEVHV5Fq+z08Yq6TXI6imqTaQp2x4zkBLryNhLR6JULj3nwg/6S+UjXwcU94wmOGKt1RzryI9ir6SDWrvT6vl7sr8mtX0p356F/tZoH5JHpaJgsnJ1uU5s8v1iLgLSmisXTAWmPyoX2rO6xmfiT5K04tkJ/hcOVuz43M58/t+ujR+QGX8EsTkMu6bjyS05XsaGm3xDZ7UnB5AE9hJ9Q7TmlEGf0LkvNYKdcru/7u/ef8aSIPFJrxfaUwQ+xwUUIR84KFbIXCwZDfsC5V0g38bM7I0F1ESqSjlBEpSU8O1E9w8SmsliWaiD3pL1Abj4LlEfu5SCxFgWkDuUA3Yl1joaztVRP8a5F4EQ+RgPGQfVuv9ur4fuiwUxwkdXH1HR80bamzUuB8DPU7SwE
 i2tplEgz
 xdcakytpyyQV6n8Z+47e6ncK4TS+FYh7nYm5o81jnct25XfI6DpF0U/1VnsqKv+dCwL95zinJtVBEKZKp2OG84JwmLbRKRHWS5sx4erImCoDCcMdqXwgiSOAMHvqEs5eaj4wB0QdtYHQMzKLzoSRKE94rdtK/Il/kgvJ6oS8ylcdhLLBGcHgFJuSxT47w7CRlfWPvOcxAu4HGviqwg+szVMn90cgbfUc9HJ0wPeN0OCHrkBv2AOC4kV2KiA08js0EvCbSyVwQqJdtRZnhai5rqHHgIoBweg0CY05zUOztEig5JzPpYqhsjSi+K+CKcgdMu+MLdLZZYKwhxiJ+rmmV3bK+Ly+8RZMDlpiXDt3fIU8cdXY7gpA9TFweRmcvl98OFf8awLdLWwF9rHOhX5IfmTho7oAPZ+4j7+t/4OjSJXGRdccITDQe8HpGfFU9Iqn07ZWyxlPD4NVLhHoBmHLXzHFUDuEcHZ1Q57nwEvZ65gaqenlFHzuNHxS9WN+3cnTmzs0bKPQ17n5kChowPx52fruE8A==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Jan 30, 2024 at 11:13:50AM +0800, Ming Lei wrote:
> On Tue, Jan 30, 2024 at 09:07:24AM +1100, Dave Chinner wrote:
> > On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote:
> > > On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote:
> > > > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote:
> > > > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> > > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > > > Follows the current report:
> > > > > 
> > > > > 1) usersapce call madvise(willneed, 1G)
> > > > > 
> > > > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
> > > > > readahead in madvise(willneed, 1G) since commit 6d2be915e589
> > > > > 
> > > > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
> > > > > set as 64KB by userspace when userspace reads the mmaped buffer, then
> > > > > the whole application becomes slower.
> > > > 
> > > > It gets limited by file->f_ra->ra_pages being initialised to
> > > > bdi->ra_pages and then never changed as the advice for access
> > > > methods to the file are changed.
> > > > 
> > > > But the problem here is *not the readahead code*. The problem is
> > > > that the user has configured the device readahead window to be far
> > > > smaller than is optimal for the storage. Hence readahead is slow.
> > > > The fix for that is to either increase the device readahead windows,
> > > > or to change the specific readahead window for the file that has
> > > > sequential access patterns.
> > > > 
> > > > Indeed, we already have that - FADV_SEQUENTIAL will set
> > > > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses
> > > > larger IOs for that access.
> > > > 
> > > > That's what should happen here - MADV_WILLNEED does not imply a
> > > > specific access pattern so the application should be running
> > > > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED
> > > > to start the readahead, and then the rest of the on-demand readahead
> > > > will get the higher readahead limits.
> > > > 
> > > > > This patch changes 3) to use bdi->io_pages as readahead unit.
> > > > 
> > > > I think it really should be changing MADV/FADV_SEQUENTIAL to set
> > > > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the
> > > > mem.load() implementation in the application converted to use
> > > > MADV_SEQUENTIAL to properly indicate it's access pattern to the
> > > > readahead algorithm.
> > > 
> > > Here the single .ra_pages may not work, that is why this patch stores
> > > the willneed range in maple tree, please see the following words from
> > > the original RH report:
> > 
> > > "
> > > Increasing read ahead is not an option as it has a mixed I/O workload of
> > > random I/O and sequential I/O, so that a large read ahead is very counterproductive
> > > to the random I/O and is unacceptable.
> > > "
> > 
> > Yes, I've read the bug. There's no triage that tells us what the
> > root cause of the application perofrmance issue might be. Just an
> > assertion that "this is how we did it 10 years ago, it's been
> > unchanged for all this time, the new kernel we are upgrading
> > to needs to behave exactly like pre-3.10 era kernels did.
> > 
> > And to be totally honest, my instincts tell me this is more likely a
> > problem with a root cause in poor IO scheduling decisions than be a
> > problem with the page cache readahead implementation. Readahead has
> > been turned down to stop the bandwidth it uses via background async
> > read IO from starving latency dependent foreground random IO
> > operation, and then we're being asked to turn readahead back up
> > in specific situations because it's actually needed for performance
> > in certain access patterns. This is the sort of thing bfq is
> > intended to solve.
> 
> Reading mmaped buffer in userspace is sync IO, and page fault just
> readahead 64KB. I don't understand how block IO scheduler makes a
> difference in this single 64KB readahead in case of cache miss.

I think you've misunderstood what I said. I was refering to the
original customer problem of "too much readahead IO causes problems
for latency sensitive IO" issue that lead to the customer setting
64kB readahead device limits in the first place.

That is, if reducing readahead for sequential IO suddenly makes
synchronous random IO perform a whole lot better and the application
goes faster, then it indicates the problem is IO dispatch
prioritisation, not that there is too much readahead. Deprioritising
readahead will educe it's impact on other IO, without having to
reduce the readahead windows that provide decent sequential IO
perofrmance...

I really think the customer needs to retune their application from
first principles. Start with the defaults, measure where things are
slow, address the worst issue by twiddling knobs. Repeat until
performance is either good enough or they hit on actual problems
that need code changes.

> > > It is even worse for readahead() syscall:
> > > 
> > > 	``` DESCRIPTION readahead()  initiates readahead on a file
> > > 	so that subsequent reads from that file will be satisfied
> > > 	from the cache, and not block on disk I/O (assuming the
> > > 	readahead was initiated early enough and that other activity
> > > 	on the system did not in the meantime flush pages from the
> > > 	cache).  ```
> > 
> > Yes, that's been "broken" for a long time (since the changes to cap
> > force_page_cache_readahead() to ra_pages way back when), but the
> > assumption documented about when readahead(2) will work goes to the
> > heart of why we don't let user controlled readahead actually do much
> > in the way of direct readahead. i.e. too much readahead is
> > typically harmful to IO and system performance and very, very few
> > applications actually need files preloaded entirely into memory.
> 
> It is true for normal readahead, but not sure if it is for
> advise(willneed) or readahead().

If we allowed unbound readahead via WILLNEED or readahead(2), then
a user can DOS the storage and/or the memory allocation subsystem
very easily.

In a previous attempt to revert the current WILLNEED readahead
bounding behaviour changes, Linus said this:

"It's just that historically we've had
some issues with people over-doing readahead (because it often helps
some made-up microbenchmark), and then we end up with latency issues
when somebody does a multi-gigabyte readahead... Iirc, we had exactly
that problem with the readahead() system call at some point (long
ago)."

https://lore.kernel.org/linux-mm/CA+55aFy8kOomnL-C5GwSpHTn+g5R7dY78C9=h-J_Rb_u=iASpg@mail.gmail.com/

Elsewhere in a different thread for a different patchset to try to
revert this readahead behaviour, Linus ranted about how it allowed
unbound, unkillable user controlled readahead for 64-bit data
lengths.

Fundamentally, readahead is not functionality we want to expose
directly to user control. MADV_POPULATE_* is a different in that it
isn't actually readahead - it works more like normal sequential user
page fault access. It is interruptable, it can fail due to ENOMEM or
OOM-kill, it can fail on IO errors, etc. IOWs, The MADV_POPULATE
functions are what the application should be using, not trying to
hack WILLNEED to do stuff that MADV_POPULATE* already does in a
better way...

> > Please read the commit message for commit 4ca9b3859dac ("mm/madvise:
> > introduce MADV_POPULATE_(READ|WRITE) to prefault page tables"). It
> > has some relevant commentary on why MADV_WILLNEED could not be
> > modified to meet the pre-population requirements of the applications
> > that required this pre-population behaviour from the kernel.
> > 
> > With this, I suspect that the application needs to be updated to
> > use MADV_POPULATE_READ rather than MADV_WILLNEED, and then we can go
> > back and do some analysis of the readahead behaviour of the
> > application and the MADV_POPULATE_READ operation. We may need to
> > tweak MADV_POPULATE_READ for large readahead IO, but that's OK
> > because it's no longer "optimistic speculation" about whether the
> > data is needed in cache - the operation being performed guarantees
> > that or it fails with an error. IOWs, MADV_POPULATE_READ is
> > effectively user data IO at this point, not advice about future
> > access patterns...
> 
> BTW, in this report, MADV_WILLNEED is used by java library[1], and I
> guess it could be difficult to update to MADV_POPULATE_READ.

Yes, but that's not an upstream kernel code development problem.
That's a problem for the people paying $$$$$ to their software
vendor to sort out.

-Dave.
-- 
Dave Chinner
david@fromorbit.com