From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 15855C021B6
	for <linux-mm@archiver.kernel.org>; Mon, 24 Feb 2025 21:55:21 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9E9ED6B0093; Mon, 24 Feb 2025 16:55:20 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 972B928000D; Mon, 24 Feb 2025 16:55:20 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 813646B0098; Mon, 24 Feb 2025 16:55:20 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 5849F6B0093
	for <linux-mm@kvack.org>; Mon, 24 Feb 2025 16:55:20 -0500 (EST)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 15CD0801AC
	for <linux-mm@kvack.org>; Mon, 24 Feb 2025 21:55:20 +0000 (UTC)
X-FDA: 83156194800.13.54CFF7B
Received: from mail-pl1-f175.google.com (mail-pl1-f175.google.com [209.85.214.175])
	by imf22.hostedemail.com (Postfix) with ESMTP id 30BF0C0010
	for <linux-mm@kvack.org>; Mon, 24 Feb 2025 21:55:18 +0000 (UTC)
Authentication-Results: imf22.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=GOOMkncA;
	spf=pass (imf22.hostedemail.com: domain of kaleshsingh@google.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=kaleshsingh@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1740434118;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=ORUQWl1+qmf87Ab9OuvOuiofe0nej11MhOUTmg2UX24=;
	b=xWn2jQ/4UxfUzPnaaj5yDCmWBdfaB1V7wuV+QBrIy+H/8KA9Xd7/YumM55R+uRhKjZicON
	4uWNGP7Qjl+Ta00yy8PpZ/YvH6APKghyOOxs/nyabs0Cyj57vcPv79kxUVRX1iQX8IxIeL
	4Qts6fDRMq9f12XXRhDrSh41EteJI1s=
ARC-Authentication-Results: i=1;
	imf22.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=GOOMkncA;
	spf=pass (imf22.hostedemail.com: domain of kaleshsingh@google.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=kaleshsingh@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740434118; a=rsa-sha256;
	cv=none;
	b=sNYGWLYlz3Cua3+7984tw5olDLMByjQR+y5K/WKzhvWnSa899pHwTHSwuoyF430WjXHlvS
	cRdNopBBx77KGTs9et1SAyXAsWnbfJodqoN3UO8fCiPBBzWBWAch4xuUWvrQnL7zeAmpP8
	bJVLbzw4mDJl0rocLF8d5fbLxKxGY34=
Received: by mail-pl1-f175.google.com with SMTP id d9443c01a7336-220e0575f5bso53745ad.0
        for <linux-mm@kvack.org>; Mon, 24 Feb 2025 13:55:17 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1740434117; x=1741038917; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=ORUQWl1+qmf87Ab9OuvOuiofe0nej11MhOUTmg2UX24=;
        b=GOOMkncA1jPMonJKfei+HCdSU9OJSitZSmlYlbeS/DK4QCv1Gu3M358NZ2hZO3xHJB
         s/eEkle6Zh4RJyZcxuV/8Ddi3HzwusMzeA6ef+D0FyEdY4JJObLFsrJINZmOeFoFAcuL
         6VGJltC8rE0WglTRW9CrfUlXmaeWBRbYJ0fQnbAMEN7s7dxVsneZDlcgYem/iljyaaKJ
         5eBONG1la4lHQUQLzWrkpzQvjOyor5p5nWnfDK1t3JA5Z9u3McP2MgO2c97xmYpfLUfc
         giabC27ARgYDv34i6dK67bGr0oegKuuu8ywosnD43z5jFJH+fI69otohnx4cSbIutqoj
         ppgA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1740434117; x=1741038917;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=ORUQWl1+qmf87Ab9OuvOuiofe0nej11MhOUTmg2UX24=;
        b=TUGc1JDk/oqBF+nsJWTat1s8hIhyv2xKpjuQM/V5MpMNsv7bJ0UqokO+5cEqVrSQM4
         HTeXgheorKYp3tv8EZ9fLvb0VU8qT36tzX+juB8Cs9kcdUB8SLTltG0sbMR6A1/su47v
         Qx2GDOET2j2O1PzROQezvyUUzOHaE36ja1CPpizmDJmGdyKSiATAHppJHrrA0ODQod1U
         JjSrEpCRBU8Vcmh8c5Qlf531M+4N874qlVoELS1zstIJhcDrQcdMVpWikYn5lKKAcFY4
         Xgjmj/GQSju8Lr+HUnSXC4x1GuxpaUzW1rXHvTH2fW1sF44UfnVaHAWMU7gPDSDOfcNC
         zRag==
X-Forwarded-Encrypted: i=1; AJvYcCXP1JPKAWEGoXXiGCGtk4Xx3a/+srNxYxSR9B+si7OusjccMlY0p8Cycb/2qq8z4dCXMPiTFmhwlw==@kvack.org
X-Gm-Message-State: AOJu0YwGGdZcD18P02zL1GwlYGVL6xY7P4isvSi1R+TR+yzrgC1qpCh3
	lIIA06mVp72RJu+uRUzsNtmblyDVpLbbaww5KsKvOZMW+4DSadRURCr2Aa0VoYK9oQlwQXwqTyq
	CumsM5kxnB7YpU53o0Nizskp9J/P06ktYR56zUVEfc0sd3G/N6A==
X-Gm-Gg: ASbGncs3AULFJZOkuuSYk4qk5cbKlFOQ3KI0CD1cScoUOYy78i9vEMMvAF0E9UfofMo
	pHtn0BRfSHzhQzzOYqQA/+uMvhak2BfV3aE/lqnRe+6McE3gM92SY74cPYNiYSaX2Niah9CuFf2
	7Fi8FMTODg7K0mOjGLBiYvpYJYjIPfUT3PwLdvng==
X-Google-Smtp-Source: AGHT+IEUyR02mxt0JV+b4xA7aH7wa2PZeBe0ZDag7SKT17YAp7s+plpQiP5IZn6DxvMtgUl5m8jImq5AjsoAwp4YJ3k=
X-Received: by 2002:a17:902:d486:b0:215:7ced:9d67 with SMTP id
 d9443c01a7336-22307aab8dbmr1121245ad.24.1740434116807; Mon, 24 Feb 2025
 13:55:16 -0800 (PST)
MIME-Version: 1.0
References: <CAC_TJvfG8GcwG_2w1o6GOTZS8tfEx2h9A91qsenYfYsX8Te=Bg@mail.gmail.com>
 <hep2a5d6k2kwth5klatzhl3ejbc6g2opqu6tyxyiohbpdyhvwp@lkg2wbb4zhy3>
 <3bd275ed-7951-4a55-9331-560981770d30@lucifer.local> <ivnv2crd3et76p2nx7oszuqhzzah756oecn5yuykzqfkqzoygw@yvnlkhjjssoz>
 <82fbe53b-98c4-4e55-9eeb-5a013596c4c6@lucifer.local> <CAC_TJvcnD731xyudgapjHx=dvVHY+cxoO1--2us7oo9TqA9-_g@mail.gmail.com>
In-Reply-To: <CAC_TJvcnD731xyudgapjHx=dvVHY+cxoO1--2us7oo9TqA9-_g@mail.gmail.com>
From: Kalesh Singh <kaleshsingh@google.com>
Date: Mon, 24 Feb 2025 13:55:05 -0800
X-Gm-Features: AWEUYZnaGxYnmo8Z6dtC-_J_4mhbctdbL0AklEvRlknqCJp6PJ70delpG6aN3_Q
Message-ID: <CAC_TJvfH2uqm-XNmcGav8v7Rq1BwNrm7dC_mPbPXhMMGqnzLaA@mail.gmail.com>
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jan Kara <jack@suse.cz>, lsf-pc@lists.linux-foundation.org, 
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, 
	Suren Baghdasaryan <surenb@google.com>, David Hildenbrand <david@redhat.com>, 
	"Liam R. Howlett" <Liam.Howlett@oracle.com>, Juan Yescas <jyescas@google.com>, 
	android-mm <android-mm@google.com>, Matthew Wilcox <willy@infradead.org>, 
	Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@suse.com>, 
	"Cc: Android Kernel" <kernel-team@android.com>, david@fromorbit.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam09
X-Rspamd-Queue-Id: 30BF0C0010
X-Stat-Signature: 3t5fjkidqjarbqn3ndjoocd3zyu11qe5
X-HE-Tag: 1740434118-231035
X-HE-Meta: U2FsdGVkX18DjSTEA20v5sSW6IFEJYC7WC1i5yc6j9SMnnSP3ggAxr1SCSCaz0FaWH9FlgB80isDOkbnEQbBA+S1QosbBXPs/j00cC6gIlg1Su33NcyQHSdZp0t3nFnd69bvtuCTsE4jsV89Ogm7JnysWcdf0lxtJUQJPRdFT7LvYh+F4J3pVig2M/Vsfs0LmZUkAqDfzMDoY+qPgdBQkMXxXJjHoxVQHKwbCeDABk1pI3B6+1Da4WQj8UpXyvmeMvYTlwK91OG+RWSvOo8Tj0L53inegt46n7O7n02E8iCf0+Hi8bclolUScmuLnIElIuOUq/93p2GR0rzeH/043UlAhx7Y07J0jm/4fQ4UmDnfsU68rAZx+rhnJ5n9XqYhbKs5y1pU3pwz8tI82lSEDGlIP1N4rjYdx7vC2eu+OsCcQcCg6FA4DG0ddC8HG+ycFAYHLdgWa8VYIxvekx03ilGm1Y1AP7yG5/T4LztEA/nKDlABJJkaC8hpBEjfJUFxg8CRcHol5kJd7+CQib9LzOrK2pFFZpVfzVLXDg4r0NNkeKEWd1vFlJTDWrAocPpQUBlibJJy+MnZPXK5lLhz9NvaI8/JdHUJ2iTHwd9D8L55/KMiyk/uuK4k9P3RsW+63AdUvI2r79htJkqfOE98VRDRZtMkPQQbEn41WjM6emE8x2YUsgi1ZAZETGHop61w9qrYUYaKYd0rUnC0k+rHWetouV3lP/bsbKowbWI57aXffKQEHKPY16U5NG1KdeO4w3h1kOwafzki6Kd8SwvT1gHXN06D3OWlhHAxL5BGd1rOdguf4D9FxFCnEx6mp19XqDk9VojZHxupcE8zLTkh+i1LwfSIIDM2DPCJ/PVuKRU3hjWKT4mfgX/5httubsZ2Kgto6i55YUnr5R13pQ/25Y5YJxvPpVKZkBo7me36aXDk3JE9L/4fVWQM3EKrJBYoNElsJ3Bgu6CwZzBaAIF
 YkNXSdSu
 myDQJLwJzwc7XSygmFhgr5t7PdKljrQYNFqXNgOC0wz+WKmB03TL/1WQZqkozi0dNNeSQQMBQhbrTTKYG5stRwqLrHthm7sKBv0VSPx9DMrLN1jH8cuoczkWOVAkDn0g4Jw6MdTHfomF9E3WEjQ4I0STDhLKA7c6Xy7T41UnlWyBnjlMPIrpelVhegdi2gmO69rmLez/8kunhUplxEML9OwMfUeeWHL54dQnJuAg33tEDxAPgUmz5iNPoQd+SRjvARiF/UIDPOpRR3ymJycbFDqtz25mExnM6x++nEw4AxtrC7O3ZUcEjzo9hKbg4hTs4ssfur6uSjLBKJPxu/CgEHJIqRRoMnWdIrLmk5ZN79cBDK56rTkYXf3VAJJ6EUJ5/zznx
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Feb 24, 2025 at 1:36=E2=80=AFPM Kalesh Singh <kaleshsingh@google.co=
m> wrote:
>
> On Mon, Feb 24, 2025 at 8:52=E2=80=AFAM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Mon, Feb 24, 2025 at 05:31:16PM +0100, Jan Kara wrote:
> > > On Mon 24-02-25 14:21:37, Lorenzo Stoakes wrote:
> > > > On Mon, Feb 24, 2025 at 03:14:04PM +0100, Jan Kara wrote:
> > > > > Hello!
> > > > >
> > > > > On Fri 21-02-25 13:13:15, Kalesh Singh via Lsf-pc wrote:
> > > > > > Problem Statement
> > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > > >
> > > > > > Readahead can result in unnecessary page cache pollution for ma=
pped
> > > > > > regions that are never accessed. Current mechanisms to disable
> > > > > > readahead lack granularity and rather operate at the file or VM=
A
> > > > > > level. This proposal seeks to initiate discussion at LSFMM to e=
xplore
> > > > > > potential solutions for optimizing page cache/readahead behavio=
r.
> > > > > >
> > > > > >
> > > > > > Background
> > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > > >
> > > > > > The read-ahead heuristics on file-backed memory mappings can
> > > > > > inadvertently populate the page cache with pages corresponding =
to
> > > > > > regions that user-space processes are known never to access e.g=
 ELF
> > > > > > LOAD segment padding regions. While these pages are ultimately
> > > > > > reclaimable, their presence precipitates unnecessary I/O operat=
ions,
> > > > > > particularly when a substantial quantity of such regions exists=
.
> > > > > >
> > > > > > Although the underlying file can be made sparse in these region=
s to
> > > > > > mitigate I/O, readahead will still allocate discrete zero pages=
 when
> > > > > > populating the page cache within these ranges. These pages, whi=
le
> > > > > > subject to reclaim, introduce additional churn to the LRU. This
> > > > > > reclaim overhead is further exacerbated in filesystems that sup=
port
> > > > > > "fault-around" semantics, that can populate the surrounding pag=
es=E2=80=99
> > > > > > PTEs if found present in the page cache.
> > > > > >
> > > > > > While the memory impact may be negligible for large files conta=
ining a
> > > > > > limited number of sparse regions, it becomes appreciable for ma=
ny
> > > > > > small mappings characterized by numerous holes. This scenario c=
an
> > > > > > arise from efforts to minimize vm_area_struct slab memory footp=
rint.
> > > > >
>
> Hi Jan, Lorenzo, thanks for the comments.
>
> > > > > OK, I agree the behavior you describe exists. But do you have som=
e
> > > > > real-world numbers showing its extent? I'm not looking for some a=
rtificial
> > > > > numbers - sure bad cases can be constructed - but how big practic=
al problem
> > > > > is this? If you can show that average Android phone has 10% of th=
ese
> > > > > useless pages in memory than that's one thing and we should be lo=
oking for
> > > > > some general solution. If it is more like 0.1%, then why bother?
> > > > >
>
> Once I revert a workaround that we currently have to avoid
> fault-around for these regions (we don't have an out of tree solution
> to prevent the page cache population); our CI which checks memory
> usage after performing some common app user-journeys; reports
> regressions as shown in the snippet below. Note, that the increases
> here are only for the populated PTEs (bounded by VMA) so the actual
> pollution is theoretically larger.
>
> Metric: perfetto_media.extractor#file-rss-avg
> Increased by 7.495 MB (32.7%)
>
> Metric: perfetto_/system/bin/audioserver#file-rss-avg
> Increased by 6.262 MB (29.8%)
>
> Metric: perfetto_/system/bin/mediaserver#file-rss-max
> Increased by 8.325 MB (28.0%)
>
> Metric: perfetto_/system/bin/mediaserver#file-rss-avg
> Increased by 8.198 MB (28.4%)
>
> Metric: perfetto_media.extractor#file-rss-max
> Increased by 7.95 MB (33.6%)
>
> Metric: perfetto_/system/bin/incidentd#file-rss-avg
> Increased by 0.896 MB (20.4%)
>
> Metric: perfetto_/system/bin/audioserver#file-rss-max
> Increased by 6.883 MB (31.9%)
>
> Metric: perfetto_media.swcodec#file-rss-max
> Increased by 7.236 MB (34.9%)
>
> Metric: perfetto_/system/bin/incidentd#file-rss-max
> Increased by 1.003 MB (22.7%)
>
> Metric: perfetto_/system/bin/cameraserver#file-rss-avg
> Increased by 6.946 MB (34.2%)
>
> Metric: perfetto_/system/bin/cameraserver#file-rss-max
> Increased by 7.205 MB (33.8%)
>
> Metric: perfetto_com.android.nfc#file-rss-max
> Increased by 8.525 MB (9.8%)
>
> Metric: perfetto_/system/bin/surfaceflinger#file-rss-avg
> Increased by 3.715 MB (3.6%)
>
> Metric: perfetto_media.swcodec#file-rss-avg
> Increased by 5.096 MB (27.1%)
>
> [...]
>
> The issue is widespread across processes because in order to support
> larger page sizes Android has a requirement that the ELF segments are
> at-least 16KB aligned, which lead to the padding regions (never
> accessed).
>
> > > > > > Limitations of Existing Mechanisms
> > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D
> > > > > >
> > > > > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for t=
he
> > > > > > entire file, rather than specific sub-regions. The offset and l=
ength
> > > > > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and
> > > > > > POSIX_FADV_DONTNEED [2] cases.
> > > > > >
> > > > > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the =
entire
> > > > > > VMA, rather than specific sub-regions. [3]
> > > > > > Guard Regions: While guard regions for file-backed VMAs circumv=
ent
> > > > > > fault-around concerns, the fundamental issue of unnecessary pag=
e cache
> > > > > > population persists. [4]
> > > > >
> > > > > Somewhere else in the thread you complain about readahead extendi=
ng past
> > > > > the VMA. That's relatively easy to avoid at least for readahead t=
riggered
> > > > > from filemap_fault() (i.e., do_async_mmap_readahead() and
> > > > > do_sync_mmap_readahead()). I agree we could do that and that seem=
s as a
> > > > > relatively uncontroversial change. Note that if someone accesses =
the file
> > > > > through standard read(2) or write(2) syscall or through different=
 memory
> > > > > mapping, the limits won't apply but such combinations of access a=
re not
> > > > > that common anyway.
> > > >
> > > > Hm I'm not sure sure, map elf files with different mprotect(), or m=
protect()
> > > > different portions of a file and suddenly you lose all the readahea=
d for the
> > > > rest even though you're reading sequentially?
> > >
> > > Well, you wouldn't loose all readahead for the rest. Just readahead w=
on't
> > > preread data underlying the next VMA so yes, you get a cache miss and=
 have
> > > to wait for a page to get loaded into cache when transitioning to the=
 next
> > > VMA but once you get there, you'll have readahead running at full spe=
ed
> > > again.
> >
> > I'm aware of how readahead works (I _believe_ there's currently a
> > pre-release of a book with a very extensive section on readahead writte=
n by
> > somebody :P).
> >
> > Also been looking at it for file-backed guard regions recently, which i=
s
> > why I've been commenting here specifically as it's been on my mind late=
ly,
> > and also Kalesh's interest in this stems from a guard region 'scenario'
> > (hence my cc).
> >
> > Anyway perhaps I didn't phrase this well - my concern is whether this m=
ight
> > impact performance in real world scenarios, such as one where a VMA is
> > mapped then mprotect()'d or mmap()'d in parts causing _separate VMAs_ o=
f
> > the same file, in sequential order.
> >
> > From Kalesh's LPC talk, unless I misinterpreted what he said, this is
> > precisely what he's doing? I mean we'd not be talking here about mmap()
> > behaviour with readahead otherwise.
> >
> > Granted, perhaps you'd only _ever_ be reading sequentially within a
> > specific VMA's boundaries, rather than going from one to another (exclu=
ding
> > PROT_NONE guards obviously) and that's very possible, if that's what yo=
u
> > mean.
> >
> > But otherwise, surely this is a thing? And might we therefore be imposi=
ng
> > unnecessary cache misses?
> >
> > Which is why I suggest...
> >
> > >
> > > So yes, sequential read of a memory mapping of a file fragmented into=
 many
> > > VMAs will be somewhat slower. My impression is such use is rare (sequ=
ential
> > > readers tend to use read(2) rather than mmap) but I could be wrong.
> > >
> > > > What about shared libraries with r/o parts and exec parts?
> > > >
> > > > I think we'd really need to do some pretty careful checking to ensu=
re this
> > > > wouldn't break some real world use cases esp. if we really do mostl=
y
> > > > readahead data from page cache.
> > >
> > > So I'm not sure if you are not conflating two things here because the=
 above
> > > sentence doesn't make sense to me :). Readahead is the mechanism that
> > > brings data from underlying filesystem into the page cache. Fault-aro=
und is
> > > the mechanism that maps into page tables pages present in the page ca=
che
> > > although they were not possibly requested by the page fault. By "do m=
ostly
> > > readahead data from page cache" are you speaking about fault-around? =
That
> > > currently does not cross VMA boundaries anyway as far as I'm reading
> > > do_fault_around()...
> >
> > ...that we test this and see how it behaves :) Which is literally all I
> > am saying in the above. Ideally with representative workloads.
> >
> > I mean, I think this shouldn't be a controversial point right? Perhaps
> > again I didn't communicate this well. But this is all I mean here.
> >
> > BTW, I understand the difference between readahead and fault-around, yo=
u can
> > run git blame on do_fault_around() if you have doubts about that ;)
> >
> > And yes fault around is constrained to the VMA (and actually avoids
> > crossing PTE boundaries).
> >
> > >
> > > > > Regarding controlling readahead for various portions of the file =
- I'm
> > > > > skeptical. In my opinion it would require too much bookeeping on =
the kernel
> > > > > side for such a niche usecache (but maybe your numbers will show =
it isn't
> > > > > such a niche as I think :)). I can imagine you could just complet=
ely
> > > > > turn off kernel readahead for the file and do your special readah=
ead from
> > > > > userspace - I think you could use either userfaultfd for triggeri=
ng it or
> > > > > new fanotify FAN_PREACCESS events.
> > > >
>
> Something like this would be ideal for the use case where uncompressed
> ELF files are mapped directly from zipped APKs without extracting
> them. (I don't have any real world number for this case atm). I also
> don't know if the cache miss on the subsequent VMAs has significant
> overhead in practice ... I'll try to collect some data for this.
>
> > > > I'm opposed to anything that'll proliferate VMAs (and from what Kal=
esh
> > > > says, he is too!) I don't really see how we could avoid having to d=
o that
> > > > for this kind of case, but I may be missing something...
> > >
> > > I don't see why we would need to be increasing number of VMAs here at=
 all.
> > > With FAN_PREACCESS you get notification with file & offset when it's
> > > accessed, you can issue readahead(2) calls based on that however you =
like.
> > > Similarly you can ask for userfaults for the whole mapped range and h=
andle
> > > those. Now thinking more about this, this approach has the downside t=
hat
> > > you cannot implement async readahead with it (once PTE is mapped to s=
ome
> > > page it won't trigger notifications either with FAN_PREACCESS or with
> > > UFFD). But with UFFD you could at least trigger readahead on minor fa=
ults.
> >
> > Yeah we're talking past each other on this, sorry I missed your point a=
bout
> > fanotify there!
> >
> > uffd is probably not reasonably workable given overhead I would have
> > thought.
> >
> > I am really unaware of how fanotify works so I mean cool if you can fin=
d a
> > solution this way, awesome :)
> >
> > I'm just saying, if we need to somehow retain state about regions which
> > should have adjusted readahead behaviour at a VMA level, I can't see ho=
w
> > this could be done without VMA fragmentation and I'd rather we didn't.
> >
> > If we can avoid that great!
>
> Another possible way we can look at this: in the regressions shared
> above by the ELF padding regions, we are able to make these regions
> sparse (for *almost* all cases) -- solving the shared-zero page
> problem for file mappings, would also eliminate much of this overhead.
> So perhaps we should tackle this angle? If that's a more tangible
> solution ?
>
> From the previous discussions that Matthew shared [7], it seems like
> Dave proposed an alternative to moving the extents to the VFS layer to
> invert the IO read path operations [8]. Maybe this is a move
> approachable solution since there is precedence for the same in the
> write path?
>
> [7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@casper.infrade=
ad.org/
> [8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@dread.disaster=
.area/

+ cc: Dave Chinner

>
> Thanks,
> Kalesh
> >
> > >
> > >                                                               Honza
> > > --
> > > Jan Kara <jack@suse.com>
> > > SUSE Labs, CR