From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 11858C021B2
	for <linux-mm@archiver.kernel.org>; Tue, 25 Feb 2025 06:59:50 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 994C16B007B; Tue, 25 Feb 2025 01:59:49 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9424E6B0082; Tue, 25 Feb 2025 01:59:49 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7E38B6B0085; Tue, 25 Feb 2025 01:59:49 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 5E2A96B007B
	for <linux-mm@kvack.org>; Tue, 25 Feb 2025 01:59:49 -0500 (EST)
Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id E12691C86B2
	for <linux-mm@kvack.org>; Tue, 25 Feb 2025 06:59:48 +0000 (UTC)
X-FDA: 83157566856.04.206403B
Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178])
	by imf04.hostedemail.com (Postfix) with ESMTP id ED55D40004
	for <linux-mm@kvack.org>; Tue, 25 Feb 2025 06:59:45 +0000 (UTC)
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=3TiGSWp7;
	spf=pass (imf04.hostedemail.com: domain of kaleshsingh@google.com designates 209.85.214.178 as permitted sender) smtp.mailfrom=kaleshsingh@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1740466786;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=/CgOsphjc6urasX+nPG+/OELAihhViRUMDPtMTtA3A4=;
	b=PXZHwGkjTLJfEzuKtgKWuikHXzBqoa1PRv6UdG5UNxhBxLhL5GqniJAFgdDk3t+p2Dc8xx
	NbAKyibT1aI6z3AU/NSS0+OI6O+5DqwsmPUPi8aoyamLzHmDKhC1g9CpEaGJu94rl1HPk8
	by1CT0E/VPNzhcboME0ql8JI0QrkqJg=
ARC-Authentication-Results: i=1;
	imf04.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=3TiGSWp7;
	spf=pass (imf04.hostedemail.com: domain of kaleshsingh@google.com designates 209.85.214.178 as permitted sender) smtp.mailfrom=kaleshsingh@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740466786; a=rsa-sha256;
	cv=none;
	b=qs0/Zuz285r8BI8eEqQ4BsXv9Q6qr8thC3k+8DXgiUGF3LgqnGLz1AXcinswu750Sbj21U
	4FIOILr+VGip1xzomPbZEygfXQsd4snyf+8BnltbcOC6f8z3uIOGo+UUB/6Bcxp65sYq0c
	NZsD8tPYT8sWgqsPsOvg8p+vQo3MSck=
Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-220e0575f5bso114985ad.0
        for <linux-mm@kvack.org>; Mon, 24 Feb 2025 22:59:45 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1740466785; x=1741071585; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=/CgOsphjc6urasX+nPG+/OELAihhViRUMDPtMTtA3A4=;
        b=3TiGSWp7dHM9rcw/oZL6OMnvFgN1DbEPGWPCpYxxRliNUTNu3sFpMeVPzPUcyrOOE0
         Ifg6Ur6BeEhA6VUMbdIRmaVrQ3sXE0FD2ewHXmed5Dk9DIhszDRyFFaXFyj2LbjEm9Dz
         0KZLlPsBNTVvpcRPzkJWnsU4edZ9/NpsAVT477HGh7ibRQvL5vIyspOcw5CInw5KS8Mk
         w3OpVRMy646xKWXNpjYe1Sc5IQXHlOC1Fo5r5GCWt8BD6B4mSCdYso2KaU3vhCGXxihW
         D0cz4sBDfnM9/W2NN1dgjPcv+Qo1Lp+x6ct7r0if2bvkgq7HfuyqRfet7E92YGMCz7y/
         m4pw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1740466785; x=1741071585;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=/CgOsphjc6urasX+nPG+/OELAihhViRUMDPtMTtA3A4=;
        b=KFvfQ3dUaR/L2qZFPkBIuGWCHJdilJHUlyLvrI45zAgYLJQHCGyUMnsbGBOeJflPBP
         iMO10+iCWQlMgDF4JEpbbTBbeys21q5VJ0w/rzN5vIGJIx/K3xJZAe6FKoxhzz1Ij/xR
         1ekFILQ0Vy1FPBfIJQ7lmjA0zhgQrlyL62lhxxKOx7eanO8B1Vrv59hqhan4QRhbOw3P
         YDLzb7ncTQZiJIjSRzeciYqKs9dPePkuoLvfhELfY20hpOIMvVf1PYalc4iAuyvgp2MO
         zxLzmKhsT9bmK38T2Mztzkg2+QCcBapJgKqnlQFkfRt3hz9GirMjvt7TOCRCB7aeuZ+q
         j+6Q==
X-Forwarded-Encrypted: i=1; AJvYcCW7RdQthHn8/nn9TitPv0Pa3ZAUk4h5gVCV+JhwMlmM0e5QJPwTm40G4VD5tSnOyDaE0Uu8XPqt2g==@kvack.org
X-Gm-Message-State: AOJu0YxruoEohVICxg5nA0EddOhiUlYmJ17HOEL7a1ws5ivR3g1jSWck
	BU7Lr1wZxGcMxrOJessNhcwXiIFcQC1x2Qmfp+L/27toapD+IYWj+n1Vz/mF58k3EOxZLpfWJjO
	QXkqexJPrPObuPQ8PKJMaVzAdBifASknqxp0DJ5whkthFiU82hiEV
X-Gm-Gg: ASbGncsI0ZP2Zg13m4596Z3l+pLEYNMvts3Ey2trBTL5MUvB9Nk3PAjrTUn83UYQrIq
	PZ+VGj5zb359f7NpJ7siC9jFHhxYjgk0PvSdMJfabLbVap/6Rsd6A7HB3Krb30hIpwzMBPA8o4L
	gYgAIG1Z+X3ewaDu5b+mbaD/7qZjQQ+lTRhuVJC92/
X-Google-Smtp-Source: AGHT+IHNJlMDRVu6i2DidRO/pj6MNjJ2WCdc7o0+DOYTLnH2HJ0lWRSPhZVgGzMTzJpi/mBMNPn+RMdk8inIPC7L9mg=
X-Received: by 2002:a17:902:c40d:b0:220:c905:68a2 with SMTP id
 d9443c01a7336-22307a2c777mr2349005ad.5.1740466784455; Mon, 24 Feb 2025
 22:59:44 -0800 (PST)
MIME-Version: 1.0
References: <CAC_TJvfG8GcwG_2w1o6GOTZS8tfEx2h9A91qsenYfYsX8Te=Bg@mail.gmail.com>
 <hep2a5d6k2kwth5klatzhl3ejbc6g2opqu6tyxyiohbpdyhvwp@lkg2wbb4zhy3>
 <3bd275ed-7951-4a55-9331-560981770d30@lucifer.local> <ivnv2crd3et76p2nx7oszuqhzzah756oecn5yuykzqfkqzoygw@yvnlkhjjssoz>
 <82fbe53b-98c4-4e55-9eeb-5a013596c4c6@lucifer.local> <CAC_TJvcnD731xyudgapjHx=dvVHY+cxoO1--2us7oo9TqA9-_g@mail.gmail.com>
 <4f0e3d28-008f-416a-9900-75b2355f1f66@lucifer.local>
In-Reply-To: <4f0e3d28-008f-416a-9900-75b2355f1f66@lucifer.local>
From: Kalesh Singh <kaleshsingh@google.com>
Date: Mon, 24 Feb 2025 22:59:31 -0800
X-Gm-Features: AWEUYZnXMJAbXfx9WxgWaplY4wxObeHPjrzb0j0I1dWnXAs_jKXgIXuyBQOQmIE
Message-ID: <CAC_TJvd_Z_a6YGpsxroF25g6b4+F3iGnyN=m507CR6BfMFPRhg@mail.gmail.com>
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jan Kara <jack@suse.cz>, lsf-pc@lists.linux-foundation.org, 
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, 
	Suren Baghdasaryan <surenb@google.com>, David Hildenbrand <david@redhat.com>, 
	"Liam R. Howlett" <Liam.Howlett@oracle.com>, Juan Yescas <jyescas@google.com>, 
	android-mm <android-mm@google.com>, Matthew Wilcox <willy@infradead.org>, 
	Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@suse.com>, 
	"Cc: Android Kernel" <kernel-team@android.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam11
X-Rspamd-Queue-Id: ED55D40004
X-Stat-Signature: i6kqw4ba5qhfkqxsrhq5sc9acqemnwi1
X-HE-Tag: 1740466785-936252
X-HE-Meta: U2FsdGVkX1+Bo78fz9QQSe9w3W9wjewcbbe3QMjoAq23KQqvrVKnx/5x9bKRg+meJ9NaGhMxsM4apt6ZD1M/Z55PKnsSiINsKJrCx608ADruRkEKECAsHztltJ+j6qU7/ZwV8oTEY4fP+eQIrG0jTa/65veLOgEv9hQQknGQZJ8ZeVWifgFhXebGHHW9iodT8d7TKULsJHSir+g8pSAv8gKogTabHbgBXT0xPoto+if4TmCaM9dTOpgSpkcH//8YJfgwdFh3bejxDt/II+my0ayQzzYaO2rNlxKVsmeBBf7Z/knGbuDA+exNaUcwHnP5JARsc7Biam8ieLAS/sKwvC7os6CdqlugHjaeLa9LkPFP4V1N2QY8baZ31IfeN5UUBAE5wkGiVAl83MA96TQOgtjUZ7+wPReg5CmPeCMiYzWiCD+Vz4fI1XGX9pw/91dwYXPgQTUVZmk2vqdhirEEZwt2d8r87cpeHB72hiOfWhu5o8IJS34NC9OA2pcisccc2Bfy4Ui6gb66nz3qwGGhIs0lpA/CqRgui8Dyo3R5o8D8d2kREvR6rsYzg2XOIQoqhfGrMgg246g2vV10kW4q1wiVpOJ481puOOiDQSbJDAdcFGxdrLlL7RbadsC7O03F0bGFHtnJryrEIUBT+k4PgG1yXdfkvuRko2v+XkfI/LKkTinzJCHMXB+kMTI2hVoaCu3C5iLmsUajBxAlrjaFUFzVWCUezwZoGPaKHhAqx0IiN2hsV6s4L1+s5FmXem/jnxSGHdzHeqaBLYv94bHKgA8WfBoMd5vzx84e3NuKziqvX5ACqTWeT5Y5dFfVUonw105sjRZ5YGJ6GSPOJRpjb+NLTivR2uv3B+Gu30gm1DmcG9PA3ddD8jurzgGD+Wg6IclRgi/aS1KPKMb0a4wshZ0j8H1uUSaDgBDEyendzNyhROMmDBeclLIflmr1KBRxcy0BinNY8obv2fOxKAf
 y+uXtYUD
 27+u3m7xqn7h34gq7jt2Puc3oa90PKQHK7t5SQWrOQcx9o03WnkRM8SUHGrI200NgwaBHC8ImxgRLuvODtlKTc9oqALT8uUSK2cnoB5Gp5xE+/+kcBiQCwN2/qERYZwVdMmm445nsRzIgFOSAXdiV9/6kS7HDr0sZ92S4oQBrD8LSk0aZgXKCdLLg/w+Fb+BEhzEH1xNraAsBukEXlj3JkvcfSEQfiQig8RfQNNzFhBvufRTJSdDC5ZQIHM6GvPGb/eKY0dFn3pvG+KS+Kb/MA98XFYNWXIJ/S9CRcAtiZKP1VYDNN0QkAX/yguLQ81Fygh69jf/ZJ1AEJwaZVBKU25wumptazwb60ComB1rRWQiisSk=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Feb 24, 2025 at 9:45=E2=80=AFPM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Feb 24, 2025 at 01:36:50PM -0800, Kalesh Singh wrote:
> > On Mon, Feb 24, 2025 at 8:52=E2=80=AFAM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Mon, Feb 24, 2025 at 05:31:16PM +0100, Jan Kara wrote:
> > > > On Mon 24-02-25 14:21:37, Lorenzo Stoakes wrote:
> > > > > On Mon, Feb 24, 2025 at 03:14:04PM +0100, Jan Kara wrote:
> > > > > > Hello!
> > > > > >
> > > > > > On Fri 21-02-25 13:13:15, Kalesh Singh via Lsf-pc wrote:
> > > > > > > Problem Statement
> > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > > > >
> > > > > > > Readahead can result in unnecessary page cache pollution for =
mapped
> > > > > > > regions that are never accessed. Current mechanisms to disabl=
e
> > > > > > > readahead lack granularity and rather operate at the file or =
VMA
> > > > > > > level. This proposal seeks to initiate discussion at LSFMM to=
 explore
> > > > > > > potential solutions for optimizing page cache/readahead behav=
ior.
> > > > > > >
> > > > > > >
> > > > > > > Background
> > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > > > >
> > > > > > > The read-ahead heuristics on file-backed memory mappings can
> > > > > > > inadvertently populate the page cache with pages correspondin=
g to
> > > > > > > regions that user-space processes are known never to access e=
.g ELF
> > > > > > > LOAD segment padding regions. While these pages are ultimatel=
y
> > > > > > > reclaimable, their presence precipitates unnecessary I/O oper=
ations,
> > > > > > > particularly when a substantial quantity of such regions exis=
ts.
> > > > > > >
> > > > > > > Although the underlying file can be made sparse in these regi=
ons to
> > > > > > > mitigate I/O, readahead will still allocate discrete zero pag=
es when
> > > > > > > populating the page cache within these ranges. These pages, w=
hile
> > > > > > > subject to reclaim, introduce additional churn to the LRU. Th=
is
> > > > > > > reclaim overhead is further exacerbated in filesystems that s=
upport
> > > > > > > "fault-around" semantics, that can populate the surrounding p=
ages=E2=80=99
> > > > > > > PTEs if found present in the page cache.
> > > > > > >
> > > > > > > While the memory impact may be negligible for large files con=
taining a
> > > > > > > limited number of sparse regions, it becomes appreciable for =
many
> > > > > > > small mappings characterized by numerous holes. This scenario=
 can
> > > > > > > arise from efforts to minimize vm_area_struct slab memory foo=
tprint.
> > > > > >
> >
> > Hi Jan, Lorenzo, thanks for the comments.
> >
> > > > > > OK, I agree the behavior you describe exists. But do you have s=
ome
> > > > > > real-world numbers showing its extent? I'm not looking for some=
 artificial
> > > > > > numbers - sure bad cases can be constructed - but how big pract=
ical problem
> > > > > > is this? If you can show that average Android phone has 10% of =
these
> > > > > > useless pages in memory than that's one thing and we should be =
looking for
> > > > > > some general solution. If it is more like 0.1%, then why bother=
?
> > > > > >
> >
> > Once I revert a workaround that we currently have to avoid
> > fault-around for these regions (we don't have an out of tree solution
> > to prevent the page cache population); our CI which checks memory
> > usage after performing some common app user-journeys; reports
> > regressions as shown in the snippet below. Note, that the increases
> > here are only for the populated PTEs (bounded by VMA) so the actual
> > pollution is theoretically larger.
>
> Hm fault-around populates these duplicate zero pages? I guess it would
> actually. I'd be curious to hear about this out-of-tree patch, and I wond=
er how
> upstreamable it might be? :)

Let's say it's a hack I'd prefer not to post on the list :) It's very
particular to our use case so great to find a generic solution that
everyone can benefit from.

>
> >
> > Metric: perfetto_media.extractor#file-rss-avg
> > Increased by 7.495 MB (32.7%)
> >
> > Metric: perfetto_/system/bin/audioserver#file-rss-avg
> > Increased by 6.262 MB (29.8%)
> >
> > Metric: perfetto_/system/bin/mediaserver#file-rss-max
> > Increased by 8.325 MB (28.0%)
> >
> > Metric: perfetto_/system/bin/mediaserver#file-rss-avg
> > Increased by 8.198 MB (28.4%)
> >
> > Metric: perfetto_media.extractor#file-rss-max
> > Increased by 7.95 MB (33.6%)
> >
> > Metric: perfetto_/system/bin/incidentd#file-rss-avg
> > Increased by 0.896 MB (20.4%)
> >
> > Metric: perfetto_/system/bin/audioserver#file-rss-max
> > Increased by 6.883 MB (31.9%)
> >
> > Metric: perfetto_media.swcodec#file-rss-max
> > Increased by 7.236 MB (34.9%)
> >
> > Metric: perfetto_/system/bin/incidentd#file-rss-max
> > Increased by 1.003 MB (22.7%)
> >
> > Metric: perfetto_/system/bin/cameraserver#file-rss-avg
> > Increased by 6.946 MB (34.2%)
> >
> > Metric: perfetto_/system/bin/cameraserver#file-rss-max
> > Increased by 7.205 MB (33.8%)
> >
> > Metric: perfetto_com.android.nfc#file-rss-max
> > Increased by 8.525 MB (9.8%)
> >
> > Metric: perfetto_/system/bin/surfaceflinger#file-rss-avg
> > Increased by 3.715 MB (3.6%)
> >
> > Metric: perfetto_media.swcodec#file-rss-avg
> > Increased by 5.096 MB (27.1%)
>
> Yikes yeah.
>
> >
> > [...]
> >
> > The issue is widespread across processes because in order to support
> > larger page sizes Android has a requirement that the ELF segments are
> > at-least 16KB aligned, which lead to the padding regions (never
> > accessed).
>
> Again I wonder if the _really_ important problem here is this duplicate z=
ero
> page proliferation?
>

Initially I didn't want to bias the discussion to only working for
sparse files, since there could be never-accessed file backed regions
that are not necessarily sparse (guard regions?). But the major issue
/ use case in this thread, yes it suffices to solve the zero page
problem. Perhaps the other issues mentioned can be revisited
separately if/when we have some real world numbers as Jan suggested.

> As Matthew points out, fixing this might be quite involved, but this isn'=
t
> pushing back on doing so, it's good to fix things even if it's hard :>)
>
> >
> > > > > > > Limitations of Existing Mechanisms
> > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D
> > > > > > >
> > > > > > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for=
 the
> > > > > > > entire file, rather than specific sub-regions. The offset and=
 length
> > > > > > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and
> > > > > > > POSIX_FADV_DONTNEED [2] cases.
> > > > > > >
> > > > > > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on th=
e entire
> > > > > > > VMA, rather than specific sub-regions. [3]
> > > > > > > Guard Regions: While guard regions for file-backed VMAs circu=
mvent
> > > > > > > fault-around concerns, the fundamental issue of unnecessary p=
age cache
> > > > > > > population persists. [4]
> > > > > >
> > > > > > Somewhere else in the thread you complain about readahead exten=
ding past
> > > > > > the VMA. That's relatively easy to avoid at least for readahead=
 triggered
> > > > > > from filemap_fault() (i.e., do_async_mmap_readahead() and
> > > > > > do_sync_mmap_readahead()). I agree we could do that and that se=
ems as a
> > > > > > relatively uncontroversial change. Note that if someone accesse=
s the file
> > > > > > through standard read(2) or write(2) syscall or through differe=
nt memory
> > > > > > mapping, the limits won't apply but such combinations of access=
 are not
> > > > > > that common anyway.
> > > > >
> > > > > Hm I'm not sure sure, map elf files with different mprotect(), or=
 mprotect()
> > > > > different portions of a file and suddenly you lose all the readah=
ead for the
> > > > > rest even though you're reading sequentially?
> > > >
> > > > Well, you wouldn't loose all readahead for the rest. Just readahead=
 won't
> > > > preread data underlying the next VMA so yes, you get a cache miss a=
nd have
> > > > to wait for a page to get loaded into cache when transitioning to t=
he next
> > > > VMA but once you get there, you'll have readahead running at full s=
peed
> > > > again.
> > >
> > > I'm aware of how readahead works (I _believe_ there's currently a
> > > pre-release of a book with a very extensive section on readahead writ=
ten by
> > > somebody :P).
> > >
> > > Also been looking at it for file-backed guard regions recently, which=
 is
> > > why I've been commenting here specifically as it's been on my mind la=
tely,
> > > and also Kalesh's interest in this stems from a guard region 'scenari=
o'
> > > (hence my cc).
> > >
> > > Anyway perhaps I didn't phrase this well - my concern is whether this=
 might
> > > impact performance in real world scenarios, such as one where a VMA i=
s
> > > mapped then mprotect()'d or mmap()'d in parts causing _separate VMAs_=
 of
> > > the same file, in sequential order.
> > >
> > > From Kalesh's LPC talk, unless I misinterpreted what he said, this is
> > > precisely what he's doing? I mean we'd not be talking here about mmap=
()
> > > behaviour with readahead otherwise.
> > >
> > > Granted, perhaps you'd only _ever_ be reading sequentially within a
> > > specific VMA's boundaries, rather than going from one to another (exc=
luding
> > > PROT_NONE guards obviously) and that's very possible, if that's what =
you
> > > mean.
> > >
> > > But otherwise, surely this is a thing? And might we therefore be impo=
sing
> > > unnecessary cache misses?
> > >
> > > Which is why I suggest...
> > >
> > > >
> > > > So yes, sequential read of a memory mapping of a file fragmented in=
to many
> > > > VMAs will be somewhat slower. My impression is such use is rare (se=
quential
> > > > readers tend to use read(2) rather than mmap) but I could be wrong.
> > > >
> > > > > What about shared libraries with r/o parts and exec parts?
> > > > >
> > > > > I think we'd really need to do some pretty careful checking to en=
sure this
> > > > > wouldn't break some real world use cases esp. if we really do mos=
tly
> > > > > readahead data from page cache.
> > > >
> > > > So I'm not sure if you are not conflating two things here because t=
he above
> > > > sentence doesn't make sense to me :). Readahead is the mechanism th=
at
> > > > brings data from underlying filesystem into the page cache. Fault-a=
round is
> > > > the mechanism that maps into page tables pages present in the page =
cache
> > > > although they were not possibly requested by the page fault. By "do=
 mostly
> > > > readahead data from page cache" are you speaking about fault-around=
? That
> > > > currently does not cross VMA boundaries anyway as far as I'm readin=
g
> > > > do_fault_around()...
> > >
> > > ...that we test this and see how it behaves :) Which is literally all=
 I
> > > am saying in the above. Ideally with representative workloads.
> > >
> > > I mean, I think this shouldn't be a controversial point right? Perhap=
s
> > > again I didn't communicate this well. But this is all I mean here.
> > >
> > > BTW, I understand the difference between readahead and fault-around, =
you can
> > > run git blame on do_fault_around() if you have doubts about that ;)
> > >
> > > And yes fault around is constrained to the VMA (and actually avoids
> > > crossing PTE boundaries).
> > >
> > > >
> > > > > > Regarding controlling readahead for various portions of the fil=
e - I'm
> > > > > > skeptical. In my opinion it would require too much bookeeping o=
n the kernel
> > > > > > side for such a niche usecache (but maybe your numbers will sho=
w it isn't
> > > > > > such a niche as I think :)). I can imagine you could just compl=
etely
> > > > > > turn off kernel readahead for the file and do your special read=
ahead from
> > > > > > userspace - I think you could use either userfaultfd for trigge=
ring it or
> > > > > > new fanotify FAN_PREACCESS events.
> > > > >
> >
> > Something like this would be ideal for the use case where uncompressed
> > ELF files are mapped directly from zipped APKs without extracting
> > them. (I don't have any real world number for this case atm). I also
> > don't know if the cache miss on the subsequent VMAs has significant
> > overhead in practice ... I'll try to collect some data for this.
> >
> > > > > I'm opposed to anything that'll proliferate VMAs (and from what K=
alesh
> > > > > says, he is too!) I don't really see how we could avoid having to=
 do that
> > > > > for this kind of case, but I may be missing something...
> > > >
> > > > I don't see why we would need to be increasing number of VMAs here =
at all.
> > > > With FAN_PREACCESS you get notification with file & offset when it'=
s
> > > > accessed, you can issue readahead(2) calls based on that however yo=
u like.
> > > > Similarly you can ask for userfaults for the whole mapped range and=
 handle
> > > > those. Now thinking more about this, this approach has the downside=
 that
> > > > you cannot implement async readahead with it (once PTE is mapped to=
 some
> > > > page it won't trigger notifications either with FAN_PREACCESS or wi=
th
> > > > UFFD). But with UFFD you could at least trigger readahead on minor =
faults.
> > >
> > > Yeah we're talking past each other on this, sorry I missed your point=
 about
> > > fanotify there!
> > >
> > > uffd is probably not reasonably workable given overhead I would have
> > > thought.
> > >
> > > I am really unaware of how fanotify works so I mean cool if you can f=
ind a
> > > solution this way, awesome :)
> > >
> > > I'm just saying, if we need to somehow retain state about regions whi=
ch
> > > should have adjusted readahead behaviour at a VMA level, I can't see =
how
> > > this could be done without VMA fragmentation and I'd rather we didn't=
.
> > >
> > > If we can avoid that great!
> >
> > Another possible way we can look at this: in the regressions shared
> > above by the ELF padding regions, we are able to make these regions
> > sparse (for *almost* all cases) -- solving the shared-zero page
> > problem for file mappings, would also eliminate much of this overhead.
> > So perhaps we should tackle this angle? If that's a more tangible
> > solution ?
>
> To me it seems we are converging on this as at least part of the solution=
.
>
> >
> > From the previous discussions that Matthew shared [7], it seems like
> > Dave proposed an alternative to moving the extents to the VFS layer to
> > invert the IO read path operations [8]. Maybe this is a move
> > approachable solution since there is precedence for the same in the
> > write path?
> >
> > [7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@casper.infra=
dead.org/
> > [8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@dread.disast=
er.area/
> >
> > Thanks,
> > Kalesh
> > >
> > > >
> > > >                                                               Honza
> > > > --
> > > > Jan Kara <jack@suse.com>
> > > > SUSE Labs, CR
>
> Overall I think we can conclude - this is a topic of interest to people f=
or
> LSF :)

Yes I'd love to discuss this more with all the relevant folks in person :)

Thanks,
Kalesh