From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 46590C021A4 for ; Mon, 24 Feb 2025 21:37:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D411128000D; Mon, 24 Feb 2025 16:37:05 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CF16D28000C; Mon, 24 Feb 2025 16:37:05 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B923128000D; Mon, 24 Feb 2025 16:37:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 9964828000C for ; Mon, 24 Feb 2025 16:37:05 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 55B2DC01C9 for ; Mon, 24 Feb 2025 21:37:05 +0000 (UTC) X-FDA: 83156148810.02.AC3B86C Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) by imf04.hostedemail.com (Postfix) with ESMTP id 2040C40002 for ; Mon, 24 Feb 2025 21:37:02 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=gW+OErQY; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of kaleshsingh@google.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=kaleshsingh@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740433023; a=rsa-sha256; cv=none; b=bIHxCgy7YNUAJasaI16VrNmtH/Z3ysyO5V6owLk3i3GAlTSPlTV3Xgpmm/ii3+hb18s6Lv 4FEUpqfaiw3cmE8pSI/XmmbhvSr8HyI/WoWvb+p66cAevSHhHUa+nSTSvLLb12Nej2Kv27 aWxjqJbMtsWhUWRG7qVCyPeCppKJBYE= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=gW+OErQY; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of kaleshsingh@google.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=kaleshsingh@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740433023; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5ovUBpG7aXXwkDBC/a6KtafoC5cGlW0QFjL4aakiGwk=; b=yyK2RsW10RHwwOmEys7icZKok+It1AHgtNaVduF/0Vuk8/liDppH4oqh1x3swOkrcMDZcq gY503IKtVXVm07IaWNYpsBz8gfZqEzAPMI4o1u4ucd2rcrn4Jllj2E2FftuNiG4XLL0zcw 2e67v+vSv+DPTUYL7lsBhTQNA4+Vr2k= Received: by mail-pl1-f182.google.com with SMTP id d9443c01a7336-22117c396baso12865ad.1 for ; Mon, 24 Feb 2025 13:37:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1740433022; x=1741037822; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=5ovUBpG7aXXwkDBC/a6KtafoC5cGlW0QFjL4aakiGwk=; b=gW+OErQYzg0XVPZfgP1iVqeWZa8al6VnBkY/2TWBY73qjV42Cf7OfdZTQsblBwHa4s rva5nRpivgfQLE+0q7FyUM1VuwDiOABU1zVACFXYonie9znmC3PHRar6Qt/AoYUeTSue haMngtHUWNFLueOYhPbVXlwRxHNd/X/j8sdUg4H7pOSaL9ksc+7XWNiJJ8s7ZIAgE2TF T5+BbTH8ao8RZWbSqYdBoGyhRhy1/c/zH+09Itz2J1h0XWei3rRK8OVnbcIHHY24Hd/I piiL6MxgMNsr78KrUTPE8Enz2tnOgM5rtXrKD//aQuWLgD72eWlSS2waIzsv0jO+RL1p gVhQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740433022; x=1741037822; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=5ovUBpG7aXXwkDBC/a6KtafoC5cGlW0QFjL4aakiGwk=; b=wsc5dyo6FpV1fLQS6VCiYLp0JFt5ajogsN/1kdObXYv7L2qiLF5UfYosjzIndyXCwg xI3GRwW4deLm75rniVEVu105yW+OcWQk65V7oMMKyHjw7uKBAuoox2g3lqtisAd6rPZz 5S1VlLlCY7ojvbWWH67t5vCnCnG7iXlxf/FNLeIudl+GZZ/jZkDLvsxqPBiXW2wcukzJ d8tse/fnDkLFwxXVmUSYwNmYF3igIAqHSzsUrWubE+RIztz4g9KC2l6a9cBnjm/Rvyvp oVo/QeJKLOSVMk6kQe52uGAW9aZfYaUyutcnAF3DY7o/kkE7rMW0HIkI4SLWC3kCiiky CtFg== X-Forwarded-Encrypted: i=1; AJvYcCXgZSCYuGYK/6+VSnelncYlD/mRtkzxzTSSyIoPT4RbsAe8m6J6NoJXLdC3u8jh+AdrmpQ+yRI47Q==@kvack.org X-Gm-Message-State: AOJu0YxT70mRjrAVdewW87VtC6RuwKOUiSEd0jlNuNIU4evFy8PHokVZ DnGQE31biCl2h6tGwd3WJEVpt1aALQBdIQbT4W36+IAhgUIdEeF3EgsDjqBGW4KDQc68cAcqCkl WiU3wZ0pmqKSY4ZOIGln/UQkLIFePTNN+eRZz X-Gm-Gg: ASbGncuyXhmGwJ5pzRpGOfWI1qU8WZWHbvu+JBfHIPyZIcc0TBum5SAaR+SlCLO1h17 1A61IpIOl6GFwBY3n7C5x0zzVbexn51Sg6ghLsqm3faZxJ0uO8gTsBiJvUgqSOXu2DVXNJrGYef 6EJRitItDWnwStjue1eCnl3Y9IU4fitWqKIDkp1w== X-Google-Smtp-Source: AGHT+IGs5iHY+9ry+lEEOEo9o06dXO9jofNUW39apTX33QTSxPUBgPykVHcd3XP0qaM7FxNdwyTyzGgqf0AOSYenyqw= X-Received: by 2002:a17:903:2350:b0:215:f0c6:4dbf with SMTP id d9443c01a7336-22307a60dc8mr1132485ad.14.1740433021713; Mon, 24 Feb 2025 13:37:01 -0800 (PST) MIME-Version: 1.0 References: <3bd275ed-7951-4a55-9331-560981770d30@lucifer.local> <82fbe53b-98c4-4e55-9eeb-5a013596c4c6@lucifer.local> In-Reply-To: <82fbe53b-98c4-4e55-9eeb-5a013596c4c6@lucifer.local> From: Kalesh Singh Date: Mon, 24 Feb 2025 13:36:50 -0800 X-Gm-Features: AWEUYZlbolcbx1VrCX_xO5f3h30r1v1r_coxBnQqXW6iT4NYdoUMUja41NjS_Do Message-ID: Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior To: Lorenzo Stoakes Cc: Jan Kara , lsf-pc@lists.linux-foundation.org, "open list:MEMORY MANAGEMENT" , linux-fsdevel , Suren Baghdasaryan , David Hildenbrand , "Liam R. Howlett" , Juan Yescas , android-mm , Matthew Wilcox , Vlastimil Babka , Michal Hocko , "Cc: Android Kernel" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 2040C40002 X-Stat-Signature: buer7iksnedhiidk4ewuydko71t5he6k X-Rspam-User: X-HE-Tag: 1740433022-637655 X-HE-Meta: U2FsdGVkX1/hUtQ+PXLP3cM0nmVUz8fa8JYbmdg3rMkI8uGx2pMXNVn3BcunHv9dkqVGfx+76PzQQRRb2zekOoQg8wQ1jATYcwZx0n5BfxjiByN0qN60gJ4fSuKLdWFKsNNyNiyk2zGmVb6vliOquRPFZX5eV3lo8xNartOzJeoaFVpy4yl73CLs3mojz3JIHd2MFZ/uxQgmTbLNz653MxsaE447L5PBpn3o6QGpC5uIPOl7ur/U3ZZAZDBP4SHcLzAPx1Iv3wIHk1p6TA+amqRwSsG2a5feS6aDcYDp8NR1vzaWzIlPnnYg2Jikm5WQ7bkWxp94vMGX7fioQ1w96mLilas6PaSJRWjg3hQ0OwWFG5fDZPtDRd55Xxtan6nvuVVyKq4tLnkSdN7BBdTtByIfyLFARuEG7cbG2vaOTLizLAnxWdUkgnroIXo+TgAVYSv+Ym45CzKpFyru0gUKFfqtZc20lTjIOD3BIgKPaVp1TJ+ET6YeFyK6oJIpK84JRdzxHNuz9Gz6U+hWzjmI9VSfXfs21D4CPKB/J1+uEFv9TMhEv3sDf4bqlBhgiqgZDY5ebzI3HnwYEENNpV7TKAwfJ7nEwHEFcvA4rtewaE3aeB/Cxp/HOV7PZeyt0cofG7jWJ2JSvmuiyf7xtOBoxiQ3LwilhsFeLLIM/jVYfeIyVbNXlPT7D7auKUE2kW5IWOm5aLyXl1LclwA6VDIRYtmNfHKppWJJ8XQdifWcQ4NegkyO/vnDUW26tDdk57UdkweYS8Ia4/++dViRQ8+dZfvMyM27AgWQbPhSzZh1HhzMTrQnafztyAH4FOLQ5ul4k+VZz9wJccswOai632jsahEPmO3igxvL83q+O2XjQ0ffP//9FT/qzE+x7HlNsgFdZQbC11WHoWHB/zXzK113zd/6hymcjMYRlbyDx6AfzDWHeY6BaNerYi0dlIpgbObQC/fcCk4SKdutx0tNyM/ jrYbLMbD F3HNsdh5kD9Q8f0vRhtjrNiVGdTt/khSd2NOJjRKKFoCl/r/XP2BnuewFRnL2rs3ZyaIGxWZL7wjmAO2c0WYAXItLd8/ZB0x08HGQTvZi13gXLzq8jgCbawbdwWeqbKBkrzRlJfUb7An4cr9ghShgqWrkA/NjyjqFkvhYrqfS3uGWAHe5ic94aDExndeCcoPwekUopwESKqnMflSzDLUgnP5veZ5TYDwGtRQ3BuNQA+/7HSYZ3cqM4xNm8NSyMQA8aonsZFwMJf6Lb0RpNgU/HgRmZ3AyOIEOaoIa3Mun1jslT7m+S1pbLZ4sxsdRDA7SoDr8tpqOyIjagHVnV/bHKE2E2nY3vbSEsyhYDyKXYDyh6jA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Feb 24, 2025 at 8:52=E2=80=AFAM Lorenzo Stoakes wrote: > > On Mon, Feb 24, 2025 at 05:31:16PM +0100, Jan Kara wrote: > > On Mon 24-02-25 14:21:37, Lorenzo Stoakes wrote: > > > On Mon, Feb 24, 2025 at 03:14:04PM +0100, Jan Kara wrote: > > > > Hello! > > > > > > > > On Fri 21-02-25 13:13:15, Kalesh Singh via Lsf-pc wrote: > > > > > Problem Statement > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > > > > Readahead can result in unnecessary page cache pollution for mapp= ed > > > > > regions that are never accessed. Current mechanisms to disable > > > > > readahead lack granularity and rather operate at the file or VMA > > > > > level. This proposal seeks to initiate discussion at LSFMM to exp= lore > > > > > potential solutions for optimizing page cache/readahead behavior. > > > > > > > > > > > > > > > Background > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > > > > The read-ahead heuristics on file-backed memory mappings can > > > > > inadvertently populate the page cache with pages corresponding to > > > > > regions that user-space processes are known never to access e.g E= LF > > > > > LOAD segment padding regions. While these pages are ultimately > > > > > reclaimable, their presence precipitates unnecessary I/O operatio= ns, > > > > > particularly when a substantial quantity of such regions exists. > > > > > > > > > > Although the underlying file can be made sparse in these regions = to > > > > > mitigate I/O, readahead will still allocate discrete zero pages w= hen > > > > > populating the page cache within these ranges. These pages, while > > > > > subject to reclaim, introduce additional churn to the LRU. This > > > > > reclaim overhead is further exacerbated in filesystems that suppo= rt > > > > > "fault-around" semantics, that can populate the surrounding pages= =E2=80=99 > > > > > PTEs if found present in the page cache. > > > > > > > > > > While the memory impact may be negligible for large files contain= ing a > > > > > limited number of sparse regions, it becomes appreciable for many > > > > > small mappings characterized by numerous holes. This scenario can > > > > > arise from efforts to minimize vm_area_struct slab memory footpri= nt. > > > > Hi Jan, Lorenzo, thanks for the comments. > > > > OK, I agree the behavior you describe exists. But do you have some > > > > real-world numbers showing its extent? I'm not looking for some art= ificial > > > > numbers - sure bad cases can be constructed - but how big practical= problem > > > > is this? If you can show that average Android phone has 10% of thes= e > > > > useless pages in memory than that's one thing and we should be look= ing for > > > > some general solution. If it is more like 0.1%, then why bother? > > > > Once I revert a workaround that we currently have to avoid fault-around for these regions (we don't have an out of tree solution to prevent the page cache population); our CI which checks memory usage after performing some common app user-journeys; reports regressions as shown in the snippet below. Note, that the increases here are only for the populated PTEs (bounded by VMA) so the actual pollution is theoretically larger. Metric: perfetto_media.extractor#file-rss-avg Increased by 7.495 MB (32.7%) Metric: perfetto_/system/bin/audioserver#file-rss-avg Increased by 6.262 MB (29.8%) Metric: perfetto_/system/bin/mediaserver#file-rss-max Increased by 8.325 MB (28.0%) Metric: perfetto_/system/bin/mediaserver#file-rss-avg Increased by 8.198 MB (28.4%) Metric: perfetto_media.extractor#file-rss-max Increased by 7.95 MB (33.6%) Metric: perfetto_/system/bin/incidentd#file-rss-avg Increased by 0.896 MB (20.4%) Metric: perfetto_/system/bin/audioserver#file-rss-max Increased by 6.883 MB (31.9%) Metric: perfetto_media.swcodec#file-rss-max Increased by 7.236 MB (34.9%) Metric: perfetto_/system/bin/incidentd#file-rss-max Increased by 1.003 MB (22.7%) Metric: perfetto_/system/bin/cameraserver#file-rss-avg Increased by 6.946 MB (34.2%) Metric: perfetto_/system/bin/cameraserver#file-rss-max Increased by 7.205 MB (33.8%) Metric: perfetto_com.android.nfc#file-rss-max Increased by 8.525 MB (9.8%) Metric: perfetto_/system/bin/surfaceflinger#file-rss-avg Increased by 3.715 MB (3.6%) Metric: perfetto_media.swcodec#file-rss-avg Increased by 5.096 MB (27.1%) [...] The issue is widespread across processes because in order to support larger page sizes Android has a requirement that the ELF segments are at-least 16KB aligned, which lead to the padding regions (never accessed). > > > > > Limitations of Existing Mechanisms > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D > > > > > > > > > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the > > > > > entire file, rather than specific sub-regions. The offset and len= gth > > > > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and > > > > > POSIX_FADV_DONTNEED [2] cases. > > > > > > > > > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the en= tire > > > > > VMA, rather than specific sub-regions. [3] > > > > > Guard Regions: While guard regions for file-backed VMAs circumven= t > > > > > fault-around concerns, the fundamental issue of unnecessary page = cache > > > > > population persists. [4] > > > > > > > > Somewhere else in the thread you complain about readahead extending= past > > > > the VMA. That's relatively easy to avoid at least for readahead tri= ggered > > > > from filemap_fault() (i.e., do_async_mmap_readahead() and > > > > do_sync_mmap_readahead()). I agree we could do that and that seems = as a > > > > relatively uncontroversial change. Note that if someone accesses th= e file > > > > through standard read(2) or write(2) syscall or through different m= emory > > > > mapping, the limits won't apply but such combinations of access are= not > > > > that common anyway. > > > > > > Hm I'm not sure sure, map elf files with different mprotect(), or mpr= otect() > > > different portions of a file and suddenly you lose all the readahead = for the > > > rest even though you're reading sequentially? > > > > Well, you wouldn't loose all readahead for the rest. Just readahead won= 't > > preread data underlying the next VMA so yes, you get a cache miss and h= ave > > to wait for a page to get loaded into cache when transitioning to the n= ext > > VMA but once you get there, you'll have readahead running at full speed > > again. > > I'm aware of how readahead works (I _believe_ there's currently a > pre-release of a book with a very extensive section on readahead written = by > somebody :P). > > Also been looking at it for file-backed guard regions recently, which is > why I've been commenting here specifically as it's been on my mind lately= , > and also Kalesh's interest in this stems from a guard region 'scenario' > (hence my cc). > > Anyway perhaps I didn't phrase this well - my concern is whether this mig= ht > impact performance in real world scenarios, such as one where a VMA is > mapped then mprotect()'d or mmap()'d in parts causing _separate VMAs_ of > the same file, in sequential order. > > From Kalesh's LPC talk, unless I misinterpreted what he said, this is > precisely what he's doing? I mean we'd not be talking here about mmap() > behaviour with readahead otherwise. > > Granted, perhaps you'd only _ever_ be reading sequentially within a > specific VMA's boundaries, rather than going from one to another (excludi= ng > PROT_NONE guards obviously) and that's very possible, if that's what you > mean. > > But otherwise, surely this is a thing? And might we therefore be imposing > unnecessary cache misses? > > Which is why I suggest... > > > > > So yes, sequential read of a memory mapping of a file fragmented into m= any > > VMAs will be somewhat slower. My impression is such use is rare (sequen= tial > > readers tend to use read(2) rather than mmap) but I could be wrong. > > > > > What about shared libraries with r/o parts and exec parts? > > > > > > I think we'd really need to do some pretty careful checking to ensure= this > > > wouldn't break some real world use cases esp. if we really do mostly > > > readahead data from page cache. > > > > So I'm not sure if you are not conflating two things here because the a= bove > > sentence doesn't make sense to me :). Readahead is the mechanism that > > brings data from underlying filesystem into the page cache. Fault-aroun= d is > > the mechanism that maps into page tables pages present in the page cach= e > > although they were not possibly requested by the page fault. By "do mos= tly > > readahead data from page cache" are you speaking about fault-around? Th= at > > currently does not cross VMA boundaries anyway as far as I'm reading > > do_fault_around()... > > ...that we test this and see how it behaves :) Which is literally all I > am saying in the above. Ideally with representative workloads. > > I mean, I think this shouldn't be a controversial point right? Perhaps > again I didn't communicate this well. But this is all I mean here. > > BTW, I understand the difference between readahead and fault-around, you = can > run git blame on do_fault_around() if you have doubts about that ;) > > And yes fault around is constrained to the VMA (and actually avoids > crossing PTE boundaries). > > > > > > > Regarding controlling readahead for various portions of the file - = I'm > > > > skeptical. In my opinion it would require too much bookeeping on th= e kernel > > > > side for such a niche usecache (but maybe your numbers will show it= isn't > > > > such a niche as I think :)). I can imagine you could just completel= y > > > > turn off kernel readahead for the file and do your special readahea= d from > > > > userspace - I think you could use either userfaultfd for triggering= it or > > > > new fanotify FAN_PREACCESS events. > > > Something like this would be ideal for the use case where uncompressed ELF files are mapped directly from zipped APKs without extracting them. (I don't have any real world number for this case atm). I also don't know if the cache miss on the subsequent VMAs has significant overhead in practice ... I'll try to collect some data for this. > > > I'm opposed to anything that'll proliferate VMAs (and from what Kales= h > > > says, he is too!) I don't really see how we could avoid having to do = that > > > for this kind of case, but I may be missing something... > > > > I don't see why we would need to be increasing number of VMAs here at a= ll. > > With FAN_PREACCESS you get notification with file & offset when it's > > accessed, you can issue readahead(2) calls based on that however you li= ke. > > Similarly you can ask for userfaults for the whole mapped range and han= dle > > those. Now thinking more about this, this approach has the downside tha= t > > you cannot implement async readahead with it (once PTE is mapped to som= e > > page it won't trigger notifications either with FAN_PREACCESS or with > > UFFD). But with UFFD you could at least trigger readahead on minor faul= ts. > > Yeah we're talking past each other on this, sorry I missed your point abo= ut > fanotify there! > > uffd is probably not reasonably workable given overhead I would have > thought. > > I am really unaware of how fanotify works so I mean cool if you can find = a > solution this way, awesome :) > > I'm just saying, if we need to somehow retain state about regions which > should have adjusted readahead behaviour at a VMA level, I can't see how > this could be done without VMA fragmentation and I'd rather we didn't. > > If we can avoid that great! Another possible way we can look at this: in the regressions shared above by the ELF padding regions, we are able to make these regions sparse (for *almost* all cases) -- solving the shared-zero page problem for file mappings, would also eliminate much of this overhead. So perhaps we should tackle this angle? If that's a more tangible solution ? >From the previous discussions that Matthew shared [7], it seems like Dave proposed an alternative to moving the extents to the VFS layer to invert the IO read path operations [8]. Maybe this is a move approachable solution since there is precedence for the same in the write path? [7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@casper.infradead= .org/ [8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@dread.disaster.a= rea/ Thanks, Kalesh > > > > > Honza > > -- > > Jan Kara > > SUSE Labs, CR