From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AEF2DC021BB for ; Mon, 24 Feb 2025 16:31:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 46B306B007B; Mon, 24 Feb 2025 11:31:22 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 440E06B0082; Mon, 24 Feb 2025 11:31:22 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2BA986B0083; Mon, 24 Feb 2025 11:31:22 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 0D6896B007B for ; Mon, 24 Feb 2025 11:31:22 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 5F0AC121881 for ; Mon, 24 Feb 2025 16:31:21 +0000 (UTC) X-FDA: 83155378362.29.1C44341 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by imf19.hostedemail.com (Postfix) with ESMTP id 0A9991A0008 for ; Mon, 24 Feb 2025 16:31:17 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=2tuMxjrO; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=FsIP7SXz; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=2tuMxjrO; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=FsIP7SXz; spf=pass (imf19.hostedemail.com: domain of jack@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740414678; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GOvrKXU2Fpby3+k39Eqb9cOy88Tj7rv27Trvm/lCB9g=; b=3yfz8OBazKvaHX40EfKCPp2AuURS4rQRG0ggX7IrBwEfcaGijJEmCKkp+FnSdbJQEwqdM/ 0Lue4e6/IMSGIJe0x8yPpB29GMDtWUMscpZyHxE88aWwLBlON3T3GixH4lZh5WYKKOkcPg kyj2eHpmPmM/TdEojadCaKBcyKsztLA= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=2tuMxjrO; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=FsIP7SXz; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=2tuMxjrO; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=FsIP7SXz; spf=pass (imf19.hostedemail.com: domain of jack@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740414678; a=rsa-sha256; cv=none; b=o4Uxm6kVweXE2ewHdltmKUA2X88s76WqaMFYiuozPKNYiNKNRrm1TI935DBS1G3wIjDAnR f1FNsoOje4jljv5K7lS3are6ESGOtnkX+/5O2uLZwJ6tPKuEmhCNHqag+EjLad+OVul0f3 SRcCv+GfzA9y3cOoRe5Jezmi96WFlc0= Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 5DDE71F441; Mon, 24 Feb 2025 16:31:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1740414676; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=GOvrKXU2Fpby3+k39Eqb9cOy88Tj7rv27Trvm/lCB9g=; b=2tuMxjrOZDnDOe8/uzuzHBRlKSRoXecMJeC2Zp9MTjCjiV2Rrf2vziJYm8wDGRgABUQo7I QhlrgTUeyIlM23oZIgfsU81o9phWlYtesM6+P7xxiwcFPrZHu70Y5nxPoedW+Vjyvo9aNF qwiPChieGCwV03I0dewdea0TIMiHBoE= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1740414676; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=GOvrKXU2Fpby3+k39Eqb9cOy88Tj7rv27Trvm/lCB9g=; b=FsIP7SXzSLKcd6twEPycw1JTOWlPuAe81V9e5Mi05CUC1j9/XXYFgl2eYwRW2pN2yXdHRO En9wmjFatH36kIDg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1740414676; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=GOvrKXU2Fpby3+k39Eqb9cOy88Tj7rv27Trvm/lCB9g=; b=2tuMxjrOZDnDOe8/uzuzHBRlKSRoXecMJeC2Zp9MTjCjiV2Rrf2vziJYm8wDGRgABUQo7I QhlrgTUeyIlM23oZIgfsU81o9phWlYtesM6+P7xxiwcFPrZHu70Y5nxPoedW+Vjyvo9aNF qwiPChieGCwV03I0dewdea0TIMiHBoE= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1740414676; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=GOvrKXU2Fpby3+k39Eqb9cOy88Tj7rv27Trvm/lCB9g=; b=FsIP7SXzSLKcd6twEPycw1JTOWlPuAe81V9e5Mi05CUC1j9/XXYFgl2eYwRW2pN2yXdHRO En9wmjFatH36kIDg== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 5045713707; Mon, 24 Feb 2025 16:31:16 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id xF+TE9SevGcIcAAAD6G6ig (envelope-from ); Mon, 24 Feb 2025 16:31:16 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id 12C1DA0851; Mon, 24 Feb 2025 17:31:16 +0100 (CET) Date: Mon, 24 Feb 2025 17:31:16 +0100 From: Jan Kara To: Lorenzo Stoakes Cc: Jan Kara , Kalesh Singh , lsf-pc@lists.linux-foundation.org, "open list:MEMORY MANAGEMENT" , linux-fsdevel , Suren Baghdasaryan , David Hildenbrand , "Liam R. Howlett" , Juan Yescas , android-mm , Matthew Wilcox , Vlastimil Babka , Michal Hocko Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior Message-ID: References: <3bd275ed-7951-4a55-9331-560981770d30@lucifer.local> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <3bd275ed-7951-4a55-9331-560981770d30@lucifer.local> X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 0A9991A0008 X-Stat-Signature: t8bq3y3s88qfed9kdcwwoxpi9c3piy3p X-HE-Tag: 1740414677-978505 X-HE-Meta: U2FsdGVkX191fv+MM/n84qXTRvDuJDfwLVzG0Hb+Gu8QUamPv/+BvFWM8ws1nyDf9fX5M08APCUzKkMzK5iQXoNZ8sIjwTQsVFqdqM/E/oT/YIRnmw3CQef5Vz2iGj3pxHI7RBZotXzsPqUlIDZDZ8SMsjEIlNS3r+SaXvO31LqAONMYavomQ5oddinxO5W5jIDNbHtIAtMNkH+cke71V8YHp6Esczsres3/IgwhIOAIt7OPI45JPBVx6cPotckfqxLxsHWfy50qzQBjDc4VbeF4C+uBD2jheBujJMDYiYeZ3eqkHX1gm7jCHwvz6fkf7a+gITeE9PSXmFkRIhsPsvIHcRuSTI1Y+cGUQR0o9BfuBTU8k09oiaA8KUApLEaJ0u2YPjJQ7t5SfpuwwmywNtI1E/nSrqAap6kPwwVJ80lYSvuV8dfgUWVMSTVoKTHuZMvnu/EWoEyXvwyfVJxTP75rXxj12u3kRdrW0c5pEa3heq3W+huV4ZE8r5MGveDfdTygr+P+ioOxPnhC8S5OlsZrFN1lKhsAwwTjLf5urij8VAcuh02OhFJm570EHGOo1avQFRV4H7qZph948t+alnm2uMMegn/KgCQ49z3WXFoqnp5Gk2gnywWz4jtyP3ANdvuczsUXPqXgIhULqp8YJzO0yoE/uIRrFH1309sW3rQvliVgQKLO6YZxbATJWkHR+St11Li1oRxGFCd0jpznnO3gDmzrm9eux9mCei/YsV4TyHtJz46zaO6HY8qEiR1sibrF7VqarzP13C5ExWcjYcdvCxXxW+QNqmYc5sXBcM5q1/apkJWXAeHx2Q2D+vkzpyQXkOny7vpPE50lb/iRGkWqzSPeyXvvatuDfaDh/BpT9o/FVMyZ4phFEcUG+Dpqu5PT+2ZNM3oeE7xOjFjolG0fJtQP/gDPbfgUJXiAUVb+9R0s70D9JZ2IIa4qx3MYTztke0wcAeHC+XEFNMR RmvVJgRv CSb8L3dggA6afUfH1+5nX5iR+o+IwJzoSVL4alorlQM+VdKebzXhtubAq8gOOhd5bT6ZzZENz/RkOW0RFDs386Dlrx8kx604wKb8HfMFK0OeSO2xq37kEbbM5uzWe21T8IiWWNnwCIIREus2qY2/xFKODAI+0smzvFsegO3KM4uaDpNQRRgwgP+s3NHFYnIGonQ+aSPoOd3aIK8FZwDjFXpV7LXwIxrS5wHOkU5kTYYazRYhwR6CUx5+ZRnz/pvQ8Pa3XIc6gugvawlxLR91ttjQ0rR0RdBIUx+IDt5Ww0Lgp2GtqNHFEoie6E4awy/43Pa51mGeyvNAKKmUMcKCduBNmDGayB5udvHqqr54oc2BKYm0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon 24-02-25 14:21:37, Lorenzo Stoakes wrote: > On Mon, Feb 24, 2025 at 03:14:04PM +0100, Jan Kara wrote: > > Hello! > > > > On Fri 21-02-25 13:13:15, Kalesh Singh via Lsf-pc wrote: > > > Problem Statement > > > =============== > > > > > > Readahead can result in unnecessary page cache pollution for mapped > > > regions that are never accessed. Current mechanisms to disable > > > readahead lack granularity and rather operate at the file or VMA > > > level. This proposal seeks to initiate discussion at LSFMM to explore > > > potential solutions for optimizing page cache/readahead behavior. > > > > > > > > > Background > > > ========= > > > > > > The read-ahead heuristics on file-backed memory mappings can > > > inadvertently populate the page cache with pages corresponding to > > > regions that user-space processes are known never to access e.g ELF > > > LOAD segment padding regions. While these pages are ultimately > > > reclaimable, their presence precipitates unnecessary I/O operations, > > > particularly when a substantial quantity of such regions exists. > > > > > > Although the underlying file can be made sparse in these regions to > > > mitigate I/O, readahead will still allocate discrete zero pages when > > > populating the page cache within these ranges. These pages, while > > > subject to reclaim, introduce additional churn to the LRU. This > > > reclaim overhead is further exacerbated in filesystems that support > > > "fault-around" semantics, that can populate the surrounding pages’ > > > PTEs if found present in the page cache. > > > > > > While the memory impact may be negligible for large files containing a > > > limited number of sparse regions, it becomes appreciable for many > > > small mappings characterized by numerous holes. This scenario can > > > arise from efforts to minimize vm_area_struct slab memory footprint. > > > > OK, I agree the behavior you describe exists. But do you have some > > real-world numbers showing its extent? I'm not looking for some artificial > > numbers - sure bad cases can be constructed - but how big practical problem > > is this? If you can show that average Android phone has 10% of these > > useless pages in memory than that's one thing and we should be looking for > > some general solution. If it is more like 0.1%, then why bother? > > > > > Limitations of Existing Mechanisms > > > =========================== > > > > > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the > > > entire file, rather than specific sub-regions. The offset and length > > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and > > > POSIX_FADV_DONTNEED [2] cases. > > > > > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire > > > VMA, rather than specific sub-regions. [3] > > > Guard Regions: While guard regions for file-backed VMAs circumvent > > > fault-around concerns, the fundamental issue of unnecessary page cache > > > population persists. [4] > > > > Somewhere else in the thread you complain about readahead extending past > > the VMA. That's relatively easy to avoid at least for readahead triggered > > from filemap_fault() (i.e., do_async_mmap_readahead() and > > do_sync_mmap_readahead()). I agree we could do that and that seems as a > > relatively uncontroversial change. Note that if someone accesses the file > > through standard read(2) or write(2) syscall or through different memory > > mapping, the limits won't apply but such combinations of access are not > > that common anyway. > > Hm I'm not sure sure, map elf files with different mprotect(), or mprotect() > different portions of a file and suddenly you lose all the readahead for the > rest even though you're reading sequentially? Well, you wouldn't loose all readahead for the rest. Just readahead won't preread data underlying the next VMA so yes, you get a cache miss and have to wait for a page to get loaded into cache when transitioning to the next VMA but once you get there, you'll have readahead running at full speed again. So yes, sequential read of a memory mapping of a file fragmented into many VMAs will be somewhat slower. My impression is such use is rare (sequential readers tend to use read(2) rather than mmap) but I could be wrong. > What about shared libraries with r/o parts and exec parts? > > I think we'd really need to do some pretty careful checking to ensure this > wouldn't break some real world use cases esp. if we really do mostly > readahead data from page cache. So I'm not sure if you are not conflating two things here because the above sentence doesn't make sense to me :). Readahead is the mechanism that brings data from underlying filesystem into the page cache. Fault-around is the mechanism that maps into page tables pages present in the page cache although they were not possibly requested by the page fault. By "do mostly readahead data from page cache" are you speaking about fault-around? That currently does not cross VMA boundaries anyway as far as I'm reading do_fault_around()... > > Regarding controlling readahead for various portions of the file - I'm > > skeptical. In my opinion it would require too much bookeeping on the kernel > > side for such a niche usecache (but maybe your numbers will show it isn't > > such a niche as I think :)). I can imagine you could just completely > > turn off kernel readahead for the file and do your special readahead from > > userspace - I think you could use either userfaultfd for triggering it or > > new fanotify FAN_PREACCESS events. > > I'm opposed to anything that'll proliferate VMAs (and from what Kalesh > says, he is too!) I don't really see how we could avoid having to do that > for this kind of case, but I may be missing something... I don't see why we would need to be increasing number of VMAs here at all. With FAN_PREACCESS you get notification with file & offset when it's accessed, you can issue readahead(2) calls based on that however you like. Similarly you can ask for userfaults for the whole mapped range and handle those. Now thinking more about this, this approach has the downside that you cannot implement async readahead with it (once PTE is mapped to some page it won't trigger notifications either with FAN_PREACCESS or with UFFD). But with UFFD you could at least trigger readahead on minor faults. Honza -- Jan Kara SUSE Labs, CR