From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2BB7BC021B2 for ; Sun, 23 Feb 2025 05:37:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 463456B007B; Sun, 23 Feb 2025 00:37:04 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4139E6B0082; Sun, 23 Feb 2025 00:37:04 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2DB1C6B0083; Sun, 23 Feb 2025 00:37:04 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 1385D6B007B for ; Sun, 23 Feb 2025 00:37:04 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id A547F1620F9 for ; Sun, 23 Feb 2025 05:37:03 +0000 (UTC) X-FDA: 83150100726.10.3C7BA4F Received: from mail-pl1-f174.google.com (mail-pl1-f174.google.com [209.85.214.174]) by imf09.hostedemail.com (Postfix) with ESMTP id BCEAA140002 for ; Sun, 23 Feb 2025 05:37:01 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=zxpS14VS; spf=pass (imf09.hostedemail.com: domain of kaleshsingh@google.com designates 209.85.214.174 as permitted sender) smtp.mailfrom=kaleshsingh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740289021; a=rsa-sha256; cv=none; b=JdHxo/z+8Dg0f+z51Ywrc4a9Msqp6rRZhvhvf8JcwpYSGGYoY6aFaat2ryJ+baOz5eIdl5 gVB9wfuT6Vns9A9Dy0WRFJuLpUEgzazE/gTh1UGSYabojvf2SvTgWW9sGllef2yoVMSjOH SyrTOoJL28Hp84yK8TJ+jV4SHVGra/A= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=zxpS14VS; spf=pass (imf09.hostedemail.com: domain of kaleshsingh@google.com designates 209.85.214.174 as permitted sender) smtp.mailfrom=kaleshsingh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740289021; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2U1cZZtP1t5fw8vCFeBScYJcRg8hJmMQWNMS+Dse9lI=; b=W3V6YQ+5uNtpxxCe65JZADuNVgpU03TKzIlIjEBspiDK/O9owIawP+/6Wcgv+wTLGpWP8n wDCwQwLxJzKrhCDN+I2Inj887TZQOV3DrOO/eQz9WCib0DBsaKtggsBZJLh5naqjK5h7/x fkg+Uxapbca2Um0NYOjxttAtUohoSsU= Received: by mail-pl1-f174.google.com with SMTP id d9443c01a7336-220e0575f5bso159665ad.0 for ; Sat, 22 Feb 2025 21:37:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1740289020; x=1740893820; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=2U1cZZtP1t5fw8vCFeBScYJcRg8hJmMQWNMS+Dse9lI=; b=zxpS14VSBQM3igrebr617bhzTEpJtccmcP9envNeanl8vCilgKLD7FelBnyrdgfYvL qbCLIDp0sZgnxBSgYSq+Pwhj4ht8wqbG+oNIwMTwhwviBiFGXjWKFIed3NXTZdem8ZBy bCKS2Jw8od+7V8CPrPZYwenuFJiW70ZsrsHGZfhCNay0sfV3FWt9FQLOjmOcGPyC08L9 pK+QKal4jPjgKXacGpSirhalWY5f3fRX+X1f17HTsp887wMAgf7d2U0+KYwJwwL/WFdq xOFvvcTVEOxfgMxTgsgYOXfUrCE8V1y5xuoX5ivkUQl+tgf9l5fa0WfJMaCE7oUaOooH wjPw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740289020; x=1740893820; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=2U1cZZtP1t5fw8vCFeBScYJcRg8hJmMQWNMS+Dse9lI=; b=NkwL1rs/34db/2xKL4NGPVe9NFMsHWYM+KLyM87UJqlpfkreSHLCedgaHS+lwgGmUE CpptvWsLQa3jb/y9t9NCQNO2TMy3nHZOCOo7WH1TxSOSZWNMJgekSC/QJp/oailPY10G U5vpUX3140byX09z4fVb5PvwP/O16wE16bhYssBrVYBz69OWVRyM9/gSvJm+iyUKA2yn jtI8eMv75qkrK9S7VV9VkiLOwqE1l+7G/PmhPqKcfCxTvf33CuGdhZVyQsmm5zsxX0z3 PFlsPWrow3QilwsQulEN8de/heIdiBVlD628FDvSyOqJFqGf0NsByxGF7Bc/edpfamIG 2lQQ== X-Forwarded-Encrypted: i=1; AJvYcCVg14rDSA98r5SfqElr+i9ryrBWv8LbxbB8F91Nr2HrlZHsUWtJnbS1z2GC+n/4pJgF8abt7V/4zw==@kvack.org X-Gm-Message-State: AOJu0YwmsDaHOKPFQIkH9100ezrrsLuFgQffdhfkeNO8MMK0+XB8U6f4 Mi+R1Ply4j2nA1lC1sKu7hz5D7VsltALshn/1NOS+3OiHOB7gdzyltWLd4f6E051TIDT7P4H7pK Ls2LcrhGoFbY0hOao4G10y+SMOgwHVNoTw4n1 X-Gm-Gg: ASbGncu4+Z5jz6CbhC2D8jxgKxeSrvY/MtfGBAEDgvhsLGVloQUVgUm6tATxAwJoV6Z vvvjID9GYfWI0RJdNCNZs5O7dFBTIbgPtDdC/d8vRU+xgfRwlPySCzxvpFDMxC7Dj6SPOGdOm+S yy1J/uojMXWZ7shjPwlJd1jzfXtIUQv1z01/TlvTIm X-Google-Smtp-Source: AGHT+IFIKbq1ZrWpsHRtayDZ0KfGBGkVAAKZRW+Pi04rJ3Pc64qMYRildOJsLXGEroFjwjKZZCA/xaVjzJgLEvyBkNY= X-Received: by 2002:a17:903:2345:b0:20c:f40e:6ec3 with SMTP id d9443c01a7336-221b9dd9268mr2099625ad.22.1740289020058; Sat, 22 Feb 2025 21:37:00 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Kalesh Singh Date: Sat, 22 Feb 2025 21:36:48 -0800 X-Gm-Features: AWEUYZm8XPbV39YsnKhfI-c5mFXvM8q-woByZlAG_2DR_tNyaEYsyTbRI13zBYI Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior To: Kent Overstreet Cc: lsf-pc@lists.linux-foundation.org, "open list:MEMORY MANAGEMENT" , linux-fsdevel , Suren Baghdasaryan , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Juan Yescas , android-mm , Matthew Wilcox , Vlastimil Babka , Michal Hocko , Johannes Weiner , Nhat Pham Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: BCEAA140002 X-Stat-Signature: 54g1xdonrne1cg8egmanqgn45w18954t X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1740289021-822087 X-HE-Meta: U2FsdGVkX18hmvR49S5wO58teFSFvog7H/WxGkugxEYf4QuwxyfOeegvM/KUoDPigPNl3KR/dvAL+3esbE1PM3PL2erxGU/4gXEB/i+m8ab7EbD8xVmQximeGX8u6GCSMotUZrI53Hat8Wnpmdn1QjmAuSh/wKil3Tun5b2jGCMtkCFDvst92y9eABM+J0KNy4uG/7R7kwri+uvWmDSm92bn9Gju1kL9QoJj78CcFW30kotawlS4jmloeWeDbFr4sKV/KNkn2xXw5YNKEZ/xIb3f6PVeSShrDGeXb8h1IxfvnXQOU7uPVG3L0TAozDA1l/aMZ5jwxyYU0XK2MSQFiDVpyLOSPutPekjoTc8t2W2hVU3jD15ZflyxafCbT7Dj1X3I8ikctwT55hgR6zw8jvFTibC3pHlTIG1HVIQklT/cwP5eNF9ttYSWHt2a/86T/Sjjhf4OVWRYrCKDGv9bdGVAaue0gc/h0aSK20t5QVRUHH1ACMwsfzH3CkC4V66Wp6tRbL+ie8xUNtWsJ1hWfEPt9nehOmqOTBS5AdpYtwhPUfMppDGW9roF7v0E3NCDvlzGLl60o/AbyGifREMRE7/gRA8PMfcsOTbRvBH16XOV4RLnU7nx4PQoJnu5jLrtlZhNOdd0JxtakIgfi9wYI7LspfFXZxaFGLipp/zAlTNtVZtAbANrgsNpejN3+iZWUgSoICuyzIuBUQHMVB1ZDQVrloirSRQLo0XfuHaCw9kgJSjVzrVmEu/sXNOC5X0Ix1pk/KQ8fdKFnVnQIzHJn+yqbjJBtNVsvYzyI/hJv/ag+g5ugJd/POG6v2cDuSLQgtFThVLt7JmL3ZGUYZCkzgalkOccNzRcxZXv7ORAC/boDcnK0mT5VUkW5MnikganQ/5ChIlyDBwd3yspa+iCmwbzqscLBYcFlPVmDUjcy4yarK8b3RgUMQ0ydrQ3N9sDQQl+xvvJlNW1o2y6yMs p9Mo2Dqh skKi8kTHL+El6SUei7MXXTL2fzMC0JImP0dFARxqm8Ks2NMF7YZjqkpLz1jaPKqbilN6LUJoU0BsAGEQhF35BJDdUM4bly+PIWmx9t32Fs0lm+F41FVqiML+bcxf2DcZLFBMHDnTzJGnHxCsZeKfB8QKL14qbTuckAiTVu8rOVLSpE+YPeKrYgUsei+iTrQ+ZOwGW X-Bogosity: Ham, tests=bogofilter, spamicity=0.118191, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Feb 22, 2025 at 10:03=E2=80=AFAM Kent Overstreet wrote: > > On Fri, Feb 21, 2025 at 01:13:15PM -0800, Kalesh Singh wrote: > > Hi organizers of LSF/MM, > > > > I realize this is a late submission, but I was hoping there might > > still be a chance to have this topic considered for discussion. > > > > Problem Statement > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > Readahead can result in unnecessary page cache pollution for mapped > > regions that are never accessed. Current mechanisms to disable > > readahead lack granularity and rather operate at the file or VMA > > level. This proposal seeks to initiate discussion at LSFMM to explore > > potential solutions for optimizing page cache/readahead behavior. > > > > > > Background > > =3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > The read-ahead heuristics on file-backed memory mappings can > > inadvertently populate the page cache with pages corresponding to > > regions that user-space processes are known never to access e.g ELF > > LOAD segment padding regions. While these pages are ultimately > > reclaimable, their presence precipitates unnecessary I/O operations, > > particularly when a substantial quantity of such regions exists. > > > > Although the underlying file can be made sparse in these regions to > > mitigate I/O, readahead will still allocate discrete zero pages when > > populating the page cache within these ranges. These pages, while > > subject to reclaim, introduce additional churn to the LRU. This > > reclaim overhead is further exacerbated in filesystems that support > > "fault-around" semantics, that can populate the surrounding pages=E2=80= =99 > > PTEs if found present in the page cache. > > > > While the memory impact may be negligible for large files containing a > > limited number of sparse regions, it becomes appreciable for many > > small mappings characterized by numerous holes. This scenario can > > arise from efforts to minimize vm_area_struct slab memory footprint. > > > > Limitations of Existing Mechanisms > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D > > > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the > > entire file, rather than specific sub-regions. The offset and length > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and > > POSIX_FADV_DONTNEED [2] cases. > > > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire > > VMA, rather than specific sub-regions. [3] > > Guard Regions: While guard regions for file-backed VMAs circumvent > > fault-around concerns, the fundamental issue of unnecessary page cache > > population persists. [4] > Hi Kent. Thanks for taking a look at this. > What if we introduced something like > > madvise(..., MADV_READAHEAD_BOUNDARY, offset) > > Would that be sufficient? And would a single readahead boundary offset > suffice? I like the idea of having boundaries. In this particular example the single boundary suffices, though I think we=E2=80=99ll need to support multiple (see below). One requirement that we=E2=80=99d like to meet is that the solution doesn= =E2=80=99t cause VMA splits, to avoid additional slab usage, so perhaps fadvise() is better suited to this? Another behavior of =E2=80=9Cmmap readahead=E2=80=9D is that it doesn=E2=80= =99t really respect VMA (start, end) boundaries: The below demonstrates readahead past the end of the mapped region of the f= ile: sudo sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' && ./pollute_page_cache.sh Creating sparse file of size 25 pages Apparent Size: 100K Real Size: 0 Number cached pages: 0 Reading first 5 pages via mmap... Mapping and reading pages: [0, 6) of file 'myfile.txt' Number cached pages: 25 Similarly the readahead can bring in pages before the start of the mapped region. I believe this is due to mmap =E2=80=9Cread-around=E2=80=9D = [6]: sudo sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' && ./pollute_page_cache.sh Creating sparse file of size 25 pages Apparent Size: 100K Real Size: 0 Number cached pages: 0 Reading last 5 pages via mmap... Mapping and reading pages: [20, 25) of file 'myfile.txt' Number cached pages: 25 I=E2=80=99m not sure what the historical use cases for readahead past the V= MA boundaries are; but at least in some scenarios this behavior is not desirable. For instance, many apps mmap uncompressed ELF files directly from a page-aligned offset within a zipped APK as a space saving and security feature. The read ahead and read around behaviors lead to unrelated resources from the zipped APK populated in the page cache. I think in this case we=E2=80=99ll need to have more than a single boundary per file. A somewhat related but separate issue is that currently distinct pages are allocated in the page cache when reading sparse file holes. I think at least in the case of reading this should be avoidable. sudo sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' && ./pollute_page_cache.sh Creating sparse file of size 1GB Apparent Size: 977M Real Size: 0 Number cached pages: 0 Meminfo Cached: 9078768 kB Reading 1GB of holes... Number cached pages: 250000 Meminfo Cached: 10117324 kB (10117324-9078768)/4 =3D 259639 =3D ~250000 pages # (global counter =3D som= e noise) --Kalesh