From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 707AEC021B2 for ; Sun, 23 Feb 2025 05:43:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0245D6B007B; Sun, 23 Feb 2025 00:43:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id F15DC6B0082; Sun, 23 Feb 2025 00:43:12 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DDD966B0083; Sun, 23 Feb 2025 00:43:12 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id BEC046B007B for ; Sun, 23 Feb 2025 00:43:12 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 5C6C3142247 for ; Sun, 23 Feb 2025 05:43:12 +0000 (UTC) X-FDA: 83150116224.03.9C39A61 Received: from mail-pl1-f170.google.com (mail-pl1-f170.google.com [209.85.214.170]) by imf15.hostedemail.com (Postfix) with ESMTP id 7B7BBA0007 for ; Sun, 23 Feb 2025 05:43:10 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=PYWz4+o1; spf=pass (imf15.hostedemail.com: domain of kaleshsingh@google.com designates 209.85.214.170 as permitted sender) smtp.mailfrom=kaleshsingh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740289390; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BE3d3sbh+bWtijwrCKbuEh5XHpX6xdGOFtmzdPzct/c=; b=JxB588haUkqiE9ypxrD7FFgSqXiH5VKSQcLZ7tF8e5C8YcHhIH4MiskiHmf4ZdY5CFytIK NHN/P0P4xzLgVqR/t3ZySdmXcD9SDQD3WXYYQ4P0TbK7IJ42uP7guCbBy72/wgMLI8znh0 7vY4coZLDaIbHy4os384Nb5rPxJa+gc= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=PYWz4+o1; spf=pass (imf15.hostedemail.com: domain of kaleshsingh@google.com designates 209.85.214.170 as permitted sender) smtp.mailfrom=kaleshsingh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740289390; a=rsa-sha256; cv=none; b=phx1wxGXSEE7gC0/cZq3d3BQHOSpuFQYkJPzjo0C69HRFg9ruq10WezV7HsMy/VyvFSBka OhN6U3oFoSg89qa/YUmsPl+a9pJyY29EcsN7TcFOOcwu5PTMP7LnRdih7q/hW8wn1rhu1n 6hhe94WZhhCEL527p3l7PqfwBY1R7uQ= Received: by mail-pl1-f170.google.com with SMTP id d9443c01a7336-221ac1f849fso120235ad.1 for ; Sat, 22 Feb 2025 21:43:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1740289389; x=1740894189; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=BE3d3sbh+bWtijwrCKbuEh5XHpX6xdGOFtmzdPzct/c=; b=PYWz4+o1IxfF9ghxHOOTJW9rrQjf62/Qq1x6LlJpaoWRPZwksXyhYcg+bZRGOHHp06 Wt9o8jLDoPJvNSEzNJu6q5S0Aw0l2HzSVRZ5WaRBgFP93oYuPG4h5qC6sujBoyiHe079 TO282Cyb1RcJ7kWOFIp80b+xEH28+zW65I5WSIVNEDdmCpDbY4qKbfRuwRaFYgVakm4p wVcuFtls0SYWAHgquKGCuWnUZnbB/0/HSD48vwKgeYn9BdhGGYic7Iz/vs/FEef+pADC SeSxZtTvazDaeVyrKMwcParx7HmJWVqE2Z+eJ2maPadqm8gK/TVXz0boyX36PUn7XbDc +XfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740289389; x=1740894189; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=BE3d3sbh+bWtijwrCKbuEh5XHpX6xdGOFtmzdPzct/c=; b=J9OUnVqvq2PUUWT5sd1P6+RKD+iHUHHnDddt/hE8hbnFPsVg9D4Vy+yFnkP6Edcocr 5ac8TzuKsihjvu38JNiT1ytFkNarl4+INThSzoKocggOBNPL8YEIXGPRhzC0z/NrJ3gK myoIbm2cE/S6mD9Iw4aFFHtdN16eu3cu1ToOjrX+zlkzd2Hykwu0bRNBmejwHTuZHkKO Ld6oLz2DLVO9kfRIxIMu41kS9vOjbJtrc/3G9rTfZwhqg6wbJPlRvRpTuDtwR1g1g9g+ CHwXxklXD9YnbYImcLKxAfFkZiL0eRv8sY6ifCm4tVDNf9wDi+Es/NvFcl9bSyptvue7 r4fA== X-Forwarded-Encrypted: i=1; AJvYcCUIx09UqN+RYTvYTQgTsTJd8c/ZhiT38RcnNlITOgLly67+ZiIeS2WWZAjyAIFtu6GgBmVdZ0yNCg==@kvack.org X-Gm-Message-State: AOJu0Yy4pyUU5oSyWcVo5EdsElwW3BJ59ZvhFWmNsALiRVEeGXt62I5P oPl7Z+uZP7qVmVFkFr+WV7HWbO5i930jPUv1qjQUPgml1cpvN0rxNaV0OTHOc+mG8nxrhmDk7Ge 1refczNbSBkeUV8jdLVvZ5zooYR5XyVkVAMHA X-Gm-Gg: ASbGnctHehXGh/d3kp2xyE2BYtHVxAudzaXGWVwQLD3PefhL+9M2yyGpkg1CkQ+phhG ZJB899ULNxb1GEC5/ZBAgF75DXw2SbcIFd33l0tl5krHkanYOHFqRCizL5b1rUf8B2d5/ZlNT5R RUWcZBUSTGZnQlKmXEHFpkdxCHujrkJ34oe9jtuuG6 X-Google-Smtp-Source: AGHT+IFENuwOSb+LOW0klyGvDARo4xHeFfqo8zrMIfmCtGX+JlK2PAUhzDUJEUg2PdpsEKRrDhLGyYs3NBSu6Wf50Ms= X-Received: by 2002:a17:903:2345:b0:20c:f40e:6ec3 with SMTP id d9443c01a7336-221b9dd9268mr2103715ad.22.1740289389097; Sat, 22 Feb 2025 21:43:09 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Kalesh Singh Date: Sat, 22 Feb 2025 21:42:57 -0800 X-Gm-Features: AWEUYZk0wXUzGb8s7E_Tpr2DHpHdR7EE_8xj4LDHXPI7jfHnofpLwcCPCsHGebs Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior To: Kent Overstreet Cc: lsf-pc@lists.linux-foundation.org, "open list:MEMORY MANAGEMENT" , linux-fsdevel , Suren Baghdasaryan , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Juan Yescas , android-mm , Matthew Wilcox , Vlastimil Babka , Michal Hocko , Johannes Weiner , Nhat Pham Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 7B7BBA0007 X-Stat-Signature: cqm7q487dx8mh7eaar3xdeg1fjbjnbrx X-HE-Tag: 1740289390-343723 X-HE-Meta: U2FsdGVkX1+t0dOcCMP/5NUNA+fO/U+H4Ril8xmPO/MzIawtiTuG10q+MeqCmEGmR6pmcwzUmuCTtUNMV3YDXxUQKfjzKRKReZo5rEJORAWjg/8iKU0/P2srESMXIklmIInPtSE9WYTw0zsUHqhakyY+w25rd9qoZ/VWdVJJU4eFR6wudYIQpjkhMXNmpGyArsuYNWaSNTnT5xmAV04p2xtmEGgDFLaze6qU3QY2U6jOnXCBPYUjLI+YGproEvljdno9vCca4NoNFPgTQD0XrJQRtnMWrJczy8uoa7qYhzm3mVkr4GEt2ef3FyD8zamLc6BkVPpcPhkox9TBkBUCshPtsQ/7MUr+9inKAMEiAPf+bdqXYdHar2gPT0XRVL4yGOmHCRa3j1X3L0JrqVnMxbtDHDdO/6fI4u4Aq+m+67YHbDy70mRUxWoVjfCciaPS2/EQBG5SMctNC70dNYjdmu6ke8uJZUEAks7481p4UjCKfEo2EBuKseuFJe4jy7sj/bA1QisSp0kxAHqhnyYr6UBj6AoYqgqIMRIzRrDAhV9WQwSfHVo3oG1hZoLWlhfRz8yiO2OWVjhui2vSv3fTbrp6chyqcidUGb/wM/qM6zRm/qUUogKEs0w9LZFse9YEPrSjyAntDJJV2y51VE7ox01Jg3FaIQHthN0UAPzfIDIE5us2IKkW0mQwDvTh+2LlBb+Nv6/zAG2Sl2W+ubFe3N7M0ZxY8zeCEqr5tsGmzLxmNvRjOB7CWM1WSqaE9bAo6NbjwW10t3S8xhytbHB7gl3h4T/lu62jtxjWj4V+EUDTfG9I++BygT/WUZpKhg833CyDF1OjAs0iGHipDuCcpwxU2kEnoioEriuNoLYj6Jdsa8nv2BhwWZJGswGLEjJMUV6gmol28lJvorq43kPDczlG4DjX6VK8VmOQcGyugfk3ZJ8hQjFQNEQBK7rzVnaG7gMsL+bEs5cgz1wxWdA bYuJRgk/ WDc2AfQsbb7UlM3rKIoiW/WTEw9vdE7hWnAN/odj/PoF2jloADtoUcLp/n2ygk0N3zi2nlRwgl6Uqw5tvm2UUqrfKrd7s50gnOz7Yag7JkEwsqoY3EI9k3UkiFEEGwdLrl+fhz0nIB0UVOPPUBdY9At0Ie/OVEHKKnzNYZjF/1OELygySE3O/uRquTSd3/5qmyAT7TW/uQiDCAqxD+WsiUP1OEM5TUwON13a1FfYqVG5qJK6xI27y9mObHA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.118337, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Feb 22, 2025 at 9:36=E2=80=AFPM Kalesh Singh wrote: > > On Sat, Feb 22, 2025 at 10:03=E2=80=AFAM Kent Overstreet > wrote: > > > > On Fri, Feb 21, 2025 at 01:13:15PM -0800, Kalesh Singh wrote: > > > Hi organizers of LSF/MM, > > > > > > I realize this is a late submission, but I was hoping there might > > > still be a chance to have this topic considered for discussion. > > > > > > Problem Statement > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > Readahead can result in unnecessary page cache pollution for mapped > > > regions that are never accessed. Current mechanisms to disable > > > readahead lack granularity and rather operate at the file or VMA > > > level. This proposal seeks to initiate discussion at LSFMM to explore > > > potential solutions for optimizing page cache/readahead behavior. > > > > > > > > > Background > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > The read-ahead heuristics on file-backed memory mappings can > > > inadvertently populate the page cache with pages corresponding to > > > regions that user-space processes are known never to access e.g ELF > > > LOAD segment padding regions. While these pages are ultimately > > > reclaimable, their presence precipitates unnecessary I/O operations, > > > particularly when a substantial quantity of such regions exists. > > > > > > Although the underlying file can be made sparse in these regions to > > > mitigate I/O, readahead will still allocate discrete zero pages when > > > populating the page cache within these ranges. These pages, while > > > subject to reclaim, introduce additional churn to the LRU. This > > > reclaim overhead is further exacerbated in filesystems that support > > > "fault-around" semantics, that can populate the surrounding pages=E2= =80=99 > > > PTEs if found present in the page cache. > > > > > > While the memory impact may be negligible for large files containing = a > > > limited number of sparse regions, it becomes appreciable for many > > > small mappings characterized by numerous holes. This scenario can > > > arise from efforts to minimize vm_area_struct slab memory footprint. > > > > > > Limitations of Existing Mechanisms > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D > > > > > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the > > > entire file, rather than specific sub-regions. The offset and length > > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and > > > POSIX_FADV_DONTNEED [2] cases. > > > > > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire > > > VMA, rather than specific sub-regions. [3] > > > Guard Regions: While guard regions for file-backed VMAs circumvent > > > fault-around concerns, the fundamental issue of unnecessary page cach= e > > > population persists. [4] > > > Hi Kent. Thanks for taking a look at this. > > > What if we introduced something like > > > > madvise(..., MADV_READAHEAD_BOUNDARY, offset) > > > > Would that be sufficient? And would a single readahead boundary offset > > suffice? > > I like the idea of having boundaries. In this particular example the > single boundary suffices, though I think we=E2=80=99ll need to support > multiple (see below). > > One requirement that we=E2=80=99d like to meet is that the solution doesn= =E2=80=99t > cause VMA splits, to avoid additional slab usage, so perhaps fadvise() > is better suited to this? > > Another behavior of =E2=80=9Cmmap readahead=E2=80=9D is that it doesn=E2= =80=99t really respect > VMA (start, end) boundaries: > > The below demonstrates readahead past the end of the mapped region of the= file: > > sudo sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' && > ./pollute_page_cache.sh > > Creating sparse file of size 25 pages > Apparent Size: 100K > Real Size: 0 > Number cached pages: 0 > Reading first 5 pages via mmap... > Mapping and reading pages: [0, 6) of file 'myfile.txt' > Number cached pages: 25 > > Similarly the readahead can bring in pages before the start of the > mapped region. I believe this is due to mmap =E2=80=9Cread-around=E2=80= =9D [6]: I missed the reference to read-around in previous response: [6] https://github.com/torvalds/linux/blob/v6.13-rc3/mm/filemap.c#L3195-L32= 04 > > sudo sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' && > ./pollute_page_cache.sh > > Creating sparse file of size 25 pages > Apparent Size: 100K > Real Size: 0 > Number cached pages: 0 > Reading last 5 pages via mmap... > Mapping and reading pages: [20, 25) of file 'myfile.txt' > Number cached pages: 25 > > I=E2=80=99m not sure what the historical use cases for readahead past the= VMA > boundaries are; but at least in some scenarios this behavior is not > desirable. For instance, many apps mmap uncompressed ELF files > directly from a page-aligned offset within a zipped APK as a space > saving and security feature. The read ahead and read around behaviors > lead to unrelated resources from the zipped APK populated in the page > cache. I think in this case we=E2=80=99ll need to have more than a single > boundary per file. > > A somewhat related but separate issue is that currently distinct pages > are allocated in the page cache when reading sparse file holes. I > think at least in the case of reading this should be avoidable. > > sudo sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' && > ./pollute_page_cache.sh > > Creating sparse file of size 1GB > Apparent Size: 977M > Real Size: 0 > Number cached pages: 0 > Meminfo Cached: 9078768 kB > Reading 1GB of holes... > Number cached pages: 250000 > Meminfo Cached: 10117324 kB > > (10117324-9078768)/4 =3D 259639 =3D ~250000 pages # (global counter =3D s= ome noise) > > --Kalesh