From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9F503C28B20 for ; Sun, 30 Mar 2025 04:46:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B4B50280187; Sun, 30 Mar 2025 00:46:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AFB0A280185; Sun, 30 Mar 2025 00:46:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9C2AF280187; Sun, 30 Mar 2025 00:46:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 79ADD280185 for ; Sun, 30 Mar 2025 00:46:15 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id A676BC07C3 for ; Sun, 30 Mar 2025 04:46:16 +0000 (UTC) X-FDA: 83276980752.12.BB6985C Received: from mail-ua1-f44.google.com (mail-ua1-f44.google.com [209.85.222.44]) by imf28.hostedemail.com (Postfix) with ESMTP id C6755C0004 for ; Sun, 30 Mar 2025 04:46:14 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=EFQsSA7x; spf=pass (imf28.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.44 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743309974; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fEdZAKlft2Z2lyMMPyV+EJoRao25LUX5sVdiT37vFAU=; b=F0saSvUyiuJtFwd9Xhq1u7Sczy+4bAqFwJGSr4Dixit+S3xG14S1gOYm4AmgPi/f8KHahY 1t93tAfeVMhtBk/o/PvI0HDtR8viuzl70iCl+av3ie2eHAePVaSXxb6qs7z8kzdJDDwmj6 TCj4ueuXN//sATgB16FCELEzTI7cNIs= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=EFQsSA7x; spf=pass (imf28.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.44 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743309974; a=rsa-sha256; cv=none; b=ArXrTcceGXc6ZMNKODzMr/LtLui8KKC0kdnimzUcjQHa0klzukfWsZu8xNgiuJN1hg3yqa zWujw/kx9J5Lg84Btxf89DAgcAI98TxH5YYVffgpBJf6c8+a2NOWmWjwGhDbB0MIMZZPD+ 3v999Uz6+ODzKhye1CIYWz022kwDhX0= Received: by mail-ua1-f44.google.com with SMTP id a1e0cc1a2514c-86fab198f8eso1552682241.1 for ; Sat, 29 Mar 2025 21:46:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1743309974; x=1743914774; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=fEdZAKlft2Z2lyMMPyV+EJoRao25LUX5sVdiT37vFAU=; b=EFQsSA7xl5eQQxTSqW3VxdykFNMAvfpxOq9cCnKrtCYOQm7xPEDrTLJtlFNvi86set 2vj9myYOjVk8eH0D1rxX+PIOEOh7X9u/7yDt4WHF/cr1QZPV3pcEPxuw0JNNUuGjWKg1 DhJed1KQEPEeewZceQtRtYc8i0ttBFMjAAOlb49A+cu86kG4AX9Y4Fc5MJ3QBLMcMukD SBRalkBSOh/f1Ihfu59XiUkNu0nNj+qYaN+sapNZxb39tgX7+fqymBGilY+/hFDEt5zy L0/tm2GXGyBf5i/miGn19zpJj8dFItHY38vHj6HCyHDc0YYGSI8K3+WSxXPn16o0BMUp SoNw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743309974; x=1743914774; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=fEdZAKlft2Z2lyMMPyV+EJoRao25LUX5sVdiT37vFAU=; b=ZmRWse87fsvbQtpiFA8NVHHTVvmoCvinhFaYdoQGRRrdWscRnb4Q3ng7qiPXNra1ve 8ry+nkWv8o2Je5Y39OPRBYwQ6EVgE/LZwd3TqAC/JA++wcXgZdCdcECf9n+/iTqWyEZm v9hR+RGWGOgGHrsSa1qhtlX3TZl8ue8NrHRZwvjiLcfrwnoOW+1b519gL9qcR7sxIdiw e8q/harjnRfsonLKkUnCaCl3aT3yjHzVKqilNka63RMI7bMDQPPoWvcK8RQydsmSiOOT ltYNiRAAmNCQA+jiKlUTkOh1bjKnNZXd8VcTuImYcbyX6gi+e2c7IsiomW1RbcIHSbDo SZmA== X-Forwarded-Encrypted: i=1; AJvYcCU7k8DB/QmCuzQ/rIT0UvcppsNAM887jj9qKhN3WlKcbwCGEOYtC74pd9z88DRRN85J5vELVCgFKw==@kvack.org X-Gm-Message-State: AOJu0YxcXCrl8WhHcX0ot3pUbVwUSdTj9unDHY9G5D9acNsYc6sAjYH6 QgDg9S6utQWk5Jv+xKt8oe+ZNBcEco+x331jHudicdf8u8BDDluouMb3XAYgXNA03YT+hubMFeo jxv0k9lQdpxQNwZAfM9KlLMGZaJo= X-Gm-Gg: ASbGncuzYWByye9YolJQKNwtLLDOVqRqzmX/Xzpi+3aDXGuv3fRPd7ehnVBMtV2aNrV i8dAlSh7eaT9ci4sM7UpGVB7FV9BMR6mRbCCRXfOTLB7VWF0CAFwLyL3Obcia6TTqjeUZ1KARLO a/FIhcBQzMrlC3Uo3dKdDUvJ43El8= X-Google-Smtp-Source: AGHT+IGXYw3rr2gCZy4ks81xZOXMEXYk1FqPfbCqsoqg/0JyW6wKhPoK7+6TKQNQkFLOxFMan/+HBLIGtOPyDmVOCnI= X-Received: by 2002:a05:6102:1516:b0:4bb:e1c9:80c6 with SMTP id ada2fe7eead31-4c6d35c51e4mr3395580137.0.1743309973698; Sat, 29 Mar 2025 21:46:13 -0700 (PDT) MIME-Version: 1.0 References: <6201267f-6d3a-4942-9a61-371bd41d633d@arm.com> <3e3a2d12-efcb-44e7-bd03-e8211161f3a4@arm.com> In-Reply-To: <3e3a2d12-efcb-44e7-bd03-e8211161f3a4@arm.com> From: Barry Song <21cnbao@gmail.com> Date: Sun, 30 Mar 2025 12:46:00 +0800 X-Gm-Features: AQ5f1JoxyzYdYxj-JNpPOYCwtkevXSxgnPTSG42KUhzg4tUbpKMRoHKF8tQZPGE Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Mapping text with large folios To: Ryan Roberts Cc: lsf-pc@lists.linux-foundation.org, Linux-MM , Matthew Wilcox , Dave Chinner Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: C6755C0004 X-Rspamd-Server: rspam05 X-Rspam-User: X-Stat-Signature: mh3p5saubu3orxhri7wrmextuq597y3h X-HE-Tag: 1743309974-749527 X-HE-Meta: U2FsdGVkX1/FUQG9f1r2Mi1O5uqCdsYHx1NKxwpOuz59KyFV9BaaLFgzOhzZy6y2bOhlvOsQrFWeVkicp0r2uYfreZbQMPCAJn7gmiNPUncOK6inUuWYee94ikdbHrkw2DyT+2yW9Qvz6bTsJW7+7wi6/6RoMeCuIfGqI/qiHtjYPvX4reFwREC4PGeDagOM9nJYs7ALbYaDJC3e8GmRd9JHBH3S4bNxcp3jhTEyMGZYTmOANXd5EOUibs1SBqWRz4PhGkpe+4Artl5+g5/YddO4e+h89qushacwZFCo0WbaLX96vWAgetnZU5jXkhBB02A+O9NTetRcG+reDyitDq/L0Bu7Ds7PoQPnb11B6D8vSVAg2eFm7apKRjS51Ga8tu80cxNRj2xRtIwvH9kBVYtOoKWGHsEAm9K3o1xyw2I+DhWMbh0avtQ49QI4argICQenthvfxZjT+Depys/EFVsK8PyM85xunM9rS9X8+0cz12nCL+eSRVv2r0aOxUs7TRQyZdE/2qjcjRn/PylOv/KhiqpCH2mnwMO1UoQaam2dT3YN4Jss84aZk7k2GXKnY8EhbIVUJhkc6S+r3TeIY+1yqOWk5k7jW/l9KOmPlBV8UpfNowC/qHyitRk5gePsXic9LIWkfoK9Ew5e9uXuWVjZymKeLqbnWGapZAuoJHaIZDzj3SD4pfK9HEuRURhQR/4o5UC7JCusD2OVS8xIobrrwtEAmZAMPV6PGRfTScEEJwGO4Xh4cK6sPD6zkr6YSpIzIw6J7XFDIbFhRtFJ8qI7yyW2q9sRXpNGQQnLwxyEmE4Zzof8sy3eKn1QebktpTEoxYLFmrRYXGAdsm0xc/ar2z7MbaQ/fBo7zLfzjLIkN1oXYReElWJhBDlE4PVcLbMWM6lNU3OAzmyQV65uEIVW/MSnL+B8JxLDqgE4gEbtOksY/ZeJblUGS9ndeiO9O9CY+7uI4hHHEzaTq8f H7yvG56e +4VuISX0NhvoDWM/MZvkc5H0cHi3NGObcTYPKFCU4TdvUKEJ6OcCGUPOGE6LrSFKtpanDJYBKKJX+1Gy0lAG46WA0IJBKSqPUVzhz709k0hz8cpc5trb5ILqCGdoCuRkmXzQfEDpSfkJ892VMfvfUZwIbtZtm+atLHRwM6gDybyIk6NFh5fQwwpbGGZT3igRDnm33hhXi9urBwdN7FoF/bmLdBXxd21wtE9oiyQT7RrXPZofjwu3sj1+v6em6Wp3RXu7W6EkfGR47jBPeLuD/0mBLPIIY4su38IWt9kAKcEB6SZuCx4kCdQEhfnhD1o4UrCoywi1yLw6yaPkihD03cKnGEQWtmFF8A2xHWOkKjKZZ/IkyGf9nRYgmU26nLC/Fv6sVhXrBaeFc8HnsFFCnRZykVg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Mar 20, 2025 at 10:57=E2=80=AFPM Ryan Roberts wrote: > > On 19/03/2025 20:47, Barry Song wrote: > > On Thu, Mar 20, 2025 at 4:38=E2=80=AFAM Ryan Roberts wrote: > >> > >> Hi All, > >> > >> I know this is very last minute, but I was hoping that it might be pos= sible to > >> squeeze in a session to discuss the following? > >> > >> Summary/Background: > >> > >> On arm64, physically contiguous and naturally aligned regions can take= advantage > >> of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for= file > >> regions containing text, current readahead behaviour often yields smal= l, > >> misaligned folios, preventing this optimization. This proposal introdu= ces a > >> special-case path for executable mappings, performing synchronous read= s of an > >> architecture-chosen size into large folios (64 KB on arm64). Early per= formance > >> tests on real-world workloads (e.g. nginx, redis, kernel compilation) = show ~2-9% > >> gains. > >> > >> I=E2=80=99ve previously posted attempts to enable this performance imp= rovement ([1], > >> [2]), but there were objections and conversation fizzled out. Now that= I have > >> more compelling performance data, I=E2=80=99m hoping there is now stro= nger > >> justification, and we can find a path forwards. > >> > >> What I=E2=80=99d Like to Cover: > >> > >> - Describe how text memory should ideally be mapped and why it benefi= ts > >> performance. > >> > >> - Brief review of performance data. > >> > >> - Discuss options for the best way to encourage text into large folio= s: > >> - Let the architecture request a preferred size > >> - Extend VMA attributes to include preferred THP size hint > > > > We might need this for a couple of other cases. > > > > 1. The native heap=E2=80=94for example, a native heap like jemalloc=E2= =80=94can configure > > the base "granularity" and then use MADV_DONTNEED/FREE at that granular= ity > > to manage memory. Currently, the default granularity is PAGE_SIZE, whic= h can > > lead to excessive folio splitting. For instance, if we set jemalloc's > > granularity to > > 16KB while sysfs supports 16KB, 32KB, 64KB, etc., splitting can still o= ccur. > > Therefore, in some cases, I believe the kernel should be aware of how > > userspace is managing memory. > > > > 2. Java heap GC compaction - userfaultfd_move() things. > > I am considering adding support for batched PTE/folios moves in > > userfaultfd_move(). > > If sysfs enables 16KB, 32KB, 64KB, 128KB, etc., but the userspace Java > > heap moves > > memory at a 16KB granularity, it could lead to excessive folio splittin= g. > > Would these heaps ever use a 64K granule or is that too big? If they can = use > 64K, then one simple solution would be to only enable mTHP sizes upto 64K= (which > is the magic size for arm64). > I'm uncertain about how Lokesh plans to implement userfaultfd_move() mTHP support in what granularity he'll use in Java heap GC. However, regarding jemalloc, I've found that 64KB is actually too large - it ends up increasing memory usage. The issue is that we need at least 64KB of freed small objects before we can effectively use MADV_DONTNEED. Perhaps we could try 16KB instead. The key requirement is that the kernel's maximum large folio size cannot ex= ceed the memory management granularity used by userspace heap implementations. Before implementing madvise-based per-VMA large folios for Java heap, I pla= n to first propose a large-folio aware userfaultfd_move() and discuss this approach with Lokesh. > Alternatively they could use MADV_NOHUGEPAGE today and be guarranteed tha= t > memory would remain mapped as small folios. Right, I'm using this MADV_NOHUGEPAGE specifically for small size classes i= n jemalloc now as large folios will soon be splitted due to unaligned userspace heap management. > > But I see the potential problem if you want to benefit from HPA with 16K = granule > there but still enable 64K globally. We have briefly discussed the idea o= f > supporting MADV_HUGEPAGE via madvise_process() in the past; that has an e= xtra > param that could encode the size hint(s). > I'm not sure what granularity Lokesh plans to support for moving large foli= os in Java GC. But first, we need kernel support for userfaultfd_move() with mTHP= . Maybe this could serve as a use case to justify the size hint in MADV_HUGEPAGE. > > > > For exec, it seems we need a userspace-transparent approach. Asking eac= h > > application to modify its code to madvise the kernel on its preferred e= xec folio > > size seems cumbersome. > > I would much prefer a transparent approach. If we did take the approach o= f using > a per-VMA size hint, I was thinking that could be handled by the dynamic = linker. > Then it's only one place to update. The dynamic linker (ld.so) primarily manages the runtime linking of shared libraries for executables. However, the initial memory mapping of the executable itself (the binary file, e.g., a.out) is performed by the kernel during program execution? > > > > > I mean, we could whitelist all execs by default unless an application e= xplicitly > > requests to disable it? > > I guess the explicit disable would be MADV_NOHUGEPAGE. But I don't believ= e the > pagecache honours this right now; presumably because the memory is shared= . What > would you do if one process disabled and another didn't? Correct. My previous concern is that if memory-constrained devices could experience increased memory pressure due to mandatory 64KB read operations. A particul= ar concern is that the 64KiB folio remains in LRU queue when any single subpag= e is active, whereas smaller folios would have been reclaimable when inactive= . However, this appears unrelated to your patch [1]. Perhaps such systems sho= uld disable file large folios entirely? [1] https://lore.kernel.org/all/20240215154059.2863126-1-ryan.roberts@arm.c= om/ > > Thanks, > Ryan > > > > >> - Provide a sysfs knob > >> - Plug into the =E2=80=9Cmapping min folio order=E2=80=9D infrast= ructure > >> - Other approaches? > >> > >> [1] https://lore.kernel.org/all/20240215154059.2863126-1-ryan.roberts@= arm.com/ > >> [2] https://lore.kernel.org/all/20240717071257.4141363-1-ryan.roberts@= arm.com/ > >> > >> Thanks, > >> Ryan > > Thanks Barry