From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EFEEEC02181 for ; Wed, 22 Jan 2025 06:52:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5BA98280003; Wed, 22 Jan 2025 01:52:33 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 56B2D280002; Wed, 22 Jan 2025 01:52:33 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 40B46280003; Wed, 22 Jan 2025 01:52:33 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 1FFCD280002 for ; Wed, 22 Jan 2025 01:52:33 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id C7B1D1A13F6 for ; Wed, 22 Jan 2025 06:52:32 +0000 (UTC) X-FDA: 83034169344.01.CDA1630 Received: from mail-vk1-f175.google.com (mail-vk1-f175.google.com [209.85.221.175]) by imf28.hostedemail.com (Postfix) with ESMTP id F04A4C0007 for ; Wed, 22 Jan 2025 06:52:30 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=k8cHZ+7s; spf=pass (imf28.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.175 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737528751; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=utuz0x2eAD6pKLPEvKxxWvQMoGMF5EIn2k98GycyIXc=; b=Zvsf63lOhJmW2/4cPDaEFur5vz/G3aB0srYrRL9fZKQfUjYj+0rBzCS54MpSDH4W4eN30O XFadQXcgiyiBB09yfxHB/V04z5MxRpOsLCW8ZuUog8wZAvpVgGeJYU3XW2H62bbcx4nW05 bnxmX78dgh2k5YtiWomOGEEDNjqd2Ek= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737528751; a=rsa-sha256; cv=none; b=7GUW+T/tMwh6+N6QUF8/ddzlSEu7QWxItDcexRRYmtkgmm71f2ErczmJqURuq0D755Vxkv i1kbW9iqIAaIKlbsC5tD1jRIMXxgLmi4ufgNYvy1jClWBoDtz0A73PDR9uYaTm03oY80ju grg3svFc92C/QOFXLGStAOMLHiaNzF4= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=k8cHZ+7s; spf=pass (imf28.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.175 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-vk1-f175.google.com with SMTP id 71dfb90a1353d-51882748165so2128933e0c.3 for ; Tue, 21 Jan 2025 22:52:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1737528750; x=1738133550; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=utuz0x2eAD6pKLPEvKxxWvQMoGMF5EIn2k98GycyIXc=; b=k8cHZ+7slySnLd5e0CMdpzhujfliK9MZjXHYDlaWBxudN6rvJ1LLuNteBDDJOGVp9H Fpfb5fKLXMHo6t+W+ZKu7WFbnxqXdafOHtg2Rq2DsffFZJvL8oCHKEtzcGrplGsr11ve E5lBej80oelRqFKjiSk9cbyzpIVn+wO/4G6Tw4SgjcsCvcNSdJATupyBw3xjemPvGmFQ r2dG6B1n/fjBp4SWzz+f+ObaN+/5KqMCIQDxXQ4wlcK5wtYgWQ5KDArP2U0sfFNcTJuJ 5abmVGguy1LDPF9FP+egj7brbnrerNicIMx8aNwV6xH7uEADJ3l/AHSA59o4RfixxXgR rX2w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737528750; x=1738133550; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=utuz0x2eAD6pKLPEvKxxWvQMoGMF5EIn2k98GycyIXc=; b=CVanD0Y9e9Mcql/iDVMnbQxhmSSJ3LvX3h4IfFYy/KDsxgpm7zzB9Wj5nWZXr+iMPm sGOedXLCXgdzuM/p5gc8u4U+tULibLFvqjIoG7NyAY1k8rYA4BpEgQi5iY9Si/DN4PmS B5b30sv4w7r84a4Mo9NMJqLSxL9N+PkpuhGIOrzzSkH6sumkrPrit0DWNgNtjsKLZeuM VXzcQ4+qUrZle0d+uHGbF9SZpvBIr69/taAzWINQAGtfAkWqHLMi+BfG0/40N8Dvchsg GK4q69yf3AwF6IpVWzvbdMPNBBzSi5vaiNoHPw9KFgWva4l/8URXzbnU593sYWDMJXAz YAmQ== X-Forwarded-Encrypted: i=1; AJvYcCUFDjzfk+PH8E2HfaWan/nz5miH5yzDQrTlUcFuJNU3hmiE0NUm1hp3znnvnlixiMRgUzO/ClJW7w==@kvack.org X-Gm-Message-State: AOJu0YxrDoEejnIOQSmb08I/Vk/PLDj7sUz/V0aGCTSIna5tOhEufMyh 72zndLhBf4uw010tF3RtB7fjoNIAVH4au1sooShXRu3bq5KSC5/2GGifNnaSai3EVqOmbCSsluU QBt43avO6vpR9dp76tXo7qp0MVhU= X-Gm-Gg: ASbGnct3x7wvKad50DNtEMTnWv+H3aQITKq+ciFm+s5+8B9n1BnWSp7ztqbHslqhOXk U7lpvesyZXmzngBFpQCMQXclU2bkbPJ9N/exX0CMa6SU/XH66Gx4vDRNZr+YmDd3ZAWw19WvGFZ 1GEW+4HqJJYw== X-Google-Smtp-Source: AGHT+IFbIXXJ4Kv/5cbB3zIolJmO5k1loAbDOemtdq/bdF15whVG8m/0FOlhsRneL7ZDfKlaQuw5VcJndFPz9JyKA1c= X-Received: by 2002:a05:6122:488e:b0:518:6286:87a4 with SMTP id 71dfb90a1353d-51d595af946mr17435909e0c.4.1737528749874; Tue, 21 Jan 2025 22:52:29 -0800 (PST) MIME-Version: 1.0 References: <463eb421-ac16-435c-b0a0-51a6a92168f6@redhat.com> <8f36d3ca-3a31-4fc4-9eaa-c53ee84bf6e7@redhat.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Wed, 22 Jan 2025 19:52:18 +1300 X-Gm-Features: AWEUYZkV6NvxfZP_O-RPI9h4zvPhxn2l6UUY2ocuPQEL6uvrOpKymhYNJBzYhZU Message-ID: Subject: Re: mm: CMA reservations require 32MiB alignment in 16KiB page size kernels instead of 8MiB in 4KiB page size kernel. To: Juan Yescas Cc: Zi Yan , David Hildenbrand , linux-mm@kvack.org, muchun.song@linux.dev, rppt@kernel.org, osalvador@suse.de, akpm@linux-foundation.org, lorenzo.stoakes@oracle.com, Jann Horn , Liam.Howlett@oracle.com, minchan@kernel.org, jaewon31.kim@samsung.com, Suren Baghdasaryan , Kalesh Singh , "T.J. Mercier" , Isaac Manjarres , iamjoonsoo.kim@lge.com, quic_charante@quicinc.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: si6gjmdisxaa7ztaecotzot3y3wbdi7p X-Rspam-User: X-Rspamd-Queue-Id: F04A4C0007 X-Rspamd-Server: rspam03 X-HE-Tag: 1737528750-602464 X-HE-Meta: U2FsdGVkX1/YLX3AKSRpnRTN1mE/DwFfvL2CzZST2fHGd0Ls6PlY+B4GmelrQExv9yGHNQV6LXoLrg1O5RI2MhhT0U6qjJN8g+kNC054a2q+O2tmoXVTj4yQ5cA8l/3Ln1vr50Lu4gmXlcphn5FP3mbgyMqAjqHr3gqotTJhGtXMscLZ8Cu/b9jJ6CndNejxGuvzVZ5994/n8MmG3jQRFvSXyoJgYF9fqQqBIts9PQMoIr9fO8YhzVCAS3N0WFdWZgKz4LhlzwbdFDcea7jpUuNUhb2fptKV5t9Y0/or0+7fDChmJjcejH63OpLjI1Be+oZCSzQheEYIKToXfY7yWWs5lbx1Ik85O1zWQMSja8fAd3R0LJB0nBfL/JE2LKmzdRlnKHjW6sXMeJ10GegGm6l9SGRGCGV/sbB1sZD/WcSJw3SbTOR1zByQwxm/l05wOO/hJ46HFrq1TZxc3clQWh6GHSoqGyUniT2YGTyGI7GlLqzN6vqHBsnPiXWq5azWB+U28UQoUiH+5UIAfwvcZmxNBgcJrv40Q1fI6DtA02FXTdr6lLfxkxv5DpBQsCofOP8ZuRcKOk6wI8xV2U2cThEx2szrvD13cb0XqOWZYJ741qCBY1zKJp1lHtq/GQriusQjaKeT1n3x/v0IkpcogUB3A4yrlUOw288ovDFp6ZJNZn0tVahULXfpR6vADANXbi7BaGT0q7N/QljbSGVQPOZNmK5Hw3Rd81NezvZTMr8Tlkil8EgHt4lnqgJXhv0qbd49NCpWYd4AQ9xfP97L5lEWlaMomFVwPDn+HFHla4Ktzk3FNMrSlonbJv+Ui7iIwF5//0eyrv9nRGVbvUFPvpqHNG3Qz5Z60UwI6t712dcnQ/ITZwQu7hvceT3Kx20nhcANWcU+EOzpa7mC/SX+AkYWS+toyknlqGc1Hx7r5EpgDS/VnCDzQUAnm+QActscfacM5BQifdCvbUAEx3u JyKo+RwY CCuELnkWj6Om68ZdvIBS+ec1EQFFnXdGYrfR0u3PJ/fZOSlbRkU2B/5GDJ9uCBBNS8cGF0k+PaEa777GMPjGRkSuo5VeVQQUeFE1LVbZNQ5zYay7XIaSNJB9m+17S0Lt9GZ3WczqZPTkoikkCvmb3jGxgLoMQ+9AVnYf/K7E/MS8yYc6zEoK72bvC2rSX2jRzSP6aCbDoM+StVDHTP8xEWgpLJU7MzJB4OfBqX6YXVAuyHtuJZDxhsOo7VEnrSDM+FXzhfdH1wyBpFLmuZWdzX7e8j6/zJeevu1waXHdOeebFYY0qgoeY/QEqCNtowuOmgzkn2UEGWDLJGaA6rQRLn3vyM6OkCm7fL3XI2VohsKHMPsBnN2JsQHFpDNsd31zjCm2bp1iqpGY/q3k= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jan 22, 2025 at 5:06=E2=80=AFPM Juan Yescas wr= ote: > > On Tue, Jan 21, 2025 at 6:24=E2=80=AFPM Zi Yan wrote: > > > > On Tue Jan 21, 2025 at 9:08 PM EST, Juan Yescas wrote: > > > On Mon, Jan 20, 2025 at 9:59=E2=80=AFAM David Hildenbrand wrote: > > > > > > > > On 20.01.25 16:29, Zi Yan wrote: > > > > > On Mon Jan 20, 2025 at 3:14 AM EST, David Hildenbrand wrote: > > > > >> On 20.01.25 01:39, Zi Yan wrote: > > > > >>> On Sun Jan 19, 2025 at 6:55 PM EST, Barry Song wrote: > > > > >>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> However, with this workaround, we can't use transparent hu= ge pages. > > > > >>>>>>>> > > > > >>>>>>>> Is the CMA_MIN_ALIGNMENT_BYTES requirement alignment only = to support huge pages? > > > > >>>>> No. CMA_MIN_ALIGNMENT_BYTES is limited by CMA_MIN_ALIGNMENT_P= AGES, which > > > > >>>>> is equal to pageblock size. Enabling THP just bumps the pageb= lock size. > > > > >>>> > > > > > > Thanks, I can see the initialization in include/linux/pageblock-flags= .h > > > > > > #define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_P= AGE_ORDER) > > > > > > > >>>> Currently, THP might be mTHP, which can have a significantly s= maller > > > > >>>> size than 32MB. For > > > > >>>> example, on arm64 systems with a 16KiB page size, a 2MB CONT-P= TE mTHP > > > > >>>> is possible. > > > > >>>> Additionally, mTHP relies on the CONFIG_TRANSPARENT_HUGEPAGE c= onfiguration. > > > > >>>> > > > > >>>> I wonder if it's possible to enable CONFIG_TRANSPARENT_HUGEPAG= E > > > > >>>> without necessarily > > > > >>>> using 32MiB THP. If we use other sizes, such as 64KiB, perhaps= a large > > > > >>>> pageblock size wouldn't > > > > >>>> be necessary? > > > > > > Do you mean with mTHP? We haven't explored that option. > > > > Yes. Unless your applications have special demands for PMD THPs. 2MB > > mTHP should work. > > > > > > > > > >>> > > > > >>> I think this should work by reducing MAX_PAGE_ORDER like Juan d= id for > > > > >>> the experiment. But MAX_PAGE_ORDER is a macro right now, Kconfi= g needs > > > > >>> to be changed and kernel needs to be recompiled. Not sure if it= is OK > > > > >>> for Juan's use case. > > > > >> > > > > > > The main goal is to reserve only the necessary CMA memory for the > > > drivers, which is > > > usually the same for 4kb and 16kb page size kernels. > > > > Got it. Based on your experiment, you changed MAX_PAGE_ORDER to get the > > minimal CMA alignment size. Can you deploy that kernel to production? > > We can't deploy that because many Android partners are using PMD THP inst= ead > of mTHP. > > > If yes, you can use mTHP instead of PMD THP and still get the CMA > > alignemnt you want. > > > > > > > > > >> > > > > >> IIRC, we set pageblock size =3D=3D THP size because this is the = granularity > > > > >> we want to optimize defragmentation for. ("try keep pageblock > > > > >> granularity of the same memory type: movable vs. unmovable") > > > > > > > > > > Right. In past, it is optimized for PMD THP. Now we have mTHP. If= user > > > > > does not care about PMD THP (32MB in ARM64 16KB base page case) a= nd mTHP > > > > > (2MB mTHP here) is good enough, reducing pageblock size works. > > > > > > > > > >> > > > > >> However, the buddy already supports having different pagetypes f= or large > > > > >> allocations. > > > > > > > > > > Right. To be clear, only MIGRATE_UNMOVABLE, MIGRATE_RECLAIMABLE, = and > > > > > MIGRATE_MOVABLE can be merged. > > > > > > > > Yes! An a THP cannot span partial MIGRATE_CMA, which would be fine. > > > > > > > > > > > > > >> > > > > >> So we could leave MAX_ORDER alone and try adjusting the pagebloc= k size > > > > >> in these setups. pageblock size is already variable on some > > > > >> architectures IIRC. > > > > > > > > > > > Which values would work for the CMA_MIN_ALIGNMENT_BYTES macro? In the > > > 16KiB page size kernel, > > > I tried these 2 configurations: > > > > > > #define CMA_MIN_ALIGNMENT_BYTES (2048 * CMA_MIN_ALIGNMENT_PAGES) > > > > > > and > > > > > > #define CMA_MIN_ALIGNMENT_BYTES (4096 * CMA_MIN_ALIGNMENT_PAGES) > > > > > > with both of them, the kernel failed to boot. > > > > CMA_MIN_ALIGNMENT_BYTES needs to be PAGE_SIZE * CMA_MIN_ALIGNMENT_PAGES= . > > So you need to adjust CMA_MIN_ALIGNMENT_PAGES, which is set by pagebloc= k > > size. pageblock size is determined by pageblock order, which is > > affected by MAX_PAGE_ORDER. > > > > > > > > > > Making pageblock size a boot time variable? We might want to warn > > > > > sysadmin/user that >pageblock_order THP/mTHP creation will suffer= . > > > > > > > > Yes, some way to configure it. > > > > > > > > > > > > > >> > > > > >> We'd only have to check if all of the THP logic can deal with pa= geblock > > > > >> size < THP size. > > > > > > > > > > > The reason that THP was disabled in my experiment is because this > > > assertion failed > > > > > > mm/huge_memory.c > > > /* > > > * hugepages can't be allocated by the buddy allocator > > > */ > > > MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER > MAX_PAGE_ORDER); > > > > > > when > > > > > > config ARCH_FORCE_MAX_ORDER > > > int > > > ..... > > > default "8" if ARM64_16K_PAGES > > > > > > > You can remove that BUILD_BUG_ON and turn on mTHP and see if mTHP works= . > > > > We'll do that and post the results. > > > > > > > > > Probably yes, pageblock should be independent of THP logic, altho= ugh > > > > > compaction (used to create THPs) logic is based on pageblock. > > > > > > > > Right. As raised in the past, we need a higher level mechanism that > > > > tries to group pageblocks together during comapction/conversion to = limit > > > > fragmentation on a higher level. > > > > > > > > I assume that many use cases would be fine with not using 32MB/512M= B > > > > THPs at all for now -- and instead using 2 MB ones. Of course, for = very > > > > large installations it might be different. > > > > > > > > >> > > > > >> This issue is even more severe on arm64 with 64k (pageblock =3D = 512MiB). > > > > > > > > > > > I agree, and if ARCH_FORCE_MAX_ORDER is configured to the max value w= e get: > > > > > > PAGE_SIZE | max MAX_PAGE_ORDER | CMA_MIN_ALIGNMENT_BYTES > > > 4KiB | 15 | 4KiB > > > * 32KiB =3D 128MiB > > > 16KiB | 13 | 16KiB > > > * 8KiB =3D 128MiB > > > 64KiB | 13 | 64KiB > > > * 8KiB =3D 512MiB > > > > > > > > This is also good for virtio-mem, since the offline memory block = size > > > > > can also be reduced. I remember you complained about it before. > > > > > > > > Yes, yes, yes! :) > > > > > > > > David's proposal should work in general, but will might take non-trivia= l > > amount of work: > > > > 1. keep pageblock size always at 4MB for all arch. > > 2. adjust existing pageblock users, like compaction, to work on a > > different range, independent of pageblock. > > a. for anti-fragmentation mechanism, multiple pageblocks might have > > different migratetypes but would be compacted to generate huge > > pages, but how to align their migratetypes is TBD. > > 3. other corner case handlings. > > > > > > The final question is that Barry mentioned that over-reserved CMA areas > > can be used for movable page allocations. Why does it not work for you? > > I need to run more experiments to see what type of page allocations in > the system is the dominant one (unmovable or movable). If it is movable, > over-reserved CMA areas should be fine. My understanding is that over-reserving 28MiB is unlikely to cause noticeable regression, given that we frequently handle allocations like GFP_HIGHUSER_MOVABLE or similar, which are significantly larger than 28MiB. However, David also mentioned a reservation of 512MiB for a 64KiB page size. In that case, 512MiB might be large enough to potentially impact the balance between movable and unmovable allocations. For instance, if we still have 512MiB reserved in CMA but are allocating unmovable folios(for example dma-buf), we could fail an allocation even when there=E2=80=99s actually capacity. So, in any = case, there is still work to be done here. By the way, is 512MiB truly a reasonable size for THP? it seems that 2MiB is a more suitable default size for THP. Both 4KiB, 16KiB, 64KiB support 2MB large folios. For 4KiB, it is PMD-mmaped, for 16KiB and 64KiB, it is cont-pte. > > > > > -- > > Best Regards, > > Yan, Zi > > Thanks Barry