From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B8832C3065C for ; Thu, 4 Jul 2024 09:43:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CEDE96B00AE; Thu, 4 Jul 2024 05:43:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C9CBD6B00B0; Thu, 4 Jul 2024 05:43:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B64796B00B1; Thu, 4 Jul 2024 05:43:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 9902D6B00AE for ; Thu, 4 Jul 2024 05:43:24 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 3FE3F1C2887 for ; Thu, 4 Jul 2024 09:43:24 +0000 (UTC) X-FDA: 82301582328.28.4F932A5 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf18.hostedemail.com (Postfix) with ESMTP id 0CCA91C0019 for ; Thu, 4 Jul 2024 09:43:20 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=none; spf=pass (imf18.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720086189; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Ff9PBeY4OMQ4uf9JONohZaJ6iIQWoz3hRHA6I7dZ7RY=; b=V2OZopB0Lkwc8SH2U0pSXytk5ogqwktgiRE0ByXqtCbpLLKf+8YvwPO8HKOTmgYod6rMf1 gLq484zLBbXSiegBoKbKACl4WSgyMc9GA7uFbuEdOx0uEXMDSiYCLpOqwTPMNRn7+KJqts OkCsr0JtKImhYvJx8Ltyhf455Gjqg8I= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=none; spf=pass (imf18.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720086189; a=rsa-sha256; cv=none; b=s/QTDquc/cqnLmGCpCzkMsI7/w+UpQIApFt758DAxaJQuGxIwwBLH+w4lzrAQROxTrWm22 gnCgClkX72QsPmPRWqGfQhDVadEHaCZUurlFlQgQaBBIBJkz0moH7kXNpNizqxO9qVBCH2 QbrojRa4yxafsQSH01KSpgH8bDO40hc= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 2B7DC367; Thu, 4 Jul 2024 02:43:45 -0700 (PDT) Received: from [10.1.29.168] (XHFQ2J9959.cambridge.arm.com [10.1.29.168]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 943333F766; Thu, 4 Jul 2024 02:43:18 -0700 (PDT) Message-ID: <4de05ed8-6ec7-4b90-942c-a170a26be384@arm.com> Date: Thu, 4 Jul 2024 10:43:17 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] support "THPeligible" semantics for mTHP with anonymous shmem Content-Language: en-GB To: Yang Shi Cc: David Hildenbrand , Baolin Wang , Bang Li , hughd@google.com, akpm@linux-foundation.org, wangkefeng.wang@huawei.com, ziy@nvidia.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20240628104926.34209-1-libang.li@antgroup.com> <4b38db15-0716-4ffb-a38b-bd6250eb93da@arm.com> <4d54880e-03f4-460a-94b9-e21b8ad13119@linux.alibaba.com> <516aa6b3-617c-4642-b12b-0c5f5b33d1c9@arm.com> <597ac51e-3f27-4606-8647-395bb4e60df4@redhat.com> <6f68fb9d-3039-4e38-bc08-44948a1dae4d@arm.com> <992cdbf9-80df-4a91-aea6-f16789c5afd7@redhat.com> <2e0a1554-d24f-4d0d-860b-0c2cf05eb8da@arm.com> <06c74db8-4d10-4a41-9a05-776f8dca7189@redhat.com> <429f2873-8532-4cc8-b0e1-1c3de9f224d9@arm.com> <7a0bbe69-1e3d-4263-b206-da007791a5c4@redhat.com> <2450e4f8-236f-49ce-8bd3-b30a6d8c5e57@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 0CCA91C0019 X-Stat-Signature: kznrewdieofqxt3arborat5utfy6eagq X-HE-Tag: 1720086200-835207 X-HE-Meta: U2FsdGVkX18qVjW3j/t0/Ycrn2eu7Ohft9m0rpV93lt1aOMu83rypSDKRCBXdRW9n2rqtVY77arHev/igvTzS4Z50pY6eG2qQpa1AVfxv3FZ5Yf3SvjAwlpbXrWp09s/oDnUfeGv/zGsQq/grbHpNmAdgI+Oinf3Jk6Mi31UTZRsfYpQYln3nGh6PUB0ZKkuk6inJplI8f3tDQwVu8TOrUJKfnJfRyiIAea0GO3mX3K+E31ZLGPnNoAb9GBpYEDg28wcWmOmiba+TXnSIZ0mN4bpIfHZPxrq2ZgAtBVEBPVBqMDRn4z6TXqHIminqcHP4vJw+DlxbmwMGwPAKP5Smxyf0MmC6+AkaxE3MTWZpT25OVt6bjkBq/BO7+BrQwJ1fpeUAGaAVmAv1QVTyuyo5oHv/iGYb6tMJD355avBcAalE1Vs3oG/bAYzkgvnvNd7OZgWs/ITsMR0yDBClvNmsro+blwSZhQYzZQqUO2ODalmB5GxULW4TypMAK1so3dT9xokj5/Zqe4fOGqG09Pmk0OsT1Xz49uRpsoNBDHzOUb/GZYmNcZI3cuvmX37b66CIoybAqntuhFpVGG9X8SDoJcvOLjcpi7cVAj8KxOggCZCi108Ep6NROgGK0T/9gnHwplRCxZTiXFKco1s6n5mP0Z2Nxf5Mh2i4bb9MNSaUx7pBAStCad/M6NxysLS5xmy5VIg3rjA8iy3ZNjB/E/xXuINzHKRg5byvX51upqlhNuotrofWnQe3gsxupIoILD9xKEsqJA1AQ5YrQCFcw2ZDTs4M/Buvt49vjqSUQug0+BkPjnevnYMy7NMh3Pkss+ufmVVulUAXHBBeGgZJluPXYqkcZdlOwX5dQCRuCE05l85ElBHBt8pxU0JCJcQWOkcRR24g0e2umvHKyFMI4Y37sa0Ucw3JzkGSRVO0G7oPn5kV++jsBneI6ulrJuyTTQMSCUcAIM6VAAgdA/Ai6m uNR16ezQ oOz4XCCEQPgsFWgFtYsnNhZEBa87Dd1LsUIgG5lZG/DTuutsd7U9iVIu5yh2vYp4B8aAo+Hx0N6DkqtluotFzECWGv/JqGrYgMogwMKjFf23ZkuqDflYvRz4p/4Egw0svanCTqX9SlcmGSZG2omUNw0OI53n0g+I4BZZnvlP6bF+ElNzFMiTisLEE6rzLvz+AaILh78OFNY1epupq9ttLWgPkjRWZp5Br3T1SGLM+flEVWmI1A92zpkkS/vNqgm6taq32AtH9OmBGFLiUnzf1z2Ohz/oWhS9mCK4SgSB6gY8N2f1sgucP/79c1w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 03/07/2024 17:08, Yang Shi wrote: > On Tue, Jul 2, 2024 at 1:24 AM Ryan Roberts wrote: >> >> On 01/07/2024 19:20, Yang Shi wrote: >>> On Mon, Jul 1, 2024 at 3:23 AM David Hildenbrand wrote: >>>> >>>> On 01.07.24 12:16, Ryan Roberts wrote: >>>>> On 01/07/2024 10:17, David Hildenbrand wrote: >>>>>> On 01.07.24 11:14, Ryan Roberts wrote: >>>>>>> On 01/07/2024 09:57, David Hildenbrand wrote: >>>>>>>> On 01.07.24 10:50, Ryan Roberts wrote: >>>>>>>>> On 01/07/2024 09:48, David Hildenbrand wrote: >>>>>>>>>> On 01.07.24 10:40, Ryan Roberts wrote: >>>>>>>>>>> On 01/07/2024 09:33, Baolin Wang wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 2024/7/1 15:55, Ryan Roberts wrote: >>>>>>>>>>>>> On 28/06/2024 11:49, Bang Li wrote: >>>>>>>>>>>>>> After the commit 7fb1b252afb5 ("mm: shmem: add mTHP support for >>>>>>>>>>>>>> anonymous shmem"), we can configure different policies through >>>>>>>>>>>>>> the multi-size THP sysfs interface for anonymous shmem. But >>>>>>>>>>>>>> currently "THPeligible" indicates only whether the mapping is >>>>>>>>>>>>>> eligible for allocating THP-pages as well as the THP is PMD >>>>>>>>>>>>>> mappable or not for anonymous shmem, we need to support semantics >>>>>>>>>>>>>> for mTHP with anonymous shmem similar to those for mTHP with >>>>>>>>>>>>>> anonymous memory. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Signed-off-by: Bang Li >>>>>>>>>>>>>> --- >>>>>>>>>>>>>> fs/proc/task_mmu.c | 10 +++++++--- >>>>>>>>>>>>>> include/linux/huge_mm.h | 11 +++++++++++ >>>>>>>>>>>>>> mm/shmem.c | 9 +-------- >>>>>>>>>>>>>> 3 files changed, 19 insertions(+), 11 deletions(-) >>>>>>>>>>>>>> >>>>>>>>>>>>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >>>>>>>>>>>>>> index 93fb2c61b154..09b5db356886 100644 >>>>>>>>>>>>>> --- a/fs/proc/task_mmu.c >>>>>>>>>>>>>> +++ b/fs/proc/task_mmu.c >>>>>>>>>>>>>> @@ -870,6 +870,7 @@ static int show_smap(struct seq_file *m, void *v) >>>>>>>>>>>>>> { >>>>>>>>>>>>>> struct vm_area_struct *vma = v; >>>>>>>>>>>>>> struct mem_size_stats mss = {}; >>>>>>>>>>>>>> + bool thp_eligible; >>>>>>>>>>>>>> smap_gather_stats(vma, &mss, 0); >>>>>>>>>>>>>> @@ -882,9 +883,12 @@ static int show_smap(struct seq_file *m, void >>>>>>>>>>>>>> *v) >>>>>>>>>>>>>> __show_smap(m, &mss, false); >>>>>>>>>>>>>> - seq_printf(m, "THPeligible: %8u\n", >>>>>>>>>>>>>> - !!thp_vma_allowable_orders(vma, vma->vm_flags, >>>>>>>>>>>>>> - TVA_SMAPS | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL)); >>>>>>>>>>>>>> + thp_eligible = !!thp_vma_allowable_orders(vma, vma->vm_flags, >>>>>>>>>>>>>> + TVA_SMAPS | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL); >>>>>>>>>>>>>> + if (vma_is_anon_shmem(vma)) >>>>>>>>>>>>>> + thp_eligible = >>>>>>>>>>>>>> !!shmem_allowable_huge_orders(file_inode(vma->vm_file), >>>>>>>>>>>>>> + vma, vma->vm_pgoff, thp_eligible); >>>>>>>>>>>>> >>>>>>>>>>>>> Afraid I haven't been following the shmem mTHP support work as much as I >>>>>>>>>>>>> would >>>>>>>>>>>>> have liked, but is there a reason why we need a separate function for >>>>>>>>>>>>> shmem? >>>>>>>>>>>> >>>>>>>>>>>> Since shmem_allowable_huge_orders() only uses shmem specific logic to >>>>>>>>>>>> determine >>>>>>>>>>>> if huge orders are allowable, there is no need to complicate the >>>>>>>>>>>> thp_vma_allowable_orders() function by adding more shmem related logic, >>>>>>>>>>>> making >>>>>>>>>>>> it more bloated. In my view, providing a dedicated helper >>>>>>>>>>>> shmem_allowable_huge_orders(), specifically for shmem, simplifies the logic. >>>>>>>>>>> >>>>>>>>>>> My point was really that a single interface (thp_vma_allowable_orders) >>>>>>>>>>> should be >>>>>>>>>>> used to get this information. I have no strong opinon on how the >>>>>>>>>>> implementation >>>>>>>>>>> of that interface looks. What you suggest below seems perfectly reasonable >>>>>>>>>>> to me. >>>>>>>>>> >>>>>>>>>> Right. thp_vma_allowable_orders() might require some care as discussed in >>>>>>>>>> other >>>>>>>>>> context (cleanly separate dax and shmem handling/orders). But that would be >>>>>>>>>> follow-up cleanups. >>>>>>>>> >>>>>>>>> Are you planning to do that, or do you want me to send a patch? >>>>>>>> >>>>>>>> I'm planning on looking into some details, especially the interaction with large >>>>>>>> folios in the pagecache. I'll let you know once I have a better idea what >>>>>>>> actually should be done :) >>>>>>> >>>>>>> OK great - I'll scrub it from my todo list... really getting things done today :) >>>>>> >>>>>> Resolved the khugepaged thiny already? :P >>>>>> >>>>>> [khugepaged not active when only enabling the sub-size via the 2M folder IIRC] >>>>> >>>>> Hmm... baby brain? >>>> >>>> :) >>>> >>>> I think I only mentioned it in a private mail at some point. >>>> >>>>> >>>>> Sorry about that. I've been a bit useless lately. For some reason it wasn't on >>>>> my list, but its there now. Will prioritise it, because I agree it's not good. >>>> >>>> >>>> IIRC, if you do >>>> >>>> echo never > /sys/kernel/mm/transparent_hugepage/enabled >>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled >>>> >>>> khugepaged will not get activated. >>> >>> khugepaged is controlled by the top level knob. >> >> What do you mean by "top level knob"? I assume >> /sys/kernel/mm/transparent_hugepage/enabled ? > > Yes. > >> >> If so, that's not really a thing in its own right; its just the legacy PMD-size >> THP control, and we only take any notice of it if a per-size control is set to >> "inherit". So if we have: >> >> # echo always > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled >> >> Then by design, /sys/kernel/mm/transparent_hugepage/enabled should be ignored. >> >>> But the above setting >>> sounds confusing, can we disable the top level knob, but enable it on >>> a per-order basis? TBH, it sounds weird and doesn't make too much >>> sense to me. >> >> Well that's the design and that's how its documented. It's done this way for >> back-compat. All controls are now per-size. But at boot, we default all per-size >> controls to "never" except for the PMD-sized control, which is defaulted to >> "inherit". That way, an unenlightened user-space can still control PMD-sized THP >> via the legacy (top-level) control. But enlightened apps can directly control >> per-size. > > OK, good to know. > >> >> I'm not sure how your way would work, because you would have 2 controls >> competing to do the same thing? > > I don't see how they compete if they are 2-level knobs. I'm not sure I understand exactly how your 2-level proposal works. Could you explain in more detail? The problem as I see it, is that the control can take multiple values; "never", "always" or "madvise". In a two-level scheme, what do we do when top level says "always" but per-size control says "madvise", or vice-versa? The scheme we adopted has clear and obvious (to me at least) semantics in this case. The other problem is that the top-level control is still used to control file memory collapse (when CONFIG_READ_ONLY_THP_FOR_FS is configured). If you're advocating for a scheme where the top-level is set to the most permissive you want to allow, then the per-size controls are only able to further restrict, that would make it impossible to, for instance set all 2M THP (inc file-backed) to madvise, but set all 64K THP to always. > And I failed > to see how it achieved back-compat. For example, memcached reads > /sys/kernel/mm/transparent_hugepage/enabled to determine whether it > should manage memory in huge page (2M) granularity. If the setting is > set to : > > # echo never > /sys/kernel/mm/transparent_hugepage/enabled > # echo always > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > > memcached will manage memory in 4K granularity, but 2M THP is actually > enabled unless memcached checks the per-order knobs. > > If we use 2-level mode, memcached doesn't need check per-order setting > at all in order to know whether THP is enabled or not. And it actually > doesn't care about what orders are enabled, it assumes THP size is 2M > (or PMD size). Even though 2M is not enabled but lower orders are > enabled, memcached still can fully utilize the mTHP since the memory > chunk managed by memcached is still 2M aligned in this setting. So > unenlightened applications still can work well. Jemalloc should do the > similar thing if I remember correctly. I wonder why we didn't decide to just make /sys/kernel/mm/transparent_hugepage/enabled an alias for /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled ? That may have solved this problem more cleanly? But that would have made it difficult to introduce "auto" in future (the idea was to set all per-size to 'inherit' and then set top-level to 'auto'). > >> >>> >>>> >>>> -- >>>> Cheers, >>>> >>>> David / dhildenb >>>> >>>> >>