From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1E3CECE7A81 for ; Mon, 25 Sep 2023 08:51:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5DBB48D001C; Mon, 25 Sep 2023 04:51:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 58C898D0001; Mon, 25 Sep 2023 04:51:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4A1168D001C; Mon, 25 Sep 2023 04:51:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 3BCD18D0001 for ; Mon, 25 Sep 2023 04:51:59 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 068BD1CA6A4 for ; Mon, 25 Sep 2023 08:51:58 +0000 (UTC) X-FDA: 81274502358.11.122957E Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf30.hostedemail.com (Postfix) with ESMTP id E6F4C80024 for ; Mon, 25 Sep 2023 08:51:56 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf30.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695631917; a=rsa-sha256; cv=none; b=Zm96pDcuq5qZs0Zq1vSKvh329yzX9GowuK+ABUK7wvliCYzdpGnf3aAAitlLqiGJX0jysx nwKs9uL0O4ryYcnscuLSXfQ5uocqm4xoTCVlV6Gvo8qw7Vy6m59PgBArjCRtI2Oktrw/jG 0V2ga9chdmDqB79ffJD6pcvMsbnYx9U= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf30.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695631917; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=//GjB7PU/itTeh5oaLrxmtde9UlTWU1KLjvw9sM3pZw=; b=3mgeWmuttUlQI60HOPTNLuBE7d58Tfs/HgOB5fhBhI/Vc1hbDMI21eCB378yZOHidsHMMi BEvv2qKGDfhax92Uulq0w+jV4W/eilzT5jWY6lUhALEuUKTmBrCNNfkYVI9Y5w9mm5EyuN s/Nqb8xn3pJfIpasfToJCk8+5DRMz0I= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E4E89DA7; Mon, 25 Sep 2023 01:52:33 -0700 (PDT) Received: from [10.57.65.13] (unknown [10.57.65.13]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2F7883F5A1; Mon, 25 Sep 2023 01:51:54 -0700 (PDT) Message-ID: <92937776-1e16-47e5-bef9-4c1a04bc98c0@arm.com> Date: Mon, 25 Sep 2023 09:51:52 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal Content-Language: en-GB To: John Hubbard , Matthew Wilcox , Yang Shi , "Yin, Fengwei" , Yu Zhao , Zi Yan , David Hildenbrand , David Rientjes , Andrew Morton , Vlastimil Babka , "Kirill A. Shutemov" Cc: Linux-MM References: <4966f496-9f71-460c-b2ab-8661384ce626@arm.com> <4830fb3e-4a35-4842-98f4-9e7baa0e692a@arm.com> <7301771f-d654-4e5a-a197-3a3d8750440c@nvidia.com> From: Ryan Roberts In-Reply-To: <7301771f-d654-4e5a-a197-3a3d8750440c@nvidia.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: E6F4C80024 X-Stat-Signature: z9x5617k6u4uitgdeeagcumfntscbqnc X-Rspam-User: X-HE-Tag: 1695631916-711112 X-HE-Meta: U2FsdGVkX18Gze9e0bUVlJKrFa0cJakMKdZdFZtegS52o7zHhfFb8SCaPVuqDx3Fwzwi4Q7wIb4LQ/lLjAP4I8enT3Rt3XVAr8i9PDg2VxB3UvP6OpUI4UCGNSLXgz85e/lOrRnkHSwv67Z4zmDmQbEGf8xaN6Mr8xygnT7U/QrK2niNH/ZvL5T8hp349qOVCtFnmSJFcbGYSqKVUjBi4D+TYzbSFXpkhHoHwxkt3VbFtd7mXo33RvpWR5W6WGfQ4+eXPGABJb1nFJsTO5deHvDThFvaQUs00/HtmqnvJ6vlLMRBNQU2NW2X5px2ZJf/MbPY2tJhRhhPk43Qj+mJfhVK6z4kvhd9mupb4PEeEnUEIdFLijnsr4uP+e5bMq1BKH3P0yAp7DHOtmnvYqEuvMfM3nO6ukOWhQ8eoRB3JPJCBdPoZIZIuqETFbrIgcVNjLPEVe7UYIuzBq5iyzTNC/8FaX/idOu13acQMpATsdA5q/3vU1HpLTdCTNSAR7GgObLOQhtNVFUP8zPAa0qXzvvHXdLtNxK/tnRjTuc29wUr4VYiwrFnQTwJPrFRr9ms+TQ43oxv6T1z9cucykR1cFqHJi8mKLlpg096iC0WETctmXTBxawbQ23pIqBZYv9ZiLoOfkx4iP+CWyq+PZWS35jp6MI6vWCUYXQg9v63yQuwRoJLUN5xrOOC+dBLMgSIQC4lUrrsHIjqjWofmbM3O1Cs8W08ZESdSfNWKqmx8NvcFwxhv5Pnmjt8+h/L/JoHAeBj6cmSrX0oQvjNbFKujOwl6Vjq1uNXl2lLzIJU+WgqbJmbZ5YsejTeWkIwmT6q4uvTuftIKLvqR3a+EOSNlCtOFKkE7ZihKks3O99f3zuoyp6Ki4xBSrZ6rBrYV199Se/1pE2RvTGa8pFwk7OsWdW+XnWn+aTBT4ZXApMVpG5aMv5f7H/8faByylQXr/OFGYqe2bFGDudlCe90p5D aUYnHFUa ZojmUUL9PxnAXln+p0hCeSiyeE0e1+BkOxWkB2Gt6QzkyOfzHMD9TyQIgJ5o0WYzjdK0iJ/f1vbwP+zYwGMgpE5oqq3JXZOzmj47RLpM4uw1NGLAl6QLrI8PLa/efBMRciwFTFzfobJuwweARs7Hksdq6T+edAib9E9A1wavCqMlnrc1mCP36d71rFnqa4Ro+/0hq X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 23/09/2023 01:33, John Hubbard wrote: > On 9/22/23 08:48, Ryan Roberts wrote: > ... >> I never had any feedback on the below; I'm not sure if that means everyone is >> happy or that nobody read it?? > > One can never really know: zero or more people read it, and of those, no > one hated it enough to send out a quick NAK. So that's a *possible*, > lukewarm endorsement of sorts. Success! :) You really know how to fill a guy with confidence! ;-) > > ... > >> BUT I've had yet another idea on the controls front, which would enable exposing >> this to user space as an extension to transparent_hugepage, while continuing to >> support THP as is and also be able to control THP and ALF (anon large folio) > > The new ALF / ANON_LARGE_FOLIO naming looks good to me. The grep aspect > is a nice touch. Well if we go the route of the newest proposal, then I guess the naming is less important, because it all attaches to transparent_hugepage. > > ... > >> Add 2 controls to sysfs: >> >> /sys/kernel/mm/transparent_hugepage/anon_orders >>    - bitfield where set bits are orders that will be tried during allocation >>    - defaults to 1<>    - For now, 1<>    - To enable ALF, set the appropriate lower bits >>    - To disable THP, clear 1<>    - (In future we could add an "auto" option too) >> >> /sys/kernel/mm/transparent_hugepage/anon_always_mask >>    - orders in (anon_orders & anon_always_mask) are not subject to madvise >>    - so when enabled=madvise, still try (anon_orders & anon_always_mask) orders >>      as if enabled=always >>    - defaults to 0 (all subject to madvise) >> > > I *think* I like this a lot, On the weight of this lukewarm endorsement, I'm going to code it up and aim to post something for dicussion end of this week. ;-) > although I have some clarifying question > below. It seems to address the key things that have been complicating > the discussions: the API is now looking more flexible, and yet still > easy to understand and reason about. Nice. > > A couple of questions about how this works: > >> >> The defaults for those controls give you "legacy THP". But you can modify the >> controls to generate policies like this: >> >> > > For these tables, a small key or legend would help. I've forgotten already > what "S" means, and am also vague about exactly what "THP>ALF>S" behavior > means, too. THP: transparent hugepage allocation; specifically PMD sized/aligned/mapped. ALF: anonymous large folio allocation; specifically some order between [PMD_ORDER-1, 1]. Always PTE-mapped. S: single page allocation; order-0, always PTE-mapped. I've found these discrete logical buckets useful for thinking about the problem, although the implementation doesn't always treat them completely separately (S is just a final fallback order in ALF's list of orders to try) and the new proposal exposes both THP and ALF through a unified THP interface. The '>' indicates 'fallback'. Fallback happens for a few different reasons; VMA is too small to contain the proposed folio order, or some PTEs that would be covered by the new folio are already populated, etc. ALF usually isn't just a single order either - it has a list of orders that it will try. Possibly all a bit confusing, but this is the nomenclature I've been using in the context of all the discusions so far and wanted to try to keep things comparable. > >> THP only - existing behaviour (default): >> ---------------------------------------- >> >> anon_orders = 1<> anon_always_mask = 0 >> >> thp prctl:            | dis       | ena       | ena       | ena > > All I see in the prctl(2) man page is PR_SET_THP_DISABLE, I don't > see any _ENABLE. What does the above refer to? dis: PR_SET_THP_DISABLE with arg2=1 (thp disabled via prctl) ena: PR_SET_THP_DISABLE with arg2=0 (thp not disabled via prctl) I was trying to illustrate that ALF is now also affected by this prctl. With the previous proposal it was independent of THP and therefore independent of this prctl. Of course it would still be _possible_ to ignore this control for the ALF orders, but I think that risks being very confusing for users. > > >> thp sysfs:            | X         | never     | madvise   | always >> ----------------------|-----------|-----------|-----------|------------- >> no hint               | S         | S         | S         | THP>S >> MADV_HUGEPAGE         | S         | S         | THP>S     | THP>S >> MADV_NOHUGEPAGE       | S         | S         | S         | S >> >> > ... >> >> It does have the disadvantage that ALF is tied to MADV_HUGEPAGE, whereas the > > Right, that is a little awkward. But maybe less so now, with this new proposal, > which leaves THP a little closer to ALF. Indeed, this approach makes it clearer/easier for users to understand, because conceptually we are just introducing a wider set of folio sizes that THP can use and all the existing THP controls continue to mean what they always meant. The only risk I see is if there are workloads that want to use both (PMD) THP and ALF, but in different VMAs, and they absolutely do not want the possibillity of ALF in the (PMD) THP area if THP fails, and instead always fallback to Single allocations for that VMA. But that sounds very niche to me. And would be better solved by the additional (future) introduction of a set of allowed orders that can be attached to a specific VMA. There are a couple of other wrinkles that I didn't highlight in my first mail: - khugepaged will continue to work only on PMD-sized THP. It will ignore the new ALF orders. This was always the plan, but if exposing the ALF functionality through THP interface to user space, does that make it confusing? I don't think its a big issue personally. And we can always enhance khugepaged to work on > > thanks,