From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 238C1E6FE52 for ; Fri, 22 Sep 2023 15:48:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A93316B02E4; Fri, 22 Sep 2023 11:48:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A1CAF6B02E5; Fri, 22 Sep 2023 11:48:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8BD966B02E6; Fri, 22 Sep 2023 11:48:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 757D56B02E4 for ; Fri, 22 Sep 2023 11:48:50 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 437871C939C for ; Fri, 22 Sep 2023 15:48:50 +0000 (UTC) X-FDA: 81264666420.10.3B50B45 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf28.hostedemail.com (Postfix) with ESMTP id 44DE8C0010 for ; Fri, 22 Sep 2023 15:48:48 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf28.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695397728; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=RZ7Fm8rz+GiBtZ9TQR/JUOH9UNG7dwoQq4SjYJlz5DA=; b=zTP+fm7teQgXKyNpPJYtL/nJH2MbgjmBmzSLV+LL5hwIBSIMF7CmtL9aDGYU79zbbdcDMX 0R6wn/m7wAcAOELlU+4/U3krLhO7flm7b/2sO1xDfqnLvQUqkvYis3uUoCbPWQJfQnG7fq eHGZyqf60f3BJr2bjNilLpX4AJcDUbY= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf28.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695397728; a=rsa-sha256; cv=none; b=MmRXhrVuv0wJZrpOAXyhUeYkAJFqonZlDKUiNQDPQvV6UlCEBUtLxAsRQdtYMRVbS/zNsv VaH9M6LmVoCclCnRrtYRr4amiQnmwju+RHU2lYZroT+zo+l9B7f3RA35iqGfBl5FG1rVcw fZyKyF0SwzlhWQnJzFpu713a3+ng6kY= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 5F7B8DA7; Fri, 22 Sep 2023 08:49:24 -0700 (PDT) Received: from [10.57.65.11] (unknown [10.57.65.11]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 980FF3F67D; Fri, 22 Sep 2023 08:48:45 -0700 (PDT) Message-ID: <4830fb3e-4a35-4842-98f4-9e7baa0e692a@arm.com> Date: Fri, 22 Sep 2023 16:48:44 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal Content-Language: en-GB From: Ryan Roberts To: Matthew Wilcox , Yang Shi , "Yin, Fengwei" , Yu Zhao , Zi Yan , David Hildenbrand , David Rientjes , Andrew Morton , Vlastimil Babka , John Hubbard , "Kirill A. Shutemov" Cc: Linux-MM References: <4966f496-9f71-460c-b2ab-8661384ce626@arm.com> In-Reply-To: <4966f496-9f71-460c-b2ab-8661384ce626@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 44DE8C0010 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: adxijszuams8sefw9yqpxhx5wjcuakq7 X-HE-Tag: 1695397728-642911 X-HE-Meta: U2FsdGVkX1+cqDTB5sGyeBAkK7JKITR/aSlatneWJQCEEgIUZg800wHI0EV0ZVszwFMuQRMC7hWX3YQkMty22qwd9uCPSyuCn2h8P0W/JngSXSADUvP8yMaxc75C8v1x+IRzr2uJWto1ykp+0SPu/IWyzBJg+ZQgevnt8Mre2JHTAadSKp/7upxKJ0IUGpYC3qmlh7I7yu3YBx/vJaJDO/0zZvhxFlbSdMdhhe+MHXMuHO9YVkAMtAlvdvLjFbSeDEo5iGIhxaGyoTA2NBWc61S1LuYbnFOS5SxWYI7yj3o1lDTH0YPB7U0es8d+ynEIue1+MkGax3ADWK4RElbtlxK0WZZBpVN8I3WMHxNcFL6244IZ1TFMld3EjbuX5czvdN239U9y6WsBOLB6jnC9zws65q5OzP/tHd+myWX1ZtUAxrYTcm9oM4T6b1HgHyKpsF/untaWDThFt6cnGdloI++oipnOcxurpwsll9AXnUD1dzJBHtA1spsXBi49oN7RSgHM65e10awoC8/nRGxZDj+Ppw+qsOKdtVVfp5iB8C/18UiFadYl96YBToyrlsuZmC4hYcf7huKCss+f/bf/AWHoNI0HZI5S0aU8vBeQXB/5NDhE9sHCT/LR2gJnZg0PJHznQaDR8jFd/9K9BpA3XUjdcporg2BQU70k4PK9KqsJREJec6Vc1P1PPbcZNOy9vqsXCwsyse9yv1X5FjYOONeKSpweGYcsKQ6QilwXDaBl8VeqrN92dxerH9aT6k7X68UpAfJCeqVwyE3XdPU3tHTSyWGT7qLce1jdhP2zIDZXpjC4sBRfetSMku0je6utxLXtBoaBnVmT57jw5k7dyZipY5xN3V9WC4nlg7wyEBXdZULWfzEJwIL9pKuzL3kLuZHYUdAOJc0uGXpqQOGNhz4xnosMhRnrIOfDVC5rNNEVijxW7F5fBLEib5Tb2XJ6szK0LpV+mAnXP+wNFEw kH1hjKxC 4Soorzr/lEl8NeToV6fgTiwpPttb1wAvUwqfqN839kXJYgsF6UtFEO01OZGHKLT/VcXDdQdnP+NE3ZS5BGyKAot7ZNdVYbCWIEdMmrelU8hLTdoLMqVaqaxWjJ1lX6D0sRHmmflz3SyAvNx2y1ePz1AfLv1ohl3vGBkahaBoWkqK77tYg43e3xtHDOFbtlvnXGo2B0bbJ6+yrt7kAg59mudjk9JhqgcWPiEeqza7nUdfXFDjS2FkdZCQyPGJi41HvHyG0v0hEmR3fR72ngB85Y6elBw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 14/09/2023 09:16, Ryan Roberts wrote: > Hi All, > > Thanks for participating in the discussion yesterday - it finally feels like we > are converging on the MVP feature set. Below are my notes from the call and a > modified proposal for controls and stats. It would be great if we can continue > to review and refine over email. I'm also planning to post an implementation > within the next couple of weeks, which I hope will also accelerate convergence. I never had any feedback on the below; I'm not sure if that means everyone is happy or that nobody read it?? I've got an implementation for all of the below ready to go, with a few tweeks to the details (the main change is that anon_orders is now a bitfield where each set bit represents an order in the set, rather than the originally proposed comma-separated list of orders). BUT I've had yet another idea on the controls front, which would enable exposing this to user space as an extension to transparent_hugepage, while continuing to support THP as is and also be able to control THP and ALF (anon large folio) usage independently. On reflection, I think it is cleaner to do it this way for a couple of reasons: - We don't have to introduce a whole new feature (ALF) to the user. Most of the concepts and controls overlap a lot with THP anyway, so if we can make it look like an extension, I think it would be easier to communicate. - The approach I have in mind would make it easy to extend to orders greater than PMD_ORDER in future if that's a direction we want to eventually go. Because >PMD_ORDER implies multiple PMD entries, it would half belong to THP and half belong to ALF in the current proposal, which is nasty. I'll lay out the new proposal now, but I suspect this will ultimately warrant another mm alignment meeting... Add 2 controls to sysfs: /sys/kernel/mm/transparent_hugepage/anon_orders - bitfield where set bits are orders that will be tried during allocation - defaults to 1<S MADV_HUGEPAGE | S | S | THP>S | THP>S MADV_NOHUGEPAGE | S | S | S | S ALF only: --------- anon_orders = 1<<3 (order-3 - example) anon_always_mask = 0 thp prctl: | dis | ena | ena | ena thp sysfs: | X | never | madvise | always ----------------------|-----------|-----------|-----------|------------- no hint | S | S | S | ALF>S MADV_HUGEPAGE | S | S | ALF>S | ALF>S MADV_NOHUGEPAGE | S | S | S | S THP and ALF: ------------ anon_orders = 1<ALF>S MADV_HUGEPAGE | S | S | THP>ALF>S | THP>ALF>S MADV_NOHUGEPAGE | S | S | S | S THP and ALF, with THP=always, ALF=advise: ----------------------------------------- anon_orders = 1<S | THP>ALF>S MADV_HUGEPAGE | S | S | THP>ALF>S | THP>ALF>S MADV_NOHUGEPAGE | S | S | S | S THP and ALF, with THP=madvise, ALF=always: ------------------------------------------ anon_orders = 1<S | THP>ALF>S MADV_HUGEPAGE | S | S | THP>ALF>S | THP>ALF>S MADV_NOHUGEPAGE | S | S | S | S It does have the disadvantage that ALF is tied to MADV_HUGEPAGE, whereas the below approach introduces a new, independent MADV_LARGEFOLIO. But personally I don't see that as a major issue. And we could solve it in future by extending MADV_HUGEPAGE to add a vma-specific set of orders, via the process_madvise flags. Thoughts? I'll hold off posting the implementation of the below for now, while we decide if its better to head in this direction. Thanks, Ryan > > > Roadmap > ------- > > Stage 1: (MVP) Propose to add minimal runtime controls and stats (as outlined > below). There were no disagreements on the call about this feature set being > either too little or too big for the initial submission. > > Stage 2: Focus on decreasing memory wastage. Plan A will attempt to do this > automatically within the kernel (I highlighted some ideas in the slide pack > which we didn't get time to cover). Plan B is to add more fine-grained controls > to to fine tune things at memcg/process/vma level (TBD). I'm not covering this > stage in this email. > > > Naming > ------ > > We may add large folio support to shmem in future, which may need some separate > controls (TBD). As a result, consensus was to have generic name "large folio", > which is specialized for anon memory. Then in future it could also be > specialized for shmem. > > I'm going to reflect this in the kernel naming by changing LARGE_ANON_FOLIO to > ANON_LARGE_FOLIO, that way it makes "LARGE_FOLIO" grepable. > > I'm also reflecting this in the sysfs controls. I'll create a directory > '/sys/kernel/mm/large_folio' as the root. Within that there are 2 main options: > > - Put shared controls directly in this directory. Add a sub-directory 'anon' for > anon-specific controls (and in future 'shmem'...) > - Put all controls in the root directory and prefix the filename for > anon-specific controls with 'anon' (e.g. anon_enabled). > > Given I don't think there will be many anon-specific controls (1 for now), and > THP already uses the latter scheme, I'm proposing to go with the latter. > > > Controls > -------- > > Modified proposal, after discussion yesterday: > > - boot_param: anon_large_folio > - =always|never|madvise > - sets boot-up default for large_folio/anon_enabled > - sysfs: /sys/kernel/mm/large_folio/anon_enabled > - =always|never|madvise > - sysfs: /sys/kernel/mm/large_folio/defrag > - =always|defer|defer+madvise|madvise|never > - Anticipate would be shared between anon and shmem if shmem added > - this is already true for THP > - Kirill suggested to drop and hardcode to "never" (GFP_TRANSHUGE_LIGHT) > - Yu previously commented GFP_TRANSHUGE_LIGHT isn't always ideal [1] > - So current series is hooking THP's defrag setting > - Given we want to separate THP and LAF, I'm proposing to keep it > - debugfs: /sys/kernel/debug/mm/large_folio/anon_orders > - Comma-separated, descending list of orders to try > - Default: arch_wants_pte_order(),PAGE_ALLOC_COSTLY_ORDER > - 0 always implicitly appended to end > - Max allowed is PMD_ORDER-1 > - intended for developers to experiment > - debugfs means we can change/remove it or promote it to sysfs later > - MADV_NOHUGEPAGE is honored; LAF disabled for these VMAs > - Required for correctness of existing use cases (live migration post copy) > - New MADV_LARGEFOLIO madvise opcode > - Like MADV_HUGEPAGE but for large folio > > Optional: > > DavidR suggested adding ability to set a VMA-specific LAF order, using > process_madvise(): > - Optionally accept LAF order through flags param of > process_madvise(MADV_LARGEFOLIO) > - When no LAF order passed, (or called with madvise()) use global LAF order > > Personally, I would prefer to avoid vma-specific laf order for an initial > submission and instead defer the addition until clear need is identified. > Thoughts? > > > Stats > ----- > > meminfo:AnonHugePages, smaps:AnonHugePages and memory.stat:anon_thp will > continue to account THP only. > > I plan to add meminfo:AnonLargeFolio, smaps:AnonLargeFolio and > memory.stat:anon_large_folio to account LAFs. > > Do I need to add counters to vmstat also? (e.g. large_folio_fault_alloc, > large_folio_fault_fallback, etc) - would need to think about which counters and > what they mean if so. > > > Thanks, > Ryan > > > [1] https://lore.kernel.org/linux-mm/CAOUHufYWtsAU4PvKpVhzJUeQb9cd+BifY9KzgceBXHp2F2dDRg@mail.gmail.com/