From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2019FC4345F for ; Wed, 24 Apr 2024 16:28:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5BD3B8D0021; Wed, 24 Apr 2024 12:28:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 546C58D001F; Wed, 24 Apr 2024 12:28:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3BF498D0021; Wed, 24 Apr 2024 12:28:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 145B48D001F for ; Wed, 24 Apr 2024 12:28:32 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 7EB44410DD for ; Wed, 24 Apr 2024 16:28:31 +0000 (UTC) X-FDA: 82044958422.19.4B3A082 Received: from mail-ej1-f42.google.com (mail-ej1-f42.google.com [209.85.218.42]) by imf24.hostedemail.com (Postfix) with ESMTP id 4E227180008 for ; Wed, 24 Apr 2024 16:28:27 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="CT7Q5/K/"; spf=pass (imf24.hostedemail.com: domain of shy828301@gmail.com designates 209.85.218.42 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1713976107; a=rsa-sha256; cv=none; b=44SuXVZN09ZRWN8Pxv9Ku1ODX7RodzOAxxCwGHubxIV+/6Cyxnl2Rc96jPyfZSfsWLm7lS sNPTU60kwGH/JhdAcM4ejyp5KpBPKHS2GxUe8qeOMMIcFDuZ+uzFiJOFRBMLMV0M6z4C0y FjvXI+cHHLWuMWhUsqQgXblhp26nkPA= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="CT7Q5/K/"; spf=pass (imf24.hostedemail.com: domain of shy828301@gmail.com designates 209.85.218.42 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1713976107; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/JqRkyrHFPe2Hk6rtj2fC55FVgPhDwmSNOrraGk0QQY=; b=n31rQcGIEqgKZYLYTW2CB+FdpMV94UvL2MqIEuY6p7xmxnzuRrXGrWbdtS3AboD57+m5R9 OZCaYGJekSBk4fI1Lnb5zZgkxCBw+ejSdGV/Ms/fm6EAHjI7X2Ga6OuXL74mfCCYARYZ1r GSj2p09O/cYKPv8nIgoTThasUgnfx5U= Received: by mail-ej1-f42.google.com with SMTP id a640c23a62f3a-a5200202c1bso9097366b.0 for ; Wed, 24 Apr 2024 09:28:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1713976105; x=1714580905; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=/JqRkyrHFPe2Hk6rtj2fC55FVgPhDwmSNOrraGk0QQY=; b=CT7Q5/K/w8ED1bFWR175YbWf/0B3wAVbrUWQkrsM6gSTlqzcWBxj0NiP0T/zsExo0E dIQ5wmuJQAWSBNi8L1m9BSnVNFE4zh6nPybdry/oCcHWU44tuqAef4ENa3D+oAaz61rG D+ZXjluBZu5Ih/7X3LGbZYhKKmZH86pEKsF+JxEhfXl5rzGF+d7qFMgkkIIoy16KmZbT RkR4F2Hxk4eelJTo6hSXiJF5lykf4lssW0pySomxI2I5Xn/t3Cqlamah3KlkmdO2oudV z1StOEu73K/y4P3sC14b2SEgop+OdDympLyOK1mJ3m7AmdstF2n9YT9+rawPy7iCuHMv TPuw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1713976105; x=1714580905; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/JqRkyrHFPe2Hk6rtj2fC55FVgPhDwmSNOrraGk0QQY=; b=Gdw4HoKIhMGGed/9j+x0qJqaxIyGI/ZXxtGgzizinD+38TcOjAA6sAW7UQYGk+DJEg dcRaFi9FqjJE/7ohqvBwWAjrKubF1W7sWvwyIwduFD3oJkGjuK8rLbTFTslkcr8kSeHz ouOxPpqjdDGF6WFZKiYKdIa06nupEzLIdUxAsY1ujjyk+AHMsOAYqWG7gt0reFjE2iVN bis75C5WpT5w8J/GogR+ymqpD0fZpqJgvNNvgLnrlbXxtk3w/NTwdai4jr9fBLqkrIM6 PlF9mdQityiyg4oDZMmhNQy6j/zfh+g371I5yDMEEE+MHY2FrXg3e3HXsWajj0fexjE7 zw/Q== X-Forwarded-Encrypted: i=1; AJvYcCUvtUGJUMg4fASLywmTIbw/xx53p76Lix/JJBukVFzzBZKub5SqU2RoNibwgf5p+bNkBVSEUaCAGUPwhLjFloK1B0w= X-Gm-Message-State: AOJu0YzsBnxq5JLoS0HPZzg2WM8ECNe6eddVebznCYg5mLiy7JJUhwUl AF6u0IgklDA7DYwKdFd5y4+DSPB14ZAxrfNaxFOd30Ow40icPb1fc/7jpRwidmsFPFJCPSy/vZI HVZFx9m7nzw08DXdKXwIufM3ZZpo= X-Google-Smtp-Source: AGHT+IH2zGJqGBW8yYoAiv97KOpycs4WS5UjEgOW6mRB5suj5uSecL+wVAVA0XwFMqqzKD/7sl1rggQpu8UvoTv44mM= X-Received: by 2002:a17:906:2a1b:b0:a55:b488:27b9 with SMTP id j27-20020a1709062a1b00b00a55b48827b9mr2623451eje.38.1713976105203; Wed, 24 Apr 2024 09:28:25 -0700 (PDT) MIME-Version: 1.0 References: <20240424122630.495788-1-david@redhat.com> In-Reply-To: <20240424122630.495788-1-david@redhat.com> From: Yang Shi Date: Wed, 24 Apr 2024 09:28:13 -0700 Message-ID: Subject: Re: [PATCH v1] mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared() To: David Hildenbrand Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org, Andrew Morton , Jonathan Corbet , "Kirill A . Shutemov" , Zi Yan , Yang Shi , John Hubbard , Ryan Roberts Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 4E227180008 X-Stat-Signature: 6p1p5cbzcp14i5r16634s3onq6w9akjy X-Rspam-User: X-HE-Tag: 1713976107-826957 X-HE-Meta: U2FsdGVkX1+zKE9jA6T1r3fTHbZIQO1UGHvfbNCH8YbQ5fns3uMOiSqL6vxF6o7Bke2B4gS2iGZA7apg4lac37dkzMUvhWAL1GYLs0xlugNB2DuhX6TFcbPmvrvBYM3ic8YHIXJbNfdQHMsE2nkPOk6L/LRfKXPxq0DtkfF0xrdA2AzjechUk/imJRhzDZPxapy0FB1uor8ZIMu/Tsv7fwvAIiGH1HvPWP0JM987nN7+yB1ATqYK9jlC/1rMVzWbv4G4UPYMzki8ZGThRMIDPSwlWhhjRtdbHRqGyB3d69/q4KRF5+DTqOscvnY/Un4fbUtx4z0olYcQn1cL0sRQ05thSddxJrDhoHbm+6IKy6rVR3UzncNiXmXHRLiEwCRkXRsCOMDKh46QYC4nuKVFpcJwslWL3z6evIAF1uc7c2Jcqa31ygy56rmdP2d0gVfygOl92Sd0VhKsquQVXMd8QUupjFbApTWeEzB3Tmc5mNaWguhQ/z9xvEHQXHTSVUOBZf9qOhm86vRCr/pH9SukymhlMmXs9T74GgB5efPR9E8dK4H+mjp5j8X/4uxSojbmoAbgfTcFfUmhgLVFIxRTkRHPugbTXs4lCLiHugfTqo7LOemMbNR9w8rvlJnOXjSON991fu+dbuGzYsC/KFeeAR5Rw4PuG3aBxhfklp3rnsbegeeBekUfmc8ZU0Zbe5NdMsrR9h2lILcPBQBZsPZda4HxelwatrhfNdb5YVHzDluDFGEl1MR796NyJ4QCdavOaDWKdmWuARUlGH3Lzn+CEnB0oURvNbo1p2IpjFr0q/+u/oqQNtX7qdI/RydqtkpocLvCqYTZOcctxInHPnnZi2+OA1ARsNzk+HWN38oTxN4C2hvxcFVyIwzOkxZi7oPQ+H+p6Xl7QH7e3OpQIl9u7FGSciMJGrrRzdeWztGPPHrIZS7fSP00k2lKHOf1bpgcLSMiFfxKKvggV32UINA +39Pr4fk tG2e6WXiW58Oj+rARKiYYIQwgCm5w23qBUs0tXfl0icb9gu1TqxOY/lIlp9iPLQj6HizmtCvcS5SEXkCdavYsfVQIFKb2+w1bDWYc1GldTw4BVsH+3Sje7IQsTGx6dRHRGMtnqfHL9cWQTps45N9GYVE27AprAS6g0gEllXcQUEmpyK9qTJNBS+YzIbxxKu+/jHtDTKdTwnnFhUAZxey7NE+4+Ykkgu6fUnjnQQrqClq+DADgOc+gh1Jr/GMlzz5ENmj5QiqNYGFS38GVDX1y7ahXrR+AnlsUptRaD1K7onBFdb7G/Bg6lIRkUbyLFumSXrzJi9+JW6A3gOa1beN2XUQGTcxUiUtEw9drMawxO+n0lLOAsPOGUjkjIm6W1P3d1aunGc4AVYTSh07mk9o1uVTzlndsKln9zEh3zmABQc5uc58/+f/G4NXKQVRR1H9cAh5uZaomazo0R6BNGK9mgWFGjBvGBDDPXLgLcRcR8vHZB8msg6wLMTyXxueFsV3UUOEAR2RjKr9QUzor6hJQdhGMezRIEKac1rTvSJQ4v2phQV+zQe/dxZP55CnURaHTFqPgvVSGvGOMAIRPo2BTVhnqiXIp0Y2aSUVbJlpCMzXPNGBVWwMPcS1HrxCWHfoJXzCQu7YnA48RGZCLPRE9oixUW0Rwq7dtdKPS33xri3Qp/4vV57ZLgXjOCLRbD8U3Eh9PiH2VWjmznhmuxqmWG3glVVgcDwqoiTB1 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Apr 24, 2024 at 5:26=E2=80=AFAM David Hildenbrand wrote: > > We want to limit the use of page_mapcount() to places where absolutely > required, to prepare for kernel configs where we won't keep track of > per-page mapcounts in large folios. > > khugepaged is one of the remaining "more challenging" page_mapcount() > users, but we might be able to move away from page_mapcount() without > resulting in a significant behavior change that would warrant > special-casing based on kernel configs. > > In 2020, we first added support to khugepaged for collapsing COW-shared > pages via commit 9445689f3b61 ("khugepaged: allow to collapse a page shar= ed > across fork"), followed by support for collapsing PTE-mapped THP in commi= t > 5503fbf2b0b8 ("khugepaged: allow to collapse PTE-mapped compound pages") > and limiting the memory waste via the "page_count() > 1" check in commit > 71a2c112a0f6 ("khugepaged: introduce 'max_ptes_shared' tunable"). > > As a default, khugepaged will allow up to half of the PTEs to map shared > pages: where page_mapcount() > 1. MADV_COLLAPSE ignores the khugepaged > setting. > > khugepaged does currently not care about swapcache page references, and > does not check under folio lock: so in some corner cases the "shared vs. > exclusive" detection might be a bit off, making us detect "exclusive" whe= n > it's actually "shared". > > Most of our anonymous folios in the system are usually exclusive. We > frequently see sharing of anonymous folios for a short period of time, > after which our short-lived suprocesses either quit or exec(). > > There are some famous examples, though, where child processes exist for a > long time, and where memory is COW-shared with a lot of processes > (webservers, webbrowsers, sshd, ...) and COW-sharing is crucial for > reducing the memory footprint. We don't want to suddenly change the > behavior to result in a significant increase in memory waste. > > Interestingly, khugepaged will only collapse an anonymous THP if at least > one PTE is writable. After fork(), that means that something (usually a > page fault) populated at least a single exclusive anonymous THP in that P= MD > range. > > So ... what happens when we switch to "is this folio mapped shared" > instead of "is this page mapped shared" by using > folio_likely_mapped_shared()? > > For "not-COW-shared" folios, small folios and for THPs (large > folios) that are completely mapped into at least one process, > switching to folio_likely_mapped_shared() will not result in a change. > > We'll only see a change for COW-shared PTE-mapped THPs that are > partially mapped into all involved processes. > > There are two cases to consider: > > (A) folio_likely_mapped_shared() returns "false" for a PTE-mapped THP > > If the folio is detected as exclusive, and it actually is exclusive, > there is no change: page_mapcount() =3D=3D 1. This is the common case > without fork() or with short-lived child processes. > > folio_likely_mapped_shared() might currently still detect a folio as > exclusive although it is shared (false negatives): if the first page is > not mapped multiple times and if the average per-page mapcount is small= er > than 1, implying that (1) the folio is partially mapped and (2) if we a= re > responsible for many mapcounts by mapping many pages others can't > ("mostly exclusive") (3) if we are not responsible for many mapcounts b= y > mapping little pages ("mostly shared") it won't make a big impact on th= e > end result. > > So while we might now detect a page as "exclusive" although it isn't, > it's not expected to make a big difference in common cases. > > (B) folio_likely_mapped_shared() returns "true" for a PTE-mapped THP > > folio_likely_mapped_shared() will never detect a large anonymous folio > as shared although it is exclusive: there are no false positives. > > If we detect a THP as shared, at least one page of the THP is mapped by > another process. It could well be that some pages are actually exclusiv= e. > For example, our child processes could have unmapped/COW'ed some pages > such that they would now be exclusive to out process, which we now > would treat as still-shared. IIUC, case A may under-count shared PTEs, however on the opposite side case B may over-count shared PTEs, right? So the impact may depend on what value is used by max_shared_ptes tunable. It may have a more noticeable impact on a very conservative setting (i.e. max_shared_ptes =3D=3D 1) if it is under-counted or on a very aggressive setting (i.e. max_shared_ptes =3D=3D 510) if it is over-counted. So I agree it should not matter much for common cases. AFAIK, the usecase for aggressive setting should be very rare, but conservative setting may be more usual, so improving the under-count for conservative setting may be worth it. > > Examples: > (1) Parent maps all pages of a THP, child maps some pages. We detect > all pages in the parent as shared although some are actually > exclusive. > (2) Parent maps all but some page of a THP, child maps the remainder. > We detect all pages of the THP that the parent maps as shared > although they are all exclusive. > > In (1) we wouldn't collapse a THP right now already: no PTE > is writable, because a write fault would have resulted in COW of a > single page and the parent would no longer map all pages of that THP. > > For (2) we would have collapsed a THP in the parent so far, now we > wouldn't as long as the child process is still alive: unless the child > process unmaps the remaining THP pages or we decide to split that THP. > > Possibly, the child COW'ed many pages, meaning that it's likely that > we can populate a THP for our child first, and then for our parent. > > For (2), we are making really bad use of the THP in the first > place (not even mapped completely in at least one process). If the > THP would be completely partially mapped, it would be on the deferred > split queue where we would split it lazily later. > > For short-running child processes, we don't particularly care. For > long-running processes, the expectation is that such scenarios are > rather rare: further, a THP might be best placed if most data in the > PMD range is actually written, implying that we'll have to COW more > pages first before khugepaged would collapse it. > > To summarize, in the common case, this change is not expected to matter > much. The more common application of khugepaged operates on > exclusive pages, either before fork() or after a child quit. > > Can we improve (A)? Yes, if we implement more precise tracking of "mapped > shared" vs. "mapped exclusively", we could get rid of the false > negatives completely. > > Can we improve (B)? We could count how many pages of a large folio we map > inside the current page table and detect that we are responsible for most > of the folio mapcount and conclude "as good as exclusive", which might he= lp > in some cases. ... but likely, some other mechanism should detect that > the THP is not a good use in the scenario (not even mapped completely in > a single process) and try splitting that folio lazily etc. > > We'll move the folio_test_anon() check before our "shared" check, so we > might get more expressive results for SCAN_EXCEED_SHARED_PTE: this order > of checks now matches the one in __collapse_huge_page_isolate(). Extend > documentation. > > Cc: Andrew Morton > Cc: Jonathan Corbet > Cc: Kirill A. Shutemov > Cc: Zi Yan > Cc: Yang Shi > Cc: John Hubbard > Cc: Ryan Roberts > Signed-off-by: David Hildenbrand > --- > > How much time can one spend writing a patch description? Unbelievable. Bu= t > it was likely time well spend to have a clear picture of the impact. > > This really needs the folio_likely_mapped_shared() optimization [1] that > resides in mm-unstable, I think, to reduce "false negatives". > > The khugepage MM selftests keep working as expected, including: > > Run test: collapse_max_ptes_shared (khugepaged:anon) > Allocate huge page... OK > Share huge page over fork()... OK > Trigger CoW on page 255 of 512... OK > Maybe collapse with max_ptes_shared exceeded.... OK > Trigger CoW on page 256 of 512... OK > Collapse with max_ptes_shared PTEs shared.... OK > Check if parent still has huge page... OK > > Where we check that collapsing in the parent behaves as expected after > COWing a lot of pages in the parent: a sane scenario that is essentially > unchanged and which does not depend on any action in the child process > (compared to the cases discussed in (B) above). > > [1] https://lkml.kernel.org/r/20240409192301.907377-6-david@redhat.com > > --- > Documentation/admin-guide/mm/transhuge.rst | 3 ++- > mm/khugepaged.c | 22 +++++++++++++++------- > 2 files changed, 17 insertions(+), 8 deletions(-) > > diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/a= dmin-guide/mm/transhuge.rst > index f82300b9193fe..076443cc10a6c 100644 > --- a/Documentation/admin-guide/mm/transhuge.rst > +++ b/Documentation/admin-guide/mm/transhuge.rst > @@ -278,7 +278,8 @@ collapsed, resulting fewer pages being collapsed into > THPs, and lower memory access performance. > > ``max_ptes_shared`` specifies how many pages can be shared across multip= le > -processes. Exceeding the number would block the collapse:: > +processes. khugepaged might treat pages of THPs as shared if any page of > +that THP is shared. Exceeding the number would block the collapse:: > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 2f73d2aa9ae84..cf518fc440982 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -583,7 +583,8 @@ static int __collapse_huge_page_isolate(struct vm_are= a_struct *vma, > folio =3D page_folio(page); > VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio); > > - if (page_mapcount(page) > 1) { > + /* See hpage_collapse_scan_pmd(). */ > + if (folio_likely_mapped_shared(folio)) { > ++shared; > if (cc->is_khugepaged && > shared > khugepaged_max_ptes_shared) { > @@ -1317,8 +1318,20 @@ static int hpage_collapse_scan_pmd(struct mm_struc= t *mm, > result =3D SCAN_PAGE_NULL; > goto out_unmap; > } > + folio =3D page_folio(page); > > - if (page_mapcount(page) > 1) { > + if (!folio_test_anon(folio)) { > + result =3D SCAN_PAGE_ANON; > + goto out_unmap; > + } > + > + /* > + * We treat a single page as shared if any part of the TH= P > + * is shared. "False negatives" from > + * folio_likely_mapped_shared() are not expected to matte= r > + * much in practice. > + */ > + if (folio_likely_mapped_shared(folio)) { > ++shared; > if (cc->is_khugepaged && > shared > khugepaged_max_ptes_shared) { > @@ -1328,7 +1341,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct= *mm, > } > } > > - folio =3D page_folio(page); > /* > * Record which node the original page is from and save t= his > * information to cc->node_load[]. > @@ -1349,10 +1361,6 @@ static int hpage_collapse_scan_pmd(struct mm_struc= t *mm, > result =3D SCAN_PAGE_LOCK; > goto out_unmap; > } > - if (!folio_test_anon(folio)) { > - result =3D SCAN_PAGE_ANON; > - goto out_unmap; > - } > > /* > * Check if the page has any GUP (or other external) pins= . > -- > 2.44.0 > >