From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 32BB5CCA483 for ; Wed, 8 Jun 2022 00:28:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B7DBE6B0071; Tue, 7 Jun 2022 20:28:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B05CC6B0072; Tue, 7 Jun 2022 20:28:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 97EFA6B0073; Tue, 7 Jun 2022 20:28:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 8366A6B0071 for ; Tue, 7 Jun 2022 20:28:02 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id 5F5E3120DCE for ; Wed, 8 Jun 2022 00:28:02 +0000 (UTC) X-FDA: 79553181204.28.ADBCA7E Received: from mail-lj1-f180.google.com (mail-lj1-f180.google.com [209.85.208.180]) by imf10.hostedemail.com (Postfix) with ESMTP id E3B92C0004 for ; Wed, 8 Jun 2022 00:28:01 +0000 (UTC) Received: by mail-lj1-f180.google.com with SMTP id y15so15573447ljc.0 for ; Tue, 07 Jun 2022 17:28:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=TCAFKfg8Hkc4klXpXa1NzpBnXN0bohHzhZQkQL5TYeY=; b=AvyPXZCuGvRbpgtYZYOwBsZubUJPplm2xr5x0R6vyE/WIXRdRleLvUmgvUxf7NwbHy syE9oLc8ZB5O6DBfzO+ld0VaLuPLMwm7l0TmqfeZ0CKsvVBP/Ooau7OLHzBSkwWVWE2c dlc21R5tC/aOoLjeySc/PQPDfeOLX9mMk9OT7z8YngEJWH4SJVafsi06jzrlY3Li2xta VKzwABq28G0dQzFVhp9MBKj9Q6MZu4fn2XO+Pck+1b9xkHLjwvBrlbnzsdxW0xeGE1ZU pUR0fHhrR8vBkvtzm/qkHiNQNvYQG4FucyRMqLcLokh9F1A95k5ph76M0NvpgLAVNsRk RtPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=TCAFKfg8Hkc4klXpXa1NzpBnXN0bohHzhZQkQL5TYeY=; b=rxOCFAO9ezCuMbKTuamKDzdgKckJJXbWP9+pFRM7OZHaLBWdcQgD8lI8KKUM3JuVg8 GV8mkVkwQ3d0fsfXprQMSsD9sBFEZawqF4NCKDg8pS9gznletsyU/ecBiuhGsX2krGe3 NnEmcY1VKddqftOs+asvAe8Lsk8RUmDOUSTKMSB413TBRLxQi7rQfXtlzBHyrfY/GstE P38Q9i5uyDYdq+mjg/jvqoN7/N/pyV6PlaoicW8tcqXtao+CpxSR8kG7e2nnsD2K/QIq 7CS39CrN7w2s2FihUVLLo9Ymp7n7UxEOZ1kNKEvddjmJXlwznlPWk2irglam/drHf0ti kEtQ== X-Gm-Message-State: AOAM531C/ZZQpPE2B6GvH8XhsgPJfKQh6LLxyqi5NkG2Pu0+CZosKVcn i0onPXxgGZUVJBpToP5tDunMOltfY0aEaYKsclGb1g== X-Google-Smtp-Source: ABdhPJyHy8Ql87Y89p1wDNhTDD6d4y9H9xp4nX+1Ye32JqLoeEo1vavk/rYltucErtPZaukWPzapf/y3DrwV6xTxNKE= X-Received: by 2002:a2e:84c9:0:b0:253:bd3e:63b3 with SMTP id q9-20020a2e84c9000000b00253bd3e63b3mr53210565ljh.350.1654648079872; Tue, 07 Jun 2022 17:27:59 -0700 (PDT) MIME-Version: 1.0 References: <20220604004004.954674-1-zokeefe@google.com> <20220604004004.954674-3-zokeefe@google.com> In-Reply-To: From: "Zach O'Keefe" Date: Tue, 7 Jun 2022 17:27:23 -0700 Message-ID: Subject: Re: [PATCH v6 02/15] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP To: Yang Shi Cc: Alex Shi , David Hildenbrand , David Rientjes , Matthew Wilcox , Michal Hocko , Pasha Tatashin , Peter Xu , Rongwei Wang , SeongJae Park , Song Liu , Vlastimil Babka , Zi Yan , Linux MM , Andrea Arcangeli , Andrew Morton , Arnd Bergmann , Axel Rasmussen , Chris Kennelly , Chris Zankel , Helge Deller , Hugh Dickins , Ivan Kokshaysky , "James E.J. Bottomley" , Jens Axboe , "Kirill A. Shutemov" , Matt Turner , Max Filippov , Miaohe Lin , Minchan Kim , Patrick Xia , Pavel Begunkov , Thomas Bogendoerfer Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: E3B92C0004 Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=AvyPXZCu; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf10.hostedemail.com: domain of zokeefe@google.com designates 209.85.208.180 as permitted sender) smtp.mailfrom=zokeefe@google.com X-Stat-Signature: m9pnueo8bwy8gubxh1ygthqkx6yxbwk9 X-Rspam-User: X-HE-Tag: 1654648081-899797 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Jun 7, 2022 at 2:27 PM Yang Shi wrote: > > On Tue, Jun 7, 2022 at 12:33 PM Zach O'Keefe wrote: > > > > On Tue, Jun 7, 2022 at 9:01 AM Zach O'Keefe wrote: > > > > > > On Mon, Jun 6, 2022 at 1:46 PM Yang Shi wrote: > > > > > > > > On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe wrote: > > > > > > > > > > When scanning an anon pmd to see if it's eligible for collapse, return > > > > > SCAN_PMD_MAPPED if the pmd already maps a THP. Note that > > > > > SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the > > > > > file-collapse path, since the latter might identify pte-mapped compound > > > > > pages. This is required by MADV_COLLAPSE which necessarily needs to > > > > > know what hugepage-aligned/sized regions are already pmd-mapped. > > > > > > > > > > Signed-off-by: Zach O'Keefe > > > > > --- > > > > > include/trace/events/huge_memory.h | 1 + > > > > > mm/internal.h | 1 + > > > > > mm/khugepaged.c | 32 ++++++++++++++++++++++++++---- > > > > > mm/rmap.c | 15 ++++++++++++-- > > > > > 4 files changed, 43 insertions(+), 6 deletions(-) > > > > > > > > > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h > > > > > index d651f3437367..55392bf30a03 100644 > > > > > --- a/include/trace/events/huge_memory.h > > > > > +++ b/include/trace/events/huge_memory.h > > > > > @@ -11,6 +11,7 @@ > > > > > EM( SCAN_FAIL, "failed") \ > > > > > EM( SCAN_SUCCEED, "succeeded") \ > > > > > EM( SCAN_PMD_NULL, "pmd_null") \ > > > > > + EM( SCAN_PMD_MAPPED, "page_pmd_mapped") \ > > > > > EM( SCAN_EXCEED_NONE_PTE, "exceed_none_pte") \ > > > > > EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \ > > > > > EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \ > > > > > diff --git a/mm/internal.h b/mm/internal.h > > > > > index 6e14749ad1e5..f768c7fae668 100644 > > > > > --- a/mm/internal.h > > > > > +++ b/mm/internal.h > > > > > @@ -188,6 +188,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason > > > > > /* > > > > > * in mm/rmap.c: > > > > > */ > > > > > +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address); > > > > > extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address); > > > > > > > > > > /* > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > > > index cc3d6fb446d5..7a914ca19e96 100644 > > > > > --- a/mm/khugepaged.c > > > > > +++ b/mm/khugepaged.c > > > > > @@ -28,6 +28,7 @@ enum scan_result { > > > > > SCAN_FAIL, > > > > > SCAN_SUCCEED, > > > > > SCAN_PMD_NULL, > > > > > + SCAN_PMD_MAPPED, > > > > > SCAN_EXCEED_NONE_PTE, > > > > > SCAN_EXCEED_SWAP_PTE, > > > > > SCAN_EXCEED_SHARED_PTE, > > > > > @@ -901,6 +902,31 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, > > > > > return 0; > > > > > } > > > > > > > > > > +static int find_pmd_or_thp_or_none(struct mm_struct *mm, > > > > > + unsigned long address, > > > > > + pmd_t **pmd) > > > > > +{ > > > > > + pmd_t pmde; > > > > > + > > > > > + *pmd = mm_find_pmd_raw(mm, address); > > > > > + if (!*pmd) > > > > > + return SCAN_PMD_NULL; > > > > > + > > > > > + pmde = pmd_read_atomic(*pmd); > > > > > + > > > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > > > + /* See comments in pmd_none_or_trans_huge_or_clear_bad() */ > > > > > + barrier(); > > > > > +#endif > > > > > + if (!pmd_present(pmde)) > > > > > + return SCAN_PMD_NULL; > > > > > + if (pmd_trans_huge(pmde)) > > > > > + return SCAN_PMD_MAPPED; > > > > > + if (pmd_bad(pmde)) > > > > > + return SCAN_FAIL; > > > > > > > > khugepaged doesn't handle pmd_bad before, IIRC it may just return > > > > SCAN_SUCCEED if everything else is good? It is fine to add it, but it > > > > may be better to return SCAN_PMD_NULL? > > > > > > Correct, pmd_bad() wasn't handled before. I actually don't know how a > > > bad pmd might arise in the wild (would love to actually know this), > > > but I don't see the check hurting (might be overly convervative > > > though). Conversely, I'm not sure where things go astray currently if > > > the pmd is bad. Guess it depends in what way the flags are mutated. > > > Returning SCAN_PMD_NULL SGTM. > > > > > > > > > > > > + return SCAN_SUCCEED; > > > > > +} > > > > > + > > > > > /* > > > > > * Bring missing pages in from swap, to complete THP collapse. > > > > > * Only done if khugepaged_scan_pmd believes it is worthwhile. > > > > > @@ -1146,11 +1172,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > > > > > > > > > > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > > > > > > > > > > - pmd = mm_find_pmd(mm, address); > > > > > - if (!pmd) { > > > > > - result = SCAN_PMD_NULL; > > > > > + result = find_pmd_or_thp_or_none(mm, address, &pmd); > > > > > + if (result != SCAN_SUCCEED) > > > > > > > > There are a couple of other callsites for mm_find_pmd(), you may need > > > > to change all of them to find_pmd_or_thp_or_none() for MADV_COLLAPSE > > > > since khugepaged may collapse the area before MADV_COLLAPSE > > > > reacquiring mmap_lock IIUC and MADV_COLLAPSE does care this case. It > > > > is fine w/o MADV_COLLAPSE since khupaged doesn't care if it is PMD > > > > mapped or not. > > > > > > Ya, I was just questioning the same thing after responding above - at > > > least w.r.t whether the pmd_bad() also needs to be in these callsites > > > (check for pmd mapping, as you mention, I think is definitely > > > necessary). Thanks for catching this! > > > > > > > So it may be better to move this patch right before MADV_COLLAPSE is introduced? > > > > > > I think this should be ok - I'll give it a try at least. > > > > > > Again, thank you for taking the time to thoroughly review this. > > > > > > Best, > > > Zach > > > > > > > > goto out; > > > > > - } > > > > > > > > > > memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load)); > > > > > pte = pte_offset_map_lock(mm, pmd, address, &ptl); > > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > > > > index 04fac1af870b..c9979c6ad7a1 100644 > > > > > --- a/mm/rmap.c > > > > > +++ b/mm/rmap.c > > > > > @@ -767,13 +767,12 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma) > > > > > return vma_address(page, vma); > > > > > } > > > > > > > > > > -pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) > > > > > +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address) > > > > > > > > May be better to have some notes for mm_find_pmd_raw() and mm_find_pmd(). > > > > > > > > Agreed. Looking over this code again, there are only 3 users of mm_find_pmd(): > > > > 1) khugepaged > > 2) ksm (replace_page()) > > 3) split_huge_pmd_address() > > > > Once khugepaged codepaths care about THP-pmds, ksm is the only > > remaining user that really wants a pte-mapping pmd. > > > > I've gone and consolidated the open-coded code in > > split_huge_pmd_address() to use the mm_find_pmd_raw(). > > > > I've also done a name switch: > > > > mm_find_pmd() -> mm_find_pte_pmd() > > mm_find_pmd_raw() -> mm_find_pmd() > > If ksm is the only user of *current* mm_find_pmd(), I think you should > be able to open code it w/o introducing mm_find_pte_pmd() and revert > mm_find_pmd() to its *old* behavior. SGTM. Tried it out and it looks fine. Thanks for the suggestion. > > > > This basically reverts mm_find_pmd() to its pre commit f72e7dcdd252 > > ("mm: let mm_find_pmd fix buggy race with THP fault") > > behavior, and special cases (what will be, after MADV_COLLAPSE file > > support) the only remaining callsite which *doesn't* care about > > THP-pmds (ksm). The naming here is a little more meaningful than > > "*raw", and IMHO more readable. > > > > > > > > > { > > > > > pgd_t *pgd; > > > > > p4d_t *p4d; > > > > > pud_t *pud; > > > > > pmd_t *pmd = NULL; > > > > > - pmd_t pmde; > > > > > > > > > > pgd = pgd_offset(mm, address); > > > > > if (!pgd_present(*pgd)) > > > > > @@ -788,6 +787,18 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) > > > > > goto out; > > > > > > > > > > pmd = pmd_offset(pud, address); > > > > > +out: > > > > > + return pmd; > > > > > +} > > > > > + > > > > > +pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) > > > > > +{ > > > > > + pmd_t pmde; > > > > > + pmd_t *pmd; > > > > > + > > > > > + pmd = mm_find_pmd_raw(mm, address); > > > > > + if (!pmd) > > > > > + goto out; > > > > > /* > > > > > * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at() > > > > > * without holding anon_vma lock for write. So when looking for a > > > > > -- > > > > > 2.36.1.255.ge46751e96f-goog > > > > >