From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0C2BCC43334 for ; Tue, 12 Jul 2022 16:50:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9B8459400AF; Tue, 12 Jul 2022 12:50:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9415C940063; Tue, 12 Jul 2022 12:50:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 793629400AF; Tue, 12 Jul 2022 12:50:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 5EBCB940063 for ; Tue, 12 Jul 2022 12:50:56 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 2D39034995 for ; Tue, 12 Jul 2022 16:50:56 +0000 (UTC) X-FDA: 79679037312.24.AA61E62 Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51]) by imf19.hostedemail.com (Postfix) with ESMTP id BD1A41A006E for ; Tue, 12 Jul 2022 16:50:55 +0000 (UTC) Received: by mail-pj1-f51.google.com with SMTP id o15so8281144pjh.1 for ; Tue, 12 Jul 2022 09:50:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=HvGjw2pmCkCzcvkn56NWL1E/O46Q1u1EPDGHE5kgGOc=; b=BYAHu/9mXSKZ9JMnBs8efAz3iedtceuBtOgL2hZOK0fbTsIS+MeY15biadKe0OwLtK wsqEV/WHyc4WKkmi79Uwxy3Vs5928z8/YZHUCwtGGNnt61xBRXYmU1YtC2opz/yitOnG ICVoVxlMTiUGME4xVtudte4daxVXKR8iERwdNJR9GZHyx/tuGKHhtq1Yh2XSAeb9j+wY ypcTGG0KSoKYjciCQfNwN5lG+CvSXZU0Tw7IJKWejnJOTvWF+eLACFW/9cF1WR24MN1o n9KaJnzduLr3x5JEg/J15Z1bHrxxxv30MoQwpUFJ4mkXFpMjaYLDY4YoHf4d30godJY7 ra/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=HvGjw2pmCkCzcvkn56NWL1E/O46Q1u1EPDGHE5kgGOc=; b=SJDjqebUOt58Mn1wwmfAaqz/9sGozdR0aPwtOaRt5P9PgoZ3W42Y0XQTZg5+H2fRsP jUWg5hEwGYosJ5JC44+CsfL2NNbp/1X70wUuZctioOJlXVkBwT+ElElQ3+3M9stMmC4L bCl7eSRUuEtDZUk8/3ru+td5prkiw/QpbfUttfRTsvv+Np4aS3lgxckWE3Oj50fmYWYX PqLDRBbtdXQRALpT936RFIepTfj3NBo2Izmq2me/kCkGRPf8KXhTzvqGw3lNpKkSTEac HwEEKhc7+sTpG82nchAeXbMYAwrRhJaCHxy3kX6JAma8PPq399XD31OrANS/P5pA81KP aVcQ== X-Gm-Message-State: AJIora9Rt5AxqXIqp62WZeP8I8eOkepH85fkz6cBc8o1vPz+Fa+FbA8c Gcqg7tsi1TXIqvL/yWY/0aX4Zw== X-Google-Smtp-Source: AGRyM1tvhc23h5pulSgn0IjxSdnQVRpQHMsU1nJW293Pf6kMu9NzagGJRMp85AnMYwJKZwBSIgRZfQ== X-Received: by 2002:a17:902:76c8:b0:16c:1c13:9eea with SMTP id j8-20020a17090276c800b0016c1c139eeamr24410343plt.83.1657644654459; Tue, 12 Jul 2022 09:50:54 -0700 (PDT) Received: from google.com (55.212.185.35.bc.googleusercontent.com. [35.185.212.55]) by smtp.gmail.com with ESMTPSA id f14-20020a170902ce8e00b00168b113f222sm7130429plg.173.2022.07.12.09.50.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Jul 2022 09:50:54 -0700 (PDT) Date: Tue, 12 Jul 2022 09:50:48 -0700 From: Zach O'Keefe To: Yang Shi Cc: Alex Shi , David Hildenbrand , David Rientjes , Matthew Wilcox , Michal Hocko , Pasha Tatashin , Peter Xu , Rongwei Wang , SeongJae Park , Song Liu , Vlastimil Babka , Zi Yan , Linux MM , Andrea Arcangeli , Andrew Morton , Arnd Bergmann , Axel Rasmussen , Chris Kennelly , Chris Zankel , Helge Deller , Hugh Dickins , Ivan Kokshaysky , "James E.J. Bottomley" , Jens Axboe , "Kirill A. Shutemov" , Matt Turner , Max Filippov , Miaohe Lin , Minchan Kim , Patrick Xia , Pavel Begunkov , Thomas Bogendoerfer Subject: Re: [mm-unstable v7 08/18] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage Message-ID: References: <20220706235936.2197195-1-zokeefe@google.com> <20220706235936.2197195-9-zokeefe@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1657644655; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HvGjw2pmCkCzcvkn56NWL1E/O46Q1u1EPDGHE5kgGOc=; b=fgwNI9/NLwxbulSldi7WXB7dSYtL267Gt9RSqcRm8HpXBm5E/aA2gAdb/z+GGXkW3ToQft Nn4O9+rfTXfPk0Y/23sEd5cm7ft3AsOFuvAz0dk4pw2/1T7qmW2vIRwuJAPr9uv23qvJj0 ep1QqUCP1oQomrz3XL3zkAvlEkDCwYk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657644655; a=rsa-sha256; cv=none; b=CfIMVmuhPfUcLYlNLlJM75VG07jH6Ic9CEmhIj4San31wywPzjVophQTUiBRNn1rtkz2Lr FIoEb72oLwRBZzv3knD5GtmvKzXIXl1nxXA2oAa8cIPW61XR3YCEv31eEaiM68UpbNS3uW IGt8DTNYVP+bOpBcnQkSSzXcgoqtWG4= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="BYAHu/9m"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of zokeefe@google.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=zokeefe@google.com X-Stat-Signature: hqggre9apefy3w3wjcgqcyu71enmrxbt X-Rspamd-Queue-Id: BD1A41A006E X-Rspam-User: Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="BYAHu/9m"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of zokeefe@google.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=zokeefe@google.com X-Rspamd-Server: rspam05 X-HE-Tag: 1657644655-333772 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Jul 11 14:03, Yang Shi wrote: > On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe wrote: > > > > When scanning an anon pmd to see if it's eligible for collapse, return > > SCAN_PMD_MAPPED if the pmd already maps a hugepage. Note that > > SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the > > file-collapse path, since the latter might identify pte-mapped compound > > pages. This is required by MADV_COLLAPSE which necessarily needs to > > know what hugepage-aligned/sized regions are already pmd-mapped. > > > > In order to determine if a pmd already maps a hugepage, refactor > > mm_find_pmd(): > > > > Return mm_find_pmd() to it's pre-commit f72e7dcdd252 ("mm: let mm_find_pmd > > fix buggy race with THP fault") behavior. ksm was the only caller that > > explicitly wanted a pte-mapping pmd, so open code the pte-mapping logic > > there (pmd_present() and pmd_trans_huge() checks). > > > > Undo revert change in commit f72e7dcdd252 ("mm: let mm_find_pmd fix buggy race > > with THP fault") that open-coded split_huge_pmd_address() pmd lookup and > > use mm_find_pmd() instead. > > > > Signed-off-by: Zach O'Keefe > > Reviewed-by: Yang Shi > Thanks for taking the time to review! Zach > > --- > > include/trace/events/huge_memory.h | 1 + > > mm/huge_memory.c | 18 +-------- > > mm/internal.h | 2 +- > > mm/khugepaged.c | 60 ++++++++++++++++++++++++------ > > mm/ksm.c | 10 +++++ > > mm/rmap.c | 15 +++----- > > 6 files changed, 67 insertions(+), 39 deletions(-) > > > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h > > index d651f3437367..55392bf30a03 100644 > > --- a/include/trace/events/huge_memory.h > > +++ b/include/trace/events/huge_memory.h > > @@ -11,6 +11,7 @@ > > EM( SCAN_FAIL, "failed") \ > > EM( SCAN_SUCCEED, "succeeded") \ > > EM( SCAN_PMD_NULL, "pmd_null") \ > > + EM( SCAN_PMD_MAPPED, "page_pmd_mapped") \ > > EM( SCAN_EXCEED_NONE_PTE, "exceed_none_pte") \ > > EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \ > > EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \ > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index 4fbe43dc1568..fb76db6c703e 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -2363,25 +2363,11 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > > void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address, > > bool freeze, struct folio *folio) > > { > > - pgd_t *pgd; > > - p4d_t *p4d; > > - pud_t *pud; > > - pmd_t *pmd; > > + pmd_t *pmd = mm_find_pmd(vma->vm_mm, address); > > > > - pgd = pgd_offset(vma->vm_mm, address); > > - if (!pgd_present(*pgd)) > > + if (!pmd) > > return; > > > > - p4d = p4d_offset(pgd, address); > > - if (!p4d_present(*p4d)) > > - return; > > - > > - pud = pud_offset(p4d, address); > > - if (!pud_present(*pud)) > > - return; > > - > > - pmd = pmd_offset(pud, address); > > - > > __split_huge_pmd(vma, pmd, address, freeze, folio); > > } > > > > diff --git a/mm/internal.h b/mm/internal.h > > index 6e14749ad1e5..ef8c23fb678f 100644 > > --- a/mm/internal.h > > +++ b/mm/internal.h > > @@ -188,7 +188,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason > > /* > > * in mm/rmap.c: > > */ > > -extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address); > > +pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address); > > > > /* > > * in mm/page_alloc.c > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index b0e20db3f805..c7a09cc9a0e8 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -28,6 +28,7 @@ enum scan_result { > > SCAN_FAIL, > > SCAN_SUCCEED, > > SCAN_PMD_NULL, > > + SCAN_PMD_MAPPED, > > SCAN_EXCEED_NONE_PTE, > > SCAN_EXCEED_SWAP_PTE, > > SCAN_EXCEED_SHARED_PTE, > > @@ -871,6 +872,45 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, > > return SCAN_SUCCEED; > > } > > > > +static int find_pmd_or_thp_or_none(struct mm_struct *mm, > > + unsigned long address, > > + pmd_t **pmd) > > +{ > > + pmd_t pmde; > > + > > + *pmd = mm_find_pmd(mm, address); > > + if (!*pmd) > > + return SCAN_PMD_NULL; > > + > > + pmde = pmd_read_atomic(*pmd); > > + > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > + /* See comments in pmd_none_or_trans_huge_or_clear_bad() */ > > + barrier(); > > +#endif > > + if (!pmd_present(pmde)) > > + return SCAN_PMD_NULL; > > + if (pmd_trans_huge(pmde)) > > + return SCAN_PMD_MAPPED; > > + if (pmd_bad(pmde)) > > + return SCAN_PMD_NULL; > > + return SCAN_SUCCEED; > > +} > > + > > +static int check_pmd_still_valid(struct mm_struct *mm, > > + unsigned long address, > > + pmd_t *pmd) > > +{ > > + pmd_t *new_pmd; > > + int result = find_pmd_or_thp_or_none(mm, address, &new_pmd); > > + > > + if (result != SCAN_SUCCEED) > > + return result; > > + if (new_pmd != pmd) > > + return SCAN_FAIL; > > + return SCAN_SUCCEED; > > +} > > + > > /* > > * Bring missing pages in from swap, to complete THP collapse. > > * Only done if khugepaged_scan_pmd believes it is worthwhile. > > @@ -982,9 +1022,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > > goto out_nolock; > > } > > > > - pmd = mm_find_pmd(mm, address); > > - if (!pmd) { > > - result = SCAN_PMD_NULL; > > + result = find_pmd_or_thp_or_none(mm, address, &pmd); > > + if (result != SCAN_SUCCEED) { > > mmap_read_unlock(mm); > > goto out_nolock; > > } > > @@ -1012,7 +1051,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > > if (result != SCAN_SUCCEED) > > goto out_up_write; > > /* check if the pmd is still valid */ > > - if (mm_find_pmd(mm, address) != pmd) > > + result = check_pmd_still_valid(mm, address, pmd); > > + if (result != SCAN_SUCCEED) > > goto out_up_write; > > > > anon_vma_lock_write(vma->anon_vma); > > @@ -1115,11 +1155,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma, > > > > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > > > > - pmd = mm_find_pmd(mm, address); > > - if (!pmd) { > > - result = SCAN_PMD_NULL; > > + result = find_pmd_or_thp_or_none(mm, address, &pmd); > > + if (result != SCAN_SUCCEED) > > goto out; > > - } > > > > memset(cc->node_load, 0, sizeof(cc->node_load)); > > pte = pte_offset_map_lock(mm, pmd, address, &ptl); > > @@ -1373,8 +1411,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) > > if (!PageHead(hpage)) > > goto drop_hpage; > > > > - pmd = mm_find_pmd(mm, haddr); > > - if (!pmd) > > + if (find_pmd_or_thp_or_none(mm, haddr, &pmd) != SCAN_SUCCEED) > > goto drop_hpage; > > > > start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); > > @@ -1492,8 +1529,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) > > if (vma->vm_end < addr + HPAGE_PMD_SIZE) > > continue; > > mm = vma->vm_mm; > > - pmd = mm_find_pmd(mm, addr); > > - if (!pmd) > > + if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED) > > continue; > > /* > > * We need exclusive mmap_lock to retract page table. > > diff --git a/mm/ksm.c b/mm/ksm.c > > index 075123602bd0..3e0a0a42fa1f 100644 > > --- a/mm/ksm.c > > +++ b/mm/ksm.c > > @@ -1136,6 +1136,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, > > { > > struct mm_struct *mm = vma->vm_mm; > > pmd_t *pmd; > > + pmd_t pmde; > > pte_t *ptep; > > pte_t newpte; > > spinlock_t *ptl; > > @@ -1150,6 +1151,15 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, > > pmd = mm_find_pmd(mm, addr); > > if (!pmd) > > goto out; > > + /* > > + * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at() > > + * without holding anon_vma lock for write. So when looking for a > > + * genuine pmde (in which to find pte), test present and !THP together. > > + */ > > + pmde = *pmd; > > + barrier(); > > + if (!pmd_present(pmde) || pmd_trans_huge(pmde)) > > + goto out; > > > > mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, addr, > > addr + PAGE_SIZE); > > diff --git a/mm/rmap.c b/mm/rmap.c > > index edc06c52bc82..af775855e58f 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -767,13 +767,17 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma) > > return vma_address(page, vma); > > } > > > > +/* > > + * Returns the actual pmd_t* where we expect 'address' to be mapped from, or > > + * NULL if it doesn't exist. No guarantees / checks on what the pmd_t* > > + * represents. > > + */ > > pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) > > { > > pgd_t *pgd; > > p4d_t *p4d; > > pud_t *pud; > > pmd_t *pmd = NULL; > > - pmd_t pmde; > > > > pgd = pgd_offset(mm, address); > > if (!pgd_present(*pgd)) > > @@ -788,15 +792,6 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) > > goto out; > > > > pmd = pmd_offset(pud, address); > > - /* > > - * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at() > > - * without holding anon_vma lock for write. So when looking for a > > - * genuine pmde (in which to find pte), test present and !THP together. > > - */ > > - pmde = *pmd; > > - barrier(); > > - if (!pmd_present(pmde) || pmd_trans_huge(pmde)) > > - pmd = NULL; > > out: > > return pmd; > > } > > -- > > 2.37.0.rc0.161.g10f37bed90-goog > >