From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C2FCC4332F for ; Thu, 15 Dec 2022 17:49:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E0DC78E0003; Thu, 15 Dec 2022 12:49:35 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DBE438E0002; Thu, 15 Dec 2022 12:49:35 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C5E9C8E0003; Thu, 15 Dec 2022 12:49:35 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id B5C868E0002 for ; Thu, 15 Dec 2022 12:49:35 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 893EBA07C4 for ; Thu, 15 Dec 2022 17:49:35 +0000 (UTC) X-FDA: 80245277910.26.7541BFD Received: from mail-wm1-f46.google.com (mail-wm1-f46.google.com [209.85.128.46]) by imf12.hostedemail.com (Postfix) with ESMTP id D132340005 for ; Thu, 15 Dec 2022 17:49:33 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=CLC8gV+d; spf=pass (imf12.hostedemail.com: domain of jthoughton@google.com designates 209.85.128.46 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1671126573; a=rsa-sha256; cv=none; b=d4+jBa/NZRQFr5N4Om4RPGlbFVLtIJFBNYSb33j6L1gEuuBADOPBFri6k/PP07nTAymbQA A/awfN044K6lcG11hmh53Db+IUWJgKQSPqOMsxhPJsDmNPXwRrJdAGBAsvZJw3xFmfjO4R NLYAIx9svjPwrTx8DZxNlx3tDWoJzoA= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=CLC8gV+d; spf=pass (imf12.hostedemail.com: domain of jthoughton@google.com designates 209.85.128.46 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1671126573; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4t9Tzq74bYVIUlDfXnIUmMQZCR5YgsKnTgTgvcZxKJ0=; b=fNzmG5u6YSGlz5Og/1xjPOfYuxbtUuCjtFOM5dAGkdG6lcmMQiI2ZAqudBPUx2hKWE+2pK MP90U3VNmC7Y8WMdIhxMu/K3yr3AlH4w1l3F2C1AgdfkKYdETlaKENA9aHHL32PPO7AgXF sqCYsl2okETtGCqSuoVEbXDrzRN8p1o= Received: by mail-wm1-f46.google.com with SMTP id r83-20020a1c4456000000b003d1e906ca23so1682173wma.3 for ; Thu, 15 Dec 2022 09:49:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=4t9Tzq74bYVIUlDfXnIUmMQZCR5YgsKnTgTgvcZxKJ0=; b=CLC8gV+dn9/wCyv9fQSj7V3QOvmBnE79Y6rBaMxai6/zGZeghJyibuqhRRaYFkpEcB Z6gCDtGoVxAA7Aic6LW3TKTgdiUarOZ994Do16CLg6HPQuinds33E7jeeuQldMhH7NHs 8+DWfik6s0RL+u2A+6bkfPO1QOjdBV+qYrAErJI8744wuJUrBEGf0xgbYGYO8x7k2uOj /NaAVm1Jia/YoMuq6nPuGb4YL7/Xmz4lznkc9YVy+bW0R1t+U5XoOF/05bzOgp+MUfcd RGC9HSu/0+a58wKxw0TF6Hklfnam1m2yq7iGpvNfEXj9DElbqZKHrZci4rp/F81VcS/M GVRA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=4t9Tzq74bYVIUlDfXnIUmMQZCR5YgsKnTgTgvcZxKJ0=; b=3CWDGkyyemWmn/G9o5dZQGQt+zGIQF67MfwuyTgpAOTn37qXGWqdlM+nfEmPCgB09w Z+BL6KQz9ostY6GuR0SiWRuVQOrpBdxkPd7WrehU7gEIo4svB1hgRFsC0LCt18AHl0Fm JkXazUpvRKGs0kRhFvM6Pe2YELHADkR6Se4ygkFvxuEr5QwFv1EsNYB7wc/Ify1rOoXF S2LxZKXiVu38D0PtKr4fuGQAwPGCF8vSb6HSNukZZFxpI+c49UfHOMWkywuMmpi3An8C HXlG6WFTgvwl3hg2ZOTmFqo6YkmFYZPUKzpkVleWf0hdXjqhp5vu+3YBOOOCtQQXPZ7D bDxQ== X-Gm-Message-State: ANoB5plr0ppHeS/RCy82gRJr4NKrZO/Q65wtO38sKQXFBhgqUE2pK2Cd qKLpx9s4qusnde/D/oBleki/7QWN8K33QxE7qluAWQ== X-Google-Smtp-Source: AA0mqf4L0wi3CY4jIWRCwhgFqlDeo7t0sGIZE5s/XpPIdq1ja9D6Z89iI1BrxqCd7Tq+cgL+VwWKgRQu5U/mFjedsZ0= X-Received: by 2002:a05:600c:2186:b0:3cf:f2aa:3dc2 with SMTP id e6-20020a05600c218600b003cff2aa3dc2mr428622wme.175.1671126572372; Thu, 15 Dec 2022 09:49:32 -0800 (PST) MIME-Version: 1.0 References: <20221021163703.3218176-1-jthoughton@google.com> <20221021163703.3218176-25-jthoughton@google.com> In-Reply-To: <20221021163703.3218176-25-jthoughton@google.com> From: James Houghton Date: Thu, 15 Dec 2022 12:49:18 -0500 Message-ID: Subject: Re: [RFC PATCH v2 24/47] hugetlb: update page_vma_mapped to do high-granularity walks To: Mike Kravetz , Muchun Song , Peter Xu Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: D132340005 X-Stat-Signature: fxeyd58ds74wwwadqgjs7dpo9xsukr4t X-HE-Tag: 1671126573-544598 X-HE-Meta: U2FsdGVkX1+4G88+MTExLSKbL5PfwizSD4LXHa6dLcuJ2YpnNfrV9ByFrS0wMDLot+3aGhR6dSpigDMITyTsLaAos8ZP7ruVU6P2NBFwQbr1VDZj8RfsdT1FCh1vjWRkpLwb2pykYsfQ4HUSyTPsg2Bdec/38+5xy8BH+X4CNJJOKdZXExgASbisppI98+iupE5VnU0nRquqS7oh/IaH5kD7IAkQNojOApfKVxL9LN7LCA47MRCc4ZNOAkTjN2CII74TLLmfS68XrXvBpsTFPHkK50JsdKMorEeEpvt07LV4WUSdlX+INzFiO0u2KXCn5WjsfHOgR6Fj0Zr3RJjBsK8Kq95uQ9TYKWIkRcuxlbimr8bU/BLloypuMH4EZf3g1C1Lr+bwEgm6Q55xMAcZ8+PFFYJIRTECAoTZZwamI3BX7RuPFCCq50Yb/j4ztnGNrUzS2ZpeAdmXVqZkgTkJqZQI/+OEm9btLtGsAZfhxXOQ8dDy9An2J5R8Otp5k6AxAfBR1IrRaDypcFgET+1RDqe5FvgA9EZfnL+wccmI21AF70WgaiI1e+iZsBKyTbxF2DhvKZ/OstAzaluodTREPKo534mlIzxiaoh6cvDUX7Z5uQW//Zqbx2YeDt5otH9XpADqOuX4anzj0YU69sSdTFNHGxF+BZY2phZA4IfpABq7h0CF2sneQI4nBv84Eh3f2UXazyazu7Wkur02uk5GDeM2qBnWwkpamKsKep+CfSHkTimfXh1ZzxgrQOVoZHeu4EXvCW856i5ubfJ3o3t6tLuO3O7N05aKTn42iSUNzMJQR6iq6l3A9ATNG2B3OUQkYVtj11Lsb14LP4x3JlYeQPLhqUyUwMltLF3ZKBevTARn1+d+KbA8Wi/EyXeIGgU0QmSBjAyEN8zwVGl4qdAcOnQsnj+xfVRvU2pFFOj/JjUBfj37zM2MQv48OrrGvV20sQ6PccEK3bPriePNzgq PrdNBDZt iAXveH2w2D7mk8LiKOwCFBtqwv6FVJiz/XI3l/z9hmSGkMQ/oyXyE9ZbzZyMXVXGWkRnT X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Oct 21, 2022 at 12:37 PM James Houghton wrote: > > This updates the HugeTLB logic to look a lot more like the PTE-mapped > THP logic. When a user calls us in a loop, we will update pvmw->address > to walk to each page table entry that could possibly map the hugepage > containing pvmw->pfn. > > This makes use of the new pte_order so callers know what size PTE > they're getting. > > Signed-off-by: James Houghton > --- > include/linux/rmap.h | 4 +++ > mm/page_vma_mapped.c | 59 ++++++++++++++++++++++++++++++++++++-------- > mm/rmap.c | 48 +++++++++++++++++++++-------------- > 3 files changed, 83 insertions(+), 28 deletions(-) > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index e0557ede2951..d7d2d9f65a01 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -13,6 +13,7 @@ > #include > #include > #include > +#include > > /* > * The anon_vma heads a list of private "related" vmas, to scan if > @@ -409,6 +410,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) > pte_unmap(pvmw->pte); > if (pvmw->ptl) > spin_unlock(pvmw->ptl); > + if (pvmw->pte && is_vm_hugetlb_page(pvmw->vma) && > + hugetlb_hgm_enabled(pvmw->vma)) > + hugetlb_vma_unlock_read(pvmw->vma); > } > > bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw); > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > index 395ca4e21c56..1994b3f9a4c2 100644 > --- a/mm/page_vma_mapped.c > +++ b/mm/page_vma_mapped.c > @@ -133,7 +133,8 @@ static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size) > * > * Returns true if the page is mapped in the vma. @pvmw->pmd and @pvmw->pte point > * to relevant page table entries. @pvmw->ptl is locked. @pvmw->address is > - * adjusted if needed (for PTE-mapped THPs). > + * adjusted if needed (for PTE-mapped THPs and high-granularity--mapped HugeTLB > + * pages). > * > * If @pvmw->pmd is set but @pvmw->pte is not, you have found PMD-mapped page > * (usually THP). For PTE-mapped THP, you should run page_vma_mapped_walk() in > @@ -166,19 +167,57 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > if (unlikely(is_vm_hugetlb_page(vma))) { > struct hstate *hstate = hstate_vma(vma); > unsigned long size = huge_page_size(hstate); > - /* The only possible mapping was handled on last iteration */ > - if (pvmw->pte) > - return not_found(pvmw); > + struct hugetlb_pte hpte; > + pte_t *pte; > + pte_t pteval; > + > + end = (pvmw->address & huge_page_mask(hstate)) + > + huge_page_size(hstate); > > /* when pud is not present, pte will be NULL */ > - pvmw->pte = huge_pte_offset(mm, pvmw->address, size); > - if (!pvmw->pte) > + pte = huge_pte_offset(mm, pvmw->address, size); > + if (!pte) > return false; > > - pvmw->pte_order = huge_page_order(hstate); > - pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte); > - if (!check_pte(pvmw)) > - return not_found(pvmw); > + do { > + hugetlb_pte_populate(&hpte, pte, huge_page_shift(hstate), > + hpage_size_to_level(size)); > + > + /* > + * Do a high granularity page table walk. The vma lock > + * is grabbed to prevent the page table from being > + * collapsed mid-walk. It is dropped in > + * page_vma_mapped_walk_done(). > + */ > + if (pvmw->pte) { > + if (pvmw->ptl) > + spin_unlock(pvmw->ptl); > + pvmw->ptl = NULL; > + pvmw->address += PAGE_SIZE << pvmw->pte_order; > + if (pvmw->address >= end) > + return not_found(pvmw); > + } else if (hugetlb_hgm_enabled(vma)) > + /* Only grab the lock once. */ > + hugetlb_vma_lock_read(vma); I realize that I can't do this -- we're already holding the i_mmap_rwsem, and we have to take the VMA lock first. It seems like we're always holding it for writing in this case, so if I make hugetlb_collapse taking the i_mmap_rwsem for reading, this will be safe. Peter, you looked at this recently [1] -- do you know if we're always holding i_mmap_rwsem *for writing* here? [1] https://lore.kernel.org/linux-mm/20221209170100.973970-10-peterx@redhat.com/ Thanks! - James > + > +retry_walk: > + hugetlb_hgm_walk(mm, vma, &hpte, pvmw->address, > + PAGE_SIZE, /*stop_at_none=*/true); > + > + pvmw->pte = hpte.ptep; > + pvmw->pte_order = hpte.shift - PAGE_SHIFT; > + pvmw->ptl = hugetlb_pte_lock(mm, &hpte); > + pteval = huge_ptep_get(hpte.ptep); > + if (pte_present(pteval) && !hugetlb_pte_present_leaf( > + &hpte, pteval)) { > + /* > + * Someone split from under us, so keep > + * walking. > + */ > + spin_unlock(pvmw->ptl); > + goto retry_walk; > + } > + } while (!check_pte(pvmw)); > return true; > } > > diff --git a/mm/rmap.c b/mm/rmap.c > index 527463c1e936..a8359584467e 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1552,17 +1552,23 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > flush_cache_range(vma, range.start, range.end); > > /* > - * To call huge_pmd_unshare, i_mmap_rwsem must be > - * held in write mode. Caller needs to explicitly > - * do this outside rmap routines. > - * > - * We also must hold hugetlb vma_lock in write mode. > - * Lock order dictates acquiring vma_lock BEFORE > - * i_mmap_rwsem. We can only try lock here and fail > - * if unsuccessful. > + * If HGM is enabled, we have already grabbed the VMA > + * lock for reading, and we cannot safely release it. > + * Because HGM-enabled VMAs have already unshared all > + * PMDs, we can safely ignore PMD unsharing here. > */ > - if (!anon) { > + if (!anon && !hugetlb_hgm_enabled(vma)) { > VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); > + /* > + * To call huge_pmd_unshare, i_mmap_rwsem must > + * be held in write mode. Caller needs to > + * explicitly do this outside rmap routines. > + * > + * We also must hold hugetlb vma_lock in write > + * mode. Lock order dictates acquiring vma_lock > + * BEFORE i_mmap_rwsem. We can only try lock > + * here and fail if unsuccessful. > + */ > if (!hugetlb_vma_trylock_write(vma)) { > page_vma_mapped_walk_done(&pvmw); > ret = false; > @@ -1946,17 +1952,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, > flush_cache_range(vma, range.start, range.end); > > /* > - * To call huge_pmd_unshare, i_mmap_rwsem must be > - * held in write mode. Caller needs to explicitly > - * do this outside rmap routines. > - * > - * We also must hold hugetlb vma_lock in write mode. > - * Lock order dictates acquiring vma_lock BEFORE > - * i_mmap_rwsem. We can only try lock here and > - * fail if unsuccessful. > + * If HGM is enabled, we have already grabbed the VMA > + * lock for reading, and we cannot safely release it. > + * Because HGM-enabled VMAs have already unshared all > + * PMDs, we can safely ignore PMD unsharing here. > */ > - if (!anon) { > + if (!anon && !hugetlb_hgm_enabled(vma)) { > VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); > + /* > + * To call huge_pmd_unshare, i_mmap_rwsem must > + * be held in write mode. Caller needs to > + * explicitly do this outside rmap routines. > + * > + * We also must hold hugetlb vma_lock in write > + * mode. Lock order dictates acquiring vma_lock > + * BEFORE i_mmap_rwsem. We can only try lock > + * here and fail if unsuccessful. > + */ > if (!hugetlb_vma_trylock_write(vma)) { > page_vma_mapped_walk_done(&pvmw); > ret = false; > -- > 2.38.0.135.g90850a2211-goog >