From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4FC49C352A1 for ; Tue, 6 Dec 2022 17:43:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AC6FA8E0003; Tue, 6 Dec 2022 12:43:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A773A8E0001; Tue, 6 Dec 2022 12:43:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9179C8E0003; Tue, 6 Dec 2022 12:43:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 80C458E0001 for ; Tue, 6 Dec 2022 12:43:19 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 50FA5A0DA4 for ; Tue, 6 Dec 2022 17:43:19 +0000 (UTC) X-FDA: 80212602918.28.51E0E90 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf08.hostedemail.com (Postfix) with ESMTP id ED3A6160004 for ; Tue, 6 Dec 2022 17:43:18 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="fs7YaKr/"; spf=pass (imf08.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670348599; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=uRk4/4sF0d6U//M9Qn2qWbgJLDhaZfDFuGivGPqf4c8=; b=d+X9GsRv+R+EooSrfi65QAmWmLIlv3j3I3synpaf6iL2h+nDUHrORlCOgGoFAlQNLU8p4X xh63Ihe/7qUbXiTg+ZLeXav7nA9fBmKZfx9QAjRQkm+bK7/mG2TBstnKbxWOJq3lPjbqXz fFirPp1sBzDI8qOr2mgk9S661g+rRrc= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="fs7YaKr/"; spf=pass (imf08.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670348599; a=rsa-sha256; cv=none; b=6EBCd/7foc1/z+UGs9pvnGYfsr6GG5FGo0ixWcebma5cb6oF3KDH39yzyCNtrEPthbUAeV Qcz3+c4j5f/9O1WRVfMHUvWq1pO+6NhKH+ozE6Z0y6OAFmEsBh4uMBeI3vG48C3o0xJbN0 R173W6H0QWR1zco6jYbLRU5Im546TYs= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1670348598; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=uRk4/4sF0d6U//M9Qn2qWbgJLDhaZfDFuGivGPqf4c8=; b=fs7YaKr/4eTIi0RAeTCuO5G0DVQ8F52gcVMWspUaedyfHOmJRJPmWq6vsXY29J95Ic4GqE ci2hy+grXRr1VoJMXAbbBb7zjHagj6cx5kCEQAcSN0PljjRi/+Jpvyq0uI7p6y2CrBWuoW bGgOLbx5oInqULgm6kT2ASaRIlR6UFM= Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com [209.85.222.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-653-Yq_OP4JbNvqlIzOtPnT-LA-1; Tue, 06 Dec 2022 12:43:15 -0500 X-MC-Unique: Yq_OP4JbNvqlIzOtPnT-LA-1 Received: by mail-qk1-f197.google.com with SMTP id bm39-20020a05620a19a700b006fca217dc54so21801496qkb.16 for ; Tue, 06 Dec 2022 09:43:15 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=oayfLbg+4e6JFyZkj40MnFWp6cxM7nCSzst762sbKxw=; b=MDrhwwio4eCVAGWsCxPA/gESRuM4xd7nIlk5a5Dz6LBoIZF5Jen7lDJgJCxibbtMsY 7Y2uNBaq1BHDDNB+AIfFNbCxH7GCFsUoNLd0zl+ANS2mjyaEEnSKEZnyZ3CXxlPi+QIH wj2qxpmXIRnqNJHlEGfCUyfNO5O/VfVe7Jd2tlpM2mp9nBd9j2KEH4mbYkhebKSQCswb mX/2JlI3xOXdx0WpzSAtKM80s1ArY1aud1nhm5EiIRQO9QYZqdsNEpnyBw1iK4KKcZzl cKhCm/WqI2udyKIPGsc4q6lStw95P457QQ+pXcZvCiMrahVI9EdY3S4BooD/tPQoiIzp ryBA== X-Gm-Message-State: ANoB5plKpWDDj6U9V633lyIFiioj/W7wxaXX2rKm8NWRu+KYC7Tpyk0m lnoy16awZsFndmD03O3q7sF1WzLhC7alw/hpLKY00bmx8hvkEedi2HJTU/jJ7UkCJBa+tF1zaE6 uxcjYPXLS4qQ= X-Received: by 2002:a05:6214:1023:b0:4c6:a1fd:9b25 with SMTP id k3-20020a056214102300b004c6a1fd9b25mr66139235qvr.128.1670348594646; Tue, 06 Dec 2022 09:43:14 -0800 (PST) X-Google-Smtp-Source: AA0mqf5Jw0TTfKXJr084q00xTnBdn0ksQGqBzCGmdwbnxy3yvDHP1mZgwQiGiUC9lRigFIMZkPeyQg== X-Received: by 2002:a05:6214:1023:b0:4c6:a1fd:9b25 with SMTP id k3-20020a056214102300b004c6a1fd9b25mr66139211qvr.128.1670348594361; Tue, 06 Dec 2022 09:43:14 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-46-70-31-27-79.dsl.bell.ca. [70.31.27.79]) by smtp.gmail.com with ESMTPSA id x16-20020ac87a90000000b003a5fb681ae7sm11878656qtr.3.2022.12.06.09.43.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 06 Dec 2022 09:43:13 -0800 (PST) Date: Tue, 6 Dec 2022 12:43:12 -0500 From: Peter Xu To: Mike Kravetz Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton , Jann Horn , Andrew Morton , Andrea Arcangeli , Rik van Riel , Nadav Amit , Miaohe Lin , Muchun Song , David Hildenbrand Subject: Re: [PATCH 09/10] mm/hugetlb: Make page_vma_mapped_walk() safe to pmd unshare Message-ID: References: <20221129193526.3588187-1-peterx@redhat.com> <20221129193526.3588187-10-peterx@redhat.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: multipart/mixed; boundary="e5yzZCdrBMEPRle2" Content-Disposition: inline X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: ED3A6160004 X-Stat-Signature: f1qwfgtushy3teoe6754db636qbgcfqg X-Spamd-Result: default: False [0.10 / 9.00]; BAYES_HAM(-6.00)[100.00%]; SORBS_IRL_BL(3.00)[209.85.222.197:received]; SUSPICIOUS_RECIPS(1.50)[]; SUBJECT_HAS_UNDERSCORES(1.00)[]; MID_RHS_NOT_FQDN(0.50)[]; MIME_GOOD(-0.10)[multipart/mixed,text/plain]; BAD_REP_POLICIES(0.10)[]; RCVD_NO_TLS_LAST(0.10)[]; RCPT_COUNT_TWELVE(0.00)[12]; MIME_TRACE(0.00)[0:+,1:+,2:+]; DMARC_POLICY_ALLOW(0.00)[redhat.com,none]; DKIM_TRACE(0.00)[redhat.com:+]; FROM_EQ_ENVFROM(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_COUNT_THREE(0.00)[4]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; FROM_HAS_DN(0.00)[]; ARC_NA(0.00)[]; R_DKIM_ALLOW(0.00)[redhat.com:s=mimecast20190719]; HAS_ATTACHMENT(0.00)[]; TO_DN_SOME(0.00)[]; TAGGED_RCPT(0.00)[]; R_SPF_ALLOW(0.00)[+ip4:170.10.129.0/24]; PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org]; RCVD_VIA_SMTP_AUTH(0.00)[] X-Rspam-User: X-HE-Tag: 1670348598-756539 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --e5yzZCdrBMEPRle2 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline On Tue, Dec 06, 2022 at 12:39:53PM -0500, Peter Xu wrote: > On Tue, Dec 06, 2022 at 09:10:00AM -0800, Mike Kravetz wrote: > > On 12/05/22 15:52, Mike Kravetz wrote: > > > On 11/29/22 14:35, Peter Xu wrote: > > > > Since page_vma_mapped_walk() walks the pgtable, it needs the vma lock > > > > to make sure the pgtable page will not be freed concurrently. > > > > > > > > Signed-off-by: Peter Xu > > > > --- > > > > include/linux/rmap.h | 4 ++++ > > > > mm/page_vma_mapped.c | 5 ++++- > > > > 2 files changed, 8 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > > > > index bd3504d11b15..a50d18bb86aa 100644 > > > > --- a/include/linux/rmap.h > > > > +++ b/include/linux/rmap.h > > > > @@ -13,6 +13,7 @@ > > > > #include > > > > #include > > > > #include > > > > +#include > > > > > > > > /* > > > > * The anon_vma heads a list of private "related" vmas, to scan if > > > > @@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) > > > > pte_unmap(pvmw->pte); > > > > if (pvmw->ptl) > > > > spin_unlock(pvmw->ptl); > > > > + /* This needs to be after unlock of the spinlock */ > > > > + if (is_vm_hugetlb_page(pvmw->vma)) > > > > + hugetlb_vma_unlock_read(pvmw->vma); > > > > } > > > > > > > > bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw); > > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > > > > index 93e13fc17d3c..f94ec78b54ff 100644 > > > > --- a/mm/page_vma_mapped.c > > > > +++ b/mm/page_vma_mapped.c > > > > @@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > > > > if (pvmw->pte) > > > > return not_found(pvmw); > > > > > > > > + hugetlb_vma_lock_read(vma); > > > > /* when pud is not present, pte will be NULL */ > > > > pvmw->pte = huge_pte_offset(mm, pvmw->address, size); > > > > - if (!pvmw->pte) > > > > + if (!pvmw->pte) { > > > > + hugetlb_vma_unlock_read(vma); > > > > return false; > > > > + } > > > > > > > > pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte); > > > > if (!check_pte(pvmw)) > > > > > > I think this is going to cause try_to_unmap() to always fail for hugetlb > > > shared pages. See try_to_unmap_one: > > > > > > while (page_vma_mapped_walk(&pvmw)) { > > > ... > > > if (folio_test_hugetlb(folio)) { > > > ... > > > /* > > > * To call huge_pmd_unshare, i_mmap_rwsem must be > > > * held in write mode. Caller needs to explicitly > > > * do this outside rmap routines. > > > * > > > * We also must hold hugetlb vma_lock in write mode. > > > * Lock order dictates acquiring vma_lock BEFORE > > > * i_mmap_rwsem. We can only try lock here and fail > > > * if unsuccessful. > > > */ > > > if (!anon) { > > > VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); > > > if (!hugetlb_vma_trylock_write(vma)) { > > > page_vma_mapped_walk_done(&pvmw); > > > ret = false; > > > } > > > > > > > > > Can not think of a great solution right now. > > > > Thought of this last night ... > > > > Perhaps we do not need vma_lock in this code path (not sure about all > > page_vma_mapped_walk calls). Why? We already hold i_mmap_rwsem. > > Exactly. The only concern is when it's not in a rmap. > > I'm actually preparing something that adds a new flag to PVMW, like: > > #define PVMW_HUGETLB_NEEDS_LOCK (1 << 2) > > But maybe we don't need that at all, since I had a closer look the only > outliers of not using a rmap is: > > __replace_page > write_protect_page > > I'm pretty sure ksm doesn't have hugetlb involved, then the other one is > uprobe (uprobe_write_opcode). I think it's the same. If it's true, we can > simply drop this patch. Then we also have hugetlb_walk and the lock checks > there guarantee that we're safe anyways. > > Potentially we can document this fact, which I also attached a comment > patch just for it to be appended to the end of the patchset. > > Mike, let me know what do you think. > > Andrew, if this patch to be dropped then the last patch may not cleanly > apply. Let me know if you want a full repost of the things. The document patch that can be appended to the end of this series attached. I referenced hugetlb_walk() so it needs to be the last patch. -- Peter Xu --e5yzZCdrBMEPRle2 Content-Type: text/plain; charset=utf-8 Content-Disposition: attachment; filename="0001-mm-hugetlb-Document-why-page_vma_mapped_walk-is-safe.patch" >From 754c2180804e9e86accf131573cbd956b8c62829 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 6 Dec 2022 12:36:04 -0500 Subject: [PATCH] mm/hugetlb: Document why page_vma_mapped_walk() is safe to walk Content-type: text/plain Taking vma lock here is not needed for now because all potential hugetlb walkers here should have i_mmap_rwsem held. Document the fact. Signed-off-by: Peter Xu --- mm/page_vma_mapped.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index e97b2e23bd28..2e59a0419d22 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) /* The only possible mapping was handled on last iteration */ if (pvmw->pte) return not_found(pvmw); - - /* when pud is not present, pte will be NULL */ + /* + * NOTE: we don't need explicit lock here to walk the + * hugetlb pgtable because either (1) potential callers of + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the + * caller will not walk a hugetlb vma (e.g. ksm or uprobe). + * When one day this rule breaks, one will get a warning + * in hugetlb_walk(), and then we'll figure out what to do. + */ pvmw->pte = hugetlb_walk(vma, pvmw->address, size); if (!pvmw->pte) return false; -- 2.37.3 --e5yzZCdrBMEPRle2--