From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C6A6BC4332F for ; Fri, 4 Nov 2022 15:02:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B37BE6B0071; Fri, 4 Nov 2022 11:02:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AE69F6B0073; Fri, 4 Nov 2022 11:02:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9AE538E0001; Fri, 4 Nov 2022 11:02:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 8C5556B0071 for ; Fri, 4 Nov 2022 11:02:53 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 647E041657 for ; Fri, 4 Nov 2022 15:02:53 +0000 (UTC) X-FDA: 80096077026.03.9AF5B99 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf18.hostedemail.com (Postfix) with ESMTP id 1642A1C000A for ; Fri, 4 Nov 2022 15:02:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1667574171; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=760mHfCsMBzfAvIYgQ085mME4hkndza7s7oPPqSIMRU=; b=U5Zd0PFdfqjfo0KwaPZwKtcVL2ashkrQvd9zv2SnyCdcmY1gTe3waglwnwUco5YpQNg2W8 5CcShGQ4Mkh8PkD4I0w4I82eaDa92mM/fKbs7mQkkguu7atj/EZBYm3KFT1u+qkoA5G1YP Wx/XaUYapB9odBgVA0zQq72Y5zWnGvE= Received: from mail-qv1-f69.google.com (mail-qv1-f69.google.com [209.85.219.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-353-y1BSUjJuMqS0BvZA3qNMHg-1; Fri, 04 Nov 2022 11:02:50 -0400 X-MC-Unique: y1BSUjJuMqS0BvZA3qNMHg-1 Received: by mail-qv1-f69.google.com with SMTP id l6-20020ad44446000000b004bb60364075so3421843qvt.13 for ; Fri, 04 Nov 2022 08:02:49 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=760mHfCsMBzfAvIYgQ085mME4hkndza7s7oPPqSIMRU=; b=BsKSYOa1e8eBWk2TC4jrwePQL60ZvEb1SCLhnbjxynqiF/2Mu5KKosBigeyrepGzAf jHJlz01+767tKioRiVVEkrtPy7nBBJ260WxA2ypbjoToGu0lPAG2GztNge8oW/9HE/km cJbVztorQJUoY1Ndh2tAmyJNmgG7ctjAuRxL8aBL5VmtpglwmEtZK1vzV/jvP35Tvq39 8N65OqKNQjCWFUF4LI5MgDspHDpWHwFD8+jA//nE6lqM38vQzhVPNYUEUbzbWRcrb66J Hj8lzuCWk6jFAn//8LImjULKNnN0uq72/4cpefn+s+uBkj4k07CyQo00ZD4kkR7SX66R HQHg== X-Gm-Message-State: ACrzQf0NzsXgPX9pb/6pjPks6bw2Md3kScgVJtj1ysn5aHw2MktNmrmW gtrjUJG5eVVc5cK9GrsSxtPLlhKPxbUDnnKhs5PPKa6z6umFirazQfH3bh9hRDavcTxsGouVoRM QZYneDbnxxm4= X-Received: by 2002:a05:620a:12b8:b0:6f9:9b46:5318 with SMTP id x24-20020a05620a12b800b006f99b465318mr283279qki.767.1667574169460; Fri, 04 Nov 2022 08:02:49 -0700 (PDT) X-Google-Smtp-Source: AMsMyM5y1m/0oHGuPZdsjlLqFYyIeoduRxO9AgJ40X0jo+tk8egpAUGd+LkjCLLhziknLtD0sSdfbA== X-Received: by 2002:a05:620a:12b8:b0:6f9:9b46:5318 with SMTP id x24-20020a05620a12b800b006f99b465318mr283275qki.767.1667574169082; Fri, 04 Nov 2022 08:02:49 -0700 (PDT) Received: from x1n (bras-base-aurron9127w-grc-46-70-31-27-79.dsl.bell.ca. [70.31.27.79]) by smtp.gmail.com with ESMTPSA id a70-20020ae9e849000000b006ce40fbb8f6sm3032758qkg.21.2022.11.04.08.02.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 04 Nov 2022 08:02:48 -0700 (PDT) Date: Fri, 4 Nov 2022 11:02:46 -0400 From: Peter Xu To: Mike Kravetz Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , James Houghton , Miaohe Lin , David Hildenbrand , Muchun Song , Andrea Arcangeli , Nadav Amit , Rik van Riel Subject: Re: [PATCH RFC 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Message-ID: References: <20221030212929.335473-1-peterx@redhat.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1667574172; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=760mHfCsMBzfAvIYgQ085mME4hkndza7s7oPPqSIMRU=; b=Ahs5XggPX93+hR86CM/o/ZywSE4n/XnAGcYTDxaBIAPo/QGatDu1uOjpjGozn1oCnj32Ny RdopyESg23Nc7y/nrFJVUwCuFnsOERDCpp5PaOVCgtp4717Inrki4zxPuhran3pEDCZ0aG NhXDvwX32O+qMdiZykA+uX3SikTlDKM= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=U5Zd0PFd; spf=pass (imf18.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1667574172; a=rsa-sha256; cv=none; b=bAMA/RxSkSpkpmdWUaGWctcODxE3jil8y6SeR2E1a3AFLr/ovMc+BAzxQjamUR02ophDWC jq/E1v68ymPD4jW92J3dlAy0INCRAIdqIN25tLECG2I53WG9s05WDWnIdXehBjoSwCruAw Qi4GVnWOmpXcKVcVso4YNhO/hQQV7Q0= X-Stat-Signature: 713d1q3sjychcr8razjcxpaqiwadsp9m X-Rspamd-Queue-Id: 1642A1C000A X-Rspamd-Server: rspam06 X-Rspam-User: Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=U5Zd0PFd; spf=pass (imf18.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-HE-Tag: 1667574171-523875 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi, Mike, On Thu, Nov 03, 2022 at 05:21:46PM -0700, Mike Kravetz wrote: > On 10/30/22 17:29, Peter Xu wrote: > > Resolution > > ========== > > > > What this patch proposed is, besides using the vma lock, we can also use > > RCU to protect the pgtable page from being freed from under us when > > huge_pte_offset() is used. The idea is kind of similar to RCU fast-gup. > > Note that fast-gup is very safe regarding pmd unsharing even before vma > > lock, because fast-gup relies on RCU to protect walking any pgtable page, > > including another mm's. > > > > To apply the same idea to huge_pte_offset(), it means with proper RCU > > protection the pte_t* pointer returned from huge_pte_offset() can also be > > always safe to access and de-reference, along with the pgtable lock that > > was bound to the pgtable page. > > > > Patch Layout > > ============ > > > > Patch 1 is a trivial cleanup that I noticed when working on this. Please > > shoot if anyone think I should just post it separately, or hopefully I can > > still just carry it over. > > > > Patch 2 is the gut of the patchset, describing how we should use the helper > > huge_pte_offset() correctly. Only a comment patch but should be the most > > important one, as the follow up patches are just trying to follow the rule > > it setup here. > > > > The rest patches resolve all the call sites of huge_pte_offset() to make > > sure either it's with the vma lock (which is perfectly good enough for > > safety in this case; the last patch commented on all those callers to make > > sure we won't miss a single case, and why they're safe). Besides, each of > > the patch will add rcu protection to one caller of huge_pte_offset(). > > > > Tests > > ===== > > > > Only lightly tested on hugetlb kselftests including uffd, no more errors > > triggered than current mm-unstable (hugetlb-madvise fails before/after > > here, with error "Unexpected number of free huge pages line 207"; haven't > > really got time to look into it). > > Do not worry about the madvise test failure, that is caused by a recent > change. > > Unless I am missing something, the basic strategy in this series is to > wrap calls to huge_pte_offset and subsequent ptep access with > rcu_read_lock/unlock calls. I must embarrassingly admit that it has > been a loooong time since I had to look at rcu usage and may not know > what I am talking about. However, I seem to recall that one needs to > somehow flag the data items being protected from update/freeing. I > do not see anything like that in the huge_pmd_unshare routine where > pmd page pointer is updated. Or, is it where the pmd page pointer is > referenced in huge_pte_offset? Right. The RCU proposed here is trying to protect the pmd pgtable page that will normally be freed in rcu pattern. Please refer to tlb_remove_table_free() (which can be called from tlb_finish_mmu()) where it's released with RCU API: call_rcu(&batch->rcu, tlb_remove_table_rcu); I mentioned fast-gup just to refererence on the same usage as fast-gup has the same risk if without RCU or similar protections that is IPI-based, but I definitely can be even clearer, and I will enrich the cover letter in the next post. In short, my understanding is pgtable pages (including the shared PUD page for hugetlb) needs to be freed with caution because there can be softwares that are walking the pages with no locks. In our case, even though huge_pte_offset() is with the mmap lock, due to the pmd sharing it's not always having the same mmap lock as when the pgtable needs to be freed, so it's similar to having no lock here, imo. Then huge_pte_offset() needs to be protected just like what we do with fast-gup. Please also feel free to refer to the comment chunk at the start of asm-generic/tlb.h for more information on the mmu gather API. > > Please ignore if you are certain of this rcu usage, otherwise I will > spend some time reeducating myself. I'm not certain, and I'd like to get any form of comment. :) Sorry if this RFC version is confusing, but if it can try to at least explain what the problem we have and if we can agree on the problem first then that'll already be a step forward to me. So far that's more important than how we resolve it, using RCU or vma lock or anything else. For a non-rfc series, I think I need to be more careful on some details, e.g., the RCU protection for pgtable page is only used when the arch supports MMU_GATHER_RCU_TABLE_FREE. I thought that's always supported at least for pmd sharing enabled archs, but I'm actually wrong: arch/arm64/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) arch/riscv/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE if 64BIT arch/x86/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE arch/arm/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if SMP && ARM_LPAE arch/arm64/Kconfig: select MMU_GATHER_RCU_TABLE_FREE arch/powerpc/Kconfig: select MMU_GATHER_RCU_TABLE_FREE arch/s390/Kconfig: select MMU_GATHER_RCU_TABLE_FREE arch/sparc/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if SMP arch/sparc/include/asm/tlb_64.h:#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE arch/x86/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if PARAVIRT I think it means at least on RISCV RCU_TABLE_FREE is not enabled and we'll need to rely on the IPIs (e.g. I think we need to replace rcu_read_lock() with local_irq_disable() on RISCV only for what this patchset wanted to do). In the next version, I plan to add a helper, let's name it huge_pte_walker_lock() for now, and it should be one of the three options: - if !ARCH_WANT_HUGE_PMD_SHARE: it's no-op - else if MMU_GATHER_RCU_TABLE_FREE: it should be rcu_read_lock() - else: it should be local_irq_disable() With that, I think we'll strictly follow what we have with fast-gup, at the meantime it should add zero overhead on archs that does not have pmd sharing. Hope above helps a bit on extending the missing pieces of the cover letter. Or again if anything missing I'd be more than glad to know.. Thanks, -- Peter Xu