From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1F5C5ECAAA1 for ; Mon, 5 Sep 2022 07:59:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 50692801B6; Mon, 5 Sep 2022 03:59:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4B5E1801B3; Mon, 5 Sep 2022 03:59:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 37D30801B6; Mon, 5 Sep 2022 03:59:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 286B4801B3 for ; Mon, 5 Sep 2022 03:59:53 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id EE7F01A180B for ; Mon, 5 Sep 2022 07:59:52 +0000 (UTC) X-FDA: 79877283024.15.D93A822 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf12.hostedemail.com (Postfix) with ESMTP id 8449240078 for ; Mon, 5 Sep 2022 07:59:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1662364792; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=T3vGy5cq6ys68yKBKgaE/pN38o35BFl+yOw6ci17fe4=; b=aAkNgwdblTvQIUnIMQOBjqGhuynlFIAArtV/18PYZh9mvne+hoa13m5WHvP2zoR16kgNAl G9BB0YFtwcyGT8deQW5TshmXGlk32KmEh0H2z5ODUouAIPXpok/5hPVKmgW5nqkBInMbQv K5nqN56EYaZM1WPQOwu2iEQgoTj4ADo= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-644-pjnpuarZOZS3mKRQxlDTBw-1; Mon, 05 Sep 2022 03:59:50 -0400 X-MC-Unique: pjnpuarZOZS3mKRQxlDTBw-1 Received: by mail-wm1-f71.google.com with SMTP id j36-20020a05600c1c2400b003a540d88677so5050163wms.1 for ; Mon, 05 Sep 2022 00:59:50 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:subject:organization:from :content-language:references:cc:to:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date; bh=T3vGy5cq6ys68yKBKgaE/pN38o35BFl+yOw6ci17fe4=; b=LVpU6m+pYhY0WtARVJJ6IYy//sJX7BjhD6Q9UIjNnVSvmn8a9MhvukksjveWwytx3U 3M4ohxX4mVkovjH4EQwKxhYKd2UqzQ8gGQEHzjgXE9K3ljR4P0lFdqSNXsOMCbGFiCPD T6+Ocz5ABPTwyDvhq3+fXbdmO4Z62vyVUeeNilbzPaRZMsNlCBvVxW/WBsLmlRO02xU7 TMUG9BjmAqqfDm7BJXdJwsZG5XZkR4/dmitpd05luVDRYpnC19nGwwIKY1dXHOGOIK89 Ddq+66cLZk5sLgizlASXWna1qDhLORg6f/oGGQAErleamISnIHcWmyPexClkERcxQftS D9hQ== X-Gm-Message-State: ACgBeo1Jrzb4hWHu2hGyvVTn8+f0EqT9KIHUUpa2DAwHLcm22mSARiLc VBeZs8wiA8LYRPxhP9uV31Wn5pEAVyMSzNDaWeMnFeD5c4QJ54O+tgEF3dIZLTLj48UQ57snO6a eBAwLdGYLc9M= X-Received: by 2002:adf:fa11:0:b0:228:bfb5:d56a with SMTP id m17-20020adffa11000000b00228bfb5d56amr603243wrr.353.1662364789555; Mon, 05 Sep 2022 00:59:49 -0700 (PDT) X-Google-Smtp-Source: AA6agR6HhqhcIyLud9D5sDVD/HFFRKEi5R562fjRiufPlrgYGebgESfbmLTsyjQZJVUDzsT/I0IzIg== X-Received: by 2002:adf:fa11:0:b0:228:bfb5:d56a with SMTP id m17-20020adffa11000000b00228bfb5d56amr603221wrr.353.1662364789215; Mon, 05 Sep 2022 00:59:49 -0700 (PDT) Received: from ?IPV6:2003:d8:2f0d:ba00:c951:31d7:b2b0:8ba0? (p200300d82f0dba00c95131d7b2b08ba0.dip0.t-ipconnect.de. [2003:d8:2f0d:ba00:c951:31d7:b2b0:8ba0]) by smtp.gmail.com with ESMTPSA id f6-20020adff586000000b00228c375d81bsm631245wro.2.2022.09.05.00.59.48 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 05 Sep 2022 00:59:48 -0700 (PDT) Message-ID: Date: Mon, 5 Sep 2022 09:59:47 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.2.0 To: John Hubbard , Yang Shi , peterx@redhat.com, kirill.shutemov@linux.intel.com, jgg@nvidia.com, hughd@google.com, akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20220901222707.477402-1-shy828301@gmail.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH] mm: gup: fix the fast GUP race against THP collapse In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=aAkNgwdb; spf=pass (imf12.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662364792; a=rsa-sha256; cv=none; b=zh7utAOqK7CQMOI1t2NRPBHEvb+k7xkJIQGfNCAVJtcLFW0l6uGKaBIorMEKs42iT5QJvz 1jUETKPdIltptrAOUYHVUuEGx4qS0wr0tC+4YZO/acCmOUAgUXw76sPpUaGWaohrSMTU3Z rJl7KqabJC3xcsSvzp7bpj77PxFOdpw= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662364792; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=T3vGy5cq6ys68yKBKgaE/pN38o35BFl+yOw6ci17fe4=; b=IC91S63xCu3Mirr0U+UX05Akp+iYL/pPiRF9ckPHR7Q80/Lgf+cSGs/sju0k0w0BJ3NLCw +92231/V3Z4ZvlzFS02u4fhG64T3ihkek11bdA97BKjsJHDNX7HSat5oRB11eQ0dklaxy0 L8NZEbhD/8w0MJSm3fFFY9OlR7YpOK4= Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=aAkNgwdb; spf=pass (imf12.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspamd-Server: rspam06 X-Stat-Signature: 11x3pzwt9gsygmkiq6cieuyzqrjohxqx X-Rspam-User: X-Rspamd-Queue-Id: 8449240078 X-HE-Tag: 1662364792-715233 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 05.09.22 00:29, John Hubbard wrote: > On 9/1/22 15:27, Yang Shi wrote: >> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: >> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer >> sufficient to handle concurrent GUP-fast in all cases, it only handles >> traditional IPI-based GUP-fast correctly. On architectures that send >> an IPI broadcast on TLB flush, it works as expected. But on the >> architectures that do not use IPI to broadcast TLB flush, it may have >> the below race: >> >> CPU A CPU B >> THP collapse fast GUP >> gup_pmd_range() <-- see valid pmd >> gup_pte_range() <-- work on pte >> pmdp_collapse_flush() <-- clear pmd and flush >> __collapse_huge_page_isolate() >> check page pinned <-- before GUP bump refcount >> pin the page >> check PTE <-- no change >> __collapse_huge_page_copy() >> copy data to huge page >> ptep_clear() >> install huge pmd for the huge page >> return the stale page >> discard the stale page > > Hi Yang, > > Thanks for taking the trouble to write down these notes. I always > forget which race we are dealing with, and this is a great help. :) > > More... > >> >> The race could be fixed by checking whether PMD is changed or not after >> taking the page pin in fast GUP, just like what it does for PTE. If the >> PMD is changed it means there may be parallel THP collapse, so GUP >> should back off. >> >> Also update the stale comment about serializing against fast GUP in >> khugepaged. >> >> Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") >> Signed-off-by: Yang Shi >> --- >> mm/gup.c | 30 ++++++++++++++++++++++++------ >> mm/khugepaged.c | 10 ++++++---- >> 2 files changed, 30 insertions(+), 10 deletions(-) >> >> diff --git a/mm/gup.c b/mm/gup.c >> index f3fc1f08d90c..4365b2811269 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, >> } >> >> #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL >> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, >> - unsigned int flags, struct page **pages, int *nr) >> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, >> + unsigned long end, unsigned int flags, >> + struct page **pages, int *nr) >> { >> struct dev_pagemap *pgmap = NULL; >> int nr_start = *nr, ret = 0; >> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, >> goto pte_unmap; >> } >> >> - if (unlikely(pte_val(pte) != pte_val(*ptep))) { >> + /* >> + * THP collapse conceptually does: >> + * 1. Clear and flush PMD >> + * 2. Check the base page refcount >> + * 3. Copy data to huge page >> + * 4. Clear PTE >> + * 5. Discard the base page >> + * >> + * So fast GUP may race with THP collapse then pin and >> + * return an old page since TLB flush is no longer sufficient >> + * to serialize against fast GUP. >> + * >> + * Check PMD, if it is changed just back off since it >> + * means there may be parallel THP collapse. >> + */ > > As I mentioned in the other thread, it would be a nice touch to move > such discussion into the comment header. > >> + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || >> + unlikely(pte_val(pte) != pte_val(*ptep))) { > > > That should be READ_ONCE() for the *pmdp and *ptep reads. Because this > whole lockless house of cards may fall apart if we try reading the > page table values without READ_ONCE(). I came to the conclusion that the implicit memory barrier when grabbing a reference on the page is sufficient such that we don't need READ_ONCE here. If we still intend to change that code, we should fixup all GUP-fast functions in a similar way. But again, I don't think we need a change here. >> - * After this gup_fast can't run anymore. This also removes >> - * any huge TLB entry from the CPU so we won't allow >> - * huge and small TLB entries for the same virtual address >> - * to avoid the risk of CPU bugs in that area. >> + * This removes any huge TLB entry from the CPU so we won't allow >> + * huge and small TLB entries for the same virtual address to >> + * avoid the risk of CPU bugs in that area. >> + * >> + * Parallel fast GUP is fine since fast GUP will back off when >> + * it detects PMD is changed. >> */ >> _pmd = pmdp_collapse_flush(vma, address, pmd); > > To follow up on David Hildenbrand's note about this in the nearby thread... > I'm also not sure if pmdp_collapse_flush() implies a memory barrier on > all arches. It definitely does do an atomic op with a return value on x86, > but that's just one arch. > I think a ptep/pmdp clear + TLB flush really has to imply a memory barrier, otherwise TLB flushing code might easily mess up with surrounding code. But we should better double-check. s390x executes an IDTE instruction, which performs serialization (-> memory barrier). arm64 seems to use DSB instructions to enforce memory ordering. -- Thanks, David / dhildenb