From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3893EECAAD3 for ; Mon, 5 Sep 2022 10:24:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8C0FD8D0061; Mon, 5 Sep 2022 06:24:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 86FBF8D0050; Mon, 5 Sep 2022 06:24:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7100E8D0061; Mon, 5 Sep 2022 06:24:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 5EDB58D0050 for ; Mon, 5 Sep 2022 06:24:39 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 38FB3120DB4 for ; Mon, 5 Sep 2022 10:24:39 +0000 (UTC) X-FDA: 79877647878.28.E21F23E Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf30.hostedemail.com (Postfix) with ESMTP id AFBF580084 for ; Mon, 5 Sep 2022 10:24:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1662373478; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xE+XP2YluYQEWXnw5w+tjbYOvjxVU2GBW/MrO8G9rEU=; b=VUBXqN8AfMRwXRDkXgim8LRl8telqPI/DQ+1rAz6iXEbiqkVHDLeyCqKbZuIPWsBrIL8iK Om6NtMegGAr8G99fw17LSnWH3uEiuGgnrDNWs/ORgMBqbPj+APOUWgMSRGUjR0Zv8o2kmN JMN5MfJpPr5SS4KfdCCp+/03pYk69ek= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-611-w1S53G5RNAGfDjip3siD8g-1; Mon, 05 Sep 2022 06:24:37 -0400 X-MC-Unique: w1S53G5RNAGfDjip3siD8g-1 Received: by mail-wm1-f69.google.com with SMTP id v21-20020a05600c215500b003a83c910d83so1635517wml.3 for ; Mon, 05 Sep 2022 03:24:36 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date; bh=xE+XP2YluYQEWXnw5w+tjbYOvjxVU2GBW/MrO8G9rEU=; b=qSnQ9gin4Gp0BjpG1lLSSesiwnBXHFdEDDQC6MRQ9eKj9aA85tvHmCMBEHzsmbRO2S eD4NMKbrNHzjWbMAXpFF7rhJ+SKcRoAyYLu0nF05mspapKUvc5Iun88aitNHx7xRbrrC u3aiopY9bhIxxBQyZPxcSnbeiAY0geGaw3wVR7fr0D4BLibFCaj/dI9dvcraNVoMrG+4 d07+ti/LRSJQBXTsS1N4ihea8iuIfysioRbasWpj1EOfsAAzot9wXnBEEieIAODS7fHR //0l4VPyrodwXiLPmcD8fG0E9ju/6l+K+cqIpZVQWmsm2QabOviUwktkeP+G/cHGSQyY xBhA== X-Gm-Message-State: ACgBeo1MGx8AUOCQOhbC71Uu4tMbGeG+qtiP9vGX1/f5QgkbF7NcsBYx 8OT2N1P47z+QcLA+Fx4CWsn8sZmXaznKKCLVV4XUmpmQVBd+O835xjyyGVp+vceQSpRUZIEkm14 g+Np4iqRkEIw= X-Received: by 2002:a05:6000:1047:b0:228:8d6e:ec52 with SMTP id c7-20020a056000104700b002288d6eec52mr2220934wrx.24.1662373475967; Mon, 05 Sep 2022 03:24:35 -0700 (PDT) X-Google-Smtp-Source: AA6agR5ihSfiF4KTnRgfmDo8ogUBHjHZ0Nmcbk6K7u9Tmwcq5JU5mfyKPORtP/JmIKP9vzrPLG4kMA== X-Received: by 2002:a05:6000:1047:b0:228:8d6e:ec52 with SMTP id c7-20020a056000104700b002288d6eec52mr2220907wrx.24.1662373475666; Mon, 05 Sep 2022 03:24:35 -0700 (PDT) Received: from ?IPV6:2003:d8:2f0d:ba00:c951:31d7:b2b0:8ba0? (p200300d82f0dba00c95131d7b2b08ba0.dip0.t-ipconnect.de. [2003:d8:2f0d:ba00:c951:31d7:b2b0:8ba0]) by smtp.gmail.com with ESMTPSA id w3-20020a05600018c300b002206203ed3dsm8845661wrq.29.2022.09.05.03.24.34 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 05 Sep 2022 03:24:35 -0700 (PDT) Message-ID: <383fec21-9801-9b60-7570-856da2133ea9@redhat.com> Date: Mon, 5 Sep 2022 12:24:34 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.2.0 Subject: Re: [PATCH] mm: gup: fix the fast GUP race against THP collapse To: Baolin Wang , John Hubbard , Yang Shi , peterx@redhat.com, kirill.shutemov@linux.intel.com, jgg@nvidia.com, hughd@google.com, akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20220901222707.477402-1-shy828301@gmail.com> <0c9d9774-77dd-fd93-b5b6-fc63f3d01b7f@linux.alibaba.com> From: David Hildenbrand Organization: Red Hat In-Reply-To: <0c9d9774-77dd-fd93-b5b6-fc63f3d01b7f@linux.alibaba.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662373478; a=rsa-sha256; cv=none; b=J9aGDXO82GBDCFwuaY0goS/LtAKuUusmdQduBTcesVFbPMHANE6Ynheb6lQNGevXfTaWaD GeNjRU5F5SPqC75pj1n0Eg+MSMt3JRAvfR6bXlAhfoa2frh0PJ1lKSSEUv2V2G/qOxTgNZ 5bhT/DUOtxib9stJyibBVq6Jp5Q1phY= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=VUBXqN8A; spf=pass (imf30.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662373478; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xE+XP2YluYQEWXnw5w+tjbYOvjxVU2GBW/MrO8G9rEU=; b=qJliq9D3QZiPkOwOjz5LsPavfcSlYizRXL0AD0m7khUid+IJYv0Z/2DAFxS1UhpzQ0tgOL VPwiQOiGQVD62qmWPPvzbQtY8h8EXd59bbSfJMtftnJiRefP7y1852166uWk6swVUsMNds uCctklsfpM0K2GNGkjO8oY/tKGZ+Ml0= X-Rspam-User: Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=VUBXqN8A; spf=pass (imf30.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspamd-Server: rspam09 X-Stat-Signature: mcj1cpa1shn99zgdx6a39zxs1yw59aii X-Rspamd-Queue-Id: AFBF580084 X-HE-Tag: 1662373478-156882 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 05.09.22 12:16, Baolin Wang wrote: > > > On 9/5/2022 3:59 PM, David Hildenbrand wrote: >> On 05.09.22 00:29, John Hubbard wrote: >>> On 9/1/22 15:27, Yang Shi wrote: >>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: >>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no >>>> longer >>>> sufficient to handle concurrent GUP-fast in all cases, it only handles >>>> traditional IPI-based GUP-fast correctly.  On architectures that send >>>> an IPI broadcast on TLB flush, it works as expected.  But on the >>>> architectures that do not use IPI to broadcast TLB flush, it may have >>>> the below race: >>>> >>>>     CPU A                                          CPU B >>>> THP collapse                                     fast GUP >>>>                                                gup_pmd_range() <-- >>>> see valid pmd >>>>                                                    gup_pte_range() >>>> <-- work on pte >>>> pmdp_collapse_flush() <-- clear pmd and flush >>>> __collapse_huge_page_isolate() >>>>      check page pinned <-- before GUP bump refcount >>>>                                                        pin the page >>>>                                                        check PTE <-- >>>> no change >>>> __collapse_huge_page_copy() >>>>      copy data to huge page >>>>      ptep_clear() >>>> install huge pmd for the huge page >>>>                                                        return the >>>> stale page >>>> discard the stale page >>> >>> Hi Yang, >>> >>> Thanks for taking the trouble to write down these notes. I always >>> forget which race we are dealing with, and this is a great help. :) >>> >>> More... >>> >>>> >>>> The race could be fixed by checking whether PMD is changed or not after >>>> taking the page pin in fast GUP, just like what it does for PTE.  If the >>>> PMD is changed it means there may be parallel THP collapse, so GUP >>>> should back off. >>>> >>>> Also update the stale comment about serializing against fast GUP in >>>> khugepaged. >>>> >>>> Fixes: 2667f50e8b81 ("mm: introduce a general RCU >>>> get_user_pages_fast()") >>>> Signed-off-by: Yang Shi >>>> --- >>>>   mm/gup.c        | 30 ++++++++++++++++++++++++------ >>>>   mm/khugepaged.c | 10 ++++++---- >>>>   2 files changed, 30 insertions(+), 10 deletions(-) >>>> >>>> diff --git a/mm/gup.c b/mm/gup.c >>>> index f3fc1f08d90c..4365b2811269 100644 >>>> --- a/mm/gup.c >>>> +++ b/mm/gup.c >>>> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int >>>> *nr, int nr_start, >>>>   } >>>>   #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL >>>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned >>>> long end, >>>> -             unsigned int flags, struct page **pages, int *nr) >>>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, >>>> +             unsigned long end, unsigned int flags, >>>> +             struct page **pages, int *nr) >>>>   { >>>>       struct dev_pagemap *pgmap = NULL; >>>>       int nr_start = *nr, ret = 0; >>>> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned >>>> long addr, unsigned long end, >>>>               goto pte_unmap; >>>>           } >>>> -        if (unlikely(pte_val(pte) != pte_val(*ptep))) { >>>> +        /* >>>> +         * THP collapse conceptually does: >>>> +         *   1. Clear and flush PMD >>>> +         *   2. Check the base page refcount >>>> +         *   3. Copy data to huge page >>>> +         *   4. Clear PTE >>>> +         *   5. Discard the base page >>>> +         * >>>> +         * So fast GUP may race with THP collapse then pin and >>>> +         * return an old page since TLB flush is no longer sufficient >>>> +         * to serialize against fast GUP. >>>> +         * >>>> +         * Check PMD, if it is changed just back off since it >>>> +         * means there may be parallel THP collapse. >>>> +         */ >>> >>> As I mentioned in the other thread, it would be a nice touch to move >>> such discussion into the comment header. >>> >>>> +        if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || >>>> +            unlikely(pte_val(pte) != pte_val(*ptep))) { >>> >>> >>> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this >>> whole lockless house of cards may fall apart if we try reading the >>> page table values without READ_ONCE(). >> >> I came to the conclusion that the implicit memory barrier when grabbing >> a reference on the page is sufficient such that we don't need READ_ONCE >> here. > > IMHO the compiler may optimize the code 'pte_val(*ptep)' to be always > get from a register, then we can get an old value if other thread did > set_pte(). I am not sure how the implicit memory barrier can pervent the > compiler optimization? Please correct me if I missed something. IIUC, an memory barrier always implies a compiler barrier. -- Thanks, David / dhildenb