From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BEF16C54EE9 for ; Tue, 6 Sep 2022 19:01:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 200ED6B0073; Tue, 6 Sep 2022 15:01:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1894C6B0074; Tue, 6 Sep 2022 15:01:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 003196B0075; Tue, 6 Sep 2022 15:01:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id DDD036B0073 for ; Tue, 6 Sep 2022 15:01:50 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id B8AE212065C for ; Tue, 6 Sep 2022 19:01:50 +0000 (UTC) X-FDA: 79882579980.30.D065E1A Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) by imf22.hostedemail.com (Postfix) with ESMTP id 664B5C0084 for ; Tue, 6 Sep 2022 19:01:50 +0000 (UTC) Received: by mail-pl1-f176.google.com with SMTP id s14so2612870plr.4 for ; Tue, 06 Sep 2022 12:01:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date; bh=AZu8Usw5qGBv2bCC+wGyzG/1jgo5UGZVkgClI5nAtmY=; b=jldjQaSzVBJgrETu/whlKsfKw96xbVi+rWmVCTjfuP86WuGBAmjqV5OzhTwp9Mt22e a7NK/CEXb+za4jthYO2k1p0jVyue+3A8zL5eNyUfzaaywchWmGHn8xT/3lHE1ESPwHJV zcRchhRgpzpu2K2k0Ssvv7TcGEa6VKB2Qk840pzhdc6VP+VqzMgDi4EGzwK5GIrNcolj gaBCDGB9BQLo1NLPZDic+Q2UgECrSlwCUlU3a8KUKn2Ft8p7zupQLT3uSBQ3wGF/Zpfm VaYJNFaxAA47sLm5wIKQwnbuNZJopF7t7JVlyFKruY8UREgX/A7piwIW1d4bH+Tt5Xo/ 3t0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date; bh=AZu8Usw5qGBv2bCC+wGyzG/1jgo5UGZVkgClI5nAtmY=; b=2mGEWMXU+5jTzGMkO/iIYRek5hlRYJ8L2b3J5enUseBkio6jLABnGONanRZjuedQKz by2+URsvFSgdHMZcrHzNKRf2F+0MyU2cqIGq2SE5HmGmu4sTy7QQp6qqe6OwgaILAIIj /02AG7Bv0JP+gqeQKxLI3iJLeUhG4s5XyZDeGe4X3ShpC0hD14jY4grh+DawoRSSYceN 85KLdOCQy68KrYKTyJqjkORhnplbcTF7VxVJTSmkHvTEkU36zQtn3RZHa+D34dv7fUbN ClQUrf0vXrE/wSjtWmMO3piji55aR10x1LorVDwYBn+KYPQN6uQjy1gY44LENYrW4BUZ pC8A== X-Gm-Message-State: ACgBeo0Q72i807L14lvxfM09aU8zZNqS1iCbP/5qtkBNtcZs4u05oSaF q3s0cXiE59gKYEnx4WWfHqffd28y0FoM/UBP5NQ= X-Google-Smtp-Source: AA6agR5O5m4bkA66nDVs0wlihn8Tn+UVeUFnkjlLNTilBW3AnZeEG7oVg0s3bWwf0/1IQQvPDr4JzZ3c1xTQx2duBuc= X-Received: by 2002:a17:902:be03:b0:175:6397:9425 with SMTP id r3-20020a170902be0300b0017563979425mr25636541pls.26.1662490909335; Tue, 06 Sep 2022 12:01:49 -0700 (PDT) MIME-Version: 1.0 References: <20220901222707.477402-1-shy828301@gmail.com> In-Reply-To: From: Yang Shi Date: Tue, 6 Sep 2022 12:01:36 -0700 Message-ID: Subject: Re: [PATCH] mm: gup: fix the fast GUP race against THP collapse To: David Hildenbrand Cc: John Hubbard , peterx@redhat.com, kirill.shutemov@linux.intel.com, jgg@nvidia.com, hughd@google.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662490910; a=rsa-sha256; cv=none; b=zgXOiWQLQCLxepNcBfUQ7AJc5LQ0/9lzwH+upDNmFyyNvmoQM5SztareDkwUKf8jJtWbpr idm1I/Rl/4HXGX/fHcy7fw6BIqBRpJZNP4CQZ/huDkb9HF7986SQ1JExq0ScnC3oaP7hxe DNlHyuHXzhW/GdhOf0dgRTtlk6nrOiY= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=jldjQaSz; spf=pass (imf22.hostedemail.com: domain of shy828301@gmail.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662490910; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=AZu8Usw5qGBv2bCC+wGyzG/1jgo5UGZVkgClI5nAtmY=; b=rQLM/fMhxChrRsg8EcxzhSiivMKF+PY9+3l0NpYlLWPfPpJTk9Li/9Y8hY5MZJyfIWywcM WnTQXnu16p81PN+nojxnrAXYv281Sr/8e9t+ghn2hrSChZGmIY3kKqN22cqzhyC4Qx5VBj Z50RH3evS9baWdxwKYXsqj3AeHEiGCI= Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=jldjQaSz; spf=pass (imf22.hostedemail.com: domain of shy828301@gmail.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-Rspam-User: X-Rspamd-Server: rspam06 X-Stat-Signature: qbsh8gfay4oy9qy9jj3ms4zqs16do1uz X-Rspamd-Queue-Id: 664B5C0084 X-HE-Tag: 1662490910-647213 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Sep 5, 2022 at 12:59 AM David Hildenbrand wrote: > > On 05.09.22 00:29, John Hubbard wrote: > > On 9/1/22 15:27, Yang Shi wrote: > >> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: > >> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer > >> sufficient to handle concurrent GUP-fast in all cases, it only handles > >> traditional IPI-based GUP-fast correctly. On architectures that send > >> an IPI broadcast on TLB flush, it works as expected. But on the > >> architectures that do not use IPI to broadcast TLB flush, it may have > >> the below race: > >> > >> CPU A CPU B > >> THP collapse fast GUP > >> gup_pmd_range() <-- see valid pmd > >> gup_pte_range() <-- work on pte > >> pmdp_collapse_flush() <-- clear pmd and flush > >> __collapse_huge_page_isolate() > >> check page pinned <-- before GUP bump refcount > >> pin the page > >> check PTE <-- no change > >> __collapse_huge_page_copy() > >> copy data to huge page > >> ptep_clear() > >> install huge pmd for the huge page > >> return the stale page > >> discard the stale page > > > > Hi Yang, > > > > Thanks for taking the trouble to write down these notes. I always > > forget which race we are dealing with, and this is a great help. :) > > > > More... > > > >> > >> The race could be fixed by checking whether PMD is changed or not after > >> taking the page pin in fast GUP, just like what it does for PTE. If the > >> PMD is changed it means there may be parallel THP collapse, so GUP > >> should back off. > >> > >> Also update the stale comment about serializing against fast GUP in > >> khugepaged. > >> > >> Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") > >> Signed-off-by: Yang Shi > >> --- > >> mm/gup.c | 30 ++++++++++++++++++++++++------ > >> mm/khugepaged.c | 10 ++++++---- > >> 2 files changed, 30 insertions(+), 10 deletions(-) > >> > >> diff --git a/mm/gup.c b/mm/gup.c > >> index f3fc1f08d90c..4365b2811269 100644 > >> --- a/mm/gup.c > >> +++ b/mm/gup.c > >> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, > >> } > >> > >> #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > >> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > >> - unsigned int flags, struct page **pages, int *nr) > >> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > >> + unsigned long end, unsigned int flags, > >> + struct page **pages, int *nr) > >> { > >> struct dev_pagemap *pgmap = NULL; > >> int nr_start = *nr, ret = 0; > >> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > >> goto pte_unmap; > >> } > >> > >> - if (unlikely(pte_val(pte) != pte_val(*ptep))) { > >> + /* > >> + * THP collapse conceptually does: > >> + * 1. Clear and flush PMD > >> + * 2. Check the base page refcount > >> + * 3. Copy data to huge page > >> + * 4. Clear PTE > >> + * 5. Discard the base page > >> + * > >> + * So fast GUP may race with THP collapse then pin and > >> + * return an old page since TLB flush is no longer sufficient > >> + * to serialize against fast GUP. > >> + * > >> + * Check PMD, if it is changed just back off since it > >> + * means there may be parallel THP collapse. > >> + */ > > > > As I mentioned in the other thread, it would be a nice touch to move > > such discussion into the comment header. > > > >> + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || > >> + unlikely(pte_val(pte) != pte_val(*ptep))) { > > > > > > That should be READ_ONCE() for the *pmdp and *ptep reads. Because this > > whole lockless house of cards may fall apart if we try reading the > > page table values without READ_ONCE(). > > I came to the conclusion that the implicit memory barrier when grabbing > a reference on the page is sufficient such that we don't need READ_ONCE > here. > > If we still intend to change that code, we should fixup all GUP-fast > functions in a similar way. But again, I don't think we need a change here. > > > >> - * After this gup_fast can't run anymore. This also removes > >> - * any huge TLB entry from the CPU so we won't allow > >> - * huge and small TLB entries for the same virtual address > >> - * to avoid the risk of CPU bugs in that area. > >> + * This removes any huge TLB entry from the CPU so we won't allow > >> + * huge and small TLB entries for the same virtual address to > >> + * avoid the risk of CPU bugs in that area. > >> + * > >> + * Parallel fast GUP is fine since fast GUP will back off when > >> + * it detects PMD is changed. > >> */ > >> _pmd = pmdp_collapse_flush(vma, address, pmd); > > > > To follow up on David Hildenbrand's note about this in the nearby thread... > > I'm also not sure if pmdp_collapse_flush() implies a memory barrier on > > all arches. It definitely does do an atomic op with a return value on x86, > > but that's just one arch. > > > > I think a ptep/pmdp clear + TLB flush really has to imply a memory > barrier, otherwise TLB flushing code might easily mess up with > surrounding code. But we should better double-check. > > s390x executes an IDTE instruction, which performs serialization (-> > memory barrier). arm64 seems to use DSB instructions to enforce memory > ordering. IIUC we just need to care about the architectures which support both THP and fast-GUP, right? If so I got the below list: Loongarch - I didn't see explicit memory barrier, but it does IPI MIPS - as same as Loongarch PowerPC - has memory barrier x86 - atomic operations imply memory barrier Arm - I didn't see explicit memory barrier, not sure if ARM's tlb flush instructions imply memory barrier or not, but it does IPI Arm64 - uses DSB as David mentioned s390 - executes IDTE as David mentioned So I think the questionable architectures are Loongarch, MIPS and ARM. And IPI is not called if the PMD entry is not present on a remote CPU. So they may have race against fast GUP IIUC. But we'd better double check with the maintainers. > > -- > Thanks, > > David / dhildenb >