From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7BC9FC77B73 for ; Tue, 2 May 2023 14:37:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1C2156B007D; Tue, 2 May 2023 10:37:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1724F6B007E; Tue, 2 May 2023 10:37:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 038026B0080; Tue, 2 May 2023 10:37:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from mail-ed1-f45.google.com (mail-ed1-f45.google.com [209.85.208.45]) by kanga.kvack.org (Postfix) with ESMTP id A23C46B007D for ; Tue, 2 May 2023 10:37:37 -0400 (EDT) Received: by mail-ed1-f45.google.com with SMTP id 4fb4d7f45d1cf-50bcb229adaso3064764a12.2 for ; Tue, 02 May 2023 07:37:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683038257; x=1685630257; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=pZWV4J/p+V2RsTLQQpuxXkHVnj5cQ9yrM686SedE9Ps=; b=HySSnnJ46YghIFEZ0j18C/8GYuQVELBFNrd9OHVqoZpSXwA6xVdilmFTBDBkngJ6CV dnKz7Pa+p9nn7OLweIUnQIk7Qd0hb5pvclGc8w68s9xNsEkM8WQftRVOVmt3/axxBa2f e80TIHsv0bnpDq8fwtvZSvOqHAbJ0moGaoFNWYpVDF8r39iEOyFaG3sbRep5TyyQ72xM dm/guZJsLSW/AbiAlUEvHGjr6PmmQ5uO+1sf6cP7dJnF8lLtXuFMHenm2KetraZhaFjf GBb7qNQQTi/isRz2VCO2QuJ4w7ppFCG7rPRj6YAEPIIpGg2bJ7ybeMmn/wtM1glB0m2f MeHA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683038257; x=1685630257; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=pZWV4J/p+V2RsTLQQpuxXkHVnj5cQ9yrM686SedE9Ps=; b=dMlrDlQupvH+JQkQLEG3pVERnw3/AJQAfLvMYDLJOJKJKcsmZaU6FXJaXbbClv7+J4 LJTSur5Sj9xggMrc08UfjecEz6+F5k8gW02LIOZ1GhSsB6P/olQj/ITRvDhlkptrI4Oq gV817twMB/ktAaN9cChCwQca9zEY5VtuhvxE3VpV2Baj10uUZ7cwJgCEU5DLbGxPsYso BSDiI6jmEzumkCgN7ib5JmCZCAmOsi7JXyndx5pt9rC44HSI9GtbKfwADUkCSP6SbyVU t1eJH3xii9yabs3RqIohfgwTfq42DkBXWg2i9UulYWy97GXhOQs/jDlQck33anIXnDQJ wS7A== X-Gm-Message-State: AC+VfDzeQb/ftaAyiuVN1e6NpI7sTrB5GbdFfsSNK5X5QBtl+kYV3WHy Fy6HWouSMB5Vuw4DVeCFnjasbGHQ51om2w== X-Google-Smtp-Source: ACHHUZ4BqiRdCXa5j0BBKTIVvWOOPys+z0Y24apzR4Z/kS9oaG9D5SiLB4/CWZSUmTgBrSAblHzO1w== X-Received: by 2002:adf:e4cb:0:b0:306:2de2:f583 with SMTP id v11-20020adfe4cb000000b003062de2f583mr4191054wrm.53.1683032083201; Tue, 02 May 2023 05:54:43 -0700 (PDT) Received: from localhost (host86-156-84-164.range86-156.btcentralplus.com. [86.156.84.164]) by smtp.gmail.com with ESMTPSA id h3-20020a5d5043000000b002c70ce264bfsm30877530wrt.76.2023.05.02.05.54.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 May 2023 05:54:42 -0700 (PDT) Date: Tue, 2 May 2023 13:54:41 +0100 From: Lorenzo Stoakes To: Christian Borntraeger Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Jason Gunthorpe , Jens Axboe , Matthew Wilcox , Dennis Dalessandro , Leon Romanovsky , Christian Benvenuti , Nelson Escobar , Bernard Metzler , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Bjorn Topel , Magnus Karlsson , Maciej Fijalkowski , Jonathan Lemon , "David S . Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Christian Brauner , Richard Cochran , Alexei Starovoitov , Daniel Borkmann , Jesper Dangaard Brouer , John Fastabend , linux-fsdevel@vger.kernel.org, linux-perf-users@vger.kernel.org, netdev@vger.kernel.org, bpf@vger.kernel.org, Oleg Nesterov , Jason Gunthorpe , John Hubbard , Jan Kara , "Kirill A . Shutemov" , Pavel Begunkov , Mika Penttila , David Hildenbrand , Dave Chinner , Theodore Ts'o , Peter Xu , Matthew Rosato Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings Message-ID: <7d56b424-ba79-4b21-b02c-c89705533852@lucifer.local> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote: > Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes: > > Writing to file-backed dirty-tracked mappings via GUP is inherently broken > > as we cannot rule out folios being cleaned and then a GUP user writing to > > them again and possibly marking them dirty unexpectedly. > > > > This is especially egregious for long-term mappings (as indicated by the > > use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as > > we have already done in the slow path. > > Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery? > It would no longer work with file backed memory, correct? > > See > arch/s390/kvm/pci.c > > kvm_s390_pci_aif_enable > which does have > FOLL_WRITE | FOLL_LONGTERM > to > Does this memory map a dirty-tracked file? It's kind of hard to dig into where the address originates from without going through a ton of code. In worst case if the fast code doesn't find a whitelist it'll fall back to slow path which explicitly checks for dirty-tracked filesystem. We can reintroduce a flag to permit exceptions if this is really broken, are you able to test? I don't have an s390 sat around :) > > > > We have access to less information in the fast path as we cannot examine > > the VMA containing the mapping, however we can determine whether the folio > > is anonymous and then whitelist known-good mappings - specifically hugetlb > > and shmem mappings. > > > > While we obtain a stable folio for this check, the mapping might not be, as > > a truncate could nullify it at any time. Since doing so requires mappings > > to be zapped, we can synchronise against a TLB shootdown operation. > > > > For some architectures TLB shootdown is synchronised by IPI, against which > > we are protected as the GUP-fast operation is performed with interrupts > > disabled. However, other architectures which specify > > CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation. > > > > In these instances, we acquire an RCU lock while performing our checks. If > > we cannot get a stable mapping, we fall back to the slow path, as otherwise > > we'd have to walk the page tables again and it's simpler and more effective > > to just fall back. > > > > It's important to note that there are no APIs allowing users to specify > > FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can > > always rely on the fact that if we fail to pin on the fast path, the code > > will fall back to the slow path which can perform the more thorough check. > > > > Suggested-by: David Hildenbrand > > Suggested-by: Kirill A . Shutemov > > Signed-off-by: Lorenzo Stoakes > > --- > > mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-- > > 1 file changed, 85 insertions(+), 2 deletions(-) > > > > diff --git a/mm/gup.c b/mm/gup.c > > index 0f09dec0906c..431618048a03 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -18,6 +18,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > > return folio; > > } > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE > > +static bool stabilise_mapping_rcu(struct folio *folio) > > +{ > > + struct address_space *mapping = READ_ONCE(folio->mapping); > > + > > + rcu_read_lock(); > > + > > + return mapping == READ_ONCE(folio->mapping); > > +} > > + > > +static void unlock_rcu(void) > > +{ > > + rcu_read_unlock(); > > +} > > +#else > > +static bool stabilise_mapping_rcu(struct folio *) > > +{ > > + return true; > > +} > > + > > +static void unlock_rcu(void) > > +{ > > +} > > +#endif > > + > > +/* > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM | > > + * FOLL_WRITE pin is permitted for a specific folio. > > + * > > + * This assumes the folio is stable and pinned. > > + * > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic > > + * (see comment describing the writeable_file_mapping_allowed() function). We > > + * therefore try to avoid the most egregious case of a long-term mapping doing > > + * so. > > + * > > + * This function cannot be as thorough as that one as the VMA is not available > > + * in the fast path, so instead we whitelist known good cases. > > + * > > + * The folio is stable, but the mapping might not be. When truncating for > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled > > + * so we are safe from an IPI, but some architectures use an RCU lock for this > > + * operation, so we acquire an RCU lock to ensure the mapping is stable. > > + */ > > +static bool folio_longterm_write_pin_allowed(struct folio *folio) > > +{ > > + bool ret; > > + > > + /* hugetlb mappings do not require dirty tracking. */ > > + if (folio_test_hugetlb(folio)) > > + return true; > > + > > + if (stabilise_mapping_rcu(folio)) { > > + struct address_space *mapping = folio_mapping(folio); > > + > > + /* > > + * Neither anonymous nor shmem-backed folios require > > + * dirty tracking. > > + */ > > + ret = folio_test_anon(folio) || > > + (mapping && shmem_mapping(mapping)); > > + } else { > > + /* If the mapping is unstable, fallback to the slow path. */ > > + ret = false; > > + } > > + > > + unlock_rcu(); > > + > > + return ret; > > +} > > + > > /** > > * try_grab_folio() - Attempt to get or pin a folio. > > * @page: pointer to page to be grabbed > > @@ -123,6 +195,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > > */ > > struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > > { > > + bool is_longterm = flags & FOLL_LONGTERM; > > + > > if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) > > return NULL; > > @@ -136,8 +210,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > > * right zone, so fail and let the caller fall back to the slow > > * path. > > */ > > - if (unlikely((flags & FOLL_LONGTERM) && > > - !is_longterm_pinnable_page(page))) > > + if (unlikely(is_longterm && !is_longterm_pinnable_page(page))) > > return NULL; > > /* > > @@ -148,6 +221,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > > if (!folio) > > return NULL; > > + /* > > + * Can this folio be safely pinned? We need to perform this > > + * check after the folio is stabilised. > > + */ > > + if ((flags & FOLL_WRITE) && is_longterm && > > + !folio_longterm_write_pin_allowed(folio)) { > > + folio_put_refs(folio, refs); > > + return NULL; > > + } > > + > > /* > > * When pinning a large folio, use an exact count to track it. > > *