From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D941C433DB for ; Mon, 4 Jan 2021 19:24:52 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C546020784 for ; Mon, 4 Jan 2021 19:24:51 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C546020784 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 207218D0021; Mon, 4 Jan 2021 14:24:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1B6628D001C; Mon, 4 Jan 2021 14:24:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0CD198D0021; Mon, 4 Jan 2021 14:24:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0128.hostedemail.com [216.40.44.128]) by kanga.kvack.org (Postfix) with ESMTP id ECBCC8D001C for ; Mon, 4 Jan 2021 14:24:50 -0500 (EST) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id A85248248047 for ; Mon, 4 Jan 2021 19:24:50 +0000 (UTC) X-FDA: 77669069940.26.cent09_0301a20274d2 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin26.hostedemail.com (Postfix) with ESMTP id 6BEF11804B669 for ; Mon, 4 Jan 2021 19:24:50 +0000 (UTC) X-HE-Tag: cent09_0301a20274d2 X-Filterd-Recvd-Size: 6473 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf30.hostedemail.com (Postfix) with ESMTP for ; Mon, 4 Jan 2021 19:24:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1609788289; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3WFUsWXvGZlvYFGW5zLUZV3SAVOw5bpW8rc0isSU/ko=; b=CLR+oaydS04gYgQioDafI6rsfT8MT+HXWyBA9lDrJJmReoAu5bcDq0JnJRWu7ZiQuqiCGY gQRSmIzhm8YTi+tW7RHuJFKvy/BQjAhA8y4kCeXCBsHw0xXxLK1YHiup2pj47vUUrG9Zfg wcKPxdtCwT6CPXzu6EIr4ywkUvV+Nx8= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-563-yYcZ4XfAMSmZCEUQAMjNdA-1; Mon, 04 Jan 2021 14:24:45 -0500 X-MC-Unique: yYcZ4XfAMSmZCEUQAMjNdA-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 3E5E5801AC0; Mon, 4 Jan 2021 19:24:43 +0000 (UTC) Received: from mail (ovpn-112-76.rdu2.redhat.com [10.10.112.76]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 55E2B5D756; Mon, 4 Jan 2021 19:24:39 +0000 (UTC) Date: Mon, 4 Jan 2021 14:24:38 -0500 From: Andrea Arcangeli To: Peter Zijlstra Cc: Nadav Amit , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadav Amit , Yu Zhao , Andy Lutomirski , Peter Xu , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , Minchan Kim , Will Deacon , Mel Gorman Subject: Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect Message-ID: References: <20201225092529.3228466-1-namit@vmware.com> <20201225092529.3228466-2-namit@vmware.com> <20210104122227.GL3021@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20210104122227.GL3021@hirez.programming.kicks-ass.net> User-Agent: Mutt/2.0.4 (2020-12-30) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello, On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: >=20 > > The scenario that happens in selftests/vm/userfaultfd is as follows: > >=20 > > cpu0 cpu1 cpu2 > > ---- ---- ---- > > [ Writable PTE > > cached in TLB ] > > userfaultfd_writeprotect() > > [ write-*unprotect* ] > > mwriteprotect_range() > > mmap_read_lock() > > change_protection() > >=20 > > change_protection_range() > > ... > > change_pte_range() > > [ *clear* =E2=80=9Cwrite=E2=80=9D-bit ] > > [ defer TLB flushes ] > > [ page-fault ] > > ... > > wp_page_copy() > > cow_user_page() > > [ copy page ] > > [ write to old > > page ] > > ... > > set_pte_at_notify() >=20 > Yuck! >=20 Note, the above was posted before we figured out the details so it wasn't showing the real deferred tlb flush that caused problems (the one showed on the left causes zero issues). The problematic one not pictured is the one of the wrprotect that has to be running in another CPU which is also isn't picture above. More accurate traces are posted later in the thread. > Isn't this all rather similar to the problem that resulted in the > tlb_flush_pending mess? >=20 > I still think that's all fundamentally buggered, the much saner solutio= n > (IMO) would've been to make things wait for the pending flush, instead How do intend you wait in PT lock while the writer also has to take PT lock repeatedly before it can do wake_up_var? If you release the PT lock before calling wait_tlb_flush_pending it all falls apart again. This I guess explains why a local pte/hugepmd smp local invlpg is the only working solution for this issue, similarly to how it's done in rmap. > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 07d9acb5b19c..0210547ac424 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -649,7 +649,8 @@ static inline void dec_tlb_flush_pending(struct mm_= struct *mm) > * > * Therefore we must rely on tlb_flush_*() to guarantee order. > */ > - atomic_dec(&mm->tlb_flush_pending); > + if (atomic_dec_and_test(&mm->tlb_flush_pending)) > + wake_up_var(&mm->tlb_flush_pending); > } > =20 > static inline bool mm_tlb_flush_pending(struct mm_struct *mm) > @@ -677,6 +678,12 @@ static inline bool mm_tlb_flush_nested(struct mm_s= truct *mm) > return atomic_read(&mm->tlb_flush_pending) > 1; > } > =20 > +static inline void wait_tlb_flush_pending(struct mm_struct *mm) > +{ > + wait_var_event(&mm->tlb_flush_pending, > + atomic_read(&mm->tlb_flush_pending) =3D=3D 0); > +} I appreciate the effort in not regressing soft dirty and uffd-wp writeprotect to disk-I/O spindle bandwidth and not using mmap_sem for writing. At the same time what was posted so far wasn't clean enough but it wasn't even tested... if we abstract it in some clean way and we mark all connected points (soft dirty, uffd-wp, the wrprotect page fault), then I can be optimistic it will remain understandable when we look at it again a few years down the road. Or at the very least it can't get worse than the "tlb_flush_pending mess" you mentioned above. flush_tlb_batched_pending() has to be orthogonally re-reviewed for those things Nadav pointed out. But I'd rather keep that review in a separate thread since any bug in that code has zero connection to this issue. The basic idea is similar but the methods and logic are different and our flush here will be granular and it's going to be only run if VM_SOFTDIRTY isn't set and soft dirty is compiled in, or if VM_UFFD_WP is set. The flush_tlb_batched_pending is mm wide, unconditional etc.. Pretty much all different. Thanks, Andrea