From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=xZq1=Z7=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.4 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8FC37C43603
	for <linux-mm@archiver.kernel.org>; Mon,  9 Dec 2019 14:18:05 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 44E9B20726
	for <linux-mm@archiver.kernel.org>; Mon,  9 Dec 2019 14:18:05 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="B+iHRtwX"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 44E9B20726
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id D609B6B2714; Mon,  9 Dec 2019 09:18:04 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D116B6B2715; Mon,  9 Dec 2019 09:18:04 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id BD7D96B2716; Mon,  9 Dec 2019 09:18:04 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0136.hostedemail.com [216.40.44.136])
	by kanga.kvack.org (Postfix) with ESMTP id A46166B2714
	for <linux-mm@kvack.org>; Mon,  9 Dec 2019 09:18:04 -0500 (EST)
Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with SMTP id 54B6F5DD8
	for <linux-mm@kvack.org>; Mon,  9 Dec 2019 14:18:04 +0000 (UTC)
X-FDA: 76245807288.17.bird77_c34661622947
X-HE-Tag: bird77_c34661622947
X-Filterd-Recvd-Size: 10329
Received: from us-smtp-delivery-1.mimecast.com (us-smtp-1.mimecast.com [205.139.110.61])
	by imf45.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon,  9 Dec 2019 14:18:03 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1575901083;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=jGpppusZd3QORw/wr9Uwub2GN9Kl4YEc98d9xaZZCuQ=;
	b=B+iHRtwXMMDv31P0GRiXE014hmK4x+H+FLDnfEwFbd9H0eN7P3Sne372oJAqRKuDxwcZZI
	i5p8rSlryppZtj/q9y9RIcnRxOzLaAVzbbdPdM88AtpZ1z8GtCJJdV37IrVKHXad9KvA+a
	+cpYUo/XbUKvr27OOKiyJQ1rPIk8IDg=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-404-AEMIq98iOBicVzRzH2BvEw-1; Mon, 09 Dec 2019 09:17:59 -0500
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 7E9D4801E70;
	Mon,  9 Dec 2019 14:17:57 +0000 (UTC)
Received: from redhat.com (unknown [10.20.6.225])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 625665D6C8;
	Mon,  9 Dec 2019 14:17:56 +0000 (UTC)
Date: Mon, 9 Dec 2019 09:17:54 -0500
From: Jerome Glisse <jglisse@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-mm@kvack.org, Laurent Dufour <ldufour@linux.ibm.com>,
	David Rientjes <rientjes@google.com>,
	Vlastimil Babka <vbabka@suse.cz>, Hugh Dickins <hughd@google.com>,
	Michel Lespinasse <walken@google.com>,
	Davidlohr Bueso <dbueso@suse.de>
Subject: Re: Splitting the mmap_sem
Message-ID: <20191209141754.GA6841@redhat.com>
References: <20191203222147.GV20752@bombadil.infradead.org>
 <20191205172150.GD5819@redhat.com>
 <20191206051322.GA21007@bombadil.infradead.org>
 <20191206173030.GA3648@redhat.com>
 <20191209033335.GC32169@bombadil.infradead.org>
MIME-Version: 1.0
In-Reply-To: <20191209033335.GC32169@bombadil.infradead.org>
User-Agent: Mutt/1.12.1 (2019-06-15)
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15
X-MC-Unique: AEMIq98iOBicVzRzH2BvEw-1
X-Mimecast-Spam-Score: 0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Sun, Dec 08, 2019 at 07:33:35PM -0800, Matthew Wilcox wrote:
> On Fri, Dec 06, 2019 at 12:30:30PM -0500, Jerome Glisse wrote:
> > On Thu, Dec 05, 2019 at 09:13:22PM -0800, Matthew Wilcox wrote:
> > > On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote:
> > > > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > > > > While one thread is calling mmap(MAP_FIXED), two other threads wh=
ich are
> > > > > accessing the same address may see different data from each other=
 and
> > > > > have different page translations in their respective CPU caches u=
ntil
> > > > > the thread calling mmap() returns.  I believe this is OK, but wou=
ld
> > > > > greatly appreciate hearing from people who know better.
> > > >=20
> > > > I do not believe this is OK, i believe this is wrong (not even cons=
idering
> > > > possible hardware issues that can arise from such aliasing).
> > >=20
> > > Well, OK, but why do you believe it is wrong?  If thread A is executi=
ng
> > > a load instruction at the same time that thread B is calling mmap(),
> > > it really is indeterminate what value A loads.  It might be from befo=
re
> > > the call to mmap() and it might be from after.  And if thread C is al=
so
> > > executing a load instruction at the same time, then it might already =
get
> > > a different result from thread A.  And can threads A and C really tel=
l
> > > which of them executed the load instruction 'first'?  I think this is
> > > all so indeterminate already that the (lack of) guarantees I outlined
> > > above are acceptable.
> > >=20
> > > But we should all agree on this, so _please_ continue to argue your c=
ase
> > > for why you believe it to be wrong.
> >=20
> > I agree that such application might looks like it is doing something th=
at
> > is undeterminate but their might be application that catch SEGFAULT and=
 use
> > it as synchronization. I did something similar for reverse engineering =
a
> > long time ago with a library call libsegfault ...
> >=20
> > In any case, i agree that an application that is not catching SEGFAULT,=
 and
> > which is doing the above (access patterns) is doing something undetermi=
nate.
> >=20
> > Nonetheless i believe it is important that at any point in time for all=
 the
> > threads in a given process, on all the CPUs, a given virtual address sh=
ould
> > always point to the same physical memory (or to nothing) ie we should n=
ever
> > have one CPU that sees a different physical memory from another CPU for=
 the
> > same virtual address.
> >=20
> > Well i feel like you are also not discussing about the munmap() the abo=
ve
> > seemed to be about MAP_FIXED (replacing an existing mapping). For munma=
p
> > too i believe we should agree on what should be the expected behavior a=
nd
> > from my POV again we should not allow new mapping to appear until a "ru=
nning"
> > munmap is not fully done (ie all CPUs cache and TLB flushed). For the s=
ame
> > reason as above ie all CPUs always see same physical memory (or nothing=
) for
> > a given virtual address.
>=20
> I see MAP_FIXED as being the harder case, but sure, let's talk about
> munmap!  I agree that a munmap() + mmap() call should not permit thread
> B to see the old value after thread A has seen the new value.  But,
> as long as no new mmap can occupy that range, then it's OK if thread A
> takes a segfault while thread B can still load the old value.  At least
> for a short window.
>=20
> We can replicate that behaviour by ensuring that new lookups see a NULL
> entry, but new attempts to allocate will not reuse that range until the
> munmap has finished and all TLB entries are flushed.  The maple tree
> actually supports a "ZERO" entry (just like the XArray does) which has
> this behaviour -- lookups see NULL, but attempts to allocate do not see
> it as free.  We already use that property to prevent allocating above
> the end of the process address space.
>=20
> > This is what we have today with the big rwsem and i think we need to ke=
ep
> > that behavior even with concurency. I do not believe this will impact t=
he
> > performance and it is easy enough to solve so i feel safer doing so giv=
en
> > it does not cost anything.
> >=20
> > So i would rather argue on why we should change the current behavior if=
 we
> > can fix the concurrency without changing it (hence why discussing solut=
ion
> > might also be relevant here).
>=20
> It seems like you want to force a thread which sees an ongoing munmap
> to spin or sleep until the munmap is done, rather than immediately take
> a segfault, and I don't know that's a useful behaviour.

No, for munmap() i want the today behavior modulo concurrency ie:
    - any concurrent fault while munmap is on going should SEGFAULT
    - any new mmap might install itself in the unmaped area but can not
      service fault until the munmap is fully done (ie all cache and TLB
      are flush)

So it seems we agree on that given your above presentation of mapple NULL
entry.

I want the spin only for fault in a MAP_FIXED range that replace an existin=
g
range ie which is what we have today. Today if you do a mmap MAP_FIXED you
will block concurrent fault until the rwsem is release. I argue we should
keep that behavior as any program that keep accessing such area while mmap
is in progress is doing something peculiar (it can be a bug or it can be
something totaly intentional and use in conjunction with signal handlers).

This will only affect such application and it will keep coherency for TLBs
accross all CPUs. The implementation cost can be kept very low and it does
not impede concurrency on other virtual range.

Keeping existing behavior just feels a lot safer to me and it avoids us
pondering on wether or not a CPU can freakout if they see different TLB
for same virtual address.


> > > > Just to make sure i understand, you propose that ->map_pages() beco=
mes
> > > > a new ->fault() handler that get calls before ->fault() without ref=
count
> > > > so that we can update fs/drivers slowly to perform better in the ne=
w scheme
> > > > (ie avoid the overead of refcounting if possible at all) ?
> > > >=20
> > > > The ->fault() callback would then be the "slow" path which will req=
uire
> > > > a refcount on the vma (taken by core mm code before dropping rcu lo=
ck).
> > >=20
> > > I would actually propose never updating most drivers.  There's just n=
o
> > > need for them to handle such an unstable and tricky situation as this=
.
> > > Let's not make driver writers lives harder.
> > >=20
> > > For the ones which need this kind of scalability (and let's be clear,=
 they
> > > would already have *better* scalability than today due to the rwsem b=
eing
> > > split into a per-VMA refcount), then yes, implementing ->map_pages wo=
uld
> > > be the way to go.  Indeed, they would probably benefit from implement=
ing
> > > it today, since it will reduce the number of page faults.
> >=20
> > Yes they will get better scalability but i see some of those drivers mi=
ght
> > want the extra few mini-percent :) In any case, i believe that it might=
 be
> > better to give a new name ie fix current map_pages() user and rename th=
at
> > callback to something more explicit (atomic_map_pages() or something si=
milar
> > i am not good at naming). But otherwise this looks like a good plan to =
avoid
> > excessive refcount overhead.
>=20
> OK, great.  I don't think the current name is bad, but if someone comes
> up with a better one, I don't have a problem with renaming it.

I just like when new behavior are reflected in name, what about rcu_map_pag=
es ?
Making it clear it is under rcu.

Cheers,
J=E9r=F4me