From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.0 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 48C06C43603 for ; Fri, 6 Dec 2019 05:13:27 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0591F24676 for ; Fri, 6 Dec 2019 05:13:26 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="cHl2L2uE" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0591F24676 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8DA036B1414; Fri, 6 Dec 2019 00:13:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 88A2A6B1415; Fri, 6 Dec 2019 00:13:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7A13D6B1416; Fri, 6 Dec 2019 00:13:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0162.hostedemail.com [216.40.44.162]) by kanga.kvack.org (Postfix) with ESMTP id 65D1B6B1414 for ; Fri, 6 Dec 2019 00:13:26 -0500 (EST) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id 0A118181AEF23 for ; Fri, 6 Dec 2019 05:13:26 +0000 (UTC) X-FDA: 76233548412.05.cat89_1de436f873043 X-HE-Tag: cat89_1de436f873043 X-Filterd-Recvd-Size: 5247 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) by imf14.hostedemail.com (Postfix) with ESMTP for ; Fri, 6 Dec 2019 05:13:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=QR2qo9NUYuYpMEP2JcbsEvM6KYDdLpGys62JYp7fBUs=; b=cHl2L2uEdMi2n61llVktbIl+Z m/FR7LPmL6fksKmCAwdlf5Jy74JQeZteL2zIKvNfmTJPej4IJ0wKkEXA4IRp6WbP9ULEb+D00H6fD emoJUlItArcSUUlIA/n5FZrWLLjRdm/9ZBefS83Rp/G06Sfv/kW8YVViX7cQ5Qe8PY3LWWqjpn7it S304wylfdYv1NT1aS4EQ2NFbjeHrtNHzGj3s7QAwrwOk5xrKHddQSs0TUjCV7aT4CJPo1Xg6H5vS7 iMWIRn+CdErTFewIo8dRtbCXGKfRpaNhJNFOJToDwq/r/Zy2PMvrcBNawdm37ut455QhNALFk1gYi 6dLuR+MAg==; Received: from willy by bombadil.infradead.org with local (Exim 4.92.3 #3 (Red Hat Linux)) id 1id5ve-0007Qo-9D; Fri, 06 Dec 2019 05:13:22 +0000 Date: Thu, 5 Dec 2019 21:13:22 -0800 From: Matthew Wilcox To: Jerome Glisse Cc: linux-mm@kvack.org, Laurent Dufour , David Rientjes , Vlastimil Babka , Hugh Dickins , Michel Lespinasse , Davidlohr Bueso Subject: Re: Splitting the mmap_sem Message-ID: <20191206051322.GA21007@bombadil.infradead.org> References: <20191203222147.GV20752@bombadil.infradead.org> <20191205172150.GD5819@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20191205172150.GD5819@redhat.com> User-Agent: Mutt/1.12.1 (2019-06-15) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote: > Adding few interested people in cc I figured they all read linux-mm already ;-) > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote: > > While one thread is calling mmap(MAP_FIXED), two other threads which are > > accessing the same address may see different data from each other and > > have different page translations in their respective CPU caches until > > the thread calling mmap() returns. I believe this is OK, but would > > greatly appreciate hearing from people who know better. > > I do not believe this is OK, i believe this is wrong (not even considering > possible hardware issues that can arise from such aliasing). Well, OK, but why do you believe it is wrong? If thread A is executing a load instruction at the same time that thread B is calling mmap(), it really is indeterminate what value A loads. It might be from before the call to mmap() and it might be from after. And if thread C is also executing a load instruction at the same time, then it might already get a different result from thread A. And can threads A and C really tell which of them executed the load instruction 'first'? I think this is all so indeterminate already that the (lack of) guarantees I outlined above are acceptable. But we should all agree on this, so _please_ continue to argue your case for why you believe it to be wrong. [snip proposed solution -- if the problem needs solving, we can argue about how to solve it later] > > Some people are concerned that a reference count on the VMA will lead to > > contention moving from the mmap_sem to the refcount on a very large VMA > > for workloads which have one giant VMA covering the entire working set. > > For those workloads, I propose we use the existing ->map_pages() callback > > (changed to return a vm_fault_t from the current void). > > > > It will be called with the RCU lock held and no reference count on > > the vma. If it needs to sleep, it should bump the refcount, drop the > > RCU lock, prepare enough so that the next call will not need to sleep, > > then drop the refcount and return VM_FAULT_RETRY so the VM knows the > > VMA is no longer good, and it needs to walk the VMA tree from the start. > > Just to make sure i understand, you propose that ->map_pages() becomes > a new ->fault() handler that get calls before ->fault() without refcount > so that we can update fs/drivers slowly to perform better in the new scheme > (ie avoid the overead of refcounting if possible at all) ? > > The ->fault() callback would then be the "slow" path which will require > a refcount on the vma (taken by core mm code before dropping rcu lock). I would actually propose never updating most drivers. There's just no need for them to handle such an unstable and tricky situation as this. Let's not make driver writers lives harder. For the ones which need this kind of scalability (and let's be clear, they would already have *better* scalability than today due to the rwsem being split into a per-VMA refcount), then yes, implementing ->map_pages would be the way to go. Indeed, they would probably benefit from implementing it today, since it will reduce the number of page faults.