From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.4 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AE60BC3F68F for ; Thu, 5 Dec 2019 17:21:59 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4E75E2464E for ; Thu, 5 Dec 2019 17:21:59 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="YRh/GoKA" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4E75E2464E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0696D6B114B; Thu, 5 Dec 2019 12:21:59 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 01A6B6B114D; Thu, 5 Dec 2019 12:21:58 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E4CAD6B114E; Thu, 5 Dec 2019 12:21:58 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0060.hostedemail.com [216.40.44.60]) by kanga.kvack.org (Postfix) with ESMTP id CDD846B114B for ; Thu, 5 Dec 2019 12:21:58 -0500 (EST) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 8B901180AD815 for ; Thu, 5 Dec 2019 17:21:58 +0000 (UTC) X-FDA: 76231755516.27.stage96_4b507115ef027 X-HE-Tag: stage96_4b507115ef027 X-Filterd-Recvd-Size: 6839 Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [205.139.110.120]) by imf08.hostedemail.com (Postfix) with ESMTP for ; Thu, 5 Dec 2019 17:21:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1575566516; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vBgkNJhW6jVECMpXmUfd8ur43U4ecHJ/Y02MJVVp8fQ=; b=YRh/GoKAbXzG2K9ct088VWhHYr2JP5g0OaclO2ppGtibLCMZgT+YU2q13PAPhQod+pDSKa //UQcsiKR0Ga0HkPbRyXkGGF6zzHxLTSZpwfYoVRSI8Ct7fYBwsC6IV+8DnyAye8HK90Mt PjfwwT2XEeA1K3JQJ5UZp4HpqbQ6g/8= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-336-Rwcaxyq0NYupLqMN0Tw-Xw-1; Thu, 05 Dec 2019 12:21:54 -0500 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 127A5800D41; Thu, 5 Dec 2019 17:21:53 +0000 (UTC) Received: from redhat.com (unknown [10.20.6.225]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 22AB7600D1; Thu, 5 Dec 2019 17:21:52 +0000 (UTC) Date: Thu, 5 Dec 2019 12:21:50 -0500 From: Jerome Glisse To: Matthew Wilcox Cc: linux-mm@kvack.org, Laurent Dufour , David Rientjes , Vlastimil Babka , Hugh Dickins , Michel Lespinasse , Davidlohr Bueso Subject: Re: Splitting the mmap_sem Message-ID: <20191205172150.GD5819@redhat.com> References: <20191203222147.GV20752@bombadil.infradead.org> MIME-Version: 1.0 In-Reply-To: <20191203222147.GV20752@bombadil.infradead.org> User-Agent: Mutt/1.12.1 (2019-06-15) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-MC-Unique: Rwcaxyq0NYupLqMN0Tw-Xw-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Adding few interested people in cc On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote: >=20 > [My thanks to Vlastimil, Michel, Liam, David, Davidlohr and Hugh for > their feedback on an earlier version of this. I think the solution > we discussed doesn't quite work, so here's one which I think does. > See the last two paragraphs in particular.] >=20 > My preferred solution to the mmap_sem scalability problem is to allow > VMAs to be looked up under the RCU read lock then take a per-VMA lock. > I've been focusing on the first half of this problem (looking up VMAs > in an RCU-safe data structure) and ignoring the second half (taking a > lock while holding the RCU lock). >=20 > We can't take a semaphore while holding the RCU lock in case we have to > sleep -- the VMA might not exist any more when we woke up. Making the > per-VMA lock a spinlock would be a massive change -- fault handlers are > currently called with the mmap_sem held and may sleep. So I think we > need a per-VMA refcount. That lets us sleep while handling a fault. > There are over 100 fault handlers in the kernel, and I don't want to > change the locking in all of them. >=20 > That makes modifications to the tree a little tricky. At the moment, > we take the rwsem for write which waits for all readers to finish, then > we modify the VMAs, then we allow readers back in. With RCU, there is > no way to block readers, so different threads may (at the same time) > see both an old and a new VMA for the same virtual address. >=20 > So calling mmap() looks like this: >=20 > 1 allocate a new VMA > 2 update pointer(s) in maple tree > 3 sleep until old VMAs have a zero refcount > 4 synchronize_rcu() > 5 free old VMAs > 6 flush caches for affected range > 7 return to userspace >=20 > While one thread is calling mmap(MAP_FIXED), two other threads which are > accessing the same address may see different data from each other and > have different page translations in their respective CPU caches until > the thread calling mmap() returns. I believe this is OK, but would > greatly appreciate hearing from people who know better. I do not believe this is OK, i believe this is wrong (not even considering possible hardware issues that can arise from such aliasing). That bein said i believe this can be solve "easily" when the new vma is added you mark it as a new born (VMA_BABY :)) and page fault will have to wait on it ie until the previous vma is fully gone and flush. So after step (6 flush caches) you remove the VMA_BABY flag before returning to userspace and page fault can resume. I would also mark old VMA with a ZOMBIE flag so that any reader might have a chance to back-off and retry. To check for that we should add a new check to vmf_insert_page() (and similar) to avoid inserting pfn in ZOMBIE vma. Note that i am not sure what we want to do here, can an application rely on rwsem serialization unknowingly ie could it have one thread doing page fault on a range that is about to be unmap by another thread ? I am not sure this can happen today without SEGFAULT thanks to serialization through rwsem. Anyway with BABY and ZOMBIE, it should behave mostly as it does today (modulo concurrency). >=20 > Some people are concerned that a reference count on the VMA will lead to > contention moving from the mmap_sem to the refcount on a very large VMA > for workloads which have one giant VMA covering the entire working set. > For those workloads, I propose we use the existing ->map_pages() callback > (changed to return a vm_fault_t from the current void). >=20 > It will be called with the RCU lock held and no reference count on > the vma. If it needs to sleep, it should bump the refcount, drop the > RCU lock, prepare enough so that the next call will not need to sleep, > then drop the refcount and return VM_FAULT_RETRY so the VM knows the > VMA is no longer good, and it needs to walk the VMA tree from the start. Just to make sure i understand, you propose that ->map_pages() becomes a new ->fault() handler that get calls before ->fault() without refcount so that we can update fs/drivers slowly to perform better in the new scheme (ie avoid the overead of refcounting if possible at all) ? The ->fault() callback would then be the "slow" path which will require a refcount on the vma (taken by core mm code before dropping rcu lock). Cheers, J=E9r=F4me