From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 068C4C33C8C for ; Tue, 7 Jan 2020 12:34:21 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BAB612072A for ; Tue, 7 Jan 2020 12:34:20 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=shutemov-name.20150623.gappssmtp.com header.i=@shutemov-name.20150623.gappssmtp.com header.b="BW0zKe7C" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BAB612072A Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=shutemov.name Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 503ED8E002D; Tue, 7 Jan 2020 07:34:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 48D278E001E; Tue, 7 Jan 2020 07:34:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 354648E002D; Tue, 7 Jan 2020 07:34:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0139.hostedemail.com [216.40.44.139]) by kanga.kvack.org (Postfix) with ESMTP id 1BADE8E001E for ; Tue, 7 Jan 2020 07:34:20 -0500 (EST) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id C4B11180AD804 for ; Tue, 7 Jan 2020 12:34:19 +0000 (UTC) X-FDA: 76350781038.05.anger33_42d3c51985d59 X-HE-Tag: anger33_42d3c51985d59 X-Filterd-Recvd-Size: 5735 Received: from mail-lj1-f181.google.com (mail-lj1-f181.google.com [209.85.208.181]) by imf43.hostedemail.com (Postfix) with ESMTP for ; Tue, 7 Jan 2020 12:34:19 +0000 (UTC) Received: by mail-lj1-f181.google.com with SMTP id j1so47139162lja.2 for ; Tue, 07 Jan 2020 04:34:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shutemov-name.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=ABxE/8F6UyEWbfvFVS9CjlbmOqSgY953Ay4duvbZgtg=; b=BW0zKe7CvwfRB5+HNbwxR2fbsoONO5L01EXkqqE2vYuqVwxMZIopSeFd+HZozfx+X0 rFEeMAJgyN97g56K9KFrFylNLrTywAB/qU8sG4H1n/BIV7R4KMpnfVLNDaJMJy6pMVvE UgORCpHOa+PD0jispU1JRnGHz0NlQzZsOrvPnWGAjJ0otNNq+vH6lzGgSseFXt0ZrGH8 KF9/QtrKxCRETKTlm2FyhZjKpaogmsWVD2yTJz/paKLYcaSdpUoD+TReJgVyOs5mXPKb IgHL+iR9lyykRi391jAouTzcuscfKU+jRo2qSSh9FUuKumsy/v5FuqMBvLqCBzuwiKSd E+MA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=ABxE/8F6UyEWbfvFVS9CjlbmOqSgY953Ay4duvbZgtg=; b=aa5sHe1MMDGwBSqNraTvgFRhaPz9YpNwxU5iaxAXCmZPEyFRW2NXu8CStyRKcjDk0M jxXShXNnIxg/y1mqyR91kfdAOhO1XfR9E6NcewcoXA5QNVy/CSa+c2z8+HQg9PGRmGBE D8KkdqGfZ+xpleKhIqcKr4AtFiYXpObPRrjZRCEtOBQzXJUKRvyp9kQG0niK0aXAfVxm RNz0H9AtiE1deEw32b+T2KkFVQt4Yy6TeMI44l0FXy0NleGVpa1b7VHLhBrY6GhLAUpU J7U5zNX3c+P5B6PyYGE6vUi+9IZHuPGUqzgpWfi5xl0pubbAKOszbi5AZjEmjXlaCWvT EuEA== X-Gm-Message-State: APjAAAV2IaNcQIrZ2oE4auLwF2ogU/yB8+DP+6y+FeKYy8tERTVEqGHH g8EgaqNTT4EoRLrljS4/NENo2g== X-Google-Smtp-Source: APXvYqzJdWu1T2ONHA5CtWSXOlRCXwyTNDLc0KlOUgJJa0cczRzAB8WiQ76aWqcIkDwo0Fv1qYeHkA== X-Received: by 2002:a2e:2c04:: with SMTP id s4mr65578458ljs.35.1578400457768; Tue, 07 Jan 2020 04:34:17 -0800 (PST) Received: from box.localdomain ([86.57.175.117]) by smtp.gmail.com with ESMTPSA id g15sm30723790ljk.8.2020.01.07.04.34.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Jan 2020 04:34:17 -0800 (PST) Received: by box.localdomain (Postfix, from userid 1000) id AE9851009CA; Tue, 7 Jan 2020 15:34:15 +0300 (+03) Date: Tue, 7 Jan 2020 15:34:15 +0300 From: "Kirill A. Shutemov" To: Matthew Wilcox Cc: linux-mm@kvack.org Subject: Re: Splitting the mmap_sem Message-ID: <20200107123415.gqklwca4qilva2yr@box> References: <20191203222147.GV20752@bombadil.infradead.org> <20191212142457.zqp4mawjz7frpyvk@box> <20191212154002.GR32169@bombadil.infradead.org> <20200106220910.GK6788@bombadil.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200106220910.GK6788@bombadil.infradead.org> User-Agent: NeoMutt/20180716 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Jan 06, 2020 at 02:09:10PM -0800, Matthew Wilcox wrote: > On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote: > > > > We currently only have one ->map_pages() callback, and it's > > > > filemap_map_pages(). It only needs to sleep in one place -- to allocate > > > > a PTE table. I think that can be allocated ahead of time if needed. > > > > > > No, filemap_map_pages() doesn't sleep. It cannot. Whole body of the > > > function is under rcu_read_lock(). It uses pre-allocated page table. > > > See do_fault_around(). > > > > Oh, thank you! That makes the ->map_pages() optimisation already workable > > with no changes. > > I've been thinking about this some more, and we have a bit of a tough time > allocating page table entries while holding the RCU read lock. There's > no GFP flags to the p??_alloc() functions, so we can't specify GFP_NOWAIT. > > Option 1: Add 'prealloc_pmd' and 'prealloc_pud' to the vm_fault (to go > with prealloc_pte). Allocate them before taking the RCU lock to walk > the VMA tree. This will be a bit of reordering as we currently take > the mmap_sem, walk the VMA tree, then walk the page tables once we know > we have a good VMA. I don't see a problem with doing that, but others > may differ. I expect preallocating all these page tables just-in-case would have measuable performance impact. Current code only preallocates PTE page table if sees pmd_none(). We may first check if this branch of the tree is present. But I'm not sure how efficient it can be. And we still need to protect from freeing these page tables from under us. > Option 2: Add a memalloc_nowait_save/restore API to go along > with nofs and noio. That way, we can take the RCU read lock, call > memalloc_nowait_save(), and walk the VMA tree and the page tables in > the current order. There's an increased chance of memory allocation of > page tables failing, so we'll have to risk that and do a retry with the > reference count held on the VMA if we need to sleep to allocate memory. > > Option 3: Variant of 2 where we add GFP flags to the p??_alloc() > functions. I think this is the most reasonable way. If we are low of memory, latency is not on the top of priorities. > Option 4: Variant of 2 where we make taking the RCU read lock magically > set the nowait bit, or we have the page allocator check the RCU preempt > depth. I don't particularly like this one, particularly since the > preempt depth is not knowable in most kernel configurations. > > Other thoughts on this? -- Kirill A. Shutemov