From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AAE6AC35274 for ; Thu, 21 Dec 2023 14:06:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3FD9F8D0003; Thu, 21 Dec 2023 09:06:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3ACB78D0002; Thu, 21 Dec 2023 09:06:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 274218D0003; Thu, 21 Dec 2023 09:06:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 1288E8D0002 for ; Thu, 21 Dec 2023 09:06:11 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id D12B440DCF for ; Thu, 21 Dec 2023 14:06:10 +0000 (UTC) X-FDA: 81590999700.18.01B8648 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf24.hostedemail.com (Postfix) with ESMTP id 766EB180026 for ; Thu, 21 Dec 2023 14:06:07 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=bZIVVC7v; spf=none (imf24.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1703167568; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=INzY/QT6atUS/BlSFmLSlk6lzsjHKIwMNfkkrynBCJg=; b=MtLzy527U0Vkd1UQ8k+C9ghYc/Jp6EyuG/X7mnY8EO2DFkkCiIzg2XorkfQtgzLw6waE6N 8l/LqfwSWErD+hJxTn4AcCsIKFKG4KGK5ZJle+fKiVz+GYX3CFseOVb9j9IRGAH4On7nwi FXMnnlLfsvmKsIRCzEdvhqBpQIjuS8U= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1703167568; a=rsa-sha256; cv=none; b=btwi0RApbOcUKQxx0X4ITzCL3TGA2Mih5hH0/lRK7/IdX39ponhYs1biPWfyA2YOMvvraf 0Jv3DaB0UryCr7P9AcUHotjEFRa+cva/a30jxJzFpaqLzFDwEdboAYzj/f+9roZhXSxdmO 9GtucpNY3efr4x6OPe8KCjGgvBOTRjg= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=bZIVVC7v; spf=none (imf24.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Transfer-Encoding: Content-Type:MIME-Version:References:Message-ID:Subject:Cc:To:From:Date: Sender:Reply-To:Content-ID:Content-Description; bh=INzY/QT6atUS/BlSFmLSlk6lzsjHKIwMNfkkrynBCJg=; b=bZIVVC7v1vf6vYj48TvYaHjrbx ZkZZ4WD32xzOnBaOLhtDpgrIhFWUsaTT+yYhwGdrhniEN0EBV1TB7+PbMDUOHOMNic+6wQFxBcXbA b454QACHPdfjB3wJs48q6ftA3V/eZjqFsttl3DR2LpYWuXc74Ys9/HZrf3D0xobLYZFvNTW23us3E apA2mPckdE+KDqQ1tKauphuRa405ibsf3zpCcXJJE9qbeF2Sk93zQ3r/c7S4qUye6v/njyrrtUPXy YKdv9lfurmjPV7exUYRdt5CuB1PeAbpWLyaubFkGLtdeRf1NjiyiyKzPPUrvSrKDoGQM1NikCz/WU ldEZXVCw==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1rGJgY-005NPh-Fm; Thu, 21 Dec 2023 14:06:02 +0000 Date: Thu, 21 Dec 2023 14:06:02 +0000 From: Matthew Wilcox To: Pasha Tatashin Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, rientjes@google.com, dwmw2@infradead.org, baolu.lu@linux.intel.com, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, iommu@lists.linux.dev Subject: Re: [RFC 0/3] iommu/intel: Free empty page tables on unmaps Message-ID: References: <20231221031915.619337-1-pasha.tatashin@soleen.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Queue-Id: 766EB180026 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: uqx64ap5abo8fa8gjifxj4wt93gzdugs X-HE-Tag: 1703167567-723020 X-HE-Meta: U2FsdGVkX19xSoQ+FIyU+Ci/Tp+eZJgnyeQoP5bkQ8oAOeQKDtljY/KtNrZENUkIgnmqGbQp/q/5HlbNSCxFqkMSdMRAMbgrqOMSEMYfahuXP+9QkPeRWBhmcOTBDN0cdRDMKuvwuvEIEhCjBPNL/gmA+wwoY2RBe7yhP5UUPHvuynfg8rvDHh0qZEg35DClN1JVjlw3L2LoCyq8BlALj7+DavqBtGOsE0p8OTtphJCdl1bEigrNvlAC/K5OrSfl4Ed/dqMObrUzbBO8OqoAVdBKZfs2OCWtrbi7BvjDn8uBn0LGDnz0/lCiptNBEh4tdsYSoUojsKJ6yP2LO/KlTLUcFVhN4y+RZZgKNptqGfJ8qmAIbpKEUMi6LMUODseTmdQUu0RjaM4UjKN94hbka/MMN8UGriMZREjDmq8BopZX9uZueUlXPudNo/2l9T/PpXOgHTuNLIIsgfO9g75LGf5bxNNe7MFUskqUrb5vx7L7UqjJwIoIz86NIRcWcNNbIXWbjm9HsGO+t3rEdO1ECLVazNGjLuWsILfrUcOSfgG7SH3Q1MeSWxB0gzE+nDN4Dqydk+t0BlxqrBor7HVhhkHCOdRckglWJUUGeIfztefSbABy/orZSeX9EQxOmz4XlmvhuBPZcd19bP/yYkY1S6tQrwVEbRA77AeHNUkU9ypvIrJ1Ge+Ap6TaBPvFAz60svId7Nb2XAE/6lUtPdhuVjzjL+4hOUpdgz1BwfsNjA6PRdcn8znABVMZ77s8q1ercr17CWsi4hwObyrKwDnDnYD0WW7+L4A79MRc6B+0KamNUhSQNtnl9VxFSWvrGbfKVOC+SgYfIFWCRkM+rmh6lV5G33aERMg32K2UWBV+2M1/USSqfp7P+v4Q7fbQoi7/8vsqr1nvhBBa8CePXRknSLSVRNCFreY9decuXrI7uYfTSKMU90IYz58Fx167SX0yxHfjlwl484ASYpVqFFy 77lDPXS/ jiONAUldTotVYxZdAvijf6FpW5brLWgtquKQaRxHrKg3iI/0QU8j5uysPs9tp/D51OZmGP/IBvEK/wgflYaCIFH67IvplPv/Sl3CLv5k/4OZLXeMoUhMCBeBqE1SHnTkdi/x7ChhBgyYXuOLAwQJ+gW3mhMQpnU5o5rG1GBq7BnDXFD+0P+vATDRcMnj8r2xY5veqx4wWhNxvyuzUKkXjt+qbf0YTnd7fkFhfReUo72OFJMMMDBWdoO+S4o0VtBFVea64TbNPoXTM+gwPfuPWd1ZVV9n5ossHDksw X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Dec 21, 2023 at 12:42:41AM -0500, Pasha Tatashin wrote: > On Thu, Dec 21, 2023 at 12:13 AM Pasha Tatashin > wrote: > > > > On Wed, Dec 20, 2023 at 11:16 PM Matthew Wilcox wrote: > > > > > > On Thu, Dec 21, 2023 at 03:19:12AM +0000, Pasha Tatashin wrote: > > > > This series frees empty page tables on unmaps. It intends to be a > > > > low overhead feature. > > > > > > > > The read-writer lock is used to synchronize page table, but most of > > > > time the lock is held is reader. It is held as a writer for short > > > > period of time when unmapping a page that is bigger than the current > > > > iova request. For all other cases this lock is read-only. > > > > > > > > page->refcount is used in order to track number of entries at each page > > > > table. > > > > > > Have I not put enough DANGER signs up around the page refcount? > > > > > > * If you want to use the refcount field, it must be used in such a way > > > * that other CPUs temporarily incrementing and then decrementing the > > > * refcount does not cause problems. On receiving the page from > > > * alloc_pages(), the refcount will be positive. > > > > > > You can't use refcount for your purpose, and honestly I'm shocked you > > > haven't seen any of your WARNings trigger. > > > > Hi Matthew, > > > > Thank you for looking at this. > > > > Could you please explain exactly why refcount can't be used like this? > > > > After alloc_page() refcount is set to 1, we never reduce it to 0, > > every new entry in a page table adds 1, so we get up-to 513, that is > > why I added warn like this: WARN_ON_ONCE(rc > 513 || rc < 2); to > > I guess, what you mean is that other CPUs could temporarily > increase/decrease refcount outside of IOMMU management, do you have an > example of why that would happen? I could remove the above warning, > and in the worst case we would miss an opportunity to free a page > table during unmap, not a big deal, it can be freed during another > map/unmap event. Still better than today, where we never free them > during unmaps. Both GUP-fast and the page cache will find a page under RCU protection, inc it's refcount if not zero, check the page is still the one they were looking for, and if not will dec the refcount again. That means if a page has been in the page cache or process page tables and you can't guarantee that all CPUs have been through the requisite grace periods, you might see the refcount increased. I'm not prepared to make a guarantee that these are the only circumstances under which you'll see a temporarily higher refcount than you expect. Either currently or in the future. If you use the refcount as anything other than a refcount, you're living dangerously. And if you think that you'll be the one to do the last refcount put, you're not necessarily correct (see the saga around __free_pages() which ended up as commit e320d3012d25 fixed by 462a8e08e0e6 (which indicates the rare race does actually happen)). Now, it seems like from your further explanation that the consequence of getting this wrong is simply that you fail to free the page early. That seems OK, but I insist that you insert some comments explaining what is going on and why it's safe so somebody auditing uses of refcount doesn't have to reanalyse the whole thing for themself. Or worse that somebody working on the iommu sees this and thinks they can "improve" on it.