From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 41F45D10DDD for ; Mon, 2 Dec 2024 09:54:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B47596B0082; Mon, 2 Dec 2024 04:54:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AF61B6B0083; Mon, 2 Dec 2024 04:54:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9BFDB6B0085; Mon, 2 Dec 2024 04:54:27 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 7BE176B0082 for ; Mon, 2 Dec 2024 04:54:27 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id F111181BD9 for ; Mon, 2 Dec 2024 09:54:26 +0000 (UTC) X-FDA: 82849558596.22.61860C6 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf01.hostedemail.com (Postfix) with ESMTP id 1134A40009 for ; Mon, 2 Dec 2024 09:54:15 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf01.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733133256; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xfIAQ0agYCP3AHzBu5L2migkfg3mSB67mGf4JVjTEpc=; b=8fzUdbnHG4dW8n1QFggy93lxfcuLkaQOgwmhxlKXSm0syvFNdgByra8hqrx6GDdOMg8s6c hgGboW6yGzqnMmLJJwwS8ei8m7TG+uAkADgN8qykiEMjx1WZJi0bo9szutGEK6t24lUxDd mFT0CqYX0CuUz+oP6RMDTyBsQeWdiBQ= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf01.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733133256; a=rsa-sha256; cv=none; b=5Bjx4L3IvCPd8q/VxtW0zn07HOXwvLYkDp7YVOW6+b64MaMnQyycxUw0FMrXSFqoOrKbLI nSe0emoi5pJ4hfD/o0bRA9oX2izFB1+4mA++BWJIcMeRpVfM5Ek7Dmrkb8ef342Rmo4Ami pCm7iGwRLXHbsbwZpLWYv9f7Z4Lh5nI= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9B2A4FEC; Mon, 2 Dec 2024 01:54:52 -0800 (PST) Received: from [10.162.40.26] (K4MQJ0H1H2.blr.arm.com [10.162.40.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id D3DBD3F58B; Mon, 2 Dec 2024 01:54:21 -0800 (PST) Message-ID: Date: Mon, 2 Dec 2024 15:24:18 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [QUESTION] mmap lock in khugepaged To: David Hildenbrand , ryan.roberts@arm.com, kirill.shutemov@linux.intel.com, willy@infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, ziy@nvidia.com References: <20241128062641.59096-1-dev.jain@arm.com> <4cb26a06-d982-4ca3-a5f7-7c6a6c63428c@arm.com> <3d4c57dd-0821-4684-9018-db8274c170ec@redhat.com> <66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com> Content-Language: en-US From: Dev Jain In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 1134A40009 X-Rspamd-Server: rspam12 X-Stat-Signature: gagqmdxr4g9ad1iu7pnkiwp3fa8o4z6k X-Rspam-User: X-HE-Tag: 1733133255-172740 X-HE-Meta: U2FsdGVkX1+Ak0pyXzWhn1VZVIRpEw/KyPMwTgOBeW9PR1LCrWgmNDN2Tv/BcgP24l+MzeWmOq6rvG24i4N4+b4cIVPe3PIpVBK8TcSw/lC/IP2grnaES7mDig966HopGES036RcUCSGrgf5Bi7dBPYfMW+xSiRrtsH5TPIWPxlwI7W+4aNDfsL9/9Z+CtxRgbllAmL2n+4xMGVzx2OfMZw9+oL1+vFjp7ku/D1CObPpZcV8mjSm1+MRzOLMRbk+XIaAgiyeS7wqX+p2+Yofje7lPS8XqBcsm1w0ZwrN5Vl1QJc6Z/iYW/OOEBwDG2NGa1L5DV6EtM3EI6D1yEiIcq23yx3akUEj4fL/9djtEJO8pYmLUzdR3TYwBR9Htc4mFDSZuXVZbVhe7nUxPb5ZDRL49hEwTM0LVyIImJGYA72D9KL7P/l7efAQM+lIBKAiAupCnh8ob59ho2Hwjn4fkp+9HhioOsSvWRxd1GF9UZJDF/Svyyj5l11eB2c8sfTHplvMMP2pHCmTwHxTo61pFhJY1sj5SxRjdW10JGjezAJ/vF7ZqBylxU6b2wU/kzQKhg2rwGPYdFJysBX30J/81EtHVerD9P+tpgOYIqEA+MAt135/pYd3w0gX56d0agDjmS2vvUhRrPhvab0pVXx0iGrxswz8FPhsEgR71Sy5GOySp+FVi0AzLKCPodp80vFSABYG/moj5TTrTSTfcRXeaYEdmpOQvdOWdSvzCcZg+JAHLWdd69ffMmeLzRHoZXfm1JasK9RXehnqza18j3mtVnog7GxymexGtKcaK/Mxtd22HHNvzw+gCcKtRtNBrazdsB+Mbo7n/O9eQWhmg90w5GO/tKQiKqijYZY2HtQGlUNra/XrEMH2ct93Al/RWpOUdnNKzNf0zippQxdms6QxBto7oNFo1u3aOQXNFi+8tkG46F+/SIjlCMKSGoqK0vUiruLwbxowrFWkpk2MIb2 KgJC/Dzb ZORU9+QuUeuzWwQVSVfjxbQTgEQzZQR2GLeUwPvyGpSlN/MwuhxqV1bij7r66rsSb0OnZk+II6QeETQ0DwHGBubcKLsXyqb5nuUF6JO4Tly7Vcupd/qm4WXt1x5YfPHFUif33f/kmgXDxGGX4TNWXNtPWR4Dzf/S/PkTJYA/y0y7vIYI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 29/11/24 5:00 pm, David Hildenbrand wrote: >>> >>> Thinking about it, I am also not sure if most other code can deal with >>> temporary pmd_none(). These don't necessarily take the PMD lock, >>> because "why would they" right now. >>> >>> See walk_pmd_range() as one example, if it spots pmd_none() it assumes >>> "there really is nothing" without checking the PMD lock. >>> >>> As a more concrete example, assume someone calls MADV_DONTNEED and we >>> end up in zap_pmd_range(). If we assume "pmd_none() == really nothing" >>> we'll skip that entry without getting the PMD lock involved. That >>> would mean that you would be breaking MADV_DONTNEED if you managed to >>> collapse or not -- memory would not get discarded. >>> >>> This is a real problem with anonymous memory. >>> >> Yes, I thought of different locking fashions but the problem seems to be >> that any pagetable walker will choose an action path according to the >> value >> it sees. >> >>> >>> Unless I am missing something it's all very tricky and there might be >>> a lot of such code that assumes "if I hold a mmap lock / VMA lock in >>> read mode, pmd_none() means there is nothing even without holding the >>> PMD lock when checking". >> >> Yes, I was betting on the fact that, if the code assumes that pmd_none() >> means there is nothing, eventually when it will take the PMD lock to >> write to it, it will check whether >> the PMD changed, and back off. I wasn't aware of the MADV_DONTNEED >> thingy, thanks. > > Note that this is just the tip of the iceberg. Most page table walkers > that deal with anonymous memory have the same requirement. > >>>>> >>>>> >>>>> I recall that for shmem that's "easier", because we don't have to >>>>> reinstall a PMD immediately, we cna be sure that the page table is >>>>> kept empty/unmodified, ... >>>>> >>>> >> >> All in all, the general solution seems to be that, if I can take all >> pagetable walkers into an invalid state and make them backoff, then I am >> safe. > > For example, we do not zero out the PMD, we take the pte PTL, we do> > stuff, then instead of setting the PTEs to zero, we set it to a universal >> invalid state upon which no pagetable walker can take an action; an >> instance of that can be to set the PTE to a swap entry such that the >> fault >> handler ends up in do_swap_page() ->print_bad_pte(). So now we can take >> the PMD lock (in fact we don't need it since any pagetable walking >> is rendered useless) and populate the PMD to resume the new pagetable >> walking. Another *ridiculous* idea may be to remember the PGD we >> came from and nuke it (but I guess there is an alternate path for that >> in walk_pgd_range() and so on?) > > > What might work is introducing a PMD marker (note: we don't have PMD > markers yet) for that purpose. Likely, the PMD marker may only be set > while the PMD lock is being held, and we must not drop the PMD lock > temporarily (otherwise people will spin on the value by repeadetly > (un)locking the PMD lock, which is stupid). > > > Then you have to make sure that each and every page table walker > handles that properly. > > Then, there is the problem with holding the PMD lock for too long as I > mentioned. > > Last but not least, there are more issues that I haven't even > described before (putting aside the other issue): > > > Our page table walkers can handle the transitions: > > PMD leaf -> PMD none > * concurrent zap / MADV_DONTNEED / !anon PMD split > > PMD leaf -> PTE table > * concurrent anon PMD split > * concurrent zap / MADV_DONTNEED / !anon PMD split + PTE table >   allocation > > PTE table (empty) -> PMD none > * concurrent page table reclaim, part of collapsing file THPs >  * collapse_pte_mapped_thp() >  * retract_page_tables() > > > I think they *cannot* tolerate the transition *properly*: > > PTE table (non-empty) -> PMD leaf > * what you want do do ;) > > -> Observe how we skip to the next PMD in all page table walkers /  > give up if pte_offset_map_lock() and friends fail! I even think there > are more issues hiding there with pte_offset_map_ro_nolock(). > > > Of course, on top of all of this, you should net significantly degrade > the ordinary page table walker performance ... > > > I don't want to discourage you, but it's all extremely complicated. > Thanks for the detailed reply!