From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 19426C27C75 for ; Thu, 13 Jun 2024 11:59:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6BB6F6B0099; Thu, 13 Jun 2024 07:59:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 66B046B009A; Thu, 13 Jun 2024 07:59:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 50B396B009B; Thu, 13 Jun 2024 07:59:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 35C146B0099 for ; Thu, 13 Jun 2024 07:59:46 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 9649CC0930 for ; Thu, 13 Jun 2024 11:59:45 +0000 (UTC) X-FDA: 82225721130.17.494F996 Received: from mail-pf1-f172.google.com (mail-pf1-f172.google.com [209.85.210.172]) by imf12.hostedemail.com (Postfix) with ESMTP id C9A044000E for ; Thu, 13 Jun 2024 11:59:39 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=eEJkCEub; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf12.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.172 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718279982; a=rsa-sha256; cv=none; b=sVFG3Mon3GRJkzFOuUZAswbMQh05Bl1sBO4glSrQSp5HXlALxr9AsgkTOzrW8ycIHtq/QM QA0cV0q4i9kldHmZWtwQlmghnqRSI0Ygw5Zw0LOIFYmunRVF9OMMa9agRZClQ2OekZJoNQ JQgEfDeY3BXXZ1S4J7czAMSQm55ZOHM= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=eEJkCEub; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf12.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.172 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718279982; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UVhn3YYdbuI907L2eSwlkSvdhUxxAH/uGHzlYl6KC5E=; b=22YPiUL7j26z7Wmz09xB797zEqhq6U2gswyXhoOllZcKHxxm0NaLU1H4xci+UdAOdmLpaj DKdxlghnZqS6h8Z+s2Ttv36awuIW3MxHFDt5AS4qPeY9RKzaBpV2FRQ0FlYdX0poaTDBzV l4rvFnqQx7IHp8cLHjQ7jKDgBG4pLBY= Received: by mail-pf1-f172.google.com with SMTP id d2e1a72fcca58-7043162b304so15509b3a.0 for ; Thu, 13 Jun 2024 04:59:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1718279978; x=1718884778; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=UVhn3YYdbuI907L2eSwlkSvdhUxxAH/uGHzlYl6KC5E=; b=eEJkCEubRkP9ggKugj0QrnjiTn+Nd39aEBJrsjpkrvuS5iAbPIqNOcKsc2tdgsdLQl ZySQ5o7+p4AgYjCgx30ox7hMa0EkcH3KSybUiH/8NJTltvgPnuLlQsVWU5kFjfX/6/Q8 Vlac/M8+LzILMOFGb02+WD4mZZ29jECINndE36+7HNsGP//9O/dO+Kt2g+oziUj+0J5W Hvcv/DoXUGbYyb022sT5wlf15iDkWmTfxjML3b6rFb0/1J6JIT5RqTscxGohR6t1B7hP 6c/7L0t9lbJKbQ2ozzI6yBZFD6dWBzkv8+Q3UBy7pqpEGugmZPlDecmV+aAIYxQcUjMI gHDQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718279978; x=1718884778; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=UVhn3YYdbuI907L2eSwlkSvdhUxxAH/uGHzlYl6KC5E=; b=n8wOuPmV9nDIebhOJ1dqUMdTlrHNsPLlMPESdDaBEMBrWaow047KdcYzaiqH4mhipg B5PxrkgI7Yg7+3OCk24sz0BpuxkmjTQDCtiuBeZeeqzglVLS+I0Q2Kx6U9dQO+LzG4r5 hJowFOnXF1X+Pw22M81R9JDx7zQkXWe8I/v1k4eExFOvbeoG4HirMC2kx7XBbjcMDDco SSLr3BD2oArNa1u1Aby8VwOuz2S8KqPRUtrCvth3vnZwgPjQaEVK53soDvJMFCpMYcpC XzXzx9WwcP6pxhIKQOA4lNJ3Cj6P+XCvqsSY8by5t7B5fdo0JfyEDNB1BXbpXPFMKlMN hVng== X-Forwarded-Encrypted: i=1; AJvYcCXXizLs7dTcMcsXct5AOvf4JWsJkMJNAId6eK49y8bTnzPTA6LVEjnUoUWzbQJGdHXOmpnj2G6ljQ==@kvack.org X-Gm-Message-State: AOJu0Yw5WLLQ9c2DWTro5HoCUO1w+Zj64X9dQoJLnFITd7SH2Fi3kFXi BRIepOxm9pAFpcgl6m0MIqInohO6UhXgGEF/Hf6bi40H3eKeMxXX4euX0ylw2V8= X-Google-Smtp-Source: AGHT+IEoB6ZdWTm9ZhlVYHoAcDjw8or+JbWnKD9N1migDDIY9gWWNfCSV7bx3y3Ea1WyK8sc7ER07w== X-Received: by 2002:a05:6a00:6902:b0:6f8:582e:6edf with SMTP id d2e1a72fcca58-705d45076a2mr179381b3a.1.1718279978186; Thu, 13 Jun 2024 04:59:38 -0700 (PDT) Received: from [10.84.144.49] ([203.208.167.150]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-705cde0dbdcsm1083295b3a.147.2024.06.13.04.59.34 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 13 Jun 2024 04:59:37 -0700 (PDT) Message-ID: Date: Thu, 13 Jun 2024 19:59:32 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages Content-Language: en-US To: David Hildenbrand Cc: hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <02f8cbd0-8b2b-4c2d-ad96-f854d25bf3c2@redhat.com> <2cda0af6-8fde-4093-b615-7979744d6898@redhat.com> From: Qi Zheng In-Reply-To: <2cda0af6-8fde-4093-b615-7979744d6898@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: C9A044000E X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: rd76ahox8ijbcambzjm8b1mbqksbufjq X-HE-Tag: 1718279979-878662 X-HE-Meta: U2FsdGVkX1/VefFmK63OEIwlPPhTBxi8zKHrhZ+MBd/TDVP6NwRPk+W4FuDiEu3soiKL2Dfg3qi0kcsi55ZB8S+v2uPbiZQkul4/zowOvU12/W3YzPaPdK0fbS3VkS5qNIK63QJVXsIWL+CTx15/8/Cw66+ZazEISkcnR0AMl6/vY5RCSHn8zE3F1E/pWSqa4vqGRBCt8XDYUlUw0f7qu53LThz1PALak3j35uKryVTF+zyrsd1wVX+qEscswSrLua2CnMVYKwYh+7VOkNW+71YysiPQYyE/+CdoLMNO6E/V9IBV4LBBLdctDhdAO856828IHVqLlQSynu88KI2OrnMHKIBOUD2QtaourBQ5lmkQJmI6bWwqCdOGf0sTj5aVz5GeLc2hEPkOWiGhgfE9oh9BtuVNXHSA8T6uJjPeOe8JBLwqIXGyJazwwY+3CnjF/YG2wytCQY/nZFpVTzass1+U12mRYouQ71hFTIl8tydbeGkd0u1OQz99aT+JNbwAGjvlsPL+2mx+fTxv2e7iNJ/DAornNybWYj8SfAKaDxqh6/peFcKzWIfxAfgCEfcyMxZYR+uU1LjTuJKspkrklZDJ+BS8T5DfANmXaw42m187EuUwS4dGsl9yktx+yi7m9qpEhoz5X1BqOWWDRs9jmY7m0NNPIbPDoclLxfgx2eHYq6TJtgDpoaYZS5/B0ynLJCIxhrSCgxZGmEeh/4SNAUWwbVoA1lrPQbLEtUc0tsL/A7AUBhDEVHoPDzlkVjP90cdndE/sZ40YUzvG7RdfO2eOWronXhaCzGPG/8q1HPFv2EeljwcTF8lKAjlcVW1QF/sMCOKbnkbTVUcNxhChTblNqEN1exoTgntJDpydLlsRqNJBVUyBAVT0hJFw9HPKqBdq2xIpvuVWxuZMg51RFs0K2EPQt66/JM4shmwm5GlWolif8jvPZaPXPH5PzZN5qnqT9PdXRwmu7xWjsJf gUhZms0i Qw/Xi3rmI1D8EEoQtMzPIie7hwKVVxW/VxxownACDxf3fYLICV9Wn3PQ86FrhfSg0W04Cj/ev4Br95CqCys7vF79PwTk/4dnUYaBW6RU7rnF+Bd0CqFIXPt99TcpVTddKqcf0P28nzayVIOvcRICeQkiSLIpYYvhHHxdG+uSd16KHfsmSLMWyc6IU0CRuilj34CY7MTloC2Ef10k4lElpwOFjVJr0aTaE/Yl+Lw6lzSEsofXdcp7JsZGQ8ni2+lIkjyJQaHGpQLNpvFkWw0N+8CIst7D0vC8l9WQ1oWXCFaHkpWbuRB4+uq7dpdj+pGfftwJYBHx/d0nCYVeNRV9A/yB5O4or42js/A3utghwY+kvq8w= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, On 2024/6/13 18:25, David Hildenbrand wrote: > On 13.06.24 11:32, Qi Zheng wrote: >> Hi David, >> >> Thanks for such a quick reply! > > I appreciate you working on this :) > >> >> On 2024/6/13 17:04, David Hildenbrand wrote: >>> On 13.06.24 10:38, Qi Zheng wrote: >>>> Hi all, >> >> [...] >> >>> >>> >>>> 3. Implementation >>>> ================= >>>> >>>> For empty user PTE pages, we don't actually need to free it >>>> immediately, nor do >>>> we need to free all of it. >>>> >>>> Therefore, in this patchset, we register a task_work for the user >>>> tasks to >>>> asyncronously scan and free empty PTE pages when they return to user >>>> space. >>>> (The scanning time interval and address space size can be adjusted.) >>> >>> The question is, if we really have to scan asynchronously, or if would >>> be reasonable for most use cases to trigger a madvise(MADV_PT_RECLAIM) >>> every now and then. For virtio-mem, and likely most memory allocators, >>> that might be feasible, and valuable independent of system-wide >>> automatic scanning. >> >> Agree, I also think it is possible to add always && madvise modes >> simliar to THP. > > My thinking is, we start with a madvise(MADV_PT_RECLAIM) that will > synchronously try to reclaim page tables without any asynchronous work. > > Similar to MADV_COLLAPSE that only does synchronous work. Of course, This is feasible, but I worry that some user-mode programs may not be able to determine when to call it. My previous idea was to do something similar to madvise(MADV_HUGEPAGE), just mark the vma as being able to reclaim the pgtable, and then hand it over to the background thread for asynchronous reclaim. > if we don't need any heavy locking for reclaim, we might also just > try reclaiming during MADV_DONTNEED when spanning a complete page I think the lock held by the current solution is not too heavy and should be acceptable. But for MADV_FREE case, it still needs to be handled by madvise(MADV_PT_RECLAIM) or asynchronous work. > table. That won't sort out all cases where reclaim is possible, but > with both approaches we could cover quite a lot that were discovered > to really result in a lot of emprt page tables. Yes, agree. > > On top, we might implement some asynchronous scanning later, This is, > of course, TBD. Maybe we could wire up other page table scanners > (khugepaged ?) to simply reclaim empty page tables it finds as well? This is also an idea. Another option may be some pgtable scanning paths, such as MGLRU. > >> >>> >>>> >>>> When scanning, we can filter out some unsuitable vmas: >>>> >>>>       - VM_HUGETLB vma >>>>       - VM_UFFD_WP vma >>> >>> Why is UFFD_WP unsuitable? It should be suitable as long as you make >>> sure to really only remove page tables that are all pte_none(). >> >> Got it, I mistakenly thought pte_none() covered pte marker case until >> I saw pte_none_mostly(). > > I *think* there is one nasty detail, and we might need an arch callback > to test if a pte is *really* can be reclaimed: for example, s390x might > require us keeping some !pte_none() page tables. > > While a PTE might be none, the s390x PGSTE (think of it as another > 8byte per PTE entry stored right next to the actual page table > entries) might hold data we might have to preserve for our KVM guest. Oh, thanks for adding this background information! > > But that should be easy to wire up. That's good! > >> >>> >>>>       - etc >>>> And for some PTE pages that spans multiple vmas, we can also skip. >>>> >>>> For locking: >>>> >>>>       - use the mmap read lock to traverse the vma tree and pgtable >>>>       - use pmd lock for clearing pmd entry >>>>       - use pte lock for checking empty PTE page, and release it after >>>> clearing >>>>         pmd entry, then we can capture the changed pmd in >>>> pte_offset_map_lock() >>>>         etc after holding this pte lock. Thanks to this, we don't need >>>> to hold the >>>>         rmap-related locks. >>>>       - users of pte_offset_map_lock() etc all expect the PTE page to >>>> be stable by >>>>         using rcu lock, so use pte_free_defer() to free PTE pages. >>> >>> I once had a protoype that would scan similar to GUP-fast, using the >>> mmap lock in read mode and disabling local IRQs and then walking the >>> page table locklessly (no PTLs). Only when identifying an empty page and >>> ripping out the page table, it would have to do more heavy locking (back >>> when we required the mmap lock in write mode and other things). >> >> Maybe mmap write lock is not necessary, we can protect it using pmd lock >> && pte lock as above. > > Yes, I'm hoping we can do that, that will solve a lot of possible issues. Yes, I think the protection provided by the locks above is enough. Of course, it would be better if more people could double-check it. > >> >>> >>> I can try digging up that patch if you're interested. >> >> Yes, that would be better, maybe it can provide more inspiration! > > I pushed it to >     https://github.com/davidhildenbrand/linux/tree/page_table_reclaim > > I suspect it's a non-working version (and I assume the locking is > broken, there > are no VMA checks, etc), it's an old prototype. Just to give you an idea > about the > lockless scanning and how I started by triggering reclaim only when > kicked-off by > user space. Many thanks! But I'm worried that on some platforms disbaling the IRQ might be more expensive than holding the lock, such as arm64? Not sure. > >> >>> >>> We'll have to double check whether all anon memory cases can *properly* >>> handle pte_offset_map_lock() failing (not just handling it, but doing >>> the right thing; most of that anon-only code didn't ever run into that >>> issue so far, so these code paths were likely never triggered). >> >> Yeah, I'll keep checking this out too. >> >>> >>> >>>> For the path that will also free PTE pages in THP, we need to recheck >>>> whether the >>>> content of pmd entry is valid after holding pmd lock or pte lock. >>>> >>>> 4. TODO >>>> ======= >>>> >>>> Some applications may be concerned about the overhead of scanning and >>>> rebuilding >>>> page tables, so the following features are considered for >>>> implementation in the >>>> future: >>>> >>>>       - add per-process switch (via prctl) >>>>       - add a madvise option (like THP) >>>>       - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via >>>> procfs file) >>>> Perhaps we can add the refcount to PTE pages in the future as well, >>>> which would >>>> help improve the scanning speed. >>> >>> I didn't like the added complexity last time, and the problem of >>> handling situations where we squeeze multiple page tables into a single >>> "struct page". >> >> OK, except for refcount, do you think the other three todos above are >> still worth doing? > > I think the question is from where we start: for example, only synchronous > reclaim vs. asynchonous reclaim. Synchronous reclaim won't really affect > workloads that do not actively trigger it, so it raises a lot less > eyebrows. ... > and some user space might have a good idea where it makes sense to try to > reclaim, and when. > > So the other things you note here rather affect asynchronous reclaim, and > might be reasonable in that context. But not sure if we should start > with doing > things asynchronously. I think synchronous and asynchronous have their own advantages and disadvantages, and are complementary. Perhaps they can be implemented at the same time? Thanks, Qi >