From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2FCE2E7716C for ; Thu, 5 Dec 2024 03:56:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B6B3E6B0088; Wed, 4 Dec 2024 22:56:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AF30E6B008A; Wed, 4 Dec 2024 22:56:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 994546B008C; Wed, 4 Dec 2024 22:56:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 78A626B0088 for ; Wed, 4 Dec 2024 22:56:15 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id F38BDC1346 for ; Thu, 5 Dec 2024 03:56:14 +0000 (UTC) X-FDA: 82859542248.22.CE0A276 Received: from mail-ot1-f42.google.com (mail-ot1-f42.google.com [209.85.210.42]) by imf26.hostedemail.com (Postfix) with ESMTP id 00AA3140009 for ; Thu, 5 Dec 2024 03:56:00 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=dFfuZBTB; spf=pass (imf26.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.42 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733370957; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=D+MHg4FHabFUV6Q6eJzYqt3B02UulSEorGl1Vg94z4o=; b=jvf3nDji10/XNnFkz2tOEPTbkWNJiyW8Tdm6oMal5kw9a+UO27kw80oARXkSP8ZC/9NK5/ n6Zp9NEx+dY3fFPNVQwlSpi+jmiFlXrJ5hPKazHXOuq4OZD0ytxc2yxTXfGsjDK8Tt5Q8l wDMh8fZQnIFlPLLDj0h/LklL9Z3kpe8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733370957; a=rsa-sha256; cv=none; b=red/g47/HLlxucQDvm6BqPFVY2NITsFN0F55AetVnk57PDMn6QglahOBp6zPERRfdqNGOG rujkjFRSlAT1cGGBAFB+Ajcg8OrOLPiMxcSCQvVzXyPIy0mnehxjwhqzUXJ852x6kkZSyx YzPhYTBt+35wWEd7IX+4b2mF0BTZwHw= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=dFfuZBTB; spf=pass (imf26.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.42 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com Received: by mail-ot1-f42.google.com with SMTP id 46e09a7af769-71d59d86ba7so329084a34.2 for ; Wed, 04 Dec 2024 19:56:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1733370971; x=1733975771; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=D+MHg4FHabFUV6Q6eJzYqt3B02UulSEorGl1Vg94z4o=; b=dFfuZBTBsiIT+Wg9iM5UTAlSf9+xAoivsrhIDEbvHAWWFxDBwa1+4uHQ7XbrXTmycn AcMGnf1tQET/0u4RIoqSx2NVkUawwsSqvV60aElU2M5vHTWWVMPdosU/ZUXa7c+kx+OV D+t83KJTAOT1WBA5bJfXuP4Nk/ULkTgB9nAhSxnNrQ+CjhmV/Y85FRoTr28rJ15OjdHM JP6nXGJ2JogulzM/GJPKwykvFLYiIzdqc3QxDVuMqql2oIuGPYpfvm4HyWyGrLKw5DHl yCGSUIy5jxnafmQQmOfo2D9XIsk2rnYqK8lNwXvehJREFl4aW3DdvMw4uDM0Er0yZr7W pVdg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733370971; x=1733975771; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=D+MHg4FHabFUV6Q6eJzYqt3B02UulSEorGl1Vg94z4o=; b=uwiWVkkCX7f/dF4ZN06W0NpN/29CmXtZ0UnNt8l/l215431LrEHbWfasL8uf+3yhf7 hGQ0q1yLUS+eeikTIXMwAAZwFI1Aih6X6kgoJ9U+/pKqO+JNah7+uP3zQ0sCql3XoCdb 3YwQxMaoaULQjtsbzf/vpzTjoUFHWv92YrvPEfp6CYjZrffDCfoJNEMcgQNCHjjwLZEj jDoByLrHFYkoon7DbwRQXkAzwF80xGLfHfsFjOXnLZRbxaBrNfWWuegt9pWqnDAM1OT2 ukxWc1ywAe3No3UaChVfwirZcfpqJRzlAdKB0kN7vNoQYiFDyvJ5OX5E5kEAYMDAGflP cEjQ== X-Forwarded-Encrypted: i=1; AJvYcCXJmI7q2rQIzJEy3b75YZRLKoK+tIppL0XtrE0qDvkhwxQ0Vzm+56FTl1/xofSsFkD94s2HKWWRgQ==@kvack.org X-Gm-Message-State: AOJu0YwoPVB+7l7qgxj8NBGnYdLG1fBhHk7TB9jwrq04yJdyd/tJBv4B RpcSqkxjSFGK56biEB2RuoroPA0L8+rLfuNn2PctQrf2CGcSMR7Vb+on2+rEodY= X-Gm-Gg: ASbGncuP+3/0pZe4utoos9r0tGgS5BP1pFYGNvcjFXOZuW4U3NOem8zQeNFiTsHNjwx P96VzykA2JNZzwZk0UAjWkF/96/N64vTzL3HVbkEbBWGVZuV8uK9hMh8h5M3iHnET0n1ngpQnd6 4/6jo/n33CLEMkz0qTgaZLLLY9bb1NyJloYeXDM63zQz65emJA88nXBPW6nSvj99U7auYgwAQXl ron9jVwdt5kmFsWyXEJ46vjdwVF/wXMB+iErm+wCu2djeJAnTAQeuFoMLxZzq+HQ1WV/6tB8ew= X-Google-Smtp-Source: AGHT+IGh2tp+pVzEke/ezAOKVA16saOxak1tg0Yxh9VN5xk2GfwEcelkgEAfMGxTFzaJKEdpBaW5Ww== X-Received: by 2002:a05:6830:6319:b0:71a:4b13:c561 with SMTP id 46e09a7af769-71dad652b59mr10641522a34.16.1733370971372; Wed, 04 Dec 2024 19:56:11 -0800 (PST) Received: from [10.84.148.23] ([203.208.167.148]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-7fd157d87d2sm268234a12.70.2024.12.04.19.56.04 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 04 Dec 2024 19:56:10 -0800 (PST) Message-ID: <0ca36b2e-463e-493f-aede-aff9aec3c7fa@bytedance.com> Date: Thu, 5 Dec 2024 11:56:02 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Content-Language: en-US To: Andrew Morton Cc: david@redhat.com, jannh@google.com, hughd@google.com, willy@infradead.org, muchun.song@linux.dev, vbabka@kernel.org, peterx@redhat.com, mgorman@suse.de, catalin.marinas@arm.com, will@kernel.org, dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, x86@kernel.org, lorenzo.stoakes@oracle.com, zokeefe@google.com, rientjes@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20241204144918.b08dbdd99903d3e18a27eb44@linux-foundation.org> From: Qi Zheng In-Reply-To: <20241204144918.b08dbdd99903d3e18a27eb44@linux-foundation.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 00AA3140009 X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: s7cnxweiwwd1buuge1q59m5etg5yxbei X-HE-Tag: 1733370960-55642 X-HE-Meta: U2FsdGVkX18jEuVjA0FgAhy/vuOpQnTMdjfbeGZYNbfBwvpejD+aHIaUiHMy+G6pS12e0jTlMVsKb/i1R6zRpH0Is1y4vSG99bgUrus2Bu6D+sIxAqAfL0HO6UidtszgfwkgmChDwFJO3x6GVdqX9MbGejv9CU1ebD2/YpLAcRldbYdAbd/H4EtbgEuQcCjXADfeQMZggujG4J17iJNjJWpPOef03TUklMXYwDfebMT67rKOT3bojNxwOXqEaYGAItE+kr7pVN75f9QZQccH5WXHE3lkEv34hVWHyeUEWJiDl76BJg401HSTTdDAEt+qUVYDd4Wuy7vcSpK2axny8vgHgO0B3KDSKWzO+KXQJDoSR12NGNLExnQV+ATehe/c0JiIb8NBC1AHXwad+jFv9zsrP72r6TQbn1FAbUkISS9tcggLjLu1ewfBUlmCoGgXW45PDcQyQmI5palHXNzFcKSJCANxGROYOg2hO3h6zQW947FzY6ImiqU+7hTZoy3U4sJiLqP8nnNh5480VQGZhT3GbKrgSIIfDz42uFr0hetY1PuL9vUuyty0eA57XsubRL8bjpeShfA1W4QdVS82o7qAgpXMMnK05TOjfXXv7JIyg8RGeaeszFGeib9tYgeCAZ6LEbdmCBnONOSYtX3hChFkwVSfsaazxEg38C8D3zVcJ/ab881odBnwqi0473UXjS5HEXQp6Hmziwo4y58kVS1Sn19Uafu1I9/iFL/ajaxtlKtM6tEP/FgOwGuwQYjv+in4SZZmqNqf8BgjqTRlCJfvvFgnb5s8SK1Y86OuXJ01iVafPp0aApoMyXyWlokO7xLef+8Cp/oGCw3EXaawi2TIBxpPSoxhprqxD1QlNSG83NjNX57wfmMXGP2JxROT99CgcxOdKMh/XUo5q/4QN79q5mmb5dTxdhMewksHUbcF0jqA97clYDif0iOtPs6oa0ZkGtoNJOEeyZS6SwL hvvB5x+7 Dc9HybDmQt48jPl78X3INAuvIL7mysSNYdw4pjZJeuBS53fmWHcIltAxahLeBT2AuU/QMLgtPcz2KN7isLifh0ShD6IszhHUDo6yXXqbWopOFMHGMqRaumvz/2Cx+9/LWeATZFkSvkHGJB+va9+gSFgAn02RKevcFeK/+jgbh6gbPrKUFkuXb8Kpiad5KeF/EjIuezy9dCsaICrL9KL4tEd1rd7s3nOzy3tFHGzAi3lwbVRQ4aBqNLEbzdoUvr2Pwu9vHraGjka4AH5kCtDV/kqu1xKpTd2HnOYp+DgEHgtoAnOhYr3f/bQy+ImI4sDPK4tlCWtTwKZEV+3wREhRcpEmvWqRIHSe1vMgiIqhkTNAYkis= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/12/5 06:49, Andrew Morton wrote: > On Wed, 4 Dec 2024 19:09:40 +0800 Qi Zheng wrote: > >> >> ... >> >> Previously, we tried to use a completely asynchronous method to reclaim empty >> user PTE pages [1]. After discussing with David Hildenbrand, we decided to >> implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the >> first step. > > Please help us understand what the other steps are. Because we dont > want to commit to a particular partial implementation only to later > discover that completing that implementation causes us problems. Although it is the first step, it is relatively independent because it solve the problem (huge PTE memory usage) in the case of madvise(MADV_DONTNEED), while the other steps are to solve the problem in other cases. I can briefly describe all the plans in my mind here: First step ========== I plan to implement synchronous empty user PTE pages reclamation in madvise(MADV_DONTNEED) case for the following reasons: 1. It covers most of the known cases. (On ByteDance server, all the problems of huge PTE memory usage are in this case) 2. It helps verify the lock protection scheme and other infrastructure. This is what this patch is doing (only support x86). Once this is done, support for more architectures will be added. Second step =========== I plan to implement asynchronous reclamation for madvise(MADV_FREE) and other cases. The initial idea is to mark vma first, then add the corresponding mm to a global linked list, and then perform asynchronous scanning and reclamation in the memory reclamation process. Third step ========== Based on the above infrastructure, we may try to reclaim all full-zero PTE pages (all pte entries map zero page), which will be beneficial to the memory balloon case mentioned by David Hildenbrand. Another plan ============ Currently, page table modification are protected by page table locks (page_table_lock or split pmd/pte lock), but the life cycle of page table pages are protected by mmap_lock (and vma lock). For more details, please refer to the latest added Documentation/mm/process_addrs.rst file. Currently we try to free the PTE pages through RCU when CONFIG_PT_RECLAIM is turned on. In this case, we will no longer need to hold mmap_lock for the read/write op on the PTE pages. So maybe we can remove the page table from the protection of the mmap lock (which is too big), like this: 1. free all levels of page table pages by RCU, not just PTE pages, but also pmd, pud, etc. 2. similar to pte_offset_map/pte_unmap, add [pmd|pud]_offset_map/[pmd|pud]_unmap, and make them all contain rcu_read_lock/rcu_read_unlcok, and make them accept failure. In this way, we no longer need the mmap lock. For readers, such as page table wallers, we are already in the critical section of RCU. For writers, we only need to hold the page table lock. But there is a difficulty here, that is, the RCU critical section is not allowed to sleep, but it is possible to sleep in the callback function of .pmd_entry, such as mmu_notifier_invalidate_range_start(). Use SRCU instead? Or use RCU + refcount method? Not sure. But I think it's an interesting thing to try. Thanks!