From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1B778E7717F for ; Tue, 10 Dec 2024 08:57:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 920C76B0141; Tue, 10 Dec 2024 03:57:18 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8D0F66B0142; Tue, 10 Dec 2024 03:57:18 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7BF6D6B0143; Tue, 10 Dec 2024 03:57:18 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 5EA5E6B0141 for ; Tue, 10 Dec 2024 03:57:18 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id C725CC0460 for ; Tue, 10 Dec 2024 08:57:17 +0000 (UTC) X-FDA: 82878445062.28.5FA2525 Received: from mail-pf1-f178.google.com (mail-pf1-f178.google.com [209.85.210.178]) by imf09.hostedemail.com (Postfix) with ESMTP id C468C140017 for ; Tue, 10 Dec 2024 08:56:59 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=ET2pgTO3; spf=pass (imf09.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.178 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733821012; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SFnSp90isoQHZnudUudu/E8shqZxJ4KBowfLIz/rby4=; b=xcbhHR6IpF/FtwZ5cqCO3711DHGnfrThGvz34i6Odqh5gR5dtsMcnEFJ31RW2umhU3E/Ey 99n6cfzv5GkI/lkrWmPN6+3b6mopOPO2PVkSqx8Jwpgvex2IX79H+hZ2xZA+YD2EFzJfEa lPn8gQY0AwlW3ELsYgkkIXnrKvojxFw= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=ET2pgTO3; spf=pass (imf09.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.178 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733821012; a=rsa-sha256; cv=none; b=fHvfQ6wIqBm+uJVpHYwAXJC2d4j+ucQ6fTZfzv0fJMiYBQ70HNr60z8vpwKCRMHH6WIseh lVtTJ84GoSmiR1sPX5UUASSosdXgBanmD7aUWzTYPzw7XMl99dm6d0c7x6XHPbUCF8NUlz PdnuiG+LzrQeMXNqdn71wvIIo8v/5V0= Received: by mail-pf1-f178.google.com with SMTP id d2e1a72fcca58-728e3826211so181125b3a.0 for ; Tue, 10 Dec 2024 00:57:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1733821033; x=1734425833; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=SFnSp90isoQHZnudUudu/E8shqZxJ4KBowfLIz/rby4=; b=ET2pgTO3EW3sL4c/RWUeSR4qGSYwHQed9nU1v2LGuaAGAMqK0vcue0OS71TGglaHuT pZU1vyswiyfMvrmFT/xSXxtzMJz2nl4D0LWCZe3JchoxJHN42N0d9qyUA8F1AeT/xmEP H8PdbQCL+pOPH7duUc3OqFugTFt66l7MhrYcUe5TfooUZLb8WGc8r47aSgeN6A+tvHAe xTP2NTvY7JqOjEn9WFaUvTdRDaIPSCGnUcqcgxaGHkzmDOCAv/bzr96C7+j0LLaF8uo8 dUZZYKrA7Y7DY1EJFLXqKNtsvNGZ4lo5wYImqqIyvbP+879E9X+gdULCqEkmBlwwD6g5 O/GA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733821033; x=1734425833; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=SFnSp90isoQHZnudUudu/E8shqZxJ4KBowfLIz/rby4=; b=fRuP873Rz7ZBBZmABL0JnlT0QS5n3fnaZDO05v2x73crKrKJZSrMSEoeGNU0fTPr20 TpMcJTx2YRD1CCdv1boC82aUUGzA79iAqDT2oRK4REHBKpq3JadQF6gPeepIPPTjeMxJ d2EpBe95OX9yDBoNyonBTdkhTZ8L5Bu/qV/VoFVKV9RFX8p+E0XhtbP08de3s0kpe+96 1H5ozcN8XoOBQ+RRyMVqpwQtXnS4CH1rU6ep0jaT0usnfxQy1WUMLiei6neGhBtt3GF0 A8WiyQeDOTffOTFZ3G4vv7sOXn3Q6hCxTU0AXVJlvbINs5e8LkY3MXiZrqGdt/x6d6ea 0kMg== X-Forwarded-Encrypted: i=1; AJvYcCWPwA4AL3rnZDOxflBugUTo48bPmKa4eqD2CFSN5URBpW17ETduQJzVwM0feGk5dKCFygNsXgBQyQ==@kvack.org X-Gm-Message-State: AOJu0YzG7817fm+aeL849LNcyqJACZcWJAbXsCaYITKyXw6AkiStLWzD FIwi9Mx51roinfynZRq2NmDIy0cXbVw1NPZaJxbkz9gqjD5YQFocAO+c3rK8O70= X-Gm-Gg: ASbGncu5qYYu3aVSsC7caGXH8AHBnHroEelqlDyCOUMwQzxLaIDSq6ku+IEo91t0QfS cEuOa6UYZdhh9hBRQhMsNr1EC3xlveAieawB0aiVIBtg4HUhhUvLjHxbuOjTQB7V1E0Qhb/Lv91 nNtat83VkeOo29WfkwTTZ5OjtvuVYUn2l2mTpD3U1dhMhPQTDwke9NDTRgYoqxwIMwG8jf4rGSQ B6D/GpQ1RdoIViYszQkfhqOlp3dE/y3TF2Vna3Whw24g2BHhQ3b8SGTw100c4KKt3RUpAZmcuxF uWk= X-Google-Smtp-Source: AGHT+IEhwn7lM7hxG2M1TcDab9WPLMFqlkMGDnOLCQZBUX3EAIfYlifkwgbzBQClAPED9zWgSj7OsA== X-Received: by 2002:a05:6a00:174b:b0:727:3c8f:3707 with SMTP id d2e1a72fcca58-7273c8f3efemr6672464b3a.23.1733821033070; Tue, 10 Dec 2024 00:57:13 -0800 (PST) Received: from [10.84.148.23] ([203.208.167.150]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-725df1e6776sm4547382b3a.170.2024.12.10.00.57.06 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 10 Dec 2024 00:57:12 -0800 (PST) Message-ID: <53fb3b26-4a28-48a2-8403-a9b8d2fe6c24@bytedance.com> Date: Tue, 10 Dec 2024 16:57:04 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Content-Language: en-US To: akpm@linux-foundation.org Cc: david@redhat.com, jannh@google.com, hughd@google.com, willy@infradead.org, muchun.song@linux.dev, vbabka@kernel.org, peterx@redhat.com, mgorman@suse.de, catalin.marinas@arm.com, will@kernel.org, dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, x86@kernel.org, lorenzo.stoakes@oracle.com, zokeefe@google.com, rientjes@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: From: Qi Zheng In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: C468C140017 X-Rspam-User: X-Stat-Signature: 57oj8wk6fr564mheywe8n16uyrcoizch X-HE-Tag: 1733821019-677497 X-HE-Meta: U2FsdGVkX1+9V3ciBbaEQvFg6eHeegMk3qE6pXvlL4Nvmnr10Vp994cKe7CZ8U9qcMLfyG1LYMzYVBz+uImdw3LZ1IfaP3+HmV/fzbjUbbNxEI00uWAQpe7GqxYoRqFWuBblTxLDNnPhbWktqvdWNzkKAJoEN2FcR9A7vPenQhTRweEZ5BlFOobHabBO507Rc5RQ7aGQz2AnAE403T4qsu5J0q3W2CxkY5ClrrLOrdl0m7SQxCnNHtMsZNYbV+ezIWsmS1mrdRHv7o854oNo89TO+mVJhgzsWGEUuQZlC7ty5KJtAMlJIiW43l0Ka1yZ5wgSj6h65pGohfgWYbS7Wj5+m1DTnL59YB9bymQ76CyCT0LH360HhRqD0iDCY1rEBPW+8udQdFsFkkwrON1cp1knqAkGaKnYS2YXLTNDIvfBXYziIrptvr/rgQMfkAu0tNqyd7Va4SEJG9j8Vqx968IN6+/lcRRBVIjcD+DSEwWgyQnCLSpquJmbCTtVR7U5q6PyjKeC62E2IaSsZgGsFZXu1j2siG0THQiR7BnOhqRMYTFWx5ahMH40Anpf4rqwml47fRecHBll4rhDncIZczXIN9IfKympIVqbKrQmh2sjX457OU7upRyBGq0CzlysmC7gwlOmXtdFLA3SiHQYnwyA2wCEWBwF//sN45w39TfdDurG6swH0uXnj1rgHsH7F3/xjISFNsgfTaaqvcOzc0PAFkeAdoLoOI55IQT35uzx7nGyIo0EYhs6Xf6N3dko0sl1o4X7x4L4aUZ7BWIAsK8xT/39FwduP7KV1j0wvNzTLDftFcqNRq025mKm8bFOg6O1YUGAVFjFN/WTJoSK3iWKElHXDbiCkXIRiomfa/X4ln6NidtmK1SkcdRky7B4Z8NwnBp9g0jKekMH9kHFRuI+WobIIR8nNvQAlikmUGHmnr+pbc8zYHPCMgCAx+w2paAH5gZfMA4DSLAx3OY FYIWnN6E SlDSIjyi3/XLYXuN8Cll4RO1eJ72CNErSVf6ssN83PxaXJZIutZZua00LXTEoOV6bvbKxVC1HuSI1zzt4QReYyUTHO2tSRH7gT6Kq7Hi/uXGbP/Q8wrQCGJsl3AuKabc4fZu9oMb3uHxz5hNK2X9Ld5sQZfqEoC2QRJDL7170o/5zNWCl/cwHpOkmWJgMPG969waXPrn/5gJr+y4W8d68nh2azLb1Wk/zrd9hXyLHEjeyazsrV17B838FQyNOCoS89aeobGXPk7qNoSQeRDaYHQ5ekGozrZZZpcQmZb9OBohL9+e+YFHqsZJ791YUk0tLuYab8uyHpGOnYd7dohVYvJjf9SYo9MUZDqe355+dPx3bH1wSHgsq3nSlcXeTMFJ+5vv1 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000005, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Andrew, I have sent patch[1][2][3] to fix recently reported issues: [1]. https://lore.kernel.org/lkml/20241210084156.89877-1-zhengqi.arch@bytedance.com/ (Fix warning, need to be folded into [PATCH v4 02/11]) [2]. https://lore.kernel.org/lkml/20241206112348.51570-1-zhengqi.arch@bytedance.com/ (Fix uninitialized symbol, need to be folded into [PATCH v4 09/11]) [3]. https://lore.kernel.org/lkml/20241210084431.91414-1-zhengqi.arch@bytedance.com/ (fix UAF, need to be placed before [PATCH v4 11/11]) If you need me to re-post a complete v5, please let me know. Thanks, Qi On 2024/12/4 19:09, Qi Zheng wrote: > Changes in v4: > - update the process_addrs.rst in [PATCH v4 01/11] > (suggested by Lorenzo Stoakes) > - fix [PATCH v3 4/9] and move it after [PATCH v3 5/9] > (pointed by David Hildenbrand) > - change to use any_skipped instead of rechecking pte_none() to detect empty > user PTE pages (suggested by David Hildenbrand) > - rebase onto the next-20241203 > > Changes in v3: > - recheck pmd state instead of pmd_same() in retract_page_tables() > (suggested by Jann Horn) > - recheck dst_pmd entry in move_pages_pte() (pointed by Jann Horn) > - introduce new skip_none_ptes() (suggested by David Hildenbrand) > - minor changes in [PATCH v2 5/7] > - remove tlb_remove_table_sync_one() if CONFIG_PT_RECLAIM is enabled. > - use put_page() instead of free_page_and_swap_cache() in > __tlb_remove_table_one_rcu() (pointed by Jann Horn) > - collect the Reviewed-bys and Acked-bys > - rebase onto the next-20241112 > > Changes in v2: > - fix [PATCH v1 1/7] (Jann Horn) > - reset force_flush and force_break to false in [PATCH v1 2/7] (Jann Horn) > - introduce zap_nonpresent_ptes() and do_zap_pte_range() > - check pte_none() instead of can_reclaim_pt after the processing of PTEs > (remove [PATCH v1 3/7] and [PATCH v1 4/7]) > - reorder patches > - rebase onto the next-20241031 > > Changes in v1: > - replace [RFC PATCH 1/7] with a separate serise (already merge into mm-unstable): > https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/ > (suggested by David Hildenbrand) > - squash [RFC PATCH 2/7] into [RFC PATCH 4/7] > (suggested by David Hildenbrand) > - change to scan and reclaim empty user PTE pages in zap_pte_range() > (suggested by David Hildenbrand) > - sent a separate RFC patch to track the tlb flushing issue, and remove > that part form this series ([RFC PATCH 3/7] and [RFC PATCH 6/7]). > link: https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/ > - add [PATCH v1 1/7] into this series > - drop RFC tag > - rebase onto the next-20241011 > > Changes in RFC v2: > - fix compilation errors in [RFC PATCH 5/7] and [RFC PATCH 7/7] reproted by > kernel test robot > - use pte_offset_map_nolock() + pmd_same() instead of check_pmd_still_valid() > in retract_page_tables() (in [RFC PATCH 4/7]) > - rebase onto the next-20240805 > > Hi all, > > Previously, we tried to use a completely asynchronous method to reclaim empty > user PTE pages [1]. After discussing with David Hildenbrand, we decided to > implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the > first step. > > So this series aims to synchronously free the empty PTE pages in > madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in > zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than > madvise(MADV_DONTNEED). > > In zap_pte_range(), mmu_gather is used to perform batch tlb flushing and page > freeing operations. Therefore, if we want to free the empty PTE page in this > path, the most natural way is to add it to mmu_gather as well. Now, if > CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free page table > pages by semi RCU: > > - batch table freeing: asynchronous free by RCU > - single table freeing: IPI + synchronous free > > But this is not enough to free the empty PTE page table pages in paths other > that munmap and exit_mmap path, because IPI cannot be synchronized with > rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also > be freed by RCU like batch table freeing. > > As a first step, we supported this feature on x86_64 and selectd the newly > introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM. > > For other cases such as madvise(MADV_FREE), consider scanning and freeing empty > PTE pages asynchronously in the future. > > This series is based on next-20241112 (which contains the series [2]). > > Note: issues related to TLB flushing are not new to this series and are tracked > in the separate RFC patch [3]. And more context please refer to this > thread [4]. > > Comments and suggestions are welcome! > > Thanks, > Qi > > [1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ > [2]. https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/ > [3]. https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/ > [4]. https://lore.kernel.org/lkml/6f38cb19-9847-4f70-bbe7-06881bb016be@bytedance.com/ > > Qi Zheng (11): > mm: khugepaged: recheck pmd state in retract_page_tables() > mm: userfaultfd: recheck dst_pmd entry in move_pages_pte() > mm: introduce zap_nonpresent_ptes() > mm: introduce do_zap_pte_range() > mm: skip over all consecutive none ptes in do_zap_pte_range() > mm: zap_install_uffd_wp_if_needed: return whether uffd-wp pte has been > re-installed > mm: do_zap_pte_range: return any_skipped information to the caller > mm: make zap_pte_range() handle full within-PMD range > mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) > x86: mm: free page table pages by RCU instead of semi RCU > x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 > > Documentation/mm/process_addrs.rst | 4 + > arch/x86/Kconfig | 1 + > arch/x86/include/asm/tlb.h | 20 +++ > arch/x86/kernel/paravirt.c | 7 + > arch/x86/mm/pgtable.c | 10 +- > include/linux/mm.h | 1 + > include/linux/mm_inline.h | 11 +- > include/linux/mm_types.h | 4 +- > mm/Kconfig | 15 ++ > mm/Makefile | 1 + > mm/internal.h | 19 +++ > mm/khugepaged.c | 45 +++-- > mm/madvise.c | 7 +- > mm/memory.c | 253 ++++++++++++++++++----------- > mm/mmu_gather.c | 9 +- > mm/pt_reclaim.c | 71 ++++++++ > mm/userfaultfd.c | 51 ++++-- > 17 files changed, 397 insertions(+), 132 deletions(-) > create mode 100644 mm/pt_reclaim.c >