From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8BCE0C3DA4A for ; Mon, 29 Jul 2024 06:47:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 03AFA6B009E; Mon, 29 Jul 2024 02:47:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F2CE06B00A0; Mon, 29 Jul 2024 02:47:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DF3CB6B00A2; Mon, 29 Jul 2024 02:47:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id C1B4C6B009E for ; Mon, 29 Jul 2024 02:47:10 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 5B5F71A015E for ; Mon, 29 Jul 2024 06:47:10 +0000 (UTC) X-FDA: 82391858220.09.59BE93A Received: from mail-oi1-f182.google.com (mail-oi1-f182.google.com [209.85.167.182]) by imf19.hostedemail.com (Postfix) with ESMTP id 2E6741A0015 for ; Mon, 29 Jul 2024 06:47:07 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=VF4sx6xR; spf=none (imf19.hostedemail.com: domain of zhengqi.arch@bytedance.com has no SPF policy when checking 209.85.167.182) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722235575; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3TSuWFPLqLXatPXQN0V7tpJFNJ3XCu8cL+90r5ZF1Xw=; b=AoPm0MfSIApFTQsKPgDA+9aegekEJHJFR2aZHsJ9H1tzgSCUWBMwcJgNbaV6rcPJFtNPor 318OznMFk3jYWofD2MDK5WtwOhAhcnlb6n2pccn2zVhy7CV3Svi/rMqHpXzlns80wuPhRS 4go1v+NA5k/7NGBVFHWa631TMHAatl0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722235575; a=rsa-sha256; cv=none; b=6T2xbBr3Sk1Xw6SieFbuGuBwpEp8TwimuT0ogmEhz98DYQBWBsNKpPm+AGQCfAAiZnXEgm jU7k5WBxnb5etdmLDme9wMgoDNEgCN1BKCIslcYgqZ+ym1M9WtvQaayl99gYSo+dimu4rM HkFntwrUFYba6v7RdC74BpOjNLtvGbM= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=VF4sx6xR; spf=none (imf19.hostedemail.com: domain of zhengqi.arch@bytedance.com has no SPF policy when checking 209.85.167.182) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com Received: by mail-oi1-f182.google.com with SMTP id 5614622812f47-3db3763e924so44009b6e.2 for ; Sun, 28 Jul 2024 23:47:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1722235627; x=1722840427; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=3TSuWFPLqLXatPXQN0V7tpJFNJ3XCu8cL+90r5ZF1Xw=; b=VF4sx6xRepRtMD8edpNrJxBn6is05VADdoWV/56rYas2NA0e0pbmhEM93rjutYEJ85 ACQWyg6CJRhNZ8PEyvocqZL0gIPfRjiNihuvezzA3yJaKKXdRicTY8AFPUNoMMF5k90W l6WoEDS7vjvLjMS8wpuzNO0Rh+cioND83GEI+TybM2skFi5SX7LMtn7LssC4zrHlG1aY 9WuiJIRWi1QrXY4BECHRfwl4CCM/0i2Owsei5bLq26fswJWHls1turfMTCtgVUGE0T/e +7xpJwXFvfM8ygoeYGuOARjMlg/ep+j6LbU94jPOSopR7S/70FmzE9DPpuiZTKUrFVZ8 YWuw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722235627; x=1722840427; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=3TSuWFPLqLXatPXQN0V7tpJFNJ3XCu8cL+90r5ZF1Xw=; b=NbK7oz4IxbjJy1sr/R0ezyNNbfH/ArEyHo9X00JhFKShVKKhAwqiM3tRjNp5VJu0k6 E1PyPgtwTITLpoTxz20Ouy1Yn0ljfwJH/78e3FX8rAgF2Yr6Zx1qGGvnLxVSfq71FpdW 0wn9APRTZJ5v4wesK3NjU0iM+Fai1LcLA4Z129c3T2n4RTQ1RiKXqQb7JzS2DFKR0JkP hFX9PH6t1O4UqcHIV1Pl6C452Uzyyx4P7faxkA2F61YPJfMUln/R26n+WG/AJvcoF53o j2TOJe3tm93cTcqa3inzm0DxYyrmCpXQEvFdIQF8ZfRs5dp3YvTF8YytaByHGotrGWJG 7A1Q== X-Forwarded-Encrypted: i=1; AJvYcCVzm0AdtSPdkCXrB4Zo7EwHgyIYfSqFsmSFCkIHWVdpyTlMo/o/lUSB72QJM9nIL7eW2SBinglqQA==@kvack.org X-Gm-Message-State: AOJu0YyTcllcaKifenTbRbNBKnyFsksCBNvr64Hf9CnNY8thZNOvpdyw DGErRlinxbiAFRNEeY+J6aaAWQ9bZ5/A2wXhu+421ghJnzoPsqbTNXWnpmN01MI= X-Google-Smtp-Source: AGHT+IF/tqrHcnt39eBxI5LZVdaWObkXyetyF18xAok06itu3FbLaPNsW77SrSeITKx+k1Yp3Pawpw== X-Received: by 2002:a05:6808:1929:b0:3db:2eb9:addd with SMTP id 5614622812f47-3db2eb9b431mr2972640b6e.4.1722235626925; Sun, 28 Jul 2024 23:47:06 -0700 (PDT) Received: from [10.255.168.175] ([139.177.225.232]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-7a9f8849608sm5553445a12.48.2024.07.28.23.47.03 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 28 Jul 2024 23:47:06 -0700 (PDT) Message-ID: <288c35d9-47d5-405d-b921-4f4c59eb3920@bytedance.com> Date: Mon, 29 Jul 2024 14:46:59 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 0/7] synchronously scan and reclaim empty user PTE pages Content-Language: en-US To: "Vlastimil Babka (SUSE)" Cc: david@redhat.com, hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: From: Qi Zheng In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: db4wt9uy6xhhw8ggt48b5sxwpjh3yy7o X-Rspamd-Queue-Id: 2E6741A0015 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1722235627-4664 X-HE-Meta: U2FsdGVkX19Dx3FwgWiKXjKY9GDki+QnG/D6O/4NoGpRydF5MBXei3SV8QKB05fabY+QZIPkWfhh5YvyaPa9PE9EpRg/hkqcKQLPVLbxiQH0jdJa+Wo9w2uMVXdUv8Jp+9khf9bWC2Ig4kJQrIlNBtPyLnAE0rjb7XBhNGvyyw87QyZysmKNEQEdTYC4z603+RrHG9bF/hdvje10i4+2M6qO1oBNcNgQpnhIf9PT9UemglUCSh8SagIyVVZ6BdBzpcdZtwTRAZ9w+W4M7xlhgoatbO91S8AvD0Q2XfzL9IG+HIsS+xEES7tl53eG0lkRhqfvgR7Nw0XhQk0DI2vkd716DIbSikFj5Hocgzn/4ZdgRvJmx0LCwNRA1+bUVT2NIGL6U+LKeaPDCQIcisb8dBe9/AqT+Bi1Jbdb4393TBoGX8XY/ToyfHLD7W6+hkDM8H1sTMckDoVgvWWBz6xSeFqbFT6EhRhw2yIG3DP6U/v/2ESOBenqdbaDr2VUpXFEANpF8oPE1hxzB0Ju07M8ckg7+maDDNldqJ8H0JUTEEU72ahl33HZ3fdfMiOxMdHrE8PUb5TEVLPQdmAmMwoGaKLJtaqltdgNLdQnYX5j3YEBnKtMIKEfEmDmFoKWHkUBEa6LfVNOJjjK166IqJ55j9qYOls8PXbvFeIBX/aXtdG5tOYGGQqXOOFogaKweYf4N3/ZtcV89TXYvHPQR4OirjInFhIxGhBLgoU6QOmCTL54bPYcTJV/RAOkk4F668c2ansMmI17Y8P296dH+KPw2mlhMBjLKj4nQaPt470Az1lsljUDpsI4gSr2DkPn2mPYUNpbbxNQPLdyWyy8128OlxwRovzyZ5GdQ0jR5AzquyMrlO+gctvkuYzGeTep6Pw/bIrY+a93OjFpimXcKXByGh1bJvQCDLlfJBT6Rckc0bZ6klJvyublmtHWk/vPw+4Xd6FhDT97U4hTX+s4TPs mrPXXjWy XvjOML4oo3ux47OFNfv0g0J8q6S7nXEf4GBCfds/Q3nGlwvHlHt4+mq/jQ5PxYUseobtQ8GUuchOi7S8Fx4L43uMmNcoHTaGMK2Ner13fGzQ/deFX4A/VDlsGjICQaYzxI0V7f3oTXhJGeBGvXrKT+CbROpyz54c9EimnzjKCOKmfy9yr2CmXkUQUvfkbzTf5cKDXgWJYS1YLEy7ayT9qsdE0QvX6IfkfNAv+hRsV8hCjSNCEp2qzwupYdQpRXmylKaOdBrBiSLK7w5yeiho2755J8YuZ1mJGpwI2B80Ne3QnFnoQAbk8rI6FjzZVNWzIXMP3tiYIJMRdXohCLLE32FWaSQQyoeCbEDm/iF5/ab0nliMgm0q4SkQyCj3nUSakXYIU/un2WJ9aWkWDdrq4kAXHLGLgviilfUQr X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Vlastimil, On 2024/7/26 17:07, Vlastimil Babka (SUSE) wrote: > On 7/1/24 10:46 AM, Qi Zheng wrote: >> Hi all, >> >> Previously, we tried to use a completely asynchronous method to reclaim empty >> user PTE pages [1]. After discussing with David Hildenbrand, we decided to >> implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the >> first step. >> >> So this series aims to synchronously scan and reclaim empty user PTE pages in >> zap_page_range_single() (madvise(MADV_DONTNEED) etc will invoke this). In >> zap_page_range_single(), mmu_gather is used to perform batch tlb flushing and >> page freeing operations. Therefore, if we want to free the empty PTE page in >> this path, the most natural way is to add it to mmu_gather as well. There are >> two problems that need to be solved here: > > Hello, Thank you for your attention to this patch series! > > I would be curious to know to what extent you are planning to pursue this > area, whether it's reclaim of empty page tables synchronously for now and > maybe again asychronously later, or you also plan on exploring reclaim of > non-empty page tables. As I discussed with David Hildenbrand, I am currently planning to implement synchronous empty user page table reclamation first for the following reasons: 1. It covers most of the known cases. 2. As a first step, it helps verify the lock protection scheme, tlb flushing, and other infrastructure. Later, I plan to implement asynchronous reclamation for MADV_FREE and other situations. The initial idea is to mark vma first, then add the corresponding mm to a global linked list, and then perform asynchronous scanning and reclamation in the memory reclamation process. For exploring reclaim of non-empty page tables, no plan yet. But I have another plan, which is to remove the page table from the protection of the mmap lock: 1. free all levels of page table pages by RCU, not just PTE pages, but also pmd, pud, etc. 2. similar to pte_offset_map/pte_unmap, add [pmd|pud]_offset_map/[pmd|pud]_unmap, and make them all contain rcu_read_lock/rcu_read_unlcok, and make them accept failure. In this way, we no longer need the mmap lcok. For readers, such as page table wallers, we are already in the critical section of RCU. For writers, we only need to hold the page table lock. But there is a difficulty here, that is, the RCU critical section is not allowed to sleep, but it is possible to sleep in the callback function of .pmd_entry, such as mmu_notifier_invalidate_range_start(). Use SRCU instead? Not sure. > > The reason is I have a master student interested in this topic, so it would > be good to know for the planning. This is great, comments and suggestions are welcome! Thanks, Qi > > Thanks a lot, > Vlastimil > >> 1. Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free >> page table pages by semi RCU: >> >> - batch table freeing: asynchronous free by RCU >> - single table freeing: IPI + synchronous free >> >> But this is not enough to free the empty PTE page table pages in paths other >> that munmap and exit_mmap path, because IPI cannot be synchronized with >> rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table >> also be freed by RCU like batch table freeing. >> >> 2. When we use mmu_gather to batch flush tlb and free PTE pages, the TLB is not >> flushed before pmd lock is unlocked. This may result in the following two >> situations: >> >> 1) Userland can trigger page fault and fill a huge page, which will cause >> the existence of small size TLB and huge TLB for the same address. >> >> 2) Userland can also trigger page fault and fill a PTE page, which will >> cause the existence of two small size TLBs, but the PTE page they map >> are different. >> >> For case 1), according to Intel's TLB Application note (317080), some CPUs of >> x86 do not allow it: >> >> ``` >> If software modifies the paging structures so that the page size used for a >> 4-KByte range of linear addresses changes, the TLBs may subsequently contain >> both ordinary and large-page translations for the address range.12 A reference >> to a linear address in the address range may use either translation. Which of >> the two translations is used may vary from one execution to another and the >> choice may be implementation-specific. >> >> Software wishing to prevent this uncertainty should not write to a paging- >> structure entry in a way that would change, for any linear address, both the >> page size and either the page frame or attributes. It can instead use the >> following algorithm: first mark the relevant paging-structure entry (e.g., >> PDE) not present; then invalidate any translations for the affected linear >> addresses (see Section 5.2); and then modify the relevant paging-structure >> entry to mark it present and establish translation(s) for the new page size. >> ``` >> >> We can also learn more information from the comments above pmdp_invalidate() >> in __split_huge_pmd_locked(). >> >> For case 2), we can see from the comments above ptep_clear_flush() in >> wp_page_copy() that this situation is also not allowed. Even without >> this patch series, madvise(MADV_DONTNEED) can also cause this situation: >> >> CPU 0 CPU 1 >> >> madvise (MADV_DONTNEED) >> --> clear pte entry >> pte_unmap_unlock >> touch and tlb miss >> --> set pte entry >> mmu_gather flush tlb >> >> But strangely, I didn't see any relevant fix code, maybe I missed something, >> or is this guaranteed by userland? >> >> Anyway, this series defines the following two functions to be implemented by >> the architecture. If the architecture does not allow the above two situations, >> then define these two functions to flush the tlb before set_pmd_at(). >> >> - arch_flush_tlb_before_set_huge_page >> - arch_flush_tlb_before_set_pte_page >> >> As a first step, we supported this feature on x86_64 and selectd the newly >> introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM. >> >> In order to reduce overhead, we only handle the cases with a high probability >> of generating empty PTE pages, and other cases will be filtered out, such as: >> >> - hugetlb vma (unsuitable) >> - userfaultfd_wp vma (may reinstall the pte entry) >> - writable private file mapping case (COW-ed anon page is not zapped) >> - etc >> >> For userfaultfd_wp and writable private file mapping cases (and MADV_FREE case, >> of course), consider scanning and freeing empty PTE pages asynchronously in >> the future. >> >> This series is based on next-20240627. >> >> Comments and suggestions are welcome! >> >> Thanks, >> Qi >> >> [1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ >> >> Qi Zheng (7): >> mm: pgtable: make pte_offset_map_nolock() return pmdval >> mm: introduce CONFIG_PT_RECLAIM >> mm: pass address information to pmd_install() >> mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single() >> x86: mm: free page table pages by RCU instead of semi RCU >> x86: mm: define arch_flush_tlb_before_set_huge_page >> x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 >> >> Documentation/mm/split_page_table_lock.rst | 3 +- >> arch/arm/mm/fault-armv.c | 2 +- >> arch/powerpc/mm/pgtable.c | 2 +- >> arch/x86/Kconfig | 1 + >> arch/x86/include/asm/pgtable.h | 6 + >> arch/x86/include/asm/tlb.h | 23 ++++ >> arch/x86/kernel/paravirt.c | 7 ++ >> arch/x86/mm/pgtable.c | 15 ++- >> include/linux/hugetlb.h | 2 +- >> include/linux/mm.h | 13 +- >> include/linux/pgtable.h | 14 +++ >> mm/Kconfig | 14 +++ >> mm/Makefile | 1 + >> mm/debug_vm_pgtable.c | 2 +- >> mm/filemap.c | 4 +- >> mm/gup.c | 2 +- >> mm/huge_memory.c | 3 + >> mm/internal.h | 17 ++- >> mm/khugepaged.c | 24 +++- >> mm/memory.c | 21 ++-- >> mm/migrate_device.c | 2 +- >> mm/mmu_gather.c | 2 +- >> mm/mprotect.c | 8 +- >> mm/mremap.c | 4 +- >> mm/page_vma_mapped.c | 2 +- >> mm/pgtable-generic.c | 21 ++-- >> mm/pt_reclaim.c | 131 +++++++++++++++++++++ >> mm/userfaultfd.c | 10 +- >> mm/vmscan.c | 2 +- >> 29 files changed, 307 insertions(+), 51 deletions(-) >> create mode 100644 mm/pt_reclaim.c >> >