From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6E522C3DA4A for ; Mon, 5 Aug 2024 13:14:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AE1D96B0098; Mon, 5 Aug 2024 09:14:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A6B096B009A; Mon, 5 Aug 2024 09:14:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8E6086B009D; Mon, 5 Aug 2024 09:14:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 6D7396B0098 for ; Mon, 5 Aug 2024 09:14:34 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id E390C121CD5 for ; Mon, 5 Aug 2024 13:14:33 +0000 (UTC) X-FDA: 82418236026.01.CCE7120 Received: from mail-pj1-f49.google.com (mail-pj1-f49.google.com [209.85.216.49]) by imf19.hostedemail.com (Postfix) with ESMTP id E0C021A0015 for ; Mon, 5 Aug 2024 13:14:30 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=OtJHGKcO; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf19.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.49 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722863622; a=rsa-sha256; cv=none; b=74NU1KBWB3y8crPHPRu8esEqAHPBG6m0sC9ko75IVA49yXhoMb5AyFqVUptXK5pFiyP3Hi 9WAolNsFS63EQL78m9GieShRDfqvWXrXfkwAYW6b27gMoRQoHABojEqqVjVJXtwZTE9012 hvEovjE9U4d7QuO6wRIH0lRvQCTcq1I= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=OtJHGKcO; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf19.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.49 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722863622; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=bBDKu/3d5EeJGhBTyFb0VU/Q7KKBy8S9Jlsd5i+OeeY=; b=zqECA6VBEXtR+sa6Oagr/gV4uEfSi+f832vTCYXwJYcNi0OF0R/n/xC84Hx56Ltn3vQq4j iLom0B/UzjzgvlYOsZIEIEbmAUaI6TGrswlylxooztlB65JFws/7WYSiyzQDyJ81QMm1Al dN1McgAihhX79X7FI/yxBbK73SKblP8= Received: by mail-pj1-f49.google.com with SMTP id 98e67ed59e1d1-2cb6b247db0so1782366a91.2 for ; Mon, 05 Aug 2024 06:14:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1722863669; x=1723468469; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=bBDKu/3d5EeJGhBTyFb0VU/Q7KKBy8S9Jlsd5i+OeeY=; b=OtJHGKcOlmortc1ntH2tKwnVgLvlT26O/+I1OVgKjhouuC45bCMyEufttxFI7663TD gzcVzjCss+89aWRCJara/LdvmxVF6pKJGUS9yEsESyZxV3PS63RfOmPU29bj53RygOXu ed9Yfj7ATRIRhRr4ST2A+6eZd9SwVWM+mwGA94qxmqw+uODDB+hqOumytxq/rWYbotnG OxSTQMx5upnmUizBMdZ3iIEdxxs79TLDccDWiY+C6A41IKbXH7BZ0Cj1fuRMRbVu9rFY voZrRhRflky0y9up4cCkZCLha1PAdMVvgzMVbCqmXDPD46LGBuSlXx39OQ5pFTxQJV3k thlw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722863669; x=1723468469; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=bBDKu/3d5EeJGhBTyFb0VU/Q7KKBy8S9Jlsd5i+OeeY=; b=q5e+wLMcD/l/j00e0DL3CL6VhkwSjH1KVRBQYcPQb/uyGMxp7O6FZzPcV21XYJd/yw UfW6sgtSbAUqwmqV+qJ9fZPlXoHyrm/uvtVD843XkYtsnl298e7fbRus1lBCFb5wg+wj o6R2PiKrswZbsBnBhVJv57dQDyLvbUNWt10qpO5deRnbEMtUpJrzELqjmxZd1GOSOcAZ 7/g5gNepyOXVeJ+nMcDufIutPuzGaGLV+lkg2sn1pDe0X6C4fe+ABLZI6XTiIZ2hUTp3 E5EZixPGqNxDcb/iaqdXS7zhul4creB+EITdjg5JCoie7Uj4P6V6jcDfCNMhFNv1CX0N S60A== X-Forwarded-Encrypted: i=1; AJvYcCXvx0/+Chwpm23dfqJFSoQY2e0qhbmhR8I+vDxHqsWYJURcYOphBoUKRsQYCOVKY8SH2usJ7nS7zw==@kvack.org X-Gm-Message-State: AOJu0YyX9pY16jEJSIz2yLYjQoGM0hLL29CGNn6NZq/5tHETxkqPeUxc kqDgqrOHMe1zgntl06+uFTtXwSphh2HhF4yWr3pd2FzKFc2vxWEOLlkJq9liFnI= X-Google-Smtp-Source: AGHT+IE6jROMIMs6StIxQgcHSl1DsQ6hKW5lBhCvbUY9/JbzNuIJLxoEeXj4DiX2VakT2Eh6xmDCuw== X-Received: by 2002:a17:90b:1bcd:b0:2cb:4382:6690 with SMTP id 98e67ed59e1d1-2cff95b63dbmr7919286a91.6.1722863669355; Mon, 05 Aug 2024 06:14:29 -0700 (PDT) Received: from [10.255.168.175] ([139.177.225.232]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2cffb091a99sm7122257a91.21.2024.08.05.06.14.23 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 05 Aug 2024 06:14:28 -0700 (PDT) Message-ID: <5505cdd3-b716-4ba5-98b4-9b2a4f06a432@bytedance.com> Date: Mon, 5 Aug 2024 21:14:21 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Content-Language: en-US To: the arch/x86 maintainers Cc: david@redhat.com, hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, vbabka@kernel.org, akpm@linux-foundation.org, zokeefe@google.com, rientjes@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: From: Qi Zheng In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: E0C021A0015 X-Stat-Signature: joej3b7kb4tcq6wam74gecqx5zugcfus X-Rspam-User: X-HE-Tag: 1722863670-128791 X-HE-Meta: U2FsdGVkX18SYeFmo6/zGRrOtBtfuIR0Peywszte9QDJY+bTgPs241ZVipO+ZDNIvzrtFjgo73+7PBijQ2MqtEWcD2l4E8JHGoQVn0IUuNWrOCU50uC51dgTLaFHB7sQ7rPvrZDqerJVMV+BUTTXSBdv12WAq27V+++eayprL1hbm7DIKCCc0bb7MkakUNGs4+VpdyrTk/ff4m7QErWUE68CNZUDAUAORdlUQ8VsZROppM63o2h0KT8kVfdYTK+nq+lRyaLRzoxWzR9kXlwUqvaWJeD8UhtM7/bTWCmLRUzUe/F2/KCJAb13d9BvrG0Lp4lcYnEYWqpMu4YMuXmdELRDiTfFdnCh4L5nDN0M+IFuSvtEQJQgvqQw5uNiIdJuWqZRfMi7V5fmammv6/7hmzMqqWmzttSnVx6BFmBmxh4c2KhZELnOSC6lLIyTwv3nKSZLyMsYIL+01mkA9pIHwF05gXpyN7YaYh4mVgCzS0XWj7jeM17URq+1oLn2WKyN8eh8VCmR2tUxfOypyNQjzcZg5mLL5OYiLhAPVuzt7n44SflWP/0vNDCcFvS5QyKUTpNQ4v3AhNJpIN3ofvFaWSKm0+x+6a+yM432CbFj9cNeKZixE4FPLm7y95IhMaPHlE8X1eh5RGsL1vTohSvoLRNQKxcoWvry0xVwPgVi+lPinetto+IgGoKlb/6Fw7GWXmmdfN5gxdM45lfj7dQb2qt71lrbYqeM/BcpOqewNDUzog5xgQhtUjt3V5nygsKgX7CpEFGih8DGNXPoKXwDcu3f8oMHXyVQarWh0ODPvdjlX4RquI2tG5d/bwKWnmX453fMJ2NiQKgsWns5gcd19LZ25pFqp8QLcQdvVAdto5i5GpeF0IBJ0lhXy3r+oA5gVFpkkxv1GUAaaKXh5RPHz8WWv95HhyVoIxcbNH8QpVoUnH9IWW2kJDNejv2k238NMYBeccyDcCaWEFahRQt ZlLY2Jy8 Ihhf9uVo7paJO5T3qeLPVfI98fUvZGIEjtJAzduLvPH77SffgwPwdoaPWIwR5JJnDcV3jZAKy/T5+sMEerkcPaS0ILI0ymERjPZiWvRjYx6YfQWCqT0aD3wl6Z5yd1cTEzgqBuJOsOKphk7aQvRmNwFOFlNe/nuVFSooRtGmhtap4EDkGMWgmnnl4jXqsJqLlZF0p+x+D9ji7svT8CmNqY5iFhcMetMyuIgMIDd7VT/6XmadwGGrHYREivIJyJVsohFgFwT5z/CZpiXMmC2h75tRMmxcjq+XFGWWpfMHYlvUCi8P3PLzXXfvhs05vf7QOtJdXJNlAKT4cD9sfDBHSlCwymUoGjElf3Io8hBIGMtiBbH1CUOkzG93Irafv3YK36WpcLw4f/5kXsy/tp5f42bM3si/1knHQVjL9mW/N9aDIVmWPKxjkmJgwhQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Add the x86 mailing list. On 2024/8/5 20:55, Qi Zheng wrote: > Changes in RFC v2: > - fix compilation errors in [RFC PATCH 5/7] and [RFC PATCH 7/7] reproted by > kernel test robot > - use pte_offset_map_nolock() + pmd_same() instead of check_pmd_still_valid() > in retract_page_tables() (in [RFC PATCH 4/7]) > - rebase onto the next-20240805 > > Hi all, > > Previously, we tried to use a completely asynchronous method to reclaim empty > user PTE pages [1]. After discussing with David Hildenbrand, we decided to > implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the > first step. > > So this series aims to synchronously scan and reclaim empty user PTE pages in > zap_page_range_single() (madvise(MADV_DONTNEED) etc will invoke this). In > zap_page_range_single(), mmu_gather is used to perform batch tlb flushing and > page freeing operations. Therefore, if we want to free the empty PTE page in > this path, the most natural way is to add it to mmu_gather as well. There are > two problems that need to be solved here: > > 1. Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free > page table pages by semi RCU: > > - batch table freeing: asynchronous free by RCU > - single table freeing: IPI + synchronous free > > But this is not enough to free the empty PTE page table pages in paths other > that munmap and exit_mmap path, because IPI cannot be synchronized with > rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table > also be freed by RCU like batch table freeing. > > 2. When we use mmu_gather to batch flush tlb and free PTE pages, the TLB is not > flushed before pmd lock is unlocked. This may result in the following two > situations: > > 1) Userland can trigger page fault and fill a huge page, which will cause > the existence of small size TLB and huge TLB for the same address. > > 2) Userland can also trigger page fault and fill a PTE page, which will > cause the existence of two small size TLBs, but the PTE page they map > are different. > > For case 1), according to Intel's TLB Application note (317080), some CPUs of > x86 do not allow it: > > ``` > If software modifies the paging structures so that the page size used for a > 4-KByte range of linear addresses changes, the TLBs may subsequently contain > both ordinary and large-page translations for the address range.12 A reference > to a linear address in the address range may use either translation. Which of > the two translations is used may vary from one execution to another and the > choice may be implementation-specific. > > Software wishing to prevent this uncertainty should not write to a paging- > structure entry in a way that would change, for any linear address, both the > page size and either the page frame or attributes. It can instead use the > following algorithm: first mark the relevant paging-structure entry (e.g., > PDE) not present; then invalidate any translations for the affected linear > addresses (see Section 5.2); and then modify the relevant paging-structure > entry to mark it present and establish translation(s) for the new page size. > ``` > > We can also learn more information from the comments above pmdp_invalidate() > in __split_huge_pmd_locked(). > > For case 2), we can see from the comments above ptep_clear_flush() in > wp_page_copy() that this situation is also not allowed. Even without > this patch series, madvise(MADV_DONTNEED) can also cause this situation: > > CPU 0 CPU 1 > > madvise (MADV_DONTNEED) > --> clear pte entry > pte_unmap_unlock > touch and tlb miss > --> set pte entry > mmu_gather flush tlb > > But strangely, I didn't see any relevant fix code, maybe I missed something, > or is this guaranteed by userland? > > Anyway, this series defines the following two functions to be implemented by > the architecture. If the architecture does not allow the above two situations, > then define these two functions to flush the tlb before set_pmd_at(). > > - arch_flush_tlb_before_set_huge_page > - arch_flush_tlb_before_set_pte_page > > As a first step, we supported this feature on x86_64 and selectd the newly > introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM. > > In order to reduce overhead, we only handle the cases with a high probability > of generating empty PTE pages, and other cases will be filtered out, such as: > > - hugetlb vma (unsuitable) > - userfaultfd_wp vma (may reinstall the pte entry) > - writable private file mapping case (COW-ed anon page is not zapped) > - etc > > For userfaultfd_wp and writable private file mapping cases (and MADV_FREE case, > of course), consider scanning and freeing empty PTE pages asynchronously in > the future. > > This series is based on next-20240805. > > Comments and suggestions are welcome! > > Thanks, > Qi > > [1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ > > Qi Zheng (7): > mm: pgtable: make pte_offset_map_nolock() return pmdval > mm: introduce CONFIG_PT_RECLAIM > mm: pass address information to pmd_install() > mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single() > x86: mm: free page table pages by RCU instead of semi RCU > x86: mm: define arch_flush_tlb_before_set_huge_page > x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 > > Documentation/mm/split_page_table_lock.rst | 3 +- > arch/arm/mm/fault-armv.c | 2 +- > arch/powerpc/mm/pgtable.c | 2 +- > arch/x86/Kconfig | 1 + > arch/x86/include/asm/pgtable.h | 6 + > arch/x86/include/asm/tlb.h | 19 +++ > arch/x86/kernel/paravirt.c | 7 ++ > arch/x86/mm/pgtable.c | 23 +++- > include/linux/hugetlb.h | 2 +- > include/linux/mm.h | 13 +- > include/linux/pgtable.h | 14 +++ > mm/Kconfig | 14 +++ > mm/Makefile | 1 + > mm/debug_vm_pgtable.c | 2 +- > mm/filemap.c | 4 +- > mm/gup.c | 2 +- > mm/huge_memory.c | 3 + > mm/internal.h | 17 ++- > mm/khugepaged.c | 32 +++-- > mm/memory.c | 21 ++-- > mm/migrate_device.c | 2 +- > mm/mmu_gather.c | 9 +- > mm/mprotect.c | 8 +- > mm/mremap.c | 4 +- > mm/page_vma_mapped.c | 2 +- > mm/pgtable-generic.c | 21 ++-- > mm/pt_reclaim.c | 131 +++++++++++++++++++++ > mm/userfaultfd.c | 10 +- > mm/vmscan.c | 2 +- > 29 files changed, 321 insertions(+), 56 deletions(-) > create mode 100644 mm/pt_reclaim.c >