From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 674A6C30653 for ; Thu, 4 Jul 2024 07:16:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E7B3B6B00B7; Thu, 4 Jul 2024 03:16:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E2BA36B00B9; Thu, 4 Jul 2024 03:16:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CA4706B00BA; Thu, 4 Jul 2024 03:16:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id ACF876B00B7 for ; Thu, 4 Jul 2024 03:16:40 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 6B25F1C185E for ; Thu, 4 Jul 2024 07:16:40 +0000 (UTC) X-FDA: 82301212560.03.E3BF67B Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) by imf04.hostedemail.com (Postfix) with ESMTP id C8E174000F for ; Thu, 4 Jul 2024 07:16:37 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=j7L618oY; spf=pass (imf04.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.173 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720077374; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TvyZAWnNNxx/fwTfnUe4tHKME28aoZVHnncLaVmTBU4=; b=uXDY17dT8krfa2hOZuxsNvf+m6SGzQGEs/RvpBVHYQaZNZB4PEKy++hsJnuD6qga4QthOI 03G5CMXitF1KFnGEPHmKeykgSI3pjZB8IXpRwUrNmPFVe8sNEcyLkk3N5Tsi/oOzgrybDY 1nHinPOGo+m21ZZIAk6L9gqUH7cetEc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720077374; a=rsa-sha256; cv=none; b=qpbOQzp4W3o2BPS6f77DRdtd9NVkdTtoNxAim31+XJpwvyw3k4F7yELETOl5G/SXnNN2h9 FMmoan2Ugc+YXZn5o0vcYycbsorEbV3Rbb2yXG7smN/2m7C0CH4Wd+NsgG2DBAWhCz1kjW PpbRlpZm3ehaj5PbSHJ2N/OFNP5mFfs= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=j7L618oY; spf=pass (imf04.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.173 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com Received: by mail-pl1-f173.google.com with SMTP id d9443c01a7336-1f9f6f41898so68055ad.1 for ; Thu, 04 Jul 2024 00:16:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1720077396; x=1720682196; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=TvyZAWnNNxx/fwTfnUe4tHKME28aoZVHnncLaVmTBU4=; b=j7L618oYi9HnS6wpGmcyg2sNEih1StX3wRDMuw7HPw14uS1/bMZbIg1oOMQ+oypwCI GdiJXIuVdf/JmuOq64JRV6w409RT3IxsmJ1QkVaHrrC8bLzBxNFiRha9dotYbYtsi2tN zp0aB/qiIEzm5cyKERvOAMsXlY06BpkQ+fDPBqLyBP6h1+SekDMb/bTDcizrh4lN2B9N PPfot5sasidtgF31oECZbiMzZFHBPPZfynqRI+Dq5hJBp6DeZ1LkKefT9IKg5nsdiZEZ H/EPGysesy6/V9Zo9Pr34u9EVMWcqlPfoY+LI0Tx34m/uUjVdphmtQjpf+rLGyzO78XP 5a3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720077396; x=1720682196; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=TvyZAWnNNxx/fwTfnUe4tHKME28aoZVHnncLaVmTBU4=; b=K73vpFvnxdwmyRPXSjemMla9hjUcEviKU7y2ez3VRLHxPe/88fl/75VD2sSLXz/kMS XTydVwsWNFMgxJ1MbW3P5HFb1E8fpHxNeW2SwlMgqxhePrOj1Q9D7hLKWau93AOQRNdV hG7R0NfJzQGdLjlCD5AEFkb+L/HfLRcfn2puyt1U4mV+HNY6wdCeM55AOE7mwGyHL7cs Doc9/JP3+hhsFK6Bl1+MIJb/ZKZDc5olYFGWUFehF1mFQuU9J+cfBq6F4XuDY2zgKXwh DRRAmtp2V5P+pHleX6U2GoC0LqI3/Gwm4E2kLJUH0QuugTwDOEmSjLNO+BKq0MqHVs2M /YEg== X-Forwarded-Encrypted: i=1; AJvYcCVtObQLExp1gGwdI/2o159sieGnexlEqPACFbGeZphiTm9cop9UCdB3MlbJC0sYAwtXeicR/09v76cKR1vXr7LOMdw= X-Gm-Message-State: AOJu0YwE1//AJUnxefk4GEDrIV3gmK0w9JkPgqyXcqOiU7FwrkzdmG0u BQ8nq6Kmz04YmNx6NPPXRWkLoys77nC0UFMqXsqC9JHoTsOG6XFohZXHLs41Nmo= X-Google-Smtp-Source: AGHT+IHPhEdFOtrGOq2so3f8fczbnDBB6HPj0uqaZE26HNY2schGIgbSlQBBjMxKmLniz35MD8p+4Q== X-Received: by 2002:a17:902:e74c:b0:1f8:6e3f:9e7 with SMTP id d9443c01a7336-1fb33e18421mr8824975ad.1.1720077396279; Thu, 04 Jul 2024 00:16:36 -0700 (PDT) Received: from [10.255.188.228] ([139.177.225.243]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-1fac1599c44sm115243885ad.273.2024.07.04.00.16.31 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 04 Jul 2024 00:16:35 -0700 (PDT) Message-ID: Date: Thu, 4 Jul 2024 15:16:29 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 0/7] synchronously scan and reclaim empty user PTE pages To: the arch/x86 maintainers Cc: david@redhat.com, hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: Content-Language: en-US From: Qi Zheng In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: C8E174000F X-Stat-Signature: 9oh34zsmtas7gfi94z13raxsuns4cw57 X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1720077397-347847 X-HE-Meta: U2FsdGVkX19lI9gwuQG+essgag2xCtb8N8Pm3gW2sem6kodOytk3oiwSH+a5HpkyBAvqL3+0RhBksgii+cnMeCtXAQeGNAL9FTP2/iNqpi8hMW7Jv3eTQ2e13FRiXH/9JylXe+Kue1arO7OJsPuhMAkvvFtXBEOCfZIqwQnXTJhWhLtMK+LwVnrSiqLRIMqimH81aIJ509rvdt1WUkn42VP1vM5dc/AZ5MWURm+2uebb+9FRE76cIV/bAn/Wg6g826kSklpvW7FvKgqQVCLQQiwf7Ic9flJD/Mq9aDh8duAh49jMpmEJ/8d4+34uAyE0hosvL0WB4wEsy0R7hBRrCttTi7+evySmwABYe9L9tXjSzncD1lhuZ8tq35KnCXpJ4kCOgydlczBJdWQY+1TPpxQ+fTmECl81dHRV73Jh/aIqaGBZX+qIYyvWHtskRYojfPtIoDQJZLVyblk94+MVdU7uLKdZnQAPU0AmBe/NMn5jPQSoJhwuHUxHdIaMyn45STFR1+WM098JDmg5jyxmQkoHfrLsmeEXqXBBAa748cX0ZWC115YTTDHHtBWakrHhPrJj8d66egE75TvX78ChE1liqp76btJvHNoViH+Vevy2HKGQrjxzvOSNUnm9+uE/ozN4/5DspDS6SMxKvyqN9NbzVuMF+Yv33Qeg99/ldUs9iZFfewkrMqbHM5M+NwWXo9IkIRwu2ANgbJKXp5+ecAX5JzHLmpDggoMFbnb5omKQHrfPIyrtHGmfb20cAWLuqxvmt/oiItDBNc+nLO9iIYFNPa0qrOhBInLyJASXcJlnFUnzRH8iJ5ns1roi2kOBxEv7lfRe0QyC5Fs0s3Ds99i2hbr6l8sE+0mgPZbkI3xxYctED/v79ATpOaZ6xc6GWRKreWpjy3dGWDH0WezM2ouSLLnez0orfw2RuotXdd26GLcKF3IGCkEk4jgqNHAWHULpTDJIoiOCjuBRvQg f92dK/OY na+Ujv9TeYtHTqC7YW/x+YCfPMz/+jCiVZAftlcpSbezIbJxp2hYvYcmSn5nOQFwobuL4eIo58HctDMymSOzFIuTE5Y+hfyCk4deD0wFJBS7vliXznPoTiVHIYpAXhSFvUjTdTXcQjk/J8g1oraROHxy/xZNACJuSS90ABOAfOBN0Zthe5VwUTEwOqr4B4W8Xsn/v1Gn2yRKbs/2LRBYfrVw6imxndnig4xGD3K3rHd25TMr7Frw2XvDuLlpE6FHntGDGuFmEw8XA/tPrp3/RJssO2xWL+YtHlO4YcgKa/r7O3Eruao4QFcj5DyP1ba7yhHmOWlVhBw3NGyg8vSmlOdoaNcCd/4UQ47m6dwKRN831MErvgxmTfs23+gcIs/CRHt25LO+jP5u5s5TfuE109fw1vlfnGjSwkg02z6oQSUs5z8k= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Add the x86 mailing list that I forgot to CC before. On 2024/7/1 16:46, Qi Zheng wrote: > Hi all, > > Previously, we tried to use a completely asynchronous method to reclaim empty > user PTE pages [1]. After discussing with David Hildenbrand, we decided to > implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the > first step. > > So this series aims to synchronously scan and reclaim empty user PTE pages in > zap_page_range_single() (madvise(MADV_DONTNEED) etc will invoke this). In > zap_page_range_single(), mmu_gather is used to perform batch tlb flushing and > page freeing operations. Therefore, if we want to free the empty PTE page in > this path, the most natural way is to add it to mmu_gather as well. There are > two problems that need to be solved here: > > 1. Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free > page table pages by semi RCU: > > - batch table freeing: asynchronous free by RCU > - single table freeing: IPI + synchronous free > > But this is not enough to free the empty PTE page table pages in paths other > that munmap and exit_mmap path, because IPI cannot be synchronized with > rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table > also be freed by RCU like batch table freeing. > > 2. When we use mmu_gather to batch flush tlb and free PTE pages, the TLB is not > flushed before pmd lock is unlocked. This may result in the following two > situations: > > 1) Userland can trigger page fault and fill a huge page, which will cause > the existence of small size TLB and huge TLB for the same address. > > 2) Userland can also trigger page fault and fill a PTE page, which will > cause the existence of two small size TLBs, but the PTE page they map > are different. > > For case 1), according to Intel's TLB Application note (317080), some CPUs of > x86 do not allow it: > > ``` > If software modifies the paging structures so that the page size used for a > 4-KByte range of linear addresses changes, the TLBs may subsequently contain > both ordinary and large-page translations for the address range.12 A reference > to a linear address in the address range may use either translation. Which of > the two translations is used may vary from one execution to another and the > choice may be implementation-specific. > > Software wishing to prevent this uncertainty should not write to a paging- > structure entry in a way that would change, for any linear address, both the > page size and either the page frame or attributes. It can instead use the > following algorithm: first mark the relevant paging-structure entry (e.g., > PDE) not present; then invalidate any translations for the affected linear > addresses (see Section 5.2); and then modify the relevant paging-structure > entry to mark it present and establish translation(s) for the new page size. > ``` > > We can also learn more information from the comments above pmdp_invalidate() > in __split_huge_pmd_locked(). > > For case 2), we can see from the comments above ptep_clear_flush() in > wp_page_copy() that this situation is also not allowed. Even without > this patch series, madvise(MADV_DONTNEED) can also cause this situation: > > CPU 0 CPU 1 > > madvise (MADV_DONTNEED) > --> clear pte entry > pte_unmap_unlock > touch and tlb miss > --> set pte entry > mmu_gather flush tlb > > But strangely, I didn't see any relevant fix code, maybe I missed something, > or is this guaranteed by userland? > > Anyway, this series defines the following two functions to be implemented by > the architecture. If the architecture does not allow the above two situations, > then define these two functions to flush the tlb before set_pmd_at(). > > - arch_flush_tlb_before_set_huge_page > - arch_flush_tlb_before_set_pte_page > > As a first step, we supported this feature on x86_64 and selectd the newly > introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM. > > In order to reduce overhead, we only handle the cases with a high probability > of generating empty PTE pages, and other cases will be filtered out, such as: > > - hugetlb vma (unsuitable) > - userfaultfd_wp vma (may reinstall the pte entry) > - writable private file mapping case (COW-ed anon page is not zapped) > - etc > > For userfaultfd_wp and writable private file mapping cases (and MADV_FREE case, > of course), consider scanning and freeing empty PTE pages asynchronously in > the future. > > This series is based on next-20240627. > > Comments and suggestions are welcome! > > Thanks, > Qi > > [1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ > > Qi Zheng (7): > mm: pgtable: make pte_offset_map_nolock() return pmdval > mm: introduce CONFIG_PT_RECLAIM > mm: pass address information to pmd_install() > mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single() > x86: mm: free page table pages by RCU instead of semi RCU > x86: mm: define arch_flush_tlb_before_set_huge_page > x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 > > Documentation/mm/split_page_table_lock.rst | 3 +- > arch/arm/mm/fault-armv.c | 2 +- > arch/powerpc/mm/pgtable.c | 2 +- > arch/x86/Kconfig | 1 + > arch/x86/include/asm/pgtable.h | 6 + > arch/x86/include/asm/tlb.h | 23 ++++ > arch/x86/kernel/paravirt.c | 7 ++ > arch/x86/mm/pgtable.c | 15 ++- > include/linux/hugetlb.h | 2 +- > include/linux/mm.h | 13 +- > include/linux/pgtable.h | 14 +++ > mm/Kconfig | 14 +++ > mm/Makefile | 1 + > mm/debug_vm_pgtable.c | 2 +- > mm/filemap.c | 4 +- > mm/gup.c | 2 +- > mm/huge_memory.c | 3 + > mm/internal.h | 17 ++- > mm/khugepaged.c | 24 +++- > mm/memory.c | 21 ++-- > mm/migrate_device.c | 2 +- > mm/mmu_gather.c | 2 +- > mm/mprotect.c | 8 +- > mm/mremap.c | 4 +- > mm/page_vma_mapped.c | 2 +- > mm/pgtable-generic.c | 21 ++-- > mm/pt_reclaim.c | 131 +++++++++++++++++++++ > mm/userfaultfd.c | 10 +- > mm/vmscan.c | 2 +- > 29 files changed, 307 insertions(+), 51 deletions(-) > create mode 100644 mm/pt_reclaim.c >