From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 04A92C432BE for ; Thu, 19 Aug 2021 03:19:58 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8EAC460240 for ; Thu, 19 Aug 2021 03:19:57 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 8EAC460240 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 195F58D0001; Wed, 18 Aug 2021 23:19:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 146446B0071; Wed, 18 Aug 2021 23:19:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 00DD48D0001; Wed, 18 Aug 2021 23:19:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0157.hostedemail.com [216.40.44.157]) by kanga.kvack.org (Postfix) with ESMTP id D89536B006C for ; Wed, 18 Aug 2021 23:19:56 -0400 (EDT) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 6DB828245578 for ; Thu, 19 Aug 2021 03:19:56 +0000 (UTC) X-FDA: 78490375992.15.AE68C01 Received: from mail-pj1-f42.google.com (mail-pj1-f42.google.com [209.85.216.42]) by imf03.hostedemail.com (Postfix) with ESMTP id 5F800300F7D8 for ; Thu, 19 Aug 2021 03:19:55 +0000 (UTC) Received: by mail-pj1-f42.google.com with SMTP id gz13-20020a17090b0ecdb0290178c0e0ce8bso6404748pjb.1 for ; Wed, 18 Aug 2021 20:19:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=u16iA5+owOG8vnM7VIDZOY9VYSsCqqEpRvLkiJHowJM=; b=TVwWpKoMKTB6tvz0kTGMhVMV4ci7WhwCOo3+J56GdY/MhqDwcDqBdErARKHiMgmVAi q46xwGumQRIhwwWleGasRyEDIv8Q+AITHHeDalA3BBbH6wY2u+9V65YR8eO3YYh7o0/0 sIec4mbMQZ6SsvWA6oWjqr91sxF/oieELjhTP+fOfgzuPbEQIKzZh+4t26FLOzr7iwd8 Dov3xd1Gb31Lxb0WUyOZgL8hHgNCdL9VZKErStf+XgktfV0W/8M/lF9Cw9DIH8huCdyA QDa2Ptv2boSa3Sy67vZla0vYJ3F9i/ip0Q4O3kvPQPGBDPVJIWxiXh6Vh9mvTUgr962+ KOjw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=u16iA5+owOG8vnM7VIDZOY9VYSsCqqEpRvLkiJHowJM=; b=uPee7yrFOBuiCSdO6GjQwP63mgSKfDEqRNxL3wzBOl34UsyHimaFrwv3JbP1k0cTnr TCpeCYoG8+Ofi0zwcDNodQYABogs3Qr9Px6mpwnRHsNKjvJCzlwsx5G3Uwwo/fusejZU y1p/dG79K70vSQ5FwSUBCkoC9NVEJXswULh7DEWMrPrgm5kZRz0A4sF3ZzYt5I23H8YD vpwFII7zd1kqKTp8BvPTjIyvsbT6pgLP2ScOKs/GCEsCBOIPh9yTc4N8L0v86426EjEs nXNFuzngfT53N+qXG6jKQHHnQTrb1OU2WcvrZJ8X/mzbguV2/02Qj/XjihcNvmAFH2uq 2xkQ== X-Gm-Message-State: AOAM533zEj9lI5db7NbYJLm91Ni7OI34fjycuvyCzhlY0wzHI5dGD0AR s5m7UorjXdmcFUvM9ExSn1ta9Q== X-Google-Smtp-Source: ABdhPJxMfD1wT2G89D25CGwS7u4ooQKwkTepp8xoAKuviQqL6trtAYnR+VMydQdzLzgERm8TXNzviQ== X-Received: by 2002:a17:90a:a581:: with SMTP id b1mr9508225pjq.146.1629343194024; Wed, 18 Aug 2021 20:19:54 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.255]) by smtp.gmail.com with ESMTPSA id k3sm1261276pfc.16.2021.08.18.20.19.49 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 18 Aug 2021 20:19:53 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, Qi Zheng Subject: [PATCH v2 0/9] Free user PTE page table pages Date: Thu, 19 Aug 2021 11:18:49 +0800 Message-Id: <20210819031858.98043-1-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Rspamd-Queue-Id: 5F800300F7D8 Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=TVwWpKoM; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf03.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.42 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-Rspamd-Server: rspam04 X-Stat-Signature: p9gkgbd7egw4dkwb165qy486qk7qr3hw X-HE-Tag: 1629343195-220359 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi, This patch series aims to free user PTE page table pages when all PTE ent= ries are empty. The beginning of this story is that some malloc libraries(e.g. jemalloc o= r tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap t= hose VAs. They will use madvise(MADV_DONTNEED) to free physical memory if they want= . But the page tables do not be freed by madvise(), so it can produce many page tables when the process touches an enormous virtual address space. The following figures are a memory usage snapshot of one process which ac= tually happened on our server: VIRT: 55t RES: 590g VmPTE: 110g As we can see, the PTE page tables size is 110g, while the RES is 590g. I= n theory, the process only need 1.2g PTE page tables to map those physical memory. The reason why PTE page tables occupy a lot of memory is that madvise(MADV_DONTNEED) only empty the PTE and free physical memory but doesn't free the PTE page table pages. So we can free those empty PTE pag= e tables to save memory. In the above cases, we can save memory about 108g(= best case). And the larger the difference between the size of VIRT and RES, th= e more memory we save. In this patch series, we add a pte_refcount field to the struct page of p= age table to track how many users of PTE page table. Similar to the mechanism= of page refcount, the user of PTE page table should hold a refcount to it be= fore accessing. The PTE page table page will be freed when the last refcount i= s dropped. Testing: The following code snippet can show the effect of optimization: mmap 50G while (1) { for (; i < 1024 * 25; i++) { touch 2M memory madvise MADV_DONTNEED 2M } } As we can see, the memory usage of VmPTE is reduced: before after VIRT 50.0 GB 50.0 GB RES 3.1 MB 3.6 MB VmPTE 102640 kB 248 kB I also have tested the stability by LTP[1] for several weeks. I have not = seen any crash so far. The performance of page fault can be affected because of the allocation/f= reeing of PTE page table pages. The following is the test result by using a micr= o benchmark[2]: root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads: threads before (pf/min) after (pf/min) 1 32,085,255 31,880,833 (-0.64= %) 8 101,674,967 100,588,311 (-1.17= %) 16 113,207,000 112,801,832 (-0.36= %) (The "pfn/min" means how many page faults in one minute.) The performance of page fault is ~1% slower than before. This series is based on next-20210812. Comments and suggestions are welcome. Thanks, Qi. [1] https://github.com/linux-test-project/ltp [2] https://lore.kernel.org/patchwork/comment/296794/ Changelog in v1 -> v2: - Change pte_install() to pmd_install(). - Fix some typo and code style problems. - Split [PATCH v1 5/7] into [PATCH v2 4/9], [PATCH v2 5/9]=EF=BC=8C[PATC= H v2 6/9] and [PATCH v2 7/9]. Qi Zheng (9): mm: introduce pmd_install() helper mm: remove redundant smp_wmb() mm: rework the parameter of lock_page_or_retry() mm: move pte_alloc{,_map,_map_lock}() to a separate file mm: pte_refcount infrastructure mm: free user PTE page table pages mm: add THP support for pte_ref mm: free PTE page table by using rcu mechanism mm: use mmu_gather to free PTE page table arch/arm/mm/pgd.c | 1 + arch/arm64/mm/hugetlbpage.c | 1 + arch/ia64/mm/hugetlbpage.c | 1 + arch/parisc/mm/hugetlbpage.c | 1 + arch/powerpc/mm/hugetlbpage.c | 1 + arch/s390/mm/gmap.c | 1 + arch/s390/mm/pgtable.c | 1 + arch/sh/mm/hugetlbpage.c | 1 + arch/sparc/mm/hugetlbpage.c | 1 + arch/x86/Kconfig | 2 +- fs/proc/task_mmu.c | 22 +++- fs/userfaultfd.c | 2 + include/linux/mm.h | 12 +- include/linux/mm_types.h | 8 +- include/linux/pagemap.h | 8 +- include/linux/pgtable.h | 3 +- include/linux/pte_ref.h | 300 ++++++++++++++++++++++++++++++++++++= ++++++ include/linux/rmap.h | 4 +- kernel/events/uprobes.c | 3 + mm/Kconfig | 4 + mm/Makefile | 3 +- mm/filemap.c | 57 ++++---- mm/gup.c | 7 + mm/hmm.c | 4 + mm/internal.h | 2 + mm/khugepaged.c | 9 ++ mm/ksm.c | 4 + mm/madvise.c | 18 ++- mm/memcontrol.c | 11 +- mm/memory.c | 276 ++++++++++++++++++++++++------------= -- mm/mempolicy.c | 4 +- mm/migrate.c | 18 +-- mm/mincore.c | 5 +- mm/mlock.c | 1 + mm/mmu_gather.c | 40 +++--- mm/mprotect.c | 10 +- mm/mremap.c | 12 +- mm/page_vma_mapped.c | 4 + mm/pagewalk.c | 16 ++- mm/pgtable-generic.c | 2 + mm/pte_ref.c | 143 ++++++++++++++++++++ mm/rmap.c | 13 ++ mm/sparse-vmemmap.c | 2 +- mm/swapfile.c | 3 +- mm/userfaultfd.c | 18 ++- 45 files changed, 849 insertions(+), 210 deletions(-) create mode 100644 include/linux/pte_ref.h create mode 100644 mm/pte_ref.c --=20 2.11.0