From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 41D4AC433FE for ; Wed, 10 Nov 2021 08:41:28 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BD95E61106 for ; Wed, 10 Nov 2021 08:41:27 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org BD95E61106 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 4BEA16B006C; Wed, 10 Nov 2021 03:41:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 46E5B6B0071; Wed, 10 Nov 2021 03:41:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 30FB36B0072; Wed, 10 Nov 2021 03:41:27 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0170.hostedemail.com [216.40.44.170]) by kanga.kvack.org (Postfix) with ESMTP id 1ECAC6B006C for ; Wed, 10 Nov 2021 03:41:27 -0500 (EST) Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id C4978184E9AF1 for ; Wed, 10 Nov 2021 08:41:26 +0000 (UTC) X-FDA: 78792376572.04.6BB0E1C Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50]) by imf26.hostedemail.com (Postfix) with ESMTP id 6453220019DC for ; Wed, 10 Nov 2021 08:41:26 +0000 (UTC) Received: by mail-pj1-f50.google.com with SMTP id gt5so975387pjb.1 for ; Wed, 10 Nov 2021 00:41:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=s/UIyi2eJnbBswQCTGuCTS0LM6W4YfTsCnVJtEHRzCs=; b=2V47xJVoTRGL4NYWDENvvJDTZnOH8xyNkxPtAfnODGEDGLODL2RYL/OVj6Aee/AJWN e8ZwxwlsKmOM+9WOhfHimf5PQuAcFm5k6hpMkR3aOPLDLxuiqXCaWQripFnD+pb5uGFO t+ra36xfNyqc2Jl+OqOLBBaINZ79qLRxg6x6NtasrAejoyPIhdNZfFlri8aga3eBm83E SBs6LTpUeIK8UM+ypnvMSeEo/rj9Kmx45KPFdFWpyrEddY7IsGg8kJ57uUh6oOWUS5Wg qvh6Fdh+3ggb3r2K9rTgJ2ywyeB91rWwnS+a1Mtx7JFWJTkv9nEfvrYaoOpfDbigOhW7 pf1w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=s/UIyi2eJnbBswQCTGuCTS0LM6W4YfTsCnVJtEHRzCs=; b=uyDrYch3lFkZTVW8ZcFHcVkMu63v4gruuWsAoS2+w/JWsqNUdTV7WPGasuFfiaJSFT TfX3iIgTYmRvMzE9aSKdVxg2kyiyYiki9ZWeXBDSmIQfiksphxKJJE+ObC2oK1AaY5k8 8gVfXBjL20pq7hTqjpoaTy3sDW3HpAWXU3AC2L/VKU25OEWBhcGTBLvM+telWtL+Z7cJ XfV+nLN5HhPm5thQ+cMBm0lvjgE67yPwkcMCZB+0U96ZFZHdXkvk02mOtPpuctxxUAzg uM9bRK4rxcES7JxfCmreY6YMVngoBGFaRoOT/jgMwUubtUsu4l5he/L+u2LWYeU1k8dV kKcw== X-Gm-Message-State: AOAM533D+KU92Eitmbvq5NVzLGpkXwn8p7CkzxStOFmSV+d7Xei0p20U RZ7CqVr1WkFhCZ5Yk9oFzp49GQ== X-Google-Smtp-Source: ABdhPJyn4T4V1iwO3gk4GZwnjKKZanA4QzZdNFSHNjddXJbOZvk8CygQwit4PmfYJ5etG1SC5P67og== X-Received: by 2002:a17:90a:d583:: with SMTP id v3mr15145481pju.216.1636533684296; Wed, 10 Nov 2021 00:41:24 -0800 (PST) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.251]) by smtp.gmail.com with ESMTPSA id v38sm5485368pgl.38.2021.11.10.00.41.18 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Nov 2021 00:41:23 -0800 (PST) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [PATCH v3 00/15] Free user PTE page table pages Date: Wed, 10 Nov 2021 16:40:42 +0800 Message-Id: <20211110084057.27676-1-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 6453220019DC X-Stat-Signature: gfnwuy4by1e5ridzrrxfrc1ahtknxt3k Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=2V47xJVo; spf=pass (imf26.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.50 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-HE-Tag: 1636533686-760605 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi, This patch series aims to free user PTE page table pages when all PTE ent= ries are empty. The beginning of this story is that some malloc libraries(e.g. jemalloc o= r tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap t= hose VAs. They will use madvise(MADV_DONTNEED) to free physical memory if they want= . But the page tables do not be freed by madvise(), so it can produce many page tables when the process touches an enormous virtual address space. The following figures are a memory usage snapshot of one process which ac= tually happened on our server: VIRT: 55t RES: 590g VmPTE: 110g As we can see, the PTE page tables size is 110g, while the RES is 590g. I= n theory, the process only need 1.2g PTE page tables to map those physical memory. The reason why PTE page tables occupy a lot of memory is that madvise(MADV_DONTNEED) only empty the PTE and free physical memory but doesn't free the PTE page table pages. So we can free those empty PTE pag= e tables to save memory. In the above cases, we can save memory about 108g(= best case). And the larger the difference between the size of VIRT and RES, th= e more memory we save. In this patch series, we add a pte_refcount field to the struct page of p= age table to track how many users of PTE page table. Similar to the mechanism= of page refcount, the user of PTE page table should hold a refcount to it be= fore accessing. The PTE page table page will be freed when the last refcount i= s dropped. Testing: The following code snippet can show the effect of optimization: mmap 50G while (1) { for (; i < 1024 * 25; i++) { touch 2M memory madvise MADV_DONTNEED 2M } } As we can see, the memory usage of VmPTE is reduced: before after VIRT 50.0 GB 50.0 GB RES 3.1 MB 3.6 MB VmPTE 102640 kB 248 kB I also have tested the stability by LTP[1] for several weeks. I have not = seen any crash so far. The performance of page fault can be affected because of the allocation/f= reeing of PTE page table pages. The following is the test result by using a micr= o benchmark[2]: root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads: threads before (pf/min) after (pf/min) 1 32,085,255 31,880,833 (-0.64= %) 8 101,674,967 100,588,311 (-1.17= %) 16 113,207,000 112,801,832 (-0.36= %) (The "pfn/min" means how many page faults in one minute.) The performance of page fault is ~1% slower than before. And there are no obvious changes in perf hot spots: before: 19.29% [kernel] [k] clear_page_rep 16.12% [kernel] [k] do_user_addr_fault 9.57% [kernel] [k] _raw_spin_unlock_irqrestore 6.16% [kernel] [k] get_page_from_freelist 5.03% [kernel] [k] __handle_mm_fault 3.53% [kernel] [k] __rcu_read_unlock 3.45% [kernel] [k] handle_mm_fault 3.38% [kernel] [k] down_read_trylock 2.74% [kernel] [k] free_unref_page_list 2.17% [kernel] [k] up_read 1.93% [kernel] [k] charge_memcg 1.73% [kernel] [k] try_charge_memcg 1.71% [kernel] [k] __alloc_pages 1.69% [kernel] [k] ___perf_sw_event 1.44% [kernel] [k] get_mem_cgroup_from_mm after: 18.19% [kernel] [k] clear_page_rep 16.28% [kernel] [k] do_user_addr_fault 8.39% [kernel] [k] _raw_spin_unlock_irqrestore 5.12% [kernel] [k] get_page_from_freelist 4.81% [kernel] [k] __handle_mm_fault 4.68% [kernel] [k] down_read_trylock 3.80% [kernel] [k] handle_mm_fault 3.59% [kernel] [k] get_mem_cgroup_from_mm 2.49% [kernel] [k] free_unref_page_list 2.41% [kernel] [k] up_read 2.16% [kernel] [k] charge_memcg 1.92% [kernel] [k] __rcu_read_unlock 1.88% [kernel] [k] ___perf_sw_event 1.70% [kernel] [k] pte_get_unless_zero This series is based on next-20211108. Comments and suggestions are welcome. Thanks, Qi. [1] https://github.com/linux-test-project/ltp [2] https://lore.kernel.org/lkml/20100106160614.ff756f82.kamezawa.hiroyu@= jp.fujitsu.com/2-multi-fault-all.c Changelog in v2 -> v3: - Refactored this patch series: - [PATCH v3 6/15]: Introduce the new dummy helpers first - [PATCH v3 7-12/15]: Convert each subsystem individually - [PATCH v3 13/15]: Implement the actual logic to the dummy helpe= rs And thanks for the advice from David and Jason. - Add a document. Changelog in v1 -> v2: - Change pte_install() to pmd_install(). - Fix some typo and code style problems. - Split [PATCH v1 5/7] into [PATCH v2 4/9], [PATCH v2 5/9]=EF=BC=8C[PATC= H v2 6/9] and [PATCH v2 7/9]. Qi Zheng (15): mm: do code cleanups to filemap_map_pmd() mm: introduce is_huge_pmd() helper mm: move pte_offset_map_lock() to pgtable.h mm: rework the parameter of lock_page_or_retry() mm: add pmd_installed_type return for __pte_alloc() and other friends mm: introduce refcount for user PTE page table page mm/pte_ref: add support for user PTE page table page allocation mm/pte_ref: initialize the refcount of the withdrawn PTE page table page mm/pte_ref: add support for the map/unmap of user PTE page table page mm/pte_ref: add support for page fault path mm/pte_ref: take a refcount before accessing the PTE page table page mm/pte_ref: update the pmd entry in move_normal_pmd() mm/pte_ref: free user PTE page table pages Documentation: add document for pte_ref mm/pte_ref: use mmu_gather to free PTE page table pages Documentation/vm/pte_ref.rst | 216 ++++++++++++++++++++++++++++++++++++ arch/x86/Kconfig | 2 +- fs/proc/task_mmu.c | 24 +++- fs/userfaultfd.c | 9 +- include/linux/huge_mm.h | 10 +- include/linux/mm.h | 170 ++++------------------------- include/linux/mm_types.h | 6 +- include/linux/pagemap.h | 8 +- include/linux/pgtable.h | 152 +++++++++++++++++++++++++- include/linux/pte_ref.h | 146 +++++++++++++++++++++++++ include/linux/rmap.h | 2 + kernel/events/uprobes.c | 2 + mm/Kconfig | 4 + mm/Makefile | 4 +- mm/damon/vaddr.c | 12 +- mm/debug_vm_pgtable.c | 5 +- mm/filemap.c | 45 +++++--- mm/gup.c | 25 ++++- mm/hmm.c | 5 +- mm/huge_memory.c | 3 +- mm/internal.h | 4 +- mm/khugepaged.c | 21 +++- mm/ksm.c | 6 +- mm/madvise.c | 21 +++- mm/memcontrol.c | 12 +- mm/memory-failure.c | 11 +- mm/memory.c | 254 ++++++++++++++++++++++++++++++++-----= ------ mm/mempolicy.c | 6 +- mm/migrate.c | 54 ++++----- mm/mincore.c | 7 +- mm/mlock.c | 1 + mm/mmu_gather.c | 40 +++---- mm/mprotect.c | 11 +- mm/mremap.c | 14 ++- mm/page_vma_mapped.c | 4 + mm/pagewalk.c | 15 ++- mm/pgtable-generic.c | 1 + mm/pte_ref.c | 141 ++++++++++++++++++++++++ mm/rmap.c | 10 ++ mm/swapfile.c | 3 + mm/userfaultfd.c | 40 +++++-- 41 files changed, 1186 insertions(+), 340 deletions(-) create mode 100644 Documentation/vm/pte_ref.rst create mode 100644 include/linux/pte_ref.h create mode 100644 mm/pte_ref.c --=20 2.11.0