From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2F238E7716F for ; Wed, 4 Dec 2024 22:36:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 941A16B0085; Wed, 4 Dec 2024 17:36:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8F1D66B0088; Wed, 4 Dec 2024 17:36:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7E0146B0089; Wed, 4 Dec 2024 17:36:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 5BBFA6B0085 for ; Wed, 4 Dec 2024 17:36:30 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id E05ECA1261 for ; Wed, 4 Dec 2024 22:36:29 +0000 (UTC) X-FDA: 82858736772.29.3A6047A Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91]) by imf02.hostedemail.com (Postfix) with ESMTP id D2A8280003 for ; Wed, 4 Dec 2024 22:35:59 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=korg header.b=CVnzOQgP; spf=pass (imf02.hostedemail.com: domain of akpm@linux-foundation.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=akpm@linux-foundation.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733351773; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=0kzuj161CjlrJENax/BacEAwEyFCA6zxq2zaZ9W2ll4=; b=xopQ7vpCKkz/y83habfq+s4Ii+ZpeAqZMCzMAombmExly4SxBIC/nUpWoY/VAOSpI2oRjr ZD3qUDePODb5Z06WY32O4stStLk19THZ0gvFiSsorAjj59w7TWTR6sfhSGtfvZCq19R9/A eztvRnVYKj4SNbUmbbZCXh4/PtVCeOc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733351773; a=rsa-sha256; cv=none; b=gtUg24Ndh4YQm2TEEWDQ7Plmelim15Es9Y571EpHcwCTAhFRoTP/2Mx3zNZ9Au0CzWSQ4x TMv8+aN1Ape5pqko8UyGW9Cpd6Qu9suDM8a4H/5rPBEt3ZfGULTMR3hVRRQG9GuLAUDC4F OmHT1abTaZz2LV6ITuqExRxB77U7EIY= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=korg header.b=CVnzOQgP; spf=pass (imf02.hostedemail.com: domain of akpm@linux-foundation.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=akpm@linux-foundation.org; dmarc=none Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id EC361A41D10; Wed, 4 Dec 2024 22:34:34 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 03DB6C4CEDF; Wed, 4 Dec 2024 22:36:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1733351786; bh=/eLmxoMarIRFWShqykMYsWfAXvN/KtokE6TigjvmROo=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=CVnzOQgPJiM8ugGXq90SXymx4VXl9O3R3bsUDWFSmuIeY+gmbfbdcOc1lsUrtVZA7 4dFL6NFsE7oQ5eWlDu4Gex/CG5uQI6D8KZJg9ypY9ihdR45fG9dWdiBNKGkSQroCcq E+HRGP8768hyb/WlY7miKuHKULbEjL05M5DsaGuc= Date: Wed, 4 Dec 2024 14:36:25 -0800 From: Andrew Morton To: Qi Zheng Cc: david@redhat.com, jannh@google.com, hughd@google.com, willy@infradead.org, muchun.song@linux.dev, vbabka@kernel.org, peterx@redhat.com, mgorman@suse.de, catalin.marinas@arm.com, will@kernel.org, dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, x86@kernel.org, lorenzo.stoakes@oracle.com, zokeefe@google.com, rientjes@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v4 09/11] mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) Message-Id: <20241204143625.a09c2b8376b2415b985cf50a@linux-foundation.org> In-Reply-To: <92aba2b319a734913f18ba41e7d86a265f0b84e2.1733305182.git.zhengqi.arch@bytedance.com> References: <92aba2b319a734913f18ba41e7d86a265f0b84e2.1733305182.git.zhengqi.arch@bytedance.com> X-Mailer: Sylpheed 3.7.0 (GTK+ 2.24.33; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: D2A8280003 X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: u7u5cfufr7eapizh4gf6b4kjhmqherft X-HE-Tag: 1733351759-565768 X-HE-Meta: U2FsdGVkX194bX6/lWj0p3MyGqeutK+nIgauLlfOSKIw1lOtG+pSPgFqxOe5pQbcoxSYgT39K+NoCzI3pv+hF3lERU//PcioQazIDU+YwE7et3KqtxqSWfCoSxMDVw3zL9dgNbYcbe5VQOd7Zd8+6XdWDNgHyGEDUlkqcAwofmfEu5ovC6V4C6JyoFNagu4OkMeD7mM7ULMuvj2HRvv4d/UC6pX3rojjNSIz180liTIKkz8Ps8b7mLwebsCNg2J09cndkx6JrrkIfz2OK4saGZPGSbSQ2Gaj+JshJoKbUro7iiggsmQoe8BCmSxxJlYOQbim86oldiqPxkhJOt9MNe4Spp/NBrAK5w6xhFzAXEQieamKlB8tZbR9sKbYQiARRbidZ94Fb8w6moXoitRdn9xaLL+c+af1bWOZgMq2t99REHAvM6aCo/ZBS1ZVkPkzf/0inHHwa2j2bEiJajzPZzNFKyD8n1oXT+Xd3wgxJ3PZFkNQL4w5PjQwyStFmdlVlIGqn3ww+geh8h8d6XjBvkn3YLgxSVAoCGdFjcYyi3BNYByO5OiGFRS7ro9oh1KkBXQpr3XPQwb0pU9Yc850iQ/yegwUnnaBuFzGxMCqg/qd7/3nSAB852VyBO1BtQ7IlXQuyPG4QHeUVe/XoLOSdz7IYjm43H+6Bpj5zqcN98co+o5ztkv//qDQcNn18Lrr1hW05INRrI/wG2blEoDIJapuHv/WOp3P6qHMA0m0AaZ9vHYFuREAO3mae+6jXZ92bPO6xk7o0Tsl6VGBBLdZXmkYJIS2SQLNWkpEEwdmheZQNeBMdETpZ9n1GWANkxdjNi5ybkD8yTczA3jjwz2nwJ/mKpOKFwpvCmTNlcu1nbY+DL656JR6o6w87gqS+31we48P6S802c2L9PjXVYsj3zVc0YOo/IaQpiv+WRVjZIFka99Z4zvyrVrVonnKkSfLxcogL/H9ltpePeahngC gZHU7fbj rOnSm7qpZ2PuJhY3gqFnNkeL80WhjT6AwW4RcXIWbs3l1VJevC+/W52XX7r/vwbEZo1LElJxtuapGZqezU8Mz6dPVAYVW+r8Tx+xe7MHOENKgnBpdrZmVtvw5mTlOV3Lyl3yYypNCzF5hwDkWJCVPf0m9tKkVzMd8+xahYDlNCT3E0GjUF0PjTqUft34Qah1Q933lPzy1BXWQPt8zXBwRhs6QaSL5XCAPN0J2hbhemJy/U4WnbYvRBmZzg4pAgnTh+IozcBNOES8h+zlorXbJvMKPxohuTgNxy+zphNa8P39y62XrElT+P1h1w4f75qYXonoE X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, 4 Dec 2024 19:09:49 +0800 Qi Zheng wrote: > Now in order to pursue high performance, applications mostly use some > high-performance user-mode memory allocators, such as jemalloc or > tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) > to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will > release page table memory, which may cause huge page table memory usage. > > The following are a memory usage snapshot of one process which actually > happened on our server: > > VIRT: 55t > RES: 590g > VmPTE: 110g > > In this case, most of the page table entries are empty. For such a PTE > page where all entries are empty, we can actually free it back to the > system for others to use. > > As a first step, this commit aims to synchronously free the empty PTE > pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE > pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude > cases other than madvise(MADV_DONTNEED). > > Once an empty PTE is detected, we first try to hold the pmd lock within > the pte lock. If successful, we clear the pmd entry directly (fast path). > Otherwise, we wait until the pte lock is released, then re-hold the pmd > and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect > whether the PTE page is empty and free it (slow path). "wait until the pte lock is released" sounds nasty. I'm not immediately seeing the code which does this. PLease provide more description? > For other cases such as madvise(MADV_FREE), consider scanning and freeing > empty PTE pages asynchronously in the future. > > The following code snippet can show the effect of optimization: > > mmap 50G > while (1) { > for (; i < 1024 * 25; i++) { > touch 2M memory > madvise MADV_DONTNEED 2M > } > } > > As we can see, the memory usage of VmPTE is reduced: > > before after > VIRT 50.0 GB 50.0 GB > RES 3.1 MB 3.1 MB > VmPTE 102640 KB 240 KB > > ... > > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -1301,6 +1301,21 @@ config ARCH_HAS_USER_SHADOW_STACK > The architecture has hardware support for userspace shadow call > stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss). > > +config ARCH_SUPPORTS_PT_RECLAIM > + def_bool n > + > +config PT_RECLAIM > + bool "reclaim empty user page table pages" > + default y > + depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP > + select MMU_GATHER_RCU_TABLE_FREE > + help > + Try to reclaim empty user page table pages in paths other than munmap > + and exit_mmap path. > + > + Note: now only empty user PTE page table pages will be reclaimed. > + Why is this optional? What is the case for permitting PT_RECLAIM to e disabled? > source "mm/damon/Kconfig" > > endmenu > > ... > > +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, > + struct mmu_gather *tlb) > +{ > + pmd_t pmdval; > + spinlock_t *pml, *ptl; > + pte_t *start_pte, *pte; > + int i; > + > + pml = pmd_lock(mm, pmd); > + start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl); > + if (!start_pte) > + goto out_ptl; > + if (ptl != pml) > + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); > + > + /* Check if it is empty PTE page */ > + for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) { > + if (!pte_none(ptep_get(pte))) > + goto out_ptl; > + } Are there any worst-case situations in which we'll spend uncceptable mounts of time running this loop? > + pte_unmap(start_pte); > + > + pmd_clear(pmd); > + > + if (ptl != pml) > + spin_unlock(ptl); > + spin_unlock(pml); > + > + free_pte(mm, addr, tlb, pmdval); > + > + return; > +out_ptl: > + if (start_pte) > + pte_unmap_unlock(start_pte, ptl); > + if (ptl != pml) > + spin_unlock(pml); > +} > -- > 2.20.1