From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B60B4C531DC for ; Fri, 16 Aug 2024 10:01:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E7DF78D0002; Fri, 16 Aug 2024 06:01:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E2DED6B0303; Fri, 16 Aug 2024 06:01:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CF5758D0002; Fri, 16 Aug 2024 06:01:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id B1E796B0302 for ; Fri, 16 Aug 2024 06:01:37 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 4FCE0C1AA5 for ; Fri, 16 Aug 2024 10:01:37 +0000 (UTC) X-FDA: 82457666634.02.1AD04A0 Received: from mail-pf1-f175.google.com (mail-pf1-f175.google.com [209.85.210.175]) by imf13.hostedemail.com (Postfix) with ESMTP id 6ABE52002E for ; Fri, 16 Aug 2024 10:01:34 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=iX+5tz4V; spf=pass (imf13.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.175 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723802458; a=rsa-sha256; cv=none; b=ug58jfe76yyxOLfQyGKEpBdmMtel4PCaqCUbdAY2F/lFqNO8/m1znBnNJcLSSS+Rm0+W/c jq+fd5YiVad3FR0R60s5hewiyke071SHqrkxV/f3x4uG1IeJ3bjiKbUPjQ17J54Zo8m97i U9B0FKylx3XnG2vgYMlveU7BH7KUeuk= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=iX+5tz4V; spf=pass (imf13.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.175 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723802458; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8IMWTPRSAk9N52PXD8ydpMbUch6FgKKveeU8k/beHs8=; b=X4VhHe/Ly8f9+XsJlGGLHOCIoYEumrfD8RIRhRaVqwo501MkGPe6z+fbwSXiCvQ/L0Cw2G ReQxeghvFksglt+uouM3aW4xVhWzbqxit+kkVMbnMmlALwbwWcgjTU1L+JmpR9R2d1foTS T0HoXkqzUANKf37C1ZHXv/85zn2KqCI= Received: by mail-pf1-f175.google.com with SMTP id d2e1a72fcca58-70d2cd07869so59021b3a.0 for ; Fri, 16 Aug 2024 03:01:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1723802492; x=1724407292; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=8IMWTPRSAk9N52PXD8ydpMbUch6FgKKveeU8k/beHs8=; b=iX+5tz4VKQel6n5vu4aXdXxzZeRlEFMuvdhusqCFEnRcFWiCGBYTAj/uPaoctBbhTj 9/14sDKhhZ5ne+8sFa2VplC3OhkO0Q3IFZ0q0H/BDm2oO8pRVD4yIpXbRF2MVMHygf7c Z27HFeSaULs+7H3iJyyh5px0rStA1KrN4k4mD5RtFqQzjhQnGJf8cbHqdaKlujeB8KRK RjPsP8X8FrZi3DI0uX3mbk8WoS6QXfYCljbN0lZ9sshkuGsUhgZwDxYFta2rJm6dRJSG S3Az5Kqb+9lXET5Y8qh2GAK3Va+B/XToUPW/hyVGJXaV9URy4lIqHypX8s/04CZj6evo x9QQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723802492; x=1724407292; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=8IMWTPRSAk9N52PXD8ydpMbUch6FgKKveeU8k/beHs8=; b=UNn5GZ/HO/P3An9voy3q0RD9R6gHtWmaTo5ujRc/KNT8mj4l+4T7C5Q7JWrr2UeKD9 9F+D7OAeF9NlbJdXBx8/JmTQWXcHcRyh2+NKO7OppIEh4t1soVjVTi7O/lk8EuzoTfDW qMzEl5iyhkfYTZHtZSQVUI1hYhaJfvlgQATpFo8ERr9rw0DVOKDN52yAHE7GRxdY2t9y CZrmQPZFoHo6KFcGDkRfbtVMP3ttX2xIAMGl6d4NTwDsH5mtH3eiARWRaYhVBQBjTAXX 7sQYR637eRR6UuKRrBemgE54P0WEcnSKtzUGWsVKvGuSlegwMEToucoyLQ02YfNMVhYr /6iw== X-Forwarded-Encrypted: i=1; AJvYcCWzkKoqAxcBQlEPzgmw9K5idXFBhy80UOBzZDNEBhsKuipffoBpA30iodV9iLWMf87djZDC1kYACQ==@kvack.org X-Gm-Message-State: AOJu0Yy4plq3WRI8yjG+WCCyt0eA+nufptlMpSrBB6PaQpZDO4vzcKdf H/0c7A2kRbBJhoxn9klXLUAiCgLh27zsN2MwbrIp+wIYre3Lss0ABsfAtNBvWTM= X-Google-Smtp-Source: AGHT+IGn7siiYZKSdptG0iTW/5aC2xIy3KY1wn6ID5R2LIvd9qecVPoNESLmMsoDkUCKVWscblbfog== X-Received: by 2002:a05:6a21:3282:b0:1c4:c4cc:fa49 with SMTP id adf61e73a8af0-1c905075655mr1591320637.7.1723802492470; Fri, 16 Aug 2024 03:01:32 -0700 (PDT) Received: from [10.4.217.215] ([139.177.225.242]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2d3e3d97f97sm1416941a91.50.2024.08.16.03.01.28 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 16 Aug 2024 03:01:31 -0700 (PDT) Message-ID: <860f45d7-4d75-4d67-bf2a-51f6000cd185@bytedance.com> Date: Fri, 16 Aug 2024 18:01:25 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v2 4/7] mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single() Content-Language: en-US To: David Hildenbrand Cc: hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, vbabka@kernel.org, akpm@linux-foundation.org, zokeefe@google.com, rientjes@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <9fb3dc75cb7f023750da2b4645fd098429deaad5.1722861064.git.zhengqi.arch@bytedance.com> <2659a0bc-b5a7-43e0-b565-fcb93e4ea2b7@redhat.com> <42942b4d-153e-43e2-bfb1-43db49f87e50@bytedance.com> From: Qi Zheng In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Stat-Signature: o3f8qosjjkwc3wrf6hf9or4xnuoe98dx X-Rspamd-Queue-Id: 6ABE52002E X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1723802494-113431 X-HE-Meta: U2FsdGVkX191c6EJmsirsAW8zjnNl9tBxIfJn9CVBF3EyyqrS44lSNKxBkT9y4fKUMWU2fyd0dO3GF6x9Z1m+n3FVIDj+owNRTsTLnq7vV52xUe9ClfIuHs2YCpnnh/UQKY+jlINriOnVAjG2swp+dzrJUjYcYqyvpgnK0SumMDG4+j6uzfznvuhS1hhbPSkS3/8k1mMA2mp1Ffcx2zF76rSJ8kPRz4gEbdLKPodClFhSPKwZUrGhv1fy3Ou//3Kat01ue2z7WbJDBhj1yUiWxDqik1AOG3R4js59YkwfZLV5vf8EjWYnTieUvA/dXLGS0yZHL7y4xRsDdR5EWRs5O4IrBBzWAuqxda7pbmEHLTX/hkUgKJBadbCV3qvHkgGs52deINPQv07N3SlgMge1J+F6WlYOwnJj9/G1QnIwPcdKCdfzqNra4EMbnwzqVilOc0+3RvXrCb+SW1KyENvTxLRMpT/cvqXUThRBRUg3+CSUDe9leGDHv3ESLo6nbVV9A2hWSAM6tskhvQIzzDELKyyg8nMLXARTWH27fiQUwxVjnr0S8hHojpgcP/xxIJ83eg290LLTUnE/jhNxfil51/CCantI9JvpOqjIs+fWd67CmmpSjWXXVSRa6YUSqpeqHKDFsdOLEQ0vVACZgtX4KsTD5p6oUKX8agkQHFmAJIelKW6zzquEIZYk9IXQ3g6tD5u+NRqMnBuuBSt7JuOVcw4CEOBLcnpe3YtducUx+huK31+nkHbs95Na+/tHYFJ4urp1g0StOoPDu0kxfHVLFwY+IDEjh+OA1vX+XE0Z+olQS8oOWS0GaF+Mf/WtwmNiScbztH1bm7zOaDfIJjpHgYTwjODOnRu0il8pkNenQzkULkXo6tCWIomse+j1uFspppzYkMHuuKWRKNVhvO4WPXdzEzOfyH2bdTv5TteGs86A7Arr+pE8zhrq0mIeBDZMB53OSzROkpft9X55K0 M2u6/Bdk jCw83C2L410k+NuU4aU7LsKuvCfPk/mIXWwl/wPWtqGP3gCEZYCssgFy1rSnE1QGnTrANwiEzXz5HtFCeb7UgCB51C8lNgA6mFRqFv2EUqQJPO8/7QwOwSp2AHhQd4KIqnm6MnkCbLmqqrKU2YFjyu7FISPuV7kxl+E8gjSFRmrfN5iq6ou/E4153GbeSoRF6d9Iz2AvQ0Gb1q0DdsEaxGp9ubl50KZjas6npNhmCC8DPzonwFBzRtURVGI+g5lv1qIYVsqSoYKKDQjJ2hi7d/6D8ozX1PIXwGVMjXguRtP3NXpaqGoihosu+edTGXzsgVOtx15px4r88tsCr/nasQu9FO7j5aMil3YUqXJDKGDgy8P+IIO6kTcOZoFMaJdlFfz2e X-Bogosity: Ham, tests=bogofilter, spamicity=0.000005, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/8/16 17:22, David Hildenbrand wrote: > On 07.08.24 05:58, Qi Zheng wrote: >> Hi David, >> > > Really sorry for the slow replies, I'm struggling with a mixture of > public holidays, holiday and too many different discussions (well, and > some stuff I have to finish myself). > >> On 2024/8/6 22:40, David Hildenbrand wrote: >>> On 05.08.24 14:55, Qi Zheng wrote: >>>> Now in order to pursue high performance, applications mostly use some >>>> high-performance user-mode memory allocators, such as jemalloc or >>>> tcmalloc. These memory allocators use madvise(MADV_DONTNEED or >>>> MADV_FREE) >>>> to release physical memory, but neither MADV_DONTNEED nor MADV_FREE >>>> will >>>> release page table memory, which may cause huge page table memory >>>> usage. >>>> >>>> The following are a memory usage snapshot of one process which actually >>>> happened on our server: >>>> >>>>           VIRT:  55t >>>>           RES:   590g >>>>           VmPTE: 110g >>>> >>>> In this case, most of the page table entries are empty. For such a PTE >>>> page where all entries are empty, we can actually free it back to the >>>> system for others to use. >>>> >>>> As a first step, this commit attempts to synchronously free the >>>> empty PTE >>>> pages in zap_page_range_single() (MADV_DONTNEED etc will invoke >>>> this). In >>>> order to reduce overhead, we only handle the cases with a high >>>> probability >>>> of generating empty PTE pages, and other cases will be filtered out, >>>> such >>>> as: >>> >>> It doesn't make particular sense during munmap() where we will just >>> remove the page tables manually directly afterwards. We should limit it >>> to the !munmap case -- in particular MADV_DONTNEED. >> >> munmap directly calls unmap_single_vma() instead of >> zap_page_range_single(), so the munmap case has already been excluded >> here. On the other hand, if we try to reclaim in zap_pte_range(), we >> need to identify the munmap case. >> >> Of course, we could just modify the MADV_DONTNEED case instead of all >> the callers of zap_page_range_single(), perhaps we could add a new >> parameter to identify the MADV_DONTNEED case? > > See below, zap_details might come in handy. > >> >>> >>> To minimze the added overhead, I further suggest to only try reclaim >>> asynchronously if we know that likely all ptes will be none, that is, >> >> asynchronously? What you probably mean to say is synchronously, right? >> >>> when we just zapped *all* ptes of a PTE page table -- our range spans >>> the complete PTE page table. >>> >>> Just imagine someone zaps a single PTE, we really don't want to start >>> scanning page tables and involve an (rather expensive) walk_page_range >>> just to find out that there is still something mapped. >> >> In the munmap path, we first execute unmap and then reclaim the page >> tables: >> >> unmap_vmas >> free_pgtables >> >> Therefore, I think doing something similar in zap_page_range_single() >> would be more consistent: >> >> unmap_single_vma >> try_to_reclaim_pgtables >> >> And I think that the main overhead should be in flushing TLB and freeing >> the pages. Of course, I will do some performance testing to see the >> actual impact. >> >>> >>> Last but not least, would there be a way to avoid the walk_page_range() >>> and simply trigger it from zap_pte_range(), possibly still while holding >>> the PTE table lock? >> >> I've tried doing it that way before, but ultimately I did not choose to >> do it that way because of the following reasons: > > I think we really should avoid another page table walk if possible. > >> >> 1. need to identify the munmap case > > We already have "struct zap_details". Maybe we can extend that to > specify what our intention are (either where we come from or whether we > want to try ripping out apge tables directly). > >> 2. trying to record the count of pte_none() within the original >>      zap_pte_range() loop is not very convenient. The most convenient >>      approach is still to loop 512 times to scan the PTE page. > > Right, the code might need some reshuffling. As we might temporary drop > the PTL (break case), fully relying on everything being pte_none() > doesn't always work. > > We could either handle it in zap_pmd_range(), after we processed a full > PMD range. zap_pmd_range() knows for sure whether the full PMD range was > covered, even if multiple zap_pte_range() calls were required. > > Or we could indicate to zap_pte_range() the original range. Or we could > make zap_pte_range() simply handle the retrying itself, and not get > called multiple times for a single PMD range. > > So the key points are: > > (a) zap_pmd_range() should know for sure whether the full range is >     covered by the zap. > (b) zap_pte_range() knows whether it left any entries being (IOW, it n >     never ran into the "!should_zap_folio" case) > (c) we know whether we temporarily had to drop the PTL and someone might >     have converted pte_none() to something else. > > Teaching zap_pte_range() to handle a full within-PMD range itself sounds > cleanest. Agree. > > Then we can handle it fully in zap_pte_range(): > > (a) if we had to leave entries behind (!pte_none()), no need to try >     ripping out the page table. Yes. > > (b) if we didn't have to drop the PTL, we can remove the page table >     without even re-verifying whether the entries are pte_none(). We If we want to remove the PTE page, we must hold the pmd lock (for clearing pmd entry). To prevent ABBA deadlock, we must first release the pte lock and then re-acquire the pmd lock + pte lock. Right? If so, then rechecking pte_none() is unavoidable. Unless we hold the pmd lock + pte lock in advance to execute the original code loop. >     know they are. If we had to drop the PTL, we have to re-verify at >    least the PTEs that were not zapped in the last iteration. > > > So there is the chance to avoid pte_none() checks completely, or minimze > them if we had to drop the PTL. > > Anything I am missing? Please let me know if anything is unclear. > > Reworking the retry logic for zap_pte_range(), to be called for a single > PMD only once is likely the first step. Agree, will do. > >> 3. still need to release the pte lock, and then re-acquire the pmd lock >>      and pte lock. > > Yes, if try-locking the PMD fails. >