From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BE81FE7717F for ; Fri, 13 Dec 2024 02:27:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0594F6B008A; Thu, 12 Dec 2024 21:27:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id F25396B008C; Thu, 12 Dec 2024 21:27:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DA8636B0092; Thu, 12 Dec 2024 21:27:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id B807C6B008A for ; Thu, 12 Dec 2024 21:27:45 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 6B193C0B51 for ; Fri, 13 Dec 2024 02:27:45 +0000 (UTC) X-FDA: 82888349082.25.CFE77B7 Received: from mail-vk1-f180.google.com (mail-vk1-f180.google.com [209.85.221.180]) by imf14.hostedemail.com (Postfix) with ESMTP id E059A100012 for ; Fri, 13 Dec 2024 02:27:15 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="PoK0KPM/"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf14.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.180 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734056840; a=rsa-sha256; cv=none; b=4CGn1SBVz4wd9fHtGVUDnMhzLaOzEU62bBmAFbVxrmDZzD+00HalstWqbFmkoaNZgjn3Aw 06ZQyOkzupM18Xmi0N2kBeib3ztlMHdPZI8sQ0hPWnqNF1qy7/tfUzfgLmGEnHnOTlpUxU IazQUW2r6TEG+i7zCohpdNnw6HSX8W8= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="PoK0KPM/"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf14.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.180 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734056840; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HXfkY0FCohqeUIaxO8DXPelDQzWUrDCmC72JlKiRg1g=; b=e/Nt7nkqQLcd3xgu7l76YIDDp0oOJJmXAmNDNBDKtV2V90spjvpYcbVGjvlQx54i6zwher 4PQr1nhvlIQQoIhx8sn1sCPfX/0/Pt17PbUln5pX7RF3K6FC9lAL9KVfmzZR/sLuWbx9hv ZDRgu8i+J5hntoN/b0aJvJUiLOkvczk= Received: by mail-vk1-f180.google.com with SMTP id 71dfb90a1353d-51882c83065so629304e0c.0 for ; Thu, 12 Dec 2024 18:27:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1734056862; x=1734661662; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=HXfkY0FCohqeUIaxO8DXPelDQzWUrDCmC72JlKiRg1g=; b=PoK0KPM/NTK1mVN+OPCqygCrbZsAZSGrM88lbVJHSfZCp/+Z/iV1ofDOnkjSxYa+3g oDRcP+qVcxLN8vkMPa1+sBx5snqMg14R3CKhfulSESyY13s9rnT1dIm1fQD0cfLgdxnA bQTxrDzHDtprrkKg6oNCBU8rvxz+mwikLNDEzIw4XQCtudrerni9enE/Tf3Nk/fSGZDB UHmnCwAPHCwKRZ4+OjA+PyutQ/zrDTIP1osQnOhc6xcatiEgLZHuCYpDuH8G4iGP34oV cbpEx6myasTzsz3ngPrcztS07sEEY5i9cNoI1twv06QU3eEu+Z8jt+a+B23awcJGEmcD Ci2A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734056862; x=1734661662; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HXfkY0FCohqeUIaxO8DXPelDQzWUrDCmC72JlKiRg1g=; b=o/t9Z1oM7Ts8IPhkCZPu90LNae3r/UdFr3eCmCG8/ZHJS8xbg4BpU17B8va6+p3lp8 0HWEnbYI8q5gAEv/1VDvuYbf481R2gXMv1aVO0gPYrMYqgfJBuq1aav4C9K2QxiPnIAq 0NKOy7VSKjr0s7EbSWyUP03M/VYMZ1iSmswumKSdgmk8idF0mSwJ3Of6wjgKEL917uRC HXRZ45xCRG6aVUeQI9x2OEFB1TE8NZfIqeAX9KS8mus/x8td/j2MWfkx/Oia2RIadnsx byNLMYwvphDFlP3bn3mgDTAwXWwSVit4Bodr/oZEURvRQy7UDvPZ0NdrFWutx/1oUgU3 OdwQ== X-Forwarded-Encrypted: i=1; AJvYcCUuVmuNqa6Rn19Mf7kKzvvxdXgSyqSRfsfzaXzQHjxuus3XXyR07jIuE4f6h9AtQR7fdOButHQFxw==@kvack.org X-Gm-Message-State: AOJu0Yz96M1okPWzLwL6JmMII8/zhDPBdVfrUMViodm2mdouLR165+Jd aBD4nLeNNVfNq8Su1Gnw+vgd47aqbzt1GLXw8lCVOXd2pIWYAfEX5/q9x5evHQ6eAa7XBp6eMOt iiX1h7WfqsQefoqKqeWxOygZFnGc= X-Gm-Gg: ASbGncs3YbZHsaJK0Nd9DIsRjUGFTfvIkXhRAiH4J0oiKNR5QfXso9LE6fHQzXQk/7P qHIEQY11pvcJHYZffzljNJ09xgMWcDhnNXD8RlxFVcTVNTVIcY92rbTkV+djmZ1g0+9z7ne9n X-Google-Smtp-Source: AGHT+IFkQNOfzLceietlJm4OrY0q6PAbWS4n0dKrcqIUH/dRudCRyGm6iLUY2i7Td8bd15Kt7kNnZpJ3JT9642akn6s= X-Received: by 2002:a05:6122:3c51:b0:518:7bc4:fccd with SMTP id 71dfb90a1353d-518ca37977emr1199646e0c.2.1734056862494; Thu, 12 Dec 2024 18:27:42 -0800 (PST) MIME-Version: 1.0 References: <20241212162508.GA4712@cmpxchg.org> <20241213014744.45296-1-21cnbao@gmail.com> In-Reply-To: <20241213014744.45296-1-21cnbao@gmail.com> From: Barry Song <21cnbao@gmail.com> Date: Fri, 13 Dec 2024 15:27:31 +1300 Message-ID: Subject: Re: [PATCH RFC] mm: map zero-filled pages to zero_pfn while doing swap-in To: hannes@cmpxchg.org Cc: akpm@linux-foundation.org, axboe@kernel.dk, bala.seshasayee@linux.intel.com, baolin.wang@linux.alibaba.com, chrisl@kernel.org, david@redhat.com, hch@infradead.org, kanchana.p.sridhar@intel.com, kasong@tencent.com, linux-mm@kvack.org, nphamcs@gmail.com, ryan.roberts@arm.com, senozhatsky@chromium.org, terrelln@fb.com, usamaarif642@gmail.com, v-songbaohua@oppo.com, wajdi.k.feghali@intel.com, willy@infradead.org, ying.huang@linux.alibaba.com, yosryahmed@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: E059A100012 X-Stat-Signature: mgbc3dw6kxadsoitk3gwn98hnmuy1q4m X-Rspam-User: X-HE-Tag: 1734056835-654671 X-HE-Meta: U2FsdGVkX1/yxZmtzSpiq8+L/QHEcCr53cirVyubXAplMwb2FLT+wcK0QMsOXSyWzKTAzQn9jdmITU2DikYpCIOW5V4y15gU+fMAbQMTPV03+DFDrkjJuZwABY8XopKfQWCTB0k+yh58XVNcW0cGiUItp5Qg/MhiD8odPws+aDQNMvtGwLHVPTZQmuyS1ukST+9VZtcmbe6QZdSWY2mx30tDTzPtAbAGMz2YKnRpOfNHBhGZcF6liGVSFqNA3Jm/eN8DnUCcajwh2IQtBknOPkNhuJ9qNrsTF0QwSELiGUic8ShejcbaVe7TlgfWYc8J7MVJfZfzjwe7xlf2KUKpAQ7n+bf+87g4vEmeShccJ7oJgCebaRYKrrDjRP7ZWoTcMo/2iIqTlqwnjamBkZbPcmXURX8e0RodOaHlFEaHxCZpsQKOY8kLadMFS2XIomVFITO/E5VJPporq/KV3EcE2SGCf4lhVXuVeG5LaJJsuUH298esr02/SBgOHTaP2DPHrHZSJsUpcK+BHJFCXpw8mNz5b7D+0gJ355X1r6BoNkNWj5WKMLYjLdpWhnuFAqZspf7jcJ8Ut9YgSxLuhY1LGKtCiFOsUsdGPDV58S0i4tJHjSldFbWWt/jwop22aW2B3bGDjksSwLXxD/lm/RNGbT22GBbCq9lFhE4+L7+5N8aochj6dL+O7bath5Qmu/W1UWPz7uayo5pB2o/Wzh1X1GHIf15LOdOAtrmYGFnGYjg7qtk6wr3zU4G/HzQX4wokaOMitUm+jed12OW44m8lfB34dJ+NkXHKjbqmfV7jF36Hvj+WQGgkCZYdPU+zgB7dIEANTHxIV1pINs3XSpqQ1nWJuJSfDxnT/IH652vsfHQcrSofWem4KP40I2SPbzz/Om4ZQ7y4L3SRy2ndD7YuBYbFWTlNR1p4ozcVEGK73C0oceUYX6uM1cBOaLw/wPiij2AH1BTwxe5DVYQEgmH 2+KIXKnU hAKqIsxjKVCqLKHLXLjy421VIUdSZwcHe6XkhubL4Ed6s9zUK57DwBATmCP0x6qD20vA4c+2H+/DmHV32xiVoAH0MHOAxwUbjfRrN1Q42tpYAQlZ3mtxFy1dzQBSYuJOKPwKuOcXiMxAGa3JM/0Iw6N8Y3zRTDMk772iUI6RZinMMJaSmGRlRphceJyfej2hAFaxcutSyDumhNq8yhBFZhkVxnzU5qd+aHFv5Z1GGKAJstz1M1j5Rx75hnKJENPxSEfctUbkZoz/D1NaI8/gJlB6PdWfuvi3Xhu87iR3lGJT03rIF3HfXZG/N5/y3YyrkWw5PTX+BM7SWHHxt6M1wj1V1UVLKWut9WJYl/Q67r3121sj6vvQVyXfTuFMTtTRTJiWtY9OT4J3CNPloj9G8+fYysfqH6aiVZBbXCqvKSETj+HfDVrwI97WgTQl275QkP6yXcwOoWK7ehfmr0OJ5SkaOOiOWA6khct7Z6a5dGwZbIRiGysI4CWd2mc+iSi729kwtUN2sJgzNGkET0HBToP/fkfHtZEM6lEXe9ffKjxUUGo3ZlZ4l6CXHSw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Dec 13, 2024 at 2:47=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrot= e: > > On Fri, Dec 13, 2024 at 5:25=E2=80=AFAM Johannes Weiner wrote: > > > > On Thu, Dec 12, 2024 at 10:16:22PM +1300, Barry Song wrote: > > > On Thu, Dec 12, 2024 at 9:51=E2=80=AFPM David Hildenbrand wrote: > > > > > > > > On 12.12.24 09:46, Barry Song wrote: > > > > > On Thu, Dec 12, 2024 at 9:29=E2=80=AFPM Christoph Hellwig wrote: > > > > >> > > > > >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > > > > >>> From: Barry Song > > > > >>> > > > > >>> While developing the zeromap series, Usama observed that certai= n > > > > >>> workloads may contain over 10% zero-filled pages. This may pres= ent > > > > >>> an opportunity to save memory by mapping zero-filled pages to z= ero_pfn > > > > >>> in do_swap_page(). If a write occurs later, do_wp_page() can > > > > >>> allocate a new page using the Copy-on-Write mechanism. > > > > >> > > > > >> Shouldn't this be done during, or rather instead of swap out ins= tead? > > > > >> Swapping all zero pages out just to optimize the in-memory > > > > >> representation on seems rather backwards. > > > > > > > > > > I=E2=80=99m having trouble understanding your point=E2=80=94it se= ems like you might > > > > > not have fully read the code. :-) > > > > > > > > > > The situation is as follows: for a zero-filled page, we are curre= ntly > > > > > allocating a new > > > > > page unconditionally. By mapping this zero-filled page to zero_pf= n, we could > > > > > save the memory used by this page. > > > > > > > > > > We don't need to allocate the memory until the page is written(wh= ich may never > > > > > happen). > > > > > > > > I think what Christoph means is that you would determine that at PT= E > > > > unmap time, and directly place the zero page in there. So there wou= ld be > > > > no need to have the page fault at all. > > > > > > > > I suspect at PTE unmap time might be problematic, because we might = still > > > > have other (i.e., GUP) references modifying that page, and we can o= nly > > > > rely on the page content being stable after we flushed the TLB as w= ell. > > > > (I recall some deferred flushing optimizations) > > > > > > Yes, we need to follow a strict sequence: > > > > > > 1. try_to_unmap - unmap PTEs in all processes; > > > 2. try_to_unmap_flush_dirty - flush deferred TLB shootdown; > > > 3. pageout - zeromap will set 1 in bitmap if page is zero-filled > > > > > > At the moment of pageout(), we can be confident that the page is zero= -filled. > > > > > > mapping to zeropage during unmap seems quite risky. > > > > You have to unmap and flush to stop modifications, but I think not in > > all processes before it's safe to decide. Shared anon pages have COW > > semantics; when you enter try_to_unmap() with a page and rmap gives > > you a pte, it's one of these: > > > > a) never forked, no sibling ptes > > b) cow broken into private copy, no sibling ptes > > c) cow/WP; any writes to this or another pte will go to a new page. > > > > In cases a and b you need to unmap and flush the current pte, but then > > it's safe to check contents and set the zero pte right away, even > > before finishing the rmap walk. > > > > In case c, modifications to the page are impossible due to WP, so you > > don't even need to unmap and flush before checking the contents. The > > pte lock holds up COW breaking to a new page until you're done. > > > > It's definitely more complicated than the current implementation, but > > if it can be made to work, we could get rid of the bitmap. > > > > You might also reduce faults, but I'm a bit skeptical. Presumably > > zerofilled regions are mostly considered invalid by the application, > > not useful data, so a populating write that will cowbreak seems more > > likely to happen next than a faultless read from the zeropage. > > Yes. That is right. > > I created the following debug patch to count the proportional distributio= n > of zero_swpin reads, as well as the comparison between zero_swpin and > zero_swpout: > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.= h > index f70d0958095c..ed9d1a6cc565 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -136,6 +136,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT= , > SWAP_RA_HIT, > SWPIN_ZERO, > SWPOUT_ZERO, > + SWPIN_ZERO_READ, > #ifdef CONFIG_KSM > KSM_SWPIN_COPY, > #endif > diff --git a/mm/memory.c b/mm/memory.c > index f3040c69f648..3aacfbe7bd77 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4400,6 +4400,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > /* Count SWPIN_ZERO since page_io was ski= pped */ > objcg =3D get_obj_cgroup_from_swap(entry)= ; > count_vm_events(SWPIN_ZERO, 1); > + count_vm_events(SWPIN_ZERO_READ, 1); > if (objcg) { > count_objcg_events(objcg, SWPIN_Z= ERO, 1); > obj_cgroup_put(objcg); > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 4d016314a56c..9465fe9bda9e 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1420,6 +1420,7 @@ const char * const vmstat_text[] =3D { > "swap_ra_hit", > "swpin_zero", > "swpout_zero", > + "swpin_zero_read", > #ifdef CONFIG_KSM > "ksm_swpin_copy", > #endif > > > For a kernel-build workload in a single memcg with only 1GB of memory, us= e > the script below: > > #!/bin/bash > > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > > vmstat_path=3D"/proc/vmstat" > thp_base_path=3D"/sys/kernel/mm/transparent_hugepage" > > read_values() { > pswpin=3D$(grep "pswpin" $vmstat_path | awk '{print $2}') > pswpout=3D$(grep "pswpout" $vmstat_path | awk '{print $2}') > pgpgin=3D$(grep "pgpgin" $vmstat_path | awk '{print $2}') > pgpgout=3D$(grep "pgpgout" $vmstat_path | awk '{print $2}') > swpout_zero=3D$(grep "swpout_zero" $vmstat_path | awk '{print $2}') > swpin_zero=3D$(grep "swpin_zero" $vmstat_path | awk '{print $2}') > swpin_zero_read=3D$(grep "swpin_zero_read" $vmstat_path | awk '{print= $2}') > > echo "$pswpin $pswpout $pgpgin $pgpgout $swpout_zero $swpin_zero $swp= in_zero_read" > } > > for ((i=3D1; i<=3D5; i++)) > do > echo > echo "*** Executing round $i ***" > make ARCH=3Darm64 CROSS_COMPILE=3Daarch64-linux-gnu- clean 1>/dev/null = 2>/dev/null > sync; echo 3 > /proc/sys/vm/drop_caches; sleep 1 > #kernel build > initial_values=3D($(read_values)) > time systemd-run --scope -p MemoryMax=3D1G make ARCH=3Darm64 \ > CROSS_COMPILE=3Daarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/de= v/null > final_values=3D($(read_values)) > > echo "pswpin: $((final_values[0] - initial_values[0]))" > echo "pswpout: $((final_values[1] - initial_values[1]))" > echo "pgpgin: $((final_values[2] - initial_values[2]))" > echo "pgpgout: $((final_values[3] - initial_values[3]))" > echo "swpout_zero: $((final_values[4] - initial_values[4]))" > echo "swpin_zero: $((final_values[5] - initial_values[5]))" > echo "swpin_zero_read: $((final_values[6] - initial_values[6]))" > done > > > The results I am seeing are as follows: > > real 6m43.998s > user 47m3.800s > sys 5m7.169s > pswpin: 342041 > pswpout: 1470846 > pgpgin: 11744932 > pgpgout: 14466564 > swpout_zero: 318030 > swpin_zero: 93621 > swpin_zero_read: 13118 > > The proportion of zero_swpout is quite large (> 10%): 318,030 vs. 1,470,8= 46. > The percentage is 17.8% =3D 318,030 / (318,030 + 1,470,846). > > About 29.4% (93,621 / 318,030) of these will be swapped in, and 14% of th= ose > zero_swpin pages are read (13,118 / 93,621). > > Therefore, a total of 17.8% * 29.4% * 14% =3D 0.73% of all swapped-out pa= ges > will be re-mapped to zero_pfn, potentially saving up to 0.73% RSS in this > kernel-build workload. Thus, the total build time of my final results fal= ls Apologies for the mistake in my math. I shouldn't have used swpout as the denominator and swpin as the numerator. Instead, both the numerator and denominator should be based on swpin. Potentially, 13,118 swpin_zero_read / (342,041 pswpin + 13,118 swpin_zero_r= ead) could be saved for swap-in, meaning 3.7% of all swap-ins can be saved using zero_pfn. > within the testing jitter range, showing no noticeable difference while > the conceptual model code with lots of zero-filled pages and read swap-in > shows significant differences. Although 3.7% of swap-ins can be saved, my X86 PC is too weak to demonstrate the differences, as numerous factors=E2=80=94such as temperatur= e and unstable I/O latency=E2=80=94can impact the final build time. Hopefully, others can share test results conducted on stable hardware and more workloads. > > I'm not sure if we can identify another real workload with more read swpi= n > to observe noticeable improvements. Perhaps Usama has some? > Thanks Barry