From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 22B59D5B15A for ; Mon, 28 Oct 2024 22:11:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A6A226B00AF; Mon, 28 Oct 2024 18:11:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9F26E6B00B0; Mon, 28 Oct 2024 18:11:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 844256B00B3; Mon, 28 Oct 2024 18:11:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 5F6BD6B00AF for ; Mon, 28 Oct 2024 18:11:42 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 192DC16046F for ; Mon, 28 Oct 2024 22:11:42 +0000 (UTC) X-FDA: 82724408172.12.C6528E3 Received: from mail-vk1-f179.google.com (mail-vk1-f179.google.com [209.85.221.179]) by imf19.hostedemail.com (Postfix) with ESMTP id 6FC731A0017 for ; Mon, 28 Oct 2024 22:11:10 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=nsWfHLnf; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf19.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.179 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730153371; a=rsa-sha256; cv=none; b=RKZQjEtHDa9/l9uTJGQMmbSAETPJxAoSnzq4R93dsUdvR2fUAr89GIvNhmDNh91wjSKm5Y FwuDeMzMBDe64ec/vGtdNWkeOR63qH3vCR4CZ2GXX/YdL2ryELGtTQDAjt0BUa7LFHZbHf Qr9T4p8gJ6QFL0AX/ImFVUAPiXyPpZY= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=nsWfHLnf; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf19.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.179 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730153371; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=X/BU+pdAKTbnfpAfLFBXKeutwwFr70bY4iy5qLivOUQ=; b=clnqLNzyr1BCC6zYar3FZeWat7UCObYQ8BjMCEM6Cx8Gbix2tGDMpuHcH6onmy/FkGqb9i ZLuMkdXpTFa6dwcxDXco4qBiPGUEFt/gW5tFjRnzNx7TiNoc3IXNtd5yeW55hUmv5IDYdc yd8xZ7QN1NMku4zorTThYWIbjx/0FXQ= Received: by mail-vk1-f179.google.com with SMTP id 71dfb90a1353d-50d32d82bd8so1139585e0c.1 for ; Mon, 28 Oct 2024 15:11:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1730153499; x=1730758299; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=X/BU+pdAKTbnfpAfLFBXKeutwwFr70bY4iy5qLivOUQ=; b=nsWfHLnfcsiGQh6PaJWxZmArG0Zz7927rIhFcqVaWRYo+3gbrj2kCeiPdP4Y36fVr6 8BrP4WGOiLGTXMGUxIKF8bteFeSH6os6AWy8lVy3NkNvtOzgMudKpWnnfiYvbE6ABNBV fqUzAgGMoW8tWRA9cnjwojfQVsDpr+IcEqeW5TIGfu6o/KADXoWqfPPIKWh6K75ZLwJT FxSMDU+nXyDKfd6HyQJ1iBYsdM/5sMPx2JzE8VLaWNld7JX1MD+vmWwFMwFlQYDgpFBy drkbJVVZN5bn7WxE3gLLU8FDZj2Tv7dSmMYikZ27WxUNZdp/f1hLqlg+rHHb3hQJZri5 XN0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730153499; x=1730758299; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=X/BU+pdAKTbnfpAfLFBXKeutwwFr70bY4iy5qLivOUQ=; b=f1Z8xMt38rfy76m8AXGR6clANcpo9gohtpi9KzboZ1uicGvd2kCB1iRc5qzEntNlp5 0cmpV57peot5T27rY7/S1fkRozLERS7bszTCqg1Y244fADaJ3WxrDzZfApcyL02HAkFa 7O//+V9sexb9iaHdusUAdiA7jZhrmnlhSRN36mwj+oKdVFYmcFoq8QhoqDq0hCxwkEaq S8xTddQl0IapGP1Od3h7rWOJCV9yWMxszwz2gWs/PuPnuvbw0ONSzotbtv95BiUbV9L3 XU5nxuDPq7/+Zp27HpAgbhOisyZEGSJI/ZPioQRhsuGY5INH7sHt3HSB5hVa+dwW1LBr w0/Q== X-Forwarded-Encrypted: i=1; AJvYcCWOERN+yqjMqV5u5cyPTuxJ2uGN3UitdXQh0T2kquRpMLWgakiCR/mKCRa+M2SP53kjL1z+uGezDg==@kvack.org X-Gm-Message-State: AOJu0Yyqx/U2a969NH4RAkOWG4R4Kx9u0aJj/K8wTlmiLTRJZLQHHj1t Q/gHFBUbraeVwypqJreHChwY3ct/FxoszTILW35w8DzFzL7o+yuTtgGcqqiCmLWK/wZxw7uYCAc iLhQ3k/quamnrmdGyqZXnih6qeD4= X-Google-Smtp-Source: AGHT+IEVO+AkNDqaXEljKov4umzoPqup0Z2wBZCcXV/aYqkRSYSjyQ3ZZBQy5CvP3f2DYkF5MrO1b5lWNq5YlszWXuE= X-Received: by 2002:a05:6122:7cc:b0:50f:fe39:a508 with SMTP id 71dfb90a1353d-5101523f155mr7128353e0c.11.1730153499113; Mon, 28 Oct 2024 15:11:39 -0700 (PDT) MIME-Version: 1.0 References: <20241027011959.9226-1-21cnbao@gmail.com> <678a1e30-4962-48de-b5cb-03a1b4b9db1b@gmail.com> <6303e3c9-85d5-40f5-b265-70ecdb02d5ba@gmail.com> <64f12abd-dde3-41a4-b694-cc42784217fb@gmail.com> <882008b6-13e0-41d8-91fa-f26c585120d8@gmail.com> <228c428d-d116-4be1-9d0d-0591667b7ccb@gmail.com> <03d4c776-4b2e-4f3d-94f0-9b716bfd74d2@gmail.com> In-Reply-To: <03d4c776-4b2e-4f3d-94f0-9b716bfd74d2@gmail.com> From: Barry Song <21cnbao@gmail.com> Date: Tue, 29 Oct 2024 06:11:27 +0800 Message-ID: Subject: Re: [PATCH RFC] mm: count zeromap read and set for swapout and swapin To: Usama Arif Cc: Yosry Ahmed , Nhat Pham , akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , Chengming Zhou , Johannes Weiner , David Hildenbrand , Hugh Dickins , Matthew Wilcox , Shakeel Butt , Andi Kleen , Baolin Wang , Chris Li , "Huang, Ying" , Kairui Song , Ryan Roberts , joshua.hahnjy@gmail.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 6FC731A0017 X-Stat-Signature: ha79ki6xdz1tzgzyp6sxaf9zjq49a87e X-Rspam-User: X-HE-Tag: 1730153470-787028 X-HE-Meta: U2FsdGVkX1/EHKWC4ApNrCvSgGrRQ270Y18CLR2Vd+acBmQqfXcV97GgA7nXp1WI8XAHnnlmJZ0WbeALh0+KDsI9fCDtsKSkt+dpKS36USBeoBN7B1UWcdiLrj8wyEpT7ZFvERjDjCXJIxyl606wGywWnengea6LJZ3jpkB/A72UrDwqI3WUILsqHBuVXN9oV3r1eUlQnW0wpPYWeqza645QRDWICujlK4ghHcU7HoSxk2UA2UIBBiHl9vxpswQsUlHyRoW6N3Ef3CxIAlyNQIDxzBDAhQ+sL86+eauMWX1yCFwwrIufAQT2yB8LLieavQq7ofSCkuQdBdsJi2gIJFhGiZaAnYtnxfNfjyZMaTu1ZpGbyCfHbOrF1sHVIhSFStSWgkdIt+O2lvP65cUUKeiCr2RzwwVg4qG4qi73WM81OVf1/p3nt9+e/Wmiy8BHAIMToVVYKnYDwjHcTN9fwQL2+FqoYrbxn0t2XCbVwTEap31dwm7sGlfPTkC8CU/xPoNulboEMNv1RLL/I4CwhAcoMhZME5/LQnLkgz4w7WXX2ErNRn8XpYH6CMuNtWfb3hlf6Jqia+8RKcgxRMt9I6ITflRi19ElbaZU/H8lzCWER38ImbZTq1Y6LW6/EbXvzwKxGh9Qo5gLUHKiEqsVZHVHBT/mmfpSmmKS+cd3s08bibG7/cd+gHzUIN6/VZ+zWvIaha791sW65CyOtmfDBrnO9Vgb1aZMtE7ouvNhXou1IBvIHYdiSQLDPCZhZiOVGZAkI12dbRgLTBQdmk/zLuv3QcZTr8lM2fvsWV8IgAWOdkx8amssSmThufdEK0T3Z9Z9+FiFingbHhcgmHUHIKpQ1MaNeLHefo9y2LkvHS5+zbq/NofcRGRrpfTnen0cKAlpXEKReYRme9iHjzxz8J/h2MLPZXv+4G7Wkh1fY0g+wspUiYi+465KKiSQXwc6vkoFZlTeKJ2p8UI758y lU34nDnp 4PZ359FCCKN2Ba+hpeRUVL+nB0Iwt4QFI/oG0Jd8erYI75BWPZ2XicEljmQTowlf5wKzCkI53O+PVgCtd/2yfr0aD0jcLlv12ATJRwmdUakZZTN1HojTfp2AHBWEb7lCVTMxZadqm9u7KSGTs98UAPvsuxe2+xwz3DnKSZK6v4qchYEV6aePc+IzdiouEKORfHXoaul77j4Rs4l9DZGNC2F8vigA/JqzToYXoU+49Hb7iMr/w1nRDMp1BYLJRrrEtnmhQq08AUzggJJCLUwjvd4GFiua5bWsKCDrZLnOHWdXCcRTaPlP1F9LcWGTP8JqPRTyvGgzRNGnD6VqPeBUjFcyHEaW0IlrlPf3TLzVB42pVEd4PgMtIkiP7brbxd/duHwGr1o/no6PcXmXqFOLk16fplG171ZEBPfZvO6pvpcRl5FfYvaPyH/ogXAZ5W7dVEtb4JiDpYlFn1wKV+EvpYicP5BtqF5kyP+mMDdvyVGbJuXnFXRNEgb10Sw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Oct 29, 2024 at 5:49=E2=80=AFAM Usama Arif = wrote: > > > > On 28/10/2024 21:40, Barry Song wrote: > > On Tue, Oct 29, 2024 at 5:24=E2=80=AFAM Usama Arif wrote: > >> > >> > >> > >> On 28/10/2024 21:15, Barry Song wrote: > >>> On Tue, Oct 29, 2024 at 4:51=E2=80=AFAM Usama Arif wrote: > >>>> > >>>> > >>>> > >>>> On 28/10/2024 20:42, Barry Song wrote: > >>>>> On Tue, Oct 29, 2024 at 4:00=E2=80=AFAM Usama Arif wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 28/10/2024 19:54, Barry Song wrote: > >>>>>>> On Tue, Oct 29, 2024 at 1:20=E2=80=AFAM Usama Arif wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On 28/10/2024 17:08, Yosry Ahmed wrote: > >>>>>>>>> On Mon, Oct 28, 2024 at 10:00=E2=80=AFAM Usama Arif wrote: > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On 28/10/2024 16:33, Nhat Pham wrote: > >>>>>>>>>>> On Mon, Oct 28, 2024 at 5:23=E2=80=AFAM Usama Arif wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> I wonder if instead of having counters, it might be better t= o keep track > >>>>>>>>>>>> of the number of zeropages currently stored in zeromap, simi= lar to how > >>>>>>>>>>>> zswap_same_filled_pages did it. It will be more complicated = then this > >>>>>>>>>>>> patch, but would give more insight of the current state of t= he system. > >>>>>>>>>>>> > >>>>>>>>>>>> Joshua (in CC) was going to have a look at that. > >>>>>>>>>>> > >>>>>>>>>>> I don't think one can substitute for the other. > >>>>>>>>>> > >>>>>>>>>> Yes agreed, they have separate uses and provide different info= rmation, but > >>>>>>>>>> maybe wasteful to have both types of counters? They are counte= rs so maybe > >>>>>>>>>> dont consume too much resources but I think we should still th= ink about > >>>>>>>>>> it.. > >>>>>>>>> > >>>>>>>>> Not for or against here, but I would say that statement is deba= table > >>>>>>>>> at best for memcg stats :) > >>>>>>>>> > >>>>>>>>> Each new counter consumes 2 longs per-memcg per-CPU (see > >>>>>>>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but i= t can > >>>>>>>>> quickly add up with a large number of CPUs/memcgs/stats. > >>>>>>>>> > >>>>>>>>> Also, when flushing the stats we iterate all of them to propaga= te > >>>>>>>>> updates from per-CPU counters. This is already a slowpath so ad= ding > >>>>>>>>> one stat is not a big deal, but again because we iterate all st= ats on > >>>>>>>>> multiple CPUs (and sometimes on each node as well), the overall= flush > >>>>>>>>> latency becomes a concern sometimes. > >>>>>>>>> > >>>>>>>>> All of that is not to say we shouldn't add more memcg stats, bu= t we > >>>>>>>>> have to be mindful of the resources. > >>>>>>>> > >>>>>>>> Yes agreed! Plus the cost of incrementing similar counters (whic= h ofcourse is > >>>>>>>> also not much). > >>>>>>>> > >>>>>>>> Not trying to block this patch in anyway. Just think its a good = point > >>>>>>>> to discuss here if we are ok with both types of counters. If its= too wasteful > >>>>>>>> then which one we should have. > >>>>>>> > >>>>>>> Hi Usama, > >>>>>>> my point is that with all the below three counters: > >>>>>>> 1. PSWPIN/PSWPOUT > >>>>>>> 2. ZSWPIN/ZSWPOUT > >>>>>>> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever) > >>>>>>> > >>>>>>> Shouldn't we have been able to determine the portion of zeromap > >>>>>>> swap indirectly? > >>>>>>> > >>>>>> > >>>>>> Hmm, I might be wrong, but I would have thought no? > >>>>>> > >>>>>> What if you swapout a zero folio, but then discard it? > >>>>>> zeromap_swpout would be incremented, but zeromap_swapin would not. > >>>>> > >>>>> I understand. It looks like we have two issues to tackle: > >>>>> 1. We shouldn't let zeromap swap in or out anything that vanishes i= nto > >>>>> a black hole > >>>>> 2. We want to find out how much I/O/memory has been saved due to ze= romap so far > >>>>> > >>>>> From my perspective, issue 1 requires a "fix", while issue 2 is mor= e > >>>>> of an optimization. > >>>> > >>>> Hmm I dont understand why point 1 would be an issue. > >>>> > >>>> If its discarded thats fine as far as I can see. > >>> > >>> it is fine to you and probably me who knows zeromap as well :-) but > >>> any userspace code > >>> as below might be entirely confused: > >>> > >>> p =3D malloc(1G); > >>> write p to 0; or write part of p to 0 > >>> madv_pageout(p, 1g) > >>> read p to swapin. > >>> > >>> The entire procedure used to involve 1GB of swap out and 1GB of swap = in by any > >>> means. Now, it has recorded 0 swaps counted. > >>> > >>> I don't expect userspace is as smart as you :-) > >>> > >> Ah I completely agree, we need to account for it in some metric. I pro= bably > >> misunderstood when you said "We shouldn't let zeromap swap in or out a= nything that > >> vanishes into a black hole", by we should not have the zeromap optimiz= ation for those > >> cases. What I guess you meant is we need to account for it in some met= ric. > >> > >>>> > >>>> As a reference, memory.stat.zswapped !=3D memory.stat.zswapout - mem= ory.stat.zswapin. > >>>> Because zswapped would take into account swapped out anon memory fre= ed, MADV_FREE, > >>>> shmem truncate, etc as Yosry said about zeromap, But zswapout and zs= wapin dont. > >>> > >>> I understand. However, I believe what we really need to focus on is > >>> this: if we=E2=80=99ve > >>> swapped out, for instance, 100GB in the past hour, how much of that 1= 00GB is > >>> zero? This information can help us assess the proportion of zero data= in the > >>> workload, along with the potential benefits that zeromap can provide = for memory, > >>> I/O space, or read/write operations. Additionally, having the second = count > >>> can enhance accuracy when considering MADV_DONTNEED, FREE, TRUNCATE, > >>> and so on. > >>> > >> Yes completely agree! > >> > >> I think we can look into adding all three metrics, zeromap_swapped, ze= romap_swpout, > >> zeromap_swpin (or whatever name works). > > > > It's great to reach an agreement. Let me work on some patches for it. > > Thanks! > > > > > By the way, I recently had an idea: if we can conduct the zeromap check > > earlier - for example - before allocating swap slots and pageout(), cou= ld > > we completely eliminate swap slot occupation and allocation/release > > for zeromap data? For example, we could use a special swap > > entry value in the PTE to indicate zero content and directly fill it wi= th > > zeros when swapping back. We've observed that swap slot allocation and > > freeing can consume a lot of CPU and slow down functions like > > zap_pte_range and swap-in. If we can entirely skip these steps, it > > could improve performance. However, I'm uncertain about the benefits we > > would gain if we only have 1-2% zeromap data. > > If I remember correctly this was one of the ideas floated around in the > initial version of the zeromap series, but it was evaluated as a lot more > complicated to do than what the current zeromap code looks like. But I > think its definitely worth looking into! Sorry for the noise. I didn't review the initial discussion. But my feeling is that it might be valuable considering the report from Zhiguo: https://lore.kernel.org/linux-mm/20240805153639.1057-1-justinjiang@vivo.com= / In fact, our recent benchmark also indicates that swap free could account for a significant portion in do_swap_page(). > > > > > I'm just putting this idea out there to see if you're interested in mov= ing > > forward with it. :-) > > > >> > >>>> > >>>> > >>>>> > >>>>> I consider issue 1 to be more critical because, after observing a p= hone > >>>>> running for some time, I've been able to roughly estimate the porti= on > >>>>> zeromap can > >>>>> help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even witho= ut a > >>>>> SWPIN counter. However, I agree that issue 2 still holds significan= t value > >>>>> as a separate patch. > >>>>> > >>> > > Thanks Barry