From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AA090D5B158 for ; Mon, 28 Oct 2024 21:49:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 222976B0092; Mon, 28 Oct 2024 17:49:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1C6456B0096; Mon, 28 Oct 2024 17:49:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0191F6B0099; Mon, 28 Oct 2024 17:49:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id D0CB26B0092 for ; Mon, 28 Oct 2024 17:49:36 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 85B961C5224 for ; Mon, 28 Oct 2024 21:49:36 +0000 (UTC) X-FDA: 82724351850.09.F3CCA5B Received: from mail-wr1-f52.google.com (mail-wr1-f52.google.com [209.85.221.52]) by imf02.hostedemail.com (Postfix) with ESMTP id 513808000A for ; Mon, 28 Oct 2024 21:48:49 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=f3Os7EMY; spf=pass (imf02.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.221.52 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730152015; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XO3H5j6J/WmlRFSlHkbEU0ETCY9Im5nCuCwAUQ3PVPs=; b=EejZ9+1hrys9NmH6xi1pu/TxqHQ7ZMkd0/lhQ2YQiHF5tSp69eeJ+mXDff/ARE5R89cmwg UsqJekKryLA/Y8L8cZpXVUXeBctNXkyhT6mnEncKvZi/n882UIPCzGHUm6Ar04FWds1oqH mzfyAInwFTbZEO7ZftrMHu7YwH6K0zg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730152015; a=rsa-sha256; cv=none; b=IXa5Emc30EWRaLLbnaEM5gxhWXTPlzg7N88byefnso4ivsERG7PHWoZr5YcV3tZ/7TSiwr morgzq/y5rOBj9czaWp7WOQt9T2anq8xSWnxr1D4OqBmDwYPM49sf5WrOa3NwIb/EDoo+9 9mfymVoTQfoxnLPxCKNzKp0U/236ETY= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=f3Os7EMY; spf=pass (imf02.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.221.52 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-wr1-f52.google.com with SMTP id ffacd0b85a97d-37d63a79bb6so3381736f8f.0 for ; Mon, 28 Oct 2024 14:49:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1730152173; x=1730756973; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=XO3H5j6J/WmlRFSlHkbEU0ETCY9Im5nCuCwAUQ3PVPs=; b=f3Os7EMYj18LV7M5NhkRJlT2+Qtf0blcCpdjEdzgUZK8Oj03s0GalxiPO8zkDZvDQ4 hbFJxYdSUfr9xnngCPBIjvPj4Fc9v4wnGJ12fJXYAkGdCNDu8rJnOf4yS60UPrOh0dZq tSeEB1VMKPR0oSlPFTz3OY3DsS5S3ZOPthX04uG8KEHvQZgFkaloh8OdI9Xdy9xy8jdO yz6EwKfysgiwB0bmXYZ7kxBiQmp5UHYqvHahY3rt7YEF2SWRw+RODMs8GSOX7cTKi2kO ARJHRPNmqPPTK1uyYDnwzX7pDTubwqa/BL3KJaIgMK94WeMPZbXsncR3vwE/0JliT4Iw Jq/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730152173; x=1730756973; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=XO3H5j6J/WmlRFSlHkbEU0ETCY9Im5nCuCwAUQ3PVPs=; b=XYbQvegXDhM9UkQ2wLB9CAembiY7e45LqWexxHn+rql8nCHPsesCQ8hAqy/UTS7vc1 hMzJ8+gwr9u5uu3DeKz29RrnAgUBd3n1DHFOgNh81LMVkYpMXQA+228FJaHXwwSdLHUT 6EWNxvefzinZxaU90Ww+r0EH4W6uX6VSYVl4RTpmnqVpUxv6LrU7LzQdUmJlNgc0FOb8 7l2vDpMTSbHj/WF+0rJzqp69Cb+mTOGNb/kDcFr2H482+zZdNBJBHCGrSFnglo1rxuIh LoqB3obOnGrIv9lR3JP5t48SvLiR6i7D0V6x+kl4uWf963sleJQ1UiE8YAkZ+edRw4if XtfA== X-Forwarded-Encrypted: i=1; AJvYcCX8Xfl+6ZQtKshjEWVUhb+g1Uuso2n1iQrpapXDxTspUnBjTQlo9ACUGTD0FnuinlEZG8yjvfYlAg==@kvack.org X-Gm-Message-State: AOJu0YwlfGOsulRE0zO4WUAjcfhzXYrb6i3ULGqEMlrMerzjJ3Ng6l6w RPsMGtdS3SDzOpIucpI304G7PhgXywl3qPlEzvTP+XYUxFJoapWH X-Google-Smtp-Source: AGHT+IG1MD2qgJ9KIYNW+Q2mPpGwDncXo2JyviNYVmjTUIulUEOs0xx7Bq0d6IRdoe49a1ypOQuNYw== X-Received: by 2002:adf:ea91:0:b0:37c:d4f8:3f2e with SMTP id ffacd0b85a97d-3806120d60amr7296672f8f.55.1730152172604; Mon, 28 Oct 2024 14:49:32 -0700 (PDT) Received: from ?IPV6:2a02:6b67:d751:7400:c2b:f323:d172:e42a? ([2a02:6b67:d751:7400:c2b:f323:d172:e42a]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-431935a3edasm124938175e9.22.2024.10.28.14.49.31 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 28 Oct 2024 14:49:32 -0700 (PDT) Message-ID: <03d4c776-4b2e-4f3d-94f0-9b716bfd74d2@gmail.com> Date: Mon, 28 Oct 2024 21:49:31 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH RFC] mm: count zeromap read and set for swapout and swapin To: Barry Song <21cnbao@gmail.com> Cc: Yosry Ahmed , Nhat Pham , akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , Chengming Zhou , Johannes Weiner , David Hildenbrand , Hugh Dickins , Matthew Wilcox , Shakeel Butt , Andi Kleen , Baolin Wang , Chris Li , "Huang, Ying" , Kairui Song , Ryan Roberts , joshua.hahnjy@gmail.com References: <20241027011959.9226-1-21cnbao@gmail.com> <678a1e30-4962-48de-b5cb-03a1b4b9db1b@gmail.com> <6303e3c9-85d5-40f5-b265-70ecdb02d5ba@gmail.com> <64f12abd-dde3-41a4-b694-cc42784217fb@gmail.com> <882008b6-13e0-41d8-91fa-f26c585120d8@gmail.com> <228c428d-d116-4be1-9d0d-0591667b7ccb@gmail.com> Content-Language: en-US From: Usama Arif In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Stat-Signature: 3gojko7n9wtoo145idcb3j41et8ia1xq X-Rspamd-Queue-Id: 513808000A X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1730152129-900979 X-HE-Meta: U2FsdGVkX18Y/5JkhBRIcKrOnchc/Gu6ecUtUCLUem+3u47aj1IpyZc9jcPgZYWxaczjZkaxmWFnXi1dnuquD2ArYyBhlAHPVWlkeTaKTlBsdOHJNOTz5wWKZ7wUlq7YSNPqKIaPOf9D12hFQLjdnEK9TFStw3s9NUqLljClyoomM7LwWdeTkiV9SOZHKY92blNp/nD5m7WLWIXmmRt/PVsRctIDzXWTeuUpzEz7d9TDqilTUCq9GuUnTjX9kXAJuUHexcaa6trF9l5pyXF3arg+IO4Fp2viD5iLoK+Tli9H9zSvzxkSdrqROAvsOAtx+hcDL8sKImn/CRzjGJzOwbTFQMG0bhTda42qmbJyYFazUxpynbqTntGzyQiNcKN0XxiBBeOHVM/G9hExXyEjUinobGuxOcrG0KnerN8PxjUfH/Cy/39UDE4aQREGXpGWJvxqeyDc4YKWr9QovDmh1XklcBJkxBJ8oOxTZULFF/I1H+XTpoDKbpiw7gEjvs+0sLJXSWxIbL8URrqxziqSpjw+A1EsOX8r394AKRV35CsH1xXFh+a9YMVnOV8Cdj84BlKa1c9md24Sy+oApzCtVRtFweVro1H4VNCJknBz2oPyoGtTetG9AWwm00mBwZpvTLS0/HCoggPn5+ffYJ1PF1yC3P0iGpV0o2AxulDoBOcNT24puvmJSRhA+s4lJajyNoWCcoS5JzxKiVLmVeyERFj+Kxxp5pdtwSS3LagM9RNOGMbqvLZuyaWSVTEE9j2MvwBOMI0si7wx7+wxsWCwILfG7Fk8KNdFY68xV4KQVigqMjKM+lFP22utWinTWbal1Kh697EFOFpN2p43q4Cw/JgXYD8h/Syjlorq0hYwA0V/TFD/01s7epl5TI4y8XW4/xBCvhfFBkX3iy6ecj7tXN+l6T5sJGgVlTt7E3yL0ggYcIwS56LFhNP33qRDLUggHeJggawbm9CDqxZH0VB jxqVoe8o /pbvGX0Buqe323Lk/gsb3adosyU8+YTc6hh1hQq6hgxeZww9mPbiW684/QIJW3HBNq6zXpOX2Z678jo0ceJxJQkIGt+SNUXYI+GxC3wQ4Amy8WNEjzA6HmprOwAccQdVPSiHu1Eel/ACEdWM2NF14GAPDQ0sBJ5wTlwT4C7XG/wiYqkITeCMIk4JwyftyDZ3qrQJCztGul5jEh4kOkCQCSbej1HMEzdVvRMVpjh5VCRlYh2ul6kEwJqVTIWFCYk1GZHFVB2rIg1bovFK5o4c1jGmpnoy2Bymk3dUYh365UsEkRi2iMR5D9s5F5glNWvfXoAaOX0aiP1/bbVbfcjtpJKZFJ2tNFx10LOZ0paksZWqDN5fqj5pMO+mutWYIO/nT/IdfEet9l9u8iRw6pXKIbT6f8o/KO0BLmVKbMsgcqg7VY5ZAU11qNnblmGbz8PpkKRiWKKPYLoP+o5/B4w8PvvVCFLtbPMP1I05OuCpNmPh0fVQ6FTNHCcHqvQwpprVgHnqe07y8SnUUN+jXbBr/t1rraIwflSvba60MIU6Tke6a6p5q+08+WxsQpgmPEr7odGOUklg+jbmq5F8u4Yqn+wO6SA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 28/10/2024 21:40, Barry Song wrote: > On Tue, Oct 29, 2024 at 5:24 AM Usama Arif wrote: >> >> >> >> On 28/10/2024 21:15, Barry Song wrote: >>> On Tue, Oct 29, 2024 at 4:51 AM Usama Arif wrote: >>>> >>>> >>>> >>>> On 28/10/2024 20:42, Barry Song wrote: >>>>> On Tue, Oct 29, 2024 at 4:00 AM Usama Arif wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 28/10/2024 19:54, Barry Song wrote: >>>>>>> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 28/10/2024 17:08, Yosry Ahmed wrote: >>>>>>>>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 28/10/2024 16:33, Nhat Pham wrote: >>>>>>>>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif wrote: >>>>>>>>>>>> >>>>>>>>>>>> I wonder if instead of having counters, it might be better to keep track >>>>>>>>>>>> of the number of zeropages currently stored in zeromap, similar to how >>>>>>>>>>>> zswap_same_filled_pages did it. It will be more complicated then this >>>>>>>>>>>> patch, but would give more insight of the current state of the system. >>>>>>>>>>>> >>>>>>>>>>>> Joshua (in CC) was going to have a look at that. >>>>>>>>>>> >>>>>>>>>>> I don't think one can substitute for the other. >>>>>>>>>> >>>>>>>>>> Yes agreed, they have separate uses and provide different information, but >>>>>>>>>> maybe wasteful to have both types of counters? They are counters so maybe >>>>>>>>>> dont consume too much resources but I think we should still think about >>>>>>>>>> it.. >>>>>>>>> >>>>>>>>> Not for or against here, but I would say that statement is debatable >>>>>>>>> at best for memcg stats :) >>>>>>>>> >>>>>>>>> Each new counter consumes 2 longs per-memcg per-CPU (see >>>>>>>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can >>>>>>>>> quickly add up with a large number of CPUs/memcgs/stats. >>>>>>>>> >>>>>>>>> Also, when flushing the stats we iterate all of them to propagate >>>>>>>>> updates from per-CPU counters. This is already a slowpath so adding >>>>>>>>> one stat is not a big deal, but again because we iterate all stats on >>>>>>>>> multiple CPUs (and sometimes on each node as well), the overall flush >>>>>>>>> latency becomes a concern sometimes. >>>>>>>>> >>>>>>>>> All of that is not to say we shouldn't add more memcg stats, but we >>>>>>>>> have to be mindful of the resources. >>>>>>>> >>>>>>>> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is >>>>>>>> also not much). >>>>>>>> >>>>>>>> Not trying to block this patch in anyway. Just think its a good point >>>>>>>> to discuss here if we are ok with both types of counters. If its too wasteful >>>>>>>> then which one we should have. >>>>>>> >>>>>>> Hi Usama, >>>>>>> my point is that with all the below three counters: >>>>>>> 1. PSWPIN/PSWPOUT >>>>>>> 2. ZSWPIN/ZSWPOUT >>>>>>> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever) >>>>>>> >>>>>>> Shouldn't we have been able to determine the portion of zeromap >>>>>>> swap indirectly? >>>>>>> >>>>>> >>>>>> Hmm, I might be wrong, but I would have thought no? >>>>>> >>>>>> What if you swapout a zero folio, but then discard it? >>>>>> zeromap_swpout would be incremented, but zeromap_swapin would not. >>>>> >>>>> I understand. It looks like we have two issues to tackle: >>>>> 1. We shouldn't let zeromap swap in or out anything that vanishes into >>>>> a black hole >>>>> 2. We want to find out how much I/O/memory has been saved due to zeromap so far >>>>> >>>>> From my perspective, issue 1 requires a "fix", while issue 2 is more >>>>> of an optimization. >>>> >>>> Hmm I dont understand why point 1 would be an issue. >>>> >>>> If its discarded thats fine as far as I can see. >>> >>> it is fine to you and probably me who knows zeromap as well :-) but >>> any userspace code >>> as below might be entirely confused: >>> >>> p = malloc(1G); >>> write p to 0; or write part of p to 0 >>> madv_pageout(p, 1g) >>> read p to swapin. >>> >>> The entire procedure used to involve 1GB of swap out and 1GB of swap in by any >>> means. Now, it has recorded 0 swaps counted. >>> >>> I don't expect userspace is as smart as you :-) >>> >> Ah I completely agree, we need to account for it in some metric. I probably >> misunderstood when you said "We shouldn't let zeromap swap in or out anything that >> vanishes into a black hole", by we should not have the zeromap optimization for those >> cases. What I guess you meant is we need to account for it in some metric. >> >>>> >>>> As a reference, memory.stat.zswapped != memory.stat.zswapout - memory.stat.zswapin. >>>> Because zswapped would take into account swapped out anon memory freed, MADV_FREE, >>>> shmem truncate, etc as Yosry said about zeromap, But zswapout and zswapin dont. >>> >>> I understand. However, I believe what we really need to focus on is >>> this: if we’ve >>> swapped out, for instance, 100GB in the past hour, how much of that 100GB is >>> zero? This information can help us assess the proportion of zero data in the >>> workload, along with the potential benefits that zeromap can provide for memory, >>> I/O space, or read/write operations. Additionally, having the second count >>> can enhance accuracy when considering MADV_DONTNEED, FREE, TRUNCATE, >>> and so on. >>> >> Yes completely agree! >> >> I think we can look into adding all three metrics, zeromap_swapped, zeromap_swpout, >> zeromap_swpin (or whatever name works). > > It's great to reach an agreement. Let me work on some patches for it. Thanks! > > By the way, I recently had an idea: if we can conduct the zeromap check > earlier - for example - before allocating swap slots and pageout(), could > we completely eliminate swap slot occupation and allocation/release > for zeromap data? For example, we could use a special swap > entry value in the PTE to indicate zero content and directly fill it with > zeros when swapping back. We've observed that swap slot allocation and > freeing can consume a lot of CPU and slow down functions like > zap_pte_range and swap-in. If we can entirely skip these steps, it > could improve performance. However, I'm uncertain about the benefits we > would gain if we only have 1-2% zeromap data. If I remember correctly this was one of the ideas floated around in the initial version of the zeromap series, but it was evaluated as a lot more complicated to do than what the current zeromap code looks like. But I think its definitely worth looking into! > > I'm just putting this idea out there to see if you're interested in moving > forward with it. :-) > >> >>>> >>>> >>>>> >>>>> I consider issue 1 to be more critical because, after observing a phone >>>>> running for some time, I've been able to roughly estimate the portion >>>>> zeromap can >>>>> help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a >>>>> SWPIN counter. However, I agree that issue 2 still holds significant value >>>>> as a separate patch. >>>>> >>> > > Thanks > Barry