From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 064A4C46CD2 for ; Tue, 2 Jan 2024 23:27:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 60B566B00F3; Tue, 2 Jan 2024 18:27:17 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 562768D002D; Tue, 2 Jan 2024 18:27:17 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3DCA18D0006; Tue, 2 Jan 2024 18:27:17 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 2544E6B00F3 for ; Tue, 2 Jan 2024 18:27:17 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id E46F9A1BE8 for ; Tue, 2 Jan 2024 23:27:16 +0000 (UTC) X-FDA: 81635959272.05.49CC4BB Received: from mail-io1-f49.google.com (mail-io1-f49.google.com [209.85.166.49]) by imf17.hostedemail.com (Postfix) with ESMTP id 3109840014 for ; Tue, 2 Jan 2024 23:27:15 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=NVsjnPna; spf=pass (imf17.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.166.49 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1704238035; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ijNyREx7RmSnDkgeKzcdo1eWDZjQ758hbgo/HA1vnUs=; b=jXmeYSy72XlreiMBgo5S4X8UNdXsZAcO4+lHroQhfYek3r/nWKefmosiHdD09547UlTLAR 3yVM0rP0ODrKd55whqFeChAi6qsZolgoDnoqsDbB2UFXiOmdnTd1r/q0geCJQLy4pjvhxN RCtmYN7RzNsDZizncsksQQS5qNz83O4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1704238035; a=rsa-sha256; cv=none; b=PSTKbXaxd9bzD30grFe/h/czqiyFHjQxnzFyuEXdr0TDdB4f6c/R7S7RrFnu031gIOl71y Gi09hq10XjTIarrp3IXem7je+gPthJxQcglMwUiKMEV/WhUHCxy7FQAyz+Qd3AYAWrDhO4 kfkI9tHx5omxX0LSboN2BD4HnC8t1sQ= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=NVsjnPna; spf=pass (imf17.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.166.49 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-io1-f49.google.com with SMTP id ca18e2360f4ac-7ba9f24acf8so456766939f.2 for ; Tue, 02 Jan 2024 15:27:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1704238034; x=1704842834; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ijNyREx7RmSnDkgeKzcdo1eWDZjQ758hbgo/HA1vnUs=; b=NVsjnPnaqy8nU+Kh5pEvR/K5rnppEd25pX7j4hnzPShL7aXxvprXhq/wHT1bMooNvU d3oOlX2Hn77GhjIwE29bYKPnAZJPxKVXrWpJmKkDSScrIMulGZvEUVT+Vl8FDDHeRmMg LyPHiBr+lfSfG5pWPcQPqQMCcA4kBFPIxN5aEH2YyjhoFJbQ/QLPmHVl21+IFFXFgjc6 7Xvx3NpnL1x9lxZRlHlFoV1wAJB44eeG3P/gAKt5RX9SoaOHuXs7TmJcJhhFQBvyT9fk B4fgpRfBRuuYHP0UEERDJelsEeoZECKwGGKf7KGySBWMDjvk4DTk2KifsSovAqicPKS7 5qPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704238034; x=1704842834; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ijNyREx7RmSnDkgeKzcdo1eWDZjQ758hbgo/HA1vnUs=; b=E/pt1k9x7shqC2QxHe3Ur7C7bbkq7v/GmtEmfEUWsNdF8O/qTGxXrYhnVF7RZrmvvt mfnlnqSDZa2L5SMDH29ZpYXiKqaEhW69N1/1H+P/G7CwQK592IZprqV+o3P6gC3FnRRU pdccAva7DhbxnrTd7WdpQhMUCdAUZ0qx5itnBGpPnivoLzO6bfn2y14aJNvjCGIXJRRI pIOWP+PBNAgOerdcu+Eslb4vD1SX6jaUWCWpspqNpZt1u2+Xu7RFTR+VAvxxPHXrkaVP H6jO/n+YrQ8G9Lqm2hyt9bTr56RDmWFPePAxAkt07415k2wF228dLjVynX4uIxuc7T8S +SiQ== X-Gm-Message-State: AOJu0YxLn9mpRU4S+ejKIA79kvqM5VGK+RhyZz9tYAdOCu8ID/Yg7Qs2 9KF8Onc/TF0rWjg35qFrshH6BEqMORxetlTmwJg= X-Google-Smtp-Source: AGHT+IEwNx7QIYUAUpl8QWTcHIgZ7RQODDJ6nQ5b93k+gINCpn68yEpoLKtXdt9/hMw/eiAtROF65ps/nbJPWMmuySM= X-Received: by 2002:a5d:8510:0:b0:7bb:9f7f:b790 with SMTP id q16-20020a5d8510000000b007bb9f7fb790mr5021266ion.28.1704238034239; Tue, 02 Jan 2024 15:27:14 -0800 (PST) MIME-Version: 1.0 References: <20231024142706.195517-1-hezhongkun.hzk@bytedance.com> In-Reply-To: From: Nhat Pham Date: Tue, 2 Jan 2024 15:27:03 -0800 Message-ID: Subject: Re: [External] Re: [PATCH] mm: zswap: fix the lack of page lru flag in zswap_writeback_entry To: Zhongkun He Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, yosryahmed@google.com, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Chris Li Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: frejbec8xs8re5bjtdms7impfk5gxsuj X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 3109840014 X-Rspam-User: X-HE-Tag: 1704238035-219425 X-HE-Meta: U2FsdGVkX1/uB3xTdeSi1vLxg2dN2fczDe5v/fyPB0cIiXVc1FiHFYmlhNKL+dxQj5OWvn1/+H0EeKZFVg8jFNKyoF5aSK89REN7+zfc10OrsoKI/IhVSRy+z41QeIhbacV8c671aEISEirpmzMtz7zRtpFsF4WaPlC/ZkQIXoyPdsxAbsWI1loeHeKUmiKZ5qDBnfuok3tFMzQBqIw6a1mSmkGt66Z6i42SdxcbrGwW5Gn17raNtzA7MPYUG+g3HQJllLIBZ4fVwKhcGmdtynxng6sVU3CDWSKtYbldEsaouzILt1UFaEnTPMT1ndwH3mg0iktgXGmTFoO/XRG/MQtzotOh1NH1ssI/oHMs2PKi7to7QC+c1/DudwHbapzTc+wVREPWZE+V1xBRxhH/KtyQ8Nla6gQwfAOjlEvSFk76WwrPadNznj6qsKU+bD8N30b1emsMP2GI4WZmBQKeKlFnf1EPgTR7IKaI72ibaPFAWFFeyhlkCvZrY+HCWEckXnsEMUr/oWepyf8MmbPpgxxkjj4AE8NZnd4NopHcOyGhUCiMOi+CDHoGC6FTXGxhpP8TElsxpCBs+iEtBfTWsxMn6TVO5TDfImUbAskh1J5iBc9J0duy+qXcwnhzMVwzMNgqOPmyXTVWJjyXtNS/RxYl57HgFg4HyZuAI8YLErtQHqdMh3zVSe9z8QJ2tZ8MZq2sDSd1Z2mH6ayIm0ErQEZycB1dfb8KDjS8nuvkiYJ7fIBHLF3A01865XmrdAhWfJNr1azHPQgKASDQLjmZ3ZZkMuZxOzHa0Tf73uVRmGo4X/tjs5ZY125RpidAKgbcDT6/kuWYKmGUk2BLJfiH2J+1kVW70z4CWlMD3nv80AQJaVQuKTVtFSvuB5pAPPMVEHTCcZ3/qG5VCxSP1iREvFr2HY/+7xB9lTzVhxw5KIMIkx25U4IlxA8qw5ObDKvsuZCvFA0r0rs5iXHtv96 kCnmkzSf H9zt5iLN4AwomkpYDP5U9RE5364FyzWaExxAW7gqIINCNx9dwr1F2Vmpn2TpniGfnb1a8UAlpIKT/3llsPa/uxBiyfNmT1+HiTPa0W6U4NgwgMmBkVbW1FCLyi/jV5wDWsrwlMCYvuMhL1biyaRBH8vzMQuUqPmzh9m2JMYPrHolnPEOU4SKXcF07xU1wVhgkvsN0pfNwXcX8FS7jgQu93anCqeXUbUZK+bdCkrb4S2+TJWXO3GYj3KhnNtQJNNRL3k/DEMRbryQtjF3wge+cLEZeWaYxdV0xHoRw3oqXAN0Efg1f7I4M3LAvJotDhfbkdxSNb77Vx5wgth9xVzYt9ShLyQWISeAYhM3P3sOvmKjZyhjjT4GnZtXANlkOKxrqvabpSlMXtajpHkn/Xm9JTJfEr1xNYCio+ltNE0Hx9TFJiYukV89+e7lYaj/cVt2DdyoYX8kNTBmyPMue50zIhPSjoddjo+G41gWfrO3QmnDy3LxjFh079XIef/5mgiRsAbu7CkXIioyhuCMN+ro81jtJ7DL5OPXMLhSorfu8IdSJY9O4rbJLl1751sr8e6tBtS13 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jan 2, 2024 at 3:39=E2=80=AFAM Zhongkun He wrote: > > On Sat, Dec 30, 2023 at 10:09=E2=80=AFAM Nhat Pham wr= ote: > > > > On Tue, Oct 24, 2023 at 7:27=E2=80=AFAM Zhongkun He > > wrote: > > > > > > > My apologies for the delayed response. I have a couple of questions. > > > > > The zswap_writeback_entry() will add a page to the swap cache, decomp= ress > > > the entry data into the page, and issue a bio write to write the page= back > > > to the swap device. Move the page to the tail of lru list through > > > SetPageReclaim(page) and folio_rotate_reclaimable(). > > > > > > Currently, about half of the pages will fail to move to the tail of l= ru > > > > May I ask what's the downstream effect of this? i.e so what if it > > fails? And yes, as Andrew pointed out, it'd be nice if the patch > > changelog spells out any observable or measurable change from the > > user's POV. > > > > The swap cache page used to decompress zswap_entry should be > moved to the tail of the inactive list after end_writeback, We can relea= se > them in time.Just like the following code in zswap_writeback_entry(). > > /* move it to the tail of the inactive list after end_writeback */ > SetPageReclaim(page); > > After the writeback is over, the function of > folio_rotate_reclaimable() will fail > because the page is not in the LRU list but in some of the cpu folio batc= hes. > Therefore we did not achieve the goal of setting SetPageReclaim(page), an= d > the pages could not be free in time. > > > > list because there is no LRU flag in page which is not in the LRU lis= t but > > > the cpu_fbatches. So fix it. > > > > This sentence is a bit confusing to me. Does this mean the page > > currently being processed for writeback is not in the LRU list > > (!PageLRU(page)), but IN one of the cpu folio batches? Which makes > > folio_rotate_reclaimable() fails on this page later on in the > > _swap_writepage() path? (hence the necessity of lru_add_drain()?) > > > > Yes, exactly. > > > Let me know if I'm misunderstanding the intention of this patch. I > > know it's a bit pedantic, but spelling things out (ideally in the > > changelog itself) will help the reviewers, as well as future > > contributors who want to study the codebase and make changes to it. > > > > Sorry,my bad. > > > > > > > Signed-off-by: Zhongkun He > > > > Thanks and look forward to your response, > > Nhat > > > > P/S: Have a nice holiday season and happy new year! > > Here are the steps and results of the performance test=EF=BC=9A > 1:zswap+ zram (simplified model with on IO) > 2:disabel zswap/parameters/same_filled_pages_enabled (stress have same pa= ges) > 3:stress --vm 1 --vm-bytes 2g --vm-hang 0 (2Gi anon pages) > 4: In order to quickly release zswap_entry, I used the previous > patch (I will send it again later). > https://lore.kernel.org/all/20231025095248.458789-1-hezhongkun.hzk@byteda= nce.com/ > > Performance result=EF=BC=9A > reclaim 1Gi zswap_entry > > time echo 1 > writeback_time_threshold > (will release the zswap_entry, not been accessed for more than 1 seconds= ) > > Base With this patch > real 0m1.015s real 0m1.043s > user 0m0.000s user 0m0.001s > sys 0m1.013s sys 0m1.040s > So no obvious performance regression was found. That's around 2.7% increase in real time, no? Admittedly, this micro-benchmark is too small to conclude either way, but the data doesn't seem to be in your favor. I'm a bit concerned about the overhead here, given that with this patch we will drain the per-cpu batch on every written-back entry. That's quite a high frequency, especially since we're moving towards more writeback (either with the new zswap shrinker, or your time threshold-based writeback mechanism). For instance, there seems to be some (local/per-cpu) locking involved, no? Could there be some form of lock contentions there (especially since with the new shrinker, you can have a lot of concurrent writeback contexts?) Furthermore, note that a writeback from zswap to swap saves less memory than a writeback from memory to swap, so the effect of the extra overhead will be even more pronounced. That is, the amount of extra work done (from this change) to save one unit of memory would be even larger than if we call lru_add_drain() every time we swap out a page (from memory -> swap). And I'm pretty sure we don't call lru_add_drain() every time we swap out a page - I believe we call lru_add_drain() every time we perform a shrink action. For e.g, in shrink_inactive_list(). That's much coarser in granularity than here. Also, IIUC, the more often we perform lru_add_drain(), the less batching effect we will obtain. IOW, the overhead of maintaining the batch will become higher relative to the performance gains from batching. Maybe I'm missing something - could you walk me through how lru_add_drain() is fine here, from this POV? Thanks! > > After writeback, we perform the following steps to release the memory aga= in > echo 1g > memory.reclaim > > Base: > total used recalim total = used > Mem: 38Gi 2.5Gi ----> 38Gi 1.5Gi > Swap: 5.0Gi 1.0Gi ----> 5Gi 1.5Gi > used memory -1G swap +0.5g > It means that half of the pages are failed to move to the tail of lru li= st, > So we need to release an additional 0.5Gi anon pages to swap space. > > With this patch: > total used recalim total = used > Mem: 38Gi 2.6Gi ----> 38Gi 1.6Gi > Swap: 5.0Gi 1.0Gi ----> 5Gi 1Gi > > used memory -1Gi, swap +0Gi > It means that we release all the pages which have been add to the tail of > lru list in zswap_writeback_entry() and folio_rotate_reclaimable(). > OTOH, this suggests that we're onto something. Swap usage seems to decrease quite a bit. Sounds like a real problem that this patch is tackling. (Please add this benchmark result to future changelog. It'll help demonstrate the problem). I'm inclined to ack this patch, but it'd be nice if you can assuage my concerns above (with some justification and/or larger benchmark). (Or perhaps, we have to drain, but less frequently/higher up the stack?) Thanks, Nhat > > Thanks for your time Nhat and Andrew. Happy New Year!