From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 27239CD128A for ; Thu, 11 Apr 2024 07:49:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7CE146B00A0; Thu, 11 Apr 2024 03:49:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 755FA6B00A1; Thu, 11 Apr 2024 03:49:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5CFC56B00A2; Thu, 11 Apr 2024 03:49:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 3B30A6B00A0 for ; Thu, 11 Apr 2024 03:49:34 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id AD69CA0A96 for ; Thu, 11 Apr 2024 07:49:33 +0000 (UTC) X-FDA: 81996476226.23.FCE4349 Received: from mail-vk1-f173.google.com (mail-vk1-f173.google.com [209.85.221.173]) by imf05.hostedemail.com (Postfix) with ESMTP id EF668100017 for ; Thu, 11 Apr 2024 07:49:31 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="Xp/F3V07"; spf=pass (imf05.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.173 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1712821772; a=rsa-sha256; cv=none; b=3VPDu3o70VPFXN/MTSDel/U3u5/BZAjOAACKbRNbAOcu0ygT9WC1DGb39HJ+WI3zfMCH7R 93mRL3DNBskdCCgvtRcQDHpw+G1YLNHZaJRQ2vdNvP2Zxt7iFPU4JiG5kJ5X3uDnsGuDuR P1X49ScZeAeguqPgc2hWuYbIrD0LvwE= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="Xp/F3V07"; spf=pass (imf05.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.173 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1712821772; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=NgOGhyO6cgIV85Qu6uWJE0ADxev/ScH0tSyhkHUdhk0=; b=LppXhjlW3EplVLZ5b5QJPy1/B/iU0hvJV6bcEfDOlGj/83G7AsHTG3cw+EfnZHXMMV1hCo 04matdfJ0wT+QhbsPYnml4ABjPRmeUUgt1MtgFhIubZESVvBSAq353/UqOFrucski3L37v J76Of9PkuxIddMmIqCWY+FFeOywLzWc= Received: by mail-vk1-f173.google.com with SMTP id 71dfb90a1353d-4dac19aa9b5so1708363e0c.2 for ; Thu, 11 Apr 2024 00:49:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1712821771; x=1713426571; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=NgOGhyO6cgIV85Qu6uWJE0ADxev/ScH0tSyhkHUdhk0=; b=Xp/F3V07gOyLiaoAP2oJDAHGOjmnBrU2Ny9YmQ9IXVy4P0CrxlaawyxvQD7Q0hxnQz H2i7chKq75QRKxpG2DJo0FuL2hlWMQD7KXO+bHejWLr0CdkIgvdj/9wAquUMQ7ynTeim FzZ60P8fZpXipBYupalVBNpPf5vvlDnWaElOg84h9eXOJu5DcjbdaWn6KmEo9P42Tbbd 6i0i+cTPpolKqsMaejXz+cr+7dikpHE9O4QGTmxDX40spJNrdSU55Zi8CI/sGR2ty6Ra e8GRYf/kOEFE4V0UJFB/0mJ3Alce7m4nEYX4euh+bnzjv3ivu/B5Cn1XKMwy62nw8X87 i7OA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712821771; x=1713426571; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NgOGhyO6cgIV85Qu6uWJE0ADxev/ScH0tSyhkHUdhk0=; b=oP8PlSEjbHYGctbc8bwAzcB3xX5ReW6HKHDNy7xOEBprAQNAXrLkVoNTOXa1coTJHq NLZ9lPxSEfdNEFB69bpQEeojRbmONFZXXjtObzzv8/r6R6sVYvtkvMTKp89u14pSILOM 92I83T8O4ISEY45qFeg7YuMESuHNk1cqBwB0qFtYbB2e75W4ZNSZ234T4oaB19KKRFsF 2+UynF3fp1S+zPPkIuZqVcBE6YvSs8ugEIAMyYBN0WiPY7FrTkf0tlRtpJnhkEfLdrM7 vEDWw5RmJOJzzC0LKomwYPMMalp0svDRI5UdhwipY05JKgsY4culERny7Gr/9gFFtxZV fepQ== X-Forwarded-Encrypted: i=1; AJvYcCUKH+//g5ujiFGTZeIFXjv/DGGMOI9Xbc7/TYbFqfRyLf1HSjq/DJa9COxzJVSmH0txI8CxxS8TMbJBwCa2aTWZ3CQ= X-Gm-Message-State: AOJu0YzxSkNJjTMtHuxxVJq6C1m2jsrhc9xri6TJ48GGu4DxD5L1yh32 SSTT7hzmjf/kK/SzGeBEgVlcatxzVnvci9+0l1qr/31amd4C/XJqmmwUKl71KHBLo05OVZHGhHC Tgv0yFmtKIY441XbTigFP5pVzBpE= X-Google-Smtp-Source: AGHT+IEKHwMSh8DfsEiYJBygkbD+8y9BaD1im3rQpj/+3WpWg7RIrW2kftpH+XVf7bzByVZ/h087DAyPy7j43kwVKlM= X-Received: by 2002:ac5:c216:0:b0:4d4:1fe2:c398 with SMTP id m22-20020ac5c216000000b004d41fe2c398mr4493920vkk.2.1712821770948; Thu, 11 Apr 2024 00:49:30 -0700 (PDT) MIME-Version: 1.0 References: <20240327214816.31191-1-21cnbao@gmail.com> <20240327214816.31191-3-21cnbao@gmail.com> <20240411014237.GB8743@google.com> <20240411041429.GC8743@google.com> In-Reply-To: <20240411041429.GC8743@google.com> From: Barry Song <21cnbao@gmail.com> Date: Thu, 11 Apr 2024 19:49:19 +1200 Message-ID: Subject: Re: [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages To: Sergey Senozhatsky Cc: akpm@linux-foundation.org, minchan@kernel.org, linux-block@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org, terrelln@fb.com, chrisl@kernel.org, david@redhat.com, kasong@tencent.com, yuzhao@google.com, yosryahmed@google.com, nphamcs@gmail.com, willy@infradead.org, hannes@cmpxchg.org, ying.huang@intel.com, surenb@google.com, wajdi.k.feghali@intel.com, kanchana.p.sridhar@intel.com, corbet@lwn.net, zhouchengming@bytedance.com, Tangquan Zheng , Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: EF668100017 X-Stat-Signature: tj4snrde1i6a6puhne6ob8y5mkht85u1 X-Rspam-User: X-HE-Tag: 1712821771-753566 X-HE-Meta: U2FsdGVkX19x4iCPHC/IDX+sUCmIxUK1LmONnd0s91UbuJAFf4yw0Fd5/f6ByyRH5o+5xrybvdrOc3HWDGHQo9hufoIOGAwiJDuq+ubs9pLsi1rpF0vw+D+1uLugdJlLQ9RfFNc2msALWiyk2SVWJFLFxVPafnM8BhbHl36G3XMrhlhnVOL7q6WeshKOElupOYkTnTVtiYHptQG3nHPfVN5BS0hcvk0VTVfHbCHNtyH64NaZhClRFBy4ApX8ihTuEK3K+m/yBHAm8pUQJQshUWXfCkgBgbTDtIWt/QSAkVhM4v1Gp7/zb5e26Kf90cGthclAQP6qVWuvXKvv3fP7Ah+dJKroYCm1n0GVZt+r27MDZn8pUDDz59Ohd9yBdurbRgmIEGClf+enMoEOilZywGbXsQ5Z0x44d07goThuwNotI8zTW0RJk+aE6A2rpURzqrviAHl2H6I9I9Uw+lA1jwcCOpu7ozohGZNq9VCJvS4uwQLb1tKCNsY9DnbMWzme9N9XlmHuYVrBQFsY2v/OfyRm+wfQcSJetjivAieeV0SMl7c+kt9/O2iRDtJr6yADmxxIlj2Dqq9gaMZBO5E9/EwopkGJR9ve1T6pxLum7BwdzSRq0mvLhSWrXnuZwFuMoH5oL4tr9pFbPWt7BDzXvppWCUa9PspymSwSjtvcZODRlmHKSFUtARJKpOC4OPPcUmE2HPl6CQnReHiWpTrNflOBvkp4d9tfH0YeaUH9H+yZo2BLrOe9PZ3FLxsRZU1sJ+tpjRpzKShYkggJXzh26TH91LcGbMur5uuUZFTAfR4VAygR5cb1z0E8NDJahzB46qOuYlaJEyN+oqqjH1PBb1ZCO9FcYw5pZGeYOcW4oGIVyGWUDJVZeQzP5Fz5NmlPk4bONndhEix76D7dShlLHAa2kzvLPvC9Sc/w1/oRqp77kswZsSB31WCDjriqYqkJRPwApQzQjvlGETIz9fW EyhNmTLP Rjp1iCozvrzy5YY/8Ur/oYRNnXcQ1XLtRNQuO++9UTttffPMRIlfvxzranR3IgudwV52S1QgIFUiipcvegsJ0/BMP/1lQ3rtXK60f88Cge8qP8ypss9bMtHcjYiVJnbiX5yckoKSlNtj4Ueb2Ua7Yl/we7d5B2tFZKI4RFUXSaF7wVcbDKZ3S1SZvURM2XZehC9o04vEv21t3OAMK9lybLebWZw1Mk4p5qsO4buruaKIhwJjKHjW90lKUq3G7/wDpf7zDAEpRWLFghYFoiRsL2ahvzHzYqr1+wnRHgNidKdLcr60U67+4wGdPJGcFRK3TwHnB/Pm6HMxijvq5JXS+8S8u2QTvVrHNlfnKehVIcxBXjLmatb+jYU+eaYOzFovcvrBkO4k4BJNHBri/pc7VYhhV+G7uyprEq14irWAJjiRWHL3DFc/oTIMwfzf70zwpXtZTT+Q6l/Qy5futKDXnZ9fIbUYZfM+k4oAblNidtFEz3emp3/ja1Yj+zDlK7db4V6DAx5S3MUMvYcONHxs1R1Xe0Cp8SQXP9hqDm1ynj76Ym0g= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Apr 11, 2024 at 4:14=E2=80=AFPM Sergey Senozhatsky wrote: > > On (24/04/11 14:03), Barry Song wrote: > > > [..] > > > > > > > +static int zram_bvec_write_multi_pages_partial(struct zram *zram, = struct bio_vec *bvec, > > > > + u32 index, int offset, struct bio = *bio) > > > > +{ > > > > + struct page *page =3D alloc_pages(GFP_NOIO | __GFP_COMP, ZCOM= P_MULTI_PAGES_ORDER); > > > > + int ret; > > > > + void *src, *dst; > > > > + > > > > + if (!page) > > > > + return -ENOMEM; > > > > + > > > > + ret =3D zram_read_multi_pages(zram, page, index, bio); > > > > + if (!ret) { > > > > + src =3D kmap_local_page(bvec->bv_page); > > > > + dst =3D kmap_local_page(page); > > > > + memcpy(dst + offset, src + bvec->bv_offset, bvec->bv_= len); > > > > + kunmap_local(dst); > > > > + kunmap_local(src); > > > > + > > > > + atomic64_inc(&zram->stats.zram_bio_write_multi_pages_= partial_count); > > > > + ret =3D zram_write_page(zram, page, index); > > > > + } > > > > + __free_pages(page, ZCOMP_MULTI_PAGES_ORDER); > > > > + return ret; > > > > +} > > > > > > What type of testing you run on it? How often do you see partial > > > reads and writes? Because this looks concerning - zsmalloc memory > > > usage reduction is one metrics, but this also can be achieved via > > > recompression, writeback, or even a different compression algorithm, > > > but higher CPU/power usage/higher requirements for physically contig > > > pages cannot be offset easily. (Another corner case, assume we have > > > partial read requests on every CPU simultaneously.) > > > > This question brings up an interesting observation. In our actual produ= ct, > > we've noticed a success rate of over 90% when allocating large folios i= n > > do_swap_page, but occasionally, we encounter failures. In such cases, > > instead of resorting to partial reads, we opt to allocate 16 small foli= os and > > request zram to fill them all. This strategy effectively minimizes part= ial reads > > to nearly zero. However, integrating this into the upstream codebase se= ems > > like a considerable task, and for now, it remains part of our > > out-of-tree code[1], > > which is also open-source. > > We're gradually sending patches for the swap-in process, systematically > > cleaning up the product's code. > > I see, thanks for explanation. > Does this sound like this series is ahead of its time? I feel it is necessary to present the whole picture together with large fol= ios swp-in series[1]. On the other hand, there is a possibility this can land earlier before everything is really with default "disable", but for those platforms which have finely tuned partial read/write, they can enable it. [1] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.= com/ > > > To enhance the success rate of large folio allocation, we've reserved s= ome > > page blocks for mTHP. This approach is currently absent from the mainli= ne > > codebase as well (Yu Zhao is trying to provide TAO [2]). Consequently, = we > > anticipate that partial reads may reach 50% or more until this method i= s > > incorporated upstream. > > These partial reads/writes are difficult to justify - instead of doing > comp_op(PAGE_SIZE) we, in the worst case, now can do ZCOMP_MULTI_PAGES_NR > of comp_op(ZCOMP_MULTI_PAGES_ORDER) (assuming a access pattern that > touches each of multi-pages individually). That is a potentially huge > increase in CPU/power usage, which cannot be easily sacrificed. In fact, > I'd probably say that power usage is more important here than zspool > memory usage (that we have means to deal with). Once Ryan's mTHP swapout without splitting [2] is integrated into the mainline, this patchset certainly gains an advantage for SWPOUT. However, for SWPIN, the situation is more nuanced. There's a risk of failing to allocate mTHP, which could result in the allocation of a small folio instead. In such cases, decompressing a large folio but copying only one subpage leads to inefficiency. In real-world products, we've addressed this challenge in two ways: 1. We've enhanced reserved page blocks for mTHP to boost allocation success rates. 2. In instances where we fail to allocate a large folio, we fall back to allocating nr_pages small folios instead of just one. so we still only decompress once for multi-pages. With these measures in place, we consistently achieve wins in both power consumption and memory savings. However, it's important to note that these optimizations are specific to our product, and there's still much work needed to upstream them all. [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@= arm.com/ > > Have you evaluated power usage? > > I also wonder if it brings down the number of ZRAM_SAME pages. Suppose > when several pages out of ZCOMP_MULTI_PAGES_ORDER are filled with zeroes > (or some other recognizable pattern) which previously would have been > stored using just unsigned long. Makes me even wonder if ZRAM_SAME test > makes sense on multi-page at all, for that matter. I don't think we need to worry about ZRAM_SAME. ARM64 supports 4KB, 16KB, a= nd 64KB base pages. Even if we configure the base page to 16KB or 64KB, there's still a possibility of missing out on identifying SAME PAGES that are identical at the 4KB level but not at the 16/64KB granularity. In our product, we continue to observe many SAME PAGES using multi-page mechanisms. Even if we miss some opportunities to identify same pages at the 4KB level, the compressed data remains relatively small, though not as compact as SAME_PAGE. Overall, in typical 12GiB/16GiB phones, we still achieve a memory saving of around 800MiB by this patchset. mTHP offers a means to emulate a 16KiB/64KiB base page while maintaining software compatibility with a 4KiB base page. The primary concern here lies in partial read/write operations. In our product, we've successfully addressed these issues. Howe= ver, convincing people in the mainline community may take considerable time and effort :-) Thanks Barry