From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 39EB0C6FD1F
	for <linux-mm@archiver.kernel.org>; Fri, 29 Mar 2024 17:38:10 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 52A7A6B0088; Fri, 29 Mar 2024 13:38:09 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4D9A46B0089; Fri, 29 Mar 2024 13:38:09 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3A1B76B008A; Fri, 29 Mar 2024 13:38:09 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 163DA6B0088
	for <linux-mm@kvack.org>; Fri, 29 Mar 2024 13:38:09 -0400 (EDT)
Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 9DB65C038B
	for <linux-mm@kvack.org>; Fri, 29 Mar 2024 17:38:08 +0000 (UTC)
X-FDA: 81950785056.02.DC46350
Received: from mail-ua1-f49.google.com (mail-ua1-f49.google.com [209.85.222.49])
	by imf06.hostedemail.com (Postfix) with ESMTP id 773C9180005
	for <linux-mm@kvack.org>; Fri, 29 Mar 2024 17:38:06 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b="qgB/v4mI";
	spf=pass (imf06.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.49 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org;
	dmarc=pass (policy=none) header.from=cmpxchg.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1711733886;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=WlyK0GsgoCgydTuF54QdUy3m/0gMb+bS62Xjc2zqhtc=;
	b=mUwtdXFrbTSbi6y1LQWZlOz2H1AgdibRy7ut+LdY8FFTabxMn+lhgl98izOoeP8A4IkT6w
	DHZW5QnRCl5Km+H3VVt6BvMDPMNrg2PcuX/iTOFsnktXdjSqSU//T6nkW7sYdkWT9j/oNH
	8IC64yYj2wEMtHQCcqNLK2TKhnQF2k4=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711733886; a=rsa-sha256;
	cv=none;
	b=6jAwPbRZ+yjpm4iVgSm2AgtGyA7BhuYgsYrwXT8fRbYyoYHjBULsREjWZFFqWhcsmzz4Yc
	OOrrDIhtS6tTB2WcUWkbBu1xWjNQtr9i4D4SAd5RNyC9uKbcWg9tYw2k9fOrzaaXSQ9jfj
	PVS/DDgzOGnxugGapJmn3lZGuL57UTs=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b="qgB/v4mI";
	spf=pass (imf06.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.49 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org;
	dmarc=pass (policy=none) header.from=cmpxchg.org
Received: by mail-ua1-f49.google.com with SMTP id a1e0cc1a2514c-7e33b9db07eso95427241.0
        for <linux-mm@kvack.org>; Fri, 29 Mar 2024 10:38:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1711733885; x=1712338685; darn=kvack.org;
        h=in-reply-to:content-transfer-encoding:content-disposition
         :mime-version:references:message-id:subject:cc:to:from:date:from:to
         :cc:subject:date:message-id:reply-to;
        bh=WlyK0GsgoCgydTuF54QdUy3m/0gMb+bS62Xjc2zqhtc=;
        b=qgB/v4mIZruVRT2L1MDzCGCaGte27K8epmzT0PjvCgrE6HKf/m0s4QJI9XUZN/K3F7
         qJDBwPFEdQjm3+/m7dRukD6WeaNdid1JaEJPK1kvLmVFw5re35irtYBvHFICx3lDy1Yj
         Zgpnr7ehngjqkx6KxjiFyW7osvCUQpO6/Q2KutQnvzxu97E850OjBOVlfxbjAoNV+rrT
         TZlVjwZX7dTBHw/UlUiALRrShGt//u7u5FZVgJcdq72ZhgiLO+nK7QMdOogaSfTjqUZR
         n7n/ONrhWALds4tHXMh7SpfggHAdfRqT5pWpR5amoTyy8pxFC4c+evuSE8WRyKrfV4Ew
         uyQw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1711733885; x=1712338685;
        h=in-reply-to:content-transfer-encoding:content-disposition
         :mime-version:references:message-id:subject:cc:to:from:date
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=WlyK0GsgoCgydTuF54QdUy3m/0gMb+bS62Xjc2zqhtc=;
        b=t3YEhj/PSKukA6AeItey2aEeiyiY2G/0zlcWolURtoSn2UJUOkZAUCGK0vikLnBUAB
         5OQBUTolBhu4ey9fDYnqCIUJ4EuRyJ3hJfke4ogVgu5YKGYtS62tzmyeksqjGeNmwFCh
         63SiJzguRKLR57btgCpL7WYZbwPl+0bxWpFR0+GDo9PplZZi22ShkFOhjDRDNOXPQ2wq
         KftvC/L8cUzGb2grDN9gAh8URvzZFcAZZRQTXxf20xHNF9+URvPAUKbx6aIwt0ayKbd7
         ole+aTh9d75fEGkGratXrcHSz8K8VUMhWf7nSWAxVI7Zbzopk3u6qy832Rl78oFNRXWH
         7Z7Q==
X-Forwarded-Encrypted: i=1; AJvYcCWJqv8jLMPJE/84Itr3+2MhEcwZcVrmR/BhZdtwbjhTqiZKf0XqLshU1T5Q6JQ3dnEfYciSTXuPGK8XXpyJmbhqfWg=
X-Gm-Message-State: AOJu0YwMwgpwSm1j59rbcLNyDXmGUmCTRWM47UQagutSJNeI46MfAnjw
	aJ7Y2pQnfxgCK5N4zHhh/Eg51r04SzwlcIp3nVQhVFecIOpIWO1E9qOtNPGcqoY=
X-Google-Smtp-Source: AGHT+IGzqLzWenW0m8fVk2VN1NboLjmTD5+JYMrdN6AtpN2oBrfl0vU9KSwgdLYKqgnquRBr8cYZWw==
X-Received: by 2002:a05:6122:3982:b0:4d8:7359:4c25 with SMTP id eq2-20020a056122398200b004d873594c25mr2716577vkb.12.1711733885020;
        Fri, 29 Mar 2024 10:38:05 -0700 (PDT)
Received: from localhost ([2620:10d:c091:400::5:bb1f])
        by smtp.gmail.com with ESMTPSA id oj15-20020a056214440f00b00698e65cdfefsm1228931qvb.87.2024.03.29.10.38.04
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 29 Mar 2024 10:38:04 -0700 (PDT)
Date: Fri, 29 Mar 2024 13:37:59 -0400
From: Johannes Weiner <hannes@cmpxchg.org>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Nhat Pham <nphamcs@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Chengming Zhou <chengming.zhou@linux.dev>, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled
 pages handling
Message-ID: <20240329173759.GI7597@cmpxchg.org>
References: <20240325235018.2028408-1-yosryahmed@google.com>
 <20240325235018.2028408-7-yosryahmed@google.com>
 <20240328193149.GF7597@cmpxchg.org>
 <CAJD7tkaFmbnt4YNWvgGZHo=-JRu-AsUWvCYCRXVZxOPvcSJRDw@mail.gmail.com>
 <20240328210709.GH7597@cmpxchg.org>
 <CAKEwX=OPDLxH-0-3F+xOc2SL5Ouj-R-HEC5QQrW+Q9Fn8pyeRg@mail.gmail.com>
 <CAJD7tkaGBofWm1eGBffEtpuKUDBVB_6RfHbYKQSKOX3fKn2jeg@mail.gmail.com>
 <CAJD7tkbKY0T6jd5v_GhNFyCO0578SNqfMgs1ZrZiCckY05hzZA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAJD7tkbKY0T6jd5v_GhNFyCO0578SNqfMgs1ZrZiCckY05hzZA@mail.gmail.com>
X-Rspamd-Queue-Id: 773C9180005
X-Rspam-User: 
X-Rspamd-Server: rspam11
X-Stat-Signature: h7dsbzj4wdnfnqxsjnn7z59zdjdsojpu
X-HE-Tag: 1711733886-466235
X-HE-Meta: U2FsdGVkX18a14CKQCHonQYEx2fuewFeWEkR/+xx+MICpkf/08IyI516+fs8Kup5+rvefeQxbcRUt0LzUHryNfdK6+xGQeB0wc1VdJuWNWXa3UCQc+Ek8krgtGKtdFqOx3Ia9EZBQJzxgkshJqF9HhanoZ1E9vH0rIRQ4AW2eLU1qK524PjjuLBueFLkChG5D8oufiKdA7EZtirQShYovFHe3tstpZ0pwFUDrCrt7PFiIHa5UOl8t2tVFbLIyed4Kc7RfzUpkxbJbk8/Te/ucQgUYBlab6LvJQHGO3H4iKISF1RgFubNjUFwFbxrDwjjkSosHGNard6FNCSb2RgKDhfvomD+DoqQdEC3mcdtdJzeBiGqFhAZURSgdikXAZn/OyyBw/vGgknvG0YcCnxVhqLLFcql15KYKERGHrwO3n4oDoVxrMQTREP9nIZ7juiFBPPMNYKJdXbo/XwWFFap2DhHZ8ow3ClmIMwzoDzSz/Euq/5S/V4EILmuv7QUSzHoYnDXwgsnx58UdVKJnML7ffZbMkJNSxqV8hQoHh2mIrrfHk7QQbqhDwJ+6isKi2dGalGMImwqxBfOCMgC32i/qA6VVIQ3Gn2GBKsnxe7UUhVRwRt1emirkaS/HICY+KxDWvX+Qtv6qZlYw4UPf7r4huxamVgroh1/Te4fd3l+KOx8VP1kx7noIH3cNigrvRQPZgT2njU6wYChKb7Sox/RYtuNHKuvtkFpJ/RCJf2BPvvOIJe0FAwmYCRbii0owDSuyZxWqNgyF4NLaYg9gskLaqY7YNpTznp1oNX4xsgy4xvMFJjsxglTi2ZWjlC8TGw9MqJNTVRfQjO73w4KMgy6egQYvknt0P9r/2ug2ruyS9AlyXTGtox5qMrvCLiV8xkmpBHkYzftuMvTeCOSyiIPhWqlHmo+w/aYHKp9bmFgJ2RLOzn8Ns1JjoxNeG83jFstWNOtPIbTCltW2FQ6e55
 ZqjTiM8p
 rqvl8O2yHAIfNd/l4zDo4uXjcEw8t6m2bVm9KiVBFw2omJNg5FMjf0nc50lWveFfvmLhoNzHX0Y4Q9y2aoe4XlAw896xngk2UxoyEBCBm72k1iVmbpNuoTrSyVm8xIz5bR1BW9PSPUAvXl5z9MkXhaIk03TQUXTh64M0jx/C0HrJq3Acd9neswwrm4iNSErqut6lfmmOYh7aRpECVhN66jztLMSFLgJE3qIkHcTIDirM3NcWY29AC7A5SNLFaC0RurF94OQCWwRkBeeQ2xb0nU0nzLFfWcTl6JtMaGX4VbRt2m0ZZziHt1I4twoi14nRvIJMw9HnE7OKGadnQQTlXFLo7xgKNxdZli8/Ygdxy4DL9jC8VpDYkG33pGH+dbp7esCtDyYPScX7W+2HwFNy+GHApz4+p4Lg8Y75pZkjQPtl2crxqS9M5FFtrwTmLUjUOKMXyHL9PljOcrRf20R8UaCnlxWhxr872jRhzh2I+S81nOhlGEcXaT302Bw1bg8Fjld+sxqIatWnpAHq75XJfpIUWv/QCPCeFRkoVkJz79EzWKL87Ovo14y73zwYXIc/HhDViufyKWdCZl8XiBAz7YQ9cUiin58cUJ0E8K4jwiPyTbp2fPPqpoFyND6zUnZB+zzCTgYJU/p4GpQMVUZb+s4+/dsk8orMaggCozH7I0L/lXtOr75fdOZALD7xApAxbCywk
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Mar 28, 2024 at 09:27:17PM -0700, Yosry Ahmed wrote:
> On Thu, Mar 28, 2024 at 7:05 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Mar 28, 2024 at 4:19 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > On Thu, Mar 28, 2024 at 2:07 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > On Thu, Mar 28, 2024 at 01:23:42PM -0700, Yosry Ahmed wrote:
> > > > > On Thu, Mar 28, 2024 at 12:31 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > > >
> > > > > > On Mon, Mar 25, 2024 at 11:50:14PM +0000, Yosry Ahmed wrote:
> > > > > > > The current same-filled pages handling supports pages filled with any
> > > > > > > repeated word-sized pattern. However, in practice, most of these should
> > > > > > > be zero pages anyway. Other patterns should be nearly as common.
> > > > > > >
> > > > > > > Drop the support for non-zero same-filled pages, but keep the names of
> > > > > > > knobs exposed to userspace as "same_filled", which isn't entirely
> > > > > > > inaccurate.
> > > > > > >
> > > > > > > This yields some nice code simplification and enables a following patch
> > > > > > > that eliminates the need to allocate struct zswap_entry for those pages
> > > > > > > completely.
> > > > > > >
> > > > > > > There is also a very small performance improvement observed over 50 runs
> > > > > > > of kernel build test (kernbench) comparing the mean build time on a
> > > > > > > skylake machine when building the kernel in a cgroup v1 container with a
> > > > > > > 3G limit:
> > > > > > >
> > > > > > >               base            patched         % diff
> > > > > > > real          70.167          69.915          -0.359%
> > > > > > > user          2953.068        2956.147        +0.104%
> > > > > > > sys           2612.811        2594.718        -0.692%
> > > > > > >
> > > > > > > This probably comes from more optimized operations like memchr_inv() and
> > > > > > > clear_highpage(). Note that the percentage of zero-filled pages during
> > > > > > > this test was only around 1.5% on average, and was not affected by this
> > > > > > > patch. Practical workloads could have a larger proportion of such pages
> > > > > > > (e.g. Johannes observed around 10% [1]), so the performance improvement
> > > > > > > should be larger.
> > > > > > >
> > > > > > > [1]https://lore.kernel.org/linux-mm/20240320210716.GH294822@cmpxchg.org/
> > > > > > >
> > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > > > > >
> > > > > > This is an interesting direction to pursue, but I actually thinkg it
> > > > > > doesn't go far enough. Either way, I think it needs more data.
> > > > > >
> > > > > > 1) How frequent are non-zero-same-filled pages? Difficult to
> > > > > >    generalize, but if you could gather some from your fleet, that
> > > > > >    would be useful. If you can devise a portable strategy, I'd also be
> > > > > >    more than happy to gather this on ours (although I think you have
> > > > > >    more widespread zswap use, whereas we have more disk swap.)
> > > > >
> > > > > I am trying to collect the data, but there are.. hurdles. It would
> > > > > take some time, so I was hoping the data could be collected elsewhere
> > > > > if possible.
> > > > >
> > > > > The idea I had was to hook a BPF program to the entry of
> > > > > zswap_fill_page() and create a histogram of the "value" argument. We
> > > > > would get more coverage by hooking it to the return of
> > > > > zswap_is_page_same_filled() and only updating the histogram if the
> > > > > return value is true, as it includes pages in zswap that haven't been
> > > > > swapped in.
> > > > >
> > > > > However, with zswap_is_page_same_filled() the BPF program will run in
> > > > > all zswap stores, whereas for zswap_fill_page() it will only run when
> > > > > needed. Not sure if this makes a practical difference tbh.
> > > > >
> > > > > >
> > > > > > 2) The fact that we're doing any of this pattern analysis in zswap at
> > > > > >    all strikes me as a bit misguided. Being efficient about repetitive
> > > > > >    patterns is squarely in the domain of a compression algorithm. Do
> > > > > >    we not trust e.g. zstd to handle this properly?
> > > > >
> > > > > I thought about this briefly, but I didn't follow through. I could try
> > > > > to collect some data by swapping out different patterns and observing
> > > > > how different compression algorithms react. That would be interesting
> > > > > for sure.
> > > > >
> > > > > >
> > > > > >    I'm guessing this goes back to inefficient packing from something
> > > > > >    like zbud, which would waste half a page on one repeating byte.
> > > > > >
> > > > > >    But zsmalloc can do 32 byte objects. It's also a batching slab
> > > > > >    allocator, where storing a series of small, same-sized objects is
> > > > > >    quite fast.
> > > > > >
> > > > > >    Add to that the additional branches, the additional kmap, the extra
> > > > > >    scanning of every single page for patterns - all in the fast path
> > > > > >    of zswap, when we already know that the vast majority of incoming
> > > > > >    pages will need to be properly compressed anyway.
> > > > > >
> > > > > >    Maybe it's time to get rid of the special handling entirely?
> > > > >
> > > > > We would still be wasting some memory (~96 bytes between zswap_entry
> > > > > and zsmalloc object), and wasting cycling allocating them. This could
> > > > > be made up for by cycles saved by removing the handling. We will be
> > > > > saving some branches for sure. I am not worried about kmap as I think
> > > > > it's a noop in most cases.
> > > >
> > > > Yes, true.
> > > >
> > > > > I am interested to see how much we could save by removing scanning for
> > > > > patterns. We may not save much if we abort after reading a few words
> > > > > in most cases, but I guess we could also be scanning a considerable
> > > > > amount before aborting. On the other hand, we would be reading the
> > > > > page contents into cache anyway for compression, so maybe it doesn't
> > > > > really matter?
> > > > >
> > > > > I will try to collect some data about this. I will start by trying to
> > > > > find out how the compression algorithms handle same-filled pages. If
> > > > > they can compress it efficiently, then I will try to get more data on
> > > > > the tradeoff from removing the handling.
> > > >
> > > > I do wonder if this could be overthinking it, too.
> > > >
> > > > Double checking the numbers on our fleet, a 96 additional bytes for
> > > > each same-filled entry would result in a
> > > >
> > > > 1) p50 waste of 0.008% of total memory, and a
> > > >
> > > > 2) p99 waste of 0.06% of total memory.
> >
> > Right. Assuming the compressors do not surprise us and store
> > same-filled pages in an absurd way, it's not worth it in terms of
> > memory savings.
> >
> > > >
> > > > And this is without us having even thought about trying to make
> > > > zsmalloc more efficient for this particular usecase - which might be
> > > > the better point of attack, if we think it's actually worth it.
> > > >
> > > > So my take is that unless removing it would be outright horrible from
> > > > a %sys POV (which seems pretty unlikely), IMO it would be fine to just
> > > > delete it entirely with a "not worth the maintenance cost" argument.
> > > >
> > > > If you turn the argument around, and somebody would submit the code as
> > > > it is today, with the numbers being what they are above, I'm not sure
> > > > we would even accept it!
> > >
> > > The context guy is here :)
> > >
> > > Not arguing for one way or another, but I did find the original patch
> > > that introduced same filled page handling:
> > >
> > > https://github.com/torvalds/linux/commit/a85f878b443f8d2b91ba76f09da21ac0af22e07f
> > >
> > > https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/T/#u
> >
> > Thanks for digging this up. I don't know why I didn't start there :)
> >
> > Following in your footsteps, and given that zram has the same feature,
> > I found the patch that added support for non-zero same-filled pages in
> > zram:
> > https://lore.kernel.org/all/1483692145-75357-1-git-send-email-zhouxianrong@huawei.com/#t
> >
> > Both of them confirm that most same-filled pages are zero pages, but
> > they show a more significant portion of same-filled pages being
> > non-zero (17% to 40%). I suspect this will be less in data centers
> > compared to consumer apps.
> >
> > The zswap patch also reports significant performance improvements from
> > the same-filled handling, but this is with 17-22% same-filled pages.
> > Johannes mentioned around 10% in your data centers, so the performance
> > improvement would be less. In the kernel build tests I ran with only
> > around 1.5% same-filled pages I observed 1.4% improvements just by
> > optimizing them (only zero-filled, skipping allocations).
> >
> > So I think removing the same-filled pages handling completely may be
> > too aggressive, because it doesn't only affect the memory efficiency,
> > but also cycles spent when handling those pages. Just avoiding going
> > through the allocator and compressor has to account for something :)
> 
> Here is another data point. I tried removing the same-filled handling
> code completely with the diff Johannes sent upthread. I saw 1.3%
> improvement in the kernel build test, very similar to the improvement
> from this patch series. _However_, the kernel build test only produces
> ~1.5% zero-filled pages in my runs. More realistic workloads have
> significantly higher percentages as demonstrated upthread.
> 
> In other words, the kernel build test (at least in my runs) seems to
> be the worst case scenario for same-filled/zero-filled pages. Since
> the improvement from removing same-filled handling is quite small in
> this case, I suspect there will be no improvement, but possibly a
> regression, on real workloads.
> 
> As the zero-filled pages ratio increases:
> - The performance with this series will improve.
> - The performance with removing same-filled handling completely will
> become worse.

Sorry, this thread is still really lacking practical perspective.

As do the numbers that initially justified the patch. Sure, the stores
of same-filled pages are faster. What's the cost of prechecking 90% of
the other pages that need compression?

Also, this is the swap path we're talking about. There is vmscan, swap
slot allocations, page table walks, TLB flushes, zswap tree inserts;
then a page fault and everything in reverse.

I perf'd zswapping out data that is 10% same-filled and 90% data that
always needs compression. It does nothing but madvise(MADV_PAGEOUT),
and the zswap_store() stack is already only ~60% of the cycles.

Using zsmalloc + zstd, this is the diff between vanilla and my patch:

# Baseline  Delta Abs  Shared Object         Symbol
# ........  .........  ....................  .....................................................
#
     4.34%     -3.02%  [kernel.kallsyms]     [k] zswap_store
    11.07%     +1.41%  [kernel.kallsyms]     [k] ZSTD_compressBlock_doubleFast
    15.55%     +0.91%  [kernel.kallsyms]     [k] FSE_buildCTable_wksp

As expected, we have to compress a bit more; on the other hand we're
removing the content scan for same-filled for 90% of the pages that
don't benefit from it. They almost amortize each other. Let's round it
up and the remaining difference is ~1%.

It's difficult to make the case that this matters to any real
workloads with actual think time in between paging.

But let's say you do make the case that zero-filled pages are worth
optimizing for. Why is this in zswap? Why not do it in vmscan with a
generic zero-swp_entry_t, and avoid the swap backend altogether? No
swap slot allocation, no zswap tree, no *IO on disk swap*.

However you slice it, I fail to see how this has a place in
zswap. It's trying to optimize the slow path of a slow path, at the
wrong layer of the reclaim stack.