From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 7A6C3D46C1D for ; Sat, 31 Jan 2026 00:33:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AC1C06B0088; Fri, 30 Jan 2026 19:33:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A6F7C6B0089; Fri, 30 Jan 2026 19:33:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 91CD46B008A; Fri, 30 Jan 2026 19:33:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 7BA076B0088 for ; Fri, 30 Jan 2026 19:33:44 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 2A841140738 for ; Sat, 31 Jan 2026 00:33:44 +0000 (UTC) X-FDA: 84390385968.26.945F695 Received: from mail-wr1-f48.google.com (mail-wr1-f48.google.com [209.85.221.48]) by imf22.hostedemail.com (Postfix) with ESMTP id 05E8DC0007 for ; Sat, 31 Jan 2026 00:33:41 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=HKdAaR2c; spf=pass (imf22.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.221.48 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com; arc=pass ("google.com:s=arc-20240605:i=1") ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1769819622; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8EK14XkwGmMqqZzjiKWIL7AKo1ekIfBKRK+KFt/BRwU=; b=nfFj+1q7xXDASdrKxo/MPwejBXAWO4BlpdegcuRApceWidMYe+BBpRwioLHjJ+PYGX/C5W JkZX1Q6dvLoXiEkmPmKHSY8jZ2Jg9VnaNYdZhislJsHFeT9chscU4tNlBlRxdRagj+Diyl aajnbpd5v8hmniWUeY3lrFD4jonHM78= ARC-Authentication-Results: i=2; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=HKdAaR2c; spf=pass (imf22.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.221.48 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com; arc=pass ("google.com:s=arc-20240605:i=1") ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1769819622; a=rsa-sha256; cv=pass; b=39juQegJLD9o5SvHO+M8dw3JUKldavW/MqEx0zc65cApRyvog7gvW5wn7Ah1NhHlz85JHa p/vBunnD7y1Ls/ceT7nRElmtVlaw0lylNe6C23bH2rR9zaqwhvvjsArDnNSU2Z/8TZbh1R hwMhvd950nhzcB39Qp/kksccjn+bsww= Received: by mail-wr1-f48.google.com with SMTP id ffacd0b85a97d-432da746749so1573598f8f.0 for ; Fri, 30 Jan 2026 16:33:41 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1769819620; cv=none; d=google.com; s=arc-20240605; b=g7vCxJW1ObFyhL++hs05gApoMAg7KFamuhHCeplqNxK6FdkU0lEXZIAGGXEHczgQUv UQzERyjxPfVbYmi4YxDEQy98f7gPKnDhSuiR6rK5rT9vQYEUFC5vkKHsuaYcolnjGOes ZXmYYvimRRMGOWSZbNb+7+mBA1QQ/5qhGHuc91lieFvRxnX2R2ViRb78acB1WI4eYygn DwaJrxu5MFudWoa5kRAYEPCWMbJvSrM+7mxkgLO2vzEVTi0bwUT8t+/ZxgNaVD5+fOv7 uhcJ/fjvhiRkfkKxrD2LohTtez+rBasb5a14eJOPlxZE3lHllQtC5OhP9Hj9PRg5s/zD qctA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=8EK14XkwGmMqqZzjiKWIL7AKo1ekIfBKRK+KFt/BRwU=; fh=KleCMv2KT/ghIPbWvHiqBg5FAFJxWjrkaT9MCYD2+ig=; b=Kkcktr3m7Zu0goZ3k2BSvS+Y0WF0JSH/43iuEWL8+JAUnz1oGCVyKLvIU0Cr7Z/Iqn KoNOshJP2jTJa3yKX+RaluOhi8k/hNBaAzkjhFnPQJJGC1soSH/EG24/3XSmuPxyJ2WW TNlxtE7oWtuiN4zewvCC1LvZF4BkkAzlQP4IzIGUg4jqE50uYa8qLdQFBfn4da2c8ckT GqkZfFlfizYnc4GdGfsdhYG9KZEcfxlSo66Ib1ozlq1sNzOFl1raRxu0CuuljOCXFRer JueQqnvXn2wi+eT6J7lBm12rj4zLrAE5gpogOa94iErvS3uDu1SrcRdhvvvB4iUwFn5Y 2C1Q==; darn=kvack.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1769819620; x=1770424420; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=8EK14XkwGmMqqZzjiKWIL7AKo1ekIfBKRK+KFt/BRwU=; b=HKdAaR2cCn3wM7Vq53GF177QZ48m0P8uppNWUiMM5EbpQn7dFDNNLFpm8C6uNUjGmm 21dZ48Dtw/pDePe32Z3kySUHvPrPS1mI3YDWn1yEylao4Phg9RVimluCBEIzZv5rmEiG CynzXNTjomBRfF0UCXz+ArAIvJIL/PFxQrdWQzUnlwVBHO9NsaWE4/FvOVn5h7g228w8 mykZvF+XhD9jMlB7WJNvAHkAVWb2TCOJGOG3T0oaliEzjogyVpnJ2ItVU9r8sELfyRFI EcMIf8GPwbC9+5BWpZF2DmVpiLC9WhGK9nkPFrc+ekpwX6JMQsx3FjDWcKk+3Eef0IEV D7CA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1769819620; x=1770424420; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=8EK14XkwGmMqqZzjiKWIL7AKo1ekIfBKRK+KFt/BRwU=; b=l7JX/D+Uu9FtLYxCnKLGYBA7OyGlMSXMY72osxdmRdNTVwx6lYXEJ/OpZr3I6Xjh4k YBTpPQSeXqcpG0TQMUYhCrP3I6NbmNOMT/jAGYWOjXVRnpRlY5jq9k2Cf3YhC/xDwkdg RqG7Nm8TeQIfruzSMf9n99yqkEBszizaxvHycj0Vg8c5KVo05XPEV9CqjDCAcKjPuW8b d1PEouI6gE42gEvLuSAthtbLQ+b7u3ggkXCyJXtXmXGT4HxcXyjE5IcT4AU/R+Zdyluz 92W3DVov6vUm9/sU+Z95Vbb0UxX/Ki3NYzxNJecLAvD6GzPRIZlp6t3TGq1lFUn4J5Ek E2YA== X-Forwarded-Encrypted: i=1; AJvYcCXzqrf/3rfAs589N48p4XrNxdC8qcudquvQALOvwsPJ/LVBrdPUsf2QlbE3Fy1IACEJjvaU0rbt/Q==@kvack.org X-Gm-Message-State: AOJu0YxyzTFVSMeCbB64CwSIjm7+LimWXvSsGyRuWAMY+uj+S+0WJKUf VFdOBcUqzQxiMC6YLir6iBdO2WQoj0rC8NF9P9XonWYFmarbBpuPccRZ9Z9bU90V+AtBuD9ep3H bGGNCs0EP/q7yC8m18clquaIxlIS4/yw= X-Gm-Gg: AZuq6aK25SptZdH84PpXCEgmjSMdbiRWrBvt1w9VfOwi0WzDS75BKDbY9Eh1tjpVQna N4piQ00Ml9WG96INVLclYN07FcDEz+CX/3lTkzwKkF4fIj3PC2sU1bRWZQXtiNUk2rFncsXX0D9 q0KHK+LMW5et/K9xNKLm4OfgJODtzzGyrWZW6cxPjVmYnFTyWE/rYamaFGJzp7HM1h2gu/OFv4b vx1L0b/0MZPEpjuz+hAJnWqry3u3taR0OT0+rCnAlt7EE7CpcOhqCx7q50b6eV9yKbfoikLyGrR 5lAxN/v17/U= X-Received: by 2002:a05:6000:4282:b0:435:af89:11be with SMTP id ffacd0b85a97d-435f3a8b257mr6711541f8f.15.1769819620179; Fri, 30 Jan 2026 16:33:40 -0800 (PST) MIME-Version: 1.0 References: <20260125033537.334628-1-kanchana.p.sridhar@intel.com> <20260125033537.334628-26-kanchana.p.sridhar@intel.com> In-Reply-To: <20260125033537.334628-26-kanchana.p.sridhar@intel.com> From: Nhat Pham Date: Fri, 30 Jan 2026 16:33:28 -0800 X-Gm-Features: AZwV_QgmSGBTWc0AW3k8iHHJLuD66znCQ4UgvbnY7Ox2V7GfkrAXx042TuHiTi4 Message-ID: Subject: Re: [PATCH v14 25/26] mm: zswap: Store large folios in batches. To: Kanchana P Sridhar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, yosry.ahmed@linux.dev, chengming.zhou@linux.dev, usamaarif642@gmail.com, ryan.roberts@arm.com, 21cnbao@gmail.com, ying.huang@linux.alibaba.com, akpm@linux-foundation.org, senozhatsky@chromium.org, sj@kernel.org, kasong@tencent.com, linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au, davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org, ebiggers@google.com, surenb@google.com, kristen.c.accardi@intel.com, vinicius.gomes@intel.com, giovanni.cabiddu@intel.com, wajdi.k.feghali@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 05E8DC0007 X-Rspamd-Server: rspam07 X-Stat-Signature: hfks3uw9pkdgpr4n3fq3kj9d5qjp6zph X-HE-Tag: 1769819621-130361 X-HE-Meta: U2FsdGVkX19XyuTxVacvKQID+vQUhe9pXUfmsEF5bQuj1kVNqGQnPI0NlYb5MZpcdikQ5W3XkauKribFtfS2TiS8p07/TlotlH6LuIt3YqKROD1VICKDqKUtNZmmJFPwAMj7FsrT63ofAUMFkeKHB6EvYXOL59A+rHGupI9hyKqeUvmsqvaE/UB9uUt4NuridtK1dcYFjIxpl4LDCwcLly9OsO+l4y3IxvvPkRrI14GwvW0be6xgiH7DoRZBCmiORHJU3k2AbjldLAjnpm/ozesgL9QsxeRY/a2zRnJsGmcUxT9JRv3kZpfhn/eq5CyrkuETZkzT9x01FRGc8Nm1ZPVw0gxTQcsr2DfkUrHKjQ+YuiTE+64iRT6OwUmmuzfdwSYmIo0WVVtaEI5HRom79bnA6s0C7Hmwuhe/pJeosy3cstfHUngnalIAPX8XnSWdQPbyOtFuACNT+xANmUEVg6B2y+REI5D/Gw8ENluv2ocjlJXre6fVtUisKhwTfySKPiofeBZ4HVdVmavRMC38Esq1eORIciRYOL5SGk7lGLUfYSkqdzZ8C3wue59qO9hwtNQrhC1h4a13r66mwUvRhq0L+uJBcgjwXklNJ3qLeYCvO3QH9OYCwgkUEImmnO9G2+EblQMkfiD2fjOGwXit41IydVxIj0LA98kEdK0bjEwVYVEMshNfUwj6CgckKOuSCOfOBuDNuj/3HzDFQKkYy2xq0xWVyUcrz7LdCbs63YaxxvpAScEjc+CpnhCnxh2d9kSH+M2uZUpmr1szTX7QFp9P5yGf5wHqo5bW070zjt3RwFHiu1GFp5XSykEPC1xGbEMee9ek0ooHqieYO/JacX15SYT27+g9CbWfFrXGkYXtCBnQSg1+0SYkvu+e6jw3ngMVYs6cbxoqcAHIkzbcr7ruyzsZAUe5UPQSZ8ZeoQcSm5H5+vxhL6kO8oNqj/pvgYEoCLyS3sMI3CzBCgn zEwgTTfi HyGcNKyft2WcTdQLuQkUeNOc5I4eTfI314XvbLQOQFLdJ1YEdzThsQb0rueK2V0+IeMlEBLJPaUMWEVQj+3FhJ3wtNlVcaIS6P721AfG1uVw/S2GfZWBgWvznv/fJLW2wto0lUurdSFl1Gnos71vXjSs0nQ7hKK8Kdd7Jtddh0gWuOUnHwUxbEbQ2uz/TRLON9y3NXDdvfjPxK3eJRbtnxAYythb/GTAVJZSnXZNZVcily3GJiY94MP991BSSf5Zhc7EthlWB6/i9BtAhEoZ9iWXx/GcCdIYyBNCvrmyNDX9i1Poxnm69ftDjrcQ39dUyXdyJpIF71OX860Ea1FVZL7idOosnFUxzQM4n9Y0thdl1ZZWycWnL0EMnHhgg8iOCb4wANy9LsGxT6mzn37bVumRJL/lRJVVibVIiC2daw1O7zp270qelGtzLbcCP2ZYM+7Mb5mVttgCv0oyZi1cgmAVOaA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Jan 24, 2026 at 7:36=E2=80=AFPM Kanchana P Sridhar wrote: > > Support batching when storing large folios in zswap. If the underlying > compressor supports batching (e.g. hardware parallel compression), > allocate multiple compression buffers, otherwise allocate one. The > number of buffers is bounded by a new constant, ZSWAP_MAX_BATCH_SIZE, to > limit the memory overhead. For existing software compressors, the only > extra overhead is the extra 'buffers' pointer, so 8 bytes per-CPU on > x86_64. > > Only the first buffer is currently used, but subsequent changes will use > the remaining buffers for hardware compression batching. > > Regardless of compression batching, always process large folios in > batches. For hardware compressors, the batch size is the compressor > batch size, otherwise ZSWAP_MAX_BATCH_SIZE is used. > > zswap_store_page() is replaced with zswap_store_pages(), which processes > a batch of pages and allows for batching optimizations. For now, only > optimize allocating entries by using batch allocations from the slab > cache. > > Since batch allocations do not support specifying a node id, store the > node id in the zswap entry instead of relying on the zswap_entry being > allocated on the same node. The size of the zswap_entry remains > unchanged as 'referenced' is lumped in with the 'length' (as it doesn't > need a full unsigned int anyway). > > Avoid repeatedly calling mem_cgroup_zswap_writeback_enabled() for every > page and only call it once for the folio, since the entire folio is > charged to a single memcg. > > Suggested-by: Nhat Pham > Suggested-by: Yosry Ahmed > Signed-off-by: Kanchana P Sridhar > --- > mm/zswap.c | 351 +++++++++++++++++++++++++++++++++++++---------------- > 1 file changed, 248 insertions(+), 103 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 0d56390342b7..6a22add63220 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -82,6 +82,11 @@ static bool zswap_pool_reached_full; > > #define ZSWAP_PARAM_UNSET "" > > +/* Limit the batch size to limit per-CPU memory usage for dst buffers. *= / > +#define ZSWAP_MAX_BATCH_SIZE 8U > +#define ZSWAP_ENTRY_SPARE_4BYTES 32U > +#define ZSWAP_ENTRY_REF_BIT 1U > + > static int zswap_setup(void); > > /* Enable/disable zswap */ > @@ -139,7 +144,7 @@ struct crypto_acomp_ctx { > struct crypto_acomp *acomp; > struct acomp_req *req; > struct crypto_wait wait; > - u8 *buffer; > + u8 **buffers; > struct mutex mutex; > }; > > @@ -148,6 +153,9 @@ struct crypto_acomp_ctx { > * The only case where lru_lock is not acquired while holding tree.lock = is > * when a zswap_entry is taken off the lru for writeback, in that case i= t > * needs to be verified that it's still valid in the tree. > + * > + * @compr_batch_size: The max batch size of the compression algorithm, > + * bounded by ZSWAP_MAX_BATCH_SIZE. > */ > struct zswap_pool { > struct zs_pool *zs_pool; > @@ -157,6 +165,7 @@ struct zswap_pool { > struct work_struct release_work; > struct hlist_node node; > char tfm_name[CRYPTO_MAX_ALG_NAME]; > + u8 compr_batch_size; > }; > > /* Global LRU lists shared by all zswap pools. */ > @@ -181,6 +190,7 @@ static struct shrinker *zswap_shrinker; > * writeback logic. The entry is only reclaimed by the writ= eback > * logic if referenced is unset. See comments in the shrink= er > * section for context. > + * nid - NUMA node id of the page for which this is the zswap entry. > * pool - the zswap_pool the entry's data is in > * handle - zsmalloc allocation handle that stores the compressed page d= ata > * objcg - the obj_cgroup that the compressed memory is charged to > @@ -188,8 +198,11 @@ static struct shrinker *zswap_shrinker; > */ > struct zswap_entry { > swp_entry_t swpentry; > - unsigned int length; > - bool referenced; > + struct { > + unsigned int length:(ZSWAP_ENTRY_SPARE_4BYTES - ZSWAP_ENT= RY_REF_BIT); > + bool referenced:ZSWAP_ENTRY_REF_BIT; Hmm I thought Yosry confirmed that using values directly rather than macros (i.e 32 and 1 instead of ZSWAP_ENTRY_SPARE_4BYTES and ZSWAP_ENTRY_REF_BIT) was the convention? :) https://lore.kernel.org/linux-mm/gnm6hcqlzna4p3unrad2sur7pnyovr7f2sfuiufzwe= u2zbfb2r@ia422moyti7v/ I was just copying zsmalloc's format ;) Anyway, either way a fixlet should be sufficient. No big deal... > + }; > + int nid; > struct zswap_pool *pool; > unsigned long handle; > struct obj_cgroup *objcg; > @@ -241,8 +254,10 @@ static inline struct xarray *swap_zswap_tree(swp_ent= ry_t swp) > **********************************/ > static void __zswap_pool_empty(struct percpu_ref *ref); > > -static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx) > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8 nr_= buffers) > { > + u8 i; > + > if (IS_ERR_OR_NULL(acomp_ctx)) > return; > > @@ -252,7 +267,11 @@ static void acomp_ctx_dealloc(struct crypto_acomp_ct= x *acomp_ctx) > if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > crypto_free_acomp(acomp_ctx->acomp); > > - kfree(acomp_ctx->buffer); > + if (acomp_ctx->buffers) { > + for (i =3D 0; i < nr_buffers; ++i) > + kfree(acomp_ctx->buffers[i]); > + kfree(acomp_ctx->buffers); > + } > } > > static struct zswap_pool *zswap_pool_create(char *compressor) > @@ -264,6 +283,7 @@ static struct zswap_pool *zswap_pool_create(char *com= pressor) > if (!zswap_has_pool && !strcmp(compressor, ZSWAP_PARAM_UNSET)) > return NULL; > > + /* Many things rely on the zero-initialization. */ > pool =3D kzalloc(sizeof(*pool), GFP_KERNEL); > if (!pool) > return NULL; > @@ -316,7 +336,9 @@ static struct zswap_pool *zswap_pool_create(char *com= pressor) > > cpuhp_add_fail: > for_each_possible_cpu(cpu) > - acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu)); > + acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu), > + pool->compr_batch_size); > + > error: > if (pool->acomp_ctx) > free_percpu(pool->acomp_ctx); > @@ -354,7 +376,8 @@ static void zswap_pool_destroy(struct zswap_pool *poo= l) > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->no= de); > > for_each_possible_cpu(cpu) > - acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu)); > + acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu), > + pool->compr_batch_size); > > free_percpu(pool->acomp_ctx); > > @@ -645,14 +668,8 @@ static inline struct mem_cgroup *mem_cgroup_from_ent= ry(struct zswap_entry *entry > } > #endif > > -static inline int entry_to_nid(struct zswap_entry *entry) > -{ > - return page_to_nid(virt_to_page(entry)); > -} > - > static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry = *entry) > { > - int nid =3D entry_to_nid(entry); > struct mem_cgroup *memcg; > > /* > @@ -669,19 +686,18 @@ static void zswap_lru_add(struct list_lru *list_lru= , struct zswap_entry *entry) > rcu_read_lock(); > memcg =3D mem_cgroup_from_entry(entry); > /* will always succeed */ > - list_lru_add(list_lru, &entry->lru, nid, memcg); > + list_lru_add(list_lru, &entry->lru, entry->nid, memcg); > rcu_read_unlock(); > } > > static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry = *entry) > { > - int nid =3D entry_to_nid(entry); > struct mem_cgroup *memcg; > > rcu_read_lock(); > memcg =3D mem_cgroup_from_entry(entry); > /* will always succeed */ > - list_lru_del(list_lru, &entry->lru, nid, memcg); > + list_lru_del(list_lru, &entry->lru, entry->nid, memcg); > rcu_read_unlock(); > } > > @@ -741,6 +757,56 @@ static void zswap_entry_cache_free(struct zswap_entr= y *entry) > kmem_cache_free(zswap_entry_cache, entry); > } > > +static __always_inline void zswap_entries_cache_free_batch( > + struct zswap_entry **entries, > + u8 nr_entries) > +{ > + /* > + * It is okay to use this to free entries allocated separately > + * by zswap_entry_cache_alloc(). > + */ > + kmem_cache_free_bulk(zswap_entry_cache, nr_entries, (void **)entr= ies); > +} > + > +static __always_inline bool zswap_entries_cache_alloc_batch( > + struct zswap_entry **entries, > + u8 nr_entries, > + gfp_t gfp, > + int nid) > +{ > + int nr_alloc =3D kmem_cache_alloc_bulk(zswap_entry_cache, gfp, > + nr_entries, (void **)entries= ); > + > + /* > + * kmem_cache_alloc_bulk() should return @nr_entries on success > + * and 0 on failure. > + */ > + if (likely(nr_alloc =3D=3D nr_entries)) > + return true; > + > + if (WARN_ON_ONCE(unlikely(nr_alloc && (nr_alloc !=3D nr_entries))= )) { > + zswap_reject_kmemcache_fail++; > + zswap_entries_cache_free_batch(entries, nr_alloc); > + nr_alloc =3D 0; > + } Can partial allocation happen? I checked a couple callers of kmem_cache_alloc_bulk() and none of them check the case nr_alloc && nr_alloc !=3D nr_entries. In fact, one caller (__io_alloc_req_refill() in io_uring/io_uring.c) even explicitly document: ret =3D kmem_cache_alloc_bulk(req_cachep, gfp, ARRAY_SIZE(reqs), reqs); /* * Bulk alloc is all-or-nothing. If we fail to get a batch, * retry single alloc to be on the safe side. */ if (unlikely(ret <=3D 0)) { reqs[0] =3D kmem_cache_alloc(req_cachep, gfp); if (!reqs[0]) return false; ret =3D 1; } Other callsers don't even bother checking the negative case (i.e ret < 0) - only the 0 case. I'm not terribly familiar with bulk allocation though. Please fact check me :) > + > + if (unlikely(!nr_alloc)) { > + unsigned int i; > + > + for (i =3D 0; i < nr_entries; ++i) { > + entries[i] =3D zswap_entry_cache_alloc(GFP_KERNEL= , nid); > + > + if (unlikely(!entries[i])) { > + zswap_reject_kmemcache_fail++; > + zswap_entries_cache_free_batch(entries, i= ); > + return false; > + } > + } > + } > + > + return true; > +} > + > /* > * Carries out the common pattern of freeing an entry's zsmalloc allocat= ion, > * freeing the entry itself, and decrementing the number of stored pages= . > @@ -767,7 +833,9 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, s= truct hlist_node *node) > { > struct zswap_pool *pool =3D hlist_entry(node, struct zswap_pool, = node); > struct crypto_acomp_ctx *acomp_ctx =3D per_cpu_ptr(pool->acomp_ct= x, cpu); > + int nid =3D cpu_to_node(cpu); > int ret =3D -ENOMEM; > + u8 i; > > /* > * To handle cases where the CPU goes through online-offline-onli= ne > @@ -778,11 +846,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, = struct hlist_node *node) > return 0; > } > > - acomp_ctx->buffer =3D kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_= node(cpu)); > - if (!acomp_ctx->buffer) > - return ret; > - > - acomp_ctx->acomp =3D crypto_alloc_acomp_node(pool->tfm_name, 0, 0= , cpu_to_node(cpu)); > + acomp_ctx->acomp =3D crypto_alloc_acomp_node(pool->tfm_name, 0, 0= , nid); > if (IS_ERR_OR_NULL(acomp_ctx->acomp)) { > pr_err("could not alloc crypto acomp %s : %ld\n", > pool->tfm_name, PTR_ERR(acomp_ctx->acomp)= ); > @@ -790,20 +854,39 @@ static int zswap_cpu_comp_prepare(unsigned int cpu,= struct hlist_node *node) > goto fail; > } > > + /* > + * Allocate up to ZSWAP_MAX_BATCH_SIZE dst buffers if the > + * compressor supports batching. > + */ > + pool->compr_batch_size =3D min(ZSWAP_MAX_BATCH_SIZE, > + crypto_acomp_batch_size(acomp_ctx->a= comp)); > + I asssume this is going to be 0 for zstd? > acomp_ctx->req =3D acomp_request_alloc(acomp_ctx->acomp); > + > if (IS_ERR_OR_NULL(acomp_ctx->req)) { > pr_err("could not alloc crypto acomp_request %s\n", > pool->tfm_name); > goto fail; > } > > - crypto_init_wait(&acomp_ctx->wait); > + acomp_ctx->buffers =3D kcalloc_node(pool->compr_batch_size, sizeo= f(u8 *), > + GFP_KERNEL, nid); > + if (!acomp_ctx->buffers) > + goto fail; > + > + for (i =3D 0; i < pool->compr_batch_size; ++i) { > + acomp_ctx->buffers[i] =3D kmalloc_node(PAGE_SIZE, GFP_KER= NEL, nid); > + if (!acomp_ctx->buffers[i]) > + goto fail; > + } > > /* > * if the backend of acomp is async zip, crypto_req_done() will w= akeup > * crypto_wait_req(); if the backend of acomp is scomp, the callb= ack > * won't be called, crypto_wait_req() will return without blockin= g. > */ > + crypto_init_wait(&acomp_ctx->wait); > + > acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BAC= KLOG, > crypto_req_done, &acomp_ctx->wait); > > @@ -813,12 +896,12 @@ static int zswap_cpu_comp_prepare(unsigned int cpu,= struct hlist_node *node) > return 0; > > fail: > - acomp_ctx_dealloc(acomp_ctx); > + acomp_ctx_dealloc(acomp_ctx, pool->compr_batch_size); > return ret; > } > > static bool zswap_compress(struct page *page, struct zswap_entry *entry, > - struct zswap_pool *pool) > + struct zswap_pool *pool, bool wb_enabled) > { > struct crypto_acomp_ctx *acomp_ctx; > struct scatterlist input, output; > @@ -832,7 +915,7 @@ static bool zswap_compress(struct page *page, struct = zswap_entry *entry, > acomp_ctx =3D raw_cpu_ptr(pool->acomp_ctx); > mutex_lock(&acomp_ctx->mutex); > > - dst =3D acomp_ctx->buffer; > + dst =3D acomp_ctx->buffers[0]; > sg_init_table(&input, 1); > sg_set_page(&input, page, PAGE_SIZE, 0); > > @@ -862,8 +945,7 @@ static bool zswap_compress(struct page *page, struct = zswap_entry *entry, > * to the active LRU list in the case. > */ > if (comp_ret || !dlen || dlen >=3D PAGE_SIZE) { > - if (!mem_cgroup_zswap_writeback_enabled( > - folio_memcg(page_folio(page)))) { > + if (!wb_enabled) { > comp_ret =3D comp_ret ? comp_ret : -EINVAL; > goto unlock; > } > @@ -909,7 +991,7 @@ static bool zswap_decompress(struct zswap_entry *entr= y, struct folio *folio) > acomp_ctx =3D raw_cpu_ptr(pool->acomp_ctx); > mutex_lock(&acomp_ctx->mutex); > obj =3D zs_obj_read_begin(pool->zs_pool, entry->handle, entry->le= ngth, > - acomp_ctx->buffer); > + acomp_ctx->buffers[0]); > > /* zswap entries of length PAGE_SIZE are not compressed. */ > if (entry->length =3D=3D PAGE_SIZE) { > @@ -919,15 +1001,15 @@ static bool zswap_decompress(struct zswap_entry *e= ntry, struct folio *folio) > > /* > * zs_obj_read_begin() might return a kmap address of highmem whe= n > - * acomp_ctx->buffer is not used. However, sg_init_one() does no= t > - * handle highmem addresses, so copy the object to acomp_ctx->buf= fer. > + * acomp_ctx->buffers[0] is not used. However, sg_init_one() doe= s not > + * handle highmem addresses, so copy the object to acomp_ctx->buf= fers[0]. > */ > if (virt_addr_valid(obj)) { > src =3D obj; > } else { > - WARN_ON_ONCE(obj =3D=3D acomp_ctx->buffer); > - memcpy(acomp_ctx->buffer, obj, entry->length); > - src =3D acomp_ctx->buffer; > + WARN_ON_ONCE(obj =3D=3D acomp_ctx->buffers[0]); > + memcpy(acomp_ctx->buffers[0], obj, entry->length); > + src =3D acomp_ctx->buffers[0]; > } > > sg_init_one(&input, src, entry->length); > @@ -1381,95 +1463,136 @@ static void shrink_worker(struct work_struct *w) > * main API > **********************************/ > > -static bool zswap_store_page(struct page *page, > - struct obj_cgroup *objcg, > - struct zswap_pool *pool) > +/* > + * Store multiple pages in @folio, starting from the page at index @star= t up to > + * the page at index @end-1. > + */ > +static bool zswap_store_pages(struct folio *folio, > + long start, > + long end, > + struct zswap_pool *pool, > + struct crypto_acomp_ctx *acomp_ctx, > + int nid, > + bool wb_enabled, > + struct obj_cgroup *objcg) > { > - swp_entry_t page_swpentry =3D page_swap_entry(page); > - struct zswap_entry *entry, *old; > + struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE]; > + u8 i, store_fail_idx =3D 0, nr_pages =3D end - start; > > - /* allocate entry */ > - entry =3D zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page)); > - if (!entry) { > - zswap_reject_kmemcache_fail++; > + VM_WARN_ON_ONCE(nr_pages > ZSWAP_MAX_BATCH_SIZE); > + > + if (unlikely(!zswap_entries_cache_alloc_batch(entries, nr_pages, > + GFP_KERNEL, nid))) > return false; > - } > > - if (!zswap_compress(page, entry, pool)) > - goto compress_failed; > + /* > + * We co-locate entry initialization as much as possible here to > + * minimize potential cache misses. > + */ > + for (i =3D 0; i < nr_pages; ++i) { > + entries[i]->handle =3D (unsigned long)ERR_PTR(-EINVAL); > + entries[i]->pool =3D pool; > + entries[i]->swpentry =3D page_swap_entry(folio_page(folio= , start + i)); > + entries[i]->objcg =3D objcg; > + entries[i]->referenced =3D true; > + entries[i]->nid =3D nid; > + INIT_LIST_HEAD(&entries[i]->lru); > + } > > - old =3D xa_store(swap_zswap_tree(page_swpentry), > - swp_offset(page_swpentry), > - entry, GFP_KERNEL); > - if (xa_is_err(old)) { > - int err =3D xa_err(old); > + for (i =3D 0; i < nr_pages; ++i) { > + struct page *page =3D folio_page(folio, start + i); > > - WARN_ONCE(err !=3D -ENOMEM, "unexpected xarray error: %d\= n", err); > - zswap_reject_alloc_fail++; > - goto store_failed; > + if (!zswap_compress(page, entries[i], pool, wb_enabled)) > + goto store_pages_failed; > } > > - /* > - * We may have had an existing entry that became stale when > - * the folio was redirtied and now the new version is being > - * swapped out. Get rid of the old. > - */ > - if (old) > - zswap_entry_free(old); > + for (i =3D 0; i < nr_pages; ++i) { > + struct zswap_entry *old, *entry =3D entries[i]; > > - /* > - * The entry is successfully compressed and stored in the tree, t= here is > - * no further possibility of failure. Grab refs to the pool and o= bjcg, > - * charge zswap memory, and increment zswap_stored_pages. > - * The opposite actions will be performed by zswap_entry_free() > - * when the entry is removed from the tree. > - */ > - zswap_pool_get(pool); > - if (objcg) { > - obj_cgroup_get(objcg); > - obj_cgroup_charge_zswap(objcg, entry->length); > - } > - atomic_long_inc(&zswap_stored_pages); > - if (entry->length =3D=3D PAGE_SIZE) > - atomic_long_inc(&zswap_stored_incompressible_pages); > + old =3D xa_store(swap_zswap_tree(entry->swpentry), > + swp_offset(entry->swpentry), > + entry, GFP_KERNEL); Future follow-up: perhaps we can use advanced xarray API (xas_*) to take the lock only once. > + if (unlikely(xa_is_err(old))) { > + int err =3D xa_err(old); > > - /* > - * We finish initializing the entry while it's already in xarray. > - * This is safe because: > - * > - * 1. Concurrent stores and invalidations are excluded by folio l= ock. > - * > - * 2. Writeback is excluded by the entry not being on the LRU yet= . > - * The publishing order matters to prevent writeback from seei= ng > - * an incoherent entry. > - */ > - entry->pool =3D pool; > - entry->swpentry =3D page_swpentry; > - entry->objcg =3D objcg; > - entry->referenced =3D true; > - if (entry->length) { > - INIT_LIST_HEAD(&entry->lru); > - zswap_lru_add(&zswap_list_lru, entry); > + WARN_ONCE(err !=3D -ENOMEM, "unexpected xarray er= ror: %d\n", err); > + zswap_reject_alloc_fail++; > + /* > + * Entries up to this point have been stored in t= he > + * xarray. zswap_store() will erase them from the= xarray > + * and call zswap_entry_free(). Local cleanup in > + * 'store_pages_failed' only needs to happen for > + * entries from [@i to @nr_pages). > + */ > + store_fail_idx =3D i; > + goto store_pages_failed; > + } > + > + /* > + * We may have had an existing entry that became stale wh= en > + * the folio was redirtied and now the new version is bei= ng > + * swapped out. Get rid of the old. > + */ > + if (unlikely(old)) > + zswap_entry_free(old); > + > + /* > + * The entry is successfully compressed and stored in the= tree, > + * and further failures will be cleaned up in zswap_store= (). > + * Grab refs to the pool and objcg, charge zswap memory, = and > + * increment zswap_stored_pages. The opposite actions wil= l be > + * performed by zswap_entry_free() when the entry is remo= ved > + * from the tree. > + */ > + zswap_pool_get(pool); > + if (objcg) { > + obj_cgroup_get(objcg); > + obj_cgroup_charge_zswap(objcg, entry->length); > + } > + atomic_long_inc(&zswap_stored_pages); > + if (entry->length =3D=3D PAGE_SIZE) > + atomic_long_inc(&zswap_stored_incompressible_page= s); > + > + /* > + * We finish by adding the entry to the LRU while it's al= ready > + * in xarray. This is safe because: > + * > + * 1. Concurrent stores and invalidations are excluded by= folio lock. > + * > + * 2. Writeback is excluded by the entry not being on the= LRU yet. > + * The publishing order matters to prevent writeback f= rom seeing > + * an incoherent entry. > + */ > + if (likely(entry->length)) > + zswap_lru_add(&zswap_list_lru, entry); Hang on - how can entry->length =3D=3D 0? This is probably holdover from back when zero pages are still managed in zswap? Future follow-up work: remove this check if that's the case... The rest looks solid to me - I'll defer to Yosry and Johannes.