From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EA18ED5D682 for ; Thu, 7 Nov 2024 18:16:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6415E6B0083; Thu, 7 Nov 2024 13:16:35 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5F1B16B0085; Thu, 7 Nov 2024 13:16:35 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 492066B0088; Thu, 7 Nov 2024 13:16:35 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 2B4796B0083 for ; Thu, 7 Nov 2024 13:16:35 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id D0570A0AD9 for ; Thu, 7 Nov 2024 18:16:34 +0000 (UTC) X-FDA: 82760103594.18.B58AFBC Received: from mail-qv1-f43.google.com (mail-qv1-f43.google.com [209.85.219.43]) by imf04.hostedemail.com (Postfix) with ESMTP id 12F7940013 for ; Thu, 7 Nov 2024 18:15:46 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b="i8mq/HwK"; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf04.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.43 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731003222; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LT3GIhZWy3LXsRY3e6h3hL2tN1+NJNn1KDxPTV/jnhg=; b=6q8w4ngktBedpeNJpmT287H2wtXJ+nhiFaaqZSM45rn2pQ4GZubghzADIntcsZwjmzCTdf 5bL3EcLYoFKcfAd3Hkd3k00xrbAJ9WZU9aVAWOa1c7AsRRAs/Q9wAtkykCZtJOUXleBHRj yXZx9NQyPDAdA85BcnkbcwyWZfIOsfI= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b="i8mq/HwK"; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf04.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.43 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731003222; a=rsa-sha256; cv=none; b=CaeFfcBMV4I708VxMZxjwze5H6dh4Y2Rli+86eve/8rhtegvzZGDZjyr3YTjgWrUfH925y mItIWEOM8bM1W3rsJhVk1zW3UTwg5DMoPTtY0sqWrYYgoSJ9ph110Do9Zz6JQq2Kydb2wL /m73n3raPnYJ/kucTvY6Y6mc/7Q965M= Received: by mail-qv1-f43.google.com with SMTP id 6a1803df08f44-6d382664fadso7634526d6.2 for ; Thu, 07 Nov 2024 10:16:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1731003391; x=1731608191; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=LT3GIhZWy3LXsRY3e6h3hL2tN1+NJNn1KDxPTV/jnhg=; b=i8mq/HwKYlEv6wDX3kaXKcnpAzdpXb3bOOqymZT+fR9dk7cxkzQYqNHXS22Oj0daYL 0nYqk940TE5Cio5aQ7U+gWZvK58eu/qyDSlD7QUkRIEPao07IyOinDSlnweAwQJ8ennQ O1wI0nGadwG/olqwCgaOORgDUPcm4c7l/+pfzp6iZ0Nxv7ufR+zpg0kkRNMp79kN4lIC X5vKowU8Y4IuNw59Mz+jbXJw144l193rJM8MRakmlc1r0KfpD+tvgSajfyC/CQq4Ro1f jUXQo6COSWDbdpVi+YRqv3NMqdNN90X4RVMeCoP5wdSknZevsMkQ9hEvGXPQm0TBIRzP JlYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731003391; x=1731608191; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=LT3GIhZWy3LXsRY3e6h3hL2tN1+NJNn1KDxPTV/jnhg=; b=KfeWoVRR6W9Yga24pYvPMG6wM+8H+cmYEa6Kx602Pl6D8YFHvhCaHFQVbeW4FEx0K+ Ato2nJs4/PoXynO2MgkbyFFCsZwxdHH9CGJG+ImU6v+iN7jrC/dhkZ6OvPu5KgB56Zju AmJ7/wJVw90GFHauMneFf99GxC734ycHU1TtHZV54YN+sI02eRYBtS9aBjGpKEIDxYxL TnrnLuSwka4mW0nUTjbk2X91c+Zb7whe6OeoaHzY7tIElTF80e8xjvWgUCjI2EYoTX9U HTgUN2vRg1+B/Y092eA+e8OM+cFHsACxGMxpqG8jDRpjO9gpj6SgFBqfXqOEr1f0d21K RSeA== X-Forwarded-Encrypted: i=1; AJvYcCXgO7GeIgsQOkPGRyEKr8/zPXRI/inBj5KucyveogHVmNm1TFTCfI5BrgPObmP9Tc44+10d62/wpg==@kvack.org X-Gm-Message-State: AOJu0YyHzfILtJpLrtqcrbR14G2rhTsGMIuzPm+rfJGDZUmHyempi7V9 HmbYRBPldk76NgqLAJ7W2nmVV7205GKtTheStliOxqW/v6DZCv57zr0EDJ4gn6c= X-Google-Smtp-Source: AGHT+IEtGIPkFNfIe3ZLKSC/bfgLuayig4eF10hn6pDIwuPddRsvkklThlidE4dnzDL2R+b2quJ1XQ== X-Received: by 2002:a05:6214:5992:b0:6ce:2357:8a36 with SMTP id 6a1803df08f44-6d39cf9b6f2mr14196506d6.7.1731003391475; Thu, 07 Nov 2024 10:16:31 -0800 (PST) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6d3961f4a0fsm10136816d6.41.2024.11.07.10.16.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 Nov 2024 10:16:30 -0800 (PST) Date: Thu, 7 Nov 2024 13:16:26 -0500 From: Johannes Weiner To: Kanchana P Sridhar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, yosryahmed@google.com, nphamcs@gmail.com, chengming.zhou@linux.dev, usamaarif642@gmail.com, ryan.roberts@arm.com, ying.huang@intel.com, 21cnbao@gmail.com, akpm@linux-foundation.org, linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au, davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org, ebiggers@google.com, surenb@google.com, kristen.c.accardi@intel.com, zanussi@kernel.org, wajdi.k.feghali@intel.com, vinodh.gopal@intel.com Subject: Re: [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios. Message-ID: <20241107181626.GF1172372@cmpxchg.org> References: <20241106192105.6731-1-kanchana.p.sridhar@intel.com> <20241106192105.6731-14-kanchana.p.sridhar@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20241106192105.6731-14-kanchana.p.sridhar@intel.com> X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 12F7940013 X-Stat-Signature: 75ig3njrdu6aik4p51zoh7eu1t87qsd1 X-Rspam-User: X-HE-Tag: 1731003346-755038 X-HE-Meta: U2FsdGVkX18w7XQh4onM/xDqQHgN9USUb6rHcoHTkVq9H9RmiejSjw1cjh9f/sd0CnOQlGWLMGLeB9728JloXD9lIHe+iyW9i8fySmoHY92jhfU5jszpAzb5p4rF2mg/KX9oj8Fbal1eoC1N3pwRoR4At2vMhUXeWqGrcvsledpgoGn7ict3LESjRQg5Hbe54ce7tsudWj+Fj7aebLNNqbnexf9YffiEjgTO/7qB8XUxJBgEDAvdLA2iE8RHTM560RCTDnRu6asMmMZiKQPJ0KQ+jUzl6Lj8TECLtrXydM1KtknDaR95V2BOtSo+96Q0ZzRkeFG1sb61/piG6Wpg0vPu1fPIXGs1xcY9gcwLL2PC4FmrLy1b8XK33F2vnmKRA3X7HnND22peTC1AXr35cF7qO0LBUexTj7/YCp+v6Y0Kdg6H6oPNggNZF1xW1QcrkW3nwDQKeBxhmrbBI+B9Iialk/5uTvX4YtjmC30cUtgIM+LQLgoUQGBKlJrSpcL27lzgREJdixtrNbuRpcB6vs5rN0vElxX44VYZlyNMmrQhHiys9eCB86B9cfsHbYLIaab7ZVHnD74E+YF8+XjItBCdPZrDk8F5Kf130JrWdT+OjzYxLSpmEXUQpMQ6AqQchBfJoyXR0il+VX+s7uBDvNnKll7Ke6pzjI+dLBeuRxskHCle165iY2K3t4mP18EJEfzSVySySlEeonfzJKQzWAA3pO8oTwspG6YWYWbeFj8JF+HAs3SdK407G0t+w1pXVsE8qS4tlgMScYdWIbDmgzgFSApXLQ9Jio60veaS0Vc8soFngNE7XiDeNNGko9IVUt0Pu8X1IoC0RazyyWrKAuz0tV7uyTZ9BJerhTrwaoAed86UauYLhh76yvfnimP1oOVmi0GnUPaRHzXS24COaj0o+eiAo+z78fcEmKyP5+2RUqNb9mBtNTUPTcY+EcuE/raSlVYjBlnb2cZiH1M uLo2zIoD idSrfgfiLn505tcMPq29LJKrCzEOU+t5Xc6JrRIx6upB0XdkBFVttZ+7vEzHQxqplG9MPaIFjVhOx5Pt5Ou+OBUU5rBT1plUrVYRj4GX2Uz0aC7VQN1f8vqmsb1LJFuOFNFBrAG8DwlbawKt+byiyHfnu6FyFCWqX+G+eLZdgRQ842DlDbFE/2j1f8uHHqanQUsiUKHuyKOjWGrCBM8R7pP9WWarRGX8uDJBmFbomcrR6E7vG4cNGf+Bsoy3C7Y1mL7HvfyCIQFaJVnYOEfFQcx8XmdNTG70EzdHqMlXdQ6c7VheG+8o0qa7UkjzKzxK2p+gtRKjVo6eGVMibnXkERrJRNwqafC7dZBOq36iLqEkb0kgK3rUTERCEFAj+Xeja0vwApVXBaoY6glUYd+EbqfREp8IVGCTJ4l+bFqeMlRYsHCHmpKH7Q5Crxjwi90yeNlz/w9WtvL1TQskukW1nv0B6YzWAvposITePgC6osEV9dUkbkrsAT//EzSUova8DOjjjARE1KY7U8e4lh6K5+Vqvsw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Nov 06, 2024 at 11:21:05AM -0800, Kanchana P Sridhar wrote: > If the system has Intel IAA, and if sysctl vm.compress-batching is set to > "1", zswap_store() will call crypto_acomp_batch_compress() to compress up > to SWAP_CRYPTO_BATCH_SIZE (i.e. 8) pages in large folios in parallel using > the multiple compress engines available in IAA hardware. > > On platforms with multiple IAA devices per socket, compress jobs from all > cores in a socket will be distributed among all IAA devices on the socket > by the iaa_crypto driver. > > With deflate-iaa configured as the zswap compressor, and > sysctl vm.compress-batching is enabled, the first time zswap_store() has to > swapout a large folio on any given cpu, it will allocate the > pool->acomp_batch_ctx resources on that cpu, and set pool->can_batch_comp > to BATCH_COMP_ENABLED. It will then proceed to call the main > __zswap_store_batch_core() compress batching function. Subsequent calls to > zswap_store() on the same cpu will go ahead and use the acomp_batch_ctx by > checking the pool->can_batch_comp status. > > Hence, we allocate the per-cpu pool->acomp_batch_ctx resources only on an > as-needed basis, to reduce memory footprint cost. The cost is not incurred > on cores that never get to swapout a large folio. > > This patch introduces the main __zswap_store_batch_core() function for > compress batching. This interface represents the extensible compress > batching architecture that can potentially be called with a batch of > any-order folios from shrink_folio_list(). In other words, although > zswap_store() calls __zswap_store_batch_core() with exactly one large folio > in this patch, we can reuse this interface to reclaim a batch of folios, to > significantly improve the reclaim path efficiency due to IAA's parallel > compression capability. > > The newly added functions that implement batched stores follow the > general structure of zswap_store() of a large folio. Some amount of > restructuring and optimization is done to minimize failure points > for a batch, fail early and maximize the zswap store pipeline occupancy > with SWAP_CRYPTO_BATCH_SIZE pages, potentially from multiple > folios. This is intended to maximize reclaim throughput with the IAA > hardware parallel compressions. > > Signed-off-by: Kanchana P Sridhar > --- > include/linux/zswap.h | 84 ++++++ > mm/zswap.c | 625 ++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 709 insertions(+) > > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > index 9ad27ab3d222..6d3ef4780c69 100644 > --- a/include/linux/zswap.h > +++ b/include/linux/zswap.h > @@ -31,6 +31,88 @@ struct zswap_lruvec_state { > atomic_long_t nr_disk_swapins; > }; > > +/* > + * struct zswap_store_sub_batch_page: > + * > + * This represents one "zswap batching element", namely, the > + * attributes associated with a page in a large folio that will > + * be compressed and stored in zswap. The term "batch" is reserved > + * for a conceptual "batch" of folios that can be sent to > + * zswap_store() by reclaim. The term "sub-batch" is used to describe > + * a collection of "zswap batching elements", i.e., an array of > + * "struct zswap_store_sub_batch_page *". > + * > + * The zswap compress sub-batch size is specified by > + * SWAP_CRYPTO_BATCH_SIZE, currently set as 8UL if the > + * platform has Intel IAA. This means zswap can store a large folio > + * by creating sub-batches of up to 8 pages and compressing this > + * batch using IAA to parallelize the 8 compress jobs in hardware. > + * For e.g., a 64KB folio can be compressed as 2 sub-batches of > + * 8 pages each. This can significantly improve the zswap_store() > + * performance for large folios. > + * > + * Although the page itself is represented directly, the structure > + * adds a "u8 batch_idx" to represent an index for the folio in a > + * conceptual "batch of folios" that can be passed to zswap_store(). > + * Conceptually, this allows for up to 256 folios that can be passed > + * to zswap_store(). If this conceptual number of folios sent to > + * zswap_store() exceeds 256, the "batch_idx" needs to become u16. > + */ > +struct zswap_store_sub_batch_page { > + u8 batch_idx; > + swp_entry_t swpentry; > + struct obj_cgroup *objcg; > + struct zswap_entry *entry; > + int error; /* folio error status. */ > +}; > + > +/* > + * struct zswap_store_pipeline_state: > + * > + * This stores state during IAA compress batching of (conceptually, a batch of) > + * folios. The term pipelining in this context, refers to breaking down > + * the batch of folios being reclaimed into sub-batches of > + * SWAP_CRYPTO_BATCH_SIZE pages, batch compressing and storing the > + * sub-batch. This concept could be further evolved to use overlap of CPU > + * computes with IAA computes. For instance, we could stage the post-compress > + * computes for sub-batch "N-1" to happen in parallel with IAA batch > + * compression of sub-batch "N". > + * > + * We begin by developing the concept of compress batching. Pipelining with > + * overlap can be future work. > + * > + * @errors: The errors status for the batch of reclaim folios passed in from > + * a higher mm layer such as swap_writepage(). > + * @pool: A valid zswap_pool. > + * @acomp_ctx: The per-cpu pointer to the crypto_acomp_ctx for the @pool. > + * @sub_batch: This is an array that represents the sub-batch of up to > + * SWAP_CRYPTO_BATCH_SIZE pages that are being stored > + * in zswap. > + * @comp_dsts: The destination buffers for crypto_acomp_compress() for each > + * page being compressed. > + * @comp_dlens: The destination buffers' lengths from crypto_acomp_compress() > + * obtained after crypto_acomp_poll() returns completion status, > + * for each page being compressed. > + * @comp_errors: Compression errors for each page being compressed. > + * @nr_comp_pages: Total number of pages in @sub_batch. > + * > + * Note: > + * The max sub-batch size is SWAP_CRYPTO_BATCH_SIZE, currently 8UL. > + * Hence, if SWAP_CRYPTO_BATCH_SIZE exceeds 256, some of the > + * u8 members (except @comp_dsts) need to become u16. > + */ > +struct zswap_store_pipeline_state { > + int *errors; > + struct zswap_pool *pool; > + struct crypto_acomp_ctx *acomp_ctx; > + struct zswap_store_sub_batch_page *sub_batch; > + struct page **comp_pages; > + u8 **comp_dsts; > + unsigned int *comp_dlens; > + int *comp_errors; > + u8 nr_comp_pages; > +}; Why are these in the public header? > unsigned long zswap_total_pages(void); > bool zswap_store(struct folio *folio); > bool zswap_load(struct folio *folio); > @@ -45,6 +127,8 @@ bool zswap_never_enabled(void); > #else > > struct zswap_lruvec_state {}; > +struct zswap_store_sub_batch_page {}; > +struct zswap_store_pipeline_state {}; > > static inline bool zswap_store(struct folio *folio) > { > diff --git a/mm/zswap.c b/mm/zswap.c > index 2af736e38213..538aac3fb552 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -255,6 +255,12 @@ static int zswap_create_acomp_ctx(unsigned int cpu, > char *tfm_name, > unsigned int nr_reqs); > > +static bool __zswap_store_batch_core( > + int node_id, > + struct folio **folios, > + int *errors, > + unsigned int nr_folios); > + Please reorder the functions to avoid forward decls. > /********************************* > * pool functions > **********************************/ > @@ -1626,6 +1632,12 @@ static ssize_t zswap_store_page(struct page *page, > return -EINVAL; > } > > +/* > + * Modified to use the IAA compress batching framework implemented in > + * __zswap_store_batch_core() if sysctl vm.compress-batching is 1. > + * The batching code is intended to significantly improve folio store > + * performance over the sequential code. This isn't helpful, please delete. > bool zswap_store(struct folio *folio) > { > long nr_pages = folio_nr_pages(folio); > @@ -1638,6 +1650,38 @@ bool zswap_store(struct folio *folio) > bool ret = false; > long index; > > + /* > + * Improve large folio zswap_store() latency with IAA compress batching, > + * if this is enabled by setting sysctl vm.compress-batching to "1". > + * If enabled, the large folio's pages are compressed in parallel in > + * batches of SWAP_CRYPTO_BATCH_SIZE pages. If disabled, every page in > + * the large folio is compressed sequentially. > + */ Same here. Reduce to "Try to batch compress large folios, fall back to processing individual subpages if that fails." > + if (folio_test_large(folio) && READ_ONCE(compress_batching)) { > + pool = zswap_pool_current_get(); There is an existing zswap_pool_current_get() in zswap_store(), please reorder the sequence so you don't need to add an extra one. > + if (!pool) { > + pr_err("Cannot setup acomp_batch_ctx for compress batching: no current pool found\n"); This is unnecessary. > + goto sequential_store; > + } > + > + if (zswap_pool_can_batch(pool)) { This function is introduced in another patch, where it isn't used. Please add functions and callers in the same patch. > + int error = -1; > + bool store_batch = __zswap_store_batch_core( > + folio_nid(folio), > + &folio, &error, 1); > + > + if (store_batch) { > + zswap_pool_put(pool); > + if (!error) > + ret = true; > + return ret; > + } > + } Please don't future proof code like this, only implement what is strictly necessary for the functionality in this patch. You're only adding a single caller with nr_folios=1, so it shouldn't be a parameter, and the function shouldn't have a that batch_idx loop. > + zswap_pool_put(pool); > + } > + > +sequential_store: > + > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > > @@ -1724,6 +1768,587 @@ bool zswap_store(struct folio *folio) > return ret; > } > > +/* > + * Note: If SWAP_CRYPTO_BATCH_SIZE exceeds 256, change the > + * u8 stack variables in the next several functions, to u16. > + */ > + > +/* > + * Propagate the "sbp" error condition to other batch elements belonging to > + * the same folio as "sbp". > + */ > +static __always_inline void zswap_store_propagate_errors( > + struct zswap_store_pipeline_state *zst, > + u8 error_batch_idx) > +{ Please observe surrounding coding style on how to wrap >80 col function signatures. Don't use __always_inline unless there is a clear, spelled out performance reason. Since it's an error path, that's doubtful. Please use a consistent namespace for all this: CONFIG_ZSWAP_BATCH zswap_batch_store() zswap_batch_alloc_entries() zswap_batch_add_folios() zswap_batch_compress() etc. Again, order to avoid forward decls. Try to keep the overall sequence of events between zswap_store() and zswap_batch_store() similar as much as possible for readability.