From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2D438C369C2 for ; Tue, 22 Apr 2025 11:27:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 550C66B000D; Tue, 22 Apr 2025 07:27:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 50A496B0012; Tue, 22 Apr 2025 07:27:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3A08B6B0023; Tue, 22 Apr 2025 07:27:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 19D736B000D for ; Tue, 22 Apr 2025 07:27:45 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 213B61C8EC8 for ; Tue, 22 Apr 2025 11:27:45 +0000 (UTC) X-FDA: 83361454890.23.E81C00A Received: from out-188.mta1.migadu.com (out-188.mta1.migadu.com [95.215.58.188]) by imf11.hostedemail.com (Postfix) with ESMTP id 49F8840009 for ; Tue, 22 Apr 2025 11:27:43 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=E7uLd2yC; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf11.hostedemail.com: domain of yosry.ahmed@linux.dev designates 95.215.58.188 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745321263; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SZSjzh21Er6zfYpYTz3VLBkYDxUNuBOsafE2+zPHwBA=; b=TRrXrGZHR1ZljIAVqQxjzd86Inlgs39Je0Y87MwIOlNgUVWAcDQEs7tIWPmpdoKqQO5LbN C+lJ7+aMzYIWbZr28/fPSC6v1cvO/MstQ7u/0M8ESJ8Fsbnt632MHSi4kLya3miMoKvHqL OlNKO0FVy2E/oQOaQODMW4+CaeoONWo= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745321263; a=rsa-sha256; cv=none; b=xzYvS4vBi+nTfQQsYHZWxE6CIBD/BxI+Rn8j3QWm1PfLe4zHBrJym7UbZUaa0dgJisrx27 9g8lm9/tRbQOjEWrBp7qhjo72SQB+uu+bkdHGEVKFu15mgaggSoZ/I7/DpjTDg37OoBYj0 cVwZ9aRc9GWSZ5207gRzWgUU2ZZ4Auc= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=E7uLd2yC; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf11.hostedemail.com: domain of yosry.ahmed@linux.dev designates 95.215.58.188 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev Date: Tue, 22 Apr 2025 04:27:23 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745321261; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=SZSjzh21Er6zfYpYTz3VLBkYDxUNuBOsafE2+zPHwBA=; b=E7uLd2yCclpmGAzh893lNm5cQKameJvnzPjKKJwvuaf8AXx89o9G9M1npI7pHn46ZNMlLn CqYXdHpLMy5dmKVTckr84siM9y3hXO1cInmM9WNNEo5/x5ZVPVKhwjWHvX23lWuvz3xLNY 9oQw3PxF6TXoTd6aRabYweyEfx43PrQ= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yosry Ahmed To: Joshua Hahn Cc: Yosry Ahmed , Nhat Pham , akpm@linux-foundation.org, hannes@cmpxchg.org, cerasuolodomenico@gmail.com, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, hughd@google.com, corbet@lwn.net, konrad.wilk@oracle.com, senozhatsky@chromium.org, rppt@kernel.org, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, david@ixit.cz Subject: Re: [PATCH 0/2] minimize swapping on zswap store failure Message-ID: References: <20250402200651.1224617-1-joshua.hahnjy@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20250402200651.1224617-1-joshua.hahnjy@gmail.com> X-Migadu-Flow: FLOW_OUT X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 49F8840009 X-Rspam-User: X-Stat-Signature: 71osij14qtufcwhpgaipiku9fcobxu38 X-HE-Tag: 1745321263-894513 X-HE-Meta: U2FsdGVkX196hRAsL8Zk3aTQDHLbQxxlZmULA7qgBF9N5bn0c+LJ7Te4f8yz/YFiL3Xjm3EfK1aKyteKUP/FtR+8hHuStQMU8J0rQ6/BMHqE6HdGZ45QMh2rcsgxMUTh7ejmzs/0MIFSULYiBqXwy1kHkcLb4VUnrwHxjjqWUr3jPbuUV/HwjtR8ddk2j1mPOF0wyvBvVUb/cT7/Hvf1aFGhVJrwAFfc+089CJmeSjP8mTrV42KUhXdkCIAHhODBAGUccoukAITXS1FPLU8bgJQfrKi3sgiyRZR3Dvg/GPuIsofUr3YQycyu6k2BdTrIG59nhETRi1qgN69IXMRpBTTiT4un0tXBH8z+814vBNE6rifw+pud0ZamV4ehX6JEoaHeF3buD1LIxEoTgaZCSlm3MBzSVhYN5QQhgCvAfbhEIdTg6OaBcupemJLjSCm7+r/aYM/CP0Lb3hvOrmlYAf9hBVA4yyWXFYwjdgBeQouAp+Q5X9x8O0ZYoYcj67DoQ9r6kvhA1Mk7DJlCD+09ISqFzTyxqpmX6rCRJG6MhL4Y9u16UmKC3vK/ZyuHO/k/XKMSsvYOsLdZ5RixCxHMZ93pv/LmLBhKphaBtMh6VmFQWtM73PY+aOMJcngTlGpkss/SAZG/7Ds9Muz1d3XuZ8SM47t9A7lOd2y5OSN1OEJc5nmMwlA5wmpRd8mKpp80KVzZv8XPKJDEVzb9PjL1kqaQn9oVpF4TvgO2tJK6cY53p1tRxsgPy+pj9/ttL0dYbarCBrYegxwzfWOs0QqUUxisMOF48NBolz52Ln0sNhB3lCypHhFYJepaUnkoP1zXtMYZpyLBLNh0R1pQBi5AAMAE43tgtfYMqKXgIXHgZEJmH3XrknEkpaJsfZ6MF1RUGbYOd1uLhcbAfI7fufLrNJZDsB4V5KMX2im3ZfyGE3YzQvAvMxDA6Z9mRKxjMAhz X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Apr 02, 2025 at 01:06:49PM -0700, Joshua Hahn wrote: > On Mon, 16 Oct 2023 17:57:31 -0700 Yosry Ahmed wrote: > > > On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham wrote: > > > I thought before about having a special list_head that allows us to > > use the lower bits of the pointers as markers, similar to the xarray. > > The markers can be used to place different objects on the same list. > > We can have a list that is a mixture of struct page and struct > > zswap_entry. I never pursued this idea, and I am sure someone will > > scream at me for suggesting it. Maybe there is a less convoluted way > > to keep the LRU ordering intact without allocating memory on the > > reclaim path. > > Hi Yosry, > > Apologies for reviving an old thread, but I wasn't sure whether opening an > entirely new thread was a better choice : -) This seems like the right choice to me :) > > So I've implemented your idea, using the lower 2 bits of the list_head's prev > pointer (last bit indicates whether the list_head belongs to a page or a > zswap_entry, and the second to last bit was repurposed for the second chance > algorithm). Thanks a lot for spending time looking into this, and sorry for the delayed resposne (I am technically on leave right now). > > For a very high level overview what I did in the patch: > - When a page fails to compress, I remove the page mapping and tag both the > xarray entry (tag == set lowest bit to 1) and the page's list_head prev ptr, > then store the page directly into the zswap LRU. What do you mean by 'remove the page mapping'? Do you mean __remove_mapping()? This is already called by reclaim, so I assume vmscan code hands over ownership of the page to zswap and doesn't call __remove_mapping(), so you end up doing that in zswap instead. > - In zswap_load, we take the entry out of the xarray and check if it's tagged. > - If it is tagged, then instead of decompressing, we just copy the page's > contents to the newly allocated page. > - (More details about how to teach vmscan / page_io / list iterators how to > handle this, but we can gloss over those details for now) > > I have a working version, but have been holding off because I have only been > seeing regressions. I wasn't really sure where they were coming from, but > after going through some perf traces with Nhat, found out that the regressions > come from the associated page faults that come from initially unmapping the > page, and then re-allocating it for every load. This causes (1) more memcg > flushing, and (2) extra allocations ==> more pressure ==> more reclaim, even > though we only temporarily keep the extra page. Hmm how is this worse than the status quo though? IIUC currently incompressible pages will skip zswap and go to the backing swapfile. Surely reading them from disk is slower than copying them? Unless of course, writeback is disabled, in which case these pages are not being reclaimed at all today. In this case, it makes sense that reclaiming them makes accessing them slower, even if we don't actually need to decompress them. I have a few thoughts in mind: - As Nhat said, if we can keep the pages in the swapcache, we can avoid making a new allocation and copying the page. We'd need to move it back from zswap LRUs to the reclaim LRUs though. - One advantage of keeping incompressible pages in zswap is preserving LRU ordering. IOW, if some compressible pages go to zswap first (old), then some incompressible pages (new), then the old compressible pages should go to disk via writeback first. Otherwise, accessing the hotter incompressible pages will be slower than accessing the colder compressible pages. This happens today because incompressible pages go straight to disk. The above will only materialize for a workload that has writeback enabled and a mixture of both incompressible and compressible workingset. The other advantage, as you mention below, is preventing repeatedly sending incompressible pages to zswap when writeback is disabled, but that could be offset by the extra cost of allocations/copying. - The above being said, we should not regress workloads that have writeback disabled, so we either need to keep the pages in the swapcache to avoid the extra allocations/copies -- or avoid storing the pages in zswap completely if writeback is disabled. If writeback is disabled and the page is incompressible, we could probably just put it in the unevictable LRU because that's what it really is. We'd need to make sure we remove it when it becomes compressible again. The first approach is probably simpler. > > Just wanted to put this here in case you were still thinking about this idea. > What do you think? Ideally, there would be a way to keep the page around in > the zswap LRU, but do not have to re-allocate a new page on a fault, but this > seems like a bigger task. > > Ultimately the goal is to prevent an incompressible page from hoarding the > compression algorithm on multiple reclaim attempts, but if we are spending > more time by allocating new pages... maybe this isn't the correct approach :( > > Please let me know if you have any thoughts on this : -) > Have a great day! > Joshua > > Sent using hkml (https://github.com/sjp38/hackermail) > >