From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3F092C36010 for ; Mon, 7 Apr 2025 18:26:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6A03D6B0008; Mon, 7 Apr 2025 14:26:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6264E6B000A; Mon, 7 Apr 2025 14:26:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4A0466B000C; Mon, 7 Apr 2025 14:26:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 23AB46B0008 for ; Mon, 7 Apr 2025 14:26:50 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 0197081434 for ; Mon, 7 Apr 2025 18:26:50 +0000 (UTC) X-FDA: 83308079022.04.904105B Received: from mailrelay4-3.pub.mailoutpod2-cph3.one.com (mailrelay4-3.pub.mailoutpod2-cph3.one.com [46.30.212.35]) by imf30.hostedemail.com (Postfix) with ESMTP id 551698000B for ; Mon, 7 Apr 2025 18:26:48 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=konsulko.se header.s=rsa1 header.b=R1DWbg2E; dkim=pass header.d=konsulko.se header.s=ed1 header.b=dlljCiiU; dmarc=none; spf=none (imf30.hostedemail.com: domain of vitaly.wool@konsulko.se has no SPF policy when checking 46.30.212.35) smtp.mailfrom=vitaly.wool@konsulko.se ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744050409; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6UZZ6n7fWOJYocDmgRLVOgUInHQMU8Le+qW/9evS4hc=; b=fPr9vZTp9MNMaKsv73AbyS5pvYek2Q23cmcQqYWJHIYbMPPnWlaQUVxIyRfkmdThSoetIE qEdvEJtGuEKetbdbRtK1rDBw5sA8nygPp8UVIOzUjLU2wh84F9jv0UxVGSjODMNCyG0pbH 5Wae+EEGrahtexv7UV5p0omS63tbVQo= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=konsulko.se header.s=rsa1 header.b=R1DWbg2E; dkim=pass header.d=konsulko.se header.s=ed1 header.b=dlljCiiU; dmarc=none; spf=none (imf30.hostedemail.com: domain of vitaly.wool@konsulko.se has no SPF policy when checking 46.30.212.35) smtp.mailfrom=vitaly.wool@konsulko.se ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744050409; a=rsa-sha256; cv=none; b=tfOKuDxQ6GuQNViriPF41Kq2FHKJ2rdu6qkjzcAFUsOo9gWOuohjJkTZirOn9xGjpoH+Ee cqSEXI3gVrRf6ZpgYYoRoymEzbh1a7Sgyg+zIFFAlC1OgpUW+tFvh51HguhXf3iaKXT9yQ LWa3I/OaeTkF0p5j044yUZ12jyvW86Q= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; t=1744050405; x=1744655205; d=konsulko.se; s=rsa1; h=content-transfer-encoding:content-type:in-reply-to:from:references:cc:to: subject:mime-version:date:message-id:from; bh=6UZZ6n7fWOJYocDmgRLVOgUInHQMU8Le+qW/9evS4hc=; b=R1DWbg2EYhZKLupgfmclllIdoTr3dzyUekcWJ5EIu5zu2uGH/ISlkCS/0DrAiSF2ZjTdkF2KWnYIb sQQ33vVjNPYpDPB3ZCa2Ebfy4kWFfN8KOzF8+ed99npqmkBSd5DZn9B75oPa/eXxeST0KHezFr0jub W64Rp4Lg/RrWD3kkP+aQTOL0/tejVEXZ9xnk35svCMn5JVSxp8mkGvipeo4Vwtf96yMsKhlDt7Rk+L ENhQP6bIKSrSR4YM6fcZkWuW+ftBW2Et2YfoL4OzLzqjQsI6RkHBK0KCV/ovP9s+x3nRCzX7rghTYt 0ZuPEau2eT3grO8do83HY0oG1GD4qQw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; t=1744050405; x=1744655205; d=konsulko.se; s=ed1; h=content-transfer-encoding:content-type:in-reply-to:from:references:cc:to: subject:mime-version:date:message-id:from; bh=6UZZ6n7fWOJYocDmgRLVOgUInHQMU8Le+qW/9evS4hc=; b=dlljCiiU8Je5AsnXEVMGMNsWxwED9o1ZiRA5ggjA6YWwy1K63CyuVcZArSskCe+rAZlMCx8XSje1B VADndBzAw== X-HalOne-ID: dba9432e-13dd-11f0-8f06-e77cec7da75b Received: from [192.168.10.245] (host-90-233-218-222.mobileonline.telia.com [90.233.218.222]) by mailrelay4.pub.mailoutpod2-cph3.one.com (Halon) with ESMTPSA id dba9432e-13dd-11f0-8f06-e77cec7da75b; Mon, 07 Apr 2025 18:26:44 +0000 (UTC) Message-ID: <2cfcfdde-0d87-4195-b5a9-73636a94e89a@konsulko.se> Date: Mon, 7 Apr 2025 20:26:44 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2] mm: add zblock allocator To: Johannes Weiner Cc: linux-mm@kvack.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, Nhat Pham , Shakeel Butt , Igor Belousov References: <20250404192813.925835-1-vitaly.wool@konsulko.se> <20250407155409.GB827@cmpxchg.org> Content-Language: en-US From: Vitaly Wool In-Reply-To: <20250407155409.GB827@cmpxchg.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 551698000B X-Stat-Signature: quqnkh7sren84mn34uthkcqbbdxsxm49 X-Rspam-User: X-HE-Tag: 1744050408-985506 X-HE-Meta: U2FsdGVkX19Y+BEEyFv5Q76BZoR4i3X8YZLRceZLXVUu0IIDc1ArU5DpFGi0N9a7SoAdS/aqXy6NZ3covSZRzqsWNmR5Jc0GEgJuzAgFR9h4jcB/rX8aVmVGgyAi0AO4ZhBkoL08Ct+EJPcy9/WFAO7iut6/mi1zKfQftWttkkMwHgvpWZmmCA47Z3b6sVxzaaRLxIGQECrNdsP4MH6Qq6cQYh+CTQgb/MTDklQreci8zqE3xU3mS3WEhg1yrmcqmc5q19WpA5zQeygghQH2wGUD19hHOEm3FEiGsyg6kozLD964YO+HjrdRIpfSnLRbRU5nTH8tbqVRg1+d7VwQ8FNQXN2JmYoFj7S97ZG+loCRCOMjv3Ox9Zo6Ct1J/+hzTO7ybMbWJAIX3zlOB6NGfXjwmeqqSCScqJwuvYcwkbARVK3SOz2d3bAu5lSB3sivWRT8jsWCVbQ+LONHUb/MhSz/hZrfK1LR+fFV5/HxbEraJcdX6F2dzyf4FIHhybrjRZDUVNEveNWPPV5UfsUkKO3S4fskKO628J8LbU5LhQvKgn1xpUbfhIj4J9uX/4heloHqtXBMT6I82yH2g8E1AIU5LO/pLM8mM0evDqKrv23NDTH8ryUMlci0/sBduWdj8S7m3K7xY+T8MBnbTLDMWxMrDNWg0XaVwVRlZ4EZaTkNL30ro2OWqNeVcsOjwbkgBFp3FJXSu0l7fWm0+ENqr4gZLHqsndfiS5/oO8SQgJMFqWjg5NpaLQTyagtR+lmVWxCrXYe1OTLbl8psNUCkPwJUv7fQqwSlhK5O78V5vJ3YDZyNwzj8SUK+G5pikgCGPO2ymAGSUGw95Pt2OH2dxSxIKHLaScIodSexsfimJJ/vWTkrFhxgchNfqcvQnmaZIzr9J2LFKXglylKkmTlqaiMzKXOaApobmfbvo0dRLiQc38d+EH7wKkNGrh4DgzvhNTsizj8bI3pGLbaTBJK lkVgJfVB OZILooDLq4fXEEbCU9GYQyHaRyjAOhAuQwMfJQ/suv15VNXICU1OecnCi26KeNj5e8sEN+AaepUD5AJ8M7gZthC4Wl1Rgu3Azuov4OOGcMDt0qMti+PC30ONbAEfXcMIejPKovjpunmlUNqHJ2iHVh7zta0Ua1fX1RuWQ6BVNGnycAu75uuw5BZZ86z1Vqt3WeRjI6F9yJ/v2gvy3kSnLj32HSX4DEdo9lzE15THDcL2l4DmOOwuUwoFPBxLSoqPV0c76wUMoiWipAB8rrfumkbZyKszrzM1r73dz3HN1vYGIOz7tptYWHvyu7OrKbCKOojAue3xtCvAwZv3/oCaiK5TdeQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Johannes, thanks for the review, some comments follow below. On 4/7/25 17:54, Johannes Weiner wrote: > On Fri, Apr 04, 2025 at 09:28:13PM +0200, Vitaly Wool wrote: >> @@ -0,0 +1,24 @@ >> +. SPDX-License-Identifier: GPL-2.0 >> + >> +====== >> +zblock >> +====== >> + >> +zblock is a special purpose allocator for storing compressed pages. >> +It stores integer number of compressed objects per its block. These >> +blocks consist of several physical pages (2**n, i. e. 1/2/4/8). >> + >> +With zblock, it is possible to densely arrange objects of various sizes >> +resulting in low internal fragmentation. Also this allocator tries to >> +fill incomplete blocks instead of adding new ones, in many cases >> +providing a compression ratio substantially higher than z3fold and zbud >> +(though lower than zmalloc's). > > This last part seems to be incorrect based on Igor's measurements. Well, I surely don't mind deleting that. >> +zblock does not require MMU to operate and also is superior to zsmalloc >> +with regard to average performance and worst execution times, thus >> +allowing for better response time and real-time characteristics of the >> +whole system. > > Please delete the MMU comment here and in the changelog. As Nhat said, > this is a pointless advantage of a swap backend allocator, as swapping > itself requires an MMU. That's correct indeed but I still believe zblock may be a good match for ZRAM, even better than zsmalloc, and that is where MMU indepencence comes into play. > Please also add that highmem is not supported. > > Please also add that migration is not supported, which has a negative > impact on the kernel's ability to produce higher-order pages. This is > particularly unfortunate because zblock itself wants higher-order pages. Right. We plan to add migration support in the close future though. >> +E. g. on a series of stress-ng tests run on a Raspberry Pi 5, we get >> +5-10% higher value for bogo ops/s in zblock/zsmalloc comparison. >> + >> diff --git a/MAINTAINERS b/MAINTAINERS >> index d5dfb9186962..19283e6a08eb 100644 >> --- a/MAINTAINERS >> +++ b/MAINTAINERS >> @@ -26556,6 +26556,13 @@ F: Documentation/networking/device_drivers/hamradio/z8530drv.rst >> F: drivers/net/hamradio/*scc.c >> F: drivers/net/hamradio/z8530.h >> >> +ZBLOCK COMPRESSED SLAB MEMORY ALLOCATOR >> +M: Vitaly Wool >> +L: linux-mm@kvack.org >> +S: Maintained >> +F: Documentation/mm/zblock.rst >> +F: mm/zblock.[ch] >> + >> ZBUD COMPRESSED PAGE ALLOCATOR >> M: Seth Jennings >> M: Dan Streetman > > It looks like you're using an old tree. Can you please rebase on top > of the mm tree? There have been some significant changes to how zswap > is calling into zpool since the other allocators have been deleted. > >> @@ -0,0 +1,418 @@ >> +// SPDX-License-Identifier: GPL-2.0-only >> +/* >> + * zblock.c >> + * >> + * Author: Vitaly Wool >> + * Based on the work from Ananda Badmaev >> + * Copyright (C) 2022-2025, Konsulko AB. >> + * >> + * Zblock is a small object allocator with the intention to serve as a >> + * zpool backend. It operates on page blocks which consist of number >> + * of physical pages being a power of 2 and store integer number of >> + * compressed pages per block which results in determinism and simplicity. >> + * >> + * zblock doesn't export any API and is meant to be used via zpool API. >> + */ >> + >> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt >> + >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include "zblock.h" >> + >> +static struct rb_root block_desc_tree = RB_ROOT; >> + >> +/* add a new block to the list */ >> +static inline void add_block(struct zblock_block *block, >> + struct block_list *block_list) >> +{ >> + list_add(&block->link, &block_list->list); >> +} >> + >> +/* >> + * Find a block with at least one free slot and claim it. >> + * We make sure that the first block, if exists, will always work. >> + */ >> +static inline struct zblock_block *find_block(struct block_list *block_list) >> +{ >> + struct list_head *l = &block_list->list; >> + >> + if (!list_empty(l)) { >> + struct zblock_block *z = list_first_entry(l, typeof(*z), link); >> + >> + if (z->free_slots > 0) { >> + if (--z->free_slots == 0) >> + list_move_tail(&z->link, l); >> + return z; >> + } >> + } >> + return NULL; >> +} >> + >> +/* remove block from the list */ >> +static inline void remove_block(struct zblock_block *block) >> +{ >> + list_del_init(&block->link); >> +} >> + >> +/* Encodes the handle of a particular slot in the pool using metadata */ >> +static inline unsigned long metadata_to_handle(struct zblock_block *block, >> + unsigned int block_type, unsigned int slot) >> +{ >> + return (unsigned long)(block) | (block_type << SLOT_BITS) | slot; >> +} >> + >> +/* Returns block, block type and slot in the pool corresponding to handle */ >> +static inline struct zblock_block *handle_to_metadata(unsigned long handle, >> + unsigned int *block_type, unsigned int *slot) >> +{ >> + *block_type = (handle & (PAGE_SIZE - 1)) >> SLOT_BITS; >> + *slot = handle & SLOT_MASK; >> + return (struct zblock_block *)(handle & PAGE_MASK); >> +} >> + >> + >> +/* >> + * allocate new block and add it to corresponding block list >> + */ >> +static struct zblock_block *alloc_block(struct zblock_pool *pool, >> + int block_type, gfp_t gfp, >> + unsigned long *handle) >> +{ >> + struct zblock_block *block; >> + struct block_list *block_list; >> + >> + block = (void *)__get_free_pages(gfp, block_desc[block_type].order); > > This gfp comes from zswap, which currently does: > > gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE; > > So you might get highmem here that cannot be dereferenced and crash. > > __GFP_MOVABLE is also wrong, as you're not implementing migration. Ack. >> + if (!block) >> + return NULL; >> + >> + block_list = &pool->block_lists[block_type]; >> + >> + /* init block data */ >> + block->free_slots = block_desc[block_type].slots_per_block - 1; >> + memset(&block->slot_info, 0, sizeof(block->slot_info)); >> + set_bit(0, block->slot_info); >> + *handle = metadata_to_handle(block, block_type, 0); >> + >> + spin_lock(&block_list->lock); >> + add_block(block, block_list); > > Use list_add() directly here. This is probably too small a thing to argue about but I would rather keep add_block and put block_count increment there. Would you agree? >> + block_list->block_count++; >> + spin_unlock(&block_list->lock); >> + return block; >> +} >> + >> +/***************** >> + * API Functions >> + *****************/ >> +/** >> + * zblock_create_pool() - create a new zblock pool >> + * @gfp: gfp flags when allocating the zblock pool structure >> + * @ops: user-defined operations for the zblock pool >> + * >> + * Return: pointer to the new zblock pool or NULL if the metadata allocation >> + * failed. >> + */ >> +static struct zblock_pool *zblock_create_pool(gfp_t gfp) >> +{ >> + struct zblock_pool *pool; >> + struct block_list *block_list; >> + int i; >> + >> + pool = kmalloc(sizeof(struct zblock_pool), gfp); >> + if (!pool) >> + return NULL; >> + >> + /* init each block list */ >> + for (i = 0; i < ARRAY_SIZE(block_desc); i++) { >> + block_list = &pool->block_lists[i]; >> + spin_lock_init(&block_list->lock); >> + INIT_LIST_HEAD(&block_list->list); >> + block_list->block_count = 0; >> + } >> + return pool; >> +} >> + >> +/** >> + * zblock_destroy_pool() - destroys an existing zblock pool >> + * @pool: the zblock pool to be destroyed >> + * >> + */ >> +static void zblock_destroy_pool(struct zblock_pool *pool) >> +{ >> + kfree(pool); >> +} >> + >> + >> +/** >> + * zblock_alloc() - allocates a slot of appropriate size >> + * @pool: zblock pool from which to allocate >> + * @size: size in bytes of the desired allocation >> + * @gfp: gfp flags used if the pool needs to grow >> + * @handle: handle of the new allocation >> + * >> + * Return: 0 if success and handle is set, otherwise -EINVAL if the size or >> + * gfp arguments are invalid or -ENOMEM if the pool was unable to allocate >> + * a new slot. >> + */ >> +static int zblock_alloc(struct zblock_pool *pool, size_t size, gfp_t gfp, >> + unsigned long *handle) >> +{ >> + int block_type = -1; >> + unsigned int slot; >> + struct zblock_block *block; >> + struct block_list *block_list; >> + >> + if (!size) >> + return -EINVAL; >> + >> + if (size > PAGE_SIZE) >> + return -ENOSPC; >> + >> + /* find basic block type with suitable slot size */ >> + if (size < block_desc[0].slot_size) >> + block_type = 0; >> + else { >> + struct block_desc_node *block_node; >> + struct rb_node *node = block_desc_tree.rb_node; >> + >> + while (node) { >> + block_node = container_of(node, typeof(*block_node), node); >> + if (size < block_node->this_slot_size) >> + node = node->rb_left; >> + else if (size >= block_node->next_slot_size) >> + node = node->rb_right; >> + else { >> + block_type = block_node->block_idx + 1; >> + break; >> + } >> + } >> + } >> + if (WARN_ON(block_type < 0 || block_type >= ARRAY_SIZE(block_desc))) >> + return -EINVAL; >> + >> + block_list = &pool->block_lists[block_type]; >> + >> + spin_lock(&block_list->lock); >> + block = find_block(block_list); >> + spin_unlock(&block_list->lock); >> + if (block) >> + goto found; >> + /* not found block with free slots try to allocate new empty block */ >> + block = alloc_block(pool, block_type, gfp, handle); >> + return block ? 0 : -ENOMEM; >> + >> +found: >> + /* find the first free slot in block */ >> + for (slot = find_first_zero_bit(block->slot_info, >> + block_desc[block_type].slots_per_block); >> + slot < block_desc[block_type].slots_per_block; >> + slot = find_next_zero_bit(block->slot_info, >> + block_desc[block_type].slots_per_block, >> + slot)) { >> + if (!test_and_set_bit(slot, block->slot_info)) >> + break; >> + barrier(); > > Ah, so find_block() reserves a slot in block->free_slots. You can race > with another allocation for the exact bit, but there is one bit > guaranteed to remain. Clever. This could use a comment. Will do. >> + } >> + BUG_ON(slot >= block_desc[block_type].slots_per_block); >> + *handle = metadata_to_handle(block, block_type, slot); >> + return 0; >> +} >> + >> +/** >> + * zblock_free() - frees the allocation associated with the given handle >> + * @pool: pool in which the allocation resided >> + * @handle: handle associated with the allocation returned by zblock_alloc() >> + * >> + */ >> +static void zblock_free(struct zblock_pool *pool, unsigned long handle) >> +{ >> + unsigned int slot, block_type; >> + struct zblock_block *block; >> + struct block_list *block_list; >> + >> + block = handle_to_metadata(handle, &block_type, &slot); >> + block_list = &pool->block_lists[block_type]; >> + >> + spin_lock(&block_list->lock); >> + /* if all slots in block are empty delete whole block */ >> + if (++block->free_slots == block_desc[block_type].slots_per_block) { >> + block_list->block_count--; >> + remove_block(block); >> + spin_unlock(&block_list->lock); >> + free_pages((unsigned long)block, block_desc[block_type].order); >> + return; >> + } >> + spin_unlock(&block_list->lock); >> + >> + clear_bit(slot, block->slot_info); > > The list handling here is unusual. find_block() only checks the first > block in the list. Filling a block puts it at the tail, but partially > freeing a block doesn't rotate it back to the head. AFAICS this can > lead to an ugly edge case where you have an unbounded amount of free > memory trapped in partial blocks, but you keep allocating new blocks: > > You start with [new block]. Once full, list_move_tail() is a nop. > > You allocate a new one, [partial]->[full]. > > Keep going, you can have [partial]->[full]->[full]->[full]->[full] > until it's hundreds of gigabytes. > > The workload now unmaps randomly and thus partially draining many > blocks. So you have: [partial]->[full]->[partial]->... with many > gigabytes of free slots in partially used blocks. > > Once the first partial block fills up, it rotates to the tail: > [full]->[partial]->...[full] > > find_block() will fail. You allocate a new block, fill it, move it to > the tail. The old [full] is back at the head. Rinse, repeat. > > You could make find_block() keep going, but it might have to go > through a million [full] blocks before finding a partial one. > > This is the reason why other allocator schemes, like zsmalloc e.g., > have separate partial and full lists. The initial variant of zblock_free() had list_move() for that very purpose but we hadn't hit this corner case in our testing so far so I decided to remove it. As an interesting coincidence I seem to had hit it minutes before I received your response so I've got it restored and checking now if it is enough. ~Vitaly