From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C713FC54798 for ; Thu, 7 Mar 2024 17:35:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 410EA6B0213; Thu, 7 Mar 2024 12:35:48 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3C0AF6B0246; Thu, 7 Mar 2024 12:35:48 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1ED526B0248; Thu, 7 Mar 2024 12:35:48 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 0A6076B0213 for ; Thu, 7 Mar 2024 12:35:48 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id CD173C1181 for ; Thu, 7 Mar 2024 17:35:47 +0000 (UTC) X-FDA: 81870945534.13.1F12370 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by imf27.hostedemail.com (Postfix) with ESMTP id 99FB740021 for ; Thu, 7 Mar 2024 17:35:43 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=zYupB0j6; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b="b/W7Q6/b"; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=zYupB0j6; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b="b/W7Q6/b"; dmarc=none; spf=pass (imf27.hostedemail.com: domain of jack@suse.cz designates 195.135.223.130 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709832944; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=J7f8wfrU+wUC3UdM/hvriaLqnnKdYG+3xblKUrZJ+ew=; b=HJeVEXpcQ/IZz+Wzn+50w0jNV0jYq9n4S1eUEuEfJ4o6Zb5eNvsli+DPRP15z/hHQIcYoJ AZUZyFVLSHRXC+Rych++6OL7esY2jW05dtVPNjcpmRM9ohqiKUIMi6OUqY02lCuyNKQcEo ePAcqsKQQQRnpkABl0kiDBzUXfCr9Oo= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=zYupB0j6; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b="b/W7Q6/b"; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=zYupB0j6; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b="b/W7Q6/b"; dmarc=none; spf=pass (imf27.hostedemail.com: domain of jack@suse.cz designates 195.135.223.130 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709832944; a=rsa-sha256; cv=none; b=D/LHEU8qMtNNfqmx+W9BPJFVCwOJiZgrFxA/rE7Q33MLyvEdozmSh3XiN0/yfdYvWheja1 axxEvACEXZT3CkFcpA7nxgAfRg/sIApmOujW240ysfJXfDW8etllZ77X7JTDPBBUwfR10X HCJuEPxWbqCwz1QGCn8zNM3tfJyCHAo= Received: from imap2.dmz-prg2.suse.org (imap2.dmz-prg2.suse.org [10.150.64.98]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id A64FA17891; Thu, 7 Mar 2024 14:03:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1709820224; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=J7f8wfrU+wUC3UdM/hvriaLqnnKdYG+3xblKUrZJ+ew=; b=zYupB0j6yhRtBCRhHXfQBtiTNewOpUorqjjHF3P0ShGBw49sjvcjddaNpvro1Eb6y29aHX ou0hBpIELxsSIa8EdxG1gL5OAtpI9VJSB+rWV2P0eoCEeIr+KJ2XRp7ACajG2pj8VMcGGL H9BiBxOdr6KMb1inV18k0qAGInYHcz4= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1709820224; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=J7f8wfrU+wUC3UdM/hvriaLqnnKdYG+3xblKUrZJ+ew=; b=b/W7Q6/b/Yft64rN+y5qubhlOA0i1GjrmcrWGbkp88/W9s8Dh6pxxQWCdZ2xHmK9IeORQb VNj/dkjpN3uytnCg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1709820224; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=J7f8wfrU+wUC3UdM/hvriaLqnnKdYG+3xblKUrZJ+ew=; b=zYupB0j6yhRtBCRhHXfQBtiTNewOpUorqjjHF3P0ShGBw49sjvcjddaNpvro1Eb6y29aHX ou0hBpIELxsSIa8EdxG1gL5OAtpI9VJSB+rWV2P0eoCEeIr+KJ2XRp7ACajG2pj8VMcGGL H9BiBxOdr6KMb1inV18k0qAGInYHcz4= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1709820224; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=J7f8wfrU+wUC3UdM/hvriaLqnnKdYG+3xblKUrZJ+ew=; b=b/W7Q6/b/Yft64rN+y5qubhlOA0i1GjrmcrWGbkp88/W9s8Dh6pxxQWCdZ2xHmK9IeORQb VNj/dkjpN3uytnCg== Received: from imap2.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap2.dmz-prg2.suse.org (Postfix) with ESMTPS id 9359D132A4; Thu, 7 Mar 2024 14:03:44 +0000 (UTC) Received: from dovecot-director2.suse.de ([10.150.64.162]) by imap2.dmz-prg2.suse.org with ESMTPSA id O+ioI0DJ6WXiWQAAn2gu4w (envelope-from ); Thu, 07 Mar 2024 14:03:44 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id 2C8DDA0803; Thu, 7 Mar 2024 15:03:44 +0100 (CET) Date: Thu, 7 Mar 2024 15:03:44 +0100 From: Jan Kara To: Chuanhua Han Cc: Chris Li , linux-mm , lsf-pc@lists.linux-foundation.org, ryan.roberts@arm.com, 21cnbao@gmail.com, david@redhat.com Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" Message-ID: <20240307140344.4wlumk6zxustylh6@quack3> References: <039190fb-81da-c9b3-3f33-70069cdb27b0@oppo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <039190fb-81da-c9b3-3f33-70069cdb27b0@oppo.com> X-Rspamd-Queue-Id: 99FB740021 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: 1xe4dt8dhwn7x6ukjeb8f6zcykynwz7s X-HE-Tag: 1709832943-152620 X-HE-Meta: U2FsdGVkX1+qzlte0q3exnz/tjjjnCMj/yqtd2KKODX/XSpBj6g6XvPe5DbpeQFwfMLACJuH13r3mrAHeCQA6e5TNdkBlJKzsBhX5uYM4LhwU0eiAkwg6gQGVtPnR804mG2mWjUu62Buf/jk11Hnl7YNmcxrHkXymA6jFr86iGXIW/i8b79SVsCrfpApmOWr/5EMBXjUn2wt1CMIhJuvYqXJkUxH4xRyIGM6PEdkOQNgfHj+v65/o5i3ucxXvKj+WPG/qVL4uC3PFkN+TXaeNWYbEZo73j4ZQu+FzEHfBTVq8phLVVf6jkYgaRDeR9kTLxl6tTzkagN5nMSO/MDuTv4zGSt2NaGOm6NBGEBr11B4dHLb75jNiJ0JO2KXEcqUR8mfczGG9fY1bW5WVxRO3Ag/052oDwJBh3AAJPimTxuaYOXk4ubaZT8mG6b3KgXR6Xixl4XFj3XzFrbF+jcY/9dEjNKxVnOEeRdfkN2YnJGTGvJzNbspU+/kMMp9QcHKSHa5lD+UhtPUJWRv+pgQo3r4uATAlqD76FUoCNQn7XazXNJhvycje6VNaoxVCITEj5LQPnpQusyAxqt/yd1y1JeDfSTT/5gzTqeh+0HMhn1j/A6TeFpqIg9pZddiJTKI9fitgts023nQTpUAz83DArvHk02p5w1S3NKxQ+DFeqlYs0XRUUKsJatAefW/lB+DA4tV0fenslWoV0lBHXtOX9aWCl2JhU//yiDTdIsdtAgliq9ztm4fL1PvwsNpdOF6LZpOCGUf6lcoZcrmYGs+Bll9QukB3XouMNfeZeeFj6vBLkh46wcH3C0j66e3g3J972utdIwBwrzq0zxCac6z3GM26HDTwwFqeOAa392US4LK8cEhcQ2wWTYNe5l/PHVWiBGhjfCEeKysIDjfa4dsOOPnR0bZjPY+hcH/+uyC38oaTmO7SBckylJvp1gwqTRO0oJzvuJqQ+AIx+Me7sH sSR0YgLv Fns0SB5ayqLPqunmCyt/+ZcfgyT6AnLYeYZnLW0PHkPN509WMpYWL12fRXa3bCBlWuRrZJvYBdmyt2mRrcIw5LSYN1CH73vT2FmNoJNbnwt/zK87AqWjVHPqa8XA+Ywdt2tVL/udCw4ZPQ43n4JSSrrAxhToRhjnsY6v5WfI52SGlXkBLqB1NKpBP6eVeh18TmzJGxmfSmqmrplX3GEi0BtcNaYvotYTEGrKeGdbpG2KpKjhHFyjcsuXWLRMF7xWS+lkQIMecpcstN7rK1FHxEq3Xk4RM5fMnjm/AXF09nSE0c3ordKzOgn65Por/p6CVosy1/duLD4wMOyG/BzE5bbvjmWUoxvynopKmOt/IbmmV7J+EkFkDZcpMK/Yg/KVKDjwjGH8FMZkhTXntMItUnvn73S8hkA2xBmJJtaCZPV7NTKc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu 07-03-24 15:56:57, Chuanhua Han via Lsf-pc wrote: > > 在 2024/3/1 17:24, Chris Li 写道: > > In last year's LSF/MM I talked about a VFS-like swap system. That is > > the pony that was chosen. > > However, I did not have much chance to go into details. > > > > This year, I would like to discuss what it takes to re-architect the > > whole swap back end from scratch? > > > > Let’s start from the requirements for the swap back end. > > > > 1) support the existing swap usage (not the implementation). > > > > Some other design goals:: > > > > 2) low per swap entry memory usage. > > > > 3) low io latency. > > > > What are the functions the swap system needs to support? > > > > At the device level. Swap systems need to support a list of swap files > > with a priority order. The same priority of swap device will do round > > robin writing on the swap device. The swap device type includes zswap, > > zram, SSD, spinning hard disk, swap file in a file system. > > > > At the swap entry level, here is the list of existing swap entry usage: > > > > * Swap entry allocation and free. Each swap entry needs to be > > associated with a location of the disk space in the swapfile. (offset > > of swap entry). > > * Each swap entry needs to track the map count of the entry. (swap_map) > > * Each swap entry needs to be able to find the associated memory > > cgroup. (swap_cgroup_ctrl->map) > > * Swap cache. Lookup folio/shadow from swap entry > > * Swap page writes through a swapfile in a file system other than a > > block device. (swap_extent) > > * Shadow entry. (store in swap cache) > > > > Any new swap back end might have different internal implementation, > > but needs to support the above usage. For example, using the existing > > file system as swap backend, per vma or per swap entry map to a file > > would mean it needs additional data structure to track the > > swap_cgroup_ctrl, combined with the size of the file inode. It would > > be challenging to meet the design goal 2) and 3) using another file > > system as it is.. > > > > I am considering grouping different swap entry data into one single > > struct and dynamically allocate it so no upfront allocation of > > swap_map. > > > > For the swap entry allocation.Current kernel support swap out 0 order > > or pmd order pages. > > > > There are some discussions and patches that add swap out for folio > > size in between (mTHP) > > > > https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/ > > > > and swap in for mTHP: > > > > https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/ > > > > The introduction of swapping different order of pages will further > > complicate the swap entry fragmentation issue. The swap back end has > > no way to predict the life cycle of the swap entries. Repeat allocate > > and free swap entry of different sizes will fragment the swap entries > > array. If we can’t allocate the contiguous swap entry for a mTHP, it > > will have to split the mTHP to a smaller size to perform the swap in > > and out. T > > > > Current swap only supports 4K pages or pmd size pages. When adding the > > other in between sizes, it greatly increases the chance of fragmenting > > the swap entry space. When no more continuous swap swap entry for > > mTHP, it will force the mTHP split into 4K pages. If we don’t solve > > the fragmentation issue. It will be a constant source of splitting the > > mTHP. > > > > Another limitation I would like to address is that swap_writepage can > > only write out IO in one contiguous chunk, not able to perform > > non-continuous IO. When the swapfile is close to full, it is likely > > the unused entry will spread across different locations. It would be > > nice to be able to read and write large folio using discontiguous disk > > IO locations. > > > > Some possible ideas for the fragmentation issue. > > > > a) buddy allocator for swap entities. Similar to the buddy allocator > > in memory. We can use a buddy allocator system for the swap entry to > > avoid the low order swap entry fragment too much of the high order > > swap entry. It should greatly reduce the fragmentation caused by > > allocate and free of the swap entry of different sizes. However the > > buddy allocator has its own limit as well. Unlike system memory, we > > can move and compact the memory. There is no rmap for swap entry, it > > is much harder to move a swap entry to another disk location. So the > > buddy allocator for swap will help, but not solve all the > > fragmentation issues. > I have an idea here😁 > > Each swap device is divided into multiple chunks, and each chunk is > allocated to meet each order allocation > (order indicates the order of swapout's folio, and each chunk is used > for only one order).   > This can solve the fragmentation problem, which is much simpler than > buddy, easier to implement, >  and can be compatible with multiple sizes, similar to small slab allocator. > > 1) Add structure members   > In the swap_info_struct structure, we only need to add the offset array > representing the offset of each order search. > eg: > > #define MTHP_NR_ORDER 9 > > struct swap_info_struct { >     ... >     long order_off[MTHP_NR_ORDER]; >     ... > }; > > Note: order_off = -1 indicates that this order is not supported. > > 2) Initialize > Set the proportion of swap device occupied by each order. > For the sake of simplicity, there are 8 kinds of orders.   > Number of slots occupied by each order: chunk_size = 1/8 * maxpages > (maxpages indicates the maximum number of available slots in the current > swap device) Well, but then if you fill in space of a particular order and need to swap out a page of that order what do you do? Return ENOSPC prematurely? Frankly as I'm reading the discussions here, it seems to me you are trying to reinvent a lot of things from the filesystem space :) Like block allocation with reasonably efficient fragmentation prevention, transparent data compression (zswap), hierarchical storage management (i.e., moving data between different backing stores), efficient way to get from VMA+offset to the place on disk where the content is stored. Sure you still don't need a lot of things modern filesystems do like permissions, directory structure (or even more complex namespacing stuff), all the stuff achieving fs consistency after a crash, etc. But still what you need is a notable portion of what filesystems do. So maybe it would be time to implement swap as a proper filesystem? Or even better we could think about factoring out these bits out of some existing filesystem to share code? Honza -- Jan Kara SUSE Labs, CR