From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B2896C25B74 for ; Thu, 16 May 2024 07:16:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F1C116B0333; Thu, 16 May 2024 03:16:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EA6CC6B038D; Thu, 16 May 2024 03:16:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D20376B038E; Thu, 16 May 2024 03:16:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id A88DC6B0333 for ; Thu, 16 May 2024 03:16:50 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 1E2C0141663 for ; Thu, 16 May 2024 07:16:50 +0000 (UTC) X-FDA: 82123401780.28.A70B3C7 Received: from mail-ed1-f51.google.com (mail-ed1-f51.google.com [209.85.208.51]) by imf26.hostedemail.com (Postfix) with ESMTP id B5969140012 for ; Thu, 16 May 2024 07:16:46 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=KxzlfBZd; spf=pass (imf26.hostedemail.com: domain of chuanhuahan@gmail.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=chuanhuahan@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715843806; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=y/vxK+sOJjp0JV5OCfa4UpHw8jKYAK7ueWmxLFuVGHM=; b=Z4+/Hoto8B8RElI8OUwHcgggTKEHpmVJNAq40r8xs38SYNOLnECAjgH3jE9llR64Etg/U0 R0rlEKOgTnOsUWe6nJ/syu+kuNKFF4V2l2tX2fuqQQjb0etKR+Nm51DFz7LoDIJzrBOFZF 9TF57yvGaVwoJFQub366Nd2OrwbAiVw= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=KxzlfBZd; spf=pass (imf26.hostedemail.com: domain of chuanhuahan@gmail.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=chuanhuahan@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715843806; a=rsa-sha256; cv=none; b=671jKP/NTRQldgFFZMR8oFDULTOIGZmf3sUEy9Yh/mdQ9h8JBleuQ26gcn41oDbF5f266m XA79KLrxbYGV/FK1rYrap/aoGf/vDhZSMXAu5J5pnvJ5WiMebrhvSTIso/83pZbNlExFh6 v2pAQI4vh04/mNg1U7l32NjXQaRcTMw= Received: by mail-ed1-f51.google.com with SMTP id 4fb4d7f45d1cf-574f7c0bab4so2783695a12.0 for ; Thu, 16 May 2024 00:16:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1715843805; x=1716448605; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=y/vxK+sOJjp0JV5OCfa4UpHw8jKYAK7ueWmxLFuVGHM=; b=KxzlfBZdUWFnBgJuW4NHq7g1M43rDCIAXrNo4sq/qZ3+c/yPuSCkLka4CWwPQu5rSx cYGgXVorhH5VcqyB8YqR7oIKv3JCQkIVV7l0Ym9/PU8PHh1pnNpnyL8ZJWeFsrmEHk1c tPExH1MP5D92MNo/glAGvZ5l5krr/0DEfzTTQwydsHVMEPZ4a4NcsxERUSpu105aas++ OvcV+EV1gFWqchm8uACYm2aGVKmZr/keRtrfmTtWvlwmCMcHWZLHDp7tJoiRkPz/kIj3 ICeZk5OmyRx3N9UzVNIAO7qdxeS8YsE4GTUGPzttdOqQtE9izySf0iK8x7zatN/iquO2 rMug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1715843805; x=1716448605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=y/vxK+sOJjp0JV5OCfa4UpHw8jKYAK7ueWmxLFuVGHM=; b=eFBf+lY9f16MK+R2emq2ljTomWSZra8YkV7Z5+GIdbxLNEKIzjGQ+U8rViy6skbYrG /tmqbyxdyoj2jhh9lFXviaKhiGueW9tR6RRmzBgrK3l5abbQfIGsJodIjaxp/QBm/NBT MlWTO+RyD4hNS/2SxhhaSK/O7fTLZFOaUE6U7n/4YeQHBXEuBcgXzaJeCMGPQF4YRDGe /7XWpaVhp0UJYWk7hck+zZCfhGcj4iXO2cUEryeWjgOHQLYEi9bnFN9BbJZuOD8CgsZT 78mSmMdOCK/IG5Ta/bGh70qxnu3Bcvn0lOZBomkyBb06Plf2HTkKU0oxcWwXXECFKOGK GMZA== X-Forwarded-Encrypted: i=1; AJvYcCULoHEtVI8bQ+cgu84yngK1Pv3l6lS2hSvnuSWMadJHo/4qbrXJJmDomzae9ls2JgqSqVwW6YTDsBPBBNYxEwvRKHE= X-Gm-Message-State: AOJu0YxNEcKcbn3/WrL9DhWXPY63Mcdg92cSVY3zU5PckZBSBU+YoUrr cGzMMLaADkcIj+rU1Sa8/rklnBhkj+A+fP9FahbvvJJ+a4kZ5dzI2WO6lLda1wjHmy2MjDHeQmF +f5HmIHVVP6fM2N69iHT5Cb1MG4U= X-Google-Smtp-Source: AGHT+IE8GvE14BTTpJjmmkaZi5g9TYJSx0IoE0Wou6ebhAKbEqm8kSyGiOu6ozGccd1ZbcZIVia7Wx6YSk+deg/PCZw= X-Received: by 2002:a05:6402:17d0:b0:574:d009:cb3c with SMTP id 4fb4d7f45d1cf-574d009ce2fmr8012870a12.18.1715843804876; Thu, 16 May 2024 00:16:44 -0700 (PDT) MIME-Version: 1.0 References: <039190fb-81da-c9b3-3f33-70069cdb27b0@oppo.com> <20240307140344.4wlumk6zxustylh6@quack3> <8da6a093-346b-35cd-818a-a82abfa6a930@oppo.com> <20240314082651.ckfpp2tyslq2hl2c@quack3> In-Reply-To: From: Chuanhua Han Date: Thu, 16 May 2024 15:16:33 +0800 Message-ID: Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" To: Chris Li Cc: Jan Kara , Chuanhua Han , linux-mm , lsf-pc@lists.linux-foundation.org, ryan.roberts@arm.com, 21cnbao@gmail.com, david@redhat.com, Matthew Wilcox Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: B5969140012 X-Stat-Signature: uuqormjsdcusrqjahamfn5unag58zf8n X-HE-Tag: 1715843806-921206 X-HE-Meta: U2FsdGVkX19W/jSE/Vgw4/DbkoYAVbZA5rcILbAhQN1wQikF1io9rM7jKKoD2Zrn+ty/NSoVPWlRaZoUsdLsxynsoTsIx/p4PT/9oW51+Jy1smfj/TQifblEaqbPWXrOb37nvzn4rR6+yIiD3sEBAmg2phBqy/gwIojEqZnQEtHxCjyNVYYWzV5X6C8GfUvGK/5O1nMC0Ke6vjr1G8cvVjtPNdXDRWnGEnv5vSlsJFsWBJ6YbxgGdOXv9Bgo5gT+tO9RtGKzrqBX/Y/77JuAOO0zbeEZ0AehzHNoqXG7jSeRF4GH/e6FGL6T7sFXC/8FYqE+iY7f2luoTvHn14nGafPCd15b60b0Sv61N6xsytwGqmwPhJA/ciE5xiiD3rSju9L4hbORs9bSgNL6QsxveZol5IrlamPTpChRwMiawR0j/PXO3NjyCMSkWWifUerFB64eKBhDb6NEdzZspcIiIvNqnccOAgL9cG7q42+NIS6x7bLIeqTg7Kd4ezmf7qtj0v7xGpLVTAXXvegUqZoEzDU6g/clnOjunvs0Jwhl1BCiYrZt6xDGpleaQGDLXEzwJVSA9GiuGq2QcFoYYVfesjY3WRDiLMQHlfmv5mjdnxchVWBP8XPEfcnVG627bC86pEvv6r3WktLjaLt+98KQG4JEdE/gq7cBXkSiSoMJpM3xUwlWhPk7MXl0ocbGeIISBoGtOoTNQgyK3hPAiEGQPo98vatllgMdwqeFqcoKR6mIDJkNebl2uk7GV3PeHBEKVvqSz84+fnijkgqJfy50gMzQ/4VuQoG8yTBXFxY2xzb4xsBz2UnGGCpu48y8XxCQB6fmH96uqE547gSeEGKjR5Iyy0yzUDCKUTgNoH6RROu80b2/+XZFwSPjqXqWkpnLOlEjgxGVc9hoQ97liQ5YjIm2hQbd0htMgvWAIoinw7odnxkKJHQIG0MaeyaamRvmAxw2fxRDzXYXV1uHmCj 2Ylck9hI kd/nFtkwDTtBlfYH5cbS0xVHeX+baZ954V9n70D64CxDu1U2zy3kfvU7Bkj8FS7QoE2LrpK7/Td7jpnNfoWOkDl32u5PzrcjeUodgGuQZIildKuJ61vYUT08QVEAHiiEBRPlUuzS9xS9FADcYi6HK4zXppuOC6/x0KBFqI/tkje004GlVA8MmgHnTv2OC5vhhzmGJ93pU+9uuK1zzX+lFA7PogQF63jvWTMu9VqJJmXgKJdOLWiadqhu83jgvZmHi6UefpphQ2cjwE/d08tYy6/SmBwNyP06d6SSUeELXVnpHb8/cUPzvsmLKAOkId0kxNzlztcl4iW3VQjRSHhO4SmQQ6lf9W4GACijlQrS2YNZcImKb0V9FKPPjaRvd9b911EkUbi1vE3jot7eYRlUuqx0Yeb1gdZdfkoTKfDQAVbXAquWVgt3viUy27xLFOOJwV1gCptgh8kLEAUxux6TEux+zPuMUeOBdal4tweM4WitqLdM9igCFHs4t9E57mqPoW0g3YrTNXTeefNnYyzWLyogcezOEjqAnvNlXRFrKTeXCnw0Dxp5AyiM6XDNt5mPEHHLS X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Chris Li =E4=BA=8E2024=E5=B9=B45=E6=9C=8816=E6=97=A5=E5= =91=A8=E5=9B=9B 07:07=E5=86=99=E9=81=93=EF=BC=9A > > Hi, > > Here is my slide for today's swap abstraction discussion. > > https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view Great, Thank you! > > Chris > > On Thu, Mar 14, 2024 at 4:20=E2=80=AFAM Chuanhua Han wrote: > > > > Jan Kara =E4=BA=8E2024=E5=B9=B43=E6=9C=8814=E6=97=A5=E5= =91=A8=E5=9B=9B 16:28=E5=86=99=E9=81=93=EF=BC=9A > > > > > > On Fri 08-03-24 10:02:20, Chuanhua Han wrote: > > > > > > > > =E5=9C=A8 2024/3/7 22:03, Jan Kara =E5=86=99=E9=81=93: > > > > > On Thu 07-03-24 15:56:57, Chuanhua Han via Lsf-pc wrote: > > > > >> =E5=9C=A8 2024/3/1 17:24, Chris Li =E5=86=99=E9=81=93: > > > > >>> In last year's LSF/MM I talked about a VFS-like swap system. Th= at is > > > > >>> the pony that was chosen. > > > > >>> However, I did not have much chance to go into details. > > > > >>> > > > > >>> This year, I would like to discuss what it takes to re-architec= t the > > > > >>> whole swap back end from scratch? > > > > >>> > > > > >>> Let=E2=80=99s start from the requirements for the swap back end= . > > > > >>> > > > > >>> 1) support the existing swap usage (not the implementation). > > > > >>> > > > > >>> Some other design goals:: > > > > >>> > > > > >>> 2) low per swap entry memory usage. > > > > >>> > > > > >>> 3) low io latency. > > > > >>> > > > > >>> What are the functions the swap system needs to support? > > > > >>> > > > > >>> At the device level. Swap systems need to support a list of swa= p files > > > > >>> with a priority order. The same priority of swap device will do= round > > > > >>> robin writing on the swap device. The swap device type includes= zswap, > > > > >>> zram, SSD, spinning hard disk, swap file in a file system. > > > > >>> > > > > >>> At the swap entry level, here is the list of existing swap entr= y usage: > > > > >>> > > > > >>> * Swap entry allocation and free. Each swap entry needs to be > > > > >>> associated with a location of the disk space in the swapfile. (= offset > > > > >>> of swap entry). > > > > >>> * Each swap entry needs to track the map count of the entry. (s= wap_map) > > > > >>> * Each swap entry needs to be able to find the associated memor= y > > > > >>> cgroup. (swap_cgroup_ctrl->map) > > > > >>> * Swap cache. Lookup folio/shadow from swap entry > > > > >>> * Swap page writes through a swapfile in a file system other th= an a > > > > >>> block device. (swap_extent) > > > > >>> * Shadow entry. (store in swap cache) > > > > >>> > > > > >>> Any new swap back end might have different internal implementat= ion, > > > > >>> but needs to support the above usage. For example, using the ex= isting > > > > >>> file system as swap backend, per vma or per swap entry map to a= file > > > > >>> would mean it needs additional data structure to track the > > > > >>> swap_cgroup_ctrl, combined with the size of the file inode. It = would > > > > >>> be challenging to meet the design goal 2) and 3) using another = file > > > > >>> system as it is.. > > > > >>> > > > > >>> I am considering grouping different swap entry data into one si= ngle > > > > >>> struct and dynamically allocate it so no upfront allocation of > > > > >>> swap_map. > > > > >>> > > > > >>> For the swap entry allocation.Current kernel support swap out 0= order > > > > >>> or pmd order pages. > > > > >>> > > > > >>> There are some discussions and patches that add swap out for fo= lio > > > > >>> size in between (mTHP) > > > > >>> > > > > >>> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.r= oberts@arm.com/ > > > > >>> > > > > >>> and swap in for mTHP: > > > > >>> > > > > >>> https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gma= il.com/ > > > > >>> > > > > >>> The introduction of swapping different order of pages will furt= her > > > > >>> complicate the swap entry fragmentation issue. The swap back en= d has > > > > >>> no way to predict the life cycle of the swap entries. Repeat al= locate > > > > >>> and free swap entry of different sizes will fragment the swap e= ntries > > > > >>> array. If we can=E2=80=99t allocate the contiguous swap entry f= or a mTHP, it > > > > >>> will have to split the mTHP to a smaller size to perform the sw= ap in > > > > >>> and out. T > > > > >>> > > > > >>> Current swap only supports 4K pages or pmd size pages. When add= ing the > > > > >>> other in between sizes, it greatly increases the chance of frag= menting > > > > >>> the swap entry space. When no more continuous swap swap entry f= or > > > > >>> mTHP, it will force the mTHP split into 4K pages. If we don=E2= =80=99t solve > > > > >>> the fragmentation issue. It will be a constant source of splitt= ing the > > > > >>> mTHP. > > > > >>> > > > > >>> Another limitation I would like to address is that swap_writepa= ge can > > > > >>> only write out IO in one contiguous chunk, not able to perform > > > > >>> non-continuous IO. When the swapfile is close to full, it is li= kely > > > > >>> the unused entry will spread across different locations. It wou= ld be > > > > >>> nice to be able to read and write large folio using discontiguo= us disk > > > > >>> IO locations. > > > > >>> > > > > >>> Some possible ideas for the fragmentation issue. > > > > >>> > > > > >>> a) buddy allocator for swap entities. Similar to the buddy allo= cator > > > > >>> in memory. We can use a buddy allocator system for the swap ent= ry to > > > > >>> avoid the low order swap entry fragment too much of the high or= der > > > > >>> swap entry. It should greatly reduce the fragmentation caused b= y > > > > >>> allocate and free of the swap entry of different sizes. However= the > > > > >>> buddy allocator has its own limit as well. Unlike system memory= , we > > > > >>> can move and compact the memory. There is no rmap for swap entr= y, it > > > > >>> is much harder to move a swap entry to another disk location. S= o the > > > > >>> buddy allocator for swap will help, but not solve all the > > > > >>> fragmentation issues. > > > > >> I have an idea here=F0=9F=98=81 > > > > >> > > > > >> Each swap device is divided into multiple chunks, and each chunk= is > > > > >> allocated to meet each order allocation > > > > >> (order indicates the order of swapout's folio, and each chunk is= used > > > > >> for only one order). > > > > >> This can solve the fragmentation problem, which is much simpler = than > > > > >> buddy, easier to implement, > > > > >> and can be compatible with multiple sizes, similar to small sla= b allocator. > > > > >> > > > > >> 1) Add structure members > > > > >> In the swap_info_struct structure, we only need to add the offse= t array > > > > >> representing the offset of each order search. > > > > >> eg: > > > > >> > > > > >> #define MTHP_NR_ORDER 9 > > > > >> > > > > >> struct swap_info_struct { > > > > >> ... > > > > >> long order_off[MTHP_NR_ORDER]; > > > > >> ... > > > > >> }; > > > > >> > > > > >> Note: order_off =3D -1 indicates that this order is not supporte= d. > > > > >> > > > > >> 2) Initialize > > > > >> Set the proportion of swap device occupied by each order. > > > > >> For the sake of simplicity, there are 8 kinds of orders. > > > > >> Number of slots occupied by each order: chunk_size =3D 1/8 * max= pages > > > > >> (maxpages indicates the maximum number of available slots in the= current > > > > >> swap device) > > > > > Well, but then if you fill in space of a particular order and nee= d to swap > > > > > out a page of that order what do you do? Return ENOSPC prematurel= y? > > > > If we swapout a subpage of large folio(due to a split in large foli= o), > > > > Simply search for a free swap entry from order_off[0]. > > > > > > I meant what are you going to do if you want to swapout 2MB huge page= but > > > you don't have any free swap entry of the appropriate order? History = shows > > > that these schemes where you partition available space into buckets o= f > > > pages of different order tends to fragment rather quickly so you need= to > > > also implement some defragmentation / compaction scheme and once you = do > > > that you are at the complexity of a standard filesystem block allocat= or. > > > That is all I wanted to point at :) > > OK, got it! It's true that my approach doesn't eliminate > > fragmentation, but it can be > > mitigated to some extent, and the method itself doesn't currently > > involve complex > > file system operations. > > > > > > Honza > > > -- > > > Jan Kara > > > SUSE Labs, CR > > > > > Thnaks, > > Chuanhua --=20 Thanks, Chuanhua