From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BC46AC54798 for ; Fri, 8 Mar 2024 02:02:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 458526B031A; Thu, 7 Mar 2024 21:02:34 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3DFEC6B031B; Thu, 7 Mar 2024 21:02:34 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1E3F86B031C; Thu, 7 Mar 2024 21:02:34 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 069406B031A for ; Thu, 7 Mar 2024 21:02:34 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 8FD0F161298 for ; Fri, 8 Mar 2024 02:02:33 +0000 (UTC) X-FDA: 81872222586.24.B0189CE Received: from APC01-SG2-obe.outbound.protection.outlook.com (mail-sgaapc01on2094.outbound.protection.outlook.com [40.107.215.94]) by imf22.hostedemail.com (Postfix) with ESMTP id 43A5BC0004 for ; Fri, 8 Mar 2024 02:02:28 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=oppo.com header.s=selector1 header.b=wU8nPla8; arc=pass ("microsoft.com:s=arcselector9901:i=1"); dmarc=pass (policy=quarantine) header.from=oppo.com; spf=pass (imf22.hostedemail.com: domain of hanchuanhua@oppo.com designates 40.107.215.94 as permitted sender) smtp.mailfrom=hanchuanhua@oppo.com ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1709863351; a=rsa-sha256; cv=pass; b=ON7mgC9fiqDt4w0o4B1RXjD8xUZ9jyTt/Qo8xnnqkVJHOdUL2bkgsoZhyd7QpbyQksFzOf zpwcfvJwYQqAkj6DUCQyao7Vs/7SjIX7bL/FNr1TY21SonigaD+0uQ6gwPSt+s/yDMsEPB Kk2tv20SkPNeG+1HsnERBunCT/pFehM= ARC-Authentication-Results: i=2; imf22.hostedemail.com; dkim=pass header.d=oppo.com header.s=selector1 header.b=wU8nPla8; arc=pass ("microsoft.com:s=arcselector9901:i=1"); dmarc=pass (policy=quarantine) header.from=oppo.com; spf=pass (imf22.hostedemail.com: domain of hanchuanhua@oppo.com designates 40.107.215.94 as permitted sender) smtp.mailfrom=hanchuanhua@oppo.com ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709863351; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=krnaf3kD//6Z1zFQ611W9rtD0JsNW7+dqpIXRP0uIxw=; b=611VjRWnK1VHNNEbsPhMj+mKUQubsxLvNmh9yZpiVltxytBMK+cZhKhkokWqCRK9gb8yVH GHPfGgfjmPbiEq8vDsYPa9LeRiZgp6iWeo6isEEgaWYwM6jF/aUBvS1c/j7bOrlC8W3D0Q aLkmS2qr5q0TCpL5ipZ7RuukmKYlBjE= ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=a04zRzR7MHGD5JX70Sbx4POsuEHzmIcBe1b0CPUzt25HeF02pgBNY5VeuXez10C6+Fnr/1kLbWVT2QofsjW4NKvjA0UEwcBErt60HDf/TIZzWln29bmuGSTvtQLmEZvLkqUqvaDjEZrP7XJ91uyWyGatW/wNu81LXYhW8DO2aC0TnkRtzRbuSTs8xYfyVgZDrboqa4CwPQ0Wzrpz/nekf/NvkTIjh718lR/uRFpUn4zv3nAFq2Dzz9pvLK3y6qtTddF9JJ9M6VZlJCqogmWSPSG6xtZo2FhQK5NXYbGnswXA574Nwr3bM6ANjwpN1G30ulJUgxgYWpKiOGxExj5/VQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=krnaf3kD//6Z1zFQ611W9rtD0JsNW7+dqpIXRP0uIxw=; b=a4ACuIYZJIIl648NzYw6A9YHYXXx9AHS9pldIwTCw2tUjI0duji8AxGNCEc0ty3K3Mq3fr39eV4dwl4tpciIaRnG1vC6q97YW+fm9Sa5EqdAAB5YRIa41i2V+Omd+jo6HgoUA60vOZwenDCDhHD4yh1NKAUINFPhZY+ymzjanqzdnWgv7zRySJak8hF8wuVfbJmrZyCa2CYThgNNTvObJW9AD6jmZUAsYu/TqdSUrb6mpq8i53vx/4oysVwRpVQRpOhwaMeP0pSs+oZXJGoHpq3tVYIlt8ebjNeFwCUcSn+5DJmWNwMiKRoDVubv7NFbr7gDrYQPGVZehXnobL/rZQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oppo.com; dmarc=pass action=none header.from=oppo.com; dkim=pass header.d=oppo.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oppo.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=krnaf3kD//6Z1zFQ611W9rtD0JsNW7+dqpIXRP0uIxw=; b=wU8nPla8XhoNfWTsZuY1fJhyk1WeO+ZJ8Y0Jo++ydeNa1SHRqhhg5tNlqr897MMHv0vZOBQexSztbuSJcAJ/F5iZz5tovKJBLIElcqIUH+/nVEnx14eVE70zM+6WKA6FtauVmG9eo+U7A6Y5OtGULiSepkfDUy4bdSYqk0DthYQ= Received: from OSQPR02MB8004.apcprd02.prod.outlook.com (2603:1096:604:290::13) by JH0PR02MB7672.apcprd02.prod.outlook.com (2603:1096:990:64::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7362.26; Fri, 8 Mar 2024 02:02:24 +0000 Received: from OSQPR02MB8004.apcprd02.prod.outlook.com ([fe80::5c62:698b:3572:58c4]) by OSQPR02MB8004.apcprd02.prod.outlook.com ([fe80::5c62:698b:3572:58c4%4]) with mapi id 15.20.7339.035; Fri, 8 Mar 2024 02:02:23 +0000 Message-ID: <8da6a093-346b-35cd-818a-a82abfa6a930@oppo.com> Date: Fri, 8 Mar 2024 10:02:20 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" To: Jan Kara References: <039190fb-81da-c9b3-3f33-70069cdb27b0@oppo.com> <20240307140344.4wlumk6zxustylh6@quack3> Cc: Chris Li , linux-mm , lsf-pc@lists.linux-foundation.org, ryan.roberts@arm.com, 21cnbao@gmail.com, david@redhat.com From: Chuanhua Han In-Reply-To: <20240307140344.4wlumk6zxustylh6@quack3> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-ClientProxiedBy: SG2PR06CA0196.apcprd06.prod.outlook.com (2603:1096:4:1::28) To OSQPR02MB8004.apcprd02.prod.outlook.com (2603:1096:604:290::13) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: OSQPR02MB8004:EE_|JH0PR02MB7672:EE_ X-MS-Office365-Filtering-Correlation-Id: ccc8986d-adc4-4f9d-31a8-08dc3f13cbdd X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: lNVFENR69NlN18QXcnCDH/ouAQXeqjX44fyFXY886CZf95LdsjBash2F0F6hFisMkNZIlwd6ljXVEHUAnAlmSTmLbz4HP5JTOHpfPFoWxjg8ItdX3xQg7jOzRNP1aJaFI3Jkfk6aRUIFPzDY5NGNvRR+qWtjastWl/Gerz+ZFaIYXd09vncDUOByECKdFlthFNVX0NMgfw8AMe57YVNwLnIjk+63aMlYektI0PP5AfWbmQNQ6L4gyQV8RozYiXTkUeYUDnj9r8EB85Twj7h0fy9pWeCJ7s1SC2of3sFiUAE4D8u+nuv2nhAGDUCq17FezRZsDlHV+KZsY6IkM+Y3IQhGf4CffoGZQ5I2ACLW4sAvHK5+BOhG2VyO8FopPTRhSMYW//bty/2JvD2Jp+++ZemA/97hiyvmf0lLixhl7InFAnMV7yJ39EM7mD0UOfhvpk7UbFOdgR8uigu4CV+zutXC97cZKmT7IOnevjR/tUGDhCM5TGGoHXXgg4m89rpp73h35+2V7KaAOHEehIZa6WQ4MF1one/RVS9I5HD5ZI4A9ad4IB6VdFc8pepZ22TJylau7niCgvMLQk8En8WghNCjRbfX3JawXrreyrC9iGg= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:OSQPR02MB8004.apcprd02.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230031)(1800799015)(376005);DIR:OUT;SFP:1102; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?UDh2bkR4eE1MaEtjWTNGeWhjcHIvMFJjTWw3N1dHL2Ezand3N3JCVjI1UTVG?= =?utf-8?B?RWE2ZDQyelBvRWFzSzdoOEttZW1XQnBJdGptU05yQ1ZHRWZXbGZ0YnNtT1FI?= =?utf-8?B?NVVUdkRLQXVUcnVtcTNkU1NManNSZjc3S1UvOWFOOU04bFNNME8wK1hMY3gv?= =?utf-8?B?MlNmZGx3OWNEdkp0SHI0VWQ5QmlZVUZwREYrRkpIanRhV1k0RDNJd2VMS3dj?= =?utf-8?B?T0U2QnpuZE8wRWE3bExScS9XVEh5MFQxMEVRN3AwREJLeTZiMk4rU29zVTl0?= =?utf-8?B?NnFUejRWVXRCRktsQkhjUjR1QkRYZ0FvUkdkb25VMW9hTFNtVlA1VW1hR1N6?= =?utf-8?B?MFVFK2dqQWQ1VXR1RzZ2c0hHeHdUQ21KeDkrem8yaWZOT01Oa040V1l2SWVo?= =?utf-8?B?OU5WdnhPV09aWjZJY1crSi84WnAzV0Y2Q1BWSTdIYkpNL2RobXhUVFFxWmNw?= =?utf-8?B?SUFiWTVuZHFpSjlEc0l1OUpJQXdkK2plTTlCTFkycnFQUit5alhjNFJwMXBE?= =?utf-8?B?ODdiSCtmM1owbEt5OEVKcytsL1R3cktDZFh1MXBvcGRJR2gyOTBaREMzamFa?= =?utf-8?B?amJvL1hOZkpYeGNEb1Aza2tlaWFGOXF2N0luZ1VyR0wyVUxET01mQnpoNTU0?= =?utf-8?B?RGxPS01WdDFNS28rR0JlTCtYSThWcTMrMVJxUHUzVVVYZTh0ak93MW5HN1JP?= =?utf-8?B?VHgrTHphek9kM3h3dFJLaEJuSm9nSlB1VmZTTVA3RTdJRUs0bHRLS3YxK3JC?= =?utf-8?B?emNHdTFPaXB1OUh2RzNPVVQ2U0xZZTFmaTBiNGVhUGJFY3ZIcXBOZGRSbG43?= =?utf-8?B?eDNuS0ZFYlVPTVVzeWRKbjFla1lydlJaQWhqZzhtMTF6SGVrL1lEaDBhWW9k?= =?utf-8?B?NXA2Z29jWDRXR1J0aTBHajQwUk1KdThDVVhGK2E1TnovN1ZKRTVVS0tYVEUx?= =?utf-8?B?T2ZPQTBwVUlDRTd6MkxCdnc2NHF6T1hvcEhmUWdzVGJFMFZrekYyYUt2anNZ?= =?utf-8?B?dGJoeDRkd29DT0h4N0Vza2ZtWjRpM0prWGF1clNYVVZTdWlYSkFUb0NpajVE?= =?utf-8?B?bytTcitzQ0ZiZDY2Z1dneE1MQVhHQllpb2VpaDJVdDN6NFI3c25jZkdFZ2h5?= =?utf-8?B?U0wrY2JEZlFhRmdUSlhIR1NhYWh0V0ZTRTYyVUNzRi9Pek9IVWZkcFFHMzJw?= =?utf-8?B?UU5RRHFVWjdkUUV1dmVGZG9JSGpEZGpDZnllN0pQa3lYRUw4QnZ6QWUwSUQz?= =?utf-8?B?bHZyeXBlRkYwVElFZUVNSGc2MU9FSEJ3NlNha1ZaajFHeEppdVljVkpXWTJu?= =?utf-8?B?YlhSOWZrdXYyUnFlZlZyNmlrSHF2MTI4WkJ5TUtJeTk5WmdhZGtxaHlhRklE?= =?utf-8?B?ZGY3YXVlS1c4RzRrTm1BQVFMKy83NURrZllnUkxWcWNQVzNEd1ZPOVJYYXJ6?= =?utf-8?B?SHkzMXJxVC83UWcvS2NsUXVtUm10RkRMdE1xYkJvcmdoS3k2MHZlYjlKdEhN?= =?utf-8?B?dDRDTDE2cnFNcUdNekVhTUVjeFRXWG1FRmk4bHJ5YStBUkV6cDluVlY0dE4r?= =?utf-8?B?b3hwZCtHUmpzd3o4RmV5RHIySjc1NWN0MTVzeVlkQWRMNys4bTgzWDBuRHky?= =?utf-8?B?b1RWV2w3VHhERFdYckh1Rkg1a2djUi9zdTltL05PNVFmaVlMRW9jWWZ5VlZD?= =?utf-8?B?M2pmRnBRVzBCc0FRd3N0cnVhd253V1E1YkVFVUp6TFF4czFYRVpuRWo5UlBs?= =?utf-8?B?MFZvNlB0N2IzdmplRW9YelREd2loSkJCc3dtSlVzRWUwUk5NSTRmeFh2N3J3?= =?utf-8?B?aTM2WCthWVc2UEo1cHRZK2Vpa0cxOHFEdmNMQzlKNjQvT3V6d1hkdkxnUlJz?= =?utf-8?B?RHd4V1lxK2Q0WVg4VFdLV3U0OW9iN3Y2cDcvUktRaUdCOVR2MW1uMDZvaXkw?= =?utf-8?B?YzZEajR3NlNSRXRsTlRraTgwY3o4MXpiQTVnSzhMSFJlYjljRHd2bFN6dVNu?= =?utf-8?B?bXNCVVhxby9zeW1ndlJzdXpZZFVINDZ6b25LWWNXWWVDa2NHdHNwN3pldnpv?= =?utf-8?B?MTArMlpUMlVyTE5KQTVvQWJtTlBRRHBEZno4bEhXT3RpdnkxWFppQ2tKMU1V?= =?utf-8?B?QjVrNFh3OFVwRVNGN2d2RzhPU3RDK3Vyem5pLzNPUnZaVTNtZ3lTaTRxVUNs?= =?utf-8?B?N2c9PQ==?= X-OriginatorOrg: oppo.com X-MS-Exchange-CrossTenant-Network-Message-Id: ccc8986d-adc4-4f9d-31a8-08dc3f13cbdd X-MS-Exchange-CrossTenant-AuthSource: OSQPR02MB8004.apcprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 08 Mar 2024 02:02:23.8492 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: f1905eb1-c353-41c5-9516-62b4a54b5ee6 X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: vSA3VYvOX+alUkzspdO6U0xtmPYWKe91rJpAi2cwSsF/yZFZ0atDtNbHeJ/HT8Vxhz4SrKFJWYEi4Dz5UJNm7w== X-MS-Exchange-Transport-CrossTenantHeadersStamped: JH0PR02MB7672 X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 43A5BC0004 X-Stat-Signature: cnq4ioawak7735m9bjnn3688apqitaiu X-HE-Tag: 1709863348-571594 X-HE-Meta: U2FsdGVkX1/IiNsev2HlEi59MzdP18lEbkqcmg30znW5Oq7MyvLPIbJWcir7ih7/DhmqhBDwuXEqP+bIA5zmZJWivVeHZxKDkFMU3Biy6mld95lfKMjFvBth0gVwVqwoC4H0Kddg0r5O7TP/ONfDBeqRmCdf04u3RJonER0CcZOUr6YsVk3GmXYVY7zbxD1NF4QSG62of+E8o1Gt75KOuFcnLcjDXTHMld6GDgA7oignZeAFaLdIFejWKj7MUT7j8a2SXjDWdjpo1Cjq1cchs+i+rPk38wISWf9tmbs5eSyEXEQdPfaKrPBL6deAz2NNY7iPD8+BXHz2bwfsQiNwBXCB1g5lB1D/QvpK77vEtHZfJHeW6Lf3sz7sYyKDt+zW0WPxeyGAqGxO58WSHyI3wOG9fIYHti6i4Rib1sbD2NilKQ3JWJSieqifEpTPYZnW9Qito1Kq7Lq32sviVDUGzHkwgKG2++TQIGIwNP9C9q/x3Y7faJ6OH+oNTZLklwJV0LLHg8rzXSF1FW10Of+HfChW6UQy1/QaqHdyb2wQa18buZ+HKPHcRhzHUg6NkDsc2QERVjBCqtqiz3MjXJHqOG4fbr+qnidJ3PlEM5U3BP2tHhIS8mqX0s1tGEMwNFD6YL7tWWwhmyeUtsqVTZ5YVvEftKRYFJi6Wkr+N/ZZ0ygH/C7Uwh7HD1Lit/VZpR1/b/t0AgtpIPG3Ld3r8hK3swV7xNoQ18xBKb/z162SRQe1lozaxPAFcKRBHDSzHHDxI5X8kiIHFWfiMb9dCCWBLFULoSgOuHae9GlaUnNhdza+FINPBR45o5t+JVCiPrfBj0ueuRQ1qvhC3uz+9b7xqunhNxcs0JqUk+TepZcwRhmHpKi9Ss9dT1LZxlxMc40UCWuifrKHV4fit+H4aGup8w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2024/3/7 22:03, Jan Kara 写道: > On Thu 07-03-24 15:56:57, Chuanhua Han via Lsf-pc wrote: >> 在 2024/3/1 17:24, Chris Li 写道: >>> In last year's LSF/MM I talked about a VFS-like swap system. That is >>> the pony that was chosen. >>> However, I did not have much chance to go into details. >>> >>> This year, I would like to discuss what it takes to re-architect the >>> whole swap back end from scratch? >>> >>> Let’s start from the requirements for the swap back end. >>> >>> 1) support the existing swap usage (not the implementation). >>> >>> Some other design goals:: >>> >>> 2) low per swap entry memory usage. >>> >>> 3) low io latency. >>> >>> What are the functions the swap system needs to support? >>> >>> At the device level. Swap systems need to support a list of swap files >>> with a priority order. The same priority of swap device will do round >>> robin writing on the swap device. The swap device type includes zswap, >>> zram, SSD, spinning hard disk, swap file in a file system. >>> >>> At the swap entry level, here is the list of existing swap entry usage: >>> >>> * Swap entry allocation and free. Each swap entry needs to be >>> associated with a location of the disk space in the swapfile. (offset >>> of swap entry). >>> * Each swap entry needs to track the map count of the entry. (swap_map) >>> * Each swap entry needs to be able to find the associated memory >>> cgroup. (swap_cgroup_ctrl->map) >>> * Swap cache. Lookup folio/shadow from swap entry >>> * Swap page writes through a swapfile in a file system other than a >>> block device. (swap_extent) >>> * Shadow entry. (store in swap cache) >>> >>> Any new swap back end might have different internal implementation, >>> but needs to support the above usage. For example, using the existing >>> file system as swap backend, per vma or per swap entry map to a file >>> would mean it needs additional data structure to track the >>> swap_cgroup_ctrl, combined with the size of the file inode. It would >>> be challenging to meet the design goal 2) and 3) using another file >>> system as it is.. >>> >>> I am considering grouping different swap entry data into one single >>> struct and dynamically allocate it so no upfront allocation of >>> swap_map. >>> >>> For the swap entry allocation.Current kernel support swap out 0 order >>> or pmd order pages. >>> >>> There are some discussions and patches that add swap out for folio >>> size in between (mTHP) >>> >>> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/ >>> >>> and swap in for mTHP: >>> >>> https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/ >>> >>> The introduction of swapping different order of pages will further >>> complicate the swap entry fragmentation issue. The swap back end has >>> no way to predict the life cycle of the swap entries. Repeat allocate >>> and free swap entry of different sizes will fragment the swap entries >>> array. If we can’t allocate the contiguous swap entry for a mTHP, it >>> will have to split the mTHP to a smaller size to perform the swap in >>> and out. T >>> >>> Current swap only supports 4K pages or pmd size pages. When adding the >>> other in between sizes, it greatly increases the chance of fragmenting >>> the swap entry space. When no more continuous swap swap entry for >>> mTHP, it will force the mTHP split into 4K pages. If we don’t solve >>> the fragmentation issue. It will be a constant source of splitting the >>> mTHP. >>> >>> Another limitation I would like to address is that swap_writepage can >>> only write out IO in one contiguous chunk, not able to perform >>> non-continuous IO. When the swapfile is close to full, it is likely >>> the unused entry will spread across different locations. It would be >>> nice to be able to read and write large folio using discontiguous disk >>> IO locations. >>> >>> Some possible ideas for the fragmentation issue. >>> >>> a) buddy allocator for swap entities. Similar to the buddy allocator >>> in memory. We can use a buddy allocator system for the swap entry to >>> avoid the low order swap entry fragment too much of the high order >>> swap entry. It should greatly reduce the fragmentation caused by >>> allocate and free of the swap entry of different sizes. However the >>> buddy allocator has its own limit as well. Unlike system memory, we >>> can move and compact the memory. There is no rmap for swap entry, it >>> is much harder to move a swap entry to another disk location. So the >>> buddy allocator for swap will help, but not solve all the >>> fragmentation issues. >> I have an idea here😁 >> >> Each swap device is divided into multiple chunks, and each chunk is >> allocated to meet each order allocation >> (order indicates the order of swapout's folio, and each chunk is used >> for only one order).   >> This can solve the fragmentation problem, which is much simpler than >> buddy, easier to implement, >>  and can be compatible with multiple sizes, similar to small slab allocator. >> >> 1) Add structure members   >> In the swap_info_struct structure, we only need to add the offset array >> representing the offset of each order search. >> eg: >> >> #define MTHP_NR_ORDER 9 >> >> struct swap_info_struct { >>     ... >>     long order_off[MTHP_NR_ORDER]; >>     ... >> }; >> >> Note: order_off = -1 indicates that this order is not supported. >> >> 2) Initialize >> Set the proportion of swap device occupied by each order. >> For the sake of simplicity, there are 8 kinds of orders.   >> Number of slots occupied by each order: chunk_size = 1/8 * maxpages >> (maxpages indicates the maximum number of available slots in the current >> swap device) > Well, but then if you fill in space of a particular order and need to swap > out a page of that order what do you do? Return ENOSPC prematurely? If we swapout a subpage of large folio(due to a split in large folio),   Simply search for a free swap entry from order_off[0]. > Frankly as I'm reading the discussions here, it seems to me you are trying > to reinvent a lot of things from the filesystem space :) Like block > allocation with reasonably efficient fragmentation prevention, transparent > data compression (zswap), hierarchical storage management (i.e., moving > data between different backing stores), efficient way to get from > VMA+offset to the place on disk where the content is stored. Sure you still > don't need a lot of things modern filesystems do like permissions, > directory structure (or even more complex namespacing stuff), all the stuff > achieving fs consistency after a crash, etc. But still what you need is a > notable portion of what filesystems do. > > So maybe it would be time to implement swap as a proper filesystem? Or even > better we could think about factoring out these bits out of some existing > filesystem to share code? In fact, my current idea does not involve too complicated file system related layers (chris' idea b) might involve file system modifications), this idea is simple to implement,   We just need to record the search location of the relevant order in the swap device. > > Honza Thanks, Chuanhua