From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 71176C5478C for ; Fri, 1 Mar 2024 09:24:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 00D256B00A3; Fri, 1 Mar 2024 04:24:22 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EFFDA6B00A4; Fri, 1 Mar 2024 04:24:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DC9116B00A5; Fri, 1 Mar 2024 04:24:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id CCAA46B00A3 for ; Fri, 1 Mar 2024 04:24:21 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 9CEE7A1BA2 for ; Fri, 1 Mar 2024 09:24:21 +0000 (UTC) X-FDA: 81847934322.25.E2FF0FA Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf17.hostedemail.com (Postfix) with ESMTP id 0A28E40018 for ; Fri, 1 Mar 2024 09:24:18 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=usGIPOYE; spf=pass (imf17.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709285060; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=WrklqTzn43yDBjiEyAzYdjVPdJt57EtMRzUpgOTpRQI=; b=0HOi/cLfyqfTkztSbei16Mzt1qclwqt3tUs4equP1i5sz4pkeXHICcPtggtq/SsrNv6P2L vUuNhN/CFnQLWFsvxm13dzCO2ll/7lZM7nM+Hej8LTyx8wWghkwquFbZ1nd5eB2BIXEG2Q /tQIQh5HSUsalHHErLJ4pN+Y4zPF8rI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709285060; a=rsa-sha256; cv=none; b=WNN3K6uj9sUD+DiXTTUemsasJ6Dv196oyptlImgbkaXXHNt89vksQpqA5KYkf09kDDtvI+ 3nSxTw48sC5IamIUWSTbsvLftLXDeloULtRuJAVsQaG6SdMHJSy+r0czZZ2ac6dq9298lq kI8csZuMXf7GQ3S2pUhCzDxPdv+cZY8= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=usGIPOYE; spf=pass (imf17.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 953F2CE1020 for ; Fri, 1 Mar 2024 09:24:15 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id CFF10C433F1 for ; Fri, 1 Mar 2024 09:24:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1709285054; bh=Hmke+1SmUQWEPrjmVPL4KGIEN9ABCTwo70DPLkQpvUE=; h=From:Date:Subject:To:From; b=usGIPOYEX94rJk70ZGjz/l8qza/QRSpsimvXpVlebtoOuwxJ5RcI/9idKRHLeLAAC pnblKRr4XUsD2YsJMAY1Amw2xzg+PHQgbK8EIDBjUB5DXsWM/r+qSPMVn7azQ21xnF qP5taOMqYFghFTEfu0kucrK79fAq/+YsU8ffEhm4GGF/LVMw2XASYpqnT7bit4KiMI 2LrNJLnkYLYsiZBmy8ryizzGXXuC9K7NtmGEIQK452BI9wCQiOksTh6ygTBTVsuSE2 87nnErj8w1KTA1mYNPi3TPpa68T1DHXMvQXi466CmsvVSBpq5/8HDPrsHJDbwUwJHw VMoxK1z02Y0Aw== Received: by mail-il1-f179.google.com with SMTP id e9e14a558f8ab-36540b9885cso7319025ab.1 for ; Fri, 01 Mar 2024 01:24:14 -0800 (PST) X-Forwarded-Encrypted: i=1; AJvYcCUSCRdt3iOCAgRCeE3u3eed6wfqA35sT2MbePRziS/Voacdu+vOhQcwF7n4QSFXKjhnGJCfd7YAo1rHUPcLhQ6NIOc= X-Gm-Message-State: AOJu0YyZRJmf29UhKZ3QA8FXExUWfyfYGIr4BiBaOGClUU6hyArOQL2l u7XWEaNelwf9CoPf97EPuXr2MCRFwq2t1TkSHO8ky+H6OTORlyLpNksGYUgJCsX2WSkLBh7zSwB xVoXOZF+UNe2+ZyWxF3ENBmjHO80DTaS3qMpm X-Google-Smtp-Source: AGHT+IH1srJnvaK29Rq2bzLMYVaO2chIYs0RzCuEAxlXITFtsR7Y0CEki18GpYVZSzsfu+cO6Zqko7mixGi1EYfG/5M= X-Received: by 2002:a92:ca09:0:b0:365:3a5:f6a4 with SMTP id j9-20020a92ca09000000b0036503a5f6a4mr556096ils.16.1709285054139; Fri, 01 Mar 2024 01:24:14 -0800 (PST) MIME-Version: 1.0 From: Chris Li Date: Fri, 1 Mar 2024 01:24:02 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" To: lsf-pc@lists.linux-foundation.org, linux-mm , ryan.roberts@arm.com, David Hildenbrand , Barry Song <21cnbao@gmail.com>, Chuanhua Han Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 0A28E40018 X-Rspam-User: X-Stat-Signature: ia5u8uxtfx14sntht6orpn5ymdfgp5wt X-Rspamd-Server: rspam03 X-HE-Tag: 1709285058-319889 X-HE-Meta: U2FsdGVkX18knfNTi51FzlFwOi7RiW5OMXd4a0vL4lwaDLpmUd1+nvbRx34cjHPJpV86uDscKEyjymwr6sdbtI5UhDzxX0a7MAy8Uwnj0YyVg/AbriN2YKobl2amskLJ3YzDyAvGbFtvABBwEEu8lXFCJPyAdQxKlfLzY7cxssAQBAAeLOZDIy1j4h+HeYbIi2iWsJ4/rLCBgOXPHbO9NUewzbCk3pYp2MPnEIs+jKqVi50BO+eB4eStQrneMCWtp7XdhK+0sN6c9kPBV+3J5kAbKSkRBBYQx6o1wnilzJPf+U266G30ig0DNnrxvq0yvOX7RYu4I3KIwCJicC23Q4McIgUktLWKltF1BEbZ/2ASA9izNjskmiZpvjWOTGnGJ/dF+IjP5s5qmTUOKdMwlKaeJo0uG1XxYMO1q2LSm+sxyAFzm1bqTts9po6jKP+hqw4548yCd1ez112A/FyeDugvimywGRdrfYYhaAkj2btAoXfTzgI658EmBuReUE+6vesHt37L83L8DOvjL+c7YI0/sqS8rQTB4QnIwRmnOPbk8LUYh9mBBOBvbvG229FyPDCWRiFEpi/TE7zTA9422LhnOvfZsJvUoh7bR0N3OOVY8YnYa7Qe70xpSMScABolfr29CIOSmgrarrqbvEnXu3J49GtP2VVvlWOUP/AIA+1X2pqD9I7I8LH5VpjI20XZgdM+ZnIKSg4p1RGrd4/LQRLnZ7QJxdW95/afcTzUE4hkijkvKtzQvzNFDKuUwt24bBvShzCRSwzhB+zPvHKeUyJ1o9rL04jn1keeivmAUVJQgEv1jM3WfFtSqpRlOnZA3ABdirDQKVky2KQGn8PkToV8etgR3uokzw37PmpI9wQQq+NoipLlKsxcG4DAjWJPpXwR6FlvM6T+tnnEciTZOzezC36qrLhM X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: In last year's LSF/MM I talked about a VFS-like swap system. That is the pony that was chosen. However, I did not have much chance to go into details. This year, I would like to discuss what it takes to re-architect the whole swap back end from scratch? Let=E2=80=99s start from the requirements for the swap back end. 1) support the existing swap usage (not the implementation). Some other design goals:: 2) low per swap entry memory usage. 3) low io latency. What are the functions the swap system needs to support? At the device level. Swap systems need to support a list of swap files with a priority order. The same priority of swap device will do round robin writing on the swap device. The swap device type includes zswap, zram, SSD, spinning hard disk, swap file in a file system. At the swap entry level, here is the list of existing swap entry usage: * Swap entry allocation and free. Each swap entry needs to be associated with a location of the disk space in the swapfile. (offset of swap entry). * Each swap entry needs to track the map count of the entry. (swap_map) * Each swap entry needs to be able to find the associated memory cgroup. (swap_cgroup_ctrl->map) * Swap cache. Lookup folio/shadow from swap entry * Swap page writes through a swapfile in a file system other than a block device. (swap_extent) * Shadow entry. (store in swap cache) Any new swap back end might have different internal implementation, but needs to support the above usage. For example, using the existing file system as swap backend, per vma or per swap entry map to a file would mean it needs additional data structure to track the swap_cgroup_ctrl, combined with the size of the file inode. It would be challenging to meet the design goal 2) and 3) using another file system as it is.. I am considering grouping different swap entry data into one single struct and dynamically allocate it so no upfront allocation of swap_map. For the swap entry allocation.Current kernel support swap out 0 order or pmd order pages. There are some discussions and patches that add swap out for folio size in between (mTHP) https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.c= om/ and swap in for mTHP: https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/ The introduction of swapping different order of pages will further complicate the swap entry fragmentation issue. The swap back end has no way to predict the life cycle of the swap entries. Repeat allocate and free swap entry of different sizes will fragment the swap entries array. If we can=E2=80=99t allocate the contiguous swap entry for a mTHP, i= t will have to split the mTHP to a smaller size to perform the swap in and out. T Current swap only supports 4K pages or pmd size pages. When adding the other in between sizes, it greatly increases the chance of fragmenting the swap entry space. When no more continuous swap swap entry for mTHP, it will force the mTHP split into 4K pages. If we don=E2=80=99t solve the fragmentation issue. It will be a constant source of splitting the mTHP. Another limitation I would like to address is that swap_writepage can only write out IO in one contiguous chunk, not able to perform non-continuous IO. When the swapfile is close to full, it is likely the unused entry will spread across different locations. It would be nice to be able to read and write large folio using discontiguous disk IO locations. Some possible ideas for the fragmentation issue. a) buddy allocator for swap entities. Similar to the buddy allocator in memory. We can use a buddy allocator system for the swap entry to avoid the low order swap entry fragment too much of the high order swap entry. It should greatly reduce the fragmentation caused by allocate and free of the swap entry of different sizes. However the buddy allocator has its own limit as well. Unlike system memory, we can move and compact the memory. There is no rmap for swap entry, it is much harder to move a swap entry to another disk location. So the buddy allocator for swap will help, but not solve all the fragmentation issues. b) Large swap entries. Take file as an example, a file on the file system can write to a discontinuous disk location. The file system responsible for tracking how to map the file offset into disk location. A large swap entry can have a similar indirection array map out the disk location for different subpages within a folio. This allows a large folio to write out dis-continguos swap entries on the swap file. The array will need to store somewhere as part of the overhead.When allocating swap entries for the folio, we can allocate a batch of smaller 4k swap entries into an array. Use this array to read/write the large folio. There will be a lot of plumbing work to get it to work. Solution a) and b) can work together as well. Only use b) if not able to allocate swap entries from a). Chris