From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B8FC3D0EE0F for ; Tue, 25 Nov 2025 18:12:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 015716B0099; Tue, 25 Nov 2025 13:12:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id F2F136B009B; Tue, 25 Nov 2025 13:12:12 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E45486B00B4; Tue, 25 Nov 2025 13:12:12 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id CC5D96B0099 for ; Tue, 25 Nov 2025 13:12:12 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 5DFD413A984 for ; Tue, 25 Nov 2025 18:12:12 +0000 (UTC) X-FDA: 84149923704.16.7E19ABB Received: from sea.source.kernel.org (sea.source.kernel.org [172.234.252.31]) by imf21.hostedemail.com (Postfix) with ESMTP id 3FE041C0005 for ; Tue, 25 Nov 2025 18:12:10 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=tlxTL7fk; spf=pass (imf21.hostedemail.com: domain of rafael@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=rafael@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1764094330; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xfA/hYmD361EE96G1eLlGb5Ff+lL5hfIlzV3fZrZd2c=; b=RKwpWjas7oePhPINOJVEM+Dwp4blk1oc2LnsaPz3276PB3Uq5NQx7uNkrfRDJeLi/yory3 QpaL7FTA9njqRPCDs1iVD45Q+H+Z8JWOZHykhQPFDKuu6+fg+TXXtd6TzOijqxBM8wFgGi n7w0y749130tevL+uRzFGqeYYgGEtME= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=tlxTL7fk; spf=pass (imf21.hostedemail.com: domain of rafael@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=rafael@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1764094330; a=rsa-sha256; cv=none; b=bjg2HiJMetSErTfcPXXnJi4ERxH41a60JZqFvNHjfHwDWT6tKiklhjFxNHTzou8LSvgxSl 0pVWhd4BBfR5N5+FNL78R6WWS14OtkbKJsh+SE0kUjDsSDFt8eQy77ZD/fAsz8mKGypOqt T7yed30SuEOFlXL1sb2RKksGCAXkHcE= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sea.source.kernel.org (Postfix) with ESMTP id 2A8E143D77 for ; Tue, 25 Nov 2025 18:12:09 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 09825C19423 for ; Tue, 25 Nov 2025 18:12:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1764094329; bh=20h0Yms98FU//c2uJHcW2xbdUbSzj+9tNH2dgg3W98Q=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=tlxTL7fkIdih6K/xeGwZjjIVL6hhYsKY82wkGW7MpacDFRW93iJXoaHbY4zIrLqui cKfST5o0AlOk0/yYHG8r6CxrXQOflf5ekuT9IxdyFqElGFEQLVNEeRJItqZbL6u9Vk v9OOlkMv0LPWmbieiR8jhdTFwrcLdQjIREquIekyWKyqDJMtsDdvlyJXx3TC+5A0oA zEThHstJbvUUfGxckITLNcaUV5wT50J6Fzp1C4yeB56tXJya+YAC/XWx6gAk+t0hq8 2CUF4U29mtNzSJOpXNil9R3bgkMJROfgfbsKS1SIoWS2/1VaAOjCRmrXM3J/yb5LxP IcxXlIfUFtzaw== Received: by mail-oo1-f41.google.com with SMTP id 006d021491bc7-657a6028fbbso1227935eaf.3 for ; Tue, 25 Nov 2025 10:12:08 -0800 (PST) X-Gm-Message-State: AOJu0YxeW1E1w/rK+lIMkVVLVHaU8AaWdZ1fCaKx+sOHUyYu5es95nJX 1WX3yyj9zx1UqGz0PNz5a5JP2pL+LX9+7rHXxWn0SgGSwMNu8AAwPHxjqke9cB7CldjyCttj54+ 55ItQQsSDDyaoEDoQlrBVvI5ZkDeBplY= X-Google-Smtp-Source: AGHT+IG1fMk5Olrjw1poUTt8BoLWV4BFgW//y1qM1sUNEhuO7m1sM+dbK2lo1llPYbR1wUYFbTBxuqF4kKpoCbTOn/k= X-Received: by 2002:a05:6820:217:b0:657:5e6f:b9d7 with SMTP id 006d021491bc7-657bddb52dbmr1431721eaf.6.1764094328063; Tue, 25 Nov 2025 10:12:08 -0800 (PST) MIME-Version: 1.0 References: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com> <20251125-swap-table-p2-v3-14-33f54f707a5c@tencent.com> In-Reply-To: <20251125-swap-table-p2-v3-14-33f54f707a5c@tencent.com> From: "Rafael J. Wysocki" Date: Tue, 25 Nov 2025 19:11:56 +0100 X-Gmail-Original-Message-ID: X-Gm-Features: AWmQ_bmrG8xAaCayhRHIMXzfydzOLm-lXoiFp9OspLLHqjGODPBuMf8ZZYsigws Message-ID: Subject: Re: [PATCH v3 14/19] mm, swap: cleanup swap entry management workflow To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song , linux-pm@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 3FE041C0005 X-Stat-Signature: edzoku4m9y1yixiyy3etpxcrg6xsigzj X-Rspam-User: X-HE-Tag: 1764094330-432560 X-HE-Meta: U2FsdGVkX1+teAu3YJDumtSw5p+juKN2PSOTw58/EvqMvz+fwRcLWaRlsNin3tpej5UDoeTySwQpyKf8PlWxKEWczF587SxamHew2cUE2cHOqfUnbAm8xHIK/orT8xAYU+T853T6XGvEYrm6juqjEykvj8DF4o3dHl02tMprZE6BtflzHVE0yc9EUoFHZI2eQouUgzOKXAGPMugk0apK2WqkvHT8Xmd/SE89AjcDAh4OYPNzct6mrQ69Wi6qDgUvXUg7d8ggTAuvlx3/YhYGlz+2W+1ui841tbci0uQAYKboQGQCSeMx0N/jfOE3AzWUDq3zyPHFEWUqlwTvLKR6lvwUY4kKXrvc/X/2CNmt9mojpqkpQJ/JhMOgv1j0WOrpxVXw8uxkZ4BkNL3H01mYy/D7KV+2gDK1PoibYmxVXwha3uaYK/R1lJWQCfuwXXClRFKVz+UXCZNvzILeH3MR2OCzH9iNSiWfqiO8YVJQwJWy59j2gC/j3naJuuqG9lzn43QM/bjPleBUgB98580c6YzV8eEXzl51vY2XrXXxAJ8jUGruoBiyh0mkMYWlhVf8cCTtLEF4emFeqXeIbvJVXyD/ExjxtzwQbGIlLP5angVat7XEQZTmrJqx5PRbQICqBkwsqnHQrs6AmBNlUKq3SxIFeMJG6CrbTyBMRLifZVzUghbi8CV5I/uJDCvRVlugB16OEY3e4pOlwoEBWwUCrmuUudCPJpZJ/YJZfDQoY8duiUgDlSiytl2tNAEujilVVb7HUfUJ8B6xm90Bznm5GYkLGXPgmkme4Mh6D3PSrUPXsf6so227ZeWUmYmFgcsrbdJf7OvALkTqGI6Yo/5m17DAoc1a5mmzzUbzspoQkS5kSEvhWiRBaTMkNWjwbKQFxuQT8NDZLR9dHSKPRm58DaqdFoXohiO4rCaOJtkvTltlgmralRzIO6WSxI9SKwhSbORhBeq8xWi4chtt4Qd 6DX7ZNoJ 4Hj7mTrvojm5Qcr7jPqfSqdeRxETEp1AVHhYlHArBvGvcIUwDaLpFjvGTc2P70VlqL7y6b3Yck3LvKFl/sACVJLW+5YBSvCl1XCE7KKKPaM+etM74qwzl0m24YrtgYb3Fkee8W1xlRSq0C4xwSQYlbiWGqGW4LMdrkXuG9X5PKqQTh4RCWYFUC+PLIjQlqYpA0xYRO7hV9pkAjyzrnaFu5lpOXVkeQxVCE4uqrYQ+Mzyn4ik2UqbPHDzEgB3G8RSwu/d/mDAB7vk3AycdDkNkfG/D8mW+Z5/zUNiyxt4OV3YH9HgZYH7eNopGmmSKDCTNyRTPRI/HY/tNo8Qv09z6ThYqT9fvJZez1Ql/5Q/sF7B00is= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Nov 24, 2025 at 8:18=E2=80=AFPM Kairui Song wrot= e: > > From: Kairui Song > > The current swap entry allocation/freeing workflow has never had a clear > definition. This makes it hard to debug or add new optimizations. > > This commit introduces a proper definition of how swap entries would be > allocated and freed. Now, most operations are folio based, so they will > never exceed one swap cluster, and we now have a cleaner border between > swap and the rest of mm, making it much easier to follow and debug, > especially with new added sanity checks. Also making more optimization > possible. > > Swap entry will be mostly allocated and free with a folio bound. > The folio lock will be useful for resolving many swap ralated races. > > Now swap allocation (except hibernation) always starts with a folio in > the swap cache, and gets duped/freed protected by the folio lock: > > - folio_alloc_swap() - The only allocation entry point now. > Context: The folio must be locked. > This allocates one or a set of continuous swap slots for a folio and > binds them to the folio by adding the folio to the swap cache. The > swap slots' swap count start with zero value. > > - folio_dup_swap() - Increase the swap count of one or more entries. > Context: The folio must be locked and in the swap cache. For now, the > caller still has to lock the new swap entry owner (e.g., PTL). > This increases the ref count of swap entries allocated to a folio. > Newly allocated swap slots' count has to be increased by this helper > as the folio got unmapped (and swap entries got installed). > > - folio_put_swap() - Decrease the swap count of one or more entries. > Context: The folio must be locked and in the swap cache. For now, the > caller still has to lock the new swap entry owner (e.g., PTL). > This decreases the ref count of swap entries allocated to a folio. > Typically, swapin will decrease the swap count as the folio got > installed back and the swap entry got uninstalled > > This won't remove the folio from the swap cache and free the > slot. Lazy freeing of swap cache is helpful for reducing IO. > There is already a folio_free_swap() for immediate cache reclaim. > This part could be further optimized later. > > The above locking constraints could be further relaxed when the swap > table if fully implemented. Currently dup still needs the caller > to lock the swap entry container (e.g. PTL), or a concurrent zap > may underflow the swap count. > > Some swap users need to interact with swap count without involving folio > (e.g. forking/zapping the page table or mapping truncate without swapin). > In such cases, the caller has to ensure there is no race condition on > whatever owns the swap count and use the below helpers: > > - swap_put_entries_direct() - Decrease the swap count directly. > Context: The caller must lock whatever is referencing the slots to > avoid a race. > > Typically the page table zapping or shmem mapping truncate will need > to free swap slots directly. If a slot is cached (has a folio bound), > this will also try to release the swap cache. > > - swap_dup_entry_direct() - Increase the swap count directly. > Context: The caller must lock whatever is referencing the entries to > avoid race, and the entries must already have a swap count > 1. > > Typically, forking will need to copy the page table and hence needs to > increase the swap count of the entries in the table. The page table is > locked while referencing the swap entries, so the entries all have a > swap count > 1 and can't be freed. > > Hibernation subsystem is a bit different, so two special wrappers are her= e: > > - swap_alloc_hibernation_slot() - Allocate one entry from one device. > - swap_free_hibernation_slot() - Free one entry allocated by the above > helper. > > All hibernation entries are exclusive to the hibernation subsystem and > should not interact with ordinary swap routines. > > By separating the workflows, it will be possible to bind folio more > tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary > pin. > > This commit should not introduce any behavior change > > Cc: linux-pm@vger.kernel.org > Signed-off-by: Kairui Song For hibernation changes Acked-by: Rafael J. Wysocki (Intel) > --- > arch/s390/mm/gmap_helpers.c | 2 +- > arch/s390/mm/pgtable.c | 2 +- > include/linux/swap.h | 58 ++++++++--------- > kernel/power/swap.c | 10 +-- > mm/madvise.c | 2 +- > mm/memory.c | 15 +++-- > mm/rmap.c | 7 ++- > mm/shmem.c | 10 +-- > mm/swap.h | 37 +++++++++++ > mm/swapfile.c | 148 +++++++++++++++++++++++++++++++-------= ------ > 10 files changed, 193 insertions(+), 98 deletions(-) > > diff --git a/arch/s390/mm/gmap_helpers.c b/arch/s390/mm/gmap_helpers.c > index 549f14ad08af..c3f56a096e8c 100644 > --- a/arch/s390/mm/gmap_helpers.c > +++ b/arch/s390/mm/gmap_helpers.c > @@ -32,7 +32,7 @@ static void ptep_zap_softleaf_entry(struct mm_struct *m= m, softleaf_t entry) > dec_mm_counter(mm, MM_SWAPENTS); > else if (softleaf_is_migration(entry)) > dec_mm_counter(mm, mm_counter(softleaf_to_folio(entry))); > - free_swap_and_cache(entry); > + swap_put_entries_direct(entry, 1); > } > > /** > diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c > index d670bfb47d9b..c3fa94a6ec15 100644 > --- a/arch/s390/mm/pgtable.c > +++ b/arch/s390/mm/pgtable.c > @@ -692,7 +692,7 @@ static void ptep_zap_softleaf_entry(struct mm_struct = *mm, softleaf_t entry) > > dec_mm_counter(mm, mm_counter(folio)); > } > - free_swap_and_cache(entry); > + swap_put_entries_direct(entry, 1); > } > > void ptep_zap_unused(struct mm_struct *mm, unsigned long addr, > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 69025b473472..ac3caa4c6999 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -452,14 +452,8 @@ static inline long get_nr_swap_pages(void) > } > > extern void si_swapinfo(struct sysinfo *); > -int folio_alloc_swap(struct folio *folio); > -bool folio_free_swap(struct folio *folio); > void put_swap_folio(struct folio *folio, swp_entry_t entry); > -extern swp_entry_t get_swap_page_of_type(int); > extern int add_swap_count_continuation(swp_entry_t, gfp_t); > -extern int swap_duplicate_nr(swp_entry_t entry, int nr); > -extern void swap_free_nr(swp_entry_t entry, int nr_pages); > -extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); > int swap_type_of(dev_t device, sector_t offset); > int find_first_swap(dev_t *device); > extern unsigned int count_swap_pages(int, int); > @@ -472,6 +466,29 @@ struct backing_dev_info; > extern struct swap_info_struct *get_swap_device(swp_entry_t entry); > sector_t swap_folio_sector(struct folio *folio); > > +/* > + * If there is an existing swap slot reference (swap entry) and the call= er > + * guarantees that there is no race modification of it (e.g., PTL > + * protecting the swap entry in page table; shmem's cmpxchg protects t > + * he swap entry in shmem mapping), these two helpers below can be used > + * to put/dup the entries directly. > + * > + * All entries must be allocated by folio_alloc_swap(). And they must ha= ve > + * a swap count > 1. See comments of folio_*_swap helpers for more info. > + */ > +int swap_dup_entry_direct(swp_entry_t entry); > +void swap_put_entries_direct(swp_entry_t entry, int nr); > + > +/* > + * folio_free_swap tries to free the swap entries pinned by a swap cache > + * folio, it has to be here to be called by other components. > + */ > +bool folio_free_swap(struct folio *folio); > + > +/* Allocate / free (hibernation) exclusive entries */ > +swp_entry_t swap_alloc_hibernation_slot(int type); > +void swap_free_hibernation_slot(swp_entry_t entry); > + > static inline void put_swap_device(struct swap_info_struct *si) > { > percpu_ref_put(&si->users); > @@ -499,10 +516,6 @@ static inline void put_swap_device(struct swap_info_= struct *si) > #define free_pages_and_swap_cache(pages, nr) \ > release_pages((pages), (nr)); > > -static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr) > -{ > -} > - > static inline void free_swap_cache(struct folio *folio) > { > } > @@ -512,12 +525,12 @@ static inline int add_swap_count_continuation(swp_e= ntry_t swp, gfp_t gfp_mask) > return 0; > } > > -static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages) > +static inline int swap_dup_entry_direct(swp_entry_t ent) > { > return 0; > } > > -static inline void swap_free_nr(swp_entry_t entry, int nr_pages) > +static inline void swap_put_entries_direct(swp_entry_t ent, int nr) > { > } > > @@ -541,11 +554,6 @@ static inline int swp_swapcount(swp_entry_t entry) > return 0; > } > > -static inline int folio_alloc_swap(struct folio *folio) > -{ > - return -EINVAL; > -} > - > static inline bool folio_free_swap(struct folio *folio) > { > return false; > @@ -558,22 +566,6 @@ static inline int add_swap_extent(struct swap_info_s= truct *sis, > return -EINVAL; > } > #endif /* CONFIG_SWAP */ > - > -static inline int swap_duplicate(swp_entry_t entry) > -{ > - return swap_duplicate_nr(entry, 1); > -} > - > -static inline void free_swap_and_cache(swp_entry_t entry) > -{ > - free_swap_and_cache_nr(entry, 1); > -} > - > -static inline void swap_free(swp_entry_t entry) > -{ > - swap_free_nr(entry, 1); > -} > - > #ifdef CONFIG_MEMCG > static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) > { > diff --git a/kernel/power/swap.c b/kernel/power/swap.c > index 0beff7eeaaba..546a0c701970 100644 > --- a/kernel/power/swap.c > +++ b/kernel/power/swap.c > @@ -179,10 +179,10 @@ sector_t alloc_swapdev_block(int swap) > { > unsigned long offset; > > - offset =3D swp_offset(get_swap_page_of_type(swap)); > + offset =3D swp_offset(swap_alloc_hibernation_slot(swap)); > if (offset) { > if (swsusp_extents_insert(offset)) > - swap_free(swp_entry(swap, offset)); > + swap_free_hibernation_slot(swp_entry(swap, offset= )); > else > return swapdev_block(swap, offset); > } > @@ -197,6 +197,7 @@ sector_t alloc_swapdev_block(int swap) > > void free_all_swap_pages(int swap) > { > + unsigned long offset; > struct rb_node *node; > > while ((node =3D swsusp_extents.rb_node)) { > @@ -204,8 +205,9 @@ void free_all_swap_pages(int swap) > > ext =3D rb_entry(node, struct swsusp_extent, node); > rb_erase(node, &swsusp_extents); > - swap_free_nr(swp_entry(swap, ext->start), > - ext->end - ext->start + 1); > + > + for (offset =3D ext->start; offset < ext->end; offset++) > + swap_free_hibernation_slot(swp_entry(swap, offset= )); > > kfree(ext); > } > diff --git a/mm/madvise.c b/mm/madvise.c > index b617b1be0f53..7cd69a02ce84 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -694,7 +694,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigne= d long addr, > max_nr =3D (end - addr) / PAGE_SIZE; > nr =3D swap_pte_batch(pte, max_nr, ptent)= ; > nr_swap -=3D nr; > - free_swap_and_cache_nr(entry, nr); > + swap_put_entries_direct(entry, nr); > clear_not_present_full_ptes(mm, addr, pte= , nr, tlb->fullmm); > } else if (softleaf_is_hwpoison(entry) || > softleaf_is_poison_marker(entry)) { > diff --git a/mm/memory.c b/mm/memory.c > index ce9f56f77ae5..d89946ad63ec 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -934,7 +934,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct = mm_struct *src_mm, > struct page *page; > > if (likely(softleaf_is_swap(entry))) { > - if (swap_duplicate(entry) < 0) > + if (swap_dup_entry_direct(entry) < 0) > return -EIO; > > /* make sure dst_mm is on swapoff's mmlist. */ > @@ -1744,7 +1744,7 @@ static inline int zap_nonpresent_ptes(struct mmu_ga= ther *tlb, > > nr =3D swap_pte_batch(pte, max_nr, ptent); > rss[MM_SWAPENTS] -=3D nr; > - free_swap_and_cache_nr(entry, nr); > + swap_put_entries_direct(entry, nr); > } else if (softleaf_is_migration(entry)) { > struct folio *folio =3D softleaf_to_folio(entry); > > @@ -4933,7 +4933,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > /* > * Some architectures may have to restore extra metadata to the p= age > * when reading from swap. This metadata may be indexed by swap e= ntry > - * so this must be called before swap_free(). > + * so this must be called before folio_put_swap(). > */ > arch_swap_restore(folio_swap(entry, folio), folio); > > @@ -4971,6 +4971,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (unlikely(folio !=3D swapcache)) { > folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSI= VE); > folio_add_lru_vma(folio, vma); > + folio_put_swap(swapcache, NULL); > } else if (!folio_test_anon(folio)) { > /* > * We currently only expect !anon folios that are fully > @@ -4979,9 +4980,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) !=3D nr_pages= , folio); > VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio); > folio_add_new_anon_rmap(folio, vma, address, rmap_flags); > + folio_put_swap(folio, NULL); > } else { > + VM_WARN_ON_ONCE(nr_pages !=3D 1 && nr_pages !=3D folio_nr= _pages(folio)); > folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, addr= ess, > - rmap_flags); > + rmap_flags); > + folio_put_swap(folio, nr_pages =3D=3D 1 ? page : NULL); > } > > VM_BUG_ON(!folio_test_anon(folio) || > @@ -4995,7 +4999,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > * Do it after mapping, so raced page faults will likely see the = folio > * in swap cache and wait on the folio lock. > */ > - swap_free_nr(entry, nr_pages); > if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags)= ) > folio_free_swap(folio); > > @@ -5005,7 +5008,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > * Hold the lock to avoid the swap entry to be reused > * until we take the PT lock for the pte_same() check > * (to avoid false positives from pte_same). For > - * further safety release the lock after the swap_free > + * further safety release the lock after the folio_put_sw= ap > * so that the swap count won't change under a > * parallel locked swapcache. > */ > diff --git a/mm/rmap.c b/mm/rmap.c > index f955f02d570e..f92c94954049 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -82,6 +82,7 @@ > #include > > #include "internal.h" > +#include "swap.h" > > static struct kmem_cache *anon_vma_cachep; > static struct kmem_cache *anon_vma_chain_cachep; > @@ -2148,7 +2149,7 @@ static bool try_to_unmap_one(struct folio *folio, s= truct vm_area_struct *vma, > goto discard; > } > > - if (swap_duplicate(entry) < 0) { > + if (folio_dup_swap(folio, subpage) < 0) { > set_pte_at(mm, address, pvmw.pte, pteval)= ; > goto walk_abort; > } > @@ -2159,7 +2160,7 @@ static bool try_to_unmap_one(struct folio *folio, s= truct vm_area_struct *vma, > * so we'll not check/care. > */ > if (arch_unmap_one(mm, vma, address, pteval) < 0)= { > - swap_free(entry); > + folio_put_swap(folio, subpage); > set_pte_at(mm, address, pvmw.pte, pteval)= ; > goto walk_abort; > } > @@ -2167,7 +2168,7 @@ static bool try_to_unmap_one(struct folio *folio, s= truct vm_area_struct *vma, > /* See folio_try_share_anon_rmap(): clear PTE fir= st. */ > if (anon_exclusive && > folio_try_share_anon_rmap_pte(folio, subpage)= ) { > - swap_free(entry); > + folio_put_swap(folio, subpage); > set_pte_at(mm, address, pvmw.pte, pteval)= ; > goto walk_abort; > } > diff --git a/mm/shmem.c b/mm/shmem.c > index eb9bd9241f99..56a690e93cc2 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -971,7 +971,7 @@ static long shmem_free_swap(struct address_space *map= ping, > old =3D xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0= ); > if (old !=3D radswap) > return 0; > - free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order); > + swap_put_entries_direct(radix_to_swp_entry(radswap), 1 << order); > > return 1 << order; > } > @@ -1654,7 +1654,7 @@ int shmem_writeout(struct folio *folio, struct swap= _iocb **plug, > spin_unlock(&shmem_swaplist_lock); > } > > - swap_duplicate_nr(folio->swap, nr_pages); > + folio_dup_swap(folio, NULL); > shmem_delete_from_page_cache(folio, swp_to_radix_entry(fo= lio->swap)); > > BUG_ON(folio_mapped(folio)); > @@ -1675,7 +1675,7 @@ int shmem_writeout(struct folio *folio, struct swap= _iocb **plug, > /* Swap entry might be erased by racing shmem_free_swap()= */ > if (!error) { > shmem_recalc_inode(inode, 0, -nr_pages); > - swap_free_nr(folio->swap, nr_pages); > + folio_put_swap(folio, NULL); > } > > /* > @@ -2161,6 +2161,7 @@ static void shmem_set_folio_swapin_error(struct ino= de *inode, pgoff_t index, > > nr_pages =3D folio_nr_pages(folio); > folio_wait_writeback(folio); > + folio_put_swap(folio, NULL); > swap_cache_del_folio(folio); > /* > * Don't treat swapin error folio as alloced. Otherwise inode->i_= blocks > @@ -2168,7 +2169,6 @@ static void shmem_set_folio_swapin_error(struct ino= de *inode, pgoff_t index, > * in shmem_evict_inode(). > */ > shmem_recalc_inode(inode, -nr_pages, -nr_pages); > - swap_free_nr(swap, nr_pages); > } > > static int shmem_split_large_entry(struct inode *inode, pgoff_t index, > @@ -2391,9 +2391,9 @@ static int shmem_swapin_folio(struct inode *inode, = pgoff_t index, > if (sgp =3D=3D SGP_WRITE) > folio_mark_accessed(folio); > > + folio_put_swap(folio, NULL); > swap_cache_del_folio(folio); > folio_mark_dirty(folio); > - swap_free_nr(swap, nr_pages); > put_swap_device(si); > > *foliop =3D folio; > diff --git a/mm/swap.h b/mm/swap.h > index 6777b2ab9d92..9ed12936b889 100644 > --- a/mm/swap.h > +++ b/mm/swap.h > @@ -183,6 +183,28 @@ static inline void swap_cluster_unlock_irq(struct sw= ap_cluster_info *ci) > spin_unlock_irq(&ci->lock); > } > > +/* > + * Below are the core routines for doing swap for a folio. > + * All helpers requires the folio to be locked, and a locked folio > + * in the swap cache pins the swap entries / slots allocated to the > + * folio, swap relies heavily on the swap cache and folio lock for > + * synchronization. > + * > + * folio_alloc_swap(): the entry point for a folio to be swapped > + * out. It allocates swap slots and pins the slots with swap cache. > + * The slots start with a swap count of zero. > + * > + * folio_dup_swap(): increases the swap count of a folio, usually > + * during it gets unmapped and a swap entry is installed to replace > + * it (e.g., swap entry in page table). A swap slot with swap > + * count =3D=3D 0 should only be increasd by this helper. > + * > + * folio_put_swap(): does the opposite thing of folio_dup_swap(). > + */ > +int folio_alloc_swap(struct folio *folio); > +int folio_dup_swap(struct folio *folio, struct page *subpage); > +void folio_put_swap(struct folio *folio, struct page *subpage); > + > /* linux/mm/page_io.c */ > int sio_pool_init(void); > struct swap_iocb; > @@ -363,9 +385,24 @@ static inline struct swap_info_struct *__swap_entry_= to_info(swp_entry_t entry) > return NULL; > } > > +static inline int folio_alloc_swap(struct folio *folio) > +{ > + return -EINVAL; > +} > + > +static inline int folio_dup_swap(struct folio *folio, struct page *page) > +{ > + return -EINVAL; > +} > + > +static inline void folio_put_swap(struct folio *folio, struct page *page= ) > +{ > +} > + > static inline void swap_read_folio(struct folio *folio, struct swap_iocb= **plug) > { > } > + > static inline void swap_write_unplug(struct swap_iocb *sio) > { > } > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 567aea6f1cd4..7890039d2f65 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -58,6 +58,9 @@ static void swap_entries_free(struct swap_info_struct *= si, > swp_entry_t entry, unsigned int nr_pages); > static void swap_range_alloc(struct swap_info_struct *si, > unsigned int nr_entries); > +static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int = nr); > +static bool swap_entries_put_map(struct swap_info_struct *si, > + swp_entry_t entry, int nr); > static bool folio_swapcache_freeable(struct folio *folio); > static void move_cluster(struct swap_info_struct *si, > struct swap_cluster_info *ci, struct list_head *= list, > @@ -1478,6 +1481,12 @@ int folio_alloc_swap(struct folio *folio) > */ > WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true)); > > + /* > + * Allocator should always allocate aligned entries so folio base= d > + * operations never crossed more than one cluster. > + */ > + VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, size), folio); > + > return 0; > > out_free: > @@ -1485,6 +1494,62 @@ int folio_alloc_swap(struct folio *folio) > return -ENOMEM; > } > > +/** > + * folio_dup_swap() - Increase swap count of swap entries of a folio. > + * @folio: folio with swap entries bounded. > + * @subpage: if not NULL, only increase the swap count of this subpage. > + * > + * Context: Caller must ensure the folio is locked and in the swap cache= . > + * The caller also has to ensure there is no raced call to > + * swap_put_entries_direct before this helper returns, or the swap > + * map may underflow (TODO: maybe we should allow or avoid underflow to > + * make swap refcount lockless). > + */ > +int folio_dup_swap(struct folio *folio, struct page *subpage) > +{ > + int err =3D 0; > + swp_entry_t entry =3D folio->swap; > + unsigned long nr_pages =3D folio_nr_pages(folio); > + > + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); > + VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio); > + > + if (subpage) { > + entry.val +=3D folio_page_idx(folio, subpage); > + nr_pages =3D 1; > + } > + > + while (!err && __swap_duplicate(entry, 1, nr_pages) =3D=3D -ENOME= M) > + err =3D add_swap_count_continuation(entry, GFP_ATOMIC); > + > + return err; > +} > + > +/** > + * folio_put_swap() - Decrease swap count of swap entries of a folio. > + * @folio: folio with swap entries bounded, must be in swap cache and lo= cked. > + * @subpage: if not NULL, only decrease the swap count of this subpage. > + * > + * This won't free the swap slots even if swap count drops to zero, they= are > + * still pinned by the swap cache. User may call folio_free_swap to free= them. > + * Context: Caller must ensure the folio is locked and in the swap cache= . > + */ > +void folio_put_swap(struct folio *folio, struct page *subpage) > +{ > + swp_entry_t entry =3D folio->swap; > + unsigned long nr_pages =3D folio_nr_pages(folio); > + > + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); > + VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio); > + > + if (subpage) { > + entry.val +=3D folio_page_idx(folio, subpage); > + nr_pages =3D 1; > + } > + > + swap_entries_put_map(__swap_entry_to_info(entry), entry, nr_pages= ); > +} > + > static struct swap_info_struct *_swap_info_get(swp_entry_t entry) > { > struct swap_info_struct *si; > @@ -1725,28 +1790,6 @@ static void swap_entries_free(struct swap_info_str= uct *si, > partial_free_cluster(si, ci); > } > > -/* > - * Caller has made sure that the swap device corresponding to entry > - * is still around or has not been recycled. > - */ > -void swap_free_nr(swp_entry_t entry, int nr_pages) > -{ > - int nr; > - struct swap_info_struct *sis; > - unsigned long offset =3D swp_offset(entry); > - > - sis =3D _swap_info_get(entry); > - if (!sis) > - return; > - > - while (nr_pages) { > - nr =3D min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % S= WAPFILE_CLUSTER); > - swap_entries_put_map(sis, swp_entry(sis->type, offset), n= r); > - offset +=3D nr; > - nr_pages -=3D nr; > - } > -} > - > /* > * Called after dropping swapcache to decrease refcnt to swap entries. > */ > @@ -1935,16 +1978,19 @@ bool folio_free_swap(struct folio *folio) > } > > /** > - * free_swap_and_cache_nr() - Release reference on range of swap entries= and > - * reclaim their cache if no more references = remain. > + * swap_put_entries_direct() - Release reference on range of swap entrie= s and > + * reclaim their cache if no more references= remain. > * @entry: First entry of range. > * @nr: Number of entries in range. > * > * For each swap entry in the contiguous range, release a reference. If = any swap > * entries become free, try to reclaim their underlying folios, if prese= nt. The > * offset range is defined by [entry.offset, entry.offset + nr). > + * > + * Context: Caller must ensure there is no race condition on the referen= ce > + * owner. e.g., locking the PTL of a PTE containing the entry being rele= ased. > */ > -void free_swap_and_cache_nr(swp_entry_t entry, int nr) > +void swap_put_entries_direct(swp_entry_t entry, int nr) > { > const unsigned long start_offset =3D swp_offset(entry); > const unsigned long end_offset =3D start_offset + nr; > @@ -1953,10 +1999,9 @@ void free_swap_and_cache_nr(swp_entry_t entry, int= nr) > unsigned long offset; > > si =3D get_swap_device(entry); > - if (!si) > + if (WARN_ON_ONCE(!si)) > return; > - > - if (WARN_ON(end_offset > si->max)) > + if (WARN_ON_ONCE(end_offset > si->max)) > goto out; > > /* > @@ -2000,8 +2045,8 @@ void free_swap_and_cache_nr(swp_entry_t entry, int = nr) > } > > #ifdef CONFIG_HIBERNATION > - > -swp_entry_t get_swap_page_of_type(int type) > +/* Allocate a slot for hibernation */ > +swp_entry_t swap_alloc_hibernation_slot(int type) > { > struct swap_info_struct *si =3D swap_type_to_info(type); > unsigned long offset; > @@ -2029,6 +2074,27 @@ swp_entry_t get_swap_page_of_type(int type) > return entry; > } > > +/* Free a slot allocated by swap_alloc_hibernation_slot */ > +void swap_free_hibernation_slot(swp_entry_t entry) > +{ > + struct swap_info_struct *si; > + struct swap_cluster_info *ci; > + pgoff_t offset =3D swp_offset(entry); > + > + si =3D get_swap_device(entry); > + if (WARN_ON(!si)) > + return; > + > + ci =3D swap_cluster_lock(si, offset); > + swap_entry_put_locked(si, ci, entry, 1); > + WARN_ON(swap_entry_swapped(si, offset)); > + swap_cluster_unlock(ci); > + > + /* In theory readahead might add it to the swap cache by accident= */ > + __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); > + put_swap_device(si); > +} > + > /* > * Find the swap type that corresponds to given device (if any). > * > @@ -2190,7 +2256,7 @@ static int unuse_pte(struct vm_area_struct *vma, pm= d_t *pmd, > /* > * Some architectures may have to restore extra metadata to the p= age > * when reading from swap. This metadata may be indexed by swap e= ntry > - * so this must be called before swap_free(). > + * so this must be called before folio_put_swap(). > */ > arch_swap_restore(folio_swap(entry, folio), folio); > > @@ -2231,7 +2297,7 @@ static int unuse_pte(struct vm_area_struct *vma, pm= d_t *pmd, > new_pte =3D pte_mkuffd_wp(new_pte); > setpte: > set_pte_at(vma->vm_mm, addr, pte, new_pte); > - swap_free(entry); > + folio_put_swap(folio, page); > out: > if (pte) > pte_unmap_unlock(pte, ptl); > @@ -3741,28 +3807,22 @@ static int __swap_duplicate(swp_entry_t entry, un= signed char usage, int nr) > return err; > } > > -/** > - * swap_duplicate_nr() - Increase reference count of nr contiguous swap = entries > - * by 1. > - * > +/* > + * swap_dup_entry_direct() - Increase reference count of a swap entry by= one. > * @entry: first swap entry from which we want to increase the refcount. > - * @nr: Number of entries in range. > * > * Returns 0 for success, or -ENOMEM if a swap_count_continuation is req= uired > * but could not be atomically allocated. Returns 0, just as if it succ= eeded, > * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), = which > * might occur if a page table entry has got corrupted. > * > - * Note that we are currently not handling the case where nr > 1 and we = need to > - * add swap count continuation. This is OK, because no such user exists = - shmem > - * is the only user that can pass nr > 1, and it never re-duplicates any= swap > - * entry it owns. > + * Context: Caller must ensure there is no race condition on the referen= ce > + * owner. e.g., locking the PTL of a PTE containing the entry being incr= eased. > */ > -int swap_duplicate_nr(swp_entry_t entry, int nr) > +int swap_dup_entry_direct(swp_entry_t entry) > { > int err =3D 0; > - > - while (!err && __swap_duplicate(entry, 1, nr) =3D=3D -ENOMEM) > + while (!err && __swap_duplicate(entry, 1, 1) =3D=3D -ENOMEM) > err =3D add_swap_count_continuation(entry, GFP_ATOMIC); > return err; > } > > -- > 2.52.0 > >