From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 657BFC67861
	for <linux-mm@archiver.kernel.org>; Mon,  8 Apr 2024 09:22:58 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id EED7A6B0098; Mon,  8 Apr 2024 05:22:57 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E74B66B0099; Mon,  8 Apr 2024 05:22:57 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D14FC6B009A; Mon,  8 Apr 2024 05:22:57 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id AAC6C6B0098
	for <linux-mm@kvack.org>; Mon,  8 Apr 2024 05:22:57 -0400 (EDT)
Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 71D37A0677
	for <linux-mm@kvack.org>; Mon,  8 Apr 2024 09:22:57 +0000 (UTC)
X-FDA: 81985825194.09.20C0234
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf26.hostedemail.com (Postfix) with ESMTP id 7995914001B
	for <linux-mm@kvack.org>; Mon,  8 Apr 2024 09:22:55 +0000 (UTC)
Authentication-Results: imf26.hostedemail.com;
	dkim=none;
	spf=pass (imf26.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com;
	dmarc=pass (policy=none) header.from=arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1712568175;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=wR3OkzURiu4ZP5HryfHk2mpSJ4An1l/F1EVeEa0jpzk=;
	b=Ql//cHYY38b2vUab/25dLyuO9tIWf3I45c4cL0/KNz7GkB1doGl55bshQM32a99BC0oO0x
	Tz0svuUTQuNi7F087S+6kHTZVAqiNADOp1Kt3EibsqNt7PlKGlLTd/J3z/hHCp+4MH2WPJ
	VJAonE4dILBD5PRxV1SAV+lNiCwV+vM=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1712568175; a=rsa-sha256;
	cv=none;
	b=kaZsvaQL3orrUk+EkVTT11degfRJDhQzwQFYgi/3czZ0vqdxztEFLcXAGhU1VeaYrdVLU/
	kiLJ8REc6g6a4Cm8089+s0okFVFJEYIM1N5Pe8K0HU/+wQvmw6GTFzPLg2wfjRmg/+eL5c
	S7Ar/Y4Luk9pcw99I2Hlxre/+yil8pQ=
ARC-Authentication-Results: i=1;
	imf26.hostedemail.com;
	dkim=none;
	spf=pass (imf26.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com;
	dmarc=pass (policy=none) header.from=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id DB4181007;
	Mon,  8 Apr 2024 02:23:24 -0700 (PDT)
Received: from [10.57.73.169] (unknown [10.57.73.169])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 6CB373F766;
	Mon,  8 Apr 2024 02:22:52 -0700 (PDT)
Message-ID: <73adae65-4429-41d7-bbb6-4c58156060d3@arm.com>
Date: Mon, 8 Apr 2024 10:22:51 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v6 2/6] mm: swap: free_swap_and_cache_nr() as batched
 free_swap_and_cache()
Content-Language: en-GB
To: David Hildenbrand <david@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 Matthew Wilcox <willy@infradead.org>, Huang Ying <ying.huang@intel.com>,
 Gao Xiang <xiang@kernel.org>, Yu Zhao <yuzhao@google.com>,
 Yang Shi <shy828301@gmail.com>, Michal Hocko <mhocko@suse.com>,
 Kefeng Wang <wangkefeng.wang@huawei.com>, Barry Song <21cnbao@gmail.com>,
 Chris Li <chrisl@kernel.org>, Lance Yang <ioworker0@gmail.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org
References: <20240403114032.1162100-1-ryan.roberts@arm.com>
 <20240403114032.1162100-3-ryan.roberts@arm.com>
 <051052af-3b56-4290-98d3-fd5a1eb11ce1@redhat.com>
From: Ryan Roberts <ryan.roberts@arm.com>
In-Reply-To: <051052af-3b56-4290-98d3-fd5a1eb11ce1@redhat.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Stat-Signature: 8fcybf5qcru9rpdss1jof464erd15caf
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 7995914001B
X-Rspam-User: 
X-HE-Tag: 1712568175-461546
X-HE-Meta: U2FsdGVkX1+5M6lnE2+dlGISEJMvmprG/x23OnZPwpTggLL+vPWvDPmXnihYDyAa/oxu662ve1RboDE1w0PY+7nxxqVZp2wjt9ZkCYbsSoBOUju+pSVr2nwYKqOo8rugUaOtcn0qAZsfzgxe0W3aQdKu2WGggRojwwHCcnSNmDtC2qlog8fQvKa+UsPdy4bmO0U46277dzFRg8uI+PS06G6yNUHaPyvHy8BH3i46omKv7IRzwGNrqHE15OKzNZFrxqmLvLYBHgUhUaSiw5IZvluX1TdNJmcZrotJ5AEepdzxUuUzTnt3P36nQixsPdNBgwQduKrRV7DUBqbLw5YZvEUH18fULQJ+/2r7PSuQ+RYWmW/0Pfb9fntKASReGU4b5eOx0qnaYylV2ZSlHbBFgcbAl7CK/PXSI3MXx/imOrG94vVnBzLxu5SspiF6ZBsUjuEzeC4oaxqUp09KmGfOS5pg68y4wpBCy1+2iZtK9JyXwxBXWBfyM7Dyfud7CsIQoYTx6Fvlox7dFq0GiMrQmZLGhbWBPrMG6Ep2eH1IBtyqeamisUd/thlkOO03cbGiM5N8bLVMAgKtdOKy8OYJG7PFXLvf4VAo5Beo3Sje0XCzf/7DUMHUOl61tBkmS2B6+EOUXyJKLNHNvv26+vHm9PHnTr10gYulgEm3NFGF2sZ1Qz7LcdOxY9TM0mMuJZxddhEASPOZJH5yjx2WhVFZI9n+Ivv9Bs4wbRlhh0CJNVZgtvtMrjzIRi2f7mwj43nAR1L8lKICuFK8xGezydI/Dbm0zJv8pe+erDJQGen0DT+75qmLCWe9oW+4Wear91b2NveYb50qKYtmko2VrRziZW1AvEzDYjIX08EAZONtEbwSDdmGdtRNBoZiE/u1OFNTf7nbrM9jDnScgyU/BSEHSS/13Gy8CMDIbiUzeKyLUkcJiOzxDJLUPnWSDkuY3YcpDh9GosfHBXGPJN1tJ8p
 NGr+C0KO
 OTjituAEtXXZwb6bCotIt0lHk8JVJ5vpQXWqI26MZ8Gqmi+Z3z1hBHURXolzXcyewAcz6bYSNTUAflErdGqN0WwSBEjkvNdJeGS5JdRHf83YML/aLWt4eibOUwlBG+zBvSAHGkTD8O66rwdoaxwE2gg2hmUJFjm+B5pn/b8SoY9prSH/pBo+yFXlp+GACR56rGeTPYzMviGFenJb25NJiX5V8jIUMIPE0jJRn/Mk95p94WLWqTMeRYWwchoI7p31fWa641lPWO+9BpJuUKUqYCqm0Ew==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Andrew, Looks like there are a few niggles for me to address below. So please
don't let v6 (currently in mm-unstable) progress to mm-stable. I'll aim to get a
new version out (or patch for this depending on how discussion lands) in the
next couple of days.


On 05/04/2024 11:13, David Hildenbrand wrote:
> On 03.04.24 13:40, Ryan Roberts wrote:
>> Now that we no longer have a convenient flag in the cluster to determine
>> if a folio is large, free_swap_and_cache() will take a reference and
>> lock a large folio much more often, which could lead to contention and
>> (e.g.) failure to split large folios, etc.
>>
>> Let's solve that problem by batch freeing swap and cache with a new
>> function, free_swap_and_cache_nr(), to free a contiguous range of swap
>> entries together. This allows us to first drop a reference to each swap
>> slot before we try to release the cache folio. This means we only try to
>> release the folio once, only taking the reference and lock once - much
>> better than the previous 512 times for the 2M THP case.
>>
>> Contiguous swap entries are gathered in zap_pte_range() and
>> madvise_free_pte_range() in a similar way to how present ptes are
>> already gathered in zap_pte_range().
>>
>> While we are at it, let's simplify by converting the return type of both
>> functions to void. The return value was used only by zap_pte_range() to
>> print a bad pte, and was ignored by everyone else, so the extra
>> reporting wasn't exactly guaranteed. We will still get the warning with
>> most of the information from get_swap_device(). With the batch version,
>> we wouldn't know which pte was bad anyway so could print the wrong one.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/pgtable.h | 28 ++++++++++++++
>>   include/linux/swap.h    | 12 ++++--
>>   mm/internal.h           | 48 +++++++++++++++++++++++
>>   mm/madvise.c            | 12 ++++--
>>   mm/memory.c             | 13 ++++---
>>   mm/swapfile.c           | 86 ++++++++++++++++++++++++++++++++---------
>>   6 files changed, 167 insertions(+), 32 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index a3fc8150b047..0278259f7078 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct
>> mm_struct *mm,
>>   }
>>   #endif
>>   +#ifndef clear_not_present_full_ptes
>> +/**
>> + * clear_not_present_full_ptes - Clear consecutive not present PTEs.
> 
> 
> Consecutive only in the page table or also in some other sense?
> 
> I suspect: just unrelated non-present entries of any kind (swp, nonswp) and any
> offset/pfn.

yes.

> 
> Consider document that.

How about: "Clear multiple not present PTEs which are consecutive in the pgtable"?


> 
>> + * @mm: Address space the ptes represent.
>> + * @addr: Address of the first pte.
>> + * @ptep: Page table pointer for the first entry.
>> + * @nr: Number of entries to clear.
>> + * @full: Whether we are clearing a full mm.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over pte_clear_not_present_full().
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs are all not present.
>> + * The PTEs are all in the same PMD.
>> + */
>> +static inline void clear_not_present_full_ptes(struct mm_struct *mm,
>> +        unsigned long addr, pte_t *ptep, unsigned int nr, int full)
>> +{
>> +    for (;;) {
>> +        pte_clear_not_present_full(mm, addr, ptep, full);
>> +        if (--nr == 0)
>> +            break;
>> +        ptep++;
>> +        addr += PAGE_SIZE;
>> +    }
>> +}
>> +#endif
>> +
>>   #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
>>   extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
>>                     unsigned long address,
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index f6f78198f000..5737236dc3ce 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -471,7 +471,7 @@ extern int swap_duplicate(swp_entry_t);
>>   extern int swapcache_prepare(swp_entry_t);
>>   extern void swap_free(swp_entry_t);
>>   extern void swapcache_free_entries(swp_entry_t *entries, int n);
>> -extern int free_swap_and_cache(swp_entry_t);
>> +extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>>   int swap_type_of(dev_t device, sector_t offset);
>>   int find_first_swap(dev_t *device);
>>   extern unsigned int count_swap_pages(int, int);
>> @@ -520,8 +520,9 @@ static inline void put_swap_device(struct swap_info_struct
>> *si)
>>   #define free_pages_and_swap_cache(pages, nr) \
>>       release_pages((pages), (nr));
>>   
> 
> 
> [...]
> 
>> +
>> +/**
>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>> + * @start_ptep: Page table pointer for the first entry.
>> + * @max_nr: The maximum number of table entries to consider.
>> + * @entry: Swap entry recovered from the first table entry.
>> + *
>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
>> + * containing swap entries all with consecutive offsets and targeting the same
>> + * swap type.
>> + *
> 
> Likely you should document that any swp pte bits are ignored? ()

Sorry I don't understand this comment. I thought any non-none, non-present PTE
was always considered to contain only a "swap entry" and a swap entry consists
of a "type" and an "offset" only. (and its a special "non-swap" swap entry if
type > SOME_CONSTANT) Are you saying there are additional fields in the PTE that
are not part of the swap entry?


> 
>> + * max_nr must be at least one and must be limited by the caller so scanning
>> + * cannot exceed a single page table.
>> + *
>> + * Return: the number of table entries in the batch.
>> + */
>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
>> +                 swp_entry_t entry)
>> +{
>> +    const pte_t *end_ptep = start_ptep + max_nr;
>> +    unsigned long expected_offset = swp_offset(entry) + 1;
>> +    unsigned int expected_type = swp_type(entry);
>> +    pte_t *ptep = start_ptep + 1;
>> +
>> +    VM_WARN_ON(max_nr < 1);
>> +    VM_WARN_ON(non_swap_entry(entry));
>> +
>> +    while (ptep < end_ptep) {
>> +        pte_t pte = ptep_get(ptep);
>> +
>> +        if (pte_none(pte) || pte_present(pte))
>> +            break;
>> +
>> +        entry = pte_to_swp_entry(pte);
>> +
>> +        if (non_swap_entry(entry) ||
>> +            swp_type(entry) != expected_type ||
>> +            swp_offset(entry) != expected_offset)
>> +            break;
>> +
>> +        expected_offset++;
>> +        ptep++;
>> +    }
>> +
>> +    return ptep - start_ptep;
>> +}
> 
> Looks very clean :)
> 
> I was wondering whether we could similarly construct the expected swp PTE and
> only check pte_same.
> 
> expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset));
> 
> ... or have a variant to increase only the swp offset for an existing pte. But
> non-trivial due to the arch-dependent format.
> 
> But then, we'd fail on mismatch of other swp pte bits.

Hmm, perhaps I have a misunderstanding regarding "swp pte bits"...

> 
> 
> On swapin, when reusing this function (likely!), we'll might to make sure that
> the PTE bits match as well.
> 
> See below regarding uffd-wp.
> 
> 
>>   #endif /* CONFIG_MMU */
>>     void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 1f77a51baaac..070bedb4996e 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>>       struct folio *folio;
>>       int nr_swap = 0;
>>       unsigned long next;
>> +    int nr, max_nr;
>>         next = pmd_addr_end(addr, end);
>>       if (pmd_trans_huge(*pmd))
>> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>>           return 0;
>>       flush_tlb_batched_pending(mm);
>>       arch_enter_lazy_mmu_mode();
>> -    for (; addr != end; pte++, addr += PAGE_SIZE) {
>> +    for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
>> +        nr = 1;
>>           ptent = ptep_get(pte);
>>             if (pte_none(ptent))
>> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>>                 entry = pte_to_swp_entry(ptent);
>>               if (!non_swap_entry(entry)) {
>> -                nr_swap--;
>> -                free_swap_and_cache(entry);
>> -                pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> +                max_nr = (end - addr) / PAGE_SIZE;
>> +                nr = swap_pte_batch(pte, max_nr, entry);
>> +                nr_swap -= nr;
>> +                free_swap_and_cache_nr(entry, nr);
>> +                clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>               } else if (is_hwpoison_entry(entry) ||
>>                      is_poisoned_swp_entry(entry)) {
>>                   pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 7dc6c3d9fa83..ef2968894718 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather
>> *tlb,
>>                   folio_remove_rmap_pte(folio, page, vma);
>>               folio_put(folio);
>>           } else if (!non_swap_entry(entry)) {
>> -            /* Genuine swap entry, hence a private anon page */
>> +            max_nr = (end - addr) / PAGE_SIZE;
>> +            nr = swap_pte_batch(pte, max_nr, entry);
>> +            /* Genuine swap entries, hence a private anon pages */
>>               if (!should_zap_cows(details))
>>                   continue;
>> -            rss[MM_SWAPENTS]--;
>> -            if (unlikely(!free_swap_and_cache(entry)))
>> -                print_bad_pte(vma, addr, ptent, NULL);
>> +            rss[MM_SWAPENTS] -= nr;
>> +            free_swap_and_cache_nr(entry, nr);
>>           } else if (is_migration_entry(entry)) {
>>               folio = pfn_swap_entry_folio(entry);
>>               if (!should_zap_folio(details, folio))
>> @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>               pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
>>               WARN_ON_ONCE(1);
>>           }
>> -        pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> -        zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
>> +        clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
> 
> For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match.
> 
> zap_install_uffd_wp_if_needed() will use the uffd-wp information in
> ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers.
> 
> On mixture, you either lose some or place too many markers.

What path are you concerned about here? I don't get how what you describe can
happen? swap_pte_batch() will only give me a batch of actual swap entries and
actual swap entries don't contain uffd-wp info, IIUC. If the function gets to a
"non-swap" swap entry, it bails. I thought the uffd-wp info was populated based
on the VMA state at swap-in? I think you are telling me that it's persisted
across the swap per-pte?

> 
> A simple workaround would be to disable any such batching if the VMA does have
> uffd-wp enabled.
> 
>> +        zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
>>       } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>>         add_mm_rss_vec(mm, rss);
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 0d44ee2b4f9c..d059de6896c1 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -130,7 +130,11 @@ static inline unsigned char swap_count(unsigned char ent)
>>   /* Reclaim the swap entry if swap is getting full*/
>>   #define TTRS_FULL        0x4
>>   -/* returns 1 if swap entry is freed */
>> +/*
>> + * returns number of pages in the folio that backs the swap entry. If positive,
>> + * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
>> + * folio was associated with the swap entry.
>> + */
>>   static int __try_to_reclaim_swap(struct swap_info_struct *si,
>>                    unsigned long offset, unsigned long flags)
>>   {
>> @@ -155,6 +159,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>>               ret = folio_free_swap(folio);
>>           folio_unlock(folio);
>>       }
>> +    ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
>>       folio_put(folio);
>>       return ret;
>>   }
>> @@ -895,7 +900,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>           swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
>>           spin_lock(&si->lock);
>>           /* entry was freed successfully, try to use this again */
>> -        if (swap_was_freed)
>> +        if (swap_was_freed > 0)
>>               goto checks;
>>           goto scan; /* check next one */
>>       }
>> @@ -1572,32 +1577,75 @@ bool folio_free_swap(struct folio *folio)
>>       return true;
>>   }
>>   -/*
>> - * Free the swap entry like above, but also try to
>> - * free the page cache entry if it is the last user.
>> - */
>> -int free_swap_and_cache(swp_entry_t entry)
> 
> Can we have some documentation what this function expects? How does nr relate to
> entry?
> 
> i.e., offset range defined by [entry.offset, entry.offset + nr).

Yep, will add something along these lines.

> 
>> +void free_swap_and_cache_nr(swp_entry_t entry, int nr)
>>   {
>> -    struct swap_info_struct *p;
> 
> It might be easier to get if you do
> 
> const unsigned long start_offset = swp_offset(entry);
> const unsigned long end_offset = start_offset + nr;

OK will do this.

> 
>> +    unsigned long end = swp_offset(entry) + nr;
>> +    unsigned int type = swp_type(entry);
>> +    struct swap_info_struct *si;
>> +    bool any_only_cache = false;
>> +    unsigned long offset;
>>       unsigned char count;
>>         if (non_swap_entry(entry))
>> -        return 1;
>> +        return;
>>   -    p = get_swap_device(entry);
>> -    if (p) {
>> -        if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) {
>> -            put_swap_device(p);
>> -            return 0;
>> +    si = get_swap_device(entry);
>> +    if (!si)
>> +        return;
>> +
>> +    if (WARN_ON(end > si->max))
>> +        goto out;
>> +
>> +    /*
>> +     * First free all entries in the range.
>> +     */
>> +    for (offset = swp_offset(entry); offset < end; offset++) {
>> +        if (!WARN_ON(data_race(!si->swap_map[offset]))) {
> 
> Ouch "!(!)). Confusing.
> 
> I'm sure there is a better way to write that, maybe using more lines
> 
> if (data_race(si->swap_map[offset])) {
>     ...
> } else {
>     WARN_ON_ONCE(1);
> }

OK, I thought it was kinda neat myself. I'll change it.

> 
> 
>> +            count = __swap_entry_free(si, swp_entry(type, offset));
>> +            if (count == SWAP_HAS_CACHE)
>> +                any_only_cache = true;
>>           }
>> +    }
>> +
>> +    /*
>> +     * Short-circuit the below loop if none of the entries had their
>> +     * reference drop to zero.
>> +     */
>> +    if (!any_only_cache)
>> +        goto out;
>>   -        count = __swap_entry_free(p, entry);
>> -        if (count == SWAP_HAS_CACHE)
>> -            __try_to_reclaim_swap(p, swp_offset(entry),
>> +    /*
>> +     * Now go back over the range trying to reclaim the swap cache. This is
>> +     * more efficient for large folios because we will only try to reclaim
>> +     * the swap once per folio in the common case. If we do
>> +     * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the
>> +     * latter will get a reference and lock the folio for every individual
>> +     * page but will only succeed once the swap slot for every subpage is
>> +     * zero.
>> +     */
>> +    for (offset = swp_offset(entry); offset < end; offset += nr) {
>> +        nr = 1;
>> +        if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
> 
> Here we use READ_ONCE() only, above data_race(). Hmmm.

Yes. I think this is correct.

READ_ONCE() is a "marked access" which KCSAN understands, so it won't complain
about it. So data_race() isn't required when READ_ONCE() (or WRITE_ONCE()) is
used. I believe READ_ONCE() is required here because we don't have a lock and we
want to make sure we read it in a non-tearing manner.

We don't need the READ_ONCE() above since we don't care about the exact value -
only that it's not 0 (because we should be holding a ref). So do a plain access
to give the compiler a bit more freedom. But we need to mark that with
data_race() to stop KCSAN from complaining.

> 
>> +            /*
>> +             * Folios are always naturally aligned in swap so
>> +             * advance forward to the next boundary. Zero means no
>> +             * folio was found for the swap entry, so advance by 1
>> +             * in this case. Negative value means folio was found
>> +             * but could not be reclaimed. Here we can still advance
>> +             * to the next boundary.
>> +             */
>> +            nr = __try_to_reclaim_swap(si, offset,
>>                             TTRS_UNMAPPED | TTRS_FULL);
>> -        put_swap_device(p);
>> +            if (nr == 0)
>> +                nr = 1;
>> +            else if (nr < 0)
>> +                nr = -nr;
>> +            nr = ALIGN(offset + 1, nr) - offset;
>> +        }
> 
> Apart from that nothing jumped at me.

Thanks for the review!