From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 657BFC67861 for ; Mon, 8 Apr 2024 09:22:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EED7A6B0098; Mon, 8 Apr 2024 05:22:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E74B66B0099; Mon, 8 Apr 2024 05:22:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D14FC6B009A; Mon, 8 Apr 2024 05:22:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id AAC6C6B0098 for ; Mon, 8 Apr 2024 05:22:57 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 71D37A0677 for ; Mon, 8 Apr 2024 09:22:57 +0000 (UTC) X-FDA: 81985825194.09.20C0234 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf26.hostedemail.com (Postfix) with ESMTP id 7995914001B for ; Mon, 8 Apr 2024 09:22:55 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=none; spf=pass (imf26.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1712568175; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=wR3OkzURiu4ZP5HryfHk2mpSJ4An1l/F1EVeEa0jpzk=; b=Ql//cHYY38b2vUab/25dLyuO9tIWf3I45c4cL0/KNz7GkB1doGl55bshQM32a99BC0oO0x Tz0svuUTQuNi7F087S+6kHTZVAqiNADOp1Kt3EibsqNt7PlKGlLTd/J3z/hHCp+4MH2WPJ VJAonE4dILBD5PRxV1SAV+lNiCwV+vM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1712568175; a=rsa-sha256; cv=none; b=kaZsvaQL3orrUk+EkVTT11degfRJDhQzwQFYgi/3czZ0vqdxztEFLcXAGhU1VeaYrdVLU/ kiLJ8REc6g6a4Cm8089+s0okFVFJEYIM1N5Pe8K0HU/+wQvmw6GTFzPLg2wfjRmg/+eL5c S7Ar/Y4Luk9pcw99I2Hlxre/+yil8pQ= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=none; spf=pass (imf26.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id DB4181007; Mon, 8 Apr 2024 02:23:24 -0700 (PDT) Received: from [10.57.73.169] (unknown [10.57.73.169]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 6CB373F766; Mon, 8 Apr 2024 02:22:52 -0700 (PDT) Message-ID: <73adae65-4429-41d7-bbb6-4c58156060d3@arm.com> Date: Mon, 8 Apr 2024 10:22:51 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v6 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache() Content-Language: en-GB To: David Hildenbrand , Andrew Morton , Matthew Wilcox , Huang Ying , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , Barry Song <21cnbao@gmail.com>, Chris Li , Lance Yang Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20240403114032.1162100-1-ryan.roberts@arm.com> <20240403114032.1162100-3-ryan.roberts@arm.com> <051052af-3b56-4290-98d3-fd5a1eb11ce1@redhat.com> From: Ryan Roberts In-Reply-To: <051052af-3b56-4290-98d3-fd5a1eb11ce1@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Stat-Signature: 8fcybf5qcru9rpdss1jof464erd15caf X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 7995914001B X-Rspam-User: X-HE-Tag: 1712568175-461546 X-HE-Meta: U2FsdGVkX1+5M6lnE2+dlGISEJMvmprG/x23OnZPwpTggLL+vPWvDPmXnihYDyAa/oxu662ve1RboDE1w0PY+7nxxqVZp2wjt9ZkCYbsSoBOUju+pSVr2nwYKqOo8rugUaOtcn0qAZsfzgxe0W3aQdKu2WGggRojwwHCcnSNmDtC2qlog8fQvKa+UsPdy4bmO0U46277dzFRg8uI+PS06G6yNUHaPyvHy8BH3i46omKv7IRzwGNrqHE15OKzNZFrxqmLvLYBHgUhUaSiw5IZvluX1TdNJmcZrotJ5AEepdzxUuUzTnt3P36nQixsPdNBgwQduKrRV7DUBqbLw5YZvEUH18fULQJ+/2r7PSuQ+RYWmW/0Pfb9fntKASReGU4b5eOx0qnaYylV2ZSlHbBFgcbAl7CK/PXSI3MXx/imOrG94vVnBzLxu5SspiF6ZBsUjuEzeC4oaxqUp09KmGfOS5pg68y4wpBCy1+2iZtK9JyXwxBXWBfyM7Dyfud7CsIQoYTx6Fvlox7dFq0GiMrQmZLGhbWBPrMG6Ep2eH1IBtyqeamisUd/thlkOO03cbGiM5N8bLVMAgKtdOKy8OYJG7PFXLvf4VAo5Beo3Sje0XCzf/7DUMHUOl61tBkmS2B6+EOUXyJKLNHNvv26+vHm9PHnTr10gYulgEm3NFGF2sZ1Qz7LcdOxY9TM0mMuJZxddhEASPOZJH5yjx2WhVFZI9n+Ivv9Bs4wbRlhh0CJNVZgtvtMrjzIRi2f7mwj43nAR1L8lKICuFK8xGezydI/Dbm0zJv8pe+erDJQGen0DT+75qmLCWe9oW+4Wear91b2NveYb50qKYtmko2VrRziZW1AvEzDYjIX08EAZONtEbwSDdmGdtRNBoZiE/u1OFNTf7nbrM9jDnScgyU/BSEHSS/13Gy8CMDIbiUzeKyLUkcJiOzxDJLUPnWSDkuY3YcpDh9GosfHBXGPJN1tJ8p NGr+C0KO OTjituAEtXXZwb6bCotIt0lHk8JVJ5vpQXWqI26MZ8Gqmi+Z3z1hBHURXolzXcyewAcz6bYSNTUAflErdGqN0WwSBEjkvNdJeGS5JdRHf83YML/aLWt4eibOUwlBG+zBvSAHGkTD8O66rwdoaxwE2gg2hmUJFjm+B5pn/b8SoY9prSH/pBo+yFXlp+GACR56rGeTPYzMviGFenJb25NJiX5V8jIUMIPE0jJRn/Mk95p94WLWqTMeRYWwchoI7p31fWa641lPWO+9BpJuUKUqYCqm0Ew== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Andrew, Looks like there are a few niggles for me to address below. So please don't let v6 (currently in mm-unstable) progress to mm-stable. I'll aim to get a new version out (or patch for this depending on how discussion lands) in the next couple of days. On 05/04/2024 11:13, David Hildenbrand wrote: > On 03.04.24 13:40, Ryan Roberts wrote: >> Now that we no longer have a convenient flag in the cluster to determine >> if a folio is large, free_swap_and_cache() will take a reference and >> lock a large folio much more often, which could lead to contention and >> (e.g.) failure to split large folios, etc. >> >> Let's solve that problem by batch freeing swap and cache with a new >> function, free_swap_and_cache_nr(), to free a contiguous range of swap >> entries together. This allows us to first drop a reference to each swap >> slot before we try to release the cache folio. This means we only try to >> release the folio once, only taking the reference and lock once - much >> better than the previous 512 times for the 2M THP case. >> >> Contiguous swap entries are gathered in zap_pte_range() and >> madvise_free_pte_range() in a similar way to how present ptes are >> already gathered in zap_pte_range(). >> >> While we are at it, let's simplify by converting the return type of both >> functions to void. The return value was used only by zap_pte_range() to >> print a bad pte, and was ignored by everyone else, so the extra >> reporting wasn't exactly guaranteed. We will still get the warning with >> most of the information from get_swap_device(). With the batch version, >> we wouldn't know which pte was bad anyway so could print the wrong one. >> >> Signed-off-by: Ryan Roberts >> --- >>   include/linux/pgtable.h | 28 ++++++++++++++ >>   include/linux/swap.h    | 12 ++++-- >>   mm/internal.h           | 48 +++++++++++++++++++++++ >>   mm/madvise.c            | 12 ++++-- >>   mm/memory.c             | 13 ++++--- >>   mm/swapfile.c           | 86 ++++++++++++++++++++++++++++++++--------- >>   6 files changed, 167 insertions(+), 32 deletions(-) >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >> index a3fc8150b047..0278259f7078 100644 >> --- a/include/linux/pgtable.h >> +++ b/include/linux/pgtable.h >> @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct >> mm_struct *mm, >>   } >>   #endif >>   +#ifndef clear_not_present_full_ptes >> +/** >> + * clear_not_present_full_ptes - Clear consecutive not present PTEs. > > > Consecutive only in the page table or also in some other sense? > > I suspect: just unrelated non-present entries of any kind (swp, nonswp) and any > offset/pfn. yes. > > Consider document that. How about: "Clear multiple not present PTEs which are consecutive in the pgtable"? > >> + * @mm: Address space the ptes represent. >> + * @addr: Address of the first pte. >> + * @ptep: Page table pointer for the first entry. >> + * @nr: Number of entries to clear. >> + * @full: Whether we are clearing a full mm. >> + * >> + * May be overridden by the architecture; otherwise, implemented as a simple >> + * loop over pte_clear_not_present_full(). >> + * >> + * Context: The caller holds the page table lock.  The PTEs are all not present. >> + * The PTEs are all in the same PMD. >> + */ >> +static inline void clear_not_present_full_ptes(struct mm_struct *mm, >> +        unsigned long addr, pte_t *ptep, unsigned int nr, int full) >> +{ >> +    for (;;) { >> +        pte_clear_not_present_full(mm, addr, ptep, full); >> +        if (--nr == 0) >> +            break; >> +        ptep++; >> +        addr += PAGE_SIZE; >> +    } >> +} >> +#endif >> + >>   #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH >>   extern pte_t ptep_clear_flush(struct vm_area_struct *vma, >>                     unsigned long address, >> diff --git a/include/linux/swap.h b/include/linux/swap.h >> index f6f78198f000..5737236dc3ce 100644 >> --- a/include/linux/swap.h >> +++ b/include/linux/swap.h >> @@ -471,7 +471,7 @@ extern int swap_duplicate(swp_entry_t); >>   extern int swapcache_prepare(swp_entry_t); >>   extern void swap_free(swp_entry_t); >>   extern void swapcache_free_entries(swp_entry_t *entries, int n); >> -extern int free_swap_and_cache(swp_entry_t); >> +extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); >>   int swap_type_of(dev_t device, sector_t offset); >>   int find_first_swap(dev_t *device); >>   extern unsigned int count_swap_pages(int, int); >> @@ -520,8 +520,9 @@ static inline void put_swap_device(struct swap_info_struct >> *si) >>   #define free_pages_and_swap_cache(pages, nr) \ >>       release_pages((pages), (nr)); >>   > > > [...] > >> + >> +/** >> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries >> + * @start_ptep: Page table pointer for the first entry. >> + * @max_nr: The maximum number of table entries to consider. >> + * @entry: Swap entry recovered from the first table entry. >> + * >> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs >> + * containing swap entries all with consecutive offsets and targeting the same >> + * swap type. >> + * > > Likely you should document that any swp pte bits are ignored? () Sorry I don't understand this comment. I thought any non-none, non-present PTE was always considered to contain only a "swap entry" and a swap entry consists of a "type" and an "offset" only. (and its a special "non-swap" swap entry if type > SOME_CONSTANT) Are you saying there are additional fields in the PTE that are not part of the swap entry? > >> + * max_nr must be at least one and must be limited by the caller so scanning >> + * cannot exceed a single page table. >> + * >> + * Return: the number of table entries in the batch. >> + */ >> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, >> +                 swp_entry_t entry) >> +{ >> +    const pte_t *end_ptep = start_ptep + max_nr; >> +    unsigned long expected_offset = swp_offset(entry) + 1; >> +    unsigned int expected_type = swp_type(entry); >> +    pte_t *ptep = start_ptep + 1; >> + >> +    VM_WARN_ON(max_nr < 1); >> +    VM_WARN_ON(non_swap_entry(entry)); >> + >> +    while (ptep < end_ptep) { >> +        pte_t pte = ptep_get(ptep); >> + >> +        if (pte_none(pte) || pte_present(pte)) >> +            break; >> + >> +        entry = pte_to_swp_entry(pte); >> + >> +        if (non_swap_entry(entry) || >> +            swp_type(entry) != expected_type || >> +            swp_offset(entry) != expected_offset) >> +            break; >> + >> +        expected_offset++; >> +        ptep++; >> +    } >> + >> +    return ptep - start_ptep; >> +} > > Looks very clean :) > > I was wondering whether we could similarly construct the expected swp PTE and > only check pte_same. > > expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset)); > > ... or have a variant to increase only the swp offset for an existing pte. But > non-trivial due to the arch-dependent format. > > But then, we'd fail on mismatch of other swp pte bits. Hmm, perhaps I have a misunderstanding regarding "swp pte bits"... > > > On swapin, when reusing this function (likely!), we'll might to make sure that > the PTE bits match as well. > > See below regarding uffd-wp. > > >>   #endif /* CONFIG_MMU */ >>     void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio, >> diff --git a/mm/madvise.c b/mm/madvise.c >> index 1f77a51baaac..070bedb4996e 100644 >> --- a/mm/madvise.c >> +++ b/mm/madvise.c >> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >> long addr, >>       struct folio *folio; >>       int nr_swap = 0; >>       unsigned long next; >> +    int nr, max_nr; >>         next = pmd_addr_end(addr, end); >>       if (pmd_trans_huge(*pmd)) >> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >> long addr, >>           return 0; >>       flush_tlb_batched_pending(mm); >>       arch_enter_lazy_mmu_mode(); >> -    for (; addr != end; pte++, addr += PAGE_SIZE) { >> +    for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) { >> +        nr = 1; >>           ptent = ptep_get(pte); >>             if (pte_none(ptent)) >> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >> long addr, >>                 entry = pte_to_swp_entry(ptent); >>               if (!non_swap_entry(entry)) { >> -                nr_swap--; >> -                free_swap_and_cache(entry); >> -                pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >> +                max_nr = (end - addr) / PAGE_SIZE; >> +                nr = swap_pte_batch(pte, max_nr, entry); >> +                nr_swap -= nr; >> +                free_swap_and_cache_nr(entry, nr); >> +                clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); >>               } else if (is_hwpoison_entry(entry) || >>                      is_poisoned_swp_entry(entry)) { >>                   pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >> diff --git a/mm/memory.c b/mm/memory.c >> index 7dc6c3d9fa83..ef2968894718 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather >> *tlb, >>                   folio_remove_rmap_pte(folio, page, vma); >>               folio_put(folio); >>           } else if (!non_swap_entry(entry)) { >> -            /* Genuine swap entry, hence a private anon page */ >> +            max_nr = (end - addr) / PAGE_SIZE; >> +            nr = swap_pte_batch(pte, max_nr, entry); >> +            /* Genuine swap entries, hence a private anon pages */ >>               if (!should_zap_cows(details)) >>                   continue; >> -            rss[MM_SWAPENTS]--; >> -            if (unlikely(!free_swap_and_cache(entry))) >> -                print_bad_pte(vma, addr, ptent, NULL); >> +            rss[MM_SWAPENTS] -= nr; >> +            free_swap_and_cache_nr(entry, nr); >>           } else if (is_migration_entry(entry)) { >>               folio = pfn_swap_entry_folio(entry); >>               if (!should_zap_folio(details, folio)) >> @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, >>               pr_alert("unrecognized swap entry 0x%lx\n", entry.val); >>               WARN_ON_ONCE(1); >>           } >> -        pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >> -        zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent); >> +        clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); > > For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match. > > zap_install_uffd_wp_if_needed() will use the uffd-wp information in > ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers. > > On mixture, you either lose some or place too many markers. What path are you concerned about here? I don't get how what you describe can happen? swap_pte_batch() will only give me a batch of actual swap entries and actual swap entries don't contain uffd-wp info, IIUC. If the function gets to a "non-swap" swap entry, it bails. I thought the uffd-wp info was populated based on the VMA state at swap-in? I think you are telling me that it's persisted across the swap per-pte? > > A simple workaround would be to disable any such batching if the VMA does have > uffd-wp enabled. > >> +        zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent); >>       } while (pte += nr, addr += PAGE_SIZE * nr, addr != end); >>         add_mm_rss_vec(mm, rss); >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index 0d44ee2b4f9c..d059de6896c1 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -130,7 +130,11 @@ static inline unsigned char swap_count(unsigned char ent) >>   /* Reclaim the swap entry if swap is getting full*/ >>   #define TTRS_FULL        0x4 >>   -/* returns 1 if swap entry is freed */ >> +/* >> + * returns number of pages in the folio that backs the swap entry. If positive, >> + * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no >> + * folio was associated with the swap entry. >> + */ >>   static int __try_to_reclaim_swap(struct swap_info_struct *si, >>                    unsigned long offset, unsigned long flags) >>   { >> @@ -155,6 +159,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si, >>               ret = folio_free_swap(folio); >>           folio_unlock(folio); >>       } >> +    ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio); >>       folio_put(folio); >>       return ret; >>   } >> @@ -895,7 +900,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >>           swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); >>           spin_lock(&si->lock); >>           /* entry was freed successfully, try to use this again */ >> -        if (swap_was_freed) >> +        if (swap_was_freed > 0) >>               goto checks; >>           goto scan; /* check next one */ >>       } >> @@ -1572,32 +1577,75 @@ bool folio_free_swap(struct folio *folio) >>       return true; >>   } >>   -/* >> - * Free the swap entry like above, but also try to >> - * free the page cache entry if it is the last user. >> - */ >> -int free_swap_and_cache(swp_entry_t entry) > > Can we have some documentation what this function expects? How does nr relate to > entry? > > i.e., offset range defined by [entry.offset, entry.offset + nr). Yep, will add something along these lines. > >> +void free_swap_and_cache_nr(swp_entry_t entry, int nr) >>   { >> -    struct swap_info_struct *p; > > It might be easier to get if you do > > const unsigned long start_offset = swp_offset(entry); > const unsigned long end_offset = start_offset + nr; OK will do this. > >> +    unsigned long end = swp_offset(entry) + nr; >> +    unsigned int type = swp_type(entry); >> +    struct swap_info_struct *si; >> +    bool any_only_cache = false; >> +    unsigned long offset; >>       unsigned char count; >>         if (non_swap_entry(entry)) >> -        return 1; >> +        return; >>   -    p = get_swap_device(entry); >> -    if (p) { >> -        if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) { >> -            put_swap_device(p); >> -            return 0; >> +    si = get_swap_device(entry); >> +    if (!si) >> +        return; >> + >> +    if (WARN_ON(end > si->max)) >> +        goto out; >> + >> +    /* >> +     * First free all entries in the range. >> +     */ >> +    for (offset = swp_offset(entry); offset < end; offset++) { >> +        if (!WARN_ON(data_race(!si->swap_map[offset]))) { > > Ouch "!(!)). Confusing. > > I'm sure there is a better way to write that, maybe using more lines > > if (data_race(si->swap_map[offset])) { >     ... > } else { >     WARN_ON_ONCE(1); > } OK, I thought it was kinda neat myself. I'll change it. > > >> +            count = __swap_entry_free(si, swp_entry(type, offset)); >> +            if (count == SWAP_HAS_CACHE) >> +                any_only_cache = true; >>           } >> +    } >> + >> +    /* >> +     * Short-circuit the below loop if none of the entries had their >> +     * reference drop to zero. >> +     */ >> +    if (!any_only_cache) >> +        goto out; >>   -        count = __swap_entry_free(p, entry); >> -        if (count == SWAP_HAS_CACHE) >> -            __try_to_reclaim_swap(p, swp_offset(entry), >> +    /* >> +     * Now go back over the range trying to reclaim the swap cache. This is >> +     * more efficient for large folios because we will only try to reclaim >> +     * the swap once per folio in the common case. If we do >> +     * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the >> +     * latter will get a reference and lock the folio for every individual >> +     * page but will only succeed once the swap slot for every subpage is >> +     * zero. >> +     */ >> +    for (offset = swp_offset(entry); offset < end; offset += nr) { >> +        nr = 1; >> +        if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) { > > Here we use READ_ONCE() only, above data_race(). Hmmm. Yes. I think this is correct. READ_ONCE() is a "marked access" which KCSAN understands, so it won't complain about it. So data_race() isn't required when READ_ONCE() (or WRITE_ONCE()) is used. I believe READ_ONCE() is required here because we don't have a lock and we want to make sure we read it in a non-tearing manner. We don't need the READ_ONCE() above since we don't care about the exact value - only that it's not 0 (because we should be holding a ref). So do a plain access to give the compiler a bit more freedom. But we need to mark that with data_race() to stop KCSAN from complaining. > >> +            /* >> +             * Folios are always naturally aligned in swap so >> +             * advance forward to the next boundary. Zero means no >> +             * folio was found for the swap entry, so advance by 1 >> +             * in this case. Negative value means folio was found >> +             * but could not be reclaimed. Here we can still advance >> +             * to the next boundary. >> +             */ >> +            nr = __try_to_reclaim_swap(si, offset, >>                             TTRS_UNMAPPED | TTRS_FULL); >> -        put_swap_device(p); >> +            if (nr == 0) >> +                nr = 1; >> +            else if (nr < 0) >> +                nr = -nr; >> +            nr = ALIGN(offset + 1, nr) - offset; >> +        } > > Apart from that nothing jumped at me. Thanks for the review!