From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A946DC54798 for ; Tue, 5 Mar 2024 08:46:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0EFB1940008; Tue, 5 Mar 2024 03:46:42 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0A10A940007; Tue, 5 Mar 2024 03:46:42 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EA979940008; Tue, 5 Mar 2024 03:46:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id D96E3940007 for ; Tue, 5 Mar 2024 03:46:41 -0500 (EST) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 7FA60A0751 for ; Tue, 5 Mar 2024 08:46:41 +0000 (UTC) X-FDA: 81862354602.19.E3B59BC Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf01.hostedemail.com (Postfix) with ESMTP id 5AC4E4000A for ; Tue, 5 Mar 2024 08:46:39 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=none; spf=pass (imf01.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709628399; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=QeurI+YzWIvPOzaGQ1r6LE0SVxT/HVblKuTVjCWmYYk=; b=LokpHwJfI/3J5RGV0cSAsjuaoytaD2UMNBygWEw8O8O26xuxxxEujFEByOG/CTlgQRStQP EvbcPfcKBdsULdhVQlyu4lw6GSmYvuN8E0AmrheJCVM7QEabJzFIoqtuBr8nDJWZmE3GR1 iTXEwzSqFUCMPOJUgp+T2TUCQbLi9ok= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709628399; a=rsa-sha256; cv=none; b=tD6G9E9/yRu56bndBCgSp3oy5JtoWeCvQ/CpMMDf57exes/LtRxUGUXaa/IOFagmHEBdX+ XU/080NnTtzF00gZfeplSoLPX4o9X5TIeyuJH+Dhg6PAK5KhTZmfa74edx5J/E4ZafRmG5 WPULEptDhLyJxeT0CvvawdmWqz0LwTA= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=none; spf=pass (imf01.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E0E282F4; Tue, 5 Mar 2024 00:47:14 -0800 (PST) Received: from [10.57.68.162] (unknown [10.57.68.162]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 859173F73F; Tue, 5 Mar 2024 00:46:36 -0800 (PST) Message-ID: <62454122-d036-41e2-bd84-8a5839515ba4@arm.com> Date: Tue, 5 Mar 2024 08:46:35 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Content-Language: en-GB To: David Hildenbrand , "Huang, Ying" Cc: Andrew Morton , Matthew Wilcox , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20231025144546.577640-1-ryan.roberts@arm.com> <20231025144546.577640-2-ryan.roberts@arm.com> <6541e29b-f25a-48b8-a553-fd8febe85e5a@redhat.com> <2934125a-f2e2-417c-a9f9-3cb1e074a44f@redhat.com> <049818ca-e656-44e4-b336-934992c16028@arm.com> <949b6c22-d737-4060-9ca1-a69d8e986d90@redhat.com> <9ed743a7-0c5d-49d9-b8b2-d58364df1f5f@arm.com> <65a66eb9-41f8-4790-8db2-0c70ea15979f@redhat.com> <6cfc022a-0c7a-4fe6-aaa4-3d28aeacc982@arm.com> <3d47ae7d-297a-441e-941c-5b2e34ba8759@redhat.com> <3ae2da13-c33a-402f-9091-2c7328aea66a@arm.com> <87plw99phz.fsf@yhuang6-desk2.ccr.corp.intel.com> <9a5cb081-c4f1-4abe-bb86-02aaca4e5433@redhat.com> From: Ryan Roberts In-Reply-To: <9a5cb081-c4f1-4abe-bb86-02aaca4e5433@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Stat-Signature: ace4uiq48neuhi3ripcnky9putn5c5xi X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 5AC4E4000A X-Rspam-User: X-HE-Tag: 1709628399-732897 X-HE-Meta: U2FsdGVkX19gUle74V9j1LEhptUx1nO3A3LQKqkOWtGmtuwgLtkx51K9aj2QMisa0Z10lAfWEEsyEta+Th4D/g5wCVDelyh8TA24uyj3fRfkfwVfyfkC9u+b8aYXyGVumBdXtVh4237+kusfjK29cThDE7iOguYZfJhh9nB8s25U8cQREn8B+867FTJIgO5Tqr6s4hNlA9gxxtQAcX0X02Xet4ZWnGmUU8qVwJREb/Yg3d4zFVRj5c3Fd5VtZOwuHNVCUxJsutudIej3OeKQKQjp3s256a8+AUrJJ9Df+frPsayXVeNcqeWPV5E9NrT4YCZNslIR8isyvSim1quOXrPK36W8EXFlQe1ZOO+RSP3jTt7mKbX7IGLpK/m9zPozuwQqRNgT2gGc76wdamJ7klo5e9us0AUYJjli/8mKnXiTEb4r/An4e0bEezODg4STTuYB4E8LG2p7Pjkc2FP4UQVuj72XOzBQ7+IJXFblfUKHj2oVxe1Jjz1BSSnG50RHq3iQxS7e++ZDFg9ntjZQTh5DEcwn8OQx3I4Uf/6xvsegUbRNv/4U24hs0Sl4PeXeiATMajUJQ0hMrYbRGmYLeqqgHSY5IJf1sIwaM+Gt5Eaz4OOstvdGFhzS9ZerRWJYulYsOFKPlN/an7JRwmcoKoiFupIPZ3g2k2HdhE/R4svu/cO02WmQvw8M2GHYqgrKny5h1fVnaW3Da5Pkl9EVDgaWnjdjSmH792J7zet3zR392LSoJ3v2MH/hhIzLG5qkSN4FpC9hhaWWVAw0kiaCxAarpa1mN9dNBy1eYfcgJ3apLdfLBU1pwlL5H+cB8caG687zpYa6iFh4VK/o//iwTIotEtki9mzXUU3ELDjHwHsqJxlR56Y9IuyIWuHmZc3mNZ5ea6AqfUM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 05/03/2024 08:35, David Hildenbrand wrote: > On 05.03.24 07:11, Huang, Ying wrote: >> Ryan Roberts writes: >> >>> + Hugh >>> >>> On 04/03/2024 22:02, David Hildenbrand wrote: >>>> On 04.03.24 22:55, Ryan Roberts wrote: >>>>> On 04/03/2024 20:50, David Hildenbrand wrote: >>>>>>>>> >>>>>>>>> This is the existing free_swap_and_cache(). I think _swap_info_get() would >>>>>>>>> break >>>>>>>>> if this could race with swapoff(), and __swap_entry_free() looks up the >>>>>>>>> cluster >>>>>>>>> from an array, which would also be freed by swapoff if racing: >>>>>>>>> >>>>>>>>> int free_swap_and_cache(swp_entry_t entry) >>>>>>>>> { >>>>>>>>>        struct swap_info_struct *p; >>>>>>>>>        unsigned char count; >>>>>>>>> >>>>>>>>>        if (non_swap_entry(entry)) >>>>>>>>>            return 1; >>>>>>>>> >>>>>>>>>        p = _swap_info_get(entry); >>>>>>>>>        if (p) { >>>>>>>>>            count = __swap_entry_free(p, entry); >>>>>>>> >>>>>>>> If count dropped to 0 and >>>>>>>> >>>>>>>>>            if (count == SWAP_HAS_CACHE) >>>>>>>> >>>>>>>> >>>>>>>> count is now SWAP_HAS_CACHE, there is in fact no swap entry anymore. We >>>>>>>> removed >>>>>>>> it. That one would have to be reclaimed asynchronously. >>>>>>>> >>>>>>>> The existing code we would call swap_page_trans_huge_swapped() with the >>>>>>>> SI it >>>>>>>> obtained via _swap_info_get(). >>>>>>>> >>>>>>>> I also don't see what should be left protecting the SI. It's not locked >>>>>>>> anymore, >>>>>>>> the swapcounts are at 0. We don't hold the folio lock. >>>>>>>> >>>>>>>> try_to_unuse() will stop as soon as si->inuse_pages is at 0. Hm ... >>>>>>> >>>>>>> But, assuming the caller of free_swap_and_cache() acquires the PTL first, I >>>>>>> think this all works out ok? While free_swap_and_cache() is running, >>>>>>> try_to_unuse() will wait for the PTL. Or if try_to_unuse() runs first, then >>>>>>> free_swap_and_cache() will never be called because the swap entry will have >>>>>>> been >>>>>>> removed from the PTE? >>>>>> >>>>>> But can't try_to_unuse() run, detect !si->inuse_pages and not even bother >>>>>> about >>>>>> scanning any further page tables? >>>>>> >>>>>> But my head hurts from digging through that code. >>>>> >>>>> Yep, glad I'm not the only one that gets headaches from swapfile.c. >>>>> >>>>>> >>>>>> Let me try again: >>>>>> >>>>>> __swap_entry_free() might be the last user and result in "count == >>>>>> SWAP_HAS_CACHE". >>>>>> >>>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0. >>>>>> >>>>>> >>>>>> So the question is: could someone reclaim the folio and turn >>>>>> si->inuse_pages==0, >>>>>> before we completed swap_page_trans_huge_swapped(). >>>>>> >>>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are >>>>>> still >>>>>> references by swap entries. >>>>>> >>>>>> Process 1 still references subpage 0 via swap entry. >>>>>> Process 2 still references subpage 1 via swap entry. >>>>>> >>>>>> Process 1 quits. Calls free_swap_and_cache(). >>>>>> -> count == SWAP_HAS_CACHE >>>>>> [then, preempted in the hypervisor etc.] >>>>>> >>>>>> Process 2 quits. Calls free_swap_and_cache(). >>>>>> -> count == SWAP_HAS_CACHE >>>>>> >>>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls >>>>>> __try_to_reclaim_swap(). >>>>>> >>>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->put_swap_folio()-> >>>>>> free_swap_slot()->swapcache_free_entries()->swap_entry_free()->swap_range_free()-> >>>>>> ... >>>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); >>>>>> >>>>>> >>>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache but >>>>>> before process 1 finished its call to swap_page_trans_huge_swapped()? >>>>> >>>>> Assuming you are talking about anonymous memory, process 1 has the PTL while >>>>> it's executing free_swap_and_cache(). try_to_unuse() iterates over every >>>>> vma in >>>>> every mm, and it swaps-in a page for every PTE that holds a swap entry for the >>>>> device being swapoff'ed. It takes the PTL while converting the swap entry to >>>>> present PTE - see unuse_pte(). Process 1 must have beaten try_to_unuse() to >>>>> the >>>>> particular pte, because if try_to_unuse() got there first, it would have >>>>> converted it from a swap entry to present pte and process 1 would never even >>>>> have called free_swap_and_cache(). So try_to_unuse() will eventually wait >>>>> on the >>>>> PTL until process 1 has released it after free_swap_and_cache() completes. >>>>> Am I >>>>> missing something? Because that part feels pretty clear to me. >>>> >>>> Why should try_to_unuse() do *anything* if it already finds >>>> si->inuse_pages == 0 because we (p1 } p2) just freed the swapentries and >>>> process >>>> 2 managed to free the last remaining swapcache entry? >>> >>> Yeah ok. For some reason I thought unuse_mm() was iterating over all mms and so >>> the `while (READ_ONCE(si->inuse_pages))` was only evaluated after iterating over >>> every mm. Oops. >>> >>> So yes, I agree with you; I think this is broken. And I'm a bit worried this >>> could be a can of worms; By the same logic, I think folio_free_swap(), >>> swp_swapcount() and probably others are broken in the same way. >> >> Don't worry too much :-), we have get_swap_device() at least.  We can >> insert it anywhere we want because it's quite lightweight.  And, because >> swapoff() is so rare, the race is theoretical only. Thanks for the response! >> >> For this specific case, I had thought that PTL is enough.  But after >> looking at this more, I found a race here too.  Until >> __swap_entry_free() return, we are OK, nobody can reduce the swap count >> because we held the PTL.  Even that is not true for the shmem case: As far as I can see, shmem doesn't have the PTL or any other synchronizing lock when it calls free_swap_and_cache(). I don't think that changes anything solution-wise though. >> But, after that, even if its return value is >> SWAP_HAS_CACHE (that is, in swap cache), parallel swap_unuse() or >> __try_to_reclaim_swap() may remove the folio from swap cache, so free >> the swap entry.  So, swapoff() can proceed to free the data structures >> in parallel. >> >> To fix the race, we can add get/put_swap_device() in >> free_swap_and_cache(). >> >> For other places, we can check whether get/put_swap_device() has been >> called in callers, and the swap reference we held has been decreased >> (e.g., swap count protected by PTL, SWAP_HAS_CACHE protected by folio >> lock). > > Yes, sounds reasonable. We should likely update the documentation of > get_swap_device(), that after decrementing the refcount, the SI might become > stale and should not be touched without a prior get_swap_device(). Yep agreed. If nobody else is planning to do it, I'll try to create a test case that provokes the problem then put a patch together to fix it.