From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9D32DC54798 for ; Tue, 5 Mar 2024 16:33:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 35FD66B007D; Tue, 5 Mar 2024 11:33:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 30F0A6B0081; Tue, 5 Mar 2024 11:33:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1D6A26B0082; Tue, 5 Mar 2024 11:33:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 0C7866B007D for ; Tue, 5 Mar 2024 11:33:20 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id AE37C160F37 for ; Tue, 5 Mar 2024 16:33:19 +0000 (UTC) X-FDA: 81863530518.26.17B9B77 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf21.hostedemail.com (Postfix) with ESMTP id 407241C0009 for ; Tue, 5 Mar 2024 16:33:16 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf21.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709656396; a=rsa-sha256; cv=none; b=duYFqaV+84XNwd1hgrzS8Uf0/ifagbymAk+DW260nehZ7gclqgeeIo3+py30KA81nvjF9P VObUWOoViZxGAXdQE4MY/9MZLLjNTX9l5SyTC/GQ0L5SXi0IMQ9Mgl7yxm49mnr934tywI LZUXqSUmO7vHGISWeEdCVRFAEIV8kO4= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf21.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709656396; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=KG4V2X8WwNmHCblQ0CBLQ6rGYM8rcX/d4QrgrmTLQuo=; b=8o+UzRe0oxvpl0+ZamKe9k/BKK3366sTyZBUxkcvBeSjaPHmQLfz2efgQl/D3KN+wduk7f lKp7qTp64fmE662KAxYXcGjLDr+ValpPmzhyHANXMAvnPTA1rp0K9z6J8IuTDTPF1bF2sg uIFtVLT76quVR5M87PcbB0/ZG0zLkx4= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D3D5D1FB; Tue, 5 Mar 2024 08:33:51 -0800 (PST) Received: from [10.1.39.151] (XHFQ2J9959.cambridge.arm.com [10.1.39.151]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 3AA163F738; Tue, 5 Mar 2024 08:33:14 -0800 (PST) Message-ID: <8989df79-84f5-488c-bd74-c11d2241eff1@arm.com> Date: Tue, 5 Mar 2024 16:33:12 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v1] mm: swap: Fix race between free_swap_and_cache() and swapoff() Content-Language: en-GB To: David Hildenbrand , Andrew Morton , "Huang, Ying" Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org References: <20240305151349.3781428-1-ryan.roberts@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 407241C0009 X-Stat-Signature: 67d87bw8qxyf5bcf4f7d1goetpmjhp9e X-Rspam-User: X-HE-Tag: 1709656396-926981 X-HE-Meta: U2FsdGVkX1/+IkmBBZg5IY8zFJK/25u0a0etn5sS2wp0oaBfoneTzn1MwGnfGY36m5tW+A+GlDjDilAeChwyprPGy8zrLALMIK9L6w4oUQ6TYmWOehHhch2dPFfSHdbZmH7lzNjlcOVTDUI44X+2JnAIXjz16stvM4ro3CnC4ngAr9UXewqI2oETcKryOlBbNd7RLv7qZmjyLn/nq1//20fCAV8WgwzjSmyc9hkYSgCCxRQ1+z4v04L+IE0NZdsRfmGnPYilSoHUjVwHD/jygTWfEjjf88/cRtJACUXiJdrX1CdPyCnIhOeju6qecYzdJJtZDJhD+h20d2s9dFCSvMIAvYmLWyuSx5DWHeE8EHhdOwqtRx7otI8hQh5Wqmx6mtj3DFG1NJJ44Nfa7S9at4/efDxJ1GM+C5McLp1YFUJ5mIa5izB3K0oNG7BCtGJ/Iqq5eNwJW5mp9b0roRBEp3GnS2ep9VkZUD+fnF+nKfu9jjoIr7suDZvQZVh5EmTcG/l2XtO/DhDhnxzFUBFy6VZwb21YNUsncBHjTC9YhupETPLQIQsbLlKQ9AGZudkSQqaFSP57YZuK6gQ3Ulvsfyavz3OKr44Z4E/3osFkHjFEb+VgOSOMIF5Z73xtVPpvN6W/6wa1VAjKxWtAbDrg3liqk9lZWXBzDwuJAA4jDoQuFQkzybOlyAAU6txK83pCmROBA4pGL/kns51qkIONRMivSilGI4AnIABnBLAYWAP4qQzYpBKG8erg7XDxelw5GbDINyWk9sjf/EpZqsDG+ya2ehHIcAHSTXtzW2jgeLhf/PhVr9IU9iSKkGpETwnxIeWgl1MEFRfG1Jtniz1Gru1+fLkHAdbQScCGYjHphWYS0+vuC3MQeXE1eC0PARZGVqKEXGvOMOm9XmstxIm+DLzZomPNKpEK11axcv+KbG+ldpcvqOCw8DLClQy1fsnvlhlCFGKaT/A+eyo1XMg cWx4/yPc tXKA2L3lgZ35m8wsqjkRY+prymQOTKJyoTWcBND2MSsXtuVOMkEzBD9vhszPfaR5JRBLv4VRJxWxDqPotV1g6jaqpJ+31vF3X40Y3OiG7JNhMtVA+/zDMvpR0BTlqQnqSQtT5RqK+yvZ1QdNJ9XjEd7cYlgsIDWhDCX6gjVYGGJwnK6vTKyElHXxF0wBnYZqGPpVGhtiCZ0lUn5FpVt7pPdVq9w4YDX6ggL3EI2Zl8doa/sMXl7pQNeduL4ohMzdB7OPPSVWBtzfK8oih9h2jU2oof5vJDkF3AqPurtBS1EQdXodcZWFvaZLQLeX6UT1M66L6HyCvpKX3OKgAnG7z/+jCFXkzg0oYToXI4EFi6eIWLfQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 05/03/2024 15:50, David Hildenbrand wrote: > On 05.03.24 16:13, Ryan Roberts wrote: >> There was previously a theoretical window where swapoff() could run and >> teardown a swap_info_struct while a call to free_swap_and_cache() was >> running in another thread. This could cause, amongst other bad >> possibilities, swap_page_trans_huge_swapped() (called by >> free_swap_and_cache()) to access the freed memory for swap_map. >> >> This is a theoretical problem and I haven't been able to provoke it from >> a test case. But there has been agreement based on code review that this >> is possible (see link below). >> >> Fix it by using get_swap_device()/put_swap_device(), which will stall >> swapoff(). There was an extra check in _swap_info_get() to confirm that >> the swap entry was valid. This wasn't present in get_swap_device() so >> I've added it. I couldn't find any existing get_swap_device() call sites >> where this extra check would cause any false alarms. >> >> Details of how to provoke one possible issue (thanks to David Hilenbrand >> for deriving this): > > Almost > > "s/Hilenbrand/Hildenbrand/" :) Ahh sorry... I even explicitly checked it against your name on emails... fat fingers... > >> >> --8<----- >> >> __swap_entry_free() might be the last user and result in >> "count == SWAP_HAS_CACHE". >> >> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0. >> >> So the question is: could someone reclaim the folio and turn >> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped(). >> >> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are >> still references by swap entries. >> >> Process 1 still references subpage 0 via swap entry. >> Process 2 still references subpage 1 via swap entry. >> >> Process 1 quits. Calls free_swap_and_cache(). >> -> count == SWAP_HAS_CACHE >> [then, preempted in the hypervisor etc.] >> >> Process 2 quits. Calls free_swap_and_cache(). >> -> count == SWAP_HAS_CACHE >> >> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls >> __try_to_reclaim_swap(). >> >> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> >> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> >> swap_entry_free()->swap_range_free()-> >> ... >> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); >> >> What stops swapoff to succeed after process 2 reclaimed the swap cache >> but before process1 finished its call to swap_page_trans_huge_swapped()? >> >> --8<----- >> >> Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch") >> Closes: >> https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@redhat.com/ >> Cc: stable@vger.kernel.org >> Signed-off-by: Ryan Roberts >> --- >> >> Applies on top of v6.8-rc6 and mm-unstable (b38c34939fe4). >> >> Thanks, >> Ryan >> >>   mm/swapfile.c | 14 +++++++++++--- >>   1 file changed, 11 insertions(+), 3 deletions(-) >> >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index 2b3a2d85e350..f580e6abc674 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -1281,7 +1281,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) >>       smp_rmb(); >>       offset = swp_offset(entry); >>       if (offset >= si->max) >> -        goto put_out; >> +        goto bad_offset; >> +    if (data_race(!si->swap_map[swp_offset(entry)])) >> +        goto bad_free; >> >>       return si; >>   bad_nofile: >> @@ -1289,9 +1291,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t >> entry) >>   out: >>       return NULL; >>   put_out: >> -    pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); >>       percpu_ref_put(&si->users); >>       return NULL; >> +bad_offset: >> +    pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); >> +    goto put_out; >> +bad_free: >> +    pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val); >> +    goto put_out; >>   } >> >>   static unsigned char __swap_entry_free(struct swap_info_struct *p, >> @@ -1609,13 +1616,14 @@ int free_swap_and_cache(swp_entry_t entry) >>       if (non_swap_entry(entry)) >>           return 1; >> >> -    p = _swap_info_get(entry); >> +    p = get_swap_device(entry); >>       if (p) { >>           count = __swap_entry_free(p, entry); >>           if (count == SWAP_HAS_CACHE && >>               !swap_page_trans_huge_swapped(p, entry)) >>               __try_to_reclaim_swap(p, swp_offset(entry), >>                             TTRS_UNMAPPED | TTRS_FULL); >> +        put_swap_device(p); >>       } >>       return p != NULL; >>   } >> -- >> 2.25.1 >> > > LGTM > > Are you planning on sending a doc extension for get_swap_device()? I saw your comment: We should likely update the documentation of get_swap_device(), that after decrementing the refcount, the SI might become stale and should not be touched without a prior get_swap_device(). But when I went to make the changes, I saw the documentation already said: ...we need to enclose all swap related functions with get_swap_device() and put_swap_device()... Notice that swapoff ... can still happen before the percpu_ref_tryget_live() in get_swap_device() or after the percpu_ref_put() in put_swap_device()... The caller must be prepared for that. I thought that already covered it? I'm sure as usual, I've misunderstood your point. Happy to respin if you have something in mind?