From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 044A2C77B75 for ; Wed, 17 May 2023 01:57:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8024C900004; Tue, 16 May 2023 21:57:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7B1B1900003; Tue, 16 May 2023 21:57:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6A0F0900004; Tue, 16 May 2023 21:57:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 59F42900003 for ; Tue, 16 May 2023 21:57:41 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 1F6D21A03EF for ; Wed, 17 May 2023 01:57:41 +0000 (UTC) X-FDA: 80798085522.13.D3EEB31 Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) by imf29.hostedemail.com (Postfix) with ESMTP id 562D2120009 for ; Wed, 17 May 2023 01:57:39 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=G2g3cTJE; spf=pass (imf29.hostedemail.com: domain of pcc@google.com designates 209.85.214.178 as permitted sender) smtp.mailfrom=pcc@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684288659; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=m17guDOQ12FqooxLGjI6njqGM315YV7VFvugPBs5F1U=; b=LxDxsrqjm9Bs339YUs9B7yT4mZ/Njj67d3aNG1zRpTZvY1CoBtVOUXmcp+1WHIb5nPm+KM CRvXwxA+QY6itQFd0TiB5I8fVpp0gckVl5s9yNY/qV6ryBWS8c8jMTuDl5pVPNH4d0mtWE NvJO2mS068fQ5LFEBdJljeeheNDnIR0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684288659; a=rsa-sha256; cv=none; b=1DH/dsfZY49B4PblKfQdTNCZ48IoDvRi94iXyA6MbzqT/wUoC1W9Hk6Ur0WG1w7NbXUrO0 tg92HB2HYA/cxcgt7uyfBkqvxLNWLd1B1GyHA9gJLTAGRWX1ATpcuUdBmS43GSDv9+89Jm v5/ZRxmBCYyVqfNRWo4Nl5h2awvT/5Y= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=G2g3cTJE; spf=pass (imf29.hostedemail.com: domain of pcc@google.com designates 209.85.214.178 as permitted sender) smtp.mailfrom=pcc@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-1a950b982d4so70185ad.0 for ; Tue, 16 May 2023 18:57:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1684288658; x=1686880658; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=m17guDOQ12FqooxLGjI6njqGM315YV7VFvugPBs5F1U=; b=G2g3cTJEzvXtAGyeMhLS3Tp9CGxqFe6bTGp67/QaCOul7DNRgJGVysoeOKGC6MU8Nb hcrR9i5M+H1GAD6TiKPgXwlBK2V4d6vyMIHTJ0OLyBi1jK65vk+LYjCEeImS/ubCtS8I Xk6hgCWAsni/4am19MdEQGg1M/jj6BqX/ao6Sjgx/2/jffC2BBbPodTR+rwFNr1yLpr9 gzee4ygUv8v7j5MaMN3CO+z3tx3czXQR8M7yLggRMS8L15H8eOJefxg3AxRD9XVtqRew uzL7IMLaRr+A2T694VFyzbxIx/Bzg3Z8L9Ex5UWi+KcvOKpiPj0GhcXopupfoJUwKyLG J0pQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684288658; x=1686880658; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=m17guDOQ12FqooxLGjI6njqGM315YV7VFvugPBs5F1U=; b=kO/6tkmK8re9jZnm78qWuO/b2VXnYhkuSvUdyBx1mq7YKkoMzqUh8zW6c2UaMyfzu8 oecKaB58aEs+4gage5cYWoahBPBVLM++aRViTMRvLqw3B4fYB5hoiJLQ99YN8xFkmM/L qsAtxg9ClGGT7MBIcMqcHlrrB8N4qK3az+3vRe2FDJXDihef+1HpVstZ8FXCCt/+6Zxw R46XZMavmLCXAkamxrqgamdqpr/s494MFUwy7v05XpliUe2opCePSMn5C0lvvwCSQFGN ynEmsk+UUirePdlSxr/eNaOICO7Cnst5oq08AUoThXspHWZLGQBAxtKv2/QH3oYALkpy T68Q== X-Gm-Message-State: AC+VfDyhJEeDG8Jny+xjzngOhqJB1W4XJBf61Wjqcw02oW7n1CSpdYgh tiz/bNwivw9IO6IL3BiFvZn7ylPEWgRNQEGSnvBm8A== X-Google-Smtp-Source: ACHHUZ5TQRoxJBA+VrCOoj6SjVxOZ9tveqCruPZC8f50f90jN5fWSAtn66U8gRT/dRMCPRrIuOzIJoSdzUpr6zs+1O0= X-Received: by 2002:a17:902:ea09:b0:1ae:513a:944d with SMTP id s9-20020a170902ea0900b001ae513a944dmr71548plg.8.1684288657779; Tue, 16 May 2023 18:57:37 -0700 (PDT) MIME-Version: 1.0 References: <20230512235755.1589034-1-pcc@google.com> <20230512235755.1589034-2-pcc@google.com> <7471013e-4afb-e445-5985-2441155fc82c@redhat.com> <851940cd-64f1-9e59-3de9-b50701a99281@redhat.com> In-Reply-To: <851940cd-64f1-9e59-3de9-b50701a99281@redhat.com> From: Peter Collingbourne Date: Tue, 16 May 2023 18:57:26 -0700 Message-ID: Subject: Re: [PATCH 1/3] mm: Move arch_do_swap_page() call to before swap_free() To: David Hildenbrand Cc: Catalin Marinas , =?UTF-8?B?UXVuLXdlaSBMaW4gKOael+e+pOW0tCk=?= , linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, "surenb@google.com" , =?UTF-8?B?Q2hpbndlbiBDaGFuZyAo5by16Yym5paHKQ==?= , "kasan-dev@googlegroups.com" , =?UTF-8?B?S3Vhbi1ZaW5nIExlZSAo5p2O5Yag56mOKQ==?= , =?UTF-8?B?Q2FzcGVyIExpICjmnY7kuK3mpq4p?= , "gregkh@linuxfoundation.org" , vincenzo.frascino@arm.com, Alexandru Elisei , will@kernel.org, eugenis@google.com, Steven Price , stable@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: h6ebahqee9u5pkiqrc8ztuopyk4fwe9y X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: 562D2120009 X-HE-Tag: 1684288659-287965 X-HE-Meta: U2FsdGVkX1875oWFTAW4CzokddL4XRVIu7FviL51zz83fp9r4WNJYW+Nt5VK/KkTCpy1wDYsVaYNaSuEbZ9CFgH4jkzHRDZrN+BGqf+e2uHYnXVvqthXVVEItdRwy/uTjwu7G14DIFQ+K8rK7cTBOkh8XPjk1bJ7eckp9u79HxcfQmd2qpG2YT4WzqnRe/XnclqsJjbAu3sydWfilY2W+Uwew+jK/O5zMtf6eKDTR4BfNMe4ZiDP+NVBNNNAKaWg0uqmxQWxtYeqecF4rD/OixzCbvT8wez1PsfNmT54iwo/89MR9WjMqPqEFl4qqyKbPMhIXOXv8f37sR2+fFxDX4sWYEMHFJittgGaHET7Rn4FmFASDIFpiwG+0ctJjP+IL8UgnGKSvWXgxvCTJv0Q5haiQ4vzxTfvFlmT8pDWUhyQFw3lz7ZE0QGN/RleRXmqVDJyVRPRqsa4vW+JKSqb91ps3nNjp1ekq5xH7Gp8z5+vpsEwUlIamKl1dDtweGacTX2PSx8zIaf0BP8htwqccyP7HyDjGPYAsjq3t3RNDQsy3+/lsFqPXdSAUjwMbEJq5BVZ7H3FzvMjA6ww/pra7fGLe3MVPoXRklmnug9BHcpCat/7jfr+NCsL+7Qmv5KLbDlThoCm2DREbSmKFcWwNrkVoYsVXW7ZmXW42jV/0+9NerzvVGj4XSR3z3CoWvkyguRBjr8hv9KNhwVqMr3HoRLyzosBN15tOBJ94URD78/AEPVOTrb5u3yTNeaDnG+Pcvgc3z5gStsDIuNgvLmmZWJ44/U9TiQWwdLf/TO4muB616RIdquTzerT9K8U05s1F2tylqT29g7eI+0oulDw/Sp8zSAOIbxDBvb4zuqLkv6LxazetHGLSUNGfLjoLouEWJkpptyfU2ewPrieAdzillzy2Mwb5i7sCGRxcB6Wc/uZMS9Ze7IlvO6lVhQZd0355kOqb1cE2YTawItjx2b FGuTUw/8 WZyaVXsgHECiH5vyDOht+3FzV0XTRKMr1u/CaOTvfl/iRvHioU3ORGZ5ltcppwaE0nf3Uu5KkIi2AHkzG7mVr1vZPk5BiVE1dZ+WH1yRxApu/OmY3E0r8Md59SjdXjFQUneYQhYrNqWVFmjix7EU+bDoTXvVjo4LdNtpvLKi5xx5JtceXDZKaneGONNIZSflo2CYFeDnC/Ju5PoGGdAAIXBuWR063MXukZcVwutyBpwUQ9A7jZyBml3S5sGMgu2DoxakQdMhJ2hSGZhbSqT1FVQ88zeQprTJwbG7AeNEgtlEYEne9QqkSfCx/EPysfTSl+sJDsPyu5e4CvMcSlUjwv8fjAvf2f6Wvm4sRhyhOiEIrgLh+uZdh3g+fc+EvehWvGiMNRpyo42S5Gq5PVz7NwW11k+XZh3xU+f5KPPwEHJk2Xgj5VDU4/9xS6Hsn9wbd+hsIjd0TSa4HH0JGXh0z7VpkhoQVtUWDu63YWizlZqpY+HY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, May 16, 2023 at 5:35=E2=80=AFAM David Hildenbrand wrote: > > On 16.05.23 01:40, Peter Collingbourne wrote: > > On Mon, May 15, 2023 at 06:34:30PM +0100, Catalin Marinas wrote: > >> On Sat, May 13, 2023 at 05:29:53AM +0200, David Hildenbrand wrote: > >>> On 13.05.23 01:57, Peter Collingbourne wrote: > >>>> diff --git a/mm/memory.c b/mm/memory.c > >>>> index 01a23ad48a04..83268d287ff1 100644 > >>>> --- a/mm/memory.c > >>>> +++ b/mm/memory.c > >>>> @@ -3914,19 +3914,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>>> } > >>>> } > >>>> - /* > >>>> - * Remove the swap entry and conditionally try to free up the swa= pcache. > >>>> - * We're already holding a reference on the page but haven't mapp= ed it > >>>> - * yet. > >>>> - */ > >>>> - swap_free(entry); > >>>> - if (should_try_to_free_swap(folio, vma, vmf->flags)) > >>>> - folio_free_swap(folio); > >>>> - > >>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > >>>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS); > >>>> pte =3D mk_pte(page, vma->vm_page_prot); > >>>> - > >>>> /* > >>>> * Same logic as in do_wp_page(); however, optimize for p= ages that are > >>>> * certainly not shared either because we just allocated = them without > >>>> @@ -3946,8 +3934,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>>> pte =3D pte_mksoft_dirty(pte); > >>>> if (pte_swp_uffd_wp(vmf->orig_pte)) > >>>> pte =3D pte_mkuffd_wp(pte); > >>>> + arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_p= te); > >>>> vmf->orig_pte =3D pte; > >>>> + /* > >>>> + * Remove the swap entry and conditionally try to free up the swa= pcache. > >>>> + * We're already holding a reference on the page but haven't mapp= ed it > >>>> + * yet. > >>>> + */ > >>>> + swap_free(entry); > >>>> + if (should_try_to_free_swap(folio, vma, vmf->flags)) > >>>> + folio_free_swap(folio); > >>>> + > >>>> + inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > >>>> + dec_mm_counter(vma->vm_mm, MM_SWAPENTS); > >>>> + > >>>> /* ksm created a completely new copy */ > >>>> if (unlikely(folio !=3D swapcache && swapcache)) { > >>>> page_add_new_anon_rmap(page, vma, vmf->address); > >>>> @@ -3959,7 +3960,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>>> VM_BUG_ON(!folio_test_anon(folio) || > >>>> (pte_write(pte) && !PageAnonExclusive(pag= e))); > >>>> set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); > >>>> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_p= te); > >>>> folio_unlock(folio); > >>>> if (folio !=3D swapcache && swapcache) { > >>> > >>> > >>> You are moving the folio_free_swap() call after the folio_ref_count(f= olio) > >>> =3D=3D 1 check, which means that such (previously) swapped pages that= are > >>> exclusive cannot be detected as exclusive. > >>> > >>> There must be a better way to handle MTE here. > >>> > >>> Where are the tags stored, how is the location identified, and when a= re they > >>> effectively restored right now? > >> > >> I haven't gone through Peter's patches yet but a pretty good descripti= on > >> of the problem is here: > >> https://lore.kernel.org/all/5050805753ac469e8d727c797c2218a9d780d434.c= amel@mediatek.com/. > >> I couldn't reproduce it with my swap setup but both Qun-wei and Peter > >> triggered it. > > > > In order to reproduce this bug it is necessary for the swap slot cache > > to be disabled, which is unlikely to occur during normal operation. I > > was only able to reproduce the bug by disabling it forcefully with the > > following patch: > > > > diff --git a/mm/swap_slots.c b/mm/swap_slots.c > > index 0bec1f705f8e0..25afba16980c7 100644 > > --- a/mm/swap_slots.c > > +++ b/mm/swap_slots.c > > @@ -79,7 +79,7 @@ void disable_swap_slots_cache_lock(void) > > > > static void __reenable_swap_slots_cache(void) > > { > > - swap_slot_cache_enabled =3D has_usable_swap(); > > + swap_slot_cache_enabled =3D false; > > } > > > > void reenable_swap_slots_cache_unlock(void) > > > > With that I can trigger the bug on an MTE-utilizing process by running > > a program that enumerates the process's private anonymous mappings and > > calls process_madvise(MADV_PAGEOUT) on all of them. > > > >> When a tagged page is swapped out, the arm64 code stores the metadata > >> (tags) in a local xarray indexed by the swap pte. When restoring from > >> swap, the arm64 set_pte_at() checks this xarray using the old swap pte > >> and spills the tags onto the new page. Apparently something changed in > >> the kernel recently that causes swap_range_free() to be called before > >> set_pte_at(). The arm64 arch_swap_invalidate_page() frees the metadata > >> from the xarray and the subsequent set_pte_at() won't find it. > >> > >> If we have the page, the metadata can be restored before set_pte_at() > >> and I guess that's what Peter is trying to do (again, I haven't looked > >> at the details yet; leaving it for tomorrow). > >> > >> Is there any other way of handling this? E.g. not release the metadata > >> in arch_swap_invalidate_page() but later in set_pte_at() once it was > >> restored. But then we may leak this metadata if there's no set_pte_at(= ) > >> (the process mapping the swap entry died). > > > > Another problem that I can see with this approach is that it does not > > respect reference counts for swap entries, and it's unclear whether tha= t > > can be done in a non-racy fashion. > > > > Another approach that I considered was to move the hook to swap_readpag= e() > > as in the patch below (sorry, it only applies to an older version > > of Android's android14-6.1 branch and not mainline, but you get the > > idea). But during a stress test (running the aforementioned program tha= t > > calls process_madvise(MADV_PAGEOUT) in a loop during an Android "monkey= " > > test) I discovered the following racy use-after-free that can occur whe= n > > two tasks T1 and T2 concurrently restore the same page: > > > > T1: | T2: > > arch_swap_readpage() | > > | arch_swap_readpage() -> mte_restore_tags() -> x= e_load() > > swap_free() | > > | arch_swap_readpage() -> mte_restore_tags() -> m= te_restore_page_tags() > > > > We can avoid it by taking the swap_info_struct::lock spinlock in > > mte_restore_tags(), but it seems like it would lead to lock contention. > > > > Would the idea be to fail swap_readpage() on the one that comes last, > simply retrying to lookup the page? The idea would be that T2's arch_swap_readpage() could potentially not find tags if it ran after swap_free(), so T2 would produce a page without restored tags. But that wouldn't matter, because T1 reaching swap_free() means that T2 will follow the goto at [1] after waiting for T1 to unlock at [2], and T2's page will be discarded. > This might be a naive question, but how does MTE play along with shared > anonymous pages? It should work fine. shmem_writepage() calls swap_writepage() which calls arch_prepare_to_swap() to write the tags. And shmem_swapin_folio() has a call to arch_swap_restore() to restore them. Peter [1] https://github.com/torvalds/linux/blob/f1fcbaa18b28dec10281551dfe6ed3a3= ed80e3d6/mm/memory.c#L3881 [2] https://github.com/torvalds/linux/blob/f1fcbaa18b28dec10281551dfe6ed3a3= ed80e3d6/mm/memory.c#L4006