From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C7205C25B75 for ; Sat, 1 Jun 2024 00:48:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 080D86B00A2; Fri, 31 May 2024 20:48:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 008A36B00A3; Fri, 31 May 2024 20:48:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DC3986B00A4; Fri, 31 May 2024 20:48:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id BBE816B00A2 for ; Fri, 31 May 2024 20:48:49 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 2AC64402AA for ; Sat, 1 Jun 2024 00:48:49 +0000 (UTC) X-FDA: 82180484778.26.A19B564 Received: from mail-ot1-f45.google.com (mail-ot1-f45.google.com [209.85.210.45]) by imf05.hostedemail.com (Postfix) with ESMTP id 5DFCB100014 for ; Sat, 1 Jun 2024 00:48:47 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="gYlTA0/L"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.210.45 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717202927; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fLpooYvn6cqpznIh0g+oklIkjo2I3vAHUhZ1XARf310=; b=KNbjZPevGVmMBTgIpJtixxCnEJFrLY8cMTUKAIZ7VOQz57J6uqy4tPEC2Kx/1xZBK/MuAT ESUkxUPD5DIFm04J5wFEwzhrbcMpDWKq5PA4F4ff2VkubIWRby6x66ikzFoO5pX2RGToSy rAWFnn4mFGBWRwSF0Li7WENWXWA3PYI= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="gYlTA0/L"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.210.45 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717202927; a=rsa-sha256; cv=none; b=TzEHK4UmJIfkTFaYhWQ2xYHwA9wkn3oRXHqIOdxL+EwtCo0e1d5yyrnuRw/aJENYOv0xZR 02EsWXkTbvQQOq4NnjpTcwqe32pTj0mCE2+iDvH43qEWrYExZDxV9QiPeD71HX7kEJCT6n bQoK28PBLfDrLQecHrUyI59qhqKY+IM= Received: by mail-ot1-f45.google.com with SMTP id 46e09a7af769-6f8ef63714cso1730543a34.1 for ; Fri, 31 May 2024 17:48:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1717202926; x=1717807726; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=fLpooYvn6cqpznIh0g+oklIkjo2I3vAHUhZ1XARf310=; b=gYlTA0/L3FUgdMGua1JkHXiFgnTAtuA7LL8zVMId8s9DChMvOXjsNQOWrrFry/+cv5 L5ONOgEs3voI42ZehSFUAEjg8+XtJoHXY/+JNUeQPkOiFz0Bc0pdla9CqNFdluJ6PkU/ WYVq0zAzVseM8upXNRWJNmmlyN0Uww60HgZvOQBYfzgHza9RL+myXd88S4qkNaDunaU2 c4VgNs4uP5MD6OJs2CVBQtk7Tn89K/3uzSXSy4xorOpRKKVPu4uFAJSkZcTzyi234VNM XkXduesdLZicFiqugFgfNXB3P5yIQXQvN3IWyEtHIqUtXeMVeG67oVcRCSoKbUTlypKi egGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717202926; x=1717807726; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=fLpooYvn6cqpznIh0g+oklIkjo2I3vAHUhZ1XARf310=; b=iSMlJ+6p2QWnf4f770lmAJNMhTb9mjz1E04fdcy2kSII67pt2M5zQP3PFZVNaX4uBN mHyh75QWJznjK5U+Nruea4N1BmkLkEEFEoycmqELdg13tdynOnCihQJgfVPVY6yHYjHD 5KkmgsczwEz5P8ewc4GX5aqEhtlNILGjq8XolaymcmyMDfFBA3w3EqZ9ly+WWKRAfzR5 E5XGw54vjLflRCsHsh2R2+ivRki4xl4/+KjTzxS79B5ALvqWnkMbvN2ffqQVXq84Uym2 xKZQwWzSSE7hyWu3sYJRjMX/6ZLiPP5ISxF7RFV7ECBaz2erwxLdPQIJ0SywSziD5rNI Gagg== X-Forwarded-Encrypted: i=1; AJvYcCVOWE93/oj5O6kUSac3uakPS2jhvfu3c9HIEuie8ktIVIX+XjfS1bY1tT4hvL8dhukQsxopQ9YvBjgynChC4bMCKfA= X-Gm-Message-State: AOJu0YzK9Egmsy/sHjZgqM96ucs2qLNHcagXf/vRiEm4rS3esV24XzUU 87WgV6OhBhky1LFTcnzb6QXAYSjmeMwa7CfLWUsZdg8Ie24QxNV654poPPqCWxSpffmZEqWCKrP b0Z09Svn6ua6pv73KAsgOmDfrk/w= X-Google-Smtp-Source: AGHT+IEgE6cxa2YbnBDMm9TNEWqxp4u0kLHoB2ZU4k4IHiB+4ruYZqTlDv2zniIUPy0zkfnP9vpri7djL8HkoWPMSjc= X-Received: by 2002:a05:6830:ed9:b0:6f1:3aa3:b0e7 with SMTP id 46e09a7af769-6f911f273ffmr3908761a34.7.1717202926307; Fri, 31 May 2024 17:48:46 -0700 (PDT) MIME-Version: 1.0 References: <20240531104819.140218-1-21cnbao@gmail.com> <87ac9610-5650-451f-aa54-e634a6310af4@redhat.com> <821d29b9-cb06-49db-9fe8-6c054c8787fb@redhat.com> In-Reply-To: <821d29b9-cb06-49db-9fe8-6c054c8787fb@redhat.com> From: Barry Song <21cnbao@gmail.com> Date: Sat, 1 Jun 2024 12:48:34 +1200 Message-ID: Subject: Re: [RFC PATCH] mm: swap: reuse exclusive folio directly instead of wp page faults To: David Hildenbrand Cc: akpm@linux-foundation.org, linux-mm@kvack.org, chrisl@kernel.org, surenb@google.com, kasong@tencent.com, minchan@kernel.org, willy@infradead.org, ryan.roberts@arm.com, linux-kernel@vger.kernel.org, Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 5DFCB100014 X-Stat-Signature: 39g8gjz9isxmhmajkpc8dcd9kjiqpsqy X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1717202927-631568 X-HE-Meta: U2FsdGVkX19Z+xlRFoDhH5/SQU34DyV+1hYUnjoNrOvG5e2iJarSKYTAs4jRO3iaymD/nGt51nA7BY0NoKGw8lpGZFpY64QGCtRWAR/e1BGa831U4vIv1Gh45PXc0Ljk2w6Z12nW6EwCQePvdm0DG5jHzShGeeyoMXvRkj40KfkKMq8HomzSRW7iKgsH9nyn9QGMWr2OSGhx1oNm8SIlc02tZUont2uw6ehA8f/n17QKvGMcVAyAbQ3IEjOanMnWts6hQmjnYY4itxctrmvH6QcoLFiOTVbnGnBfeKTMEfGz+p7HLa2nd2rmNgKoVwfMDOIxYlR6BUuSxnIPpq0A1jxIqLttluodUN5Zk5KSUdazjiVhyuEXg9BOXJxqqvDv9FrDm2LIvU1o6dRhOhzH7EfQS1F/PNWnegM0saFHPaGUO45xXPlLLYbgS9E1Xa4J2awGVPOerDlxyxGSD7dpYqYHJDXTktd1JZwQqLsrhuwPv4BYkSN4Q4c2n7aB9Yts8eaMZazauciLLPAk/6jbEkF+9ywRJMqgKeMAkFQd3f9o5UXmsbqtU12p+ecuRYx26FXpQeZcGq1zHLCeJul+sHBXa2zEp81xlsRnYmvikrrCVZSdk9qq45l+b+uPnPkkVFwFgNxDK7bNUPOPlOXTKhMuMob1kSoy6PxhhMFYhlWD+I8J4SDiglcX3uZQM5dvTAsZftwCwoWkspqsold97tfQEd1SEbitzTwirYPg/gaiy3oUYRKDrO5Vn0FLyOuPW081XBU+ykrGRHQR9XNUs+uF6+s+eosCxHh2ZbV0m5eKEIGToBMrtMuwa82W8H3T5+s/HdESiXQw+BBdwFCw3HFdAroaNRd/ZxrkygQn5q1fZuaHprTEAIWOe8zdJKyBhZyA8mTTc/wsURAZoyaN/Hbfvu/NzN03m7nGkwBuqXv7fjpnSHsFWgwCCL9at8ySy4wBxh02riakSleCV11 uUFTrzOm 3n1IrrbtvTQP2VvxNffTmZ7TMQAHOoF6r7USCu7PR7bD7a7lweZ9LJ2/Xt1I66lPjVM6BSFnICjY5rpc8FmwbRhfny5Y0g4xq6zvtaM5NzF7Yk9TxYUUCPj8dBmh5x18pUaNO/OYa2MKkGNqof5LeNAT69+k9niwrvPOFp+JTuP0NLCEWhKi8hIWNpL4xq3lcgcXnT/jJYhgDBaBVOuDx3WRO0Y47JiIjHE676ZBCU0UYrZO3V6TtoYpRMDPKp15+oc4K20wL5I+ptNWmdlgUggMboxReZ/lHZM4s//K07ql9n6POhmmFoizhEjBaroEK6NCC35yxkFs5C+A2tbYGTEGMi7TWAg8b9g/ieCGVfxuN85LFDmmeJGOrY/3Rxf9vnuMXfl8btOJ8Gfwvd+LmZUTDtSFWdVV9iA3skQv6dLDpTEs= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Jun 1, 2024 at 12:35=E2=80=AFAM David Hildenbrand wrote: > > On 31.05.24 14:30, Barry Song wrote: > > On Sat, Jun 1, 2024 at 12:20=E2=80=AFAM Barry Song <21cnbao@gmail.com> = wrote: > >> > >> On Sat, Jun 1, 2024 at 12:10=E2=80=AFAM David Hildenbrand wrote: > >>> > >>> On 31.05.24 13:55, Barry Song wrote: > >>>> On Fri, May 31, 2024 at 11:08=E2=80=AFPM David Hildenbrand wrote: > >>>>> > >>>>> On 31.05.24 12:48, Barry Song wrote: > >>>>>> From: Barry Song > >>>>>> > >>>>>> After swapping out, we perform a swap-in operation. If we first re= ad > >>>>>> and then write, we encounter a major fault in do_swap_page for rea= ding, > >>>>>> along with additional minor faults in do_wp_page for writing. Howe= ver, > >>>>>> the latter appears to be unnecessary and inefficient. Instead, we = can > >>>>>> directly reuse in do_swap_page and completely eliminate the need f= or > >>>>>> do_wp_page. > >>>>>> > >>>>>> This patch achieves that optimization specifically for exclusive f= olios. > >>>>>> The following microbenchmark demonstrates the significant reductio= n in > >>>>>> minor faults. > >>>>>> > >>>>>> #define DATA_SIZE (2UL * 1024 * 1024) > >>>>>> #define PAGE_SIZE (4UL * 1024) > >>>>>> > >>>>>> static void *read_write_data(char *addr) > >>>>>> { > >>>>>> char tmp; > >>>>>> > >>>>>> for (int i =3D 0; i < DATA_SIZE; i +=3D PAGE_SIZE) { > >>>>>> tmp =3D *(volatile char *)(addr + i); > >>>>>> *(volatile char *)(addr + i) =3D tmp; > >>>>>> } > >>>>>> } > >>>>>> > >>>>>> int main(int argc, char **argv) > >>>>>> { > >>>>>> struct rusage ru; > >>>>>> > >>>>>> char *addr =3D mmap(NULL, DATA_SIZE, PROT_READ | PROT_= WRITE, > >>>>>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > >>>>>> memset(addr, 0x11, DATA_SIZE); > >>>>>> > >>>>>> do { > >>>>>> long old_ru_minflt, old_ru_majflt; > >>>>>> long new_ru_minflt, new_ru_majflt; > >>>>>> > >>>>>> madvise(addr, DATA_SIZE, MADV_PAGEOUT); > >>>>>> > >>>>>> getrusage(RUSAGE_SELF, &ru); > >>>>>> old_ru_minflt =3D ru.ru_minflt; > >>>>>> old_ru_majflt =3D ru.ru_majflt; > >>>>>> > >>>>>> read_write_data(addr); > >>>>>> getrusage(RUSAGE_SELF, &ru); > >>>>>> new_ru_minflt =3D ru.ru_minflt; > >>>>>> new_ru_majflt =3D ru.ru_majflt; > >>>>>> > >>>>>> printf("minor faults:%ld major faults:%ld\n", > >>>>>> new_ru_minflt - old_ru_minflt, > >>>>>> new_ru_majflt - old_ru_majflt); > >>>>>> } while(0); > >>>>>> > >>>>>> return 0; > >>>>>> } > >>>>>> > >>>>>> w/o patch, > >>>>>> / # ~/a.out > >>>>>> minor faults:512 major faults:512 > >>>>>> > >>>>>> w/ patch, > >>>>>> / # ~/a.out > >>>>>> minor faults:0 major faults:512 > >>>>>> > >>>>>> Minor faults decrease to 0! > >>>>>> > >>>>>> Signed-off-by: Barry Song > >>>>>> --- > >>>>>> mm/memory.c | 7 ++++--- > >>>>>> 1 file changed, 4 insertions(+), 3 deletions(-) > >>>>>> > >>>>>> diff --git a/mm/memory.c b/mm/memory.c > >>>>>> index eef4e482c0c2..e1d2e339958e 100644 > >>>>>> --- a/mm/memory.c > >>>>>> +++ b/mm/memory.c > >>>>>> @@ -4325,9 +4325,10 @@ vm_fault_t do_swap_page(struct vm_fault *vm= f) > >>>>>> */ > >>>>>> if (!folio_test_ksm(folio) && > >>>>>> (exclusive || folio_ref_count(folio) =3D=3D 1)) { > >>>>>> - if (vmf->flags & FAULT_FLAG_WRITE) { > >>>>>> - pte =3D maybe_mkwrite(pte_mkdirty(pte), vma)= ; > >>>>>> - vmf->flags &=3D ~FAULT_FLAG_WRITE; > >>>>>> + if (vma->vm_flags & VM_WRITE) { > >>>>>> + pte =3D pte_mkwrite(pte_mkdirty(pte), vma); > >>>>>> + if (vmf->flags & FAULT_FLAG_WRITE) > >>>>>> + vmf->flags &=3D ~FAULT_FLAG_WRITE; > >>>>> > >>>>> This implies, that even on a read fault, you would mark the pte dir= ty > >>>>> and it would have to be written back to swap if still in the swap c= ache > >>>>> and only read. > >>>>> > >>>>> That is controversial. > >>>>> > >>>>> What is less controversial is doing what mprotect() via > >>>>> change_pte_range()/can_change_pte_writable() would do: mark the PTE > >>>>> writable but not dirty. > >>>>> > >>>>> I suggest setting the pte only dirty if FAULT_FLAG_WRITE is set. > >>>> > >>>> Thanks! > >>>> > >>>> I assume you mean something as below? > >>> > >>> It raises an important point: uffd-wp must be handled accordingly. > >>> > >>>> > >>>> diff --git a/mm/memory.c b/mm/memory.c > >>>> index eef4e482c0c2..dbf1ba8ccfd6 100644 > >>>> --- a/mm/memory.c > >>>> +++ b/mm/memory.c > >>>> @@ -4317,6 +4317,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>>> add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); > >>>> pte =3D mk_pte(page, vma->vm_page_prot); > >>>> > >>>> + if (pte_swp_soft_dirty(vmf->orig_pte)) > >>>> + pte =3D pte_mksoft_dirty(pte); > >>>> + if (pte_swp_uffd_wp(vmf->orig_pte)) > >>>> + pte =3D pte_mkuffd_wp(pte); > >>>> /* > >>>> * Same logic as in do_wp_page(); however, optimize for pa= ges that are > >>>> * certainly not shared either because we just allocated t= hem without > >>>> @@ -4325,18 +4329,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf= ) > >>>> */ > >>>> if (!folio_test_ksm(folio) && > >>>> (exclusive || folio_ref_count(folio) =3D=3D 1)) { > >>>> - if (vmf->flags & FAULT_FLAG_WRITE) { > >>>> - pte =3D maybe_mkwrite(pte_mkdirty(pte), vma)= ; > >>>> - vmf->flags &=3D ~FAULT_FLAG_WRITE; > >>>> + if (vma->vm_flags & VM_WRITE) { > >>>> + if (vmf->flags & FAULT_FLAG_WRITE) { > >>>> + pte =3D pte_mkwrite(pte_mkdirty(pte)= , vma); > >>>> + vmf->flags &=3D ~FAULT_FLAG_WRITE; > >>>> + } else if ((!vma_soft_dirty_enabled(vma) || > >>>> pte_soft_dirty(pte)) > >>>> + && !userfaultfd_pte_wp(vma, pte)= ) { > >>>> + pte =3D pte_mkwrite(pte, vma= ); > >>> > >>> Even with FAULT_FLAG_WRITE we must respect uffd-wp and *not* do a > >>> pte_mkwrite(pte). So we have to catch and handle that earlier (I coul= d > >>> have sworn we handle that somehow). > >>> > >>> Note that the existing > >>> pte =3D pte_mkuffd_wp(pte); > >>> > >>> Will fix that up because it does an implicit pte_wrprotect(). > >> > >> This is exactly what I have missed as I am struggling with why WRITE_F= AULT > >> blindly does mkwrite without checking userfaultfd_pte_wp(). > >> > >>> > >>> > >>> So maybe what would work is > >>> > >>> > >>> if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) && > >>> !vma_soft_dirty_enabled(vma)) { > >>> pte =3D pte_mkwrite(pte); > >>> > >>> /* Only set the PTE dirty on write fault. */ > >>> if (vmf->flags & FAULT_FLAG_WRITE) { > >>> pte =3D pte_mkdirty(pte); > >>> vmf->flags &=3D ~FAULT_FLAG_WRITE; > >>> } > > > > WRITE_FAULT has a pte_mkdirty, so it doesn't need to check > > "!vma_soft_dirty_enabled(vma)"? > > Maybe I thought too much, just the simple code below should work? > > That would likely not handle softdirty correctly in case we end up in > pte_mkwrite(pte, vma); note that pte_mksoft_dirty() will not wrprotect ..= . if SOFTDIRTY has been set, we shouldn't do wrprotect? till the dirty is cleared, we don't rely on further write fault to set soft dirty, right? so we should better do pte_mkwrite if pte_soft_dirty(pte) =3D=3D true? if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) && (!vma_soft_dirty_enabled(vma) || pte_soft_dirty(pte))) > > (note that we shouldn't optimize for softdirty handling) > > > > > if (!folio_test_ksm(folio) && > > (exclusive || folio_ref_count(folio) =3D=3D 1)) { > > if (vma->vm_flags & VM_WRITE) { > > if (vmf->flags & FAULT_FLAG_WRITE) { > > pte =3D pte_mkwrite(pte_mkdirty(pte), = vma); > > vmf->flags &=3D ~FAULT_FLAG_WRITE; > > } else { > > pte =3D pte_mkwrite(pte, vma); > > } > > } > > rmap_flags |=3D RMAP_EXCLUSIVE; > > } > > > > if (pte_swp_soft_dirty(vmf->orig_pte)) > > pte =3D pte_mksoft_dirty(pte); > > if (pte_swp_uffd_wp(vmf->orig_pte)) > > pte =3D pte_mkuffd_wp(pte); > > > > This still uses the implicit wrprotect of pte_mkuffd_wp. > > But the wrprotected->writable->wrprotected path really is confusing. I'd > prefer to set these bits ahead of time instead, so we can properly rely > on them -- like we do in other code. I agree. we are setting WRITE then reverting the WRITE. It is confusing. So in conclusion, we get the belows? if (pte_swp_soft_dirty(vmf->orig_pte)) pte =3D pte_mksoft_dirty(pte); if (pte_swp_uffd_wp(vmf->orig_pte)) pte =3D pte_mkuffd_wp(pte); /* * Same logic as in do_wp_page(); however, optimize for pages that = are * certainly not shared either because we just allocated them witho= ut * exposing them to the swapcache or because the swap entry indicat= es * exclusivity. */ if (!folio_test_ksm(folio) && (exclusive || folio_ref_count(folio) =3D=3D 1)) { if (vma->vm_flags & VM_WRITE && !userfaultfd_pte_wp(vma, pt= e) && (!vma_soft_dirty_enabled(vma) || pte_soft_dirty(pte)= )) { if (vmf->flags & FAULT_FLAG_WRITE) { pte =3D pte_mkwrite(pte_mkdirty(pte), vma); vmf->flags &=3D ~FAULT_FLAG_WRITE; } else { pte =3D pte_mkwrite(pte, vma); } } rmap_flags |=3D RMAP_EXCLUSIVE; } > > -- > Cheers, > > David / dhildenb > Thanks Barry