From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 12FC2C25B75 for ; Fri, 31 May 2024 11:56:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 479EA6B008A; Fri, 31 May 2024 07:56:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 429E36B0093; Fri, 31 May 2024 07:56:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2F1C86B0095; Fri, 31 May 2024 07:56:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 147D96B008A for ; Fri, 31 May 2024 07:56:07 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id BE31C161087 for ; Fri, 31 May 2024 11:56:06 +0000 (UTC) X-FDA: 82178537532.13.1350881 Received: from mail-vk1-f176.google.com (mail-vk1-f176.google.com [209.85.221.176]) by imf07.hostedemail.com (Postfix) with ESMTP id E4F4340008 for ; Fri, 31 May 2024 11:56:04 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XbxDH+vx; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf07.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.176 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717156564; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=PQxilVtxLYz3Pay3/b52pECIgOOAi98LJMD2Qy2kC+Q=; b=qSrvGsTqtBIDiBH5zTJ4Iz8zRK6JQVHDcrWK7plhuIrlmSasDDI0maq2tAgoE7961awy7p j8lcqJz/18Zp7UL3IWJcUiK87HkPBbOcVwFZ/2x63WuMif9kRfS/4lFCN3k71WVhzFdId5 ICqBLWw9kxupdlmtO4NcZLUQjSW9LxI= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XbxDH+vx; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf07.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.176 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717156564; a=rsa-sha256; cv=none; b=kwm5Bjr5FJQUVYTP7XmJQojHWf2guSYHCZ+oHL3A7kqmUl06Vrz1loG37Dqbk0Pn8weHDX ax45KVSWlbTTaX4N8CtWxO9CUt3TAmL3O9+Za7e/IQZAAjIQbbGMIlZAQ3KK5oumZXvSim A55hcPQA4aZl9aOLlyBX5urEJ/df8dg= Received: by mail-vk1-f176.google.com with SMTP id 71dfb90a1353d-4eacd4c1c93so617943e0c.0 for ; Fri, 31 May 2024 04:56:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1717156564; x=1717761364; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=PQxilVtxLYz3Pay3/b52pECIgOOAi98LJMD2Qy2kC+Q=; b=XbxDH+vxUE9lCSxqP6LAKc3euF74sTreVwDIxyC8vvI9CAqZ5wi8TfX9EhtuLatRnt g0KrZvSRlEsiVffJUA7Dp4mEYbSOKMa/swHV/a1P4KvcVkRVi5ggDKHEuOTsBsP94w2s 2lRE0KpPBmgSNeNmv3IPquk38od6c37lue91AH/PdVXmVuhIHg9JjAxOC3PHYYBM3D2o HvIO0O1HzG3k1aDMWXkgEiuzp0KoDPE2Aaw7vMIXBT43xBb1BDRjqsZqACRtmvjylwpU K8gc4RrdhAHi0ch0dDqMm1w2NPYrtqVhKDV2LRfBKBs2Nzx6L4MPSY4U+Ele8fgsBFta IcGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717156564; x=1717761364; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=PQxilVtxLYz3Pay3/b52pECIgOOAi98LJMD2Qy2kC+Q=; b=vhEXD+lveTbPqcXOiX3XDCRvPFjPPzK5bEjwpYuLiAeQ0JCxdORffEqerQHvE+Zlan /KCEzProZwhG1+W9Q0ZkUypCbe85/BU3bj/3kb1dlWRAidyBRfOGejW87jsgXZ0ZU1v1 ZN9ti2TOYnw0BFZOXhhve4JpXI4oFVMGEvn2VUq3E4uTJe0k5uwCAqdb2Zoowm9Gl93v 6FAuRvmueLe1ALmT+tNF9dzzfr1l9Yvi8doKoJ7ZtsFfFcoJomQ8Xsx8fOJXz8JLuJRk KTYneLbQPjs7O2AE89MI1sSuMV4Vf2qQT6AjrGlvBnd024J5CM+CzVZ+n9bn9P0hE9PK RkoQ== X-Forwarded-Encrypted: i=1; AJvYcCVf68/as54BRpl+w/h3hPrjANYOBm2clx8zRgLnPn3Mc3I0Tq0+r1EyPi2sSgK4sVgJ04vqgyVmszBMao5bUEhlRgs= X-Gm-Message-State: AOJu0YylxmVkXdiOmHFbsl6EOcM2SmXvgvzBDINMB3hmvr8h2xjF1a5m jjY2Fqu9yHmzoADq4OibWuEcNyX2qihbE769HZmFh/kSaKS3uDXYQu8EvFHSvQcKKGIgip1mJd7 FFBQ4boF/g5WkBwPuvg+xWk1C8RQ= X-Google-Smtp-Source: AGHT+IENheezYWDrZmGudS5loMzwnDgESISvO+ZVKHu3Dr8W5sTiuO2SHQuiiiLbNMMR8YqWZxyxJ0G2WXA7YFnvd8A= X-Received: by 2002:a1f:f24b:0:b0:4e4:e8a7:2620 with SMTP id 71dfb90a1353d-4eb02ede2aemr1632753e0c.14.1717156563797; Fri, 31 May 2024 04:56:03 -0700 (PDT) MIME-Version: 1.0 References: <20240531104819.140218-1-21cnbao@gmail.com> <87ac9610-5650-451f-aa54-e634a6310af4@redhat.com> In-Reply-To: <87ac9610-5650-451f-aa54-e634a6310af4@redhat.com> From: Barry Song <21cnbao@gmail.com> Date: Fri, 31 May 2024 23:55:51 +1200 Message-ID: Subject: Re: [RFC PATCH] mm: swap: reuse exclusive folio directly instead of wp page faults To: David Hildenbrand Cc: akpm@linux-foundation.org, linux-mm@kvack.org, chrisl@kernel.org, surenb@google.com, kasong@tencent.com, minchan@kernel.org, willy@infradead.org, ryan.roberts@arm.com, linux-kernel@vger.kernel.org, Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: E4F4340008 X-Stat-Signature: ocbq7w5p6bnknpzfue8iqeneomxr7x6b X-Rspam-User: X-Rspamd-Server: rspam11 X-HE-Tag: 1717156564-416116 X-HE-Meta: U2FsdGVkX1+QR2iEhgJQDQLTTT7mf+kAJLITEydrZEmwL65yajN6h58MaNZ7q1+Ng2564VRVDWS2qV8aRnt6yroBlZlYa+GtLTJMU5Ux/6WCjyfiYIzSWsntkNYIfh5mLVmtCBdZODulpeqvzmqDMowCEnwEclzIyoorllnVgI+KYNLUG1ToUicbIrWP45GLReGudf6agoiOenwbLZY1UP4lSka4cch7Mi7jN8qJQ0od7qpDVGjAfey1kokkjiqgZqPefP6J7QUYYXVjUz0PZn9DC9FqGGpIUvhjL+OaoJC6mUdbhTiTyn5188uc9S4AUrrQJdGAtWznGWRYaIQW73ml/Ag+HVdzJVQZazPWgqpT3ARKqFkYTM4Cq2eOdUpDUjjzndvIGmJKSM2v7dysPwC4i9OIyktJTtTqbtisJzjhcsfhF6RzuXg5ahnHqdKblrZ3oBqG8Hddpy9NQMGmqa05ZHTMsDLJqLwFNzZ7472OqNDoiZNUgTajuNG5bb7IV8f9VIkReOokvnxVlOjdlEM71KR/WWdHCriZNYuYaRA3JaH4CQSKBq8fGzJ5hzJ3qK8IvOh7IExSkx43bb7GQFmv1RLaRlkX1iv8S/p8VmfcjDN2KDXUZXEYD1Cy7xMYdebf62DsgZTDfABaHBMWanXvHZ4TsG9XPzuoZEI7yd3y8vc4cyQPsxyrtCnM6sFfrHEVYuS4UGEwEMSQlA5lz4fkXd/9pckL5Bqc1lXIIEZgyfnvJhK9vFhIRR5vcyvt1Q30U5hlxgRQ8yz+3GDGs1KedAUSMCPdPByDzLEjpsXxslLKP98/GDsyfGU1FnDvg71RW6VcGWAcUIOpOiq8RH4tdLwUGLpELp6Fgj48U6N7h2RrM9514/IrfhyO2WkGTXIdgAaKNTOSKM1w8UjLToo2zGz61jAORb0tu2B27c9wCnc0FPkcBnzoDvgD+/xdEHiKMdHrjX8WAOpDgy6 yqptMFld reG9fMv2JxgomplNU6SNgPrkCll9oh9MgYbGYp+PJKQBS/lwqxfGJNoxryY4uIY7+KBtrOlIh3p8O3Tvcv3+vDql/fU2teFdMpR7DX5LXwcVwVGnj7z2AJZpVG/CNbzxyjKFGaOykRnMbgDt4VkzBPbIXaywNZ5ER9hlDYb71IPjGuSPPCsraeVqYsVjsB4EW0kRAuIQSBwxAercv00M8mt2JT/VZUogrc2AgWLSX127YG3+OZSPGa9jR/nxu5zbCFxHNrX/83ft2gpGYeLe30lm39oo66kqrZoO5x+Dq7JaMVSiXrdYM5RfQmMk/3/u9UUbbSeQUU9+HzpSSkMGOkxyKq/hbJ9TZUwTnJwp1oj6fqzFMYogtghoJoY2ppq30NlhL3KLYsONS2omFUyTCMOOV3AC/UT9akO+uah9wzK0t48BYq0aRgJXlcQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 31, 2024 at 11:08=E2=80=AFPM David Hildenbrand wrote: > > On 31.05.24 12:48, Barry Song wrote: > > From: Barry Song > > > > After swapping out, we perform a swap-in operation. If we first read > > and then write, we encounter a major fault in do_swap_page for reading, > > along with additional minor faults in do_wp_page for writing. However, > > the latter appears to be unnecessary and inefficient. Instead, we can > > directly reuse in do_swap_page and completely eliminate the need for > > do_wp_page. > > > > This patch achieves that optimization specifically for exclusive folios= . > > The following microbenchmark demonstrates the significant reduction in > > minor faults. > > > > #define DATA_SIZE (2UL * 1024 * 1024) > > #define PAGE_SIZE (4UL * 1024) > > > > static void *read_write_data(char *addr) > > { > > char tmp; > > > > for (int i =3D 0; i < DATA_SIZE; i +=3D PAGE_SIZE) { > > tmp =3D *(volatile char *)(addr + i); > > *(volatile char *)(addr + i) =3D tmp; > > } > > } > > > > int main(int argc, char **argv) > > { > > struct rusage ru; > > > > char *addr =3D mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE, > > MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > > memset(addr, 0x11, DATA_SIZE); > > > > do { > > long old_ru_minflt, old_ru_majflt; > > long new_ru_minflt, new_ru_majflt; > > > > madvise(addr, DATA_SIZE, MADV_PAGEOUT); > > > > getrusage(RUSAGE_SELF, &ru); > > old_ru_minflt =3D ru.ru_minflt; > > old_ru_majflt =3D ru.ru_majflt; > > > > read_write_data(addr); > > getrusage(RUSAGE_SELF, &ru); > > new_ru_minflt =3D ru.ru_minflt; > > new_ru_majflt =3D ru.ru_majflt; > > > > printf("minor faults:%ld major faults:%ld\n", > > new_ru_minflt - old_ru_minflt, > > new_ru_majflt - old_ru_majflt); > > } while(0); > > > > return 0; > > } > > > > w/o patch, > > / # ~/a.out > > minor faults:512 major faults:512 > > > > w/ patch, > > / # ~/a.out > > minor faults:0 major faults:512 > > > > Minor faults decrease to 0! > > > > Signed-off-by: Barry Song > > --- > > mm/memory.c | 7 ++++--- > > 1 file changed, 4 insertions(+), 3 deletions(-) > > > > diff --git a/mm/memory.c b/mm/memory.c > > index eef4e482c0c2..e1d2e339958e 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -4325,9 +4325,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > */ > > if (!folio_test_ksm(folio) && > > (exclusive || folio_ref_count(folio) =3D=3D 1)) { > > - if (vmf->flags & FAULT_FLAG_WRITE) { > > - pte =3D maybe_mkwrite(pte_mkdirty(pte), vma); > > - vmf->flags &=3D ~FAULT_FLAG_WRITE; > > + if (vma->vm_flags & VM_WRITE) { > > + pte =3D pte_mkwrite(pte_mkdirty(pte), vma); > > + if (vmf->flags & FAULT_FLAG_WRITE) > > + vmf->flags &=3D ~FAULT_FLAG_WRITE; > > This implies, that even on a read fault, you would mark the pte dirty > and it would have to be written back to swap if still in the swap cache > and only read. > > That is controversial. > > What is less controversial is doing what mprotect() via > change_pte_range()/can_change_pte_writable() would do: mark the PTE > writable but not dirty. > > I suggest setting the pte only dirty if FAULT_FLAG_WRITE is set. Thanks! I assume you mean something as below? diff --git a/mm/memory.c b/mm/memory.c index eef4e482c0c2..dbf1ba8ccfd6 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4317,6 +4317,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); pte =3D mk_pte(page, vma->vm_page_prot); + if (pte_swp_soft_dirty(vmf->orig_pte)) + pte =3D pte_mksoft_dirty(pte); + if (pte_swp_uffd_wp(vmf->orig_pte)) + pte =3D pte_mkuffd_wp(pte); /* * Same logic as in do_wp_page(); however, optimize for pages that = are * certainly not shared either because we just allocated them witho= ut @@ -4325,18 +4329,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ if (!folio_test_ksm(folio) && (exclusive || folio_ref_count(folio) =3D=3D 1)) { - if (vmf->flags & FAULT_FLAG_WRITE) { - pte =3D maybe_mkwrite(pte_mkdirty(pte), vma); - vmf->flags &=3D ~FAULT_FLAG_WRITE; + if (vma->vm_flags & VM_WRITE) { + if (vmf->flags & FAULT_FLAG_WRITE) { + pte =3D pte_mkwrite(pte_mkdirty(pte), vma); + vmf->flags &=3D ~FAULT_FLAG_WRITE; + } else if ((!vma_soft_dirty_enabled(vma) || pte_soft_dirty(pte)) + && !userfaultfd_pte_wp(vma, pte)) { + pte =3D pte_mkwrite(pte, vma); + } } rmap_flags |=3D RMAP_EXCLUSIVE; } folio_ref_add(folio, nr_pages - 1); flush_icache_pages(vma, page, nr_pages); - if (pte_swp_soft_dirty(vmf->orig_pte)) - pte =3D pte_mksoft_dirty(pte); - if (pte_swp_uffd_wp(vmf->orig_pte)) - pte =3D pte_mkuffd_wp(pte); vmf->orig_pte =3D pte_advance_pfn(pte, page_idx); /* ksm created a completely new copy */ > > -- > Cheers, > > David / dhildenb > Thanks Barry