From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 64288D62060 for ; Tue, 19 Nov 2024 10:03:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DD4E76B009B; Tue, 19 Nov 2024 05:03:48 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D85656B009C; Tue, 19 Nov 2024 05:03:48 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C25C76B009D; Tue, 19 Nov 2024 05:03:48 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 9DFFA6B009B for ; Tue, 19 Nov 2024 05:03:48 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 450BE1602B8 for ; Tue, 19 Nov 2024 10:03:48 +0000 (UTC) X-FDA: 82802406666.22.161C924 Received: from mail-pl1-f170.google.com (mail-pl1-f170.google.com [209.85.214.170]) by imf04.hostedemail.com (Postfix) with ESMTP id 0734C40015 for ; Tue, 19 Nov 2024 10:02:41 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=PbEDotfF; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf04.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.170 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732010491; a=rsa-sha256; cv=none; b=c/sUiS2jKQbMYIX/zgiAApo0OhM0OPqK0Nb3zk/jz+hM66jrMhr3An9pSfQ8Xr8kxWAK8+ 1C8g8T3WlgtMPtFt0OIPxwpffyrx7V7ZTokLKuchEjrIwt5J6b05g9YJs1jTfHwtsvqUrO 5LTyLkkpWVEfGB9M/Qi869lHLVucfz4= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=PbEDotfF; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf04.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.170 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1732010491; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Q/gckH2swE3iy2J2B+UCv0TBAurGqUwksNzT/L3/BVc=; b=dViJjtDtKLLWJnwUGCP9Uue6UhuUvavHJOsI3GczFzrYpbO2tKJjcfI++Yamy9Yqyumi9i fvS39V17jwtpWIlXHokDxZkACI7KCkRlyI4Ws8oHCfQBP1ohIUglR/UG9/OzVaKWn7aXPF c4F0DyNBVx/tap5Bm+R08pqEBy4BMXk= Received: by mail-pl1-f170.google.com with SMTP id d9443c01a7336-20c9978a221so6618055ad.1 for ; Tue, 19 Nov 2024 02:03:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1732010624; x=1732615424; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=Q/gckH2swE3iy2J2B+UCv0TBAurGqUwksNzT/L3/BVc=; b=PbEDotfFdONnT/AP8v+QIx8rZsSK00U4pKdgDmqObxc7uSSxZbcYqJIzJn/IkgJ9b5 jOMpDAa85yefjMBCJD4z6eud6j563wWvNr7GDN0vG+PGXFbdpwlMYo4DWVz49MTLjE0A hFFCYmtpTkuzYuSy1TnzmC2nI7MV6/jh+2zQjECIDwEq351goNRYyGKV674IVIwrUaFl 7bAhFqVZ/fmj/Z/9KN1I+oggS9a9m5BoaWBuYlpr1BB/5E0TnUjlh6nQkbkztDFVBOhX r+o1XbIX43I6C0W40bp215557uxuhdRZ6OG/SZeBr6ecd5/ILuntGbHq0/2nbDm0Cbh/ H8Ow== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732010624; x=1732615424; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Q/gckH2swE3iy2J2B+UCv0TBAurGqUwksNzT/L3/BVc=; b=Qs6KJoF6SISxVQw13eXN62O1nL9jGEjurVi6rmq3oOtpw/K5vu8HLuIjB5LmygOeEe yCr7LhibusbJCGcSTdOk6US4AZZ5lXKhO89yUsN3qG4Tc98DRyMVerVegGYEl43AMGJE BEu4wIGrWMH51H7XuHsRu/7/H/kE/W1sP3G67F07lEP9vCGMw2WR+f9re+ZtxJwYQcsl l3peG+334e9u8QUp8mtjtkbeFcTp5Ymcj6MA/OquDPF4cekCB5HcZb59kGHM1D4U9f/2 ktdK20CJihk7+olcrvTpLCcJn1G7ICkElUrRDj22k8u7E90ezm6kkC7fjlEFV2c52oj5 Ovaw== X-Forwarded-Encrypted: i=1; AJvYcCUYDmSJAbtJZc3UJkgHi3b7eeqmnbakjsIID1fnWLV6blGDQXM0zBE8TOoqtQfXa6l5+prrZsV64g==@kvack.org X-Gm-Message-State: AOJu0YxPgcMESsIcDJYnO5eJu4lFAnLn9qdKAqB4AwircjRR7jOaYuqQ Xps9QYRETK4EE7xbXWOvlIBGCC5gWpp3rXZIbBR1wH4FsKGSiMB/xD01vSkL7Sc= X-Google-Smtp-Source: AGHT+IFw7NMT+8aWA+giCJdZ35/mDE+PF7iPZm76phWNwrVsNKK9E2t5REXsrZ76O6Weqb1xVqV5sw== X-Received: by 2002:a17:903:183:b0:20c:5b80:930f with SMTP id d9443c01a7336-211d0d4bda3mr207147225ad.12.1732010623811; Tue, 19 Nov 2024 02:03:43 -0800 (PST) Received: from [10.84.149.95] ([203.208.167.151]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2120e4373c5sm40862595ad.57.2024.11.19.02.03.37 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 19 Nov 2024 02:03:43 -0800 (PST) Message-ID: <99d3229a-17ce-4a41-ae11-ebfab138cf76@bytedance.com> Date: Tue, 19 Nov 2024 18:03:34 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 4/9] mm: introduce skip_none_ptes() Content-Language: en-US To: David Hildenbrand Cc: jannh@google.com, hughd@google.com, willy@infradead.org, muchun.song@linux.dev, vbabka@kernel.org, akpm@linux-foundation.org, peterx@redhat.com, mgorman@suse.de, catalin.marinas@arm.com, will@kernel.org, dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, x86@kernel.org, lorenzo.stoakes@oracle.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, zokeefe@google.com, rientjes@google.com References: <617a063e-bd84-4da5-acf4-6ff516512055@bytedance.com> <253e5fd0-7e98-43fd-b0d7-8a5b739ae4aa@bytedance.com> <77b1eddf-7c1b-43e9-9352-229998ce3fc7@redhat.com> <5a3428bd-743a-4d51-8b75-163ab560bca7@bytedance.com> <4edccc1a-2761-4a5a-89a6-7869c1b6b08a@redhat.com> <2b48d313-4f66-47c8-98d8-8aa78db62b1b@bytedance.com> <995804f4-b658-44b2-bb40-c84b8a322616@redhat.com> <4ee60b7b-a81e-4b94-96c9-52b1bd9d5061@redhat.com> <332cbacb-cad3-4522-a74b-b5ad5efee4af@redhat.com> From: Qi Zheng In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 0734C40015 X-Stat-Signature: 81xakamfhn3crtw5s4uc7bgwt3ofayjs X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1732010561-691732 X-HE-Meta: U2FsdGVkX19EKnifk0+PUed7Gw+uAVp/C4ItyPbuhMi8Zc+xtCUQzYmEPYhvNCskRUr/y2wVb1+M/MNQCer2olL5mDxXetk4/uzIgrElSTkkJCRbBVAlJo7hm8aSc9DQvq4cwUNStD03LS4MhfJS3dvKCxUxgPmWSmjkT9oEUeiDXJCWW4EMCOXzBTHtwq27+syRnslN4AiSfz1Oscneeu5B1MHLmVoUU+OWHJztXw1ek/3HIxI7lozHeRR2PihRirPt82qL4SAfEGklRN02yOdEGa62DECx2a4+8b8ksiwU7T7zLVs+U9z14XQWQ3plqozv1GhuNAgAG6R0HIY9scc6OS22gvwjEi+POoRI97u056Cz/S18fwamDecB4bZYTmlxwdZAXoflFU9iZ5ln+2jrBn6J0igDqWg+LKxfWSDyLtsElUYi+utgZ6YRwmfjXljTJeRuPiUt/sDP5AKGyoXRFkvK0zwDlo45XpHXaJLq+cvkaAVep1ZCZ+o+4AolUYoUVc/+d9bV8e25xezn6PFecPdHcuc3PliRdB4HvE9llw2/60bpKFvDWQs+PZbrN1GP4BtQnK/4ygkRr5i8uP2rXuJ/R/ekw2BN/7itWeVM4eHP0czk0xnKstK6WEXzeYUkDHM6qlmPop2LeftGIfbWj4En/3RGHOB6WPVEjsw7d+N4BAxcbJmP/kzOl7xcyFiSZZoG0jFWDfo0fSVNT7pSzNWZjSxjAt45Q8DK03SJSIIjXc8DLEiRyHh1MwqcAGDYV+w5V4TnyNu4BTLwfXiPe9nwNesa1O8VRDjxSjNioRv9fHej0gyoG4CfwVUTBiXNm7kzq6fYvUDior8i3EIcaQ95Xt3Z7x1C2Jt9PQRAzZjVPnyN1DzIirSqalvaXsHWOqzroV4zyJqjfgiUOf/PhMyp9O0uzdNf849G3BuD7gwVb3iyq0R8AxYGP61EXeXH2I2cneSwqehrNU+ +WigmbPV 8hVO9LH7kUMZdTtd9KGsNNkaNaUPWHOfpGAdX50Yol9r+9Cb9lp0Hp5kal94KfPP8SzC9fshutqByrGs/ohr4+co1BmQzY6ZbHiM3BfJ/Uaw2miTt8krxPwYTf7fMS8iQH/A2u9WL1CPSvGJhZ4SREqhuwDRgDVKxItSF4Y3A8oxiKSzdZT7fscHgDTNouHvTql5RmVxQpfD6pfpIhWlKT/JnY/4MSoTXAS74L0SFHxe9uMJhNtlcfbL8IU3UBhOZrqXn6C4usJGzKBUpEuAMkBdXK+ipuoQvBmYxmG8ywzk/yEu/CDviRieZtpibD9bGYIektCovXJwR0m3+HWdoDPn9xA0E6s8r6KGxExrydw1vnnyzxq9COheQJhzwScvDewCgd4zMrTE0//BW1B7TDpsdJSjTf5PpKtiT X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/11/19 17:55, David Hildenbrand wrote: > On 18.11.24 12:13, Qi Zheng wrote: >> >> >> On 2024/11/18 18:59, David Hildenbrand wrote: >>> On 18.11.24 11:56, Qi Zheng wrote: >>>> >>>> >>>> On 2024/11/18 18:41, David Hildenbrand wrote: >>>>> On 18.11.24 11:34, Qi Zheng wrote: >>>>>> >>>>>> >>>>>> On 2024/11/18 17:29, David Hildenbrand wrote: >>>>>>> On 18.11.24 04:35, Qi Zheng wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 2024/11/15 22:59, David Hildenbrand wrote: >>>>>>>>> On 15.11.24 15:41, Qi Zheng wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 2024/11/15 18:22, David Hildenbrand wrote: >>>>>>>>>>>>>> *nr_skip = nr; >>>>>>>>>>>>>> >>>>>>>>>>>>>> and then: >>>>>>>>>>>>>> >>>>>>>>>>>>>> zap_pte_range >>>>>>>>>>>>>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, >>>>>>>>>>>>>> &skip_nr, >>>>>>>>>>>>>>                                rss, &force_flush, >>>>>>>>>>>>>> &force_break); >>>>>>>>>>>>>>             if (can_reclaim_pt) { >>>>>>>>>>>>>>                 none_nr += count_pte_none(pte, nr); >>>>>>>>>>>>>>                 none_nr += nr_skip; >>>>>>>>>>>>>>             } >>>>>>>>>>>>>> >>>>>>>>>>>>>> Right? >>>>>>>>>>>>> >>>>>>>>>>>>> Yes. I did not look closely at the patch that adds the >>>>>>>>>>>>> counting of >>>>>>>>>>>> >>>>>>>>>>>> Got it. >>>>>>>>>>>> >>>>>>>>>>>>> pte_none though (to digest why it is required :) ). >>>>>>>>>>>> >>>>>>>>>>>> Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect >>>>>>>>>>>> empty PTE page. >>>>>>>>>>> >>>>>>>>>>> Okay, so the problem is that "nr" would be "all processed >>>>>>>>>>> entries" but >>>>>>>>>>> there are cases where we "process an entry but not zap it". >>>>>>>>>>> >>>>>>>>>>> What you really only want to know is "was any entry not zapped", >>>>>>>>>>> which >>>>>>>>>>> could be a simple input boolean variable passed into >>>>>>>>>>> do_zap_pte_range? >>>>>>>>>>> >>>>>>>>>>> Because as soon as any entry was processed but  no zapped, >>>>>>>>>>> you can >>>>>>>>>>> immediately give up on reclaiming that table. >>>>>>>>>> >>>>>>>>>> Yes, we can set can_reclaim_pt to false when a !pte_none() >>>>>>>>>> entry is >>>>>>>>>> found in count_pte_none(). >>>>>>>>> >>>>>>>>> I'm not sure if well need cont_pte_none(), but I'll have to take a >>>>>>>>> look >>>>>>>>> at your new patch to see how this fits together with doing the >>>>>>>>> pte_none >>>>>>>>> detection+skipping in do_zap_pte_range(). >>>>>>>>> >>>>>>>>> I was wondering if you cannot simply avoid the additional >>>>>>>>> scanning and >>>>>>>>> simply set "can_reclaim_pt" if you skip a zap. >>>>>>>> >>>>>>>> Maybe we can return the information whether the zap was skipped >>>>>>>> from >>>>>>>> zap_present_ptes() and zap_nonpresent_ptes() through parameters >>>>>>>> like I >>>>>>>> did in [PATCH v1 3/7] and [PATCH v1 4/7]. >>>>>>>> >>>>>>>> In theory, we can detect empty PTE pages in the following two ways: >>>>>>>> >>>>>>>> 1) If no zap is skipped, it means that all pte entries have been >>>>>>>>         zap, and the PTE page must be empty. >>>>>>>> 2) If all pte entries are detected to be none, then the PTE page is >>>>>>>>         empty. >>>>>>>> >>>>>>>> In the error case, 1) may cause non-empty PTE pages to be reclaimed >>>>>>>> (which is unacceptable), while the 2) will at most cause empty PTE >>>>>>>> pages >>>>>>>> to not be reclaimed. >>>>>>>> >>>>>>>> So the most reliable and efficient method may be: >>>>>>>> >>>>>>>> a. If there is a zap that is skipped, stop scanning and do not >>>>>>>> reclaim >>>>>>>>         the PTE page; >>>>>>>> b. Otherwise, as now, detect the empty PTE page through >>>>>>>> count_pte_none() >>>>>>> >>>>>>> Is there a need for count_pte_none() that I am missing? >>>>>> >>>>>> When any_skipped == false, at least add VM_BUG_ON() to recheck none >>>>>> ptes. >>>>>> >>>>>>> >>>>>>> Assume we have >>>>>>> >>>>>>> nr = do_zap_pte_range(&any_skipped) >>>>>>> >>>>>>> >>>>>>> If "nr" is the number of processed entries (including >>>>>>> pte_none()), and >>>>>>> "any_skipped" is set whenever we skipped to zap a !pte_none >>>>>>> entry, we >>>>>>> can detect what we need, no? >>>>>>> >>>>>>> If any_skipped == false after the call, we now have "nr" pte_none() >>>>>>> entries. -> We can continue trying to reclaim >>>>>> >>>>>> I prefer that "nr" should not include pte_none(). >>>>>> >>>>> >>>>> Why? do_zap_pte_range() should tell you how far to advance, nothing >>>>> less, nothing more. >>>>> >>>>> Let's just keep it simple and avoid count_pte_none(). >>>>> >>>>> I'm probably missing something important? >>>> >>>> As we discussed before, we should skip all consecutive none ptes, >>>   > pte and addr are already incremented before returning. >>> >>> It's probably best to send the resulting patch so I can either >>> understand why count_pte_none() is required or comment on how to get rid >>> of it. >> >> Something like this: >> >> diff --git a/mm/memory.c b/mm/memory.c >> index bd9ebe0f4471f..e9bec3cd49d44 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -1657,6 +1657,66 @@ static inline int zap_nonpresent_ptes(struct >> mmu_gather *tlb, >>           return nr; >>    } >> >> +static inline int do_zap_pte_range(struct mmu_gather *tlb, >> +                                  struct vm_area_struct *vma, pte_t >> *pte, >> +                                  unsigned long addr, unsigned long end, >> +                                  struct zap_details *details, int *rss, >> +                                  bool *force_flush, bool *force_break, >> +                                  bool *any_skipped) >> +{ >> +       pte_t ptent = ptep_get(pte); >> +       int max_nr = (end - addr) / PAGE_SIZE; > > I'd do here: > >       int nr = 0; > >> + >> +       /* Skip all consecutive pte_none(). */ >> +       if (pte_none(ptent)) { >> + > > instead of the "int nr" here > >> +               for (nr = 1; nr < max_nr; nr++) { >> +                       ptent = ptep_get(pte + nr); >> +                       if (!pte_none(ptent)) >> +                               break; >> +               } >> +               max_nr -= nr; >> +               if (!max_nr) >> +                       return 0; >> +               pte += nr; >> +               addr += nr * PAGE_SIZE; >> +       } >> + >> +       if (pte_present(ptent)) >> +               return zap_present_ptes(tlb, vma, pte, ptent, max_nr, >> +                                       addr, details, rss, force_flush, >> +                                       force_break, any_skipped); > > and here: > > if (pte_present(ptent)) >     nr += zap_present_ptes(...) > else >     nr += zap_nonpresent_ptes(); > return nr; > > So you really return the number of processed items -- how much the > caller should advance. Got it, please ignore my previous stupid considerations. ;) > >> + >> +       return zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr, >> +                                  details, rss, any_skipped); >> +} >> + >> +static inline int count_pte_none(pte_t *pte, int nr) >> +{ >> +       int none_nr = 0; >> + >> +       /* >> +        * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be >> +        * re-installed, so we need to check pte_none() one by one. >> +        * Otherwise, checking a single PTE in a batch is sufficient. >> +        */ >> +#ifdef CONFIG_PTE_MARKER_UFFD_WP >> +       for (;;) { >> +               if (pte_none(ptep_get(pte))) >> +                       none_nr++; >> +               if (--nr == 0) >> +                       break; >> +               pte++; >> +       } >> +#else >> +       if (pte_none(ptep_get(pte))) >> +               none_nr = nr; >> +#endif >> +       return none_nr; >> +} >> + >> + >>    static unsigned long zap_pte_range(struct mmu_gather *tlb, >>                                   struct vm_area_struct *vma, pmd_t *pmd, >>                                   unsigned long addr, unsigned long end, >> @@ -1667,6 +1727,7 @@ static unsigned long zap_pte_range(struct >> mmu_gather *tlb, >>           int rss[NR_MM_COUNTERS]; >>           spinlock_t *ptl; >>           pte_t *start_pte; >> +       bool can_reclaim_pt; >>           pte_t *pte; >>           int nr; >> >> @@ -1679,28 +1740,22 @@ static unsigned long zap_pte_range(struct >> mmu_gather *tlb, >>           flush_tlb_batched_pending(mm); >>           arch_enter_lazy_mmu_mode(); >>           do { >> -               pte_t ptent = ptep_get(pte); >> -               int max_nr; >> - >> -               nr = 1; >> -               if (pte_none(ptent)) >> -                       continue; >> +               bool any_skipped; >> >>                   if (need_resched()) >>                           break; >> >> -               max_nr = (end - addr) / PAGE_SIZE; >> -               if (pte_present(ptent)) { >> -                       nr = zap_present_ptes(tlb, vma, pte, ptent, >> max_nr, >> -                                             addr, details, rss, >> &force_flush, >> -                                             &force_break); >> -                       if (unlikely(force_break)) { >> -                               addr += nr * PAGE_SIZE; >> -                               break; >> -                       } >> -               } else { >> -                       nr = zap_nonpresent_ptes(tlb, vma, pte, ptent, >> max_nr, >> -                                                addr, details, rss); >> +               nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, >> +                                     rss, &force_flush, &force_break, >> +                                     &any_skipped); >> +               if (can_reclaim_pt) { >> +                       VM_BUG_ON(!any_skipped && count_pte_none(pte, >> nr) == nr); > > Okay, so this is really only for debugging then? So we can safely drop > that for now. > > I assume it would make sense to check when reclaiming a page table with > CONFIG_DEBUG_VM, that the table is actually all-pte_none instead? Ah, make sense. Will change to it in the next version. > > (as a side note: no VM_BUG_ON, please :) ) Got it. Thanks! >