From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DBB0AD4920C for ; Mon, 18 Nov 2024 11:14:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 62E5A6B00B3; Mon, 18 Nov 2024 06:14:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5B3106B00B4; Mon, 18 Nov 2024 06:14:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2A2CE6B00B5; Mon, 18 Nov 2024 06:14:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id F39296B00B3 for ; Mon, 18 Nov 2024 06:14:07 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 9054AA0153 for ; Mon, 18 Nov 2024 11:14:07 +0000 (UTC) X-FDA: 82798954014.01.BECD377 Received: from mail-pg1-f177.google.com (mail-pg1-f177.google.com [209.85.215.177]) by imf22.hostedemail.com (Postfix) with ESMTP id 4CA94C0006 for ; Mon, 18 Nov 2024 11:13:03 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=CIMwnevH; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf22.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.215.177 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731928263; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ORrVs+s0V5otpD68MhZ7rAWmPvfENIKBy1EC12UuyQs=; b=ygOiuoSTAFHztOD5/LFg07ZgPqltH1viXa6W+6l8N4g85+oWDQagHyRkmaQTQhkboYKSXS 95/XsugHzI8MQfQrk8rqTqc4pukAxoiMHJBrGTDi06Ej9lMykgROtP7zpYA8Kn+3VCycc5 xOfhgvKgD9Ip+XSRnbMGpGFgDHw5Mdg= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=CIMwnevH; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf22.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.215.177 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731928263; a=rsa-sha256; cv=none; b=vIj8jLapc+SyuFUEUtN53bXza7TodxQg64Z1OutyE001dxCJEq/LVjfEUZx/TLOOi3iswW 1YCTM4eeOCFInFh+IKlyVgR6AwzkqbqHbWQgaGIpuhZrab+fPd7OQxSdGolp7YXxVpcchx Aj0boyMdTVdsWcaZ2kykqhgUJPAkAQE= Received: by mail-pg1-f177.google.com with SMTP id 41be03b00d2f7-7ee020ec76dso3141833a12.3 for ; Mon, 18 Nov 2024 03:14:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1731928443; x=1732533243; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=ORrVs+s0V5otpD68MhZ7rAWmPvfENIKBy1EC12UuyQs=; b=CIMwnevHgVMoaYqgbU9Z8ROn2AS2rVMvtuBNGm1o3SFHWLGp+XzrDR0wa0/9U6x8Xa IL0LTv0noK+hfsYkemv1foLr/2dWvNVXP79C+0GxQMXEABr6xPtoa1Fcy2eAeWKB+tr5 fg24szDe97ex/MzythuKsbnw7tJxjWsI2gwvc3/Lu9zJTjZaZTVc2kw6mXpmjRu0PKF5 IVY8J5KYI9BW+Miu4Ccnnfz0pLpbWd9WZ8bua0oTpIdMfJmTbaevm07cS5oSgqo8sjrI EwZRkXt5meGg5THLWtdVUshvPQ+vOMaBlZ37R3V4sAEd+dqwJSav0sfdxoccscqwi/00 WL9g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731928443; x=1732533243; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ORrVs+s0V5otpD68MhZ7rAWmPvfENIKBy1EC12UuyQs=; b=fWOzNwNhWeQV7Ss5n/iQ7OpceCAGlpdvtQ9ScI2DQD+85BTecD2QBYFagTM/RRT6wi KqLCu21CauzaTTKDXNDRsoNGZ7mAYKjg4ujv1KV3JjM8k7gZrr2ShDVGIAAFKw+T1s1L VrlSixbAYFPa0AGZv0tzpNiBWCXgq9PDFuMpts//yEe0VnUbt1gv4ewwBQ6YjQ/WDjjY cRM4xzoOOnNO1+Y94UMPq8HODV3AMkJBAwlM5CWa/54AP724juAgyVXVOwKGIqI1d1EY qQn+Qd/3r1rx1WrAs7/BciMcZVD5qjpXjBbC3cpHFGYM0adT6iHSIPaZ7D3oOqo+TBVE ft7g== X-Forwarded-Encrypted: i=1; AJvYcCWwKhbnuhLbHwYoLT8OGjk2k7bT7kzmx+vqKMEoO3G6rtnvmNXmh+HimmfuAYaqbzXzyLG/cP/hLg==@kvack.org X-Gm-Message-State: AOJu0YxjZGFKcCuzcQYkxdqKIFcJ+t3qwbNOe8BW2+Rsr025Q4N+4v3w KVBd2PFpQp6x1CVG+S8DsQDp9zqB8YNQaRsynLkCYXjlOkwVM8KcTvdFrN1L1S0= X-Google-Smtp-Source: AGHT+IEH6pemOWN/YAryx+DlY/pBV3dgpcgNKO/QEpdyCBZYCTXDYKz3IjnBXx8ovLRGEl/XltUULA== X-Received: by 2002:a05:6a20:394d:b0:1d9:c78f:4207 with SMTP id adf61e73a8af0-1dc90b1c34amr16642798637.11.1731928442746; Mon, 18 Nov 2024 03:14:02 -0800 (PST) Received: from [10.84.149.95] ([203.208.167.151]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-724771394ffsm5899231b3a.90.2024.11.18.03.13.56 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 18 Nov 2024 03:14:02 -0800 (PST) Message-ID: Date: Mon, 18 Nov 2024 19:13:53 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 4/9] mm: introduce skip_none_ptes() Content-Language: en-US To: David Hildenbrand Cc: jannh@google.com, hughd@google.com, willy@infradead.org, muchun.song@linux.dev, vbabka@kernel.org, akpm@linux-foundation.org, peterx@redhat.com, mgorman@suse.de, catalin.marinas@arm.com, will@kernel.org, dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, x86@kernel.org, lorenzo.stoakes@oracle.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, zokeefe@google.com, rientjes@google.com References: <574bc9b646c87d878a5048edb63698a1f8483e10.1731566457.git.zhengqi.arch@bytedance.com> <617a063e-bd84-4da5-acf4-6ff516512055@bytedance.com> <253e5fd0-7e98-43fd-b0d7-8a5b739ae4aa@bytedance.com> <77b1eddf-7c1b-43e9-9352-229998ce3fc7@redhat.com> <5a3428bd-743a-4d51-8b75-163ab560bca7@bytedance.com> <4edccc1a-2761-4a5a-89a6-7869c1b6b08a@redhat.com> <2b48d313-4f66-47c8-98d8-8aa78db62b1b@bytedance.com> <995804f4-b658-44b2-bb40-c84b8a322616@redhat.com> <4ee60b7b-a81e-4b94-96c9-52b1bd9d5061@redhat.com> <332cbacb-cad3-4522-a74b-b5ad5efee4af@redhat.com> From: Qi Zheng In-Reply-To: <332cbacb-cad3-4522-a74b-b5ad5efee4af@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 4CA94C0006 X-Stat-Signature: t71iii9wp4n4zgekuxcfu896p67zb8ns X-Rspam-User: X-HE-Tag: 1731928383-666359 X-HE-Meta: U2FsdGVkX1/1sQ56Mi4d1IAktGudgL/oJ+gL1uClG3H77/r5JE5SfWSf/PoLV/H0mo+NcMbYbvbEJjqj5IbWSeDqUhVovxznYreQOc3c0nS3qm+4H2Db95iA5kQMykE+R/WPJpDCgYYTJFlp5OwthZ2WWvozO9ESGnbAo42TWLYg816M6Hu6KetVxseyf+YUeXaALCm3cHmHApVSQs7aeXzC9YJstBQPnFwmusxRDxPn6O47hAtFeHTtHB3r7dPZJ+RvqC0kMzCHf+wNiQdgrQuI9hztW80Q5WL7NkG8shUrDFylEN3y6MEbdBF197Dyxe639NzVJx6INSNO4GhEpOtIvGokK2PtZxprT2sRXDRMY6JZus0pJoHC1tMITPeR5XZZmQngFU9lJwUfcgBMJpE13AV33IWF+fgIcGqby1t1ny8dfhRz37GCL/y5i/jm+tLYKdwrW9RYRbueeJ0XBvr8mDOPdplKo77pDeg9p2GnVs7g9k3Eg4dNPTrMLOu0H91N+5uaPOWe/XtbiHbld3plBR5LQe9OTJu46qPuYvxSUnucghu78wts4pVMiE1wAzcZ1zIJvM0bULOfkP/3t4TcN+bpxbeAXnzOpfLqlQXYyJdV/hAW8ns+90oguaYbWGS4rkS0YioIeIKbPwh5OpuofVL+dKajAm4sDSV0FKttnbV1faZZdjB5FAnJ3EU5P38qa1SEWr16lkzVSkP7L+g9xfDYxU/CRzCEZpgNCyLjeKKFPWnAlmJR0CzmHWTsrizevVnOV24tLp8XNS0ParrpaXmf3tzXLFk03oUM9KuEdzov6m2bquuO3ZhpQ3jql/h3WzJanDxM1tRjv+PG/RwhXmOIrDK93QQN8C9woHfKvey27ro6qrMy2JOgKwS7cG9DHaL4mQS1c1kbeo9ZbB7R+JBLejy2Fd59OJk3eZIhKhEFXYE6tmGb3ep4oggaT5A+L7TWn0Nni74K3WY eOUb6suA kbO6tmUqnSCRPXHe+o9PbCt7PKTAyD4cVrLoYqybgX4fRyFY7Ctc8mq73s0szhH49n3vhF1jkZyteY4twXBggx/+ovzR7ACEeWl2mKrOXN6Qv6ROT1AomUKeR3M4j+fdNcVKvVmxf5ARfwGCIwp/r+6FELQBTkT1+2NR7NSyk939ay2iHN6bcfTKyJw/0DHmMQwt3ESccu7cJmZkO0RLogECD1V40GB/BkPrD6VskV71JzVlkgHZ/u1tHsfyHfIODtPoEMNLhN5ntGLZ3F9HyHV7uvBGe/usqg3wNqVa8zH07/QoTFYboUYvZr436hr09ogXzhmKxXGMe1Fl1PTvZ89hJCG07MxNSxwvdGdpimJvfWMoLpGXJZpgLViUOZztaYTwivXxgKqm4PVdwDHdWkWJV5kZQZBAMV6K/ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/11/18 18:59, David Hildenbrand wrote: > On 18.11.24 11:56, Qi Zheng wrote: >> >> >> On 2024/11/18 18:41, David Hildenbrand wrote: >>> On 18.11.24 11:34, Qi Zheng wrote: >>>> >>>> >>>> On 2024/11/18 17:29, David Hildenbrand wrote: >>>>> On 18.11.24 04:35, Qi Zheng wrote: >>>>>> >>>>>> >>>>>> On 2024/11/15 22:59, David Hildenbrand wrote: >>>>>>> On 15.11.24 15:41, Qi Zheng wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 2024/11/15 18:22, David Hildenbrand wrote: >>>>>>>>>>>> *nr_skip = nr; >>>>>>>>>>>> >>>>>>>>>>>> and then: >>>>>>>>>>>> >>>>>>>>>>>> zap_pte_range >>>>>>>>>>>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, >>>>>>>>>>>> &skip_nr, >>>>>>>>>>>>                               rss, &force_flush, &force_break); >>>>>>>>>>>>            if (can_reclaim_pt) { >>>>>>>>>>>>                none_nr += count_pte_none(pte, nr); >>>>>>>>>>>>                none_nr += nr_skip; >>>>>>>>>>>>            } >>>>>>>>>>>> >>>>>>>>>>>> Right? >>>>>>>>>>> >>>>>>>>>>> Yes. I did not look closely at the patch that adds the >>>>>>>>>>> counting of >>>>>>>>>> >>>>>>>>>> Got it. >>>>>>>>>> >>>>>>>>>>> pte_none though (to digest why it is required :) ). >>>>>>>>>> >>>>>>>>>> Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect >>>>>>>>>> empty PTE page. >>>>>>>>> >>>>>>>>> Okay, so the problem is that "nr" would be "all processed >>>>>>>>> entries" but >>>>>>>>> there are cases where we "process an entry but not zap it". >>>>>>>>> >>>>>>>>> What you really only want to know is "was any entry not zapped", >>>>>>>>> which >>>>>>>>> could be a simple input boolean variable passed into >>>>>>>>> do_zap_pte_range? >>>>>>>>> >>>>>>>>> Because as soon as any entry was processed but  no zapped, you can >>>>>>>>> immediately give up on reclaiming that table. >>>>>>>> >>>>>>>> Yes, we can set can_reclaim_pt to false when a !pte_none() entry is >>>>>>>> found in count_pte_none(). >>>>>>> >>>>>>> I'm not sure if well need cont_pte_none(), but I'll have to take a >>>>>>> look >>>>>>> at your new patch to see how this fits together with doing the >>>>>>> pte_none >>>>>>> detection+skipping in do_zap_pte_range(). >>>>>>> >>>>>>> I was wondering if you cannot simply avoid the additional >>>>>>> scanning and >>>>>>> simply set "can_reclaim_pt" if you skip a zap. >>>>>> >>>>>> Maybe we can return the information whether the zap was skipped from >>>>>> zap_present_ptes() and zap_nonpresent_ptes() through parameters >>>>>> like I >>>>>> did in [PATCH v1 3/7] and [PATCH v1 4/7]. >>>>>> >>>>>> In theory, we can detect empty PTE pages in the following two ways: >>>>>> >>>>>> 1) If no zap is skipped, it means that all pte entries have been >>>>>>        zap, and the PTE page must be empty. >>>>>> 2) If all pte entries are detected to be none, then the PTE page is >>>>>>        empty. >>>>>> >>>>>> In the error case, 1) may cause non-empty PTE pages to be reclaimed >>>>>> (which is unacceptable), while the 2) will at most cause empty PTE >>>>>> pages >>>>>> to not be reclaimed. >>>>>> >>>>>> So the most reliable and efficient method may be: >>>>>> >>>>>> a. If there is a zap that is skipped, stop scanning and do not >>>>>> reclaim >>>>>>        the PTE page; >>>>>> b. Otherwise, as now, detect the empty PTE page through >>>>>> count_pte_none() >>>>> >>>>> Is there a need for count_pte_none() that I am missing? >>>> >>>> When any_skipped == false, at least add VM_BUG_ON() to recheck none >>>> ptes. >>>> >>>>> >>>>> Assume we have >>>>> >>>>> nr = do_zap_pte_range(&any_skipped) >>>>> >>>>> >>>>> If "nr" is the number of processed entries (including pte_none()), and >>>>> "any_skipped" is set whenever we skipped to zap a !pte_none entry, we >>>>> can detect what we need, no? >>>>> >>>>> If any_skipped == false after the call, we now have "nr" pte_none() >>>>> entries. -> We can continue trying to reclaim >>>> >>>> I prefer that "nr" should not include pte_none(). >>>> >>> >>> Why? do_zap_pte_range() should tell you how far to advance, nothing >>> less, nothing more. >>> >>> Let's just keep it simple and avoid count_pte_none(). >>> >>> I'm probably missing something important? >> >> As we discussed before, we should skip all consecutive none ptes, > > pte and addr are already incremented before returning. > > It's probably best to send the resulting patch so I can either > understand why count_pte_none() is required or comment on how to get rid > of it. Something like this: diff --git a/mm/memory.c b/mm/memory.c index bd9ebe0f4471f..e9bec3cd49d44 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1657,6 +1657,66 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb, return nr; } +static inline int do_zap_pte_range(struct mmu_gather *tlb, + struct vm_area_struct *vma, pte_t *pte, + unsigned long addr, unsigned long end, + struct zap_details *details, int *rss, + bool *force_flush, bool *force_break, + bool *any_skipped) +{ + pte_t ptent = ptep_get(pte); + int max_nr = (end - addr) / PAGE_SIZE; + + /* Skip all consecutive pte_none(). */ + if (pte_none(ptent)) { + int nr; + + for (nr = 1; nr < max_nr; nr++) { + ptent = ptep_get(pte + nr); + if (!pte_none(ptent)) + break; + } + max_nr -= nr; + if (!max_nr) + return 0; + pte += nr; + addr += nr * PAGE_SIZE; + } + + if (pte_present(ptent)) + return zap_present_ptes(tlb, vma, pte, ptent, max_nr, + addr, details, rss, force_flush, + force_break, any_skipped); + + return zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr, + details, rss, any_skipped); +} + +static inline int count_pte_none(pte_t *pte, int nr) +{ + int none_nr = 0; + + /* + * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be + * re-installed, so we need to check pte_none() one by one. + * Otherwise, checking a single PTE in a batch is sufficient. + */ +#ifdef CONFIG_PTE_MARKER_UFFD_WP + for (;;) { + if (pte_none(ptep_get(pte))) + none_nr++; + if (--nr == 0) + break; + pte++; + } +#else + if (pte_none(ptep_get(pte))) + none_nr = nr; +#endif + return none_nr; +} + + static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, @@ -1667,6 +1727,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, int rss[NR_MM_COUNTERS]; spinlock_t *ptl; pte_t *start_pte; + bool can_reclaim_pt; pte_t *pte; int nr; @@ -1679,28 +1740,22 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); do { - pte_t ptent = ptep_get(pte); - int max_nr; - - nr = 1; - if (pte_none(ptent)) - continue; + bool any_skipped; if (need_resched()) break; - max_nr = (end - addr) / PAGE_SIZE; - if (pte_present(ptent)) { - nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr, - addr, details, rss, &force_flush, - &force_break); - if (unlikely(force_break)) { - addr += nr * PAGE_SIZE; - break; - } - } else { - nr = zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, - addr, details, rss); + nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, + rss, &force_flush, &force_break, + &any_skipped); + if (can_reclaim_pt) { + VM_BUG_ON(!any_skipped && count_pte_none(pte, nr) == nr); + if (any_skipped) + can_reclaim_pt = false; + } + if (unlikely(force_break)) { + addr += nr * PAGE_SIZE; + break; } } while (pte += nr, addr += PAGE_SIZE * nr, addr != end); >