From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 10648C54E58 for ; Fri, 22 Mar 2024 00:56:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 84D516B0083; Thu, 21 Mar 2024 20:56:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 805B36B0089; Thu, 21 Mar 2024 20:56:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 676276B008A; Thu, 21 Mar 2024 20:56:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 5102E6B0083 for ; Thu, 21 Mar 2024 20:56:17 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 1B7791202AD for ; Fri, 22 Mar 2024 00:56:17 +0000 (UTC) X-FDA: 81922858794.21.4C20F3E Received: from mail-lj1-f169.google.com (mail-lj1-f169.google.com [209.85.208.169]) by imf08.hostedemail.com (Postfix) with ESMTP id 21323160009 for ; Fri, 22 Mar 2024 00:56:14 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=lV5HN2xo; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf08.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.208.169 as permitted sender) smtp.mailfrom=ioworker0@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711068975; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vcVrCip4Ps9anS18xqoWS2f/zJ/l51Fqge92mpqaEh8=; b=38azPXVc7GMRWI1rP7QWkXiyJuWAFoAlpiGxbDaaM3HulAc9EjYHOotSAXe/RBFlPNl3x1 rvVSDJpxo8z6+2++mQoTchTbyjrDkfylsx2m+RpPrHGBrBFYr3ZOM0/wrMi1rsppZnzXGW 6dgaX3sWITeobVOY1sOYpbJGj3hl600= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=lV5HN2xo; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf08.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.208.169 as permitted sender) smtp.mailfrom=ioworker0@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711068975; a=rsa-sha256; cv=none; b=lHmYzbNkoCH2yAIpz4p4RVLIpmaN2z/iZStMCRHIGhU4qIov7//9I/l+W4Gwb+cWB5ONqZ pZrQpnAPJsPWNJIpWtbengHV7ICg7SuWc4zG4Q5GoDsfhxHQB0ztW15oDzEYXfzLLtMpBJ 3Ygvag2NapmYY7Lmxry81ffULwyfJS4= Received: by mail-lj1-f169.google.com with SMTP id 38308e7fff4ca-2d4541bf57eso21764571fa.2 for ; Thu, 21 Mar 2024 17:56:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1711068973; x=1711673773; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=vcVrCip4Ps9anS18xqoWS2f/zJ/l51Fqge92mpqaEh8=; b=lV5HN2xoz+nIdQxnFyHcxoZC05BlGOovj6U0Ffr45Wrm4BFfkCukiD5fS1UBxD2rhz umEqpV6HrvWzKyMHDQyrsX7K/t7eIqG4G/YU0niKmk8ExxA3vkWizn9Me/A/WneWBP4U ojl4GDqrhMNKsvNqv0szIyGQiorudHEUKO84VSr5L9yLBpuSp+dHSMnbIaBhqC4gBwYD m5ce4r2rh+IysimXfLbkhOelatK2GlsMpCPGslv4aW6SO1UXoYODQ0Su8XTNHDzFS87d TVALE+vI4ee53SbSzGK0F2KAeXWxRfvqI/PwDV6j+e09XQca+2dWCX7H4QEvu81F9+97 EaXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711068973; x=1711673773; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vcVrCip4Ps9anS18xqoWS2f/zJ/l51Fqge92mpqaEh8=; b=GX77k2TPq28TH4sMHtM0FBeA9zjWTPDu7gM0tqu1UM25W7PjCssimbME1G4/x0DCGB Lkvq+RgtKvXpnlgt1jmbbccVDF6oS3R0bubpgRu/OHcHrCuCw6VUm06GvfkjJxB2M/XY Jn0jYUI3JvA54tLkthC+MtuXUy4R0nLus8D6qOJcSu24hPeb+0/ZRakECFzQ8vQuESug 0TOcj60e1ajbc8H/gXk6QBMIabPwrqo/hptWoATOKWl4vzaJeDWJvCyK+bppEBETR70r NkPOs4kd/voNbHZ2l/GvzIb+a6aM9zoDgx7ZCxKOVlWiNwTchDePbpIfJODdcs4Yi9Jz nx8g== X-Forwarded-Encrypted: i=1; AJvYcCXkZt6NIYDUASJRnHInjGxAzHWVBEQ46Wh6QN95m25FbiyB5+eRpOVM7BhUow/ObwdfMEv7QMoSHkezbx77a+DafSc= X-Gm-Message-State: AOJu0YzkNUNvsxUuPXC3h6+hEHIJazNqOMKKa+GeX2CP9n7z1JRgJJAR +iLyfiVmiU1Y5l3lob1ShQhUH8g6ZPFfZxrxjpJBgCuwO1jGsgHKPVO3AcdtyUU7xkUhpKsANJ6 vEqpySW6p/j78TUy0K7lYndqDDsQ= X-Google-Smtp-Source: AGHT+IH6raTTot+XyOBM82iAckU1J1fggMed0iKNgAS0FhUrBwZqv0ssKTP8PWpyjDQKrR17BkF3oKlqxJEfphY11Ho= X-Received: by 2002:a2e:a4b7:0:b0:2d6:a31f:b66b with SMTP id g23-20020a2ea4b7000000b002d6a31fb66bmr702997ljm.21.1711068973108; Thu, 21 Mar 2024 17:56:13 -0700 (PDT) MIME-Version: 1.0 References: <20240311150058.1122862-1-ryan.roberts@arm.com> <20240311150058.1122862-7-ryan.roberts@arm.com> <7ba06704-2090-4eb2-9534-c4d467cc085a@arm.com> <269375a4-78a3-4c22-8e6e-570368a2c053@arm.com> <9930c86a-c0c8-4112-9122-0e4faca475f5@arm.com> In-Reply-To: <9930c86a-c0c8-4112-9122-0e4faca475f5@arm.com> From: Lance Yang Date: Fri, 22 Mar 2024 08:56:01 +0800 Message-ID: Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD To: Ryan Roberts Cc: Barry Song <21cnbao@gmail.com>, Andrew Morton , David Hildenbrand , Matthew Wilcox , Huang Ying , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , Chris Li , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 21323160009 X-Stat-Signature: c51r8zo7chbqf6wc6gs51ytiynu6jn13 X-Rspam-User: X-HE-Tag: 1711068974-663913 X-HE-Meta: U2FsdGVkX1/iJorOxtna5+X6mifRx5N0+bF+Pd5NToe22bWgv1B9KUtpX8HjHB3MgoI1VcIZBGxgceteNprDuHWEgcoCljeP9l3QWzwDINb73kGdxcfgJSB2ro3ieL6ztIttWi5J+MZQOtUkBwiL2PvwjSV72faJ1BpByU9nHzZ2qvANJ3qSXeTydFHRaPtTm2re6ajb4Xw00zdzYKuR0E+15MmtUwMdUh749fHy2zE907Je5ysdkXYDCZO3FrRd7dRdhNQO/02x37Qo9sgZETnlqpRdl4YCEjBiaT0PYlR8pCn2uA+//7dEZPPZ+k/U3TuMFvJWVDgAU2mKN0Qj0ArKlO+mrfmpMPW9sdrbfTAejQ9BmxdzS8YDUcd8Dq9X2RuLdoImDvgLvAiRGjSNYN4mi5PJWEhZ3ZEuBoKtjwAMTjOaVgKYISxgqH9c4DK1XkPdOW17tFQ7Stv28/SEy4iCBWpiIa5XqVqSEG6RTERwPr0nZMppnpAHmwqM+7Uu9NCHargpCNgyiO5iP2+qCeiHGc2gaN4H2DoB1ndoF5JOdaP97BLfGYOl/kGH9WjsggU9ZhlTWLi/yHcgyNE2/oaMPBppAXTBi6uqx/Na+/ZNWHFodNaHhTmw/pd+qo44t+v4aaXQxYO7xgd6HosHMP2glETeaY/0SsqlXv2LT8mexFLSlgDca2PQWYwJNJOFdxVPFTodxYUJ5cHvSo800tDp8iIF9lEZCUsupC84TGzLuiwRPcVxz6gi6sx8Md0PODacaIhttIfiU0gJfd4yuJb6gvp+Thtn6wE8p+DoELbHbcn+ZtMLjZqjSGP0NK3uCFSF0uVlnTuAF0P49YPC5ujoblC6+t+kl9AcuRPiEAKMRGd3Q8tQAK3CX7r/CJ0IDzWfrg6i1aWYtWKfiJiLy9SqwJktLPK7vUkg0fASox/r896UNfvFa4wlhuCfBQc3a4osPRn4gYERR5ML9e+ b6u+9Ezz 96mvLNLkao05AtmHfiDIPjoYJ1weDxiipfHQVHZ53HNCLsxba9rT1EqLO4o1VkxxptLrx2kpXrI7Zo4LgIzPWbiVztmsws2Cug7uFTisSCvwmHz27yItWU1+U6FxwDv5Tsq+YrpLTbKhX2z+nn0KBM33xrtwLsxx5/jJyj6JhAFHARFqHoi6ElVoCEdxQsVDFdo+MSCKhoAy5pUNoZC3CwM4mNRSLTnnfFlPx+wSNkPv0JeXLVOZ8uvCQ6y0K+jzTBum9IXcoPz++m4DXxlfZ7MpI2lCE0Lqtw3PRYG73g2ibeS4/lX2wY3vuJVwbZ2RUGO/D3ElnOWBAiCGEqzDAjHbOZDH0fPnveLz9MtfNV2R7ytmD0bOvHL12KJ22no6APhbGCHPsJLKLsmMhX+88LcHk0m2hK2z3NON4cXb/kziurhtGBIMzoRXr9heBmrzQTqXaDlL74U3V6hhvK7VtqwyadegZ/D6nnOcR5vFIoC3m1wQ5od9sfS1c1/OCWQ9NB+e0LNXBaY6AO6qGIdk68NRsc8rQi6huFGaaa1b+OscGChzZyISWqImZjwiNdZbfAhWKsNbIk22RKeMrZf72Kz1GyiMxtCD66x0h X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Mar 21, 2024 at 11:24=E2=80=AFPM Ryan Roberts wrote: > > On 21/03/2024 14:55, Lance Yang wrote: > > On Thu, Mar 21, 2024 at 9:38=E2=80=AFPM Ryan Roberts wrote: > >> > >>>>>>>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio= ); > >>>>>>>>>> - > >>>>>>>>>> - if (!pageout && pte_young(ptent)) { > >>>>>>>>>> - ptent =3D ptep_get_and_clear_full(mm, = addr, pte, > >>>>>>>>>> - tlb->f= ullmm); > >>>>>>>>>> - ptent =3D pte_mkold(ptent); > >>>>>>>>>> - set_pte_at(mm, addr, pte, ptent); > >>>>>>>>>> - tlb_remove_tlb_entry(tlb, pte, addr); > >>>>>>>>>> + if (!pageout) { > >>>>>>>>>> + for (; nr !=3D 0; nr--, pte++, addr += =3D PAGE_SIZE) { > >>>>>>>>>> + if (ptep_test_and_clear_young(= vma, addr, pte)) > >>>>>>>>>> + tlb_remove_tlb_entry(t= lb, pte, addr); > >>>>>>> > >>>>>>> IIRC, some of the architecture(ex, PPC) don't update TLB with set= _pte_at and > >>>>>>> tlb_remove_tlb_entry. So, didn't we consider remapping the PTE wi= th old after > >>>>>>> pte clearing? > >>>>>> > >>>>>> Sorry Lance, I don't understand this question, can you rephrase? A= re you saying > >>>>>> there is a good reason to do the original clear-mkold-set for some= arches? > >>>>> > >>>>> IIRC, some of the architecture(ex, PPC) don't update TLB with > >>>>> ptep_test_and_clear_young() > >>>>> and tlb_remove_tlb_entry(). > >> > >> Afraid I'm still struggling with this comment. Do you mean to say that= powerpc > >> invalidates the TLB entry as part of the call to ptep_test_and_clear_y= oung()? So > >> tlb_remove_tlb_entry() would be redundant here, and likely cause perfo= rmance > >> degradation on that architecture? > > > > I just thought that using ptep_test_and_clear_young() instead of > > ptep_get_and_clear_full() + pte_mkold() might not be correct. > > However, it's most likely that I was mistaken :( > > OK, I'm pretty confident that my usage is correct. > > > > > I also have a question. Why aren't we using ptep_test_and_clear_young()= in > > madvise_cold_or_pageout_pte_range(), but instead > > ptep_get_and_clear_full() + pte_mkold() as we did previously. > > > > /* > > * Some of architecture(ex, PPC) don't update TLB > > * with set_pte_at and tlb_remove_tlb_entry so for > > * the portability, remap the pte with old|clean > > * after pte clearing. > > */ > > Ahh, I see; this is a comment from madvise_free_pte_range() I don't quite > understand that comment. I suspect it might be out of date, or saying tha= t doing > set_pte_at(pte_mkold(ptep_get(ptent))) is not correct because it is not a= tomic > and the HW could set the dirty bit between the get and the set. Doing the= atomic > ptep_get_and_clear_full() means you go via a pte_none() state, so if the = TLB is > racing it will see the entry isn't valid and fault. Thanks for your analysis and explanations! > > Note that madvise_free_pte_range() is trying to clear both the access and= dirty > bits, whereas madvise_cold_or_pageout_pte_range() is only trying to clear= the > access bit. There is a special helper to clear the access bit atomically = - > ptep_test_and_clear_young() - but there is no helper to clear the access = *and* > dirty bit, I don't believe. There is ptep_set_access_flags(), but that se= ts > flags to a "more permissive setting" (i.e. allows setting the flags, not > clearing them). Perhaps this constraint can be relaxed given we will foll= ow up > with an explicit TLBI - it would require auditing all the implementations= . Thanks for bringing this! I'll take a closer look at it. Thanks again for your time! Lance > > > > > According to this comment from madvise_free_pte_range. IIUC, we need to > > call ptep_get_and_clear_full() to clear the PTE, and then remap the > > PTE with old|clean. > > > > Thanks, > > Lance > > > >> > >> IMHO, ptep_test_and_clear_young() really shouldn't be invalidating the= TLB > >> entry, that's what ptep_clear_flush_young() is for. > >> > >> But I do see that for some cases of the 32-bit ppc, there appears to b= e a flush: > >> > >> #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG > >> static inline int __ptep_test_and_clear_young(struct mm_struct *mm, > >> unsigned long addr, pte_= t *ptep) > >> { > >> unsigned long old; > >> old =3D pte_update(mm, addr, ptep, _PAGE_ACCESSED, 0, 0); > >> if (old & _PAGE_HASHPTE) > >> flush_hash_entry(mm, ptep, addr); <<<<<<<< > >> > >> return (old & _PAGE_ACCESSED) !=3D 0; > >> } > >> #define ptep_test_and_clear_young(__vma, __addr, __ptep) \ > >> __ptep_test_and_clear_young((__vma)->vm_mm, __addr, __ptep) > >> > >> Is that what you are describing? Does any anyone know why flush_hash_e= ntry() is > >> called? I'd say that's a bug in ppc and not a reason not to use > >> ptep_test_and_clear_young() in the common code! > >> > >> Thanks, > >> Ryan > >> > >> > >>>> > >>>> Err, I assumed tlb_remove_tlb_entry() meant "invalidate the TLB entr= y for this > >>>> address please" - albeit its deferred and batched. I'll look into th= is. > >>>> > >>>>> > >>>>> In my new patch[1], I use refresh_full_ptes() and > >>>>> tlb_remove_tlb_entries() to batch-update the > >>>>> access and dirty bits. > >>>> > >>>> I want to avoid the per-pte clear-modify-set approach, because this = doesn't > >>>> perform well on arm64 when using contpte mappings; it will cause the= contpe > >>>> mapping to be unfolded by the first clear that touches the contpte b= lock, then > >>>> refolded by the last set to touch the block. That's expensive. > >>>> ptep_test_and_clear_young() doesn't suffer that problem. > >>> > >>> Thanks for explaining. I got it. > >>> > >>> I think that other architectures will benefit from the per-pte clear-= modify-set > >>> approach. IMO, refresh_full_ptes() can be overridden by arm64. > >>> > >>> Thanks, > >>> Lance > >>>> > >>>>> > >>>>> [1] https://lore.kernel.org/linux-mm/20240316102952.39233-1-ioworke= r0@gmail.com > >>>>> > >>>>> Thanks, > >>>>> Lance > >>>>> > >>>>>> > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Lance > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>>>> + } > >>>>>>>>> > >>>>>>>>> This looks so smart. if it is not pageout, we have increased pt= e > >>>>>>>>> and addr here; so nr is 0 and we don't need to increase again i= n > >>>>>>>>> for (; addr < end; pte +=3D nr, addr +=3D nr * PAGE_SIZE) > >>>>>>>>> > >>>>>>>>> otherwise, nr won't be 0. so we will increase addr and > >>>>>>>>> pte by nr. > >>>>>>>> > >>>>>>>> Indeed. I'm hoping that Lance is able to follow a similar patter= n for > >>>>>>>> madvise_free_pte_range(). > >>>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> /* > >>>>>>>>>> -- > >>>>>>>>>> 2.25.1 > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> Overall, LGTM, > >>>>>>>>> > >>>>>>>>> Reviewed-by: Barry Song > >>>>>>>> > >>>>>>>> Thanks! > >>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> >