From: Ankur Arora <ankur.a.arora@oracle.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: Ankur Arora <ankur.a.arora@oracle.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
torvalds@linux-foundation.org, akpm@linux-foundation.org,
bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com,
mingo@redhat.com, luto@kernel.org, peterz@infradead.org,
paulmck@kernel.org, rostedt@goodmis.org, tglx@linutronix.de,
willy@infradead.org, jon.grimm@amd.com, bharata@amd.com,
raghavendra.kt@amd.com, boris.ostrovsky@oracle.com,
konrad.wilk@oracle.com
Subject: Re: [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing
Date: Mon, 14 Apr 2025 14:21:45 -0700 [thread overview]
Message-ID: <874iyq1lsm.fsf@oracle.com> (raw)
In-Reply-To: <Z_ywzkEEqUOMHcO0@gmail.com>
Ingo Molnar <mingo@kernel.org> writes:
> * Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> clear_pages_rep(), clear_pages_erms() use string instructions to zero
>> memory. When operating on more than a single page, we can use these
>> more effectively by explicitly advertising the region-size to the
>> processor, which can use that as a hint to optimize the clearing
>> (ex. by eliding cacheline allocation.)
>
>> +#ifndef CONFIG_HIGHMEM
>> +/*
>> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user().
>> + *
>> + * Taking inspiration from the common code variant, we split the zeroing in
>> + * three parts: left of the fault, right of the fault, and up to 5 pages
>> + * in the immediate neighbourhood of the target page.
>> + *
>> + * Cleared in that order to keep cache lines of the target region hot.
>> + *
>> + * For gigantic pages, there is no expectation of cache locality so just do a
>> + * straight zero.
>> + */
>> +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint)
>> +{
>> + unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>> + const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
>> + const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
>> + int width = 2; /* pages cleared last on either side */
>> + struct range r[3];
>> + int i;
>> +
>> + if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
>> + clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
>
>> + clear_pages(page_address(folio_page(folio, r[i].start)), len);
>
> So the _user postfix naming is super confusing here and elsewhere in
> this series.
The problem is that the _user naming comes from the MM interface name and
is meant to address architectures where you might need to do more than
just zero the kernel address range for the page.
> clear_page(), and by extension the clear_pages() interface you extended
> it to, fundamentally only works on kernel addresses:
Agreed.
> /*
> * Zero a page.
> * %rdi - page
> */
> SYM_TYPED_FUNC_START(clear_page_rep)
> movl $4096/8,%ecx
> xorl %eax,%eax
> rep stosq
> RET
>
> Note the absolute lack of fault & exception handling.
Yeah. And, as you are implying that is safe because the folio_zero_user()
(and this path) is only called after this range has been validated.
> But folio_zero_user*() uses the kernel-space variants of page clearing
> AFAICT (contrary to the naming):
>
> void folio_zero_user(struct folio *folio, unsigned long addr_hint)
> {
> unsigned int nr_pages = folio_nr_pages(folio);
>
> if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
> clear_gigantic_page(folio, addr_hint, nr_pages);
> else
> process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
> }
>
>
> static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint, > unsigned int nr_pages)
> {
> unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
> int i;
>
> might_sleep();
> for (i = 0; i < nr_pages; i++) {
> cond_resched();
> clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
> }
> }
>
> Which on x86 is simply mapped into a kernel-memory interface:
>
> static inline void clear_user_page(void *page, unsigned long vaddr,
> struct page *pg)
> {
> clear_page(page);
> }
>
> So at minimum this is a misnomer and a confusing mixture of user/kernel
> interface names on an epic scale that TBH should be cleaned up first
> before extended...
I think a comment to avoid this confusion is definitely warranted. About
the mixture of names, I'm not sure how to avoid that. For instance see
arch/arc/mm/cache.c::clear_user_page()::
void clear_user_page(void *to, unsigned long u_vaddr, struct page *page)
{
struct folio *folio = page_folio(page);
clear_page(to);
clear_bit(PG_dc_clean, &folio->flags);
}
arch/arm also does a bunch of stuff which made my head hurt but the arc
version is clearly different enough.
>> +out:
>> + /* Explicitly invoke cond_resched() to handle any live patching necessary. */
>> + cond_resched();
>
> What again?
Yeah, I can see how this looks out of place :). The idea was that even
though we don't need explicit invocations of cond_resched() (because
this path is only called when preemptible), we still need some because
cond_resched() is overloaded to help with live patching.
Anyway, this comment can go away based on your suggestion elsewhere
(extensions for cooperative preemption models.)
Thanks for the detailed review.
--
ankur
next prev parent reply other threads:[~2025-04-14 21:22 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-14 3:46 [PATCH v3 0/4] mm/folio_zero_user: add " Ankur Arora
2025-04-14 3:46 ` [PATCH v3 1/4] x86/clear_page: extend clear_page*() for " Ankur Arora
2025-04-14 6:32 ` Ingo Molnar
2025-04-14 11:02 ` Peter Zijlstra
2025-04-14 11:14 ` Ingo Molnar
2025-04-14 19:46 ` Ankur Arora
2025-04-14 22:26 ` Mateusz Guzik
2025-04-15 6:14 ` Ankur Arora
2025-04-15 8:22 ` Mateusz Guzik
2025-04-15 20:01 ` Ankur Arora
2025-04-15 20:32 ` Mateusz Guzik
2025-04-14 19:52 ` Ankur Arora
2025-04-14 20:09 ` Matthew Wilcox
2025-04-15 21:59 ` Ankur Arora
2025-04-14 3:46 ` [PATCH v3 2/4] x86/clear_page: add clear_pages() Ankur Arora
2025-04-14 3:46 ` [PATCH v3 3/4] huge_page: allow arch override for folio_zero_user() Ankur Arora
2025-04-14 3:46 ` [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing Ankur Arora
2025-04-14 6:53 ` Ingo Molnar
2025-04-14 21:21 ` Ankur Arora [this message]
2025-04-14 7:05 ` Ingo Molnar
2025-04-15 6:36 ` Ankur Arora
2025-04-22 6:36 ` Raghavendra K T
2025-04-22 19:14 ` Ankur Arora
2025-04-15 10:16 ` Mateusz Guzik
2025-04-15 21:46 ` Ankur Arora
2025-04-15 22:01 ` Mateusz Guzik
2025-04-16 4:46 ` Ankur Arora
2025-04-17 14:06 ` Mateusz Guzik
2025-04-14 5:34 ` [PATCH v3 0/4] mm/folio_zero_user: add " Ingo Molnar
2025-04-14 19:30 ` Ankur Arora
2025-04-14 6:36 ` Ingo Molnar
2025-04-14 19:19 ` Ankur Arora
2025-04-15 19:10 ` Zi Yan
2025-04-22 19:32 ` Ankur Arora
2025-04-22 6:23 ` Raghavendra K T
2025-04-22 19:22 ` Ankur Arora
2025-04-23 8:12 ` Raghavendra K T
2025-04-23 9:18 ` Raghavendra K T
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=874iyq1lsm.fsf@oracle.com \
--to=ankur.a.arora@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=bharata@amd.com \
--cc=boris.ostrovsky@oracle.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=hpa@zytor.com \
--cc=jon.grimm@amd.com \
--cc=konrad.wilk@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mingo@kernel.org \
--cc=mingo@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@amd.com \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox