From: Wei Yang <richard.weiyang@gmail.com>
To: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Wei Yang <richard.weiyang@gmail.com>,
linux-mm@kvack.org, pavel.tatashin@microsoft.com,
akpm@linux-foundation.org, mhocko@suse.com
Subject: Re: [PATCH] mm, page_alloc: calculate first_deferred_pfn directly
Date: Sat, 22 Dec 2018 00:22:35 +0000 [thread overview]
Message-ID: <20181222002235.imzsqh6p7ryt3cgh@master> (raw)
In-Reply-To: <fcbf10c73e2b2ce7b8580f0f91c447571a506ea4.camel@linux.intel.com>
On Fri, Dec 21, 2018 at 03:45:40PM -0800, Alexander Duyck wrote:
>On Fri, 2018-12-21 at 22:44 +0000, Wei Yang wrote:
>> On Thu, Dec 20, 2018 at 03:47:53PM -0800, Alexander Duyck wrote:
>> > On Fri, 2018-12-07 at 18:08 +0800, Wei Yang wrote:
>> > > After commit c9e97a1997fb ("mm: initialize pages on demand during
>> > > boot"), the behavior of DEFERRED_STRUCT_PAGE_INIT is changed to
>> > > initialize first section for highest zone on each node.
>> > >
>> > > Instead of test each pfn during iteration, we could calculate the
>> > > first_deferred_pfn directly with necessary information.
>> > >
>> > > By doing so, we also get some performance benefit during bootup:
>> > >
>> > > +----------+-----------+-----------+--------+
>> > > | |Base |Patched |Gain |
>> > > +----------+-----------+-----------+--------+
>> > > | 1 Node |0.011993 |0.011459 |-4.45% |
>> > > +----------+-----------+-----------+--------+
>> > > | 4 Nodes |0.006466 |0.006255 |-3.26% |
>> > > +----------+-----------+-----------+--------+
>> > >
>> > > Test result is retrieved from dmesg time stamp by add printk around
>> > > free_area_init_nodes().
>> > >
>> > > Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>>
>> Hi, Alexander
>>
>> Thanks for your comment!
>>
>> > I'm pretty sure the fundamental assumption made in this patch is wrong.
>> >
>> > It is assuming that the first deferred PFN will just be your start PFN
>> > + PAGES_PER_SECTION aligned to the nearest PAGES_PER_SECTION, do I have
>> > that correct?
>>
>> You are right.
>>
>> >
>> > If I am not mistaken that can result in scenarios where you actually
>> > start out with 0 pages allocated if your first section is in a span
>> > belonging to another node, or is reserved memory for things like MMIO.
>>
>> Yeah, sounds it is possible.
>>
>> >
>> > Ideally we don't want to do that as we have to immediately jump into
>> > growing the zone with the code as it currently stands.
>>
>> You are right.
>>
>> >
>> > > ---
>> > > mm/page_alloc.c | 57 +++++++++++++++++++++++++++------------------------------
>> > > 1 file changed, 27 insertions(+), 30 deletions(-)
>> > >
>> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> > > index baf473f80800..5f077bf07f3e 100644
>> > > --- a/mm/page_alloc.c
>> > > +++ b/mm/page_alloc.c
>> > > @@ -306,38 +306,33 @@ static inline bool __meminit early_page_uninitialised(unsigned long pfn)
>> > > }
>> > >
>> > > /*
>> > > - * Returns true when the remaining initialisation should be deferred until
>> > > - * later in the boot cycle when it can be parallelised.
>> > > + * Calculate first_deferred_pfn in case:
>> > > + * - in MEMMAP_EARLY context
>> > > + * - this is the last zone
>> > > + *
>> > > + * If the first aligned section doesn't exceed the end_pfn, set it to
>> > > + * first_deferred_pfn and return it.
>> > > */
>> > > -static bool __meminit
>> > > -defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
>> > > +unsigned long __meminit
>> > > +defer_pfn(int nid, unsigned long start_pfn, unsigned long end_pfn,
>> > > + enum memmap_context context)
>> > > {
>> > > - static unsigned long prev_end_pfn, nr_initialised;
>> > > + struct pglist_data *pgdat = NODE_DATA(nid);
>> > > + unsigned long pfn;
>> > >
>> > > - /*
>> > > - * prev_end_pfn static that contains the end of previous zone
>> > > - * No need to protect because called very early in boot before smp_init.
>> > > - */
>> > > - if (prev_end_pfn != end_pfn) {
>> > > - prev_end_pfn = end_pfn;
>> > > - nr_initialised = 0;
>> > > - }
>> > > + if (context != MEMMAP_EARLY)
>> > > + return end_pfn;
>> > >
>> > > - /* Always populate low zones for address-constrained allocations */
>> > > - if (end_pfn < pgdat_end_pfn(NODE_DATA(nid)))
>> > > - return false;
>> > > + /* Always populate low zones */
>> > > + if (end_pfn < pgdat_end_pfn(pgdat))
>> > > + return end_pfn;
>> > >
>> > > - /*
>> > > - * We start only with one section of pages, more pages are added as
>> > > - * needed until the rest of deferred pages are initialized.
>> > > - */
>> > > - nr_initialised++;
>> > > - if ((nr_initialised > PAGES_PER_SECTION) &&
>> > > - (pfn & (PAGES_PER_SECTION - 1)) == 0) {
>> > > - NODE_DATA(nid)->first_deferred_pfn = pfn;
>> > > - return true;
>> > > + pfn = roundup(start_pfn + PAGES_PER_SECTION - 1, PAGES_PER_SECTION);
>> > > + if (end_pfn > pfn) {
>> > > + pgdat->first_deferred_pfn = pfn;
>> > > + end_pfn = pfn;
>> > > }
>> > > - return false;
>> > > + return end_pfn;
>> >
>> > Okay so I stand corrected. It looks like you are rounding up by
>> > (PAGES_PER_SECTION - 1) * 2 since if I am not mistaken roundup should
>> > do the same math you already did in side the function.
>> >
>> > > }
>> > > #else
>> > > static inline bool early_page_uninitialised(unsigned long pfn)
>> > > @@ -345,9 +340,11 @@ static inline bool early_page_uninitialised(unsigned long pfn)
>> > > return false;
>> > > }
>> > >
>> > > -static inline bool defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
>> > > +unsigned long __meminit
>> > > +defer_pfn(int nid, unsigned long start_pfn, unsigned long end_pfn,
>> > > + enum memmap_context context)
>> > > {
>> > > - return false;
>> > > + return end_pfn;
>> > > }
>> > > #endif
>> > >
>> > > @@ -5514,6 +5511,8 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>> > > }
>> > > #endif
>> > >
>> > > + end_pfn = defer_pfn(nid, start_pfn, end_pfn, context);
>> > > +
>> >
>> > A better approach for this might be to look at placing the loop within
>> > a loop similar to how I handled this for the deferred init. You only
>> > really need to be performing all of these checks once per section
>> > aligned point anyway.
>>
>> I didn't really get your idea here. Do you have the commit id you handle
>> deferred init?
>
>The deferred_grow_zone function actually had some logic like this
>before I had rewritten it, you can still find it on lxr:
>https://elixir.bootlin.com/linux/latest/source/mm/page_alloc.c#L1668
>
>>
>> >
>> > Basically if you added another loop and limited the loop below so that
>> > you only fed it one section at a time then you could just pull the
>> > defer_init check out of this section and place it in the outer loop
>> > after you have already tried initializing at least one section worth of
>> > pages.
>> >
>> > You could probably also look at pulling in the logic that is currently
>> > sitting at the end of the current function that is initializing things
>> > until the end_pfn is aligned with PAGES_PER_SECTION.
>> >
>> > > for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>> > > /*
>> > > * There can be holes in boot-time mem_map[]s handed to this
>> > > @@ -5526,8 +5525,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>> > > continue;
>> > > if (overlap_memmap_init(zone, &pfn))
>> > > continue;
>> > > - if (defer_init(nid, pfn, end_pfn))
>> > > - break;
>> > > }
>> >
>> > So the whole reason for the "defer_init" call being placed here is
>> > because there are checks to see if the prior PFN is valid, in our NUMA
>> > node, or is an overlapping region. If your first section or in this
>> > case 2 sections contain pages that fall into these categories you
>> > aren't going to initialize any pages.
>>
>> Ok, I get your point. Let me do a summary, the approach in this patch
>> has one flaw: in case all pages in the first section fall into these two
>> categories, we will end up with no page initialized for this zone.
>>
>> So my suggestion is:
>>
>> Find the first valid page and roundup it to PAGES_PER_SECTION. This
>> would ensure we won't end up with zero initialized page.
>
>Using the first valid PFN will not work either. The problem is you want
>to ideally have PAGES_PER_SECTION number of pages allocated before we
>begin deferring allocation. The pages will have a number of regions
>that are reserved and/or full of holes so you cannot rely on the first
>PFN to be the start of a contiguous section of pages.
>
Hmm... my original idea is we don't need to initialize at least
PAGES_PER_SECTION pages before defer init. In my mind, we just need to
initialize *some* pages for this zone. The worst case is there is only
one page initialized at bootup.
So here is the version with a little change to cope with the situation
when the whole section is not available.
for (pfn = start_pfn; pfn < enf_pfn; pfn++) {
if (!early_pfn_valid(pfn))
continue;
if (!early_pfn_in_nid(pfn, nid))
continue;
if (overlap_memmap_init(zone, &pfn))
continue;
break;
}
pfn = round_up(pfn + 1, PAGES_PER_SECTION);
if (end_pfn > pfn) {
pgdat->first_deferred_pfn = pfn;
end_pfn = pfn;
}
And here is the version if we want to count the number of valid pages.
for (pfn = start_pfn; pfn < enf_pfn; pfn++) {
if (!early_pfn_valid(pfn))
continue;
if (!early_pfn_in_nid(pfn, nid))
continue;
if (overlap_memmap_init(zone, &pfn))
continue;
if (++valid_pfns == PAGES_PER_SECTION)
break;
}
pfn = round_up(pfn + 1, PAGES_PER_SECTION);
if (end_pfn > pfn) {
pgdat->first_deferred_pfn = pfn;
end_pfn = pfn;
}
>> Generally, my purpose in this patch is:
>>
>> 1. Don't affect the initialisation for non defer init zones.
>> Current code will call defer_init() for each pfn, no matter this pfn
>> should be defer_init or not. By taking this out, we try to minimize
>> the effect on the initialisation process.
>
>So one problem with trying to pull out defer_init is that it contains
>the increment nr_initialized. At a minimum that logic should probably
>be pulled out and placed back where it was.
>
>> 2. Iterate on less pfn for defer zone
>> Current code will count on each pfn in defer zone. By roundup pfn
>> directly, less calculation would be necessary. Defer init will handle
>> the rest. Or if we really want at least PAGES_PER_SECTION pfn be
>> initialized for defer zone, we can do the same math in defer_pfn().
>
>So the general idea I was referring to above would be something like:
>for (pfn = start_pfn; pfn < end_pfn;) {
> t = ALIGN_DOWN(pfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
> first_deferred_pfn = min(t, end_pfn);
> section_initialized = 0;
>
> for (; pfn < first_deferred_pfn; pfn++) {
> struct page *page;
>
> /* All or original checks here w/ continue */
>
> /* all of the original page initialization stuff */
>
> section_initialized++;
> }
>
> /* Place all the original checks for deferred init here */
>
> nr_initialized += section_initialized;
>
> /* remaining checks for deferred init to see if we exit/break here */
>}
>
>The general idea is the same as what you have stated. However with this
>approach we should be able to count the number of pages initialized and
>once per section we will either just drop the results stored in
>section_initialized, or we will add them to the initialized count and
>if it exceeds the needed value we could then break out of the loop.
>
Yep, it looks we share similar idea. While I take the initialization
part out and just count the pfn ahead. And use this pfn for the loop.
Do you think I understand you correctly?
>>
>> Glad to talk with you and look forward your comments:-)
>>
>> >
>> > >
>> > > page = pfn_to_page(pfn);
>>
>>
--
Wei Yang
Help you, Help me
next prev parent reply other threads:[~2018-12-22 0:22 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-12-07 10:08 Wei Yang
2018-12-20 23:04 ` Andrew Morton
2018-12-20 23:47 ` Alexander Duyck
2018-12-21 22:44 ` Wei Yang
2018-12-21 23:45 ` Alexander Duyck
2018-12-21 23:45 ` Alexander Duyck
2018-12-22 0:22 ` Wei Yang [this message]
2018-12-22 1:41 ` Alexander Duyck
2018-12-23 6:58 ` Wei Yang
2018-12-24 17:50 ` Alexander Duyck
2018-12-24 17:50 ` Alexander Duyck
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181222002235.imzsqh6p7ryt3cgh@master \
--to=richard.weiyang@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=alexander.h.duyck@linux.intel.com \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=pavel.tatashin@microsoft.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox