Re: [PATCH] mm, page_alloc: calculate first_deferred_pfn directly

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Wei Yang <richard.weiyang@gmail.com>
To: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Wei Yang <richard.weiyang@gmail.com>,
	linux-mm@kvack.org, pavel.tatashin@microsoft.com,
	akpm@linux-foundation.org, mhocko@suse.com
Subject: Re: [PATCH] mm, page_alloc: calculate first_deferred_pfn directly
Date: Sun, 23 Dec 2018 06:58:27 +0000	[thread overview]
Message-ID: <20181223065827.ng2xqck7jdllt7b7@master> (raw)
In-Reply-To: <32d061d6-39a2-97f1-6609-d27ad74f8404@linux.intel.com>

On Fri, Dec 21, 2018 at 05:41:30PM -0800, Alexander Duyck wrote:
>On 12/21/2018 4:22 PM, Wei Yang wrote:
>> On Fri, Dec 21, 2018 at 03:45:40PM -0800, Alexander Duyck wrote:
>> > On Fri, 2018-12-21 at 22:44 +0000, Wei Yang wrote:
>> > > On Thu, Dec 20, 2018 at 03:47:53PM -0800, Alexander Duyck wrote:
>> > > > On Fri, 2018-12-07 at 18:08 +0800, Wei Yang wrote:
>> > > > > After commit c9e97a1997fb ("mm: initialize pages on demand during
>> > > > > boot"), the behavior of DEFERRED_STRUCT_PAGE_INIT is changed to
>> > > > > initialize first section for highest zone on each node.
>> > > > > 
>> > > > > Instead of test each pfn during iteration, we could calculate the
>> > > > > first_deferred_pfn directly with necessary information.
>> > > > > 
>> > > > > By doing so, we also get some performance benefit during bootup:
>> > > > > 
>> > > > >      +----------+-----------+-----------+--------+
>> > > > >      |          |Base       |Patched    |Gain    |
>> > > > >      +----------+-----------+-----------+--------+
>> > > > >      | 1 Node   |0.011993   |0.011459   |-4.45%  |
>> > > > >      +----------+-----------+-----------+--------+
>> > > > >      | 4 Nodes  |0.006466   |0.006255   |-3.26%  |
>> > > > >      +----------+-----------+-----------+--------+
>> > > > > 
>> > > > > Test result is retrieved from dmesg time stamp by add printk around
>> > > > > free_area_init_nodes().
>> > > > > 
>> > > > > Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>> > > 
>> > > Hi, Alexander
>> > > 
>> > > Thanks for your comment!
>> > > 
>> > > > I'm pretty sure the fundamental assumption made in this patch is wrong.
>> > > > 
>> > > > It is assuming that the first deferred PFN will just be your start PFN
>> > > > + PAGES_PER_SECTION aligned to the nearest PAGES_PER_SECTION, do I have
>> > > > that correct?
>> > > 
>> > > You are right.
>> > > 
>> > > > 
>> > > > If I am not mistaken that can result in scenarios where you actually
>> > > > start out with 0 pages allocated if your first section is in a span
>> > > > belonging to another node, or is reserved memory for things like MMIO.
>> > > 
>> > > Yeah, sounds it is possible.
>> > > 
>> > > > 
>> > > > Ideally we don't want to do that as we have to immediately jump into
>> > > > growing the zone with the code as it currently stands.
>> > > 
>> > > You are right.
>> > > 
>> > > > 
>> > > > > ---
>> > > > >   mm/page_alloc.c | 57 +++++++++++++++++++++++++++------------------------------
>> > > > >   1 file changed, 27 insertions(+), 30 deletions(-)
>> > > > > 
>> > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> > > > > index baf473f80800..5f077bf07f3e 100644
>> > > > > --- a/mm/page_alloc.c
>> > > > > +++ b/mm/page_alloc.c
>> > > > > @@ -306,38 +306,33 @@ static inline bool __meminit early_page_uninitialised(unsigned long pfn)
>> > > > >   }
>> > > > >   /*
>> > > > > - * Returns true when the remaining initialisation should be deferred until
>> > > > > - * later in the boot cycle when it can be parallelised.
>> > > > > + * Calculate first_deferred_pfn in case:
>> > > > > + * - in MEMMAP_EARLY context
>> > > > > + * - this is the last zone
>> > > > > + *
>> > > > > + * If the first aligned section doesn't exceed the end_pfn, set it to
>> > > > > + * first_deferred_pfn and return it.
>> > > > >    */
>> > > > > -static bool __meminit
>> > > > > -defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
>> > > > > +unsigned long __meminit
>> > > > > +defer_pfn(int nid, unsigned long start_pfn, unsigned long end_pfn,
>> > > > > +	  enum memmap_context context)
>> > > > >   {
>> > > > > -	static unsigned long prev_end_pfn, nr_initialised;
>> > > > > +	struct pglist_data *pgdat = NODE_DATA(nid);
>> > > > > +	unsigned long pfn;
>> > > > > -	/*
>> > > > > -	 * prev_end_pfn static that contains the end of previous zone
>> > > > > -	 * No need to protect because called very early in boot before smp_init.
>> > > > > -	 */
>> > > > > -	if (prev_end_pfn != end_pfn) {
>> > > > > -		prev_end_pfn = end_pfn;
>> > > > > -		nr_initialised = 0;
>> > > > > -	}
>> > > > > +	if (context != MEMMAP_EARLY)
>> > > > > +		return end_pfn;
>> > > > > -	/* Always populate low zones for address-constrained allocations */
>> > > > > -	if (end_pfn < pgdat_end_pfn(NODE_DATA(nid)))
>> > > > > -		return false;
>> > > > > +	/* Always populate low zones */
>> > > > > +	if (end_pfn < pgdat_end_pfn(pgdat))
>> > > > > +		return end_pfn;
>> > > > > -	/*
>> > > > > -	 * We start only with one section of pages, more pages are added as
>> > > > > -	 * needed until the rest of deferred pages are initialized.
>> > > > > -	 */
>> > > > > -	nr_initialised++;
>> > > > > -	if ((nr_initialised > PAGES_PER_SECTION) &&
>> > > > > -	    (pfn & (PAGES_PER_SECTION - 1)) == 0) {
>> > > > > -		NODE_DATA(nid)->first_deferred_pfn = pfn;
>> > > > > -		return true;
>> > > > > +	pfn = roundup(start_pfn + PAGES_PER_SECTION - 1, PAGES_PER_SECTION);
>> > > > > +	if (end_pfn > pfn) {
>> > > > > +		pgdat->first_deferred_pfn = pfn;
>> > > > > +		end_pfn = pfn;
>> > > > >   	}
>> > > > > -	return false;
>> > > > > +	return end_pfn;
>> > > > 
>> > > > Okay so I stand corrected. It looks like you are rounding up by
>> > > > (PAGES_PER_SECTION - 1) * 2 since if I am not mistaken roundup should
>> > > > do the same math you already did in side the function.
>> > > > 
>> > > > >   }
>> > > > >   #else
>> > > > >   static inline bool early_page_uninitialised(unsigned long pfn)
>> > > > > @@ -345,9 +340,11 @@ static inline bool early_page_uninitialised(unsigned long pfn)
>> > > > >   	return false;
>> > > > >   }
>> > > > > -static inline bool defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
>> > > > > +unsigned long __meminit
>> > > > > +defer_pfn(int nid, unsigned long start_pfn, unsigned long end_pfn,
>> > > > > +	  enum memmap_context context)
>> > > > >   {
>> > > > > -	return false;
>> > > > > +	return end_pfn;
>> > > > >   }
>> > > > >   #endif
>> > > > > @@ -5514,6 +5511,8 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>> > > > >   	}
>> > > > >   #endif
>> > > > > +	end_pfn = defer_pfn(nid, start_pfn, end_pfn, context);
>> > > > > +
>> > > > 
>> > > > A better approach for this might be to look at placing the loop within
>> > > > a loop similar to how I handled this for the deferred init. You only
>> > > > really need to be performing all of these checks once per section
>> > > > aligned point anyway.
>> > > 
>> > > I didn't really get your idea here. Do you have the commit id you handle
>> > > deferred init?
>> > 
>> > The deferred_grow_zone function actually had some logic like this
>> > before I had rewritten it, you can still find it on lxr:
>> > https://elixir.bootlin.com/linux/latest/source/mm/page_alloc.c#L1668
>> > 
>> > > 
>> > > > 
>> > > > Basically if you added another loop and limited the loop below so that
>> > > > you only fed it one section at a time then you could just pull the
>> > > > defer_init check out of this section and place it in the outer loop
>> > > > after you have already tried initializing at least one section worth of
>> > > > pages.
>> > > > 
>> > > > You could probably also look at pulling in the logic that is currently
>> > > > sitting at the end of the current function that is initializing things
>> > > > until the end_pfn is aligned with PAGES_PER_SECTION.
>> > > > 
>> > > > >   	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>> > > > >   		/*
>> > > > >   		 * There can be holes in boot-time mem_map[]s handed to this
>> > > > > @@ -5526,8 +5525,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>> > > > >   				continue;
>> > > > >   			if (overlap_memmap_init(zone, &pfn))
>> > > > >   				continue;
>> > > > > -			if (defer_init(nid, pfn, end_pfn))
>> > > > > -				break;
>> > > > >   		}
>> > > > 
>> > > > So the whole reason for the "defer_init" call being placed here is
>> > > > because there are checks to see if the prior PFN is valid, in our NUMA
>> > > > node, or is an overlapping region. If your first section or in this
>> > > > case 2 sections contain pages that fall into these categories you
>> > > > aren't going to initialize any pages.
>> > > 
>> > > Ok, I get your point. Let me do a summary, the approach in this patch
>> > > has one flaw: in case all pages in the first section fall into these two
>> > > categories, we will end up with no page initialized for this zone.
>> > > 
>> > > So my suggestion is:
>> > > 
>> > >    Find the first valid page and roundup it to PAGES_PER_SECTION. This
>> > >    would ensure we won't end up with zero initialized page.
>> > 
>> > Using the first valid PFN will not work either. The problem is you want
>> > to ideally have PAGES_PER_SECTION number of pages allocated before we
>> > begin deferring allocation. The pages will have a number of regions
>> > that are reserved and/or full of holes so you cannot rely on the first
>> > PFN to be the start of a contiguous section of pages.
>> > 
>> 
>> Hmm... my original idea is we don't need to initialize at least
>> PAGES_PER_SECTION pages before defer init. In my mind, we just need to
>> initialize *some* pages for this zone. The worst case is there is only
>> one page initialized at bootup.
>> 
>> So here is the version with a little change to cope with the situation
>> when the whole section is not available.
>> 
>> 	for (pfn = start_pfn; pfn < enf_pfn; pfn++) {
>> 		if (!early_pfn_valid(pfn))
>> 			continue;
>> 		if (!early_pfn_in_nid(pfn, nid))
>> 			continue;
>> 		if (overlap_memmap_init(zone, &pfn))
>> 			continue;
>> 
>> 		break;
>> 	}
>> 
>> 	pfn = round_up(pfn + 1, PAGES_PER_SECTION);
>> 
>> 	if (end_pfn > pfn) {
>> 		pgdat->first_deferred_pfn = pfn;
>> 		end_pfn = pfn;
>> 	}
>> 
>
>This would have you updating your first_deferred_pfn for every valid pfn. I
>don't think that is what you want.

It break out at the first valid pfn, so it update first_deferred_pfn
once on the first valid pfn. Is my logic in code incorrect?

>
>> And here is the version if we want to count the number of valid pages.
>> 
>> 	for (pfn = start_pfn; pfn < enf_pfn; pfn++) {
>> 		if (!early_pfn_valid(pfn))
>> 			continue;
>> 		if (!early_pfn_in_nid(pfn, nid))
>> 			continue;
>> 		if (overlap_memmap_init(zone, &pfn))
>> 			continue;
>> 
>> 		if (++valid_pfns == PAGES_PER_SECTION)
>> 			break;
>> 	}
>> 
>
>Your break condition here is wrong. You don't want to break out if the number
>of valid_pfns is equal to PAGES_PER_SECTION, you want to break out if pfn is
>aligned to PAGES_PER_SECTION.
>
>> 	pfn = round_up(pfn + 1, PAGES_PER_SECTION);

Hmm... here it align pfn to PAGES_PER_SECTION. The loop above makes sure
there are one section pages initialized during bootup.

>> 
>> 	if (end_pfn > pfn) {
>> 		pgdat->first_deferred_pfn = pfn;
>> 		end_pfn = pfn;
>> 	}
>> 
>
>This is even more incorrect then the original patch.
>
>> > > Generally, my purpose in this patch is:
>> > > 
>> > > 1. Don't affect the initialisation for non defer init zones.
>> > >     Current code will call defer_init() for each pfn, no matter this pfn
>> > >     should be defer_init or not. By taking this out, we try to minimize
>> > >     the effect on the initialisation process.
>> > 
>> > So one problem with trying to pull out defer_init is that it contains
>> > the increment nr_initialized. At a minimum that logic should probably
>> > be pulled out and placed back where it was.
>> > 
>> > > 2. Iterate on less pfn for defer zone
>> > >     Current code will count on each pfn in defer zone. By roundup pfn
>> > >     directly, less calculation would be necessary. Defer init will handle
>> > >     the rest. Or if we really want at least PAGES_PER_SECTION pfn be
>> > >     initialized for defer zone, we can do the same math in defer_pfn().
>> > 
>> > So the general idea I was referring to above would be something like:
>> > for (pfn = start_pfn; pfn < end_pfn;) {
>> > 	t = ALIGN_DOWN(pfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
>> > 	first_deferred_pfn = min(t, end_pfn);
>> > 	section_initialized = 0;
>> > 
>> > 	for (; pfn < first_deferred_pfn; pfn++) {
>> > 		struct page *page;
>> > 
>> > 		/* All or original checks here w/ continue */
>> > 
>> > 		/* all of the original page initialization stuff */
>> > 
>> > 		section_initialized++;
>> > 	}
>> > 
>> > 	/* Place all the original checks for deferred init here */
>> > 
>> > 	nr_initialized += section_initialized;
>> > 
>> > 	/* remaining checks for deferred init to see if we exit/break here */
>> > }
>> > 
>> > The general idea is the same as what you have stated. However with this
>> > approach we should be able to count the number of pages initialized and
>> > once per section we will either just drop the results stored in
>> > section_initialized, or we will add them to the initialized count and
>> > if it exceeds the needed value we could then break out of the loop.
>> > 
>> 
>> Yep, it looks we share similar idea. While I take the initialization
>> part out and just count the pfn ahead. And use this pfn for the loop.
>> 
>> Do you think I understand you correctly?
>> 
>
>There are two pieces to this. One is tracking the number of PFNs that you
>have initialized in a given section. We need to do that and the overhead
>should be pretty light since it is just an increment. It would probably be
>only one or two instructions manipulating a local variable, possibly even a
>register.
>

Yes, the effort to count for defer zone is not huge, while the function
defer_ini() will be called for each pfn in non-defer zone.

>The other bit is to see if we even care about the section/zone and if we
>should do something with the count of pages initialized in it. That check
>should be done once at the end of every section we initialize. I would say we
>should aim for trying to initialize at least a section number worth of pages.

If we have to initialize PAGES_PER_SECTION pages, my approach is not
better than current one. :)

>I admit it is kind of arbitrary, but if we don't do that then the work falls
>back to deferred_grow_zone and that is resorting to allocating a section
>number worth of pages so one way or another we will end up having to do that
>anyway.
>
>We should be section aligned in order to keep the later freeing of the pages
>from blowing up due to accessing an uninitialized page.
>
>> > > 
>> > > Glad to talk with you and look forward your comments:-)
>> > > 
>> > > > 
>> > > > >   		page = pfn_to_page(pfn);
>> > > 
>> > > 
>> 

-- 
Wei Yang
Help you, Help me

next prev parent reply	other threads:[~2018-12-23  6:58 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-07 10:08 Wei Yang
2018-12-20 23:04 ` Andrew Morton
2018-12-20 23:47 ` Alexander Duyck
2018-12-21 22:44   ` Wei Yang
2018-12-21 23:45     ` Alexander Duyck
2018-12-21 23:45       ` Alexander Duyck
2018-12-22  0:22       ` Wei Yang
2018-12-22  1:41         ` Alexander Duyck
2018-12-23  6:58           ` Wei Yang [this message]
2018-12-24 17:50             ` Alexander Duyck
2018-12-24 17:50               ` Alexander Duyck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181223065827.ng2xqck7jdllt7b7@master \
    --to=richard.weiyang@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.h.duyck@linux.intel.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=pavel.tatashin@microsoft.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox