Re: + mm-introduce-reported-pages.patch added to -mm tree

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
To: David Hildenbrand <david@redhat.com>, Michal Hocko <mhocko@kernel.org>
Cc: akpm@linux-foundation.org, aarcange@redhat.com,
	dan.j.williams@intel.com,  dave.hansen@intel.com,
	konrad.wilk@oracle.com, lcapitulino@redhat.com,
	 mgorman@techsingularity.net, mm-commits@vger.kernel.org,
	mst@redhat.com,  osalvador@suse.de, pagupta@redhat.com,
	pbonzini@redhat.com, riel@surriel.com,  vbabka@suse.cz,
	wei.w.wang@intel.com, willy@infradead.org,
	yang.zhang.wz@gmail.com,  linux-mm@kvack.org
Subject: Re: + mm-introduce-reported-pages.patch added to -mm tree
Date: Tue, 12 Nov 2019 16:31:23 -0800	[thread overview]
Message-ID: <324a154d050d44d3b8a85a3e08a64fe4bd75b72c.camel@linux.intel.com> (raw)
In-Reply-To: <bfcabcdd-2824-f092-d546-8a9ce4325225@redhat.com>

On Wed, 2019-11-13 at 00:10 +0100, David Hildenbrand wrote:
> > > > > start_isolate_page_range()/undo_isolate_page_range()/test_pages_isolated()
> > > > > along with a lockless check if the page is free.
> > > > 
> > > > Okay, that part I think I get. However doesn't all that logic more or less
> > > > ignore the watermarks? It seems like you could cause an OOM if you don't
> > > > have the necessary checks in place for that.
> > > 
> > > Any approach that temporarily blocks some free pages from getting
> > > allocated will essentially have this issue, no? I think one main design
> > > point to minimize false OOMs was to limit the number of pages we report
> > > at a time. Or what do you propose here in addition to that?
> > 
> > If you take a look at __isolate_free_page it was performing a check to see
> > if pulling the page would place us below the minimum watermark for pages.
> > Odds are you should probably look at somehow incorporating that into the
> > solution before you pull the page. I have updated my approach to check for
> 
> Ah, now I see what you mean. Makes sense!
> 
> > the low watermark with the full capacity of MAX_ORDER - 1 pages before I
> > start reporting, and then I am using __isolate_free_page which will check
> > the minimum watermark to make sure I don't cross that.
> 
> Yeah, you probably want to check the watermark before doing any 
> reporting - I assume.
> 
> > > > > I think it should be something like this (ignoring different
> > > > > migratetypes and such for now)
> > > > > 
> > > > > 1. Test lockless if page is free: Not free? Done.
> > > > 
> > > > So this should help to reduce the liklihood of races in the steps below.
> > > > However it might also be useful if the code had some other check to see if
> > > > it was done other than just making a pass through the bitmap.
> > > 
> > > Yes.
> > > 
> > > > One thing I had brought up with Nitesh was the idea of maybe doing some
> > > > sort of RCU bitmap type approach. Basically while we hold the zone lock we
> > > > could swap out the old bitmap for a new one. We could probably even keep a
> > > > counter at the start of the structure so that we could track how many bits
> > > > are actually set there. Then it becomes less likely of having a race where
> > > > you free a page and set the bit and the hinting thread tests and clears
> > > > the bit but doesn't see the freed page since it is not synchronized.
> > > > Otherwise your notification setup and reporting thread may need a few smp
> > > > barriers added where necessary.
> > > 
> > > Yes, swapping out the bitmap via RCU is also be a way to make memory
> > > hotplug work.
> > > 
> > > I was also thinking about a different bitmap approach. Store for each
> > > section a bitmap. Use a meta bitmap with a bit for each section that
> > > contains pages to report. Sparse zones and memory hot(un)plug would not
> > > be a real issue anymore.
> > 
> > I had thought about that too. The only problem is that the section has to
> > be power of 2 sized and I don't know if we want to be increasing the size
> 
> ... are there sections that are not a power of 2? x86_64: 128MB, s390x: 
> 256MB, ...

No, what I meant was the mem_section structure. It has a hard requirement
about being power of 2 aligned and is already 16 bytes, or 32 with page
extensions enabled. There is room for a pad in the page extension case so
maybe you could squeeze in something there.

> It does not really make sense to have sections that are not a power of 
> two, thinking about page tables ... I would really be interested where 
> something like that is possible.

Sorry for the confusion on that.

> > 2. start_isolate_page_range(): Busy? Rare race (with other isolate users
> > > > 
> > > > Doesn't this have the side effect of draining all the percpu caches in
> > > > order to make certain to flush the pages we isolated from there?
> > > 
> > > While alloc_contig_range() e.g., calls lru_add_drain_all(), I don't
> > > think isolation will. Where did you spot something like this in
> > > mm/page_isolation.c?
> > 
> > On the end of set_migratetype_isolate(). The last thing it does is call
> > drain_all_pages.
> 
> Ahh, missed that, thanks. Yeah, one could probably make the 
> configurable, because for that use case, where we already expect a free 
> page, we don't need that.

I suppose but that gets back into adding complexity as we now have to
special case isolation to work with page reporting.

<snip>

> > pages in the region are isolated since as you pointed out you get an EBUSY
> > when you attempt to isolate a page that is already isolated and as such
> > removal will fail won't it?
> 
> Right now, yes.
> 
> (we should rework that code either way to return -EAGAIN in that case 
> and let memory offlining try again automatically. But we have to rework 
> the -EAGAIN vs. -EBUSY handling in memory offlining code at one point 
> either way, I discussed that partially with Michal recently. There is a 
> lot of cleaning up to do.)

So it sounds like a cleanup/rewrite of some of the isolation code will be
needed to really get it doing what you want.

Actually I wonder if we couldn't look at something like the
free_reported_page function I did and instead split it up so that it could
be used as an inverse of __isolate_free_page. Maybe something like a
__free_isolated_page.

> > > > still having to use the scatterlist in order to hold the pages and track
> > > > what you will need to undo the isolation later.
> > > 
> > > I think it is very neat and not complex at all. Page isolation is a nice
> > > feature we have in the kernel. :) It deserves some cleanups, though.
> > 
> > We can agree to disagree. At this point you are talking about adding bits
> > for sections and pages, and in the meantime I am working with zones and
> > pages. I believe finding free space in the section may be much more tricky
> > than finding it in the zone or page has been. Now that I am rid of the
> > list manipulators my approach may soon surpass the bitmap one in terms of
> > being less intrusive/complex.. :-)
> 
> I am definitely interested to see that approach :) Good to see that the 
> whole discussion in this big thread turned out to be productive.

Yeah, when I started working on the patch split Mel wanted I kind of
realized I was optimizing for the shuffle case which really shouldn't be
an optimization target. I think I just got focused on it as it was in the
way of some of the initial changes I needed to make to handle the
notifier.

next prev parent reply	other threads:[~2019-11-13  0:31 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20191106000547.juQRi83gi%akpm@linux-foundation.org>
2019-11-06 12:16 ` Michal Hocko
2019-11-06 14:09   ` David Hildenbrand
2019-11-06 16:35     ` Alexander Duyck
2019-11-06 16:54       ` Michal Hocko
2019-11-06 17:48         ` Alexander Duyck
2019-11-06 22:11           ` Mel Gorman
2019-11-06 23:38             ` David Hildenbrand
2019-11-07  0:20             ` Alexander Duyck
2019-11-07 10:20               ` Mel Gorman
2019-11-07 16:07                 ` Alexander Duyck
2019-11-08  9:43                   ` Mel Gorman
2019-11-08 16:17                     ` Alexander Duyck
2019-11-08 18:41                       ` Mel Gorman
2019-11-08 20:29                         ` Alexander Duyck
2019-11-09 14:57                           ` Mel Gorman
2019-11-10 18:03                             ` Alexander Duyck
2019-11-06 23:33           ` David Hildenbrand
2019-11-07  0:20             ` Dave Hansen
2019-11-07  0:52               ` David Hildenbrand
2019-11-07 17:12                 ` Dave Hansen
2019-11-07 17:46                   ` Michal Hocko
2019-11-07 18:08                     ` Dave Hansen
2019-11-07 18:12                     ` Alexander Duyck
2019-11-08  9:57                       ` Michal Hocko
2019-11-08 16:43                         ` Alexander Duyck
2019-11-07 18:46                   ` Qian Cai
2019-11-07 18:02             ` Alexander Duyck
2019-11-07 19:37               ` Nitesh Narayan Lal
2019-11-07 22:46                 ` Alexander Duyck
2019-11-07 22:43               ` David Hildenbrand
2019-11-08  0:42                 ` Alexander Duyck
2019-11-08  7:06                   ` David Hildenbrand
2019-11-08 17:18                     ` Alexander Duyck
2019-11-12 13:04                       ` David Hildenbrand
2019-11-12 18:34                         ` Alexander Duyck
2019-11-12 21:05                           ` David Hildenbrand
2019-11-12 22:17                             ` David Hildenbrand
2019-11-12 22:19                             ` Alexander Duyck
2019-11-12 23:10                               ` David Hildenbrand
2019-11-13  0:31                                 ` Alexander Duyck [this message]
2019-11-13 18:51                           ` Nitesh Narayan Lal
2019-11-06 16:49   ` Nitesh Narayan Lal
2019-11-11 18:52   ` Nitesh Narayan Lal
2019-11-11 22:00     ` Alexander Duyck
2019-11-12 15:19       ` Nitesh Narayan Lal
2019-11-12 16:18         ` Alexander Duyck
2019-11-13 18:39           ` Nitesh Narayan Lal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=324a154d050d44d3b8a85a3e08a64fe4bd75b72c.camel@linux.intel.com \
    --to=alexander.h.duyck@linux.intel.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=david@redhat.com \
    --cc=konrad.wilk@oracle.com \
    --cc=lcapitulino@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@kernel.org \
    --cc=mm-commits@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=osalvador@suse.de \
    --cc=pagupta@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=riel@surriel.com \
    --cc=vbabka@suse.cz \
    --cc=wei.w.wang@intel.com \
    --cc=willy@infradead.org \
    --cc=yang.zhang.wz@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox