linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: David Hildenbrand <david@redhat.com>
Cc: linux-mm@kvack.org
Subject: Re: [PATCH RFC 0/8] mm: online/offline 4MB chunks controlled by device driver
Date: Mon, 16 Apr 2018 16:08:10 +0200	[thread overview]
Message-ID: <20180416140810.GR17484@dhcp22.suse.cz> (raw)
In-Reply-To: <b51ca7a1-c5ae-fbbb-8edf-e71f383da07e@redhat.com>

On Fri 13-04-18 18:31:02, David Hildenbrand wrote:
> On 13.04.2018 17:59, Michal Hocko wrote:
> > On Fri 13-04-18 15:16:24, David Hildenbrand wrote:
> > [...]
> >> In contrast to existing balloon solutions:
> >> - The device is responsible for its own memory only.
> > 
> > Please be more specific. Any ballooning driver is responsible for its
> > own memory. So what exactly does that mean?
> 
> Simple example (virtio-balloon):
> 
> You boot Linux with 8GB. You hotplug two DIMM2 with 4GB each.
> You use virtio-balloon to "unplug"(although it's not) 4GB. Memory will
> be removed on *any* memory the allocator is willing to give away.
> 
> Now imagine you want to reboot and keep the 4GB unplugged by e.g.
> resizing memory/DIMM. What to resize to how much? A DIMM? Two DIMMS?
> Delete one DIMM and keep only one? Drop all DIMMs and resize the
> initital memory? This makes implementation in the hypervisor extremely
> hard (especially thinking about migration).

I do not follow. Why does this even matter in the virtualized env.?

> So a ballooning driver does not inflate on its device memory but simply
> on *any* memory. That's why you only have one virtio-balloon device per
> VM. It is basically "per machine", while paravirtualied memory devices
> manage their assigned memory. Like a resizable DIMM, so to say.
> 
> > 
> >> - Works on a coarser granularity (e.g. 4MB because that's what we can
> >>   online/offline in Linux). We are not using the buddy allocator when unplugging
> >>   but really search for chunks of memory we can offline.
> > 
> > Again, more details please. Virtio driver already tries to scan suitable
> > pages to balloon AFAIK.
> Virtio balloon simply uses alloc_page().
> 
> That's it. Also, if we wanted to alloc bigger chunks at a time, we would
> be limited to MAX_ORDER - 1. But that's a different story.

Not really, we do have means to do a pfn walk and then try to isolate
pages one-by-one. Have a look at http://lkml.kernel.org/r/1523017045-18315-1-git-send-email-wei.w.wang@intel.com

> One goal of paravirtualied memory is to be able to unplug more memory
> than with simple DIMM devices (e.g. 2GB) without having it to online it
> as MOVABLE.

I am not sure I understand but any hotplug based solution without
handling that memory on ZONE_MOVABLE is a lost bettle. It is simply
usable only to hotadd memory. Shrinking it back will most likely fail on
many workloads.

> We don't want to squeeze the last little piece of memory out
> if the system like a balloon driver does. So we *don't* want to go down
> to a page level. If we can't unplug any bigger chunks anymore, we tried
> our best and can't do anything about it.
> 
> That's why I don't like to call it a balloon. It is not meant to be used
> for purposes an ordinary balloon is used (besides memory unplug) - like
> reducing the size of the dcache or cooperative memory management
> (meaning a high frequency of inflate and deflate).

OK, so what is the typical usecase? And how does the usual setup looks
like?

> >> - A device can belong to exactly one NUMA node. This way we can online/offline
> >>   memory in a fine granularity NUMA aware.
> > 
> > What does prevent existing balloon solutions to be NUMA aware?
> 
> They usually (e.g. virtio-balloon) only have one "target pages" goal
> defined. Not per NUMA node. We could implement such a feature (with
> quite some obstacles - e.g. what happens if NUMA nodes are merged in the
> guest but the hypervisor wants to verify that memory is ballooned on the
> right NUMA node?), but with paravirtualized memory devices it comes
> "naturally" and "for free". You just request to resize a certain memory
> device, that's it.

I do not follow. I expected that gues NUMA topology matches the host one
or it is a subset. So both ends should know which memory to
{in,de}flate. Maybe I am just to naive because I am not familiar with
any ballooning implementation much.

> >> - Architectures that don't have proper memory hotplug interfaces (e.g. s390x)
> >>   get memory hotplug support. I have a prototype for s390x.
> > 
> > I am pretty sure that s390 does support memory hotplug. Or what do you
> > mean?
> 
> There is no interface for s390x to tell it "hey, I just gave you another
> DIMM/memory chunk, please online it" like we have on other
> architectures. Such an interface does not exist. That's also the reason
> why remove_memory() is not yet supported. All there is is standby memory
> that Linux will simply add and try to online when booting up. Old
> mainframe interfaces are different :)
> 
> You can read more about that here:
> 
> http://lists.gnu.org/archive/html/qemu-devel/2018-02/msg04951.html

Thanks for the pointer

> >> - Once all 4MB chunks of a memory block are offline, we can remove the
> >>   memory block and therefore the struct pages (seems to work in my prototype),
> >>   which is nice.
> > 
> > OK, so our existing ballooning solutions indeed do not free up memmaps
> > which is suboptimal.
> 
> And we would have to hack deep into the current offlining code to make
> it work (at least that's my understanding).
> 
> > 
> >> Todo:
> >> - We might have to add a parameter to offline_pages(), telling it to not
> >>   try forever but abort in case it takes too long.
> > 
> > Offlining fails when it see non-migrateable pages but other than that it
> > should always succeed in the finite time. If not then there is a bug to
> > be fixed.
> 
> I just found the -EINTR in the offlining code and thought this might be
> problematic. (e.g. if somebody pins a page that is still to be migrated
> - or is that avoided by isolating?) I haven't managed to trigger this
> scenario yet. Was just a thought, that's why I mentioned it but didn't
> implement it.

Offlining is a 3 stage thing. Check for unmovable pages and fail with
EBUSY, isolating free memory and migrating the rest. If the first 2
succeed we expect the migration will finish in a finite time. 
-- 
Michal Hocko
SUSE Labs

  reply	other threads:[~2018-04-16 14:08 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-13 13:16 David Hildenbrand
2018-04-13 13:16 ` [PATCH RFC 1/8] mm/memory_hotplug: Revert "mm/memory_hotplug: optimize memory hotplug" David Hildenbrand
2018-04-13 13:16 ` [PATCH RFC 2/8] mm: introduce PG_offline David Hildenbrand
2018-04-13 13:40   ` Michal Hocko
2018-04-13 13:46     ` David Hildenbrand
2018-04-17 11:50     ` David Hildenbrand
2018-04-13 17:11   ` Matthew Wilcox
2018-04-16  8:31     ` David Hildenbrand
2018-04-21 16:52     ` Vlastimil Babka
2018-04-22  3:01       ` Matthew Wilcox
2018-04-22  8:17         ` David Hildenbrand
2018-04-22 14:02           ` Matthew Wilcox
2018-04-22 15:13             ` David Hildenbrand
2018-04-29 21:08               ` Michal Hocko
2018-04-30  6:31                 ` David Hildenbrand
2018-04-20  7:30   ` David Hildenbrand
2018-04-13 13:16 ` [PATCH RFC 3/8] mm: use PG_offline in online/offlining code David Hildenbrand
2018-04-13 13:31 ` [PATCH RFC 4/8] kdump: expose PG_offline David Hildenbrand
2018-04-13 13:33 ` [PATCH RFC 5/8] mm: only mark section offline when all pages are offline David Hildenbrand
2018-04-13 13:33 ` [PATCH RFC 6/8] mm: offline_pages() is also limited by MAX_ORDER David Hildenbrand
2018-04-13 13:33 ` [PATCH RFC 7/8] mm: allow to control onlining/offlining of memory by a driver David Hildenbrand
2018-04-13 15:59   ` Michal Hocko
2018-04-13 16:32     ` David Hildenbrand
2018-04-13 13:33 ` [PATCH RFC 8/8] mm: export more functions used to online/offline memory David Hildenbrand
2018-04-13 13:44 ` [PATCH RFC 0/8] mm: online/offline 4MB chunks controlled by device driver Michal Hocko
2018-04-13 14:01   ` David Hildenbrand
2018-04-13 14:20     ` Michal Hocko
2018-04-13 14:59       ` David Hildenbrand
2018-04-13 15:02   ` David Hildenbrand
2018-04-13 16:03     ` Michal Hocko
2018-04-13 16:36       ` David Hildenbrand
2018-04-13 15:59 ` Michal Hocko
2018-04-13 16:31   ` David Hildenbrand
2018-04-16 14:08     ` Michal Hocko [this message]
2018-04-16 14:48       ` David Hildenbrand
2018-04-18 15:46       ` David Hildenbrand
2018-04-19  7:33         ` Michal Hocko
2018-04-26 15:30           ` David Hildenbrand
2018-04-29 21:05             ` Michal Hocko
2018-04-30  6:24               ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180416140810.GR17484@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=david@redhat.com \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox