From: Michal Hocko <mhocko@kernel.org>
To: David Hildenbrand <david@redhat.com>
Cc: linux-mm@kvack.org
Subject: Re: [PATCH RFC 0/8] mm: online/offline 4MB chunks controlled by device driver
Date: Mon, 16 Apr 2018 16:08:10 +0200 [thread overview]
Message-ID: <20180416140810.GR17484@dhcp22.suse.cz> (raw)
In-Reply-To: <b51ca7a1-c5ae-fbbb-8edf-e71f383da07e@redhat.com>
On Fri 13-04-18 18:31:02, David Hildenbrand wrote:
> On 13.04.2018 17:59, Michal Hocko wrote:
> > On Fri 13-04-18 15:16:24, David Hildenbrand wrote:
> > [...]
> >> In contrast to existing balloon solutions:
> >> - The device is responsible for its own memory only.
> >
> > Please be more specific. Any ballooning driver is responsible for its
> > own memory. So what exactly does that mean?
>
> Simple example (virtio-balloon):
>
> You boot Linux with 8GB. You hotplug two DIMM2 with 4GB each.
> You use virtio-balloon to "unplug"(although it's not) 4GB. Memory will
> be removed on *any* memory the allocator is willing to give away.
>
> Now imagine you want to reboot and keep the 4GB unplugged by e.g.
> resizing memory/DIMM. What to resize to how much? A DIMM? Two DIMMS?
> Delete one DIMM and keep only one? Drop all DIMMs and resize the
> initital memory? This makes implementation in the hypervisor extremely
> hard (especially thinking about migration).
I do not follow. Why does this even matter in the virtualized env.?
> So a ballooning driver does not inflate on its device memory but simply
> on *any* memory. That's why you only have one virtio-balloon device per
> VM. It is basically "per machine", while paravirtualied memory devices
> manage their assigned memory. Like a resizable DIMM, so to say.
>
> >
> >> - Works on a coarser granularity (e.g. 4MB because that's what we can
> >> online/offline in Linux). We are not using the buddy allocator when unplugging
> >> but really search for chunks of memory we can offline.
> >
> > Again, more details please. Virtio driver already tries to scan suitable
> > pages to balloon AFAIK.
> Virtio balloon simply uses alloc_page().
>
> That's it. Also, if we wanted to alloc bigger chunks at a time, we would
> be limited to MAX_ORDER - 1. But that's a different story.
Not really, we do have means to do a pfn walk and then try to isolate
pages one-by-one. Have a look at http://lkml.kernel.org/r/1523017045-18315-1-git-send-email-wei.w.wang@intel.com
> One goal of paravirtualied memory is to be able to unplug more memory
> than with simple DIMM devices (e.g. 2GB) without having it to online it
> as MOVABLE.
I am not sure I understand but any hotplug based solution without
handling that memory on ZONE_MOVABLE is a lost bettle. It is simply
usable only to hotadd memory. Shrinking it back will most likely fail on
many workloads.
> We don't want to squeeze the last little piece of memory out
> if the system like a balloon driver does. So we *don't* want to go down
> to a page level. If we can't unplug any bigger chunks anymore, we tried
> our best and can't do anything about it.
>
> That's why I don't like to call it a balloon. It is not meant to be used
> for purposes an ordinary balloon is used (besides memory unplug) - like
> reducing the size of the dcache or cooperative memory management
> (meaning a high frequency of inflate and deflate).
OK, so what is the typical usecase? And how does the usual setup looks
like?
> >> - A device can belong to exactly one NUMA node. This way we can online/offline
> >> memory in a fine granularity NUMA aware.
> >
> > What does prevent existing balloon solutions to be NUMA aware?
>
> They usually (e.g. virtio-balloon) only have one "target pages" goal
> defined. Not per NUMA node. We could implement such a feature (with
> quite some obstacles - e.g. what happens if NUMA nodes are merged in the
> guest but the hypervisor wants to verify that memory is ballooned on the
> right NUMA node?), but with paravirtualized memory devices it comes
> "naturally" and "for free". You just request to resize a certain memory
> device, that's it.
I do not follow. I expected that gues NUMA topology matches the host one
or it is a subset. So both ends should know which memory to
{in,de}flate. Maybe I am just to naive because I am not familiar with
any ballooning implementation much.
> >> - Architectures that don't have proper memory hotplug interfaces (e.g. s390x)
> >> get memory hotplug support. I have a prototype for s390x.
> >
> > I am pretty sure that s390 does support memory hotplug. Or what do you
> > mean?
>
> There is no interface for s390x to tell it "hey, I just gave you another
> DIMM/memory chunk, please online it" like we have on other
> architectures. Such an interface does not exist. That's also the reason
> why remove_memory() is not yet supported. All there is is standby memory
> that Linux will simply add and try to online when booting up. Old
> mainframe interfaces are different :)
>
> You can read more about that here:
>
> http://lists.gnu.org/archive/html/qemu-devel/2018-02/msg04951.html
Thanks for the pointer
> >> - Once all 4MB chunks of a memory block are offline, we can remove the
> >> memory block and therefore the struct pages (seems to work in my prototype),
> >> which is nice.
> >
> > OK, so our existing ballooning solutions indeed do not free up memmaps
> > which is suboptimal.
>
> And we would have to hack deep into the current offlining code to make
> it work (at least that's my understanding).
>
> >
> >> Todo:
> >> - We might have to add a parameter to offline_pages(), telling it to not
> >> try forever but abort in case it takes too long.
> >
> > Offlining fails when it see non-migrateable pages but other than that it
> > should always succeed in the finite time. If not then there is a bug to
> > be fixed.
>
> I just found the -EINTR in the offlining code and thought this might be
> problematic. (e.g. if somebody pins a page that is still to be migrated
> - or is that avoided by isolating?) I haven't managed to trigger this
> scenario yet. Was just a thought, that's why I mentioned it but didn't
> implement it.
Offlining is a 3 stage thing. Check for unmovable pages and fail with
EBUSY, isolating free memory and migrating the rest. If the first 2
succeed we expect the migration will finish in a finite time.
--
Michal Hocko
SUSE Labs
next prev parent reply other threads:[~2018-04-16 14:08 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-04-13 13:16 David Hildenbrand
2018-04-13 13:16 ` [PATCH RFC 1/8] mm/memory_hotplug: Revert "mm/memory_hotplug: optimize memory hotplug" David Hildenbrand
2018-04-13 13:16 ` [PATCH RFC 2/8] mm: introduce PG_offline David Hildenbrand
2018-04-13 13:40 ` Michal Hocko
2018-04-13 13:46 ` David Hildenbrand
2018-04-17 11:50 ` David Hildenbrand
2018-04-13 17:11 ` Matthew Wilcox
2018-04-16 8:31 ` David Hildenbrand
2018-04-21 16:52 ` Vlastimil Babka
2018-04-22 3:01 ` Matthew Wilcox
2018-04-22 8:17 ` David Hildenbrand
2018-04-22 14:02 ` Matthew Wilcox
2018-04-22 15:13 ` David Hildenbrand
2018-04-29 21:08 ` Michal Hocko
2018-04-30 6:31 ` David Hildenbrand
2018-04-20 7:30 ` David Hildenbrand
2018-04-13 13:16 ` [PATCH RFC 3/8] mm: use PG_offline in online/offlining code David Hildenbrand
2018-04-13 13:31 ` [PATCH RFC 4/8] kdump: expose PG_offline David Hildenbrand
2018-04-13 13:33 ` [PATCH RFC 5/8] mm: only mark section offline when all pages are offline David Hildenbrand
2018-04-13 13:33 ` [PATCH RFC 6/8] mm: offline_pages() is also limited by MAX_ORDER David Hildenbrand
2018-04-13 13:33 ` [PATCH RFC 7/8] mm: allow to control onlining/offlining of memory by a driver David Hildenbrand
2018-04-13 15:59 ` Michal Hocko
2018-04-13 16:32 ` David Hildenbrand
2018-04-13 13:33 ` [PATCH RFC 8/8] mm: export more functions used to online/offline memory David Hildenbrand
2018-04-13 13:44 ` [PATCH RFC 0/8] mm: online/offline 4MB chunks controlled by device driver Michal Hocko
2018-04-13 14:01 ` David Hildenbrand
2018-04-13 14:20 ` Michal Hocko
2018-04-13 14:59 ` David Hildenbrand
2018-04-13 15:02 ` David Hildenbrand
2018-04-13 16:03 ` Michal Hocko
2018-04-13 16:36 ` David Hildenbrand
2018-04-13 15:59 ` Michal Hocko
2018-04-13 16:31 ` David Hildenbrand
2018-04-16 14:08 ` Michal Hocko [this message]
2018-04-16 14:48 ` David Hildenbrand
2018-04-18 15:46 ` David Hildenbrand
2018-04-19 7:33 ` Michal Hocko
2018-04-26 15:30 ` David Hildenbrand
2018-04-29 21:05 ` Michal Hocko
2018-04-30 6:24 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180416140810.GR17484@dhcp22.suse.cz \
--to=mhocko@kernel.org \
--cc=david@redhat.com \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox