From: Alexey Makhalov <amakhalov@vmware.com>
To: "linux-mm@kvack.org" <linux-mm@kvack.org>,
Laurent Dufour <ldufour@linux.ibm.com>
Cc: Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>,
Christoph Lameter <cl@linux.com>, Roman Gushchin <guro@fb.com>,
Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>,
Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Subject: Re: Percpu allocator: CPU hotplug support
Date: Thu, 22 Apr 2021 01:22:15 -0700 [thread overview]
Message-ID: <5614778F-AA79-40FD-BB62-A543A9C49CE2@vmware.com> (raw)
In-Reply-To: <3320a36c-9270-a7f7-88da-0a9bfa13c774@linux.ibm.com>
Hello,
> On Apr 22, 2021, at 12:45 AM, Laurent Dufour <ldufour@linux.ibm.com> wrote:
>
> Le 22/04/2021 à 03:33, Dennis Zhou a écrit :
>> Hello,
>> On Thu, Apr 22, 2021 at 12:44:37AM +0000, Alexey Makhalov wrote:
>>> Current implementation of percpu allocator uses total possible number of CPUs (nr_cpu_ids) to
>>> get number of units to allocate per chunk. Every alloc_percpu() request of N bytes will allocate
>>> N*nr_cpu_ids bytes even if the number of present CPUs is much less. Percpu allocator grows by
>>> number of chunks keeping number of units per chunk constant. This is done in that way to
>>> simplify CPU hotplug/remove to have per-cpu area preallocated.
>>>
>>> Problem: This behavior can lead to inefficient memory usage for big server machines and VMs,
>>> where nr_cpu_ids is huge.
>>>
>>> Example from my experiment:
>>> 2 vCPU VM with hotplug support (up to 128):
>>> [ 0.105989] smpboot: Allowing 128 CPUs, 126 hotplug CPUs
>>> By creating huge amount of active or/and dying memory cgroups, I can generate active percpu
>>> allocations of 100 MB (per single CPU) including fragmentation overhead. But in that case total
>>> percpu memory consumption (reported in /proc/meminfo) will be 12.8 GB. BTW, chunks are
>>> filled by ~75% in my experiment, so fragmentation is not a concern.
>>> Out of 12.8 GB:
>>> - 0.2 GB are actually used by present vCPUs, and
>>> - 12.6 GB are "wasted"!
>>>
>>> I've seen production VMs consuming 16-20 GB of memory by Percpu. Roman reported 100 GB.
>>> There are solutions to reduce "wasted" memory overhead such as: disabling CPU hotplug; reducing
>>> number of maximum CPUs reported by hypervisor or/and firmware; using possible_cpus= kernel
>>> parameter. But it won't eliminate fundamental issue with "wasted" memory.
>>>
>>> Suggestion: To support percpu chunks scaling by number of units there. To allocate/deallocate new
>>> units for existing chunks on CPU hotplug/remove event.
>>>
>> Idk. In theory it sounds doable. In practice I'm not so sure. The two
>> problems off the top of my head:
>> 1) What happens if we can't allocate new pages when a cpu is onlined?
Simply - registering CPU can return error on allocation failure. Or potentially it can be reinstantiated later on memory availability if it’s the case.
>> 2) It's possible users set particular conditions in percpu variables
>> that are not tied to just statistics summing (such as the cpu
>> runqueues). Users would have to provide online init and exit functions
>> which could get weird.
I do not think online init/exit function is a right approach.
There are many places in the Linux where percpu data get initialized right after got allocated:
ptr = alloc_percpu();
for_each_possible_cpu(cpu) {
initialize (per_cpu_ptr(ptr, cpu));
}
Let’s keep all such instances untouched. Hope initialize() just touch content of percpu area without allocating substructures. If so - it should be redesigned.
BTW, this loop does extra work (runtime overhead) to initialize areas for possible cpus which might never arrive.
The proposal:
- in case of possible_cpus > online_cpus, add additional unit (call it A) to the chunks which will contain initialized image of percpu data for possible cpus.
- for_each_possible_cpu(cpu) from snippet above should go through all online cpus + 1 (for unit A).
- on new CPU #N arrival, percpu should allocate corresponding unit N and initialize its content by data from unit A. Repeat for all chunks.
- on CPU D departure - release unit D from the chunks, keeping unit A intact.
- in case of possible_cpus > online_cpus, overhead will be +1 (for unit A), while current overhead is +(possible_cpus-online_cpus).
- in case of possible_cpus == online_cpus (no CPU hotplug) - do not allocate unit A, keep percpu allocator as it is now - no overhead.
Does it fully cover 2nd concern?
>> As Roman mentioned, I think it would be much better to not have the
>> large discrepancy between the cpu_online_mask and the cpu_possible_mask.
>
> Indeed it is quite common on PowerPC to set a VM with a possible high number of CPUs but with a reasonnable number of online CPUs. This allows the user to scale up its VM when needed.
>
> For instance we may see up to 1024 possible CPUs while the online number is *only* 128.
Agree. In VMs, vCPUs there are just threads/processes on the host and can be easily added/removed on demand.
Thanks,
—Alexey
next prev parent reply other threads:[~2021-04-22 8:22 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-22 0:44 Alexey Makhalov
2021-04-22 1:10 ` Roman Gushchin
2021-04-22 1:33 ` Dennis Zhou
2021-04-22 7:45 ` Laurent Dufour
2021-04-22 8:22 ` Alexey Makhalov [this message]
2021-04-22 17:52 ` Vlastimil Babka
2021-04-29 11:39 ` Pratik Sampat
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5614778F-AA79-40FD-BB62-A543A9C49CE2@vmware.com \
--to=amakhalov@vmware.com \
--cc=aneesh.kumar@linux.ibm.com \
--cc=cl@linux.com \
--cc=dennis@kernel.org \
--cc=guro@fb.com \
--cc=ldufour@linux.ibm.com \
--cc=linux-mm@kvack.org \
--cc=srikar@linux.vnet.ibm.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox