linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Alexey Makhalov <amakhalov@vmware.com>
To: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	Laurent Dufour <ldufour@linux.ibm.com>
Cc: Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>,
	Christoph Lameter <cl@linux.com>, Roman Gushchin <guro@fb.com>,
	Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Subject: Re: Percpu allocator: CPU hotplug support
Date: Thu, 22 Apr 2021 01:22:15 -0700	[thread overview]
Message-ID: <5614778F-AA79-40FD-BB62-A543A9C49CE2@vmware.com> (raw)
In-Reply-To: <3320a36c-9270-a7f7-88da-0a9bfa13c774@linux.ibm.com>

Hello,

> On Apr 22, 2021, at 12:45 AM, Laurent Dufour <ldufour@linux.ibm.com> wrote:
> 
> Le 22/04/2021 à 03:33, Dennis Zhou a écrit :
>> Hello,
>> On Thu, Apr 22, 2021 at 12:44:37AM +0000, Alexey Makhalov wrote:
>>> Current implementation of percpu allocator uses total possible number of CPUs (nr_cpu_ids) to
>>> get number of units to allocate per chunk. Every alloc_percpu() request of N bytes will allocate
>>> N*nr_cpu_ids bytes even if the number of present CPUs is much less. Percpu allocator grows by
>>> number of chunks keeping number of units per chunk constant. This is done in that way to
>>> simplify CPU hotplug/remove to have per-cpu area preallocated.
>>> 
>>> Problem: This behavior can lead to inefficient memory usage for big server machines and VMs,
>>> where nr_cpu_ids is huge.
>>> 
>>> Example from my experiment:
>>> 2 vCPU VM with hotplug support (up to 128):
>>> [    0.105989] smpboot: Allowing 128 CPUs, 126 hotplug CPUs
>>> By creating huge amount of active or/and dying memory cgroups, I can generate active percpu
>>> allocations of 100 MB (per single CPU) including fragmentation overhead. But in that case total
>>> percpu memory consumption (reported in /proc/meminfo) will be 12.8 GB. BTW, chunks are
>>> filled by ~75% in my experiment, so fragmentation is not a concern.
>>> Out of 12.8 GB:
>>>  - 0.2 GB are actually used by present vCPUs, and
>>>  - 12.6 GB are "wasted"!
>>> 
>>> I've seen production VMs consuming 16-20 GB of memory by Percpu. Roman reported 100 GB.
>>> There are solutions to reduce "wasted" memory overhead such as: disabling CPU hotplug; reducing
>>> number of maximum CPUs reported by hypervisor or/and firmware; using possible_cpus= kernel
>>> parameter. But it won't eliminate fundamental issue with "wasted" memory.
>>> 
>>> Suggestion: To support percpu chunks scaling by number of units there. To allocate/deallocate new
>>> units for existing chunks on CPU hotplug/remove event.
>>> 
>> Idk. In theory it sounds doable. In practice I'm not so sure. The two
>> problems off the top of my head:
>> 1) What happens if we can't allocate new pages when a cpu is onlined?
Simply - registering CPU can return error on allocation failure. Or potentially it can be reinstantiated later on memory availability if it’s the case.

>> 2) It's possible users set particular conditions in percpu variables
>> that are not tied to just statistics summing (such as the cpu
>> runqueues). Users would have to provide online init and exit functions
>> which could get weird.
I do not think online init/exit function is a right approach.
There are many places in the Linux where percpu data get initialized right after got allocated:
ptr = alloc_percpu();
for_each_possible_cpu(cpu) {
        initialize (per_cpu_ptr(ptr, cpu));
}
Let’s keep all such instances untouched. Hope initialize() just touch content of percpu area without allocating substructures. If so - it should be redesigned.
BTW, this loop does extra work (runtime overhead) to initialize areas for possible cpus which might never arrive.

The proposal:
 - in case of possible_cpus > online_cpus, add additional unit (call it A) to the chunks which will contain initialized image of percpu data for possible cpus.
 - for_each_possible_cpu(cpu) from snippet above should go through all online cpus + 1 (for unit A).
 - on new CPU #N arrival, percpu should allocate corresponding unit N and initialize its content by data from unit A. Repeat for all chunks.
 - on CPU D departure - release unit D from the chunks, keeping unit A intact.
 - in case of possible_cpus > online_cpus, overhead will be +1 (for unit A), while current overhead is +(possible_cpus-online_cpus).
 - in case of possible_cpus == online_cpus (no CPU hotplug) - do not allocate unit A, keep percpu allocator as it is now - no overhead.

Does it fully cover 2nd concern?

>> As Roman mentioned, I think it would be much better to not have the
>> large discrepancy between the cpu_online_mask and the cpu_possible_mask.
> 
> Indeed it is quite common on PowerPC to set a VM with a possible high number of CPUs but with a reasonnable number of online CPUs. This allows the user to scale up its VM when needed.
> 
> For instance we may see up to 1024 possible CPUs while the online number is *only* 128.
Agree. In VMs, vCPUs there are just threads/processes on the host and can be easily added/removed on demand.

Thanks,
—Alexey




  reply	other threads:[~2021-04-22  8:22 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-22  0:44 Alexey Makhalov
2021-04-22  1:10 ` Roman Gushchin
2021-04-22  1:33 ` Dennis Zhou
2021-04-22  7:45   ` Laurent Dufour
2021-04-22  8:22     ` Alexey Makhalov [this message]
2021-04-22 17:52       ` Vlastimil Babka
2021-04-29 11:39 ` Pratik Sampat

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5614778F-AA79-40FD-BB62-A543A9C49CE2@vmware.com \
    --to=amakhalov@vmware.com \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=cl@linux.com \
    --cc=dennis@kernel.org \
    --cc=guro@fb.com \
    --cc=ldufour@linux.ibm.com \
    --cc=linux-mm@kvack.org \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox