From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=C0d7=JT=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 505F8C433ED
	for <linux-mm@archiver.kernel.org>; Thu, 22 Apr 2021 08:22:21 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id C1510600EF
	for <linux-mm@archiver.kernel.org>; Thu, 22 Apr 2021 08:22:20 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C1510600EF
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=vmware.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 21D426B006C; Thu, 22 Apr 2021 04:22:20 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1C99D6B006E; Thu, 22 Apr 2021 04:22:20 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 06F906B0070; Thu, 22 Apr 2021 04:22:20 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0190.hostedemail.com [216.40.44.190])
	by kanga.kvack.org (Postfix) with ESMTP id DEFD76B006C
	for <linux-mm@kvack.org>; Thu, 22 Apr 2021 04:22:19 -0400 (EDT)
Received: from smtpin34.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 8F136180EAE3F
	for <linux-mm@kvack.org>; Thu, 22 Apr 2021 08:22:19 +0000 (UTC)
X-FDA: 78059310798.34.DE50F12
Received: from EX13-EDG-OU-001.vmware.com (ex13-edg-ou-001.vmware.com [208.91.0.189])
	by imf24.hostedemail.com (Postfix) with ESMTP id 381E3A0003B5
	for <linux-mm@kvack.org>; Thu, 22 Apr 2021 08:22:10 +0000 (UTC)
Received: from sc9-mailhost2.vmware.com (10.113.161.72) by
 EX13-EDG-OU-001.vmware.com (10.113.208.155) with Microsoft SMTP Server id
 15.0.1156.6; Thu, 22 Apr 2021 01:22:13 -0700
Received: from [10.188.146.52] (unknown [10.188.146.52])
	by sc9-mailhost2.vmware.com (Postfix) with ESMTP id BCDBA20471;
	Thu, 22 Apr 2021 01:22:16 -0700 (PDT)
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0 (Mac OS X Mail 14.0 \(3654.60.0.2.21\))
Subject: Re: Percpu allocator: CPU hotplug support
From: Alexey Makhalov <amakhalov@vmware.com>
In-Reply-To: <3320a36c-9270-a7f7-88da-0a9bfa13c774@linux.ibm.com>
Date: Thu, 22 Apr 2021 01:22:15 -0700
CC: Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>, Christoph
 Lameter <cl@linux.com>, Roman Gushchin <guro@fb.com>, Aneesh Kumar K.V
	<aneesh.kumar@linux.ibm.com>, Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Content-Transfer-Encoding: quoted-printable
Message-ID: <5614778F-AA79-40FD-BB62-A543A9C49CE2@vmware.com>
References: <8E7F3D98-CB68-4418-8E0E-7287E8273DA9@vmware.com>
 <YIDSZKItmXRfyqP3@google.com>
 <3320a36c-9270-a7f7-88da-0a9bfa13c774@linux.ibm.com>
To: "linux-mm@kvack.org" <linux-mm@kvack.org>, Laurent Dufour
	<ldufour@linux.ibm.com>
X-Mailer: Apple Mail (2.3654.60.0.2.21)
Received-SPF: None (EX13-EDG-OU-001.vmware.com: amakhalov@vmware.com does not
 designate permitted sender hosts)
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 381E3A0003B5
X-Stat-Signature: ucdh3n3q817q9gdieip6tkjecwmjmtub
Received-SPF: none (vmware.com>: No applicable sender policy available) receiver=imf24; identity=mailfrom; envelope-from="<amakhalov@vmware.com>"; helo=EX13-EDG-OU-001.vmware.com; client-ip=208.91.0.189
X-HE-DKIM-Result: none/none
X-HE-Tag: 1619079730-378107
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Hello,

> On Apr 22, 2021, at 12:45 AM, Laurent Dufour <ldufour@linux.ibm.com> =
wrote:
>=20
> Le 22/04/2021 =C3=A0 03:33, Dennis Zhou a =C3=A9crit :
>> Hello,
>> On Thu, Apr 22, 2021 at 12:44:37AM +0000, Alexey Makhalov wrote:
>>> Current implementation of percpu allocator uses total possible =
number of CPUs (nr_cpu_ids) to
>>> get number of units to allocate per chunk. Every alloc_percpu() =
request of N bytes will allocate
>>> N*nr_cpu_ids bytes even if the number of present CPUs is much less. =
Percpu allocator grows by
>>> number of chunks keeping number of units per chunk constant. This is =
done in that way to
>>> simplify CPU hotplug/remove to have per-cpu area preallocated.
>>>=20
>>> Problem: This behavior can lead to inefficient memory usage for big =
server machines and VMs,
>>> where nr_cpu_ids is huge.
>>>=20
>>> Example from my experiment:
>>> 2 vCPU VM with hotplug support (up to 128):
>>> [    0.105989] smpboot: Allowing 128 CPUs, 126 hotplug CPUs
>>> By creating huge amount of active or/and dying memory cgroups, I can =
generate active percpu
>>> allocations of 100 MB (per single CPU) including fragmentation =
overhead. But in that case total
>>> percpu memory consumption (reported in /proc/meminfo) will be 12.8 =
GB. BTW, chunks are
>>> filled by ~75% in my experiment, so fragmentation is not a concern.
>>> Out of 12.8 GB:
>>>  - 0.2 GB are actually used by present vCPUs, and
>>>  - 12.6 GB are "wasted"!
>>>=20
>>> I've seen production VMs consuming 16-20 GB of memory by Percpu. =
Roman reported 100 GB.
>>> There are solutions to reduce "wasted" memory overhead such as: =
disabling CPU hotplug; reducing
>>> number of maximum CPUs reported by hypervisor or/and firmware; using =
possible_cpus=3D kernel
>>> parameter. But it won't eliminate fundamental issue with "wasted" =
memory.
>>>=20
>>> Suggestion: To support percpu chunks scaling by number of units =
there. To allocate/deallocate new
>>> units for existing chunks on CPU hotplug/remove event.
>>>=20
>> Idk. In theory it sounds doable. In practice I'm not so sure. The two
>> problems off the top of my head:
>> 1) What happens if we can't allocate new pages when a cpu is onlined?
Simply - registering CPU can return error on allocation failure. Or =
potentially it can be reinstantiated later on memory availability if =
it=E2=80=99s the case.

>> 2) It's possible users set particular conditions in percpu variables
>> that are not tied to just statistics summing (such as the cpu
>> runqueues). Users would have to provide online init and exit =
functions
>> which could get weird.
I do not think online init/exit function is a right approach.
There are many places in the Linux where percpu data get initialized =
right after got allocated:
ptr =3D alloc_percpu();
for_each_possible_cpu(cpu) {
        initialize (per_cpu_ptr(ptr, cpu));
}
Let=E2=80=99s keep all such instances untouched. Hope initialize() just =
touch content of percpu area without allocating substructures. If so - =
it should be redesigned.
BTW, this loop does extra work (runtime overhead) to initialize areas =
for possible cpus which might never arrive.

The proposal:
 - in case of possible_cpus > online_cpus, add additional unit (call it =
A) to the chunks which will contain initialized image of percpu data for =
possible cpus.
 - for_each_possible_cpu(cpu) from snippet above should go through all =
online cpus + 1 (for unit A).
 - on new CPU #N arrival, percpu should allocate corresponding unit N =
and initialize its content by data from unit A. Repeat for all chunks.
 - on CPU D departure - release unit D from the chunks, keeping unit A =
intact.
 - in case of possible_cpus > online_cpus, overhead will be +1 (for unit =
A), while current overhead is +(possible_cpus-online_cpus).
 - in case of possible_cpus =3D=3D online_cpus (no CPU hotplug) - do not =
allocate unit A, keep percpu allocator as it is now - no overhead.

Does it fully cover 2nd concern?

>> As Roman mentioned, I think it would be much better to not have the
>> large discrepancy between the cpu_online_mask and the =
cpu_possible_mask.
>=20
> Indeed it is quite common on PowerPC to set a VM with a possible high =
number of CPUs but with a reasonnable number of online CPUs. This allows =
the user to scale up its VM when needed.
>=20
> For instance we may see up to 1024 possible CPUs while the online =
number is *only* 128.
Agree. In VMs, vCPUs there are just threads/processes on the host and =
can be easily added/removed on demand.

Thanks,
=E2=80=94Alexey