From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 505F8C433ED for ; Thu, 22 Apr 2021 08:22:21 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C1510600EF for ; Thu, 22 Apr 2021 08:22:20 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C1510600EF Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=vmware.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 21D426B006C; Thu, 22 Apr 2021 04:22:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1C99D6B006E; Thu, 22 Apr 2021 04:22:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 06F906B0070; Thu, 22 Apr 2021 04:22:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0190.hostedemail.com [216.40.44.190]) by kanga.kvack.org (Postfix) with ESMTP id DEFD76B006C for ; Thu, 22 Apr 2021 04:22:19 -0400 (EDT) Received: from smtpin34.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 8F136180EAE3F for ; Thu, 22 Apr 2021 08:22:19 +0000 (UTC) X-FDA: 78059310798.34.DE50F12 Received: from EX13-EDG-OU-001.vmware.com (ex13-edg-ou-001.vmware.com [208.91.0.189]) by imf24.hostedemail.com (Postfix) with ESMTP id 381E3A0003B5 for ; Thu, 22 Apr 2021 08:22:10 +0000 (UTC) Received: from sc9-mailhost2.vmware.com (10.113.161.72) by EX13-EDG-OU-001.vmware.com (10.113.208.155) with Microsoft SMTP Server id 15.0.1156.6; Thu, 22 Apr 2021 01:22:13 -0700 Received: from [10.188.146.52] (unknown [10.188.146.52]) by sc9-mailhost2.vmware.com (Postfix) with ESMTP id BCDBA20471; Thu, 22 Apr 2021 01:22:16 -0700 (PDT) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 (Mac OS X Mail 14.0 \(3654.60.0.2.21\)) Subject: Re: Percpu allocator: CPU hotplug support From: Alexey Makhalov In-Reply-To: <3320a36c-9270-a7f7-88da-0a9bfa13c774@linux.ibm.com> Date: Thu, 22 Apr 2021 01:22:15 -0700 CC: Dennis Zhou , Tejun Heo , Christoph Lameter , Roman Gushchin , Aneesh Kumar K.V , Srikar Dronamraju Content-Transfer-Encoding: quoted-printable Message-ID: <5614778F-AA79-40FD-BB62-A543A9C49CE2@vmware.com> References: <8E7F3D98-CB68-4418-8E0E-7287E8273DA9@vmware.com> <3320a36c-9270-a7f7-88da-0a9bfa13c774@linux.ibm.com> To: "linux-mm@kvack.org" , Laurent Dufour X-Mailer: Apple Mail (2.3654.60.0.2.21) Received-SPF: None (EX13-EDG-OU-001.vmware.com: amakhalov@vmware.com does not designate permitted sender hosts) X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 381E3A0003B5 X-Stat-Signature: ucdh3n3q817q9gdieip6tkjecwmjmtub Received-SPF: none (vmware.com>: No applicable sender policy available) receiver=imf24; identity=mailfrom; envelope-from=""; helo=EX13-EDG-OU-001.vmware.com; client-ip=208.91.0.189 X-HE-DKIM-Result: none/none X-HE-Tag: 1619079730-378107 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello, > On Apr 22, 2021, at 12:45 AM, Laurent Dufour = wrote: >=20 > Le 22/04/2021 =C3=A0 03:33, Dennis Zhou a =C3=A9crit : >> Hello, >> On Thu, Apr 22, 2021 at 12:44:37AM +0000, Alexey Makhalov wrote: >>> Current implementation of percpu allocator uses total possible = number of CPUs (nr_cpu_ids) to >>> get number of units to allocate per chunk. Every alloc_percpu() = request of N bytes will allocate >>> N*nr_cpu_ids bytes even if the number of present CPUs is much less. = Percpu allocator grows by >>> number of chunks keeping number of units per chunk constant. This is = done in that way to >>> simplify CPU hotplug/remove to have per-cpu area preallocated. >>>=20 >>> Problem: This behavior can lead to inefficient memory usage for big = server machines and VMs, >>> where nr_cpu_ids is huge. >>>=20 >>> Example from my experiment: >>> 2 vCPU VM with hotplug support (up to 128): >>> [ 0.105989] smpboot: Allowing 128 CPUs, 126 hotplug CPUs >>> By creating huge amount of active or/and dying memory cgroups, I can = generate active percpu >>> allocations of 100 MB (per single CPU) including fragmentation = overhead. But in that case total >>> percpu memory consumption (reported in /proc/meminfo) will be 12.8 = GB. BTW, chunks are >>> filled by ~75% in my experiment, so fragmentation is not a concern. >>> Out of 12.8 GB: >>> - 0.2 GB are actually used by present vCPUs, and >>> - 12.6 GB are "wasted"! >>>=20 >>> I've seen production VMs consuming 16-20 GB of memory by Percpu. = Roman reported 100 GB. >>> There are solutions to reduce "wasted" memory overhead such as: = disabling CPU hotplug; reducing >>> number of maximum CPUs reported by hypervisor or/and firmware; using = possible_cpus=3D kernel >>> parameter. But it won't eliminate fundamental issue with "wasted" = memory. >>>=20 >>> Suggestion: To support percpu chunks scaling by number of units = there. To allocate/deallocate new >>> units for existing chunks on CPU hotplug/remove event. >>>=20 >> Idk. In theory it sounds doable. In practice I'm not so sure. The two >> problems off the top of my head: >> 1) What happens if we can't allocate new pages when a cpu is onlined? Simply - registering CPU can return error on allocation failure. Or = potentially it can be reinstantiated later on memory availability if = it=E2=80=99s the case. >> 2) It's possible users set particular conditions in percpu variables >> that are not tied to just statistics summing (such as the cpu >> runqueues). Users would have to provide online init and exit = functions >> which could get weird. I do not think online init/exit function is a right approach. There are many places in the Linux where percpu data get initialized = right after got allocated: ptr =3D alloc_percpu(); for_each_possible_cpu(cpu) { initialize (per_cpu_ptr(ptr, cpu)); } Let=E2=80=99s keep all such instances untouched. Hope initialize() just = touch content of percpu area without allocating substructures. If so - = it should be redesigned. BTW, this loop does extra work (runtime overhead) to initialize areas = for possible cpus which might never arrive. The proposal: - in case of possible_cpus > online_cpus, add additional unit (call it = A) to the chunks which will contain initialized image of percpu data for = possible cpus. - for_each_possible_cpu(cpu) from snippet above should go through all = online cpus + 1 (for unit A). - on new CPU #N arrival, percpu should allocate corresponding unit N = and initialize its content by data from unit A. Repeat for all chunks. - on CPU D departure - release unit D from the chunks, keeping unit A = intact. - in case of possible_cpus > online_cpus, overhead will be +1 (for unit = A), while current overhead is +(possible_cpus-online_cpus). - in case of possible_cpus =3D=3D online_cpus (no CPU hotplug) - do not = allocate unit A, keep percpu allocator as it is now - no overhead. Does it fully cover 2nd concern? >> As Roman mentioned, I think it would be much better to not have the >> large discrepancy between the cpu_online_mask and the = cpu_possible_mask. >=20 > Indeed it is quite common on PowerPC to set a VM with a possible high = number of CPUs but with a reasonnable number of online CPUs. This allows = the user to scale up its VM when needed. >=20 > For instance we may see up to 1024 possible CPUs while the online = number is *only* 128. Agree. In VMs, vCPUs there are just threads/processes on the host and = can be easily added/removed on demand. Thanks, =E2=80=94Alexey