From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 54C61C433EF for ; Fri, 24 Jun 2022 06:08:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8CB5E8E01E3; Fri, 24 Jun 2022 02:08:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 853856B02B3; Fri, 24 Jun 2022 02:08:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6CC928E01E3; Fri, 24 Jun 2022 02:08:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 5684F6B02B2 for ; Fri, 24 Jun 2022 02:08:06 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 1E34334972 for ; Fri, 24 Jun 2022 06:08:06 +0000 (UTC) X-FDA: 79612098972.27.9549F3E Received: from mail-yb1-f171.google.com (mail-yb1-f171.google.com [209.85.219.171]) by imf18.hostedemail.com (Postfix) with ESMTP id BF8211C002E for ; Fri, 24 Jun 2022 06:08:05 +0000 (UTC) Received: by mail-yb1-f171.google.com with SMTP id l11so2779683ybu.13 for ; Thu, 23 Jun 2022 23:08:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=SeaKrMM2DGf9AmKtqHuEPGi8mQgc3169/28hZsaIXsU=; b=FqiQk1iwQhSOXqbIhgVqSjXUfZlndtlBhMmS3D2jY6aNk61b+pcUKRo2ByWWbfcn6g 8X+pd75XhEdBKwAdwg+alHdpfKkV43RstivYkdaY/WlCeljzGFN2LKxz9cDqpvJ1vpPZ aIPQMNcoLoHdyUMr+VcjSHUisnUSwFJkjg9tNvVR1ORlao3PzRO0ovn7NTponI3t9vXl tg2ddeMBpINdPGMnBEQLUuloASNfsn3mg0iymryzg7WWRzL9+xdZbxu6/LoaLNP+8P9i dC/SKDmyeAehjCmCvWYSUSEZv2xgCYyIOxMZIemDJX2tXHPrPJdcx7UAIJOmJo3UeP9f LpFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=SeaKrMM2DGf9AmKtqHuEPGi8mQgc3169/28hZsaIXsU=; b=Lg/6qVUTeNgN/z4v34yIKgWCTSYIfB21v+1wpA1cnoqwbuDOtYfXtSDYUPVefOFSu0 oMPIDbya+QogoZPuT34ePsXBh7mcvZMaLuM4sUnCkAlr4a9AMqRNZsJBdRRRleKBj/IX A1iRVq/nL9/O37xhJ2NRBLvIrzW+sr1DTcKYKs4BwkOLSjjZeqBg6G7O47055p+Xj/AO NvT9DsKvBLq5s7KORmvyN1m1fPFvi1u9T5iEiohS3PyOt+bjuty25K49DV2AJdvjBIMT zJWo/6StFTbZNTW5l4FOsyW/hlNMGy/DNAzweTo+pVPyxaW5rVLYo2s8IJ1EpSpgllqD 0FfQ== X-Gm-Message-State: AJIora/8Yo2S5aRSwWyV4k34Mo4x5QtJ/D5Yf4ON6p1hefhBOkkYdcJD E4bRtYz50Ct5EcyKzArLq92RiAFn0A3VI88XQMQ92Q== X-Google-Smtp-Source: AGRyM1slb9bCxz7eLioJNPBNzDFAyfhH0Oa34pXwYuxwl4FMz+2hZHqAadj2IjKe9S3LUAuLPMF+yyknHn7GLi11iqU= X-Received: by 2002:a25:8181:0:b0:668:c835:eb7c with SMTP id p1-20020a258181000000b00668c835eb7cmr13491867ybk.598.1656050884788; Thu, 23 Jun 2022 23:08:04 -0700 (PDT) MIME-Version: 1.0 References: <20220619150456.GB34471@xsang-OptiPlex-9020> <20220622172857.37db0d29@kernel.org> <20220623185730.25b88096@kernel.org> <20220624051351.GA72171@shbuild999.sh.intel.com> <20220624060053.GD79500@shbuild999.sh.intel.com> In-Reply-To: <20220624060053.GD79500@shbuild999.sh.intel.com> From: Eric Dumazet Date: Fri, 24 Jun 2022 08:07:53 +0200 Message-ID: Subject: Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression To: Feng Tang Cc: Jakub Kicinski , Xin Long , Marcelo Ricardo Leitner , kernel test robot , Shakeel Butt , Soheil Hassas Yeganeh , LKML , Linux Memory Management List , network dev , linux-s390@vger.kernel.org, MPTCP Upstream , "linux-sctp @ vger . kernel . org" , lkp@lists.01.org, kbuild test robot , Huang Ying , zhengjun.xing@linux.intel.com, fengwei.yin@intel.com, Ying Xu Content-Type: text/plain; charset="UTF-8" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656050885; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SeaKrMM2DGf9AmKtqHuEPGi8mQgc3169/28hZsaIXsU=; b=ALzFfuWm+DosKIeiFIQKZalBzaOXa8s3h/JypYm1haSnca99mkK0QB1Ehg2iaeR/ZVxRey sdiHdtIj0fd23BYMQgk2xKF5kukr6iwOl5ol+LMAIKDzV3GJYf3qcURSXahdFyqh02lZrf P1lxbulVTOoS7Xcg6YWT1284Mrh9dsY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656050885; a=rsa-sha256; cv=none; b=oJIXxrxYVh28RlSD71Vo+bwtJdO71/GhQcfAhHH8W4vg06pDgs+WaaE0yhZPd7+ohQyh4P GvJJE8s7y2l0cUz6tdlHzqigEB72QtfKMQlq9u0EdjktCqixNZc1JKoNz2ugsa2k3Xe3gy ZzPXk5JJ/HtsDA7Y45ueGG37ygVFCRs= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=FqiQk1iw; spf=pass (imf18.hostedemail.com: domain of edumazet@google.com designates 209.85.219.171 as permitted sender) smtp.mailfrom=edumazet@google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspam-User: Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=FqiQk1iw; spf=pass (imf18.hostedemail.com: domain of edumazet@google.com designates 209.85.219.171 as permitted sender) smtp.mailfrom=edumazet@google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: BF8211C002E X-Stat-Signature: 9qkwif8o6yy79csw9tckfrbsrtwj666q X-HE-Tag: 1656050885-201151 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Jun 24, 2022 at 8:01 AM Feng Tang wrote: > > On Fri, Jun 24, 2022 at 07:45:00AM +0200, Eric Dumazet wrote: > > On Fri, Jun 24, 2022 at 7:14 AM Feng Tang wrote: > > > > > > Hi Eric, > > > > > > On Fri, Jun 24, 2022 at 06:13:51AM +0200, Eric Dumazet wrote: > > > > On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski wrote: > > > > > > > > > > On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote: > > > > > > From the perf data, we can see __sk_mem_reduce_allocated() is the one > > > > > > using CPU the most more than before, and mem_cgroup APIs are also > > > > > > called in this function. It means the mem cgroup must be enabled in > > > > > > the test env, which may explain why I couldn't reproduce it. > > > > > > > > > > > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as > > > > > > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to > > > > > > reclaim the memory, which is *more frequent* to call > > > > > > __sk_mem_reduce_allocated() than before (checking reclaimable >= > > > > > > SK_RECLAIM_THRESHOLD). It might be cheap when > > > > > > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still > > > > > > cheap when mem_cgroup_sockets_enabled is true. > > > > > > > > > > > > I think SCTP netperf could trigger this, as the CPU is the bottleneck > > > > > > for SCTP netperf testing, which is more sensitive to the extra > > > > > > function calls than TCP. > > > > > > > > > > > > Can we re-run this testing without mem cgroup enabled? > > > > > > > > > > FWIW I defer to Eric, thanks a lot for double checking the report > > > > > and digging in! > > > > > > > > I did tests with TCP + memcg and noticed a very small additional cost > > > > in memcg functions, > > > > because of suboptimal layout: > > > > > > > > Extract of an internal Google bug, update from June 9th: > > > > > > > > -------------------------------- > > > > I have noticed a minor false sharing to fetch (struct > > > > mem_cgroup)->css.parent, at offset 0xc0, > > > > because it shares the cache line containing struct mem_cgroup.memory, > > > > at offset 0xd0 > > > > > > > > Ideally, memcg->socket_pressure and memcg->parent should sit in a read > > > > mostly cache line. > > > > ----------------------- > > > > > > > > But nothing that could explain a "-69.4% regression" > > > > > > We can double check that. > > > > > > > memcg has a very similar strategy of per-cpu reserves, with > > > > MEMCG_CHARGE_BATCH being 32 pages per cpu. > > > > > > We have proposed patch to increase the batch numer for stats > > > update, which was not accepted as it hurts the accuracy and > > > the data is used by many tools. > > > > > > > It is not clear why SCTP with 10K writes would overflow this reserve constantly. > > > > > > > > Presumably memcg experts will have to rework structure alignments to > > > > make sure they can cope better > > > > with more charge/uncharge operations, because we are not going back to > > > > gigantic per-socket reserves, > > > > this simply does not scale. > > > > > > Yes, the memcg statitics and charge/unchage update is very sensitive > > > with the data alignemnt layout, and can easily trigger peformance > > > changes, as we've seen quite some similar cases in the past several > > > years. > > > > > > One pattern we've seen is, even if a memcg stats updating or charge > > > function only takes about 2%~3% of the CPU cycles in perf-profile data, > > > once it got affected, the peformance change could be amplified to up to > > > 60% or more. > > > > > > > Reorganizing "struct mem_cgroup" to put "struct page_counter memory" > > in a separate cache line would be beneficial. > > That may help. > > And I also want to say the benchmarks(especially micro one) are very > sensitive to the layout of mem_cgroup. As the 'page_counter' is 112 > bytes in size, I recently made a patch to make it cacheline aligned > (take 2 cachelines), which improved some hackbench/netperf test > cases, but caused huge (49%) drop for some vm-scalability tests. > > > Many low hanging fruits, assuming nobody will use __randomize_layout on it ;) > > > > Also some fields are written even if their value is not changed. > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index abec50f31fe64100f4be5b029c7161b3a6077a74..53d9c1e581e78303ef73942e2b34338567987b74 > > 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -7037,10 +7037,12 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup > > *memcg, unsigned int nr_pages, > > struct page_counter *fail; > > > > if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) { > > - memcg->tcpmem_pressure = 0; > > + if (READ_ONCE(memcg->tcpmem_pressure)) > > + WRITE_ONCE(memcg->tcpmem_pressure, 0); > > return true; > > } > > - memcg->tcpmem_pressure = 1; > > + if (!READ_ONCE(memcg->tcpmem_pressure)) > > + WRITE_ONCE(memcg->tcpmem_pressure, 1); > > if (gfp_mask & __GFP_NOFAIL) { > > page_counter_charge(&memcg->tcpmem, nr_pages); > > return true; > > I will also try this patch, which may take some time. Note that applications can opt-in reserving memory for one socket, using SO_RESERVE_MEM This can be used for jobs with a controlled number of sockets, as this will avoid many charge/uncharge operations.