From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 91F9ECCA473
	for <linux-mm@archiver.kernel.org>; Fri, 24 Jun 2022 06:01:18 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2D48E6B02AE; Fri, 24 Jun 2022 02:01:18 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 284D28E01E7; Fri, 24 Jun 2022 02:01:18 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 14D478E01E3; Fri, 24 Jun 2022 02:01:18 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 080A46B02AE
	for <linux-mm@kvack.org>; Fri, 24 Jun 2022 02:01:18 -0400 (EDT)
Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id D9EB53587E
	for <linux-mm@kvack.org>; Fri, 24 Jun 2022 06:01:17 +0000 (UTC)
X-FDA: 79612081794.02.5E2CFAF
Received: from mga05.intel.com (mga05.intel.com [192.55.52.43])
	by imf12.hostedemail.com (Postfix) with ESMTP id DAD59400B3
	for <linux-mm@kvack.org>; Fri, 24 Jun 2022 06:01:04 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1656050464; x=1687586464;
  h=date:from:to:cc:subject:message-id:references:
   mime-version:in-reply-to;
  bh=6o4KFuPBreV56Xtx9KWgdlQDzfvBbkhilorMy6YcUL8=;
  b=NwTFWOV8Zg7RTGvApe/7khHcYsiV7ulviJtxDk8P2j/L/r5P6YhAqDUp
   UiSXcgtzunKgSEvgeUybW5qXOj9wyOFOoktk60szPTEyHiBnLI9vPFU4i
   IlqPJ3sHtN0t9aD1pPyQgYpwV5x8MMe+SHql8IMSt21PVnh2+KxDp/rQp
   o/nd3F6RVaLo0jlLpEZ4ZGDEHfz6ScYMAlL+NX9g/n3lu+8Xv56N9gVHb
   3cNvauanvkoJMEsdwri5dhciZqiubDa68N/Iup8BO0CDxbE2OwEnsBzF8
   Z0QlNw1xF56Kqw2qT6UtUfXUblmStEJtvba/yWtGIKnb+YOTlJNb7xOun
   g==;
X-IronPort-AV: E=McAfee;i="6400,9594,10387"; a="367249582"
X-IronPort-AV: E=Sophos;i="5.92,218,1650956400"; 
   d="scan'208";a="367249582"
Received: from fmsmga007.fm.intel.com ([10.253.24.52])
  by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Jun 2022 23:00:58 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.92,218,1650956400"; 
   d="scan'208";a="593076463"
Received: from shbuild999.sh.intel.com (HELO localhost) ([10.239.146.138])
  by fmsmga007.fm.intel.com with ESMTP; 23 Jun 2022 23:00:54 -0700
Date: Fri, 24 Jun 2022 14:00:53 +0800
From: Feng Tang <feng.tang@intel.com>
To: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>, Xin Long <lucien.xin@gmail.com>,
	Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>,
	kernel test robot <oliver.sang@intel.com>,
	Shakeel Butt <shakeelb@google.com>,
	Soheil Hassas Yeganeh <soheil@google.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	network dev <netdev@vger.kernel.org>, linux-s390@vger.kernel.org,
	MPTCP Upstream <mptcp@lists.linux.dev>,
	"linux-sctp @ vger . kernel . org" <linux-sctp@vger.kernel.org>,
	lkp@lists.01.org, kbuild test robot <lkp@intel.com>,
	Huang Ying <ying.huang@intel.com>, zhengjun.xing@linux.intel.com,
	fengwei.yin@intel.com, Ying Xu <yinxu@redhat.com>
Subject: Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
Message-ID: <20220624060053.GD79500@shbuild999.sh.intel.com>
References: <20220619150456.GB34471@xsang-OptiPlex-9020>
 <20220622172857.37db0d29@kernel.org>
 <CADvbK_csvmkKe46hT9792=+Qcjor2EvkkAnr--CJK3NGX-N9BQ@mail.gmail.com>
 <CADvbK_eQUmb942vC+bG+NRzM1ki1LiCydEDR1AezZ35Jvsdfnw@mail.gmail.com>
 <20220623185730.25b88096@kernel.org>
 <CANn89iLidqjiiV8vxr7KnUg0JvfoS9+TRGg=8ANZ8NBRjeQxsQ@mail.gmail.com>
 <20220624051351.GA72171@shbuild999.sh.intel.com>
 <CANn89iLwwN7hRsJD_skbcRNY9sBtPh1fhULKco5wosx_i4x6gg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CANn89iLwwN7hRsJD_skbcRNY9sBtPh1fhULKco5wosx_i4x6gg@mail.gmail.com>
X-Rspam-User: 
Authentication-Results: imf12.hostedemail.com;
	dkim=temperror ("DNS error when getting key") header.d=intel.com header.s=Intel header.b=NwTFWOV8;
	spf=temperror (imf12.hostedemail.com: error in processing during lookup of feng.tang@intel.com: DNS error) smtp.mailfrom=feng.tang@intel.com;
	dmarc=temperror reason="query timed out" header.from=intel.com (policy=temperror)
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: DAD59400B3
X-Stat-Signature: r9wzhzapsq9to4e9p194f3xrs6jyar9s
X-HE-Tag: 1656050464-10462
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Jun 24, 2022 at 07:45:00AM +0200, Eric Dumazet wrote:
> On Fri, Jun 24, 2022 at 7:14 AM Feng Tang <feng.tang@intel.com> wrote:
> >
> > Hi Eric,
> >
> > On Fri, Jun 24, 2022 at 06:13:51AM +0200, Eric Dumazet wrote:
> > > On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > >
> > > > On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote:
> > > > > From the perf data, we can see __sk_mem_reduce_allocated() is the one
> > > > > using CPU the most more than before, and mem_cgroup APIs are also
> > > > > called in this function. It means the mem cgroup must be enabled in
> > > > > the test env, which may explain why I couldn't reproduce it.
> > > > >
> > > > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as
> > > > > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to
> > > > > reclaim the memory, which is *more frequent* to call
> > > > > __sk_mem_reduce_allocated() than before (checking reclaimable >=
> > > > > SK_RECLAIM_THRESHOLD). It might be cheap when
> > > > > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still
> > > > > cheap when mem_cgroup_sockets_enabled is true.
> > > > >
> > > > > I think SCTP netperf could trigger this, as the CPU is the bottleneck
> > > > > for SCTP netperf testing, which is more sensitive to the extra
> > > > > function calls than TCP.
> > > > >
> > > > > Can we re-run this testing without mem cgroup enabled?
> > > >
> > > > FWIW I defer to Eric, thanks a lot for double checking the report
> > > > and digging in!
> > >
> > > I did tests with TCP + memcg and noticed a very small additional cost
> > > in memcg functions,
> > > because of suboptimal layout:
> > >
> > > Extract of an internal Google bug, update from June 9th:
> > >
> > > --------------------------------
> > > I have noticed a minor false sharing to fetch (struct
> > > mem_cgroup)->css.parent, at offset 0xc0,
> > > because it shares the cache line containing struct mem_cgroup.memory,
> > > at offset 0xd0
> > >
> > > Ideally, memcg->socket_pressure and memcg->parent should sit in a read
> > > mostly cache line.
> > > -----------------------
> > >
> > > But nothing that could explain a "-69.4% regression"
> >
> > We can double check that.
> >
> > > memcg has a very similar strategy of per-cpu reserves, with
> > > MEMCG_CHARGE_BATCH being 32 pages per cpu.
> >
> > We have proposed patch to increase the batch numer for stats
> > update, which was not accepted as it hurts the accuracy and
> > the data is used by many tools.
> >
> > > It is not clear why SCTP with 10K writes would overflow this reserve constantly.
> > >
> > > Presumably memcg experts will have to rework structure alignments to
> > > make sure they can cope better
> > > with more charge/uncharge operations, because we are not going back to
> > > gigantic per-socket reserves,
> > > this simply does not scale.
> >
> > Yes, the memcg statitics and charge/unchage update is very sensitive
> > with the data alignemnt layout, and can easily trigger peformance
> > changes, as we've seen quite some similar cases in the past several
> > years.
> >
> > One pattern we've seen is, even if a memcg stats updating or charge
> > function only takes about 2%~3% of the CPU cycles in perf-profile data,
> > once it got affected, the peformance change could be amplified to up to
> > 60% or more.
> >
> 
> Reorganizing "struct mem_cgroup" to put "struct page_counter memory"
> in a separate cache line would be beneficial.
 
That may help.

And I also want to say the benchmarks(especially micro one) are very
sensitive to the layout of mem_cgroup. As the 'page_counter' is 112
bytes in size, I recently made a patch to make it cacheline aligned
(take 2 cachelines), which improved some hackbench/netperf test
cases, but caused huge (49%) drop for some vm-scalability tests. 

> Many low hanging fruits, assuming nobody will use __randomize_layout on it ;)
> 
> Also some fields are written even if their value is not changed.
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index abec50f31fe64100f4be5b029c7161b3a6077a74..53d9c1e581e78303ef73942e2b34338567987b74
> 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7037,10 +7037,12 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup
> *memcg, unsigned int nr_pages,
>                 struct page_counter *fail;
> 
>                 if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) {
> -                       memcg->tcpmem_pressure = 0;
> +                       if (READ_ONCE(memcg->tcpmem_pressure))
> +                               WRITE_ONCE(memcg->tcpmem_pressure, 0);
>                         return true;
>                 }
> -               memcg->tcpmem_pressure = 1;
> +               if (!READ_ONCE(memcg->tcpmem_pressure))
> +                       WRITE_ONCE(memcg->tcpmem_pressure, 1);
>                 if (gfp_mask & __GFP_NOFAIL) {
>                         page_counter_charge(&memcg->tcpmem, nr_pages);
>                         return true;

I will also try this patch, which may take some time.

Thanks,
Feng