From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A513DC25B75 for ; Thu, 23 May 2024 04:36:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C50066B007B; Thu, 23 May 2024 00:36:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C00786B0082; Thu, 23 May 2024 00:36:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AC71D6B0083; Thu, 23 May 2024 00:36:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 903446B007B for ; Thu, 23 May 2024 00:36:39 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 10298A0278 for ; Thu, 23 May 2024 04:36:39 +0000 (UTC) X-FDA: 82148399718.05.637EBB6 Received: from mail-ej1-f51.google.com (mail-ej1-f51.google.com [209.85.218.51]) by imf08.hostedemail.com (Postfix) with ESMTP id 400B416000B for ; Thu, 23 May 2024 04:36:37 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=FUxMN8Ff; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf08.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.51 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1716438997; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=T8VUWfVL8PhvcfZC2Me8qhrGsq05o4Pd2dAFMPy1yHI=; b=Emfuzsx7JcgWG5p/+vLezM/UrnEWEAxfohLweKy4RWvLUU2RK2Ca8D4dfAuj5sjZ8XFf8a xc7z6UXggrJliQ9fJUsjRg4fKqKe4Q0dRkfdz99gbfXxs0+B89+LxV6T/jhq34AD8s8yR1 DjWLAVJuB+Y9xf6LCJgo75I2lMXODs0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1716438997; a=rsa-sha256; cv=none; b=4fBY0kuCJKBdrccS4LC3GjL1qc5ZVYKMxSNiS/YQIMjH3KEOMCJwY7O3XI3qBrm5jzVRSW 74VyUv1A5GY6iDVZQpeN9Ce0xT0s3hkMDgxJwD8e3dYOXabzyE0Ico4uL5e3xphEMBRhm9 pikoGX4YUrlNaN65J6wYJbM1qpU9bjM= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=FUxMN8Ff; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf08.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.51 as permitted sender) smtp.mailfrom=yosryahmed@google.com Received: by mail-ej1-f51.google.com with SMTP id a640c23a62f3a-a621cb07d8fso207743366b.2 for ; Wed, 22 May 2024 21:36:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1716438995; x=1717043795; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=T8VUWfVL8PhvcfZC2Me8qhrGsq05o4Pd2dAFMPy1yHI=; b=FUxMN8Ff5Js0PEYiVeBIGWPC/F7l88+5c2ZD8FzS5DpxYXrFEww+UdIDRhK5i9rCWR oNPfYhbqJHPS9EzZEK59zoduf+/rxr4/DvirW0a4e+OKtZmiz1IulqgQ6byVpqGuUdqw RiEYB3k5iAL4hfo5EJ5ck8mmCW6mtQr2UbhVDc+K+5engDs+8p8SAz7Yt7Y/9JOTtvjE 59RM4qhGWXyoN3fIFcO1DBdc2Mj9YkCl1Xc+D3+auMapKZPlu6P6G6w0vDFI0aWKTMEQ 1CQFUDq5JoI0BIjap3kAQIu7XHgBLcN3VxE9pYgAbGuxUcQZQ82XtQjPvrC8WvrYjCz4 NDbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1716438995; x=1717043795; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=T8VUWfVL8PhvcfZC2Me8qhrGsq05o4Pd2dAFMPy1yHI=; b=nCPvsJffjLPmErBPjo87+yglVnzI8baxaHmqhnUTFQSmSMXjAWcpWSBeZq6CJGRu2P Gr2/+GPAXsCulNZnim2ZAI5QOc0dyBSA/zCP4Ty3INS+uTVBtOrfI9a++M/4u5tp3ERZ /iRFaI/Fur5emSs/45ZfvkRsM0htMdRyc9xr1hLJd36C0nvBK4ZTE2C/WUiA7wCoyTlM czM7yF686kaneeuGkXbomSvCOI+MEb1zBGXIokjsn/e87cQ6gb9ooL0Blh0cpkRef0mt dw3RRfbjsi7iUhObJ5o0Rl0OLI7ed9wcnS9kKret6T96kYVmfa0WFARny85PxtI87urq H/Qg== X-Forwarded-Encrypted: i=1; AJvYcCVhmsyYIQ3c/IY/wwTGO1gq0iQceE5QOstYglL0ADui7uQJc2qLII1C96WZtUzSwTtwo8cb7ZNbzPxLjwxcvc/31PY= X-Gm-Message-State: AOJu0YyvYEshddYs1h0+kgu/3qPWaR9dUyzzoS1XFqsvaDfNS7wuIqT6 pqg7YPd/mtppqVRg1sipksGoNjtc3yspdM6vlXTlj4CqMXmTbZnOxOQRMpojOEJTFcdvdEQYVMF F8LQHm1VdIm9f+gnGfauyVmcmiaKUgcrCXebJ X-Google-Smtp-Source: AGHT+IHWV+3Dho0vc85opVqlQ8AvB8hTnKlFzG4tNs7U0+6OlilQoHDG8ftHGifzquFyG3EHQW0C9/DfXRx8MBRZuPI= X-Received: by 2002:a17:907:30d0:b0:a58:a0b8:2a64 with SMTP id a640c23a62f3a-a6228029215mr209555566b.5.1716438995175; Wed, 22 May 2024 21:36:35 -0700 (PDT) MIME-Version: 1.0 References: <20240523034824.1255719-1-shakeel.butt@linux.dev> In-Reply-To: <20240523034824.1255719-1-shakeel.butt@linux.dev> From: Yosry Ahmed Date: Wed, 22 May 2024 21:35:57 -0700 Message-ID: Subject: Re: [PATCH] memcg: rearrage fields of mem_cgroup_per_node To: Shakeel Butt Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , ying.huang@intel.com, feng.tang@intel.com, fengwei.yin@intel.com, oliver.sang@intel.com, kernel-team@meta.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: jiwar413rcmprouuruiwk41nktjq1zj7 X-Rspamd-Queue-Id: 400B416000B X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1716438997-898184 X-HE-Meta: U2FsdGVkX1/Sv/v0TIaA/0NTUSa0RVy181LB83p3HneGk/GNAG/OHB0bxkqDo/gs8E/E7l8UYOI9hgC+8YQqDdY4pHzKgUpDnt1rOeOEamfemRA0ds88hc9BraGtJs8+6B8bbProwP2eI7MzvvG4PiAx9X//M6+qoFuKh26vegWI6R7hiY74bjFganDmRUtPhKo/0Xycj/YQzSHsLzOygYhvkJwLtEIU1zdCVAaEx3SPV9IBNwlXCNJhFlwGmHiPDbwimio1HxrEhua13DwwqVVUsroBlkHT3rJPjAnl+F2kkLCEtqffbsSZYPe7FKLgKySSL1n9TJtUotSs+epW7X8VVyYPILvGN4tyluvvnKeLvkecOvUZpBl24YEwSkd76RAdTR22VCbhyEgUCV740txWafuxXOmcFF46bBsTW8jBaOQl/1TfIv5sB+UeSP/Y3zBkA9VD0ym0Q9remKoSHNtj09oellRcAktjw/2UD10HX18zBXThoD7+KUJ949C7Idy6e6OpJ9MYkm7nQ55ae5qO4VVAfwZb6NOjKWBbu6KLMuMgq3SseM96ma7BKYkJKkE5yxBXmYwIxNb7L2Vfnh7Lj/LPtabP6wfC/z/Sxy472Z3w4eFXKwfzq648Vu2hw3BXQ5Zx6y/MQy+tDtFqq/YrttFjb2zhX/nBDuwG5OssJs23b87XJOi1dBNq14vgGEo7rL9FkdBuoHfCcz5fCOAHPJa56s0YeQyFSXQSHAZQZN42Zm9H/Pr+aVHusjVNoR968X5MD2F5saGGcBSQql+6lVnMxIW6ECgQa7rkmOfUbZlhDCXXb0QgVztj6v4kgcA5ywRioBwISskXio2dwftDljcroZaXQ6Qe+ngZDoPmXiJ9QX10tc4pOYKOyyP2M62SG3O2dCTK+nM5FR7NnwP0hufK7RdAw1CbzibrK4paey/hX+hbrSuMy+hhJeNyDy5mGZpPBOIlJ3OZfmJ i7PFMXM3 W4c42QtzyjVnx+AM6KPDjhnc7iraSjgxmiWKwLG3Qy3uDHrGqsYZS4tr+Z+W6a9yX3l7lrEzbxEcvRCOTUpD0K1V6QqsLa8c15yri1h6243EjvAzM73uD4MXue1rF8wwY2IJuF/pDnm0Pwxen9drQzUk9wuIXCW21VlB8FHWt0BO3ySLGI+P6rEc2X6jIXU0zp9kujiix1PhuD/IDO94AMMoYrSFPy/W85/jxkajBImYbqtzIP1K3qcpm3GRtipEhnP32NKrJpyaC4wicK0EdJkQkTOdA4/e6X6kKR0okkfOhCV4B1teiX0dak0JYDQyCTJh8fS73yMzlAa+EeXBOitJNQfxyNvzI3FRct3r5zgdzzdk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, May 22, 2024 at 8:48=E2=80=AFPM Shakeel Butt wrote: > > Kernel test robot reported [1] performance regression for will-it-scale > test suite's page_fault2 test case for the commit 70a64b7919cb ("memcg: > dynamically allocate lruvec_stats"). After inspection it seems like the > commit has unintentionally introduced false cache sharing. > > After the commit the fields of mem_cgroup_per_node which get read on the > performance critical path share the cacheline with the fields which > get updated often on LRU page allocations or deallocations. This has > caused contention on that cacheline and the workloads which manipulates > a lot of LRU pages are regressed as reported by the test report. > > The solution is to rearrange the fields of mem_cgroup_per_node such that > the false sharing is eliminated. Let's move all the read only pointers > at the start of the struct, followed by memcg-v1 only fields and at the > end fields which get updated often. > > Experiment setup: Ran fallocate1, fallocate2, page_fault1, page_fault2 > and page_fault3 from the will-it-scale test suite inside a three level > memcg with /tmp mounted as tmpfs on two different machines, one a single > numa node and the other one, two node machine. > > $ ./[testcase]_processes -t $NR_CPUS -s 50 > > Results for single node, 52 CPU machine: > > Testcase base with-patch > > fallocate1 1031081 1431291 (38.80 %) > fallocate2 1029993 1421421 (38.00 %) > page_fault1 2269440 3405788 (50.07 %) > page_fault2 2375799 3572868 (50.30 %) > page_fault3 28641143 28673950 ( 0.11 %) > > Results for dual node, 80 CPU machine: > > Testcase base with-patch > > fallocate1 2976288 3641185 (22.33 %) > fallocate2 2979366 3638181 (22.11 %) > page_fault1 6221790 7748245 (24.53 %) > page_fault2 6482854 7847698 (21.05 %) > page_fault3 28804324 28991870 ( 0.65 %) Great analysis :) > > Fixes: 70a64b7919cb ("memcg: dynamically allocate lruvec_stats") > Reported-by: kernel test robot > Closes: https://lore.kernel.org/oe-lkp/202405171353.b56b845-oliver.sang@i= ntel.com > Signed-off-by: Shakeel Butt > --- > include/linux/memcontrol.h | 18 ++++++++++-------- > 1 file changed, 10 insertions(+), 8 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 030d34e9d117..16efd9737be9 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -96,23 +96,25 @@ struct mem_cgroup_reclaim_iter { > * per-node information in memory controller. > */ > struct mem_cgroup_per_node { > - struct lruvec lruvec; > + /* Keep the read-only fields at the start */ > + struct mem_cgroup *memcg; /* Back pointer, we canno= t */ > + /* use container_of = */ > > struct lruvec_stats_percpu __percpu *lruvec_stats_percpu; > struct lruvec_stats *lruvec_stats; > - > - unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS]= ; > - > - struct mem_cgroup_reclaim_iter iter; > - > struct shrinker_info __rcu *shrinker_info; > > + /* memcg-v1 only stuff in middle */ > + > struct rb_node tree_node; /* RB tree node */ > unsigned long usage_in_excess;/* Set to the value by wh= ich */ > /* the soft limit is exce= eded*/ > bool on_tree; > - struct mem_cgroup *memcg; /* Back pointer, we canno= t */ > - /* use container_of = */ Do we need CACHELINE_PADDING() here (or maybe make struct lruvec cache-aligned) to make sure the false cacheline sharing doesn't happen again with the fields below, or is the idea that the fields that get read in hot paths (memcg, lruvec_stats_percpu, lruvec_stats) are far at the top, and the memcg v1 elements in the middle act as a buffer? IOW, is sharing between the fields below and memcg v1 fields okay because they are not read in the hot path? If yes, I believe it's worth a comment. It can be easily missed if the memcg v1 soft limit is removed later for example. > + > + /* Fields which get updated often at the end. */ > + struct lruvec lruvec; > + unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS]= ; > + struct mem_cgroup_reclaim_iter iter; > }; > > struct mem_cgroup_threshold { > -- > 2.43.0 >