From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 02629EB64DD for ; Wed, 26 Jul 2023 00:37:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 714066B0071; Tue, 25 Jul 2023 20:37:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6C6AB8D0001; Tue, 25 Jul 2023 20:37:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 564876B0075; Tue, 25 Jul 2023 20:37:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 472F06B0071 for ; Tue, 25 Jul 2023 20:37:26 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id F1B0FC0FBD for ; Wed, 26 Jul 2023 00:37:25 +0000 (UTC) X-FDA: 81051899250.28.D8AFD32 Received: from mail-ed1-f41.google.com (mail-ed1-f41.google.com [209.85.208.41]) by imf14.hostedemail.com (Postfix) with ESMTP id 0926F100009 for ; Wed, 26 Jul 2023 00:37:23 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=xKnoP3Bx; spf=pass (imf14.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.41 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690331844; a=rsa-sha256; cv=none; b=wvA54wXgebnb8KGu4kgQe48UmOoKPr0gcBkU6RwTlwHQFaB4qziroVRDNbCV3yq1RB17yD gnVQe9IRmcvFnImm5qNJ/KB3Rsuhaoo+8vkW9gbI9zaigF5ae/VW7JXWECbD9xQ18ZzJGb KTwVDIst3Tqz7/MrQosb+NYrOwybTIU= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=xKnoP3Bx; spf=pass (imf14.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.41 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690331844; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=uj8uOQ1xFhK03A6y5unkx5o1F73Orlzpfqc5kQDwjHc=; b=m0Wei5fQI1hO/7s8T4mYC2o8JCf0iFleiAe4DetCWK88Gb7EEnYzPYyFHMmg3WWYSki5A+ E9vbw9LSoGbR27CcIr1jOBnu/AJU3cyxW8MfRZbVwljIzSS36LdL8k+F6za6k2D7UNP7oj vPymN96aeP3EALX2XXwI83EdkyKWbhE= Received: by mail-ed1-f41.google.com with SMTP id 4fb4d7f45d1cf-51de9c2bc77so8722466a12.3 for ; Tue, 25 Jul 2023 17:37:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1690331842; x=1690936642; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=uj8uOQ1xFhK03A6y5unkx5o1F73Orlzpfqc5kQDwjHc=; b=xKnoP3BxZtdLidI1PCKarNGWJNWTE8yRAL9mnpHEoI5dPgGXx5McjtBjL9p7XYYL1W 0ByVZdOKntHLDlNyOMaXno6hoPyJ6bUno1z8gWbZUvkUKI6I30Ev/8cbHbRaNTUbSORF nSpn5O0ZOZgXVPpTOApU4ce06qX4rLlnD/NhCTU6OAm06PnZ45fsCMdLFH4pGgikUi48 3xueAYdeMu1BWhQQLUq6hLic6dHwbz9G1VAl5ETc79cTow9ZCxfPKfCdFl5qr5IA8YeV IoGxDVEtglitDQKwO+ZpWsDyoL8iPwbUNMN+H+10tB/r7YUdia2QSuL4f3k5xIQpaOE/ /WMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690331842; x=1690936642; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=uj8uOQ1xFhK03A6y5unkx5o1F73Orlzpfqc5kQDwjHc=; b=FfiLqZJmIDVbJU7lK4I9L/pty3XbYtB5A32EQxY/a4Aej/nZAUQveqi4MHd1dwKENI tv2Le5J6zhaUe21FDBpwiYeCHadJQyGNpRGPfPZblSeplj63vNr+Z7RK+B7WdKbV0ma1 7o3VEIsWY75JDNYTBpeNL0W1YIE1TIjCfQopxmpL60HElqxkBTH0hJtGfLvs4B3MDJN5 Mle8yynkkJzg8+LW2gGJkxopXbf9f6GQHTFlTWDNyalUfysYldZDF40vxfVSGZSnzuux q7ZX/FKBbtyCs1o1XA/gNKbNORvmlS4O2J93LPYw+Rleb9cQeyLNbai55L+rMvriuYzp jzUw== X-Gm-Message-State: ABy/qLYZEGPukMecdJOvkGOzffmf9UYbGhJGllK3kY6YzlgNVQru3G0k svk0CWfSJxnaRxDfTr5OfPFlj6GsfMIpW0SOgbgv2Q== X-Google-Smtp-Source: APBJJlHDLczizjzV239PjazMIjInu33c/gMqALgI5baOCHHh2fPkrny9XpeAXG4B8MrktvjoHVtCH0gCbzcmg0BhFYc= X-Received: by 2002:a17:906:8472:b0:99b:5161:8e0d with SMTP id hx18-20020a170906847200b0099b51618e0dmr261271ejc.21.1690331842179; Tue, 25 Jul 2023 17:37:22 -0700 (PDT) MIME-Version: 1.0 References: <20230726002904.655377-1-yosryahmed@google.com> <20230726002904.655377-2-yosryahmed@google.com> In-Reply-To: <20230726002904.655377-2-yosryahmed@google.com> From: Yosry Ahmed Date: Tue, 25 Jul 2023 17:36:45 -0700 Message-ID: Subject: Re: [PATCH v2] mm: memcg: use rstat for non-hierarchical stats To: Johannes Weiner Cc: linux-kernel@vger.kernel.org, Michal Hocko , cgroups@vger.kernel.org, linux-mm@kvack.org, Muchun Song , Shakeel Butt , Roman Gushchin , Andrew Morton Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 0926F100009 X-Stat-Signature: e78z75zxm95xh5pb3r7jey38uaf38b8h X-Rspam-User: X-HE-Tag: 1690331843-915921 X-HE-Meta: U2FsdGVkX19VXuVBOTdPmDpjQ2IijzeVChvlN+o9ouQv8W+OuhM1X5YavUEMSRdK2A+a04y1lan7eQT+plKUvt6IYYTjGUT1XtACk1YpIle9MupxzQiI52C3rrucceQ/e6Gy0LHVawCkrp1Dhu0nB5NG1JN9/SEg0JZ6ItSSEwmNIsYU7LRqBu06WxIPLgjqs2LjNHP4HPD1ydeGAIaU5U38jtnv2iCCgq980AgAyaLsTbcaLM9DuUylYxv2dryp0wh0dyFtKrlpYyZ7RnX+APcwFQqsg4rRPffPt3VphF30d7a32ja/K/Rnxw+a0d7yyENZtgNPDVsjPu85u0UVOBREtzBi98KcpbqBKQup1JuG9r6SQmZZZsiIEbBIOq6iES6EIv9YQWsReuQuVY88KasewS0xYnHrVGwprgCIXQbFx5Uc+DUvQdrlEXvWznu9BHzXvLiMHZ+ZL9AIhiG2D5e09qWuo5y26OckTqEq8wPvSHQinxi2dJLANvWOEuLiDJJTNwABNQDZJVCMOi0bK+Qdlv7beQc7Xu6OBtt+cqPx1qFd+JbR6gbf2ouLiQbatwyPF//DR4Fy9Y2ODOEO7BTm09sQtyrF/4J4tpRpiyjVppVzesS3cCuzzVSB5GquYkCnCRNSEKlQIzs0WLMbYi74kBdlgP5WyQ1lXjY8Eyvx94cMvPQdm1ECY7jb+psVygPhILiudmLOjT64t2mXvM9vK79hZlwbUIjJo7yi86ylA9JLcGp423FSG0vod/tKNvTTrIZc2H0G/+90lelemcfXPrvbFZkZwZMI9bY/PiDvb1vvKCNLAzBLXoG5RGlRGjrLxw4qMnsvExmLdksIvLwUexPKavscX75qcISb5WIEwQIi2Sr7OTc/CB+OtW4LDTAzr0bitXVCm9mJujEqiy7GjrY2sSW9OdRn4fuoOpq4kPv1Iny38lgv+1PW7gHA9E7Qyh6Xv7VAQI2aAYL WPyhU9Gu sDNpOsnDNTCv7LEmtm2+1OJp+lS1RJJdiKnQG5YKs4AOKLj1jJdInuV1BR1OkoyWDvrzK4r5/6iTE8NzWulx6t5g0DNqWoWriYnEOPbWQbISzxOw+2pnVcfQxE9DagsPknoU3le1fsUPEKyfmyEUUXEyJRAy0zp2tQt9pFZmGNqdRKFIZjuNhRreaacBinvht1pC77X94r3Tf3Cr0d6DZH+yAYjvtw4xzzxZy0FLaAOmcJgnUxxO+LFE2pZXs791esaRRN8T0FP5wLV7TKnG6/5IWSqhxEdkSfDIF7szmq8cvFMJMp/50fjPxmsSUdJaNbfYE7S5jH6lMSc7+B8AGDkdsVLc3wokSBp93TDvA5i6iSs8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Jul 25, 2023 at 5:29=E2=80=AFPM Yosry Ahmed = wrote: > > Currently, memcg uses rstat to maintain hierarchical stats. Counters are > maintained for hierarchical stats at each memcg. Rstat tracks which > cgroups have updates on which cpus to keep those counters fresh on the > read-side. > > For non-hierarchical stats, we do not maintain counters. Instead, the > percpu counters for a given stat need to be summed to get the > non-hierarchical stat value. The original implementation did the same. > At some point before rstat, non-hierarchical counters were introduced by > commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in > memory.stat reporting"). However, those counters were updated on the > performance critical write-side, which caused regressions, so they were > later removed by commit 815744d75152 ("mm: memcontrol: don't batch > updates of local VM stats and events"). See [1] for more detailed > history. > > Kernel versions in between a983b5ebee57 & 815744d75152 (a year and a > half) enjoyed cheap reads of non-hierarchical stats, specifically on > cgroup v1. When moving to more recent kernels, a performance regression > for reading non-hierarchical stats is observed. > > Now that we have rstat, we know exactly which percpu counters have > updates for each stat. We can maintain non-hierarchical counters again, > making reads much more efficient, without affecting the performance > critical write-side. Hence, add non-hierarchical (i.e local) counters > for the stats, and extend rstat flushing to keep those up-to-date. > > A caveat is that we now a stats flush before reading > local/non-hierarchical stats through {memcg/lruvec}_page_state_local() > or memcg_events_local(), where we previously only needed a flush to > read hierarchical stats. Most contexts reading non-hierarchical stats > are already doing a flush, add a flush to the only missing context in > count_shadow_nodes(). > > With this patch, reading memory.stat from 1000 memcgs is 3x faster on a > machine with 256 cpus on cgroup v1: > # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done > # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null > real 0m0.125s > user 0m0.005s > sys 0m0.120s > > After: > real 0m0.032s > user 0m0.005s > sys 0m0.027s > > [1]https://lore.kernel.org/lkml/20230725201811.GA1231514@cmpxchg.org/ > > Signed-off-by: Yosry Ahmed > Acked-by: Johannes Weiner > > --- > > v1 -> v2: > - Rewrite the changelog based on the history context provided by > Johannes (Thanks!). > - Fix a subtle bug where updating a local counter would be missed if it > was cancelled out by a pending update from child memcgs. Johannes, I fixed a subtle bug here and I kept your Ack, I wasn't sure what the Ack retention policy should be here. A quick look at the fix would be great. Thanks! > > > --- > include/linux/memcontrol.h | 7 ++-- > mm/memcontrol.c | 67 +++++++++++++++++++++----------------- > mm/workingset.c | 1 + > 3 files changed, 43 insertions(+), 32 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 5818af8eca5a..a9f2861a57a5 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -112,6 +112,9 @@ struct lruvec_stats { > /* Aggregated (CPU and subtree) state */ > long state[NR_VM_NODE_STAT_ITEMS]; > > + /* Non-hierarchical (CPU aggregated) state */ > + long state_local[NR_VM_NODE_STAT_ITEMS]; > + > /* Pending child counts during tree propagation */ > long state_pending[NR_VM_NODE_STAT_ITEMS]; > }; > @@ -1020,14 +1023,12 @@ static inline unsigned long lruvec_page_state_loc= al(struct lruvec *lruvec, > { > struct mem_cgroup_per_node *pn; > long x =3D 0; > - int cpu; > > if (mem_cgroup_disabled()) > return node_page_state(lruvec_pgdat(lruvec), idx); > > pn =3D container_of(lruvec, struct mem_cgroup_per_node, lruvec); > - for_each_possible_cpu(cpu) > - x +=3D per_cpu(pn->lruvec_stats_percpu->state[idx], cpu); > + x =3D READ_ONCE(pn->lruvec_stats.state_local[idx]); > #ifdef CONFIG_SMP > if (x < 0) > x =3D 0; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e8ca4bdcb03c..50f8035e998a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -742,6 +742,10 @@ struct memcg_vmstats { > long state[MEMCG_NR_STAT]; > unsigned long events[NR_MEMCG_EVENTS]; > > + /* Non-hierarchical (CPU aggregated) page state & events */ > + long state_local[MEMCG_NR_STAT]; > + unsigned long events_local[NR_MEMCG_EVENTS]; > + > /* Pending child counts during tree propagation */ > long state_pending[MEMCG_NR_STAT]; > unsigned long events_pending[NR_MEMCG_EVENTS]; > @@ -775,11 +779,8 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int= idx, int val) > /* idx can be of type enum memcg_stat_item or node_stat_item. */ > static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, in= t idx) > { > - long x =3D 0; > - int cpu; > + long x =3D READ_ONCE(memcg->vmstats->state_local[idx]); > > - for_each_possible_cpu(cpu) > - x +=3D per_cpu(memcg->vmstats_percpu->state[idx], cpu); > #ifdef CONFIG_SMP > if (x < 0) > x =3D 0; > @@ -926,16 +927,12 @@ static unsigned long memcg_events(struct mem_cgroup= *memcg, int event) > > static unsigned long memcg_events_local(struct mem_cgroup *memcg, int ev= ent) > { > - long x =3D 0; > - int cpu; > int index =3D memcg_events_index(event); > > if (index < 0) > return 0; > > - for_each_possible_cpu(cpu) > - x +=3D per_cpu(memcg->vmstats_percpu->events[index], cpu)= ; > - return x; > + return READ_ONCE(memcg->vmstats->events_local[index]); > } > > static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, > @@ -5526,7 +5523,7 @@ static void mem_cgroup_css_rstat_flush(struct cgrou= p_subsys_state *css, int cpu) > struct mem_cgroup *memcg =3D mem_cgroup_from_css(css); > struct mem_cgroup *parent =3D parent_mem_cgroup(memcg); > struct memcg_vmstats_percpu *statc; > - long delta, v; > + long delta, delta_cpu, v; > int i, nid; > > statc =3D per_cpu_ptr(memcg->vmstats_percpu, cpu); > @@ -5542,19 +5539,23 @@ static void mem_cgroup_css_rstat_flush(struct cgr= oup_subsys_state *css, int cpu) > memcg->vmstats->state_pending[i] =3D 0; > > /* Add CPU changes on this level since the last flush */ > + delta_cpu =3D 0; > v =3D READ_ONCE(statc->state[i]); > if (v !=3D statc->state_prev[i]) { > - delta +=3D v - statc->state_prev[i]; > + delta_cpu =3D v - statc->state_prev[i]; > + delta +=3D delta_cpu; > statc->state_prev[i] =3D v; > } > > - if (!delta) > - continue; > - > /* Aggregate counts on this level and propagate upwards *= / > - memcg->vmstats->state[i] +=3D delta; > - if (parent) > - parent->vmstats->state_pending[i] +=3D delta; > + if (delta_cpu) > + memcg->vmstats->state_local[i] +=3D delta_cpu; > + > + if (delta) { > + memcg->vmstats->state[i] +=3D delta; > + if (parent) > + parent->vmstats->state_pending[i] +=3D de= lta; > + } > } > > for (i =3D 0; i < NR_MEMCG_EVENTS; i++) { > @@ -5562,18 +5563,22 @@ static void mem_cgroup_css_rstat_flush(struct cgr= oup_subsys_state *css, int cpu) > if (delta) > memcg->vmstats->events_pending[i] =3D 0; > > + delta_cpu =3D 0; > v =3D READ_ONCE(statc->events[i]); > if (v !=3D statc->events_prev[i]) { > - delta +=3D v - statc->events_prev[i]; > + delta_cpu =3D v - statc->events_prev[i]; > + delta +=3D delta_cpu; > statc->events_prev[i] =3D v; > } > > - if (!delta) > - continue; > + if (delta_cpu) > + memcg->vmstats->events_local[i] +=3D delta_cpu; > > - memcg->vmstats->events[i] +=3D delta; > - if (parent) > - parent->vmstats->events_pending[i] +=3D delta; > + if (delta) { > + memcg->vmstats->events[i] +=3D delta; > + if (parent) > + parent->vmstats->events_pending[i] +=3D d= elta; > + } > } > > for_each_node_state(nid, N_MEMORY) { > @@ -5591,18 +5596,22 @@ static void mem_cgroup_css_rstat_flush(struct cgr= oup_subsys_state *css, int cpu) > if (delta) > pn->lruvec_stats.state_pending[i] =3D 0; > > + delta_cpu =3D 0; > v =3D READ_ONCE(lstatc->state[i]); > if (v !=3D lstatc->state_prev[i]) { > - delta +=3D v - lstatc->state_prev[i]; > + delta_cpu =3D v - lstatc->state_prev[i]; > + delta +=3D delta_cpu; > lstatc->state_prev[i] =3D v; > } > > - if (!delta) > - continue; > + if (delta_cpu) > + pn->lruvec_stats.state_local[i] +=3D delt= a_cpu; > > - pn->lruvec_stats.state[i] +=3D delta; > - if (ppn) > - ppn->lruvec_stats.state_pending[i] +=3D d= elta; > + if (delta) { > + pn->lruvec_stats.state[i] +=3D delta; > + if (ppn) > + ppn->lruvec_stats.state_pending[i= ] +=3D delta; > + } > } > } > } > diff --git a/mm/workingset.c b/mm/workingset.c > index 4686ae363000..da58a26d0d4d 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -664,6 +664,7 @@ static unsigned long count_shadow_nodes(struct shrink= er *shrinker, > struct lruvec *lruvec; > int i; > > + mem_cgroup_flush_stats(); > lruvec =3D mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid= )); > for (pages =3D 0, i =3D 0; i < NR_LRU_LISTS; i++) > pages +=3D lruvec_page_state_local(lruvec, > -- > 2.41.0.487.g6d72f3e995-goog >