From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D21EC83F33 for ; Mon, 4 Sep 2023 15:15:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C98368E0005; Mon, 4 Sep 2023 11:15:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C478D8D0001; Mon, 4 Sep 2023 11:15:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B0F228E0005; Mon, 4 Sep 2023 11:15:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 9DC948D0001 for ; Mon, 4 Sep 2023 11:15:35 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 5D3E8140743 for ; Mon, 4 Sep 2023 15:15:35 +0000 (UTC) X-FDA: 81199264230.23.791E8B0 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by imf15.hostedemail.com (Postfix) with ESMTP id 70AA5A0035 for ; Mon, 4 Sep 2023 15:15:31 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=K5HN6oie; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf15.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1693840531; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vEHfNTrFDIvveWwLAge8I94yVYAf11G0ez80Gn/sAIo=; b=ANcPCA/MxwV4S/ot21GkoW++xx06AMc4JMC8oKBXqPY6cfZ/yDgXhMgG3lNn7H+/d03afW RyoAoua47ncqNOsfGTj3NRhJXa+6ttNstNjR1Opdey8W7XMDUrkPRoJL/JAYMELHiVi8Cf +PEQv6FcQZH4OJYb5dB/iJDORkTmWZk= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=K5HN6oie; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf15.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1693840531; a=rsa-sha256; cv=none; b=C4Vg/5H4cCExzmzbC+zdFAaK22Se+nHrj6ZqJNb8DRYBg/VMeqQKmkk7RdTLCus3aLhyI0 S8MybWiefyfIF6oizmVn35oSPuOVE8+s2WgbeMQFwq7qyXoIZU/OY4NsgoB/uaIjhMSRI5 ocUhMPC9EzUIQviX6UDzfuYV1VrQ4zc= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id B47181F74B; Mon, 4 Sep 2023 15:15:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1693840529; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=vEHfNTrFDIvveWwLAge8I94yVYAf11G0ez80Gn/sAIo=; b=K5HN6oieyg4dnpZYlEDXFpTdfBj5Lr1bunWUt8XGoux9xDLmTrSRB5c3k3HLwNfINGCB/Y wwiBm3du/EYziqElZ07Rovo83WNnXaKsQqIXpImNBQxLtxIMRiLVvunE98h2wCUWXFFQT2 WdfW2pstfR7KHV5YcUuY1gONJyPNWtQ= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 9788313585; Mon, 4 Sep 2023 15:15:29 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id beTVIZH09WSeRwAAMHmgww (envelope-from ); Mon, 04 Sep 2023 15:15:29 +0000 Date: Mon, 4 Sep 2023 17:15:28 +0200 From: Michal Hocko To: Yosry Ahmed Cc: Andrew Morton , Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , Michal =?iso-8859-1?Q?Koutn=FD?= , Waiman Long , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads Message-ID: References: <20230831165611.2610118-1-yosryahmed@google.com> <20230831165611.2610118-5-yosryahmed@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230831165611.2610118-5-yosryahmed@google.com> X-Rspamd-Queue-Id: 70AA5A0035 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 4crwfu9b9dag7k5qhjm789sdwexj14e1 X-HE-Tag: 1693840531-409761 X-HE-Meta: U2FsdGVkX18j8arfUAtdONRVGBLOMXKx5tlTTo2xL3vOUJ6g8j2mH7u6j76Pvb1HsiXFgzt8bexKDmPT3Q7j3zF9SqZy0HM3nE9wUJhdGkoU2YqoTWOWCwB1tCDF4D+VrsiMQyoso+CyT+ZhUO05BfN1bVBOVmX+35lzz8OKffNMCjuWu8O86RggjsIRJDmKnAq3rjDAqSb9aawXKMYYOVm6fevIjk/S0Sm7a3UNeLtWIbDDRxjXYDmZO3aeI35qk6LO4tWaIUp/IW/w7wkDBg0GQqBLF3aV8iBKAopz5JNwjEqBDKR68iyOFS/5pUn93/v77xzrbwrW3gwuWfYMlA/Iv9heP6Q19TI++Kgjj69Y1skTevMnEB5Crej2acAuLJuCHdVltaSWxj5gwGMcB6i7dD1sI2ncTf3DPb3cDGwxo+TyAs34Jw2QbDc7edczVAtiTfzRgulB+BLofexaHzkFxKnBbGSFtjJ+8WtpxQ3G/0tqrmTNt6KBINcCtAndM0SJU8Ps/tzzM2hFK0QqdO48LiaIwKkPFLPc2HC9BUtf9UyHH9X71FwkQA09HCiOxYEIq332bH4IbkKTxhC/E5nr7ukr3puaj5kLp7dvSjdcAR4esZlrwGEQKEBK6yBNe/pPgAr/D3YIoK6SpS2rNP5PbcQDYPl0yBZ/e/Jf13kxIyOwBIuy3nfgug9XRJXyK0F8hiWDKERLUA3FrVP52s+bTACSr+MTG1bYrWzQziDIpeb2WgYnsp9KKd28/cdeGPfEtuoxVQsuayEUaYoGMWo4gSYB495cPqDaU7t8pcE3qVBmNvds0Lvw+A8sFZX3aFpKmYLYuRBphePvb8q4hP7hlc7BJOK+K3Agog1LXWdflnHHmU2LafdbXEQW/VTZ5RTe2o0H8e+tkNgsTjKDnizNHYg7vrKPC+8b2lffTZkbll/lL+3DrYmVUJqUlc1Si+Tcx+dkZEX2khLB6z4 cUO2NybD pXZsAPT8+EGbk+KFv/k4BQlbX7U7Wur0nfNcHnO5KLxSYtMRGZXOPlvzIzZgDOCfLDOmCBGMCyHlO0aC8d14GwcPi9Sm3Pj08achZIF0dtYUcYWnr1R9g4qUkKnF5deqkxVdNxpZwNa/7BT5bx853nG/NwhoEbCUBWwcBVKYVlPnTYMKUquQ3svzk5gmZevvhjojucZAXx5UgQHI2P9awtCBNQfJrRksZbEx/GKsG792Zh7BMbEYc+8inf94alknaha5zXr8EC3fW9Vi08qr5/WJ5+ncJ7BfRBFD5+BjAHnPYtL5igr0AtjMIeC4k0OV0iBqn X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu 31-08-23 16:56:11, Yosry Ahmed wrote: > Unified flushing allows for great concurrency for paths that attempt to > flush the stats, at the expense of potential staleness and a single > flusher paying the extra cost of flushing the full tree. > > This tradeoff makes sense for in-kernel flushers that may observe high > concurrency (e.g. reclaim, refault). For userspace readers, stale stats > may be unexpected and problematic, especially when such stats are used > for critical paths such as userspace OOM handling. Additionally, a > userspace reader will occasionally pay the cost of flushing the entire > hierarchy, which also causes problems in some cases [1]. > > Opt userspace reads out of unified flushing. This makes the cost of > reading the stats more predictable (proportional to the size of the > subtree), as well as the freshness of the stats. Userspace readers are > not expected to have similar concurrency to in-kernel flushers, > serializing them among themselves and among in-kernel flushers should be > okay. Nonetheless, for extra safety, introduce a mutex when flushing for > userspace readers to make sure only a single userspace reader can compete > with in-kernel flushers at a time. This takes away userspace ability to > directly influence or hurt in-kernel lock contention. I think it would be helpful to note that the primary reason this is a concern is that the spinlock is dropped during flushing under contention. > An alternative is to remove flushing from the stats reading path > completely, and rely on the periodic flusher. This should be accompanied > by making the periodic flushing period tunable, and providing an > interface for userspace to force a flush, following a similar model to > /proc/vmstat. However, such a change will be hard to reverse if the > implementation needs to be changed because: > - The cost of reading stats will be very cheap and we won't be able to > take that back easily. > - There are user-visible interfaces involved. > > Hence, let's go with the change that's most reversible first and revisit > as needed. > > This was tested on a machine with 256 cpus by running a synthetic test > script [2] that creates 50 top-level cgroups, each with 5 children (250 > leaf cgroups). Each leaf cgroup has 10 processes running that allocate > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel > unified flusher). Concurrently, one thread is spawned per-cgroup to read > the stats every second (including root, top-level, and leaf cgroups -- > so total 251 threads). No significant regressions were observed in the > total run time, which means that userspace readers are not significantly > affecting in-kernel flushers: > > Base (mm-unstable): > > real 0m22.500s > user 0m9.399s > sys 73m41.381s > > real 0m22.749s > user 0m15.648s > sys 73m13.113s > > real 0m22.466s > user 0m10.000s > sys 73m11.933s > > With this patch: > > real 0m23.092s > user 0m10.110s > sys 75m42.774s > > real 0m22.277s > user 0m10.443s > sys 72m7.182s > > real 0m24.127s > user 0m12.617s > sys 78m52.765s > > [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/ > [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/ > > Signed-off-by: Yosry Ahmed OK, I can live with that but I still believe that locking involved in the user interface only begs for issues later on as there is no control over that lock contention other than the number of processes involved. As it seems that we cannot make a consensus on this concern now and this should be already helping existing workloads then let's just buy some more time ;) Acked-by: Michal Hocko Thanks! > --- > mm/memcontrol.c | 24 ++++++++++++++++++++---- > 1 file changed, 20 insertions(+), 4 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 94d5a6751a9e..46a7abf71c73 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) > static void flush_memcg_stats_dwork(struct work_struct *w); > static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); > static DEFINE_PER_CPU(unsigned int, stats_updates); > +static DEFINE_MUTEX(stats_user_flush_mutex); > static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0); > static atomic_t stats_flush_threshold = ATOMIC_INIT(0); > static u64 flush_next_time; > @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memcg) > cgroup_rstat_flush(memcg->css.cgroup); > } > > +/* > + * mem_cgroup_user_flush_stats - do a stats flush for a user read > + * @memcg: memory cgroup to flush > + * > + * Flush the subtree of @memcg. A mutex is used for userspace readers to gate > + * the global rstat spinlock. This protects in-kernel flushers from userspace > + * readers hogging the lock. readers hogging the lock as do_stats_flush drops the spinlock under contention. > + */ > +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg) > +{ > + mutex_lock(&stats_user_flush_mutex); > + do_stats_flush(memcg); > + mutex_unlock(&stats_user_flush_mutex); > +} > + > /* > * do_unified_stats_flush - do a unified flush of memory cgroup statistics > * -- Michal Hocko SUSE Labs