From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 104C8EEAA5B for ; Thu, 14 Sep 2023 17:20:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1AC866B026C; Thu, 14 Sep 2023 13:20:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 135596B0280; Thu, 14 Sep 2023 13:20:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F18286B0296; Thu, 14 Sep 2023 13:20:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id E01A86B026C for ; Thu, 14 Sep 2023 13:20:00 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 9B5431406AD for ; Thu, 14 Sep 2023 17:20:00 +0000 (UTC) X-FDA: 81235865760.12.1DBA844 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf13.hostedemail.com (Postfix) with ESMTP id 9F32720024 for ; Thu, 14 Sep 2023 17:19:58 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=XMN+bBsx; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf13.hostedemail.com: domain of longman@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=longman@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1694711998; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=H7p/gt3aVq7lD7iyp0WPGqSXTJ30HA29PvaQkx1SlYw=; b=lZ/50gfrQ/DpLftEm5BJFUXeVdKY20BHoDWV74XehbbGq9Ojh/i5yb+BQWmYOTYIUSK5B+ o1Zv6iGPw6PfOOechm7o7gegaMc1zSGx5DRGR7vituhoORlD2h/rvB7rSa1D3DznkqI8He cFbRkFgyh/p3OUf/Ua2ERZMVHsOh3Wg= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=XMN+bBsx; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf13.hostedemail.com: domain of longman@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=longman@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694711998; a=rsa-sha256; cv=none; b=rR774nG3LtLTMafckmjOp6eD18bW66Vi4qejAI/fsU/6JFwY7DUGR2lx/FvNINFuw4E+Wv K5yrYhwc4swsxtvj52Q54m4ppfOPeMyTweAWe4L6Qn3wM7lpzj6ilF8YBE6ywa8tfMpQJ9 mCtBUvTtcSu5E7aEiingxUa1JeCh8/o= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1694711997; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=H7p/gt3aVq7lD7iyp0WPGqSXTJ30HA29PvaQkx1SlYw=; b=XMN+bBsx45HHWk730cP8G/JW86hubvDxfMowduSHwxHUGajIAw5OWMjua9aT59HRR2HS+k xYn28y2Pr71nVrmr31EzbCqbsNE1G4DMyH/fc0CnmUsXOvRxfitXwfoqZgkEiAssU/aefR AfsWw1LrbjAb4CcYdmHfLHeQD36GJHo= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-211-Zvip7LpjNBabTauwqCxrkQ-1; Thu, 14 Sep 2023 13:19:54 -0400 X-MC-Unique: Zvip7LpjNBabTauwqCxrkQ-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 84307855321; Thu, 14 Sep 2023 17:19:53 +0000 (UTC) Received: from [10.22.34.133] (unknown [10.22.34.133]) by smtp.corp.redhat.com (Postfix) with ESMTP id 97EBD7B62; Thu, 14 Sep 2023 17:19:52 +0000 (UTC) Message-ID: Date: Thu, 14 Sep 2023 13:19:52 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.14.0 Subject: Re: [PATCH 3/3] mm: memcg: optimize stats flushing for latency and accuracy Content-Language: en-US To: Yosry Ahmed , Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , =?UTF-8?Q?Michal_Koutn=c3=bd?= , kernel-team@cloudflare.com, Wei Xu , Greg Thelen , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org References: <20230913073846.1528938-1-yosryahmed@google.com> <20230913073846.1528938-4-yosryahmed@google.com> From: Waiman Long In-Reply-To: <20230913073846.1528938-4-yosryahmed@google.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.5 X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 9F32720024 X-Stat-Signature: gh3r868j3mkg5bbc5qu7gckh9gr8x6hp X-Rspam-User: X-HE-Tag: 1694711998-16497 X-HE-Meta: U2FsdGVkX19/iiYzut1Kem0dOVsVjL4qd8T7LOSztscIaU/roMHMrHgF6R2MBIHZCzLg7LTBVP3drmnhXiSh1+GiLOOA2ix+K4MwKqDuSzr/PTBpCtwijV2MwbKeT3b2h6JQFJKgxl2be0SOfBuvL8qv+wYyKUxaDTd7kLepNIUikTwhLXi+sBm3aLZ/Yqs6feT5u6+GsFH1PPJqldWTJD6QhTegzJQnkFQB5EHhredrbbcoCcKaXbPSb9x4Tu4PZ/1YXaHUedOYahQbqi8+Br5/V4fynK/OGGRdv45hN1N2IA+IANkPcOEX5Vgh9lQQWjboCa4wJE6eFPhFUhqrBCenxlWed+sdDbAfDclzkzwF1OmbO2A6WZDFVrCfnFvBky/uBYSx5/1oaruTOBE1oopaINGcr72lOiq+6I/DVg1ieqQdcPK25TrQl4blh8JhCN0uoYUk7pVzqViuC8u9eNAKl5kBHHYdc+ehWacW4PaByAj6u5lUzpi6cF8aZBG0KlTgcE7/qfU0lCdv+Xpmcxkd8WQYxhpvlBe5Ca2qY3N6zEJoVo60ChkV8lXtf6oBsKInAPhSSxJAXZgO8IBuc/uMF0DlVC1ypykGJS92UoLbxi6v4odM0f9LKZJ7US/HCR+PoPFMe/4QX9k2/tEixjszMQp/2VZQbh3RB+WOu1Wz+Jby+oIHezBFG1omPLE4u7I6D2sxNqUVi9gOrjxKz0l68dHj7DbCxyqEUJvxXCRNahPf6bSuo0XtqWNySFul+janwKA/JMUCZcTRFkdxXDoUZx0MFr4El2/BJ1ablFHcOJ8YfA3EA29buq/n1LLIjeHpAe9cnewWrOQ4GZPh/qY0TQJNyzH5bdMa4bBzE7qA5SlLipT4cyTriqaOeHK8eAOTTtgP6r2adA+jBd1risQ1svT+6po8v+mOdDYKPEUQiGJcqbmpmME5pDALvBn4JCBMiTAy6BG2ty134wm k5c6kfQ+ VfrSbWm3toYkqUkk+7xX1fHH4d0cuA5uiHQWUQBWiWTrat5qEOSlBJrWFGHgJwE3sK0BtDw5xJ1VGacBeVsXP4NdqwXm8crrTTCN1aQmUZ00UmOkxL9dpJI9LhKDjhx4LWk/zSeJe2LL/CI6QqY0nJiQwGVGn1aFaVzE2ZI0UldlmXzeR7dLbjygAExfcEwUfeHd035WBP40zei/g5FcssfnrB9J0uKag8U6hQcOJ4so1NHFxeuHA2Q8l0+aEFpzITDeY0Ubu/TRC298= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 9/13/23 03:38, Yosry Ahmed wrote: > Stats flushing for memcg currently follows the following rules: > - Always flush the entire memcg hierarchy (i.e. flush the root). > - Only one flusher is allowed at a time. If someone else tries to flush > concurrently, they skip and return immediately. > - A periodic flusher flushes all the stats every 2 seconds. > > The reason this approach is followed is because all flushes are > serialized by a global rstat spinlock. On the memcg side, flushing is > invoked from userspace reads as well as in-kernel flushers (e.g. > reclaim, refault, etc). This approach aims to avoid serializing all > flushers on the global lock, which can cause a significant performance > hit under high concurrency. > > This approach has the following problems: > - Occasionally a userspace read of the stats of a non-root cgroup will > be too expensive as it has to flush the entire hierarchy [1]. > - Sometimes the stats accuracy are compromised if there is an ongoing > flush, and we skip and return before the subtree of interest is > actually flushed. This is more visible when reading stats from > userspace, but can also affect in-kernel flushers. > > This patch aims to solve both problems by reworking how flushing > currently works as follows: > - Without contention, there is no need to flush the entire tree. In this > case, only flush the subtree of interest. This avoids the latency of a > full root flush if unnecessary. > - With contention, fallback to a coalesced (aka unified) flush of the > entire hierarchy, a root flush. In this case, instead of returning > immediately if a root flush is ongoing, wait for it to finish > *without* attempting to acquire the lock or flush. This is done using > a completion. Compared to competing directly on the underlying lock, > this approach makes concurrent flushing a synchronization point > instead of a serialization point. Once a root flush finishes, *all* > waiters can wake up and continue at once. > - Finally, with very high contention, bound the number of waiters to the > number of online cpus. This keeps the flush latency bounded at the tail > (very high concurrency). We fallback to sacrificing stats freshness only > in such cases in favor of performance. > > This was tested in two ways on a machine with 384 cpus: > - A synthetic test with 5000 concurrent workers doing allocations and > reclaim, as well as 1000 readers for memory.stat (variation of [2]). > No significant regressions were noticed in the total runtime. > Note that if concurrent flushers compete directly on the spinlock > instead of waiting for a completion, this test shows 2x-3x slowdowns. > Even though subsequent flushers would have nothing to flush, just the > serialization and lock contention is a major problem. Using a > completion for synchronization instead seems to overcome this problem. > > - A synthetic stress test for concurrently reading memcg stats provided > by Wei Xu. > With 10k threads reading the stats every 100ms: > - 98.8% of reads take <100us > - 1.09% of reads take 100us to 1ms. > - 0.11% of reads take 1ms to 10ms. > - Almost no reads take more than 10ms. > With 10k threads reading the stats every 10ms: > - 82.3% of reads take <100us. > - 4.2% of reads take 100us to 1ms. > - 4.7% of reads take 1ms to 10ms. > - 8.8% of reads take 10ms to 100ms. > - Almost no reads take more than 100ms. > > [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/ > [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/ > [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/ > > [weixugc@google.com: suggested the fallback logic and bounding the > number of waiters] > > Signed-off-by: Yosry Ahmed > --- > include/linux/memcontrol.h | 4 +- > mm/memcontrol.c | 100 ++++++++++++++++++++++++++++--------- > mm/vmscan.c | 2 +- > mm/workingset.c | 8 ++- > 4 files changed, 85 insertions(+), 29 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 11810a2cfd2d..4453cd3fc4b8 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -1034,7 +1034,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, > return x; > } > > -void mem_cgroup_flush_stats(void); > +void mem_cgroup_flush_stats(struct mem_cgroup *memcg); > void mem_cgroup_flush_stats_ratelimited(void); > > void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, > @@ -1519,7 +1519,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, > return node_page_state(lruvec_pgdat(lruvec), idx); > } > > -static inline void mem_cgroup_flush_stats(void) > +static inline void mem_cgroup_flush_stats(struct mem_cgroup *memcg) > { > } > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index d729870505f1..edff41e4b4e7 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -588,7 +588,6 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) > static void flush_memcg_stats_dwork(struct work_struct *w); > static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); > static DEFINE_PER_CPU(unsigned int, stats_updates); > -static atomic_t stats_flush_ongoing = ATOMIC_INIT(0); > /* stats_updates_order is in multiples of MEMCG_CHARGE_BATCH */ > static atomic_t stats_updates_order = ATOMIC_INIT(0); > static u64 flush_last_time; > @@ -639,36 +638,87 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) > } > } > > -static void do_flush_stats(void) > +/* > + * do_flush_stats - flush the statistics of a memory cgroup and its tree > + * @memcg: the memory cgroup to flush > + * @wait: wait for an ongoing root flush to complete before returning > + * > + * All flushes are serialized by the underlying rstat global lock. If there is > + * no contention, we try to only flush the subtree of the passed @memcg to > + * minimize the work. Otherwise, we coalesce multiple flushing requests into a > + * single flush of the root memcg. When there is an ongoing root flush, we wait > + * for its completion (unless otherwise requested), to get fresh stats. If the > + * number of waiters exceeds the number of cpus just skip the flush to bound the > + * flush latency at the tail with very high concurrency. > + * > + * This is a trade-off between stats accuracy and flush latency. > + */ > +static void do_flush_stats(struct mem_cgroup *memcg, bool wait) > { > + static DECLARE_COMPLETION(root_flush_done); > + static DEFINE_SPINLOCK(root_flusher_lock); > + static DEFINE_MUTEX(subtree_flush_mutex); > + static atomic_t waiters = ATOMIC_INIT(0); > + static bool root_flush_ongoing; > + bool root_flusher = false; > + > + /* Ongoing root flush, just wait for it (unless otherwise requested) */ > + if (READ_ONCE(root_flush_ongoing)) > + goto root_flush_or_wait; > + > /* > - * We always flush the entire tree, so concurrent flushers can just > - * skip. This avoids a thundering herd problem on the rstat global lock > - * from memcg flushers (e.g. reclaim, refault, etc). > + * Opportunistically try to only flush the requested subtree. Otherwise > + * fallback to a coalesced flush below. > */ > - if (atomic_read(&stats_flush_ongoing) || > - atomic_xchg(&stats_flush_ongoing, 1)) > + if (!mem_cgroup_is_root(memcg) && mutex_trylock(&subtree_flush_mutex)) { > + cgroup_rstat_flush(memcg->css.cgroup); > + mutex_unlock(&subtree_flush_mutex); > return; > + } If mutex_trylock() is the only way to acquire subtree_flush_mutex, you don't really need a mutex. Just a simple integer flag with xchg() call should be enough. Cheers, Longman