From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5DD24EEAA5E for ; Thu, 14 Sep 2023 17:36:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DA0466B02DA; Thu, 14 Sep 2023 13:36:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D29BE6B02DB; Thu, 14 Sep 2023 13:36:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BF0D96B02DC; Thu, 14 Sep 2023 13:36:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id AB1CD6B02DA for ; Thu, 14 Sep 2023 13:36:55 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 869BEB42D3 for ; Thu, 14 Sep 2023 17:36:55 +0000 (UTC) X-FDA: 81235908390.01.79604C3 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf05.hostedemail.com (Postfix) with ESMTP id 97AE910000C for ; Thu, 14 Sep 2023 17:36:53 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=G3nwAX6W; spf=pass (imf05.hostedemail.com: domain of longman@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=longman@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1694713013; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=h4hzn4St1hDKvPdOxDMBMmAGTrFFy2dEQTx58ZE1Vlo=; b=OCCTxf1Bbd1wKtCtC3yI1NW3ZGxJwC8Zl218JSYg0VHopomtoSrUQ4HkXpWIbatCALkujS G0YuDNEIXSdQACvQw6LO92XKjcMDmjjCbD3y620ZLjhDVECdDQnfSiLyYy0iiCCONZtTb0 UEE60yeQxoinbX0TJc8T8fZ/T6bEdXg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694713013; a=rsa-sha256; cv=none; b=AiAhdgiSdi1aCbrOGNvtUWl4hqiW2NpgQONRqNV0a2tNrKtRFIqXUwvwchi03BJ6d5SgEn pRFTyhtXNl4Fsc08hArh+bPiBjq0HY7inCTlB5SRSJWw96nrgKwZwdHrNpGRW7GaI6Cf91 Y6SCx6QaG9pbWx0a5hRhTzOQk2vsL1Q= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=G3nwAX6W; spf=pass (imf05.hostedemail.com: domain of longman@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=longman@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1694713012; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=h4hzn4St1hDKvPdOxDMBMmAGTrFFy2dEQTx58ZE1Vlo=; b=G3nwAX6WFOKzZFD94Pcw9ggWzj0P0xMszJzK4Iyx+iyndBDCc06icjpKRbtwV64PugzpfT y+8SgKxPhzEJzZyNY3VKrV/p8zJFWzZZJKEljs5Uov56324ABl4ZO0c7H2vD5jHJyXjmku OChYbRCBhGWFeKGw2OcaF325ol7A/rE= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-684-K-EPic2lNueZ9FA-gCtQLA-1; Thu, 14 Sep 2023 13:36:47 -0400 X-MC-Unique: K-EPic2lNueZ9FA-gCtQLA-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 2F0C181D7A7; Thu, 14 Sep 2023 17:36:46 +0000 (UTC) Received: from [10.22.34.133] (unknown [10.22.34.133]) by smtp.corp.redhat.com (Postfix) with ESMTP id 2D28721B20B1; Thu, 14 Sep 2023 17:36:45 +0000 (UTC) Message-ID: Date: Thu, 14 Sep 2023 13:36:44 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.14.0 Subject: Re: [PATCH 3/3] mm: memcg: optimize stats flushing for latency and accuracy Content-Language: en-US To: Yosry Ahmed Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , =?UTF-8?Q?Michal_Koutn=c3=bd?= , kernel-team@cloudflare.com, Wei Xu , Greg Thelen , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org References: <20230913073846.1528938-1-yosryahmed@google.com> <20230913073846.1528938-4-yosryahmed@google.com> From: Waiman Long In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.6 X-Rspamd-Queue-Id: 97AE910000C X-Rspam-User: X-Stat-Signature: fpkyycs1ftyu84raapb4k85nnyb5gi7r X-Rspamd-Server: rspam03 X-HE-Tag: 1694713013-240398 X-HE-Meta: U2FsdGVkX1/9GOojPD2DyW8gaKFOPl2Jq2+jnzq7k0w0x6PggLFDWVswfMbrmInzw4KgHg2RksU+Bn609LgxkU91KwMjed9DMyLz0c3/3IwXaFVmneQzolG+K9m1hqGbJIXm7RSwWbE/tqZ4KLoD83uzBEdH3RjmSuc7lgMYM7N6kDC5D5dZTOywfy/WvWdnLgOjmkXF+gbCGzBLoWaDrcsfjMDvejHyqk92OK49/MW6B1Yc1XClyOVEBmlusHV+Ladi0O7aScANpHUFM+addSxhE1n6LlsFtPXccUB9hUgJwhPBSLgR5gYs9NcFcIsYjnaTuUyW6DV7xGmSphXbyNcMpC9WtZIkbENqhhXAXKDStfn7VHq/Z7IUfVpBrJbNlIXwvA0M+x+sZKaMC705Pqy+bAehgv2KPnnk9vRDmAH3mcYt27VjXq6wBGx1NEC54rAUwsL7OeizL0vfIrbMIOua81FmSm+uQVFM3hmXKWPvRi6R0kwh3Mry0n3ZRGeVhJ10AHmGEKruyyjpdBrpc3LIVLpr43TnXpXQJQEpInOvUXX4yj26RhHIh69+yvmDm61rMs2nt+3fMscm6Dg9WpiZnOkodWM6IGGJljqV71trewbIt6FUC5yU2g236SdTBkAloXi5Tp82usOE8yq6sqW0E7Bqv+IhkFihx5oRYye4EOIddT7haTUTye/KkF7il9NpCBsoK7cMZQsG1nJG4qUPkh+huy4Cpzcs8ublrbowVWbcP5AKRuWMBdOmKSlzCNrtl/8hLJAEjFCK4IhyqyrKAnBLPw+TJ6tc3oSrNxC6s+J+0LKA/k+3IfgqR6Hz/iSbIW6R7cfHsp2a8SiEW6riBKbNhOcPo63+fCfGX4sS9YgZxHu9GK9PBfuz92KJ1iW7Eg2hq0/Xs3xfstXyBQ1hlrV1PoFjzr5acRxkwXkfScEhC+4cPGUQUhbSVYl1jKgiBAp8JVJwUJJgboq bOnevj7C C0F12qz4VKUzQFDvt65NjGF7wZ79nk89j+Q16vq9OP4OGQEDGBkxfUUbiaCnTXBjFDgTx5KsfJe1+7Ns5Cl09SCi+MD44yrxVrcNOL6bMurmCw8Ky3LKjbHYrGgEq5ze9S6oWHZqZvzOXZCcrc2/kqXzNimCMkrG0FFeSpz9aUtDwFeJfTPKC8wKiK4itHr0Zo5wyoSrN5HtohVXUUa37lQRuEA6cl1L5w6hKpHbXcUnMBZ0b5izIPGNAvCQG72KaJu0hX+zuoTSBNIO4Ie35aiAPAeOV9W3KZwG/ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 9/14/23 13:23, Yosry Ahmed wrote: > On Thu, Sep 14, 2023 at 10:19 AM Waiman Long wrote: >> >> On 9/13/23 03:38, Yosry Ahmed wrote: >>> Stats flushing for memcg currently follows the following rules: >>> - Always flush the entire memcg hierarchy (i.e. flush the root). >>> - Only one flusher is allowed at a time. If someone else tries to flush >>> concurrently, they skip and return immediately. >>> - A periodic flusher flushes all the stats every 2 seconds. >>> >>> The reason this approach is followed is because all flushes are >>> serialized by a global rstat spinlock. On the memcg side, flushing is >>> invoked from userspace reads as well as in-kernel flushers (e.g. >>> reclaim, refault, etc). This approach aims to avoid serializing all >>> flushers on the global lock, which can cause a significant performance >>> hit under high concurrency. >>> >>> This approach has the following problems: >>> - Occasionally a userspace read of the stats of a non-root cgroup will >>> be too expensive as it has to flush the entire hierarchy [1]. >>> - Sometimes the stats accuracy are compromised if there is an ongoing >>> flush, and we skip and return before the subtree of interest is >>> actually flushed. This is more visible when reading stats from >>> userspace, but can also affect in-kernel flushers. >>> >>> This patch aims to solve both problems by reworking how flushing >>> currently works as follows: >>> - Without contention, there is no need to flush the entire tree. In this >>> case, only flush the subtree of interest. This avoids the latency of a >>> full root flush if unnecessary. >>> - With contention, fallback to a coalesced (aka unified) flush of the >>> entire hierarchy, a root flush. In this case, instead of returning >>> immediately if a root flush is ongoing, wait for it to finish >>> *without* attempting to acquire the lock or flush. This is done using >>> a completion. Compared to competing directly on the underlying lock, >>> this approach makes concurrent flushing a synchronization point >>> instead of a serialization point. Once a root flush finishes, *all* >>> waiters can wake up and continue at once. >>> - Finally, with very high contention, bound the number of waiters to the >>> number of online cpus. This keeps the flush latency bounded at the tail >>> (very high concurrency). We fallback to sacrificing stats freshness only >>> in such cases in favor of performance. >>> >>> This was tested in two ways on a machine with 384 cpus: >>> - A synthetic test with 5000 concurrent workers doing allocations and >>> reclaim, as well as 1000 readers for memory.stat (variation of [2]). >>> No significant regressions were noticed in the total runtime. >>> Note that if concurrent flushers compete directly on the spinlock >>> instead of waiting for a completion, this test shows 2x-3x slowdowns. >>> Even though subsequent flushers would have nothing to flush, just the >>> serialization and lock contention is a major problem. Using a >>> completion for synchronization instead seems to overcome this problem. >>> >>> - A synthetic stress test for concurrently reading memcg stats provided >>> by Wei Xu. >>> With 10k threads reading the stats every 100ms: >>> - 98.8% of reads take <100us >>> - 1.09% of reads take 100us to 1ms. >>> - 0.11% of reads take 1ms to 10ms. >>> - Almost no reads take more than 10ms. >>> With 10k threads reading the stats every 10ms: >>> - 82.3% of reads take <100us. >>> - 4.2% of reads take 100us to 1ms. >>> - 4.7% of reads take 1ms to 10ms. >>> - 8.8% of reads take 10ms to 100ms. >>> - Almost no reads take more than 100ms. >>> >>> [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/ >>> [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/ >>> [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/ >>> >>> [weixugc@google.com: suggested the fallback logic and bounding the >>> number of waiters] >>> >>> Signed-off-by: Yosry Ahmed >>> --- >>> include/linux/memcontrol.h | 4 +- >>> mm/memcontrol.c | 100 ++++++++++++++++++++++++++++--------- >>> mm/vmscan.c | 2 +- >>> mm/workingset.c | 8 ++- >>> 4 files changed, 85 insertions(+), 29 deletions(-) >>> >>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h >>> index 11810a2cfd2d..4453cd3fc4b8 100644 >>> --- a/include/linux/memcontrol.h >>> +++ b/include/linux/memcontrol.h >>> @@ -1034,7 +1034,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, >>> return x; >>> } >>> >>> -void mem_cgroup_flush_stats(void); >>> +void mem_cgroup_flush_stats(struct mem_cgroup *memcg); >>> void mem_cgroup_flush_stats_ratelimited(void); >>> >>> void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, >>> @@ -1519,7 +1519,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, >>> return node_page_state(lruvec_pgdat(lruvec), idx); >>> } >>> >>> -static inline void mem_cgroup_flush_stats(void) >>> +static inline void mem_cgroup_flush_stats(struct mem_cgroup *memcg) >>> { >>> } >>> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>> index d729870505f1..edff41e4b4e7 100644 >>> --- a/mm/memcontrol.c >>> +++ b/mm/memcontrol.c >>> @@ -588,7 +588,6 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) >>> static void flush_memcg_stats_dwork(struct work_struct *w); >>> static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); >>> static DEFINE_PER_CPU(unsigned int, stats_updates); >>> -static atomic_t stats_flush_ongoing = ATOMIC_INIT(0); >>> /* stats_updates_order is in multiples of MEMCG_CHARGE_BATCH */ >>> static atomic_t stats_updates_order = ATOMIC_INIT(0); >>> static u64 flush_last_time; >>> @@ -639,36 +638,87 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) >>> } >>> } >>> >>> -static void do_flush_stats(void) >>> +/* >>> + * do_flush_stats - flush the statistics of a memory cgroup and its tree >>> + * @memcg: the memory cgroup to flush >>> + * @wait: wait for an ongoing root flush to complete before returning >>> + * >>> + * All flushes are serialized by the underlying rstat global lock. If there is >>> + * no contention, we try to only flush the subtree of the passed @memcg to >>> + * minimize the work. Otherwise, we coalesce multiple flushing requests into a >>> + * single flush of the root memcg. When there is an ongoing root flush, we wait >>> + * for its completion (unless otherwise requested), to get fresh stats. If the >>> + * number of waiters exceeds the number of cpus just skip the flush to bound the >>> + * flush latency at the tail with very high concurrency. >>> + * >>> + * This is a trade-off between stats accuracy and flush latency. >>> + */ >>> +static void do_flush_stats(struct mem_cgroup *memcg, bool wait) >>> { >>> + static DECLARE_COMPLETION(root_flush_done); >>> + static DEFINE_SPINLOCK(root_flusher_lock); >>> + static DEFINE_MUTEX(subtree_flush_mutex); >>> + static atomic_t waiters = ATOMIC_INIT(0); >>> + static bool root_flush_ongoing; >>> + bool root_flusher = false; >>> + >>> + /* Ongoing root flush, just wait for it (unless otherwise requested) */ >>> + if (READ_ONCE(root_flush_ongoing)) >>> + goto root_flush_or_wait; >>> + >>> /* >>> - * We always flush the entire tree, so concurrent flushers can just >>> - * skip. This avoids a thundering herd problem on the rstat global lock >>> - * from memcg flushers (e.g. reclaim, refault, etc). >>> + * Opportunistically try to only flush the requested subtree. Otherwise >>> + * fallback to a coalesced flush below. >>> */ >>> - if (atomic_read(&stats_flush_ongoing) || >>> - atomic_xchg(&stats_flush_ongoing, 1)) >>> + if (!mem_cgroup_is_root(memcg) && mutex_trylock(&subtree_flush_mutex)) { >>> + cgroup_rstat_flush(memcg->css.cgroup); >>> + mutex_unlock(&subtree_flush_mutex); >>> return; >>> + } >> If mutex_trylock() is the only way to acquire subtree_flush_mutex, you >> don't really need a mutex. Just a simple integer flag with xchg() call >> should be enough. Equivalently test_and_set_bit() will work too. Cheers, Longman > Thanks for pointing this out. Agreed. > > If we keep this approach I will drop that mutex. >