From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B0A01EDEC73 for ; Wed, 13 Sep 2023 15:38:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 034326B01A0; Wed, 13 Sep 2023 11:38:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F266D6B01A2; Wed, 13 Sep 2023 11:38:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DC7266B01A3; Wed, 13 Sep 2023 11:38:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id CCDF46B01A0 for ; Wed, 13 Sep 2023 11:38:03 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 868F4A0E4D for ; Wed, 13 Sep 2023 15:38:03 +0000 (UTC) X-FDA: 81231980046.04.E1B87A9 Received: from mail-ot1-f45.google.com (mail-ot1-f45.google.com [209.85.210.45]) by imf02.hostedemail.com (Postfix) with ESMTP id 71E3780019 for ; Wed, 13 Sep 2023 15:38:00 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=p39JJQXc; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf02.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.210.45 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1694619480; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=pes0x2FgCzst9zO4by3a7HiJnngRgun+ysnigDkdDBI=; b=iXHxZe9X5fwWsCHF3s8Lkgm05qISdNjI/L3TFCS1Ax1xLuQSzRQ8SrrejGy51LKvIrVfHM OgdG6H33F2JCAB8y7g2W0zqBfmArwewalbQYG96M0KpDSbFck+IioHHgmhTMFP+fln6MIJ Ln9xervQIkz0yFEKLooE/I8jNf3+fXs= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=p39JJQXc; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf02.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.210.45 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694619480; a=rsa-sha256; cv=none; b=4oM+XDpj7Hd9M9JiIJkKokP7M1s3D2VrwpenFnc7n1D32slU1milK4qvGGEWzO0kJDbN3c /boC5f5ZGX3C/XQ/yEkpFh5zg1P8PvVg+xracwWio6kyEPOL/edh/v82Tob4l2k+hGOph6 w+2B43A7G2YA9pONciPzaXrdXRow7BM= Received: by mail-ot1-f45.google.com with SMTP id 46e09a7af769-6bf2427b947so4360064a34.3 for ; Wed, 13 Sep 2023 08:38:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1694619479; x=1695224279; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=pes0x2FgCzst9zO4by3a7HiJnngRgun+ysnigDkdDBI=; b=p39JJQXcDjYxzk1hoZ7xcBSxWXJJGnsqTFhjJJAnuTH48mfLjQi2kXIUrDi3tz+YB7 ut2EAFRPlEuRHLgI38MtSPh2SbC96olAuCxSA6ZnHTx1ukXNSHsyv39F/Lo/RIkvuS+Q z+lrWC8GiWAjoOrTe3pTzdCHguMgun/djmDava2/ffJrBdfOwprM+PNEhROzf0NzUJRw lPmsAebfzHBx4ZdHu4DouQxmg7o/ObgndXmJUx1MDOPs8kZzYp0V0L4L/BvOm1BM3NO2 MPUUBDcJr2fYtxk7DmBCTN/aTR61LaB9QjTbO0VV5tFOxMngiVQegUIepi2M/ikr+HjD PRWw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694619479; x=1695224279; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=pes0x2FgCzst9zO4by3a7HiJnngRgun+ysnigDkdDBI=; b=obB/L7Xe75ySuSjvS7R8JxPcfxt0b/0ojI5zx0Q79fK9cKdIRC9C0d7Yz7NmeiPxT0 s3InkVy2RQfiI8prEkVm379YQlTvRZessQT/J40epgx6Kl2InMdWHfLCBK5FXozvL/2D INAa6iccJ8SPiV5abkLO7BEGvNnmaZ01CftUY2UQkzVXaho5S9uucudtu6OXk4I3/ukD MNP/zN7OjXA9F92ZZ2HD5DlOCAncJ2QUlkCoACCm7HDshvevkxeh1biq/CiPQfCwCrYV 4EHn2pFfyNLMp/qV2G3U7d4jhik9pwzenIqJtVNGmBuWsg6LaQ7CnQAUxz/uVLK+GBkC dUGw== X-Gm-Message-State: AOJu0Yx3gdVInkJD31loo69vdYq+ZhORtRcXK+zbwO8uDxmMTqXG7ZtI klLJeBhmeDgOycrWSogv8yPzXA== X-Google-Smtp-Source: AGHT+IEPPPHSfrfy7IdWyHRRjhuOQ3KghAaHDzL4YMfoWPb61lBUcsPQqhoRRgadgxgnlLOngUQIqQ== X-Received: by 2002:a05:6358:5e01:b0:130:faea:a81f with SMTP id q1-20020a0563585e0100b00130faeaa81fmr2210210rwn.28.1694619479319; Wed, 13 Sep 2023 08:37:59 -0700 (PDT) Received: from localhost (2603-7000-0c01-2716-3012-16a2-6bc2-2937.res6.spectrum.com. [2603:7000:c01:2716:3012:16a2:6bc2:2937]) by smtp.gmail.com with ESMTPSA id x15-20020a0ce0cf000000b006427145590csm4495804qvk.48.2023.09.13.08.37.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 13 Sep 2023 08:37:58 -0700 (PDT) Date: Wed, 13 Sep 2023 11:37:58 -0400 From: Johannes Weiner To: Yosry Ahmed Cc: Andrew Morton , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , Michal =?iso-8859-1?Q?Koutn=FD?= , Waiman Long , kernel-team@cloudflare.com, Wei Xu , Greg Thelen , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 3/3] mm: memcg: optimize stats flushing for latency and accuracy Message-ID: <20230913153758.GB45543@cmpxchg.org> References: <20230913073846.1528938-1-yosryahmed@google.com> <20230913073846.1528938-4-yosryahmed@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20230913073846.1528938-4-yosryahmed@google.com> X-Rspamd-Queue-Id: 71E3780019 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: xzyxbceim48xmo9oru4yhz9dz64koc7h X-HE-Tag: 1694619480-922969 X-HE-Meta: U2FsdGVkX193NBjbR+0xWmA9acKKqLFpQnk4cQq/wOrjoR22EUX0KKkXMXIrZwSVaWGtDnJW2p6Lu6MtTCzMo9C73YJkYTxJVGcmuI0SCW7eY/NG2EWP+DFN30LOURUluM80+sFUxcuFf0hOlVhL16Ycu6JBZgt8hrRReAeOPgkWn8HREuPWS/ue2ll0JfIBnadBcK5LNT6kX/h329zzWtYaPBPMmltNDIoXQKRuKeJ/QXOdGzTFU/NHQcftQIVmTqdouanKUx1lqaAPIeH257lhDJpYc4SPON+lJyYTzUFVkO12BPvpK5OoIEeht4xq/nBd8W6PGon+6FgxYWR1vyO4yqGM7RLaQBwE640EnPTEqmwTvalgiBPeTFfBBhkVc68RrTDXnbFqiuy36cEmBcgEs8p4Meu3le0ceqNESbweQPfm0UlvQeRMBNurSCGd5lyMujwcfs7iVGe2/dMvLSc1JM2PrJRtZ3AuFLjuCpkcX9JKTvxMHl3jwZ/HL9PQ1w7WXi2IyxwNStIVNJD9IgaqO5KZviAwwOxUvl9MwdlnyfJpDMbBDHF82w1kefBIuphsMQLuSgP1Newi+luxvw3Y0ImsTzTGvtKPck5VZLrPEnGF6JD6V04PjXqVmfCIuyvz/cayaYL4iPJGc0cjtpD18fUOy/l9IwkybzAF/ZcbEreLZLLkfDONjy1Zircu2tCirDs6sSbrRs2EcVPRhRD5nZcpeSqd4HB8YTEeGaYk4Vp4pDGxP8S12bNgF875s0/1gJZyAdatScyKbCPQ9tgj+lXWuwG6BJyf3u8+L5+IhE/AUNmMKohgSZ3f8uPld3k24ambK5CCFvgvwJT3e1i7noTh9vkPnwQc67tcJA35pcdFHPXwKOcfjzK8oV23P7w5CaYwxT0pnqzlA1Pcc3oF3jw3+1tAD1UoHZs/L2vMYn7H9AHQeKcyXmi2X9gYi/ce01iYv3ezXaK33bL ChrqeueT zYCdbKUsS+d/gB7OyKhOBFIFWQJwb1F9SRgHy3PI9i0Mh4bgOjbm9dB1oJluuiHDr85yw8MAyjgC5QVu19/6sWd61ayrJ6d8yMbwxEp6Em+40sqor+Q3EbvXCGET6z44O0m8JRHbL4RMYtzIPOK3gDjnWx4vr3Tj5RaaKIrfjsP90PCHiYmUhHqMLC2iqsVZd+s8sL9vbtGTsz1yfGJM+Aq8b7fgCnsLqTiCCjPz58Hlha/FmkwI7zdXTF1II0CTGxPaWnRlMCKewpoXS7frzekfrFrWDfFR9Rdz32bKc1cDrv+CkDXzxlKD0CdVydaPEg/x9Zp091Vo/+Pd54/SEkY+PAyYmwCpnsFcKsepBEbWee7ovUqqhM1QZkyZU6xTf56dtK11JpXvperg1YQlaRLdKyGNZvW8AK7b5AzfPrFABXvQf+C6Y30Nu7DCusLWkGPKM9cqfYc7+pv+L7OYPEYk3WUYfB6dBRBAscEWlAZ6RlN/wOMXXOG0AamoPol+IHwAsKqRiesf94QLjpvKk0plOjqpY5W0xWwrGMlZWAgoat8E3jtkH9K+yzmj4B9YJqyb9q64QaIF4Lk1WN8QYmeiCQ1hu459UQINy+CZ8rboWFquX77AEci9pBnANs2lE5EmMchIu/tYsVi3eFkhDg+Fc9sJSM4/ZeIRVNSve7DhcLDI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Sep 13, 2023 at 07:38:46AM +0000, Yosry Ahmed wrote: > Stats flushing for memcg currently follows the following rules: > - Always flush the entire memcg hierarchy (i.e. flush the root). > - Only one flusher is allowed at a time. If someone else tries to flush > concurrently, they skip and return immediately. > - A periodic flusher flushes all the stats every 2 seconds. > > The reason this approach is followed is because all flushes are > serialized by a global rstat spinlock. On the memcg side, flushing is > invoked from userspace reads as well as in-kernel flushers (e.g. > reclaim, refault, etc). This approach aims to avoid serializing all > flushers on the global lock, which can cause a significant performance > hit under high concurrency. > > This approach has the following problems: > - Occasionally a userspace read of the stats of a non-root cgroup will > be too expensive as it has to flush the entire hierarchy [1]. > - Sometimes the stats accuracy are compromised if there is an ongoing > flush, and we skip and return before the subtree of interest is > actually flushed. This is more visible when reading stats from > userspace, but can also affect in-kernel flushers. > > This patch aims to solve both problems by reworking how flushing > currently works as follows: > - Without contention, there is no need to flush the entire tree. In this > case, only flush the subtree of interest. This avoids the latency of a > full root flush if unnecessary. > - With contention, fallback to a coalesced (aka unified) flush of the > entire hierarchy, a root flush. In this case, instead of returning > immediately if a root flush is ongoing, wait for it to finish > *without* attempting to acquire the lock or flush. This is done using > a completion. Compared to competing directly on the underlying lock, > this approach makes concurrent flushing a synchronization point > instead of a serialization point. Once a root flush finishes, *all* > waiters can wake up and continue at once. > - Finally, with very high contention, bound the number of waiters to the > number of online cpus. This keeps the flush latency bounded at the tail > (very high concurrency). We fallback to sacrificing stats freshness only > in such cases in favor of performance. > > This was tested in two ways on a machine with 384 cpus: > - A synthetic test with 5000 concurrent workers doing allocations and > reclaim, as well as 1000 readers for memory.stat (variation of [2]). > No significant regressions were noticed in the total runtime. > Note that if concurrent flushers compete directly on the spinlock > instead of waiting for a completion, this test shows 2x-3x slowdowns. > Even though subsequent flushers would have nothing to flush, just the > serialization and lock contention is a major problem. Using a > completion for synchronization instead seems to overcome this problem. > > - A synthetic stress test for concurrently reading memcg stats provided > by Wei Xu. > With 10k threads reading the stats every 100ms: > - 98.8% of reads take <100us > - 1.09% of reads take 100us to 1ms. > - 0.11% of reads take 1ms to 10ms. > - Almost no reads take more than 10ms. > With 10k threads reading the stats every 10ms: > - 82.3% of reads take <100us. > - 4.2% of reads take 100us to 1ms. > - 4.7% of reads take 1ms to 10ms. > - 8.8% of reads take 10ms to 100ms. > - Almost no reads take more than 100ms. > > [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/ > [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/ > [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/ > > [weixugc@google.com: suggested the fallback logic and bounding the > number of waiters] > > Signed-off-by: Yosry Ahmed > /* > + * Opportunistically try to only flush the requested subtree. Otherwise > + * fallback to a coalesced flush below. > */ > - if (atomic_read(&stats_flush_ongoing) || > - atomic_xchg(&stats_flush_ongoing, 1)) > + if (!mem_cgroup_is_root(memcg) && mutex_trylock(&subtree_flush_mutex)) { > + cgroup_rstat_flush(memcg->css.cgroup); > + mutex_unlock(&subtree_flush_mutex); > return; > + } > > - WRITE_ONCE(flush_last_time, jiffies_64); > - > - cgroup_rstat_flush(root_mem_cgroup->css.cgroup); > + /* A coalesced root flush is in order. Are we the designated flusher? */ > + spin_lock(&root_flusher_lock); I can't say I'm crazy about this. Let's say you have a fairly sprawling and active cgroup tree with lots of updates in lots of groups in the system. Add a periodic memory.stat reader to one of the subgroups, and that reader will do very fast, localized aggregation. Now add a periodic memory.stat reader to some other subgroup. They might still both do very fast, localized aggregation. Or they might collide; and now - despite having nothing in common, and sharing no data besides the rstat lock - will have to flush the entire massive tree. The rate at which this happens is purely dependent on timing of what should be independent actors. This is really not great for the p50 vs p99 gap. I think this whole thing is getting away from us. When we first switched to rstat, every reader would take the global rstat lock and then flush its local subtree of interest. This was changed in the following commit: commit fd25a9e0e23b995fd0ba5e2f00a1099452cbc3cf Author: Shakeel Butt Date: Fri Nov 5 13:37:34 2021 -0700 memcg: unify memcg stat flushing The memcg stats can be flushed in multiple context and potentially in parallel too. For example multiple parallel user space readers for memcg stats will contend on the rstat locks with each other. There is no need for that. We just need one flusher and everyone else can benefit. In addition after aa48e47e3906 ("memcg: infrastructure to flush memcg stats") the kernel periodically flush the memcg stats from the root, so, the other flushers will potentially have much less work to do. Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb@google.com Signed-off-by: Shakeel Butt Acked-by: Johannes Weiner Cc: Michal Hocko Cc: "Michal Koutný" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds The idea was that we can avoid lock contention if somebody is already doing the flushing. However, you're now bringing global serialization. Clearly that idea didn't work out. What would be an obstacle to go back to the original way of doing it? With one reader, this will work the same as in your proposal. With two readers, just like in your proposal, flushing must be serialized against the root level. But at least the two flushes only aggregate the local data they actually care about - not the entire tree data that doesn't even have readers! This is much better for lock contention and performance. One concern is the thresholding code. The cost of flushing on every read is too high: even when there is no flush work, checking for it is kind of expensive due to the locks and the for_each_possible_cpu(). Could we do something like the following? mem_cgroup_flush(memcg) { mutex_lock(&memcg_flush_mutex); if (atomic_read(&memcg->stat_delta) > THRESHOLD) cgroup_rstat_flush(memcg); mutex_unlock(&memcg_flush_mutex); } mem_cgroup_css_rstat_flush(css, cpu) { ... /* * Reset here instead of mem_cgroup_flush() * so that each flushed subgroup is reset * recursively. A recent parent flush will * allow a child flush to skip. */ atomic_set(&mem_cgroup_from_css(css)->stat_delta, 0); } memcg_rstat_updated(memcg, val) { atomic_add(&memcg->stat_delta, val); }