From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2FBF2C2BBCA for ; Tue, 25 Jun 2024 15:33:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7BC296B00D2; Tue, 25 Jun 2024 11:33:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 76B9B6B00D4; Tue, 25 Jun 2024 11:33:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 633696B00D5; Tue, 25 Jun 2024 11:33:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 45CC96B00D2 for ; Tue, 25 Jun 2024 11:33:00 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id EAE9EA026B for ; Tue, 25 Jun 2024 15:32:59 +0000 (UTC) X-FDA: 82269804078.04.B7F143B Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf03.hostedemail.com (Postfix) with ESMTP id 1FC3520003 for ; Tue, 25 Jun 2024 15:32:57 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=OVpVooIv; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf03.hostedemail.com: domain of hawk@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=hawk@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719329561; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/BxkKGbsO52KakpD//pbE+MuKeD6lIj6GoCz9EHJcAg=; b=RKzi8npp5tL5vpbGgheThUrHztUlaFRZmuUfFQFb8eHzwfquuzN771w2O1uCOik+jbEHuJ BnFdmN0ifa/Ft5na0uw86rXfm5act3nBXZvgd4bHMv5+MA0b4YM2u+LEC7CWe2fV6RGNC4 eHVxOZqGC0iweM6k3Eyo4ytEB1IsQE8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719329561; a=rsa-sha256; cv=none; b=tG4nbFTTD/ttKoxqqwR6prGVe/8bC1UmoHAi8z6QAJXYd+ySr4uzL3wrWvl7NInGniWLuK H6Bb+W9Bd4QgYyAoNBhR4n22PSAzqNtGo8ZwGf2RzZTJsfCnh3g3pV/JU2cCTKls7Wh3Wb /0OKfDZ62Kgm7bUnDXQd51eN7TeviAA= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=OVpVooIv; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf03.hostedemail.com: domain of hawk@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=hawk@kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 13658614BC; Tue, 25 Jun 2024 15:32:57 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C83A3C32781; Tue, 25 Jun 2024 15:32:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1719329576; bh=uZ8BwQJ/Kdq0iyp8vn3LgQFFUO/XtKzGirgwze/0unA=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=OVpVooIv/ne2KJwP31QW0B8pvv2CFLfRpajLqK33jScWmGhBOKtLs50n/Q8riekVo b2XhWqcmUwnFziccfC3QkrJWgfndsFYXnzayD2Lq+7A+hBtQ0xrIeGVk+dhrT/mWxM LtkWadLfKMF7otN1jwMyjn9nhEP8StXTvJhEAJWpfqsU40ku5jS1KSS7yv5pvZOi/P mUkKdbQX7Psjkoiwk+d9AMsRA21zryclmEdt5367GSPa7BO64Hak7ukKTaKaqYd/zu X10vIZYeLxtkcADsaeihFWqZB5i7HDnXFNxkK7/m+TUb8XYZOPug30piF3O7u/J1Ob DT60ahSF/KEbQ== Message-ID: Date: Tue, 25 Jun 2024 17:32:53 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH V2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes To: Yosry Ahmed , Shakeel Butt Cc: tj@kernel.org, cgroups@vger.kernel.org, hannes@cmpxchg.org, lizefan.x@bytedance.com, longman@redhat.com, kernel-team@cloudflare.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <171923011608.1500238.3591002573732683639.stgit@firesoul> Content-Language: en-US From: Jesper Dangaard Brouer In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 1FC3520003 X-Stat-Signature: f3mqbqfat3uahwr5g9q1w6efmkxgwf9u X-Rspam-User: X-HE-Tag: 1719329577-641598 X-HE-Meta: U2FsdGVkX1/eS511Bbu+PCzvr6pTgOSpQTxLLDypmvsdo3d2cY1gUfXwFdz1vaU5ghxmzjUx9nyIv4CawSEorRIbohiArNrvnfUOBMCkUXCkG3ch7JonJrTPlMq/I1gvNJwM7lUy61Ii7p803G+0ai3961/V5W6Il0jfvCa5hO0T/XjEX4Q4YeyjfpXXnhIy721JnQvFbvpV4mzmg3ASjLLH4oJu2g0c9VETOZH0eQ2j/CaVrBI1PPuX2ZcYhBjLCUsd/6/gqmYmdHtOsxGEwZBiiTMoU6zEBt+UnXcPNLlMjuRwumF7I0io8pppUjYrrvP4gHBKdfbTFqfxb18u8shmQa20+aSVhe3qKBIhDfrOYow2YJNQpzvmy839xgAdIhZRuEVpufh4tze4spDgk/SeEoQqzbSGuoOcK8oC8HKMrkp0nEphohuwwcP0s/cHYawC1SgR0V/2jCx8b91jsXt/ielnxyGCj9sN6oE6AhFeqI6P5Y0ZmubGShtNJ0riwenjXYZyA+h5gzqGJm8SlFmGe21UUJBinfqp29fGA59tB77HKkpEX9uOzsmBDo1Tyv+SEeav9hlXjel3Pkif2KKzvkxVVO3nGdKB3SO2uuI/mLEw3q7HkAGlAzkteWs+wNMoo/vD50sKa6/C/q93/p5hywe/1kStTTDGyI7LsRuqDktDCnVPJTwXGrsSkO7Y8+pj6zBd22GE5Z+yFUOWMQSHxrQ8gJSpF1wPW5isruLvbp2D36MJCMEOIaJJWvQ3us/DVqpTxxECIwg1NQzVNEDxKmyp98YgcSEZEHfws79s7Y8W4JaO9u7PXgVor41FZVMx+08gWX7/5uZP2meFTfmDmpDPcLUl46GLol7frtvoBIyga9omlElnCmsf+FsZGe+QLFj5pJ6onx0CU2P+c4uVkxuo5QiLC9nYA1QxS874zq783dqcU7XXEVRkuDLOLeix7XjmqXbBa8riP/H 7rC9SDHT Q4Iv1SX7rqIjwOtSeUJsswvlarLiby/h692VSGM2Uy6iVRfq71HCOgUFXAiuAl4cMue1e74wuAaN8KAUq1vzYvwuItBzluC/r0E5+J4y4awvBIUznW6L7R22c+kjt59iWEx/1ZjZe8viGrO82jwhR03baX1WOz3g2JvMhrz/eSyWYshxICByuIkz6/o52kEciJav2WtaCiuA9+gAs/7gLBZOyY/BAomyW/z1VqEPzjaXj0D5YE2BgxSsQy5Yix4Z9n/7Z X-Bogosity: Ham, tests=bogofilter, spamicity=0.000018, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 25/06/2024 11.28, Yosry Ahmed wrote: > On Mon, Jun 24, 2024 at 5:24 PM Shakeel Butt wrote: >> >> On Mon, Jun 24, 2024 at 03:21:22PM GMT, Yosry Ahmed wrote: >>> On Mon, Jun 24, 2024 at 3:17 PM Shakeel Butt wrote: >>>> >>>> On Mon, Jun 24, 2024 at 02:43:02PM GMT, Yosry Ahmed wrote: >>>> [...] >>>>>> >>>>>>> There is also >>>>>>> a heuristic in zswap that may writeback more (or less) pages that it >>>>>>> should to the swap device if the stats are significantly stale. >>>>>>> >>>>>> >>>>>> Is this the ratio of MEMCG_ZSWAP_B and MEMCG_ZSWAPPED in >>>>>> zswap_shrinker_count()? There is already a target memcg flush in that >>>>>> function and I don't expect root memcg flush from there. >>>>> >>>>> I was thinking of the generic approach I suggested, where we can avoid >>>>> contending on the lock if the cgroup is a descendant of the cgroup >>>>> being flushed, regardless of whether or not it's the root memcg. I >>>>> think this would be more beneficial than just focusing on root >>>>> flushes. >>>> >>>> Yes I agree with this but what about skipping the flush in this case? >>>> Are you ok with that? >>> >>> Sorry if I am confused, but IIUC this patch affects all root flushes, >>> even for userspace reads, right? In this case I think it's not okay to >>> skip the flush without waiting for the ongoing flush. >> >> So, we differentiate between userspace and in-kernel users. For >> userspace, we should not skip flush and for in-kernel users, we can skip >> if flushing memcg is the ancestor of the given memcg. Is that what you >> are saying? > > Basically, I prefer that we don't skip flushing at all and keep > userspace and in-kernel users the same. We can use completions to make > other overlapping flushers sleep instead of spin on the lock. > I think there are good reasons for skipping flushes for userspace when reading these stats. More below. I'm looking at kernel code to spot cases where the flush MUST to be completed before returning. There are clearly cases where we don't need 100% accurate stats, evident by mem_cgroup_flush_stats_ratelimited() and mem_cgroup_flush_stats() that use memcg_vmstats_needs_flush(). The cgroup_rstat_exit() call seems to depend on cgroup_rstat_flush() being strict/accurate, because need to free the percpu resources. The obj_cgroup_may_zswap() have a comments that says it needs to get accurate stats for charging. These were the two cases, I found, do you know of others? > A proof of concept is basically something like: > > void cgroup_rstat_flush(cgroup) > { > if (cgroup_is_descendant(cgroup, READ_ONCE(cgroup_under_flush))) { > wait_for_completion_interruptible(&cgroup_under_flush->completion); > return; > } This feels like what we would achieve by changing this to a mutex. > > __cgroup_rstat_lock(cgrp, -1); > reinit_completion(&cgroup->completion); > /* Any overlapping flush requests after this write will not spin > on the lock */ > WRITE_ONCE(cgroup_under_flush, cgroup); > > cgroup_rstat_flush_locked(cgrp); > complete_all(&cgroup->completion); > __cgroup_rstat_unlock(cgrp, -1); > } > > There may be missing barriers or chances to reduce the window between > __cgroup_rstat_lock and WRITE_ONCE(), but that's what I have in mind. > I think it's not too complicated, but we need to check if it fixes the > problem. > > If this is not preferable, then yeah, let's at least keep the > userspace behavior intact. This makes sure we don't affect userspace > negatively, and we can change it later as we please. I don't think userspace reading these stats need to be 100% accurate. We are only reading the io.stat, memory.stat and cpu.stat every 53 seconds. Reading cpu.stat print stats divided by NSEC_PER_USEC (1000). If userspace is reading these very often, then they will be killing the system as it disables IRQs. On my prod system the flush of root cgroup can take 35 ms, which is not good, but this inaccuracy should not matter for userspace. Please educate me on why we need accurate userspace stats? --Jesper