From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B8C51C2BBCA for ; Tue, 25 Jun 2024 16:00:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 515AA6B00A8; Tue, 25 Jun 2024 12:00:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4A5A76B00D9; Tue, 25 Jun 2024 12:00:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3177A6B00DB; Tue, 25 Jun 2024 12:00:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 0EA006B00D9 for ; Tue, 25 Jun 2024 12:00:44 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 92966C028C for ; Tue, 25 Jun 2024 16:00:43 +0000 (UTC) X-FDA: 82269873966.02.B33BB28 Received: from mail-vs1-f49.google.com (mail-vs1-f49.google.com [209.85.217.49]) by imf02.hostedemail.com (Postfix) with ESMTP id 84E4980025 for ; Tue, 25 Jun 2024 16:00:41 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=DgT3CwhA; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf02.hostedemail.com: domain of yosryahmed@google.com designates 209.85.217.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719331230; a=rsa-sha256; cv=none; b=YfsMDDK7LtsOhWVkKVssn7u2sfeV/aS78QXgE2plfX1TDe3FWIEjfFr0w9nnoJOKwmCz8u IaOW7bYNEZJPPFHqupdPtvZXalMO+fsJMIPJbbuE85dsQiqU0dtLePH6zUfV/vc6kzPEHk ErxFA71zXB3uVv/XRZX+jPkTl1+jOU4= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=DgT3CwhA; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf02.hostedemail.com: domain of yosryahmed@google.com designates 209.85.217.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719331230; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=JGAs5Wtcc8fAU8dtCBhVqU6cNJrqxGDPOZOTDG4bTWk=; b=vRSmgG/x7ogghnHF/b29dRZOWGwdemyuwCBFU/uTl+0JEpCBPT7fotxNDNUvGmCXePnixN 6niPXKoZZ669blJ4Eq1CYkLqAmoDJXrYyJQqdIX0vTZFzuaOEreMudGIVIDO3gsn8lWn5m F+UMXmyh9rqAQCnCXZGxg8bvPscsJ0g= Received: by mail-vs1-f49.google.com with SMTP id ada2fe7eead31-48f389cfa33so2133306137.0 for ; Tue, 25 Jun 2024 09:00:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1719331240; x=1719936040; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=JGAs5Wtcc8fAU8dtCBhVqU6cNJrqxGDPOZOTDG4bTWk=; b=DgT3CwhA7H1wFW8hu+lyBgcRuUKfY0HNy+XGDInAGseCbgTRNN0eJjMwCxlNvRcVR9 uPozRgxxeMpRdBc/pAenRhW4bBl81JlV0Uxz6sXZO2VvVn9e1t380HbdIN1iYVTyav8f 0L7mxnNxnyGMzzlj290rb2zjnP6kAVfhkN+XRe+ry2+UK2ZZ8o8JhKUd+Xd8BSLIfMFO CeSWgD1xR79Vt92Ng9/u+7cEi+RCqrwGPDhFcWzxbauvxt2gjMRAkbaAqbvq+yOaJlCI CoW4+67xjhQN0GiawRhgFG1rKsRV9WRBCEPM1CEg14zq/mWmUSwTWIG56s6OGn6u9mAK wYPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719331240; x=1719936040; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=JGAs5Wtcc8fAU8dtCBhVqU6cNJrqxGDPOZOTDG4bTWk=; b=NcGy8NYy482lqoz9RKhk8X0q6maryd6b6Jo1jWOpdabvvXRfn0vXMG3pOFgw5JBhJL pHFrgLRid30eSLoaI+xMPpfBEc06H9NNwJ6mr22lVSQ5wOhheLwLEbgQ7IY3gszMMjmO 60h14MyIa+ExMoGADUmmJDcIV9ryHhkcBtyPgJcN31hjc+jyMJPQHXW+ZipyB05hBfsU Z9A+RM6xI5xg1zYusUh50BHL0kI/tdmAU47X8xsi13uhnscQnP5NOIJDufgDYOOyEpQw BZuabRtFLjVtOTwLWYfUhT/7dqMe4tFi8TNd+IHtVItym13WBTDcRTKYzVsUTnNaFUii Cn4A== X-Forwarded-Encrypted: i=1; AJvYcCVU3CPK5E1MtgeZWMDLBA64VgdpJcJENATvFPMjgRTuYQQcrqE5n2bD1mtN+G+TFaPYAdfuorG+p/KBOzmmNKG3Ths= X-Gm-Message-State: AOJu0YzHMoatHRfMbqxCmnPD1/SyeG7Z4tLjXECVD0pmZxqukeUnd7vs wyFC6QoVXU1womNGIrKJb56xPxznUs9RcCQQKd+jEAT1T3mXwoaFrS3stb4Vk2gWn/3Ij1MVkW8 e9LWrKBZBcCzWqnzkaTOzsCiMxu/ym+qe6vYi X-Google-Smtp-Source: AGHT+IG0nu9UNAR7AZc7WRyeB2quuN3FajFGwslyoT0UUTZGqC7hJhnj0C+KxLaRtzJ2zMZg2zxGNZwQs6bwReITcP0= X-Received: by 2002:a67:f289:0:b0:48d:a07e:d003 with SMTP id ada2fe7eead31-48f49a5f853mr6502317137.2.1719331240180; Tue, 25 Jun 2024 09:00:40 -0700 (PDT) MIME-Version: 1.0 References: <171923011608.1500238.3591002573732683639.stgit@firesoul> In-Reply-To: From: Yosry Ahmed Date: Tue, 25 Jun 2024 09:00:03 -0700 Message-ID: Subject: Re: [PATCH V2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes To: Jesper Dangaard Brouer Cc: Shakeel Butt , tj@kernel.org, cgroups@vger.kernel.org, hannes@cmpxchg.org, lizefan.x@bytedance.com, longman@redhat.com, kernel-team@cloudflare.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 84E4980025 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: mzx6bznpsaaz8ah7pzbfpzqp1mpi9fh7 X-HE-Tag: 1719331241-940433 X-HE-Meta: U2FsdGVkX1/nJyKBSlYXmAT4nV5Fqb5sVhaWgYN/IkPiw9Z45FWJPp4cBVhmoymCChWcPVVc6i1kiXqul5GS1tVdfdHFbGB0GWgZMlSMT6bNq1rOd7atZitOw4SgDgjZ17QsFTtJsb4Yx71obToIzpc1xo3WI0ywvkm4j/rKVQGIK5UXTwLWrOmIfs9Tc4cuGuYKMccChs5q6NmfVS6Di9B0QmGmo6dGGHoAviX3HFseO4x5lrKExER1FjKpPtF77vJ8Mjc252SZUY+Es70g8JIlGA2MK3jM2AZnCnwTCSdsLOxSMrLbeKnzkMPi6lBeP6u8e5LIBS/RBxUu43InqE5AK86/9izUgkiMOU78l7O74H0Z4uJlRHjXY5hF3IhQHR+SeyigjFBOVMwSN/4wGkg/EibmELeOBitQLQI0oetkcSlHcnm1k4ozdmfRxN3bWiLq/nik+mPf9EtPb27ZplNRCIBXv4dUmnMHltBEYsDyn3bMaROGBma9OhGVZk2cdrSDCBMbXQEvH1u4ANZGjeKpFsmyJP6e9kda/6Hfq3FQtt8+Fgdiy7k18if2jglEbWDBvxo3K4T3LENKQolJqsIsbGOaIiZHiFeU2sfHC/MgKAChOlx/ZPVHJb/uG9UXAXq5G59ZjLaQy4T1Zm7mu1Lj4ZScIWs/OB4H2PdFPnlnbiLq6HotaqrYAdXr3Faj0aCQRZbl/RX1Fq20MvZ/3nar1BT9+GlMgEJBW5USjH7acUATGphnRCyaoP6ZbVeu8snAq+zikpFH3cSdkxxMdnOyY0XCaxIlRunl0OGuyv3jbiTS5q7pZe2pepBSbRp82mPF2ugDOFNZYu2ankseChnavJ5aIL5AyDSRYSkZt975mz8IRMN6d7HLYA4YwUTRYYz+XXeyhPJrMGLPWWSFAJT+kzRt4BeIgUymrAgWIllHZMhpmRATUclTXkD0SjIrW88Jd8TQnzrrgLcBjwq 9rYyp9l2 ySpk4l77NMzP97gqrgaOw7SkDboEaJ1ldTZj2IBm79edGyrYvZZZjt6WOFJH/OLxFPrSV3TACKvcvMaxf7fKN96pEPHlyoOzc4Yiyu0r89LSCH8wr2pc6NFVaY15cmbLneV30LiFcUOVIPPNHofevF4Xlj0aCbTDeRMvKCpHFYagdpEifcAH2AcIftqOdoxKB8zyZ9+IFi6u8ntLd2hVbOv4T8FN+urniFZY4CZ/kBh+HYZYDREQBwUPo0lc9mZZz4iWU X-Bogosity: Ham, tests=bogofilter, spamicity=0.014080, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: [..] > > > > Basically, I prefer that we don't skip flushing at all and keep > > userspace and in-kernel users the same. We can use completions to make > > other overlapping flushers sleep instead of spin on the lock. > > > > I think there are good reasons for skipping flushes for userspace when > reading these stats. More below. > > I'm looking at kernel code to spot cases where the flush MUST to be > completed before returning. There are clearly cases where we don't need > 100% accurate stats, evident by mem_cgroup_flush_stats_ratelimited() and > mem_cgroup_flush_stats() that use memcg_vmstats_needs_flush(). > > The cgroup_rstat_exit() call seems to depend on cgroup_rstat_flush() > being strict/accurate, because need to free the percpu resources. Yeah I think this one cannot be skipped. > > The obj_cgroup_may_zswap() have a comments that says it needs to get > accurate stats for charging. This one needs to be somewhat accurate to respect memcg limits. I am not sure how much inaccuracy we can tolerate. > > These were the two cases, I found, do you know of others? Nothing that screamed at me, but as I mentioned, the non-deterministic nature of this makes me uncomfortable and feels to me like a potential way to get subtle bugs. > > > > A proof of concept is basically something like: > > > > void cgroup_rstat_flush(cgroup) > > { > > if (cgroup_is_descendant(cgroup, READ_ONCE(cgroup_under_flush))) { > > wait_for_completion_interruptible(&cgroup_under_flush->completion); > > return; > > } > > This feels like what we would achieve by changing this to a mutex. The main difference is that whoever is holding the lock still cannot sleep, while waiters can (and more importantly, they don't disable interrupts). This is essentially a middle ground between a mutex and a lock. I think this dodges the priority inversion problem Shakeel described because a low priority job holding the lock cannot sleep. Is there an existing locking primitive that can achieve this? > > > > > __cgroup_rstat_lock(cgrp, -1); > > reinit_completion(&cgroup->completion); > > /* Any overlapping flush requests after this write will not spin > > on the lock */ > > WRITE_ONCE(cgroup_under_flush, cgroup); > > > > cgroup_rstat_flush_locked(cgrp); > > complete_all(&cgroup->completion); > > __cgroup_rstat_unlock(cgrp, -1); > > } > > > > There may be missing barriers or chances to reduce the window between > > __cgroup_rstat_lock and WRITE_ONCE(), but that's what I have in mind. > > I think it's not too complicated, but we need to check if it fixes the > > problem. > > > > If this is not preferable, then yeah, let's at least keep the > > userspace behavior intact. This makes sure we don't affect userspace > > negatively, and we can change it later as we please. > > I don't think userspace reading these stats need to be 100% accurate. > We are only reading the io.stat, memory.stat and cpu.stat every 53 > seconds. Reading cpu.stat print stats divided by NSEC_PER_USEC (1000). > > If userspace is reading these very often, then they will be killing the > system as it disables IRQs. > > On my prod system the flush of root cgroup can take 35 ms, which is not > good, but this inaccuracy should not matter for userspace. > > Please educate me on why we need accurate userspace stats? My point is not about accuracy, although I think it's a reasonable argument on its own (a lot of things could change in a short amount of time, which is why I prefer magnitude-based ratelimiting). My point is about logical ordering. If a userspace program reads the stats *after* an event occurs, it expects to get a snapshot of the system state after that event. Two examples are: - A proactive reclaimer reading the stats after a reclaim attempt to check if it needs to reclaim more memory or fallback. - A userspace OOM killer reading the stats after a usage spike to decide which workload to kill. I listed such examples with more detail in [1], when I removed stats_flush_ongoing from the memcg code. [1]https://lore.kernel.org/lkml/20231129032154.3710765-6-yosryahmed@google.com/