From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B8C51C2BBCA
	for <linux-mm@archiver.kernel.org>; Tue, 25 Jun 2024 16:00:44 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 515AA6B00A8; Tue, 25 Jun 2024 12:00:44 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4A5A76B00D9; Tue, 25 Jun 2024 12:00:44 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3177A6B00DB; Tue, 25 Jun 2024 12:00:44 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 0EA006B00D9
	for <linux-mm@kvack.org>; Tue, 25 Jun 2024 12:00:44 -0400 (EDT)
Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 92966C028C
	for <linux-mm@kvack.org>; Tue, 25 Jun 2024 16:00:43 +0000 (UTC)
X-FDA: 82269873966.02.B33BB28
Received: from mail-vs1-f49.google.com (mail-vs1-f49.google.com [209.85.217.49])
	by imf02.hostedemail.com (Postfix) with ESMTP id 84E4980025
	for <linux-mm@kvack.org>; Tue, 25 Jun 2024 16:00:41 +0000 (UTC)
Authentication-Results: imf02.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=DgT3CwhA;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf02.hostedemail.com: domain of yosryahmed@google.com designates 209.85.217.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719331230; a=rsa-sha256;
	cv=none;
	b=YfsMDDK7LtsOhWVkKVssn7u2sfeV/aS78QXgE2plfX1TDe3FWIEjfFr0w9nnoJOKwmCz8u
	IaOW7bYNEZJPPFHqupdPtvZXalMO+fsJMIPJbbuE85dsQiqU0dtLePH6zUfV/vc6kzPEHk
	ErxFA71zXB3uVv/XRZX+jPkTl1+jOU4=
ARC-Authentication-Results: i=1;
	imf02.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=DgT3CwhA;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf02.hostedemail.com: domain of yosryahmed@google.com designates 209.85.217.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1719331230;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=JGAs5Wtcc8fAU8dtCBhVqU6cNJrqxGDPOZOTDG4bTWk=;
	b=vRSmgG/x7ogghnHF/b29dRZOWGwdemyuwCBFU/uTl+0JEpCBPT7fotxNDNUvGmCXePnixN
	6niPXKoZZ669blJ4Eq1CYkLqAmoDJXrYyJQqdIX0vTZFzuaOEreMudGIVIDO3gsn8lWn5m
	F+UMXmyh9rqAQCnCXZGxg8bvPscsJ0g=
Received: by mail-vs1-f49.google.com with SMTP id ada2fe7eead31-48f389cfa33so2133306137.0
        for <linux-mm@kvack.org>; Tue, 25 Jun 2024 09:00:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1719331240; x=1719936040; darn=kvack.org;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=JGAs5Wtcc8fAU8dtCBhVqU6cNJrqxGDPOZOTDG4bTWk=;
        b=DgT3CwhA7H1wFW8hu+lyBgcRuUKfY0HNy+XGDInAGseCbgTRNN0eJjMwCxlNvRcVR9
         uPozRgxxeMpRdBc/pAenRhW4bBl81JlV0Uxz6sXZO2VvVn9e1t380HbdIN1iYVTyav8f
         0L7mxnNxnyGMzzlj290rb2zjnP6kAVfhkN+XRe+ry2+UK2ZZ8o8JhKUd+Xd8BSLIfMFO
         CeSWgD1xR79Vt92Ng9/u+7cEi+RCqrwGPDhFcWzxbauvxt2gjMRAkbaAqbvq+yOaJlCI
         CoW4+67xjhQN0GiawRhgFG1rKsRV9WRBCEPM1CEg14zq/mWmUSwTWIG56s6OGn6u9mAK
         wYPQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1719331240; x=1719936040;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=JGAs5Wtcc8fAU8dtCBhVqU6cNJrqxGDPOZOTDG4bTWk=;
        b=NcGy8NYy482lqoz9RKhk8X0q6maryd6b6Jo1jWOpdabvvXRfn0vXMG3pOFgw5JBhJL
         pHFrgLRid30eSLoaI+xMPpfBEc06H9NNwJ6mr22lVSQ5wOhheLwLEbgQ7IY3gszMMjmO
         60h14MyIa+ExMoGADUmmJDcIV9ryHhkcBtyPgJcN31hjc+jyMJPQHXW+ZipyB05hBfsU
         Z9A+RM6xI5xg1zYusUh50BHL0kI/tdmAU47X8xsi13uhnscQnP5NOIJDufgDYOOyEpQw
         BZuabRtFLjVtOTwLWYfUhT/7dqMe4tFi8TNd+IHtVItym13WBTDcRTKYzVsUTnNaFUii
         Cn4A==
X-Forwarded-Encrypted: i=1; AJvYcCVU3CPK5E1MtgeZWMDLBA64VgdpJcJENATvFPMjgRTuYQQcrqE5n2bD1mtN+G+TFaPYAdfuorG+p/KBOzmmNKG3Ths=
X-Gm-Message-State: AOJu0YzHMoatHRfMbqxCmnPD1/SyeG7Z4tLjXECVD0pmZxqukeUnd7vs
	wyFC6QoVXU1womNGIrKJb56xPxznUs9RcCQQKd+jEAT1T3mXwoaFrS3stb4Vk2gWn/3Ij1MVkW8
	e9LWrKBZBcCzWqnzkaTOzsCiMxu/ym+qe6vYi
X-Google-Smtp-Source: AGHT+IG0nu9UNAR7AZc7WRyeB2quuN3FajFGwslyoT0UUTZGqC7hJhnj0C+KxLaRtzJ2zMZg2zxGNZwQs6bwReITcP0=
X-Received: by 2002:a67:f289:0:b0:48d:a07e:d003 with SMTP id
 ada2fe7eead31-48f49a5f853mr6502317137.2.1719331240180; Tue, 25 Jun 2024
 09:00:40 -0700 (PDT)
MIME-Version: 1.0
References: <171923011608.1500238.3591002573732683639.stgit@firesoul>
 <CAJD7tkbHNvQoPO=8Nubrd5an7_9kSWM=5Wh5H1ZV22WD=oFVMg@mail.gmail.com>
 <tl25itxuzvjxlzliqsvghaa3auzzze6ap26pjdxt6spvhf5oqz@fvc36ntdeg4r>
 <CAJD7tkaKDcG+W+C6Po=_j4HLOYN23rtVnM0jmC077_kkrrq9xA@mail.gmail.com>
 <exnxkjyaslel2jlvvwxlmebtav4m7fszn2qouiciwhuxpomhky@ljkycu67efbx>
 <CAJD7tkaJXNfWQtoURyf-YWS7WGPMGEc5qDmZrxhH2+RE-LeEEg@mail.gmail.com>
 <a45ggqu6jcve44y7ha6m6cr3pcjc3xgyomu4ml6jbsq3zv7tte@oeovgtwh6ytg>
 <CAJD7tkZT_2tyOFq5koK0djMXj4tY8BO3CtSamPb85p=iiXCgXQ@mail.gmail.com>
 <qolg56e7mjloynou6j7ar7xzefqojp4cagzkb3r6duoj5i54vu@jqhi2chs4ecj>
 <CAJD7tka0b52zm=SjqxO-gxc0XTib=81c7nMx9MFNttwVkCVmSg@mail.gmail.com>
 <u3jrec5n42v35f3xiigfqabajjt4onh44eyfajewnzbfqxaekw@5x2daobkkbxh>
 <CAJD7tkaMeevj2TS_aRj_WXVi26CuuBrprYwUfQmszJnwqqJrHw@mail.gmail.com> <d3b5f10a-2649-446c-a6f9-9311f96e7569@kernel.org>
In-Reply-To: <d3b5f10a-2649-446c-a6f9-9311f96e7569@kernel.org>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Tue, 25 Jun 2024 09:00:03 -0700
Message-ID: <CAJD7tkZ0ReOjoioACyxQ848qNMh6a93hH616jNPgX3j72thrLg@mail.gmail.com>
Subject: Re: [PATCH V2] cgroup/rstat: Avoid thundering herd problem by kswapd
 across NUMA nodes
To: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>, tj@kernel.org, cgroups@vger.kernel.org, 
	hannes@cmpxchg.org, lizefan.x@bytedance.com, longman@redhat.com, 
	kernel-team@cloudflare.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Queue-Id: 84E4980025
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: mzx6bznpsaaz8ah7pzbfpzqp1mpi9fh7
X-HE-Tag: 1719331241-940433
X-HE-Meta: U2FsdGVkX1/nJyKBSlYXmAT4nV5Fqb5sVhaWgYN/IkPiw9Z45FWJPp4cBVhmoymCChWcPVVc6i1kiXqul5GS1tVdfdHFbGB0GWgZMlSMT6bNq1rOd7atZitOw4SgDgjZ17QsFTtJsb4Yx71obToIzpc1xo3WI0ywvkm4j/rKVQGIK5UXTwLWrOmIfs9Tc4cuGuYKMccChs5q6NmfVS6Di9B0QmGmo6dGGHoAviX3HFseO4x5lrKExER1FjKpPtF77vJ8Mjc252SZUY+Es70g8JIlGA2MK3jM2AZnCnwTCSdsLOxSMrLbeKnzkMPi6lBeP6u8e5LIBS/RBxUu43InqE5AK86/9izUgkiMOU78l7O74H0Z4uJlRHjXY5hF3IhQHR+SeyigjFBOVMwSN/4wGkg/EibmELeOBitQLQI0oetkcSlHcnm1k4ozdmfRxN3bWiLq/nik+mPf9EtPb27ZplNRCIBXv4dUmnMHltBEYsDyn3bMaROGBma9OhGVZk2cdrSDCBMbXQEvH1u4ANZGjeKpFsmyJP6e9kda/6Hfq3FQtt8+Fgdiy7k18if2jglEbWDBvxo3K4T3LENKQolJqsIsbGOaIiZHiFeU2sfHC/MgKAChOlx/ZPVHJb/uG9UXAXq5G59ZjLaQy4T1Zm7mu1Lj4ZScIWs/OB4H2PdFPnlnbiLq6HotaqrYAdXr3Faj0aCQRZbl/RX1Fq20MvZ/3nar1BT9+GlMgEJBW5USjH7acUATGphnRCyaoP6ZbVeu8snAq+zikpFH3cSdkxxMdnOyY0XCaxIlRunl0OGuyv3jbiTS5q7pZe2pepBSbRp82mPF2ugDOFNZYu2ankseChnavJ5aIL5AyDSRYSkZt975mz8IRMN6d7HLYA4YwUTRYYz+XXeyhPJrMGLPWWSFAJT+kzRt4BeIgUymrAgWIllHZMhpmRATUclTXkD0SjIrW88Jd8TQnzrrgLcBjwq
 9rYyp9l2
 ySpk4l77NMzP97gqrgaOw7SkDboEaJ1ldTZj2IBm79edGyrYvZZZjt6WOFJH/OLxFPrSV3TACKvcvMaxf7fKN96pEPHlyoOzc4Yiyu0r89LSCH8wr2pc6NFVaY15cmbLneV30LiFcUOVIPPNHofevF4Xlj0aCbTDeRMvKCpHFYagdpEifcAH2AcIftqOdoxKB8zyZ9+IFi6u8ntLd2hVbOv4T8FN+urniFZY4CZ/kBh+HYZYDREQBwUPo0lc9mZZz4iWU
X-Bogosity: Ham, tests=bogofilter, spamicity=0.014080, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

[..]
> >
> > Basically, I prefer that we don't skip flushing at all and keep
> > userspace and in-kernel users the same. We can use completions to make
> > other overlapping flushers sleep instead of spin on the lock.
> >
>
> I think there are good reasons for skipping flushes for userspace when
> reading these stats. More below.
>
> I'm looking at kernel code to spot cases where the flush MUST to be
> completed before returning.  There are clearly cases where we don't need
> 100% accurate stats, evident by mem_cgroup_flush_stats_ratelimited() and
> mem_cgroup_flush_stats() that use memcg_vmstats_needs_flush().
>
> The cgroup_rstat_exit() call seems to depend on cgroup_rstat_flush()
> being strict/accurate, because need to free the percpu resources.

Yeah I think this one cannot be skipped.

>
> The obj_cgroup_may_zswap() have a comments that says it needs to get
> accurate stats for charging.

This one needs to be somewhat accurate to respect memcg limits. I am
not sure how much inaccuracy we can tolerate.

>
> These were the two cases, I found, do you know of others?

Nothing that screamed at me, but as I mentioned, the non-deterministic
nature of this makes me uncomfortable and feels to me like a potential
way to get subtle bugs.

>
>
> > A proof of concept is basically something like:
> >
> > void cgroup_rstat_flush(cgroup)
> > {
> >      if (cgroup_is_descendant(cgroup, READ_ONCE(cgroup_under_flush))) {
> >          wait_for_completion_interruptible(&cgroup_under_flush->completion);
> >          return;
> >      }
>
> This feels like what we would achieve by changing this to a mutex.

The main difference is that whoever is holding the lock still cannot
sleep, while waiters can (and more importantly, they don't disable
interrupts). This is essentially a middle ground between a mutex and a
lock. I think this dodges the priority inversion problem Shakeel
described because a low priority job holding the lock cannot sleep.

Is there an existing locking primitive that can achieve this?

>
> >
> >      __cgroup_rstat_lock(cgrp, -1);
> >      reinit_completion(&cgroup->completion);
> >      /* Any overlapping flush requests after this write will not spin
> > on the lock */
> >      WRITE_ONCE(cgroup_under_flush, cgroup);
> >
> >      cgroup_rstat_flush_locked(cgrp);
> >      complete_all(&cgroup->completion);
> >      __cgroup_rstat_unlock(cgrp, -1);
> > }
> >
> > There may be missing barriers or chances to reduce the window between
> > __cgroup_rstat_lock and WRITE_ONCE(), but that's what I have in mind.
> > I think it's not too complicated, but we need to check if it fixes the
> > problem.
> >
> > If this is not preferable, then yeah, let's at least keep the
> > userspace behavior intact. This makes sure we don't affect userspace
> > negatively, and we can change it later as we please.
>
> I don't think userspace reading these stats need to be 100% accurate.
> We are only reading the io.stat, memory.stat and cpu.stat every 53
> seconds. Reading cpu.stat print stats divided by NSEC_PER_USEC (1000).
>
> If userspace is reading these very often, then they will be killing the
> system as it disables IRQs.
>
> On my prod system the flush of root cgroup can take 35 ms, which is not
> good, but this inaccuracy should not matter for userspace.
>
> Please educate me on why we need accurate userspace stats?

My point is not about accuracy, although I think it's a reasonable
argument on its own (a lot of things could change in a short amount of
time, which is why I prefer magnitude-based ratelimiting).

My point is about logical ordering. If a userspace program reads the
stats *after* an event occurs, it expects to get a snapshot of the
system state after that event. Two examples are:

- A proactive reclaimer reading the stats after a reclaim attempt to
check if it needs to reclaim more memory or fallback.
- A userspace OOM killer reading the stats after a usage spike to
decide which workload to kill.

I listed such examples with more detail in [1], when I removed
stats_flush_ongoing from the memcg code.

[1]https://lore.kernel.org/lkml/20231129032154.3710765-6-yosryahmed@google.com/