From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4419DC4345F
	for <linux-mm@archiver.kernel.org>; Fri, 19 Apr 2024 19:22:11 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9FA2F6B0083; Fri, 19 Apr 2024 15:22:10 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9AA766B0085; Fri, 19 Apr 2024 15:22:10 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 871E06B0087; Fri, 19 Apr 2024 15:22:10 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 68D416B0083
	for <linux-mm@kvack.org>; Fri, 19 Apr 2024 15:22:10 -0400 (EDT)
Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 1ADCD14039C
	for <linux-mm@kvack.org>; Fri, 19 Apr 2024 19:22:10 +0000 (UTC)
X-FDA: 82027252020.14.25463EB
Received: from mail-ej1-f46.google.com (mail-ej1-f46.google.com [209.85.218.46])
	by imf25.hostedemail.com (Postfix) with ESMTP id 3C3CFA0007
	for <linux-mm@kvack.org>; Fri, 19 Apr 2024 19:22:08 +0000 (UTC)
Authentication-Results: imf25.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=CXdrL4Q1;
	spf=pass (imf25.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.46 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1713554528;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=Cgxiaek1OC/UQSjtnluTE8oS+OxORZPByii8n2kRyX4=;
	b=c3QyWjPjojbI6IOZGj5MYtD+xMMBfFJqsyTmUvdDt6uGGIDQn2pHTjJQQ9cEN+pYRiPmHA
	Wun9wiGOTXs0zPV4hBkh3WDjkzqfWHm5Z8WVJIO1Pq+ovT9HMXGi0TI32KlJ4pFvnZqcXm
	kqLbH6imGPRVDtfY8cchQeFmCaltsJs=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1713554528; a=rsa-sha256;
	cv=none;
	b=oLDSqOGsIMbERcSNSXVKWpxflQsTr9oteedXD0T3+8/L4WF4ht8ntkkiAfItUWpxamM9b5
	T2QIwzjHTs7S6RqEiNWsD26HtqxetTTKCX6VOUrrwba/hr/JyyG+L/ODmu4iGkaA/dboa2
	LQ8CU+nK2RGVemq94oCJ7Ub0s82sHDE=
ARC-Authentication-Results: i=1;
	imf25.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=CXdrL4Q1;
	spf=pass (imf25.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.46 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
Received: by mail-ej1-f46.google.com with SMTP id a640c23a62f3a-a519e1b0e2dso265865566b.2
        for <linux-mm@kvack.org>; Fri, 19 Apr 2024 12:22:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1713554527; x=1714159327; darn=kvack.org;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=Cgxiaek1OC/UQSjtnluTE8oS+OxORZPByii8n2kRyX4=;
        b=CXdrL4Q14EP5wSdWmuYeUR1KY4kwKskyo/AK4RSWmu834BsxU2oNyZ7UA/CeUc0UoC
         y/dkv69nxRyx+zrO3pM3gHh3OwgfMoCK9UWfKDksmsbjCTRKDS/qetmIMcjhyG5hE5aI
         U1D6840Wj0DimoaFGz13fzdkuMpbiyUJs2T9cxZ6HqhslE6uG00W1TTvFbFFb06NNP4J
         R5zRaGTUnh5gAkyc8FcVg1BBruZPQA5hFjXBR87PNZfuKOMny/eFDsE8XP6IeOIoaJdm
         aYCyXT2gObgeCFAsuDEakfesWtT1CTucF5KW9ZkLreyfy1Nv5KiKuQe2Ocmuqwgtf35f
         45gQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1713554527; x=1714159327;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=Cgxiaek1OC/UQSjtnluTE8oS+OxORZPByii8n2kRyX4=;
        b=xHTkDmf8zT0RTSa4KQ6++jsiBH64kLLhk2vuecnPkJB+KSrrjItwdggiP2XWMr4tTR
         ov70CPvQ+uiygSJXnzP2alHVS/SytRZWtx2tx8D/6FpEi1HGWMV8utw3vP6q7UgFzVQC
         +p89M16SytsbzsBZ7zstkyzY4DVQ57IQqK4nHiKpUaWcedCfKSoBNbzSHvM/Fmw2mLSU
         uWvoAkmikzgHqF60Ar1xFkIH2CFjSmVOlfsWXoHyltG+LRk0phJoRkw3QKV/1zunTkAx
         GsMrtX0kyeuAHcRyNJXBeEeAYilRJdVITWZstDTGG7J9BMdXQw9JURdBwdGudrCvbyX5
         SlEQ==
X-Forwarded-Encrypted: i=1; AJvYcCV/YhZvJ5AhtwqUOORY8IuS5YZdMRSG5eZLpIaPe4crxFY2nckbmO3ERhdQXy/K3k9rAQHQDMmbgGMl7J+DNOdbNT4=
X-Gm-Message-State: AOJu0Yw2beO+1yhHzTIQRpYrEFQkThH/LdEnu2k/a/raquGb+WogHSMw
	xDlnbNll1Wgimy2ITwUa08EdH5LFZF5wswMt0og/nKJJfLrEVfQ/Kf0I4OzX5XxooIaIaAytJHi
	77zS+OUOE8bJH4yr4KFN45bX/A4ECGS3XkdSn
X-Google-Smtp-Source: AGHT+IEBjBNkvxdmuTE9VnI6C88nKhqq8KjzGCHHIFj/6VLY3xCFjTvSpOxoFw/tjaW8RUkr6IZWX3lHaV1KlXwC0BQ=
X-Received: by 2002:a17:906:f255:b0:a52:2284:d97f with SMTP id
 gy21-20020a170906f25500b00a522284d97fmr2030488ejb.25.1713554526552; Fri, 19
 Apr 2024 12:22:06 -0700 (PDT)
MIME-Version: 1.0
References: <171328983017.3930751.9484082608778623495.stgit@firesoul>
 <171328989335.3930751.3091577850420501533.stgit@firesoul> <CAJD7tkZFnQK9CFofp5rxa7Mv9wYH2vWF=Bb28Dchupm8LRt7Aw@mail.gmail.com>
 <651a52ac-b545-4b25-b82f-ad3a2a57bf69@kernel.org> <lxzi557wfbrkrj6phdlub4nmtulzbegykbmroextadvssdyfhe@qarxog72lheh>
 <CAJD7tkYJZgWOeFuTMYNoyH=9+uX2qaRdwc4cNuFN9wdhneuHfA@mail.gmail.com>
 <6392f7e8-d14c-40f4-8a19-110dfffb9707@kernel.org> <gckdqiczjtyd5qdod6a7uyaxppbglg3fkgx2pideuscsyhdrmy@by6rlly6crmz>
In-Reply-To: <gckdqiczjtyd5qdod6a7uyaxppbglg3fkgx2pideuscsyhdrmy@by6rlly6crmz>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Fri, 19 Apr 2024 12:21:30 -0700
Message-ID: <CAJD7tkbCzx1S9d0oK-wR7AY3O3ToBrEwKTaYTykE1WwczcYLBg@mail.gmail.com>
Subject: Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>, tj@kernel.org, hannes@cmpxchg.org, 
	lizefan.x@bytedance.com, cgroups@vger.kernel.org, longman@redhat.com, 
	netdev@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	kernel-team@cloudflare.com, Arnaldo Carvalho de Melo <acme@kernel.org>, 
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>, mhocko@kernel.org
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: 3C3CFA0007
X-Stat-Signature: 49yzb43byt6g3tb8aj965zqqs6mx43w4
X-Rspam-User: 
X-HE-Tag: 1713554528-228771
X-HE-Meta: U2FsdGVkX1+g8P7BVNHg6MQSL8uuzz+UhlThK1X0O1siRd9PDkKOjKtpLHy3/k8wX6iVQ0hCfHjIcUsYnYuMb/GKbBrLApRkHjceW7QqVP2D791YzDBSSmB1g3ETLDdu0PAkQFcW/2rKFJvSPAVXAcoeeNbgZ0esVAysK0QU5rc0GlP7ne1yKoRHENQMbqV3BP52HyTnC+IsQ7qD10e50XVXxi6KV4Ov8Q5zX6jB6N1983ZT6jMVITMGb/7XTHKwmh7+SGGV5JuQYrRwIAp0DW/eEx60QCXs3Eigl3NqhKPWmNxtKxuq2H6r4Ab6B6PaWhh4ZjkDVEkL1asrlqNH6cUohJn+w7I37eFX/oG5/Nbwz6VYiobbtzx1bwIcTC/TPy8p6D1ln0EYQePs3owwjSTAqT6jgiHLCEhgvcI35+dhU13JwwUXQDleAHmrFccc/1+ekqSwWunDMMPhHqr08+/5j1LkUQR1bihVxyZTEzOwqgJuvjCzXJ3QI/HAveoiqDnQ8jLyfC+175cbUN9xQKNvyOhPQt4TcliOTxSCq7wJDovmAijkRO4hKHXRkxSy3mn21ySLhx4lVskZO/1AduINvVzoY35wS0Plxk3CdpSTXhcnA0C2FBNVWPc1kDf69FY9PWh9bh6Dwqb/ZLHju5B1sVu7yO6fFUI4i6SZYPeygwZ05oY2p/sMN2tFIsMR6JaAgFjUUDi4FMexUm6A3P16Ifsw3rJC0DwYXjvvjsOF6A82wz/TkXrBdgqLOK91ilGvslNV9ptf3fKRQlSKh/iMNhHAfdYYHK3fQW2zcTqKIqfomMngdz6dgDt6Rfzg8HxgQBi65Fnp2o+5+CLVP/i+n1Z1qJKoqN9Mn3/4mnVeKu/PsiBKRmUSnwZ5Ek32Yd6fLVv3TQMclmfd+EzHuHGp2r+6WoQb9OEaNQAjysQC/Y0H9gcJ0thhxzr5kw93BU+4KUL0CQjrRdn8OcJ
 4pezU+VK
 CiUN4TbvuUZtl2DWPcarOZZo28IOV+BB2bA/fF9Nb/hyF7o+AwyhE2E6oNRvzGy+COnWlOpWTkMRmBH16RD9WEETwYqkQqUM6glO4MZjBGWL6kYoEXlQIRRgYw89mZorCK0TOsJTWXC4HQyv7LNskd27f4kHEb7yPA/cAD7O1rQEwNizIDXbExis3ZhD4oZ8xkctn9bUaQ8OpUCvA6GFn6sSt0FLs0UaA9U7N8CaGk+xx2gCz077xBw2HWLMjlSWScAzuXAWo2POWRPji8mINS1/ZbWKtC1WkYB36escQjsO+5Fg=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.012480, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

[..]
> > > Perhaps we could experiment with always dropping the lock at CPU
> > > boundaries instead?
> > >
> >
> > I don't think this will be enough (always dropping the lock at CPU
> > boundaries).  My measured "lock-hold" times that is blocking IRQ (and
> > softirq) for too long.  When looking at prod with my new cgroup
> > tracepoint script[2]. When contention occurs, I see many Yields
> > happening and with same magnitude as Contended. But still see events
> > with long "lock-hold" times, even-though yields are high.
> >
> >  [2] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt
> >
> > Example output:
> >
> >  12:46:56 High Lock-contention: wait: 739 usec (0 ms) on CPU:56 comm:kswapd7
> >  12:46:56 Long lock-hold time: 6381 usec (6 ms) on CPU:27 comm:kswapd3
> >  12:46:56 Long lock-hold time: 18905 usec (18 ms) on CPU:100
> > comm:kworker/u261:12
> >
> >  12:46:56  time elapsed: 36 sec (interval = 1 sec)
> >   Flushes(2051) 15/interval (avg 56/sec)
> >   Locks(44464) 1340/interval (avg 1235/sec)
> >   Yields(42413) 1325/interval (avg 1178/sec)
> >   Contended(42112) 1322/interval (avg 1169/sec)
> >
> > There is reported 15 flushes/sec, but locks are yielded quickly.
> >
> > More problematically (for softirq latency) we see a Long lock-hold time
> > reaching 18 ms.  For network RX softirq I need lower than 0.5ms latency,
> > to avoid RX-ring HW queue overflows.

Here we are measuring yields against contention, but the main problem
here is IRQ serving latency, which doesn't have to correlate with
contention, right?

Perhaps contention is causing us to yield the lock every nth cpu
boundary, but apparently this is not enough for IRQ serving latency.
Dropping the lock on each boundary should improve IRQ serving latency,
regardless of the presence of contention.

Let's focus on one problem at a time ;)

> >
> >
> > --Jesper
> > p.s. I'm seeing a pattern with kswapdN contending on this lock.
> >
> > @stack[697, kswapd3]:
> >         __cgroup_rstat_lock+107
> >         __cgroup_rstat_lock+107
> >         cgroup_rstat_flush_locked+851
> >         cgroup_rstat_flush+35
> >         shrink_node+226
> >         balance_pgdat+807
> >         kswapd+521
> >         kthread+228
> >         ret_from_fork+48
> >         ret_from_fork_asm+27
> >
> > @stack[698, kswapd4]:
> >         __cgroup_rstat_lock+107
> >         __cgroup_rstat_lock+107
> >         cgroup_rstat_flush_locked+851
> >         cgroup_rstat_flush+35
> >         shrink_node+226
> >         balance_pgdat+807
> >         kswapd+521
> >         kthread+228
> >         ret_from_fork+48
> >         ret_from_fork_asm+27
> >
> > @stack[699, kswapd5]:
> >         __cgroup_rstat_lock+107
> >         __cgroup_rstat_lock+107
> >         cgroup_rstat_flush_locked+851
> >         cgroup_rstat_flush+35
> >         shrink_node+226
> >         balance_pgdat+807
> >         kswapd+521
> >         kthread+228
> >         ret_from_fork+48
> >         ret_from_fork_asm+27
> >
>
> Can you simply replace mem_cgroup_flush_stats() in
> prepare_scan_control() with the ratelimited version and see if the issue
> still persists for your production traffic?

With thresholding, the fact that we reach cgroup_rstat_flush() means
that there is a high magnitude of pending updates. I think Jesper
mentioned 128 CPUs before, that means 128 * 64 (MEMCG_CHARGE_BATCH)
page-sized updates. That could be over 33 MBs with 4K page size.

I am not sure if it's fine to ignore such updates in shrink_node(),
especially that it is called in a loop sometimes so I imagine we may
want to see what changed after the last iteration.

>
> Also were you able to get which specific stats are getting the most
> updates?

This, on the other hand, would be very interesting. I think it is very
possible that we don't actually have 33 MBs of updates, but rather we
keep adding and subtracting from the same stat until we reach the
threshold. This could especially be true for hot stats like slab
allocations.