From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4419DC4345F for ; Fri, 19 Apr 2024 19:22:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9FA2F6B0083; Fri, 19 Apr 2024 15:22:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9AA766B0085; Fri, 19 Apr 2024 15:22:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 871E06B0087; Fri, 19 Apr 2024 15:22:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 68D416B0083 for ; Fri, 19 Apr 2024 15:22:10 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 1ADCD14039C for ; Fri, 19 Apr 2024 19:22:10 +0000 (UTC) X-FDA: 82027252020.14.25463EB Received: from mail-ej1-f46.google.com (mail-ej1-f46.google.com [209.85.218.46]) by imf25.hostedemail.com (Postfix) with ESMTP id 3C3CFA0007 for ; Fri, 19 Apr 2024 19:22:08 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=CXdrL4Q1; spf=pass (imf25.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.46 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1713554528; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Cgxiaek1OC/UQSjtnluTE8oS+OxORZPByii8n2kRyX4=; b=c3QyWjPjojbI6IOZGj5MYtD+xMMBfFJqsyTmUvdDt6uGGIDQn2pHTjJQQ9cEN+pYRiPmHA Wun9wiGOTXs0zPV4hBkh3WDjkzqfWHm5Z8WVJIO1Pq+ovT9HMXGi0TI32KlJ4pFvnZqcXm kqLbH6imGPRVDtfY8cchQeFmCaltsJs= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1713554528; a=rsa-sha256; cv=none; b=oLDSqOGsIMbERcSNSXVKWpxflQsTr9oteedXD0T3+8/L4WF4ht8ntkkiAfItUWpxamM9b5 T2QIwzjHTs7S6RqEiNWsD26HtqxetTTKCX6VOUrrwba/hr/JyyG+L/ODmu4iGkaA/dboa2 LQ8CU+nK2RGVemq94oCJ7Ub0s82sHDE= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=CXdrL4Q1; spf=pass (imf25.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.46 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ej1-f46.google.com with SMTP id a640c23a62f3a-a519e1b0e2dso265865566b.2 for ; Fri, 19 Apr 2024 12:22:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1713554527; x=1714159327; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=Cgxiaek1OC/UQSjtnluTE8oS+OxORZPByii8n2kRyX4=; b=CXdrL4Q14EP5wSdWmuYeUR1KY4kwKskyo/AK4RSWmu834BsxU2oNyZ7UA/CeUc0UoC y/dkv69nxRyx+zrO3pM3gHh3OwgfMoCK9UWfKDksmsbjCTRKDS/qetmIMcjhyG5hE5aI U1D6840Wj0DimoaFGz13fzdkuMpbiyUJs2T9cxZ6HqhslE6uG00W1TTvFbFFb06NNP4J R5zRaGTUnh5gAkyc8FcVg1BBruZPQA5hFjXBR87PNZfuKOMny/eFDsE8XP6IeOIoaJdm aYCyXT2gObgeCFAsuDEakfesWtT1CTucF5KW9ZkLreyfy1Nv5KiKuQe2Ocmuqwgtf35f 45gQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1713554527; x=1714159327; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Cgxiaek1OC/UQSjtnluTE8oS+OxORZPByii8n2kRyX4=; b=xHTkDmf8zT0RTSa4KQ6++jsiBH64kLLhk2vuecnPkJB+KSrrjItwdggiP2XWMr4tTR ov70CPvQ+uiygSJXnzP2alHVS/SytRZWtx2tx8D/6FpEi1HGWMV8utw3vP6q7UgFzVQC +p89M16SytsbzsBZ7zstkyzY4DVQ57IQqK4nHiKpUaWcedCfKSoBNbzSHvM/Fmw2mLSU uWvoAkmikzgHqF60Ar1xFkIH2CFjSmVOlfsWXoHyltG+LRk0phJoRkw3QKV/1zunTkAx GsMrtX0kyeuAHcRyNJXBeEeAYilRJdVITWZstDTGG7J9BMdXQw9JURdBwdGudrCvbyX5 SlEQ== X-Forwarded-Encrypted: i=1; AJvYcCV/YhZvJ5AhtwqUOORY8IuS5YZdMRSG5eZLpIaPe4crxFY2nckbmO3ERhdQXy/K3k9rAQHQDMmbgGMl7J+DNOdbNT4= X-Gm-Message-State: AOJu0Yw2beO+1yhHzTIQRpYrEFQkThH/LdEnu2k/a/raquGb+WogHSMw xDlnbNll1Wgimy2ITwUa08EdH5LFZF5wswMt0og/nKJJfLrEVfQ/Kf0I4OzX5XxooIaIaAytJHi 77zS+OUOE8bJH4yr4KFN45bX/A4ECGS3XkdSn X-Google-Smtp-Source: AGHT+IEBjBNkvxdmuTE9VnI6C88nKhqq8KjzGCHHIFj/6VLY3xCFjTvSpOxoFw/tjaW8RUkr6IZWX3lHaV1KlXwC0BQ= X-Received: by 2002:a17:906:f255:b0:a52:2284:d97f with SMTP id gy21-20020a170906f25500b00a522284d97fmr2030488ejb.25.1713554526552; Fri, 19 Apr 2024 12:22:06 -0700 (PDT) MIME-Version: 1.0 References: <171328983017.3930751.9484082608778623495.stgit@firesoul> <171328989335.3930751.3091577850420501533.stgit@firesoul> <651a52ac-b545-4b25-b82f-ad3a2a57bf69@kernel.org> <6392f7e8-d14c-40f4-8a19-110dfffb9707@kernel.org> In-Reply-To: From: Yosry Ahmed Date: Fri, 19 Apr 2024 12:21:30 -0700 Message-ID: Subject: Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex To: Shakeel Butt Cc: Jesper Dangaard Brouer , tj@kernel.org, hannes@cmpxchg.org, lizefan.x@bytedance.com, cgroups@vger.kernel.org, longman@redhat.com, netdev@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@cloudflare.com, Arnaldo Carvalho de Melo , Sebastian Andrzej Siewior , mhocko@kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 3C3CFA0007 X-Stat-Signature: 49yzb43byt6g3tb8aj965zqqs6mx43w4 X-Rspam-User: X-HE-Tag: 1713554528-228771 X-HE-Meta: U2FsdGVkX1+g8P7BVNHg6MQSL8uuzz+UhlThK1X0O1siRd9PDkKOjKtpLHy3/k8wX6iVQ0hCfHjIcUsYnYuMb/GKbBrLApRkHjceW7QqVP2D791YzDBSSmB1g3ETLDdu0PAkQFcW/2rKFJvSPAVXAcoeeNbgZ0esVAysK0QU5rc0GlP7ne1yKoRHENQMbqV3BP52HyTnC+IsQ7qD10e50XVXxi6KV4Ov8Q5zX6jB6N1983ZT6jMVITMGb/7XTHKwmh7+SGGV5JuQYrRwIAp0DW/eEx60QCXs3Eigl3NqhKPWmNxtKxuq2H6r4Ab6B6PaWhh4ZjkDVEkL1asrlqNH6cUohJn+w7I37eFX/oG5/Nbwz6VYiobbtzx1bwIcTC/TPy8p6D1ln0EYQePs3owwjSTAqT6jgiHLCEhgvcI35+dhU13JwwUXQDleAHmrFccc/1+ekqSwWunDMMPhHqr08+/5j1LkUQR1bihVxyZTEzOwqgJuvjCzXJ3QI/HAveoiqDnQ8jLyfC+175cbUN9xQKNvyOhPQt4TcliOTxSCq7wJDovmAijkRO4hKHXRkxSy3mn21ySLhx4lVskZO/1AduINvVzoY35wS0Plxk3CdpSTXhcnA0C2FBNVWPc1kDf69FY9PWh9bh6Dwqb/ZLHju5B1sVu7yO6fFUI4i6SZYPeygwZ05oY2p/sMN2tFIsMR6JaAgFjUUDi4FMexUm6A3P16Ifsw3rJC0DwYXjvvjsOF6A82wz/TkXrBdgqLOK91ilGvslNV9ptf3fKRQlSKh/iMNhHAfdYYHK3fQW2zcTqKIqfomMngdz6dgDt6Rfzg8HxgQBi65Fnp2o+5+CLVP/i+n1Z1qJKoqN9Mn3/4mnVeKu/PsiBKRmUSnwZ5Ek32Yd6fLVv3TQMclmfd+EzHuHGp2r+6WoQb9OEaNQAjysQC/Y0H9gcJ0thhxzr5kw93BU+4KUL0CQjrRdn8OcJ 4pezU+VK CiUN4TbvuUZtl2DWPcarOZZo28IOV+BB2bA/fF9Nb/hyF7o+AwyhE2E6oNRvzGy+COnWlOpWTkMRmBH16RD9WEETwYqkQqUM6glO4MZjBGWL6kYoEXlQIRRgYw89mZorCK0TOsJTWXC4HQyv7LNskd27f4kHEb7yPA/cAD7O1rQEwNizIDXbExis3ZhD4oZ8xkctn9bUaQ8OpUCvA6GFn6sSt0FLs0UaA9U7N8CaGk+xx2gCz077xBw2HWLMjlSWScAzuXAWo2POWRPji8mINS1/ZbWKtC1WkYB36escQjsO+5Fg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.012480, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: [..] > > > Perhaps we could experiment with always dropping the lock at CPU > > > boundaries instead? > > > > > > > I don't think this will be enough (always dropping the lock at CPU > > boundaries). My measured "lock-hold" times that is blocking IRQ (and > > softirq) for too long. When looking at prod with my new cgroup > > tracepoint script[2]. When contention occurs, I see many Yields > > happening and with same magnitude as Contended. But still see events > > with long "lock-hold" times, even-though yields are high. > > > > [2] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt > > > > Example output: > > > > 12:46:56 High Lock-contention: wait: 739 usec (0 ms) on CPU:56 comm:kswapd7 > > 12:46:56 Long lock-hold time: 6381 usec (6 ms) on CPU:27 comm:kswapd3 > > 12:46:56 Long lock-hold time: 18905 usec (18 ms) on CPU:100 > > comm:kworker/u261:12 > > > > 12:46:56 time elapsed: 36 sec (interval = 1 sec) > > Flushes(2051) 15/interval (avg 56/sec) > > Locks(44464) 1340/interval (avg 1235/sec) > > Yields(42413) 1325/interval (avg 1178/sec) > > Contended(42112) 1322/interval (avg 1169/sec) > > > > There is reported 15 flushes/sec, but locks are yielded quickly. > > > > More problematically (for softirq latency) we see a Long lock-hold time > > reaching 18 ms. For network RX softirq I need lower than 0.5ms latency, > > to avoid RX-ring HW queue overflows. Here we are measuring yields against contention, but the main problem here is IRQ serving latency, which doesn't have to correlate with contention, right? Perhaps contention is causing us to yield the lock every nth cpu boundary, but apparently this is not enough for IRQ serving latency. Dropping the lock on each boundary should improve IRQ serving latency, regardless of the presence of contention. Let's focus on one problem at a time ;) > > > > > > --Jesper > > p.s. I'm seeing a pattern with kswapdN contending on this lock. > > > > @stack[697, kswapd3]: > > __cgroup_rstat_lock+107 > > __cgroup_rstat_lock+107 > > cgroup_rstat_flush_locked+851 > > cgroup_rstat_flush+35 > > shrink_node+226 > > balance_pgdat+807 > > kswapd+521 > > kthread+228 > > ret_from_fork+48 > > ret_from_fork_asm+27 > > > > @stack[698, kswapd4]: > > __cgroup_rstat_lock+107 > > __cgroup_rstat_lock+107 > > cgroup_rstat_flush_locked+851 > > cgroup_rstat_flush+35 > > shrink_node+226 > > balance_pgdat+807 > > kswapd+521 > > kthread+228 > > ret_from_fork+48 > > ret_from_fork_asm+27 > > > > @stack[699, kswapd5]: > > __cgroup_rstat_lock+107 > > __cgroup_rstat_lock+107 > > cgroup_rstat_flush_locked+851 > > cgroup_rstat_flush+35 > > shrink_node+226 > > balance_pgdat+807 > > kswapd+521 > > kthread+228 > > ret_from_fork+48 > > ret_from_fork_asm+27 > > > > Can you simply replace mem_cgroup_flush_stats() in > prepare_scan_control() with the ratelimited version and see if the issue > still persists for your production traffic? With thresholding, the fact that we reach cgroup_rstat_flush() means that there is a high magnitude of pending updates. I think Jesper mentioned 128 CPUs before, that means 128 * 64 (MEMCG_CHARGE_BATCH) page-sized updates. That could be over 33 MBs with 4K page size. I am not sure if it's fine to ignore such updates in shrink_node(), especially that it is called in a loop sometimes so I imagine we may want to see what changed after the last iteration. > > Also were you able to get which specific stats are getting the most > updates? This, on the other hand, would be very interesting. I think it is very possible that we don't actually have 33 MBs of updates, but rather we keep adding and subtracting from the same stat until we reach the threshold. This could especially be true for hot stats like slab allocations.