From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3333EC3DA42 for ; Wed, 17 Jul 2024 16:50:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B589E6B0085; Wed, 17 Jul 2024 12:50:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B08786B0089; Wed, 17 Jul 2024 12:50:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9D0076B0092; Wed, 17 Jul 2024 12:50:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 800F26B0085 for ; Wed, 17 Jul 2024 12:50:37 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 27A9B1A0B93 for ; Wed, 17 Jul 2024 16:50:37 +0000 (UTC) X-FDA: 82349833314.20.4807413 Received: from mail-ej1-f49.google.com (mail-ej1-f49.google.com [209.85.218.49]) by imf09.hostedemail.com (Postfix) with ESMTP id 36BD1140034 for ; Wed, 17 Jul 2024 16:50:35 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=wFXgDQGu; spf=pass (imf09.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721234982; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xTNGHnvQ6RG3vz/xqvoqZtrVCx1QoR7aN9z9QUd/56U=; b=IXgJYEtHuH8UM+cdX/jxUL9De3yka3nfWocBzq4TlthaC5+0DYVgp63MytvmcdOu+OcAvk mBcD8MmKxx3vziHTaIL+3vpH4k8+WCsFymfcniwFC1a/03je6JkDoOndcS6mS/e7FcMX+H SVEWyofSnJd03Mj6NF1n88E1lHJeF/k= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=wFXgDQGu; spf=pass (imf09.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721234982; a=rsa-sha256; cv=none; b=O6zZa8F+udcNZ4/YbHaRfvGzoZp0/VwS3tiOGWn7L7GeWOZqXOqUUwXyp4+NjRjSmZcf/m Hf3u/vtptnjscGAVmuoetLkpWvoHI9vr/AFOLBGCitGwsPWAJZrs7zLpcawbTuOieCNeTe whSHRv/amXdvBR3HwdVVzgZ5VXTi16A= Received: by mail-ej1-f49.google.com with SMTP id a640c23a62f3a-a79a7d1a0dbso578004466b.2 for ; Wed, 17 Jul 2024 09:50:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1721235033; x=1721839833; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=xTNGHnvQ6RG3vz/xqvoqZtrVCx1QoR7aN9z9QUd/56U=; b=wFXgDQGu6y1+KrrqQsZM6v90HFXWmF6jqFEDYnCSZu1igsuBhwOSkEedaXkHDwn4EM 3fbLLXxsxKShVUcaX+hxcHfg8i3UCEaLoek8rYChMySOTHGo0uvqvBCLatruj5OG1OvP jxkrZfmpp33Mrqkfh1EAZE+vhi4p6ekfriOK9/eNFRCGBip9omzfQMCTiQz1ahxLlxJw q2dnlgHNxiXOaFTb5+MfUn5W9MzbUtAt+tmbX0qO19nb/Rbyn5hhDX1ZEzAfB3EXTQPl bzddG9xbZZKYGVVGtYNBX/jrmXrVWlfBsSbUEfAszCRiD0lmX+WMKpt/aDKrIX8eXwQn Z/Dw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721235033; x=1721839833; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=xTNGHnvQ6RG3vz/xqvoqZtrVCx1QoR7aN9z9QUd/56U=; b=NV7zXy/moEEh+mKy5hdgwAsDVrxAFLcuWzXl/+XK5tnxGs7o5xRTnaEClklZazynJq lDWhCsNvurfUNHoizYXvIIz5EIJtaKuREtBvdDSsQkfqGR26/B+I3ME8EGUTloSC+t4y 6KPPkYryhbqWFRngrzqDCshMGA5UGhffPH4KmKjTm16y+6ZvUsjNtCe7YIIHMs4GXcg1 X8/7qGxWz/kVX4Ly/gyHA0rQE4XEvZHCaSqMS3TB80OZ5/yOO79YH2+gbgerLzpllbwy FElMbHLrEqvaujX/wrPbRoAavl24MpxtlzRY8SHt99byi8MTrt7O78QmoqsMPQOeGwyl Qzvg== X-Forwarded-Encrypted: i=1; AJvYcCV019pxqL2sR6J8uxzsvgxEZgRWzVl+dodrqpTr6rAOQ+9nymS2IrlLq40FGe460HdiEA8hAaUCp9WNkTTVwCi6RoE= X-Gm-Message-State: AOJu0Yxjf72OsAXxlSXwR9cexPd8v7DvCZt1l/5mGdk9ZxjMG0WQlDB7 +yLHEWoQG3kO1zBLND9eehuVRN8FwuCNde7x8ORkZ4OuERaSLoFLo700sIdowC8btH9dF0TIn3i cIA3oa0VBfZwWqzxzMEjeNtWV9qQddGeT8Sya X-Google-Smtp-Source: AGHT+IGqZ84dAhFoS4O9crcKXggKs4NB/xqx5YRXOru5heEmpY3ot7CFAr2+9x16/lLYK/bZsph4bqXikfqjRlzPzDs= X-Received: by 2002:a17:906:1c12:b0:a77:cca9:b216 with SMTP id a640c23a62f3a-a7a011bc7d4mr147771066b.33.1721235033083; Wed, 17 Jul 2024 09:50:33 -0700 (PDT) MIME-Version: 1.0 References: <172070450139.2992819.13210624094367257881.stgit@firesoul> <100caebf-c11c-45c9-b864-d8562e2a5ac5@kernel.org> In-Reply-To: <100caebf-c11c-45c9-b864-d8562e2a5ac5@kernel.org> From: Yosry Ahmed Date: Wed, 17 Jul 2024 09:49:54 -0700 Message-ID: Subject: Re: [PATCH V7 1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes To: Jesper Dangaard Brouer Cc: tj@kernel.org, cgroups@vger.kernel.org, shakeel.butt@linux.dev, hannes@cmpxchg.org, lizefan.x@bytedance.com, longman@redhat.com, kernel-team@cloudflare.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 36BD1140034 X-Stat-Signature: jtt9a55865rd7s3hoiwzg6k5p7no5dzy X-Rspam-User: X-HE-Tag: 1721235034-590749 X-HE-Meta: U2FsdGVkX1/qtcU4gTZyqKKaZ2UEKmnvbggTdZiKpxV5X5VwlpKum9ig0rskI9vPYU+wBB5VJHvMhe7863psz3YclFyW/vIymmC95Y6milXr48WNlgA9DfkcuzEnu7pDeo4dEn9FMC93iKUw2hlwNyf8gvTyrbYLdf2WvL/O0KOlOOgXOgFIViAn9qlcdJZbTO1965dFhA5Kvyh7yFrNH6Lok3WuioY8ggWMf0+adYFxxTI7bZHu32hGMgqT1ejxKSuLi5dkxCPbbXjuTcLLkbAAjHdNCpDt67QSwIQYTwtWwrBwNFTWfsmkaP+jFcT6kGQ7WRC7zclU8vTGAHqxpO4mFJ2Cc6mI8M0W57Qk+4m/pnuJv+CZk1FFQTMF0847vUTH/xaCm8HuSXCLgv6ycC2xf6a7p/4r4X1aJALjKqet3tGkKzVbxyOZ1iJT96s+Cm4HEQTeYViMBN+dKkDbDRj/2YuSdQUQxegEpMS2ggY07jXHPkCPb3MFezW41aulmpZT1/O6dDvxYKkOtFKP8oE58+Ag8jyBn9/Mz4lWk9QSctRxpwSL87Ebklu8PviuMbypb6UVFtDkReuyG4I7FDD6tRAe+8QSyFkiudUusjIuDJ3eAv4egSsdIoLIdB3s3MagpoIBvwR5YTdztYNM48d9esYTWnf1FpBRb96p/7fGkHN5M6mZSTe+S1lw4r7mj7tqgPrwynHwyXUDlj2CSQqRaz3SoAqaXek0oz+WOq0n7HeW8O69B9vX1bINZNw1JHm6ZZmR4Rvu5nUptU2/p0/fZGqEWoKWxphAtz/hYDp3AIk6pk827qbe4WdKYA6oFWqK81OuW8ocKG+xhn69dKs2M37JmhioYPCbY9E8ATtlwlyy7poBGB9EBO+P4XUg7cJMrXN+Jd0+eidaFchIFnk1p8/AOL7EmxB1x5bJYICR10m0ZrDACPC7umU5c8WuHCJvLjvtv0+bLRluX+X tdoPlMr/ vFMZzf9DMu9uNQ/zJFTVgUqCC8cpbe45bzxogVM5ky9tx1lkjBGi0U+vF2tqJby0wVB7zQXrFOjnQz4iyYcC7tbcOg4+It46yApUNkpr/VCnLekWbEhpDy1NQk6zgXCpjMLr9f63OOu3sll9StV6HTmBNPuefpPkef1MT2BFTicg1RlfsjlfOmWPZ4Zn/YZKT16PQFk+SwSBnJSEEoddZUfcoXQegzjirfhBY3qu6QzkVD6M= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000583, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jul 17, 2024 at 9:36=E2=80=AFAM Jesper Dangaard Brouer wrote: > > > > > On 17/07/2024 02.35, Yosry Ahmed wrote: > > [..] > >> > >> > >> This is a clean (meaning no cadvisor interference) example of kswapd > >> starting simultaniously on many NUMA nodes, that in 27 out of 98 cases > >> hit the race (which is handled in V6 and V7). > >> > >> The BPF "cnt" maps are getting cleared every second, so this > >> approximates per sec numbers. This patch reduce pressure on the lock, > >> but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full > >> flushes approx 37 per sec (every 27 ms). On the positive side > >> ongoing_flusher mitigation stopped 98 per sec of these. > >> > >> In this clean kswapd case the patch removes the lock contention issue > >> for kswapd. The lock_contended cases 27 seems to be all related to > >> handled_race cases 27. > >> > >> The remaning high flush rate should also be addressed, and we should > >> also work on aproaches to limit this like my ealier proposal[1]. > > > > I honestly don't think a high number of flushes is a problem on its > > own as long as we are not spending too much time flushing, especially > > when we have magnitude-based thresholding so we know there is > > something to flush (although it may not be relevant to what we are > > doing). > > > > We are "spending too much time flushing" see below. > > > If we keep observing a lot of lock contention, one thing that I > > thought about is to have a variant of spin_lock with a timeout. This > > limits the flushing latency, instead of limiting the number of flushes > > (which I believe is the wrong metric to optimize). > > > > It also seems to me that we are doing a flush each 27ms, and your > > proposed threshold was once per 50ms. It doesn't seem like a > > fundamental difference. > > > > > Looking at the production numbers for the time the lock is held for level= 0: > > @locked_time_level[0]: > [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | > [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | > > The time is in nanosec, so M corresponds to ms (milliseconds). > > With 36 flushes per second (as shown earlier) this is a flush every > 27.7ms. It is not unreasonable (from above data) that the flush time > also spend 27ms, which means that we spend a full CPU second flushing. > That is spending too much time flushing. > > This around 1 sec CPU usage for kswapd is also quite clear in the > attached grafana graph for when server was rebooted into this V7 kernel. > > I choose 50ms because at the time I saw flush taking around 30ms, and I > view the flush time as queue service-time. When arrival-rate is faster > than service-time, then a queue will form. So, choosing 50ms as > arrival-rate gave me some headroom. As I mentioned earlier, optimally > this threshold should be dynamically measured. Thanks for the data. Yeah this doesn't look good. Does it make sense to just throttle flushers at some point to increase the chances of coalescing multiple flushers? Otherwise I think it makes sense in this case to ratelimit flushing in general. Although instead of just checking how much time elapsed since the last flush, can we use something like __ratelimit()? This will make sure that we skip flushes when we actually have a high rate of flushing over a period of time, not because two flushes happened to be requested in close succession and the flushing rate is generally low. > > --Jesper