From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45E67EEAA5F for ; Thu, 14 Sep 2023 17:57:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9DB506B026E; Thu, 14 Sep 2023 13:57:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 964476B0270; Thu, 14 Sep 2023 13:57:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 804D36B0273; Thu, 14 Sep 2023 13:57:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 6EEAD6B026E for ; Thu, 14 Sep 2023 13:57:35 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 455AD140EC9 for ; Thu, 14 Sep 2023 17:57:35 +0000 (UTC) X-FDA: 81235960470.06.9ECD3E7 Received: from mail-lf1-f54.google.com (mail-lf1-f54.google.com [209.85.167.54]) by imf09.hostedemail.com (Postfix) with ESMTP id 6335214002C for ; Thu, 14 Sep 2023 17:57:33 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=GESmjEfj; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf09.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1694714253; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7VwkUDvl9cZBLnoPVjTuHGS1QPb1S9F5cL//3ghT7s0=; b=mfnjUcsbKs9AImABzXUMNIwFYmaEM2tOP/X6EEtkpXRnLWat7Z05Ax+bVNoD2gYFjdG0qT 1d7A81AS7+5QRvf2lq/CVNZowb7y4Jw0+yp4ktecY3qECh74v4GTdiu+l0YiikeB2NmywS ltogI4FKMRw/ZDByC7H6z8XZdTpHR8g= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=GESmjEfj; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf09.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694714253; a=rsa-sha256; cv=none; b=5eA/0pZhP6tRNh2gnLF79FBIBc9aUdDdakeZT/tmSogss2jR4w1f3BsYzD2qkrqEQX9kcv 2VSyzaGzIwNxBU3JyEOMZ4GnRLu6xlX49PygFGTK1UFZsX+JpVFG2lIF6H8CuRWgybYDR6 jEAiJGYzJuTKfHF+S/4+Aon9yGQa57s= Received: by mail-lf1-f54.google.com with SMTP id 2adb3069b0e04-502d9ce31cbso2125672e87.3 for ; Thu, 14 Sep 2023 10:57:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1694714251; x=1695319051; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=7VwkUDvl9cZBLnoPVjTuHGS1QPb1S9F5cL//3ghT7s0=; b=GESmjEfj3A/mXWW1CM1oo3fDFP1Jim+UHIgvrGO4lAZ2KpO65ci2aZd8TmbN+GELD+ g8K2s6jgB9tZkML07TMY3nwH/f9SpmaBfWLYdjQpGQHDc/1XyVz7uV8iq42/+hkpIQut tl2ia/JP/UiS5VTgV1hqeGHTYKUdAyBCbaXqmQJmCvgKYBLVKiKwwlZyfsqrA9MQC6Ej 3FRY0hS9CoL8KrN9uUJW31QR9ffwVZesf203rPo6cq02Ld5iwL0BCmHJevRTem/iq057 /utIXqyX9+SELpsNyqw65FFFErhCBZyw0UpraARtJPMmxB2sY5pNn3TQuKvcHVXuMSpi RLFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694714251; x=1695319051; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=7VwkUDvl9cZBLnoPVjTuHGS1QPb1S9F5cL//3ghT7s0=; b=pF72H1K3g6RZqS3NdLZItc8TlcNON5lZ5a4chIL7YN9rVVvR2fVWumaIf9rR+MkBi8 Sb3QjDJ4hjJD5oTmnYmbzGiEQAGzOYW5iuGNTbmC1mcra3uSya86twmgNeIYOfnN4jWO I0zwMEAcYJ4ZksCy+vUW7Z3PxaGkFp51GuefbFWVT3DkSq+q9P1rBzsWWcpZqbwjoOo2 KwI9Y/pL7UJWVr+H522gElYLN43XJIQFOmNNGNVS+coWP6LCj3zqdE8CbC/F1b0jIM5A pPuvZBbstFzA+X4A5SDq1Wf9chWkjsel76jRUk/wFI8Ib9I57hoZZN5jnkvyynKFenXG Rk8g== X-Gm-Message-State: AOJu0Yw/BXnHONqn612hO1aFAncUoGNYTwcOhQh4QGiXaz/6hndpFwmE ikGFViB4Mawx55hhp2Hwziyedm7spcOgMCOHI1tz4g== X-Google-Smtp-Source: AGHT+IHQgtpNq6uSe+ZaI+1z0+HChrSoQtGuT/yoWpOf0wfrdIwnM85LbTdpIHNp4lwDtm4o2jYW0NAs2G8lxEa0Mdw= X-Received: by 2002:a05:6512:2394:b0:4f8:71cc:2b6e with SMTP id c20-20020a056512239400b004f871cc2b6emr6333688lfv.33.1694714251252; Thu, 14 Sep 2023 10:57:31 -0700 (PDT) MIME-Version: 1.0 References: <20230913073846.1528938-1-yosryahmed@google.com> <20230913073846.1528938-4-yosryahmed@google.com> In-Reply-To: From: Yosry Ahmed Date: Thu, 14 Sep 2023 10:56:52 -0700 Message-ID: Subject: Re: [PATCH 3/3] mm: memcg: optimize stats flushing for latency and accuracy To: Shakeel Butt Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Ivan Babrou , Tejun Heo , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Waiman Long , kernel-team@cloudflare.com, Wei Xu , Greg Thelen , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 6335214002C X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: k1o8y9s6x6rdbdmo3szxmnr81n3fap3n X-HE-Tag: 1694714253-360263 X-HE-Meta: U2FsdGVkX18kuA008D7ms7g/08ule/FGYPibfwpBXtpZ0jCIIUmEukBvYMHfK1cfXs+K0LAgZdpLqibgd2nnRJ1ef6rOBx/h4Num3FEXkXx31wTvF8mOKEOvTJUBeRLFrDXjXq8SRBBFjyg312eUI0tO7UDnRjhusDeWmaO5GJ3BsDZbh1izfvxaWQiiCbUgPpa0VpBaNYd3L4zuJ9FAwLs/sCk4WjVIaod5HzsfsF3FZMWfXoRq3d4PtBYQnq1lzhT/hqsm9yiw8g0HJuQD3TdnZAP9sBXzyHdKtCpqoupZ0Fyk1+voyv1G5fou5ehQJ050bcT4PBgfyImMGxVNBWoUiP0kzzFEccOb2/c9h9d9hmR7X5Rv5QLV98oi2tH6A327F5L8dmTqkcfJoWZsPizOjQvMKTtqnnPSgprHJ+gsuIKFSX9KJeydTDfNSpcSx2hCZs7ubgzPPkvlSVfJq86jjqnSLLICcVtUWg6TT6EUjMTHpmmQjJX1kyx+v2Zl4lkzvyVqP3ahyJ/tQjqsxSmFL7bi7RRU0jjkpb51DF8xLQ8j/M/BUgn1BNr3fzhW32sSYrgcIye99zCk0hZZcT3R96IWB2INaVetqSl9IRU9b8fsm7dRoWfo+7+/84HM8axFjCU1nH/jpJy1Crkfz3mT6+tA2oaOEI/8QAllc/rQ1QStrsMSu0ZbLF3RSDjACvdei0Y/ZwNskMnwE4VX8IUNCe8u6hLiQ8h1Vj/x3SahfGd+4zQK8ZjtFVlin1OrUyB1MyBeEd4luCF/OylTRTiU+ng06SykGJBPOR+ES7uxY/Vy7WhvZVM8pELR/LTzefE8LAYl2UlL06ceR2QIPdukbzseji730hqM+0slEFpfeE+J9dYitvSSvBJLnquadkWTNyOG5Ga4Iuozt7Fb9NVbpDV2mzdXGPRtO1qLzNXhfqyJ6Bl5xtXYBhotdM4/eBmF8t2A+URcd+XQ4ha nIBa7q7M XqfQZmkf6D+a/R9cEOQkgl/EOL3GypUyBUiC51/UnS+Gt2svtSnnZMEgYzhAkN1W/HRbC+OH8iD2O0110KE65mySU4z6XdY/IPpLhe4kscl3WZ4OZ4UpAnilpbP4rW+LwPE52QWRc3gBbCgfwZ+l62Rp5nX6x6fG0rdMyBio8o9iiGtCy9IgS3B1IzMWQGtAgk1zjB6mdD1lpXyAlbFqkbFdxWsHw8P3RNo50nr3nvUz+hH3HfQ0LYHrJmg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Sep 14, 2023 at 10:36=E2=80=AFAM Shakeel Butt = wrote: > > On Wed, Sep 13, 2023 at 12:38=E2=80=AFAM Yosry Ahmed wrote: > > > > Stats flushing for memcg currently follows the following rules: > > - Always flush the entire memcg hierarchy (i.e. flush the root). > > - Only one flusher is allowed at a time. If someone else tries to flush > > concurrently, they skip and return immediately. > > - A periodic flusher flushes all the stats every 2 seconds. > > > > The reason this approach is followed is because all flushes are > > serialized by a global rstat spinlock. On the memcg side, flushing is > > invoked from userspace reads as well as in-kernel flushers (e.g. > > reclaim, refault, etc). This approach aims to avoid serializing all > > flushers on the global lock, which can cause a significant performance > > hit under high concurrency. > > > > This approach has the following problems: > > - Occasionally a userspace read of the stats of a non-root cgroup will > > be too expensive as it has to flush the entire hierarchy [1]. > > This is a real world workload exhibiting the issue which is good. > > > - Sometimes the stats accuracy are compromised if there is an ongoing > > flush, and we skip and return before the subtree of interest is > > actually flushed. This is more visible when reading stats from > > userspace, but can also affect in-kernel flushers. > > Please provide similar data/justification for the above. In addition: > > 1. How much delayed/stale stats have you observed on real world workload? I am not really sure. We don't have a wide deployment of kernels with rstat yet. These are problems observed in testing and/or concerns expressed by our userspace team. I am trying to solve this now because any problems that result from this staleness will be very hard to debug and link back to stale stats. > > 2. What is acceptable staleness in the stats for your use-case? Again, unfortunately I am not sure, but right now it can be O(seconds) which is not acceptable as we have workloads querying the stats every 1s (and sometimes more frequently). > > 3. What is your use-case? A few use cases we have that may be affected by this: - System overhead: calculations using memory.usage and some stats from memory.stat. If one of them is fresh and the other one isn't we have an inconsistent view of the system. - Userspace OOM killing: We use some stats in memory.stat to gauge the amount of memory that will be freed by killing a task as sometimes memory.usage includes shared resources that wouldn't be freed anyway. - Proactive reclaim: we read memory.stat in a proactive reclaim feedback loop, stale stats may cause us to mistakenly think reclaim is ineffective and prematurely stop. > > 4. Does your use-case care about staleness of all the stats in > memory.stat or some specific stats? We have multiple use cases that can be affected by this, so I don't think there are some specific stats. I am also not aware of all possibly affected use cases. > > 5. If some specific stats in memory.stat, does it make sense to > decouple them from rstat and just pay the price up front to maintain > them accurately? > > Most importantly please please please be concise in your responses. I try, sometimes I am not sure how much detail is needed. Sorry about that = :) > > I know I am going back on some of the previous agreements but this > whole locking back and forth has made in question the original > motivation. That's okay. Taking a step back, having flushing being indeterministic in this way is a time bomb in my opinion. Note that this also affects in-kernel flushers like reclaim or dirty isolation, which I suspect will be more affected by staleness. No one complained yet AFAICT, but I think it's a time bomb. The worst part is that if/when a problem happens, we won't be able to easily tell what went wrong.