From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6BC3BEEAA7B for ; Thu, 14 Sep 2023 23:31:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D49FC6B02DC; Thu, 14 Sep 2023 19:31:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CFA4E6B02ED; Thu, 14 Sep 2023 19:31:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BC18C6B02EF; Thu, 14 Sep 2023 19:31:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id ABDD86B02DC for ; Thu, 14 Sep 2023 19:31:38 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 75609160EFE for ; Thu, 14 Sep 2023 23:31:38 +0000 (UTC) X-FDA: 81236802276.05.7C819AD Received: from mail-ej1-f42.google.com (mail-ej1-f42.google.com [209.85.218.42]) by imf17.hostedemail.com (Postfix) with ESMTP id A10CE40003 for ; Thu, 14 Sep 2023 23:31:36 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=jqDofT2d; spf=pass (imf17.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1694734296; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=r7Rb24jsu1Obt5C61wSLsWivJKBy0po7p8YAHy1mWV0=; b=q7wcZTSMw2TqdglgY770bRbf4MlVRJdGL1nVWiJ7CVI6cfE2bhgQ775l4uIxcQsArl8fuM yB7xk46FWik33LNWmWdt+f15rqNcZY0s9rNdvFtlHhSG3A4CB6ED2+SytdW2fmL4c4vfHi Deow8Jyw0izGImGxiTuQKMAvCFD3ZZU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694734296; a=rsa-sha256; cv=none; b=U0munz/sHWmPNimsIBqEZDn9tGbkCA5acXa1CU5widRE5jf9025EXfw3+cnaeNLa19o3hC KVQNpRBZfc6j10XgzTPc8oNOkP1kMNmBGnQ3+/QGW8FDpcdfTOBT13gAQBH2GR/zgbh5Vd WQcTyCXmSIxZ6yqDoxY06WuPYJ7Gf9A= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=jqDofT2d; spf=pass (imf17.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ej1-f42.google.com with SMTP id a640c23a62f3a-9aa0495f9cfso581285566b.1 for ; Thu, 14 Sep 2023 16:31:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1694734295; x=1695339095; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=r7Rb24jsu1Obt5C61wSLsWivJKBy0po7p8YAHy1mWV0=; b=jqDofT2dZhgiybhLVbUqxElW2g1Bike1A1ee43wd33QFw0LJea8vx8ywxvYGsYOQCL Zr6HObABpskDWpAWdY00ytschhsBQOHYinNV3OntSwwJD6czLZPM8iTvYEzwzHWLTLh9 NgAmkZDvwlSqbCVOAylRjF7Rw+YUUchcNf+7q7Oa702dm2Lhupb8144iHX7b3Rl0zKYz XdnUWEmWaMMcqvPQs/Onf2/0Sep+fEiBe4Gw7JQAMOXB4KS6Huwp4SP8YD9KmBOX0p/O yAG6m3G765fZQ/ffjtNwXkXLbFkCefwXAj0msG8oDGl4NuHM8NqtLI5WMZMi3rKS9/mv TCKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694734295; x=1695339095; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=r7Rb24jsu1Obt5C61wSLsWivJKBy0po7p8YAHy1mWV0=; b=npRS4T2Ww21Acdux3/onQA9OBKMxnZGgVhPATY53n/mV/zQea4QD9IVb0VC7bvZxko jl7Pp39Brxshp/il24Ml4CWyaTraW0jhpyjKt/JWktmwradLL60Os5+BXoRZWl++7PiT EUdUkt2EE80dTVELv4aKotBNXsfs+87Nds7YaYEOIa3wdgYq7R9U5IyLtSP/z7YHFOoz jtM1svyhGwVCl27t9UJE+yfnXzYaCFgIa3ME3tMtTmxOyaFtoBl7DT0bh1kX4xybMPiM QnCoveFyLbFB0oeq/pP2aDLOgYo0V+pL/OUBG3GZeho0jJNGXODq0aiMMz/oGbIPKxYJ 12dA== X-Gm-Message-State: AOJu0YyNZHG30k/1J6yL1QafZv/r9gU0PQPS8WjHMJ6g6QbxUnvvKyBr El1+e30TQdqRG4NIKUw4vYspIsP0MPAMm6gqpQzE0w== X-Google-Smtp-Source: AGHT+IFLci1Gh1vs2mE7QDtLKQZUE/TQg6gUZQiUYohfckoWuwQf41Nn3O2H4gd3T/hi9Q7bQvnRj2aLWQNjtTHhaXs= X-Received: by 2002:a17:906:5a5f:b0:9a5:9305:83fb with SMTP id my31-20020a1709065a5f00b009a5930583fbmr100497ejc.34.1694734294841; Thu, 14 Sep 2023 16:31:34 -0700 (PDT) MIME-Version: 1.0 References: <20230913073846.1528938-1-yosryahmed@google.com> <20230913073846.1528938-4-yosryahmed@google.com> <20230914225844.woz7mke6vnmwijh7@google.com> In-Reply-To: <20230914225844.woz7mke6vnmwijh7@google.com> From: Yosry Ahmed Date: Thu, 14 Sep 2023 16:30:56 -0700 Message-ID: Subject: Re: [PATCH 3/3] mm: memcg: optimize stats flushing for latency and accuracy To: Shakeel Butt Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Ivan Babrou , Tejun Heo , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Waiman Long , kernel-team@cloudflare.com, Wei Xu , Greg Thelen , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: xing4adhcp8ukjo38ijymi13hdhtryx9 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: A10CE40003 X-Rspam-User: X-HE-Tag: 1694734296-652725 X-HE-Meta: U2FsdGVkX1/AS4yvo/b6b8t98rFnDop92qzKhpU7uLry8EyFq2o8hKRe1cIZ7EraEpVQ2yQ+zzQX1il1ReXNNIrO/35wogrwVDOvKPAXWUJbwxHS1PLMNRWmOR60DOJVFktWrEW5JplAYGN5d9nnHTD/NJWYZGiHouavblTP8a2GZD80my3XLF4nudmZyvYaWpT1vfE4gflD7oeyIHJ99MwGFt0ePQVs/SqUAFmvmFBXvhv/bBdIrqAFhhAdf8XIABle7uKMFGFaJarDbINOPhpmK2X2+buthFDdRpnC5V02RthQC/ZgyObN16FLlzE/EEe2Imbk32ppvmq/yknHiFrV/y21QV3YsSDmperbWgOxsdJ2q28a0O6ociC6y+QWYZiOS32xZc86hruNyr6sVF76aSHesFW04Ev2T1cYKEYn+S7iebFYPZKkUoqt9aGigbjY+6MD+uT02ulA0+lubLkYbL+0HV54JcEsMPkF0zSPr8y+vCR1AmFQvzT4pqFCLT2WJH0K0rKj6l2qF0yWdDMgUzG+OIeP9qvQsMglydohHlKZn1m6XswQsBldpATEFxk/YtT9Rf3PhuCCm1DZeK9TYYb5/goOQkwjgC/4K2FySK96YV1y0lgpokGwuGYFxX4n0kviH+gAdOPD5NLk9HUpwEeAX7y8WxPeUYYBBT5m16BHRNDmqZFEFFGqBxHPzu/0XXAhpWfJaryq38uhFyXmewbNY44dYeNAYJnYrfIsp2I0ATf2lbwAKjNRhsAzvXucNwWYuTaqW4iev3zhlQc3u+Afr3ICaTBLLQYTWRuTbVuLHSnD+D9nyokrjWO74T/cHAiSRmB7ydMYFiTbU206ssl825MwkWuhStM8veRrlQXCYcDz2JI/4eI83LEb0nWV9g5gU7nXZOCUle9rSNla2p03kEIKBJ0FF2BKujYLk6T2FOQbr/7YDOzY7nFiBJX1+CFRGF+kjRfUNJr Vo3SrAhT wphVPm7oejFsQZ5IsqViIPsK9j4DDngPh+yXRPdIumBK5o4v36Lk63cAY/3GdXmIm2tK9ir2n8aSZkFyAfs0EFhI1XfStqEp0ByK5Z2LOg2wP+MdPNr/zwMJ7mC9skoere2cMMmAdKd+BUQT36ZCQ3CzY0AvJXF3mQscdr8Bee1DkTgfhh+yz4X6lVNL0URZDbd4cJ+Pxn+zrkuS+6veN3yCr59Blhj1pjp2zYyxftDBDxtRBTq1SNJXVTM+ir1jIVc5lN1LZ1hIC8f8QQCDANwxCjju+K9AgPtRIHboSBLH2w3zZQVO4IkFnaFpV6VHA9Uuyxt7QNN3mhfprexk7hC2PNiorfkwIOAUQuYmijM2DRP5LBG8H/L0UDA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Sep 14, 2023 at 3:58=E2=80=AFPM Shakeel Butt = wrote: > > On Thu, Sep 14, 2023 at 10:56:52AM -0700, Yosry Ahmed wrote: > [...] > > > > > > 1. How much delayed/stale stats have you observed on real world workl= oad? > > > > I am not really sure. We don't have a wide deployment of kernels with > > rstat yet. These are problems observed in testing and/or concerns > > expressed by our userspace team. > > > > Why sleep(2) not good enough for the tests? The problem is not making the tests pass. The tests are just a signal. > > > I am trying to solve this now because any problems that result from > > this staleness will be very hard to debug and link back to stale > > stats. > > > > I think first you need to show if this (2 sec stale stats) is really a > problem. That's the thing, my main concern is that if this causes a problem, we probably won't be able to tell it was because of stale stats. It's very hard to make that connection. Pre-rstat, reading stats would always yield fresh stats (as much as possible). Now the stats can be up to 2s stale, and we don't really know how this will affect our existing workloads. > > > > > > > 2. What is acceptable staleness in the stats for your use-case? > > > > Again, unfortunately I am not sure, but right now it can be O(seconds) > > which is not acceptable as we have workloads querying the stats every > > 1s (and sometimes more frequently). > > > > It is 2 seconds in most cases and if it is higher, the system is already > in bad shape. O(seconds) seems more dramatic. So, why 2 seconds > staleness is not acceptable? Is 1 second acceptable? or 500 msec? Let's > look at the use-cases below. > > > > > > > 3. What is your use-case? > > > > A few use cases we have that may be affected by this: > > - System overhead: calculations using memory.usage and some stats from > > memory.stat. If one of them is fresh and the other one isn't we have > > an inconsistent view of the system. > > - Userspace OOM killing: We use some stats in memory.stat to gauge the > > amount of memory that will be freed by killing a task as sometimes > > memory.usage includes shared resources that wouldn't be freed anyway. > > - Proactive reclaim: we read memory.stat in a proactive reclaim > > feedback loop, stale stats may cause us to mistakenly think reclaim is > > ineffective and prematurely stop. > > > > I don't see why userspace OOM killing and proactive reclaim need > subsecond accuracy. Please explain. For proactive reclaim it is not about sub-second accuracy. It is about doing the reclaim then reading the stats immediately to see the effect. Naturally one would expect that a stat read after reclaim would show the system state after reclaim. For userspace OOM killing I am not really sure. It depends on how dynamic the workload is. If a task recently had a spike in memory usage causing a threshold to be hit, userspace can kill a different task if the stats are stale. I think the whole point is *not* about the amount of staleness. It is more about that you expect a stats read after an event to reflect the system state after the event. Whether this event is proactive reclaim or a spike in memory usage or something else. As Tejun mentioned previously [1]: "The only guarantee you need is that there has been at least one flush since the read attempt started". [1]https://lore.kernel.org/lkml/ZP92xP5rdKdeps7Z@mtj.duckdns.org/ > Same for system overhead but I can > see the complication of two different sources for stats. Can you provide > the formula of system overhead? I am wondering why do you need to read > stats from memory.stat files. Why not the memory.current of top level > cgroups and /proc/meminfo be enough. Something like: > > Overhead =3D MemTotal - MemFree - SumOfTopCgroups(memory.current) We use the amount of compressed memory in zswap from memory.stat, which is not accounted as memory usage in cgroup v1. > > > > > > > I know I am going back on some of the previous agreements but this > > > whole locking back and forth has made in question the original > > > motivation. > > > > That's okay. Taking a step back, having flushing being indeterministic > > I would say atmost 2 second stale instead of indeterministic. Ack. > > > in this way is a time bomb in my opinion. Note that this also affects > > in-kernel flushers like reclaim or dirty isolation > > Fix the in-kernel flushers separately. The in-kernel flushers are basically facing the same problem. For instance, reclaim would expect a stats read after a reclaim iteration to reflect the system state after the reclaim iteration. > Also the problem Cloudflare is facing does not need to be tied with this. When we try to wait for flushing to complete we run into the same latency problem of the root flush.