From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD31DC001B0 for ; Fri, 11 Aug 2023 20:40:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 45F976B0074; Fri, 11 Aug 2023 16:40:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 40FBC6B0078; Fri, 11 Aug 2023 16:40:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2D8966B007B; Fri, 11 Aug 2023 16:40:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 1F3716B0074 for ; Fri, 11 Aug 2023 16:40:05 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 6239AA1220 for ; Fri, 11 Aug 2023 20:39:57 +0000 (UTC) X-FDA: 81112990728.12.EF222DE Received: from mail-ej1-f46.google.com (mail-ej1-f46.google.com [209.85.218.46]) by imf21.hostedemail.com (Postfix) with ESMTP id 1FAAC1C001D for ; Fri, 11 Aug 2023 20:40:02 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=d1WOKri1; spf=pass (imf21.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.46 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1691786403; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cclQRiJrdAJJ3U9Z+VU5L9YVSiq+e/M7Uw6SXPSKxE4=; b=I2oDQRpMM+5G/6Ngh5tTcOq9c6RZOO/aSJXM/dui7s/ySK2sEeynednmuaB9XRux+i/hZI aTdqUVE/Jms30IQpq8ajSH8MfsatYhCgAOhjgVwMCKIVhYmjR9EmpwSrM+XZbZzKpd+kKq Tyj05sU38R5KGxe6o71y1rrQVYiX2i4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1691786403; a=rsa-sha256; cv=none; b=O3EAWfjhg/YjWp+SNJP+3jSWFZ9XYvCHDviFjSPY5mr/hjyHZqZydF/blAmy0VCx7LLYZP br3WpWknF6ycQX+PnzukyHc/0PI+N2z1xwKzP5tUAYHI9Q+qzUl5uz9RXbIvA/mdX7xCgB YGOQpB+i1HgetTnATJ7q0cjB/dZPkvg= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=d1WOKri1; spf=pass (imf21.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.46 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ej1-f46.google.com with SMTP id a640c23a62f3a-991c786369cso329736966b.1 for ; Fri, 11 Aug 2023 13:40:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691786401; x=1692391201; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=cclQRiJrdAJJ3U9Z+VU5L9YVSiq+e/M7Uw6SXPSKxE4=; b=d1WOKri1V0cTECraZJebIF7uPjJ/cWptkll4toOVmEou35hcRaCIpI4nsi1FBz0bDt H1OpR6GZp3dHH0M8NOiecO4Mb+PrP1GxQgY85QJrdznaUGNBzKM6xEMAaBzJPwa2Hn1+ kct6YTFt2DFXdM2WFTU6AkAI5vlvIgW2ee/inEomprJSOrSMNwrqnRw2kETkarbf7ymA DFGoP+5Czy9/SpVvrnSY7zSVKVOBSNy7F8kMRpp1jXroskhpWwValEs9ZSldfCrwKo0/ sE0sWwfPNybS1RJOeEdW8MSxg/Obxb7cWaUPVxNk786GCfjgRPMzTSVPm/PTgKYrVKgI sKXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691786401; x=1692391201; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=cclQRiJrdAJJ3U9Z+VU5L9YVSiq+e/M7Uw6SXPSKxE4=; b=JvxbmpoAEHQ9WKBH034c44oWY9MFN87ujm7RYFGoFlEzDTiAwQ9cfCXNxbOBNfkMEu jty2TKG+2qR5x+FYTJl5ySefGb4UMp6imrdL93bJjl7GBi8/QQiyNN/ZysENb/q5S1en IVbZVJ6lKdSAoIJpy0aA4+h4mfchQX9CvCG1+XFkYXygmwdFrY5hSDMP9OZKOir4m1R9 //rCYlSUD1Y8WZcxn1pW/f/B2dLgwmXI0GtlXlJX8dIntmvBfOTcFOFUYMzdmyEVYwlT AL4rQdMrcHr9ewYsAZjD4wNUjXVmZc2Vk/YDsYoOsc0I70Is9TV23umAvkU9Nobn8yJW u4ig== X-Gm-Message-State: AOJu0YzF8gtjaq/qez/GGXDQPoSP4czda7UgM6eAGriHAW0NkaogsR3P yRSnBleSz0jiBr6n+e7EWDr26ZbdyOa31qnIpYYARg== X-Google-Smtp-Source: AGHT+IE84C1SDaB8TA8b0a5JlyUXp86rQFnINyrSLaRDISNceXMXqq1jzE+JwlE+gZ7IzdEnsTnB019B298P2/WsqX0= X-Received: by 2002:a17:906:74de:b0:99c:9e98:70bc with SMTP id z30-20020a17090674de00b0099c9e9870bcmr2650126ejl.29.1691786401309; Fri, 11 Aug 2023 13:40:01 -0700 (PDT) MIME-Version: 1.0 References: <20230809045810.1659356-1-yosryahmed@google.com> In-Reply-To: From: Yosry Ahmed Date: Fri, 11 Aug 2023 13:39:24 -0700 Message-ID: Subject: Re: [PATCH] mm: memcg: provide accurate stats for userspace reads To: Michal Hocko Cc: Johannes Weiner , Roman Gushchin , Shakeel Butt , Andrew Morton , Muchun Song , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Stat-Signature: n1cb7kqti1frf1t3przo71urgkxz6knx X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 1FAAC1C001D X-Rspam-User: X-HE-Tag: 1691786402-352634 X-HE-Meta: U2FsdGVkX1+glRXPs2g561QzLOTqSVNs73SJjOwfLfS9qdHLkfboMkhy/hIzddSYOPX2Hsm/orJtkRd4JNnBuHLWE2sclZswXl11ebmTyrFg89RMdh497qvg95m08kRE0I9DVKTPE+HRcFSLBFwhIWDoyAZq6g/3+VVddnrE+OTvTfOepaDKt2Zed/RTvpm6Wd1wbPsC4Akm07QjuNDpTMFhJXQnUAIHwJ0J6V6wVaB3qzmljp6n38AQeT8+lSKGUqINzftMDcWXZ3kTe+QKTD1Rdg68/6DwcP1bel/3EvNQawrVTPs756ZU/renC2thxOE20IO0Tu9WY7BH+tOkJprvYiDpDSNA13cKa9tKYo51sg+ig4QvwO1wGRlPjJVrwkYcuurrhtLYdDeLVPsarciFHd4N0DHLKYy/m5o2ZMqZbPUEUb/Ci7CedDLw+xN1rRtuzyODviu3tBkWQS4rnkyYSONC7T+pmPeWongfrR4Ka1kujmCSYE68/omEPBnIxbII6PbvU9nNAnHJQffXmTLIUp5WJV0Jp4YENW5mxqxMRMROOnw/BPCEfwqte/tpyM8wXC+RMkgBGP7BFfOwSM9I5iF1jJI1R7qjgoOpdv68iOV6I6yLMXhsMaRVfjmTJxgDMIk9Wdm3FyhToc5xyTnOZvb7J0bY3zkP3urzGyG0b5+SJOXOQGe+CN3CbykMnVteQj2q5wvPSnuSl6CubRYEez7wkLdDgmNXi1CxZZdhzpywC0Z53Uw6KvgldUQ/+0h8/CkiQr0eIvEK7YevHn1f0IEgdrjuRIIs6pO+qh6hAYQDYaiUwEEgxhU2JQ4pPPVyGR4w6g0I96tObUZW+0q7C+lvM4wV0Uwo/Eb2W1Vn6HZzqpqlSifUyF23nhugX7rzyHjrGIASQokV4KuWUgd3vYciZUzXgryLAa1ziBtc+AIIH/5Po1eRHeKRKytMuQjUi5iLcJo1RW+Oxad Tg5act24 fU5qhMG9qegeH7SqWtnWrK2iC9E0tNZlezoGJTxJghvdgLj+zBeeU/gSa+lzpawzxl11wi8c6qf70huh8UG5FWSSS8X8UgLdu0DQtT0TDdS4tuLs4ogd6nBPuYpDjxFeecKHPm3gzyYA5hyoc4PjD/n1xJdPSYkVrujpHaSVlOgbj/SyVQR2hKsP9xQweF+uB+rX6g9Fz+ayyw4+vuIEIMc+TEBEYOf+cKI9u X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > > > > > I hate to repeat myself but please be more specific. This all sounds > > > > > just too wavy to me. > > > > > > > > Sorry I didn't have the full story in mind, I had to do my homework. > > > > One example is userspace OOM killing. Our userspace OOM killer makes > > > > decisions based on some stats from memory.stat, and stale stats (a few > > > > seconds in this case) can result in an unrightful OOM kill, which can > > > > easily cascade. > > > > > > OK, but how is this any different from having outdated data because you > > > have to wait for memory.stat to read (being blocked inside the rstat > > > code)? Either your oom killer is reading the stats directly and then you > > > depend on that flushing which is something that could be really harmful > > > itself or you rely on another thread doing the blocking and you do not > > > have up-to-date numbers anyway. So how does blocking actually help? > > > > I am not sure I understand. > > > > The problem is that when you skip when someone else is flushing, there > > is a chance that the stats we care about haven't been flushed since > > the last time the periodic flusher ran. Which is supposed to be ~2 > > seconds ago, but maybe more depending on how busy the workqueue is. > > Yes, this is clear. You simply get _some_ snapshot of the past. > > > When you block until the flusher finishes, the stats are being > > refreshed as you wait. So the stats are not getting more outdated as > > you wait in the general case (unless your cgroup was flushed first and > > you're waiting for others to be flushed). > > [Let's call this approach A] > > Yes, but the amount of waiting is also undeterministic and even after > you waited your stats might be outdated already depending on how quickly > somebody allocates. That was my point. Right, we are just trying to minimize the staleness window. > > > Furthermore, with the implementation you suggested using a mutex, we > > will wait until the ongoing flush is completed, then we will grab the > > mutex and do a flush ourselves. > > Flushing would be mostly unnecessary as somebody has just flushed > everything. The only point of mutex is to remove the super ugly busy > wait for sleepable context construct. Right, but it also has the (arguably) nice double flush effect, also minimizes the staleness window. > > [...] > > When running a test that is proactively reclaiming some memory and > > expecting to see the memory swapped, without this patch, we see > > significant inaccuracy. In some failure instances we expect ~2000 > > pages to be swapped but we only find ~1200. > > That difference is 3MB of memory. What is the precision you are > operating on? I am not concerned with MBs, I am concerned with ratio. On a large system with hundreds of cpus there are larger chances of missing updates on a bunch of cpus, which might be a lot. > > > This is observed on > > machines with hundreds of cpus, where the problem is most noticeable. > > This is a huge difference. Keep in mind that the inaccuracy would > > probably be even worse in a production environment if the system is > > under enough pressure (e.g. the periodic flusher is late). > > > > For both approach A (wait until flusher finishes and exit, i.e this > > patch) and approach B (wait until flusher finishes then flush, i.e the > > mutex approach), I stop seeing this failure in the proactive reclaim > > test and the stats are accurate. > > > > I have v2 ready that implements approach B with the mutex ready to > > fire, just say the word :) > > > > > > > > In any case I do get the argument about consistency within a subtree > > > (children data largely not matching parents'). Examples like that would > > > be really helpful as well. If that is indeed the case then I would > > > consider it much more serious than accuracy which is always problematic > > > (100ms of an actively allocating context can ruin your just read numbers > > > and there is no way around that wihtout stopping the world). > > > > 100% agreed. It's more difficult to get testing results for this, but > > that can easily be the case when we have no idea how much is flushed > > when we return from mem_cgroup_flush_stats(). > > > > > > > > Last note, for /proc/vmstat we have /proc/sys/vm/stat_refresh to trigger > > > an explicit refresh. For those users who really need more accurate > > > numbers we might consider interface like that. Or allow to write to stat > > > file and do that in the write handler. > > > > This wouldn't be my first option, but if that's the only way to get > > accurate stats I'll take it. > > To be honest, this would be my preferable option because of 2 reasons. > a) we do not want to guarantee to much on the precision front because > that would just makes maintainability much more harder with different > people having a different opinion of how much precision is enough and b) > it makes the more rare (need precise) case the special case rather than > the default. How about we go with the proposed approach in this patch (or the mutex approach as it's much cleaner), and if someone complains about slow reads we revert the change and introduce the refresh API? We might just get away with making all reads accurate and avoid the hassle of updating some userspace readers to do write-then-read. We don't know for sure that something will regress. What do you think?