From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 585F7EE499B for ; Fri, 18 Aug 2023 21:41:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DA588940076; Fri, 18 Aug 2023 17:41:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D5569940012; Fri, 18 Aug 2023 17:41:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C1D1D940076; Fri, 18 Aug 2023 17:41:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id B00DE940012 for ; Fri, 18 Aug 2023 17:41:08 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 82E721C9494 for ; Fri, 18 Aug 2023 21:41:08 +0000 (UTC) X-FDA: 81138546216.06.FECA350 Received: from mail-ej1-f47.google.com (mail-ej1-f47.google.com [209.85.218.47]) by imf27.hostedemail.com (Postfix) with ESMTP id AB26C40009 for ; Fri, 18 Aug 2023 21:41:06 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=2vmiGzAa; spf=pass (imf27.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.47 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692394866; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=96IMmmZrA7+LOMjVXBt5LauO5hLl+h8Byr56SGNcD5E=; b=w5qp4wPGrBtdRm2Lcyaskq4yIx/MN5MCADouorOD5S7C+GstNNqPOTZv0K/4O0YaE9MIXM 8zOT4fJCAG8jt/dr/ElZwEoQORp83uoLHqpVg7Mr7BPcwnv3IcTPIMkf/nkIuYd+Bx9VEZ 7G8vbnFhGCIIAIZmZLhZXT1qk3/fBk4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692394866; a=rsa-sha256; cv=none; b=IOLfTGEQNQGWnePpI9fWoWxXyW+4pXX0Sf0BU/vh5RaW/WsYQES423JMQU/RQHtY+BnCIx I5fqHfx+hTNM6PIadNckDl28uyGFpi+39v7eR+T1EEvt90wp0yfc4mHPEs1nPujGC9uiio G1ETR8VOoXhgQqvwDQVv33xEA24p8cc= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=2vmiGzAa; spf=pass (imf27.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.47 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ej1-f47.google.com with SMTP id a640c23a62f3a-99bc9e3cbf1so283944066b.0 for ; Fri, 18 Aug 2023 14:41:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1692394865; x=1692999665; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=96IMmmZrA7+LOMjVXBt5LauO5hLl+h8Byr56SGNcD5E=; b=2vmiGzAaxTuFV5Ofe5033kpMJWLm6PdVg2qNpIk+6jYWxfhiaEbxUok4/NlLRjUxDa mCV4TPsBq0kR83ozlatZ+5TYbBW8GKKH6dfpG3xy+L1j6kMebghjrqinobcuW1QcavWY EBJWRePyFYqJ/btGMh4YPjAhg6r11mjYJql/G7vN0GzGPWFUGeerLQWeQTC/ihdOcD2k m48lAzpSIvKJIFrunAnsFFLobDbzVUzXhi+xw2WjiZ6OERcm1TlFvgDBt4xjIg0wv8y7 jDBXZGOWFn74eOriV+eLLCuExyAcSa44V2CJXgNdVPlhYd2DkQiJRYW31vuygJTT/hrI Xd/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692394865; x=1692999665; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=96IMmmZrA7+LOMjVXBt5LauO5hLl+h8Byr56SGNcD5E=; b=WCTA3H9/pfkhfTx7R7dyBCNadKDFWevjAA6zQO9DnGHS6FT4BfGJLYzJa6kiFytr5V pnUeuBYb4Mzu6jIa6W1lbzTe8l4EzWNxJeR+i6KYGWziemmmQEw9bXKNdKBMxaIhC/nH 4NfSriMDju5FHXlFPaFJsoIU0h3dUhIB7qXqzGpG2i3uAt/ynq0rD0W6ys4kpia9BYOI nPE5IhDwuRiZcRQQaNgOz0pM6iUAHiidknr1GLt9I1WBQ21Y6NCQSt5DpnKR0cyl4bDh a+q3h7M2E50NeF5PNWUU1nAGCBtgFe41IYa5ol41HuHgw1LHsEOfrEH/dTrWxuXySnQp F2xw== X-Gm-Message-State: AOJu0YxRT6IHCnVM/k50nszNzEMdLtH82m7+o9ce6P3voXzzOl3rzE8d Nbyc+RQ6oUyB0lQfH0Kx5ry6fFnGc2YAX2WDeJ2NUA== X-Google-Smtp-Source: AGHT+IEbozZG7SJ/ux7vw/uhn7UBpHEKdLMEh5J08FskeFyaNq3qmqbcHtEsAIkM+gOZDaD9Zlhvckuf9YHtPndgL90= X-Received: by 2002:a17:906:300c:b0:99d:f1d7:d007 with SMTP id 12-20020a170906300c00b0099df1d7d007mr413701ejz.24.1692394865022; Fri, 18 Aug 2023 14:41:05 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Yosry Ahmed Date: Fri, 18 Aug 2023 14:40:28 -0700 Message-ID: Subject: Re: [PATCH] mm: memcg: provide accurate stats for userspace reads To: Tejun Heo Cc: Shakeel Butt , Michal Hocko , Johannes Weiner , Roman Gushchin , Andrew Morton , Muchun Song , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ivan Babrou Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: AB26C40009 X-Rspam-User: X-Stat-Signature: 1sseqgjqgg1gzh18yrkzxo1qme5owagy X-Rspamd-Server: rspam03 X-HE-Tag: 1692394866-142744 X-HE-Meta: U2FsdGVkX18t84NfuEd6TwZXjSbmrmuj/SwSXDj5WZSodUn0IfqGI8JAwMcYdPtDJdIF/H93PjDx01kz4AvOYTnGtd1yLjFsmOT7rYvtKMAqXqm+V7tXLGi4vMDdhZOQrGZvjdggYWJsB5tdiVFMNWrbzyaiXgHfYaPiM4cjLhsrlZ3ihtmDmMxQIKxSS8OpPRUOKWZNzks4ljjSFSLqeEHeOsFE3yH0/DsW88fOJ7xsXvrM0ZlW82DzXoVe6z1fUMCs7xBAAowfcJcL8Z/U4I1hvC+dlT4GD7jnel7W+NWh4Pe1zGIscx2BcrQYf4JkYXm+OLDCvbhKQclFq4RKUWXYKugyFsAzcMmMXV2enHb/2N2yP46PL4ipSfHLX9YE9tnyyLatgfSv61JhNB7ta/3xVnHSWSz7RrZgbioIwdKlIex+jeJ9HyPvXmKKtVuq26x7Gp/TJiza9/ul9fLKh3ICwtBSEa/si/AR30Sq/H9xKEMQumPuQK4As2ARQfbvsiiEz+0fyxvgPDmLR0jvzjcVw9u/6pB8Iv20ae0vhpwzeTQgcTHmRvz+t76LAi80KP9/biGIBPlICCKkVvKffm5KABbd7BTPOXmVGNWcafJ7P+gKYgOjV+c5XaYfXo1kWicsrTJ5WUSmGx4YRouv/Xbgsnyv645kj+huxKGNgVCOB+WLJS5WBPcJ4C3tDbEp8X0UMQDK+ey2mrZTMLS3rugFbWKPnyZ0FXq2cG/jwCVp42quEoDoE1xDUFRWVsLOp2J1HgH0kOoQjNFdY6SN34uuzhpiI6aLEitq/q7a7KVDosa4HL2cWziYTLKcJB84UC1nDyQIGdVmf8UzJ2fQZPiiCDAWBbIwKt/Q+iWh5EJVmDO05aEMXkESjPdEngywvuktyQH3E+6MBY/PK3GO/JbtHlozH+ZrxsIXVBYJLOHlqF+ncxCGCtAV684yfM9sirr2s1I1QzUo3S6A9R8 WzrIgFB0 CsI6a0cvyIwH7Yf1EdkC6cDMzUHy/YL52EHxqeNPQxnH8g05Iel5jqOlrCeqJVfebnU/nf4KzHZWEIr4jXN46BDcvpnT7SY7d24eukKaIjtLF9Ms3Ma2t4mL5/BeifznAbGi3VoDhcaNRteb6xzWuDLwSyG+92PRHkufweoa4SQzDEVUwkLk+s6E3P4uKY9V5l5mUFVO3X9DiOuxU+ucnoNSJnGgLop1Rfi2dnsWrbajtxtJniA0Hvxq6AL70pMrt++7GwPl4ckWXBv9qNnlh/J60zTfe8A4KQpVcbzm1dgmMqeAiY87D0FcjiA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Aug 16, 2023 at 3:35=E2=80=AFPM Yosry Ahmed = wrote: > > On Wed, Aug 16, 2023 at 12:08=E2=80=AFPM Tejun Heo wrote: > > > > Hello, > > > > On Wed, Aug 16, 2023 at 10:11:20AM -0700, Shakeel Butt wrote: > > > These options are not white and black and there can be something in > > > between but let me be very clear on what I don't want and would NACK. > > > > I'm not a big fan of interfaces with hidden states. What you're proposi= ng > > isn't strictly that but it's still a bit nasty. So, if we can get by wi= thout > > doing that, that'd be great. > > Agreed. I will try to send patches soon implementing option (2) above, > basically removing unified flushing. I will try to document any > potential regressions that may happen and how we may fix them. Ideally > we see no regressions. > > > > > > I don't want a global sleepable lock which can be taken by potentiall= y > > > any application running on the system. We have seen similar global > > > locks causing isolation and priority inversion issues in production. > > > So, not another lock which needs to be taken under extreme condition > > > (reading stats under OOM) by a high priority task (node controller) > > > and might be held by a low priority task. > > > > Yeah, this is a real concern. Those priority inversions do occur and ca= n be > > serious but causing serious problems under memory pressure usually requ= ires > > involving memory allocations and IOs. Here, it's just all CPU. So, at l= east > > in OOM conditions, this shouldn't be in the way (the system wouldn't ha= ve > > anything else to do anyway). > > > > It is true that this still can lead to priority through CPU competition= tho. > > However, that problem isn't necessarily solved by what you're suggestin= g > > either unless you want to restrict explicit flushing based on permissio= ns > > which is another can of worms. > > Right. Also in the case of a mutex, if we disable preemption while > holding the mutex, this makes sure that whoever holding the mutex does > not starve waiters. Essentially the difference would be that waiters > will sleep with the mutex instead of spinning, but the mutex holder > itself wouldn't sleep. > > I will make this a separate patch, just in case it's too > controversial. Switching the spinlock to a mutex should not block > removing unified flushing. > > > > > My preference is not exposing this in user interface. This is mostly ar= ising > > from internal implementation details and isn't what users necessarily c= are > > about. There are many things we can do on the kernel side to make trade= -offs > > among overhead, staleness and priority inversions. If we make this an > > explicit userland interface behavior, we get locked into that semantics > > which we'll likely regret in some future. > > > > Yeah that's what I am trying to do here as well. I will try to follow > up on this discussion with patches soon. > > Thanks everyone! So status update. I tried implementing removing unified flushing. I can send the patches if it helps visualize things, but essentially mem_cgroup_flush_stats{_ratelimited} takes a memcg argument that it passes to cgroup_flush_stats(). No skipping if someone else is flushing (stats_flush_ongoing) and no threshold at which we start flushing (stats_flush_threshold). I tested this by some synthetic reclaim stress test that I wrote, because reclaim is the easiest way to stress in-kernel flushing. I basically create 10s or 100s cgroups and run workers in them that keep allocating memory beyond the limit. I also run workers that periodically read the stats. Removing unified flushing makes such a synthetic benchmark 2-3 times slower. This is not surprising, as all these concurrent reclaimers used to just skip flushing, regardless of whether or not they get accurate stats. Now I don't know how realistic such a benchmark is, but if there's a workload somewhere that runs into such conditions it will surely regress. Sorry if the above is difficult to visualize. I can send the patches and the test script if it makes things easier, I just didn't want to overwhelm the thread. I think there are 2 options going forward: (a) If we believe removing unified flushing is the best way forward for most use cases, then we need to find a better way of testing this approach practically. Any suggestions here are useful. (b) Decide that it's too expensive to remove unified flushing completely, at least for in-kernel flushers that can see a lot of concurrency. We can only remove unified flushing for userspace reads. Only flush the memcg being read and never skip (essentially what Ivan tried). There are already some flushing contexts that do this (e.g. zswap memcg charging code). I believe as long as there isn't a lot of concurrency in these paths it should be fine to skip unified flushing for them, and keep it only for in-kernel flushers. Any thoughts on this?