From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5A481EE49A0 for ; Mon, 21 Aug 2023 20:59:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D3854940014; Mon, 21 Aug 2023 16:59:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CE8908E0012; Mon, 21 Aug 2023 16:59:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B88DC940014; Mon, 21 Aug 2023 16:59:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id A855D8E0012 for ; Mon, 21 Aug 2023 16:59:02 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 734F81A06EF for ; Mon, 21 Aug 2023 20:59:02 +0000 (UTC) X-FDA: 81149326524.20.4111ED3 Received: from mail-ej1-f49.google.com (mail-ej1-f49.google.com [209.85.218.49]) by imf02.hostedemail.com (Postfix) with ESMTP id 9B2A48000D for ; Mon, 21 Aug 2023 20:59:00 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b="U8qf/ENQ"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf02.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692651540; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=IbpQ6J2JbDbq9n2VQq/6we3rCBQZ2LXRiD+xapKkGXw=; b=IZ4zI9CGLCdsFUcQ4VwHE9hmRwDr27q78tIr5yBxmN8kVtWTeW0pgGMjJW18b8qYheb0Bc 7kMbbm1f/Zv+DUIirnTD9fc1fIjaI7H1jhOXQ8RGEj1fnpy8MSvBfKLP8wK+hdIiaeMi7u bYtn9pJV7W5+wbJa+XV+dIbtbg1l+8I= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b="U8qf/ENQ"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf02.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692651540; a=rsa-sha256; cv=none; b=XgEN9EcfLmDF998pY56ouMWJTrwDU4jgIpR4AXnZ7+9KQ3oqWzWSxEziZl9ytyLEkZqi9C 5vO3SI8gBywPXXu3JssrVXPFXLlxE8/6SkjL6KMD8YHQt3BhQUmVmPofFJh7PA3jXPTeqh AJwKUhdkSUEXo6RhGm3Z7Ipw9jioen4= Received: by mail-ej1-f49.google.com with SMTP id a640c23a62f3a-99bed101b70so491291466b.3 for ; Mon, 21 Aug 2023 13:59:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1692651539; x=1693256339; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=IbpQ6J2JbDbq9n2VQq/6we3rCBQZ2LXRiD+xapKkGXw=; b=U8qf/ENQTd2sCVviFY7hWlKIaAEj8on8IZmLHdWDcLUdn1xNFXFpQF+vAd/IQdGYHC opuIAkQgOKr5sqPudhkcPSXI4KMe3WtfotnVUiiHfqZZxttlyJRij/6AJTOdzKjdiycI N1TZStvj86wlanUfbAFjmwF/jq9QGZyWZXVvuBa/9vYvSqpkxlJNRNDzCj35JQf3iw2c dP8ARm2Tu1mUD25pF9mpx9GTvjhq+swewTo0XQfKxKNrhB9GXY8vH5adzkEuuMMsunxB 7xF3zah+cMidpUjvkveU5hTtr9xzy9mK24FfKEXNDFHF2Q7ah4RPJVwGuK5IdJ91wr0i F/sw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692651539; x=1693256339; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=IbpQ6J2JbDbq9n2VQq/6we3rCBQZ2LXRiD+xapKkGXw=; b=fL5QN+UqgYWqejq7TdRYM+eTCygjcTPibIYIn/Lil4ffILkqHdlJH/08WyB2iB8I4G LgezixQUaX++Ers1QCekCi0+SaFi7rwgFIWF16cxMS9HFJmKakx73jOwD+zvxo76C/j0 pnnXkfQRk3PqPNMlxvv7QTfF7jnNy6Ufxwew3egCmRqglKcb9CQijRGF/xpir/RlxFoT zQ0EneV/QGXdoq6/gUDmMUgQ391YPlOS0KXDLvTR8sggwriDEwBlyUAillzU6DXb5gLr 5ZHMCZ6t3qkpCiRl2F25tPP6YP0wpVRbTWTw7+odUC/rVMKBnX20xqa+5b+HYp/v9Ph3 S5AA== X-Gm-Message-State: AOJu0YwdX8tFBDh9JfAWFabbwoxrEMv1ljaM0RnGEeFTiV8pdWMlRJ/Y vlTHsqCvzLgcSp6m6JDMlnmlj9qUIZ8hMFBEwhHM+g== X-Google-Smtp-Source: AGHT+IFol5Ll28hxOIYXKVWQXKUNlVnVNtefnIqZe0QsAbr1yjjZsnkk89maYwTxkJFuHZJ0WcVTvflG4p7t7YhLnrE= X-Received: by 2002:a17:906:844a:b0:993:f12a:39ce with SMTP id e10-20020a170906844a00b00993f12a39cemr5797844ejy.15.1692651538948; Mon, 21 Aug 2023 13:58:58 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Yosry Ahmed Date: Mon, 21 Aug 2023 13:58:22 -0700 Message-ID: Subject: Re: [PATCH] mm: memcg: provide accurate stats for userspace reads To: Tejun Heo Cc: Shakeel Butt , Michal Hocko , Johannes Weiner , Roman Gushchin , Andrew Morton , Muchun Song , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ivan Babrou Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 9B2A48000D X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: ak13wddftkbgdotc5us53mwqkaankzqu X-HE-Tag: 1692651540-73271 X-HE-Meta: U2FsdGVkX1/UYjCasSMz2/NFtkp4HpLa4lCtx2IWAseU0lDrD2V1uli8Zq+hU4GZ4DkN9NCTAlDT9MED7IDKayoCIsYl+7LVMZYC4XSof8wMP66ZaGrknG9lswf650Zgg26UVkS0fcPhubaFua4sQ9UPScO12lhfQSJl0ObxWzEpRVlAyQ2oB3KdnN4uV8PjUXeGP30wkIEt1ZLe3jinpeCklM0w6X3298p4VmOw5V+cPrYs7iYjq5f62cVGThZ0o3dBOtVREzUb1ZU7oYKgtZ5Q3OQvlJGhwwHgyYNd7j2jhHxnPPtswsWhYDIxoFErvMdfUX9oc8de52H3aL32i8lwK7N5pZxoC/018eXaLuD9lgb8bztGLWVGJOIVTlj88XcODVKj2/6O/IcfWgqGXmZQqO0+eMQfROuKJ0ZjLPRyAdAzWO9mY38QmA0QtQ2vm8Zx2CAk05J2MwBlFONLd/ApUB4tIqLcCoF0oVys3ys31uLPUr2/x58wjKGKrnoHLkYYc3WFK5JpfYq3fI84/px6haTF+Tw0bGwP3LSQET+8sZqyWbuPpKKC9qYLLco6+lyee5/QgIeMXNk4IXh04j8OrBokRXoJpOmuMsnvo6AvHafddTfsUFk3GlN3s/qV0JZPT6khyTEe+wxuK+cioHGakaf886NUbCpv/rAd2HdUJR5/FN1YP4GTTaazq0A02RsRR8QIfFXOtPcPd0K1oCYPqTE0QqAQhqsms++XOG9qBHfhTE3EZAJmkFhBbGswV7IX4XZrussA6aaG73ShtZBVhEgdw9KGVY5rRt+0g1cbdhNMYyaMOyJroBqKGOw3/V3o9Qa7JkbgOI+lKj5kYaIVJ3XVSwsQJ1Cd2RCxYCkwPxTxpn/tiylNnVJo+1Tkk6MqLAoUTeQEWi+DsnsCWLm9OiW0595B7lIlKwuEiIkyww2e6OfgXxQ0Plbgow1rgdhru6DlLVeIW6THW3X YoLDK32d Yg5v7FWrv9pnaWLLPD+xRqf02TM4cRTiD6L/1B699kAyAhqlmQiKmV3vA/Q4sTBF+InnbpyHD8tr0ydbdqi9lN4HfoWnS5y9NnjZU2M/UXfxSW7661nRkcuZL74N+hYSsBSZk8dYCSkXmgDtjGwcRgb5JyzSvriu6KDiuYhlTgRXe8VC+7bER02c+13bULKW7nmHePtxQOUGHjbOEH3tOSKjWRLIE8EURb2hyJkRIrIbD7zBjyNjQGqk40hp5vNuw9xY+x3MhpZiMYb1zErmVchimpt703EtiZZTq5BMWe767PedySE46WXr7tQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Aug 18, 2023 at 2:40=E2=80=AFPM Yosry Ahmed = wrote: > > On Wed, Aug 16, 2023 at 3:35=E2=80=AFPM Yosry Ahmed wrote: > > > > On Wed, Aug 16, 2023 at 12:08=E2=80=AFPM Tejun Heo wrot= e: > > > > > > Hello, > > > > > > On Wed, Aug 16, 2023 at 10:11:20AM -0700, Shakeel Butt wrote: > > > > These options are not white and black and there can be something in > > > > between but let me be very clear on what I don't want and would NAC= K. > > > > > > I'm not a big fan of interfaces with hidden states. What you're propo= sing > > > isn't strictly that but it's still a bit nasty. So, if we can get by = without > > > doing that, that'd be great. > > > > Agreed. I will try to send patches soon implementing option (2) above, > > basically removing unified flushing. I will try to document any > > potential regressions that may happen and how we may fix them. Ideally > > we see no regressions. > > > > > > > > > I don't want a global sleepable lock which can be taken by potentia= lly > > > > any application running on the system. We have seen similar global > > > > locks causing isolation and priority inversion issues in production= . > > > > So, not another lock which needs to be taken under extreme conditio= n > > > > (reading stats under OOM) by a high priority task (node controller) > > > > and might be held by a low priority task. > > > > > > Yeah, this is a real concern. Those priority inversions do occur and = can be > > > serious but causing serious problems under memory pressure usually re= quires > > > involving memory allocations and IOs. Here, it's just all CPU. So, at= least > > > in OOM conditions, this shouldn't be in the way (the system wouldn't = have > > > anything else to do anyway). > > > > > > It is true that this still can lead to priority through CPU competiti= on tho. > > > However, that problem isn't necessarily solved by what you're suggest= ing > > > either unless you want to restrict explicit flushing based on permiss= ions > > > which is another can of worms. > > > > Right. Also in the case of a mutex, if we disable preemption while > > holding the mutex, this makes sure that whoever holding the mutex does > > not starve waiters. Essentially the difference would be that waiters > > will sleep with the mutex instead of spinning, but the mutex holder > > itself wouldn't sleep. > > > > I will make this a separate patch, just in case it's too > > controversial. Switching the spinlock to a mutex should not block > > removing unified flushing. > > > > > > > > My preference is not exposing this in user interface. This is mostly = arising > > > from internal implementation details and isn't what users necessarily= care > > > about. There are many things we can do on the kernel side to make tra= de-offs > > > among overhead, staleness and priority inversions. If we make this an > > > explicit userland interface behavior, we get locked into that semanti= cs > > > which we'll likely regret in some future. > > > > > > > Yeah that's what I am trying to do here as well. I will try to follow > > up on this discussion with patches soon. > > > > Thanks everyone! > > So status update. I tried implementing removing unified flushing. I > can send the patches if it helps visualize things, but essentially > mem_cgroup_flush_stats{_ratelimited} takes a memcg argument that it > passes to cgroup_flush_stats(). No skipping if someone else is > flushing (stats_flush_ongoing) and no threshold at which we start > flushing (stats_flush_threshold). > > I tested this by some synthetic reclaim stress test that I wrote, > because reclaim is the easiest way to stress in-kernel flushing. I > basically create 10s or 100s cgroups and run workers in them that keep > allocating memory beyond the limit. I also run workers that > periodically read the stats. > > Removing unified flushing makes such a synthetic benchmark 2-3 times > slower. This is not surprising, as all these concurrent reclaimers > used to just skip flushing, regardless of whether or not they get > accurate stats. Now I don't know how realistic such a benchmark is, > but if there's a workload somewhere that runs into such conditions it > will surely regress. > > Sorry if the above is difficult to visualize. I can send the patches > and the test script if it makes things easier, I just didn't want to > overwhelm the thread. > > I think there are 2 options going forward: > (a) If we believe removing unified flushing is the best way forward > for most use cases, then we need to find a better way of testing this > approach practically. Any suggestions here are useful. > > (b) Decide that it's too expensive to remove unified flushing > completely, at least for in-kernel flushers that can see a lot of > concurrency. We can only remove unified flushing for userspace reads. > Only flush the memcg being read and never skip (essentially what Ivan > tried). There are already some flushing contexts that do this (e.g. > zswap memcg charging code). I believe as long as there isn't a lot of > concurrency in these paths it should be fine to skip unified flushing > for them, and keep it only for in-kernel flushers. I sent a short series implementing (b) above: https://lore.kernel.org/lkml/20230821205458.1764662-1-yosryahmed@google.com= / Basic testing shows that it works well.