From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DBFA7C001DF for ; Wed, 16 Aug 2023 17:11:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2B9CC8D004A; Wed, 16 Aug 2023 13:11:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 26A968D0001; Wed, 16 Aug 2023 13:11:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 10A9F8D004A; Wed, 16 Aug 2023 13:11:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id F20BD8D0001 for ; Wed, 16 Aug 2023 13:11:35 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id BFA83A05F0 for ; Wed, 16 Aug 2023 17:11:35 +0000 (UTC) X-FDA: 81130609350.04.77FB1AB Received: from mail-qt1-f176.google.com (mail-qt1-f176.google.com [209.85.160.176]) by imf01.hostedemail.com (Postfix) with ESMTP id C5F5A40025 for ; Wed, 16 Aug 2023 17:11:33 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=rV1Ty5Qw; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf01.hostedemail.com: domain of shakeelb@google.com designates 209.85.160.176 as permitted sender) smtp.mailfrom=shakeelb@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692205893; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mihDJW+Sury5wgYKSVJ4wMxtsiublP7tmkDuALLcHY8=; b=gU/2veH3sl4W8As5aznvl9DRJuUeM/QR/Vg8Desk6yVzLxHsBpQb9p+SXojjBeR63vJ6yC 2Y81VwmAWuhffOVg+CHsKeXaXkyXmkNNM0bhpBah58sNAqRkZ2ghUFIq9NXraRvMqwKNTz LoeewNUl28L0we+q1W1AXOh5zDu888A= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=rV1Ty5Qw; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf01.hostedemail.com: domain of shakeelb@google.com designates 209.85.160.176 as permitted sender) smtp.mailfrom=shakeelb@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692205893; a=rsa-sha256; cv=none; b=mHHKinr2S4FhPXYkQggV6UDZwJxJoZjGPRkn5wXBk+zbsUmt0fJ8XZl7iR2QjrMb/FOAhN RFI4nsJfBe6twSzL3GcjH4t3x+qGYubQA5lNjBqbLv45MqgX4/p1TskMT4A1xYAu5QcIbS fCOjyTGHqHc1qgsPzNFswLSdQc/DZig= Received: by mail-qt1-f176.google.com with SMTP id d75a77b69052e-40a47e8e38dso12591cf.1 for ; Wed, 16 Aug 2023 10:11:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1692205893; x=1692810693; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=mihDJW+Sury5wgYKSVJ4wMxtsiublP7tmkDuALLcHY8=; b=rV1Ty5Qw3EKhiG5czbzW/vE1BSLiUs9zUNJcjf+75M5FK0vpYpqr7JxcSnbPjCRksH JDYTQwxkpjoZNv/FpeeH5CDlqg97AYMA3Uy0Y1Ja7RVxqKn03yZKibU88sWBt2MJWhX1 6pa36ry2L9+ntsrjBi2HhHLxzZtZYzQVmb5Ia2NQND5p/l4gd3XUH/8EpN6CMgeaEv/w ks+iUH9vv+yHXWBqWDjFMFYo5h+Y3BZ31ynhfIDwOs7X7VwcN3j+94Z7jxPv7TFK2OKF GsQGnCvb/Aq3Hd50Nn6+7VHy97QXR0nLGjHiXlAzLOMiZjLu2Sekg/DnlLL4slsuAh9f YoSA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692205893; x=1692810693; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mihDJW+Sury5wgYKSVJ4wMxtsiublP7tmkDuALLcHY8=; b=kAmv84bG3Um+jggSVixxx58s/P7HjEiIMPAA0ScgseBtvUQhbr+ZyZyujWKBrpB+u0 SGvT+CVSoIkuu4iytWbHlFAFxz1TMWdaU/myyWncpbKxBoZOhHdsakKdUkcd6KEBENSB 29oOm67l8bui+XlFRKo4M5zXN5xZ56ZQ15cgb7RK5Q5FAOAEvLYKvPBq0NaxOD7IR/qx Ny16pPE4eNHPqXrx0nUXnR588LlDvEnsBFv/sn4qIUjrmfSYTK0iyMza1CX2Sk+XBORE 31gLxGEHbOlBYsvynOtkYnz/0F4E1478x+/LdyX8Rp2j81+FK82VbQrNXNNFJPODsvQx ukMw== X-Gm-Message-State: AOJu0YxI+bHrpmKqfYOIpKOsaiU5tG8/3EJH9yaMi8P9sDTKss66Kyij yp9Mi18ZC/RsmWyXd4kcU0vvXElueTtZRYKipDNP8A== X-Google-Smtp-Source: AGHT+IFr1qh1N8TwgMlrGf/aPh/t9PgoXf520XL252U/Yi/GCnx4rLFKpCwbyWraU/2t0RtVIig7bKENz0i5Cc5nowA= X-Received: by 2002:ac8:5b06:0:b0:40f:d3db:f328 with SMTP id m6-20020ac85b06000000b0040fd3dbf328mr15510qtw.2.1692205892845; Wed, 16 Aug 2023 10:11:32 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Shakeel Butt Date: Wed, 16 Aug 2023 10:11:20 -0700 Message-ID: Subject: Re: [PATCH] mm: memcg: provide accurate stats for userspace reads To: Yosry Ahmed Cc: Tejun Heo , Michal Hocko , Johannes Weiner , Roman Gushchin , Andrew Morton , Muchun Song , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ivan Babrou Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: C5F5A40025 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: c69575c5fhgk5wc3pgo6d8pdcj7bfuuh X-HE-Tag: 1692205893-907246 X-HE-Meta: U2FsdGVkX1/3ZBsgjpOXqqMHDt/qyiHiNla1jxV1V/jc6XHRIu1Cu57Ho801I6o4aDxau+F3jAJwShANY1Jtynh/jGfM6ZTm5CzYMOFnTl9KkK+il1vIUg0iosAIMRV0TtKVeoeKes4Cc/ZTaeTzgHpGlCLUZG7/GiUeLoI3pisxAdRWCNHB2IsUOxMuHriYGeOkHTz9F9BCcUsd5gzkBqjSCv97pMdHeLhefGZZdMl6OGrlyb7SP1xC4oW1R6MFcFYJ+sDLWUTFHjxEOvu7t3favXTBvwQ2oCwpvUBFsms6qlhz8eMgTo7M25sD+f3fZNf8AMZAS2fbi6864aTt3u1+C9B2CbiwTQeTxReKvQwJiq8d0a9kWU4d6Kt2FTklyUkPn2MzvnBSaFDMssw+hG9E+vs7HPReZRd42XQ4O0Fo/VqHY8jOYvfGbhoKMED8rK4EM5S9orovNg8+e2SU7z//XaxSIknIrzH33MD20QapZgST2SxDPsaeiuFKc5VMMwsAyMF5L0X62VBoDwTIOSeHCU+x9JLxCtEWRuJnErnLqJb8FgVkoeBTrE138AjEHXS7qXvSr5f+wEh1m1MgbHbyK3BXrxKHiCUsw3bTFh3LkIL3yYwb12WfBtVYzO/7ThNwFF/lt8rykHcwBYuo3fudY6m4jnaOIpX0W/JQ48AhaXN0EpLC3deHXIeQWgQs7u4u/XNTB38GYMEgQc/XbSsl9RwbbzUo/sZlARi7kNcEaltYSr42BzVE9nDmdDe4lY4lB6nePsaHd6eGMbdm7os8/lFuEqxAYMaiVtyVTfnLzwcqLXAtlt5t/CG5s5iuOQJxNnMawS834542vQcInMkd45NAhIKB/cHCYe5k9b6rgctHRcich/viwdRsdNXSaTAI92V+Ou4S9HzW+2wJXSkUMQtkG2AJwGydZmdo04PEx+TSKi1EslbbyHMpLyfpXqjKlSes9JOyWZ46OxY 5MN7OEaL 6bPzoe7A0AEvPVC8g9bIUhBNIyw/8VuROYWrosXItLTvOdX2W+D/N+6zXOBqPinXN41SncnyfiYBy/zCakgUQHrEV6ovqOdfL0v806NX1tJk5+OhSVnkto1oRebyyVUMqmT+IP5E4LzzGzXrO875CqS/nMwgq3bu57IQCWepZDlPL6tszkqdanoDUng8uvTDQQDOSib26lW63h8bOAl2ZJQEHfHw7CBKrm2yi+su1lFLVds2n/wIQkdmlXA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000048, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Aug 15, 2023 at 7:20=E2=80=AFPM Yosry Ahmed = wrote: > [...] > > The problem in (1) is that first of all it's a behavioral change, we > start having explicit staleness in the stats, and userspace needs to > adapt by explicitly requesting a flush. A node controller can be > enlightened to do so, but on a system with a lot of cgroups, if you > flush once explicitly and iterate through all cgroups, the flush will > be stale by the time you reach the last cgroup. Keep in mind there are > also users that read their own stats, figuring out which users need to > flush explicitly vs. read cached stats is a problem. I thought we covered the invalidity of the staleness argument. Similar staleness can happen today, so not strictly a behavioral change. We can change the time window and condition of the periodic flush to reduce the chance of staleness. Option 2 can also face staleness as well. > > Taking a step back, the total work that needs to be done does not > change with (2). A node controller iterating cgroups and reading their > stats will do the same amount of flushing, it will just be distributed > across multiple read syscalls, so shorter intervals in kernel space. You seem to be worried about the very fine grained staleness of the stats. So, for scenarios where stats of multi-level cgroups need to be read and the workload is continuously updating the stats, the total work can be much more. For example if we are reading stats of root and top level memcgs then potentially option 2 can flush the stats for the whole tree twice. > > There are also in-kernel flushers (e.g. reclaim and dirty throttling) > that will benefit from (2) by reading more accurate stats without > having to flush the entire tree. The behavior is currently > indeterministic, you may get fresh or stale stats, you may flush one > cgroup or 100 cgroups. > > I think with (2) we make less compromises in terms of accuracy and > determinism, and it's a less disruptive change to userspace. These options are not white and black and there can be something in between but let me be very clear on what I don't want and would NACK. I don't want a global sleepable lock which can be taken by potentially any application running on the system. We have seen similar global locks causing isolation and priority inversion issues in production. So, not another lock which needs to be taken under extreme condition (reading stats under OOM) by a high priority task (node controller) and might be held by a low priority task.