From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0A17FC369AB for ; Fri, 18 Apr 2025 20:19:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 22ECC280005; Fri, 18 Apr 2025 16:19:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1DF5A280004; Fri, 18 Apr 2025 16:19:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0A727280005; Fri, 18 Apr 2025 16:19:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id E01D2280004 for ; Fri, 18 Apr 2025 16:19:32 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 9F3DBBD20E for ; Fri, 18 Apr 2025 20:19:34 +0000 (UTC) X-FDA: 83348279868.06.7A4A0EE Received: from mail-ed1-f46.google.com (mail-ed1-f46.google.com [209.85.208.46]) by imf24.hostedemail.com (Postfix) with ESMTP id AB967180006 for ; Fri, 18 Apr 2025 20:19:32 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="QxUUm/E/"; spf=pass (imf24.hostedemail.com: domain of gthelen@google.com designates 209.85.208.46 as permitted sender) smtp.mailfrom=gthelen@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745007572; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wZXFKeC9M/Del26pDuuytW82irsQUL37Nct0skl7UqM=; b=p1VY7xxhRN6rNWuXzAXcKpIFlGD2qJ9WB0ZmeMdbJp4E630vA3I4JldIh2+AiivqI7Qpq+ jep8OlfusPcTvB5uHtgXPe/69iPCaHw9w1RBej2SQh4BoAYIJKnL0HAP77kZpim2koou0s ZD5RUvsANqA14B4usDfsk9Ulj1Tv3zw= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="QxUUm/E/"; spf=pass (imf24.hostedemail.com: domain of gthelen@google.com designates 209.85.208.46 as permitted sender) smtp.mailfrom=gthelen@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745007572; a=rsa-sha256; cv=none; b=b656VBlnj5+9bPy0FcPgLFAKqccmXRCtr1uqjaDiVInZ/c5wj5EZhfkKPDrWDMq8F6aoW8 n+twJTBhLKeNumD4jTKds57ll8gnfBy7ZApGlpzR7OdKA6Igz3+AjpkfgEEhWCXf0/BR9w 18dgJugoNCdjYR337pl0hqOkuE62k9I= Received: by mail-ed1-f46.google.com with SMTP id 4fb4d7f45d1cf-5dbfc122b82so12865a12.0 for ; Fri, 18 Apr 2025 13:19:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745007571; x=1745612371; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=wZXFKeC9M/Del26pDuuytW82irsQUL37Nct0skl7UqM=; b=QxUUm/E/o789Eh8oZspfyOpQN7twCSFA7Ie7Ckbhrgz/iixWAu43Ha5Upt9D94knyo AS1AOHhnxfm/jbWnubl0QhCewUJdxOsqZTevj/RjBa9U09ej9sKV7eNUQAW7EDgZKp/a 7Ya5lqFMjNczP61sEtnNiIxmsK6bKBkO34H2Vea6EfXu9ORuJogAgv2dTc/1R7zMut3B mZHrWs8daRRzkxxHMg0ODrV9FUc8LwRHG5MHzUFPXBCHgndc0XeN6dxRB0EXxXHmxi8l xUpEmJ5oh5gph7Viy+eBn76wBuupS5hBxzcstnQ7VkGgTQSrK5tjr4ncULoEOvl3KqVU swxw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745007571; x=1745612371; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wZXFKeC9M/Del26pDuuytW82irsQUL37Nct0skl7UqM=; b=lVTiXTgywVoQQo6w1VMjYO0yibfWDuqBaV69ZPOxz54NSawPPcxeE6dDtif3rwla34 Be5hrSZkj9FxQFLkF1TAyGkOpjpZNVTLr7laP3O9VDXYhZWFbnlSJCCwDmJ198oF593X 1oLwINZ4hhfbSIcdvB9r62SSq4q3u7t+TU3hxJ76scZXNEXM9PtC2fS0K/RsQyPaCjGY Xeb1zTvYxoPDRE/vFo+hiev52DI/hWhMgycu9VMYi6NAajZYo+tn6dow1ghf0vz55dWU dMVyOIYGOWmOz5pjeJqmOB1ChWNwuV9O3ZheDexTzuzVu1Oggf2q1XdVHSnWT6m0PACj Z6dw== X-Forwarded-Encrypted: i=1; AJvYcCUSHYFCyxHx2lk79fiJXhZ5DC+ElQ2QeYmHq/hx2gbnbeCFRNJ+CkrID2sEGzyYtejl0qCHkY9fQg==@kvack.org X-Gm-Message-State: AOJu0Yw7ZCkOVoh4l8HwN5spRAK+l1z09Af7dpTtx1xcC2UV5JrSp0/h 8nEADqGJsz17fNVAAZGZdq/ZR/5u3mGaIitRwNmYKcAAEM9/BHiRzq5yh6U5zY1fAVSrDUmV1Lt p36HkgyQW/JVRx9v8v1zHigXm9G7LinkwITQr X-Gm-Gg: ASbGncsofOo5rQV0EQpSoNExeIEwTvKeT0m3HvYUJZ/SwX7j9aV9ekdIyvlzT7N0xun B5fA+s197uJ+kTBQj6cit0oqBuN1ZZU3nLlkGgdhXpMUhctCNm/1Yfu74EjoZzejn6LOQq4WD9z KtbWHq/RZXCpLqq6ghEd3WdA== X-Google-Smtp-Source: AGHT+IGFe97Mx8SoIIgHDiHuNwRkro7YcFs1R7KuODBZY8K/NGiuCZqNbl0JcflhwH61kKBZ35/7bfZnv2ML0wJSyyc= X-Received: by 2002:aa7:d596:0:b0:5e0:eaa6:a2b0 with SMTP id 4fb4d7f45d1cf-5f6289ac180mr92807a12.5.1745007570788; Fri, 18 Apr 2025 13:19:30 -0700 (PDT) MIME-Version: 1.0 References: <20250418195956.64824-1-shakeel.butt@linux.dev> In-Reply-To: <20250418195956.64824-1-shakeel.butt@linux.dev> From: Greg Thelen Date: Fri, 18 Apr 2025 13:18:53 -0700 X-Gm-Features: ATxdqUEFwnlg_5kC790EFWxU23bmt6mATlBYh_XFDn6utQnVCEwoEYazCJRVwyo Message-ID: Subject: Re: [PATCH] memcg: introduce non-blocking limit setting interfaces To: Shakeel Butt Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Yosry Ahmed , Tejun Heo , =?UTF-8?Q?Michal_Koutn=C3=BD?= , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Meta kernel team Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: AB967180006 X-Stat-Signature: 4yi41pmhkbded3uidn1r3drm8furqrcw X-HE-Tag: 1745007572-716723 X-HE-Meta: U2FsdGVkX1+1C16GBZ8RVnlrzlMGsxWllgokXJYx5AkuVQQQFkNqAOUFJZpGQrVi0c6jYZsXfNrxYAh+/9BOvoJLOqmiwnGJ/BZaoTtibPiV4aKxEP5tiC3FnxLcJOipVNlJFmLpVOAD5h4rRgrwkGRVHc/i9ale2rkvuLRPS+C3npEc73QMfSqrf6eRzcytOcuu/ljx1ND4BZ8Ca17i5o8PbVVVaUpY1CSQtNa86Dqkt0ZA/mckU17DEWP4DEohPFkZn0VhAJMj4vTtCrjzzYHKNi+f+/Rhj+thRhit6OZT8SKj4xdnuDC35PyHcjbi1lY97lxYtfb2x1NjpTkn38ZC4gG1JBENPRcY/1md2IJhTl60HfiXdSIZj44rJxUS4SRlDZRkP0XSeUYxUHB9u1f89pga2T9qgG1keDAy7/fNyUrgCnuwxp+AxSmJDGGQ1ChnT9TmQGEQ3/EzrfgzlunSmr5PgJQSh2kpZuPnXZUkt0Oa4mS60BwPraNHBEbcTJStQN+/GRbv1jDBs7xE1T1YwkaTyiRhEZrbrkz5byl9uqxAa/jOaJyS45P75Mev8DlAtXvA+NoMNtOdZjgwnSGBwn/6qRH/5eYJAHRp9ZAKKejxw0M26GaLpPRraE1bexwmEeDDk2Rrdq2kvYH6lEoRF1ocKldYrGxvUudP1G/4cxe+x+fzV9t6KMnbj6brHf4wV4glfBVLxME79z66isS31XobynIrjTuvWe8gH4BBER3MOrC4Nmhs8K086xpTCK1mMHzF6GjwX26ALcOmYmo3FiuAgUoEmZm0E/7X7Kyoa6DubVmg3NkH0BwIcuh5BuCVeg1vhLo+aQmeC3G1OTOGDucVogQMwUd852camdy1zjik5t6OXY6GFB6ewPPamppeNW9WOMFVZuICiIm/9FFAcCiUyJ6zQwWB4Ebt60ugg68utqbTQjWLasjvTnGcRjxUc2bKO10Jnga7Lvz IJOiO7Bm wYKq86W7CUrJ4WYRb6Q7Zi0XwA0IeaEwi4Cpva1VPZljnXEOkmUbclwBUdpy0b2KNjJRjCFfJcH+TdmMv+Ywqxyl2CpeIxg5vyYxK5D5WiuYyGraddEBjobSFwJfoekD8wBpJMmzyuxoshRWKyRUXgukpumf/MK5bKhYep21mNlETBDgjj67JqPbCGlfy9/mTk3uh X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Apr 18, 2025 at 1:00=E2=80=AFPM Shakeel Butt wrote: > > Setting the max and high limits can trigger synchronous reclaim and/or > oom-kill if the usage is higher than the given limit. This behavior is > fine for newly created cgroups but it can cause issues for the node > controller while setting limits for existing cgroups. > > In our production multi-tenant and overcommitted environment, we are > seeing priority inversion when the node controller dynamically adjusts > the limits of running jobs of different priorities. Based on the system > situation, the node controller may reduce the limits of lower priority > jobs and increase the limits of higher priority jobs. However we are > seeing node controller getting stuck for long period of time while > reclaiming from lower priority jobs while setting their limits and also > spends a lot of its own CPU. > > One of the workaround we are trying is to fork a new process which sets > the limit of the lower priority job along with setting an alarm to get > itself killed if it get stuck in the reclaim for lower priority job. > However we are finding it very unreliable and costly. Either we need a > good enough time buffer for the alarm to be delivered after setting > limit and potentialy spend a lot of CPU in the reclaim or be unreliable > in setting the limit for much shorter but cheaper (less reclaim) alarms. > > Let's introduce new limit setting interfaces which does not trigger > reclaim and/or oom-kill and let the processes in the target cgroup to > trigger reclaim and/or throttling and/or oom-kill in their next charge > request. This will make the node controller on multi-tenant > overcommitted environment much more reliable. Would opening the typical synchronous files (e.g. memory.max) with O_NONBLOCK be a more general way to tell the kernel that the user space controller doesn't want to wait? It's not quite consistent with traditional use of O_NONBLOCK, which would make operations to fully succeed or fail, rather than altering the operation being requested. But O_NONBLOCK would allow for a semantics of non-blocking reclaim, if that's fast enough for your controller. > Signed-off-by: Shakeel Butt > --- > Documentation/admin-guide/cgroup-v2.rst | 16 +++++++++ > mm/memcontrol.c | 46 +++++++++++++++++++++++++ > 2 files changed, 62 insertions(+) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admi= n-guide/cgroup-v2.rst > index 8fb14ffab7d1..7b459c821afa 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1299,6 +1299,14 @@ PAGE_SIZE multiple when read back. > monitors the limited cgroup to alleviate heavy reclaim > pressure. > > + memory.high.nonblock > + This is the same limit as memory.high but have different > + behaviour for the writer of this interface. The program setting > + the limit will not trigger reclaim synchronously if the > + usage is higher than the limit and let the processes in the > + target cgroup to trigger reclaim and/or get throttled on > + hitting the high limit. > + > memory.max > A read-write single value file which exists on non-root > cgroups. The default is "max". > @@ -1316,6 +1324,14 @@ PAGE_SIZE multiple when read back. > Caller could retry them differently, return into userspace > as -ENOMEM or silently ignore in cases like disk readahead. > > + memory.max.nonblock > + This is the same limit as memory.max but have different > + behaviour for the writer of this interface. The program setting > + the limit will not trigger reclaim synchronously and/or trigger > + the oom-kill if the usage is higher than the limit and let the > + processes in the target cgroup to trigger reclaim and/or get > + oom-killed on hitting their max limit. > + > memory.reclaim > A write-only nested-keyed file which exists for all cgroups. > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 5e2ea8b8a898..6ad1464b621a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4279,6 +4279,23 @@ static ssize_t memory_high_write(struct kernfs_ope= n_file *of, > return nbytes; > } > > +static ssize_t memory_high_nonblock_write(struct kernfs_open_file *of, > + char *buf, size_t nbytes, loff_= t off) > +{ > + struct mem_cgroup *memcg =3D mem_cgroup_from_css(of_css(of)); > + unsigned long high; > + int err; > + > + buf =3D strstrip(buf); > + err =3D page_counter_memparse(buf, "max", &high); > + if (err) > + return err; > + > + page_counter_set_high(&memcg->memory, high); > + memcg_wb_domain_size_changed(memcg); > + return nbytes; > +} > + > static int memory_max_show(struct seq_file *m, void *v) > { > return seq_puts_memcg_tunable(m, > @@ -4333,6 +4350,23 @@ static ssize_t memory_max_write(struct kernfs_open= _file *of, > return nbytes; > } > > +static ssize_t memory_max_nonblock_write(struct kernfs_open_file *of, > + char *buf, size_t nbytes, loff_t= off) > +{ > + struct mem_cgroup *memcg =3D mem_cgroup_from_css(of_css(of)); > + unsigned long max; > + int err; > + > + buf =3D strstrip(buf); > + err =3D page_counter_memparse(buf, "max", &max); > + if (err) > + return err; > + > + xchg(&memcg->memory.max, max); > + memcg_wb_domain_size_changed(memcg); > + return nbytes; > +} > + > /* > * Note: don't forget to update the 'samples/cgroup/memcg_event_listener= ' > * if any new events become available. > @@ -4557,12 +4591,24 @@ static struct cftype memory_files[] =3D { > .seq_show =3D memory_high_show, > .write =3D memory_high_write, > }, > + { > + .name =3D "high.nonblock", > + .flags =3D CFTYPE_NOT_ON_ROOT, > + .seq_show =3D memory_high_show, > + .write =3D memory_high_nonblock_write, > + }, > { > .name =3D "max", > .flags =3D CFTYPE_NOT_ON_ROOT, > .seq_show =3D memory_max_show, > .write =3D memory_max_write, > }, > + { > + .name =3D "max.nonblock", > + .flags =3D CFTYPE_NOT_ON_ROOT, > + .seq_show =3D memory_max_show, > + .write =3D memory_max_nonblock_write, > + }, > { > .name =3D "events", > .flags =3D CFTYPE_NOT_ON_ROOT, > -- > 2.47.1 > >