From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8DCABC369C2 for ; Tue, 22 Apr 2025 18:12:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BD6D46B0005; Tue, 22 Apr 2025 14:12:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B86B56B0006; Tue, 22 Apr 2025 14:12:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A4D826B0008; Tue, 22 Apr 2025 14:12:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 83B3B6B0005 for ; Tue, 22 Apr 2025 14:12:26 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 6C7A1BE2F6 for ; Tue, 22 Apr 2025 18:12:26 +0000 (UTC) X-FDA: 83362474692.11.31EDD21 Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com [209.85.160.175]) by imf10.hostedemail.com (Postfix) with ESMTP id 6583DC0008 for ; Tue, 22 Apr 2025 18:12:23 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=sEK8fDNv; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf10.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.175 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745345543; a=rsa-sha256; cv=none; b=KbcbYFRjG7MA/eSMC//T55OZrDdvSMLnBTkdy1N7xuS4zsV7zgeD1IQkzkMqVKZhc/MKe7 Ca3OTaPl0RAo+h3ryBXAbg1acWmI1h4WChOtN3j/4iiFK4N6fysJnIZF3PjtCODCnylQzW lg84ltEve1yjOmstOzRQLcd1BFM779A= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=sEK8fDNv; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf10.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.175 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745345543; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=s2MxjZUGMrKR2DMGYAmofr4Hh7XRlldZp9rSBjUPMDM=; b=YiAN6bwKIv5BywHRyyX2wP+M5eTmhrr2tFthR5uwSIdwaSpvkUxzU9HtKiV+vMkPGSX9tC z+sHkfBlJWRinCE3/nYeHkwsjxVjl9a0OsRWGY7UAJRYa4LkCokMZBcdofzZ3xsooSivXU hbcxKOw652gSFDKaoIEApuLw0o6OFN4= Received: by mail-qt1-f175.google.com with SMTP id d75a77b69052e-4769bbc21b0so53359281cf.2 for ; Tue, 22 Apr 2025 11:12:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1745345542; x=1745950342; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=s2MxjZUGMrKR2DMGYAmofr4Hh7XRlldZp9rSBjUPMDM=; b=sEK8fDNv5MCjIv6GXEAQdoA9A3cDPlxnhX6xqMFromvMB8PItq6CqPbURajjLBEhkZ ozOS2YMtavnAIXmt9QM4EHQtepZ5c5zh4gHUrEqhM0flat9R4NeSpOKRt+ihKRozjQ8h 29N+Mi/fP9U9bMhYfyI06avyylbs5Zpup/vNN7t5AHfrnBR9sVRyBPeLhtCLFQRGe4nV l0q2sFvS3ifgflAfUmPjGAK3aikTgEPw6WKXOKxUNtzpoEilRyX5lgNK/9ps921dJfOI boafZhUro9xeba5gKVCS+RTyIoaiFycxyIRjvlv8KKbnDJ28DJXQpTZWXnvzJzl6TSgK 74QA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745345542; x=1745950342; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=s2MxjZUGMrKR2DMGYAmofr4Hh7XRlldZp9rSBjUPMDM=; b=hJvDlBHfuUn5qPPPE0sOn67iLSZ5DIJ+qHBrB27SlkCmtI3IROl2x4/B5955+rst4z uu5Ns3KA9ryLyFO+rMP2+NaDsmRLiW/bguNMEOStMFJoncaYOgRp16z6WQUGLNuMTWbE pR6xkeE6O5Q9x9xYbfCVDCSipC4WkQyJUfaQQki4Px53lG0xjmNNVqCEKcQxk12V+wMM YUWmNYvBHezDu1PejMxBRn55PwgSZ3bO/+BSzDKuKHbGEjrwQ6SIOpBeAggftoHS9Jw8 qDVrB0WtMCJxVBK1KuYJhILc9ZGWdGBSiR9uC2jyulKBFkaGWhftefFQAw94HB5cqTTa Y/dw== X-Forwarded-Encrypted: i=1; AJvYcCVGzRi2yKBnAF3iDvcqoMoy15Pk+n95L9d1mClaica8jn40tvpLJ7ibCtn4tjSwoOPPeqUGhhFPvw==@kvack.org X-Gm-Message-State: AOJu0Yx5W2o7Xm3EvaJjb0D9JFA8m+e/ONqY7+2MifS6dcIl6ct753Mm EH5LRzSGXPHT0PhuTJjqtXoU0b3xMEglQ3BhLT3scdWT369Gjjb/gBhfBxiYJCI= X-Gm-Gg: ASbGncsz1SOU4FPINAblcYZT4yd4cMcCmlH5SzLnSSTBureqxu8CSI9MmGnlBamWGib Lpca5YAzo5P24YL0Y5bzD4A9ZWHY6qarQq+st0X/h93Un/lV5xKtUnGGOUjmiHBigLVieRgAqEN 6vRmpo77eCJw1jt1S3YvaXtYAQ6WTIgfRWxWQGsCBfStGS1qItOAF/AjqKrWR1fSg6vGv0dlYyr ZdlnRIkxp/M6HWIQdRbx7ZdqccHoaicZPIuGUfyXOc8zbZUXfkGUxb+s9wMHj6Sit+cFWv9N3Pa XkGLkS/fahTdWR9Pl6dmhWPXhcOoHiWqESHfJ9Y= X-Google-Smtp-Source: AGHT+IFSl5Z1DP+LoH0/jj1kUp0S4uovmAA5DbtTz8uDb7qyAwSb2oRgO+qtT/+KNaCsFK0ZjmtCfw== X-Received: by 2002:ac8:5811:0:b0:477:1e85:1e1b with SMTP id d75a77b69052e-47aec35503amr265505431cf.8.1745345542401; Tue, 22 Apr 2025 11:12:22 -0700 (PDT) Received: from localhost ([2603:7000:c01:2716:365a:60ff:fe62:ff29]) by smtp.gmail.com with UTF8SMTPSA id d75a77b69052e-47ae9ce293bsm58048581cf.56.2025.04.22.11.12.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 22 Apr 2025 11:12:21 -0700 (PDT) Date: Tue, 22 Apr 2025 14:12:17 -0400 From: Johannes Weiner To: Shakeel Butt Cc: Andrew Morton , Michal Hocko , Roman Gushchin , Muchun Song , Yosry Ahmed , Tejun Heo , Michal =?iso-8859-1?Q?Koutn=FD?= , Greg Thelen , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Meta kernel team Subject: Re: [PATCH v2] memcg: introduce non-blocking limit setting option Message-ID: <20250422181217.GE1853@cmpxchg.org> References: <20250419183545.1982187-1-shakeel.butt@linux.dev> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250419183545.1982187-1-shakeel.butt@linux.dev> X-Rspam-User: X-Rspamd-Queue-Id: 6583DC0008 X-Rspamd-Server: rspam04 X-Stat-Signature: tnjmhcwy4oc3my6qbhpfxkbfra9te9mb X-HE-Tag: 1745345543-68249 X-HE-Meta: U2FsdGVkX1/3+OemAWLfkgwKBDQESoCbncrmJ17X/YptBMOYgDEeopnAotc1XDfOHQ8vFJIoAWHivzieHsP+3p+5RFQzM8xho7DSLYdkuLjqRQ4Wp2DU4Lwc3UgFdIlbuietMVFUGEz3S11+JP29vMyXUsAeGklWBzqjLTC23OUs3HhGd+zeUVckCI5lOPdOP092eYjO8ccZp4ikphQ97wft5IlFzPy51wFYbGZvUb0jB52MSyzV0DvXahzf+MFUfRFYYiKBbQltr20dEi7Knlgbar0YOt3y4zUlPqr0Dr+4zuDE3aSIVvj/1fEiWfkT0yvjMHglQcXRUqyxstQw9ak27ElV7mWhkzZ+K7rRSOtv8lZxNDO/Z7Mhs2mEkbqasK9o4UQvVvrIV6r9AR4II8wk5vewdbtDyaKPPPlG9iRhDo0FTNBkjLxhHTzicvVxQg4T4qK9uDcrS3uHIkeVLjPwUWnFybZlMqp6GL2Ul0kUc7YDuFYXhkab0Aq4QHOnu5IyUNBLfVw2FRYfSx2FYQgvkVAHAMY0WElNGgMD3/+rKFZeeGfnRx3sow/iiaNhJUAHnkIOPcSNERT8sx6Gd3jFkw6H/M8YNcHuVPYOND4iEGsih5RWYS33//neXZW12eYwTx2dvk1RMbjbBjyXni6Ld4q/tG8/6ypDX4yIxRu8cqEZ9LNjCaMHZ3KtRoEw6h6kgft+crYqiVhgoH5sOTk3kWdHzTB7Al5xfarrRLc0zNgFlvDa4zMQF+oVqoVoDBDHSt3JcBsXWXgRA/oIw/TSKas45AIUGAqJl+XB+8z+82KymkZks55jZfsMIKavVclaZHFEPk6j5/LiAUPJG08zPPfNNw1g06bQvkWPuXsJyIRGFr4W6O6YF04ZcOmywaCpUdxfjXl4gB6N5r8AjCzCZgccjnf8u61E5qcx8Hb72xP8QtEBfDWS5Qi7Naalegyc4VGj5m4HibFdKPn x5+e+FYu 5fV2cWyxUE8gaMWpQy4hK8BpcsTWWBoW/w4JsOz+pCoDR/l+xuFAgxnU3HlZSzIh/au5YFW2al/qEKDd9Jh65FzkHCpbcmH/Yk2expiM//loQCqXLYjE/MgysPt/sX0GiNMIu3CyjGEIpxN1UPwDIg08jRiFmz2viQ6wGczlhjqZDzpBFvxIN/LcINNFXen+xQrYhspjIDyz//1wHoRlNb5SVO8gL1TqtrT2pCN6sRtJvzaGXvAbTzVup2ZbA6YnIi/LAiVvGppz3xLuJzJR+y7hWv4kt4eUlxGq4cFdjjcdqanchoXqVZDPnskcPul3Ir4D/StynDGNRTfR43VhsFAPH/0TbZqtd8b9yxZCaDsbd80ZDElV3Vfj2GjyMBK9HWjff1bzcOi7EA/ddfjdsrhidlUAQU7gu4176v6Ve0+NXinwgAjCoX51FOo+NYuBcO67Hw5UV6ljTCcoYLfjSnVpIZf13FCJVO3HH+Zn4Mjwu5O0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Apr 19, 2025 at 11:35:45AM -0700, Shakeel Butt wrote: > Setting the max and high limits can trigger synchronous reclaim and/or > oom-kill if the usage is higher than the given limit. This behavior is > fine for newly created cgroups but it can cause issues for the node > controller while setting limits for existing cgroups. > > In our production multi-tenant and overcommitted environment, we are > seeing priority inversion when the node controller dynamically adjusts > the limits of running jobs of different priorities. Based on the system > situation, the node controller may reduce the limits of lower priority > jobs and increase the limits of higher priority jobs. However we are > seeing node controller getting stuck for long period of time while > reclaiming from lower priority jobs while setting their limits and also > spends a lot of its own CPU. > > One of the workaround we are trying is to fork a new process which sets > the limit of the lower priority job along with setting an alarm to get > itself killed if it get stuck in the reclaim for lower priority job. > However we are finding it very unreliable and costly. Either we need a > good enough time buffer for the alarm to be delivered after setting > limit and potentialy spend a lot of CPU in the reclaim or be unreliable > in setting the limit for much shorter but cheaper (less reclaim) alarms. > > Let's introduce new limit setting option which does not trigger > reclaim and/or oom-kill and let the processes in the target cgroup to > trigger reclaim and/or throttling and/or oom-kill in their next charge > request. This will make the node controller on multi-tenant > overcommitted environment much more reliable. > > Signed-off-by: Shakeel Butt It's usually the allocating tasks inside the group bearing the cost of limit enforcement and reclaim. This allows a (privileged) updater from outside the group to keep that cost in there - instead of having to help, from a context that doesn't necessarily make sense. I suppose the tradeoff with that - and the reason why this was doing sync reclaim in the first place - is that, if the group is idle and not trying to allocate more, it can take indefinitely for the new limit to actually be met. It should be okay in most scenarios in practice. As the capacity is reallocated from group A to B, B will exert pressure on A once it tries to claim it and thereby shrink it down. If A is idle, that shouldn't be hard. If A is running, it's likely to fault/allocate soon-ish and then join the effort. It does leave a (malicious) corner case where A is just busy-hitting its memory to interfere with the clawback. This is comparable to reclaiming memory.low overage from the outside, though, which is an acceptable risk. Users of O_NONBLOCK just need to be aware. Maybe this and what Christian brought up deserves a mention in the changelog / docs though? Acked-by: Johannes Weiner