From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 39224C433ED for ; Tue, 27 Apr 2021 08:08:18 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 96C686101B for ; Tue, 27 Apr 2021 08:08:17 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 96C686101B Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=suse.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id DB4986B0036; Tue, 27 Apr 2021 04:08:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D67B96B006E; Tue, 27 Apr 2021 04:08:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C06476B0070; Tue, 27 Apr 2021 04:08:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0157.hostedemail.com [216.40.44.157]) by kanga.kvack.org (Postfix) with ESMTP id A13816B0036 for ; Tue, 27 Apr 2021 04:08:16 -0400 (EDT) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 5838D181AF5CC for ; Tue, 27 Apr 2021 08:08:16 +0000 (UTC) X-FDA: 78077419392.27.B5ACB67 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf12.hostedemail.com (Postfix) with ESMTP id 542B4FA for ; Tue, 27 Apr 2021 08:08:05 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1619510894; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=klwWqyK3auUS6+5hVctEXn3+tFSjH6GGhQv3RZ56Qfs=; b=PJrk6LYsru1lgJXaIIQqXuFueCtKXLNiUR6l0DmBl5NI28OXh/gLOyeeR1le7Ueq1+tUCK ykwE6eqBnlkpBPr2rtI0cWVFz3m1ndyZ0BAd9UuhWM7hSp5cLLOWktf08QG8LFLzoz+Gb5 SUMJ2oYPwX9g4ezPVCRIRRxcXKeLwXI= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 942AAB001; Tue, 27 Apr 2021 08:08:14 +0000 (UTC) Date: Tue, 27 Apr 2021 10:08:13 +0200 From: Michal Hocko To: Alexander Sosna Cc: Chris Down , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] Prevent OOM casualties by enforcing memcg limits Message-ID: References: <410a58ba-d746-4ed6-a660-98b5f99258c3@sosna.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <410a58ba-d746-4ed6-a660-98b5f99258c3@sosna.de> X-Stat-Signature: i5xqz7oaac4qfac77y195jr3zjwayd17 X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 542B4FA Received-SPF: none (suse.com>: No applicable sender policy available) receiver=imf12; identity=mailfrom; envelope-from=""; helo=mx2.suse.de; client-ip=195.135.220.15 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1619510885-129517 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue 27-04-21 08:37:30, Alexander Sosna wrote: > Hi Chris, >=20 > Am 27.04.21 um 02:09 schrieb Chris Down: > > Hi Alexander, > >=20 > > Alexander Sosna writes: > >> Before this commit memory cgroup limits were not enforced during > >> allocation.=A0 If a process within a cgroup tries to allocates more > >> memory than allowed, the kernel will not prevent the allocation even= if > >> OVERCOMMIT_NEVER is set.=A0 Than the OOM killer is activated to kill > >> processes in the corresponding cgroup. > >=20 > > Unresolvable cgroup overages are indifferent to vm.overcommit_memory, > > since exceeding memory.max is not overcommitment, it's just a natural > > consequence of the fact that allocation and reclaim are not atomic > > processes. Overcommitment, on the other hand, is about the bounds of > > available memory at the global resource level. > >=20 > >> This behavior is not to be expected > >> when setting OVERCOMMIT_NEVER (vm.overcommit_memory =3D 2) and it is= a huge > >> problem for applications assuming that the kernel will deny an alloc= ation > >> if not enough memory is available, like PostgreSQL.=A0 To prevent th= is a > >> check is implemented to not allow a process to allocate more memory = than > >> limited by it's cgroup.=A0 This means a process will not be killed w= hile > >> accessing pages but will receive errors on memory allocation as > >> appropriate.=A0 This gives programs a chance to handle memory alloca= tion > >> failures gracefully instead of being reaped. > >=20 > > We don't guarantee that vm.overcommit_memory 2 means "no OOM killer".= It > > can still happen for a bunch of reasons, so I really hope PostgreSQL > > isn't relying on that. > >=20 > > Could you please be more clear about the "huge problem" being solved > > here? I'm not seeing it. >=20 > let me explain the problem I encounter and why I fell down the mm rabbi= t > hole. It is not a PostgreSQL specific problem but that's where I run > into it. PostgreSQL forks a backend for each client connection. All > backends have shared memory as well as local work memory. When a > backend needs more dynamic work_mem to execute a query, new memory > is allocated. It is normal that such an allocation can fail. If the > backend gets an ENOMEM the current query is rolled back an all dynamic > work_mem is freed. The RDBMS stays operational an no other query is > disturbed. I am afraid the kernel MM implementation has never been really compatible with such a memory allocation model. Linux has always preferred to pretend there is always memory available and rather reclaim memory - including by killing some processes - rather than fail the allocation eith ENOMEM. Overcommit configuration (especially OVERCOMMIT_NEVER) is an attempt to somehow mitigate this ambitious memory allocation approach but in reality this has turned out a) unreliable and b) unsuable with modern userspace which relies on considerable virtual memory overcommit. > When running in a memory cgroup - for example via systemd or on k8s - > the kernel will not return ENOMEM even if the cgroup's memory limit is > exceeded. Yes, memcg doesn't change the overal approach. It just restricts the existing semantic with a smaller memory limit. Also overcommit heuristic has never been implemented for memory controllers. > Instead the OOM killer is awakened and kills processes in the > violating cgroup. If any backend is killed with SIGKILL the shared > memory of the whole cluster is deemed potentially corrupted and > PostgreSQL needs to do an emergency restart. This cancels all operatio= n > on all backends and it entails a potentially lengthy recovery process. > Therefore the behavior is quite "costly". One way around that would be to use high limit rather than hard limit and pro-actively watch for memory utilization and communicate that back to the application to throttle its workers. I can see how that > I totally understand that vm.overcommit_memory 2 does not mean "no OOM > killer". IMHO it should mean "no OOM killer if we can avoid it" and I I do not see how it can ever promise anything like that. Memory consumption by kernel subsystems cannot be predicted at the time virtual memory allocated from the userspace. Not only it cannot be predicted but it is also highly impractical to force kernel allocations - necessary for the OS operation - to fail just because userspace has reserved virtual memory. So this all is just a heuristic to help in some extreme cases but overall I consider OVERCOMMIT_NEVER as impractical to say the least. > would highly appreciate if the kernel would use a less invasive means > whenever possible. I guess this might also be the expectation by many > other users. In my described case - which is a real pain for me - it i= s > quite easy to tweak the kernel behavior in order to handle this and > other similar situations with less casualties. This is why I send a > patch instead of starting a theoretical discussion. I am pretty sure that many users would agree with you on that but the matter of fact is that a different approach has been chosen historically. We can argue whether this has been a good or bad design decision but I do not see that to change without a lot of fallouts. Btw. a strong memory reservation approach can be found with hugetlb pages and this one has turned out to be very tricky both from implementation and userspace usage POV. Needless to say that it operates on a single purpose preallocated memory pool and it would be quite reasonable to expect the complexity would grow with more users of the pool which is the general case for general purpose memory allocator. > What do you think is necessary to get this to an approvable quality? See my other reply. --=20 Michal Hocko SUSE Labs