From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 050AFC4332F for ; Tue, 31 Oct 2023 13:43:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 624B46B02F7; Tue, 31 Oct 2023 09:43:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5D5A56B02F8; Tue, 31 Oct 2023 09:43:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4C48B6B02F9; Tue, 31 Oct 2023 09:43:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 3BDCC6B02F7 for ; Tue, 31 Oct 2023 09:43:50 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 0275E1A08D4 for ; Tue, 31 Oct 2023 13:43:49 +0000 (UTC) X-FDA: 81405874620.28.D040BD8 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by imf28.hostedemail.com (Postfix) with ESMTP id F4054C002A for ; Tue, 31 Oct 2023 13:43:47 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=Ifvo63kg; spf=pass (imf28.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.28 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698759828; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/0u3vJpP3NYEU8SD2U2RYUepPyGcet77QqQr9UO7B1Q=; b=ShpyQvmDR45evd6LMaLlpL4EqNBZcPuOWhPEcmLqrVYVaIiEA9a/Rv1OLmZdB0FOSd5Ns+ EZKM0qBvE6Pnp/2dCI9fzGybkyhahXi9b0oQ63y99fZ00iWPMkUPrFdKmk0RLK//yVVHhC NrRCsgNDnInVetQLBD1VQjr72S8JqxE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698759828; a=rsa-sha256; cv=none; b=vWizNjz4MmN2vr/1frgwB4bwy81OCVmGsudk06xpyBe+H9oZnzS7IrGOvtySJ/vx9eHMgv QAeOrJSXVI83ZWUGozwk0X/EHAYEN3BaYNCQebwjFP0iMrj/w8+1B+oRiiozBEdG9ryDcW BcQiZQQnvqRcZ026QO7RZuBcI3p2adU= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=Ifvo63kg; spf=pass (imf28.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.28 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id E43D921882; Tue, 31 Oct 2023 13:43:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1698759825; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=/0u3vJpP3NYEU8SD2U2RYUepPyGcet77QqQr9UO7B1Q=; b=Ifvo63kgrto9KNMFSrrvl8g23j3xzU+Ey8C6BHqTh9zBU3PaGcLQBRM0mjTf7q3qtOdl3A wrF4X/mWNVwa4Bawa339VnVFP86ET0HDJtqhYw6ZnhXMdE/Sk8f9ZLlO5ChFB0N2+AwfIT +97TQGda8ZBI5gIJ71JQonk3KZC4zIs= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id C2CA6138EF; Tue, 31 Oct 2023 13:43:45 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id imqdLJEEQWUoeQAAMHmgww (envelope-from ); Tue, 31 Oct 2023 13:43:45 +0000 Date: Tue, 31 Oct 2023 14:43:44 +0100 From: Michal Hocko To: Charan Teja Kalla Cc: akpm@linux-foundation.org, mgorman@techsingularity.net, david@redhat.com, vbabka@suse.cz, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] mm: page_alloc: unreserve highatomic page blocks before oom Message-ID: References: <1698669590-3193-1-git-send-email-quic_charante@quicinc.com> <2a0d2dd8-562c-fec7-e3ac-0bd955643e16@quicinc.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2a0d2dd8-562c-fec7-e3ac-0bd955643e16@quicinc.com> X-Rspamd-Queue-Id: F4054C002A X-Rspam-User: X-Stat-Signature: coauiwpaumrhuw5q3x6i5ip5br3nxpsr X-Rspamd-Server: rspam03 X-HE-Tag: 1698759827-227453 X-HE-Meta: U2FsdGVkX1+RBk/5gUKa1PtvVqwLf1g/wTh7MBvF4+nFOW8qNhqJenb7W4kYWJUgsPI6oIXveVBcniTfoyb0o/w608sfamZnymwSlXs4WLnr4/KBq4aME4j/nYzK/zRnVEgYbkgYjyavviOieql6165NWYq5K2ClK+rL+mAym1TjyJQb/ace8CV9j/PScZ8rSuCcwJE+bpFxKrFvLFqEaFU3M4JFK4F7GtKN5YdeZrB5fFV9Uw0ErBm4gxCkIUdRoyGVOAmL+5kaMe/x6fcBhCJwX1+7No0MoIW2w+cChWb13gKWOpWCTD6yKzVxECxqhkAMbUXCQqKOMYQ3grFaOcoPQ0DaMga4a8vUG0W8rXsXauvaMY3DwW2Fv77KrAQRk6QqdZY6WIfNHIFYVRxZ3iBGwImmLB/luyvV6uahlh+1pHa/lwNB/2xtwwirhmXbHF0eeSFu0wah9ZCy/NsaqVtyRUoBHpFTayzC5ltwFlHlerfaq1QD5Xl/Um6NVPvMEmcka+45CwD/6oh6IL/D0O2fRL4favNyOBa7Vl7a6Fwm/6DaX/8Du13IiPOSUu0Pp1V1AZoVqEzvBi5d/n+azttxJtvfFVaQoth5+JLBCfxN9LGARL5FHGnYXHhjbXON6Q1Zwe3VZSdOhDqu9qcdsWONW614pRQcf9fPQGOSgrPpImD/4eg1K/IJyOfwGhbk9j0zmX4cZFa8Vx0G8gPuUuKpSEylEZ2kTf/lLRRXa2+oX8tW7XP1sHxkWF4wx5YbeIwHBvZjkjP1X/SuIqo1kZi4qSRfQwTmhEMPGhq7gQM0UAGcQKr+MLSrTbXij62CB0bGsCvRj7otzCP/etII0P86QtBvVyUFVopFkQkgptPpiu1h6n0HLAQ1Mij06RnPv/jIL43wCrCVAYlOw1ruwXgeOhU+HWMWCoK8NHdoh10mEcP9P/lZdbM3DtrPTnsUNd06vN3cmLmLDgnBcQo g3jyAcBR CO9Z+ql9FGYNTeEqWfSVl5rQjbW1pEfHyI2CAcMWaLDqBXraqzR0z322T/ITCH3E3OW0JHYAYK1vMn21qe2EfqhEpjdMbqTjE3ffsqxZcJ3+tfTdXt9nIekvAwiwa3QiFKmvKp7ubnP0TJvsYyneOhK9Os7eEQXPk3YU9bnxtc7CsK71QkYPja3DZqr5aPQJiA0SqWDSb3pwwq1YvESpeuL9RpeXCYsnZbhH7rtVXB7jtIi4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue 31-10-23 18:43:55, Charan Teja Kalla wrote: > Thanks Michal/Pavan!! > > On 10/31/2023 1:44 PM, Michal Hocko wrote: > > On Mon 30-10-23 18:09:50, Charan Teja Kalla wrote: > >> __alloc_pages_direct_reclaim() is called from slowpath allocation where > >> high atomic reserves can be unreserved after there is a progress in > >> reclaim and yet no suitable page is found. Later should_reclaim_retry() > >> gets called from slow path allocation to decide if the reclaim needs to > >> be retried before OOM kill path is taken. > >> > >> should_reclaim_retry() checks the available(reclaimable + free pages) > >> memory against the min wmark levels of a zone and returns: > >> a) true, if it is above the min wmark so that slow path allocation will > >> do the reclaim retries. > >> b) false, thus slowpath allocation takes oom kill path. > >> > >> should_reclaim_retry() can also unreserves the high atomic reserves > >> **but only after all the reclaim retries are exhausted.** > >> > >> In a case where there are almost none reclaimable memory and free pages > >> contains mostly the high atomic reserves but allocation context can't > >> use these high atomic reserves, makes the available memory below min > >> wmark levels hence false is returned from should_reclaim_retry() leading > >> the allocation request to take OOM kill path. This is an early oom kill > >> because high atomic reserves are holding lot of free memory and > >> unreserving of them is not attempted. > > > > OK, I see. So we do not release those reserved pages because OOM hits > > too early. > > > >> (early)OOM is encountered on a machine in the below state(excerpt from > >> the oom kill logs): > >> [ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB > >> high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB > >> active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB > >> present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB > >> local_pcp:492kB free_cma:0kB > >> [ 295.998656] lowmem_reserve[]: 0 32 > >> [ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH) > >> 33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB > >> 0*4096kB = 7752kB > > > > OK, this is quite interesting as well. The system is really tiny and 8MB > > of reserved memory is indeed really high. How come those reservations > > have grown that high? > > Actually it is a VM running on the Linux kernel. > > Regarding the reservations, I think it is because of the 'max_managed ' > calculations in the below: > static void reserve_highatomic_pageblock(struct page *page, ....) { > .... > /* > * Limit the number reserved to 1 pageblock or roughly 1% of a zone. > * Check is race-prone but harmless. > */ > max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages; > > if (zone->nr_reserved_highatomic >= max_managed) > goto out; > > zone->nr_reserved_highatomic += pageblock_nr_pages; > set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC); > move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL); > out: > } > > Since we are always appending the 1% of zone managed pages count to > pageblock_nr_pages, the minimum it is turning into 2 pageblocks as the > 'nr_reserved_highatomic' is incremented/decremented in pageblock size > granules. > > And for my case the 8M out of ~50M is turned out to be 16%, which is high. > > If the below looks fine to you, I can raise this as a separate change: Yes, please. Having a full page block (4MB) sounds still too much for such a tiny system. Maybe there shouldn't be any reservation. But definitely worth a separate patch. > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2a2536d..41441ced 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1886,7 +1886,9 @@ static void reserve_highatomic_pageblock(struct > page *page, struct zone *zone) > * Limit the number reserved to 1 pageblock or roughly 1% of a zone. > * Check is race-prone but harmless. > */ > - max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages; > + max_managed = max_t(unsigned long, > + ALIGN(zone_managed_pages(zone) / 100, > pageblock_nr_pages), > + pageblock_nr_pages); > if (zone->nr_reserved_highatomic >= max_managed) > return; > > >> > >> Per above log, the free memory of ~7MB exist in the high atomic > >> reserves is not freed up before falling back to oom kill path. > >> > >> This fix includes unreserving these atomic reserves in the OOM path > >> before going for a kill. The side effect of unreserving in oom kill path > >> is that these free pages are checked against the high wmark. If > >> unreserved from should_reclaim_retry()/__alloc_pages_direct_reclaim(), > >> they are checked against the min wmark levels. > > > > I do not like the fix much TBH. I think the logic should live in > > yeah, This code looks way too cleaner to me. Let me know If I can raise > V2 with the below, suggested-by you. Sure, go ahead. > I think another thing system is missing here is draining the pcp lists. > min:804kB low:1004kB high:1204kB free_pcp:688kB Yes, but this seems like negligible even under a small system like that. Does it actually help to keep system in balance? I would expect that the OOM is just imminent no matter the draining. Anyway if this makes any difference then just make it a separate patch please. -- Michal Hocko SUSE Labs