From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EC51EC4332F
	for <linux-mm@archiver.kernel.org>; Tue, 31 Oct 2023 13:14:07 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 87E866B02F4; Tue, 31 Oct 2023 09:14:07 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 82F2A6B02F6; Tue, 31 Oct 2023 09:14:07 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 6F6D36B02F7; Tue, 31 Oct 2023 09:14:07 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 606CA6B02F4
	for <linux-mm@kvack.org>; Tue, 31 Oct 2023 09:14:07 -0400 (EDT)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 226B81CAF0A
	for <linux-mm@kvack.org>; Tue, 31 Oct 2023 13:14:07 +0000 (UTC)
X-FDA: 81405799734.08.568C3FE
Received: from mx0a-0031df01.pphosted.com (mx0a-0031df01.pphosted.com [205.220.168.131])
	by imf30.hostedemail.com (Postfix) with ESMTP id E54578001F
	for <linux-mm@kvack.org>; Tue, 31 Oct 2023 13:14:03 +0000 (UTC)
Authentication-Results: imf30.hostedemail.com;
	dkim=pass header.d=quicinc.com header.s=qcppdkim1 header.b=nobcN6uo;
	spf=pass (imf30.hostedemail.com: domain of quic_charante@quicinc.com designates 205.220.168.131 as permitted sender) smtp.mailfrom=quic_charante@quicinc.com;
	dmarc=pass (policy=none) header.from=quicinc.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1698758044;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=FFN8MO2SByMCG1VWE0WnyVW7hy5Fd5ffBJH86/KHafw=;
	b=BYzQhsRKPwe8kPvw17K8hfuKQNZhIoP4FuW+bvl5Vuspx8u3JsfPLpTVcG0vPs//vN7El2
	UW2PF5HpbXpK61ew1jnjv1aLUKHKIc0PfH6bpDfjH6+d1pgb4KA5oq8p843XufSuMKzKIQ
	XIRRxwA6lcjVLZmAocPRS/ib5fp9++E=
ARC-Authentication-Results: i=1;
	imf30.hostedemail.com;
	dkim=pass header.d=quicinc.com header.s=qcppdkim1 header.b=nobcN6uo;
	spf=pass (imf30.hostedemail.com: domain of quic_charante@quicinc.com designates 205.220.168.131 as permitted sender) smtp.mailfrom=quic_charante@quicinc.com;
	dmarc=pass (policy=none) header.from=quicinc.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698758044; a=rsa-sha256;
	cv=none;
	b=synuLJW6JZEg3Q/L8IlyKMJlQfGAaf6H65FMG0NJ88kBKgD63LpUb4bM8UUS/WBt/QOqti
	P/nvdKNPODpLC47BuCzXiCBGB2J7jGX3H22O84eDXGtYwJbfZGRg0xUKbdHmQafF6dJeST
	5vMd7LayOUDs64P7LNbxmD3geW6Jgr4=
Received: from pps.filterd (m0279862.ppops.net [127.0.0.1])
	by mx0a-0031df01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 39VC1bG0020481;
	Tue, 31 Oct 2023 13:14:02 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; h=message-id : date :
 mime-version : subject : to : cc : references : from : in-reply-to :
 content-type : content-transfer-encoding; s=qcppdkim1;
 bh=FFN8MO2SByMCG1VWE0WnyVW7hy5Fd5ffBJH86/KHafw=;
 b=nobcN6uo5EwTzr8tLJQw4/XcnmEr1U0ZCm0MbaK9L9CSl4qw1b+t7Q6n6HqGa/UG8wZI
 rfga0mT4AiRmYwg5sblnrAhHIfvo0+m+GQBVvH0Po8wxBLEU3sXoQdIhdwErYppa4xEm
 Uqn+icfzs9uHs7j9fY+6lFmhFkEkpxA6H4LhrDDmLJdY40ldBCkKRc2ROvtSBl0zDa2W
 qCO7nGpyUkZEURCTPud4yjFf6ioiQRU17EGcEzMAmYiLJiwpRHX1MtZocQJpwl/w8G8f
 OIAxPIWU3aWkZg3H4vDa/iZe4hudLiggGzwGsQZrwiYoaZaLAEQIv+v6HrAm722zTyuF 3Q== 
Received: from nalasppmta02.qualcomm.com (Global_NAT1.qualcomm.com [129.46.96.20])
	by mx0a-0031df01.pphosted.com (PPS) with ESMTPS id 3u2chyk50h-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Tue, 31 Oct 2023 13:14:01 +0000
Received: from nalasex01a.na.qualcomm.com (nalasex01a.na.qualcomm.com [10.47.209.196])
	by NALASPPMTA02.qualcomm.com (8.17.1.5/8.17.1.5) with ESMTPS id 39VDE13r012590
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Tue, 31 Oct 2023 13:14:01 GMT
Received: from [10.214.66.119] (10.80.80.8) by nalasex01a.na.qualcomm.com
 (10.47.209.196) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.39; Tue, 31 Oct
 2023 06:13:58 -0700
Message-ID: <2a0d2dd8-562c-fec7-e3ac-0bd955643e16@quicinc.com>
Date: Tue, 31 Oct 2023 18:43:55 +0530
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.13.0
Subject: Re: [PATCH] mm: page_alloc: unreserve highatomic page blocks before
 oom
To: Michal Hocko <mhocko@suse.com>
CC: <akpm@linux-foundation.org>, <mgorman@techsingularity.net>,
        <david@redhat.com>, <vbabka@suse.cz>, <linux-mm@kvack.org>,
        <linux-kernel@vger.kernel.org>
References: <1698669590-3193-1-git-send-email-quic_charante@quicinc.com>
 <gtya2g2pdbsonelny6vpfwj5vsxdrzhi6wzkpcrke33mr3q2hf@j4ramnjmfx52>
Content-Language: en-US
From: Charan Teja Kalla <quic_charante@quicinc.com>
In-Reply-To: <gtya2g2pdbsonelny6vpfwj5vsxdrzhi6wzkpcrke33mr3q2hf@j4ramnjmfx52>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Originating-IP: [10.80.80.8]
X-ClientProxiedBy: nasanex01a.na.qualcomm.com (10.52.223.231) To
 nalasex01a.na.qualcomm.com (10.47.209.196)
X-QCInternal: smtphost
X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=5800 signatures=585085
X-Proofpoint-ORIG-GUID: V00cAp50veuoZviQ86wMotv1d1L1osdL
X-Proofpoint-GUID: V00cAp50veuoZviQ86wMotv1d1L1osdL
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.272,Aquarius:18.0.987,Hydra:6.0.619,FMLib:17.11.176.26
 definitions=2023-10-31_01,2023-10-31_03,2023-05-22_02
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 bulkscore=0
 mlxscore=0 adultscore=0 mlxlogscore=999 spamscore=0 phishscore=0
 clxscore=1015 suspectscore=0 lowpriorityscore=0 malwarescore=0
 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2310240000 definitions=main-2310310105
X-Rspamd-Queue-Id: E54578001F
X-Rspam-User: 
X-Stat-Signature: fbdmpib5aqafce5zcnnxu8i9jdcmrda9
X-Rspamd-Server: rspam01
X-HE-Tag: 1698758043-121116
X-HE-Meta: U2FsdGVkX18gujoLMSiFO5GbqY9aNRpC8qwomW/+p2I3v6PCqse2MYZgyTqNuU4M6rNkf27mpNO6fqIvUh3xZs/ddVreF/mrRBtLWdXOI4ax3NNCwOf3rKWRr393x3EIgLVV499g4s9ATP+iFPO2rCLkMUPIuASinYFMTyd5RpP1wsrlt1efoGQ+N/KDBePafeOLOTF2/Oxj5XIP7x8bqHlfu2l/tDeAhzulpPUcZuL7xQGJICbf7uIRK3GeaBndxno0be+R9CclNc2s9u86+Hgge3xshtK5VGLWLaZ++t1/9EVQTvvMkVBytgPokoZeWUznTI7YNXDEvBAN9liIUNeLYtVBznxFRV6KOWO7hsR+Joj/GPNH1gY5n0KH1ED8aMbe3VTVB4yyfSM+vYZPC8KBCfU3VpDO/9Uws3E7J09mtCcMUvp/SfSSs2Qq/PBI1kXTCmySvIu1PhW76vBxb0YMfLUWMJGUNhFR2yb7U7N8Ki73MvUR+l6BRx7tSBFHkRUaYWZOjvJnDcGeAMyqx3Cx9xyJQjs6SGSTN5gIdozX8z8QDXyYDSp0BNhWIlHEUrId41DTPpTXrASys/9W4PdwdJxK23tOUsqcbeYAKWkE6MVIGNp7mi9nHzvU8vqR/mFVAN0qrKlI03OQoPuCzy+/AJqyZC6EZwaUuRqQiCuorjPxQX/WV1B/ZqEpQUmemzMhUrqBUmX9FkRGNY4RL0gLsF7YJUwTZ/TMqbFGekr3zJtiIglP8laj1IH8/dGfUzgLvCWsxSkcbzAkN0JW8gp1N0uSIrinWBfirIoiigAvgnto/qKs04RW2rwvQb6n3scJWut8Or8kiieB1td5fokuLTWfpvEvup6X0EbpKuisnIOe9hrPeMueQ85mDB9fdY4k5hEHhnZND1h6eJSemArAksszYl8jNObRMtdbPPC668kWjAzsJcKSqSfx80dLEqx+ZWLL7gHl3R8uLxj
 6c8iaPzx
 8UVLhKIcBBoZxnjuQBmIXiqcdk4dSS8yIi7uP5Zl7wftDqcKDVZb3pmwljqncgXXfQP+w8YCKctTpT7upd772xgbUKLgd4HlrosTx/Dz0MAK7kuGwpYl6AhFLBF0Zh3gyB26hjyMEsQThqkymZMeBo8irrR/g9l+DCWqB0iGyl23TCg0pnBnHXrXpuXl5ibDaeJnKhDi9tdoVuRhJ8ZpBzDyAyY7p3FQkmv6/ODp7bercPFIC3ZAVc1OMhUCUpzovbhA7Q0nb4hnJ1I6oX0Lo7weTg/lpvEgmj7kc
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Thanks Michal/Pavan!!

On 10/31/2023 1:44 PM, Michal Hocko wrote:
> On Mon 30-10-23 18:09:50, Charan Teja Kalla wrote:
>> __alloc_pages_direct_reclaim() is called from slowpath allocation where
>> high atomic reserves can be unreserved after there is a progress in
>> reclaim and yet no suitable page is found. Later should_reclaim_retry()
>> gets called from slow path allocation to decide if the reclaim needs to
>> be retried before OOM kill path is taken.
>>
>> should_reclaim_retry() checks the available(reclaimable + free pages)
>> memory against the min wmark levels of a zone and returns:
>> a)  true, if it is above the min wmark so that slow path allocation will
>> do the reclaim retries.
>> b) false, thus slowpath allocation takes oom kill path.
>>
>> should_reclaim_retry() can also unreserves the high atomic reserves
>> **but only after all the reclaim retries are exhausted.**
>>
>> In a case where there are almost none reclaimable memory and free pages
>> contains mostly the high atomic reserves but allocation context can't
>> use these high atomic reserves, makes the available memory below min
>> wmark levels hence false is returned from should_reclaim_retry() leading
>> the allocation request to take OOM kill path. This is an early oom kill
>> because high atomic reserves are holding lot of free memory and 
>> unreserving of them is not attempted.
> 
> OK, I see. So we do not release those reserved pages because OOM hits
> too early. 
> 
>> (early)OOM is encountered on a machine in the below state(excerpt from
>> the oom kill logs):
>> [  295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB
>> high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB
>> active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB
>> present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB
>> local_pcp:492kB free_cma:0kB
>> [  295.998656] lowmem_reserve[]: 0 32
>> [  295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH)
>> 33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
>> 0*4096kB = 7752kB
> 
> OK, this is quite interesting as well. The system is really tiny and 8MB
> of reserved memory is indeed really high. How come those reservations
> have grown that high?

Actually it is a VM running on the Linux kernel.

Regarding the reservations, I think it is because of the 'max_managed '
calculations in the below:
static void reserve_highatomic_pageblock(struct page *page, ....) {
    ....
  /*
   * Limit the number reserved to 1 pageblock or roughly 1% of a zone.
   * Check is race-prone but harmless.
   */
    max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages;

    if (zone->nr_reserved_highatomic >= max_managed)
            goto out;

    zone->nr_reserved_highatomic += pageblock_nr_pages;
    set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
    move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);
out:
}

Since we are always appending the 1% of zone managed pages count to
pageblock_nr_pages, the minimum it is turning into 2 pageblocks as the
'nr_reserved_highatomic' is incremented/decremented in pageblock size
granules.

And for my case the 8M out of ~50M is turned out to be 16%, which is high.

If the below looks fine to you, I can raise this as a separate change:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2a2536d..41441ced 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1886,7 +1886,9 @@ static void reserve_highatomic_pageblock(struct
page *page, struct zone *zone)
         * Limit the number reserved to 1 pageblock or roughly 1% of a zone.
         * Check is race-prone but harmless.
         */
-       max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages;
+       max_managed = max_t(unsigned long,
+                       ALIGN(zone_managed_pages(zone) / 100,
pageblock_nr_pages),
+                       pageblock_nr_pages);
        if (zone->nr_reserved_highatomic >= max_managed)
                return;

>>
>> Per above log, the free memory of ~7MB exist in the high atomic
>> reserves is not freed up before falling back to oom kill path.
>>
>> This fix includes unreserving these atomic reserves in the OOM path
>> before going for a kill. The side effect of unreserving in oom kill path
>> is that these free pages are checked against the high wmark. If
>> unreserved from should_reclaim_retry()/__alloc_pages_direct_reclaim(),
>> they are checked against the min wmark levels.
> 
> I do not like the fix much TBH. I think the logic should live in

yeah, This code looks way too cleaner to me. Let me know If I can raise
V2 with the below, suggested-by you.

I think another thing system is missing here is draining the pcp lists.
min:804kB low:1004kB high:1204kB free_pcp:688kB

IIUC, the drain pages is being called in reclaim path as below. In this
case, when did_some_progress  = 0, it is also skipping the pcp drain.
struct page *__alloc_pages_direct_reclaim() {
    .....
   *did_some_progress = __perform_reclaim(gfp_mask, order, ac);
   if (unlikely(!(*did_some_progress)))
      goto out;
retry:
    page = get_page_from_freelist();
    if (!page && !drained) {
        drain_all_pages(NULL);
        drained = true;
        goto retry;
    }
out:
}

so, how about the extending the below code from you for this case.
Assuming that did_some_progress > 0 means the draining perhaps already
done in __alloc_pages_direct_reclaim() thus:
out:
   if (!ret) {
       ret = unreserve_highatomic_pageblock(ac, true);
       drain_all_pages(NULL);
   }
   return ret;

Please suggest If the above doesn't make sense. If Looks good, I will
raise a separate patch for this condition.
> should_reclaim_retry. One way to approach it is to unreserve at the end
> of the function, something like this:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 95546f376302..d04e14adf2c5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3813,10 +3813,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  	 * Make sure we converge to OOM if we cannot make any progress
>  	 * several times in the row.
>  	 */
> -	if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
> -		/* Before OOM, exhaust highatomic_reserve */
> -		return unreserve_highatomic_pageblock(ac, true);
> -	}
> +	if (*no_progress_loops > MAX_RECLAIM_RETRIES)
> +		goto out;
>  
>  	/*
>  	 * Keep reclaiming pages while there is a chance this will lead
> @@ -3859,6 +3857,12 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		schedule_timeout_uninterruptible(1);
>  	else
>  		cond_resched();
> +
> +out:
> +	/* Before OOM, exhaust highatomic_reserve */
> +	if (!ret)
> +		return unreserve_highatomic_pageblock(ac, true);
> +
>  	return ret;
>  }
>