From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0D9DAD10376
	for <linux-mm@archiver.kernel.org>; Thu, 24 Oct 2024 21:16:20 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 93FA76B0082; Thu, 24 Oct 2024 17:16:19 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8EF1A6B0083; Thu, 24 Oct 2024 17:16:19 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7B6CF6B0085; Thu, 24 Oct 2024 17:16:19 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 5F13F6B0082
	for <linux-mm@kvack.org>; Thu, 24 Oct 2024 17:16:19 -0400 (EDT)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id CA4424161E
	for <linux-mm@kvack.org>; Thu, 24 Oct 2024 21:16:08 +0000 (UTC)
X-FDA: 82709753700.11.0D0E50F
Received: from mail-ua1-f46.google.com (mail-ua1-f46.google.com [209.85.222.46])
	by imf22.hostedemail.com (Postfix) with ESMTP id 217C1C0021
	for <linux-mm@kvack.org>; Thu, 24 Oct 2024 21:15:53 +0000 (UTC)
Authentication-Results: imf22.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=1xHAJdR4;
	spf=pass (imf22.hostedemail.com: domain of yuzhao@google.com designates 209.85.222.46 as permitted sender) smtp.mailfrom=yuzhao@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1729804373;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=GQTCqNUduHyWXPxZTMfE/mLw9XjdvE2Tl6TDlgx+Fmo=;
	b=fxm7Ln0LduljlYAbEj4cW+v2Qn+7zGuSweJfl4TQS3tQv2rMmREjbP/Bi48rDp9IM/8mC0
	3p7RV/kJmSH3xEnDa5Zltz0mOIeYZ1GkU744tondEfmMhpr28iHjyIP8vNAVZch7vTf05o
	sBBYWX2eu4qAN8tC1TvJ9Ao3AIUF9lE=
ARC-Authentication-Results: i=1;
	imf22.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=1xHAJdR4;
	spf=pass (imf22.hostedemail.com: domain of yuzhao@google.com designates 209.85.222.46 as permitted sender) smtp.mailfrom=yuzhao@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729804373; a=rsa-sha256;
	cv=none;
	b=5Ph0vO/9Q6DKJXdZCcjuXiZudGcYZwwOLz8wrylHoMsnxnYP7O2SQl1UvHA/Iu18+hc1RP
	TaRgT9KjZ+MG/6nwpwxzDj6g5DyMhPfXBq6T6Gfe6WfZGHvmjQN5wlBQMdS/WH9Sp/RtWd
	U3K0c15Hwi6ugnhBM+52n+Kku4MLn4g=
Received: by mail-ua1-f46.google.com with SMTP id a1e0cc1a2514c-84ff612ca93so440552241.0
        for <linux-mm@kvack.org>; Thu, 24 Oct 2024 14:16:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1729804576; x=1730409376; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=GQTCqNUduHyWXPxZTMfE/mLw9XjdvE2Tl6TDlgx+Fmo=;
        b=1xHAJdR4xUf5DTMjertdZK+Eoen1Xq7grDCqIRpi1dswgKigR82Op7T/y3MiDd/f9y
         e0NFvAOWoYT5VtFmIBEYAb9QIUrzysI4q7idO1CO3wh38irmtsZM1Xw2mo1rfw4soaHR
         d6TXvmww6/LQWiM9pqM01MvHnnDJnYXrPRvl2hcwAhckjt8vutAk/Wk/H+aooaY9UjhK
         65L2JfcnjCu0ZlnBoQ+q6FJ1cXorBTsGbcdGuE8LdAQ691Vf36P0g+wMRRkFiZRPdkA+
         N+/s1f/jOsuaDZiV9GG/jQohBFlH7FH50ZhyGXWR8nyYKiB60sURpsnt8ME7djNUOthd
         V0Gw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1729804576; x=1730409376;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=GQTCqNUduHyWXPxZTMfE/mLw9XjdvE2Tl6TDlgx+Fmo=;
        b=wtp5TPjwZzV30+4qUbZx1seHDfFORmU7ddg7qm6AZzsCfRpaZsRN4e9w3CacHnArKY
         +vwITNYbcV9I/w42FFQYsenY/RoDWUNGuQfewLraIbctwwMloxR5hoi/HPqdhuNvoJ3k
         DCqB/SbDgKFDEdx/2HL8hbaIRDB2CdXywI73rh00UPSolSzQsxF7OGlxz+lL3WjAfsYU
         WcVyuG4vsZnlWPlBSFDsbf+WuSwG0vJQju1lY1JBrmzosbD1ww19/C1KlvI0mbJTwFk+
         +gI7HWbHCQPXvfRN6d3hFvpS9cCMKM8Qud72rtnBKQVvy1+lwy43VvbWkDcifju9LAfm
         6M2w==
X-Forwarded-Encrypted: i=1; AJvYcCUpXZXzEFX70iWqvHURVfyXJl1qznRPXIfAbzbY6DxFNwpu6W4kRKJKEyS9vc5kHSFdPTCC/ERXLg==@kvack.org
X-Gm-Message-State: AOJu0Ywf0f2/04MP/gFET8rPznZfbUp98S/bFgduwj5zE9wSZ+gslVEa
	oiRMuxa+okAnUrph9fKv7wjWOQiOwpeR5c70I/AgnC55olwPpeNLZEqyH2W0LCAqQq+KDgpuYCo
	A25iLOCM+OPbmm5LZwDFrP5DaGR++b4L/Nz/g
X-Google-Smtp-Source: AGHT+IFyGfRVQ4B9yOdAPvEizSuLviG372ZLWQQ5w14MVBrpQL6sMsecsu0bcgPZiDHoOjujVWyV4ZSjUAucNOxHn94=
X-Received: by 2002:a05:6102:5489:b0:4a3:c9b6:b311 with SMTP id
 ada2fe7eead31-4a8711e6d2cmr4493406137.26.1729804575971; Thu, 24 Oct 2024
 14:16:15 -0700 (PDT)
MIME-Version: 1.0
References: <20241020051315.356103-1-yuzhao@google.com> <ZxYNLb0CiZyw31_q@tiehlicka>
 <CAOUHufZ1fBvj0DgxtuLvwMAu-qx=jFAqM5RaooXzuYqCCTK1QA@mail.gmail.com>
 <ZxaOo59ZwXoCduhG@tiehlicka> <82e6d623-bbf3-4dd8-af32-fdfc120fc759@suse.cz>
 <CAOUHufanF3VaLzq6o_V+-+iPvB4Oj-xHwD+Rm-gmKS02h8Dw=g@mail.gmail.com>
 <97ccf48e-f30c-4abd-b8ff-2b5310a8b60f@suse.cz> <CAOUHufb=Ze1pj2BeasCLYpAvOhBQfKXcz678Zo_==9DeMbgT9Q@mail.gmail.com>
 <b594795a-7f26-4c0d-80f4-88d242fdc0fb@suse.cz>
In-Reply-To: <b594795a-7f26-4c0d-80f4-88d242fdc0fb@suse.cz>
From: Yu Zhao <yuzhao@google.com>
Date: Thu, 24 Oct 2024 15:15:39 -0600
Message-ID: <CAOUHufaHSYFibie=mb7jZYm2-xS=k-C+nvCA=wG-O_ZQDGCxFQ@mail.gmail.com>
Subject: Re: [PATCH mm-unstable v1] mm/page_alloc: try not to overestimate
 free highatomic
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>, Michal Hocko <mhocko@suse.com>, 
	Andrew Morton <akpm@linux-foundation.org>, David Rientjes <rientjes@google.com>, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, Link Lin <linkl@google.com>, 
	Matt Fleming <mfleming@cloudflare.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 217C1C0021
X-Stat-Signature: 6xb6heizyjjbuaii6h8cx397331u5m4y
X-Rspam-User: 
X-HE-Tag: 1729804553-617991
X-HE-Meta: U2FsdGVkX19aixIf5ON/sbfnwGnncIZaaMlVIMR7miWshB58iidJf3Q7ouTrlImuqlFNtkCxM+SvH2yk+nzKUzaIg/w1n1Z9iGFHH5YnYDFziLguKsdr40qWcng67qEGsJ8mTSHGT5HUlITzWJfQzy4h8AdKUGJsOTFvwW4BN1dZLkd7vf8gtcWfeMbVWC222Sdr0Wr2DhO4RG1+P9RgHTKQ9qcdeFdQDoi2bMEwhqcdzG9pl6bWaUb9fWEBHK8NSzaaaknDLfzK7Hp5fZs8vdHyK/uTVkzlIm9m2qxCibwjLgXNqo+ESaUJPP8DsOn33DsNektoPMQKKc7TC44OvXXs62D2/VW9qeQK2VcFgppmEmqTh+4SgYzmVstXTdDzF6PBI4niw5nIkn+Z8ld2j3agyw/ENngeHKga+2N/aUtE5zvWzcmAnSWrnPmqcEFKXVYapcFyAVJX1uYLdLqyArof8ypscDnnA4zzbswtHa4kOVsig04zPX1nfp2dKygQJWoGnNdU+97xypXakUmGN/paotK+X7DrClzKSo7T0I/may5p1QzG7aQV8Cor37jv3QtwZsYKw+V5HvzlmZlIlTQOO4ZN4t7f803MbghLYoRfTiIFoGPEZf4D+mjLo2IPP3AcNuapw453mFPcJFnwRgCvDGglb4p5OansZHkN7xaOuA4sUHxqUQ/SVNTvHEMcg7haG5qfLvejr+zHKmbIyfmkd4ECG87t7ls9F/LMMSMv5twc8kbs4bk+lHdcCJ7VYuAZrV3+VBzjXFtvt4bDXXzZAqyXTRy1ReIHuD61cdEvR80Mp0zffbEb8NhDInJgGimqpIXnIoBiI0dSVfvLcDzrjzHY32bd+Vbn+lOIW+Bit7Ya38usZ0wClTh0SJiySW+rk6zI3Ta3lssxlI8m40CvF4Xik4cxBKLmRFKG02AF4vHg3ljVKi/SxoP65jR/JWvT0PlBI5asNIgZNeV
 egqC1aos
 92gWGOGqLgAV4PlDbnq3o69p/3TpoJd+/OuDNMqaruSBl1rohxyiN0wVHTO6qtU6bwjJpTma5Yjs+oX4RjxivYjOaAYhsKquGe1fEIz4brBkpAwbx3IyK0L3S/3FhuxBC7g/0gHj6mun5KJZHlpQee+dl/eplAzJ9ihplrTUhBzoU5mnDGbISBevZ+6irNU6Yb0pwmHS/in7Z9Vqsi7mFTB0q+b9jtbFn3fOG+t35EPvlpIQ=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Oct 24, 2024 at 2:16=E2=80=AFAM Vlastimil Babka <vbabka@suse.cz> wr=
ote:
>
> On 10/24/24 06:35, Yu Zhao wrote:
> > On Wed, Oct 23, 2024 at 1:35=E2=80=AFAM Vlastimil Babka <vbabka@suse.cz=
> wrote:
> >>
> >> On 10/23/24 08:36, Yu Zhao wrote:
> >> > On Tue, Oct 22, 2024 at 4:53=E2=80=AFAM Vlastimil Babka <vbabka@suse=
.cz> wrote:
> >> >>
> >> >> +Cc Mel and Matt
> >> >>
> >> >> On 10/21/24 19:25, Michal Hocko wrote:
> >> >>
> >> >> Hm I don't think it's completely WAI. The intention is that we shou=
ld be
> >> >> able to unreserve the highatomic pageblocks before going OOM, and t=
here
> >> >> seems to be an unintended corner case that if the pageblocks are fu=
lly
> >> >> exhausted, they are not reachable for unreserving.
> >> >
> >> > I still think unreserving should only apply to highatomic PBs that
> >> > contain free pages. Otherwise, it seems to me that it'd be
> >> > self-defecting because:
> >> > 1. Unreserving fully used hightatomic PBs can't fulfill the alloc
> >> > demand immediately.
> >>
> >> I thought the alloc demand is only blocked on the pessimistic watermar=
k
> >> calculation. Usable free pages exist, but the allocation is not allowe=
d to
> >> use them.
> >
> > I think we are talking about two different problems here:
> > 1. The estimation problem.
> > 2. The unreserving policy problem.
> >
> > What you said here is correct w.r.t. the first problem, and I was
> > talking about the second problem.
>
> OK but the problem with unreserving currently makes the problem of
> estimation worse and unfixable.
>
> >> > 2. More importantly, it only takes one alloc failure in
> >> > __alloc_pages_direct_reclaim() to reset nr_reserved_highatomic to 2M=
B,
> >> > from as high as 1% of a zone (in this case 1GB). IOW, it makes more
> >> > sense to me that highatomic only unreserves what it doesn't fully us=
e
> >> > each time unreserve_highatomic_pageblock() is called, not everything
> >> > it got (except the last PB).
> >>
> >> But if the highatomic pageblocks are already full, we are not really
> >> removing any actual highatomic reserves just by changing the migratety=
pe and
> >> decreasing nr_reserved_highatomic?
> >
> > If we change the MT, they can be fragmented a lot faster, i.e., from
> > the next near OOM condition to upon becoming free. Trying to persist
> > over time is what actually makes those PBs more fragmentation
> > resistant.
>
> If we assume the allocations there have similar sizes and lifetimes, then=
 I
> guess yeah.
>
> >> In fact that would allow the reserves
> >> grow with some actual free pages in the future.
> >
> > Good point. I think I can explain it better along this line.
> >
> > If highatomic is under the limit, both your proposal and the current
> > implementation would try to grow, making not much difference. However,
> > the current implementation can also reuse previously full PBs when
> > they become available. So there is a clear winner here: the current
> > implementation.
>
> I'd say it depends on the user of the highatomic blocks (the workload),
> which way ends up better.
>
> > If highatomic has reached the limit, with your proposal, the growth
> > can only happen after unreserve, and unreserve only happens under
> > memory pressure. This means it's likely that it tries to grow under
> > memory pressure, which is more difficult than the condition where
> > there is plenty of memory. For the current implementation, it doesn't
> > try to grow, rather, it keeps what it already has, betting those full
> > PBs becoming available for reuse. So I don't see a clear winner
> > between trying to grow under memory pressure and betting on becoming
> > available for reuse.
>
> Understood. But also note there are many conditions where the current
> implementation and my proposal behave the same. If highatomic pageblocks
> become full and then only one or few pages from each is freed, it suddenl=
y
> becomes possible to unreserve them due to memory pressure, and there is n=
o
> reuse for those highatomic allocations anymore. This very different outco=
me
> only depends on whether a single page is free for the unreserve to work, =
but
> from the efficiency of pageblock reusal you describe above a single page =
is
> only a minor difference. My proposal would at least remove the sudden cha=
nge
> of behavior when going from a single free page to no free page.
>
> >> Hm that assumes we're adding some checks in free fastpath, and for tha=
t to
> >> work also that there will be a freed page in highatomic PC in near eno=
ugh
> >> future from the decision we need to unreserve something. Which is not =
so
> >> much different from the current assumption we'll find such a free page
> >> already in the free list immediately.
> >>
> >> > To summarize, I think this is an estimation problem, which I would
> >> > categorize as a lesser problem than accounting problems. But it soun=
ds
> >> > to me that you think it's a policy problem, i.e., the highatomic
> >> > unreserving policy is wrong or not properly implemented?
> >>
> >> Yeah I'd say not properly implemented, but that sounds like a mechanis=
m, not
> >> policy problem to me :)
> >
> > What about adding a new counter to keep track of the size of free
> > pages reserved for highatomic?
>
> That's doable but not so trivial and means starting to handle the highato=
mic
> pageblocks much more carefully, like we do with CMA pageblocks and
> NR_FREE_CMA_PAGES counter, otherwise we risk drifting the counter unrecov=
erably.

The counter would be protected by the zone lock:

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 17506e4a2835..86c63d48c08e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -824,6 +824,7 @@ struct zone {
  unsigned long watermark_boost;

  unsigned long nr_reserved_highatomic;
+ unsigned long nr_free_highatomic;

  /*
  * We don't know if the memory that we're going to allocate will be
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8afab64814dc..4d8031817c59 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -644,6 +644,17 @@ static inline void account_freepages(struct zone
*zone, int nr_pages,
  __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
 }

+static void account_highatomic_freepages(struct zone *zone, unsigned
int order, int old_mt, int new_mt)
+{
+ int nr_pages =3D 1 < order;
+
+ if (is_migrate_highatomic(old_mt))
+ zone->nr_free_highatomic -=3D nr_pages;
+
+ if (is_migrate_highatomic(new_mt))
+ zone->nr_free_highatomic +=3D nr_pages;
+}
+
 /* Used for pages not on another list */
 static inline void __add_to_free_list(struct page *page, struct zone *zone=
,
        unsigned int order, int migratetype,
@@ -660,6 +671,8 @@ static inline void __add_to_free_list(struct page
*page, struct zone *zone,
  else
  list_add(&page->buddy_list, &area->free_list[migratetype]);
  area->nr_free++;
+
+ account_highatomic_freepages(zone, order, -1, migratetype);
 }

 /*
@@ -681,6 +694,8 @@ static inline void move_to_free_list(struct page
*page, struct zone *zone,

  account_freepages(zone, -(1 << order), old_mt);
  account_freepages(zone, 1 << order, new_mt);
+
+ account_highatomic_freepages(zone, order, old_mt, new_mt);
 }

 static inline void __del_page_from_free_list(struct page *page,
struct zone *zone,
@@ -698,6 +713,8 @@ static inline void
__del_page_from_free_list(struct page *page, struct zone *zon
  __ClearPageBuddy(page);
  set_page_private(page, 0);
  zone->free_area[order].nr_free--;
+
+ account_highatomic_freepages(zone, order, migratetype, -1);
 }

 static inline void del_page_from_free_list(struct page *page, struct
zone *zone,
@@ -3085,7 +3102,7 @@ static inline long
__zone_watermark_unusable_free(struct zone *z,
  * over-estimate the size of the atomic reserve but it avoids a search.
  */
  if (likely(!(alloc_flags & ALLOC_RESERVES)))
- unusable_free +=3D z->nr_reserved_highatomic;
+ unusable_free +=3D z->nr_free_highatomic;


 #ifdef CONFIG_CMA
  /* If allocation can't use CMA areas don't use free CMA pages */