From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=456z=66=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,MAILING_LIST_MULTI,
	SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9631BC433DF
	for <linux-mm@archiver.kernel.org>; Sat, 16 May 2020 00:50:45 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id F169E20671
	for <linux-mm@archiver.kernel.org>; Sat, 16 May 2020 00:50:44 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=nitingupta.dev header.i=@nitingupta.dev header.b="KzG9e+B1"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F169E20671
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=nitingupta.dev
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 5A9408E0003; Fri, 15 May 2020 20:50:44 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 55A708E0001; Fri, 15 May 2020 20:50:44 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 421BE8E0003; Fri, 15 May 2020 20:50:44 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0245.hostedemail.com [216.40.44.245])
	by kanga.kvack.org (Postfix) with ESMTP id 20BB78E0001
	for <linux-mm@kvack.org>; Fri, 15 May 2020 20:50:44 -0400 (EDT)
Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id D7A7C45DA
	for <linux-mm@kvack.org>; Sat, 16 May 2020 00:50:43 +0000 (UTC)
X-FDA: 76820751966.14.mouth87_2363f771e0f11
X-HE-Tag: mouth87_2363f771e0f11
X-Filterd-Recvd-Size: 32672
Received: from mail-lf1-f68.google.com (mail-lf1-f68.google.com [209.85.167.68])
	by imf04.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Sat, 16 May 2020 00:50:43 +0000 (UTC)
Received: by mail-lf1-f68.google.com with SMTP id z22so3314532lfd.0
        for <linux-mm@kvack.org>; Fri, 15 May 2020 17:50:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=nitingupta.dev; s=google;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=zfYyQbxSgUhkDaiVBQw7Ov71BKGeQAyBPv8MEL0ZMXw=;
        b=KzG9e+B1x4OZrYzt50qharPo10rK1Cvn+YE/i3cBGrvitwUX0fxOhBnxaUiozC3+nw
         dSVdwNN+ZZ8ZgIYwXDt2YG5r0T+GWjFveudospq3kc2lFJgyUjPSMJJjdOielQCkRmm6
         OJrt672n6D/Sjsvj900fwV34JYOTT0bWY7bDXxgtxJkOaQHqJa/fhCvMJjlfRVlAcvaS
         NKUfcJl184tzRNRw1XFHbZcwO/mcRkdZO62dWyKr4pfQLxwI+Xq7cuBPQIel6qtj6Vyx
         nYaSNdIbSMlkASnD9yv7wr4jKrCuFBXHAg4zPcPXybWfX4nterZxNlCrcverQggviRw5
         vt6Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=zfYyQbxSgUhkDaiVBQw7Ov71BKGeQAyBPv8MEL0ZMXw=;
        b=lPgVAFRX7kUkHtSsAROHIrUimfl6n9eaYXjDSbNw5chYXooluPSdvbGGpauKJzbM5L
         Os4EQP/JVN1lUU2lU33VuCm9sILYLLW5/aFTIVPg6LGhrqTQnJN8oUUVawsdcFxPc8Az
         s+v9ibVpMQ/2AGftgm8qWh/QwwY7xC4kM3wUCHAoIJCni63cHQ6ISTbWUKcKPIWh6eNf
         Y0nVPiwUIaROf7nIgUkPqJjW015G1GGzjA2VYRawbYItcTWgv3g1WT3efG0FOPUTN9tW
         8MaX3MXFrJXZYArQgn0CZ6TaJuuHvBxwhAJvnryTK+IfoaLfP/pdMwiUqZjP0wC9ltly
         j3JA==
X-Gm-Message-State: AOAM533oUlVA+/HN6aSaCqu3z21r9Lq8Yd7c4b4kydukwFULN9jOuQ5H
	2VheiD2ilS+bPlseT9HHuws3Ee0aDFstz/ox49ty6g==
X-Google-Smtp-Source: ABdhPJxYfFOLdmBr7nXyThx1w5V+A/5EFcIEBwmM4Mph98NXiXgnZRVgbEII+ThomTB0KTK0OFE5izHrQNKUfKp7k7g=
X-Received: by 2002:a19:5053:: with SMTP id z19mr4186281lfj.177.1589590241394;
 Fri, 15 May 2020 17:50:41 -0700 (PDT)
MIME-Version: 1.0
References: <20200428221055.598-1-nigupta@nvidia.com> <28993c4d-adc6-b83e-66a6-abb0a753f481@suse.cz>
In-Reply-To: <28993c4d-adc6-b83e-66a6-abb0a753f481@suse.cz>
From: Nitin Gupta <ngupta@nitingupta.dev>
Date: Fri, 15 May 2020 17:50:30 -0700
Message-ID: <CAB6CXpDCfoU0mJvDJra4MN6omD1cnNgSLH2Qyy=f7e7uQhA4Cg@mail.gmail.com>
Subject: Re: [PATCH v4] mm: Proactive compaction
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Nitin Gupta <nigupta@nvidia.com>, Mel Gorman <mgorman@techsingularity.net>, 
	Michal Hocko <mhocko@suse.com>, Matthew Wilcox <willy@infradead.org>, 
	Andrew Morton <akpm@linux-foundation.org>, Mike Kravetz <mike.kravetz@oracle.com>, 
	Joonsoo Kim <iamjoonsoo.kim@lge.com>, David Rientjes <rientjes@google.com>, 
	linux-kernel <linux-kernel@vger.kernel.org>, linux-mm <linux-mm@kvack.org>, 
	Linux API <linux-api@vger.kernel.org>
Content-Type: multipart/alternative; boundary="000000000000ad54d205a5b95114"
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

--000000000000ad54d205a5b95114
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Fri, May 15, 2020 at 11:02 AM Vlastimil Babka <vbabka@suse.cz> wrote:

> On 4/29/20 12:10 AM, Nitin Gupta wrote:
> > For some applications, we need to allocate almost all memory as
> > hugepages. However, on a running system, higher-order allocations can
> > fail if the memory is fragmented. Linux kernel currently does on-demand
> > compaction as we request more hugepages, but this style of compaction
> > incurs very high latency. Experiments with one-time full memory
> > compaction (followed by hugepage allocations) show that kernel is able
> > to restore a highly fragmented memory state to a fairly compacted memor=
y
> > state within <1 sec for a 32G system. Such data suggests that a more
> > proactive compaction can help us allocate a large fraction of memory as
> > hugepages keeping allocation latencies low.
> >
> > For a more proactive compaction, the approach taken here is to define
> > a new tunable called 'proactiveness' which dictates bounds for external
> > fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to
> > maintain.
> >
> > The tunable is exposed through sysfs:
> >   /sys/kernel/mm/compaction/proactiveness
>
> I would prefer sysctl. Why?
>
> During the mm evolution we seem to have end up with stuff scattered over
> several
> places:
>
> /proc/sys aka sysctl:
> /proc/sys/vm/compact_unevictable_allowed
> /proc/sys/vm/compact_memory - write-only one-time action trigger!
>
> /sys/kernel/mm:
> e.g. /sys/kernel/mm/transparent_hugepage/
>
> This is unfortunate enough, and (influenced by my recent dive into sysctl
> perhaps :), I would have preferred sysctl only. In this case it's
> consistent
> that we have sysctls for compaction already, while this introduces a whol=
e
> new
> compaction directory in the /sys/kernel/mm/ space.
>
>

I have now replaced this sysfs node with vm.compaction_proactiveness sysctl=
.


> > It takes value in range [0, 100], with a default of 20.
> >
> > Note that a previous version of this patch [1] was found to introduce t=
oo
> > many tunables (per-order extfrag{low, high}), but this one reduces them
> > to just one (proactiveness). Also, the new tunable is an opaque value
> > instead of asking for specific bounds of "external fragmentation", whic=
h
> > would have been difficult to estimate. The internal interpretation of
> > this opaque value allows for future fine-tuning.
> >
> > Currently, we use a simple translation from this tunable to [low, high]
> > "fragmentation score" thresholds (low=3D100-proactiveness, high=3Dlow+1=
0%).
> > The score for a node is defined as weighted mean of per-zone external
> > fragmentation wrt HUGETLB_PAGE_ORDER order. A zone's present_pages
> > determines its weight.
> >
> > To periodically check per-node score, we reuse per-node kcompactd
> > threads, which are woken up every 500 milliseconds to check the same. I=
f
> > a node's score exceeds its high threshold (as derived from user-provide=
d
> > proactiveness value), proactive compaction is started until its score
> > reaches its low threshold value. By default, proactiveness is set to 20=
,
> > which implies threshold values of low=3D80 and high=3D90.
> >
> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
> >
> > Performance data
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >
> > System: x64_64, 1T RAM, 80 CPU threads.
> > Kernel: 5.6.0-rc3 + this patch
> >
> > echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
> > echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
> >
> > Before starting the driver, the system was fragmented from a userspace
> > program that allocates all memory and then for each 2M aligned section,
> > frees 3/4 of base pages using munmap. The workload is mainly anonymous
> > userspace pages, which are easy to move around. I intentionally avoided
> > unmovable pages in this test to see how much latency we incur when
> > hugepage allocations hit direct compaction.
> >
> > 1. Kernel hugepage allocation latencies
> >
> > With the system in such a fragmented state, a kernel driver then
> allocates
> > as many hugepages as possible and measures allocation latency:
> >
> > (all latency values are in microseconds)
> >
> > - With vanilla 5.6.0-rc3
> >
> > echo 0 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness
> >
> >   percentile latency
> >   =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=
=93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=
=93=E2=80=93
> >          5    7894
> >         10    9496
> >         25   12561
> >         30   15295
> >         40   18244
> >         50   21229
> >         60   27556
> >         75   30147
> >         80   31047
> >         90   32859
> >         95   33799
> >
> > Total 2M hugepages allocated =3D 383859 (749G worth of hugepages out of
> > 762G total free =3D> 98% of free memory could be allocated as hugepages=
)
> >
> > - With 5.6.0-rc3 + this patch, with proactiveness=3D20
> >
> > echo 20 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness
> >
> >   percentile latency
> >   =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=
=93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=
=93=E2=80=93
> >          5       2
> >         10       2
> >         25       3
> >         30       3
> >         40       3
> >         50       4
> >         60       4
> >         75       4
> >         80       4
> >         90       5
> >         95     429
> >
> > Total 2M hugepages allocated =3D 384105 (750G worth of hugepages out of
> > 762G total free =3D> 98% of free memory could be allocated as hugepages=
)
> >
> > 2. JAVA heap allocation
> >
> > In this test, we first fragment memory using the same method as for (1)=
.
> >
> > Then, we start a Java process with a heap size set to 700G and request
> > the heap to be allocated with THP hugepages. We also set THP to madvise
> > to allow hugepage backing of this heap.
> >
> > /usr/bin/time
> >  java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouc=
h
> >
> > The above command allocates 700G of Java heap using hugepages.
> >
> > - With vanilla 5.6.0-rc3
> >
> > 17.39user 1666.48system 27:37.89elapsed
> >
> > - With 5.6.0-rc3 + this patch, with proactiveness=3D20
> >
> > 8.35user 194.58system 3:19.62elapsed
>
> I still wonder how the single additional CPU during compaction resulted i=
n
> such
> an improvement. Isn't this against the Amdahl's law? :)
>
>
The speedup is by avoiding the direct compaction path most of the time, so
in
effect we are speeding up the "serial" part of user applications
(back-to-back
memory allocations).


> > Elapsed time remains around 3:15, as proactiveness is further increased=
.
> >
> > Note that proactive compaction happens throughout the runtime of these
> > workloads. The situation of one-time compaction, sufficient to supply
> > hugepages for following allocation stream, can probably happen for more
> > extreme proactiveness values, like 80 or 90.
> >
> > In the above Java workload, proactiveness is set to 20. The test starts
> > with a node's score of 80 or higher, depending on the delay between the
> > fragmentation step and starting the benchmark, which gives more-or-less
> > time for the initial round of compaction. As the benchmark consumes
> > hugepages, node's score quickly rises above the high threshold (90) and
> > proactive compaction starts again, which brings down the score to the
> > low threshold level (80).  Repeat.
> >
> > bpftrace also confirms proactive compaction running 20+ times during th=
e
> > runtime of this Java benchmark. kcompactd threads consume 100% of one o=
f
> > the CPUs while it tries to bring a node's score within thresholds.
> >
> > Backoff behavior
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >
> > Above workloads produce a memory state which is easy to compact.
> > However, if memory is filled with unmovable pages, proactive compaction
> > should essentially back off. To test this aspect:
> >
> > - Created a kernel driver that allocates almost all memory as hugepages
> >   followed by freeing first 3/4 of each hugepage.
> > - Set proactiveness=3D40
> > - Note that proactive_compact_node() is deferred maximum number of time=
s
> >   with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
> >   (=3D> ~30 seconds between retries).
> >
> > [1] https://patchwork.kernel.org/patch/11098289/
> >
> > Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
> > To: Mel Gorman <mgorman@techsingularity.net>
>
> I hope Mel can also comment on this, but in general I agree.
>
> ...
>
> > +
> > +/*
> > + * A zone's fragmentation score is the external fragmentation wrt to t=
he
> > + * HUGETLB_PAGE_ORDER scaled by the zone's size. It returns a value in
> the
> > + * range [0, 100].
> > +
> > + * The scaling factor ensures that proactive compaction focuses on
> larger
> > + * zones like ZONE_NORMAL, rather than smaller, specialized zones like
> > + * ZONE_DMA32. For smaller zones, the score value remains close to zer=
o,
> > + * and thus never exceeds the high threshold for proactive compaction.
> > + */
> > +static int fragmentation_score_zone(struct zone *zone)
> > +{
> > +     unsigned long score;
> > +
> > +     score =3D zone->present_pages *
> > +                     extfrag_for_order(zone, HUGETLB_PAGE_ORDER);
>
> HPAGE_PMD_ORDER would be a better match than HUGETLB_PAGE_ORDER, even if =
it
> might be the same number. hugetlb pages are pre-reserved, unlike THP.
>
>

Ok, I will change to HPAGE_PMD_ORDER.


> > +     score =3D div64_ul(score,
> > +                     node_present_pages(zone->zone_pgdat->node_id) + 1=
);
>
> zone->zone_pgdat->node_present_pages is more direct
>
>
Ok.

> +     return score;
> > +}
> > +
> > +/*
>
> > @@ -2309,6 +2411,7 @@ static enum compact_result
> compact_zone_order(struct zone *zone, int order,
> >               .alloc_flags =3D alloc_flags,
> >               .classzone_idx =3D classzone_idx,
> >               .direct_compaction =3D true,
> > +             .proactive_compaction =3D false,
>
> false, 0, NULL etc are implicitly initialized with this kind of
> initialization
> (also in other places of the patch)
>
>
hmm.. will remove these redundant initializations.


> >               .whole_zone =3D (prio =3D=3D MIN_COMPACT_PRIORITY),
> >               .ignore_skip_hint =3D (prio =3D=3D MIN_COMPACT_PRIORITY),
> >               .ignore_block_suitable =3D (prio =3D=3D MIN_COMPACT_PRIOR=
ITY)
> > @@ -2412,6 +2515,42 @@ enum compact_result try_to_compact_pages(gfp_t
> gfp_mask, unsigned int order,
> >       return rc;
> >  }
> >
>
> > @@ -2500,6 +2640,63 @@ void compaction_unregister_node(struct node *nod=
e)
> >  }
> >  #endif /* CONFIG_SYSFS && CONFIG_NUMA */
> >
> > +#ifdef CONFIG_SYSFS
> > +
> > +#define COMPACTION_ATTR_RO(_name) \
> > +     static struct kobj_attribute _name##_attr =3D __ATTR_RO(_name)
> > +
> > +#define COMPACTION_ATTR(_name) \
> > +     static struct kobj_attribute _name##_attr =3D \
> > +             __ATTR(_name, 0644, _name##_show, _name##_store)
> > +
> > +static struct kobject *compaction_kobj;
> > +
> > +static ssize_t proactiveness_store(struct kobject *kobj,
> > +             struct kobj_attribute *attr, const char *buf, size_t coun=
t)
> > +{
> > +     int err;
> > +     unsigned long input;
> > +
> > +     err =3D kstrtoul(buf, 10, &input);
> > +     if (err)
> > +             return err;
> > +     if (input > 100)
> > +             return -EINVAL;
>
> The sysctl way also allows to specify min/max in the descriptor and use t=
he
> generic handler
>

Thanks for pointing me to sysctl, it deletes ~50 lines from the patch :)

I will post v5 soon with the above changes.

Nitin

--000000000000ad54d205a5b95114
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Fri, May 15, 2020 at 11:02 AM Vlas=
timil Babka &lt;<a href=3D"mailto:vbabka@suse.cz">vbabka@suse.cz</a>&gt; wr=
ote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px=
 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 4/29/20 =
12:10 AM, Nitin Gupta wrote:<br>
&gt; For some applications, we need to allocate almost all memory as<br>
&gt; hugepages. However, on a running system, higher-order allocations can<=
br>
&gt; fail if the memory is fragmented. Linux kernel currently does on-deman=
d<br>
&gt; compaction as we request more hugepages, but this style of compaction<=
br>
&gt; incurs very high latency. Experiments with one-time full memory<br>
&gt; compaction (followed by hugepage allocations) show that kernel is able=
<br>
&gt; to restore a highly fragmented memory state to a fairly compacted memo=
ry<br>
&gt; state within &lt;1 sec for a 32G system. Such data suggests that a mor=
e<br>
&gt; proactive compaction can help us allocate a large fraction of memory a=
s<br>
&gt; hugepages keeping allocation latencies low.<br>
&gt; <br>
&gt; For a more proactive compaction, the approach taken here is to define<=
br>
&gt; a new tunable called &#39;proactiveness&#39; which dictates bounds for=
 external<br>
&gt; fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to<br=
>
&gt; maintain.<br>
&gt; <br>
&gt; The tunable is exposed through sysfs:<br>
&gt;=C2=A0 =C2=A0/sys/kernel/mm/compaction/proactiveness<br>
<br>
I would prefer sysctl. Why?<br>
<br>
During the mm evolution we seem to have end up with stuff scattered over se=
veral<br>
places:<br>
<br>
/proc/sys aka sysctl:<br>
/proc/sys/vm/compact_unevictable_allowed<br>
/proc/sys/vm/compact_memory - write-only one-time action trigger!<br>
<br>
/sys/kernel/mm:<br>
e.g. /sys/kernel/mm/transparent_hugepage/<br>
<br>
This is unfortunate enough, and (influenced by my recent dive into sysctl<b=
r>
perhaps :), I would have preferred sysctl only. In this case it&#39;s consi=
stent<br>
that we have sysctls for compaction already, while this introduces a whole =
new<br>
compaction directory in the /sys/kernel/mm/ space.<br>
<br></blockquote><div><br></div><div><br></div><div>I have now replaced thi=
s sysfs node with vm.compaction_proactiveness sysctl.</div><div><br></div><=
div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0=
px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
&gt; It takes value in range [0, 100], with a default of 20.<br>
&gt; <br>
&gt; Note that a previous version of this patch [1] was found to introduce =
too<br>
&gt; many tunables (per-order extfrag{low, high}), but this one reduces the=
m<br>
&gt; to just one (proactiveness). Also, the new tunable is an opaque value<=
br>
&gt; instead of asking for specific bounds of &quot;external fragmentation&=
quot;, which<br>
&gt; would have been difficult to estimate. The internal interpretation of<=
br>
&gt; this opaque value allows for future fine-tuning.<br>
&gt; <br>
&gt; Currently, we use a simple translation from this tunable to [low, high=
]<br>
&gt; &quot;fragmentation score&quot; thresholds (low=3D100-proactiveness, h=
igh=3Dlow+10%).<br>
&gt; The score for a node is defined as weighted mean of per-zone external<=
br>
&gt; fragmentation wrt HUGETLB_PAGE_ORDER order. A zone&#39;s present_pages=
<br>
&gt; determines its weight.<br>
&gt; <br>
&gt; To periodically check per-node score, we reuse per-node kcompactd<br>
&gt; threads, which are woken up every 500 milliseconds to check the same. =
If<br>
&gt; a node&#39;s score exceeds its high threshold (as derived from user-pr=
ovided<br>
&gt; proactiveness value), proactive compaction is started until its score<=
br>
&gt; reaches its low threshold value. By default, proactiveness is set to 2=
0,<br>
&gt; which implies threshold values of low=3D80 and high=3D90.<br>
&gt; <br>
&gt; This patch is largely based on ideas from Michal Hocko posted here:<br=
>
&gt; <a href=3D"https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhc=
p22.suse.cz/" rel=3D"noreferrer" target=3D"_blank">https://lore.kernel.org/=
linux-mm/20161230131412.GI13301@dhcp22.suse.cz/</a><br>
&gt; <br>
&gt; Performance data<br>
&gt; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
&gt; <br>
&gt; System: x64_64, 1T RAM, 80 CPU threads.<br>
&gt; Kernel: 5.6.0-rc3 + this patch<br>
&gt; <br>
&gt; echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled<br=
>
&gt; echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag<br>
&gt; <br>
&gt; Before starting the driver, the system was fragmented from a userspace=
<br>
&gt; program that allocates all memory and then for each 2M aligned section=
,<br>
&gt; frees 3/4 of base pages using munmap. The workload is mainly anonymous=
<br>
&gt; userspace pages, which are easy to move around. I intentionally avoide=
d<br>
&gt; unmovable pages in this test to see how much latency we incur when<br>
&gt; hugepage allocations hit direct compaction.<br>
&gt; <br>
&gt; 1. Kernel hugepage allocation latencies<br>
&gt; <br>
&gt; With the system in such a fragmented state, a kernel driver then alloc=
ates<br>
&gt; as many hugepages as possible and measures allocation latency:<br>
&gt; <br>
&gt; (all latency values are in microseconds)<br>
&gt; <br>
&gt; - With vanilla 5.6.0-rc3<br>
&gt; <br>
&gt; echo 0 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness<br>
&gt; <br>
&gt;=C2=A0 =C2=A0percentile latency<br>
&gt;=C2=A0 =C2=A0=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=
=80=93=E2=80=93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=
=80=93=E2=80=93=E2=80=93<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 5=C2=A0 =C2=A0 7894<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A010=C2=A0 =C2=A0 9496<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A025=C2=A0 =C2=A012561<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A030=C2=A0 =C2=A015295<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A040=C2=A0 =C2=A018244<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A050=C2=A0 =C2=A021229<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A060=C2=A0 =C2=A027556<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A075=C2=A0 =C2=A030147<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A080=C2=A0 =C2=A031047<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A090=C2=A0 =C2=A032859<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A095=C2=A0 =C2=A033799<br>
&gt; <br>
&gt; Total 2M hugepages allocated =3D 383859 (749G worth of hugepages out o=
f<br>
&gt; 762G total free =3D&gt; 98% of free memory could be allocated as hugep=
ages)<br>
&gt; <br>
&gt; - With 5.6.0-rc3 + this patch, with proactiveness=3D20<br>
&gt; <br>
&gt; echo 20 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness<br>
&gt; <br>
&gt;=C2=A0 =C2=A0percentile latency<br>
&gt;=C2=A0 =C2=A0=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=
=80=93=E2=80=93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=
=80=93=E2=80=93=E2=80=93<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 5=C2=A0 =C2=A0 =C2=A0 =C2=A02<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A010=C2=A0 =C2=A0 =C2=A0 =C2=A02<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A025=C2=A0 =C2=A0 =C2=A0 =C2=A03<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A030=C2=A0 =C2=A0 =C2=A0 =C2=A03<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A040=C2=A0 =C2=A0 =C2=A0 =C2=A03<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A050=C2=A0 =C2=A0 =C2=A0 =C2=A04<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A060=C2=A0 =C2=A0 =C2=A0 =C2=A04<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A075=C2=A0 =C2=A0 =C2=A0 =C2=A04<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A080=C2=A0 =C2=A0 =C2=A0 =C2=A04<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A090=C2=A0 =C2=A0 =C2=A0 =C2=A05<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A095=C2=A0 =C2=A0 =C2=A0429<br>
&gt; <br>
&gt; Total 2M hugepages allocated =3D 384105 (750G worth of hugepages out o=
f<br>
&gt; 762G total free =3D&gt; 98% of free memory could be allocated as hugep=
ages)<br>
&gt; <br>
&gt; 2. JAVA heap allocation<br>
&gt; <br>
&gt; In this test, we first fragment memory using the same method as for (1=
).<br>
&gt; <br>
&gt; Then, we start a Java process with a heap size set to 700G and request=
<br>
&gt; the heap to be allocated with THP hugepages. We also set THP to madvis=
e<br>
&gt; to allow hugepage backing of this heap.<br>
&gt; <br>
&gt; /usr/bin/time<br>
&gt;=C2=A0 java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysP=
reTouch<br>
&gt; <br>
&gt; The above command allocates 700G of Java heap using hugepages.<br>
&gt; <br>
&gt; - With vanilla 5.6.0-rc3<br>
&gt; <br>
&gt; 17.39user 1666.48system 27:37.89elapsed<br>
&gt; <br>
&gt; - With 5.6.0-rc3 + this patch, with proactiveness=3D20<br>
&gt; <br>
&gt; 8.35user 194.58system 3:19.62elapsed<br>
<br>
I still wonder how the single additional CPU during compaction resulted in =
such<br>
an improvement. Isn&#39;t this against the Amdahl&#39;s law? :)<br>
<br></blockquote><div><br></div><div>The speedup is by avoiding the direct =
compaction path most of the time, so in</div><div>effect we are speeding up=
 the &quot;serial&quot; part of user applications (back-to-back</div><div>m=
emory allocations).<br></div><div><br></div><div>=C2=A0</div><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid=
 rgb(204,204,204);padding-left:1ex">
&gt; Elapsed time remains around 3:15, as proactiveness is further increase=
d.<br>
&gt; <br>
&gt; Note that proactive compaction happens throughout the runtime of these=
<br>
&gt; workloads. The situation of one-time compaction, sufficient to supply<=
br>
&gt; hugepages for following allocation stream, can probably happen for mor=
e<br>
&gt; extreme proactiveness values, like 80 or 90.<br>
&gt; <br>
&gt; In the above Java workload, proactiveness is set to 20. The test start=
s<br>
&gt; with a node&#39;s score of 80 or higher, depending on the delay betwee=
n the<br>
&gt; fragmentation step and starting the benchmark, which gives more-or-les=
s<br>
&gt; time for the initial round of compaction. As the benchmark consumes<br=
>
&gt; hugepages, node&#39;s score quickly rises above the high threshold (90=
) and<br>
&gt; proactive compaction starts again, which brings down the score to the<=
br>
&gt; low threshold level (80).=C2=A0 Repeat.<br>
&gt; <br>
&gt; bpftrace also confirms proactive compaction running 20+ times during t=
he<br>
&gt; runtime of this Java benchmark. kcompactd threads consume 100% of one =
of<br>
&gt; the CPUs while it tries to bring a node&#39;s score within thresholds.=
<br>
&gt; <br>
&gt; Backoff behavior<br>
&gt; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
&gt; <br>
&gt; Above workloads produce a memory state which is easy to compact.<br>
&gt; However, if memory is filled with unmovable pages, proactive compactio=
n<br>
&gt; should essentially back off. To test this aspect:<br>
&gt; <br>
&gt; - Created a kernel driver that allocates almost all memory as hugepage=
s<br>
&gt;=C2=A0 =C2=A0followed by freeing first 3/4 of each hugepage.<br>
&gt; - Set proactiveness=3D40<br>
&gt; - Note that proactive_compact_node() is deferred maximum number of tim=
es<br>
&gt;=C2=A0 =C2=A0with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each c=
heck<br>
&gt;=C2=A0 =C2=A0(=3D&gt; ~30 seconds between retries).<br>
&gt; <br>
&gt; [1] <a href=3D"https://patchwork.kernel.org/patch/11098289/" rel=3D"no=
referrer" target=3D"_blank">https://patchwork.kernel.org/patch/11098289/</a=
><br>
&gt; <br>
&gt; Signed-off-by: Nitin Gupta &lt;<a href=3D"mailto:nigupta@nvidia.com" t=
arget=3D"_blank">nigupta@nvidia.com</a>&gt;<br>
&gt; To: Mel Gorman &lt;<a href=3D"mailto:mgorman@techsingularity.net" targ=
et=3D"_blank">mgorman@techsingularity.net</a>&gt;<br>
<br>
I hope Mel can also comment on this, but in general I agree.<br>
<br>
...<br>
<br>
&gt; +<br>
&gt; +/*<br>
&gt; + * A zone&#39;s fragmentation score is the external fragmentation wrt=
 to the<br>
&gt; + * HUGETLB_PAGE_ORDER scaled by the zone&#39;s size. It returns a val=
ue in the<br>
&gt; + * range [0, 100].<br>
&gt; +<br>
&gt; + * The scaling factor ensures that proactive compaction focuses on la=
rger<br>
&gt; + * zones like ZONE_NORMAL, rather than smaller, specialized zones lik=
e<br>
&gt; + * ZONE_DMA32. For smaller zones, the score value remains close to ze=
ro,<br>
&gt; + * and thus never exceeds the high threshold for proactive compaction=
.<br>
&gt; + */<br>
&gt; +static int fragmentation_score_zone(struct zone *zone)<br>
&gt; +{<br>
&gt; +=C2=A0 =C2=A0 =C2=A0unsigned long score;<br>
&gt; +<br>
&gt; +=C2=A0 =C2=A0 =C2=A0score =3D zone-&gt;present_pages *<br>
&gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0extfrag_for_order(zone, HUGETLB_PAGE_ORDER);<br>
<br>
HPAGE_PMD_ORDER would be a better match than HUGETLB_PAGE_ORDER, even if it=
<br>
might be the same number. hugetlb pages are pre-reserved, unlike THP.<br>
<br></blockquote><div><br></div><div><br></div><div>Ok, I will change to HP=
AGE_PMD_ORDER.</div><div><br></div><div>=C2=A0</div><blockquote class=3D"gm=
ail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,=
204,204);padding-left:1ex">
&gt; +=C2=A0 =C2=A0 =C2=A0score =3D div64_ul(score,<br>
&gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0node_present_pages(zone-&gt;zone_pgdat-&gt;node_id) + 1);<br>
<br>
zone-&gt;zone_pgdat-&gt;node_present_pages is more direct<br>
<br></blockquote><div><br></div><div>Ok.</div><div> <br></div><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px soli=
d rgb(204,204,204);padding-left:1ex">
&gt; +=C2=A0 =C2=A0 =C2=A0return score;<br>
&gt; +}<br>
&gt; +<br>
&gt; +/*<br>
<br>
&gt; @@ -2309,6 +2411,7 @@ static enum compact_result compact_zone_order(st=
ruct zone *zone, int order,<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.alloc_flags =3D=
 alloc_flags,<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.classzone_idx =
=3D classzone_idx,<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.direct_compacti=
on =3D true,<br>
&gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.proactive_compaction=
 =3D false,<br>
<br>
false, 0, NULL etc are implicitly initialized with this kind of initializat=
ion<br>
(also in other places of the patch)<br>
<br></blockquote><div><br></div><div>hmm.. will remove these redundant init=
ializations.<br></div><div><br></div><div>=C2=A0</div><blockquote class=3D"=
gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(20=
4,204,204);padding-left:1ex">
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.whole_zone =3D =
(prio =3D=3D MIN_COMPACT_PRIORITY),<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.ignore_skip_hin=
t =3D (prio =3D=3D MIN_COMPACT_PRIORITY),<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.ignore_block_su=
itable =3D (prio =3D=3D MIN_COMPACT_PRIORITY)<br>
&gt; @@ -2412,6 +2515,42 @@ enum compact_result try_to_compact_pages(gfp_t =
gfp_mask, unsigned int order,<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0return rc;<br>
&gt;=C2=A0 }<br>
&gt;=C2=A0 <br>
<br>
&gt; @@ -2500,6 +2640,63 @@ void compaction_unregister_node(struct node *no=
de)<br>
&gt;=C2=A0 }<br>
&gt;=C2=A0 #endif /* CONFIG_SYSFS &amp;&amp; CONFIG_NUMA */<br>
&gt;=C2=A0 <br>
&gt; +#ifdef CONFIG_SYSFS<br>
&gt; +<br>
&gt; +#define COMPACTION_ATTR_RO(_name) \<br>
&gt; +=C2=A0 =C2=A0 =C2=A0static struct kobj_attribute _name##_attr =3D __A=
TTR_RO(_name)<br>
&gt; +<br>
&gt; +#define COMPACTION_ATTR(_name) \<br>
&gt; +=C2=A0 =C2=A0 =C2=A0static struct kobj_attribute _name##_attr =3D \<b=
r>
&gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0__ATTR(_name, 0644, _=
name##_show, _name##_store)<br>
&gt; +<br>
&gt; +static struct kobject *compaction_kobj;<br>
&gt; +<br>
&gt; +static ssize_t proactiveness_store(struct kobject *kobj,<br>
&gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0struct kobj_attribute=
 *attr, const char *buf, size_t count)<br>
&gt; +{<br>
&gt; +=C2=A0 =C2=A0 =C2=A0int err;<br>
&gt; +=C2=A0 =C2=A0 =C2=A0unsigned long input;<br>
&gt; +<br>
&gt; +=C2=A0 =C2=A0 =C2=A0err =3D kstrtoul(buf, 10, &amp;input);<br>
&gt; +=C2=A0 =C2=A0 =C2=A0if (err)<br>
&gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return err;<br>
&gt; +=C2=A0 =C2=A0 =C2=A0if (input &gt; 100)<br>
&gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return -EINVAL;<br>
<br>
The sysctl way also allows to specify min/max in the descriptor and use the=
<br>
generic handler<br></blockquote><div><br></div><div>Thanks for pointing me =
to sysctl, it deletes ~50 lines from the patch :)<br></div><div><br></div><=
div>I will post v5 soon with the above changes.</div><div><br></div><div>Ni=
tin</div><div><br></div></div></div>

--000000000000ad54d205a5b95114--