From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9631BC433DF for ; Sat, 16 May 2020 00:50:45 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id F169E20671 for ; Sat, 16 May 2020 00:50:44 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=nitingupta.dev header.i=@nitingupta.dev header.b="KzG9e+B1" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F169E20671 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=nitingupta.dev Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5A9408E0003; Fri, 15 May 2020 20:50:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 55A708E0001; Fri, 15 May 2020 20:50:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 421BE8E0003; Fri, 15 May 2020 20:50:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0245.hostedemail.com [216.40.44.245]) by kanga.kvack.org (Postfix) with ESMTP id 20BB78E0001 for ; Fri, 15 May 2020 20:50:44 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id D7A7C45DA for ; Sat, 16 May 2020 00:50:43 +0000 (UTC) X-FDA: 76820751966.14.mouth87_2363f771e0f11 X-HE-Tag: mouth87_2363f771e0f11 X-Filterd-Recvd-Size: 32672 Received: from mail-lf1-f68.google.com (mail-lf1-f68.google.com [209.85.167.68]) by imf04.hostedemail.com (Postfix) with ESMTP for ; Sat, 16 May 2020 00:50:43 +0000 (UTC) Received: by mail-lf1-f68.google.com with SMTP id z22so3314532lfd.0 for ; Fri, 15 May 2020 17:50:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nitingupta.dev; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=zfYyQbxSgUhkDaiVBQw7Ov71BKGeQAyBPv8MEL0ZMXw=; b=KzG9e+B1x4OZrYzt50qharPo10rK1Cvn+YE/i3cBGrvitwUX0fxOhBnxaUiozC3+nw dSVdwNN+ZZ8ZgIYwXDt2YG5r0T+GWjFveudospq3kc2lFJgyUjPSMJJjdOielQCkRmm6 OJrt672n6D/Sjsvj900fwV34JYOTT0bWY7bDXxgtxJkOaQHqJa/fhCvMJjlfRVlAcvaS NKUfcJl184tzRNRw1XFHbZcwO/mcRkdZO62dWyKr4pfQLxwI+Xq7cuBPQIel6qtj6Vyx nYaSNdIbSMlkASnD9yv7wr4jKrCuFBXHAg4zPcPXybWfX4nterZxNlCrcverQggviRw5 vt6Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=zfYyQbxSgUhkDaiVBQw7Ov71BKGeQAyBPv8MEL0ZMXw=; b=lPgVAFRX7kUkHtSsAROHIrUimfl6n9eaYXjDSbNw5chYXooluPSdvbGGpauKJzbM5L Os4EQP/JVN1lUU2lU33VuCm9sILYLLW5/aFTIVPg6LGhrqTQnJN8oUUVawsdcFxPc8Az s+v9ibVpMQ/2AGftgm8qWh/QwwY7xC4kM3wUCHAoIJCni63cHQ6ISTbWUKcKPIWh6eNf Y0nVPiwUIaROf7nIgUkPqJjW015G1GGzjA2VYRawbYItcTWgv3g1WT3efG0FOPUTN9tW 8MaX3MXFrJXZYArQgn0CZ6TaJuuHvBxwhAJvnryTK+IfoaLfP/pdMwiUqZjP0wC9ltly j3JA== X-Gm-Message-State: AOAM533oUlVA+/HN6aSaCqu3z21r9Lq8Yd7c4b4kydukwFULN9jOuQ5H 2VheiD2ilS+bPlseT9HHuws3Ee0aDFstz/ox49ty6g== X-Google-Smtp-Source: ABdhPJxYfFOLdmBr7nXyThx1w5V+A/5EFcIEBwmM4Mph98NXiXgnZRVgbEII+ThomTB0KTK0OFE5izHrQNKUfKp7k7g= X-Received: by 2002:a19:5053:: with SMTP id z19mr4186281lfj.177.1589590241394; Fri, 15 May 2020 17:50:41 -0700 (PDT) MIME-Version: 1.0 References: <20200428221055.598-1-nigupta@nvidia.com> <28993c4d-adc6-b83e-66a6-abb0a753f481@suse.cz> In-Reply-To: <28993c4d-adc6-b83e-66a6-abb0a753f481@suse.cz> From: Nitin Gupta Date: Fri, 15 May 2020 17:50:30 -0700 Message-ID: Subject: Re: [PATCH v4] mm: Proactive compaction To: Vlastimil Babka Cc: Nitin Gupta , Mel Gorman , Michal Hocko , Matthew Wilcox , Andrew Morton , Mike Kravetz , Joonsoo Kim , David Rientjes , linux-kernel , linux-mm , Linux API Content-Type: multipart/alternative; boundary="000000000000ad54d205a5b95114" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --000000000000ad54d205a5b95114 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Fri, May 15, 2020 at 11:02 AM Vlastimil Babka wrote: > On 4/29/20 12:10 AM, Nitin Gupta wrote: > > For some applications, we need to allocate almost all memory as > > hugepages. However, on a running system, higher-order allocations can > > fail if the memory is fragmented. Linux kernel currently does on-demand > > compaction as we request more hugepages, but this style of compaction > > incurs very high latency. Experiments with one-time full memory > > compaction (followed by hugepage allocations) show that kernel is able > > to restore a highly fragmented memory state to a fairly compacted memor= y > > state within <1 sec for a 32G system. Such data suggests that a more > > proactive compaction can help us allocate a large fraction of memory as > > hugepages keeping allocation latencies low. > > > > For a more proactive compaction, the approach taken here is to define > > a new tunable called 'proactiveness' which dictates bounds for external > > fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to > > maintain. > > > > The tunable is exposed through sysfs: > > /sys/kernel/mm/compaction/proactiveness > > I would prefer sysctl. Why? > > During the mm evolution we seem to have end up with stuff scattered over > several > places: > > /proc/sys aka sysctl: > /proc/sys/vm/compact_unevictable_allowed > /proc/sys/vm/compact_memory - write-only one-time action trigger! > > /sys/kernel/mm: > e.g. /sys/kernel/mm/transparent_hugepage/ > > This is unfortunate enough, and (influenced by my recent dive into sysctl > perhaps :), I would have preferred sysctl only. In this case it's > consistent > that we have sysctls for compaction already, while this introduces a whol= e > new > compaction directory in the /sys/kernel/mm/ space. > > I have now replaced this sysfs node with vm.compaction_proactiveness sysctl= . > > It takes value in range [0, 100], with a default of 20. > > > > Note that a previous version of this patch [1] was found to introduce t= oo > > many tunables (per-order extfrag{low, high}), but this one reduces them > > to just one (proactiveness). Also, the new tunable is an opaque value > > instead of asking for specific bounds of "external fragmentation", whic= h > > would have been difficult to estimate. The internal interpretation of > > this opaque value allows for future fine-tuning. > > > > Currently, we use a simple translation from this tunable to [low, high] > > "fragmentation score" thresholds (low=3D100-proactiveness, high=3Dlow+1= 0%). > > The score for a node is defined as weighted mean of per-zone external > > fragmentation wrt HUGETLB_PAGE_ORDER order. A zone's present_pages > > determines its weight. > > > > To periodically check per-node score, we reuse per-node kcompactd > > threads, which are woken up every 500 milliseconds to check the same. I= f > > a node's score exceeds its high threshold (as derived from user-provide= d > > proactiveness value), proactive compaction is started until its score > > reaches its low threshold value. By default, proactiveness is set to 20= , > > which implies threshold values of low=3D80 and high=3D90. > > > > This patch is largely based on ideas from Michal Hocko posted here: > > https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/ > > > > Performance data > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > System: x64_64, 1T RAM, 80 CPU threads. > > Kernel: 5.6.0-rc3 + this patch > > > > echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled > > echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag > > > > Before starting the driver, the system was fragmented from a userspace > > program that allocates all memory and then for each 2M aligned section, > > frees 3/4 of base pages using munmap. The workload is mainly anonymous > > userspace pages, which are easy to move around. I intentionally avoided > > unmovable pages in this test to see how much latency we incur when > > hugepage allocations hit direct compaction. > > > > 1. Kernel hugepage allocation latencies > > > > With the system in such a fragmented state, a kernel driver then > allocates > > as many hugepages as possible and measures allocation latency: > > > > (all latency values are in microseconds) > > > > - With vanilla 5.6.0-rc3 > > > > echo 0 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness > > > > percentile latency > > =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93 > > 5 7894 > > 10 9496 > > 25 12561 > > 30 15295 > > 40 18244 > > 50 21229 > > 60 27556 > > 75 30147 > > 80 31047 > > 90 32859 > > 95 33799 > > > > Total 2M hugepages allocated =3D 383859 (749G worth of hugepages out of > > 762G total free =3D> 98% of free memory could be allocated as hugepages= ) > > > > - With 5.6.0-rc3 + this patch, with proactiveness=3D20 > > > > echo 20 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness > > > > percentile latency > > =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93 > > 5 2 > > 10 2 > > 25 3 > > 30 3 > > 40 3 > > 50 4 > > 60 4 > > 75 4 > > 80 4 > > 90 5 > > 95 429 > > > > Total 2M hugepages allocated =3D 384105 (750G worth of hugepages out of > > 762G total free =3D> 98% of free memory could be allocated as hugepages= ) > > > > 2. JAVA heap allocation > > > > In this test, we first fragment memory using the same method as for (1)= . > > > > Then, we start a Java process with a heap size set to 700G and request > > the heap to be allocated with THP hugepages. We also set THP to madvise > > to allow hugepage backing of this heap. > > > > /usr/bin/time > > java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouc= h > > > > The above command allocates 700G of Java heap using hugepages. > > > > - With vanilla 5.6.0-rc3 > > > > 17.39user 1666.48system 27:37.89elapsed > > > > - With 5.6.0-rc3 + this patch, with proactiveness=3D20 > > > > 8.35user 194.58system 3:19.62elapsed > > I still wonder how the single additional CPU during compaction resulted i= n > such > an improvement. Isn't this against the Amdahl's law? :) > > The speedup is by avoiding the direct compaction path most of the time, so in effect we are speeding up the "serial" part of user applications (back-to-back memory allocations). > > Elapsed time remains around 3:15, as proactiveness is further increased= . > > > > Note that proactive compaction happens throughout the runtime of these > > workloads. The situation of one-time compaction, sufficient to supply > > hugepages for following allocation stream, can probably happen for more > > extreme proactiveness values, like 80 or 90. > > > > In the above Java workload, proactiveness is set to 20. The test starts > > with a node's score of 80 or higher, depending on the delay between the > > fragmentation step and starting the benchmark, which gives more-or-less > > time for the initial round of compaction. As the benchmark consumes > > hugepages, node's score quickly rises above the high threshold (90) and > > proactive compaction starts again, which brings down the score to the > > low threshold level (80). Repeat. > > > > bpftrace also confirms proactive compaction running 20+ times during th= e > > runtime of this Java benchmark. kcompactd threads consume 100% of one o= f > > the CPUs while it tries to bring a node's score within thresholds. > > > > Backoff behavior > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > Above workloads produce a memory state which is easy to compact. > > However, if memory is filled with unmovable pages, proactive compaction > > should essentially back off. To test this aspect: > > > > - Created a kernel driver that allocates almost all memory as hugepages > > followed by freeing first 3/4 of each hugepage. > > - Set proactiveness=3D40 > > - Note that proactive_compact_node() is deferred maximum number of time= s > > with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check > > (=3D> ~30 seconds between retries). > > > > [1] https://patchwork.kernel.org/patch/11098289/ > > > > Signed-off-by: Nitin Gupta > > To: Mel Gorman > > I hope Mel can also comment on this, but in general I agree. > > ... > > > + > > +/* > > + * A zone's fragmentation score is the external fragmentation wrt to t= he > > + * HUGETLB_PAGE_ORDER scaled by the zone's size. It returns a value in > the > > + * range [0, 100]. > > + > > + * The scaling factor ensures that proactive compaction focuses on > larger > > + * zones like ZONE_NORMAL, rather than smaller, specialized zones like > > + * ZONE_DMA32. For smaller zones, the score value remains close to zer= o, > > + * and thus never exceeds the high threshold for proactive compaction. > > + */ > > +static int fragmentation_score_zone(struct zone *zone) > > +{ > > + unsigned long score; > > + > > + score =3D zone->present_pages * > > + extfrag_for_order(zone, HUGETLB_PAGE_ORDER); > > HPAGE_PMD_ORDER would be a better match than HUGETLB_PAGE_ORDER, even if = it > might be the same number. hugetlb pages are pre-reserved, unlike THP. > > Ok, I will change to HPAGE_PMD_ORDER. > > + score =3D div64_ul(score, > > + node_present_pages(zone->zone_pgdat->node_id) + 1= ); > > zone->zone_pgdat->node_present_pages is more direct > > Ok. > + return score; > > +} > > + > > +/* > > > @@ -2309,6 +2411,7 @@ static enum compact_result > compact_zone_order(struct zone *zone, int order, > > .alloc_flags =3D alloc_flags, > > .classzone_idx =3D classzone_idx, > > .direct_compaction =3D true, > > + .proactive_compaction =3D false, > > false, 0, NULL etc are implicitly initialized with this kind of > initialization > (also in other places of the patch) > > hmm.. will remove these redundant initializations. > > .whole_zone =3D (prio =3D=3D MIN_COMPACT_PRIORITY), > > .ignore_skip_hint =3D (prio =3D=3D MIN_COMPACT_PRIORITY), > > .ignore_block_suitable =3D (prio =3D=3D MIN_COMPACT_PRIOR= ITY) > > @@ -2412,6 +2515,42 @@ enum compact_result try_to_compact_pages(gfp_t > gfp_mask, unsigned int order, > > return rc; > > } > > > > > @@ -2500,6 +2640,63 @@ void compaction_unregister_node(struct node *nod= e) > > } > > #endif /* CONFIG_SYSFS && CONFIG_NUMA */ > > > > +#ifdef CONFIG_SYSFS > > + > > +#define COMPACTION_ATTR_RO(_name) \ > > + static struct kobj_attribute _name##_attr =3D __ATTR_RO(_name) > > + > > +#define COMPACTION_ATTR(_name) \ > > + static struct kobj_attribute _name##_attr =3D \ > > + __ATTR(_name, 0644, _name##_show, _name##_store) > > + > > +static struct kobject *compaction_kobj; > > + > > +static ssize_t proactiveness_store(struct kobject *kobj, > > + struct kobj_attribute *attr, const char *buf, size_t coun= t) > > +{ > > + int err; > > + unsigned long input; > > + > > + err =3D kstrtoul(buf, 10, &input); > > + if (err) > > + return err; > > + if (input > 100) > > + return -EINVAL; > > The sysctl way also allows to specify min/max in the descriptor and use t= he > generic handler > Thanks for pointing me to sysctl, it deletes ~50 lines from the patch :) I will post v5 soon with the above changes. Nitin --000000000000ad54d205a5b95114 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Fri, May 15, 2020 at 11:02 AM Vlas= timil Babka <vbabka@suse.cz> wr= ote:
On 4/29/20 = 12:10 AM, Nitin Gupta wrote:
> For some applications, we need to allocate almost all memory as
> hugepages. However, on a running system, higher-order allocations can<= br> > fail if the memory is fragmented. Linux kernel currently does on-deman= d
> compaction as we request more hugepages, but this style of compaction<= br> > incurs very high latency. Experiments with one-time full memory
> compaction (followed by hugepage allocations) show that kernel is able=
> to restore a highly fragmented memory state to a fairly compacted memo= ry
> state within <1 sec for a 32G system. Such data suggests that a mor= e
> proactive compaction can help us allocate a large fraction of memory a= s
> hugepages keeping allocation latencies low.
>
> For a more proactive compaction, the approach taken here is to define<= br> > a new tunable called 'proactiveness' which dictates bounds for= external
> fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to > maintain.
>
> The tunable is exposed through sysfs:
>=C2=A0 =C2=A0/sys/kernel/mm/compaction/proactiveness

I would prefer sysctl. Why?

During the mm evolution we seem to have end up with stuff scattered over se= veral
places:

/proc/sys aka sysctl:
/proc/sys/vm/compact_unevictable_allowed
/proc/sys/vm/compact_memory - write-only one-time action trigger!

/sys/kernel/mm:
e.g. /sys/kernel/mm/transparent_hugepage/

This is unfortunate enough, and (influenced by my recent dive into sysctl perhaps :), I would have preferred sysctl only. In this case it's consi= stent
that we have sysctls for compaction already, while this introduces a whole = new
compaction directory in the /sys/kernel/mm/ space.



I have now replaced thi= s sysfs node with vm.compaction_proactiveness sysctl.

<= div>=C2=A0
> It takes value in range [0, 100], with a default of 20.
>
> Note that a previous version of this patch [1] was found to introduce = too
> many tunables (per-order extfrag{low, high}), but this one reduces the= m
> to just one (proactiveness). Also, the new tunable is an opaque value<= br> > instead of asking for specific bounds of "external fragmentation&= quot;, which
> would have been difficult to estimate. The internal interpretation of<= br> > this opaque value allows for future fine-tuning.
>
> Currently, we use a simple translation from this tunable to [low, high= ]
> "fragmentation score" thresholds (low=3D100-proactiveness, h= igh=3Dlow+10%).
> The score for a node is defined as weighted mean of per-zone external<= br> > fragmentation wrt HUGETLB_PAGE_ORDER order. A zone's present_pages=
> determines its weight.
>
> To periodically check per-node score, we reuse per-node kcompactd
> threads, which are woken up every 500 milliseconds to check the same. = If
> a node's score exceeds its high threshold (as derived from user-pr= ovided
> proactiveness value), proactive compaction is started until its score<= br> > reaches its low threshold value. By default, proactiveness is set to 2= 0,
> which implies threshold values of low=3D80 and high=3D90.
>
> This patch is largely based on ideas from Michal Hocko posted here: > https://lore.kernel.org/= linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
>
> Performance data
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> System: x64_64, 1T RAM, 80 CPU threads.
> Kernel: 5.6.0-rc3 + this patch
>
> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled > echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
>
> Before starting the driver, the system was fragmented from a userspace=
> program that allocates all memory and then for each 2M aligned section= ,
> frees 3/4 of base pages using munmap. The workload is mainly anonymous=
> userspace pages, which are easy to move around. I intentionally avoide= d
> unmovable pages in this test to see how much latency we incur when
> hugepage allocations hit direct compaction.
>
> 1. Kernel hugepage allocation latencies
>
> With the system in such a fragmented state, a kernel driver then alloc= ates
> as many hugepages as possible and measures allocation latency:
>
> (all latency values are in microseconds)
>
> - With vanilla 5.6.0-rc3
>
> echo 0 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness
>
>=C2=A0 =C2=A0percentile latency
>=C2=A0 =C2=A0=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2= =80=93=E2=80=93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2= =80=93=E2=80=93=E2=80=93
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 5=C2=A0 =C2=A0 7894
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A010=C2=A0 =C2=A0 9496
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A025=C2=A0 =C2=A012561
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A030=C2=A0 =C2=A015295
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A040=C2=A0 =C2=A018244
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A050=C2=A0 =C2=A021229
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A060=C2=A0 =C2=A027556
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A075=C2=A0 =C2=A030147
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A080=C2=A0 =C2=A031047
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A090=C2=A0 =C2=A032859
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A095=C2=A0 =C2=A033799
>
> Total 2M hugepages allocated =3D 383859 (749G worth of hugepages out o= f
> 762G total free =3D> 98% of free memory could be allocated as hugep= ages)
>
> - With 5.6.0-rc3 + this patch, with proactiveness=3D20
>
> echo 20 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness
>
>=C2=A0 =C2=A0percentile latency
>=C2=A0 =C2=A0=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2= =80=93=E2=80=93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2= =80=93=E2=80=93=E2=80=93
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 5=C2=A0 =C2=A0 =C2=A0 =C2=A02
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A010=C2=A0 =C2=A0 =C2=A0 =C2=A02
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A025=C2=A0 =C2=A0 =C2=A0 =C2=A03
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A030=C2=A0 =C2=A0 =C2=A0 =C2=A03
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A040=C2=A0 =C2=A0 =C2=A0 =C2=A03
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A050=C2=A0 =C2=A0 =C2=A0 =C2=A04
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A060=C2=A0 =C2=A0 =C2=A0 =C2=A04
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A075=C2=A0 =C2=A0 =C2=A0 =C2=A04
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A080=C2=A0 =C2=A0 =C2=A0 =C2=A04
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A090=C2=A0 =C2=A0 =C2=A0 =C2=A05
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A095=C2=A0 =C2=A0 =C2=A0429
>
> Total 2M hugepages allocated =3D 384105 (750G worth of hugepages out o= f
> 762G total free =3D> 98% of free memory could be allocated as hugep= ages)
>
> 2. JAVA heap allocation
>
> In this test, we first fragment memory using the same method as for (1= ).
>
> Then, we start a Java process with a heap size set to 700G and request=
> the heap to be allocated with THP hugepages. We also set THP to madvis= e
> to allow hugepage backing of this heap.
>
> /usr/bin/time
>=C2=A0 java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysP= reTouch
>
> The above command allocates 700G of Java heap using hugepages.
>
> - With vanilla 5.6.0-rc3
>
> 17.39user 1666.48system 27:37.89elapsed
>
> - With 5.6.0-rc3 + this patch, with proactiveness=3D20
>
> 8.35user 194.58system 3:19.62elapsed

I still wonder how the single additional CPU during compaction resulted in = such
an improvement. Isn't this against the Amdahl's law? :)


The speedup is by avoiding the direct = compaction path most of the time, so in
effect we are speeding up= the "serial" part of user applications (back-to-back
m= emory allocations).

=C2=A0
> Elapsed time remains around 3:15, as proactiveness is further increase= d.
>
> Note that proactive compaction happens throughout the runtime of these=
> workloads. The situation of one-time compaction, sufficient to supply<= br> > hugepages for following allocation stream, can probably happen for mor= e
> extreme proactiveness values, like 80 or 90.
>
> In the above Java workload, proactiveness is set to 20. The test start= s
> with a node's score of 80 or higher, depending on the delay betwee= n the
> fragmentation step and starting the benchmark, which gives more-or-les= s
> time for the initial round of compaction. As the benchmark consumes > hugepages, node's score quickly rises above the high threshold (90= ) and
> proactive compaction starts again, which brings down the score to the<= br> > low threshold level (80).=C2=A0 Repeat.
>
> bpftrace also confirms proactive compaction running 20+ times during t= he
> runtime of this Java benchmark. kcompactd threads consume 100% of one = of
> the CPUs while it tries to bring a node's score within thresholds.=
>
> Backoff behavior
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> Above workloads produce a memory state which is easy to compact.
> However, if memory is filled with unmovable pages, proactive compactio= n
> should essentially back off. To test this aspect:
>
> - Created a kernel driver that allocates almost all memory as hugepage= s
>=C2=A0 =C2=A0followed by freeing first 3/4 of each hugepage.
> - Set proactiveness=3D40
> - Note that proactive_compact_node() is deferred maximum number of tim= es
>=C2=A0 =C2=A0with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each c= heck
>=C2=A0 =C2=A0(=3D> ~30 seconds between retries).
>
> [1] https://patchwork.kernel.org/patch/11098289/
>
> Signed-off-by: Nitin Gupta <
nigupta@nvidia.com>
> To: Mel Gorman <mgorman@techsingularity.net>

I hope Mel can also comment on this, but in general I agree.

...

> +
> +/*
> + * A zone's fragmentation score is the external fragmentation wrt= to the
> + * HUGETLB_PAGE_ORDER scaled by the zone's size. It returns a val= ue in the
> + * range [0, 100].
> +
> + * The scaling factor ensures that proactive compaction focuses on la= rger
> + * zones like ZONE_NORMAL, rather than smaller, specialized zones lik= e
> + * ZONE_DMA32. For smaller zones, the score value remains close to ze= ro,
> + * and thus never exceeds the high threshold for proactive compaction= .
> + */
> +static int fragmentation_score_zone(struct zone *zone)
> +{
> +=C2=A0 =C2=A0 =C2=A0unsigned long score;
> +
> +=C2=A0 =C2=A0 =C2=A0score =3D zone->present_pages *
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0extfrag_for_order(zone, HUGETLB_PAGE_ORDER);

HPAGE_PMD_ORDER would be a better match than HUGETLB_PAGE_ORDER, even if it=
might be the same number. hugetlb pages are pre-reserved, unlike THP.



Ok, I will change to HP= AGE_PMD_ORDER.

=C2=A0
> +=C2=A0 =C2=A0 =C2=A0score =3D div64_ul(score,
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0node_present_pages(zone->zone_pgdat->node_id) + 1);

zone->zone_pgdat->node_present_pages is more direct


Ok.

> +=C2=A0 =C2=A0 =C2=A0return score;
> +}
> +
> +/*

> @@ -2309,6 +2411,7 @@ static enum compact_result compact_zone_order(st= ruct zone *zone, int order,
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.alloc_flags =3D= alloc_flags,
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.classzone_idx = =3D classzone_idx,
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.direct_compacti= on =3D true,
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.proactive_compaction= =3D false,

false, 0, NULL etc are implicitly initialized with this kind of initializat= ion
(also in other places of the patch)


hmm.. will remove these redundant init= ializations.

=C2=A0
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.whole_zone =3D = (prio =3D=3D MIN_COMPACT_PRIORITY),
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.ignore_skip_hin= t =3D (prio =3D=3D MIN_COMPACT_PRIORITY),
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.ignore_block_su= itable =3D (prio =3D=3D MIN_COMPACT_PRIORITY)
> @@ -2412,6 +2515,42 @@ enum compact_result try_to_compact_pages(gfp_t = gfp_mask, unsigned int order,
>=C2=A0 =C2=A0 =C2=A0 =C2=A0return rc;
>=C2=A0 }
>=C2=A0

> @@ -2500,6 +2640,63 @@ void compaction_unregister_node(struct node *no= de)
>=C2=A0 }
>=C2=A0 #endif /* CONFIG_SYSFS && CONFIG_NUMA */
>=C2=A0
> +#ifdef CONFIG_SYSFS
> +
> +#define COMPACTION_ATTR_RO(_name) \
> +=C2=A0 =C2=A0 =C2=A0static struct kobj_attribute _name##_attr =3D __A= TTR_RO(_name)
> +
> +#define COMPACTION_ATTR(_name) \
> +=C2=A0 =C2=A0 =C2=A0static struct kobj_attribute _name##_attr =3D \ > +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0__ATTR(_name, 0644, _= name##_show, _name##_store)
> +
> +static struct kobject *compaction_kobj;
> +
> +static ssize_t proactiveness_store(struct kobject *kobj,
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0struct kobj_attribute= *attr, const char *buf, size_t count)
> +{
> +=C2=A0 =C2=A0 =C2=A0int err;
> +=C2=A0 =C2=A0 =C2=A0unsigned long input;
> +
> +=C2=A0 =C2=A0 =C2=A0err =3D kstrtoul(buf, 10, &input);
> +=C2=A0 =C2=A0 =C2=A0if (err)
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return err;
> +=C2=A0 =C2=A0 =C2=A0if (input > 100)
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return -EINVAL;

The sysctl way also allows to specify min/max in the descriptor and use the=
generic handler

Thanks for pointing me = to sysctl, it deletes ~50 lines from the patch :)

<= div>I will post v5 soon with the above changes.

Ni= tin

--000000000000ad54d205a5b95114--