From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=nPx+=65=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A3157C433E1
	for <linux-mm@archiver.kernel.org>; Fri, 15 May 2020 18:02:03 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 480ED206D8
	for <linux-mm@archiver.kernel.org>; Fri, 15 May 2020 18:02:03 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 480ED206D8
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id D437E8E0003; Fri, 15 May 2020 14:02:02 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D1A3E8E0001; Fri, 15 May 2020 14:02:02 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C09E18E0003; Fri, 15 May 2020 14:02:02 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0144.hostedemail.com [216.40.44.144])
	by kanga.kvack.org (Postfix) with ESMTP id A748F8E0001
	for <linux-mm@kvack.org>; Fri, 15 May 2020 14:02:02 -0400 (EDT)
Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 63812801977A
	for <linux-mm@kvack.org>; Fri, 15 May 2020 18:02:02 +0000 (UTC)
X-FDA: 76819722084.08.coal29_69087f8b4915a
X-HE-Tag: coal29_69087f8b4915a
X-Filterd-Recvd-Size: 11683
Received: from mx2.suse.de (mx2.suse.de [195.135.220.15])
	by imf22.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Fri, 15 May 2020 18:02:01 +0000 (UTC)
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
Received: from relay2.suse.de (unknown [195.135.220.254])
	by mx2.suse.de (Postfix) with ESMTP id 44BDFAA7C;
	Fri, 15 May 2020 18:02:02 +0000 (UTC)
Subject: Re: [PATCH v4] mm: Proactive compaction
To: Nitin Gupta <nigupta@nvidia.com>, Mel Gorman
 <mgorman@techsingularity.net>, Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>,
 Andrew Morton <akpm@linux-foundation.org>,
 Mike Kravetz <mike.kravetz@oracle.com>, Joonsoo Kim
 <iamjoonsoo.kim@lge.com>, David Rientjes <rientjes@google.com>,
 Nitin Gupta <ngupta@nitingupta.dev>,
 linux-kernel <linux-kernel@vger.kernel.org>, linux-mm <linux-mm@kvack.org>,
 Linux API <linux-api@vger.kernel.org>
References: <20200428221055.598-1-nigupta@nvidia.com>
From: Vlastimil Babka <vbabka@suse.cz>
Message-ID: <28993c4d-adc6-b83e-66a6-abb0a753f481@suse.cz>
Date: Fri, 15 May 2020 20:01:57 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.7.0
MIME-Version: 1.0
In-Reply-To: <20200428221055.598-1-nigupta@nvidia.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 4/29/20 12:10 AM, Nitin Gupta wrote:
> For some applications, we need to allocate almost all memory as
> hugepages. However, on a running system, higher-order allocations can
> fail if the memory is fragmented. Linux kernel currently does on-demand
> compaction as we request more hugepages, but this style of compaction
> incurs very high latency. Experiments with one-time full memory
> compaction (followed by hugepage allocations) show that kernel is able
> to restore a highly fragmented memory state to a fairly compacted memor=
y
> state within <1 sec for a 32G system. Such data suggests that a more
> proactive compaction can help us allocate a large fraction of memory as
> hugepages keeping allocation latencies low.
>=20
> For a more proactive compaction, the approach taken here is to define
> a new tunable called 'proactiveness' which dictates bounds for external
> fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to
> maintain.
>=20
> The tunable is exposed through sysfs:
>   /sys/kernel/mm/compaction/proactiveness

I would prefer sysctl. Why?

During the mm evolution we seem to have end up with stuff scattered over =
several
places:

/proc/sys aka sysctl:
/proc/sys/vm/compact_unevictable_allowed
/proc/sys/vm/compact_memory - write-only one-time action trigger!

/sys/kernel/mm:
e.g. /sys/kernel/mm/transparent_hugepage/

This is unfortunate enough, and (influenced by my recent dive into sysctl
perhaps :), I would have preferred sysctl only. In this case it's consist=
ent
that we have sysctls for compaction already, while this introduces a whol=
e new
compaction directory in the /sys/kernel/mm/ space.

> It takes value in range [0, 100], with a default of 20.
>=20
> Note that a previous version of this patch [1] was found to introduce t=
oo
> many tunables (per-order extfrag{low, high}), but this one reduces them
> to just one (proactiveness). Also, the new tunable is an opaque value
> instead of asking for specific bounds of "external fragmentation", whic=
h
> would have been difficult to estimate. The internal interpretation of
> this opaque value allows for future fine-tuning.
>=20
> Currently, we use a simple translation from this tunable to [low, high]
> "fragmentation score" thresholds (low=3D100-proactiveness, high=3Dlow+1=
0%).
> The score for a node is defined as weighted mean of per-zone external
> fragmentation wrt HUGETLB_PAGE_ORDER order. A zone's present_pages
> determines its weight.
>=20
> To periodically check per-node score, we reuse per-node kcompactd
> threads, which are woken up every 500 milliseconds to check the same. I=
f
> a node's score exceeds its high threshold (as derived from user-provide=
d
> proactiveness value), proactive compaction is started until its score
> reaches its low threshold value. By default, proactiveness is set to 20=
,
> which implies threshold values of low=3D80 and high=3D90.
>=20
> This patch is largely based on ideas from Michal Hocko posted here:
> https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
>=20
> Performance data
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>=20
> System: x64_64, 1T RAM, 80 CPU threads.
> Kernel: 5.6.0-rc3 + this patch
>=20
> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
>=20
> Before starting the driver, the system was fragmented from a userspace
> program that allocates all memory and then for each 2M aligned section,
> frees 3/4 of base pages using munmap. The workload is mainly anonymous
> userspace pages, which are easy to move around. I intentionally avoided
> unmovable pages in this test to see how much latency we incur when
> hugepage allocations hit direct compaction.
>=20
> 1. Kernel hugepage allocation latencies
>=20
> With the system in such a fragmented state, a kernel driver then alloca=
tes
> as many hugepages as possible and measures allocation latency:
>=20
> (all latency values are in microseconds)
>=20
> - With vanilla 5.6.0-rc3
>=20
> echo 0 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness
>=20
>   percentile latency
>   =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=
=93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=
=93=E2=80=93
> 	   5    7894
> 	  10    9496
> 	  25   12561
> 	  30   15295
> 	  40   18244
> 	  50   21229
> 	  60   27556
> 	  75   30147
> 	  80   31047
> 	  90   32859
> 	  95   33799
>=20
> Total 2M hugepages allocated =3D 383859 (749G worth of hugepages out of
> 762G total free =3D> 98% of free memory could be allocated as hugepages=
)
>=20
> - With 5.6.0-rc3 + this patch, with proactiveness=3D20
>=20
> echo 20 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness
>=20
>   percentile latency
>   =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=
=93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=
=93=E2=80=93
> 	   5       2
> 	  10       2
> 	  25       3
> 	  30       3
> 	  40       3
> 	  50       4
> 	  60       4
> 	  75       4
> 	  80       4
> 	  90       5
> 	  95     429
>=20
> Total 2M hugepages allocated =3D 384105 (750G worth of hugepages out of
> 762G total free =3D> 98% of free memory could be allocated as hugepages=
)
>=20
> 2. JAVA heap allocation
>=20
> In this test, we first fragment memory using the same method as for (1)=
.
>=20
> Then, we start a Java process with a heap size set to 700G and request
> the heap to be allocated with THP hugepages. We also set THP to madvise
> to allow hugepage backing of this heap.
>=20
> /usr/bin/time
>  java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouc=
h
>=20
> The above command allocates 700G of Java heap using hugepages.
>=20
> - With vanilla 5.6.0-rc3
>=20
> 17.39user 1666.48system 27:37.89elapsed
>=20
> - With 5.6.0-rc3 + this patch, with proactiveness=3D20
>=20
> 8.35user 194.58system 3:19.62elapsed

I still wonder how the single additional CPU during compaction resulted i=
n such
an improvement. Isn't this against the Amdahl's law? :)

> Elapsed time remains around 3:15, as proactiveness is further increased=
.
>=20
> Note that proactive compaction happens throughout the runtime of these
> workloads. The situation of one-time compaction, sufficient to supply
> hugepages for following allocation stream, can probably happen for more
> extreme proactiveness values, like 80 or 90.
>=20
> In the above Java workload, proactiveness is set to 20. The test starts
> with a node's score of 80 or higher, depending on the delay between the
> fragmentation step and starting the benchmark, which gives more-or-less
> time for the initial round of compaction. As the benchmark consumes
> hugepages, node's score quickly rises above the high threshold (90) and
> proactive compaction starts again, which brings down the score to the
> low threshold level (80).  Repeat.
>=20
> bpftrace also confirms proactive compaction running 20+ times during th=
e
> runtime of this Java benchmark. kcompactd threads consume 100% of one o=
f
> the CPUs while it tries to bring a node's score within thresholds.
>=20
> Backoff behavior
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>=20
> Above workloads produce a memory state which is easy to compact.
> However, if memory is filled with unmovable pages, proactive compaction
> should essentially back off. To test this aspect:
>=20
> - Created a kernel driver that allocates almost all memory as hugepages
>   followed by freeing first 3/4 of each hugepage.
> - Set proactiveness=3D40
> - Note that proactive_compact_node() is deferred maximum number of time=
s
>   with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
>   (=3D> ~30 seconds between retries).
>=20
> [1] https://patchwork.kernel.org/patch/11098289/
>=20
> Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
> To: Mel Gorman <mgorman@techsingularity.net>

I hope Mel can also comment on this, but in general I agree.

...

> +
> +/*
> + * A zone's fragmentation score is the external fragmentation wrt to t=
he
> + * HUGETLB_PAGE_ORDER scaled by the zone's size. It returns a value in=
 the
> + * range [0, 100].
> +
> + * The scaling factor ensures that proactive compaction focuses on lar=
ger
> + * zones like ZONE_NORMAL, rather than smaller, specialized zones like
> + * ZONE_DMA32. For smaller zones, the score value remains close to zer=
o,
> + * and thus never exceeds the high threshold for proactive compaction.
> + */
> +static int fragmentation_score_zone(struct zone *zone)
> +{
> +	unsigned long score;
> +
> +	score =3D zone->present_pages *
> +			extfrag_for_order(zone, HUGETLB_PAGE_ORDER);

HPAGE_PMD_ORDER would be a better match than HUGETLB_PAGE_ORDER, even if =
it
might be the same number. hugetlb pages are pre-reserved, unlike THP.

> +	score =3D div64_ul(score,
> +			node_present_pages(zone->zone_pgdat->node_id) + 1);

zone->zone_pgdat->node_present_pages is more direct

> +	return score;
> +}
> +
> +/*

> @@ -2309,6 +2411,7 @@ static enum compact_result compact_zone_order(str=
uct zone *zone, int order,
>  		.alloc_flags =3D alloc_flags,
>  		.classzone_idx =3D classzone_idx,
>  		.direct_compaction =3D true,
> +		.proactive_compaction =3D false,

false, 0, NULL etc are implicitly initialized with this kind of initializ=
ation
(also in other places of the patch)

>  		.whole_zone =3D (prio =3D=3D MIN_COMPACT_PRIORITY),
>  		.ignore_skip_hint =3D (prio =3D=3D MIN_COMPACT_PRIORITY),
>  		.ignore_block_suitable =3D (prio =3D=3D MIN_COMPACT_PRIORITY)
> @@ -2412,6 +2515,42 @@ enum compact_result try_to_compact_pages(gfp_t g=
fp_mask, unsigned int order,
>  	return rc;
>  }
> =20

> @@ -2500,6 +2640,63 @@ void compaction_unregister_node(struct node *nod=
e)
>  }
>  #endif /* CONFIG_SYSFS && CONFIG_NUMA */
> =20
> +#ifdef CONFIG_SYSFS
> +
> +#define COMPACTION_ATTR_RO(_name) \
> +	static struct kobj_attribute _name##_attr =3D __ATTR_RO(_name)
> +
> +#define COMPACTION_ATTR(_name) \
> +	static struct kobj_attribute _name##_attr =3D \
> +		__ATTR(_name, 0644, _name##_show, _name##_store)
> +
> +static struct kobject *compaction_kobj;
> +
> +static ssize_t proactiveness_store(struct kobject *kobj,
> +		struct kobj_attribute *attr, const char *buf, size_t count)
> +{
> +	int err;
> +	unsigned long input;
> +
> +	err =3D kstrtoul(buf, 10, &input);
> +	if (err)
> +		return err;
> +	if (input > 100)
> +		return -EINVAL;

The sysctl way also allows to specify min/max in the descriptor and use t=
he
generic handler