From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A3157C433E1 for ; Fri, 15 May 2020 18:02:03 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 480ED206D8 for ; Fri, 15 May 2020 18:02:03 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 480ED206D8 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id D437E8E0003; Fri, 15 May 2020 14:02:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D1A3E8E0001; Fri, 15 May 2020 14:02:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C09E18E0003; Fri, 15 May 2020 14:02:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0144.hostedemail.com [216.40.44.144]) by kanga.kvack.org (Postfix) with ESMTP id A748F8E0001 for ; Fri, 15 May 2020 14:02:02 -0400 (EDT) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 63812801977A for ; Fri, 15 May 2020 18:02:02 +0000 (UTC) X-FDA: 76819722084.08.coal29_69087f8b4915a X-HE-Tag: coal29_69087f8b4915a X-Filterd-Recvd-Size: 11683 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf22.hostedemail.com (Postfix) with ESMTP for ; Fri, 15 May 2020 18:02:01 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 44BDFAA7C; Fri, 15 May 2020 18:02:02 +0000 (UTC) Subject: Re: [PATCH v4] mm: Proactive compaction To: Nitin Gupta , Mel Gorman , Michal Hocko Cc: Matthew Wilcox , Andrew Morton , Mike Kravetz , Joonsoo Kim , David Rientjes , Nitin Gupta , linux-kernel , linux-mm , Linux API References: <20200428221055.598-1-nigupta@nvidia.com> From: Vlastimil Babka Message-ID: <28993c4d-adc6-b83e-66a6-abb0a753f481@suse.cz> Date: Fri, 15 May 2020 20:01:57 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.7.0 MIME-Version: 1.0 In-Reply-To: <20200428221055.598-1-nigupta@nvidia.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 4/29/20 12:10 AM, Nitin Gupta wrote: > For some applications, we need to allocate almost all memory as > hugepages. However, on a running system, higher-order allocations can > fail if the memory is fragmented. Linux kernel currently does on-demand > compaction as we request more hugepages, but this style of compaction > incurs very high latency. Experiments with one-time full memory > compaction (followed by hugepage allocations) show that kernel is able > to restore a highly fragmented memory state to a fairly compacted memor= y > state within <1 sec for a 32G system. Such data suggests that a more > proactive compaction can help us allocate a large fraction of memory as > hugepages keeping allocation latencies low. >=20 > For a more proactive compaction, the approach taken here is to define > a new tunable called 'proactiveness' which dictates bounds for external > fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to > maintain. >=20 > The tunable is exposed through sysfs: > /sys/kernel/mm/compaction/proactiveness I would prefer sysctl. Why? During the mm evolution we seem to have end up with stuff scattered over = several places: /proc/sys aka sysctl: /proc/sys/vm/compact_unevictable_allowed /proc/sys/vm/compact_memory - write-only one-time action trigger! /sys/kernel/mm: e.g. /sys/kernel/mm/transparent_hugepage/ This is unfortunate enough, and (influenced by my recent dive into sysctl perhaps :), I would have preferred sysctl only. In this case it's consist= ent that we have sysctls for compaction already, while this introduces a whol= e new compaction directory in the /sys/kernel/mm/ space. > It takes value in range [0, 100], with a default of 20. >=20 > Note that a previous version of this patch [1] was found to introduce t= oo > many tunables (per-order extfrag{low, high}), but this one reduces them > to just one (proactiveness). Also, the new tunable is an opaque value > instead of asking for specific bounds of "external fragmentation", whic= h > would have been difficult to estimate. The internal interpretation of > this opaque value allows for future fine-tuning. >=20 > Currently, we use a simple translation from this tunable to [low, high] > "fragmentation score" thresholds (low=3D100-proactiveness, high=3Dlow+1= 0%). > The score for a node is defined as weighted mean of per-zone external > fragmentation wrt HUGETLB_PAGE_ORDER order. A zone's present_pages > determines its weight. >=20 > To periodically check per-node score, we reuse per-node kcompactd > threads, which are woken up every 500 milliseconds to check the same. I= f > a node's score exceeds its high threshold (as derived from user-provide= d > proactiveness value), proactive compaction is started until its score > reaches its low threshold value. By default, proactiveness is set to 20= , > which implies threshold values of low=3D80 and high=3D90. >=20 > This patch is largely based on ideas from Michal Hocko posted here: > https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/ >=20 > Performance data > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > System: x64_64, 1T RAM, 80 CPU threads. > Kernel: 5.6.0-rc3 + this patch >=20 > echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled > echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag >=20 > Before starting the driver, the system was fragmented from a userspace > program that allocates all memory and then for each 2M aligned section, > frees 3/4 of base pages using munmap. The workload is mainly anonymous > userspace pages, which are easy to move around. I intentionally avoided > unmovable pages in this test to see how much latency we incur when > hugepage allocations hit direct compaction. >=20 > 1. Kernel hugepage allocation latencies >=20 > With the system in such a fragmented state, a kernel driver then alloca= tes > as many hugepages as possible and measures allocation latency: >=20 > (all latency values are in microseconds) >=20 > - With vanilla 5.6.0-rc3 >=20 > echo 0 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness >=20 > percentile latency > =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93 > 5 7894 > 10 9496 > 25 12561 > 30 15295 > 40 18244 > 50 21229 > 60 27556 > 75 30147 > 80 31047 > 90 32859 > 95 33799 >=20 > Total 2M hugepages allocated =3D 383859 (749G worth of hugepages out of > 762G total free =3D> 98% of free memory could be allocated as hugepages= ) >=20 > - With 5.6.0-rc3 + this patch, with proactiveness=3D20 >=20 > echo 20 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness >=20 > percentile latency > =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93 > 5 2 > 10 2 > 25 3 > 30 3 > 40 3 > 50 4 > 60 4 > 75 4 > 80 4 > 90 5 > 95 429 >=20 > Total 2M hugepages allocated =3D 384105 (750G worth of hugepages out of > 762G total free =3D> 98% of free memory could be allocated as hugepages= ) >=20 > 2. JAVA heap allocation >=20 > In this test, we first fragment memory using the same method as for (1)= . >=20 > Then, we start a Java process with a heap size set to 700G and request > the heap to be allocated with THP hugepages. We also set THP to madvise > to allow hugepage backing of this heap. >=20 > /usr/bin/time > java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouc= h >=20 > The above command allocates 700G of Java heap using hugepages. >=20 > - With vanilla 5.6.0-rc3 >=20 > 17.39user 1666.48system 27:37.89elapsed >=20 > - With 5.6.0-rc3 + this patch, with proactiveness=3D20 >=20 > 8.35user 194.58system 3:19.62elapsed I still wonder how the single additional CPU during compaction resulted i= n such an improvement. Isn't this against the Amdahl's law? :) > Elapsed time remains around 3:15, as proactiveness is further increased= . >=20 > Note that proactive compaction happens throughout the runtime of these > workloads. The situation of one-time compaction, sufficient to supply > hugepages for following allocation stream, can probably happen for more > extreme proactiveness values, like 80 or 90. >=20 > In the above Java workload, proactiveness is set to 20. The test starts > with a node's score of 80 or higher, depending on the delay between the > fragmentation step and starting the benchmark, which gives more-or-less > time for the initial round of compaction. As the benchmark consumes > hugepages, node's score quickly rises above the high threshold (90) and > proactive compaction starts again, which brings down the score to the > low threshold level (80). Repeat. >=20 > bpftrace also confirms proactive compaction running 20+ times during th= e > runtime of this Java benchmark. kcompactd threads consume 100% of one o= f > the CPUs while it tries to bring a node's score within thresholds. >=20 > Backoff behavior > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > Above workloads produce a memory state which is easy to compact. > However, if memory is filled with unmovable pages, proactive compaction > should essentially back off. To test this aspect: >=20 > - Created a kernel driver that allocates almost all memory as hugepages > followed by freeing first 3/4 of each hugepage. > - Set proactiveness=3D40 > - Note that proactive_compact_node() is deferred maximum number of time= s > with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check > (=3D> ~30 seconds between retries). >=20 > [1] https://patchwork.kernel.org/patch/11098289/ >=20 > Signed-off-by: Nitin Gupta > To: Mel Gorman I hope Mel can also comment on this, but in general I agree. ... > + > +/* > + * A zone's fragmentation score is the external fragmentation wrt to t= he > + * HUGETLB_PAGE_ORDER scaled by the zone's size. It returns a value in= the > + * range [0, 100]. > + > + * The scaling factor ensures that proactive compaction focuses on lar= ger > + * zones like ZONE_NORMAL, rather than smaller, specialized zones like > + * ZONE_DMA32. For smaller zones, the score value remains close to zer= o, > + * and thus never exceeds the high threshold for proactive compaction. > + */ > +static int fragmentation_score_zone(struct zone *zone) > +{ > + unsigned long score; > + > + score =3D zone->present_pages * > + extfrag_for_order(zone, HUGETLB_PAGE_ORDER); HPAGE_PMD_ORDER would be a better match than HUGETLB_PAGE_ORDER, even if = it might be the same number. hugetlb pages are pre-reserved, unlike THP. > + score =3D div64_ul(score, > + node_present_pages(zone->zone_pgdat->node_id) + 1); zone->zone_pgdat->node_present_pages is more direct > + return score; > +} > + > +/* > @@ -2309,6 +2411,7 @@ static enum compact_result compact_zone_order(str= uct zone *zone, int order, > .alloc_flags =3D alloc_flags, > .classzone_idx =3D classzone_idx, > .direct_compaction =3D true, > + .proactive_compaction =3D false, false, 0, NULL etc are implicitly initialized with this kind of initializ= ation (also in other places of the patch) > .whole_zone =3D (prio =3D=3D MIN_COMPACT_PRIORITY), > .ignore_skip_hint =3D (prio =3D=3D MIN_COMPACT_PRIORITY), > .ignore_block_suitable =3D (prio =3D=3D MIN_COMPACT_PRIORITY) > @@ -2412,6 +2515,42 @@ enum compact_result try_to_compact_pages(gfp_t g= fp_mask, unsigned int order, > return rc; > } > =20 > @@ -2500,6 +2640,63 @@ void compaction_unregister_node(struct node *nod= e) > } > #endif /* CONFIG_SYSFS && CONFIG_NUMA */ > =20 > +#ifdef CONFIG_SYSFS > + > +#define COMPACTION_ATTR_RO(_name) \ > + static struct kobj_attribute _name##_attr =3D __ATTR_RO(_name) > + > +#define COMPACTION_ATTR(_name) \ > + static struct kobj_attribute _name##_attr =3D \ > + __ATTR(_name, 0644, _name##_show, _name##_store) > + > +static struct kobject *compaction_kobj; > + > +static ssize_t proactiveness_store(struct kobject *kobj, > + struct kobj_attribute *attr, const char *buf, size_t count) > +{ > + int err; > + unsigned long input; > + > + err =3D kstrtoul(buf, 10, &input); > + if (err) > + return err; > + if (input > 100) > + return -EINVAL; The sysctl way also allows to specify min/max in the descriptor and use t= he generic handler