From: Nitin Gupta <ngupta@nitingupta.dev>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Nitin Gupta <nigupta@nvidia.com>,
Mel Gorman <mgorman@techsingularity.net>,
Michal Hocko <mhocko@suse.com>,
Matthew Wilcox <willy@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
Mike Kravetz <mike.kravetz@oracle.com>,
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
David Rientjes <rientjes@google.com>,
linux-kernel <linux-kernel@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>,
Linux API <linux-api@vger.kernel.org>
Subject: Re: [PATCH v4] mm: Proactive compaction
Date: Fri, 15 May 2020 17:50:30 -0700 [thread overview]
Message-ID: <CAB6CXpDCfoU0mJvDJra4MN6omD1cnNgSLH2Qyy=f7e7uQhA4Cg@mail.gmail.com> (raw)
In-Reply-To: <28993c4d-adc6-b83e-66a6-abb0a753f481@suse.cz>
[-- Attachment #1: Type: text/plain, Size: 11751 bytes --]
On Fri, May 15, 2020 at 11:02 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> On 4/29/20 12:10 AM, Nitin Gupta wrote:
> > For some applications, we need to allocate almost all memory as
> > hugepages. However, on a running system, higher-order allocations can
> > fail if the memory is fragmented. Linux kernel currently does on-demand
> > compaction as we request more hugepages, but this style of compaction
> > incurs very high latency. Experiments with one-time full memory
> > compaction (followed by hugepage allocations) show that kernel is able
> > to restore a highly fragmented memory state to a fairly compacted memory
> > state within <1 sec for a 32G system. Such data suggests that a more
> > proactive compaction can help us allocate a large fraction of memory as
> > hugepages keeping allocation latencies low.
> >
> > For a more proactive compaction, the approach taken here is to define
> > a new tunable called 'proactiveness' which dictates bounds for external
> > fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to
> > maintain.
> >
> > The tunable is exposed through sysfs:
> > /sys/kernel/mm/compaction/proactiveness
>
> I would prefer sysctl. Why?
>
> During the mm evolution we seem to have end up with stuff scattered over
> several
> places:
>
> /proc/sys aka sysctl:
> /proc/sys/vm/compact_unevictable_allowed
> /proc/sys/vm/compact_memory - write-only one-time action trigger!
>
> /sys/kernel/mm:
> e.g. /sys/kernel/mm/transparent_hugepage/
>
> This is unfortunate enough, and (influenced by my recent dive into sysctl
> perhaps :), I would have preferred sysctl only. In this case it's
> consistent
> that we have sysctls for compaction already, while this introduces a whole
> new
> compaction directory in the /sys/kernel/mm/ space.
>
>
I have now replaced this sysfs node with vm.compaction_proactiveness sysctl.
> > It takes value in range [0, 100], with a default of 20.
> >
> > Note that a previous version of this patch [1] was found to introduce too
> > many tunables (per-order extfrag{low, high}), but this one reduces them
> > to just one (proactiveness). Also, the new tunable is an opaque value
> > instead of asking for specific bounds of "external fragmentation", which
> > would have been difficult to estimate. The internal interpretation of
> > this opaque value allows for future fine-tuning.
> >
> > Currently, we use a simple translation from this tunable to [low, high]
> > "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
> > The score for a node is defined as weighted mean of per-zone external
> > fragmentation wrt HUGETLB_PAGE_ORDER order. A zone's present_pages
> > determines its weight.
> >
> > To periodically check per-node score, we reuse per-node kcompactd
> > threads, which are woken up every 500 milliseconds to check the same. If
> > a node's score exceeds its high threshold (as derived from user-provided
> > proactiveness value), proactive compaction is started until its score
> > reaches its low threshold value. By default, proactiveness is set to 20,
> > which implies threshold values of low=80 and high=90.
> >
> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
> >
> > Performance data
> > ================
> >
> > System: x64_64, 1T RAM, 80 CPU threads.
> > Kernel: 5.6.0-rc3 + this patch
> >
> > echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
> > echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
> >
> > Before starting the driver, the system was fragmented from a userspace
> > program that allocates all memory and then for each 2M aligned section,
> > frees 3/4 of base pages using munmap. The workload is mainly anonymous
> > userspace pages, which are easy to move around. I intentionally avoided
> > unmovable pages in this test to see how much latency we incur when
> > hugepage allocations hit direct compaction.
> >
> > 1. Kernel hugepage allocation latencies
> >
> > With the system in such a fragmented state, a kernel driver then
> allocates
> > as many hugepages as possible and measures allocation latency:
> >
> > (all latency values are in microseconds)
> >
> > - With vanilla 5.6.0-rc3
> >
> > echo 0 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness
> >
> > percentile latency
> > –––––––––– –––––––
> > 5 7894
> > 10 9496
> > 25 12561
> > 30 15295
> > 40 18244
> > 50 21229
> > 60 27556
> > 75 30147
> > 80 31047
> > 90 32859
> > 95 33799
> >
> > Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
> > 762G total free => 98% of free memory could be allocated as hugepages)
> >
> > - With 5.6.0-rc3 + this patch, with proactiveness=20
> >
> > echo 20 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness
> >
> > percentile latency
> > –––––––––– –––––––
> > 5 2
> > 10 2
> > 25 3
> > 30 3
> > 40 3
> > 50 4
> > 60 4
> > 75 4
> > 80 4
> > 90 5
> > 95 429
> >
> > Total 2M hugepages allocated = 384105 (750G worth of hugepages out of
> > 762G total free => 98% of free memory could be allocated as hugepages)
> >
> > 2. JAVA heap allocation
> >
> > In this test, we first fragment memory using the same method as for (1).
> >
> > Then, we start a Java process with a heap size set to 700G and request
> > the heap to be allocated with THP hugepages. We also set THP to madvise
> > to allow hugepage backing of this heap.
> >
> > /usr/bin/time
> > java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
> >
> > The above command allocates 700G of Java heap using hugepages.
> >
> > - With vanilla 5.6.0-rc3
> >
> > 17.39user 1666.48system 27:37.89elapsed
> >
> > - With 5.6.0-rc3 + this patch, with proactiveness=20
> >
> > 8.35user 194.58system 3:19.62elapsed
>
> I still wonder how the single additional CPU during compaction resulted in
> such
> an improvement. Isn't this against the Amdahl's law? :)
>
>
The speedup is by avoiding the direct compaction path most of the time, so
in
effect we are speeding up the "serial" part of user applications
(back-to-back
memory allocations).
> > Elapsed time remains around 3:15, as proactiveness is further increased.
> >
> > Note that proactive compaction happens throughout the runtime of these
> > workloads. The situation of one-time compaction, sufficient to supply
> > hugepages for following allocation stream, can probably happen for more
> > extreme proactiveness values, like 80 or 90.
> >
> > In the above Java workload, proactiveness is set to 20. The test starts
> > with a node's score of 80 or higher, depending on the delay between the
> > fragmentation step and starting the benchmark, which gives more-or-less
> > time for the initial round of compaction. As the benchmark consumes
> > hugepages, node's score quickly rises above the high threshold (90) and
> > proactive compaction starts again, which brings down the score to the
> > low threshold level (80). Repeat.
> >
> > bpftrace also confirms proactive compaction running 20+ times during the
> > runtime of this Java benchmark. kcompactd threads consume 100% of one of
> > the CPUs while it tries to bring a node's score within thresholds.
> >
> > Backoff behavior
> > ================
> >
> > Above workloads produce a memory state which is easy to compact.
> > However, if memory is filled with unmovable pages, proactive compaction
> > should essentially back off. To test this aspect:
> >
> > - Created a kernel driver that allocates almost all memory as hugepages
> > followed by freeing first 3/4 of each hugepage.
> > - Set proactiveness=40
> > - Note that proactive_compact_node() is deferred maximum number of times
> > with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
> > (=> ~30 seconds between retries).
> >
> > [1] https://patchwork.kernel.org/patch/11098289/
> >
> > Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
> > To: Mel Gorman <mgorman@techsingularity.net>
>
> I hope Mel can also comment on this, but in general I agree.
>
> ...
>
> > +
> > +/*
> > + * A zone's fragmentation score is the external fragmentation wrt to the
> > + * HUGETLB_PAGE_ORDER scaled by the zone's size. It returns a value in
> the
> > + * range [0, 100].
> > +
> > + * The scaling factor ensures that proactive compaction focuses on
> larger
> > + * zones like ZONE_NORMAL, rather than smaller, specialized zones like
> > + * ZONE_DMA32. For smaller zones, the score value remains close to zero,
> > + * and thus never exceeds the high threshold for proactive compaction.
> > + */
> > +static int fragmentation_score_zone(struct zone *zone)
> > +{
> > + unsigned long score;
> > +
> > + score = zone->present_pages *
> > + extfrag_for_order(zone, HUGETLB_PAGE_ORDER);
>
> HPAGE_PMD_ORDER would be a better match than HUGETLB_PAGE_ORDER, even if it
> might be the same number. hugetlb pages are pre-reserved, unlike THP.
>
>
Ok, I will change to HPAGE_PMD_ORDER.
> > + score = div64_ul(score,
> > + node_present_pages(zone->zone_pgdat->node_id) + 1);
>
> zone->zone_pgdat->node_present_pages is more direct
>
>
Ok.
> + return score;
> > +}
> > +
> > +/*
>
> > @@ -2309,6 +2411,7 @@ static enum compact_result
> compact_zone_order(struct zone *zone, int order,
> > .alloc_flags = alloc_flags,
> > .classzone_idx = classzone_idx,
> > .direct_compaction = true,
> > + .proactive_compaction = false,
>
> false, 0, NULL etc are implicitly initialized with this kind of
> initialization
> (also in other places of the patch)
>
>
hmm.. will remove these redundant initializations.
> > .whole_zone = (prio == MIN_COMPACT_PRIORITY),
> > .ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
> > .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
> > @@ -2412,6 +2515,42 @@ enum compact_result try_to_compact_pages(gfp_t
> gfp_mask, unsigned int order,
> > return rc;
> > }
> >
>
> > @@ -2500,6 +2640,63 @@ void compaction_unregister_node(struct node *node)
> > }
> > #endif /* CONFIG_SYSFS && CONFIG_NUMA */
> >
> > +#ifdef CONFIG_SYSFS
> > +
> > +#define COMPACTION_ATTR_RO(_name) \
> > + static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
> > +
> > +#define COMPACTION_ATTR(_name) \
> > + static struct kobj_attribute _name##_attr = \
> > + __ATTR(_name, 0644, _name##_show, _name##_store)
> > +
> > +static struct kobject *compaction_kobj;
> > +
> > +static ssize_t proactiveness_store(struct kobject *kobj,
> > + struct kobj_attribute *attr, const char *buf, size_t count)
> > +{
> > + int err;
> > + unsigned long input;
> > +
> > + err = kstrtoul(buf, 10, &input);
> > + if (err)
> > + return err;
> > + if (input > 100)
> > + return -EINVAL;
>
> The sysctl way also allows to specify min/max in the descriptor and use the
> generic handler
>
Thanks for pointing me to sysctl, it deletes ~50 lines from the patch :)
I will post v5 soon with the above changes.
Nitin
[-- Attachment #2: Type: text/html, Size: 15113 bytes --]
prev parent reply other threads:[~2020-05-16 0:50 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-28 22:10 Nitin Gupta
2020-05-12 23:40 ` Nitin Gupta
2020-05-15 18:01 ` Vlastimil Babka
2020-05-16 0:50 ` Nitin Gupta [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAB6CXpDCfoU0mJvDJra4MN6omD1cnNgSLH2Qyy=f7e7uQhA4Cg@mail.gmail.com' \
--to=ngupta@nitingupta.dev \
--cc=akpm@linux-foundation.org \
--cc=iamjoonsoo.kim@lge.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mhocko@suse.com \
--cc=mike.kravetz@oracle.com \
--cc=nigupta@nvidia.com \
--cc=rientjes@google.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox