From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=hIPo=75=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-8.0 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,
	SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CA2D5C433E2
	for <linux-mm@archiver.kernel.org>; Tue, 16 Jun 2020 16:51:27 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 472B8214F1
	for <linux-mm@archiver.kernel.org>; Tue, 16 Jun 2020 16:51:27 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key) header.d=nitingupta.dev header.i=@nitingupta.dev header.b="Ii1ghQPa"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 472B8214F1
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=nitingupta.dev
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id A32198D001F; Tue, 16 Jun 2020 12:51:26 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9E2A88D0006; Tue, 16 Jun 2020 12:51:26 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8A9338D001F; Tue, 16 Jun 2020 12:51:26 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0125.hostedemail.com [216.40.44.125])
	by kanga.kvack.org (Postfix) with ESMTP id 6806D8D0006
	for <linux-mm@kvack.org>; Tue, 16 Jun 2020 12:51:26 -0400 (EDT)
Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 0BB0F180B3623
	for <linux-mm@kvack.org>; Tue, 16 Jun 2020 16:51:26 +0000 (UTC)
X-FDA: 76935665772.21.fold68_1314e8926e00
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin21.hostedemail.com (Postfix) with ESMTP id D0C67180B3637
	for <linux-mm@kvack.org>; Tue, 16 Jun 2020 16:51:25 +0000 (UTC)
X-HE-Tag: fold68_1314e8926e00
X-Filterd-Recvd-Size: 28476
Received: from mail-pl1-f195.google.com (mail-pl1-f195.google.com [209.85.214.195])
	by imf48.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 16 Jun 2020 16:51:24 +0000 (UTC)
Received: by mail-pl1-f195.google.com with SMTP id k6so3685760pll.9
        for <linux-mm@kvack.org>; Tue, 16 Jun 2020 09:51:24 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=nitingupta.dev; s=google;
        h=subject:to:cc:references:from:message-id:date:user-agent
         :mime-version:in-reply-to:content-language:content-transfer-encoding;
        bh=xTal3b2coW9ICDPlLYoedsXJTli9+NnDb2IwyH+iuNU=;
        b=Ii1ghQPa5Qtw3Kkx3lW9IvTA/xZmO1KoMLrv952UNd1VPWmlUBjIPGmBVA9GYV+zs5
         nawA93Q9OuNvGy+AKQtdJ/9ZwVz0UHW9TTjNCeEoK6fqaHM0iHwV5CD0S+R1yC07NgpT
         Cqx/BQPWXrHqZPWQZ2o19Oyu4xiSI75wbTxXk28Q1wWjN8tq0MOLlIH0JMbRfOD1vk9T
         m+P3Zvfuy0SI8Hs6HkYvqEmp03vwGOdUG2ioMPYebdXCiO/U5KWQhTS0K4NgcsMp9vO0
         o5Ouzey5u1DAignpJQ1PBtTMtrok/4058XbcCk/Y6JwXN+I0idLA+rgzvZYl8BOKTnwI
         fTLA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:subject:to:cc:references:from:message-id:date
         :user-agent:mime-version:in-reply-to:content-language
         :content-transfer-encoding;
        bh=xTal3b2coW9ICDPlLYoedsXJTli9+NnDb2IwyH+iuNU=;
        b=ccV4sF3OknFWxZWvuyu5QuEahoRUGI7Xd18x1K48r96ivGf39sha//0x/J6oWbOlri
         KhppBwCpuhe1haENzCFDGbxte3PqLp4GLagvSoP6tbz7qZI1p/C48KkHBvK/ng+Ms64u
         SqpCFvpd0pEVmkpesfTcq8WkzslomeOIe4qcED15cMFHkP93IQ841oQ8vIQjrD4kTwUB
         HY0VnJ+eo7egnkhXlgdzo8L3IW7sTnGuWtmYYFak/BF20EAWFL/rmNVI0ccXm5SuQZHh
         LhfeNDPYKAswUL/jca1/qaqO56bzEphxwjID7z91dCp/3Gh0/WQAiZfZOP3oWirJZiwV
         EwuA==
X-Gm-Message-State: AOAM533pQ4VLDy6cDikzge6igg09cLPQTh0ZU8dgL3FLQkmEinaG8br4
	jGEXKNJm//I709JwTQN1HcEOXw==
X-Google-Smtp-Source: ABdhPJxJgH2UkyEpu23DpwTlhbuj2IRSMRZQzxeHhd5h2MW/81YttQ4GCr/IY+2i9BAWivqXbQKKhA==
X-Received: by 2002:a17:90a:a005:: with SMTP id q5mr3679907pjp.219.1592326282542;
        Tue, 16 Jun 2020 09:51:22 -0700 (PDT)
Received: from ngvpn01-160-57.dyn.scz.us.nvidia.com ([2601:646:9302:1050:e427:8a7e:3006:9f33])
        by smtp.gmail.com with ESMTPSA id c2sm17549724pfi.71.2020.06.16.09.51.20
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Tue, 16 Jun 2020 09:51:21 -0700 (PDT)
Subject: Re: [PATCH v7] mm: Proactive compaction
To: Oleksandr Natalenko <oleksandr@redhat.com>,
 Nitin Gupta <nigupta@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
 Vlastimil Babka <vbabka@suse.cz>, Khalid Aziz <khalid.aziz@oracle.com>,
 Michal Hocko <mhocko@suse.com>, Mel Gorman <mgorman@techsingularity.net>,
 Matthew Wilcox <willy@infradead.org>, Mike Kravetz
 <mike.kravetz@oracle.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>,
 David Rientjes <rientjes@google.com>,
 linux-kernel <linux-kernel@vger.kernel.org>, linux-mm <linux-mm@kvack.org>,
 Linux API <linux-api@vger.kernel.org>
References: <20200615143614.15267-1-nigupta@nvidia.com>
 <20200616094616.ddwmtqczmd3qfcl6@butterfly.localdomain>
From: Nitin Gupta <ngupta@nitingupta.dev>
Message-ID: <e7735a5d-d073-a178-fee0-1769916d6a5a@nitingupta.dev>
Date: Tue, 16 Jun 2020 09:51:19 -0700
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0)
 Gecko/20100101 Thunderbird/68.9.0
MIME-Version: 1.0
In-Reply-To: <20200616094616.ddwmtqczmd3qfcl6@butterfly.localdomain>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
X-Rspamd-Queue-Id: D0C67180B3637
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam04
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 6/16/20 2:46 AM, Oleksandr Natalenko wrote:
> Hello.
>=20
> Please see the notes inline.
>=20
> On Mon, Jun 15, 2020 at 07:36:14AM -0700, Nitin Gupta wrote:
>> For some applications, we need to allocate almost all memory as
>> hugepages. However, on a running system, higher-order allocations can
>> fail if the memory is fragmented. Linux kernel currently does on-deman=
d
>> compaction as we request more hugepages, but this style of compaction
>> incurs very high latency. Experiments with one-time full memory
>> compaction (followed by hugepage allocations) show that kernel is able
>> to restore a highly fragmented memory state to a fairly compacted memo=
ry
>> state within <1 sec for a 32G system. Such data suggests that a more
>> proactive compaction can help us allocate a large fraction of memory a=
s
>> hugepages keeping allocation latencies low.
>>
>> For a more proactive compaction, the approach taken here is to define =
a
>> new sysctl called 'vm.compaction_proactiveness' which dictates bounds
>> for external fragmentation which kcompactd tries to maintain.
>>
>> The tunable takes a value in range [0, 100], with a default of 20.
>>
>> Note that a previous version of this patch [1] was found to introduce
>> too many tunables (per-order extfrag{low, high}), but this one reduces
>> them to just one sysctl. Also, the new tunable is an opaque value
>> instead of asking for specific bounds of "external fragmentation", whi=
ch
>> would have been difficult to estimate. The internal interpretation of
>> this opaque value allows for future fine-tuning.
>>
>> Currently, we use a simple translation from this tunable to [low, high=
]
>> "fragmentation score" thresholds (low=3D100-proactiveness, high=3Dlow+=
10%).
>> The score for a node is defined as weighted mean of per-zone external
>> fragmentation. A zone's present_pages determines its weight.
>>
>> To periodically check per-node score, we reuse per-node kcompactd
>> threads, which are woken up every 500 milliseconds to check the same. =
If
>> a node's score exceeds its high threshold (as derived from user-provid=
ed
>> proactiveness value), proactive compaction is started until its score
>> reaches its low threshold value. By default, proactiveness is set to 2=
0,
>> which implies threshold values of low=3D80 and high=3D90.
>>
>> This patch is largely based on ideas from Michal Hocko [2]. See also t=
he
>> LWN article [3].
>>
>> Performance data
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>
>> System: x64_64, 1T RAM, 80 CPU threads.
>> Kernel: 5.6.0-rc3 + this patch
>>
>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
>>
>> Before starting the driver, the system was fragmented from a userspace
>> program that allocates all memory and then for each 2M aligned section=
,
>> frees 3/4 of base pages using munmap. The workload is mainly anonymous
>> userspace pages, which are easy to move around. I intentionally avoide=
d
>> unmovable pages in this test to see how much latency we incur when
>> hugepage allocations hit direct compaction.
>>
>> 1. Kernel hugepage allocation latencies
>>
>> With the system in such a fragmented state, a kernel driver then
>> allocates as many hugepages as possible and measures allocation
>> latency:
>>
>> (all latency values are in microseconds)
>>
>> - With vanilla 5.6.0-rc3
>>
>>   percentile latency
>>   =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=
=93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=
=93=E2=80=93
>> 	   5    7894
>> 	  10    9496
>> 	  25   12561
>> 	  30   15295
>> 	  40   18244
>> 	  50   21229
>> 	  60   27556
>> 	  75   30147
>> 	  80   31047
>> 	  90   32859
>> 	  95   33799
>>
>> Total 2M hugepages allocated =3D 383859 (749G worth of hugepages out o=
f
>> 762G total free =3D> 98% of free memory could be allocated as hugepage=
s)
>>
>> - With 5.6.0-rc3 + this patch, with proactiveness=3D20
>>
>> sysctl -w vm.compaction_proactiveness=3D20
>>
>>   percentile latency
>>   =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=
=93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=
=93=E2=80=93
>> 	   5       2
>> 	  10       2
>> 	  25       3
>> 	  30       3
>> 	  40       3
>> 	  50       4
>> 	  60       4
>> 	  75       4
>> 	  80       4
>> 	  90       5
>> 	  95     429
>>
>> Total 2M hugepages allocated =3D 384105 (750G worth of hugepages out o=
f
>> 762G total free =3D> 98% of free memory could be allocated as hugepage=
s)
>>
>> 2. JAVA heap allocation
>>
>> In this test, we first fragment memory using the same method as for (1=
).
>>
>> Then, we start a Java process with a heap size set to 700G and request
>> the heap to be allocated with THP hugepages. We also set THP to madvis=
e
>> to allow hugepage backing of this heap.
>>
>> /usr/bin/time
>>  java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTou=
ch
>>
>> The above command allocates 700G of Java heap using hugepages.
>>
>> - With vanilla 5.6.0-rc3
>>
>> 17.39user 1666.48system 27:37.89elapsed
>>
>> - With 5.6.0-rc3 + this patch, with proactiveness=3D20
>>
>> 8.35user 194.58system 3:19.62elapsed
>>
>> Elapsed time remains around 3:15, as proactiveness is further increase=
d.
>>
>> Note that proactive compaction happens throughout the runtime of these
>> workloads. The situation of one-time compaction, sufficient to supply
>> hugepages for following allocation stream, can probably happen for mor=
e
>> extreme proactiveness values, like 80 or 90.
>>
>> In the above Java workload, proactiveness is set to 20. The test start=
s
>> with a node's score of 80 or higher, depending on the delay between th=
e
>> fragmentation step and starting the benchmark, which gives more-or-les=
s
>> time for the initial round of compaction. As t	he benchmark consumes
>> hugepages, node's score quickly rises above the high threshold (90) an=
d
>> proactive compaction starts again, which brings down the score to the
>> low threshold level (80).  Repeat.
>>
>> bpftrace also confirms proactive compaction running 20+ times during t=
he
>> runtime of this Java benchmark. kcompactd threads consume 100% of one =
of
>> the CPUs while it tries to bring a node's score within thresholds.
>>
>> Backoff behavior
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>
>> Above workloads produce a memory state which is easy to compact.
>> However, if memory is filled with unmovable pages, proactive compactio=
n
>> should essentially back off. To test this aspect:
>>
>> - Created a kernel driver that allocates almost all memory as hugepage=
s
>>   followed by freeing first 3/4 of each hugepage.
>> - Set proactiveness=3D40
>> - Note that proactive_compact_node() is deferred maximum number of tim=
es
>>   with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
>>   (=3D> ~30 seconds between retries).
>>
>> [1] https://patchwork.kernel.org/patch/11098289/
>> [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.sus=
e.cz/
>> [3] https://lwn.net/Articles/817905/
>>
>> Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
>> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
>> To: Andrew Morton <akpm@linux-foundation.org>
>> CC: Vlastimil Babka <vbabka@suse.cz>
>> CC: Khalid Aziz <khalid.aziz@oracle.com>
>> CC: Michal Hocko <mhocko@suse.com>
>> CC: Mel Gorman <mgorman@techsingularity.net>
>> CC: Matthew Wilcox <willy@infradead.org>
>> CC: Mike Kravetz <mike.kravetz@oracle.com>
>> CC: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> CC: David Rientjes <rientjes@google.com>
>> CC: Nitin Gupta <ngupta@nitingupta.dev>
>> CC: Oleksandr Natalenko <oleksandr@redhat.com>
>> CC: linux-kernel <linux-kernel@vger.kernel.org>
>> CC: linux-mm <linux-mm@kvack.org>
>> CC: Linux API <linux-api@vger.kernel.org>
>>
>> ---
>> Changelog v7 vs v6:
>>  - Fix compile error while THP is disabled (Oleksandr)
>=20
> Thank you for taking this.
>=20
>>
>> Changelog v6 vs v5:
>>  - Fallback to HUGETLB_PAGE_ORDER if HPAGE_PMD_ORDER is not defined, a=
nd
>>    some cleanups (Vlastimil)
>>  - Cap min threshold to avoid excess compaction load in case user sets
>>    extreme values like 100 for `vm.compaction_proactiveness` sysctl (K=
halid)
>>  - Add some more explanation about the effect of tunable on compaction
>>    behavior in user guide (Khalid)
>>
>> Changelog v5 vs v4:
>>  - Change tunable from sysfs to sysctl (Vlastimil)
>>  - Replace HUGETLB_PAGE_ORDER with HPAGE_PMD_ORDER (Vlastimil)
>>  - Minor cleanups (remove redundant initializations, ...)
>>
>> Changelog v4 vs v3:
>>  - Document various functions.
>>  - Added admin-guide for the new tunable `proactiveness`.
>>  - Rename proactive_compaction_score to fragmentation_score for clarit=
y.
>>
>> Changelog v3 vs v2:
>>  - Make proactiveness a global tunable and not per-node. Also upadated
>> the
>>    patch description to reflect the same (Vlastimil Babka).
>>  - Don't start proactive compaction if kswapd is running (Vlastimil
>> Babka).
>>  - Clarified in the description that compaction runs in parallel with
>>    the workload, instead of a one-time compaction followed by a stream
>> of
>>    hugepage allocations.
>>
>> Changelog v2 vs v1:
>>  - Introduce per-node and per-zone "proactive compaction score". This
>>    score is compared against watermarks which are set according to
>>    user provided proactiveness value.
>>  - Separate code-paths for proactive compaction from targeted compacti=
on
>>    i.e. where pgdat->kcompactd_max_order is non-zero.
>>  - Renamed hpage_compaction_effort -> proactiveness. In future we may
>>    use more than extfrag wrt hugepage size to determine proactive
>>    compaction score.
>> ---
>>  Documentation/admin-guide/sysctl/vm.rst |  15 ++
>>  include/linux/compaction.h              |   2 +
>>  kernel/sysctl.c                         |   9 ++
>>  mm/compaction.c                         | 183 +++++++++++++++++++++++=
-
>>  mm/internal.h                           |   1 +
>>  mm/vmstat.c                             |  18 +++
>>  6 files changed, 223 insertions(+), 5 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/a=
dmin-guide/sysctl/vm.rst
>> index 0329a4d3fa9e..360914b4f346 100644
>> --- a/Documentation/admin-guide/sysctl/vm.rst
>> +++ b/Documentation/admin-guide/sysctl/vm.rst
>> @@ -119,6 +119,21 @@ all zones are compacted such that free memory is =
available in contiguous
>>  blocks where possible. This can be important for example in the alloc=
ation of
>>  huge pages although processes will also directly compact memory as re=
quired.
>> =20
>> +compaction_proactiveness
>> +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
>> +
>> +This tunable takes a value in the range [0, 100] with a default value=
 of
>> +20. This tunable determines how aggressively compaction is done in th=
e
>> +background. Setting it to 0 disables proactive compaction.
>> +
>> +Note that compaction has a non-trivial system-wide impact as pages
>> +belonging to different processes are moved around, which could also l=
ead
>> +to latency spikes in unsuspecting applications. The kernel employs
>> +various heuristics to avoid wasting CPU cycles if it detects that
>> +proactive compaction is not being effective.
>> +
>> +Be careful when setting it to extreme values like 100, as that may
>> +cause excessive background compaction activity.
>> =20
>>  compact_unevictable_allowed
>>  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D
>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>> index 4b898cdbdf05..ccd28978b296 100644
>> --- a/include/linux/compaction.h
>> +++ b/include/linux/compaction.h
>> @@ -85,11 +85,13 @@ static inline unsigned long compact_gap(unsigned i=
nt order)
>> =20
>>  #ifdef CONFIG_COMPACTION
>>  extern int sysctl_compact_memory;
>> +extern int sysctl_compaction_proactiveness;
>>  extern int sysctl_compaction_handler(struct ctl_table *table, int wri=
te,
>>  			void __user *buffer, size_t *length, loff_t *ppos);
>=20
> Based on __user notation here, I guess the patch is based on v5.7, not
> on something newer, right?
>=20

The somehow missed rebasing the v7 patch.

>>  extern int sysctl_extfrag_threshold;
>>  extern int sysctl_compact_unevictable_allowed;
>> =20
>> +extern int extfrag_for_order(struct zone *zone, unsigned int order);
>>  extern int fragmentation_index(struct zone *zone, unsigned int order)=
;
>>  extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
>>  		unsigned int order, unsigned int alloc_flags,
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 8a176d8727a3..51c90906efbc 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -1458,6 +1458,15 @@ static struct ctl_table vm_table[] =3D {
>>  		.mode		=3D 0200,
>>  		.proc_handler	=3D sysctl_compaction_handler,
>>  	},
>> +	{
>> +		.procname	=3D "compaction_proactiveness",
>> +		.data		=3D &sysctl_compaction_proactiveness,
>> +		.maxlen		=3D sizeof(int),
>> +		.mode		=3D 0644,
>> +		.proc_handler	=3D proc_dointvec_minmax,
>> +		.extra1		=3D SYSCTL_ZERO,
>> +		.extra2		=3D &one_hundred,
>> +	},
>=20
> Again, as a highlight, in v5.8 the table was shuffled around, so you ma=
y want
> to rebase the patch on top of something newer in order for people to no=
t
> get conflicts when doing `git am`.

I have rebase it now for v8.

>=20
>>  	{
>>  		.procname	=3D "extfrag_threshold",
>>  		.data		=3D &sysctl_extfrag_threshold,
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 46f0fcc93081..99579a1fa582 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -50,6 +50,24 @@ static inline void count_compact_events(enum vm_eve=
nt_item item, long delta)
>>  #define pageblock_start_pfn(pfn)	block_start_pfn(pfn, pageblock_order=
)
>>  #define pageblock_end_pfn(pfn)		block_end_pfn(pfn, pageblock_order)
>> =20
>> +/*
>> + * Fragmentation score check interval for proactive compaction purpos=
es.
>> + */
>> +static const int HPAGE_FRAG_CHECK_INTERVAL_MSEC =3D 500;
>> +
>> +/*
>> + * Page order with-respect-to which proactive compaction
>> + * calculates external fragmentation, which is used as
>> + * the "fragmentation score" of a node/zone.
>> + */
>> +#if defined CONFIG_TRANSPARENT_HUGEPAGE
>> +#define COMPACTION_HPAGE_ORDER	HPAGE_PMD_ORDER
>> +#elif defined HUGETLB_PAGE_ORDER
>> +#define COMPACTION_HPAGE_ORDER	HUGETLB_PAGE_ORDER
>> +#else
>> +#define COMPACTION_HPAGE_ORDER	(PMD_SHIFT - PAGE_SHIFT)
>> +#endif
>> +
>>  static unsigned long release_freepages(struct list_head *freelist)
>>  {
>>  	struct page *page, *next;
>> @@ -1855,6 +1873,76 @@ static inline bool is_via_compact_memory(int or=
der)
>>  	return order =3D=3D -1;
>>  }
>> =20
>> +static bool kswapd_is_running(pg_data_t *pgdat)
>> +{
>> +	return pgdat->kswapd && (pgdat->kswapd->state =3D=3D TASK_RUNNING);
>> +}
>> +
>> +/*
>> + * A zone's fragmentation score is the external fragmentation wrt to =
the
>> + * COMPACTION_HPAGE_ORDER scaled by the zone's size. It returns a val=
ue
>> + * in the range [0, 100].
>> + *
>> + * The scaling factor ensures that proactive compaction focuses on la=
rger
>> + * zones like ZONE_NORMAL, rather than smaller, specialized zones lik=
e
>> + * ZONE_DMA32. For smaller zones, the score value remains close to ze=
ro,
>> + * and thus never exceeds the high threshold for proactive compaction=
.
>> + */
>> +static int fragmentation_score_zone(struct zone *zone)
>> +{
>> +	unsigned long score;
>> +
>> +	score =3D zone->present_pages *
>> +			extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
>> +	return div64_ul(score, zone->zone_pgdat->node_present_pages + 1);
>> +}
>> +
>> +/*
>> + * The per-node proactive (background) compaction process is started =
by its
>> + * corresponding kcompactd thread when the node's fragmentation score
>> + * exceeds the high threshold. The compaction process remains active =
till
>> + * the node's score falls below the low threshold, or one of the back=
-off
>> + * conditions is met.
>> + */
>> +static int fragmentation_score_node(pg_data_t *pgdat)
>> +{
>> +	unsigned long score =3D 0;
>> +	int zoneid;
>> +
>> +	for (zoneid =3D 0; zoneid < MAX_NR_ZONES; zoneid++) {
>> +		struct zone *zone;
>> +
>> +		zone =3D &pgdat->node_zones[zoneid];
>> +		score +=3D fragmentation_score_zone(zone);
>> +	}
>> +
>> +	return score;
>> +}
>> +
>> +static int fragmentation_score_wmark(pg_data_t *pgdat, bool low)
>> +{
>> +	int wmark_low;
>> +
>> +	/*
>> +	 * Cap the low watermak to avoid excessive compaction
>> +	 * activity in case a user sets the proactivess tunable
>> +	 * close to 100 (maximum).
>> +	 */
>> +	wmark_low =3D max(100 - sysctl_compaction_proactiveness, 5);
>> +	return low ? wmark_low : min(wmark_low + 10, 100);
>> +}
>> +
>> +static bool should_proactive_compact_node(pg_data_t *pgdat)
>> +{
>> +	int wmark_high;
>> +
>> +	if (!sysctl_compaction_proactiveness || kswapd_is_running(pgdat))
>> +		return false;
>> +
>> +	wmark_high =3D fragmentation_score_wmark(pgdat, false);
>> +	return fragmentation_score_node(pgdat) > wmark_high;
>> +}
>> +
>>  static enum compact_result __compact_finished(struct compact_control =
*cc)
>>  {
>>  	unsigned int order;
>> @@ -1881,6 +1969,25 @@ static enum compact_result __compact_finished(s=
truct compact_control *cc)
>>  			return COMPACT_PARTIAL_SKIPPED;
>>  	}
>> =20
>> +	if (cc->proactive_compaction) {
>> +		int score, wmark_low;
>> +		pg_data_t *pgdat;
>> +
>> +		pgdat =3D cc->zone->zone_pgdat;
>> +		if (kswapd_is_running(pgdat))
>> +			return COMPACT_PARTIAL_SKIPPED;
>> +
>> +		score =3D fragmentation_score_zone(cc->zone);
>> +		wmark_low =3D fragmentation_score_wmark(pgdat, true);
>> +
>> +		if (score > wmark_low)
>> +			ret =3D COMPACT_CONTINUE;
>> +		else
>> +			ret =3D COMPACT_SUCCESS;
>> +
>> +		goto out;
>> +	}
>> +
>>  	if (is_via_compact_memory(cc->order))
>>  		return COMPACT_CONTINUE;
>> =20
>> @@ -1939,6 +2046,7 @@ static enum compact_result __compact_finished(st=
ruct compact_control *cc)
>>  		}
>>  	}
>> =20
>> +out:
>>  	if (cc->contended || fatal_signal_pending(current))
>>  		ret =3D COMPACT_CONTENDED;
>> =20
>> @@ -2412,6 +2520,41 @@ enum compact_result try_to_compact_pages(gfp_t =
gfp_mask, unsigned int order,
>>  	return rc;
>>  }
>> =20
>> +/*
>> + * Compact all zones within a node till each zone's fragmentation sco=
re
>> + * reaches within proactive compaction thresholds (as determined by t=
he
>> + * proactiveness tunable).
>> + *
>> + * It is possible that the function returns before reaching score tar=
gets
>> + * due to various back-off conditions, such as, contention on per-nod=
e or
>> + * per-zone locks.
>> + */
>> +static void proactive_compact_node(pg_data_t *pgdat)
>> +{
>> +	int zoneid;
>> +	struct zone *zone;
>> +	struct compact_control cc =3D {
>> +		.order =3D -1,
>> +		.mode =3D MIGRATE_SYNC_LIGHT,
>> +		.ignore_skip_hint =3D true,
>> +		.whole_zone =3D true,
>> +		.gfp_mask =3D GFP_KERNEL,
>> +		.proactive_compaction =3D true,
>> +	};
>> +
>> +	for (zoneid =3D 0; zoneid < MAX_NR_ZONES; zoneid++) {
>> +		zone =3D &pgdat->node_zones[zoneid];
>> +		if (!populated_zone(zone))
>> +			continue;
>> +
>> +		cc.zone =3D zone;
>> +
>> +		compact_zone(&cc, NULL);
>> +
>> +		VM_BUG_ON(!list_empty(&cc.freepages));
>> +		VM_BUG_ON(!list_empty(&cc.migratepages));
>=20
> Can this actually happen here? I'd expect some comment in the code
> regarding being overcautious here. IIUC, you follow what
> kcompactd_do_work() does, but even there it is not explained.
>=20

In theory No, it can't happen: we do release_pages(cc->freepages)
and putback_movable_pages(cc->migratepages) for any residuals
which could not by migrated. This is just being cautious to
detect any future mem leak bugs here.

>> +	}
>> +}
>> =20
>>  /* Compact all zones within a node */
>>  static void compact_node(int nid)
>> @@ -2458,6 +2601,13 @@ static void compact_nodes(void)
>>  /* The written value is actually unused, all memory is compacted */
>>  int sysctl_compact_memory;
>> =20
>> +/*
>> + * Tunable for proactive compaction. It determines how
>> + * aggressively the kernel should compact memory in the
>> + * background. It takes values in the range [0, 100].
>> + */
>> +int __read_mostly sysctl_compaction_proactiveness =3D 20;
>=20
> Excuse me if I missed previous discussion and this question was already
> addressed, but given possible latency spikes as described in the commit
> message, shall this value be amended to conserve current kernel
> behaviour (IOW, sysctl_compaction_proactiveness =3D 0)?
>=20

For the v2 patch, Vlastimil suggested [1] setting it to a non-0, yet
conservative default after some more testing. Well, my testing suggests
that a conservative value of 20 does not seem to impact overall system
negatively while consistently giving good improvements for hugepage
allocation latencies.

[1] https://lkml.org/lkml/2020/3/4/812

>> +
>>  /*
>>   * This is the entry point for compacting all nodes via
>>   * /proc/sys/vm/compact_memory
>> @@ -2637,6 +2787,7 @@ static int kcompactd(void *p)
>>  {
>>  	pg_data_t *pgdat =3D (pg_data_t*)p;
>>  	struct task_struct *tsk =3D current;
>> +	unsigned int proactive_defer =3D 0;
>> =20
>>  	const struct cpumask *cpumask =3D cpumask_of_node(pgdat->node_id);
>> =20
>> @@ -2652,12 +2803,34 @@ static int kcompactd(void *p)
>>  		unsigned long pflags;
>> =20
>>  		trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
>> -		wait_event_freezable(pgdat->kcompactd_wait,
>> -				kcompactd_work_requested(pgdat));
>> +		if (wait_event_freezable_timeout(pgdat->kcompactd_wait,
>> +			kcompactd_work_requested(pgdat),
>> +			msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC))) {
>> +
>> +			psi_memstall_enter(&pflags);
>> +			kcompactd_do_work(pgdat);
>> +			psi_memstall_leave(&pflags);
>=20
> I wonder if wrapping kcompactd_do_work() into
> psi_memstall_{enter,leave} is a too big hammer that may cause mem PSI t=
o
> be bigger that it really is, but this question seems to be out of scope
> of current patch, so feel free to ignore it.
>=20
>> +			continue;
>> +		}
>> =20
>> -		psi_memstall_enter(&pflags);
>> -		kcompactd_do_work(pgdat);
>> -		psi_memstall_leave(&pflags);
>> +		/* kcompactd wait timeout */
>> +		if (should_proactive_compact_node(pgdat)) {
>> +			unsigned int prev_score, score;
>> +
>> +			if (proactive_defer) {
>> +				proactive_defer--;
>> +				continue;
>> +			}
>> +			prev_score =3D fragmentation_score_node(pgdat);
>> +			proactive_compact_node(pgdat);
>> +			score =3D fragmentation_score_node(pgdat);
>> +			/*
>> +			 * Defer proactive compaction if the fragmentation
>> +			 * score did not go down i.e. no progress made.
>> +			 */
>> +			proactive_defer =3D score < prev_score ?
>> +					0 : 1 << COMPACT_MAX_DEFER_SHIFT;
>> +		}
>>  	}
>> =20
>>  	return 0;
>> diff --git a/mm/internal.h b/mm/internal.h
>> index b5634e78f01d..9671bccd97d5 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -228,6 +228,7 @@ struct compact_control {
>>  	bool no_set_skip_hint;		/* Don't mark blocks for skipping */
>>  	bool ignore_block_suitable;	/* Scan blocks considered unsuitable */
>>  	bool direct_compaction;		/* False from kcompactd or /proc/... */
>> +	bool proactive_compaction;	/* kcompactd proactive compaction */
>>  	bool whole_zone;		/* Whole zone should/has been scanned */
>>  	bool contended;			/* Signal lock or sched contention */
>>  	bool rescan;			/* Rescanning the same pageblock */
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 96d21a792b57..cc88f7533b8d 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1074,6 +1074,24 @@ static int __fragmentation_index(unsigned int o=
rder, struct contig_page_info *in
>>  	return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, re=
quested))), info->free_blocks_total);
>>  }
>> =20
>> +/*
>> + * Calculates external fragmentation within a zone wrt the given orde=
r.
>> + * It is defined as the percentage of pages found in blocks of size
>> + * less than 1 << order. It returns values in range [0, 100].
>> + */
>> +int extfrag_for_order(struct zone *zone, unsigned int order)
>> +{
>> +	struct contig_page_info info;
>> +
>> +	fill_contig_page_info(zone, order, &info);
>> +	if (info.free_pages =3D=3D 0)
>> +		return 0;
>> +
>> +	return div_u64((info.free_pages -
>> +			(info.free_blocks_suitable << order)) * 100,
>> +			info.free_pages);
>> +}
>> +
>>  /* Same as __fragmentation index but allocs contig_page_info on stack=
 */
>>  int fragmentation_index(struct zone *zone, unsigned int order)
>>  {
>> --=20
>> 2.27.0
>>
>=20
> Modulo the minor nits above and given I run this submission for quite
> some time on various machines:
>=20
> Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com>
> Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
>=20

Thank you.

-Nitin