From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CA2D5C433E2 for ; Tue, 16 Jun 2020 16:51:27 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 472B8214F1 for ; Tue, 16 Jun 2020 16:51:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=nitingupta.dev header.i=@nitingupta.dev header.b="Ii1ghQPa" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 472B8214F1 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=nitingupta.dev Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A32198D001F; Tue, 16 Jun 2020 12:51:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9E2A88D0006; Tue, 16 Jun 2020 12:51:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8A9338D001F; Tue, 16 Jun 2020 12:51:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0125.hostedemail.com [216.40.44.125]) by kanga.kvack.org (Postfix) with ESMTP id 6806D8D0006 for ; Tue, 16 Jun 2020 12:51:26 -0400 (EDT) Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 0BB0F180B3623 for ; Tue, 16 Jun 2020 16:51:26 +0000 (UTC) X-FDA: 76935665772.21.fold68_1314e8926e00 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin21.hostedemail.com (Postfix) with ESMTP id D0C67180B3637 for ; Tue, 16 Jun 2020 16:51:25 +0000 (UTC) X-HE-Tag: fold68_1314e8926e00 X-Filterd-Recvd-Size: 28476 Received: from mail-pl1-f195.google.com (mail-pl1-f195.google.com [209.85.214.195]) by imf48.hostedemail.com (Postfix) with ESMTP for ; Tue, 16 Jun 2020 16:51:24 +0000 (UTC) Received: by mail-pl1-f195.google.com with SMTP id k6so3685760pll.9 for ; Tue, 16 Jun 2020 09:51:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nitingupta.dev; s=google; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=xTal3b2coW9ICDPlLYoedsXJTli9+NnDb2IwyH+iuNU=; b=Ii1ghQPa5Qtw3Kkx3lW9IvTA/xZmO1KoMLrv952UNd1VPWmlUBjIPGmBVA9GYV+zs5 nawA93Q9OuNvGy+AKQtdJ/9ZwVz0UHW9TTjNCeEoK6fqaHM0iHwV5CD0S+R1yC07NgpT Cqx/BQPWXrHqZPWQZ2o19Oyu4xiSI75wbTxXk28Q1wWjN8tq0MOLlIH0JMbRfOD1vk9T m+P3Zvfuy0SI8Hs6HkYvqEmp03vwGOdUG2ioMPYebdXCiO/U5KWQhTS0K4NgcsMp9vO0 o5Ouzey5u1DAignpJQ1PBtTMtrok/4058XbcCk/Y6JwXN+I0idLA+rgzvZYl8BOKTnwI fTLA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=xTal3b2coW9ICDPlLYoedsXJTli9+NnDb2IwyH+iuNU=; b=ccV4sF3OknFWxZWvuyu5QuEahoRUGI7Xd18x1K48r96ivGf39sha//0x/J6oWbOlri KhppBwCpuhe1haENzCFDGbxte3PqLp4GLagvSoP6tbz7qZI1p/C48KkHBvK/ng+Ms64u SqpCFvpd0pEVmkpesfTcq8WkzslomeOIe4qcED15cMFHkP93IQ841oQ8vIQjrD4kTwUB HY0VnJ+eo7egnkhXlgdzo8L3IW7sTnGuWtmYYFak/BF20EAWFL/rmNVI0ccXm5SuQZHh LhfeNDPYKAswUL/jca1/qaqO56bzEphxwjID7z91dCp/3Gh0/WQAiZfZOP3oWirJZiwV EwuA== X-Gm-Message-State: AOAM533pQ4VLDy6cDikzge6igg09cLPQTh0ZU8dgL3FLQkmEinaG8br4 jGEXKNJm//I709JwTQN1HcEOXw== X-Google-Smtp-Source: ABdhPJxJgH2UkyEpu23DpwTlhbuj2IRSMRZQzxeHhd5h2MW/81YttQ4GCr/IY+2i9BAWivqXbQKKhA== X-Received: by 2002:a17:90a:a005:: with SMTP id q5mr3679907pjp.219.1592326282542; Tue, 16 Jun 2020 09:51:22 -0700 (PDT) Received: from ngvpn01-160-57.dyn.scz.us.nvidia.com ([2601:646:9302:1050:e427:8a7e:3006:9f33]) by smtp.gmail.com with ESMTPSA id c2sm17549724pfi.71.2020.06.16.09.51.20 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 16 Jun 2020 09:51:21 -0700 (PDT) Subject: Re: [PATCH v7] mm: Proactive compaction To: Oleksandr Natalenko , Nitin Gupta Cc: Andrew Morton , Vlastimil Babka , Khalid Aziz , Michal Hocko , Mel Gorman , Matthew Wilcox , Mike Kravetz , Joonsoo Kim , David Rientjes , linux-kernel , linux-mm , Linux API References: <20200615143614.15267-1-nigupta@nvidia.com> <20200616094616.ddwmtqczmd3qfcl6@butterfly.localdomain> From: Nitin Gupta Message-ID: Date: Tue, 16 Jun 2020 09:51:19 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.9.0 MIME-Version: 1.0 In-Reply-To: <20200616094616.ddwmtqczmd3qfcl6@butterfly.localdomain> Content-Type: text/plain; charset=utf-8 Content-Language: en-US X-Rspamd-Queue-Id: D0C67180B3637 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 6/16/20 2:46 AM, Oleksandr Natalenko wrote: > Hello. >=20 > Please see the notes inline. >=20 > On Mon, Jun 15, 2020 at 07:36:14AM -0700, Nitin Gupta wrote: >> For some applications, we need to allocate almost all memory as >> hugepages. However, on a running system, higher-order allocations can >> fail if the memory is fragmented. Linux kernel currently does on-deman= d >> compaction as we request more hugepages, but this style of compaction >> incurs very high latency. Experiments with one-time full memory >> compaction (followed by hugepage allocations) show that kernel is able >> to restore a highly fragmented memory state to a fairly compacted memo= ry >> state within <1 sec for a 32G system. Such data suggests that a more >> proactive compaction can help us allocate a large fraction of memory a= s >> hugepages keeping allocation latencies low. >> >> For a more proactive compaction, the approach taken here is to define = a >> new sysctl called 'vm.compaction_proactiveness' which dictates bounds >> for external fragmentation which kcompactd tries to maintain. >> >> The tunable takes a value in range [0, 100], with a default of 20. >> >> Note that a previous version of this patch [1] was found to introduce >> too many tunables (per-order extfrag{low, high}), but this one reduces >> them to just one sysctl. Also, the new tunable is an opaque value >> instead of asking for specific bounds of "external fragmentation", whi= ch >> would have been difficult to estimate. The internal interpretation of >> this opaque value allows for future fine-tuning. >> >> Currently, we use a simple translation from this tunable to [low, high= ] >> "fragmentation score" thresholds (low=3D100-proactiveness, high=3Dlow+= 10%). >> The score for a node is defined as weighted mean of per-zone external >> fragmentation. A zone's present_pages determines its weight. >> >> To periodically check per-node score, we reuse per-node kcompactd >> threads, which are woken up every 500 milliseconds to check the same. = If >> a node's score exceeds its high threshold (as derived from user-provid= ed >> proactiveness value), proactive compaction is started until its score >> reaches its low threshold value. By default, proactiveness is set to 2= 0, >> which implies threshold values of low=3D80 and high=3D90. >> >> This patch is largely based on ideas from Michal Hocko [2]. See also t= he >> LWN article [3]. >> >> Performance data >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> System: x64_64, 1T RAM, 80 CPU threads. >> Kernel: 5.6.0-rc3 + this patch >> >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag >> >> Before starting the driver, the system was fragmented from a userspace >> program that allocates all memory and then for each 2M aligned section= , >> frees 3/4 of base pages using munmap. The workload is mainly anonymous >> userspace pages, which are easy to move around. I intentionally avoide= d >> unmovable pages in this test to see how much latency we incur when >> hugepage allocations hit direct compaction. >> >> 1. Kernel hugepage allocation latencies >> >> With the system in such a fragmented state, a kernel driver then >> allocates as many hugepages as possible and measures allocation >> latency: >> >> (all latency values are in microseconds) >> >> - With vanilla 5.6.0-rc3 >> >> percentile latency >> =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93 >> 5 7894 >> 10 9496 >> 25 12561 >> 30 15295 >> 40 18244 >> 50 21229 >> 60 27556 >> 75 30147 >> 80 31047 >> 90 32859 >> 95 33799 >> >> Total 2M hugepages allocated =3D 383859 (749G worth of hugepages out o= f >> 762G total free =3D> 98% of free memory could be allocated as hugepage= s) >> >> - With 5.6.0-rc3 + this patch, with proactiveness=3D20 >> >> sysctl -w vm.compaction_proactiveness=3D20 >> >> percentile latency >> =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93 >> 5 2 >> 10 2 >> 25 3 >> 30 3 >> 40 3 >> 50 4 >> 60 4 >> 75 4 >> 80 4 >> 90 5 >> 95 429 >> >> Total 2M hugepages allocated =3D 384105 (750G worth of hugepages out o= f >> 762G total free =3D> 98% of free memory could be allocated as hugepage= s) >> >> 2. JAVA heap allocation >> >> In this test, we first fragment memory using the same method as for (1= ). >> >> Then, we start a Java process with a heap size set to 700G and request >> the heap to be allocated with THP hugepages. We also set THP to madvis= e >> to allow hugepage backing of this heap. >> >> /usr/bin/time >> java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTou= ch >> >> The above command allocates 700G of Java heap using hugepages. >> >> - With vanilla 5.6.0-rc3 >> >> 17.39user 1666.48system 27:37.89elapsed >> >> - With 5.6.0-rc3 + this patch, with proactiveness=3D20 >> >> 8.35user 194.58system 3:19.62elapsed >> >> Elapsed time remains around 3:15, as proactiveness is further increase= d. >> >> Note that proactive compaction happens throughout the runtime of these >> workloads. The situation of one-time compaction, sufficient to supply >> hugepages for following allocation stream, can probably happen for mor= e >> extreme proactiveness values, like 80 or 90. >> >> In the above Java workload, proactiveness is set to 20. The test start= s >> with a node's score of 80 or higher, depending on the delay between th= e >> fragmentation step and starting the benchmark, which gives more-or-les= s >> time for the initial round of compaction. As t he benchmark consumes >> hugepages, node's score quickly rises above the high threshold (90) an= d >> proactive compaction starts again, which brings down the score to the >> low threshold level (80). Repeat. >> >> bpftrace also confirms proactive compaction running 20+ times during t= he >> runtime of this Java benchmark. kcompactd threads consume 100% of one = of >> the CPUs while it tries to bring a node's score within thresholds. >> >> Backoff behavior >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> Above workloads produce a memory state which is easy to compact. >> However, if memory is filled with unmovable pages, proactive compactio= n >> should essentially back off. To test this aspect: >> >> - Created a kernel driver that allocates almost all memory as hugepage= s >> followed by freeing first 3/4 of each hugepage. >> - Set proactiveness=3D40 >> - Note that proactive_compact_node() is deferred maximum number of tim= es >> with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check >> (=3D> ~30 seconds between retries). >> >> [1] https://patchwork.kernel.org/patch/11098289/ >> [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.sus= e.cz/ >> [3] https://lwn.net/Articles/817905/ >> >> Signed-off-by: Nitin Gupta >> Reviewed-by: Vlastimil Babka >> Reviewed-by: Khalid Aziz >> To: Andrew Morton >> CC: Vlastimil Babka >> CC: Khalid Aziz >> CC: Michal Hocko >> CC: Mel Gorman >> CC: Matthew Wilcox >> CC: Mike Kravetz >> CC: Joonsoo Kim >> CC: David Rientjes >> CC: Nitin Gupta >> CC: Oleksandr Natalenko >> CC: linux-kernel >> CC: linux-mm >> CC: Linux API >> >> --- >> Changelog v7 vs v6: >> - Fix compile error while THP is disabled (Oleksandr) >=20 > Thank you for taking this. >=20 >> >> Changelog v6 vs v5: >> - Fallback to HUGETLB_PAGE_ORDER if HPAGE_PMD_ORDER is not defined, a= nd >> some cleanups (Vlastimil) >> - Cap min threshold to avoid excess compaction load in case user sets >> extreme values like 100 for `vm.compaction_proactiveness` sysctl (K= halid) >> - Add some more explanation about the effect of tunable on compaction >> behavior in user guide (Khalid) >> >> Changelog v5 vs v4: >> - Change tunable from sysfs to sysctl (Vlastimil) >> - Replace HUGETLB_PAGE_ORDER with HPAGE_PMD_ORDER (Vlastimil) >> - Minor cleanups (remove redundant initializations, ...) >> >> Changelog v4 vs v3: >> - Document various functions. >> - Added admin-guide for the new tunable `proactiveness`. >> - Rename proactive_compaction_score to fragmentation_score for clarit= y. >> >> Changelog v3 vs v2: >> - Make proactiveness a global tunable and not per-node. Also upadated >> the >> patch description to reflect the same (Vlastimil Babka). >> - Don't start proactive compaction if kswapd is running (Vlastimil >> Babka). >> - Clarified in the description that compaction runs in parallel with >> the workload, instead of a one-time compaction followed by a stream >> of >> hugepage allocations. >> >> Changelog v2 vs v1: >> - Introduce per-node and per-zone "proactive compaction score". This >> score is compared against watermarks which are set according to >> user provided proactiveness value. >> - Separate code-paths for proactive compaction from targeted compacti= on >> i.e. where pgdat->kcompactd_max_order is non-zero. >> - Renamed hpage_compaction_effort -> proactiveness. In future we may >> use more than extfrag wrt hugepage size to determine proactive >> compaction score. >> --- >> Documentation/admin-guide/sysctl/vm.rst | 15 ++ >> include/linux/compaction.h | 2 + >> kernel/sysctl.c | 9 ++ >> mm/compaction.c | 183 +++++++++++++++++++++++= - >> mm/internal.h | 1 + >> mm/vmstat.c | 18 +++ >> 6 files changed, 223 insertions(+), 5 deletions(-) >> >> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/a= dmin-guide/sysctl/vm.rst >> index 0329a4d3fa9e..360914b4f346 100644 >> --- a/Documentation/admin-guide/sysctl/vm.rst >> +++ b/Documentation/admin-guide/sysctl/vm.rst >> @@ -119,6 +119,21 @@ all zones are compacted such that free memory is = available in contiguous >> blocks where possible. This can be important for example in the alloc= ation of >> huge pages although processes will also directly compact memory as re= quired. >> =20 >> +compaction_proactiveness >> +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D >> + >> +This tunable takes a value in the range [0, 100] with a default value= of >> +20. This tunable determines how aggressively compaction is done in th= e >> +background. Setting it to 0 disables proactive compaction. >> + >> +Note that compaction has a non-trivial system-wide impact as pages >> +belonging to different processes are moved around, which could also l= ead >> +to latency spikes in unsuspecting applications. The kernel employs >> +various heuristics to avoid wasting CPU cycles if it detects that >> +proactive compaction is not being effective. >> + >> +Be careful when setting it to extreme values like 100, as that may >> +cause excessive background compaction activity. >> =20 >> compact_unevictable_allowed >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D >> diff --git a/include/linux/compaction.h b/include/linux/compaction.h >> index 4b898cdbdf05..ccd28978b296 100644 >> --- a/include/linux/compaction.h >> +++ b/include/linux/compaction.h >> @@ -85,11 +85,13 @@ static inline unsigned long compact_gap(unsigned i= nt order) >> =20 >> #ifdef CONFIG_COMPACTION >> extern int sysctl_compact_memory; >> +extern int sysctl_compaction_proactiveness; >> extern int sysctl_compaction_handler(struct ctl_table *table, int wri= te, >> void __user *buffer, size_t *length, loff_t *ppos); >=20 > Based on __user notation here, I guess the patch is based on v5.7, not > on something newer, right? >=20 The somehow missed rebasing the v7 patch. >> extern int sysctl_extfrag_threshold; >> extern int sysctl_compact_unevictable_allowed; >> =20 >> +extern int extfrag_for_order(struct zone *zone, unsigned int order); >> extern int fragmentation_index(struct zone *zone, unsigned int order)= ; >> extern enum compact_result try_to_compact_pages(gfp_t gfp_mask, >> unsigned int order, unsigned int alloc_flags, >> diff --git a/kernel/sysctl.c b/kernel/sysctl.c >> index 8a176d8727a3..51c90906efbc 100644 >> --- a/kernel/sysctl.c >> +++ b/kernel/sysctl.c >> @@ -1458,6 +1458,15 @@ static struct ctl_table vm_table[] =3D { >> .mode =3D 0200, >> .proc_handler =3D sysctl_compaction_handler, >> }, >> + { >> + .procname =3D "compaction_proactiveness", >> + .data =3D &sysctl_compaction_proactiveness, >> + .maxlen =3D sizeof(int), >> + .mode =3D 0644, >> + .proc_handler =3D proc_dointvec_minmax, >> + .extra1 =3D SYSCTL_ZERO, >> + .extra2 =3D &one_hundred, >> + }, >=20 > Again, as a highlight, in v5.8 the table was shuffled around, so you ma= y want > to rebase the patch on top of something newer in order for people to no= t > get conflicts when doing `git am`. I have rebase it now for v8. >=20 >> { >> .procname =3D "extfrag_threshold", >> .data =3D &sysctl_extfrag_threshold, >> diff --git a/mm/compaction.c b/mm/compaction.c >> index 46f0fcc93081..99579a1fa582 100644 >> --- a/mm/compaction.c >> +++ b/mm/compaction.c >> @@ -50,6 +50,24 @@ static inline void count_compact_events(enum vm_eve= nt_item item, long delta) >> #define pageblock_start_pfn(pfn) block_start_pfn(pfn, pageblock_order= ) >> #define pageblock_end_pfn(pfn) block_end_pfn(pfn, pageblock_order) >> =20 >> +/* >> + * Fragmentation score check interval for proactive compaction purpos= es. >> + */ >> +static const int HPAGE_FRAG_CHECK_INTERVAL_MSEC =3D 500; >> + >> +/* >> + * Page order with-respect-to which proactive compaction >> + * calculates external fragmentation, which is used as >> + * the "fragmentation score" of a node/zone. >> + */ >> +#if defined CONFIG_TRANSPARENT_HUGEPAGE >> +#define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER >> +#elif defined HUGETLB_PAGE_ORDER >> +#define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER >> +#else >> +#define COMPACTION_HPAGE_ORDER (PMD_SHIFT - PAGE_SHIFT) >> +#endif >> + >> static unsigned long release_freepages(struct list_head *freelist) >> { >> struct page *page, *next; >> @@ -1855,6 +1873,76 @@ static inline bool is_via_compact_memory(int or= der) >> return order =3D=3D -1; >> } >> =20 >> +static bool kswapd_is_running(pg_data_t *pgdat) >> +{ >> + return pgdat->kswapd && (pgdat->kswapd->state =3D=3D TASK_RUNNING); >> +} >> + >> +/* >> + * A zone's fragmentation score is the external fragmentation wrt to = the >> + * COMPACTION_HPAGE_ORDER scaled by the zone's size. It returns a val= ue >> + * in the range [0, 100]. >> + * >> + * The scaling factor ensures that proactive compaction focuses on la= rger >> + * zones like ZONE_NORMAL, rather than smaller, specialized zones lik= e >> + * ZONE_DMA32. For smaller zones, the score value remains close to ze= ro, >> + * and thus never exceeds the high threshold for proactive compaction= . >> + */ >> +static int fragmentation_score_zone(struct zone *zone) >> +{ >> + unsigned long score; >> + >> + score =3D zone->present_pages * >> + extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); >> + return div64_ul(score, zone->zone_pgdat->node_present_pages + 1); >> +} >> + >> +/* >> + * The per-node proactive (background) compaction process is started = by its >> + * corresponding kcompactd thread when the node's fragmentation score >> + * exceeds the high threshold. The compaction process remains active = till >> + * the node's score falls below the low threshold, or one of the back= -off >> + * conditions is met. >> + */ >> +static int fragmentation_score_node(pg_data_t *pgdat) >> +{ >> + unsigned long score =3D 0; >> + int zoneid; >> + >> + for (zoneid =3D 0; zoneid < MAX_NR_ZONES; zoneid++) { >> + struct zone *zone; >> + >> + zone =3D &pgdat->node_zones[zoneid]; >> + score +=3D fragmentation_score_zone(zone); >> + } >> + >> + return score; >> +} >> + >> +static int fragmentation_score_wmark(pg_data_t *pgdat, bool low) >> +{ >> + int wmark_low; >> + >> + /* >> + * Cap the low watermak to avoid excessive compaction >> + * activity in case a user sets the proactivess tunable >> + * close to 100 (maximum). >> + */ >> + wmark_low =3D max(100 - sysctl_compaction_proactiveness, 5); >> + return low ? wmark_low : min(wmark_low + 10, 100); >> +} >> + >> +static bool should_proactive_compact_node(pg_data_t *pgdat) >> +{ >> + int wmark_high; >> + >> + if (!sysctl_compaction_proactiveness || kswapd_is_running(pgdat)) >> + return false; >> + >> + wmark_high =3D fragmentation_score_wmark(pgdat, false); >> + return fragmentation_score_node(pgdat) > wmark_high; >> +} >> + >> static enum compact_result __compact_finished(struct compact_control = *cc) >> { >> unsigned int order; >> @@ -1881,6 +1969,25 @@ static enum compact_result __compact_finished(s= truct compact_control *cc) >> return COMPACT_PARTIAL_SKIPPED; >> } >> =20 >> + if (cc->proactive_compaction) { >> + int score, wmark_low; >> + pg_data_t *pgdat; >> + >> + pgdat =3D cc->zone->zone_pgdat; >> + if (kswapd_is_running(pgdat)) >> + return COMPACT_PARTIAL_SKIPPED; >> + >> + score =3D fragmentation_score_zone(cc->zone); >> + wmark_low =3D fragmentation_score_wmark(pgdat, true); >> + >> + if (score > wmark_low) >> + ret =3D COMPACT_CONTINUE; >> + else >> + ret =3D COMPACT_SUCCESS; >> + >> + goto out; >> + } >> + >> if (is_via_compact_memory(cc->order)) >> return COMPACT_CONTINUE; >> =20 >> @@ -1939,6 +2046,7 @@ static enum compact_result __compact_finished(st= ruct compact_control *cc) >> } >> } >> =20 >> +out: >> if (cc->contended || fatal_signal_pending(current)) >> ret =3D COMPACT_CONTENDED; >> =20 >> @@ -2412,6 +2520,41 @@ enum compact_result try_to_compact_pages(gfp_t = gfp_mask, unsigned int order, >> return rc; >> } >> =20 >> +/* >> + * Compact all zones within a node till each zone's fragmentation sco= re >> + * reaches within proactive compaction thresholds (as determined by t= he >> + * proactiveness tunable). >> + * >> + * It is possible that the function returns before reaching score tar= gets >> + * due to various back-off conditions, such as, contention on per-nod= e or >> + * per-zone locks. >> + */ >> +static void proactive_compact_node(pg_data_t *pgdat) >> +{ >> + int zoneid; >> + struct zone *zone; >> + struct compact_control cc =3D { >> + .order =3D -1, >> + .mode =3D MIGRATE_SYNC_LIGHT, >> + .ignore_skip_hint =3D true, >> + .whole_zone =3D true, >> + .gfp_mask =3D GFP_KERNEL, >> + .proactive_compaction =3D true, >> + }; >> + >> + for (zoneid =3D 0; zoneid < MAX_NR_ZONES; zoneid++) { >> + zone =3D &pgdat->node_zones[zoneid]; >> + if (!populated_zone(zone)) >> + continue; >> + >> + cc.zone =3D zone; >> + >> + compact_zone(&cc, NULL); >> + >> + VM_BUG_ON(!list_empty(&cc.freepages)); >> + VM_BUG_ON(!list_empty(&cc.migratepages)); >=20 > Can this actually happen here? I'd expect some comment in the code > regarding being overcautious here. IIUC, you follow what > kcompactd_do_work() does, but even there it is not explained. >=20 In theory No, it can't happen: we do release_pages(cc->freepages) and putback_movable_pages(cc->migratepages) for any residuals which could not by migrated. This is just being cautious to detect any future mem leak bugs here. >> + } >> +} >> =20 >> /* Compact all zones within a node */ >> static void compact_node(int nid) >> @@ -2458,6 +2601,13 @@ static void compact_nodes(void) >> /* The written value is actually unused, all memory is compacted */ >> int sysctl_compact_memory; >> =20 >> +/* >> + * Tunable for proactive compaction. It determines how >> + * aggressively the kernel should compact memory in the >> + * background. It takes values in the range [0, 100]. >> + */ >> +int __read_mostly sysctl_compaction_proactiveness =3D 20; >=20 > Excuse me if I missed previous discussion and this question was already > addressed, but given possible latency spikes as described in the commit > message, shall this value be amended to conserve current kernel > behaviour (IOW, sysctl_compaction_proactiveness =3D 0)? >=20 For the v2 patch, Vlastimil suggested [1] setting it to a non-0, yet conservative default after some more testing. Well, my testing suggests that a conservative value of 20 does not seem to impact overall system negatively while consistently giving good improvements for hugepage allocation latencies. [1] https://lkml.org/lkml/2020/3/4/812 >> + >> /* >> * This is the entry point for compacting all nodes via >> * /proc/sys/vm/compact_memory >> @@ -2637,6 +2787,7 @@ static int kcompactd(void *p) >> { >> pg_data_t *pgdat =3D (pg_data_t*)p; >> struct task_struct *tsk =3D current; >> + unsigned int proactive_defer =3D 0; >> =20 >> const struct cpumask *cpumask =3D cpumask_of_node(pgdat->node_id); >> =20 >> @@ -2652,12 +2803,34 @@ static int kcompactd(void *p) >> unsigned long pflags; >> =20 >> trace_mm_compaction_kcompactd_sleep(pgdat->node_id); >> - wait_event_freezable(pgdat->kcompactd_wait, >> - kcompactd_work_requested(pgdat)); >> + if (wait_event_freezable_timeout(pgdat->kcompactd_wait, >> + kcompactd_work_requested(pgdat), >> + msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC))) { >> + >> + psi_memstall_enter(&pflags); >> + kcompactd_do_work(pgdat); >> + psi_memstall_leave(&pflags); >=20 > I wonder if wrapping kcompactd_do_work() into > psi_memstall_{enter,leave} is a too big hammer that may cause mem PSI t= o > be bigger that it really is, but this question seems to be out of scope > of current patch, so feel free to ignore it. >=20 >> + continue; >> + } >> =20 >> - psi_memstall_enter(&pflags); >> - kcompactd_do_work(pgdat); >> - psi_memstall_leave(&pflags); >> + /* kcompactd wait timeout */ >> + if (should_proactive_compact_node(pgdat)) { >> + unsigned int prev_score, score; >> + >> + if (proactive_defer) { >> + proactive_defer--; >> + continue; >> + } >> + prev_score =3D fragmentation_score_node(pgdat); >> + proactive_compact_node(pgdat); >> + score =3D fragmentation_score_node(pgdat); >> + /* >> + * Defer proactive compaction if the fragmentation >> + * score did not go down i.e. no progress made. >> + */ >> + proactive_defer =3D score < prev_score ? >> + 0 : 1 << COMPACT_MAX_DEFER_SHIFT; >> + } >> } >> =20 >> return 0; >> diff --git a/mm/internal.h b/mm/internal.h >> index b5634e78f01d..9671bccd97d5 100644 >> --- a/mm/internal.h >> +++ b/mm/internal.h >> @@ -228,6 +228,7 @@ struct compact_control { >> bool no_set_skip_hint; /* Don't mark blocks for skipping */ >> bool ignore_block_suitable; /* Scan blocks considered unsuitable */ >> bool direct_compaction; /* False from kcompactd or /proc/... */ >> + bool proactive_compaction; /* kcompactd proactive compaction */ >> bool whole_zone; /* Whole zone should/has been scanned */ >> bool contended; /* Signal lock or sched contention */ >> bool rescan; /* Rescanning the same pageblock */ >> diff --git a/mm/vmstat.c b/mm/vmstat.c >> index 96d21a792b57..cc88f7533b8d 100644 >> --- a/mm/vmstat.c >> +++ b/mm/vmstat.c >> @@ -1074,6 +1074,24 @@ static int __fragmentation_index(unsigned int o= rder, struct contig_page_info *in >> return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, re= quested))), info->free_blocks_total); >> } >> =20 >> +/* >> + * Calculates external fragmentation within a zone wrt the given orde= r. >> + * It is defined as the percentage of pages found in blocks of size >> + * less than 1 << order. It returns values in range [0, 100]. >> + */ >> +int extfrag_for_order(struct zone *zone, unsigned int order) >> +{ >> + struct contig_page_info info; >> + >> + fill_contig_page_info(zone, order, &info); >> + if (info.free_pages =3D=3D 0) >> + return 0; >> + >> + return div_u64((info.free_pages - >> + (info.free_blocks_suitable << order)) * 100, >> + info.free_pages); >> +} >> + >> /* Same as __fragmentation index but allocs contig_page_info on stack= */ >> int fragmentation_index(struct zone *zone, unsigned int order) >> { >> --=20 >> 2.27.0 >> >=20 > Modulo the minor nits above and given I run this submission for quite > some time on various machines: >=20 > Reviewed-by: Oleksandr Natalenko > Tested-by: Oleksandr Natalenko >=20 Thank you. -Nitin