From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 88E8EC433DF for ; Tue, 23 Jun 2020 04:21:45 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id F1FF2206FA for ; Tue, 23 Jun 2020 04:21:43 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F1FF2206FA Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=loongson.cn Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5881F6B0002; Tue, 23 Jun 2020 00:21:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 510D66B0006; Tue, 23 Jun 2020 00:21:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3B1716B0007; Tue, 23 Jun 2020 00:21:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0196.hostedemail.com [216.40.44.196]) by kanga.kvack.org (Postfix) with ESMTP id 187B26B0002 for ; Tue, 23 Jun 2020 00:21:43 -0400 (EDT) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id AF7A1180AD81A for ; Tue, 23 Jun 2020 04:21:42 +0000 (UTC) X-FDA: 76959178044.30.hour67_04177ee26e38 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin30.hostedemail.com (Postfix) with ESMTP id 8CC3D180B3AA7 for ; Tue, 23 Jun 2020 04:21:42 +0000 (UTC) X-HE-Tag: hour67_04177ee26e38 X-Filterd-Recvd-Size: 19044 Received: from loongson.cn (mail.loongson.cn [114.242.206.163]) by imf47.hostedemail.com (Postfix) with ESMTP for ; Tue, 23 Jun 2020 04:21:40 +0000 (UTC) Received: from [10.20.42.25] (unknown [10.20.42.25]) by mail.loongson.cn (Coremail) with SMTP id AQAAf9DxP2pIg_FespdIAA--.9441S3; Tue, 23 Jun 2020 12:21:29 +0800 (CST) Subject: Re: [PATCH v8] mm: Proactive compaction To: Nathan Chancellor , Nitin Gupta References: <20200616204527.19185-1-nigupta@nvidia.com> <20200623022636.GA1051134@ubuntu-n2-xlarge-x86> Cc: Andrew Morton , Vlastimil Babka , Khalid Aziz , Oleksandr Natalenko , Michal Hocko , Mel Gorman , Matthew Wilcox , Mike Kravetz , Joonsoo Kim , David Rientjes , Nitin Gupta , linux-kernel , linux-mm , Linux API , linux-mips@vger.kernel.org From: maobibo Message-ID: Date: Tue, 23 Jun 2020 12:21:28 +0800 User-Agent: Mozilla/5.0 (X11; Linux mips64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20200623022636.GA1051134@ubuntu-n2-xlarge-x86> Content-Type: text/plain; charset=utf-8 X-CM-TRANSID:AQAAf9DxP2pIg_FespdIAA--.9441S3 X-Coremail-Antispam: 1UD129KBjvAXoW3KFy8KryrZFW8Gw17Zr1kXwb_yoW8XFyrGo Z5GrsrAw4fJry5Wa1DGas8KF98J3ykKrsYq3Z0q345AFn7X39I9r1qka1fCay5AFyDta1k Jw43Awsxtws7XFnxn29KB7ZKAUJUUUUU529EdanIXcx71UUUUU7v73VFW2AGmfu7bjvjm3 AaLaJ3UjIYCTnIWjp_UUUY67k0a2IF6w4kM7kC6x804xWl14x267AKxVW5JVWrJwAFc2x0 x2IEx4CE42xK8VAvwI8IcIk0rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj4 1l84x0c7CEw4AK67xGY2AK021l84ACjcxK6xIIjxv20xvE14v26r1I6r4UM28EF7xvwVC0 I7IYx2IY6xkF7I0E14v26r4j6F4UM28EF7xvwVC2z280aVAFwI0_Gr1j6F4UJwA2z4x0Y4 vEx4A2jsIEc7CjxVAFwI0_Cr1j6rxdM2AIxVAIcxkEcVAq07x20xvEncxIr21l5I8CrVAC Y4xI64kE6c02F40Ex7xfMcIj6xIIjxv20xvE14v26r1j6r18McIj6I8E87Iv67AKxVWUJV W8JwAm72CE4IkC6x0Yz7v_Jr0_Gr1lF7xvr2IY64vIr41lFIxGxcIEc7CjxVA2Y2ka0xkI wI1lc7I2V7IY0VAS07AlzVAYIcxG8wCF04k20xvY0x0EwIxGrwCFx2IqxVCFs4IE7xkEbV WUJVW8JwC20s026c02F40E14v26r1j6r18MI8I3I0E7480Y4vE14v26r106r1rMI8E67AF 67kF1VAFwI0_GFv_WrylIxkGc2Ij64vIr41lIxAIcVC0I7IYx2IY67AKxVWUJVWUCwCI42 IY6xIIjxv20xvEc7CjxVAFwI0_Gr0_Cr1lIxAIcVCF04k26cxKx2IYs7xG6rW3Jr0E3s1l IxAIcVC2z280aVAFwI0_Jr0_Gr1lIxAIcVC2z280aVCY1x0267AKxVW8JVW8JrUvcSsGvf C2KfnxnUUI43ZEXa7IU5q385UUUUU== X-CM-SenderInfo: xpdruxter6z05rqj20fqof0/ X-Rspamd-Queue-Id: 8CC3D180B3AA7 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 06/23/2020 10:26 AM, Nathan Chancellor wrote: > On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote: >> For some applications, we need to allocate almost all memory as >> hugepages. However, on a running system, higher-order allocations can >> fail if the memory is fragmented. Linux kernel currently does on-deman= d >> compaction as we request more hugepages, but this style of compaction >> incurs very high latency. Experiments with one-time full memory >> compaction (followed by hugepage allocations) show that kernel is able >> to restore a highly fragmented memory state to a fairly compacted memo= ry >> state within <1 sec for a 32G system. Such data suggests that a more >> proactive compaction can help us allocate a large fraction of memory a= s >> hugepages keeping allocation latencies low. >> >> For a more proactive compaction, the approach taken here is to define = a >> new sysctl called 'vm.compaction_proactiveness' which dictates bounds >> for external fragmentation which kcompactd tries to maintain. >> >> The tunable takes a value in range [0, 100], with a default of 20. >> >> Note that a previous version of this patch [1] was found to introduce >> too many tunables (per-order extfrag{low, high}), but this one reduces >> them to just one sysctl. Also, the new tunable is an opaque value >> instead of asking for specific bounds of "external fragmentation", whi= ch >> would have been difficult to estimate. The internal interpretation of >> this opaque value allows for future fine-tuning. >> >> Currently, we use a simple translation from this tunable to [low, high= ] >> "fragmentation score" thresholds (low=3D100-proactiveness, high=3Dlow+= 10%). >> The score for a node is defined as weighted mean of per-zone external >> fragmentation. A zone's present_pages determines its weight. >> >> To periodically check per-node score, we reuse per-node kcompactd >> threads, which are woken up every 500 milliseconds to check the same. = If >> a node's score exceeds its high threshold (as derived from user-provid= ed >> proactiveness value), proactive compaction is started until its score >> reaches its low threshold value. By default, proactiveness is set to 2= 0, >> which implies threshold values of low=3D80 and high=3D90. >> >> This patch is largely based on ideas from Michal Hocko [2]. See also t= he >> LWN article [3]. >> >> Performance data >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> System: x64_64, 1T RAM, 80 CPU threads. >> Kernel: 5.6.0-rc3 + this patch >> >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag >> >> Before starting the driver, the system was fragmented from a userspace >> program that allocates all memory and then for each 2M aligned section= , >> frees 3/4 of base pages using munmap. The workload is mainly anonymous >> userspace pages, which are easy to move around. I intentionally avoide= d >> unmovable pages in this test to see how much latency we incur when >> hugepage allocations hit direct compaction. >> >> 1. Kernel hugepage allocation latencies >> >> With the system in such a fragmented state, a kernel driver then >> allocates as many hugepages as possible and measures allocation >> latency: >> >> (all latency values are in microseconds) >> >> - With vanilla 5.6.0-rc3 >> >> percentile latency >> =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93 >> 5 7894 >> 10 9496 >> 25 12561 >> 30 15295 >> 40 18244 >> 50 21229 >> 60 27556 >> 75 30147 >> 80 31047 >> 90 32859 >> 95 33799 >> >> Total 2M hugepages allocated =3D 383859 (749G worth of hugepages out o= f >> 762G total free =3D> 98% of free memory could be allocated as hugepage= s) >> >> - With 5.6.0-rc3 + this patch, with proactiveness=3D20 >> >> sysctl -w vm.compaction_proactiveness=3D20 >> >> percentile latency >> =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93 >> 5 2 >> 10 2 >> 25 3 >> 30 3 >> 40 3 >> 50 4 >> 60 4 >> 75 4 >> 80 4 >> 90 5 >> 95 429 >> >> Total 2M hugepages allocated =3D 384105 (750G worth of hugepages out o= f >> 762G total free =3D> 98% of free memory could be allocated as hugepage= s) >> >> 2. JAVA heap allocation >> >> In this test, we first fragment memory using the same method as for (1= ). >> >> Then, we start a Java process with a heap size set to 700G and request >> the heap to be allocated with THP hugepages. We also set THP to madvis= e >> to allow hugepage backing of this heap. >> >> /usr/bin/time >> java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTou= ch >> >> The above command allocates 700G of Java heap using hugepages. >> >> - With vanilla 5.6.0-rc3 >> >> 17.39user 1666.48system 27:37.89elapsed >> >> - With 5.6.0-rc3 + this patch, with proactiveness=3D20 >> >> 8.35user 194.58system 3:19.62elapsed >> >> Elapsed time remains around 3:15, as proactiveness is further increase= d. >> >> Note that proactive compaction happens throughout the runtime of these >> workloads. The situation of one-time compaction, sufficient to supply >> hugepages for following allocation stream, can probably happen for mor= e >> extreme proactiveness values, like 80 or 90. >> >> In the above Java workload, proactiveness is set to 20. The test start= s >> with a node's score of 80 or higher, depending on the delay between th= e >> fragmentation step and starting the benchmark, which gives more-or-les= s >> time for the initial round of compaction. As t he benchmark consumes >> hugepages, node's score quickly rises above the high threshold (90) an= d >> proactive compaction starts again, which brings down the score to the >> low threshold level (80). Repeat. >> >> bpftrace also confirms proactive compaction running 20+ times during t= he >> runtime of this Java benchmark. kcompactd threads consume 100% of one = of >> the CPUs while it tries to bring a node's score within thresholds. >> >> Backoff behavior >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> Above workloads produce a memory state which is easy to compact. >> However, if memory is filled with unmovable pages, proactive compactio= n >> should essentially back off. To test this aspect: >> >> - Created a kernel driver that allocates almost all memory as hugepage= s >> followed by freeing first 3/4 of each hugepage. >> - Set proactiveness=3D40 >> - Note that proactive_compact_node() is deferred maximum number of tim= es >> with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check >> (=3D> ~30 seconds between retries). >> >> [1] https://patchwork.kernel.org/patch/11098289/ >> [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.sus= e.cz/ >> [3] https://lwn.net/Articles/817905/ >> >> Signed-off-by: Nitin Gupta >> Reviewed-by: Vlastimil Babka >> Reviewed-by: Khalid Aziz >> Reviewed-by: Oleksandr Natalenko >> Tested-by: Oleksandr Natalenko >> To: Andrew Morton >> CC: Vlastimil Babka >> CC: Khalid Aziz >> CC: Michal Hocko >> CC: Mel Gorman >> CC: Matthew Wilcox >> CC: Mike Kravetz >> CC: Joonsoo Kim >> CC: David Rientjes >> CC: Nitin Gupta >> CC: Oleksandr Natalenko >> CC: linux-kernel >> CC: linux-mm >> CC: Linux API >=20 > This is now in -next and causes the following build failure: >=20 > $ make -skj"$(nproc)" ARCH=3Dmips CROSS_COMPILE=3Dmipsel-linux- O=3Dout= /mipsel distclean malta_kvm_guest_defconfig mm/compaction.o > In file included from include/linux/dev_printk.h:14, > from include/linux/device.h:15, > from include/linux/node.h:18, > from include/linux/cpu.h:17, > from mm/compaction.c:11: > In function 'fragmentation_score_zone', > inlined from '__compact_finished' at mm/compaction.c:1982:11, > inlined from 'compact_zone' at mm/compaction.c:2062:8: > include/linux/compiler.h:339:38: error: call to '__compiletime_assert_3= 01' declared with attribute error: BUILD_BUG failed > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __C= OUNTER__) > | ^ > include/linux/compiler.h:320:4: note: in definition of macro '__compile= time_assert' > 320 | prefix ## suffix(); \ > | ^~~~~~ > include/linux/compiler.h:339:2: note: in expansion of macro '_compileti= me_assert' > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __C= OUNTER__) > | ^~~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:39:37: note: in expansion of macro 'compileti= me_assert' > 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond),= msg) > | ^~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG= _ON_MSG' > 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") > | ^~~~~~~~~~~~~~~~ > arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_= BUG' > 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) > | ^~~~~~~~~ > mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' > 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER > | ^~~~~~~~~~~~~~~~~~ > mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_= ORDER' > 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > | ^~~~~~~~~~~~~~~~~~~~~~ > In function 'fragmentation_score_zone', > inlined from 'kcompactd' at mm/compaction.c:1918:12: > include/linux/compiler.h:339:38: error: call to '__compiletime_assert_3= 01' declared with attribute error: BUILD_BUG failed > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __C= OUNTER__) > | ^ > include/linux/compiler.h:320:4: note: in definition of macro '__compile= time_assert' > 320 | prefix ## suffix(); \ > | ^~~~~~ > include/linux/compiler.h:339:2: note: in expansion of macro '_compileti= me_assert' > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __C= OUNTER__) > | ^~~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:39:37: note: in expansion of macro 'compileti= me_assert' > 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond),= msg) > | ^~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG= _ON_MSG' > 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") > | ^~~~~~~~~~~~~~~~ > arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_= BUG' > 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) > | ^~~~~~~~~ > mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' > 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER > | ^~~~~~~~~~~~~~~~~~ > mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_= ORDER' > 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > | ^~~~~~~~~~~~~~~~~~~~~~ > In function 'fragmentation_score_zone', > inlined from 'kcompactd' at mm/compaction.c:1918:12: > include/linux/compiler.h:339:38: error: call to '__compiletime_assert_3= 01' declared with attribute error: BUILD_BUG failed > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __C= OUNTER__) > | ^ > include/linux/compiler.h:320:4: note: in definition of macro '__compile= time_assert' > 320 | prefix ## suffix(); \ > | ^~~~~~ > include/linux/compiler.h:339:2: note: in expansion of macro '_compileti= me_assert' > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __C= OUNTER__) > | ^~~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:39:37: note: in expansion of macro 'compileti= me_assert' > 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond),= msg) > | ^~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG= _ON_MSG' > 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") > | ^~~~~~~~~~~~~~~~ > arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_= BUG' > 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) > | ^~~~~~~~~ > mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' > 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER > | ^~~~~~~~~~~~~~~~~~ > mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_= ORDER' > 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > | ^~~~~~~~~~~~~~~~~~~~~~ > In function 'fragmentation_score_zone', > inlined from 'kcompactd' at mm/compaction.c:1918:12: > include/linux/compiler.h:339:38: error: call to '__compiletime_assert_3= 01' declared with attribute error: BUILD_BUG failed > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __C= OUNTER__) > | ^ > include/linux/compiler.h:320:4: note: in definition of macro '__compile= time_assert' > 320 | prefix ## suffix(); \ > | ^~~~~~ > include/linux/compiler.h:339:2: note: in expansion of macro '_compileti= me_assert' > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __C= OUNTER__) > | ^~~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:39:37: note: in expansion of macro 'compileti= me_assert' > 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond),= msg) > | ^~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG= _ON_MSG' > 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") > | ^~~~~~~~~~~~~~~~ > arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_= BUG' > 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) > | ^~~~~~~~~ > mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' > 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER > | ^~~~~~~~~~~~~~~~~~ > mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_= ORDER' > 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > | ^~~~~~~~~~~~~~~~~~~~~~ > make[3]: *** [scripts/Makefile.build:281: mm/compaction.o] Error 1 > make[3]: Target '__build' not remade because of errors. > make[2]: *** [Makefile:1765: mm] Error 2 > make[2]: Target 'mm/compaction.o' not remade because of errors. > make[1]: *** [Makefile:336: __build_one_by_one] Error 2 > make[1]: Target 'distclean' not remade because of errors. > make[1]: Target 'malta_kvm_guest_defconfig' not remade because of error= s. > make[1]: Target 'mm/compaction.o' not remade because of errors. > make: *** [Makefile:185: __sub-make] Error 2 > make: Target 'distclean' not remade because of errors. > make: Target 'malta_kvm_guest_defconfig' not remade because of errors. > make: Target 'mm/compaction.o' not remade because of errors. >=20 > I am not sure why MIPS is special with its handling of hugepage support > but I am far from a MIPS expert :) it seems that both HUGETLB_PAGE and TRANSPARENT_HUGEPAGE are disabled wit= h malta_kvm_guest_defconfig. >=20 > Cheers, > Nathan >=20