From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D9E2AC433E1 for ; Tue, 23 Jun 2020 02:26:41 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7A984206EB for ; Tue, 23 Jun 2020 02:26:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="li5DpHS2" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7A984206EB Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 194C96B0022; Mon, 22 Jun 2020 22:26:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 11EFD6B0023; Mon, 22 Jun 2020 22:26:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EFEDE6B0025; Mon, 22 Jun 2020 22:26:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0090.hostedemail.com [216.40.44.90]) by kanga.kvack.org (Postfix) with ESMTP id CC1DF6B0022 for ; Mon, 22 Jun 2020 22:26:40 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 3A5FC2AAE for ; Tue, 23 Jun 2020 02:26:40 +0000 (UTC) X-FDA: 76958888160.14.frog42_460b30d26e37 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin14.hostedemail.com (Postfix) with ESMTP id 145F018229818 for ; Tue, 23 Jun 2020 02:26:40 +0000 (UTC) X-HE-Tag: frog42_460b30d26e37 X-Filterd-Recvd-Size: 19180 Received: from mail-ot1-f68.google.com (mail-ot1-f68.google.com [209.85.210.68]) by imf45.hostedemail.com (Postfix) with ESMTP for ; Tue, 23 Jun 2020 02:26:39 +0000 (UTC) Received: by mail-ot1-f68.google.com with SMTP id t6so15020251otk.9 for ; Mon, 22 Jun 2020 19:26:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=mKZtZ5OrQtHQh8JuJC9z7n/06lUZZJPEU7u0mg0qJIQ=; b=li5DpHS2aQdcYr6Jbvoac5/oXlaVKPiinz7Nn4860//NMwUqvXCQRlk8LESXeIkDCi phyzjhnvPuFu6TuGBaK7H9mCug1DxhuqEHVvFTjtQKOFgCjmjSOCOzRvjxWit4inhArq ojEAofE4b1tdk7nadTLam5vvF8NW1PKDAiGWIXX2ztD3zc9vpHQVqS0BvBG6jqZ5qpjK yW1zjqlAbidMDOvBuLUrR6DWxhAZuQRoudxiHiu63ePCivoaPTUDAxeDrPIv4cFYFr0a 2Gm45gDBuHzM1hmglZLYPKVN7Eb3bM9xtnFY8ykUyUoYXVCWT5750ukKWhlRNJgRx1DS mBEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=mKZtZ5OrQtHQh8JuJC9z7n/06lUZZJPEU7u0mg0qJIQ=; b=XHnGhQ+p+kQ2b0BkEX1Pw2hZj/6tSlvQ/s5uSZf2jijzrQtDixGWSHAbLZz0H4UMsj iEkN3WsOqCbia2Lf7HfxM8ikiKgqNSNhZcXhzv4WYwutyJKxVnHGPexzLqcT17GWP2Gj D8O4WPkLHluDz7EhZxm97tosfcrzNu+71120EMqWb4K0AXagGZ8SzVacRiya74Br0js4 IExPAr5WB7myxWxifDetcNKwqEdPZp5GXnu6ezCExtm6EcGY2RRyVpEPtGYfI8kFrdoS r/IxcHRNOD5fwQkkoPdjsOlq4itAkoGnD14avjUwHJ1jFe1dAVlvhNbjQCe3xOurU21D aGUg== X-Gm-Message-State: AOAM531TUeis/Bs8pp3SD6E6VhGYrqcPOesizop8w8fbuGWs74olI/Ae CEWC1UeDb5nZO/Z0ZnLLmHnhaAKB+r8= X-Google-Smtp-Source: ABdhPJzycELRa2UXfT68HWgIYPWo2b8CCwQJo+dVO2j8vLteCdOMpJbnARRduPJaeUv6WtUShchgCg== X-Received: by 2002:a9d:4c0b:: with SMTP id l11mr17298541otf.139.1592879198609; Mon, 22 Jun 2020 19:26:38 -0700 (PDT) Received: from ubuntu-n2-xlarge-x86 ([2604:1380:4111:8b00::3]) by smtp.gmail.com with ESMTPSA id 190sm3872900oon.2.2020.06.22.19.26.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Jun 2020 19:26:37 -0700 (PDT) Date: Mon, 22 Jun 2020 19:26:36 -0700 From: Nathan Chancellor To: Nitin Gupta Cc: Andrew Morton , Vlastimil Babka , Khalid Aziz , Oleksandr Natalenko , Michal Hocko , Mel Gorman , Matthew Wilcox , Mike Kravetz , Joonsoo Kim , David Rientjes , Nitin Gupta , linux-kernel , linux-mm , Linux API , linux-mips@vger.kernel.org Subject: Re: [PATCH v8] mm: Proactive compaction Message-ID: <20200623022636.GA1051134@ubuntu-n2-xlarge-x86> References: <20200616204527.19185-1-nigupta@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20200616204527.19185-1-nigupta@nvidia.com> X-Rspamd-Queue-Id: 145F018229818 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam03 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote: > For some applications, we need to allocate almost all memory as > hugepages. However, on a running system, higher-order allocations can > fail if the memory is fragmented. Linux kernel currently does on-demand > compaction as we request more hugepages, but this style of compaction > incurs very high latency. Experiments with one-time full memory > compaction (followed by hugepage allocations) show that kernel is able > to restore a highly fragmented memory state to a fairly compacted memor= y > state within <1 sec for a 32G system. Such data suggests that a more > proactive compaction can help us allocate a large fraction of memory as > hugepages keeping allocation latencies low. >=20 > For a more proactive compaction, the approach taken here is to define a > new sysctl called 'vm.compaction_proactiveness' which dictates bounds > for external fragmentation which kcompactd tries to maintain. >=20 > The tunable takes a value in range [0, 100], with a default of 20. >=20 > Note that a previous version of this patch [1] was found to introduce > too many tunables (per-order extfrag{low, high}), but this one reduces > them to just one sysctl. Also, the new tunable is an opaque value > instead of asking for specific bounds of "external fragmentation", whic= h > would have been difficult to estimate. The internal interpretation of > this opaque value allows for future fine-tuning. >=20 > Currently, we use a simple translation from this tunable to [low, high] > "fragmentation score" thresholds (low=3D100-proactiveness, high=3Dlow+1= 0%). > The score for a node is defined as weighted mean of per-zone external > fragmentation. A zone's present_pages determines its weight. >=20 > To periodically check per-node score, we reuse per-node kcompactd > threads, which are woken up every 500 milliseconds to check the same. I= f > a node's score exceeds its high threshold (as derived from user-provide= d > proactiveness value), proactive compaction is started until its score > reaches its low threshold value. By default, proactiveness is set to 20= , > which implies threshold values of low=3D80 and high=3D90. >=20 > This patch is largely based on ideas from Michal Hocko [2]. See also th= e > LWN article [3]. >=20 > Performance data > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > System: x64_64, 1T RAM, 80 CPU threads. > Kernel: 5.6.0-rc3 + this patch >=20 > echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled > echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag >=20 > Before starting the driver, the system was fragmented from a userspace > program that allocates all memory and then for each 2M aligned section, > frees 3/4 of base pages using munmap. The workload is mainly anonymous > userspace pages, which are easy to move around. I intentionally avoided > unmovable pages in this test to see how much latency we incur when > hugepage allocations hit direct compaction. >=20 > 1. Kernel hugepage allocation latencies >=20 > With the system in such a fragmented state, a kernel driver then > allocates as many hugepages as possible and measures allocation > latency: >=20 > (all latency values are in microseconds) >=20 > - With vanilla 5.6.0-rc3 >=20 > percentile latency > =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93 > 5 7894 > 10 9496 > 25 12561 > 30 15295 > 40 18244 > 50 21229 > 60 27556 > 75 30147 > 80 31047 > 90 32859 > 95 33799 >=20 > Total 2M hugepages allocated =3D 383859 (749G worth of hugepages out of > 762G total free =3D> 98% of free memory could be allocated as hugepages= ) >=20 > - With 5.6.0-rc3 + this patch, with proactiveness=3D20 >=20 > sysctl -w vm.compaction_proactiveness=3D20 >=20 > percentile latency > =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93 > 5 2 > 10 2 > 25 3 > 30 3 > 40 3 > 50 4 > 60 4 > 75 4 > 80 4 > 90 5 > 95 429 >=20 > Total 2M hugepages allocated =3D 384105 (750G worth of hugepages out of > 762G total free =3D> 98% of free memory could be allocated as hugepages= ) >=20 > 2. JAVA heap allocation >=20 > In this test, we first fragment memory using the same method as for (1)= . >=20 > Then, we start a Java process with a heap size set to 700G and request > the heap to be allocated with THP hugepages. We also set THP to madvise > to allow hugepage backing of this heap. >=20 > /usr/bin/time > java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouc= h >=20 > The above command allocates 700G of Java heap using hugepages. >=20 > - With vanilla 5.6.0-rc3 >=20 > 17.39user 1666.48system 27:37.89elapsed >=20 > - With 5.6.0-rc3 + this patch, with proactiveness=3D20 >=20 > 8.35user 194.58system 3:19.62elapsed >=20 > Elapsed time remains around 3:15, as proactiveness is further increased= . >=20 > Note that proactive compaction happens throughout the runtime of these > workloads. The situation of one-time compaction, sufficient to supply > hugepages for following allocation stream, can probably happen for more > extreme proactiveness values, like 80 or 90. >=20 > In the above Java workload, proactiveness is set to 20. The test starts > with a node's score of 80 or higher, depending on the delay between the > fragmentation step and starting the benchmark, which gives more-or-less > time for the initial round of compaction. As t he benchmark consumes > hugepages, node's score quickly rises above the high threshold (90) and > proactive compaction starts again, which brings down the score to the > low threshold level (80). Repeat. >=20 > bpftrace also confirms proactive compaction running 20+ times during th= e > runtime of this Java benchmark. kcompactd threads consume 100% of one o= f > the CPUs while it tries to bring a node's score within thresholds. >=20 > Backoff behavior > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > Above workloads produce a memory state which is easy to compact. > However, if memory is filled with unmovable pages, proactive compaction > should essentially back off. To test this aspect: >=20 > - Created a kernel driver that allocates almost all memory as hugepages > followed by freeing first 3/4 of each hugepage. > - Set proactiveness=3D40 > - Note that proactive_compact_node() is deferred maximum number of time= s > with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check > (=3D> ~30 seconds between retries). >=20 > [1] https://patchwork.kernel.org/patch/11098289/ > [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse= .cz/ > [3] https://lwn.net/Articles/817905/ >=20 > Signed-off-by: Nitin Gupta > Reviewed-by: Vlastimil Babka > Reviewed-by: Khalid Aziz > Reviewed-by: Oleksandr Natalenko > Tested-by: Oleksandr Natalenko > To: Andrew Morton > CC: Vlastimil Babka > CC: Khalid Aziz > CC: Michal Hocko > CC: Mel Gorman > CC: Matthew Wilcox > CC: Mike Kravetz > CC: Joonsoo Kim > CC: David Rientjes > CC: Nitin Gupta > CC: Oleksandr Natalenko > CC: linux-kernel > CC: linux-mm > CC: Linux API This is now in -next and causes the following build failure: $ make -skj"$(nproc)" ARCH=3Dmips CROSS_COMPILE=3Dmipsel-linux- O=3Dout/m= ipsel distclean malta_kvm_guest_defconfig mm/compaction.o In file included from include/linux/dev_printk.h:14, from include/linux/device.h:15, from include/linux/node.h:18, from include/linux/cpu.h:17, from mm/compaction.c:11: In function 'fragmentation_score_zone', inlined from '__compact_finished' at mm/compaction.c:1982:11, inlined from 'compact_zone' at mm/compaction.c:2062:8: include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301= ' declared with attribute error: BUILD_BUG failed 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COU= NTER__) | ^ include/linux/compiler.h:320:4: note: in definition of macro '__compileti= me_assert' 320 | prefix ## suffix(); \ | ^~~~~~ include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime= _assert' 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COU= NTER__) | ^~~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime= _assert' 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), m= sg) | ^~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_O= N_MSG' 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") | ^~~~~~~~~~~~~~~~ arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BU= G' 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) | ^~~~~~~~~ mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER | ^~~~~~~~~~~~~~~~~~ mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_OR= DER' 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); | ^~~~~~~~~~~~~~~~~~~~~~ In function 'fragmentation_score_zone', inlined from 'kcompactd' at mm/compaction.c:1918:12: include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301= ' declared with attribute error: BUILD_BUG failed 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COU= NTER__) | ^ include/linux/compiler.h:320:4: note: in definition of macro '__compileti= me_assert' 320 | prefix ## suffix(); \ | ^~~~~~ include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime= _assert' 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COU= NTER__) | ^~~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime= _assert' 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), m= sg) | ^~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_O= N_MSG' 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") | ^~~~~~~~~~~~~~~~ arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BU= G' 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) | ^~~~~~~~~ mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER | ^~~~~~~~~~~~~~~~~~ mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_OR= DER' 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); | ^~~~~~~~~~~~~~~~~~~~~~ In function 'fragmentation_score_zone', inlined from 'kcompactd' at mm/compaction.c:1918:12: include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301= ' declared with attribute error: BUILD_BUG failed 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COU= NTER__) | ^ include/linux/compiler.h:320:4: note: in definition of macro '__compileti= me_assert' 320 | prefix ## suffix(); \ | ^~~~~~ include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime= _assert' 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COU= NTER__) | ^~~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime= _assert' 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), m= sg) | ^~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_O= N_MSG' 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") | ^~~~~~~~~~~~~~~~ arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BU= G' 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) | ^~~~~~~~~ mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER | ^~~~~~~~~~~~~~~~~~ mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_OR= DER' 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); | ^~~~~~~~~~~~~~~~~~~~~~ In function 'fragmentation_score_zone', inlined from 'kcompactd' at mm/compaction.c:1918:12: include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301= ' declared with attribute error: BUILD_BUG failed 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COU= NTER__) | ^ include/linux/compiler.h:320:4: note: in definition of macro '__compileti= me_assert' 320 | prefix ## suffix(); \ | ^~~~~~ include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime= _assert' 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COU= NTER__) | ^~~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime= _assert' 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), m= sg) | ^~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_O= N_MSG' 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") | ^~~~~~~~~~~~~~~~ arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BU= G' 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) | ^~~~~~~~~ mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER | ^~~~~~~~~~~~~~~~~~ mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_OR= DER' 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); | ^~~~~~~~~~~~~~~~~~~~~~ make[3]: *** [scripts/Makefile.build:281: mm/compaction.o] Error 1 make[3]: Target '__build' not remade because of errors. make[2]: *** [Makefile:1765: mm] Error 2 make[2]: Target 'mm/compaction.o' not remade because of errors. make[1]: *** [Makefile:336: __build_one_by_one] Error 2 make[1]: Target 'distclean' not remade because of errors. make[1]: Target 'malta_kvm_guest_defconfig' not remade because of errors. make[1]: Target 'mm/compaction.o' not remade because of errors. make: *** [Makefile:185: __sub-make] Error 2 make: Target 'distclean' not remade because of errors. make: Target 'malta_kvm_guest_defconfig' not remade because of errors. make: Target 'mm/compaction.o' not remade because of errors. I am not sure why MIPS is special with its handling of hugepage support but I am far from a MIPS expert :) Cheers, Nathan