From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt1-f199.google.com (mail-qt1-f199.google.com [209.85.160.199]) by kanga.kvack.org (Postfix) with ESMTP id 7AFF08E00E5 for ; Wed, 12 Dec 2018 10:41:44 -0500 (EST) Received: by mail-qt1-f199.google.com with SMTP id p24so18367890qtl.2 for ; Wed, 12 Dec 2018 07:41:44 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id 128si650064qkj.224.2018.12.12.07.41.43 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 12 Dec 2018 07:41:43 -0800 (PST) Date: Wed, 12 Dec 2018 10:41:42 -0500 (EST) From: Jan Stancek Message-ID: <1125108393.85764095.1544629302243.JavaMail.zimbra@redhat.com> In-Reply-To: <769820788.85756226.1544627443438.JavaMail.zimbra@redhat.com> Subject: [bug?] poor migrate_pages() performance on arm64 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org Cc: jstancek@redhat.com, "ltp@lists.linux.it" Hi, I'm observing migrate_pages() taking quite long time on arm64 system (Huawei TaiShan 2280, 4 nodes, 64 CPUs). I'm using 4.20.0-rc6, but it's reproducible with older kernels (4.14) as well. The test (see [1] below), is a trivial C application, that migrates current process from one node to another. More complicated example is also LTP's migrate_pages03, where this has been originally reported. It takes 2+ seconds to migrate process from one node to another: # strace -f -t -T ./a.out ... [pid 13754] 10:17:13 migrate_pages(0, 8, [0x0000000000000002], [0x0000000= 000000001]) =3D 1 <0.058115> [pid 13754] 10:17:13 migrate_pages(0, 8, [0x0000000000000001], [0x0000000= 000000002]) =3D 12 <2.348186> [pid 13754] 10:17:16 migrate_pages(0, 8, [0x0000000000000002], [0x0000000= 000000001]) =3D 1 <0.057889> [pid 13754] 10:17:16 migrate_pages(0, 8, [0x0000000000000001], [0x0000000= 000000002]) =3D 10 <2.194890> ... This scales with number of children. For example with MAXCHILD 1000, it takes ~33 seconds: # strace -f -t -T ./a.out ... [pid 13773] 10:17:55 migrate_pages(0, 8, [0x0000000000000001], [0x0000000= 000000002]) =3D 11 <33.615550> [pid 13773] 10:18:29 migrate_pages(0, 8, [0x0000000000000002], [0x0000000= 000000001]) =3D 2 <5.460270> ... It appears to be related to migration of shared pages, presumably executable code of glibc. If I run [1] without CAP_SYS_NICE, it completes very quickly: # sudo -u nobody strace -f -t -T ./a.out ... [pid 14847] 10:24:57 migrate_pages(0, 8, [0x0000000000000001], [0x0000000= 000000002]) =3D 0 <0.000172> [pid 14847] 10:24:57 migrate_pages(0, 8, [0x0000000000000002], [0x0000000= 000000001]) =3D 0 <0.000091> [pid 14847] 10:24:57 migrate_pages(0, 8, [0x0000000000000001], [0x0000000= 000000002]) =3D 0 <0.000074> [pid 14847] 10:24:57 migrate_pages(0, 8, [0x0000000000000002], [0x0000000= 000000001]) =3D 0 <0.000069> ... Looking at perf, most of time is spent invalidating icache. - 100.00% 0.00% a.out [kernel.kallsyms] [k] __sys_trace_return - __sys_trace_return - 100.00% __se_sys_migrate_pages do_migrate_pages.part.9 - migrate_pages - 99.92% rmap_walk - 99.92% rmap_walk_file - 99.90% remove_migration_pte - 99.85% __sync_icache_dcache __flush_cache_user_range Percent=E2=94=82 nop =E2=94=82 ubfx x3, x3, #16, #4 =E2=94=82 mov x2, #0x4 // #4 =E2=94=82 lsl x2, x2, x3 =E2=94=82 sub x3, x2, #0x1 =E2=94=82 bic x4, x0, x3 1.82 =E2=94=82 dc cvau, x4 =E2=94=82 add x4, x4, x2 =E2=94=82 cmp x4, x1 =E2=94=82 =E2=86=92 b.cc 0xffff00000809efc8 // b.lo, b.ul, fff= ff7f61067 =E2=94=82 dsb ish =E2=94=82 nop 0.07 =E2=94=82 nop =E2=94=82 mrs x3, ctr_el0 =E2=94=82 nop =E2=94=82 and x3, x3, #0xf =E2=94=82 mov x2, #0x4 // #4 =E2=94=82 lsl x2, x2, x3 =E2=94=82 sub x3, x2, #0x1 =E2=94=82 bic x3, x0, x3 96.17 =E2=94=82 ic ivau, x3 =E2=94=82 add x3, x3, x2 =E2=94=82 cmp x3, x1 =E2=94=82 =E2=86=92 b.cc 0xffff00000809f000 // b.lo, b.ul, fff= ff7f61067 0.10 =E2=94=82 dsb ish =E2=94=82 isb 1.85 =E2=94=82 mov x0, #0x0 // #0 =E2=94=8278: =E2=86=90 ret =E2=94=82 mov x0, #0xfffffffffffffff2 // #-14 =E2=94=82 =E2=86=91 b 78 Regards, Jan [1] ----- 8< ----- #include #include #include #include #include #define MAXCHILD 10 int main(void) { =09long node1 =3D 1, node2 =3D 2; =09int i, child; =09int pids[MAXCHILD]; =09for (i =3D 0; i < MAXCHILD; i++) { =09=09child =3D fork(); =09=09if (child =3D=3D 0) { =09=09=09sleep(600); =09=09=09exit(0); =09=09} =09=09pids[i] =3D child; =09} =09for (i =3D 0; i < 5; i++) { =09=09syscall(__NR_migrate_pages, 0, 8, &node1, &node2); =09=09syscall(__NR_migrate_pages, 0, 8, &node2, &node1); =09} =09for (i =3D 0; i < MAXCHILD; i++) { =09=09kill(pids[i], SIGKILL); =09} =09return 0; } ----- >8 -----