From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=BAYES_00, FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C2BEC433B4 for ; Sun, 16 May 2021 08:57:05 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B52106100C for ; Sun, 16 May 2021 08:57:04 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B52106100C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 12C716B0070; Sun, 16 May 2021 04:57:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0DD7D6B0071; Sun, 16 May 2021 04:57:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E9AFF6B0072; Sun, 16 May 2021 04:57:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0198.hostedemail.com [216.40.44.198]) by kanga.kvack.org (Postfix) with ESMTP id AF6636B0070 for ; Sun, 16 May 2021 04:57:03 -0400 (EDT) Received: from smtpin40.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 3C77187C7 for ; Sun, 16 May 2021 08:57:03 +0000 (UTC) X-FDA: 78146489526.40.C6CC394 Received: from r3-21.sinamail.sina.com.cn (r3-21.sinamail.sina.com.cn [202.108.3.21]) by imf23.hostedemail.com (Postfix) with SMTP id 2426FA000384 for ; Sun, 16 May 2021 08:56:59 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([123.123.24.98]) by sina.com (172.16.97.32) with ESMTP id 60A0DE5700000DBC; Sun, 16 May 2021 16:56:57 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 294947628784 From: Hillf Danton To: Hu Weiwen Cc: Hillf Danton , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: Very slow page fault caused by migrate_pages Date: Sun, 16 May 2021 16:56:44 +0800 Message-Id: <20210516085644.13800-1-hdanton@sina.com> In-Reply-To: <20210516073956.GA550523@laptop> References: <20210515170104.GA469239@laptop> <20210516025035.5724-1-hdanton@sina.com> MIME-Version: 1.0 Authentication-Results: imf23.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf23.hostedemail.com: domain of hdanton@sina.com designates 202.108.3.21 as permitted sender) smtp.mailfrom=hdanton@sina.com X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 2426FA000384 X-Stat-Signature: b9bkca4bu4tkktgdrm44zkin6ajufkja X-HE-Tag: 1621155419-424032 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sun, 16 May 2021 15:39:56 +0800 Hu Weiwen wrote: >On Sun, May 16, 2021 at 10:50:35AM +0800, Hillf Danton wrote: >> On Sun, 16 May 2021 01:01:04 +0800 Hu Weiwen wrote: >> > Hi all, >> >=20 >> > We observed our program sometimes running over 1000x slower than nor= mal and >> > cannot find the root cause. I would be grateful if someone can give = me a hint. I >> > also searched on this mailing list but failed to find anything relat= ed. >> >=20 >> > The affected process spend almost 100% CPU time in kernel. First I r= an `perf' to >> > identify where times are spent. It turns out 99.93% of the time is s= pent on >> > do_page_fault, and 98.16% on migrate_pages. And `perf record -e faul= ts -t ` >> > show 3 seconds per fault, which seems to much to me. >> >=20 >> > Then I do `perf record -e compaction:*`, and see 12k >> > compaction:mm_compaction_migratepages events per second. almost all = of them has >> > nr_migrated=3D0 and nr_failed greater than 0. So it seems a lot of m= igration >> > failure is happening. But I cannot understand what causes it. The mo= re >> > complete perf output is attached below. >> >=20 >> > This system is used to run machine-learning training programs, >> > with 2 Intel Xeon CPU, 512GB RAM and 8 NVIDIA GPU attached, >> > running Ubuntu 20.04, with kernel version 5.4.0-72-generic. >>=20 >> Given the do_huge_pmd_anonymous_page in your perf report below, what y= ou >> need is likely commit facdaa917c4d ("mm: proactive compaction"). >>=20 >> You need to port it back to the linux-5.4 powering your box if there i= s >> anying that prevents you from using the mainline kernel. Feel free to >> report again if it does not work for you. > >Thank you for your reply. Will try it. But it is hard to say whether it = is >working. We can only reproduce this about once per month on 10+ similar = boxes. > >But I think things still not adding up. > >The last time I see this, I left it running in this state for 6 hours. B= ut it >still does not get better. If it is just fragmented memory that caused t= oo much >compaction on huge page allocation, I'd expect it to solve itself in min= utes. Sounds good given "367GB of RAM is available when we observe the problem.= " (free -g illustrates the available RAM.) >Commit message of facdaa917c4d reported 700GB memory being allocated in = 27 >minutes without that patch, but we hardly have any progress in 6 hours. > >Also, Commit message of facdaa917c4d reported at most 33ms latency on al= location >without that patch, but we see 3 seconds per page fault. > >Over 99% of pages passed to `migrate_pages' are failed to migrate. This = is >concerning. It might relate to our issue. I plan to investigate more on = this >when I see the issue next time. Any tips on how to find out why migrate = failed >are appreciated. No idea of such tips if they exist instead of migrate.c. If d479960e44f2 ("mm: disable LRU pagevec during the migration temporaril= y") is not a cure then try to see which commit works with GPU module unloaded= . > >> > Load average is 13, and 367GB of RAM is available when >> > we observe the problem this time. We have observed the very same pro= blem on >> > multiple nodes, running different programs, with different models of= CPU, GPU. >> >=20 >> > Currently, only reboot the system can resolve this. When this happen= s, all >> > process on the system is affected, and it does not resolve if I rest= art the >> > affected process. >> >=20 >> > Thanks, >> > Hu Weiwen >> > SCUT student >> >=20 >> >=20 >> > `perf report` output, collected with `sudo perf record -F 99 --pid 2= 413158 -g -- sleep 10`: >> >=20 >> > Samples: 993 of event 'cycles', Event count (approx.): 36810932529 >> > Children Self Command Shared Object Symbol >> > - 99.93% 0.00% python libc-2.31.so [.] 0x0000= 7ff5fcc78984 >> > 0x7ff5fcc78984 >> > page_fault >> > do_page_fault >> > __do_page_fault >> > do_user_addr_fault >> > handle_mm_fault >> > __handle_mm_fault >> > do_huge_pmd_anonymous_page >> > alloc_pages_vma >> > __alloc_pages_nodemask >> > __alloc_pages_slowpath >> > __alloc_pages_direct_compact >> > try_to_compact_pages >> > compact_zone_order >> > - compact_zone >> > - 98.16% migrate_pages >> > - 93.30% try_to_unmap >> > - rmap_walk >> > - 92.12% rmap_walk_file >> > - 90.45% try_to_unmap_one >> > - 83.48% ptep_clear_flush >> > - 76.33% flush_tlb_mm_range >> > - 74.35% native_flush_tlb_others >> > - 73.75% on_each_cpu_cond_mask >> > - 70.85% on_each_cpu_mask >> > - 69.57% smp_call_function_many >> > - 67.20% smp_call_function_si= ngle >> > 4.93% llist_add_batch >> > + 1.09% generic_exec_singl= e >> > 1.19% smp_call_function_single >> > + 2.01% zalloc_cpumask_var >> > 1.45% page_vma_mapped_walk >> > 0.78% down_read >> > 0.79% up_read >> > + 3.29% remove_migration_ptes >> > 0.59% move_to_new_page >> >=20 >> >=20 >> > `perf report` output, collected with `sudo perf record -e compaction= :* -a -- sleep 10`: >> >=20 >> > Samples: 133K of event 'compaction:mm_compaction_migratepages', Even= t count (approx.): 133394 >> > Overhead Trace output >> > 26.25% nr_migrated=3D0 nr_failed=3D32 >> > 5.00% nr_migrated=3D0 nr_failed=3D2 >> > 4.64% nr_migrated=3D0 nr_failed=3D4 >> > 4.24% nr_migrated=3D0 nr_failed=3D3 >> > 4.20% nr_migrated=3D0 nr_failed=3D6 >> > 3.83% nr_migrated=3D0 nr_failed=3D1 >> > 3.79% nr_migrated=3D0 nr_failed=3D5 >> > 3.62% nr_migrated=3D0 nr_failed=3D7 >> > 3.45% nr_migrated=3D0 nr_failed=3D8 >> > 3.06% nr_migrated=3D0 nr_failed=3D9 >> > 2.68% nr_migrated=3D0 nr_failed=3D10 >> > 2.68% nr_migrated=3D0 nr_failed=3D11 >> > 2.42% nr_migrated=3D0 nr_failed=3D12 >> > 2.35% nr_migrated=3D0 nr_failed=3D13 >> > 2.11% nr_migrated=3D0 nr_failed=3D14 >> > 1.93% nr_migrated=3D0 nr_failed=3D16 >> > 1.90% nr_migrated=3D0 nr_failed=3D15 >> > 1.64% nr_migrated=3D0 nr_failed=3D17 >> > 1.50% nr_migrated=3D0 nr_failed=3D18 >> > 1.25% nr_migrated=3D0 nr_failed=3D19 >> > 1.25% nr_migrated=3D0 nr_failed=3D20 >> > 1.22% nr_migrated=3D0 nr_failed=3D21 >> > ... >> > 0.00% nr_migrated=3D0 nr_failed=3D207 >> > 0.00% nr_migrated=3D0 nr_failed=3D258 >> > 0.00% nr_migrated=3D0 nr_failed=3D273 >> > 0.00% nr_migrated=3D0 nr_failed=3D277 >> > 0.00% nr_migrated=3D10 nr_failed=3D22 >> > 0.00% nr_migrated=3D13 nr_failed=3D19 >> > 0.00% nr_migrated=3D2 nr_failed=3D2 >> > 0.00% nr_migrated=3D3 nr_failed=3D13 >> > 0.00% nr_migrated=3D3 nr_failed=3D17 >> > 0.00% nr_migrated=3D3 nr_failed=3D25 >> > 0.00% nr_migrated=3D3 nr_failed=3D27 >> > 0.00% nr_migrated=3D3 nr_failed=3D28 >> > 0.00% nr_migrated=3D4 nr_failed=3D10 >> > 0.00% nr_migrated=3D4 nr_failed=3D15 >> > 0.00% nr_migrated=3D4 nr_failed=3D24 >> > 0.00% nr_migrated=3D5 nr_failed=3D10 >> > 0.00% nr_migrated=3D5 nr_failed=3D15 >> > 0.00% nr_migrated=3D6 nr_failed=3D20 >> > 0.00% nr_migrated=3D6 nr_failed=3D21 >> > 0.00% nr_migrated=3D6 nr_failed=3D23 >> > 0.00% nr_migrated=3D8 nr_failed=3D19 > >