From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2A01AC433EF for ; Sat, 19 Mar 2022 04:07:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9ECB38D0002; Sat, 19 Mar 2022 00:07:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 99BB98D0001; Sat, 19 Mar 2022 00:07:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 83D528D0002; Sat, 19 Mar 2022 00:07:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0019.hostedemail.com [216.40.44.19]) by kanga.kvack.org (Postfix) with ESMTP id 74C078D0001 for ; Sat, 19 Mar 2022 00:07:47 -0400 (EDT) Received: from smtpin31.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 1D0A98249980 for ; Sat, 19 Mar 2022 04:07:47 +0000 (UTC) X-FDA: 79259802174.31.0C83006 Received: from loongson.cn (mail.loongson.cn [114.242.206.163]) by imf04.hostedemail.com (Postfix) with ESMTP id 38A0D40025 for ; Sat, 19 Mar 2022 04:07:44 +0000 (UTC) Received: from localhost.localdomain (unknown [10.20.42.95]) by mail.loongson.cn (Coremail) with SMTP id AQAAf9AxOswJVzVihfYLAA--.10370S3; Sat, 19 Mar 2022 12:07:38 +0800 (CST) From: wangjianxing Subject: Re: [PATCH v2 1/1] mm/mmu_gather: limit free batch count and add schedule point in tlb_batch_pages_flush To: Andrew Morton Cc: peterz@infradead.org, will@kernel.org, aneesh.kumar@linux.ibm.com, npiggin@gmail.com, linux-arch@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20220317072857.2635262-1-wangjianxing@loongson.cn> <20220317164011.27d7341715de12d890ca244a@linux-foundation.org> Message-ID: Date: Sat, 19 Mar 2022 12:07:37 +0800 User-Agent: Mozilla/5.0 (X11; Linux mips64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <20220317164011.27d7341715de12d890ca244a@linux-foundation.org> Content-Type: multipart/alternative; boundary="------------F154C732B419D8DA6861C10A" Content-Language: en-US X-CM-TRANSID:AQAAf9AxOswJVzVihfYLAA--.10370S3 X-Coremail-Antispam: 1UD129KBjvJXoWxZFy8trW7Gw15Kw1rJr13urg_yoW5AF18p3 y5XrsFyr4rG3yrtw42y3Wvkry2van5Wa95JrykGrZxZwsxJ342gFyktwnI9F47Gr4rA3yf JF4DXa40gF4DZF7anT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUkEb7Iv0xC_Kw4lb4IE77IF4wAFF20E14v26r4j6ryUM7CY07I2 0VC2zVCF04k26cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rw A2F7IY1VAKz4vEj48ve4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_Gr0_Xr1l84ACjcxK6xII jxv20xvEc7CjxVAFwI0_Gr0_Cr1l84ACjcxK6I8E87Iv67AKxVWxJr0_GcWl84ACjcxK6I 8E87Iv6xkF7I0E14v26F4UJVW0owAS0I0E0xvYzxvE52x082IY62kv0487McIj6xIIjxv2 0xvE14v26r1j6r18McIj6I8E87Iv67AKxVWUJVW8JwAm72CE4IkC6x0Yz7v_Jr0_Gr1lF7 xvr2IY64vIr41l7480Y4vEI4kI2Ix0rVAqx4xJMxk0xIA0c2IEe2xFo4CEbIxvr21lc2xS Y4AK6svPMxAIw28IcxkI7VAKI48JMxC20s026xCaFVCjc4AY6r1j6r4UMI8I3I0E5I8CrV AFwI0_JrI_JrWlx2IqxVCjr7xvwVAFwI0_JrI_JrWlx4CE17CEb7AF67AKxVWUtVW8ZwCI c40Y0x0EwIxGrwCI42IY6xIIjxv20xvE14v26r1j6r1xMIIF0xvE2Ix0cI8IcVCY1x0267 AKxVWUJVW8JwCI42IY6xAIw20EY4v20xvaj40_WFyUJVCq3wCI42IY6I8E87Iv67AKxVWU JVW8JwCI42IY6I8E87Iv6xkF7I0E14v26r1j6r4UYxBIdaVFxhVjvjDU0xZFpf9x07b8Z2 -UUUUU= X-CM-SenderInfo: pzdqwyxldq5xtqj6z05rqj20fqof0/ X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 38A0D40025 X-Stat-Signature: xswnrshmqya7z9e9ekiayuk5gu6oyjs9 Authentication-Results: imf04.hostedemail.com; dkim=none; spf=pass (imf04.hostedemail.com: domain of wangjianxing@loongson.cn designates 114.242.206.163 as permitted sender) smtp.mailfrom=wangjianxing@loongson.cn; dmarc=none X-Rspam-User: X-HE-Tag: 1647662864-64131 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This is a multi-part message in MIME format. --------------F154C732B419D8DA6861C10A Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit On 03/18/2022 07:40 AM, Andrew Morton wrote: > On Thu, 17 Mar 2022 03:28:57 -0400 Jianxing Wang wrote: > >> free a large list of pages maybe cause rcu_sched starved on >> non-preemptible kernels. howerver free_unref_page_list maybe can't >> cond_resched as it maybe called in interrupt or atomic context, >> especially can't detect atomic context in CONFIG_PREEMPTION=n. >> >> tlb flush batch count depends on PAGE_SIZE, it's too large if >> PAGE_SIZE > 4K, here limit free batch count with 512. >> And add schedule point in tlb_batch_pages_flush. >> >> rcu: rcu_sched kthread starved for 5359 jiffies! g454793 f0x0 >> RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=19 >> [...] >> Call Trace: >> free_unref_page_list+0x19c/0x270 >> release_pages+0x3cc/0x498 >> tlb_flush_mmu_free+0x44/0x70 >> zap_pte_range+0x450/0x738 >> unmap_page_range+0x108/0x240 >> unmap_vmas+0x74/0xf0 >> unmap_region+0xb0/0x120 >> do_munmap+0x264/0x438 >> vm_munmap+0x58/0xa0 >> sys_munmap+0x10/0x20 >> syscall_common+0x24/0x38 > tlb_batch_pages_flush() doesn't appear in this trace. I assume the call > sequence is > > zap_pte_range > ->tlb_flush_mmu > ->tlb_flush_mmu_free > > correct? Yeah, you are right. >> --- a/mm/mmu_gather.c >> +++ b/mm/mmu_gather.c >> @@ -47,8 +47,20 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb) >> struct mmu_gather_batch *batch; >> >> for (batch = &tlb->local; batch && batch->nr; batch = batch->next) { >> - free_pages_and_swap_cache(batch->pages, batch->nr); >> - batch->nr = 0; >> + struct page **pages = batch->pages; >> + >> + do { >> + /* >> + * limit free batch count when PAGE_SIZE > 4K >> + */ >> + unsigned int nr = min(512U, batch->nr); >> + >> + free_pages_and_swap_cache(pages, nr); >> + pages += nr; >> + batch->nr -= nr; >> + >> + cond_resched(); >> + } while (batch->nr); >> } > The patch looks safe enough. But again, it's unlikely to work if the > calling task has realtime policy. The same can be said of the > cond_resched() in zap_pte_range(), and presumably many others. Yes, cond_resched can't work in task with realtime policy, sorry but no good idea now. > I'll save this away for now and will revisit after 5.18-rc1. > > How serious is this problem? Under precisely what circumstances were > you able to trigger this? In other words, do you believe that a > backport into -stable kernels is needed and if so, why? > > Thanks. > The issue is detected in guest with kvm cpu 200% overcommit, however I didn't see the warning in the host with the same application. I'm sure that the patch is needed for guest kernel, but no sure for host. >Under precisely what circumstances were you able to trigger this? setup two virtual machines in one host machine, per vm has the same number cpu and half memory of host. the run ltpstress.sh in per vm, then will see rcu stall warning.kernel is preempt disabled, append kernel command 'preempt=none' if enable dynamic preempt . It could detected in loongson machine(32 core, 128G mem) and ProLiant DL380 Gen9(x86 E5-2680, 28 core, 64G mem) --------------F154C732B419D8DA6861C10A Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 7bit On 03/18/2022 07:40 AM, Andrew Morton wrote:
On Thu, 17 Mar 2022 03:28:57 -0400 Jianxing Wang <wangjianxing@loongson.cn> wrote:

free a large list of pages maybe cause rcu_sched starved on
non-preemptible kernels. howerver free_unref_page_list maybe can't
cond_resched as it maybe called in interrupt or atomic context,
especially can't detect atomic context in CONFIG_PREEMPTION=n.

tlb flush batch count depends on PAGE_SIZE, it's too large if
PAGE_SIZE > 4K, here limit free batch count with 512.
And add schedule point in tlb_batch_pages_flush.

rcu: rcu_sched kthread starved for 5359 jiffies! g454793 f0x0
RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=19
[...]
Call Trace:
   free_unref_page_list+0x19c/0x270
   release_pages+0x3cc/0x498
   tlb_flush_mmu_free+0x44/0x70
   zap_pte_range+0x450/0x738
   unmap_page_range+0x108/0x240
   unmap_vmas+0x74/0xf0
   unmap_region+0xb0/0x120
   do_munmap+0x264/0x438
   vm_munmap+0x58/0xa0
   sys_munmap+0x10/0x20
   syscall_common+0x24/0x38
tlb_batch_pages_flush() doesn't appear in this trace.  I assume the call
sequence is

zap_pte_range
->tlb_flush_mmu
  ->tlb_flush_mmu_free

correct?
Yeah, you are right.
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -47,8 +47,20 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 	struct mmu_gather_batch *batch;
 
 	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
-		free_pages_and_swap_cache(batch->pages, batch->nr);
-		batch->nr = 0;
+		struct page **pages = batch->pages;
+
+		do {
+			/*
+			 * limit free batch count when PAGE_SIZE > 4K
+			 */
+			unsigned int nr = min(512U, batch->nr);
+
+			free_pages_and_swap_cache(pages, nr);
+			pages += nr;
+			batch->nr -= nr;
+
+			cond_resched();
+		} while (batch->nr);
 	}
The patch looks safe enough.  But again, it's unlikely to work if the
calling task has realtime policy.  The same can be said of the
cond_resched() in zap_pte_range(), and presumably many others.
Yes, cond_resched can't work in task with realtime policy, sorry but no good idea now.
I'll save this away for now and will revisit after 5.18-rc1.

How serious is this problem?  Under precisely what circumstances were
you able to trigger this?  In other words, do you believe that a
backport into -stable kernels is needed and if so, why?

Thanks.

The issue is detected in guest with kvm cpu 200% overcommit, however I didn't see the warning in the host with the same application.
I'm sure that the patch is needed for guest kernel, but no sure for host.

>Under precisely what circumstances were you able to trigger this?
setup two virtual machines in one host machine, per vm has the same number cpu and half memory of host.
the run ltpstress.sh in per vm, then will see rcu stall warning. kernel is preempt disabled, append kernel command 'preempt=none' if enable dynamic preempt .
It could detected in loongson machine(
32 core, 128G mem) and ProLiant DL380 Gen9(x86 E5-2680, 28 core, 64G mem)

--------------F154C732B419D8DA6861C10A--