[PATCH v2 1/1] mm/mmu_gather: limit free batch count and add schedule point in tlb_batch_pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 1/1] mm/mmu_gather: limit free batch count and add schedule point in tlb_batch_pages_flush
@ 2022-03-17  7:28 Jianxing Wang
  2022-03-17 23:40 ` Andrew Morton
  0 siblings, 1 reply; 3+ messages in thread
From: Jianxing Wang @ 2022-03-17  7:28 UTC (permalink / raw)
  To: peterz
  Cc: will, aneesh.kumar, akpm, npiggin, linux-arch, linux-mm,
	linux-kernel, Jianxing Wang

free a large list of pages maybe cause rcu_sched starved on
non-preemptible kernels. howerver free_unref_page_list maybe can't
cond_resched as it maybe called in interrupt or atomic context,
especially can't detect atomic context in CONFIG_PREEMPTION=n.

tlb flush batch count depends on PAGE_SIZE, it's too large if
PAGE_SIZE > 4K, here limit free batch count with 512.
And add schedule point in tlb_batch_pages_flush.

rcu: rcu_sched kthread starved for 5359 jiffies! g454793 f0x0
RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=19
[...]
Call Trace:
   free_unref_page_list+0x19c/0x270
   release_pages+0x3cc/0x498
   tlb_flush_mmu_free+0x44/0x70
   zap_pte_range+0x450/0x738
   unmap_page_range+0x108/0x240
   unmap_vmas+0x74/0xf0
   unmap_region+0xb0/0x120
   do_munmap+0x264/0x438
   vm_munmap+0x58/0xa0
   sys_munmap+0x10/0x20
   syscall_common+0x24/0x38

Signed-off-by: Jianxing Wang <wangjianxing@loongson.cn>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>

---
ChangeLog:
V1 -> V2: limit free batch count directly in tlb_batch_pages_flush
---
---
 mm/mmu_gather.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index afb7185ffdc4..a71924bd38c0 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -47,8 +47,20 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 	struct mmu_gather_batch *batch;

 	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
-		free_pages_and_swap_cache(batch->pages, batch->nr);
-		batch->nr = 0;
+		struct page **pages = batch->pages;
+
+		do {
+			/*
+			 * limit free batch count when PAGE_SIZE > 4K
+			 */
+			unsigned int nr = min(512U, batch->nr);
+
+			free_pages_and_swap_cache(pages, nr);
+			pages += nr;
+			batch->nr -= nr;
+
+			cond_resched();
+		} while (batch->nr);
 	}
 	tlb->active = &tlb->local;
 }
-- 
2.31.1

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2 1/1] mm/mmu_gather: limit free batch count and add schedule point in tlb_batch_pages_flush
  2022-03-17  7:28 [PATCH v2 1/1] mm/mmu_gather: limit free batch count and add schedule point in tlb_batch_pages_flush Jianxing Wang
@ 2022-03-17 23:40 ` Andrew Morton
  2022-03-19  4:07   ` wangjianxing
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Morton @ 2022-03-17 23:40 UTC (permalink / raw)
  To: Jianxing Wang
  Cc: peterz, will, aneesh.kumar, npiggin, linux-arch, linux-mm, linux-kernel

On Thu, 17 Mar 2022 03:28:57 -0400 Jianxing Wang <wangjianxing@loongson.cn> wrote:

> free a large list of pages maybe cause rcu_sched starved on
> non-preemptible kernels. howerver free_unref_page_list maybe can't
> cond_resched as it maybe called in interrupt or atomic context,
> especially can't detect atomic context in CONFIG_PREEMPTION=n.
> 
> tlb flush batch count depends on PAGE_SIZE, it's too large if
> PAGE_SIZE > 4K, here limit free batch count with 512.
> And add schedule point in tlb_batch_pages_flush.
> 
> rcu: rcu_sched kthread starved for 5359 jiffies! g454793 f0x0
> RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=19
> [...]
> Call Trace:
>    free_unref_page_list+0x19c/0x270
>    release_pages+0x3cc/0x498
>    tlb_flush_mmu_free+0x44/0x70
>    zap_pte_range+0x450/0x738
>    unmap_page_range+0x108/0x240
>    unmap_vmas+0x74/0xf0
>    unmap_region+0xb0/0x120
>    do_munmap+0x264/0x438
>    vm_munmap+0x58/0xa0
>    sys_munmap+0x10/0x20
>    syscall_common+0x24/0x38

tlb_batch_pages_flush() doesn't appear in this trace.  I assume the call
sequence is

zap_pte_range
->tlb_flush_mmu
  ->tlb_flush_mmu_free

correct?

> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -47,8 +47,20 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>  	struct mmu_gather_batch *batch;
>  
>  	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
> -		free_pages_and_swap_cache(batch->pages, batch->nr);
> -		batch->nr = 0;
> +		struct page **pages = batch->pages;
> +
> +		do {
> +			/*
> +			 * limit free batch count when PAGE_SIZE > 4K
> +			 */
> +			unsigned int nr = min(512U, batch->nr);
> +
> +			free_pages_and_swap_cache(pages, nr);
> +			pages += nr;
> +			batch->nr -= nr;
> +
> +			cond_resched();
> +		} while (batch->nr);
>  	}

The patch looks safe enough.  But again, it's unlikely to work if the
calling task has realtime policy.  The same can be said of the
cond_resched() in zap_pte_range(), and presumably many others.

I'll save this away for now and will revisit after 5.18-rc1.

How serious is this problem?  Under precisely what circumstances were
you able to trigger this?  In other words, do you believe that a
backport into -stable kernels is needed and if so, why?

Thanks.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2 1/1] mm/mmu_gather: limit free batch count and add schedule point in tlb_batch_pages_flush
  2022-03-17 23:40 ` Andrew Morton
@ 2022-03-19  4:07   ` wangjianxing
  0 siblings, 0 replies; 3+ messages in thread
From: wangjianxing @ 2022-03-19  4:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: peterz, will, aneesh.kumar, npiggin, linux-arch, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3133 bytes --]

On 03/18/2022 07:40 AM, Andrew Morton wrote:
> On Thu, 17 Mar 2022 03:28:57 -0400 Jianxing Wang<wangjianxing@loongson.cn>  wrote:
>
>> free a large list of pages maybe cause rcu_sched starved on
>> non-preemptible kernels. howerver free_unref_page_list maybe can't
>> cond_resched as it maybe called in interrupt or atomic context,
>> especially can't detect atomic context in CONFIG_PREEMPTION=n.
>>
>> tlb flush batch count depends on PAGE_SIZE, it's too large if
>> PAGE_SIZE > 4K, here limit free batch count with 512.
>> And add schedule point in tlb_batch_pages_flush.
>>
>> rcu: rcu_sched kthread starved for 5359 jiffies! g454793 f0x0
>> RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=19
>> [...]
>> Call Trace:
>>     free_unref_page_list+0x19c/0x270
>>     release_pages+0x3cc/0x498
>>     tlb_flush_mmu_free+0x44/0x70
>>     zap_pte_range+0x450/0x738
>>     unmap_page_range+0x108/0x240
>>     unmap_vmas+0x74/0xf0
>>     unmap_region+0xb0/0x120
>>     do_munmap+0x264/0x438
>>     vm_munmap+0x58/0xa0
>>     sys_munmap+0x10/0x20
>>     syscall_common+0x24/0x38
> tlb_batch_pages_flush() doesn't appear in this trace.  I assume the call
> sequence is
>
> zap_pte_range
> ->tlb_flush_mmu
>    ->tlb_flush_mmu_free
>
> correct?
Yeah, you are right.
>> --- a/mm/mmu_gather.c
>> +++ b/mm/mmu_gather.c
>> @@ -47,8 +47,20 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>>   	struct mmu_gather_batch *batch;
>>   
>>   	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
>> -		free_pages_and_swap_cache(batch->pages, batch->nr);
>> -		batch->nr = 0;
>> +		struct page **pages = batch->pages;
>> +
>> +		do {
>> +			/*
>> +			 * limit free batch count when PAGE_SIZE > 4K
>> +			 */
>> +			unsigned int nr = min(512U, batch->nr);
>> +
>> +			free_pages_and_swap_cache(pages, nr);
>> +			pages += nr;
>> +			batch->nr -= nr;
>> +
>> +			cond_resched();
>> +		} while (batch->nr);
>>   	}
> The patch looks safe enough.  But again, it's unlikely to work if the
> calling task has realtime policy.  The same can be said of the
> cond_resched() in zap_pte_range(), and presumably many others.
Yes, cond_resched can't work in task with realtime policy, sorry but no 
good idea now.
> I'll save this away for now and will revisit after 5.18-rc1.
>
> How serious is this problem?  Under precisely what circumstances were
> you able to trigger this?  In other words, do you believe that a
> backport into -stable kernels is needed and if so, why?
>
> Thanks.
>
The issue is detected in guest with kvm cpu 200% overcommit, however I 
didn't see the warning in the host with the same application.
I'm sure that the patch is needed for guest kernel, but no sure for host.

 >Under precisely what circumstances were you able to trigger this?
setup two virtual machines in one host machine, per vm has the same 
number cpu and half memory of host.
the run ltpstress.sh in per vm, then will see rcu stall warning.kernel 
is preempt disabled, append kernel command 'preempt=none' if enable 
dynamic preempt .
It could detected in loongson machine(32 core, 128G mem) and ProLiant 
DL380 Gen9(x86 E5-2680, 28 core, 64G mem)


[-- Attachment #2: Type: text/html, Size: 11811 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-03-19  4:07 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-17  7:28 [PATCH v2 1/1] mm/mmu_gather: limit free batch count and add schedule point in tlb_batch_pages_flush Jianxing Wang
2022-03-17 23:40 ` Andrew Morton
2022-03-19  4:07   ` wangjianxing

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox