From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 27405C2BBCA
	for <linux-mm@archiver.kernel.org>; Tue, 25 Jun 2024 15:49:38 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9C6E36B00DA; Tue, 25 Jun 2024 11:49:37 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 976626B00DB; Tue, 25 Jun 2024 11:49:37 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 816D16B00DC; Tue, 25 Jun 2024 11:49:37 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 607FF6B00DA
	for <linux-mm@kvack.org>; Tue, 25 Jun 2024 11:49:37 -0400 (EDT)
Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 0619F1601E7
	for <linux-mm@kvack.org>; Tue, 25 Jun 2024 15:49:37 +0000 (UTC)
X-FDA: 82269845994.05.F0A9E12
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf04.hostedemail.com (Postfix) with ESMTP id 0070A40018
	for <linux-mm@kvack.org>; Tue, 25 Jun 2024 15:49:34 +0000 (UTC)
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=SvpuVwnt;
	spf=pass (imf04.hostedemail.com: domain of bhe@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=bhe@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1719330559;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=0wsnVV8wpGfGOkMUutdfBkeh8eoVPXf9ThGK/9X9dRk=;
	b=5r1OtBNich1orTpx19EfW2W6VlNbvUv+ecDtU+xe6qgEuOiYihq1RQtCu4JmsSesswxUgi
	kXqcSquVug1DAGUQ3uh22WhnAW1Qdt/SG+NvkR3dazMrOQ+6StvXPsKDtjJC3ZTJ23Ksdo
	1MJCT7f6CPAin5a13Q47WS4jvvuGj6w=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719330560; a=rsa-sha256;
	cv=none;
	b=2gNHtEE5+YwWbKmBAoUf9WPBLJM7YdNg6/D1YGY0zXdotvTuoc/DUxkSSr5rTa/M+7OEYi
	w+bTUievA7oXcVKQDVy8w0xAl4R1zN1aS2qm1VzDyxEYAE1Ayh7dv8MSjnJy2JBigc5FKr
	/PhQ/evQT1AH7gPawa2l8szxAR6E/rQ=
ARC-Authentication-Results: i=1;
	imf04.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=SvpuVwnt;
	spf=pass (imf04.hostedemail.com: domain of bhe@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=bhe@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1719330574;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=0wsnVV8wpGfGOkMUutdfBkeh8eoVPXf9ThGK/9X9dRk=;
	b=SvpuVwnt1mAv0lFHiN4HpINQkAV5LjASTOQfujRU0sgURfDAAw54PnX0WtU7ove0EbCpLs
	NIRp54CGW9dcIMLYYq8qjEAcadi1vJh+1ijeX5OFqb+xhjXOtr+uAKq3z1tp7FKrGQDUvg
	Id5X8JNjwhpaVpwZrVPQhy9GzbzOfKI=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-648-JgzeuAsjP0qiKJuutWbxvQ-1; Tue,
 25 Jun 2024 11:49:29 -0400
X-MC-Unique: JgzeuAsjP0qiKJuutWbxvQ-1
Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 8163E195608C;
	Tue, 25 Jun 2024 15:49:27 +0000 (UTC)
Received: from localhost (unknown [10.72.116.8])
	by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 4F977300021A;
	Tue, 25 Jun 2024 15:49:24 +0000 (UTC)
Date: Tue, 25 Jun 2024 23:49:20 +0800
From: Baoquan He <bhe@redhat.com>
To: Uladzislau Rezki <urezki@gmail.com>, Hailong Liu <hailong.liu@oppo.com>
Cc: Nick Bowler <nbowler@draconx.ca>, linux-kernel@vger.kernel.org,
	Linux regressions mailing list <regressions@lists.linux.dev>,
	linux-mm@kvack.org, sparclinux@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
Message-ID: <ZnrnADHvOiNcZv9t@MiWiFi-R3L-srv>
References: <ZnUmpMbCBFWnvaEz@MiWiFi-R3L-srv>
 <ZnVLbCCkvhf5GaTf@pc636>
 <ZnWICsPgYuBlrWlt@MiWiFi-R3L-srv>
 <Znljtv5n-6EBgpsF@pc636>
 <Zno52QBG0g5Z+otD@MiWiFi-R3L-srv>
 <ZnqcuKt2qrR-wmH3@pc636>
 <ZnqspTVl/76jM9WD@MiWiFi-R3L-srv>
 <Znq6tEtCgB6QnnJH@pc638.lan>
 <Znq/8/HAc/0p6Ja0@MiWiFi-R3L-srv>
 <ZnrjZRq5-_hemrbD@pc636>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <ZnrjZRq5-_hemrbD@pc636>
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4
X-Stat-Signature: cppczk31t78pb8re55knna3onhogebyc
X-Rspamd-Queue-Id: 0070A40018
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-HE-Tag: 1719330574-477487
X-HE-Meta: U2FsdGVkX1+pYcA7I//Rtv5gsettCa3i6WC+54HXvazvt4H7o1kvm8O64pODGKNYSXEFE+Xu8AHVEpRRuGX35TIYRVitzefN9VF0Rj/rIhOk9cCjd7z/TSATaOoPNWMu77uyDARWUBG9Zecew8Wdu1+CsmF/THuulW//Ce6xnkvJ8NTgrLmH6KLOnsC7YJATEAdXx84sRvQhQNpiQlUuxRcQgW067XYsjgC4Uwe50oM/F1Jgmk6+fjORJN7ZnIIZYMZhaRyIDdzkl/cbsubCvhPQ6PVEx+FvaC8WlfwosF/193xYHEvXlpl/0Qo5Q1ORTCM8uP9QQQrXuDPXW3ygWyTpAtCklmX386edlA4plkzVKydU0vB6bQ0Ect6uY8FOEuvhQjrg+jnLYvDoxfMJIt71oEtQ/O3rtY8qt23zikGzBR5sU09D8Ockq4itcQUz1UKfqFBDHcragkRbfDC53wLnROZmfruTz5pSIH7tKGrzRBpNPuMm8mYvP+QCdwLukPn9RQcxqtfD/9yYGZicfRhB0SU+a1Sat0leZAfqh5QFaBsHK3/pR5hvUthuAz4L1whQ7MTD3IVtU0VJj+LnM2X6/ByxWTWQWPlNIRBXckkLoCvr3twvdH8HgNCYKJhJNmxpIPhXXo8lvJTqRKqwd0l11aILmc3FYB8Ngbr5u2Gp+4xX7f/M1LqCPOLGxGl5Nh3CsFtJw9iNKABIqo7rwtCzh0JXuRHrpJR+6eE9WfdN95xhXqeTCsYUUEGXICWqAvcN+ZeEIhEaZafMmgSNeO4zckn4/d77IPx2CxDA3REYU4QoM6Z1Eunj9dVqjZtoxVt7KLJsqOK4Px0cOxM1IJRABhVU+uSOd980qxvljw1zFJz6ETYOMZ1xAdu0HSonBMJ2np5AFqEyGv/bMMT5cCCq6bxOp7S48GxjxSzrjc0LKokSjOLNZCwDItrZgfSGVoaQVypSqHd87Ndlvc+
 LcjPuNRe
 80k7vwYv3ypWooc/3JadLaTM4rKz3S5g7E0adxI8DFtQO952Eo+nMxTMqxQg55cRwQf1tPSNpYROQTCLVMmGshfTg2iibwqDkVoqRAeQIOjthpSxsVVWMQmPfYHmEMBo88yIog2P39aU5OkPahkdRPpS7v50X9uy40zUTYBxCeNMNr/TtMmMiGN9OULORe3aPa0/UD6MmBJNQhBUdrT54SvUEqsr7MeXUWwlH4gxoZwOqODPwpfLIQSP8/VGBNaPdeUAHsNlNEDdNdvNQN7EEyqEwlv2BfseGwCx+p/Q7oSRHk1zwe9zxfWmIgnqn6ZCaBHd4qBccKxOdSK0=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 06/25/24 at 05:33pm, Uladzislau Rezki wrote:
> On Tue, Jun 25, 2024 at 09:02:43PM +0800, Baoquan He wrote:
> > On 06/25/24 at 02:40pm, Uladzislau Rezki wrote:
> > > On Tue, Jun 25, 2024 at 07:40:21PM +0800, Baoquan He wrote:
> > > > On 06/25/24 at 12:32pm, Uladzislau Rezki wrote:
> > > > > On Tue, Jun 25, 2024 at 11:30:33AM +0800, Baoquan He wrote:
> > > > > > On 06/24/24 at 02:16pm, Uladzislau Rezki wrote:
> > > > > > > On Fri, Jun 21, 2024 at 10:02:50PM +0800, Baoquan He wrote:
> > > > > > > > On 06/21/24 at 11:44am, Uladzislau Rezki wrote:
> > > > > > > > > On Fri, Jun 21, 2024 at 03:07:16PM +0800, Baoquan He wrote:
> > > > > > > > > > On 06/21/24 at 11:30am, Hailong Liu wrote:
> > > > > > > > > > > On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > > > > > > > > > > > On 2024-06-20 02:19, Nick Bowler wrote:
> > > > > > > > ......
> > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > > > > index be2dd281ea76..18e87cafbaf2 100644
> > > > > > > > > > --- a/mm/vmalloc.c
> > > > > > > > > > +++ b/mm/vmalloc.c
> > > > > > > > > > @@ -2542,7 +2542,7 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > > > > > > > >  static struct xarray *
> > > > > > > > > >  addr_to_vb_xa(unsigned long addr)
> > > > > > > > > >  {
> > > > > > > > > > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > > > > > > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > > > > > > > >  
> > > > > > > > > >  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > > > > > > >  }
> > > > > > > > > > 
> > > > > > > > > The problem i see is about not-initializing of the:
> > > > > > > > > <snip>
> > > > > > > > > 	for_each_possible_cpu(i) {
> > > > > > > > > 		struct vmap_block_queue *vbq;
> > > > > > > > > 		struct vfree_deferred *p;
> > > > > > > > > 
> > > > > > > > > 		vbq = &per_cpu(vmap_block_queue, i);
> > > > > > > > > 		spin_lock_init(&vbq->lock);
> > > > > > > > > 		INIT_LIST_HEAD(&vbq->free);
> > > > > > > > > 		p = &per_cpu(vfree_deferred, i);
> > > > > > > > > 		init_llist_head(&p->list);
> > > > > > > > > 		INIT_WORK(&p->wq, delayed_vfree_work);
> > > > > > > > > 		xa_init(&vbq->vmap_blocks);
> > > > > > > > > 	}
> > > > > > > > > <snip>
> > > > > > > > > 
> > > > > > > > > correctly or fully. It is my bad i did not think that CPUs in a possible mask
> > > > > > > > > can be non sequential :-/
> > > > > > > > > 
> > > > > > > > > nr_cpu_ids - is not the max possible CPU. For example, in Nick case,
> > > > > > > > > when he has two CPUs, num_possible_cpus() and nr_cpu_ids are the same.
> > > > > > > > 
> > > > > > > > I checked the generic version of setup_nr_cpu_ids(), from codes, they
> > > > > > > > are different with my understanding.
> > > > > > > > 
> > > > > > > > kernel/smp.c
> > > > > > > > void __init setup_nr_cpu_ids(void)
> > > > > > > > {
> > > > > > > >         set_nr_cpu_ids(find_last_bit(cpumask_bits(cpu_possible_mask), NR_CPUS) + 1);
> > > > > > > > }
> > > > > > > > 
> > > > > > > I see that it is not a weak function, so it is generic, thus the
> > > > > > > behavior can not be overwritten, which is great. This does what we
> > > > > > > need.
> > > > > > > 
> > > > > > > Thank you for checking this you are right!
> > > > > > 
> > > > > > Thanks for confirming this.
> > > > > > 
> > > > > > > 
> > > > > > > Then it is just a matter of proper initialization of the hash:
> > > > > > > 
> > > > > > > <snip>
> > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > index 5d3aa2dc88a8..1733946f7a12 100644
> > > > > > > --- a/mm/vmalloc.c
> > > > > > > +++ b/mm/vmalloc.c
> > > > > > > @@ -5087,7 +5087,13 @@ void __init vmalloc_init(void)
> > > > > > >          */
> > > > > > >         vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
> > > > > > >  
> > > > > > > -       for_each_possible_cpu(i) {
> > > > > > > +       /*
> > > > > > > +        * We use "nr_cpu_ids" here because some architectures
> > > > > > > +        * may have "gaps" in cpu-possible-mask. It is OK for
> > > > > > > +        * per-cpu approaches but is not OK for cases where it
> > > > > > > +        * can be used as hashes also.
> > > > > > > +        */
> > > > > > > +       for (i = 0; i < nr_cpu_ids; i++) {
> > > > > > 
> > > > > > I was wrong about earlier comments. Percpu variables are only available
> > > > > > on possible CPUs. For those nonexistent possible CPUs of static percpu
> > > > > > variable vmap_block_queue, there isn't memory allocated and mapped for
> > > > > > them. So accessing into them will cause problem.
> > > > > > 
> > > > > > In Nick's case, there are only CPU0, CPU2. If you access
> > > > > > &per_cpu(vmap_block_queue, 1), problem occurs. So I think we may need to
> > > > > > change to take other way for vbq. E.g:
> > > > > > 1) Storing the vb in the nearest neighbouring vbq on possible CPU as
> > > > > >    below draft patch;
> > > > > > 2) create an normal array to store vbq of size nr_cpu_ids, then we can
> > > > > >    store/fetch each vbq on non-possible CPU?
> > > > > > 
> > > > > A correct way, i think, is to create a normal array. A quick fix can be
> > > > > to stick to a next possible CPU.
> > > > > 
> > > > > > The way 1) is simpler, the existing code can be adapted a little just as
> > > > > > below.
> > > > > > 
> > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > index 633363997dec..59a8951cc6c0 100644
> > > > > > --- a/mm/vmalloc.c
> > > > > > +++ b/mm/vmalloc.c
> > > > > > @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > > > >  static struct xarray *
> > > > > >  addr_to_vb_xa(unsigned long addr)
> > > > > >  {
> > > > > > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > > > > +
> > > > > > +	if (!cpu_possible(idex))
> > > > > > +		index = cpumask_next(index, cpu_possible_mask);
> > > > > >
> > > > > cpumask_next() can return nr_cpu_ids if no next bits set.
> > > > 
> > > > It won't. nr_cpu_ids is the largest index + 1, the hashed index will
> > > > be:  0 =<  index  <= (nr_cpu_ids - 1) e.g cpu_possible_mask is
> > > > b10001111, the nr_cpu_ids is 8, the largest bit is cpu7.
> > > > cpu_possible(index) will check that. So the largest bit of cpumask_next()
> > > > returns is (nr_cpu_ids - 1).
> > > > 
> > > /**
> > >  * cpumask_next - get the next cpu in a cpumask
> > >  * @n: the cpu prior to the place to search (i.e. return will be > @n)
> > >  * @srcp: the cpumask pointer
> > >  *
> > >  * Return: >= nr_cpu_ids if no further cpus set.
> > 
> > Ah, I got what you mean. In the vbq case, it may not have chance to get
> > a return number as nr_cpu_ids. Becuase the hashed index limits the
> > range to [0, nr_cpu_ids-1], and cpu_possible(index) will guarantee it
> > won't be the highest cpu number [nr_cpu_ids-1] since CPU[nr_cpu_ids-1] must
> > be possible CPU.
> > 
> > Do I miss some corner cases?
> > 
> Right. We guarantee that a highest CPU is available by doing: % nr_cpu_ids.
> So we do not need to use *next_wrap() variant. You do not miss anything :)
> 
> Hailong Liu has proposed more simpler version:
> 
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 11fe5ea208aa..e1e63ffb9c57 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1994,8 +1994,9 @@ static struct xarray *
>  addr_to_vb_xa(unsigned long addr)
>  {
>         int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> +       int cpu = cpumask_nth(index, cpu_possible_mask);
> 
> -       return &per_cpu(vmap_block_queue, index).vmap_blocks;
> +       return &per_cpu(vmap_block_queue, cpu).vmap_blocks;
> <snip>
> 
> which just takes a next CPU if an index is not set in the cpu_possible_mask.
> 
> The only thing that can be updated in the patch is to replace num_possible_cpu()
> by the nr_cpu_ids.
> 
> Any thoughts? I think we need to fix it by a minor change so it is
> easier to back-port on stable kernels.

Yeah, sounds good since the regresson commit is merged in v6.3.
Please feel free to post this and the hash array patch separately for
formal reviewing.

By the way, when I am replying this mail, I check the cpumask_nth()
again. I doubt it may take more checking then cpu_possible(), given most
of systems don't have gaps in cpu_possible_mask. I could be dizzy at
this moment.

static inline unsigned int cpumask_nth(unsigned int cpu, const struct cpumask *srcp)
{
        return find_nth_bit(cpumask_bits(srcp), small_cpumask_bits, cpumask_check(cpu));
}