[PATCH 00/10] blk-cgroup: don't use queue_lock for protection and fix deadlock

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/10] blk-cgroup: don't use queue_lock for protection and fix deadlock
@ 2025-09-25  8:15 Yu Kuai
  2025-09-25  8:15 ` [PATCH 01/10] blk-cgroup: use cgroup lock and rcu to protect iterating blkcg blkgs Yu Kuai
                   ` (9 more replies)
  0 siblings, 10 replies; 19+ messages in thread
From: Yu Kuai @ 2025-09-25  8:15 UTC (permalink / raw)
  To: tj, ming.lei, nilay, hch, josef, axboe, akpm, vgoyal
  Cc: cgroups, linux-block, linux-kernel, linux-mm, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

patch 1-6 is prep patches to make sure queue_lock is not nested under
rcu and other spinlocks in blk-cgroup;
patch 7 convert blk-cgroup to use blkcg_mutex to protect blkgs;
patch 8-9 are follow cleanups;
patch 10 fix deadlock in blk-throttle;

BTW, with this set, we can unify blkg_conf_prep() and other related
helpers.

Yu Kuai (10):
  blk-cgroup: use cgroup lock and rcu to protect iterating blkcg blkgs
  blk-cgroup: don't nest queue_lock under rcu in blkg_lookup_create()
  blk-cgroup: don't nest queu_lock under rcu in bio_associate_blkg()
  blk-cgroup: don't nest queue_lock under blkcg->lock in
    blkcg_destroy_blkgs()
  mm/page_io: don't nest queue_lock under rcu in
    bio_associate_blkg_from_page()
  block, bfq: don't grab queue_lock to initialize bfq
  blk-cgroup: convert to protect blkgs with blkcg_mutex
  blk-cgroup: remove radix_tree_preload()
  blk-cgroup: remove preallocate blkg for blkg_create()
  blk-throttle: fix possible deadlock due to queue_lock in timer

 block/bfq-cgroup.c        |   6 +-
 block/bfq-iosched.c       |  13 +-
 block/blk-cgroup-rwstat.c |   2 -
 block/blk-cgroup.c        | 329 ++++++++++++++++----------------------
 block/blk-cgroup.h        |   6 +-
 block/blk-throttle.c      |  31 ++--
 mm/page_io.c              |   7 +-
 7 files changed, 163 insertions(+), 231 deletions(-)

-- 
2.39.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 01/10] blk-cgroup: use cgroup lock and rcu to protect iterating blkcg blkgs
  2025-09-25  8:15 [PATCH 00/10] blk-cgroup: don't use queue_lock for protection and fix deadlock Yu Kuai
@ 2025-09-25  8:15 ` Yu Kuai
  2025-09-25 15:57   ` Bart Van Assche
  2025-09-25  8:15 ` [PATCH 02/10] blk-cgroup: don't nest queue_lock under rcu in blkg_lookup_create() Yu Kuai
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 19+ messages in thread
From: Yu Kuai @ 2025-09-25  8:15 UTC (permalink / raw)
  To: tj, ming.lei, nilay, hch, josef, axboe, akpm, vgoyal
  Cc: cgroups, linux-block, linux-kernel, linux-mm, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

It's safe to iterate blkgs with cgroup lock or rcu lock held, prevent
nested queue_lock under rcu lock, and prepare to convert protecting
blkcg with blkcg_mutex instead of queuelock.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-cgroup-rwstat.c |  2 --
 block/blk-cgroup.c        | 16 +++++-----------
 2 files changed, 5 insertions(+), 13 deletions(-)

diff --git a/block/blk-cgroup-rwstat.c b/block/blk-cgroup-rwstat.c
index a55fb0c53558..d41ade993312 100644
--- a/block/blk-cgroup-rwstat.c
+++ b/block/blk-cgroup-rwstat.c
@@ -101,8 +101,6 @@ void blkg_rwstat_recursive_sum(struct blkcg_gq *blkg, struct blkcg_policy *pol,
 	struct cgroup_subsys_state *pos_css;
 	unsigned int i;
 
-	lockdep_assert_held(&blkg->q->queue_lock);
-
 	memset(sum, 0, sizeof(*sum));
 	rcu_read_lock();
 	blkg_for_each_descendant_pre(pos_blkg, pos_css, blkg) {
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index f93de34fe87d..2d767ae61d2f 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -712,12 +712,9 @@ void blkcg_print_blkgs(struct seq_file *sf, struct blkcg *blkcg,
 	u64 total = 0;
 
 	rcu_read_lock();
-	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
-		spin_lock_irq(&blkg->q->queue_lock);
-		if (blkcg_policy_enabled(blkg->q, pol))
+	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node)
+		if (blkcg_policy_enabled(blkg->q, pol) && blkg->online)
 			total += prfill(sf, blkg->pd[pol->plid], data);
-		spin_unlock_irq(&blkg->q->queue_lock);
-	}
 	rcu_read_unlock();
 
 	if (show_total)
@@ -1242,13 +1239,10 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
 	else
 		css_rstat_flush(&blkcg->css);
 
-	rcu_read_lock();
-	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
-		spin_lock_irq(&blkg->q->queue_lock);
+	spin_lock_irq(&blkcg->lock);
+	hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node)
 		blkcg_print_one_stat(blkg, sf);
-		spin_unlock_irq(&blkg->q->queue_lock);
-	}
-	rcu_read_unlock();
+	spin_unlock_irq(&blkcg->lock);
 	return 0;
 }
 
-- 
2.39.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/10] blk-cgroup: use cgroup lock and rcu to protect iterating blkcg blkgs
  2025-09-25  8:15 ` [PATCH 01/10] blk-cgroup: use cgroup lock and rcu to protect iterating blkcg blkgs Yu Kuai
@ 2025-09-25 15:57   ` Bart Van Assche
  2025-09-25 17:07     ` Yu Kuai
  0 siblings, 1 reply; 19+ messages in thread
From: Bart Van Assche @ 2025-09-25 15:57 UTC (permalink / raw)
  To: Yu Kuai, tj, ming.lei, nilay, hch, josef, axboe, akpm, vgoyal
  Cc: cgroups, linux-block, linux-kernel, linux-mm, yukuai3, yi.zhang,
	yangerkun, johnny.chenyi

On 9/25/25 1:15 AM, Yu Kuai wrote:
> It's safe to iterate blkgs with cgroup lock or rcu lock held, prevent
> nested queue_lock under rcu lock, and prepare to convert protecting
> blkcg with blkcg_mutex instead of queuelock.

Iterating blkgs without holding q->queue_lock is safe but accessing the
blkg members without holding that lock is not safe since q->queue_lock
is acquired by all code that modifies blkg members. Should perhaps a new
spinlock be introduced to serialize blkg modifications?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/10] blk-cgroup: use cgroup lock and rcu to protect iterating blkcg blkgs
  2025-09-25 15:57   ` Bart Van Assche
@ 2025-09-25 17:07     ` Yu Kuai
  2025-09-26  0:57       ` Yu Kuai
  0 siblings, 1 reply; 19+ messages in thread
From: Yu Kuai @ 2025-09-25 17:07 UTC (permalink / raw)
  To: Bart Van Assche, Yu Kuai, tj, ming.lei, nilay, hch, josef, axboe,
	akpm, vgoyal
  Cc: cgroups, linux-block, linux-kernel, linux-mm, yukuai3, yi.zhang,
	yangerkun, johnny.chenyi

Hi,

在 2025/9/25 23:57, Bart Van Assche 写道:
> On 9/25/25 1:15 AM, Yu Kuai wrote:
>> It's safe to iterate blkgs with cgroup lock or rcu lock held, prevent
>> nested queue_lock under rcu lock, and prepare to convert protecting
>> blkcg with blkcg_mutex instead of queuelock.
>
> Iterating blkgs without holding q->queue_lock is safe but accessing the
> blkg members without holding that lock is not safe since q->queue_lock
> is acquired by all code that modifies blkg members. Should perhaps a new
> spinlock be introduced to serialize blkg modifications?
>
No need for a new lock, I think blkcg->lock can do that.

Thanks,
Kuai

> Thanks,
>
> Bart.
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/10] blk-cgroup: use cgroup lock and rcu to protect iterating blkcg blkgs
  2025-09-25 17:07     ` Yu Kuai
@ 2025-09-26  0:57       ` Yu Kuai
  2025-09-26 17:19         ` Bart Van Assche
  0 siblings, 1 reply; 19+ messages in thread
From: Yu Kuai @ 2025-09-26  0:57 UTC (permalink / raw)
  To: Yu Kuai, Bart Van Assche, Yu Kuai, tj, ming.lei, nilay, hch,
	josef, axboe, akpm, vgoyal
  Cc: cgroups, linux-block, linux-kernel, linux-mm, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/09/26 1:07, Yu Kuai 写道:
> Hi,
> 
> 在 2025/9/25 23:57, Bart Van Assche 写道:
>> On 9/25/25 1:15 AM, Yu Kuai wrote:
>>> It's safe to iterate blkgs with cgroup lock or rcu lock held, prevent
>>> nested queue_lock under rcu lock, and prepare to convert protecting
>>> blkcg with blkcg_mutex instead of queuelock.
>>
>> Iterating blkgs without holding q->queue_lock is safe but accessing the
>> blkg members without holding that lock is not safe since q->queue_lock
>> is acquired by all code that modifies blkg members. Should perhaps a new
>> spinlock be introduced to serialize blkg modifications?

Actually, only blkcg_print_blkgs() is using rcu in this patch, and take
a look at the callers, I don't see anyone have to hold queue_lock. Can
you explain in detail which field from blkg is problematic in this
patch?

Thanks,
Kuai

>>
> No need for a new lock, I think blkcg->lock can do that.
> 
> Thanks,
> Kuai
> 
>> Thanks,
>>
>> Bart.
>>
> .
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/10] blk-cgroup: use cgroup lock and rcu to protect iterating blkcg blkgs
  2025-09-26  0:57       ` Yu Kuai
@ 2025-09-26 17:19         ` Bart Van Assche
  2025-09-29  1:02           ` Yu Kuai
  0 siblings, 1 reply; 19+ messages in thread
From: Bart Van Assche @ 2025-09-26 17:19 UTC (permalink / raw)
  To: Yu Kuai, Yu Kuai, tj, ming.lei, nilay, hch, josef, axboe, akpm, vgoyal
  Cc: cgroups, linux-block, linux-kernel, linux-mm, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

On 9/25/25 5:57 PM, Yu Kuai wrote:
> 在 2025/09/26 1:07, Yu Kuai 写道:
>> 在 2025/9/25 23:57, Bart Van Assche 写道:
>>> On 9/25/25 1:15 AM, Yu Kuai wrote:
>>>> It's safe to iterate blkgs with cgroup lock or rcu lock held, prevent
>>>> nested queue_lock under rcu lock, and prepare to convert protecting
>>>> blkcg with blkcg_mutex instead of queuelock.
>>>
>>> Iterating blkgs without holding q->queue_lock is safe but accessing the
>>> blkg members without holding that lock is not safe since q->queue_lock
>>> is acquired by all code that modifies blkg members. Should perhaps a new
>>> spinlock be introduced to serialize blkg modifications?
> 
> Actually, only blkcg_print_blkgs() is using rcu in this patch, and take
> a look at the callers, I don't see anyone have to hold queue_lock. Can
> you explain in detail which field from blkg is problematic in this
> patch?

I'm not a cgroup expert so I cannot answer the above question. But I
think it's clear that the description of this patch is not sufficient as
motivation for this patch. Removing the blkg->q->queue_lock lock and
unlock calls requires a detailed review of all blkcg_print_blkgs() and
blkcg_print_stat() callers. There is no evidence available in the patch
description that shows that such a review has happened.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/10] blk-cgroup: use cgroup lock and rcu to protect iterating blkcg blkgs
  2025-09-26 17:19         ` Bart Van Assche
@ 2025-09-29  1:02           ` Yu Kuai
  0 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-09-29  1:02 UTC (permalink / raw)
  To: Bart Van Assche, Yu Kuai, Yu Kuai, tj, ming.lei, nilay, hch,
	josef, axboe, akpm, vgoyal
  Cc: cgroups, linux-block, linux-kernel, linux-mm, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/09/27 1:19, Bart Van Assche 写道:
> On 9/25/25 5:57 PM, Yu Kuai wrote:
>> 在 2025/09/26 1:07, Yu Kuai 写道:
>>> 在 2025/9/25 23:57, Bart Van Assche 写道:
>>>> On 9/25/25 1:15 AM, Yu Kuai wrote:
>>>>> It's safe to iterate blkgs with cgroup lock or rcu lock held, prevent
>>>>> nested queue_lock under rcu lock, and prepare to convert protecting
>>>>> blkcg with blkcg_mutex instead of queuelock.
>>>>
>>>> Iterating blkgs without holding q->queue_lock is safe but accessing the
>>>> blkg members without holding that lock is not safe since q->queue_lock
>>>> is acquired by all code that modifies blkg members. Should perhaps a 
>>>> new
>>>> spinlock be introduced to serialize blkg modifications?
>>
>> Actually, only blkcg_print_blkgs() is using rcu in this patch, and take
>> a look at the callers, I don't see anyone have to hold queue_lock. Can
>> you explain in detail which field from blkg is problematic in this
>> patch?
> 
> I'm not a cgroup expert so I cannot answer the above question. But I
> think it's clear that the description of this patch is not sufficient as
> motivation for this patch. Removing the blkg->q->queue_lock lock and
> unlock calls requires a detailed review of all blkcg_print_blkgs() and
> blkcg_print_stat() callers. There is no evidence available in the patch
> description that shows that such a review has happened.
> 

Ok, I'll explain more in details in commit message.

Thanks,
Kuai

> Thanks,
> 
> Bart.
> .
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 02/10] blk-cgroup: don't nest queue_lock under rcu in blkg_lookup_create()
  2025-09-25  8:15 [PATCH 00/10] blk-cgroup: don't use queue_lock for protection and fix deadlock Yu Kuai
  2025-09-25  8:15 ` [PATCH 01/10] blk-cgroup: use cgroup lock and rcu to protect iterating blkcg blkgs Yu Kuai
@ 2025-09-25  8:15 ` Yu Kuai
  2025-09-25  8:15 ` [PATCH 03/10] blk-cgroup: don't nest queu_lock under rcu in bio_associate_blkg() Yu Kuai
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-09-25  8:15 UTC (permalink / raw)
  To: tj, ming.lei, nilay, hch, josef, axboe, akpm, vgoyal
  Cc: cgroups, linux-block, linux-kernel, linux-mm, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Changes to two step:

1) hold rcu lock and do blkg_lookup() from fast path;
2) hold queue_lock directly from slow path, and don't nest it under rcu
   lock;

Prepare to convert protecting blkcg with blkcg_mutex instead of
queue_lock;

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-cgroup.c | 57 +++++++++++++++++++++++++++++-----------------
 1 file changed, 36 insertions(+), 21 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 2d767ae61d2f..9ffc3195f7e4 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -467,22 +467,17 @@ static struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 {
 	struct request_queue *q = disk->queue;
 	struct blkcg_gq *blkg;
-	unsigned long flags;
-
-	WARN_ON_ONCE(!rcu_read_lock_held());
 
-	blkg = blkg_lookup(blkcg, q);
-	if (blkg)
-		return blkg;
-
-	spin_lock_irqsave(&q->queue_lock, flags);
+	rcu_read_lock();
 	blkg = blkg_lookup(blkcg, q);
 	if (blkg) {
 		if (blkcg != &blkcg_root &&
 		    blkg != rcu_dereference(blkcg->blkg_hint))
 			rcu_assign_pointer(blkcg->blkg_hint, blkg);
-		goto found;
+		rcu_read_unlock();
+		return blkg;
 	}
+	rcu_read_unlock();
 
 	/*
 	 * Create blkgs walking down from blkcg_root to @blkcg, so that all
@@ -514,8 +509,6 @@ static struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 			break;
 	}
 
-found:
-	spin_unlock_irqrestore(&q->queue_lock, flags);
 	return blkg;
 }
 
@@ -2077,6 +2070,18 @@ void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta)
 	atomic64_add(delta, &blkg->delay_nsec);
 }
 
+static inline struct blkcg_gq *blkg_lookup_tryget(struct blkcg_gq *blkg)
+{
+retry:
+	if (blkg_tryget(blkg))
+		return blkg;
+
+	blkg = blkg->parent;
+	if (blkg)
+		goto retry;
+
+	return NULL;
+}
 /**
  * blkg_tryget_closest - try and get a blkg ref on the closet blkg
  * @bio: target bio
@@ -2089,20 +2094,30 @@ void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta)
 static inline struct blkcg_gq *blkg_tryget_closest(struct bio *bio,
 		struct cgroup_subsys_state *css)
 {
-	struct blkcg_gq *blkg, *ret_blkg = NULL;
+	struct request_queue *q = bio->bi_bdev->bd_queue;
+	struct blkcg *blkcg = css_to_blkcg(css);
+	struct blkcg_gq *blkg;
 
 	rcu_read_lock();
-	blkg = blkg_lookup_create(css_to_blkcg(css), bio->bi_bdev->bd_disk);
-	while (blkg) {
-		if (blkg_tryget(blkg)) {
-			ret_blkg = blkg;
-			break;
-		}
-		blkg = blkg->parent;
-	}
+	blkg = blkg_lookup(blkcg, q);
+	if (likely(blkg))
+		blkg = blkg_lookup_tryget(blkg);
 	rcu_read_unlock();
 
-	return ret_blkg;
+	if (blkg)
+		return blkg;
+
+	/*
+	 * Fast path failed, we're probably issuing IO in this cgroup the first
+	 * time, hold lock to create new blkg.
+	 */
+	spin_lock_irq(&q->queue_lock);
+	blkg = blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk);
+	if (blkg)
+		blkg = blkg_lookup_tryget(blkg);
+	spin_unlock_irq(&q->queue_lock);
+
+	return blkg;
 }
 
 /**
-- 
2.39.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 03/10] blk-cgroup: don't nest queu_lock under rcu in bio_associate_blkg()
  2025-09-25  8:15 [PATCH 00/10] blk-cgroup: don't use queue_lock for protection and fix deadlock Yu Kuai
  2025-09-25  8:15 ` [PATCH 01/10] blk-cgroup: use cgroup lock and rcu to protect iterating blkcg blkgs Yu Kuai
  2025-09-25  8:15 ` [PATCH 02/10] blk-cgroup: don't nest queue_lock under rcu in blkg_lookup_create() Yu Kuai
@ 2025-09-25  8:15 ` Yu Kuai
  2025-09-25  8:15 ` [PATCH 04/10] blk-cgroup: don't nest queue_lock under blkcg->lock in blkcg_destroy_blkgs() Yu Kuai
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-09-25  8:15 UTC (permalink / raw)
  To: tj, ming.lei, nilay, hch, josef, axboe, akpm, vgoyal
  Cc: cgroups, linux-block, linux-kernel, linux-mm, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

If bio is already associated with blkg, blkcg is already pinned until
bio is done, no need for rcu protection; Otherwise protect blkcg_css()
with rcu independently. Prepare to convert protecting blkcg with
blkcg_mutex instead of queue_lock.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-cgroup.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 9ffc3195f7e4..53a64bfe4a24 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -2165,16 +2165,20 @@ void bio_associate_blkg(struct bio *bio)
 	if (blk_op_is_passthrough(bio->bi_opf))
 		return;
 
-	rcu_read_lock();
-
-	if (bio->bi_blkg)
+	if (bio->bi_blkg) {
 		css = bio_blkcg_css(bio);
-	else
+		bio_associate_blkg_from_css(bio, css);
+	} else {
+		rcu_read_lock();
 		css = blkcg_css();
+		if (!css_tryget_online(css))
+			css = NULL;
+		rcu_read_unlock();
 
-	bio_associate_blkg_from_css(bio, css);
-
-	rcu_read_unlock();
+		bio_associate_blkg_from_css(bio, css);
+		if (css)
+			css_put(css);
+	}
 }
 EXPORT_SYMBOL_GPL(bio_associate_blkg);
 
-- 
2.39.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 04/10] blk-cgroup: don't nest queue_lock under blkcg->lock in blkcg_destroy_blkgs()
  2025-09-25  8:15 [PATCH 00/10] blk-cgroup: don't use queue_lock for protection and fix deadlock Yu Kuai
                   ` (2 preceding siblings ...)
  2025-09-25  8:15 ` [PATCH 03/10] blk-cgroup: don't nest queu_lock under rcu in bio_associate_blkg() Yu Kuai
@ 2025-09-25  8:15 ` Yu Kuai
  2025-09-25  8:15 ` [PATCH 05/10] mm/page_io: don't nest queue_lock under rcu in bio_associate_blkg_from_page() Yu Kuai
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-09-25  8:15 UTC (permalink / raw)
  To: tj, ming.lei, nilay, hch, josef, axboe, akpm, vgoyal
  Cc: cgroups, linux-block, linux-kernel, linux-mm, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

The correct lock order is q->queue_lock before blkcg->lock, and in order
to prevent deadlock from blkcg_destroy_blkgs(), trylock is used for
q->queue_lock while blkcg->lock is already held, this is hacky.

Hence refactor blkcg_destroy_blkgs(), by holding blkcg->lock to get the
first blkg and release it, then hold q->queue_lock and blkcg->lock in
the correct order to destroy blkg. This is super cold path, it's fine to
grab and release locks.

Also prepare to convert protecting blkcg with blkcg_mutex instead of
queue_lock.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-cgroup.c | 45 ++++++++++++++++++++++++++-------------------
 1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 53a64bfe4a24..795efb5ccb5e 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1283,6 +1283,21 @@ struct list_head *blkcg_get_cgwb_list(struct cgroup_subsys_state *css)
  *    This finally frees the blkcg.
  */
 
+static struct blkcg_gq *blkcg_get_first_blkg(struct blkcg *blkcg)
+{
+	struct blkcg_gq *blkg = NULL;
+
+	spin_lock_irq(&blkcg->lock);
+	if (!hlist_empty(&blkcg->blkg_list)) {
+		blkg = hlist_entry(blkcg->blkg_list.first, struct blkcg_gq,
+				   blkcg_node);
+		blkg_get(blkg);
+	}
+	spin_unlock_irq(&blkcg->lock);
+
+	return blkg;
+}
+
 /**
  * blkcg_destroy_blkgs - responsible for shooting down blkgs
  * @blkcg: blkcg of interest
@@ -1296,32 +1311,24 @@ struct list_head *blkcg_get_cgwb_list(struct cgroup_subsys_state *css)
  */
 static void blkcg_destroy_blkgs(struct blkcg *blkcg)
 {
-	might_sleep();
+	struct blkcg_gq *blkg;
 
-	spin_lock_irq(&blkcg->lock);
+	might_sleep();
 
-	while (!hlist_empty(&blkcg->blkg_list)) {
-		struct blkcg_gq *blkg = hlist_entry(blkcg->blkg_list.first,
-						struct blkcg_gq, blkcg_node);
+	while ((blkg = blkcg_get_first_blkg(blkcg))) {
 		struct request_queue *q = blkg->q;
 
-		if (need_resched() || !spin_trylock(&q->queue_lock)) {
-			/*
-			 * Given that the system can accumulate a huge number
-			 * of blkgs in pathological cases, check to see if we
-			 * need to rescheduling to avoid softlockup.
-			 */
-			spin_unlock_irq(&blkcg->lock);
-			cond_resched();
-			spin_lock_irq(&blkcg->lock);
-			continue;
-		}
+		spin_lock_irq(&q->queue_lock);
+		spin_lock(&blkcg->lock);
 
 		blkg_destroy(blkg);
-		spin_unlock(&q->queue_lock);
-	}
 
-	spin_unlock_irq(&blkcg->lock);
+		spin_unlock(&blkcg->lock);
+		spin_unlock_irq(&q->queue_lock);
+
+		blkg_put(blkg);
+		cond_resched();
+	}
 }
 
 /**
-- 
2.39.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 05/10] mm/page_io: don't nest queue_lock under rcu in bio_associate_blkg_from_page()
  2025-09-25  8:15 [PATCH 00/10] blk-cgroup: don't use queue_lock for protection and fix deadlock Yu Kuai
                   ` (3 preceding siblings ...)
  2025-09-25  8:15 ` [PATCH 04/10] blk-cgroup: don't nest queue_lock under blkcg->lock in blkcg_destroy_blkgs() Yu Kuai
@ 2025-09-25  8:15 ` Yu Kuai
  2025-09-25  8:15 ` [PATCH 06/10] block, bfq: don't grab queue_lock to initialize bfq Yu Kuai
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-09-25  8:15 UTC (permalink / raw)
  To: tj, ming.lei, nilay, hch, josef, axboe, akpm, vgoyal
  Cc: cgroups, linux-block, linux-kernel, linux-mm, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Prepare to convert protecting blkcg with blkcg_mutex instead of
queue_lock.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 mm/page_io.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index a2056a5ecb13..4f4cc9370573 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -313,8 +313,13 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct folio *folio)
 
 	rcu_read_lock();
 	css = cgroup_e_css(memcg->css.cgroup, &io_cgrp_subsys);
-	bio_associate_blkg_from_css(bio, css);
+	if (!css || !css_tryget_online(css))
+		css = NULL;
 	rcu_read_unlock();
+
+	bio_associate_blkg_from_css(bio, css);
+	if (css)
+		css_put(css);
 }
 #else
 #define bio_associate_blkg_from_page(bio, folio)		do { } while (0)
-- 
2.39.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 06/10] block, bfq: don't grab queue_lock to initialize bfq
  2025-09-25  8:15 [PATCH 00/10] blk-cgroup: don't use queue_lock for protection and fix deadlock Yu Kuai
                   ` (4 preceding siblings ...)
  2025-09-25  8:15 ` [PATCH 05/10] mm/page_io: don't nest queue_lock under rcu in bio_associate_blkg_from_page() Yu Kuai
@ 2025-09-25  8:15 ` Yu Kuai
  2025-09-25  8:15 ` [PATCH 07/10] blk-cgroup: convert to protect blkgs with blkcg_mutex Yu Kuai
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-09-25  8:15 UTC (permalink / raw)
  To: tj, ming.lei, nilay, hch, josef, axboe, akpm, vgoyal
  Cc: cgroups, linux-block, linux-kernel, linux-mm, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Request_queue is freezed and quiesced during elevator init_sched()
method, there is no point to hold queue_lock for protection.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/bfq-iosched.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 4a8d3d96bfe4..a77b98187370 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7213,10 +7213,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 		return -ENOMEM;
 
 	eq->elevator_data = bfqd;
-
-	spin_lock_irq(&q->queue_lock);
 	q->elevator = eq;
-	spin_unlock_irq(&q->queue_lock);
 
 	/*
 	 * Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues.
@@ -7249,7 +7246,6 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	 * If the disk supports multiple actuators, copy independent
 	 * access ranges from the request queue structure.
 	 */
-	spin_lock_irq(&q->queue_lock);
 	if (ia_ranges) {
 		/*
 		 * Check if the disk ia_ranges size exceeds the current bfq
@@ -7275,7 +7271,6 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 		bfqd->sector[0] = 0;
 		bfqd->nr_sectors[0] = get_capacity(q->disk);
 	}
-	spin_unlock_irq(&q->queue_lock);
 
 	INIT_LIST_HEAD(&bfqd->dispatch);
 
-- 
2.39.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 07/10] blk-cgroup: convert to protect blkgs with blkcg_mutex
  2025-09-25  8:15 [PATCH 00/10] blk-cgroup: don't use queue_lock for protection and fix deadlock Yu Kuai
                   ` (5 preceding siblings ...)
  2025-09-25  8:15 ` [PATCH 06/10] block, bfq: don't grab queue_lock to initialize bfq Yu Kuai
@ 2025-09-25  8:15 ` Yu Kuai
  2025-09-25  8:15 ` [PATCH 08/10] blk-cgroup: remove radix_tree_preload() Yu Kuai
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-09-25  8:15 UTC (permalink / raw)
  To: tj, ming.lei, nilay, hch, josef, axboe, akpm, vgoyal
  Cc: cgroups, linux-block, linux-kernel, linux-mm, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

With previous modifications, queue_lock no longer grabbed under other
spinlocks and rcu for protecting blkgs, it's ok convert to blkcg_mutex
directly.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/bfq-cgroup.c  |   6 +--
 block/bfq-iosched.c |   8 ++--
 block/blk-cgroup.c  | 104 ++++++++++++++------------------------------
 block/blk-cgroup.h  |   6 +--
 4 files changed, 42 insertions(+), 82 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 9fb9f3533150..8af471d565d9 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -405,7 +405,7 @@ static void bfqg_stats_xfer_dead(struct bfq_group *bfqg)
 
 	parent = bfqg_parent(bfqg);
 
-	lockdep_assert_held(&bfqg_to_blkg(bfqg)->q->queue_lock);
+	lockdep_assert_held(&bfqg_to_blkg(bfqg)->q->blkcg_mutex);
 
 	if (unlikely(!parent))
 		return;
@@ -866,7 +866,7 @@ static void bfq_reparent_active_queues(struct bfq_data *bfqd,
  *		    and reparent its children entities.
  * @pd: descriptor of the policy going offline.
  *
- * blkio already grabs the queue_lock for us, so no need to use
+ * blkio already grabs the blkcg_mtuex for us, so no need to use
  * RCU-based magic
  */
 static void bfq_pd_offline(struct blkg_policy_data *pd)
@@ -1139,7 +1139,7 @@ static u64 bfqg_prfill_stat_recursive(struct seq_file *sf,
 	struct cgroup_subsys_state *pos_css;
 	u64 sum = 0;
 
-	lockdep_assert_held(&blkg->q->queue_lock);
+	lockdep_assert_held(&blkg->q->blkcg_mutex);
 
 	rcu_read_lock();
 	blkg_for_each_descendant_pre(pos_blkg, pos_css, blkg) {
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a77b98187370..4ffbe4383dd2 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5266,7 +5266,7 @@ static void bfq_update_dispatch_stats(struct request_queue *q,
 	 * In addition, the following queue lock guarantees that
 	 * bfqq_group(bfqq) exists as well.
 	 */
-	spin_lock_irq(&q->queue_lock);
+	mutex_lock(&q->blkcg_mutex);
 	if (idle_timer_disabled)
 		/*
 		 * Since the idle timer has been disabled,
@@ -5285,7 +5285,7 @@ static void bfq_update_dispatch_stats(struct request_queue *q,
 		bfqg_stats_set_start_empty_time(bfqg);
 		bfqg_stats_update_io_remove(bfqg, rq->cmd_flags);
 	}
-	spin_unlock_irq(&q->queue_lock);
+	mutex_unlock(&q->blkcg_mutex);
 }
 #else
 static inline void bfq_update_dispatch_stats(struct request_queue *q,
@@ -6218,11 +6218,11 @@ static void bfq_update_insert_stats(struct request_queue *q,
 	 * In addition, the following queue lock guarantees that
 	 * bfqq_group(bfqq) exists as well.
 	 */
-	spin_lock_irq(&q->queue_lock);
+	mutex_lock(&q->blkcg_mutex);
 	bfqg_stats_update_io_add(bfqq_group(bfqq), bfqq, cmd_flags);
 	if (idle_timer_disabled)
 		bfqg_stats_update_idle_time(bfqq_group(bfqq));
-	spin_unlock_irq(&q->queue_lock);
+	mutex_unlock(&q->blkcg_mutex);
 }
 #else
 static inline void bfq_update_insert_stats(struct request_queue *q,
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 795efb5ccb5e..280f29a713b6 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -130,9 +130,7 @@ static void blkg_free_workfn(struct work_struct *work)
 			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
 	if (blkg->parent)
 		blkg_put(blkg->parent);
-	spin_lock_irq(&q->queue_lock);
 	list_del_init(&blkg->q_node);
-	spin_unlock_irq(&q->queue_lock);
 	mutex_unlock(&q->blkcg_mutex);
 
 	blk_put_queue(q);
@@ -372,7 +370,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 	struct blkcg_gq *blkg;
 	int i, ret;
 
-	lockdep_assert_held(&disk->queue->queue_lock);
+	lockdep_assert_held(&disk->queue->blkcg_mutex);
 
 	/* request_queue is dying, do not create/recreate a blkg */
 	if (blk_queue_dying(disk->queue)) {
@@ -457,7 +455,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
  * Lookup blkg for the @blkcg - @disk pair.  If it doesn't exist, try to
  * create one.  blkg creation is performed recursively from blkcg_root such
  * that all non-root blkg's have access to the parent blkg.  This function
- * should be called under RCU read lock and takes @disk->queue->queue_lock.
+ * should be called under RCU read lock and takes @disk->queue->blkcg_mutex.
  *
  * Returns the blkg or the closest blkg if blkg_create() fails as it walks
  * down from root.
@@ -517,7 +515,7 @@ static void blkg_destroy(struct blkcg_gq *blkg)
 	struct blkcg *blkcg = blkg->blkcg;
 	int i;
 
-	lockdep_assert_held(&blkg->q->queue_lock);
+	lockdep_assert_held(&blkg->q->blkcg_mutex);
 	lockdep_assert_held(&blkcg->lock);
 
 	/*
@@ -546,8 +544,8 @@ static void blkg_destroy(struct blkcg_gq *blkg)
 
 	/*
 	 * Both setting lookup hint to and clearing it from @blkg are done
-	 * under queue_lock.  If it's not pointing to @blkg now, it never
-	 * will.  Hint assignment itself can race safely.
+	 * under q->blkcg_mutex and blkcg->lock.  If it's not pointing to @blkg
+	 * now, it never will. Hint assignment itself can race safely.
 	 */
 	if (rcu_access_pointer(blkcg->blkg_hint) == blkg)
 		rcu_assign_pointer(blkcg->blkg_hint, NULL);
@@ -567,25 +565,20 @@ static void blkg_destroy_all(struct gendisk *disk)
 	int i;
 
 restart:
-	spin_lock_irq(&q->queue_lock);
+	mutex_lock(&q->blkcg_mutex);
 	list_for_each_entry(blkg, &q->blkg_list, q_node) {
 		struct blkcg *blkcg = blkg->blkcg;
 
 		if (hlist_unhashed(&blkg->blkcg_node))
 			continue;
 
-		spin_lock(&blkcg->lock);
+		spin_lock_irq(&blkcg->lock);
 		blkg_destroy(blkg);
-		spin_unlock(&blkcg->lock);
+		spin_unlock_irq(&blkcg->lock);
 
-		/*
-		 * in order to avoid holding the spin lock for too long, release
-		 * it when a batch of blkgs are destroyed.
-		 */
 		if (!(--count)) {
 			count = BLKG_DESTROY_BATCH_SIZE;
-			spin_unlock_irq(&q->queue_lock);
-			cond_resched();
+			mutex_unlock(&q->blkcg_mutex);
 			goto restart;
 		}
 	}
@@ -603,7 +596,7 @@ static void blkg_destroy_all(struct gendisk *disk)
 	}
 
 	q->root_blkg = NULL;
-	spin_unlock_irq(&q->queue_lock);
+	mutex_unlock(&q->blkcg_mutex);
 }
 
 static void blkg_iostat_set(struct blkg_iostat *dst, struct blkg_iostat *src)
@@ -853,7 +846,7 @@ unsigned long __must_check blkg_conf_open_bdev_frozen(struct blkg_conf_ctx *ctx)
  */
 int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 		   struct blkg_conf_ctx *ctx)
-	__acquires(&bdev->bd_queue->queue_lock)
+	__acquires(&bdev->bd_queue->blkcg_mutex)
 {
 	struct gendisk *disk;
 	struct request_queue *q;
@@ -869,7 +862,6 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 
 	/* Prevent concurrent with blkcg_deactivate_policy() */
 	mutex_lock(&q->blkcg_mutex);
-	spin_lock_irq(&q->queue_lock);
 
 	if (!blkcg_policy_enabled(q, pol)) {
 		ret = -EOPNOTSUPP;
@@ -895,23 +887,18 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 			parent = blkcg_parent(parent);
 		}
 
-		/* Drop locks to do new blkg allocation with GFP_KERNEL. */
-		spin_unlock_irq(&q->queue_lock);
-
 		new_blkg = blkg_alloc(pos, disk, GFP_NOIO);
 		if (unlikely(!new_blkg)) {
 			ret = -ENOMEM;
-			goto fail_exit;
+			goto fail_unlock;
 		}
 
 		if (radix_tree_preload(GFP_KERNEL)) {
 			blkg_free(new_blkg);
 			ret = -ENOMEM;
-			goto fail_exit;
+			goto fail_unlock;
 		}
 
-		spin_lock_irq(&q->queue_lock);
-
 		if (!blkcg_policy_enabled(q, pol)) {
 			blkg_free(new_blkg);
 			ret = -EOPNOTSUPP;
@@ -935,15 +922,12 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 			goto success;
 	}
 success:
-	mutex_unlock(&q->blkcg_mutex);
 	ctx->blkg = blkg;
 	return 0;
 
 fail_preloaded:
 	radix_tree_preload_end();
 fail_unlock:
-	spin_unlock_irq(&q->queue_lock);
-fail_exit:
 	mutex_unlock(&q->blkcg_mutex);
 	/*
 	 * If queue was bypassing, we should retry.  Do so after a
@@ -967,11 +951,11 @@ EXPORT_SYMBOL_GPL(blkg_conf_prep);
  * blkg_conf_ctx's initialized with blkg_conf_init().
  */
 void blkg_conf_exit(struct blkg_conf_ctx *ctx)
-	__releases(&ctx->bdev->bd_queue->queue_lock)
+	__releases(&ctx->bdev->bd_queue->blkcg_mutex)
 	__releases(&ctx->bdev->bd_queue->rq_qos_mutex)
 {
 	if (ctx->blkg) {
-		spin_unlock_irq(&bdev_get_queue(ctx->bdev)->queue_lock);
+		mutex_unlock(&bdev_get_queue(ctx->bdev)->blkcg_mutex);
 		ctx->blkg = NULL;
 	}
 
@@ -1318,13 +1302,13 @@ static void blkcg_destroy_blkgs(struct blkcg *blkcg)
 	while ((blkg = blkcg_get_first_blkg(blkcg))) {
 		struct request_queue *q = blkg->q;
 
-		spin_lock_irq(&q->queue_lock);
-		spin_lock(&blkcg->lock);
+		mutex_lock(&q->blkcg_mutex);
+		spin_lock_irq(&blkcg->lock);
 
 		blkg_destroy(blkg);
 
-		spin_unlock(&blkcg->lock);
-		spin_unlock_irq(&q->queue_lock);
+		spin_unlock_irq(&blkcg->lock);
+		mutex_unlock(&q->blkcg_mutex);
 
 		blkg_put(blkg);
 		cond_resched();
@@ -1501,24 +1485,23 @@ int blkcg_init_disk(struct gendisk *disk)
 	if (!new_blkg)
 		return -ENOMEM;
 
-	preloaded = !radix_tree_preload(GFP_KERNEL);
+	mutex_lock(&q->blkcg_mutex);
+	preloaded = !radix_tree_preload(GFP_NOIO);
 
 	/* Make sure the root blkg exists. */
-	/* spin_lock_irq can serve as RCU read-side critical section. */
-	spin_lock_irq(&q->queue_lock);
 	blkg = blkg_create(&blkcg_root, disk, new_blkg);
 	if (IS_ERR(blkg))
 		goto err_unlock;
 	q->root_blkg = blkg;
-	spin_unlock_irq(&q->queue_lock);
 
 	if (preloaded)
 		radix_tree_preload_end();
+	mutex_unlock(&q->blkcg_mutex);
 
 	return 0;
 
 err_unlock:
-	spin_unlock_irq(&q->queue_lock);
+	mutex_unlock(&q->blkcg_mutex);
 	if (preloaded)
 		radix_tree_preload_end();
 	return PTR_ERR(blkg);
@@ -1595,8 +1578,7 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 
 	if (queue_is_mq(q))
 		memflags = blk_mq_freeze_queue(q);
-retry:
-	spin_lock_irq(&q->queue_lock);
+	mutex_lock(&q->blkcg_mutex);
 
 	/* blkg_list is pushed at the head, reverse walk to initialize parents first */
 	list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) {
@@ -1605,36 +1587,17 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 		if (blkg->pd[pol->plid])
 			continue;
 
-		/* If prealloc matches, use it; otherwise try GFP_NOWAIT */
+		/* If prealloc matches, use it */
 		if (blkg == pinned_blkg) {
 			pd = pd_prealloc;
 			pd_prealloc = NULL;
 		} else {
 			pd = pol->pd_alloc_fn(disk, blkg->blkcg,
-					      GFP_NOWAIT);
+					      GFP_NOIO);
 		}
 
-		if (!pd) {
-			/*
-			 * GFP_NOWAIT failed.  Free the existing one and
-			 * prealloc for @blkg w/ GFP_KERNEL.
-			 */
-			if (pinned_blkg)
-				blkg_put(pinned_blkg);
-			blkg_get(blkg);
-			pinned_blkg = blkg;
-
-			spin_unlock_irq(&q->queue_lock);
-
-			if (pd_prealloc)
-				pol->pd_free_fn(pd_prealloc);
-			pd_prealloc = pol->pd_alloc_fn(disk, blkg->blkcg,
-						       GFP_KERNEL);
-			if (pd_prealloc)
-				goto retry;
-			else
-				goto enomem;
-		}
+		if (!pd)
+			goto enomem;
 
 		spin_lock(&blkg->blkcg->lock);
 
@@ -1655,8 +1618,8 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	__set_bit(pol->plid, q->blkcg_pols);
 	ret = 0;
 
-	spin_unlock_irq(&q->queue_lock);
 out:
+	mutex_unlock(&q->blkcg_mutex);
 	if (queue_is_mq(q))
 		blk_mq_unfreeze_queue(q, memflags);
 	if (pinned_blkg)
@@ -1667,7 +1630,6 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 
 enomem:
 	/* alloc failed, take down everything */
-	spin_lock_irq(&q->queue_lock);
 	list_for_each_entry(blkg, &q->blkg_list, q_node) {
 		struct blkcg *blkcg = blkg->blkcg;
 		struct blkg_policy_data *pd;
@@ -1683,7 +1645,7 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 		}
 		spin_unlock(&blkcg->lock);
 	}
-	spin_unlock_irq(&q->queue_lock);
+
 	ret = -ENOMEM;
 	goto out;
 }
@@ -1711,7 +1673,6 @@ void blkcg_deactivate_policy(struct gendisk *disk,
 		memflags = blk_mq_freeze_queue(q);
 
 	mutex_lock(&q->blkcg_mutex);
-	spin_lock_irq(&q->queue_lock);
 
 	__clear_bit(pol->plid, q->blkcg_pols);
 
@@ -1728,7 +1689,6 @@ void blkcg_deactivate_policy(struct gendisk *disk,
 		spin_unlock(&blkcg->lock);
 	}
 
-	spin_unlock_irq(&q->queue_lock);
 	mutex_unlock(&q->blkcg_mutex);
 
 	if (queue_is_mq(q))
@@ -2118,11 +2078,11 @@ static inline struct blkcg_gq *blkg_tryget_closest(struct bio *bio,
 	 * Fast path failed, we're probably issuing IO in this cgroup the first
 	 * time, hold lock to create new blkg.
 	 */
-	spin_lock_irq(&q->queue_lock);
+	mutex_lock(&q->blkcg_mutex);
 	blkg = blkg_lookup_create(blkcg, bio->bi_bdev->bd_disk);
 	if (blkg)
 		blkg = blkg_lookup_tryget(blkg);
-	spin_unlock_irq(&q->queue_lock);
+	mutex_unlock(&q->blkcg_mutex);
 
 	return blkg;
 }
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 1cce3294634d..60c1da02f437 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -261,7 +261,7 @@ static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg,
 		return q->root_blkg;
 
 	blkg = rcu_dereference_check(blkcg->blkg_hint,
-			lockdep_is_held(&q->queue_lock));
+			lockdep_is_held(&q->blkcg_mutex));
 	if (blkg && blkg->q == q)
 		return blkg;
 
@@ -345,8 +345,8 @@ static inline void blkg_put(struct blkcg_gq *blkg)
  * @p_blkg: target blkg to walk descendants of
  *
  * Walk @c_blkg through the descendants of @p_blkg.  Must be used with RCU
- * read locked.  If called under either blkcg or queue lock, the iteration
- * is guaranteed to include all and only online blkgs.  The caller may
+ * read locked.  If called under either blkcg->lock or q->blkcg_mutex, the
+ * iteration is guaranteed to include all and only online blkgs.  The caller may
  * update @pos_css by calling css_rightmost_descendant() to skip subtree.
  * @p_blkg is included in the iteration and the first node to be visited.
  */
-- 
2.39.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 08/10] blk-cgroup: remove radix_tree_preload()
  2025-09-25  8:15 [PATCH 00/10] blk-cgroup: don't use queue_lock for protection and fix deadlock Yu Kuai
                   ` (6 preceding siblings ...)
  2025-09-25  8:15 ` [PATCH 07/10] blk-cgroup: convert to protect blkgs with blkcg_mutex Yu Kuai
@ 2025-09-25  8:15 ` Yu Kuai
  2025-10-03  7:37   ` Christoph Hellwig
  2025-09-25  8:15 ` [PATCH 09/10] blk-cgroup: remove preallocate blkg for blkg_create() Yu Kuai
  2025-09-25  8:15 ` [PATCH 10/10] blk-throttle: fix possible deadlock due to queue_lock in timer Yu Kuai
  9 siblings, 1 reply; 19+ messages in thread
From: Yu Kuai @ 2025-09-25  8:15 UTC (permalink / raw)
  To: tj, ming.lei, nilay, hch, josef, axboe, akpm, vgoyal
  Cc: cgroups, linux-block, linux-kernel, linux-mm, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Now that blkcg_mutex is used to protect blkgs, memory allocation no
longer need to be non-blocking, this is not needed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-cgroup.c | 20 ++------------------
 1 file changed, 2 insertions(+), 18 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 280f29a713b6..bfc74cfebd3e 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -893,16 +893,10 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 			goto fail_unlock;
 		}
 
-		if (radix_tree_preload(GFP_KERNEL)) {
-			blkg_free(new_blkg);
-			ret = -ENOMEM;
-			goto fail_unlock;
-		}
-
 		if (!blkcg_policy_enabled(q, pol)) {
 			blkg_free(new_blkg);
 			ret = -EOPNOTSUPP;
-			goto fail_preloaded;
+			goto fail_unlock;
 		}
 
 		blkg = blkg_lookup(pos, q);
@@ -912,12 +906,10 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 			blkg = blkg_create(pos, disk, new_blkg);
 			if (IS_ERR(blkg)) {
 				ret = PTR_ERR(blkg);
-				goto fail_preloaded;
+				goto fail_unlock;
 			}
 		}
 
-		radix_tree_preload_end();
-
 		if (pos == blkcg)
 			goto success;
 	}
@@ -925,8 +917,6 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 	ctx->blkg = blkg;
 	return 0;
 
-fail_preloaded:
-	radix_tree_preload_end();
 fail_unlock:
 	mutex_unlock(&q->blkcg_mutex);
 	/*
@@ -1479,14 +1469,12 @@ int blkcg_init_disk(struct gendisk *disk)
 {
 	struct request_queue *q = disk->queue;
 	struct blkcg_gq *new_blkg, *blkg;
-	bool preloaded;
 
 	new_blkg = blkg_alloc(&blkcg_root, disk, GFP_KERNEL);
 	if (!new_blkg)
 		return -ENOMEM;
 
 	mutex_lock(&q->blkcg_mutex);
-	preloaded = !radix_tree_preload(GFP_NOIO);
 
 	/* Make sure the root blkg exists. */
 	blkg = blkg_create(&blkcg_root, disk, new_blkg);
@@ -1494,16 +1482,12 @@ int blkcg_init_disk(struct gendisk *disk)
 		goto err_unlock;
 	q->root_blkg = blkg;
 
-	if (preloaded)
-		radix_tree_preload_end();
 	mutex_unlock(&q->blkcg_mutex);
 
 	return 0;
 
 err_unlock:
 	mutex_unlock(&q->blkcg_mutex);
-	if (preloaded)
-		radix_tree_preload_end();
 	return PTR_ERR(blkg);
 }
 
-- 
2.39.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 08/10] blk-cgroup: remove radix_tree_preload()
  2025-09-25  8:15 ` [PATCH 08/10] blk-cgroup: remove radix_tree_preload() Yu Kuai
@ 2025-10-03  7:37   ` Christoph Hellwig
  2025-10-06  1:55     ` Yu Kuai
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2025-10-03  7:37 UTC (permalink / raw)
  To: Yu Kuai
  Cc: tj, ming.lei, nilay, hch, josef, axboe, akpm, vgoyal, cgroups,
	linux-block, linux-kernel, linux-mm, yukuai3, yi.zhang,
	yangerkun, johnny.chenyi

On Thu, Sep 25, 2025 at 04:15:23PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Now that blkcg_mutex is used to protect blkgs, memory allocation no
> longer need to be non-blocking, this is not needed.

Btw, this might also be a good time to convert from the old radix tree
to an xarray.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 08/10] blk-cgroup: remove radix_tree_preload()
  2025-10-03  7:37   ` Christoph Hellwig
@ 2025-10-06  1:55     ` Yu Kuai
  2025-10-06  6:48       ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Yu Kuai @ 2025-10-06  1:55 UTC (permalink / raw)
  To: Christoph Hellwig, Yu Kuai
  Cc: tj, ming.lei, nilay, josef, axboe, akpm, vgoyal, cgroups,
	linux-block, linux-kernel, linux-mm, yukuai3, yi.zhang,
	yangerkun, johnny.chenyi

Hi,

在 2025/10/3 15:37, Christoph Hellwig 写道:
> On Thu, Sep 25, 2025 at 04:15:23PM +0800, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Now that blkcg_mutex is used to protect blkgs, memory allocation no
>> longer need to be non-blocking, this is not needed.
> Btw, this might also be a good time to convert from the old radix tree
> to an xarray.
>
Sure, I can do that, perhaps after this set and cleanup blkg_conf_prep apis.

Thanks,
Kuai



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 08/10] blk-cgroup: remove radix_tree_preload()
  2025-10-06  1:55     ` Yu Kuai
@ 2025-10-06  6:48       ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2025-10-06  6:48 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Christoph Hellwig, Yu Kuai, tj, ming.lei, nilay, josef, axboe,
	akpm, vgoyal, cgroups, linux-block, linux-kernel, linux-mm,
	yukuai3, yi.zhang, yangerkun, johnny.chenyi

On Mon, Oct 06, 2025 at 09:55:39AM +0800, Yu Kuai wrote:
> Sure, I can do that, perhaps after this set and cleanup blkg_conf_prep apis.

Yes, that sounds sensible.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 09/10] blk-cgroup: remove preallocate blkg for blkg_create()
  2025-09-25  8:15 [PATCH 00/10] blk-cgroup: don't use queue_lock for protection and fix deadlock Yu Kuai
                   ` (7 preceding siblings ...)
  2025-09-25  8:15 ` [PATCH 08/10] blk-cgroup: remove radix_tree_preload() Yu Kuai
@ 2025-09-25  8:15 ` Yu Kuai
  2025-09-25  8:15 ` [PATCH 10/10] blk-throttle: fix possible deadlock due to queue_lock in timer Yu Kuai
  9 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-09-25  8:15 UTC (permalink / raw)
  To: tj, ming.lei, nilay, hch, josef, axboe, akpm, vgoyal
  Cc: cgroups, linux-block, linux-kernel, linux-mm, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Now that blkg_create is protected with blkcg_mutex, there is no need to
preallocate blkg, remove related code.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-cgroup.c | 91 +++++++++++++++++-----------------------------
 1 file changed, 33 insertions(+), 58 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bfc74cfebd3e..60b8c742f876 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -364,10 +364,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
  * If @new_blkg is %NULL, this function tries to allocate a new one as
  * necessary using %GFP_NOWAIT.  @new_blkg is always consumed on return.
  */
-static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
-				    struct blkcg_gq *new_blkg)
+static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk)
 {
-	struct blkcg_gq *blkg;
+	struct blkcg_gq *blkg = NULL;
 	int i, ret;
 
 	lockdep_assert_held(&disk->queue->blkcg_mutex);
@@ -384,15 +383,11 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 		goto err_free_blkg;
 	}
 
-	/* allocate */
-	if (!new_blkg) {
-		new_blkg = blkg_alloc(blkcg, disk, GFP_NOWAIT);
-		if (unlikely(!new_blkg)) {
-			ret = -ENOMEM;
-			goto err_put_css;
-		}
+	blkg = blkg_alloc(blkcg, disk, GFP_NOIO);
+	if (unlikely(!blkg)) {
+		ret = -ENOMEM;
+		goto err_put_css;
 	}
-	blkg = new_blkg;
 
 	/* link parent */
 	if (blkcg_parent(blkcg)) {
@@ -415,35 +410,34 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 	/* insert */
 	spin_lock(&blkcg->lock);
 	ret = radix_tree_insert(&blkcg->blkg_tree, disk->queue->id, blkg);
-	if (likely(!ret)) {
-		hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
-		list_add(&blkg->q_node, &disk->queue->blkg_list);
+	if (unlikely(ret)) {
+		spin_unlock(&blkcg->lock);
+		blkg_put(blkg);
+		return ERR_PTR(ret);
+	}
 
-		for (i = 0; i < BLKCG_MAX_POLS; i++) {
-			struct blkcg_policy *pol = blkcg_policy[i];
+	hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
+	list_add(&blkg->q_node, &disk->queue->blkg_list);
 
-			if (blkg->pd[i]) {
-				if (pol->pd_online_fn)
-					pol->pd_online_fn(blkg->pd[i]);
-				blkg->pd[i]->online = true;
-			}
+	for (i = 0; i < BLKCG_MAX_POLS; i++) {
+		struct blkcg_policy *pol = blkcg_policy[i];
+
+		if (blkg->pd[i]) {
+			if (pol->pd_online_fn)
+				pol->pd_online_fn(blkg->pd[i]);
+			blkg->pd[i]->online = true;
 		}
 	}
+
 	blkg->online = true;
 	spin_unlock(&blkcg->lock);
-
-	if (!ret)
-		return blkg;
-
-	/* @blkg failed fully initialized, use the usual release path */
-	blkg_put(blkg);
-	return ERR_PTR(ret);
+	return blkg;
 
 err_put_css:
 	css_put(&blkcg->css);
 err_free_blkg:
-	if (new_blkg)
-		blkg_free(new_blkg);
+	if (blkg)
+		blkg_free(blkg);
 	return ERR_PTR(ret);
 }
 
@@ -498,7 +492,7 @@ static struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 			parent = blkcg_parent(parent);
 		}
 
-		blkg = blkg_create(pos, disk, NULL);
+		blkg = blkg_create(pos, disk);
 		if (IS_ERR(blkg)) {
 			blkg = ret_blkg;
 			break;
@@ -879,7 +873,6 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 	while (true) {
 		struct blkcg *pos = blkcg;
 		struct blkcg *parent;
-		struct blkcg_gq *new_blkg;
 
 		parent = blkcg_parent(blkcg);
 		while (parent && !blkg_lookup(parent, q)) {
@@ -887,23 +880,14 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 			parent = blkcg_parent(parent);
 		}
 
-		new_blkg = blkg_alloc(pos, disk, GFP_NOIO);
-		if (unlikely(!new_blkg)) {
-			ret = -ENOMEM;
-			goto fail_unlock;
-		}
-
 		if (!blkcg_policy_enabled(q, pol)) {
-			blkg_free(new_blkg);
 			ret = -EOPNOTSUPP;
 			goto fail_unlock;
 		}
 
 		blkg = blkg_lookup(pos, q);
-		if (blkg) {
-			blkg_free(new_blkg);
-		} else {
-			blkg = blkg_create(pos, disk, new_blkg);
+		if (!blkg) {
+			blkg = blkg_create(pos, disk);
 			if (IS_ERR(blkg)) {
 				ret = PTR_ERR(blkg);
 				goto fail_unlock;
@@ -1468,27 +1452,18 @@ void blkg_init_queue(struct request_queue *q)
 int blkcg_init_disk(struct gendisk *disk)
 {
 	struct request_queue *q = disk->queue;
-	struct blkcg_gq *new_blkg, *blkg;
-
-	new_blkg = blkg_alloc(&blkcg_root, disk, GFP_KERNEL);
-	if (!new_blkg)
-		return -ENOMEM;
+	struct blkcg_gq *blkg;
 
+	/* Make sure the root blkg exists. */
 	mutex_lock(&q->blkcg_mutex);
+	blkg = blkg_create(&blkcg_root, disk);
+	mutex_unlock(&q->blkcg_mutex);
 
-	/* Make sure the root blkg exists. */
-	blkg = blkg_create(&blkcg_root, disk, new_blkg);
 	if (IS_ERR(blkg))
-		goto err_unlock;
-	q->root_blkg = blkg;
-
-	mutex_unlock(&q->blkcg_mutex);
+		return PTR_ERR(blkg);
 
+	q->root_blkg = blkg;
 	return 0;
-
-err_unlock:
-	mutex_unlock(&q->blkcg_mutex);
-	return PTR_ERR(blkg);
 }
 
 void blkcg_exit_disk(struct gendisk *disk)
-- 
2.39.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 10/10] blk-throttle: fix possible deadlock due to queue_lock in timer
  2025-09-25  8:15 [PATCH 00/10] blk-cgroup: don't use queue_lock for protection and fix deadlock Yu Kuai
                   ` (8 preceding siblings ...)
  2025-09-25  8:15 ` [PATCH 09/10] blk-cgroup: remove preallocate blkg for blkg_create() Yu Kuai
@ 2025-09-25  8:15 ` Yu Kuai
  9 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-09-25  8:15 UTC (permalink / raw)
  To: tj, ming.lei, nilay, hch, josef, axboe, akpm, vgoyal
  Cc: cgroups, linux-block, linux-kernel, linux-mm, yukuai3, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Abusing queue_lock to protect blk-throttle can cause deadlock:

1) throtl_pending_timer_fn() will hold the lock, while throtl_pd_free()
   will flush the timer, this is fixed by protecting blkgs with
   blkcg_mutex instead of queue_lock by previous patches.
2) queue_lock can be held from hardirq context, hence if
   throtl_pending_timer_fn() is interrupted by hardirq, deadlock can be
   triggered as well.

Stop abusing queue_lock to protect blk-throttle, and intorduce a new
internal lock td->lock for protection. And now that the new lock won't
be grabbed from hardirq context, it's safe to use spin_lock_bh() from
thread context and spin_lock() directly from softirq context.

Fixes: 6e1a5704cbbd ("blk-throttle: dispatch from throtl_pending_timer_fn()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-throttle.c | 31 +++++++++++++------------------
 1 file changed, 13 insertions(+), 18 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 2c5b64b1a724..a2fa440559c9 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -33,6 +33,7 @@ static struct workqueue_struct *kthrotld_workqueue;
 
 struct throtl_data
 {
+	spinlock_t lock;
 	/* service tree for active throtl groups */
 	struct throtl_service_queue service_queue;
 
@@ -1140,7 +1141,7 @@ static void throtl_pending_timer_fn(struct timer_list *t)
 	else
 		q = td->queue;
 
-	spin_lock_irq(&q->queue_lock);
+	spin_lock(&td->lock);
 
 	if (!q->root_blkg)
 		goto out_unlock;
@@ -1166,9 +1167,9 @@ static void throtl_pending_timer_fn(struct timer_list *t)
 			break;
 
 		/* this dispatch windows is still open, relax and repeat */
-		spin_unlock_irq(&q->queue_lock);
+		spin_unlock(&td->lock);
 		cpu_relax();
-		spin_lock_irq(&q->queue_lock);
+		spin_lock(&td->lock);
 	}
 
 	if (!dispatched)
@@ -1191,7 +1192,7 @@ static void throtl_pending_timer_fn(struct timer_list *t)
 		queue_work(kthrotld_workqueue, &td->dispatch_work);
 	}
 out_unlock:
-	spin_unlock_irq(&q->queue_lock);
+	spin_unlock(&td->lock);
 }
 
 /**
@@ -1207,7 +1208,6 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
 	struct throtl_data *td = container_of(work, struct throtl_data,
 					      dispatch_work);
 	struct throtl_service_queue *td_sq = &td->service_queue;
-	struct request_queue *q = td->queue;
 	struct bio_list bio_list_on_stack;
 	struct bio *bio;
 	struct blk_plug plug;
@@ -1215,11 +1215,11 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
 
 	bio_list_init(&bio_list_on_stack);
 
-	spin_lock_irq(&q->queue_lock);
+	spin_lock_bh(&td->lock);
 	for (rw = READ; rw <= WRITE; rw++)
 		while ((bio = throtl_pop_queued(td_sq, NULL, rw)))
 			bio_list_add(&bio_list_on_stack, bio);
-	spin_unlock_irq(&q->queue_lock);
+	spin_unlock_bh(&td->lock);
 
 	if (!bio_list_empty(&bio_list_on_stack)) {
 		blk_start_plug(&plug);
@@ -1297,7 +1297,7 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global)
 	rcu_read_unlock();
 
 	/*
-	 * We're already holding queue_lock and know @tg is valid.  Let's
+	 * We're already holding td->lock and know @tg is valid.  Let's
 	 * apply the new config directly.
 	 *
 	 * Restart the slices for both READ and WRITES. It might happen
@@ -1324,6 +1324,7 @@ static int blk_throtl_init(struct gendisk *disk)
 	if (!td)
 		return -ENOMEM;
 
+	spin_lock_init(&td->lock);
 	INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
 	throtl_service_queue_init(&td->service_queue);
 
@@ -1694,12 +1695,7 @@ void blk_throtl_cancel_bios(struct gendisk *disk)
 	if (!blk_throtl_activated(q))
 		return;
 
-	spin_lock_irq(&q->queue_lock);
-	/*
-	 * queue_lock is held, rcu lock is not needed here technically.
-	 * However, rcu lock is still held to emphasize that following
-	 * path need RCU protection and to prevent warning from lockdep.
-	 */
+	spin_lock_bh(&q->td->lock);
 	rcu_read_lock();
 	blkg_for_each_descendant_post(blkg, pos_css, q->root_blkg) {
 		/*
@@ -1713,7 +1709,7 @@ void blk_throtl_cancel_bios(struct gendisk *disk)
 		tg_flush_bios(blkg_to_tg(blkg));
 	}
 	rcu_read_unlock();
-	spin_unlock_irq(&q->queue_lock);
+	spin_unlock_bh(&q->td->lock);
 }
 
 static bool tg_within_limit(struct throtl_grp *tg, struct bio *bio, bool rw)
@@ -1746,7 +1742,6 @@ static bool tg_within_limit(struct throtl_grp *tg, struct bio *bio, bool rw)
 
 bool __blk_throtl_bio(struct bio *bio)
 {
-	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 	struct blkcg_gq *blkg = bio->bi_blkg;
 	struct throtl_qnode *qn = NULL;
 	struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1756,7 +1751,7 @@ bool __blk_throtl_bio(struct bio *bio)
 	struct throtl_data *td = tg->td;
 
 	rcu_read_lock();
-	spin_lock_irq(&q->queue_lock);
+	spin_lock_bh(&td->lock);
 	sq = &tg->service_queue;
 
 	while (true) {
@@ -1832,7 +1827,7 @@ bool __blk_throtl_bio(struct bio *bio)
 	}
 
 out_unlock:
-	spin_unlock_irq(&q->queue_lock);
+	spin_unlock_bh(&td->lock);
 
 	rcu_read_unlock();
 	return throttled;
-- 
2.39.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-10-06  6:48 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-25  8:15 [PATCH 00/10] blk-cgroup: don't use queue_lock for protection and fix deadlock Yu Kuai
2025-09-25  8:15 ` [PATCH 01/10] blk-cgroup: use cgroup lock and rcu to protect iterating blkcg blkgs Yu Kuai
2025-09-25 15:57   ` Bart Van Assche
2025-09-25 17:07     ` Yu Kuai
2025-09-26  0:57       ` Yu Kuai
2025-09-26 17:19         ` Bart Van Assche
2025-09-29  1:02           ` Yu Kuai
2025-09-25  8:15 ` [PATCH 02/10] blk-cgroup: don't nest queue_lock under rcu in blkg_lookup_create() Yu Kuai
2025-09-25  8:15 ` [PATCH 03/10] blk-cgroup: don't nest queu_lock under rcu in bio_associate_blkg() Yu Kuai
2025-09-25  8:15 ` [PATCH 04/10] blk-cgroup: don't nest queue_lock under blkcg->lock in blkcg_destroy_blkgs() Yu Kuai
2025-09-25  8:15 ` [PATCH 05/10] mm/page_io: don't nest queue_lock under rcu in bio_associate_blkg_from_page() Yu Kuai
2025-09-25  8:15 ` [PATCH 06/10] block, bfq: don't grab queue_lock to initialize bfq Yu Kuai
2025-09-25  8:15 ` [PATCH 07/10] blk-cgroup: convert to protect blkgs with blkcg_mutex Yu Kuai
2025-09-25  8:15 ` [PATCH 08/10] blk-cgroup: remove radix_tree_preload() Yu Kuai
2025-10-03  7:37   ` Christoph Hellwig
2025-10-06  1:55     ` Yu Kuai
2025-10-06  6:48       ` Christoph Hellwig
2025-09-25  8:15 ` [PATCH 09/10] blk-cgroup: remove preallocate blkg for blkg_create() Yu Kuai
2025-09-25  8:15 ` [PATCH 10/10] blk-throttle: fix possible deadlock due to queue_lock in timer Yu Kuai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox