Re: 6.18.13 iwlwifi deadlock allocating cma while work-item is active.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ben Greear <greearb@candelatech.com>
To: Tejun Heo <tj@kernel.org>
Cc: Johannes Berg <johannes@sipsolutions.net>,
	linux-wireless <linux-wireless@vger.kernel.org>,
	Miriam Rachel <miriam.rachel.korenblit@intel.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: 6.18.13 iwlwifi deadlock allocating cma while work-item is active.
Date: Tue, 10 Mar 2026 12:18:49 -0700	[thread overview]
Message-ID: <729164a1-9dd4-c9a4-f092-d93d775257e0@candelatech.com> (raw)
In-Reply-To: <5b9b93df8774810a43fceb359906604b@kernel.org>

On 3/10/26 11:06, Tejun Heo wrote:
> Hello,
> 
> Thanks for the detailed dump. One thing that doesn't look right is the
> number of pending work items on pool 22 (CPU 5). The pool reports 2 idle
> workers, yet there are 7+ work items sitting in the pending list across
> multiple workqueues. If the pool were making forward progress, those items
> would have been picked up by the idle workers. So, the pool itself seems to
> be stuck for some reason, and the cfg80211 mutex stall may be a consequence
> rather than the cause.
> 
> Let's try using drgn on the crash dump. I'm attaching a prompt that you can
> feed to Claude (or any LLM with tool access to drgn). It contains workqueue
> internals documentation, drgn code snippets, and a systematic investigation
> procedure. The idea is:
> 
> 1. Generate the crash dump when the deadlock is happening:
> 
>       echo c > /proc/sysrq-trigger
> 
> 2. After the crash kernel boots, create the dump file:
> 
>       makedumpfile -c -d 31 /proc/vmcore /tmp/vmcore.dmp
> 
> 3. Feed the attached prompt to Claude with drgn access to the dump. It
>     should produce a Markdown report with its findings that you can post
>     back here.
> 
> This is a bit experimental, so let's see whether it works. Either way, the
> report should at least give us concrete data points to work with.
> 
> Thanks.

Thanks for that.  It will probably be a few days before I flip back to debugging
that lockup as I'm trying to get something ready for our internal release (using
kthread work-around).

While working on another bug, I found evidence (but not proof yet), that this code below
can be called multiple times for the same object.  The bug I'm tracking is that this
may be the cause of list corruption (my debugging logs and work-arounds are in the method below).

But could this work-item (re)initialization also explain work-queue system going
weird?  Just using kthreads, which 'fixes' the problem for me,
really shouldn't make a difference to the code below, so probably
it is not related?


void ieee80211_link_init(struct ieee80211_sub_if_data *sdata,
			 int link_id,
			 struct ieee80211_link_data *link,
			 struct ieee80211_bss_conf *link_conf)
{
	struct ieee80211_local *local = sdata->local;
	bool deflink = link_id < 0;

	lockdep_assert_wiphy(local->hw.wiphy);

	if (link_id < 0)
		link_id = 0;

	if (sdata->vif.type == NL80211_IFTYPE_AP_VLAN) {
		struct ieee80211_sub_if_data *ap_bss;
		struct ieee80211_bss_conf *ap_bss_conf;

		ap_bss = container_of(sdata->bss,
				      struct ieee80211_sub_if_data, u.ap);
		ap_bss_conf = sdata_dereference(ap_bss->vif.link_conf[link_id],
						ap_bss);
		memcpy(link_conf, ap_bss_conf, sizeof(*link_conf));
	}

	link->sdata = sdata;
	link->link_id = link_id;
	link->conf = link_conf;
	link_conf->link_id = link_id;
	link_conf->vif = &sdata->vif;
	link->ap_power_level = IEEE80211_UNSET_POWER_LEVEL;
	link->user_power_level = sdata->local->user_power_level;
	link_conf->txpower = INT_MIN;

	wiphy_work_init(&link->csa.finalize_work,
			ieee80211_csa_finalize_work);
	wiphy_work_init(&link->color_change_finalize_work,
			ieee80211_color_change_finalize_work);
	wiphy_delayed_work_init(&link->color_collision_detect_work,
				ieee80211_color_collision_detection_work);
	/* I see some sort of list corruption where links don't get removed from chanctx
	 * lists.  I think if we are in a list while here, that could cause it.  deflink
	 * appears to have chance of doing that.  So, remove from list first if
	 * it is indeed in one.
	 */
	if (WARN_ON_ONCE((link->assigned_chanctx_list.next != LIST_POISON1)
			 && (link->assigned_chanctx_list.next != link->assigned_chanctx_list.prev)
			 && (link->assigned_chanctx_list.next))) {
		sdata_err(sdata, "link-init: %d called while already in an assigned-chan-ctx list, clearing.\n",
			  link_id);
		list_del(&link->assigned_chanctx_list);
	}
	if (WARN_ON_ONCE((link->reserved_chanctx_list.next != LIST_POISON1)
			 && (link->reserved_chanctx_list.next != link->reserved_chanctx_list.prev)
			 && (link->reserved_chanctx_list.next))) {
		sdata_err(sdata, "link-init: %d called while already in a reserved-chan-ctx list, clearing.\n",
			  link_id);
		list_del(&link->reserved_chanctx_list);
	}

	INIT_LIST_HEAD(&link->assigned_chanctx_list);
	INIT_LIST_HEAD(&link->reserved_chanctx_list);
	wiphy_delayed_work_init(&link->dfs_cac_timer_work,
				ieee80211_dfs_cac_timer_work);

	if (!deflink) {
		switch (sdata->vif.type) {
		case NL80211_IFTYPE_AP:
		case NL80211_IFTYPE_AP_VLAN:
			ether_addr_copy(link_conf->addr,
					sdata->wdev.links[link_id].addr);
			link_conf->bssid = link_conf->addr;
			WARN_ON(!(sdata->wdev.valid_links & BIT(link_id)));
			break;
		case NL80211_IFTYPE_STATION:
			/* station sets the bssid in ieee80211_mgd_setup_link */
			break;
		default:
			WARN_ON(1);
		}

		ieee80211_link_debugfs_add(link);
	}

	rcu_assign_pointer(sdata->vif.link_conf[link_id], link_conf);
	rcu_assign_pointer(sdata->link[link_id], link);
}


Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

next prev parent reply	other threads:[~2026-03-10 19:18 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-23 22:36 Ben Greear
2026-02-27 16:31 ` Ben Greear
2026-03-01 15:38   ` Ben Greear
2026-03-02  8:07     ` Johannes Berg
2026-03-02 15:26       ` Ben Greear
2026-03-02 15:38         ` Johannes Berg
2026-03-02 15:50           ` Ben Greear
2026-03-03 11:49             ` Johannes Berg
2026-03-03 20:52               ` Tejun Heo
2026-03-03 21:03                 ` Johannes Berg
2026-03-03 21:12                 ` Johannes Berg
2026-03-03 21:40                   ` Ben Greear
2026-03-03 21:54                     ` Tejun Heo
2026-03-04  0:02                       ` Ben Greear
2026-03-04 17:14                         ` Tejun Heo
2026-03-10 16:10                           ` Ben Greear
2026-03-10 18:06                             ` Tejun Heo
2026-03-10 19:18                               ` Ben Greear [this message]
2026-03-10 19:47                                 ` Tejun Heo
2026-03-10 19:48                                   ` Tejun Heo
2026-03-04  3:08               ` Hillf Danton
2026-03-04  6:57                 ` Johannes Berg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=729164a1-9dd4-c9a4-f092-d93d775257e0@candelatech.com \
    --to=greearb@candelatech.com \
    --cc=johannes@sipsolutions.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-wireless@vger.kernel.org \
    --cc=miriam.rachel.korenblit@intel.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox