From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8995EC433EF
	for <linux-mm@archiver.kernel.org>; Tue, 19 Jul 2022 10:41:21 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id EF1836B0071; Tue, 19 Jul 2022 06:41:20 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id EA2336B0073; Tue, 19 Jul 2022 06:41:20 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D91746B0074; Tue, 19 Jul 2022 06:41:20 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id CA4D66B0071
	for <linux-mm@kvack.org>; Tue, 19 Jul 2022 06:41:20 -0400 (EDT)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 93FD4717
	for <linux-mm@kvack.org>; Tue, 19 Jul 2022 10:41:20 +0000 (UTC)
X-FDA: 79703507520.15.35175F9
Received: from mail3-162.sinamail.sina.com.cn (mail3-162.sinamail.sina.com.cn [202.108.3.162])
	by imf25.hostedemail.com (Postfix) with SMTP id 8E0DEA0098
	for <linux-mm@kvack.org>; Tue, 19 Jul 2022 10:41:17 +0000 (UTC)
Received: from unknown (HELO localhost.localdomain)([114.249.59.48])
	by sina.com (172.16.97.27) with ESMTP
	id 62D68A3F0001030F; Tue, 19 Jul 2022 18:41:05 +0800 (CST)
X-Sender: hdanton@sina.com
X-Auth-ID: hdanton@sina.com
X-SMAIL-MID: 90076549285386
From: Hillf Danton <hdanton@sina.com>
To: Doug Anderson <dianders@chromium.org>
Cc: Waiman Long <longman@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Will Deacon <will@kernel.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	linux-mm@kvack.org,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v5] locking/rwsem: Make handoff bit handling more consistent
Date: Tue, 19 Jul 2022 18:41:04 +0800
Message-Id: <20220719104104.1634-1-hdanton@sina.com>
In-Reply-To: <CAD=FV=URCo5xv3k3jWbxV1uRkUU5k6bcnuB1puZhxayEyVc6-A@mail.gmail.com>
References: <20211116012912.723980-1-longman@redhat.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
ARC-Authentication-Results: i=1;
	imf25.hostedemail.com;
	dkim=none;
	dmarc=none;
	spf=pass (imf25.hostedemail.com: domain of hdanton@sina.com designates 202.108.3.162 as permitted sender) smtp.mailfrom=hdanton@sina.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1658227280;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=DIlC5msPepLhk41WiInnprCu48InA2Fy+q7OB0+vJX0=;
	b=yhriLkkM3yDuHmmbJdkP/Mcuc65he+Ize3Yf+LD3JTxbjO3kNtjU/WkEhWXe10JYRUQiGk
	Ta2qHivMAPOWMCw63tTGzBu9CVtW+VhUW/2ze/0LzMrpkkkYOTjk8mCv7PEvF7RKsfYvLy
	N8wrYQqYP+audIZ0L6oMIoz0Z2+w+gY=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658227280; a=rsa-sha256;
	cv=none;
	b=rW/GWfLKHp27A568wl59qgBirtKoqydtbgAW6ohllAAffH6ui7LBo5OL2/SiVk2OEqAWDC
	yGVfLoK6TgJe+0XVwzkR/k6klyfhP2nw25kfpPZONn9P1C/2JTiWp45/1qYskxfF6SHq2v
	94l51ybh2ehzGymiMPpvFDFORkvyeXk=
X-Rspamd-Queue-Id: 8E0DEA0098
Authentication-Results: imf25.hostedemail.com;
	dkim=none;
	dmarc=none;
	spf=pass (imf25.hostedemail.com: domain of hdanton@sina.com designates 202.108.3.162 as permitted sender) smtp.mailfrom=hdanton@sina.com
X-Rspam-User: 
X-Rspamd-Server: rspam03
X-Stat-Signature: sngf5aht4fh3ahyceoehi6ioa4ss7w81
X-HE-Tag: 1658227277-391891
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, 18 Jul 2022 17:27:28 -0700 Doug Anderson <dianders@chromium.org> wrote:
> 
> I've been tracking down an occasional hang at reboot on my system and
> I've ended up at this as the first bad commit. I will not pretend to
> understand the intricacies of the rwsem implementation, but I can
> describe what I saw. I have also produced a fairly small test case
> that reproduces the problem rather quickly.
> 
> First, what I saw:
> 
> My system failed to fully boot up and eventually the "hung task"
> detection kicked in. Many tasks in my system were hung all waiting on
> the "kernfs_rwsem". No tasks actually had the semaphore--it only had
> tasks waiting.
> 
> Of the tasks waiting, 3 of them were doing a down_write(). The rest
> were all waiting on down_read().
> 
> 2 of the tasks waiting on the down_write() were locked to CPU0. One of
> these tasks was a bound kworker. Another of these tasks was a threaded
> IRQ handler. The threaded IRQ handler was set to "real time" priority
> because in setup_irq_thread() you can see the call to
> sched_set_fifo().
> 
> At the time the hung task detector kicked in, the real time task was
> actually active on a CPU. Specifically it was running in the for (;;)
> loop in rwsem_down_write_slowpath(). rwsem_try_write_lock() had
> clearly just returned false which meant we didn't get the lock.
> Everything else was sitting in schedule().
> 
> I managed to get the real time task into kgdb and I could analyze its
> state as well as the state of "sem". The real time task was _not_ the
> first waiter. The kworker was the first waiter. The
> "waiter.handoff_set" was set to "true" for the real time task. The
> rwsem owner was OWNER_NULL.
> 
> Looking through the code and watching what was happening.
> 
> 1. The function rwsem_try_write_lock() was instantly returning false
> since `handoff` is set and we're not first.
> 2. After we get back into rwsem_down_write_slowpath() we'll see the
> handoff set and we'll try to spin on the owner. There is no owner, so
> this is a noop.
> 3. Since there's no owner, we'll go right back to the start of the loop.
> 
> So basically the real time thread (the threaded IRQ handler) was
> locked to CPU0 and spinning as fast as possible. The "first waiter"
> for the semaphore was blocked from running because it could only run
> on CPU0 but was _not_ real time priority.
> 
> -
> 
> So all the analysis above was done on the Chrome OS 5.15 kernel
> branch, which has ${SUBJECT} patch from the stable tree. The code
> looks reasonably the same on mainline.
> 
> ...and also, I coded up a test case that can reproduce this on
> mainline. It's ugly/hacky but it gets the job done. This reproduces
> the problem at the top of mainline as of commit 80e19f34c288 ("Merge
> tag 'hte/for-5.19' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux").
> 
> For me, I was only able to reproduce this without "lockdep" enabled.
> My lockdep configs were:
> 
> CONFIG_DEBUG_RT_MUTEXES=y
> CONFIG_DEBUG_SPINLOCK=y
> CONFIG_DEBUG_MUTEXES=y
> CONFIG_PROVE_RCU=y
> CONFIG_PROVE_LOCKING=y
> CONFIG_DEBUG_ATOMIC_SLEEP=y
> 
> I don't know for sure if lockdep is actually required to reproduce.
> 
> -
> 
> OK, so here's my hacky test case. In my case, I put a call to this
> test function in a convenient debugfs "show" function to make it easy
> to trigger. You can put it wherever.
> 
> struct test_data {
>         struct rw_semaphore *rwsem;
>         int i;
>         bool should_sleep;
> };
> 
> static int test_thread_fn(void *data)
> {
>         struct test_data *test_data = data;
>         struct rw_semaphore *rwsem = test_data->rwsem;
>         ktime_t start;
> 
>         trace_printk("Starting\n");
>         start = ktime_get();
>         while (ktime_to_ms(ktime_sub(ktime_get(), start)) < 60000) {
>                 trace_printk("About to grab\n");
>                 down_write(rwsem);
>                 trace_printk("Grabbed write %d\n", test_data->i);
>                 schedule();
>                 up_write(rwsem);
>                 trace_printk("Released write %d\n", test_data->i);
>                 if (test_data->should_sleep)
>                         msleep(1);
>         }
>         trace_printk("Done\n");
> 
>         return 0;
> }
> 
> static void test(void)
> {
>         static struct task_struct *t[10];
>         static struct test_data test_data[10];
>         static DECLARE_RWSEM(rwsem);
>         int i;
> 
>         trace_printk("About to create threads\n");
> 
>         for (i = 0; i < ARRAY_SIZE(t); i++) {
>                 test_data[i].rwsem = &rwsem;
>                 test_data[i].i = i;
> 
>                 if (i == ARRAY_SIZE(t) - 1) {
>                         /*
>                          * Last thread will be bound to CPU0 and realtime.
>                          * Have it sleep to give other threads a chance to
>                          * run and contend.
>                          */
>                         test_data[i].should_sleep = true;
>                         t[i] = kthread_create_on_cpu(test_thread_fn,
>                                                      &test_data[i], 0,
>                                                      "test0 FIFO-%u");
>                         sched_set_fifo(t[i]);
>                 } else if (i == ARRAY_SIZE(t) - 2) {
>                         /* 2nd to last thread will be bound to CPU0 */
>                         t[i] = kthread_create_on_cpu(test_thread_fn,
>                                                      &test_data[i], 0,
>                                                      "test0-%u");
>                 } else {
>                         /* All other threads are just normal */
>                         t[i] = kthread_create(test_thread_fn,
>                                               &test_data[i], "test");
>                 }
>                 wake_up_process(t[i]);
>                 msleep(10);
>         }
> }
> 
> -
> 
> With the reproducer above, I was able to:
> 
> 1. Validate that on chromeos-5.15 I could revert ${SUBJECT} patch and
> the problem went away.
> 
> 2. I could go to mainline at exactly the commit hash of ${SUBJECT}
> patch, see the problem, then revert ${SUBJECT} patch and see the
> problem go away.
> 
> Thus I'm fairly confident that the problem is related to ${SUBJECT} patch.
> 
> -
> 
> I'm hoping that someone on this thread can propose a fix. I'm happy to
> test, but I was hoping not to have to become an expert on the rwsem
> implementation to try to figure out the proper fix.
> 

See if it makes sense to only allow the first waiter to spin on owner.

Hillf

--- mainline/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -337,7 +337,7 @@ struct rwsem_waiter {
 	unsigned long timeout;
 
 	/* Writer only, not initialized in reader */
-	bool handoff_set;
+	bool handoff_set, first;
 };
 #define rwsem_first_waiter(sem) \
 	list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
@@ -604,6 +604,7 @@ static inline bool rwsem_try_write_lock(
 
 	lockdep_assert_held(&sem->wait_lock);
 
+	waiter->first = first;
 	count = atomic_long_read(&sem->count);
 	do {
 		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
@@ -1114,6 +1115,7 @@ rwsem_down_write_slowpath(struct rw_sema
 	waiter.type = RWSEM_WAITING_FOR_WRITE;
 	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
 	waiter.handoff_set = false;
+	waiter.first = false;
 
 	raw_spin_lock_irq(&sem->wait_lock);
 	rwsem_add_waiter(sem, &waiter);
@@ -1158,7 +1160,7 @@ rwsem_down_write_slowpath(struct rw_sema
 		 * In this case, we attempt to acquire the lock again
 		 * without sleeping.
 		 */
-		if (waiter.handoff_set) {
+		if (waiter.handoff_set && waiter.first) {
 			enum owner_state owner_state;
 
 			preempt_disable();