From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8995EC433EF for ; Tue, 19 Jul 2022 10:41:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EF1836B0071; Tue, 19 Jul 2022 06:41:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EA2336B0073; Tue, 19 Jul 2022 06:41:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D91746B0074; Tue, 19 Jul 2022 06:41:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id CA4D66B0071 for ; Tue, 19 Jul 2022 06:41:20 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 93FD4717 for ; Tue, 19 Jul 2022 10:41:20 +0000 (UTC) X-FDA: 79703507520.15.35175F9 Received: from mail3-162.sinamail.sina.com.cn (mail3-162.sinamail.sina.com.cn [202.108.3.162]) by imf25.hostedemail.com (Postfix) with SMTP id 8E0DEA0098 for ; Tue, 19 Jul 2022 10:41:17 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([114.249.59.48]) by sina.com (172.16.97.27) with ESMTP id 62D68A3F0001030F; Tue, 19 Jul 2022 18:41:05 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 90076549285386 From: Hillf Danton To: Doug Anderson Cc: Waiman Long , Peter Zijlstra , Will Deacon , Davidlohr Bueso , linux-mm@kvack.org, LKML Subject: Re: [PATCH v5] locking/rwsem: Make handoff bit handling more consistent Date: Tue, 19 Jul 2022 18:41:04 +0800 Message-Id: <20220719104104.1634-1-hdanton@sina.com> In-Reply-To: References: <20211116012912.723980-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf25.hostedemail.com: domain of hdanton@sina.com designates 202.108.3.162 as permitted sender) smtp.mailfrom=hdanton@sina.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658227280; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=DIlC5msPepLhk41WiInnprCu48InA2Fy+q7OB0+vJX0=; b=yhriLkkM3yDuHmmbJdkP/Mcuc65he+Ize3Yf+LD3JTxbjO3kNtjU/WkEhWXe10JYRUQiGk Ta2qHivMAPOWMCw63tTGzBu9CVtW+VhUW/2ze/0LzMrpkkkYOTjk8mCv7PEvF7RKsfYvLy N8wrYQqYP+audIZ0L6oMIoz0Z2+w+gY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658227280; a=rsa-sha256; cv=none; b=rW/GWfLKHp27A568wl59qgBirtKoqydtbgAW6ohllAAffH6ui7LBo5OL2/SiVk2OEqAWDC yGVfLoK6TgJe+0XVwzkR/k6klyfhP2nw25kfpPZONn9P1C/2JTiWp45/1qYskxfF6SHq2v 94l51ybh2ehzGymiMPpvFDFORkvyeXk= X-Rspamd-Queue-Id: 8E0DEA0098 Authentication-Results: imf25.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf25.hostedemail.com: domain of hdanton@sina.com designates 202.108.3.162 as permitted sender) smtp.mailfrom=hdanton@sina.com X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: sngf5aht4fh3ahyceoehi6ioa4ss7w81 X-HE-Tag: 1658227277-391891 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, 18 Jul 2022 17:27:28 -0700 Doug Anderson wrote: > > I've been tracking down an occasional hang at reboot on my system and > I've ended up at this as the first bad commit. I will not pretend to > understand the intricacies of the rwsem implementation, but I can > describe what I saw. I have also produced a fairly small test case > that reproduces the problem rather quickly. > > First, what I saw: > > My system failed to fully boot up and eventually the "hung task" > detection kicked in. Many tasks in my system were hung all waiting on > the "kernfs_rwsem". No tasks actually had the semaphore--it only had > tasks waiting. > > Of the tasks waiting, 3 of them were doing a down_write(). The rest > were all waiting on down_read(). > > 2 of the tasks waiting on the down_write() were locked to CPU0. One of > these tasks was a bound kworker. Another of these tasks was a threaded > IRQ handler. The threaded IRQ handler was set to "real time" priority > because in setup_irq_thread() you can see the call to > sched_set_fifo(). > > At the time the hung task detector kicked in, the real time task was > actually active on a CPU. Specifically it was running in the for (;;) > loop in rwsem_down_write_slowpath(). rwsem_try_write_lock() had > clearly just returned false which meant we didn't get the lock. > Everything else was sitting in schedule(). > > I managed to get the real time task into kgdb and I could analyze its > state as well as the state of "sem". The real time task was _not_ the > first waiter. The kworker was the first waiter. The > "waiter.handoff_set" was set to "true" for the real time task. The > rwsem owner was OWNER_NULL. > > Looking through the code and watching what was happening. > > 1. The function rwsem_try_write_lock() was instantly returning false > since `handoff` is set and we're not first. > 2. After we get back into rwsem_down_write_slowpath() we'll see the > handoff set and we'll try to spin on the owner. There is no owner, so > this is a noop. > 3. Since there's no owner, we'll go right back to the start of the loop. > > So basically the real time thread (the threaded IRQ handler) was > locked to CPU0 and spinning as fast as possible. The "first waiter" > for the semaphore was blocked from running because it could only run > on CPU0 but was _not_ real time priority. > > - > > So all the analysis above was done on the Chrome OS 5.15 kernel > branch, which has ${SUBJECT} patch from the stable tree. The code > looks reasonably the same on mainline. > > ...and also, I coded up a test case that can reproduce this on > mainline. It's ugly/hacky but it gets the job done. This reproduces > the problem at the top of mainline as of commit 80e19f34c288 ("Merge > tag 'hte/for-5.19' of > git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux"). > > For me, I was only able to reproduce this without "lockdep" enabled. > My lockdep configs were: > > CONFIG_DEBUG_RT_MUTEXES=y > CONFIG_DEBUG_SPINLOCK=y > CONFIG_DEBUG_MUTEXES=y > CONFIG_PROVE_RCU=y > CONFIG_PROVE_LOCKING=y > CONFIG_DEBUG_ATOMIC_SLEEP=y > > I don't know for sure if lockdep is actually required to reproduce. > > - > > OK, so here's my hacky test case. In my case, I put a call to this > test function in a convenient debugfs "show" function to make it easy > to trigger. You can put it wherever. > > struct test_data { > struct rw_semaphore *rwsem; > int i; > bool should_sleep; > }; > > static int test_thread_fn(void *data) > { > struct test_data *test_data = data; > struct rw_semaphore *rwsem = test_data->rwsem; > ktime_t start; > > trace_printk("Starting\n"); > start = ktime_get(); > while (ktime_to_ms(ktime_sub(ktime_get(), start)) < 60000) { > trace_printk("About to grab\n"); > down_write(rwsem); > trace_printk("Grabbed write %d\n", test_data->i); > schedule(); > up_write(rwsem); > trace_printk("Released write %d\n", test_data->i); > if (test_data->should_sleep) > msleep(1); > } > trace_printk("Done\n"); > > return 0; > } > > static void test(void) > { > static struct task_struct *t[10]; > static struct test_data test_data[10]; > static DECLARE_RWSEM(rwsem); > int i; > > trace_printk("About to create threads\n"); > > for (i = 0; i < ARRAY_SIZE(t); i++) { > test_data[i].rwsem = &rwsem; > test_data[i].i = i; > > if (i == ARRAY_SIZE(t) - 1) { > /* > * Last thread will be bound to CPU0 and realtime. > * Have it sleep to give other threads a chance to > * run and contend. > */ > test_data[i].should_sleep = true; > t[i] = kthread_create_on_cpu(test_thread_fn, > &test_data[i], 0, > "test0 FIFO-%u"); > sched_set_fifo(t[i]); > } else if (i == ARRAY_SIZE(t) - 2) { > /* 2nd to last thread will be bound to CPU0 */ > t[i] = kthread_create_on_cpu(test_thread_fn, > &test_data[i], 0, > "test0-%u"); > } else { > /* All other threads are just normal */ > t[i] = kthread_create(test_thread_fn, > &test_data[i], "test"); > } > wake_up_process(t[i]); > msleep(10); > } > } > > - > > With the reproducer above, I was able to: > > 1. Validate that on chromeos-5.15 I could revert ${SUBJECT} patch and > the problem went away. > > 2. I could go to mainline at exactly the commit hash of ${SUBJECT} > patch, see the problem, then revert ${SUBJECT} patch and see the > problem go away. > > Thus I'm fairly confident that the problem is related to ${SUBJECT} patch. > > - > > I'm hoping that someone on this thread can propose a fix. I'm happy to > test, but I was hoping not to have to become an expert on the rwsem > implementation to try to figure out the proper fix. > See if it makes sense to only allow the first waiter to spin on owner. Hillf --- mainline/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -337,7 +337,7 @@ struct rwsem_waiter { unsigned long timeout; /* Writer only, not initialized in reader */ - bool handoff_set; + bool handoff_set, first; }; #define rwsem_first_waiter(sem) \ list_first_entry(&sem->wait_list, struct rwsem_waiter, list) @@ -604,6 +604,7 @@ static inline bool rwsem_try_write_lock( lockdep_assert_held(&sem->wait_lock); + waiter->first = first; count = atomic_long_read(&sem->count); do { bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF); @@ -1114,6 +1115,7 @@ rwsem_down_write_slowpath(struct rw_sema waiter.type = RWSEM_WAITING_FOR_WRITE; waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT; waiter.handoff_set = false; + waiter.first = false; raw_spin_lock_irq(&sem->wait_lock); rwsem_add_waiter(sem, &waiter); @@ -1158,7 +1160,7 @@ rwsem_down_write_slowpath(struct rw_sema * In this case, we attempt to acquire the lock again * without sleeping. */ - if (waiter.handoff_set) { + if (waiter.handoff_set && waiter.first) { enum owner_state owner_state; preempt_disable();