Re: [PATCH v5] locking/rwsem: Make handoff bit handling more consistent

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Doug Anderson <dianders@chromium.org>
To: Hillf Danton <hdanton@sina.com>
Cc: Waiman Long <longman@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	 Will Deacon <will@kernel.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	 Linux Memory Management List <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v5] locking/rwsem: Make handoff bit handling more consistent
Date: Tue, 19 Jul 2022 08:30:02 -0700	[thread overview]
Message-ID: <CAD=FV=Xng_Mcd-9SaK29XSbsthDWLTR6sz53VEktAQFng6a27A@mail.gmail.com> (raw)
In-Reply-To: <20220719104104.1634-1-hdanton@sina.com>

Hi,

On Tue, Jul 19, 2022 at 3:41 AM Hillf Danton <hdanton@sina.com> wrote:
>
> On Mon, 18 Jul 2022 17:27:28 -0700 Doug Anderson <dianders@chromium.org> wrote:
> >
> > I've been tracking down an occasional hang at reboot on my system and
> > I've ended up at this as the first bad commit. I will not pretend to
> > understand the intricacies of the rwsem implementation, but I can
> > describe what I saw. I have also produced a fairly small test case
> > that reproduces the problem rather quickly.
> >
> > First, what I saw:
> >
> > My system failed to fully boot up and eventually the "hung task"
> > detection kicked in. Many tasks in my system were hung all waiting on
> > the "kernfs_rwsem". No tasks actually had the semaphore--it only had
> > tasks waiting.
> >
> > Of the tasks waiting, 3 of them were doing a down_write(). The rest
> > were all waiting on down_read().
> >
> > 2 of the tasks waiting on the down_write() were locked to CPU0. One of
> > these tasks was a bound kworker. Another of these tasks was a threaded
> > IRQ handler. The threaded IRQ handler was set to "real time" priority
> > because in setup_irq_thread() you can see the call to
> > sched_set_fifo().
> >
> > At the time the hung task detector kicked in, the real time task was
> > actually active on a CPU. Specifically it was running in the for (;;)
> > loop in rwsem_down_write_slowpath(). rwsem_try_write_lock() had
> > clearly just returned false which meant we didn't get the lock.
> > Everything else was sitting in schedule().
> >
> > I managed to get the real time task into kgdb and I could analyze its
> > state as well as the state of "sem". The real time task was _not_ the
> > first waiter. The kworker was the first waiter. The
> > "waiter.handoff_set" was set to "true" for the real time task. The
> > rwsem owner was OWNER_NULL.
> >
> > Looking through the code and watching what was happening.
> >
> > 1. The function rwsem_try_write_lock() was instantly returning false
> > since `handoff` is set and we're not first.
> > 2. After we get back into rwsem_down_write_slowpath() we'll see the
> > handoff set and we'll try to spin on the owner. There is no owner, so
> > this is a noop.
> > 3. Since there's no owner, we'll go right back to the start of the loop.
> >
> > So basically the real time thread (the threaded IRQ handler) was
> > locked to CPU0 and spinning as fast as possible. The "first waiter"
> > for the semaphore was blocked from running because it could only run
> > on CPU0 but was _not_ real time priority.
> >
> > -
> >
> > So all the analysis above was done on the Chrome OS 5.15 kernel
> > branch, which has ${SUBJECT} patch from the stable tree. The code
> > looks reasonably the same on mainline.
> >
> > ...and also, I coded up a test case that can reproduce this on
> > mainline. It's ugly/hacky but it gets the job done. This reproduces
> > the problem at the top of mainline as of commit 80e19f34c288 ("Merge
> > tag 'hte/for-5.19' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux").
> >
> > For me, I was only able to reproduce this without "lockdep" enabled.
> > My lockdep configs were:
> >
> > CONFIG_DEBUG_RT_MUTEXES=y
> > CONFIG_DEBUG_SPINLOCK=y
> > CONFIG_DEBUG_MUTEXES=y
> > CONFIG_PROVE_RCU=y
> > CONFIG_PROVE_LOCKING=y
> > CONFIG_DEBUG_ATOMIC_SLEEP=y
> >
> > I don't know for sure if lockdep is actually required to reproduce.
> >
> > -
> >
> > OK, so here's my hacky test case. In my case, I put a call to this
> > test function in a convenient debugfs "show" function to make it easy
> > to trigger. You can put it wherever.
> >
> > struct test_data {
> >         struct rw_semaphore *rwsem;
> >         int i;
> >         bool should_sleep;
> > };
> >
> > static int test_thread_fn(void *data)
> > {
> >         struct test_data *test_data = data;
> >         struct rw_semaphore *rwsem = test_data->rwsem;
> >         ktime_t start;
> >
> >         trace_printk("Starting\n");
> >         start = ktime_get();
> >         while (ktime_to_ms(ktime_sub(ktime_get(), start)) < 60000) {
> >                 trace_printk("About to grab\n");
> >                 down_write(rwsem);
> >                 trace_printk("Grabbed write %d\n", test_data->i);
> >                 schedule();
> >                 up_write(rwsem);
> >                 trace_printk("Released write %d\n", test_data->i);
> >                 if (test_data->should_sleep)
> >                         msleep(1);
> >         }
> >         trace_printk("Done\n");
> >
> >         return 0;
> > }
> >
> > static void test(void)
> > {
> >         static struct task_struct *t[10];
> >         static struct test_data test_data[10];
> >         static DECLARE_RWSEM(rwsem);
> >         int i;
> >
> >         trace_printk("About to create threads\n");
> >
> >         for (i = 0; i < ARRAY_SIZE(t); i++) {
> >                 test_data[i].rwsem = &rwsem;
> >                 test_data[i].i = i;
> >
> >                 if (i == ARRAY_SIZE(t) - 1) {
> >                         /*
> >                          * Last thread will be bound to CPU0 and realtime.
> >                          * Have it sleep to give other threads a chance to
> >                          * run and contend.
> >                          */
> >                         test_data[i].should_sleep = true;
> >                         t[i] = kthread_create_on_cpu(test_thread_fn,
> >                                                      &test_data[i], 0,
> >                                                      "test0 FIFO-%u");
> >                         sched_set_fifo(t[i]);
> >                 } else if (i == ARRAY_SIZE(t) - 2) {
> >                         /* 2nd to last thread will be bound to CPU0 */
> >                         t[i] = kthread_create_on_cpu(test_thread_fn,
> >                                                      &test_data[i], 0,
> >                                                      "test0-%u");
> >                 } else {
> >                         /* All other threads are just normal */
> >                         t[i] = kthread_create(test_thread_fn,
> >                                               &test_data[i], "test");
> >                 }
> >                 wake_up_process(t[i]);
> >                 msleep(10);
> >         }
> > }
> >
> > -
> >
> > With the reproducer above, I was able to:
> >
> > 1. Validate that on chromeos-5.15 I could revert ${SUBJECT} patch and
> > the problem went away.
> >
> > 2. I could go to mainline at exactly the commit hash of ${SUBJECT}
> > patch, see the problem, then revert ${SUBJECT} patch and see the
> > problem go away.
> >
> > Thus I'm fairly confident that the problem is related to ${SUBJECT} patch.
> >
> > -
> >
> > I'm hoping that someone on this thread can propose a fix. I'm happy to
> > test, but I was hoping not to have to become an expert on the rwsem
> > implementation to try to figure out the proper fix.
> >
>
> See if it makes sense to only allow the first waiter to spin on owner.
>
> Hillf
>
> --- mainline/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -337,7 +337,7 @@ struct rwsem_waiter {
>         unsigned long timeout;
>
>         /* Writer only, not initialized in reader */
> -       bool handoff_set;
> +       bool handoff_set, first;
>  };
>  #define rwsem_first_waiter(sem) \
>         list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
> @@ -604,6 +604,7 @@ static inline bool rwsem_try_write_lock(
>
>         lockdep_assert_held(&sem->wait_lock);
>
> +       waiter->first = first;
>         count = atomic_long_read(&sem->count);
>         do {
>                 bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
> @@ -1114,6 +1115,7 @@ rwsem_down_write_slowpath(struct rw_sema
>         waiter.type = RWSEM_WAITING_FOR_WRITE;
>         waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
>         waiter.handoff_set = false;
> +       waiter.first = false;
>
>         raw_spin_lock_irq(&sem->wait_lock);
>         rwsem_add_waiter(sem, &waiter);
> @@ -1158,7 +1160,7 @@ rwsem_down_write_slowpath(struct rw_sema
>                  * In this case, we attempt to acquire the lock again
>                  * without sleeping.
>                  */
> -               if (waiter.handoff_set) {
> +               if (waiter.handoff_set && waiter.first) {

Your patch does fix my test case, so FWIW:

Tested-by: Douglas Anderson <dianders@chromium.org>

I haven't done any stress testing other than my test case, though, so
I can't speak to whether there might be any other unintended issues.


-Doug

next prev parent reply	other threads:[~2022-07-19 15:30 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20211116012912.723980-1-longman@redhat.com>
     [not found] ` <CAD=FV=URCo5xv3k3jWbxV1uRkUU5k6bcnuB1puZhxayEyVc6-A@mail.gmail.com>
2022-07-19 10:41   ` Hillf Danton
2022-07-19 15:30     ` Doug Anderson [this message]
2022-07-22 11:55       ` Hillf Danton
2022-07-22 14:02         ` Doug Anderson
2022-07-23  0:17           ` Hillf Danton
2022-07-23  1:27             ` Hillf Danton
2022-08-05 17:14             ` Doug Anderson
2022-08-05 19:02               ` Waiman Long
2022-08-05 19:16                 ` Doug Anderson
2022-08-30 16:18                   ` Doug Anderson
2022-08-31 11:08                     ` Hillf Danton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAD=FV=Xng_Mcd-9SaK29XSbsthDWLTR6sz53VEktAQFng6a27A@mail.gmail.com' \
    --to=dianders@chromium.org \
    --cc=dave@stgolabs.net \
    --cc=hdanton@sina.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=peterz@infradead.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox