From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7E0D3C2BA80 for ; Tue, 7 Apr 2020 03:11:26 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2C9C9206B8 for ; Tue, 7 Apr 2020 03:11:26 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="XgBS1VJc" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2C9C9206B8 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id D37F48E0090; Mon, 6 Apr 2020 23:11:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D0FB68E0062; Mon, 6 Apr 2020 23:11:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C4BE18E0090; Mon, 6 Apr 2020 23:11:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0038.hostedemail.com [216.40.44.38]) by kanga.kvack.org (Postfix) with ESMTP id A6E868E0062 for ; Mon, 6 Apr 2020 23:11:25 -0400 (EDT) Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 6DCAD181AC217 for ; Tue, 7 Apr 2020 03:11:25 +0000 (UTC) X-FDA: 76679583330.04.story02_c60f4c15a443 X-HE-Tag: story02_c60f4c15a443 X-Filterd-Recvd-Size: 6618 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf33.hostedemail.com (Postfix) with ESMTP for ; Tue, 7 Apr 2020 03:11:24 +0000 (UTC) Received: from localhost.localdomain (c-73-231-172-41.hsd1.ca.comcast.net [73.231.172.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id F3E3D20781; Tue, 7 Apr 2020 03:11:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1586229084; bh=YPPtfjl+R1xn39VmoztUMcr03ot6h2nC8ke4kOjBphs=; h=Date:From:To:Subject:In-Reply-To:From; b=XgBS1VJc8ecvjhzGslgGVRllu3IdlXs92LSfkXi8aAHtIhmMZSTECkdAfdPXLh5PZ Y8j6iR+Ojmjgxh7mQa9r7sTehrieetRu0ieLC9PD/O3fYSpBtisCIvl4R/nuu8EOYp vqe1HUlp48m8KnukFH3yYY9/1lyZ+SccehWGH32I= Date: Mon, 06 Apr 2020 20:11:23 -0700 From: Andrew Morton To: akpm@linux-foundation.org, dbueso@suse.de, jbaron@akamai.com, linux-mm@kvack.org, mm-commits@vger.kernel.org, normalperson@yhbt.net, rpenyaev@suse.de, torvalds@linux-foundation.org, viro@zeniv.linux.org.uk Subject: [patch 138/166] fs/epoll: make nesting accounting safe for -rt kernel Message-ID: <20200407031123.COnBt0S6b%akpm@linux-foundation.org> In-Reply-To: <20200406200254.a69ebd9e08c4074e41ddebaf@linux-foundation.org> User-Agent: s-nail v14.8.16 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Jason Baron Subject: fs/epoll: make nesting accounting safe for -rt kernel Davidlohr Bueso pointed out that when CONFIG_DEBUG_LOCK_ALLOC is set ep_poll_safewake() can take several non-raw spinlocks after disabling interrupts. Since a spinlock can block in the -rt kernel, we can't take a spinlock after disabling interrupts. So let's re-work how we determine the nesting level such that it plays nicely with the -rt kernel. Let's introduce a 'nests' field in struct eventpoll that records the current nesting level during ep_poll_callback(). Then, if we nest again we can find the previous struct eventpoll that we were called from and increase our count by 1. The 'nests' field is protected by ep->poll_wait.lock. I've also moved the visited field to reduce the size of struct eventpoll from 184 bytes to 176 bytes on x86_64 for !CONFIG_DEBUG_LOCK_ALLOC, which is typical for a production config. Link: http://lkml.kernel.org/r/1582739816-13167-1-git-send-email-jbaron@akamai.com Reported-by: Davidlohr Bueso Signed-off-by: Jason Baron Reviewed-by: Davidlohr Bueso Cc: Roman Penyaev Cc: Eric Wong Cc: Al Viro Signed-off-by: Andrew Morton --- fs/eventpoll.c | 64 +++++++++++++++++++++++++++++++---------------- 1 file changed, 43 insertions(+), 21 deletions(-) --- a/fs/eventpoll.c~fs-epoll-make-nesting-accounting-safe-for-rt-kernel +++ a/fs/eventpoll.c @@ -218,13 +218,18 @@ struct eventpoll { struct file *file; /* used to optimize loop detection check */ - int visited; struct list_head visited_list_link; + int visited; #ifdef CONFIG_NET_RX_BUSY_POLL /* used to track busy poll napi_id */ unsigned int napi_id; #endif + +#ifdef CONFIG_DEBUG_LOCK_ALLOC + /* tracks wakeup nests for lockdep validation */ + u8 nests; +#endif }; /* Wait structure used by the poll hooks */ @@ -545,30 +550,47 @@ out_unlock: */ #ifdef CONFIG_DEBUG_LOCK_ALLOC -static DEFINE_PER_CPU(int, wakeup_nest); - -static void ep_poll_safewake(wait_queue_head_t *wq) +static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi) { + struct eventpoll *ep_src; unsigned long flags; - int subclass; + u8 nests = 0; - local_irq_save(flags); - preempt_disable(); - subclass = __this_cpu_read(wakeup_nest); - spin_lock_nested(&wq->lock, subclass + 1); - __this_cpu_inc(wakeup_nest); - wake_up_locked_poll(wq, POLLIN); - __this_cpu_dec(wakeup_nest); - spin_unlock(&wq->lock); - local_irq_restore(flags); - preempt_enable(); + /* + * To set the subclass or nesting level for spin_lock_irqsave_nested() + * it might be natural to create a per-cpu nest count. However, since + * we can recurse on ep->poll_wait.lock, and a non-raw spinlock can + * schedule() in the -rt kernel, the per-cpu variable are no longer + * protected. Thus, we are introducing a per eventpoll nest field. + * If we are not being call from ep_poll_callback(), epi is NULL and + * we are at the first level of nesting, 0. Otherwise, we are being + * called from ep_poll_callback() and if a previous wakeup source is + * not an epoll file itself, we are at depth 1 since the wakeup source + * is depth 0. If the wakeup source is a previous epoll file in the + * wakeup chain then we use its nests value and record ours as + * nests + 1. The previous epoll file nests value is stable since its + * already holding its own poll_wait.lock. + */ + if (epi) { + if ((is_file_epoll(epi->ffd.file))) { + ep_src = epi->ffd.file->private_data; + nests = ep_src->nests; + } else { + nests = 1; + } + } + spin_lock_irqsave_nested(&ep->poll_wait.lock, flags, nests); + ep->nests = nests + 1; + wake_up_locked_poll(&ep->poll_wait, EPOLLIN); + ep->nests = 0; + spin_unlock_irqrestore(&ep->poll_wait.lock, flags); } #else -static void ep_poll_safewake(wait_queue_head_t *wq) +static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi) { - wake_up_poll(wq, EPOLLIN); + wake_up_poll(&ep->poll_wait, EPOLLIN); } #endif @@ -789,7 +811,7 @@ static void ep_free(struct eventpoll *ep /* We need to release all tasks waiting for these file */ if (waitqueue_active(&ep->poll_wait)) - ep_poll_safewake(&ep->poll_wait); + ep_poll_safewake(ep, NULL); /* * We need to lock this because we could be hit by @@ -1258,7 +1280,7 @@ out_unlock: /* We have to call this outside the lock */ if (pwake) - ep_poll_safewake(&ep->poll_wait); + ep_poll_safewake(ep, epi); if (!(epi->event.events & EPOLLEXCLUSIVE)) ewake = 1; @@ -1562,7 +1584,7 @@ static int ep_insert(struct eventpoll *e /* We have to call this outside the lock */ if (pwake) - ep_poll_safewake(&ep->poll_wait); + ep_poll_safewake(ep, NULL); return 0; @@ -1666,7 +1688,7 @@ static int ep_modify(struct eventpoll *e /* We have to call this outside the lock */ if (pwake) - ep_poll_safewake(&ep->poll_wait); + ep_poll_safewake(ep, NULL); return 0; } _