From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.1 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DDBE0C433DF for ; Wed, 22 Jul 2020 21:30:34 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8D0A520771 for ; Wed, 22 Jul 2020 21:30:34 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Bgbl4hio" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8D0A520771 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0D75E6B0002; Wed, 22 Jul 2020 17:30:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 089276B0005; Wed, 22 Jul 2020 17:30:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E91B16B0006; Wed, 22 Jul 2020 17:30:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0150.hostedemail.com [216.40.44.150]) by kanga.kvack.org (Postfix) with ESMTP id D437B6B0002 for ; Wed, 22 Jul 2020 17:30:33 -0400 (EDT) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 8964B8248068 for ; Wed, 22 Jul 2020 21:30:33 +0000 (UTC) X-FDA: 77067005946.06.snake97_091609726f39 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin06.hostedemail.com (Postfix) with ESMTP id 7C60C101F9075 for ; Wed, 22 Jul 2020 21:29:52 +0000 (UTC) X-HE-Tag: snake97_091609726f39 X-Filterd-Recvd-Size: 9363 Received: from mail-oi1-f194.google.com (mail-oi1-f194.google.com [209.85.167.194]) by imf09.hostedemail.com (Postfix) with ESMTP for ; Wed, 22 Jul 2020 21:29:51 +0000 (UTC) Received: by mail-oi1-f194.google.com with SMTP id 12so3175710oir.4 for ; Wed, 22 Jul 2020 14:29:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=MfkhC03Gc9jUuHWJh3Z9KGKVUr5k40t6D5IVuHnukSY=; b=Bgbl4hioE6Lxk+ni+VFgHv+JZXaOi46dj4jneGTg2LbDXDeb410BiCaxc00g7JlziL PfzZQFWvlwpLhvCdYW7yhCM2BeGmUHKYhQiNCrSpiix+BJChoBvlLEvBe6LXTtdGTgDc cemp03JrIPNs7W6mvR17su+JKU2nKPx2pI9BZKpS0mNwNjoYEaztLDMVdu4jZXSYJT1i kj5idv4L50oGDfgBBIqvRv5k2QkUNDG/8Jf5zZ7yzpex5FLYXUgYgoMJuVrAmSPPgwXc q6vxhkrD76DTQEQrBtwupWX0O8RHaVYfCPRPRz/QOfbdMuiuRPk9UrrXdJTNJONrIVFw 5Nqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=MfkhC03Gc9jUuHWJh3Z9KGKVUr5k40t6D5IVuHnukSY=; b=ceAHBNw+LrIftae59vAhZoliPz36G4YKzlueODFVsg2HdyUG288fEAbv0AdfMJ9IcX 4tGewxwft1Bn4gqtHm4Wl2TDmUG1H8XNcCatcTXP+xHFkZatRRvwNPpBw6lz1kVLAxeh MadfPmQgL96ItYT3w2w/UcMmflnhkIvf4z903Iq4AOrk2/MlUi4en11FqThLNCY5PucQ qwqvXcM678vjGWdtSdS6oN+Ps1IWtgW0MgLPlfxEiM14ZkZuYa7XNb+f8fyZyZ6Gum9R Wvjd87ZR+WS+5TohNKpRzq9A7iOZtkuNXhp6pvWqLOLcF5d9ZyNYQr4E00mflxHSajE5 FOiQ== X-Gm-Message-State: AOAM531iT2Affm7yfkVWvDePWLX23f/AnZxg119DdRS6nt5zARB3Nsiu 7dTIGW1daXpmU7t+iVsFIivL1A== X-Google-Smtp-Source: ABdhPJwLRa67OJOCfoXDJdByIDklLYqywqdtVbMPH11zT2T7yjjeQPKMkJfj6BzgCQT8ns2C8oTSxQ== X-Received: by 2002:aca:4e05:: with SMTP id c5mr1303955oib.37.1595453390906; Wed, 22 Jul 2020 14:29:50 -0700 (PDT) Received: from eggly.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id q15sm198197ood.11.2020.07.22.14.29.49 (version=TLS1 cipher=ECDHE-ECDSA-AES128-SHA bits=128/128); Wed, 22 Jul 2020 14:29:49 -0700 (PDT) Date: Wed, 22 Jul 2020 14:29:36 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Linus Torvalds cc: Michal Hocko , Linux-MM , LKML , Andrew Morton , Tim Chen , Michal Hocko Subject: Re: [RFC PATCH] mm: silence soft lockups from unlock_page In-Reply-To: Message-ID: References: <20200721063258.17140-1-mhocko@kernel.org> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Rspamd-Queue-Id: 7C60C101F9075 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam03 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, 22 Jul 2020, Linus Torvalds wrote: > > I do wonder if we should make that PAGE_WAIT_TABLE_SIZE be larger. 256 > entries seems potentially ridiculously small, and aliasing not only > increases the waitqueue length, it also potentially causes more > contention on the waitqueue spinlock (which is already likely seeing > some false sharing on a cacheline basis due to the fairly dense array > of waitqueue entries: wait_queue_head is intentionally fairly small > and dense unless you have lots of spinlock debugging options enabled). > > That hashed wait-queue size is an independent issue, though. But it > might be part of "some loads can get into some really nasty behavior > in corner cases" I don't think we've ever suffered from unlock_page() softlockups when booting, as Michal reports; but we do have a forkbomby test (and a few others) which occasionally suffer that way, on machines with many cpus. We run with the three imperfect patches appended below, which together seemed to improve, but not entirely solve, the situation: mm,sched: make page_wait_table[] four times bigger mm,sched: wait_on_page_bit_common() add back to head mm,sched: __wake_up_common() break only after waking kernel/sched/wait.c | 5 ++++- mm/filemap.c | 10 ++++++++-- 2 files changed, 12 insertions(+), 3 deletions(-) I've rediffed them to 5.8-rc6, and checked that they build and sustain a load for a few minutes: so they're not entirely ridiculous on latest kernels, but certainly not thoroughly retested. I'm rather desperate to stay out of the discussion here, given how far behind I am on responding to other threads; and may not be able to defend them beyond what I said in the original commit messages. But seeing this thread, thought I'd better put them up for your eyes. (And, no, I don't think I have a Copyright on changing an 8 to a 10: you've probably gone further already; just included for completeness.) Hugh [PATCH] mm,sched: make page_wait_table[] four times bigger Current page_wait_table[] size is 256 entries: bump that up to 1024 to reduce latency from frequent page hash collisions. No science in choosing fourfold: just "a little bigger". Signed-off-by: Hugh Dickins --- a/mm/filemap.c +++ b/mm/filemap.c @@ -968,7 +968,7 @@ EXPORT_SYMBOL(__page_cache_alloc); * at a cost of "thundering herd" phenomena during rare hash * collisions. */ -#define PAGE_WAIT_TABLE_BITS 8 +#define PAGE_WAIT_TABLE_BITS 10 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS) static wait_queue_head_t page_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned; [PATCH] mm,sched: wait_on_page_bit_common() add back to head wait_on_page_bit_common() always adds to tail of wait queue. That is of course right when adding an entry for the first time; but when woken, and bit found already set, so adding back again? Looks unfair: it would get placed after recent arrivals, and in danger of indefinite starvation. Instead, add back to head on repeat passes: not quite right either, but if that happens again and again, the ordering will be reversed each time, so it should work out reasonably fair. Signed-off-by: Hugh Dickins --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1109,6 +1109,7 @@ static inline int wait_on_page_bit_commo struct wait_page_queue wait_page; wait_queue_entry_t *wait = &wait_page.wait; bool bit_is_set; + bool first_time = true; bool thrashing = false; bool delayacct = false; unsigned long pflags; @@ -1134,7 +1135,12 @@ static inline int wait_on_page_bit_commo spin_lock_irq(&q->lock); if (likely(list_empty(&wait->entry))) { - __add_wait_queue_entry_tail(q, wait); + if (first_time) { + __add_wait_queue_entry_tail(q, wait); + first_time = false; + } else { + __add_wait_queue(q, wait); + } SetPageWaiters(page); } [PATCH] mm,sched: __wake_up_common() break only after waking 4.14 commit 2554db916586 ("sched/wait: Break up long wake list walk") added WQ_FLAG_BOOKMARK early breakout from __wake_up_common(): it lets the waker drop and retake the irq-safe wait queue lock every 64 entries. It was introduced to handle an Intel customer issue with long page wait queues, but it also applies to all paths using __wake_up_common_lock(). It would probably be more useful to our non-preemptible kernel if it could do a cond_resched() along with its cpu_relax(); but although most unlock_page() calls would be fine with that, some would not - page_endio(), for example; and it would be a big job to sort them, and not clear that doing some not others would really help anyway. A patch that I've been running with, that does help somewhat to reduce those unlock_page() softlockups, is this weakening of 2554db916586: don't break out until at least one wakeup has been issued for the page. In the worst case (waking a page at the very end of a hash queue shared with many waiters on another page), this would simply revert to the old behavior, where there was no WQ_FLAG_BOOKMARK early breakout at all. Whilst it did not set out to change behavior for __wake_up_common_lock() users, review suggests that this change is probably good for them too. Signed-off-by: Hugh Dickins --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -68,6 +68,7 @@ static int __wake_up_common(struct wait_ wait_queue_entry_t *bookmark) { wait_queue_entry_t *curr, *next; + int woken = 0; int cnt = 0; lockdep_assert_held(&wq_head->lock); @@ -95,8 +96,10 @@ static int __wake_up_common(struct wait_ break; if (ret && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) break; + if (ret) + woken++; - if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) && + if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) && woken && (&next->entry != &wq_head->head)) { bookmark->flags = WQ_FLAG_BOOKMARK; list_add_tail(&bookmark->entry, &next->entry);