From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD4A9C433E4 for ; Mon, 27 Jul 2020 19:35:19 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A1E6D2073E for ; Mon, 27 Jul 2020 19:35:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="2jC35nd4" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A1E6D2073E Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linuxfoundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 14C746B0006; Mon, 27 Jul 2020 15:35:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 124056B0007; Mon, 27 Jul 2020 15:35:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0397A6B0008; Mon, 27 Jul 2020 15:35:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0038.hostedemail.com [216.40.44.38]) by kanga.kvack.org (Postfix) with ESMTP id E495D6B0006 for ; Mon, 27 Jul 2020 15:35:18 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id A58BD8248D51 for ; Mon, 27 Jul 2020 19:35:18 +0000 (UTC) X-FDA: 77084859516.14.jam18_620a4c626f63 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin14.hostedemail.com (Postfix) with ESMTP id 6C8FB1822987B for ; Mon, 27 Jul 2020 19:35:18 +0000 (UTC) X-HE-Tag: jam18_620a4c626f63 X-Filterd-Recvd-Size: 6021 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf47.hostedemail.com (Postfix) with ESMTP for ; Mon, 27 Jul 2020 19:35:17 +0000 (UTC) Received: from localhost (83-86-89-107.cable.dynamic.v4.ziggo.nl [83.86.89.107]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id AAFD120738; Mon, 27 Jul 2020 19:35:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1595878517; bh=xsoGM31kW2VrPyMnU5Snl46HwG9kOCh/rN4IW3/JF0E=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=2jC35nd4Q0ZjnCUAnY73gev9iFF2t2wk0UuRnoUMz8dSiFq8MP+4YlRbgmATlLtIG qaae1QhTZ0V2A18mlAkLEJNCLOIS1lpliii17qB8BUfSv4v9/sHPFw4zfNJNB+fbUR igL3KYajQoAZfzyXPbROFj/A2AF2V9hsO87EM/+A= Date: Mon, 27 Jul 2020 21:35:12 +0200 From: Greg KH To: Hugh Dickins Cc: Linus Torvalds , Oleg Nesterov , Michal Hocko , Linux-MM , LKML , Andrew Morton , Tim Chen , Michal Hocko Subject: Re: [RFC PATCH] mm: silence soft lockups from unlock_page Message-ID: <20200727193512.GA236164@kroah.com> References: <20200724152424.GC17209@redhat.com> <20200725101445.GB3870@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 6C8FB1822987B X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam05 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sun, Jul 26, 2020 at 01:30:32PM -0700, Hugh Dickins wrote: > On Sat, 25 Jul 2020, Hugh Dickins wrote: > > On Sat, 25 Jul 2020, Hugh Dickins wrote: > > > On Sat, 25 Jul 2020, Linus Torvalds wrote: > > > > On Sat, Jul 25, 2020 at 3:14 AM Oleg Nesterov wrote: > > > > > > > > > > Heh. I too thought about this. And just in case, your patch looks correct > > > > > to me. But I can't really comment this behavioural change. Perhaps it > > > > > should come in a separate patch? > > > > > > > > We could do that. At the same time, I think both parts change how the > > > > waitqueue works that it might as well just be one "fix page_bit_wait > > > > waitqueue usage". > > > > > > > > But let's wait to see what Hugh's numbers say. > > > > > > Oh no, no no: sorry for getting your hopes up there, I won't come up > > > with any numbers more significant than "0 out of 10" machines crashed. > > > I know it would be *really* useful if I could come up with performance > > > comparisons, or steer someone else to do so: but I'm sorry, cannot. > > > > > > Currently it's actually 1 out of 10 machines crashed, for the same > > > driverland issue seen last time, maybe it's a bad machine; and another > > > 1 out of the 10 machines went AWOL for unknown reasons, but probably > > > something outside the kernel got confused by the stress. No reason > > > to suspect your changes at all (but some unanalyzed "failure"s, of > > > dubious significance, accumulating like last time). > > > > > > I'm optimistic: nothing has happened to warn us off your changes. > > > > Less optimistic now, I'm afraid. > > > > The machine I said had (twice) crashed coincidentally in driverland > > (some USB completion thing): that machine I set running a comparison > > kernel without your changes this morning, while the others still > > running with your changes; and it has now passed the point where it > > twice crashed before (the most troublesome test), without crashing. > > > > Surprising: maybe still just coincidence, but I must look closer at > > the crashes. > > > > The others have now completed, and one other crashed in that > > troublesome test, but sadly without yielding any crash info. > > > > I've just set comparison runs going on them all, to judge whether > > to take the "failure"s seriously; and I'll look more closely at them. > > The comparison runs have not yet completed (except for the one started > early), but they have all got past the most interesting tests, and it's > clear that they do not have the "failure"s seen with your patches. > > >From that I can only conclude that your patches make a difference. > > I've deduced nothing useful from the logs, will have to leave that > to others here with more experience of them. But my assumption now > is that you have successfully removed one bottleneck, so the tests > get somewhat further and now stick in the next bottleneck, whatever > that may be. Which shows up as "failure", where the unlock_page() > wake_up_page_bit() bottleneck had allowed the tests to proceed in > a more serially sedate way. > > The xhci handle_cmd_completion list_del bugs (on an older version > of the driver): weird, nothing to do with page wakeups, I'll just > have to assume that it's some driver bug exposed by the greater > stress allowed down, and let driver people investigate (if it > still manifests) when we take in your improvements. Linus just pointed me at this thread. If you could run: echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control and run the same workload to see if anything shows up in the log when xhci crashes, that would be great. Although if you are using an "older version" of the driver, there's not much I can suggest except update to a newer one :) thanks, greg k-h