From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8F73CC54E49 for ; Thu, 7 Mar 2024 17:06:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BF1C86B0206; Thu, 7 Mar 2024 12:06:23 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B7CA26B0207; Thu, 7 Mar 2024 12:06:23 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A1C436B0208; Thu, 7 Mar 2024 12:06:23 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 8D93E6B0206 for ; Thu, 7 Mar 2024 12:06:23 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 37841C052D for ; Thu, 7 Mar 2024 17:06:23 +0000 (UTC) X-FDA: 81870871446.13.1FEC172 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf30.hostedemail.com (Postfix) with ESMTP id 7061280023 for ; Thu, 7 Mar 2024 17:06:20 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=korg header.b=d7YsNYct; dmarc=none; spf=pass (imf30.hostedemail.com: domain of akpm@linux-foundation.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=akpm@linux-foundation.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709831180; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6faWVg8p6rAgeS8RNJ43r2UDYAwDGmcCi8xx1QbD5vo=; b=1XNPkG8C8sp1QTYxwCwBVS/l+l6Pyj+sCtjP9zp4vvcL4gOFtfJRsNeV0QhCTmvuXicAEn Dq5m+PSHa4Vs1jM8dZZsIA3IYLIAR5hJASaOf1EBhfZr9Fx8BhHdx/Mtxzxg3i1I2g1EC8 CioARiIcXdeA6ACi6ksvLUGCIXkAnxc= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=korg header.b=d7YsNYct; dmarc=none; spf=pass (imf30.hostedemail.com: domain of akpm@linux-foundation.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=akpm@linux-foundation.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709831180; a=rsa-sha256; cv=none; b=wq1WlpvEfCmRijy0G+0stZVPrUqCJUWTNbbflgkbppET+lRmYhfNYbNRYMLnaMgkSZsaf4 w76yhhZP3OkD41FhwfeuhCcbNQqLPuxYHLvSLXW7giOSGUcLDjTu/3Ap9V7TJaVNF9pG9l IvoafWcmey0v2uOx+a5L//6kmXq/+RA= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 94DC5602DE; Thu, 7 Mar 2024 17:06:19 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2F1F4C433C7; Thu, 7 Mar 2024 17:06:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1709831179; bh=pnS+sAcajTYzsM/6j4dWe0iCom8D53yfQgg7WlaHm2Y=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=d7YsNYctsm0kwbIzwjNCVNvjSWxBVAJ6n2hLneQuQXJQQGk6RGLSKxQmrwph/ggVy tuHGjBhuGYzi18QRd7nB8OOA/6itR6+tgLCHp8PJn8vhNI4pFh4NlGxvifom+YG1oX AdwiisWwIq+XPA/Lj6tbFZWykVKmZViQrkjVVa54= Date: Thu, 7 Mar 2024 09:06:18 -0800 From: Andrew Morton To: Yafang Shao Cc: yuzhao@google.com, linux-mm@kvack.org, stable@vger.kernel.org Subject: Re: [PATCH] mm: mglru: Fix soft lockup attributed to scanning folios Message-Id: <20240307090618.50da28040e1263f8af39046f@linux-foundation.org> In-Reply-To: <20240307031952.2123-1-laoar.shao@gmail.com> References: <20240307031952.2123-1-laoar.shao@gmail.com> X-Mailer: Sylpheed 3.8.0beta1 (GTK+ 2.24.33; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 7061280023 X-Stat-Signature: zukhhsuf4j1m4bk8eq7u31k6onubx86s X-HE-Tag: 1709831180-975334 X-HE-Meta: U2FsdGVkX1//buqEVW7BAgsBfI+Ae+K7Y2iifaLlTQStJqO4blEdP60YjcptuyyBn5EINuTPplbJx1CUyxejHA2kftj5F5TN+SL7c1VBB7gACVacjC67Se9J7EF6KGJP8VjLQ75jJNFXh8G8IdDdGxGqFpD4+uGYVWlqLRe07a7vyy/6eMXeTFgBGWfgGJuWZq1A/IqAl1rkFB7Q8iLRvuDqpkfmlHWJiKKYe7mTCdHl7y2TM6on/UunW/tvaZqyKnDSH6XjuMoRKRJoJ/BQHuE3T3ACrRc0JDUULiVvuMJbdRxPGAXv2lhO8AdijMzLlfkz+wGeZrMIOexLoinjEWspV89LA+6zJJdEYqX/ya1j8cWGsA7Gai+mIc4Lupg0Fb9L5E+XGNankNktRMOvr5UPGCHvXuAsDURpF438nQdTEHXxXioEWAQKlKzOzuUkASj6QYKJ3fKrWuxoJuRzvKrqHi5qt85sxx6ZNR6twionvGEKTYG6bNWgZnNvEUNEWaPzpleFOrcqhHzO+ovH+DcHQKhsxg1uJYLgA2zTkLGUBHzlYjnK543JMVSb+OLUCoWDVUtcEcl+spiPOvcuW/K8H1U/F0hN1zQIkyIUJABQBjLPrrWTFQNAdEKDdkMAGItI8Th+ogbopKlbNi6FqQeTZ1iRBHJ3YvRCl5hFWgIXvgvXH66djrHoOFA7D88tuQf0l5L9Yd+tqSbUmT7Kq1usT1SVtOvFVarSZMSMB+tuF1e/1CpQK6Rw2Bw2SZIiZe1GpeBVpOb4+u4epbX5QmjCVmM94iGYMOlSHte6XiSAB9RBygW8ptBtubDM8CH1uB16mWNKOyfB+9+LI2xzIbUq6ZE+Z4b9zf7QhphlAgPUyA0vigzZ234VedM3ToGqXkjrpByGVTmv+WzjgGEpVLSVYOZJ/5GSGc1xIvn2/DX7BN+hNifZuwxYm7lucmiu30ahCAlIs/ZcU1/DBeJ 7SuAMtGv vvGIBvGs9gZ8V/Kc+48rms1Mw+g== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, 7 Mar 2024 11:19:52 +0800 Yafang Shao wrote: > After we enabled mglru on our 384C1536GB production servers, we > encountered frequent soft lockups attributed to scanning folios. > > The soft lockup as follows, > > ... > > There were a total of 22 tasks waiting for this spinlock > (RDI: ffff99d2b6ff9050): > > crash> foreach RU bt | grep -B 8 queued_spin_lock_slowpath | grep "RDI: ffff99d2b6ff9050" | wc -l > 22 If we're holding the lock for this long then there's a possibility of getting hit by the NMI watchdog also. > Additionally, two other threads were also engaged in scanning folios, one > with 19 waiters and the other with 15 waiters. > > To address this issue under heavy reclaim conditions, we introduced a > hotfix version of the fix, incorporating cond_resched() in scan_folios(). > Following the application of this hotfix to our servers, the soft lockup > issue ceased. > > ... > > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4367,6 +4367,10 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, > > if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH) > break; > + > + spin_unlock_irq(&lruvec->lru_lock); > + cond_resched(); > + spin_lock_irq(&lruvec->lru_lock); > } Presumably wrapping this with `if (need_resched())' will save some work. This lock is held for a reason. I'd like to see an analysis of why this change is safe.