From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 65E3CC54E58 for ; Tue, 12 Mar 2024 21:08:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ADA366B02C2; Tue, 12 Mar 2024 17:08:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A8A5E6B02C3; Tue, 12 Mar 2024 17:08:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 92A9F6B02C4; Tue, 12 Mar 2024 17:08:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 7FFEF6B02C2 for ; Tue, 12 Mar 2024 17:08:31 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 3F5F412088E for ; Tue, 12 Mar 2024 21:08:31 +0000 (UTC) X-FDA: 81889625622.29.1F93D1B Received: from mail-qv1-f41.google.com (mail-qv1-f41.google.com [209.85.219.41]) by imf25.hostedemail.com (Postfix) with ESMTP id 102B8A000D for ; Tue, 12 Mar 2024 21:08:28 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=F6KzwLxu; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf25.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.41 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710277709; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6gjV4+2wG3wAnoAJEZQQ2UXF6TkgGhJejwn/Dp1Cwfk=; b=wO9NzYWCaMlymIfMkGrVLVTT4mkkM4rO4PXaIzc8hws5Pd74z9CCn69AMHYO3E4Wf+pGKp TNGmd/piBCCMDi1hUco6DvCLajqxrMfk73QfwcJqctLnkY88YDiblcJxrwlmMSnMxwllNx ngVKtFs5SfrABlgXSKgSs286JwH0EzY= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=F6KzwLxu; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf25.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.41 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710277709; a=rsa-sha256; cv=none; b=UHuhM4+hoP3TpWD2hKOl0GpDnBc6kKMoEssnlYGsHU9UUfhllcgY4Nr2gmKziS2hhgUV8j aOtsUUfP3UQmeVA/7wPLR1R60znsBym+NOtIVNXHtwdb+t1N5Kyw/8isu+NcjoIB2H0YJx /YPGFDXbc08+wu1YaUHZ4PsdSYDo5lc= Received: by mail-qv1-f41.google.com with SMTP id 6a1803df08f44-690cd7f83cdso2689366d6.3 for ; Tue, 12 Mar 2024 14:08:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1710277708; x=1710882508; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=6gjV4+2wG3wAnoAJEZQQ2UXF6TkgGhJejwn/Dp1Cwfk=; b=F6KzwLxudP1zVwOIFHAgYR4MKGEhgj4GOaaQU9DlsE6boPLfP0lymfK12Jbk1X8HOe 4j39brKStNR3/gLg4pQbeiVIVDcwjoWRIcJ3bWur8vcf3AY7hiHhAqMXiIH9YxKNQ3IA PzNyjJN+PwNvXtiviksZD0pGXuU7I7hmqPVE19kbXuKCIQScL3M86UE4KrzxBvMiwyfV c1T3oopP70IiOquIKKdZDHW8QSb193hDJ8hCSmU0OxiGe6TxTYiRKOr7g+p+Qi3/4l6i s6xkV95RBlgG49jBNHULsbyovrkzTDsBvv1+/RkKBKO5e2pu5o+apA/Kqyx+AXBg5dQt GLFg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710277708; x=1710882508; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=6gjV4+2wG3wAnoAJEZQQ2UXF6TkgGhJejwn/Dp1Cwfk=; b=QYeshsfYoin3rlKuInwCaKIHdjv3BBTvPwuuxnoUQuSS7WsJERmqLctDFSUxaydN73 Kn6ClJDFLQ/57ThR8n+16S4PetC6Fgb0BVKnE/3qSb7Fmq+vDL+rQFrTk8vyXblKs5XP jDiW+GorIoLQNZEHSPvqSXBCKZIxk50Qgdr7Q9jwSG22/SdoW3arowWA4SyTjrRQRIs1 lO7VExmdqrHwDnXwWSCgzttkqjZpRW07XgubN8f2vLurBLNGysZ5F3Pj8EGgy58bhoyb nzL9A+A5sRBGGvhXlNFd3Y8I+mnMc1JZh2uZQ5w/YSqj6niw+GWdFtLBDyJ8YCOT8DmR OVKw== X-Forwarded-Encrypted: i=1; AJvYcCXkSA5umQNUEx7HJ1FMkL3/FxmGCblhBlkYZX2B6ovUmJdP105IifXlKZAhWvhJmq4QDkWo7CzhrDXWwAS+y+PbLQs= X-Gm-Message-State: AOJu0Yx7upxGKi6mJayF4MeANSQwDLP6fxHzuXJQPYJUb8E8i1N+DIhc K8vEgCzFDWo1X2EZqImJ+6G6soJ8prSUbeEpdWk33L0rfDzlpImdzqci+i0YsM8= X-Google-Smtp-Source: AGHT+IEvoRw+r80Y+4FWVqE81roYQNmgylheTVRxWmPyfauWNZmcjfSSFgQehFXxBosTdWYf8WPH8g== X-Received: by 2002:a05:6214:2aa2:b0:691:a8a:d35f with SMTP id js2-20020a0562142aa200b006910a8ad35fmr1089671qvb.51.1710277708040; Tue, 12 Mar 2024 14:08:28 -0700 (PDT) Received: from localhost (2603-7000-0c01-2716-da5e-d3ff-fee7-26e7.res6.spectrum.com. [2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with ESMTPSA id d22-20020a0caa16000000b00690baf5cde9sm4073661qvb.118.2024.03.12.14.08.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Mar 2024 14:08:27 -0700 (PDT) Date: Tue, 12 Mar 2024 17:08:22 -0400 From: Johannes Weiner To: Yu Zhao Cc: Axel Rasmussen , Yafang Shao , Chris Down , cgroups@vger.kernel.org, kernel-team@fb.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: MGLRU premature memcg OOM on slow writes Message-ID: <20240312210822.GB65481@cmpxchg.org> References: <20240229235134.2447718-1-axelrasmussen@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Stat-Signature: zrcs5wg1zofc7eqmdum95pa634rcuw1c X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 102B8A000D X-HE-Tag: 1710277708-227929 X-HE-Meta: U2FsdGVkX1/O9HoeHTWWhb7S+yUIhck3vnUa4BxWJv9tnVcd+6ylKZyMi0XQ8JaYobX8NS/cGQKPhPMh178njgQDZTWDl74iMVzhMjxnaMKxQjJvoD489IenzGD14VkZC3OhbNe6dvA8PhgUePsIUMT8aiLCKJ7TdEd9hTn4NOhBl+t+3+Lo9aw9F8vBn/jTRLdR3rD/QFeW0Mlmv3zbc6odiyYAThFUweTA+Afz7VdSos3T3ZbKyR9C+HFDf7EK6Mx7W7y+i3zt2aX8QSPRGGgtcfF5wDs0QIghpVtfQoGHrhHjNUbk05OyYcbUxMGTGASPufs/KnmxmqMzT7ltanSrzfB3m89VecjCL0PXRqVkuk8Povg+scMe510HMaBMIhTPDoCTBkUC0SPUUIIg8YfJ5Uw+iAJrmPcpYiPEgWi/5vcDlBuWOVhsEeJ+Lsa4AOoaECq79erjR5NQjGKXGkRaSsjt8tEi6F3jteqS1dxniuaQTvwNxwmSB87iOYxtEMQ0Og6YqME7RWtOwI4s/aSU3oeEOy7gJ8fg1tSxBP7i/H35h2DmEWA5cOEQgjLvHNeoB3/c11hywnpiDnMhqKsXbdXgZKCDLtfCjG8NkW7XqYhzls+uF6fYGzJzqmG/QiQISUPmGUUuP9FbvZmXkArXuTmaoaS7NTOZqBfqjFfyGlFi+yli9hgSbCor+buioOhD5KgbUsFQqWYp6cS33dao2VDKx5D9nZVJ79Z/7bNIodLmAecdXhz/azo8y2NJ1rtIS60/ZZ25F+Shb6vDvUsp0Sv4J9SecnUkL44mIHmpGQ2r+XprsgvRmW7wbIASGE6wFAGfRVRlYx0rslBYJiaIsTgSAVHU87NL2fwpjPvo1ClMLp8eV5vDlq8JsOjt7B8HP8GzsH1tG0yFWIJ1WosIG0PZ8C3P2zJb0LPyKn9Nl9O6fBSaJzEc3qUEOoM7pOGBbft99FH54BAaQEh VUSmJXmb 3bpcOu2dtAzrzZHqmE1K5DaqBu7+UN5h6Nt5Fo5M3Kyf8NGQ4IWjbnutDJNwCBKn4QXBWJ7s34CSmiS2Yz2e9yFYzdlsxu0ezHqbW8BeIUDFe2Tq6kQ5ILrOjukdL+ehb1wxYhLSFZD1CSVJTKeS6nPYDW6/AlPU9KUTEMznI/IGP5FOYru6+tbA4O70bV07u50bHDShP4S7U7Adm1Lj7FZH0siOQtYDriyoaobjsuV5UdUrOiKwg8+C8S4H2i46UyjAD7ajkOl67lhZh8yVmLZhem5PK53QKN7RjdHpSfjxgBRi6GuKsTiDUvBBAeFJwVGvIdsyg3XaijyfNOtfCjg229pu1NsKanrp2tf5PJa67OWMThi8z1k4tXbm7QMJXJm7mXZj+hs0qEphBhpsTz8rEGhhsbqgNs709k+qPDNPbjYxQzCpyR9gTflsur/Jk6+rpwmeG1FE8cpy3cXJm3srqrw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote: > Yes, these two are among the differences between the active/inactive > LRU and MGLRU, but their roles, IMO, are not as important as the LRU > positions of dirty pages. The active/inactive LRU moves dirty pages > all the way to the end of the line (reclaim happens at the front) > whereas MGLRU moves them into the middle, during direct reclaim. The > rationale for MGLRU was that this way those dirty pages would still > be counted as "inactive" (or cold). Note that activating the page is not a statement on the page's hotness. It's simply to park it away from the scanner. We could as well have moved it to the unevictable list - this is just easier. folio_end_writeback() will call folio_rotate_reclaimable() and move it back to the inactive tail, to make it the very next reclaim target as soon as it's clean. > This theory can be quickly verified by comparing how much > nr_vmscan_immediate_reclaim grows, i.e., > > Before the copy > grep nr_vmscan_immediate_reclaim /proc/vmstat > And then after the copy > grep nr_vmscan_immediate_reclaim /proc/vmstat > > The growth should be trivial for MGLRU and nontrivial for the > active/inactive LRU. > > If this is indeed the case, I'd appreciate very much if anyone could > try the following (I'll try it myself too later next week). > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 4255619a1a31..020f5d98b9a1 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c > } > > /* waiting for writeback */ > - if (folio_test_locked(folio) || folio_test_writeback(folio) || > - (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > - gen = folio_inc_gen(lruvec, folio, true); > - list_move(&folio->lru, &lrugen->folios[gen][type][zone]); > + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > + DEFINE_MAX_SEQ(lruvec); > + int old_gen, new_gen = lru_gen_from_seq(max_seq); > + > + old_gen = folio_update_gen(folio, new_gen); > + lru_gen_update_size(lruvec, folio, old_gen, new_gen); > + list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]); > return true; Right, because MGLRU sorts these pages out before calling the scanner, so they never get marked for immediate reclaim. But that also implies they won't get rotated back to the tail when writeback finishes. Doesn't that mean that you now have pages that a) came from the oldest generation and were only deferred due to their writeback state, and b) are now clean and should be reclaimed. But since they're permanently advanced to the next gen, you'll instead reclaim pages that were originally ahead of them, and likely hotter. Isn't that an age inversion? Back to the broader question though: if reclaim demand outstrips clean pages and the only viable candidates are dirty ones (e.g. an allocation spike in the presence of dirty/writeback pages), there only seem to be 3 options: 1) sleep-wait for writeback 2) continue scanning, aka busy-wait for writeback + age inversions 3) find nothing and declare OOM Since you're not doing 1), it must be one of the other two, no? One way or another it has to either pace-match to IO completions, or OOM.