From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7C21CC54E58 for ; Wed, 13 Mar 2024 02:08:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D85978E0030; Tue, 12 Mar 2024 22:08:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D0E728E0011; Tue, 12 Mar 2024 22:08:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BAF2A8E0030; Tue, 12 Mar 2024 22:08:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id A31FD8E0011 for ; Tue, 12 Mar 2024 22:08:54 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 71CAD1A0FE9 for ; Wed, 13 Mar 2024 02:08:54 +0000 (UTC) X-FDA: 81890382588.02.8449F9B Received: from mail-wm1-f48.google.com (mail-wm1-f48.google.com [209.85.128.48]) by imf08.hostedemail.com (Postfix) with ESMTP id C11F3160015 for ; Wed, 13 Mar 2024 02:08:52 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=GhGkgjsI; spf=pass (imf08.hostedemail.com: domain of yuzhao@google.com designates 209.85.128.48 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710295732; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5NhuLx8Hd3VM7u/rZLXcaWHkS4clRr3sag751o8/rg4=; b=MYC1ezCWVAYS35gySYB2t3ESZxzn5B26Z7Wrkt43+Wa+kn1nEbFcNs28675gKAoDdmfDDT 3sudev5oaNDd/rw1hqKTt5ZqfddBIIPaeqlBEgKHecTxOyBr3eIxjomV63cJncTqAQocyG ICWRAP6wZy28tyxqNAySdLLXN69nfLI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710295732; a=rsa-sha256; cv=none; b=RnP6m0rv3KclPBPUFNpg6tTkgSCRbhFWdEVD+ZY0eiLuSThpcHFhFxkzQDuUiewf+iQUuo kRonIFPIP5ymvI4FXr9jsSbBWRoa4FlaenJBbpibGl3/4qYRnU8ixGzxZiYT97Ejs8WYch wz7k5Ebcig44WV0SbBLTS6R0k9vpfes= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=GhGkgjsI; spf=pass (imf08.hostedemail.com: domain of yuzhao@google.com designates 209.85.128.48 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-wm1-f48.google.com with SMTP id 5b1f17b1804b1-412d84ffbfaso45505e9.0 for ; Tue, 12 Mar 2024 19:08:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1710295731; x=1710900531; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=5NhuLx8Hd3VM7u/rZLXcaWHkS4clRr3sag751o8/rg4=; b=GhGkgjsIZRSU5M+GoCeTFql6QbvlQJrkvJQbftiHFjayJDtsvmOlvKqXx3+0285itE dwzckeGMXF7l3pNEjsajHGik+U7JK3tRQY6khY5Mpi2cfbxUFzGaBhFnarT2jD0Xw1rP 0/E+Ne6u73hY85e0/ooM9EEMSdNfkmAZgCeQdYsGUrV4YCdieRJWMbFoePuaf/Ru08Kd mZARV8nzMvXbbIwhImXmyS7gG7N41OdyJLd9e25ulbKv4GULFODBKQLAAA/raSqb/brC qxp6b6QCXysit+k8vxAtEo90VlVYBeHIGnGveuAvqveCsBQIGJTnZNTbmt1gRJd5UNUx lR1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710295731; x=1710900531; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=5NhuLx8Hd3VM7u/rZLXcaWHkS4clRr3sag751o8/rg4=; b=XH4y3PH3Nk6EMGHCgi3sPRwGeXIHzUe8lAFflEMhaysawxGYbozbCLt2/HT8hHEepB bTg5iU5izNFdSEotc3N0vzY0Ze03H9L3r2h7eLrAIToWU0fqffG28qI/MucjOKnJ2iz8 a7hgkaocthSOgvwv4YvK1AuL8wzC27UybR8QFLwLaDd07uEJ6Diw20SLZBJctNcoCwWx eNjQ0nXnlJVXbv//n5dEe92H3xlOlYIo+9Y6IiW9vUgosYTr1iripCdrhDQ9DZgBRTbu MJTuUmbmzsBntqaRmuuCuwz/5Y5znXxEvtSeMhNRLab7I/nJRiMgGnqVj4LvxvEsNkrA pOlA== X-Forwarded-Encrypted: i=1; AJvYcCU5I5eaKF+0dPeR+kLG8ch3GizxQbFTiA37jWZ9ox3MJEsIzMP3dVds7o58m60mL6aW9IRDAQfTljmqRYyFPhg1PDI= X-Gm-Message-State: AOJu0YwCbYA+U2bdfq9ufqYsmYpyY0ulfsFZ24BYJkqpXgB73qc/ElBW 3ECpbVN+ilBLv0lg8P8uFgnp1CtS5zhqfLblsKhVwE/VnPU3OmHZxvg3sIq9ygjWY7t9inJOxGr XEZXIb5I0+Hx8pgO7R3o1glFOoGUj2Hvms6Q1 X-Google-Smtp-Source: AGHT+IE22TKRWGA/6PaqMQKmQT2rQFAdr8rQqytN9on+vBz0f5TKP19RwLycd+OGxliog1WsIbUL3rhPAZw/TnTsIEY= X-Received: by 2002:a05:600c:3d18:b0:412:f547:e3e0 with SMTP id bh24-20020a05600c3d1800b00412f547e3e0mr76860wmb.4.1710295730982; Tue, 12 Mar 2024 19:08:50 -0700 (PDT) MIME-Version: 1.0 References: <20240229235134.2447718-1-axelrasmussen@google.com> <20240312210822.GB65481@cmpxchg.org> In-Reply-To: <20240312210822.GB65481@cmpxchg.org> From: Yu Zhao Date: Tue, 12 Mar 2024 22:08:13 -0400 Message-ID: Subject: Re: MGLRU premature memcg OOM on slow writes To: Johannes Weiner Cc: Axel Rasmussen , Yafang Shao , Chris Down , cgroups@vger.kernel.org, kernel-team@fb.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: C11F3160015 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: bcrt5xfcx41xukxyc936ks1mfkkngoqc X-HE-Tag: 1710295732-285802 X-HE-Meta: U2FsdGVkX195bUuGCMMUWd/gtUcI4YysPdX5mBg22d2fB1S3Zox28Ww0j8lhLfkMgRNBJVOsOfmiZpsLodse/JbVkbhLLeFMJ6xfZvqpnYIbSvRBsVAqYMyqP3uLessy2020qiqc9L3pLRQxQpGuAWQY8CXCQkGyqWk2DBtlFoOrbJsuy+AH5I10AP+f49oiLgtJVKB4pgn/RDhpVwg0quzkcACjWhhl/FzdjnAlhvaegk7wDCKfDt4lFMtxpTDhTpwOPFigNUUrJfDOVrdk4NbxRVsAWtJ2+8CM7iLXiENd+olbDFhK9n/EvOMj5ua63XASZAP+oBQUQf2sEiHJj76A3EE2lHW/kHHrCcqqA/2SntrQzah/z94YudY4P5qBctOlre4CcQOJApK1BRaumSwquXE80Wkiq0DI+a7QZiFPwzQpM3jiVQcChQRH0N/MDUFwHvUeYUuUQONnIrmEyOUh/qu/rANXAx4TSeBqVR8R9zdiRD068n9xPtJbZibe3ne1h4U4ioN9STI7bi9b1Kk8PsHVPtU1g/Du+E2uPdfB/IsQhZc3ZVDh9xQPk8VUhE4DTlGs2qV7EatWlM/A7Jpj+QdQnUkwLIF5gUR1MUdQcKHr38w4/2cB3dHp5QeOabjo55sK998Kjw3/An9YPTzCWv6HPu6F4B4Jv75mSxxsxgT7VC4G9JfJJVG6ISorB23PtdO20q7IN4fV63HmnhvgL9Ywr+l0DT26hu2P15gEFrRaRl/Uc3KUBQdkgXkoVmkUfFIJFbrD8H0jeKwe6TnZGgumedvdzkfy5HagWiJS67CyfiikL/npJuib3V6glaTOqycgMdMdM/5+QSxyTaW8XzQRr6H0N6b8nwYcWz/AIwC9xhQMtaIJLBT3jvatdoTEgWFuZTWPQV3NAI/VVL9Q5e7poshB5yhzJhHMIvBvXHPi+wcAD38s7hc3mkRywORSSxZqkP1ip35yJCO 8jdPsGgm q66ewH9Ew0W7Iucmcw2Gm9Bongqxxjf9mnWeOa/doeaFn+8aq+QAlVmTKUAHDRQj7BegF X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Mar 12, 2024 at 5:08=E2=80=AFPM Johannes Weiner wrote: > > On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote: > > Yes, these two are among the differences between the active/inactive > > LRU and MGLRU, but their roles, IMO, are not as important as the LRU > > positions of dirty pages. The active/inactive LRU moves dirty pages > > all the way to the end of the line (reclaim happens at the front) > > whereas MGLRU moves them into the middle, during direct reclaim. The > > rationale for MGLRU was that this way those dirty pages would still > > be counted as "inactive" (or cold). > > Note that activating the page is not a statement on the page's > hotness. It's simply to park it away from the scanner. We could as > well have moved it to the unevictable list - this is just easier. > > folio_end_writeback() will call folio_rotate_reclaimable() and move it > back to the inactive tail, to make it the very next reclaim target as > soon as it's clean. > > > This theory can be quickly verified by comparing how much > > nr_vmscan_immediate_reclaim grows, i.e., > > > > Before the copy > > grep nr_vmscan_immediate_reclaim /proc/vmstat > > And then after the copy > > grep nr_vmscan_immediate_reclaim /proc/vmstat > > > > The growth should be trivial for MGLRU and nontrivial for the > > active/inactive LRU. > > > > If this is indeed the case, I'd appreciate very much if anyone could > > try the following (I'll try it myself too later next week). > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 4255619a1a31..020f5d98b9a1 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, s= truct folio *folio, struct scan_c > > } > > > > /* waiting for writeback */ > > - if (folio_test_locked(folio) || folio_test_writeback(folio) || > > - (type =3D=3D LRU_GEN_FILE && folio_test_dirty(folio))) { > > - gen =3D folio_inc_gen(lruvec, folio, true); > > - list_move(&folio->lru, &lrugen->folios[gen][type][zone]); > > + if (folio_test_writeback(folio) || (type =3D=3D LRU_GEN_FILE && f= olio_test_dirty(folio))) { > > + DEFINE_MAX_SEQ(lruvec); > > + int old_gen, new_gen =3D lru_gen_from_seq(max_seq); > > + > > + old_gen =3D folio_update_gen(folio, new_gen); > > + lru_gen_update_size(lruvec, folio, old_gen, new_gen); > > + list_move(&folio->lru, &lrugen->folios[new_gen][type][zon= e]); > > return true; > > Right, because MGLRU sorts these pages out before calling the scanner, > so they never get marked for immediate reclaim. > > But that also implies they won't get rotated back to the tail when > writeback finishes. Those dirty pages are marked by PG_reclaim either by folio_inc_gen() { ... if (reclaiming) new_flags |=3D BIT(PG_reclaim); ... } or [1], which I missed initially. So they should be rotated on writeback finishing up. [1] https://lore.kernel.org/linux-mm/ZfC2612ZYwwxpOmR@google.com/ > Doesn't that mean that you now have pages that > > a) came from the oldest generation and were only deferred due to their > writeback state, and > > b) are now clean and should be reclaimed. But since they're > permanently advanced to the next gen, you'll instead reclaim pages > that were originally ahead of them, and likely hotter. > > Isn't that an age inversion? > > Back to the broader question though: if reclaim demand outstrips clean > pages and the only viable candidates are dirty ones (e.g. an > allocation spike in the presence of dirty/writeback pages), there only > seem to be 3 options: > > 1) sleep-wait for writeback > 2) continue scanning, aka busy-wait for writeback + age inversions > 3) find nothing and declare OOM > > Since you're not doing 1), it must be one of the other two, no? One > way or another it has to either pace-match to IO completions, or OOM. Yes, and in this case, 2) is possible but 3) is very likely. MGLRU doesn't do 1) for sure (in the reclaim path of course). I didn't find any throttling on dirty pages for cgroup v2 either in the active/inactive LRU -- I assume Chris was on v2, and hence my take on throttling on dirty pages in the reclaim path not being the key for his case. With the above change, I'm hoping balance_dirty_pages() will wake up the flusher, again for Chris' case, so that MGLRU won't have to call wakeup_flusher_threads(), since it can wake up the flusher too often and in turn cause excessive IOs when considering SSD wearout.