From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AF01FC54E58 for ; Tue, 12 Mar 2024 20:07:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 416898E0014; Tue, 12 Mar 2024 16:07:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3C6408E0011; Tue, 12 Mar 2024 16:07:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 268B58E0014; Tue, 12 Mar 2024 16:07:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 14C038E0011 for ; Tue, 12 Mar 2024 16:07:14 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id DCFADC06D3 for ; Tue, 12 Mar 2024 20:07:13 +0000 (UTC) X-FDA: 81889471146.10.CE4AC52 Received: from mail-il1-f170.google.com (mail-il1-f170.google.com [209.85.166.170]) by imf25.hostedemail.com (Postfix) with ESMTP id 1FEC6A000F for ; Tue, 12 Mar 2024 20:07:10 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ON+oyy5h; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf25.hostedemail.com: domain of yuzhao@google.com designates 209.85.166.170 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710274031; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Z7Q03zM1IVSIFtblUjZYYL4rwsX8KkQoK8j6nmVhRrM=; b=NTuF781QuvTh09WZax6sLZlcFOxrTZg5xabVriBTb2p+uSlqXzbzZuLsGYZeoZBUaRUpWv D2QtM9WBk0UvcNvUCj84D9GEzey9rCBy0cwnhXp2Zo2V9uOORjSBjWIKnydBhtW76YMlv2 ioyFBz8xPYwyUt78odvgbBjL1oDyh98= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ON+oyy5h; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf25.hostedemail.com: domain of yuzhao@google.com designates 209.85.166.170 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710274031; a=rsa-sha256; cv=none; b=ijCOHbsEc+1R8alPn/lR09xkC+uxBaJgTuP03M3u2x2nRFTW1HV3KARc0d2yW0pr6w7pnr sH4PEJ0qrZNqxe2liikIEfbRnpJwFZiii50AiuS2M5Vn/l8tdei3pdCCpHKkNWAkEJu1TI 2SbsppvibKrMoFsnhpseiPXGcKudmOc= Received: by mail-il1-f170.google.com with SMTP id e9e14a558f8ab-36646d1c2b7so3055ab.1 for ; Tue, 12 Mar 2024 13:07:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1710274030; x=1710878830; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=Z7Q03zM1IVSIFtblUjZYYL4rwsX8KkQoK8j6nmVhRrM=; b=ON+oyy5hfl/kzD5KVsXLIPbHh0HgRsH6dV+a3Wz0tJmxRqSlPHoIAqP5X8yc/kGOJm OgSKJqr8todMC64MDg8rQKGWxBaRk1/t0uKErAO5poEWzjtvvtMzzNL5IY5jhTZujuLj Ilc29VKGIxZvUi1J22YkgSiXnUG4uc8kWvrzMe1cvDgCWloh5ZEh3+JIN5VhKWFNN9qo 1zlOpryRGIsrTH1FOnmSsDhTMEjsLX+VV1y8exbIFj53AEBEMCP3whqS2MQyjNgp07/2 adY48t5kJP/MUJo4JvLT5GoS9zuAeAybB/x6QrLvDaO+O3E+E10G4W8tfVLFlWuGpYod QouA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710274030; x=1710878830; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Z7Q03zM1IVSIFtblUjZYYL4rwsX8KkQoK8j6nmVhRrM=; b=spcu2R2ZaXMQDFkW9m6eSHqwRkktlm2JcXBpykmsGqfMjQYhfsjj4qGL+h22AZRfE9 wi4Ag1hc0HWrAScIaKI+wUR4gqbkdjuRkJpYDev4fzsz3vPhA6q67BtRy0tv71x8TxUi laZ4bh4Z0Yf76GffBTPfRr90TG+OBp4BLWFUxlVjrbNr2fR+xVZlb7pk3qUEAIy9iSjo SpQPG/ull9mo2xvfJjcqWcdURImLH3I+swpLaKjPwbzAJKzE0Wu3jg0ZCVjXwVfTOItw 2k473a8vUHVyEHS6VgLlGaQ41qBj4Bse5+a9MXeL12GT3kRrXq+OnsbtmUaGHgOIDuzS 0yNA== X-Forwarded-Encrypted: i=1; AJvYcCUo37qgjV6YjBDB55F1cm6HCFzFJpSz+wRdOPvxJef3jGhJ83GCpVomqXq4Vc38iymwKLvnVFGcGdayRaGa9e61A8I= X-Gm-Message-State: AOJu0Yx/AeGfPZbXa9LDGPbHWyLfraT+yk+Xgl7AU580S3hN8Yy6ToT4 c/ToS8AZuY1tp4/bWtGyLiw4f1mjGYuaq24VhPC8kDpIegO1E9gHGa62i8Ubhw== X-Google-Smtp-Source: AGHT+IFHCVwXD7f3JR1HTZN7DbJBUtlY5Iyt4qqDJbw8WqZxUb2L2t87WR6swprQd92qpAgtNSZ8JA== X-Received: by 2002:a05:6e02:304a:b0:366:2b:4236 with SMTP id be10-20020a056e02304a00b00366002b4236mr30942ilb.6.1710274030013; Tue, 12 Mar 2024 13:07:10 -0700 (PDT) Received: from google.com ([100.64.188.49]) by smtp.gmail.com with ESMTPSA id j16-20020a02cc70000000b00476e5e352c8sm1536103jaq.151.2024.03.12.13.07.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Mar 2024 13:07:09 -0700 (PDT) Date: Tue, 12 Mar 2024 14:07:04 -0600 From: Yu Zhao To: Axel Rasmussen Cc: Yafang Shao , Chris Down , cgroups@vger.kernel.org, hannes@cmpxchg.org, kernel-team@fb.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: MGLRU premature memcg OOM on slow writes Message-ID: References: <20240229235134.2447718-1-axelrasmussen@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Queue-Id: 1FEC6A000F X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: a8ufffn7t8qkuontbtz6uu9uxuq3hj1e X-HE-Tag: 1710274030-378090 X-HE-Meta: U2FsdGVkX19uPWERakpcW0mlo/7xHMqYI4WOnvfaIr9sRnkooAhv5T7FusMCr5VcQlQ9SrR+dVoANYxqxFQ7OcBQegx2xKZfKtys8A9dkpHDhNN4zO1pI7UdEkDS4Rl/zaLRkPSRu7X8mvYwUArcmv2YBKDjHUjvKrbqCGYK6rTkis5gtfCcgyaQj6lu8/VvM9bpWSaMeaUqdGPJoBtTXMZOb4CbGh7KYu2ez1p33rGZ699GidlN0y1ODtR0Iz+h9YnBti4pqzIBZiozkSl0vFSB/QXlG8tj8ogkTzQu7rYhDvSQs8iaIRCIsIqW08X0LbU0TmF7EJrRQ81IW3Mpj+nih2yVAp5L5C4P6sn/P/ucpAVLPzw1Hn6bHtdoh8faMjGueS5eJ12QNGIxnQifFjcMuKPJpJ2x0mnZBqNG4t0sY60xXyMeB0dYycWjfbp1crKAs8YHeVZwnE+kqyNunEX0bKC21128Yi2W7tJW0GaDKjNaP6y7u67xcxF4SW7RTDoCYea9t9kbHd08o9mbyK3gAHFxCkW82tWKBevMaex8Jrr2tieP7EXmIZwT3djxqWayrQGIYhSusY23687KnL9SrnejDXWkHyVPzwxQLdakLbfqg1LF68uoBauGV6x2AuVeqsNFO78CAutSsU4/GHuoLIp/jBJb0V2/Y6YD+1Zu8/XmBMF6c5w8DE5jQG1IJwBVUu3XIl0GncoWjGuvROfhWwyEwRkr66ztKhFZj39JjDeEPMVyABtxtqcqogwoLzBVSOYD3cFSQbl7I636J2Fx12ohOkxU821bqm4VErX25u4+o9p6Co6hepb7Rh4deBloFv406fg2OuilSCKdordR9R7rCh806a2WOhKEoC9buqu8+fMkR6XxoHHQKqN5jyffoT/IRBkSfPthZgJRCmVGXD14etffH85U8HIC7ydGVcngjlUjIIDofEbTDIZlHvoQfuQOtFSESvBt8oX psmZT7gY TOsDsTaTzaz4e9wqp9eTFdeWEDt7UrsxUxmLLdqzrP+7XOsDEtzi1uFKqi1ilzFcSn8y2JyrGRB7iIYS+LGBnxyUFR7y51pExFzZ8a15u8XbX4VrIZp6g4lbliEBGm59fG3dp+301D4pSIDQSVbzK0CIF8Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Mar 12, 2024 at 09:44:19AM -0700, Axel Rasmussen wrote: > On Mon, Mar 11, 2024 at 2:11 AM Yafang Shao wrote: > > > > On Sat, Mar 9, 2024 at 3:19 AM Axel Rasmussen wrote: > > > > > > On Thu, Feb 29, 2024 at 4:30 PM Chris Down wrote: > > > > > > > > Axel Rasmussen writes: > > > > >A couple of dumb questions. In your test, do you have any of the following > > > > >configured / enabled? > > > > > > > > > >/proc/sys/vm/laptop_mode > > > > >memory.low > > > > >memory.min > > > > > > > > None of these are enabled. The issue is trivially reproducible by writing to > > > > any slow device with memory.max enabled, but from the code it looks like MGLRU > > > > is also susceptible to this on global reclaim (although it's less likely due to > > > > page diversity). > > > > > > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the > > > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()). > > > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it > > > > >looks like it simply will not do this. > > > > > > > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It > > > > >makes sense to me at least that doing writeback every time we age is too > > > > >aggressive, but doing it in evict_folios() makes some sense to me, basically to > > > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has. > > > > > > > > Thanks! We may also need reclaim_throttle(), depending on how you implement it. > > > > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of > > > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one > > > > thing at a time :-) > > > > > > > > > Hmm, so I have a patch which I think will help with this situation, > > > but I'm having some trouble reproducing the problem on 6.8-rc7 (so > > > then I can verify the patch fixes it). > > > > We encountered the same premature OOM issue caused by numerous dirty pages. > > The issue disappears after we revert the commit 14aa8b2d5c2e > > "mm/mglru: don't sync disk for each aging cycle" > > > > To aid in replicating the issue, we've developed a straightforward > > script, which consistently reproduces it, even on the latest kernel. > > You can find the script provided below: > > > > ``` > > #!/bin/bash > > > > MEMCG="/sys/fs/cgroup/memory/mglru" > > ENABLE=$1 > > > > # Avoid waking up the flusher > > sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 *4)) > > sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 *4)) > > > > if [ ! -d ${MEMCG} ]; then > > mkdir -p ${MEMCG} > > fi > > > > echo $$ > ${MEMCG}/cgroup.procs > > echo 1g > ${MEMCG}/memory.limit_in_bytes > > > > if [ $ENABLE -eq 0 ]; then > > echo 0 > /sys/kernel/mm/lru_gen/enabled > > else > > echo 0x7 > /sys/kernel/mm/lru_gen/enabled > > fi > > > > dd if=/dev/zero of=/data0/mglru.test bs=1M count=1023 > > rm -rf /data0/mglru.test > > ``` > > > > This issue disappears as well after we disable the mglru. > > > > We hope this script proves helpful in identifying and addressing the > > root cause. We eagerly await your insights and proposed fixes. > > Thanks Yafang, I was able to reproduce the issue using this script. > > Perhaps interestingly, I was not able to reproduce it with cgroupv2 > memcgs. I know writeback semantics are quite a bit different there, so > perhaps that explains why. > > Unfortunately, it also reproduces even with the commit I had in mind > (basically stealing the "if (all isolated pages are unqueued dirty) { > wakeup_flusher_threads(); reclaim_throttle(); }" from > shrink_inactive_list, and adding it to MGLRU's evict_folios()). So > I'll need to spend some more time on this; I'm planning to send > something out for testing next week. Hi Chris, My apologies for not getting back to you sooner. And thanks everyone for all the input! My take is that Chris' premature OOM kills were NOT really due to the flusher not waking up or missing throttling. Yes, these two are among the differences between the active/inactive LRU and MGLRU, but their roles, IMO, are not as important as the LRU positions of dirty pages. The active/inactive LRU moves dirty pages all the way to the end of the line (reclaim happens at the front) whereas MGLRU moves them into the middle, during direct reclaim. The rationale for MGLRU was that this way those dirty pages would still be counted as "inactive" (or cold). This theory can be quickly verified by comparing how much nr_vmscan_immediate_reclaim grows, i.e., Before the copy grep nr_vmscan_immediate_reclaim /proc/vmstat And then after the copy grep nr_vmscan_immediate_reclaim /proc/vmstat The growth should be trivial for MGLRU and nontrivial for the active/inactive LRU. If this is indeed the case, I'd appreciate very much if anyone could try the following (I'll try it myself too later next week). diff --git a/mm/vmscan.c b/mm/vmscan.c index 4255619a1a31..020f5d98b9a1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c } /* waiting for writeback */ - if (folio_test_locked(folio) || folio_test_writeback(folio) || - (type == LRU_GEN_FILE && folio_test_dirty(folio))) { - gen = folio_inc_gen(lruvec, folio, true); - list_move(&folio->lru, &lrugen->folios[gen][type][zone]); + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) { + DEFINE_MAX_SEQ(lruvec); + int old_gen, new_gen = lru_gen_from_seq(max_seq); + + old_gen = folio_update_gen(folio, new_gen); + lru_gen_update_size(lruvec, folio, old_gen, new_gen); + list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]); return true; } > > > If I understand the issue right, all we should need to do is get a > > > slow filesystem, and then generate a bunch of dirty file pages on it, > > > while running in a tightly constrained memcg. To that end, I tried the > > > following script. But, in reality I seem to get little or no > > > accumulation of dirty file pages. > > > > > > I thought maybe fio does something different than rsync which you said > > > you originally tried, so I also tried rsync (copying /usr/bin into > > > this loop mount) and didn't run into an OOM situation either. > > > > > > Maybe some dirty ratio settings need tweaking or something to get the > > > behavior you see? Or maybe my test has a dumb mistake in it. :) > > > > > > > > > > > > #!/usr/bin/env bash > > > > > > echo 0 > /proc/sys/vm/laptop_mode || exit 1 > > > echo y > /sys/kernel/mm/lru_gen/enabled || exit 1 > > > > > > echo "Allocate disk image" > > > IMAGE_SIZE_MIB=1024 > > > IMAGE_PATH=/tmp/slow.img > > > dd if=/dev/zero of=$IMAGE_PATH bs=1024k count=$IMAGE_SIZE_MIB || exit 1 > > > > > > echo "Setup loop device" > > > LOOP_DEV=$(losetup --show --find $IMAGE_PATH) || exit 1 > > > LOOP_BLOCKS=$(blockdev --getsize $LOOP_DEV) || exit 1 > > > > > > echo "Create dm-slow" > > > DM_NAME=dm-slow > > > DM_DEV=/dev/mapper/$DM_NAME > > > echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || exit 1 > > > > > > echo "Create fs" > > > mkfs.ext4 "$DM_DEV" || exit 1 > > > > > > echo "Mount fs" > > > MOUNT_PATH="/tmp/$DM_NAME" > > > mkdir -p "$MOUNT_PATH" || exit 1 > > > mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1 > > > > > > echo "Generate dirty file pages" > > > systemd-run --wait --pipe --collect -p MemoryMax=32M \ > > > fio -name=writes -directory=$MOUNT_PATH -readwrite=randwrite \ > > > -numjobs=10 -nrfiles=90 -filesize=1048576 \ > > > -fallocate=posix \ > > > -blocksize=4k -ioengine=mmap \ > > > -direct=0 -buffered=1 -fsync=0 -fdatasync=0 -sync=0 \ > > > -runtime=300 -time_based