From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DBEB9C54E41 for ; Fri, 8 Mar 2024 21:22:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5C90D8D0001; Fri, 8 Mar 2024 16:22:17 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 578708D0012; Fri, 8 Mar 2024 16:22:17 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4384D8D0001; Fri, 8 Mar 2024 16:22:17 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 2F78F8D0001 for ; Fri, 8 Mar 2024 16:22:17 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 056C7A169C for ; Fri, 8 Mar 2024 21:22:17 +0000 (UTC) X-FDA: 81875145114.18.F3829F9 Received: from mail-ua1-f41.google.com (mail-ua1-f41.google.com [209.85.222.41]) by imf06.hostedemail.com (Postfix) with ESMTP id F3DC518000B for ; Fri, 8 Mar 2024 21:22:14 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=0r4qs0hs; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf06.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.41 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709932935; a=rsa-sha256; cv=none; b=BrfqounJrUdyL1w9BEsZYHd0gO1NMkxssq9ph4kxvv87J85oi8AAIZDgQWRrvaVE2k/Byt qGFs+qSNndQyvE6iFdGsRhUzvcNE2rLmkTxbgYQZ26SXAYkvUI564wNjI3pEEn7osg2mT3 Vv1zTPR/hU1k0Ghf1VzHCw83Nw662TE= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=0r4qs0hs; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf06.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.41 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709932935; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XL80Q/LEC1hNeoVDgvQ1NRtgFUnKBazWk6zXTz/pdtI=; b=z+pYDRamt1NTCQxBoN3xaKyN4nHuWmKkbfQqWWtz0umWUio2NYXvbTPqJ9hqIlp9GgN9gH +vaTgTf8/N5/cvlTTi4UrBBBCakLbF9pgzt4qsFiAiM0/gPYJMRiPbnKraEhRXnT5a/cSo UkuiFEl5krPWUytU73SbR8Z/bSNTnuM= Received: by mail-ua1-f41.google.com with SMTP id a1e0cc1a2514c-7dba73cab13so689716241.3 for ; Fri, 08 Mar 2024 13:22:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1709932934; x=1710537734; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=XL80Q/LEC1hNeoVDgvQ1NRtgFUnKBazWk6zXTz/pdtI=; b=0r4qs0hsX5bu7Qeari9MNWhnfTD1jLTy+QE9pqGbgMUqhVh1cKbhXjm+ZIRTDsJOt1 1akOXzUzDWKyu0qHpmXs+DEPY5EnQeAAOwwScTf5vw1YXOBiMJvCCiLoGnUVsRcN4qUj 6UWSZRiGr1xIjs4fZjZTIaZPy6H/DLFzdym2SijYjFA+vgkBmOpfvkP5tvwWM0U4QCXQ 0XPT/CsOTiIVhNPws1b8lrtvb67dfdIEekMyKLfL6WlVJEiKflkYWhzqgS2RGivRo9FR Bbe1Q2ueRfE4iIsYhh1U3icR1mQFGV3UMDrGS+PuPGb6ImRAla256dIa9lzPJbWOsGa/ 2GBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709932934; x=1710537734; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=XL80Q/LEC1hNeoVDgvQ1NRtgFUnKBazWk6zXTz/pdtI=; b=flMB7VdKpfZNUa67jPcYAHrzkeKokLRZR31RrLN+sfz0hfuLzK5LTX85qReLufx2EX whxNlKx/mgHG2CgSMWQgsd7r9IMm/duD3rTepDcMlPc731XMYoaCGFwKPVHzDaMoVik7 t0K2LbngxCqR5Smdc1rswa7DXdL6lBzvQ6OmwIPfevrxqC6dCpaGdomJJLQBrNuu12Ms dZDyjRL6fJ6W5OtGRxUbwNmYzR+b4Z14IFKYJJmOiKiZXl73Vm+4KzSnP36uieYA+hRa bHtfRWoShvWyf7yHzvLW8a53JVl3pyq8pIYSQffuYq1k+gAoUBhZFhNZdP/5IRVkw3Qw 0dMA== X-Forwarded-Encrypted: i=1; AJvYcCWUhYex6krAvZv8mPuhohNz7q7vPu3dKGAExiWFMG/H96oIXHks7PDihS5i1N2K3GYTwkyj4WFEQmwAAUW8z8Mq1So= X-Gm-Message-State: AOJu0YwT6gLdCRGr921VJ+exTwsCqogeFgL/2skmrWvjeK6J9EC00qqp qV5Rzek5LCzE7SLxswi20tCTSEke3OFCCYnZhD4M7ZEb70WPExuqgmSLgcMgjhQ= X-Google-Smtp-Source: AGHT+IEwQHXBgf936c0Xa021NeRFaNfDWXzdfuhZY3RXu2CzjdaZ96VxQTmJPXIO3pIeRXQSm9EaYQ== X-Received: by 2002:a05:6122:1784:b0:4d1:4e40:bd6f with SMTP id o4-20020a056122178400b004d14e40bd6fmr592980vkf.10.1709932933927; Fri, 08 Mar 2024 13:22:13 -0800 (PST) Received: from localhost (2603-7000-0c01-2716-da5e-d3ff-fee7-26e7.res6.spectrum.com. [2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with ESMTPSA id ej5-20020ad45a45000000b0068fc8e339b8sm122593qvb.136.2024.03.08.13.22.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 Mar 2024 13:22:13 -0800 (PST) Date: Fri, 8 Mar 2024 16:22:12 -0500 From: Johannes Weiner To: Axel Rasmussen Cc: Chris Down , cgroups@vger.kernel.org, kernel-team@fb.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, yuzhao@google.com Subject: Re: MGLRU premature memcg OOM on slow writes Message-ID: <20240308212212.GA38843@cmpxchg.org> References: <20240229235134.2447718-1-axelrasmussen@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: F3DC518000B X-Stat-Signature: 3e1rskqkt6p6iptmg95amg5h67dunkta X-HE-Tag: 1709932934-946591 X-HE-Meta: U2FsdGVkX1971GCsEcq7q0J4ygKwlmxdNOgx0cOOuu1wiZeQd1DdggDOvoW7pqvaWUDqtejoa2YmYB7oTeLU6KQSkdEgvdEVxNfuI8w7iywACePmCnM4DmkIrfVBPW8tjpyEo74hLqjHVg48Pi/YuqNfNzrvv9yGRpnnZUi3IdNrQFwTmjym0pJCYiCZzhK4GUtCDV1ObocG8gXnty0M9jcEUak1orPiNLFfNjr7fpk1REzY86/Kj7yKC15UnBbzRV4FvwTrkyekc+MTFGOKJAMXKh+lxcydGn7gYA10FQzkjgZ/oF0v3/1j6dOWXoPeLgzoxNx+wIXtqqbtMDYSdODH5ZMHi7hpfxso8DyQWy9FIlannqBnV+fOkPKMHdfKPEirL+WodAALCL3bimRkenVRL9ascUQhcbce79wmBaG9hGCbVXXw14ocex67je0kLve2RD6LrluoUedInS02hdh0MumMeXu+7uH4Yrywm1VlRzEdSCTf+ZKsF0W1t5w8L6u5MGllxNokJdaWczYdbwSwf+/UpPBqyV00hSblc/axHOkBVHA+4AcyUjDctCwEiTANLPmCN7lEXhdx2RFsp62MHyc3t130az3XprJcWY2EVyM6p9f3J1gAdozWZqbu766cPVuz6b66pc0hRVFtq6LsG346AUy3rY2FJnevyzf/rDzog/hqijjEeICTos74j82cavk1/WsgdGDFkbDlobUdEmr/HxNuhZThBLkc6nihzHDtnt+EDhYV7HEIHcuLYD2UiXkF3ynEFt+DbXV3TyvLBP7lKumM8QVr0YSYwAz2mUSBRgBVDfp7QDIstEzuh2Ppl91lEmHB4pPHVrkcE/ccRPWXI3bZBypTDxoOPtiyELBzL48EFGNx3BipUE5yvM+coR8k7u8UQtRxFh9lxbMGonw35HhZoSSP69ffAYydyVNXAlntTaNfuPSi4Qy3gC+2bBPOi2BnnmRQGtb g3EeqGQ7 QEuX87pOlN5ZlxDOMW8Ok5XdSCkHJRNXDA4RnjOdJ/zOqKvXlEH9ArI+d0+zsivZgWAeSFIN09QO46UMb+z/ltBkBAv2LQe24k0MhBRF1QvbW7pbXLSRFfn8hITOY+cG3YsrQE3Kpj22Pa40= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Mar 08, 2024 at 11:18:28AM -0800, Axel Rasmussen wrote: > On Thu, Feb 29, 2024 at 4:30 PM Chris Down wrote: > > > > Axel Rasmussen writes: > > >A couple of dumb questions. In your test, do you have any of the following > > >configured / enabled? > > > > > >/proc/sys/vm/laptop_mode > > >memory.low > > >memory.min > > > > None of these are enabled. The issue is trivially reproducible by writing to > > any slow device with memory.max enabled, but from the code it looks like MGLRU > > is also susceptible to this on global reclaim (although it's less likely due to > > page diversity). > > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()). > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it > > >looks like it simply will not do this. > > > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It > > >makes sense to me at least that doing writeback every time we age is too > > >aggressive, but doing it in evict_folios() makes some sense to me, basically to > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has. > > > > Thanks! We may also need reclaim_throttle(), depending on how you implement it. > > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one > > thing at a time :-) > > > Hmm, so I have a patch which I think will help with this situation, > but I'm having some trouble reproducing the problem on 6.8-rc7 (so > then I can verify the patch fixes it). > > If I understand the issue right, all we should need to do is get a > slow filesystem, and then generate a bunch of dirty file pages on it, > while running in a tightly constrained memcg. To that end, I tried the > following script. But, in reality I seem to get little or no > accumulation of dirty file pages. > > I thought maybe fio does something different than rsync which you said > you originally tried, so I also tried rsync (copying /usr/bin into > this loop mount) and didn't run into an OOM situation either. > > Maybe some dirty ratio settings need tweaking or something to get the > behavior you see? Or maybe my test has a dumb mistake in it. :) > > > > #!/usr/bin/env bash > > echo 0 > /proc/sys/vm/laptop_mode || exit 1 > echo y > /sys/kernel/mm/lru_gen/enabled || exit 1 > > echo "Allocate disk image" > IMAGE_SIZE_MIB=1024 > IMAGE_PATH=/tmp/slow.img > dd if=/dev/zero of=$IMAGE_PATH bs=1024k count=$IMAGE_SIZE_MIB || exit 1 > > echo "Setup loop device" > LOOP_DEV=$(losetup --show --find $IMAGE_PATH) || exit 1 > LOOP_BLOCKS=$(blockdev --getsize $LOOP_DEV) || exit 1 > > echo "Create dm-slow" > DM_NAME=dm-slow > DM_DEV=/dev/mapper/$DM_NAME > echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || exit 1 > > echo "Create fs" > mkfs.ext4 "$DM_DEV" || exit 1 > > echo "Mount fs" > MOUNT_PATH="/tmp/$DM_NAME" > mkdir -p "$MOUNT_PATH" || exit 1 > mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1 > > echo "Generate dirty file pages" > systemd-run --wait --pipe --collect -p MemoryMax=32M \ > fio -name=writes -directory=$MOUNT_PATH -readwrite=randwrite \ > -numjobs=10 -nrfiles=90 -filesize=1048576 \ > -fallocate=posix \ > -blocksize=4k -ioengine=mmap \ > -direct=0 -buffered=1 -fsync=0 -fdatasync=0 -sync=0 \ > -runtime=300 -time_based By doing only the writes in the cgroup, you might just be running into balance_dirty_pages(), which wakes the flushers and slows the writing/allocating task before hitting the cg memory limit. I think the key to what happens in Chris's case is: 1) The cgroup has a certain share of dirty pages, but in aggregate they are below the cgroup dirty limit (dirty < mdtc->avail * ratio) such that no writeback/dirty throttling is triggered from balance_dirty_pages(). 2) An unthrottled burst of (non-dirtying) allocations causes reclaim demand that suddenly exceeds the reclaimable clean pages on the LRU. Now you get into a situation where allocation and reclaim rate exceeds the writeback rate and the only reclaimable pages left on the LRU are dirty. In this case reclaim needs to wake the flushers and wait for writeback instead of blowing through the priority cycles and OOMing. Chris might be causing 2) from the read side of the copy also being in the cgroup. Especially if he's copying larger files that can saturate the readahead window and cause bigger allocation bursts. Those readahead pages are accounted to the cgroup and on the LRU as soon as they're allocated, but remain locked and unreclaimable until the read IO finishes.