From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 98568C54E60 for ; Tue, 12 Mar 2024 16:45:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 30F8D8D0062; Tue, 12 Mar 2024 12:45:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2BF228D0017; Tue, 12 Mar 2024 12:45:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1608E8D0062; Tue, 12 Mar 2024 12:45:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 0371B8D0017 for ; Tue, 12 Mar 2024 12:45:02 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id CBB0C806F3 for ; Tue, 12 Mar 2024 16:45:01 +0000 (UTC) X-FDA: 81888961602.22.D87CA99 Received: from mail-wr1-f53.google.com (mail-wr1-f53.google.com [209.85.221.53]) by imf17.hostedemail.com (Postfix) with ESMTP id DDE1040005 for ; Tue, 12 Mar 2024 16:44:59 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=4ShQrXko; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of axelrasmussen@google.com designates 209.85.221.53 as permitted sender) smtp.mailfrom=axelrasmussen@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710261900; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=jUYTRlfGzPdlx8mymTMY+S0XS9OSGuHPk7Hy7kZQcOo=; b=kuePDChOlchQqH/rjgY5a2OX8vwE2AZVu8moo+nj8ib26wGxwP03ZKunXQ5WO2c8Zo//w9 nv+MZ7Hou/ONLIOSxmMZfY96HSg1lMiiG+ooDYJNn0Jeh0scTHj17XR3HbFN4IWhr+RpPu A14GTorCLMPyi5jwJcNdyBVrSRX/JDU= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=4ShQrXko; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of axelrasmussen@google.com designates 209.85.221.53 as permitted sender) smtp.mailfrom=axelrasmussen@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710261900; a=rsa-sha256; cv=none; b=inPanQ/KVgSfwz65mo42W9iayDmV402qXAmGqDkl2hnwIn7mlX1MjfGwAxrK3wZZlbWtIT cBdi9slFVLMD7xJo3jtf9JHv8zUAHdOPdYhNvsBxrpSFLCKnt9z6cBio/Vw1akeSwHo0wd X/42uK7eB58lOt43FHcXBs1s3pp7KNw= Received: by mail-wr1-f53.google.com with SMTP id ffacd0b85a97d-33e8f906f3dso2874100f8f.3 for ; Tue, 12 Mar 2024 09:44:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1710261898; x=1710866698; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=jUYTRlfGzPdlx8mymTMY+S0XS9OSGuHPk7Hy7kZQcOo=; b=4ShQrXkoxCVAJSzuPlO/LMLD9YgbHP+FzNLqWuTRfplg3R5vMLcUVJqkvtmfvNfo/1 s9kyphkTAJMAGa8sJQDjlPv3Ucy370CjVk8NKKEyhnFIeJV+KdaAhlyyZ2nE9OJUp7tb CTO0jijgsh9j8uA4JxeM51Hq5KaNmFbBEQ7BW6a2kMBkzwxfLuj8K8idcfdaT2rKREws Rr3TmEUIcIcyHH9kr9Qok8j30KhUYoF6XF09PgZcRk7wQ9wb9A48JEp6kyNqST8d1Lqv ou907kWGVCEgwBOsb9zvG2lmZo6sN2NzPKYJh6WMBt5e/dP25N5fmxSvF0TeaoSPZS9t gaBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710261898; x=1710866698; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=jUYTRlfGzPdlx8mymTMY+S0XS9OSGuHPk7Hy7kZQcOo=; b=Jg5sgQJqIpGq55iPrO6O5LiYLVn6tg12shR4NXIgZNmfQuDUvYoYv6/V85McRXGhjG KOXa4pqEbGYqhO9JQkNrUcB4pCT1QWywXSzR/6UU5Og/GcHVsCcqcD3HyaL7JTghuAtt yv1WQt3PdxXJAH3aSbMGZaHgF3INCuAIFrjo50/abx0/cPoXU0Y/wtlRM0ssRuf1za19 H+i00VJV4gG+KnrpwVS8ygCzUXjVeMrbTbyx7aNNX7D6ZEb5snJ2KQnoLsPgDUABrVc7 qCnRBOSQtw79sNVGTqTZ82U7gXPI/cb7TjzSFBy3O08kEtG5K2I371Ah0KZ/Q7/RvRwa isoA== X-Forwarded-Encrypted: i=1; AJvYcCXHUiYT47xtnsakDHoy1WRusNV57+JCbL/Fc9InYngJSuadA1y7HjUxRIQq7mw70ZgtX2Owmff/NnEfXyRU9vfpOGk= X-Gm-Message-State: AOJu0YzwUEU/dyts+Os17EuLZKNXIAJHFJWfAbBVeEGhpq/J4aGEsUwA dkmS7hDrGGnUtw/RD7QE3g1Jb0GQdG6uNmsgPkxM1QXtOSg3mKrpQzAHRRZN1VrraxSX1yhbSZC c5DL49BMBYaIYD/xeV944icIK0835HCheBqBR X-Google-Smtp-Source: AGHT+IEOHrE4+eQswfnysO+rbIz/h2km4EqB2aEJrVjLYvrBrpadyUomXrcZEXQc05cIPMuPa95v4IS3geyGrAN/WsI= X-Received: by 2002:a05:6000:1806:b0:33d:701f:d179 with SMTP id m6-20020a056000180600b0033d701fd179mr3754wrh.19.1710261898215; Tue, 12 Mar 2024 09:44:58 -0700 (PDT) MIME-Version: 1.0 References: <20240229235134.2447718-1-axelrasmussen@google.com> In-Reply-To: From: Axel Rasmussen Date: Tue, 12 Mar 2024 09:44:19 -0700 Message-ID: Subject: Re: MGLRU premature memcg OOM on slow writes To: Yafang Shao Cc: Chris Down , cgroups@vger.kernel.org, hannes@cmpxchg.org, kernel-team@fb.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, yuzhao@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: DDE1040005 X-Stat-Signature: qzkjdt59ar3fscgcftpkj4wyrq1a97j1 X-Rspam-User: X-HE-Tag: 1710261899-267255 X-HE-Meta: U2FsdGVkX1/mG+Ul6cBNydhv4cHq1fZeJLvH4tSTex+KqrBkknaZ0qxDvNz69z0aEbZrwC9xoR+hvOFkmw9/Zp+D+O0V0aqWVgNgManqradY1PiIS/elvPrnVPMhfy7TghmLbkISpU7S9Zu4GMMNCzR9sZ7gdphWltQsjS+JX+Ur3Nd1N7LWK3XJ325p0zta1r1iKvXUVZP1Sa4BCB4wboWgBkd9SKYXj4OTSWjcfoOz0g61n5/HJ1CCXnoBOyOBxBYPn+KBSJHXCPaF+uENQVyzaVLBTPwzjACcUD3ny7ByNfl05K+c4AUhv34K8OM3936jSuS0GUot4nqJlsHjHH2moZSMNdlIaoIx8dON9/criRQCfikdsligDoXPV9fQwDF3PRFcIVmVCIpPbf/KWwiXRXeTycam/tdGcQcHusgiD5WVtKXd8VezlFaFS+2DDTKVbh2od/hKhuJKJljPaYm9jLC0Hi+zdGyyEw5JJrRorjyoL1AU2F+pKmX4tgnwONvpIw/4Afiftw95ffOMqyuLP+lWteHtH63orleWSKAvZECDu6Cx6BQj2i6ckzlw88nL81SeDRAXZIii+oCVBL9UkQN+KXszui2R/yGnIrexltibouRzLaDP1KEh8j2NUhG3jrOeAjGATu7Eo7OolWdXAyW0hCeeR/3mG6C2YUTTcpS5lBd4Ul9KEbDnlEeUHvsDKlSjLz0m9kZpG5hXOFCL9/hvr/wT6QcBxXwdrd2uSzp/DjSwnq7BvTgJcxlvk6xzJ1UvIBmSoZMTshm03wChHsP2OVC8gLb7xq61kdEvati3Sz7jK2qt/3EjObWwdZiM8LU0HaC/AA207Pe5pVo38jf2qgN4gSEycoIF/mloh9ZHJj3N6oYxqf1WER8TkjHgM3eIecbqeQkaMztZq6sjeLG7/W3ld2C9WESNhOP8NwtPpJDpbDXmFfhNwhvhdrN+C5lRT4i6fkachZh swEfssKr l5wbbqfE+RgUJw6dJ0sHUv+axAAsVsS4E2ZCjNGbMSDIvFQhsyGGBIAhI4U050aDDAUdk X-Bogosity: Ham, tests=bogofilter, spamicity=0.000027, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Mar 11, 2024 at 2:11=E2=80=AFAM Yafang Shao = wrote: > > On Sat, Mar 9, 2024 at 3:19=E2=80=AFAM Axel Rasmussen wrote: > > > > On Thu, Feb 29, 2024 at 4:30=E2=80=AFPM Chris Down wrote: > > > > > > Axel Rasmussen writes: > > > >A couple of dumb questions. In your test, do you have any of the fol= lowing > > > >configured / enabled? > > > > > > > >/proc/sys/vm/laptop_mode > > > >memory.low > > > >memory.min > > > > > > None of these are enabled. The issue is trivially reproducible by wri= ting to > > > any slow device with memory.max enabled, but from the code it looks l= ike MGLRU > > > is also susceptible to this on global reclaim (although it's less lik= ely due to > > > page diversity). > > > > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the > > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_th= reads()). > > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios())= , I agree it > > > >looks like it simply will not do this. > > > > > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doi= ng that. It > > > >makes sense to me at least that doing writeback every time we age is= too > > > >aggressive, but doing it in evict_folios() makes some sense to me, b= asically to > > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has. > > > > > > Thanks! We may also need reclaim_throttle(), depending on how you imp= lement it. > > > Current non-MGLRU behaviour on slow storage is also highly suspect in= terms of > > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK= , but one > > > thing at a time :-) > > > > > > Hmm, so I have a patch which I think will help with this situation, > > but I'm having some trouble reproducing the problem on 6.8-rc7 (so > > then I can verify the patch fixes it). > > We encountered the same premature OOM issue caused by numerous dirty page= s. > The issue disappears after we revert the commit 14aa8b2d5c2e > "mm/mglru: don't sync disk for each aging cycle" > > To aid in replicating the issue, we've developed a straightforward > script, which consistently reproduces it, even on the latest kernel. > You can find the script provided below: > > ``` > #!/bin/bash > > MEMCG=3D"/sys/fs/cgroup/memory/mglru" > ENABLE=3D$1 > > # Avoid waking up the flusher > sysctl -w vm.dirty_background_bytes=3D$((1024 * 1024 * 1024 *4)) > sysctl -w vm.dirty_bytes=3D$((1024 * 1024 * 1024 *4)) > > if [ ! -d ${MEMCG} ]; then > mkdir -p ${MEMCG} > fi > > echo $$ > ${MEMCG}/cgroup.procs > echo 1g > ${MEMCG}/memory.limit_in_bytes > > if [ $ENABLE -eq 0 ]; then > echo 0 > /sys/kernel/mm/lru_gen/enabled > else > echo 0x7 > /sys/kernel/mm/lru_gen/enabled > fi > > dd if=3D/dev/zero of=3D/data0/mglru.test bs=3D1M count=3D1023 > rm -rf /data0/mglru.test > ``` > > This issue disappears as well after we disable the mglru. > > We hope this script proves helpful in identifying and addressing the > root cause. We eagerly await your insights and proposed fixes. Thanks Yafang, I was able to reproduce the issue using this script. Perhaps interestingly, I was not able to reproduce it with cgroupv2 memcgs. I know writeback semantics are quite a bit different there, so perhaps that explains why. Unfortunately, it also reproduces even with the commit I had in mind (basically stealing the "if (all isolated pages are unqueued dirty) { wakeup_flusher_threads(); reclaim_throttle(); }" from shrink_inactive_list, and adding it to MGLRU's evict_folios()). So I'll need to spend some more time on this; I'm planning to send something out for testing next week. > > > > > If I understand the issue right, all we should need to do is get a > > slow filesystem, and then generate a bunch of dirty file pages on it, > > while running in a tightly constrained memcg. To that end, I tried the > > following script. But, in reality I seem to get little or no > > accumulation of dirty file pages. > > > > I thought maybe fio does something different than rsync which you said > > you originally tried, so I also tried rsync (copying /usr/bin into > > this loop mount) and didn't run into an OOM situation either. > > > > Maybe some dirty ratio settings need tweaking or something to get the > > behavior you see? Or maybe my test has a dumb mistake in it. :) > > > > > > > > #!/usr/bin/env bash > > > > echo 0 > /proc/sys/vm/laptop_mode || exit 1 > > echo y > /sys/kernel/mm/lru_gen/enabled || exit 1 > > > > echo "Allocate disk image" > > IMAGE_SIZE_MIB=3D1024 > > IMAGE_PATH=3D/tmp/slow.img > > dd if=3D/dev/zero of=3D$IMAGE_PATH bs=3D1024k count=3D$IMAGE_SIZE_MIB |= | exit 1 > > > > echo "Setup loop device" > > LOOP_DEV=3D$(losetup --show --find $IMAGE_PATH) || exit 1 > > LOOP_BLOCKS=3D$(blockdev --getsize $LOOP_DEV) || exit 1 > > > > echo "Create dm-slow" > > DM_NAME=3Ddm-slow > > DM_DEV=3D/dev/mapper/$DM_NAME > > echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME |= | exit 1 > > > > echo "Create fs" > > mkfs.ext4 "$DM_DEV" || exit 1 > > > > echo "Mount fs" > > MOUNT_PATH=3D"/tmp/$DM_NAME" > > mkdir -p "$MOUNT_PATH" || exit 1 > > mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1 > > > > echo "Generate dirty file pages" > > systemd-run --wait --pipe --collect -p MemoryMax=3D32M \ > > fio -name=3Dwrites -directory=3D$MOUNT_PATH -readwrite=3Drandwr= ite \ > > -numjobs=3D10 -nrfiles=3D90 -filesize=3D1048576 \ > > -fallocate=3Dposix \ > > -blocksize=3D4k -ioengine=3Dmmap \ > > -direct=3D0 -buffered=3D1 -fsync=3D0 -fdatasync=3D0 -sync=3D0 \ > > -runtime=3D300 -time_based > > > > > -- > Regards > Yafang