From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 324CFC77B61 for ; Thu, 27 Apr 2023 07:03:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4DE7F6B0074; Thu, 27 Apr 2023 03:03:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 48E486B0075; Thu, 27 Apr 2023 03:03:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 356166B0078; Thu, 27 Apr 2023 03:03:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 238776B0074 for ; Thu, 27 Apr 2023 03:03:16 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id CF86D12020E for ; Thu, 27 Apr 2023 07:03:15 +0000 (UTC) X-FDA: 80726279550.09.AF8C9AA Received: from mail-wm1-f47.google.com (mail-wm1-f47.google.com [209.85.128.47]) by imf06.hostedemail.com (Postfix) with ESMTP id C49FB180018 for ; Thu, 27 Apr 2023 07:03:13 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=igCSz8Yy; spf=pass (imf06.hostedemail.com: domain of colin.i.king@gmail.com designates 209.85.128.47 as permitted sender) smtp.mailfrom=colin.i.king@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1682578993; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=oChy2cy9a199hoLVqQu6NIky1B+1Lc+Edg+2fDpsWgY=; b=Ix1dOExe3qMj5YjpQjavmXX2sIqZ9m7e9CSlRPorSObPReCngI2wUMq+KIalj7JMO5Ax6D 8yGtFZRlmJpxVeVHsQsNHHLOuemIaWAgDfe9kz0pf13t7+gFEjr3Qod34qcHyZ7xLAIAO+ dKah0SKoh6ANwyfwAkCSxJR9xm6/gXo= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=igCSz8Yy; spf=pass (imf06.hostedemail.com: domain of colin.i.king@gmail.com designates 209.85.128.47 as permitted sender) smtp.mailfrom=colin.i.king@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682578993; a=rsa-sha256; cv=none; b=3o5wsh/xKzx16rSqLckUO1tVA/5jk7B+Jaa3mu/6vNy+dGadsnTuvRw6bqx4N98q5E4EqL BJdBOHIGLTeS/+Gc8rVrpZH4qRYa4V4oWwas2m+UHaPm2TuVrTiTd0tRqDK75/d0qMuPFU 5G5HnyJ5mZh+7wP4YmqXZD1x4s3qZ5U= Received: by mail-wm1-f47.google.com with SMTP id 5b1f17b1804b1-3f178da219bso80417045e9.1 for ; Thu, 27 Apr 2023 00:03:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1682578992; x=1685170992; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=oChy2cy9a199hoLVqQu6NIky1B+1Lc+Edg+2fDpsWgY=; b=igCSz8YyqjPA9fHRW/g0ZTzLGAPb+Q3SSkDUDoqdzaUmz8CJnJGJIxEKEyllSXa5PV Q1nSx/VG7reY9CQUxiidOXtz1FLpz/ZE+lCtcBaQIs/R3a/+UC58xyKea0GJvGZFLrj2 hSY2oZzob/B0adZSd7lbMRzs7Wo8Aeh9kuKM6SuCEGRVnH2FVKWvuJM+l2ujWArymEKi ZoBtBtTSLDSg/ytcQeWDMx/uihRZlpbb7tl4FceGdN1WM2CFUEO24zq2HZZEGZogTxxp GH1Tx4oSAnz1dZBPC5+igjFVBCiNJsJNE8FRoq7qevFVlWO2G0lX1ywkqvZHCgqMqKAs uZxw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682578992; x=1685170992; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=oChy2cy9a199hoLVqQu6NIky1B+1Lc+Edg+2fDpsWgY=; b=CBdNbcBNvjBpAp0P4BjTvcCyuiJRZ155aOmzkZjtefcUjIf/RHr9SX9/SSihFPh755 FbEskMyHcHp6o3UZSa9JEKGbSLO2EYXEthHFXfSMju8KqLDLyr9vxqOBY0izz0ZFUx/S xHumv241TPM2gCXiZUJnM77rUoFn7uASVZMByM+4o6tiOlv+P+wtKuk1QQNhCdV1oa8s khMaFB6KAGV5ifP+AiNwhZiN4FhKti8vbamafDNOlRi0On6HaIJr8G+cQWzbaBZkOpLv +por81Z9MKLZlsF9xOp1nBaF/SVvYJRl1RDyFDdjX9fTH2rSRAcqBx1crphmnHuDiCnz ZG0Q== X-Gm-Message-State: AC+VfDzC3YNPPsnP/er3cIcIlx3ZQFm7cDuFIviaHrJnQcDecNdLsHmg CfQurEvQ2wcSqmvb6U8rT4Y= X-Google-Smtp-Source: ACHHUZ6IWDZP74Yx8okYC2Oq4PXIQ7DVRPsa8wVYq2GlSymFxr9wuUUCoGh9lHKrC0pH2ubztMhERQ== X-Received: by 2002:a7b:c385:0:b0:3f1:72ee:97b7 with SMTP id s5-20020a7bc385000000b003f172ee97b7mr608002wmj.15.1682578991844; Thu, 27 Apr 2023 00:03:11 -0700 (PDT) Received: from [192.168.0.210] (cpc154979-craw9-2-0-cust193.16-3.cable.virginm.net. [80.193.200.194]) by smtp.googlemail.com with ESMTPSA id u3-20020a7bcb03000000b003f16f3cc9fcsm20224479wmj.9.2023.04.27.00.03.10 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 27 Apr 2023 00:03:11 -0700 (PDT) Message-ID: <8bac892e-15e2-e95b-5b2b-0981f1279e4c@gmail.com> Date: Thu, 27 Apr 2023 08:03:08 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS To: Hui Wang , Gao Xiang , Michal Hocko Cc: linux-mm@kvack.org, akpm@linux-foundation.org, surenb@google.com, shy828301@gmail.com, hannes@cmpxchg.org, vbabka@suse.cz, hch@infradead.org, mgorman@suse.de, Phillip Lougher References: <20230426051030.112007-1-hui.wang@canonical.com> <20230426051030.112007-2-hui.wang@canonical.com> <68b085fe-3347-507c-d739-0dc9b27ebe05@linux.alibaba.com> <4aa48b6a-362d-de1b-f0ff-9bb8dafbdcc7@canonical.com> Content-Language: en-US From: "Colin King (gmail)" In-Reply-To: <4aa48b6a-362d-de1b-f0ff-9bb8dafbdcc7@canonical.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Stat-Signature: wsre7txqkgjh4ua8myrhz7nkqcigqa3a X-Rspam-User: X-Rspamd-Queue-Id: C49FB180018 X-Rspamd-Server: rspam06 X-HE-Tag: 1682578993-195166 X-HE-Meta: U2FsdGVkX188qyXcHxlCimYswo080ky7Y4CleZfxjDV+A3xqLod+AYMvgkuolaSsT848WioiPIeDGxNDBXHzjO5nFuCU9AHONQVIkwTVVIdERJiBaeE6A/gamjj9vsLiZ8QOt4RfK7HRPkunmeGS9/ganZ2cBTFPjhMCA6aqKhQoqB+YWhfbhgeal1TEemDNl8X4LIyt7x/D23lNxno5p+I8WObjvaVJMBDg86Y5blQahtwmtw2C0PJWiEQ4ETZbAhl3vSTXgY+SjDaNmYQlf9T/mlHIa5F0kUk4UJrI/cqOLP1P9Bzc0F06emAehCn14j+Pg2Pra0ZtlvXwQVaWGL0Y9XSqm66V0FY4AlGC9typfOUc9mlcXGujQ/gF/tPtwhw7qaOQSye7B30q8fkCaGESJl0sc+20hWkKIO57GnMZbktu4vB9z3OxAXUdQatvpDEUffjFMopLIV84RmwF/9CBa5RomEY1j5tWmgaTVAHHZsFL/o8/9d/V6LJGiawcRwbLknVVYrjrD0FBOGgvesmqdLXBXH1jEj84kN0UU0I0K0hrkJBu38mj86rD2iah3XB7isMy8ODpOEdNNaNmCGia+06SmCR5ey1EUEBW8QhYyIQc+mtWBsXb8M5Mv7j8nkgaW0IZIISTcnZ9/pNGcJgDTvNLVaFl5Y5QkCU93PsVbhgoZguVuo2FefM/Z2EN67WlrCnwkRHsbZwDPNGOFPmaeBtDUN1yaxrK+RDgTd1G8Sxaly55cP/aRbq17lB5Zw26xjX7aq5KKooRL0/6QICb28Vrdde7ts/jK+y69upjJ4Y2ys/AqAIko4/Cs7N8qvsqSNTwsxlrjikH9EcmG5WE/mHXo+fkNOzfhhRJtMEeA9vt11T2Dfv65h3KSz/ieYKlrkR6Xg4ElwekAxNPCqkCiyjD0hrEOr8IXG/Yp/dV7bastP5oblvEfGToDYqPjfaxuHgpJ9bRFTNsHf6 JPHy/BAf BWj+17yfr2kkzR8IheX42l+Y2+ELMta15Tr/uuABDrGQA3Z9wjAJ5c4R2QDVxOt74/Z/BIcksU4G9NlaDkEFKWPTpUASnjOR2PUkt0GxUGaIcde1MCX7lrWUOjKDkn6DhIpV3qu+tMwTakmWfCfmlAZWLiJPAURWiMeOG6VyFzOo3a1Lrf+UU99Mk8N1ZyECJtEewNcVIwmbbIlT2eWRKK12jYcev9XkzIURwsm4iNWmvFc2+EpCtPIOJl6b1NjLJdNZ90RNthxmiapLDX90lfgeaiS4U5fdVUDTvSNCwRbKMHChoeQInsTJJOVJbr6JMSaV4S/MW9CVtzpUUHOfqNuboNTmWZrdORHEvPm/p/GLrImFw4bhxKhLZlbmGDlQLyNEIw6LMlLcZ6zbi24Fs4BCDmP+VKbJYzl6aapZey7GsK0XJq+JZtVh4HoE3afvMWiUKfPbUlyNEpKOno4bJecoIF9+nJF1iA2g/JGV1GLRjI/W9XjtYPnj7k5v66UIckG/0JMuZSQxzXJG5JpuEiPX4gX+x22aCdSCj X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 27/04/2023 04:47, Hui Wang wrote: > > On 4/27/23 09:18, Gao Xiang wrote: >> >> >> On 2023/4/26 19:07, Hui Wang wrote: >>> >>> On 4/26/23 16:33, Michal Hocko wrote: >>>> [CC squashfs maintainer] >>>> >>>> On Wed 26-04-23 13:10:30, Hui Wang wrote: >>>>> If we run the stress-ng in the filesystem of squashfs, the system >>>>> will be in a state something like hang, the stress-ng couldn't >>>>> finish running and the console couldn't react to users' input. >>>>> >>>>> This issue happens on all arm/arm64 platforms we are working on, >>>>> through debugging, we found this issue is introduced by oom handling >>>>> in the kernel. >>>>> >>>>> The fs->readahead() is called between memalloc_nofs_save() and >>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls >>>>> alloc_page(), in this case, if there is no memory left, the >>>>> out_of_memory() will be called without __GFP_FS, then the oom killer >>>>> will not be triggered and this process will loop endlessly and wait >>>>> for others to trigger oom killer to release some memory. But for a >>>>> system with the whole root filesystem constructed by squashfs, >>>>> nearly all userspace processes will call out_of_memory() without >>>>> __GFP_FS, so we will see that the system enters a state something like >>>>> hang when running stress-ng. >>>>> >>>>> To fix it, we could trigger a kthread to call page_alloc() with >>>>> __GFP_FS before returning from out_of_memory() due to without >>>>> __GFP_FS. >>>> I do not think this is an appropriate way to deal with this issue. >>>> Does it even make sense to trigger OOM killer for something like >>>> readahead? Would it be more mindful to fail the allocation instead? >>>> That being said should allocations from squashfs_readahead use >>>> __GFP_RETRY_MAYFAIL instead? >>> >>> Thanks for your comment, and this issue could hardly be reproduced on >>> ext4 filesystem, that is because the ext4->readahead() doesn't call >>> alloc_page(). If changing the ext4->readahead() as below, it will be >>> easy to reproduce this issue with the ext4 filesystem (repeatedly >>> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0 >>> --timeout 30s --skip-silent --verbose) >>> >>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >>> index ffbbd9626bd8..8b9db0b9d0b8 100644 >>> --- a/fs/ext4/inode.c >>> +++ b/fs/ext4/inode.c >>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file *file, >>> struct folio *folio) >>>   static void ext4_readahead(struct readahead_control *rac) >>>   { >>>          struct inode *inode = rac->mapping->host; >>> +       struct page *tmp_page; >>> >>>          /* If the file has inline data, no need to do readahead. */ >>>          if (ext4_has_inline_data(inode)) >>>                  return; >>> >>> +       tmp_page = alloc_page(GFP_KERNEL); >>> + >>>          ext4_mpage_readpages(inode, rac, NULL); >>> + >>> +       if (tmp_page) >>> +               __free_page(tmp_page); >>>   } >>> >> > Hi Xiang and Michal, >> Is it tested with a pure ext4 without any other fs background? >> > Basically yes. Maybe there is a squashfs mounted for python3 in my test > environment. But stress-ng and its needed sharing libs are in the ext4. One could build a static version of stress-ng to remove the need for shared library loading at run time: git clone https://github.com/ColinIanKing/stress-ng cd stress-ng make clean STATIC=1 make -j 8 >> I don't think it's true that "ext4->readahead() doesn't call >> alloc_page()" since I think even ext2/ext4 uses buffer head >> interfaces to read metadata (extents or old block mapping) >> from its bd_inode for readahead, which indirectly allocates >> some extra pages to page cache as well. > > Calling alloc_page() or allocating memory in the readahead() is not a > problem, suppose we have 4 processes (A, B, C and D). Process A, B and C > are entering out_of_memory() because of allocating memory in the > readahead(), they are looping and waiting for some memory be released. > And process D could enter out_of_memory() with __GFP_FS, then it could > trigger oom killer, so A, B and C could get the memory and return to the > readahead(), there is no system hang issue. > > But if all 4 processes enter out_of_memory() from readahead(), they will > loop and wait endlessly, there is no process to trigger oom killer,  so > the users will think the system is getting hang. > > I applied my change for ext4->readahead to linux-next, and tested it on > my ubuntu classic server for arm64, I could reproduce the hang issue > within 1 minutes with 100% rate. I guess it is easy to reproduce the > issue because it is an embedded environment, the total number of > processes in the system is very limited, nearly all userspace processes > will finally reach out_of_memory() from ext4_readahead(), and nearly all > kthreads will not reach out_of_memory() for long time, that makes the > system in a state like hang (not real hang). > > And this is why I wrote a patch to let a specific kthread trigger oom > killer forcibly (my initial patch). > > >> >> The difference only here is the total number of pages to be >> allocated here, but many extra compressed data takeing extra >> allocation causes worse.  So I think it much depends on how >> stressful does your stress workload work like, and I'm even >> not sure it's a real issue since if you stop the stress >> workload, it will immediately recover (only it may not oom >> directly). >> > Yes, it is not a real hang. All userspace processes are looping and > waiting for other processes to release or reclaim memory. And in this > case, we can't stop the stress workload since users can't control the > system through console. > > So Michal, > > Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters a > state something like hang when running stress-ng", do you know why > out_of_memory() will return immediately if there is no __GFP_FS, could > we drop these lines directly: > >     /* >      * The OOM killer does not compensate for IO-less reclaim. >      * pagefault_out_of_memory lost its gfp context so we have to >      * make sure exclude 0 mask - all other users should have at least >      * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to >      * invoke the OOM killer even if it is a GFP_NOFS allocation. >      */ >     if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc)) >         return true; > > > Thanks, > > Hui. > >> Thanks, >> Gao Xiang