From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D13D7C77B60 for ; Thu, 27 Apr 2023 01:18:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 671C26B0078; Wed, 26 Apr 2023 21:18:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 622096B007B; Wed, 26 Apr 2023 21:18:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4EA216B007D; Wed, 26 Apr 2023 21:18:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 3F0366B0078 for ; Wed, 26 Apr 2023 21:18:59 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 0AAE2C03D5 for ; Thu, 27 Apr 2023 01:18:59 +0000 (UTC) X-FDA: 80725411998.29.9F14A43 Received: from out30-112.freemail.mail.aliyun.com (out30-112.freemail.mail.aliyun.com [115.124.30.112]) by imf02.hostedemail.com (Postfix) with ESMTP id 182568001C for ; Thu, 27 Apr 2023 01:18:55 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf02.hostedemail.com: domain of hsiangkao@linux.alibaba.com designates 115.124.30.112 as permitted sender) smtp.mailfrom=hsiangkao@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1682558337; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=McZrJGgmBEuPuPiQwjptiUb0TSrAdQM1Nf91aQ+3Iok=; b=MDfeQ1/MQLEmCA2X5f7aqMzCOuqN51lg2ChkD94sTd0btRIJfZH4hQuZ/WrSd6nZHJtKw/ 9h9yW4NQh1SOOKUeoIg6WpbmqpoohDVw2XydjPzwF75E7a//bEFYheIAl6F6GoQOsbURHI 3bDEYmhVkCqQBvLrTguyuagmmhSDAJs= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf02.hostedemail.com: domain of hsiangkao@linux.alibaba.com designates 115.124.30.112 as permitted sender) smtp.mailfrom=hsiangkao@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682558337; a=rsa-sha256; cv=none; b=2xBXWe/fjhMbXT5h8lM+ChfhHgaxrM1ZlpZny/8JJ0BgqMGHj67VuH5ZZnMjzkL1gINFmh wKKo/ydQIDORFukxfTFZr6tY+CABkBY4+xEiFKGbxv19zCI07HY+/2kccBMgLVfNStS1YQ m1qGjFbvxHOnK+KynFVAmVSPPNxaNxw= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046049;MF=hsiangkao@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0Vh5AadS_1682558329; Received: from 172.20.10.3(mailfrom:hsiangkao@linux.alibaba.com fp:SMTPD_---0Vh5AadS_1682558329) by smtp.aliyun-inc.com; Thu, 27 Apr 2023 09:18:50 +0800 Message-ID: <68b085fe-3347-507c-d739-0dc9b27ebe05@linux.alibaba.com> Date: Thu, 27 Apr 2023 09:18:47 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS To: Hui Wang , Michal Hocko Cc: linux-mm@kvack.org, akpm@linux-foundation.org, surenb@google.com, colin.i.king@gmail.com, shy828301@gmail.com, hannes@cmpxchg.org, vbabka@suse.cz, hch@infradead.org, mgorman@suse.de, dan.carpenter@oracle.com, Phillip Lougher , Yang Shi References: <20230426051030.112007-1-hui.wang@canonical.com> <20230426051030.112007-2-hui.wang@canonical.com> From: Gao Xiang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 182568001C X-Stat-Signature: sfunrhujrskhf9qiq7ehfjoqikiyqkp5 X-HE-Tag: 1682558335-895852 X-HE-Meta: U2FsdGVkX19TKPSG2ZffP9J7ieDL15+mngrNxoBsAp/+2ggkyA3I9LL4Wui9U3tAJTzcBZtqgqCfwsweDD6t/wD7k4Fj9izzjTReS73VaWvGz56Kqgt8GTeifXcBEt0E5u7wNjodQIHqbUBeqOFDF4vJ0mBQX5t69jsVOyNxZ6XcxAmrxnDxLNGzNOD9nnyXMkA8mh5AZXuygg0JjrhAJuc9Oy0BH/st1GZv21j6bpyHCVkO/jdq/BBWXac/WzcReAJ1fv5Swqeb39/QXEib5EKefehSZMzSamt6kkb1p9SqNMv4ww0i1ZRoRAjNaK9nfAv8eVsLQowHJBGSkjYaSs8rtgZ5VSCHNKLj72RC5/5IUYAklvcedtxnrgmFsh8ACTxGlb/tJPvVDobyShzKl7+mKLDiixsYeKw7fTOWJ5nxezXLYbUukx1Rgw7wX3r0+oEe7bHL3D2y0IqzV2Sk2o1H+8sdJwqW4OADpMwPOS1zmTQsvKd9+uCJ9ndal715HrStPZJ5A7zCD8mISTpSx7/Nci9l7tYQsEJAoLyRmxqZG61ZvbaCwynkFiD09K2SJgUw6dGYkPV0nre1BbSLobR/7X8jcIV8U0mMfz4sjPz3IUsBI3GHacxvermeqdrAhlI5UQ04RE6q9g4fjRQJk8QRYEK0XBxX9FinYROZdR+cCTUxfBzeClYwdryIA4A/w2lmOSz0ED6xwzXSJkwoCgkAFBc05pEDHVOOlJDAIy0tMek1lMmiYcnwHj/GS6IJIOXuH7kMQU3woBCxWy2AX4xIqPwua1g8099iXwrptBDy6fcrmefjms6iqjXIhIdKc99Xp39J7Kq+yjqlggekjtAkJqpmorkLvNyaMOg01m05DAp6ELMLUFZ0KX51B7FRIxv39zxwAJYiyMCm+z+z4DnryCWhFs9kNzRJCVy4EHA7wSRH9AzOWK2OL1vVc2+FXMI9yrFQgt5Mwewb1iO DhUPad3d 2mrQHEWgjzczxG+KVOABH1YmskujU9X0oyoKW65/pncf8goAKYfw6HUBHkDxCMz4WeR+EdDhIP5yGi62JwPqZFcLnWx5b++sw0AXGBqpa7JIvk2/9I8oDD0Q4w1eIBOcuiVzUTL+Nt5tBuo/2CF/IrXW+Ps+w5gclSThdFV8SX50n1uYI4imxZJ//8eo+PXoqPRPdfXcBslE3I10= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2023/4/26 19:07, Hui Wang wrote: > > On 4/26/23 16:33, Michal Hocko wrote: >> [CC squashfs maintainer] >> >> On Wed 26-04-23 13:10:30, Hui Wang wrote: >>> If we run the stress-ng in the filesystem of squashfs, the system >>> will be in a state something like hang, the stress-ng couldn't >>> finish running and the console couldn't react to users' input. >>> >>> This issue happens on all arm/arm64 platforms we are working on, >>> through debugging, we found this issue is introduced by oom handling >>> in the kernel. >>> >>> The fs->readahead() is called between memalloc_nofs_save() and >>> memalloc_nofs_restore(), and the squashfs_readahead() calls >>> alloc_page(), in this case, if there is no memory left, the >>> out_of_memory() will be called without __GFP_FS, then the oom killer >>> will not be triggered and this process will loop endlessly and wait >>> for others to trigger oom killer to release some memory. But for a >>> system with the whole root filesystem constructed by squashfs, >>> nearly all userspace processes will call out_of_memory() without >>> __GFP_FS, so we will see that the system enters a state something like >>> hang when running stress-ng. >>> >>> To fix it, we could trigger a kthread to call page_alloc() with >>> __GFP_FS before returning from out_of_memory() due to without >>> __GFP_FS. >> I do not think this is an appropriate way to deal with this issue. >> Does it even make sense to trigger OOM killer for something like >> readahead? Would it be more mindful to fail the allocation instead? >> That being said should allocations from squashfs_readahead use >> __GFP_RETRY_MAYFAIL instead? > > Thanks for your comment, and this issue could hardly be reproduced on ext4 filesystem, that is because the ext4->readahead() doesn't call alloc_page(). If changing the ext4->readahead() as below, it will be easy to reproduce this issue with the ext4 filesystem (repeatedly run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0 --timeout 30s --skip-silent --verbose) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index ffbbd9626bd8..8b9db0b9d0b8 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file *file, struct folio *folio) >  static void ext4_readahead(struct readahead_control *rac) >  { >         struct inode *inode = rac->mapping->host; > +       struct page *tmp_page; > >         /* If the file has inline data, no need to do readahead. */ >         if (ext4_has_inline_data(inode)) >                 return; > > +       tmp_page = alloc_page(GFP_KERNEL); > + >         ext4_mpage_readpages(inode, rac, NULL); > + > +       if (tmp_page) > +               __free_page(tmp_page); >  } > Is it tested with a pure ext4 without any other fs background? I don't think it's true that "ext4->readahead() doesn't call alloc_page()" since I think even ext2/ext4 uses buffer head interfaces to read metadata (extents or old block mapping) from its bd_inode for readahead, which indirectly allocates some extra pages to page cache as well. The difference only here is the total number of pages to be allocated here, but many extra compressed data takeing extra allocation causes worse. So I think it much depends on how stressful does your stress workload work like, and I'm even not sure it's a real issue since if you stop the stress workload, it will immediately recover (only it may not oom directly). Thanks, Gao Xiang