From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 21584C77B73 for ; Thu, 27 Apr 2023 05:23:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 21D196B0071; Thu, 27 Apr 2023 01:23:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1CDB96B0072; Thu, 27 Apr 2023 01:23:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 095906B0074; Thu, 27 Apr 2023 01:23:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id EC3916B0071 for ; Thu, 27 Apr 2023 01:23:07 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id C233C120188 for ; Thu, 27 Apr 2023 05:23:07 +0000 (UTC) X-FDA: 80726027214.05.011045C Received: from smtp-relay-canonical-1.canonical.com (smtp-relay-canonical-1.canonical.com [185.125.188.121]) by imf25.hostedemail.com (Postfix) with ESMTP id AE3C0A0017 for ; Thu, 27 Apr 2023 05:23:05 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=canonical.com header.s=20210705 header.b=eMlv+pW6; dmarc=pass (policy=none) header.from=canonical.com; spf=pass (imf25.hostedemail.com: domain of hui.wang@canonical.com designates 185.125.188.121 as permitted sender) smtp.mailfrom=hui.wang@canonical.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1682572986; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=65OZNuCmyFWqazKmXB1DEJtJtMVZAspfDO3M67v5yn4=; b=ZcLT3Xks/hb7ZNuQxg/57WMgoIiB0VeE8SsGig/5pnADducLNyqZGKVfOdFQKJsTEgqRs1 f55pOy0O7e94g1equHCVG64C7t/rKUnMVvKnBawzSIxFyOSs0KU+iQChTgxpaurMLTn9p3 xzKZcpx3BBA+whArF6zr9SUldCblP5k= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=canonical.com header.s=20210705 header.b=eMlv+pW6; dmarc=pass (policy=none) header.from=canonical.com; spf=pass (imf25.hostedemail.com: domain of hui.wang@canonical.com designates 185.125.188.121 as permitted sender) smtp.mailfrom=hui.wang@canonical.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682572986; a=rsa-sha256; cv=none; b=6f8e1BnyHLnxKa4dTRoNL/cAE9Jb3CozopMelb/I0gYuux8ne5DuKYEK1vLnrWoucRqewK rPIYVwxzVyOS7jPnW2p8BiJnMqjLjUUlLKO0MeaZ88Zzw3g1GB1cXIAyB5QLPad2qhnI2l pE023LBSX1Qh+3dKifWh+JyVqvDeLZI= Received: from [192.168.0.106] (unknown [123.112.66.36]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-canonical-1.canonical.com (Postfix) with ESMTPSA id 78CE73F185; Thu, 27 Apr 2023 05:22:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=canonical.com; s=20210705; t=1682572983; bh=65OZNuCmyFWqazKmXB1DEJtJtMVZAspfDO3M67v5yn4=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=eMlv+pW6ytt1n6/w257A6qJCQ64thkCmvRkeHiiXbxrnlA9BBF7gfGXBFTxM/1l+S 71ueIy0XHobop9fnDCmDWM2Bc5CniunvlgJAVlm5cPBDFlwwnDfovviQK9bwjYMwei 1kq8ASVgd736nSBKN+LYNIOb+3H1C0gTwHRUdsygPu5lhMyfmBh0zuNVWzph+Mkl5w Ugt+eTkDyT7BjajBUjGzht+ahlttMDuO8JQx3WwAxnGFwCnm2ETF624OZEgcWXFYkL 9ulPxCKGBYMpAZZpKdfDY2pNS0GeQtjG6EPkJabYAbVGy1ogSYmN9bpCjZv1aSE8vs qfQuqnys9MIZw== Message-ID: Date: Thu, 27 Apr 2023 13:22:55 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS Content-Language: en-US To: Phillip Lougher , Yang Shi Cc: Michal Hocko , linux-mm@kvack.org, akpm@linux-foundation.org, surenb@google.com, colin.i.king@gmail.com, hannes@cmpxchg.org, vbabka@suse.cz, hch@infradead.org, mgorman@suse.de, Gao Xiang References: <20230426051030.112007-1-hui.wang@canonical.com> <20230426051030.112007-2-hui.wang@canonical.com> <9f827ae2-eaec-8660-35fa-71e218d5a2c5@squashfs.org.uk> <72cf1f14-c02b-033e-6fa9-8558e628ffb6@squashfs.org.uk> <553e6668-bebd-7411-9f69-d62e9658da1d@squashfs.org.uk> <1f181fe6-60f4-6b71-b8ca-4f6365de0b4c@canonical.com> From: Hui Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: AE3C0A0017 X-Stat-Signature: sps6q7ntsr7qpnok8uz36qihf144rsgr X-HE-Tag: 1682572985-832721 X-HE-Meta: U2FsdGVkX18NFOQM5t51CnCu+G5TkQtooaZhdcce+co0eRZ9e5cV4LyVUSvh1ua9UQzJz6JtiulBGL2bjA56hYd1+vO2VQ4ckyQVreCiQw+YOdDoSjnGukFT95zvKI5GtE7ZoqKAE2SEZFWnTpfXQTBmvP4ecrSeYW/Jw9PclQFeGVWGf6OQRpd7/yKfyoDszjx71J11iHKmz1WLfthBUhAOtqH4aWnj2c9bvQacUW6hzworHZggkWoRUVVTV3+a58pBipQMvrvagXvIg0cSRijOHT4F5id+tuvaUbqIpNu7s4A9jHH9a5bC1Sjz8ncqUnPzVBZM6x8zDiy7wDVBBfnMHUhm1gq576jz9BsmQD9M+1IksbxBZiu6ZQRO+SaWvYzpz9PzcrJuDvzBxpORb2A9GXOCWKIINW8yBxFYO8pikRBWnMK0YlwvltueZXhpIeMxPmpif+RFrLCybDpwHLmZRToS6l2Zvb/3Lddq3AYykAnhTYonEV3n6dEfJH2APbT4HWYprtAZ/rgEk9INEJU9usyB3yt97vZAUYmmOQ39v1mMy3vkRCAgb0ujlnmJN/Q/6o5jblNvBiF0A39Ocuzi1FwXbL0cffd1uXJ6TPJLfYXCP0hKtySWe11x8TGYpTYOcU+XOBjGapkhOQ0NTdyoqQxIG3yvt74BigsYA2aCQIBY/8vf88T9V05l+Ee+hprsc4sSF7B9XUivr5xFoyTmq2MNAJfS4B8Z9pqnSEhCnSoW6vqzApO42cg3PLxM5tm86CrPsckzQXb1VjRLsDJmhYBlINqH4g2LqcCdwUUZCGunYzm/1fxufKISEisdUeylXfbF61MT1CgNeHEkDhNCmUp1lriU+B5hbpb/1arV6kD7wePCQA5jk7EInvtJuU9pW8yaOMXFMeewpwYM8osPQerXMeO/9r3uP5ZmHYn3h3kWynq+T47hFOUHbQvxKlKUPRBX3Xh9vHWFkwr 3yzlibeo avdWif0XqiBstf53uD920hVA2dXxr5h23PvmJDIQngWh8O00P3Q38b/9Unkkh/YfQRUjRwYv/+BQ9LoDatJja7yw8nrGh+ZirS83pWvTFG/JLlcHvFdZ+ZzIsXNnl4Pu7GNI08f+7iLxARzkW6HiCH0ugtuonSPAwE1S1PKLCMAjDLfjRlzZx1cttkk9yXGDZ5eGyFApb3hOV0oAP6JUizZRib0tofBpyO660z7lZ4/ImMtewxVavOOBATRL9Z2qi9XCzxqHqB/QTKCxEvWEZHgU+vxMEVyYBXDujwkoVX1vRJr9YDjNmn4MBHtfie/VXitJZaYlgfXYdAsr9Pre4lw1yprZ4ro7IJNu0yROJ1qNVcd5pr+c8xJJ7ie3FixvvzVFB X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 4/27/23 09:37, Phillip Lougher wrote: > > On 27/04/2023 01:42, Hui Wang wrote: >> >> On 4/27/23 03:34, Phillip Lougher wrote: >>> >>> On 26/04/2023 20:06, Phillip Lougher wrote: >>>> >>>> On 26/04/2023 19:26, Yang Shi wrote: >>>>> On Wed, Apr 26, 2023 at 10:38 AM Phillip Lougher >>>>> wrote: >>>>>> >>>>>> On 26/04/2023 17:44, Phillip Lougher wrote: >>>>>>> On 26/04/2023 12:07, Hui Wang wrote: >>>>>>>> On 4/26/23 16:33, Michal Hocko wrote: >>>>>>>>> [CC squashfs maintainer] >>>>>>>>> >>>>>>>>> On Wed 26-04-23 13:10:30, Hui Wang wrote: >>>>>>>>>> If we run the stress-ng in the filesystem of squashfs, the >>>>>>>>>> system >>>>>>>>>> will be in a state something like hang, the stress-ng couldn't >>>>>>>>>> finish running and the console couldn't react to users' input. >>>>>>>>>> >>>>>>>>>> This issue happens on all arm/arm64 platforms we are working on, >>>>>>>>>> through debugging, we found this issue is introduced by oom >>>>>>>>>> handling >>>>>>>>>> in the kernel. >>>>>>>>>> >>>>>>>>>> The fs->readahead() is called between memalloc_nofs_save() and >>>>>>>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls >>>>>>>>>> alloc_page(), in this case, if there is no memory left, the >>>>>>>>>> out_of_memory() will be called without __GFP_FS, then the oom >>>>>>>>>> killer >>>>>>>>>> will not be triggered and this process will loop endlessly >>>>>>>>>> and wait >>>>>>>>>> for others to trigger oom killer to release some memory. But >>>>>>>>>> for a >>>>>>>>>> system with the whole root filesystem constructed by squashfs, >>>>>>>>>> nearly all userspace processes will call out_of_memory() without >>>>>>>>>> __GFP_FS, so we will see that the system enters a state >>>>>>>>>> something like >>>>>>>>>> hang when running stress-ng. >>>>>>>>>> >>>>>>>>>> To fix it, we could trigger a kthread to call page_alloc() with >>>>>>>>>> __GFP_FS before returning from out_of_memory() due to without >>>>>>>>>> __GFP_FS. >>>>>>>>> I do not think this is an appropriate way to deal with this >>>>>>>>> issue. >>>>>>>>> Does it even make sense to trigger OOM killer for something like >>>>>>>>> readahead? Would it be more mindful to fail the allocation >>>>>>>>> instead? >>>>>>>>> That being said should allocations from squashfs_readahead use >>>>>>>>> __GFP_RETRY_MAYFAIL instead? >>>>>>>> Thanks for your comment, and this issue could hardly be >>>>>>>> reproduced on >>>>>>>> ext4 filesystem, that is because the ext4->readahead() doesn't >>>>>>>> call >>>>>>>> alloc_page(). If changing the ext4->readahead() as below, it >>>>>>>> will be >>>>>>>> easy to reproduce this issue with the ext4 filesystem (repeatedly >>>>>>>> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0 >>>>>>>> --timeout 30s --skip-silent --verbose) >>>>>>>> >>>>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >>>>>>>> index ffbbd9626bd8..8b9db0b9d0b8 100644 >>>>>>>> --- a/fs/ext4/inode.c >>>>>>>> +++ b/fs/ext4/inode.c >>>>>>>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file >>>>>>>> *file, >>>>>>>> struct folio *folio) >>>>>>>>   static void ext4_readahead(struct readahead_control *rac) >>>>>>>>   { >>>>>>>>          struct inode *inode = rac->mapping->host; >>>>>>>> +       struct page *tmp_page; >>>>>>>> >>>>>>>>          /* If the file has inline data, no need to do >>>>>>>> readahead. */ >>>>>>>>          if (ext4_has_inline_data(inode)) >>>>>>>>                  return; >>>>>>>> >>>>>>>> +       tmp_page = alloc_page(GFP_KERNEL); >>>>>>>> + >>>>>>>>          ext4_mpage_readpages(inode, rac, NULL); >>>>>>>> + >>>>>>>> +       if (tmp_page) >>>>>>>> +               __free_page(tmp_page); >>>>>>>>   } >>>>>>>> >>>>>>>> >>>>>>>> BTW, I applied my patch to the linux-next and ran the oom >>>>>>>> stress-ng >>>>>>>> testcases overnight, there is no hang, oops or crash, looks like >>>>>>>> there is no big problem to use a kthread to trigger the oom >>>>>>>> killer in >>>>>>>> this case. >>>>>>>> >>>>>>>> And Hi squashfs maintainer, I checked the code of filesystem, >>>>>>>> looks >>>>>>>> like most filesystems will not call alloc_page() in the >>>>>>>> readahead(), >>>>>>>> could you please help take a look at this issue, thanks. >>>>>>> >>>>>>> This will be because most filesystems don't need to do so. >>>>>>> Squashfs is >>>>>>> a compressed filesystem with large blocks covering much more >>>>>>> than one >>>>>>> page, and it decompresses these blocks in >>>>>>> squashfs_readahead().   If >>>>>>> __readahead_batch() does not return the full set of pages >>>>>>> covering the >>>>>>> Squashfs block, it allocates a temporary page for the >>>>>>> decompressors to >>>>>>> decompress into to "fill in the hole". >>>>>>> >>>>>>> What can be done here as far as Squashfs is concerned .... I could >>>>>>> move the page allocation out of the readahead path (e.g. do it at >>>>>>> mount time). >>>>>> You could try this patch which does that.  Compile tested only. >>>>> The kmalloc_array() may call alloc_page() to trigger this problem too >>>>> IIUC. It should be pre-allocated as well? >>>> >>>> >>>> That is a much smaller allocation, and so it entirely depends >>>> whether it is an issue or not.  There are also a number of other >>>> small memory allocations in the path as well. >>>> >>>> The whole point of this patch is to move the *biggest* allocation >>>> which is the reported issue and then see what happens.   No point >>>> in making this test patch more involved and complex than necessary >>>> at this point. >>>> >>>> Phillip >>>> >>> >>> Also be aware this stress-ng triggered issue is new, and apparently >>> didn't occur last year.   So it is reasonable to assume the issue >>> has been introduced as a side effect of the readahead improvements.  >>> One of these is this allocation of a temporary page to decompress >>> into rather than falling back to entirely decompressing into a >>> pre-allocated buffer (allocated at mount time).  The small memory >>> allocations have been there for many years. >>> >>> Allocating the page at mount time effectively puts the memory >>> allocation situation back to how it was last year before the >>> readahead work. >>> >>> Phillip >>> >> Thanks Phillip and Yang. >> >> And Phillip, >> >> I tested your change, it didn't help. According to my debug, the OOM >> happens at the place of allocating memory for bio, it is at the line >> of "struct page *page = alloc_page(GFP_NOIO);" in the >> squashfs_bio_read(). Other filesystems just use the pre-allocated >> memory in the "struct readahead_control" to do the bio, but squashfs >> allocate the new page to do the bio (maybe because the squashfs is a >> compressed filesystem). >> Hi Phillip, > The test patch was a process of elimination, it removed the obvious > change from last year. > > It is also because it is a compressed filesystem, in most filesystems > what is read off disk in I/O is what ends up in the page cache.  In a > compressed filesystem what is read in isn't what ends up in the page > cache. > Understand. >> BTW, this is not a new issue for squashfs, we have uc20 (linux-5.4 >> kernel) and uc22 (linux-5.15 kernel), all have this issue. The issue >> already existed in the squahsfs_readpage() in the 5.4 kernel. > > That information would have been rather useful in the initial report, > and saved myself from wasting my time.  Thanks for that. Sorry didn't mention it before. > > Now in the squashfs_readpage() situation does processes hang or > crash?  In the squashfs_readpage() path __GFP_NOFS should not be in > effect.  So is the OOM killer being invoked in this code path or > not?   Does alloc_page() in the bio code return NULL, and/or invoke > the OMM killer or does it get stuck.  Don't keep this information to > yourself so I have to guess. > In the squashfs_readpage() situation, the process still hang, the __GFP_NOFS also applies to squahsfs_readpage(), please see below calltrace I captured from linux-5.4: [  118.131804] wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww current->comm = stress-ng-bighe oc->gfp_mask = 8c50 (GFP_FS is not set in gfp_mask) [  118.142829] ------------[ cut here ]------------ [  118.142843] WARNING: CPU: 1 PID: 794 at mm/oom_kill.c:1097 out_of_memory+0x2dc/0x340 [  118.142845] Modules linked in: [  118.142851] CPU: 1 PID: 794 Comm: stress-ng-bighe Tainted: G        W         5.4.0+ #152 [  118.142853] Hardware name: LS1028A RDB Board (DT) [  118.142857] pstate: 60400005 (nZCv daif +PAN -UAO) [  118.142860] pc : out_of_memory+0x2dc/0x340 [  118.142863] lr : out_of_memory+0xe4/0x340 [  118.142865] sp : ffff8000115cb580 [  118.142867] x29: ffff8000115cb580 x28: 0000000000000000 [  118.142871] x27: ffffcefc623dab80 x26: 000031039de16790 [  118.142875] x25: ffffcefc621e9878 x24: 0000000000000100 [  118.142878] x23: ffffcefc62278000 x22: 0000000000000000 [  118.142881] x21: ffff8000115cb6f8 x20: ffff00206272e740 [  118.142885] x19: 0000000000000000 x18: ffffcefc622268f8 [  118.142888] x17: 0000000000000000 x16: 0000000000000000 [  118.142891] x15: ffffcefc621e8a38 x14: 1a9f17e4f9444a3e [  118.142894] x13: 0000000000000001 x12: 0000000000000400 [  118.142897] x11: 0000000000000400 x10: 0000000000000a90 [  118.142900] x9 : ffff8000115cb2c0 x8 : ffff00206272f230 [  118.142903] x7 : 0000001b81dc2360 x6 : 0000000000000000 [  118.142906] x5 : 0000000000000000 x4 : ffff00207f7db210 [  118.142909] x3 : 0000000000000000 x2 : 0000000000000000 [  118.142912] x1 : ffff00206272e740 x0 : 0000000000000000 [  118.142915] Call trace: [  118.142919]  out_of_memory+0x2dc/0x340 [  118.142924]  __alloc_pages_nodemask+0xf04/0x1090 [  118.142928]  alloc_slab_page+0x34/0x430 [  118.142931]  allocate_slab+0x474/0x500 [  118.142935]  ___slab_alloc.constprop.0+0x1e4/0x64c [  118.142938]  __slab_alloc.constprop.0+0x54/0xb0 [  118.142941]  kmem_cache_alloc+0x31c/0x350 [  118.142945]  alloc_buffer_head+0x2c/0xac [  118.142948]  alloc_page_buffers+0xb8/0x210 [  118.142951]  __getblk_gfp+0x180/0x39c [  118.142955]  squashfs_read_data+0x2a4/0x6f0 [  118.142958]  squashfs_readpage_block+0x2c4/0x630 [  118.142961]  squashfs_readpage+0x5e4/0x98c [  118.142964]  filemap_fault+0x17c/0x720 [  118.142967]  __do_fault+0x44/0x110 [  118.142970]  __handle_mm_fault+0x930/0xdac [  118.142973]  handle_mm_fault+0xc8/0x190 [  118.142978]  do_page_fault+0x134/0x5a0 [  118.142982]  do_translation_fault+0xe0/0x108 [  118.142985]  do_mem_abort+0x54/0xb0 [  118.142988]  el0_da+0x1c/0x20 [  118.142990] ---[ end trace c105c6721d4e890e ]--- I guess it is here to drop the __GFP_FS in the linux-5.4 (grow_dev_page() in the fs/buffer.c): gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp; >> I guess if could use pre-allocated memory to do the bio, it will help. > > We'll see. > > As far I can see it you've made the system run out of memory, and are > now complaining about the result.  There's nothing unconventional > about Squashfs handling of out of memory, and most filesystems put > into an out of memory situation will fail. > Understand.  And now the squashfs is used in ubuntu core and will be in many IoT products, really need to find a solution for it. Thanks, Hui. > Phillip > > >> >> Thanks, >> >> Hui. >> >> > > >>