From: Hui Wang <hui.wang@canonical.com>
To: Phillip Lougher <phillip@squashfs.org.uk>,
Yang Shi <shy828301@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>,
linux-mm@kvack.org, akpm@linux-foundation.org, surenb@google.com,
colin.i.king@gmail.com, hannes@cmpxchg.org, vbabka@suse.cz,
hch@infradead.org, mgorman@suse.de,
Gao Xiang <hsiangkao@linux.alibaba.com>
Subject: Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
Date: Thu, 27 Apr 2023 13:22:55 +0800 [thread overview]
Message-ID: <ac7f1fec-fa51-8e5f-ec47-aa63c4c90c2f@canonical.com> (raw)
In-Reply-To: <d691388f-e43c-017f-3f4e-1af834e25159@squashfs.org.uk>
On 4/27/23 09:37, Phillip Lougher wrote:
>
> On 27/04/2023 01:42, Hui Wang wrote:
>>
>> On 4/27/23 03:34, Phillip Lougher wrote:
>>>
>>> On 26/04/2023 20:06, Phillip Lougher wrote:
>>>>
>>>> On 26/04/2023 19:26, Yang Shi wrote:
>>>>> On Wed, Apr 26, 2023 at 10:38 AM Phillip Lougher
>>>>> <phillip@squashfs.org.uk> wrote:
>>>>>>
>>>>>> On 26/04/2023 17:44, Phillip Lougher wrote:
>>>>>>> On 26/04/2023 12:07, Hui Wang wrote:
>>>>>>>> On 4/26/23 16:33, Michal Hocko wrote:
>>>>>>>>> [CC squashfs maintainer]
>>>>>>>>>
>>>>>>>>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>>>>>>>>>> If we run the stress-ng in the filesystem of squashfs, the
>>>>>>>>>> system
>>>>>>>>>> will be in a state something like hang, the stress-ng couldn't
>>>>>>>>>> finish running and the console couldn't react to users' input.
>>>>>>>>>>
>>>>>>>>>> This issue happens on all arm/arm64 platforms we are working on,
>>>>>>>>>> through debugging, we found this issue is introduced by oom
>>>>>>>>>> handling
>>>>>>>>>> in the kernel.
>>>>>>>>>>
>>>>>>>>>> The fs->readahead() is called between memalloc_nofs_save() and
>>>>>>>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>>>>>>>>>> alloc_page(), in this case, if there is no memory left, the
>>>>>>>>>> out_of_memory() will be called without __GFP_FS, then the oom
>>>>>>>>>> killer
>>>>>>>>>> will not be triggered and this process will loop endlessly
>>>>>>>>>> and wait
>>>>>>>>>> for others to trigger oom killer to release some memory. But
>>>>>>>>>> for a
>>>>>>>>>> system with the whole root filesystem constructed by squashfs,
>>>>>>>>>> nearly all userspace processes will call out_of_memory() without
>>>>>>>>>> __GFP_FS, so we will see that the system enters a state
>>>>>>>>>> something like
>>>>>>>>>> hang when running stress-ng.
>>>>>>>>>>
>>>>>>>>>> To fix it, we could trigger a kthread to call page_alloc() with
>>>>>>>>>> __GFP_FS before returning from out_of_memory() due to without
>>>>>>>>>> __GFP_FS.
>>>>>>>>> I do not think this is an appropriate way to deal with this
>>>>>>>>> issue.
>>>>>>>>> Does it even make sense to trigger OOM killer for something like
>>>>>>>>> readahead? Would it be more mindful to fail the allocation
>>>>>>>>> instead?
>>>>>>>>> That being said should allocations from squashfs_readahead use
>>>>>>>>> __GFP_RETRY_MAYFAIL instead?
>>>>>>>> Thanks for your comment, and this issue could hardly be
>>>>>>>> reproduced on
>>>>>>>> ext4 filesystem, that is because the ext4->readahead() doesn't
>>>>>>>> call
>>>>>>>> alloc_page(). If changing the ext4->readahead() as below, it
>>>>>>>> will be
>>>>>>>> easy to reproduce this issue with the ext4 filesystem (repeatedly
>>>>>>>> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0
>>>>>>>> --timeout 30s --skip-silent --verbose)
>>>>>>>>
>>>>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>>>>> index ffbbd9626bd8..8b9db0b9d0b8 100644
>>>>>>>> --- a/fs/ext4/inode.c
>>>>>>>> +++ b/fs/ext4/inode.c
>>>>>>>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file
>>>>>>>> *file,
>>>>>>>> struct folio *folio)
>>>>>>>> static void ext4_readahead(struct readahead_control *rac)
>>>>>>>> {
>>>>>>>> struct inode *inode = rac->mapping->host;
>>>>>>>> + struct page *tmp_page;
>>>>>>>>
>>>>>>>> /* If the file has inline data, no need to do
>>>>>>>> readahead. */
>>>>>>>> if (ext4_has_inline_data(inode))
>>>>>>>> return;
>>>>>>>>
>>>>>>>> + tmp_page = alloc_page(GFP_KERNEL);
>>>>>>>> +
>>>>>>>> ext4_mpage_readpages(inode, rac, NULL);
>>>>>>>> +
>>>>>>>> + if (tmp_page)
>>>>>>>> + __free_page(tmp_page);
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> BTW, I applied my patch to the linux-next and ran the oom
>>>>>>>> stress-ng
>>>>>>>> testcases overnight, there is no hang, oops or crash, looks like
>>>>>>>> there is no big problem to use a kthread to trigger the oom
>>>>>>>> killer in
>>>>>>>> this case.
>>>>>>>>
>>>>>>>> And Hi squashfs maintainer, I checked the code of filesystem,
>>>>>>>> looks
>>>>>>>> like most filesystems will not call alloc_page() in the
>>>>>>>> readahead(),
>>>>>>>> could you please help take a look at this issue, thanks.
>>>>>>>
>>>>>>> This will be because most filesystems don't need to do so.
>>>>>>> Squashfs is
>>>>>>> a compressed filesystem with large blocks covering much more
>>>>>>> than one
>>>>>>> page, and it decompresses these blocks in
>>>>>>> squashfs_readahead(). If
>>>>>>> __readahead_batch() does not return the full set of pages
>>>>>>> covering the
>>>>>>> Squashfs block, it allocates a temporary page for the
>>>>>>> decompressors to
>>>>>>> decompress into to "fill in the hole".
>>>>>>>
>>>>>>> What can be done here as far as Squashfs is concerned .... I could
>>>>>>> move the page allocation out of the readahead path (e.g. do it at
>>>>>>> mount time).
>>>>>> You could try this patch which does that. Compile tested only.
>>>>> The kmalloc_array() may call alloc_page() to trigger this problem too
>>>>> IIUC. It should be pre-allocated as well?
>>>>
>>>>
>>>> That is a much smaller allocation, and so it entirely depends
>>>> whether it is an issue or not. There are also a number of other
>>>> small memory allocations in the path as well.
>>>>
>>>> The whole point of this patch is to move the *biggest* allocation
>>>> which is the reported issue and then see what happens. No point
>>>> in making this test patch more involved and complex than necessary
>>>> at this point.
>>>>
>>>> Phillip
>>>>
>>>
>>> Also be aware this stress-ng triggered issue is new, and apparently
>>> didn't occur last year. So it is reasonable to assume the issue
>>> has been introduced as a side effect of the readahead improvements.
>>> One of these is this allocation of a temporary page to decompress
>>> into rather than falling back to entirely decompressing into a
>>> pre-allocated buffer (allocated at mount time). The small memory
>>> allocations have been there for many years.
>>>
>>> Allocating the page at mount time effectively puts the memory
>>> allocation situation back to how it was last year before the
>>> readahead work.
>>>
>>> Phillip
>>>
>> Thanks Phillip and Yang.
>>
>> And Phillip,
>>
>> I tested your change, it didn't help. According to my debug, the OOM
>> happens at the place of allocating memory for bio, it is at the line
>> of "struct page *page = alloc_page(GFP_NOIO);" in the
>> squashfs_bio_read(). Other filesystems just use the pre-allocated
>> memory in the "struct readahead_control" to do the bio, but squashfs
>> allocate the new page to do the bio (maybe because the squashfs is a
>> compressed filesystem).
>>
Hi Phillip,
> The test patch was a process of elimination, it removed the obvious
> change from last year.
>
> It is also because it is a compressed filesystem, in most filesystems
> what is read off disk in I/O is what ends up in the page cache. In a
> compressed filesystem what is read in isn't what ends up in the page
> cache.
>
Understand.
>> BTW, this is not a new issue for squashfs, we have uc20 (linux-5.4
>> kernel) and uc22 (linux-5.15 kernel), all have this issue. The issue
>> already existed in the squahsfs_readpage() in the 5.4 kernel.
>
> That information would have been rather useful in the initial report,
> and saved myself from wasting my time. Thanks for that.
Sorry didn't mention it before.
>
> Now in the squashfs_readpage() situation does processes hang or
> crash? In the squashfs_readpage() path __GFP_NOFS should not be in
> effect. So is the OOM killer being invoked in this code path or
> not? Does alloc_page() in the bio code return NULL, and/or invoke
> the OMM killer or does it get stuck. Don't keep this information to
> yourself so I have to guess.
>
In the squashfs_readpage() situation, the process still hang, the
__GFP_NOFS also applies to squahsfs_readpage(), please see below
calltrace I captured from linux-5.4:
[ 118.131804] wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww current->comm =
stress-ng-bighe oc->gfp_mask = 8c50 (GFP_FS is not set in gfp_mask)
[ 118.142829] ------------[ cut here ]------------
[ 118.142843] WARNING: CPU: 1 PID: 794 at mm/oom_kill.c:1097
out_of_memory+0x2dc/0x340
[ 118.142845] Modules linked in:
[ 118.142851] CPU: 1 PID: 794 Comm: stress-ng-bighe Tainted: G
W 5.4.0+ #152
[ 118.142853] Hardware name: LS1028A RDB Board (DT)
[ 118.142857] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 118.142860] pc : out_of_memory+0x2dc/0x340
[ 118.142863] lr : out_of_memory+0xe4/0x340
[ 118.142865] sp : ffff8000115cb580
[ 118.142867] x29: ffff8000115cb580 x28: 0000000000000000
[ 118.142871] x27: ffffcefc623dab80 x26: 000031039de16790
[ 118.142875] x25: ffffcefc621e9878 x24: 0000000000000100
[ 118.142878] x23: ffffcefc62278000 x22: 0000000000000000
[ 118.142881] x21: ffff8000115cb6f8 x20: ffff00206272e740
[ 118.142885] x19: 0000000000000000 x18: ffffcefc622268f8
[ 118.142888] x17: 0000000000000000 x16: 0000000000000000
[ 118.142891] x15: ffffcefc621e8a38 x14: 1a9f17e4f9444a3e
[ 118.142894] x13: 0000000000000001 x12: 0000000000000400
[ 118.142897] x11: 0000000000000400 x10: 0000000000000a90
[ 118.142900] x9 : ffff8000115cb2c0 x8 : ffff00206272f230
[ 118.142903] x7 : 0000001b81dc2360 x6 : 0000000000000000
[ 118.142906] x5 : 0000000000000000 x4 : ffff00207f7db210
[ 118.142909] x3 : 0000000000000000 x2 : 0000000000000000
[ 118.142912] x1 : ffff00206272e740 x0 : 0000000000000000
[ 118.142915] Call trace:
[ 118.142919] out_of_memory+0x2dc/0x340
[ 118.142924] __alloc_pages_nodemask+0xf04/0x1090
[ 118.142928] alloc_slab_page+0x34/0x430
[ 118.142931] allocate_slab+0x474/0x500
[ 118.142935] ___slab_alloc.constprop.0+0x1e4/0x64c
[ 118.142938] __slab_alloc.constprop.0+0x54/0xb0
[ 118.142941] kmem_cache_alloc+0x31c/0x350
[ 118.142945] alloc_buffer_head+0x2c/0xac
[ 118.142948] alloc_page_buffers+0xb8/0x210
[ 118.142951] __getblk_gfp+0x180/0x39c
[ 118.142955] squashfs_read_data+0x2a4/0x6f0
[ 118.142958] squashfs_readpage_block+0x2c4/0x630
[ 118.142961] squashfs_readpage+0x5e4/0x98c
[ 118.142964] filemap_fault+0x17c/0x720
[ 118.142967] __do_fault+0x44/0x110
[ 118.142970] __handle_mm_fault+0x930/0xdac
[ 118.142973] handle_mm_fault+0xc8/0x190
[ 118.142978] do_page_fault+0x134/0x5a0
[ 118.142982] do_translation_fault+0xe0/0x108
[ 118.142985] do_mem_abort+0x54/0xb0
[ 118.142988] el0_da+0x1c/0x20
[ 118.142990] ---[ end trace c105c6721d4e890e ]---
I guess it is here to drop the __GFP_FS in the linux-5.4
(grow_dev_page() in the fs/buffer.c): gfp_mask =
mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp;
>> I guess if could use pre-allocated memory to do the bio, it will help.
>
> We'll see.
>
> As far I can see it you've made the system run out of memory, and are
> now complaining about the result. There's nothing unconventional
> about Squashfs handling of out of memory, and most filesystems put
> into an out of memory situation will fail.
>
Understand. And now the squashfs is used in ubuntu core and will be in
many IoT products, really need to find a solution for it.
Thanks,
Hui.
> Phillip
>
>
>>
>> Thanks,
>>
>> Hui.
>>
>>
>
>
>>
next prev parent reply other threads:[~2023-04-27 5:23 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-26 5:10 [PATCH 0/1] mm/oom_kill: system enters a state something like hang when running stress-ng Hui Wang
2023-04-26 5:10 ` [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS Hui Wang
2023-04-26 8:33 ` Michal Hocko
2023-04-26 11:07 ` Hui Wang
2023-04-26 16:44 ` Phillip Lougher
2023-04-26 17:38 ` Phillip Lougher
2023-04-26 18:26 ` Yang Shi
2023-04-26 19:06 ` Phillip Lougher
2023-04-26 19:34 ` Phillip Lougher
2023-04-27 0:42 ` Hui Wang
2023-04-27 1:37 ` Phillip Lougher
2023-04-27 5:22 ` Hui Wang [this message]
2023-04-27 1:18 ` Gao Xiang
2023-04-27 3:47 ` Hui Wang
2023-04-27 4:17 ` Gao Xiang
2023-04-27 7:03 ` Colin King (gmail)
2023-04-27 7:49 ` Hui Wang
2023-04-28 19:53 ` Michal Hocko
2023-05-03 11:49 ` Hui Wang
2023-05-03 12:20 ` Michal Hocko
2023-05-03 18:41 ` Phillip Lougher
2023-05-03 19:10 ` Phillip Lougher
2023-05-03 19:38 ` Hui Wang
2023-05-07 21:07 ` Phillip Lougher
2023-05-08 10:05 ` Hui Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ac7f1fec-fa51-8e5f-ec47-aa63c4c90c2f@canonical.com \
--to=hui.wang@canonical.com \
--cc=akpm@linux-foundation.org \
--cc=colin.i.king@gmail.com \
--cc=hannes@cmpxchg.org \
--cc=hch@infradead.org \
--cc=hsiangkao@linux.alibaba.com \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mhocko@suse.com \
--cc=phillip@squashfs.org.uk \
--cc=shy828301@gmail.com \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox