From: Hui Wang <hui.wang@canonical.com>
To: "Colin King (gmail)" <colin.i.king@gmail.com>,
Gao Xiang <hsiangkao@linux.alibaba.com>,
Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org, surenb@google.com,
shy828301@gmail.com, hannes@cmpxchg.org, vbabka@suse.cz,
hch@infradead.org, mgorman@suse.de,
Phillip Lougher <phillip@squashfs.org.uk>
Subject: Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
Date: Thu, 27 Apr 2023 15:49:19 +0800 [thread overview]
Message-ID: <5eac5a89-e8b6-4a57-91af-7b21012042d3@canonical.com> (raw)
In-Reply-To: <8bac892e-15e2-e95b-5b2b-0981f1279e4c@gmail.com>
On 4/27/23 15:03, Colin King (gmail) wrote:
> On 27/04/2023 04:47, Hui Wang wrote:
>>
>> On 4/27/23 09:18, Gao Xiang wrote:
>>>
>>>
>>> On 2023/4/26 19:07, Hui Wang wrote:
>>>>
>>>> On 4/26/23 16:33, Michal Hocko wrote:
>>>>> [CC squashfs maintainer]
>>>>>
>>>>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>>>>>> If we run the stress-ng in the filesystem of squashfs, the system
>>>>>> will be in a state something like hang, the stress-ng couldn't
>>>>>> finish running and the console couldn't react to users' input.
>>>>>>
>>>>>> This issue happens on all arm/arm64 platforms we are working on,
>>>>>> through debugging, we found this issue is introduced by oom handling
>>>>>> in the kernel.
>>>>>>
>>>>>> The fs->readahead() is called between memalloc_nofs_save() and
>>>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>>>>>> alloc_page(), in this case, if there is no memory left, the
>>>>>> out_of_memory() will be called without __GFP_FS, then the oom killer
>>>>>> will not be triggered and this process will loop endlessly and wait
>>>>>> for others to trigger oom killer to release some memory. But for a
>>>>>> system with the whole root filesystem constructed by squashfs,
>>>>>> nearly all userspace processes will call out_of_memory() without
>>>>>> __GFP_FS, so we will see that the system enters a state something
>>>>>> like
>>>>>> hang when running stress-ng.
>>>>>>
>>>>>> To fix it, we could trigger a kthread to call page_alloc() with
>>>>>> __GFP_FS before returning from out_of_memory() due to without
>>>>>> __GFP_FS.
>>>>> I do not think this is an appropriate way to deal with this issue.
>>>>> Does it even make sense to trigger OOM killer for something like
>>>>> readahead? Would it be more mindful to fail the allocation instead?
>>>>> That being said should allocations from squashfs_readahead use
>>>>> __GFP_RETRY_MAYFAIL instead?
>>>>
>>>> Thanks for your comment, and this issue could hardly be reproduced
>>>> on ext4 filesystem, that is because the ext4->readahead() doesn't
>>>> call alloc_page(). If changing the ext4->readahead() as below, it
>>>> will be easy to reproduce this issue with the ext4 filesystem
>>>> (repeatedly run: $stress-ng --bigheap ${num_of_cpu_threads}
>>>> --sequential 0 --timeout 30s --skip-silent --verbose)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index ffbbd9626bd8..8b9db0b9d0b8 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file
>>>> *file, struct folio *folio)
>>>> static void ext4_readahead(struct readahead_control *rac)
>>>> {
>>>> struct inode *inode = rac->mapping->host;
>>>> + struct page *tmp_page;
>>>>
>>>> /* If the file has inline data, no need to do readahead. */
>>>> if (ext4_has_inline_data(inode))
>>>> return;
>>>>
>>>> + tmp_page = alloc_page(GFP_KERNEL);
>>>> +
>>>> ext4_mpage_readpages(inode, rac, NULL);
>>>> +
>>>> + if (tmp_page)
>>>> + __free_page(tmp_page);
>>>> }
>>>>
>>>
>> Hi Xiang and Michal,
>>> Is it tested with a pure ext4 without any other fs background?
>>>
>> Basically yes. Maybe there is a squashfs mounted for python3 in my
>> test environment. But stress-ng and its needed sharing libs are in
>> the ext4.
>
> One could build a static version of stress-ng to remove the need for
> shared library loading at run time:
>
> git clone https://github.com/ColinIanKing/stress-ng
> cd stress-ng
> make clean
> STATIC=1 make -j 8
>
I did that already, and copied it to /home/ubuntu under uc20/uc22 and
ran it from /home/ubuntu, there is no hang issue anymore. The folder
/home/ubuntu/ is ext4 filesystem, that proves the issue only happens on
squashfs.
And if I built it without static=1, it will hang even I ran it from
/home/ubuntu/ because the system needs to load shared libs from squashfs
folder.
Thanks,
Hui.
>
>>> I don't think it's true that "ext4->readahead() doesn't call
>>> alloc_page()" since I think even ext2/ext4 uses buffer head
>>> interfaces to read metadata (extents or old block mapping)
>>> from its bd_inode for readahead, which indirectly allocates
>>> some extra pages to page cache as well.
>>
>> Calling alloc_page() or allocating memory in the readahead() is not a
>> problem, suppose we have 4 processes (A, B, C and D). Process A, B
>> and C are entering out_of_memory() because of allocating memory in
>> the readahead(), they are looping and waiting for some memory be
>> released. And process D could enter out_of_memory() with __GFP_FS,
>> then it could trigger oom killer, so A, B and C could get the memory
>> and return to the readahead(), there is no system hang issue.
>>
>> But if all 4 processes enter out_of_memory() from readahead(), they
>> will loop and wait endlessly, there is no process to trigger oom
>> killer, so the users will think the system is getting hang.
>>
>> I applied my change for ext4->readahead to linux-next, and tested it
>> on my ubuntu classic server for arm64, I could reproduce the hang
>> issue within 1 minutes with 100% rate. I guess it is easy to
>> reproduce the issue because it is an embedded environment, the total
>> number of processes in the system is very limited, nearly all
>> userspace processes will finally reach out_of_memory() from
>> ext4_readahead(), and nearly all kthreads will not reach
>> out_of_memory() for long time, that makes the system in a state like
>> hang (not real hang).
>>
>> And this is why I wrote a patch to let a specific kthread trigger oom
>> killer forcibly (my initial patch).
>>
>>
>>>
>>> The difference only here is the total number of pages to be
>>> allocated here, but many extra compressed data takeing extra
>>> allocation causes worse. So I think it much depends on how
>>> stressful does your stress workload work like, and I'm even
>>> not sure it's a real issue since if you stop the stress
>>> workload, it will immediately recover (only it may not oom
>>> directly).
>>>
>> Yes, it is not a real hang. All userspace processes are looping and
>> waiting for other processes to release or reclaim memory. And in this
>> case, we can't stop the stress workload since users can't control the
>> system through console.
>>
>> So Michal,
>>
>> Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters a
>> state something like hang when running stress-ng", do you know why
>> out_of_memory() will return immediately if there is no __GFP_FS,
>> could we drop these lines directly:
>>
>> /*
>> * The OOM killer does not compensate for IO-less reclaim.
>> * pagefault_out_of_memory lost its gfp context so we have to
>> * make sure exclude 0 mask - all other users should have at least
>> * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
>> * invoke the OOM killer even if it is a GFP_NOFS allocation.
>> */
>> if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) &&
>> !is_memcg_oom(oc))
>> return true;
>>
>>
>> Thanks,
>>
>> Hui.
>>
>>> Thanks,
>>> Gao Xiang
>
next prev parent reply other threads:[~2023-04-27 7:49 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-26 5:10 [PATCH 0/1] mm/oom_kill: system enters a state something like hang when running stress-ng Hui Wang
2023-04-26 5:10 ` [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS Hui Wang
2023-04-26 8:33 ` Michal Hocko
2023-04-26 11:07 ` Hui Wang
2023-04-26 16:44 ` Phillip Lougher
2023-04-26 17:38 ` Phillip Lougher
2023-04-26 18:26 ` Yang Shi
2023-04-26 19:06 ` Phillip Lougher
2023-04-26 19:34 ` Phillip Lougher
2023-04-27 0:42 ` Hui Wang
2023-04-27 1:37 ` Phillip Lougher
2023-04-27 5:22 ` Hui Wang
2023-04-27 1:18 ` Gao Xiang
2023-04-27 3:47 ` Hui Wang
2023-04-27 4:17 ` Gao Xiang
2023-04-27 7:03 ` Colin King (gmail)
2023-04-27 7:49 ` Hui Wang [this message]
2023-04-28 19:53 ` Michal Hocko
2023-05-03 11:49 ` Hui Wang
2023-05-03 12:20 ` Michal Hocko
2023-05-03 18:41 ` Phillip Lougher
2023-05-03 19:10 ` Phillip Lougher
2023-05-03 19:38 ` Hui Wang
2023-05-07 21:07 ` Phillip Lougher
2023-05-08 10:05 ` Hui Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5eac5a89-e8b6-4a57-91af-7b21012042d3@canonical.com \
--to=hui.wang@canonical.com \
--cc=akpm@linux-foundation.org \
--cc=colin.i.king@gmail.com \
--cc=hannes@cmpxchg.org \
--cc=hch@infradead.org \
--cc=hsiangkao@linux.alibaba.com \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mhocko@suse.com \
--cc=phillip@squashfs.org.uk \
--cc=shy828301@gmail.com \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox