From: Phillip Lougher <phillip@squashfs.org.uk>
To: Yang Shi <shy828301@gmail.com>
Cc: Hui Wang <hui.wang@canonical.com>, Michal Hocko <mhocko@suse.com>,
linux-mm@kvack.org, akpm@linux-foundation.org, surenb@google.com,
colin.i.king@gmail.com, hannes@cmpxchg.org, vbabka@suse.cz,
hch@infradead.org, mgorman@suse.de, dan.carpenter@oracle.com
Subject: Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
Date: Wed, 26 Apr 2023 20:06:33 +0100 [thread overview]
Message-ID: <b6b4b90a-fedd-ef32-767d-2e76b8993c69@squashfs.org.uk> (raw)
In-Reply-To: <CAHbLzko3DDr02QrYbSteQ2GZR_DZYxKnsGR_8tPjY8vBXs6Cvw@mail.gmail.com>
On 26/04/2023 19:26, Yang Shi wrote:
> On Wed, Apr 26, 2023 at 10:38 AM Phillip Lougher
> <phillip@squashfs.org.uk> wrote:
>>
>> On 26/04/2023 17:44, Phillip Lougher wrote:
>>> On 26/04/2023 12:07, Hui Wang wrote:
>>>> On 4/26/23 16:33, Michal Hocko wrote:
>>>>> [CC squashfs maintainer]
>>>>>
>>>>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>>>>>> If we run the stress-ng in the filesystem of squashfs, the system
>>>>>> will be in a state something like hang, the stress-ng couldn't
>>>>>> finish running and the console couldn't react to users' input.
>>>>>>
>>>>>> This issue happens on all arm/arm64 platforms we are working on,
>>>>>> through debugging, we found this issue is introduced by oom handling
>>>>>> in the kernel.
>>>>>>
>>>>>> The fs->readahead() is called between memalloc_nofs_save() and
>>>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>>>>>> alloc_page(), in this case, if there is no memory left, the
>>>>>> out_of_memory() will be called without __GFP_FS, then the oom killer
>>>>>> will not be triggered and this process will loop endlessly and wait
>>>>>> for others to trigger oom killer to release some memory. But for a
>>>>>> system with the whole root filesystem constructed by squashfs,
>>>>>> nearly all userspace processes will call out_of_memory() without
>>>>>> __GFP_FS, so we will see that the system enters a state something like
>>>>>> hang when running stress-ng.
>>>>>>
>>>>>> To fix it, we could trigger a kthread to call page_alloc() with
>>>>>> __GFP_FS before returning from out_of_memory() due to without
>>>>>> __GFP_FS.
>>>>> I do not think this is an appropriate way to deal with this issue.
>>>>> Does it even make sense to trigger OOM killer for something like
>>>>> readahead? Would it be more mindful to fail the allocation instead?
>>>>> That being said should allocations from squashfs_readahead use
>>>>> __GFP_RETRY_MAYFAIL instead?
>>>> Thanks for your comment, and this issue could hardly be reproduced on
>>>> ext4 filesystem, that is because the ext4->readahead() doesn't call
>>>> alloc_page(). If changing the ext4->readahead() as below, it will be
>>>> easy to reproduce this issue with the ext4 filesystem (repeatedly
>>>> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0
>>>> --timeout 30s --skip-silent --verbose)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index ffbbd9626bd8..8b9db0b9d0b8 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file *file,
>>>> struct folio *folio)
>>>> static void ext4_readahead(struct readahead_control *rac)
>>>> {
>>>> struct inode *inode = rac->mapping->host;
>>>> + struct page *tmp_page;
>>>>
>>>> /* If the file has inline data, no need to do readahead. */
>>>> if (ext4_has_inline_data(inode))
>>>> return;
>>>>
>>>> + tmp_page = alloc_page(GFP_KERNEL);
>>>> +
>>>> ext4_mpage_readpages(inode, rac, NULL);
>>>> +
>>>> + if (tmp_page)
>>>> + __free_page(tmp_page);
>>>> }
>>>>
>>>>
>>>> BTW, I applied my patch to the linux-next and ran the oom stress-ng
>>>> testcases overnight, there is no hang, oops or crash, looks like
>>>> there is no big problem to use a kthread to trigger the oom killer in
>>>> this case.
>>>>
>>>> And Hi squashfs maintainer, I checked the code of filesystem, looks
>>>> like most filesystems will not call alloc_page() in the readahead(),
>>>> could you please help take a look at this issue, thanks.
>>>
>>> This will be because most filesystems don't need to do so. Squashfs is
>>> a compressed filesystem with large blocks covering much more than one
>>> page, and it decompresses these blocks in squashfs_readahead(). If
>>> __readahead_batch() does not return the full set of pages covering the
>>> Squashfs block, it allocates a temporary page for the decompressors to
>>> decompress into to "fill in the hole".
>>>
>>> What can be done here as far as Squashfs is concerned .... I could
>>> move the page allocation out of the readahead path (e.g. do it at
>>> mount time).
>> You could try this patch which does that. Compile tested only.
> The kmalloc_array() may call alloc_page() to trigger this problem too
> IIUC. It should be pre-allocated as well?
That is a much smaller allocation, and so it entirely depends whether it
is an issue or not. There are also a number of other small memory
allocations in the path as well.
The whole point of this patch is to move the *biggest* allocation which
is the reported issue and then see what happens. No point in making
this test patch more involved and complex than necessary at this point.
Phillip
>> fs/squashfs/page_actor.c | 10 +---------
>> fs/squashfs/page_actor.h | 1 -
>> fs/squashfs/squashfs_fs_sb.h | 1 +
>> fs/squashfs/super.c | 10 ++++++++++
>> 4 files changed, 12 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/squashfs/page_actor.c b/fs/squashfs/page_actor.c
>> index 81af6c4ca115..6cce239eca66 100644
>> --- a/fs/squashfs/page_actor.c
>> +++ b/fs/squashfs/page_actor.c
>> @@ -110,15 +110,7 @@ struct squashfs_page_actor *squashfs_page_actor_init_special(struct squashfs_sb_
>> if (actor == NULL)
>> return NULL;
>>
>> - if (msblk->decompressor->alloc_buffer) {
>> - actor->tmp_buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
>> -
>> - if (actor->tmp_buffer == NULL) {
>> - kfree(actor);
>> - return NULL;
>> - }
>> - } else
>> - actor->tmp_buffer = NULL;
>> + actor->tmp_buffer = msblk->actor_page;
>>
>> actor->length = length ? : pages * PAGE_SIZE;
>> actor->page = page;
>> diff --git a/fs/squashfs/page_actor.h b/fs/squashfs/page_actor.h
>> index 97d4983559b1..df5e999afa42 100644
>> --- a/fs/squashfs/page_actor.h
>> +++ b/fs/squashfs/page_actor.h
>> @@ -34,7 +34,6 @@ static inline struct page *squashfs_page_actor_free(struct squashfs_page_actor *
>> {
>> struct page *last_page = actor->last_page;
>>
>> - kfree(actor->tmp_buffer);
>> kfree(actor);
>> return last_page;
>> }
>> diff --git a/fs/squashfs/squashfs_fs_sb.h b/fs/squashfs/squashfs_fs_sb.h
>> index 72f6f4b37863..8feddc9e6cce 100644
>> --- a/fs/squashfs/squashfs_fs_sb.h
>> +++ b/fs/squashfs/squashfs_fs_sb.h
>> @@ -47,6 +47,7 @@ struct squashfs_sb_info {
>> struct squashfs_cache *block_cache;
>> struct squashfs_cache *fragment_cache;
>> struct squashfs_cache *read_page;
>> + void *actor_page;
>> int next_meta_index;
>> __le64 *id_table;
>> __le64 *fragment_index;
>> diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c
>> index e090fae48e68..674dc187d961 100644
>> --- a/fs/squashfs/super.c
>> +++ b/fs/squashfs/super.c
>> @@ -329,6 +329,15 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc)
>> goto failed_mount;
>> }
>>
>> +
>> + /* Allocate page for squashfs_readahead()/squashfs_read_folio() */
>> + if (msblk->decompressor->alloc_buffer) {
>> + msblk->actor_page = kmalloc(PAGE_SIZE, GFP_KERNEL);
>> +
>> + if(msblk->actor_page == NULL)
>> + goto failed_mount;
>> + }
>> +
>> msblk->stream = squashfs_decompressor_setup(sb, flags);
>> if (IS_ERR(msblk->stream)) {
>> err = PTR_ERR(msblk->stream);
>> @@ -454,6 +463,7 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc)
>> squashfs_cache_delete(msblk->block_cache);
>> squashfs_cache_delete(msblk->fragment_cache);
>> squashfs_cache_delete(msblk->read_page);
>> + kfree(msblk->actor_page);
>> msblk->thread_ops->destroy(msblk);
>> kfree(msblk->inode_lookup_table);
>> kfree(msblk->fragment_index);
>> --
>> 2.35.1
>>
>>> Adding __GFP_RETRY_MAYFAIL so the alloc() can fail will mean Squashfs
>>> returning I/O failures due to no memory. That will cause a lot of
>>> applications to crash in a low memory situation. So a crash rather
>>> than a hang.
>>>
>>> Phillip
>>>
>>>
>>>
>>>
next prev parent reply other threads:[~2023-04-26 19:06 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-26 5:10 [PATCH 0/1] mm/oom_kill: system enters a state something like hang when running stress-ng Hui Wang
2023-04-26 5:10 ` [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS Hui Wang
2023-04-26 8:33 ` Michal Hocko
2023-04-26 11:07 ` Hui Wang
2023-04-26 16:44 ` Phillip Lougher
2023-04-26 17:38 ` Phillip Lougher
2023-04-26 18:26 ` Yang Shi
2023-04-26 19:06 ` Phillip Lougher [this message]
2023-04-26 19:34 ` Phillip Lougher
2023-04-27 0:42 ` Hui Wang
2023-04-27 1:37 ` Phillip Lougher
2023-04-27 5:22 ` Hui Wang
2023-04-27 1:18 ` Gao Xiang
2023-04-27 3:47 ` Hui Wang
2023-04-27 4:17 ` Gao Xiang
2023-04-27 7:03 ` Colin King (gmail)
2023-04-27 7:49 ` Hui Wang
2023-04-28 19:53 ` Michal Hocko
2023-05-03 11:49 ` Hui Wang
2023-05-03 12:20 ` Michal Hocko
2023-05-03 18:41 ` Phillip Lougher
2023-05-03 19:10 ` Phillip Lougher
2023-05-03 19:38 ` Hui Wang
2023-05-07 21:07 ` Phillip Lougher
2023-05-08 10:05 ` Hui Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b6b4b90a-fedd-ef32-767d-2e76b8993c69@squashfs.org.uk \
--to=phillip@squashfs.org.uk \
--cc=akpm@linux-foundation.org \
--cc=colin.i.king@gmail.com \
--cc=dan.carpenter@oracle.com \
--cc=hannes@cmpxchg.org \
--cc=hch@infradead.org \
--cc=hui.wang@canonical.com \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mhocko@suse.com \
--cc=shy828301@gmail.com \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox