linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Hui Wang <hui.wang@canonical.com>
To: Phillip Lougher <phillip@squashfs.org.uk>,
	Yang Shi <shy828301@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>,
	linux-mm@kvack.org, akpm@linux-foundation.org, surenb@google.com,
	colin.i.king@gmail.com, hannes@cmpxchg.org, vbabka@suse.cz,
	hch@infradead.org, mgorman@suse.de
Subject: Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
Date: Thu, 27 Apr 2023 08:42:45 +0800	[thread overview]
Message-ID: <1f181fe6-60f4-6b71-b8ca-4f6365de0b4c@canonical.com> (raw)
In-Reply-To: <553e6668-bebd-7411-9f69-d62e9658da1d@squashfs.org.uk>


On 4/27/23 03:34, Phillip Lougher wrote:
>
> On 26/04/2023 20:06, Phillip Lougher wrote:
>>
>> On 26/04/2023 19:26, Yang Shi wrote:
>>> On Wed, Apr 26, 2023 at 10:38 AM Phillip Lougher
>>> <phillip@squashfs.org.uk> wrote:
>>>>
>>>> On 26/04/2023 17:44, Phillip Lougher wrote:
>>>>> On 26/04/2023 12:07, Hui Wang wrote:
>>>>>> On 4/26/23 16:33, Michal Hocko wrote:
>>>>>>> [CC squashfs maintainer]
>>>>>>>
>>>>>>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>>>>>>>> If we run the stress-ng in the filesystem of squashfs, the system
>>>>>>>> will be in a state something like hang, the stress-ng couldn't
>>>>>>>> finish running and the console couldn't react to users' input.
>>>>>>>>
>>>>>>>> This issue happens on all arm/arm64 platforms we are working on,
>>>>>>>> through debugging, we found this issue is introduced by oom 
>>>>>>>> handling
>>>>>>>> in the kernel.
>>>>>>>>
>>>>>>>> The fs->readahead() is called between memalloc_nofs_save() and
>>>>>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>>>>>>>> alloc_page(), in this case, if there is no memory left, the
>>>>>>>> out_of_memory() will be called without __GFP_FS, then the oom 
>>>>>>>> killer
>>>>>>>> will not be triggered and this process will loop endlessly and 
>>>>>>>> wait
>>>>>>>> for others to trigger oom killer to release some memory. But for a
>>>>>>>> system with the whole root filesystem constructed by squashfs,
>>>>>>>> nearly all userspace processes will call out_of_memory() without
>>>>>>>> __GFP_FS, so we will see that the system enters a state 
>>>>>>>> something like
>>>>>>>> hang when running stress-ng.
>>>>>>>>
>>>>>>>> To fix it, we could trigger a kthread to call page_alloc() with
>>>>>>>> __GFP_FS before returning from out_of_memory() due to without
>>>>>>>> __GFP_FS.
>>>>>>> I do not think this is an appropriate way to deal with this issue.
>>>>>>> Does it even make sense to trigger OOM killer for something like
>>>>>>> readahead? Would it be more mindful to fail the allocation instead?
>>>>>>> That being said should allocations from squashfs_readahead use
>>>>>>> __GFP_RETRY_MAYFAIL instead?
>>>>>> Thanks for your comment, and this issue could hardly be 
>>>>>> reproduced on
>>>>>> ext4 filesystem, that is because the ext4->readahead() doesn't call
>>>>>> alloc_page(). If changing the ext4->readahead() as below, it will be
>>>>>> easy to reproduce this issue with the ext4 filesystem (repeatedly
>>>>>> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0
>>>>>> --timeout 30s --skip-silent --verbose)
>>>>>>
>>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>>> index ffbbd9626bd8..8b9db0b9d0b8 100644
>>>>>> --- a/fs/ext4/inode.c
>>>>>> +++ b/fs/ext4/inode.c
>>>>>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file 
>>>>>> *file,
>>>>>> struct folio *folio)
>>>>>>   static void ext4_readahead(struct readahead_control *rac)
>>>>>>   {
>>>>>>          struct inode *inode = rac->mapping->host;
>>>>>> +       struct page *tmp_page;
>>>>>>
>>>>>>          /* If the file has inline data, no need to do readahead. */
>>>>>>          if (ext4_has_inline_data(inode))
>>>>>>                  return;
>>>>>>
>>>>>> +       tmp_page = alloc_page(GFP_KERNEL);
>>>>>> +
>>>>>>          ext4_mpage_readpages(inode, rac, NULL);
>>>>>> +
>>>>>> +       if (tmp_page)
>>>>>> +               __free_page(tmp_page);
>>>>>>   }
>>>>>>
>>>>>>
>>>>>> BTW, I applied my patch to the linux-next and ran the oom stress-ng
>>>>>> testcases overnight, there is no hang, oops or crash, looks like
>>>>>> there is no big problem to use a kthread to trigger the oom 
>>>>>> killer in
>>>>>> this case.
>>>>>>
>>>>>> And Hi squashfs maintainer, I checked the code of filesystem, looks
>>>>>> like most filesystems will not call alloc_page() in the readahead(),
>>>>>> could you please help take a look at this issue, thanks.
>>>>>
>>>>> This will be because most filesystems don't need to do so. 
>>>>> Squashfs is
>>>>> a compressed filesystem with large blocks covering much more than one
>>>>> page, and it decompresses these blocks in squashfs_readahead().   If
>>>>> __readahead_batch() does not return the full set of pages covering 
>>>>> the
>>>>> Squashfs block, it allocates a temporary page for the 
>>>>> decompressors to
>>>>> decompress into to "fill in the hole".
>>>>>
>>>>> What can be done here as far as Squashfs is concerned .... I could
>>>>> move the page allocation out of the readahead path (e.g. do it at
>>>>> mount time).
>>>> You could try this patch which does that.  Compile tested only.
>>> The kmalloc_array() may call alloc_page() to trigger this problem too
>>> IIUC. It should be pre-allocated as well?
>>
>>
>> That is a much smaller allocation, and so it entirely depends whether 
>> it is an issue or not.  There are also a number of other small memory 
>> allocations in the path as well.
>>
>> The whole point of this patch is to move the *biggest* allocation 
>> which is the reported issue and then see what happens.   No point in 
>> making this test patch more involved and complex than necessary at 
>> this point.
>>
>> Phillip
>>
>
> Also be aware this stress-ng triggered issue is new, and apparently 
> didn't occur last year.   So it is reasonable to assume the issue has 
> been introduced as a side effect of the readahead improvements.  One 
> of these is this allocation of a temporary page to decompress into 
> rather than falling back to entirely decompressing into a 
> pre-allocated buffer (allocated at mount time).  The small memory 
> allocations have been there for many years.
>
> Allocating the page at mount time effectively puts the memory 
> allocation situation back to how it was last year before the readahead 
> work.
>
> Phillip
>
Thanks Phillip and Yang.

And Phillip,

I tested your change, it didn't help. According to my debug, the OOM 
happens at the place of allocating memory for bio, it is at the line of 
"struct page *page = alloc_page(GFP_NOIO);" in the squashfs_bio_read(). 
Other filesystems just use the pre-allocated memory in the "struct 
readahead_control" to do the bio, but squashfs allocate the new page to 
do the bio (maybe because the squashfs is a compressed filesystem).

BTW, this is not a new issue for squashfs, we have uc20 (linux-5.4 
kernel) and uc22 (linux-5.15 kernel), all have this issue. The issue 
already existed in the squahsfs_readpage() in the 5.4 kernel.

I guess if could use pre-allocated memory to do the bio, it will help.

Thanks,

Hui.



>
>>>>    fs/squashfs/page_actor.c     | 10 +---------
>>>>    fs/squashfs/page_actor.h     |  1 -
>>>>    fs/squashfs/squashfs_fs_sb.h |  1 +
>>>>    fs/squashfs/super.c          | 10 ++++++++++
>>>>    4 files changed, 12 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/fs/squashfs/page_actor.c b/fs/squashfs/page_actor.c
>>>> index 81af6c4ca115..6cce239eca66 100644
>>>> --- a/fs/squashfs/page_actor.c
>>>> +++ b/fs/squashfs/page_actor.c
>>>> @@ -110,15 +110,7 @@ struct squashfs_page_actor 
>>>> *squashfs_page_actor_init_special(struct squashfs_sb_
>>>>          if (actor == NULL)
>>>>                  return NULL;
>>>>
>>>> -       if (msblk->decompressor->alloc_buffer) {
>>>> -               actor->tmp_buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
>>>> -
>>>> -               if (actor->tmp_buffer == NULL) {
>>>> -                       kfree(actor);
>>>> -                       return NULL;
>>>> -               }
>>>> -       } else
>>>> -               actor->tmp_buffer = NULL;
>>>> +       actor->tmp_buffer = msblk->actor_page;
>>>>
>>>>          actor->length = length ? : pages * PAGE_SIZE;
>>>>          actor->page = page;
>>>> diff --git a/fs/squashfs/page_actor.h b/fs/squashfs/page_actor.h
>>>> index 97d4983559b1..df5e999afa42 100644
>>>> --- a/fs/squashfs/page_actor.h
>>>> +++ b/fs/squashfs/page_actor.h
>>>> @@ -34,7 +34,6 @@ static inline struct page 
>>>> *squashfs_page_actor_free(struct squashfs_page_actor *
>>>>    {
>>>>          struct page *last_page = actor->last_page;
>>>>
>>>> -       kfree(actor->tmp_buffer);
>>>>          kfree(actor);
>>>>          return last_page;
>>>>    }
>>>> diff --git a/fs/squashfs/squashfs_fs_sb.h 
>>>> b/fs/squashfs/squashfs_fs_sb.h
>>>> index 72f6f4b37863..8feddc9e6cce 100644
>>>> --- a/fs/squashfs/squashfs_fs_sb.h
>>>> +++ b/fs/squashfs/squashfs_fs_sb.h
>>>> @@ -47,6 +47,7 @@ struct squashfs_sb_info {
>>>>          struct squashfs_cache *block_cache;
>>>>          struct squashfs_cache *fragment_cache;
>>>>          struct squashfs_cache                   *read_page;
>>>> +       void                                    *actor_page;
>>>>          int next_meta_index;
>>>>          __le64                                  *id_table;
>>>>          __le64 *fragment_index;
>>>> diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c
>>>> index e090fae48e68..674dc187d961 100644
>>>> --- a/fs/squashfs/super.c
>>>> +++ b/fs/squashfs/super.c
>>>> @@ -329,6 +329,15 @@ static int squashfs_fill_super(struct 
>>>> super_block *sb, struct fs_context *fc)
>>>>                  goto failed_mount;
>>>>          }
>>>>
>>>> +
>>>> +       /* Allocate page for 
>>>> squashfs_readahead()/squashfs_read_folio() */
>>>> +       if (msblk->decompressor->alloc_buffer) {
>>>> +               msblk->actor_page = kmalloc(PAGE_SIZE, GFP_KERNEL);
>>>> +
>>>> +               if(msblk->actor_page == NULL)
>>>> +                       goto failed_mount;
>>>> +       }
>>>> +
>>>>          msblk->stream = squashfs_decompressor_setup(sb, flags);
>>>>          if (IS_ERR(msblk->stream)) {
>>>>                  err = PTR_ERR(msblk->stream);
>>>> @@ -454,6 +463,7 @@ static int squashfs_fill_super(struct 
>>>> super_block *sb, struct fs_context *fc)
>>>>          squashfs_cache_delete(msblk->block_cache);
>>>>          squashfs_cache_delete(msblk->fragment_cache);
>>>>          squashfs_cache_delete(msblk->read_page);
>>>> +       kfree(msblk->actor_page);
>>>>          msblk->thread_ops->destroy(msblk);
>>>>          kfree(msblk->inode_lookup_table);
>>>>          kfree(msblk->fragment_index);
>>>> -- 
>>>> 2.35.1
>>>>
>>>>> Adding __GFP_RETRY_MAYFAIL so the alloc() can fail will mean Squashfs
>>>>> returning I/O failures due to no memory.  That will cause a lot of
>>>>> applications to crash in a low memory situation.  So a crash rather
>>>>> than a hang.
>>>>>
>>>>> Phillip
>>>>>
>>>>>
>>>>>
>>>>>


  reply	other threads:[~2023-04-27  0:42 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-26  5:10 [PATCH 0/1] mm/oom_kill: system enters a state something like hang when running stress-ng Hui Wang
2023-04-26  5:10 ` [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS Hui Wang
2023-04-26  8:33   ` Michal Hocko
2023-04-26 11:07     ` Hui Wang
2023-04-26 16:44       ` Phillip Lougher
2023-04-26 17:38         ` Phillip Lougher
2023-04-26 18:26           ` Yang Shi
2023-04-26 19:06             ` Phillip Lougher
2023-04-26 19:34               ` Phillip Lougher
2023-04-27  0:42                 ` Hui Wang [this message]
2023-04-27  1:37                   ` Phillip Lougher
2023-04-27  5:22                     ` Hui Wang
2023-04-27  1:18       ` Gao Xiang
2023-04-27  3:47         ` Hui Wang
2023-04-27  4:17           ` Gao Xiang
2023-04-27  7:03           ` Colin King (gmail)
2023-04-27  7:49             ` Hui Wang
2023-04-28 19:53           ` Michal Hocko
2023-05-03 11:49             ` Hui Wang
2023-05-03 12:20               ` Michal Hocko
2023-05-03 18:41                 ` Phillip Lougher
2023-05-03 19:10               ` Phillip Lougher
2023-05-03 19:38                 ` Hui Wang
2023-05-07 21:07                 ` Phillip Lougher
2023-05-08 10:05                   ` Hui Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1f181fe6-60f4-6b71-b8ca-4f6365de0b4c@canonical.com \
    --to=hui.wang@canonical.com \
    --cc=akpm@linux-foundation.org \
    --cc=colin.i.king@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=phillip@squashfs.org.uk \
    --cc=shy828301@gmail.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox