Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Hui Wang <hui.wang@canonical.com>
To: Phillip Lougher <phillip@squashfs.org.uk>,
	Yang Shi <shy828301@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>,
	linux-mm@kvack.org, akpm@linux-foundation.org, surenb@google.com,
	colin.i.king@gmail.com, hannes@cmpxchg.org, vbabka@suse.cz,
	hch@infradead.org, mgorman@suse.de,
	Gao Xiang <hsiangkao@linux.alibaba.com>
Subject: Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
Date: Thu, 27 Apr 2023 13:22:55 +0800	[thread overview]
Message-ID: <ac7f1fec-fa51-8e5f-ec47-aa63c4c90c2f@canonical.com> (raw)
In-Reply-To: <d691388f-e43c-017f-3f4e-1af834e25159@squashfs.org.uk>


On 4/27/23 09:37, Phillip Lougher wrote:
>
> On 27/04/2023 01:42, Hui Wang wrote:
>>
>> On 4/27/23 03:34, Phillip Lougher wrote:
>>>
>>> On 26/04/2023 20:06, Phillip Lougher wrote:
>>>>
>>>> On 26/04/2023 19:26, Yang Shi wrote:
>>>>> On Wed, Apr 26, 2023 at 10:38 AM Phillip Lougher
>>>>> <phillip@squashfs.org.uk> wrote:
>>>>>>
>>>>>> On 26/04/2023 17:44, Phillip Lougher wrote:
>>>>>>> On 26/04/2023 12:07, Hui Wang wrote:
>>>>>>>> On 4/26/23 16:33, Michal Hocko wrote:
>>>>>>>>> [CC squashfs maintainer]
>>>>>>>>>
>>>>>>>>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>>>>>>>>>> If we run the stress-ng in the filesystem of squashfs, the 
>>>>>>>>>> system
>>>>>>>>>> will be in a state something like hang, the stress-ng couldn't
>>>>>>>>>> finish running and the console couldn't react to users' input.
>>>>>>>>>>
>>>>>>>>>> This issue happens on all arm/arm64 platforms we are working on,
>>>>>>>>>> through debugging, we found this issue is introduced by oom 
>>>>>>>>>> handling
>>>>>>>>>> in the kernel.
>>>>>>>>>>
>>>>>>>>>> The fs->readahead() is called between memalloc_nofs_save() and
>>>>>>>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>>>>>>>>>> alloc_page(), in this case, if there is no memory left, the
>>>>>>>>>> out_of_memory() will be called without __GFP_FS, then the oom 
>>>>>>>>>> killer
>>>>>>>>>> will not be triggered and this process will loop endlessly 
>>>>>>>>>> and wait
>>>>>>>>>> for others to trigger oom killer to release some memory. But 
>>>>>>>>>> for a
>>>>>>>>>> system with the whole root filesystem constructed by squashfs,
>>>>>>>>>> nearly all userspace processes will call out_of_memory() without
>>>>>>>>>> __GFP_FS, so we will see that the system enters a state 
>>>>>>>>>> something like
>>>>>>>>>> hang when running stress-ng.
>>>>>>>>>>
>>>>>>>>>> To fix it, we could trigger a kthread to call page_alloc() with
>>>>>>>>>> __GFP_FS before returning from out_of_memory() due to without
>>>>>>>>>> __GFP_FS.
>>>>>>>>> I do not think this is an appropriate way to deal with this 
>>>>>>>>> issue.
>>>>>>>>> Does it even make sense to trigger OOM killer for something like
>>>>>>>>> readahead? Would it be more mindful to fail the allocation 
>>>>>>>>> instead?
>>>>>>>>> That being said should allocations from squashfs_readahead use
>>>>>>>>> __GFP_RETRY_MAYFAIL instead?
>>>>>>>> Thanks for your comment, and this issue could hardly be 
>>>>>>>> reproduced on
>>>>>>>> ext4 filesystem, that is because the ext4->readahead() doesn't 
>>>>>>>> call
>>>>>>>> alloc_page(). If changing the ext4->readahead() as below, it 
>>>>>>>> will be
>>>>>>>> easy to reproduce this issue with the ext4 filesystem (repeatedly
>>>>>>>> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0
>>>>>>>> --timeout 30s --skip-silent --verbose)
>>>>>>>>
>>>>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>>>>> index ffbbd9626bd8..8b9db0b9d0b8 100644
>>>>>>>> --- a/fs/ext4/inode.c
>>>>>>>> +++ b/fs/ext4/inode.c
>>>>>>>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file 
>>>>>>>> *file,
>>>>>>>> struct folio *folio)
>>>>>>>>   static void ext4_readahead(struct readahead_control *rac)
>>>>>>>>   {
>>>>>>>>          struct inode *inode = rac->mapping->host;
>>>>>>>> +       struct page *tmp_page;
>>>>>>>>
>>>>>>>>          /* If the file has inline data, no need to do 
>>>>>>>> readahead. */
>>>>>>>>          if (ext4_has_inline_data(inode))
>>>>>>>>                  return;
>>>>>>>>
>>>>>>>> +       tmp_page = alloc_page(GFP_KERNEL);
>>>>>>>> +
>>>>>>>>          ext4_mpage_readpages(inode, rac, NULL);
>>>>>>>> +
>>>>>>>> +       if (tmp_page)
>>>>>>>> +               __free_page(tmp_page);
>>>>>>>>   }
>>>>>>>>
>>>>>>>>
>>>>>>>> BTW, I applied my patch to the linux-next and ran the oom 
>>>>>>>> stress-ng
>>>>>>>> testcases overnight, there is no hang, oops or crash, looks like
>>>>>>>> there is no big problem to use a kthread to trigger the oom 
>>>>>>>> killer in
>>>>>>>> this case.
>>>>>>>>
>>>>>>>> And Hi squashfs maintainer, I checked the code of filesystem, 
>>>>>>>> looks
>>>>>>>> like most filesystems will not call alloc_page() in the 
>>>>>>>> readahead(),
>>>>>>>> could you please help take a look at this issue, thanks.
>>>>>>>
>>>>>>> This will be because most filesystems don't need to do so. 
>>>>>>> Squashfs is
>>>>>>> a compressed filesystem with large blocks covering much more 
>>>>>>> than one
>>>>>>> page, and it decompresses these blocks in 
>>>>>>> squashfs_readahead().   If
>>>>>>> __readahead_batch() does not return the full set of pages 
>>>>>>> covering the
>>>>>>> Squashfs block, it allocates a temporary page for the 
>>>>>>> decompressors to
>>>>>>> decompress into to "fill in the hole".
>>>>>>>
>>>>>>> What can be done here as far as Squashfs is concerned .... I could
>>>>>>> move the page allocation out of the readahead path (e.g. do it at
>>>>>>> mount time).
>>>>>> You could try this patch which does that.  Compile tested only.
>>>>> The kmalloc_array() may call alloc_page() to trigger this problem too
>>>>> IIUC. It should be pre-allocated as well?
>>>>
>>>>
>>>> That is a much smaller allocation, and so it entirely depends 
>>>> whether it is an issue or not.  There are also a number of other 
>>>> small memory allocations in the path as well.
>>>>
>>>> The whole point of this patch is to move the *biggest* allocation 
>>>> which is the reported issue and then see what happens.   No point 
>>>> in making this test patch more involved and complex than necessary 
>>>> at this point.
>>>>
>>>> Phillip
>>>>
>>>
>>> Also be aware this stress-ng triggered issue is new, and apparently 
>>> didn't occur last year.   So it is reasonable to assume the issue 
>>> has been introduced as a side effect of the readahead improvements.  
>>> One of these is this allocation of a temporary page to decompress 
>>> into rather than falling back to entirely decompressing into a 
>>> pre-allocated buffer (allocated at mount time).  The small memory 
>>> allocations have been there for many years.
>>>
>>> Allocating the page at mount time effectively puts the memory 
>>> allocation situation back to how it was last year before the 
>>> readahead work.
>>>
>>> Phillip
>>>
>> Thanks Phillip and Yang.
>>
>> And Phillip,
>>
>> I tested your change, it didn't help. According to my debug, the OOM 
>> happens at the place of allocating memory for bio, it is at the line 
>> of "struct page *page = alloc_page(GFP_NOIO);" in the 
>> squashfs_bio_read(). Other filesystems just use the pre-allocated 
>> memory in the "struct readahead_control" to do the bio, but squashfs 
>> allocate the new page to do the bio (maybe because the squashfs is a 
>> compressed filesystem).
>>
Hi Phillip,
> The test patch was a process of elimination, it removed the obvious 
> change from last year.
>
> It is also because it is a compressed filesystem, in most filesystems 
> what is read off disk in I/O is what ends up in the page cache.  In a 
> compressed filesystem what is read in isn't what ends up in the page 
> cache.
>
Understand.
>> BTW, this is not a new issue for squashfs, we have uc20 (linux-5.4 
>> kernel) and uc22 (linux-5.15 kernel), all have this issue. The issue 
>> already existed in the squahsfs_readpage() in the 5.4 kernel.
>
> That information would have been rather useful in the initial report, 
> and saved myself from wasting my time.  Thanks for that.
Sorry didn't mention it before.
>
> Now in the squashfs_readpage() situation does processes hang or 
> crash?  In the squashfs_readpage() path __GFP_NOFS should not be in 
> effect.  So is the OOM killer being invoked in this code path or 
> not?   Does alloc_page() in the bio code return NULL, and/or invoke 
> the OMM killer or does it get stuck.  Don't keep this information to 
> yourself so I have to guess.
>
In the squashfs_readpage() situation, the process still hang, the 
__GFP_NOFS also applies to squahsfs_readpage(), please see below 
calltrace I captured from linux-5.4:

[  118.131804] wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww current->comm = 
stress-ng-bighe oc->gfp_mask = 8c50 (GFP_FS is not set in gfp_mask)
[  118.142829] ------------[ cut here ]------------
[  118.142843] WARNING: CPU: 1 PID: 794 at mm/oom_kill.c:1097 
out_of_memory+0x2dc/0x340
[  118.142845] Modules linked in:
[  118.142851] CPU: 1 PID: 794 Comm: stress-ng-bighe Tainted: G        
W         5.4.0+ #152
[  118.142853] Hardware name: LS1028A RDB Board (DT)
[  118.142857] pstate: 60400005 (nZCv daif +PAN -UAO)
[  118.142860] pc : out_of_memory+0x2dc/0x340
[  118.142863] lr : out_of_memory+0xe4/0x340
[  118.142865] sp : ffff8000115cb580
[  118.142867] x29: ffff8000115cb580 x28: 0000000000000000
[  118.142871] x27: ffffcefc623dab80 x26: 000031039de16790
[  118.142875] x25: ffffcefc621e9878 x24: 0000000000000100
[  118.142878] x23: ffffcefc62278000 x22: 0000000000000000
[  118.142881] x21: ffff8000115cb6f8 x20: ffff00206272e740
[  118.142885] x19: 0000000000000000 x18: ffffcefc622268f8
[  118.142888] x17: 0000000000000000 x16: 0000000000000000
[  118.142891] x15: ffffcefc621e8a38 x14: 1a9f17e4f9444a3e
[  118.142894] x13: 0000000000000001 x12: 0000000000000400
[  118.142897] x11: 0000000000000400 x10: 0000000000000a90
[  118.142900] x9 : ffff8000115cb2c0 x8 : ffff00206272f230
[  118.142903] x7 : 0000001b81dc2360 x6 : 0000000000000000
[  118.142906] x5 : 0000000000000000 x4 : ffff00207f7db210
[  118.142909] x3 : 0000000000000000 x2 : 0000000000000000
[  118.142912] x1 : ffff00206272e740 x0 : 0000000000000000
[  118.142915] Call trace:
[  118.142919]  out_of_memory+0x2dc/0x340
[  118.142924]  __alloc_pages_nodemask+0xf04/0x1090
[  118.142928]  alloc_slab_page+0x34/0x430
[  118.142931]  allocate_slab+0x474/0x500
[  118.142935]  ___slab_alloc.constprop.0+0x1e4/0x64c
[  118.142938]  __slab_alloc.constprop.0+0x54/0xb0
[  118.142941]  kmem_cache_alloc+0x31c/0x350
[  118.142945]  alloc_buffer_head+0x2c/0xac
[  118.142948]  alloc_page_buffers+0xb8/0x210
[  118.142951]  __getblk_gfp+0x180/0x39c
[  118.142955]  squashfs_read_data+0x2a4/0x6f0
[  118.142958]  squashfs_readpage_block+0x2c4/0x630
[  118.142961]  squashfs_readpage+0x5e4/0x98c
[  118.142964]  filemap_fault+0x17c/0x720
[  118.142967]  __do_fault+0x44/0x110
[  118.142970]  __handle_mm_fault+0x930/0xdac
[  118.142973]  handle_mm_fault+0xc8/0x190
[  118.142978]  do_page_fault+0x134/0x5a0
[  118.142982]  do_translation_fault+0xe0/0x108
[  118.142985]  do_mem_abort+0x54/0xb0
[  118.142988]  el0_da+0x1c/0x20
[  118.142990] ---[ end trace c105c6721d4e890e ]---


I guess it is here to drop the __GFP_FS in the linux-5.4 
(grow_dev_page() in the fs/buffer.c): gfp_mask = 
mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp;

>> I guess if could use pre-allocated memory to do the bio, it will help.
>
> We'll see.
>
> As far I can see it you've made the system run out of memory, and are 
> now complaining about the result.  There's nothing unconventional 
> about Squashfs handling of out of memory, and most filesystems put 
> into an out of memory situation will fail.
>
Understand.  And now the squashfs is used in ubuntu core and will be in 
many IoT products, really need to find a solution for it.

Thanks,

Hui.


> Phillip
>
>
>>
>> Thanks,
>>
>> Hui.
>>
>>
>
>
>>

next prev parent reply	other threads:[~2023-04-27  5:23 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-26  5:10 [PATCH 0/1] mm/oom_kill: system enters a state something like hang when running stress-ng Hui Wang
2023-04-26  5:10 ` [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS Hui Wang
2023-04-26  8:33   ` Michal Hocko
2023-04-26 11:07     ` Hui Wang
2023-04-26 16:44       ` Phillip Lougher
2023-04-26 17:38         ` Phillip Lougher
2023-04-26 18:26           ` Yang Shi
2023-04-26 19:06             ` Phillip Lougher
2023-04-26 19:34               ` Phillip Lougher
2023-04-27  0:42                 ` Hui Wang
2023-04-27  1:37                   ` Phillip Lougher
2023-04-27  5:22                     ` Hui Wang [this message]
2023-04-27  1:18       ` Gao Xiang
2023-04-27  3:47         ` Hui Wang
2023-04-27  4:17           ` Gao Xiang
2023-04-27  7:03           ` Colin King (gmail)
2023-04-27  7:49             ` Hui Wang
2023-04-28 19:53           ` Michal Hocko
2023-05-03 11:49             ` Hui Wang
2023-05-03 12:20               ` Michal Hocko
2023-05-03 18:41                 ` Phillip Lougher
2023-05-03 19:10               ` Phillip Lougher
2023-05-03 19:38                 ` Hui Wang
2023-05-07 21:07                 ` Phillip Lougher
2023-05-08 10:05                   ` Hui Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ac7f1fec-fa51-8e5f-ec47-aa63c4c90c2f@canonical.com \
    --to=hui.wang@canonical.com \
    --cc=akpm@linux-foundation.org \
    --cc=colin.i.king@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@infradead.org \
    --cc=hsiangkao@linux.alibaba.com \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=phillip@squashfs.org.uk \
    --cc=shy828301@gmail.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox