[PATCH 0/1] mm/oom_kill: system enters a state something like hang when running stress-ng

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/1] mm/oom_kill: system enters a state something like hang when running stress-ng
@ 2023-04-26  5:10 Hui Wang
  2023-04-26  5:10 ` [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS Hui Wang
  0 siblings, 1 reply; 25+ messages in thread
From: Hui Wang @ 2023-04-26  5:10 UTC (permalink / raw)
  To: linux-mm, akpm, mhocko
  Cc: surenb, colin.i.king, shy828301, hannes, vbabka, hch, mgorman,
	dan.carpenter, hui.wang

When we run stress-ng on the UC (Ubuntu Core), the system will be in a
state similar to hang. And we found if a testcase could introduce the
oom (like stress-ng-bigheap, stress-ng-brk, ...) under the UC, it is
highly possible that this testcase will make the system be in a state
like hang. We had a discussion for this issue here:
https://github.com/ColinIanKing/stress-ng/pull/270

The root cause are Ubuntu Core is constructed on squashfs, and
out_of_memory() will not trigger the oom killer for an allocation
without __GFP_FS.

For squashfs side:
memalloc_nofs_save()
  squashfs_readahead()--->...--->alloc_page() /* alloc without __GFP_FS */
memalloc_nofs_restore()

For oom side:
out_of_memory()
  if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
          return true;

In the out_of_memory(), this design exists for a long period of time,
If removing these 2 lines, the issue could be fixed. But I don't know
if it is allowed to do so or not. From my understanding, there will be
no problem to remove those 2 lines, since if the allocation without
__GFP_FS triggers the oom killer, the killer will select a process to
kill, it has no difference with an allocation with __GFP_FS.

Since this design exists for a long time, it is risky to change it,
here I use a kthread to trigger oom with __GFP_FS when needed, I think
it is a safer way to resolve this issue. Please help review it.

Thanks.

Hui Wang (1):
  mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS

 mm/oom_kill.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-04-26  5:10 [PATCH 0/1] mm/oom_kill: system enters a state something like hang when running stress-ng Hui Wang
@ 2023-04-26  5:10 ` Hui Wang
  2023-04-26  8:33   ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Hui Wang @ 2023-04-26  5:10 UTC (permalink / raw)
  To: linux-mm, akpm, mhocko
  Cc: surenb, colin.i.king, shy828301, hannes, vbabka, hch, mgorman,
	dan.carpenter, hui.wang

If we run the stress-ng in the filesystem of squashfs, the system
will be in a state something like hang, the stress-ng couldn't
finish running and the console couldn't react to users' input.

This issue happens on all arm/arm64 platforms we are working on,
through debugging, we found this issue is introduced by oom handling
in the kernel.

The fs->readahead() is called between memalloc_nofs_save() and
memalloc_nofs_restore(), and the squashfs_readahead() calls
alloc_page(), in this case, if there is no memory left, the
out_of_memory() will be called without __GFP_FS, then the oom killer
will not be triggered and this process will loop endlessly and wait
for others to trigger oom killer to release some memory. But for a
system with the whole root filesystem constructed by squashfs,
nearly all userspace processes will call out_of_memory() without
__GFP_FS, so we will see that the system enters a state something like
hang when running stress-ng.

To fix it, we could trigger a kthread to call page_alloc() with
__GFP_FS before returning from out_of_memory() due to without
__GFP_FS.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Colin Ian King <colin.i.king@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Hui Wang <hui.wang@canonical.com>
---
 mm/oom_kill.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 044e1eed720e..c9c38d6b8580 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1094,6 +1094,24 @@ int unregister_oom_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
+/*
+ * If an oom occurs without the __GFP_FS flag in the gfp_mask, the oom killer
+ * will not be triggered. In this case, we could call schedule_work to run
+ * trigger_oom_killer_work() to trigger an oom forcibly with __GFP_FS flag,
+ * this could make the oom killer run with a fair chance.
+ */
+static void trigger_oom_killer_work(struct work_struct *work)
+{
+	struct page *tmp_page;
+
+	/* This could trigger an oom forcibly with a chance */
+	tmp_page = alloc_page(GFP_KERNEL);
+	if (tmp_page)
+		__free_page(tmp_page);
+}
+
+static DECLARE_WORK(oom_trigger_work, trigger_oom_killer_work);
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  * @oc: pointer to struct oom_control
@@ -1135,8 +1153,10 @@ bool out_of_memory(struct oom_control *oc)
 	 * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
 	 * invoke the OOM killer even if it is a GFP_NOFS allocation.
 	 */
-	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
+	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc)) {
+		schedule_work(&oom_trigger_work);
 		return true;
+	}
 
 	/*
 	 * Check if there were limitations on the allocation (only relevant for
-- 
2.34.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-04-26  5:10 ` [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS Hui Wang
@ 2023-04-26  8:33   ` Michal Hocko
  2023-04-26 11:07     ` Hui Wang
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2023-04-26  8:33 UTC (permalink / raw)
  To: Hui Wang
  Cc: linux-mm, akpm, surenb, colin.i.king, shy828301, hannes, vbabka,
	hch, mgorman, dan.carpenter, Phillip Lougher

[CC squashfs maintainer]

On Wed 26-04-23 13:10:30, Hui Wang wrote:
> If we run the stress-ng in the filesystem of squashfs, the system
> will be in a state something like hang, the stress-ng couldn't
> finish running and the console couldn't react to users' input.
> 
> This issue happens on all arm/arm64 platforms we are working on,
> through debugging, we found this issue is introduced by oom handling
> in the kernel.
> 
> The fs->readahead() is called between memalloc_nofs_save() and
> memalloc_nofs_restore(), and the squashfs_readahead() calls
> alloc_page(), in this case, if there is no memory left, the
> out_of_memory() will be called without __GFP_FS, then the oom killer
> will not be triggered and this process will loop endlessly and wait
> for others to trigger oom killer to release some memory. But for a
> system with the whole root filesystem constructed by squashfs,
> nearly all userspace processes will call out_of_memory() without
> __GFP_FS, so we will see that the system enters a state something like
> hang when running stress-ng.
> 
> To fix it, we could trigger a kthread to call page_alloc() with
> __GFP_FS before returning from out_of_memory() due to without
> __GFP_FS.

I do not think this is an appropriate way to deal with this issue.
Does it even make sense to trigger OOM killer for something like
readahead? Would it be more mindful to fail the allocation instead?
That being said should allocations from squashfs_readahead use
__GFP_RETRY_MAYFAIL instead?
 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Colin Ian King <colin.i.king@gmail.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Hui Wang <hui.wang@canonical.com>
> ---
>  mm/oom_kill.c | 22 +++++++++++++++++++++-
>  1 file changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 044e1eed720e..c9c38d6b8580 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1094,6 +1094,24 @@ int unregister_oom_notifier(struct notifier_block *nb)
>  }
>  EXPORT_SYMBOL_GPL(unregister_oom_notifier);
>  
> +/*
> + * If an oom occurs without the __GFP_FS flag in the gfp_mask, the oom killer
> + * will not be triggered. In this case, we could call schedule_work to run
> + * trigger_oom_killer_work() to trigger an oom forcibly with __GFP_FS flag,
> + * this could make the oom killer run with a fair chance.
> + */
> +static void trigger_oom_killer_work(struct work_struct *work)
> +{
> +	struct page *tmp_page;
> +
> +	/* This could trigger an oom forcibly with a chance */
> +	tmp_page = alloc_page(GFP_KERNEL);
> +	if (tmp_page)
> +		__free_page(tmp_page);
> +}
> +
> +static DECLARE_WORK(oom_trigger_work, trigger_oom_killer_work);
> +
>  /**
>   * out_of_memory - kill the "best" process when we run out of memory
>   * @oc: pointer to struct oom_control
> @@ -1135,8 +1153,10 @@ bool out_of_memory(struct oom_control *oc)
>  	 * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
>  	 * invoke the OOM killer even if it is a GFP_NOFS allocation.
>  	 */
> -	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
> +	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc)) {
> +		schedule_work(&oom_trigger_work);
>  		return true;
> +	}
>  
>  	/*
>  	 * Check if there were limitations on the allocation (only relevant for
> -- 
> 2.34.1

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-04-26  8:33   ` Michal Hocko
@ 2023-04-26 11:07     ` Hui Wang
  2023-04-26 16:44       ` Phillip Lougher
  2023-04-27  1:18       ` Gao Xiang
  0 siblings, 2 replies; 25+ messages in thread
From: Hui Wang @ 2023-04-26 11:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, akpm, surenb, colin.i.king, shy828301, hannes, vbabka,
	hch, mgorman, dan.carpenter, Phillip Lougher


On 4/26/23 16:33, Michal Hocko wrote:
> [CC squashfs maintainer]
>
> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>> If we run the stress-ng in the filesystem of squashfs, the system
>> will be in a state something like hang, the stress-ng couldn't
>> finish running and the console couldn't react to users' input.
>>
>> This issue happens on all arm/arm64 platforms we are working on,
>> through debugging, we found this issue is introduced by oom handling
>> in the kernel.
>>
>> The fs->readahead() is called between memalloc_nofs_save() and
>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>> alloc_page(), in this case, if there is no memory left, the
>> out_of_memory() will be called without __GFP_FS, then the oom killer
>> will not be triggered and this process will loop endlessly and wait
>> for others to trigger oom killer to release some memory. But for a
>> system with the whole root filesystem constructed by squashfs,
>> nearly all userspace processes will call out_of_memory() without
>> __GFP_FS, so we will see that the system enters a state something like
>> hang when running stress-ng.
>>
>> To fix it, we could trigger a kthread to call page_alloc() with
>> __GFP_FS before returning from out_of_memory() due to without
>> __GFP_FS.
> I do not think this is an appropriate way to deal with this issue.
> Does it even make sense to trigger OOM killer for something like
> readahead? Would it be more mindful to fail the allocation instead?
> That being said should allocations from squashfs_readahead use
> __GFP_RETRY_MAYFAIL instead?

Thanks for your comment, and this issue could hardly be reproduced on 
ext4 filesystem, that is because the ext4->readahead() doesn't call 
alloc_page(). If changing the ext4->readahead() as below, it will be 
easy to reproduce this issue with the ext4 filesystem (repeatedly run: 
$stress-ng --bigheap ${num_of_cpu_threads} --sequential 0 --timeout 30s 
--skip-silent --verbose)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ffbbd9626bd8..8b9db0b9d0b8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file *file, 
struct folio *folio)
  static void ext4_readahead(struct readahead_control *rac)
  {
         struct inode *inode = rac->mapping->host;
+       struct page *tmp_page;

         /* If the file has inline data, no need to do readahead. */
         if (ext4_has_inline_data(inode))
                 return;

+       tmp_page = alloc_page(GFP_KERNEL);
+
         ext4_mpage_readpages(inode, rac, NULL);
+
+       if (tmp_page)
+               __free_page(tmp_page);
  }


BTW, I applied my patch to the linux-next and ran the oom stress-ng 
testcases overnight, there is no hang, oops or crash, looks like there 
is no big problem to use a kthread to trigger the oom killer in this case.

And Hi squashfs maintainer, I checked the code of filesystem, looks like 
most filesystems will not call alloc_page() in the readahead(), could 
you please help take a look at this issue, thanks.


>   
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Suren Baghdasaryan <surenb@google.com>
>> Cc: Colin Ian King <colin.i.king@gmail.com>
>> Cc: Yang Shi <shy828301@gmail.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Christoph Hellwig <hch@infradead.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Dan Carpenter <dan.carpenter@oracle.com>
>> Signed-off-by: Hui Wang <hui.wang@canonical.com>
>> ---
>>   mm/oom_kill.c | 22 +++++++++++++++++++++-
>>   1 file changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 044e1eed720e..c9c38d6b8580 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -1094,6 +1094,24 @@ int unregister_oom_notifier(struct notifier_block *nb)
>>   }
>>   EXPORT_SYMBOL_GPL(unregister_oom_notifier);
>>   
>> +/*
>> + * If an oom occurs without the __GFP_FS flag in the gfp_mask, the oom killer
>> + * will not be triggered. In this case, we could call schedule_work to run
>> + * trigger_oom_killer_work() to trigger an oom forcibly with __GFP_FS flag,
>> + * this could make the oom killer run with a fair chance.
>> + */
>> +static void trigger_oom_killer_work(struct work_struct *work)
>> +{
>> +	struct page *tmp_page;
>> +
>> +	/* This could trigger an oom forcibly with a chance */
>> +	tmp_page = alloc_page(GFP_KERNEL);
>> +	if (tmp_page)
>> +		__free_page(tmp_page);
>> +}
>> +
>> +static DECLARE_WORK(oom_trigger_work, trigger_oom_killer_work);
>> +
>>   /**
>>    * out_of_memory - kill the "best" process when we run out of memory
>>    * @oc: pointer to struct oom_control
>> @@ -1135,8 +1153,10 @@ bool out_of_memory(struct oom_control *oc)
>>   	 * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
>>   	 * invoke the OOM killer even if it is a GFP_NOFS allocation.
>>   	 */
>> -	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
>> +	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc)) {
>> +		schedule_work(&oom_trigger_work);
>>   		return true;
>> +	}
>>   
>>   	/*
>>   	 * Check if there were limitations on the allocation (only relevant for
>> -- 
>> 2.34.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-04-26 11:07     ` Hui Wang
@ 2023-04-26 16:44       ` Phillip Lougher
  2023-04-26 17:38         ` Phillip Lougher
  2023-04-27  1:18       ` Gao Xiang
  1 sibling, 1 reply; 25+ messages in thread
From: Phillip Lougher @ 2023-04-26 16:44 UTC (permalink / raw)
  To: Hui Wang, Michal Hocko
  Cc: linux-mm, akpm, surenb, colin.i.king, shy828301, hannes, vbabka,
	hch, mgorman, dan.carpenter


On 26/04/2023 12:07, Hui Wang wrote:
>
> On 4/26/23 16:33, Michal Hocko wrote:
>> [CC squashfs maintainer]
>>
>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>>> If we run the stress-ng in the filesystem of squashfs, the system
>>> will be in a state something like hang, the stress-ng couldn't
>>> finish running and the console couldn't react to users' input.
>>>
>>> This issue happens on all arm/arm64 platforms we are working on,
>>> through debugging, we found this issue is introduced by oom handling
>>> in the kernel.
>>>
>>> The fs->readahead() is called between memalloc_nofs_save() and
>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>>> alloc_page(), in this case, if there is no memory left, the
>>> out_of_memory() will be called without __GFP_FS, then the oom killer
>>> will not be triggered and this process will loop endlessly and wait
>>> for others to trigger oom killer to release some memory. But for a
>>> system with the whole root filesystem constructed by squashfs,
>>> nearly all userspace processes will call out_of_memory() without
>>> __GFP_FS, so we will see that the system enters a state something like
>>> hang when running stress-ng.
>>>
>>> To fix it, we could trigger a kthread to call page_alloc() with
>>> __GFP_FS before returning from out_of_memory() due to without
>>> __GFP_FS.
>> I do not think this is an appropriate way to deal with this issue.
>> Does it even make sense to trigger OOM killer for something like
>> readahead? Would it be more mindful to fail the allocation instead?
>> That being said should allocations from squashfs_readahead use
>> __GFP_RETRY_MAYFAIL instead?
>
> Thanks for your comment, and this issue could hardly be reproduced on 
> ext4 filesystem, that is because the ext4->readahead() doesn't call 
> alloc_page(). If changing the ext4->readahead() as below, it will be 
> easy to reproduce this issue with the ext4 filesystem (repeatedly run: 
> $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0 --timeout 
> 30s --skip-silent --verbose)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index ffbbd9626bd8..8b9db0b9d0b8 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file *file, 
> struct folio *folio)
>  static void ext4_readahead(struct readahead_control *rac)
>  {
>         struct inode *inode = rac->mapping->host;
> +       struct page *tmp_page;
>
>         /* If the file has inline data, no need to do readahead. */
>         if (ext4_has_inline_data(inode))
>                 return;
>
> +       tmp_page = alloc_page(GFP_KERNEL);
> +
>         ext4_mpage_readpages(inode, rac, NULL);
> +
> +       if (tmp_page)
> +               __free_page(tmp_page);
>  }
>
>
> BTW, I applied my patch to the linux-next and ran the oom stress-ng 
> testcases overnight, there is no hang, oops or crash, looks like there 
> is no big problem to use a kthread to trigger the oom killer in this 
> case.
>
> And Hi squashfs maintainer, I checked the code of filesystem, looks 
> like most filesystems will not call alloc_page() in the readahead(), 
> could you please help take a look at this issue, thanks.


This will be because most filesystems don't need to do so. Squashfs is a 
compressed filesystem with large blocks covering much more than one 
page, and it decompresses these blocks in squashfs_readahead().   If 
__readahead_batch() does not return the full set of pages covering the 
Squashfs block, it allocates a temporary page for the decompressors to 
decompress into to "fill in the hole".

What can be done here as far as Squashfs is concerned ....  I could move 
the page allocation out of the readahead path (e.g. do it at mount time).

Adding __GFP_RETRY_MAYFAIL so the alloc() can fail will mean Squashfs 
returning I/O failures due to no memory.  That will cause a lot of 
applications to crash in a low memory situation.  So a crash rather than 
a hang.

Phillip






^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-04-26 16:44       ` Phillip Lougher
@ 2023-04-26 17:38         ` Phillip Lougher
  2023-04-26 18:26           ` Yang Shi
  0 siblings, 1 reply; 25+ messages in thread
From: Phillip Lougher @ 2023-04-26 17:38 UTC (permalink / raw)
  To: Hui Wang, Michal Hocko
  Cc: linux-mm, akpm, surenb, colin.i.king, shy828301, hannes, vbabka,
	hch, mgorman, dan.carpenter


On 26/04/2023 17:44, Phillip Lougher wrote:
>
> On 26/04/2023 12:07, Hui Wang wrote:
>>
>> On 4/26/23 16:33, Michal Hocko wrote:
>>> [CC squashfs maintainer]
>>>
>>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>>>> If we run the stress-ng in the filesystem of squashfs, the system
>>>> will be in a state something like hang, the stress-ng couldn't
>>>> finish running and the console couldn't react to users' input.
>>>>
>>>> This issue happens on all arm/arm64 platforms we are working on,
>>>> through debugging, we found this issue is introduced by oom handling
>>>> in the kernel.
>>>>
>>>> The fs->readahead() is called between memalloc_nofs_save() and
>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>>>> alloc_page(), in this case, if there is no memory left, the
>>>> out_of_memory() will be called without __GFP_FS, then the oom killer
>>>> will not be triggered and this process will loop endlessly and wait
>>>> for others to trigger oom killer to release some memory. But for a
>>>> system with the whole root filesystem constructed by squashfs,
>>>> nearly all userspace processes will call out_of_memory() without
>>>> __GFP_FS, so we will see that the system enters a state something like
>>>> hang when running stress-ng.
>>>>
>>>> To fix it, we could trigger a kthread to call page_alloc() with
>>>> __GFP_FS before returning from out_of_memory() due to without
>>>> __GFP_FS.
>>> I do not think this is an appropriate way to deal with this issue.
>>> Does it even make sense to trigger OOM killer for something like
>>> readahead? Would it be more mindful to fail the allocation instead?
>>> That being said should allocations from squashfs_readahead use
>>> __GFP_RETRY_MAYFAIL instead?
>>
>> Thanks for your comment, and this issue could hardly be reproduced on 
>> ext4 filesystem, that is because the ext4->readahead() doesn't call 
>> alloc_page(). If changing the ext4->readahead() as below, it will be 
>> easy to reproduce this issue with the ext4 filesystem (repeatedly 
>> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0 
>> --timeout 30s --skip-silent --verbose)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index ffbbd9626bd8..8b9db0b9d0b8 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file *file, 
>> struct folio *folio)
>>  static void ext4_readahead(struct readahead_control *rac)
>>  {
>>         struct inode *inode = rac->mapping->host;
>> +       struct page *tmp_page;
>>
>>         /* If the file has inline data, no need to do readahead. */
>>         if (ext4_has_inline_data(inode))
>>                 return;
>>
>> +       tmp_page = alloc_page(GFP_KERNEL);
>> +
>>         ext4_mpage_readpages(inode, rac, NULL);
>> +
>> +       if (tmp_page)
>> +               __free_page(tmp_page);
>>  }
>>
>>
>> BTW, I applied my patch to the linux-next and ran the oom stress-ng 
>> testcases overnight, there is no hang, oops or crash, looks like 
>> there is no big problem to use a kthread to trigger the oom killer in 
>> this case.
>>
>> And Hi squashfs maintainer, I checked the code of filesystem, looks 
>> like most filesystems will not call alloc_page() in the readahead(), 
>> could you please help take a look at this issue, thanks.
>
>
> This will be because most filesystems don't need to do so. Squashfs is 
> a compressed filesystem with large blocks covering much more than one 
> page, and it decompresses these blocks in squashfs_readahead().   If 
> __readahead_batch() does not return the full set of pages covering the 
> Squashfs block, it allocates a temporary page for the decompressors to 
> decompress into to "fill in the hole".
>
> What can be done here as far as Squashfs is concerned ....  I could 
> move the page allocation out of the readahead path (e.g. do it at 
> mount time).

You could try this patch which does that.  Compile tested only.

  fs/squashfs/page_actor.c     | 10 +---------
  fs/squashfs/page_actor.h     |  1 -
  fs/squashfs/squashfs_fs_sb.h |  1 +
  fs/squashfs/super.c          | 10 ++++++++++
  4 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/squashfs/page_actor.c b/fs/squashfs/page_actor.c
index 81af6c4ca115..6cce239eca66 100644
--- a/fs/squashfs/page_actor.c
+++ b/fs/squashfs/page_actor.c
@@ -110,15 +110,7 @@ struct squashfs_page_actor *squashfs_page_actor_init_special(struct squashfs_sb_
  	if (actor == NULL)
  		return NULL;
  
-	if (msblk->decompressor->alloc_buffer) {
-		actor->tmp_buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
-
-		if (actor->tmp_buffer == NULL) {
-			kfree(actor);
-			return NULL;
-		}
-	} else
-		actor->tmp_buffer = NULL;
+	actor->tmp_buffer = msblk->actor_page;
  
  	actor->length = length ? : pages * PAGE_SIZE;
  	actor->page = page;
diff --git a/fs/squashfs/page_actor.h b/fs/squashfs/page_actor.h
index 97d4983559b1..df5e999afa42 100644
--- a/fs/squashfs/page_actor.h
+++ b/fs/squashfs/page_actor.h
@@ -34,7 +34,6 @@ static inline struct page *squashfs_page_actor_free(struct squashfs_page_actor *
  {
  	struct page *last_page = actor->last_page;
  
-	kfree(actor->tmp_buffer);
  	kfree(actor);
  	return last_page;
  }
diff --git a/fs/squashfs/squashfs_fs_sb.h b/fs/squashfs/squashfs_fs_sb.h
index 72f6f4b37863..8feddc9e6cce 100644
--- a/fs/squashfs/squashfs_fs_sb.h
+++ b/fs/squashfs/squashfs_fs_sb.h
@@ -47,6 +47,7 @@ struct squashfs_sb_info {
  	struct squashfs_cache			*block_cache;
  	struct squashfs_cache			*fragment_cache;
  	struct squashfs_cache			*read_page;
+	void					*actor_page;
  	int					next_meta_index;
  	__le64					*id_table;
  	__le64					*fragment_index;
diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c
index e090fae48e68..674dc187d961 100644
--- a/fs/squashfs/super.c
+++ b/fs/squashfs/super.c
@@ -329,6 +329,15 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc)
  		goto failed_mount;
  	}
  
+
+	/* Allocate page for squashfs_readahead()/squashfs_read_folio() */
+	if (msblk->decompressor->alloc_buffer) {
+		msblk->actor_page = kmalloc(PAGE_SIZE, GFP_KERNEL);
+
+		if(msblk->actor_page == NULL)
+			goto failed_mount;
+	}
+
  	msblk->stream = squashfs_decompressor_setup(sb, flags);
  	if (IS_ERR(msblk->stream)) {
  		err = PTR_ERR(msblk->stream);
@@ -454,6 +463,7 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc)
  	squashfs_cache_delete(msblk->block_cache);
  	squashfs_cache_delete(msblk->fragment_cache);
  	squashfs_cache_delete(msblk->read_page);
+	kfree(msblk->actor_page);
  	msblk->thread_ops->destroy(msblk);
  	kfree(msblk->inode_lookup_table);
  	kfree(msblk->fragment_index);
-- 
2.35.1

>
> Adding __GFP_RETRY_MAYFAIL so the alloc() can fail will mean Squashfs 
> returning I/O failures due to no memory.  That will cause a lot of 
> applications to crash in a low memory situation.  So a crash rather 
> than a hang.
>
> Phillip
>
>
>
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-04-26 17:38         ` Phillip Lougher
@ 2023-04-26 18:26           ` Yang Shi
  2023-04-26 19:06             ` Phillip Lougher
  0 siblings, 1 reply; 25+ messages in thread
From: Yang Shi @ 2023-04-26 18:26 UTC (permalink / raw)
  To: Phillip Lougher
  Cc: Hui Wang, Michal Hocko, linux-mm, akpm, surenb, colin.i.king,
	hannes, vbabka, hch, mgorman, dan.carpenter

On Wed, Apr 26, 2023 at 10:38 AM Phillip Lougher
<phillip@squashfs.org.uk> wrote:
>
>
> On 26/04/2023 17:44, Phillip Lougher wrote:
> >
> > On 26/04/2023 12:07, Hui Wang wrote:
> >>
> >> On 4/26/23 16:33, Michal Hocko wrote:
> >>> [CC squashfs maintainer]
> >>>
> >>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
> >>>> If we run the stress-ng in the filesystem of squashfs, the system
> >>>> will be in a state something like hang, the stress-ng couldn't
> >>>> finish running and the console couldn't react to users' input.
> >>>>
> >>>> This issue happens on all arm/arm64 platforms we are working on,
> >>>> through debugging, we found this issue is introduced by oom handling
> >>>> in the kernel.
> >>>>
> >>>> The fs->readahead() is called between memalloc_nofs_save() and
> >>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
> >>>> alloc_page(), in this case, if there is no memory left, the
> >>>> out_of_memory() will be called without __GFP_FS, then the oom killer
> >>>> will not be triggered and this process will loop endlessly and wait
> >>>> for others to trigger oom killer to release some memory. But for a
> >>>> system with the whole root filesystem constructed by squashfs,
> >>>> nearly all userspace processes will call out_of_memory() without
> >>>> __GFP_FS, so we will see that the system enters a state something like
> >>>> hang when running stress-ng.
> >>>>
> >>>> To fix it, we could trigger a kthread to call page_alloc() with
> >>>> __GFP_FS before returning from out_of_memory() due to without
> >>>> __GFP_FS.
> >>> I do not think this is an appropriate way to deal with this issue.
> >>> Does it even make sense to trigger OOM killer for something like
> >>> readahead? Would it be more mindful to fail the allocation instead?
> >>> That being said should allocations from squashfs_readahead use
> >>> __GFP_RETRY_MAYFAIL instead?
> >>
> >> Thanks for your comment, and this issue could hardly be reproduced on
> >> ext4 filesystem, that is because the ext4->readahead() doesn't call
> >> alloc_page(). If changing the ext4->readahead() as below, it will be
> >> easy to reproduce this issue with the ext4 filesystem (repeatedly
> >> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0
> >> --timeout 30s --skip-silent --verbose)
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index ffbbd9626bd8..8b9db0b9d0b8 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file *file,
> >> struct folio *folio)
> >>  static void ext4_readahead(struct readahead_control *rac)
> >>  {
> >>         struct inode *inode = rac->mapping->host;
> >> +       struct page *tmp_page;
> >>
> >>         /* If the file has inline data, no need to do readahead. */
> >>         if (ext4_has_inline_data(inode))
> >>                 return;
> >>
> >> +       tmp_page = alloc_page(GFP_KERNEL);
> >> +
> >>         ext4_mpage_readpages(inode, rac, NULL);
> >> +
> >> +       if (tmp_page)
> >> +               __free_page(tmp_page);
> >>  }
> >>
> >>
> >> BTW, I applied my patch to the linux-next and ran the oom stress-ng
> >> testcases overnight, there is no hang, oops or crash, looks like
> >> there is no big problem to use a kthread to trigger the oom killer in
> >> this case.
> >>
> >> And Hi squashfs maintainer, I checked the code of filesystem, looks
> >> like most filesystems will not call alloc_page() in the readahead(),
> >> could you please help take a look at this issue, thanks.
> >
> >
> > This will be because most filesystems don't need to do so. Squashfs is
> > a compressed filesystem with large blocks covering much more than one
> > page, and it decompresses these blocks in squashfs_readahead().   If
> > __readahead_batch() does not return the full set of pages covering the
> > Squashfs block, it allocates a temporary page for the decompressors to
> > decompress into to "fill in the hole".
> >
> > What can be done here as far as Squashfs is concerned ....  I could
> > move the page allocation out of the readahead path (e.g. do it at
> > mount time).
>
> You could try this patch which does that.  Compile tested only.

The kmalloc_array() may call alloc_page() to trigger this problem too
IIUC. It should be pre-allocated as well?

>
>   fs/squashfs/page_actor.c     | 10 +---------
>   fs/squashfs/page_actor.h     |  1 -
>   fs/squashfs/squashfs_fs_sb.h |  1 +
>   fs/squashfs/super.c          | 10 ++++++++++
>   4 files changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/fs/squashfs/page_actor.c b/fs/squashfs/page_actor.c
> index 81af6c4ca115..6cce239eca66 100644
> --- a/fs/squashfs/page_actor.c
> +++ b/fs/squashfs/page_actor.c
> @@ -110,15 +110,7 @@ struct squashfs_page_actor *squashfs_page_actor_init_special(struct squashfs_sb_
>         if (actor == NULL)
>                 return NULL;
>
> -       if (msblk->decompressor->alloc_buffer) {
> -               actor->tmp_buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
> -
> -               if (actor->tmp_buffer == NULL) {
> -                       kfree(actor);
> -                       return NULL;
> -               }
> -       } else
> -               actor->tmp_buffer = NULL;
> +       actor->tmp_buffer = msblk->actor_page;
>
>         actor->length = length ? : pages * PAGE_SIZE;
>         actor->page = page;
> diff --git a/fs/squashfs/page_actor.h b/fs/squashfs/page_actor.h
> index 97d4983559b1..df5e999afa42 100644
> --- a/fs/squashfs/page_actor.h
> +++ b/fs/squashfs/page_actor.h
> @@ -34,7 +34,6 @@ static inline struct page *squashfs_page_actor_free(struct squashfs_page_actor *
>   {
>         struct page *last_page = actor->last_page;
>
> -       kfree(actor->tmp_buffer);
>         kfree(actor);
>         return last_page;
>   }
> diff --git a/fs/squashfs/squashfs_fs_sb.h b/fs/squashfs/squashfs_fs_sb.h
> index 72f6f4b37863..8feddc9e6cce 100644
> --- a/fs/squashfs/squashfs_fs_sb.h
> +++ b/fs/squashfs/squashfs_fs_sb.h
> @@ -47,6 +47,7 @@ struct squashfs_sb_info {
>         struct squashfs_cache                   *block_cache;
>         struct squashfs_cache                   *fragment_cache;
>         struct squashfs_cache                   *read_page;
> +       void                                    *actor_page;
>         int                                     next_meta_index;
>         __le64                                  *id_table;
>         __le64                                  *fragment_index;
> diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c
> index e090fae48e68..674dc187d961 100644
> --- a/fs/squashfs/super.c
> +++ b/fs/squashfs/super.c
> @@ -329,6 +329,15 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc)
>                 goto failed_mount;
>         }
>
> +
> +       /* Allocate page for squashfs_readahead()/squashfs_read_folio() */
> +       if (msblk->decompressor->alloc_buffer) {
> +               msblk->actor_page = kmalloc(PAGE_SIZE, GFP_KERNEL);
> +
> +               if(msblk->actor_page == NULL)
> +                       goto failed_mount;
> +       }
> +
>         msblk->stream = squashfs_decompressor_setup(sb, flags);
>         if (IS_ERR(msblk->stream)) {
>                 err = PTR_ERR(msblk->stream);
> @@ -454,6 +463,7 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc)
>         squashfs_cache_delete(msblk->block_cache);
>         squashfs_cache_delete(msblk->fragment_cache);
>         squashfs_cache_delete(msblk->read_page);
> +       kfree(msblk->actor_page);
>         msblk->thread_ops->destroy(msblk);
>         kfree(msblk->inode_lookup_table);
>         kfree(msblk->fragment_index);
> --
> 2.35.1
>
> >
> > Adding __GFP_RETRY_MAYFAIL so the alloc() can fail will mean Squashfs
> > returning I/O failures due to no memory.  That will cause a lot of
> > applications to crash in a low memory situation.  So a crash rather
> > than a hang.
> >
> > Phillip
> >
> >
> >
> >


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-04-26 18:26           ` Yang Shi
@ 2023-04-26 19:06             ` Phillip Lougher
  2023-04-26 19:34               ` Phillip Lougher
  0 siblings, 1 reply; 25+ messages in thread
From: Phillip Lougher @ 2023-04-26 19:06 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hui Wang, Michal Hocko, linux-mm, akpm, surenb, colin.i.king,
	hannes, vbabka, hch, mgorman, dan.carpenter


On 26/04/2023 19:26, Yang Shi wrote:
> On Wed, Apr 26, 2023 at 10:38 AM Phillip Lougher
> <phillip@squashfs.org.uk> wrote:
>>
>> On 26/04/2023 17:44, Phillip Lougher wrote:
>>> On 26/04/2023 12:07, Hui Wang wrote:
>>>> On 4/26/23 16:33, Michal Hocko wrote:
>>>>> [CC squashfs maintainer]
>>>>>
>>>>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>>>>>> If we run the stress-ng in the filesystem of squashfs, the system
>>>>>> will be in a state something like hang, the stress-ng couldn't
>>>>>> finish running and the console couldn't react to users' input.
>>>>>>
>>>>>> This issue happens on all arm/arm64 platforms we are working on,
>>>>>> through debugging, we found this issue is introduced by oom handling
>>>>>> in the kernel.
>>>>>>
>>>>>> The fs->readahead() is called between memalloc_nofs_save() and
>>>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>>>>>> alloc_page(), in this case, if there is no memory left, the
>>>>>> out_of_memory() will be called without __GFP_FS, then the oom killer
>>>>>> will not be triggered and this process will loop endlessly and wait
>>>>>> for others to trigger oom killer to release some memory. But for a
>>>>>> system with the whole root filesystem constructed by squashfs,
>>>>>> nearly all userspace processes will call out_of_memory() without
>>>>>> __GFP_FS, so we will see that the system enters a state something like
>>>>>> hang when running stress-ng.
>>>>>>
>>>>>> To fix it, we could trigger a kthread to call page_alloc() with
>>>>>> __GFP_FS before returning from out_of_memory() due to without
>>>>>> __GFP_FS.
>>>>> I do not think this is an appropriate way to deal with this issue.
>>>>> Does it even make sense to trigger OOM killer for something like
>>>>> readahead? Would it be more mindful to fail the allocation instead?
>>>>> That being said should allocations from squashfs_readahead use
>>>>> __GFP_RETRY_MAYFAIL instead?
>>>> Thanks for your comment, and this issue could hardly be reproduced on
>>>> ext4 filesystem, that is because the ext4->readahead() doesn't call
>>>> alloc_page(). If changing the ext4->readahead() as below, it will be
>>>> easy to reproduce this issue with the ext4 filesystem (repeatedly
>>>> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0
>>>> --timeout 30s --skip-silent --verbose)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index ffbbd9626bd8..8b9db0b9d0b8 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file *file,
>>>> struct folio *folio)
>>>>   static void ext4_readahead(struct readahead_control *rac)
>>>>   {
>>>>          struct inode *inode = rac->mapping->host;
>>>> +       struct page *tmp_page;
>>>>
>>>>          /* If the file has inline data, no need to do readahead. */
>>>>          if (ext4_has_inline_data(inode))
>>>>                  return;
>>>>
>>>> +       tmp_page = alloc_page(GFP_KERNEL);
>>>> +
>>>>          ext4_mpage_readpages(inode, rac, NULL);
>>>> +
>>>> +       if (tmp_page)
>>>> +               __free_page(tmp_page);
>>>>   }
>>>>
>>>>
>>>> BTW, I applied my patch to the linux-next and ran the oom stress-ng
>>>> testcases overnight, there is no hang, oops or crash, looks like
>>>> there is no big problem to use a kthread to trigger the oom killer in
>>>> this case.
>>>>
>>>> And Hi squashfs maintainer, I checked the code of filesystem, looks
>>>> like most filesystems will not call alloc_page() in the readahead(),
>>>> could you please help take a look at this issue, thanks.
>>>
>>> This will be because most filesystems don't need to do so. Squashfs is
>>> a compressed filesystem with large blocks covering much more than one
>>> page, and it decompresses these blocks in squashfs_readahead().   If
>>> __readahead_batch() does not return the full set of pages covering the
>>> Squashfs block, it allocates a temporary page for the decompressors to
>>> decompress into to "fill in the hole".
>>>
>>> What can be done here as far as Squashfs is concerned ....  I could
>>> move the page allocation out of the readahead path (e.g. do it at
>>> mount time).
>> You could try this patch which does that.  Compile tested only.
> The kmalloc_array() may call alloc_page() to trigger this problem too
> IIUC. It should be pre-allocated as well?


That is a much smaller allocation, and so it entirely depends whether it 
is an issue or not.  There are also a number of other small memory 
allocations in the path as well.

The whole point of this patch is to move the *biggest* allocation which 
is the reported issue and then see what happens.   No point in making 
this test patch more involved and complex than necessary at this point.

Phillip

>>    fs/squashfs/page_actor.c     | 10 +---------
>>    fs/squashfs/page_actor.h     |  1 -
>>    fs/squashfs/squashfs_fs_sb.h |  1 +
>>    fs/squashfs/super.c          | 10 ++++++++++
>>    4 files changed, 12 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/squashfs/page_actor.c b/fs/squashfs/page_actor.c
>> index 81af6c4ca115..6cce239eca66 100644
>> --- a/fs/squashfs/page_actor.c
>> +++ b/fs/squashfs/page_actor.c
>> @@ -110,15 +110,7 @@ struct squashfs_page_actor *squashfs_page_actor_init_special(struct squashfs_sb_
>>          if (actor == NULL)
>>                  return NULL;
>>
>> -       if (msblk->decompressor->alloc_buffer) {
>> -               actor->tmp_buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
>> -
>> -               if (actor->tmp_buffer == NULL) {
>> -                       kfree(actor);
>> -                       return NULL;
>> -               }
>> -       } else
>> -               actor->tmp_buffer = NULL;
>> +       actor->tmp_buffer = msblk->actor_page;
>>
>>          actor->length = length ? : pages * PAGE_SIZE;
>>          actor->page = page;
>> diff --git a/fs/squashfs/page_actor.h b/fs/squashfs/page_actor.h
>> index 97d4983559b1..df5e999afa42 100644
>> --- a/fs/squashfs/page_actor.h
>> +++ b/fs/squashfs/page_actor.h
>> @@ -34,7 +34,6 @@ static inline struct page *squashfs_page_actor_free(struct squashfs_page_actor *
>>    {
>>          struct page *last_page = actor->last_page;
>>
>> -       kfree(actor->tmp_buffer);
>>          kfree(actor);
>>          return last_page;
>>    }
>> diff --git a/fs/squashfs/squashfs_fs_sb.h b/fs/squashfs/squashfs_fs_sb.h
>> index 72f6f4b37863..8feddc9e6cce 100644
>> --- a/fs/squashfs/squashfs_fs_sb.h
>> +++ b/fs/squashfs/squashfs_fs_sb.h
>> @@ -47,6 +47,7 @@ struct squashfs_sb_info {
>>          struct squashfs_cache                   *block_cache;
>>          struct squashfs_cache                   *fragment_cache;
>>          struct squashfs_cache                   *read_page;
>> +       void                                    *actor_page;
>>          int                                     next_meta_index;
>>          __le64                                  *id_table;
>>          __le64                                  *fragment_index;
>> diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c
>> index e090fae48e68..674dc187d961 100644
>> --- a/fs/squashfs/super.c
>> +++ b/fs/squashfs/super.c
>> @@ -329,6 +329,15 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc)
>>                  goto failed_mount;
>>          }
>>
>> +
>> +       /* Allocate page for squashfs_readahead()/squashfs_read_folio() */
>> +       if (msblk->decompressor->alloc_buffer) {
>> +               msblk->actor_page = kmalloc(PAGE_SIZE, GFP_KERNEL);
>> +
>> +               if(msblk->actor_page == NULL)
>> +                       goto failed_mount;
>> +       }
>> +
>>          msblk->stream = squashfs_decompressor_setup(sb, flags);
>>          if (IS_ERR(msblk->stream)) {
>>                  err = PTR_ERR(msblk->stream);
>> @@ -454,6 +463,7 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc)
>>          squashfs_cache_delete(msblk->block_cache);
>>          squashfs_cache_delete(msblk->fragment_cache);
>>          squashfs_cache_delete(msblk->read_page);
>> +       kfree(msblk->actor_page);
>>          msblk->thread_ops->destroy(msblk);
>>          kfree(msblk->inode_lookup_table);
>>          kfree(msblk->fragment_index);
>> --
>> 2.35.1
>>
>>> Adding __GFP_RETRY_MAYFAIL so the alloc() can fail will mean Squashfs
>>> returning I/O failures due to no memory.  That will cause a lot of
>>> applications to crash in a low memory situation.  So a crash rather
>>> than a hang.
>>>
>>> Phillip
>>>
>>>
>>>
>>>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-04-26 19:06             ` Phillip Lougher
@ 2023-04-26 19:34               ` Phillip Lougher
  2023-04-27  0:42                 ` Hui Wang
  0 siblings, 1 reply; 25+ messages in thread
From: Phillip Lougher @ 2023-04-26 19:34 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hui Wang, Michal Hocko, linux-mm, akpm, surenb, colin.i.king,
	hannes, vbabka, hch, mgorman, dan.carpenter


On 26/04/2023 20:06, Phillip Lougher wrote:
>
> On 26/04/2023 19:26, Yang Shi wrote:
>> On Wed, Apr 26, 2023 at 10:38 AM Phillip Lougher
>> <phillip@squashfs.org.uk> wrote:
>>>
>>> On 26/04/2023 17:44, Phillip Lougher wrote:
>>>> On 26/04/2023 12:07, Hui Wang wrote:
>>>>> On 4/26/23 16:33, Michal Hocko wrote:
>>>>>> [CC squashfs maintainer]
>>>>>>
>>>>>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>>>>>>> If we run the stress-ng in the filesystem of squashfs, the system
>>>>>>> will be in a state something like hang, the stress-ng couldn't
>>>>>>> finish running and the console couldn't react to users' input.
>>>>>>>
>>>>>>> This issue happens on all arm/arm64 platforms we are working on,
>>>>>>> through debugging, we found this issue is introduced by oom 
>>>>>>> handling
>>>>>>> in the kernel.
>>>>>>>
>>>>>>> The fs->readahead() is called between memalloc_nofs_save() and
>>>>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>>>>>>> alloc_page(), in this case, if there is no memory left, the
>>>>>>> out_of_memory() will be called without __GFP_FS, then the oom 
>>>>>>> killer
>>>>>>> will not be triggered and this process will loop endlessly and wait
>>>>>>> for others to trigger oom killer to release some memory. But for a
>>>>>>> system with the whole root filesystem constructed by squashfs,
>>>>>>> nearly all userspace processes will call out_of_memory() without
>>>>>>> __GFP_FS, so we will see that the system enters a state 
>>>>>>> something like
>>>>>>> hang when running stress-ng.
>>>>>>>
>>>>>>> To fix it, we could trigger a kthread to call page_alloc() with
>>>>>>> __GFP_FS before returning from out_of_memory() due to without
>>>>>>> __GFP_FS.
>>>>>> I do not think this is an appropriate way to deal with this issue.
>>>>>> Does it even make sense to trigger OOM killer for something like
>>>>>> readahead? Would it be more mindful to fail the allocation instead?
>>>>>> That being said should allocations from squashfs_readahead use
>>>>>> __GFP_RETRY_MAYFAIL instead?
>>>>> Thanks for your comment, and this issue could hardly be reproduced on
>>>>> ext4 filesystem, that is because the ext4->readahead() doesn't call
>>>>> alloc_page(). If changing the ext4->readahead() as below, it will be
>>>>> easy to reproduce this issue with the ext4 filesystem (repeatedly
>>>>> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0
>>>>> --timeout 30s --skip-silent --verbose)
>>>>>
>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>> index ffbbd9626bd8..8b9db0b9d0b8 100644
>>>>> --- a/fs/ext4/inode.c
>>>>> +++ b/fs/ext4/inode.c
>>>>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file *file,
>>>>> struct folio *folio)
>>>>>   static void ext4_readahead(struct readahead_control *rac)
>>>>>   {
>>>>>          struct inode *inode = rac->mapping->host;
>>>>> +       struct page *tmp_page;
>>>>>
>>>>>          /* If the file has inline data, no need to do readahead. */
>>>>>          if (ext4_has_inline_data(inode))
>>>>>                  return;
>>>>>
>>>>> +       tmp_page = alloc_page(GFP_KERNEL);
>>>>> +
>>>>>          ext4_mpage_readpages(inode, rac, NULL);
>>>>> +
>>>>> +       if (tmp_page)
>>>>> +               __free_page(tmp_page);
>>>>>   }
>>>>>
>>>>>
>>>>> BTW, I applied my patch to the linux-next and ran the oom stress-ng
>>>>> testcases overnight, there is no hang, oops or crash, looks like
>>>>> there is no big problem to use a kthread to trigger the oom killer in
>>>>> this case.
>>>>>
>>>>> And Hi squashfs maintainer, I checked the code of filesystem, looks
>>>>> like most filesystems will not call alloc_page() in the readahead(),
>>>>> could you please help take a look at this issue, thanks.
>>>>
>>>> This will be because most filesystems don't need to do so. Squashfs is
>>>> a compressed filesystem with large blocks covering much more than one
>>>> page, and it decompresses these blocks in squashfs_readahead().   If
>>>> __readahead_batch() does not return the full set of pages covering the
>>>> Squashfs block, it allocates a temporary page for the decompressors to
>>>> decompress into to "fill in the hole".
>>>>
>>>> What can be done here as far as Squashfs is concerned .... I could
>>>> move the page allocation out of the readahead path (e.g. do it at
>>>> mount time).
>>> You could try this patch which does that.  Compile tested only.
>> The kmalloc_array() may call alloc_page() to trigger this problem too
>> IIUC. It should be pre-allocated as well?
>
>
> That is a much smaller allocation, and so it entirely depends whether 
> it is an issue or not.  There are also a number of other small memory 
> allocations in the path as well.
>
> The whole point of this patch is to move the *biggest* allocation 
> which is the reported issue and then see what happens.   No point in 
> making this test patch more involved and complex than necessary at 
> this point.
>
> Phillip
>

Also be aware this stress-ng triggered issue is new, and apparently 
didn't occur last year.   So it is reasonable to assume the issue has 
been introduced as a side effect of the readahead improvements.  One of 
these is this allocation of a temporary page to decompress into rather 
than falling back to entirely decompressing into a pre-allocated buffer 
(allocated at mount time).  The small memory allocations have been there 
for many years.

Allocating the page at mount time effectively puts the memory allocation 
situation back to how it was last year before the readahead work.

Phillip


>>>    fs/squashfs/page_actor.c     | 10 +---------
>>>    fs/squashfs/page_actor.h     |  1 -
>>>    fs/squashfs/squashfs_fs_sb.h |  1 +
>>>    fs/squashfs/super.c          | 10 ++++++++++
>>>    4 files changed, 12 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/fs/squashfs/page_actor.c b/fs/squashfs/page_actor.c
>>> index 81af6c4ca115..6cce239eca66 100644
>>> --- a/fs/squashfs/page_actor.c
>>> +++ b/fs/squashfs/page_actor.c
>>> @@ -110,15 +110,7 @@ struct squashfs_page_actor 
>>> *squashfs_page_actor_init_special(struct squashfs_sb_
>>>          if (actor == NULL)
>>>                  return NULL;
>>>
>>> -       if (msblk->decompressor->alloc_buffer) {
>>> -               actor->tmp_buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
>>> -
>>> -               if (actor->tmp_buffer == NULL) {
>>> -                       kfree(actor);
>>> -                       return NULL;
>>> -               }
>>> -       } else
>>> -               actor->tmp_buffer = NULL;
>>> +       actor->tmp_buffer = msblk->actor_page;
>>>
>>>          actor->length = length ? : pages * PAGE_SIZE;
>>>          actor->page = page;
>>> diff --git a/fs/squashfs/page_actor.h b/fs/squashfs/page_actor.h
>>> index 97d4983559b1..df5e999afa42 100644
>>> --- a/fs/squashfs/page_actor.h
>>> +++ b/fs/squashfs/page_actor.h
>>> @@ -34,7 +34,6 @@ static inline struct page 
>>> *squashfs_page_actor_free(struct squashfs_page_actor *
>>>    {
>>>          struct page *last_page = actor->last_page;
>>>
>>> -       kfree(actor->tmp_buffer);
>>>          kfree(actor);
>>>          return last_page;
>>>    }
>>> diff --git a/fs/squashfs/squashfs_fs_sb.h 
>>> b/fs/squashfs/squashfs_fs_sb.h
>>> index 72f6f4b37863..8feddc9e6cce 100644
>>> --- a/fs/squashfs/squashfs_fs_sb.h
>>> +++ b/fs/squashfs/squashfs_fs_sb.h
>>> @@ -47,6 +47,7 @@ struct squashfs_sb_info {
>>>          struct squashfs_cache                   *block_cache;
>>>          struct squashfs_cache *fragment_cache;
>>>          struct squashfs_cache                   *read_page;
>>> +       void                                    *actor_page;
>>>          int next_meta_index;
>>>          __le64                                  *id_table;
>>>          __le64 *fragment_index;
>>> diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c
>>> index e090fae48e68..674dc187d961 100644
>>> --- a/fs/squashfs/super.c
>>> +++ b/fs/squashfs/super.c
>>> @@ -329,6 +329,15 @@ static int squashfs_fill_super(struct 
>>> super_block *sb, struct fs_context *fc)
>>>                  goto failed_mount;
>>>          }
>>>
>>> +
>>> +       /* Allocate page for 
>>> squashfs_readahead()/squashfs_read_folio() */
>>> +       if (msblk->decompressor->alloc_buffer) {
>>> +               msblk->actor_page = kmalloc(PAGE_SIZE, GFP_KERNEL);
>>> +
>>> +               if(msblk->actor_page == NULL)
>>> +                       goto failed_mount;
>>> +       }
>>> +
>>>          msblk->stream = squashfs_decompressor_setup(sb, flags);
>>>          if (IS_ERR(msblk->stream)) {
>>>                  err = PTR_ERR(msblk->stream);
>>> @@ -454,6 +463,7 @@ static int squashfs_fill_super(struct 
>>> super_block *sb, struct fs_context *fc)
>>>          squashfs_cache_delete(msblk->block_cache);
>>>          squashfs_cache_delete(msblk->fragment_cache);
>>>          squashfs_cache_delete(msblk->read_page);
>>> +       kfree(msblk->actor_page);
>>>          msblk->thread_ops->destroy(msblk);
>>>          kfree(msblk->inode_lookup_table);
>>>          kfree(msblk->fragment_index);
>>> -- 
>>> 2.35.1
>>>
>>>> Adding __GFP_RETRY_MAYFAIL so the alloc() can fail will mean Squashfs
>>>> returning I/O failures due to no memory.  That will cause a lot of
>>>> applications to crash in a low memory situation.  So a crash rather
>>>> than a hang.
>>>>
>>>> Phillip
>>>>
>>>>
>>>>
>>>>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-04-26 19:34               ` Phillip Lougher
@ 2023-04-27  0:42                 ` Hui Wang
  2023-04-27  1:37                   ` Phillip Lougher
  0 siblings, 1 reply; 25+ messages in thread
From: Hui Wang @ 2023-04-27  0:42 UTC (permalink / raw)
  To: Phillip Lougher, Yang Shi
  Cc: Michal Hocko, linux-mm, akpm, surenb, colin.i.king, hannes,
	vbabka, hch, mgorman


On 4/27/23 03:34, Phillip Lougher wrote:
>
> On 26/04/2023 20:06, Phillip Lougher wrote:
>>
>> On 26/04/2023 19:26, Yang Shi wrote:
>>> On Wed, Apr 26, 2023 at 10:38 AM Phillip Lougher
>>> <phillip@squashfs.org.uk> wrote:
>>>>
>>>> On 26/04/2023 17:44, Phillip Lougher wrote:
>>>>> On 26/04/2023 12:07, Hui Wang wrote:
>>>>>> On 4/26/23 16:33, Michal Hocko wrote:
>>>>>>> [CC squashfs maintainer]
>>>>>>>
>>>>>>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>>>>>>>> If we run the stress-ng in the filesystem of squashfs, the system
>>>>>>>> will be in a state something like hang, the stress-ng couldn't
>>>>>>>> finish running and the console couldn't react to users' input.
>>>>>>>>
>>>>>>>> This issue happens on all arm/arm64 platforms we are working on,
>>>>>>>> through debugging, we found this issue is introduced by oom 
>>>>>>>> handling
>>>>>>>> in the kernel.
>>>>>>>>
>>>>>>>> The fs->readahead() is called between memalloc_nofs_save() and
>>>>>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>>>>>>>> alloc_page(), in this case, if there is no memory left, the
>>>>>>>> out_of_memory() will be called without __GFP_FS, then the oom 
>>>>>>>> killer
>>>>>>>> will not be triggered and this process will loop endlessly and 
>>>>>>>> wait
>>>>>>>> for others to trigger oom killer to release some memory. But for a
>>>>>>>> system with the whole root filesystem constructed by squashfs,
>>>>>>>> nearly all userspace processes will call out_of_memory() without
>>>>>>>> __GFP_FS, so we will see that the system enters a state 
>>>>>>>> something like
>>>>>>>> hang when running stress-ng.
>>>>>>>>
>>>>>>>> To fix it, we could trigger a kthread to call page_alloc() with
>>>>>>>> __GFP_FS before returning from out_of_memory() due to without
>>>>>>>> __GFP_FS.
>>>>>>> I do not think this is an appropriate way to deal with this issue.
>>>>>>> Does it even make sense to trigger OOM killer for something like
>>>>>>> readahead? Would it be more mindful to fail the allocation instead?
>>>>>>> That being said should allocations from squashfs_readahead use
>>>>>>> __GFP_RETRY_MAYFAIL instead?
>>>>>> Thanks for your comment, and this issue could hardly be 
>>>>>> reproduced on
>>>>>> ext4 filesystem, that is because the ext4->readahead() doesn't call
>>>>>> alloc_page(). If changing the ext4->readahead() as below, it will be
>>>>>> easy to reproduce this issue with the ext4 filesystem (repeatedly
>>>>>> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0
>>>>>> --timeout 30s --skip-silent --verbose)
>>>>>>
>>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>>> index ffbbd9626bd8..8b9db0b9d0b8 100644
>>>>>> --- a/fs/ext4/inode.c
>>>>>> +++ b/fs/ext4/inode.c
>>>>>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file 
>>>>>> *file,
>>>>>> struct folio *folio)
>>>>>>   static void ext4_readahead(struct readahead_control *rac)
>>>>>>   {
>>>>>>          struct inode *inode = rac->mapping->host;
>>>>>> +       struct page *tmp_page;
>>>>>>
>>>>>>          /* If the file has inline data, no need to do readahead. */
>>>>>>          if (ext4_has_inline_data(inode))
>>>>>>                  return;
>>>>>>
>>>>>> +       tmp_page = alloc_page(GFP_KERNEL);
>>>>>> +
>>>>>>          ext4_mpage_readpages(inode, rac, NULL);
>>>>>> +
>>>>>> +       if (tmp_page)
>>>>>> +               __free_page(tmp_page);
>>>>>>   }
>>>>>>
>>>>>>
>>>>>> BTW, I applied my patch to the linux-next and ran the oom stress-ng
>>>>>> testcases overnight, there is no hang, oops or crash, looks like
>>>>>> there is no big problem to use a kthread to trigger the oom 
>>>>>> killer in
>>>>>> this case.
>>>>>>
>>>>>> And Hi squashfs maintainer, I checked the code of filesystem, looks
>>>>>> like most filesystems will not call alloc_page() in the readahead(),
>>>>>> could you please help take a look at this issue, thanks.
>>>>>
>>>>> This will be because most filesystems don't need to do so. 
>>>>> Squashfs is
>>>>> a compressed filesystem with large blocks covering much more than one
>>>>> page, and it decompresses these blocks in squashfs_readahead().   If
>>>>> __readahead_batch() does not return the full set of pages covering 
>>>>> the
>>>>> Squashfs block, it allocates a temporary page for the 
>>>>> decompressors to
>>>>> decompress into to "fill in the hole".
>>>>>
>>>>> What can be done here as far as Squashfs is concerned .... I could
>>>>> move the page allocation out of the readahead path (e.g. do it at
>>>>> mount time).
>>>> You could try this patch which does that.  Compile tested only.
>>> The kmalloc_array() may call alloc_page() to trigger this problem too
>>> IIUC. It should be pre-allocated as well?
>>
>>
>> That is a much smaller allocation, and so it entirely depends whether 
>> it is an issue or not.  There are also a number of other small memory 
>> allocations in the path as well.
>>
>> The whole point of this patch is to move the *biggest* allocation 
>> which is the reported issue and then see what happens.   No point in 
>> making this test patch more involved and complex than necessary at 
>> this point.
>>
>> Phillip
>>
>
> Also be aware this stress-ng triggered issue is new, and apparently 
> didn't occur last year.   So it is reasonable to assume the issue has 
> been introduced as a side effect of the readahead improvements.  One 
> of these is this allocation of a temporary page to decompress into 
> rather than falling back to entirely decompressing into a 
> pre-allocated buffer (allocated at mount time).  The small memory 
> allocations have been there for many years.
>
> Allocating the page at mount time effectively puts the memory 
> allocation situation back to how it was last year before the readahead 
> work.
>
> Phillip
>
Thanks Phillip and Yang.

And Phillip,

I tested your change, it didn't help. According to my debug, the OOM 
happens at the place of allocating memory for bio, it is at the line of 
"struct page *page = alloc_page(GFP_NOIO);" in the squashfs_bio_read(). 
Other filesystems just use the pre-allocated memory in the "struct 
readahead_control" to do the bio, but squashfs allocate the new page to 
do the bio (maybe because the squashfs is a compressed filesystem).

BTW, this is not a new issue for squashfs, we have uc20 (linux-5.4 
kernel) and uc22 (linux-5.15 kernel), all have this issue. The issue 
already existed in the squahsfs_readpage() in the 5.4 kernel.

I guess if could use pre-allocated memory to do the bio, it will help.

Thanks,

Hui.



>
>>>>    fs/squashfs/page_actor.c     | 10 +---------
>>>>    fs/squashfs/page_actor.h     |  1 -
>>>>    fs/squashfs/squashfs_fs_sb.h |  1 +
>>>>    fs/squashfs/super.c          | 10 ++++++++++
>>>>    4 files changed, 12 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/fs/squashfs/page_actor.c b/fs/squashfs/page_actor.c
>>>> index 81af6c4ca115..6cce239eca66 100644
>>>> --- a/fs/squashfs/page_actor.c
>>>> +++ b/fs/squashfs/page_actor.c
>>>> @@ -110,15 +110,7 @@ struct squashfs_page_actor 
>>>> *squashfs_page_actor_init_special(struct squashfs_sb_
>>>>          if (actor == NULL)
>>>>                  return NULL;
>>>>
>>>> -       if (msblk->decompressor->alloc_buffer) {
>>>> -               actor->tmp_buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
>>>> -
>>>> -               if (actor->tmp_buffer == NULL) {
>>>> -                       kfree(actor);
>>>> -                       return NULL;
>>>> -               }
>>>> -       } else
>>>> -               actor->tmp_buffer = NULL;
>>>> +       actor->tmp_buffer = msblk->actor_page;
>>>>
>>>>          actor->length = length ? : pages * PAGE_SIZE;
>>>>          actor->page = page;
>>>> diff --git a/fs/squashfs/page_actor.h b/fs/squashfs/page_actor.h
>>>> index 97d4983559b1..df5e999afa42 100644
>>>> --- a/fs/squashfs/page_actor.h
>>>> +++ b/fs/squashfs/page_actor.h
>>>> @@ -34,7 +34,6 @@ static inline struct page 
>>>> *squashfs_page_actor_free(struct squashfs_page_actor *
>>>>    {
>>>>          struct page *last_page = actor->last_page;
>>>>
>>>> -       kfree(actor->tmp_buffer);
>>>>          kfree(actor);
>>>>          return last_page;
>>>>    }
>>>> diff --git a/fs/squashfs/squashfs_fs_sb.h 
>>>> b/fs/squashfs/squashfs_fs_sb.h
>>>> index 72f6f4b37863..8feddc9e6cce 100644
>>>> --- a/fs/squashfs/squashfs_fs_sb.h
>>>> +++ b/fs/squashfs/squashfs_fs_sb.h
>>>> @@ -47,6 +47,7 @@ struct squashfs_sb_info {
>>>>          struct squashfs_cache *block_cache;
>>>>          struct squashfs_cache *fragment_cache;
>>>>          struct squashfs_cache                   *read_page;
>>>> +       void                                    *actor_page;
>>>>          int next_meta_index;
>>>>          __le64                                  *id_table;
>>>>          __le64 *fragment_index;
>>>> diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c
>>>> index e090fae48e68..674dc187d961 100644
>>>> --- a/fs/squashfs/super.c
>>>> +++ b/fs/squashfs/super.c
>>>> @@ -329,6 +329,15 @@ static int squashfs_fill_super(struct 
>>>> super_block *sb, struct fs_context *fc)
>>>>                  goto failed_mount;
>>>>          }
>>>>
>>>> +
>>>> +       /* Allocate page for 
>>>> squashfs_readahead()/squashfs_read_folio() */
>>>> +       if (msblk->decompressor->alloc_buffer) {
>>>> +               msblk->actor_page = kmalloc(PAGE_SIZE, GFP_KERNEL);
>>>> +
>>>> +               if(msblk->actor_page == NULL)
>>>> +                       goto failed_mount;
>>>> +       }
>>>> +
>>>>          msblk->stream = squashfs_decompressor_setup(sb, flags);
>>>>          if (IS_ERR(msblk->stream)) {
>>>>                  err = PTR_ERR(msblk->stream);
>>>> @@ -454,6 +463,7 @@ static int squashfs_fill_super(struct 
>>>> super_block *sb, struct fs_context *fc)
>>>>          squashfs_cache_delete(msblk->block_cache);
>>>>          squashfs_cache_delete(msblk->fragment_cache);
>>>>          squashfs_cache_delete(msblk->read_page);
>>>> +       kfree(msblk->actor_page);
>>>>          msblk->thread_ops->destroy(msblk);
>>>>          kfree(msblk->inode_lookup_table);
>>>>          kfree(msblk->fragment_index);
>>>> -- 
>>>> 2.35.1
>>>>
>>>>> Adding __GFP_RETRY_MAYFAIL so the alloc() can fail will mean Squashfs
>>>>> returning I/O failures due to no memory.  That will cause a lot of
>>>>> applications to crash in a low memory situation.  So a crash rather
>>>>> than a hang.
>>>>>
>>>>> Phillip
>>>>>
>>>>>
>>>>>
>>>>>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-04-26 11:07     ` Hui Wang
  2023-04-26 16:44       ` Phillip Lougher
@ 2023-04-27  1:18       ` Gao Xiang
  2023-04-27  3:47         ` Hui Wang
  1 sibling, 1 reply; 25+ messages in thread
From: Gao Xiang @ 2023-04-27  1:18 UTC (permalink / raw)
  To: Hui Wang, Michal Hocko
  Cc: linux-mm, akpm, surenb, colin.i.king, shy828301, hannes, vbabka,
	hch, mgorman, dan.carpenter, Phillip Lougher, Yang Shi



On 2023/4/26 19:07, Hui Wang wrote:
> 
> On 4/26/23 16:33, Michal Hocko wrote:
>> [CC squashfs maintainer]
>>
>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>>> If we run the stress-ng in the filesystem of squashfs, the system
>>> will be in a state something like hang, the stress-ng couldn't
>>> finish running and the console couldn't react to users' input.
>>>
>>> This issue happens on all arm/arm64 platforms we are working on,
>>> through debugging, we found this issue is introduced by oom handling
>>> in the kernel.
>>>
>>> The fs->readahead() is called between memalloc_nofs_save() and
>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>>> alloc_page(), in this case, if there is no memory left, the
>>> out_of_memory() will be called without __GFP_FS, then the oom killer
>>> will not be triggered and this process will loop endlessly and wait
>>> for others to trigger oom killer to release some memory. But for a
>>> system with the whole root filesystem constructed by squashfs,
>>> nearly all userspace processes will call out_of_memory() without
>>> __GFP_FS, so we will see that the system enters a state something like
>>> hang when running stress-ng.
>>>
>>> To fix it, we could trigger a kthread to call page_alloc() with
>>> __GFP_FS before returning from out_of_memory() due to without
>>> __GFP_FS.
>> I do not think this is an appropriate way to deal with this issue.
>> Does it even make sense to trigger OOM killer for something like
>> readahead? Would it be more mindful to fail the allocation instead?
>> That being said should allocations from squashfs_readahead use
>> __GFP_RETRY_MAYFAIL instead?
> 
> Thanks for your comment, and this issue could hardly be reproduced on ext4 filesystem, that is because the ext4->readahead() doesn't call alloc_page(). If changing the ext4->readahead() as below, it will be easy to reproduce this issue with the ext4 filesystem (repeatedly run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0 --timeout 30s --skip-silent --verbose)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index ffbbd9626bd8..8b9db0b9d0b8 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file *file, struct folio *folio)
>   static void ext4_readahead(struct readahead_control *rac)
>   {
>          struct inode *inode = rac->mapping->host;
> +       struct page *tmp_page;
> 
>          /* If the file has inline data, no need to do readahead. */
>          if (ext4_has_inline_data(inode))
>                  return;
> 
> +       tmp_page = alloc_page(GFP_KERNEL);
> +
>          ext4_mpage_readpages(inode, rac, NULL);
> +
> +       if (tmp_page)
> +               __free_page(tmp_page);
>   }
> 

Is it tested with a pure ext4 without any other fs background?

I don't think it's true that "ext4->readahead() doesn't call
alloc_page()" since I think even ext2/ext4 uses buffer head
interfaces to read metadata (extents or old block mapping)
from its bd_inode for readahead, which indirectly allocates
some extra pages to page cache as well.

The difference only here is the total number of pages to be
allocated here, but many extra compressed data takeing extra
allocation causes worse.  So I think it much depends on how
stressful does your stress workload work like, and I'm even
not sure it's a real issue since if you stop the stress
workload, it will immediately recover (only it may not oom
directly).

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-04-27  0:42                 ` Hui Wang
@ 2023-04-27  1:37                   ` Phillip Lougher
  2023-04-27  5:22                     ` Hui Wang
  0 siblings, 1 reply; 25+ messages in thread
From: Phillip Lougher @ 2023-04-27  1:37 UTC (permalink / raw)
  To: Hui Wang, Yang Shi
  Cc: Michal Hocko, linux-mm, akpm, surenb, colin.i.king, hannes,
	vbabka, hch, mgorman


On 27/04/2023 01:42, Hui Wang wrote:
>
> On 4/27/23 03:34, Phillip Lougher wrote:
>>
>> On 26/04/2023 20:06, Phillip Lougher wrote:
>>>
>>> On 26/04/2023 19:26, Yang Shi wrote:
>>>> On Wed, Apr 26, 2023 at 10:38 AM Phillip Lougher
>>>> <phillip@squashfs.org.uk> wrote:
>>>>>
>>>>> On 26/04/2023 17:44, Phillip Lougher wrote:
>>>>>> On 26/04/2023 12:07, Hui Wang wrote:
>>>>>>> On 4/26/23 16:33, Michal Hocko wrote:
>>>>>>>> [CC squashfs maintainer]
>>>>>>>>
>>>>>>>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>>>>>>>>> If we run the stress-ng in the filesystem of squashfs, the system
>>>>>>>>> will be in a state something like hang, the stress-ng couldn't
>>>>>>>>> finish running and the console couldn't react to users' input.
>>>>>>>>>
>>>>>>>>> This issue happens on all arm/arm64 platforms we are working on,
>>>>>>>>> through debugging, we found this issue is introduced by oom 
>>>>>>>>> handling
>>>>>>>>> in the kernel.
>>>>>>>>>
>>>>>>>>> The fs->readahead() is called between memalloc_nofs_save() and
>>>>>>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>>>>>>>>> alloc_page(), in this case, if there is no memory left, the
>>>>>>>>> out_of_memory() will be called without __GFP_FS, then the oom 
>>>>>>>>> killer
>>>>>>>>> will not be triggered and this process will loop endlessly and 
>>>>>>>>> wait
>>>>>>>>> for others to trigger oom killer to release some memory. But 
>>>>>>>>> for a
>>>>>>>>> system with the whole root filesystem constructed by squashfs,
>>>>>>>>> nearly all userspace processes will call out_of_memory() without
>>>>>>>>> __GFP_FS, so we will see that the system enters a state 
>>>>>>>>> something like
>>>>>>>>> hang when running stress-ng.
>>>>>>>>>
>>>>>>>>> To fix it, we could trigger a kthread to call page_alloc() with
>>>>>>>>> __GFP_FS before returning from out_of_memory() due to without
>>>>>>>>> __GFP_FS.
>>>>>>>> I do not think this is an appropriate way to deal with this issue.
>>>>>>>> Does it even make sense to trigger OOM killer for something like
>>>>>>>> readahead? Would it be more mindful to fail the allocation 
>>>>>>>> instead?
>>>>>>>> That being said should allocations from squashfs_readahead use
>>>>>>>> __GFP_RETRY_MAYFAIL instead?
>>>>>>> Thanks for your comment, and this issue could hardly be 
>>>>>>> reproduced on
>>>>>>> ext4 filesystem, that is because the ext4->readahead() doesn't call
>>>>>>> alloc_page(). If changing the ext4->readahead() as below, it 
>>>>>>> will be
>>>>>>> easy to reproduce this issue with the ext4 filesystem (repeatedly
>>>>>>> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0
>>>>>>> --timeout 30s --skip-silent --verbose)
>>>>>>>
>>>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>>>> index ffbbd9626bd8..8b9db0b9d0b8 100644
>>>>>>> --- a/fs/ext4/inode.c
>>>>>>> +++ b/fs/ext4/inode.c
>>>>>>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file 
>>>>>>> *file,
>>>>>>> struct folio *folio)
>>>>>>>   static void ext4_readahead(struct readahead_control *rac)
>>>>>>>   {
>>>>>>>          struct inode *inode = rac->mapping->host;
>>>>>>> +       struct page *tmp_page;
>>>>>>>
>>>>>>>          /* If the file has inline data, no need to do 
>>>>>>> readahead. */
>>>>>>>          if (ext4_has_inline_data(inode))
>>>>>>>                  return;
>>>>>>>
>>>>>>> +       tmp_page = alloc_page(GFP_KERNEL);
>>>>>>> +
>>>>>>>          ext4_mpage_readpages(inode, rac, NULL);
>>>>>>> +
>>>>>>> +       if (tmp_page)
>>>>>>> +               __free_page(tmp_page);
>>>>>>>   }
>>>>>>>
>>>>>>>
>>>>>>> BTW, I applied my patch to the linux-next and ran the oom stress-ng
>>>>>>> testcases overnight, there is no hang, oops or crash, looks like
>>>>>>> there is no big problem to use a kthread to trigger the oom 
>>>>>>> killer in
>>>>>>> this case.
>>>>>>>
>>>>>>> And Hi squashfs maintainer, I checked the code of filesystem, looks
>>>>>>> like most filesystems will not call alloc_page() in the 
>>>>>>> readahead(),
>>>>>>> could you please help take a look at this issue, thanks.
>>>>>>
>>>>>> This will be because most filesystems don't need to do so. 
>>>>>> Squashfs is
>>>>>> a compressed filesystem with large blocks covering much more than 
>>>>>> one
>>>>>> page, and it decompresses these blocks in squashfs_readahead().   If
>>>>>> __readahead_batch() does not return the full set of pages 
>>>>>> covering the
>>>>>> Squashfs block, it allocates a temporary page for the 
>>>>>> decompressors to
>>>>>> decompress into to "fill in the hole".
>>>>>>
>>>>>> What can be done here as far as Squashfs is concerned .... I could
>>>>>> move the page allocation out of the readahead path (e.g. do it at
>>>>>> mount time).
>>>>> You could try this patch which does that.  Compile tested only.
>>>> The kmalloc_array() may call alloc_page() to trigger this problem too
>>>> IIUC. It should be pre-allocated as well?
>>>
>>>
>>> That is a much smaller allocation, and so it entirely depends 
>>> whether it is an issue or not.  There are also a number of other 
>>> small memory allocations in the path as well.
>>>
>>> The whole point of this patch is to move the *biggest* allocation 
>>> which is the reported issue and then see what happens.   No point in 
>>> making this test patch more involved and complex than necessary at 
>>> this point.
>>>
>>> Phillip
>>>
>>
>> Also be aware this stress-ng triggered issue is new, and apparently 
>> didn't occur last year.   So it is reasonable to assume the issue has 
>> been introduced as a side effect of the readahead improvements.  One 
>> of these is this allocation of a temporary page to decompress into 
>> rather than falling back to entirely decompressing into a 
>> pre-allocated buffer (allocated at mount time).  The small memory 
>> allocations have been there for many years.
>>
>> Allocating the page at mount time effectively puts the memory 
>> allocation situation back to how it was last year before the 
>> readahead work.
>>
>> Phillip
>>
> Thanks Phillip and Yang.
>
> And Phillip,
>
> I tested your change, it didn't help. According to my debug, the OOM 
> happens at the place of allocating memory for bio, it is at the line 
> of "struct page *page = alloc_page(GFP_NOIO);" in the 
> squashfs_bio_read(). Other filesystems just use the pre-allocated 
> memory in the "struct readahead_control" to do the bio, but squashfs 
> allocate the new page to do the bio (maybe because the squashfs is a 
> compressed filesystem).
>
The test patch was a process of elimination, it removed the obvious 
change from last year.

It is also because it is a compressed filesystem, in most filesystems 
what is read off disk in I/O is what ends up in the page cache.  In a 
compressed filesystem what is read in isn't what ends up in the page cache.

> BTW, this is not a new issue for squashfs, we have uc20 (linux-5.4 
> kernel) and uc22 (linux-5.15 kernel), all have this issue. The issue 
> already existed in the squahsfs_readpage() in the 5.4 kernel.

That information would have been rather useful in the initial report, 
and saved myself from wasting my time.  Thanks for that.

Now in the squashfs_readpage() situation does processes hang or crash?  
In the squashfs_readpage() path __GFP_NOFS should not be in effect.  So 
is the OOM killer being invoked in this code path or not?   Does 
alloc_page() in the bio code return NULL, and/or invoke the OMM killer 
or does it get stuck.  Don't keep this information to yourself so I have 
to guess.

> I guess if could use pre-allocated memory to do the bio, it will help.

We'll see.

As far I can see it you've made the system run out of memory, and are 
now complaining about the result.  There's nothing unconventional about 
Squashfs handling of out of memory, and most filesystems put into an out 
of memory situation will fail.

Phillip


>
> Thanks,
>
> Hui.
>
>


>
>>
>>>>>    fs/squashfs/page_actor.c     | 10 +---------
>>>>>    fs/squashfs/page_actor.h     |  1 -
>>>>>    fs/squashfs/squashfs_fs_sb.h |  1 +
>>>>>    fs/squashfs/super.c          | 10 ++++++++++
>>>>>    4 files changed, 12 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/fs/squashfs/page_actor.c b/fs/squashfs/page_actor.c
>>>>> index 81af6c4ca115..6cce239eca66 100644
>>>>> --- a/fs/squashfs/page_actor.c
>>>>> +++ b/fs/squashfs/page_actor.c
>>>>> @@ -110,15 +110,7 @@ struct squashfs_page_actor 
>>>>> *squashfs_page_actor_init_special(struct squashfs_sb_
>>>>>          if (actor == NULL)
>>>>>                  return NULL;
>>>>>
>>>>> -       if (msblk->decompressor->alloc_buffer) {
>>>>> -               actor->tmp_buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
>>>>> -
>>>>> -               if (actor->tmp_buffer == NULL) {
>>>>> -                       kfree(actor);
>>>>> -                       return NULL;
>>>>> -               }
>>>>> -       } else
>>>>> -               actor->tmp_buffer = NULL;
>>>>> +       actor->tmp_buffer = msblk->actor_page;
>>>>>
>>>>>          actor->length = length ? : pages * PAGE_SIZE;
>>>>>          actor->page = page;
>>>>> diff --git a/fs/squashfs/page_actor.h b/fs/squashfs/page_actor.h
>>>>> index 97d4983559b1..df5e999afa42 100644
>>>>> --- a/fs/squashfs/page_actor.h
>>>>> +++ b/fs/squashfs/page_actor.h
>>>>> @@ -34,7 +34,6 @@ static inline struct page 
>>>>> *squashfs_page_actor_free(struct squashfs_page_actor *
>>>>>    {
>>>>>          struct page *last_page = actor->last_page;
>>>>>
>>>>> -       kfree(actor->tmp_buffer);
>>>>>          kfree(actor);
>>>>>          return last_page;
>>>>>    }
>>>>> diff --git a/fs/squashfs/squashfs_fs_sb.h 
>>>>> b/fs/squashfs/squashfs_fs_sb.h
>>>>> index 72f6f4b37863..8feddc9e6cce 100644
>>>>> --- a/fs/squashfs/squashfs_fs_sb.h
>>>>> +++ b/fs/squashfs/squashfs_fs_sb.h
>>>>> @@ -47,6 +47,7 @@ struct squashfs_sb_info {
>>>>>          struct squashfs_cache *block_cache;
>>>>>          struct squashfs_cache *fragment_cache;
>>>>>          struct squashfs_cache *read_page;
>>>>> +       void *actor_page;
>>>>>          int next_meta_index;
>>>>>          __le64 *id_table;
>>>>>          __le64 *fragment_index;
>>>>> diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c
>>>>> index e090fae48e68..674dc187d961 100644
>>>>> --- a/fs/squashfs/super.c
>>>>> +++ b/fs/squashfs/super.c
>>>>> @@ -329,6 +329,15 @@ static int squashfs_fill_super(struct 
>>>>> super_block *sb, struct fs_context *fc)
>>>>>                  goto failed_mount;
>>>>>          }
>>>>>
>>>>> +
>>>>> +       /* Allocate page for 
>>>>> squashfs_readahead()/squashfs_read_folio() */
>>>>> +       if (msblk->decompressor->alloc_buffer) {
>>>>> +               msblk->actor_page = kmalloc(PAGE_SIZE, GFP_KERNEL);
>>>>> +
>>>>> +               if(msblk->actor_page == NULL)
>>>>> +                       goto failed_mount;
>>>>> +       }
>>>>> +
>>>>>          msblk->stream = squashfs_decompressor_setup(sb, flags);
>>>>>          if (IS_ERR(msblk->stream)) {
>>>>>                  err = PTR_ERR(msblk->stream);
>>>>> @@ -454,6 +463,7 @@ static int squashfs_fill_super(struct 
>>>>> super_block *sb, struct fs_context *fc)
>>>>>          squashfs_cache_delete(msblk->block_cache);
>>>>>          squashfs_cache_delete(msblk->fragment_cache);
>>>>>          squashfs_cache_delete(msblk->read_page);
>>>>> +       kfree(msblk->actor_page);
>>>>>          msblk->thread_ops->destroy(msblk);
>>>>>          kfree(msblk->inode_lookup_table);
>>>>>          kfree(msblk->fragment_index);
>>>>> -- 
>>>>> 2.35.1
>>>>>
>>>>>> Adding __GFP_RETRY_MAYFAIL so the alloc() can fail will mean 
>>>>>> Squashfs
>>>>>> returning I/O failures due to no memory.  That will cause a lot of
>>>>>> applications to crash in a low memory situation.  So a crash rather
>>>>>> than a hang.
>>>>>>
>>>>>> Phillip
>>>>>>
>>>>>>
>>>>>>
>>>>>>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-04-27  1:18       ` Gao Xiang
@ 2023-04-27  3:47         ` Hui Wang
  2023-04-27  4:17           ` Gao Xiang
                             ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Hui Wang @ 2023-04-27  3:47 UTC (permalink / raw)
  To: Gao Xiang, Michal Hocko
  Cc: linux-mm, akpm, surenb, colin.i.king, shy828301, hannes, vbabka,
	hch, mgorman, Phillip Lougher


On 4/27/23 09:18, Gao Xiang wrote:
>
>
> On 2023/4/26 19:07, Hui Wang wrote:
>>
>> On 4/26/23 16:33, Michal Hocko wrote:
>>> [CC squashfs maintainer]
>>>
>>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>>>> If we run the stress-ng in the filesystem of squashfs, the system
>>>> will be in a state something like hang, the stress-ng couldn't
>>>> finish running and the console couldn't react to users' input.
>>>>
>>>> This issue happens on all arm/arm64 platforms we are working on,
>>>> through debugging, we found this issue is introduced by oom handling
>>>> in the kernel.
>>>>
>>>> The fs->readahead() is called between memalloc_nofs_save() and
>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>>>> alloc_page(), in this case, if there is no memory left, the
>>>> out_of_memory() will be called without __GFP_FS, then the oom killer
>>>> will not be triggered and this process will loop endlessly and wait
>>>> for others to trigger oom killer to release some memory. But for a
>>>> system with the whole root filesystem constructed by squashfs,
>>>> nearly all userspace processes will call out_of_memory() without
>>>> __GFP_FS, so we will see that the system enters a state something like
>>>> hang when running stress-ng.
>>>>
>>>> To fix it, we could trigger a kthread to call page_alloc() with
>>>> __GFP_FS before returning from out_of_memory() due to without
>>>> __GFP_FS.
>>> I do not think this is an appropriate way to deal with this issue.
>>> Does it even make sense to trigger OOM killer for something like
>>> readahead? Would it be more mindful to fail the allocation instead?
>>> That being said should allocations from squashfs_readahead use
>>> __GFP_RETRY_MAYFAIL instead?
>>
>> Thanks for your comment, and this issue could hardly be reproduced on 
>> ext4 filesystem, that is because the ext4->readahead() doesn't call 
>> alloc_page(). If changing the ext4->readahead() as below, it will be 
>> easy to reproduce this issue with the ext4 filesystem (repeatedly 
>> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0 
>> --timeout 30s --skip-silent --verbose)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index ffbbd9626bd8..8b9db0b9d0b8 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file *file, 
>> struct folio *folio)
>>   static void ext4_readahead(struct readahead_control *rac)
>>   {
>>          struct inode *inode = rac->mapping->host;
>> +       struct page *tmp_page;
>>
>>          /* If the file has inline data, no need to do readahead. */
>>          if (ext4_has_inline_data(inode))
>>                  return;
>>
>> +       tmp_page = alloc_page(GFP_KERNEL);
>> +
>>          ext4_mpage_readpages(inode, rac, NULL);
>> +
>> +       if (tmp_page)
>> +               __free_page(tmp_page);
>>   }
>>
>
Hi Xiang and Michal,
> Is it tested with a pure ext4 without any other fs background?
>
Basically yes. Maybe there is a squashfs mounted for python3 in my test 
environment. But stress-ng and its needed sharing libs are in the ext4.
> I don't think it's true that "ext4->readahead() doesn't call
> alloc_page()" since I think even ext2/ext4 uses buffer head
> interfaces to read metadata (extents or old block mapping)
> from its bd_inode for readahead, which indirectly allocates
> some extra pages to page cache as well.

Calling alloc_page() or allocating memory in the readahead() is not a 
problem, suppose we have 4 processes (A, B, C and D). Process A, B and C 
are entering out_of_memory() because of allocating memory in the 
readahead(), they are looping and waiting for some memory be released. 
And process D could enter out_of_memory() with __GFP_FS, then it could 
trigger oom killer, so A, B and C could get the memory and return to the 
readahead(), there is no system hang issue.

But if all 4 processes enter out_of_memory() from readahead(), they will 
loop and wait endlessly, there is no process to trigger oom killer,  so 
the users will think the system is getting hang.

I applied my change for ext4->readahead to linux-next, and tested it on 
my ubuntu classic server for arm64, I could reproduce the hang issue 
within 1 minutes with 100% rate. I guess it is easy to reproduce the 
issue because it is an embedded environment, the total number of 
processes in the system is very limited, nearly all userspace processes 
will finally reach out_of_memory() from ext4_readahead(), and nearly all 
kthreads will not reach out_of_memory() for long time, that makes the 
system in a state like hang (not real hang).

And this is why I wrote a patch to let a specific kthread trigger oom 
killer forcibly (my initial patch).


>
> The difference only here is the total number of pages to be
> allocated here, but many extra compressed data takeing extra
> allocation causes worse.  So I think it much depends on how
> stressful does your stress workload work like, and I'm even
> not sure it's a real issue since if you stop the stress
> workload, it will immediately recover (only it may not oom
> directly).
>
Yes, it is not a real hang. All userspace processes are looping and 
waiting for other processes to release or reclaim memory. And in this 
case, we can't stop the stress workload since users can't control the 
system through console.

So Michal,

Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters a 
state something like hang when running stress-ng", do you know why 
out_of_memory() will return immediately if there is no __GFP_FS, could 
we drop these lines directly:

     /*
      * The OOM killer does not compensate for IO-less reclaim.
      * pagefault_out_of_memory lost its gfp context so we have to
      * make sure exclude 0 mask - all other users should have at least
      * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
      * invoke the OOM killer even if it is a GFP_NOFS allocation.
      */
     if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
         return true;


Thanks,

Hui.

> Thanks,
> Gao Xiang


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-04-27  3:47         ` Hui Wang
@ 2023-04-27  4:17           ` Gao Xiang
  2023-04-27  7:03           ` Colin King (gmail)
  2023-04-28 19:53           ` Michal Hocko
  2 siblings, 0 replies; 25+ messages in thread
From: Gao Xiang @ 2023-04-27  4:17 UTC (permalink / raw)
  To: Hui Wang, Michal Hocko
  Cc: linux-mm, akpm, surenb, colin.i.king, shy828301, hannes, vbabka,
	hch, mgorman, Phillip Lougher



On 2023/4/27 11:47, Hui Wang wrote:
> 

...

>>
> Hi Xiang and Michal,
>> Is it tested with a pure ext4 without any other fs background?
>>
> Basically yes. Maybe there is a squashfs mounted for python3 in my test environment. But stress-ng and its needed sharing libs are in the ext4.
>> I don't think it's true that "ext4->readahead() doesn't call
>> alloc_page()" since I think even ext2/ext4 uses buffer head
>> interfaces to read metadata (extents or old block mapping)
>> from its bd_inode for readahead, which indirectly allocates
>> some extra pages to page cache as well.
> 
> Calling alloc_page() or allocating memory in the readahead() is not a problem, suppose we have 4 processes (A, B, C and D). Process A, B and C are entering out_of_memory() because of allocating memory in the readahead(), they are looping and waiting for some memory be released. And process D could enter out_of_memory() with __GFP_FS, then it could trigger oom killer, so A, B and C could get the memory and return to the readahead(), there is no system hang issue.
> 
> But if all 4 processes enter out_of_memory() from readahead(), they will loop and wait endlessly, there is no process to trigger oom killer,  so the users will think the system is getting hang.
> 
> I applied my change for ext4->readahead to linux-next, and tested it on my ubuntu classic server for arm64, I could reproduce the hang issue within 1 minutes with 100% rate. I guess it is easy to reproduce the issue because it is an embedded environment, the total number of processes in the system is very limited, nearly all userspace processes will finally reach out_of_memory() from ext4_readahead(), and nearly all kthreads will not reach out_of_memory() for long time, that makes the system in a state like hang (not real hang).
> 
> And this is why I wrote a patch to let a specific kthread trigger oom killer forcibly (my initial patch).
> 
> 
>>
>> The difference only here is the total number of pages to be
>> allocated here, but many extra compressed data takeing extra
>> allocation causes worse.  So I think it much depends on how
>> stressful does your stress workload work like, and I'm even
>> not sure it's a real issue since if you stop the stress
>> workload, it will immediately recover (only it may not oom
>> directly).
>>
> Yes, it is not a real hang. All userspace processes are looping and waiting for other processes to release or reclaim memory. And in this case, we can't stop the stress workload since users can't control the system through console.
> 
> So Michal,
> 
> Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters a state something like hang when running stress-ng", do you know why out_of_memory() will return immediately if there is no __GFP_FS, could we drop these lines directly:
> 
>      /*
>       * The OOM killer does not compensate for IO-less reclaim.
>       * pagefault_out_of_memory lost its gfp context so we have to
>       * make sure exclude 0 mask - all other users should have at least
>       * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
>       * invoke the OOM killer even if it is a GFP_NOFS allocation.
>       */
>      if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
>          return true;

I'm curious about this as well.

Also apart from that we also have some stacked virtual block devices
as well, mostly they will use GFP_NOIO allocations.  So if they
allocate extra pages with GFP_NOIO like this, oom kill won't happen
as well.  And mostly such block drivers cannot fail out I/Os (IO
error) since it will have worse results to upper fses and end users.

Generally I don't think reserving more pages makes sense since it
still depends on stressed workload (and it makes more pages privately
and unuseable to other subsystems and user programs), so we still have
two results:

   - wait (hang) for previous requests to get free pages;

   - return ENOMEM instead.

Thanks,
Gao Xiang

> 
> 
> Thanks,
> 
> Hui.
> 
>> Thanks,
>> Gao Xiang


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-04-27  1:37                   ` Phillip Lougher
@ 2023-04-27  5:22                     ` Hui Wang
  0 siblings, 0 replies; 25+ messages in thread
From: Hui Wang @ 2023-04-27  5:22 UTC (permalink / raw)
  To: Phillip Lougher, Yang Shi
  Cc: Michal Hocko, linux-mm, akpm, surenb, colin.i.king, hannes,
	vbabka, hch, mgorman, Gao Xiang


On 4/27/23 09:37, Phillip Lougher wrote:
>
> On 27/04/2023 01:42, Hui Wang wrote:
>>
>> On 4/27/23 03:34, Phillip Lougher wrote:
>>>
>>> On 26/04/2023 20:06, Phillip Lougher wrote:
>>>>
>>>> On 26/04/2023 19:26, Yang Shi wrote:
>>>>> On Wed, Apr 26, 2023 at 10:38 AM Phillip Lougher
>>>>> <phillip@squashfs.org.uk> wrote:
>>>>>>
>>>>>> On 26/04/2023 17:44, Phillip Lougher wrote:
>>>>>>> On 26/04/2023 12:07, Hui Wang wrote:
>>>>>>>> On 4/26/23 16:33, Michal Hocko wrote:
>>>>>>>>> [CC squashfs maintainer]
>>>>>>>>>
>>>>>>>>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>>>>>>>>>> If we run the stress-ng in the filesystem of squashfs, the 
>>>>>>>>>> system
>>>>>>>>>> will be in a state something like hang, the stress-ng couldn't
>>>>>>>>>> finish running and the console couldn't react to users' input.
>>>>>>>>>>
>>>>>>>>>> This issue happens on all arm/arm64 platforms we are working on,
>>>>>>>>>> through debugging, we found this issue is introduced by oom 
>>>>>>>>>> handling
>>>>>>>>>> in the kernel.
>>>>>>>>>>
>>>>>>>>>> The fs->readahead() is called between memalloc_nofs_save() and
>>>>>>>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>>>>>>>>>> alloc_page(), in this case, if there is no memory left, the
>>>>>>>>>> out_of_memory() will be called without __GFP_FS, then the oom 
>>>>>>>>>> killer
>>>>>>>>>> will not be triggered and this process will loop endlessly 
>>>>>>>>>> and wait
>>>>>>>>>> for others to trigger oom killer to release some memory. But 
>>>>>>>>>> for a
>>>>>>>>>> system with the whole root filesystem constructed by squashfs,
>>>>>>>>>> nearly all userspace processes will call out_of_memory() without
>>>>>>>>>> __GFP_FS, so we will see that the system enters a state 
>>>>>>>>>> something like
>>>>>>>>>> hang when running stress-ng.
>>>>>>>>>>
>>>>>>>>>> To fix it, we could trigger a kthread to call page_alloc() with
>>>>>>>>>> __GFP_FS before returning from out_of_memory() due to without
>>>>>>>>>> __GFP_FS.
>>>>>>>>> I do not think this is an appropriate way to deal with this 
>>>>>>>>> issue.
>>>>>>>>> Does it even make sense to trigger OOM killer for something like
>>>>>>>>> readahead? Would it be more mindful to fail the allocation 
>>>>>>>>> instead?
>>>>>>>>> That being said should allocations from squashfs_readahead use
>>>>>>>>> __GFP_RETRY_MAYFAIL instead?
>>>>>>>> Thanks for your comment, and this issue could hardly be 
>>>>>>>> reproduced on
>>>>>>>> ext4 filesystem, that is because the ext4->readahead() doesn't 
>>>>>>>> call
>>>>>>>> alloc_page(). If changing the ext4->readahead() as below, it 
>>>>>>>> will be
>>>>>>>> easy to reproduce this issue with the ext4 filesystem (repeatedly
>>>>>>>> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0
>>>>>>>> --timeout 30s --skip-silent --verbose)
>>>>>>>>
>>>>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>>>>> index ffbbd9626bd8..8b9db0b9d0b8 100644
>>>>>>>> --- a/fs/ext4/inode.c
>>>>>>>> +++ b/fs/ext4/inode.c
>>>>>>>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file 
>>>>>>>> *file,
>>>>>>>> struct folio *folio)
>>>>>>>>   static void ext4_readahead(struct readahead_control *rac)
>>>>>>>>   {
>>>>>>>>          struct inode *inode = rac->mapping->host;
>>>>>>>> +       struct page *tmp_page;
>>>>>>>>
>>>>>>>>          /* If the file has inline data, no need to do 
>>>>>>>> readahead. */
>>>>>>>>          if (ext4_has_inline_data(inode))
>>>>>>>>                  return;
>>>>>>>>
>>>>>>>> +       tmp_page = alloc_page(GFP_KERNEL);
>>>>>>>> +
>>>>>>>>          ext4_mpage_readpages(inode, rac, NULL);
>>>>>>>> +
>>>>>>>> +       if (tmp_page)
>>>>>>>> +               __free_page(tmp_page);
>>>>>>>>   }
>>>>>>>>
>>>>>>>>
>>>>>>>> BTW, I applied my patch to the linux-next and ran the oom 
>>>>>>>> stress-ng
>>>>>>>> testcases overnight, there is no hang, oops or crash, looks like
>>>>>>>> there is no big problem to use a kthread to trigger the oom 
>>>>>>>> killer in
>>>>>>>> this case.
>>>>>>>>
>>>>>>>> And Hi squashfs maintainer, I checked the code of filesystem, 
>>>>>>>> looks
>>>>>>>> like most filesystems will not call alloc_page() in the 
>>>>>>>> readahead(),
>>>>>>>> could you please help take a look at this issue, thanks.
>>>>>>>
>>>>>>> This will be because most filesystems don't need to do so. 
>>>>>>> Squashfs is
>>>>>>> a compressed filesystem with large blocks covering much more 
>>>>>>> than one
>>>>>>> page, and it decompresses these blocks in 
>>>>>>> squashfs_readahead().   If
>>>>>>> __readahead_batch() does not return the full set of pages 
>>>>>>> covering the
>>>>>>> Squashfs block, it allocates a temporary page for the 
>>>>>>> decompressors to
>>>>>>> decompress into to "fill in the hole".
>>>>>>>
>>>>>>> What can be done here as far as Squashfs is concerned .... I could
>>>>>>> move the page allocation out of the readahead path (e.g. do it at
>>>>>>> mount time).
>>>>>> You could try this patch which does that.  Compile tested only.
>>>>> The kmalloc_array() may call alloc_page() to trigger this problem too
>>>>> IIUC. It should be pre-allocated as well?
>>>>
>>>>
>>>> That is a much smaller allocation, and so it entirely depends 
>>>> whether it is an issue or not.  There are also a number of other 
>>>> small memory allocations in the path as well.
>>>>
>>>> The whole point of this patch is to move the *biggest* allocation 
>>>> which is the reported issue and then see what happens.   No point 
>>>> in making this test patch more involved and complex than necessary 
>>>> at this point.
>>>>
>>>> Phillip
>>>>
>>>
>>> Also be aware this stress-ng triggered issue is new, and apparently 
>>> didn't occur last year.   So it is reasonable to assume the issue 
>>> has been introduced as a side effect of the readahead improvements.  
>>> One of these is this allocation of a temporary page to decompress 
>>> into rather than falling back to entirely decompressing into a 
>>> pre-allocated buffer (allocated at mount time).  The small memory 
>>> allocations have been there for many years.
>>>
>>> Allocating the page at mount time effectively puts the memory 
>>> allocation situation back to how it was last year before the 
>>> readahead work.
>>>
>>> Phillip
>>>
>> Thanks Phillip and Yang.
>>
>> And Phillip,
>>
>> I tested your change, it didn't help. According to my debug, the OOM 
>> happens at the place of allocating memory for bio, it is at the line 
>> of "struct page *page = alloc_page(GFP_NOIO);" in the 
>> squashfs_bio_read(). Other filesystems just use the pre-allocated 
>> memory in the "struct readahead_control" to do the bio, but squashfs 
>> allocate the new page to do the bio (maybe because the squashfs is a 
>> compressed filesystem).
>>
Hi Phillip,
> The test patch was a process of elimination, it removed the obvious 
> change from last year.
>
> It is also because it is a compressed filesystem, in most filesystems 
> what is read off disk in I/O is what ends up in the page cache.  In a 
> compressed filesystem what is read in isn't what ends up in the page 
> cache.
>
Understand.
>> BTW, this is not a new issue for squashfs, we have uc20 (linux-5.4 
>> kernel) and uc22 (linux-5.15 kernel), all have this issue. The issue 
>> already existed in the squahsfs_readpage() in the 5.4 kernel.
>
> That information would have been rather useful in the initial report, 
> and saved myself from wasting my time.  Thanks for that.
Sorry didn't mention it before.
>
> Now in the squashfs_readpage() situation does processes hang or 
> crash?  In the squashfs_readpage() path __GFP_NOFS should not be in 
> effect.  So is the OOM killer being invoked in this code path or 
> not?   Does alloc_page() in the bio code return NULL, and/or invoke 
> the OMM killer or does it get stuck.  Don't keep this information to 
> yourself so I have to guess.
>
In the squashfs_readpage() situation, the process still hang, the 
__GFP_NOFS also applies to squahsfs_readpage(), please see below 
calltrace I captured from linux-5.4:

[  118.131804] wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww current->comm = 
stress-ng-bighe oc->gfp_mask = 8c50 (GFP_FS is not set in gfp_mask)
[  118.142829] ------------[ cut here ]------------
[  118.142843] WARNING: CPU: 1 PID: 794 at mm/oom_kill.c:1097 
out_of_memory+0x2dc/0x340
[  118.142845] Modules linked in:
[  118.142851] CPU: 1 PID: 794 Comm: stress-ng-bighe Tainted: G        
W         5.4.0+ #152
[  118.142853] Hardware name: LS1028A RDB Board (DT)
[  118.142857] pstate: 60400005 (nZCv daif +PAN -UAO)
[  118.142860] pc : out_of_memory+0x2dc/0x340
[  118.142863] lr : out_of_memory+0xe4/0x340
[  118.142865] sp : ffff8000115cb580
[  118.142867] x29: ffff8000115cb580 x28: 0000000000000000
[  118.142871] x27: ffffcefc623dab80 x26: 000031039de16790
[  118.142875] x25: ffffcefc621e9878 x24: 0000000000000100
[  118.142878] x23: ffffcefc62278000 x22: 0000000000000000
[  118.142881] x21: ffff8000115cb6f8 x20: ffff00206272e740
[  118.142885] x19: 0000000000000000 x18: ffffcefc622268f8
[  118.142888] x17: 0000000000000000 x16: 0000000000000000
[  118.142891] x15: ffffcefc621e8a38 x14: 1a9f17e4f9444a3e
[  118.142894] x13: 0000000000000001 x12: 0000000000000400
[  118.142897] x11: 0000000000000400 x10: 0000000000000a90
[  118.142900] x9 : ffff8000115cb2c0 x8 : ffff00206272f230
[  118.142903] x7 : 0000001b81dc2360 x6 : 0000000000000000
[  118.142906] x5 : 0000000000000000 x4 : ffff00207f7db210
[  118.142909] x3 : 0000000000000000 x2 : 0000000000000000
[  118.142912] x1 : ffff00206272e740 x0 : 0000000000000000
[  118.142915] Call trace:
[  118.142919]  out_of_memory+0x2dc/0x340
[  118.142924]  __alloc_pages_nodemask+0xf04/0x1090
[  118.142928]  alloc_slab_page+0x34/0x430
[  118.142931]  allocate_slab+0x474/0x500
[  118.142935]  ___slab_alloc.constprop.0+0x1e4/0x64c
[  118.142938]  __slab_alloc.constprop.0+0x54/0xb0
[  118.142941]  kmem_cache_alloc+0x31c/0x350
[  118.142945]  alloc_buffer_head+0x2c/0xac
[  118.142948]  alloc_page_buffers+0xb8/0x210
[  118.142951]  __getblk_gfp+0x180/0x39c
[  118.142955]  squashfs_read_data+0x2a4/0x6f0
[  118.142958]  squashfs_readpage_block+0x2c4/0x630
[  118.142961]  squashfs_readpage+0x5e4/0x98c
[  118.142964]  filemap_fault+0x17c/0x720
[  118.142967]  __do_fault+0x44/0x110
[  118.142970]  __handle_mm_fault+0x930/0xdac
[  118.142973]  handle_mm_fault+0xc8/0x190
[  118.142978]  do_page_fault+0x134/0x5a0
[  118.142982]  do_translation_fault+0xe0/0x108
[  118.142985]  do_mem_abort+0x54/0xb0
[  118.142988]  el0_da+0x1c/0x20
[  118.142990] ---[ end trace c105c6721d4e890e ]---


I guess it is here to drop the __GFP_FS in the linux-5.4 
(grow_dev_page() in the fs/buffer.c): gfp_mask = 
mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp;

>> I guess if could use pre-allocated memory to do the bio, it will help.
>
> We'll see.
>
> As far I can see it you've made the system run out of memory, and are 
> now complaining about the result.  There's nothing unconventional 
> about Squashfs handling of out of memory, and most filesystems put 
> into an out of memory situation will fail.
>
Understand.  And now the squashfs is used in ubuntu core and will be in 
many IoT products, really need to find a solution for it.

Thanks,

Hui.


> Phillip
>
>
>>
>> Thanks,
>>
>> Hui.
>>
>>
>
>
>>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-04-27  3:47         ` Hui Wang
  2023-04-27  4:17           ` Gao Xiang
@ 2023-04-27  7:03           ` Colin King (gmail)
  2023-04-27  7:49             ` Hui Wang
  2023-04-28 19:53           ` Michal Hocko
  2 siblings, 1 reply; 25+ messages in thread
From: Colin King (gmail) @ 2023-04-27  7:03 UTC (permalink / raw)
  To: Hui Wang, Gao Xiang, Michal Hocko
  Cc: linux-mm, akpm, surenb, shy828301, hannes, vbabka, hch, mgorman,
	Phillip Lougher

On 27/04/2023 04:47, Hui Wang wrote:
> 
> On 4/27/23 09:18, Gao Xiang wrote:
>>
>>
>> On 2023/4/26 19:07, Hui Wang wrote:
>>>
>>> On 4/26/23 16:33, Michal Hocko wrote:
>>>> [CC squashfs maintainer]
>>>>
>>>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>>>>> If we run the stress-ng in the filesystem of squashfs, the system
>>>>> will be in a state something like hang, the stress-ng couldn't
>>>>> finish running and the console couldn't react to users' input.
>>>>>
>>>>> This issue happens on all arm/arm64 platforms we are working on,
>>>>> through debugging, we found this issue is introduced by oom handling
>>>>> in the kernel.
>>>>>
>>>>> The fs->readahead() is called between memalloc_nofs_save() and
>>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>>>>> alloc_page(), in this case, if there is no memory left, the
>>>>> out_of_memory() will be called without __GFP_FS, then the oom killer
>>>>> will not be triggered and this process will loop endlessly and wait
>>>>> for others to trigger oom killer to release some memory. But for a
>>>>> system with the whole root filesystem constructed by squashfs,
>>>>> nearly all userspace processes will call out_of_memory() without
>>>>> __GFP_FS, so we will see that the system enters a state something like
>>>>> hang when running stress-ng.
>>>>>
>>>>> To fix it, we could trigger a kthread to call page_alloc() with
>>>>> __GFP_FS before returning from out_of_memory() due to without
>>>>> __GFP_FS.
>>>> I do not think this is an appropriate way to deal with this issue.
>>>> Does it even make sense to trigger OOM killer for something like
>>>> readahead? Would it be more mindful to fail the allocation instead?
>>>> That being said should allocations from squashfs_readahead use
>>>> __GFP_RETRY_MAYFAIL instead?
>>>
>>> Thanks for your comment, and this issue could hardly be reproduced on 
>>> ext4 filesystem, that is because the ext4->readahead() doesn't call 
>>> alloc_page(). If changing the ext4->readahead() as below, it will be 
>>> easy to reproduce this issue with the ext4 filesystem (repeatedly 
>>> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0 
>>> --timeout 30s --skip-silent --verbose)
>>>
>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>> index ffbbd9626bd8..8b9db0b9d0b8 100644
>>> --- a/fs/ext4/inode.c
>>> +++ b/fs/ext4/inode.c
>>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file *file, 
>>> struct folio *folio)
>>>   static void ext4_readahead(struct readahead_control *rac)
>>>   {
>>>          struct inode *inode = rac->mapping->host;
>>> +       struct page *tmp_page;
>>>
>>>          /* If the file has inline data, no need to do readahead. */
>>>          if (ext4_has_inline_data(inode))
>>>                  return;
>>>
>>> +       tmp_page = alloc_page(GFP_KERNEL);
>>> +
>>>          ext4_mpage_readpages(inode, rac, NULL);
>>> +
>>> +       if (tmp_page)
>>> +               __free_page(tmp_page);
>>>   }
>>>
>>
> Hi Xiang and Michal,
>> Is it tested with a pure ext4 without any other fs background?
>>
> Basically yes. Maybe there is a squashfs mounted for python3 in my test 
> environment. But stress-ng and its needed sharing libs are in the ext4.

One could build a static version of stress-ng to remove the need for 
shared library loading at run time:

git clone https://github.com/ColinIanKing/stress-ng
cd stress-ng
make clean
STATIC=1 make -j 8


>> I don't think it's true that "ext4->readahead() doesn't call
>> alloc_page()" since I think even ext2/ext4 uses buffer head
>> interfaces to read metadata (extents or old block mapping)
>> from its bd_inode for readahead, which indirectly allocates
>> some extra pages to page cache as well.
> 
> Calling alloc_page() or allocating memory in the readahead() is not a 
> problem, suppose we have 4 processes (A, B, C and D). Process A, B and C 
> are entering out_of_memory() because of allocating memory in the 
> readahead(), they are looping and waiting for some memory be released. 
> And process D could enter out_of_memory() with __GFP_FS, then it could 
> trigger oom killer, so A, B and C could get the memory and return to the 
> readahead(), there is no system hang issue.
> 
> But if all 4 processes enter out_of_memory() from readahead(), they will 
> loop and wait endlessly, there is no process to trigger oom killer,  so 
> the users will think the system is getting hang.
> 
> I applied my change for ext4->readahead to linux-next, and tested it on 
> my ubuntu classic server for arm64, I could reproduce the hang issue 
> within 1 minutes with 100% rate. I guess it is easy to reproduce the 
> issue because it is an embedded environment, the total number of 
> processes in the system is very limited, nearly all userspace processes 
> will finally reach out_of_memory() from ext4_readahead(), and nearly all 
> kthreads will not reach out_of_memory() for long time, that makes the 
> system in a state like hang (not real hang).
> 
> And this is why I wrote a patch to let a specific kthread trigger oom 
> killer forcibly (my initial patch).
> 
> 
>>
>> The difference only here is the total number of pages to be
>> allocated here, but many extra compressed data takeing extra
>> allocation causes worse.  So I think it much depends on how
>> stressful does your stress workload work like, and I'm even
>> not sure it's a real issue since if you stop the stress
>> workload, it will immediately recover (only it may not oom
>> directly).
>>
> Yes, it is not a real hang. All userspace processes are looping and 
> waiting for other processes to release or reclaim memory. And in this 
> case, we can't stop the stress workload since users can't control the 
> system through console.
> 
> So Michal,
> 
> Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters a 
> state something like hang when running stress-ng", do you know why 
> out_of_memory() will return immediately if there is no __GFP_FS, could 
> we drop these lines directly:
> 
>      /*
>       * The OOM killer does not compensate for IO-less reclaim.
>       * pagefault_out_of_memory lost its gfp context so we have to
>       * make sure exclude 0 mask - all other users should have at least
>       * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
>       * invoke the OOM killer even if it is a GFP_NOFS allocation.
>       */
>      if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
>          return true;
> 
> 
> Thanks,
> 
> Hui.
> 
>> Thanks,
>> Gao Xiang



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-04-27  7:03           ` Colin King (gmail)
@ 2023-04-27  7:49             ` Hui Wang
  0 siblings, 0 replies; 25+ messages in thread
From: Hui Wang @ 2023-04-27  7:49 UTC (permalink / raw)
  To: Colin King (gmail), Gao Xiang, Michal Hocko
  Cc: linux-mm, akpm, surenb, shy828301, hannes, vbabka, hch, mgorman,
	Phillip Lougher


On 4/27/23 15:03, Colin King (gmail) wrote:
> On 27/04/2023 04:47, Hui Wang wrote:
>>
>> On 4/27/23 09:18, Gao Xiang wrote:
>>>
>>>
>>> On 2023/4/26 19:07, Hui Wang wrote:
>>>>
>>>> On 4/26/23 16:33, Michal Hocko wrote:
>>>>> [CC squashfs maintainer]
>>>>>
>>>>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>>>>>> If we run the stress-ng in the filesystem of squashfs, the system
>>>>>> will be in a state something like hang, the stress-ng couldn't
>>>>>> finish running and the console couldn't react to users' input.
>>>>>>
>>>>>> This issue happens on all arm/arm64 platforms we are working on,
>>>>>> through debugging, we found this issue is introduced by oom handling
>>>>>> in the kernel.
>>>>>>
>>>>>> The fs->readahead() is called between memalloc_nofs_save() and
>>>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>>>>>> alloc_page(), in this case, if there is no memory left, the
>>>>>> out_of_memory() will be called without __GFP_FS, then the oom killer
>>>>>> will not be triggered and this process will loop endlessly and wait
>>>>>> for others to trigger oom killer to release some memory. But for a
>>>>>> system with the whole root filesystem constructed by squashfs,
>>>>>> nearly all userspace processes will call out_of_memory() without
>>>>>> __GFP_FS, so we will see that the system enters a state something 
>>>>>> like
>>>>>> hang when running stress-ng.
>>>>>>
>>>>>> To fix it, we could trigger a kthread to call page_alloc() with
>>>>>> __GFP_FS before returning from out_of_memory() due to without
>>>>>> __GFP_FS.
>>>>> I do not think this is an appropriate way to deal with this issue.
>>>>> Does it even make sense to trigger OOM killer for something like
>>>>> readahead? Would it be more mindful to fail the allocation instead?
>>>>> That being said should allocations from squashfs_readahead use
>>>>> __GFP_RETRY_MAYFAIL instead?
>>>>
>>>> Thanks for your comment, and this issue could hardly be reproduced 
>>>> on ext4 filesystem, that is because the ext4->readahead() doesn't 
>>>> call alloc_page(). If changing the ext4->readahead() as below, it 
>>>> will be easy to reproduce this issue with the ext4 filesystem 
>>>> (repeatedly run: $stress-ng --bigheap ${num_of_cpu_threads} 
>>>> --sequential 0 --timeout 30s --skip-silent --verbose)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index ffbbd9626bd8..8b9db0b9d0b8 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file 
>>>> *file, struct folio *folio)
>>>>   static void ext4_readahead(struct readahead_control *rac)
>>>>   {
>>>>          struct inode *inode = rac->mapping->host;
>>>> +       struct page *tmp_page;
>>>>
>>>>          /* If the file has inline data, no need to do readahead. */
>>>>          if (ext4_has_inline_data(inode))
>>>>                  return;
>>>>
>>>> +       tmp_page = alloc_page(GFP_KERNEL);
>>>> +
>>>>          ext4_mpage_readpages(inode, rac, NULL);
>>>> +
>>>> +       if (tmp_page)
>>>> +               __free_page(tmp_page);
>>>>   }
>>>>
>>>
>> Hi Xiang and Michal,
>>> Is it tested with a pure ext4 without any other fs background?
>>>
>> Basically yes. Maybe there is a squashfs mounted for python3 in my 
>> test environment. But stress-ng and its needed sharing libs are in 
>> the ext4.
>
> One could build a static version of stress-ng to remove the need for 
> shared library loading at run time:
>
> git clone https://github.com/ColinIanKing/stress-ng
> cd stress-ng
> make clean
> STATIC=1 make -j 8
>
I did that already, and copied it to /home/ubuntu under uc20/uc22 and 
ran it from /home/ubuntu, there is no hang issue anymore. The folder 
/home/ubuntu/ is ext4 filesystem, that proves the issue only happens on 
squashfs.

And if I built it without static=1, it will hang even I ran it from 
/home/ubuntu/ because the system needs to load shared libs from squashfs 
folder.

Thanks,

Hui.

>
>>> I don't think it's true that "ext4->readahead() doesn't call
>>> alloc_page()" since I think even ext2/ext4 uses buffer head
>>> interfaces to read metadata (extents or old block mapping)
>>> from its bd_inode for readahead, which indirectly allocates
>>> some extra pages to page cache as well.
>>
>> Calling alloc_page() or allocating memory in the readahead() is not a 
>> problem, suppose we have 4 processes (A, B, C and D). Process A, B 
>> and C are entering out_of_memory() because of allocating memory in 
>> the readahead(), they are looping and waiting for some memory be 
>> released. And process D could enter out_of_memory() with __GFP_FS, 
>> then it could trigger oom killer, so A, B and C could get the memory 
>> and return to the readahead(), there is no system hang issue.
>>
>> But if all 4 processes enter out_of_memory() from readahead(), they 
>> will loop and wait endlessly, there is no process to trigger oom 
>> killer,  so the users will think the system is getting hang.
>>
>> I applied my change for ext4->readahead to linux-next, and tested it 
>> on my ubuntu classic server for arm64, I could reproduce the hang 
>> issue within 1 minutes with 100% rate. I guess it is easy to 
>> reproduce the issue because it is an embedded environment, the total 
>> number of processes in the system is very limited, nearly all 
>> userspace processes will finally reach out_of_memory() from 
>> ext4_readahead(), and nearly all kthreads will not reach 
>> out_of_memory() for long time, that makes the system in a state like 
>> hang (not real hang).
>>
>> And this is why I wrote a patch to let a specific kthread trigger oom 
>> killer forcibly (my initial patch).
>>
>>
>>>
>>> The difference only here is the total number of pages to be
>>> allocated here, but many extra compressed data takeing extra
>>> allocation causes worse.  So I think it much depends on how
>>> stressful does your stress workload work like, and I'm even
>>> not sure it's a real issue since if you stop the stress
>>> workload, it will immediately recover (only it may not oom
>>> directly).
>>>
>> Yes, it is not a real hang. All userspace processes are looping and 
>> waiting for other processes to release or reclaim memory. And in this 
>> case, we can't stop the stress workload since users can't control the 
>> system through console.
>>
>> So Michal,
>>
>> Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters a 
>> state something like hang when running stress-ng", do you know why 
>> out_of_memory() will return immediately if there is no __GFP_FS, 
>> could we drop these lines directly:
>>
>>      /*
>>       * The OOM killer does not compensate for IO-less reclaim.
>>       * pagefault_out_of_memory lost its gfp context so we have to
>>       * make sure exclude 0 mask - all other users should have at least
>>       * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
>>       * invoke the OOM killer even if it is a GFP_NOFS allocation.
>>       */
>>      if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && 
>> !is_memcg_oom(oc))
>>          return true;
>>
>>
>> Thanks,
>>
>> Hui.
>>
>>> Thanks,
>>> Gao Xiang
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-04-27  3:47         ` Hui Wang
  2023-04-27  4:17           ` Gao Xiang
  2023-04-27  7:03           ` Colin King (gmail)
@ 2023-04-28 19:53           ` Michal Hocko
  2023-05-03 11:49             ` Hui Wang
  2 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2023-04-28 19:53 UTC (permalink / raw)
  To: Hui Wang
  Cc: Gao Xiang, linux-mm, akpm, surenb, colin.i.king, shy828301,
	hannes, vbabka, hch, mgorman, Phillip Lougher

On Thu 27-04-23 11:47:10, Hui Wang wrote:
[...]
> So Michal,
> 
> Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters a state
> something like hang when running stress-ng", do you know why out_of_memory()
> will return immediately if there is no __GFP_FS, could we drop these lines
> directly:
> 
>     /*
>      * The OOM killer does not compensate for IO-less reclaim.
>      * pagefault_out_of_memory lost its gfp context so we have to
>      * make sure exclude 0 mask - all other users should have at least
>      * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
>      * invoke the OOM killer even if it is a GFP_NOFS allocation.
>      */
>     if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
>         return true;

The comment is rather hard to grasp without an intimate knowledge of the
memory reclaim. The primary reason is that the allocation context
without __GFP_FS (and also __GFP_IO) cannot perform a full memory
reclaim because fs or the storage subsystem might be holding locks
required for the memory reclaim. This means that a large amount of
reclaimable memory is out of sight of the specific direct reclaim
context. If we allowed oom killer to trigger we could invoke the oom
killer while there is a lot of otherwise reclaimable memory. As you can
imagine not something many users would appreciate as the oom kill is a
very disruptive operation. In this case we rely on kswapd or other
GFP_KERNEL like allocation context to make forward instead. If there is
really nothing reclaimable then the oom killer would eventually hit from
elsewhere.

HTH
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-04-28 19:53           ` Michal Hocko
@ 2023-05-03 11:49             ` Hui Wang
  2023-05-03 12:20               ` Michal Hocko
  2023-05-03 19:10               ` Phillip Lougher
  0 siblings, 2 replies; 25+ messages in thread
From: Hui Wang @ 2023-05-03 11:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Gao Xiang, linux-mm, akpm, surenb, colin.i.king, shy828301,
	hannes, vbabka, hch, mgorman, Phillip Lougher


On 4/29/23 03:53, Michal Hocko wrote:
> On Thu 27-04-23 11:47:10, Hui Wang wrote:
> [...]
>> So Michal,
>>
>> Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters a state
>> something like hang when running stress-ng", do you know why out_of_memory()
>> will return immediately if there is no __GFP_FS, could we drop these lines
>> directly:
>>
>>      /*
>>       * The OOM killer does not compensate for IO-less reclaim.
>>       * pagefault_out_of_memory lost its gfp context so we have to
>>       * make sure exclude 0 mask - all other users should have at least
>>       * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
>>       * invoke the OOM killer even if it is a GFP_NOFS allocation.
>>       */
>>      if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
>>          return true;
> The comment is rather hard to grasp without an intimate knowledge of the
> memory reclaim. The primary reason is that the allocation context
> without __GFP_FS (and also __GFP_IO) cannot perform a full memory
> reclaim because fs or the storage subsystem might be holding locks
> required for the memory reclaim. This means that a large amount of
> reclaimable memory is out of sight of the specific direct reclaim
> context. If we allowed oom killer to trigger we could invoke the oom
> killer while there is a lot of otherwise reclaimable memory. As you can
> imagine not something many users would appreciate as the oom kill is a
> very disruptive operation. In this case we rely on kswapd or other
> GFP_KERNEL like allocation context to make forward instead. If there is
> really nothing reclaimable then the oom killer would eventually hit from
> elsewhere.
>
> HTH
Hi Michal,

Understand. Thanks for explanation. So we can't remove those 2 lines of 
code.

Here in my patch, letting a kthread allocate a page with GFP_KERNEL, It 
could possibly trigger the reclaim and if nothing reclaimable, trigger 
the oom killer. Do you think it is a safe workaround for the issue we 
are facing currently?


And Hi Phillip,

What is your opinion on it, do you have a direction to solve this issue 
from filesystem?

Thanks,

Hui.

>   


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-05-03 11:49             ` Hui Wang
@ 2023-05-03 12:20               ` Michal Hocko
  2023-05-03 18:41                 ` Phillip Lougher
  2023-05-03 19:10               ` Phillip Lougher
  1 sibling, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2023-05-03 12:20 UTC (permalink / raw)
  To: Hui Wang
  Cc: Gao Xiang, linux-mm, akpm, surenb, colin.i.king, shy828301,
	hannes, vbabka, hch, mgorman, Phillip Lougher

On Wed 03-05-23 19:49:19, Hui Wang wrote:
> 
> On 4/29/23 03:53, Michal Hocko wrote:
> > On Thu 27-04-23 11:47:10, Hui Wang wrote:
> > [...]
> > > So Michal,
> > > 
> > > Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters a state
> > > something like hang when running stress-ng", do you know why out_of_memory()
> > > will return immediately if there is no __GFP_FS, could we drop these lines
> > > directly:
> > > 
> > >      /*
> > >       * The OOM killer does not compensate for IO-less reclaim.
> > >       * pagefault_out_of_memory lost its gfp context so we have to
> > >       * make sure exclude 0 mask - all other users should have at least
> > >       * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
> > >       * invoke the OOM killer even if it is a GFP_NOFS allocation.
> > >       */
> > >      if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
> > >          return true;
> > The comment is rather hard to grasp without an intimate knowledge of the
> > memory reclaim. The primary reason is that the allocation context
> > without __GFP_FS (and also __GFP_IO) cannot perform a full memory
> > reclaim because fs or the storage subsystem might be holding locks
> > required for the memory reclaim. This means that a large amount of
> > reclaimable memory is out of sight of the specific direct reclaim
> > context. If we allowed oom killer to trigger we could invoke the oom
> > killer while there is a lot of otherwise reclaimable memory. As you can
> > imagine not something many users would appreciate as the oom kill is a
> > very disruptive operation. In this case we rely on kswapd or other
> > GFP_KERNEL like allocation context to make forward instead. If there is
> > really nothing reclaimable then the oom killer would eventually hit from
> > elsewhere.
> > 
> > HTH
> Hi Michal,
> 
> Understand. Thanks for explanation. So we can't remove those 2 lines of
> code.
> 
> Here in my patch, letting a kthread allocate a page with GFP_KERNEL, It
> could possibly trigger the reclaim and if nothing reclaimable, trigger the
> oom killer. Do you think it is a safe workaround for the issue we are facing
> currently?

I have to say I really dislike this workaround. Allocating memory just
to release it and potentially hit the oom killer is really not very
mindful approach to the problem. It is not a reliable way either because
you depend on the WQ context which might be clogged for the very same
lack of memory. This issue simply doesn't have a simple and neat
solution unfortunately.

I would prefer if the fs could be less demanding from NOFS context if
that is possible at all.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-05-03 12:20               ` Michal Hocko
@ 2023-05-03 18:41                 ` Phillip Lougher
  0 siblings, 0 replies; 25+ messages in thread
From: Phillip Lougher @ 2023-05-03 18:41 UTC (permalink / raw)
  To: Michal Hocko, Hui Wang
  Cc: Gao Xiang, linux-mm, akpm, surenb, colin.i.king, shy828301,
	hannes, vbabka, hch, mgorman

On 03/05/2023 13:20, Michal Hocko wrote:
> On Wed 03-05-23 19:49:19, Hui Wang wrote:
>> On 4/29/23 03:53, Michal Hocko wrote:
>>> On Thu 27-04-23 11:47:10, Hui Wang wrote:
>>> [...]
>>>> So Michal,
>>>>
>>>> Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters a state
>>>> something like hang when running stress-ng", do you know why out_of_memory()
>>>> will return immediately if there is no __GFP_FS, could we drop these lines
>>>> directly:
>>>>
>>>>       /*
>>>>        * The OOM killer does not compensate for IO-less reclaim.
>>>>        * pagefault_out_of_memory lost its gfp context so we have to
>>>>        * make sure exclude 0 mask - all other users should have at least
>>>>        * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
>>>>        * invoke the OOM killer even if it is a GFP_NOFS allocation.
>>>>        */
>>>>       if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
>>>>           return true;
>>> The comment is rather hard to grasp without an intimate knowledge of the
>>> memory reclaim. The primary reason is that the allocation context
>>> without __GFP_FS (and also __GFP_IO) cannot perform a full memory
>>> reclaim because fs or the storage subsystem might be holding locks
>>> required for the memory reclaim. This means that a large amount of
>>> reclaimable memory is out of sight of the specific direct reclaim
>>> context. If we allowed oom killer to trigger we could invoke the oom
>>> killer while there is a lot of otherwise reclaimable memory. As you can
>>> imagine not something many users would appreciate as the oom kill is a
>>> very disruptive operation. In this case we rely on kswapd or other
>>> GFP_KERNEL like allocation context to make forward instead. If there is
>>> really nothing reclaimable then the oom killer would eventually hit from
>>> elsewhere.
>>>
>>> HTH
>> Hi Michal,
>>
>> Understand. Thanks for explanation. So we can't remove those 2 lines of
>> code.
>>
>> Here in my patch, letting a kthread allocate a page with GFP_KERNEL, It
>> could possibly trigger the reclaim and if nothing reclaimable, trigger the
>> oom killer. Do you think it is a safe workaround for the issue we are facing
>> currently?
> I have to say I really dislike this workaround. Allocating memory just
> to release it and potentially hit the oom killer is really not very
> mindful approach to the problem. It is not a reliable way either because
> you depend on the WQ context which might be clogged for the very same
> lack of memory. This issue simply doesn't have a simple and neat
> solution unfortunately.

Agree.

> I would prefer if the fs could be less demanding from NOFS context if
> that is possible at all.

This does seem to be the best solution.

Phillip


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-05-03 11:49             ` Hui Wang
  2023-05-03 12:20               ` Michal Hocko
@ 2023-05-03 19:10               ` Phillip Lougher
  2023-05-03 19:38                 ` Hui Wang
  2023-05-07 21:07                 ` Phillip Lougher
  1 sibling, 2 replies; 25+ messages in thread
From: Phillip Lougher @ 2023-05-03 19:10 UTC (permalink / raw)
  To: Hui Wang, Michal Hocko
  Cc: Gao Xiang, linux-mm, akpm, surenb, colin.i.king, shy828301,
	hannes, vbabka, hch, mgorman


On 03/05/2023 12:49, Hui Wang wrote:
>
> On 4/29/23 03:53, Michal Hocko wrote:
>> On Thu 27-04-23 11:47:10, Hui Wang wrote:
>> [...]
>>> So Michal,
>>>
>>> Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters a 
>>> state
>>> something like hang when running stress-ng", do you know why 
>>> out_of_memory()
>>> will return immediately if there is no __GFP_FS, could we drop these 
>>> lines
>>> directly:
>>>
>>>      /*
>>>       * The OOM killer does not compensate for IO-less reclaim.
>>>       * pagefault_out_of_memory lost its gfp context so we have to
>>>       * make sure exclude 0 mask - all other users should have at least
>>>       * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
>>>       * invoke the OOM killer even if it is a GFP_NOFS allocation.
>>>       */
>>>      if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && 
>>> !is_memcg_oom(oc))
>>>          return true;
>> The comment is rather hard to grasp without an intimate knowledge of the
>> memory reclaim. The primary reason is that the allocation context
>> without __GFP_FS (and also __GFP_IO) cannot perform a full memory
>> reclaim because fs or the storage subsystem might be holding locks
>> required for the memory reclaim. This means that a large amount of
>> reclaimable memory is out of sight of the specific direct reclaim
>> context. If we allowed oom killer to trigger we could invoke the oom
>> killer while there is a lot of otherwise reclaimable memory. As you can
>> imagine not something many users would appreciate as the oom kill is a
>> very disruptive operation. In this case we rely on kswapd or other
>> GFP_KERNEL like allocation context to make forward instead. If there is
>> really nothing reclaimable then the oom killer would eventually hit from
>> elsewhere.
>>
>> HTH
> Hi Michal,
>
> Understand. Thanks for explanation. So we can't remove those 2 lines 
> of code.
>
> Here in my patch, letting a kthread allocate a page with GFP_KERNEL, 
> It could possibly trigger the reclaim and if nothing reclaimable, 
> trigger the oom killer. Do you think it is a safe workaround for the 
> issue we are facing currently?
>
>
> And Hi Phillip,
>
> What is your opinion on it, do you have a direction to solve this 
> issue from filesystem?
>

The following patch creates the concept of "squashfs contexts", which 
moves all memory dynamically allocated (in a readahead/read_page path) 
into a single structure which can be allocated and deleted once.  It 
then creates a pool of these at filesystem mount time.  Threads entering 
readahead/read_page will take a context from the pool, and will then 
perform no dynamic memory allocation.

The final patch-series will make this a non-default build option for 
systems that need this.

Phillip

 From e4dd08d0850796132f9874cd7e0808373d86cbf2 Mon Sep 17 00:00:00 2001
From: Phillip Lougher <phillip@squashfs.org.uk>
Date: Tue, 2 May 2023 20:16:55 +0100
Subject: [PATCH] read_context

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
---
  fs/squashfs/Makefile                    |   1 +
  fs/squashfs/block.c                     |  67 ++++-----
  fs/squashfs/cache.c                     |  66 ++++++---
  fs/squashfs/decompressor.c              |  14 +-
  fs/squashfs/decompressor_multi.c        |   2 +
  fs/squashfs/decompressor_multi_percpu.c |   2 +
  fs/squashfs/decompressor_single.c       |   2 +
  fs/squashfs/file.c                      |  61 ++++----
  fs/squashfs/file_direct.c               |  59 ++++----
  fs/squashfs/page_actor.c                |  37 ++---
  fs/squashfs/page_actor.h                |   8 +-
  fs/squashfs/read_context.c              | 178 ++++++++++++++++++++++++
  fs/squashfs/read_context.h              |  31 +++++
  fs/squashfs/squashfs.h                  |  12 +-
  fs/squashfs/squashfs_fs_sb.h            |   3 +-
  fs/squashfs/super.c                     |  13 +-
  fs/squashfs/symlink.c                   |   3 +-
  17 files changed, 387 insertions(+), 172 deletions(-)
  create mode 100644 fs/squashfs/read_context.c
  create mode 100644 fs/squashfs/read_context.h

diff --git a/fs/squashfs/Makefile b/fs/squashfs/Makefile
index 477c89a519ee..93511720e94f 100644
--- a/fs/squashfs/Makefile
+++ b/fs/squashfs/Makefile
@@ -6,6 +6,7 @@
  obj-$(CONFIG_SQUASHFS) += squashfs.o
  squashfs-y += block.o cache.o dir.o export.o file.o fragment.o id.o inode.o
  squashfs-y += namei.o super.o symlink.o decompressor.o page_actor.o
+squashfs-y += read_context.o
  squashfs-$(CONFIG_SQUASHFS_FILE_CACHE) += file_cache.o
  squashfs-$(CONFIG_SQUASHFS_FILE_DIRECT) += file_direct.o
  squashfs-$(CONFIG_SQUASHFS_DECOMP_SINGLE) += decompressor_single.o
diff --git a/fs/squashfs/block.c b/fs/squashfs/block.c
index bed3bb8b27fa..110fcfd2a0eb 100644
--- a/fs/squashfs/block.c
+++ b/fs/squashfs/block.c
@@ -23,9 +23,10 @@
  
  #include "squashfs_fs.h"
  #include "squashfs_fs_sb.h"
+#include "read_context.h"
+#include "page_actor.h"
  #include "squashfs.h"
  #include "decompressor.h"
-#include "page_actor.h"
  
  /*
   * Returns the amount of bytes copied to the page actor.
@@ -77,7 +78,7 @@ static int copy_bio_to_actor(struct bio *bio,
  }
  
  static int squashfs_bio_read(struct super_block *sb, u64 index, int length,
-			     struct bio **biop, int *block_offset)
+		     struct squashfs_context *context, int *block_offset)
  {
  	struct squashfs_sb_info *msblk = sb->s_fs_info;
  	const u64 read_start = round_down(index, msblk->devblksize);
@@ -87,44 +88,33 @@ static int squashfs_bio_read(struct super_block *sb, u64 index, int length,
  	int offset = read_start - round_down(index, PAGE_SIZE);
  	int total_len = (block_end - block) << msblk->devblksize_log2;
  	const int page_count = DIV_ROUND_UP(total_len + offset, PAGE_SIZE);
-	int error, i;
-	struct bio *bio;
+	struct bio *bio = context->bio;
+	int error = -EIO, i;
  
-	bio = bio_kmalloc(page_count, GFP_NOIO);
-	if (!bio)
-		return -ENOMEM;
  	bio_init(bio, sb->s_bdev, bio->bi_inline_vecs, page_count, REQ_OP_READ);
  	bio->bi_iter.bi_sector = block * (msblk->devblksize >> SECTOR_SHIFT);
  
  	for (i = 0; i < page_count; ++i) {
  		unsigned int len =
  			min_t(unsigned int, PAGE_SIZE - offset, total_len);
-		struct page *page = alloc_page(GFP_NOIO);
+		int res = bio_add_page(bio, context->bio_page[i], len, offset);
+
+		if (!res)
+			goto out_uninit_bio;
  
-		if (!page) {
-			error = -ENOMEM;
-			goto out_free_bio;
-		}
-		if (!bio_add_page(bio, page, len, offset)) {
-			error = -EIO;
-			goto out_free_bio;
-		}
  		offset = 0;
  		total_len -= len;
  	}
  
  	error = submit_bio_wait(bio);
  	if (error)
-		goto out_free_bio;
+		goto out_uninit_bio;
  
-	*biop = bio;
  	*block_offset = index & ((1 << msblk->devblksize_log2) - 1);
  	return 0;
  
-out_free_bio:
-	bio_free_pages(bio);
-	bio_uninit(bio);
-	kfree(bio);
+out_uninit_bio:
+	bio_uninit(context->bio);
  	return error;
  }
  
@@ -138,10 +128,10 @@ static int squashfs_bio_read(struct super_block *sb, u64 index, int length,
   * algorithms).
   */
  int squashfs_read_data(struct super_block *sb, u64 index, int length,
-		       u64 *next_index, struct squashfs_page_actor *output)
+		       u64 *next_index, struct squashfs_context *context)
  {
  	struct squashfs_sb_info *msblk = sb->s_fs_info;
-	struct bio *bio = NULL;
+	struct squashfs_page_actor *output = context->actor;
  	int compressed;
  	int res;
  	int offset;
@@ -166,13 +156,13 @@ int squashfs_read_data(struct super_block *sb, u64 index, int length,
  			res = -EIO;
  			goto out;
  		}
-		res = squashfs_bio_read(sb, index, 2, &bio, &offset);
+		res = squashfs_bio_read(sb, index, 2, context, &offset);
  		if (res)
  			goto out;
  
-		if (WARN_ON_ONCE(!bio_next_segment(bio, &iter_all))) {
+		if (WARN_ON_ONCE(!bio_next_segment(context->bio, &iter_all))) {
  			res = -EIO;
-			goto out_free_bio;
+			goto out_uninit_bio;
  		}
  		/* Extract the length of the metadata block */
  		data = bvec_virt(bvec);
@@ -180,16 +170,14 @@ int squashfs_read_data(struct super_block *sb, u64 index, int length,
  		if (offset < bvec->bv_len - 1) {
  			length |= data[offset + 1] << 8;
  		} else {
-			if (WARN_ON_ONCE(!bio_next_segment(bio, &iter_all))) {
+			if (WARN_ON_ONCE(!bio_next_segment(context->bio, &iter_all))) {
  				res = -EIO;
-				goto out_free_bio;
+				goto out_uninit_bio;
  			}
  			data = bvec_virt(bvec);
  			length |= data[0] << 8;
  		}
-		bio_free_pages(bio);
-		bio_uninit(bio);
-		kfree(bio);
+		bio_uninit(context->bio);
  
  		compressed = SQUASHFS_COMPRESSED(length);
  		length = SQUASHFS_COMPRESSED_SIZE(length);
@@ -207,24 +195,23 @@ int squashfs_read_data(struct super_block *sb, u64 index, int length,
  	if (next_index)
  		*next_index = index + length;
  
-	res = squashfs_bio_read(sb, index, length, &bio, &offset);
+	res = squashfs_bio_read(sb, index, length, context, &offset);
  	if (res)
  		goto out;
  
  	if (compressed) {
  		if (!msblk->stream) {
  			res = -EIO;
-			goto out_free_bio;
+			goto out_uninit_bio;
  		}
-		res = msblk->thread_ops->decompress(msblk, bio, offset, length, output);
+		res = msblk->thread_ops->decompress(msblk, context->bio,
+						offset, length, output);
  	} else {
-		res = copy_bio_to_actor(bio, output, offset, length);
+		res = copy_bio_to_actor(context->bio, output, offset, length);
  	}
  
-out_free_bio:
-	bio_free_pages(bio);
-	bio_uninit(bio);
-	kfree(bio);
+out_uninit_bio:
+	bio_uninit(context->bio);
  out:
  	if (res < 0) {
  		ERROR("Failed to read block 0x%llx: %d\n", index, res);
diff --git a/fs/squashfs/cache.c b/fs/squashfs/cache.c
index 5062326d0efb..98af4f696edd 100644
--- a/fs/squashfs/cache.c
+++ b/fs/squashfs/cache.c
@@ -42,19 +42,24 @@
  
  #include "squashfs_fs.h"
  #include "squashfs_fs_sb.h"
-#include "squashfs.h"
+#include "read_context.h"
  #include "page_actor.h"
+#include "squashfs.h"
  
  /*
   * Look-up block in cache, and increment usage count.  If not in cache, read
   * and decompress it from disk.
   */
  struct squashfs_cache_entry *squashfs_cache_get(struct super_block *sb,
-	struct squashfs_cache *cache, u64 block, int length)
+	struct squashfs_cache *cache, struct squashfs_contexts *contexts,
+	struct squashfs_context *context, u64 block, int length)
  {
-	int i, n;
+	int i, n, need_context = context == NULL;
  	struct squashfs_cache_entry *entry;
  
+	if (cache->contexts)
+		contexts = cache->contexts;
+
  	spin_lock(&cache->lock);
  
  	while (1) {
@@ -107,8 +112,17 @@ struct squashfs_cache_entry *squashfs_cache_get(struct super_block *sb,
  			entry->error = 0;
  			spin_unlock(&cache->lock);
  
+			if (need_context)
+				context = squashfs_get_context(contexts);
+
+			squashfs_page_actor_init(context->actor, entry->data,
+				cache->pages, 0);
+
  			entry->length = squashfs_read_data(sb, block, length,
-				&entry->next_index, entry->actor);
+				&entry->next_index, context);
+
+			if (need_context)
+				squashfs_put_context(contexts, context);
  
  			spin_lock(&cache->lock);
  
@@ -207,9 +221,9 @@ void squashfs_cache_delete(struct squashfs_cache *cache)
  				kfree(cache->entry[i].data[j]);
  			kfree(cache->entry[i].data);
  		}
-		kfree(cache->entry[i].actor);
  	}
  
+	squashfs_delete_contexts(cache->contexts);
  	kfree(cache->entry);
  	kfree(cache);
  }
@@ -221,7 +235,7 @@ void squashfs_cache_delete(struct squashfs_cache *cache)
   * is allocated as a sequence of kmalloced PAGE_SIZE buffers.
   */
  struct squashfs_cache *squashfs_cache_init(char *name, int entries,
-	int block_size)
+	int block_size, int alloc_contexts)
  {
  	int i, j;
  	struct squashfs_cache *cache = kzalloc(sizeof(*cache), GFP_KERNEL);
@@ -237,6 +251,15 @@ struct squashfs_cache *squashfs_cache_init(char *name, int entries,
  		goto cleanup;
  	}
  
+	if (alloc_contexts) {
+		cache->contexts = squashfs_create_contexts(block_size, entries,
+									0, 0);
+		if (cache->contexts == NULL) {
+			ERROR("Failed to allocate %s cache\n", name);
+			goto cleanup;
+		}
+	}
+
  	cache->curr_blk = 0;
  	cache->next_blk = 0;
  	cache->unused = entries;
@@ -268,13 +291,6 @@ struct squashfs_cache *squashfs_cache_init(char *name, int entries,
  				goto cleanup;
  			}
  		}
-
-		entry->actor = squashfs_page_actor_init(entry->data,
-						cache->pages, 0);
-		if (entry->actor == NULL) {
-			ERROR("Failed to allocate %s cache entry\n", name);
-			goto cleanup;
-		}
  	}
  
  	return cache;
@@ -341,7 +357,8 @@ int squashfs_read_metadata(struct super_block *sb, void *buffer,
  		return -EIO;
  
  	while (length) {
-		entry = squashfs_cache_get(sb, msblk->block_cache, *block, 0);
+		entry = squashfs_cache_get(sb, msblk->block_cache, NULL, NULL,
+								*block, 0);
  		if (entry->error) {
  			res = entry->error;
  			goto error;
@@ -377,12 +394,12 @@ int squashfs_read_metadata(struct super_block *sb, void *buffer,
   * filesystem.  If necessary read and decompress it from disk.
   */
  struct squashfs_cache_entry *squashfs_get_fragment(struct super_block *sb,
-				u64 start_block, int length)
+		u64 start_block, int length, struct squashfs_context *context)
  {
  	struct squashfs_sb_info *msblk = sb->s_fs_info;
  
-	return squashfs_cache_get(sb, msblk->fragment_cache, start_block,
-		length);
+	return squashfs_cache_get(sb, msblk->fragment_cache, msblk->contexts,
+		context, start_block, length);
  }
  
  
@@ -396,7 +413,8 @@ struct squashfs_cache_entry *squashfs_get_datablock(struct super_block *sb,
  {
  	struct squashfs_sb_info *msblk = sb->s_fs_info;
  
-	return squashfs_cache_get(sb, msblk->read_page, start_block, length);
+	return squashfs_cache_get(sb, msblk->read_page, msblk->contexts, NULL,
+							start_block, length);
  }
  
  
@@ -408,7 +426,7 @@ void *squashfs_read_table(struct super_block *sb, u64 block, int length)
  	int pages = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
  	int i, res;
  	void *table, *buffer, **data;
-	struct squashfs_page_actor *actor;
+	struct squashfs_context *context;
  
  	table = buffer = kmalloc(length, GFP_KERNEL);
  	if (table == NULL)
@@ -420,20 +438,22 @@ void *squashfs_read_table(struct super_block *sb, u64 block, int length)
  		goto failed;
  	}
  
-	actor = squashfs_page_actor_init(data, pages, length);
-	if (actor == NULL) {
+	context = squashfs_create_context(length, 0, 0);
+	if (context == NULL) {
  		res = -ENOMEM;
  		goto failed2;
  	}
  
+	squashfs_page_actor_init(context->actor, data, pages, length);
+
  	for (i = 0; i < pages; i++, buffer += PAGE_SIZE)
  		data[i] = buffer;
  
  	res = squashfs_read_data(sb, block, length |
-		SQUASHFS_COMPRESSED_BIT_BLOCK, NULL, actor);
+		SQUASHFS_COMPRESSED_BIT_BLOCK, NULL, context);
  
+	squashfs_delete_context(context);
  	kfree(data);
-	kfree(actor);
  
  	if (res < 0)
  		goto failed;
diff --git a/fs/squashfs/decompressor.c b/fs/squashfs/decompressor.c
index 8893cb9b4198..d12255edb129 100644
--- a/fs/squashfs/decompressor.c
+++ b/fs/squashfs/decompressor.c
@@ -15,9 +15,10 @@
  
  #include "squashfs_fs.h"
  #include "squashfs_fs_sb.h"
+#include "read_context.h"
+#include "page_actor.h"
  #include "decompressor.h"
  #include "squashfs.h"
-#include "page_actor.h"
  
  /*
   * This file (and decompressor.h) implements a decompressor framework for
@@ -89,7 +90,7 @@ static void *get_comp_opts(struct super_block *sb, unsigned short flags)
  {
  	struct squashfs_sb_info *msblk = sb->s_fs_info;
  	void *buffer = NULL, *comp_opts;
-	struct squashfs_page_actor *actor = NULL;
+	struct squashfs_context *context;
  	int length = 0;
  
  	/*
@@ -102,14 +103,16 @@ static void *get_comp_opts(struct super_block *sb, unsigned short flags)
  			goto out;
  		}
  
-		actor = squashfs_page_actor_init(&buffer, 1, 0);
-		if (actor == NULL) {
+		context = squashfs_create_context(PAGE_SIZE, 0, 0);
+		if (context == NULL) {
  			comp_opts = ERR_PTR(-ENOMEM);
  			goto out;
  		}
  
+		squashfs_page_actor_init(context->actor, &buffer, 1, 0);
  		length = squashfs_read_data(sb,
-			sizeof(struct squashfs_super_block), 0, NULL, actor);
+			sizeof(struct squashfs_super_block), 0, NULL, context);
+		squashfs_delete_context(context);
  
  		if (length < 0) {
  			comp_opts = ERR_PTR(length);
@@ -120,7 +123,6 @@ static void *get_comp_opts(struct super_block *sb, unsigned short flags)
  	comp_opts = squashfs_comp_opts(msblk, buffer, length);
  
  out:
-	kfree(actor);
  	kfree(buffer);
  	return comp_opts;
  }
diff --git a/fs/squashfs/decompressor_multi.c b/fs/squashfs/decompressor_multi.c
index 416c53eedbd1..0a31012abec7 100644
--- a/fs/squashfs/decompressor_multi.c
+++ b/fs/squashfs/decompressor_multi.c
@@ -13,6 +13,8 @@
  
  #include "squashfs_fs.h"
  #include "squashfs_fs_sb.h"
+#include "read_context.h"
+#include "page_actor.h"
  #include "decompressor.h"
  #include "squashfs.h"
  
diff --git a/fs/squashfs/decompressor_multi_percpu.c b/fs/squashfs/decompressor_multi_percpu.c
index 1dfadf76ed9a..12094aaba19f 100644
--- a/fs/squashfs/decompressor_multi_percpu.c
+++ b/fs/squashfs/decompressor_multi_percpu.c
@@ -12,6 +12,8 @@
  
  #include "squashfs_fs.h"
  #include "squashfs_fs_sb.h"
+#include "read_context.h"
+#include "page_actor.h"
  #include "decompressor.h"
  #include "squashfs.h"
  
diff --git a/fs/squashfs/decompressor_single.c b/fs/squashfs/decompressor_single.c
index 6f161887710b..8d9b73103165 100644
--- a/fs/squashfs/decompressor_single.c
+++ b/fs/squashfs/decompressor_single.c
@@ -11,6 +11,8 @@
  
  #include "squashfs_fs.h"
  #include "squashfs_fs_sb.h"
+#include "read_context.h"
+#include "page_actor.h"
  #include "decompressor.h"
  #include "squashfs.h"
  
diff --git a/fs/squashfs/file.c b/fs/squashfs/file.c
index 8ba8c4c50770..d8ca63a3eb0e 100644
--- a/fs/squashfs/file.c
+++ b/fs/squashfs/file.c
@@ -38,8 +38,9 @@
  #include "squashfs_fs.h"
  #include "squashfs_fs_sb.h"
  #include "squashfs_fs_i.h"
-#include "squashfs.h"
+#include "read_context.h"
  #include "page_actor.h"
+#include "squashfs.h"
  
  /*
   * Locate cache slot in range [offset, index] for specified inode.  If
@@ -424,7 +425,7 @@ static int squashfs_readpage_fragment(struct page *page, int expected)
  	struct inode *inode = page->mapping->host;
  	struct squashfs_cache_entry *buffer = squashfs_get_fragment(inode->i_sb,
  		squashfs_i(inode)->fragment_block,
-		squashfs_i(inode)->fragment_size);
+		squashfs_i(inode)->fragment_size, NULL);
  	int res = buffer->error;
  
  	if (res)
@@ -497,13 +498,13 @@ static int squashfs_read_folio(struct file *file, struct folio *folio)
  	return res;
  }
  
-static int squashfs_readahead_fragment(struct page **page,
+static int squashfs_readahead_fragment(struct squashfs_context *context,
  	unsigned int pages, unsigned int expected)
  {
-	struct inode *inode = page[0]->mapping->host;
+	struct inode *inode = context->page[0]->mapping->host;
  	struct squashfs_cache_entry *buffer = squashfs_get_fragment(inode->i_sb,
  		squashfs_i(inode)->fragment_block,
-		squashfs_i(inode)->fragment_size);
+		squashfs_i(inode)->fragment_size, context);
  	struct squashfs_sb_info *msblk = inode->i_sb->s_fs_info;
  	unsigned int n, mask = (1 << (msblk->block_log - PAGE_SHIFT)) - 1;
  	int error = buffer->error;
@@ -514,18 +515,18 @@ static int squashfs_readahead_fragment(struct page **page,
  	expected += squashfs_i(inode)->fragment_offset;
  
  	for (n = 0; n < pages; n++) {
-		unsigned int base = (page[n]->index & mask) << PAGE_SHIFT;
+		unsigned int base = (context->page[n]->index & mask) << PAGE_SHIFT;
  		unsigned int offset = base + squashfs_i(inode)->fragment_offset;
  
  		if (expected > offset) {
  			unsigned int avail = min_t(unsigned int, expected -
  				offset, PAGE_SIZE);
  
-			squashfs_fill_page(page[n], buffer, offset, avail);
+			squashfs_fill_page(context->page[n], buffer, offset, avail);
  		}
  
-		unlock_page(page[n]);
-		put_page(page[n]);
+		unlock_page(context->page[n]);
+		put_page(context->page[n]);
  	}
  
  out:
@@ -541,16 +542,15 @@ static void squashfs_readahead(struct readahead_control *ractl)
  	unsigned short shift = msblk->block_log - PAGE_SHIFT;
  	loff_t start = readahead_pos(ractl) & ~mask;
  	size_t len = readahead_length(ractl) + readahead_pos(ractl) - start;
-	struct squashfs_page_actor *actor;
+	struct squashfs_context *context;
  	unsigned int nr_pages = 0;
-	struct page **pages;
  	int i, file_end = i_size_read(inode) >> msblk->block_log;
  	unsigned int max_pages = 1UL << shift;
  
  	readahead_expand(ractl, start, (len | mask) + 1);
  
-	pages = kmalloc_array(max_pages, sizeof(void *), GFP_KERNEL);
-	if (!pages)
+	context = squashfs_get_context(msblk->contexts);
+	if (context == NULL)
  		return;
  
  	for (;;) {
@@ -566,21 +566,21 @@ static void squashfs_readahead(struct readahead_control *ractl)
  
  		max_pages = (expected + PAGE_SIZE - 1) >> PAGE_SHIFT;
  
-		nr_pages = __readahead_batch(ractl, pages, max_pages);
+		nr_pages = __readahead_batch(ractl, context->page, max_pages);
  		if (!nr_pages)
  			break;
  
  		if (readahead_pos(ractl) >= i_size_read(inode))
  			goto skip_pages;
  
-		index = pages[0]->index >> shift;
+		index = context->page[0]->index >> shift;
  
-		if ((pages[nr_pages - 1]->index >> shift) != index)
+		if ((context->page[nr_pages - 1]->index >> shift) != index)
  			goto skip_pages;
  
  		if (index == file_end && squashfs_i(inode)->fragment_block !=
  						SQUASHFS_INVALID_BLK) {
-			res = squashfs_readahead_fragment(pages, nr_pages,
+			res = squashfs_readahead_fragment(context, nr_pages,
  							  expected);
  			if (res)
  				goto skip_pages;
@@ -591,14 +591,13 @@ static void squashfs_readahead(struct readahead_control *ractl)
  		if (bsize == 0)
  			goto skip_pages;
  
-		actor = squashfs_page_actor_init_special(msblk, pages, nr_pages,
-							 expected);
-		if (!actor)
-			goto skip_pages;
+		squashfs_page_actor_init_special(context, msblk, nr_pages,
+							expected);
  
-		res = squashfs_read_data(inode->i_sb, block, bsize, NULL, actor);
+		res = squashfs_read_data(inode->i_sb, block, bsize, NULL,
+							context);
  
-		last_page = squashfs_page_actor_free(actor);
+		last_page = squashfs_page_actor_free(context->actor);
  
  		if (res == expected) {
  			int bytes;
@@ -610,26 +609,26 @@ static void squashfs_readahead(struct readahead_control *ractl)
  					     PAGE_SIZE - bytes);
  
  			for (i = 0; i < nr_pages; i++) {
-				flush_dcache_page(pages[i]);
-				SetPageUptodate(pages[i]);
+				flush_dcache_page(context->page[i]);
+				SetPageUptodate(context->page[i]);
  			}
  		}
  
  		for (i = 0; i < nr_pages; i++) {
-			unlock_page(pages[i]);
-			put_page(pages[i]);
+			unlock_page(context->page[i]);
+			put_page(context->page[i]);
  		}
  	}
  
-	kfree(pages);
+	squashfs_put_context(msblk->contexts, context);
  	return;
  
  skip_pages:
  	for (i = 0; i < nr_pages; i++) {
-		unlock_page(pages[i]);
-		put_page(pages[i]);
+		unlock_page(context->page[i]);
+		put_page(context->page[i]);
  	}
-	kfree(pages);
+	squashfs_put_context(msblk->contexts, context);
  }
  
  const struct address_space_operations squashfs_aops = {
diff --git a/fs/squashfs/file_direct.c b/fs/squashfs/file_direct.c
index f1ccad519e28..ab7b990cf71d 100644
--- a/fs/squashfs/file_direct.c
+++ b/fs/squashfs/file_direct.c
@@ -15,8 +15,9 @@
  #include "squashfs_fs.h"
  #include "squashfs_fs_sb.h"
  #include "squashfs_fs_i.h"
-#include "squashfs.h"
+#include "read_context.h"
  #include "page_actor.h"
+#include "squashfs.h"
  
  /* Read separately compressed datablock directly into page cache */
  int squashfs_readpage_block(struct page *target_page, u64 block, int bsize,
@@ -31,8 +32,7 @@ int squashfs_readpage_block(struct page *target_page, u64 block, int bsize,
  	int start_index = target_page->index & ~mask;
  	int end_index = start_index | mask;
  	int i, n, pages, bytes, res = -ENOMEM;
-	struct page **page;
-	struct squashfs_page_actor *actor;
+	struct squashfs_context *context;
  	void *pageaddr;
  
  	if (end_index > file_end)
@@ -40,21 +40,21 @@ int squashfs_readpage_block(struct page *target_page, u64 block, int bsize,
  
  	pages = end_index - start_index + 1;
  
-	page = kmalloc_array(pages, sizeof(void *), GFP_KERNEL);
-	if (page == NULL)
+	context = squashfs_get_context(msblk->contexts);
+	if (context == NULL)
  		return res;
  
  	/* Try to grab all the pages covered by the Squashfs block */
  	for (i = 0, n = start_index; n <= end_index; n++) {
-		page[i] = (n == target_page->index) ? target_page :
+		context->page[i] = (n == target_page->index) ? target_page :
  			grab_cache_page_nowait(target_page->mapping, n);
  
-		if (page[i] == NULL)
+		if (context->page[i] == NULL)
  			continue;
  
-		if (PageUptodate(page[i])) {
-			unlock_page(page[i]);
-			put_page(page[i]);
+		if (PageUptodate(context->page[i])) {
+			unlock_page(context->page[i]);
+			put_page(context->page[i]);
  			continue;
  		}
  
@@ -64,17 +64,15 @@ int squashfs_readpage_block(struct page *target_page, u64 block, int bsize,
  	pages = i;
  
  	/*
-	 * Create a "page actor" which will kmap and kunmap the
+	 * Initialise "page actor" which will kmap and kunmap the
  	 * page cache pages appropriately within the decompressor
  	 */
-	actor = squashfs_page_actor_init_special(msblk, page, pages, expected);
-	if (actor == NULL)
-		goto out;
+	squashfs_page_actor_init_special(context, msblk, pages, expected);
  
  	/* Decompress directly into the page cache buffers */
-	res = squashfs_read_data(inode->i_sb, block, bsize, NULL, actor);
+	res = squashfs_read_data(inode->i_sb, block, bsize, NULL, context);
  
-	squashfs_page_actor_free(actor);
+	squashfs_page_actor_free(context->actor);
  
  	if (res < 0)
  		goto mark_errored;
@@ -86,22 +84,22 @@ int squashfs_readpage_block(struct page *target_page, u64 block, int bsize,
  
  	/* Last page (if present) may have trailing bytes not filled */
  	bytes = res % PAGE_SIZE;
-	if (page[pages - 1]->index == end_index && bytes) {
-		pageaddr = kmap_local_page(page[pages - 1]);
+	if (context->page[pages - 1]->index == end_index && bytes) {
+		pageaddr = kmap_local_page(context->page[pages - 1]);
  		memset(pageaddr + bytes, 0, PAGE_SIZE - bytes);
  		kunmap_local(pageaddr);
  	}
  
  	/* Mark pages as uptodate, unlock and release */
  	for (i = 0; i < pages; i++) {
-		flush_dcache_page(page[i]);
-		SetPageUptodate(page[i]);
-		unlock_page(page[i]);
-		if (page[i] != target_page)
-			put_page(page[i]);
+		flush_dcache_page(context->page[i]);
+		SetPageUptodate(context->page[i]);
+		unlock_page(context->page[i]);
+		if (context->page[i] != target_page)
+			put_page(context->page[i]);
  	}
  
-	kfree(page);
+	squashfs_put_context(msblk->contexts, context);
  
  	return 0;
  
@@ -110,15 +108,14 @@ int squashfs_readpage_block(struct page *target_page, u64 block, int bsize,
  	 * dealt with by the caller
  	 */
  	for (i = 0; i < pages; i++) {
-		if (page[i] == NULL || page[i] == target_page)
+		if (context->page[i] == NULL || context->page[i] == target_page)
  			continue;
-		flush_dcache_page(page[i]);
-		SetPageError(page[i]);
-		unlock_page(page[i]);
-		put_page(page[i]);
+		flush_dcache_page(context->page[i]);
+		SetPageError(context->page[i]);
+		unlock_page(context->page[i]);
+		put_page(context->page[i]);
  	}
  
-out:
-	kfree(page);
+	squashfs_put_context(msblk->contexts, context);
  	return res;
  }
diff --git a/fs/squashfs/page_actor.c b/fs/squashfs/page_actor.c
index 81af6c4ca115..8ff50034afee 100644
--- a/fs/squashfs/page_actor.c
+++ b/fs/squashfs/page_actor.c
@@ -8,8 +8,9 @@
  #include <linux/slab.h>
  #include <linux/pagemap.h>
  #include "squashfs_fs_sb.h"
-#include "decompressor.h"
+#include "read_context.h"
  #include "page_actor.h"
+#include "decompressor.h"
  
  /*
   * This file contains implementations of page_actor for decompressing into
@@ -40,14 +41,9 @@ static void cache_finish_page(struct squashfs_page_actor *actor)
  	/* empty */
  }
  
-struct squashfs_page_actor *squashfs_page_actor_init(void **buffer,
+void squashfs_page_actor_init(struct squashfs_page_actor *actor, void **buffer,
  	int pages, int length)
  {
-	struct squashfs_page_actor *actor = kmalloc(sizeof(*actor), GFP_KERNEL);
-
-	if (actor == NULL)
-		return NULL;
-
  	actor->length = length ? : pages * PAGE_SIZE;
  	actor->buffer = buffer;
  	actor->pages = pages;
@@ -56,7 +52,6 @@ struct squashfs_page_actor *squashfs_page_actor_init(void **buffer,
  	actor->squashfs_first_page = cache_first_page;
  	actor->squashfs_next_page = cache_next_page;
  	actor->squashfs_finish_page = cache_finish_page;
-	return actor;
  }
  
  /* Implementation of page_actor for decompressing directly into page cache. */
@@ -102,35 +97,23 @@ static void direct_finish_page(struct squashfs_page_actor *actor)
  		kunmap_local(actor->pageaddr);
  }
  
-struct squashfs_page_actor *squashfs_page_actor_init_special(struct squashfs_sb_info *msblk,
-	struct page **page, int pages, int length)
+void squashfs_page_actor_init_special(struct squashfs_context *context,
+	struct squashfs_sb_info *msblk, int pages, int length)
  {
-	struct squashfs_page_actor *actor = kmalloc(sizeof(*actor), GFP_KERNEL);
-
-	if (actor == NULL)
-		return NULL;
-
-	if (msblk->decompressor->alloc_buffer) {
-		actor->tmp_buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
-
-		if (actor->tmp_buffer == NULL) {
-			kfree(actor);
-			return NULL;
-		}
-	} else
-		actor->tmp_buffer = NULL;
+	struct squashfs_page_actor *actor = context->actor;
  
+	actor->tmp_buffer = context->hole;
  	actor->length = length ? : pages * PAGE_SIZE;
-	actor->page = page;
+	actor->page = context->page;
  	actor->pages = pages;
  	actor->next_page = 0;
  	actor->returned_pages = 0;
-	actor->next_index = page[0]->index & ~((1 << (msblk->block_log - PAGE_SHIFT)) - 1);
+	actor->next_index = context->page[0]->index & ~((1 <<
+					(msblk->block_log - PAGE_SHIFT)) - 1);
  	actor->pageaddr = NULL;
  	actor->last_page = NULL;
  	actor->alloc_buffer = msblk->decompressor->alloc_buffer;
  	actor->squashfs_first_page = direct_first_page;
  	actor->squashfs_next_page = direct_next_page;
  	actor->squashfs_finish_page = direct_finish_page;
-	return actor;
  }
diff --git a/fs/squashfs/page_actor.h b/fs/squashfs/page_actor.h
index 97d4983559b1..36fbc93d8907 100644
--- a/fs/squashfs/page_actor.h
+++ b/fs/squashfs/page_actor.h
@@ -25,17 +25,15 @@ struct squashfs_page_actor {
  	pgoff_t	next_index;
  };
  
-extern struct squashfs_page_actor *squashfs_page_actor_init(void **buffer,
+extern void squashfs_page_actor_init(struct squashfs_page_actor *actor, void **buffer,
  				int pages, int length);
-extern struct squashfs_page_actor *squashfs_page_actor_init_special(
+extern void squashfs_page_actor_init_special(struct squashfs_context *context,
  				struct squashfs_sb_info *msblk,
-				struct page **page, int pages, int length);
+				int pages, int length);
  static inline struct page *squashfs_page_actor_free(struct squashfs_page_actor *actor)
  {
  	struct page *last_page = actor->last_page;
  
-	kfree(actor->tmp_buffer);
-	kfree(actor);
  	return last_page;
  }
  static inline void *squashfs_first_page(struct squashfs_page_actor *actor)
diff --git a/fs/squashfs/read_context.c b/fs/squashfs/read_context.c
new file mode 100644
index 000000000000..6ad151b034ed
--- /dev/null
+++ b/fs/squashfs/read_context.c
@@ -0,0 +1,178 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Squashfs - a compressed read only filesystem for Linux
+ *
+ * Copyright (c) 2023
+ * Phillip Lougher <phillip@squashfs.org.uk>
+ *
+ * read_context.c
+ */
+
+#include <linux/fs.h>
+#include <linux/vfs.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/pagemap.h>
+#include <linux/bio.h>
+
+#include "squashfs_fs.h"
+#include "squashfs_fs_sb.h"
+#include "squashfs.h"
+#include "page_actor.h"
+
+
+void squashfs_delete_context(struct squashfs_context *context)
+{
+	int i;
+
+	for (i = 0; i < context->bio_pages; i++)
+		if (context->bio_page[i])
+			__free_page(context->bio_page[i]);
+
+	kfree(context->bio);
+	kfree(context->page);
+	kfree(context->actor);
+	kfree(context->hole);
+}
+
+
+struct squashfs_context *squashfs_create_context(int length, int alloc_hole,
+							int alloc_array)
+{
+	int pages = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	int bio_pages = pages + 1, i;
+	struct squashfs_context *context;
+
+	context = kzalloc(sizeof(struct squashfs_context), GFP_KERNEL);
+	if (context == NULL)
+		return ERR_PTR(-ENOMEM);
+
+	context->bio_page = kcalloc(bio_pages, sizeof(struct bio_pages *),
+							GFP_KERNEL);
+
+	if (context->bio_page == NULL)
+		goto cleanup;
+
+	context->bio_pages = bio_pages;
+
+	for (i = 0; i < bio_pages; i++) {
+		context->bio_page[i] = alloc_page(GFP_KERNEL);
+
+		if (context->bio_page[i] == NULL)
+			goto cleanup;
+	}
+
+	context->bio = bio_kmalloc(bio_pages, GFP_KERNEL);
+	if (context->bio == NULL)
+		goto cleanup;
+
+	context->actor = kmalloc(sizeof(struct squashfs_page_actor),
+							GFP_KERNEL);
+	if (context->actor == NULL)
+		goto cleanup;
+
+	if (alloc_array) {
+		context->page = kmalloc_array(pages, sizeof(struct page *),
+								GFP_KERNEL);
+		if (context->page == NULL)
+			goto cleanup;
+	}
+
+	if (alloc_hole) {
+		context->hole = kmalloc(PAGE_SIZE, GFP_KERNEL);
+		if (context->hole == NULL)
+			goto cleanup;
+	}
+
+	return context;
+
+cleanup:
+	squashfs_delete_context(context);
+	return ERR_PTR(-ENOMEM);
+}
+
+
+struct squashfs_context *squashfs_get_context(struct squashfs_contexts *contexts)
+{
+	struct squashfs_context *context;
+
+	while (1) {
+		mutex_lock(&contexts->mutex);
+
+		if (!list_empty(&contexts->list)) {
+			context = list_entry(contexts->list.prev,
+					struct squashfs_context, list);
+			list_del(&context->list);
+			mutex_unlock(&contexts->mutex);
+			break;
+		}
+
+		mutex_unlock(&contexts->mutex);
+		wait_event(contexts->wait_queue, !list_empty(&contexts->list));
+	}
+
+	return context;
+}
+
+
+void squashfs_put_context(struct squashfs_contexts *contexts,
+					struct squashfs_context *context)
+{
+
+	mutex_lock(&contexts->mutex);
+	list_add(&context->list, &contexts->list);
+	mutex_unlock(&contexts->mutex);
+	wake_up(&contexts->wait_queue);
+}
+
+
+void squashfs_delete_contexts(struct squashfs_contexts *contexts)
+{
+	struct squashfs_context *context;
+
+	if (contexts == NULL)
+		return;
+
+	while (!list_empty(&contexts->list)) {
+		context = list_entry(contexts->list.prev,
+				struct squashfs_context, list);
+		list_del(&context->list);
+		squashfs_delete_context(context);
+	}
+}
+
+
+struct squashfs_contexts *squashfs_create_contexts(int block_size, int entries,
+						int alloc_hole, int alloc_array)
+{
+	int i;
+	struct squashfs_contexts *contexts;
+
+	contexts = kzalloc(sizeof(struct squashfs_contexts), GFP_KERNEL);
+	if (contexts == NULL)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&contexts->mutex);
+	INIT_LIST_HEAD(&contexts->list);
+	init_waitqueue_head(&contexts->wait_queue);
+
+	for (i = 0; i < entries; i++) {
+		struct squashfs_context *context;
+
+		context = squashfs_create_context(block_size, alloc_hole,
+								alloc_array);
+		if (context == NULL)
+			goto cleanup;
+
+		list_add(&context->list, &contexts->list);
+	}
+
+	return contexts;
+
+cleanup:
+	squashfs_delete_contexts(contexts);
+	return ERR_PTR(-ENOMEM);
+}
diff --git a/fs/squashfs/read_context.h b/fs/squashfs/read_context.h
new file mode 100644
index 000000000000..a9b93a43893b
--- /dev/null
+++ b/fs/squashfs/read_context.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef READ_CONTEXT_H
+#define READ_CONTEXT_H
+/*
+ * Copyright (c) 2023
+ * Phillip Lougher <phillip@squashfs.org.uk>
+ */
+
+struct squashfs_context {
+	int				bio_pages;
+	struct bio			*bio;
+	struct page			**bio_page;
+	struct squashfs_page_actor	*actor;
+	struct page			**page;
+	void				*hole;
+	struct list_head		list;
+};
+
+struct squashfs_contexts {
+	struct mutex		mutex;
+	struct list_head	list;
+	wait_queue_head_t	wait_queue;
+};
+
+extern void squashfs_delete_context(struct squashfs_context *context);
+extern struct squashfs_context *squashfs_create_context(int length, int alloc_hole, int alloc_array);
+extern struct squashfs_context *squashfs_get_context(struct squashfs_contexts *contexts);
+extern void squashfs_put_context(struct squashfs_contexts *contexts, struct squashfs_context *context);
+extern void squashfs_delete_contexts(struct squashfs_contexts *contexts);
+extern struct squashfs_contexts *squashfs_create_contexts(int block_size, int entries, int alloc_hole, int alloc_array);
+#endif
diff --git a/fs/squashfs/squashfs.h b/fs/squashfs/squashfs.h
index a6164fdf9435..c5db13907e9f 100644
--- a/fs/squashfs/squashfs.h
+++ b/fs/squashfs/squashfs.h
@@ -8,6 +8,9 @@
   * squashfs.h
   */
  
+#include "read_context.h"
+#include "page_actor.h"
+
  #define TRACE(s, args...)	pr_debug("SQUASHFS: "s, ## args)
  
  #define ERROR(s, args...)	pr_err("SQUASHFS error: "s, ## args)
@@ -16,19 +19,20 @@
  
  /* block.c */
  extern int squashfs_read_data(struct super_block *, u64, int, u64 *,
-				struct squashfs_page_actor *);
+				struct squashfs_context *);
  
  /* cache.c */
-extern struct squashfs_cache *squashfs_cache_init(char *, int, int);
+extern struct squashfs_cache *squashfs_cache_init(char *, int, int, int);
  extern void squashfs_cache_delete(struct squashfs_cache *);
  extern struct squashfs_cache_entry *squashfs_cache_get(struct super_block *,
-				struct squashfs_cache *, u64, int);
+		struct squashfs_cache *, struct squashfs_contexts *,
+		struct squashfs_context *, u64, int);
  extern void squashfs_cache_put(struct squashfs_cache_entry *);
  extern int squashfs_copy_data(void *, struct squashfs_cache_entry *, int, int);
  extern int squashfs_read_metadata(struct super_block *, void *, u64 *,
  				int *, int);
  extern struct squashfs_cache_entry *squashfs_get_fragment(struct super_block *,
-				u64, int);
+				u64, int, struct squashfs_context *);
  extern struct squashfs_cache_entry *squashfs_get_datablock(struct super_block *,
  				u64, int);
  extern void *squashfs_read_table(struct super_block *, u64, int);
diff --git a/fs/squashfs/squashfs_fs_sb.h b/fs/squashfs/squashfs_fs_sb.h
index 72f6f4b37863..6cee243be9bb 100644
--- a/fs/squashfs/squashfs_fs_sb.h
+++ b/fs/squashfs/squashfs_fs_sb.h
@@ -23,6 +23,7 @@ struct squashfs_cache {
  	int			pages;
  	spinlock_t		lock;
  	wait_queue_head_t	wait_queue;
+	struct squashfs_contexts *contexts;
  	struct squashfs_cache_entry *entry;
  };
  
@@ -37,7 +38,6 @@ struct squashfs_cache_entry {
  	wait_queue_head_t	wait_queue;
  	struct squashfs_cache	*cache;
  	void			**data;
-	struct squashfs_page_actor	*actor;
  };
  
  struct squashfs_sb_info {
@@ -47,6 +47,7 @@ struct squashfs_sb_info {
  	struct squashfs_cache			*block_cache;
  	struct squashfs_cache			*fragment_cache;
  	struct squashfs_cache			*read_page;
+	struct squashfs_contexts		*contexts;
  	int					next_meta_index;
  	__le64					*id_table;
  	__le64					*fragment_index;
diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c
index e090fae48e68..3ba5c2b5c9e0 100644
--- a/fs/squashfs/super.c
+++ b/fs/squashfs/super.c
@@ -33,6 +33,7 @@
  #include "squashfs_fs.h"
  #include "squashfs_fs_sb.h"
  #include "squashfs_fs_i.h"
+#include "read_context.h"
  #include "squashfs.h"
  #include "decompressor.h"
  #include "xattr.h"
@@ -317,18 +318,23 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc)
  	err = -ENOMEM;
  
  	msblk->block_cache = squashfs_cache_init("metadata",
-			SQUASHFS_CACHED_BLKS, SQUASHFS_METADATA_SIZE);
+			SQUASHFS_CACHED_BLKS, SQUASHFS_METADATA_SIZE, 1);
  	if (msblk->block_cache == NULL)
  		goto failed_mount;
  
  	/* Allocate read_page block */
  	msblk->read_page = squashfs_cache_init("data",
-		msblk->max_thread_num, msblk->block_size);
+		msblk->max_thread_num, msblk->block_size, 0);
  	if (msblk->read_page == NULL) {
  		errorf(fc, "Failed to allocate read_page block");
  		goto failed_mount;
  	}
  
+	msblk->contexts = squashfs_create_contexts(msblk->block_size,
+		msblk->max_thread_num, msblk->decompressor->alloc_buffer, 1);
+	if (msblk->contexts == NULL)
+		goto failed_mount;
+
  	msblk->stream = squashfs_decompressor_setup(sb, flags);
  	if (IS_ERR(msblk->stream)) {
  		err = PTR_ERR(msblk->stream);
@@ -392,7 +398,7 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc)
  		goto check_directory_table;
  
  	msblk->fragment_cache = squashfs_cache_init("fragment",
-		SQUASHFS_CACHED_FRAGMENTS, msblk->block_size);
+		SQUASHFS_CACHED_FRAGMENTS, msblk->block_size, 0);
  	if (msblk->fragment_cache == NULL) {
  		err = -ENOMEM;
  		goto failed_mount;
@@ -454,6 +460,7 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc)
  	squashfs_cache_delete(msblk->block_cache);
  	squashfs_cache_delete(msblk->fragment_cache);
  	squashfs_cache_delete(msblk->read_page);
+	squashfs_delete_contexts(msblk->contexts);
  	msblk->thread_ops->destroy(msblk);
  	kfree(msblk->inode_lookup_table);
  	kfree(msblk->fragment_index);
diff --git a/fs/squashfs/symlink.c b/fs/squashfs/symlink.c
index 2bf977a52c2c..8471a3a5491b 100644
--- a/fs/squashfs/symlink.c
+++ b/fs/squashfs/symlink.c
@@ -69,7 +69,8 @@ static int squashfs_symlink_read_folio(struct file *file, struct folio *folio)
  	 * blocks, we may need to call squashfs_cache_get multiple times.
  	 */
  	for (bytes = 0; bytes < length; offset = 0, bytes += copied) {
-		entry = squashfs_cache_get(sb, msblk->block_cache, block, 0);
+		entry = squashfs_cache_get(sb, msblk->block_cache, NULL, NULL,
+								block, 0);
  		if (entry->error) {
  			ERROR("Unable to read symlink [%llx:%x]\n",
  				squashfs_i(inode)->start,
-- 
2.35.1


> Thanks,
>
> Hui.
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-05-03 19:10               ` Phillip Lougher
@ 2023-05-03 19:38                 ` Hui Wang
  2023-05-07 21:07                 ` Phillip Lougher
  1 sibling, 0 replies; 25+ messages in thread
From: Hui Wang @ 2023-05-03 19:38 UTC (permalink / raw)
  To: Phillip Lougher, Michal Hocko
  Cc: Gao Xiang, linux-mm, akpm, surenb, colin.i.king, shy828301,
	hannes, vbabka, hch, mgorman


On 5/4/23 03:10, Phillip Lougher wrote:
>
> On 03/05/2023 12:49, Hui Wang wrote:
>>
>> On 4/29/23 03:53, Michal Hocko wrote:
>>> On Thu 27-04-23 11:47:10, Hui Wang wrote:
>>> [...]
>>>> So Michal,
>>>>
>>>> Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters 
>>>> a state
>>>> something like hang when running stress-ng", do you know why 
>>>> out_of_memory()
>>>> will return immediately if there is no __GFP_FS, could we drop 
>>>> these lines
>>>> directly:
>>>>
>>>>      /*
>>>>       * The OOM killer does not compensate for IO-less reclaim.
>>>>       * pagefault_out_of_memory lost its gfp context so we have to
>>>>       * make sure exclude 0 mask - all other users should have at 
>>>> least
>>>>       * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
>>>>       * invoke the OOM killer even if it is a GFP_NOFS allocation.
>>>>       */
>>>>      if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && 
>>>> !is_memcg_oom(oc))
>>>>          return true;
>>> The comment is rather hard to grasp without an intimate knowledge of 
>>> the
>>> memory reclaim. The primary reason is that the allocation context
>>> without __GFP_FS (and also __GFP_IO) cannot perform a full memory
>>> reclaim because fs or the storage subsystem might be holding locks
>>> required for the memory reclaim. This means that a large amount of
>>> reclaimable memory is out of sight of the specific direct reclaim
>>> context. If we allowed oom killer to trigger we could invoke the oom
>>> killer while there is a lot of otherwise reclaimable memory. As you can
>>> imagine not something many users would appreciate as the oom kill is a
>>> very disruptive operation. In this case we rely on kswapd or other
>>> GFP_KERNEL like allocation context to make forward instead. If there is
>>> really nothing reclaimable then the oom killer would eventually hit 
>>> from
>>> elsewhere.
>>>
>>> HTH
>> Hi Michal,
>>
>> Understand. Thanks for explanation. So we can't remove those 2 lines 
>> of code.
>>
>> Here in my patch, letting a kthread allocate a page with GFP_KERNEL, 
>> It could possibly trigger the reclaim and if nothing reclaimable, 
>> trigger the oom killer. Do you think it is a safe workaround for the 
>> issue we are facing currently?
>>
>>
>> And Hi Phillip,
>>
>> What is your opinion on it, do you have a direction to solve this 
>> issue from filesystem?
>>
>
> The following patch creates the concept of "squashfs contexts", which 
> moves all memory dynamically allocated (in a readahead/read_page path) 
> into a single structure which can be allocated and deleted once.  It 
> then creates a pool of these at filesystem mount time.  Threads 
> entering readahead/read_page will take a context from the pool, and 
> will then perform no dynamic memory allocation.
>
> The final patch-series will make this a non-default build option for 
> systems that need this.
>
> Phillip
>
>
Hi Phillip,

Got it. Will verify the patch next Monday or Tuesday. I am on travel 
now, will be back to the office next Monday.

Thanks,

Hui.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-05-03 19:10               ` Phillip Lougher
  2023-05-03 19:38                 ` Hui Wang
@ 2023-05-07 21:07                 ` Phillip Lougher
  2023-05-08 10:05                   ` Hui Wang
  1 sibling, 1 reply; 25+ messages in thread
From: Phillip Lougher @ 2023-05-07 21:07 UTC (permalink / raw)
  To: Hui Wang, Michal Hocko
  Cc: Gao Xiang, linux-mm, akpm, surenb, colin.i.king, shy828301,
	hannes, vbabka, hch, mgorman


On 03/05/2023 20:10, Phillip Lougher wrote:
>
> On 03/05/2023 12:49, Hui Wang wrote:
>>
>> On 4/29/23 03:53, Michal Hocko wrote:
>>> On Thu 27-04-23 11:47:10, Hui Wang wrote:
>>> [...]
>>>> So Michal,
>>>>
>>>> Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters 
>>>> a state
>>>> something like hang when running stress-ng", do you know why 
>>>> out_of_memory()
>>>> will return immediately if there is no __GFP_FS, could we drop 
>>>> these lines
>>>> directly:
>>>>
>>>>      /*
>>>>       * The OOM killer does not compensate for IO-less reclaim.
>>>>       * pagefault_out_of_memory lost its gfp context so we have to
>>>>       * make sure exclude 0 mask - all other users should have at 
>>>> least
>>>>       * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
>>>>       * invoke the OOM killer even if it is a GFP_NOFS allocation.
>>>>       */
>>>>      if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && 
>>>> !is_memcg_oom(oc))
>>>>          return true;
>>> The comment is rather hard to grasp without an intimate knowledge of 
>>> the
>>> memory reclaim. The primary reason is that the allocation context
>>> without __GFP_FS (and also __GFP_IO) cannot perform a full memory
>>> reclaim because fs or the storage subsystem might be holding locks
>>> required for the memory reclaim. This means that a large amount of
>>> reclaimable memory is out of sight of the specific direct reclaim
>>> context. If we allowed oom killer to trigger we could invoke the oom
>>> killer while there is a lot of otherwise reclaimable memory. As you can
>>> imagine not something many users would appreciate as the oom kill is a
>>> very disruptive operation. In this case we rely on kswapd or other
>>> GFP_KERNEL like allocation context to make forward instead. If there is
>>> really nothing reclaimable then the oom killer would eventually hit 
>>> from
>>> elsewhere.
>>>
>>> HTH
>> Hi Michal,
>>
>> Understand. Thanks for explanation. So we can't remove those 2 lines 
>> of code.
>>
>> Here in my patch, letting a kthread allocate a page with GFP_KERNEL, 
>> It could possibly trigger the reclaim and if nothing reclaimable, 
>> trigger the oom killer. Do you think it is a safe workaround for the 
>> issue we are facing currently?
>>
>>
>> And Hi Phillip,
>>
>> What is your opinion on it, do you have a direction to solve this 
>> issue from filesystem?
>>
>
> The following patch creates the concept of "squashfs contexts", which 
> moves all memory dynamically allocated (in a readahead/read_page path) 
> into a single structure which can be allocated and deleted once.  It 
> then creates a pool of these at filesystem mount time.  Threads 
> entering readahead/read_page will take a context from the pool, and 
> will then perform no dynamic memory allocation.
>
> The final patch-series will make this a non-default build option for 
> systems that need this.
>
> Phillip
>
>

An updated version of the patch.

 From 24f22dbd7fed8d0a90d11a4ce80545d30a9acfdc Mon Sep 17 00:00:00 2001
From: Phillip Lougher <phillip@squashfs.org.uk>
Date: Sun, 7 May 2023 20:16:55 +0100
Subject: [PATCH] read_context v2

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
---
  fs/squashfs/Makefile                    |   1 +
  fs/squashfs/block.c                     |  67 ++++-----
  fs/squashfs/cache.c                     |  66 ++++++---
  fs/squashfs/decompressor.c              |  14 +-
  fs/squashfs/decompressor_multi.c        |   2 +
  fs/squashfs/decompressor_multi_percpu.c |   2 +
  fs/squashfs/decompressor_single.c       |   2 +
  fs/squashfs/file.c                      |  61 ++++----
  fs/squashfs/file_direct.c               |  59 ++++----
  fs/squashfs/page_actor.c                |  37 ++---
  fs/squashfs/page_actor.h                |   8 +-
  fs/squashfs/read_context.c              | 181 ++++++++++++++++++++++++
  fs/squashfs/read_context.h              |  31 ++++
  fs/squashfs/squashfs.h                  |  12 +-
  fs/squashfs/squashfs_fs_sb.h            |   3 +-
  fs/squashfs/super.c                     |  13 +-
  fs/squashfs/symlink.c                   |   3 +-
  17 files changed, 390 insertions(+), 172 deletions(-)
  create mode 100644 fs/squashfs/read_context.c
  create mode 100644 fs/squashfs/read_context.h

diff --git a/fs/squashfs/Makefile b/fs/squashfs/Makefile
index 477c89a519ee..93511720e94f 100644
--- a/fs/squashfs/Makefile
+++ b/fs/squashfs/Makefile
@@ -6,6 +6,7 @@
  obj-$(CONFIG_SQUASHFS) += squashfs.o
  squashfs-y += block.o cache.o dir.o export.o file.o fragment.o id.o inode.o
  squashfs-y += namei.o super.o symlink.o decompressor.o page_actor.o
+squashfs-y += read_context.o
  squashfs-$(CONFIG_SQUASHFS_FILE_CACHE) += file_cache.o
  squashfs-$(CONFIG_SQUASHFS_FILE_DIRECT) += file_direct.o
  squashfs-$(CONFIG_SQUASHFS_DECOMP_SINGLE) += decompressor_single.o
diff --git a/fs/squashfs/block.c b/fs/squashfs/block.c
index bed3bb8b27fa..110fcfd2a0eb 100644
--- a/fs/squashfs/block.c
+++ b/fs/squashfs/block.c
@@ -23,9 +23,10 @@
  
  #include "squashfs_fs.h"
  #include "squashfs_fs_sb.h"
+#include "read_context.h"
+#include "page_actor.h"
  #include "squashfs.h"
  #include "decompressor.h"
-#include "page_actor.h"
  
  /*
   * Returns the amount of bytes copied to the page actor.
@@ -77,7 +78,7 @@ static int copy_bio_to_actor(struct bio *bio,
  }
  
  static int squashfs_bio_read(struct super_block *sb, u64 index, int length,
-			     struct bio **biop, int *block_offset)
+		     struct squashfs_context *context, int *block_offset)
  {
  	struct squashfs_sb_info *msblk = sb->s_fs_info;
  	const u64 read_start = round_down(index, msblk->devblksize);
@@ -87,44 +88,33 @@ static int squashfs_bio_read(struct super_block *sb, u64 index, int length,
  	int offset = read_start - round_down(index, PAGE_SIZE);
  	int total_len = (block_end - block) << msblk->devblksize_log2;
  	const int page_count = DIV_ROUND_UP(total_len + offset, PAGE_SIZE);
-	int error, i;
-	struct bio *bio;
+	struct bio *bio = context->bio;
+	int error = -EIO, i;
  
-	bio = bio_kmalloc(page_count, GFP_NOIO);
-	if (!bio)
-		return -ENOMEM;
  	bio_init(bio, sb->s_bdev, bio->bi_inline_vecs, page_count, REQ_OP_READ);
  	bio->bi_iter.bi_sector = block * (msblk->devblksize >> SECTOR_SHIFT);
  
  	for (i = 0; i < page_count; ++i) {
  		unsigned int len =
  			min_t(unsigned int, PAGE_SIZE - offset, total_len);
-		struct page *page = alloc_page(GFP_NOIO);
+		int res = bio_add_page(bio, context->bio_page[i], len, offset);
+
+		if (!res)
+			goto out_uninit_bio;
  
-		if (!page) {
-			error = -ENOMEM;
-			goto out_free_bio;
-		}
-		if (!bio_add_page(bio, page, len, offset)) {
-			error = -EIO;
-			goto out_free_bio;
-		}
  		offset = 0;
  		total_len -= len;
  	}
  
  	error = submit_bio_wait(bio);
  	if (error)
-		goto out_free_bio;
+		goto out_uninit_bio;
  
-	*biop = bio;
  	*block_offset = index & ((1 << msblk->devblksize_log2) - 1);
  	return 0;
  
-out_free_bio:
-	bio_free_pages(bio);
-	bio_uninit(bio);
-	kfree(bio);
+out_uninit_bio:
+	bio_uninit(context->bio);
  	return error;
  }
  
@@ -138,10 +128,10 @@ static int squashfs_bio_read(struct super_block *sb, u64 index, int length,
   * algorithms).
   */
  int squashfs_read_data(struct super_block *sb, u64 index, int length,
-		       u64 *next_index, struct squashfs_page_actor *output)
+		       u64 *next_index, struct squashfs_context *context)
  {
  	struct squashfs_sb_info *msblk = sb->s_fs_info;
-	struct bio *bio = NULL;
+	struct squashfs_page_actor *output = context->actor;
  	int compressed;
  	int res;
  	int offset;
@@ -166,13 +156,13 @@ int squashfs_read_data(struct super_block *sb, u64 index, int length,
  			res = -EIO;
  			goto out;
  		}
-		res = squashfs_bio_read(sb, index, 2, &bio, &offset);
+		res = squashfs_bio_read(sb, index, 2, context, &offset);
  		if (res)
  			goto out;
  
-		if (WARN_ON_ONCE(!bio_next_segment(bio, &iter_all))) {
+		if (WARN_ON_ONCE(!bio_next_segment(context->bio, &iter_all))) {
  			res = -EIO;
-			goto out_free_bio;
+			goto out_uninit_bio;
  		}
  		/* Extract the length of the metadata block */
  		data = bvec_virt(bvec);
@@ -180,16 +170,14 @@ int squashfs_read_data(struct super_block *sb, u64 index, int length,
  		if (offset < bvec->bv_len - 1) {
  			length |= data[offset + 1] << 8;
  		} else {
-			if (WARN_ON_ONCE(!bio_next_segment(bio, &iter_all))) {
+			if (WARN_ON_ONCE(!bio_next_segment(context->bio, &iter_all))) {
  				res = -EIO;
-				goto out_free_bio;
+				goto out_uninit_bio;
  			}
  			data = bvec_virt(bvec);
  			length |= data[0] << 8;
  		}
-		bio_free_pages(bio);
-		bio_uninit(bio);
-		kfree(bio);
+		bio_uninit(context->bio);
  
  		compressed = SQUASHFS_COMPRESSED(length);
  		length = SQUASHFS_COMPRESSED_SIZE(length);
@@ -207,24 +195,23 @@ int squashfs_read_data(struct super_block *sb, u64 index, int length,
  	if (next_index)
  		*next_index = index + length;
  
-	res = squashfs_bio_read(sb, index, length, &bio, &offset);
+	res = squashfs_bio_read(sb, index, length, context, &offset);
  	if (res)
  		goto out;
  
  	if (compressed) {
  		if (!msblk->stream) {
  			res = -EIO;
-			goto out_free_bio;
+			goto out_uninit_bio;
  		}
-		res = msblk->thread_ops->decompress(msblk, bio, offset, length, output);
+		res = msblk->thread_ops->decompress(msblk, context->bio,
+						offset, length, output);
  	} else {
-		res = copy_bio_to_actor(bio, output, offset, length);
+		res = copy_bio_to_actor(context->bio, output, offset, length);
  	}
  
-out_free_bio:
-	bio_free_pages(bio);
-	bio_uninit(bio);
-	kfree(bio);
+out_uninit_bio:
+	bio_uninit(context->bio);
  out:
  	if (res < 0) {
  		ERROR("Failed to read block 0x%llx: %d\n", index, res);
diff --git a/fs/squashfs/cache.c b/fs/squashfs/cache.c
index 5062326d0efb..98af4f696edd 100644
--- a/fs/squashfs/cache.c
+++ b/fs/squashfs/cache.c
@@ -42,19 +42,24 @@
  
  #include "squashfs_fs.h"
  #include "squashfs_fs_sb.h"
-#include "squashfs.h"
+#include "read_context.h"
  #include "page_actor.h"
+#include "squashfs.h"
  
  /*
   * Look-up block in cache, and increment usage count.  If not in cache, read
   * and decompress it from disk.
   */
  struct squashfs_cache_entry *squashfs_cache_get(struct super_block *sb,
-	struct squashfs_cache *cache, u64 block, int length)
+	struct squashfs_cache *cache, struct squashfs_contexts *contexts,
+	struct squashfs_context *context, u64 block, int length)
  {
-	int i, n;
+	int i, n, need_context = context == NULL;
  	struct squashfs_cache_entry *entry;
  
+	if (cache->contexts)
+		contexts = cache->contexts;
+
  	spin_lock(&cache->lock);
  
  	while (1) {
@@ -107,8 +112,17 @@ struct squashfs_cache_entry *squashfs_cache_get(struct super_block *sb,
  			entry->error = 0;
  			spin_unlock(&cache->lock);
  
+			if (need_context)
+				context = squashfs_get_context(contexts);
+
+			squashfs_page_actor_init(context->actor, entry->data,
+				cache->pages, 0);
+
  			entry->length = squashfs_read_data(sb, block, length,
-				&entry->next_index, entry->actor);
+				&entry->next_index, context);
+
+			if (need_context)
+				squashfs_put_context(contexts, context);
  
  			spin_lock(&cache->lock);
  
@@ -207,9 +221,9 @@ void squashfs_cache_delete(struct squashfs_cache *cache)
  				kfree(cache->entry[i].data[j]);
  			kfree(cache->entry[i].data);
  		}
-		kfree(cache->entry[i].actor);
  	}
  
+	squashfs_delete_contexts(cache->contexts);
  	kfree(cache->entry);
  	kfree(cache);
  }
@@ -221,7 +235,7 @@ void squashfs_cache_delete(struct squashfs_cache *cache)
   * is allocated as a sequence of kmalloced PAGE_SIZE buffers.
   */
  struct squashfs_cache *squashfs_cache_init(char *name, int entries,
-	int block_size)
+	int block_size, int alloc_contexts)
  {
  	int i, j;
  	struct squashfs_cache *cache = kzalloc(sizeof(*cache), GFP_KERNEL);
@@ -237,6 +251,15 @@ struct squashfs_cache *squashfs_cache_init(char *name, int entries,
  		goto cleanup;
  	}
  
+	if (alloc_contexts) {
+		cache->contexts = squashfs_create_contexts(block_size, entries,
+									0, 0);
+		if (cache->contexts == NULL) {
+			ERROR("Failed to allocate %s cache\n", name);
+			goto cleanup;
+		}
+	}
+
  	cache->curr_blk = 0;
  	cache->next_blk = 0;
  	cache->unused = entries;
@@ -268,13 +291,6 @@ struct squashfs_cache *squashfs_cache_init(char *name, int entries,
  				goto cleanup;
  			}
  		}
-
-		entry->actor = squashfs_page_actor_init(entry->data,
-						cache->pages, 0);
-		if (entry->actor == NULL) {
-			ERROR("Failed to allocate %s cache entry\n", name);
-			goto cleanup;
-		}
  	}
  
  	return cache;
@@ -341,7 +357,8 @@ int squashfs_read_metadata(struct super_block *sb, void *buffer,
  		return -EIO;
  
  	while (length) {
-		entry = squashfs_cache_get(sb, msblk->block_cache, *block, 0);
+		entry = squashfs_cache_get(sb, msblk->block_cache, NULL, NULL,
+								*block, 0);
  		if (entry->error) {
  			res = entry->error;
  			goto error;
@@ -377,12 +394,12 @@ int squashfs_read_metadata(struct super_block *sb, void *buffer,
   * filesystem.  If necessary read and decompress it from disk.
   */
  struct squashfs_cache_entry *squashfs_get_fragment(struct super_block *sb,
-				u64 start_block, int length)
+		u64 start_block, int length, struct squashfs_context *context)
  {
  	struct squashfs_sb_info *msblk = sb->s_fs_info;
  
-	return squashfs_cache_get(sb, msblk->fragment_cache, start_block,
-		length);
+	return squashfs_cache_get(sb, msblk->fragment_cache, msblk->contexts,
+		context, start_block, length);
  }
  
  
@@ -396,7 +413,8 @@ struct squashfs_cache_entry *squashfs_get_datablock(struct super_block *sb,
  {
  	struct squashfs_sb_info *msblk = sb->s_fs_info;
  
-	return squashfs_cache_get(sb, msblk->read_page, start_block, length);
+	return squashfs_cache_get(sb, msblk->read_page, msblk->contexts, NULL,
+							start_block, length);
  }
  
  
@@ -408,7 +426,7 @@ void *squashfs_read_table(struct super_block *sb, u64 block, int length)
  	int pages = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
  	int i, res;
  	void *table, *buffer, **data;
-	struct squashfs_page_actor *actor;
+	struct squashfs_context *context;
  
  	table = buffer = kmalloc(length, GFP_KERNEL);
  	if (table == NULL)
@@ -420,20 +438,22 @@ void *squashfs_read_table(struct super_block *sb, u64 block, int length)
  		goto failed;
  	}
  
-	actor = squashfs_page_actor_init(data, pages, length);
-	if (actor == NULL) {
+	context = squashfs_create_context(length, 0, 0);
+	if (context == NULL) {
  		res = -ENOMEM;
  		goto failed2;
  	}
  
+	squashfs_page_actor_init(context->actor, data, pages, length);
+
  	for (i = 0; i < pages; i++, buffer += PAGE_SIZE)
  		data[i] = buffer;
  
  	res = squashfs_read_data(sb, block, length |
-		SQUASHFS_COMPRESSED_BIT_BLOCK, NULL, actor);
+		SQUASHFS_COMPRESSED_BIT_BLOCK, NULL, context);
  
+	squashfs_delete_context(context);
  	kfree(data);
-	kfree(actor);
  
  	if (res < 0)
  		goto failed;
diff --git a/fs/squashfs/decompressor.c b/fs/squashfs/decompressor.c
index 8893cb9b4198..d12255edb129 100644
--- a/fs/squashfs/decompressor.c
+++ b/fs/squashfs/decompressor.c
@@ -15,9 +15,10 @@
  
  #include "squashfs_fs.h"
  #include "squashfs_fs_sb.h"
+#include "read_context.h"
+#include "page_actor.h"
  #include "decompressor.h"
  #include "squashfs.h"
-#include "page_actor.h"
  
  /*
   * This file (and decompressor.h) implements a decompressor framework for
@@ -89,7 +90,7 @@ static void *get_comp_opts(struct super_block *sb, unsigned short flags)
  {
  	struct squashfs_sb_info *msblk = sb->s_fs_info;
  	void *buffer = NULL, *comp_opts;
-	struct squashfs_page_actor *actor = NULL;
+	struct squashfs_context *context;
  	int length = 0;
  
  	/*
@@ -102,14 +103,16 @@ static void *get_comp_opts(struct super_block *sb, unsigned short flags)
  			goto out;
  		}
  
-		actor = squashfs_page_actor_init(&buffer, 1, 0);
-		if (actor == NULL) {
+		context = squashfs_create_context(PAGE_SIZE, 0, 0);
+		if (context == NULL) {
  			comp_opts = ERR_PTR(-ENOMEM);
  			goto out;
  		}
  
+		squashfs_page_actor_init(context->actor, &buffer, 1, 0);
  		length = squashfs_read_data(sb,
-			sizeof(struct squashfs_super_block), 0, NULL, actor);
+			sizeof(struct squashfs_super_block), 0, NULL, context);
+		squashfs_delete_context(context);
  
  		if (length < 0) {
  			comp_opts = ERR_PTR(length);
@@ -120,7 +123,6 @@ static void *get_comp_opts(struct super_block *sb, unsigned short flags)
  	comp_opts = squashfs_comp_opts(msblk, buffer, length);
  
  out:
-	kfree(actor);
  	kfree(buffer);
  	return comp_opts;
  }
diff --git a/fs/squashfs/decompressor_multi.c b/fs/squashfs/decompressor_multi.c
index 416c53eedbd1..0a31012abec7 100644
--- a/fs/squashfs/decompressor_multi.c
+++ b/fs/squashfs/decompressor_multi.c
@@ -13,6 +13,8 @@
  
  #include "squashfs_fs.h"
  #include "squashfs_fs_sb.h"
+#include "read_context.h"
+#include "page_actor.h"
  #include "decompressor.h"
  #include "squashfs.h"
  
diff --git a/fs/squashfs/decompressor_multi_percpu.c b/fs/squashfs/decompressor_multi_percpu.c
index 1dfadf76ed9a..12094aaba19f 100644
--- a/fs/squashfs/decompressor_multi_percpu.c
+++ b/fs/squashfs/decompressor_multi_percpu.c
@@ -12,6 +12,8 @@
  
  #include "squashfs_fs.h"
  #include "squashfs_fs_sb.h"
+#include "read_context.h"
+#include "page_actor.h"
  #include "decompressor.h"
  #include "squashfs.h"
  
diff --git a/fs/squashfs/decompressor_single.c b/fs/squashfs/decompressor_single.c
index 6f161887710b..8d9b73103165 100644
--- a/fs/squashfs/decompressor_single.c
+++ b/fs/squashfs/decompressor_single.c
@@ -11,6 +11,8 @@
  
  #include "squashfs_fs.h"
  #include "squashfs_fs_sb.h"
+#include "read_context.h"
+#include "page_actor.h"
  #include "decompressor.h"
  #include "squashfs.h"
  
diff --git a/fs/squashfs/file.c b/fs/squashfs/file.c
index 8ba8c4c50770..d8ca63a3eb0e 100644
--- a/fs/squashfs/file.c
+++ b/fs/squashfs/file.c
@@ -38,8 +38,9 @@
  #include "squashfs_fs.h"
  #include "squashfs_fs_sb.h"
  #include "squashfs_fs_i.h"
-#include "squashfs.h"
+#include "read_context.h"
  #include "page_actor.h"
+#include "squashfs.h"
  
  /*
   * Locate cache slot in range [offset, index] for specified inode.  If
@@ -424,7 +425,7 @@ static int squashfs_readpage_fragment(struct page *page, int expected)
  	struct inode *inode = page->mapping->host;
  	struct squashfs_cache_entry *buffer = squashfs_get_fragment(inode->i_sb,
  		squashfs_i(inode)->fragment_block,
-		squashfs_i(inode)->fragment_size);
+		squashfs_i(inode)->fragment_size, NULL);
  	int res = buffer->error;
  
  	if (res)
@@ -497,13 +498,13 @@ static int squashfs_read_folio(struct file *file, struct folio *folio)
  	return res;
  }
  
-static int squashfs_readahead_fragment(struct page **page,
+static int squashfs_readahead_fragment(struct squashfs_context *context,
  	unsigned int pages, unsigned int expected)
  {
-	struct inode *inode = page[0]->mapping->host;
+	struct inode *inode = context->page[0]->mapping->host;
  	struct squashfs_cache_entry *buffer = squashfs_get_fragment(inode->i_sb,
  		squashfs_i(inode)->fragment_block,
-		squashfs_i(inode)->fragment_size);
+		squashfs_i(inode)->fragment_size, context);
  	struct squashfs_sb_info *msblk = inode->i_sb->s_fs_info;
  	unsigned int n, mask = (1 << (msblk->block_log - PAGE_SHIFT)) - 1;
  	int error = buffer->error;
@@ -514,18 +515,18 @@ static int squashfs_readahead_fragment(struct page **page,
  	expected += squashfs_i(inode)->fragment_offset;
  
  	for (n = 0; n < pages; n++) {
-		unsigned int base = (page[n]->index & mask) << PAGE_SHIFT;
+		unsigned int base = (context->page[n]->index & mask) << PAGE_SHIFT;
  		unsigned int offset = base + squashfs_i(inode)->fragment_offset;
  
  		if (expected > offset) {
  			unsigned int avail = min_t(unsigned int, expected -
  				offset, PAGE_SIZE);
  
-			squashfs_fill_page(page[n], buffer, offset, avail);
+			squashfs_fill_page(context->page[n], buffer, offset, avail);
  		}
  
-		unlock_page(page[n]);
-		put_page(page[n]);
+		unlock_page(context->page[n]);
+		put_page(context->page[n]);
  	}
  
  out:
@@ -541,16 +542,15 @@ static void squashfs_readahead(struct readahead_control *ractl)
  	unsigned short shift = msblk->block_log - PAGE_SHIFT;
  	loff_t start = readahead_pos(ractl) & ~mask;
  	size_t len = readahead_length(ractl) + readahead_pos(ractl) - start;
-	struct squashfs_page_actor *actor;
+	struct squashfs_context *context;
  	unsigned int nr_pages = 0;
-	struct page **pages;
  	int i, file_end = i_size_read(inode) >> msblk->block_log;
  	unsigned int max_pages = 1UL << shift;
  
  	readahead_expand(ractl, start, (len | mask) + 1);
  
-	pages = kmalloc_array(max_pages, sizeof(void *), GFP_KERNEL);
-	if (!pages)
+	context = squashfs_get_context(msblk->contexts);
+	if (context == NULL)
  		return;
  
  	for (;;) {
@@ -566,21 +566,21 @@ static void squashfs_readahead(struct readahead_control *ractl)
  
  		max_pages = (expected + PAGE_SIZE - 1) >> PAGE_SHIFT;
  
-		nr_pages = __readahead_batch(ractl, pages, max_pages);
+		nr_pages = __readahead_batch(ractl, context->page, max_pages);
  		if (!nr_pages)
  			break;
  
  		if (readahead_pos(ractl) >= i_size_read(inode))
  			goto skip_pages;
  
-		index = pages[0]->index >> shift;
+		index = context->page[0]->index >> shift;
  
-		if ((pages[nr_pages - 1]->index >> shift) != index)
+		if ((context->page[nr_pages - 1]->index >> shift) != index)
  			goto skip_pages;
  
  		if (index == file_end && squashfs_i(inode)->fragment_block !=
  						SQUASHFS_INVALID_BLK) {
-			res = squashfs_readahead_fragment(pages, nr_pages,
+			res = squashfs_readahead_fragment(context, nr_pages,
  							  expected);
  			if (res)
  				goto skip_pages;
@@ -591,14 +591,13 @@ static void squashfs_readahead(struct readahead_control *ractl)
  		if (bsize == 0)
  			goto skip_pages;
  
-		actor = squashfs_page_actor_init_special(msblk, pages, nr_pages,
-							 expected);
-		if (!actor)
-			goto skip_pages;
+		squashfs_page_actor_init_special(context, msblk, nr_pages,
+							expected);
  
-		res = squashfs_read_data(inode->i_sb, block, bsize, NULL, actor);
+		res = squashfs_read_data(inode->i_sb, block, bsize, NULL,
+							context);
  
-		last_page = squashfs_page_actor_free(actor);
+		last_page = squashfs_page_actor_free(context->actor);
  
  		if (res == expected) {
  			int bytes;
@@ -610,26 +609,26 @@ static void squashfs_readahead(struct readahead_control *ractl)
  					     PAGE_SIZE - bytes);
  
  			for (i = 0; i < nr_pages; i++) {
-				flush_dcache_page(pages[i]);
-				SetPageUptodate(pages[i]);
+				flush_dcache_page(context->page[i]);
+				SetPageUptodate(context->page[i]);
  			}
  		}
  
  		for (i = 0; i < nr_pages; i++) {
-			unlock_page(pages[i]);
-			put_page(pages[i]);
+			unlock_page(context->page[i]);
+			put_page(context->page[i]);
  		}
  	}
  
-	kfree(pages);
+	squashfs_put_context(msblk->contexts, context);
  	return;
  
  skip_pages:
  	for (i = 0; i < nr_pages; i++) {
-		unlock_page(pages[i]);
-		put_page(pages[i]);
+		unlock_page(context->page[i]);
+		put_page(context->page[i]);
  	}
-	kfree(pages);
+	squashfs_put_context(msblk->contexts, context);
  }
  
  const struct address_space_operations squashfs_aops = {
diff --git a/fs/squashfs/file_direct.c b/fs/squashfs/file_direct.c
index f1ccad519e28..ab7b990cf71d 100644
--- a/fs/squashfs/file_direct.c
+++ b/fs/squashfs/file_direct.c
@@ -15,8 +15,9 @@
  #include "squashfs_fs.h"
  #include "squashfs_fs_sb.h"
  #include "squashfs_fs_i.h"
-#include "squashfs.h"
+#include "read_context.h"
  #include "page_actor.h"
+#include "squashfs.h"
  
  /* Read separately compressed datablock directly into page cache */
  int squashfs_readpage_block(struct page *target_page, u64 block, int bsize,
@@ -31,8 +32,7 @@ int squashfs_readpage_block(struct page *target_page, u64 block, int bsize,
  	int start_index = target_page->index & ~mask;
  	int end_index = start_index | mask;
  	int i, n, pages, bytes, res = -ENOMEM;
-	struct page **page;
-	struct squashfs_page_actor *actor;
+	struct squashfs_context *context;
  	void *pageaddr;
  
  	if (end_index > file_end)
@@ -40,21 +40,21 @@ int squashfs_readpage_block(struct page *target_page, u64 block, int bsize,
  
  	pages = end_index - start_index + 1;
  
-	page = kmalloc_array(pages, sizeof(void *), GFP_KERNEL);
-	if (page == NULL)
+	context = squashfs_get_context(msblk->contexts);
+	if (context == NULL)
  		return res;
  
  	/* Try to grab all the pages covered by the Squashfs block */
  	for (i = 0, n = start_index; n <= end_index; n++) {
-		page[i] = (n == target_page->index) ? target_page :
+		context->page[i] = (n == target_page->index) ? target_page :
  			grab_cache_page_nowait(target_page->mapping, n);
  
-		if (page[i] == NULL)
+		if (context->page[i] == NULL)
  			continue;
  
-		if (PageUptodate(page[i])) {
-			unlock_page(page[i]);
-			put_page(page[i]);
+		if (PageUptodate(context->page[i])) {
+			unlock_page(context->page[i]);
+			put_page(context->page[i]);
  			continue;
  		}
  
@@ -64,17 +64,15 @@ int squashfs_readpage_block(struct page *target_page, u64 block, int bsize,
  	pages = i;
  
  	/*
-	 * Create a "page actor" which will kmap and kunmap the
+	 * Initialise "page actor" which will kmap and kunmap the
  	 * page cache pages appropriately within the decompressor
  	 */
-	actor = squashfs_page_actor_init_special(msblk, page, pages, expected);
-	if (actor == NULL)
-		goto out;
+	squashfs_page_actor_init_special(context, msblk, pages, expected);
  
  	/* Decompress directly into the page cache buffers */
-	res = squashfs_read_data(inode->i_sb, block, bsize, NULL, actor);
+	res = squashfs_read_data(inode->i_sb, block, bsize, NULL, context);
  
-	squashfs_page_actor_free(actor);
+	squashfs_page_actor_free(context->actor);
  
  	if (res < 0)
  		goto mark_errored;
@@ -86,22 +84,22 @@ int squashfs_readpage_block(struct page *target_page, u64 block, int bsize,
  
  	/* Last page (if present) may have trailing bytes not filled */
  	bytes = res % PAGE_SIZE;
-	if (page[pages - 1]->index == end_index && bytes) {
-		pageaddr = kmap_local_page(page[pages - 1]);
+	if (context->page[pages - 1]->index == end_index && bytes) {
+		pageaddr = kmap_local_page(context->page[pages - 1]);
  		memset(pageaddr + bytes, 0, PAGE_SIZE - bytes);
  		kunmap_local(pageaddr);
  	}
  
  	/* Mark pages as uptodate, unlock and release */
  	for (i = 0; i < pages; i++) {
-		flush_dcache_page(page[i]);
-		SetPageUptodate(page[i]);
-		unlock_page(page[i]);
-		if (page[i] != target_page)
-			put_page(page[i]);
+		flush_dcache_page(context->page[i]);
+		SetPageUptodate(context->page[i]);
+		unlock_page(context->page[i]);
+		if (context->page[i] != target_page)
+			put_page(context->page[i]);
  	}
  
-	kfree(page);
+	squashfs_put_context(msblk->contexts, context);
  
  	return 0;
  
@@ -110,15 +108,14 @@ int squashfs_readpage_block(struct page *target_page, u64 block, int bsize,
  	 * dealt with by the caller
  	 */
  	for (i = 0; i < pages; i++) {
-		if (page[i] == NULL || page[i] == target_page)
+		if (context->page[i] == NULL || context->page[i] == target_page)
  			continue;
-		flush_dcache_page(page[i]);
-		SetPageError(page[i]);
-		unlock_page(page[i]);
-		put_page(page[i]);
+		flush_dcache_page(context->page[i]);
+		SetPageError(context->page[i]);
+		unlock_page(context->page[i]);
+		put_page(context->page[i]);
  	}
  
-out:
-	kfree(page);
+	squashfs_put_context(msblk->contexts, context);
  	return res;
  }
diff --git a/fs/squashfs/page_actor.c b/fs/squashfs/page_actor.c
index 81af6c4ca115..8ff50034afee 100644
--- a/fs/squashfs/page_actor.c
+++ b/fs/squashfs/page_actor.c
@@ -8,8 +8,9 @@
  #include <linux/slab.h>
  #include <linux/pagemap.h>
  #include "squashfs_fs_sb.h"
-#include "decompressor.h"
+#include "read_context.h"
  #include "page_actor.h"
+#include "decompressor.h"
  
  /*
   * This file contains implementations of page_actor for decompressing into
@@ -40,14 +41,9 @@ static void cache_finish_page(struct squashfs_page_actor *actor)
  	/* empty */
  }
  
-struct squashfs_page_actor *squashfs_page_actor_init(void **buffer,
+void squashfs_page_actor_init(struct squashfs_page_actor *actor, void **buffer,
  	int pages, int length)
  {
-	struct squashfs_page_actor *actor = kmalloc(sizeof(*actor), GFP_KERNEL);
-
-	if (actor == NULL)
-		return NULL;
-
  	actor->length = length ? : pages * PAGE_SIZE;
  	actor->buffer = buffer;
  	actor->pages = pages;
@@ -56,7 +52,6 @@ struct squashfs_page_actor *squashfs_page_actor_init(void **buffer,
  	actor->squashfs_first_page = cache_first_page;
  	actor->squashfs_next_page = cache_next_page;
  	actor->squashfs_finish_page = cache_finish_page;
-	return actor;
  }
  
  /* Implementation of page_actor for decompressing directly into page cache. */
@@ -102,35 +97,23 @@ static void direct_finish_page(struct squashfs_page_actor *actor)
  		kunmap_local(actor->pageaddr);
  }
  
-struct squashfs_page_actor *squashfs_page_actor_init_special(struct squashfs_sb_info *msblk,
-	struct page **page, int pages, int length)
+void squashfs_page_actor_init_special(struct squashfs_context *context,
+	struct squashfs_sb_info *msblk, int pages, int length)
  {
-	struct squashfs_page_actor *actor = kmalloc(sizeof(*actor), GFP_KERNEL);
-
-	if (actor == NULL)
-		return NULL;
-
-	if (msblk->decompressor->alloc_buffer) {
-		actor->tmp_buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
-
-		if (actor->tmp_buffer == NULL) {
-			kfree(actor);
-			return NULL;
-		}
-	} else
-		actor->tmp_buffer = NULL;
+	struct squashfs_page_actor *actor = context->actor;
  
+	actor->tmp_buffer = context->hole;
  	actor->length = length ? : pages * PAGE_SIZE;
-	actor->page = page;
+	actor->page = context->page;
  	actor->pages = pages;
  	actor->next_page = 0;
  	actor->returned_pages = 0;
-	actor->next_index = page[0]->index & ~((1 << (msblk->block_log - PAGE_SHIFT)) - 1);
+	actor->next_index = context->page[0]->index & ~((1 <<
+					(msblk->block_log - PAGE_SHIFT)) - 1);
  	actor->pageaddr = NULL;
  	actor->last_page = NULL;
  	actor->alloc_buffer = msblk->decompressor->alloc_buffer;
  	actor->squashfs_first_page = direct_first_page;
  	actor->squashfs_next_page = direct_next_page;
  	actor->squashfs_finish_page = direct_finish_page;
-	return actor;
  }
diff --git a/fs/squashfs/page_actor.h b/fs/squashfs/page_actor.h
index 97d4983559b1..36fbc93d8907 100644
--- a/fs/squashfs/page_actor.h
+++ b/fs/squashfs/page_actor.h
@@ -25,17 +25,15 @@ struct squashfs_page_actor {
  	pgoff_t	next_index;
  };
  
-extern struct squashfs_page_actor *squashfs_page_actor_init(void **buffer,
+extern void squashfs_page_actor_init(struct squashfs_page_actor *actor, void **buffer,
  				int pages, int length);
-extern struct squashfs_page_actor *squashfs_page_actor_init_special(
+extern void squashfs_page_actor_init_special(struct squashfs_context *context,
  				struct squashfs_sb_info *msblk,
-				struct page **page, int pages, int length);
+				int pages, int length);
  static inline struct page *squashfs_page_actor_free(struct squashfs_page_actor *actor)
  {
  	struct page *last_page = actor->last_page;
  
-	kfree(actor->tmp_buffer);
-	kfree(actor);
  	return last_page;
  }
  static inline void *squashfs_first_page(struct squashfs_page_actor *actor)
diff --git a/fs/squashfs/read_context.c b/fs/squashfs/read_context.c
new file mode 100644
index 000000000000..35a52ed572d7
--- /dev/null
+++ b/fs/squashfs/read_context.c
@@ -0,0 +1,181 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Squashfs - a compressed read only filesystem for Linux
+ *
+ * Copyright (c) 2023
+ * Phillip Lougher <phillip@squashfs.org.uk>
+ *
+ * read_context.c
+ */
+
+#include <linux/fs.h>
+#include <linux/vfs.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/pagemap.h>
+#include <linux/bio.h>
+
+#include "squashfs_fs.h"
+#include "squashfs_fs_sb.h"
+#include "squashfs.h"
+#include "page_actor.h"
+
+
+void squashfs_delete_context(struct squashfs_context *context)
+{
+	int i;
+
+	for (i = 0; i < context->bio_pages; i++)
+		if (context->bio_page[i])
+			__free_page(context->bio_page[i]);
+
+	kfree(context->bio);
+	kfree(context->page);
+	kfree(context->actor);
+	kfree(context->hole);
+	kfree(context);
+}
+
+
+struct squashfs_context *squashfs_create_context(int length, int alloc_hole,
+							int alloc_array)
+{
+	int pages = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	int bio_pages = pages + 1, i;
+	struct squashfs_context *context;
+
+	context = kzalloc(sizeof(struct squashfs_context), GFP_KERNEL);
+	if (context == NULL)
+		return ERR_PTR(-ENOMEM);
+
+	context->bio_page = kcalloc(bio_pages, sizeof(struct bio_pages *),
+							GFP_KERNEL);
+
+	if (context->bio_page == NULL)
+		goto cleanup;
+
+	context->bio_pages = bio_pages;
+
+	for (i = 0; i < bio_pages; i++) {
+		context->bio_page[i] = alloc_page(GFP_KERNEL);
+
+		if (context->bio_page[i] == NULL)
+			goto cleanup;
+	}
+
+	context->bio = bio_kmalloc(bio_pages, GFP_KERNEL);
+	if (context->bio == NULL)
+		goto cleanup;
+
+	context->actor = kmalloc(sizeof(struct squashfs_page_actor),
+							GFP_KERNEL);
+	if (context->actor == NULL)
+		goto cleanup;
+
+	if (alloc_array) {
+		context->page = kmalloc_array(pages, sizeof(struct page *),
+								GFP_KERNEL);
+		if (context->page == NULL)
+			goto cleanup;
+	}
+
+	if (alloc_hole) {
+		context->hole = kmalloc(PAGE_SIZE, GFP_KERNEL);
+		if (context->hole == NULL)
+			goto cleanup;
+	}
+
+	return context;
+
+cleanup:
+	squashfs_delete_context(context);
+	return ERR_PTR(-ENOMEM);
+}
+
+
+struct squashfs_context *squashfs_get_context(struct squashfs_contexts *contexts)
+{
+	struct squashfs_context *context;
+
+	while (1) {
+		mutex_lock(&contexts->mutex);
+
+		if (!list_empty(&contexts->list)) {
+			context = list_entry(contexts->list.prev,
+					struct squashfs_context, list);
+			list_del(&context->list);
+			mutex_unlock(&contexts->mutex);
+			break;
+		}
+
+		mutex_unlock(&contexts->mutex);
+		wait_event(contexts->wait_queue, !list_empty(&contexts->list));
+	}
+
+	return context;
+}
+
+
+void squashfs_put_context(struct squashfs_contexts *contexts,
+					struct squashfs_context *context)
+{
+
+	mutex_lock(&contexts->mutex);
+	list_add(&context->list, &contexts->list);
+	mutex_unlock(&contexts->mutex);
+	wake_up(&contexts->wait_queue);
+}
+
+
+void squashfs_delete_contexts(struct squashfs_contexts *contexts)
+{
+	struct squashfs_context *context;
+
+	if (contexts == NULL)
+		return;
+
+	while (!list_empty(&contexts->list)) {
+		context = list_entry(contexts->list.prev,
+				struct squashfs_context, list);
+		list_del(&context->list);
+		squashfs_delete_context(context);
+	}
+
+	kfree(contexts);
+}
+
+
+struct squashfs_contexts *squashfs_create_contexts(int block_size, int entries,
+						int alloc_hole, int alloc_array)
+{
+	int i;
+	struct squashfs_contexts *contexts;
+
+	contexts = kzalloc(sizeof(struct squashfs_contexts), GFP_KERNEL);
+	if (contexts == NULL)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&contexts->mutex);
+	INIT_LIST_HEAD(&contexts->list);
+	init_waitqueue_head(&contexts->wait_queue);
+
+	for (i = 0; i < entries; i++) {
+		struct squashfs_context *context;
+
+		context = squashfs_create_context(block_size, alloc_hole,
+								alloc_array);
+		if (context == NULL)
+			goto cleanup;
+
+		list_add(&context->list, &contexts->list);
+	}
+
+	return contexts;
+
+cleanup:
+	squashfs_delete_contexts(contexts);
+	return ERR_PTR(-ENOMEM);
+}
diff --git a/fs/squashfs/read_context.h b/fs/squashfs/read_context.h
new file mode 100644
index 000000000000..a9b93a43893b
--- /dev/null
+++ b/fs/squashfs/read_context.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef READ_CONTEXT_H
+#define READ_CONTEXT_H
+/*
+ * Copyright (c) 2023
+ * Phillip Lougher <phillip@squashfs.org.uk>
+ */
+
+struct squashfs_context {
+	int				bio_pages;
+	struct bio			*bio;
+	struct page			**bio_page;
+	struct squashfs_page_actor	*actor;
+	struct page			**page;
+	void				*hole;
+	struct list_head		list;
+};
+
+struct squashfs_contexts {
+	struct mutex		mutex;
+	struct list_head	list;
+	wait_queue_head_t	wait_queue;
+};
+
+extern void squashfs_delete_context(struct squashfs_context *context);
+extern struct squashfs_context *squashfs_create_context(int length, int alloc_hole, int alloc_array);
+extern struct squashfs_context *squashfs_get_context(struct squashfs_contexts *contexts);
+extern void squashfs_put_context(struct squashfs_contexts *contexts, struct squashfs_context *context);
+extern void squashfs_delete_contexts(struct squashfs_contexts *contexts);
+extern struct squashfs_contexts *squashfs_create_contexts(int block_size, int entries, int alloc_hole, int alloc_array);
+#endif
diff --git a/fs/squashfs/squashfs.h b/fs/squashfs/squashfs.h
index a6164fdf9435..c5db13907e9f 100644
--- a/fs/squashfs/squashfs.h
+++ b/fs/squashfs/squashfs.h
@@ -8,6 +8,9 @@
   * squashfs.h
   */
  
+#include "read_context.h"
+#include "page_actor.h"
+
  #define TRACE(s, args...)	pr_debug("SQUASHFS: "s, ## args)
  
  #define ERROR(s, args...)	pr_err("SQUASHFS error: "s, ## args)
@@ -16,19 +19,20 @@
  
  /* block.c */
  extern int squashfs_read_data(struct super_block *, u64, int, u64 *,
-				struct squashfs_page_actor *);
+				struct squashfs_context *);
  
  /* cache.c */
-extern struct squashfs_cache *squashfs_cache_init(char *, int, int);
+extern struct squashfs_cache *squashfs_cache_init(char *, int, int, int);
  extern void squashfs_cache_delete(struct squashfs_cache *);
  extern struct squashfs_cache_entry *squashfs_cache_get(struct super_block *,
-				struct squashfs_cache *, u64, int);
+		struct squashfs_cache *, struct squashfs_contexts *,
+		struct squashfs_context *, u64, int);
  extern void squashfs_cache_put(struct squashfs_cache_entry *);
  extern int squashfs_copy_data(void *, struct squashfs_cache_entry *, int, int);
  extern int squashfs_read_metadata(struct super_block *, void *, u64 *,
  				int *, int);
  extern struct squashfs_cache_entry *squashfs_get_fragment(struct super_block *,
-				u64, int);
+				u64, int, struct squashfs_context *);
  extern struct squashfs_cache_entry *squashfs_get_datablock(struct super_block *,
  				u64, int);
  extern void *squashfs_read_table(struct super_block *, u64, int);
diff --git a/fs/squashfs/squashfs_fs_sb.h b/fs/squashfs/squashfs_fs_sb.h
index 72f6f4b37863..6cee243be9bb 100644
--- a/fs/squashfs/squashfs_fs_sb.h
+++ b/fs/squashfs/squashfs_fs_sb.h
@@ -23,6 +23,7 @@ struct squashfs_cache {
  	int			pages;
  	spinlock_t		lock;
  	wait_queue_head_t	wait_queue;
+	struct squashfs_contexts *contexts;
  	struct squashfs_cache_entry *entry;
  };
  
@@ -37,7 +38,6 @@ struct squashfs_cache_entry {
  	wait_queue_head_t	wait_queue;
  	struct squashfs_cache	*cache;
  	void			**data;
-	struct squashfs_page_actor	*actor;
  };
  
  struct squashfs_sb_info {
@@ -47,6 +47,7 @@ struct squashfs_sb_info {
  	struct squashfs_cache			*block_cache;
  	struct squashfs_cache			*fragment_cache;
  	struct squashfs_cache			*read_page;
+	struct squashfs_contexts		*contexts;
  	int					next_meta_index;
  	__le64					*id_table;
  	__le64					*fragment_index;
diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c
index e090fae48e68..3ba5c2b5c9e0 100644
--- a/fs/squashfs/super.c
+++ b/fs/squashfs/super.c
@@ -33,6 +33,7 @@
  #include "squashfs_fs.h"
  #include "squashfs_fs_sb.h"
  #include "squashfs_fs_i.h"
+#include "read_context.h"
  #include "squashfs.h"
  #include "decompressor.h"
  #include "xattr.h"
@@ -317,18 +318,23 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc)
  	err = -ENOMEM;
  
  	msblk->block_cache = squashfs_cache_init("metadata",
-			SQUASHFS_CACHED_BLKS, SQUASHFS_METADATA_SIZE);
+			SQUASHFS_CACHED_BLKS, SQUASHFS_METADATA_SIZE, 1);
  	if (msblk->block_cache == NULL)
  		goto failed_mount;
  
  	/* Allocate read_page block */
  	msblk->read_page = squashfs_cache_init("data",
-		msblk->max_thread_num, msblk->block_size);
+		msblk->max_thread_num, msblk->block_size, 0);
  	if (msblk->read_page == NULL) {
  		errorf(fc, "Failed to allocate read_page block");
  		goto failed_mount;
  	}
  
+	msblk->contexts = squashfs_create_contexts(msblk->block_size,
+		msblk->max_thread_num, msblk->decompressor->alloc_buffer, 1);
+	if (msblk->contexts == NULL)
+		goto failed_mount;
+
  	msblk->stream = squashfs_decompressor_setup(sb, flags);
  	if (IS_ERR(msblk->stream)) {
  		err = PTR_ERR(msblk->stream);
@@ -392,7 +398,7 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc)
  		goto check_directory_table;
  
  	msblk->fragment_cache = squashfs_cache_init("fragment",
-		SQUASHFS_CACHED_FRAGMENTS, msblk->block_size);
+		SQUASHFS_CACHED_FRAGMENTS, msblk->block_size, 0);
  	if (msblk->fragment_cache == NULL) {
  		err = -ENOMEM;
  		goto failed_mount;
@@ -454,6 +460,7 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc)
  	squashfs_cache_delete(msblk->block_cache);
  	squashfs_cache_delete(msblk->fragment_cache);
  	squashfs_cache_delete(msblk->read_page);
+	squashfs_delete_contexts(msblk->contexts);
  	msblk->thread_ops->destroy(msblk);
  	kfree(msblk->inode_lookup_table);
  	kfree(msblk->fragment_index);
diff --git a/fs/squashfs/symlink.c b/fs/squashfs/symlink.c
index 2bf977a52c2c..8471a3a5491b 100644
--- a/fs/squashfs/symlink.c
+++ b/fs/squashfs/symlink.c
@@ -69,7 +69,8 @@ static int squashfs_symlink_read_folio(struct file *file, struct folio *folio)
  	 * blocks, we may need to call squashfs_cache_get multiple times.
  	 */
  	for (bytes = 0; bytes < length; offset = 0, bytes += copied) {
-		entry = squashfs_cache_get(sb, msblk->block_cache, block, 0);
+		entry = squashfs_cache_get(sb, msblk->block_cache, NULL, NULL,
+								block, 0);
  		if (entry->error) {
  			ERROR("Unable to read symlink [%llx:%x]\n",
  				squashfs_i(inode)->start,
-- 
2.35.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
  2023-05-07 21:07                 ` Phillip Lougher
@ 2023-05-08 10:05                   ` Hui Wang
  0 siblings, 0 replies; 25+ messages in thread
From: Hui Wang @ 2023-05-08 10:05 UTC (permalink / raw)
  To: Phillip Lougher, Michal Hocko
  Cc: Gao Xiang, linux-mm, akpm, surenb, colin.i.king, shy828301,
	hannes, vbabka, hch, mgorman


On 5/8/23 05:07, Phillip Lougher wrote:
>
> On 03/05/2023 20:10, Phillip Lougher wrote:
>>
>> On 03/05/2023 12:49, Hui Wang wrote:
>>>
>>> On 4/29/23 03:53, Michal Hocko wrote:
>>>> On Thu 27-04-23 11:47:10, Hui Wang wrote:
>>>> [...]
>>>>> So Michal,
>>>>>
>>>>> Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters 
>>>>> a state
>>>>> something like hang when running stress-ng", do you know why 
>>>>> out_of_memory()
>>>>> will return immediately if there is no __GFP_FS, could we drop 
>>>>> these lines
>>>>> directly:
>>>>>
>>>>>      /*
>>>>>       * The OOM killer does not compensate for IO-less reclaim.
>>>>>       * pagefault_out_of_memory lost its gfp context so we have to
>>>>>       * make sure exclude 0 mask - all other users should have at 
>>>>> least
>>>>>       * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() 
>>>>> has to
>>>>>       * invoke the OOM killer even if it is a GFP_NOFS allocation.
>>>>>       */
>>>>>      if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && 
>>>>> !is_memcg_oom(oc))
>>>>>          return true;
>>>> The comment is rather hard to grasp without an intimate knowledge 
>>>> of the
>>>> memory reclaim. The primary reason is that the allocation context
>>>> without __GFP_FS (and also __GFP_IO) cannot perform a full memory
>>>> reclaim because fs or the storage subsystem might be holding locks
>>>> required for the memory reclaim. This means that a large amount of
>>>> reclaimable memory is out of sight of the specific direct reclaim
>>>> context. If we allowed oom killer to trigger we could invoke the oom
>>>> killer while there is a lot of otherwise reclaimable memory. As you 
>>>> can
>>>> imagine not something many users would appreciate as the oom kill is a
>>>> very disruptive operation. In this case we rely on kswapd or other
>>>> GFP_KERNEL like allocation context to make forward instead. If 
>>>> there is
>>>> really nothing reclaimable then the oom killer would eventually hit 
>>>> from
>>>> elsewhere.
>>>>
>>>> HTH
>>> Hi Michal,
>>>
>>> Understand. Thanks for explanation. So we can't remove those 2 lines 
>>> of code.
>>>
>>> Here in my patch, letting a kthread allocate a page with GFP_KERNEL, 
>>> It could possibly trigger the reclaim and if nothing reclaimable, 
>>> trigger the oom killer. Do you think it is a safe workaround for the 
>>> issue we are facing currently?
>>>
>>>
>>> And Hi Phillip,
>>>
>>> What is your opinion on it, do you have a direction to solve this 
>>> issue from filesystem?
>>>
>>
>> The following patch creates the concept of "squashfs contexts", which 
>> moves all memory dynamically allocated (in a readahead/read_page 
>> path) into a single structure which can be allocated and deleted 
>> once.  It then creates a pool of these at filesystem mount time.  
>> Threads entering readahead/read_page will take a context from the 
>> pool, and will then perform no dynamic memory allocation.
>>
>> The final patch-series will make this a non-default build option for 
>> systems that need this.
>>
>> Phillip
>>
>>
>
> An updated version of the patch.

Hi Phillip,

The patch could fix the issue.

I built and tested the kernel linux-next 20230505 on my arm64 board, 
without your patch, the system will hang when I ran "$ stress-ng 
--bigheap 2 --sequential 0 --timeout 30s --skip-silent --verbose" or 
"stress-ng --sequential 0 --class os --timeout 30 --skip-silent 
--verbose". After applied your patch, I repeated bigheap testcase 60 
times, the hang issue didn't happen, and I ran "--class os, --class vm 
and --class scheduler", all passed.

Thanks,

Hui.

>
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-05-08 10:05 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-26  5:10 [PATCH 0/1] mm/oom_kill: system enters a state something like hang when running stress-ng Hui Wang
2023-04-26  5:10 ` [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS Hui Wang
2023-04-26  8:33   ` Michal Hocko
2023-04-26 11:07     ` Hui Wang
2023-04-26 16:44       ` Phillip Lougher
2023-04-26 17:38         ` Phillip Lougher
2023-04-26 18:26           ` Yang Shi
2023-04-26 19:06             ` Phillip Lougher
2023-04-26 19:34               ` Phillip Lougher
2023-04-27  0:42                 ` Hui Wang
2023-04-27  1:37                   ` Phillip Lougher
2023-04-27  5:22                     ` Hui Wang
2023-04-27  1:18       ` Gao Xiang
2023-04-27  3:47         ` Hui Wang
2023-04-27  4:17           ` Gao Xiang
2023-04-27  7:03           ` Colin King (gmail)
2023-04-27  7:49             ` Hui Wang
2023-04-28 19:53           ` Michal Hocko
2023-05-03 11:49             ` Hui Wang
2023-05-03 12:20               ` Michal Hocko
2023-05-03 18:41                 ` Phillip Lougher
2023-05-03 19:10               ` Phillip Lougher
2023-05-03 19:38                 ` Hui Wang
2023-05-07 21:07                 ` Phillip Lougher
2023-05-08 10:05                   ` Hui Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox