From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 079CCC77B60 for ; Thu, 27 Apr 2023 00:42:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6A7036B0071; Wed, 26 Apr 2023 20:42:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 656FD6B0072; Wed, 26 Apr 2023 20:42:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4F7686B0074; Wed, 26 Apr 2023 20:42:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 411D76B0071 for ; Wed, 26 Apr 2023 20:42:59 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id E15BAC0481 for ; Thu, 27 Apr 2023 00:42:58 +0000 (UTC) X-FDA: 80725321236.06.BFE3A0E Received: from smtp-relay-canonical-1.canonical.com (smtp-relay-canonical-1.canonical.com [185.125.188.121]) by imf02.hostedemail.com (Postfix) with ESMTP id E821080017 for ; Thu, 27 Apr 2023 00:42:56 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=canonical.com header.s=20210705 header.b=kBAXFFxY; spf=pass (imf02.hostedemail.com: domain of hui.wang@canonical.com designates 185.125.188.121 as permitted sender) smtp.mailfrom=hui.wang@canonical.com; dmarc=pass (policy=none) header.from=canonical.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1682556177; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Fg+0pRUBj1iisUWTtNV30byZu7qBxrIEbGG6hApB0+8=; b=0mqCbtj8cShOjBsbYkCgofPAXBRbFRkQ0JXCXRnLOrl7UwuGUfu7ORWwcOT7yRwfGRLJRu wYUayG9k/jXSGg4YSQcFNb+YrGiu3PjUFDeUW5MVysquMtlfyx/0SlIHG8XnAoYTBeWnGE kwOaCEjQXR+nfFToqYWx/jJ2AqB5Mv8= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=canonical.com header.s=20210705 header.b=kBAXFFxY; spf=pass (imf02.hostedemail.com: domain of hui.wang@canonical.com designates 185.125.188.121 as permitted sender) smtp.mailfrom=hui.wang@canonical.com; dmarc=pass (policy=none) header.from=canonical.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682556177; a=rsa-sha256; cv=none; b=VIZBBTTF0guuwj3QGwc18dhL3iJ28R8NHL7f/fokPrTBfPC0akanVH+TIWaxTb7R3b1mK8 UvrG0jMYW+gVXJTs4golcJ5jtdc57o9dF1tLvGffRmJ8nQ+NIcyh2xeE6j9BudEX5s7zat JoV7S61nDr4bbiYK3rQp2Qxv799fa00= Received: from [192.168.0.106] (unknown [123.112.66.36]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-canonical-1.canonical.com (Postfix) with ESMTPSA id 36ADC4058E; Thu, 27 Apr 2023 00:42:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=canonical.com; s=20210705; t=1682556173; bh=Fg+0pRUBj1iisUWTtNV30byZu7qBxrIEbGG6hApB0+8=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=kBAXFFxYY1NBeVGRionPhtcO5IvwQcM6jwCFGJ2CDxxK8gnKEzjd2+++sJf2KIqkl g5O+XHooaeSWtfBrhxepUxjJL/kSw2fqrDC/RchWPDkeBWVrk08j6JfGrx5+3TV+ey YU6GWehhpsdPzEqOvfCEeAMbV46LlfTdxUzvhrSKnKTQQ7C+8cZzAglobUun3NFdxV fSlHtCs6nSQuzak7/wmDS/6JdRCzrg8LRG9Nwfg8QAVjcMvGelEN0c3yIzZT27Japi n1N/eYBlFm5jqbQ2wISPFTXiSl8RRKEI+L6GhJ88tIUOThmVon09IDcmsfB/MHGGqF 7KXeYaTpsUjDA== Message-ID: <1f181fe6-60f4-6b71-b8ca-4f6365de0b4c@canonical.com> Date: Thu, 27 Apr 2023 08:42:45 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS Content-Language: en-US To: Phillip Lougher , Yang Shi Cc: Michal Hocko , linux-mm@kvack.org, akpm@linux-foundation.org, surenb@google.com, colin.i.king@gmail.com, hannes@cmpxchg.org, vbabka@suse.cz, hch@infradead.org, mgorman@suse.de References: <20230426051030.112007-1-hui.wang@canonical.com> <20230426051030.112007-2-hui.wang@canonical.com> <9f827ae2-eaec-8660-35fa-71e218d5a2c5@squashfs.org.uk> <72cf1f14-c02b-033e-6fa9-8558e628ffb6@squashfs.org.uk> <553e6668-bebd-7411-9f69-d62e9658da1d@squashfs.org.uk> From: Hui Wang In-Reply-To: <553e6668-bebd-7411-9f69-d62e9658da1d@squashfs.org.uk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: kpstrahs54uhkpqi56z6hpop5xn9rpsz X-Rspamd-Queue-Id: E821080017 X-HE-Tag: 1682556176-687988 X-HE-Meta: U2FsdGVkX1/99Yj+dQGhi9GOeZv6VWUO4UboJA8TSW8bUUXl8iT5puXheq0JiWouFRgmZ8W/FQFvb5zHqrsNB2OB1wlPro5U8qFu6lQfJEwXMuOwGMo3TMdEYoqbpaJUbVTik0uvObpQL4LB9htsHrlMrxIquLeF+vbKbvVRJJHJ7Z/qG+WfnvUNIuf9Gc2eWCcS9uP36OJYFEgYID45CPWDBtayyoqTpYoXDQR+s+xdlgNSO/i+FChFE/hDvJbtmM/Vd/c6ezVW5eZ2BJvtCA6VVSVoupLy1wL76wcMsk+wl/7JThRFyf19DOfR4AXiXhL5CWBpA+tCyDZYGpvBLDIUlXjY0LdzoKW1WVQa9gJOzxiTbi3vriRjwXSaaeMbjrII+Tg92gkfQ32aXpLr7IbPzTWq0RMbOg3JqBtTTww0E7mr57bYsMDVn0M1tRwkVnUD+MAIdvVn6LSyQw2wK86K5BwUrD/LhIH30bdV5tfgjGt0dCK6KFrl2f/+wiOGwKFQdUjdIsIrRTuDX4WF8IhYiKS/M7kKxUYHwguE5DO+FRBIckBtyCfh5sB3ISosODd7bknFKTlFtvGuKqBCBpPw9oVYmok/SYVF2ylnTOI+IPEXm7cInsnzj4flqjlS5AJeDCeZlVWTC9n72JCI829UC+w3lGUhkxRbJ+sodof4ISMIkD0Kr7uADlVGFk3d11Hv9A1J2x+oTKBZ0fnDqj38zyaSBR1d9lEzNJPpwFpwZfv8XthFGaSIbBzBrKLxlbBZVd9FBra/KP/bsmMYlYLrh6mBjoOWiRziKs6elkkoDCj2FqrRTG/eObOYF+nAFt0HXJLMsEwHnNUVN9pg++YBm9EFN0fRaOnnCT+LYrt9AvmWlF+zWD5JuySbsmmcUvX6GcB8BckytUhrbI500Y8wxiA/Th9UntSwTh1yUTHd1Zx8LQIuE0NVmwABbuX+S2v6FOd0/r0DII9bpP7 7Jq49Pgd 4JLwfnY/itYD1LtG4ZKstX2gZ1XacPyxzkbHgZMcA4Zb+E++xVM2lksPTNpElgUhI27fc4lFz9JQQmy/rhinFjjsRZ6MKrGg2cdsztn+45vUYuDuPkfetAhGKjvSUG9YlyLGaWTZmVFO0Mv/8S+AGC5Aq1WyF9FxjbTNzr5xJEFaGqZDrjnNQl94rQANNydBkCySfWnjXdGFNVZ266ztcEC5vHTRrOwMQjMrx41H+h7xCoEvrRvmyF68U5xXcEnkNuobmtr/xeGOvf/qpBcNkY4LcmBZmhOwfFRY1vUMMHeRdlK6m60JySTo0zNhJJxL1i//m6HyhTrplJ+SJ7bz9uufAmTqnMXWhTLmBTeVnzT969peCnJwZuL9xSO1NcdCIIdCg X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 4/27/23 03:34, Phillip Lougher wrote: > > On 26/04/2023 20:06, Phillip Lougher wrote: >> >> On 26/04/2023 19:26, Yang Shi wrote: >>> On Wed, Apr 26, 2023 at 10:38 AM Phillip Lougher >>> wrote: >>>> >>>> On 26/04/2023 17:44, Phillip Lougher wrote: >>>>> On 26/04/2023 12:07, Hui Wang wrote: >>>>>> On 4/26/23 16:33, Michal Hocko wrote: >>>>>>> [CC squashfs maintainer] >>>>>>> >>>>>>> On Wed 26-04-23 13:10:30, Hui Wang wrote: >>>>>>>> If we run the stress-ng in the filesystem of squashfs, the system >>>>>>>> will be in a state something like hang, the stress-ng couldn't >>>>>>>> finish running and the console couldn't react to users' input. >>>>>>>> >>>>>>>> This issue happens on all arm/arm64 platforms we are working on, >>>>>>>> through debugging, we found this issue is introduced by oom >>>>>>>> handling >>>>>>>> in the kernel. >>>>>>>> >>>>>>>> The fs->readahead() is called between memalloc_nofs_save() and >>>>>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls >>>>>>>> alloc_page(), in this case, if there is no memory left, the >>>>>>>> out_of_memory() will be called without __GFP_FS, then the oom >>>>>>>> killer >>>>>>>> will not be triggered and this process will loop endlessly and >>>>>>>> wait >>>>>>>> for others to trigger oom killer to release some memory. But for a >>>>>>>> system with the whole root filesystem constructed by squashfs, >>>>>>>> nearly all userspace processes will call out_of_memory() without >>>>>>>> __GFP_FS, so we will see that the system enters a state >>>>>>>> something like >>>>>>>> hang when running stress-ng. >>>>>>>> >>>>>>>> To fix it, we could trigger a kthread to call page_alloc() with >>>>>>>> __GFP_FS before returning from out_of_memory() due to without >>>>>>>> __GFP_FS. >>>>>>> I do not think this is an appropriate way to deal with this issue. >>>>>>> Does it even make sense to trigger OOM killer for something like >>>>>>> readahead? Would it be more mindful to fail the allocation instead? >>>>>>> That being said should allocations from squashfs_readahead use >>>>>>> __GFP_RETRY_MAYFAIL instead? >>>>>> Thanks for your comment, and this issue could hardly be >>>>>> reproduced on >>>>>> ext4 filesystem, that is because the ext4->readahead() doesn't call >>>>>> alloc_page(). If changing the ext4->readahead() as below, it will be >>>>>> easy to reproduce this issue with the ext4 filesystem (repeatedly >>>>>> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0 >>>>>> --timeout 30s --skip-silent --verbose) >>>>>> >>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >>>>>> index ffbbd9626bd8..8b9db0b9d0b8 100644 >>>>>> --- a/fs/ext4/inode.c >>>>>> +++ b/fs/ext4/inode.c >>>>>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file >>>>>> *file, >>>>>> struct folio *folio) >>>>>>   static void ext4_readahead(struct readahead_control *rac) >>>>>>   { >>>>>>          struct inode *inode = rac->mapping->host; >>>>>> +       struct page *tmp_page; >>>>>> >>>>>>          /* If the file has inline data, no need to do readahead. */ >>>>>>          if (ext4_has_inline_data(inode)) >>>>>>                  return; >>>>>> >>>>>> +       tmp_page = alloc_page(GFP_KERNEL); >>>>>> + >>>>>>          ext4_mpage_readpages(inode, rac, NULL); >>>>>> + >>>>>> +       if (tmp_page) >>>>>> +               __free_page(tmp_page); >>>>>>   } >>>>>> >>>>>> >>>>>> BTW, I applied my patch to the linux-next and ran the oom stress-ng >>>>>> testcases overnight, there is no hang, oops or crash, looks like >>>>>> there is no big problem to use a kthread to trigger the oom >>>>>> killer in >>>>>> this case. >>>>>> >>>>>> And Hi squashfs maintainer, I checked the code of filesystem, looks >>>>>> like most filesystems will not call alloc_page() in the readahead(), >>>>>> could you please help take a look at this issue, thanks. >>>>> >>>>> This will be because most filesystems don't need to do so. >>>>> Squashfs is >>>>> a compressed filesystem with large blocks covering much more than one >>>>> page, and it decompresses these blocks in squashfs_readahead().   If >>>>> __readahead_batch() does not return the full set of pages covering >>>>> the >>>>> Squashfs block, it allocates a temporary page for the >>>>> decompressors to >>>>> decompress into to "fill in the hole". >>>>> >>>>> What can be done here as far as Squashfs is concerned .... I could >>>>> move the page allocation out of the readahead path (e.g. do it at >>>>> mount time). >>>> You could try this patch which does that.  Compile tested only. >>> The kmalloc_array() may call alloc_page() to trigger this problem too >>> IIUC. It should be pre-allocated as well? >> >> >> That is a much smaller allocation, and so it entirely depends whether >> it is an issue or not.  There are also a number of other small memory >> allocations in the path as well. >> >> The whole point of this patch is to move the *biggest* allocation >> which is the reported issue and then see what happens.   No point in >> making this test patch more involved and complex than necessary at >> this point. >> >> Phillip >> > > Also be aware this stress-ng triggered issue is new, and apparently > didn't occur last year.   So it is reasonable to assume the issue has > been introduced as a side effect of the readahead improvements.  One > of these is this allocation of a temporary page to decompress into > rather than falling back to entirely decompressing into a > pre-allocated buffer (allocated at mount time).  The small memory > allocations have been there for many years. > > Allocating the page at mount time effectively puts the memory > allocation situation back to how it was last year before the readahead > work. > > Phillip > Thanks Phillip and Yang. And Phillip, I tested your change, it didn't help. According to my debug, the OOM happens at the place of allocating memory for bio, it is at the line of "struct page *page = alloc_page(GFP_NOIO);" in the squashfs_bio_read(). Other filesystems just use the pre-allocated memory in the "struct readahead_control" to do the bio, but squashfs allocate the new page to do the bio (maybe because the squashfs is a compressed filesystem). BTW, this is not a new issue for squashfs, we have uc20 (linux-5.4 kernel) and uc22 (linux-5.15 kernel), all have this issue. The issue already existed in the squahsfs_readpage() in the 5.4 kernel. I guess if could use pre-allocated memory to do the bio, it will help. Thanks, Hui. > >>>>    fs/squashfs/page_actor.c     | 10 +--------- >>>>    fs/squashfs/page_actor.h     |  1 - >>>>    fs/squashfs/squashfs_fs_sb.h |  1 + >>>>    fs/squashfs/super.c          | 10 ++++++++++ >>>>    4 files changed, 12 insertions(+), 10 deletions(-) >>>> >>>> diff --git a/fs/squashfs/page_actor.c b/fs/squashfs/page_actor.c >>>> index 81af6c4ca115..6cce239eca66 100644 >>>> --- a/fs/squashfs/page_actor.c >>>> +++ b/fs/squashfs/page_actor.c >>>> @@ -110,15 +110,7 @@ struct squashfs_page_actor >>>> *squashfs_page_actor_init_special(struct squashfs_sb_ >>>>          if (actor == NULL) >>>>                  return NULL; >>>> >>>> -       if (msblk->decompressor->alloc_buffer) { >>>> -               actor->tmp_buffer = kmalloc(PAGE_SIZE, GFP_KERNEL); >>>> - >>>> -               if (actor->tmp_buffer == NULL) { >>>> -                       kfree(actor); >>>> -                       return NULL; >>>> -               } >>>> -       } else >>>> -               actor->tmp_buffer = NULL; >>>> +       actor->tmp_buffer = msblk->actor_page; >>>> >>>>          actor->length = length ? : pages * PAGE_SIZE; >>>>          actor->page = page; >>>> diff --git a/fs/squashfs/page_actor.h b/fs/squashfs/page_actor.h >>>> index 97d4983559b1..df5e999afa42 100644 >>>> --- a/fs/squashfs/page_actor.h >>>> +++ b/fs/squashfs/page_actor.h >>>> @@ -34,7 +34,6 @@ static inline struct page >>>> *squashfs_page_actor_free(struct squashfs_page_actor * >>>>    { >>>>          struct page *last_page = actor->last_page; >>>> >>>> -       kfree(actor->tmp_buffer); >>>>          kfree(actor); >>>>          return last_page; >>>>    } >>>> diff --git a/fs/squashfs/squashfs_fs_sb.h >>>> b/fs/squashfs/squashfs_fs_sb.h >>>> index 72f6f4b37863..8feddc9e6cce 100644 >>>> --- a/fs/squashfs/squashfs_fs_sb.h >>>> +++ b/fs/squashfs/squashfs_fs_sb.h >>>> @@ -47,6 +47,7 @@ struct squashfs_sb_info { >>>>          struct squashfs_cache *block_cache; >>>>          struct squashfs_cache *fragment_cache; >>>>          struct squashfs_cache                   *read_page; >>>> +       void                                    *actor_page; >>>>          int next_meta_index; >>>>          __le64                                  *id_table; >>>>          __le64 *fragment_index; >>>> diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c >>>> index e090fae48e68..674dc187d961 100644 >>>> --- a/fs/squashfs/super.c >>>> +++ b/fs/squashfs/super.c >>>> @@ -329,6 +329,15 @@ static int squashfs_fill_super(struct >>>> super_block *sb, struct fs_context *fc) >>>>                  goto failed_mount; >>>>          } >>>> >>>> + >>>> +       /* Allocate page for >>>> squashfs_readahead()/squashfs_read_folio() */ >>>> +       if (msblk->decompressor->alloc_buffer) { >>>> +               msblk->actor_page = kmalloc(PAGE_SIZE, GFP_KERNEL); >>>> + >>>> +               if(msblk->actor_page == NULL) >>>> +                       goto failed_mount; >>>> +       } >>>> + >>>>          msblk->stream = squashfs_decompressor_setup(sb, flags); >>>>          if (IS_ERR(msblk->stream)) { >>>>                  err = PTR_ERR(msblk->stream); >>>> @@ -454,6 +463,7 @@ static int squashfs_fill_super(struct >>>> super_block *sb, struct fs_context *fc) >>>>          squashfs_cache_delete(msblk->block_cache); >>>>          squashfs_cache_delete(msblk->fragment_cache); >>>>          squashfs_cache_delete(msblk->read_page); >>>> +       kfree(msblk->actor_page); >>>>          msblk->thread_ops->destroy(msblk); >>>>          kfree(msblk->inode_lookup_table); >>>>          kfree(msblk->fragment_index); >>>> -- >>>> 2.35.1 >>>> >>>>> Adding __GFP_RETRY_MAYFAIL so the alloc() can fail will mean Squashfs >>>>> returning I/O failures due to no memory.  That will cause a lot of >>>>> applications to crash in a low memory situation.  So a crash rather >>>>> than a hang. >>>>> >>>>> Phillip >>>>> >>>>> >>>>> >>>>>