From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9B848C77B7C for ; Wed, 26 Apr 2023 19:35:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 98CAF6B0122; Wed, 26 Apr 2023 15:35:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 93BE16B0124; Wed, 26 Apr 2023 15:35:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 803096B0125; Wed, 26 Apr 2023 15:35:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 726656B0122 for ; Wed, 26 Apr 2023 15:35:07 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 31FE31A03CE for ; Wed, 26 Apr 2023 19:35:07 +0000 (UTC) X-FDA: 80724545454.02.C470EAA Received: from p3plwbeout18-04.prod.phx3.secureserver.net (p3plsmtp18-04-2.prod.phx3.secureserver.net [173.201.193.188]) by imf17.hostedemail.com (Postfix) with ESMTP id D233E40018 for ; Wed, 26 Apr 2023 19:35:03 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=none; spf=none (imf17.hostedemail.com: domain of phillip@squashfs.org.uk has no SPF policy when checking 173.201.193.188) smtp.mailfrom=phillip@squashfs.org.uk; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1682537704; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1yI5zhWxKrTEghmxuesN0VwnR2R5KxJ/k4KFbuxxuRo=; b=VUNogfmC3cVTX0PHeMUGUDLrQfZ+uwOZ2X6RIUDZDbv5lx6zFvy/1dZPLM/VT4OeLvgDwQ 16EL/eaBb/yWLvp8A1Xx4x6GTjFJ0bmjjByTxG/QqQPQicBmc3mL0yZnQrl5/ZVIrFLvdw DT9zM5smtfdLT+03Mv0XRS4HNpVCaSw= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=none; spf=none (imf17.hostedemail.com: domain of phillip@squashfs.org.uk has no SPF policy when checking 173.201.193.188) smtp.mailfrom=phillip@squashfs.org.uk; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682537704; a=rsa-sha256; cv=none; b=BIzZGdwtGQdaoRuct87ZLXzCzuMVKs2FollhptL3+rbnlUXcKSfsaWEPdPuZpOytWKGmCF osXEzAqx96iKRiNr6xVsswEGfwryr6AJRWY4GL7loKvutu3Tt/nIqQQc3M7U18TrCKP0xI V2RsigqJZbiiJiYbr914vTlQiWp2F30= Received: from mailex.mailcore.me ([94.136.40.142]) by :WBEOUT: with ESMTP id rkurpHjmYPjFBrkuspmg4v; Wed, 26 Apr 2023 12:35:02 -0700 X-SECURESERVER-ACCT: phillip@squashfs.org.uk X-SID: rkurpHjmYPjFB Received: from 82-69-79-175.dsl.in-addr.zen.co.uk ([82.69.79.175] helo=[192.168.178.87]) by smtp01.mailcore.me with esmtpa (Exim 4.94.2) (envelope-from ) id 1prkuo-000CEw-02; Wed, 26 Apr 2023 20:34:58 +0100 Message-ID: <553e6668-bebd-7411-9f69-d62e9658da1d@squashfs.org.uk> Date: Wed, 26 Apr 2023 20:34:59 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS Content-Language: en-GB From: Phillip Lougher To: Yang Shi Cc: Hui Wang , Michal Hocko , linux-mm@kvack.org, akpm@linux-foundation.org, surenb@google.com, colin.i.king@gmail.com, hannes@cmpxchg.org, vbabka@suse.cz, hch@infradead.org, mgorman@suse.de, dan.carpenter@oracle.com References: <20230426051030.112007-1-hui.wang@canonical.com> <20230426051030.112007-2-hui.wang@canonical.com> <9f827ae2-eaec-8660-35fa-71e218d5a2c5@squashfs.org.uk> <72cf1f14-c02b-033e-6fa9-8558e628ffb6@squashfs.org.uk> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Mailcore-Auth: 439999529 X-Mailcore-Domain: 1394945 X-123-reg-Authenticated: phillip@squashfs.org.uk X-Originating-IP: 82.69.79.175 X-CMAE-Envelope: MS4xfGZwP5DoKVg/Rop8CSF/QsdAf178kbjNXDV5f9mP50TyKPHJ9qeUhACX2UHtQ+ed+Z6No9jnzdProCLRT/RTBCQPuV26EmljNqo5WEt5906GWAsFWc0f B5zXAqBDK0egxVgOZ3lnStE7DtHTPgv0a7222jDd8U0ku12vbM0GW0jEgAbGHnZClGI60iSZAzAZUAf3Z8Ekh4yyeGEEKq4SJU4= X-Stat-Signature: jmacp1c9btdxu3af1b9qzxs3bmm5chjw X-Rspam-User: X-Rspamd-Queue-Id: D233E40018 X-Rspamd-Server: rspam06 X-CMAE-Analysis: v=2.4 cv=Gdiv3ybL c=1 sm=1 tr=0 ts=64497ce9 a=PWWIS3U6KuXUdfh4/ywz0w==:117 a=84ok6UeoqCVsigPHarzEiQ==:17 a=ggZhUymU-5wA:10 a=IkcTkHD0fZMA:10 a=dKHAf1wccvYA:10 a=FXvPX3liAAAA:8 a=AsW3oUPRGk-T-r-rQwwA:9 a=QEXdDO2ut3YA:10 a=UObqyxdv-6Yh2QiB9mM_:22 X-HE-Tag: 1682537703-249406 X-HE-Meta: U2FsdGVkX1/g9LE97ZjNrt0hYr0j44MDe9SF85tDQIonoVT/RSUtcj+Udf+Y2lZM/vWzDsZketdyx+Jk24xxviy6q+RUoBzNPLS5rqvioWA7hrl/61eNT0vizoFpCyfI1Y2/XhZvPW00H02SVy++Mxr+zWGjNrBhoSmp4hh6oFgmcsiD/pD4ECwal7Jt8OcBcevS3UhQWhGcRJYJFqD8YJvQMCQrA3VBSce/T+i+u9EM5RUFocA9IXKd7cz87CcrFktkJR5Gui9LEJ1acPY17+kiHPW/KkmwRY2lYVPlGOQzSlhaNGqY5vsqiLJ8w3EzPqp834ngI8sK8APF3cGW+mxBb+gCKBH4Uvp3o5lG/RIUbr+dwS1pIQlO+a3ROU8Qfzp+THcqXwHEV6HBG/FthlFTqg75f+rfWJac5UlZ6pph3leMsNl2jIL9jrZzPSK3//++4DbkiRyRfPt+O7dVT5LyI8HNt5DEWSdbG5MpedeyjoJsEfSXXoDX+DjtSX/xgVDQPZRNSLu5ZhWrF6l2yb7smNO/JCGCQGasEoeFmyTrBIsFu0zvTBys6J1jA0b+BLmjYPgwC3DklIWNAiP6kcjBrOE7imrbnixTrTbYdq4PKG9ghA4SSkF+nH/w0yQ/DG6KnBAFA0Y5OnWfrhitrL3IxG/4BzbU0tXYzaz18em9bEbYQAfd+D8pDS1s8Vea6KaMT515Ph5ky8GCmoWHI30vF4ej27xdoIdp4e1SzGpAfOj+O2QgVlPUI7kk6QM8Ca4YWODXqTzwQoCm9nPxP44A4TaRIGZ9TkgxI0y+yvamFCTxiUUJ5IZYo7qby8SNXrGc4V2GcDLb9+GONxt3a5xEVjVesvM1nQEMi90oSubqRiqWEufYatzOvMBt2x/Ie2cdnumzzsEcjajWrqdniCvMlo8S4hqCKjFA15/qRDBtWkaEd2FdRcCuQ9n0gD7u9yldgZBOjhm+5o5iUBv ZLAKmGwK 4ugMWAu4NMd1CkLFgP0xKpnmNMJAYY17KMGJE0Z2wgTllynaCw/y1LJP6oC9KRDCMyjepHaPrTl9KGFdVsvWRmW7KJQZJ6uFHdA0YU2xZYW6LpUpL1Z7tR6F7v6C8cUutXRe+iTmJZLd2eaIxzoO1y0BWC7gU3Lteh8sngnE5wXyCWBGMbiXDtK6xdygPaqE6mwECWB36p1+pmjPt2MGmCbcwGy1QW0My3sHSsh03eJ5+E1QR4QL3HvKlJBlwn2pYRper34wFnVgCcDxwXwySuhxCc0JZ7RhLVimKo9Oi1y+CfA4PaS5hEN4EWio45A4kNAI26q9vsuAHsRcGrfB/Bq7JBXOUPLZbRjXKVZHaS94vXjxWHjwsdhPpfQbu1QDEIkNUgWdxkBUDOAkexJzfpVvvhDDluya7BpCbk44y/3T42SbxW1EQTNaAFu8cOsmuyCegn5WhmIQSnP/5kIieOAuNEb3HsUvNsXjH+vhpvclMPBOG/+xQCvvwJzkFKv8+19zCewI63OrLzRBPXM1NUx+DvyY93pEgQrDlKOwKtClVM0fPqOtTi8sfgTRZLrcYiPD7QQV0HwVlN9p/WdjV7I/q4kDzZSjFvXSBjVy3HqdSG9xKNbpZdNoIQ6rzdrupzkuBEbIZH/hA4nK3HEAWhefCsztXME4BhAboz4FBb4Nc9drVi+SDvhak7iixH9WOVPvGce4Jbze6PkGdNaezPuiioJ2Qg6pqKH5RbsSXMlhmnWs= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 26/04/2023 20:06, Phillip Lougher wrote: > > On 26/04/2023 19:26, Yang Shi wrote: >> On Wed, Apr 26, 2023 at 10:38 AM Phillip Lougher >> wrote: >>> >>> On 26/04/2023 17:44, Phillip Lougher wrote: >>>> On 26/04/2023 12:07, Hui Wang wrote: >>>>> On 4/26/23 16:33, Michal Hocko wrote: >>>>>> [CC squashfs maintainer] >>>>>> >>>>>> On Wed 26-04-23 13:10:30, Hui Wang wrote: >>>>>>> If we run the stress-ng in the filesystem of squashfs, the system >>>>>>> will be in a state something like hang, the stress-ng couldn't >>>>>>> finish running and the console couldn't react to users' input. >>>>>>> >>>>>>> This issue happens on all arm/arm64 platforms we are working on, >>>>>>> through debugging, we found this issue is introduced by oom >>>>>>> handling >>>>>>> in the kernel. >>>>>>> >>>>>>> The fs->readahead() is called between memalloc_nofs_save() and >>>>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls >>>>>>> alloc_page(), in this case, if there is no memory left, the >>>>>>> out_of_memory() will be called without __GFP_FS, then the oom >>>>>>> killer >>>>>>> will not be triggered and this process will loop endlessly and wait >>>>>>> for others to trigger oom killer to release some memory. But for a >>>>>>> system with the whole root filesystem constructed by squashfs, >>>>>>> nearly all userspace processes will call out_of_memory() without >>>>>>> __GFP_FS, so we will see that the system enters a state >>>>>>> something like >>>>>>> hang when running stress-ng. >>>>>>> >>>>>>> To fix it, we could trigger a kthread to call page_alloc() with >>>>>>> __GFP_FS before returning from out_of_memory() due to without >>>>>>> __GFP_FS. >>>>>> I do not think this is an appropriate way to deal with this issue. >>>>>> Does it even make sense to trigger OOM killer for something like >>>>>> readahead? Would it be more mindful to fail the allocation instead? >>>>>> That being said should allocations from squashfs_readahead use >>>>>> __GFP_RETRY_MAYFAIL instead? >>>>> Thanks for your comment, and this issue could hardly be reproduced on >>>>> ext4 filesystem, that is because the ext4->readahead() doesn't call >>>>> alloc_page(). If changing the ext4->readahead() as below, it will be >>>>> easy to reproduce this issue with the ext4 filesystem (repeatedly >>>>> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0 >>>>> --timeout 30s --skip-silent --verbose) >>>>> >>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >>>>> index ffbbd9626bd8..8b9db0b9d0b8 100644 >>>>> --- a/fs/ext4/inode.c >>>>> +++ b/fs/ext4/inode.c >>>>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file *file, >>>>> struct folio *folio) >>>>>   static void ext4_readahead(struct readahead_control *rac) >>>>>   { >>>>>          struct inode *inode = rac->mapping->host; >>>>> +       struct page *tmp_page; >>>>> >>>>>          /* If the file has inline data, no need to do readahead. */ >>>>>          if (ext4_has_inline_data(inode)) >>>>>                  return; >>>>> >>>>> +       tmp_page = alloc_page(GFP_KERNEL); >>>>> + >>>>>          ext4_mpage_readpages(inode, rac, NULL); >>>>> + >>>>> +       if (tmp_page) >>>>> +               __free_page(tmp_page); >>>>>   } >>>>> >>>>> >>>>> BTW, I applied my patch to the linux-next and ran the oom stress-ng >>>>> testcases overnight, there is no hang, oops or crash, looks like >>>>> there is no big problem to use a kthread to trigger the oom killer in >>>>> this case. >>>>> >>>>> And Hi squashfs maintainer, I checked the code of filesystem, looks >>>>> like most filesystems will not call alloc_page() in the readahead(), >>>>> could you please help take a look at this issue, thanks. >>>> >>>> This will be because most filesystems don't need to do so. Squashfs is >>>> a compressed filesystem with large blocks covering much more than one >>>> page, and it decompresses these blocks in squashfs_readahead().   If >>>> __readahead_batch() does not return the full set of pages covering the >>>> Squashfs block, it allocates a temporary page for the decompressors to >>>> decompress into to "fill in the hole". >>>> >>>> What can be done here as far as Squashfs is concerned .... I could >>>> move the page allocation out of the readahead path (e.g. do it at >>>> mount time). >>> You could try this patch which does that.  Compile tested only. >> The kmalloc_array() may call alloc_page() to trigger this problem too >> IIUC. It should be pre-allocated as well? > > > That is a much smaller allocation, and so it entirely depends whether > it is an issue or not.  There are also a number of other small memory > allocations in the path as well. > > The whole point of this patch is to move the *biggest* allocation > which is the reported issue and then see what happens.   No point in > making this test patch more involved and complex than necessary at > this point. > > Phillip > Also be aware this stress-ng triggered issue is new, and apparently didn't occur last year.   So it is reasonable to assume the issue has been introduced as a side effect of the readahead improvements.  One of these is this allocation of a temporary page to decompress into rather than falling back to entirely decompressing into a pre-allocated buffer (allocated at mount time).  The small memory allocations have been there for many years. Allocating the page at mount time effectively puts the memory allocation situation back to how it was last year before the readahead work. Phillip >>>    fs/squashfs/page_actor.c     | 10 +--------- >>>    fs/squashfs/page_actor.h     |  1 - >>>    fs/squashfs/squashfs_fs_sb.h |  1 + >>>    fs/squashfs/super.c          | 10 ++++++++++ >>>    4 files changed, 12 insertions(+), 10 deletions(-) >>> >>> diff --git a/fs/squashfs/page_actor.c b/fs/squashfs/page_actor.c >>> index 81af6c4ca115..6cce239eca66 100644 >>> --- a/fs/squashfs/page_actor.c >>> +++ b/fs/squashfs/page_actor.c >>> @@ -110,15 +110,7 @@ struct squashfs_page_actor >>> *squashfs_page_actor_init_special(struct squashfs_sb_ >>>          if (actor == NULL) >>>                  return NULL; >>> >>> -       if (msblk->decompressor->alloc_buffer) { >>> -               actor->tmp_buffer = kmalloc(PAGE_SIZE, GFP_KERNEL); >>> - >>> -               if (actor->tmp_buffer == NULL) { >>> -                       kfree(actor); >>> -                       return NULL; >>> -               } >>> -       } else >>> -               actor->tmp_buffer = NULL; >>> +       actor->tmp_buffer = msblk->actor_page; >>> >>>          actor->length = length ? : pages * PAGE_SIZE; >>>          actor->page = page; >>> diff --git a/fs/squashfs/page_actor.h b/fs/squashfs/page_actor.h >>> index 97d4983559b1..df5e999afa42 100644 >>> --- a/fs/squashfs/page_actor.h >>> +++ b/fs/squashfs/page_actor.h >>> @@ -34,7 +34,6 @@ static inline struct page >>> *squashfs_page_actor_free(struct squashfs_page_actor * >>>    { >>>          struct page *last_page = actor->last_page; >>> >>> -       kfree(actor->tmp_buffer); >>>          kfree(actor); >>>          return last_page; >>>    } >>> diff --git a/fs/squashfs/squashfs_fs_sb.h >>> b/fs/squashfs/squashfs_fs_sb.h >>> index 72f6f4b37863..8feddc9e6cce 100644 >>> --- a/fs/squashfs/squashfs_fs_sb.h >>> +++ b/fs/squashfs/squashfs_fs_sb.h >>> @@ -47,6 +47,7 @@ struct squashfs_sb_info { >>>          struct squashfs_cache                   *block_cache; >>>          struct squashfs_cache *fragment_cache; >>>          struct squashfs_cache                   *read_page; >>> +       void                                    *actor_page; >>>          int next_meta_index; >>>          __le64                                  *id_table; >>>          __le64 *fragment_index; >>> diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c >>> index e090fae48e68..674dc187d961 100644 >>> --- a/fs/squashfs/super.c >>> +++ b/fs/squashfs/super.c >>> @@ -329,6 +329,15 @@ static int squashfs_fill_super(struct >>> super_block *sb, struct fs_context *fc) >>>                  goto failed_mount; >>>          } >>> >>> + >>> +       /* Allocate page for >>> squashfs_readahead()/squashfs_read_folio() */ >>> +       if (msblk->decompressor->alloc_buffer) { >>> +               msblk->actor_page = kmalloc(PAGE_SIZE, GFP_KERNEL); >>> + >>> +               if(msblk->actor_page == NULL) >>> +                       goto failed_mount; >>> +       } >>> + >>>          msblk->stream = squashfs_decompressor_setup(sb, flags); >>>          if (IS_ERR(msblk->stream)) { >>>                  err = PTR_ERR(msblk->stream); >>> @@ -454,6 +463,7 @@ static int squashfs_fill_super(struct >>> super_block *sb, struct fs_context *fc) >>>          squashfs_cache_delete(msblk->block_cache); >>>          squashfs_cache_delete(msblk->fragment_cache); >>>          squashfs_cache_delete(msblk->read_page); >>> +       kfree(msblk->actor_page); >>>          msblk->thread_ops->destroy(msblk); >>>          kfree(msblk->inode_lookup_table); >>>          kfree(msblk->fragment_index); >>> -- >>> 2.35.1 >>> >>>> Adding __GFP_RETRY_MAYFAIL so the alloc() can fail will mean Squashfs >>>> returning I/O failures due to no memory.  That will cause a lot of >>>> applications to crash in a low memory situation.  So a crash rather >>>> than a hang. >>>> >>>> Phillip >>>> >>>> >>>> >>>>