From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6B8D1C77B73
	for <linux-mm@archiver.kernel.org>; Thu, 27 Apr 2023 03:47:26 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A58E96B0071; Wed, 26 Apr 2023 23:47:25 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A08D86B0072; Wed, 26 Apr 2023 23:47:25 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8D0356B0074; Wed, 26 Apr 2023 23:47:25 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 7B8AF6B0071
	for <linux-mm@kvack.org>; Wed, 26 Apr 2023 23:47:25 -0400 (EDT)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 3E0CC4013D
	for <linux-mm@kvack.org>; Thu, 27 Apr 2023 03:47:24 +0000 (UTC)
X-FDA: 80725786008.01.A2BC0CF
Received: from smtp-relay-canonical-1.canonical.com (smtp-relay-canonical-1.canonical.com [185.125.188.121])
	by imf28.hostedemail.com (Postfix) with ESMTP id 41F98C000E
	for <linux-mm@kvack.org>; Thu, 27 Apr 2023 03:47:22 +0000 (UTC)
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=canonical.com header.s=20210705 header.b=orY21Vz1;
	spf=pass (imf28.hostedemail.com: domain of hui.wang@canonical.com designates 185.125.188.121 as permitted sender) smtp.mailfrom=hui.wang@canonical.com;
	dmarc=pass (policy=none) header.from=canonical.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1682567242;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=GpjUtIZVmlNzl4phyg6U2WwchGAb0GtrdxZg8AT/VDw=;
	b=ocFV+KRrmrCNVVhdQuK16U9V58IO2jiW6m2QU+TYMlgT48O7rxPY2oSOJ/rtO95TwfXVv9
	gJXMKKGLOmb68uoXyflt+zYSYklvqOIslWpoNq5GBRJlfBmnJkLGCRUBvOapBEqG6CSA2m
	UFZfvYSSrOLj2eXlZ8GGpRqyAy2u1GU=
ARC-Authentication-Results: i=1;
	imf28.hostedemail.com;
	dkim=pass header.d=canonical.com header.s=20210705 header.b=orY21Vz1;
	spf=pass (imf28.hostedemail.com: domain of hui.wang@canonical.com designates 185.125.188.121 as permitted sender) smtp.mailfrom=hui.wang@canonical.com;
	dmarc=pass (policy=none) header.from=canonical.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682567242; a=rsa-sha256;
	cv=none;
	b=V0DfzRJGRvU7N/+2NkMxz2g/RqASyTIIrlEPWex4LTRm4Rz+LCCa7GME2ng3gyeBjRQj8A
	xMOclHNwuw/if923QN07rifO8rB6JRB/JAhM2JOBzykUUHOq0Z6wxRTT2nHJeMkGBnA+dA
	oLH/Yk/BX1BFq10Un9R8IGACZHwOMzw=
Received: from [192.168.0.106] (unknown [123.112.66.36])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by smtp-relay-canonical-1.canonical.com (Postfix) with ESMTPSA id 3A3653F557;
	Thu, 27 Apr 2023 03:47:13 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=canonical.com;
	s=20210705; t=1682567238;
	bh=GpjUtIZVmlNzl4phyg6U2WwchGAb0GtrdxZg8AT/VDw=;
	h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From:
	 In-Reply-To:Content-Type;
	b=orY21Vz1+o3vPruM2FOgYuTH+xDXPJdwDFTFg8CThd1alxkR8XhHRbyI8WGFy01NM
	 uqw6YXGMC/clEsZ+BGMmfiNroin21rB6GXQUb2L0dFlnxpq2C8jlQi60FU3m423Yvs
	 8kYBSKtxT8V8kMTri0WQzv5JRvQhAOAy9ZjP9rIreBRYnXNIW2Yro0IQIaQjq2zqvA
	 zoW7ah4HlapWe4Tz1reJ+N7CW+MIZStV1gxKDRahUN/CBhKohh2E9j46pXInoHHZIS
	 u137dUFP8cTa86cgq9cbde7bT7aqLeus+Y7xumsaR3cmBlUz7GtZ97Z1DJUAZ9jbVx
	 s/PAzU4SerAQw==
Message-ID: <4aa48b6a-362d-de1b-f0ff-9bb8dafbdcc7@canonical.com>
Date: Thu, 27 Apr 2023 11:47:10 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.10.0
Subject: Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs
 without __GFP_FS
Content-Language: en-US
To: Gao Xiang <hsiangkao@linux.alibaba.com>, Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org, surenb@google.com,
 colin.i.king@gmail.com, shy828301@gmail.com, hannes@cmpxchg.org,
 vbabka@suse.cz, hch@infradead.org, mgorman@suse.de,
 Phillip Lougher <phillip@squashfs.org.uk>
References: <20230426051030.112007-1-hui.wang@canonical.com>
 <20230426051030.112007-2-hui.wang@canonical.com>
 <ZEjhwasBsame8Fbi@dhcp22.suse.cz>
 <be75a80b-fe95-e5cd-2049-522cbd95317a@canonical.com>
 <68b085fe-3347-507c-d739-0dc9b27ebe05@linux.alibaba.com>
From: Hui Wang <hui.wang@canonical.com>
In-Reply-To: <68b085fe-3347-507c-d739-0dc9b27ebe05@linux.alibaba.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Rspam-User: 
X-Rspamd-Server: rspam03
X-Stat-Signature: dgtdez48ff3oyr31zibxwycm37yfgsau
X-Rspamd-Queue-Id: 41F98C000E
X-HE-Tag: 1682567242-198238
X-HE-Meta: U2FsdGVkX18yRNLxkjFqK1PirBvjeqSdz5KHmm+h7m6Vy2uffGQ8RXdbaQGC6SoEjKdFyO19dInA4tp0o0/xQQQMW0x1aASdtx9rgQFJZwY6ycIFJQIEBr10zi9igIS1F95+76pWsQaQbrecv16wtMF4XEqzK50XNNY/y4khgXjscqRuFhfQXkpBJVpf+TFmlfhMWSBf84HxSRmAep2eowGJ/KJqWeVyVQhHhXIPFSe/4YdXm7rV/LsZgUoztYx3Nwo+SCd0dwfE+Lqo9wTeOmNkCneq5+fg+KkcrWdKJqb8xXi/HBMxKu98U+pSckbqRvUvVKaa703kEcrL9ZB6J7FHBwDIRbdLo53FxdDTwzxKk9r8aIDSazFyh7ZomnmzXe2YW0x1HqfYcP3VvdwjByXLkOnMtVG7HGU9Y7o7Ru33zf+Mii/7kqFja4AxSzj7GuKUEpprjsr0lmWRI+8jtijwKNH1vM8vPOL7pZ4nsafjvRAgEcbiB2LBMiuPDHknYeHmKjuz4QlA/A2Jhq+qrddRYWULjhfoDThFCitZ5ZGApJ84Xl8cjk0A2JuXYLcYN1MpupzJsvOs5BiEX8vfMR4S53BFqYm2LVfV8tn+lf2xp/2/066F3Kap+s0LIfVg0MeXOPqrRDaHWXxMhgg0efpKb/mX16wJLX3Q/CjI4kSqOgf/1PFP9g09X/T28GEE7pBcyiZV5BJifOAa/ofdRVO4CiozW9bXWZvuO0XjqPFPGcIsHRGslnyPjcLwuLltGoIvcNT+ntEOcteVjwSnXu6SDJrqciopel4ZGQTkK+EO4nfax17yki9cSQbm0xnQJ1lHsQgczJ10n0LvlYp32jl83F+MONBymJN2neGNALBQ1rfctShFng2/gCExpkeixPSEwni03iRNBNCVmKZ753ecndOv4mAl7JQpa8DP6TnAydCbpwD67Ye6zLf8YsbgsHDf6yQ+qPS9l29wpkM
 qO7qPa8y
 OHsYDDXhFgc2jKu0fvAFanHs7VmdAXy2FsyqPHDqm7AKzMnPO8F81wH+EqpqAvVlmPB56UFeL3kJ+1clb5mTXdcJQi+E1+AGznnH+DrfjZTruSLDZp3dY4+LVWX06IldIeLh24RuKcXCs4V5r45F04jwjMmofikm0SmGJGniymdFjONsS3ckrkkoqkJ8radyUXOJ604xB9C4XmHDMjD8eAuyevYmP28x5G2e4EazDUMyukdFkEQIHrUX1oEIKhBrh7uDSq5pey2X3Voue6dvIEWNablfYh8BI0eL1Yp/++ldp2g4DffLepJYtF3l+kLVA0Wv8
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On 4/27/23 09:18, Gao Xiang wrote:
>
>
> On 2023/4/26 19:07, Hui Wang wrote:
>>
>> On 4/26/23 16:33, Michal Hocko wrote:
>>> [CC squashfs maintainer]
>>>
>>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>>>> If we run the stress-ng in the filesystem of squashfs, the system
>>>> will be in a state something like hang, the stress-ng couldn't
>>>> finish running and the console couldn't react to users' input.
>>>>
>>>> This issue happens on all arm/arm64 platforms we are working on,
>>>> through debugging, we found this issue is introduced by oom handling
>>>> in the kernel.
>>>>
>>>> The fs->readahead() is called between memalloc_nofs_save() and
>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>>>> alloc_page(), in this case, if there is no memory left, the
>>>> out_of_memory() will be called without __GFP_FS, then the oom killer
>>>> will not be triggered and this process will loop endlessly and wait
>>>> for others to trigger oom killer to release some memory. But for a
>>>> system with the whole root filesystem constructed by squashfs,
>>>> nearly all userspace processes will call out_of_memory() without
>>>> __GFP_FS, so we will see that the system enters a state something like
>>>> hang when running stress-ng.
>>>>
>>>> To fix it, we could trigger a kthread to call page_alloc() with
>>>> __GFP_FS before returning from out_of_memory() due to without
>>>> __GFP_FS.
>>> I do not think this is an appropriate way to deal with this issue.
>>> Does it even make sense to trigger OOM killer for something like
>>> readahead? Would it be more mindful to fail the allocation instead?
>>> That being said should allocations from squashfs_readahead use
>>> __GFP_RETRY_MAYFAIL instead?
>>
>> Thanks for your comment, and this issue could hardly be reproduced on 
>> ext4 filesystem, that is because the ext4->readahead() doesn't call 
>> alloc_page(). If changing the ext4->readahead() as below, it will be 
>> easy to reproduce this issue with the ext4 filesystem (repeatedly 
>> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0 
>> --timeout 30s --skip-silent --verbose)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index ffbbd9626bd8..8b9db0b9d0b8 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file *file, 
>> struct folio *folio)
>>   static void ext4_readahead(struct readahead_control *rac)
>>   {
>>          struct inode *inode = rac->mapping->host;
>> +       struct page *tmp_page;
>>
>>          /* If the file has inline data, no need to do readahead. */
>>          if (ext4_has_inline_data(inode))
>>                  return;
>>
>> +       tmp_page = alloc_page(GFP_KERNEL);
>> +
>>          ext4_mpage_readpages(inode, rac, NULL);
>> +
>> +       if (tmp_page)
>> +               __free_page(tmp_page);
>>   }
>>
>
Hi Xiang and Michal,
> Is it tested with a pure ext4 without any other fs background?
>
Basically yes. Maybe there is a squashfs mounted for python3 in my test 
environment. But stress-ng and its needed sharing libs are in the ext4.
> I don't think it's true that "ext4->readahead() doesn't call
> alloc_page()" since I think even ext2/ext4 uses buffer head
> interfaces to read metadata (extents or old block mapping)
> from its bd_inode for readahead, which indirectly allocates
> some extra pages to page cache as well.

Calling alloc_page() or allocating memory in the readahead() is not a 
problem, suppose we have 4 processes (A, B, C and D). Process A, B and C 
are entering out_of_memory() because of allocating memory in the 
readahead(), they are looping and waiting for some memory be released. 
And process D could enter out_of_memory() with __GFP_FS, then it could 
trigger oom killer, so A, B and C could get the memory and return to the 
readahead(), there is no system hang issue.

But if all 4 processes enter out_of_memory() from readahead(), they will 
loop and wait endlessly, there is no process to trigger oom killer,  so 
the users will think the system is getting hang.

I applied my change for ext4->readahead to linux-next, and tested it on 
my ubuntu classic server for arm64, I could reproduce the hang issue 
within 1 minutes with 100% rate. I guess it is easy to reproduce the 
issue because it is an embedded environment, the total number of 
processes in the system is very limited, nearly all userspace processes 
will finally reach out_of_memory() from ext4_readahead(), and nearly all 
kthreads will not reach out_of_memory() for long time, that makes the 
system in a state like hang (not real hang).

And this is why I wrote a patch to let a specific kthread trigger oom 
killer forcibly (my initial patch).


>
> The difference only here is the total number of pages to be
> allocated here, but many extra compressed data takeing extra
> allocation causes worse.  So I think it much depends on how
> stressful does your stress workload work like, and I'm even
> not sure it's a real issue since if you stop the stress
> workload, it will immediately recover (only it may not oom
> directly).
>
Yes, it is not a real hang. All userspace processes are looping and 
waiting for other processes to release or reclaim memory. And in this 
case, we can't stop the stress workload since users can't control the 
system through console.

So Michal,

Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters a 
state something like hang when running stress-ng", do you know why 
out_of_memory() will return immediately if there is no __GFP_FS, could 
we drop these lines directly:

     /*
      * The OOM killer does not compensate for IO-less reclaim.
      * pagefault_out_of_memory lost its gfp context so we have to
      * make sure exclude 0 mask - all other users should have at least
      * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
      * invoke the OOM killer even if it is a GFP_NOFS allocation.
      */
     if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
         return true;


Thanks,

Hui.

> Thanks,
> Gao Xiang