From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D00B8C77B61 for ; Thu, 27 Apr 2023 04:17:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3E4536B0071; Thu, 27 Apr 2023 00:17:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 394066B0072; Thu, 27 Apr 2023 00:17:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 283236B0074; Thu, 27 Apr 2023 00:17:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 162796B0071 for ; Thu, 27 Apr 2023 00:17:12 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id B4AAA160138 for ; Thu, 27 Apr 2023 04:17:11 +0000 (UTC) X-FDA: 80725861062.03.D12C34A Received: from out30-112.freemail.mail.aliyun.com (out30-112.freemail.mail.aliyun.com [115.124.30.112]) by imf18.hostedemail.com (Postfix) with ESMTP id B9EEF1C0004 for ; Thu, 27 Apr 2023 04:17:08 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf18.hostedemail.com: domain of hsiangkao@linux.alibaba.com designates 115.124.30.112 as permitted sender) smtp.mailfrom=hsiangkao@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1682569030; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ILFxjT1YDY78t2CI2hWPeGK6yojfvb+2AAa8QBvHKgY=; b=VreV6DiHRlhjeGHT2YbFGaH0owiDPa6zmuHqZ12YK3lkActT+nOaRglSnJT324rKLcACx2 Ayp9lgQZWBzeqxTlFJmdjrGDK4x0/2aKYU6l1samRpGLwe4TbEw1x8FdSgJh+iN5sABur8 5KX5O1xu49t2+YkV96N2bJX2BeRSsHU= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf18.hostedemail.com: domain of hsiangkao@linux.alibaba.com designates 115.124.30.112 as permitted sender) smtp.mailfrom=hsiangkao@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682569030; a=rsa-sha256; cv=none; b=6r6EMBfpqoOu711JHQb65m7VLzlQVuu3e4cyCp0hIqJHZtWK3TCBIDhOdPHsOfGTh/ZA6s Nb9+EoBValOUWlwZXFDaksuoLjcZHFX6f4S4RqXPj791gw5pWoRiQOvsUcMtG8SYofU+Hh 8T/v8IxPHH4SsgApJ5NY5d7vO2gHEm0= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R691e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046056;MF=hsiangkao@linux.alibaba.com;NM=1;PH=DS;RN=12;SR=0;TI=SMTPD_---0Vh5xL9i_1682569022; Received: from 30.97.48.233(mailfrom:hsiangkao@linux.alibaba.com fp:SMTPD_---0Vh5xL9i_1682569022) by smtp.aliyun-inc.com; Thu, 27 Apr 2023 12:17:03 +0800 Message-ID: Date: Thu, 27 Apr 2023 12:17:02 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS To: Hui Wang , Michal Hocko Cc: linux-mm@kvack.org, akpm@linux-foundation.org, surenb@google.com, colin.i.king@gmail.com, shy828301@gmail.com, hannes@cmpxchg.org, vbabka@suse.cz, hch@infradead.org, mgorman@suse.de, Phillip Lougher References: <20230426051030.112007-1-hui.wang@canonical.com> <20230426051030.112007-2-hui.wang@canonical.com> <68b085fe-3347-507c-d739-0dc9b27ebe05@linux.alibaba.com> <4aa48b6a-362d-de1b-f0ff-9bb8dafbdcc7@canonical.com> From: Gao Xiang In-Reply-To: <4aa48b6a-362d-de1b-f0ff-9bb8dafbdcc7@canonical.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: B9EEF1C0004 X-Rspam-User: X-Stat-Signature: qa7t78bdq7fykfuajh1ftmxeqdpixmid X-HE-Tag: 1682569028-770888 X-HE-Meta: U2FsdGVkX19GaDQwP+N/KhFiv33Z85TWBAF0TYulwmIk55njul5sA/ra28Xaq+DrZxOxkLP5LlOZeXJZu5hi5SCfKBokBz0WSVRK3CJtQnXsUGr1aKwVx+Tf3hZMs5pEN1lXFYafp0V8oqoweENX6H0GqQ2av72hrAnn/GoJ3eHlSZ0ydIBokEVxmjNsG8xaXuO2mc2N+pQYAUeX6ZHZOQdjP3/Sel4jlHaf0QLmyl/ZunZuE61jBdC8lCF8cXtM1+Iy371fFJlWz6wZyBhO58c8W+6XKbULpATnrR5+yZ6uTYxrrMEOQelJGuBHEpu+D75TsGITzR3dj+ocfIKzmQYBQE4c7/vnGfYOE1/6W1KOOW6Hec1Q3Nw6+BoMXvtROXufYd1AtR9P8qI0FAoKBQNN+DKp4OVb7NWd3gL4hbFbEpZwiY+7kHVySnm2/avhYsYb6LRQafpQ+6V7cLOPJDV0mlsBNk0QLReD+ENvf8EYGEnE7GAhjNSvCsE8IeO950pSMi37g/BIdC2qE+OwxcPLrgUA8q6PsH8HYg+VcpAYV/IZf4scstSgxelhY1P7pheJqZS1erVw16/Owk6iq3+K1o1+/m5VpAu+29TiDOCgZCUzoWsx7u2gf322KLaQ5WinepGg+PS+m9ez5nBRokFQ3YNf1DoXggfIzg9NyNUb0UTdLeTkhFibpPTXhLKvq3PbNM2wCQmT8r9sBlZimlDsLN4fbbOaXmLpjvW+KcIow17BA+8f7ZRUUHRLR/hf9UgQCrPcLvX1ctoKbeASiFfllzeLBt9vW2Nehgeqtq/smNGRADHV2fmqxpwLemhSnvBDTyVNbxAJg47ljHLh48mWXIifzCHgWjWzKu+qlpwNWghfz2Xqy85guyT1v8D9ouDwcMQ8hMp109dissxMp4dVCkE7ze+jamJZkfwjLCahyCWH7G46Jsi/nn6QDG/uKnBLBBfoNp6W+C9701J laRggZNk 8InQ6Ij2AKChpAsJEYmHFVDLaMm7uWqUi7BJmlKGIpjJVP6MscM2Kr5XCoCAapM5t7htfo3+k5oRp6/P6fFXdIz/Ru9DQUOgTz1zOUNg0Zaef07+fwnlDdxaj5aIvqsBtok/EH4QdIJ06rndhHw5xg//0RQTssdbBM0UWB051gCOFj1HOXkzY5oOMzNpmSRs9Ak5eKuIgM6QmDR4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2023/4/27 11:47, Hui Wang wrote: > ... >> > Hi Xiang and Michal, >> Is it tested with a pure ext4 without any other fs background? >> > Basically yes. Maybe there is a squashfs mounted for python3 in my test environment. But stress-ng and its needed sharing libs are in the ext4. >> I don't think it's true that "ext4->readahead() doesn't call >> alloc_page()" since I think even ext2/ext4 uses buffer head >> interfaces to read metadata (extents or old block mapping) >> from its bd_inode for readahead, which indirectly allocates >> some extra pages to page cache as well. > > Calling alloc_page() or allocating memory in the readahead() is not a problem, suppose we have 4 processes (A, B, C and D). Process A, B and C are entering out_of_memory() because of allocating memory in the readahead(), they are looping and waiting for some memory be released. And process D could enter out_of_memory() with __GFP_FS, then it could trigger oom killer, so A, B and C could get the memory and return to the readahead(), there is no system hang issue. > > But if all 4 processes enter out_of_memory() from readahead(), they will loop and wait endlessly, there is no process to trigger oom killer,  so the users will think the system is getting hang. > > I applied my change for ext4->readahead to linux-next, and tested it on my ubuntu classic server for arm64, I could reproduce the hang issue within 1 minutes with 100% rate. I guess it is easy to reproduce the issue because it is an embedded environment, the total number of processes in the system is very limited, nearly all userspace processes will finally reach out_of_memory() from ext4_readahead(), and nearly all kthreads will not reach out_of_memory() for long time, that makes the system in a state like hang (not real hang). > > And this is why I wrote a patch to let a specific kthread trigger oom killer forcibly (my initial patch). > > >> >> The difference only here is the total number of pages to be >> allocated here, but many extra compressed data takeing extra >> allocation causes worse.  So I think it much depends on how >> stressful does your stress workload work like, and I'm even >> not sure it's a real issue since if you stop the stress >> workload, it will immediately recover (only it may not oom >> directly). >> > Yes, it is not a real hang. All userspace processes are looping and waiting for other processes to release or reclaim memory. And in this case, we can't stop the stress workload since users can't control the system through console. > > So Michal, > > Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters a state something like hang when running stress-ng", do you know why out_of_memory() will return immediately if there is no __GFP_FS, could we drop these lines directly: > >     /* >      * The OOM killer does not compensate for IO-less reclaim. >      * pagefault_out_of_memory lost its gfp context so we have to >      * make sure exclude 0 mask - all other users should have at least >      * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to >      * invoke the OOM killer even if it is a GFP_NOFS allocation. >      */ >     if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc)) >         return true; I'm curious about this as well. Also apart from that we also have some stacked virtual block devices as well, mostly they will use GFP_NOIO allocations. So if they allocate extra pages with GFP_NOIO like this, oom kill won't happen as well. And mostly such block drivers cannot fail out I/Os (IO error) since it will have worse results to upper fses and end users. Generally I don't think reserving more pages makes sense since it still depends on stressed workload (and it makes more pages privately and unuseable to other subsystems and user programs), so we still have two results: - wait (hang) for previous requests to get free pages; - return ENOMEM instead. Thanks, Gao Xiang > > > Thanks, > > Hui. > >> Thanks, >> Gao Xiang