From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9713BC77B61 for ; Thu, 27 Apr 2023 07:49:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D09FC6B0071; Thu, 27 Apr 2023 03:49:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CBA326B0072; Thu, 27 Apr 2023 03:49:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B82316B0074; Thu, 27 Apr 2023 03:49:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id A9DDC6B0071 for ; Thu, 27 Apr 2023 03:49:36 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 68E14ACD7E for ; Thu, 27 Apr 2023 07:49:36 +0000 (UTC) X-FDA: 80726396352.03.D787182 Received: from smtp-relay-canonical-1.canonical.com (smtp-relay-canonical-1.canonical.com [185.125.188.121]) by imf18.hostedemail.com (Postfix) with ESMTP id 70D9F1C0006 for ; Thu, 27 Apr 2023 07:49:34 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=canonical.com header.s=20210705 header.b="D/5e8HRB"; spf=pass (imf18.hostedemail.com: domain of hui.wang@canonical.com designates 185.125.188.121 as permitted sender) smtp.mailfrom=hui.wang@canonical.com; dmarc=pass (policy=none) header.from=canonical.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1682581774; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=NWGSkRTASCaB3PfoWZvtEQQ30jcOwadTkgqWGEAovXc=; b=pJLPRvpKK8oFsvJozS8qVZrquDpl3XChfJLFMPxejP6ts2cLPBs3IN1Mk27Pj+bT0ESnLg xk91SOO6s6wZ2T2NT1jVZCfdgNpBWfx8Ax/id3aBBnyzM0NKYPwtvGG4jgBDjcIPbJFLku U1PFKDPd6AsOnNsw6eVlGSS9gg137gQ= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=canonical.com header.s=20210705 header.b="D/5e8HRB"; spf=pass (imf18.hostedemail.com: domain of hui.wang@canonical.com designates 185.125.188.121 as permitted sender) smtp.mailfrom=hui.wang@canonical.com; dmarc=pass (policy=none) header.from=canonical.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682581774; a=rsa-sha256; cv=none; b=zfmqTuisiB3Z1E9q/iLsTxOeLi9vLzH/atHmDnnV/iiETvYcviL90KS+4IdNg5X6Pkx4le qQYKeS7/bq6bj7m/HeNIQtAEabWLeMwwN4IbT3bQDe73tmuiz8+PizWtXlABBJW7HMuI6b qyu8+NPPnP2qsJHmPxsFMo78vOxYtg0= Received: from [192.168.0.106] (unknown [123.112.66.36]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-canonical-1.canonical.com (Postfix) with ESMTPSA id 6D6CB43DBC; Thu, 27 Apr 2023 07:49:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=canonical.com; s=20210705; t=1682581772; bh=NWGSkRTASCaB3PfoWZvtEQQ30jcOwadTkgqWGEAovXc=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=D/5e8HRBRbz4bWZ6/JRauP8FabfNaq70EcB/E/r+RzL3oBIHBL1RlQaxixMtvnPPP U8QDR+D62E4inwcFITzzwKS1Wt9tArTzXARs/ICIRB+xiBmQz0KsAf75NmNoaa1BpX N+7esD+sTxWCRSCxTIj2/BCJ6mfu5/OGGomeMRXJBnONOwHnNGgUTRrFpCAgUbWaIo r1voKngYZDkwedg/tNeJ2p+QdeHQUTGUEFX47zVZB7bMak/1gtknx3zxAFzjurp2mx 4Xhes2gEbRMcB57nEfUHv9VIQwAIYerUpYtXeBhWzdI+548fbOwc1o1NikIXVFpdsy ogr0JHhl+mZMw== Message-ID: <5eac5a89-e8b6-4a57-91af-7b21012042d3@canonical.com> Date: Thu, 27 Apr 2023 15:49:19 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS Content-Language: en-US To: "Colin King (gmail)" , Gao Xiang , Michal Hocko Cc: linux-mm@kvack.org, akpm@linux-foundation.org, surenb@google.com, shy828301@gmail.com, hannes@cmpxchg.org, vbabka@suse.cz, hch@infradead.org, mgorman@suse.de, Phillip Lougher References: <20230426051030.112007-1-hui.wang@canonical.com> <20230426051030.112007-2-hui.wang@canonical.com> <68b085fe-3347-507c-d739-0dc9b27ebe05@linux.alibaba.com> <4aa48b6a-362d-de1b-f0ff-9bb8dafbdcc7@canonical.com> <8bac892e-15e2-e95b-5b2b-0981f1279e4c@gmail.com> From: Hui Wang In-Reply-To: <8bac892e-15e2-e95b-5b2b-0981f1279e4c@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 70D9F1C0006 X-Rspam-User: X-Stat-Signature: r8p6kedgqe49xpuxuys9httob9iqijzn X-HE-Tag: 1682581774-410590 X-HE-Meta: U2FsdGVkX1+e4OdnPURKQqp2Rbxc3RAfw938PJLjW8g+E8dtLXdhwYkiZO2DACPpX4/ZZsmaOm+E35yWJpLIFI9ynS7U/ZDKs5N+Qg9ny/naO9y5NG31dx5XZOYIytFaPxTKJ4O4v6U+n/59wSdFIcEqFhLhM2k3gkEP5FJS8MLOG8P1UgoyrgX0+6prOKv2DB1PWw9GcZRIi5H6GumLA0Xr9YQaAnLtJA5PPOUm4M6PMyJc9txCWOWVhBRCzOyAYngvjfN0bcw0BP8G80gL+U19GOln1XpxqlaPMYddUqRwbE5mm+4F+DjB2u9lBpcd6yq2LzXExnWYqHvVsgBEhmMcT8QZ+ifGWw0MjS/A8+OS41DtcmxXh8COrh6/B2otsuDFDuZIzpC6fOaennoRQ16uR4dJNjsrzckfajhP/5/7rCGSibE87K50JLNOv1CAuWmzlXpG9ylThnRRXuFm15PB1WVfF1ogpsOkotdNuopSS9Lko0bbk1IdZJelid/s0Mphb3j+F2JdZLM9yr0p46T5A+99PT/m7f0D7FTzPV8LhxPTu8KZhQwA8HzxmVtPejB7MMdv5J+poLuqyxaJ0YMz5UE9lrHWDpl6fvSkShpYpKmCVYA4oiNJTY1Rr++iRMyCUDJ3H0VRdAwwxyFuOV1+/F/usuDfBZVe/gZ2nCwz4cX7rIRvO7so1qoVlqkX75+lEZTE4Ft8AzjZHLj1OBwCAw9iTgL9eO7ux1+FQU2Lt9urHt8IVPUibAO9fdTs0s3Nlh7sLtIgcQ8e9cYl08Cr8rlbJ5SQt519EKM0ZleL4VL0kr713vsuBmaWirSbckHofo1OUMFpTGZVZjeJmvpsUNCn+Y6h5ZYLRIuDE/ig0xqjVdJaqU8z969gz4EXBt8NL0WNEQ8K8NtUourvTMMSTv284cb37ndbPIZXaGHqsIUGw+7zFCxtXA8k+lFZm5QW5HvOB5WWTV+tzHx q7zQd3s6 uWldbFfgra4cMFWXvklVx3flWTRglnyz5kiNQ07TnoyWGflmWyCAXNQfe4VWFU7iT5gMnSqWAmBT3aYb4UPcxJVgaix5oChnTpILziQMb2nBn3YugcklyUGz41H0wc6VVPckw2Sk3u2PI0FZh7jp7SZzdKJqs5Hat8DI6Y6vLgx6Bap85JF5fFQmaHQdiOKs2O6cEt09Oy4XtLa2dnIzK2jzIHcKbLm/iySRO1m8v4UzGQusICO81Pg3Drb7Xv6JFwIkNl8fLSs94RgMYf1GyxKxh1JmKDro3+obUFvkY5VJ353jHaausDTaGriL4sQi7nRuCz/c3tAa+8SpLbRrOjhOFcg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 4/27/23 15:03, Colin King (gmail) wrote: > On 27/04/2023 04:47, Hui Wang wrote: >> >> On 4/27/23 09:18, Gao Xiang wrote: >>> >>> >>> On 2023/4/26 19:07, Hui Wang wrote: >>>> >>>> On 4/26/23 16:33, Michal Hocko wrote: >>>>> [CC squashfs maintainer] >>>>> >>>>> On Wed 26-04-23 13:10:30, Hui Wang wrote: >>>>>> If we run the stress-ng in the filesystem of squashfs, the system >>>>>> will be in a state something like hang, the stress-ng couldn't >>>>>> finish running and the console couldn't react to users' input. >>>>>> >>>>>> This issue happens on all arm/arm64 platforms we are working on, >>>>>> through debugging, we found this issue is introduced by oom handling >>>>>> in the kernel. >>>>>> >>>>>> The fs->readahead() is called between memalloc_nofs_save() and >>>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls >>>>>> alloc_page(), in this case, if there is no memory left, the >>>>>> out_of_memory() will be called without __GFP_FS, then the oom killer >>>>>> will not be triggered and this process will loop endlessly and wait >>>>>> for others to trigger oom killer to release some memory. But for a >>>>>> system with the whole root filesystem constructed by squashfs, >>>>>> nearly all userspace processes will call out_of_memory() without >>>>>> __GFP_FS, so we will see that the system enters a state something >>>>>> like >>>>>> hang when running stress-ng. >>>>>> >>>>>> To fix it, we could trigger a kthread to call page_alloc() with >>>>>> __GFP_FS before returning from out_of_memory() due to without >>>>>> __GFP_FS. >>>>> I do not think this is an appropriate way to deal with this issue. >>>>> Does it even make sense to trigger OOM killer for something like >>>>> readahead? Would it be more mindful to fail the allocation instead? >>>>> That being said should allocations from squashfs_readahead use >>>>> __GFP_RETRY_MAYFAIL instead? >>>> >>>> Thanks for your comment, and this issue could hardly be reproduced >>>> on ext4 filesystem, that is because the ext4->readahead() doesn't >>>> call alloc_page(). If changing the ext4->readahead() as below, it >>>> will be easy to reproduce this issue with the ext4 filesystem >>>> (repeatedly run: $stress-ng --bigheap ${num_of_cpu_threads} >>>> --sequential 0 --timeout 30s --skip-silent --verbose) >>>> >>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >>>> index ffbbd9626bd8..8b9db0b9d0b8 100644 >>>> --- a/fs/ext4/inode.c >>>> +++ b/fs/ext4/inode.c >>>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file >>>> *file, struct folio *folio) >>>>   static void ext4_readahead(struct readahead_control *rac) >>>>   { >>>>          struct inode *inode = rac->mapping->host; >>>> +       struct page *tmp_page; >>>> >>>>          /* If the file has inline data, no need to do readahead. */ >>>>          if (ext4_has_inline_data(inode)) >>>>                  return; >>>> >>>> +       tmp_page = alloc_page(GFP_KERNEL); >>>> + >>>>          ext4_mpage_readpages(inode, rac, NULL); >>>> + >>>> +       if (tmp_page) >>>> +               __free_page(tmp_page); >>>>   } >>>> >>> >> Hi Xiang and Michal, >>> Is it tested with a pure ext4 without any other fs background? >>> >> Basically yes. Maybe there is a squashfs mounted for python3 in my >> test environment. But stress-ng and its needed sharing libs are in >> the ext4. > > One could build a static version of stress-ng to remove the need for > shared library loading at run time: > > git clone https://github.com/ColinIanKing/stress-ng > cd stress-ng > make clean > STATIC=1 make -j 8 > I did that already, and copied it to /home/ubuntu under uc20/uc22 and ran it from /home/ubuntu, there is no hang issue anymore. The folder /home/ubuntu/ is ext4 filesystem, that proves the issue only happens on squashfs. And if I built it without static=1, it will hang even I ran it from /home/ubuntu/ because the system needs to load shared libs from squashfs folder. Thanks, Hui. > >>> I don't think it's true that "ext4->readahead() doesn't call >>> alloc_page()" since I think even ext2/ext4 uses buffer head >>> interfaces to read metadata (extents or old block mapping) >>> from its bd_inode for readahead, which indirectly allocates >>> some extra pages to page cache as well. >> >> Calling alloc_page() or allocating memory in the readahead() is not a >> problem, suppose we have 4 processes (A, B, C and D). Process A, B >> and C are entering out_of_memory() because of allocating memory in >> the readahead(), they are looping and waiting for some memory be >> released. And process D could enter out_of_memory() with __GFP_FS, >> then it could trigger oom killer, so A, B and C could get the memory >> and return to the readahead(), there is no system hang issue. >> >> But if all 4 processes enter out_of_memory() from readahead(), they >> will loop and wait endlessly, there is no process to trigger oom >> killer,  so the users will think the system is getting hang. >> >> I applied my change for ext4->readahead to linux-next, and tested it >> on my ubuntu classic server for arm64, I could reproduce the hang >> issue within 1 minutes with 100% rate. I guess it is easy to >> reproduce the issue because it is an embedded environment, the total >> number of processes in the system is very limited, nearly all >> userspace processes will finally reach out_of_memory() from >> ext4_readahead(), and nearly all kthreads will not reach >> out_of_memory() for long time, that makes the system in a state like >> hang (not real hang). >> >> And this is why I wrote a patch to let a specific kthread trigger oom >> killer forcibly (my initial patch). >> >> >>> >>> The difference only here is the total number of pages to be >>> allocated here, but many extra compressed data takeing extra >>> allocation causes worse.  So I think it much depends on how >>> stressful does your stress workload work like, and I'm even >>> not sure it's a real issue since if you stop the stress >>> workload, it will immediately recover (only it may not oom >>> directly). >>> >> Yes, it is not a real hang. All userspace processes are looping and >> waiting for other processes to release or reclaim memory. And in this >> case, we can't stop the stress workload since users can't control the >> system through console. >> >> So Michal, >> >> Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters a >> state something like hang when running stress-ng", do you know why >> out_of_memory() will return immediately if there is no __GFP_FS, >> could we drop these lines directly: >> >>      /* >>       * The OOM killer does not compensate for IO-less reclaim. >>       * pagefault_out_of_memory lost its gfp context so we have to >>       * make sure exclude 0 mask - all other users should have at least >>       * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to >>       * invoke the OOM killer even if it is a GFP_NOFS allocation. >>       */ >>      if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && >> !is_memcg_oom(oc)) >>          return true; >> >> >> Thanks, >> >> Hui. >> >>> Thanks, >>> Gao Xiang >