From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 38EEAC388F2 for ; Fri, 6 Nov 2020 08:10:31 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 94375208FE for ; Fri, 6 Nov 2020 08:10:30 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com header.b="JExK9dBx" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 94375208FE Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=suse.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 16B5C6B006C; Fri, 6 Nov 2020 03:10:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 11BE76B0070; Fri, 6 Nov 2020 03:10:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 00BFF6B0075; Fri, 6 Nov 2020 03:10:29 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0101.hostedemail.com [216.40.44.101]) by kanga.kvack.org (Postfix) with ESMTP id C91A86B006C for ; Fri, 6 Nov 2020 03:10:29 -0500 (EST) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 77BA0180AD804 for ; Fri, 6 Nov 2020 08:10:29 +0000 (UTC) X-FDA: 77453271378.13.horse10_610cd71272d1 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin13.hostedemail.com (Postfix) with ESMTP id 5A8F818140B69 for ; Fri, 6 Nov 2020 08:10:29 +0000 (UTC) X-HE-Tag: horse10_610cd71272d1 X-Filterd-Recvd-Size: 6721 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf37.hostedemail.com (Postfix) with ESMTP for ; Fri, 6 Nov 2020 08:10:28 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1604650227; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=cnKqsKK811Yuykke9FVYpJ7LXK+WwGHKEzsvdLhzzr0=; b=JExK9dBxhnGDZWRQA3bTSMWQPKnF9h/Hh60uLB8P8hvxhXBvaphsQH+GNbxiUND8WSImde NoWc324vFEdE6NgN9LrqRMdmFnXSB4Q+z66uIs/5hZHmjx2R1EAL0+9u3K/w1ieSraMzLk WgjyL3cUKkDUCseQ1hk4AEhgFQUuinU= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 6088AAB8F; Fri, 6 Nov 2020 08:10:27 +0000 (UTC) Date: Fri, 6 Nov 2020 09:10:26 +0100 From: Michal Hocko To: Feng Tang Cc: Vlastimil Babka , Andrew Morton , Johannes Weiner , Matthew Wilcox , Mel Gorman , dave.hansen@intel.com, ying.huang@intel.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node Message-ID: <20201106081026.GB7247@dhcp22.suse.cz> References: <20201104085343.GA18718@dhcp22.suse.cz> <20201105014028.GA86777@shbuild999.sh.intel.com> <20201105120818.GC21348@dhcp22.suse.cz> <4029c079-b1f3-f290-26b6-a819c52f5200@suse.cz> <20201105125828.GG21348@dhcp22.suse.cz> <20201105130710.GB16525@shbuild999.sh.intel.com> <20201105131245.GH21348@dhcp22.suse.cz> <20201105134305.GA16424@shbuild999.sh.intel.com> <20201105161612.GM21348@dhcp22.suse.cz> <20201106070656.GA129085@shbuild999.sh.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201106070656.GA129085@shbuild999.sh.intel.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri 06-11-20 15:06:56, Feng Tang wrote: > On Thu, Nov 05, 2020 at 05:16:12PM +0100, Michal Hocko wrote: > > On Thu 05-11-20 21:43:05, Feng Tang wrote: > > > On Thu, Nov 05, 2020 at 02:12:45PM +0100, Michal Hocko wrote: > > > > On Thu 05-11-20 21:07:10, Feng Tang wrote: > > > > [...] > > > > > My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL' > > > > > > > > Can you provide the full information please? Which node has been > > > > requested. Which cpuset the calling process run in and which node has > > > > the allocation succeeded from? A bare dump_stack without any further > > > > context is not really helpful. > > > > > > I don't have the same platform as the original report, so I simulated > > > one similar setup (with fakenuma and movablecore), which has 2 memory > > > nodes: node 0 has DMA0/DMA32/Movable zones, while node 1 has only > > > Movable zone. With it, I can got the same error and same oom callstack > > > as the original report (as in the cover-letter). > > > > > > The test command is: > > > # docker run -it --rm --cpuset-mems 1 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status" > > > > > > To debug I only added some trace in the __alloc_pages_nodemask(), and > > > for the callstack which get the page successfully: > > > > > > [ 567.510903] Call Trace: > > > [ 567.510909] dump_stack+0x74/0x9a > > > [ 567.510910] __alloc_pages_nodemask.cold+0x22/0xe5 > > > [ 567.510913] alloc_pages_current+0x87/0xe0 > > > [ 567.510914] __vmalloc_node_range+0x14c/0x240 > > > [ 567.510918] module_alloc+0x82/0xe0 > > > [ 567.510921] bpf_jit_alloc_exec+0xe/0x10 > > > [ 567.510922] bpf_jit_binary_alloc+0x7a/0x120 > > > [ 567.510925] bpf_int_jit_compile+0x145/0x424 > > > [ 567.510926] bpf_prog_select_runtime+0xac/0x130 > > > > As already said this doesn't really tell much without the additional > > information. > > > > > The incomming parameter nodemask is NULL, and the function will first try the > > > cpuset nodemask (1 here), and the zoneidx is only granted 2, which makes the > > > 'ac's preferred zone to be NULL. so it goes into __alloc_pages_slowpath(), > > > which will first set the nodemask to 'NULL', and this time it got a preferred > > > zone: zone DMA32 from node 0, following get_page_from_freelist will allocate > > > one page from that zone. > > > > I do not follow. Both hot and slow paths of the allocator set > > ALLOC_CPUSET or emulate it by mems_allowed when cpusets are nebaled > > IIRC. This is later enforced in get_page_from_free_list. There are some > > exceptions when the allocating process can run away from its cpusets - > > e.g. IRQs, OOM victims and few other cases but definitely not a random > > allocation. There might be some subtle details that have changed or I > > might have forgot but > > yes, I was confused too. IIUC, the key check inside get_page_from_freelist() > is > > if (cpusets_enabled() && > (alloc_flags & ALLOC_CPUSET) && > !__cpuset_zone_allowed(zone, gfp_mask)) > > In our case (kernel page got allocated), the first 2 conditions are true, > and for __cpuset_zone_allowed(), the possible place to return true is > checking parent cpuset's nodemask > > cs = nearest_hardwall_ancestor(task_cs(current)); > allowed = node_isset(node, cs->mems_allowed); > > This will override the ALLOC_CPUSET check. Yes and this is ok because that is defined hierarchical semantic of the cpusets which applies to any !hardwalled allocation. Cpusets are quite non intuitive. Re-reading the previous discussion I have realized that me trying to not go into those details might have mislead you. Let me try again and clarify that now. I was talking in context of the patch you are proposing and that is a clear violation of the cpuset isolation. Especially for hardwalled setups because it allows to spill over to other nodes which shouldn't be possible except for few exceptions which shouldn't generate a lot of allocations (e.g. oom victim exiting, IRQ context). What I was not talking about, and should have been more clear about, is that without hardwall resp. exclusive nodes the isolation is best effort only for most kernel allocation requests (or more specifically those without __GFP_HARDWALL). Your patch doesn't distinguish between those and any non movable allocations and effectively allowed to runaway even for hardwalled allocations which are not movable. Those can be controlled by userspace very easily. I hope this clarifies it a bit more and sorry if I mislead you. -- Michal Hocko SUSE Labs