From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=5vU3=EM=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.9 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 38EEAC388F2
	for <linux-mm@archiver.kernel.org>; Fri,  6 Nov 2020 08:10:31 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 94375208FE
	for <linux-mm@archiver.kernel.org>; Fri,  6 Nov 2020 08:10:30 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com header.b="JExK9dBx"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 94375208FE
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=suse.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 16B5C6B006C; Fri,  6 Nov 2020 03:10:30 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 11BE76B0070; Fri,  6 Nov 2020 03:10:30 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 00BFF6B0075; Fri,  6 Nov 2020 03:10:29 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0101.hostedemail.com [216.40.44.101])
	by kanga.kvack.org (Postfix) with ESMTP id C91A86B006C
	for <linux-mm@kvack.org>; Fri,  6 Nov 2020 03:10:29 -0500 (EST)
Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 77BA0180AD804
	for <linux-mm@kvack.org>; Fri,  6 Nov 2020 08:10:29 +0000 (UTC)
X-FDA: 77453271378.13.horse10_610cd71272d1
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin13.hostedemail.com (Postfix) with ESMTP id 5A8F818140B69
	for <linux-mm@kvack.org>; Fri,  6 Nov 2020 08:10:29 +0000 (UTC)
X-HE-Tag: horse10_610cd71272d1
X-Filterd-Recvd-Size: 6721
Received: from mx2.suse.de (mx2.suse.de [195.135.220.15])
	by imf37.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Fri,  6 Nov 2020 08:10:28 +0000 (UTC)
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
	t=1604650227;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=cnKqsKK811Yuykke9FVYpJ7LXK+WwGHKEzsvdLhzzr0=;
	b=JExK9dBxhnGDZWRQA3bTSMWQPKnF9h/Hh60uLB8P8hvxhXBvaphsQH+GNbxiUND8WSImde
	NoWc324vFEdE6NgN9LrqRMdmFnXSB4Q+z66uIs/5hZHmjx2R1EAL0+9u3K/w1ieSraMzLk
	WgjyL3cUKkDUCseQ1hk4AEhgFQUuinU=
Received: from relay2.suse.de (unknown [195.135.221.27])
	by mx2.suse.de (Postfix) with ESMTP id 6088AAB8F;
	Fri,  6 Nov 2020 08:10:27 +0000 (UTC)
Date: Fri, 6 Nov 2020 09:10:26 +0100
From: Michal Hocko <mhocko@suse.com>
To: Feng Tang <feng.tang@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>,
	dave.hansen@intel.com, ying.huang@intel.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable
 zone only node
Message-ID: <20201106081026.GB7247@dhcp22.suse.cz>
References: <20201104085343.GA18718@dhcp22.suse.cz>
 <20201105014028.GA86777@shbuild999.sh.intel.com>
 <20201105120818.GC21348@dhcp22.suse.cz>
 <4029c079-b1f3-f290-26b6-a819c52f5200@suse.cz>
 <20201105125828.GG21348@dhcp22.suse.cz>
 <20201105130710.GB16525@shbuild999.sh.intel.com>
 <20201105131245.GH21348@dhcp22.suse.cz>
 <20201105134305.GA16424@shbuild999.sh.intel.com>
 <20201105161612.GM21348@dhcp22.suse.cz>
 <20201106070656.GA129085@shbuild999.sh.intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20201106070656.GA129085@shbuild999.sh.intel.com>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri 06-11-20 15:06:56, Feng Tang wrote:
> On Thu, Nov 05, 2020 at 05:16:12PM +0100, Michal Hocko wrote:
> > On Thu 05-11-20 21:43:05, Feng Tang wrote:
> > > On Thu, Nov 05, 2020 at 02:12:45PM +0100, Michal Hocko wrote:
> > > > On Thu 05-11-20 21:07:10, Feng Tang wrote:
> > > > [...]
> > > > > My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL'
> > > > 
> > > > Can you provide the full information please? Which node has been
> > > > requested. Which cpuset the calling process run in and which node has
> > > > the allocation succeeded from? A bare dump_stack without any further
> > > > context is not really helpful.
> > > 
> > > I don't have the same platform as the original report, so I simulated
> > > one similar setup (with fakenuma and movablecore), which has 2 memory
> > > nodes: node 0 has DMA0/DMA32/Movable zones, while node 1 has only
> > > Movable zone. With it, I can got the same error and same oom callstack
> > > as the original report (as in the cover-letter).
> > > 
> > > The test command is:
> > > 	# docker run -it --rm --cpuset-mems 1 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"
> > > 
> > > To debug I only added some trace in the __alloc_pages_nodemask(), and
> > > for the callstack which get the page successfully:
> > > 
> > > 	[  567.510903] Call Trace:
> > > 	[  567.510909]  dump_stack+0x74/0x9a
> > > 	[  567.510910]  __alloc_pages_nodemask.cold+0x22/0xe5
> > > 	[  567.510913]  alloc_pages_current+0x87/0xe0
> > > 	[  567.510914]  __vmalloc_node_range+0x14c/0x240
> > > 	[  567.510918]  module_alloc+0x82/0xe0
> > > 	[  567.510921]  bpf_jit_alloc_exec+0xe/0x10
> > > 	[  567.510922]  bpf_jit_binary_alloc+0x7a/0x120
> > > 	[  567.510925]  bpf_int_jit_compile+0x145/0x424
> > > 	[  567.510926]  bpf_prog_select_runtime+0xac/0x130
> > 
> > As already said this doesn't really tell much without the additional
> > information.
> > 
> > > The incomming parameter nodemask is NULL, and the function will first try the
> > > cpuset nodemask (1 here), and the zoneidx is only granted 2, which makes the
> > > 'ac's preferred zone to be NULL. so it goes into __alloc_pages_slowpath(),
> > > which will first set the nodemask to 'NULL', and this time it got a preferred
> > > zone: zone DMA32 from node 0, following get_page_from_freelist will allocate
> > > one page from that zone. 
> > 
> > I do not follow. Both hot and slow paths of the allocator set
> > ALLOC_CPUSET or emulate it by mems_allowed when cpusets are nebaled
> > IIRC. This is later enforced in get_page_from_free_list. There are some
> > exceptions when the allocating process can run away from its cpusets -
> > e.g. IRQs, OOM victims and few other cases but definitely not a random
> > allocation. There might be some subtle details that have changed or I
> > might have forgot but 
> 
> yes, I was confused too. IIUC, the key check inside get_page_from_freelist()
> is 
> 
> 	if (cpusets_enabled() &&
> 		(alloc_flags & ALLOC_CPUSET) &&
> 		!__cpuset_zone_allowed(zone, gfp_mask))
> 
> In our case (kernel page got allocated), the first 2 conditions are true,
> and for __cpuset_zone_allowed(), the possible place to return true is
> checking parent cpuset's nodemask
> 
> 	cs = nearest_hardwall_ancestor(task_cs(current));
> 	allowed = node_isset(node, cs->mems_allowed);
> 
> This will override the ALLOC_CPUSET check.

Yes and this is ok because that is defined hierarchical semantic of the
cpusets which applies to any !hardwalled allocation. Cpusets are quite
non intuitive. Re-reading the previous discussion I have realized that
me trying to not go into those details might have mislead you. Let me
try again and clarify that now.

I was talking in context of the patch you are proposing and that is a
clear violation of the cpuset isolation. Especially for hardwalled
setups because it allows to spill over to other nodes which shouldn't be
possible except for few exceptions which shouldn't generate a lot of
allocations (e.g. oom victim exiting, IRQ context).

What I was not talking about, and should have been more clear about, is
that without hardwall resp. exclusive nodes the isolation is best effort
only for most kernel allocation requests (or more specifically those
without __GFP_HARDWALL). Your patch doesn't distinguish between those
and any non movable allocations and effectively allowed to runaway even
for hardwalled allocations which are not movable. Those can be controlled
by userspace very easily.

I hope this clarifies it a bit more and sorry if I mislead you.
-- 
Michal Hocko
SUSE Labs