From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E6BDFC32772 for ; Tue, 23 Aug 2022 15:38:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3C22D6B0073; Tue, 23 Aug 2022 11:38:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 372816B0074; Tue, 23 Aug 2022 11:38:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 21247940007; Tue, 23 Aug 2022 11:38:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 1302D6B0073 for ; Tue, 23 Aug 2022 11:38:34 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id D9E85141547 for ; Tue, 23 Aug 2022 15:38:33 +0000 (UTC) X-FDA: 79831264506.16.7CD51D9 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by imf02.hostedemail.com (Postfix) with ESMTP id 3B74180012 for ; Tue, 23 Aug 2022 15:38:33 +0000 (UTC) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id F315A33760; Tue, 23 Aug 2022 15:38:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1661269112; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=AfWLYQ2KC8JWngUCIEdbEQTzcqzddx69lrt8owUg0aY=; b=nbmv82huXWbBEPPsrVpublMQuBNInSXjtAYtp44WI57fWP19K8PNp2/cLw4cC6048juc2x n8m4Hv9B0fm55GdpDDKELJtRVEJM3DSBmVMu1GtsnB2hFOadK6kGvd1BsVlLVrteCMwHws R6TUVnJp88VLJ2THob7ETLTMSQZL2mk= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id E0CC813A89; Tue, 23 Aug 2022 15:38:31 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id gVsuNnf0BGMUIQAAMHmgww (envelope-from ); Tue, 23 Aug 2022 15:38:31 +0000 Date: Tue, 23 Aug 2022 17:38:31 +0200 From: Michal Hocko To: Mel Gorman Cc: David Hildenbrand , Patrick Daly , linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, Juergen Gross Subject: Re: Race condition in build_all_zonelists() when offlining movable zone Message-ID: References: <20220823094950.ocjyur2h3mqnqbeg@suse.de> <0fc01e47-51f3-baf7-2d46-72291422f695@redhat.com> <20220823110946.o3eawk3kghaykcim@suse.de> <20220823125850.o3nhkmikmv7vyxq4@suse.de> <37031749-ff57-2f90-5c90-f16473f31e37@redhat.com> <20220823151415.zorod7xtmoiu6wy3@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220823151415.zorod7xtmoiu6wy3@suse.de> ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1661269113; a=rsa-sha256; cv=none; b=2m4Vj/23iKInGq1Y/7JpZR3CacxBCKZi/HiamcC12EC1XXvXcIANcjXKrWBQI/j6HBttwl hCrqiTPFRHQdOfytWDjxGQuJbb7MexqjYdOPHNDFsFPr7+1UMHKHT/p27F2He4vwn/bhnt /CADFIc6bIG3VlRwit+v9kP5qzhPdPo= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=nbmv82hu; spf=pass (imf02.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.28 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1661269113; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=AfWLYQ2KC8JWngUCIEdbEQTzcqzddx69lrt8owUg0aY=; b=K0W0WD02btovpeOAbBduzHoCgT/cNP+f+xLsV9DFUAwDg9/5d7lijn/6utP9zz1YPmVDxl XipElSkpvSeuB9SRvg8wxc5ZFck/KDr++aCVALBY6+5PxFicYMtfdubW6Y5crcyKbB6sCr KvhJJrCYA4hAlC1MR3MsaV+6jtsYbb4= X-Rspam-User: X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 3B74180012 Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=nbmv82hu; spf=pass (imf02.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.28 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com X-Stat-Signature: 39c3gkqfsqzqy1fjufwbdhy95afk1to1 X-HE-Tag: 1661269113-212111 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue 23-08-22 16:14:15, Mel Gorman wrote: > On Tue, Aug 23, 2022 at 03:57:38PM +0200, Michal Hocko wrote: > > > I think I agree that 6aa303defb74 is most likely not the origin of this. > > > It could only have been the origin in weird corner cases where we > > > actually succeed offlining one memory block (adjust present+managed) and > > > end up with managed=0 and present!=0 -- which barely happens in > > > practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning > > > that adjusts managed pages dynamically and might provoke such a > > > situation on ZONE_MOVABLE) > > > > OK, thanks for the correction David. Then I would agree that Fixes tag > > could be more confusing than helpful and your above summary would be a > > great part of the changelog. > > > > Given that 6aa303defb74 still makes it potentially worse, it's as good a > Fixes-by as any given that anything prior to that commit would need careful > examination. The race changes shape going further back in time until memory > hot-remove was initially added and if someone needs to go that far back, > they'll also need to check if the ZLC needs special treatment. > > Provisional patch and changelog is below. I'd still like to get a Tested-by > from Patrick to confirm it still fixes the problem before posting formally. > > --8<-- > mm/page_alloc: Fix race condition between build_all_zonelists and page allocation > > Patrick Daly reported the following problem; > > NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation > [0] - ZONE_MOVABLE > [1] - ZONE_NORMAL > [2] - NULL > > For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the > offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent > memory_offline operation removes the last page from ZONE_MOVABLE, > build_all_zonelists() & build_zonerefs_node() will update > node_zonelists as shown below. Only populated zones are added. > > NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation > [0] - ZONE_NORMAL > [1] - NULL > [2] - NULL > > The race is simple -- page allocation could be in progress when a memory > hot-remove operation triggers a zonelist rebuild that removes zones. > The allocation request will still have a valid ac->preferred_zoneref that > is now pointing to NULL and triggers an OOM kill. > > This problem probably always existed but may be slighly easier to trigger > due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones > with pages managed by the buddy allocator") which distinguishes between > zones that are completely unpopulated versus zones that have valid pages > but they are all reserved. Memory hotplug had multiple stages with > timing considerations around managed/present page updates, the zonelist > rebuild and the zone span updates. As David Hildenbrand puts it > > memory offlining adjusts managed+present pages of the zone > essentially in one go. If after the adjustments, the zone is no > longer populated (present==0), we rebuild the zone lists. > > Once that's done, we try shrinking the zone (start+spanned > pages) -- which results in zone_start_pfn == 0 if there are no > more pages. That happens *after* rebuilding the zonelists via > remove_pfn_range_from_zone(). > > The only requirement to fix the race is that a page allocation request > identifies when a zonelist rebuild has happened since the allocation > request started and no page has yet been allocated. Use a seqlock_t to track > zonelist updates with a lockless read-side of the zonelist and protecting > the rebuild and update of the counter with a spinlock. > > Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") > Reported-by: Patrick Daly > Signed-off-by: Mel Gorman > Cc: > --- > mm/page_alloc.c | 37 ++++++++++++++++++++++++++++++------- > 1 file changed, 30 insertions(+), 7 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e5486d47406e..216e21048ddf 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask) > EXPORT_SYMBOL_GPL(fs_reclaim_release); > #endif > > +/* > + * Zonelists may change due to hotplug during allocation. Detect when zonelists > + * have been rebuilt so allocation retries. Reader side does not lock and > + * retries the allocation if zonelist changes. Writer side is protected by the > + * embedded spin_lock. > + */ > +DEFINE_SEQLOCK(zonelist_update_seq); > + > +static unsigned int zonelist_iter_begin(void) > +{ You likely want something like if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) return read_seqbegin(&zonelist_update_seq); return 0; > + return read_seqbegin(&zonelist_update_seq); > +} > + > +static unsigned int check_retry_zonelist(unsigned int seq) > +{ if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) return read_seqretry(&zonelist_update_seq, seq); return seq; > + return read_seqretry(&zonelist_update_seq, seq); > +} > + to avoid overhead on systems without HOTREMOVE configured. Other than that LGTM. Acked-by: Michal Hocko Thanks! > /* Perform direct synchronous page reclaim */ > static unsigned long > __perform_reclaim(gfp_t gfp_mask, unsigned int order, > @@ -5001,6 +5019,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > int compaction_retries; > int no_progress_loops; > unsigned int cpuset_mems_cookie; > + unsigned int zonelist_iter_cookie; > int reserve_flags; > > /* > @@ -5011,11 +5030,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM))) > gfp_mask &= ~__GFP_ATOMIC; > > -retry_cpuset: > +restart: > compaction_retries = 0; > no_progress_loops = 0; > compact_priority = DEF_COMPACT_PRIORITY; > cpuset_mems_cookie = read_mems_allowed_begin(); > + zonelist_iter_cookie = zonelist_iter_begin(); > > /* > * The fast path uses conservative alloc_flags to succeed only until > @@ -5187,9 +5207,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > goto retry; > > > - /* Deal with possible cpuset update races before we start OOM killing */ > - if (check_retry_cpuset(cpuset_mems_cookie, ac)) > - goto retry_cpuset; > + /* > + * Deal with possible cpuset update races or zonelist updates to avoid > + * a unnecessary OOM kill. > + */ > + if (check_retry_cpuset(cpuset_mems_cookie, ac) || > + check_retry_zonelist(zonelist_iter_cookie)) > + goto restart; > > /* Reclaim has failed us, start killing things */ > page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); > @@ -6514,9 +6538,8 @@ static void __build_all_zonelists(void *data) > int nid; > int __maybe_unused cpu; > pg_data_t *self = data; > - static DEFINE_SPINLOCK(lock); > > - spin_lock(&lock); > + write_seqlock(&zonelist_update_seq); > > #ifdef CONFIG_NUMA > memset(node_load, 0, sizeof(node_load)); > @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data) > #endif > } > > - spin_unlock(&lock); > + write_sequnlock(&zonelist_update_seq); > } > > static noinline void __init -- Michal Hocko SUSE Labs