From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 44AD8C32772 for ; Tue, 23 Aug 2022 15:51:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 946A16B0073; Tue, 23 Aug 2022 11:51:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8A67A6B0074; Tue, 23 Aug 2022 11:51:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6D266940007; Tue, 23 Aug 2022 11:51:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 56BDE6B0073 for ; Tue, 23 Aug 2022 11:51:32 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 1F88E81586 for ; Tue, 23 Aug 2022 15:51:32 +0000 (UTC) X-FDA: 79831297224.06.85315CE Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf05.hostedemail.com (Postfix) with ESMTP id D6101100032 for ; Tue, 23 Aug 2022 15:51:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1661269889; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=z4UADzd2G0onG8slch5OIAcI4i1jO4ZASZYRyFg4K/M=; b=NcXOF1kLyPJf5z1k4GRsBGtI4ZlQdNdvP2dXnZ6/JZ4XcVP+l6qUMfnErJ7LEg3tlPaAzs eKDlXWkKPuWjv04zbh82Iz2pdqWnGBKcJoS3oUOoKkiYWfgQjAFStGgfZAbdaLoPvYwa5y iWr3zq6ZvyAW+JBJejbRcyfiklk82Uk= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-248-jZ8YxOU0NJiVuI30P-S5JA-1; Tue, 23 Aug 2022 11:51:28 -0400 X-MC-Unique: jZ8YxOU0NJiVuI30P-S5JA-1 Received: by mail-wm1-f72.google.com with SMTP id j22-20020a05600c485600b003a5e4420552so10444277wmo.8 for ; Tue, 23 Aug 2022 08:51:27 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc; bh=z4UADzd2G0onG8slch5OIAcI4i1jO4ZASZYRyFg4K/M=; b=3HHMNMcLR9Nls0Sd2nD33Cle4HIi9HLiD1AFxrmzZVPdjGttTHfIFZ1DMe1AVxsoeK LaovrwiPhILkIRM3tpOZI+q1vhnwGBgLwYry7tc7gXtO3Z2l3ceHXrAp14W8SOTkLPmW vJPQRUQbcoSt6FKOk5iUycLvEf/8UngxrSaHi/VwOPHmArtH7Dn6VruDChchhNyQj1Hc 93A9mbLira4wySiJW6Z4oPWTSHwzBr/VChBQlwGnSeq1WpQHlUHUYk5GkTrCSalScYfJ 0gWX48I2zM9Sa/wKhOj/nH1IUMfvZp19e27hd7IdXyejKqFkV3MC7HRK242JIVNE6d30 i9SQ== X-Gm-Message-State: ACgBeo3xgdZKnpSrqjbUtZB0IpsvutwgpThubjpcNIBSj0d+z0gD7YTX OxtxTB+7ReNDwzAk7QiTFujqT3Rf1rYOWNQQuqbvECRFQRYRLqv3PQvVHk2qLS8OjMnT2j3FYJz t+cNrZLTIaz0= X-Received: by 2002:a5d:4a11:0:b0:225:2f5e:c704 with SMTP id m17-20020a5d4a11000000b002252f5ec704mr13590487wrq.703.1661269886865; Tue, 23 Aug 2022 08:51:26 -0700 (PDT) X-Google-Smtp-Source: AA6agR4EwQ/q6X4+HC5vTZ2W8bEhcuhj7Th4Hl8z9pXgAaRh73bG08MzABgfU319tykMaTUQYjkDuA== X-Received: by 2002:a5d:4a11:0:b0:225:2f5e:c704 with SMTP id m17-20020a5d4a11000000b002252f5ec704mr13590474wrq.703.1661269886545; Tue, 23 Aug 2022 08:51:26 -0700 (PDT) Received: from ?IPV6:2003:cb:c70b:1600:c48b:1fab:a330:5182? (p200300cbc70b1600c48b1faba3305182.dip0.t-ipconnect.de. [2003:cb:c70b:1600:c48b:1fab:a330:5182]) by smtp.gmail.com with ESMTPSA id g1-20020adffc81000000b0022520aba90asm14922998wrr.107.2022.08.23.08.51.25 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 23 Aug 2022 08:51:26 -0700 (PDT) Message-ID: <23eda994-44bc-b05c-9e1b-e0095f7ca547@redhat.com> Date: Tue, 23 Aug 2022 17:51:25 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 Subject: Re: Race condition in build_all_zonelists() when offlining movable zone To: Michal Hocko , Mel Gorman Cc: Patrick Daly , linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, Juergen Gross References: <20220823094950.ocjyur2h3mqnqbeg@suse.de> <0fc01e47-51f3-baf7-2d46-72291422f695@redhat.com> <20220823110946.o3eawk3kghaykcim@suse.de> <20220823125850.o3nhkmikmv7vyxq4@suse.de> <37031749-ff57-2f90-5c90-f16473f31e37@redhat.com> <20220823151415.zorod7xtmoiu6wy3@suse.de> From: David Hildenbrand Organization: Red Hat In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1661269889; a=rsa-sha256; cv=none; b=DCHStbQmvUfmbfpQYMRqp8BIcYIUq+i8qsiWuNWZwpKrVM1wlJyuX/xnRqNCKr4928VWET waATgZ+ieGnRLycRsFo9z+Vml3NptOW9W7qV9aIYC8fVGsbz3prOWMYKJG9oeTZXRjJUi/ QF40avEql72c3GU7fYtAliCvkUuViME= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=NcXOF1kL; spf=pass (imf05.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1661269889; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=z4UADzd2G0onG8slch5OIAcI4i1jO4ZASZYRyFg4K/M=; b=lPgiLTPvEqepYc6SeQwe20wYBTFLpzDQoo5b3BwH65Zt9VF8SXCn66KERkQrxFgphqP/ul 7K1DGyqChF3IwYPu3gVnK8kA4PyLWoVwTn1OeXfp/SX0LCl22izDkFBDOxj/z5mipAG/iu WZff89VZ3kaH8Z13FLC8n1v/LsAoN0Y= X-Rspam-User: X-Stat-Signature: bue9aut7643h6a13owceg47or75exea4 X-Rspamd-Queue-Id: D6101100032 X-Rspamd-Server: rspam12 Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=NcXOF1kL; spf=pass (imf05.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-HE-Tag: 1661269889-724038 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 23.08.22 17:38, Michal Hocko wrote: > On Tue 23-08-22 16:14:15, Mel Gorman wrote: >> On Tue, Aug 23, 2022 at 03:57:38PM +0200, Michal Hocko wrote: >>>> I think I agree that 6aa303defb74 is most likely not the origin of this. >>>> It could only have been the origin in weird corner cases where we >>>> actually succeed offlining one memory block (adjust present+managed) and >>>> end up with managed=0 and present!=0 -- which barely happens in >>>> practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning >>>> that adjusts managed pages dynamically and might provoke such a >>>> situation on ZONE_MOVABLE) >>> >>> OK, thanks for the correction David. Then I would agree that Fixes tag >>> could be more confusing than helpful and your above summary would be a >>> great part of the changelog. >>> >> >> Given that 6aa303defb74 still makes it potentially worse, it's as good a >> Fixes-by as any given that anything prior to that commit would need careful >> examination. The race changes shape going further back in time until memory >> hot-remove was initially added and if someone needs to go that far back, >> they'll also need to check if the ZLC needs special treatment. >> >> Provisional patch and changelog is below. I'd still like to get a Tested-by >> from Patrick to confirm it still fixes the problem before posting formally. >> >> --8<-- >> mm/page_alloc: Fix race condition between build_all_zonelists and page allocation >> >> Patrick Daly reported the following problem; >> >> NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation >> [0] - ZONE_MOVABLE >> [1] - ZONE_NORMAL >> [2] - NULL >> >> For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the >> offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent >> memory_offline operation removes the last page from ZONE_MOVABLE, >> build_all_zonelists() & build_zonerefs_node() will update >> node_zonelists as shown below. Only populated zones are added. >> >> NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation >> [0] - ZONE_NORMAL >> [1] - NULL >> [2] - NULL >> >> The race is simple -- page allocation could be in progress when a memory >> hot-remove operation triggers a zonelist rebuild that removes zones. >> The allocation request will still have a valid ac->preferred_zoneref that >> is now pointing to NULL and triggers an OOM kill. >> >> This problem probably always existed but may be slighly easier to trigger s/slighly/slightly/ >> due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones >> with pages managed by the buddy allocator") which distinguishes between >> zones that are completely unpopulated versus zones that have valid pages >> but they are all reserved. Memory hotplug had multiple stages with Not necessarily reserved, simply not managed by the buddy (e.g., early allocations, memory ballooning / virtio-mem). >> timing considerations around managed/present page updates, the zonelist >> rebuild and the zone span updates. As David Hildenbrand puts it >> >> memory offlining adjusts managed+present pages of the zone >> essentially in one go. If after the adjustments, the zone is no >> longer populated (present==0), we rebuild the zone lists. >> >> Once that's done, we try shrinking the zone (start+spanned >> pages) -- which results in zone_start_pfn == 0 if there are no >> more pages. That happens *after* rebuilding the zonelists via >> remove_pfn_range_from_zone(). >> >> The only requirement to fix the race is that a page allocation request >> identifies when a zonelist rebuild has happened since the allocation >> request started and no page has yet been allocated. Use a seqlock_t to track >> zonelist updates with a lockless read-side of the zonelist and protecting >> the rebuild and update of the counter with a spinlock. >> >> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") >> Reported-by: Patrick Daly >> Signed-off-by: Mel Gorman >> Cc: >> --- >> mm/page_alloc.c | 37 ++++++++++++++++++++++++++++++------- >> 1 file changed, 30 insertions(+), 7 deletions(-) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index e5486d47406e..216e21048ddf 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask) >> EXPORT_SYMBOL_GPL(fs_reclaim_release); >> #endif >> >> +/* >> + * Zonelists may change due to hotplug during allocation. Detect when zonelists >> + * have been rebuilt so allocation retries. Reader side does not lock and >> + * retries the allocation if zonelist changes. Writer side is protected by the >> + * embedded spin_lock. >> + */ >> +DEFINE_SEQLOCK(zonelist_update_seq); >> + >> +static unsigned int zonelist_iter_begin(void) >> +{ > > You likely want something like > if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) > return read_seqbegin(&zonelist_update_seq); > return 0; > >> + return read_seqbegin(&zonelist_update_seq); >> +} >> + >> +static unsigned int check_retry_zonelist(unsigned int seq) >> +{ > if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) > return read_seqretry(&zonelist_update_seq, seq); > return seq; > >> + return read_seqretry(&zonelist_update_seq, seq); >> +} >> + > > to avoid overhead on systems without HOTREMOVE configured. > > Other than that LGTM. > Acked-by: Michal Hocko Makes sense to me, although I wonder how much it will matter in practice. Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb