From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 85F18C433DF for ; Tue, 18 Aug 2020 11:01:04 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 500762067C for ; Tue, 18 Aug 2020 11:01:04 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 500762067C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.de Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id E83E38D000D; Tue, 18 Aug 2020 07:01:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E0E258D0002; Tue, 18 Aug 2020 07:01:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CFC7B8D000D; Tue, 18 Aug 2020 07:01:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0233.hostedemail.com [216.40.44.233]) by kanga.kvack.org (Postfix) with ESMTP id B65138D0002 for ; Tue, 18 Aug 2020 07:01:03 -0400 (EDT) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 6F411181AEF1E for ; Tue, 18 Aug 2020 11:01:03 +0000 (UTC) X-FDA: 77163397206.08.plate45_1e0009b2701e Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin08.hostedemail.com (Postfix) with ESMTP id 46FCE1819E798 for ; Tue, 18 Aug 2020 11:01:03 +0000 (UTC) X-HE-Tag: plate45_1e0009b2701e X-Filterd-Recvd-Size: 7309 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf04.hostedemail.com (Postfix) with ESMTP for ; Tue, 18 Aug 2020 11:01:02 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 192E3ABD2; Tue, 18 Aug 2020 11:01:27 +0000 (UTC) From: Oscar Salvador To: stable@vger.kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.com, vbabka@suse.com, david@redhat.com, Oscar Salvador , Vlastimil Babka Subject: [PATCH STABLE 4.9] mm: Avoid calling build_all_zonelists_init under hotplug context Date: Tue, 18 Aug 2020 13:00:46 +0200 Message-Id: <20200818110046.6664-1-osalvador@suse.de> X-Mailer: git-send-email 2.28.0 MIME-Version: 1.0 X-Rspamd-Queue-Id: 46FCE1819E798 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Recently a customer of ours experienced a crash when booting the system while enabling memory-hotplug. The problem is that Normal zones on different nodes don't get their priva= te zone->pageset allocated, and keep sharing the initial boot_pageset. The sharing between zones is normally safe as explained by the comment fo= r boot_pageset - it's a percpu structure, and manipulations are done with disabled interrupts, and boot_pageset is set up in a way that any page pl= aced on its pcplist is immediately flushed to shared zone's freelist, because pcp->high =3D=3D 1. However, the hotplug operation updates pcp->high to a higher value as it expects to be operating on a private pageset. The problem is in build_all_zonelists(), which is called when the first r= ange of pages is onlined for the Normal zone of node X or Y: if (system_state =3D=3D SYSTEM_BOOTING) { build_all_zonelists_init(); } else { #ifdef CONFIG_MEMORY_HOTPLUG if (zone) setup_zone_pageset(zone); #endif /* we have to stop all cpus to guarantee there is no user of zonelist */ stop_machine(__build_all_zonelists, pgdat, NULL); /* cpuset refresh routine should be here */ } When called during hotplug, it should execute the setup_zone_pageset(zone= ) which allocates the private pageset. However, with memhp_default_state=3Donline, this happens early while system_state =3D=3D SYSTEM_BOOTING is still true, hence this step is skip= ped. (and build_all_zonelists_init() is probably unsafe anyway at this point). Another hotplug operation on the same zone then leads to zone_pcp_update(= zone) called from online_pages(), which updates the pcp->high for the shared boot_pageset to a value higher than 1. At that point, pages freed from Node X and Y Normal zones can end up on t= he same pcplist and from there they can be freed to the wrong zone's freelist, leading to the corruption and crashes. Please, note that upstream has fixed that differently (and unintentionall= y) by adding another boot state (SYSTEM_SCHEDULING), which is set before smp_in= it(). That should happen before memory hotplug events even with memhp_default_s= tate=3Donline. Backporting that would be too intrusive. Signed-off-by: Oscar Salvador Debugged-by: Vlastimil Babka --- include/linux/mmzone.h | 3 ++- init/main.c | 2 +- mm/memory_hotplug.c | 10 +++++----- mm/page_alloc.c | 7 ++++--- 4 files changed, 12 insertions(+), 10 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e3d7754f25f0..5c7645e156a5 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -756,7 +756,8 @@ static inline bool is_dev_zone(const struct zone *zon= e) #include =20 extern struct mutex zonelists_mutex; -void build_all_zonelists(pg_data_t *pgdat, struct zone *zone); +void build_all_zonelists(pg_data_t *pgdat, struct zone *zone, + bool hotplug_context); void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzon= e_idx); bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned lo= ng mark, int classzone_idx, unsigned int alloc_flags, diff --git a/init/main.c b/init/main.c index d47860dbe896..7ad08957dd18 100644 --- a/init/main.c +++ b/init/main.c @@ -512,7 +512,7 @@ asmlinkage __visible void __init start_kernel(void) smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */ boot_cpu_hotplug_init(); =20 - build_all_zonelists(NULL, NULL); + build_all_zonelists(NULL, NULL, false); page_alloc_init(); =20 pr_notice("Kernel command line: %s\n", boot_command_line); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 449999657c0b..a4ffe5996317 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1125,7 +1125,7 @@ int __ref online_pages(unsigned long pfn, unsigned = long nr_pages, int online_typ mutex_lock(&zonelists_mutex); if (!populated_zone(zone)) { need_zonelists_rebuild =3D 1; - build_all_zonelists(NULL, zone); + build_all_zonelists(NULL, zone, true); } =20 ret =3D walk_system_ram_range(pfn, nr_pages, &onlined_pages, @@ -1146,7 +1146,7 @@ int __ref online_pages(unsigned long pfn, unsigned = long nr_pages, int online_typ if (onlined_pages) { node_states_set_node(nid, &arg); if (need_zonelists_rebuild) - build_all_zonelists(NULL, NULL); + build_all_zonelists(NULL, NULL, true); else zone_pcp_update(zone); } @@ -1220,7 +1220,7 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u= 64 start) * to access not-initialized zonelist, build here. */ mutex_lock(&zonelists_mutex); - build_all_zonelists(pgdat, NULL); + build_all_zonelists(pgdat, NULL, true); mutex_unlock(&zonelists_mutex); =20 /* @@ -1276,7 +1276,7 @@ int try_online_node(int nid) =20 if (pgdat->node_zonelists->_zonerefs->zone =3D=3D NULL) { mutex_lock(&zonelists_mutex); - build_all_zonelists(NULL, NULL); + build_all_zonelists(NULL, NULL, true); mutex_unlock(&zonelists_mutex); } =20 @@ -2016,7 +2016,7 @@ static int __ref __offline_pages(unsigned long star= t_pfn, if (!populated_zone(zone)) { zone_pcp_reset(zone); mutex_lock(&zonelists_mutex); - build_all_zonelists(NULL, NULL); + build_all_zonelists(NULL, NULL, true); mutex_unlock(&zonelists_mutex); } else zone_pcp_update(zone); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index de00e0fec484..f394dd87fa03 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4608,7 +4608,7 @@ int numa_zonelist_order_handler(struct ctl_table *t= able, int write, user_zonelist_order =3D oldval; } else if (oldval !=3D user_zonelist_order) { mutex_lock(&zonelists_mutex); - build_all_zonelists(NULL, NULL); + build_all_zonelists(NULL, NULL, false); mutex_unlock(&zonelists_mutex); } } @@ -4988,11 +4988,12 @@ build_all_zonelists_init(void) * (2) call of __init annotated helper build_all_zonelists_init * [protected by SYSTEM_BOOTING]. */ -void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone) +void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone, + bool hotplug_context) { set_zonelist_order(); =20 - if (system_state =3D=3D SYSTEM_BOOTING) { + if (system_state =3D=3D SYSTEM_BOOTING && !hotplug_context) { build_all_zonelists_init(); } else { #ifdef CONFIG_MEMORY_HOTPLUG --=20 2.26.2