From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7AFB1C433E2 for ; Mon, 7 Sep 2020 16:36:51 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E9BEA21927 for ; Mon, 7 Sep 2020 16:36:50 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E9BEA21927 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id DA28F6B0055; Mon, 7 Sep 2020 12:36:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CB7E86B005D; Mon, 7 Sep 2020 12:36:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B7DD98E0001; Mon, 7 Sep 2020 12:36:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0151.hostedemail.com [216.40.44.151]) by kanga.kvack.org (Postfix) with ESMTP id 99DA26B0055 for ; Mon, 7 Sep 2020 12:36:46 -0400 (EDT) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 660121EFD for ; Mon, 7 Sep 2020 16:36:46 +0000 (UTC) X-FDA: 77236819212.29.paste85_3f11fdf270cd Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin29.hostedemail.com (Postfix) with ESMTP id 3A2A6180868EC for ; Mon, 7 Sep 2020 16:36:46 +0000 (UTC) X-HE-Tag: paste85_3f11fdf270cd X-Filterd-Recvd-Size: 15422 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf23.hostedemail.com (Postfix) with ESMTP for ; Mon, 7 Sep 2020 16:36:45 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id A429CB020; Mon, 7 Sep 2020 16:36:43 +0000 (UTC) From: Vlastimil Babka To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Michal Hocko , Pavel Tatashin , David Hildenbrand , Oscar Salvador , Joonsoo Kim , Vlastimil Babka , Michal Hocko Subject: [RFC 5/5] mm, page_alloc: disable pcplists during page isolation Date: Mon, 7 Sep 2020 18:36:28 +0200 Message-Id: <20200907163628.26495-6-vbabka@suse.cz> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20200907163628.26495-1-vbabka@suse.cz> References: <20200907163628.26495-1-vbabka@suse.cz> MIME-Version: 1.0 X-Rspamd-Queue-Id: 3A2A6180868EC X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Page isolation can race with process freeing pages to pcplists in a way t= hat a page from isolated pageblock can end up on pcplist. This can be fixed b= y repeated draining of pcplists, as done by patch "mm/memory_hotplug: drain per-cpu pages again during memory offline" in [1]. David and Michal would prefer that this race was closed in a way that cal= lers of page isolation don't need to care about drain. David suggested disabli= ng pcplists usage completely during page isolation, instead of repeatedly dr= aining them. To achieve this without adding special cases in alloc/free fastpath, we c= an use the same 'trick' as boot pagesets - when pcp->high is 0, any pcplist addi= tion will be immediately flushed. The race can thus be closed by setting pcp->high to 0 and draining pcplis= ts once in start_isolate_page_range(). The draining will serialize after pro= cesses that already disabled interrupts and read the old value of pcp->high in free_unref_page_commit(), and processes that have not yet disabled interr= upts, will observe pcp->high =3D=3D 0 when they are rescheduled, and skip pcpli= sts. This guarantees no stray pages on pcplists in zones where isolation happe= ns. We can use the variable zone->nr_isolate_pageblock (protected by zone->lo= ck) to detect transitions from 0 to 1 (to change pcp->high to 0 and issue dra= in) and from 1 to 0 (to restore original pcp->high and batch values cached in struct zone). We have to avoid external updates to high and batch by taki= ng pcp_batch_high_lock. To allow multiple isolations in parallel, change thi= s lock from mutex to rwsem. For callers that pair start_isolate_page_range() with undo_isolated_page_range() properly, this is transparent. Currently that'= s alloc_contig_range(). __offline_pages() doesn't call undo_isolated_page_r= ange() in the succes case, so it has to be carful to handle restoring pcp->high = and batch and unlocking pcp_batch_high_lock. This commit also changes drain_all_pages() to not trust reading pcp->coun= t during drain for page isolation - I believe that could be racy and lead to missi= ng some cpu's to drain. If others agree, this can be separated and potentially ba= ckported. [1] https://lore.kernel.org/linux-mm/20200903140032.380431-1-pasha.tatash= in@soleen.com/ Suggested-by: David Hildenbrand Suggested-by: Michal Hocko Signed-off-by: Vlastimil Babka --- include/linux/gfp.h | 1 + mm/internal.h | 4 ++++ mm/memory_hotplug.c | 24 ++++++++----------- mm/page_alloc.c | 58 +++++++++++++++++++++++++++++---------------- mm/page_isolation.c | 45 ++++++++++++++++++++++++++++------- 5 files changed, 89 insertions(+), 43 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 67a0774e080b..cc52c5cc9fab 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -592,6 +592,7 @@ extern void page_frag_free(void *addr); =20 void page_alloc_init(void); void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp); +void __drain_all_pages(struct zone *zone, bool page_isolation); void drain_all_pages(struct zone *zone); void drain_local_pages(struct zone *zone); =20 diff --git a/mm/internal.h b/mm/internal.h index 10c677655912..c157af87a9ed 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -7,6 +7,7 @@ #ifndef __MM_INTERNAL_H #define __MM_INTERNAL_H =20 +#include #include #include #include @@ -201,8 +202,11 @@ extern void post_alloc_hook(struct page *page, unsig= ned int order, gfp_t gfp_flags); extern int user_min_free_kbytes; =20 +extern struct rw_semaphore pcp_batch_high_lock; extern void zone_pcp_update(struct zone *zone); extern void zone_pcp_reset(struct zone *zone); +extern void zone_update_pageset_high_and_batch(struct zone *zone, + unsigned long high, unsigned long batch); =20 #if defined CONFIG_COMPACTION || defined CONFIG_CMA =20 diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index b11a269e2356..a978ac32279b 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1485,6 +1485,7 @@ static int __ref __offline_pages(unsigned long star= t_pfn, struct zone *zone; struct memory_notify arg; char *reason; + bool unisolated_last =3D false; =20 mem_hotplug_begin(); =20 @@ -1575,20 +1576,6 @@ static int __ref __offline_pages(unsigned long sta= rt_pfn, /* check again */ ret =3D walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL, check_pages_isolated_cb); - /* - * per-cpu pages are drained in start_isolate_page_range, but if - * there are still pages that are not free, make sure that we - * drain again, because when we isolated range we might - * have raced with another thread that was adding pages to pcp - * list. - * - * Forward progress should be still guaranteed because - * pages on the pcp list can only belong to MOVABLE_ZONE - * because has_unmovable_pages explicitly checks for - * PageBuddy on freed pages on other zones. - */ - if (ret) - drain_all_pages(zone); } while (ret); =20 /* Ok, all of our target is isolated. @@ -1602,8 +1589,17 @@ static int __ref __offline_pages(unsigned long sta= rt_pfn, * pageblocks zone counter here. */ spin_lock_irqsave(&zone->lock, flags); + if (nr_isolate_pageblock && nr_isolate_pageblock =3D=3D + zone->nr_isolate_pageblock) + unisolated_last =3D true; zone->nr_isolate_pageblock -=3D nr_isolate_pageblock; spin_unlock_irqrestore(&zone->lock, flags); + if (unisolated_last) { + zone_update_pageset_high_and_batch(zone, zone->pageset_high, + zone->pageset_batch); + } + /* pairs with start_isolate_page_range() */ + up_read(&pcp_batch_high_lock); =20 /* removal success */ adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 004350a2b6ca..d82f3bec7953 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -78,7 +78,7 @@ #include "page_reporting.h" =20 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields= */ -static DEFINE_MUTEX(pcp_batch_high_lock); +DECLARE_RWSEM(pcp_batch_high_lock); #define MIN_PERCPU_PAGELIST_FRACTION (8) =20 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID @@ -2958,14 +2958,7 @@ static void drain_local_pages_wq(struct work_struc= t *work) preempt_enable(); } =20 -/* - * Spill all the per-cpu pages from all CPUs back into the buddy allocat= or. - * - * When zone parameter is non-NULL, spill just the single zone's pages. - * - * Note that this can be extremely slow as the draining happens in a wor= kqueue. - */ -void drain_all_pages(struct zone *zone) +void __drain_all_pages(struct zone *zone, bool page_isolation) { int cpu; =20 @@ -3004,7 +2997,13 @@ void drain_all_pages(struct zone *zone) struct zone *z; bool has_pcps =3D false; =20 - if (zone) { + if (page_isolation) { + /* + * For page isolation, don't trust the racy pcp.count + * check. We need to flush really everything. + */ + has_pcps =3D true; + } else if (zone) { pcp =3D per_cpu_ptr(zone->pageset, cpu); if (pcp->pcp.count) has_pcps =3D true; @@ -3037,6 +3036,18 @@ void drain_all_pages(struct zone *zone) mutex_unlock(&pcpu_drain_mutex); } =20 +/* + * Spill all the per-cpu pages from all CPUs back into the buddy allocat= or. + * + * When zone parameter is non-NULL, spill just the single zone's pages. + * + * Note that this can be extremely slow as the draining happens in a wor= kqueue. + */ +void drain_all_pages(struct zone *zone) +{ + __drain_all_pages(zone, false); +} + #ifdef CONFIG_HIBERNATION =20 /* @@ -6237,13 +6248,23 @@ static void pageset_init(struct per_cpu_pageset *= p) pcp->batch =3D 1; } =20 +void zone_update_pageset_high_and_batch(struct zone *zone, unsigned long= high, + unsigned long batch) +{ + struct per_cpu_pageset *p; + int cpu; + + for_each_possible_cpu(cpu) { + p =3D per_cpu_ptr(zone->pageset, cpu); + pageset_update(&p->pcp, high, batch); + } +} + static void zone_set_pageset_high_and_batch(struct zone *zone, bool forc= e_update) { unsigned long new_high; unsigned long new_batch; int fraction =3D READ_ONCE(percpu_pagelist_fraction); - int cpu; - struct per_cpu_pageset *p; =20 if (fraction) { new_high =3D zone_managed_pages(zone) / fraction; @@ -6264,10 +6285,7 @@ static void zone_set_pageset_high_and_batch(struct= zone *zone, bool force_update return; } =20 - for_each_possible_cpu(cpu) { - p =3D per_cpu_ptr(zone->pageset, cpu); - pageset_update(&p->pcp, new_high, new_batch); - } + zone_update_pageset_high_and_batch(zone, new_high, new_batch); } =20 void __meminit setup_zone_pageset(struct zone *zone) @@ -8026,7 +8044,7 @@ int percpu_pagelist_fraction_sysctl_handler(struct = ctl_table *table, int write, int old_percpu_pagelist_fraction; int ret; =20 - mutex_lock(&pcp_batch_high_lock); + down_write(&pcp_batch_high_lock); old_percpu_pagelist_fraction =3D percpu_pagelist_fraction; =20 ret =3D proc_dointvec_minmax(table, write, buffer, length, ppos); @@ -8048,7 +8066,7 @@ int percpu_pagelist_fraction_sysctl_handler(struct = ctl_table *table, int write, for_each_populated_zone(zone) zone_set_pageset_high_and_batch(zone, false); out: - mutex_unlock(&pcp_batch_high_lock); + up_write(&pcp_batch_high_lock); return ret; } =20 @@ -8661,9 +8679,9 @@ EXPORT_SYMBOL(free_contig_range); */ void __meminit zone_pcp_update(struct zone *zone) { - mutex_lock(&pcp_batch_high_lock); + down_write(&pcp_batch_high_lock); zone_set_pageset_high_and_batch(zone, false); - mutex_unlock(&pcp_batch_high_lock); + up_write(&pcp_batch_high_lock); } =20 void zone_pcp_reset(struct zone *zone) diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 63a3db10a8c0..ceada64abd1f 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -21,6 +21,7 @@ static int set_migratetype_isolate(struct page *page, i= nt migratetype, int isol_ struct zone *zone; unsigned long flags; int ret =3D -EBUSY; + bool first_isolated_pageblock =3D false; =20 zone =3D page_zone(page); =20 @@ -45,6 +46,8 @@ static int set_migratetype_isolate(struct page *page, i= nt migratetype, int isol_ =20 set_pageblock_migratetype(page, MIGRATE_ISOLATE); zone->nr_isolate_pageblock++; + if (zone->nr_isolate_pageblock =3D=3D 1) + first_isolated_pageblock =3D true; nr_pages =3D move_freepages_block(zone, page, MIGRATE_ISOLATE, NULL); =20 @@ -54,8 +57,9 @@ static int set_migratetype_isolate(struct page *page, i= nt migratetype, int isol_ =20 out: spin_unlock_irqrestore(&zone->lock, flags); - if (!ret) { - drain_all_pages(zone); + if (!ret && first_isolated_pageblock) { + zone_update_pageset_high_and_batch(zone, 0, 1); + __drain_all_pages(zone, true); } else { WARN_ON_ONCE(zone_idx(zone) =3D=3D ZONE_MOVABLE); =20 @@ -78,6 +82,7 @@ static void unset_migratetype_isolate(struct page *page= , unsigned migratetype) unsigned int order; unsigned long pfn, buddy_pfn; struct page *buddy; + bool unisolated_last =3D false; =20 zone =3D page_zone(page); spin_lock_irqsave(&zone->lock, flags); @@ -120,8 +125,14 @@ static void unset_migratetype_isolate(struct page *p= age, unsigned migratetype) if (isolated_page) __putback_isolated_page(page, order, migratetype); zone->nr_isolate_pageblock--; + if (zone->nr_isolate_pageblock =3D=3D 0) + unisolated_last =3D true; out: spin_unlock_irqrestore(&zone->lock, flags); + if (unisolated_last) { + zone_update_pageset_high_and_batch(zone, zone->pageset_high, + zone->pageset_batch); + } } =20 static inline struct page * @@ -170,14 +181,17 @@ __first_valid_page(unsigned long pfn, unsigned long= nr_pages) * pageblocks we may have modified and return -EBUSY to caller. This * prevents two threads from simultaneously working on overlapping range= s. * - * Please note that there is no strong synchronization with the page all= ocator - * either. Pages might be freed while their page blocks are marked ISOLA= TED. - * In some cases pages might still end up on pcp lists and that would al= low - * for their allocation even when they are in fact isolated already. Dep= ending - * on how strong of a guarantee the caller needs drain_all_pages might b= e needed - * (e.g. __offline_pages will need to call it after check for isolated r= ange for - * a next retry). + * To synchronize with page allocator users freeing pages on the pcplist= s, we + * disable them by setting their allowed usage (pcp->high) to 0, and iss= ue a + * drain. This is only needed when isolating the first pageblock of a zo= ne. * + * Successful call to start_isolate_page_range() has to be paired with + * undo_isolate_page_range() for proper accounting of zone->nr_isolate_p= ageblock + * (which controls pcplist enabling/disabling discussed above, including + * handling of pcp_batch_high_lock). + * If undo_isolate_page_range() is not used, this has to be handled manu= ally + * by caller. + *=20 * Return: the number of isolated pageblocks on success and -EBUSY if an= y part * of range cannot be isolated. */ @@ -192,6 +206,13 @@ int start_isolate_page_range(unsigned long start_pfn= , unsigned long end_pfn, BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages)); BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages)); =20 + /* + * We are going to change pcplists's high and batch values temporarily, + * so block any updates via sysctl. Caller must unlock by + * undo_isolate_page_range() or finish_isolate_page_range(). + */ + down_read(&pcp_batch_high_lock); + for (pfn =3D start_pfn; pfn < end_pfn; pfn +=3D pageblock_nr_pages) { @@ -215,6 +236,8 @@ int start_isolate_page_range(unsigned long start_pfn,= unsigned long end_pfn, unset_migratetype_isolate(page, migratetype); } =20 + up_read(&pcp_batch_high_lock); + return -EBUSY; } =20 @@ -238,7 +261,11 @@ void undo_isolate_page_range(unsigned long start_pfn= , unsigned long end_pfn, continue; unset_migratetype_isolate(page, migratetype); } + + up_read(&pcp_batch_high_lock); } + + /* * Test all pages in the range is free(means isolated) or not. * all pages in [start_pfn...end_pfn) must be in the same zone. --=20 2.28.0