From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 79C3DC4332F for ; Sun, 6 Nov 2022 16:42:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E3B0480009; Sun, 6 Nov 2022 11:42:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DEA8080007; Sun, 6 Nov 2022 11:42:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C8C1F80009; Sun, 6 Nov 2022 11:42:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B443680007 for ; Sun, 6 Nov 2022 11:42:45 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 7A9A1140740 for ; Sun, 6 Nov 2022 16:42:45 +0000 (UTC) X-FDA: 80103586290.30.3E4908E Received: from mail-qk1-f172.google.com (mail-qk1-f172.google.com [209.85.222.172]) by imf24.hostedemail.com (Postfix) with ESMTP id 0C0B5180008 for ; Sun, 6 Nov 2022 16:42:43 +0000 (UTC) Received: by mail-qk1-f172.google.com with SMTP id 8so6012475qka.1 for ; Sun, 06 Nov 2022 08:42:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=xtzGKfnRZ1p7+0H8FLRZSOOcROCnLbN2i4aRELH2gIs=; b=mPxcuDCHI6uEJkr54Ot1KN78OTE6ZR03MaaJCu6fHPbbTUazYYSLnmYWRFrCyqtmCv n5dvXfupGjebNGX0uouOfopvCLG63cgkPcYwP+jmQvPgda08O3/xBsdT0/HNnK6T1dKc uCKgXO37fqxLv4B+Hg0tutGkG5C46VyzlC5AOLG1ZPdeA3kkhprHqZ2F/zZkBScdtIHB 4Ot5Eoqgx+YPREux7JA5ReYeFkcXttWe+oIro4ySTdcqgPu6BHsYXluGUaoHhtYy2O+q /HyquHuEYMPR9BT9Ta0jsYaiJntEWoPL14RG4IEzy3YXalfYazjxU8gWC5xq+IUqb76b GlWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=xtzGKfnRZ1p7+0H8FLRZSOOcROCnLbN2i4aRELH2gIs=; b=dZuYw4gOPc9/MmUJQaEk7ITSkRQsDpTkBGymHZlBKubZVYKfAwinsRWr/6f9T1g+9k 6e1r5jV+ZmRSfkYocdErpn6GSqdl5Q+6PwMXOLrUJUKvSweD2AWtrsUc3SnQF5IjbWKC pzLKyLd6tacaXWCZojvdEJ7+/0j0JUSybAOGdJ0FYncB6LiwGV4GhNX15ysRr/ASMlR9 0kBicPQg0nUzvqdpAJ9m1ErYsneEKCuXOdymOOJE1QJNNEh+dBIXKaCCjUCewcXMYnnm wXq0uqqlsMlqceiK0rZcFtAwAjrNSd+/GmbnFWV/UOKfoQT4iiindyYuBLFdbYdyAFcp Kr3A== X-Gm-Message-State: ACrzQf0m4Xa++X5N+udQmE0oKrbl9IpIEUmQQClKqAuRqU4w8RPckQyi /eqkyyTziBRAF+v6lcZFjtcxKg== X-Google-Smtp-Source: AMsMyM6bM23PKQsi3mvKscEqL7WRtuLZhaQENUE9kYYsuItOB0o3GUdfL/xeeV7k+P9ZGMjYPdzw8w== X-Received: by 2002:a05:620a:909:b0:6fa:1031:33a5 with SMTP id v9-20020a05620a090900b006fa103133a5mr32557125qkv.496.1667752963069; Sun, 06 Nov 2022 08:42:43 -0800 (PST) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id q53-20020a05620a2a7500b006fab68c7e87sm3591041qkp.70.2022.11.06.08.42.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 06 Nov 2022 08:42:42 -0800 (PST) Date: Sun, 6 Nov 2022 08:42:32 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Mel Gorman cc: Andrew Morton , Yu Zhao , Vlastimil Babka , Nicolas Saenz Julienne , Marcelo Tosatti , Michal Hocko , Hugh Dickins , Marek Szyprowski , LKML , Linux-MM Subject: Re: [PATCH v2] mm/page_alloc: Leave IRQs enabled for per-cpu page allocations In-Reply-To: <20221104142259.5hohev5hzvwanbi2@techsingularity.net> Message-ID: <97b7ae87-797c-4ebb-d2d3-9415975188@google.com> References: <20221104142259.5hohev5hzvwanbi2@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=mPxcuDCH; spf=pass (imf24.hostedemail.com: domain of hughd@google.com designates 209.85.222.172 as permitted sender) smtp.mailfrom=hughd@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1667752964; a=rsa-sha256; cv=none; b=YrosKlseLRhbl+l2fCTUB1xcwJaSSyFlYcF4LRYSHLsnanEYhnYSL55nwP7YEX28Qy28MH nUTrb0oQE+QyTaEyrbTzmshgbsdXkB65q55XIi/vhYq6ISRA8P2aX/sOH1p/Eu7jmSvmDD AS4IZCwdlHp5Q6VVljlD0D0H7ROJcjg= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1667752964; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xtzGKfnRZ1p7+0H8FLRZSOOcROCnLbN2i4aRELH2gIs=; b=M0yXfjrn2O3d3KqRIk0KgQIRsvz0n2vGQ//C46bzYmTrv6yhUCbSxSr4kquXXdyg6C3X1g 6mibuNpU7pAbb5SPIi8evzsLtUZ+lpSw+5w5ee6wQmIzyuifLtzJIAg1ZK4DaT2OMBHrPW fkahaN5XGKhfnWMvS2Zt4OBuKqTgQz4= X-Rspam-User: Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=mPxcuDCH; spf=pass (imf24.hostedemail.com: domain of hughd@google.com designates 209.85.222.172 as permitted sender) smtp.mailfrom=hughd@google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 0C0B5180008 X-Stat-Signature: io5n7ru8djxkqis9q7org46ztqj7b5kq X-HE-Tag: 1667752963-881437 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, 4 Nov 2022, Mel Gorman wrote: > Changelog since v1 > o Use trylock in free_unref_page_list due to IO completion from softirq > context > > The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task > allocating from the PCP must not re-enter the allocator from IRQ context. > In each instance where IRQ-reentrancy is possible, the lock is acquired using > pcp_spin_trylock_irqsave() even though IRQs are disabled and re-entrancy > is impossible. > > Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the common > case at the cost of some IRQ allocations taking a slower path. If the PCP > lists need to be refilled, the zone lock still needs to disable IRQs but > that will only happen on PCP refill and drain. If an IRQ is raised when > a PCP allocation is in progress, the trylock will fail and fallback to > using the buddy lists directly. Note that this may not be a universal win > if an interrupt-intensive workload also allocates heavily from interrupt > context and contends heavily on the zone->lock as a result. > > [yuzhao@google.com: Reported lockdep issue on IO completion from softirq] > Signed-off-by: Mel Gorman Hi Mel, I think you Cc'ed me for the purpose of giving this patch a run, and reporting if it's not good. That is the case, I'm afraid. I first tried it on a v6.1-rc3, and very soon crashed under load with something about PageBuddy in the output. When I reverted, no problem; I thought maybe it's dependent on other commits in akpm's tree. Later I tried on current mm-unstable: which is living up to the name in other ways, but when other issues patched, it soon crashed under load, GPF probably for non-canonical address 0xdead0000000000f8 in compact_zone < compact_zone_order < try_to_compact_pages < .... < shmem_alloc_hugefolio < ... I do try to exercise compaction as hard as I can, even to the point of having a hack in what used to be called shmem_getpage_gfp(), reverting to the stronger attempt to get huge pages, before Rik weakened the effect of huge=always with vma_thp_gfp_mask() in 5.12: so shmem is probably applying stronger flags for compaction than it would in your tree - I'm using GFP_TRANSHUGE_LIGHT | __GFP_RECLAIM | __GFP_NORETRY there. Sorry for not giving you more info, I'm rather hoping that compaction is relevant, and will give you a clue (maybe that capture code, which surprised us once before??). What I'm really trying to do is fix the bug in Linus's rmap/TLB series, and its interaction with my rmap series, and report back on his series (asking for temporary drop), before next-20221107 goes down in flames. I'd advocate for dropping this patch of yours too; but if it's giving nobody else any trouble, I can easily continue to patch it out. Hugh > --- > mm/page_alloc.c | 122 ++++++++++++++++++++++++-------------------------------- > 1 file changed, 53 insertions(+), 69 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e20ade858e71..ae410adf36fb 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -170,21 +170,12 @@ static DEFINE_MUTEX(pcp_batch_high_lock); > _ret; \ > }) > > -#define pcpu_spin_lock_irqsave(type, member, ptr, flags) \ > +#define pcpu_spin_trylock(type, member, ptr) \ > ({ \ > type *_ret; \ > pcpu_task_pin(); \ > _ret = this_cpu_ptr(ptr); \ > - spin_lock_irqsave(&_ret->member, flags); \ > - _ret; \ > -}) > - > -#define pcpu_spin_trylock_irqsave(type, member, ptr, flags) \ > -({ \ > - type *_ret; \ > - pcpu_task_pin(); \ > - _ret = this_cpu_ptr(ptr); \ > - if (!spin_trylock_irqsave(&_ret->member, flags)) { \ > + if (!spin_trylock(&_ret->member)) { \ > pcpu_task_unpin(); \ > _ret = NULL; \ > } \ > @@ -197,27 +188,16 @@ static DEFINE_MUTEX(pcp_batch_high_lock); > pcpu_task_unpin(); \ > }) > > -#define pcpu_spin_unlock_irqrestore(member, ptr, flags) \ > -({ \ > - spin_unlock_irqrestore(&ptr->member, flags); \ > - pcpu_task_unpin(); \ > -}) > - > /* struct per_cpu_pages specific helpers. */ > #define pcp_spin_lock(ptr) \ > pcpu_spin_lock(struct per_cpu_pages, lock, ptr) > > -#define pcp_spin_lock_irqsave(ptr, flags) \ > - pcpu_spin_lock_irqsave(struct per_cpu_pages, lock, ptr, flags) > - > -#define pcp_spin_trylock_irqsave(ptr, flags) \ > - pcpu_spin_trylock_irqsave(struct per_cpu_pages, lock, ptr, flags) > +#define pcp_spin_trylock(ptr) \ > + pcpu_spin_trylock(struct per_cpu_pages, lock, ptr) > > #define pcp_spin_unlock(ptr) \ > pcpu_spin_unlock(lock, ptr) > > -#define pcp_spin_unlock_irqrestore(ptr, flags) \ > - pcpu_spin_unlock_irqrestore(lock, ptr, flags) > #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID > DEFINE_PER_CPU(int, numa_node); > EXPORT_PER_CPU_SYMBOL(numa_node); > @@ -1546,6 +1526,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, > struct per_cpu_pages *pcp, > int pindex) > { > + unsigned long flags; > int min_pindex = 0; > int max_pindex = NR_PCP_LISTS - 1; > unsigned int order; > @@ -1561,8 +1542,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, > /* Ensure requested pindex is drained first. */ > pindex = pindex - 1; > > - /* Caller must hold IRQ-safe pcp->lock so IRQs are disabled. */ > - spin_lock(&zone->lock); > + spin_lock_irqsave(&zone->lock, flags); > isolated_pageblocks = has_isolate_pageblock(zone); > > while (count > 0) { > @@ -1610,7 +1590,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, > } while (count > 0 && !list_empty(list)); > } > > - spin_unlock(&zone->lock); > + spin_unlock_irqrestore(&zone->lock, flags); > } > > static void free_one_page(struct zone *zone, > @@ -3124,10 +3104,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > unsigned long count, struct list_head *list, > int migratetype, unsigned int alloc_flags) > { > + unsigned long flags; > int i, allocated = 0; > > - /* Caller must hold IRQ-safe pcp->lock so IRQs are disabled. */ > - spin_lock(&zone->lock); > + spin_lock_irqsave(&zone->lock, flags); > for (i = 0; i < count; ++i) { > struct page *page = __rmqueue(zone, order, migratetype, > alloc_flags); > @@ -3161,7 +3141,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > * pages added to the pcp list. > */ > __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order)); > - spin_unlock(&zone->lock); > + spin_unlock_irqrestore(&zone->lock, flags); > return allocated; > } > > @@ -3178,16 +3158,9 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) > batch = READ_ONCE(pcp->batch); > to_drain = min(pcp->count, batch); > if (to_drain > 0) { > - unsigned long flags; > - > - /* > - * free_pcppages_bulk expects IRQs disabled for zone->lock > - * so even though pcp->lock is not intended to be IRQ-safe, > - * it's needed in this context. > - */ > - spin_lock_irqsave(&pcp->lock, flags); > + spin_lock(&pcp->lock); > free_pcppages_bulk(zone, to_drain, pcp, 0); > - spin_unlock_irqrestore(&pcp->lock, flags); > + spin_unlock(&pcp->lock); > } > } > #endif > @@ -3201,12 +3174,9 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) > > pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); > if (pcp->count) { > - unsigned long flags; > - > - /* See drain_zone_pages on why this is disabling IRQs */ > - spin_lock_irqsave(&pcp->lock, flags); > + spin_lock(&pcp->lock); > free_pcppages_bulk(zone, pcp->count, pcp, 0); > - spin_unlock_irqrestore(&pcp->lock, flags); > + spin_unlock(&pcp->lock); > } > } > > @@ -3472,7 +3442,6 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, > */ > void free_unref_page(struct page *page, unsigned int order) > { > - unsigned long flags; > unsigned long __maybe_unused UP_flags; > struct per_cpu_pages *pcp; > struct zone *zone; > @@ -3500,10 +3469,10 @@ void free_unref_page(struct page *page, unsigned int order) > > zone = page_zone(page); > pcp_trylock_prepare(UP_flags); > - pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags); > + pcp = pcp_spin_trylock(zone->per_cpu_pageset); > if (pcp) { > free_unref_page_commit(zone, pcp, page, migratetype, order); > - pcp_spin_unlock_irqrestore(pcp, flags); > + pcp_spin_unlock(pcp); > } else { > free_one_page(zone, page, pfn, order, migratetype, FPI_NONE); > } > @@ -3515,10 +3484,10 @@ void free_unref_page(struct page *page, unsigned int order) > */ > void free_unref_page_list(struct list_head *list) > { > + unsigned long __maybe_unused UP_flags; > struct page *page, *next; > struct per_cpu_pages *pcp = NULL; > struct zone *locked_zone = NULL; > - unsigned long flags; > int batch_count = 0; > int migratetype; > > @@ -3547,11 +3516,26 @@ void free_unref_page_list(struct list_head *list) > > /* Different zone, different pcp lock. */ > if (zone != locked_zone) { > - if (pcp) > - pcp_spin_unlock_irqrestore(pcp, flags); > + if (pcp) { > + pcp_spin_unlock(pcp); > + pcp_trylock_finish(UP_flags); > + } > > + /* > + * trylock is necessary as pages may be getting freed > + * from IRQ or SoftIRQ context after an IO completion. > + */ > + pcp_trylock_prepare(UP_flags); > + pcp = pcp_spin_trylock(zone->per_cpu_pageset); > + if (!pcp) { > + pcp_trylock_finish(UP_flags); > + list_del(&page->lru); > + free_one_page(page_zone(page), page, > + page_to_pfn(page), 0, migratetype, > + FPI_NONE); > + continue; > + } > locked_zone = zone; > - pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags); > } > > /* > @@ -3566,18 +3550,23 @@ void free_unref_page_list(struct list_head *list) > free_unref_page_commit(zone, pcp, page, migratetype, 0); > > /* > - * Guard against excessive IRQ disabled times when we get > - * a large list of pages to free. > + * Guard against excessive IRQ disabled times when freeing > + * a large list of pages. Lock will be reacquired if > + * necessary on the next iteration. > */ > if (++batch_count == SWAP_CLUSTER_MAX) { > - pcp_spin_unlock_irqrestore(pcp, flags); > + pcp_spin_unlock(pcp); > + pcp_trylock_finish(UP_flags); > batch_count = 0; > - pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags); > + pcp = NULL; > + locked_zone = NULL; > } > } > > - if (pcp) > - pcp_spin_unlock_irqrestore(pcp, flags); > + if (pcp) { > + pcp_spin_unlock(pcp); > + pcp_trylock_finish(UP_flags); > + } > } > > /* > @@ -3778,15 +3767,11 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, > struct per_cpu_pages *pcp; > struct list_head *list; > struct page *page; > - unsigned long flags; > unsigned long __maybe_unused UP_flags; > > - /* > - * spin_trylock may fail due to a parallel drain. In the future, the > - * trylock will also protect against IRQ reentrancy. > - */ > + /* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */ > pcp_trylock_prepare(UP_flags); > - pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags); > + pcp = pcp_spin_trylock(zone->per_cpu_pageset); > if (!pcp) { > pcp_trylock_finish(UP_flags); > return NULL; > @@ -3800,7 +3785,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, > pcp->free_factor >>= 1; > list = &pcp->lists[order_to_pindex(migratetype, order)]; > page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list); > - pcp_spin_unlock_irqrestore(pcp, flags); > + pcp_spin_unlock(pcp); > pcp_trylock_finish(UP_flags); > if (page) { > __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); > @@ -5368,7 +5353,6 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > struct page **page_array) > { > struct page *page; > - unsigned long flags; > unsigned long __maybe_unused UP_flags; > struct zone *zone; > struct zoneref *z; > @@ -5450,9 +5434,9 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > if (unlikely(!zone)) > goto failed; > > - /* Is a parallel drain in progress? */ > + /* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */ > pcp_trylock_prepare(UP_flags); > - pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags); > + pcp = pcp_spin_trylock(zone->per_cpu_pageset); > if (!pcp) > goto failed_irq; > > @@ -5471,7 +5455,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > if (unlikely(!page)) { > /* Try and allocate at least one page */ > if (!nr_account) { > - pcp_spin_unlock_irqrestore(pcp, flags); > + pcp_spin_unlock(pcp); > goto failed_irq; > } > break; > @@ -5486,7 +5470,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > nr_populated++; > } > > - pcp_spin_unlock_irqrestore(pcp, flags); > + pcp_spin_unlock(pcp); > pcp_trylock_finish(UP_flags); > > __count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account); >