Re: Race condition in build_all_zonelists() when offlining movable zone

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@suse.de>
To: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>,
	Patrick Daly <quic_pdaly@quicinc.com>,
	linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org,
	Juergen Gross <jgross@suse.com>
Subject: Re: Race condition in build_all_zonelists() when offlining movable zone
Date: Tue, 23 Aug 2022 10:49:50 +0100	[thread overview]
Message-ID: <20220823094950.ocjyur2h3mqnqbeg@suse.de> (raw)
In-Reply-To: <f70ac93a-8dd2-e157-8c25-790ee8838e0d@redhat.com>

On Tue, Aug 23, 2022 at 10:52:34AM +0200, David Hildenbrand wrote:
> On 23.08.22 10:33, Mel Gorman wrote:
> > On Tue, Aug 23, 2022 at 08:36:34AM +0200, David Hildenbrand wrote:
> >>> @@ -6517,6 +6538,7 @@ static void __build_all_zonelists(void *data)
> >>>  	static DEFINE_SPINLOCK(lock);
> >>>  
> >>>  	spin_lock(&lock);
> >>> +	write_seqcount_begin(&zonelist_update_seq);
> >>>  
> >>>  #ifdef CONFIG_NUMA
> >>>  	memset(node_load, 0, sizeof(node_load));
> >>> @@ -6553,6 +6575,7 @@ static void __build_all_zonelists(void *data)
> >>>  #endif
> >>>  	}
> >>>  
> >>> +	write_seqcount_end(&zonelist_update_seq);
> >>>  	spin_unlock(&lock);
> >>
> >> Do we want to get rid of the static lock by using a seqlock_t instead of
> >> a seqcount_t?
> >>
> > 
> > I do not think so because it's a relatively heavy lock compared to the
> > counter and the read side.
> 
> I was primarily asking because seqlock.h states: "Sequential locks
> (seqlock_t):  Sequence counters with an embedded spinlock for writer
> serialization and non-preemptibility." seems to be precisely what we are
> doing here.
> 
> > 
> > As the read-side can be called from hardirq or softirq context, the
> > write-side needs to disable irqs or bottom halves as well according to the
> > documentation. That is relatively minor as the write side is rare but it's
> > tricker because the calling context can be both IRQ or softirq so the IRQ
> > protection would have to be used.
> 
> Naive me would just have used write_seqlock()/write_sequnlock() and
> read_seqbegin()/read_seqretry() to result in almost the same code as
> with your change -- but having both mechanisms encapsulated.
> 
> 
> Yeah, there are special write_seqlock_bh()/write_sequnlock_bh(),
> write_sequnlock_irq() ... but IIRC we don't have to care about that at
> all when just using the primitives as above. But most probably I am
> missing something important.
> 

You're not missing anything important, I'm just not a massive fan of the
API naming because it's unclear from the context if it's a plain counter
or a locked counter and felt it was better to keep the locking explicit.

A seqlock version is below. I updated the comments and naming to make it
clear the read-side is for iteration, what the locking protocol is and
match the retry naming with the cpuset equivalent. It boots on KVM but
would need another test from Patrick to be certain it still works. Patrick,
would you mind testing this version please?

---8<---
 mm/page_alloc.c | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5486d47406e..a644c7b638a3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask)
 EXPORT_SYMBOL_GPL(fs_reclaim_release);
 #endif
 
+/*
+ * Zonelists may change due to hotplug during allocation. Detect when zonelists
+ * have been rebuilt so allocation retries. Reader side does not lock and
+ * retries the allocation if zonelist changes. Writer side is protected by the
+ * embedded spin_lock.
+ */
+DEFINE_SEQLOCK(zonelist_update_seq);
+
+static unsigned int zonelist_iter_begin(void)
+{
+	return read_seqbegin(&zonelist_update_seq);
+}
+
+static unsigned int check_retry_zonelist(unsigned int seq)
+{
+	return read_seqretry(&zonelist_update_seq, seq);
+}
+
 /* Perform direct synchronous page reclaim */
 static unsigned long
 __perform_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -5001,6 +5019,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	int compaction_retries;
 	int no_progress_loops;
 	unsigned int cpuset_mems_cookie;
+	unsigned int zonelist_iter_cookie;
 	int reserve_flags;
 
 	/*
@@ -5016,6 +5035,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	no_progress_loops = 0;
 	compact_priority = DEF_COMPACT_PRIORITY;
 	cpuset_mems_cookie = read_mems_allowed_begin();
+	zonelist_iter_cookie = zonelist_iter_begin();
 
 	/*
 	 * The fast path uses conservative alloc_flags to succeed only until
@@ -5187,8 +5207,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto retry;
 
 
-	/* Deal with possible cpuset update races before we start OOM killing */
-	if (check_retry_cpuset(cpuset_mems_cookie, ac))
+	/*
+	 * Deal with possible cpuset update races or zonelist updates to avoid
+	 * a unnecessary OOM kill.
+	 */
+	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
+	    check_retry_zonelist(zonelist_iter_cookie))
 		goto retry_cpuset;
 
 	/* Reclaim has failed us, start killing things */
@@ -6514,9 +6538,8 @@ static void __build_all_zonelists(void *data)
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
-	static DEFINE_SPINLOCK(lock);
 
-	spin_lock(&lock);
+	write_seqlock(&zonelist_update_seq);
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
 #endif
 	}
 
-	spin_unlock(&lock);
+	write_sequnlock(&zonelist_update_seq);
 }
 
 static noinline void __init

next prev parent reply	other threads:[~2022-08-23  9:49 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-17  3:42 Patrick Daly
     [not found] ` <YvyM1AWeJRt6PF9B@dhcp22.suse.cz>
     [not found]   ` <YvyRvxO+FHMyuGn3@dhcp22.suse.cz>
     [not found]     ` <20220817104028.uin7cmkb4qlpgfbi@suse.de>
     [not found]       ` <YvzI0PHW6uojk+N1@dhcp22.suse.cz>
     [not found]         ` <20220817112647.z7wenwjpyt3hphtk@suse.de>
2022-08-19  2:11           ` Patrick Daly
2022-08-22 20:18           ` Patrick Daly
2022-08-23  6:36       ` David Hildenbrand
     [not found]         ` <20220823083349.5c2aolc6xgfhp3k7@suse.de>
2022-08-23  8:52           ` David Hildenbrand
2022-08-23  9:49             ` Mel Gorman [this message]
2022-08-23 10:34               ` David Hildenbrand
     [not found]                 ` <20220823110946.o3eawk3kghaykcim@suse.de>
2022-08-23 12:18                   ` Michal Hocko
     [not found]                     ` <20220823125850.o3nhkmikmv7vyxq4@suse.de>
2022-08-23 13:25                       ` Michal Hocko
2022-08-23 13:50                         ` David Hildenbrand
2022-08-23 13:57                           ` Michal Hocko
2022-08-23 15:14                             ` Mel Gorman
2022-08-23 15:38                               ` Michal Hocko
2022-08-23 15:51                                 ` David Hildenbrand
2022-08-24  9:45                                   ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220823094950.ocjyur2h3mqnqbeg@suse.de \
    --to=mgorman@suse.de \
    --cc=david@redhat.com \
    --cc=jgross@suse.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=quic_pdaly@quicinc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox