linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
To: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Michal Hocko <mhocko@suse.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Petr Mladek <pmladek@suse.com>, linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	"Luis Claudio R. Goncalves" <lgoncalv@redhat.com>,
	Boqun Feng <boqun.feng@gmail.com>, Ingo Molnar <mingo@redhat.com>,
	John Ogness <john.ogness@linutronix.de>,
	Mel Gorman <mgorman@techsingularity.net>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Waiman Long <longman@redhat.com>, Will Deacon <will@kernel.org>
Subject: Re: [PATCH v2] mm/page_alloc: don't check zonelist_update_seq from atomic allocations
Date: Thu, 10 Aug 2023 18:58:59 +0900	[thread overview]
Message-ID: <566173d4-84d1-c76b-6fe4-f5ea5f24f613@I-love.SAKURA.ne.jp> (raw)
In-Reply-To: <20230810072637.6Sc3UU3R@linutronix.de>

On 2023/08/10 16:26, Sebastian Andrzej Siewior wrote:
> On 2023-08-09 20:03:00 [+0900], Tetsuo Handa wrote:
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 7d3460c7a480..5557d9a2ff2c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3638,26 +3638,44 @@ EXPORT_SYMBOL_GPL(fs_reclaim_release);
> …
>> -static DEFINE_SEQLOCK(zonelist_update_seq);
>> +static unsigned int zonelist_update_seq;
>>  
>>  static unsigned int zonelist_iter_begin(void)
>>  {
>>  	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
>> -		return read_seqbegin(&zonelist_update_seq);
>> +		/* See comment above. */
>> +		return data_race(READ_ONCE(zonelist_update_seq));
> 
> This is open coded raw_read_seqcount() while it should have been
> raw_seqcount_begin().

Not an open coded raw_read_seqcount(). You explained us at
https://lkml.kernel.org/r/20230623101111.7tuAg5p5@linutronix.de that
seqprop_sequence() behaves differently if CONFIG_PREEMPT_RT=y.

The point of my proposal is to get rid of

  spin_lock(s->lock);
  spin_unlock(s->lock);

 from zonelist_iter_begin().

Also, my version avoids KCSAN warning by using data_race() and avoids papering
over KCSAN warnings between zonelist_iter_begin() and check_retry_zonelist()
by not using kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX).

  /*
   * The seqlock seqcount_t interface does not prescribe a precise sequence of
   * read begin/retry/end. For readers, typically there is a call to
   * read_seqcount_begin() and read_seqcount_retry(), however, there are more
   * esoteric cases which do not follow this pattern.
   *
   * As a consequence, we take the following best-effort approach for raw usage
   * via seqcount_t under KCSAN: upon beginning a seq-reader critical section,
   * pessimistically mark the next KCSAN_SEQLOCK_REGION_MAX memory accesses as
   * atomics; if there is a matching read_seqcount_retry() call, no following
   * memory operations are considered atomic. Usage of the seqlock_t interface
   * is not affected.
   */

The section between zonelist_iter_begin() and check_retry_zonelist() is very
complicated where concurrency bug that is unrelated to zonelist counters could
be found and fixed.

> 
>>  	return 0;
>>  }
>>  
>> -static unsigned int check_retry_zonelist(unsigned int seq)
>> +static unsigned int check_retry_zonelist(gfp_t gfp, unsigned int seq)
>>  {
>> -	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
>> -		return read_seqretry(&zonelist_update_seq, seq);
>> +	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE) && (gfp & __GFP_DIRECT_RECLAIM)) {
>> +		/* See comment above. */
>> +		unsigned int seq2 = data_race(READ_ONCE(zonelist_update_seq));
>>  
>> -	return seq;
>> +		/*
>> +		 * "seq != seq2" indicates that __build_all_zonelists() has
>> +		 * started or has finished rebuilding zonelists, hence retry.
>> +		 * "seq == seq2 && (seq2 & 1)" indicates that
>> +		 * __build_all_zonelists() is still rebuilding zonelists
>> +		 * with context switching disabled, hence retry.
>> +		 * "seq == seq2 && !(seq2 & 1)" indicates that
>> +		 * __build_all_zonelists() did not rebuild zonelists, hence
>> +		 * no retry.
>> +		 */
>> +		return unlikely(seq != seq2 || (seq2 & 1));
> 
> open coded read_seqcount_retry().

Not an open coded read_seqcount_retry(), for read_seqcount_retry() checks
only "seq != seq2" condition. We need to check "seq2 & 1" condition too.

> 
>> +	}
>> +
>> +	return 0;
>>  }
>>  
>>  /* Perform direct synchronous page reclaim */
>> @@ -5136,22 +5154,17 @@ static void __build_all_zonelists(void *data)
>>  	int nid;
>>  	int __maybe_unused cpu;
>>  	pg_data_t *self = data;
>> +	static DEFINE_SPINLOCK(lock);
>>  	unsigned long flags;
>>  
>> -	/*
>> -	 * Explicitly disable this CPU's interrupts before taking seqlock
>> -	 * to prevent any IRQ handler from calling into the page allocator
>> -	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
>> -	 */
>> -	local_irq_save(flags);
>> -	/*
>> -	 * Explicitly disable this CPU's synchronous printk() before taking
>> -	 * seqlock to prevent any printk() from trying to hold port->lock, for
>> -	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
>> -	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
>> -	 */
>> -	printk_deferred_enter();
>> -	write_seqlock(&zonelist_update_seq);
>> +#ifdef CONFIG_PREEMPT_RT
>> +	migrate_disable()
>> +#endif
> 
> There is no justification/ explanation why migrate_disable() here is
> needed on PREEMPT_RT and I don't see one.

This migrate_disable() is a compensation for removing

  spin_lock(s->lock);
  spin_unlock(s->lock);

 from zonelist_iter_begin(). Since neither the proposed zonelist_iter_begin()
nor the proposed check_retry_zonelist() holds the spinlock, we need to
guarantee that the thread which has performed the opening zonelist_update_seq++
can continue execution till the closing zonelist_update_seq++ without sleeping.
Calling interrupt handlers are fine, for interrupt handlers can't do
__GFP_DIRECT_RECLAIM allocation, which in turn guarantees that interrupt
handlers switched from the thread which has performed the opening
zonelist_update_seq++ won't deadlock.

> 
> There are two changes here:
> - The replacement of seqlock_t with something open coded 

Yes.

> - Logic change when a retry is needed (the gfp mask is considered).

Yes.

> 
> I am not a big fan of open coding things especially when not needed and
> then there is this ifdef which is not needed as well. I don't comment on
> the logic change.

If __build_all_zonelists() can run without being switched to other threads
(except interrupt handlers), I consider that this approach works.

> 
> Can we please put an end to this?
> 
>> +	/* Serialize increments of zonelist_update_seq. */
>> +	spin_lock_irqsave(&lock, flags);
>> +	zonelist_update_seq++;
>> +	/* Tell check_retry_zonelist() that we started rebuilding zonelists. */
>> +	smp_wmb();
>>  
>>  #ifdef CONFIG_NUMA
>>  	memset(node_load, 0, sizeof(node_load));
> 
> Sebastian
> 



  reply	other threads:[~2023-08-10  9:59 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-09 11:03 Tetsuo Handa
2023-08-09 12:49 ` Michal Hocko
2023-08-10  7:26 ` Sebastian Andrzej Siewior
2023-08-10  9:58   ` Tetsuo Handa [this message]
2023-08-10 15:35     ` Tetsuo Handa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=566173d4-84d1-c76b-6fe4-f5ea5f24f613@I-love.SAKURA.ne.jp \
    --to=penguin-kernel@i-love.sakura.ne.jp \
    --cc=akpm@linux-foundation.org \
    --cc=bigeasy@linutronix.de \
    --cc=boqun.feng@gmail.com \
    --cc=john.ogness@linutronix.de \
    --cc=lgoncalv@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pmladek@suse.com \
    --cc=tglx@linutronix.de \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox