[PATCH] mm/page_alloc: don't check zonelist_update

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/page_alloc: don't check zonelist_update_seq from atomic allocations
       [not found] ` <f6bd471c-f961-ef5e-21c5-bf158be19d12@linux.intel.com>
@ 2023-04-02 10:48   ` Tetsuo Handa
  2023-04-03  8:15     ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2023-04-02 10:48 UTC (permalink / raw)
  To: Mel Gorman, Patrick Daly, Michal Hocko, David Hildenbrand, Andrew Morton
  Cc: syzkaller-bugs, Ilpo Järvinen, syzbot, linux-mm

syzbot is reporting circular locking dependency which involves
zonelist_update_seq seqlock, for zonelist_update_seq is checked
while doing GFP_ATOMIC allocation.

Since zonelist_update_seq is checked for avoiding unnecessary OOM kill,
there is no need to check zonelist_update_seq for memory allocations
which will fail without OOM kill.

Therefore, let's keep locking dependency simple, by not checking
zonelist_update_seq from !__GFP_DIRECT_RECLAIM allocations.

Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10
Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/page_alloc.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7136c36c5d01..80ef79b23865 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4759,17 +4759,17 @@ EXPORT_SYMBOL_GPL(fs_reclaim_release);
  */
 static DEFINE_SEQLOCK(zonelist_update_seq);
 
-static unsigned int zonelist_iter_begin(void)
+static unsigned int zonelist_iter_begin(gfp_t gfp_mask)
 {
-	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
+	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE) && (gfp_mask & __GFP_DIRECT_RECLAIM))
 		return read_seqbegin(&zonelist_update_seq);
 
 	return 0;
 }
 
-static unsigned int check_retry_zonelist(unsigned int seq)
+static unsigned int check_retry_zonelist(unsigned int seq, gfp_t gfp_mask)
 {
-	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
+	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE) && (gfp_mask & __GFP_DIRECT_RECLAIM))
 		return read_seqretry(&zonelist_update_seq, seq);
 
 	return seq;
@@ -5083,7 +5083,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	no_progress_loops = 0;
 	compact_priority = DEF_COMPACT_PRIORITY;
 	cpuset_mems_cookie = read_mems_allowed_begin();
-	zonelist_iter_cookie = zonelist_iter_begin();
+	zonelist_iter_cookie = zonelist_iter_begin(gfp_mask);
 
 	/*
 	 * The fast path uses conservative alloc_flags to succeed only until
@@ -5261,7 +5261,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * a unnecessary OOM kill.
 	 */
 	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
-	    check_retry_zonelist(zonelist_iter_cookie))
+	    check_retry_zonelist(zonelist_iter_cookie, gfp_mask))
 		goto restart;
 
 	/* Reclaim has failed us, start killing things */
@@ -5287,7 +5287,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * a unnecessary OOM kill.
 	 */
 	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
-	    check_retry_zonelist(zonelist_iter_cookie))
+	    check_retry_zonelist(zonelist_iter_cookie, gfp_mask))
 		goto restart;
 
 	/*
-- 
2.34.1



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/page_alloc: don't check zonelist_update_seq from atomic allocations
  2023-04-02 10:48   ` [PATCH] mm/page_alloc: don't check zonelist_update_seq from atomic allocations Tetsuo Handa
@ 2023-04-03  8:15     ` Michal Hocko
  2023-04-03 11:14       ` Tetsuo Handa
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2023-04-03  8:15 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Mel Gorman, Patrick Daly, David Hildenbrand, Andrew Morton,
	syzkaller-bugs, Ilpo Järvinen, syzbot, linux-mm

On Sun 02-04-23 19:48:29, Tetsuo Handa wrote:
> syzbot is reporting circular locking dependency which involves
> zonelist_update_seq seqlock, for zonelist_update_seq is checked
> while doing GFP_ATOMIC allocation.

Could you explain the the deadlock scenario?

> Since zonelist_update_seq is checked for avoiding unnecessary OOM kill,
> there is no need to check zonelist_update_seq for memory allocations
> which will fail without OOM kill.
> 
> Therefore, let's keep locking dependency simple, by not checking
> zonelist_update_seq from !__GFP_DIRECT_RECLAIM allocations.
> 
> Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>

Is this
https://lore.kernel.org/all/0000000000001d74d205f7c1821f@google.com/ the
underlying report ?

> Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10
> Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/page_alloc: don't check zonelist_update_seq from atomic allocations
  2023-04-03  8:15     ` Michal Hocko
@ 2023-04-03 11:14       ` Tetsuo Handa
  2023-04-03 12:09         ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2023-04-03 11:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Patrick Daly, David Hildenbrand, Andrew Morton,
	syzkaller-bugs, Ilpo Järvinen, syzbot, linux-mm,
	Petr Mladek, Sergey Senozhatsky, Steven Rostedt, John Ogness

On 2023/04/03 17:15, Michal Hocko wrote:
> Is this
> https://lore.kernel.org/all/0000000000001d74d205f7c1821f@google.com/ the
> underlying report ?

Yes.

> Could you explain the the deadlock scenario?

build_zonelists() from __build_all_zonelists() calls printk() with
zonelist_update_seq held.

printk() holds console_owner lock for synchronous printing, and then upon
unlock of console_owner lock, printk() holds port_lock_key and port->lock.

tty_insert_flip_string_and_push_buffer() from pty_write() conditionally calls
kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held. But since commit 3d36424b3b58,
zonelist_update_seq is checked by GFP_ATOMIC allocation (i.e. a new locking dependency
was added by that commit).

  CPU0                                                       CPU1
  pty_write() {
    tty_insert_flip_string_and_push_buffer() {
                                                             __build_all_zonelists() {
      spin_lock_irqsave(&port->lock, flags);
      tty_insert_flip_string() {
        tty_insert_flip_string_fixed_flag() {
          __tty_buffer_request_room() {
            tty_buffer_alloc() {
              kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
                __alloc_pages_slowpath() {
                                                               write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
                                                               // interrupt handler starts
                                                                 handle_irq() {
                                                                   serial8250_interrupt() {
                                                                     serial8250_tx_chars() {
                                                                       tty_port_tty_get() {
                                                                         spin_lock_irqsave(&port->lock, flags); // spins here waiting for kmalloc() from tty_insert_flip_string() to complete
                  zonelist_iter_begin() {
                    read_seqbegin(&zonelist_update_seq) {
                      // spins here waiting for interrupt handler to complete if zonelist_update_seq.seqcount is odd
                                                                         tty = tty_kref_get(port->tty);
                                                                         spin_unlock_irqrestore(&port->lock, flags);
                                                                       }
                                                                     }
                                                                   }
                                                                 }
                                                               // interrupt handler ends
                                                               write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
                                                             }
                    }
                  }
                }
              }
            }
          }
        }
      }
      spin_unlock_irqrestore(&port->lock, flags);
    }
  }

Well, it seems that read_mems_allowed_begin() is protected by calling
local_irq_save(flags) before write_seqcount_begin(&current->mems_allowed_seq).

Can zonelist_iter_begin() be protected as well (i.e. call local_irq_save(flags)
before write_seqlock(&zonelist_update_seq)) ?

But even if write_seqlock(&zonelist_update_seq) is called with local irq disabled,
port_lock_key after all makes this warning again?

This bug report might be a suggestion that we want to use two versions of
__alloc_pages_slowpath(), one for atomic context which is geared towards smaller
kernel stack usage and simplified locking dependency (because atomic allocation can
happen from subtle context including interrupt handler) and the other for noinline
version for schedulable context which is geared towards larger kernel stack usage
and complicated locking dependency for implementing rich retry paths including
direct reclaim and OOM kill...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/page_alloc: don't check zonelist_update_seq from atomic allocations
  2023-04-03 11:14       ` Tetsuo Handa
@ 2023-04-03 12:09         ` Michal Hocko
  2023-04-03 12:51           ` Tetsuo Handa
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2023-04-03 12:09 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Mel Gorman, Patrick Daly, David Hildenbrand, Andrew Morton,
	syzkaller-bugs, Ilpo Järvinen, syzbot, linux-mm,
	Petr Mladek, Sergey Senozhatsky, Steven Rostedt, John Ogness

On Mon 03-04-23 20:14:28, Tetsuo Handa wrote:
> On 2023/04/03 17:15, Michal Hocko wrote:
> > Is this
> > https://lore.kernel.org/all/0000000000001d74d205f7c1821f@google.com/ the
> > underlying report ?
> 
> Yes.
> 
> > Could you explain the the deadlock scenario?
> 
> build_zonelists() from __build_all_zonelists() calls printk() with
> zonelist_update_seq held.
> 
> printk() holds console_owner lock for synchronous printing, and then upon
> unlock of console_owner lock, printk() holds port_lock_key and port->lock.
> 
> tty_insert_flip_string_and_push_buffer() from pty_write() conditionally calls
> kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held. But since commit 3d36424b3b58,
> zonelist_update_seq is checked by GFP_ATOMIC allocation (i.e. a new locking dependency
> was added by that commit).
> 
>   CPU0                                                       CPU1
>   pty_write() {
>     tty_insert_flip_string_and_push_buffer() {
>                                                              __build_all_zonelists() {
>       spin_lock_irqsave(&port->lock, flags);
>       tty_insert_flip_string() {
>         tty_insert_flip_string_fixed_flag() {
>           __tty_buffer_request_room() {
>             tty_buffer_alloc() {
>               kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
>                 __alloc_pages_slowpath() {
>                                                                write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
>                                                                // interrupt handler starts
>                                                                  handle_irq() {
>                                                                    serial8250_interrupt() {
>                                                                      serial8250_tx_chars() {
>                                                                        tty_port_tty_get() {
>                                                                          spin_lock_irqsave(&port->lock, flags); // spins here waiting for kmalloc() from tty_insert_flip_string() to complete
>                   zonelist_iter_begin() {
>                     read_seqbegin(&zonelist_update_seq) {
>                       // spins here waiting for interrupt handler to complete if zonelist_update_seq.seqcount is odd
>                                                                          tty = tty_kref_get(port->tty);
>                                                                          spin_unlock_irqrestore(&port->lock, flags);
>                                                                        }
>                                                                      }
>                                                                    }
>                                                                  }
>                                                                // interrupt handler ends
>                                                                write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
>                                                              }
>                     }
>                   }
>                 }
>               }
>             }
>           }
>         }
>       }
>       spin_unlock_irqrestore(&port->lock, flags);
>     }
>   }

Thank you! IIUC this can only happen when there is a race with the
memory hotplug. So pretty much a very rare event. Also I am not really
sure this really requires any changes at the allocator level. I would
much rather sacrifice the printk in build_zonelists or pull it out of
the locked section. Or would printk_deferred help in this case?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/page_alloc: don't check zonelist_update_seq from atomic allocations
  2023-04-03 12:09         ` Michal Hocko
@ 2023-04-03 12:51           ` Tetsuo Handa
  2023-04-03 13:44             ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2023-04-03 12:51 UTC (permalink / raw)
  To: Michal Hocko, Petr Mladek, Sergey Senozhatsky, Steven Rostedt,
	John Ogness
  Cc: Mel Gorman, Patrick Daly, David Hildenbrand, Andrew Morton,
	syzkaller-bugs, Ilpo Järvinen, syzbot, linux-mm

On 2023/04/03 21:09, Michal Hocko wrote:
> On Mon 03-04-23 20:14:28, Tetsuo Handa wrote:
>> Well, it seems that read_mems_allowed_begin() is protected by calling
>> local_irq_save(flags) before write_seqcount_begin(&current->mems_allowed_seq).
>> 
>> Can zonelist_iter_begin() be protected as well (i.e. call local_irq_save(flags)
>> before write_seqlock(&zonelist_update_seq)) ?
>> 
>> But even if write_seqlock(&zonelist_update_seq) is called with local irq disabled,
>> port_lock_key after all makes this warning again?

Hmm, local_irq_save(flags) before write_seqlock(&zonelist_update_seq) won't help.
Synchronous printk() will try to hold port->lock from process context even if local
irq is disabled, won't it? Not limited to interrupt handler but any synchronous printk()
inside write_seqlock(&zonelist_update_seq) / write_sequnlock(&zonelist_update_seq)
section is not safe.

> Thank you! IIUC this can only happen when there is a race with the
> memory hotplug. So pretty much a very rare event.

Right.

>                                                   Also I am not really
> sure this really requires any changes at the allocator level. I would
> much rather sacrifice the printk in build_zonelists or pull it out of
> the locked section. Or would printk_deferred help in this case?

Just moving printk() out of write_seqlock(&zonelist_update_seq) / write_sequnlock(&zonelist_update_seq)
section is not sufficient. This problem will happen as long as interrupt handler tries to hold port->lock.
Also disabling local irq will be needed.

By the way, is this case qualified as a user of printk_deferred(), for printk_deferred() says

  /*
   * Special printk facility for scheduler/timekeeping use only, _DO_NOT_USE_ !
   */
  __printf(1, 2) __cold int _printk_deferred(const char *fmt, ...);

?

Since this is a problem introduced by mm change, I think fixing this problem on the
mm side is the cleaner. Can't there be a different approach? For example, can't we
replace

	cpuset_mems_cookie = read_mems_allowed_begin();
	zonelist_iter_cookie = zonelist_iter_begin();

and

	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
	    check_retry_zonelist(zonelist_iter_cookie))

with different conditions, like recalculate cpuset/zonelist in the last second and
check immediately before giving up allocation or OOM kill whether they have changed?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/page_alloc: don't check zonelist_update_seq from atomic allocations
  2023-04-03 12:51           ` Tetsuo Handa
@ 2023-04-03 13:44             ` Michal Hocko
  2023-04-03 15:12               ` Petr Mladek
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2023-04-03 13:44 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Petr Mladek, Sergey Senozhatsky, Steven Rostedt, John Ogness,
	Mel Gorman, Patrick Daly, David Hildenbrand, Andrew Morton,
	syzkaller-bugs, Ilpo Järvinen, syzbot, linux-mm

On Mon 03-04-23 21:51:29, Tetsuo Handa wrote:
> On 2023/04/03 21:09, Michal Hocko wrote:
> > On Mon 03-04-23 20:14:28, Tetsuo Handa wrote:
> >> Well, it seems that read_mems_allowed_begin() is protected by calling
> >> local_irq_save(flags) before write_seqcount_begin(&current->mems_allowed_seq).
> >> 
> >> Can zonelist_iter_begin() be protected as well (i.e. call local_irq_save(flags)
> >> before write_seqlock(&zonelist_update_seq)) ?
> >> 
> >> But even if write_seqlock(&zonelist_update_seq) is called with local irq disabled,
> >> port_lock_key after all makes this warning again?
> 
> Hmm, local_irq_save(flags) before write_seqlock(&zonelist_update_seq) won't help.
> Synchronous printk() will try to hold port->lock from process context even if local
> irq is disabled, won't it? Not limited to interrupt handler but any synchronous printk()
> inside write_seqlock(&zonelist_update_seq) / write_sequnlock(&zonelist_update_seq)
> section is not safe.
> 
> > Thank you! IIUC this can only happen when there is a race with the
> > memory hotplug. So pretty much a very rare event.
> 
> Right.
> 
> >                                                   Also I am not really
> > sure this really requires any changes at the allocator level. I would
> > much rather sacrifice the printk in build_zonelists or pull it out of
> > the locked section. Or would printk_deferred help in this case?
> 
> Just moving printk() out of write_seqlock(&zonelist_update_seq) / write_sequnlock(&zonelist_update_seq)
> section is not sufficient. This problem will happen as long as interrupt handler tries to hold port->lock.

I do not follow. How is a printk outside of zonelist_update_seq going to
cause a dead/live lock? There shouldn't be any other locks (apart from
hotplug) taken in that path IIRC.

> Also disabling local irq will be needed.

Why?

> By the way, is this case qualified as a user of printk_deferred(), for printk_deferred() says
> 
>   /*
>    * Special printk facility for scheduler/timekeeping use only, _DO_NOT_USE_ !
>    */
>   __printf(1, 2) __cold int _printk_deferred(const char *fmt, ...);
> 
> ?

Dunno, question for printk maintainers. I know they want to limit the
usage. Maybe this qualifies as a exception worth case as well.


> Since this is a problem introduced by mm change, I think fixing this problem on the
> mm side is the cleaner.

Agreed. That would be one of the options I have mentioned. I do not
think the printk information serves such a big role we couldn't live
without it.

> Can't there be a different approach? For example, can't we
> replace
> 
> 	cpuset_mems_cookie = read_mems_allowed_begin();
> 	zonelist_iter_cookie = zonelist_iter_begin();
> 
> and
> 
> 	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> 	    check_retry_zonelist(zonelist_iter_cookie))
> 
> with different conditions, like recalculate cpuset/zonelist in the last second and
> check immediately before giving up allocation or OOM kill whether they have changed?

Dunno and honestly that is a subtle piece of code and I would rather not
touch it just because we have limitations in printk usage. Especially
considerenig the above.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/page_alloc: don't check zonelist_update_seq from atomic allocations
  2023-04-03 13:44             ` Michal Hocko
@ 2023-04-03 15:12               ` Petr Mladek
  2023-04-04  0:37                 ` [PATCH] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock Tetsuo Handa
  0 siblings, 1 reply; 20+ messages in thread
From: Petr Mladek @ 2023-04-03 15:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, Sergey Senozhatsky, Steven Rostedt, John Ogness,
	Mel Gorman, Patrick Daly, David Hildenbrand, Andrew Morton,
	syzkaller-bugs, Ilpo Järvinen, syzbot, linux-mm

On Mon 2023-04-03 15:44:34, Michal Hocko wrote:
> On Mon 03-04-23 21:51:29, Tetsuo Handa wrote:
> > On 2023/04/03 21:09, Michal Hocko wrote:
> > > On Mon 03-04-23 20:14:28, Tetsuo Handa wrote:
> > >> Well, it seems that read_mems_allowed_begin() is protected by calling
> > >> local_irq_save(flags) before write_seqcount_begin(&current->mems_allowed_seq).
> > >> 
> > >> Can zonelist_iter_begin() be protected as well (i.e. call local_irq_save(flags)
> > >> before write_seqlock(&zonelist_update_seq)) ?
> > >> 
> > >> But even if write_seqlock(&zonelist_update_seq) is called with local irq disabled,
> > >> port_lock_key after all makes this warning again?
> > 
> > Hmm, local_irq_save(flags) before write_seqlock(&zonelist_update_seq) won't help.
> > Synchronous printk() will try to hold port->lock from process context even if local
> > irq is disabled, won't it? Not limited to interrupt handler but any synchronous printk()
> > inside write_seqlock(&zonelist_update_seq) / write_sequnlock(&zonelist_update_seq)
> > section is not safe.
> > 
> > > Thank you! IIUC this can only happen when there is a race with the
> > > memory hotplug. So pretty much a very rare event.
> > 
> > Right.
> > 
> > >                                                   Also I am not really
> > > sure this really requires any changes at the allocator level. I would
> > > much rather sacrifice the printk in build_zonelists or pull it out of
> > > the locked section. Or would printk_deferred help in this case?
> > 
> > Just moving printk() out of write_seqlock(&zonelist_update_seq) / write_sequnlock(&zonelist_update_seq)
> > section is not sufficient. This problem will happen as long as interrupt handler tries to hold port->lock.
> 
> I do not follow. How is a printk outside of zonelist_update_seq going to
> cause a dead/live lock? There shouldn't be any other locks (apart from
> hotplug) taken in that path IIRC.
> 
> > Also disabling local irq will be needed.
> 
> Why?
> 
> > By the way, is this case qualified as a user of printk_deferred(), for printk_deferred() says
> > 
> >   /*
> >    * Special printk facility for scheduler/timekeeping use only, _DO_NOT_USE_ !
> >    */
> >   __printf(1, 2) __cold int _printk_deferred(const char *fmt, ...);
> > 
> > ?
> 
> Dunno, question for printk maintainers. I know they want to limit the
> usage. Maybe this qualifies as a exception worth case as well.

Sigh, I am afraid that printk_deferred() is the best option at this
moment.

I see that there are more pr_info()/pr_cont() calls in build_zonelists.
You might want to use printk_deferred_enter()/exit() around them.

These problems should disappear once printk() messages get processesed
in a kthread. And pr_info() is not critical information by definition
and it is a perfect candidate for deferring.

Unfortunately, introducing the kthreads takes ages. So, we will have
to live with printk_deferred() for some time.


Important:

printk_deferred() still should be used with care. The messages will
be pushed to the console in another random context that might be
atomic. The more messages we deffer the bigger risk of softlockups
we create.

A rules of thumb:

   + printk_deferred() would make sense for printing few lines
     in a code where the dependency against the console driver
     code is really complicated. It seems to be this case.

  + printk_deferred() is not a good solution for long reports. It
    would just increase a risk of softlockups and could break
    the system.

Best Regards,
Petr


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
  2023-04-03 15:12               ` Petr Mladek
@ 2023-04-04  0:37                 ` Tetsuo Handa
  2023-04-04  2:11                   ` Sergey Senozhatsky
                                     ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Tetsuo Handa @ 2023-04-04  0:37 UTC (permalink / raw)
  To: Petr Mladek, Michal Hocko, Patrick Daly, Mel Gorman,
	David Hildenbrand, Andrew Morton
  Cc: Sergey Senozhatsky, Steven Rostedt, John Ogness, syzkaller-bugs,
	Ilpo Järvinen, syzbot, linux-mm

syzbot is reporting circular locking dependency which involves
zonelist_update_seq seqlock [1], for this lock is checked by memory
allocation requests which do not need to be retried.

We somehow need to prevent __alloc_pages_slowpath() from checking
this lock. Since Petr Mladek thinks that __build_all_zonelists() can
become a candidate for deferring printk() [2], let's make sure that
current CPU/thread won't reach __alloc_pages_slowpath() while this lock
is in use.

Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Petr Mladek <pmladek@suse.com>
---
 mm/page_alloc.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7136c36c5d01..64fa77b8d24a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6632,7 +6632,21 @@ static void __build_all_zonelists(void *data)
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
+	unsigned long flags;
 
+	/*
+	 * Since __alloc_pages_slowpath() spins if zonelist_update_seq.seqcount
+	 * is odd, any memory allocation while zonelist_update_seq.seqcount is
+	 * odd have to be avoided.
+	 *
+	 * Explicitly disable local irqs in order to avoid calling
+	 * kmalloc(GFP_ATOMIC) from e.g. timer interrupt handler.
+	 * Also, explicitly prevent printk() from synchronously waiting for
+	 * port->lock because tty_insert_flip_string_and_push_buffer() might
+	 * call kmalloc(GFP_ATOMIC | __GFP_NOWARN) while holding port->lock.
+	 */
+	local_irq_save(flags);
+	printk_deferred_enter();
 	write_seqlock(&zonelist_update_seq);
 
 #ifdef CONFIG_NUMA
@@ -6671,6 +6685,8 @@ static void __build_all_zonelists(void *data)
 	}
 
 	write_sequnlock(&zonelist_update_seq);
+	printk_deferred_exit();
+	local_irq_restore(flags);
 }
 
 static noinline void __init
-- 
2.34.1



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
  2023-04-04  0:37                 ` [PATCH] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock Tetsuo Handa
@ 2023-04-04  2:11                   ` Sergey Senozhatsky
  2023-04-04  7:43                   ` Petr Mladek
  2023-04-04  7:54                   ` Michal Hocko
  2 siblings, 0 replies; 20+ messages in thread
From: Sergey Senozhatsky @ 2023-04-04  2:11 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Petr Mladek, Michal Hocko, Patrick Daly, Mel Gorman,
	David Hildenbrand, Andrew Morton, Sergey Senozhatsky,
	Steven Rostedt, John Ogness, syzkaller-bugs, Ilpo Järvinen,
	syzbot, linux-mm

On (23/04/04 09:37), Tetsuo Handa wrote:
> syzbot is reporting circular locking dependency which involves
> zonelist_update_seq seqlock [1], for this lock is checked by memory
> allocation requests which do not need to be retried.
> 
> We somehow need to prevent __alloc_pages_slowpath() from checking
> this lock. Since Petr Mladek thinks that __build_all_zonelists() can
> become a candidate for deferring printk() [2], let's make sure that
> current CPU/thread won't reach __alloc_pages_slowpath() while this lock
> is in use.
> 
> Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
> Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
> Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
> Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Petr Mladek <pmladek@suse.com>

Yeah that looks like one of those printk_deferred() cases.
FWIW
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
  2023-04-04  0:37                 ` [PATCH] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock Tetsuo Handa
  2023-04-04  2:11                   ` Sergey Senozhatsky
@ 2023-04-04  7:43                   ` Petr Mladek
  2023-04-04  7:54                   ` Michal Hocko
  2 siblings, 0 replies; 20+ messages in thread
From: Petr Mladek @ 2023-04-04  7:43 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, Patrick Daly, Mel Gorman, David Hildenbrand,
	Andrew Morton, Sergey Senozhatsky, Steven Rostedt, John Ogness,
	syzkaller-bugs, Ilpo Järvinen, syzbot, linux-mm

On Tue 2023-04-04 09:37:25, Tetsuo Handa wrote:
> syzbot is reporting circular locking dependency which involves
> zonelist_update_seq seqlock [1], for this lock is checked by memory
> allocation requests which do not need to be retried.
> 
> We somehow need to prevent __alloc_pages_slowpath() from checking
> this lock. Since Petr Mladek thinks that __build_all_zonelists() can
> become a candidate for deferring printk() [2], let's make sure that
> current CPU/thread won't reach __alloc_pages_slowpath() while this lock
> is in use.
> 
> Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
> Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
> Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
> Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]

From the description is far from obvious how printk() is involved.
It might make sense to paste the entire lockdep splat. The links
are not guaranteed to stay around.

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6632,7 +6632,21 @@ static void __build_all_zonelists(void *data)
>  	int nid;
>  	int __maybe_unused cpu;
>  	pg_data_t *self = data;
> +	unsigned long flags;
>  
> +	/*
> +	 * Since __alloc_pages_slowpath() spins if zonelist_update_seq.seqcount
> +	 * is odd, any memory allocation while zonelist_update_seq.seqcount is
> +	 * odd have to be avoided.
> +	 *
> +	 * Explicitly disable local irqs in order to avoid calling
> +	 * kmalloc(GFP_ATOMIC) from e.g. timer interrupt handler.
> +	 * Also, explicitly prevent printk() from synchronously waiting for
> +	 * port->lock because tty_insert_flip_string_and_push_buffer() might
> +	 * call kmalloc(GFP_ATOMIC | __GFP_NOWARN) while holding port->lock.
> +	 */
> +	local_irq_save(flags);

The comment above printk_deferred_enter definition in
include/linux/printk.h says that interrupts need to be disabled.

But strictly speaking, it should be enough to disable preemption
there days. The reason is that is uses per-CPU reference counter.

Note that it used to be really important to disable interrupts
in the past. The messages were temporary stored in a per-CPU buffer
and the lockless algorithm was not safe for reentrancy.

> +	printk_deferred_enter();
>  	write_seqlock(&zonelist_update_seq);
>  
>  #ifdef CONFIG_NUMA
> @@ -6671,6 +6685,8 @@ static void __build_all_zonelists(void *data)
>  	}
>  
>  	write_sequnlock(&zonelist_update_seq);
> +	printk_deferred_exit();
> +	local_irq_restore(flags);
>  }
>  
>  static noinline void __init

Otherwise, it looks fine from the printk() POV.

Best Regards,
Petr


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
  2023-04-04  0:37                 ` [PATCH] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock Tetsuo Handa
  2023-04-04  2:11                   ` Sergey Senozhatsky
  2023-04-04  7:43                   ` Petr Mladek
@ 2023-04-04  7:54                   ` Michal Hocko
  2023-04-04  8:20                     ` Tetsuo Handa
  2 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2023-04-04  7:54 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Petr Mladek, Patrick Daly, Mel Gorman, David Hildenbrand,
	Andrew Morton, Sergey Senozhatsky, Steven Rostedt, John Ogness,
	syzkaller-bugs, Ilpo Järvinen, syzbot, linux-mm

On Tue 04-04-23 09:37:25, Tetsuo Handa wrote:
> syzbot is reporting circular locking dependency which involves
> zonelist_update_seq seqlock [1], for this lock is checked by memory
> allocation requests which do not need to be retried.
> 
> We somehow need to prevent __alloc_pages_slowpath() from checking
> this lock. Since Petr Mladek thinks that __build_all_zonelists() can
> become a candidate for deferring printk() [2], let's make sure that
> current CPU/thread won't reach __alloc_pages_slowpath() while this lock
> is in use.

I agree with Petr that the lockdep splat and the lockup explanation
should be part of the changelog. Just reuse what you had in previous
email.

> Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
> Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
> Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
> Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Petr Mladek <pmladek@suse.com>
> ---
>  mm/page_alloc.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7136c36c5d01..64fa77b8d24a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6632,7 +6632,21 @@ static void __build_all_zonelists(void *data)
>  	int nid;
>  	int __maybe_unused cpu;
>  	pg_data_t *self = data;
> +	unsigned long flags;
>  
> +	/*
> +	 * Since __alloc_pages_slowpath() spins if zonelist_update_seq.seqcount
> +	 * is odd, any memory allocation while zonelist_update_seq.seqcount is
> +	 * odd have to be avoided.
> +	 *
> +	 * Explicitly disable local irqs in order to avoid calling
> +	 * kmalloc(GFP_ATOMIC) from e.g. timer interrupt handler.
> +	 * Also, explicitly prevent printk() from synchronously waiting for
> +	 * port->lock because tty_insert_flip_string_and_push_buffer() might
> +	 * call kmalloc(GFP_ATOMIC | __GFP_NOWARN) while holding port->lock.

This explanation of local_irq_save just doesn't make any sense. You do
not prevent any other cpu from entering the IRQ and doing the same
thing. If the sole purpose of local_irq_save is to conform
printk_deferred_enter then state that instead. Although it seems that
Petr believes that preempt_disable should be sufficient and then it
would be preferred as well. This would require update to the comment for
printk_deferred_enter though.

Thanks!

> +	 */
> +	local_irq_save(flags);
> +	printk_deferred_enter();
>  	write_seqlock(&zonelist_update_seq);
>  
>  #ifdef CONFIG_NUMA
> @@ -6671,6 +6685,8 @@ static void __build_all_zonelists(void *data)
>  	}
>  
>  	write_sequnlock(&zonelist_update_seq);
> +	printk_deferred_exit();
> +	local_irq_restore(flags);
>  }
>  
>  static noinline void __init
> -- 
> 2.34.1

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
  2023-04-04  7:54                   ` Michal Hocko
@ 2023-04-04  8:20                     ` Tetsuo Handa
  2023-04-04 11:05                       ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2023-04-04  8:20 UTC (permalink / raw)
  To: Michal Hocko, Petr Mladek
  Cc: Patrick Daly, Mel Gorman, David Hildenbrand, Andrew Morton,
	Sergey Senozhatsky, Steven Rostedt, John Ogness, syzkaller-bugs,
	Ilpo Järvinen, syzbot, linux-mm

On 2023/04/04 16:54, Michal Hocko wrote:
>> +	/*
>> +	 * Since __alloc_pages_slowpath() spins if zonelist_update_seq.seqcount
>> +	 * is odd, any memory allocation while zonelist_update_seq.seqcount is
>> +	 * odd have to be avoided.
>> +	 *
>> +	 * Explicitly disable local irqs in order to avoid calling
>> +	 * kmalloc(GFP_ATOMIC) from e.g. timer interrupt handler.
>> +	 * Also, explicitly prevent printk() from synchronously waiting for
>> +	 * port->lock because tty_insert_flip_string_and_push_buffer() might
>> +	 * call kmalloc(GFP_ATOMIC | __GFP_NOWARN) while holding port->lock.
> 
> This explanation of local_irq_save just doesn't make any sense. You do
> not prevent any other cpu from entering the IRQ and doing the same
> thing.

There is no need to prevent other CPUs from doing the same thing.
The intent of local_irq_save() here is to avoid below sequence.

  CPU0
  ----
  __build_all_zonelists() {
    write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
    // e.g. timer interrupt handler runs at this moment
      some_timer_func() {
        kmalloc(GFP_ATOMIC) {
          __alloc_pages_slowpath() {
            read_seqbegin(&zonelist_update_seq) {
              // forever spins because zonelist_update_seq.seqcount is odd
            }
          }
        }
      }
    // e.g. timer interrupt handler finishes
    write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
  }

>        If the sole purpose of local_irq_save is to conform
> printk_deferred_enter then state that instead.

As described above, this local_irq_save() is not intended for conforming
printk_deferred_enter(). This is the same with set_mems_allowed() calling
local_irq_save() before write_seqcount_begin(&current->mems_allowed_seq).

>                                                Although it seems that
> Petr believes that preempt_disable should be sufficient and then it
> would be preferred as well. This would require update to the comment for
> printk_deferred_enter though.

This patch is supposed to be backported to stable kernels where commit
3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists
and page allocation") was backported.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
  2023-04-04  8:20                     ` Tetsuo Handa
@ 2023-04-04 11:05                       ` Michal Hocko
  2023-04-04 11:19                         ` Tetsuo Handa
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2023-04-04 11:05 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Petr Mladek, Patrick Daly, Mel Gorman, David Hildenbrand,
	Andrew Morton, Sergey Senozhatsky, Steven Rostedt, John Ogness,
	syzkaller-bugs, Ilpo Järvinen, syzbot, linux-mm

On Tue 04-04-23 17:20:02, Tetsuo Handa wrote:
> On 2023/04/04 16:54, Michal Hocko wrote:
> >> +	/*
> >> +	 * Since __alloc_pages_slowpath() spins if zonelist_update_seq.seqcount
> >> +	 * is odd, any memory allocation while zonelist_update_seq.seqcount is
> >> +	 * odd have to be avoided.
> >> +	 *
> >> +	 * Explicitly disable local irqs in order to avoid calling
> >> +	 * kmalloc(GFP_ATOMIC) from e.g. timer interrupt handler.
> >> +	 * Also, explicitly prevent printk() from synchronously waiting for
> >> +	 * port->lock because tty_insert_flip_string_and_push_buffer() might
> >> +	 * call kmalloc(GFP_ATOMIC | __GFP_NOWARN) while holding port->lock.
> > 
> > This explanation of local_irq_save just doesn't make any sense. You do
> > not prevent any other cpu from entering the IRQ and doing the same
> > thing.
> 
> There is no need to prevent other CPUs from doing the same thing.
> The intent of local_irq_save() here is to avoid below sequence.
> 
>   CPU0
>   ----
>   __build_all_zonelists() {
>     write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
>     // e.g. timer interrupt handler runs at this moment
>       some_timer_func() {
>         kmalloc(GFP_ATOMIC) {
>           __alloc_pages_slowpath() {
>             read_seqbegin(&zonelist_update_seq) {
>               // forever spins because zonelist_update_seq.seqcount is odd
>             }
>           }
>         }
>       }
>     // e.g. timer interrupt handler finishes
>     write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
>   }

OK, now we are on the same page. Your previous bc630622-8b24-e4ca-2685-64880b5a7647@I-love.SAKURA.ne.jp
explanation had an intermediary lock in the dependency (port->lock). I
haven't realized that the actual scenario could be simpler than that.
But you are right that GFP_ATOMIC allocations from IRQ context are quite
common so this is a more general situation that doesn't really need to
involve TTY and some locking oddities there.

This all is quite subtle and easy to miss so please make sure to
describe both scenarios in the changelog. For the comment above I would
rephrase as follows:
	/*
	 * Explicitly disable interrupts before taking seqlock to prevent
	 * any IRQ handler from calling into the page allocator
	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and live
	 * lock.
	 */

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
  2023-04-04 11:05                       ` Michal Hocko
@ 2023-04-04 11:19                         ` Tetsuo Handa
  2023-04-04 14:31                           ` [PATCH v2] " Tetsuo Handa
  0 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2023-04-04 11:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Petr Mladek, Patrick Daly, Mel Gorman, David Hildenbrand,
	Andrew Morton, Sergey Senozhatsky, Steven Rostedt, John Ogness,
	syzkaller-bugs, Ilpo Järvinen, syzbot, linux-mm

On 2023/04/04 20:05, Michal Hocko wrote:
> But you are right that GFP_ATOMIC allocations from IRQ context are quite
> common so this is a more general situation that doesn't really need to
> involve TTY and some locking oddities there.

Yes, that's why "[PATCH] mm/page_alloc: don't check zonelist_update_seq from
atomic allocations" was proposed; I consider that zonelist_update_seq should
not be checked from atomic context.

 From lockdep perspective, keeping locking dependency simpler is the better,
even if we go with printk_deferred_enter()/printk_deferred_exit() approach.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
  2023-04-04 11:19                         ` Tetsuo Handa
@ 2023-04-04 14:31                           ` Tetsuo Handa
  2023-04-04 15:20                             ` Michal Hocko
  2023-04-04 21:25                             ` Andrew Morton
  0 siblings, 2 replies; 20+ messages in thread
From: Tetsuo Handa @ 2023-04-04 14:31 UTC (permalink / raw)
  To: Michal Hocko, Petr Mladek
  Cc: Patrick Daly, Mel Gorman, David Hildenbrand, Andrew Morton,
	Sergey Senozhatsky, Steven Rostedt, John Ogness, syzkaller-bugs,
	Ilpo Järvinen, syzbot, linux-mm

syzbot is reporting circular locking dependency which involves
zonelist_update_seq seqlock [1], for this lock is checked by memory
allocation requests which do not need to be retried.

One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.

  CPU0
  ----
  __build_all_zonelists() {
    write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
    // e.g. timer interrupt handler runs at this moment
      some_timer_func() {
        kmalloc(GFP_ATOMIC) {
          __alloc_pages_slowpath() {
            read_seqbegin(&zonelist_update_seq) {
              // spins forever because zonelist_update_seq.seqcount is odd
            }
          }
        }
      }
    // e.g. timer interrupt handler finishes
    write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
  }

This deadlock scenario can be easily eliminated by not calling
read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation
requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation
requests. But Michal Hocko does not know whether we should go with this
approach.

Another deadlock scenario which syzbot is reporting is a race between
kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer()
with port->lock held and printk() from __build_all_zonelists() with
zonelist_update_seq held.

  CPU0                                   CPU1
  ----                                   ----
  pty_write() {
    tty_insert_flip_string_and_push_buffer() {
                                         __build_all_zonelists() {
                                           write_seqlock(&zonelist_update_seq);
                                           build_zonelists() {
                                             printk() {
                                               vprintk() {
                                                 vprintk_default() {
                                                   vprintk_emit() {
                                                     console_unlock() {
                                                       console_flush_all() {
                                                         console_emit_next_record() {
                                                           con->write() = serial8250_console_write() {
      spin_lock_irqsave(&port->lock, flags);
      tty_insert_flip_string() {
        tty_insert_flip_string_fixed_flag() {
          __tty_buffer_request_room() {
            tty_buffer_alloc() {
              kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
                __alloc_pages_slowpath() {
                  zonelist_iter_begin() {
                    read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd
                                                             spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held
                    }
                  }
                }
              }
            }
          }
        }
      }
      spin_unlock_irqrestore(&port->lock, flags);
                                                             // message is printed to console
                                                             spin_unlock_irqrestore(&port->lock, flags);
                                                           }
                                                         }
                                                       }
                                                     }
                                                   }
                                                 }
                                               }
                                             }
                                           }
                                           write_sequnlock(&zonelist_update_seq);
                                         }
    }
  }

This deadlock scenario can be eliminated by

  preventing interrupt context from calling kmalloc(GFP_ATOMIC)

and

  preventing printk() from calling console_flush_all()

while zonelist_update_seq.seqcount is odd.

Since Petr Mladek thinks that __build_all_zonelists() can become a
candidate for deferring printk() [2], let's address this problem by

  disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC)

and

  disabling synchronous printk() in order to avoid console_flush_all()

.

As a side effect of minimizing duration of zonelist_update_seq.seqcount
being odd by disabling synchronous printk(), latency at
read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and
__GFP_DIRECT_RECLAIM allocation requests will be reduced. Although, from
lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e.
do not record unnecessary locking dependency) from interrupt context is
still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC) inside
write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq)
section...

Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Petr Mladek <pmladek@suse.com>
---
Changes in v2:
  Update patch description and comment.

 mm/page_alloc.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7136c36c5d01..e8b4f294d763 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6632,7 +6632,21 @@ static void __build_all_zonelists(void *data)
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
+	unsigned long flags;
 
+	/*
+	 * Explicitly disable this CPU's interrupts before taking seqlock
+	 * to prevent any IRQ handler from calling into the page allocator
+	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
+	 */
+	local_irq_save(flags);
+	/*
+	 * Explicitly disable this CPU's synchronous printk() before taking
+	 * seqlock to prevent any printk() from trying to hold port->lock, for
+	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
+	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
+	 */
+	printk_deferred_enter();
 	write_seqlock(&zonelist_update_seq);
 
 #ifdef CONFIG_NUMA
@@ -6671,6 +6685,8 @@ static void __build_all_zonelists(void *data)
 	}
 
 	write_sequnlock(&zonelist_update_seq);
+	printk_deferred_exit();
+	local_irq_restore(flags);
 }
 
 static noinline void __init
-- 
2.34.1



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
  2023-04-04 14:31                           ` [PATCH v2] " Tetsuo Handa
@ 2023-04-04 15:20                             ` Michal Hocko
  2023-04-05  9:02                               ` Mel Gorman
  2023-04-04 21:25                             ` Andrew Morton
  1 sibling, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2023-04-04 15:20 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Petr Mladek, Patrick Daly, Mel Gorman, David Hildenbrand,
	Andrew Morton, Sergey Senozhatsky, Steven Rostedt, John Ogness,
	syzkaller-bugs, Ilpo Järvinen, syzbot, linux-mm

On Tue 04-04-23 23:31:58, Tetsuo Handa wrote:
> syzbot is reporting circular locking dependency which involves
> zonelist_update_seq seqlock [1], for this lock is checked by memory
> allocation requests which do not need to be retried.
> 
> One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.
> 
>   CPU0
>   ----
>   __build_all_zonelists() {
>     write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
>     // e.g. timer interrupt handler runs at this moment
>       some_timer_func() {
>         kmalloc(GFP_ATOMIC) {
>           __alloc_pages_slowpath() {
>             read_seqbegin(&zonelist_update_seq) {
>               // spins forever because zonelist_update_seq.seqcount is odd
>             }
>           }
>         }
>       }
>     // e.g. timer interrupt handler finishes
>     write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
>   }
> 
> This deadlock scenario can be easily eliminated by not calling
> read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation
> requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation
> requests. But Michal Hocko does not know whether we should go with this
> approach.

It would have been more useful to explain why that is not preferred or
desirable.

> Another deadlock scenario which syzbot is reporting is a race between
> kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer()
> with port->lock held and printk() from __build_all_zonelists() with
> zonelist_update_seq held.
> 
>   CPU0                                   CPU1
>   ----                                   ----
>   pty_write() {
>     tty_insert_flip_string_and_push_buffer() {
>                                          __build_all_zonelists() {
>                                            write_seqlock(&zonelist_update_seq);
>                                            build_zonelists() {
>                                              printk() {
>                                                vprintk() {
>                                                  vprintk_default() {
>                                                    vprintk_emit() {
>                                                      console_unlock() {
>                                                        console_flush_all() {
>                                                          console_emit_next_record() {
>                                                            con->write() = serial8250_console_write() {
>       spin_lock_irqsave(&port->lock, flags);
>       tty_insert_flip_string() {
>         tty_insert_flip_string_fixed_flag() {
>           __tty_buffer_request_room() {
>             tty_buffer_alloc() {
>               kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
>                 __alloc_pages_slowpath() {
>                   zonelist_iter_begin() {
>                     read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd
>                                                              spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held
>                     }
>                   }
>                 }
>               }
>             }
>           }
>         }
>       }
>       spin_unlock_irqrestore(&port->lock, flags);
>                                                              // message is printed to console
>                                                              spin_unlock_irqrestore(&port->lock, flags);
>                                                            }
>                                                          }
>                                                        }
>                                                      }
>                                                    }
>                                                  }
>                                                }
>                                              }
>                                            }
>                                            write_sequnlock(&zonelist_update_seq);
>                                          }
>     }
>   }
> 
> This deadlock scenario can be eliminated by
> 
>   preventing interrupt context from calling kmalloc(GFP_ATOMIC)
> 
> and
> 
>   preventing printk() from calling console_flush_all()
> 
> while zonelist_update_seq.seqcount is odd.
> 
> Since Petr Mladek thinks that __build_all_zonelists() can become a
> candidate for deferring printk() [2], let's address this problem by
> 
>   disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC)
> 
> and
> 
>   disabling synchronous printk() in order to avoid console_flush_all()
> 
> .
> 
> As a side effect of minimizing duration of zonelist_update_seq.seqcount
> being odd by disabling synchronous printk(), latency at
> read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and
> __GFP_DIRECT_RECLAIM allocation requests will be reduced. Although, from
> lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e.
> do not record unnecessary locking dependency) from interrupt context is
> still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC) inside
> write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq)
> section...

I have really hard time to wrap my head around this changelog. I would
rephrase as follows.

The syzbot has noticed the following deadlock scenario[1]
	CPU0					CPU1
   pty_write() {
     tty_insert_flip_string_and_push_buffer() {
                                          __build_all_zonelists() {
                                            write_seqlock(&zonelist_update_seq); (A)
                                            build_zonelists() {
                                              printk() {
                                                vprintk() {
                                                  vprintk_default() {
                                                    vprintk_emit() {
                                                      console_unlock() {
                                                        console_flush_all() {
                                                          console_emit_next_record() {
                                                            con->write() = serial8250_console_write() {
       spin_lock_irqsave(&port->lock, flags); (B)
                                                              spin_lock_irqsave(&port->lock, flags); <<< spinning on (B)
       tty_insert_flip_string() {
         tty_insert_flip_string_fixed_flag() {
           __tty_buffer_request_room() {
             tty_buffer_alloc() {
               kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
                 __alloc_pages_slowpath() {
                   zonelist_iter_begin() {
                     read_seqbegin(&zonelist_update_seq); <<< spinning on (A)

This can happen during memory hotplug operation. This means that
write_seqlock on zonelist_update_seq is not allowed to call into
synchronous printk code path. This can be avoided by using a deferred
printk context.

This is not the only problematic scenario though. Another one would be
   __build_all_zonelists() {
     write_seqlock(&zonelist_update_seq); <<< (A)
       <IRQ>
         kmalloc(GFP_ATOMIC) {
           __alloc_pages_slowpath() {
             read_seqbegin(&zonelist_update_seq) <<< spinning on (A)

Allocations from (soft)IRQ contexts are quite common. This can be
avoided by disabling interrupts for this path so we won't self livelock.

> Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
> Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
> Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
> Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Petr Mladek <pmladek@suse.com>

Anyway the patch is correct
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
> Changes in v2:
>   Update patch description and comment.
> 
>  mm/page_alloc.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7136c36c5d01..e8b4f294d763 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6632,7 +6632,21 @@ static void __build_all_zonelists(void *data)
>  	int nid;
>  	int __maybe_unused cpu;
>  	pg_data_t *self = data;
> +	unsigned long flags;
>  
> +	/*
> +	 * Explicitly disable this CPU's interrupts before taking seqlock
> +	 * to prevent any IRQ handler from calling into the page allocator
> +	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
> +	 */
> +	local_irq_save(flags);
> +	/*
> +	 * Explicitly disable this CPU's synchronous printk() before taking
> +	 * seqlock to prevent any printk() from trying to hold port->lock, for
> +	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
> +	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
> +	 */
> +	printk_deferred_enter();
>  	write_seqlock(&zonelist_update_seq);
>  
>  #ifdef CONFIG_NUMA
> @@ -6671,6 +6685,8 @@ static void __build_all_zonelists(void *data)
>  	}
>  
>  	write_sequnlock(&zonelist_update_seq);
> +	printk_deferred_exit();
> +	local_irq_restore(flags);
>  }
>  
>  static noinline void __init
> -- 
> 2.34.1

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
  2023-04-04 14:31                           ` [PATCH v2] " Tetsuo Handa
  2023-04-04 15:20                             ` Michal Hocko
@ 2023-04-04 21:25                             ` Andrew Morton
  2023-04-05  8:28                               ` Michal Hocko
  1 sibling, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2023-04-04 21:25 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, Petr Mladek, Patrick Daly, Mel Gorman,
	David Hildenbrand, Sergey Senozhatsky, Steven Rostedt,
	John Ogness, syzkaller-bugs, Ilpo Järvinen, syzbot,
	linux-mm

On Tue, 4 Apr 2023 23:31:58 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:

> syzbot is reporting circular locking dependency which involves
> zonelist_update_seq seqlock [1], for this lock is checked by memory
> allocation requests which do not need to be retried.

I queued this, along with a note that an updated changelog is likely.

Do we feel that a -stable backport is warranted?  I think so, from your
earlier comments.  Please add the cc:stable to the changelog in this
situation.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
  2023-04-04 21:25                             ` Andrew Morton
@ 2023-04-05  8:28                               ` Michal Hocko
  2023-04-05  8:53                                 ` Petr Mladek
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2023-04-05  8:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tetsuo Handa, Petr Mladek, Patrick Daly, Mel Gorman,
	David Hildenbrand, Sergey Senozhatsky, Steven Rostedt,
	John Ogness, syzkaller-bugs, Ilpo Järvinen, syzbot,
	linux-mm

On Tue 04-04-23 14:25:28, Andrew Morton wrote:
> On Tue, 4 Apr 2023 23:31:58 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> 
> > syzbot is reporting circular locking dependency which involves
> > zonelist_update_seq seqlock [1], for this lock is checked by memory
> > allocation requests which do not need to be retried.
> 
> I queued this, along with a note that an updated changelog is likely.
> 
> Do we feel that a -stable backport is warranted?  I think so, from your
> earlier comments.  Please add the cc:stable to the changelog in this
> situation.

Memory hotplug is pretty rare event so the deadlock is quite unlikely.
On the other hand the fix is pretty easy so it shouldn't hurt to have it
in stable kernels.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
  2023-04-05  8:28                               ` Michal Hocko
@ 2023-04-05  8:53                                 ` Petr Mladek
  0 siblings, 0 replies; 20+ messages in thread
From: Petr Mladek @ 2023-04-05  8:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Tetsuo Handa, Patrick Daly, Mel Gorman,
	David Hildenbrand, Sergey Senozhatsky, Steven Rostedt,
	John Ogness, syzkaller-bugs, Ilpo Järvinen, syzbot,
	linux-mm

On Wed 2023-04-05 10:28:07, Michal Hocko wrote:
> On Tue 04-04-23 14:25:28, Andrew Morton wrote:
> > On Tue, 4 Apr 2023 23:31:58 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > 
> > > syzbot is reporting circular locking dependency which involves
> > > zonelist_update_seq seqlock [1], for this lock is checked by memory
> > > allocation requests which do not need to be retried.
> > 
> > I queued this, along with a note that an updated changelog is likely.
> > 
> > Do we feel that a -stable backport is warranted?  I think so, from your
> > earlier comments.  Please add the cc:stable to the changelog in this
> > situation.
> 
> Memory hotplug is pretty rare event so the deadlock is quite unlikely.
> On the other hand the fix is pretty easy so it shouldn't hurt to have it
> in stable kernels.

Note that printk_deferred_enter()/exit() has been added in v5.15-rc1
by the commit 85e3e7fbbb720b9897 ("printk: remove NMI tracking").

The commit has non-trivial dependencies. Any backport for older stable
kernel would need a custom patch just adding these two wrappers.

Best Regards,
Petr


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
  2023-04-04 15:20                             ` Michal Hocko
@ 2023-04-05  9:02                               ` Mel Gorman
  0 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2023-04-05  9:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, Petr Mladek, Patrick Daly, David Hildenbrand,
	Andrew Morton, Sergey Senozhatsky, Steven Rostedt, John Ogness,
	syzkaller-bugs, Ilpo Jï¿œrvinen, syzbot, linux-mm

On Tue, Apr 04, 2023 at 05:20:35PM +0200, Michal Hocko wrote:
> On Tue 04-04-23 23:31:58, Tetsuo Handa wrote:
> > syzbot is reporting circular locking dependency which involves
> > zonelist_update_seq seqlock [1], for this lock is checked by memory
> > allocation requests which do not need to be retried.
> > 
> > One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.
> > 
> >   CPU0
> >   ----
> >   __build_all_zonelists() {
> >     write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
> >     // e.g. timer interrupt handler runs at this moment
> >       some_timer_func() {
> >         kmalloc(GFP_ATOMIC) {
> >           __alloc_pages_slowpath() {
> >             read_seqbegin(&zonelist_update_seq) {
> >               // spins forever because zonelist_update_seq.seqcount is odd
> >             }
> >           }
> >         }
> >       }
> >     // e.g. timer interrupt handler finishes
> >     write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
> >   }
> > 
> > This deadlock scenario can be easily eliminated by not calling
> > read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation
> > requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation
> > requests. But Michal Hocko does not know whether we should go with this
> > approach.
> 
> It would have been more useful to explain why that is not preferred or
> desirable.
> 

It would have undesirable side-effects. !__GFP_DIRECT_RECLAIM does not mean
that the caller is happening from interrupt context or is a re-entrant
allocation request from IRQ context. Special casing __GFP_DIRECT_RECLAIM
could allow a speculative allocation to simply prematurely fail due to
memory hotplug.

	This deadlock could be mitigated by not calling
	read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM
	allocation request but it has undesirable consequences.
	!GFP_DIRECT_RECLAIM applies to allocations other than
	atomic allocations from interrupt context such as GFP_NOWAIT
	allocations. These type of allocations could prematurely fail due to
	a memory hotplug race or cpuset update and while this is probably
	harmless and recoverable, it's undesirable and unnecessary to fix
	the deadlock.

I imagine there could also be checks for callers from IRQ context but on
some older chips, checking if IRQs are disabled is expensive so also is
an undesirable fix. Maybe the same is true for newer CPUs, but I didn't
check. The printk_deferred option seems the most harmless option for now
and maybe printk_deferred will ultimately go away.

Other than potential changelog updates

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-04-05  9:02 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <000000000000b21f0a05e9ec310d@google.com>
     [not found] ` <f6bd471c-f961-ef5e-21c5-bf158be19d12@linux.intel.com>
2023-04-02 10:48   ` [PATCH] mm/page_alloc: don't check zonelist_update_seq from atomic allocations Tetsuo Handa
2023-04-03  8:15     ` Michal Hocko
2023-04-03 11:14       ` Tetsuo Handa
2023-04-03 12:09         ` Michal Hocko
2023-04-03 12:51           ` Tetsuo Handa
2023-04-03 13:44             ` Michal Hocko
2023-04-03 15:12               ` Petr Mladek
2023-04-04  0:37                 ` [PATCH] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock Tetsuo Handa
2023-04-04  2:11                   ` Sergey Senozhatsky
2023-04-04  7:43                   ` Petr Mladek
2023-04-04  7:54                   ` Michal Hocko
2023-04-04  8:20                     ` Tetsuo Handa
2023-04-04 11:05                       ` Michal Hocko
2023-04-04 11:19                         ` Tetsuo Handa
2023-04-04 14:31                           ` [PATCH v2] " Tetsuo Handa
2023-04-04 15:20                             ` Michal Hocko
2023-04-05  9:02                               ` Mel Gorman
2023-04-04 21:25                             ` Andrew Morton
2023-04-05  8:28                               ` Michal Hocko
2023-04-05  8:53                                 ` Petr Mladek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox