Re: [PATCH v2] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Michal Hocko <mhocko@suse.com>
To: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: "Petr Mladek" <pmladek@suse.com>,
	"Patrick Daly" <quic_pdaly@quicinc.com>,
	"Mel Gorman" <mgorman@techsingularity.net>,
	"David Hildenbrand" <david@redhat.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Sergey Senozhatsky" <senozhatsky@chromium.org>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"John Ogness" <john.ogness@linutronix.de>,
	syzkaller-bugs@googlegroups.com,
	"Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>,
	syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: [PATCH v2] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
Date: Tue, 4 Apr 2023 17:20:35 +0200	[thread overview]
Message-ID: <ZCxAQ++B3Io/dk6E@dhcp22.suse.cz> (raw)
In-Reply-To: <8796b95c-3da3-5885-fddd-6ef55f30e4d3@I-love.SAKURA.ne.jp>

On Tue 04-04-23 23:31:58, Tetsuo Handa wrote:
> syzbot is reporting circular locking dependency which involves
> zonelist_update_seq seqlock [1], for this lock is checked by memory
> allocation requests which do not need to be retried.
> 
> One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.
> 
>   CPU0
>   ----
>   __build_all_zonelists() {
>     write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
>     // e.g. timer interrupt handler runs at this moment
>       some_timer_func() {
>         kmalloc(GFP_ATOMIC) {
>           __alloc_pages_slowpath() {
>             read_seqbegin(&zonelist_update_seq) {
>               // spins forever because zonelist_update_seq.seqcount is odd
>             }
>           }
>         }
>       }
>     // e.g. timer interrupt handler finishes
>     write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
>   }
> 
> This deadlock scenario can be easily eliminated by not calling
> read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation
> requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation
> requests. But Michal Hocko does not know whether we should go with this
> approach.

It would have been more useful to explain why that is not preferred or
desirable.

> Another deadlock scenario which syzbot is reporting is a race between
> kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer()
> with port->lock held and printk() from __build_all_zonelists() with
> zonelist_update_seq held.
> 
>   CPU0                                   CPU1
>   ----                                   ----
>   pty_write() {
>     tty_insert_flip_string_and_push_buffer() {
>                                          __build_all_zonelists() {
>                                            write_seqlock(&zonelist_update_seq);
>                                            build_zonelists() {
>                                              printk() {
>                                                vprintk() {
>                                                  vprintk_default() {
>                                                    vprintk_emit() {
>                                                      console_unlock() {
>                                                        console_flush_all() {
>                                                          console_emit_next_record() {
>                                                            con->write() = serial8250_console_write() {
>       spin_lock_irqsave(&port->lock, flags);
>       tty_insert_flip_string() {
>         tty_insert_flip_string_fixed_flag() {
>           __tty_buffer_request_room() {
>             tty_buffer_alloc() {
>               kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
>                 __alloc_pages_slowpath() {
>                   zonelist_iter_begin() {
>                     read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd
>                                                              spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held
>                     }
>                   }
>                 }
>               }
>             }
>           }
>         }
>       }
>       spin_unlock_irqrestore(&port->lock, flags);
>                                                              // message is printed to console
>                                                              spin_unlock_irqrestore(&port->lock, flags);
>                                                            }
>                                                          }
>                                                        }
>                                                      }
>                                                    }
>                                                  }
>                                                }
>                                              }
>                                            }
>                                            write_sequnlock(&zonelist_update_seq);
>                                          }
>     }
>   }
> 
> This deadlock scenario can be eliminated by
> 
>   preventing interrupt context from calling kmalloc(GFP_ATOMIC)
> 
> and
> 
>   preventing printk() from calling console_flush_all()
> 
> while zonelist_update_seq.seqcount is odd.
> 
> Since Petr Mladek thinks that __build_all_zonelists() can become a
> candidate for deferring printk() [2], let's address this problem by
> 
>   disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC)
> 
> and
> 
>   disabling synchronous printk() in order to avoid console_flush_all()
> 
> .
> 
> As a side effect of minimizing duration of zonelist_update_seq.seqcount
> being odd by disabling synchronous printk(), latency at
> read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and
> __GFP_DIRECT_RECLAIM allocation requests will be reduced. Although, from
> lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e.
> do not record unnecessary locking dependency) from interrupt context is
> still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC) inside
> write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq)
> section...

I have really hard time to wrap my head around this changelog. I would
rephrase as follows.

The syzbot has noticed the following deadlock scenario[1]
	CPU0					CPU1
   pty_write() {
     tty_insert_flip_string_and_push_buffer() {
                                          __build_all_zonelists() {
                                            write_seqlock(&zonelist_update_seq); (A)
                                            build_zonelists() {
                                              printk() {
                                                vprintk() {
                                                  vprintk_default() {
                                                    vprintk_emit() {
                                                      console_unlock() {
                                                        console_flush_all() {
                                                          console_emit_next_record() {
                                                            con->write() = serial8250_console_write() {
       spin_lock_irqsave(&port->lock, flags); (B)
                                                              spin_lock_irqsave(&port->lock, flags); <<< spinning on (B)
       tty_insert_flip_string() {
         tty_insert_flip_string_fixed_flag() {
           __tty_buffer_request_room() {
             tty_buffer_alloc() {
               kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
                 __alloc_pages_slowpath() {
                   zonelist_iter_begin() {
                     read_seqbegin(&zonelist_update_seq); <<< spinning on (A)

This can happen during memory hotplug operation. This means that
write_seqlock on zonelist_update_seq is not allowed to call into
synchronous printk code path. This can be avoided by using a deferred
printk context.

This is not the only problematic scenario though. Another one would be
   __build_all_zonelists() {
     write_seqlock(&zonelist_update_seq); <<< (A)
       <IRQ>
         kmalloc(GFP_ATOMIC) {
           __alloc_pages_slowpath() {
             read_seqbegin(&zonelist_update_seq) <<< spinning on (A)

Allocations from (soft)IRQ contexts are quite common. This can be
avoided by disabling interrupts for this path so we won't self livelock.

> Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
> Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
> Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
> Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Petr Mladek <pmladek@suse.com>

Anyway the patch is correct
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
> Changes in v2:
>   Update patch description and comment.
> 
>  mm/page_alloc.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7136c36c5d01..e8b4f294d763 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6632,7 +6632,21 @@ static void __build_all_zonelists(void *data)
>  	int nid;
>  	int __maybe_unused cpu;
>  	pg_data_t *self = data;
> +	unsigned long flags;
>  
> +	/*
> +	 * Explicitly disable this CPU's interrupts before taking seqlock
> +	 * to prevent any IRQ handler from calling into the page allocator
> +	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
> +	 */
> +	local_irq_save(flags);
> +	/*
> +	 * Explicitly disable this CPU's synchronous printk() before taking
> +	 * seqlock to prevent any printk() from trying to hold port->lock, for
> +	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
> +	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
> +	 */
> +	printk_deferred_enter();
>  	write_seqlock(&zonelist_update_seq);
>  
>  #ifdef CONFIG_NUMA
> @@ -6671,6 +6685,8 @@ static void __build_all_zonelists(void *data)
>  	}
>  
>  	write_sequnlock(&zonelist_update_seq);
> +	printk_deferred_exit();
> +	local_irq_restore(flags);
>  }
>  
>  static noinline void __init
> -- 
> 2.34.1

-- 
Michal Hocko
SUSE Labs

next prev parent reply	other threads:[~2023-04-04 15:20 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <000000000000b21f0a05e9ec310d@google.com>
     [not found] ` <f6bd471c-f961-ef5e-21c5-bf158be19d12@linux.intel.com>
2023-04-02 10:48   ` [PATCH] mm/page_alloc: don't check zonelist_update_seq from atomic allocations Tetsuo Handa
2023-04-03  8:15     ` Michal Hocko
2023-04-03 11:14       ` Tetsuo Handa
2023-04-03 12:09         ` Michal Hocko
2023-04-03 12:51           ` Tetsuo Handa
2023-04-03 13:44             ` Michal Hocko
2023-04-03 15:12               ` Petr Mladek
2023-04-04  0:37                 ` [PATCH] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock Tetsuo Handa
2023-04-04  2:11                   ` Sergey Senozhatsky
2023-04-04  7:43                   ` Petr Mladek
2023-04-04  7:54                   ` Michal Hocko
2023-04-04  8:20                     ` Tetsuo Handa
2023-04-04 11:05                       ` Michal Hocko
2023-04-04 11:19                         ` Tetsuo Handa
2023-04-04 14:31                           ` [PATCH v2] " Tetsuo Handa
2023-04-04 15:20                             ` Michal Hocko [this message]
2023-04-05  9:02                               ` Mel Gorman
2023-04-04 21:25                             ` Andrew Morton
2023-04-05  8:28                               ` Michal Hocko
2023-04-05  8:53                                 ` Petr Mladek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZCxAQ++B3Io/dk6E@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=ilpo.jarvinen@linux.intel.com \
    --cc=john.ogness@linutronix.de \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    --cc=pmladek@suse.com \
    --cc=quic_pdaly@quicinc.com \
    --cc=rostedt@goodmis.org \
    --cc=senozhatsky@chromium.org \
    --cc=syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com \
    --cc=syzkaller-bugs@googlegroups.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox