[PATCH 0/3] memory_hotplug: hyperv: fix deadlock between memory adding and onlining

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] memory_hotplug: hyperv: fix deadlock between memory adding and onlining
@ 2015-02-11 15:44 Vitaly Kuznetsov
  2015-02-11 15:44 ` [PATCH 1/3] driver core: export lock_device_hotplug/unlock_device_hotplug Vitaly Kuznetsov
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Vitaly Kuznetsov @ 2015-02-11 15:44 UTC (permalink / raw)
  To: Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Andrew Morton, Yasuaki Ishimatsu, Tang Chen, Vlastimil Babka,
	David Rientjes, Fabian Frederick, Zhang Zhen, Vladimir Davydov,
	Wang Nan
  Cc: linux-kernel, devel, linux-mm

If newly added memory is brought online with e.g. udev rule:
SUBSYSTEM=="memory", ACTION=="add", ATTR{state}="online"
the following deadlock is observed (and easily reproducable):

First participant, worker thread doing add_memory():

[  724.948846] kworker/0:1     D ffff88000412f9c8 13248    27      2 0x00000000
[  724.973543] Workqueue: events hot_add_req [hv_balloon]
[  724.991736]  ffff88000412f9c8 0000000000000000 ffff88003fa1dc30 00000000000151c0
[  725.019725]  0000000000000246 ffff88000412ffd8 00000000000151c0 ffff88003a77a4e0
[  725.046486]  ffff88003fa1dc30 00000001032a6000 ffff88003a7ca838 ffff88003a7ca898
[  725.072969] Call Trace:
[  725.082690]  [<ffffffff81aac0a9>] schedule_preempt_disabled+0x29/0x70
[  725.103799]  [<ffffffff81aae33b>] mutex_lock_nested+0x14b/0x470
[  725.122367]  [<ffffffff815ed773>] ? device_attach+0x23/0xb0
[  725.140992]  [<ffffffff815ed773>] device_attach+0x23/0xb0
[  725.159131]  [<ffffffff815ecba0>] bus_probe_device+0xb0/0xe0
[  725.177055]  [<ffffffff815ea693>] device_add+0x443/0x650
[  725.195558]  [<ffffffff815ea8be>] device_register+0x1e/0x30
[  725.213133]  [<ffffffff81601790>] init_memory_block+0xd0/0xf0
[  725.231533]  [<ffffffff816018f1>] register_new_memory+0xb1/0xd0
[  725.250769]  [<ffffffff81a961cf>] __add_pages+0x13f/0x250
[  725.269642]  [<ffffffff81063770>] ? arch_add_memory+0x70/0xf0
[  725.288764]  [<ffffffff81063770>] arch_add_memory+0x70/0xf0
[  725.306117]  [<ffffffff81a95f8f>] add_memory+0xef/0x1f0
[  725.322466]  [<ffffffffa00293af>] hot_add_req+0x33f/0xf90 [hv_balloon]
[  725.342777]  [<ffffffff8109509f>] process_one_work+0x1df/0x4e0
[  725.361459]  [<ffffffff8109502d>] ? process_one_work+0x16d/0x4e0
[  725.380390]  [<ffffffff810954bb>] worker_thread+0x11b/0x450
[  725.397684]  [<ffffffff810953a0>] ? process_one_work+0x4e0/0x4e0
[  725.416533]  [<ffffffff8109ac33>] kthread+0xf3/0x110
[  725.433372]  [<ffffffff8109ab40>] ? kthread_create_on_node+0x240/0x240
[  725.453749]  [<ffffffff81ab1dfc>] ret_from_fork+0x7c/0xb0
[  725.470994]  [<ffffffff8109ab40>] ? kthread_create_on_node+0x240/0x240
[  725.491469] 6 locks held by kworker/0:1/27:
[  725.505037]  #0:  ("events"){......}, at: [<ffffffff8109502d>] process_one_work+0x16d/0x4e0
[  725.533370]  #1:  ((&dm_device.ha_wrk.wrk)){......}, at: [<ffffffff8109502d>] process_one_work+0x16d/0x4e0
[  725.565580]  #2:  (mem_hotplug.lock){......}, at: [<ffffffff811e6525>] mem_hotplug_begin+0x5/0x80
[  725.594369]  #3:  (mem_hotplug.lock#2){......}, at: [<ffffffff811e656f>] mem_hotplug_begin+0x4f/0x80
[  725.628554]  #4:  (mem_sysfs_mutex){......}, at: [<ffffffff81601873>] register_new_memory+0x33/0xd0
[  725.658519]  #5:  (&dev->mutex){......}, at: [<ffffffff815ed773>] device_attach+0x23/0xb0

Second participant, udev:

[  725.750889] systemd-udevd   D ffff88003b94fc68 14016   888    530 0x00000004
[  725.773767]  ffff88003b94fc68 0000000000000000 ffff8800034949c0 00000000000151c0
[  725.798332]  ffffffff8210d980 ffff88003b94ffd8 00000000000151c0 ffff880037a69270
[  725.822841]  ffff8800034949c0 0000000100000001 ffff8800034949c0 ffffffff81ff2b48
[  725.849184] Call Trace:
[  725.858987]  [<ffffffff81aac0a9>] schedule_preempt_disabled+0x29/0x70
[  725.879231]  [<ffffffff81aae33b>] mutex_lock_nested+0x14b/0x470
[  725.897860]  [<ffffffff811e656f>] ? mem_hotplug_begin+0x4f/0x80
[  725.916698]  [<ffffffff811e656f>] mem_hotplug_begin+0x4f/0x80
[  725.935064]  [<ffffffff811e6525>] ? mem_hotplug_begin+0x5/0x80
[  725.953464]  [<ffffffff81a9631b>] online_pages+0x3b/0x520
[  725.971542]  [<ffffffff815eb0b3>] ? device_online+0x23/0xa0
[  725.989207]  [<ffffffff81601524>] memory_subsys_online+0x64/0xc0
[  726.008513]  [<ffffffff815eb0fd>] device_online+0x6d/0xa0
[  726.025579]  [<ffffffff816012eb>] store_mem_state+0x5b/0xe0
[  726.043400]  [<ffffffff815e8258>] dev_attr_store+0x18/0x30
[  726.060506]  [<ffffffff8127a808>] sysfs_kf_write+0x48/0x60
[  726.077940]  [<ffffffff81279d1b>] kernfs_fop_write+0x13b/0x1a0
[  726.099416]  [<ffffffff811f9f67>] vfs_write+0xb7/0x1f0
[  726.115748]  [<ffffffff811fabf8>] SyS_write+0x58/0xd0
[  726.131933]  [<ffffffff81ab1ea9>] system_call_fastpath+0x12/0x17
[  726.150691] 7 locks held by systemd-udevd/888:
[  726.165044]  #0:  (sb_writers#3){......}, at: [<ffffffff811fa063>] vfs_write+0x1b3/0x1f0
[  726.192422]  #1:  (&of->mutex){......}, at: [<ffffffff81279c46>] kernfs_fop_write+0x66/0x1a0
[  726.220289]  #2:  (s_active#60){......}, at: [<ffffffff81279c4e>] kernfs_fop_write+0x6e/0x1a0
[  726.249382]  #3:  (device_hotplug_lock){......}, at: [<ffffffff815e9c15>] lock_device_hotplug_sysfs+0x15/0x50
[  726.281901]  #4:  (&dev->mutex){......}, at: [<ffffffff815eb0b3>] device_online+0x23/0xa0
[  726.308619]  #5:  (mem_hotplug.lock){......}, at: [<ffffffff811e6525>] mem_hotplug_begin+0x5/0x80
[  726.337994]  #6:  (mem_hotplug.lock#2){......}, at: [<ffffffff811e656f>] mem_hotplug_begin+0x4f/0x80

In short: onlining grabs device lock and then tries to do mem_hotplug_begin()
while add_memory() is between mem_hotplug_begin() and mem_hotplug_done() and it
tries grabbing device lock.

To my understanding ACPI memory hotplug doesn't have the same issue as
device_hotplug_lock is being grabbed when the ACPI device is added.

Solve the issue by grabbing device_hotplug_lock before doing add_memory(). If
we do that, lock_device_hotplug_sysfs() will cause syscall retry which will
eventually succeed. To support the change we need to export lock_device_hotplug/
unlock_device_hotplug. This approach can be completely wrong though.

Vitaly Kuznetsov (3):
  driver core: export lock_device_hotplug/unlock_device_hotplug
  memory_hotplug: add note about holding device_hotplug_lock and
    add_memory()
  Drivers: hv: balloon: fix deadlock between memory adding and onlining

 drivers/base/core.c     |  2 ++
 drivers/hv/hv_balloon.c | 10 ++++++++++
 mm/memory_hotplug.c     |  6 +++++-
 3 files changed, 17 insertions(+), 1 deletion(-)

-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] driver core: export lock_device_hotplug/unlock_device_hotplug
  2015-02-11 15:44 [PATCH 0/3] memory_hotplug: hyperv: fix deadlock between memory adding and onlining Vitaly Kuznetsov
@ 2015-02-11 15:44 ` Vitaly Kuznetsov
  2015-02-11 20:39   ` Andrew Morton
  2015-02-11 15:44 ` [PATCH 2/3] memory_hotplug: add note about holding device_hotplug_lock and add_memory() Vitaly Kuznetsov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Vitaly Kuznetsov @ 2015-02-11 15:44 UTC (permalink / raw)
  To: Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Andrew Morton, Yasuaki Ishimatsu, Tang Chen, Vlastimil Babka,
	David Rientjes, Fabian Frederick, Zhang Zhen, Vladimir Davydov,
	Wang Nan
  Cc: linux-kernel, devel, linux-mm

add_memory() is supposed to be run with device_hotplug_lock grabbed, otherwise
it can race with e.g. device_online(). Allow external modules (hv_balloon for
now) to lock device hotplug.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 drivers/base/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 97e2baf..b3073af 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -55,11 +55,13 @@ void lock_device_hotplug(void)
 {
 	mutex_lock(&device_hotplug_lock);
 }
+EXPORT_SYMBOL_GPL(lock_device_hotplug);
 
 void unlock_device_hotplug(void)
 {
 	mutex_unlock(&device_hotplug_lock);
 }
+EXPORT_SYMBOL_GPL(unlock_device_hotplug);
 
 int lock_device_hotplug_sysfs(void)
 {
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] driver core: export lock_device_hotplug/unlock_device_hotplug
  2015-02-11 15:44 ` [PATCH 1/3] driver core: export lock_device_hotplug/unlock_device_hotplug Vitaly Kuznetsov
@ 2015-02-11 20:39   ` Andrew Morton
  2015-02-12  4:04     ` Rafael J. Wysocki
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2015-02-11 20:39 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Yasuaki Ishimatsu, Tang Chen, Vlastimil Babka, David Rientjes,
	Fabian Frederick, Zhang Zhen, Vladimir Davydov, Wang Nan,
	linux-kernel, devel, linux-mm, Rafael J. Wysocki

On Wed, 11 Feb 2015 16:44:20 +0100 Vitaly Kuznetsov <vkuznets@redhat.com> wrote:

> add_memory() is supposed to be run with device_hotplug_lock grabbed, otherwise
> it can race with e.g. device_online(). Allow external modules (hv_balloon for
> now) to lock device hotplug.
> 
> ...
>
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -55,11 +55,13 @@ void lock_device_hotplug(void)
>  {
>  	mutex_lock(&device_hotplug_lock);
>  }
> +EXPORT_SYMBOL_GPL(lock_device_hotplug);
>  
>  void unlock_device_hotplug(void)
>  {
>  	mutex_unlock(&device_hotplug_lock);
>  }
> +EXPORT_SYMBOL_GPL(unlock_device_hotplug);
>  
>  int lock_device_hotplug_sysfs(void)
>  {

It's kinda crazy that lock_device_hotplug_sysfs() didn't get any
documentation.  I suggest adding this while you're in there:


--- a/drivers/base/core.c~a
+++ a/drivers/base/core.c
@@ -61,6 +61,9 @@ void unlock_device_hotplug(void)
 	mutex_unlock(&device_hotplug_lock);
 }
 
+/*
+ * "git show 5e33bc4165f3ed" for details
+ */
 int lock_device_hotplug_sysfs(void)
 {
 	if (mutex_trylock(&device_hotplug_lock))

which is a bit lazy but whatev.

I'll assume that Greg (or Rafael?) will be processing this patchset.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] driver core: export lock_device_hotplug/unlock_device_hotplug
  2015-02-11 20:39   ` Andrew Morton
@ 2015-02-12  4:04     ` Rafael J. Wysocki
  2015-02-12 10:16       ` Vitaly Kuznetsov
  0 siblings, 1 reply; 11+ messages in thread
From: Rafael J. Wysocki @ 2015-02-12  4:04 UTC (permalink / raw)
  To: Andrew Morton, Vitaly Kuznetsov
  Cc: Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Yasuaki Ishimatsu, Tang Chen, Vlastimil Babka, David Rientjes,
	Fabian Frederick, Zhang Zhen, Vladimir Davydov, Wang Nan,
	linux-kernel, devel, linux-mm

On Wednesday, February 11, 2015 12:39:47 PM Andrew Morton wrote:
> On Wed, 11 Feb 2015 16:44:20 +0100 Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> 
> > add_memory() is supposed to be run with device_hotplug_lock grabbed, otherwise
> > it can race with e.g. device_online(). Allow external modules (hv_balloon for
> > now) to lock device hotplug.
> > 
> > ...
> >
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -55,11 +55,13 @@ void lock_device_hotplug(void)
> >  {
> >  	mutex_lock(&device_hotplug_lock);
> >  }
> > +EXPORT_SYMBOL_GPL(lock_device_hotplug);
> >  
> >  void unlock_device_hotplug(void)
> >  {
> >  	mutex_unlock(&device_hotplug_lock);
> >  }
> > +EXPORT_SYMBOL_GPL(unlock_device_hotplug);
> >  
> >  int lock_device_hotplug_sysfs(void)
> >  {
> 
> It's kinda crazy that lock_device_hotplug_sysfs() didn't get any
> documentation.  I suggest adding this while you're in there:
> 
> 
> --- a/drivers/base/core.c~a
> +++ a/drivers/base/core.c
> @@ -61,6 +61,9 @@ void unlock_device_hotplug(void)
>  	mutex_unlock(&device_hotplug_lock);
>  }
>  
> +/*
> + * "git show 5e33bc4165f3ed" for details
> + */
>  int lock_device_hotplug_sysfs(void)
>  {
>  	if (mutex_trylock(&device_hotplug_lock))
> 
> which is a bit lazy but whatev.
> 
> I'll assume that Greg (or Rafael?) will be processing this patchset.

Well, I would do that if I saw it (my address in the CC has been deprecated
for several months now).

Vitaly, can you please resend with a CC to a valid address of mine, please?

Rafael

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] driver core: export lock_device_hotplug/unlock_device_hotplug
  2015-02-12  4:04     ` Rafael J. Wysocki
@ 2015-02-12 10:16       ` Vitaly Kuznetsov
  0 siblings, 0 replies; 11+ messages in thread
From: Vitaly Kuznetsov @ 2015-02-12 10:16 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Greg Kroah-Hartman, K. Y. Srinivasan,
	Haiyang Zhang, Yasuaki Ishimatsu, Tang Chen, Vlastimil Babka,
	David Rientjes, Fabian Frederick, Zhang Zhen, Vladimir Davydov,
	Wang Nan, linux-kernel, devel, linux-mm

"Rafael J. Wysocki" <rjw@rjwysocki.net> writes:

> On Wednesday, February 11, 2015 12:39:47 PM Andrew Morton wrote:
>> On Wed, 11 Feb 2015 16:44:20 +0100 Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>> 
>> > add_memory() is supposed to be run with device_hotplug_lock grabbed, otherwise
>> > it can race with e.g. device_online(). Allow external modules (hv_balloon for
>> > now) to lock device hotplug.
>> > 
>> > ...
>> >
>> > --- a/drivers/base/core.c
>> > +++ b/drivers/base/core.c
>> > @@ -55,11 +55,13 @@ void lock_device_hotplug(void)
>> >  {
>> >  	mutex_lock(&device_hotplug_lock);
>> >  }
>> > +EXPORT_SYMBOL_GPL(lock_device_hotplug);
>> >  
>> >  void unlock_device_hotplug(void)
>> >  {
>> >  	mutex_unlock(&device_hotplug_lock);
>> >  }
>> > +EXPORT_SYMBOL_GPL(unlock_device_hotplug);
>> >  
>> >  int lock_device_hotplug_sysfs(void)
>> >  {
>> 
>> It's kinda crazy that lock_device_hotplug_sysfs() didn't get any
>> documentation.  I suggest adding this while you're in there:
>> 
>> 
>> --- a/drivers/base/core.c~a
>> +++ a/drivers/base/core.c
>> @@ -61,6 +61,9 @@ void unlock_device_hotplug(void)
>>  	mutex_unlock(&device_hotplug_lock);
>>  }
>>  
>> +/*
>> + * "git show 5e33bc4165f3ed" for details
>> + */
>>  int lock_device_hotplug_sysfs(void)
>>  {
>>  	if (mutex_trylock(&device_hotplug_lock))
>> 
>> which is a bit lazy but whatev.
>> 
>> I'll assume that Greg (or Rafael?) will be processing this patchset.
>
> Well, I would do that if I saw it (my address in the CC has been deprecated
> for several months now).
>
> Vitaly, can you please resend with a CC to a valid address of mine,
> please?

Yes, sure, for some reason you were not on the get_maintainer.pl output.

>
> Rafael

-- 
  Vitaly

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/3] memory_hotplug: add note about holding device_hotplug_lock and add_memory()
  2015-02-11 15:44 [PATCH 0/3] memory_hotplug: hyperv: fix deadlock between memory adding and onlining Vitaly Kuznetsov
  2015-02-11 15:44 ` [PATCH 1/3] driver core: export lock_device_hotplug/unlock_device_hotplug Vitaly Kuznetsov
@ 2015-02-11 15:44 ` Vitaly Kuznetsov
  2015-02-11 15:44 ` [PATCH 3/3] Drivers: hv: balloon: fix deadlock between memory adding and onlining Vitaly Kuznetsov
  2015-02-11 20:30 ` [PATCH 0/3] memory_hotplug: hyperv: " David Rientjes
  3 siblings, 0 replies; 11+ messages in thread
From: Vitaly Kuznetsov @ 2015-02-11 15:44 UTC (permalink / raw)
  To: Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Andrew Morton, Yasuaki Ishimatsu, Tang Chen, Vlastimil Babka,
	David Rientjes, Fabian Frederick, Zhang Zhen, Vladimir Davydov,
	Wang Nan
  Cc: linux-kernel, devel, linux-mm

add_memory() is supposed to be run with device_hotplug_lock grabbed, otherwise
it can race with e.g. device_online(). ACPI memory hotplug does that already
but e.g. Hyper-V ballooning driver doesn't.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 mm/memory_hotplug.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9fab107..41638eb 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1213,7 +1213,11 @@ int zone_for_memory(int nid, u64 start, u64 size, int zone_default)
 	return zone_default;
 }
 
-/* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
+/*
+ * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
+ * and online/offline operations before this call.
+ * We are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG.
+ */
 int __ref add_memory(int nid, u64 start, u64 size)
 {
 	pg_data_t *pgdat = NULL;
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 3/3] Drivers: hv: balloon: fix deadlock between memory adding and onlining
  2015-02-11 15:44 [PATCH 0/3] memory_hotplug: hyperv: fix deadlock between memory adding and onlining Vitaly Kuznetsov
  2015-02-11 15:44 ` [PATCH 1/3] driver core: export lock_device_hotplug/unlock_device_hotplug Vitaly Kuznetsov
  2015-02-11 15:44 ` [PATCH 2/3] memory_hotplug: add note about holding device_hotplug_lock and add_memory() Vitaly Kuznetsov
@ 2015-02-11 15:44 ` Vitaly Kuznetsov
  2015-02-11 20:30 ` [PATCH 0/3] memory_hotplug: hyperv: " David Rientjes
  3 siblings, 0 replies; 11+ messages in thread
From: Vitaly Kuznetsov @ 2015-02-11 15:44 UTC (permalink / raw)
  To: Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Andrew Morton, Yasuaki Ishimatsu, Tang Chen, Vlastimil Babka,
	David Rientjes, Fabian Frederick, Zhang Zhen, Vladimir Davydov,
	Wang Nan
  Cc: linux-kernel, devel, linux-mm

If newly added memory is brought online with e.g. udev rule:
SUBSYSTEM=="memory", ACTION=="add", ATTR{state}="online"
the following deadlock is observed (and easily reproducable):

First participant, worker thread doing add_memory():
...
[  725.491469] 6 locks held by kworker/0:1/27:
[  725.505037]  #0:  ("events"){......}, at: [<ffffffff8109502d>] process_one_work+0x16d/0x4e0
[  725.533370]  #1:  ((&dm_device.ha_wrk.wrk)){......}, at: [<ffffffff8109502d>] process_one_work+0x16d/0x4e0
[  725.565580]  #2:  (mem_hotplug.lock){......}, at: [<ffffffff811e6525>] mem_hotplug_begin+0x5/0x80
[  725.594369]  #3:  (mem_hotplug.lock#2){......}, at: [<ffffffff811e656f>] mem_hotplug_begin+0x4f/0x80
[  725.628554]  #4:  (mem_sysfs_mutex){......}, at: [<ffffffff81601873>] register_new_memory+0x33/0xd0
[  725.658519]  #5:  (&dev->mutex){......}, at: [<ffffffff815ed773>] device_attach+0x23/0xb0

Second participant, udev:
...
[  726.150691] 7 locks held by systemd-udevd/888:
[  726.165044]  #0:  (sb_writers#3){......}, at: [<ffffffff811fa063>] vfs_write+0x1b3/0x1f0
[  726.192422]  #1:  (&of->mutex){......}, at: [<ffffffff81279c46>] kernfs_fop_write+0x66/0x1a0
[  726.220289]  #2:  (s_active#60){......}, at: [<ffffffff81279c4e>] kernfs_fop_write+0x6e/0x1a0
[  726.249382]  #3:  (device_hotplug_lock){......}, at: [<ffffffff815e9c15>] lock_device_hotplug_sysfs+0x15/0x50
[  726.281901]  #4:  (&dev->mutex){......}, at: [<ffffffff815eb0b3>] device_online+0x23/0xa0
[  726.308619]  #5:  (mem_hotplug.lock){......}, at: [<ffffffff811e6525>] mem_hotplug_begin+0x5/0x80
[  726.337994]  #6:  (mem_hotplug.lock#2){......}, at: [<ffffffff811e656f>] mem_hotplug_begin+0x4f/0x80

Solve the issue bu grabbing device_hotplug_lock before doing add_memory(). If
we do that, lock_device_hotplug_sysfs() will cause syscall retry which will
eventually succeed.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 drivers/hv/hv_balloon.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index b958ded..0af1aa2 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -592,9 +592,19 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
 		dm_device.ha_waiting = true;
 
 		nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
+
+		/*
+		 * Grab hotplug lock as we'll be doing device_register() and we
+		 * need to protect against someone (e.g. udev doing memory
+		 * onlining) locking it before we're done.
+		 */
+		lock_device_hotplug();
+
 		ret = add_memory(nid, PFN_PHYS((start_pfn)),
 				(HA_CHUNK << PAGE_SHIFT));
 
+		unlock_device_hotplug();
+
 		if (ret) {
 			pr_info("hot_add memory failed error is %d\n", ret);
 			if (ret == -EEXIST) {
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] memory_hotplug: hyperv: fix deadlock between memory adding and onlining
  2015-02-11 15:44 [PATCH 0/3] memory_hotplug: hyperv: fix deadlock between memory adding and onlining Vitaly Kuznetsov
                   ` (2 preceding siblings ...)
  2015-02-11 15:44 ` [PATCH 3/3] Drivers: hv: balloon: fix deadlock between memory adding and onlining Vitaly Kuznetsov
@ 2015-02-11 20:30 ` David Rientjes
  2015-02-12  6:39   ` David Rientjes
  2015-02-12 10:15   ` Vitaly Kuznetsov
  3 siblings, 2 replies; 11+ messages in thread
From: David Rientjes @ 2015-02-11 20:30 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Andrew Morton, Yasuaki Ishimatsu, Tang Chen, Vlastimil Babka,
	Fabian Frederick, Zhang Zhen, Vladimir Davydov, Wang Nan,
	linux-kernel, devel, linux-mm

On Wed, 11 Feb 2015, Vitaly Kuznetsov wrote:

> If newly added memory is brought online with e.g. udev rule:
> SUBSYSTEM=="memory", ACTION=="add", ATTR{state}="online"
> the following deadlock is observed (and easily reproducable):
> 
> First participant, worker thread doing add_memory():
> 
> [  724.948846] kworker/0:1     D ffff88000412f9c8 13248    27      2 0x00000000
> [  724.973543] Workqueue: events hot_add_req [hv_balloon]
> [  724.991736]  ffff88000412f9c8 0000000000000000 ffff88003fa1dc30 00000000000151c0
> [  725.019725]  0000000000000246 ffff88000412ffd8 00000000000151c0 ffff88003a77a4e0
> [  725.046486]  ffff88003fa1dc30 00000001032a6000 ffff88003a7ca838 ffff88003a7ca898
> [  725.072969] Call Trace:
> [  725.082690]  [<ffffffff81aac0a9>] schedule_preempt_disabled+0x29/0x70
> [  725.103799]  [<ffffffff81aae33b>] mutex_lock_nested+0x14b/0x470
> [  725.122367]  [<ffffffff815ed773>] ? device_attach+0x23/0xb0
> [  725.140992]  [<ffffffff815ed773>] device_attach+0x23/0xb0
> [  725.159131]  [<ffffffff815ecba0>] bus_probe_device+0xb0/0xe0
> [  725.177055]  [<ffffffff815ea693>] device_add+0x443/0x650
> [  725.195558]  [<ffffffff815ea8be>] device_register+0x1e/0x30
> [  725.213133]  [<ffffffff81601790>] init_memory_block+0xd0/0xf0
> [  725.231533]  [<ffffffff816018f1>] register_new_memory+0xb1/0xd0
> [  725.250769]  [<ffffffff81a961cf>] __add_pages+0x13f/0x250
> [  725.269642]  [<ffffffff81063770>] ? arch_add_memory+0x70/0xf0
> [  725.288764]  [<ffffffff81063770>] arch_add_memory+0x70/0xf0
> [  725.306117]  [<ffffffff81a95f8f>] add_memory+0xef/0x1f0
> [  725.322466]  [<ffffffffa00293af>] hot_add_req+0x33f/0xf90 [hv_balloon]
> [  725.342777]  [<ffffffff8109509f>] process_one_work+0x1df/0x4e0
> [  725.361459]  [<ffffffff8109502d>] ? process_one_work+0x16d/0x4e0
> [  725.380390]  [<ffffffff810954bb>] worker_thread+0x11b/0x450
> [  725.397684]  [<ffffffff810953a0>] ? process_one_work+0x4e0/0x4e0
> [  725.416533]  [<ffffffff8109ac33>] kthread+0xf3/0x110
> [  725.433372]  [<ffffffff8109ab40>] ? kthread_create_on_node+0x240/0x240
> [  725.453749]  [<ffffffff81ab1dfc>] ret_from_fork+0x7c/0xb0
> [  725.470994]  [<ffffffff8109ab40>] ? kthread_create_on_node+0x240/0x240
> [  725.491469] 6 locks held by kworker/0:1/27:
> [  725.505037]  #0:  ("events"){......}, at: [<ffffffff8109502d>] process_one_work+0x16d/0x4e0
> [  725.533370]  #1:  ((&dm_device.ha_wrk.wrk)){......}, at: [<ffffffff8109502d>] process_one_work+0x16d/0x4e0
> [  725.565580]  #2:  (mem_hotplug.lock){......}, at: [<ffffffff811e6525>] mem_hotplug_begin+0x5/0x80
> [  725.594369]  #3:  (mem_hotplug.lock#2){......}, at: [<ffffffff811e656f>] mem_hotplug_begin+0x4f/0x80
> [  725.628554]  #4:  (mem_sysfs_mutex){......}, at: [<ffffffff81601873>] register_new_memory+0x33/0xd0
> [  725.658519]  #5:  (&dev->mutex){......}, at: [<ffffffff815ed773>] device_attach+0x23/0xb0
> 
> Second participant, udev:
> 
> [  725.750889] systemd-udevd   D ffff88003b94fc68 14016   888    530 0x00000004
> [  725.773767]  ffff88003b94fc68 0000000000000000 ffff8800034949c0 00000000000151c0
> [  725.798332]  ffffffff8210d980 ffff88003b94ffd8 00000000000151c0 ffff880037a69270
> [  725.822841]  ffff8800034949c0 0000000100000001 ffff8800034949c0 ffffffff81ff2b48
> [  725.849184] Call Trace:
> [  725.858987]  [<ffffffff81aac0a9>] schedule_preempt_disabled+0x29/0x70
> [  725.879231]  [<ffffffff81aae33b>] mutex_lock_nested+0x14b/0x470
> [  725.897860]  [<ffffffff811e656f>] ? mem_hotplug_begin+0x4f/0x80
> [  725.916698]  [<ffffffff811e656f>] mem_hotplug_begin+0x4f/0x80
> [  725.935064]  [<ffffffff811e6525>] ? mem_hotplug_begin+0x5/0x80
> [  725.953464]  [<ffffffff81a9631b>] online_pages+0x3b/0x520
> [  725.971542]  [<ffffffff815eb0b3>] ? device_online+0x23/0xa0
> [  725.989207]  [<ffffffff81601524>] memory_subsys_online+0x64/0xc0
> [  726.008513]  [<ffffffff815eb0fd>] device_online+0x6d/0xa0
> [  726.025579]  [<ffffffff816012eb>] store_mem_state+0x5b/0xe0
> [  726.043400]  [<ffffffff815e8258>] dev_attr_store+0x18/0x30
> [  726.060506]  [<ffffffff8127a808>] sysfs_kf_write+0x48/0x60
> [  726.077940]  [<ffffffff81279d1b>] kernfs_fop_write+0x13b/0x1a0
> [  726.099416]  [<ffffffff811f9f67>] vfs_write+0xb7/0x1f0
> [  726.115748]  [<ffffffff811fabf8>] SyS_write+0x58/0xd0
> [  726.131933]  [<ffffffff81ab1ea9>] system_call_fastpath+0x12/0x17
> [  726.150691] 7 locks held by systemd-udevd/888:
> [  726.165044]  #0:  (sb_writers#3){......}, at: [<ffffffff811fa063>] vfs_write+0x1b3/0x1f0
> [  726.192422]  #1:  (&of->mutex){......}, at: [<ffffffff81279c46>] kernfs_fop_write+0x66/0x1a0
> [  726.220289]  #2:  (s_active#60){......}, at: [<ffffffff81279c4e>] kernfs_fop_write+0x6e/0x1a0
> [  726.249382]  #3:  (device_hotplug_lock){......}, at: [<ffffffff815e9c15>] lock_device_hotplug_sysfs+0x15/0x50
> [  726.281901]  #4:  (&dev->mutex){......}, at: [<ffffffff815eb0b3>] device_online+0x23/0xa0
> [  726.308619]  #5:  (mem_hotplug.lock){......}, at: [<ffffffff811e6525>] mem_hotplug_begin+0x5/0x80
> [  726.337994]  #6:  (mem_hotplug.lock#2){......}, at: [<ffffffff811e656f>] mem_hotplug_begin+0x4f/0x80
> 
> In short: onlining grabs device lock and then tries to do mem_hotplug_begin()
> while add_memory() is between mem_hotplug_begin() and mem_hotplug_done() and it
> tries grabbing device lock.
> 
> To my understanding ACPI memory hotplug doesn't have the same issue as
> device_hotplug_lock is being grabbed when the ACPI device is added.
> 
> Solve the issue by grabbing device_hotplug_lock before doing add_memory(). If
> we do that, lock_device_hotplug_sysfs() will cause syscall retry which will
> eventually succeed. To support the change we need to export lock_device_hotplug/
> unlock_device_hotplug. This approach can be completely wrong though.

Saying the approach could be completely wrong doesn't inspire a lot of 
confidence.  I assume this output is from the hung task detector, is there 
any other lockdep output that suggests there's a possible deadlock?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] memory_hotplug: hyperv: fix deadlock between memory adding and onlining
  2015-02-11 20:30 ` [PATCH 0/3] memory_hotplug: hyperv: " David Rientjes
@ 2015-02-12  6:39   ` David Rientjes
  2015-02-12 15:39     ` Vitaly Kuznetsov
  2015-02-12 10:15   ` Vitaly Kuznetsov
  1 sibling, 1 reply; 11+ messages in thread
From: David Rientjes @ 2015-02-12  6:39 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Greg Kroah-Hartman, Rafael J. Wysocki, K. Y. Srinivasan,
	Haiyang Zhang, Andrew Morton, Yasuaki Ishimatsu, Tang Chen,
	Vlastimil Babka, Fabian Frederick, Zhang Zhen, Vladimir Davydov,
	Wang Nan, linux-kernel, devel, linux-mm

On Wed, 11 Feb 2015, David Rientjes wrote:

> > If newly added memory is brought online with e.g. udev rule:
> > SUBSYSTEM=="memory", ACTION=="add", ATTR{state}="online"
> > the following deadlock is observed (and easily reproducable):
> > 
> > First participant, worker thread doing add_memory():
> > 
> > [  724.948846] kworker/0:1     D ffff88000412f9c8 13248    27      2 0x00000000
> > [  724.973543] Workqueue: events hot_add_req [hv_balloon]
> > [  724.991736]  ffff88000412f9c8 0000000000000000 ffff88003fa1dc30 00000000000151c0
> > [  725.019725]  0000000000000246 ffff88000412ffd8 00000000000151c0 ffff88003a77a4e0
> > [  725.046486]  ffff88003fa1dc30 00000001032a6000 ffff88003a7ca838 ffff88003a7ca898
> > [  725.072969] Call Trace:
> > [  725.082690]  [<ffffffff81aac0a9>] schedule_preempt_disabled+0x29/0x70
> > [  725.103799]  [<ffffffff81aae33b>] mutex_lock_nested+0x14b/0x470
> > [  725.122367]  [<ffffffff815ed773>] ? device_attach+0x23/0xb0
> > [  725.140992]  [<ffffffff815ed773>] device_attach+0x23/0xb0
> > [  725.159131]  [<ffffffff815ecba0>] bus_probe_device+0xb0/0xe0
> > [  725.177055]  [<ffffffff815ea693>] device_add+0x443/0x650
> > [  725.195558]  [<ffffffff815ea8be>] device_register+0x1e/0x30
> > [  725.213133]  [<ffffffff81601790>] init_memory_block+0xd0/0xf0
> > [  725.231533]  [<ffffffff816018f1>] register_new_memory+0xb1/0xd0
> > [  725.250769]  [<ffffffff81a961cf>] __add_pages+0x13f/0x250
> > [  725.269642]  [<ffffffff81063770>] ? arch_add_memory+0x70/0xf0
> > [  725.288764]  [<ffffffff81063770>] arch_add_memory+0x70/0xf0
> > [  725.306117]  [<ffffffff81a95f8f>] add_memory+0xef/0x1f0
> > [  725.322466]  [<ffffffffa00293af>] hot_add_req+0x33f/0xf90 [hv_balloon]
> > [  725.342777]  [<ffffffff8109509f>] process_one_work+0x1df/0x4e0
> > [  725.361459]  [<ffffffff8109502d>] ? process_one_work+0x16d/0x4e0
> > [  725.380390]  [<ffffffff810954bb>] worker_thread+0x11b/0x450
> > [  725.397684]  [<ffffffff810953a0>] ? process_one_work+0x4e0/0x4e0
> > [  725.416533]  [<ffffffff8109ac33>] kthread+0xf3/0x110
> > [  725.433372]  [<ffffffff8109ab40>] ? kthread_create_on_node+0x240/0x240
> > [  725.453749]  [<ffffffff81ab1dfc>] ret_from_fork+0x7c/0xb0
> > [  725.470994]  [<ffffffff8109ab40>] ? kthread_create_on_node+0x240/0x240
> > [  725.491469] 6 locks held by kworker/0:1/27:
> > [  725.505037]  #0:  ("events"){......}, at: [<ffffffff8109502d>] process_one_work+0x16d/0x4e0
> > [  725.533370]  #1:  ((&dm_device.ha_wrk.wrk)){......}, at: [<ffffffff8109502d>] process_one_work+0x16d/0x4e0
> > [  725.565580]  #2:  (mem_hotplug.lock){......}, at: [<ffffffff811e6525>] mem_hotplug_begin+0x5/0x80
> > [  725.594369]  #3:  (mem_hotplug.lock#2){......}, at: [<ffffffff811e656f>] mem_hotplug_begin+0x4f/0x80
> > [  725.628554]  #4:  (mem_sysfs_mutex){......}, at: [<ffffffff81601873>] register_new_memory+0x33/0xd0
> > [  725.658519]  #5:  (&dev->mutex){......}, at: [<ffffffff815ed773>] device_attach+0x23/0xb0
> > 
> > Second participant, udev:
> > 
> > [  725.750889] systemd-udevd   D ffff88003b94fc68 14016   888    530 0x00000004
> > [  725.773767]  ffff88003b94fc68 0000000000000000 ffff8800034949c0 00000000000151c0
> > [  725.798332]  ffffffff8210d980 ffff88003b94ffd8 00000000000151c0 ffff880037a69270
> > [  725.822841]  ffff8800034949c0 0000000100000001 ffff8800034949c0 ffffffff81ff2b48
> > [  725.849184] Call Trace:
> > [  725.858987]  [<ffffffff81aac0a9>] schedule_preempt_disabled+0x29/0x70
> > [  725.879231]  [<ffffffff81aae33b>] mutex_lock_nested+0x14b/0x470
> > [  725.897860]  [<ffffffff811e656f>] ? mem_hotplug_begin+0x4f/0x80
> > [  725.916698]  [<ffffffff811e656f>] mem_hotplug_begin+0x4f/0x80
> > [  725.935064]  [<ffffffff811e6525>] ? mem_hotplug_begin+0x5/0x80
> > [  725.953464]  [<ffffffff81a9631b>] online_pages+0x3b/0x520
> > [  725.971542]  [<ffffffff815eb0b3>] ? device_online+0x23/0xa0
> > [  725.989207]  [<ffffffff81601524>] memory_subsys_online+0x64/0xc0
> > [  726.008513]  [<ffffffff815eb0fd>] device_online+0x6d/0xa0
> > [  726.025579]  [<ffffffff816012eb>] store_mem_state+0x5b/0xe0
> > [  726.043400]  [<ffffffff815e8258>] dev_attr_store+0x18/0x30
> > [  726.060506]  [<ffffffff8127a808>] sysfs_kf_write+0x48/0x60
> > [  726.077940]  [<ffffffff81279d1b>] kernfs_fop_write+0x13b/0x1a0
> > [  726.099416]  [<ffffffff811f9f67>] vfs_write+0xb7/0x1f0
> > [  726.115748]  [<ffffffff811fabf8>] SyS_write+0x58/0xd0
> > [  726.131933]  [<ffffffff81ab1ea9>] system_call_fastpath+0x12/0x17
> > [  726.150691] 7 locks held by systemd-udevd/888:
> > [  726.165044]  #0:  (sb_writers#3){......}, at: [<ffffffff811fa063>] vfs_write+0x1b3/0x1f0
> > [  726.192422]  #1:  (&of->mutex){......}, at: [<ffffffff81279c46>] kernfs_fop_write+0x66/0x1a0
> > [  726.220289]  #2:  (s_active#60){......}, at: [<ffffffff81279c4e>] kernfs_fop_write+0x6e/0x1a0
> > [  726.249382]  #3:  (device_hotplug_lock){......}, at: [<ffffffff815e9c15>] lock_device_hotplug_sysfs+0x15/0x50
> > [  726.281901]  #4:  (&dev->mutex){......}, at: [<ffffffff815eb0b3>] device_online+0x23/0xa0
> > [  726.308619]  #5:  (mem_hotplug.lock){......}, at: [<ffffffff811e6525>] mem_hotplug_begin+0x5/0x80
> > [  726.337994]  #6:  (mem_hotplug.lock#2){......}, at: [<ffffffff811e656f>] mem_hotplug_begin+0x4f/0x80
> > 
> > In short: onlining grabs device lock and then tries to do mem_hotplug_begin()
> > while add_memory() is between mem_hotplug_begin() and mem_hotplug_done() and it
> > tries grabbing device lock.
> > 
> > To my understanding ACPI memory hotplug doesn't have the same issue as
> > device_hotplug_lock is being grabbed when the ACPI device is added.
> > 
> > Solve the issue by grabbing device_hotplug_lock before doing add_memory(). If
> > we do that, lock_device_hotplug_sysfs() will cause syscall retry which will
> > eventually succeed. To support the change we need to export lock_device_hotplug/
> > unlock_device_hotplug. This approach can be completely wrong though.
> 
> Saying the approach could be completely wrong doesn't inspire a lot of 
> confidence.  I assume this output is from the hung task detector, is there 
> any other lockdep output that suggests there's a possible deadlock?
> 

Ok, I looked at this and the problem is that kworker/0 is onlining memory 
and serializes memory hot-add with mem_hotplug_begin() before registering 
the new memory block.  This is the appropriate lock ordering, we want to 
do mem_hotplug_begin() before device_lock(dev) which takes dev->mutex 
since we must disallow concurrent hot-add events from looking up the same 
memory block.

The issue only arises when systemd-udevd takes device_lock(dev) to 
transition a memory block from offline to online and 
memory_subsys_online() callbacks require mem_hotplug_begin().

Understanding this is pretty simple: in the kworker/0 case, we must create 
a memory block and add the range by probing; in the systemd-udevd case, we 
already have a memory block and need to transition its state.

Your approach to resolve this dependency is to serialize all of this with 
device_hotplug_lock so that only one thread can be doing 
mem_hotplug_begin() -> device_lock() or device_lock() -> 
mem_hotplug_begin() at a time.  I don't think resolving a locking 
dependency is appropriate by just serializing them with another lock; 
rather, I think the solution is to truly make one lock depend on the other 
for memory hotplug.

I already mentioned that the appropriate lock ordering is 
mem_hotplug_begin() -> device_lock() since we can't possibly know the 
device that we are onlining for probe (we must create a new device, it 
didn't exist before probe).

So all we need to do is require store_mem_state() to take 
mem_hotplug_begin() before doing device_online() and requiring all  
memory_subsys_online() callbacks to assume the protection, which they 
already require anyway.

Could you try this patch out instead of your series?  I did it for memory 
hot-remove as well just to simplify the dependency, but it would also be 
possible to just do mem_hotplug_begin() when onlining since we already 
have the memory block registered for hot-remove.  It's simpler this way.
---
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -219,6 +219,7 @@ static bool pages_correctly_reserved(unsigned long start_pfn)
 /*
  * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
  * OK to have direct references to sparsemem variables in here.
+ * Must already be protected by mem_hotplug_begin().
  */
 static int
 memory_block_action(unsigned long phys_index, unsigned long action, int online_type)
@@ -286,6 +287,7 @@ static int memory_subsys_online(struct device *dev)
 	if (mem->online_type < 0)
 		mem->online_type = MMOP_ONLINE_KEEP;
 
+	/* Already under protection of mem_hotplug_begin() */
 	ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
 
 	/* clear online_type */
@@ -328,17 +330,19 @@ store_mem_state(struct device *dev,
 		goto err;
 	}
 
+	/*
+	 * Memory hotplug needs to hold mem_hotplug_begin() for probe to find
+	 * the correct memory block to online before doing device_online(dev),
+	 * which will take dev->mutex.  Take the lock early to prevent an
+	 * inversion, memory_subsys_online() callbacks will be implemented by
+	 * assuming it's already protected.
+	 */
+	mem_hotplug_begin();
+
 	switch (online_type) {
 	case MMOP_ONLINE_KERNEL:
 	case MMOP_ONLINE_MOVABLE:
 	case MMOP_ONLINE_KEEP:
-		/*
-		 * mem->online_type is not protected so there can be a
-		 * race here.  However, when racing online, the first
-		 * will succeed and the second will just return as the
-		 * block will already be online.  The online type
-		 * could be either one, but that is expected.
-		 */
 		mem->online_type = online_type;
 		ret = device_online(&mem->dev);
 		break;
@@ -349,6 +353,8 @@ store_mem_state(struct device *dev,
 		ret = -EINVAL; /* should never happen */
 	}
 
+	mem_hotplug_done();
+
 err:
 	unlock_device_hotplug();
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -192,6 +192,9 @@ extern void get_page_bootmem(unsigned long ingo, struct page *page,
 void get_online_mems(void);
 void put_online_mems(void);
 
+void mem_hotplug_begin(void);
+void mem_hotplug_done(void);
+
 #else /* ! CONFIG_MEMORY_HOTPLUG */
 /*
  * Stub functions for when hotplug is off
@@ -231,6 +234,9 @@ static inline int try_online_node(int nid)
 static inline void get_online_mems(void) {}
 static inline void put_online_mems(void) {}
 
+static inline void mem_hotplug_begin(void) {}
+static inline void mem_hotplug_done(void) {}
+
 #endif /* ! CONFIG_MEMORY_HOTPLUG */
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -104,7 +104,7 @@ void put_online_mems(void)
 
 }
 
-static void mem_hotplug_begin(void)
+void mem_hotplug_begin(void)
 {
 	mem_hotplug.active_writer = current;
 
@@ -119,7 +119,7 @@ static void mem_hotplug_begin(void)
 	}
 }
 
-static void mem_hotplug_done(void)
+void mem_hotplug_done(void)
 {
 	mem_hotplug.active_writer = NULL;
 	mutex_unlock(&mem_hotplug.lock);
@@ -959,6 +959,7 @@ static void node_states_set_node(int node, struct memory_notify *arg)
 }
 
 
+/* Must be protected by mem_hotplug_begin() */
 int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type)
 {
 	unsigned long flags;
@@ -969,7 +970,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	int ret;
 	struct memory_notify arg;
 
-	mem_hotplug_begin();
 	/*
 	 * This doesn't need a lock to do pfn_to_page().
 	 * The section can't be removed here because of the
@@ -977,21 +977,20 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	 */
 	zone = page_zone(pfn_to_page(pfn));
 
-	ret = -EINVAL;
 	if ((zone_idx(zone) > ZONE_NORMAL ||
 	    online_type == MMOP_ONLINE_MOVABLE) &&
 	    !can_online_high_movable(zone))
-		goto out;
+		return -EINVAL;
 
 	if (online_type == MMOP_ONLINE_KERNEL &&
 	    zone_idx(zone) == ZONE_MOVABLE) {
 		if (move_pfn_range_left(zone - 1, zone, pfn, pfn + nr_pages))
-			goto out;
+			return -EINVAL;
 	}
 	if (online_type == MMOP_ONLINE_MOVABLE &&
 	    zone_idx(zone) == ZONE_MOVABLE - 1) {
 		if (move_pfn_range_right(zone, zone + 1, pfn, pfn + nr_pages))
-			goto out;
+			return -EINVAL;
 	}
 
 	/* Previous code may changed the zone of the pfn range */
@@ -1007,7 +1006,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	ret = notifier_to_errno(ret);
 	if (ret) {
 		memory_notify(MEM_CANCEL_ONLINE, &arg);
-		goto out;
+		return ret;
 	}
 	/*
 	 * If this zone is not populated, then it is not in zonelist.
@@ -1031,7 +1030,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 		       (((unsigned long long) pfn + nr_pages)
 			    << PAGE_SHIFT) - 1);
 		memory_notify(MEM_CANCEL_ONLINE, &arg);
-		goto out;
+		return ret;
 	}
 
 	zone->present_pages += onlined_pages;
@@ -1061,9 +1060,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 
 	if (onlined_pages)
 		memory_notify(MEM_ONLINE, &arg);
-out:
-	mem_hotplug_done();
-	return ret;
+
+	return 0;
 }
 #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
 
@@ -1684,21 +1682,18 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	if (!test_pages_in_a_zone(start_pfn, end_pfn))
 		return -EINVAL;
 
-	mem_hotplug_begin();
-
 	zone = page_zone(pfn_to_page(start_pfn));
 	node = zone_to_nid(zone);
 	nr_pages = end_pfn - start_pfn;
 
-	ret = -EINVAL;
 	if (zone_idx(zone) <= ZONE_NORMAL && !can_offline_normal(zone, nr_pages))
-		goto out;
+		return -EINVAL;
 
 	/* set above range as isolated */
 	ret = start_isolate_page_range(start_pfn, end_pfn,
 				       MIGRATE_MOVABLE, true);
 	if (ret)
-		goto out;
+		return ret;
 
 	arg.start_pfn = start_pfn;
 	arg.nr_pages = nr_pages;
@@ -1791,7 +1786,6 @@ repeat:
 	writeback_set_ratelimit();
 
 	memory_notify(MEM_OFFLINE, &arg);
-	mem_hotplug_done();
 	return 0;
 
 failed_removal:
@@ -1801,12 +1795,10 @@ failed_removal:
 	memory_notify(MEM_CANCEL_OFFLINE, &arg);
 	/* pushback to free area */
 	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
-
-out:
-	mem_hotplug_done();
 	return ret;
 }
 
+/* Must be protected by mem_hotplug_begin() */
 int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 {
 	return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] memory_hotplug: hyperv: fix deadlock between memory adding and onlining
  2015-02-12  6:39   ` David Rientjes
@ 2015-02-12 15:39     ` Vitaly Kuznetsov
  0 siblings, 0 replies; 11+ messages in thread
From: Vitaly Kuznetsov @ 2015-02-12 15:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: Greg Kroah-Hartman, Rafael J. Wysocki, K. Y. Srinivasan,
	Haiyang Zhang, Andrew Morton, Yasuaki Ishimatsu, Tang Chen,
	Vlastimil Babka, Fabian Frederick, Zhang Zhen, Vladimir Davydov,
	Wang Nan, linux-kernel, devel, linux-mm

David Rientjes <rientjes@google.com> writes:

> On Wed, 11 Feb 2015, David Rientjes wrote:
>
>> > If newly added memory is brought online with e.g. udev rule:
>> > SUBSYSTEM=="memory", ACTION=="add", ATTR{state}="online"
>> > the following deadlock is observed (and easily reproducable):
>> > 
>> > First participant, worker thread doing add_memory():
>> > 
>> > [  724.948846] kworker/0:1     D ffff88000412f9c8 13248    27      2 0x00000000
>> > [  724.973543] Workqueue: events hot_add_req [hv_balloon]
>> > [  724.991736]  ffff88000412f9c8 0000000000000000 ffff88003fa1dc30 00000000000151c0
>> > [  725.019725]  0000000000000246 ffff88000412ffd8 00000000000151c0 ffff88003a77a4e0
>> > [  725.046486]  ffff88003fa1dc30 00000001032a6000 ffff88003a7ca838 ffff88003a7ca898
>> > [  725.072969] Call Trace:
>> > [  725.082690]  [<ffffffff81aac0a9>] schedule_preempt_disabled+0x29/0x70
>> > [  725.103799]  [<ffffffff81aae33b>] mutex_lock_nested+0x14b/0x470
>> > [  725.122367]  [<ffffffff815ed773>] ? device_attach+0x23/0xb0
>> > [  725.140992]  [<ffffffff815ed773>] device_attach+0x23/0xb0
>> > [  725.159131]  [<ffffffff815ecba0>] bus_probe_device+0xb0/0xe0
>> > [  725.177055]  [<ffffffff815ea693>] device_add+0x443/0x650
>> > [  725.195558]  [<ffffffff815ea8be>] device_register+0x1e/0x30
>> > [  725.213133]  [<ffffffff81601790>] init_memory_block+0xd0/0xf0
>> > [  725.231533]  [<ffffffff816018f1>] register_new_memory+0xb1/0xd0
>> > [  725.250769]  [<ffffffff81a961cf>] __add_pages+0x13f/0x250
>> > [  725.269642]  [<ffffffff81063770>] ? arch_add_memory+0x70/0xf0
>> > [  725.288764]  [<ffffffff81063770>] arch_add_memory+0x70/0xf0
>> > [  725.306117]  [<ffffffff81a95f8f>] add_memory+0xef/0x1f0
>> > [  725.322466]  [<ffffffffa00293af>] hot_add_req+0x33f/0xf90 [hv_balloon]
>> > [  725.342777]  [<ffffffff8109509f>] process_one_work+0x1df/0x4e0
>> > [  725.361459]  [<ffffffff8109502d>] ? process_one_work+0x16d/0x4e0
>> > [  725.380390]  [<ffffffff810954bb>] worker_thread+0x11b/0x450
>> > [  725.397684]  [<ffffffff810953a0>] ? process_one_work+0x4e0/0x4e0
>> > [  725.416533]  [<ffffffff8109ac33>] kthread+0xf3/0x110
>> > [  725.433372]  [<ffffffff8109ab40>] ? kthread_create_on_node+0x240/0x240
>> > [  725.453749]  [<ffffffff81ab1dfc>] ret_from_fork+0x7c/0xb0
>> > [  725.470994]  [<ffffffff8109ab40>] ? kthread_create_on_node+0x240/0x240
>> > [  725.491469] 6 locks held by kworker/0:1/27:
>> > [  725.505037]  #0:  ("events"){......}, at: [<ffffffff8109502d>] process_one_work+0x16d/0x4e0
>> > [  725.533370]  #1:  ((&dm_device.ha_wrk.wrk)){......}, at: [<ffffffff8109502d>] process_one_work+0x16d/0x4e0
>> > [  725.565580]  #2:  (mem_hotplug.lock){......}, at: [<ffffffff811e6525>] mem_hotplug_begin+0x5/0x80
>> > [  725.594369]  #3:  (mem_hotplug.lock#2){......}, at: [<ffffffff811e656f>] mem_hotplug_begin+0x4f/0x80
>> > [  725.628554]  #4:  (mem_sysfs_mutex){......}, at: [<ffffffff81601873>] register_new_memory+0x33/0xd0
>> > [  725.658519]  #5:  (&dev->mutex){......}, at: [<ffffffff815ed773>] device_attach+0x23/0xb0
>> > 
>> > Second participant, udev:
>> > 
>> > [  725.750889] systemd-udevd   D ffff88003b94fc68 14016   888    530 0x00000004
>> > [  725.773767]  ffff88003b94fc68 0000000000000000 ffff8800034949c0 00000000000151c0
>> > [  725.798332]  ffffffff8210d980 ffff88003b94ffd8 00000000000151c0 ffff880037a69270
>> > [  725.822841]  ffff8800034949c0 0000000100000001 ffff8800034949c0 ffffffff81ff2b48
>> > [  725.849184] Call Trace:
>> > [  725.858987]  [<ffffffff81aac0a9>] schedule_preempt_disabled+0x29/0x70
>> > [  725.879231]  [<ffffffff81aae33b>] mutex_lock_nested+0x14b/0x470
>> > [  725.897860]  [<ffffffff811e656f>] ? mem_hotplug_begin+0x4f/0x80
>> > [  725.916698]  [<ffffffff811e656f>] mem_hotplug_begin+0x4f/0x80
>> > [  725.935064]  [<ffffffff811e6525>] ? mem_hotplug_begin+0x5/0x80
>> > [  725.953464]  [<ffffffff81a9631b>] online_pages+0x3b/0x520
>> > [  725.971542]  [<ffffffff815eb0b3>] ? device_online+0x23/0xa0
>> > [  725.989207]  [<ffffffff81601524>] memory_subsys_online+0x64/0xc0
>> > [  726.008513]  [<ffffffff815eb0fd>] device_online+0x6d/0xa0
>> > [  726.025579]  [<ffffffff816012eb>] store_mem_state+0x5b/0xe0
>> > [  726.043400]  [<ffffffff815e8258>] dev_attr_store+0x18/0x30
>> > [  726.060506]  [<ffffffff8127a808>] sysfs_kf_write+0x48/0x60
>> > [  726.077940]  [<ffffffff81279d1b>] kernfs_fop_write+0x13b/0x1a0
>> > [  726.099416]  [<ffffffff811f9f67>] vfs_write+0xb7/0x1f0
>> > [  726.115748]  [<ffffffff811fabf8>] SyS_write+0x58/0xd0
>> > [  726.131933]  [<ffffffff81ab1ea9>] system_call_fastpath+0x12/0x17
>> > [  726.150691] 7 locks held by systemd-udevd/888:
>> > [  726.165044]  #0:  (sb_writers#3){......}, at: [<ffffffff811fa063>] vfs_write+0x1b3/0x1f0
>> > [  726.192422]  #1:  (&of->mutex){......}, at: [<ffffffff81279c46>] kernfs_fop_write+0x66/0x1a0
>> > [  726.220289]  #2:  (s_active#60){......}, at: [<ffffffff81279c4e>] kernfs_fop_write+0x6e/0x1a0
>> > [  726.249382]  #3:  (device_hotplug_lock){......}, at: [<ffffffff815e9c15>] lock_device_hotplug_sysfs+0x15/0x50
>> > [  726.281901]  #4:  (&dev->mutex){......}, at: [<ffffffff815eb0b3>] device_online+0x23/0xa0
>> > [  726.308619]  #5:  (mem_hotplug.lock){......}, at: [<ffffffff811e6525>] mem_hotplug_begin+0x5/0x80
>> > [  726.337994]  #6:  (mem_hotplug.lock#2){......}, at: [<ffffffff811e656f>] mem_hotplug_begin+0x4f/0x80
>> > 
>> > In short: onlining grabs device lock and then tries to do mem_hotplug_begin()
>> > while add_memory() is between mem_hotplug_begin() and mem_hotplug_done() and it
>> > tries grabbing device lock.
>> > 
>> > To my understanding ACPI memory hotplug doesn't have the same issue as
>> > device_hotplug_lock is being grabbed when the ACPI device is added.
>> > 
>> > Solve the issue by grabbing device_hotplug_lock before doing add_memory(). If
>> > we do that, lock_device_hotplug_sysfs() will cause syscall retry which will
>> > eventually succeed. To support the change we need to export lock_device_hotplug/
>> > unlock_device_hotplug. This approach can be completely wrong though.
>> 
>> Saying the approach could be completely wrong doesn't inspire a lot of 
>> confidence.  I assume this output is from the hung task detector, is there 
>> any other lockdep output that suggests there's a possible deadlock?
>> 
>
> Ok, I looked at this and the problem is that kworker/0 is onlining memory 
> and serializes memory hot-add with mem_hotplug_begin() before registering 
> the new memory block.  This is the appropriate lock ordering, we want to 
> do mem_hotplug_begin() before device_lock(dev) which takes dev->mutex 
> since we must disallow concurrent hot-add events from looking up the same 
> memory block.
>
> The issue only arises when systemd-udevd takes device_lock(dev) to 
> transition a memory block from offline to online and 
> memory_subsys_online() callbacks require mem_hotplug_begin().
>
> Understanding this is pretty simple: in the kworker/0 case, we must create 
> a memory block and add the range by probing; in the systemd-udevd case, we 
> already have a memory block and need to transition its state.
>
> Your approach to resolve this dependency is to serialize all of this with 
> device_hotplug_lock so that only one thread can be doing 
> mem_hotplug_begin() -> device_lock() or device_lock() -> 
> mem_hotplug_begin() at a time.  I don't think resolving a locking 
> dependency is appropriate by just serializing them with another lock; 
> rather, I think the solution is to truly make one lock depend on the other 
> for memory hotplug.
>
> I already mentioned that the appropriate lock ordering is 
> mem_hotplug_begin() -> device_lock() since we can't possibly know the 
> device that we are onlining for probe (we must create a new device, it 
> didn't exist before probe).
>
> So all we need to do is require store_mem_state() to take 
> mem_hotplug_begin() before doing device_online() and requiring all  
> memory_subsys_online() callbacks to assume the protection, which they 
> already require anyway.
>
> Could you try this patch out instead of your series?  I did it for memory 
> hot-remove as well just to simplify the dependency, but it would also be 
> possible to just do mem_hotplug_begin() when onlining since we already 
> have the memory block registered for hot-remove.  It's simpler this
> way.

Thanks, I tested your patch and it also solves the issue. Haven't tested
hotremove though (as it is not supported by Hyper-V).

I also agree this approach is better. Please let me know in case you
want me to send it out.

> ---
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -219,6 +219,7 @@ static bool pages_correctly_reserved(unsigned long start_pfn)
>  /*
>   * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
>   * OK to have direct references to sparsemem variables in here.
> + * Must already be protected by mem_hotplug_begin().
>   */
>  static int
>  memory_block_action(unsigned long phys_index, unsigned long action, int online_type)
> @@ -286,6 +287,7 @@ static int memory_subsys_online(struct device *dev)
>  	if (mem->online_type < 0)
>  		mem->online_type = MMOP_ONLINE_KEEP;
>
> +	/* Already under protection of mem_hotplug_begin() */
>  	ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
>
>  	/* clear online_type */
> @@ -328,17 +330,19 @@ store_mem_state(struct device *dev,
>  		goto err;
>  	}
>
> +	/*
> +	 * Memory hotplug needs to hold mem_hotplug_begin() for probe to find
> +	 * the correct memory block to online before doing device_online(dev),
> +	 * which will take dev->mutex.  Take the lock early to prevent an
> +	 * inversion, memory_subsys_online() callbacks will be implemented by
> +	 * assuming it's already protected.
> +	 */
> +	mem_hotplug_begin();
> +
>  	switch (online_type) {
>  	case MMOP_ONLINE_KERNEL:
>  	case MMOP_ONLINE_MOVABLE:
>  	case MMOP_ONLINE_KEEP:
> -		/*
> -		 * mem->online_type is not protected so there can be a
> -		 * race here.  However, when racing online, the first
> -		 * will succeed and the second will just return as the
> -		 * block will already be online.  The online type
> -		 * could be either one, but that is expected.
> -		 */
>  		mem->online_type = online_type;
>  		ret = device_online(&mem->dev);
>  		break;
> @@ -349,6 +353,8 @@ store_mem_state(struct device *dev,
>  		ret = -EINVAL; /* should never happen */
>  	}
>
> +	mem_hotplug_done();
> +
>  err:
>  	unlock_device_hotplug();
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -192,6 +192,9 @@ extern void get_page_bootmem(unsigned long ingo, struct page *page,
>  void get_online_mems(void);
>  void put_online_mems(void);
>
> +void mem_hotplug_begin(void);
> +void mem_hotplug_done(void);
> +
>  #else /* ! CONFIG_MEMORY_HOTPLUG */
>  /*
>   * Stub functions for when hotplug is off
> @@ -231,6 +234,9 @@ static inline int try_online_node(int nid)
>  static inline void get_online_mems(void) {}
>  static inline void put_online_mems(void) {}
>
> +static inline void mem_hotplug_begin(void) {}
> +static inline void mem_hotplug_done(void) {}
> +
>  #endif /* ! CONFIG_MEMORY_HOTPLUG */
>
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -104,7 +104,7 @@ void put_online_mems(void)
>
>  }
>
> -static void mem_hotplug_begin(void)
> +void mem_hotplug_begin(void)
>  {
>  	mem_hotplug.active_writer = current;
>
> @@ -119,7 +119,7 @@ static void mem_hotplug_begin(void)
>  	}
>  }
>
> -static void mem_hotplug_done(void)
> +void mem_hotplug_done(void)
>  {
>  	mem_hotplug.active_writer = NULL;
>  	mutex_unlock(&mem_hotplug.lock);
> @@ -959,6 +959,7 @@ static void node_states_set_node(int node, struct memory_notify *arg)
>  }
>
> +/* Must be protected by mem_hotplug_begin() */
>  int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type)
>  {
>  	unsigned long flags;
> @@ -969,7 +970,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
>  	int ret;
>  	struct memory_notify arg;
>
> -	mem_hotplug_begin();
>  	/*
>  	 * This doesn't need a lock to do pfn_to_page().
>  	 * The section can't be removed here because of the
> @@ -977,21 +977,20 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
>  	 */
>  	zone = page_zone(pfn_to_page(pfn));
>
> -	ret = -EINVAL;
>  	if ((zone_idx(zone) > ZONE_NORMAL ||
>  	    online_type == MMOP_ONLINE_MOVABLE) &&
>  	    !can_online_high_movable(zone))
> -		goto out;
> +		return -EINVAL;
>
>  	if (online_type == MMOP_ONLINE_KERNEL &&
>  	    zone_idx(zone) == ZONE_MOVABLE) {
>  		if (move_pfn_range_left(zone - 1, zone, pfn, pfn + nr_pages))
> -			goto out;
> +			return -EINVAL;
>  	}
>  	if (online_type == MMOP_ONLINE_MOVABLE &&
>  	    zone_idx(zone) == ZONE_MOVABLE - 1) {
>  		if (move_pfn_range_right(zone, zone + 1, pfn, pfn + nr_pages))
> -			goto out;
> +			return -EINVAL;
>  	}
>
>  	/* Previous code may changed the zone of the pfn range */
> @@ -1007,7 +1006,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
>  	ret = notifier_to_errno(ret);
>  	if (ret) {
>  		memory_notify(MEM_CANCEL_ONLINE, &arg);
> -		goto out;
> +		return ret;
>  	}
>  	/*
>  	 * If this zone is not populated, then it is not in zonelist.
> @@ -1031,7 +1030,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
>  		       (((unsigned long long) pfn + nr_pages)
>  			    << PAGE_SHIFT) - 1);
>  		memory_notify(MEM_CANCEL_ONLINE, &arg);
> -		goto out;
> +		return ret;
>  	}
>
>  	zone->present_pages += onlined_pages;
> @@ -1061,9 +1060,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
>
>  	if (onlined_pages)
>  		memory_notify(MEM_ONLINE, &arg);
> -out:
> -	mem_hotplug_done();
> -	return ret;
> +
> +	return 0;
>  }
>  #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
>
> @@ -1684,21 +1682,18 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  	if (!test_pages_in_a_zone(start_pfn, end_pfn))
>  		return -EINVAL;
>
> -	mem_hotplug_begin();
> -
>  	zone = page_zone(pfn_to_page(start_pfn));
>  	node = zone_to_nid(zone);
>  	nr_pages = end_pfn - start_pfn;
>
> -	ret = -EINVAL;
>  	if (zone_idx(zone) <= ZONE_NORMAL && !can_offline_normal(zone, nr_pages))
> -		goto out;
> +		return -EINVAL;
>
>  	/* set above range as isolated */
>  	ret = start_isolate_page_range(start_pfn, end_pfn,
>  				       MIGRATE_MOVABLE, true);
>  	if (ret)
> -		goto out;
> +		return ret;
>
>  	arg.start_pfn = start_pfn;
>  	arg.nr_pages = nr_pages;
> @@ -1791,7 +1786,6 @@ repeat:
>  	writeback_set_ratelimit();
>
>  	memory_notify(MEM_OFFLINE, &arg);
> -	mem_hotplug_done();
>  	return 0;
>
>  failed_removal:
> @@ -1801,12 +1795,10 @@ failed_removal:
>  	memory_notify(MEM_CANCEL_OFFLINE, &arg);
>  	/* pushback to free area */
>  	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
> -
> -out:
> -	mem_hotplug_done();
>  	return ret;
>  }
>
> +/* Must be protected by mem_hotplug_begin() */
>  int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  {
>  	return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);

-- 
  Vitaly

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] memory_hotplug: hyperv: fix deadlock between memory adding and onlining
  2015-02-11 20:30 ` [PATCH 0/3] memory_hotplug: hyperv: " David Rientjes
  2015-02-12  6:39   ` David Rientjes
@ 2015-02-12 10:15   ` Vitaly Kuznetsov
  1 sibling, 0 replies; 11+ messages in thread
From: Vitaly Kuznetsov @ 2015-02-12 10:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Andrew Morton, Yasuaki Ishimatsu, Tang Chen, Vlastimil Babka,
	Fabian Frederick, Zhang Zhen, Vladimir Davydov, Wang Nan,
	linux-kernel, devel, linux-mm

David Rientjes <rientjes@google.com> writes:

> On Wed, 11 Feb 2015, Vitaly Kuznetsov wrote:
>
>> If newly added memory is brought online with e.g. udev rule:
>> SUBSYSTEM=="memory", ACTION=="add", ATTR{state}="online"
>> the following deadlock is observed (and easily reproducable):
>> 
>> First participant, worker thread doing add_memory():
>> 
>> [  724.948846] kworker/0:1     D ffff88000412f9c8 13248    27      2 0x00000000
>> [  724.973543] Workqueue: events hot_add_req [hv_balloon]
>> [  724.991736]  ffff88000412f9c8 0000000000000000 ffff88003fa1dc30 00000000000151c0
>> [  725.019725]  0000000000000246 ffff88000412ffd8 00000000000151c0 ffff88003a77a4e0
>> [  725.046486]  ffff88003fa1dc30 00000001032a6000 ffff88003a7ca838 ffff88003a7ca898
>> [  725.072969] Call Trace:
>> [  725.082690]  [<ffffffff81aac0a9>] schedule_preempt_disabled+0x29/0x70
>> [  725.103799]  [<ffffffff81aae33b>] mutex_lock_nested+0x14b/0x470
>> [  725.122367]  [<ffffffff815ed773>] ? device_attach+0x23/0xb0
>> [  725.140992]  [<ffffffff815ed773>] device_attach+0x23/0xb0
>> [  725.159131]  [<ffffffff815ecba0>] bus_probe_device+0xb0/0xe0
>> [  725.177055]  [<ffffffff815ea693>] device_add+0x443/0x650
>> [  725.195558]  [<ffffffff815ea8be>] device_register+0x1e/0x30
>> [  725.213133]  [<ffffffff81601790>] init_memory_block+0xd0/0xf0
>> [  725.231533]  [<ffffffff816018f1>] register_new_memory+0xb1/0xd0
>> [  725.250769]  [<ffffffff81a961cf>] __add_pages+0x13f/0x250
>> [  725.269642]  [<ffffffff81063770>] ? arch_add_memory+0x70/0xf0
>> [  725.288764]  [<ffffffff81063770>] arch_add_memory+0x70/0xf0
>> [  725.306117]  [<ffffffff81a95f8f>] add_memory+0xef/0x1f0
>> [  725.322466]  [<ffffffffa00293af>] hot_add_req+0x33f/0xf90 [hv_balloon]
>> [  725.342777]  [<ffffffff8109509f>] process_one_work+0x1df/0x4e0
>> [  725.361459]  [<ffffffff8109502d>] ? process_one_work+0x16d/0x4e0
>> [  725.380390]  [<ffffffff810954bb>] worker_thread+0x11b/0x450
>> [  725.397684]  [<ffffffff810953a0>] ? process_one_work+0x4e0/0x4e0
>> [  725.416533]  [<ffffffff8109ac33>] kthread+0xf3/0x110
>> [  725.433372]  [<ffffffff8109ab40>] ? kthread_create_on_node+0x240/0x240
>> [  725.453749]  [<ffffffff81ab1dfc>] ret_from_fork+0x7c/0xb0
>> [  725.470994]  [<ffffffff8109ab40>] ? kthread_create_on_node+0x240/0x240
>> [  725.491469] 6 locks held by kworker/0:1/27:
>> [  725.505037]  #0:  ("events"){......}, at: [<ffffffff8109502d>] process_one_work+0x16d/0x4e0
>> [  725.533370]  #1:  ((&dm_device.ha_wrk.wrk)){......}, at: [<ffffffff8109502d>] process_one_work+0x16d/0x4e0
>> [  725.565580]  #2:  (mem_hotplug.lock){......}, at: [<ffffffff811e6525>] mem_hotplug_begin+0x5/0x80
>> [  725.594369]  #3:  (mem_hotplug.lock#2){......}, at: [<ffffffff811e656f>] mem_hotplug_begin+0x4f/0x80
>> [  725.628554]  #4:  (mem_sysfs_mutex){......}, at: [<ffffffff81601873>] register_new_memory+0x33/0xd0
>> [  725.658519]  #5:  (&dev->mutex){......}, at: [<ffffffff815ed773>] device_attach+0x23/0xb0
>> 
>> Second participant, udev:
>> 
>> [  725.750889] systemd-udevd   D ffff88003b94fc68 14016   888    530 0x00000004
>> [  725.773767]  ffff88003b94fc68 0000000000000000 ffff8800034949c0 00000000000151c0
>> [  725.798332]  ffffffff8210d980 ffff88003b94ffd8 00000000000151c0 ffff880037a69270
>> [  725.822841]  ffff8800034949c0 0000000100000001 ffff8800034949c0 ffffffff81ff2b48
>> [  725.849184] Call Trace:
>> [  725.858987]  [<ffffffff81aac0a9>] schedule_preempt_disabled+0x29/0x70
>> [  725.879231]  [<ffffffff81aae33b>] mutex_lock_nested+0x14b/0x470
>> [  725.897860]  [<ffffffff811e656f>] ? mem_hotplug_begin+0x4f/0x80
>> [  725.916698]  [<ffffffff811e656f>] mem_hotplug_begin+0x4f/0x80
>> [  725.935064]  [<ffffffff811e6525>] ? mem_hotplug_begin+0x5/0x80
>> [  725.953464]  [<ffffffff81a9631b>] online_pages+0x3b/0x520
>> [  725.971542]  [<ffffffff815eb0b3>] ? device_online+0x23/0xa0
>> [  725.989207]  [<ffffffff81601524>] memory_subsys_online+0x64/0xc0
>> [  726.008513]  [<ffffffff815eb0fd>] device_online+0x6d/0xa0
>> [  726.025579]  [<ffffffff816012eb>] store_mem_state+0x5b/0xe0
>> [  726.043400]  [<ffffffff815e8258>] dev_attr_store+0x18/0x30
>> [  726.060506]  [<ffffffff8127a808>] sysfs_kf_write+0x48/0x60
>> [  726.077940]  [<ffffffff81279d1b>] kernfs_fop_write+0x13b/0x1a0
>> [  726.099416]  [<ffffffff811f9f67>] vfs_write+0xb7/0x1f0
>> [  726.115748]  [<ffffffff811fabf8>] SyS_write+0x58/0xd0
>> [  726.131933]  [<ffffffff81ab1ea9>] system_call_fastpath+0x12/0x17
>> [  726.150691] 7 locks held by systemd-udevd/888:
>> [  726.165044]  #0:  (sb_writers#3){......}, at: [<ffffffff811fa063>] vfs_write+0x1b3/0x1f0
>> [  726.192422]  #1:  (&of->mutex){......}, at: [<ffffffff81279c46>] kernfs_fop_write+0x66/0x1a0
>> [  726.220289]  #2:  (s_active#60){......}, at: [<ffffffff81279c4e>] kernfs_fop_write+0x6e/0x1a0
>> [  726.249382]  #3:  (device_hotplug_lock){......}, at: [<ffffffff815e9c15>] lock_device_hotplug_sysfs+0x15/0x50
>> [  726.281901]  #4:  (&dev->mutex){......}, at: [<ffffffff815eb0b3>] device_online+0x23/0xa0
>> [  726.308619]  #5:  (mem_hotplug.lock){......}, at: [<ffffffff811e6525>] mem_hotplug_begin+0x5/0x80
>> [  726.337994]  #6:  (mem_hotplug.lock#2){......}, at: [<ffffffff811e656f>] mem_hotplug_begin+0x4f/0x80
>> 
>> In short: onlining grabs device lock and then tries to do mem_hotplug_begin()
>> while add_memory() is between mem_hotplug_begin() and mem_hotplug_done() and it
>> tries grabbing device lock.
>> 
>> To my understanding ACPI memory hotplug doesn't have the same issue as
>> device_hotplug_lock is being grabbed when the ACPI device is added.
>> 
>> Solve the issue by grabbing device_hotplug_lock before doing add_memory(). If
>> we do that, lock_device_hotplug_sysfs() will cause syscall retry which will
>> eventually succeed. To support the change we need to export lock_device_hotplug/
>> unlock_device_hotplug. This approach can be completely wrong though.
>
> Saying the approach could be completely wrong doesn't inspire a lot of 
> confidence.  I assume this output is from the hung task detector, is there 
> any other lockdep output that suggests there's a possible deadlock?

I said 'can be completely wrong' not because I'm not sure about the
cause of the deadlock (see locks #.2,3,5 in worker thread and locks
4,5,6 in systemd-udev) and not because I'm not sure my patch solves the
issue (as you can see lock #3 in systemd-udevd was taken before the
dev->mutex so we should be safe). My testing also showed the issue is
gone. I rather wasn't sure there is no other way to obtain this lock
indirectly or do some other synchronization.

-- 
  Vitaly

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-02-12 15:40 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-11 15:44 [PATCH 0/3] memory_hotplug: hyperv: fix deadlock between memory adding and onlining Vitaly Kuznetsov
2015-02-11 15:44 ` [PATCH 1/3] driver core: export lock_device_hotplug/unlock_device_hotplug Vitaly Kuznetsov
2015-02-11 20:39   ` Andrew Morton
2015-02-12  4:04     ` Rafael J. Wysocki
2015-02-12 10:16       ` Vitaly Kuznetsov
2015-02-11 15:44 ` [PATCH 2/3] memory_hotplug: add note about holding device_hotplug_lock and add_memory() Vitaly Kuznetsov
2015-02-11 15:44 ` [PATCH 3/3] Drivers: hv: balloon: fix deadlock between memory adding and onlining Vitaly Kuznetsov
2015-02-11 20:30 ` [PATCH 0/3] memory_hotplug: hyperv: " David Rientjes
2015-02-12  6:39   ` David Rientjes
2015-02-12 15:39     ` Vitaly Kuznetsov
2015-02-12 10:15   ` Vitaly Kuznetsov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox