From: Dmitry Vyukov <dvyukov@google.com>
To: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Michal Hocko <mhocko@kernel.org>,
Johannes Weiner <hannes@cmpxchg.org>,
cgroups@vger.kernel.org,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
Andrew Morton <akpm@linux-foundation.org>,
syzkaller <syzkaller@googlegroups.com>,
Kostya Serebryany <kcc@google.com>,
Alexander Potapenko <glider@google.com>,
Sasha Levin <sasha.levin@oracle.com>,
Eric Dumazet <edumazet@google.com>,
Greg Thelen <gthelen@google.com>, Tejun Heo <tj@kernel.org>,
Peter Zijlstra <peterz@infradead.org>
Subject: Re: WARNING in handle_mm_fault
Date: Wed, 25 Nov 2015 18:32:15 +0100 [thread overview]
Message-ID: <CACT4Y+YhCXznHzu8ieQgA0spMq5=-Yfodt1jLDYz96rDoF5tyg@mail.gmail.com> (raw)
In-Reply-To: <CACT4Y+ZdF09hOnb_bL4GNjytSMMGvNde8=9pdZt6gZQB1sp0hQ@mail.gmail.com>
On Wed, Nov 25, 2015 at 6:21 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Wed, Nov 25, 2015 at 4:27 PM, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>> Dmitry Vyukov wrote:
>>> If the race described in
>>> http://www.spinics.net/lists/cgroups/msg14078.html does actually
>>> happen, then there is nothing to check.
>>> https://gcc.gnu.org/ml/gcc/2012-02/msg00005.html talks about different
>>> memory locations, if there is store-widening involving different
>>> memory locations, then this is a compiler bug. But the race happens on
>>> a single memory location, in such case the code is buggy.
>>>
>>
>> All ->in_execve ->in_iowait ->sched_reset_on_fork ->sched_contributes_to_load
>> ->sched_migrated ->memcg_may_oom ->memcg_kmem_skip_account ->brk_randomized
>> shares the same byte.
>>
>> sched_fork(p) modifies p->sched_reset_on_fork but p is not yet visible.
>> __sched_setscheduler(p) modifies p->sched_reset_on_fork.
>> try_to_wake_up(p) modifies p->sched_contributes_to_load.
>> perf_event_task_migrate(p) modifies p->sched_migrated.
>>
>> Trying to reproduce this problem with
>>
>> static __always_inline bool
>> perf_sw_migrate_enabled(void)
>> {
>> - if (static_key_false(&perf_swevent_enabled[PERF_COUNT_SW_CPU_MIGRATIONS]))
>> - return true;
>> return false;
>> }
>>
>> would help testing ->sched_migrated case.
>
>
>
>
>
>
>
>
> I have some progress.
>
> With the following patch:
>
> dvyukov@dvyukov-z840:~/src/linux-dvyukov$ git diff
> include/linux/sched.h mm/memory.c
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2fae7d8..4c126a1 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1455,6 +1455,8 @@ struct task_struct {
> /* Used for emulating ABI behavior of previous Linux versions */
> unsigned int personality;
>
> + union {
> + struct {
> unsigned in_execve:1; /* Tell the LSMs that the process is doing an
> * execve */
> unsigned in_iowait:1;
> @@ -1463,18 +1465,24 @@ struct task_struct {
> unsigned sched_reset_on_fork:1;
> unsigned sched_contributes_to_load:1;
> unsigned sched_migrated:1;
> + unsigned dummy_a:1;
> #ifdef CONFIG_MEMCG
> unsigned memcg_may_oom:1;
> #endif
> + unsigned dummy_b:1;
> #ifdef CONFIG_MEMCG_KMEM
> unsigned memcg_kmem_skip_account:1;
> #endif
> #ifdef CONFIG_COMPAT_BRK
> unsigned brk_randomized:1;
> #endif
> + };
> + unsigned nonatomic_flags;
> + };
>
> unsigned long atomic_flags; /* Flags needing atomic access. */
>
> +
> struct restart_block restart_block;
>
> pid_t pid;
> diff --git a/mm/memory.c b/mm/memory.c
> index deb679c..6351dac 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -62,6 +62,7 @@
> #include <linux/dma-debug.h>
> #include <linux/debugfs.h>
> #include <linux/userfaultfd_k.h>
> +#include <linux/kasan.h>
>
> #include <asm/io.h>
> #include <asm/pgalloc.h>
> @@ -3436,12 +3437,45 @@ int handle_mm_fault(struct mm_struct *mm,
> struct vm_area_struct *vma,
> * Enable the memcg OOM handling for faults triggered in user
> * space. Kernel faults are handled more gracefully.
> */
> - if (flags & FAULT_FLAG_USER)
> + if (flags & FAULT_FLAG_USER) {
> + volatile int x;
> + unsigned f0, f1;
> + f0 = READ_ONCE(current->nonatomic_flags);
> + for (x = 0; x < 1000; x++) {
> + WRITE_ONCE(current->nonatomic_flags, 0xaeaeaeae);
> + cpu_relax();
> + WRITE_ONCE(current->nonatomic_flags, 0xaeaeaeab);
> + cpu_relax();
> + f1 = READ_ONCE(current->nonatomic_flags);
> + if (f1 != 0xaeaeaeab) {
> + pr_err("enable: flags 0x%x -> 0x%x\n", f0, f1);
> + break;
> + }
> + }
> + WRITE_ONCE(current->nonatomic_flags, f0);
> +
> mem_cgroup_oom_enable();
> + }
>
> ret = __handle_mm_fault(mm, vma, address, flags);
>
> if (flags & FAULT_FLAG_USER) {
> + volatile int x;
> + unsigned f0, f1;
> + f0 = READ_ONCE(current->nonatomic_flags);
> + for (x = 0; x < 1000; x++) {
> + WRITE_ONCE(current->nonatomic_flags, 0xaeaeaeae);
> + cpu_relax();
> + WRITE_ONCE(current->nonatomic_flags, 0xaeaeaeab);
> + cpu_relax();
> + f1 = READ_ONCE(current->nonatomic_flags);
> + if (f1 != 0xaeaeaeab) {
> + pr_err("enable: flags 0x%x -> 0x%x\n", f0, f1);
> + break;
> + }
> + }
> + WRITE_ONCE(current->nonatomic_flags, f0);
> +
> mem_cgroup_oom_disable();
> /*
> * The task may have entered a memcg OOM situation but
>
>
> I see:
>
> [ 153.484152] enable: flags 0x8 -> 0xaeaeaeaf
> [ 168.707786] enable: flags 0x8 -> 0xaeaeaeae
> [ 169.654966] enable: flags 0x40 -> 0xaeaeaeae
> [ 176.809080] enable: flags 0x48 -> 0xaeaeaeaa
> [ 177.496219] enable: flags 0x8 -> 0xaeaeaeaf
> [ 193.266703] enable: flags 0x0 -> 0xaeaeaeae
> [ 199.536435] enable: flags 0x8 -> 0xaeaeaeae
> [ 210.650809] enable: flags 0x48 -> 0xaeaeaeaf
> [ 210.869397] enable: flags 0x8 -> 0xaeaeaeaf
> [ 216.150804] enable: flags 0x8 -> 0xaeaeaeaa
> [ 231.607211] enable: flags 0x8 -> 0xaeaeaeaf
> [ 260.677408] enable: flags 0x48 -> 0xaeaeaeae
> [ 272.065364] enable: flags 0x40 -> 0xaeaeaeaf
> [ 281.594973] enable: flags 0x48 -> 0xaeaeaeaf
> [ 282.899860] enable: flags 0x8 -> 0xaeaeaeaf
> [ 286.472173] enable: flags 0x8 -> 0xaeaeaeae
> [ 286.763203] enable: flags 0x8 -> 0xaeaeaeaf
> [ 288.229107] enable: flags 0x0 -> 0xaeaeaeaf
> [ 291.336522] enable: flags 0x8 -> 0xaeaeaeae
> [ 310.082981] enable: flags 0x48 -> 0xaeaeaeaf
> [ 313.798935] enable: flags 0x8 -> 0xaeaeaeaf
> [ 343.340508] enable: flags 0x8 -> 0xaeaeaeaf
> [ 344.170635] enable: flags 0x48 -> 0xaeaeaeaf
> [ 357.568555] enable: flags 0x8 -> 0xaeaeaeaf
> [ 359.158179] enable: flags 0x48 -> 0xaeaeaeaf
> [ 361.188300] enable: flags 0x40 -> 0xaeaeaeaa
> [ 365.636639] enable: flags 0x8 -> 0xaeaeaeaf
With a better check:
volatile int x;
unsigned f0, f1;
f0 = READ_ONCE(current->nonatomic_flags);
for (x = 0; x < 1000; x++) {
WRITE_ONCE(current->nonatomic_flags, 0xaeaeaeff);
cpu_relax();
WRITE_ONCE(current->nonatomic_flags, 0xaeaeae00);
cpu_relax();
f1 = READ_ONCE(current->nonatomic_flags);
if (f1 != 0xaeaeae00) {
pr_err("enable1: flags 0x%x -> 0x%x\n", f0, f1);
break;
}
WRITE_ONCE(current->nonatomic_flags, 0xaeaeae00);
cpu_relax();
WRITE_ONCE(current->nonatomic_flags, 0xaeaeaeff);
cpu_relax();
f1 = READ_ONCE(current->nonatomic_flags);
if (f1 != 0xaeaeaeff) {
pr_err("enable2: flags 0x%x -> 0x%x\n", f0, f1);
break;
}
}
WRITE_ONCE(current->nonatomic_flags, f0);
I see:
[ 82.339662] enable1: flags 0x8 -> 0xaeaeae04
[ 102.743386] enable1: flags 0x0 -> 0xaeaeaeff
[ 122.209687] enable2: flags 0x0 -> 0xaeaeae04
[ 142.366938] enable1: flags 0x8 -> 0xaeaeae04
[ 157.273155] diable2: flags 0x40 -> 0xaeaeaefb
[ 162.320346] enable2: flags 0x8 -> 0xaeaeae00
[ 163.241090] enable2: flags 0x0 -> 0xaeaeaefb
[ 194.266300] diable2: flags 0x40 -> 0xaeaeaefb
[ 196.247483] enable1: flags 0x8 -> 0xaeaeae04
[ 219.996095] enable2: flags 0x0 -> 0xaeaeaefb
[ 228.088207] diable1: flags 0x48 -> 0xaeaeae04
[ 228.802678] diable2: flags 0x40 -> 0xaeaeaefb
[ 241.829173] enable1: flags 0x8 -> 0xaeaeae04
[ 257.601127] diable2: flags 0x48 -> 0xaeaeae04
[ 265.207038] enable2: flags 0x8 -> 0xaeaeaefb
[ 269.887365] enable1: flags 0x0 -> 0xaeaeae04
[ 272.254086] diable1: flags 0x40 -> 0xaeaeae04
[ 272.480384] enable1: flags 0x8 -> 0xaeaeae04
[ 276.430762] enable2: flags 0x8 -> 0xaeaeaefb
[ 289.526677] enable1: flags 0x8 -> 0xaeaeae04
Which suggests that somebody messes with 3-rd bit (both sets and
resets). Assuming that compiler does not reorder fields, this bit is
sched_reset_on_fork.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2015-11-25 17:32 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-11-24 13:50 Dmitry Vyukov
2015-11-24 22:31 ` Johannes Weiner
2015-11-25 13:05 ` Dmitry Vyukov
2015-11-25 13:12 ` Dmitry Vyukov
2015-11-25 8:44 ` Michal Hocko
2015-11-25 10:51 ` Tetsuo Handa
2015-11-25 13:08 ` Dmitry Vyukov
2015-11-25 15:27 ` Tetsuo Handa
2015-11-25 17:21 ` Dmitry Vyukov
2015-11-25 17:32 ` Dmitry Vyukov [this message]
2015-11-25 17:37 ` Michal Hocko
2015-11-25 17:44 ` Dmitry Vyukov
2015-11-26 11:33 ` Tetsuo Handa
2015-11-30 16:20 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CACT4Y+YhCXznHzu8ieQgA0spMq5=-Yfodt1jLDYz96rDoF5tyg@mail.gmail.com' \
--to=dvyukov@google.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=edumazet@google.com \
--cc=glider@google.com \
--cc=gthelen@google.com \
--cc=hannes@cmpxchg.org \
--cc=kcc@google.com \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=penguin-kernel@i-love.sakura.ne.jp \
--cc=peterz@infradead.org \
--cc=sasha.levin@oracle.com \
--cc=syzkaller@googlegroups.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox