From: Li Zefan <lizf@cn.fujitsu.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Nick Piggin <nickpiggin@yahoo.com.au>,
Tim LaBerge <tim.laberge@quantum.com>,
linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
FNST-Wang Chen <wangchen@cn.fujitsu.com>
Subject: Re: Corruption with O_DIRECT and unaligned user buffers
Date: Fri, 19 Dec 2008 16:45:47 +0800 [thread overview]
Message-ID: <494B5F3B.806@cn.fujitsu.com> (raw)
In-Reply-To: <494B50C9.7080308@cn.fujitsu.com>
Li Zefan wrote:
> KAMEZAWA Hiroyuki wrote:
>> On Thu, 18 Dec 2008 16:29:52 +0100
>> Andrea Arcangeli <aarcange@redhat.com> wrote:
>>
>>> On Wed, Nov 19, 2008 at 05:58:19PM +0100, Andrea Arcangeli wrote:
>>>> On Wed, Nov 19, 2008 at 03:25:59PM +1100, Nick Piggin wrote:
>>>>> The solution either involves synchronising forks and get_user_pages,
>>>>> or probably better, to do copy on fork rather than COW in the case
>>>>> that we detect a page is subject to get_user_pages. The trick is in
>>>>> the details :)
>>> From: Andrea Arcangeli <aarcange@redhat.com>
>>> Subject: fork-o_direct-race
>>>
>>> Think a thread writing constantly to the last 512bytes of a page, while another
>>> thread read and writes to/from the first 512bytes of the page. We can lose
>>> O_DIRECT reads, the very moment we mark any pte wrprotected because a third
>>> unrelated thread forks off a child.
>>>
>>> This fixes it by never wprotecting anon ptes if there can be any direct I/O in
>>> flight to the page, and by instantiating a readonly pte and triggering a COW in
>>> the child. The only trouble here are O_DIRECT reads (writes to memory, read
>>> from disk). Checking the page_count under the PT lock guarantees no
>>> get_user_pages could be running under us because if somebody wants to write to
>>> the page, it has to break any cow first and that requires taking the PT lock in
>>> follow_page before increasing the page count.
>>>
>>> The COW triggered inside fork will run while the parent pte is read-write, this
>>> is not usual but that's ok as it's only a page copy and it doesn't modify the
>>> page contents.
>>>
>>> In the long term there should be a smp_wmb() in between page_cache_get and
>>> SetPageSwapCache in __add_to_swap_cache and a smp_rmb in between the
>>> PageSwapCache and the page_count() to remove the trylock op.
>>>
>>> Fixed version of original patch from Nick Piggin.
>>>
>>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>> Confirmed this fixes the problem.
>>
>
> We tested with RHEL 5.2 + patch on i386 using the test program provided by
> Tim LaBerge, though the program can pass but sometimes hanged. strace log is
> attached, and we'll test it again with LOCKDEP enabled to see if we can get
> some other information.
>
# ./dma_thread -a 512
Using 2 workers.
Using alignment 512.
Read buffer: 0xb7e4e000.
Reading file 1.
Reading file 2.
...
Reading file 26.
Reading file 27.
(hang here, Ctrl+C can break the process)
And we modified the program to use 'dma_thread -a 512 -w 1', we can still see
hung in a very low frequency.
==============
Here is a snapshop of call trace:
dma_thread S 00000035 2872 20296 8797 23256 (NOTLB)
f7018e78 00000046 1f593e7d 00000035 f7018e84 00000002 00000000 00000006
f4c35530 f71ac030 1f5a03da 00000035 0000c55d 00000001 f4c3563c c1a80044
f7018f04 f7018f1c b7e4cbd8 00000046 00000000 00000002 00000001 7fffffff
Call Trace:
[<c061bd10>] schedule_timeout+0x13/0x8c
[<c043c435>] do_futex+0x1e2/0xb38
[<c061d316>] _spin_unlock+0x14/0x1c
[<c0465938>] do_wp_page+0x3fb/0x405
[<c0466da0>] __handle_mm_fault+0x858/0x8b8
[<c041e5f3>] default_wake_function+0x0/0xc
[<c044e32b>] audit_syscall_entry+0x14b/0x17d
[<c043ce9c>] sys_futex+0x111/0x127
[<c0408076>] do_syscall_trace+0xab/0xb1
[<c0404f53>] syscall_call+0x7/0xb
=======================
dma_thread S 00000035 3304 23256 8797 23258 20296 (NOTLB)
f4e24f50 00000046 1ec0e26d 00000035 c073ea10 416db065 00000046 00000003
f70ac030 c1b7eab0 1ec7d7c9 00000035 0006f55c 00000001 f70ac13c c1a80044
00005ada f4cc0030 00000000 00000246 ffffffff 00000000 00000000 f53acab0
Call Trace:
[<c0426b84>] do_wait+0x8b5/0x9a3
[<c044e32b>] audit_syscall_entry+0x14b/0x17d
[<c041e5f3>] default_wake_function+0x0/0xc
[<c0426c99>] sys_wait4+0x27/0x2a
[<c0426caf>] sys_waitpid+0x13/0x17
[<c0404f53>] syscall_call+0x7/0xb
=======================
dma_thread R running 3412 23258 23256 (NOTLB)
...
...
Showing all locks held in the system:
4 locks held by kseriod/82:
#0: (serio_mutex){--..}, at: [<c059c7f6>] serio_thread+0x13/0x28d
#1: (&serio->drv_mutex){--..}, at: [<c059be7e>] serio_connect_driver+0x16/0x2c
#2: (psmouse_mutex){--..}, at: [<c05a41ce>] psmouse_connect+0x18/0x211
#3: (&ps2_mutex_key){--..}, at: [<c059e4bd>] ps2_command+0x80/0x2dc
=============================================
> BTW, the patch works fine on IA64.
>
>> Hmm, but, fork() gets slower.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2008-12-19 8:45 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-11-14 17:04 Tim LaBerge
2008-11-19 4:25 ` Nick Piggin
2008-11-19 6:52 ` Nick Piggin
2008-11-19 16:58 ` Andrea Arcangeli
2008-12-18 15:29 ` Andrea Arcangeli
2008-12-19 2:21 ` KAMEZAWA Hiroyuki
2008-12-19 5:06 ` KAMEZAWA Hiroyuki
2008-12-19 6:34 ` KOSAKI Motohiro
2008-12-20 16:02 ` Andrea Arcangeli
2008-12-19 7:19 ` KAMEZAWA Hiroyuki
2008-12-19 7:44 ` Li Zefan
2008-12-19 8:45 ` Li Zefan [this message]
2008-12-19 20:27 ` Andrea Arcangeli
2008-12-20 15:55 ` Andrea Arcangeli
2008-12-19 11:51 ` Li Zefan
2008-12-19 12:14 ` KOSAKI Motohiro
2008-12-19 12:58 ` Hugh Dickins
2008-12-19 20:34 ` Andrea Arcangeli
2014-06-27 2:08 Xiaoguang Wang
2014-07-01 4:18 ` Hugh Dickins
2014-07-02 11:39 ` Xiaoguang Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=494B5F3B.806@cn.fujitsu.com \
--to=lizf@cn.fujitsu.com \
--cc=aarcange@redhat.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nickpiggin@yahoo.com.au \
--cc=tim.laberge@quantum.com \
--cc=wangchen@cn.fujitsu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox