* [PATCH] mm: save/restore current->journal_info in handle_mm_fault @ 2017-12-14 10:55 Yan, Zheng 2017-12-14 13:30 ` Michal Hocko 2017-12-14 13:43 ` Jan Kara 0 siblings, 2 replies; 8+ messages in thread From: Yan, Zheng @ 2017-12-14 10:55 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, ceph-devel, linux-ext4, linux-btrfs, linux-mm, akpm Cc: viro, jlayton, Yan, Zheng, stable We recently got an Oops report: BUG: unable to handle kernel NULL pointer dereference at (null) IP: jbd2__journal_start+0x38/0x1a2 [...] Call Trace: ext4_page_mkwrite+0x307/0x52b _ext4_get_block+0xd8/0xd8 do_page_mkwrite+0x6e/0xd8 handle_mm_fault+0x686/0xf9b mntput_no_expire+0x1f/0x21e __do_page_fault+0x21d/0x465 dput+0x4a/0x2f7 page_fault+0x22/0x30 copy_user_generic_string+0x2c/0x40 copy_page_to_iter+0x8c/0x2b8 generic_file_read_iter+0x26e/0x845 timerqueue_del+0x31/0x90 ceph_read_iter+0x697/0xa33 [ceph] hrtimer_cancel+0x23/0x41 futex_wait+0x1c8/0x24d get_futex_key+0x32c/0x39a __vfs_read+0xe0/0x130 vfs_read.part.1+0x6c/0x123 handle_mm_fault+0x831/0xf9b __fget+0x7e/0xbf SyS_read+0x4d/0xb5 ceph_read_iter() uses current->journal_info to pass context info to ceph_readpages(). Because ceph_readpages() needs to know if its caller has already gotten capability of using page cache (distinguish read from readahead/fadvise). ceph_read_iter() set current->journal_info, then calls generic_file_read_iter(). In above Oops, page fault happened when copying data to userspace. Page fault handler called ext4_page_mkwrite(). Ext4 code read current->journal_info and assumed it is journal handle. I checked other filesystems, btrfs probably suffers similar problem for its readpage. (page fault happens when write() copies data from userspace memory and the memory is mapped to a file in btrfs. verify_parent_transid() can be called during readpage) Cc: stable@vger.kernel.org Signed-off-by: "Yan, Zheng" <zyan@redhat.com> --- mm/memory.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index a728bed16c20..db2a50233c49 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags) { int ret; + void *old_journal_info; __set_current_state(TASK_RUNNING); @@ -4065,11 +4066,24 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, if (flags & FAULT_FLAG_USER) mem_cgroup_oom_enable(); + /* + * Fault can happen when filesystem A's read_iter()/write_iter() + * copies data to/from userspace. Filesystem A may have set + * current->journal_info. If the userspace memory is MAP_SHARED + * mapped to a file in filesystem B, we later may call filesystem + * B's vm operation. Filesystem B may also want to read/set + * current->journal_info. + */ + old_journal_info = current->journal_info; + current->journal_info = NULL; + if (unlikely(is_vm_hugetlb_page(vma))) ret = hugetlb_fault(vma->vm_mm, vma, address, flags); else ret = __handle_mm_fault(vma, address, flags); + current->journal_info = old_journal_info; + if (flags & FAULT_FLAG_USER) { mem_cgroup_oom_disable(); /* -- 2.13.6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault 2017-12-14 10:55 [PATCH] mm: save/restore current->journal_info in handle_mm_fault Yan, Zheng @ 2017-12-14 13:30 ` Michal Hocko 2017-12-14 13:43 ` Jan Kara 1 sibling, 0 replies; 8+ messages in thread From: Michal Hocko @ 2017-12-14 13:30 UTC (permalink / raw) To: Yan, Zheng Cc: linux-kernel, linux-fsdevel, ceph-devel, linux-ext4, linux-btrfs, linux-mm, akpm, viro, jlayton, stable On Thu 14-12-17 18:55:27, Yan, Zheng wrote: > We recently got an Oops report: > > BUG: unable to handle kernel NULL pointer dereference at (null) > IP: jbd2__journal_start+0x38/0x1a2 > [...] > Call Trace: > ext4_page_mkwrite+0x307/0x52b > _ext4_get_block+0xd8/0xd8 > do_page_mkwrite+0x6e/0xd8 > handle_mm_fault+0x686/0xf9b > mntput_no_expire+0x1f/0x21e > __do_page_fault+0x21d/0x465 > dput+0x4a/0x2f7 > page_fault+0x22/0x30 > copy_user_generic_string+0x2c/0x40 > copy_page_to_iter+0x8c/0x2b8 > generic_file_read_iter+0x26e/0x845 > timerqueue_del+0x31/0x90 > ceph_read_iter+0x697/0xa33 [ceph] > hrtimer_cancel+0x23/0x41 > futex_wait+0x1c8/0x24d > get_futex_key+0x32c/0x39a > __vfs_read+0xe0/0x130 > vfs_read.part.1+0x6c/0x123 > handle_mm_fault+0x831/0xf9b > __fget+0x7e/0xbf > SyS_read+0x4d/0xb5 > > ceph_read_iter() uses current->journal_info to pass context info to > ceph_readpages(). Because ceph_readpages() needs to know if its caller > has already gotten capability of using page cache (distinguish read > from readahead/fadvise). ceph_read_iter() set current->journal_info, > then calls generic_file_read_iter(). > > In above Oops, page fault happened when copying data to userspace. > Page fault handler called ext4_page_mkwrite(). Ext4 code read > current->journal_info and assumed it is journal handle. > > I checked other filesystems, btrfs probably suffers similar problem > for its readpage. (page fault happens when write() copies data from > userspace memory and the memory is mapped to a file in btrfs. > verify_parent_transid() can be called during readpage) > > Cc: stable@vger.kernel.org > Signed-off-by: "Yan, Zheng" <zyan@redhat.com> I am not an FS expert so (ab)using journal_info for unrelated purposes might be acceptable in general but hooking into the generic PF path like this is just too ugly to live. Can this be limited to a FS code so that not everybody has to pay additional cycles? With a big fat warning that (ab)users might want to find a better way to comunicate their internal stuff. > --- > mm/memory.c | 14 ++++++++++++++ > 1 file changed, 14 insertions(+) > > diff --git a/mm/memory.c b/mm/memory.c > index a728bed16c20..db2a50233c49 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > unsigned int flags) > { > int ret; > + void *old_journal_info; > > __set_current_state(TASK_RUNNING); > > @@ -4065,11 +4066,24 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > if (flags & FAULT_FLAG_USER) > mem_cgroup_oom_enable(); > > + /* > + * Fault can happen when filesystem A's read_iter()/write_iter() > + * copies data to/from userspace. Filesystem A may have set > + * current->journal_info. If the userspace memory is MAP_SHARED > + * mapped to a file in filesystem B, we later may call filesystem > + * B's vm operation. Filesystem B may also want to read/set > + * current->journal_info. > + */ > + old_journal_info = current->journal_info; > + current->journal_info = NULL; > + > if (unlikely(is_vm_hugetlb_page(vma))) > ret = hugetlb_fault(vma->vm_mm, vma, address, flags); > else > ret = __handle_mm_fault(vma, address, flags); > > + current->journal_info = old_journal_info; > + > if (flags & FAULT_FLAG_USER) { > mem_cgroup_oom_disable(); > /* > -- > 2.13.6 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault 2017-12-14 10:55 [PATCH] mm: save/restore current->journal_info in handle_mm_fault Yan, Zheng 2017-12-14 13:30 ` Michal Hocko @ 2017-12-14 13:43 ` Jan Kara 2017-12-14 14:30 ` Yan, Zheng 1 sibling, 1 reply; 8+ messages in thread From: Jan Kara @ 2017-12-14 13:43 UTC (permalink / raw) To: Yan, Zheng Cc: linux-kernel, linux-fsdevel, ceph-devel, linux-ext4, linux-btrfs, linux-mm, akpm, viro, jlayton, stable On Thu 14-12-17 18:55:27, Yan, Zheng wrote: > We recently got an Oops report: > > BUG: unable to handle kernel NULL pointer dereference at (null) > IP: jbd2__journal_start+0x38/0x1a2 > [...] > Call Trace: > ext4_page_mkwrite+0x307/0x52b > _ext4_get_block+0xd8/0xd8 > do_page_mkwrite+0x6e/0xd8 > handle_mm_fault+0x686/0xf9b > mntput_no_expire+0x1f/0x21e > __do_page_fault+0x21d/0x465 > dput+0x4a/0x2f7 > page_fault+0x22/0x30 > copy_user_generic_string+0x2c/0x40 > copy_page_to_iter+0x8c/0x2b8 > generic_file_read_iter+0x26e/0x845 > timerqueue_del+0x31/0x90 > ceph_read_iter+0x697/0xa33 [ceph] > hrtimer_cancel+0x23/0x41 > futex_wait+0x1c8/0x24d > get_futex_key+0x32c/0x39a > __vfs_read+0xe0/0x130 > vfs_read.part.1+0x6c/0x123 > handle_mm_fault+0x831/0xf9b > __fget+0x7e/0xbf > SyS_read+0x4d/0xb5 > > ceph_read_iter() uses current->journal_info to pass context info to > ceph_readpages(). Because ceph_readpages() needs to know if its caller > has already gotten capability of using page cache (distinguish read > from readahead/fadvise). ceph_read_iter() set current->journal_info, > then calls generic_file_read_iter(). > > In above Oops, page fault happened when copying data to userspace. > Page fault handler called ext4_page_mkwrite(). Ext4 code read > current->journal_info and assumed it is journal handle. > > I checked other filesystems, btrfs probably suffers similar problem > for its readpage. (page fault happens when write() copies data from > userspace memory and the memory is mapped to a file in btrfs. > verify_parent_transid() can be called during readpage) > > Cc: stable@vger.kernel.org > Signed-off-by: "Yan, Zheng" <zyan@redhat.com> I agree with the analysis but the patch is too ugly too live. Ceph just should not be abusing current->journal_info for passing information between two random functions or when it does a hackery like this, it should just make sure the pieces hold together. Poluting generic code to accommodate this hack in Ceph is not acceptable. Also bear in mind there are likely other code paths (e.g. memory reclaim) which could recurse into another filesystem confusing it with non-NULL current->journal_info in the same way. In this particular case I'm not sure why does ceph pass 'filp' into readpage() / readpages() handler when it already gets that pointer as part of arguments... Honza > diff --git a/mm/memory.c b/mm/memory.c > index a728bed16c20..db2a50233c49 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > unsigned int flags) > { > int ret; > + void *old_journal_info; > > __set_current_state(TASK_RUNNING); > > @@ -4065,11 +4066,24 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > if (flags & FAULT_FLAG_USER) > mem_cgroup_oom_enable(); > > + /* > + * Fault can happen when filesystem A's read_iter()/write_iter() > + * copies data to/from userspace. Filesystem A may have set > + * current->journal_info. If the userspace memory is MAP_SHARED > + * mapped to a file in filesystem B, we later may call filesystem > + * B's vm operation. Filesystem B may also want to read/set > + * current->journal_info. > + */ > + old_journal_info = current->journal_info; > + current->journal_info = NULL; > + > if (unlikely(is_vm_hugetlb_page(vma))) > ret = hugetlb_fault(vma->vm_mm, vma, address, flags); > else > ret = __handle_mm_fault(vma, address, flags); > > + current->journal_info = old_journal_info; > + > if (flags & FAULT_FLAG_USER) { > mem_cgroup_oom_disable(); > /* > -- > 2.13.6 > -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault 2017-12-14 13:43 ` Jan Kara @ 2017-12-14 14:30 ` Yan, Zheng 2017-12-14 16:53 ` Jan Kara 2017-12-14 20:48 ` Andreas Dilger 0 siblings, 2 replies; 8+ messages in thread From: Yan, Zheng @ 2017-12-14 14:30 UTC (permalink / raw) To: Jan Kara Cc: Yan, Zheng, Linux Kernel Mailing List, Linux FS-devel Mailing List, ceph-devel, linux-ext4, linux-btrfs, linux-mm, Andrew Morton, Al Viro, Jeff Layton, stable On Thu, Dec 14, 2017 at 9:43 PM, Jan Kara <jack@suse.cz> wrote: > On Thu 14-12-17 18:55:27, Yan, Zheng wrote: >> We recently got an Oops report: >> >> BUG: unable to handle kernel NULL pointer dereference at (null) >> IP: jbd2__journal_start+0x38/0x1a2 >> [...] >> Call Trace: >> ext4_page_mkwrite+0x307/0x52b >> _ext4_get_block+0xd8/0xd8 >> do_page_mkwrite+0x6e/0xd8 >> handle_mm_fault+0x686/0xf9b >> mntput_no_expire+0x1f/0x21e >> __do_page_fault+0x21d/0x465 >> dput+0x4a/0x2f7 >> page_fault+0x22/0x30 >> copy_user_generic_string+0x2c/0x40 >> copy_page_to_iter+0x8c/0x2b8 >> generic_file_read_iter+0x26e/0x845 >> timerqueue_del+0x31/0x90 >> ceph_read_iter+0x697/0xa33 [ceph] >> hrtimer_cancel+0x23/0x41 >> futex_wait+0x1c8/0x24d >> get_futex_key+0x32c/0x39a >> __vfs_read+0xe0/0x130 >> vfs_read.part.1+0x6c/0x123 >> handle_mm_fault+0x831/0xf9b >> __fget+0x7e/0xbf >> SyS_read+0x4d/0xb5 >> >> ceph_read_iter() uses current->journal_info to pass context info to >> ceph_readpages(). Because ceph_readpages() needs to know if its caller >> has already gotten capability of using page cache (distinguish read >> from readahead/fadvise). ceph_read_iter() set current->journal_info, >> then calls generic_file_read_iter(). >> >> In above Oops, page fault happened when copying data to userspace. >> Page fault handler called ext4_page_mkwrite(). Ext4 code read >> current->journal_info and assumed it is journal handle. >> >> I checked other filesystems, btrfs probably suffers similar problem >> for its readpage. (page fault happens when write() copies data from >> userspace memory and the memory is mapped to a file in btrfs. >> verify_parent_transid() can be called during readpage) >> >> Cc: stable@vger.kernel.org >> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> > > I agree with the analysis but the patch is too ugly too live. Ceph just > should not be abusing current->journal_info for passing information between > two random functions or when it does a hackery like this, it should just > make sure the pieces hold together. Poluting generic code to accommodate > this hack in Ceph is not acceptable. Also bear in mind there are likely > other code paths (e.g. memory reclaim) which could recurse into another > filesystem confusing it with non-NULL current->journal_info in the same > way. But ... some filesystem set journal_info in its write_begin(), then clear it in write_end(). If buffer for write is mapped to another filesystem, current->journal can leak to the later filesystem's page_readpage(). The later filesystem may read current->journal and treat it as its own journal handle. Besides, most filesystem's vm fault handle is filemap_fault(), filemap also may tigger memory reclaim. > > In this particular case I'm not sure why does ceph pass 'filp' into > readpage() / readpages() handler when it already gets that pointer as part > of arguments... It actually a flag which tells ceph_readpages() if its caller is ceph_read_iter or readahead/fadvise/madvise. because when there are multiple clients read/write a file a the same time, page cache should be disabled. Regards Yan, Zheng > > Honza > >> diff --git a/mm/memory.c b/mm/memory.c >> index a728bed16c20..db2a50233c49 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, >> unsigned int flags) >> { >> int ret; >> + void *old_journal_info; >> >> __set_current_state(TASK_RUNNING); >> >> @@ -4065,11 +4066,24 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, >> if (flags & FAULT_FLAG_USER) >> mem_cgroup_oom_enable(); >> >> + /* >> + * Fault can happen when filesystem A's read_iter()/write_iter() >> + * copies data to/from userspace. Filesystem A may have set >> + * current->journal_info. If the userspace memory is MAP_SHARED >> + * mapped to a file in filesystem B, we later may call filesystem >> + * B's vm operation. Filesystem B may also want to read/set >> + * current->journal_info. >> + */ >> + old_journal_info = current->journal_info; >> + current->journal_info = NULL; >> + >> if (unlikely(is_vm_hugetlb_page(vma))) >> ret = hugetlb_fault(vma->vm_mm, vma, address, flags); >> else >> ret = __handle_mm_fault(vma, address, flags); >> >> + current->journal_info = old_journal_info; >> + >> if (flags & FAULT_FLAG_USER) { >> mem_cgroup_oom_disable(); >> /* >> -- >> 2.13.6 >> > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault 2017-12-14 14:30 ` Yan, Zheng @ 2017-12-14 16:53 ` Jan Kara 2017-12-15 1:17 ` Yan, Zheng 2017-12-14 20:48 ` Andreas Dilger 1 sibling, 1 reply; 8+ messages in thread From: Jan Kara @ 2017-12-14 16:53 UTC (permalink / raw) To: Yan, Zheng Cc: Jan Kara, Yan, Zheng, Linux Kernel Mailing List, Linux FS-devel Mailing List, ceph-devel, linux-ext4, linux-btrfs, linux-mm, Andrew Morton, Al Viro, Jeff Layton, stable On Thu 14-12-17 22:30:26, Yan, Zheng wrote: > On Thu, Dec 14, 2017 at 9:43 PM, Jan Kara <jack@suse.cz> wrote: > > On Thu 14-12-17 18:55:27, Yan, Zheng wrote: > >> We recently got an Oops report: > >> > >> BUG: unable to handle kernel NULL pointer dereference at (null) > >> IP: jbd2__journal_start+0x38/0x1a2 > >> [...] > >> Call Trace: > >> ext4_page_mkwrite+0x307/0x52b > >> _ext4_get_block+0xd8/0xd8 > >> do_page_mkwrite+0x6e/0xd8 > >> handle_mm_fault+0x686/0xf9b > >> mntput_no_expire+0x1f/0x21e > >> __do_page_fault+0x21d/0x465 > >> dput+0x4a/0x2f7 > >> page_fault+0x22/0x30 > >> copy_user_generic_string+0x2c/0x40 > >> copy_page_to_iter+0x8c/0x2b8 > >> generic_file_read_iter+0x26e/0x845 > >> timerqueue_del+0x31/0x90 > >> ceph_read_iter+0x697/0xa33 [ceph] > >> hrtimer_cancel+0x23/0x41 > >> futex_wait+0x1c8/0x24d > >> get_futex_key+0x32c/0x39a > >> __vfs_read+0xe0/0x130 > >> vfs_read.part.1+0x6c/0x123 > >> handle_mm_fault+0x831/0xf9b > >> __fget+0x7e/0xbf > >> SyS_read+0x4d/0xb5 > >> > >> ceph_read_iter() uses current->journal_info to pass context info to > >> ceph_readpages(). Because ceph_readpages() needs to know if its caller > >> has already gotten capability of using page cache (distinguish read > >> from readahead/fadvise). ceph_read_iter() set current->journal_info, > >> then calls generic_file_read_iter(). > >> > >> In above Oops, page fault happened when copying data to userspace. > >> Page fault handler called ext4_page_mkwrite(). Ext4 code read > >> current->journal_info and assumed it is journal handle. > >> > >> I checked other filesystems, btrfs probably suffers similar problem > >> for its readpage. (page fault happens when write() copies data from > >> userspace memory and the memory is mapped to a file in btrfs. > >> verify_parent_transid() can be called during readpage) > >> > >> Cc: stable@vger.kernel.org > >> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> > > > > I agree with the analysis but the patch is too ugly too live. Ceph just > > should not be abusing current->journal_info for passing information between > > two random functions or when it does a hackery like this, it should just > > make sure the pieces hold together. Poluting generic code to accommodate > > this hack in Ceph is not acceptable. Also bear in mind there are likely > > other code paths (e.g. memory reclaim) which could recurse into another > > filesystem confusing it with non-NULL current->journal_info in the same > > way. > > But ... > > some filesystem set journal_info in its write_begin(), then clear it > in write_end(). If buffer for write is mapped to another filesystem, > current->journal can leak to the later filesystem's page_readpage(). > The later filesystem may read current->journal and treat it as its own > journal handle. Besides, most filesystem's vm fault handle is > filemap_fault(), filemap also may tigger memory reclaim. Did you really observe this? Because write path uses iov_iter_copy_from_user_atomic() which does not allow page faults to happen. All page faulting happens in iov_iter_fault_in_readable() before ->write_begin() is called. And the recursion problems like you mention above are exactly the reason why things are done in a more complicated way like this. > > > > In this particular case I'm not sure why does ceph pass 'filp' into > > readpage() / readpages() handler when it already gets that pointer as part > > of arguments... > > It actually a flag which tells ceph_readpages() if its caller is > ceph_read_iter or readahead/fadvise/madvise. because when there are > multiple clients read/write a file a the same time, page cache should > be disabled. I'm not sure I understand the reasoning properly but from what you say above it rather seems the 'hint' should be stored in the inode (or possibly struct file)? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault 2017-12-14 16:53 ` Jan Kara @ 2017-12-15 1:17 ` Yan, Zheng 2017-12-15 10:33 ` Jan Kara 0 siblings, 1 reply; 8+ messages in thread From: Yan, Zheng @ 2017-12-15 1:17 UTC (permalink / raw) To: Jan Kara Cc: Yan, Zheng, Linux Kernel Mailing List, Linux FS-devel Mailing List, ceph-devel, linux-ext4, linux-btrfs, linux-mm, Andrew Morton, Al Viro, Jeff Layton, stable On Fri, Dec 15, 2017 at 12:53 AM, Jan Kara <jack@suse.cz> wrote: > On Thu 14-12-17 22:30:26, Yan, Zheng wrote: >> On Thu, Dec 14, 2017 at 9:43 PM, Jan Kara <jack@suse.cz> wrote: >> > On Thu 14-12-17 18:55:27, Yan, Zheng wrote: >> >> We recently got an Oops report: >> >> >> >> BUG: unable to handle kernel NULL pointer dereference at (null) >> >> IP: jbd2__journal_start+0x38/0x1a2 >> >> [...] >> >> Call Trace: >> >> ext4_page_mkwrite+0x307/0x52b >> >> _ext4_get_block+0xd8/0xd8 >> >> do_page_mkwrite+0x6e/0xd8 >> >> handle_mm_fault+0x686/0xf9b >> >> mntput_no_expire+0x1f/0x21e >> >> __do_page_fault+0x21d/0x465 >> >> dput+0x4a/0x2f7 >> >> page_fault+0x22/0x30 >> >> copy_user_generic_string+0x2c/0x40 >> >> copy_page_to_iter+0x8c/0x2b8 >> >> generic_file_read_iter+0x26e/0x845 >> >> timerqueue_del+0x31/0x90 >> >> ceph_read_iter+0x697/0xa33 [ceph] >> >> hrtimer_cancel+0x23/0x41 >> >> futex_wait+0x1c8/0x24d >> >> get_futex_key+0x32c/0x39a >> >> __vfs_read+0xe0/0x130 >> >> vfs_read.part.1+0x6c/0x123 >> >> handle_mm_fault+0x831/0xf9b >> >> __fget+0x7e/0xbf >> >> SyS_read+0x4d/0xb5 >> >> >> >> ceph_read_iter() uses current->journal_info to pass context info to >> >> ceph_readpages(). Because ceph_readpages() needs to know if its caller >> >> has already gotten capability of using page cache (distinguish read >> >> from readahead/fadvise). ceph_read_iter() set current->journal_info, >> >> then calls generic_file_read_iter(). >> >> >> >> In above Oops, page fault happened when copying data to userspace. >> >> Page fault handler called ext4_page_mkwrite(). Ext4 code read >> >> current->journal_info and assumed it is journal handle. >> >> >> >> I checked other filesystems, btrfs probably suffers similar problem >> >> for its readpage. (page fault happens when write() copies data from >> >> userspace memory and the memory is mapped to a file in btrfs. >> >> verify_parent_transid() can be called during readpage) >> >> >> >> Cc: stable@vger.kernel.org >> >> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> >> > >> > I agree with the analysis but the patch is too ugly too live. Ceph just >> > should not be abusing current->journal_info for passing information between >> > two random functions or when it does a hackery like this, it should just >> > make sure the pieces hold together. Poluting generic code to accommodate >> > this hack in Ceph is not acceptable. Also bear in mind there are likely >> > other code paths (e.g. memory reclaim) which could recurse into another >> > filesystem confusing it with non-NULL current->journal_info in the same >> > way. >> >> But ... >> >> some filesystem set journal_info in its write_begin(), then clear it >> in write_end(). If buffer for write is mapped to another filesystem, >> current->journal can leak to the later filesystem's page_readpage(). >> The later filesystem may read current->journal and treat it as its own >> journal handle. Besides, most filesystem's vm fault handle is >> filemap_fault(), filemap also may tigger memory reclaim. > > Did you really observe this? Because write path uses > iov_iter_copy_from_user_atomic() which does not allow page faults to > happen. All page faulting happens in iov_iter_fault_in_readable() before > ->write_begin() is called. And the recursion problems like you mention > above are exactly the reason why things are done in a more complicated way > like this. I think you are right. > >> > >> > In this particular case I'm not sure why does ceph pass 'filp' into >> > readpage() / readpages() handler when it already gets that pointer as part >> > of arguments... >> >> It actually a flag which tells ceph_readpages() if its caller is >> ceph_read_iter or readahead/fadvise/madvise. because when there are >> multiple clients read/write a file a the same time, page cache should >> be disabled. > > I'm not sure I understand the reasoning properly but from what you say > above it rather seems the 'hint' should be stored in the inode (or possibly > struct file)? > The capability of using page cache is hold by the process who got it. ceph_read_iter() first gets the capability, calls generic_file_read_iter(), then release the capability. The capability can not be easily stored in inode or file because it can be revoked by server any time if caller does not hold it Regards Yan, Zheng > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault 2017-12-15 1:17 ` Yan, Zheng @ 2017-12-15 10:33 ` Jan Kara 0 siblings, 0 replies; 8+ messages in thread From: Jan Kara @ 2017-12-15 10:33 UTC (permalink / raw) To: Yan, Zheng Cc: Jan Kara, Yan, Zheng, Linux Kernel Mailing List, Linux FS-devel Mailing List, ceph-devel, linux-ext4, linux-btrfs, linux-mm, Andrew Morton, Al Viro, Jeff Layton, stable On Fri 15-12-17 09:17:42, Yan, Zheng wrote: > On Fri, Dec 15, 2017 at 12:53 AM, Jan Kara <jack@suse.cz> wrote: > >> > > >> > In this particular case I'm not sure why does ceph pass 'filp' into > >> > readpage() / readpages() handler when it already gets that pointer as part > >> > of arguments... > >> > >> It actually a flag which tells ceph_readpages() if its caller is > >> ceph_read_iter or readahead/fadvise/madvise. because when there are > >> multiple clients read/write a file a the same time, page cache should > >> be disabled. > > > > I'm not sure I understand the reasoning properly but from what you say > > above it rather seems the 'hint' should be stored in the inode (or possibly > > struct file)? > > > > The capability of using page cache is hold by the process who got it. > ceph_read_iter() first gets the capability, calls > generic_file_read_iter(), then release the capability. The capability > can not be easily stored in inode or file because it can be revoked by > server any time if caller does not hold it OK, understood. But using storage in task_struct (such as journal_info) is problematic as it has hard to fix recursion issues as the bug you're trying to fix shows (it is difficult to track down all the paths that can recurse into another filesystem which will clobber the stored info). So either you have to come up with some scheme to safely use current->journal_info (by somehow tracking owner as Andreas suggests) and convert all users to it or you have to come up with some scheme propagating the information through the inode / file->private_data and use it in Ceph. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault 2017-12-14 14:30 ` Yan, Zheng 2017-12-14 16:53 ` Jan Kara @ 2017-12-14 20:48 ` Andreas Dilger 1 sibling, 0 replies; 8+ messages in thread From: Andreas Dilger @ 2017-12-14 20:48 UTC (permalink / raw) To: Yan, Zheng Cc: Jan Kara, Yan, Zheng, Linux Kernel Mailing List, Linux FS-devel Mailing List, ceph-devel, linux-ext4, linux-btrfs, linux-mm, Andrew Morton, Al Viro, Jeff Layton [-- Attachment #1: Type: text/plain, Size: 6487 bytes --] [remove stable@ as this is not really a stable patch] On Dec 14, 2017, at 7:30 AM, Yan, Zheng <ukernel@gmail.com> wrote: > > On Thu, Dec 14, 2017 at 9:43 PM, Jan Kara <jack@suse.cz> wrote: >> On Thu 14-12-17 18:55:27, Yan, Zheng wrote: >>> We recently got an Oops report: >>> >>> BUG: unable to handle kernel NULL pointer dereference at (null) >>> IP: jbd2__journal_start+0x38/0x1a2 >>> [...] >>> Call Trace: >>> ext4_page_mkwrite+0x307/0x52b >>> _ext4_get_block+0xd8/0xd8 >>> do_page_mkwrite+0x6e/0xd8 >>> handle_mm_fault+0x686/0xf9b >>> mntput_no_expire+0x1f/0x21e >>> __do_page_fault+0x21d/0x465 >>> dput+0x4a/0x2f7 >>> page_fault+0x22/0x30 >>> copy_user_generic_string+0x2c/0x40 >>> copy_page_to_iter+0x8c/0x2b8 >>> generic_file_read_iter+0x26e/0x845 >>> timerqueue_del+0x31/0x90 >>> ceph_read_iter+0x697/0xa33 [ceph] >>> hrtimer_cancel+0x23/0x41 >>> futex_wait+0x1c8/0x24d >>> get_futex_key+0x32c/0x39a >>> __vfs_read+0xe0/0x130 >>> vfs_read.part.1+0x6c/0x123 >>> handle_mm_fault+0x831/0xf9b >>> __fget+0x7e/0xbf >>> SyS_read+0x4d/0xb5 >>> >>> ceph_read_iter() uses current->journal_info to pass context info to >>> ceph_readpages(). Because ceph_readpages() needs to know if its caller >>> has already gotten capability of using page cache (distinguish read >>> from readahead/fadvise). ceph_read_iter() set current->journal_info, >>> then calls generic_file_read_iter(). >>> >>> In above Oops, page fault happened when copying data to userspace. >>> Page fault handler called ext4_page_mkwrite(). Ext4 code read >>> current->journal_info and assumed it is journal handle. >>> >>> I checked other filesystems, btrfs probably suffers similar problem >>> for its readpage. (page fault happens when write() copies data from >>> userspace memory and the memory is mapped to a file in btrfs. >>> verify_parent_transid() can be called during readpage) >>> >>> Cc: stable@vger.kernel.org >>> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> >> >> I agree with the analysis but the patch is too ugly too live. Ceph just >> should not be abusing current->journal_info for passing information between >> two random functions or when it does a hackery like this, it should just >> make sure the pieces hold together. Poluting generic code to accommodate >> this hack in Ceph is not acceptable. Also bear in mind there are likely >> other code paths (e.g. memory reclaim) which could recurse into another >> filesystem confusing it with non-NULL current->journal_info in the same >> way. > > But ... > > some filesystem set journal_info in its write_begin(), then clear it > in write_end(). If buffer for write is mapped to another filesystem, > current->journal can leak to the later filesystem's page_readpage(). > The later filesystem may read current->journal and treat it as its own > journal handle. Besides, most filesystem's vm fault handle is > filemap_fault(), filemap also may tigger memory reclaim. Shouldn't the memory reclaim be prevented from recursing into the other filesystem by use of GFP_NOFS, or the new memalloc_nofs annotation? I don't think that ext4 is ever using current->journal on any read paths, only in case of writes. >> In this particular case I'm not sure why does ceph pass 'filp' into >> readpage() / readpages() handler when it already gets that pointer as part >> of arguments... > > It actually a flag which tells ceph_readpages() if its caller is > ceph_read_iter or readahead/fadvise/madvise. because when there are > multiple clients read/write a file a the same time, page cache should > be disabled. I've wanted something similar for other reasons. It would be better to have a separate fs-specific pointer in the task struct to handle this kind of information. This can be used by the filesystem "upper half" to communicate with the "lower half" (doing the writeout or other IO below the VFS), and the "lower half" can use ->journal for handling the writeout. However, some care would be needed to ensure that other processes accessing this pointer would only do so if it is their own. Something like ->fs_private_sb and ->fs_private_data would allow this sanely. If the ->fs_private_sb != sb in the filesystem then ->fs_private_data is not valid for this fs and cannot be used by the current filesystem code. Alternately, we could have a single ->fs_private pointer to reduce impact on task_struct so long as all filesystems used the first field of the structure to point to "sb", probably with a library helper to ensure this was done consistently: data = current_fs_private_get(sb); current_fs_private_set(sb, data); data = current_fs_private_alloc(sb, size, gfp); or whatever. > Regards > Yan, Zheng > >> >> Honza >> >>> diff --git a/mm/memory.c b/mm/memory.c >>> index a728bed16c20..db2a50233c49 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, >>> unsigned int flags) >>> { >>> int ret; >>> + void *old_journal_info; >>> >>> __set_current_state(TASK_RUNNING); >>> >>> @@ -4065,11 +4066,24 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, >>> if (flags & FAULT_FLAG_USER) >>> mem_cgroup_oom_enable(); >>> >>> + /* >>> + * Fault can happen when filesystem A's read_iter()/write_iter() >>> + * copies data to/from userspace. Filesystem A may have set >>> + * current->journal_info. If the userspace memory is MAP_SHARED >>> + * mapped to a file in filesystem B, we later may call filesystem >>> + * B's vm operation. Filesystem B may also want to read/set >>> + * current->journal_info. >>> + */ >>> + old_journal_info = current->journal_info; >>> + current->journal_info = NULL; >>> + >>> if (unlikely(is_vm_hugetlb_page(vma))) >>> ret = hugetlb_fault(vma->vm_mm, vma, address, flags); >>> else >>> ret = __handle_mm_fault(vma, address, flags); >>> >>> + current->journal_info = old_journal_info; >>> + >>> if (flags & FAULT_FLAG_USER) { >>> mem_cgroup_oom_disable(); >>> /* >>> -- >>> 2.13.6 >>> >> -- >> Jan Kara <jack@suse.com> >> SUSE Labs, CR Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2017-12-15 10:33 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-12-14 10:55 [PATCH] mm: save/restore current->journal_info in handle_mm_fault Yan, Zheng 2017-12-14 13:30 ` Michal Hocko 2017-12-14 13:43 ` Jan Kara 2017-12-14 14:30 ` Yan, Zheng 2017-12-14 16:53 ` Jan Kara 2017-12-15 1:17 ` Yan, Zheng 2017-12-15 10:33 ` Jan Kara 2017-12-14 20:48 ` Andreas Dilger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox