From: Dan Williams <dan.j.williams@intel.com>
To: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Matthew Wilcox <willy@linux.intel.com>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Linux MM <linux-mm@kvack.org>,
Andrew Morton <akpm@linux-foundation.org>,
Dave Chinner <david@fromorbit.com>, Jan Kara <jack@suse.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
Matthew Wilcox <matthew.r.wilcox@intel.com>
Subject: Re: [PATCH v4 1/2] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
Date: Wed, 7 Oct 2015 09:19:28 -0700 [thread overview]
Message-ID: <CAPcyv4ikrVqYc4QxAaZiPMc4awh6h6WG0kP5nH=ZnuNTJVDwiQ@mail.gmail.com> (raw)
In-Reply-To: <1444170529-12814-2-git-send-email-ross.zwisler@linux.intel.com>
On Tue, Oct 6, 2015 at 3:28 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> This reverts commits 46c043ede4711e8d598b9d63c5616c1fedb0605e
> and 8346c416d17bf5b4ea1508662959bb62e73fd6a5.
>
> The following two locking commits in the DAX code:
>
> commit 843172978bb9 ("dax: fix race between simultaneous faults")
> commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")
>
> introduced a number of deadlocks and other issues, and need to be
> reverted for the v4.3 kernel. The list of issues in DAX after these
> commits (some newly introduced by the commits, some preexisting) can be
> found here:
>
> https://lkml.org/lkml/2015/9/25/602
>
> This revert keeps the PMEM API changes to the zeroing code in
> __dax_pmd_fault(), which were added by this commit:
>
> commit d77e92e270ed ("dax: update PMD fault handler with PMEM API")
>
> It also keeps the code dropping mapping->i_mmap_rwsem before calling
> unmap_mapping_range(), but converts it to a read lock since that's what is
> now used by the rest of __dax_pmd_fault(). This is needed to avoid
> recursively acquiring mapping->i_mmap_rwsem, once with a read lock in
> __dax_pmd_fault() and once with a write lock in unmap_mapping_range().
I think it is safe to say that this has now morphed into a full blown
fix and the "revert" label no longer applies. But, I'll let Andrew
weigh in if he wants that fixed up or will replace these patches in
-mm:
revert-mm-take-i_mmap_lock-in-unmap_mapping_range-for-dax.patch
revert-dax-fix-race-between-simultaneous-faults.patch
dax-temporarily-disable-dax-pmd-fault-path.patch
...with this new series.
However, a question below:
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
> fs/dax.c | 37 +++++++++++++------------------------
> mm/memory.c | 11 +++++++++--
> 2 files changed, 22 insertions(+), 26 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index bcfb14b..f665bc9 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -569,36 +569,14 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE)
> goto fallback;
>
> - sector = bh.b_blocknr << (blkbits - 9);
> -
> - if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> - int i;
> -
> - length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
> - bh.b_size);
> - if (length < 0) {
> - result = VM_FAULT_SIGBUS;
> - goto out;
> - }
> - if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
> - goto fallback;
> -
> - for (i = 0; i < PTRS_PER_PMD; i++)
> - clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE);
> - wmb_pmem();
> - count_vm_event(PGMAJFAULT);
> - mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> - result |= VM_FAULT_MAJOR;
> - }
> -
> /*
> * If we allocated new storage, make sure no process has any
> * zero pages covering this hole
> */
> if (buffer_new(&bh)) {
> - i_mmap_unlock_write(mapping);
> + i_mmap_unlock_read(mapping);
> unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
> - i_mmap_lock_write(mapping);
> + i_mmap_lock_read(mapping);
> }
>
> /*
> @@ -635,6 +613,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> result = VM_FAULT_NOPAGE;
> spin_unlock(ptl);
> } else {
> + sector = bh.b_blocknr << (blkbits - 9);
> length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
> bh.b_size);
> if (length < 0) {
> @@ -644,6 +623,16 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
> goto fallback;
>
> + if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> + int i;
> + for (i = 0; i < PTRS_PER_PMD; i++)
> + clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE);
> + wmb_pmem();
> + count_vm_event(PGMAJFAULT);
> + mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> + result |= VM_FAULT_MAJOR;
> + }
> +
> result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
> }
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 9cb2747..5ec066f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2426,10 +2426,17 @@ void unmap_mapping_range(struct address_space *mapping,
> if (details.last_index < details.first_index)
> details.last_index = ULONG_MAX;
>
> - i_mmap_lock_write(mapping);
> +
> + /*
> + * DAX already holds i_mmap_lock to serialise file truncate vs
> + * page fault and page fault vs page fault.
> + */
> + if (!IS_DAX(mapping->host))
> + i_mmap_lock_write(mapping);
> if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
> unmap_mapping_range_tree(&mapping->i_mmap, &details);
> - i_mmap_unlock_write(mapping);
> + if (!IS_DAX(mapping->host))
> + i_mmap_unlock_write(mapping);
> }
> EXPORT_SYMBOL(unmap_mapping_range);
What about cases where unmap_mapping_range() is called without an fs
lock? For the get_user_pages() and ZONE_DEVICE implementation I'm
looking to call truncate_pagecache() from the driver shutdown path to
revoke usage of the struct page's that were allocated by
devm_memremap_pages().
Likely I'm introducing a path through unmap_mapping_range() that does
not exist today, but I don't like that unmap_mapping_range() with this
change is presuming a given locking context. It's not clear to me how
this routine is safe when it optionally takes i_mmap_lock_write(), at
a minimum this needs documenting, and possibly assertions if the
locking assumptions are violated.
invalidate_inode_pages2_range() seems to call unmap_mapping_range()
without the the correct locking, but this was just a quick scan.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2015-10-07 16:19 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-06 22:28 [PATCH v4 0/2] Revert locking changes in DAX for v4.3 Ross Zwisler
2015-10-06 22:28 ` [PATCH v4 1/2] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" Ross Zwisler
2015-10-07 16:19 ` Dan Williams [this message]
2015-10-07 21:39 ` Ross Zwisler
2015-10-06 22:28 ` [PATCH v4 2/2] Revert "dax: fix race between simultaneous faults" Ross Zwisler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAPcyv4ikrVqYc4QxAaZiPMc4awh6h6WG0kP5nH=ZnuNTJVDwiQ@mail.gmail.com' \
--to=dan.j.williams@intel.com \
--cc=akpm@linux-foundation.org \
--cc=david@fromorbit.com \
--cc=jack@suse.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nvdimm@lists.01.org \
--cc=matthew.r.wilcox@intel.com \
--cc=ross.zwisler@linux.intel.com \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox