From: Jan Kara <jack@suse.cz>
To: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>,
linux-nvdimm@lists.01.org, linux-mm@kvack.org,
linux-fsdevel@vger.kernel.org,
Dan Williams <dan.j.williams@intel.com>
Subject: Re: [PATCH 3/3] dax: Clear dirty entry tags on cache flush
Date: Wed, 29 Jun 2016 22:47:34 +0200 [thread overview]
Message-ID: <20160629204734.GE16831@quack2.suse.cz> (raw)
In-Reply-To: <20160628213857.GA15457@linux.intel.com>
On Tue 28-06-16 15:38:57, Ross Zwisler wrote:
> On Tue, Jun 21, 2016 at 05:45:15PM +0200, Jan Kara wrote:
> > Currently we never clear dirty tags in DAX mappings and thus address
> > ranges to flush accumulate. Now that we have locking of radix tree
> > entries, we have all the locking necessary to reliably clear the radix
> > tree dirty tag when flushing caches for corresponding address range.
> > Similarly to page_mkclean() we also have to write-protect pages to get a
> > page fault when the page is next written to so that we can mark the
> > entry dirty again.
> >
> > Signed-off-by: Jan Kara <jack@suse.cz>
>
> I think we still have a race where we can end up with a writeable PTE but a
> clean DAX radix tree entry. Here is the flow:
>
> Thread 1 Thread 2
> ======== ========
> wp_pfn_shared()
> dax_pfn_mkwrite()
> get_unlocked_mapping_entry()
> radix_tree_tag_set(DIRTY)
> put_unlocked_mapping_entry()
> dax_writeback_one()
> lock_slot()
> radix_tree_tag_clear(TOWRITE)
> dax_mapping_entry_mkclean()
> wb_cache_pmem()
> radix_tree_tag_clear(DIRTY)
> put_locked_mapping_entry()
> wp_page_reuse()
> maybe_mkwrite()
>
> When this ends the radix tree entry will have been written back and the
> TOWRITE and DIRTY radix tree tags will have been cleared, but the PTE will be
> writeable due to the last maybe_mkwrite() call. This will result in no new
> dax_pfn_mkwrite() calls happening to re-dirty the radix tree entry.
Good point, thanks for spotting this! There are several ways to fix this -
one is the callback you suggest below but I don't like that much. Another
is returning VM_FAULT_DAX_LOCKED as we do for cow faults handle that
properly in page_mkwrite() paths. Finally, we could just handle PTE changes
within the pfn_mkwrite() handler like we already do for the ->fault path
and thus keep the locking private to DAX code. I'm not yet decided what's
best. I'll think about it.
> Essentially the problem is that we don't hold any locks through all the work
> done by wp_pfn_shared() so that we can do maybe_mkwrite() safely.
>
> Perhaps we can lock the radix tree slot in dax_pfn_mkwrite(), then have
> another callback into DAX after the wp_page_reuse() => maybe_mkwrite() to
> unlock the slot? This would guarantee that they happen together from DAX's
> point of view.
>
> Thread 1's flow would then be:
>
> Thread 1
> ========
> wp_pfn_shared()
> dax_pfn_mkwrite()
> lock_slot()
> radix_tree_tag_set(DIRTY)
> wp_page_reuse()
> maybe_mkwrite()
> new_dax_call_to_unlock_slot()
> put_unlocked_mapping_entry()
>
> > ---
> > fs/dax.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 68 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 5209f8cd0bee..c0c4eecb5f73 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -31,6 +31,7 @@
> > #include <linux/vmstat.h>
> > #include <linux/pfn_t.h>
> > #include <linux/sizes.h>
> > +#include <linux/mmu_notifier.h>
> >
> > /*
> > * We use lowest available bit in exceptional entry for locking, other two
> > @@ -665,6 +666,59 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
> > return new_entry;
> > }
> >
> > +static inline unsigned long
> > +pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
> > +{
> > + unsigned long address;
> > +
> > + address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > + VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> > + return address;
> > +}
> > +
> > +/* Walk all mappings of a given index of a file and writeprotect them */
> > +static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > + pgoff_t index, unsigned long pfn)
> > +{
> > + struct vm_area_struct *vma;
> > + pte_t *ptep;
> > + pte_t pte;
> > + spinlock_t *ptl;
> > + bool changed;
> > +
> > + i_mmap_lock_read(mapping);
> > + vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
> > + unsigned long address;
> > +
> > + cond_resched();
>
> Do we really need to cond_resched() between every PTE clean? Maybe it would
> be better to cond_resched() after each call to dax_writeback_one() in
> dax_writeback_mapping_range() or something so we can be sure we've done a good
> amount of work between each?
This is copied over from rmap code and I'd prefer to keep the code similar.
Note that cond_resched() does not mean we are going to reschedule. It is
pretty cheap call and we will be rescheduled only if we have used our CPU
slice...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-06-29 20:48 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-06-21 15:45 [PATCH 0/3 v1] dax: Clear dirty bits after flushing caches Jan Kara
2016-06-21 15:45 ` [PATCH 1/3] dax: Make cache flushing protected by entry lock Jan Kara
2016-06-24 21:44 ` Ross Zwisler
2016-06-29 20:28 ` Jan Kara
2016-06-21 15:45 ` [PATCH 2/3] mm: Export follow_pte() Jan Kara
2016-06-24 21:55 ` Ross Zwisler
2016-06-29 20:29 ` Jan Kara
2016-06-21 15:45 ` [PATCH 3/3] dax: Clear dirty entry tags on cache flush Jan Kara
2016-06-21 17:31 ` kbuild test robot
2016-06-21 20:59 ` kbuild test robot
2016-06-23 10:47 ` Jan Kara
2016-06-28 21:38 ` Ross Zwisler
2016-06-29 20:47 ` Jan Kara [this message]
2016-06-28 21:41 ` [PATCH 0/3 v1] dax: Clear dirty bits after flushing caches Ross Zwisler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160629204734.GE16831@quack2.suse.cz \
--to=jack@suse.cz \
--cc=dan.j.williams@intel.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nvdimm@lists.01.org \
--cc=ross.zwisler@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox