Re: [PATCH 1/6] mm: tracking shared dirty pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Hugh Dickins <hugh@veritas.com>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@osdl.org>,
	David Howells <dhowells@redhat.com>,
	Christoph Lameter <christoph@lameter.com>,
	Martin Bligh <mbligh@google.com>, Nick Piggin <npiggin@suse.de>,
	Linus Torvalds <torvalds@osdl.org>
Subject: Re: [PATCH 1/6] mm: tracking shared dirty pages
Date: Thu, 22 Jun 2006 21:52:24 +0100 (BST)	[thread overview]
Message-ID: <Pine.LNX.4.64.0606222126310.26805@blonde.wat.veritas.com> (raw)
In-Reply-To: <20060619175253.24655.96323.sendpatchset@lappy>

On Mon, 19 Jun 2006, Peter Zijlstra wrote:
> 
> People expressed the need to track dirty pages in shared mappings.
> 
> Linus outlined the general idea of doing that through making clean
> writable pages write-protected and taking the write fault.
> 
> Index: 2.6-mm/include/linux/mm.h
> ===================================================================
> --- 2.6-mm.orig/include/linux/mm.h	2006-06-14 10:29:04.000000000 +0200
> +++ 2.6-mm/include/linux/mm.h	2006-06-19 14:45:18.000000000 +0200
> @@ -182,6 +182,12 @@ extern unsigned int kobjsize(const void 
>  #define VM_SequentialReadHint(v)	((v)->vm_flags & VM_SEQ_READ)
>  #define VM_RandomReadHint(v)		((v)->vm_flags & VM_RAND_READ)
>  
> +static inline int is_shared_writable(unsigned int flags)
> +{
> +	return (flags & (VM_SHARED|VM_WRITE|VM_PFNMAP)) ==
> +		(VM_SHARED|VM_WRITE);
> +}
> +

Andrew asked for the inclusion of VM_PFNMAP to be commented there,
I don't believe that's enough: a function called "is_shared_writable"
should be testing precisely that, or people will misuse it.

Either you change the name to "is_shared_writable_but_not_pfnmap"
or somesuch, or you split out the VM_PFNMAP test, or you do away
with the function and make the tests explicit inline.  As before,
my instinctive preference is the latter: I really want to see what's
being tested (especially in do_wp_page); but perhaps it'll just look
too ugly all over - give it a try and see.

>  /*
>   * mapping from the currently active vm_flags protection bits (the
>   * low four bits) to a page protection mask..
> Index: 2.6-mm/mm/memory.c
> ===================================================================
> --- 2.6-mm.orig/mm/memory.c	2006-06-14 10:29:06.000000000 +0200
> +++ 2.6-mm/mm/memory.c	2006-06-19 16:20:06.000000000 +0200
> @@ -938,6 +938,12 @@ struct page *follow_page(struct vm_area_
>  	pte = *ptep;
>  	if (!pte_present(pte))
>  		goto unlock;
> +	/*
> +	 * This is not fully correct in the light of trapping write faults
> +	 * for writable shared mappings. However since we're going to mark
> +	 * the page dirty anyway some few lines downward, we might as well
> +	 * take the write fault now.
> +	 */

I don't understand what you're getting at here: please explain,
what is not fully correct and why?  In mail first, then we can
decide what the comment should say, or if it should be removed.
follow_page isn't making a pte writable, so what's the issue?

>  	if ((flags & FOLL_WRITE) && !pte_write(pte))
>  		goto unlock;
>  	page = vm_normal_page(vma, address, pte);
> @@ -1458,13 +1464,14 @@ static int do_wp_page(struct mm_struct *
>  {
>  	struct page *old_page, *new_page;
>  	pte_t entry;
> -	int reuse, ret = VM_FAULT_MINOR;
> +	int reuse = 0, ret = VM_FAULT_MINOR;
> +	struct page *dirty_page = NULL;
>  
>  	old_page = vm_normal_page(vma, address, orig_pte);
>  	if (!old_page)
>  		goto gotten;
>  
> -	if (unlikely(vma->vm_flags & VM_SHARED)) {
> +	if (unlikely(is_shared_writable(vma->vm_flags))) {

Most interesting line in the series, yes, and I'd find it
easier to think through if it showed the flags test explicitly:
	if ((vma->vm_flags & (VM_SHARED|VM_WRITE|VM_PFNMAP)) ==
		(VM_SHARED|VM_WRITE))

Yes, Andrew, you're right it's a change in behaviour from David's
page_mkwrite patch.  I've realized that when I was originally
reviewing David's patch, I believed do_wp_page was mistaken to be
doing COW on VM_SHARED areas.  But Linus has since asserted very
forcefully that it's intentional, that ptrace poke on a VM_SHARED
area which is currently not !VM_WRITE should COW it, so I mentioned
that to Peter.

Has he got the test right there now?  Ummm... maybe: my brain
exploded weeks ago.  Several strangenesses collide here, I'll
try again tomorrow, maybe others will argue it to certainty before.

>  		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
>  			/*
>  			 * Notify the address space that the page is about to
> Index: 2.6-mm/mm/mmap.c
> ===================================================================
> --- 2.6-mm.orig/mm/mmap.c	2006-06-14 10:29:06.000000000 +0200
> +++ 2.6-mm/mm/mmap.c	2006-06-19 15:41:53.000000000 +0200
> @@ -25,6 +25,7 @@
>  #include <linux/mount.h>
>  #include <linux/mempolicy.h>
>  #include <linux/rmap.h>
> +#include <linux/backing-dev.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/cacheflush.h>
> @@ -888,6 +889,7 @@ unsigned long do_mmap_pgoff(struct file 
>  	struct rb_node ** rb_link, * rb_parent;
>  	int accountable = 1;
>  	unsigned long charged = 0, reqprot = prot;
> +	struct address_space *mapping = NULL;
>  
>  	if (file) {
>  		if (is_file_hugepages(file))
> @@ -1084,18 +1086,13 @@ munmap_back:
>  		error = file->f_op->mmap(file, vma);
>  		if (error)
>  			goto unmap_and_free_vma;
> +

Do you really need this blank line?

>  	} else if (vm_flags & VM_SHARED) {
>  		error = shmem_zero_setup(vma);
>  		if (error)
>  			goto free_vma;
>  	}
>  
> -	/* Don't make the VMA automatically writable if it's shared, but the
> -	 * backer wishes to know when pages are first written to */
> -	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
> -		vma->vm_page_prot =
> -			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
> -
>  	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
>  	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
>  	 * that memory reservation must be checked; but that reservation
> @@ -1113,6 +1110,31 @@ munmap_back:
>  	pgoff = vma->vm_pgoff;
>  	vm_flags = vma->vm_flags;
>  
> +	/*
> +	 * Tracking of dirty pages for shared writable mappings. Do this by
> +	 * write protecting writable pages, and mark dirty in the write fault.
> +	 *
> +	 * Modify vma->vm_page_prot (the default protection for new pages)
> +	 * to this effect.
> +	 *
> +	 * Cannot do before because the condition depends on:
> +	 *  - backing_dev_info having the right capabilities
> +	 *    (set by f_op->open())

Is that so, backing_dev_info set by f_op->open()?
And how would that be a problem here if it were so?

> +	 *  - vma->vm_flags being fully set
> +	 *    (finished in f_op->mmap(), which could call remap_pfn_range())
> +	 *
> +	 *  Also, cannot reset vma->vm_page_prot from vma->vm_flags because
> +	 *  f_op->mmap() can modify it.
> +	 */
> +	if (is_shared_writable(vm_flags) && vma->vm_file)
> +		mapping = vma->vm_file->f_mapping;
> +	if ((mapping && mapping_cap_account_dirty(mapping)) ||
> +			(vma->vm_ops && vma->vm_ops->page_mkwrite))

The only way "mapping" might be set is just above.
Wouldn't it all be clearer (though more indented) if you said

	if (is_shared_writable(vm_flags) && vma->vm_file) {
		mapping = vma->vm_file->f_mapping;
		if ((mapping && mapping_cap_account_dirty(mapping)) ||
				(vma->vm_ops && vma->vm_ops->page_mkwrite)) {
			vma->vm_page_prot = whatever;
		}
	}

Or no need for "mapping" here at all if you change
mapping_cap_account_dirty(vma->vm_file->f_mapping)
to do the right thing with NULL.

> +		vma->vm_page_prot =
> +			__pgprot(pte_val
> +				(pte_wrprotect
> +				 (__pte(pgprot_val(vma->vm_page_prot)))));
> +

In other mail I've suggested saving vm_page_prot above, and
changing it here only if the driver's ->mmap did not change it.

I remain uneasy about interfering with the permissions expected by
strange drivers, but can't really justify my paranoia.  Certainly
you're right to exclude VM_PFNMAPs from this interference, that's
important; I'd be less uneasy if you also exclude VM_INSERTPAGEs,
they're strange too - but at least they're dealing with proper struct
pages, so should be able to handle an unexpected do_wp_page; that
leaves the driver nopage cases, which again should be okay now you're
(one way or another) protecting specially added vm_page_prot flags.

I guess I'm just paranoid; it's irritating me that we do not have
the right backing_dev_infos in place and having to hack around it.

>  	if (!file || !vma_merge(mm, prev, addr, vma->vm_end,
>  			vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
>  		file = vma->vm_file;
> Index: 2.6-mm/mm/mprotect.c
> ===================================================================
> --- 2.6-mm.orig/mm/mprotect.c	2006-06-14 10:29:06.000000000 +0200
> +++ 2.6-mm/mm/mprotect.c	2006-06-19 16:19:42.000000000 +0200
> @@ -21,6 +21,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> +#include <linux/backing-dev.h>
>  #include <asm/uaccess.h>
>  #include <asm/pgtable.h>
>  #include <asm/cacheflush.h>
> @@ -124,6 +125,7 @@ mprotect_fixup(struct vm_area_struct *vm
>  	long nrpages = (end - start) >> PAGE_SHIFT;
>  	unsigned long charged = 0;
>  	unsigned int mask;
> +	struct address_space *mapping = NULL;
>  	pgprot_t newprot;
>  	pgoff_t pgoff;
>  	int error;
> @@ -179,7 +181,10 @@ success:
>  	/* Don't make the VMA automatically writable if it's shared, but the
>  	 * backer wishes to know when pages are first written to */
>  	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
> -	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
> +	if (is_shared_writable(newflags) && vma->vm_file)
> +		mapping = vma->vm_file->f_mapping;
> +	if ((mapping && mapping_cap_account_dirty(mapping)) ||
> +			(vma->vm_ops && vma->vm_ops->page_mkwrite))

Similar remarks on indenting,
or letting mapping_cap_account_dirty take NULL mapping.

>  		mask &= ~VM_SHARED;
>  
>  	newprot = protection_map[newflags & mask];
> Index: 2.6-mm/mm/rmap.c
> ===================================================================
> --- 2.6-mm.orig/mm/rmap.c	2006-06-14 10:29:07.000000000 +0200
> +++ 2.6-mm/mm/rmap.c	2006-06-19 14:45:18.000000000 +0200
> @@ -53,6 +53,7 @@
>  #include <linux/rmap.h>
>  #include <linux/rcupdate.h>
>  #include <linux/module.h>
> +#include <linux/backing-dev.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -434,6 +435,73 @@ int page_referenced(struct page *page, i
>  	return referenced;
>  }
>  
> +static int page_mkclean_one(struct page *page, struct vm_area_struct *vma, int protect)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long address;
> +	pte_t *pte, entry;
> +	spinlock_t *ptl;
> +	int ret = 0;
> +
> +	address = vma_address(page, vma);
> +	if (address == -EFAULT)
> +		goto out;
> +
> +	pte = page_check_address(page, mm, address, &ptl);
> +	if (!pte)
> +		goto out;
> +
> +	if (!(pte_dirty(*pte) || (protect && pte_write(*pte))))
> +		goto unlock;
> +
> +	entry = ptep_get_and_clear(mm, address, pte);
> +	entry = pte_mkclean(entry);
> +	if (protect)
> +		entry = pte_wrprotect(entry);
> +	ptep_establish(vma, address, pte, entry);
> +	lazy_mmu_prot_update(entry);
> +	ret = 1;
> +
> +unlock:
> +	pte_unmap_unlock(pte, ptl);
> +out:
> +	return ret;
> +}
> +
> +static int page_mkclean_file(struct address_space *mapping, struct page *page)
> +{
> +	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +	struct vm_area_struct *vma;
> +	struct prio_tree_iter iter;
> +	int ret = 0;
> +
> +	BUG_ON(PageAnon(page));
> +
> +	spin_lock(&mapping->i_mmap_lock);
> +	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> +		int protect = mapping_cap_account_dirty(mapping) &&
> +			is_shared_writable(vma->vm_flags);
> +		ret += page_mkclean_one(page, vma, protect);

You have a good point here, one I'd completely missed: because a vma
may have been recently mprotected !VM_WRITE, you have to check readonly
mappings too.  Perhaps worth a comment.  But I think "is_shared_writable"
is not the best test here: just test for VM_SHARED vmas, they're the
only ones which can be mprotected to/from shared writable.  And then
I think you don't need to pass down an additional "protect" argument?
It's only being called for mapping_cap_account_dirty mappings anyway,
isn't it?

> +	}
> +	spin_unlock(&mapping->i_mmap_lock);
> +	return ret;
> +}

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2006-06-22 20:52 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-06-19 17:52 [PATCH 0/6] mm: tracking dirty pages -v9 Peter Zijlstra
2006-06-19 17:52 ` [PATCH 1/6] mm: tracking shared dirty pages Peter Zijlstra
2006-06-22  5:56   ` Andrew Morton
2006-06-22  6:07     ` Christoph Lameter
2006-06-22  6:15       ` Andrew Morton
2006-06-22 11:33     ` Peter Zijlstra
2006-06-22 13:17       ` Hugh Dickins
2006-06-22 20:52   ` Hugh Dickins [this message]
2006-06-22 23:02     ` Peter Zijlstra
2006-06-22 23:39     ` [PATCH] mm: tracking shared dirty pages -v10 Peter Zijlstra
2006-06-23  3:10       ` Jeff Dike
2006-06-23  3:31         ` Andrew Morton
2006-06-23  3:50           ` Jeff Dike
2006-06-23  4:01           ` H. Peter Anvin
2006-06-23 15:08             ` Jeff Dike
2006-06-23  6:08       ` Linus Torvalds
2006-06-23  7:27         ` Hugh Dickins
2006-06-23 17:00           ` Christoph Lameter
2006-06-23 17:22             ` Peter Zijlstra
2006-06-23 17:52               ` Christoph Lameter
2006-06-23 18:11                 ` Martin Bligh
2006-06-23 18:20                   ` Linus Torvalds
2006-06-23 17:56               ` Linus Torvalds
2006-06-23 18:03                 ` Peter Zijlstra
2006-06-23 18:23                   ` Christoph Lameter
2006-06-23 18:41                 ` Christoph Hellwig
2006-06-23 17:49           ` Linus Torvalds
2006-06-23 18:05             ` Arjan van de Ven
2006-06-23 18:08             ` Miklos Szeredi
2006-06-23 19:06       ` Hugh Dickins
2006-06-23 22:00         ` Peter Zijlstra
2006-06-23 22:35           ` Linus Torvalds
2006-06-23 22:44             ` Peter Zijlstra
2006-06-28 14:58         ` [RFC][PATCH] mm: fixup do_wp_page() Peter Zijlstra
2006-06-28 18:20           ` Hugh Dickins
2006-06-19 17:53 ` [PATCH 2/6] mm: balance dirty pages Peter Zijlstra
2006-06-19 17:53 ` [PATCH 3/6] mm: msync() cleanup Peter Zijlstra
2006-06-22 17:02   ` Hugh Dickins
2006-06-19 17:53 ` [PATCH 4/6] mm: optimize the new mprotect() code a bit Peter Zijlstra
2006-06-22 17:21   ` Hugh Dickins
2006-06-19 17:53 ` [PATCH 5/6] mm: small cleanup of install_page() Peter Zijlstra
2006-06-19 17:53 ` [PATCH 6/6] mm: remove some update_mmu_cache() calls Peter Zijlstra
2006-06-22 16:29   ` Hugh Dickins
2006-06-22 16:37     ` Christoph Lameter
2006-06-22 17:35       ` Hugh Dickins
2006-06-22 18:31         ` Christoph Lameter
  -- strict thread matches above, loose matches on Subject: below --
2006-06-28 20:17 [PATCH 0/6] mm: tracking dirty pages -v14 Peter Zijlstra
2006-06-28 20:17 ` [PATCH 1/6] mm: tracking shared dirty pages Peter Zijlstra
2006-06-13 11:21 [PATCH 0/6] mm: tracking dirty pages -v8 Peter Zijlstra
2006-06-13 11:21 ` [PATCH 1/6] mm: tracking shared dirty pages Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0606222126310.26805@blonde.wat.veritas.com \
    --to=hugh@veritas.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@osdl.org \
    --cc=christoph@lameter.com \
    --cc=dhowells@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mbligh@google.com \
    --cc=npiggin@suse.de \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox