Re: [PATCH/RFC 2/14] Reclaim Scalability: convert inode i_mmap_lock to reader/writer lock

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: mel@skynet.ie (Mel Gorman)
To: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org, clameter@sgi.com,
	riel@redhat.com, balbir@linux.vnet.ibm.com, andrea@suse.de,
	a.p.zijlstra@chello.nl, eric.whitney@hp.com, npiggin@suse.de
Subject: Re: [PATCH/RFC 2/14] Reclaim Scalability:  convert inode i_mmap_lock to reader/writer lock
Date: Mon, 17 Sep 2007 13:53:55 +0100	[thread overview]
Message-ID: <20070917125355.GH25706@skynet.ie> (raw)
In-Reply-To: <20070914205412.6536.34898.sendpatchset@localhost>

On (14/09/07 16:54), Lee Schermerhorn didst pronounce:
> PATCH/RFC 02/14 Reclaim Scalability:  make the inode i_mmap_lock a reader/writer lock
> 
> Against:  2.6.23-rc4-mm1
> 
> I have seen soft cpu lockups in page_referenced_file() due to 
> contention on i_mmap_lock() for different pages.  Making the
> i_mmap_lock a reader/writer lock should increase parallelism
> in vmscan for file back pages mapped into many address spaces.
> 

Same as the last patch. With respect to the rw-lock, we need to be sure
this is not regressing the general case. A mmap() microbenchmark would
show it up but rapid mapping and unmapping doesn't feel particularly
realistic.

Considering the nature of the lock though, pretty much any benchmark
will show up problems, right?

> Read lock the i_mmap_lock for all usage except:
> 
> 1) mmap/munmap:  linking vma into i_mmap prio_tree or removing
> 2) unmap_mapping_range:   protecting vm_truncate_count
> 
> rmap:  try_to_unmap_file() required new cond_resched_rwlock().
> To reduce code duplication, I recast cond_resched_lock() as a
> [static inline] wrapper around reworked cond_sched_lock() =>
> __cond_resched_lock(void *lock, int type). 
> New cond_resched_rwlock() implemented as another wrapper.  
> 
> Note:  This patch is meant to address a situation I've seen
> running large Oracle OLTP workload--1000s of users--on an
> large HP ia64 NUMA platform.  The system hung, spitting out
> "soft lockup" messages on the console.  Stack traces showed
> that all cpus were in page_referenced(), as mentioned above.
> I let the system run overnight in this state--it never
> recovered before I decided to reboot.
> 
> TODO:  I've yet to test this patch with the same workload
> to see what happens.  Don't have access to the system now.
> 

If you do get access to the machine, can you see if the patch reduces
the number of transactions Oracle is capable of?

> 
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  fs/hugetlbfs/inode.c  |    7 +++--
>  fs/inode.c            |    2 -
>  fs/revoke.c           |   18 +++++++-------
>  include/linux/fs.h    |    2 -
>  include/linux/mm.h    |    2 -
>  include/linux/sched.h |   17 ++++++++++---
>  kernel/fork.c         |    4 +--
>  kernel/sched.c        |   64 ++++++++++++++++++++++++++++++++++++++++++--------
>  mm/filemap_xip.c      |    4 +--
>  mm/fremap.c           |    4 +--
>  mm/hugetlb.c          |    8 +++---
>  mm/memory.c           |   13 +++++-----
>  mm/migrate.c          |    4 +--
>  mm/mmap.c             |   18 +++++++-------
>  mm/mremap.c           |    4 +--
>  mm/rmap.c             |   16 ++++++------
>  16 files changed, 123 insertions(+), 64 deletions(-)
> 
> Index: Linux/include/linux/fs.h
> ===================================================================
> --- Linux.orig/include/linux/fs.h	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/include/linux/fs.h	2007-09-10 11:43:26.000000000 -0400
> @@ -506,7 +506,7 @@ struct address_space {
>  	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
>  	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
>  	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
> -	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
> +	rwlock_t		i_mmap_lock;	/* protect tree, count, list */
>  	unsigned int		truncate_count;	/* Cover race condition with truncate */
>  	unsigned long		nrpages;	/* number of total pages */
>  	pgoff_t			writeback_index;/* writeback starts here */
> Index: Linux/include/linux/mm.h
> ===================================================================
> --- Linux.orig/include/linux/mm.h	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/include/linux/mm.h	2007-09-10 11:43:26.000000000 -0400
> @@ -684,7 +684,7 @@ struct zap_details {
>  	struct address_space *check_mapping;	/* Check page->mapping if set */
>  	pgoff_t	first_index;			/* Lowest page->index to unmap */
>  	pgoff_t last_index;			/* Highest page->index to unmap */
> -	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
> +	rwlock_t *i_mmap_lock;			/* For unmap_mapping_range: */
>  	unsigned long truncate_count;		/* Compare vm_truncate_count */
>  };
>  
> Index: Linux/fs/inode.c
> ===================================================================
> --- Linux.orig/fs/inode.c	2007-09-10 10:09:43.000000000 -0400
> +++ Linux/fs/inode.c	2007-09-10 11:43:26.000000000 -0400
> @@ -203,7 +203,7 @@ void inode_init_once(struct inode *inode
>  	init_rwsem(&inode->i_alloc_sem);
>  	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
>  	rwlock_init(&inode->i_data.tree_lock);
> -	spin_lock_init(&inode->i_data.i_mmap_lock);
> +	rwlock_init(&inode->i_data.i_mmap_lock);
>  	INIT_LIST_HEAD(&inode->i_data.private_list);
>  	spin_lock_init(&inode->i_data.private_lock);
>  	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
> Index: Linux/fs/hugetlbfs/inode.c
> ===================================================================
> --- Linux.orig/fs/hugetlbfs/inode.c	2007-09-10 10:09:43.000000000 -0400
> +++ Linux/fs/hugetlbfs/inode.c	2007-09-10 11:43:26.000000000 -0400
> @@ -411,6 +411,9 @@ static void hugetlbfs_drop_inode(struct 
>  		hugetlbfs_forget_inode(inode);
>  }
>  
> +/*
> + * LOCKING:  __unmap_hugepage_range() requires write lock on i_mmap_lock
> + */
>  static inline void
>  hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff)
>  {
> @@ -445,10 +448,10 @@ static int hugetlb_vmtruncate(struct ino
>  	pgoff = offset >> PAGE_SHIFT;
>  
>  	i_size_write(inode, offset);
> -	spin_lock(&mapping->i_mmap_lock);
> +	write_lock(&mapping->i_mmap_lock);
>  	if (!prio_tree_empty(&mapping->i_mmap))
>  		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
> -	spin_unlock(&mapping->i_mmap_lock);
> +	write_unlock(&mapping->i_mmap_lock);
>  	truncate_hugepages(inode, offset);
>  	return 0;
>  }
> Index: Linux/fs/revoke.c
> ===================================================================
> --- Linux.orig/fs/revoke.c	2007-09-10 10:09:44.000000000 -0400
> +++ Linux/fs/revoke.c	2007-09-10 11:43:26.000000000 -0400
> @@ -272,7 +272,7 @@ static int revoke_break_cow(struct files
>  
>  /*
>   *	 LOCKING: down_write(&mm->mmap_sem)
> - *	 	    -> spin_lock(&mapping->i_mmap_lock)
> + *	 	    -> write_lock(&mapping->i_mmap_lock)
>   */
>  static int revoke_vma(struct vm_area_struct *vma, struct zap_details *details)
>  {
> @@ -298,14 +298,14 @@ static int revoke_vma(struct vm_area_str
>  	return 0;
>  
>    out_need_break:
> -	spin_unlock(details->i_mmap_lock);
> +	write_unlock(details->i_mmap_lock);
>  	cond_resched();
> -	spin_lock(details->i_mmap_lock);
> +	write_lock(details->i_mmap_lock);
>  	return -EINTR;
>  }
>  
>  /*
> - *	LOCKING: spin_lock(&mapping->i_mmap_lock)
> + *	LOCKING: write_lock(&mapping->i_mmap_lock)
>   */
>  static int revoke_mm(struct mm_struct *mm, struct address_space *mapping,
>  		     struct file *to_exclude)
> @@ -335,7 +335,7 @@ static int revoke_mm(struct mm_struct *m
>  		if (err)
>  			break;
>  
> -		__unlink_file_vma(vma);
> +		__unlink_file_vma(vma);	/* requires write_lock(i_mmap_lock) */
>  		fput(vma->vm_file);
>  		vma->vm_file = NULL;
>  	}
> @@ -345,7 +345,7 @@ static int revoke_mm(struct mm_struct *m
>  }
>  
>  /*
> - *	LOCKING: spin_lock(&mapping->i_mmap_lock)
> + *	LOCKING: write_lock(&mapping->i_mmap_lock)
>   */
>  static void revoke_mapping_tree(struct address_space *mapping,
>  				struct file *to_exclude)
> @@ -377,7 +377,7 @@ static void revoke_mapping_tree(struct a
>  }
>  
>  /*
> - *	LOCKING: spin_lock(&mapping->i_mmap_lock)
> + *	LOCKING: write_lock(&mapping->i_mmap_lock)
>   */
>  static void revoke_mapping_list(struct address_space *mapping,
>  				struct file *to_exclude)
> @@ -408,12 +408,12 @@ static void revoke_mapping_list(struct a
>  
>  static void revoke_mapping(struct address_space *mapping, struct file *to_exclude)
>  {
> -	spin_lock(&mapping->i_mmap_lock);
> +	write_lock(&mapping->i_mmap_lock);
>  	if (unlikely(!prio_tree_empty(&mapping->i_mmap)))
>  		revoke_mapping_tree(mapping, to_exclude);
>  	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
>  		revoke_mapping_list(mapping, to_exclude);
> -	spin_unlock(&mapping->i_mmap_lock);
> +	write_unlock(&mapping->i_mmap_lock);
>  }
>  
>  static void restore_file(struct revokefs_inode_info *info)
> Index: Linux/kernel/fork.c
> ===================================================================
> --- Linux.orig/kernel/fork.c	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/kernel/fork.c	2007-09-10 11:43:26.000000000 -0400
> @@ -262,12 +262,12 @@ static inline int dup_mmap(struct mm_str
>  				atomic_dec(&inode->i_writecount);
>  
>  			/* insert tmp into the share list, just after mpnt */
> -			spin_lock(&file->f_mapping->i_mmap_lock);
> +			write_lock(&file->f_mapping->i_mmap_lock);
>  			tmp->vm_truncate_count = mpnt->vm_truncate_count;
>  			flush_dcache_mmap_lock(file->f_mapping);
>  			vma_prio_tree_add(tmp, mpnt);
>  			flush_dcache_mmap_unlock(file->f_mapping);
> -			spin_unlock(&file->f_mapping->i_mmap_lock);
> +			write_unlock(&file->f_mapping->i_mmap_lock);
>  		}
>  
>  		/*
> Index: Linux/mm/filemap_xip.c
> ===================================================================
> --- Linux.orig/mm/filemap_xip.c	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/mm/filemap_xip.c	2007-09-10 11:43:26.000000000 -0400
> @@ -182,7 +182,7 @@ __xip_unmap (struct address_space * mapp
>  	if (!page)
>  		return;
>  
> -	spin_lock(&mapping->i_mmap_lock);
> +	read_lock(&mapping->i_mmap_lock);
>  	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
>  		mm = vma->vm_mm;
>  		address = vma->vm_start +
> @@ -200,7 +200,7 @@ __xip_unmap (struct address_space * mapp
>  			page_cache_release(page);
>  		}
>  	}
> -	spin_unlock(&mapping->i_mmap_lock);
> +	read_unlock(&mapping->i_mmap_lock);
>  }
>  
>  /*
> Index: Linux/mm/fremap.c
> ===================================================================
> --- Linux.orig/mm/fremap.c	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/mm/fremap.c	2007-09-10 11:43:26.000000000 -0400
> @@ -200,13 +200,13 @@ asmlinkage long sys_remap_file_pages(uns
>  			}
>  			goto out;
>  		}
> -		spin_lock(&mapping->i_mmap_lock);
> +		write_lock(&mapping->i_mmap_lock);
>  		flush_dcache_mmap_lock(mapping);
>  		vma->vm_flags |= VM_NONLINEAR;
>  		vma_prio_tree_remove(vma, &mapping->i_mmap);
>  		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
>  		flush_dcache_mmap_unlock(mapping);
> -		spin_unlock(&mapping->i_mmap_lock);
> +		write_unlock(&mapping->i_mmap_lock);
>  	}
>  
>  	err = populate_range(mm, vma, start, size, pgoff);
> Index: Linux/mm/hugetlb.c
> ===================================================================
> --- Linux.orig/mm/hugetlb.c	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/mm/hugetlb.c	2007-09-10 11:43:26.000000000 -0400
> @@ -451,9 +451,9 @@ void unmap_hugepage_range(struct vm_area
>  	 * do nothing in this case.
>  	 */
>  	if (vma->vm_file) {
> -		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
> +		write_lock(&vma->vm_file->f_mapping->i_mmap_lock);
>  		__unmap_hugepage_range(vma, start, end);
> -		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
> +		write_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
>  	}
>  }
>  
> @@ -693,7 +693,7 @@ void hugetlb_change_protection(struct vm
>  	BUG_ON(address >= end);
>  	flush_cache_range(vma, address, end);
>  
> -	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
> +	read_lock(&vma->vm_file->f_mapping->i_mmap_lock);
>  	spin_lock(&mm->page_table_lock);
>  	for (; address < end; address += HPAGE_SIZE) {
>  		ptep = huge_pte_offset(mm, address);
> @@ -708,7 +708,7 @@ void hugetlb_change_protection(struct vm
>  		}
>  	}
>  	spin_unlock(&mm->page_table_lock);
> -	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
> +	read_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
>  
>  	flush_tlb_range(vma, start, end);
>  }
> Index: Linux/mm/memory.c
> ===================================================================
> --- Linux.orig/mm/memory.c	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/mm/memory.c	2007-09-10 11:43:26.000000000 -0400
> @@ -816,7 +816,7 @@ unsigned long unmap_vmas(struct mmu_gath
>  	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
>  	int tlb_start_valid = 0;
>  	unsigned long start = start_addr;
> -	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
> +	rwlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
>  	int fullmm = (*tlbp)->fullmm;
>  
>  	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
> @@ -1728,7 +1728,7 @@ unwritable_page:
>   * can't efficiently keep all vmas in step with mapping->truncate_count:
>   * so instead reset them all whenever it wraps back to 0 (then go to 1).
>   * mapping->truncate_count and vma->vm_truncate_count are protected by
> - * i_mmap_lock.
> + * write locked i_mmap_lock.
>   *
>   * In order to make forward progress despite repeatedly restarting some
>   * large vma, note the restart_addr from unmap_vmas when it breaks out:
> @@ -1793,9 +1793,10 @@ again:
>  			goto again;
>  	}
>  
> -	spin_unlock(details->i_mmap_lock);
> +//TODO:  why not cond_resched_lock() here [rwlock version]?
> +	write_unlock(details->i_mmap_lock);
>  	cond_resched();
> -	spin_lock(details->i_mmap_lock);
> +	write_lock(details->i_mmap_lock);

I guess it's not used because it just doesn't exist :/ . Not sure why
that is although because no one thought it was necessary is the likely
answer.

>  	return -EINTR;
>  }
>  
> @@ -1891,7 +1892,7 @@ void unmap_mapping_range(struct address_
>  		details.last_index = ULONG_MAX;
>  	details.i_mmap_lock = &mapping->i_mmap_lock;
>  
> -	spin_lock(&mapping->i_mmap_lock);
> +	write_lock(&mapping->i_mmap_lock);
>  
>  	/* Protect against endless unmapping loops */
>  	mapping->truncate_count++;
> @@ -1906,7 +1907,7 @@ void unmap_mapping_range(struct address_
>  		unmap_mapping_range_tree(&mapping->i_mmap, &details);
>  	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
>  		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
> -	spin_unlock(&mapping->i_mmap_lock);
> +	write_unlock(&mapping->i_mmap_lock);
>  }
>  EXPORT_SYMBOL(unmap_mapping_range);
>  
> Index: Linux/mm/migrate.c
> ===================================================================
> --- Linux.orig/mm/migrate.c	2007-09-10 11:43:11.000000000 -0400
> +++ Linux/mm/migrate.c	2007-09-10 11:43:26.000000000 -0400
> @@ -207,12 +207,12 @@ static void remove_file_migration_ptes(s
>  	if (!mapping)
>  		return;
>  
> -	spin_lock(&mapping->i_mmap_lock);
> +	read_lock(&mapping->i_mmap_lock);
>  
>  	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
>  		remove_migration_pte(vma, old, new);
>  
> -	spin_unlock(&mapping->i_mmap_lock);
> +	read_unlock(&mapping->i_mmap_lock);
>  }
>  
>  /*
> Index: Linux/mm/mmap.c
> ===================================================================
> --- Linux.orig/mm/mmap.c	2007-09-10 11:43:11.000000000 -0400
> +++ Linux/mm/mmap.c	2007-09-10 11:43:26.000000000 -0400
> @@ -182,7 +182,7 @@ error:
>  }
>  
>  /*
> - * Requires inode->i_mapping->i_mmap_lock
> + * Requires write locked inode->i_mapping->i_mmap_lock
>   */
>  static void __remove_shared_vm_struct(struct vm_area_struct *vma,
>  		struct file *file, struct address_space *mapping)
> @@ -201,7 +201,7 @@ static void __remove_shared_vm_struct(st
>  }
>  
>  /*
> - * Requires inode->i_mapping->i_mmap_lock
> + * Requires write locked inode->i_mapping->i_mmap_lock
>   */
>  void __unlink_file_vma(struct vm_area_struct *vma)
>  {
> @@ -221,9 +221,9 @@ void unlink_file_vma(struct vm_area_stru
>  
>  	if (file) {
>  		struct address_space *mapping = file->f_mapping;
> -		spin_lock(&mapping->i_mmap_lock);
> +		write_lock(&mapping->i_mmap_lock);
>  		__remove_shared_vm_struct(vma, file, mapping);
> -		spin_unlock(&mapping->i_mmap_lock);
> +		write_unlock(&mapping->i_mmap_lock);
>  	}
>  }
>  
> @@ -445,7 +445,7 @@ static void vma_link(struct mm_struct *m
>  		mapping = vma->vm_file->f_mapping;
>  
>  	if (mapping) {
> -		spin_lock(&mapping->i_mmap_lock);
> +		write_lock(&mapping->i_mmap_lock);
>  		vma->vm_truncate_count = mapping->truncate_count;
>  	}
>  	anon_vma_lock(vma);
> @@ -455,7 +455,7 @@ static void vma_link(struct mm_struct *m
>  
>  	anon_vma_unlock(vma);
>  	if (mapping)
> -		spin_unlock(&mapping->i_mmap_lock);
> +		write_unlock(&mapping->i_mmap_lock);
>  
>  	mm->map_count++;
>  	validate_mm(mm);
> @@ -542,7 +542,7 @@ again:			remove_next = 1 + (end > next->
>  		mapping = file->f_mapping;
>  		if (!(vma->vm_flags & VM_NONLINEAR))
>  			root = &mapping->i_mmap;
> -		spin_lock(&mapping->i_mmap_lock);
> +		write_lock(&mapping->i_mmap_lock);
>  		if (importer &&
>  		    vma->vm_truncate_count != next->vm_truncate_count) {
>  			/*
> @@ -626,7 +626,7 @@ again:			remove_next = 1 + (end > next->
>  	if (anon_vma)
>  		write_unlock(&anon_vma->rwlock);
>  	if (mapping)
> -		spin_unlock(&mapping->i_mmap_lock);
> +		write_unlock(&mapping->i_mmap_lock);
>  
>  	if (remove_next) {
>  		if (file)
> @@ -2064,7 +2064,7 @@ void exit_mmap(struct mm_struct *mm)
>  
>  /* Insert vm structure into process list sorted by address
>   * and into the inode's i_mmap tree.  If vm_file is non-NULL
> - * then i_mmap_lock is taken here.
> + * then i_mmap_lock is write locked here.
>   */
>  int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
>  {
> Index: Linux/mm/mremap.c
> ===================================================================
> --- Linux.orig/mm/mremap.c	2007-09-10 10:09:38.000000000 -0400
> +++ Linux/mm/mremap.c	2007-09-10 11:43:26.000000000 -0400
> @@ -83,7 +83,7 @@ static void move_ptes(struct vm_area_str
>  		 * and we propagate stale pages into the dst afterward.
>  		 */
>  		mapping = vma->vm_file->f_mapping;
> -		spin_lock(&mapping->i_mmap_lock);
> +		read_lock(&mapping->i_mmap_lock);
>  		if (new_vma->vm_truncate_count &&
>  		    new_vma->vm_truncate_count != vma->vm_truncate_count)
>  			new_vma->vm_truncate_count = 0;
> @@ -115,7 +115,7 @@ static void move_ptes(struct vm_area_str
>  	pte_unmap_nested(new_pte - 1);
>  	pte_unmap_unlock(old_pte - 1, old_ptl);
>  	if (mapping)
> -		spin_unlock(&mapping->i_mmap_lock);
> +		read_unlock(&mapping->i_mmap_lock);
>  }
>  
>  #define LATENCY_LIMIT	(64 * PAGE_SIZE)
> Index: Linux/mm/rmap.c
> ===================================================================
> --- Linux.orig/mm/rmap.c	2007-09-10 11:43:11.000000000 -0400
> +++ Linux/mm/rmap.c	2007-09-10 11:43:26.000000000 -0400
> @@ -365,7 +365,7 @@ static int page_referenced_file(struct p
>  	 */
>  	BUG_ON(!PageLocked(page));
>  
> -	spin_lock(&mapping->i_mmap_lock);
> +	read_lock(&mapping->i_mmap_lock);
>  
>  	/*
>  	 * i_mmap_lock does not stabilize mapcount at all, but mapcount
> @@ -391,7 +391,7 @@ static int page_referenced_file(struct p
>  			break;
>  	}
>  
> -	spin_unlock(&mapping->i_mmap_lock);
> +	read_unlock(&mapping->i_mmap_lock);
>  	return referenced;
>  }
>  
> @@ -472,12 +472,12 @@ static int page_mkclean_file(struct addr
>  
>  	BUG_ON(PageAnon(page));
>  
> -	spin_lock(&mapping->i_mmap_lock);
> +	read_lock(&mapping->i_mmap_lock);
>  	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
>  		if (vma->vm_flags & VM_SHARED)
>  			ret += page_mkclean_one(page, vma);
>  	}
> -	spin_unlock(&mapping->i_mmap_lock);
> +	read_unlock(&mapping->i_mmap_lock);
>  	return ret;
>  }
>  
> @@ -904,7 +904,7 @@ static int try_to_unmap_file(struct page
>  	unsigned long max_nl_size = 0;
>  	unsigned int mapcount;
>  
> -	spin_lock(&mapping->i_mmap_lock);
> +	read_lock(&mapping->i_mmap_lock);
>  	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
>  		ret = try_to_unmap_one(page, vma, migration);
>  		if (ret == SWAP_FAIL || !page_mapped(page))
> @@ -941,7 +941,7 @@ static int try_to_unmap_file(struct page
>  	mapcount = page_mapcount(page);
>  	if (!mapcount)
>  		goto out;
> -	cond_resched_lock(&mapping->i_mmap_lock);
> +	cond_resched_rwlock(&mapping->i_mmap_lock, 0);
>  
>  	max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
>  	if (max_nl_cursor == 0)
> @@ -963,7 +963,7 @@ static int try_to_unmap_file(struct page
>  			}
>  			vma->vm_private_data = (void *) max_nl_cursor;
>  		}
> -		cond_resched_lock(&mapping->i_mmap_lock);
> +		cond_resched_rwlock(&mapping->i_mmap_lock, 0);
>  		max_nl_cursor += CLUSTER_SIZE;
>  	} while (max_nl_cursor <= max_nl_size);
>  
> @@ -975,7 +975,7 @@ static int try_to_unmap_file(struct page
>  	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
>  		vma->vm_private_data = NULL;
>  out:
> -	spin_unlock(&mapping->i_mmap_lock);
> +	read_unlock(&mapping->i_mmap_lock);
>  	return ret;
>  }
>  
> Index: Linux/include/linux/sched.h
> ===================================================================
> --- Linux.orig/include/linux/sched.h	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/include/linux/sched.h	2007-09-10 11:43:26.000000000 -0400
> @@ -1823,12 +1823,23 @@ static inline int need_resched(void)
>   * cond_resched() and cond_resched_lock(): latency reduction via
>   * explicit rescheduling in places that are safe. The return
>   * value indicates whether a reschedule was done in fact.
> - * cond_resched_lock() will drop the spinlock before scheduling,
> - * cond_resched_softirq() will enable bhs before scheduling.
> + * cond_resched_softirq() will enable bhs before scheduling,
> + * cond_resched_*lock() will drop the *lock before scheduling.
>   */
>  extern int cond_resched(void);
> -extern int cond_resched_lock(spinlock_t * lock);
>  extern int cond_resched_softirq(void);
> +extern int __cond_resched_lock(void * lock, int lock_type);
> +
> +#define COND_RESCHED_SPIN  2
> +static inline int cond_resched_lock(spinlock_t * lock)
> +{
> +	return __cond_resched_lock(lock, COND_RESCHED_SPIN);
> +}
> +
> +static inline int cond_resched_rwlock(rwlock_t * lock, int write_lock)
> +{
> +	return __cond_resched_lock(lock, !!write_lock);
> +}
>  
>  /*
>   * Does a critical section need to be broken due to another
> Index: Linux/kernel/sched.c
> ===================================================================
> --- Linux.orig/kernel/sched.c	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/kernel/sched.c	2007-09-10 11:43:26.000000000 -0400
> @@ -4635,34 +4635,78 @@ int __sched cond_resched(void)
>  EXPORT_SYMBOL(cond_resched);
>  
>  /*
> - * cond_resched_lock() - if a reschedule is pending, drop the given lock,
> + * helper functions for __cond_resched_lock()
> + */
> +static int __need_lockbreak(void *lock, int type)
> +{
> +	if (likely(type == COND_RESCHED_SPIN))
> +		return need_lockbreak((spinlock_t *)lock);
> +	else
> +		return need_lockbreak((rwlock_t *)lock);
> +}
> +
> +static void __reacquire_lock(void *lock, int type)
> +{
> +	if (likely(type == COND_RESCHED_SPIN))
> +		spin_lock((spinlock_t *)lock);
> +	else if (type)
> +		write_unlock((rwlock_t *)lock);
> +	else
> +		read_unlock((rwlock_t *)lock);
> +}
> +
> +static void __drop_lock(void *lock, int type)
> +{
> +	if (likely(type == COND_RESCHED_SPIN))
> +		spin_unlock((spinlock_t *)lock);
> +	else if (type)
> +		write_unlock((rwlock_t *)lock);
> +	else
> +		read_unlock((rwlock_t *)lock);
> +}
> +
> +static void __release_lock(void *lock, int type)
> +{
> +	if (likely(type == COND_RESCHED_SPIN))
> +		spin_release(&(spinlock_t *)lock->dep_map, 1, _RET_IP_);
> +	else
> +		rwlock_release(&(rwlock_t *)lock->dep_map, 1, _RET_IP_);
> +}
> +
> +/*
> + * __cond_resched_lock() - if a reschedule is pending, drop the given lock,
>   * call schedule, and on return reacquire the lock.
>   *
> + * Lock type:
> + *  0 = rwlock held for read
> + *  1 = rwlock held for write
> + *  2 = COND_RESCHED_SPIN = spinlock
> + *
>   * This works OK both with and without CONFIG_PREEMPT.  We do strange low-level
>   * operations here to prevent schedule() from being called twice (once via
> - * spin_unlock(), once by hand).
> + * *_unlock(), once by hand).
>   */
> -int cond_resched_lock(spinlock_t *lock)
> +int __cond_resched_lock(void *lock, int type)
>  {
>  	int ret = 0;
>  
> -	if (need_lockbreak(lock)) {
> -		spin_unlock(lock);
> +	if (__need_lockbreak(lock, type)) {
> +		__drop_lock(lock, type);
>  		cpu_relax();
>  		ret = 1;
> -		spin_lock(lock);
> +		__reacquire_lock(lock, type);
>  	}
>  	if (need_resched() && system_state == SYSTEM_RUNNING) {
> -		spin_release(&lock->dep_map, 1, _THIS_IP_);
> -		_raw_spin_unlock(lock);
> +		__release_lock(lock, type);
> +		__drop_lock(lock, type);
>  		preempt_enable_no_resched();
>  		__cond_resched();
>  		ret = 1;
> -		spin_lock(lock);
> +		__reacquire_lock(lock, type);
>  	}
>  	return ret;
>  }
> -EXPORT_SYMBOL(cond_resched_lock);
> +EXPORT_SYMBOL(__cond_resched_lock);
>  

This whole block looks like it belongs in another patch. Also, is it really
worth having single functions that handle all locks with loads of branches
instead of specific versions?

>  int __sched cond_resched_softirq(void)
>  {
> 

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2007-09-17 12:53 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
2007-09-14 20:54 ` [PATCH/RFC 1/14] Reclaim Scalability: Convert anon_vma lock to read/write lock Lee Schermerhorn
2007-09-17 11:02   ` Mel Gorman
2007-09-18  2:41     ` KAMEZAWA Hiroyuki
2007-09-18 11:01       ` Mel Gorman
2007-09-18 14:57         ` Rik van Riel
2007-09-18 15:37       ` Lee Schermerhorn
2007-09-18 20:17     ` Lee Schermerhorn
2007-09-20 10:19       ` Mel Gorman
2007-09-14 20:54 ` [PATCH/RFC 2/14] Reclaim Scalability: convert inode i_mmap_lock to reader/writer lock Lee Schermerhorn
2007-09-17 12:53   ` Mel Gorman [this message]
2007-09-20  1:24   ` Andrea Arcangeli
2007-09-20 14:10     ` Lee Schermerhorn
2007-09-20 14:16       ` Andrea Arcangeli
2007-09-14 20:54 ` [PATCH/RFC 3/14] Reclaim Scalability: move isolate_lru_page() to vmscan.c Lee Schermerhorn
2007-09-14 21:34   ` Peter Zijlstra
2007-09-15  1:55     ` Rik van Riel
2007-09-17 14:11     ` Lee Schermerhorn
2007-09-17  9:20   ` Balbir Singh
2007-09-17 19:19     ` Lee Schermerhorn
2007-09-14 20:54 ` [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function Lee Schermerhorn
2007-09-15  2:00   ` Rik van Riel
2007-09-17 13:19   ` Mel Gorman
2007-09-18  1:58   ` KAMEZAWA Hiroyuki
2007-09-18  2:27     ` Rik van Riel
2007-09-18  2:40       ` KAMEZAWA Hiroyuki
2007-09-18 15:04     ` Lee Schermerhorn
2007-09-18 19:41       ` Christoph Lameter
2007-09-19  0:30       ` KAMEZAWA Hiroyuki
2007-09-19 16:58         ` Lee Schermerhorn
2007-09-20  0:56           ` KAMEZAWA Hiroyuki
2007-09-14 20:54 ` [PATCH/RFC 5/14] Reclaim Scalability: Use an indexed array for LRU variables Lee Schermerhorn
2007-09-17 13:40   ` Mel Gorman
2007-09-17 14:17     ` Lee Schermerhorn
2007-09-17 14:39       ` Lee Schermerhorn
2007-09-17 18:58   ` Balbir Singh
2007-09-17 19:12     ` Lee Schermerhorn
2007-09-17 19:36       ` Balbir Singh
2007-09-17 19:36     ` Rik van Riel
2007-09-17 20:21       ` Balbir Singh
2007-09-17 21:01         ` Rik van Riel
2007-09-14 20:54 ` [PATCH/RFC 6/14] Reclaim Scalability: "No Reclaim LRU Infrastructure" Lee Schermerhorn
2007-09-14 22:47   ` Christoph Lameter
2007-09-17 15:17     ` Lee Schermerhorn
2007-09-17 18:41       ` Christoph Lameter
2007-09-18  9:54         ` Mel Gorman
2007-09-18 19:45           ` Christoph Lameter
2007-09-19 11:11             ` Mel Gorman
2007-09-19 18:03               ` Christoph Lameter
2007-09-19  6:00   ` Balbir Singh
2007-09-19 14:47     ` Lee Schermerhorn
2007-09-14 20:54 ` [PATCH/RFC 7/14] Reclaim Scalability: Non-reclaimable page statistics Lee Schermerhorn
2007-09-17  1:56   ` Rik van Riel
2007-09-14 20:54 ` [PATCH/RFC 8/14] Reclaim Scalability: Ram Disk Pages are non-reclaimable Lee Schermerhorn
2007-09-17  1:57   ` Rik van Riel
2007-09-17 14:40     ` Lee Schermerhorn
2007-09-17 18:42       ` Christoph Lameter
2007-09-14 20:54 ` [PATCH/RFC 9/14] Reclaim Scalability: SHM_LOCKED pages are nonreclaimable Lee Schermerhorn
2007-09-17  2:18   ` Rik van Riel
2007-09-14 20:55 ` [PATCH/RFC 10/14] Reclaim Scalability: track anon_vma "related vmas" Lee Schermerhorn
2007-09-17  2:52   ` Rik van Riel
2007-09-17 15:52     ` Lee Schermerhorn
2007-09-14 20:55 ` [PATCH/RFC 11/14] Reclaim Scalability: swap backed pages are nonreclaimable when no swap space available Lee Schermerhorn
2007-09-17  2:53   ` Rik van Riel
2007-09-18 17:46     ` Lee Schermerhorn
2007-09-18 20:01       ` Rik van Riel
2007-09-19 14:55         ` Lee Schermerhorn
2007-09-18  2:59   ` KAMEZAWA Hiroyuki
2007-09-18 15:47     ` Lee Schermerhorn
2007-09-14 20:55 ` [PATCH/RFC 12/14] Reclaim Scalability: Non-reclaimable Mlock'ed pages Lee Schermerhorn
2007-09-14 20:55 ` [PATCH/RFC 13/14] Reclaim Scalability: Handle Mlock'ed pages during map/unmap and truncate Lee Schermerhorn
2007-09-14 20:55 ` [PATCH/RFC 14/14] Reclaim Scalability: cull non-reclaimable anon pages in fault path Lee Schermerhorn
2007-09-14 21:11 ` [PATCH/RFC 0/14] Page Reclaim Scalability Peter Zijlstra
2007-09-14 21:42   ` Linus Torvalds
2007-09-14 22:02     ` Peter Zijlstra
2007-09-15  0:07       ` Linus Torvalds
2007-09-17  6:44 ` Balbir Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070917125355.GH25706@skynet.ie \
    --to=mel@skynet.ie \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=andrea@suse.de \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=clameter@sgi.com \
    --cc=eric.whitney@hp.com \
    --cc=lee.schermerhorn@hp.com \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@suse.de \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox