From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243])
	by kanga.kvack.org (Postfix) with ESMTP id D10E56B0044
	for <linux-mm@kvack.org>; Fri, 18 Dec 2009 15:03:58 -0500 (EST)
Date: Fri, 18 Dec 2009 20:03:46 +0000
From: Mel Gorman <mel@csn.ul.ie>
Subject: Re: [PATCH 25 of 28] transparent hugepage core
Message-ID: <20091218200345.GH21194@csn.ul.ie>
References: <patchbomb.1261076403@v2.random> <4d96699c8fb89a4a22eb.1261076428@v2.random>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-15
Content-Disposition: inline
In-Reply-To: <4d96699c8fb89a4a22eb.1261076428@v2.random>
Sender: owner-linux-mm@kvack.org
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-mm@kvack.org, Marcelo Tosatti <mtosatti@redhat.com>, Adam Litke <agl@us.ibm.com>, Avi Kivity <avi@redhat.com>, Izik Eidus <ieidus@redhat.com>, Hugh Dickins <hugh.dickins@tiscali.co.uk>, Nick Piggin <npiggin@suse.de>, Rik van Riel <riel@redhat.com>, Andi Kleen <andi@firstfloor.org>, Dave Hansen <dave@linux.vnet.ibm.com>, Benjamin Herrenschmidt <benh@kernel.crashing.org>, Ingo Molnar <mingo@elte.hu>, Mike Travis <travis@sgi.com>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>, Christoph Lameter <cl@linux-foundation.org>, Chris Wright <chrisw@sous-sol.org>, Andrew Morton <akpm@linux-foundation.org>
List-ID: <linux-mm.kvack.org>

On Thu, Dec 17, 2009 at 07:00:28PM -0000, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Lately I've been working to make KVM use hugepages transparently
> without the usual restrictions of hugetlbfs. Some of the restrictions
> I'd like to see removed:
> 
> 1) hugepages have to be swappable or the guest physical memory remains
>    locked in RAM and can't be paged out to swap
> 
> 2) if a hugepage allocation fails, regular pages should be allocated
>    instead and mixed in the same vma without any failure and without
>    userland noticing
> 
> 3) if some task quits and more hugepages become available in the
>    buddy, guest physical memory backed by regular pages should be
>    relocated on hugepages automatically in regions under
>    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
>    kernel deamon if the order=HPAGE_SHIFT-PAGE_SHIFT list becomes not
>    null)
> 
> 4) avoidance of reservation and maximization of use of hugepages whenever
>    possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
>    1 machine with 1 database with 1 database cache with 1 database cache size
>    known at boot time. It's definitely not feasible with a virtualization
>    hypervisor usage like RHEV-H that runs an unknown number of virtual machines
>    with an unknown size of each virtual machine with an unknown amount of
>    pagecache that could be potentially useful in the host for guest not using
>    O_DIRECT (aka cache=off).
> 
> hugepages in the virtualization hypervisor (and also in the guest!) are
> much more important than in a regular host not using virtualization, becasue
> with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
> case only the hypervisor uses transparent hugepages, and they decrease the
> tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
> the linux guest both uses this patch (though the guest will limit the addition
> speedup to anonymous regions only for now...).  Even more important is that the
> tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
> paging or no-virtualization scenario. So maximizing the amount of virtual
> memory cached by the TLB pays off significantly more with NPT/EPT than without
> (even if there would be no significant speedup in the tlb-miss runtime).
> 
> The first (and more tedious) part of this work requires allowing the VM to
> handle anonymous hugepages mixed with regular pages transparently on regular
> anonymous vmas. This is what this patch tries to achieve in the least intrusive
> possible way. We want hugepages and hugetlb to be used in a way so that all
> applications can benefit without changes (as usual we leverage the KVM
> virtualization design: by improving the Linux VM at large, KVM gets the
> performance boost too).
> 
> The most important design choice is: always fallback to 4k allocation
> if the hugepage allocation fails! This is the _very_ opposite of some
> large pagecache patches that failed with -EIO back then if a 64k (or
> similar) allocation failed...
> 
> Second important decision (to reduce the impact of the feature on the
> existing pagetable handling code) is that at any time we can split an
> hugepage into 512 regular pages and it has to be done with an
> operation that can't fail. This way the reliability of the swapping
> isn't decreased (no need to allocate memory when we are short on
> memory to swap) and it's trivial to plug a split_huge_page* one-liner
> where needed without polluting the VM. Over time we can teach
> mprotect, mremap and friends to handle pmd_trans_huge natively without
> calling split_huge_page*. The fact it can't fail isn't just for swap:
> if split_huge_page would return -ENOMEM (instead of the current void)
> we'd need to rollback the mprotect from the middle of it (ideally
> including undoing the split_vma) which would be a big change and in
> the very wrong direction (it'd likely be simpler not to call
> split_huge_page at all and to teach mprotect and friends to handle
> hugepages instead of rolling them back from the middle). In short the
> very value of split_huge_page is that it can't fail.
> 
> The collapsing and madvise(MADV_HUGEPAGE) part will remain separated
> and incremental and it'll just be an "harmless" addition later if this
> initial part is agreed upon. It also should be noted that locking-wise
> replacing regular pages with hugepages is going to be very easy if
> compared to what I'm doing below in split_huge_page, as it will only
> happen when page_count(page) matches page_mapcount(page) if we can
> take the PG_lock and mmap_sem in write mode. collapse_huge_page will
> be a "best effort" that (unlike split_huge_page) can fail at the
> minimal sign of trouble and we can try again later. collapse_huge_page
> will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will
> work similar to madvise(MADV_MERGEABLE).
> 
> The default I like is that transparent hugepages are used at page fault time.
> This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The
> control knob can be set to three values "always", "madvise", "never" which
> mean respectively that hugepages are always used, or only inside
> madvise(MADV_HUGEPAGE) regions, or never used.
> /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage
> allocation should defrag memory aggressively "always", only inside "madvise"
> regions, or "never".
> 
> The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
> put_page (from get_user_page users that can't use mmu notifier like
> O_DIRECT) that runs against a __split_huge_page_refcount instead was a
> pain to serialize in a way that would result always in a coherent page
> count for both tail and head. I think my locking solution with a
> compound_lock taken only after the page_first is valid and is still a
> PageHead should be safe but it surely needs review from SMP race point
> of view. In short there is no current existing way to serialize the
> O_DIRECT final put_page against split_huge_page_refcount so I had to
> invent a new one (O_DIRECT loses knowledge on the mapping status by
> the time gup_fast returns so...). And I didn't want to impact all
> gup/gup_fast users for now, maybe if we change the gup interface
> substantially we can avoid this locking, I admit I didn't think too
> much about it because changing the gup unpinning interface would be
> invasive.
> 
> If we ignored O_DIRECT we could stick to the existing compound
> refcounting code, by simply adding a
> get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu
> notifier user) would call it without FOLL_GET (and if FOLL_GET isn't
> set we'd just BUG_ON if nobody registered itself in the current task
> mmu notifier list yet). But O_DIRECT is fundamental for decent
> performance of virtualized I/O on fast storage so we can't avoid it to
> solve the race of put_page against split_huge_page_refcount to achieve
> a complete hugepage feature for KVM.
> 
> Swap and oom works fine (well just like with regular pages ;). MMU
> notifier is handled transparently too, with the exception of the young
> bit on the pmd, that didn't have a range check but I think KVM will be
> fine because the whole point of hugepages is that EPT/NPT will also
> use a huge pmd when they notice gup returns pages with PageCompound set,
> so they won't care of a range and there's just the pmd young bit to
> check in that case.
> 
> NOTE: in some cases if the L2 cache is small, this may slowdown and
> waste memory during COWs because 4M of memory are accessed in a single
> fault instead of 8k (the payoff is that after COW the program can run
> faster). So we might want to switch the copy_huge_page (and
> clear_huge_page too) to not temporal stores. I also extensively
> researched ways to avoid this cache trashing with a full prefault
> logic that would cow in 8k/16k/32k/64k up to 1M (I can send those
> patches that fully implemented prefault) but I concluded they're not
> worth it and they add an huge additional complexity and they remove all tlb
> benefits until the full hugepage has been faulted in, to save a little bit of
> memory and some cache during app startup, but they still don't improve
> substantially the cache-trashing during startup if the prefault happens in >4k
> chunks.  One reason is that those 4k pte entries copied are still mapped on a
> perfectly cache-colored hugepage, so the trashing is the worst one can generate
> in those copies (cow of 4k page copies aren't so well colored so they trashes
> less, but again this results in software running faster after the page fault).
> Those prefault patches allowed things like a pte where post-cow pages were
> local 4k regular anon pages and the not-yet-cowed pte entries were pointing in
> the middle of some hugepage mapped read-only. If it doesn't payoff
> substantially with todays hardware it will payoff even less in the future with
> larger l2 caches, and the prefault logic would blot the VM a lot. If one is
> emebdded transparent_hugepage can be disabled during boot with sysfs or with
> the boot commandline parameter transparent_hugepage=0 (or
> transparent_hugepage=2 to restrict hugepages inside madvise regions) that will
> ensure not a single hugepage is allocated at boot time. It is simple enough to
> just disable transparent hugepage globally and let transparent hugepages be
> allocated selectively by applications in the MADV_HUGEPAGE region (both at page
> fault time, and if enabled with the collapse_huge_page too through the kernel
> daemon).
> 
> This patch supports only hugepages mapped in the pmd, archs that have
> smaller hugepages will not fit in this patch alone. Also some archs like power
> have certain tlb limits that prevents mixing different page size in the same
> regions so they will not fit in this framework that requires "graceful
> fallback" to basic PAGE_SIZE in case of physical memory fragmentation.
> hugetlbfs remains a perfect fit for those because its software limits happen to
> match the hardware limits. hugetlbfs also remains a perfect fit for hugepage
> sizes like 1GByte that cannot be hoped to be found not fragmented after a
> certain system uptime and that would be very expensive to defragment with
> relocation, so requiring reservation. hugetlbfs is the "reservation way", the
> point of transparent hugepages is not to have any reservation at all and
> maximizing the use of cache and hugepages at all times automatically.
> 
> Some performance result:
> 
> vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
> ages3
> memset page fault 1566023
> memset tlb miss 453854
> memset second tlb miss 453321
> random access tlb miss 41635
> random access second tlb miss 41658
> vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
> memset page fault 1566471
> memset tlb miss 453375
> memset second tlb miss 453320
> random access tlb miss 41636
> random access second tlb miss 41637
> vmx andrea # ./largepages3
> memset page fault 1566642
> memset tlb miss 453417
> memset second tlb miss 453313
> random access tlb miss 41630
> random access second tlb miss 41647
> vmx andrea # ./largepages3
> memset page fault 1566872
> memset tlb miss 453418
> memset second tlb miss 453315
> random access tlb miss 41618
> random access second tlb miss 41659
> vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
> vmx andrea # ./largepages3
> memset page fault 2182476
> memset tlb miss 460305
> memset second tlb miss 460179
> random access tlb miss 44483
> random access second tlb miss 44186
> vmx andrea # ./largepages3
> memset page fault 2182791
> memset tlb miss 460742
> memset second tlb miss 459962
> random access tlb miss 43981
> random access second tlb miss 43988
> 
> ============
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/time.h>
> 
> #define SIZE (3UL*1024*1024*1024)
> 
> int main()
> {
> 	char *p = malloc(SIZE), *p2;
> 	struct timeval before, after;
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset page fault %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset second tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	for (p2 = p; p2 < p+SIZE; p2 += 4096)
> 		*p2 = 0;
> 	gettimeofday(&after, NULL);
> 	printf("random access tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	for (p2 = p; p2 < p+SIZE; p2 += 4096)
> 		*p2 = 0;
> 	gettimeofday(&after, NULL);
> 	printf("random access second tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	return 0;
> }
> ============
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> new file mode 100644
> --- /dev/null
> +++ b/include/linux/huge_mm.h
> @@ -0,0 +1,110 @@
> +#ifndef _LINUX_HUGE_MM_H
> +#define _LINUX_HUGE_MM_H
> +
> +extern int do_huge_anonymous_page(struct mm_struct *mm,
> +				  struct vm_area_struct *vma,
> +				  unsigned long address, pmd_t *pmd,
> +				  unsigned int flags);
> +extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> +			 struct vm_area_struct *vma);
> +extern int do_huge_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			   unsigned long address, pmd_t *pmd,
> +			   pmd_t orig_pmd);
> +extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);

The naming of "huge" might bite in the ass later if/when transparent
support is applied to multiple page sizes. Granted, it's not happening
any time soon.

> +extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
> +					  unsigned long addr,
> +					  pmd_t *pmd,
> +					  unsigned int flags);
> +extern int zap_pmd_trans_huge(struct mmu_gather *tlb,
> +			      struct vm_area_struct *vma,
> +			      pmd_t *pmd);
> +
> +enum transparent_hugepage_flag {
> +	TRANSPARENT_HUGEPAGE_FLAG,
> +	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,

Defrag is misleading. Glancing through the rest of the patch, "try harder"
would be a more appropriate term because it uses __GFP_REPEAT.

> +	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> +#ifdef CONFIG_DEBUG_VM
> +	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
> +#endif
> +};
> +
> +#define transparent_hugepage_enabled(__vma)				\
> +	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
> +	 (transparent_hugepage_flags &				\
> +	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&		\
> +	  (__vma)->vm_flags & VM_HUGEPAGE))
> +#define transparent_hugepage_defrag(__vma)			       \
> +	(transparent_hugepage_flags &				       \
> +	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) ||		       \
> +	 (transparent_hugepage_flags &				       \
> +	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&	       \
> +	  (__vma)->vm_flags & VM_HUGEPAGE))
> +#ifdef CONFIG_DEBUG_VM
> +#define transparent_hugepage_debug_cow()				\
> +	(transparent_hugepage_flags &					\
> +	 (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
> +#else /* CONFIG_DEBUG_VM */
> +#define transparent_hugepage_debug_cow() 0
> +#endif /* CONFIG_DEBUG_VM */
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +extern unsigned long transparent_hugepage_flags;
> +extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +			  pmd_t *dst_pmd, pmd_t *src_pmd,
> +			  struct vm_area_struct *vma,
> +			  unsigned long addr, unsigned long end);
> +extern int handle_pte_fault(struct mm_struct *mm,
> +			    struct vm_area_struct *vma, unsigned long address,
> +			    pte_t *pte, pmd_t *pmd, unsigned int flags);
> +extern void __split_huge_page_mm(struct mm_struct *mm, unsigned long address,
> +				 pmd_t *pmd);
> +extern void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd);
> +extern int split_huge_page(struct page *page);
> +#define split_huge_page_mm(__mm, __addr, __pmd)				\
> +	do {								\
> +		if (unlikely(pmd_trans_huge(*(__pmd))))			\
> +			__split_huge_page_mm(__mm, __addr, __pmd);	\
> +	}  while (0)
> +#define split_huge_page_vma(__vma, __pmd)				\
> +	do {								\
> +		if (unlikely(pmd_trans_huge(*(__pmd))))			\
> +			__split_huge_page_vma(__vma, __pmd);		\
> +	}  while (0)
> +#define wait_split_huge_page(__anon_vma, __pmd)				\
> +	do {								\
> +		smp_mb();						\
> +		spin_unlock_wait(&(__anon_vma)->lock);			\
> +		smp_mb();						\
> +		VM_BUG_ON(pmd_trans_splitting(*(__pmd)) ||		\
> +			  pmd_trans_huge(*(__pmd)));			\
> +	} while (0)
> +#define HPAGE_ORDER (HPAGE_SHIFT-PAGE_SHIFT)
> +#define HPAGE_NR (1<<HPAGE_ORDER)
> +
> +enum page_check_address_pmd_flag {
> +	PAGE_CHECK_ADDRESS_PMD_FLAG,
> +	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
> +	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
> +};
> +extern pmd_t *page_check_address_pmd(struct page *page,
> +				     struct mm_struct *mm,
> +				     unsigned long address,
> +				     enum page_check_address_pmd_flag flag);
> +extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
> +#else /* CONFIG_TRANSPARENT_HUGEPAGE */
> +#define transparent_hugepage_flags 0UL
> +static inline int split_huge_page(struct page *page)
> +{
> +	return 0;
> +}
> +#define split_huge_page_mm(__mm, __addr, __pmd)	\
> +	do { }  while (0)
> +#define split_huge_page_vma(__vma, __pmd)	\
> +	do { }  while (0)
> +#define wait_split_huge_page(__anon_vma, __pmd)	\
> +	do { } while (0)
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +#endif /* _LINUX_HUGE_MM_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -234,6 +234,7 @@ struct inode;
>   * files which need it (119 of them)
>   */
>  #include <linux/page-flags.h>
> +#include <linux/huge_mm.h>
>  
>  /*
>   * Methods to modify the page usage count.
> diff --git a/mm/Makefile b/mm/Makefile
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
>  obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
> +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> new file mode 100644
> --- /dev/null
> +++ b/mm/huge_memory.c

Similar on naming. Later someone will get congused as to why there is
hugetlbfs and huge_memory.

> @@ -0,0 +1,792 @@
> +/*
> + *  Copyright (C) 2009  Red Hat, Inc.
> + *
> + *  This work is licensed under the terms of the GNU GPL, version 2. See
> + *  the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +#include <linux/highmem.h>
> +#include <linux/hugetlb.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/rmap.h>
> +#include <linux/swap.h>
> +#include <asm/tlb.h>
> +#include <asm/pgalloc.h>
> +#include "internal.h"
> +
> +unsigned long transparent_hugepage_flags __read_mostly =
> +	(1<<TRANSPARENT_HUGEPAGE_FLAG)|(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG);
> +
> +#ifdef CONFIG_SYSFS
> +static ssize_t double_flag_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf,
> +				enum transparent_hugepage_flag enabled,
> +				enum transparent_hugepage_flag req_madv)
> +{
> +	if (test_bit(enabled, &transparent_hugepage_flags)) {
> +		VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags));
> +		return sprintf(buf, "[always] madvise never\n");
> +	} else if (test_bit(req_madv, &transparent_hugepage_flags))
> +		return sprintf(buf, "always [madvise] never\n");
> +	else
> +		return sprintf(buf, "always madvise [never]\n");
> +}
> +static ssize_t double_flag_store(struct kobject *kobj,
> +				 struct kobj_attribute *attr,
> +				 const char *buf, size_t count,
> +				 enum transparent_hugepage_flag enabled,
> +				 enum transparent_hugepage_flag req_madv)
> +{
> +	if (!memcmp("always", buf,
> +		    min(sizeof("always")-1, count))) {
> +		set_bit(enabled, &transparent_hugepage_flags);
> +		clear_bit(req_madv, &transparent_hugepage_flags);
> +	} else if (!memcmp("madvise", buf,
> +			   min(sizeof("madvise")-1, count))) {
> +		clear_bit(enabled, &transparent_hugepage_flags);
> +		set_bit(req_madv, &transparent_hugepage_flags);
> +	} else if (!memcmp("never", buf,
> +			   min(sizeof("never")-1, count))) {
> +		clear_bit(enabled, &transparent_hugepage_flags);
> +		clear_bit(req_madv, &transparent_hugepage_flags);
> +	} else
> +		return -EINVAL;
> +
> +	return count;
> +}
> +
> +static ssize_t enabled_show(struct kobject *kobj,
> +			    struct kobj_attribute *attr, char *buf)
> +{
> +	return double_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_FLAG,
> +				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> +}
> +static ssize_t enabled_store(struct kobject *kobj,
> +			     struct kobj_attribute *attr,
> +			     const char *buf, size_t count)
> +{
> +	return double_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_FLAG,
> +				 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> +}
> +static struct kobj_attribute enabled_attr =
> +	__ATTR(enabled, 0644, enabled_show, enabled_store);
> +
> +static ssize_t defrag_show(struct kobject *kobj,
> +			   struct kobj_attribute *attr, char *buf)
> +{
> +	return double_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +				TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> +}
> +static ssize_t defrag_store(struct kobject *kobj,
> +			    struct kobj_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	return double_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> +}
> +static struct kobj_attribute defrag_attr =
> +	__ATTR(defrag, 0644, defrag_show, defrag_store);
> +
> +static ssize_t single_flag_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf,
> +				enum transparent_hugepage_flag flag)
> +{
> +	if (test_bit(flag, &transparent_hugepage_flags))
> +		return sprintf(buf, "[yes] no\n");
> +	else
> +		return sprintf(buf, "yes [no]\n");
> +}
> +static ssize_t single_flag_store(struct kobject *kobj,
> +				 struct kobj_attribute *attr,
> +				 const char *buf, size_t count,
> +				 enum transparent_hugepage_flag flag)
> +{
> +	if (!memcmp("yes", buf,
> +		    min(sizeof("yes")-1, count))) {
> +		set_bit(flag, &transparent_hugepage_flags);
> +	} else if (!memcmp("no", buf,
> +			   min(sizeof("no")-1, count))) {
> +		clear_bit(flag, &transparent_hugepage_flags);
> +	} else
> +		return -EINVAL;
> +
> +	return count;
> +}
> +
> +#ifdef CONFIG_DEBUG_VM
> +static ssize_t debug_cow_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf)
> +{
> +	return single_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
> +}
> +static ssize_t debug_cow_store(struct kobject *kobj,
> +			       struct kobj_attribute *attr,
> +			       const char *buf, size_t count)
> +{
> +	return single_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
> +}
> +static struct kobj_attribute debug_cow_attr =
> +	__ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
> +#endif /* CONFIG_DEBUG_VM */
> +
> +static struct attribute *hugepage_attr[] = {
> +	&enabled_attr.attr,
> +	&defrag_attr.attr,
> +#ifdef CONFIG_DEBUG_VM
> +	&debug_cow_attr.attr,
> +#endif
> +	NULL,
> +};
> +
> +static struct attribute_group hugepage_attr_group = {
> +	.attrs = hugepage_attr,
> +	.name = "transparent_hugepage",
> +};
> +#endif /* CONFIG_SYSFS */
> +
> +static int __init ksm_init(void)
> +{
> +#ifdef CONFIG_SYSFS
> +	int err;
> +
> +	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
> +	if (err)
> +		printk(KERN_ERR "hugepage: register sysfs failed\n");
> +#endif
> +	return 0;
> +}
> +module_init(ksm_init)
> +
> +static int __init setup_transparent_hugepage(char *str)
> +{
> +	if (!str)
> +		return 0;
> +	transparent_hugepage_flags = simple_strtoul(str, &str, 0);
> +	return 1;
> +}
> +__setup("transparent_hugepage=", setup_transparent_hugepage);
> +
> +
> +static void prepare_pmd_huge_pte(pgtable_t pgtable,
> +				 struct mm_struct *mm)
> +{
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> +	/* FIFO */
> +	if (!mm->pmd_huge_pte)
> +		INIT_LIST_HEAD(&pgtable->lru);
> +	else
> +		list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
> +	mm->pmd_huge_pte = pgtable;
> +}
> +
> +static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
> +{
> +	if (likely(vma->vm_flags & VM_WRITE))
> +		pmd = pmd_mkwrite(pmd);
> +	return pmd;
> +}
> +
> +static int __do_huge_anonymous_page(struct mm_struct *mm,
> +				    struct vm_area_struct *vma,
> +				    unsigned long address, pmd_t *pmd,
> +				    struct page *page,

Maybe this should be do_pmd_anonymous page and match what do_anonymous_page
does as much as possible. This might offset any future problems related to
transparently handling pages at other page table levels.

> +				    unsigned long haddr)
> +{
> +	int ret = 0;
> +	pgtable_t pgtable;
> +
> +	VM_BUG_ON(!PageCompound(page));
> +	pgtable = pte_alloc_one(mm, address);
> +	if (unlikely(!pgtable)) {
> +		put_page(page);
> +		return VM_FAULT_OOM;
> +	}
> +
> +	clear_huge_page(page, haddr, HPAGE_NR);
> +

Ideally insead of defining things like HPAGE_NR, the existing functions for
multiple huge page sizes would be extended to return the "huge page size
corresponding to a PMD".

> +	__SetPageUptodate(page);
> +	smp_wmb();
> +

Need to explain why smp_wmb() is needed there. It doesn't look like
you're protecting the bit set itself. More likely you are making sure
the writes in clear_huge_page() have finished but that's a guess.
Comment.

> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_none(*pmd))) {
> +		put_page(page);
> +		pte_free(mm, pgtable);

Racing fault already filled in the PTE? If so, comment please. Again,
matching how do_anonymous_page() does a similar job would help
comprehension.

> +	} else {
> +		pmd_t entry;
> +		entry = mk_pmd(page, vma->vm_page_prot);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		entry = pmd_mkhuge(entry);
> +		page_add_new_anon_rmap(page, vma, haddr);
> +		set_pmd_at(mm, haddr, pmd, entry);
> +		prepare_pmd_huge_pte(pgtable, mm);
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +	
> +	return ret;
> +}
> +
> +int do_huge_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			   unsigned long address, pmd_t *pmd,
> +			   unsigned int flags)
> +{
> +	struct page *page;
> +	unsigned long haddr = address & HPAGE_MASK;
> +	pte_t *pte;
> +
> +	if (haddr >= vma->vm_start && haddr + HPAGE_SIZE <= vma->vm_end) {
> +		if (unlikely(anon_vma_prepare(vma)))
> +			return VM_FAULT_OOM;
> +		page = alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP|
> +				   (transparent_hugepage_defrag(vma) ?

GFP_HIGHUSER_MOVABLE should only be used if hugepages_treat_as_movable
is set in /proc/sys/vm. This should be GFP_HIGHUSER only.

> +				    __GFP_REPEAT : 0)|__GFP_NOWARN,
> +				   HPAGE_ORDER);
> +		if (unlikely(!page))
> +			goto out;
> +
> +		return __do_huge_anonymous_page(mm, vma,
> +						address, pmd,
> +						page, haddr);
> +	}
> +out:
> +	pte = pte_alloc_map(mm, vma, pmd, address);
> +	if (!pte)
> +		return VM_FAULT_OOM;
> +	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
> +}
> +
> +int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> +		  struct vm_area_struct *vma)
> +{
> +	struct page *src_page;
> +	pmd_t pmd;
> +	pgtable_t pgtable;
> +	int ret;
> +
> +	ret = -ENOMEM;
> +	pgtable = pte_alloc_one(dst_mm, addr);
> +	if (unlikely(!pgtable))
> +		goto out;
> +
> +	spin_lock(&dst_mm->page_table_lock);
> +	spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
> +
> +	ret = -EAGAIN;
> +	pmd = *src_pmd;
> +	if (unlikely(!pmd_trans_huge(pmd)))
> +		goto out_unlock;
> +	if (unlikely(pmd_trans_splitting(pmd))) {
> +		/* split huge page running from under us */
> +		spin_unlock(&src_mm->page_table_lock);
> +		spin_unlock(&dst_mm->page_table_lock);
> +
> +		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
> +		goto out;
> +	}
> +	src_page = pmd_pgtable(pmd);
> +	VM_BUG_ON(!PageHead(src_page));
> +	get_page(src_page);
> +	page_dup_rmap(src_page);
> +	add_mm_counter(dst_mm, anon_rss, HPAGE_NR);
> +
> +	pmdp_set_wrprotect(src_mm, addr, src_pmd);
> +	pmd = pmd_mkold(pmd_wrprotect(pmd));
> +	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> +	prepare_pmd_huge_pte(pgtable, dst_mm);
> +
> +	ret = 0;
> +out_unlock:
> +	spin_unlock(&src_mm->page_table_lock);
> +	spin_unlock(&dst_mm->page_table_lock);
> +out:
> +	return ret;
> +}
> +
> +/* no "address" argument so destroys page coloring of some arch */
> +pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
> +{
> +	pgtable_t pgtable;
> +
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> +	/* FIFO */
> +	pgtable = mm->pmd_huge_pte;
> +	if (list_empty(&pgtable->lru))
> +		mm->pmd_huge_pte = NULL; /* debug */
> +	else {
> +		mm->pmd_huge_pte = list_entry(pgtable->lru.next,
> +					      struct page, lru);
> +		list_del(&pgtable->lru);
> +	}
> +	return pgtable;
> +}
> +
> +int do_huge_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +		    unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
> +{
> +	int ret = 0, i;
> +	struct page *page, *new_page;
> +	unsigned long haddr;
> +	struct page **pages;
> +
> +	VM_BUG_ON(!vma->anon_vma);
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		goto out_unlock;
> +
> +	page = pmd_pgtable(orig_pmd);
> +	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> +	haddr = address & HPAGE_MASK;
> +	if (page_mapcount(page) == 1) {
> +		pmd_t entry;
> +		entry = pmd_mkyoung(orig_pmd);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
> +			update_mmu_cache(vma, address, entry);
> +		ret |= VM_FAULT_WRITE;
> +		goto out_unlock;
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +	new_page = alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP|
> +			       (transparent_hugepage_defrag(vma) ?
> +				__GFP_REPEAT : 0)|__GFP_NOWARN,
> +			       HPAGE_ORDER);
> +	if (transparent_hugepage_debug_cow() && new_page) {
> +		put_page(new_page);
> +		new_page = NULL;
> +	}
> +	if (unlikely(!new_page)) {

This entire block needs be in a demote_pmd_page() or something similar.
It's on the hefty side for being in the main function. That said, I
didn't spot anything wrong in there either.

> +		pgtable_t pgtable;
> +		pmd_t _pmd;
> +
> +		pages = kzalloc(sizeof(struct page *) * HPAGE_NR,
> +				GFP_KERNEL);
> +		if (unlikely(!pages)) {
> +			ret |= VM_FAULT_OOM;
> +			goto out;
> +		}
> +		
> +		for (i = 0; i < HPAGE_NR; i++) {
> +			pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
> +						  vma, address);
> +			if (unlikely(!pages[i])) {
> +				while (--i >= 0)
> +					put_page(pages[i]);
> +				kfree(pages);
> +				ret |= VM_FAULT_OOM;
> +				goto out;
> +			}
> +		}
> +
> +		spin_lock(&mm->page_table_lock);
> +		if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +			goto out_free_pages;
> +		else
> +			get_page(page);
> +		spin_unlock(&mm->page_table_lock);
> +
> +		might_sleep();

Is this check really necessary? We could already go alseep easier when
allocating pages.

> +		for (i = 0; i < HPAGE_NR; i++) {
> +			copy_user_highpage(pages[i], page + i,

More nasty naming there. Needs to be cleared that pages is your demoted
base pages and page is the existing compound page.

> +					   haddr + PAGE_SHIFT*i, vma);
> +			__SetPageUptodate(pages[i]);
> +			cond_resched();
> +		}
> +
> +		spin_lock(&mm->page_table_lock);
> +		if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +			goto out_free_pages;
> +		else
> +			put_page(page);
> +
> +		pmdp_clear_flush_notify(vma, haddr, pmd);
> +		/* leave pmd empty until pte is filled */
> +
> +		pgtable = get_pmd_huge_pte(mm);
> +		pmd_populate(mm, &_pmd, pgtable);
> +
> +		for (i = 0; i < HPAGE_NR;
> +		     i++, haddr += PAGE_SIZE) {
> +			pte_t *pte, entry;
> +			entry = mk_pte(pages[i], vma->vm_page_prot);
> +			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +			page_add_new_anon_rmap(pages[i], vma, haddr);
> +			pte = pte_offset_map(&_pmd, haddr);
> +			VM_BUG_ON(!pte_none(*pte));
> +			set_pte_at(mm, haddr, pte, entry);
> +			pte_unmap(pte);
> +		}
> +		kfree(pages);
> +
> +		mm->nr_ptes++;
> +		smp_wmb(); /* make pte visible before pmd */
> +		pmd_populate(mm, pmd, pgtable);
> +		spin_unlock(&mm->page_table_lock);
> +
> +		ret |= VM_FAULT_WRITE;
> +		page_remove_rmap(page);
> +		put_page(page);
> +		goto out;
> +	}
> +
> +	copy_huge_page(new_page, page, haddr, vma, HPAGE_NR);
> +	__SetPageUptodate(new_page);
> +
> +	smp_wmb();
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		put_page(new_page);
> +	else {
> +		pmd_t entry;
> +		entry = mk_pmd(new_page, vma->vm_page_prot);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		entry = pmd_mkhuge(entry);
> +		pmdp_clear_flush_notify(vma, haddr, pmd);
> +		page_add_new_anon_rmap(new_page, vma, haddr);
> +		set_pmd_at(mm, haddr, pmd, entry);
> +		update_mmu_cache(vma, address, entry);
> +		page_remove_rmap(page);
> +		put_page(page);
> +		ret |= VM_FAULT_WRITE;
> +	}
> +out_unlock:
> +	spin_unlock(&mm->page_table_lock);
> +out:
> +	return ret;
> +
> +out_free_pages:
> +	for (i = 0; i < HPAGE_NR; i++)
> +		put_page(pages[i]);
> +	kfree(pages);
> +	goto out_unlock;
> +}
> +
> +struct page *follow_trans_huge_pmd(struct mm_struct *mm,
> +				   unsigned long addr,
> +				   pmd_t *pmd,
> +				   unsigned int flags)
> +{
> +	struct page *page = NULL;
> +
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> +	if (flags & FOLL_WRITE && !pmd_write(*pmd))
> +		goto out;
> +
> +	page = pmd_pgtable(*pmd);
> +	VM_BUG_ON(!PageHead(page));
> +	if (flags & FOLL_TOUCH) {
> +		pmd_t _pmd;
> +		/*
> +		 * We should set the dirty bit only for FOLL_WRITE but
> +		 * for now the dirty bit in the pmd is meaningless.
> +		 * And if the dirty bit will become meaningful and
> +		 * we'll only set it with FOLL_WRITE, an atomic
> +		 * set_bit will be required on the pmd to set the
> +		 * young bit, instead of the current set_pmd_at.
> +		 */
> +		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
> +		set_pmd_at(mm, addr & HPAGE_MASK, pmd, _pmd);
> +	}
> +	page += (addr & ~HPAGE_MASK) >> PAGE_SHIFT;

More HPAGE vs PMD here.

> +	VM_BUG_ON(!PageCompound(page));
> +	if (flags & FOLL_GET)
> +		get_page(page);
> +
> +out:
> +	return page;
> +}
> +
> +int zap_pmd_trans_huge(struct mmu_gather *tlb, struct vm_area_struct *vma,
> +		       pmd_t *pmd)
> +{
> +	int ret = 0;
> +
> +	spin_lock(&tlb->mm->page_table_lock);
> +	if (likely(pmd_trans_huge(*pmd))) {
> +		if (unlikely(pmd_trans_splitting(*pmd))) {
> +			spin_unlock(&tlb->mm->page_table_lock);
> +			wait_split_huge_page(vma->anon_vma,
> +					     pmd);
> +		} else {
> +			struct page *page;
> +			pgtable_t pgtable;
> +			pgtable = get_pmd_huge_pte(tlb->mm);
> +			page = pfn_to_page(pmd_pfn(*pmd));
> +			VM_BUG_ON(!PageCompound(page));
> +			pmd_clear(pmd);
> +			page_remove_rmap(page);
> +			VM_BUG_ON(page_mapcount(page) < 0);
> +			spin_unlock(&tlb->mm->page_table_lock);
> +			add_mm_counter(tlb->mm, anon_rss, -HPAGE_NR);
> +			tlb_remove_page(tlb, page);
> +			pte_free(tlb->mm, pgtable);
> +			ret = 1;
> +		}
> +	} else
> +		spin_unlock(&tlb->mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +pmd_t *page_check_address_pmd(struct page *page,
> +			      struct mm_struct *mm,
> +			      unsigned long address,
> +			      enum page_check_address_pmd_flag flag)
> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd, *ret = NULL;
> +
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		goto out;
> +
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		goto out;
> +
> +	pmd = pmd_offset(pud, address);
> +	if (pmd_none(*pmd))
> +		goto out;
> +	VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
> +		  pmd_trans_splitting(*pmd));
> +	if (pmd_trans_huge(*pmd) && pmd_pgtable(*pmd) == page) {
> +		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
> +			  !pmd_trans_splitting(*pmd));
> +		ret = pmd;
> +	}
> +out:
> +	return ret;
> +}
> +
> +static int __split_huge_page_splitting(struct page *page,
> +				       struct vm_area_struct *vma,
> +				       unsigned long address)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pmd_t *pmd;
> +	int ret = 0;
> +
> +	spin_lock(&mm->page_table_lock);
> +	pmd = page_check_address_pmd(page, mm, address,
> +				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
> +	if (pmd) {
> +		/*
> +		 * We can't temporarily set the pmd to null in order
> +		 * to split it, pmd_huge must remain on at all times.
> +		 */

Why, to avoid a double fault? Or to avoid a case where the huge page is
being split, another fault occurs and zero-filled pages get faulted in?

I'm afraid I ran out of time at this point. It'll be after the holidays
before I get time for a proper go at it. Sorry.

> +		pmdp_splitting_flush_notify(vma, address, pmd);
> +		ret = 1;
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +static void __split_huge_page_refcount(struct page *page)
> +{
> +	int i;
> +	unsigned long head_index = page->index;
> +
> +	compound_lock(page);
> +
> +	for (i = 1; i < HPAGE_NR; i++) {
> +		struct page *page_tail = page + i;
> +
> +		/* tail_page->_count cannot change */
> +		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
> +		BUG_ON(page_count(page) <= 0);
> +		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
> +		BUG_ON(atomic_read(&page_tail->_count) <= 0);
> +
> +		/* after clearing PageTail the gup refcount can be released */
> +		smp_mb();
> +
> +		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +		page_tail->flags |= (page->flags &
> +				     ((1L << PG_referenced) |
> +				      (1L << PG_swapbacked) |
> +				      (1L << PG_mlocked) |
> +				      (1L << PG_uptodate)));
> +		page_tail->flags |= (1L << PG_dirty);
> +
> +		/*
> +		 * 1) clear PageTail before overwriting first_page
> +		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
> +		 */
> +		smp_wmb();
> +
> +		BUG_ON(page_mapcount(page_tail));
> +		page_tail->_mapcount = page->_mapcount;
> +		BUG_ON(page_tail->mapping);
> +		page_tail->mapping = page->mapping;
> +		page_tail->index = ++head_index;
> +		BUG_ON(!PageAnon(page_tail));
> +		BUG_ON(!PageUptodate(page_tail));
> +		BUG_ON(!PageDirty(page_tail));
> +		BUG_ON(!PageSwapBacked(page_tail));
> +
> +		if (page_evictable(page_tail, NULL))
> +			lru_cache_add_lru(page_tail, LRU_ACTIVE_ANON);
> +		else
> +			add_page_to_unevictable_list(page_tail);
> +		put_page(page_tail);
> +	}
> +
> +	ClearPageCompound(page);
> +	compound_unlock(page);
> +}
> +
> +static int __split_huge_page_map(struct page *page,
> +				 struct vm_area_struct *vma,
> +				 unsigned long address)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pmd_t *pmd, _pmd;
> +	int ret = 0, i;
> +	pgtable_t pgtable;
> +	unsigned long haddr;
> +
> +	spin_lock(&mm->page_table_lock);
> +	pmd = page_check_address_pmd(page, mm, address,
> +				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
> +	if (pmd) {
> +		pgtable = get_pmd_huge_pte(mm);
> +		pmd_populate(mm, &_pmd, pgtable);
> +
> +		for (i = 0, haddr = address; i < HPAGE_NR;
> +		     i++, haddr += PAGE_SIZE) {
> +			pte_t *pte, entry;
> +			entry = mk_pte(page + i, vma->vm_page_prot);
> +			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +			if (!pmd_write(*pmd))
> +				entry = pte_wrprotect(entry);
> +			else
> +				BUG_ON(page_mapcount(page) != 1);
> +			if (!pmd_young(*pmd))
> +				entry = pte_mkold(entry);
> +			pte = pte_offset_map(&_pmd, haddr);
> +			BUG_ON(!pte_none(*pte));
> +			set_pte_at(mm, haddr, pte, entry);
> +			pte_unmap(pte);
> +		}
> +
> +		mm->nr_ptes++;
> +		smp_wmb(); /* make pte visible before pmd */
> +		pmd_populate(mm, pmd, pgtable);
> +		ret = 1;
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +/* must be called with anon_vma->lock hold */
> +static void __split_huge_page(struct page *page,
> +			      struct anon_vma *anon_vma)
> +{
> +	int mapcount, mapcount2;
> +	struct vm_area_struct *vma;
> +
> +	BUG_ON(!PageHead(page));
> +
> +	mapcount = 0;
> +	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
> +		unsigned long addr = vma_address(page, vma);
> +		if (addr == -EFAULT)
> +			continue;
> +		mapcount += __split_huge_page_splitting(page, vma, addr);
> +	}
> +	BUG_ON(mapcount != page_mapcount(page));
> +
> +	__split_huge_page_refcount(page);
> +
> +	mapcount2 = 0;
> +	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
> +		unsigned long addr = vma_address(page, vma);
> +		if (addr == -EFAULT)
> +			continue;
> +		mapcount2 += __split_huge_page_map(page, vma, addr);
> +	}
> +	BUG_ON(mapcount != mapcount2);
> +}
> +
> +/* must run with mmap_sem to prevent vma to go away */
> +void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd)
> +{
> +	struct page *page;
> +	struct anon_vma *anon_vma;
> +	struct mm_struct *mm;
> +
> +	BUG_ON(vma->vm_flags & VM_HUGETLB);
> +
> +	mm = vma->vm_mm;
> +	BUG_ON(down_write_trylock(&mm->mmap_sem));
> +
> +	anon_vma = vma->anon_vma;
> +
> +	spin_lock(&anon_vma->lock);
> +	BUG_ON(pmd_trans_splitting(*pmd));
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_trans_huge(*pmd))) {
> +		spin_unlock(&mm->page_table_lock);
> +		spin_unlock(&anon_vma->lock);
> +		return;
> +	}
> +	page = pmd_pgtable(*pmd);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	__split_huge_page(page, anon_vma);
> +
> +	spin_unlock(&anon_vma->lock);
> +	BUG_ON(pmd_trans_huge(*pmd));
> +}
> +
> +/* must run with mmap_sem to prevent vma to go away */
> +void __split_huge_page_mm(struct mm_struct *mm,
> +			  unsigned long address,
> +			  pmd_t *pmd)
> +{
> +	struct vm_area_struct *vma;
> +
> +	vma = find_vma(mm, address);
> +	BUG_ON(vma->vm_start > address);
> +	BUG_ON(vma->vm_mm != mm);
> +
> +	__split_huge_page_vma(vma, pmd);
> +}
> +
> +int split_huge_page(struct page *page)
> +{
> +	struct anon_vma *anon_vma;
> +	int ret = 1;
> +
> +	BUG_ON(!PageAnon(page));
> +	anon_vma = page_lock_anon_vma(page);
> +	if (!anon_vma)
> +		goto out;
> +	ret = 0;
> +	if (!PageCompound(page))
> +		goto out_unlock;
> +
> + 	BUG_ON(!PageSwapBacked(page));
> +	__split_huge_page(page, anon_vma);
> +
> +	BUG_ON(PageCompound(page));
> +out_unlock:
> +	page_unlock_anon_vma(anon_vma);
> +out:
> +	return ret;
> +}
> diff --git a/mm/memory.c b/mm/memory.c
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -647,9 +647,9 @@ out_set_pte:
>  	return 0;
>  }
>  
> -static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> -		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> -		unsigned long addr, unsigned long end)
> +int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> +		   unsigned long addr, unsigned long end)
>  {
>  	pte_t *orig_src_pte, *orig_dst_pte;
>  	pte_t *src_pte, *dst_pte;
> @@ -722,6 +722,16 @@ static inline int copy_pmd_range(struct 
>  	src_pmd = pmd_offset(src_pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> +		if (pmd_trans_huge(*src_pmd)) {
> +			int err;
> +			err = copy_huge_pmd(dst_mm, src_mm,
> +					    dst_pmd, src_pmd, addr, vma);
> +			if (err == -ENOMEM)
> +				return -ENOMEM;
> +			if (!err)
> +				continue;
> +			/* fall through */
> +		}
>  		if (pmd_none_or_clear_bad(src_pmd))
>  			continue;
>  		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
> @@ -918,6 +928,15 @@ static inline unsigned long zap_pmd_rang
>  	pmd = pmd_offset(pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> +		if (pmd_trans_huge(*pmd)) {
> +			if (next-addr != HPAGE_SIZE)
> +				split_huge_page_vma(vma, pmd);
> +			else if (zap_pmd_trans_huge(tlb, vma, pmd)) {
> +				(*zap_work)--;
> +				continue;
> +			}
> +			/* fall through */
> +		}
>  		if (pmd_none_or_clear_bad(pmd)) {
>  			(*zap_work)--;
>  			continue;
> @@ -1185,11 +1204,27 @@ struct page *follow_page(struct vm_area_
>  	pmd = pmd_offset(pud, address);
>  	if (pmd_none(*pmd))
>  		goto no_page_table;
> -	if (pmd_huge(*pmd)) {
> +	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
>  		BUG_ON(flags & FOLL_GET);
>  		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
>  		goto out;
>  	}
> +	if (pmd_trans_huge(*pmd)) {
> +		spin_lock(&mm->page_table_lock);
> +		if (likely(pmd_trans_huge(*pmd))) {
> +			if (unlikely(pmd_trans_splitting(*pmd))) {
> +				spin_unlock(&mm->page_table_lock);
> +				wait_split_huge_page(vma->anon_vma, pmd);
> +			} else {
> +				page = follow_trans_huge_pmd(mm, address,
> +							     pmd, flags);
> +				spin_unlock(&mm->page_table_lock);
> +				goto out;
> +			}
> +		} else
> +			spin_unlock(&mm->page_table_lock);
> +		/* fall through */
> +	}
>  	if (unlikely(pmd_bad(*pmd)))
>  		goto no_page_table;
>  
> @@ -1298,6 +1333,7 @@ int __get_user_pages(struct task_struct 
>  			pmd = pmd_offset(pud, pg);
>  			if (pmd_none(*pmd))
>  				return i ? : -EFAULT;
> +			VM_BUG_ON(pmd_trans_huge(*pmd));
>  			pte = pte_offset_map(pmd, pg);
>  			if (pte_none(*pte)) {
>  				pte_unmap(pte);
> @@ -2949,9 +2985,9 @@ static int do_nonlinear_fault(struct mm_
>   * but allow concurrent faults), and pte mapped but not yet locked.
>   * We return with mmap_sem still held, but pte unmapped and unlocked.
>   */
> -static inline int handle_pte_fault(struct mm_struct *mm,
> -		struct vm_area_struct *vma, unsigned long address,
> -		pte_t *pte, pmd_t *pmd, unsigned int flags)
> +int handle_pte_fault(struct mm_struct *mm,
> +		     struct vm_area_struct *vma, unsigned long address,
> +		     pte_t *pte, pmd_t *pmd, unsigned int flags)
>  {
>  	pte_t entry;
>  	spinlock_t *ptl;
> @@ -3027,6 +3063,22 @@ int handle_mm_fault(struct mm_struct *mm
>  	pmd = pmd_alloc(mm, pud, address);
>  	if (!pmd)
>  		return VM_FAULT_OOM;
> +	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> +		if (!vma->vm_ops)
> +			return do_huge_anonymous_page(mm, vma, address,
> +						      pmd, flags);
> +	} else {
> +		pmd_t orig_pmd = *pmd;
> +		barrier();
> +		if (pmd_trans_huge(orig_pmd)) {
> +			if (flags & FAULT_FLAG_WRITE &&
> +			    !pmd_write(orig_pmd) &&
> +			    !pmd_trans_splitting(orig_pmd))
> +				return do_huge_wp_page(mm, vma, address,
> +						       pmd, orig_pmd);
> +			return 0;
> +		}
> +	}
>  	pte = pte_alloc_map(mm, vma, pmd, address);
>  	if (!pte)
>  		return VM_FAULT_OOM;
> @@ -3167,6 +3219,7 @@ static int follow_pte(struct mm_struct *
>  		goto out;
>  
>  	pmd = pmd_offset(pud, address);
> +	VM_BUG_ON(pmd_trans_huge(*pmd));
>  	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
>  		goto out;
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -56,6 +56,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/migrate.h>
> +#include <linux/hugetlb.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -229,7 +230,7 @@ void page_unlock_anon_vma(struct anon_vm
>   * Returns virtual address or -EFAULT if page's index/offset is not
>   * within the range mapped the @vma.
>   */
> -static inline unsigned long
> +inline unsigned long
>  vma_address(struct page *page, struct vm_area_struct *vma)
>  {
>  	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> @@ -343,35 +344,17 @@ int page_referenced_one(struct page *pag
>  			unsigned long *vm_flags)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
> -	pte_t *pte;
> -	spinlock_t *ptl;
>  	int referenced = 0;
>  
> -	pte = page_check_address(page, mm, address, &ptl, 0);
> -	if (!pte)
> -		goto out;
> -
>  	/*
>  	 * Don't want to elevate referenced for mlocked page that gets this far,
>  	 * in order that it progresses to try_to_unmap and is moved to the
>  	 * unevictable list.
>  	 */
>  	if (vma->vm_flags & VM_LOCKED) {
> -		*mapcount = 1;	/* break early from loop */
> +		*mapcount = 0;	/* break early from loop */
>  		*vm_flags |= VM_LOCKED;
> -		goto out_unmap;
> -	}
> -
> -	if (ptep_clear_flush_young_notify(vma, address, pte)) {
> -		/*
> -		 * Don't treat a reference through a sequentially read
> -		 * mapping as such.  If the page has been used in
> -		 * another mapping, we will catch it; if this other
> -		 * mapping is already gone, the unmap path will have
> -		 * set PG_referenced or activated the page.
> -		 */
> -		if (likely(!VM_SequentialReadHint(vma)))
> -			referenced++;
> +		goto out;
>  	}
>  
>  	/* Pretend the page is referenced if the task has the
> @@ -380,9 +363,43 @@ int page_referenced_one(struct page *pag
>  			rwsem_is_locked(&mm->mmap_sem))
>  		referenced++;
>  
> -out_unmap:
> +	if (unlikely(PageCompound(page))) {
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +		pmd_t *pmd;
> +
> +		spin_lock(&mm->page_table_lock);
> +		pmd = page_check_address_pmd(page, mm, address,
> +					     PAGE_CHECK_ADDRESS_PMD_FLAG);
> +		if (pmd && !pmd_trans_splitting(*pmd) &&
> +		    pmdp_clear_flush_young_notify(vma, address, pmd))
> +			referenced++;
> +		spin_unlock(&mm->page_table_lock);
> +#else /* CONFIG_TRANSPARENT_HUGEPAGE */
> +		VM_BUG_ON(1);
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +	} else {
> +		pte_t *pte;
> +		spinlock_t *ptl;
> +
> +		pte = page_check_address(page, mm, address, &ptl, 0);
> +		if (!pte)
> +			goto out;
> +
> +		if (ptep_clear_flush_young_notify(vma, address, pte)) {
> +			/*
> +			 * Don't treat a reference through a sequentially read
> +			 * mapping as such.  If the page has been used in
> +			 * another mapping, we will catch it; if this other
> +			 * mapping is already gone, the unmap path will have
> +			 * set PG_referenced or activated the page.
> +			 */
> +			if (likely(!VM_SequentialReadHint(vma)))
> +				referenced++;
> +		}
> +		pte_unmap_unlock(pte, ptl);
> +	}
> +
>  	(*mapcount)--;
> -	pte_unmap_unlock(pte, ptl);
>  
>  	if (referenced)
>  		*vm_flags |= vma->vm_flags;
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>