From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id D10E56B0044 for ; Fri, 18 Dec 2009 15:03:58 -0500 (EST) Date: Fri, 18 Dec 2009 20:03:46 +0000 From: Mel Gorman Subject: Re: [PATCH 25 of 28] transparent hugepage core Message-ID: <20091218200345.GH21194@csn.ul.ie> References: <4d96699c8fb89a4a22eb.1261076428@v2.random> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <4d96699c8fb89a4a22eb.1261076428@v2.random> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Andi Kleen , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , Andrew Morton List-ID: On Thu, Dec 17, 2009 at 07:00:28PM -0000, Andrea Arcangeli wrote: > From: Andrea Arcangeli > > Lately I've been working to make KVM use hugepages transparently > without the usual restrictions of hugetlbfs. Some of the restrictions > I'd like to see removed: > > 1) hugepages have to be swappable or the guest physical memory remains > locked in RAM and can't be paged out to swap > > 2) if a hugepage allocation fails, regular pages should be allocated > instead and mixed in the same vma without any failure and without > userland noticing > > 3) if some task quits and more hugepages become available in the > buddy, guest physical memory backed by regular pages should be > relocated on hugepages automatically in regions under > madvise(MADV_HUGEPAGE) (ideally event driven by waking up the > kernel deamon if the order=HPAGE_SHIFT-PAGE_SHIFT list becomes not > null) > > 4) avoidance of reservation and maximization of use of hugepages whenever > possible. Reservation (needed to avoid runtime fatal faliures) may be ok for > 1 machine with 1 database with 1 database cache with 1 database cache size > known at boot time. It's definitely not feasible with a virtualization > hypervisor usage like RHEV-H that runs an unknown number of virtual machines > with an unknown size of each virtual machine with an unknown amount of > pagecache that could be potentially useful in the host for guest not using > O_DIRECT (aka cache=off). > > hugepages in the virtualization hypervisor (and also in the guest!) are > much more important than in a regular host not using virtualization, becasue > with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in > case only the hypervisor uses transparent hugepages, and they decrease the > tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and > the linux guest both uses this patch (though the guest will limit the addition > speedup to anonymous regions only for now...). Even more important is that the > tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow > paging or no-virtualization scenario. So maximizing the amount of virtual > memory cached by the TLB pays off significantly more with NPT/EPT than without > (even if there would be no significant speedup in the tlb-miss runtime). > > The first (and more tedious) part of this work requires allowing the VM to > handle anonymous hugepages mixed with regular pages transparently on regular > anonymous vmas. This is what this patch tries to achieve in the least intrusive > possible way. We want hugepages and hugetlb to be used in a way so that all > applications can benefit without changes (as usual we leverage the KVM > virtualization design: by improving the Linux VM at large, KVM gets the > performance boost too). > > The most important design choice is: always fallback to 4k allocation > if the hugepage allocation fails! This is the _very_ opposite of some > large pagecache patches that failed with -EIO back then if a 64k (or > similar) allocation failed... > > Second important decision (to reduce the impact of the feature on the > existing pagetable handling code) is that at any time we can split an > hugepage into 512 regular pages and it has to be done with an > operation that can't fail. This way the reliability of the swapping > isn't decreased (no need to allocate memory when we are short on > memory to swap) and it's trivial to plug a split_huge_page* one-liner > where needed without polluting the VM. Over time we can teach > mprotect, mremap and friends to handle pmd_trans_huge natively without > calling split_huge_page*. The fact it can't fail isn't just for swap: > if split_huge_page would return -ENOMEM (instead of the current void) > we'd need to rollback the mprotect from the middle of it (ideally > including undoing the split_vma) which would be a big change and in > the very wrong direction (it'd likely be simpler not to call > split_huge_page at all and to teach mprotect and friends to handle > hugepages instead of rolling them back from the middle). In short the > very value of split_huge_page is that it can't fail. > > The collapsing and madvise(MADV_HUGEPAGE) part will remain separated > and incremental and it'll just be an "harmless" addition later if this > initial part is agreed upon. It also should be noted that locking-wise > replacing regular pages with hugepages is going to be very easy if > compared to what I'm doing below in split_huge_page, as it will only > happen when page_count(page) matches page_mapcount(page) if we can > take the PG_lock and mmap_sem in write mode. collapse_huge_page will > be a "best effort" that (unlike split_huge_page) can fail at the > minimal sign of trouble and we can try again later. collapse_huge_page > will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will > work similar to madvise(MADV_MERGEABLE). > > The default I like is that transparent hugepages are used at page fault time. > This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The > control knob can be set to three values "always", "madvise", "never" which > mean respectively that hugepages are always used, or only inside > madvise(MADV_HUGEPAGE) regions, or never used. > /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage > allocation should defrag memory aggressively "always", only inside "madvise" > regions, or "never". > > The pmd_trans_splitting/pmd_trans_huge locking is very solid. The > put_page (from get_user_page users that can't use mmu notifier like > O_DIRECT) that runs against a __split_huge_page_refcount instead was a > pain to serialize in a way that would result always in a coherent page > count for both tail and head. I think my locking solution with a > compound_lock taken only after the page_first is valid and is still a > PageHead should be safe but it surely needs review from SMP race point > of view. In short there is no current existing way to serialize the > O_DIRECT final put_page against split_huge_page_refcount so I had to > invent a new one (O_DIRECT loses knowledge on the mapping status by > the time gup_fast returns so...). And I didn't want to impact all > gup/gup_fast users for now, maybe if we change the gup interface > substantially we can avoid this locking, I admit I didn't think too > much about it because changing the gup unpinning interface would be > invasive. > > If we ignored O_DIRECT we could stick to the existing compound > refcounting code, by simply adding a > get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu > notifier user) would call it without FOLL_GET (and if FOLL_GET isn't > set we'd just BUG_ON if nobody registered itself in the current task > mmu notifier list yet). But O_DIRECT is fundamental for decent > performance of virtualized I/O on fast storage so we can't avoid it to > solve the race of put_page against split_huge_page_refcount to achieve > a complete hugepage feature for KVM. > > Swap and oom works fine (well just like with regular pages ;). MMU > notifier is handled transparently too, with the exception of the young > bit on the pmd, that didn't have a range check but I think KVM will be > fine because the whole point of hugepages is that EPT/NPT will also > use a huge pmd when they notice gup returns pages with PageCompound set, > so they won't care of a range and there's just the pmd young bit to > check in that case. > > NOTE: in some cases if the L2 cache is small, this may slowdown and > waste memory during COWs because 4M of memory are accessed in a single > fault instead of 8k (the payoff is that after COW the program can run > faster). So we might want to switch the copy_huge_page (and > clear_huge_page too) to not temporal stores. I also extensively > researched ways to avoid this cache trashing with a full prefault > logic that would cow in 8k/16k/32k/64k up to 1M (I can send those > patches that fully implemented prefault) but I concluded they're not > worth it and they add an huge additional complexity and they remove all tlb > benefits until the full hugepage has been faulted in, to save a little bit of > memory and some cache during app startup, but they still don't improve > substantially the cache-trashing during startup if the prefault happens in >4k > chunks. One reason is that those 4k pte entries copied are still mapped on a > perfectly cache-colored hugepage, so the trashing is the worst one can generate > in those copies (cow of 4k page copies aren't so well colored so they trashes > less, but again this results in software running faster after the page fault). > Those prefault patches allowed things like a pte where post-cow pages were > local 4k regular anon pages and the not-yet-cowed pte entries were pointing in > the middle of some hugepage mapped read-only. If it doesn't payoff > substantially with todays hardware it will payoff even less in the future with > larger l2 caches, and the prefault logic would blot the VM a lot. If one is > emebdded transparent_hugepage can be disabled during boot with sysfs or with > the boot commandline parameter transparent_hugepage=0 (or > transparent_hugepage=2 to restrict hugepages inside madvise regions) that will > ensure not a single hugepage is allocated at boot time. It is simple enough to > just disable transparent hugepage globally and let transparent hugepages be > allocated selectively by applications in the MADV_HUGEPAGE region (both at page > fault time, and if enabled with the collapse_huge_page too through the kernel > daemon). > > This patch supports only hugepages mapped in the pmd, archs that have > smaller hugepages will not fit in this patch alone. Also some archs like power > have certain tlb limits that prevents mixing different page size in the same > regions so they will not fit in this framework that requires "graceful > fallback" to basic PAGE_SIZE in case of physical memory fragmentation. > hugetlbfs remains a perfect fit for those because its software limits happen to > match the hardware limits. hugetlbfs also remains a perfect fit for hugepage > sizes like 1GByte that cannot be hoped to be found not fragmented after a > certain system uptime and that would be very expensive to defragment with > relocation, so requiring reservation. hugetlbfs is the "reservation way", the > point of transparent hugepages is not to have any reservation at all and > maximizing the use of cache and hugepages at all times automatically. > > Some performance result: > > vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep > ages3 > memset page fault 1566023 > memset tlb miss 453854 > memset second tlb miss 453321 > random access tlb miss 41635 > random access second tlb miss 41658 > vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3 > memset page fault 1566471 > memset tlb miss 453375 > memset second tlb miss 453320 > random access tlb miss 41636 > random access second tlb miss 41637 > vmx andrea # ./largepages3 > memset page fault 1566642 > memset tlb miss 453417 > memset second tlb miss 453313 > random access tlb miss 41630 > random access second tlb miss 41647 > vmx andrea # ./largepages3 > memset page fault 1566872 > memset tlb miss 453418 > memset second tlb miss 453315 > random access tlb miss 41618 > random access second tlb miss 41659 > vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage > vmx andrea # ./largepages3 > memset page fault 2182476 > memset tlb miss 460305 > memset second tlb miss 460179 > random access tlb miss 44483 > random access second tlb miss 44186 > vmx andrea # ./largepages3 > memset page fault 2182791 > memset tlb miss 460742 > memset second tlb miss 459962 > random access tlb miss 43981 > random access second tlb miss 43988 > > ============ > #include > #include > #include > #include > > #define SIZE (3UL*1024*1024*1024) > > int main() > { > char *p = malloc(SIZE), *p2; > struct timeval before, after; > > gettimeofday(&before, NULL); > memset(p, 0, SIZE); > gettimeofday(&after, NULL); > printf("memset page fault %Lu\n", > (after.tv_sec-before.tv_sec)*1000000UL + > after.tv_usec-before.tv_usec); > > gettimeofday(&before, NULL); > memset(p, 0, SIZE); > gettimeofday(&after, NULL); > printf("memset tlb miss %Lu\n", > (after.tv_sec-before.tv_sec)*1000000UL + > after.tv_usec-before.tv_usec); > > gettimeofday(&before, NULL); > memset(p, 0, SIZE); > gettimeofday(&after, NULL); > printf("memset second tlb miss %Lu\n", > (after.tv_sec-before.tv_sec)*1000000UL + > after.tv_usec-before.tv_usec); > > gettimeofday(&before, NULL); > for (p2 = p; p2 < p+SIZE; p2 += 4096) > *p2 = 0; > gettimeofday(&after, NULL); > printf("random access tlb miss %Lu\n", > (after.tv_sec-before.tv_sec)*1000000UL + > after.tv_usec-before.tv_usec); > > gettimeofday(&before, NULL); > for (p2 = p; p2 < p+SIZE; p2 += 4096) > *p2 = 0; > gettimeofday(&after, NULL); > printf("random access second tlb miss %Lu\n", > (after.tv_sec-before.tv_sec)*1000000UL + > after.tv_usec-before.tv_usec); > > return 0; > } > ============ > > Signed-off-by: Andrea Arcangeli > --- > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > new file mode 100644 > --- /dev/null > +++ b/include/linux/huge_mm.h > @@ -0,0 +1,110 @@ > +#ifndef _LINUX_HUGE_MM_H > +#define _LINUX_HUGE_MM_H > + > +extern int do_huge_anonymous_page(struct mm_struct *mm, > + struct vm_area_struct *vma, > + unsigned long address, pmd_t *pmd, > + unsigned int flags); > +extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > + pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, > + struct vm_area_struct *vma); > +extern int do_huge_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > + unsigned long address, pmd_t *pmd, > + pmd_t orig_pmd); > +extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm); The naming of "huge" might bite in the ass later if/when transparent support is applied to multiple page sizes. Granted, it's not happening any time soon. > +extern struct page *follow_trans_huge_pmd(struct mm_struct *mm, > + unsigned long addr, > + pmd_t *pmd, > + unsigned int flags); > +extern int zap_pmd_trans_huge(struct mmu_gather *tlb, > + struct vm_area_struct *vma, > + pmd_t *pmd); > + > +enum transparent_hugepage_flag { > + TRANSPARENT_HUGEPAGE_FLAG, > + TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, > + TRANSPARENT_HUGEPAGE_DEFRAG_FLAG, Defrag is misleading. Glancing through the rest of the patch, "try harder" would be a more appropriate term because it uses __GFP_REPEAT. > + TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, > +#ifdef CONFIG_DEBUG_VM > + TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG, > +#endif > +}; > + > +#define transparent_hugepage_enabled(__vma) \ > + (transparent_hugepage_flags & (1< + (transparent_hugepage_flags & \ > + (1< + (__vma)->vm_flags & VM_HUGEPAGE)) > +#define transparent_hugepage_defrag(__vma) \ > + (transparent_hugepage_flags & \ > + (1< + (transparent_hugepage_flags & \ > + (1< + (__vma)->vm_flags & VM_HUGEPAGE)) > +#ifdef CONFIG_DEBUG_VM > +#define transparent_hugepage_debug_cow() \ > + (transparent_hugepage_flags & \ > + (1< +#else /* CONFIG_DEBUG_VM */ > +#define transparent_hugepage_debug_cow() 0 > +#endif /* CONFIG_DEBUG_VM */ > + > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > +extern unsigned long transparent_hugepage_flags; > +extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, > + pmd_t *dst_pmd, pmd_t *src_pmd, > + struct vm_area_struct *vma, > + unsigned long addr, unsigned long end); > +extern int handle_pte_fault(struct mm_struct *mm, > + struct vm_area_struct *vma, unsigned long address, > + pte_t *pte, pmd_t *pmd, unsigned int flags); > +extern void __split_huge_page_mm(struct mm_struct *mm, unsigned long address, > + pmd_t *pmd); > +extern void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd); > +extern int split_huge_page(struct page *page); > +#define split_huge_page_mm(__mm, __addr, __pmd) \ > + do { \ > + if (unlikely(pmd_trans_huge(*(__pmd)))) \ > + __split_huge_page_mm(__mm, __addr, __pmd); \ > + } while (0) > +#define split_huge_page_vma(__vma, __pmd) \ > + do { \ > + if (unlikely(pmd_trans_huge(*(__pmd)))) \ > + __split_huge_page_vma(__vma, __pmd); \ > + } while (0) > +#define wait_split_huge_page(__anon_vma, __pmd) \ > + do { \ > + smp_mb(); \ > + spin_unlock_wait(&(__anon_vma)->lock); \ > + smp_mb(); \ > + VM_BUG_ON(pmd_trans_splitting(*(__pmd)) || \ > + pmd_trans_huge(*(__pmd))); \ > + } while (0) > +#define HPAGE_ORDER (HPAGE_SHIFT-PAGE_SHIFT) > +#define HPAGE_NR (1< + > +enum page_check_address_pmd_flag { > + PAGE_CHECK_ADDRESS_PMD_FLAG, > + PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, > + PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG, > +}; > +extern pmd_t *page_check_address_pmd(struct page *page, > + struct mm_struct *mm, > + unsigned long address, > + enum page_check_address_pmd_flag flag); > +extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma); > +#else /* CONFIG_TRANSPARENT_HUGEPAGE */ > +#define transparent_hugepage_flags 0UL > +static inline int split_huge_page(struct page *page) > +{ > + return 0; > +} > +#define split_huge_page_mm(__mm, __addr, __pmd) \ > + do { } while (0) > +#define split_huge_page_vma(__vma, __pmd) \ > + do { } while (0) > +#define wait_split_huge_page(__anon_vma, __pmd) \ > + do { } while (0) > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > + > +#endif /* _LINUX_HUGE_MM_H */ > diff --git a/include/linux/mm.h b/include/linux/mm.h > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -234,6 +234,7 @@ struct inode; > * files which need it (119 of them) > */ > #include > +#include > > /* > * Methods to modify the page usage count. > diff --git a/mm/Makefile b/mm/Makefile > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f > obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o > obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o > obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o > +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > new file mode 100644 > --- /dev/null > +++ b/mm/huge_memory.c Similar on naming. Later someone will get congused as to why there is hugetlbfs and huge_memory. > @@ -0,0 +1,792 @@ > +/* > + * Copyright (C) 2009 Red Hat, Inc. > + * > + * This work is licensed under the terms of the GNU GPL, version 2. See > + * the COPYING file in the top-level directory. > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include "internal.h" > + > +unsigned long transparent_hugepage_flags __read_mostly = > + (1< + > +#ifdef CONFIG_SYSFS > +static ssize_t double_flag_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf, > + enum transparent_hugepage_flag enabled, > + enum transparent_hugepage_flag req_madv) > +{ > + if (test_bit(enabled, &transparent_hugepage_flags)) { > + VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags)); > + return sprintf(buf, "[always] madvise never\n"); > + } else if (test_bit(req_madv, &transparent_hugepage_flags)) > + return sprintf(buf, "always [madvise] never\n"); > + else > + return sprintf(buf, "always madvise [never]\n"); > +} > +static ssize_t double_flag_store(struct kobject *kobj, > + struct kobj_attribute *attr, > + const char *buf, size_t count, > + enum transparent_hugepage_flag enabled, > + enum transparent_hugepage_flag req_madv) > +{ > + if (!memcmp("always", buf, > + min(sizeof("always")-1, count))) { > + set_bit(enabled, &transparent_hugepage_flags); > + clear_bit(req_madv, &transparent_hugepage_flags); > + } else if (!memcmp("madvise", buf, > + min(sizeof("madvise")-1, count))) { > + clear_bit(enabled, &transparent_hugepage_flags); > + set_bit(req_madv, &transparent_hugepage_flags); > + } else if (!memcmp("never", buf, > + min(sizeof("never")-1, count))) { > + clear_bit(enabled, &transparent_hugepage_flags); > + clear_bit(req_madv, &transparent_hugepage_flags); > + } else > + return -EINVAL; > + > + return count; > +} > + > +static ssize_t enabled_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + return double_flag_show(kobj, attr, buf, > + TRANSPARENT_HUGEPAGE_FLAG, > + TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG); > +} > +static ssize_t enabled_store(struct kobject *kobj, > + struct kobj_attribute *attr, > + const char *buf, size_t count) > +{ > + return double_flag_store(kobj, attr, buf, count, > + TRANSPARENT_HUGEPAGE_FLAG, > + TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG); > +} > +static struct kobj_attribute enabled_attr = > + __ATTR(enabled, 0644, enabled_show, enabled_store); > + > +static ssize_t defrag_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + return double_flag_show(kobj, attr, buf, > + TRANSPARENT_HUGEPAGE_DEFRAG_FLAG, > + TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG); > +} > +static ssize_t defrag_store(struct kobject *kobj, > + struct kobj_attribute *attr, > + const char *buf, size_t count) > +{ > + return double_flag_store(kobj, attr, buf, count, > + TRANSPARENT_HUGEPAGE_DEFRAG_FLAG, > + TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG); > +} > +static struct kobj_attribute defrag_attr = > + __ATTR(defrag, 0644, defrag_show, defrag_store); > + > +static ssize_t single_flag_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf, > + enum transparent_hugepage_flag flag) > +{ > + if (test_bit(flag, &transparent_hugepage_flags)) > + return sprintf(buf, "[yes] no\n"); > + else > + return sprintf(buf, "yes [no]\n"); > +} > +static ssize_t single_flag_store(struct kobject *kobj, > + struct kobj_attribute *attr, > + const char *buf, size_t count, > + enum transparent_hugepage_flag flag) > +{ > + if (!memcmp("yes", buf, > + min(sizeof("yes")-1, count))) { > + set_bit(flag, &transparent_hugepage_flags); > + } else if (!memcmp("no", buf, > + min(sizeof("no")-1, count))) { > + clear_bit(flag, &transparent_hugepage_flags); > + } else > + return -EINVAL; > + > + return count; > +} > + > +#ifdef CONFIG_DEBUG_VM > +static ssize_t debug_cow_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + return single_flag_show(kobj, attr, buf, > + TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG); > +} > +static ssize_t debug_cow_store(struct kobject *kobj, > + struct kobj_attribute *attr, > + const char *buf, size_t count) > +{ > + return single_flag_store(kobj, attr, buf, count, > + TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG); > +} > +static struct kobj_attribute debug_cow_attr = > + __ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store); > +#endif /* CONFIG_DEBUG_VM */ > + > +static struct attribute *hugepage_attr[] = { > + &enabled_attr.attr, > + &defrag_attr.attr, > +#ifdef CONFIG_DEBUG_VM > + &debug_cow_attr.attr, > +#endif > + NULL, > +}; > + > +static struct attribute_group hugepage_attr_group = { > + .attrs = hugepage_attr, > + .name = "transparent_hugepage", > +}; > +#endif /* CONFIG_SYSFS */ > + > +static int __init ksm_init(void) > +{ > +#ifdef CONFIG_SYSFS > + int err; > + > + err = sysfs_create_group(mm_kobj, &hugepage_attr_group); > + if (err) > + printk(KERN_ERR "hugepage: register sysfs failed\n"); > +#endif > + return 0; > +} > +module_init(ksm_init) > + > +static int __init setup_transparent_hugepage(char *str) > +{ > + if (!str) > + return 0; > + transparent_hugepage_flags = simple_strtoul(str, &str, 0); > + return 1; > +} > +__setup("transparent_hugepage=", setup_transparent_hugepage); > + > + > +static void prepare_pmd_huge_pte(pgtable_t pgtable, > + struct mm_struct *mm) > +{ > + VM_BUG_ON(spin_can_lock(&mm->page_table_lock)); > + > + /* FIFO */ > + if (!mm->pmd_huge_pte) > + INIT_LIST_HEAD(&pgtable->lru); > + else > + list_add(&pgtable->lru, &mm->pmd_huge_pte->lru); > + mm->pmd_huge_pte = pgtable; > +} > + > +static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) > +{ > + if (likely(vma->vm_flags & VM_WRITE)) > + pmd = pmd_mkwrite(pmd); > + return pmd; > +} > + > +static int __do_huge_anonymous_page(struct mm_struct *mm, > + struct vm_area_struct *vma, > + unsigned long address, pmd_t *pmd, > + struct page *page, Maybe this should be do_pmd_anonymous page and match what do_anonymous_page does as much as possible. This might offset any future problems related to transparently handling pages at other page table levels. > + unsigned long haddr) > +{ > + int ret = 0; > + pgtable_t pgtable; > + > + VM_BUG_ON(!PageCompound(page)); > + pgtable = pte_alloc_one(mm, address); > + if (unlikely(!pgtable)) { > + put_page(page); > + return VM_FAULT_OOM; > + } > + > + clear_huge_page(page, haddr, HPAGE_NR); > + Ideally insead of defining things like HPAGE_NR, the existing functions for multiple huge page sizes would be extended to return the "huge page size corresponding to a PMD". > + __SetPageUptodate(page); > + smp_wmb(); > + Need to explain why smp_wmb() is needed there. It doesn't look like you're protecting the bit set itself. More likely you are making sure the writes in clear_huge_page() have finished but that's a guess. Comment. > + spin_lock(&mm->page_table_lock); > + if (unlikely(!pmd_none(*pmd))) { > + put_page(page); > + pte_free(mm, pgtable); Racing fault already filled in the PTE? If so, comment please. Again, matching how do_anonymous_page() does a similar job would help comprehension. > + } else { > + pmd_t entry; > + entry = mk_pmd(page, vma->vm_page_prot); > + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); > + entry = pmd_mkhuge(entry); > + page_add_new_anon_rmap(page, vma, haddr); > + set_pmd_at(mm, haddr, pmd, entry); > + prepare_pmd_huge_pte(pgtable, mm); > + } > + spin_unlock(&mm->page_table_lock); > + > + return ret; > +} > + > +int do_huge_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, > + unsigned long address, pmd_t *pmd, > + unsigned int flags) > +{ > + struct page *page; > + unsigned long haddr = address & HPAGE_MASK; > + pte_t *pte; > + > + if (haddr >= vma->vm_start && haddr + HPAGE_SIZE <= vma->vm_end) { > + if (unlikely(anon_vma_prepare(vma))) > + return VM_FAULT_OOM; > + page = alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP| > + (transparent_hugepage_defrag(vma) ? GFP_HIGHUSER_MOVABLE should only be used if hugepages_treat_as_movable is set in /proc/sys/vm. This should be GFP_HIGHUSER only. > + __GFP_REPEAT : 0)|__GFP_NOWARN, > + HPAGE_ORDER); > + if (unlikely(!page)) > + goto out; > + > + return __do_huge_anonymous_page(mm, vma, > + address, pmd, > + page, haddr); > + } > +out: > + pte = pte_alloc_map(mm, vma, pmd, address); > + if (!pte) > + return VM_FAULT_OOM; > + return handle_pte_fault(mm, vma, address, pte, pmd, flags); > +} > + > +int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > + pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, > + struct vm_area_struct *vma) > +{ > + struct page *src_page; > + pmd_t pmd; > + pgtable_t pgtable; > + int ret; > + > + ret = -ENOMEM; > + pgtable = pte_alloc_one(dst_mm, addr); > + if (unlikely(!pgtable)) > + goto out; > + > + spin_lock(&dst_mm->page_table_lock); > + spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING); > + > + ret = -EAGAIN; > + pmd = *src_pmd; > + if (unlikely(!pmd_trans_huge(pmd))) > + goto out_unlock; > + if (unlikely(pmd_trans_splitting(pmd))) { > + /* split huge page running from under us */ > + spin_unlock(&src_mm->page_table_lock); > + spin_unlock(&dst_mm->page_table_lock); > + > + wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */ > + goto out; > + } > + src_page = pmd_pgtable(pmd); > + VM_BUG_ON(!PageHead(src_page)); > + get_page(src_page); > + page_dup_rmap(src_page); > + add_mm_counter(dst_mm, anon_rss, HPAGE_NR); > + > + pmdp_set_wrprotect(src_mm, addr, src_pmd); > + pmd = pmd_mkold(pmd_wrprotect(pmd)); > + set_pmd_at(dst_mm, addr, dst_pmd, pmd); > + prepare_pmd_huge_pte(pgtable, dst_mm); > + > + ret = 0; > +out_unlock: > + spin_unlock(&src_mm->page_table_lock); > + spin_unlock(&dst_mm->page_table_lock); > +out: > + return ret; > +} > + > +/* no "address" argument so destroys page coloring of some arch */ > +pgtable_t get_pmd_huge_pte(struct mm_struct *mm) > +{ > + pgtable_t pgtable; > + > + VM_BUG_ON(spin_can_lock(&mm->page_table_lock)); > + > + /* FIFO */ > + pgtable = mm->pmd_huge_pte; > + if (list_empty(&pgtable->lru)) > + mm->pmd_huge_pte = NULL; /* debug */ > + else { > + mm->pmd_huge_pte = list_entry(pgtable->lru.next, > + struct page, lru); > + list_del(&pgtable->lru); > + } > + return pgtable; > +} > + > +int do_huge_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > + unsigned long address, pmd_t *pmd, pmd_t orig_pmd) > +{ > + int ret = 0, i; > + struct page *page, *new_page; > + unsigned long haddr; > + struct page **pages; > + > + VM_BUG_ON(!vma->anon_vma); > + spin_lock(&mm->page_table_lock); > + if (unlikely(!pmd_same(*pmd, orig_pmd))) > + goto out_unlock; > + > + page = pmd_pgtable(orig_pmd); > + VM_BUG_ON(!PageCompound(page) || !PageHead(page)); > + haddr = address & HPAGE_MASK; > + if (page_mapcount(page) == 1) { > + pmd_t entry; > + entry = pmd_mkyoung(orig_pmd); > + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); > + if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1)) > + update_mmu_cache(vma, address, entry); > + ret |= VM_FAULT_WRITE; > + goto out_unlock; > + } > + spin_unlock(&mm->page_table_lock); > + > + new_page = alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP| > + (transparent_hugepage_defrag(vma) ? > + __GFP_REPEAT : 0)|__GFP_NOWARN, > + HPAGE_ORDER); > + if (transparent_hugepage_debug_cow() && new_page) { > + put_page(new_page); > + new_page = NULL; > + } > + if (unlikely(!new_page)) { This entire block needs be in a demote_pmd_page() or something similar. It's on the hefty side for being in the main function. That said, I didn't spot anything wrong in there either. > + pgtable_t pgtable; > + pmd_t _pmd; > + > + pages = kzalloc(sizeof(struct page *) * HPAGE_NR, > + GFP_KERNEL); > + if (unlikely(!pages)) { > + ret |= VM_FAULT_OOM; > + goto out; > + } > + > + for (i = 0; i < HPAGE_NR; i++) { > + pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE, > + vma, address); > + if (unlikely(!pages[i])) { > + while (--i >= 0) > + put_page(pages[i]); > + kfree(pages); > + ret |= VM_FAULT_OOM; > + goto out; > + } > + } > + > + spin_lock(&mm->page_table_lock); > + if (unlikely(!pmd_same(*pmd, orig_pmd))) > + goto out_free_pages; > + else > + get_page(page); > + spin_unlock(&mm->page_table_lock); > + > + might_sleep(); Is this check really necessary? We could already go alseep easier when allocating pages. > + for (i = 0; i < HPAGE_NR; i++) { > + copy_user_highpage(pages[i], page + i, More nasty naming there. Needs to be cleared that pages is your demoted base pages and page is the existing compound page. > + haddr + PAGE_SHIFT*i, vma); > + __SetPageUptodate(pages[i]); > + cond_resched(); > + } > + > + spin_lock(&mm->page_table_lock); > + if (unlikely(!pmd_same(*pmd, orig_pmd))) > + goto out_free_pages; > + else > + put_page(page); > + > + pmdp_clear_flush_notify(vma, haddr, pmd); > + /* leave pmd empty until pte is filled */ > + > + pgtable = get_pmd_huge_pte(mm); > + pmd_populate(mm, &_pmd, pgtable); > + > + for (i = 0; i < HPAGE_NR; > + i++, haddr += PAGE_SIZE) { > + pte_t *pte, entry; > + entry = mk_pte(pages[i], vma->vm_page_prot); > + entry = maybe_mkwrite(pte_mkdirty(entry), vma); > + page_add_new_anon_rmap(pages[i], vma, haddr); > + pte = pte_offset_map(&_pmd, haddr); > + VM_BUG_ON(!pte_none(*pte)); > + set_pte_at(mm, haddr, pte, entry); > + pte_unmap(pte); > + } > + kfree(pages); > + > + mm->nr_ptes++; > + smp_wmb(); /* make pte visible before pmd */ > + pmd_populate(mm, pmd, pgtable); > + spin_unlock(&mm->page_table_lock); > + > + ret |= VM_FAULT_WRITE; > + page_remove_rmap(page); > + put_page(page); > + goto out; > + } > + > + copy_huge_page(new_page, page, haddr, vma, HPAGE_NR); > + __SetPageUptodate(new_page); > + > + smp_wmb(); > + > + spin_lock(&mm->page_table_lock); > + if (unlikely(!pmd_same(*pmd, orig_pmd))) > + put_page(new_page); > + else { > + pmd_t entry; > + entry = mk_pmd(new_page, vma->vm_page_prot); > + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); > + entry = pmd_mkhuge(entry); > + pmdp_clear_flush_notify(vma, haddr, pmd); > + page_add_new_anon_rmap(new_page, vma, haddr); > + set_pmd_at(mm, haddr, pmd, entry); > + update_mmu_cache(vma, address, entry); > + page_remove_rmap(page); > + put_page(page); > + ret |= VM_FAULT_WRITE; > + } > +out_unlock: > + spin_unlock(&mm->page_table_lock); > +out: > + return ret; > + > +out_free_pages: > + for (i = 0; i < HPAGE_NR; i++) > + put_page(pages[i]); > + kfree(pages); > + goto out_unlock; > +} > + > +struct page *follow_trans_huge_pmd(struct mm_struct *mm, > + unsigned long addr, > + pmd_t *pmd, > + unsigned int flags) > +{ > + struct page *page = NULL; > + > + VM_BUG_ON(spin_can_lock(&mm->page_table_lock)); > + > + if (flags & FOLL_WRITE && !pmd_write(*pmd)) > + goto out; > + > + page = pmd_pgtable(*pmd); > + VM_BUG_ON(!PageHead(page)); > + if (flags & FOLL_TOUCH) { > + pmd_t _pmd; > + /* > + * We should set the dirty bit only for FOLL_WRITE but > + * for now the dirty bit in the pmd is meaningless. > + * And if the dirty bit will become meaningful and > + * we'll only set it with FOLL_WRITE, an atomic > + * set_bit will be required on the pmd to set the > + * young bit, instead of the current set_pmd_at. > + */ > + _pmd = pmd_mkyoung(pmd_mkdirty(*pmd)); > + set_pmd_at(mm, addr & HPAGE_MASK, pmd, _pmd); > + } > + page += (addr & ~HPAGE_MASK) >> PAGE_SHIFT; More HPAGE vs PMD here. > + VM_BUG_ON(!PageCompound(page)); > + if (flags & FOLL_GET) > + get_page(page); > + > +out: > + return page; > +} > + > +int zap_pmd_trans_huge(struct mmu_gather *tlb, struct vm_area_struct *vma, > + pmd_t *pmd) > +{ > + int ret = 0; > + > + spin_lock(&tlb->mm->page_table_lock); > + if (likely(pmd_trans_huge(*pmd))) { > + if (unlikely(pmd_trans_splitting(*pmd))) { > + spin_unlock(&tlb->mm->page_table_lock); > + wait_split_huge_page(vma->anon_vma, > + pmd); > + } else { > + struct page *page; > + pgtable_t pgtable; > + pgtable = get_pmd_huge_pte(tlb->mm); > + page = pfn_to_page(pmd_pfn(*pmd)); > + VM_BUG_ON(!PageCompound(page)); > + pmd_clear(pmd); > + page_remove_rmap(page); > + VM_BUG_ON(page_mapcount(page) < 0); > + spin_unlock(&tlb->mm->page_table_lock); > + add_mm_counter(tlb->mm, anon_rss, -HPAGE_NR); > + tlb_remove_page(tlb, page); > + pte_free(tlb->mm, pgtable); > + ret = 1; > + } > + } else > + spin_unlock(&tlb->mm->page_table_lock); > + > + return ret; > +} > + > +pmd_t *page_check_address_pmd(struct page *page, > + struct mm_struct *mm, > + unsigned long address, > + enum page_check_address_pmd_flag flag) > +{ > + pgd_t *pgd; > + pud_t *pud; > + pmd_t *pmd, *ret = NULL; > + > + pgd = pgd_offset(mm, address); > + if (!pgd_present(*pgd)) > + goto out; > + > + pud = pud_offset(pgd, address); > + if (!pud_present(*pud)) > + goto out; > + > + pmd = pmd_offset(pud, address); > + if (pmd_none(*pmd)) > + goto out; > + VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG && > + pmd_trans_splitting(*pmd)); > + if (pmd_trans_huge(*pmd) && pmd_pgtable(*pmd) == page) { > + VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG && > + !pmd_trans_splitting(*pmd)); > + ret = pmd; > + } > +out: > + return ret; > +} > + > +static int __split_huge_page_splitting(struct page *page, > + struct vm_area_struct *vma, > + unsigned long address) > +{ > + struct mm_struct *mm = vma->vm_mm; > + pmd_t *pmd; > + int ret = 0; > + > + spin_lock(&mm->page_table_lock); > + pmd = page_check_address_pmd(page, mm, address, > + PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG); > + if (pmd) { > + /* > + * We can't temporarily set the pmd to null in order > + * to split it, pmd_huge must remain on at all times. > + */ Why, to avoid a double fault? Or to avoid a case where the huge page is being split, another fault occurs and zero-filled pages get faulted in? I'm afraid I ran out of time at this point. It'll be after the holidays before I get time for a proper go at it. Sorry. > + pmdp_splitting_flush_notify(vma, address, pmd); > + ret = 1; > + } > + spin_unlock(&mm->page_table_lock); > + > + return ret; > +} > + > +static void __split_huge_page_refcount(struct page *page) > +{ > + int i; > + unsigned long head_index = page->index; > + > + compound_lock(page); > + > + for (i = 1; i < HPAGE_NR; i++) { > + struct page *page_tail = page + i; > + > + /* tail_page->_count cannot change */ > + atomic_sub(atomic_read(&page_tail->_count), &page->_count); > + BUG_ON(page_count(page) <= 0); > + atomic_add(page_mapcount(page) + 1, &page_tail->_count); > + BUG_ON(atomic_read(&page_tail->_count) <= 0); > + > + /* after clearing PageTail the gup refcount can be released */ > + smp_mb(); > + > + page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; > + page_tail->flags |= (page->flags & > + ((1L << PG_referenced) | > + (1L << PG_swapbacked) | > + (1L << PG_mlocked) | > + (1L << PG_uptodate))); > + page_tail->flags |= (1L << PG_dirty); > + > + /* > + * 1) clear PageTail before overwriting first_page > + * 2) clear PageTail before clearing PageHead for VM_BUG_ON > + */ > + smp_wmb(); > + > + BUG_ON(page_mapcount(page_tail)); > + page_tail->_mapcount = page->_mapcount; > + BUG_ON(page_tail->mapping); > + page_tail->mapping = page->mapping; > + page_tail->index = ++head_index; > + BUG_ON(!PageAnon(page_tail)); > + BUG_ON(!PageUptodate(page_tail)); > + BUG_ON(!PageDirty(page_tail)); > + BUG_ON(!PageSwapBacked(page_tail)); > + > + if (page_evictable(page_tail, NULL)) > + lru_cache_add_lru(page_tail, LRU_ACTIVE_ANON); > + else > + add_page_to_unevictable_list(page_tail); > + put_page(page_tail); > + } > + > + ClearPageCompound(page); > + compound_unlock(page); > +} > + > +static int __split_huge_page_map(struct page *page, > + struct vm_area_struct *vma, > + unsigned long address) > +{ > + struct mm_struct *mm = vma->vm_mm; > + pmd_t *pmd, _pmd; > + int ret = 0, i; > + pgtable_t pgtable; > + unsigned long haddr; > + > + spin_lock(&mm->page_table_lock); > + pmd = page_check_address_pmd(page, mm, address, > + PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG); > + if (pmd) { > + pgtable = get_pmd_huge_pte(mm); > + pmd_populate(mm, &_pmd, pgtable); > + > + for (i = 0, haddr = address; i < HPAGE_NR; > + i++, haddr += PAGE_SIZE) { > + pte_t *pte, entry; > + entry = mk_pte(page + i, vma->vm_page_prot); > + entry = maybe_mkwrite(pte_mkdirty(entry), vma); > + if (!pmd_write(*pmd)) > + entry = pte_wrprotect(entry); > + else > + BUG_ON(page_mapcount(page) != 1); > + if (!pmd_young(*pmd)) > + entry = pte_mkold(entry); > + pte = pte_offset_map(&_pmd, haddr); > + BUG_ON(!pte_none(*pte)); > + set_pte_at(mm, haddr, pte, entry); > + pte_unmap(pte); > + } > + > + mm->nr_ptes++; > + smp_wmb(); /* make pte visible before pmd */ > + pmd_populate(mm, pmd, pgtable); > + ret = 1; > + } > + spin_unlock(&mm->page_table_lock); > + > + return ret; > +} > + > +/* must be called with anon_vma->lock hold */ > +static void __split_huge_page(struct page *page, > + struct anon_vma *anon_vma) > +{ > + int mapcount, mapcount2; > + struct vm_area_struct *vma; > + > + BUG_ON(!PageHead(page)); > + > + mapcount = 0; > + list_for_each_entry(vma, &anon_vma->head, anon_vma_node) { > + unsigned long addr = vma_address(page, vma); > + if (addr == -EFAULT) > + continue; > + mapcount += __split_huge_page_splitting(page, vma, addr); > + } > + BUG_ON(mapcount != page_mapcount(page)); > + > + __split_huge_page_refcount(page); > + > + mapcount2 = 0; > + list_for_each_entry(vma, &anon_vma->head, anon_vma_node) { > + unsigned long addr = vma_address(page, vma); > + if (addr == -EFAULT) > + continue; > + mapcount2 += __split_huge_page_map(page, vma, addr); > + } > + BUG_ON(mapcount != mapcount2); > +} > + > +/* must run with mmap_sem to prevent vma to go away */ > +void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd) > +{ > + struct page *page; > + struct anon_vma *anon_vma; > + struct mm_struct *mm; > + > + BUG_ON(vma->vm_flags & VM_HUGETLB); > + > + mm = vma->vm_mm; > + BUG_ON(down_write_trylock(&mm->mmap_sem)); > + > + anon_vma = vma->anon_vma; > + > + spin_lock(&anon_vma->lock); > + BUG_ON(pmd_trans_splitting(*pmd)); > + spin_lock(&mm->page_table_lock); > + if (unlikely(!pmd_trans_huge(*pmd))) { > + spin_unlock(&mm->page_table_lock); > + spin_unlock(&anon_vma->lock); > + return; > + } > + page = pmd_pgtable(*pmd); > + spin_unlock(&mm->page_table_lock); > + > + __split_huge_page(page, anon_vma); > + > + spin_unlock(&anon_vma->lock); > + BUG_ON(pmd_trans_huge(*pmd)); > +} > + > +/* must run with mmap_sem to prevent vma to go away */ > +void __split_huge_page_mm(struct mm_struct *mm, > + unsigned long address, > + pmd_t *pmd) > +{ > + struct vm_area_struct *vma; > + > + vma = find_vma(mm, address); > + BUG_ON(vma->vm_start > address); > + BUG_ON(vma->vm_mm != mm); > + > + __split_huge_page_vma(vma, pmd); > +} > + > +int split_huge_page(struct page *page) > +{ > + struct anon_vma *anon_vma; > + int ret = 1; > + > + BUG_ON(!PageAnon(page)); > + anon_vma = page_lock_anon_vma(page); > + if (!anon_vma) > + goto out; > + ret = 0; > + if (!PageCompound(page)) > + goto out_unlock; > + > + BUG_ON(!PageSwapBacked(page)); > + __split_huge_page(page, anon_vma); > + > + BUG_ON(PageCompound(page)); > +out_unlock: > + page_unlock_anon_vma(anon_vma); > +out: > + return ret; > +} > diff --git a/mm/memory.c b/mm/memory.c > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -647,9 +647,9 @@ out_set_pte: > return 0; > } > > -static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, > - pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma, > - unsigned long addr, unsigned long end) > +int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, > + pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma, > + unsigned long addr, unsigned long end) > { > pte_t *orig_src_pte, *orig_dst_pte; > pte_t *src_pte, *dst_pte; > @@ -722,6 +722,16 @@ static inline int copy_pmd_range(struct > src_pmd = pmd_offset(src_pud, addr); > do { > next = pmd_addr_end(addr, end); > + if (pmd_trans_huge(*src_pmd)) { > + int err; > + err = copy_huge_pmd(dst_mm, src_mm, > + dst_pmd, src_pmd, addr, vma); > + if (err == -ENOMEM) > + return -ENOMEM; > + if (!err) > + continue; > + /* fall through */ > + } > if (pmd_none_or_clear_bad(src_pmd)) > continue; > if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd, > @@ -918,6 +928,15 @@ static inline unsigned long zap_pmd_rang > pmd = pmd_offset(pud, addr); > do { > next = pmd_addr_end(addr, end); > + if (pmd_trans_huge(*pmd)) { > + if (next-addr != HPAGE_SIZE) > + split_huge_page_vma(vma, pmd); > + else if (zap_pmd_trans_huge(tlb, vma, pmd)) { > + (*zap_work)--; > + continue; > + } > + /* fall through */ > + } > if (pmd_none_or_clear_bad(pmd)) { > (*zap_work)--; > continue; > @@ -1185,11 +1204,27 @@ struct page *follow_page(struct vm_area_ > pmd = pmd_offset(pud, address); > if (pmd_none(*pmd)) > goto no_page_table; > - if (pmd_huge(*pmd)) { > + if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) { > BUG_ON(flags & FOLL_GET); > page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE); > goto out; > } > + if (pmd_trans_huge(*pmd)) { > + spin_lock(&mm->page_table_lock); > + if (likely(pmd_trans_huge(*pmd))) { > + if (unlikely(pmd_trans_splitting(*pmd))) { > + spin_unlock(&mm->page_table_lock); > + wait_split_huge_page(vma->anon_vma, pmd); > + } else { > + page = follow_trans_huge_pmd(mm, address, > + pmd, flags); > + spin_unlock(&mm->page_table_lock); > + goto out; > + } > + } else > + spin_unlock(&mm->page_table_lock); > + /* fall through */ > + } > if (unlikely(pmd_bad(*pmd))) > goto no_page_table; > > @@ -1298,6 +1333,7 @@ int __get_user_pages(struct task_struct > pmd = pmd_offset(pud, pg); > if (pmd_none(*pmd)) > return i ? : -EFAULT; > + VM_BUG_ON(pmd_trans_huge(*pmd)); > pte = pte_offset_map(pmd, pg); > if (pte_none(*pte)) { > pte_unmap(pte); > @@ -2949,9 +2985,9 @@ static int do_nonlinear_fault(struct mm_ > * but allow concurrent faults), and pte mapped but not yet locked. > * We return with mmap_sem still held, but pte unmapped and unlocked. > */ > -static inline int handle_pte_fault(struct mm_struct *mm, > - struct vm_area_struct *vma, unsigned long address, > - pte_t *pte, pmd_t *pmd, unsigned int flags) > +int handle_pte_fault(struct mm_struct *mm, > + struct vm_area_struct *vma, unsigned long address, > + pte_t *pte, pmd_t *pmd, unsigned int flags) > { > pte_t entry; > spinlock_t *ptl; > @@ -3027,6 +3063,22 @@ int handle_mm_fault(struct mm_struct *mm > pmd = pmd_alloc(mm, pud, address); > if (!pmd) > return VM_FAULT_OOM; > + if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) { > + if (!vma->vm_ops) > + return do_huge_anonymous_page(mm, vma, address, > + pmd, flags); > + } else { > + pmd_t orig_pmd = *pmd; > + barrier(); > + if (pmd_trans_huge(orig_pmd)) { > + if (flags & FAULT_FLAG_WRITE && > + !pmd_write(orig_pmd) && > + !pmd_trans_splitting(orig_pmd)) > + return do_huge_wp_page(mm, vma, address, > + pmd, orig_pmd); > + return 0; > + } > + } > pte = pte_alloc_map(mm, vma, pmd, address); > if (!pte) > return VM_FAULT_OOM; > @@ -3167,6 +3219,7 @@ static int follow_pte(struct mm_struct * > goto out; > > pmd = pmd_offset(pud, address); > + VM_BUG_ON(pmd_trans_huge(*pmd)); > if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) > goto out; > > diff --git a/mm/rmap.c b/mm/rmap.c > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -56,6 +56,7 @@ > #include > #include > #include > +#include > > #include > > @@ -229,7 +230,7 @@ void page_unlock_anon_vma(struct anon_vm > * Returns virtual address or -EFAULT if page's index/offset is not > * within the range mapped the @vma. > */ > -static inline unsigned long > +inline unsigned long > vma_address(struct page *page, struct vm_area_struct *vma) > { > pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); > @@ -343,35 +344,17 @@ int page_referenced_one(struct page *pag > unsigned long *vm_flags) > { > struct mm_struct *mm = vma->vm_mm; > - pte_t *pte; > - spinlock_t *ptl; > int referenced = 0; > > - pte = page_check_address(page, mm, address, &ptl, 0); > - if (!pte) > - goto out; > - > /* > * Don't want to elevate referenced for mlocked page that gets this far, > * in order that it progresses to try_to_unmap and is moved to the > * unevictable list. > */ > if (vma->vm_flags & VM_LOCKED) { > - *mapcount = 1; /* break early from loop */ > + *mapcount = 0; /* break early from loop */ > *vm_flags |= VM_LOCKED; > - goto out_unmap; > - } > - > - if (ptep_clear_flush_young_notify(vma, address, pte)) { > - /* > - * Don't treat a reference through a sequentially read > - * mapping as such. If the page has been used in > - * another mapping, we will catch it; if this other > - * mapping is already gone, the unmap path will have > - * set PG_referenced or activated the page. > - */ > - if (likely(!VM_SequentialReadHint(vma))) > - referenced++; > + goto out; > } > > /* Pretend the page is referenced if the task has the > @@ -380,9 +363,43 @@ int page_referenced_one(struct page *pag > rwsem_is_locked(&mm->mmap_sem)) > referenced++; > > -out_unmap: > + if (unlikely(PageCompound(page))) { > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + pmd_t *pmd; > + > + spin_lock(&mm->page_table_lock); > + pmd = page_check_address_pmd(page, mm, address, > + PAGE_CHECK_ADDRESS_PMD_FLAG); > + if (pmd && !pmd_trans_splitting(*pmd) && > + pmdp_clear_flush_young_notify(vma, address, pmd)) > + referenced++; > + spin_unlock(&mm->page_table_lock); > +#else /* CONFIG_TRANSPARENT_HUGEPAGE */ > + VM_BUG_ON(1); > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > + } else { > + pte_t *pte; > + spinlock_t *ptl; > + > + pte = page_check_address(page, mm, address, &ptl, 0); > + if (!pte) > + goto out; > + > + if (ptep_clear_flush_young_notify(vma, address, pte)) { > + /* > + * Don't treat a reference through a sequentially read > + * mapping as such. If the page has been used in > + * another mapping, we will catch it; if this other > + * mapping is already gone, the unmap path will have > + * set PG_referenced or activated the page. > + */ > + if (likely(!VM_SequentialReadHint(vma))) > + referenced++; > + } > + pte_unmap_unlock(pte, ptl); > + } > + > (*mapcount)--; > - pte_unmap_unlock(pte, ptl); > > if (referenced) > *vm_flags |= vma->vm_flags; > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org