* Re: hugepage patches
2003-01-31 23:15 hugepage patches Andrew Morton
@ 2003-01-31 23:13 ` David S. Miller
2003-01-31 23:36 ` Andrew Morton
2003-01-31 23:16 ` Andrew Morton
` (12 subsequent siblings)
13 siblings, 1 reply; 48+ messages in thread
From: David S. Miller @ 2003-01-31 23:13 UTC (permalink / raw)
To: akpm; +Cc: rohit.seth, davidm, anton, wli, linux-mm
- need to implement either hugepage_vma()/follow_huge_addr() or
pmd_huge()/follow_huge_pmd(), depending on whether a page's hugeness can be
determined via pmd inspection. Implementations of both schemes for ia32
are here.
Remind me why we can't just look at the PTE? Why can't
we end up doing something like:
if (!pmd_is_huge(pmd)) {
ptep = ...;
if (pte_is_huge(*ptep)) {
}
}
Which is what all these systems besides x86 and PPC-BAT are doing. I
don't see the real requirement for a full VMA lookup in these cases.
The page tables say fully whether we have huge stuff here or not.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: hugepage patches
2003-01-31 23:13 ` David S. Miller
@ 2003-01-31 23:36 ` Andrew Morton
2003-01-31 23:23 ` David S. Miller
0 siblings, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2003-01-31 23:36 UTC (permalink / raw)
To: David S. Miller; +Cc: rohit.seth, davidm, anton, wli, linux-mm
"David S. Miller" <davem@redhat.com> wrote:
>
> Remind me why we can't just look at the PTE?
Diktat ;)
Linus Torvalds <torvalds@transmeta.com> wrote:
>
> ...
> Your big-page approach makes the assumption that I refuse to make - namely
> that the "big page" is somehow attached to the page tables, and to the pmd
> in particular.
>
> On many architectures, big pages are totally independent of the smaller
> pages, and don't necessarily have any of the x86 aligment/size
> restrictions.
>
> While on an x86, a big page is always the size of a PMD, on a ppc it can
> be any power-of-two size and alignment from 128kB to 256MB. And fixing
> that to a pmd boundary just doesn't work. They have other restrictions
> instead: they are mapped by the "BAT array", and there are 8 of those (and
> I think Linux/PPC uses a few of them for the kernel itself).
>
> So a portable big-page approach must _not_ tie the big pages to the page
> tables. I don't like big pages particularly, but if I add big page support
> to the kernel I want to at least do it in such a way that other people
> than just Intel can use it.
>
> Portability means that
> - the architecture must be able to set its large pages totally
> independently of the page tables.
> - the architecture may have other non-size-related limits on the large
> page areas, like "only 6 large page areas can be allocated per VM"
>
> and quite frankly, anything that goes in and mucks with the VM deeply is
> bound to fail, I think. The patch that Intel made (with some input from
> me) and which I attached to the previous email does this, and has almost
> zero impact on the "normal" MM code.
>
> Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-01-31 23:36 ` Andrew Morton
@ 2003-01-31 23:23 ` David S. Miller
2003-01-31 23:45 ` Andrew Morton
0 siblings, 1 reply; 48+ messages in thread
From: David S. Miller @ 2003-01-31 23:23 UTC (permalink / raw)
To: akpm; +Cc: rohit.seth, davidm, anton, wli, linux-mm
"David S. Miller" <davem@redhat.com> wrote:
>
> Remind me why we can't just look at the PTE?
Diktat ;)
I understand, but give _ME_ a way to use the pagetables if
that is how things are implemented. Don't force me to do
a VMA lookup if I need not.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-01-31 23:23 ` David S. Miller
@ 2003-01-31 23:45 ` Andrew Morton
2003-01-31 23:48 ` David S. Miller
0 siblings, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2003-01-31 23:45 UTC (permalink / raw)
To: David S. Miller; +Cc: rohit.seth, davidm, anton, wli, linux-mm
"David S. Miller" <davem@redhat.com> wrote:
>
> From: Andrew Morton <akpm@digeo.com>
> Date: Fri, 31 Jan 2003 15:36:26 -0800
>
> "David S. Miller" <davem@redhat.com> wrote:
> >
> > Remind me why we can't just look at the PTE?
>
> Diktat ;)
>
> I understand, but give _ME_ a way to use the pagetables if
> that is how things are implemented. Don't force me to do
> a VMA lookup if I need not.
I did? pmd_huge()/follow_huge_pmd(). Patch 2/4.
It might not be 100% appropriate for sparc64 pagetable representation - I
just guessed...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-01-31 23:15 hugepage patches Andrew Morton
2003-01-31 23:13 ` David S. Miller
@ 2003-01-31 23:16 ` Andrew Morton
2003-01-31 23:17 ` Andrew Morton
` (11 subsequent siblings)
13 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2003-01-31 23:16 UTC (permalink / raw)
To: Andrew Morton; +Cc: davem, rohit.seth, davidm, anton, wli, linux-mm
1/4
Using a futex in a large page causes a kernel lockup in __pin_page() -
because __pin_page's page revalidation uses follow_page(), and follow_page()
doesn't work for hugepages.
The patch fixes up follow_page() to return the appropriate 4k page for
hugepages.
This incurs a vma lookup for each follow_page(), which is considerable
overhead in some situations. We only _need_ to do this if the architecture
cannot determin a page's hugeness from the contents of the PMD.
So this patch is a "reference" implementation for, say, PPC BAT-based
hugepages.
arch/i386/mm/hugetlbpage.c | 29 +++++++++++++++++++++++++++++
include/linux/hugetlb.h | 18 ++++++++++++++++--
include/linux/sched.h | 4 +++-
mm/memory.c | 5 +++++
mm/mmap.c | 2 +-
linux/mm.h | 0
6 files changed, 54 insertions(+), 4 deletions(-)
diff -puN mm/memory.c~pin_page-fix mm/memory.c
--- 25/mm/memory.c~pin_page-fix Fri Jan 31 13:32:13 2003
+++ 25-akpm/mm/memory.c Fri Jan 31 14:29:59 2003
@@ -607,6 +607,11 @@ follow_page(struct mm_struct *mm, unsign
pmd_t *pmd;
pte_t *ptep, pte;
unsigned long pfn;
+ struct vm_area_struct *vma;
+
+ vma = hugepage_vma(mm, address);
+ if (vma)
+ return follow_huge_addr(mm, vma, address, write);
pgd = pgd_offset(mm, address);
if (pgd_none(*pgd) || pgd_bad(*pgd))
diff -puN include/linux/hugetlb.h~pin_page-fix include/linux/hugetlb.h
--- 25/include/linux/hugetlb.h~pin_page-fix Fri Jan 31 13:32:13 2003
+++ 25-akpm/include/linux/hugetlb.h Fri Jan 31 14:29:59 2003
@@ -20,16 +20,28 @@ int hugetlb_prefault(struct address_spac
void huge_page_release(struct page *);
int hugetlb_report_meminfo(char *);
int is_hugepage_mem_enough(size_t);
-
+struct page *follow_huge_addr(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, int write);
+struct vm_area_struct *hugepage_vma(struct mm_struct *mm,
+ unsigned long address);
extern int htlbpage_max;
+static inline void
+mark_mm_hugetlb(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+ if (is_vm_hugetlb_page(vma))
+ mm->used_hugetlb = 1;
+}
+
#else /* !CONFIG_HUGETLB_PAGE */
+
static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
{
return 0;
}
-#define follow_hugetlb_page(m,v,p,vs,a,b,i) ({ BUG(); 0; })
+#define follow_hugetlb_page(m,v,p,vs,a,b,i) ({ BUG(); 0; })
+#define follow_huge_addr(mm, vma, addr, write) 0
#define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; })
#define hugetlb_prefault(mapping, vma) ({ BUG(); 0; })
#define zap_hugepage_range(vma, start, len) BUG()
@@ -37,6 +49,8 @@ static inline int is_vm_hugetlb_page(str
#define huge_page_release(page) BUG()
#define is_hugepage_mem_enough(size) 0
#define hugetlb_report_meminfo(buf) 0
+#define hugepage_vma(mm, addr) 0
+#define mark_mm_hugetlb(mm, vma) do { } while (0)
#endif /* !CONFIG_HUGETLB_PAGE */
diff -puN arch/i386/mm/hugetlbpage.c~pin_page-fix arch/i386/mm/hugetlbpage.c
--- 25/arch/i386/mm/hugetlbpage.c~pin_page-fix Fri Jan 31 13:32:13 2003
+++ 25-akpm/arch/i386/mm/hugetlbpage.c Fri Jan 31 14:29:59 2003
@@ -150,6 +150,35 @@ back1:
return i;
}
+struct page *
+follow_huge_addr(struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long address, int write)
+{
+ unsigned long start = address;
+ int length = 1;
+ int nr;
+ struct page *page;
+
+ nr = follow_hugetlb_page(mm, vma, &page, NULL, &start, &length, 0);
+ if (nr == 1)
+ return page;
+ return NULL;
+}
+
+/*
+ * If virtual address `addr' lies within a huge page, return its controlling
+ * VMA, else NULL.
+ */
+struct vm_area_struct *hugepage_vma(struct mm_struct *mm, unsigned long addr)
+{
+ if (mm->used_hugetlb) {
+ struct vm_area_struct *vma = find_vma(mm, addr);
+ if (vma && is_vm_hugetlb_page(vma))
+ return vma;
+ }
+ return NULL;
+}
+
void free_huge_page(struct page *page)
{
BUG_ON(page_count(page));
diff -puN mm/mmap.c~pin_page-fix mm/mmap.c
--- 25/mm/mmap.c~pin_page-fix Fri Jan 31 13:32:13 2003
+++ 25-akpm/mm/mmap.c Fri Jan 31 13:32:13 2003
@@ -362,6 +362,7 @@ static void vma_link(struct mm_struct *m
if (mapping)
up(&mapping->i_shared_sem);
+ mark_mm_hugetlb(mm, vma);
mm->map_count++;
validate_mm(mm);
}
@@ -1427,7 +1428,6 @@ void exit_mmap(struct mm_struct *mm)
kmem_cache_free(vm_area_cachep, vma);
vma = next;
}
-
}
/* Insert vm structure into process list sorted by address
diff -puN include/linux/mm.h~pin_page-fix include/linux/mm.h
diff -puN include/linux/sched.h~pin_page-fix include/linux/sched.h
--- 25/include/linux/sched.h~pin_page-fix Fri Jan 31 13:32:13 2003
+++ 25-akpm/include/linux/sched.h Fri Jan 31 13:32:13 2003
@@ -203,7 +203,9 @@ struct mm_struct {
unsigned long swap_address;
unsigned dumpable:1;
-
+#ifdef CONFIG_HUGETLB_PAGE
+ int used_hugetlb;
+#endif
/* Architecture-specific MM context */
mm_context_t context;
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: hugepage patches
2003-01-31 23:15 hugepage patches Andrew Morton
2003-01-31 23:13 ` David S. Miller
2003-01-31 23:16 ` Andrew Morton
@ 2003-01-31 23:17 ` Andrew Morton
2003-01-31 23:18 ` Andrew Morton
` (10 subsequent siblings)
13 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2003-01-31 23:17 UTC (permalink / raw)
To: davem, rohit.seth, davidm, anton, wli, linux-mm
2/4
ia32 and others can determine a page's hugeness by inspecting the pmd's value
directly. No need to perform a VMA lookup against the user's virtual
address.
This patch ifdef's away the VMA-based implementation of
hugepage-aware-follow_page for ia32 and replaces it with a pmd-based
implementation.
The intent is that architectures will implement one or the other. So the architecture either:
1: Implements hugepage_vma()/follow_huge_addr(), and stubs out
pmd_huge()/follow_huge_pmd() or
2: Implements pmd_huge()/follow_huge_pmd(), and stubs out
hugepage_vma()/follow_huge_addr()
arch/i386/mm/hugetlbpage.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
include/asm-i386/pgtable.h | 5 +++++
include/linux/hugetlb.h | 3 +++
mm/memory.c | 6 +++++-
4 files changed, 58 insertions(+), 1 deletion(-)
diff -puN mm/memory.c~pin_page-pmd mm/memory.c
--- 25/mm/memory.c~pin_page-pmd Fri Jan 31 14:30:01 2003
+++ 25-akpm/mm/memory.c Fri Jan 31 14:30:01 2003
@@ -618,7 +618,11 @@ follow_page(struct mm_struct *mm, unsign
goto out;
pmd = pmd_offset(pgd, address);
- if (pmd_none(*pmd) || pmd_bad(*pmd))
+ if (pmd_none(*pmd))
+ goto out;
+ if (pmd_huge(*pmd))
+ return follow_huge_pmd(mm, address, pmd, write);
+ if (pmd_bad(*pmd))
goto out;
ptep = pte_offset_map(pmd, address);
diff -puN include/linux/hugetlb.h~pin_page-pmd include/linux/hugetlb.h
--- 25/include/linux/hugetlb.h~pin_page-pmd Fri Jan 31 14:30:01 2003
+++ 25-akpm/include/linux/hugetlb.h Fri Jan 31 14:30:01 2003
@@ -24,6 +24,8 @@ struct page *follow_huge_addr(struct mm_
unsigned long address, int write);
struct vm_area_struct *hugepage_vma(struct mm_struct *mm,
unsigned long address);
+struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
+ pmd_t *pmd, int write);
extern int htlbpage_max;
static inline void
@@ -51,6 +53,7 @@ static inline int is_vm_hugetlb_page(str
#define hugetlb_report_meminfo(buf) 0
#define hugepage_vma(mm, addr) 0
#define mark_mm_hugetlb(mm, vma) do { } while (0)
+#define follow_huge_pmd(mm, addr, pmd, write) 0
#endif /* !CONFIG_HUGETLB_PAGE */
diff -puN include/asm-i386/pgtable.h~pin_page-pmd include/asm-i386/pgtable.h
--- 25/include/asm-i386/pgtable.h~pin_page-pmd Fri Jan 31 14:30:01 2003
+++ 25-akpm/include/asm-i386/pgtable.h Fri Jan 31 14:30:01 2003
@@ -177,6 +177,11 @@ extern unsigned long pg0[1024];
#define pmd_clear(xp) do { set_pmd(xp, __pmd(0)); } while (0)
#define pmd_bad(x) ((pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER)) != _KERNPG_TABLE)
+#ifdef CONFIG_HUGETLB_PAGE
+int pmd_huge(pmd_t pmd);
+#else
+#define pmd_huge(x) 0
+#endif
#define pages_to_mb(x) ((x) >> (20-PAGE_SHIFT))
diff -puN arch/i386/mm/hugetlbpage.c~pin_page-pmd arch/i386/mm/hugetlbpage.c
--- 25/arch/i386/mm/hugetlbpage.c~pin_page-pmd Fri Jan 31 14:30:01 2003
+++ 25-akpm/arch/i386/mm/hugetlbpage.c Fri Jan 31 14:30:01 2003
@@ -150,6 +150,7 @@ back1:
return i;
}
+#if 0 /* This is just for testing */
struct page *
follow_huge_addr(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address, int write)
@@ -179,6 +180,50 @@ struct vm_area_struct *hugepage_vma(stru
return NULL;
}
+int pmd_huge(pmd_t pmd)
+{
+ return 0;
+}
+
+struct page *
+follow_huge_pmd(struct mm_struct *mm, unsigned long address,
+ pmd_t *pmd, int write)
+{
+ return NULL;
+}
+
+#else
+
+struct page *
+follow_huge_addr(struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long address, int write)
+{
+ return NULL;
+}
+
+struct vm_area_struct *hugepage_vma(struct mm_struct *mm, unsigned long addr)
+{
+ return NULL;
+}
+
+int pmd_huge(pmd_t pmd)
+{
+ return !!(pmd_val(pmd) & _PAGE_PSE);
+}
+
+struct page *
+follow_huge_pmd(struct mm_struct *mm, unsigned long address,
+ pmd_t *pmd, int write)
+{
+ struct page *page;
+
+ page = pte_page(*(pte_t *)pmd);
+ if (page)
+ page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
+ return page;
+}
+#endif
+
void free_huge_page(struct page *page)
{
BUG_ON(page_count(page));
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: hugepage patches
2003-01-31 23:15 hugepage patches Andrew Morton
` (2 preceding siblings ...)
2003-01-31 23:17 ` Andrew Morton
@ 2003-01-31 23:18 ` Andrew Morton
2003-01-31 23:18 ` Andrew Morton
` (9 subsequent siblings)
13 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2003-01-31 23:18 UTC (permalink / raw)
To: davem, rohit.seth, davidm, anton, wli, linux-mm
3/4
We currently have a problem when things like ptrace, futexes and direct-io
try to pin user pages. If the user's address is in a huge page we're
elevting the refcount of a constituent 4k page, not the head page of the
high-order allocation unit.
To solve this, a generic way of handling higher-order pages has been
implemented:
- A higher-order page is called a "compound page". Chose this because
"huge page", "large page", "super page", etc all seem to mean different
things to different people.
- The first (controlling) 4k page of a compound page is referred to as the
"head" page.
- The remaining pages are tail pages.
All pages have PG_compound set. All pages have their lru.next pointing at
the head page (even the head page has this).
The head page's lru.prev, if non-zero, holds the address of the compound
page's put_page() function.
The order of the allocation is stored in the first tail page's lru.prev.
This is only for debug at present. This usage means that zero-order pages
may not be compound.
The above relationships are established for _all_ higher-order pages in the
page allocator. Which has some cost, but not much - another atomic op during
fork(), mainly.
This functionality is only enabled if CONFIG_HUGETLB_PAGE, although it could
be turned on permanently. There's a little extra cost in get_page/put_page.
linux/mm.h | 35 ++++++++++++++++++++++++++--
linux/page-flags.h | 7 ++++-
page_alloc.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 105 insertions(+), 3 deletions(-)
diff -puN include/linux/page-flags.h~compound-pages include/linux/page-flags.h
--- 25/include/linux/page-flags.h~compound-pages 2003-01-30 23:43:18.000000000 -0800
+++ 25-akpm/include/linux/page-flags.h 2003-01-30 23:43:18.000000000 -0800
@@ -72,7 +72,8 @@
#define PG_direct 16 /* ->pte_chain points directly at pte */
#define PG_mappedtodisk 17 /* Has blocks allocated on-disk */
-#define PG_reclaim 18 /* To be recalimed asap */
+#define PG_reclaim 18 /* To be reclaimed asap */
+#define PG_compound 19 /* Part of a compound page */
/*
* Global page accounting. One instance per CPU. Only unsigned longs are
@@ -251,6 +252,10 @@ extern void get_full_page_state(struct p
#define ClearPageReclaim(page) clear_bit(PG_reclaim, &(page)->flags)
#define TestClearPageReclaim(page) test_and_clear_bit(PG_reclaim, &(page)->flags)
+#define PageCompound(page) test_bit(PG_compound, &(page)->flags)
+#define SetPageCompound(page) set_bit(PG_compound, &(page)->flags)
+#define ClearPageCompound(page) clear_bit(PG_compound, &(page)->flags)
+
/*
* The PageSwapCache predicate doesn't use a PG_flag at this time,
* but it may again do so one day.
diff -puN mm/page_alloc.c~compound-pages mm/page_alloc.c
--- 25/mm/page_alloc.c~compound-pages 2003-01-30 23:43:18.000000000 -0800
+++ 25-akpm/mm/page_alloc.c 2003-01-31 01:47:02.000000000 -0800
@@ -85,6 +85,62 @@ static void bad_page(const char *functio
page->mapping = NULL;
}
+#ifndef CONFIG_HUGETLB_PAGE
+#define prep_compound_page(page, order) do { } while (0)
+#define destroy_compound_page(page, order) do { } while (0)
+#else
+/*
+ * Higher-order pages are called "compound pages". They are structured thusly:
+ *
+ * The first PAGE_SIZE page is called the "head page".
+ *
+ * The remaining PAGE_SIZE pages are called "tail pages".
+ *
+ * All pages have PG_compound set. All pages have their lru.next pointing at
+ * the head page (even the head page has this).
+ *
+ * The head page's lru.prev, if non-zero, holds the address of the compound
+ * page's put_page() function.
+ *
+ * The order of the allocation is stored in the first tail page's lru.prev.
+ * This is only for debug at present. This usage means that zero-order pages
+ * may not be compound.
+ */
+static void prep_compound_page(struct page *page, int order)
+{
+ int i;
+ int nr_pages = 1 << order;
+
+ page->lru.prev = NULL;
+ page[1].lru.prev = (void *)order;
+ for (i = 0; i < nr_pages; i++) {
+ struct page *p = page + i;
+
+ SetPageCompound(p);
+ p->lru.next = (void *)page;
+ }
+}
+
+static void destroy_compound_page(struct page *page, int order)
+{
+ int i;
+ int nr_pages = 1 << order;
+
+ if (page[1].lru.prev != (void *)order)
+ bad_page(__FUNCTION__, page);
+
+ for (i = 0; i < nr_pages; i++) {
+ struct page *p = page + i;
+
+ if (!PageCompound(p))
+ bad_page(__FUNCTION__, page);
+ if (p->lru.next != (void *)page)
+ bad_page(__FUNCTION__, page);
+ ClearPageCompound(p);
+ }
+}
+#endif /* CONFIG_HUGETLB_PAGE */
+
/*
* Freeing function for a buddy system allocator.
*
@@ -114,6 +170,8 @@ static inline void __free_pages_bulk (st
{
unsigned long page_idx, index;
+ if (order)
+ destroy_compound_page(page, order);
page_idx = page - base;
if (page_idx & ~mask)
BUG();
@@ -409,6 +467,12 @@ void free_cold_page(struct page *page)
free_hot_cold_page(page, 1);
}
+/*
+ * Really, prep_compound_page() should be called from __rmqueue_bulk(). But
+ * we cheat by calling it from here, in the order > 0 path. Saves a branch
+ * or two.
+ */
+
static struct page *buffered_rmqueue(struct zone *zone, int order, int cold)
{
unsigned long flags;
@@ -435,6 +499,8 @@ static struct page *buffered_rmqueue(str
spin_lock_irqsave(&zone->lock, flags);
page = __rmqueue(zone, order);
spin_unlock_irqrestore(&zone->lock, flags);
+ if (order && page)
+ prep_compound_page(page, order);
}
if (page != NULL) {
diff -puN include/linux/mm.h~compound-pages include/linux/mm.h
--- 25/include/linux/mm.h~compound-pages 2003-01-30 23:43:18.000000000 -0800
+++ 25-akpm/include/linux/mm.h 2003-01-30 23:43:18.000000000 -0800
@@ -208,24 +208,55 @@ struct page {
* Also, many kernel routines increase the page count before a critical
* routine so they can be sure the page doesn't go away from under them.
*/
-#define get_page(p) atomic_inc(&(p)->count)
-#define __put_page(p) atomic_dec(&(p)->count)
#define put_page_testzero(p) \
({ \
BUG_ON(page_count(page) == 0); \
atomic_dec_and_test(&(p)->count); \
})
+
#define page_count(p) atomic_read(&(p)->count)
#define set_page_count(p,v) atomic_set(&(p)->count, v)
+#define __put_page(p) atomic_dec(&(p)->count)
extern void FASTCALL(__page_cache_release(struct page *));
+#ifdef CONFIG_HUGETLB_PAGE
+
+static inline void get_page(struct page *page)
+{
+ if (PageCompound(page))
+ page = (struct page *)page->lru.next;
+ atomic_inc(&page->count);
+}
+
static inline void put_page(struct page *page)
{
+ if (PageCompound(page)) {
+ page = (struct page *)page->lru.next;
+ if (page->lru.prev) { /* destructor? */
+ (*(void (*)(struct page *))page->lru.prev)(page);
+ return;
+ }
+ }
if (!PageReserved(page) && put_page_testzero(page))
__page_cache_release(page);
}
+#else /* CONFIG_HUGETLB_PAGE */
+
+static inline void get_page(struct page *page)
+{
+ atomic_inc(&page->count);
+}
+
+static inline void put_page(struct page *page)
+{
+ if (!PageReserved(page) && put_page_testzero(page))
+ __page_cache_release(page);
+}
+
+#endif /* CONFIG_HUGETLB_PAGE */
+
/*
* Multiple processes may "see" the same page. E.g. for untouched
* mappings of /dev/null, all processes see the same page full of
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: hugepage patches
2003-01-31 23:15 hugepage patches Andrew Morton
` (3 preceding siblings ...)
2003-01-31 23:18 ` Andrew Morton
@ 2003-01-31 23:18 ` Andrew Morton
2003-02-01 8:58 ` Ingo Oeser
2003-02-02 10:55 ` Andrew Morton
` (8 subsequent siblings)
13 siblings, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2003-01-31 23:18 UTC (permalink / raw)
To: davem, rohit.seth, davidm, anton, wli, linux-mm
4/4
The odd thing about hugetlb is that it maintains its own freelist of pages.
And it has to do that, else it would trivially run out of pages due to buddy
fragmetation.
So we we don't want callers of put_page() to be passing those pages
to __free_pages_ok() on the final put().
So hugetlb installs a destructor in the compound pages to point at
free_huge_page(), which knows how to put these pages back onto the free list.
Also, don't mark hugepages as all PageReserved any more. That's preenting
callers from doing proper refcounting. Any code which does a user pagetable
walk and hits part of a hugepage will now handle it transparently.
arch/i386/mm/hugetlbpage.c | 22 ++++++++++------------
arch/ia64/mm/hugetlbpage.c | 8 ++------
arch/sparc64/mm/hugetlbpage.c | 7 +------
3 files changed, 13 insertions(+), 24 deletions(-)
diff -puN arch/i386/mm/hugetlbpage.c~compound-pages-hugetlb arch/i386/mm/hugetlbpage.c
--- 25/arch/i386/mm/hugetlbpage.c~compound-pages-hugetlb Fri Jan 31 14:34:55 2003
+++ 25-akpm/arch/i386/mm/hugetlbpage.c Fri Jan 31 14:35:16 2003
@@ -46,6 +46,7 @@ static struct page *alloc_hugetlb_page(v
htlbpagemem--;
spin_unlock(&htlbpage_lock);
set_page_count(page, 1);
+ page->lru.prev = (void *)huge_page_release;
for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
clear_highpage(&page[i]);
return page;
@@ -134,6 +135,7 @@ back1:
page = pte_page(pte);
if (pages) {
page += ((start & ~HPAGE_MASK) >> PAGE_SHIFT);
+ get_page(page);
pages[i] = page;
}
if (vmas)
@@ -218,8 +220,10 @@ follow_huge_pmd(struct mm_struct *mm, un
struct page *page;
page = pte_page(*(pte_t *)pmd);
- if (page)
+ if (page) {
page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
+ get_page(page);
+ }
return page;
}
#endif
@@ -372,8 +376,8 @@ int try_to_free_low(int count)
int set_hugetlb_mem_size(int count)
{
- int j, lcount;
- struct page *page, *map;
+ int lcount;
+ struct page *page;
extern long htlbzone_pages;
extern struct list_head htlbpage_freelist;
@@ -389,11 +393,6 @@ int set_hugetlb_mem_size(int count)
page = alloc_pages(__GFP_HIGHMEM, HUGETLB_PAGE_ORDER);
if (page == NULL)
break;
- map = page;
- for (j = 0; j < (HPAGE_SIZE / PAGE_SIZE); j++) {
- SetPageReserved(map);
- map++;
- }
spin_lock(&htlbpage_lock);
list_add(&page->list, &htlbpage_freelist);
htlbpagemem++;
@@ -415,7 +414,8 @@ int set_hugetlb_mem_size(int count)
return (int) htlbzone_pages;
}
-int hugetlb_sysctl_handler(ctl_table *table, int write, struct file *file, void *buffer, size_t *length)
+int hugetlb_sysctl_handler(ctl_table *table, int write,
+ struct file *file, void *buffer, size_t *length)
{
proc_dointvec(table, write, file, buffer, length);
htlbpage_max = set_hugetlb_mem_size(htlbpage_max);
@@ -432,15 +432,13 @@ __setup("hugepages=", hugetlb_setup);
static int __init hugetlb_init(void)
{
- int i, j;
+ int i;
struct page *page;
for (i = 0; i < htlbpage_max; ++i) {
page = alloc_pages(__GFP_HIGHMEM, HUGETLB_PAGE_ORDER);
if (!page)
break;
- for (j = 0; j < HPAGE_SIZE/PAGE_SIZE; ++j)
- SetPageReserved(&page[j]);
spin_lock(&htlbpage_lock);
list_add(&page->list, &htlbpage_freelist);
spin_unlock(&htlbpage_lock);
diff -puN arch/ia64/mm/hugetlbpage.c~compound-pages-hugetlb arch/ia64/mm/hugetlbpage.c
--- 25/arch/ia64/mm/hugetlbpage.c~compound-pages-hugetlb Fri Jan 31 15:04:32 2003
+++ 25-akpm/arch/ia64/mm/hugetlbpage.c Fri Jan 31 15:06:27 2003
@@ -227,6 +227,7 @@ back1:
page = pte_page(pte);
if (pages) {
page += ((start & ~HPAGE_MASK) >> PAGE_SHIFT);
+ get_page(page);
pages[i] = page;
}
if (vmas)
@@ -303,11 +304,6 @@ set_hugetlb_mem_size (int count)
page = alloc_pages(__GFP_HIGHMEM, HUGETLB_PAGE_ORDER);
if (page == NULL)
break;
- map = page;
- for (j = 0; j < (HPAGE_SIZE / PAGE_SIZE); j++) {
- SetPageReserved(map);
- map++;
- }
spin_lock(&htlbpage_lock);
list_add(&page->list, &htlbpage_freelist);
htlbpagemem++;
@@ -327,7 +323,7 @@ set_hugetlb_mem_size (int count)
map = page;
for (j = 0; j < (HPAGE_SIZE / PAGE_SIZE); j++) {
map->flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
- 1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
+ 1 << PG_dirty | 1 << PG_active |
1 << PG_private | 1<< PG_writeback);
map++;
}
diff -puN arch/sparc64/mm/hugetlbpage.c~compound-pages-hugetlb arch/sparc64/mm/hugetlbpage.c
--- 25/arch/sparc64/mm/hugetlbpage.c~compound-pages-hugetlb Fri Jan 31 15:05:00 2003
+++ 25-akpm/arch/sparc64/mm/hugetlbpage.c Fri Jan 31 15:06:35 2003
@@ -288,6 +288,7 @@ back1:
page = pte_page(pte);
if (pages) {
page += ((start & ~HPAGE_MASK) >> PAGE_SHIFT);
+ get_page(page);
pages[i] = page;
}
if (vmas)
@@ -584,11 +585,6 @@ int set_hugetlb_mem_size(int count)
page = alloc_pages(GFP_ATOMIC, HUGETLB_PAGE_ORDER);
if (page == NULL)
break;
- map = page;
- for (j = 0; j < (HPAGE_SIZE / PAGE_SIZE); j++) {
- SetPageReserved(map);
- map++;
- }
spin_lock(&htlbpage_lock);
list_add(&page->list, &htlbpage_freelist);
htlbpagemem++;
@@ -613,7 +609,6 @@ int set_hugetlb_mem_size(int count)
map->flags &= ~(1UL << PG_locked | 1UL << PG_error |
1UL << PG_referenced |
1UL << PG_dirty | 1UL << PG_active |
- 1UL << PG_reserved |
1UL << PG_private | 1UL << PG_writeback);
set_page_count(page, 0);
map++;
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: hugepage patches
2003-01-31 23:18 ` Andrew Morton
@ 2003-02-01 8:58 ` Ingo Oeser
2003-02-01 9:31 ` Andrew Morton
0 siblings, 1 reply; 48+ messages in thread
From: Ingo Oeser @ 2003-02-01 8:58 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm
Hi Andrew,
On Fri, Jan 31, 2003 at 03:18:58PM -0800, Andrew Morton wrote:
> Also, don't mark hugepages as all PageReserved any more. That's preenting
> callers from doing proper refcounting. Any code which does a user pagetable
> walk and hits part of a hugepage will now handle it transparently.
Heh, that's helping me a lot and makes get_one_user_page very
simple again (and simplify the follow_huge_* stuff even more).
This could help futex slow-path and remove loads of code.
Once this hugetlb stuff settles down a bit, I'll rewrite the
page-walking again to accomodate this. No API changes, just
internal rewrites.
So please tell the linux-mm list, when it's finished and I'll have
sth. ready for -mm in the first week of March[1].
Regards
Ingo Oeser
[1] Important exams in February, sorry.
--
Science is what we can tell a computer. Art is everything else. --- D.E.Knuth
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-02-01 8:58 ` Ingo Oeser
@ 2003-02-01 9:31 ` Andrew Morton
2003-02-01 10:00 ` William Lee Irwin III
0 siblings, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2003-02-01 9:31 UTC (permalink / raw)
To: Ingo Oeser; +Cc: linux-mm
Ingo Oeser <ingo.oeser@informatik.tu-chemnitz.de> wrote:
>
> Hi Andrew,
>
> On Fri, Jan 31, 2003 at 03:18:58PM -0800, Andrew Morton wrote:
> > Also, don't mark hugepages as all PageReserved any more. That's preenting
> > callers from doing proper refcounting. Any code which does a user pagetable
> > walk and hits part of a hugepage will now handle it transparently.
>
> Heh, that's helping me a lot and makes get_one_user_page very
> simple again (and simplify the follow_huge_* stuff even more).
>
> This could help futex slow-path and remove loads of code.
>
> Once this hugetlb stuff settles down a bit, I'll rewrite the
> page-walking again to accomodate this. No API changes, just
> internal rewrites.
OK...
> So please tell the linux-mm list, when it's finished and I'll have
> sth. ready for -mm in the first week of March[1].
Well I'm thinking of renaming it to hugebugfs. It should be settled down
shortly.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-02-01 9:31 ` Andrew Morton
@ 2003-02-01 10:00 ` William Lee Irwin III
2003-02-01 10:14 ` Andrew Morton
0 siblings, 1 reply; 48+ messages in thread
From: William Lee Irwin III @ 2003-02-01 10:00 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ingo Oeser, linux-mm
On Sat, Feb 01, 2003 at 01:31:36AM -0800, Andrew Morton wrote:
> Well I'm thinking of renaming it to hugebugfs. It should be settled down
> shortly.
We've had a difference of opinion wrt. the proper mechanism for
referring things to the head of their superpage. I guess in one
sense I could be blamed for not following directions, but I _really_
didn't want to go in the direction of killing ->lru for all time.
There is also other shite I'd _really_ rather not get into publicly.
-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-02-01 10:00 ` William Lee Irwin III
@ 2003-02-01 10:14 ` Andrew Morton
0 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2003-02-01 10:14 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: ingo.oeser, linux-mm
William Lee Irwin III <wli@holomorphy.com> wrote:
>
> On Sat, Feb 01, 2003 at 01:31:36AM -0800, Andrew Morton wrote:
> > Well I'm thinking of renaming it to hugebugfs. It should be settled down
> > shortly.
>
> We've had a difference of opinion wrt. the proper mechanism for
> referring things to the head of their superpage. I guess in one
> sense I could be blamed for not following directions, but I _really_
> didn't want to go in the direction of killing ->lru for all time.
It's not killed - tons of stuff can be stuck at page[1].
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-01-31 23:15 hugepage patches Andrew Morton
` (4 preceding siblings ...)
2003-01-31 23:18 ` Andrew Morton
@ 2003-02-02 10:55 ` Andrew Morton
2003-02-02 10:55 ` Andrew Morton
` (7 subsequent siblings)
13 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2003-02-02 10:55 UTC (permalink / raw)
To: davem, rohit.seth, davidm, anton, wli, linux-mm
5/4
get_unmapped_area for hugetlbfs
Having to specify the mapping address is a pain. Give hugetlbfs files a
file_operations.get_unmapped_area().
The implementation is in hugetlbfs rather than in arch code because it's
probably common to several architectures. If the architecture has special
needs it can define HAVE_ARCH_HUGETLB_UNMAPPED_AREA and go it alone. Just
like HAVE_ARCH_UNMAPPED_AREA.
Having to specify the mapping address is a pain. Give hugetlbfs files a
file_operations.get_unmapped_area().
The implementation is in hugetlbfs rather than in arch code because it's
probably common to several architectures. If the architecture has special
needs it can define HAVE_ARCH_HUGETLB_UNMAPPED_AREA and go it alone. Just
like HAVE_ARCH_UNMAPPED_AREA.
hugetlbfs/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 44 insertions(+), 2 deletions(-)
diff -puN fs/hugetlbfs/inode.c~hugetlbfs-get_unmapped_area fs/hugetlbfs/inode.c
--- 25/fs/hugetlbfs/inode.c~hugetlbfs-get_unmapped_area 2003-02-01 01:13:03.000000000 -0800
+++ 25-akpm/fs/hugetlbfs/inode.c 2003-02-02 01:17:01.000000000 -0800
@@ -74,6 +74,47 @@ static int hugetlbfs_file_mmap(struct fi
}
/*
+ * Called under down_write(mmap_sem), page_table_lock is not held
+ */
+
+#ifdef HAVE_ARCH_HUGETLB_UNMAPPED_AREA
+unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
+ unsigned long len, unsigned long pgoff, unsigned long flags);
+#else
+static unsigned long
+hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
+ unsigned long len, unsigned long pgoff, unsigned long flags)
+{
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+
+ if (len & ~HPAGE_MASK)
+ return -EINVAL;
+ if (len > TASK_SIZE)
+ return -ENOMEM;
+
+ if (addr) {
+ addr = ALIGN(addr, HPAGE_SIZE);
+ vma = find_vma(mm, addr);
+ if (TASK_SIZE - len >= addr &&
+ (!vma || addr + len <= vma->vm_start))
+ return addr;
+ }
+
+ addr = ALIGN(mm->free_area_cache, HPAGE_SIZE);
+
+ for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
+ /* At this point: (!vma || addr < vma->vm_end). */
+ if (TASK_SIZE - len < addr)
+ return -ENOMEM;
+ if (!vma || addr + len <= vma->vm_start)
+ return addr;
+ addr = ALIGN(vma->vm_end, HPAGE_SIZE);
+ }
+}
+#endif
+
+/*
* Read a page. Again trivial. If it didn't already exist
* in the page cache, it is zero-filled.
*/
@@ -466,8 +507,9 @@ static struct address_space_operations h
};
struct file_operations hugetlbfs_file_operations = {
- .mmap = hugetlbfs_file_mmap,
- .fsync = simple_sync_file,
+ .mmap = hugetlbfs_file_mmap,
+ .fsync = simple_sync_file,
+ .get_unmapped_area = hugetlb_get_unmapped_area,
};
static struct inode_operations hugetlbfs_dir_inode_operations = {
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: hugepage patches
2003-01-31 23:15 hugepage patches Andrew Morton
` (5 preceding siblings ...)
2003-02-02 10:55 ` Andrew Morton
@ 2003-02-02 10:55 ` Andrew Morton
2003-02-02 19:59 ` William Lee Irwin III
2003-02-02 10:55 ` Andrew Morton
` (6 subsequent siblings)
13 siblings, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2003-02-02 10:55 UTC (permalink / raw)
To: davem, rohit.seth, davidm, anton, wli, linux-mm
6/4
hugetlbfs: fix truncate
- Opening a hugetlbfs file O_TRUNC calls the generic vmtruncate() functions
and nukes the kernel.
Give S_ISREG hugetlbfs files a inode_operations, and hence a setattr
which know how to handle these files.
- Don't permit the user to truncate hugetlbfs files to sizes which are not
a multiple of HPAGE_SIZE.
- We don't support expanding in ftruncate(), so remove that code.
hugetlbfs/inode.c | 39 ++++++++++++++++-----------------------
1 files changed, 16 insertions(+), 23 deletions(-)
diff -puN fs/hugetlbfs/inode.c~hugetlbfs-truncate-fix fs/hugetlbfs/inode.c
--- 25/fs/hugetlbfs/inode.c~hugetlbfs-truncate-fix 2003-02-02 01:17:04.000000000 -0800
+++ 25-akpm/fs/hugetlbfs/inode.c 2003-02-02 01:17:04.000000000 -0800
@@ -34,6 +34,7 @@ static struct super_operations hugetlbfs
static struct address_space_operations hugetlbfs_aops;
struct file_operations hugetlbfs_file_operations;
static struct inode_operations hugetlbfs_dir_inode_operations;
+static struct inode_operations hugetlbfs_inode_operations;
static struct backing_dev_info hugetlbfs_backing_dev_info = {
.ra_pages = 0, /* No readahead */
@@ -326,44 +327,29 @@ static void hugetlb_vmtruncate_list(stru
}
}
+/*
+ * Expanding truncates are not allowed.
+ */
static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
{
unsigned long pgoff;
struct address_space *mapping = inode->i_mapping;
- unsigned long limit;
- pgoff = (offset + HPAGE_SIZE - 1) >> HPAGE_SHIFT;
+ if (offset > inode->i_size)
+ return -EINVAL;
- if (inode->i_size < offset)
- goto do_expand;
+ BUG_ON(offset & ~HPAGE_MASK);
+ pgoff = offset >> HPAGE_SHIFT;
inode->i_size = offset;
down(&mapping->i_shared_sem);
- if (list_empty(&mapping->i_mmap) && list_empty(&mapping->i_mmap_shared))
- goto out_unlock;
if (!list_empty(&mapping->i_mmap))
hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
if (!list_empty(&mapping->i_mmap_shared))
hugetlb_vmtruncate_list(&mapping->i_mmap_shared, pgoff);
-
-out_unlock:
up(&mapping->i_shared_sem);
truncate_hugepages(mapping, offset);
return 0;
-
-do_expand:
- limit = current->rlim[RLIMIT_FSIZE].rlim_cur;
- if (limit != RLIM_INFINITY && offset > limit)
- goto out_sig;
- if (offset > inode->i_sb->s_maxbytes)
- goto out;
- inode->i_size = offset;
- return 0;
-
-out_sig:
- send_sig(SIGXFSZ, current, 0);
-out:
- return -EFBIG;
}
static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
@@ -390,7 +376,9 @@ static int hugetlbfs_setattr(struct dent
goto out;
if (ia_valid & ATTR_SIZE) {
- error = hugetlb_vmtruncate(inode, attr->ia_size);
+ error = -EINVAL;
+ if (!(attr->ia_size & ~HPAGE_MASK))
+ error = hugetlb_vmtruncate(inode, attr->ia_size);
if (error)
goto out;
attr->ia_valid &= ~ATTR_SIZE;
@@ -425,6 +413,7 @@ hugetlbfs_get_inode(struct super_block *
init_special_inode(inode, mode, dev);
break;
case S_IFREG:
+ inode->i_op = &hugetlbfs_inode_operations;
inode->i_fop = &hugetlbfs_file_operations;
break;
case S_IFDIR:
@@ -525,6 +514,10 @@ static struct inode_operations hugetlbfs
.setattr = hugetlbfs_setattr,
};
+static struct inode_operations hugetlbfs_inode_operations = {
+ .setattr = hugetlbfs_setattr,
+};
+
static struct super_operations hugetlbfs_ops = {
.statfs = simple_statfs,
.drop_inode = hugetlbfs_drop_inode,
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: hugepage patches
2003-02-02 10:55 ` Andrew Morton
@ 2003-02-02 19:59 ` William Lee Irwin III
2003-02-02 20:49 ` Andrew Morton
0 siblings, 1 reply; 48+ messages in thread
From: William Lee Irwin III @ 2003-02-02 19:59 UTC (permalink / raw)
To: Andrew Morton; +Cc: davem, rohit.seth, davidm, anton, linux-mm
On Sun, Feb 02, 2003 at 02:55:46AM -0800, Andrew Morton wrote:
> 6/4
> hugetlbfs: fix truncate
> - Opening a hugetlbfs file O_TRUNC calls the generic vmtruncate() functions
> and nukes the kernel.
> Give S_ISREG hugetlbfs files a inode_operations, and hence a setattr
> which know how to handle these files.
> - Don't permit the user to truncate hugetlbfs files to sizes which are not
> a multiple of HPAGE_SIZE.
> - We don't support expanding in ftruncate(), so remove that code.
erm, IIRC ftruncate() was the only way to expand the things; without
read() or write() showing up this creates a huge semantic deficit.
When I wake up the rest of the way I'll eventually remember which
debate I lost that introduced an alternative method.
Leaving .setattr out of the non-directory inode ops and/or not having
a non-directory i_ops is a relatively huge omission. Not sure how
anything actually survived that.
-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-02-02 19:59 ` William Lee Irwin III
@ 2003-02-02 20:49 ` Andrew Morton
2003-02-03 15:09 ` Eric W. Biederman
0 siblings, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2003-02-02 20:49 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: davem, rohit.seth, davidm, anton, linux-mm
William Lee Irwin III <wli@holomorphy.com> wrote:
>
> On Sun, Feb 02, 2003 at 02:55:46AM -0800, Andrew Morton wrote:
> > 6/4
> > hugetlbfs: fix truncate
> > - Opening a hugetlbfs file O_TRUNC calls the generic vmtruncate() functions
> > and nukes the kernel.
> > Give S_ISREG hugetlbfs files a inode_operations, and hence a setattr
> > which know how to handle these files.
> > - Don't permit the user to truncate hugetlbfs files to sizes which are not
> > a multiple of HPAGE_SIZE.
> > - We don't support expanding in ftruncate(), so remove that code.
>
> erm, IIRC ftruncate() was the only way to expand the things;
Expanding ftruncate would be nice, but the current way of performing
the page instantiation at mmap() time seems sufficient.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-02-02 20:49 ` Andrew Morton
@ 2003-02-03 15:09 ` Eric W. Biederman
2003-02-03 21:29 ` Andrew Morton
0 siblings, 1 reply; 48+ messages in thread
From: Eric W. Biederman @ 2003-02-03 15:09 UTC (permalink / raw)
To: Andrew Morton
Cc: William Lee Irwin III, davem, rohit.seth, davidm, anton, linux-mm
Andrew Morton <akpm@digeo.com> writes:
> William Lee Irwin III <wli@holomorphy.com> wrote:
> >
> > On Sun, Feb 02, 2003 at 02:55:46AM -0800, Andrew Morton wrote:
> > > 6/4
> > > hugetlbfs: fix truncate
> > > - Opening a hugetlbfs file O_TRUNC calls the generic vmtruncate() functions
> > > and nukes the kernel.
> > > Give S_ISREG hugetlbfs files a inode_operations, and hence a setattr
> > > which know how to handle these files.
> > > - Don't permit the user to truncate hugetlbfs files to sizes which are not
> > > a multiple of HPAGE_SIZE.
> > > - We don't support expanding in ftruncate(), so remove that code.
> >
> > erm, IIRC ftruncate() was the only way to expand the things;
>
> Expanding ftruncate would be nice, but the current way of performing
> the page instantiation at mmap() time seems sufficient.
Having an expanding/shrinking ftruncate will trivially allow posix shared
memory semantics.
I am trying to digest the idea of a mmap that grows a file. There isn't
anything else that works that way is there?
It looks like you are removing the limit checking from hugetlbfs, by
removing the expansion code from ftruncate. And given the fact that
nothing else grows in mmap, I suspect the code will be much easier to
write and maintain if the growth is constrained to happen in ftruncate.
mmap growing a file just sounds totally non-intuitive. Though I do
agree, allocating that page at the time of growth sounds reasonable.
I may be missing something but it looks like there is not code present
to prevent multiple page allocations at the same time conflicting
when i_size is grown.
Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-02-03 15:09 ` Eric W. Biederman
@ 2003-02-03 21:29 ` Andrew Morton
2003-02-04 5:37 ` Eric W. Biederman
0 siblings, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2003-02-03 21:29 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: wli, davem, rohit.seth, davidm, anton, linux-mm
ebiederm@xmission.com (Eric W. Biederman) wrote:
>
> >
> > Expanding ftruncate would be nice, but the current way of performing
> > the page instantiation at mmap() time seems sufficient.
>
> Having an expanding/shrinking ftruncate will trivially allow posix shared
> memory semantics.
>
> I am trying to digest the idea of a mmap that grows a file. There isn't
> anything else that works that way is there?
Not that I can think of.
> It looks like you are removing the limit checking from hugetlbfs, by
> removing the expansion code from ftruncate.
There was no expansion code.
The code I took out was vestigial. We can put it all back if we decide to
add a new expand-with-ftruncate feature to hugetlbfs.
> And given the fact that
> nothing else grows in mmap, I suspect the code will be much easier to
> write and maintain if the growth is constrained to happen in ftruncate.
That would require a fault handler. We don't have one of those for hugetlbs.
Probably not hard to add one though.
> I may be missing something but it looks like there is not code present
> to prevent multiple page allocations at the same time conflicting
> when i_size is grown.
All the mmap code runs under down_write(current->mm->mmap_sem);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-02-03 21:29 ` Andrew Morton
@ 2003-02-04 5:37 ` Eric W. Biederman
2003-02-04 5:50 ` William Lee Irwin III
0 siblings, 1 reply; 48+ messages in thread
From: Eric W. Biederman @ 2003-02-04 5:37 UTC (permalink / raw)
To: Andrew Morton; +Cc: wli, davem, rohit.seth, davidm, anton, linux-mm
Andrew Morton <akpm@digeo.com> writes:
> ebiederm@xmission.com (Eric W. Biederman) wrote:
> >
> > >
> > > Expanding ftruncate would be nice, but the current way of performing
> > > the page instantiation at mmap() time seems sufficient.
> >
> > Having an expanding/shrinking ftruncate will trivially allow posix shared
> > memory semantics.
> >
> > I am trying to digest the idea of a mmap that grows a file. There isn't
> > anything else that works that way is there?
>
> Not that I can think of.
>
> > It looks like you are removing the limit checking from hugetlbfs, by
> > removing the expansion code from ftruncate.
>
> There was no expansion code.
inode->i_size was grown, but I admit no huge pages were allocated.
> The code I took out was vestigial. We can put it all back if we decide to
> add a new expand-with-ftruncate feature to hugetlbfs.
>
> > And given the fact that
> > nothing else grows in mmap, I suspect the code will be much easier to
> > write and maintain if the growth is constrained to happen in ftruncate.
>
> That would require a fault handler. We don't have one of those for hugetlbs.
> Probably not hard to add one though.
I don't see that ftruncate setting the size would require a fault
handler. ftruncate just needs to be called before mmap. But a fault
handler would certainly make the code more like the rest of the mmap
cases.
With a fault handler I start getting dangerous thoughts of paging
hugetlbfs to swap, probably not a good idea.
> > I may be missing something but it looks like there is not code present
> > to prevent multiple page allocations at the same time conflicting
> > when i_size is grown.
>
> All the mmap code runs under down_write(current->mm->mmap_sem);
Last I looked i_size is commonly protected by inode->i_sem.
current->mm->mmap_sem really doesn't provide protection if there is
a shared area between mappings in two different mm's. Not a problem
if the code is a private mapping but otherwise...
Does hugetlbfs support shared mappings? If it is exclusively
for private mappings the code makes much more sense than I am
thinking.
Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-02-04 5:37 ` Eric W. Biederman
@ 2003-02-04 5:50 ` William Lee Irwin III
2003-02-04 7:06 ` Eric W. Biederman
0 siblings, 1 reply; 48+ messages in thread
From: William Lee Irwin III @ 2003-02-04 5:50 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Andrew Morton, davem, rohit.seth, davidm, anton, linux-mm
On Mon, Feb 03, 2003 at 10:37:51PM -0700, Eric W. Biederman wrote:
> current->mm->mmap_sem really doesn't provide protection if there is
> a shared area between mappings in two different mm's. Not a problem
> if the code is a private mapping but otherwise...
> Does hugetlbfs support shared mappings? If it is exclusively
> for private mappings the code makes much more sense than I am
> thinking.
It's supposedly for massively shared mappings to reduce PTE overhead.
Well, in theory there's some kind of TLB benefit, but the only thing
ppl really care about is x86 pagetable structure gets rid of L3 space
entirely so you don't burn 12+GB of L3 pagetables for appserver loads.
-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-02-04 5:50 ` William Lee Irwin III
@ 2003-02-04 7:06 ` Eric W. Biederman
2003-02-04 7:16 ` Martin J. Bligh
0 siblings, 1 reply; 48+ messages in thread
From: Eric W. Biederman @ 2003-02-04 7:06 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Andrew Morton, davem, rohit.seth, davidm, anton, linux-mm
William Lee Irwin III <wli@holomorphy.com> writes:
> On Mon, Feb 03, 2003 at 10:37:51PM -0700, Eric W. Biederman wrote:
> > current->mm->mmap_sem really doesn't provide protection if there is
> > a shared area between mappings in two different mm's. Not a problem
> > if the code is a private mapping but otherwise...
> > Does hugetlbfs support shared mappings? If it is exclusively
> > for private mappings the code makes much more sense than I am
> > thinking.
>
> It's supposedly for massively shared mappings to reduce PTE overhead.
O.k. Then the code definitely needs to handle shared mappings..
> Well, in theory there's some kind of TLB benefit, but the only thing
> ppl really care about is x86 pagetable structure gets rid of L3 space
> entirely so you don't burn 12+GB of L3 pagetables for appserver loads.
I am with the group that actually cares more about the TLB benefit.
For HPC loads there is really only one application per machine. And with
just one page table, the only real advantage is the more efficient use
of the TLB.
Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-02-04 7:06 ` Eric W. Biederman
@ 2003-02-04 7:16 ` Martin J. Bligh
2003-02-04 12:40 ` Eric W. Biederman
0 siblings, 1 reply; 48+ messages in thread
From: Martin J. Bligh @ 2003-02-04 7:16 UTC (permalink / raw)
To: Eric W. Biederman, William Lee Irwin III
Cc: Andrew Morton, davem, rohit.seth, davidm, anton, linux-mm
> O.k. Then the code definitely needs to handle shared mappings..
Why? we just divided the pagetable size by a factor of 1000, so
the problem is no longer really there ;-)
>> Well, in theory there's some kind of TLB benefit, but the only thing
>> ppl really care about is x86 pagetable structure gets rid of L3 space
>> entirely so you don't burn 12+GB of L3 pagetables for appserver loads.
>
> I am with the group that actually cares more about the TLB benefit.
> For HPC loads there is really only one application per machine. And with
> just one page table, the only real advantage is the more efficient use
> of the TLB.
The reason we don't see it much is that we mostly have P3's which only
have 4 entries for large pages. P4's would be much easier to demonstrate
such things on, and I don't think we've really tried very hard on that with
hugetlbfs (earlier Java work by the research group showed impressive
improvements on an earlier implementation).
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-02-04 7:16 ` Martin J. Bligh
@ 2003-02-04 12:40 ` Eric W. Biederman
2003-02-04 15:55 ` Martin J. Bligh
2003-02-04 21:12 ` Andrew Morton
0 siblings, 2 replies; 48+ messages in thread
From: Eric W. Biederman @ 2003-02-04 12:40 UTC (permalink / raw)
To: Martin J. Bligh
Cc: William Lee Irwin III, Andrew Morton, davem, rohit.seth, davidm,
anton, linux-mm
"Martin J. Bligh" <mbligh@aracnet.com> writes:
> > O.k. Then the code definitely needs to handle shared mappings..
>
> Why? we just divided the pagetable size by a factor of 1000, so
> the problem is no longer really there ;-)
William said one of the cases was to handle massively shared
mappings. You cannot create a massively shared mapping except by
sharing.
Did I misunderstand what was meant by a massively shared mapping?
I can't imagine it being useful to guys like oracle without MAP_SHARED
support....
> >> Well, in theory there's some kind of TLB benefit, but the only thing
> >> ppl really care about is x86 pagetable structure gets rid of L3 space
> >> entirely so you don't burn 12+GB of L3 pagetables for appserver loads.
> >
> > I am with the group that actually cares more about the TLB benefit.
> > For HPC loads there is really only one application per machine. And with
> > just one page table, the only real advantage is the more efficient use
> > of the TLB.
>
> The reason we don't see it much is that we mostly have P3's which only
> have 4 entries for large pages. P4's would be much easier to demonstrate
> such things on, and I don't think we've really tried very hard on that with
> hugetlbfs (earlier Java work by the research group showed impressive
> improvements on an earlier implementation).
Cool. I have no doubt the benefit is there. Measuring how large it
is will certainly be interesting.
Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-02-04 12:40 ` Eric W. Biederman
@ 2003-02-04 15:55 ` Martin J. Bligh
2003-02-05 12:18 ` Eric W. Biederman
2003-02-04 21:12 ` Andrew Morton
1 sibling, 1 reply; 48+ messages in thread
From: Martin J. Bligh @ 2003-02-04 15:55 UTC (permalink / raw)
To: Eric W. Biederman
Cc: William Lee Irwin III, Andrew Morton, davem, rohit.seth, davidm,
anton, linux-mm
>> > O.k. Then the code definitely needs to handle shared mappings..
>>
>> Why? we just divided the pagetable size by a factor of 1000, so
>> the problem is no longer really there ;-)
>
> William said one of the cases was to handle massively shared
> mappings. You cannot create a massively shared mapping except by
> sharing.
>
> Did I misunderstand what was meant by a massively shared mapping?
>
> I can't imagine it being useful to guys like oracle without MAP_SHARED
> support....
Create a huge shmem segment. and don't share the pagetables. Without large
pages, it's an enormous waste of space in mindless duplication. With large
pages, it's a much smaller waste of space (no PTEs) in mindless
duplication.
Still not optimal, but makes the problem manageable.
> Cool. I have no doubt the benefit is there. Measuring how large it
> is will certainly be interesting.
See the IBM research group's paper on large page support from last years
OLS. Pretty impressive stuff.
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-02-04 15:55 ` Martin J. Bligh
@ 2003-02-05 12:18 ` Eric W. Biederman
0 siblings, 0 replies; 48+ messages in thread
From: Eric W. Biederman @ 2003-02-05 12:18 UTC (permalink / raw)
To: Martin J. Bligh
Cc: William Lee Irwin III, Andrew Morton, davem, rohit.seth, davidm,
anton, linux-mm
"Martin J. Bligh" <mbligh@aracnet.com> writes:
> > Did I misunderstand what was meant by a massively shared mapping?
> >
> > I can't imagine it being useful to guys like oracle without MAP_SHARED
> > support....
>
> Create a huge shmem segment. and don't share the pagetables. Without large
> pages, it's an enormous waste of space in mindless duplication. With large
> pages, it's a much smaller waste of space (no PTEs) in mindless
> duplication.
> Still not optimal, but makes the problem manageable.
And this is exactly the mmap(MAP_SHARED) case. Where a single memory
segment is shared between multiple mm's.
Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-02-04 12:40 ` Eric W. Biederman
2003-02-04 15:55 ` Martin J. Bligh
@ 2003-02-04 21:12 ` Andrew Morton
2003-02-05 12:25 ` Eric W. Biederman
1 sibling, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2003-02-04 21:12 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: mbligh, wli, davem, rohit.seth, davidm, anton, linux-mm
ebiederm@xmission.com (Eric W. Biederman) wrote:
>
> I can't imagine it being useful to guys like oracle without MAP_SHARED
> support....
MAP_SHARED is supported. I haven't tested it much though.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-02-04 21:12 ` Andrew Morton
@ 2003-02-05 12:25 ` Eric W. Biederman
2003-02-05 19:57 ` Andrew Morton
0 siblings, 1 reply; 48+ messages in thread
From: Eric W. Biederman @ 2003-02-05 12:25 UTC (permalink / raw)
To: Andrew Morton; +Cc: mbligh, wli, davem, rohit.seth, davidm, anton, linux-mm
Andrew Morton <akpm@digeo.com> writes:
> ebiederm@xmission.com (Eric W. Biederman) wrote:
> >
> > I can't imagine it being useful to guys like oracle without MAP_SHARED
> > support....
>
> MAP_SHARED is supported. I haven't tested it much though.
Given that none of the standard kernel idioms to prevent races in
this kind of code are present, I would be very surprised if it
was not racy.
- inode->i_sem is not taken to protect inode->i_size.
- After successfully allocating a page, a test is not made to see if
another process with the same mapping has allocated the page first.
Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-02-05 12:25 ` Eric W. Biederman
@ 2003-02-05 19:57 ` Andrew Morton
2003-02-05 20:00 ` Andrew Morton
0 siblings, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2003-02-05 19:57 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: mbligh, wli, davem, rohit.seth, davidm, anton, linux-mm
ebiederm@xmission.com (Eric W. Biederman) wrote:
>
> Andrew Morton <akpm@digeo.com> writes:
>
> > ebiederm@xmission.com (Eric W. Biederman) wrote:
> > >
> > > I can't imagine it being useful to guys like oracle without MAP_SHARED
> > > support....
> >
> > MAP_SHARED is supported. I haven't tested it much though.
>
> Given that none of the standard kernel idioms to prevent races in
> this kind of code are present, I would be very surprised if it
> was not racy.
>
> - inode->i_sem is not taken to protect inode->i_size.
OK, I'll fix that up.
> - After successfully allocating a page, a test is not made to see if
> another process with the same mapping has allocated the page first.
In this case, add_to_page_cache() in hugetlb_prefault() will return -EEXIST,
and the page which lost the race will be freed again.
Uh, but we don't establish a pte against the page which got there first.
I'll fix that up too. Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-02-05 19:57 ` Andrew Morton
@ 2003-02-05 20:00 ` Andrew Morton
0 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2003-02-05 20:00 UTC (permalink / raw)
To: ebiederm, mbligh, wli, davem, rohit.seth, davidm, anton, linux-mm
Andrew Morton <akpm@digeo.com> wrote:
>
> > - inode->i_sem is not taken to protect inode->i_size.
>
> OK, I'll fix that up.
>
> > - After successfully allocating a page, a test is not made to see if
> > another process with the same mapping has allocated the page first.
>
> In this case, add_to_page_cache() in hugetlb_prefault() will return -EEXIST,
> and the page which lost the race will be freed again.
>
> Uh, but we don't establish a pte against the page which got there first.
> I'll fix that up too. Thanks.
No, everything is OK isn't it? The entire operation (i_size update and
allocate/add_to_page_cache()) is serialised under i_sem.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-01-31 23:15 hugepage patches Andrew Morton
` (6 preceding siblings ...)
2003-02-02 10:55 ` Andrew Morton
@ 2003-02-02 10:55 ` Andrew Morton
2003-02-02 10:56 ` Andrew Morton
` (5 subsequent siblings)
13 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2003-02-02 10:55 UTC (permalink / raw)
To: davem, rohit.seth, davidm, anton, wli, linux-mm
7/4
hugetlbfs i_size fixes
We're expanding hugetlbfs i_size in the wrong place. If someone attempts to
mmap more pages than are available, i_size is updated to reflect the
attempted mapping size.
So set i_size only when pages are successfully added to the mapping.
i_size handling at truncate time is still a bit wrong - if the mapping has
pages at (say) page offset 100-200 and the mappng is truncated to (say_ page
offset 50, i_size should be set to zero. But it is instead set to
50*HPAGE_SIZE. That's harmless.
i386/mm/hugetlbpage.c | 5 +++++
ia64/mm/hugetlbpage.c | 0
sparc64/mm/hugetlbpage.c | 0
x86_64/mm/hugetlbpage.c | 6 ++++++
hugetlbfs/inode.c | 5 -----
5 files changed, 11 insertions(+), 5 deletions(-)
diff -puN fs/hugetlbfs/inode.c~hugetlbfs-i_size-fix fs/hugetlbfs/inode.c
--- 25/fs/hugetlbfs/inode.c~hugetlbfs-i_size-fix 2003-02-01 02:07:22.000000000 -0800
+++ 25-akpm/fs/hugetlbfs/inode.c 2003-02-01 02:07:22.000000000 -0800
@@ -45,7 +45,6 @@ static int hugetlbfs_file_mmap(struct fi
{
struct inode *inode =file->f_dentry->d_inode;
struct address_space *mapping = inode->i_mapping;
- size_t len;
int ret;
if (!capable(CAP_IPC_LOCK))
@@ -66,10 +65,6 @@ static int hugetlbfs_file_mmap(struct fi
vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
vma->vm_ops = &hugetlb_vm_ops;
ret = hugetlb_prefault(mapping, vma);
- len = (vma->vm_end - vma->vm_start) + (vma->vm_pgoff << PAGE_SHIFT);
- if (inode->i_size < len)
- inode->i_size = len;
-
up(&inode->i_sem);
return ret;
}
diff -puN arch/i386/mm/hugetlbpage.c~hugetlbfs-i_size-fix arch/i386/mm/hugetlbpage.c
--- 25/arch/i386/mm/hugetlbpage.c~hugetlbfs-i_size-fix 2003-02-01 02:07:22.000000000 -0800
+++ 25-akpm/arch/i386/mm/hugetlbpage.c 2003-02-01 02:07:22.000000000 -0800
@@ -284,6 +284,7 @@ void zap_hugepage_range(struct vm_area_s
int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
{
struct mm_struct *mm = current->mm;
+ struct inode *inode = mapping->host;
unsigned long addr;
int ret = 0;
@@ -307,6 +308,7 @@ int hugetlb_prefault(struct address_spac
+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
page = find_get_page(mapping, idx);
if (!page) {
+ loff_t i_size;
page = alloc_hugetlb_page();
if (!page) {
ret = -ENOMEM;
@@ -318,6 +320,9 @@ int hugetlb_prefault(struct address_spac
free_huge_page(page);
goto out;
}
+ i_size = (loff_t)(idx + 1) * HPAGE_SIZE;
+ if (i_size > inode->i_size)
+ inode->i_size = i_size;
}
set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
}
diff -puN arch/ia64/mm/hugetlbpage.c~hugetlbfs-i_size-fix arch/ia64/mm/hugetlbpage.c
diff -puN arch/sparc64/mm/hugetlbpage.c~hugetlbfs-i_size-fix arch/sparc64/mm/hugetlbpage.c
diff -puN arch/x86_64/mm/hugetlbpage.c~hugetlbfs-i_size-fix arch/x86_64/mm/hugetlbpage.c
--- 25/arch/x86_64/mm/hugetlbpage.c~hugetlbfs-i_size-fix 2003-02-01 02:07:22.000000000 -0800
+++ 25-akpm/arch/x86_64/mm/hugetlbpage.c 2003-02-01 02:07:22.000000000 -0800
@@ -205,6 +205,7 @@ void zap_hugepage_range(struct vm_area_s
int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
{
struct mm_struct *mm = current->mm;
+ struct inode = mapping->host;
unsigned long addr;
int ret = 0;
@@ -228,6 +229,8 @@ int hugetlb_prefault(struct address_spac
+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
page = find_get_page(mapping, idx);
if (!page) {
+ loff_t i_size;
+
page = alloc_hugetlb_page();
if (!page) {
ret = -ENOMEM;
@@ -239,6 +242,9 @@ int hugetlb_prefault(struct address_spac
free_huge_page(page);
goto out;
}
+ i_size = (loff_t)(idx + 1) * HPAGE_SIZE;
+ if (i_size > inode->i_size)
+ inode->i_size = i_size;
}
set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
}
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: hugepage patches
2003-01-31 23:15 hugepage patches Andrew Morton
` (7 preceding siblings ...)
2003-02-02 10:55 ` Andrew Morton
@ 2003-02-02 10:56 ` Andrew Morton
2003-02-02 20:06 ` William Lee Irwin III
2003-02-02 10:56 ` Andrew Morton
` (4 subsequent siblings)
13 siblings, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2003-02-02 10:56 UTC (permalink / raw)
To: davem, rohit.seth, davidm, anton, wli, linux-mm
8/4
hugetlbfs cleanups
- Remove quota code.
- Remove extraneous copy-n-paste code from truncate: that's only for
physically-backed filesystems.
- Whitespace changes.
hugetlbfs/inode.c | 91 ++++++++----------------------------------------------
1 files changed, 15 insertions(+), 76 deletions(-)
diff -puN fs/hugetlbfs/inode.c~hugetlbfs-cleanup fs/hugetlbfs/inode.c
--- 25/fs/hugetlbfs/inode.c~hugetlbfs-cleanup 2003-02-02 01:17:07.000000000 -0800
+++ 25-akpm/fs/hugetlbfs/inode.c 2003-02-02 01:17:07.000000000 -0800
@@ -120,12 +120,16 @@ static int hugetlbfs_readpage(struct fil
return -EINVAL;
}
-static int hugetlbfs_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to)
+static int
+hugetlbfs_prepare_write(struct file *file, struct page *page,
+ unsigned offset, unsigned to)
{
return -EINVAL;
}
-static int hugetlbfs_commit_write(struct file *file, struct page *page, unsigned offset, unsigned to)
+static int
+hugetlbfs_commit_write(struct file *file, struct page *page,
+ unsigned offset, unsigned to)
{
return -EINVAL;
}
@@ -140,28 +144,8 @@ void huge_pagevec_release(struct pagevec
pagevec_reinit(pvec);
}
-void truncate_partial_hugepage(struct page *page, unsigned partial)
-{
- int i;
- const unsigned piece = partial & (PAGE_SIZE - 1);
- const unsigned tailstart = PAGE_SIZE - piece;
- const unsigned whole_pages = partial / PAGE_SIZE;
- const unsigned last_page_offset = HPAGE_SIZE/PAGE_SIZE - whole_pages;
-
- for (i = HPAGE_SIZE/PAGE_SIZE - 1; i >= last_page_offset; ++i)
- memclear_highpage_flush(&page[i], 0, PAGE_SIZE);
-
- if (!piece)
- return;
-
- memclear_highpage_flush(&page[last_page_offset - 1], tailstart, piece);
-}
-
-void truncate_huge_page(struct address_space *mapping, struct page *page)
+void truncate_huge_page(struct page *page)
{
- if (page->mapping != mapping)
- return;
-
clear_page_dirty(page);
ClearPageUptodate(page);
remove_from_page_cache(page);
@@ -170,52 +154,13 @@ void truncate_huge_page(struct address_s
void truncate_hugepages(struct address_space *mapping, loff_t lstart)
{
- const pgoff_t start = (lstart + HPAGE_SIZE - 1) >> HPAGE_SHIFT;
- const unsigned partial = lstart & (HPAGE_SIZE - 1);
+ const pgoff_t start = lstart >> HPAGE_SHIFT;
struct pagevec pvec;
pgoff_t next;
int i;
pagevec_init(&pvec, 0);
next = start;
-
- while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
- for (i = 0; i < pagevec_count(&pvec); ++i) {
- struct page *page = pvec.pages[i];
- pgoff_t page_index = page->index;
-
- if (page_index > next)
- next = page_index;
-
- ++next;
-
- if (TestSetPageLocked(page))
- continue;
-
- if (PageWriteback(page)) {
- unlock_page(page);
- continue;
- }
-
- truncate_huge_page(mapping, page);
- unlock_page(page);
- }
- huge_pagevec_release(&pvec);
- cond_resched();
- }
-
- if (partial) {
- struct page *page = find_lock_page(mapping, start - 1);
- if (page) {
- wait_on_page_writeback(page);
- truncate_partial_hugepage(page, partial);
- unlock_page(page);
- huge_page_release(page);
- }
- }
-
- next = start;
-
while (1) {
if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
if (next == start)
@@ -228,11 +173,10 @@ void truncate_hugepages(struct address_s
struct page *page = pvec.pages[i];
lock_page(page);
- wait_on_page_writeback(page);
if (page->index > next)
next = page->index;
++next;
- truncate_huge_page(mapping, page);
+ truncate_huge_page(page);
unlock_page(page);
}
huge_pagevec_release(&pvec);
@@ -363,13 +307,6 @@ static int hugetlbfs_setattr(struct dent
error = security_inode_setattr(dentry, attr);
if (error)
goto out;
-
- if ((ia_valid & ATTR_UID && attr->ia_uid != inode->i_uid) ||
- (ia_valid & ATTR_GID && attr->ia_gid != inode->i_gid))
- error = DQUOT_TRANSFER(inode, attr) ? -EDQUOT : 0;
- if (error)
- goto out;
-
if (ia_valid & ATTR_SIZE) {
error = -EINVAL;
if (!(attr->ia_size & ~HPAGE_MASK))
@@ -401,7 +338,7 @@ hugetlbfs_get_inode(struct super_block *
inode->i_blocks = 0;
inode->i_rdev = NODEV;
inode->i_mapping->a_ops = &hugetlbfs_aops;
- inode->i_mapping->backing_dev_info = &hugetlbfs_backing_dev_info;
+ inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
switch (mode & S_IFMT) {
default:
@@ -444,7 +381,7 @@ hugetlbfs_mknod(struct inode *dir, struc
return error;
}
-static int hugetlbfs_mkdir(struct inode * dir, struct dentry * dentry, int mode)
+static int hugetlbfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
{
int retval = hugetlbfs_mknod(dir, dentry, mode | S_IFDIR, 0);
if (!retval)
@@ -457,7 +394,8 @@ static int hugetlbfs_create(struct inode
return hugetlbfs_mknod(dir, dentry, mode | S_IFREG, 0);
}
-static int hugetlbfs_symlink(struct inode * dir, struct dentry *dentry, const char * symname)
+static int
+hugetlbfs_symlink(struct inode *dir, struct dentry *dentry, const char *symname)
{
struct inode *inode;
int error = -ENOSPC;
@@ -518,7 +456,8 @@ static struct super_operations hugetlbfs
.drop_inode = hugetlbfs_drop_inode,
};
-static int hugetlbfs_fill_super(struct super_block * sb, void * data, int silent)
+static int
+hugetlbfs_fill_super(struct super_block * sb, void * data, int silent)
{
struct inode * inode;
struct dentry * root;
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: hugepage patches
2003-02-02 10:56 ` Andrew Morton
@ 2003-02-02 20:06 ` William Lee Irwin III
0 siblings, 0 replies; 48+ messages in thread
From: William Lee Irwin III @ 2003-02-02 20:06 UTC (permalink / raw)
To: Andrew Morton; +Cc: davem, rohit.seth, davidm, anton, linux-mm
On Sun, Feb 02, 2003 at 02:56:09AM -0800, Andrew Morton wrote:
> hugetlbfs cleanups
> - Remove quota code.
> - Remove extraneous copy-n-paste code from truncate: that's only for
> physically-backed filesystems.
> - Whitespace changes.
quotas wold allow per-user limits on the memory consumed with the stuff.
I guess since I've not pursued it / tested it / etc. out it goes...
-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-01-31 23:15 hugepage patches Andrew Morton
` (8 preceding siblings ...)
2003-02-02 10:56 ` Andrew Morton
@ 2003-02-02 10:56 ` Andrew Morton
2003-02-02 10:56 ` Andrew Morton
` (3 subsequent siblings)
13 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2003-02-02 10:56 UTC (permalink / raw)
To: davem, rohit.seth, davidm, anton, wli, linux-mm
9/4
Give all architectures a hugetlb_nopage().
If someone maps a hugetlbfs file, then truncates it, then references the part
of the mapping outside the truncation point, they take a pagefault and we end
up hitting hugetlb_nopage().
We want to prevent this from ever happening. This patch just makes sure that
all architectures have a goes-BUG hugetlb_nopage() to trap it.
i386/mm/hugetlbpage.c | 10 ++++++++--
ia64/mm/hugetlbpage.c | 11 +++++++++--
sparc64/mm/hugetlbpage.c | 8 ++++++++
x86_64/mm/hugetlbpage.c | 4 ++--
4 files changed, 27 insertions(+), 6 deletions(-)
diff -puN arch/i386/mm/hugetlbpage.c~hugetlbfs-nopage-cleanup arch/i386/mm/hugetlbpage.c
--- 25/arch/i386/mm/hugetlbpage.c~hugetlbfs-nopage-cleanup 2003-02-01 22:35:51.000000000 -0800
+++ 25-akpm/arch/i386/mm/hugetlbpage.c 2003-02-01 22:37:04.000000000 -0800
@@ -26,7 +26,6 @@ static long htlbpagemem;
int htlbpage_max;
static long htlbzone_pages;
-struct vm_operations_struct hugetlb_vm_ops;
static LIST_HEAD(htlbpage_freelist);
static spinlock_t htlbpage_lock = SPIN_LOCK_UNLOCKED;
@@ -472,7 +471,14 @@ int is_hugepage_mem_enough(size_t size)
return 1;
}
-static struct page *hugetlb_nopage(struct vm_area_struct * area, unsigned long address, int unused)
+/*
+ * We cannot handle pagefaults against hugetlb pages at all. They cause
+ * handle_mm_fault() to try to instantiate regular-sized pages in the
+ * hugegpage VMA. do_page_fault() is supposed to trap this, so BUG is we get
+ * this far.
+ */
+static struct page *
+hugetlb_nopage(struct vm_area_struct *vma, unsigned long address, int unused)
{
BUG();
return NULL;
diff -puN arch/ia64/mm/hugetlbpage.c~hugetlbfs-nopage-cleanup arch/ia64/mm/hugetlbpage.c
--- 25/arch/ia64/mm/hugetlbpage.c~hugetlbfs-nopage-cleanup 2003-02-01 22:35:51.000000000 -0800
+++ 25-akpm/arch/ia64/mm/hugetlbpage.c 2003-02-01 22:37:08.000000000 -0800
@@ -18,7 +18,6 @@
#include <asm/tlb.h>
#include <asm/tlbflush.h>
-static struct vm_operations_struct hugetlb_vm_ops;
struct list_head htlbpage_freelist;
spinlock_t htlbpage_lock = SPIN_LOCK_UNLOCKED;
extern long htlbpagemem;
@@ -333,6 +332,14 @@ set_hugetlb_mem_size (int count)
return (int) htlbzone_pages;
}
+static struct page *
+hugetlb_nopage(struct vm_area_struct *vma, unsigned long address, int unused)
+{
+ BUG();
+ return NULL;
+}
+
static struct vm_operations_struct hugetlb_vm_ops = {
- .close = zap_hugetlb_resources
+ .nopage = hugetlb_nopage,
+ .close = zap_hugetlb_resources,
};
diff -puN arch/sparc64/mm/hugetlbpage.c~hugetlbfs-nopage-cleanup arch/sparc64/mm/hugetlbpage.c
--- 25/arch/sparc64/mm/hugetlbpage.c~hugetlbfs-nopage-cleanup 2003-02-01 22:35:51.000000000 -0800
+++ 25-akpm/arch/sparc64/mm/hugetlbpage.c 2003-02-01 22:37:13.000000000 -0800
@@ -619,6 +619,14 @@ int set_hugetlb_mem_size(int count)
return (int) htlbzone_pages;
}
+static struct page *
+hugetlb_nopage(struct vm_area_struct *vma, unsigned long address, int unused)
+{
+ BUG();
+ return NULL;
+}
+
static struct vm_operations_struct hugetlb_vm_ops = {
+ .nopage = hugetlb_nopage,
.close = zap_hugetlb_resources,
};
diff -puN arch/x86_64/mm/hugetlbpage.c~hugetlbfs-nopage-cleanup arch/x86_64/mm/hugetlbpage.c
--- 25/arch/x86_64/mm/hugetlbpage.c~hugetlbfs-nopage-cleanup 2003-02-01 22:35:51.000000000 -0800
+++ 25-akpm/arch/x86_64/mm/hugetlbpage.c 2003-02-01 22:37:19.000000000 -0800
@@ -25,7 +25,6 @@ static long htlbpagemem;
int htlbpage_max;
static long htlbzone_pages;
-struct vm_operations_struct hugetlb_vm_ops;
static LIST_HEAD(htlbpage_freelist);
static spinlock_t htlbpage_lock = SPIN_LOCK_UNLOCKED;
@@ -349,7 +348,8 @@ int hugetlb_report_meminfo(char *buf)
HPAGE_SIZE/1024);
}
-static struct page * hugetlb_nopage(struct vm_area_struct * area, unsigned long address, int unused)
+static struct page *
+hugetlb_nopage(struct vm_area_struct *vma, unsigned long address, int unused)
{
BUG();
return NULL;
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: hugepage patches
2003-01-31 23:15 hugepage patches Andrew Morton
` (9 preceding siblings ...)
2003-02-02 10:56 ` Andrew Morton
@ 2003-02-02 10:56 ` Andrew Morton
2003-02-02 10:57 ` Andrew Morton
` (2 subsequent siblings)
13 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2003-02-02 10:56 UTC (permalink / raw)
To: davem, rohit.seth, davidm, anton, wli, linux-mm
10/4
Fix hugetlbfs faults
If the underlying mapping was truncated and someone references the
now-unmapped memory the kernel will enter handle_mm_fault() and will start
instantiating PAGE_SIZE pte's inside the hugepage VMA. Everything goes
generally pear-shaped.
So trap this in handle_mm_fault(). It adds no overhead to non-hugepage
builds.
Another possible fix would be to not unmap the huge pages at all in truncate
- just anonymise them.
But I think we want full ftruncate semantics for hugepages for management
purposes.
i386/mm/fault.c | 0
memory.c | 4 ++++
2 files changed, 4 insertions(+)
diff -puN arch/i386/mm/fault.c~hugetlbfs-fault-fix arch/i386/mm/fault.c
diff -puN mm/memory.c~hugetlbfs-fault-fix mm/memory.c
--- 25/mm/memory.c~hugetlbfs-fault-fix 2003-02-01 22:46:48.000000000 -0800
+++ 25-akpm/mm/memory.c 2003-02-01 22:46:48.000000000 -0800
@@ -1447,6 +1447,10 @@ int handle_mm_fault(struct mm_struct *mm
pgd = pgd_offset(mm, address);
inc_page_state(pgfault);
+
+ if (is_vm_hugetlb_page(vma))
+ return VM_FAULT_SIGBUS; /* mapping truncation does this. */
+
/*
* We need the page table lock to synchronize with kswapd
* and the SMP-safe atomic PTE updates.
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: hugepage patches
2003-01-31 23:15 hugepage patches Andrew Morton
` (10 preceding siblings ...)
2003-02-02 10:56 ` Andrew Morton
@ 2003-02-02 10:57 ` Andrew Morton
2003-02-02 10:57 ` Andrew Morton
2003-02-02 10:57 ` Andrew Morton
13 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2003-02-02 10:57 UTC (permalink / raw)
To: davem, rohit.seth, davidm, anton, wli, linux-mm
11/4
ia32 hugetlb cleanup
- whitespace
- remove unneeded spinlocking no-op.
i386/mm/hugetlbpage.c | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)
diff -puN arch/i386/mm/hugetlbpage.c~hugetlbpage-cleanup arch/i386/mm/hugetlbpage.c
--- 25/arch/i386/mm/hugetlbpage.c~hugetlbpage-cleanup 2003-02-01 22:06:04.000000000 -0800
+++ 25-akpm/arch/i386/mm/hugetlbpage.c 2003-02-01 22:06:25.000000000 -0800
@@ -248,7 +248,9 @@ void huge_page_release(struct page *page
free_huge_page(page);
}
-void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+void
+unmap_hugepage_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -258,8 +260,6 @@ void unmap_hugepage_range(struct vm_area
BUG_ON(start & (HPAGE_SIZE - 1));
BUG_ON(end & (HPAGE_SIZE - 1));
- spin_lock(&htlbpage_lock);
- spin_unlock(&htlbpage_lock);
for (address = start; address < end; address += HPAGE_SIZE) {
pte = huge_pte_offset(mm, address);
if (pte_none(*pte))
@@ -272,7 +272,9 @@ void unmap_hugepage_range(struct vm_area
flush_tlb_range(vma, start, end);
}
-void zap_hugepage_range(struct vm_area_struct *vma, unsigned long start, unsigned long length)
+void
+zap_hugepage_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long length)
{
struct mm_struct *mm = vma->vm_mm;
spin_lock(&mm->page_table_lock);
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: hugepage patches
2003-01-31 23:15 hugepage patches Andrew Morton
` (11 preceding siblings ...)
2003-02-02 10:57 ` Andrew Morton
@ 2003-02-02 10:57 ` Andrew Morton
2003-02-02 20:17 ` William Lee Irwin III
2003-02-02 10:57 ` Andrew Morton
13 siblings, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2003-02-02 10:57 UTC (permalink / raw)
To: davem, rohit.seth, davidm, anton, wli, linux-mm
12/4
Fix hugetlb_vmtruncate_list()
This function is quite wrong - has an "=" where it should have an "-" and
confuses PAGE_SIZE and HPAGE_SIZE in its address and file offset arithmetic.
hugetlbfs/inode.c | 46 ++++++++++++++++++++++++++++++++--------------
1 files changed, 32 insertions(+), 14 deletions(-)
diff -puN fs/hugetlbfs/inode.c~hugetlb_vmtruncate-fixes fs/hugetlbfs/inode.c
--- 25/fs/hugetlbfs/inode.c~hugetlb_vmtruncate-fixes 2003-02-02 01:17:12.000000000 -0800
+++ 25-akpm/fs/hugetlbfs/inode.c 2003-02-02 02:53:49.000000000 -0800
@@ -240,29 +240,47 @@ static void hugetlbfs_drop_inode(struct
hugetlbfs_forget_inode(inode);
}
-static void hugetlb_vmtruncate_list(struct list_head *list, unsigned long pgoff)
+/*
+ * h_pgoff is in HPAGE_SIZE units.
+ * vma->vm_pgoff is in PAGE_SIZE units.
+ */
+static void
+hugetlb_vmtruncate_list(struct list_head *list, unsigned long h_pgoff)
{
- unsigned long start, end, length, delta;
struct vm_area_struct *vma;
list_for_each_entry(vma, list, shared) {
- start = vma->vm_start;
- end = vma->vm_end;
- length = end - start;
-
- if (vma->vm_pgoff >= pgoff) {
- zap_hugepage_range(vma, start, length);
+ unsigned long h_vm_pgoff;
+ unsigned long v_length;
+ unsigned long h_length;
+ unsigned long v_offset;
+
+ h_vm_pgoff = vma->vm_pgoff << (HPAGE_SHIFT - PAGE_SHIFT);
+ v_length = vma->vm_end - vma->vm_start;
+ h_length = v_length >> HPAGE_SHIFT;
+ v_offset = (h_pgoff - h_vm_pgoff) << HPAGE_SHIFT;
+
+ /*
+ * Is this VMA fully outside the truncation point?
+ */
+ if (h_vm_pgoff >= h_pgoff) {
+ zap_hugepage_range(vma, vma->vm_start, v_length);
continue;
}
- length >>= PAGE_SHIFT;
- delta = pgoff = vma->vm_pgoff;
- if (delta >= length)
+ /*
+ * Is this VMA fully inside the truncaton point?
+ */
+ if (h_vm_pgoff + (v_length >> HPAGE_SHIFT) <= h_pgoff)
continue;
- start += delta << PAGE_SHIFT;
- length = (length - delta) << PAGE_SHIFT;
- zap_hugepage_range(vma, start, length);
+ /*
+ * The VMA straddles the truncation point. v_offset is the
+ * offset (in bytes) into the VMA where the point lies.
+ */
+ zap_hugepage_range(vma,
+ vma->vm_start + v_offset,
+ v_length - v_offset);
}
}
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: hugepage patches
2003-02-02 10:57 ` Andrew Morton
@ 2003-02-02 20:17 ` William Lee Irwin III
0 siblings, 0 replies; 48+ messages in thread
From: William Lee Irwin III @ 2003-02-02 20:17 UTC (permalink / raw)
To: Andrew Morton; +Cc: davem, rohit.seth, davidm, anton, linux-mm
On Sun, Feb 02, 2003 at 02:57:20AM -0800, Andrew Morton wrote:
> 12/4
> Fix hugetlb_vmtruncate_list()
> This function is quite wrong - has an "=" where it should have an "-" and
> confuses PAGE_SIZE and HPAGE_SIZE in its address and file offset arithmetic.
AFAICT the = typo and passing in a pgoff shifted the wrong amount were
the bogons here; maybe there's another one somewhere else.
Heavy-handed but correct.
-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: hugepage patches
2003-01-31 23:15 hugepage patches Andrew Morton
` (12 preceding siblings ...)
2003-02-02 10:57 ` Andrew Morton
@ 2003-02-02 10:57 ` Andrew Morton
13 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2003-02-02 10:57 UTC (permalink / raw)
To: davem, rohit.seth, davidm, anton, wli, linux-mm
13/4
hugetlb mremap fix
If you attempt tp perform a relocating 4k-aligned mremap and the new address
for the map lands on top of a hugepage VMA, do_mremap() will attempt to
perform a 4k-aligned unmap inside the hugetlb VMA. The hugetlb layer goes
BUG.
Fix that by trapping the poorly-aligned unmap attempt in do_munmap().
do_remap() will then fall through without having done anything to the place
where it tests for a hugetlb VMA.
It would be neater to perform these checks on entry to do_mremap(), but that
would incur another VMA lookup.
Also, if you attempt to perform a 4k-aligned and/or sized munmap() inside a
hugepage VMA the same BUG happens. This patch fixes that too.
mmap.c | 5 +++++
1 files changed, 5 insertions(+)
diff -puN mm/mmap.c~hugetlb-mremap-fix mm/mmap.c
--- 25/mm/mmap.c~hugetlb-mremap-fix 2003-02-02 02:53:56.000000000 -0800
+++ 25-akpm/mm/mmap.c 2003-02-02 02:53:56.000000000 -0800
@@ -1227,6 +1227,11 @@ int do_munmap(struct mm_struct *mm, unsi
return 0;
/* we have start < mpnt->vm_end */
+ if (is_vm_hugetlb_page(mpnt)) {
+ if ((start & ~HPAGE_MASK) || (len & ~HPAGE_MASK))
+ return -EINVAL;
+ }
+
/* if it doesn't overlap, we have nothing.. */
end = start + len;
if (mpnt->vm_start >= end)
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 48+ messages in thread