* [PATCH v2 1/5] mm: factor out the order calculation into a new helper
2024-11-12 7:45 [PATCH v2 0/5] Support large folios for tmpfs Baolin Wang
@ 2024-11-12 7:45 ` Baolin Wang
[not found] ` <CGME20241115135428eucas1p2b266175fadfb08cad9264c89fd395407@eucas1p2.samsung.com>
2024-11-12 7:45 ` [PATCH v2 2/5] mm: shmem: change shmem_huge_global_enabled() to return huge order bitmap Baolin Wang
` (4 subsequent siblings)
5 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2024-11-12 7:45 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
da.gomez, baolin.wang, linux-mm, linux-kernel
Factor out the order calculation into a new helper, which can be reused
by shmem in the following patch.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
include/linux/pagemap.h | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index bcf0865a38ae..d796c8a33647 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -727,6 +727,16 @@ typedef unsigned int __bitwise fgf_t;
#define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
+static inline unsigned int filemap_get_order(size_t size)
+{
+ unsigned int shift = ilog2(size);
+
+ if (shift <= PAGE_SHIFT)
+ return 0;
+
+ return shift - PAGE_SHIFT;
+}
+
/**
* fgf_set_order - Encode a length in the fgf_t flags.
* @size: The suggested size of the folio to create.
@@ -740,11 +750,11 @@ typedef unsigned int __bitwise fgf_t;
*/
static inline fgf_t fgf_set_order(size_t size)
{
- unsigned int shift = ilog2(size);
+ unsigned int order = filemap_get_order(size);
- if (shift <= PAGE_SHIFT)
+ if (!order)
return 0;
- return (__force fgf_t)((shift - PAGE_SHIFT) << 26);
+ return (__force fgf_t)(order << 26);
}
void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
--
2.39.3
^ permalink raw reply [flat|nested] 23+ messages in thread* [PATCH v2 2/5] mm: shmem: change shmem_huge_global_enabled() to return huge order bitmap
2024-11-12 7:45 [PATCH v2 0/5] Support large folios for tmpfs Baolin Wang
2024-11-12 7:45 ` [PATCH v2 1/5] mm: factor out the order calculation into a new helper Baolin Wang
@ 2024-11-12 7:45 ` Baolin Wang
2024-11-12 16:03 ` David Hildenbrand
2024-11-12 7:45 ` [PATCH v2 3/5] mm: shmem: add large folio support for tmpfs Baolin Wang
` (3 subsequent siblings)
5 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2024-11-12 7:45 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
da.gomez, baolin.wang, linux-mm, linux-kernel
Change the shmem_huge_global_enabled() to return the suitable huge
order bitmap, and return 0 if huge pages are not allowed. This is a
preparation for supporting various huge orders allocation of tmpfs
in the following patches.
No functional changes.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/shmem.c | 40 ++++++++++++++++++++--------------------
1 file changed, 20 insertions(+), 20 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 579e58cb3262..86b2e417dc6f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -549,37 +549,37 @@ static bool shmem_confirm_swap(struct address_space *mapping,
static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
-static bool shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
- loff_t write_end, bool shmem_huge_force,
- unsigned long vm_flags)
+static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
+ loff_t write_end, bool shmem_huge_force,
+ unsigned long vm_flags)
{
loff_t i_size;
if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
- return false;
+ return 0;
if (!S_ISREG(inode->i_mode))
- return false;
+ return 0;
if (shmem_huge == SHMEM_HUGE_DENY)
- return false;
+ return 0;
if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
- return true;
+ return BIT(HPAGE_PMD_ORDER);
switch (SHMEM_SB(inode->i_sb)->huge) {
case SHMEM_HUGE_ALWAYS:
- return true;
+ return BIT(HPAGE_PMD_ORDER);
case SHMEM_HUGE_WITHIN_SIZE:
index = round_up(index + 1, HPAGE_PMD_NR);
i_size = max(write_end, i_size_read(inode));
i_size = round_up(i_size, PAGE_SIZE);
if (i_size >> PAGE_SHIFT >= index)
- return true;
+ return BIT(HPAGE_PMD_ORDER);
fallthrough;
case SHMEM_HUGE_ADVISE:
if (vm_flags & VM_HUGEPAGE)
- return true;
+ return BIT(HPAGE_PMD_ORDER);
fallthrough;
default:
- return false;
+ return 0;
}
}
@@ -774,11 +774,11 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
return 0;
}
-static bool shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
- loff_t write_end, bool shmem_huge_force,
- unsigned long vm_flags)
+static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
+ loff_t write_end, bool shmem_huge_force,
+ unsigned long vm_flags)
{
- return false;
+ return 0;
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -1682,21 +1682,21 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
unsigned long mask = READ_ONCE(huge_shmem_orders_always);
unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size);
unsigned long vm_flags = vma ? vma->vm_flags : 0;
- bool global_huge;
+ unsigned int global_orders;
loff_t i_size;
int order;
if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags)))
return 0;
- global_huge = shmem_huge_global_enabled(inode, index, write_end,
- shmem_huge_force, vm_flags);
+ global_orders = shmem_huge_global_enabled(inode, index, write_end,
+ shmem_huge_force, vm_flags);
if (!vma || !vma_is_anon_shmem(vma)) {
/*
* For tmpfs, we now only support PMD sized THP if huge page
* is enabled, otherwise fallback to order 0.
*/
- return global_huge ? BIT(HPAGE_PMD_ORDER) : 0;
+ return global_orders;
}
/*
@@ -1729,7 +1729,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
if (vm_flags & VM_HUGEPAGE)
mask |= READ_ONCE(huge_shmem_orders_madvise);
- if (global_huge)
+ if (global_orders > 0)
mask |= READ_ONCE(huge_shmem_orders_inherit);
return THP_ORDERS_ALL_FILE_DEFAULT & mask;
--
2.39.3
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v2 2/5] mm: shmem: change shmem_huge_global_enabled() to return huge order bitmap
2024-11-12 7:45 ` [PATCH v2 2/5] mm: shmem: change shmem_huge_global_enabled() to return huge order bitmap Baolin Wang
@ 2024-11-12 16:03 ` David Hildenbrand
0 siblings, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2024-11-12 16:03 UTC (permalink / raw)
To: Baolin Wang, akpm, hughd
Cc: willy, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
da.gomez, linux-mm, linux-kernel
On 12.11.24 08:45, Baolin Wang wrote:
> Change the shmem_huge_global_enabled() to return the suitable huge
> order bitmap, and return 0 if huge pages are not allowed. This is a
> preparation for supporting various huge orders allocation of tmpfs
> in the following patches.
>
> No functional changes.
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v2 3/5] mm: shmem: add large folio support for tmpfs
2024-11-12 7:45 [PATCH v2 0/5] Support large folios for tmpfs Baolin Wang
2024-11-12 7:45 ` [PATCH v2 1/5] mm: factor out the order calculation into a new helper Baolin Wang
2024-11-12 7:45 ` [PATCH v2 2/5] mm: shmem: change shmem_huge_global_enabled() to return huge order bitmap Baolin Wang
@ 2024-11-12 7:45 ` Baolin Wang
2024-11-12 16:19 ` David Hildenbrand
2024-11-13 6:53 ` [PATCH] mm: shmem: add large folio support for tmpfs fix Baolin Wang
2024-11-12 7:45 ` [PATCH v2 4/5] mm: shmem: add a kernel command line to change the default huge policy for tmpfs Baolin Wang
` (2 subsequent siblings)
5 siblings, 2 replies; 23+ messages in thread
From: Baolin Wang @ 2024-11-12 7:45 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
da.gomez, baolin.wang, linux-mm, linux-kernel
Add large folio support for tmpfs write and fallocate paths matching the
same high order preference mechanism used in the iomap buffered IO path
as used in __filemap_get_folio().
Add shmem_mapping_size_orders() to get a hint for the orders of the folio
based on the file size which takes care of the mapping requirements.
Traditionally, tmpfs only supported PMD-sized huge folios. However nowadays
with other file systems supporting any sized large folios, and extending
anonymous to support mTHP, we should not restrict tmpfs to allocating only
PMD-sized huge folios, making it more special. Instead, we should allow
tmpfs can allocate any sized large folios.
Considering that tmpfs already has the 'huge=' option to control the huge
folios allocation, we can extend the 'huge=' option to allow any sized huge
folios. The semantics of the 'huge=' mount option are:
huge=never: no any sized huge folios
huge=always: any sized huge folios
huge=within_size: like 'always' but respect the i_size
huge=advise: like 'always' if requested with fadvise()/madvise()
Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
allocate the PMD-sized huge folios if huge=always/within_size/advise is set.
Moreover, the 'deny' and 'force' testing options controlled by
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
semantics. The 'deny' can disable any sized large folios for tmpfs, while
the 'force' can enable PMD sized large folios for tmpfs.
Co-developed-by: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/shmem.c | 91 +++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 77 insertions(+), 14 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 86b2e417dc6f..a3203cf8860f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -549,10 +549,50 @@ static bool shmem_confirm_swap(struct address_space *mapping,
static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
+/**
+ * shmem_mapping_size_orders - Get allowable folio orders for the given file size.
+ * @mapping: Target address_space.
+ * @index: The page index.
+ * @write_end: end of a write, could extend inode size.
+ *
+ * This returns huge orders for folios (when supported) based on the file size
+ * which the mapping currently allows at the given index. The index is relevant
+ * due to alignment considerations the mapping might have. The returned order
+ * may be less than the size passed.
+ *
+ * Return: The orders.
+ */
+static inline unsigned int
+shmem_mapping_size_orders(struct address_space *mapping, pgoff_t index, loff_t write_end)
+{
+ unsigned int order;
+ size_t size;
+
+ if (!mapping_large_folio_support(mapping) || !write_end)
+ return 0;
+
+ /* Calculate the write size based on the write_end */
+ size = write_end - (index << PAGE_SHIFT);
+ order = filemap_get_order(size);
+ if (!order)
+ return 0;
+
+ /* If we're not aligned, allocate a smaller folio */
+ if (index & ((1UL << order) - 1))
+ order = __ffs(index);
+
+ order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
+ return order > 0 ? BIT(order + 1) - 1 : 0;
+}
+
static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
loff_t write_end, bool shmem_huge_force,
+ struct vm_area_struct *vma,
unsigned long vm_flags)
{
+ unsigned long within_size_orders;
+ unsigned int order;
+ pgoff_t aligned_index;
loff_t i_size;
if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
@@ -564,15 +604,41 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
return BIT(HPAGE_PMD_ORDER);
+ /*
+ * The huge order allocation for anon shmem is controlled through
+ * the mTHP interface, so we still use PMD-sized huge order to
+ * check whether global control is enabled.
+ *
+ * For tmpfs mmap()'s huge order, we still use PMD-sized order to
+ * allocate huge pages due to lack of a write size hint.
+ *
+ * Otherwise, tmpfs will allow getting a highest order hint based on
+ * the size of write and fallocate paths, then will try each allowable
+ * huge orders.
+ */
switch (SHMEM_SB(inode->i_sb)->huge) {
case SHMEM_HUGE_ALWAYS:
- return BIT(HPAGE_PMD_ORDER);
- case SHMEM_HUGE_WITHIN_SIZE:
- index = round_up(index + 1, HPAGE_PMD_NR);
- i_size = max(write_end, i_size_read(inode));
- i_size = round_up(i_size, PAGE_SIZE);
- if (i_size >> PAGE_SHIFT >= index)
+ if (vma)
return BIT(HPAGE_PMD_ORDER);
+
+ return shmem_mapping_size_orders(inode->i_mapping, index, write_end);
+ case SHMEM_HUGE_WITHIN_SIZE:
+ if (vma)
+ within_size_orders = BIT(HPAGE_PMD_ORDER);
+ else
+ within_size_orders = shmem_mapping_size_orders(inode->i_mapping,
+ index, write_end);
+
+ order = highest_order(within_size_orders);
+ while (within_size_orders) {
+ aligned_index = round_up(index + 1, 1 << order);
+ i_size = max(write_end, i_size_read(inode));
+ i_size = round_up(i_size, PAGE_SIZE);
+ if (i_size >> PAGE_SHIFT >= aligned_index)
+ return within_size_orders;
+
+ order = next_order(&within_size_orders, order);
+ }
fallthrough;
case SHMEM_HUGE_ADVISE:
if (vm_flags & VM_HUGEPAGE)
@@ -776,6 +842,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
loff_t write_end, bool shmem_huge_force,
+ struct vm_area_struct *vma,
unsigned long vm_flags)
{
return 0;
@@ -1173,7 +1240,7 @@ static int shmem_getattr(struct mnt_idmap *idmap,
generic_fillattr(idmap, request_mask, inode, stat);
inode_unlock_shared(inode);
- if (shmem_huge_global_enabled(inode, 0, 0, false, 0))
+ if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0))
stat->blksize = HPAGE_PMD_SIZE;
if (request_mask & STATX_BTIME) {
@@ -1690,14 +1757,10 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
return 0;
global_orders = shmem_huge_global_enabled(inode, index, write_end,
- shmem_huge_force, vm_flags);
- if (!vma || !vma_is_anon_shmem(vma)) {
- /*
- * For tmpfs, we now only support PMD sized THP if huge page
- * is enabled, otherwise fallback to order 0.
- */
+ shmem_huge_force, vma, vm_flags);
+ /* Tmpfs huge pages allocation */
+ if (!vma || !vma_is_anon_shmem(vma))
return global_orders;
- }
/*
* Following the 'deny' semantics of the top level, force the huge
--
2.39.3
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v2 3/5] mm: shmem: add large folio support for tmpfs
2024-11-12 7:45 ` [PATCH v2 3/5] mm: shmem: add large folio support for tmpfs Baolin Wang
@ 2024-11-12 16:19 ` David Hildenbrand
2024-11-12 16:21 ` David Hildenbrand
2024-11-13 3:07 ` Baolin Wang
2024-11-13 6:53 ` [PATCH] mm: shmem: add large folio support for tmpfs fix Baolin Wang
1 sibling, 2 replies; 23+ messages in thread
From: David Hildenbrand @ 2024-11-12 16:19 UTC (permalink / raw)
To: Baolin Wang, akpm, hughd
Cc: willy, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
da.gomez, linux-mm, linux-kernel
On 12.11.24 08:45, Baolin Wang wrote:
> Add large folio support for tmpfs write and fallocate paths matching the
> same high order preference mechanism used in the iomap buffered IO path
> as used in __filemap_get_folio().
>
> Add shmem_mapping_size_orders() to get a hint for the orders of the folio
> based on the file size which takes care of the mapping requirements.
>
> Traditionally, tmpfs only supported PMD-sized huge folios. However nowadays
> with other file systems supporting any sized large folios, and extending
> anonymous to support mTHP, we should not restrict tmpfs to allocating only
> PMD-sized huge folios, making it more special. Instead, we should allow
> tmpfs can allocate any sized large folios.
>
> Considering that tmpfs already has the 'huge=' option to control the huge
> folios allocation, we can extend the 'huge=' option to allow any sized huge
> folios. The semantics of the 'huge=' mount option are:
>
> huge=never: no any sized huge folios
> huge=always: any sized huge folios
> huge=within_size: like 'always' but respect the i_size
> huge=advise: like 'always' if requested with fadvise()/madvise()
>
> Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
> allocate the PMD-sized huge folios if huge=always/within_size/advise is set.
>
> Moreover, the 'deny' and 'force' testing options controlled by
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
> semantics. The 'deny' can disable any sized large folios for tmpfs, while
> the 'force' can enable PMD sized large folios for tmpfs.
>
> Co-developed-by: Daniel Gomez <da.gomez@samsung.com>
> Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
> mm/shmem.c | 91 +++++++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 77 insertions(+), 14 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 86b2e417dc6f..a3203cf8860f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -549,10 +549,50 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>
> static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
>
> +/**
> + * shmem_mapping_size_orders - Get allowable folio orders for the given file size.
> + * @mapping: Target address_space.
> + * @index: The page index.
> + * @write_end: end of a write, could extend inode size.
> + *
> + * This returns huge orders for folios (when supported) based on the file size
> + * which the mapping currently allows at the given index. The index is relevant
> + * due to alignment considerations the mapping might have. The returned order
> + * may be less than the size passed.
> + *
> + * Return: The orders.
> + */
> +static inline unsigned int
> +shmem_mapping_size_orders(struct address_space *mapping, pgoff_t index, loff_t write_end)
> +{
> + unsigned int order;
> + size_t size;
> +
> + if (!mapping_large_folio_support(mapping) || !write_end)
> + return 0;
> +
> + /* Calculate the write size based on the write_end */
> + size = write_end - (index << PAGE_SHIFT);
> + order = filemap_get_order(size);
> + if (!order)
> + return 0;
> +
> + /* If we're not aligned, allocate a smaller folio */
> + if (index & ((1UL << order) - 1))
> + order = __ffs(index);
> +
> + order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
> + return order > 0 ? BIT(order + 1) - 1 : 0;
> +}
> +
> static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
> loff_t write_end, bool shmem_huge_force,
> + struct vm_area_struct *vma,
> unsigned long vm_flags)
> {
> + unsigned long within_size_orders;
> + unsigned int order;
> + pgoff_t aligned_index;
> loff_t i_size;
>
> if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
We can allow all orders up to MAX_PAGECACHE_ORDER,
shmem_mapping_size_orders() handles it properly.
So maybe we should drop this condition and use instead below where we have
return BIT(HPAGE_PMD_ORDER);
instead something like.
return HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER ? 0 : BIT(HPAGE_PMD_ORDER);
Ideally, factoring it out somehow
int maybe_pmd_order = HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER ? 0 :
BIT(HPAGE_PMD_ORDER);
...
return maybe_pmd_order;
> @@ -564,15 +604,41 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
> if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
> return BIT(HPAGE_PMD_ORDER);
Why not force-enable all orders (of course, respecting
MAX_PAGECACHE_ORDER and possibly VMA)?
>
> + /*
> + * The huge order allocation for anon shmem is controlled through
> + * the mTHP interface, so we still use PMD-sized huge order to
> + * check whether global control is enabled.
> + *
> + * For tmpfs mmap()'s huge order, we still use PMD-sized order to
> + * allocate huge pages due to lack of a write size hint.
> + *
> + * Otherwise, tmpfs will allow getting a highest order hint based on
> + * the size of write and fallocate paths, then will try each allowable
> + * huge orders.
> + */
> switch (SHMEM_SB(inode->i_sb)->huge) {
> case SHMEM_HUGE_ALWAYS:
> - return BIT(HPAGE_PMD_ORDER);
> - case SHMEM_HUGE_WITHIN_SIZE:
> - index = round_up(index + 1, HPAGE_PMD_NR);
> - i_size = max(write_end, i_size_read(inode));
> - i_size = round_up(i_size, PAGE_SIZE);
> - if (i_size >> PAGE_SHIFT >= index)
> + if (vma)
> return BIT(HPAGE_PMD_ORDER);
> +
> + return shmem_mapping_size_orders(inode->i_mapping, index, write_end);
> + case SHMEM_HUGE_WITHIN_SIZE:
> + if (vma)
> + within_size_orders = BIT(HPAGE_PMD_ORDER);
> + else
> + within_size_orders = shmem_mapping_size_orders(inode->i_mapping,
> + index, write_end);
> +
> + order = highest_order(within_size_orders);
> + while (within_size_orders) {
> + aligned_index = round_up(index + 1, 1 << order);
> + i_size = max(write_end, i_size_read(inode));
> + i_size = round_up(i_size, PAGE_SIZE);
> + if (i_size >> PAGE_SHIFT >= aligned_index)
> + return within_size_orders;
> +
> + order = next_order(&within_size_orders, order);
> + }
> fallthrough;
> case SHMEM_HUGE_ADVISE:
> if (vm_flags & VM_HUGEPAGE)
I think the point here is that "write" -> no VMA -> vm_flags == 0 -> no
code changes needed :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v2 3/5] mm: shmem: add large folio support for tmpfs
2024-11-12 16:19 ` David Hildenbrand
@ 2024-11-12 16:21 ` David Hildenbrand
2024-11-13 3:07 ` Baolin Wang
1 sibling, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2024-11-12 16:21 UTC (permalink / raw)
To: Baolin Wang, akpm, hughd
Cc: willy, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
da.gomez, linux-mm, linux-kernel
On 12.11.24 17:19, David Hildenbrand wrote:
> On 12.11.24 08:45, Baolin Wang wrote:
>> Add large folio support for tmpfs write and fallocate paths matching the
>> same high order preference mechanism used in the iomap buffered IO path
>> as used in __filemap_get_folio().
>>
>> Add shmem_mapping_size_orders() to get a hint for the orders of the folio
>> based on the file size which takes care of the mapping requirements.
>>
>> Traditionally, tmpfs only supported PMD-sized huge folios. However nowadays
>> with other file systems supporting any sized large folios, and extending
>> anonymous to support mTHP, we should not restrict tmpfs to allocating only
>> PMD-sized huge folios, making it more special. Instead, we should allow
>> tmpfs can allocate any sized large folios.
>>
>> Considering that tmpfs already has the 'huge=' option to control the huge
>> folios allocation, we can extend the 'huge=' option to allow any sized huge
>> folios. The semantics of the 'huge=' mount option are:
>>
>> huge=never: no any sized huge folios
>> huge=always: any sized huge folios
>> huge=within_size: like 'always' but respect the i_size
>> huge=advise: like 'always' if requested with fadvise()/madvise()
>>
>> Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
>> allocate the PMD-sized huge folios if huge=always/within_size/advise is set.
>>
>> Moreover, the 'deny' and 'force' testing options controlled by
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
>> semantics. The 'deny' can disable any sized large folios for tmpfs, while
>> the 'force' can enable PMD sized large folios for tmpfs.
>>
>> Co-developed-by: Daniel Gomez <da.gomez@samsung.com>
>> Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>> mm/shmem.c | 91 +++++++++++++++++++++++++++++++++++++++++++++---------
>> 1 file changed, 77 insertions(+), 14 deletions(-)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 86b2e417dc6f..a3203cf8860f 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -549,10 +549,50 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>>
>> static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
>>
>> +/**
>> + * shmem_mapping_size_orders - Get allowable folio orders for the given file size.
>> + * @mapping: Target address_space.
>> + * @index: The page index.
>> + * @write_end: end of a write, could extend inode size.
>> + *
>> + * This returns huge orders for folios (when supported) based on the file size
>> + * which the mapping currently allows at the given index. The index is relevant
>> + * due to alignment considerations the mapping might have. The returned order
>> + * may be less than the size passed.
>> + *
>> + * Return: The orders.
>> + */
>> +static inline unsigned int
>> +shmem_mapping_size_orders(struct address_space *mapping, pgoff_t index, loff_t write_end)
>> +{
>> + unsigned int order;
>> + size_t size;
>> +
>> + if (!mapping_large_folio_support(mapping) || !write_end)
>> + return 0;
>> +
>> + /* Calculate the write size based on the write_end */
>> + size = write_end - (index << PAGE_SHIFT);
>> + order = filemap_get_order(size);
>> + if (!order)
>> + return 0;
>> +
>> + /* If we're not aligned, allocate a smaller folio */
>> + if (index & ((1UL << order) - 1))
>> + order = __ffs(index);
>> +
>> + order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
>> + return order > 0 ? BIT(order + 1) - 1 : 0;
>> +}
>> +
>> static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
>> loff_t write_end, bool shmem_huge_force,
>> + struct vm_area_struct *vma,
>> unsigned long vm_flags)
>> {
>> + unsigned long within_size_orders;
>> + unsigned int order;
>> + pgoff_t aligned_index;
>> loff_t i_size;
>>
>> if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
>
> We can allow all orders up to MAX_PAGECACHE_ORDER,
> shmem_mapping_size_orders() handles it properly.
>
> So maybe we should drop this condition and use instead below where we have
>
> return BIT(HPAGE_PMD_ORDER);
>
> instead something like.
>
> return HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER ? 0 : BIT(HPAGE_PMD_ORDER);
>
> Ideally, factoring it out somehow
>
>
> int maybe_pmd_order = HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER ? 0 :
> BIT(HPAGE_PMD_ORDER);
>
> ...
>
> return maybe_pmd_order;
>
>> @@ -564,15 +604,41 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
>> if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
>> return BIT(HPAGE_PMD_ORDER);
>
> Why not force-enable all orders (of course, respecting
> MAX_PAGECACHE_ORDER and possibly VMA)?
>
>>
>> + /*
>> + * The huge order allocation for anon shmem is controlled through
>> + * the mTHP interface, so we still use PMD-sized huge order to
>> + * check whether global control is enabled.
>> + *
>> + * For tmpfs mmap()'s huge order, we still use PMD-sized order to
>> + * allocate huge pages due to lack of a write size hint.
>> + *
>> + * Otherwise, tmpfs will allow getting a highest order hint based on
>> + * the size of write and fallocate paths, then will try each allowable
>> + * huge orders.
>> + */
>> switch (SHMEM_SB(inode->i_sb)->huge) {
>> case SHMEM_HUGE_ALWAYS:
>> - return BIT(HPAGE_PMD_ORDER);
>> - case SHMEM_HUGE_WITHIN_SIZE:
>> - index = round_up(index + 1, HPAGE_PMD_NR);
>> - i_size = max(write_end, i_size_read(inode));
>> - i_size = round_up(i_size, PAGE_SIZE);
>> - if (i_size >> PAGE_SHIFT >= index)
>> + if (vma)
>> return BIT(HPAGE_PMD_ORDER);
>> +
>> + return shmem_mapping_size_orders(inode->i_mapping, index, write_end);
>> + case SHMEM_HUGE_WITHIN_SIZE:
>> + if (vma)
>> + within_size_orders = BIT(HPAGE_PMD_ORDER);
>> + else
>> + within_size_orders = shmem_mapping_size_orders(inode->i_mapping,
>> + index, write_end);
>> +
>> + order = highest_order(within_size_orders);
>> + while (within_size_orders) {
>> + aligned_index = round_up(index + 1, 1 << order);
>> + i_size = max(write_end, i_size_read(inode));
>> + i_size = round_up(i_size, PAGE_SIZE);
>> + if (i_size >> PAGE_SHIFT >= aligned_index)
>> + return within_size_orders;
>> +
>> + order = next_order(&within_size_orders, order);
>> + }
>> fallthrough;
>> case SHMEM_HUGE_ADVISE:
>> if (vm_flags & VM_HUGEPAGE)
>
> I think the point here is that "write" -> no VMA -> vm_flags == 0 -> no
> code changes needed :)
... and now I wonder about documented "fadvise", because this here is
only concerned with madvise?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v2 3/5] mm: shmem: add large folio support for tmpfs
2024-11-12 16:19 ` David Hildenbrand
2024-11-12 16:21 ` David Hildenbrand
@ 2024-11-13 3:07 ` Baolin Wang
2024-11-15 13:48 ` David Hildenbrand
1 sibling, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2024-11-13 3:07 UTC (permalink / raw)
To: David Hildenbrand, akpm, hughd
Cc: willy, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
da.gomez, linux-mm, linux-kernel
On 2024/11/13 00:19, David Hildenbrand wrote:
> On 12.11.24 08:45, Baolin Wang wrote:
>> Add large folio support for tmpfs write and fallocate paths matching the
>> same high order preference mechanism used in the iomap buffered IO path
>> as used in __filemap_get_folio().
>>
>> Add shmem_mapping_size_orders() to get a hint for the orders of the folio
>> based on the file size which takes care of the mapping requirements.
>>
>> Traditionally, tmpfs only supported PMD-sized huge folios. However
>> nowadays
>> with other file systems supporting any sized large folios, and extending
>> anonymous to support mTHP, we should not restrict tmpfs to allocating
>> only
>> PMD-sized huge folios, making it more special. Instead, we should allow
>> tmpfs can allocate any sized large folios.
>>
>> Considering that tmpfs already has the 'huge=' option to control the huge
>> folios allocation, we can extend the 'huge=' option to allow any sized
>> huge
>> folios. The semantics of the 'huge=' mount option are:
>>
>> huge=never: no any sized huge folios
>> huge=always: any sized huge folios
>> huge=within_size: like 'always' but respect the i_size
>> huge=advise: like 'always' if requested with fadvise()/madvise()
>>
>> Note: for tmpfs mmap() faults, due to the lack of a write size hint,
>> still
>> allocate the PMD-sized huge folios if huge=always/within_size/advise
>> is set.
>>
>> Moreover, the 'deny' and 'force' testing options controlled by
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the
>> same
>> semantics. The 'deny' can disable any sized large folios for tmpfs, while
>> the 'force' can enable PMD sized large folios for tmpfs.
>>
>> Co-developed-by: Daniel Gomez <da.gomez@samsung.com>
>> Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>> mm/shmem.c | 91 +++++++++++++++++++++++++++++++++++++++++++++---------
>> 1 file changed, 77 insertions(+), 14 deletions(-)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 86b2e417dc6f..a3203cf8860f 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -549,10 +549,50 @@ static bool shmem_confirm_swap(struct
>> address_space *mapping,
>> static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
>> +/**
>> + * shmem_mapping_size_orders - Get allowable folio orders for the
>> given file size.
>> + * @mapping: Target address_space.
>> + * @index: The page index.
>> + * @write_end: end of a write, could extend inode size.
>> + *
>> + * This returns huge orders for folios (when supported) based on the
>> file size
>> + * which the mapping currently allows at the given index. The index
>> is relevant
>> + * due to alignment considerations the mapping might have. The
>> returned order
>> + * may be less than the size passed.
>> + *
>> + * Return: The orders.
>> + */
>> +static inline unsigned int
>> +shmem_mapping_size_orders(struct address_space *mapping, pgoff_t
>> index, loff_t write_end)
>> +{
>> + unsigned int order;
>> + size_t size;
>> +
>> + if (!mapping_large_folio_support(mapping) || !write_end)
>> + return 0;
>> +
>> + /* Calculate the write size based on the write_end */
>> + size = write_end - (index << PAGE_SHIFT);
>> + order = filemap_get_order(size);
>> + if (!order)
>> + return 0;
>> +
>> + /* If we're not aligned, allocate a smaller folio */
>> + if (index & ((1UL << order) - 1))
>> + order = __ffs(index);
>> +
>> + order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
>> + return order > 0 ? BIT(order + 1) - 1 : 0;
>> +}
>> +
>> static unsigned int shmem_huge_global_enabled(struct inode *inode,
>> pgoff_t index,
>> loff_t write_end, bool shmem_huge_force,
>> + struct vm_area_struct *vma,
>> unsigned long vm_flags)
>> {
>> + unsigned long within_size_orders;
>> + unsigned int order;
>> + pgoff_t aligned_index;
>> loff_t i_size;
>> if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
>
> We can allow all orders up to MAX_PAGECACHE_ORDER,
> shmem_mapping_size_orders() handles it properly.
>
> So maybe we should drop this condition and use instead below where we have
>
> return BIT(HPAGE_PMD_ORDER);
>
> instead something like.
>
> return HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER ? 0 : BIT(HPAGE_PMD_ORDER);
>
> Ideally, factoring it out somehow
>
>
> int maybe_pmd_order = HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER ? 0 :
> BIT(HPAGE_PMD_ORDER);
>
> ...
>
> return maybe_pmd_order;
Good point. Will do.
>
>> @@ -564,15 +604,41 @@ static unsigned int
>> shmem_huge_global_enabled(struct inode *inode, pgoff_t index
>> if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
>> return BIT(HPAGE_PMD_ORDER);
>
> Why not force-enable all orders (of course, respecting
> MAX_PAGECACHE_ORDER and possibly VMA)?
The ‘force’ option will affect the tmpfs mmap()'s huge allocation, which
I intend to handle in a separate patch as we discussed. Additionally,
for the huge page allocation of tmpfs mmap(), I am also considering the
readahead approach for the pagecache.
>> + /*
>> + * The huge order allocation for anon shmem is controlled through
>> + * the mTHP interface, so we still use PMD-sized huge order to
>> + * check whether global control is enabled.
>> + *
>> + * For tmpfs mmap()'s huge order, we still use PMD-sized order to
>> + * allocate huge pages due to lack of a write size hint.
>> + *
>> + * Otherwise, tmpfs will allow getting a highest order hint based on
>> + * the size of write and fallocate paths, then will try each
>> allowable
>> + * huge orders.
>> + */
>> switch (SHMEM_SB(inode->i_sb)->huge) {
>> case SHMEM_HUGE_ALWAYS:
>> - return BIT(HPAGE_PMD_ORDER);
>> - case SHMEM_HUGE_WITHIN_SIZE:
>> - index = round_up(index + 1, HPAGE_PMD_NR);
>> - i_size = max(write_end, i_size_read(inode));
>> - i_size = round_up(i_size, PAGE_SIZE);
>> - if (i_size >> PAGE_SHIFT >= index)
>> + if (vma)
>> return BIT(HPAGE_PMD_ORDER);
>> +
>> + return shmem_mapping_size_orders(inode->i_mapping, index,
>> write_end);
>> + case SHMEM_HUGE_WITHIN_SIZE:
>> + if (vma)
>> + within_size_orders = BIT(HPAGE_PMD_ORDER);
>> + else
>> + within_size_orders =
>> shmem_mapping_size_orders(inode->i_mapping,
>> + index, write_end);
>> +
>> + order = highest_order(within_size_orders);
>> + while (within_size_orders) {
>> + aligned_index = round_up(index + 1, 1 << order);
>> + i_size = max(write_end, i_size_read(inode));
>> + i_size = round_up(i_size, PAGE_SIZE);
>> + if (i_size >> PAGE_SHIFT >= aligned_index)
>> + return within_size_orders;
>> +
>> + order = next_order(&within_size_orders, order);
>> + }
>> fallthrough;
>> case SHMEM_HUGE_ADVISE:
>> if (vm_flags & VM_HUGEPAGE)
>
> I think the point here is that "write" -> no VMA -> vm_flags == 0 -> no
> code changes needed :)
Yes. Currently the fadvise() have no HUGEPAGE handling, so I will drop
the 'fadvise' in the doc.
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v2 3/5] mm: shmem: add large folio support for tmpfs
2024-11-13 3:07 ` Baolin Wang
@ 2024-11-15 13:48 ` David Hildenbrand
0 siblings, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2024-11-15 13:48 UTC (permalink / raw)
To: Baolin Wang, akpm, hughd
Cc: willy, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
da.gomez, linux-mm, linux-kernel
>>> return BIT(HPAGE_PMD_ORDER);
>>
>> Why not force-enable all orders (of course, respecting
>> MAX_PAGECACHE_ORDER and possibly VMA)?
>
> The ‘force’ option will affect the tmpfs mmap()'s huge allocation, which
> I intend to handle in a separate patch as we discussed. Additionally,
> for the huge page allocation of tmpfs mmap(), I am also considering the
> readahead approach for the pagecache.
Okay, we can change this later. Likely force/deny are a blast from the
past either way.
[...]
>>> +
>>> + order = highest_order(within_size_orders);
>>> + while (within_size_orders) {
>>> + aligned_index = round_up(index + 1, 1 << order);
>>> + i_size = max(write_end, i_size_read(inode));
>>> + i_size = round_up(i_size, PAGE_SIZE);
>>> + if (i_size >> PAGE_SHIFT >= aligned_index)
>>> + return within_size_orders;
>>> +
>>> + order = next_order(&within_size_orders, order);
>>> + }
>>> fallthrough;
>>> case SHMEM_HUGE_ADVISE:
>>> if (vm_flags & VM_HUGEPAGE)
>>
>> I think the point here is that "write" -> no VMA -> vm_flags == 0 -> no
>> code changes needed :)
>
> Yes. Currently the fadvise() have no HUGEPAGE handling, so I will drop
> the 'fadvise' in the doc.
Interesting that we documented it :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH] mm: shmem: add large folio support for tmpfs fix
2024-11-12 7:45 ` [PATCH v2 3/5] mm: shmem: add large folio support for tmpfs Baolin Wang
2024-11-12 16:19 ` David Hildenbrand
@ 2024-11-13 6:53 ` Baolin Wang
1 sibling, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2024-11-13 6:53 UTC (permalink / raw)
To: baolin.wang
Cc: 21cnbao, akpm, da.gomez, david, hughd, ioworker0, linux-kernel,
linux-mm, ryan.roberts, wangkefeng.wang, willy
As David suggested: "We can allow all orders up to MAX_PAGECACHE_ORDER,
since shmem_mapping_size_orders() handles it properly", therefore we can
drop the 'MAX_PAGECACHE_ORDER' condition.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/shmem.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index a3203cf8860f..d54b24d65193 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -590,19 +590,19 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
struct vm_area_struct *vma,
unsigned long vm_flags)
{
+ unsigned int maybe_pmd_order = HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER ?
+ 0 : BIT(HPAGE_PMD_ORDER);
unsigned long within_size_orders;
unsigned int order;
pgoff_t aligned_index;
loff_t i_size;
- if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
- return 0;
if (!S_ISREG(inode->i_mode))
return 0;
if (shmem_huge == SHMEM_HUGE_DENY)
return 0;
if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
- return BIT(HPAGE_PMD_ORDER);
+ return maybe_pmd_order;
/*
* The huge order allocation for anon shmem is controlled through
@@ -619,12 +619,12 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
switch (SHMEM_SB(inode->i_sb)->huge) {
case SHMEM_HUGE_ALWAYS:
if (vma)
- return BIT(HPAGE_PMD_ORDER);
+ return maybe_pmd_order;
return shmem_mapping_size_orders(inode->i_mapping, index, write_end);
case SHMEM_HUGE_WITHIN_SIZE:
if (vma)
- within_size_orders = BIT(HPAGE_PMD_ORDER);
+ within_size_orders = maybe_pmd_order;
else
within_size_orders = shmem_mapping_size_orders(inode->i_mapping,
index, write_end);
@@ -642,7 +642,7 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
fallthrough;
case SHMEM_HUGE_ADVISE:
if (vm_flags & VM_HUGEPAGE)
- return BIT(HPAGE_PMD_ORDER);
+ return maybe_pmd_order;
fallthrough;
default:
return 0;
--
2.39.3
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v2 4/5] mm: shmem: add a kernel command line to change the default huge policy for tmpfs
2024-11-12 7:45 [PATCH v2 0/5] Support large folios for tmpfs Baolin Wang
` (2 preceding siblings ...)
2024-11-12 7:45 ` [PATCH v2 3/5] mm: shmem: add large folio support for tmpfs Baolin Wang
@ 2024-11-12 7:45 ` Baolin Wang
[not found] ` <CGME20241115140254eucas1p2e77d484813d39b8e6c8c0dbd6046f3c4@eucas1p2.samsung.com>
2024-11-12 7:45 ` [PATCH v2 5/5] docs: tmpfs: update the huge folios policy for tmpfs and shmem Baolin Wang
[not found] ` <CGME20241115131634eucas1p2db22b75fcc768a4bb6aa47ee180110cc@eucas1p2.samsung.com>
5 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2024-11-12 7:45 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
da.gomez, baolin.wang, linux-mm, linux-kernel
Now the tmpfs can allow to allocate any sized large folios, and the default
huge policy is still 'never'. Thus adding a new command line to change
the default huge policy will be helpful to use the large folios for tmpfs,
which is similar to the 'transparent_hugepage_shmem' cmdline for shmem.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
.../admin-guide/kernel-parameters.txt | 7 ++++++
Documentation/admin-guide/mm/transhuge.rst | 6 +++++
mm/shmem.c | 23 ++++++++++++++++++-
3 files changed, 35 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b48d744d99b0..007e6cfada3e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6943,6 +6943,13 @@
See Documentation/admin-guide/mm/transhuge.rst
for more details.
+ transparent_hugepage_tmpfs= [KNL]
+ Format: [always|within_size|advise|never]
+ Can be used to control the default hugepage allocation policy
+ for the tmpfs mount.
+ See Documentation/admin-guide/mm/transhuge.rst
+ for more details.
+
trusted.source= [KEYS]
Format: <string>
This parameter identifies the trust source as a backend
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 5034915f4e8e..9ae775eaacbe 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -332,6 +332,12 @@ allocation policy for the internal shmem mount by using the kernel parameter
seven valid policies for shmem (``always``, ``within_size``, ``advise``,
``never``, ``deny``, and ``force``).
+Similarly to ``transparent_hugepage_shmem``, you can control the default
+hugepage allocation policy for the tmpfs mount by using the kernel parameter
+``transparent_hugepage_tmpfs=<policy>``, where ``<policy>`` is one of the
+four valid policies for tmpfs (``always``, ``within_size``, ``advise``,
+``never``). The tmpfs mount default policy is ``never``.
+
In the same manner as ``thp_anon`` controls each supported anonymous THP
size, ``thp_shmem`` controls each supported shmem THP size. ``thp_shmem``
has the same format as ``thp_anon``, but also supports the policy
diff --git a/mm/shmem.c b/mm/shmem.c
index a3203cf8860f..021760e91cea 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -548,6 +548,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
/* ifdef here to avoid bloating shmem.o when not necessary */
static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
+static int tmpfs_huge __read_mostly = SHMEM_HUGE_NEVER;
/**
* shmem_mapping_size_orders - Get allowable folio orders for the given file size.
@@ -4780,7 +4781,12 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
sbinfo->gid = ctx->gid;
sbinfo->full_inums = ctx->full_inums;
sbinfo->mode = ctx->mode;
- sbinfo->huge = ctx->huge;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ if (ctx->seen & SHMEM_SEEN_HUGE)
+ sbinfo->huge = ctx->huge;
+ else
+ sbinfo->huge = tmpfs_huge;
+#endif
sbinfo->mpol = ctx->mpol;
ctx->mpol = NULL;
@@ -5259,6 +5265,21 @@ static int __init setup_transparent_hugepage_shmem(char *str)
}
__setup("transparent_hugepage_shmem=", setup_transparent_hugepage_shmem);
+static int __init setup_transparent_hugepage_tmpfs(char *str)
+{
+ int huge;
+
+ huge = shmem_parse_huge(str);
+ if (huge < 0) {
+ pr_warn("transparent_hugepage_tmpfs= cannot parse, ignored\n");
+ return huge;
+ }
+
+ tmpfs_huge = huge;
+ return 1;
+}
+__setup("transparent_hugepage_tmpfs=", setup_transparent_hugepage_tmpfs);
+
static char str_dup[PAGE_SIZE] __initdata;
static int __init setup_thp_shmem(char *str)
{
--
2.39.3
^ permalink raw reply [flat|nested] 23+ messages in thread* [PATCH v2 5/5] docs: tmpfs: update the huge folios policy for tmpfs and shmem
2024-11-12 7:45 [PATCH v2 0/5] Support large folios for tmpfs Baolin Wang
` (3 preceding siblings ...)
2024-11-12 7:45 ` [PATCH v2 4/5] mm: shmem: add a kernel command line to change the default huge policy for tmpfs Baolin Wang
@ 2024-11-12 7:45 ` Baolin Wang
2024-11-13 6:57 ` [PATCH] docs: tmpfs: update the huge folios policy for tmpfs and shmem fix Baolin Wang
[not found] ` <CGME20241115131634eucas1p2db22b75fcc768a4bb6aa47ee180110cc@eucas1p2.samsung.com>
5 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2024-11-12 7:45 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
da.gomez, baolin.wang, linux-mm, linux-kernel
From: David Hildenbrand <david@redhat.com>
Update the huge folios policy for tmpfs and shmem.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
Documentation/admin-guide/mm/transhuge.rst | 58 +++++++++++++++-------
1 file changed, 41 insertions(+), 17 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 9ae775eaacbe..ba6edff728ed 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -358,8 +358,21 @@ default to ``never``.
Hugepages in tmpfs/shmem
========================
-You can control hugepage allocation policy in tmpfs with mount option
-``huge=``. It can have following values:
+Traditionally, tmpfs only supported a single huge page size ("PMD"). Today,
+it also supports smaller sizes just like anonymous memory, often referred
+to as "multi-size THP" (mTHP). Huge pages of any size are commonly
+represented in the kernel as "large folios".
+
+While there is fine control over the huge page sizes to use for the internal
+shmem mount (see below), ordinary tmpfs mounts will make use of all available
+huge page sizes without any control over the exact sizes, behaving more like
+other file systems.
+
+tmpfs mounts
+------------
+
+The THP allocation policy for tmpfs mounts can be adjusted using the mount
+option: ``huge=``. It can have following values:
always
Attempt to allocate huge pages every time we need a new page;
@@ -374,19 +387,19 @@ within_size
advise
Only allocate huge pages if requested with fadvise()/madvise();
-The default policy is ``never``.
+Remember, that the kernel may use huge pages of all available sizes, and
+that no fine control as for the internal tmpfs mount is available.
+
+The default policy in the past was ``never``, but it can now be adjusted
+using the kernel parameter ``transparent_hugepage_tmpfs=<policy>``.
``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
``huge=never`` will not attempt to break up huge pages at all, just stop more
from being allocated.
-There's also sysfs knob to control hugepage allocation policy for internal
-shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
-is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
-MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
-
-In addition to policies listed above, shmem_enabled allows two further
-values:
+In addition to policies listed above, the sysfs knob
+/sys/kernel/mm/transparent_hugepage/shmem_enabled will affect the
+allocation policy of tmpfs mounts, when set to the following values:
deny
For use in emergencies, to force the huge option off from
@@ -394,13 +407,24 @@ deny
force
Force the huge option on for all - very useful for testing;
-Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to
-control mTHP allocation:
-'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled',
-and its value for each mTHP is essentially consistent with the global
-setting. An 'inherit' option is added to ensure compatibility with these
-global settings. Conversely, the options 'force' and 'deny' are dropped,
-which are rather testing artifacts from the old ages.
+shmem / internal tmpfs
+----------------------
+The mount internal tmpfs mount is used for SysV SHM, memfds, shared anonymous
+mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
+
+To control the THP allocation policy for this internal tmpfs mount, the
+sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled and the knobs
+per THP size in
+'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled'
+can be used.
+
+The global knob has the same semantics as the ``huge=`` mount options
+for tmpfs mounts, except that the different huge page sizes can be controlled
+individually, and will only use the setting of the global knob when the
+per-size knob is set to 'inherit'.
+
+The options 'force' and 'deny' are dropped for the individual sizes, which
+are rather testing artifacts from the old ages.
always
Attempt to allocate <size> huge pages every time we need a new page;
--
2.39.3
^ permalink raw reply [flat|nested] 23+ messages in thread* [PATCH] docs: tmpfs: update the huge folios policy for tmpfs and shmem fix
2024-11-12 7:45 ` [PATCH v2 5/5] docs: tmpfs: update the huge folios policy for tmpfs and shmem Baolin Wang
@ 2024-11-13 6:57 ` Baolin Wang
2024-11-20 21:35 ` Barry Song
0 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2024-11-13 6:57 UTC (permalink / raw)
To: baolin.wang
Cc: 21cnbao, akpm, da.gomez, david, hughd, ioworker0, linux-kernel,
linux-mm, ryan.roberts, wangkefeng.wang, willy
Drop 'fadvise()' from the doc, since fadvise() has no HUGEPAGE advise
currently.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
Documentation/admin-guide/mm/transhuge.rst | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index ba6edff728ed..333958ef0d5f 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -382,10 +382,10 @@ never
within_size
Only allocate huge page if it will be fully within i_size.
- Also respect fadvise()/madvise() hints;
+ Also respect madvise() hints;
advise
- Only allocate huge pages if requested with fadvise()/madvise();
+ Only allocate huge pages if requested with madvise();
Remember, that the kernel may use huge pages of all available sizes, and
that no fine control as for the internal tmpfs mount is available.
@@ -438,10 +438,10 @@ never
within_size
Only allocate <size> huge page if it will be fully within i_size.
- Also respect fadvise()/madvise() hints;
+ Also respect madvise() hints;
advise
- Only allocate <size> huge pages if requested with fadvise()/madvise();
+ Only allocate <size> huge pages if requested with madvise();
Need of application restart
===========================
--
2.39.3
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH] docs: tmpfs: update the huge folios policy for tmpfs and shmem fix
2024-11-13 6:57 ` [PATCH] docs: tmpfs: update the huge folios policy for tmpfs and shmem fix Baolin Wang
@ 2024-11-20 21:35 ` Barry Song
2024-11-22 11:12 ` David Hildenbrand
0 siblings, 1 reply; 23+ messages in thread
From: Barry Song @ 2024-11-20 21:35 UTC (permalink / raw)
To: Baolin Wang
Cc: akpm, da.gomez, david, hughd, ioworker0, linux-kernel, linux-mm,
ryan.roberts, wangkefeng.wang, willy
On Wed, Nov 13, 2024 at 7:57 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Drop 'fadvise()' from the doc, since fadvise() has no HUGEPAGE advise
> currently.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
I couldn’t find any mention of HUGEPAGE in fadvise() either.
FADV_NORMAL
FADV_RANDOM
FADV_SEQUENTIAL
FADV_WILLNEED
FADV_DONTNEED
FADV_NOREUSE
> ---
> Documentation/admin-guide/mm/transhuge.rst | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index ba6edff728ed..333958ef0d5f 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -382,10 +382,10 @@ never
>
> within_size
> Only allocate huge page if it will be fully within i_size.
> - Also respect fadvise()/madvise() hints;
> + Also respect madvise() hints;
>
> advise
> - Only allocate huge pages if requested with fadvise()/madvise();
> + Only allocate huge pages if requested with madvise();
>
> Remember, that the kernel may use huge pages of all available sizes, and
> that no fine control as for the internal tmpfs mount is available.
> @@ -438,10 +438,10 @@ never
>
> within_size
> Only allocate <size> huge page if it will be fully within i_size.
> - Also respect fadvise()/madvise() hints;
> + Also respect madvise() hints;
>
> advise
> - Only allocate <size> huge pages if requested with fadvise()/madvise();
> + Only allocate <size> huge pages if requested with madvise();
>
> Need of application restart
> ===========================
> --
> 2.39.3
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] docs: tmpfs: update the huge folios policy for tmpfs and shmem fix
2024-11-20 21:35 ` Barry Song
@ 2024-11-22 11:12 ` David Hildenbrand
0 siblings, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2024-11-22 11:12 UTC (permalink / raw)
To: Barry Song, Baolin Wang
Cc: akpm, da.gomez, hughd, ioworker0, linux-kernel, linux-mm,
ryan.roberts, wangkefeng.wang, willy
On 20.11.24 22:35, Barry Song wrote:
> On Wed, Nov 13, 2024 at 7:57 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>> Drop 'fadvise()' from the doc, since fadvise() has no HUGEPAGE advise
>> currently.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>
> Reviewed-by: Barry Song <baohua@kernel.org>
>
> I couldn’t find any mention of HUGEPAGE in fadvise() either.
>
> FADV_NORMAL
> FADV_RANDOM
> FADV_SEQUENTIAL
> FADV_WILLNEED
> FADV_DONTNEED
> FADV_NOREUSE
Probably it was forward-looking, and that change never happened.
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <CGME20241115131634eucas1p2db22b75fcc768a4bb6aa47ee180110cc@eucas1p2.samsung.com>]
* Re: [PATCH v2 0/5] Support large folios for tmpfs
[not found] ` <CGME20241115131634eucas1p2db22b75fcc768a4bb6aa47ee180110cc@eucas1p2.samsung.com>
@ 2024-11-15 13:16 ` Daniel Gomez
2024-11-15 13:35 ` David Hildenbrand
0 siblings, 1 reply; 23+ messages in thread
From: Daniel Gomez @ 2024-11-15 13:16 UTC (permalink / raw)
To: Baolin Wang, akpm, hughd
Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
linux-mm, linux-kernel
On Tue Nov 12, 2024 at 8:45 AM CET, Baolin Wang wrote:
> Traditionally, tmpfs only supported PMD-sized huge folios. However nowadays
Nitpick:
We are mixing here folios/page, PMD-size huge. For anyone not aware of
Memory Folios conversion in the kernel I think this makes it confusing.
Tmpfs has never supported folios so, this is not true. Can we rephrase
it?
Below you are also mixing terms huge/large folios etc. Can we be
consistent? I'd stick with folios (for order-0), and large folios (!
order-0). I'd use huge term only when referring to PMD-size pages.
> with other file systems supporting any sized large folios, and extending
> anonymous to support mTHP, we should not restrict tmpfs to allocating only
> PMD-sized huge folios, making it more special. Instead, we should allow
Again here.
> tmpfs can allocate any sized large folios.
>
> Considering that tmpfs already has the 'huge=' option to control the huge
> folios allocation, we can extend the 'huge=' option to allow any sized huge
'huge=' has never controlled folios.
> folios. The semantics of the 'huge=' mount option are:
>
> huge=never: no any sized huge folios
> huge=always: any sized huge folios
> huge=within_size: like 'always' but respect the i_size
> huge=advise: like 'always' if requested with fadvise()/madvise()
>
> Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
> allocate the PMD-sized huge folios if huge=always/within_size/advise is set.
>
> Moreover, the 'deny' and 'force' testing options controlled by
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
> semantics. The 'deny' can disable any sized large folios for tmpfs, while
> the 'force' can enable PMD sized large folios for tmpfs.
>
> Any comments and suggestions are appreciated. Thanks.
>
> Changes from v1:
> - Add reviewed tag from Barry and David. Thanks.
> - Fix building warnings reported by kernel test robot.
> - Add a new patch to control the default huge policy for tmpfs.
>
> Changes from RFC v3:
> - Drop the huge=write_size option.
> - Allow any sized huge folios for 'hgue' option.
> - Update the documentation, per David.
>
> Changes from RFC v2:
> - Drop mTHP interfaces to control huge page allocation, per Matthew.
> - Add a new helper to calculate the order, suggested by Matthew.
> - Add a new huge=write_size option to allocate large folios based on
> the write size.
> - Add a new patch to update the documentation.
>
> Changes from RFC v1:
> - Drop patch 1.
> - Use 'write_end' to calculate the length in shmem_allowable_huge_orders().
> - Update shmem_mapping_size_order() per Daniel.
>
> Baolin Wang (4):
> mm: factor out the order calculation into a new helper
> mm: shmem: change shmem_huge_global_enabled() to return huge order
> bitmap
> mm: shmem: add large folio support for tmpfs
> mm: shmem: add a kernel command line to change the default huge policy
> for tmpfs
>
> David Hildenbrand (1):
> docs: tmpfs: update the huge folios policy for tmpfs and shmem
>
> .../admin-guide/kernel-parameters.txt | 7 +
> Documentation/admin-guide/mm/transhuge.rst | 64 ++++++--
> include/linux/pagemap.h | 16 +-
> mm/shmem.c | 148 ++++++++++++++----
> 4 files changed, 183 insertions(+), 52 deletions(-)
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v2 0/5] Support large folios for tmpfs
2024-11-15 13:16 ` [PATCH v2 0/5] Support large folios for tmpfs Daniel Gomez
@ 2024-11-15 13:35 ` David Hildenbrand
2024-11-15 15:35 ` Daniel Gomez
0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2024-11-15 13:35 UTC (permalink / raw)
To: Daniel Gomez, Baolin Wang, akpm, hughd
Cc: willy, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
linux-mm, linux-kernel
On 15.11.24 14:16, Daniel Gomez wrote:
> On Tue Nov 12, 2024 at 8:45 AM CET, Baolin Wang wrote:
>> Traditionally, tmpfs only supported PMD-sized huge folios. However nowadays
>
> Nitpick:
> We are mixing here folios/page, PMD-size huge. For anyone not aware of
> Memory Folios conversion in the kernel I think this makes it confusing.
> Tmpfs has never supported folios so, this is not true. Can we rephrase
> it?
We had the exact same discussion when we added mTHP support to anonymous
memory.
I suggest you read:
https://lkml.kernel.org/r/65dbdf2a-9281-a3c3-b7e3-a79c5b60b357@redhat.com
Folios are an implementation detail on how we manage metadata. Nobody in
user space should even have to be aware of how we manage metadata for
larger chunks of memory ("huge pages") in the kernel.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v2 0/5] Support large folios for tmpfs
2024-11-15 13:35 ` David Hildenbrand
@ 2024-11-15 15:35 ` Daniel Gomez
2024-11-15 15:44 ` David Hildenbrand
0 siblings, 1 reply; 23+ messages in thread
From: Daniel Gomez @ 2024-11-15 15:35 UTC (permalink / raw)
To: David Hildenbrand, Baolin Wang, akpm, hughd
Cc: willy, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
linux-mm, linux-kernel
On Fri Nov 15, 2024 at 2:35 PM CET, David Hildenbrand wrote:
> On 15.11.24 14:16, Daniel Gomez wrote:
>> On Tue Nov 12, 2024 at 8:45 AM CET, Baolin Wang wrote:
>>> Traditionally, tmpfs only supported PMD-sized huge folios. However nowadays
>>
>> Nitpick:
>> We are mixing here folios/page, PMD-size huge. For anyone not aware of
>> Memory Folios conversion in the kernel I think this makes it confusing.
>> Tmpfs has never supported folios so, this is not true. Can we rephrase
>> it?
>
> We had the exact same discussion when we added mTHP support to anonymous
> memory.
>
> I suggest you read:
>
> https://lkml.kernel.org/r/65dbdf2a-9281-a3c3-b7e3-a79c5b60b357@redhat.com
>
> Folios are an implementation detail on how we manage metadata. Nobody in
> user space should even have to be aware of how we manage metadata for
> larger chunks of memory ("huge pages") in the kernel.
I read it and I can't find where the use of "PMD-size huge folios" could
be a valid term. Tmpfs has never supported "folios", so I think using
"PMD-size huge pages" is more appropiate.
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v2 0/5] Support large folios for tmpfs
2024-11-15 15:35 ` Daniel Gomez
@ 2024-11-15 15:44 ` David Hildenbrand
0 siblings, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2024-11-15 15:44 UTC (permalink / raw)
To: Daniel Gomez, Baolin Wang, akpm, hughd
Cc: willy, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
linux-mm, linux-kernel
On 15.11.24 16:35, Daniel Gomez wrote:
> On Fri Nov 15, 2024 at 2:35 PM CET, David Hildenbrand wrote:
>> On 15.11.24 14:16, Daniel Gomez wrote:
>>> On Tue Nov 12, 2024 at 8:45 AM CET, Baolin Wang wrote:
>>>> Traditionally, tmpfs only supported PMD-sized huge folios. However nowadays
>>>
>>> Nitpick:
>>> We are mixing here folios/page, PMD-size huge. For anyone not aware of
>>> Memory Folios conversion in the kernel I think this makes it confusing.
>>> Tmpfs has never supported folios so, this is not true. Can we rephrase
>>> it?
>>
>> We had the exact same discussion when we added mTHP support to anonymous
>> memory.
>>
>> I suggest you read:
>>
>> https://lkml.kernel.org/r/65dbdf2a-9281-a3c3-b7e3-a79c5b60b357@redhat.com
>>
>> Folios are an implementation detail on how we manage metadata. Nobody in
>> user space should even have to be aware of how we manage metadata for
>> larger chunks of memory ("huge pages") in the kernel.
>
> I read it and I can't find where the use of "PMD-size huge folios" could
> be a valid term. Tmpfs has never supported "folios", so I think using
> "PMD-size huge pages" is more appropiate.
Oh sorry, I completely agree. Yes, we should use that.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread