* [RFC PATCH v2 1/2] mm: shmem: add large folio support to the write and fallocate paths
2024-09-26 8:27 [RFC PATCH v2 0/2] Support large folios for tmpfs Baolin Wang
@ 2024-09-26 8:27 ` Baolin Wang
2024-09-26 12:16 ` Matthew Wilcox
2024-09-26 8:27 ` [RFC PATCH v2 2/2] mm: shmem: use mTHP interface to control huge orders for tmpfs Baolin Wang
2024-09-26 12:20 ` [RFC PATCH v2 0/2] Support large folios " Matthew Wilcox
2 siblings, 1 reply; 9+ messages in thread
From: Baolin Wang @ 2024-09-26 8:27 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
da.gomez, baolin.wang, linux-mm, linux-kernel
From: Daniel Gomez <da.gomez@samsung.com>
Add large folio support for shmem write and fallocate paths matching the
same high order preference mechanism used in the iomap buffered IO path
as used in __filemap_get_folio().
Add shmem_mapping_size_order() to get a hint for the order of the folio
based on the file size which takes care of the mapping requirements.
If the top level huge page (controlled by '/sys/kernel/mm/transparent_hugepage/shmem_enabled')
is enabled, we just allow PMD sized THP to keep interface backward
compatibility.
Co-developed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/shmem.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 48 insertions(+), 3 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 0613421e09e7..6dece90ff421 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1672,6 +1672,36 @@ bool shmem_hpage_pmd_enabled(void)
return false;
}
+/**
+ * shmem_mapping_size_order - Get maximum folio order for the given file size.
+ * @mapping: Target address_space.
+ * @index: The page index.
+ * @size: The suggested size of the folio to create.
+ *
+ * This returns a high order for folios (when supported) based on the file size
+ * which the mapping currently allows at the given index. The index is relevant
+ * due to alignment considerations the mapping might have. The returned order
+ * may be less than the size passed.
+ *
+ * Like __filemap_get_folio order calculation.
+ *
+ * Return: The order.
+ */
+static inline unsigned int
+shmem_mapping_size_order(struct address_space *mapping, pgoff_t index, size_t size)
+{
+ unsigned int order = get_order(max_t(size_t, size, PAGE_SIZE));
+
+ if (!mapping_large_folio_support(mapping))
+ return 0;
+
+ /* If we're not aligned, allocate a smaller folio */
+ if (index & ((1UL << order) - 1))
+ order = __ffs(index);
+
+ return min_t(size_t, order, MAX_PAGECACHE_ORDER);
+}
+
unsigned long shmem_allowable_huge_orders(struct inode *inode,
struct vm_area_struct *vma, pgoff_t index,
loff_t write_end, bool shmem_huge_force)
@@ -1694,11 +1724,26 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
global_huge = shmem_huge_global_enabled(inode, index, write_end,
shmem_huge_force, vma, vm_flags);
if (!vma || !vma_is_anon_shmem(vma)) {
+ size_t len;
+
+ /*
+ * For tmpfs, if top level huge page is enabled, we just allow
+ * PMD sized THP to keep interface backward compatibility.
+ */
+ if (global_huge)
+ return BIT(HPAGE_PMD_ORDER);
+
+ if (!write_end)
+ return 0;
+
/*
- * For tmpfs, we now only support PMD sized THP if huge page
- * is enabled, otherwise fallback to order 0.
+ * Otherwise, get a highest order hint based on the size of
+ * write and fallocate paths, then will try each allowable
+ * huge orders.
*/
- return global_huge ? BIT(HPAGE_PMD_ORDER) : 0;
+ len = write_end - (index << PAGE_SHIFT);
+ order = shmem_mapping_size_order(inode->i_mapping, index, len);
+ return order > 0 ? BIT(order + 1) - 1 : 0;
}
/*
--
2.39.3
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [RFC PATCH v2 1/2] mm: shmem: add large folio support to the write and fallocate paths
2024-09-26 8:27 ` [RFC PATCH v2 1/2] mm: shmem: add large folio support to the write and fallocate paths Baolin Wang
@ 2024-09-26 12:16 ` Matthew Wilcox
2024-09-26 12:58 ` Daniel Gomez
0 siblings, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2024-09-26 12:16 UTC (permalink / raw)
To: Baolin Wang
Cc: akpm, hughd, david, wangkefeng.wang, 21cnbao, ryan.roberts,
ioworker0, da.gomez, linux-mm, linux-kernel
On Thu, Sep 26, 2024 at 04:27:26PM +0800, Baolin Wang wrote:
> +static inline unsigned int
> +shmem_mapping_size_order(struct address_space *mapping, pgoff_t index, size_t size)
> +{
> + unsigned int order = get_order(max_t(size_t, size, PAGE_SIZE));
Why introduce the max_t() call here? Did nobody read the documentation
or implementation for get_order() before writing this patch?
Besides, get_order() is wrong (at least relative to other filesystems).
get_order() rounds up instead of down, so what should we do for a write()
of size 512 * 1024 + 1 byte? Other filesystems allocate an order-8 folio
plus an order-0 folio. This code would have us allocate an order-9 folio.
I think that's a bad idea.
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [RFC PATCH v2 1/2] mm: shmem: add large folio support to the write and fallocate paths
2024-09-26 12:16 ` Matthew Wilcox
@ 2024-09-26 12:58 ` Daniel Gomez
2024-09-26 13:40 ` Matthew Wilcox
0 siblings, 1 reply; 9+ messages in thread
From: Daniel Gomez @ 2024-09-26 12:58 UTC (permalink / raw)
To: Matthew Wilcox, Baolin Wang
Cc: akpm, hughd, david, wangkefeng.wang, 21cnbao, ryan.roberts,
ioworker0, linux-mm, linux-kernel
On 9/26/2024 2:16 PM, Matthew Wilcox wrote:
> On Thu, Sep 26, 2024 at 04:27:26PM +0800, Baolin Wang wrote:
>> +static inline unsigned int
>> +shmem_mapping_size_order(struct address_space *mapping, pgoff_t index, size_t size)
>> +{
>> + unsigned int order = get_order(max_t(size_t, size, PAGE_SIZE));
>
> Why introduce the max_t() call here? Did nobody read the documentation
> or implementation for get_order() before writing this patch?
get_order() result is undefined if the size is 0. I've used max_t() here
to avoid that case. Perhaps should we prevent that case before getting here?
>
> Besides, get_order() is wrong (at least relative to other filesystems).
> get_order() rounds up instead of down, so what should we do for a write()
> of size 512 * 1024 + 1 byte? Other filesystems allocate an order-8 folio
> plus an order-0 folio. This code would have us allocate an order-9 folio.
> I think that's a bad idea.
>
I think one of my earlier attemps was to use fgf_set_order +
FGF_GET_ORDER() as in iomap. But the solution taken there was to share
code between shmem and filemap and that wasn't considered a good idea.
Shall we just replicate iomap_get_folio()? Or else, what do you suggest
here?
Daniel
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [RFC PATCH v2 1/2] mm: shmem: add large folio support to the write and fallocate paths
2024-09-26 12:58 ` Daniel Gomez
@ 2024-09-26 13:40 ` Matthew Wilcox
2024-09-27 2:12 ` Baolin Wang
0 siblings, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2024-09-26 13:40 UTC (permalink / raw)
To: Daniel Gomez
Cc: Baolin Wang, akpm, hughd, david, wangkefeng.wang, 21cnbao,
ryan.roberts, ioworker0, linux-mm, linux-kernel
On Thu, Sep 26, 2024 at 02:58:31PM +0200, Daniel Gomez wrote:
> On 9/26/2024 2:16 PM, Matthew Wilcox wrote:
> > On Thu, Sep 26, 2024 at 04:27:26PM +0800, Baolin Wang wrote:
> > > +static inline unsigned int
> > > +shmem_mapping_size_order(struct address_space *mapping, pgoff_t index, size_t size)
> > > +{
> > > + unsigned int order = get_order(max_t(size_t, size, PAGE_SIZE));
> >
> > Why introduce the max_t() call here? Did nobody read the documentation
> > or implementation for get_order() before writing this patch?
>
> get_order() result is undefined if the size is 0. I've used max_t() here to
> avoid that case. Perhaps should we prevent that case before getting here?
Surely we've handled a length-0 write before we get here?
> I think one of my earlier attemps was to use fgf_set_order + FGF_GET_ORDER()
> as in iomap. But the solution taken there was to share code between shmem
> and filemap and that wasn't considered a good idea. Shall we just replicate
> iomap_get_folio()? Or else, what do you suggest here?
We could move three of the four lines from fgf_set_order() into a
new function and call it from both fgf_set_order() and shmem?
>
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [RFC PATCH v2 1/2] mm: shmem: add large folio support to the write and fallocate paths
2024-09-26 13:40 ` Matthew Wilcox
@ 2024-09-27 2:12 ` Baolin Wang
0 siblings, 0 replies; 9+ messages in thread
From: Baolin Wang @ 2024-09-27 2:12 UTC (permalink / raw)
To: Matthew Wilcox, Daniel Gomez
Cc: akpm, hughd, david, wangkefeng.wang, 21cnbao, ryan.roberts,
ioworker0, linux-mm, linux-kernel
On 2024/9/26 21:40, Matthew Wilcox wrote:
> On Thu, Sep 26, 2024 at 02:58:31PM +0200, Daniel Gomez wrote:
>> On 9/26/2024 2:16 PM, Matthew Wilcox wrote:
>>> On Thu, Sep 26, 2024 at 04:27:26PM +0800, Baolin Wang wrote:
>>>> +static inline unsigned int
>>>> +shmem_mapping_size_order(struct address_space *mapping, pgoff_t index, size_t size)
>>>> +{
>>>> + unsigned int order = get_order(max_t(size_t, size, PAGE_SIZE));
>>>
>>> Why introduce the max_t() call here? Did nobody read the documentation
>>> or implementation for get_order() before writing this patch?
>>
>> get_order() result is undefined if the size is 0. I've used max_t() here to
>> avoid that case. Perhaps should we prevent that case before getting here?
>
> Surely we've handled a length-0 write before we get here?
>
>> I think one of my earlier attemps was to use fgf_set_order + FGF_GET_ORDER()
>> as in iomap. But the solution taken there was to share code between shmem
>> and filemap and that wasn't considered a good idea. Shall we just replicate
>> iomap_get_folio()? Or else, what do you suggest here?
>
> We could move three of the four lines from fgf_set_order() into a
> new function and call it from both fgf_set_order() and shmem?
Sounds good. How about the following changes? Do you have a perferred
name for the new helper? Thanks.
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d9c7edb6422b..ce418acd2737 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -629,6 +629,16 @@ typedef unsigned int __bitwise fgf_t;
#define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT |
FGP_STABLE)
+static inline unsigned int filemap_get_order(size_t size)
+{
+ unsigned int shift = ilog2(size);
+
+ if (shift <= PAGE_SHIFT)
+ return 0;
+
+ return shift - PAGE_SHIFT;
+}
+
/**
* fgf_set_order - Encode a length in the fgf_t flags.
* @size: The suggested size of the folio to create.
@@ -642,11 +652,11 @@ typedef unsigned int __bitwise fgf_t;
*/
static inline fgf_t fgf_set_order(size_t size)
{
- unsigned int shift = ilog2(size);
+ unsigned int order = filemap_get_order(size);
- if (shift <= PAGE_SHIFT)
+ if (!order)
return 0;
- return (__force fgf_t)((shift - PAGE_SHIFT) << 26);
+ return (__force fgf_t)(order << 26);
}
void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC PATCH v2 2/2] mm: shmem: use mTHP interface to control huge orders for tmpfs
2024-09-26 8:27 [RFC PATCH v2 0/2] Support large folios for tmpfs Baolin Wang
2024-09-26 8:27 ` [RFC PATCH v2 1/2] mm: shmem: add large folio support to the write and fallocate paths Baolin Wang
@ 2024-09-26 8:27 ` Baolin Wang
2024-09-26 12:20 ` [RFC PATCH v2 0/2] Support large folios " Matthew Wilcox
2 siblings, 0 replies; 9+ messages in thread
From: Baolin Wang @ 2024-09-26 8:27 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, 21cnbao, ryan.roberts, ioworker0,
da.gomez, baolin.wang, linux-mm, linux-kernel
For the huge orders allowed by writable mmap() faults on tmpfs,
the mTHP interface is used to control the allowable huge orders,
while 'huge_shmem_orders_inherit' maintains backward compatibility
with top-level interface.
For the huge orders allowed by write() and fallocate() paths on tmpfs,
getting a highest order hint based on the size of write and fallocate
paths, then will try each allowable huge orders filtered by the mTHP
interfaces if set.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/memory.c | 4 ++--
mm/shmem.c | 51 ++++++++++++++++++++++++++-------------------------
2 files changed, 28 insertions(+), 27 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 2366578015ad..99dd75b84605 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5098,10 +5098,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
/*
* Using per-page fault to maintain the uffd semantics, and same
- * approach also applies to non-anonymous-shmem faults to avoid
+ * approach also applies to non shmem/tmpfs faults to avoid
* inflating the RSS of the process.
*/
- if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) {
+ if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma))) {
nr_pages = 1;
} else if (nr_pages > 1) {
pgoff_t idx = folio_page_idx(folio, page);
diff --git a/mm/shmem.c b/mm/shmem.c
index 6dece90ff421..569d3ab37161 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1721,31 +1721,6 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
return 0;
- global_huge = shmem_huge_global_enabled(inode, index, write_end,
- shmem_huge_force, vma, vm_flags);
- if (!vma || !vma_is_anon_shmem(vma)) {
- size_t len;
-
- /*
- * For tmpfs, if top level huge page is enabled, we just allow
- * PMD sized THP to keep interface backward compatibility.
- */
- if (global_huge)
- return BIT(HPAGE_PMD_ORDER);
-
- if (!write_end)
- return 0;
-
- /*
- * Otherwise, get a highest order hint based on the size of
- * write and fallocate paths, then will try each allowable
- * huge orders.
- */
- len = write_end - (index << PAGE_SHIFT);
- order = shmem_mapping_size_order(inode->i_mapping, index, len);
- return order > 0 ? BIT(order + 1) - 1 : 0;
- }
-
/*
* Following the 'deny' semantics of the top level, force the huge
* option off from all mounts.
@@ -1776,9 +1751,35 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
if (vm_flags & VM_HUGEPAGE)
mask |= READ_ONCE(huge_shmem_orders_madvise);
+ global_huge = shmem_huge_global_enabled(inode, index, write_end,
+ shmem_huge_force, vma, vm_flags);
if (global_huge)
mask |= READ_ONCE(huge_shmem_orders_inherit);
+ /*
+ * For the huge orders allowed by writable mmap() faults on tmpfs,
+ * the mTHP interface is used to control the allowable huge orders,
+ * while 'huge_shmem_orders_inherit' maintains backward compatibility
+ * with top-level interface.
+ *
+ * For the huge orders allowed by write() and fallocate() paths on tmpfs,
+ * get a highest order hint based on the size of write and fallocate
+ * paths, then will try each allowable huge orders filtered by the mTHP
+ * interfaces if set.
+ */
+ if (!vma && !global_huge) {
+ size_t len;
+
+ if (!write_end)
+ return 0;
+
+ len = write_end - (index << PAGE_SHIFT);
+ order = shmem_mapping_size_order(inode->i_mapping, index, len);
+ if (!mask)
+ return order > 0 ? BIT(order + 1) - 1 : 0;
+
+ mask &= BIT(order + 1) - 1;
+ }
return THP_ORDERS_ALL_FILE_DEFAULT & mask;
}
--
2.39.3
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [RFC PATCH v2 0/2] Support large folios for tmpfs
2024-09-26 8:27 [RFC PATCH v2 0/2] Support large folios for tmpfs Baolin Wang
2024-09-26 8:27 ` [RFC PATCH v2 1/2] mm: shmem: add large folio support to the write and fallocate paths Baolin Wang
2024-09-26 8:27 ` [RFC PATCH v2 2/2] mm: shmem: use mTHP interface to control huge orders for tmpfs Baolin Wang
@ 2024-09-26 12:20 ` Matthew Wilcox
2024-09-27 2:36 ` Baolin Wang
2 siblings, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2024-09-26 12:20 UTC (permalink / raw)
To: Baolin Wang
Cc: akpm, hughd, david, wangkefeng.wang, 21cnbao, ryan.roberts,
ioworker0, da.gomez, linux-mm, linux-kernel
On Thu, Sep 26, 2024 at 04:27:25PM +0800, Baolin Wang wrote:
> This RFC patch series attempts to support large folios for tmpfs. The first
> patch is based on Daniel's previous patches in [1], mainly using the length
> in the write and fallocate paths to get a highest order hint for large
> order allocation. The last patch adds mTHP filter control for tmpfs if mTHP
> is set for the following reasons:
>
> 1. Maintain backward compatibility for the control interface. Tmpfs already
> has a global 'huge=' mount option and '/sys/kernel/mm/transparent_hugepage/shmem_enabled'
> interface to control large order allocations. mTHP extends this capability to a
> per-size basis while maintaining good interface compatibility.
... it's confusing as hell to anyone who tries to understand it and
you've made it more complicated. Well done.
> 2. For the large order allocation of writable mmap() faults in tmpfs, we need
> something like the mTHP interfaces to control large orders, as well as ensuring
> consistent interfaces with shmem.
tmpfs and shmem do NOT need to be consistent! I don't know why anyone
thinks this is a goal. tmpfs should be consistent with OTHER FILE
SYSTEMS. shmem should do the right thing for the shared anon use case.
> 3. Ryan pointed out that large order allocations based on write length could
> lead to memory fragmentation issue. Just quoting Ryan's comment [2]:
> "And it's possible (likely even, in my opinion) that allocating lots of different
> folio sizes will exacerbate memory fragmentation, leading to more order-0
> fallbacks, which would hurt the overall system performance in the long run, vs
> restricting to a couple of folio sizes."
I disagree with this. It's a buddy allocator; it's resistant to this
kind of fragmentation.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH v2 0/2] Support large folios for tmpfs
2024-09-26 12:20 ` [RFC PATCH v2 0/2] Support large folios " Matthew Wilcox
@ 2024-09-27 2:36 ` Baolin Wang
0 siblings, 0 replies; 9+ messages in thread
From: Baolin Wang @ 2024-09-27 2:36 UTC (permalink / raw)
To: Matthew Wilcox
Cc: akpm, hughd, david, wangkefeng.wang, 21cnbao, ryan.roberts,
ioworker0, da.gomez, linux-mm, linux-kernel
On 2024/9/26 20:20, Matthew Wilcox wrote:
> On Thu, Sep 26, 2024 at 04:27:25PM +0800, Baolin Wang wrote:
>> This RFC patch series attempts to support large folios for tmpfs. The first
>> patch is based on Daniel's previous patches in [1], mainly using the length
>> in the write and fallocate paths to get a highest order hint for large
>> order allocation. The last patch adds mTHP filter control for tmpfs if mTHP
>> is set for the following reasons:
>>
>> 1. Maintain backward compatibility for the control interface. Tmpfs already
>> has a global 'huge=' mount option and '/sys/kernel/mm/transparent_hugepage/shmem_enabled'
>> interface to control large order allocations. mTHP extends this capability to a
>> per-size basis while maintaining good interface compatibility.
>
> ... it's confusing as hell to anyone who tries to understand it and
> you've made it more complicated. Well done.
>
>> 2. For the large order allocation of writable mmap() faults in tmpfs, we need
>> something like the mTHP interfaces to control large orders, as well as ensuring
>> consistent interfaces with shmem.
>
> tmpfs and shmem do NOT need to be consistent! I don't know why anyone
> thinks this is a goal. tmpfs should be consistent with OTHER FILE
> SYSTEMS. shmem should do the right thing for the shared anon use case.
>
>> 3. Ryan pointed out that large order allocations based on write length could
>> lead to memory fragmentation issue. Just quoting Ryan's comment [2]:
>> "And it's possible (likely even, in my opinion) that allocating lots of different
>> folio sizes will exacerbate memory fragmentation, leading to more order-0
>> fallbacks, which would hurt the overall system performance in the long run, vs
>> restricting to a couple of folio sizes."
>
> I disagree with this. It's a buddy allocator; it's resistant to this
> kind of fragmentation.
Fine. Thanks for sharing your opinion. So far, I can drop patch 2 to
stop adding mTHP interfaces for tmpfs.
^ permalink raw reply [flat|nested] 9+ messages in thread