* [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
@ 2024-10-26 5:43 Kefeng Wang
2024-10-26 5:43 ` [PATCH v2 2/2] mm: use aligned address in copy_user_gigantic_page() Kefeng Wang
` (2 more replies)
0 siblings, 3 replies; 33+ messages in thread
From: Kefeng Wang @ 2024-10-26 5:43 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Matthew Wilcox, Muchun Song, Huang, Ying,
linux-mm, Kefeng Wang
When clearing gigantic page, it zeros page from the first page to the
last page, if directly passing addr_hint which maybe not the address
of the first page of folio, then some archs could flush the wrong cache
if it does use the addr_hint as a hint. For non-gigantic page, it
calculates the base address inside, even passed the wrong addr_hint, it
only has performance impact as the process_huge_page() wants to process
target page last to keep its cache lines hot), no functional impact.
Let's pass the real accessed address to folio_zero_user() and use the
aligned address in clear_gigantic_page() to fix it.
Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()")
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
v2:
- update changelog to clarify the impact, per Andrew
fs/hugetlbfs/inode.c | 2 +-
mm/memory.c | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a4441fb77f7c..a5ea006f403e 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
error = PTR_ERR(folio);
goto out;
}
- folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
+ folio_zero_user(folio, addr);
__folio_mark_uptodate(folio);
error = hugetlb_add_to_page_cache(folio, mapping, index);
if (unlikely(error)) {
diff --git a/mm/memory.c b/mm/memory.c
index 75c2dfd04f72..ef47b7ea5ddd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
int i;
might_sleep();
+ addr = ALIGN_DOWN(addr, folio_size(folio));
for (i = 0; i < nr_pages; i++) {
cond_resched();
clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
--
2.27.0
^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH v2 2/2] mm: use aligned address in copy_user_gigantic_page()
2024-10-26 5:43 [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() Kefeng Wang
@ 2024-10-26 5:43 ` Kefeng Wang
2024-10-28 10:01 ` David Hildenbrand
2024-10-28 6:17 ` [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() Huang, Ying
2024-10-28 10:00 ` David Hildenbrand
2 siblings, 1 reply; 33+ messages in thread
From: Kefeng Wang @ 2024-10-26 5:43 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Matthew Wilcox, Muchun Song, Huang, Ying,
linux-mm, Kefeng Wang
When copying gigantic page, it copies page from the first page to the
last page, if directly passing addr_hint which maybe not the address
of the first page of folio, then some archs could flush the wrong cache
if it does use the addr_hint as a hint. For non-gigantic page, it
calculates the base address inside, even passed the wrong addr_hint, it
only has performance impact as the process_huge_page() wants to process
target page last to keep its cache lines hot), no functional impact.
Let's pass the real accessed address to copy_user_large_folio() and use
the aligned address in copy_user_gigantic_page() to fix it.
Fixes: 530dd9926dc1 ("mm: memory: improve copy_user_large_folio()")
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
v2:
- update changelog to clarify the impact, per Andrew
mm/hugetlb.c | 5 ++---
mm/memory.c | 1 +
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2c8c5da0f5d3..15b5d46d49d2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5338,7 +5338,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
break;
}
ret = copy_user_large_folio(new_folio, pte_folio,
- ALIGN_DOWN(addr, sz), dst_vma);
+ addr, dst_vma);
folio_put(pte_folio);
if (ret) {
folio_put(new_folio);
@@ -6641,8 +6641,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
*foliop = NULL;
goto out;
}
- ret = copy_user_large_folio(folio, *foliop,
- ALIGN_DOWN(dst_addr, size), dst_vma);
+ ret = copy_user_large_folio(folio, *foliop, dst_addr, dst_vma);
folio_put(*foliop);
*foliop = NULL;
if (ret) {
diff --git a/mm/memory.c b/mm/memory.c
index ef47b7ea5ddd..e5284bab659d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6860,6 +6860,7 @@ static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
struct page *dst_page;
struct page *src_page;
+ addr = ALIGN_DOWN(addr, folio_size(dst));
for (i = 0; i < nr_pages; i++) {
dst_page = folio_page(dst, i);
src_page = folio_page(src, i);
--
2.27.0
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-26 5:43 [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() Kefeng Wang
2024-10-26 5:43 ` [PATCH v2 2/2] mm: use aligned address in copy_user_gigantic_page() Kefeng Wang
@ 2024-10-28 6:17 ` Huang, Ying
2024-10-28 6:35 ` Kefeng Wang
2024-10-28 10:00 ` David Hildenbrand
2 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2024-10-28 6:17 UTC (permalink / raw)
To: Kefeng Wang
Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Muchun Song, linux-mm
Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> When clearing gigantic page, it zeros page from the first page to the
> last page, if directly passing addr_hint which maybe not the address
> of the first page of folio, then some archs could flush the wrong cache
> if it does use the addr_hint as a hint. For non-gigantic page, it
> calculates the base address inside, even passed the wrong addr_hint, it
> only has performance impact as the process_huge_page() wants to process
> target page last to keep its cache lines hot), no functional impact.
>
> Let's pass the real accessed address to folio_zero_user() and use the
> aligned address in clear_gigantic_page() to fix it.
>
> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()")
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
> v2:
> - update changelog to clarify the impact, per Andrew
>
> fs/hugetlbfs/inode.c | 2 +-
> mm/memory.c | 1 +
> 2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a4441fb77f7c..a5ea006f403e 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
> error = PTR_ERR(folio);
> goto out;
> }
> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
> + folio_zero_user(folio, addr);
'addr' is set with the following statement above,
/* addr is the offset within the file (zero based) */
addr = index * hpage_size;
So, we just don't need to ALIGN_DOWN() here. Or do I miss something?
> __folio_mark_uptodate(folio);
> error = hugetlb_add_to_page_cache(folio, mapping, index);
> if (unlikely(error)) {
> diff --git a/mm/memory.c b/mm/memory.c
> index 75c2dfd04f72..ef47b7ea5ddd 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
> int i;
>
> might_sleep();
> + addr = ALIGN_DOWN(addr, folio_size(folio));
> for (i = 0; i < nr_pages; i++) {
> cond_resched();
> clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-28 6:17 ` [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() Huang, Ying
@ 2024-10-28 6:35 ` Kefeng Wang
2024-10-28 7:03 ` Huang, Ying
0 siblings, 1 reply; 33+ messages in thread
From: Kefeng Wang @ 2024-10-28 6:35 UTC (permalink / raw)
To: Huang, Ying
Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Muchun Song, linux-mm
On 2024/10/28 14:17, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>
>> When clearing gigantic page, it zeros page from the first page to the
>> last page, if directly passing addr_hint which maybe not the address
>> of the first page of folio, then some archs could flush the wrong cache
>> if it does use the addr_hint as a hint. For non-gigantic page, it
>> calculates the base address inside, even passed the wrong addr_hint, it
>> only has performance impact as the process_huge_page() wants to process
>> target page last to keep its cache lines hot), no functional impact.
>>
>> Let's pass the real accessed address to folio_zero_user() and use the
>> aligned address in clear_gigantic_page() to fix it.
>>
>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()")
>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>> ---
>> v2:
>> - update changelog to clarify the impact, per Andrew
>>
>> fs/hugetlbfs/inode.c | 2 +-
>> mm/memory.c | 1 +
>> 2 files changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index a4441fb77f7c..a5ea006f403e 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>> error = PTR_ERR(folio);
>> goto out;
>> }
>> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
>> + folio_zero_user(folio, addr);
>
> 'addr' is set with the following statement above,
>
> /* addr is the offset within the file (zero based) */
> addr = index * hpage_size;
>
> So, we just don't need to ALIGN_DOWN() here. Or do I miss something?
Yes, it is already aligned,
>
>> __folio_mark_uptodate(folio);
>> error = hugetlb_add_to_page_cache(folio, mapping, index);
>> if (unlikely(error)) {
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 75c2dfd04f72..ef47b7ea5ddd 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>> int i;
>>
>> might_sleep();
>> + addr = ALIGN_DOWN(addr, folio_size(folio));
but for hugetlb_no_page(), we do need to align the addr as it use
vmf->real_address, so I move the alignment into the clear_gigantic_page.
>> for (i = 0; i < nr_pages; i++) {
>> cond_resched();
>> clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
>
> --
> Best Regards,
> Huang, Ying
>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-28 6:35 ` Kefeng Wang
@ 2024-10-28 7:03 ` Huang, Ying
2024-10-28 8:35 ` Kefeng Wang
0 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2024-10-28 7:03 UTC (permalink / raw)
To: Kefeng Wang
Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Muchun Song, linux-mm
Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> On 2024/10/28 14:17, Huang, Ying wrote:
>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>
>>> When clearing gigantic page, it zeros page from the first page to the
>>> last page, if directly passing addr_hint which maybe not the address
>>> of the first page of folio, then some archs could flush the wrong cache
>>> if it does use the addr_hint as a hint. For non-gigantic page, it
>>> calculates the base address inside, even passed the wrong addr_hint, it
>>> only has performance impact as the process_huge_page() wants to process
>>> target page last to keep its cache lines hot), no functional impact.
>>>
>>> Let's pass the real accessed address to folio_zero_user() and use the
>>> aligned address in clear_gigantic_page() to fix it.
>>>
>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()")
>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>> ---
>>> v2:
>>> - update changelog to clarify the impact, per Andrew
>>>
>>> fs/hugetlbfs/inode.c | 2 +-
>>> mm/memory.c | 1 +
>>> 2 files changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>> index a4441fb77f7c..a5ea006f403e 100644
>>> --- a/fs/hugetlbfs/inode.c
>>> +++ b/fs/hugetlbfs/inode.c
>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>>> error = PTR_ERR(folio);
>>> goto out;
>>> }
>>> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
>>> + folio_zero_user(folio, addr);
>> 'addr' is set with the following statement above,
>> /* addr is the offset within the file (zero based) */
>> addr = index * hpage_size;
>> So, we just don't need to ALIGN_DOWN() here. Or do I miss
>> something?
>
> Yes, it is already aligned,
>>
>>> __folio_mark_uptodate(folio);
>>> error = hugetlb_add_to_page_cache(folio, mapping, index);
>>> if (unlikely(error)) {
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 75c2dfd04f72..ef47b7ea5ddd 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>>> int i;
>>> might_sleep();
>>> + addr = ALIGN_DOWN(addr, folio_size(folio));
>
> but for hugetlb_no_page(), we do need to align the addr as it use
> vmf->real_address, so I move the alignment into the
> clear_gigantic_page.
That sounds good. You may need to revise patch description to describe
why you make the change. May be something like below?
In current kernel, hugetlb_no_page() calls folio_zero_user() with the
fault address. Where the fault address may be not aligned with the huge
page size. Then, folio_zero_user() may call clear_gigantic_page() with
the address, while clear_gigantic_page() requires the address to be huge
page size aligned. So, this may cause memory corruption or information
leak.
>>> for (i = 0; i < nr_pages; i++) {
>>> cond_resched();
>>> clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-28 7:03 ` Huang, Ying
@ 2024-10-28 8:35 ` Kefeng Wang
0 siblings, 0 replies; 33+ messages in thread
From: Kefeng Wang @ 2024-10-28 8:35 UTC (permalink / raw)
To: Huang, Ying
Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Muchun Song, linux-mm
On 2024/10/28 15:03, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>
>> On 2024/10/28 14:17, Huang, Ying wrote:
>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>
>>>> When clearing gigantic page, it zeros page from the first page to the
>>>> last page, if directly passing addr_hint which maybe not the address
>>>> of the first page of folio, then some archs could flush the wrong cache
>>>> if it does use the addr_hint as a hint. For non-gigantic page, it
>>>> calculates the base address inside, even passed the wrong addr_hint, it
>>>> only has performance impact as the process_huge_page() wants to process
>>>> target page last to keep its cache lines hot), no functional impact.
>>>>
>>>> Let's pass the real accessed address to folio_zero_user() and use the
>>>> aligned address in clear_gigantic_page() to fix it.
>>>>
>>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()")
>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>> ---
>>>> v2:
>>>> - update changelog to clarify the impact, per Andrew
>>>>
>>>> fs/hugetlbfs/inode.c | 2 +-
>>>> mm/memory.c | 1 +
>>>> 2 files changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>>> index a4441fb77f7c..a5ea006f403e 100644
>>>> --- a/fs/hugetlbfs/inode.c
>>>> +++ b/fs/hugetlbfs/inode.c
>>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>>>> error = PTR_ERR(folio);
>>>> goto out;
>>>> }
>>>> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
>>>> + folio_zero_user(folio, addr);
>>> 'addr' is set with the following statement above,
>>> /* addr is the offset within the file (zero based) */
>>> addr = index * hpage_size;
>>> So, we just don't need to ALIGN_DOWN() here. Or do I miss
>>> something?
>>
>> Yes, it is already aligned,
>>>
>>>> __folio_mark_uptodate(folio);
>>>> error = hugetlb_add_to_page_cache(folio, mapping, index);
>>>> if (unlikely(error)) {
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 75c2dfd04f72..ef47b7ea5ddd 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>>>> int i;
>>>> might_sleep();
>>>> + addr = ALIGN_DOWN(addr, folio_size(folio));
>>
>> but for hugetlb_no_page(), we do need to align the addr as it use
>> vmf->real_address, so I move the alignment into the
>> clear_gigantic_page.
>
> That sounds good. You may need to revise patch description to describe
> why you make the change. May be something like below?
>
> In current kernel, hugetlb_no_page() calls folio_zero_user() with the
> fault address. Where the fault address may be not aligned with the huge
> page size. Then, folio_zero_user() may call clear_gigantic_page() with
> the address, while clear_gigantic_page() requires the address to be huge
> page size aligned. So, this may cause memory corruption or information
> leak.
OK, will use it and update all patches, thanks.
>
>>>> for (i = 0; i < nr_pages; i++) {
>>>> cond_resched();
>>>> clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
>
> --
> Best Regards,
> Huang, Ying
>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-26 5:43 [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() Kefeng Wang
2024-10-26 5:43 ` [PATCH v2 2/2] mm: use aligned address in copy_user_gigantic_page() Kefeng Wang
2024-10-28 6:17 ` [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() Huang, Ying
@ 2024-10-28 10:00 ` David Hildenbrand
2024-10-28 12:52 ` Kefeng Wang
2 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2024-10-28 10:00 UTC (permalink / raw)
To: Kefeng Wang, Andrew Morton
Cc: Matthew Wilcox, Muchun Song, Huang, Ying, linux-mm
On 26.10.24 07:43, Kefeng Wang wrote:
> When clearing gigantic page, it zeros page from the first page to the
> last page, if directly passing addr_hint which maybe not the address
> of the first page of folio, then some archs could flush the wrong cache
> if it does use the addr_hint as a hint. For non-gigantic page, it
> calculates the base address inside, even passed the wrong addr_hint, it
> only has performance impact as the process_huge_page() wants to process
> target page last to keep its cache lines hot), no functional impact.
>
> Let's pass the real accessed address to folio_zero_user() and use the
> aligned address in clear_gigantic_page() to fix it.
>
> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()")
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
> v2:
> - update changelog to clarify the impact, per Andrew
>
> fs/hugetlbfs/inode.c | 2 +-
> mm/memory.c | 1 +
> 2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a4441fb77f7c..a5ea006f403e 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
> error = PTR_ERR(folio);
> goto out;
> }
> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
> + folio_zero_user(folio, addr);
> __folio_mark_uptodate(folio);
> error = hugetlb_add_to_page_cache(folio, mapping, index);
> if (unlikely(error)) {
> diff --git a/mm/memory.c b/mm/memory.c
> index 75c2dfd04f72..ef47b7ea5ddd 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
> int i;
>
> might_sleep();
> + addr = ALIGN_DOWN(addr, folio_size(folio));
Right, that's what's effectively done in a very bad way in
process_huge_page()
unsigned long addr = addr_hint &
~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
That should all be cleaned up ... process_huge_page() likely shouldn't
be even consuming "nr_pages".
In clear_gigantic_page(), can you please rename the "unsigned long addr"
parameter to unsigned long "addr_hint" and use an additional "unsigned
long addr" ?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 2/2] mm: use aligned address in copy_user_gigantic_page()
2024-10-26 5:43 ` [PATCH v2 2/2] mm: use aligned address in copy_user_gigantic_page() Kefeng Wang
@ 2024-10-28 10:01 ` David Hildenbrand
0 siblings, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2024-10-28 10:01 UTC (permalink / raw)
To: Kefeng Wang, Andrew Morton
Cc: Matthew Wilcox, Muchun Song, Huang, Ying, linux-mm
On 26.10.24 07:43, Kefeng Wang wrote:
> When copying gigantic page, it copies page from the first page to the
> last page, if directly passing addr_hint which maybe not the address
> of the first page of folio, then some archs could flush the wrong cache
> if it does use the addr_hint as a hint. For non-gigantic page, it
> calculates the base address inside, even passed the wrong addr_hint, it
> only has performance impact as the process_huge_page() wants to process
> target page last to keep its cache lines hot), no functional impact.
>
> Let's pass the real accessed address to copy_user_large_folio() and use
> the aligned address in copy_user_gigantic_page() to fix it.
>
> Fixes: 530dd9926dc1 ("mm: memory: improve copy_user_large_folio()")
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
> v2:
> - update changelog to clarify the impact, per Andrew
>
> mm/hugetlb.c | 5 ++---
> mm/memory.c | 1 +
> 2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2c8c5da0f5d3..15b5d46d49d2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5338,7 +5338,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> break;
> }
> ret = copy_user_large_folio(new_folio, pte_folio,
> - ALIGN_DOWN(addr, sz), dst_vma);
> + addr, dst_vma);
> folio_put(pte_folio);
> if (ret) {
> folio_put(new_folio);
> @@ -6641,8 +6641,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
> *foliop = NULL;
> goto out;
> }
> - ret = copy_user_large_folio(folio, *foliop,
> - ALIGN_DOWN(dst_addr, size), dst_vma);
> + ret = copy_user_large_folio(folio, *foliop, dst_addr, dst_vma);
> folio_put(*foliop);
> *foliop = NULL;
> if (ret) {
> diff --git a/mm/memory.c b/mm/memory.c
> index ef47b7ea5ddd..e5284bab659d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6860,6 +6860,7 @@ static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
> struct page *dst_page;
> struct page *src_page;
>
> + addr = ALIGN_DOWN(addr, folio_size(dst));
Same thing, please make it clearer that there is an "addr_hint" and an
"addr".
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-28 10:00 ` David Hildenbrand
@ 2024-10-28 12:52 ` Kefeng Wang
2024-10-28 13:14 ` David Hildenbrand
0 siblings, 1 reply; 33+ messages in thread
From: Kefeng Wang @ 2024-10-28 12:52 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton
Cc: Matthew Wilcox, Muchun Song, Huang, Ying, linux-mm
On 2024/10/28 18:00, David Hildenbrand wrote:
> On 26.10.24 07:43, Kefeng Wang wrote:
>> When clearing gigantic page, it zeros page from the first page to the
>> last page, if directly passing addr_hint which maybe not the address
>> of the first page of folio, then some archs could flush the wrong cache
>> if it does use the addr_hint as a hint. For non-gigantic page, it
>> calculates the base address inside, even passed the wrong addr_hint, it
>> only has performance impact as the process_huge_page() wants to process
>> target page last to keep its cache lines hot), no functional impact.
>>
>> Let's pass the real accessed address to folio_zero_user() and use the
>> aligned address in clear_gigantic_page() to fix it.
>>
>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to
>> folio_zero_user()")
>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>> ---
>> v2:
>> - update changelog to clarify the impact, per Andrew
>>
>> fs/hugetlbfs/inode.c | 2 +-
>> mm/memory.c | 1 +
>> 2 files changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index a4441fb77f7c..a5ea006f403e 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file,
>> int mode, loff_t offset,
>> error = PTR_ERR(folio);
>> goto out;
>> }
>> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
>> + folio_zero_user(folio, addr);
>> __folio_mark_uptodate(folio);
>> error = hugetlb_add_to_page_cache(folio, mapping, index);
>> if (unlikely(error)) {
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 75c2dfd04f72..ef47b7ea5ddd 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio
>> *folio, unsigned long addr,
>> int i;
>> might_sleep();
>> + addr = ALIGN_DOWN(addr, folio_size(folio));
>
> Right, that's what's effectively done in a very bad way in
> process_huge_page()
>
> unsigned long addr = addr_hint &
> ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>
>
> That should all be cleaned up ... process_huge_page() likely shouldn't
Yes, let's fix the bug firstly,
> be even consuming "nr_pages".
No sure about this part, it uses nr_pages as the end and calculate the
'base'.
>
>
> In clear_gigantic_page(), can you please rename the "unsigned long addr"
> parameter to unsigned long "addr_hint" and use an additional "unsigned
> long addr" ?
>
Sure.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-28 12:52 ` Kefeng Wang
@ 2024-10-28 13:14 ` David Hildenbrand
2024-10-28 13:33 ` Kefeng Wang
0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2024-10-28 13:14 UTC (permalink / raw)
To: Kefeng Wang, Andrew Morton
Cc: Matthew Wilcox, Muchun Song, Huang, Ying, linux-mm
On 28.10.24 13:52, Kefeng Wang wrote:
>
>
> On 2024/10/28 18:00, David Hildenbrand wrote:
>> On 26.10.24 07:43, Kefeng Wang wrote:
>>> When clearing gigantic page, it zeros page from the first page to the
>>> last page, if directly passing addr_hint which maybe not the address
>>> of the first page of folio, then some archs could flush the wrong cache
>>> if it does use the addr_hint as a hint. For non-gigantic page, it
>>> calculates the base address inside, even passed the wrong addr_hint, it
>>> only has performance impact as the process_huge_page() wants to process
>>> target page last to keep its cache lines hot), no functional impact.
>>>
>>> Let's pass the real accessed address to folio_zero_user() and use the
>>> aligned address in clear_gigantic_page() to fix it.
>>>
>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to
>>> folio_zero_user()")
>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>> ---
>>> v2:
>>> - update changelog to clarify the impact, per Andrew
>>>
>>> fs/hugetlbfs/inode.c | 2 +-
>>> mm/memory.c | 1 +
>>> 2 files changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>> index a4441fb77f7c..a5ea006f403e 100644
>>> --- a/fs/hugetlbfs/inode.c
>>> +++ b/fs/hugetlbfs/inode.c
>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file,
>>> int mode, loff_t offset,
>>> error = PTR_ERR(folio);
>>> goto out;
>>> }
>>> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
>>> + folio_zero_user(folio, addr);
>>> __folio_mark_uptodate(folio);
>>> error = hugetlb_add_to_page_cache(folio, mapping, index);
>>> if (unlikely(error)) {
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 75c2dfd04f72..ef47b7ea5ddd 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio
>>> *folio, unsigned long addr,
>>> int i;
>>> might_sleep();
>>> + addr = ALIGN_DOWN(addr, folio_size(folio));
>>
>> Right, that's what's effectively done in a very bad way in
>> process_huge_page()
>>
>> unsigned long addr = addr_hint &
>> ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>
>>
>> That should all be cleaned up ... process_huge_page() likely shouldn't
>
> Yes, let's fix the bug firstly,
>
>> be even consuming "nr_pages".
>
> No sure about this part, it uses nr_pages as the end and calculate the
> 'base'.
It should be using folio_nr_pages().
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-28 13:14 ` David Hildenbrand
@ 2024-10-28 13:33 ` Kefeng Wang
2024-10-28 13:46 ` David Hildenbrand
0 siblings, 1 reply; 33+ messages in thread
From: Kefeng Wang @ 2024-10-28 13:33 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton
Cc: Matthew Wilcox, Muchun Song, Huang, Ying, linux-mm
On 2024/10/28 21:14, David Hildenbrand wrote:
> On 28.10.24 13:52, Kefeng Wang wrote:
>>
>>
>> On 2024/10/28 18:00, David Hildenbrand wrote:
>>> On 26.10.24 07:43, Kefeng Wang wrote:
>>>> When clearing gigantic page, it zeros page from the first page to the
>>>> last page, if directly passing addr_hint which maybe not the address
>>>> of the first page of folio, then some archs could flush the wrong cache
>>>> if it does use the addr_hint as a hint. For non-gigantic page, it
>>>> calculates the base address inside, even passed the wrong addr_hint, it
>>>> only has performance impact as the process_huge_page() wants to process
>>>> target page last to keep its cache lines hot), no functional impact.
>>>>
>>>> Let's pass the real accessed address to folio_zero_user() and use the
>>>> aligned address in clear_gigantic_page() to fix it.
>>>>
>>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to
>>>> folio_zero_user()")
>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>> ---
>>>> v2:
>>>> - update changelog to clarify the impact, per Andrew
>>>>
>>>> fs/hugetlbfs/inode.c | 2 +-
>>>> mm/memory.c | 1 +
>>>> 2 files changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>>> index a4441fb77f7c..a5ea006f403e 100644
>>>> --- a/fs/hugetlbfs/inode.c
>>>> +++ b/fs/hugetlbfs/inode.c
>>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file,
>>>> int mode, loff_t offset,
>>>> error = PTR_ERR(folio);
>>>> goto out;
>>>> }
>>>> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
>>>> + folio_zero_user(folio, addr);
>>>> __folio_mark_uptodate(folio);
>>>> error = hugetlb_add_to_page_cache(folio, mapping, index);
>>>> if (unlikely(error)) {
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 75c2dfd04f72..ef47b7ea5ddd 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio
>>>> *folio, unsigned long addr,
>>>> int i;
>>>> might_sleep();
>>>> + addr = ALIGN_DOWN(addr, folio_size(folio));
>>>
>>> Right, that's what's effectively done in a very bad way in
>>> process_huge_page()
>>>
>>> unsigned long addr = addr_hint &
>>> ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>>
>>>
>>> That should all be cleaned up ... process_huge_page() likely shouldn't
>>
>> Yes, let's fix the bug firstly,
>>
>>> be even consuming "nr_pages".
>>
>> No sure about this part, it uses nr_pages as the end and calculate the
>> 'base'.
>
> It should be using folio_nr_pages().
But process_huge_page() without an explicit folio argument, I'd like to
move the aligned address calculate into the folio_zero_user and
copy_user_large_folio(will rename it to folio_copy_user()) in the
following cleanup patches, or do it in the fix patches?
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-28 13:33 ` Kefeng Wang
@ 2024-10-28 13:46 ` David Hildenbrand
2024-10-28 14:22 ` Kefeng Wang
0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2024-10-28 13:46 UTC (permalink / raw)
To: Kefeng Wang, Andrew Morton
Cc: Matthew Wilcox, Muchun Song, Huang, Ying, linux-mm
On 28.10.24 14:33, Kefeng Wang wrote:
>
>
> On 2024/10/28 21:14, David Hildenbrand wrote:
>> On 28.10.24 13:52, Kefeng Wang wrote:
>>>
>>>
>>> On 2024/10/28 18:00, David Hildenbrand wrote:
>>>> On 26.10.24 07:43, Kefeng Wang wrote:
>>>>> When clearing gigantic page, it zeros page from the first page to the
>>>>> last page, if directly passing addr_hint which maybe not the address
>>>>> of the first page of folio, then some archs could flush the wrong cache
>>>>> if it does use the addr_hint as a hint. For non-gigantic page, it
>>>>> calculates the base address inside, even passed the wrong addr_hint, it
>>>>> only has performance impact as the process_huge_page() wants to process
>>>>> target page last to keep its cache lines hot), no functional impact.
>>>>>
>>>>> Let's pass the real accessed address to folio_zero_user() and use the
>>>>> aligned address in clear_gigantic_page() to fix it.
>>>>>
>>>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to
>>>>> folio_zero_user()")
>>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>> ---
>>>>> v2:
>>>>> - update changelog to clarify the impact, per Andrew
>>>>>
>>>>> fs/hugetlbfs/inode.c | 2 +-
>>>>> mm/memory.c | 1 +
>>>>> 2 files changed, 2 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>>>> index a4441fb77f7c..a5ea006f403e 100644
>>>>> --- a/fs/hugetlbfs/inode.c
>>>>> +++ b/fs/hugetlbfs/inode.c
>>>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file,
>>>>> int mode, loff_t offset,
>>>>> error = PTR_ERR(folio);
>>>>> goto out;
>>>>> }
>>>>> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
>>>>> + folio_zero_user(folio, addr);
>>>>> __folio_mark_uptodate(folio);
>>>>> error = hugetlb_add_to_page_cache(folio, mapping, index);
>>>>> if (unlikely(error)) {
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 75c2dfd04f72..ef47b7ea5ddd 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio
>>>>> *folio, unsigned long addr,
>>>>> int i;
>>>>> might_sleep();
>>>>> + addr = ALIGN_DOWN(addr, folio_size(folio));
>>>>
>>>> Right, that's what's effectively done in a very bad way in
>>>> process_huge_page()
>>>>
>>>> unsigned long addr = addr_hint &
>>>> ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>>>
>>>>
>>>> That should all be cleaned up ... process_huge_page() likely shouldn't
>>>
>>> Yes, let's fix the bug firstly,
>>>
>>>> be even consuming "nr_pages".
>>>
>>> No sure about this part, it uses nr_pages as the end and calculate the
>>> 'base'.
>>
>> It should be using folio_nr_pages().
>
> But process_huge_page() without an explicit folio argument, I'd like to
> move the aligned address calculate into the folio_zero_user and
> copy_user_large_folio(will rename it to folio_copy_user()) in the
> following cleanup patches, or do it in the fix patches?
First, why does folio_zero_user() call process_huge_page() for *a small
folio*? Because we like or code to be extra complicated to understand?
Or am I missing something important?
Second, we should be passing the folio to "process_huge_page" and likely
rename it to "folio_process_pages()" or sth like that. The function even
documents "of the specified huge page", but there is none specified. The
copy case might require a rework.
I think this code needs a serious cleanup ...
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-28 13:46 ` David Hildenbrand
@ 2024-10-28 14:22 ` Kefeng Wang
2024-10-28 14:24 ` David Hildenbrand
0 siblings, 1 reply; 33+ messages in thread
From: Kefeng Wang @ 2024-10-28 14:22 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton
Cc: Matthew Wilcox, Muchun Song, Huang, Ying, linux-mm
On 2024/10/28 21:46, David Hildenbrand wrote:
> On 28.10.24 14:33, Kefeng Wang wrote:
>>
>>
>> On 2024/10/28 21:14, David Hildenbrand wrote:
>>> On 28.10.24 13:52, Kefeng Wang wrote:
>>>>
>>>>
>>>> On 2024/10/28 18:00, David Hildenbrand wrote:
>>>>> On 26.10.24 07:43, Kefeng Wang wrote:
>>>>>> When clearing gigantic page, it zeros page from the first page to the
>>>>>> last page, if directly passing addr_hint which maybe not the address
>>>>>> of the first page of folio, then some archs could flush the wrong
>>>>>> cache
>>>>>> if it does use the addr_hint as a hint. For non-gigantic page, it
>>>>>> calculates the base address inside, even passed the wrong
>>>>>> addr_hint, it
>>>>>> only has performance impact as the process_huge_page() wants to
>>>>>> process
>>>>>> target page last to keep its cache lines hot), no functional impact.
>>>>>>
>>>>>> Let's pass the real accessed address to folio_zero_user() and use the
>>>>>> aligned address in clear_gigantic_page() to fix it.
>>>>>>
>>>>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to
>>>>>> folio_zero_user()")
>>>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>> ---
>>>>>> v2:
>>>>>> - update changelog to clarify the impact, per Andrew
>>>>>>
>>>>>> fs/hugetlbfs/inode.c | 2 +-
>>>>>> mm/memory.c | 1 +
>>>>>> 2 files changed, 2 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>>>>> index a4441fb77f7c..a5ea006f403e 100644
>>>>>> --- a/fs/hugetlbfs/inode.c
>>>>>> +++ b/fs/hugetlbfs/inode.c
>>>>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file
>>>>>> *file,
>>>>>> int mode, loff_t offset,
>>>>>> error = PTR_ERR(folio);
>>>>>> goto out;
>>>>>> }
>>>>>> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
>>>>>> + folio_zero_user(folio, addr);
>>>>>> __folio_mark_uptodate(folio);
>>>>>> error = hugetlb_add_to_page_cache(folio, mapping, index);
>>>>>> if (unlikely(error)) {
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index 75c2dfd04f72..ef47b7ea5ddd 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio
>>>>>> *folio, unsigned long addr,
>>>>>> int i;
>>>>>> might_sleep();
>>>>>> + addr = ALIGN_DOWN(addr, folio_size(folio));
>>>>>
>>>>> Right, that's what's effectively done in a very bad way in
>>>>> process_huge_page()
>>>>>
>>>>> unsigned long addr = addr_hint &
>>>>> ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>>>>
>>>>>
>>>>> That should all be cleaned up ... process_huge_page() likely shouldn't
>>>>
>>>> Yes, let's fix the bug firstly,
>>>>
>>>>> be even consuming "nr_pages".
>>>>
>>>> No sure about this part, it uses nr_pages as the end and calculate the
>>>> 'base'.
>>>
>>> It should be using folio_nr_pages().
>>
>> But process_huge_page() without an explicit folio argument, I'd like to
>> move the aligned address calculate into the folio_zero_user and
>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>> following cleanup patches, or do it in the fix patches?
>
> First, why does folio_zero_user() call process_huge_page() for *a small
> folio*? Because we like or code to be extra complicated to understand?
> Or am I missing something important?
The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
after anon mTHP supported, it is used for order-2~order-PMD-order THP
and HugeTLB, so it won't process a small folio if I understand correctly.
>
> Second, we should be passing the folio to "process_huge_page" and likely
> rename it to "folio_process_pages()" or sth like that. The function even
> documents "of the specified huge page", but there is none specified. The
> copy case might require a rework.
>
> I think this code needs a serious cleanup ...
>
Yes, I'd like to do more cleanup and rework them, also I find some
performance issue on my arm64 machine[1] which need to be addressed.
[1]
https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-28 14:22 ` Kefeng Wang
@ 2024-10-28 14:24 ` David Hildenbrand
2024-10-29 13:04 ` Kefeng Wang
0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2024-10-28 14:24 UTC (permalink / raw)
To: Kefeng Wang, Andrew Morton
Cc: Matthew Wilcox, Muchun Song, Huang, Ying, linux-mm
On 28.10.24 15:22, Kefeng Wang wrote:
>
>
> On 2024/10/28 21:46, David Hildenbrand wrote:
>> On 28.10.24 14:33, Kefeng Wang wrote:
>>>
>>>
>>> On 2024/10/28 21:14, David Hildenbrand wrote:
>>>> On 28.10.24 13:52, Kefeng Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/10/28 18:00, David Hildenbrand wrote:
>>>>>> On 26.10.24 07:43, Kefeng Wang wrote:
>>>>>>> When clearing gigantic page, it zeros page from the first page to the
>>>>>>> last page, if directly passing addr_hint which maybe not the address
>>>>>>> of the first page of folio, then some archs could flush the wrong
>>>>>>> cache
>>>>>>> if it does use the addr_hint as a hint. For non-gigantic page, it
>>>>>>> calculates the base address inside, even passed the wrong
>>>>>>> addr_hint, it
>>>>>>> only has performance impact as the process_huge_page() wants to
>>>>>>> process
>>>>>>> target page last to keep its cache lines hot), no functional impact.
>>>>>>>
>>>>>>> Let's pass the real accessed address to folio_zero_user() and use the
>>>>>>> aligned address in clear_gigantic_page() to fix it.
>>>>>>>
>>>>>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to
>>>>>>> folio_zero_user()")
>>>>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>> ---
>>>>>>> v2:
>>>>>>> - update changelog to clarify the impact, per Andrew
>>>>>>>
>>>>>>> fs/hugetlbfs/inode.c | 2 +-
>>>>>>> mm/memory.c | 1 +
>>>>>>> 2 files changed, 2 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>>>>>> index a4441fb77f7c..a5ea006f403e 100644
>>>>>>> --- a/fs/hugetlbfs/inode.c
>>>>>>> +++ b/fs/hugetlbfs/inode.c
>>>>>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file
>>>>>>> *file,
>>>>>>> int mode, loff_t offset,
>>>>>>> error = PTR_ERR(folio);
>>>>>>> goto out;
>>>>>>> }
>>>>>>> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
>>>>>>> + folio_zero_user(folio, addr);
>>>>>>> __folio_mark_uptodate(folio);
>>>>>>> error = hugetlb_add_to_page_cache(folio, mapping, index);
>>>>>>> if (unlikely(error)) {
>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>> index 75c2dfd04f72..ef47b7ea5ddd 100644
>>>>>>> --- a/mm/memory.c
>>>>>>> +++ b/mm/memory.c
>>>>>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio
>>>>>>> *folio, unsigned long addr,
>>>>>>> int i;
>>>>>>> might_sleep();
>>>>>>> + addr = ALIGN_DOWN(addr, folio_size(folio));
>>>>>>
>>>>>> Right, that's what's effectively done in a very bad way in
>>>>>> process_huge_page()
>>>>>>
>>>>>> unsigned long addr = addr_hint &
>>>>>> ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>>>>>
>>>>>>
>>>>>> That should all be cleaned up ... process_huge_page() likely shouldn't
>>>>>
>>>>> Yes, let's fix the bug firstly,
>>>>>
>>>>>> be even consuming "nr_pages".
>>>>>
>>>>> No sure about this part, it uses nr_pages as the end and calculate the
>>>>> 'base'.
>>>>
>>>> It should be using folio_nr_pages().
>>>
>>> But process_huge_page() without an explicit folio argument, I'd like to
>>> move the aligned address calculate into the folio_zero_user and
>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>> following cleanup patches, or do it in the fix patches?
>>
>> First, why does folio_zero_user() call process_huge_page() for *a small
>> folio*? Because we like or code to be extra complicated to understand?
>> Or am I missing something important?
>
> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
> after anon mTHP supported, it is used for order-2~order-PMD-order THP
> and HugeTLB, so it won't process a small folio if I understand correctly.
And unfortunately neither the documentation nor the function name
expresses that :(
I'm happy to review any patches that improve the situation here :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-28 14:24 ` David Hildenbrand
@ 2024-10-29 13:04 ` Kefeng Wang
2024-10-29 14:04 ` David Hildenbrand
0 siblings, 1 reply; 33+ messages in thread
From: Kefeng Wang @ 2024-10-29 13:04 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton
Cc: Matthew Wilcox, Muchun Song, Huang, Ying, linux-mm
>>>>>>>
>>>>>>> That should all be cleaned up ... process_huge_page() likely
>>>>>>> shouldn't
>>>>>>
>>>>>> Yes, let's fix the bug firstly,
>>>>>>
>>>>>>> be even consuming "nr_pages".
>>>>>>
>>>>>> No sure about this part, it uses nr_pages as the end and calculate
>>>>>> the
>>>>>> 'base'.
>>>>>
>>>>> It should be using folio_nr_pages().
>>>>
>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>> move the aligned address calculate into the folio_zero_user and
>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>> following cleanup patches, or do it in the fix patches?
>>>
>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>> folio*? Because we like or code to be extra complicated to understand?
>>> Or am I missing something important?
>>
>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>> and HugeTLB, so it won't process a small folio if I understand correctly.
>
> And unfortunately neither the documentation nor the function name
> expresses that :(
>
> I'm happy to review any patches that improve the situation here :)
>
Actually, could we drop the process_huge_page() totally, from my
testcase[1], process_huge_page() is not better than clear/copy page
from start to last, and sequential clearing/copying maybe more
beneficial to the hardware prefetching, and is there a way to let lkp
to test to check the performance, since the process_huge_page()
was submitted by Ying, what's your opinion?
[1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/
case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP)
case-anon-w-seq-hugetlb (2M PMD HugeTLB)
fallocate hugetlb 20G (2M PMD HugeTLB)
fallocate tmpfs 20G (2M PMD THP)
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-29 13:04 ` Kefeng Wang
@ 2024-10-29 14:04 ` David Hildenbrand
2024-10-30 1:04 ` Huang, Ying
0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2024-10-29 14:04 UTC (permalink / raw)
To: Kefeng Wang, Andrew Morton
Cc: Matthew Wilcox, Muchun Song, Huang, Ying, linux-mm, Zi Yan
On 29.10.24 14:04, Kefeng Wang wrote:
>>>>>>>>
>>>>>>>> That should all be cleaned up ... process_huge_page() likely
>>>>>>>> shouldn't
>>>>>>>
>>>>>>> Yes, let's fix the bug firstly,
>>>>>>>
>>>>>>>> be even consuming "nr_pages".
>>>>>>>
>>>>>>> No sure about this part, it uses nr_pages as the end and calculate
>>>>>>> the
>>>>>>> 'base'.
>>>>>>
>>>>>> It should be using folio_nr_pages().
>>>>>
>>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>>> move the aligned address calculate into the folio_zero_user and
>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>>> following cleanup patches, or do it in the fix patches?
>>>>
>>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>>> folio*? Because we like or code to be extra complicated to understand?
>>>> Or am I missing something important?
>>>
>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>>> and HugeTLB, so it won't process a small folio if I understand correctly.
>>
>> And unfortunately neither the documentation nor the function name
>> expresses that :(
>>
>> I'm happy to review any patches that improve the situation here :)
>>
>
> Actually, could we drop the process_huge_page() totally, from my
> testcase[1], process_huge_page() is not better than clear/copy page
> from start to last, and sequential clearing/copying maybe more
> beneficial to the hardware prefetching, and is there a way to let lkp
> to test to check the performance, since the process_huge_page()
> was submitted by Ying, what's your opinion?
I questioned that just recently [1], and Ying assumed that it still
applies [2].
c79b57e462b5 ("mm: hugetlb: clear target
sub-page last when clearing huge page”) documents the scenario where
this matters -- anon-w-seq which you also run below.
If there is no performance benefit anymore, we should rip that out. But
likely we should check on multiple micro-architectures with multiple
#CPU configs that are relevant. c79b57e462b5 used a Xeon E5 v3 2699 with
72 processes on 2 NUMA nodes, maybe your test environment cannot
replicate that?
[1]
https://lore.kernel.org/linux-mm/b8272cb4-aee8-45ad-8dff-353444b3fa74@redhat.com/
[2]
https://lore.kernel.org/linux-mm/878quv9lhf.fsf@yhuang6-desk2.ccr.corp.intel.com/
>
>
> [1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/
> case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP)
> case-anon-w-seq-hugetlb (2M PMD HugeTLB)
But these are sequential, not random. I'd have thought access + zeroing
would be sequentially either way. Did you run with random access as well>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-29 14:04 ` David Hildenbrand
@ 2024-10-30 1:04 ` Huang, Ying
2024-10-30 3:04 ` Kefeng Wang
0 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2024-10-30 1:04 UTC (permalink / raw)
To: Kefeng Wang, David Hildenbrand
Cc: Andrew Morton, Matthew Wilcox, Muchun Song, linux-mm, Zi Yan
David Hildenbrand <david@redhat.com> writes:
> On 29.10.24 14:04, Kefeng Wang wrote:
>>>>>>>>>
>>>>>>>>> That should all be cleaned up ... process_huge_page() likely
>>>>>>>>> shouldn't
>>>>>>>>
>>>>>>>> Yes, let's fix the bug firstly,
>>>>>>>>
>>>>>>>>> be even consuming "nr_pages".
>>>>>>>>
>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate
>>>>>>>> the
>>>>>>>> 'base'.
>>>>>>>
>>>>>>> It should be using folio_nr_pages().
>>>>>>
>>>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>>>> move the aligned address calculate into the folio_zero_user and
>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>>>> following cleanup patches, or do it in the fix patches?
>>>>>
>>>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>>>> folio*? Because we like or code to be extra complicated to understand?
>>>>> Or am I missing something important?
>>>>
>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>>>> and HugeTLB, so it won't process a small folio if I understand correctly.
>>>
>>> And unfortunately neither the documentation nor the function name
>>> expresses that :(
>>>
>>> I'm happy to review any patches that improve the situation here :)
>>>
>> Actually, could we drop the process_huge_page() totally, from my
>> testcase[1], process_huge_page() is not better than clear/copy page
>> from start to last, and sequential clearing/copying maybe more
>> beneficial to the hardware prefetching, and is there a way to let lkp
>> to test to check the performance, since the process_huge_page()
>> was submitted by Ying, what's your opinion?
I don't think that it's a good idea to revert the commit without
studying and root causing the issues. I can work together with you on
that. If we have solid and well explained data to prove
process_huge_page() isn't benefitial, we can revert the commit.
> I questioned that just recently [1], and Ying assumed that it still
> applies [2].
>
> c79b57e462b5 ("mm: hugetlb: clear target
> sub-page last when clearing huge page”) documents the scenario where
> this matters -- anon-w-seq which you also run below.
>
> If there is no performance benefit anymore, we should rip that
> out. But likely we should check on multiple micro-architectures with
> multiple #CPU configs that are relevant. c79b57e462b5 used a Xeon E5
> v3 2699 with 72 processes on 2 NUMA nodes, maybe your test environment
> cannot replicate that?
>
>
> [1]
> https://lore.kernel.org/linux-mm/b8272cb4-aee8-45ad-8dff-353444b3fa74@redhat.com/
> [2]
> https://lore.kernel.org/linux-mm/878quv9lhf.fsf@yhuang6-desk2.ccr.corp.intel.com/
>
>> [1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/
>> case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP)
>> case-anon-w-seq-hugetlb (2M PMD HugeTLB)
>
> But these are sequential, not random. I'd have thought access +
> zeroing would be sequentially either way. Did you run with random
> access as well>
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-30 1:04 ` Huang, Ying
@ 2024-10-30 3:04 ` Kefeng Wang
2024-10-30 3:21 ` Huang, Ying
0 siblings, 1 reply; 33+ messages in thread
From: Kefeng Wang @ 2024-10-30 3:04 UTC (permalink / raw)
To: Huang, Ying, David Hildenbrand
Cc: Andrew Morton, Matthew Wilcox, Muchun Song, linux-mm, Zi Yan
On 2024/10/30 9:04, Huang, Ying wrote:
> David Hildenbrand <david@redhat.com> writes:
>
>> On 29.10.24 14:04, Kefeng Wang wrote:
>>>>>>>>>>
>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely
>>>>>>>>>> shouldn't
>>>>>>>>>
>>>>>>>>> Yes, let's fix the bug firstly,
>>>>>>>>>
>>>>>>>>>> be even consuming "nr_pages".
>>>>>>>>>
>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate
>>>>>>>>> the
>>>>>>>>> 'base'.
>>>>>>>>
>>>>>>>> It should be using folio_nr_pages().
>>>>>>>
>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>>>>> move the aligned address calculate into the folio_zero_user and
>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>>>>> following cleanup patches, or do it in the fix patches?
>>>>>>
>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>>>>> folio*? Because we like or code to be extra complicated to understand?
>>>>>> Or am I missing something important?
>>>>>
>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>>>>> and HugeTLB, so it won't process a small folio if I understand correctly.
>>>>
>>>> And unfortunately neither the documentation nor the function name
>>>> expresses that :(
>>>>
>>>> I'm happy to review any patches that improve the situation here :)
>>>>
>>> Actually, could we drop the process_huge_page() totally, from my
>>> testcase[1], process_huge_page() is not better than clear/copy page
>>> from start to last, and sequential clearing/copying maybe more
>>> beneficial to the hardware prefetching, and is there a way to let lkp
>>> to test to check the performance, since the process_huge_page()
>>> was submitted by Ying, what's your opinion?
>
> I don't think that it's a good idea to revert the commit without
> studying and root causing the issues. I can work together with you on
> that. If we have solid and well explained data to prove
> process_huge_page() isn't benefitial, we can revert the commit.
Take 'fallocate 20G' as an example, before
Performance counter stats for 'taskset -c 10 fallocate -l 20G
/mnt/hugetlbfs/test':
3,118.94 msec task-clock # 0.999 CPUs
utilized
30 context-switches # 0.010 K/sec
1 cpu-migrations # 0.000 K/sec
136 page-faults # 0.044 K/sec
8,092,075,873 cycles # 2.594 GHz
(92.82%)
1,624,587,663 instructions # 0.20 insn per
cycle (92.83%)
395,341,850 branches # 126.755 M/sec
(92.82%)
3,872,302 branch-misses # 0.98% of all
branches (92.83%)
1,398,066,701 L1-dcache-loads # 448.251 M/sec
(92.82%)
58,124,626 L1-dcache-load-misses # 4.16% of all
L1-dcache accesses (92.82%)
1,032,527 LLC-loads # 0.331 M/sec
(92.82%)
498,684 LLC-load-misses # 48.30% of all
LL-cache accesses (92.84%)
473,689,004 L1-icache-loads # 151.875 M/sec
(92.82%)
356,721 L1-icache-load-misses # 0.08% of all
L1-icache accesses (92.85%)
1,947,644,987 dTLB-loads # 624.458 M/sec
(92.95%)
10,185 dTLB-load-misses # 0.00% of all
dTLB cache accesses (92.96%)
474,622,896 iTLB-loads # 152.174 M/sec
(92.95%)
94 iTLB-load-misses # 0.00% of all
iTLB cache accesses (85.69%)
3.122844830 seconds time elapsed
0.000000000 seconds user
3.107259000 seconds sys
and after(clear from start to end)
Performance counter stats for 'taskset -c 10 fallocate -l 20G
/mnt/hugetlbfs/test':
1,135.53 msec task-clock # 0.999 CPUs
utilized
10 context-switches # 0.009 K/sec
1 cpu-migrations # 0.001 K/sec
137 page-faults # 0.121 K/sec
2,946,673,587 cycles # 2.595 GHz
(92.67%)
1,620,704,205 instructions # 0.55 insn per
cycle (92.61%)
394,595,772 branches # 347.499 M/sec
(92.60%)
130,756 branch-misses # 0.03% of all
branches (92.84%)
1,396,726,689 L1-dcache-loads # 1230.022 M/sec
(92.96%)
338,344 L1-dcache-load-misses # 0.02% of all
L1-dcache accesses (92.95%)
111,737 LLC-loads # 0.098 M/sec
(92.96%)
67,486 LLC-load-misses # 60.40% of all
LL-cache accesses (92.96%)
418,198,663 L1-icache-loads # 368.285 M/sec
(92.96%)
173,764 L1-icache-load-misses # 0.04% of all
L1-icache accesses (92.96%)
2,203,364,632 dTLB-loads # 1940.385 M/sec
(92.96%)
17,195 dTLB-load-misses # 0.00% of all
dTLB cache accesses (92.95%)
418,198,365 iTLB-loads # 368.285 M/sec
(92.96%)
79 iTLB-load-misses # 0.00% of all
iTLB cache accesses (85.34%)
1.137015760 seconds time elapsed
0.000000000 seconds user
1.131266000 seconds sys
The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but
this depends on the implementation of the microarchitecture.
1) Will test some rand test to check the different of performance as
David suggested.
2) Hope the LKP to run more tests since it is very useful(more test set
and different machines)
>
>> I questioned that just recently [1], and Ying assumed that it still
>> applies [2].
>>
>> c79b57e462b5 ("mm: hugetlb: clear target
>> sub-page last when clearing huge page”) documents the scenario where
>> this matters -- anon-w-seq which you also run below.
>>
>> If there is no performance benefit anymore, we should rip that
>> out. But likely we should check on multiple micro-architectures with
>> multiple #CPU configs that are relevant. c79b57e462b5 used a Xeon E5
>> v3 2699 with 72 processes on 2 NUMA nodes, maybe your test environment
>> cannot replicate that?>>
>>
>> [1]
>> https://lore.kernel.org/linux-mm/b8272cb4-aee8-45ad-8dff-353444b3fa74@redhat.com/
>> [2]
>> https://lore.kernel.org/linux-mm/878quv9lhf.fsf@yhuang6-desk2.ccr.corp.intel.com/
>>
>>> [1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/
>>> case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP)
>>> case-anon-w-seq-hugetlb (2M PMD HugeTLB)
>>
>> But these are sequential, not random. I'd have thought access +
>> zeroing would be sequentially either way. Did you run with random
>> access as well>
Will do.
> > --
> Best Regards,
> Huang, Ying
>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-30 3:04 ` Kefeng Wang
@ 2024-10-30 3:21 ` Huang, Ying
2024-10-30 5:05 ` Kefeng Wang
0 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2024-10-30 3:21 UTC (permalink / raw)
To: Kefeng Wang
Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Muchun Song,
linux-mm, Zi Yan
Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> On 2024/10/30 9:04, Huang, Ying wrote:
>> David Hildenbrand <david@redhat.com> writes:
>>
>>> On 29.10.24 14:04, Kefeng Wang wrote:
>>>>>>>>>>>
>>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely
>>>>>>>>>>> shouldn't
>>>>>>>>>>
>>>>>>>>>> Yes, let's fix the bug firstly,
>>>>>>>>>>
>>>>>>>>>>> be even consuming "nr_pages".
>>>>>>>>>>
>>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate
>>>>>>>>>> the
>>>>>>>>>> 'base'.
>>>>>>>>>
>>>>>>>>> It should be using folio_nr_pages().
>>>>>>>>
>>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>>>>>> move the aligned address calculate into the folio_zero_user and
>>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>>>>>> following cleanup patches, or do it in the fix patches?
>>>>>>>
>>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>>>>>> folio*? Because we like or code to be extra complicated to understand?
>>>>>>> Or am I missing something important?
>>>>>>
>>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>>>>>> and HugeTLB, so it won't process a small folio if I understand correctly.
>>>>>
>>>>> And unfortunately neither the documentation nor the function name
>>>>> expresses that :(
>>>>>
>>>>> I'm happy to review any patches that improve the situation here :)
>>>>>
>>>> Actually, could we drop the process_huge_page() totally, from my
>>>> testcase[1], process_huge_page() is not better than clear/copy page
>>>> from start to last, and sequential clearing/copying maybe more
>>>> beneficial to the hardware prefetching, and is there a way to let lkp
>>>> to test to check the performance, since the process_huge_page()
>>>> was submitted by Ying, what's your opinion?
>> I don't think that it's a good idea to revert the commit without
>> studying and root causing the issues. I can work together with you on
>> that. If we have solid and well explained data to prove
>> process_huge_page() isn't benefitial, we can revert the commit.
>
>
> Take 'fallocate 20G' as an example, before
>
> Performance counter stats for 'taskset -c 10 fallocate -l 20G
> /mnt/hugetlbfs/test':
IIUC, fallocate will zero pages, but will not touch them at all, right?
If so, no cache benefit from clearing referenced page last.
> 3,118.94 msec task-clock # 0.999 CPUs
> utilized
> 30 context-switches # 0.010 K/sec
> 1 cpu-migrations # 0.000 K/sec
> 136 page-faults # 0.044 K/sec
> 8,092,075,873 cycles #
> 2.594 GHz (92.82%)
> 1,624,587,663 instructions # 0.20 insn per
> cycle (92.83%)
> 395,341,850 branches # 126.755 M/sec
> (92.82%)
> 3,872,302 branch-misses # 0.98% of all
> branches (92.83%)
> 1,398,066,701 L1-dcache-loads # 448.251 M/sec
> (92.82%)
> 58,124,626 L1-dcache-load-misses # 4.16% of all
> L1-dcache accesses (92.82%)
> 1,032,527 LLC-loads # 0.331 M/sec
> (92.82%)
> 498,684 LLC-load-misses # 48.30% of all
> LL-cache accesses (92.84%)
> 473,689,004 L1-icache-loads # 151.875 M/sec
> (92.82%)
> 356,721 L1-icache-load-misses # 0.08% of all
> L1-icache accesses (92.85%)
> 1,947,644,987 dTLB-loads # 624.458 M/sec
> (92.95%)
> 10,185 dTLB-load-misses # 0.00% of all
> dTLB cache accesses (92.96%)
> 474,622,896 iTLB-loads # 152.174 M/sec
> (92.95%)
> 94 iTLB-load-misses # 0.00% of all
> iTLB cache accesses (85.69%)
>
> 3.122844830 seconds time elapsed
>
> 0.000000000 seconds user
> 3.107259000 seconds sys
>
> and after(clear from start to end)
>
> Performance counter stats for 'taskset -c 10 fallocate -l 20G
> /mnt/hugetlbfs/test':
>
> 1,135.53 msec task-clock # 0.999 CPUs
> utilized
> 10 context-switches # 0.009 K/sec
> 1 cpu-migrations # 0.001 K/sec
> 137 page-faults # 0.121 K/sec
> 2,946,673,587 cycles #
> 2.595 GHz (92.67%)
> 1,620,704,205 instructions # 0.55 insn per
> cycle (92.61%)
> 394,595,772 branches # 347.499 M/sec
> (92.60%)
> 130,756 branch-misses # 0.03% of all
> branches (92.84%)
> 1,396,726,689 L1-dcache-loads # 1230.022 M/sec
> (92.96%)
> 338,344 L1-dcache-load-misses # 0.02% of all
> L1-dcache accesses (92.95%)
> 111,737 LLC-loads # 0.098 M/sec
> (92.96%)
> 67,486 LLC-load-misses # 60.40% of all
> LL-cache accesses (92.96%)
> 418,198,663 L1-icache-loads # 368.285 M/sec
> (92.96%)
> 173,764 L1-icache-load-misses # 0.04% of all
> L1-icache accesses (92.96%)
> 2,203,364,632 dTLB-loads # 1940.385 M/sec
> (92.96%)
> 17,195 dTLB-load-misses # 0.00% of all
> dTLB cache accesses (92.95%)
> 418,198,365 iTLB-loads # 368.285 M/sec
> (92.96%)
> 79 iTLB-load-misses # 0.00% of all
> iTLB cache accesses (85.34%)
>
> 1.137015760 seconds time elapsed
>
> 0.000000000 seconds user
> 1.131266000 seconds sys
>
> The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but
> this depends on the implementation of the microarchitecture.
Anyway, we need to avoid (or reduce at least) the pure memory clearing
performance. Have you double checked whether process_huge_page() is
inlined? Perf-profile result can be used to check this too.
When you say from start to end, you mean to use clear_gigantic_page()
directly, or change process_huge_page() to clear page from start to end?
> 1) Will test some rand test to check the different of performance as
> David suggested.
>
> 2) Hope the LKP to run more tests since it is very useful(more test
> set and different machines)
I'm starting to use LKP to test.
--
Best Regards,
Huang, Ying
>
>>
>>> I questioned that just recently [1], and Ying assumed that it still
>>> applies [2].
>>>
>>> c79b57e462b5 ("mm: hugetlb: clear target
>>> sub-page last when clearing huge page”) documents the scenario where
>>> this matters -- anon-w-seq which you also run below.
>>>
>>> If there is no performance benefit anymore, we should rip that
>>> out. But likely we should check on multiple micro-architectures with
>>> multiple #CPU configs that are relevant. c79b57e462b5 used a Xeon E5
>>> v3 2699 with 72 processes on 2 NUMA nodes, maybe your test environment
>>> cannot replicate that?>>
>>>
>>> [1]
>>> https://lore.kernel.org/linux-mm/b8272cb4-aee8-45ad-8dff-353444b3fa74@redhat.com/
>>> [2]
>>> https://lore.kernel.org/linux-mm/878quv9lhf.fsf@yhuang6-desk2.ccr.corp.intel.com/
>>>
>>>> [1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/
>>>> case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP)
>>>> case-anon-w-seq-hugetlb (2M PMD HugeTLB)
>>>
>>> But these are sequential, not random. I'd have thought access +
>>> zeroing would be sequentially either way. Did you run with random
>>> access as well>
>
> Will do.
>> > --
>> Best Regards,
>> Huang, Ying
>>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-30 3:21 ` Huang, Ying
@ 2024-10-30 5:05 ` Kefeng Wang
2024-10-31 8:39 ` Huang, Ying
2024-11-01 6:18 ` Huang, Ying
0 siblings, 2 replies; 33+ messages in thread
From: Kefeng Wang @ 2024-10-30 5:05 UTC (permalink / raw)
To: Huang, Ying
Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Muchun Song,
linux-mm, Zi Yan
On 2024/10/30 11:21, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>
>> On 2024/10/30 9:04, Huang, Ying wrote:
>>> David Hildenbrand <david@redhat.com> writes:
>>>
>>>> On 29.10.24 14:04, Kefeng Wang wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely
>>>>>>>>>>>> shouldn't
>>>>>>>>>>>
>>>>>>>>>>> Yes, let's fix the bug firstly,
>>>>>>>>>>>
>>>>>>>>>>>> be even consuming "nr_pages".
>>>>>>>>>>>
>>>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate
>>>>>>>>>>> the
>>>>>>>>>>> 'base'.
>>>>>>>>>>
>>>>>>>>>> It should be using folio_nr_pages().
>>>>>>>>>
>>>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>>>>>>> move the aligned address calculate into the folio_zero_user and
>>>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>>>>>>> following cleanup patches, or do it in the fix patches?
>>>>>>>>
>>>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>>>>>>> folio*? Because we like or code to be extra complicated to understand?
>>>>>>>> Or am I missing something important?
>>>>>>>
>>>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>>>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>>>>>>> and HugeTLB, so it won't process a small folio if I understand correctly.
>>>>>>
>>>>>> And unfortunately neither the documentation nor the function name
>>>>>> expresses that :(
>>>>>>
>>>>>> I'm happy to review any patches that improve the situation here :)
>>>>>>
>>>>> Actually, could we drop the process_huge_page() totally, from my
>>>>> testcase[1], process_huge_page() is not better than clear/copy page
>>>>> from start to last, and sequential clearing/copying maybe more
>>>>> beneficial to the hardware prefetching, and is there a way to let lkp
>>>>> to test to check the performance, since the process_huge_page()
>>>>> was submitted by Ying, what's your opinion?
>>> I don't think that it's a good idea to revert the commit without
>>> studying and root causing the issues. I can work together with you on
>>> that. If we have solid and well explained data to prove
>>> process_huge_page() isn't benefitial, we can revert the commit.
>>
>>
>> Take 'fallocate 20G' as an example, before
>>
>> Performance counter stats for 'taskset -c 10 fallocate -l 20G
>> /mnt/hugetlbfs/test':
>
> IIUC, fallocate will zero pages, but will not touch them at all, right?
> If so, no cache benefit from clearing referenced page last.
Yes, for this case, only clear page.
>
>> 3,118.94 msec task-clock # 0.999 CPUs
>> utilized
>> 30 context-switches # 0.010 K/sec
>> 1 cpu-migrations # 0.000 K/sec
>> 136 page-faults # 0.044 K/sec
>> 8,092,075,873 cycles #
>> 2.594 GHz (92.82%)
>> 1,624,587,663 instructions # 0.20 insn per
>> cycle (92.83%)
>> 395,341,850 branches # 126.755 M/sec
>> (92.82%)
>> 3,872,302 branch-misses # 0.98% of all
>> branches (92.83%)
>> 1,398,066,701 L1-dcache-loads # 448.251 M/sec
>> (92.82%)
>> 58,124,626 L1-dcache-load-misses # 4.16% of all
>> L1-dcache accesses (92.82%)
>> 1,032,527 LLC-loads # 0.331 M/sec
>> (92.82%)
>> 498,684 LLC-load-misses # 48.30% of all
>> LL-cache accesses (92.84%)
>> 473,689,004 L1-icache-loads # 151.875 M/sec
>> (92.82%)
>> 356,721 L1-icache-load-misses # 0.08% of all
>> L1-icache accesses (92.85%)
>> 1,947,644,987 dTLB-loads # 624.458 M/sec
>> (92.95%)
>> 10,185 dTLB-load-misses # 0.00% of all
>> dTLB cache accesses (92.96%)
>> 474,622,896 iTLB-loads # 152.174 M/sec
>> (92.95%)
>> 94 iTLB-load-misses # 0.00% of all
>> iTLB cache accesses (85.69%)
>>
>> 3.122844830 seconds time elapsed
>>
>> 0.000000000 seconds user
>> 3.107259000 seconds sys
>>
>> and after(clear from start to end)
>>
>> Performance counter stats for 'taskset -c 10 fallocate -l 20G
>> /mnt/hugetlbfs/test':
>>
>> 1,135.53 msec task-clock # 0.999 CPUs
>> utilized
>> 10 context-switches # 0.009 K/sec
>> 1 cpu-migrations # 0.001 K/sec
>> 137 page-faults # 0.121 K/sec
>> 2,946,673,587 cycles #
>> 2.595 GHz (92.67%)
>> 1,620,704,205 instructions # 0.55 insn per
>> cycle (92.61%)
>> 394,595,772 branches # 347.499 M/sec
>> (92.60%)
>> 130,756 branch-misses # 0.03% of all
>> branches (92.84%)
>> 1,396,726,689 L1-dcache-loads # 1230.022 M/sec
>> (92.96%)
>> 338,344 L1-dcache-load-misses # 0.02% of all
>> L1-dcache accesses (92.95%)
>> 111,737 LLC-loads # 0.098 M/sec
>> (92.96%)
>> 67,486 LLC-load-misses # 60.40% of all
>> LL-cache accesses (92.96%)
>> 418,198,663 L1-icache-loads # 368.285 M/sec
>> (92.96%)
>> 173,764 L1-icache-load-misses # 0.04% of all
>> L1-icache accesses (92.96%)
>> 2,203,364,632 dTLB-loads # 1940.385 M/sec
>> (92.96%)
>> 17,195 dTLB-load-misses # 0.00% of all
>> dTLB cache accesses (92.95%)
>> 418,198,365 iTLB-loads # 368.285 M/sec
>> (92.96%)
>> 79 iTLB-load-misses # 0.00% of all
>> iTLB cache accesses (85.34%)
>>
>> 1.137015760 seconds time elapsed
>>
>> 0.000000000 seconds user
>> 1.131266000 seconds sys
>>
>> The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but
>> this depends on the implementation of the microarchitecture.
>
> Anyway, we need to avoid (or reduce at least) the pure memory clearing
> performance. Have you double checked whether process_huge_page() is
> inlined? Perf-profile result can be used to check this too.
>
Yes, I'm sure the process_huge_page() is inlined.
> When you say from start to end, you mean to use clear_gigantic_page()
> directly, or change process_huge_page() to clear page from start to end?
>
Using clear_gigantic_page() and changing process_huge_page() to clear
page from start to end are both good for performance when sequential
clearing, but no random test so far.
>> 1) Will test some rand test to check the different of performance as
>> David suggested.
>>
>> 2) Hope the LKP to run more tests since it is very useful(more test
>> set and different machines)
>
> I'm starting to use LKP to test.
Greet.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-30 5:05 ` Kefeng Wang
@ 2024-10-31 8:39 ` Huang, Ying
2024-11-01 7:43 ` Kefeng Wang
2024-12-01 2:15 ` Andrew Morton
2024-11-01 6:18 ` Huang, Ying
1 sibling, 2 replies; 33+ messages in thread
From: Huang, Ying @ 2024-10-31 8:39 UTC (permalink / raw)
To: Kefeng Wang
Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Muchun Song,
linux-mm, Zi Yan
Kefeng Wang <wangkefeng.wang@huawei.com> writes:
[snip]
>
>>> 1) Will test some rand test to check the different of performance as
>>> David suggested.
>>>
>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>> set and different machines)
>> I'm starting to use LKP to test.
>
> Greet.
I have run some tests with LKP to test.
Firstly, there's almost no measurable difference between clearing pages
from start to end or from end to start on Intel server CPU. I guess
that there's some similar optimization for both direction.
For multiple processes (same as logical CPU number)
vm-scalability/anon-w-seq test case, the benchmark score increases
about 22.4%.
For multiple processes vm-scalability/anon-w-rand test case, no
measurable difference for benchmark score.
So, the optimization helps sequential workload mainly.
In summary, on x86, process_huge_page() will not introduce any
regression. And it helps some workload.
However, on ARM64, it does introduce some regression for clearing pages
from end to start. That needs to be addressed. I guess that the
regression can be resolved via using more clearing from start to end
(but not all). For example, can you take a look at the patch below?
Which uses the similar framework as before, but clear each small trunk
(mpage) from start to end. You can adjust MPAGE_NRPAGES to check when
the regression can be restored.
WARNING: the patch is only build tested.
Best Regards,
Huang, Ying
-----------------------------------8<----------------------------------------
From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001
From: Huang Ying <ying.huang@intel.com>
Date: Thu, 31 Oct 2024 11:13:57 +0800
Subject: [PATCH] mpage clear
---
mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 67 insertions(+), 3 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 3ccee51adfbb..1fdc548c4275 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6769,6 +6769,68 @@ static inline int process_huge_page(
return 0;
}
+#define MPAGE_NRPAGES (1<<4)
+#define MPAGE_SIZE (PAGE_SIZE * MPAGE_NRPAGES)
+static inline int clear_huge_page(
+ unsigned long addr_hint, unsigned int nr_pages,
+ int (*process_subpage)(unsigned long addr, int idx, void *arg),
+ void *arg)
+{
+ int i, n, base, l, ret;
+ unsigned long addr = addr_hint &
+ ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
+ unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE;
+
+ /* Process target subpage last to keep its cache lines hot */
+ might_sleep();
+ n = (addr_hint - addr) / MPAGE_SIZE;
+ if (2 * n <= nr_mpages) {
+ /* If target subpage in first half of huge page */
+ base = 0;
+ l = n;
+ /* Process subpages at the end of huge page */
+ for (i = nr_mpages - 1; i >= 2 * n; i--) {
+ cond_resched();
+ ret = process_subpage(addr + i * MPAGE_SIZE,
+ i * MPAGE_NRPAGES, arg);
+ if (ret)
+ return ret;
+ }
+ } else {
+ /* If target subpage in second half of huge page */
+ base = nr_mpages - 2 * (nr_mpages - n);
+ l = nr_mpages - n;
+ /* Process subpages at the begin of huge page */
+ for (i = 0; i < base; i++) {
+ cond_resched();
+ ret = process_subpage(addr + i * MPAGE_SIZE,
+ i * MPAGE_NRPAGES, arg);
+ if (ret)
+ return ret;
+ }
+ }
+ /*
+ * Process remaining subpages in left-right-left-right pattern
+ * towards the target subpage
+ */
+ for (i = 0; i < l; i++) {
+ int left_idx = base + i;
+ int right_idx = base + 2 * l - 1 - i;
+
+ cond_resched();
+ ret = process_subpage(addr + left_idx * MPAGE_SIZE,
+ left_idx * MPAGE_NRPAGES, arg);
+ if (ret)
+ return ret;
+ cond_resched();
+ ret = process_subpage(addr + right_idx * MPAGE_SIZE,
+ right_idx * MPAGE_NRPAGES, arg);
+ if (ret)
+ return ret;
+ }
+ return 0;
+}
+
static void clear_gigantic_page(struct folio *folio, unsigned long addr,
unsigned int nr_pages)
{
@@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
static int clear_subpage(unsigned long addr, int idx, void *arg)
{
struct folio *folio = arg;
+ int i;
- clear_user_highpage(folio_page(folio, idx), addr);
+ for (i = 0; i < MPAGE_NRPAGES; i++)
+ clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE);
return 0;
}
@@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio, unsigned long addr_hint)
{
unsigned int nr_pages = folio_nr_pages(folio);
- if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
+ if (unlikely(nr_pages != HPAGE_PMD_NR))
clear_gigantic_page(folio, addr_hint, nr_pages);
else
- process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+ clear_huge_page(addr_hint, nr_pages, clear_subpage, folio);
}
static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
--
2.39.2
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-30 5:05 ` Kefeng Wang
2024-10-31 8:39 ` Huang, Ying
@ 2024-11-01 6:18 ` Huang, Ying
2024-11-01 7:51 ` Kefeng Wang
1 sibling, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2024-11-01 6:18 UTC (permalink / raw)
To: Kefeng Wang
Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Muchun Song,
linux-mm, Zi Yan
Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> On 2024/10/30 11:21, Huang, Ying wrote:
>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>
>>> On 2024/10/30 9:04, Huang, Ying wrote:
>>>> David Hildenbrand <david@redhat.com> writes:
>>>>
>>>>> On 29.10.24 14:04, Kefeng Wang wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely
>>>>>>>>>>>>> shouldn't
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, let's fix the bug firstly,
>>>>>>>>>>>>
>>>>>>>>>>>>> be even consuming "nr_pages".
>>>>>>>>>>>>
>>>>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate
>>>>>>>>>>>> the
>>>>>>>>>>>> 'base'.
>>>>>>>>>>>
>>>>>>>>>>> It should be using folio_nr_pages().
>>>>>>>>>>
>>>>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>>>>>>>> move the aligned address calculate into the folio_zero_user and
>>>>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>>>>>>>> following cleanup patches, or do it in the fix patches?
>>>>>>>>>
>>>>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>>>>>>>> folio*? Because we like or code to be extra complicated to understand?
>>>>>>>>> Or am I missing something important?
>>>>>>>>
>>>>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>>>>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>>>>>>>> and HugeTLB, so it won't process a small folio if I understand correctly.
>>>>>>>
>>>>>>> And unfortunately neither the documentation nor the function name
>>>>>>> expresses that :(
>>>>>>>
>>>>>>> I'm happy to review any patches that improve the situation here :)
>>>>>>>
>>>>>> Actually, could we drop the process_huge_page() totally, from my
>>>>>> testcase[1], process_huge_page() is not better than clear/copy page
>>>>>> from start to last, and sequential clearing/copying maybe more
>>>>>> beneficial to the hardware prefetching, and is there a way to let lkp
>>>>>> to test to check the performance, since the process_huge_page()
>>>>>> was submitted by Ying, what's your opinion?
>>>> I don't think that it's a good idea to revert the commit without
>>>> studying and root causing the issues. I can work together with you on
>>>> that. If we have solid and well explained data to prove
>>>> process_huge_page() isn't benefitial, we can revert the commit.
>>>
>>>
>>> Take 'fallocate 20G' as an example, before
>>>
>>> Performance counter stats for 'taskset -c 10 fallocate -l 20G
>>> /mnt/hugetlbfs/test':
>> IIUC, fallocate will zero pages, but will not touch them at all,
>> right?
>> If so, no cache benefit from clearing referenced page last.
>
>
> Yes, for this case, only clear page.
>>
>>> 3,118.94 msec task-clock # 0.999 CPUs
>>> utilized
>>> 30 context-switches # 0.010 K/sec
>>> 1 cpu-migrations # 0.000 K/sec
>>> 136 page-faults # 0.044 K/sec
>>> 8,092,075,873 cycles #
>>> 2.594 GHz (92.82%)
>>> 1,624,587,663 instructions # 0.20 insn per
>>> cycle (92.83%)
>>> 395,341,850 branches # 126.755 M/sec
>>> (92.82%)
>>> 3,872,302 branch-misses # 0.98% of all
>>> branches (92.83%)
>>> 1,398,066,701 L1-dcache-loads # 448.251 M/sec
>>> (92.82%)
>>> 58,124,626 L1-dcache-load-misses # 4.16% of all
>>> L1-dcache accesses (92.82%)
>>> 1,032,527 LLC-loads # 0.331 M/sec
>>> (92.82%)
>>> 498,684 LLC-load-misses # 48.30% of all
>>> LL-cache accesses (92.84%)
>>> 473,689,004 L1-icache-loads # 151.875 M/sec
>>> (92.82%)
>>> 356,721 L1-icache-load-misses # 0.08% of all
>>> L1-icache accesses (92.85%)
>>> 1,947,644,987 dTLB-loads # 624.458 M/sec
>>> (92.95%)
>>> 10,185 dTLB-load-misses # 0.00% of all
>>> dTLB cache accesses (92.96%)
>>> 474,622,896 iTLB-loads # 152.174 M/sec
>>> (92.95%)
>>> 94 iTLB-load-misses # 0.00% of all
>>> iTLB cache accesses (85.69%)
>>>
>>> 3.122844830 seconds time elapsed
>>>
>>> 0.000000000 seconds user
>>> 3.107259000 seconds sys
>>>
>>> and after(clear from start to end)
>>>
>>> Performance counter stats for 'taskset -c 10 fallocate -l 20G
>>> /mnt/hugetlbfs/test':
>>>
>>> 1,135.53 msec task-clock # 0.999 CPUs
>>> utilized
>>> 10 context-switches # 0.009 K/sec
>>> 1 cpu-migrations # 0.001 K/sec
>>> 137 page-faults # 0.121 K/sec
>>> 2,946,673,587 cycles #
>>> 2.595 GHz (92.67%)
>>> 1,620,704,205 instructions # 0.55 insn per
>>> cycle (92.61%)
>>> 394,595,772 branches # 347.499 M/sec
>>> (92.60%)
>>> 130,756 branch-misses # 0.03% of all
>>> branches (92.84%)
>>> 1,396,726,689 L1-dcache-loads # 1230.022 M/sec
>>> (92.96%)
>>> 338,344 L1-dcache-load-misses # 0.02% of all
>>> L1-dcache accesses (92.95%)
>>> 111,737 LLC-loads # 0.098 M/sec
>>> (92.96%)
>>> 67,486 LLC-load-misses # 60.40% of all
>>> LL-cache accesses (92.96%)
>>> 418,198,663 L1-icache-loads # 368.285 M/sec
>>> (92.96%)
>>> 173,764 L1-icache-load-misses # 0.04% of all
>>> L1-icache accesses (92.96%)
>>> 2,203,364,632 dTLB-loads # 1940.385 M/sec
>>> (92.96%)
>>> 17,195 dTLB-load-misses # 0.00% of all
>>> dTLB cache accesses (92.95%)
>>> 418,198,365 iTLB-loads # 368.285 M/sec
>>> (92.96%)
>>> 79 iTLB-load-misses # 0.00% of all
>>> iTLB cache accesses (85.34%)
>>>
>>> 1.137015760 seconds time elapsed
>>>
>>> 0.000000000 seconds user
>>> 1.131266000 seconds sys
>>>
>>> The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but
>>> this depends on the implementation of the microarchitecture.
>> Anyway, we need to avoid (or reduce at least) the pure memory
>> clearing
>> performance. Have you double checked whether process_huge_page() is
>> inlined? Perf-profile result can be used to check this too.
>>
>
> Yes, I'm sure the process_huge_page() is inlined.
>
>> When you say from start to end, you mean to use clear_gigantic_page()
>> directly, or change process_huge_page() to clear page from start to end?
>>
>
> Using clear_gigantic_page() and changing process_huge_page() to clear
> page from start to end are both good for performance when sequential
> clearing, but no random test so far.
>
>>> 1) Will test some rand test to check the different of performance as
>>> David suggested.
>>>
>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>> set and different machines)
>> I'm starting to use LKP to test.
https://lore.kernel.org/linux-mm/20200419155856.dtwxomdkyujljdfi@oneplus.com/
Just remembered that we have discussed a similar issue for arm64 before.
Can you take a look at it? There's more discussion and tests/results in
the thread, I think that may be helpful.
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-31 8:39 ` Huang, Ying
@ 2024-11-01 7:43 ` Kefeng Wang
2024-11-01 8:16 ` Huang, Ying
2024-12-01 2:15 ` Andrew Morton
1 sibling, 1 reply; 33+ messages in thread
From: Kefeng Wang @ 2024-11-01 7:43 UTC (permalink / raw)
To: Huang, Ying
Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Muchun Song,
linux-mm, Zi Yan
On 2024/10/31 16:39, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>
> [snip]
>>
>>>> 1) Will test some rand test to check the different of performance as
>>>> David suggested.>>>>
>>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>>> set and different machines)
>>> I'm starting to use LKP to test.
>>
>> Greet.
Sorry for the late,
>
> I have run some tests with LKP to test.
>
> Firstly, there's almost no measurable difference between clearing pages
> from start to end or from end to start on Intel server CPU. I guess
> that there's some similar optimization for both direction.
>
> For multiple processes (same as logical CPU number)
> vm-scalability/anon-w-seq test case, the benchmark score increases
> about 22.4%.
So process_huge_page is better than clear_gigantic_page() on Intel?
Could you test the following case on x86?
echo 10240 >
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
mkdir -p /hugetlbfs/
mount none /hugetlbfs/ -t hugetlbfs
rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate
-d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G
/hugetlbfs/test
>
> For multiple processes vm-scalability/anon-w-rand test case, no
> measurable difference for benchmark score.
>
> So, the optimization helps sequential workload mainly.
>
> In summary, on x86, process_huge_page() will not introduce any
> regression. And it helps some workload.
>
> However, on ARM64, it does introduce some regression for clearing pages
> from end to start. That needs to be addressed. I guess that the
> regression can be resolved via using more clearing from start to end
> (but not all). For example, can you take a look at the patch below?
> Which uses the similar framework as before, but clear each small trunk
> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when
> the regression can be restored.
>
> WARNING: the patch is only build tested.
Base: baseline
Change1: using clear_gigantic_page() for 2M PMD
Change2: your patch with MPAGE_NRPAGES=16
Change3: Case3 + fix[1]
Change4: your patch with MPAGE_NRPAGES=64 + fix[1]
1. For rand write,
case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference
2. For seq write,
1) case-anon-w-seq-mt:
base:
real 0m2.490s 0m2.254s 0m2.272s
user 1m59.980s 2m23.431s 2m18.739s
sys 1m3.675s 1m15.462s 1m15.030s
Change1:
real 0m2.234s 0m2.225s 0m2.159s
user 2m56.105s 2m57.117s 3m0.489s
sys 0m17.064s 0m17.564s 0m16.150s
Change2:
real 0m2.244s 0m2.384s 0m2.370s
user 2m39.413s 2m41.990s 2m42.229s
sys 0m19.826s 0m18.491s 0m18.053s
Change3: // best performance
real 0m2.155s 0m2.204s 0m2.194s
user 3m2.640s 2m55.837s 3m0.902s
sys 0m17.346s 0m17.630s 0m18.197s
Change4:
real 0m2.287s 0m2.377s 0m2.284s
user 2m37.030s 2m52.868s 3m17.593s
sys 0m15.445s 0m34.430s 0m45.224s
2) case-anon-w-seq-hugetlb
very similar 1), Change4 slightly better than Change3, but not big
different.
3) hugetlbfs fallocate 20G
Change1(0m1.136s) = Change3(0m1.136s) = Change4(0m1.135s) <
Change2(0m1.275s) < base(0m3.016s)
In summary, the Change3 is best and Change1 is good on my arm64 machine.
>
> Best Regards,
> Huang, Ying
>
> -----------------------------------8<----------------------------------------
> From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001
> From: Huang Ying <ying.huang@intel.com>
> Date: Thu, 31 Oct 2024 11:13:57 +0800
> Subject: [PATCH] mpage clear
>
> ---
> mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 67 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 3ccee51adfbb..1fdc548c4275 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6769,6 +6769,68 @@ static inline int process_huge_page(
> return 0;
> }
>
> +#define MPAGE_NRPAGES (1<<4)
> +#define MPAGE_SIZE (PAGE_SIZE * MPAGE_NRPAGES)
> +static inline int clear_huge_page(
> + unsigned long addr_hint, unsigned int nr_pages,
> + int (*process_subpage)(unsigned long addr, int idx, void *arg),
> + void *arg)
> +{
> + int i, n, base, l, ret;
> + unsigned long addr = addr_hint &
> + ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
> + unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE;
> +
> + /* Process target subpage last to keep its cache lines hot */
> + might_sleep();
> + n = (addr_hint - addr) / MPAGE_SIZE;
> + if (2 * n <= nr_mpages) {
> + /* If target subpage in first half of huge page */
> + base = 0;
> + l = n;
> + /* Process subpages at the end of huge page */
> + for (i = nr_mpages - 1; i >= 2 * n; i--) {
> + cond_resched();
> + ret = process_subpage(addr + i * MPAGE_SIZE,
> + i * MPAGE_NRPAGES, arg);
> + if (ret)
> + return ret;
> + }
> + } else {
> + /* If target subpage in second half of huge page */
> + base = nr_mpages - 2 * (nr_mpages - n);
> + l = nr_mpages - n;
> + /* Process subpages at the begin of huge page */
> + for (i = 0; i < base; i++) {
> + cond_resched();
> + ret = process_subpage(addr + i * MPAGE_SIZE,
> + i * MPAGE_NRPAGES, arg);
> + if (ret)
> + return ret;
> + }
> + }
> + /*
> + * Process remaining subpages in left-right-left-right pattern
> + * towards the target subpage
> + */
> + for (i = 0; i < l; i++) {
> + int left_idx = base + i;
> + int right_idx = base + 2 * l - 1 - i;
> +
> + cond_resched();
> + ret = process_subpage(addr + left_idx * MPAGE_SIZE,
> + left_idx * MPAGE_NRPAGES, arg);
> + if (ret)
> + return ret;
> + cond_resched();
> + ret = process_subpage(addr + right_idx * MPAGE_SIZE,
> + right_idx * MPAGE_NRPAGES, arg);
> + if (ret)
> + return ret;
> + }
> + return 0;
> +}
> +
> static void clear_gigantic_page(struct folio *folio, unsigned long addr,
> unsigned int nr_pages)
> {
> @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
> static int clear_subpage(unsigned long addr, int idx, void *arg)
> {
> struct folio *folio = arg;
> + int i;
>
> - clear_user_highpage(folio_page(folio, idx), addr);
> + for (i = 0; i < MPAGE_NRPAGES; i++)
> + clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE);
> return 0;
> }
>
> @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio, unsigned long addr_hint)
> {
> unsigned int nr_pages = folio_nr_pages(folio);
>
> - if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
> + if (unlikely(nr_pages != HPAGE_PMD_NR))
> clear_gigantic_page(folio, addr_hint, nr_pages);
> else
> - process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
> + clear_huge_page(addr_hint, nr_pages, clear_subpage, folio);
> }
>
> static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
[1] fix patch
diff --git a/mm/memory.c b/mm/memory.c
index b22d4b83295b..aee99ede0c4f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6816,7 +6816,7 @@ static inline int clear_huge_page(
base = 0;
l = n;
/* Process subpages at the end of huge page */
- for (i = nr_mpages - 1; i >= 2 * n; i--) {
+ for (i = 2 * n; i < nr_mpages; i++) {
cond_resched();
ret = process_subpage(addr + i * MPAGE_SIZE,
i * MPAGE_NRPAGES, arg);
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-11-01 6:18 ` Huang, Ying
@ 2024-11-01 7:51 ` Kefeng Wang
0 siblings, 0 replies; 33+ messages in thread
From: Kefeng Wang @ 2024-11-01 7:51 UTC (permalink / raw)
To: Huang, Ying
Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Muchun Song,
linux-mm, Zi Yan
On 2024/11/1 14:18, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>
>> On 2024/10/30 11:21, Huang, Ying wrote:
>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>
>>>> On 2024/10/30 9:04, Huang, Ying wrote:
>>>>> David Hildenbrand <david@redhat.com> writes:
>>>>>
>>>>>> On 29.10.24 14:04, Kefeng Wang wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely
>>>>>>>>>>>>>> shouldn't
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, let's fix the bug firstly,
>>>>>>>>>>>>>
>>>>>>>>>>>>>> be even consuming "nr_pages".
>>>>>>>>>>>>>
>>>>>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate
>>>>>>>>>>>>> the
>>>>>>>>>>>>> 'base'.
>>>>>>>>>>>>
>>>>>>>>>>>> It should be using folio_nr_pages().
>>>>>>>>>>>
>>>>>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>>>>>>>>> move the aligned address calculate into the folio_zero_user and
>>>>>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>>>>>>>>> following cleanup patches, or do it in the fix patches?
>>>>>>>>>>
>>>>>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>>>>>>>>> folio*? Because we like or code to be extra complicated to understand?
>>>>>>>>>> Or am I missing something important?
>>>>>>>>>
>>>>>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>>>>>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>>>>>>>>> and HugeTLB, so it won't process a small folio if I understand correctly.
>>>>>>>>
>>>>>>>> And unfortunately neither the documentation nor the function name
>>>>>>>> expresses that :(
>>>>>>>>
>>>>>>>> I'm happy to review any patches that improve the situation here :)
>>>>>>>>
>>>>>>> Actually, could we drop the process_huge_page() totally, from my
>>>>>>> testcase[1], process_huge_page() is not better than clear/copy page
>>>>>>> from start to last, and sequential clearing/copying maybe more
>>>>>>> beneficial to the hardware prefetching, and is there a way to let lkp
>>>>>>> to test to check the performance, since the process_huge_page()
>>>>>>> was submitted by Ying, what's your opinion?
>>>>> I don't think that it's a good idea to revert the commit without
>>>>> studying and root causing the issues. I can work together with you on
>>>>> that. If we have solid and well explained data to prove
>>>>> process_huge_page() isn't benefitial, we can revert the commit.
>>>>
>>>>
>>>> Take 'fallocate 20G' as an example, before
>>>>
>>>> Performance counter stats for 'taskset -c 10 fallocate -l 20G
>>>> /mnt/hugetlbfs/test':
>>> IIUC, fallocate will zero pages, but will not touch them at all,
>>> right?
>>> If so, no cache benefit from clearing referenced page last.
>>
>>
>> Yes, for this case, only clear page.
>>>
>>>> 3,118.94 msec task-clock # 0.999 CPUs
>>>> utilized
>>>> 30 context-switches # 0.010 K/sec
>>>> 1 cpu-migrations # 0.000 K/sec
>>>> 136 page-faults # 0.044 K/sec
>>>> 8,092,075,873 cycles #
>>>> 2.594 GHz (92.82%)
>>>> 1,624,587,663 instructions # 0.20 insn per
>>>> cycle (92.83%)
>>>> 395,341,850 branches # 126.755 M/sec
>>>> (92.82%)
>>>> 3,872,302 branch-misses # 0.98% of all
>>>> branches (92.83%)
>>>> 1,398,066,701 L1-dcache-loads # 448.251 M/sec
>>>> (92.82%)
>>>> 58,124,626 L1-dcache-load-misses # 4.16% of all
>>>> L1-dcache accesses (92.82%)
>>>> 1,032,527 LLC-loads # 0.331 M/sec
>>>> (92.82%)
>>>> 498,684 LLC-load-misses # 48.30% of all
>>>> LL-cache accesses (92.84%)
>>>> 473,689,004 L1-icache-loads # 151.875 M/sec
>>>> (92.82%)
>>>> 356,721 L1-icache-load-misses # 0.08% of all
>>>> L1-icache accesses (92.85%)
>>>> 1,947,644,987 dTLB-loads # 624.458 M/sec
>>>> (92.95%)
>>>> 10,185 dTLB-load-misses # 0.00% of all
>>>> dTLB cache accesses (92.96%)
>>>> 474,622,896 iTLB-loads # 152.174 M/sec
>>>> (92.95%)
>>>> 94 iTLB-load-misses # 0.00% of all
>>>> iTLB cache accesses (85.69%)
>>>>
>>>> 3.122844830 seconds time elapsed
>>>>
>>>> 0.000000000 seconds user
>>>> 3.107259000 seconds sys
>>>>
>>>> and after(clear from start to end)
>>>>
>>>> Performance counter stats for 'taskset -c 10 fallocate -l 20G
>>>> /mnt/hugetlbfs/test':
>>>>
>>>> 1,135.53 msec task-clock # 0.999 CPUs
>>>> utilized
>>>> 10 context-switches # 0.009 K/sec
>>>> 1 cpu-migrations # 0.001 K/sec
>>>> 137 page-faults # 0.121 K/sec
>>>> 2,946,673,587 cycles #
>>>> 2.595 GHz (92.67%)
>>>> 1,620,704,205 instructions # 0.55 insn per
>>>> cycle (92.61%)
>>>> 394,595,772 branches # 347.499 M/sec
>>>> (92.60%)
>>>> 130,756 branch-misses # 0.03% of all
>>>> branches (92.84%)
>>>> 1,396,726,689 L1-dcache-loads # 1230.022 M/sec
>>>> (92.96%)
>>>> 338,344 L1-dcache-load-misses # 0.02% of all
>>>> L1-dcache accesses (92.95%)
>>>> 111,737 LLC-loads # 0.098 M/sec
>>>> (92.96%)
>>>> 67,486 LLC-load-misses # 60.40% of all
>>>> LL-cache accesses (92.96%)
>>>> 418,198,663 L1-icache-loads # 368.285 M/sec
>>>> (92.96%)
>>>> 173,764 L1-icache-load-misses # 0.04% of all
>>>> L1-icache accesses (92.96%)
>>>> 2,203,364,632 dTLB-loads # 1940.385 M/sec
>>>> (92.96%)
>>>> 17,195 dTLB-load-misses # 0.00% of all
>>>> dTLB cache accesses (92.95%)
>>>> 418,198,365 iTLB-loads # 368.285 M/sec
>>>> (92.96%)
>>>> 79 iTLB-load-misses # 0.00% of all
>>>> iTLB cache accesses (85.34%)
>>>>
>>>> 1.137015760 seconds time elapsed
>>>>
>>>> 0.000000000 seconds user
>>>> 1.131266000 seconds sys
>>>>
>>>> The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but
>>>> this depends on the implementation of the microarchitecture.
>>> Anyway, we need to avoid (or reduce at least) the pure memory
>>> clearing
>>> performance. Have you double checked whether process_huge_page() is
>>> inlined? Perf-profile result can be used to check this too.
>>>
>>
>> Yes, I'm sure the process_huge_page() is inlined.
>>
>>> When you say from start to end, you mean to use clear_gigantic_page()
>>> directly, or change process_huge_page() to clear page from start to end?
>>>
>>
>> Using clear_gigantic_page() and changing process_huge_page() to clear
>> page from start to end are both good for performance when sequential
>> clearing, but no random test so far.
>>
>>>> 1) Will test some rand test to check the different of performance as
>>>> David suggested.
>>>>
>>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>>> set and different machines)
>>> I'm starting to use LKP to test.
>
> https://lore.kernel.org/linux-mm/20200419155856.dtwxomdkyujljdfi@oneplus.com/
>
> Just remembered that we have discussed a similar issue for arm64 before.
> Can you take a look at it? There's more discussion and tests/results in
> the thread, I think that may be helpful.
>
Thanks for the tips, will check it.
> --
> Best Regards,
> Huang, Ying
>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-11-01 7:43 ` Kefeng Wang
@ 2024-11-01 8:16 ` Huang, Ying
2024-11-01 9:45 ` Kefeng Wang
0 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2024-11-01 8:16 UTC (permalink / raw)
To: Kefeng Wang
Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Muchun Song,
linux-mm, Zi Yan
Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> On 2024/10/31 16:39, Huang, Ying wrote:
>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>> [snip]
>>>
>>>>> 1) Will test some rand test to check the different of performance as
>>>>> David suggested.>>>>
>>>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>>>> set and different machines)
>>>> I'm starting to use LKP to test.
>>>
>>> Greet.
>
>
> Sorry for the late,
>
>> I have run some tests with LKP to test.
>> Firstly, there's almost no measurable difference between clearing
>> pages
>> from start to end or from end to start on Intel server CPU. I guess
>> that there's some similar optimization for both direction.
>> For multiple processes (same as logical CPU number)
>> vm-scalability/anon-w-seq test case, the benchmark score increases
>> about 22.4%.
>
> So process_huge_page is better than clear_gigantic_page() on Intel?
For vm-scalability/anon-w-seq test case, it is. Because the performance
of forward and backward clearing is almost same, and the user space
accessing has cache-hot benefit.
> Could you test the following case on x86?
> echo 10240 >
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
> mkdir -p /hugetlbfs/
> mount none /hugetlbfs/ -t hugetlbfs
> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate
> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G
> /hugetlbfs/test
It's not trivial for me to do this test. Because 0day wraps test cases.
Do you know which existing test cases provide this? For example, in
vm-scalability?
>> For multiple processes vm-scalability/anon-w-rand test case, no
>> measurable difference for benchmark score.
>> So, the optimization helps sequential workload mainly.
>> In summary, on x86, process_huge_page() will not introduce any
>> regression. And it helps some workload.
>> However, on ARM64, it does introduce some regression for clearing
>> pages
>> from end to start. That needs to be addressed. I guess that the
>> regression can be resolved via using more clearing from start to end
>> (but not all). For example, can you take a look at the patch below?
>> Which uses the similar framework as before, but clear each small trunk
>> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when
>> the regression can be restored.
>> WARNING: the patch is only build tested.
>
>
> Base: baseline
> Change1: using clear_gigantic_page() for 2M PMD
> Change2: your patch with MPAGE_NRPAGES=16
> Change3: Case3 + fix[1]
What is case3?
> Change4: your patch with MPAGE_NRPAGES=64 + fix[1]
>
> 1. For rand write,
> case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference
>
> 2. For seq write,
>
> 1) case-anon-w-seq-mt:
Can you try case-anon-w-seq? That may be more stable.
> base:
> real 0m2.490s 0m2.254s 0m2.272s
> user 1m59.980s 2m23.431s 2m18.739s
> sys 1m3.675s 1m15.462s 1m15.030s
>
> Change1:
> real 0m2.234s 0m2.225s 0m2.159s
> user 2m56.105s 2m57.117s 3m0.489s
> sys 0m17.064s 0m17.564s 0m16.150s
>
> Change2:
> real 0m2.244s 0m2.384s 0m2.370s
> user 2m39.413s 2m41.990s 2m42.229s
> sys 0m19.826s 0m18.491s 0m18.053s
It appears strange. There's no much cache hot benefit even if we clear
pages from end to begin (with larger chunk).
However, sys time improves a lot. This shows clearing page with large
chunk helps on ARM64.
> Change3: // best performance
> real 0m2.155s 0m2.204s 0m2.194s
> user 3m2.640s 2m55.837s 3m0.902s
> sys 0m17.346s 0m17.630s 0m18.197s
>
> Change4:
> real 0m2.287s 0m2.377s 0m2.284s
> user 2m37.030s 2m52.868s 3m17.593s
> sys 0m15.445s 0m34.430s 0m45.224s
Change4 is essentially same as Change1. I don't know why they are
different. Is there some large variation among run to run?
Can you further optimize the prototype patch below? I think that it has
potential to fix your issue.
> 2) case-anon-w-seq-hugetlb
> very similar 1), Change4 slightly better than Change3, but not big
> different.
>
> 3) hugetlbfs fallocate 20G
> Change1(0m1.136s) = Change3(0m1.136s) = Change4(0m1.135s) <
> Change2(0m1.275s) < base(0m3.016s)
>
> In summary, the Change3 is best and Change1 is good on my arm64 machine.
>
>> Best Regards,
>> Huang, Ying
>> -----------------------------------8<----------------------------------------
>> From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001
>> From: Huang Ying <ying.huang@intel.com>
>> Date: Thu, 31 Oct 2024 11:13:57 +0800
>> Subject: [PATCH] mpage clear
>> ---
>> mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 67 insertions(+), 3 deletions(-)
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 3ccee51adfbb..1fdc548c4275 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6769,6 +6769,68 @@ static inline int process_huge_page(
>> return 0;
>> }
>> +#define MPAGE_NRPAGES (1<<4)
>> +#define MPAGE_SIZE (PAGE_SIZE * MPAGE_NRPAGES)
>> +static inline int clear_huge_page(
>> + unsigned long addr_hint, unsigned int nr_pages,
>> + int (*process_subpage)(unsigned long addr, int idx, void *arg),
>> + void *arg)
>> +{
>> + int i, n, base, l, ret;
>> + unsigned long addr = addr_hint &
>> + ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>> + unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE;
>> +
>> + /* Process target subpage last to keep its cache lines hot */
>> + might_sleep();
>> + n = (addr_hint - addr) / MPAGE_SIZE;
>> + if (2 * n <= nr_mpages) {
>> + /* If target subpage in first half of huge page */
>> + base = 0;
>> + l = n;
>> + /* Process subpages at the end of huge page */
>> + for (i = nr_mpages - 1; i >= 2 * n; i--) {
>> + cond_resched();
>> + ret = process_subpage(addr + i * MPAGE_SIZE,
>> + i * MPAGE_NRPAGES, arg);
>> + if (ret)
>> + return ret;
>> + }
>> + } else {
>> + /* If target subpage in second half of huge page */
>> + base = nr_mpages - 2 * (nr_mpages - n);
>> + l = nr_mpages - n;
>> + /* Process subpages at the begin of huge page */
>> + for (i = 0; i < base; i++) {
>> + cond_resched();
>> + ret = process_subpage(addr + i * MPAGE_SIZE,
>> + i * MPAGE_NRPAGES, arg);
>> + if (ret)
>> + return ret;
>> + }
>> + }
>> + /*
>> + * Process remaining subpages in left-right-left-right pattern
>> + * towards the target subpage
>> + */
>> + for (i = 0; i < l; i++) {
>> + int left_idx = base + i;
>> + int right_idx = base + 2 * l - 1 - i;
>> +
>> + cond_resched();
>> + ret = process_subpage(addr + left_idx * MPAGE_SIZE,
>> + left_idx * MPAGE_NRPAGES, arg);
>> + if (ret)
>> + return ret;
>> + cond_resched();
>> + ret = process_subpage(addr + right_idx * MPAGE_SIZE,
>> + right_idx * MPAGE_NRPAGES, arg);
>> + if (ret)
>> + return ret;
>> + }
>> + return 0;
>> +}
>> +
>> static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>> unsigned int nr_pages)
>> {
>> @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>> static int clear_subpage(unsigned long addr, int idx, void *arg)
>> {
>> struct folio *folio = arg;
>> + int i;
>> - clear_user_highpage(folio_page(folio, idx), addr);
>> + for (i = 0; i < MPAGE_NRPAGES; i++)
>> + clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE);
>> return 0;
>> }
>> @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio,
>> unsigned long addr_hint)
>> {
>> unsigned int nr_pages = folio_nr_pages(folio);
>> - if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
>> + if (unlikely(nr_pages != HPAGE_PMD_NR))
>> clear_gigantic_page(folio, addr_hint, nr_pages);
>> else
>> - process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>> + clear_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>> }
>> static int copy_user_gigantic_page(struct folio *dst, struct
>> folio *src,
>
>
> [1] fix patch
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b22d4b83295b..aee99ede0c4f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6816,7 +6816,7 @@ static inline int clear_huge_page(
> base = 0;
> l = n;
> /* Process subpages at the end of huge page */
> - for (i = nr_mpages - 1; i >= 2 * n; i--) {
> + for (i = 2 * n; i < nr_mpages; i++) {
> cond_resched();
> ret = process_subpage(addr + i * MPAGE_SIZE,
> i * MPAGE_NRPAGES, arg);
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-11-01 8:16 ` Huang, Ying
@ 2024-11-01 9:45 ` Kefeng Wang
2024-11-04 2:35 ` Huang, Ying
0 siblings, 1 reply; 33+ messages in thread
From: Kefeng Wang @ 2024-11-01 9:45 UTC (permalink / raw)
To: Huang, Ying
Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Muchun Song,
linux-mm, Zi Yan
On 2024/11/1 16:16, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>
>> On 2024/10/31 16:39, Huang, Ying wrote:
>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>> [snip]
>>>>
>>>>>> 1) Will test some rand test to check the different of performance as
>>>>>> David suggested.>>>>
>>>>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>>>>> set and different machines)
>>>>> I'm starting to use LKP to test.
>>>>
>>>> Greet.
>>
>>
>> Sorry for the late,
>>
>>> I have run some tests with LKP to test.
>>> Firstly, there's almost no measurable difference between clearing
>>> pages
>>> from start to end or from end to start on Intel server CPU. I guess
>>> that there's some similar optimization for both direction.
>>> For multiple processes (same as logical CPU number)
>>> vm-scalability/anon-w-seq test case, the benchmark score increases
>>> about 22.4%.
>>
>> So process_huge_page is better than clear_gigantic_page() on Intel?
>
> For vm-scalability/anon-w-seq test case, it is. Because the performance
> of forward and backward clearing is almost same, and the user space
> accessing has cache-hot benefit.
>
>> Could you test the following case on x86?
>> echo 10240 >
>> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>> mkdir -p /hugetlbfs/
>> mount none /hugetlbfs/ -t hugetlbfs
>> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate
>> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G
>> /hugetlbfs/test
>
> It's not trivial for me to do this test. Because 0day wraps test cases.
> Do you know which existing test cases provide this? For example, in
> vm-scalability?
I don't know the public fallocate test, I will try to find a intel
machine to test this case.
>
>>> For multiple processes vm-scalability/anon-w-rand test case, no
>>> measurable difference for benchmark score.
>>> So, the optimization helps sequential workload mainly.
>>> In summary, on x86, process_huge_page() will not introduce any
>>> regression. And it helps some workload.
>>> However, on ARM64, it does introduce some regression for clearing
>>> pages
>>> from end to start. That needs to be addressed. I guess that the
>>> regression can be resolved via using more clearing from start to end
>>> (but not all). For example, can you take a look at the patch below?
>>> Which uses the similar framework as before, but clear each small trunk
>>> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when
>>> the regression can be restored.
>>> WARNING: the patch is only build tested.
>>
>>
>> Base: baseline
>> Change1: using clear_gigantic_page() for 2M PMD
>> Change2: your patch with MPAGE_NRPAGES=16
>> Change3: Case3 + fix[1]
>
> What is case3?
Oh, it is Change2.
>
>> Change4: your patch with MPAGE_NRPAGES=64 + fix[1]
>>
>> 1. For rand write,
>> case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference
>>
>> 2. For seq write,
>>
>> 1) case-anon-w-seq-mt:
>
> Can you try case-anon-w-seq? That may be more stable.
>
>> base:
>> real 0m2.490s 0m2.254s 0m2.272s
>> user 1m59.980s 2m23.431s 2m18.739s
>> sys 1m3.675s 1m15.462s 1m15.030s
>>
>> Change1:
>> real 0m2.234s 0m2.225s 0m2.159s
>> user 2m56.105s 2m57.117s 3m0.489s
>> sys 0m17.064s 0m17.564s 0m16.150s
>>
>> Change2:
>> real 0m2.244s 0m2.384s 0m2.370s
>> user 2m39.413s 2m41.990s 2m42.229s
>> sys 0m19.826s 0m18.491s 0m18.053s
>
> It appears strange. There's no much cache hot benefit even if we clear
> pages from end to begin (with larger chunk).
>
> However, sys time improves a lot. This shows clearing page with large
> chunk helps on ARM64.
>
>> Change3: // best performance
>> real 0m2.155s 0m2.204s 0m2.194s
>> user 3m2.640s 2m55.837s 3m0.902s
>> sys 0m17.346s 0m17.630s 0m18.197s
>>
>> Change4:
>> real 0m2.287s 0m2.377s 0m2.284s
>> user 2m37.030s 2m52.868s 3m17.593s
>> sys 0m15.445s 0m34.430s 0m45.224s
>
> Change4 is essentially same as Change1. I don't know why they are
> different. Is there some large variation among run to run?
As above shown, I test three times, the test results are relatively
stable, at least for real, I will try case-anon-w-seq.
>
> Can you further optimize the prototype patch below? I think that it has
> potential to fix your issue.
Yes, thanks for you helper, but this will make process_huge_page() a
little more complicated :)
>
>> 2) case-anon-w-seq-hugetlb
>> very similar 1), Change4 slightly better than Change3, but not big
>> different.
>>
>> 3) hugetlbfs fallocate 20G
>> Change1(0m1.136s) = Change3(0m1.136s) = Change4(0m1.135s) <
>> Change2(0m1.275s) < base(0m3.016s)
>>
>> In summary, the Change3 is best and Change1 is good on my arm64 machine.
>>
>>> Best Regards,
>>> Huang, Ying
>>> -----------------------------------8<----------------------------------------
>>> From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001
>>> From: Huang Ying <ying.huang@intel.com>
>>> Date: Thu, 31 Oct 2024 11:13:57 +0800
>>> Subject: [PATCH] mpage clear
>>> ---
>>> mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>>> 1 file changed, 67 insertions(+), 3 deletions(-)
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 3ccee51adfbb..1fdc548c4275 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -6769,6 +6769,68 @@ static inline int process_huge_page(
>>> return 0;
>>> }
>>> +#define MPAGE_NRPAGES (1<<4)
>>> +#define MPAGE_SIZE (PAGE_SIZE * MPAGE_NRPAGES)
>>> +static inline int clear_huge_page(
>>> + unsigned long addr_hint, unsigned int nr_pages,
>>> + int (*process_subpage)(unsigned long addr, int idx, void *arg),
>>> + void *arg)
>>> +{
>>> + int i, n, base, l, ret;
>>> + unsigned long addr = addr_hint &
>>> + ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>> + unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE;
>>> +
>>> + /* Process target subpage last to keep its cache lines hot */
>>> + might_sleep();
>>> + n = (addr_hint - addr) / MPAGE_SIZE;
>>> + if (2 * n <= nr_mpages) {
>>> + /* If target subpage in first half of huge page */
>>> + base = 0;
>>> + l = n;
>>> + /* Process subpages at the end of huge page */
>>> + for (i = nr_mpages - 1; i >= 2 * n; i--) {
>>> + cond_resched();
>>> + ret = process_subpage(addr + i * MPAGE_SIZE,
>>> + i * MPAGE_NRPAGES, arg);
>>> + if (ret)
>>> + return ret;
>>> + }
>>> + } else {
>>> + /* If target subpage in second half of huge page */
>>> + base = nr_mpages - 2 * (nr_mpages - n);
>>> + l = nr_mpages - n;
>>> + /* Process subpages at the begin of huge page */
>>> + for (i = 0; i < base; i++) {
>>> + cond_resched();
>>> + ret = process_subpage(addr + i * MPAGE_SIZE,
>>> + i * MPAGE_NRPAGES, arg);
>>> + if (ret)
>>> + return ret;
>>> + }
>>> + }
>>> + /*
>>> + * Process remaining subpages in left-right-left-right pattern
>>> + * towards the target subpage
>>> + */
>>> + for (i = 0; i < l; i++) {
>>> + int left_idx = base + i;
>>> + int right_idx = base + 2 * l - 1 - i;
>>> +
>>> + cond_resched();
>>> + ret = process_subpage(addr + left_idx * MPAGE_SIZE,
>>> + left_idx * MPAGE_NRPAGES, arg);
>>> + if (ret)
>>> + return ret;
>>> + cond_resched();
>>> + ret = process_subpage(addr + right_idx * MPAGE_SIZE,
>>> + right_idx * MPAGE_NRPAGES, arg);
>>> + if (ret)
>>> + return ret;
>>> + }
>>> + return 0;
>>> +}
>>> +
>>> static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>>> unsigned int nr_pages)
>>> {
>>> @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>>> static int clear_subpage(unsigned long addr, int idx, void *arg)
>>> {
>>> struct folio *folio = arg;
>>> + int i;
>>> - clear_user_highpage(folio_page(folio, idx), addr);
>>> + for (i = 0; i < MPAGE_NRPAGES; i++)
>>> + clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE);
>>> return 0;
>>> }
>>> @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio,
>>> unsigned long addr_hint)
>>> {
>>> unsigned int nr_pages = folio_nr_pages(folio);
>>> - if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
>>> + if (unlikely(nr_pages != HPAGE_PMD_NR))
>>> clear_gigantic_page(folio, addr_hint, nr_pages);
>>> else
>>> - process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>>> + clear_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>>> }
>>> static int copy_user_gigantic_page(struct folio *dst, struct
>>> folio *src,
>>
>>
>> [1] fix patch
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b22d4b83295b..aee99ede0c4f 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6816,7 +6816,7 @@ static inline int clear_huge_page(
>> base = 0;
>> l = n;
>> /* Process subpages at the end of huge page */
>> - for (i = nr_mpages - 1; i >= 2 * n; i--) {
>> + for (i = 2 * n; i < nr_mpages; i++) {
>> cond_resched();
>> ret = process_subpage(addr + i * MPAGE_SIZE,
>> i * MPAGE_NRPAGES, arg);
>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-11-01 9:45 ` Kefeng Wang
@ 2024-11-04 2:35 ` Huang, Ying
2024-11-05 2:06 ` Kefeng Wang
0 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2024-11-04 2:35 UTC (permalink / raw)
To: Kefeng Wang
Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Muchun Song,
linux-mm, Zi Yan
Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> On 2024/11/1 16:16, Huang, Ying wrote:
>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>
>>> On 2024/10/31 16:39, Huang, Ying wrote:
>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>> [snip]
>>>>>
>>>>>>> 1) Will test some rand test to check the different of performance as
>>>>>>> David suggested.>>>>
>>>>>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>>>>>> set and different machines)
>>>>>> I'm starting to use LKP to test.
>>>>>
>>>>> Greet.
>>>
>>>
>>> Sorry for the late,
>>>
>>>> I have run some tests with LKP to test.
>>>> Firstly, there's almost no measurable difference between clearing
>>>> pages
>>>> from start to end or from end to start on Intel server CPU. I guess
>>>> that there's some similar optimization for both direction.
>>>> For multiple processes (same as logical CPU number)
>>>> vm-scalability/anon-w-seq test case, the benchmark score increases
>>>> about 22.4%.
>>>
>>> So process_huge_page is better than clear_gigantic_page() on Intel?
>> For vm-scalability/anon-w-seq test case, it is. Because the
>> performance
>> of forward and backward clearing is almost same, and the user space
>> accessing has cache-hot benefit.
>>
>>> Could you test the following case on x86?
>>> echo 10240 >
>>> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>>> mkdir -p /hugetlbfs/
>>> mount none /hugetlbfs/ -t hugetlbfs
>>> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate
>>> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G
>>> /hugetlbfs/test
>> It's not trivial for me to do this test. Because 0day wraps test
>> cases.
>> Do you know which existing test cases provide this? For example, in
>> vm-scalability?
>
> I don't know the public fallocate test, I will try to find a intel
> machine to test this case.
I don't expect it to change much, because we have observed that the
performance of forward and backward clearing is similar on Intel.
>>
>>>> For multiple processes vm-scalability/anon-w-rand test case, no
>>>> measurable difference for benchmark score.
>>>> So, the optimization helps sequential workload mainly.
>>>> In summary, on x86, process_huge_page() will not introduce any
>>>> regression. And it helps some workload.
>>>> However, on ARM64, it does introduce some regression for clearing
>>>> pages
>>>> from end to start. That needs to be addressed. I guess that the
>>>> regression can be resolved via using more clearing from start to end
>>>> (but not all). For example, can you take a look at the patch below?
>>>> Which uses the similar framework as before, but clear each small trunk
>>>> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when
>>>> the regression can be restored.
>>>> WARNING: the patch is only build tested.
>>>
>>>
>>> Base: baseline
>>> Change1: using clear_gigantic_page() for 2M PMD
>>> Change2: your patch with MPAGE_NRPAGES=16
>>> Change3: Case3 + fix[1]
>> What is case3?
>
> Oh, it is Change2.
Got it.
>>
>>> Change4: your patch with MPAGE_NRPAGES=64 + fix[1]
>>>
>>> 1. For rand write,
>>> case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference
>>>
>>> 2. For seq write,
>>>
>>> 1) case-anon-w-seq-mt:
>> Can you try case-anon-w-seq? That may be more stable.
>>
>>> base:
>>> real 0m2.490s 0m2.254s 0m2.272s
>>> user 1m59.980s 2m23.431s 2m18.739s
>>> sys 1m3.675s 1m15.462s 1m15.030s
>>>
>>> Change1:
>>> real 0m2.234s 0m2.225s 0m2.159s
>>> user 2m56.105s 2m57.117s 3m0.489s
>>> sys 0m17.064s 0m17.564s 0m16.150s
>>>
>>> Change2:
>>> real 0m2.244s 0m2.384s 0m2.370s
>>> user 2m39.413s 2m41.990s 2m42.229s
>>> sys 0m19.826s 0m18.491s 0m18.053s
>> It appears strange. There's no much cache hot benefit even if we
>> clear
>> pages from end to begin (with larger chunk).
>> However, sys time improves a lot. This shows clearing page with
>> large
>> chunk helps on ARM64.
>>
>>> Change3: // best performance
>>> real 0m2.155s 0m2.204s 0m2.194s
>>> user 3m2.640s 2m55.837s 3m0.902s
>>> sys 0m17.346s 0m17.630s 0m18.197s
>>>
>>> Change4:
>>> real 0m2.287s 0m2.377s 0m2.284s
>>> user 2m37.030s 2m52.868s 3m17.593s
>>> sys 0m15.445s 0m34.430s 0m45.224s
>> Change4 is essentially same as Change1. I don't know why they are
>> different. Is there some large variation among run to run?
>
> As above shown, I test three times, the test results are relatively
> stable, at least for real, I will try case-anon-w-seq.
Can you also show the score of vm-scalability?
TBH, I cannot understand your results. For example, why there are
measurable difference between Change3 and Change4? In both cases, the
kernel clears pages from start to end.
>> Can you further optimize the prototype patch below? I think that it
>> has
>> potential to fix your issue.
>
> Yes, thanks for you helper, but this will make process_huge_page() a
> little more complicated :)
IMHO, we should try to root cause it, then try to find the proper
solution and optimize (simplifies) it.
--
Best Regards,
Huang, Ying
>>
>>> 2) case-anon-w-seq-hugetlb
>>> very similar 1), Change4 slightly better than Change3, but not big
>>> different.
>>>
>>> 3) hugetlbfs fallocate 20G
>>> Change1(0m1.136s) = Change3(0m1.136s) = Change4(0m1.135s) <
>>> Change2(0m1.275s) < base(0m3.016s)
>>>
>>> In summary, the Change3 is best and Change1 is good on my arm64 machine.
>>>
>>>> Best Regards,
>>>> Huang, Ying
>>>> -----------------------------------8<----------------------------------------
>>>> From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001
>>>> From: Huang Ying <ying.huang@intel.com>
>>>> Date: Thu, 31 Oct 2024 11:13:57 +0800
>>>> Subject: [PATCH] mpage clear
>>>> ---
>>>> mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>> 1 file changed, 67 insertions(+), 3 deletions(-)
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 3ccee51adfbb..1fdc548c4275 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -6769,6 +6769,68 @@ static inline int process_huge_page(
>>>> return 0;
>>>> }
>>>> +#define MPAGE_NRPAGES (1<<4)
>>>> +#define MPAGE_SIZE (PAGE_SIZE * MPAGE_NRPAGES)
>>>> +static inline int clear_huge_page(
>>>> + unsigned long addr_hint, unsigned int nr_pages,
>>>> + int (*process_subpage)(unsigned long addr, int idx, void *arg),
>>>> + void *arg)
>>>> +{
>>>> + int i, n, base, l, ret;
>>>> + unsigned long addr = addr_hint &
>>>> + ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>>> + unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE;
>>>> +
>>>> + /* Process target subpage last to keep its cache lines hot */
>>>> + might_sleep();
>>>> + n = (addr_hint - addr) / MPAGE_SIZE;
>>>> + if (2 * n <= nr_mpages) {
>>>> + /* If target subpage in first half of huge page */
>>>> + base = 0;
>>>> + l = n;
>>>> + /* Process subpages at the end of huge page */
>>>> + for (i = nr_mpages - 1; i >= 2 * n; i--) {
>>>> + cond_resched();
>>>> + ret = process_subpage(addr + i * MPAGE_SIZE,
>>>> + i * MPAGE_NRPAGES, arg);
>>>> + if (ret)
>>>> + return ret;
>>>> + }
>>>> + } else {
>>>> + /* If target subpage in second half of huge page */
>>>> + base = nr_mpages - 2 * (nr_mpages - n);
>>>> + l = nr_mpages - n;
>>>> + /* Process subpages at the begin of huge page */
>>>> + for (i = 0; i < base; i++) {
>>>> + cond_resched();
>>>> + ret = process_subpage(addr + i * MPAGE_SIZE,
>>>> + i * MPAGE_NRPAGES, arg);
>>>> + if (ret)
>>>> + return ret;
>>>> + }
>>>> + }
>>>> + /*
>>>> + * Process remaining subpages in left-right-left-right pattern
>>>> + * towards the target subpage
>>>> + */
>>>> + for (i = 0; i < l; i++) {
>>>> + int left_idx = base + i;
>>>> + int right_idx = base + 2 * l - 1 - i;
>>>> +
>>>> + cond_resched();
>>>> + ret = process_subpage(addr + left_idx * MPAGE_SIZE,
>>>> + left_idx * MPAGE_NRPAGES, arg);
>>>> + if (ret)
>>>> + return ret;
>>>> + cond_resched();
>>>> + ret = process_subpage(addr + right_idx * MPAGE_SIZE,
>>>> + right_idx * MPAGE_NRPAGES, arg);
>>>> + if (ret)
>>>> + return ret;
>>>> + }
>>>> + return 0;
>>>> +}
>>>> +
>>>> static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>>>> unsigned int nr_pages)
>>>> {
>>>> @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>>>> static int clear_subpage(unsigned long addr, int idx, void *arg)
>>>> {
>>>> struct folio *folio = arg;
>>>> + int i;
>>>> - clear_user_highpage(folio_page(folio, idx), addr);
>>>> + for (i = 0; i < MPAGE_NRPAGES; i++)
>>>> + clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE);
>>>> return 0;
>>>> }
>>>> @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio,
>>>> unsigned long addr_hint)
>>>> {
>>>> unsigned int nr_pages = folio_nr_pages(folio);
>>>> - if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
>>>> + if (unlikely(nr_pages != HPAGE_PMD_NR))
>>>> clear_gigantic_page(folio, addr_hint, nr_pages);
>>>> else
>>>> - process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>>>> + clear_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>>>> }
>>>> static int copy_user_gigantic_page(struct folio *dst, struct
>>>> folio *src,
>>>
>>>
>>> [1] fix patch
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index b22d4b83295b..aee99ede0c4f 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -6816,7 +6816,7 @@ static inline int clear_huge_page(
>>> base = 0;
>>> l = n;
>>> /* Process subpages at the end of huge page */
>>> - for (i = nr_mpages - 1; i >= 2 * n; i--) {
>>> + for (i = 2 * n; i < nr_mpages; i++) {
>>> cond_resched();
>>> ret = process_subpage(addr + i * MPAGE_SIZE,
>>> i * MPAGE_NRPAGES, arg);
>>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-11-04 2:35 ` Huang, Ying
@ 2024-11-05 2:06 ` Kefeng Wang
0 siblings, 0 replies; 33+ messages in thread
From: Kefeng Wang @ 2024-11-05 2:06 UTC (permalink / raw)
To: Huang, Ying
Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Muchun Song,
linux-mm, Zi Yan
On 2024/11/4 10:35, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>
>> On 2024/11/1 16:16, Huang, Ying wrote:
>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>
>>>> On 2024/10/31 16:39, Huang, Ying wrote:
>>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>> [snip]
>>>>>>
>>>>>>>> 1) Will test some rand test to check the different of performance as
>>>>>>>> David suggested.>>>>
>>>>>>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>>>>>>> set and different machines)
>>>>>>> I'm starting to use LKP to test.
>>>>>>
>>>>>> Greet.
>>>>
>>>>
>>>> Sorry for the late,
>>>>
>>>>> I have run some tests with LKP to test.
>>>>> Firstly, there's almost no measurable difference between clearing
>>>>> pages
>>>>> from start to end or from end to start on Intel server CPU. I guess
>>>>> that there's some similar optimization for both direction.
>>>>> For multiple processes (same as logical CPU number)
>>>>> vm-scalability/anon-w-seq test case, the benchmark score increases
>>>>> about 22.4%.
>>>>
>>>> So process_huge_page is better than clear_gigantic_page() on Intel?
>>> For vm-scalability/anon-w-seq test case, it is. Because the
>>> performance
>>> of forward and backward clearing is almost same, and the user space
>>> accessing has cache-hot benefit.
>>>
>>>> Could you test the following case on x86?
>>>> echo 10240 >
>>>> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>>>> mkdir -p /hugetlbfs/
>>>> mount none /hugetlbfs/ -t hugetlbfs
>>>> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate
>>>> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G
>>>> /hugetlbfs/test
>>> It's not trivial for me to do this test. Because 0day wraps test
>>> cases.
>>> Do you know which existing test cases provide this? For example, in
>>> vm-scalability?
>>
>> I don't know the public fallocate test, I will try to find a intel
>> machine to test this case.
>
> I don't expect it to change much, because we have observed that the
> performance of forward and backward clearing is similar on Intel.
I find a Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
Caches (sum of all):
L1d: 1.1 MiB (36 instances)
L1i: 1.1 MiB (36 instances)
L2: 36 MiB (36 instances)
L3: 49.5 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-17,36-53
NUMA node1 CPU(s): 18-35,54-71
Before:
Performance counter stats for 'taskset -c 10 fallocate -l 20G
/mnt/hugetlbfs/test':
3,856.93 msec task-clock # 0.997
CPUs utilized
6 context-switches # 1.556
/sec
1 cpu-migrations # 0.259
/sec
132 page-faults # 34.224
/sec
11,520,934,848 cycles # 2.987 GHz
(19.95%)
213,731,011 instructions # 0.02
insn per cycle (24.96%)
58,164,361 branches # 15.080
M/sec (24.96%)
262,547 branch-misses # 0.45% of
all branches (24.97%)
96,029,321 CPU_CLK_UNHALTED.REF_XCLK # 24.898
M/sec
# 0.3 %
tma_frontend_bound
# 3.3 %
tma_retiring
# 96.4 %
tma_backend_bound
# 0.0 %
tma_bad_speculation (24.99%)
149,735,020 IDQ_UOPS_NOT_DELIVERED.CORE # 38.822
M/sec (25.01%)
2,486,326 INT_MISC.RECOVERY_CYCLES_ANY # 644.638
K/sec (20.01%)
95,973,482 CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE # 24.883
M/sec (20.01%)
11,526,783,305 CPU_CLK_UNHALTED.THREAD # 2.989
G/sec (20.01%)
1,519,072,911 UOPS_RETIRED.RETIRE_SLOTS # 393.855
M/sec (20.01%)
1,526,020,825 UOPS_ISSUED.ANY # 395.657
M/sec (20.01%)
59,784,189 L1-dcache-loads # 15.500
M/sec (20.01%)
337,479,254 L1-dcache-load-misses # 564.50% of
all L1-dcache accesses (20.02%)
175,954 LLC-loads # 45.620
K/sec (20.02%)
51,955 LLC-load-misses # 29.53% of
all L1-icache accesses (20.02%)
<not supported> L1-icache-loads
2,864,230 L1-icache-load-misses
(20.02%)
59,769,391 dTLB-loads # 15.497
M/sec (20.02%)
819 dTLB-load-misses # 0.00% of
all dTLB cache accesses (20.02%)
2,459 iTLB-loads # 637.553
/sec (20.01%)
370 iTLB-load-misses # 15.05% of
all iTLB cache accesses (19.98%)
3.870393637 seconds time elapsed
0.000000000 seconds user
3.833021000 seconds sys
After(using clear_gigantic_page()):
Performance counter stats for 'taskset -c 10 fallocate -l 20G
/mnt/hugetlbfs/test':
4,426.18 msec task-clock # 0.994
CPUs utilized
8 context-switches # 1.807
/sec
1 cpu-migrations # 0.226
/sec
131 page-faults # 29.597
/sec
13,221,263,588 cycles # 2.987 GHz
(19.98%)
215,924,995 instructions # 0.02
insn per cycle (25.00%)
58,430,182 branches # 13.201
M/sec (25.01%)
279,381 branch-misses # 0.48% of
all branches (25.03%)
110,199,114 CPU_CLK_UNHALTED.REF_XCLK # 24.897
M/sec
# 0.3 %
tma_frontend_bound
# 2.9 %
tma_retiring
# 96.8 %
tma_backend_bound
# 0.0 %
tma_bad_speculation (25.06%)
160,650,548 IDQ_UOPS_NOT_DELIVERED.CORE # 36.296
M/sec (25.07%)
2,559,970 INT_MISC.RECOVERY_CYCLES_ANY # 578.370
K/sec (20.05%)
110,229,402 CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE # 24.904
M/sec (20.05%)
13,227,924,727 CPU_CLK_UNHALTED.THREAD # 2.989
G/sec (20.03%)
1,525,019,287 UOPS_RETIRED.RETIRE_SLOTS # 344.545
M/sec (20.01%)
1,531,307,263 UOPS_ISSUED.ANY # 345.966
M/sec (19.98%)
60,600,471 L1-dcache-loads # 13.691
M/sec (19.96%)
337,576,917 L1-dcache-load-misses # 557.05% of
all L1-dcache accesses (19.96%)
177,157 LLC-loads # 40.025
K/sec (19.96%)
48,056 LLC-load-misses # 27.13% of
all L1-icache accesses (19.97%)
<not supported> L1-icache-loads
2,653,617 L1-icache-load-misses
(19.97%)
60,609,241 dTLB-loads # 13.693
M/sec (19.97%)
530 dTLB-load-misses # 0.00% of
all dTLB cache accesses (19.97%)
1,952 iTLB-loads # 441.013
/sec (19.97%)
3,059 iTLB-load-misses # 156.71% of
all iTLB cache accesses (19.97%)
4.450664421 seconds time elapsed
0.000984000 seconds user
4.397795000 seconds sys
This shows the backward is better than forward,at least for this CPU.
>
>>>
>>>>> For multiple processes vm-scalability/anon-w-rand test case, no
>>>>> measurable difference for benchmark score.
>>>>> So, the optimization helps sequential workload mainly.
>>>>> In summary, on x86, process_huge_page() will not introduce any
>>>>> regression. And it helps some workload.
>>>>> However, on ARM64, it does introduce some regression for clearing
>>>>> pages
>>>>> from end to start. That needs to be addressed. I guess that the
>>>>> regression can be resolved via using more clearing from start to end
>>>>> (but not all). For example, can you take a look at the patch below?
>>>>> Which uses the similar framework as before, but clear each small trunk
>>>>> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when
>>>>> the regression can be restored.
>>>>> WARNING: the patch is only build tested.
>>>>
>>>>
>>>> Base: baseline
>>>> Change1: using clear_gigantic_page() for 2M PMD
>>>> Change2: your patch with MPAGE_NRPAGES=16
>>>> Change3: Case3 + fix[1]
>>> What is case3?
>>
>> Oh, it is Change2.
>
> Got it.
>
>>>
>>>> Change4: your patch with MPAGE_NRPAGES=64 + fix[1]
>>>>
>>>> 1. For rand write,
>>>> case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference
>>>>
>>>> 2. For seq write,
>>>>
>>>> 1) case-anon-w-seq-mt:
>>> Can you try case-anon-w-seq? That may be more stable.
>>>
>>>> base:
>>>> real 0m2.490s 0m2.254s 0m2.272s
>>>> user 1m59.980s 2m23.431s 2m18.739s
>>>> sys 1m3.675s 1m15.462s 1m15.030s
>>>>
>>>> Change1:
>>>> real 0m2.234s 0m2.225s 0m2.159s
>>>> user 2m56.105s 2m57.117s 3m0.489s
>>>> sys 0m17.064s 0m17.564s 0m16.150s
>>>>
>>>> Change2:
>>>> real 0m2.244s 0m2.384s 0m2.370s
>>>> user 2m39.413s 2m41.990s 2m42.229s
>>>> sys 0m19.826s 0m18.491s 0m18.053s
>>> It appears strange. There's no much cache hot benefit even if we
>>> clear
>>> pages from end to begin (with larger chunk).
>>> However, sys time improves a lot. This shows clearing page with
>>> large
>>> chunk helps on ARM64.
>>>
>>>> Change3: // best performance
>>>> real 0m2.155s 0m2.204s 0m2.194s
>>>> user 3m2.640s 2m55.837s 3m0.902s
>>>> sys 0m17.346s 0m17.630s 0m18.197s
>>>>
>>>> Change4:
>>>> real 0m2.287s 0m2.377s 0m2.284s
>>>> user 2m37.030s 2m52.868s 3m17.593s
>>>> sys 0m15.445s 0m34.430s 0m45.224s
>>> Change4 is essentially same as Change1. I don't know why they are
>>> different. Is there some large variation among run to run?
>>
>> As above shown, I test three times, the test results are relatively
>> stable, at least for real, I will try case-anon-w-seq.
>
> Can you also show the score of vm-scalability?
>
> TBH, I cannot understand your results. For example, why there are
> measurable difference between Change3 and Change4? In both cases, the
> kernel clears pages from start to end.
OK,will retest once I can access the machine again.
>
>>> Can you further optimize the prototype patch below? I think that it
>>> has
>>> potential to fix your issue.
>>
>> Yes, thanks for you helper, but this will make process_huge_page() a
>> little more complicated :)
>
> IMHO, we should try to root cause it, then try to find the proper
> solution and optimize (simplifies) it.
From the above fallocate test on intel, it seems that different
microarchitectures have different performance on Intel too.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-10-31 8:39 ` Huang, Ying
2024-11-01 7:43 ` Kefeng Wang
@ 2024-12-01 2:15 ` Andrew Morton
2024-12-01 5:37 ` Huang, Ying
1 sibling, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2024-12-01 2:15 UTC (permalink / raw)
To: Huang, Ying
Cc: Kefeng Wang, David Hildenbrand, Matthew Wilcox, Muchun Song,
linux-mm, Zi Yan
On Thu, 31 Oct 2024 16:39:53 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
> However, on ARM64, it does introduce some regression for clearing pages
> from end to start. That needs to be addressed. I guess that the
> regression can be resolved via using more clearing from start to end
> (but not all). For example, can you take a look at the patch below?
> Which uses the similar framework as before, but clear each small trunk
> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when
> the regression can be restored.
>
> WARNING: the patch is only build tested.
I'm holding this patch in mm-unstable because it's unclear (to me) that
this regression has been adequately addressed?
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-12-01 2:15 ` Andrew Morton
@ 2024-12-01 5:37 ` Huang, Ying
2024-12-02 1:03 ` Kefeng Wang
0 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2024-12-01 5:37 UTC (permalink / raw)
To: Andrew Morton, Kefeng Wang
Cc: David Hildenbrand, Matthew Wilcox, Muchun Song, linux-mm, Zi Yan
Hi, Andrew,
Andrew Morton <akpm@linux-foundation.org> writes:
> On Thu, 31 Oct 2024 16:39:53 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
>
>> However, on ARM64, it does introduce some regression for clearing pages
>> from end to start. That needs to be addressed. I guess that the
>> regression can be resolved via using more clearing from start to end
>> (but not all). For example, can you take a look at the patch below?
>> Which uses the similar framework as before, but clear each small trunk
>> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when
>> the regression can be restored.
>>
>> WARNING: the patch is only build tested.
>
> I'm holding this patch in mm-unstable because it's unclear (to me) that
> this regression has been adequately addressed?
Yes. This patch isn't ready to be merged yet. More works are needed to
root cause the problem and implement the proper fix. I guess that
Kefeng will continue to work on this.
---
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-12-01 5:37 ` Huang, Ying
@ 2024-12-02 1:03 ` Kefeng Wang
2024-12-06 1:47 ` Andrew Morton
0 siblings, 1 reply; 33+ messages in thread
From: Kefeng Wang @ 2024-12-02 1:03 UTC (permalink / raw)
To: Huang, Ying, Andrew Morton
Cc: David Hildenbrand, Matthew Wilcox, Muchun Song, linux-mm, Zi Yan
Hi Andrew and Ying,
On 2024/12/1 13:37, Huang, Ying wrote:
> Hi, Andrew,
>
> Andrew Morton <akpm@linux-foundation.org> writes:
>
>> On Thu, 31 Oct 2024 16:39:53 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
>>
>>> However, on ARM64, it does introduce some regression for clearing pages
>>> from end to start. That needs to be addressed. I guess that the
>>> regression can be resolved via using more clearing from start to end
>>> (but not all). For example, can you take a look at the patch below?
>>> Which uses the similar framework as before, but clear each small trunk
>>> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when
>>> the regression can be restored.
>>>
>>> WARNING: the patch is only build tested.
>>
>> I'm holding this patch in mm-unstable because it's unclear (to me) that
>> this regression has been adequately addressed?
>
> Yes. This patch isn't ready to be merged yet. More works are needed to
> root cause the problem and implement the proper fix. I guess that
> Kefeng will continue to work on this.
Sorry for the late, I've been buried in other tasks.
The two bugfix patches in the next should be merged firstly, which is
not related about the performance issue.
84ddd1450383 mm: use aligned address in copy_user_gigantic_page()
9fbd869b19b8 mm: use aligned address in clear_gigantic_page()
and according to the suggestion of Ying, I am trying to fix the
performance issue on arm64(will in new series), but I need to check no
regression on other arm64's machine, will send the new patches once
the test is complete.
So Andrew, please help to pick up the above two bugfix patches to
v6.13-rcX if no objection.
>
> ---
> Best Regards,
> Huang, Ying
>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-12-02 1:03 ` Kefeng Wang
@ 2024-12-06 1:47 ` Andrew Morton
2024-12-06 2:08 ` Kefeng Wang
0 siblings, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2024-12-06 1:47 UTC (permalink / raw)
To: Kefeng Wang
Cc: Huang, Ying, David Hildenbrand, Matthew Wilcox, Muchun Song,
linux-mm, Zi Yan
On Mon, 2 Dec 2024 09:03:45 +0800 Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>> WARNING: the patch is only build tested.
> >>
> >> I'm holding this patch in mm-unstable because it's unclear (to me) that
> >> this regression has been adequately addressed?
> >
> > Yes. This patch isn't ready to be merged yet. More works are needed to
> > root cause the problem and implement the proper fix. I guess that
> > Kefeng will continue to work on this.
>
> Sorry for the late, I've been buried in other tasks.
>
> The two bugfix patches in the next should be merged firstly, which is
> not related about the performance issue.
>
> 84ddd1450383 mm: use aligned address in copy_user_gigantic_page()
> 9fbd869b19b8 mm: use aligned address in clear_gigantic_page()
>
>
> and according to the suggestion of Ying, I am trying to fix the
> performance issue on arm64(will in new series), but I need to check no
> regression on other arm64's machine, will send the new patches once
> the test is complete.
>
> So Andrew, please help to pick up the above two bugfix patches to
> v6.13-rcX if no objection.
I added cc:stable to both and moved them into mm-hotfixes-unstable,
targeting an upstream merge next week.
"mm: use aligned address in clear_gigantic_page()":
https://lkml.kernel.org/r/20241028145656.932941-1-wangkefeng.wang@huawei.com
"mm: use aligned address in copy_user_gigantic_page()":
https://lkml.kernel.org/r/20241028145656.932941-2-wangkefeng.wang@huawei.com
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
2024-12-06 1:47 ` Andrew Morton
@ 2024-12-06 2:08 ` Kefeng Wang
0 siblings, 0 replies; 33+ messages in thread
From: Kefeng Wang @ 2024-12-06 2:08 UTC (permalink / raw)
To: Andrew Morton
Cc: Huang, Ying, David Hildenbrand, Matthew Wilcox, Muchun Song,
linux-mm, Zi Yan
On 2024/12/6 9:47, Andrew Morton wrote:
> On Mon, 2 Dec 2024 09:03:45 +0800 Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>>>>> WARNING: the patch is only build tested.
>>>>
>>>> I'm holding this patch in mm-unstable because it's unclear (to me) that
>>>> this regression has been adequately addressed?
>>>
>>> Yes. This patch isn't ready to be merged yet. More works are needed to
>>> root cause the problem and implement the proper fix. I guess that
>>> Kefeng will continue to work on this.
>>
>> Sorry for the late, I've been buried in other tasks.
>>
>> The two bugfix patches in the next should be merged firstly, which is
>> not related about the performance issue.
>>
>> 84ddd1450383 mm: use aligned address in copy_user_gigantic_page()
>> 9fbd869b19b8 mm: use aligned address in clear_gigantic_page()
>>
>>
>> and according to the suggestion of Ying, I am trying to fix the
>> performance issue on arm64(will in new series), but I need to check no
>> regression on other arm64's machine, will send the new patches once
>> the test is complete.
>>
>> So Andrew, please help to pick up the above two bugfix patches to
>> v6.13-rcX if no objection.
>
> I added cc:stable to both and moved them into mm-hotfixes-unstable,
> targeting an upstream merge next week.
>
> "mm: use aligned address in clear_gigantic_page()":
> https://lkml.kernel.org/r/20241028145656.932941-1-wangkefeng.wang@huawei.com
>
> "mm: use aligned address in copy_user_gigantic_page()":
> https://lkml.kernel.org/r/20241028145656.932941-2-wangkefeng.wang@huawei.com
>
Nice, thanks a lot.
^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2024-12-06 2:08 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-26 5:43 [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() Kefeng Wang
2024-10-26 5:43 ` [PATCH v2 2/2] mm: use aligned address in copy_user_gigantic_page() Kefeng Wang
2024-10-28 10:01 ` David Hildenbrand
2024-10-28 6:17 ` [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() Huang, Ying
2024-10-28 6:35 ` Kefeng Wang
2024-10-28 7:03 ` Huang, Ying
2024-10-28 8:35 ` Kefeng Wang
2024-10-28 10:00 ` David Hildenbrand
2024-10-28 12:52 ` Kefeng Wang
2024-10-28 13:14 ` David Hildenbrand
2024-10-28 13:33 ` Kefeng Wang
2024-10-28 13:46 ` David Hildenbrand
2024-10-28 14:22 ` Kefeng Wang
2024-10-28 14:24 ` David Hildenbrand
2024-10-29 13:04 ` Kefeng Wang
2024-10-29 14:04 ` David Hildenbrand
2024-10-30 1:04 ` Huang, Ying
2024-10-30 3:04 ` Kefeng Wang
2024-10-30 3:21 ` Huang, Ying
2024-10-30 5:05 ` Kefeng Wang
2024-10-31 8:39 ` Huang, Ying
2024-11-01 7:43 ` Kefeng Wang
2024-11-01 8:16 ` Huang, Ying
2024-11-01 9:45 ` Kefeng Wang
2024-11-04 2:35 ` Huang, Ying
2024-11-05 2:06 ` Kefeng Wang
2024-12-01 2:15 ` Andrew Morton
2024-12-01 5:37 ` Huang, Ying
2024-12-02 1:03 ` Kefeng Wang
2024-12-06 1:47 ` Andrew Morton
2024-12-06 2:08 ` Kefeng Wang
2024-11-01 6:18 ` Huang, Ying
2024-11-01 7:51 ` Kefeng Wang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox