* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-09 13:47 [Question]: major faults are still triggered after mlockall when numa balancing zhangpeng (AS)
@ 2023-11-09 14:11 ` Peter Zijlstra
2023-11-09 14:29 ` Matthew Wilcox
2023-11-09 17:27 ` Matthew Wilcox
` (4 subsequent siblings)
5 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2023-11-09 14:11 UTC (permalink / raw)
To: zhangpeng (AS)
Cc: linux-mm, linux-kernel, akpm, Matthew Wilcox, lstoakes, hughd,
david, fengwei.yin, vbabka, mgorman, mingo, riel, ying.huang,
hannes, Nanyong Sun, Kefeng Wang
On Thu, Nov 09, 2023 at 09:47:24PM +0800, zhangpeng (AS) wrote:
> Is there any way to avoid such a major fault?
man madvise
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-09 14:11 ` Peter Zijlstra
@ 2023-11-09 14:29 ` Matthew Wilcox
2023-11-09 15:15 ` Yin, Fengwei
0 siblings, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2023-11-09 14:29 UTC (permalink / raw)
To: Peter Zijlstra
Cc: zhangpeng (AS),
linux-mm, linux-kernel, akpm, lstoakes, hughd, david,
fengwei.yin, vbabka, mgorman, mingo, riel, ying.huang, hannes,
Nanyong Sun, Kefeng Wang
On Thu, Nov 09, 2023 at 03:11:41PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 09, 2023 at 09:47:24PM +0800, zhangpeng (AS) wrote:
> > Is there any way to avoid such a major fault?
>
> man madvise
but from the mlockall manpage:
mlockall() locks all pages mapped into the address space of the calling
process. This includes the pages of the code, data, and stack segment,
as well as shared libraries, user space kernel data, shared memory, and
memory-mapped files. All mapped pages are guaranteed to be resident in
RAM when the call returns successfully; the pages are guaranteed to
stay in RAM until later unlocked.
https://pubs.opengroup.org/onlinepubs/9699919799/functions/mlockall.html
isn't quite so explicit, but I do think that page cache should be locked
into memory.
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-09 14:29 ` Matthew Wilcox
@ 2023-11-09 15:15 ` Yin, Fengwei
0 siblings, 0 replies; 23+ messages in thread
From: Yin, Fengwei @ 2023-11-09 15:15 UTC (permalink / raw)
To: Matthew Wilcox, Peter Zijlstra
Cc: zhangpeng (AS),
linux-mm, linux-kernel, akpm, lstoakes, hughd, david, vbabka,
mgorman, mingo, riel, ying.huang, hannes, Nanyong Sun,
Kefeng Wang
On 11/9/2023 10:29 PM, Matthew Wilcox wrote:
> On Thu, Nov 09, 2023 at 03:11:41PM +0100, Peter Zijlstra wrote:
>> On Thu, Nov 09, 2023 at 09:47:24PM +0800, zhangpeng (AS) wrote:
>>> Is there any way to avoid such a major fault?
>>
>> man madvise
>
> but from the mlockall manpage:
>
> mlockall() locks all pages mapped into the address space of the calling
> process. This includes the pages of the code, data, and stack segment,
> as well as shared libraries, user space kernel data, shared memory, and
> memory-mapped files. All mapped pages are guaranteed to be resident in
> RAM when the call returns successfully; the pages are guaranteed to
> stay in RAM until later unlocked.
>
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/mlockall.html
> isn't quite so explicit, but I do think that page cache should be locked
> into memory.
Here is my understanding. It's related with write to a mlocked private file
mapping. From Peng:
"For the data segment, the global variable area is a private mapping".
So it's data segment of ELF file and mapped privately by ELF loader.
For this case, even ELF loader is updated to mlock the data segment, a
write will trigger COW and a new anonymous page will be allocated and
mlocked. The original file mapped page will be munlocked in
do_wp_page()
wp_page_copy()
if (old_folio) {
page_remove_rmap()
}
So it's possible the original file mapped page is reclaimed and later
accessing will trigger major fault.
Regards
Yin, Fengwei
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-09 13:47 [Question]: major faults are still triggered after mlockall when numa balancing zhangpeng (AS)
2023-11-09 14:11 ` Peter Zijlstra
@ 2023-11-09 17:27 ` Matthew Wilcox
2023-11-10 5:32 ` Huang, Ying
2023-11-10 9:39 ` zhangpeng (AS)
2023-11-09 22:54 ` Yang Shi
` (3 subsequent siblings)
5 siblings, 2 replies; 23+ messages in thread
From: Matthew Wilcox @ 2023-11-09 17:27 UTC (permalink / raw)
To: zhangpeng (AS)
Cc: linux-mm, linux-kernel, akpm, lstoakes, hughd, david,
fengwei.yin, vbabka, peterz, mgorman, mingo, riel, ying.huang,
hannes, Nanyong Sun, Kefeng Wang
On Thu, Nov 09, 2023 at 09:47:24PM +0800, zhangpeng (AS) wrote:
> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
> ptep_modify_prot_start() will clear the vmf->pte, until
> ptep_modify_prot_commit() assign a value to the vmf->pte.
[...]
> Our problem scenario is as follows:
>
> task 1 task 2
> ------ ------
> /* scan global variables */
> do_numa_page()
> spin_lock(vmf->ptl)
> ptep_modify_prot_start()
> /* set vmf->pte as null */
> /* Access global variables */
> handle_pte_fault()
> /* no pte lock */
> do_pte_missing()
> do_fault()
> do_read_fault()
> ptep_modify_prot_commit()
> /* ptep update done */
> pte_unmap_unlock(vmf->pte, vmf->ptl)
> do_fault_around()
> __do_fault()
> filemap_fault()
> /* page cache is not available
> and a major fault is triggered */
> do_sync_mmap_readahead()
> /* page_not_uptodate and goto
> out_retry. */
>
> Is there any way to avoid such a major fault?
Yes, this looks like a bug.
It seems to me that the easiest way to fix this is not to zero the pte
but to make it protnone? That would send task 2 into do_numa_page()
where it would take the ptl, then check pte_same(), see that it's
changed and goto out, which will end up retrying the fault.
I'm not particularly expert at page table manipulation, so I'll let
somebody who is propose an actual patch. Or you could try to do it?
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-09 17:27 ` Matthew Wilcox
@ 2023-11-10 5:32 ` Huang, Ying
2023-11-10 9:04 ` Yin, Fengwei
2023-11-10 9:39 ` zhangpeng (AS)
1 sibling, 1 reply; 23+ messages in thread
From: Huang, Ying @ 2023-11-10 5:32 UTC (permalink / raw)
To: Matthew Wilcox
Cc: zhangpeng (AS),
linux-mm, linux-kernel, akpm, lstoakes, hughd, david,
fengwei.yin, vbabka, peterz, mgorman, mingo, riel, hannes,
Nanyong Sun, Kefeng Wang
Matthew Wilcox <willy@infradead.org> writes:
> On Thu, Nov 09, 2023 at 09:47:24PM +0800, zhangpeng (AS) wrote:
>> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
>> ptep_modify_prot_start() will clear the vmf->pte, until
>> ptep_modify_prot_commit() assign a value to the vmf->pte.
>
> [...]
>
>> Our problem scenario is as follows:
>>
>> task 1 task 2
>> ------ ------
>> /* scan global variables */
>> do_numa_page()
>> spin_lock(vmf->ptl)
>> ptep_modify_prot_start()
>> /* set vmf->pte as null */
>> /* Access global variables */
>> handle_pte_fault()
>> /* no pte lock */
>> do_pte_missing()
>> do_fault()
>> do_read_fault()
>> ptep_modify_prot_commit()
>> /* ptep update done */
>> pte_unmap_unlock(vmf->pte, vmf->ptl)
>> do_fault_around()
>> __do_fault()
>> filemap_fault()
>> /* page cache is not available
>> and a major fault is triggered */
>> do_sync_mmap_readahead()
>> /* page_not_uptodate and goto
>> out_retry. */
>>
>> Is there any way to avoid such a major fault?
>
> Yes, this looks like a bug.
>
> It seems to me that the easiest way to fix this is not to zero the pte
> but to make it protnone? That would send task 2 into do_numa_page()
> where it would take the ptl, then check pte_same(), see that it's
> changed and goto out, which will end up retrying the fault.
There are other places in the kernel where the PTE is cleared, for
example, move_ptes() in mremap.c. IIUC, we need to audit all them.
Another possible solution is to check PTE again with PTL held before
reading in file data. This will increase the overhead of major fault
path. Is it acceptable?
> I'm not particularly expert at page table manipulation, so I'll let
> somebody who is propose an actual patch. Or you could try to do it?
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-10 5:32 ` Huang, Ying
@ 2023-11-10 9:04 ` Yin, Fengwei
2023-11-13 2:02 ` Huang, Ying
0 siblings, 1 reply; 23+ messages in thread
From: Yin, Fengwei @ 2023-11-10 9:04 UTC (permalink / raw)
To: Huang, Ying, Matthew Wilcox
Cc: zhangpeng (AS),
linux-mm, linux-kernel, akpm, lstoakes, hughd, david, vbabka,
peterz, mgorman, mingo, riel, hannes, Nanyong Sun, Kefeng Wang
On 11/10/2023 1:32 PM, Huang, Ying wrote:
> Matthew Wilcox <willy@infradead.org> writes:
>
>> On Thu, Nov 09, 2023 at 09:47:24PM +0800, zhangpeng (AS) wrote:
>>> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
>>> ptep_modify_prot_start() will clear the vmf->pte, until
>>> ptep_modify_prot_commit() assign a value to the vmf->pte.
>>
>> [...]
>>
>>> Our problem scenario is as follows:
>>>
>>> task 1 task 2
>>> ------ ------
>>> /* scan global variables */
>>> do_numa_page()
>>> spin_lock(vmf->ptl)
>>> ptep_modify_prot_start()
>>> /* set vmf->pte as null */
>>> /* Access global variables */
>>> handle_pte_fault()
>>> /* no pte lock */
>>> do_pte_missing()
>>> do_fault()
>>> do_read_fault()
>>> ptep_modify_prot_commit()
>>> /* ptep update done */
>>> pte_unmap_unlock(vmf->pte, vmf->ptl)
>>> do_fault_around()
>>> __do_fault()
>>> filemap_fault()
>>> /* page cache is not available
>>> and a major fault is triggered */
>>> do_sync_mmap_readahead()
>>> /* page_not_uptodate and goto
>>> out_retry. */
>>>
>>> Is there any way to avoid such a major fault?
>>
>> Yes, this looks like a bug.
>>
>> It seems to me that the easiest way to fix this is not to zero the pte
>> but to make it protnone? That would send task 2 into do_numa_page()
>> where it would take the ptl, then check pte_same(), see that it's
>> changed and goto out, which will end up retrying the fault.
>
> There are other places in the kernel where the PTE is cleared, for
> example, move_ptes() in mremap.c. IIUC, we need to audit all them.
>
> Another possible solution is to check PTE again with PTL held before
> reading in file data. This will increase the overhead of major fault
> path. Is it acceptable?
What if we check the PTE without page table lock acquired?
Regards
Yin, Fengwei
>
>> I'm not particularly expert at page table manipulation, so I'll let
>> somebody who is propose an actual patch. Or you could try to do it?
>
> --
> Best Regards,
> Huang, Ying
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-10 9:04 ` Yin, Fengwei
@ 2023-11-13 2:02 ` Huang, Ying
2023-11-14 11:23 ` Yin, Fengwei
0 siblings, 1 reply; 23+ messages in thread
From: Huang, Ying @ 2023-11-13 2:02 UTC (permalink / raw)
To: Yin, Fengwei
Cc: Matthew Wilcox, zhangpeng (AS),
linux-mm, linux-kernel, akpm, lstoakes, hughd, david, vbabka,
peterz, mgorman, mingo, riel, hannes, Nanyong Sun, Kefeng Wang
"Yin, Fengwei" <fengwei.yin@intel.com> writes:
> On 11/10/2023 1:32 PM, Huang, Ying wrote:
>> Matthew Wilcox <willy@infradead.org> writes:
>>
>>> On Thu, Nov 09, 2023 at 09:47:24PM +0800, zhangpeng (AS) wrote:
>>>> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
>>>> ptep_modify_prot_start() will clear the vmf->pte, until
>>>> ptep_modify_prot_commit() assign a value to the vmf->pte.
>>>
>>> [...]
>>>
>>>> Our problem scenario is as follows:
>>>>
>>>> task 1 task 2
>>>> ------ ------
>>>> /* scan global variables */
>>>> do_numa_page()
>>>> spin_lock(vmf->ptl)
>>>> ptep_modify_prot_start()
>>>> /* set vmf->pte as null */
>>>> /* Access global variables */
>>>> handle_pte_fault()
>>>> /* no pte lock */
>>>> do_pte_missing()
>>>> do_fault()
>>>> do_read_fault()
>>>> ptep_modify_prot_commit()
>>>> /* ptep update done */
>>>> pte_unmap_unlock(vmf->pte, vmf->ptl)
>>>> do_fault_around()
>>>> __do_fault()
>>>> filemap_fault()
>>>> /* page cache is not available
>>>> and a major fault is triggered */
>>>> do_sync_mmap_readahead()
>>>> /* page_not_uptodate and goto
>>>> out_retry. */
>>>>
>>>> Is there any way to avoid such a major fault?
>>>
>>> Yes, this looks like a bug.
>>>
>>> It seems to me that the easiest way to fix this is not to zero the pte
>>> but to make it protnone? That would send task 2 into do_numa_page()
>>> where it would take the ptl, then check pte_same(), see that it's
>>> changed and goto out, which will end up retrying the fault.
>>
>> There are other places in the kernel where the PTE is cleared, for
>> example, move_ptes() in mremap.c. IIUC, we need to audit all them.
>>
>> Another possible solution is to check PTE again with PTL held before
>> reading in file data. This will increase the overhead of major fault
>> path. Is it acceptable?
> What if we check the PTE without page table lock acquired?
The PTE is zeroed temporarily only with PTL held. So, if we acquire the
PTL in filemap_fault() and check the PTE, the PTE which is zeroed in
do_numa_page() will be non-zero now. So we can avoid the major fault.
But, if we don't acquire the PTL, the PTE may still be zero.
--
Best Regards,
Huang, Ying
> Regards
> Yin, Fengwei
>
>>
>>> I'm not particularly expert at page table manipulation, so I'll let
>>> somebody who is propose an actual patch. Or you could try to do it?
>>
>> --
>> Best Regards,
>> Huang, Ying
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-13 2:02 ` Huang, Ying
@ 2023-11-14 11:23 ` Yin, Fengwei
2023-11-15 1:46 ` Huang, Ying
0 siblings, 1 reply; 23+ messages in thread
From: Yin, Fengwei @ 2023-11-14 11:23 UTC (permalink / raw)
To: Huang, Ying
Cc: Matthew Wilcox, zhangpeng (AS),
linux-mm, linux-kernel, akpm, lstoakes, hughd, david, vbabka,
peterz, mgorman, mingo, riel, hannes, Nanyong Sun, Kefeng Wang
On 11/13/2023 10:02 AM, Huang, Ying wrote:
>>> There are other places in the kernel where the PTE is cleared, for
>>> example, move_ptes() in mremap.c. IIUC, we need to audit all them.
>>>
>>> Another possible solution is to check PTE again with PTL held before
>>> reading in file data. This will increase the overhead of major fault
>>> path. Is it acceptable?
>> What if we check the PTE without page table lock acquired?
> The PTE is zeroed temporarily only with PTL held. So, if we acquire the
> PTL in filemap_fault() and check the PTE, the PTE which is zeroed in
> do_numa_page() will be non-zero now. So we can avoid the major fault.
Yes.
>
> But, if we don't acquire the PTL, the PTE may still be zero.
For do_numa_page()/change_pte_range(), it does very limit thing during
PTE is cleared. Considering the code path of do_read_fault(), it's likely
the PTE is none-zero.
My concern to acquiring lock is that it brings extra PTL lock acquire/release
for other more common cases.
Regards
Yin, Fengwei
>
> --
> Best Regards,
> Huang, Ying
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-14 11:23 ` Yin, Fengwei
@ 2023-11-15 1:46 ` Huang, Ying
0 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2023-11-15 1:46 UTC (permalink / raw)
To: Yin, Fengwei
Cc: Matthew Wilcox, zhangpeng (AS),
linux-mm, linux-kernel, akpm, lstoakes, hughd, david, vbabka,
peterz, mgorman, mingo, riel, hannes, Nanyong Sun, Kefeng Wang
"Yin, Fengwei" <fengwei.yin@intel.com> writes:
> On 11/13/2023 10:02 AM, Huang, Ying wrote:
>>>> There are other places in the kernel where the PTE is cleared, for
>>>> example, move_ptes() in mremap.c. IIUC, we need to audit all them.
>>>>
>>>> Another possible solution is to check PTE again with PTL held before
>>>> reading in file data. This will increase the overhead of major fault
>>>> path. Is it acceptable?
>>> What if we check the PTE without page table lock acquired?
>> The PTE is zeroed temporarily only with PTL held. So, if we acquire the
>> PTL in filemap_fault() and check the PTE, the PTE which is zeroed in
>> do_numa_page() will be non-zero now. So we can avoid the major fault.
> Yes.
>
>>
>> But, if we don't acquire the PTL, the PTE may still be zero.
> For do_numa_page()/change_pte_range(), it does very limit thing during
> PTE is cleared. Considering the code path of do_read_fault(), it's likely
> the PTE is none-zero.
It's possible per my understanding, although it doesn't feel good to
depend on some "race" condition.
> My concern to acquiring lock is that it brings extra PTL lock acquire/release
> for other more common cases.
Yes. It will bring some overhead to acquire the PTL.
Anyway, some performance test is needed to compare the solution.
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-09 17:27 ` Matthew Wilcox
2023-11-10 5:32 ` Huang, Ying
@ 2023-11-10 9:39 ` zhangpeng (AS)
1 sibling, 0 replies; 23+ messages in thread
From: zhangpeng (AS) @ 2023-11-10 9:39 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, linux-kernel, akpm, lstoakes, hughd, david,
fengwei.yin, vbabka, peterz, mgorman, mingo, riel, ying.huang,
hannes, Nanyong Sun, Kefeng Wang
On 2023/11/10 1:27, Matthew Wilcox wrote:
> On Thu, Nov 09, 2023 at 09:47:24PM +0800, zhangpeng (AS) wrote:
>> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
>> ptep_modify_prot_start() will clear the vmf->pte, until
>> ptep_modify_prot_commit() assign a value to the vmf->pte.
> [...]
>
>> Our problem scenario is as follows:
>>
>> task 1 task 2
>> ------ ------
>> /* scan global variables */
>> do_numa_page()
>> spin_lock(vmf->ptl)
>> ptep_modify_prot_start()
>> /* set vmf->pte as null */
>> /* Access global variables */
>> handle_pte_fault()
>> /* no pte lock */
>> do_pte_missing()
>> do_fault()
>> do_read_fault()
>> ptep_modify_prot_commit()
>> /* ptep update done */
>> pte_unmap_unlock(vmf->pte, vmf->ptl)
>> do_fault_around()
>> __do_fault()
>> filemap_fault()
>> /* page cache is not available
>> and a major fault is triggered */
>> do_sync_mmap_readahead()
>> /* page_not_uptodate and goto
>> out_retry. */
>>
>> Is there any way to avoid such a major fault?
> Yes, this looks like a bug.
>
> It seems to me that the easiest way to fix this is not to zero the pte
> but to make it protnone? That would send task 2 into do_numa_page()
> where it would take the ptl, then check pte_same(), see that it's
> changed and goto out, which will end up retrying the fault.
>
> I'm not particularly expert at page table manipulation, so I'll let
> somebody who is propose an actual patch. Or you could try to do it?
Thank you for your reply.
Sorry, I'm not particularly good at page table related manipulation
either. It would be great if somebody who are better at this part could
help solve it.
--
Best Regards,
Peng
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-09 13:47 [Question]: major faults are still triggered after mlockall when numa balancing zhangpeng (AS)
2023-11-09 14:11 ` Peter Zijlstra
2023-11-09 17:27 ` Matthew Wilcox
@ 2023-11-09 22:54 ` Yang Shi
2023-11-10 1:57 ` Yin, Fengwei
2023-11-09 23:21 ` Matthew Wilcox
` (2 subsequent siblings)
5 siblings, 1 reply; 23+ messages in thread
From: Yang Shi @ 2023-11-09 22:54 UTC (permalink / raw)
To: zhangpeng (AS)
Cc: linux-mm, linux-kernel, akpm, Matthew Wilcox, lstoakes, hughd,
david, fengwei.yin, vbabka, peterz, mgorman, mingo, riel,
ying.huang, hannes, Nanyong Sun, Kefeng Wang
On Thu, Nov 9, 2023 at 5:48 AM zhangpeng (AS) <zhangpeng362@huawei.com> wrote:
>
> Hi everyone,
>
> There is a performance issue that has been bothering us recently.
> This problem can reproduce in the latest mainline version (Linux 6.6).
>
> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process
> to avoid performance problems caused by major fault.
>
> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
> ptep_modify_prot_start() will clear the vmf->pte, until
> ptep_modify_prot_commit() assign a value to the vmf->pte.
>
> For the data segment of the user-mode program, the global variable area
> is a private mapping. After the pagecache is loaded, the private
> anonymous page is generated after the COW is triggered. Mlockall can
> lock COW pages (anonymous pages), but the original file pages cannot
> be locked and may be reclaimed. If the global variable (private anon page)
> is accessed when vmf->pte is zero which is concurrently set by numa fault,
> a file page fault will be triggered.
>
> At this time, the original private file page may have been reclaimed.
> If the page cache is not available at this time, a major fault will be
> triggered and the file will be read, causing additional overhead.
>
> Our problem scenario is as follows:
>
> task 1 task 2
> ------ ------
> /* scan global variables */
> do_numa_page()
> spin_lock(vmf->ptl)
> ptep_modify_prot_start()
> /* set vmf->pte as null */
> /* Access global variables */
> handle_pte_fault()
> /* no pte lock */
> do_pte_missing()
> do_fault()
> do_read_fault()
> ptep_modify_prot_commit()
> /* ptep update done */
> pte_unmap_unlock(vmf->pte, vmf->ptl)
> do_fault_around()
> __do_fault()
> filemap_fault()
> /* page cache is not available
> and a major fault is triggered */
> do_sync_mmap_readahead()
> /* page_not_uptodate and goto
> out_retry. */
>
> Is there any way to avoid such a major fault?
IMHO I don't think it is a bug. The man page quoted by Willy says "All
mapped pages are guaranteed to be resident in RAM when the call
returns successfully", but the later COW already made the file page
unmapped, right? The PTE pointed to the COW'ed anon page.
Hypothetically if we kept the file page mlocked and unmapped,
munlock() would have not munlocked the file page at all, it would be
mlocked in memory forever.
>
> --
> Best Regards,
> Peng
>
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-09 22:54 ` Yang Shi
@ 2023-11-10 1:57 ` Yin, Fengwei
2023-11-10 3:39 ` Kefeng Wang
2023-11-14 1:41 ` Yang Shi
0 siblings, 2 replies; 23+ messages in thread
From: Yin, Fengwei @ 2023-11-10 1:57 UTC (permalink / raw)
To: Yang Shi, zhangpeng (AS)
Cc: linux-mm, linux-kernel, akpm, Matthew Wilcox, lstoakes, hughd,
david, vbabka, peterz, mgorman, mingo, riel, ying.huang, hannes,
Nanyong Sun, Kefeng Wang
On 11/10/2023 6:54 AM, Yang Shi wrote:
> On Thu, Nov 9, 2023 at 5:48 AM zhangpeng (AS) <zhangpeng362@huawei.com> wrote:
>>
>> Hi everyone,
>>
>> There is a performance issue that has been bothering us recently.
>> This problem can reproduce in the latest mainline version (Linux 6.6).
>>
>> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process
>> to avoid performance problems caused by major fault.
>>
>> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
>> ptep_modify_prot_start() will clear the vmf->pte, until
>> ptep_modify_prot_commit() assign a value to the vmf->pte.
>>
>> For the data segment of the user-mode program, the global variable area
>> is a private mapping. After the pagecache is loaded, the private
>> anonymous page is generated after the COW is triggered. Mlockall can
>> lock COW pages (anonymous pages), but the original file pages cannot
>> be locked and may be reclaimed. If the global variable (private anon page)
>> is accessed when vmf->pte is zero which is concurrently set by numa fault,
>> a file page fault will be triggered.
>>
>> At this time, the original private file page may have been reclaimed.
>> If the page cache is not available at this time, a major fault will be
>> triggered and the file will be read, causing additional overhead.
>>
>> Our problem scenario is as follows:
>>
>> task 1 task 2
>> ------ ------
>> /* scan global variables */
>> do_numa_page()
>> spin_lock(vmf->ptl)
>> ptep_modify_prot_start()
>> /* set vmf->pte as null */
>> /* Access global variables */
>> handle_pte_fault()
>> /* no pte lock */
>> do_pte_missing()
>> do_fault()
>> do_read_fault()
>> ptep_modify_prot_commit()
>> /* ptep update done */
>> pte_unmap_unlock(vmf->pte, vmf->ptl)
>> do_fault_around()
>> __do_fault()
>> filemap_fault()
>> /* page cache is not available
>> and a major fault is triggered */
>> do_sync_mmap_readahead()
>> /* page_not_uptodate and goto
>> out_retry. */
>>
>> Is there any way to avoid such a major fault?
>
> IMHO I don't think it is a bug. The man page quoted by Willy says "All
> mapped pages are guaranteed to be resident in RAM when the call
> returns successfully", but the later COW already made the file page
> unmapped, right? The PTE pointed to the COW'ed anon page.
> Hypothetically if we kept the file page mlocked and unmapped,
> munlock() would have not munlocked the file page at all, it would be
> mlocked in memory forever.
But in this case, even the COW page is mlocked. There is small window
that PTE is set to null in do_numa_page(). data segment access (it's to
COW page which has nothing to do with original page cache) happens in
this small window will trigger filemap_fault() to fault in original
page cache.
I had thought to do double check whether vmf->pte is NULL in do_read_fault().
But it's not reliable enough.
Matthew's idea to use protnone to block both hardware accessing and
do_pte_missing() looks more promising to me.
Regards
Yin, Fengwei
>
>>
>> --
>> Best Regards,
>> Peng
>>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-10 1:57 ` Yin, Fengwei
@ 2023-11-10 3:39 ` Kefeng Wang
2023-11-10 3:50 ` Yin, Fengwei
2023-11-14 1:41 ` Yang Shi
1 sibling, 1 reply; 23+ messages in thread
From: Kefeng Wang @ 2023-11-10 3:39 UTC (permalink / raw)
To: Yin, Fengwei, Yang Shi, zhangpeng (AS), Aneesh Kumar K.V
Cc: linux-mm, linux-kernel, akpm, Matthew Wilcox, lstoakes, hughd,
david, vbabka, peterz, mgorman, mingo, riel, ying.huang, hannes,
Nanyong Sun
On 2023/11/10 9:57, Yin, Fengwei wrote:
>
>
> On 11/10/2023 6:54 AM, Yang Shi wrote:
>> On Thu, Nov 9, 2023 at 5:48 AM zhangpeng (AS) <zhangpeng362@huawei.com> wrote:
>>>
>>> Hi everyone,
>>>
>>> There is a performance issue that has been bothering us recently.
>>> This problem can reproduce in the latest mainline version (Linux 6.6).
>>>
>>> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process
>>> to avoid performance problems caused by major fault.
>>>
>>> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
>>> ptep_modify_prot_start() will clear the vmf->pte, until
>>> ptep_modify_prot_commit() assign a value to the vmf->pte.
>>>
>>> For the data segment of the user-mode program, the global variable area
>>> is a private mapping. After the pagecache is loaded, the private
>>> anonymous page is generated after the COW is triggered. Mlockall can
>>> lock COW pages (anonymous pages), but the original file pages cannot
>>> be locked and may be reclaimed. If the global variable (private anon page)
>>> is accessed when vmf->pte is zero which is concurrently set by numa fault,
>>> a file page fault will be triggered.
>>>
>>> At this time, the original private file page may have been reclaimed.
>>> If the page cache is not available at this time, a major fault will be
>>> triggered and the file will be read, causing additional overhead.
>>>
>>> Our problem scenario is as follows:
>>>
>>> task 1 task 2
>>> ------ ------
>>> /* scan global variables */
>>> do_numa_page()
>>> spin_lock(vmf->ptl)
>>> ptep_modify_prot_start()
>>> /* set vmf->pte as null */
>>> /* Access global variables */
>>> handle_pte_fault()
>>> /* no pte lock */
>>> do_pte_missing()
>>> do_fault()
>>> do_read_fault()
>>> ptep_modify_prot_commit()
>>> /* ptep update done */
>>> pte_unmap_unlock(vmf->pte, vmf->ptl)
>>> do_fault_around()
>>> __do_fault()
>>> filemap_fault()
>>> /* page cache is not available
>>> and a major fault is triggered */
>>> do_sync_mmap_readahead()
>>> /* page_not_uptodate and goto
>>> out_retry. */
>>>
>>> Is there any way to avoid such a major fault?
>>
>> IMHO I don't think it is a bug. The man page quoted by Willy says "All
>> mapped pages are guaranteed to be resident in RAM when the call
>> returns successfully", but the later COW already made the file page
>> unmapped, right? The PTE pointed to the COW'ed anon page.
>> Hypothetically if we kept the file page mlocked and unmapped,
>> munlock() would have not munlocked the file page at all, it would be
>> mlocked in memory forever.
> But in this case, even the COW page is mlocked. There is small window
> that PTE is set to null in do_numa_page(). data segment access (it's to
> COW page which has nothing to do with original page cache) happens in
> this small window will trigger filemap_fault() to fault in original
> page cache.
>
> I had thought to do double check whether vmf->pte is NULL in do_read_fault().
> But it's not reliable enough.
>
> Matthew's idea to use protnone to block both hardware accessing and
> do_pte_missing() looks more promising to me.
Actual, we could revert the following patch to avoid this issue,
but this workaroud from ppc...
commit cee216a696b2004017a5ecb583366093d90b1568
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date: Fri Feb 24 14:59:13 2017 -0800
mm/autonuma: don't use set_pte_at when updating protnone ptes
Architectures like ppc64, use privilege access bit to mark pte non
accessible. This implies that kernel can do a copy_to_user to an
address marked for numa fault. This also implies that there can be a
parallel hardware update for the pte. set_pte_at cannot be used in
such
scenarios. Hence switch the pte update to use ptep_get_and_clear and
set_pte_at combination.
>
>
> Regards
> Yin, Fengwei
>
>>
>>>
>>> --
>>> Best Regards,
>>> Peng
>>>
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-10 3:39 ` Kefeng Wang
@ 2023-11-10 3:50 ` Yin, Fengwei
2023-11-10 4:00 ` Aneesh Kumar K V
0 siblings, 1 reply; 23+ messages in thread
From: Yin, Fengwei @ 2023-11-10 3:50 UTC (permalink / raw)
To: Kefeng Wang, Yang Shi, zhangpeng (AS), Aneesh Kumar K.V
Cc: linux-mm, linux-kernel, akpm, Matthew Wilcox, lstoakes, hughd,
david, vbabka, peterz, mgorman, mingo, riel, ying.huang, hannes,
Nanyong Sun
On 11/10/2023 11:39 AM, Kefeng Wang wrote:
>
>
> On 2023/11/10 9:57, Yin, Fengwei wrote:
>>
>>
>> On 11/10/2023 6:54 AM, Yang Shi wrote:
>>> On Thu, Nov 9, 2023 at 5:48 AM zhangpeng (AS) <zhangpeng362@huawei.com> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> There is a performance issue that has been bothering us recently.
>>>> This problem can reproduce in the latest mainline version (Linux 6.6).
>>>>
>>>> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process
>>>> to avoid performance problems caused by major fault.
>>>>
>>>> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
>>>> ptep_modify_prot_start() will clear the vmf->pte, until
>>>> ptep_modify_prot_commit() assign a value to the vmf->pte.
>>>>
>>>> For the data segment of the user-mode program, the global variable area
>>>> is a private mapping. After the pagecache is loaded, the private
>>>> anonymous page is generated after the COW is triggered. Mlockall can
>>>> lock COW pages (anonymous pages), but the original file pages cannot
>>>> be locked and may be reclaimed. If the global variable (private anon page)
>>>> is accessed when vmf->pte is zero which is concurrently set by numa fault,
>>>> a file page fault will be triggered.
>>>>
>>>> At this time, the original private file page may have been reclaimed.
>>>> If the page cache is not available at this time, a major fault will be
>>>> triggered and the file will be read, causing additional overhead.
>>>>
>>>> Our problem scenario is as follows:
>>>>
>>>> task 1 task 2
>>>> ------ ------
>>>> /* scan global variables */
>>>> do_numa_page()
>>>> spin_lock(vmf->ptl)
>>>> ptep_modify_prot_start()
>>>> /* set vmf->pte as null */
>>>> /* Access global variables */
>>>> handle_pte_fault()
>>>> /* no pte lock */
>>>> do_pte_missing()
>>>> do_fault()
>>>> do_read_fault()
>>>> ptep_modify_prot_commit()
>>>> /* ptep update done */
>>>> pte_unmap_unlock(vmf->pte, vmf->ptl)
>>>> do_fault_around()
>>>> __do_fault()
>>>> filemap_fault()
>>>> /* page cache is not available
>>>> and a major fault is triggered */
>>>> do_sync_mmap_readahead()
>>>> /* page_not_uptodate and goto
>>>> out_retry. */
>>>>
>>>> Is there any way to avoid such a major fault?
>>>
>>> IMHO I don't think it is a bug. The man page quoted by Willy says "All
>>> mapped pages are guaranteed to be resident in RAM when the call
>>> returns successfully", but the later COW already made the file page
>>> unmapped, right? The PTE pointed to the COW'ed anon page.
>>> Hypothetically if we kept the file page mlocked and unmapped,
>>> munlock() would have not munlocked the file page at all, it would be
>>> mlocked in memory forever.
>> But in this case, even the COW page is mlocked. There is small window
>> that PTE is set to null in do_numa_page(). data segment access (it's to
>> COW page which has nothing to do with original page cache) happens in
>> this small window will trigger filemap_fault() to fault in original
>> page cache.
>>
>> I had thought to do double check whether vmf->pte is NULL in do_read_fault().
>> But it's not reliable enough.
>>
>> Matthew's idea to use protnone to block both hardware accessing and
>> do_pte_missing() looks more promising to me.
>
> Actual, we could revert the following patch to avoid this issue,
> but this workaroud from ppc...
>
> commit cee216a696b2004017a5ecb583366093d90b1568
> Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Date: Fri Feb 24 14:59:13 2017 -0800
>
> mm/autonuma: don't use set_pte_at when updating protnone ptes
>
> Architectures like ppc64, use privilege access bit to mark pte non
> accessible. This implies that kernel can do a copy_to_user to an
> address marked for numa fault. This also implies that there can be a
> parallel hardware update for the pte. set_pte_at cannot be used in such
> scenarios. Hence switch the pte update to use ptep_get_and_clear and
> set_pte_at combination.
Oh. This means the protnone doesn't work for PPC.
>
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Peng
>>>>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-10 3:50 ` Yin, Fengwei
@ 2023-11-10 4:00 ` Aneesh Kumar K V
0 siblings, 0 replies; 23+ messages in thread
From: Aneesh Kumar K V @ 2023-11-10 4:00 UTC (permalink / raw)
To: Yin, Fengwei, Kefeng Wang, Yang Shi, zhangpeng (AS)
Cc: linux-mm, linux-kernel, akpm, Matthew Wilcox, lstoakes, hughd,
david, vbabka, peterz, mgorman, mingo, riel, ying.huang, hannes,
Nanyong Sun
On 11/10/23 9:20 AM, Yin, Fengwei wrote:
>
>
> On 11/10/2023 11:39 AM, Kefeng Wang wrote:
>>
>>
>> On 2023/11/10 9:57, Yin, Fengwei wrote:
>>>
>>>
>>> On 11/10/2023 6:54 AM, Yang Shi wrote:
>>>> On Thu, Nov 9, 2023 at 5:48 AM zhangpeng (AS) <zhangpeng362@huawei.com> wrote:
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> There is a performance issue that has been bothering us recently.
>>>>> This problem can reproduce in the latest mainline version (Linux 6.6).
>>>>>
>>>>> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process
>>>>> to avoid performance problems caused by major fault.
>>>>>
>>>>> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
>>>>> ptep_modify_prot_start() will clear the vmf->pte, until
>>>>> ptep_modify_prot_commit() assign a value to the vmf->pte.
>>>>>
>>>>> For the data segment of the user-mode program, the global variable area
>>>>> is a private mapping. After the pagecache is loaded, the private
>>>>> anonymous page is generated after the COW is triggered. Mlockall can
>>>>> lock COW pages (anonymous pages), but the original file pages cannot
>>>>> be locked and may be reclaimed. If the global variable (private anon page)
>>>>> is accessed when vmf->pte is zero which is concurrently set by numa fault,
>>>>> a file page fault will be triggered.
>>>>>
>>>>> At this time, the original private file page may have been reclaimed.
>>>>> If the page cache is not available at this time, a major fault will be
>>>>> triggered and the file will be read, causing additional overhead.
>>>>>
>>>>> Our problem scenario is as follows:
>>>>>
>>>>> task 1 task 2
>>>>> ------ ------
>>>>> /* scan global variables */
>>>>> do_numa_page()
>>>>> spin_lock(vmf->ptl)
>>>>> ptep_modify_prot_start()
>>>>> /* set vmf->pte as null */
>>>>> /* Access global variables */
>>>>> handle_pte_fault()
>>>>> /* no pte lock */
>>>>> do_pte_missing()
>>>>> do_fault()
>>>>> do_read_fault()
>>>>> ptep_modify_prot_commit()
>>>>> /* ptep update done */
>>>>> pte_unmap_unlock(vmf->pte, vmf->ptl)
>>>>> do_fault_around()
>>>>> __do_fault()
>>>>> filemap_fault()
>>>>> /* page cache is not available
>>>>> and a major fault is triggered */
>>>>> do_sync_mmap_readahead()
>>>>> /* page_not_uptodate and goto
>>>>> out_retry. */
>>>>>
>>>>> Is there any way to avoid such a major fault?
>>>>
>>>> IMHO I don't think it is a bug. The man page quoted by Willy says "All
>>>> mapped pages are guaranteed to be resident in RAM when the call
>>>> returns successfully", but the later COW already made the file page
>>>> unmapped, right? The PTE pointed to the COW'ed anon page.
>>>> Hypothetically if we kept the file page mlocked and unmapped,
>>>> munlock() would have not munlocked the file page at all, it would be
>>>> mlocked in memory forever.
>>> But in this case, even the COW page is mlocked. There is small window
>>> that PTE is set to null in do_numa_page(). data segment access (it's to
>>> COW page which has nothing to do with original page cache) happens in
>>> this small window will trigger filemap_fault() to fault in original
>>> page cache.
>>>
>>> I had thought to do double check whether vmf->pte is NULL in do_read_fault().
>>> But it's not reliable enough.
>>>
>>> Matthew's idea to use protnone to block both hardware accessing and
>>> do_pte_missing() looks more promising to me.
>>
>> Actual, we could revert the following patch to avoid this issue,
>> but this workaroud from ppc...
>>
>> commit cee216a696b2004017a5ecb583366093d90b1568
>> Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> Date: Fri Feb 24 14:59:13 2017 -0800
>>
>> mm/autonuma: don't use set_pte_at when updating protnone ptes
>>
>> Architectures like ppc64, use privilege access bit to mark pte non
>> accessible. This implies that kernel can do a copy_to_user to an
>> address marked for numa fault. This also implies that there can be a
>> parallel hardware update for the pte. set_pte_at cannot be used in such
>> scenarios. Hence switch the pte update to use ptep_get_and_clear and
>> set_pte_at combination.
> Oh. This means the protnone doesn't work for PPC.
>
>
That is correct. I am yet to read the full thread. Can we make ptep_modify_prot_start()
not to mark pte = 0 ? One of the requirement for powerpc is to mark it hardware invalid
such that not TLB entries get inserted after that. Other options is to get a proper
pte_update API for generic kernel so that architectures can do this without marking the
pte invalid.
-aneesh
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-10 1:57 ` Yin, Fengwei
2023-11-10 3:39 ` Kefeng Wang
@ 2023-11-14 1:41 ` Yang Shi
2023-11-14 11:10 ` Yin, Fengwei
1 sibling, 1 reply; 23+ messages in thread
From: Yang Shi @ 2023-11-14 1:41 UTC (permalink / raw)
To: Yin, Fengwei
Cc: zhangpeng (AS),
linux-mm, linux-kernel, akpm, Matthew Wilcox, lstoakes, hughd,
david, vbabka, peterz, mgorman, mingo, riel, ying.huang, hannes,
Nanyong Sun, Kefeng Wang
On Thu, Nov 9, 2023 at 5:57 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 11/10/2023 6:54 AM, Yang Shi wrote:
> > On Thu, Nov 9, 2023 at 5:48 AM zhangpeng (AS) <zhangpeng362@huawei.com> wrote:
> >>
> >> Hi everyone,
> >>
> >> There is a performance issue that has been bothering us recently.
> >> This problem can reproduce in the latest mainline version (Linux 6.6).
> >>
> >> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process
> >> to avoid performance problems caused by major fault.
> >>
> >> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
> >> ptep_modify_prot_start() will clear the vmf->pte, until
> >> ptep_modify_prot_commit() assign a value to the vmf->pte.
> >>
> >> For the data segment of the user-mode program, the global variable area
> >> is a private mapping. After the pagecache is loaded, the private
> >> anonymous page is generated after the COW is triggered. Mlockall can
> >> lock COW pages (anonymous pages), but the original file pages cannot
> >> be locked and may be reclaimed. If the global variable (private anon page)
> >> is accessed when vmf->pte is zero which is concurrently set by numa fault,
> >> a file page fault will be triggered.
> >>
> >> At this time, the original private file page may have been reclaimed.
> >> If the page cache is not available at this time, a major fault will be
> >> triggered and the file will be read, causing additional overhead.
> >>
> >> Our problem scenario is as follows:
> >>
> >> task 1 task 2
> >> ------ ------
> >> /* scan global variables */
> >> do_numa_page()
> >> spin_lock(vmf->ptl)
> >> ptep_modify_prot_start()
> >> /* set vmf->pte as null */
> >> /* Access global variables */
> >> handle_pte_fault()
> >> /* no pte lock */
> >> do_pte_missing()
> >> do_fault()
> >> do_read_fault()
> >> ptep_modify_prot_commit()
> >> /* ptep update done */
> >> pte_unmap_unlock(vmf->pte, vmf->ptl)
> >> do_fault_around()
> >> __do_fault()
> >> filemap_fault()
> >> /* page cache is not available
> >> and a major fault is triggered */
> >> do_sync_mmap_readahead()
> >> /* page_not_uptodate and goto
> >> out_retry. */
> >>
> >> Is there any way to avoid such a major fault?
> >
> > IMHO I don't think it is a bug. The man page quoted by Willy says "All
> > mapped pages are guaranteed to be resident in RAM when the call
> > returns successfully", but the later COW already made the file page
> > unmapped, right? The PTE pointed to the COW'ed anon page.
> > Hypothetically if we kept the file page mlocked and unmapped,
> > munlock() would have not munlocked the file page at all, it would be
> > mlocked in memory forever.
> But in this case, even the COW page is mlocked. There is small window
> that PTE is set to null in do_numa_page(). data segment access (it's to
> COW page which has nothing to do with original page cache) happens in
> this small window will trigger filemap_fault() to fault in original
> page cache.
Yes, my point is this may not break the mlockall, but the potential
optimization by avoiding the major fault may still stand.
>
> I had thought to do double check whether vmf->pte is NULL in do_read_fault().
> But it's not reliable enough.
>
> Matthew's idea to use protnone to block both hardware accessing and
> do_pte_missing() looks more promising to me.
>
>
> Regards
> Yin, Fengwei
>
> >
> >>
> >> --
> >> Best Regards,
> >> Peng
> >>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-14 1:41 ` Yang Shi
@ 2023-11-14 11:10 ` Yin, Fengwei
0 siblings, 0 replies; 23+ messages in thread
From: Yin, Fengwei @ 2023-11-14 11:10 UTC (permalink / raw)
To: Yang Shi
Cc: zhangpeng (AS),
linux-mm, linux-kernel, akpm, Matthew Wilcox, lstoakes, hughd,
david, vbabka, peterz, mgorman, mingo, riel, ying.huang, hannes,
Nanyong Sun, Kefeng Wang
On 11/14/2023 9:41 AM, Yang Shi wrote:
> On Thu, Nov 9, 2023 at 5:57 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>
>>
>>
>> On 11/10/2023 6:54 AM, Yang Shi wrote:
>>> On Thu, Nov 9, 2023 at 5:48 AM zhangpeng (AS) <zhangpeng362@huawei.com> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> There is a performance issue that has been bothering us recently.
>>>> This problem can reproduce in the latest mainline version (Linux 6.6).
>>>>
>>>> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process
>>>> to avoid performance problems caused by major fault.
>>>>
>>>> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
>>>> ptep_modify_prot_start() will clear the vmf->pte, until
>>>> ptep_modify_prot_commit() assign a value to the vmf->pte.
>>>>
>>>> For the data segment of the user-mode program, the global variable area
>>>> is a private mapping. After the pagecache is loaded, the private
>>>> anonymous page is generated after the COW is triggered. Mlockall can
>>>> lock COW pages (anonymous pages), but the original file pages cannot
>>>> be locked and may be reclaimed. If the global variable (private anon page)
>>>> is accessed when vmf->pte is zero which is concurrently set by numa fault,
>>>> a file page fault will be triggered.
>>>>
>>>> At this time, the original private file page may have been reclaimed.
>>>> If the page cache is not available at this time, a major fault will be
>>>> triggered and the file will be read, causing additional overhead.
>>>>
>>>> Our problem scenario is as follows:
>>>>
>>>> task 1 task 2
>>>> ------ ------
>>>> /* scan global variables */
>>>> do_numa_page()
>>>> spin_lock(vmf->ptl)
>>>> ptep_modify_prot_start()
>>>> /* set vmf->pte as null */
>>>> /* Access global variables */
>>>> handle_pte_fault()
>>>> /* no pte lock */
>>>> do_pte_missing()
>>>> do_fault()
>>>> do_read_fault()
>>>> ptep_modify_prot_commit()
>>>> /* ptep update done */
>>>> pte_unmap_unlock(vmf->pte, vmf->ptl)
>>>> do_fault_around()
>>>> __do_fault()
>>>> filemap_fault()
>>>> /* page cache is not available
>>>> and a major fault is triggered */
>>>> do_sync_mmap_readahead()
>>>> /* page_not_uptodate and goto
>>>> out_retry. */
>>>>
>>>> Is there any way to avoid such a major fault?
>>>
>>> IMHO I don't think it is a bug. The man page quoted by Willy says "All
>>> mapped pages are guaranteed to be resident in RAM when the call
>>> returns successfully", but the later COW already made the file page
>>> unmapped, right? The PTE pointed to the COW'ed anon page.
>>> Hypothetically if we kept the file page mlocked and unmapped,
>>> munlock() would have not munlocked the file page at all, it would be
>>> mlocked in memory forever.
>> But in this case, even the COW page is mlocked. There is small window
>> that PTE is set to null in do_numa_page(). data segment access (it's to
>> COW page which has nothing to do with original page cache) happens in
>> this small window will trigger filemap_fault() to fault in original
>> page cache.
>
> Yes, my point is this may not break the mlockall, but the potential
> optimization by avoiding the major fault may still stand.
Totally agree.
Regards
Yin, Fengwei
>
>>
>> I had thought to do double check whether vmf->pte is NULL in do_read_fault().
>> But it's not reliable enough.
>>
>> Matthew's idea to use protnone to block both hardware accessing and
>> do_pte_missing() looks more promising to me.
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Peng
>>>>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-09 13:47 [Question]: major faults are still triggered after mlockall when numa balancing zhangpeng (AS)
` (2 preceding siblings ...)
2023-11-09 22:54 ` Yang Shi
@ 2023-11-09 23:21 ` Matthew Wilcox
2023-11-10 5:04 ` Aneesh Kumar K.V
2023-11-10 8:17 ` Aneesh Kumar K.V
5 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2023-11-09 23:21 UTC (permalink / raw)
To: zhangpeng (AS)
Cc: linux-mm, linux-kernel, akpm, lstoakes, hughd, david,
fengwei.yin, vbabka, peterz, mgorman, mingo, riel, ying.huang,
hannes, Nanyong Sun, Kefeng Wang, Aneesh Kumar K.V
I went spelunking to try to find out more about this issue, and I
discovered it's Aneesh's fault from 2017 ...
On Thu, Nov 09, 2023 at 09:47:24PM +0800, zhangpeng (AS) wrote:
> Hi everyone,
>
> There is a performance issue that has been bothering us recently.
> This problem can reproduce in the latest mainline version (Linux 6.6).
>
> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process
> to avoid performance problems caused by major fault.
>
> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
> ptep_modify_prot_start() will clear the vmf->pte, until
> ptep_modify_prot_commit() assign a value to the vmf->pte.
>
> For the data segment of the user-mode program, the global variable area
> is a private mapping. After the pagecache is loaded, the private
> anonymous page is generated after the COW is triggered. Mlockall can
> lock COW pages (anonymous pages), but the original file pages cannot
> be locked and may be reclaimed. If the global variable (private anon page)
> is accessed when vmf->pte is zero which is concurrently set by numa fault,
> a file page fault will be triggered.
>
> At this time, the original private file page may have been reclaimed.
> If the page cache is not available at this time, a major fault will be
> triggered and the file will be read, causing additional overhead.
>
> Our problem scenario is as follows:
>
> task 1 task 2
> ------ ------
> /* scan global variables */
> do_numa_page()
> spin_lock(vmf->ptl)
> ptep_modify_prot_start()
> /* set vmf->pte as null */
> /* Access global variables */
> handle_pte_fault()
> /* no pte lock */
> do_pte_missing()
> do_fault()
> do_read_fault()
> ptep_modify_prot_commit()
> /* ptep update done */
> pte_unmap_unlock(vmf->pte, vmf->ptl)
> do_fault_around()
> __do_fault()
> filemap_fault()
> /* page cache is not available
> and a major fault is triggered */
> do_sync_mmap_readahead()
> /* page_not_uptodate and goto
> out_retry. */
>
> Is there any way to avoid such a major fault?
>
> --
> Best Regards,
> Peng
>
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-09 13:47 [Question]: major faults are still triggered after mlockall when numa balancing zhangpeng (AS)
` (3 preceding siblings ...)
2023-11-09 23:21 ` Matthew Wilcox
@ 2023-11-10 5:04 ` Aneesh Kumar K.V
2023-11-10 8:36 ` zhangpeng (AS)
2023-11-10 8:17 ` Aneesh Kumar K.V
5 siblings, 1 reply; 23+ messages in thread
From: Aneesh Kumar K.V @ 2023-11-10 5:04 UTC (permalink / raw)
To: zhangpeng (AS), linux-mm, linux-kernel
Cc: akpm, Matthew Wilcox, lstoakes, hughd, david, fengwei.yin,
vbabka, peterz, mgorman, mingo, riel, ying.huang, hannes,
Nanyong Sun, Kefeng Wang
"zhangpeng (AS)" <zhangpeng362@huawei.com> writes:
> Hi everyone,
>
> There is a performance issue that has been bothering us recently.
> This problem can reproduce in the latest mainline version (Linux 6.6).
>
> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process
> to avoid performance problems caused by major fault.
>
> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
> ptep_modify_prot_start() will clear the vmf->pte, until
> ptep_modify_prot_commit() assign a value to the vmf->pte.
>
> For the data segment of the user-mode program, the global variable area
> is a private mapping. After the pagecache is loaded, the private
> anonymous page is generated after the COW is triggered. Mlockall can
> lock COW pages (anonymous pages), but the original file pages cannot
> be locked and may be reclaimed. If the global variable (private anon page)
> is accessed when vmf->pte is zero which is concurrently set by numa fault,
> a file page fault will be triggered.
>
> At this time, the original private file page may have been reclaimed.
> If the page cache is not available at this time, a major fault will be
> triggered and the file will be read, causing additional overhead.
>
> Our problem scenario is as follows:
>
> task 1 task 2
> ------ ------
> /* scan global variables */
> do_numa_page()
> spin_lock(vmf->ptl)
> ptep_modify_prot_start()
> /* set vmf->pte as null */
> /* Access global variables */
> handle_pte_fault()
> /* no pte lock */
> do_pte_missing()
> do_fault()
> do_read_fault()
> ptep_modify_prot_commit()
> /* ptep update done */
> pte_unmap_unlock(vmf->pte, vmf->ptl)
> do_fault_around()
> __do_fault()
> filemap_fault()
> /* page cache is not available
> and a major fault is triggered */
> do_sync_mmap_readahead()
> /* page_not_uptodate and goto
> out_retry. */
>
> Is there any way to avoid such a major fault?
>
This is also true w.r.t change_pte_range() in addition to do_numa_page()
?
-aneesh
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-10 5:04 ` Aneesh Kumar K.V
@ 2023-11-10 8:36 ` zhangpeng (AS)
0 siblings, 0 replies; 23+ messages in thread
From: zhangpeng (AS) @ 2023-11-10 8:36 UTC (permalink / raw)
To: Aneesh Kumar K.V, linux-mm, linux-kernel
Cc: akpm, Matthew Wilcox, lstoakes, hughd, david, fengwei.yin,
vbabka, peterz, mgorman, mingo, riel, ying.huang, hannes,
Nanyong Sun, Kefeng Wang
On 2023/11/10 13:04, Aneesh Kumar K.V wrote:
> "zhangpeng (AS)" <zhangpeng362@huawei.com> writes:
>
>> Hi everyone,
>>
>> There is a performance issue that has been bothering us recently.
>> This problem can reproduce in the latest mainline version (Linux 6.6).
>>
>> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process
>> to avoid performance problems caused by major fault.
>>
>> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
>> ptep_modify_prot_start() will clear the vmf->pte, until
>> ptep_modify_prot_commit() assign a value to the vmf->pte.
>>
>> For the data segment of the user-mode program, the global variable area
>> is a private mapping. After the pagecache is loaded, the private
>> anonymous page is generated after the COW is triggered. Mlockall can
>> lock COW pages (anonymous pages), but the original file pages cannot
>> be locked and may be reclaimed. If the global variable (private anon page)
>> is accessed when vmf->pte is zero which is concurrently set by numa fault,
>> a file page fault will be triggered.
>>
>> At this time, the original private file page may have been reclaimed.
>> If the page cache is not available at this time, a major fault will be
>> triggered and the file will be read, causing additional overhead.
>>
>> Our problem scenario is as follows:
>>
>> task 1 task 2
>> ------ ------
>> /* scan global variables */
>> do_numa_page()
>> spin_lock(vmf->ptl)
>> ptep_modify_prot_start()
>> /* set vmf->pte as null */
>> /* Access global variables */
>> handle_pte_fault()
>> /* no pte lock */
>> do_pte_missing()
>> do_fault()
>> do_read_fault()
>> ptep_modify_prot_commit()
>> /* ptep update done */
>> pte_unmap_unlock(vmf->pte, vmf->ptl)
>> do_fault_around()
>> __do_fault()
>> filemap_fault()
>> /* page cache is not available
>> and a major fault is triggered */
>> do_sync_mmap_readahead()
>> /* page_not_uptodate and goto
>> out_retry. */
>>
>> Is there any way to avoid such a major fault?
>>
> This is also true w.r.t change_pte_range() in addition to do_numa_page()
> ?
>
> -aneesh
It seems that change_pte_range() should have similar problems.
However, the trigger frequency of do numa fault is generally relatively
high, and it is more likely to occur in actual scenarios.
--
Best Regards,
Peng
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-09 13:47 [Question]: major faults are still triggered after mlockall when numa balancing zhangpeng (AS)
` (4 preceding siblings ...)
2023-11-10 5:04 ` Aneesh Kumar K.V
@ 2023-11-10 8:17 ` Aneesh Kumar K.V
2023-11-10 9:50 ` zhangpeng (AS)
5 siblings, 1 reply; 23+ messages in thread
From: Aneesh Kumar K.V @ 2023-11-10 8:17 UTC (permalink / raw)
To: zhangpeng (AS), linux-mm, linux-kernel
Cc: akpm, Matthew Wilcox, lstoakes, hughd, david, fengwei.yin,
vbabka, peterz, mgorman, mingo, riel, ying.huang, hannes,
Nanyong Sun, Kefeng Wang
"zhangpeng (AS)" <zhangpeng362@huawei.com> writes:
> Hi everyone,
>
> There is a performance issue that has been bothering us recently.
> This problem can reproduce in the latest mainline version (Linux 6.6).
>
> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process
> to avoid performance problems caused by major fault.
>
> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
> ptep_modify_prot_start() will clear the vmf->pte, until
> ptep_modify_prot_commit() assign a value to the vmf->pte.
>
pte lookup don't expect the pte to be 0 after it got initialized (We do
check pte value without holding ptl and if we find the pte val 0 we
return). So the read-modify-write updates to the pte should make sure we
don't clear the pte right? powerpc did that by marking the pte present
but invalid. Can we do similar for other architecture? The default
implementation of ptep_modify_prot_start() to ptep_get_and_clear() can
result in pte lookup returning wrong pte as explained in the report
because we don't hold ptl and recheck if we find pte == 0
>
> For the data segment of the user-mode program, the global variable area
> is a private mapping. After the pagecache is loaded, the private
> anonymous page is generated after the COW is triggered. Mlockall can
> lock COW pages (anonymous pages), but the original file pages cannot
> be locked and may be reclaimed. If the global variable (private anon page)
> is accessed when vmf->pte is zero which is concurrently set by numa fault,
> a file page fault will be triggered.
>
> At this time, the original private file page may have been reclaimed.
> If the page cache is not available at this time, a major fault will be
> triggered and the file will be read, causing additional overhead.
>
> Our problem scenario is as follows:
>
> task 1 task 2
> ------ ------
> /* scan global variables */
> do_numa_page()
> spin_lock(vmf->ptl)
> ptep_modify_prot_start()
> /* set vmf->pte as null */
> /* Access global variables */
> handle_pte_fault()
> /* no pte lock */
> do_pte_missing()
> do_fault()
> do_read_fault()
> ptep_modify_prot_commit()
> /* ptep update done */
> pte_unmap_unlock(vmf->pte, vmf->ptl)
> do_fault_around()
> __do_fault()
> filemap_fault()
> /* page cache is not available
> and a major fault is triggered */
> do_sync_mmap_readahead()
> /* page_not_uptodate and goto
> out_retry. */
>
> Is there any way to avoid such a major fault?
>
-aneesh
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [Question]: major faults are still triggered after mlockall when numa balancing
2023-11-10 8:17 ` Aneesh Kumar K.V
@ 2023-11-10 9:50 ` zhangpeng (AS)
0 siblings, 0 replies; 23+ messages in thread
From: zhangpeng (AS) @ 2023-11-10 9:50 UTC (permalink / raw)
To: Aneesh Kumar K.V, linux-mm, linux-kernel
Cc: akpm, Matthew Wilcox, lstoakes, hughd, david, fengwei.yin,
vbabka, peterz, mgorman, mingo, riel, ying.huang, hannes,
Nanyong Sun, Kefeng Wang
On 2023/11/10 16:17, Aneesh Kumar K.V wrote:
> "zhangpeng (AS)" <zhangpeng362@huawei.com> writes:
>
>> Hi everyone,
>>
>> There is a performance issue that has been bothering us recently.
>> This problem can reproduce in the latest mainline version (Linux 6.6).
>>
>> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process
>> to avoid performance problems caused by major fault.
>>
>> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
>> ptep_modify_prot_start() will clear the vmf->pte, until
>> ptep_modify_prot_commit() assign a value to the vmf->pte.
>>
> pte lookup don't expect the pte to be 0 after it got initialized (We do
> check pte value without holding ptl and if we find the pte val 0 we
> return). So the read-modify-write updates to the pte should make sure we
> don't clear the pte right? powerpc did that by marking the pte present
> but invalid. Can we do similar for other architecture? The default
> implementation of ptep_modify_prot_start() to ptep_get_and_clear() can
> result in pte lookup returning wrong pte as explained in the report
> because we don't hold ptl and recheck if we find pte == 0
>
Thank you for helping to clarify the problem.
>> For the data segment of the user-mode program, the global variable area
>> is a private mapping. After the pagecache is loaded, the private
>> anonymous page is generated after the COW is triggered. Mlockall can
>> lock COW pages (anonymous pages), but the original file pages cannot
>> be locked and may be reclaimed. If the global variable (private anon page)
>> is accessed when vmf->pte is zero which is concurrently set by numa fault,
>> a file page fault will be triggered.
>>
>> At this time, the original private file page may have been reclaimed.
>> If the page cache is not available at this time, a major fault will be
>> triggered and the file will be read, causing additional overhead.
>>
>> Our problem scenario is as follows:
>>
>> task 1 task 2
>> ------ ------
>> /* scan global variables */
>> do_numa_page()
>> spin_lock(vmf->ptl)
>> ptep_modify_prot_start()
>> /* set vmf->pte as null */
>> /* Access global variables */
>> handle_pte_fault()
>> /* no pte lock */
>> do_pte_missing()
>> do_fault()
>> do_read_fault()
>> ptep_modify_prot_commit()
>> /* ptep update done */
>> pte_unmap_unlock(vmf->pte, vmf->ptl)
>> do_fault_around()
>> __do_fault()
>> filemap_fault()
>> /* page cache is not available
>> and a major fault is triggered */
>> do_sync_mmap_readahead()
>> /* page_not_uptodate and goto
>> out_retry. */
>>
>> Is there any way to avoid such a major fault?
>>
> -aneesh
--
Best Regards,
Peng
^ permalink raw reply [flat|nested] 23+ messages in thread