linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yang Shi <shy828301@gmail.com>
To: "zhangpeng (AS)" <zhangpeng362@huawei.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	 akpm@linux-foundation.org, Matthew Wilcox <willy@infradead.org>,
	lstoakes@gmail.com,  hughd@google.com, david@redhat.com,
	fengwei.yin@intel.com, vbabka@suse.cz,  peterz@infradead.org,
	mgorman@suse.de, mingo@redhat.com, riel@redhat.com,
	 ying.huang@intel.com, hannes@cmpxchg.org,
	Nanyong Sun <sunnanyong@huawei.com>,
	 Kefeng Wang <wangkefeng.wang@huawei.com>
Subject: Re: [Question]: major faults are still triggered after mlockall when numa balancing
Date: Thu, 9 Nov 2023 14:54:15 -0800	[thread overview]
Message-ID: <CAHbLzkqEytFbRoHU3=Y85tmTQ--XVQpwhVEXgDN0ss_PPv8VGA@mail.gmail.com> (raw)
In-Reply-To: <9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com>

On Thu, Nov 9, 2023 at 5:48 AM zhangpeng (AS) <zhangpeng362@huawei.com> wrote:
>
> Hi everyone,
>
> There is a performance issue that has been bothering us recently.
> This problem can reproduce in the latest mainline version (Linux 6.6).
>
> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process
> to avoid performance problems caused by major fault.
>
> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
> ptep_modify_prot_start() will clear the vmf->pte, until
> ptep_modify_prot_commit() assign a value to the vmf->pte.
>
> For the data segment of the user-mode program, the global variable area
> is a private mapping. After the pagecache is loaded, the private
> anonymous page is generated after the COW is triggered. Mlockall can
> lock COW pages (anonymous pages), but the original file pages cannot
> be locked and may be reclaimed. If the global variable (private anon page)
> is accessed when vmf->pte is zero which is concurrently set by numa fault,
> a file page fault will be triggered.
>
> At this time, the original private file page may have been reclaimed.
> If the page cache is not available at this time, a major fault will be
> triggered and the file will be read, causing additional overhead.
>
> Our problem scenario is as follows:
>
> task 1                      task 2
> ------                      ------
> /* scan global variables */
> do_numa_page()
>    spin_lock(vmf->ptl)
>    ptep_modify_prot_start()
>    /* set vmf->pte as null */
>                              /* Access global variables */
>                              handle_pte_fault()
>                                /* no pte lock */
>                                do_pte_missing()
>                                  do_fault()
>                                    do_read_fault()
>    ptep_modify_prot_commit()
>    /* ptep update done */
>    pte_unmap_unlock(vmf->pte, vmf->ptl)
>                                      do_fault_around()
>                                      __do_fault()
>                                        filemap_fault()
>                                          /* page cache is not available
>                                          and a major fault is triggered */
>                                          do_sync_mmap_readahead()
>                                          /* page_not_uptodate and goto
>                                          out_retry. */
>
> Is there any way to avoid such a major fault?

IMHO I don't think it is a bug. The man page quoted by Willy says "All
mapped pages are guaranteed to be resident in RAM when the call
returns successfully", but the later COW already made the file page
unmapped, right? The PTE pointed to the COW'ed anon page.
Hypothetically if we kept the file page mlocked and unmapped,
munlock() would have not munlocked the file page at all, it would be
mlocked in memory forever.

>
> --
> Best Regards,
> Peng
>


  parent reply	other threads:[~2023-11-09 22:54 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-09 13:47 zhangpeng (AS)
2023-11-09 14:11 ` Peter Zijlstra
2023-11-09 14:29   ` Matthew Wilcox
2023-11-09 15:15     ` Yin, Fengwei
2023-11-09 17:27 ` Matthew Wilcox
2023-11-10  5:32   ` Huang, Ying
2023-11-10  9:04     ` Yin, Fengwei
2023-11-13  2:02       ` Huang, Ying
2023-11-14 11:23         ` Yin, Fengwei
2023-11-15  1:46           ` Huang, Ying
2023-11-10  9:39   ` zhangpeng (AS)
2023-11-09 22:54 ` Yang Shi [this message]
2023-11-10  1:57   ` Yin, Fengwei
2023-11-10  3:39     ` Kefeng Wang
2023-11-10  3:50       ` Yin, Fengwei
2023-11-10  4:00         ` Aneesh Kumar K V
2023-11-14  1:41     ` Yang Shi
2023-11-14 11:10       ` Yin, Fengwei
2023-11-09 23:21 ` Matthew Wilcox
2023-11-10  5:04 ` Aneesh Kumar K.V
2023-11-10  8:36   ` zhangpeng (AS)
2023-11-10  8:17 ` Aneesh Kumar K.V
2023-11-10  9:50   ` zhangpeng (AS)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHbLzkqEytFbRoHU3=Y85tmTQ--XVQpwhVEXgDN0ss_PPv8VGA@mail.gmail.com' \
    --to=shy828301@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=fengwei.yin@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lstoakes@gmail.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=sunnanyong@huawei.com \
    --cc=vbabka@suse.cz \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=zhangpeng362@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox