From: Yafang Shao <laoar.shao@gmail.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, Matthew Wilcox <willy@infradead.org>,
David Rientjes <rientjes@google.com>,
Mel Gorman <mgorman@techsingularity.net>
Subject: Re: [PATCH] mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist
Date: Thu, 4 Jul 2024 21:27:59 +0800 [thread overview]
Message-ID: <CALOAHbACG64faVearXF-eJQLVc4Viv=ShOtpLeQSfVwx2tdr=w@mail.gmail.com> (raw)
In-Reply-To: <87jzi3kphw.fsf@yhuang6-desk2.ccr.corp.intel.com>
On Wed, Jul 3, 2024 at 1:36 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Wed, Jul 3, 2024 at 11:23 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Wed, Jul 3, 2024 at 9:57 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Tue, Jul 2, 2024 at 5:10 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Tue, Jul 2, 2024 at 10:51 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >> >> >> >>
> >> >> >> >> On Mon, 1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
> >> >> >> >>
> >> >> >> >> > Currently, we're encountering latency spikes in our container environment
> >> >> >> >> > when a specific container with multiple Python-based tasks exits. These
> >> >> >> >> > tasks may hold the zone->lock for an extended period, significantly
> >> >> >> >> > impacting latency for other containers attempting to allocate memory.
> >> >> >> >>
> >> >> >> >> Is this locking issue well understood? Is anyone working on it? A
> >> >> >> >> reasonably detailed description of the issue and a description of any
> >> >> >> >> ongoing work would be helpful here.
> >> >> >> >
> >> >> >> > In our containerized environment, we have a specific type of container
> >> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
> >> >> >> > processes are organized as separate processes rather than threads due
> >> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> >> >> >> > multi-threaded setup. Upon the exit of these containers, other
> >> >> >> > containers hosted on the same machine experience significant latency
> >> >> >> > spikes.
> >> >> >> >
> >> >> >> > Our investigation using perf tracing revealed that the root cause of
> >> >> >> > these spikes is the simultaneous execution of exit_mmap() by each of
> >> >> >> > the exiting processes. This concurrent access to the zone->lock
> >> >> >> > results in contention, which becomes a hotspot and negatively impacts
> >> >> >> > performance. The perf results clearly indicate this contention as a
> >> >> >> > primary contributor to the observed latency issues.
> >> >> >> >
> >> >> >> > + 77.02% 0.00% uwsgi [kernel.kallsyms]
> >> >> >> > [k] mmput ▒
> >> >> >> > - 76.98% 0.01% uwsgi [kernel.kallsyms]
> >> >> >> > [k] exit_mmap ▒
> >> >> >> > - 76.97% exit_mmap
> >> >> >> > ▒
> >> >> >> > - 58.58% unmap_vmas
> >> >> >> > ▒
> >> >> >> > - 58.55% unmap_single_vma
> >> >> >> > ▒
> >> >> >> > - unmap_page_range
> >> >> >> > ▒
> >> >> >> > - 58.32% zap_pte_range
> >> >> >> > ▒
> >> >> >> > - 42.88% tlb_flush_mmu
> >> >> >> > ▒
> >> >> >> > - 42.76% free_pages_and_swap_cache
> >> >> >> > ▒
> >> >> >> > - 41.22% release_pages
> >> >> >> > ▒
> >> >> >> > - 33.29% free_unref_page_list
> >> >> >> > ▒
> >> >> >> > - 32.37% free_unref_page_commit
> >> >> >> > ▒
> >> >> >> > - 31.64% free_pcppages_bulk
> >> >> >> > ▒
> >> >> >> > + 28.65% _raw_spin_lock
> >> >> >> > ▒
> >> >> >> > 1.28% __list_del_entry_valid
> >> >> >> > ▒
> >> >> >> > + 3.25% folio_lruvec_lock_irqsave
> >> >> >> > ▒
> >> >> >> > + 0.75% __mem_cgroup_uncharge_list
> >> >> >> > ▒
> >> >> >> > 0.60% __mod_lruvec_state
> >> >> >> > ▒
> >> >> >> > 1.07% free_swap_cache
> >> >> >> > ▒
> >> >> >> > + 11.69% page_remove_rmap
> >> >> >> > ▒
> >> >> >> > 0.64% __mod_lruvec_page_state
> >> >> >> > - 17.34% remove_vma
> >> >> >> > ▒
> >> >> >> > - 17.25% vm_area_free
> >> >> >> > ▒
> >> >> >> > - 17.23% kmem_cache_free
> >> >> >> > ▒
> >> >> >> > - 17.15% __slab_free
> >> >> >> > ▒
> >> >> >> > - 14.56% discard_slab
> >> >> >> > ▒
> >> >> >> > free_slab
> >> >> >> > ▒
> >> >> >> > __free_slab
> >> >> >> > ▒
> >> >> >> > __free_pages
> >> >> >> > ▒
> >> >> >> > - free_unref_page
> >> >> >> > ▒
> >> >> >> > - 13.50% free_unref_page_commit
> >> >> >> > ▒
> >> >> >> > - free_pcppages_bulk
> >> >> >> > ▒
> >> >> >> > + 13.44% _raw_spin_lock
> >> >> >> >
> >> >> >> > By enabling the mm_page_pcpu_drain() we can find the detailed stack:
> >> >> >> >
> >> >> >> > <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain:
> >> >> >> > page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> >> >> >> > e=1
> >> >> >> > <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >> >> >> > => free_pcppages_bulk
> >> >> >> > => free_unref_page_commit
> >> >> >> > => free_unref_page_list
> >> >> >> > => release_pages
> >> >> >> > => free_pages_and_swap_cache
> >> >> >> > => tlb_flush_mmu
> >> >> >> > => zap_pte_range
> >> >> >> > => unmap_page_range
> >> >> >> > => unmap_single_vma
> >> >> >> > => unmap_vmas
> >> >> >> > => exit_mmap
> >> >> >> > => mmput
> >> >> >> > => do_exit
> >> >> >> > => do_group_exit
> >> >> >> > => get_signal
> >> >> >> > => arch_do_signal_or_restart
> >> >> >> > => exit_to_user_mode_prepare
> >> >> >> > => syscall_exit_to_user_mode
> >> >> >> > => do_syscall_64
> >> >> >> > => entry_SYSCALL_64_after_hwframe
> >> >> >> >
> >> >> >> > The servers experiencing these issues are equipped with impressive
> >> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
> >> >> >> > within a single NUMA node. The zoneinfo is as follows,
> >> >> >> >
> >> >> >> > Node 0, zone Normal
> >> >> >> > pages free 144465775
> >> >> >> > boost 0
> >> >> >> > min 1309270
> >> >> >> > low 1636587
> >> >> >> > high 1963904
> >> >> >> > spanned 564133888
> >> >> >> > present 296747008
> >> >> >> > managed 291974346
> >> >> >> > cma 0
> >> >> >> > protection: (0, 0, 0, 0)
> >> >> >> > ...
> >> >> >> > ...
> >> >> >> > pagesets
> >> >> >> > cpu: 0
> >> >> >> > count: 2217
> >> >> >> > high: 6392
> >> >> >> > batch: 63
> >> >> >> > vm stats threshold: 125
> >> >> >> > cpu: 1
> >> >> >> > count: 4510
> >> >> >> > high: 6392
> >> >> >> > batch: 63
> >> >> >> > vm stats threshold: 125
> >> >> >> > cpu: 2
> >> >> >> > count: 3059
> >> >> >> > high: 6392
> >> >> >> > batch: 63
> >> >> >> >
> >> >> >> > ...
> >> >> >> >
> >> >> >> > The high is around 100 times the batch size.
> >> >> >> >
> >> >> >> > We also traced the latency associated with the free_pcppages_bulk()
> >> >> >> > function during the container exit process:
> >> >> >> >
> >> >> >> > 19:48:54
> >> >> >> > nsecs : count distribution
> >> >> >> > 0 -> 1 : 0 | |
> >> >> >> > 2 -> 3 : 0 | |
> >> >> >> > 4 -> 7 : 0 | |
> >> >> >> > 8 -> 15 : 0 | |
> >> >> >> > 16 -> 31 : 0 | |
> >> >> >> > 32 -> 63 : 0 | |
> >> >> >> > 64 -> 127 : 0 | |
> >> >> >> > 128 -> 255 : 0 | |
> >> >> >> > 256 -> 511 : 148 |***************** |
> >> >> >> > 512 -> 1023 : 334 |****************************************|
> >> >> >> > 1024 -> 2047 : 33 |*** |
> >> >> >> > 2048 -> 4095 : 5 | |
> >> >> >> > 4096 -> 8191 : 7 | |
> >> >> >> > 8192 -> 16383 : 12 |* |
> >> >> >> > 16384 -> 32767 : 30 |*** |
> >> >> >> > 32768 -> 65535 : 21 |** |
> >> >> >> > 65536 -> 131071 : 15 |* |
> >> >> >> > 131072 -> 262143 : 27 |*** |
> >> >> >> > 262144 -> 524287 : 84 |********** |
> >> >> >> > 524288 -> 1048575 : 203 |************************ |
> >> >> >> > 1048576 -> 2097151 : 284 |********************************** |
> >> >> >> > 2097152 -> 4194303 : 327 |*************************************** |
> >> >> >> > 4194304 -> 8388607 : 215 |************************* |
> >> >> >> > 8388608 -> 16777215 : 116 |************* |
> >> >> >> > 16777216 -> 33554431 : 47 |***** |
> >> >> >> > 33554432 -> 67108863 : 8 | |
> >> >> >> > 67108864 -> 134217727 : 3 | |
> >> >> >> >
> >> >> >> > avg = 3066311 nsecs, total: 5887317501 nsecs, count: 1920
> >> >> >> >
> >> >> >> > The latency can reach tens of milliseconds.
> >> >> >> >
> >> >> >> > By adjusting the vm.percpu_pagelist_high_fraction parameter to set the
> >> >> >> > minimum pagelist high at 4 times the batch size, we were able to
> >> >> >> > significantly reduce the latency associated with the
> >> >> >> > free_pcppages_bulk() function during container exits.:
> >> >> >> >
> >> >> >> > nsecs : count distribution
> >> >> >> > 0 -> 1 : 0 | |
> >> >> >> > 2 -> 3 : 0 | |
> >> >> >> > 4 -> 7 : 0 | |
> >> >> >> > 8 -> 15 : 0 | |
> >> >> >> > 16 -> 31 : 0 | |
> >> >> >> > 32 -> 63 : 0 | |
> >> >> >> > 64 -> 127 : 0 | |
> >> >> >> > 128 -> 255 : 120 | |
> >> >> >> > 256 -> 511 : 365 |* |
> >> >> >> > 512 -> 1023 : 201 | |
> >> >> >> > 1024 -> 2047 : 103 | |
> >> >> >> > 2048 -> 4095 : 84 | |
> >> >> >> > 4096 -> 8191 : 87 | |
> >> >> >> > 8192 -> 16383 : 4777 |************** |
> >> >> >> > 16384 -> 32767 : 10572 |******************************* |
> >> >> >> > 32768 -> 65535 : 13544 |****************************************|
> >> >> >> > 65536 -> 131071 : 12723 |************************************* |
> >> >> >> > 131072 -> 262143 : 8604 |************************* |
> >> >> >> > 262144 -> 524287 : 3659 |********** |
> >> >> >> > 524288 -> 1048575 : 921 |** |
> >> >> >> > 1048576 -> 2097151 : 122 | |
> >> >> >> > 2097152 -> 4194303 : 5 | |
> >> >> >> >
> >> >> >> > avg = 103814 nsecs, total: 5805802787 nsecs, count: 55925
> >> >> >> >
> >> >> >> > After successfully tuning the vm.percpu_pagelist_high_fraction sysctl
> >> >> >> > knob to set the minimum pagelist high at a level that effectively
> >> >> >> > mitigated latency issues, we observed that other containers were no
> >> >> >> > longer experiencing similar complaints. As a result, we decided to
> >> >> >> > implement this tuning as a permanent workaround and have deployed it
> >> >> >> > across all clusters of servers where these containers may be deployed.
> >> >> >>
> >> >> >> Thanks for your detailed data.
> >> >> >>
> >> >> >> IIUC, the latency of free_pcppages_bulk() during process exiting
> >> >> >> shouldn't be a problem?
> >> >> >
> >> >> > Right. The problem arises when the process holds the lock for too
> >> >> > long, causing other processes that are attempting to allocate memory
> >> >> > to experience delays or wait times.
> >> >> >
> >> >> >> Because users care more about the total time of
> >> >> >> process exiting, that is, throughput. And I suspect that the zone->lock
> >> >> >> contention and page allocating/freeing throughput will be worse with
> >> >> >> your configuration?
> >> >> >
> >> >> > While reducing throughput may not be a significant concern due to the
> >> >> > minimal difference, the potential for latency spikes, a crucial metric
> >> >> > for assessing system stability, is of greater concern to users. Higher
> >> >> > latency can lead to request errors, impacting the user experience.
> >> >> > Therefore, maintaining stability, even at the cost of slightly lower
> >> >> > throughput, is preferable to experiencing higher throughput with
> >> >> > unstable performance.
> >> >> >
> >> >> >>
> >> >> >> But the latency of free_pcppages_bulk() and page allocation in other
> >> >> >> processes is a problem. And your configuration can help it.
> >> >> >>
> >> >> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX. In that way,
> >> >> >> you have a normal PCP size (high) but smaller PCP batch. I guess that
> >> >> >> may help both latency and throughput in your system. Could you give it
> >> >> >> a try?
> >> >> >
> >> >> > Currently, our kernel does not include the CONFIG_PCP_BATCH_SCALE_MAX
> >> >> > configuration option. However, I've observed your recent improvements
> >> >> > to the zone->lock mechanism, particularly commit 52166607ecc9 ("mm:
> >> >> > restrict the pcp batch scale factor to avoid too long latency"), which
> >> >> > has prompted me to experiment with manually setting the
> >> >> > pcp->free_factor to zero. While this adjustment provided some
> >> >> > improvement, the results were not as significant as I had hoped.
> >> >> >
> >> >> > BTW, perhaps we should consider the implementation of a sysctl knob as
> >> >> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow users
> >> >> > to more easily adjust it.
> >> >>
> >> >> If you cannot test upstream behavior, it's hard to make changes to
> >> >> upstream. Could you find a way to do that?
> >> >
> >> > I'm afraid I can't run an upstream kernel in our production environment :(
> >> > Lots of code changes have to be made.
> >>
> >> Understand. Can you find a way to test upstream behavior, not upstream
> >> kernel exactly? Or test the upstream kernel but in a similar but not
> >> exactly production environment.
> >
> > I'm willing to give it a try, but it may take some time to achieve the
> > desired results..
>
> Thanks!
After I backported the series "mm: PCP high auto-tuning," which
consists of a total of 9 patches, to our 6.1.y stable kernel and
deployed it to our production envrionment, I observed a significant
reduction in latency. The results are as follows:
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 2 | |
2048 -> 4095 : 11 | |
4096 -> 8191 : 3 | |
8192 -> 16383 : 1 | |
16384 -> 32767 : 2 | |
32768 -> 65535 : 7 | |
65536 -> 131071 : 198 |********* |
131072 -> 262143 : 530 |************************ |
262144 -> 524287 : 824 |************************************** |
524288 -> 1048575 : 852 |****************************************|
1048576 -> 2097151 : 714 |********************************* |
2097152 -> 4194303 : 389 |****************** |
4194304 -> 8388607 : 143 |****** |
8388608 -> 16777215 : 29 |* |
16777216 -> 33554431 : 1 | |
avg = 1181478 nsecs, total: 4380921824 nsecs, count: 3708
Compared to the previous data, the maximum latency has been reduced to
less than 30ms.
Additionally, I introduced a new sysctl knob, vm.pcp_batch_scale_max,
to replace CONFIG_PCP_BATCH_SCALE_MAX. By tuning
vm.pcp_batch_scale_max from the default value of 5 to 0, the maximum
latency was further reduced to less than 2ms.
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 36 | |
2048 -> 4095 : 5063 |***** |
4096 -> 8191 : 31226 |******************************** |
8192 -> 16383 : 37606 |*************************************** |
16384 -> 32767 : 38359 |****************************************|
32768 -> 65535 : 30652 |******************************* |
65536 -> 131071 : 18714 |******************* |
131072 -> 262143 : 7968 |******** |
262144 -> 524287 : 1996 |** |
524288 -> 1048575 : 302 | |
1048576 -> 2097151 : 19 | |
avg = 40702 nsecs, total: 7002105331 nsecs, count: 172031
After multiple trials, I observed no significant differences between
each attempt.
Therefore, we decided to backport your improvements to our local
kernel. Additionally, I propose introducing a new sysctl knob,
vm.pcp_batch_scale_max, to the upstream kernel. This will enable users
to easily tune the setting based on their specific workloads.
--
Regards
Yafang
next prev parent reply other threads:[~2024-07-04 13:28 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-01 14:20 Yafang Shao
2024-07-02 2:51 ` Andrew Morton
2024-07-02 6:37 ` Yafang Shao
2024-07-02 9:08 ` Huang, Ying
2024-07-02 12:07 ` Yafang Shao
2024-07-03 1:55 ` Huang, Ying
2024-07-03 2:13 ` Yafang Shao
2024-07-03 3:21 ` Huang, Ying
2024-07-03 3:44 ` Yafang Shao
2024-07-03 5:34 ` Huang, Ying
2024-07-04 13:27 ` Yafang Shao [this message]
2024-07-05 1:28 ` Huang, Ying
2024-07-05 3:03 ` Yafang Shao
2024-07-05 5:31 ` Huang, Ying
2024-07-05 13:09 ` Mel Gorman
2024-07-02 7:23 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CALOAHbACG64faVearXF-eJQLVc4Viv=ShOtpLeQSfVwx2tdr=w@mail.gmail.com' \
--to=laoar.shao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=rientjes@google.com \
--cc=willy@infradead.org \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox