linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Huang, Ying" <ying.huang@intel.com>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	 linux-mm@kvack.org,  Matthew Wilcox <willy@infradead.org>,
	 David Rientjes <rientjes@google.com>,
	 Mel Gorman <mgorman@techsingularity.net>
Subject: Re: [PATCH] mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist
Date: Wed, 03 Jul 2024 09:55:33 +0800	[thread overview]
Message-ID: <87wmm3kzmy.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <CALOAHbDn+ax1oeaM4at+tNW6B+rEK6zy-32Upr7S5KcJu=JmOw@mail.gmail.com> (Yafang Shao's message of "Tue, 2 Jul 2024 20:07:57 +0800")

Yafang Shao <laoar.shao@gmail.com> writes:

> On Tue, Jul 2, 2024 at 5:10 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Tue, Jul 2, 2024 at 10:51 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>> >>
>> >> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
>> >>
>> >> > Currently, we're encountering latency spikes in our container environment
>> >> > when a specific container with multiple Python-based tasks exits. These
>> >> > tasks may hold the zone->lock for an extended period, significantly
>> >> > impacting latency for other containers attempting to allocate memory.
>> >>
>> >> Is this locking issue well understood?  Is anyone working on it?  A
>> >> reasonably detailed description of the issue and a description of any
>> >> ongoing work would be helpful here.
>> >
>> > In our containerized environment, we have a specific type of container
>> > that runs 18 processes, each consuming approximately 6GB of RSS. These
>> > processes are organized as separate processes rather than threads due
>> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
>> > multi-threaded setup. Upon the exit of these containers, other
>> > containers hosted on the same machine experience significant latency
>> > spikes.
>> >
>> > Our investigation using perf tracing revealed that the root cause of
>> > these spikes is the simultaneous execution of exit_mmap() by each of
>> > the exiting processes. This concurrent access to the zone->lock
>> > results in contention, which becomes a hotspot and negatively impacts
>> > performance. The perf results clearly indicate this contention as a
>> > primary contributor to the observed latency issues.
>> >
>> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]
>> >            [k] mmput                                   ▒
>> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]
>> >            [k] exit_mmap                               ▒
>> >    - 76.97% exit_mmap
>> >                                                        ▒
>> >       - 58.58% unmap_vmas
>> >                                                        ▒
>> >          - 58.55% unmap_single_vma
>> >                                                        ▒
>> >             - unmap_page_range
>> >                                                        ▒
>> >                - 58.32% zap_pte_range
>> >                                                        ▒
>> >                   - 42.88% tlb_flush_mmu
>> >                                                        ▒
>> >                      - 42.76% free_pages_and_swap_cache
>> >                                                        ▒
>> >                         - 41.22% release_pages
>> >                                                        ▒
>> >                            - 33.29% free_unref_page_list
>> >                                                        ▒
>> >                               - 32.37% free_unref_page_commit
>> >                                                        ▒
>> >                                  - 31.64% free_pcppages_bulk
>> >                                                        ▒
>> >                                     + 28.65% _raw_spin_lock
>> >                                                        ▒
>> >                                       1.28% __list_del_entry_valid
>> >                                                        ▒
>> >                            + 3.25% folio_lruvec_lock_irqsave
>> >                                                        ▒
>> >                            + 0.75% __mem_cgroup_uncharge_list
>> >                                                        ▒
>> >                              0.60% __mod_lruvec_state
>> >                                                        ▒
>> >                           1.07% free_swap_cache
>> >                                                        ▒
>> >                   + 11.69% page_remove_rmap
>> >                                                        ▒
>> >                     0.64% __mod_lruvec_page_state
>> >       - 17.34% remove_vma
>> >                                                        ▒
>> >          - 17.25% vm_area_free
>> >                                                        ▒
>> >             - 17.23% kmem_cache_free
>> >                                                        ▒
>> >                - 17.15% __slab_free
>> >                                                        ▒
>> >                   - 14.56% discard_slab
>> >                                                        ▒
>> >                        free_slab
>> >                                                        ▒
>> >                        __free_slab
>> >                                                        ▒
>> >                        __free_pages
>> >                                                        ▒
>> >                      - free_unref_page
>> >                                                        ▒
>> >                         - 13.50% free_unref_page_commit
>> >                                                        ▒
>> >                            - free_pcppages_bulk
>> >                                                        ▒
>> >                               + 13.44% _raw_spin_lock
>> >
>> > By enabling the mm_page_pcpu_drain() we can find the detailed stack:
>> >
>> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain:
>> > page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
>> > e=1
>> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>> >  => free_pcppages_bulk
>> >  => free_unref_page_commit
>> >  => free_unref_page_list
>> >  => release_pages
>> >  => free_pages_and_swap_cache
>> >  => tlb_flush_mmu
>> >  => zap_pte_range
>> >  => unmap_page_range
>> >  => unmap_single_vma
>> >  => unmap_vmas
>> >  => exit_mmap
>> >  => mmput
>> >  => do_exit
>> >  => do_group_exit
>> >  => get_signal
>> >  => arch_do_signal_or_restart
>> >  => exit_to_user_mode_prepare
>> >  => syscall_exit_to_user_mode
>> >  => do_syscall_64
>> >  => entry_SYSCALL_64_after_hwframe
>> >
>> > The servers experiencing these issues are equipped with impressive
>> > hardware specifications, including 256 CPUs and 1TB of memory, all
>> > within a single NUMA node. The zoneinfo is as follows,
>> >
>> > Node 0, zone   Normal
>> >   pages free     144465775
>> >         boost    0
>> >         min      1309270
>> >         low      1636587
>> >         high     1963904
>> >         spanned  564133888
>> >         present  296747008
>> >         managed  291974346
>> >         cma      0
>> >         protection: (0, 0, 0, 0)
>> > ...
>> > ...
>> >   pagesets
>> >     cpu: 0
>> >               count: 2217
>> >               high:  6392
>> >               batch: 63
>> >   vm stats threshold: 125
>> >     cpu: 1
>> >               count: 4510
>> >               high:  6392
>> >               batch: 63
>> >   vm stats threshold: 125
>> >     cpu: 2
>> >               count: 3059
>> >               high:  6392
>> >               batch: 63
>> >
>> > ...
>> >
>> > The high is around 100 times the batch size.
>> >
>> > We also traced the latency associated with the free_pcppages_bulk()
>> > function during the container exit process:
>> >
>> > 19:48:54
>> >      nsecs               : count     distribution
>> >          0 -> 1          : 0        |                                        |
>> >          2 -> 3          : 0        |                                        |
>> >          4 -> 7          : 0        |                                        |
>> >          8 -> 15         : 0        |                                        |
>> >         16 -> 31         : 0        |                                        |
>> >         32 -> 63         : 0        |                                        |
>> >         64 -> 127        : 0        |                                        |
>> >        128 -> 255        : 0        |                                        |
>> >        256 -> 511        : 148      |*****************                       |
>> >        512 -> 1023       : 334      |****************************************|
>> >       1024 -> 2047       : 33       |***                                     |
>> >       2048 -> 4095       : 5        |                                        |
>> >       4096 -> 8191       : 7        |                                        |
>> >       8192 -> 16383      : 12       |*                                       |
>> >      16384 -> 32767      : 30       |***                                     |
>> >      32768 -> 65535      : 21       |**                                      |
>> >      65536 -> 131071     : 15       |*                                       |
>> >     131072 -> 262143     : 27       |***                                     |
>> >     262144 -> 524287     : 84       |**********                              |
>> >     524288 -> 1048575    : 203      |************************                |
>> >    1048576 -> 2097151    : 284      |**********************************      |
>> >    2097152 -> 4194303    : 327      |*************************************** |
>> >    4194304 -> 8388607    : 215      |*************************               |
>> >    8388608 -> 16777215   : 116      |*************                           |
>> >   16777216 -> 33554431   : 47       |*****                                   |
>> >   33554432 -> 67108863   : 8        |                                        |
>> >   67108864 -> 134217727  : 3        |                                        |
>> >
>> > avg = 3066311 nsecs, total: 5887317501 nsecs, count: 1920
>> >
>> > The latency can reach tens of milliseconds.
>> >
>> > By adjusting the vm.percpu_pagelist_high_fraction parameter to set the
>> > minimum pagelist high at 4 times the batch size, we were able to
>> > significantly reduce the latency associated with the
>> > free_pcppages_bulk() function during container exits.:
>> >
>> >      nsecs               : count     distribution
>> >          0 -> 1          : 0        |                                        |
>> >          2 -> 3          : 0        |                                        |
>> >          4 -> 7          : 0        |                                        |
>> >          8 -> 15         : 0        |                                        |
>> >         16 -> 31         : 0        |                                        |
>> >         32 -> 63         : 0        |                                        |
>> >         64 -> 127        : 0        |                                        |
>> >        128 -> 255        : 120      |                                        |
>> >        256 -> 511        : 365      |*                                       |
>> >        512 -> 1023       : 201      |                                        |
>> >       1024 -> 2047       : 103      |                                        |
>> >       2048 -> 4095       : 84       |                                        |
>> >       4096 -> 8191       : 87       |                                        |
>> >       8192 -> 16383      : 4777     |**************                          |
>> >      16384 -> 32767      : 10572    |*******************************         |
>> >      32768 -> 65535      : 13544    |****************************************|
>> >      65536 -> 131071     : 12723    |*************************************   |
>> >     131072 -> 262143     : 8604     |*************************               |
>> >     262144 -> 524287     : 3659     |**********                              |
>> >     524288 -> 1048575    : 921      |**                                      |
>> >    1048576 -> 2097151    : 122      |                                        |
>> >    2097152 -> 4194303    : 5        |                                        |
>> >
>> > avg = 103814 nsecs, total: 5805802787 nsecs, count: 55925
>> >
>> > After successfully tuning the vm.percpu_pagelist_high_fraction sysctl
>> > knob to set the minimum pagelist high at a level that effectively
>> > mitigated latency issues, we observed that other containers were no
>> > longer experiencing similar complaints. As a result, we decided to
>> > implement this tuning as a permanent workaround and have deployed it
>> > across all clusters of servers where these containers may be deployed.
>>
>> Thanks for your detailed data.
>>
>> IIUC, the latency of free_pcppages_bulk() during process exiting
>> shouldn't be a problem?
>
> Right. The problem arises when the process holds the lock for too
> long, causing other processes that are attempting to allocate memory
> to experience delays or wait times.
>
>> Because users care more about the total time of
>> process exiting, that is, throughput.  And I suspect that the zone->lock
>> contention and page allocating/freeing throughput will be worse with
>> your configuration?
>
> While reducing throughput may not be a significant concern due to the
> minimal difference, the potential for latency spikes, a crucial metric
> for assessing system stability, is of greater concern to users. Higher
> latency can lead to request errors, impacting the user experience.
> Therefore, maintaining stability, even at the cost of slightly lower
> throughput, is preferable to experiencing higher throughput with
> unstable performance.
>
>>
>> But the latency of free_pcppages_bulk() and page allocation in other
>> processes is a problem.  And your configuration can help it.
>>
>> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX.  In that way,
>> you have a normal PCP size (high) but smaller PCP batch.  I guess that
>> may help both latency and throughput in your system.  Could you give it
>> a try?
>
> Currently, our kernel does not include the CONFIG_PCP_BATCH_SCALE_MAX
> configuration option. However, I've observed your recent improvements
> to the zone->lock mechanism, particularly commit 52166607ecc9 ("mm:
> restrict the pcp batch scale factor to avoid too long latency"), which
> has prompted me to experiment with manually setting the
> pcp->free_factor to zero. While this adjustment provided some
> improvement, the results were not as significant as I had hoped.
>
> BTW, perhaps we should consider the implementation of a sysctl knob as
> an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow users
> to more easily adjust it.

If you cannot test upstream behavior, it's hard to make changes to
upstream.  Could you find a way to do that?

IIUC, PCP high will not influence allocate/free latency, PCP batch will.
Your configuration will influence PCP batch via configuration PCP high.
So, it may be reasonable to find a way to adjust PCP batch directly.
But, we need practical requirements and test methods first.

[snip]

--
Best Regards,
Huang, Ying


  reply	other threads:[~2024-07-03  1:57 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-01 14:20 Yafang Shao
2024-07-02  2:51 ` Andrew Morton
2024-07-02  6:37   ` Yafang Shao
2024-07-02  9:08     ` Huang, Ying
2024-07-02 12:07       ` Yafang Shao
2024-07-03  1:55         ` Huang, Ying [this message]
2024-07-03  2:13           ` Yafang Shao
2024-07-03  3:21             ` Huang, Ying
2024-07-03  3:44               ` Yafang Shao
2024-07-03  5:34                 ` Huang, Ying
2024-07-04 13:27                   ` Yafang Shao
2024-07-05  1:28                     ` Huang, Ying
2024-07-05  3:03                       ` Yafang Shao
2024-07-05  5:31                         ` Huang, Ying
2024-07-05 13:09   ` Mel Gorman
2024-07-02  7:23 ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87wmm3kzmy.fsf@yhuang6-desk2.ccr.corp.intel.com \
    --to=ying.huang@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=laoar.shao@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=rientjes@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox