Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yafang Shao <laoar.shao@gmail.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: akpm@linux-foundation.org, mgorman@techsingularity.net,
	linux-mm@kvack.org
Subject: Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
Date: Thu, 11 Jul 2024 10:25:06 +0800	[thread overview]
Message-ID: <CALOAHbDmLx3Ky6h9kFS_p8A6o-mR8Z46Jnr3d=nOEycJX0SqCg@mail.gmail.com> (raw)
In-Reply-To: <874j8yar3z.fsf@yhuang6-desk2.ccr.corp.intel.com>

On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > Background
> > ==========
> >
> > In our containerized environment, we have a specific type of container
> > that runs 18 processes, each consuming approximately 6GB of RSS. These
> > processes are organized as separate processes rather than threads due
> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> > multi-threaded setup. Upon the exit of these containers, other
> > containers hosted on the same machine experience significant latency
> > spikes.
> >
> > Investigation
> > =============
> >
> > My investigation using perf tracing revealed that the root cause of
> > these spikes is the simultaneous execution of exit_mmap() by each of
> > the exiting processes. This concurrent access to the zone->lock
> > results in contention, which becomes a hotspot and negatively impacts
> > performance. The perf results clearly indicate this contention as a
> > primary contributor to the observed latency issues.
> >
> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                                  [k] mmput
> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                                  [k] exit_mmap
> >    - 76.97% exit_mmap
> >       - 58.58% unmap_vmas
> >          - 58.55% unmap_single_vma
> >             - unmap_page_range
> >                - 58.32% zap_pte_range
> >                   - 42.88% tlb_flush_mmu
> >                      - 42.76% free_pages_and_swap_cache
> >                         - 41.22% release_pages
> >                            - 33.29% free_unref_page_list
> >                               - 32.37% free_unref_page_commit
> >                                  - 31.64% free_pcppages_bulk
> >                                     + 28.65% _raw_spin_lock
> >                                       1.28% __list_del_entry_valid
> >                            + 3.25% folio_lruvec_lock_irqsave
> >                            + 0.75% __mem_cgroup_uncharge_list
> >                              0.60% __mod_lruvec_state
> >                           1.07% free_swap_cache
> >                   + 11.69% page_remove_rmap
> >                     0.64% __mod_lruvec_page_state
> >       - 17.34% remove_vma
> >          - 17.25% vm_area_free
> >             - 17.23% kmem_cache_free
> >                - 17.15% __slab_free
> >                   - 14.56% discard_slab
> >                        free_slab
> >                        __free_slab
> >                        __free_pages
> >                      - free_unref_page
> >                         - 13.50% free_unref_page_commit
> >                            - free_pcppages_bulk
> >                               + 13.44% _raw_spin_lock
>
> I don't think your change will reduce zone->lock contention cycles.  So,
> I don't find the value of the above data.
>
> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
> > with the majority of them being regular order-0 user pages.
> >
> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> > e=1
> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >  => free_pcppages_bulk
> >  => free_unref_page_commit
> >  => free_unref_page_list
> >  => release_pages
> >  => free_pages_and_swap_cache
> >  => tlb_flush_mmu
> >  => zap_pte_range
> >  => unmap_page_range
> >  => unmap_single_vma
> >  => unmap_vmas
> >  => exit_mmap
> >  => mmput
> >  => do_exit
> >  => do_group_exit
> >  => get_signal
> >  => arch_do_signal_or_restart
> >  => exit_to_user_mode_prepare
> >  => syscall_exit_to_user_mode
> >  => do_syscall_64
> >  => entry_SYSCALL_64_after_hwframe
> >
> > The servers experiencing these issues are equipped with impressive
> > hardware specifications, including 256 CPUs and 1TB of memory, all
> > within a single NUMA node. The zoneinfo is as follows,
> >
> > Node 0, zone   Normal
> >   pages free     144465775
> >         boost    0
> >         min      1309270
> >         low      1636587
> >         high     1963904
> >         spanned  564133888
> >         present  296747008
> >         managed  291974346
> >         cma      0
> >         protection: (0, 0, 0, 0)
> > ...
> >   pagesets
> >     cpu: 0
> >               count: 2217
> >               high:  6392
> >               batch: 63
> >   vm stats threshold: 125
> >     cpu: 1
> >               count: 4510
> >               high:  6392
> >               batch: 63
> >   vm stats threshold: 125
> >     cpu: 2
> >               count: 3059
> >               high:  6392
> >               batch: 63
> >
> > ...
> >
> > The pcp high is around 100 times the batch size.
> >
> > I also traced the latency associated with the free_pcppages_bulk()
> > function during the container exit process:
> >
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                        |
> >          2 -> 3          : 0        |                                        |
> >          4 -> 7          : 0        |                                        |
> >          8 -> 15         : 0        |                                        |
> >         16 -> 31         : 0        |                                        |
> >         32 -> 63         : 0        |                                        |
> >         64 -> 127        : 0        |                                        |
> >        128 -> 255        : 0        |                                        |
> >        256 -> 511        : 148      |*****************                       |
> >        512 -> 1023       : 334      |****************************************|
> >       1024 -> 2047       : 33       |***                                     |
> >       2048 -> 4095       : 5        |                                        |
> >       4096 -> 8191       : 7        |                                        |
> >       8192 -> 16383      : 12       |*                                       |
> >      16384 -> 32767      : 30       |***                                     |
> >      32768 -> 65535      : 21       |**                                      |
> >      65536 -> 131071     : 15       |*                                       |
> >     131072 -> 262143     : 27       |***                                     |
> >     262144 -> 524287     : 84       |**********                              |
> >     524288 -> 1048575    : 203      |************************                |
> >    1048576 -> 2097151    : 284      |**********************************      |
> >    2097152 -> 4194303    : 327      |*************************************** |
> >    4194304 -> 8388607    : 215      |*************************               |
> >    8388608 -> 16777215   : 116      |*************                           |
> >   16777216 -> 33554431   : 47       |*****                                   |
> >   33554432 -> 67108863   : 8        |                                        |
> >   67108864 -> 134217727  : 3        |                                        |
> >
> > The latency can reach tens of milliseconds.
> >
> > Experimenting
> > =============
> >
> > vm.percpu_pagelist_high_fraction
> > --------------------------------
> >
> > The kernel version currently deployed in our production environment is the
> > stable 6.1.y, and my initial strategy involves optimizing the
>
> IMHO, we should focus on upstream activity in the cover letter and patch
> description.  And I don't think that it's necessary to describe the
> alternative solution with too much details.
>
> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
> > page draining, which subsequently leads to a substantial reduction in
> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
> > improvement in latency.
> >
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                        |
> >          2 -> 3          : 0        |                                        |
> >          4 -> 7          : 0        |                                        |
> >          8 -> 15         : 0        |                                        |
> >         16 -> 31         : 0        |                                        |
> >         32 -> 63         : 0        |                                        |
> >         64 -> 127        : 0        |                                        |
> >        128 -> 255        : 120      |                                        |
> >        256 -> 511        : 365      |*                                       |
> >        512 -> 1023       : 201      |                                        |
> >       1024 -> 2047       : 103      |                                        |
> >       2048 -> 4095       : 84       |                                        |
> >       4096 -> 8191       : 87       |                                        |
> >       8192 -> 16383      : 4777     |**************                          |
> >      16384 -> 32767      : 10572    |*******************************         |
> >      32768 -> 65535      : 13544    |****************************************|
> >      65536 -> 131071     : 12723    |*************************************   |
> >     131072 -> 262143     : 8604     |*************************               |
> >     262144 -> 524287     : 3659     |**********                              |
> >     524288 -> 1048575    : 921      |**                                      |
> >    1048576 -> 2097151    : 122      |                                        |
> >    2097152 -> 4194303    : 5        |                                        |
> >
> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
> > pcp high watermark size to a minimum of four times the batch size. While
> > this could theoretically affect throughput, as highlighted by Ying[0], we
> > have yet to observe any significant difference in throughput within our
> > production environment after implementing this change.
> >
> > Backporting the series "mm: PCP high auto-tuning"
> > -------------------------------------------------
>
> Again, not upstream activity.  We can describe the upstream behavior
> directly.

Andrew has requested that I provide a more comprehensive analysis of
this issue, and in response, I have endeavored to outline all the
pertinent details in a thorough and detailed manner.

>
> > My second endeavor was to backport the series titled
> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
> > production environment, I noted a pronounced reduction in latency. The
> > observed outcomes are as enumerated below:
> >
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                        |
> >          2 -> 3          : 0        |                                        |
> >          4 -> 7          : 0        |                                        |
> >          8 -> 15         : 0        |                                        |
> >         16 -> 31         : 0        |                                        |
> >         32 -> 63         : 0        |                                        |
> >         64 -> 127        : 0        |                                        |
> >        128 -> 255        : 0        |                                        |
> >        256 -> 511        : 0        |                                        |
> >        512 -> 1023       : 0        |                                        |
> >       1024 -> 2047       : 2        |                                        |
> >       2048 -> 4095       : 11       |                                        |
> >       4096 -> 8191       : 3        |                                        |
> >       8192 -> 16383      : 1        |                                        |
> >      16384 -> 32767      : 2        |                                        |
> >      32768 -> 65535      : 7        |                                        |
> >      65536 -> 131071     : 198      |*********                               |
> >     131072 -> 262143     : 530      |************************                |
> >     262144 -> 524287     : 824      |**************************************  |
> >     524288 -> 1048575    : 852      |****************************************|
> >    1048576 -> 2097151    : 714      |*********************************       |
> >    2097152 -> 4194303    : 389      |******************                      |
> >    4194304 -> 8388607    : 143      |******                                  |
> >    8388608 -> 16777215   : 29       |*                                       |
> >   16777216 -> 33554431   : 1        |                                        |
> >
> > Compared to the previous data, the maximum latency has been reduced to
> > less than 30ms.
>
> People don't care too much about page freeing latency during processes
> exiting.  Instead, they care more about the process exiting time, that
> is, throughput.  So, it's better to show the page allocation latency
> which is affected by the simultaneous processes exiting.

I'm confused also. Is this issue really hard to understand ?


-- 
Regards
Yafang

next prev parent reply	other threads:[~2024-07-11  2:25 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-07  9:49 Yafang Shao
2024-07-07  9:49 ` [PATCH 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
2024-07-10  1:52   ` Huang, Ying
2024-07-07  9:49 ` [PATCH 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
2024-07-10  1:51   ` Huang, Ying
2024-07-10  2:07     ` Yafang Shao
2024-07-07  9:49 ` [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2024-07-10  2:49   ` Huang, Ying
2024-07-11  2:21     ` Yafang Shao
2024-07-11  6:42       ` Huang, Ying
2024-07-11  7:25         ` Yafang Shao
2024-07-11  8:18           ` Huang, Ying
2024-07-11  9:51             ` Yafang Shao
2024-07-11 10:49               ` Huang, Ying
2024-07-11 12:45                 ` Yafang Shao
2024-07-12  1:19                   ` Huang, Ying
2024-07-12  2:25                     ` Yafang Shao
2024-07-12  3:05                       ` Huang, Ying
2024-07-12  3:44                         ` Yafang Shao
2024-07-12  5:25                           ` Huang, Ying
2024-07-12  5:41                             ` Yafang Shao
2024-07-12  6:16                               ` Huang, Ying
2024-07-12  6:41                                 ` Yafang Shao
2024-07-12  7:04                                   ` Huang, Ying
2024-07-12  7:36                                     ` Yafang Shao
2024-07-12  8:24                                       ` Huang, Ying
2024-07-12  8:49                                         ` Yafang Shao
2024-07-12  9:10                                           ` Huang, Ying
2024-07-12  9:24                                             ` Yafang Shao
2024-07-12  9:46                                               ` Yafang Shao
2024-07-15  1:09                                                 ` Huang, Ying
2024-07-15  4:32                                                   ` Yafang Shao
2024-07-10  3:00 ` [PATCH 0/3] " Huang, Ying
2024-07-11  2:25   ` Yafang Shao [this message]
2024-07-11  6:38     ` Huang, Ying
2024-07-11  7:21       ` Yafang Shao
2024-07-11  8:36         ` Huang, Ying
2024-07-11  9:40           ` Yafang Shao
2024-07-11 11:03             ` Huang, Ying
2024-07-11 12:40               ` Yafang Shao
2024-07-12  2:32                 ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALOAHbDmLx3Ky6h9kFS_p8A6o-mR8Z46Jnr3d=nOEycJX0SqCg@mail.gmail.com' \
    --to=laoar.shao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox