Re: [PATCH] mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yafang Shao <laoar.shao@gmail.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, Matthew Wilcox <willy@infradead.org>,
	 David Rientjes <rientjes@google.com>,
	"Huang, Ying" <ying.huang@intel.com>,
	 Mel Gorman <mgorman@techsingularity.net>
Subject: Re: [PATCH] mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist
Date: Tue, 2 Jul 2024 14:37:38 +0800	[thread overview]
Message-ID: <CALOAHbBZBq=wNGw2N_K9zMp0OW=x2HmOBCVg8c06+zwHiW=H8A@mail.gmail.com> (raw)
In-Reply-To: <20240701195143.7e8d597abc14b255f3bc4bcd@linux-foundation.org>

On Tue, Jul 2, 2024 at 10:51 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > Currently, we're encountering latency spikes in our container environment
> > when a specific container with multiple Python-based tasks exits. These
> > tasks may hold the zone->lock for an extended period, significantly
> > impacting latency for other containers attempting to allocate memory.
>
> Is this locking issue well understood?  Is anyone working on it?  A
> reasonably detailed description of the issue and a description of any
> ongoing work would be helpful here.

In our containerized environment, we have a specific type of container
that runs 18 processes, each consuming approximately 6GB of RSS. These
processes are organized as separate processes rather than threads due
to the Python Global Interpreter Lock (GIL) being a bottleneck in a
multi-threaded setup. Upon the exit of these containers, other
containers hosted on the same machine experience significant latency
spikes.

Our investigation using perf tracing revealed that the root cause of
these spikes is the simultaneous execution of exit_mmap() by each of
the exiting processes. This concurrent access to the zone->lock
results in contention, which becomes a hotspot and negatively impacts
performance. The perf results clearly indicate this contention as a
primary contributor to the observed latency issues.

+   77.02%     0.00%  uwsgi    [kernel.kallsyms]
           [k] mmput                                   ▒
-   76.98%     0.01%  uwsgi    [kernel.kallsyms]
           [k] exit_mmap                               ▒
   - 76.97% exit_mmap
                                                       ▒
      - 58.58% unmap_vmas
                                                       ▒
         - 58.55% unmap_single_vma
                                                       ▒
            - unmap_page_range
                                                       ▒
               - 58.32% zap_pte_range
                                                       ▒
                  - 42.88% tlb_flush_mmu
                                                       ▒
                     - 42.76% free_pages_and_swap_cache
                                                       ▒
                        - 41.22% release_pages
                                                       ▒
                           - 33.29% free_unref_page_list
                                                       ▒
                              - 32.37% free_unref_page_commit
                                                       ▒
                                 - 31.64% free_pcppages_bulk
                                                       ▒
                                    + 28.65% _raw_spin_lock
                                                       ▒
                                      1.28% __list_del_entry_valid
                                                       ▒
                           + 3.25% folio_lruvec_lock_irqsave
                                                       ▒
                           + 0.75% __mem_cgroup_uncharge_list
                                                       ▒
                             0.60% __mod_lruvec_state
                                                       ▒
                          1.07% free_swap_cache
                                                       ▒
                  + 11.69% page_remove_rmap
                                                       ▒
                    0.64% __mod_lruvec_page_state
      - 17.34% remove_vma
                                                       ▒
         - 17.25% vm_area_free
                                                       ▒
            - 17.23% kmem_cache_free
                                                       ▒
               - 17.15% __slab_free
                                                       ▒
                  - 14.56% discard_slab
                                                       ▒
                       free_slab
                                                       ▒
                       __free_slab
                                                       ▒
                       __free_pages
                                                       ▒
                     - free_unref_page
                                                       ▒
                        - 13.50% free_unref_page_commit
                                                       ▒
                           - free_pcppages_bulk
                                                       ▒
                              + 13.44% _raw_spin_lock

By enabling the mm_page_pcpu_drain() we can find the detailed stack:

          <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain:
page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
e=1
           <...>-1540432 [224] d..3. 618048.023887: <stack trace>
 => free_pcppages_bulk
 => free_unref_page_commit
 => free_unref_page_list
 => release_pages
 => free_pages_and_swap_cache
 => tlb_flush_mmu
 => zap_pte_range
 => unmap_page_range
 => unmap_single_vma
 => unmap_vmas
 => exit_mmap
 => mmput
 => do_exit
 => do_group_exit
 => get_signal
 => arch_do_signal_or_restart
 => exit_to_user_mode_prepare
 => syscall_exit_to_user_mode
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

The servers experiencing these issues are equipped with impressive
hardware specifications, including 256 CPUs and 1TB of memory, all
within a single NUMA node. The zoneinfo is as follows,

Node 0, zone   Normal
  pages free     144465775
        boost    0
        min      1309270
        low      1636587
        high     1963904
        spanned  564133888
        present  296747008
        managed  291974346
        cma      0
        protection: (0, 0, 0, 0)
...
...
  pagesets
    cpu: 0
              count: 2217
              high:  6392
              batch: 63
  vm stats threshold: 125
    cpu: 1
              count: 4510
              high:  6392
              batch: 63
  vm stats threshold: 125
    cpu: 2
              count: 3059
              high:  6392
              batch: 63

...

The high is around 100 times the batch size.

We also traced the latency associated with the free_pcppages_bulk()
function during the container exit process:

19:48:54
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 148      |*****************                       |
       512 -> 1023       : 334      |****************************************|
      1024 -> 2047       : 33       |***                                     |
      2048 -> 4095       : 5        |                                        |
      4096 -> 8191       : 7        |                                        |
      8192 -> 16383      : 12       |*                                       |
     16384 -> 32767      : 30       |***                                     |
     32768 -> 65535      : 21       |**                                      |
     65536 -> 131071     : 15       |*                                       |
    131072 -> 262143     : 27       |***                                     |
    262144 -> 524287     : 84       |**********                              |
    524288 -> 1048575    : 203      |************************                |
   1048576 -> 2097151    : 284      |**********************************      |
   2097152 -> 4194303    : 327      |*************************************** |
   4194304 -> 8388607    : 215      |*************************               |
   8388608 -> 16777215   : 116      |*************                           |
  16777216 -> 33554431   : 47       |*****                                   |
  33554432 -> 67108863   : 8        |                                        |
  67108864 -> 134217727  : 3        |                                        |

avg = 3066311 nsecs, total: 5887317501 nsecs, count: 1920

The latency can reach tens of milliseconds.

By adjusting the vm.percpu_pagelist_high_fraction parameter to set the
minimum pagelist high at 4 times the batch size, we were able to
significantly reduce the latency associated with the
free_pcppages_bulk() function during container exits.:

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 120      |                                        |
       256 -> 511        : 365      |*                                       |
       512 -> 1023       : 201      |                                        |
      1024 -> 2047       : 103      |                                        |
      2048 -> 4095       : 84       |                                        |
      4096 -> 8191       : 87       |                                        |
      8192 -> 16383      : 4777     |**************                          |
     16384 -> 32767      : 10572    |*******************************         |
     32768 -> 65535      : 13544    |****************************************|
     65536 -> 131071     : 12723    |*************************************   |
    131072 -> 262143     : 8604     |*************************               |
    262144 -> 524287     : 3659     |**********                              |
    524288 -> 1048575    : 921      |**                                      |
   1048576 -> 2097151    : 122      |                                        |
   2097152 -> 4194303    : 5        |                                        |

avg = 103814 nsecs, total: 5805802787 nsecs, count: 55925

After successfully tuning the vm.percpu_pagelist_high_fraction sysctl
knob to set the minimum pagelist high at a level that effectively
mitigated latency issues, we observed that other containers were no
longer experiencing similar complaints. As a result, we decided to
implement this tuning as a permanent workaround and have deployed it
across all clusters of servers where these containers may be deployed.

>
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -856,6 +856,10 @@ on per-cpu page lists. This entry only changes the value of hot per-cpu
> >  page lists. A user can specify a number like 100 to allocate 1/100th of
> >  each zone between per-cpu lists.
> >
> > +The minimum number of pages that can be stored in per-CPU page lists is
> > +four times the batch value. By writing '-1' to this sysctl, you can set
> > +this minimum value.
>
> I suggest we also describe why an operator would want to set this, and
> the expected effects of that action.

will improve it.

>
> >  The batch value of each per-cpu page list remains the same regardless of
> >  the value of the high fraction so allocation latencies are unaffected.
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 2e22ce5675ca..e7313f9d704b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5486,6 +5486,10 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online,
> >       int nr_split_cpus;
> >       unsigned long total_pages;
> >
> > +     /* Setting -1 to set the minimum pagelist size, four times the batch size */
>
> Some old-timers still use 80-column xterms ;)

will change it.


Regards
Yafang

next prev parent reply	other threads:[~2024-07-02  6:38 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-01 14:20 Yafang Shao
2024-07-02  2:51 ` Andrew Morton
2024-07-02  6:37   ` Yafang Shao [this message]
2024-07-02  9:08     ` Huang, Ying
2024-07-02 12:07       ` Yafang Shao
2024-07-03  1:55         ` Huang, Ying
2024-07-03  2:13           ` Yafang Shao
2024-07-03  3:21             ` Huang, Ying
2024-07-03  3:44               ` Yafang Shao
2024-07-03  5:34                 ` Huang, Ying
2024-07-04 13:27                   ` Yafang Shao
2024-07-05  1:28                     ` Huang, Ying
2024-07-05  3:03                       ` Yafang Shao
2024-07-05  5:31                         ` Huang, Ying
2024-07-05 13:09   ` Mel Gorman
2024-07-02  7:23 ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALOAHbBZBq=wNGw2N_K9zMp0OW=x2HmOBCVg8c06+zwHiW=H8A@mail.gmail.com' \
    --to=laoar.shao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=rientjes@google.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox