[PATCH v2 0/3] mm: Introduce a new sysctl knob vm.pcp_batch_scale_max

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yafang Shao <laoar.shao@gmail.com>
To: akpm@linux-foundation.org
Cc: ying.huang@intel.com, mgorman@techsingularity.net,
	linux-mm@kvack.org, Yafang Shao <laoar.shao@gmail.com>
Subject: [PATCH v2 0/3] mm: Introduce a new sysctl knob vm.pcp_batch_scale_max
Date: Mon, 29 Jul 2024 10:35:29 +0800	[thread overview]
Message-ID: <20240729023532.1555-1-laoar.shao@gmail.com> (raw)

Background
==========

In our containerized environment, we have a specific type of container
that runs 18 processes, each consuming approximately 6GB of RSS. These
processes are organized as separate processes rather than threads due
to the Python Global Interpreter Lock (GIL) being a bottleneck in a
multi-threaded setup. Upon the exit of these containers, other
containers hosted on the same machine experience significant latency
spikes.

Investigation
=============

Duration my investigation on this issue, I found the latency spikes were
caused by the zone->lock contention. That can be illustrated as follows,

     CPU A (Freer)                CPU B (Allocator)
  lock zone->lock
  free pages                      lock zone->lock
  unlock zone->lock               
                                  alloc pages
                                  unlock zone->lock

If the Freer holds the zone->lock for an extended period, the Allocator
has to wait and thus latency spikes occures.

I also wrote a python script to reproduce it on my test servers. See the
dedails in patch #3. It is worth to note that the reproducer is based on
the upstream kernel.

Experimenting
=============

As the more pages to be freed in one batch, the long the duration will
be. So my attempt involves reducing the batch size. After I restrict the
batch to the smallest size, there is no complains on the latency spikes
any more.

However, duration my experiment, I found that the
CONFIG_PCP_BATCH_SCALE_MAX is hard to use in practice. So I try to
improve it in this series.

The Proposal
============

This series encompasses two minor refinements to the PCP high watermark
auto-tuning mechanism, along with the introduction of a new sysctl knob
that serves as a more practical alternative to the previous configuration
method.

Future work
===========

To ultimately mitigate the zone->lock contention issue, several suggestions
have been proposed. One approach involves dividing large zones into multi
smaller zones, as suggested by Matthew[0], while another entails splitting
the zone->lock using a mechanism similar to memory arenas and shifting away
from relying solely on zone_id to identify the range of free lists a
particular page belongs to, as suggested by Mel[1]. However, implementing
these solutions is likely to necessitate a more extended development
effort.

Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [0]
Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [1]

Changes:
- v1-> v2: Commit log refinement

- v1: mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  https://lwn.net/Articles/981069/

- mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the
  minimum pagelist
  https://lore.kernel.org/linux-mm/20240701142046.6050-1-laoar.shao@gmail.com/

Yafang Shao (3):
  mm/page_alloc: A minor fix to the calculation of pcp->free_count
  mm/page_alloc: Avoid changing pcp->high decaying when adjusting
    CONFIG_PCP_BATCH_SCALE_MAX
  mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

 Documentation/admin-guide/sysctl/vm.rst | 17 +++++++++++
 mm/Kconfig                              | 11 -------
 mm/page_alloc.c                         | 40 ++++++++++++++++++-------
 3 files changed, 47 insertions(+), 21 deletions(-)

-- 
2.43.5

next             reply	other threads:[~2024-07-29  2:36 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-29  2:35 Yafang Shao [this message]
2024-07-29  2:35 ` [PATCH v2 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
2024-07-29  2:35 ` [PATCH v2 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
2024-07-29  2:35 ` [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2024-07-29  3:18   ` Huang, Ying
2024-07-29  3:40     ` Yafang Shao
2024-07-29  5:12       ` Huang, Ying
2024-07-29  5:45         ` Yafang Shao
2024-07-29  5:50           ` Huang, Ying
2024-07-29  6:00             ` Yafang Shao
2024-07-29  6:00               ` Huang, Ying
2024-07-29  6:13                 ` Yafang Shao
2024-07-29  6:14                   ` Huang, Ying
2024-07-29  7:50                     ` Yafang Shao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240729023532.1555-1-laoar.shao@gmail.com \
    --to=laoar.shao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox