Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Huang, Ying" <ying.huang@intel.com>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: akpm@linux-foundation.org,  mgorman@techsingularity.net,
	linux-mm@kvack.org,  Matthew Wilcox <willy@infradead.org>,
	 David Rientjes <rientjes@google.com>
Subject: Re: [PATCH v2  3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
Date: Mon, 29 Jul 2024 11:18:48 +0800	[thread overview]
Message-ID: <878qxkyjfr.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <20240729023532.1555-4-laoar.shao@gmail.com> (Yafang Shao's message of "Mon, 29 Jul 2024 10:35:32 +0800")

Hi, Yafang,

Yafang Shao <laoar.shao@gmail.com> writes:

> During my recent work to resolve latency spikes caused by zone->lock
> contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use
> in practice.

As we discussed before [1], I still feel confusing about the description
about zone->lock contention.  How about change the description to
something like,

Larger page allocation/freeing batch number may cause longer run time of
code holding zone->lock.  If zone->lock is heavily contended at the same
time, latency spikes may occur even for casual page allocation/freeing.
Although reducing the batch number cannot make zone->lock contended
lighter, it can reduce the latency spikes effectively.

[1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@yhuang6-desk2.ccr.corp.intel.com/

> To demonstrate this, I wrote a Python script:
>
>   import mmap
>
>   size = 6 * 1024**3
>
>   while True:
>       mm = mmap.mmap(-1, size)
>       mm[:] = b'\xff' * size
>       mm.close()
>
> Run this script 10 times in parallel and measure the allocation latency by
> measuring the duration of rmqueue_bulk() with the BCC tools
> funclatency[1]:
>
>   funclatency -T -i 600 rmqueue_bulk
>
> Here are the results for both AMD and Intel CPUs.
>
> AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
> =====================================================================
>
> - Default value of 5
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 0        |                                        |
>        512 -> 1023       : 12       |                                        |
>       1024 -> 2047       : 9116     |                                        |
>       2048 -> 4095       : 2004     |                                        |
>       4096 -> 8191       : 2497     |                                        |
>       8192 -> 16383      : 2127     |                                        |
>      16384 -> 32767      : 2483     |                                        |
>      32768 -> 65535      : 10102    |                                        |
>      65536 -> 131071     : 212730   |*******************                     |
>     131072 -> 262143     : 314692   |*****************************           |
>     262144 -> 524287     : 430058   |****************************************|
>     524288 -> 1048575    : 224032   |********************                    |
>    1048576 -> 2097151    : 73567    |******                                  |
>    2097152 -> 4194303    : 17079    |*                                       |
>    4194304 -> 8388607    : 3900     |                                        |
>    8388608 -> 16777215   : 750      |                                        |
>   16777216 -> 33554431   : 88       |                                        |
>   33554432 -> 67108863   : 2        |                                        |
>
> avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242
>
> The avg alloc latency can be 449us, and the max latency can be higher
> than 30ms.
>
> - Value set to 0
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 0        |                                        |
>        512 -> 1023       : 92       |                                        |
>       1024 -> 2047       : 8594     |                                        |
>       2048 -> 4095       : 2042818  |******                                  |
>       4096 -> 8191       : 8737624  |**************************              |
>       8192 -> 16383      : 13147872 |****************************************|
>      16384 -> 32767      : 8799951  |**************************              |
>      32768 -> 65535      : 2879715  |********                                |
>      65536 -> 131071     : 659600   |**                                      |
>     131072 -> 262143     : 204004   |                                        |
>     262144 -> 524287     : 78246    |                                        |
>     524288 -> 1048575    : 30800    |                                        |
>    1048576 -> 2097151    : 12251    |                                        |
>    2097152 -> 4194303    : 2950     |                                        |
>    4194304 -> 8388607    : 78       |                                        |
>
> avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636
>
> The avg was reduced significantly to 19us, and the max latency is reduced
> to less than 8ms.
>
> - Conclusion
>
> On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
> latency. Latency-sensitive applications will benefit from this tuning.
>
> However, I don't have access to other types of AMD CPUs, so I was unable to
> test it on different AMD models.
>
> Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
> ============================================================
>
> - Default value of 5
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 0        |                                        |
>        512 -> 1023       : 2419     |                                        |
>       1024 -> 2047       : 34499    |*                                       |
>       2048 -> 4095       : 4272     |                                        |
>       4096 -> 8191       : 9035     |                                        |
>       8192 -> 16383      : 4374     |                                        |
>      16384 -> 32767      : 2963     |                                        |
>      32768 -> 65535      : 6407     |                                        |
>      65536 -> 131071     : 884806   |****************************************|
>     131072 -> 262143     : 145931   |******                                  |
>     262144 -> 524287     : 13406    |                                        |
>     524288 -> 1048575    : 1874     |                                        |
>    1048576 -> 2097151    : 249      |                                        |
>    2097152 -> 4194303    : 28       |                                        |
>
> avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263
>
> - Conclusion
>
> This Intel CPU works fine with the default setting.
>
> Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
> ==============================================================
>
> Using the cpuset cgroup, we can restrict the test script to run on NUMA
> node 0 only.
>
> - Default value of 5
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 46       |                                        |
>        512 -> 1023       : 695      |                                        |
>       1024 -> 2047       : 19950    |*                                       |
>       2048 -> 4095       : 1788     |                                        |
>       4096 -> 8191       : 3392     |                                        |
>       8192 -> 16383      : 2569     |                                        |
>      16384 -> 32767      : 2619     |                                        |
>      32768 -> 65535      : 3809     |                                        |
>      65536 -> 131071     : 616182   |****************************************|
>     131072 -> 262143     : 295587   |*******************                     |
>     262144 -> 524287     : 75357    |****                                    |
>     524288 -> 1048575    : 15471    |*                                       |
>    1048576 -> 2097151    : 2939     |                                        |
>    2097152 -> 4194303    : 243      |                                        |
>    4194304 -> 8388607    : 3        |                                        |
>
> avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651
>
> The zone->lock contention becomes severe when there is only a single NUMA
> node. The average latency is approximately 144us, with the maximum
> latency exceeding 4ms.
>
> - Value set to 0
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 24       |                                        |
>        512 -> 1023       : 2686     |                                        |
>       1024 -> 2047       : 10246    |                                        |
>       2048 -> 4095       : 4061529  |*********                               |
>       4096 -> 8191       : 16894971 |****************************************|
>       8192 -> 16383      : 6279310  |**************                          |
>      16384 -> 32767      : 1658240  |***                                     |
>      32768 -> 65535      : 445760   |*                                       |
>      65536 -> 131071     : 110817   |                                        |
>     131072 -> 262143     : 20279    |                                        |
>     262144 -> 524287     : 4176     |                                        |
>     524288 -> 1048575    : 436      |                                        |
>    1048576 -> 2097151    : 8        |                                        |
>    2097152 -> 4194303    : 2        |                                        |
>
> avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508
>
> After setting it to 0, the avg latency is reduced to around 8us, and the
> max latency is less than 4ms.
>
> - Conclusion
>
> On this Intel CPU, this tuning doesn't help much. Latency-sensitive
> applications work well with the default setting.
>
> It is worth noting that all the above data were tested using the upstream
> kernel.
>
> Why introduce a systl knob?
> ===========================
>
> From the above data, it's clear that different CPU types have varying
> allocation latencies concerning zone->lock contention. Typically, people
> don't release individual kernel packages for each type of x86_64 CPU.
>
> Furthermore, for latency-insensitive applications, we can keep the default
> setting for better throughput. In our production environment, we set this
> value to 0 for applications running on Kubernetes servers while keeping it
> at the default value of 5 for other applications like big data. It's not
> common to release individual kernel packages for each application.

Thanks for detailed performance data!

Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX to 0 in
your environment?  If not, I suggest to use 0 as default for
CONFIG_PCP_BATCH_SCALE_MAX.  Because we have clear evidence that
CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads.  After
that, if someone found some other workloads need larger
CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically.

[snip]

--
Best Regards,
Huang, Ying

next prev parent reply	other threads:[~2024-07-29  3:22 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-29  2:35 [PATCH v2 0/3] mm: " Yafang Shao
2024-07-29  2:35 ` [PATCH v2 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
2024-07-29  2:35 ` [PATCH v2 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
2024-07-29  2:35 ` [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2024-07-29  3:18   ` Huang, Ying [this message]
2024-07-29  3:40     ` Yafang Shao
2024-07-29  5:12       ` Huang, Ying
2024-07-29  5:45         ` Yafang Shao
2024-07-29  5:50           ` Huang, Ying
2024-07-29  6:00             ` Yafang Shao
2024-07-29  6:00               ` Huang, Ying
2024-07-29  6:13                 ` Yafang Shao
2024-07-29  6:14                   ` Huang, Ying
2024-07-29  7:50                     ` Yafang Shao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878qxkyjfr.fsf@yhuang6-desk2.ccr.corp.intel.com \
    --to=ying.huang@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=laoar.shao@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=rientjes@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox