From: "Huang, Ying" <ying.huang@intel.com>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: akpm@linux-foundation.org, mgorman@techsingularity.net,
linux-mm@kvack.org, Matthew Wilcox <willy@infradead.org>,
David Rientjes <rientjes@google.com>
Subject: Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
Date: Mon, 29 Jul 2024 11:18:48 +0800 [thread overview]
Message-ID: <878qxkyjfr.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <20240729023532.1555-4-laoar.shao@gmail.com> (Yafang Shao's message of "Mon, 29 Jul 2024 10:35:32 +0800")
Hi, Yafang,
Yafang Shao <laoar.shao@gmail.com> writes:
> During my recent work to resolve latency spikes caused by zone->lock
> contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use
> in practice.
As we discussed before [1], I still feel confusing about the description
about zone->lock contention. How about change the description to
something like,
Larger page allocation/freeing batch number may cause longer run time of
code holding zone->lock. If zone->lock is heavily contended at the same
time, latency spikes may occur even for casual page allocation/freeing.
Although reducing the batch number cannot make zone->lock contended
lighter, it can reduce the latency spikes effectively.
[1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@yhuang6-desk2.ccr.corp.intel.com/
> To demonstrate this, I wrote a Python script:
>
> import mmap
>
> size = 6 * 1024**3
>
> while True:
> mm = mmap.mmap(-1, size)
> mm[:] = b'\xff' * size
> mm.close()
>
> Run this script 10 times in parallel and measure the allocation latency by
> measuring the duration of rmqueue_bulk() with the BCC tools
> funclatency[1]:
>
> funclatency -T -i 600 rmqueue_bulk
>
> Here are the results for both AMD and Intel CPUs.
>
> AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
> =====================================================================
>
> - Default value of 5
>
> nsecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 0 | |
> 4 -> 7 : 0 | |
> 8 -> 15 : 0 | |
> 16 -> 31 : 0 | |
> 32 -> 63 : 0 | |
> 64 -> 127 : 0 | |
> 128 -> 255 : 0 | |
> 256 -> 511 : 0 | |
> 512 -> 1023 : 12 | |
> 1024 -> 2047 : 9116 | |
> 2048 -> 4095 : 2004 | |
> 4096 -> 8191 : 2497 | |
> 8192 -> 16383 : 2127 | |
> 16384 -> 32767 : 2483 | |
> 32768 -> 65535 : 10102 | |
> 65536 -> 131071 : 212730 |******************* |
> 131072 -> 262143 : 314692 |***************************** |
> 262144 -> 524287 : 430058 |****************************************|
> 524288 -> 1048575 : 224032 |******************** |
> 1048576 -> 2097151 : 73567 |****** |
> 2097152 -> 4194303 : 17079 |* |
> 4194304 -> 8388607 : 3900 | |
> 8388608 -> 16777215 : 750 | |
> 16777216 -> 33554431 : 88 | |
> 33554432 -> 67108863 : 2 | |
>
> avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242
>
> The avg alloc latency can be 449us, and the max latency can be higher
> than 30ms.
>
> - Value set to 0
>
> nsecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 0 | |
> 4 -> 7 : 0 | |
> 8 -> 15 : 0 | |
> 16 -> 31 : 0 | |
> 32 -> 63 : 0 | |
> 64 -> 127 : 0 | |
> 128 -> 255 : 0 | |
> 256 -> 511 : 0 | |
> 512 -> 1023 : 92 | |
> 1024 -> 2047 : 8594 | |
> 2048 -> 4095 : 2042818 |****** |
> 4096 -> 8191 : 8737624 |************************** |
> 8192 -> 16383 : 13147872 |****************************************|
> 16384 -> 32767 : 8799951 |************************** |
> 32768 -> 65535 : 2879715 |******** |
> 65536 -> 131071 : 659600 |** |
> 131072 -> 262143 : 204004 | |
> 262144 -> 524287 : 78246 | |
> 524288 -> 1048575 : 30800 | |
> 1048576 -> 2097151 : 12251 | |
> 2097152 -> 4194303 : 2950 | |
> 4194304 -> 8388607 : 78 | |
>
> avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636
>
> The avg was reduced significantly to 19us, and the max latency is reduced
> to less than 8ms.
>
> - Conclusion
>
> On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
> latency. Latency-sensitive applications will benefit from this tuning.
>
> However, I don't have access to other types of AMD CPUs, so I was unable to
> test it on different AMD models.
>
> Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
> ============================================================
>
> - Default value of 5
>
> nsecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 0 | |
> 4 -> 7 : 0 | |
> 8 -> 15 : 0 | |
> 16 -> 31 : 0 | |
> 32 -> 63 : 0 | |
> 64 -> 127 : 0 | |
> 128 -> 255 : 0 | |
> 256 -> 511 : 0 | |
> 512 -> 1023 : 2419 | |
> 1024 -> 2047 : 34499 |* |
> 2048 -> 4095 : 4272 | |
> 4096 -> 8191 : 9035 | |
> 8192 -> 16383 : 4374 | |
> 16384 -> 32767 : 2963 | |
> 32768 -> 65535 : 6407 | |
> 65536 -> 131071 : 884806 |****************************************|
> 131072 -> 262143 : 145931 |****** |
> 262144 -> 524287 : 13406 | |
> 524288 -> 1048575 : 1874 | |
> 1048576 -> 2097151 : 249 | |
> 2097152 -> 4194303 : 28 | |
>
> avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263
>
> - Conclusion
>
> This Intel CPU works fine with the default setting.
>
> Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
> ==============================================================
>
> Using the cpuset cgroup, we can restrict the test script to run on NUMA
> node 0 only.
>
> - Default value of 5
>
> nsecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 0 | |
> 4 -> 7 : 0 | |
> 8 -> 15 : 0 | |
> 16 -> 31 : 0 | |
> 32 -> 63 : 0 | |
> 64 -> 127 : 0 | |
> 128 -> 255 : 0 | |
> 256 -> 511 : 46 | |
> 512 -> 1023 : 695 | |
> 1024 -> 2047 : 19950 |* |
> 2048 -> 4095 : 1788 | |
> 4096 -> 8191 : 3392 | |
> 8192 -> 16383 : 2569 | |
> 16384 -> 32767 : 2619 | |
> 32768 -> 65535 : 3809 | |
> 65536 -> 131071 : 616182 |****************************************|
> 131072 -> 262143 : 295587 |******************* |
> 262144 -> 524287 : 75357 |**** |
> 524288 -> 1048575 : 15471 |* |
> 1048576 -> 2097151 : 2939 | |
> 2097152 -> 4194303 : 243 | |
> 4194304 -> 8388607 : 3 | |
>
> avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651
>
> The zone->lock contention becomes severe when there is only a single NUMA
> node. The average latency is approximately 144us, with the maximum
> latency exceeding 4ms.
>
> - Value set to 0
>
> nsecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 0 | |
> 4 -> 7 : 0 | |
> 8 -> 15 : 0 | |
> 16 -> 31 : 0 | |
> 32 -> 63 : 0 | |
> 64 -> 127 : 0 | |
> 128 -> 255 : 0 | |
> 256 -> 511 : 24 | |
> 512 -> 1023 : 2686 | |
> 1024 -> 2047 : 10246 | |
> 2048 -> 4095 : 4061529 |********* |
> 4096 -> 8191 : 16894971 |****************************************|
> 8192 -> 16383 : 6279310 |************** |
> 16384 -> 32767 : 1658240 |*** |
> 32768 -> 65535 : 445760 |* |
> 65536 -> 131071 : 110817 | |
> 131072 -> 262143 : 20279 | |
> 262144 -> 524287 : 4176 | |
> 524288 -> 1048575 : 436 | |
> 1048576 -> 2097151 : 8 | |
> 2097152 -> 4194303 : 2 | |
>
> avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508
>
> After setting it to 0, the avg latency is reduced to around 8us, and the
> max latency is less than 4ms.
>
> - Conclusion
>
> On this Intel CPU, this tuning doesn't help much. Latency-sensitive
> applications work well with the default setting.
>
> It is worth noting that all the above data were tested using the upstream
> kernel.
>
> Why introduce a systl knob?
> ===========================
>
> From the above data, it's clear that different CPU types have varying
> allocation latencies concerning zone->lock contention. Typically, people
> don't release individual kernel packages for each type of x86_64 CPU.
>
> Furthermore, for latency-insensitive applications, we can keep the default
> setting for better throughput. In our production environment, we set this
> value to 0 for applications running on Kubernetes servers while keeping it
> at the default value of 5 for other applications like big data. It's not
> common to release individual kernel packages for each application.
Thanks for detailed performance data!
Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX to 0 in
your environment? If not, I suggest to use 0 as default for
CONFIG_PCP_BATCH_SCALE_MAX. Because we have clear evidence that
CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads. After
that, if someone found some other workloads need larger
CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically.
[snip]
--
Best Regards,
Huang, Ying
next prev parent reply other threads:[~2024-07-29 3:22 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-29 2:35 [PATCH v2 0/3] mm: " Yafang Shao
2024-07-29 2:35 ` [PATCH v2 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
2024-07-29 2:35 ` [PATCH v2 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
2024-07-29 2:35 ` [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2024-07-29 3:18 ` Huang, Ying [this message]
2024-07-29 3:40 ` Yafang Shao
2024-07-29 5:12 ` Huang, Ying
2024-07-29 5:45 ` Yafang Shao
2024-07-29 5:50 ` Huang, Ying
2024-07-29 6:00 ` Yafang Shao
2024-07-29 6:00 ` Huang, Ying
2024-07-29 6:13 ` Yafang Shao
2024-07-29 6:14 ` Huang, Ying
2024-07-29 7:50 ` Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=878qxkyjfr.fsf@yhuang6-desk2.ccr.corp.intel.com \
--to=ying.huang@intel.com \
--cc=akpm@linux-foundation.org \
--cc=laoar.shao@gmail.com \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=rientjes@google.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox