linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RCF 0/1] mm/mempolicy: weighted interleave system default weights
@ 2024-02-20 20:25 Gregory Price
  2024-02-20 20:25 ` [RFC 1/1] mm/mempolicy: introduce system default interleave weights Gregory Price
  0 siblings, 1 reply; 11+ messages in thread
From: Gregory Price @ 2024-02-20 20:25 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, ying.huang, hannes, dan.j.williams, dave.jiang,
	Gregory Price

Weighted interleave added a sysfs interface for users to change
the interleave weights based on user input - with a default value
of `1` until reasonable system default code could be agreed upon.

This RFC series will suggest and solicit ideas for how to generate
these system defaults, and lay out some challenges in generating them.

Future work on the CXL driver (drivers/cxl) will introduce additional
code which registers HMAT information for hotplug memory provided
by CXL devices. This RFC does not presently provide that integration,
but will after it is upstream.


Interfaces introduced:
- mempolicy_set_node_perf
  Called when HMAT data for a node is reported to the system

Integration points:
- node_set_perf_attrs - for reporting bandwidth info to mempolicy
- get_il_weight and weighted interleave allocation interfaces to
  provide system defaults when applying weighted interleave.

New data in mempolicy:
- node_bw_table - cached bandwidth information about each node
- default_iw_table - the system default interleave weights


Note that because there are now multiple tables (default and sysfs),
the allocators fetch each weight individually, rather than via memcpy.
This means if weights change at runtime (extremely unlikely), the
allocators may temporarily see an "incorrect distribution" while the
system is being reweighted. This is not harmful (simply inaccurate)
and a result of providing a clean way to revert to the system default.


v1: Simple GCD reduction of basic bandwidth distribution.

Approach:
- whenever new coordinates are reported, recalculate all weights
- cache each node's min(read, write) bandwidth
- calculate the percentage each node's bandwidth is of the whole
- use GCD to reduce all percentages down to the minimum possible

The approach is simple and fast, and operates well under reasonably
well if the numbers reported by HMAT for each node happen to land
on easily reducable percentages.  For example, a system presenting
88% of its bandwidth on DRAM and 11% of its bandwidth on CXL (floored
for simplicity) will end up with default weights of (8:1), which is
a preferably small number assigned in each weight.

The downside of this approach is that it is susceptible to prime and
co-prime numbers keeping interleave weights large (e.g. 89:11 vs 8:1).
We prefer finer grained interleaves to prevent large swaths of
contiguous memory from landing on the same device.

Additionally, this also hides the fact that multi-socket systems
experience chokepoints across sockets.  For example a 2-socket
system with 200GB/s on each socket from DDR does not mean a given
socket has an aggregate of 400GB/s of bandwidth.  Interconnects between
sockets provide less aggregate bandwidth than the DDR they provide
access to (e.g. 3 UPI lanes vs 8 DDR channels).

So this approach will reduce multi-socket interleave weights to (1:1)
by default if all sockets provide the same bandwidth.

Signed-off-by: Gregory Price <gregory.price@memverge.com>

Gregory Price (1):
  mm/mempolicy: introduce system default interleave weights

 drivers/acpi/numa/hmat.c  |   1 +
 drivers/base/node.c       |   7 +++
 include/linux/mempolicy.h |   4 ++
 mm/mempolicy.c            | 129 ++++++++++++++++++++++++++++++--------
 4 files changed, 116 insertions(+), 25 deletions(-)

-- 
2.39.1



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-02-27  8:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-20 20:25 [RCF 0/1] mm/mempolicy: weighted interleave system default weights Gregory Price
2024-02-20 20:25 ` [RFC 1/1] mm/mempolicy: introduce system default interleave weights Gregory Price
2024-02-22  7:10   ` Huang, Ying
2024-02-23  5:47     ` Gregory Price
2024-02-23  9:11       ` Huang, Ying
2024-02-26 14:29         ` Gregory Price
2024-02-27  0:38           ` Huang, Ying
2024-02-27  5:36             ` Gregory Price
2024-02-27  5:59               ` Huang, Ying
2024-02-27  6:11                 ` Gregory Price
2024-02-27  8:24                   ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox