linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/12] mempolicy2, mbind2, and weighted interleave
@ 2024-01-03 22:41 Gregory Price
  2024-01-03 22:41 ` [PATCH v6 01/12] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price
                   ` (7 more replies)
  0 siblings, 8 replies; 12+ messages in thread
From: Gregory Price @ 2024-01-03 22:41 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, linux-arch,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, Johannes Weiner,
	Hasan Al Maruf, Hao Wang, Dan Williams, Michal Hocko,
	Zhongkun He, Frank van der Linden, John Groves, Jonathan Cameron


Weighted interleave is a new interleave policy intended to make
use of heterogeneous memory environments appearing with CXL.

To implement weighted interleave with task-local weights, we
need new syscalls capable of passing a weight array. This is
the justification for mempolicy2/mbind2 - which are designed
to be extensible to capture future policies as well.


The existing interleave mechanism does an even round-robin
distribution of memory across all nodes in a nodemask, while
weighted interleave distributes memory across nodes according
to a provided weight. (Weight = # of page allocations per round)

Weighted interleave is intended to reduce average latency when
bandwidth is pressured - therefore increasing total throughput.
In other words: It allows greater use of the total available
bandwidth in a heterogeneous hardware environment (different
hardware provides different bandwidth capacity).

As bandwidth is pressured, latency increases - first linearly
and then exponentially. By keeping bandwidth usage distributed
according to available bandwidth, we therefore can reduce the
average latency of a cacheline fetch.

A good explanation of the bandwidth vs latency response curve:
https://mahmoudhatem.wordpress.com/2017/11/07/memory-bandwidth-vs-latency-response-curve/

From the article:
```
Constant region:
    The latency response is fairly constant for the first 40%
    of the sustained bandwidth.
Linear region:
    In between 40% to 80% of the sustained bandwidth, the
    latency response increases almost linearly with the bandwidth
    demand of the system due to contention overhead by numerous
    memory requests.
Exponential region:
    Between 80% to 100% of the sustained bandwidth, the memory
    latency is dominated by the contention latency which can be
    as much as twice the idle latency or more.
Maximum sustained bandwidth :
    Is 65% to 75% of the theoretical maximum bandwidth.
```

As a general rule of thumb:
  * If bandwidth usage is low, latency does not increase. It is
    optimal to place data in the nearest (lowest latency) device.
  * If bandwidth usage is high, latency increases. It is optimal
    to place data such that bandwidth use is optimized per-device.

This is the top line goal: Provide a user a mechanism to target using
the "maximum sustained bandwidth" of each hardware component in a
heterogenous memory system.


For example, the stream benchmark demonstrates that default interleave
is actively harmful, where weighted interleave is beneficial. Default
interleave distributes data such that too much pressure is placed on
devices with lower available bandwidth.

Stream Benchmark (High level results, 1 Socket + 1 CXL Device)
Default interleave : -78% (slower than DRAM)
Global weighting   : -6% to +4% (workload dependant)
Targeted weights   : +2.5% to +4% (consistently better than DRAM)

Global means the task-policy was set (set_mempolicy2), while targeted
means VMA policies were set (mbind2). We can see weighted interleave
is not always beneficial when applied globally, but is always
beneficial when applied to bandwidth-driving data areas. This is a
good reason to provide both mechanisms (Simplicity vs Control).


We implement sysfs entries for "system global" weights which can be
set by a daemon or administrator, and new extensible syscalls
(mempolicy2, mbind2) for task-local weights to be set by either
numactl or user-software.

We chose to implement an extensible mempolicy interface so that
future extensions can be captured, rather than adding additional
syscalls for every new mempolicy which requires new data.

MPOL_WEIGHTED_INTERLEAVE is included as an example extension.

There are 3 "phases" in the patch set that could be considered
for separate merge candidates, but are presented here as a single
line as the goal is a fully functional MPOL_WEIGHTED_INTERLEAVE.

1) Implement MPOL_WEIGHTED_INTERLEAVE with a sysfs extension for
   setting system-global weights via sysfs.
   (Patches 1-3)

2) Refactor mempolicy creation mechanism to use an extensible arg
   struct `struct mempolicy_param` to promote code re-use between
   the original mempolicy/mbind interfaces and the new interfaces.
   (Patches 4-7)

3) Implementation of set_mempolicy2, get_mempolicy2, and mbind2,
   along with the addition of task-local weights so that per-task
   weights can be registered for MPOL_WEIGHTED_INTERLEAVE.
   (Patches 8-12)

Included below is LTP test information, performance test information,
and some software / numactl branch which can be used for testing.

= Performance summary =
(tests may have different configurations, see extended info below)
1) MLC (W2) : +38% over DRAM. +264% over default interleave.
   MLC (W5) : +40% over DRAM. +226% over default interleave.
2) Stream   : -6% to +4% over DRAM, +430% over default interleave.
3) XSBench  : +19% over DRAM. +47% over default interleave.

= LTP Testing Summary =
existing mempolicy & mbind tests: pass
mempolicy & mbind + weighted interleave (global weights): pass
mempolicy2 & mbind2 + weighted interleave (global weights): pass
mempolicy2 & mbind2 + weighted interleave (local weights): pass

= v6 notes =
- bug: resolved excessive stack usage w/ scratch area
- bug: bulk allocator uninitialized value (prev_node = NUMA_NO_NODE)
- bug: global weights should be unsigned (char -> u8)
- bug: return value in get_mempolicy (thanks dan.carpenter@linaro.org)
- refactor: refactor read_once operations into functions
- change: reduce mpol_params->pol_maxnodes size from u64 to u16
- change: add weight scratch space in mempolicy used during allocation
- change: kill MPOL_F_GWEIGHT flag
- change: 0-weight now implies "use global/default"
- change: simplify bulk allocator logic
- change: weights are now all u8 for consistency
- change: add default_iw_table (system default separate from sysfs)
- change: _args to _param in struct names
- change: sanitize_flags simplification
- documentation updates

=====================================================================
Performance tests - MLC
From - Ravi Jonnalagadda <ravis.opensrc@micron.com>

Hardware: Single-socket, multiple CXL memory expanders.

Workload:                               W2
Data Signature:                         2:1 read:write
DRAM only bandwidth (GBps):             298.8
DRAM + CXL (default interleave) (GBps): 113.04
DRAM + CXL (weighted interleave)(GBps): 412.5
Gain over DRAM only:                    1.38x
Gain over default interleave:           2.64x

Workload:                               W5
Data Signature:                         1:1 read:write
DRAM only bandwidth (GBps):             273.2
DRAM + CXL (default interleave) (GBps): 117.23
DRAM + CXL (weighted interleave)(GBps): 382.7
Gain over DRAM only:                    1.4x
Gain over default interleave:           2.26x

=====================================================================
Performance test - Stream
From - Gregory Price <gregory.price@memverge.com>

Hardware: Single socket, single CXL expander
numactl extension: https://github.com/gmprice/numactl/tree/weighted_interleave_master

Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times
Default interleave : -78% (slower than DRAM)
Global weighting   : -6% to +4% (workload dependant)
mbind2 weights     : +2.5% to +4% (consistently better than DRAM)

dram only:
numactl --cpunodebind=1 --membind=1 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Function     Direction    BestRateMBs     AvgTime      MinTime      MaxTime
Copy:        0->0            200923.2     0.032662     0.031853     0.033301
Scale:       0->0            202123.0     0.032526     0.031664     0.032970
Add:         0->0            208873.2     0.047322     0.045961     0.047884
Triad:       0->0            208523.8     0.047262     0.046038     0.048414

CXL-only:
numactl --cpunodebind=1 -w --membind=2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy:        0->0             22209.7     0.288661     0.288162     0.289342
Scale:       0->0             22288.2     0.287549     0.287147     0.288291
Add:         0->0             24419.1     0.393372     0.393135     0.393735
Triad:       0->0             24484.6     0.392337     0.392083     0.394331

Based on the above, the optimal weights are ~9:1
echo 9 > /sys/kernel/mm/mempolicy/weighted_interleave/node1
echo 1 > /sys/kernel/mm/mempolicy/weighted_interleave/node2

default interleave:
numactl --cpunodebind=1 --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy:        0->0             44666.2     0.143671     0.143285     0.144174
Scale:       0->0             44781.6     0.143256     0.142916     0.143713
Add:         0->0             48600.7     0.197719     0.197528     0.197858
Triad:       0->0             48727.5     0.197204     0.197014     0.197439

global weighted interleave:
numactl --cpunodebind=1 -w --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy:        0->0            190085.9     0.034289     0.033669     0.034645
Scale:       0->0            207677.4     0.031909     0.030817     0.033061
Add:         0->0            202036.8     0.048737     0.047516     0.053409
Triad:       0->0            217671.5     0.045819     0.044103     0.046755

targted regions w/ global weights (modified stream to mbind2 malloc'd regions))
numactl --cpunodebind=1 --membind=1 ./stream_c.exe -b --ntimes 100 --array-size 400M --malloc
Copy:        0->0            205827.0     0.031445     0.031094     0.031984
Scale:       0->0            208171.8     0.031320     0.030744     0.032505
Add:         0->0            217352.0     0.045087     0.044168     0.046515
Triad:       0->0            216884.8     0.045062     0.044263     0.046982

=====================================================================
Performance tests - XSBench
From - Hyeongtak Ji <hyeongtak.ji@sk.com>

Hardware: Single socket, Single CXL memory Expander

NUMA node 0: 56 logical cores, 128 GB memory
NUMA node 2: 96 GB CXL memory
Threads:     56
Lookups:     170,000,000

Summary: +19% over DRAM. +47% over default interleave.

Performance tests - XSBench
1. dram only
$ numactl -m 0 ./XSBench -s XL –p 5000000
Runtime:     36.235 seconds
Lookups/s:   4,691,618

2. default interleave
$ numactl –i 0,2 ./XSBench –s XL –p 5000000
Runtime:     55.243 seconds
Lookups/s:   3,077,293

3. weighted interleave
numactl –w –i 0,2 ./XSBench –s XL –p 5000000
Runtime:     29.262 seconds
Lookups/s:   5,809,513

=====================================================================
LTP Tests: https://github.com/gmprice/ltp/tree/mempolicy2

= Existing tests
set_mempolicy, get_mempolicy, mbind

MPOL_WEIGHTED_INTERLEAVE added manually to test basic functionality
but did not adjust tests for weighting.  Basically the weights were
set to 1, which is the default, and it should behavior like standard
MPOL_INTERLEAVE if logic is correct.

== set_mempolicy01 : passed   18, failed   0
== set_mempolicy02 : passed   10, failed   0
== set_mempolicy03 : passed   64, failed   0
== set_mempolicy04 : passed   32, failed   0
== set_mempolicy05 - n/a on non-x86
== set_mempolicy06 : passed   10, failed   0
   this is set_mempolicy02 + MPOL_WEIGHTED_INTERLEAVE
== set_mempolicy07 : passed   32, failed   0
   set_mempolicy04 + MPOL_WEIGHTED_INTERLEAVE
== get_mempolicy01 : passed   12, failed   0
   change: added MPOL_WEIGHTED_INTERLEAVE
== get_mempolicy02 : passed   2, failed   0
== mbind01 : passed   15, failed   0
   added MPOL_WEIGHTED_INTERLEAVE
== mbind02 : passed   4, failed   0
   added MPOL_WEIGHTED_INTERLEAVE
== mbind03 : passed   16, failed   0
   added MPOL_WEIGHTED_INTERLEAVE
== mbind04 : passed   48, failed   0
   added MPOL_WEIGHTED_INTERLEAVE

= New Tests
set_mempolicy2, get_mempolicy2, mbind2

Took the original set_mempolicy and get_mempolicy tests, and updated
them to utilize the new mempolicy2 interfaces.  Added additional tests
for setting task-local weights to validate behavior.

== set_mempolicy201  : passed   18, failed   0
== set_mempolicy202  : passed   10, failed   0
== set_mempolicy203  : passed   64, failed   0
== set_mempolicy204  : passed   32, failed   0
== set_mempolicy205  : passed   10, failed   0
== set_mempolicy206  : passed   32, failed   0
== set_mempolicy207  : passed   6, failed   0
   new: MPOL_WEIGHTED_INTERLEAVE with task-local weights
== get_mempolicy201  : passed   12, failed   0
== get_mempolicy202  : passed   2, failed   0
== get_mempolicy203  : passed   6, failed   0
   new: fetch global and local weights
== mbind201  : passed   15, failed   0
== mbind202  : passed   4, failed   0
== mbind203  : passed   16, failed   0
== mbind204  : passed   48, failed   0

=====================================================================
Basic set_mempolicy2 test

set_mempolicy2 w/ weighted interleave, task-local weights and uses
pthread_create to demonstrate the mempolicy is overwritten by child.

Manually validating the distribution via numa_maps

007c0000 weighted interleave:0-1 heap anon=65794 dirty=65794 active=0 N0=54829 N1=10965 kernelpagesize_kB=4
7f3f2c000000 weighted interleave:0-1 anon=32768 dirty=32768 active=0 N0=5461 N1=27307 kernelpagesize_kB=4
7f3f34000000 weighted interleave:0-1 anon=16384 dirty=16384 active=0 N0=2731 N1=13653 kernelpagesize_kB=4
7f3f3bffe000 weighted interleave:0-1 anon=65538 dirty=65538 active=0 N0=10924 N1=54614 kernelpagesize_kB=4
7f3f5c000000 weighted interleave:0-1 anon=16384 dirty=16384 active=0 N0=2731 N1=13653 kernelpagesize_kB=4
7f3f60dfe000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54615 N1=10922 kernelpagesize_kB=4

Expected distribution is 5:1 or 1:5 (less node should be ~16.666%)
1) 10965/65794 : 16.6656...
2) 5461/32768  : 16.6656...
3) 2731/16384  : 16.6687...
4) 10924/65538 : 16.6682...
5) 2731/16384  : 16.6687...
6) 10922/65537 : 16.6653...


#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <numa.h>
#include <errno.h>
#include <numaif.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/uio.h>
#include <sys/types.h>
#include <stdint.h>

#define MPOL_WEIGHTED_INTERLEAVE 6
#define SET_MEMPOLICY2(a, b) syscall(457, a, b, 0)

#define M256 (1024*1024*256)
#define PAGE_SIZE (4096)

struct mpol_param {
        /* Basic mempolicy settings */
        uint16_t mode;
        uint16_t mode_flags;
        int32_t home_node;
        uint16_t pol_maxnodes;
        uint8_t  resv[6];
        uint64_t pol_nodes;
        uint64_t il_weights;
};

struct mpol_param wil_param;
struct bitmask *wil_nodes;
unsigned char *weights;
int total_nodes = -1;
pthread_t tid;

void set_mempolicy_call(int which)
{
        weights = (unsigned char *)calloc(total_nodes, sizeof(unsigned char));
        wil_nodes = numa_allocate_nodemask();

        numa_bitmask_setbit(wil_nodes, 0); weights[0] = which ? 1 : 5;
        numa_bitmask_setbit(wil_nodes, 1); weights[1] = which ? 5 : 1;

        memset(&wil_param, 0, sizeof(wil_param));
        wil_param.mode = MPOL_WEIGHTED_INTERLEAVE;
        wil_param.mode_flags = 0;
        wil_param.pol_nodes = wil_nodes->maskp;
        wil_param.pol_maxnodes = total_nodes;
        wil_param.il_weights = weights;

        int ret = SET_MEMPOLICY2(&wil_param, sizeof(wil_param));
        fprintf(stderr, "set_mempolicy2 result: %d(%s)\n", ret, strerror(errno));
}

void *func(void *arg)
{
        char *mainmem = malloc(M256);
        int i;

        set_mempolicy_call(1); /* weight 1 heavier */

        mainmem = malloc(M256);
        memset(mainmem, 1, M256);
        for (i = 0; i < (M256/PAGE_SIZE); i++) {
                mainmem = malloc(PAGE_SIZE);
                mainmem[0] = 1;
        }
        printf("thread done %d\n", getpid());
        getchar();
        return arg;
}

int main()
{
        char * mainmem;
        int i;

        total_nodes = numa_max_node() + 1;

        set_mempolicy_call(0); /* weight 0 heavier */
        pthread_create(&tid, NULL, func, NULL);

        mainmem = malloc(M256);
        memset(mainmem, 1, M256);
        for (i = 0; i < (M256/PAGE_SIZE); i++) {
                mainmem = malloc(PAGE_SIZE);
                mainmem[0] = 1;
        }
        printf("main done %d\n", getpid());
        getchar();

        return 0;
}

=====================================================================
numactl (set_mempolicy) w/ global weighting test
numactl fork: https://github.com/gmprice/numactl/tree/weighted_interleave_master

command: numactl -w --interleave=0,1 ./eatmem

result (weights 1:1):
0176a000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=32897 N1=32896 kernelpagesize_kB=4
7fceeb9ff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=32768 N1=32769 kernelpagesize_kB=4
50% distribution is correct

result (weights 5:1):
01b14000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=54828 N1=10965 kernelpagesize_kB=4
7f47a1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54614 N1=10923 kernelpagesize_kB=4
16.666% distribution is correct

result (weights 1:5):
01f07000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=10966 N1=54827 kernelpagesize_kB=4
7f17b1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=10923 N1=54614 kernelpagesize_kB=4
16.666% distribution is correct

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main (void)
{
        char* mem = malloc(1024*1024*256);
        memset(mem, 1, 1024*1024*256);
        for (int i = 0; i  < ((1024*1024*256)/4096); i++)
        {
                mem = malloc(4096);
                mem[0] = 1;
        }
        printf("done\n");
        getchar();
        return 0;
}

=====================================================================

Suggested-by: Gregory Price <gregory.price@memverge.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Hasan Al Maruf <hasanalmaruf@fb.com>
Suggested-by: Hao Wang <haowang3@fb.com>
Suggested-by: Ying Huang <ying.huang@intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: tj <tj@kernel.org>
Suggested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
Suggested-by: Frank van der Linden <fvdl@google.com>
Suggested-by: John Groves <john@jagalactic.com>
Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Suggested-by: Srinivasulu Thanneeru <sthanneeru@micron.com>
Suggested-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Suggested-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>


Gregory Price (11):
  mm/mempolicy: refactor a read-once mechanism into a function for
    re-use
  mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted
    interleaving
  mm/mempolicy: refactor sanitize_mpol_flags for reuse
  mm/mempolicy: create struct mempolicy_param for creating new
    mempolicies
  mm/mempolicy: refactor kernel_get_mempolicy for code re-use
  mm/mempolicy: allow home_node to be set by mpol_new
  mm/mempolicy: add userland mempolicy arg structure
  mm/mempolicy: add set_mempolicy2 syscall
  mm/mempolicy: add get_mempolicy2 syscall
  mm/mempolicy: add the mbind2 syscall
  mm/mempolicy: extend mempolicy2 and mbind2 to support weighted
    interleave

Rakie Kim (1):
  mm/mempolicy: implement the sysfs-based weighted_interleave interface

 .../ABI/testing/sysfs-kernel-mm-mempolicy     |   4 +
 ...fs-kernel-mm-mempolicy-weighted-interleave |  26 +
 .../admin-guide/mm/numa_memory_policy.rst     |  67 ++
 arch/alpha/kernel/syscalls/syscall.tbl        |   3 +
 arch/arm/tools/syscall.tbl                    |   3 +
 arch/arm64/include/asm/unistd.h               |   2 +-
 arch/arm64/include/asm/unistd32.h             |   6 +
 arch/m68k/kernel/syscalls/syscall.tbl         |   3 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   3 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |   3 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |   3 +
 arch/parisc/kernel/syscalls/syscall.tbl       |   3 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |   3 +
 arch/s390/kernel/syscalls/syscall.tbl         |   3 +
 arch/sh/kernel/syscalls/syscall.tbl           |   3 +
 arch/sparc/kernel/syscalls/syscall.tbl        |   3 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   3 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   3 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |   3 +
 include/linux/mempolicy.h                     |  19 +
 include/linux/syscalls.h                      |   8 +
 include/uapi/asm-generic/unistd.h             |   8 +-
 include/uapi/linux/mempolicy.h                |  16 +-
 kernel/sys_ni.c                               |   3 +
 mm/mempolicy.c                                | 976 +++++++++++++++---
 .../arch/mips/entry/syscalls/syscall_n64.tbl  |   3 +
 .../arch/powerpc/entry/syscalls/syscall.tbl   |   3 +
 .../perf/arch/s390/entry/syscalls/syscall.tbl |   3 +
 .../arch/x86/entry/syscalls/syscall_64.tbl    |   3 +
 29 files changed, 1062 insertions(+), 127 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave

-- 
2.39.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v6 01/12] mm/mempolicy: implement the sysfs-based weighted_interleave interface
  2024-01-03 22:41 [PATCH v6 00/12] mempolicy2, mbind2, and weighted interleave Gregory Price
@ 2024-01-03 22:41 ` Gregory Price
  2024-01-03 22:41 ` [PATCH v6 02/12] mm/mempolicy: refactor a read-once mechanism into a function for re-use Gregory Price
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Gregory Price @ 2024-01-03 22:41 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, linux-arch,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha

From: Rakie Kim <rakie.kim@sk.com>

This patch provides a way to set interleave weight information under
sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN

The sysfs structure is designed as follows.

  $ tree /sys/kernel/mm/mempolicy/
  /sys/kernel/mm/mempolicy/ [1]
  └── weighted_interleave [2]
      ├── node0 [3]
      └── node1

Each file above can be explained as follows.

[1] mm/mempolicy: configuration interface for mempolicy subsystem

[2] weighted_interleave/: config interface for weighted interleave policy

[3] weighted_interleave/nodeN: weight for nodeN

Internally, there is a secondary table `default_iw_table`, which holds
kernel-internal default interleave weights for each possible node.

If the value for a node is set to `0`, the default value will be used.

If sysfs is disabled in the config, interleave weights will default
to use `default_iw_table`.

Suggested-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Rakie Kim <rakie.kim@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Co-developed-by: Gregory Price <gregory.price@memverge.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
---
 .../ABI/testing/sysfs-kernel-mm-mempolicy     |   4 +
 ...fs-kernel-mm-mempolicy-weighted-interleave |  26 +++
 mm/mempolicy.c                                | 178 ++++++++++++++++++
 3 files changed, 208 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
new file mode 100644
index 000000000000..2dcf24f4384a
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
@@ -0,0 +1,4 @@
+What:		/sys/kernel/mm/mempolicy/
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Interface for Mempolicy
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
new file mode 100644
index 000000000000..e6a38139bf0f
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
@@ -0,0 +1,26 @@
+What:		/sys/kernel/mm/mempolicy/weighted_interleave/
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Configuration Interface for the Weighted Interleave policy
+
+What:		/sys/kernel/mm/mempolicy/weighted_interleave/nodeN
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Weight configuration interface for nodeN
+
+		The interleave weight for a memory node (N). These weights are
+		utilized by processes which have set their mempolicy to
+		MPOL_WEIGHTED_INTERLEAVE and have opted into global weights by
+		omitting a task-local weight array.
+
+		These weights only affect new allocations, and changes at runtime
+		will not cause migrations on already allocated pages.
+
+		The minimum weight for a node is always 1.
+
+		Minimum weight: 1
+		Maximum weight: 255
+
+		Writing an empty string or `0` will reset the weight to the
+		system default. The system default may be set by the kernel
+		or drivers at boot or during hotplug events.
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..30da1a1be707 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -131,6 +131,23 @@ static struct mempolicy default_policy = {
 
 static struct mempolicy preferred_node_policy[MAX_NUMNODES];
 
+/*
+ * default_iw_table is the kernel-internal default value interleave
+ * weight table. It is to be set by driver code capable of reading
+ * HMAT/CDAT information, and to provide mempolicy a sane set of
+ * default weight values for WEIGHTED_INTERLEAVE mode.
+ *
+ * By default, prior to HMAT/CDAT information being consumed, the
+ * default weight of all nodes is 1.  The default weight of any
+ * node can only be in the range 1-255. A 0-weight is not allowed.
+ */
+static u8 default_iw_table[MAX_NUMNODES];
+/*
+ * iw_table is the sysfs-set interleave weight table, a value of 0
+ * denotes that the default_iw_table value should be used.
+ */
+static u8 iw_table[MAX_NUMNODES];
+
 /**
  * numa_nearest_node - Find nearest node by state
  * @node: Node id to start the search
@@ -3067,3 +3084,164 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 		p += scnprintf(p, buffer + maxlen - p, ":%*pbl",
 			       nodemask_pr_args(&nodes));
 }
+
+#ifdef CONFIG_SYSFS
+struct iw_node_attr {
+	struct kobj_attribute kobj_attr;
+	int nid;
+};
+
+static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr,
+			 char *buf)
+{
+	struct iw_node_attr *node_attr;
+	u8 weight;
+
+	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
+	weight = iw_table[node_attr->nid];
+	if (!weight)
+		weight = default_iw_table[node_attr->nid];
+	return sysfs_emit(buf, "%d\n", weight);
+}
+
+static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
+			  const char *buf, size_t count)
+{
+	struct iw_node_attr *node_attr;
+	u8 weight = 0;
+
+	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
+	/* If no input, revert to default weight */
+	if (count == 0 || sysfs_streq(buf, ""))
+		weight = 0;
+	else if (kstrtou8(buf, 0, &weight))
+		return -EINVAL;
+
+	iw_table[node_attr->nid] = weight;
+	return count;
+}
+
+static struct iw_node_attr *node_attrs[MAX_NUMNODES];
+
+static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
+				  struct kobject *parent)
+{
+	if (!node_attr)
+		return;
+	sysfs_remove_file(parent, &node_attr->kobj_attr.attr);
+	kfree(node_attr->kobj_attr.attr.name);
+	kfree(node_attr);
+}
+
+static void sysfs_mempolicy_release(struct kobject *mempolicy_kobj)
+{
+	int i;
+
+	for (i = 0; i < MAX_NUMNODES; i++)
+		sysfs_wi_node_release(node_attrs[i], mempolicy_kobj);
+	kobject_put(mempolicy_kobj);
+}
+
+static const struct kobj_type mempolicy_ktype = {
+	.sysfs_ops = &kobj_sysfs_ops,
+	.release = sysfs_mempolicy_release,
+};
+
+static int add_weight_node(int nid, struct kobject *wi_kobj)
+{
+	struct iw_node_attr *node_attr;
+	char *name;
+
+	node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL);
+	if (!node_attr)
+		return -ENOMEM;
+
+	name = kasprintf(GFP_KERNEL, "node%d", nid);
+	if (!name) {
+		kfree(node_attr);
+		return -ENOMEM;
+	}
+
+	sysfs_attr_init(&node_attr->kobj_attr.attr);
+	node_attr->kobj_attr.attr.name = name;
+	node_attr->kobj_attr.attr.mode = 0644;
+	node_attr->kobj_attr.show = node_show;
+	node_attr->kobj_attr.store = node_store;
+	node_attr->nid = nid;
+
+	if (sysfs_create_file(wi_kobj, &node_attr->kobj_attr.attr)) {
+		kfree(node_attr->kobj_attr.attr.name);
+		kfree(node_attr);
+		pr_err("failed to add attribute to weighted_interleave\n");
+		return -ENOMEM;
+	}
+
+	node_attrs[nid] = node_attr;
+	return 0;
+}
+
+static int add_weighted_interleave_group(struct kobject *root_kobj)
+{
+	struct kobject *wi_kobj;
+	int nid, err;
+
+	wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL);
+	if (!wi_kobj)
+		return -ENOMEM;
+
+	err = kobject_init_and_add(wi_kobj, &mempolicy_ktype, root_kobj,
+				   "weighted_interleave");
+	if (err) {
+		kfree(wi_kobj);
+		return err;
+	}
+
+	memset(node_attrs, 0, sizeof(node_attrs));
+	for_each_node_state(nid, N_POSSIBLE) {
+		err = add_weight_node(nid, wi_kobj);
+		if (err) {
+			pr_err("failed to add sysfs [node%d]\n", nid);
+			break;
+		}
+	}
+	if (err)
+		kobject_put(wi_kobj);
+	return 0;
+}
+
+static int __init mempolicy_sysfs_init(void)
+{
+	int err;
+	struct kobject *root_kobj;
+
+	memset(&default_iw_table, 1, sizeof(default_iw_table));
+	memset(&iw_table, 0, sizeof(iw_table));
+
+	root_kobj = kobject_create_and_add("mempolicy", mm_kobj);
+	if (!root_kobj) {
+		pr_err("failed to add mempolicy kobject to the system\n");
+		return -ENOMEM;
+	}
+
+	err = add_weighted_interleave_group(root_kobj);
+
+	if (err)
+		kobject_put(root_kobj);
+	return err;
+
+}
+#else
+static int __init mempolicy_sysfs_init(void)
+{
+	/*
+	 * if sysfs is not enabled MPOL_WEIGHTED_INTERLEAVE defaults to
+	 * MPOL_INTERLEAVE behavior, but is still defined separately to
+	 * allow task-local weighted interleave and system-defaults to
+	 * operate as intended.
+	 */
+	memset(&default_iw_table, 1, sizeof(default_iw_table));
+	memset(&iw_table, 0, sizeof(iw_table));
+	return 0;
+}
+#endif /* CONFIG_SYSFS */
+late_initcall(mempolicy_sysfs_init);
-- 
2.39.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v6 02/12] mm/mempolicy: refactor a read-once mechanism into a function for re-use
  2024-01-03 22:41 [PATCH v6 00/12] mempolicy2, mbind2, and weighted interleave Gregory Price
  2024-01-03 22:41 ` [PATCH v6 01/12] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price
@ 2024-01-03 22:41 ` Gregory Price
  2024-01-03 22:42 ` [PATCH v6 06/12] mm/mempolicy: refactor kernel_get_mempolicy for code re-use Gregory Price
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Gregory Price @ 2024-01-03 22:41 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, linux-arch,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha

move the use of barrier() to force policy->nodemask onto the stack into
a function read_once_policy_nodemask so that it may be re-used.

Suggested-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 mm/mempolicy.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 30da1a1be707..6cdb00acb86b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1900,6 +1900,20 @@ unsigned int mempolicy_slab_node(void)
 	}
 }
 
+static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
+					      nodemask_t *mask)
+{
+	/*
+	 * barrier stabilizes the nodemask locally so that it can be iterated
+	 * over safely without concern for changes. Allocators validate node
+	 * selection does not violate mems_allowed, so this is safe.
+	 */
+	barrier();
+	__builtin_memcpy(mask, &pol->nodes, sizeof(nodemask_t));
+	barrier();
+	return nodes_weight(*mask);
+}
+
 /*
  * Do static interleaving for interleave index @ilx.  Returns the ilx'th
  * node in pol->nodes (starting from ilx=0), wrapping around if ilx
@@ -1907,20 +1921,12 @@ unsigned int mempolicy_slab_node(void)
  */
 static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 {
-	nodemask_t nodemask = pol->nodes;
+	nodemask_t nodemask;
 	unsigned int target, nnodes;
 	int i;
 	int nid;
-	/*
-	 * The barrier will stabilize the nodemask in a register or on
-	 * the stack so that it will stop changing under the code.
-	 *
-	 * Between first_node() and next_node(), pol->nodes could be changed
-	 * by other threads. So we put pol->nodes in a local stack.
-	 */
-	barrier();
 
-	nnodes = nodes_weight(nodemask);
+	nnodes = read_once_policy_nodemask(pol, &nodemask);
 	if (!nnodes)
 		return numa_node_id();
 	target = ilx % nnodes;
-- 
2.39.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v6 06/12] mm/mempolicy: refactor kernel_get_mempolicy for code re-use
  2024-01-03 22:41 [PATCH v6 00/12] mempolicy2, mbind2, and weighted interleave Gregory Price
  2024-01-03 22:41 ` [PATCH v6 01/12] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price
  2024-01-03 22:41 ` [PATCH v6 02/12] mm/mempolicy: refactor a read-once mechanism into a function for re-use Gregory Price
@ 2024-01-03 22:42 ` Gregory Price
  2024-01-03 22:42 ` [PATCH v6 08/12] mm/mempolicy: add userland mempolicy arg structure Gregory Price
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Gregory Price @ 2024-01-03 22:42 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, linux-arch,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha

Pull operation flag checking from inside do_get_mempolicy out
to kernel_get_mempolicy.  This allows us to flatten the
internal code, and break it into separate functions for future
syscalls (get_mempolicy2, process_get_mempolicy) to re-use the
code, even after additional extensions are made.

The primary change is that the flag is treated as the multiplexer
that it actually is.  For get_mempolicy, the flags represents 3
different primary operations:

if (flags & MPOL_F_MEMS_ALLOWED)
	return task->mems_allowed
else if (flags & MPOL_F_ADDR)
	return vma mempolicy information
else
	return task mempolicy information

Plus the behavior modifying flag:

if (flags & MPOL_F_NODE)
	change the return value of (int __user *policy)
	based on whether MPOL_F_ADDR was set.

The original behavior of get_mempolicy is retained, but we utilize
the new mempolicy_param structure to pass the operations down the
stack.  This will allow us to extend the internal functions without
affecting the legacy behavior of get_mempolicy.

Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 mm/mempolicy.c | 244 +++++++++++++++++++++++++++++++------------------
 1 file changed, 154 insertions(+), 90 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1f6f19b5d157..db290cf540d7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -910,106 +910,109 @@ static int lookup_node(struct mm_struct *mm, unsigned long addr)
 	return ret;
 }
 
-/* Retrieve NUMA policy */
-static long do_get_mempolicy(int *policy, nodemask_t *nmask,
-			     unsigned long addr, unsigned long flags)
+/* Retrieve the mems_allowed for current task */
+static inline long do_get_mems_allowed(nodemask_t *nmask)
 {
-	int err;
-	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = NULL;
-	struct mempolicy *pol = current->mempolicy, *pol_refcount = NULL;
+	task_lock(current);
+	*nmask  = cpuset_current_mems_allowed;
+	task_unlock(current);
+	return 0;
+}
 
-	if (flags &
-		~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED))
-		return -EINVAL;
+/* If the policy has additional node information to retrieve, return it */
+static long do_get_policy_node(struct mempolicy *pol)
+{
+	/*
+	 * For MPOL_INTERLEAVE, the extended node information is the next
+	 * node that will be selected for interleave. For weighted interleave
+	 * we return the next node based on the current weight.
+	 */
+	if (pol == current->mempolicy && pol->mode == MPOL_INTERLEAVE)
+		return next_node_in(current->il_prev, pol->nodes);
 
-	if (flags & MPOL_F_MEMS_ALLOWED) {
-		if (flags & (MPOL_F_NODE|MPOL_F_ADDR))
-			return -EINVAL;
-		*policy = 0;	/* just so it's initialized */
+	if (pol == current->mempolicy &&
+	    pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
+		if (pol->wil.cur_weight)
+			return current->il_prev;
+		else
+			return next_node_in(current->il_prev, pol->nodes);
+	}
+	return -EINVAL;
+}
+
+/* Handle user_nodemask condition when fetching nodemask for userspace */
+static void do_get_mempolicy_nodemask(struct mempolicy *pol, nodemask_t *nmask)
+{
+	if (mpol_store_user_nodemask(pol)) {
+		*nmask = pol->w.user_nodemask;
+	} else {
 		task_lock(current);
-		*nmask  = cpuset_current_mems_allowed;
+		get_policy_nodemask(pol, nmask);
 		task_unlock(current);
-		return 0;
 	}
+}
 
-	if (flags & MPOL_F_ADDR) {
-		pgoff_t ilx;		/* ignored here */
-		/*
-		 * Do NOT fall back to task policy if the
-		 * vma/shared policy at addr is NULL.  We
-		 * want to return MPOL_DEFAULT in this case.
-		 */
-		mmap_read_lock(mm);
-		vma = vma_lookup(mm, addr);
-		if (!vma) {
-			mmap_read_unlock(mm);
-			return -EFAULT;
-		}
-		pol = __get_vma_policy(vma, addr, &ilx);
-	} else if (addr)
-		return -EINVAL;
+/* Retrieve NUMA policy for a VMA assocated with a given address  */
+static long do_get_vma_mempolicy(unsigned long addr, int *addr_node,
+				 struct mempolicy_param *param)
+{
+	pgoff_t ilx;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma = NULL;
+	struct mempolicy *pol = NULL;
 
+	mmap_read_lock(mm);
+	vma = vma_lookup(mm, addr);
+	if (!vma) {
+		mmap_read_unlock(mm);
+		return -EFAULT;
+	}
+	pol = __get_vma_policy(vma, addr, &ilx);
 	if (!pol)
-		pol = &default_policy;	/* indicates default behavior */
+		pol = &default_policy;
+	else
+		mpol_get(pol);
+	mmap_read_unlock(mm);
 
-	if (flags & MPOL_F_NODE) {
-		if (flags & MPOL_F_ADDR) {
-			/*
-			 * Take a refcount on the mpol, because we are about to
-			 * drop the mmap_lock, after which only "pol" remains
-			 * valid, "vma" is stale.
-			 */
-			pol_refcount = pol;
-			vma = NULL;
-			mpol_get(pol);
-			mmap_read_unlock(mm);
-			err = lookup_node(mm, addr);
-			if (err < 0)
-				goto out;
-			*policy = err;
-		} else if (pol == current->mempolicy &&
-				pol->mode == MPOL_INTERLEAVE) {
-			*policy = next_node_in(current->il_prev, pol->nodes);
-		} else if (pol == current->mempolicy &&
-				(pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
-			if (pol->wil.cur_weight)
-				*policy = current->il_prev;
-			else
-				*policy = next_node_in(current->il_prev,
-						       pol->nodes);
-		} else {
-			err = -EINVAL;
-			goto out;
-		}
-	} else {
-		*policy = pol == &default_policy ? MPOL_DEFAULT :
-						pol->mode;
-		/*
-		 * Internal mempolicy flags must be masked off before exposing
-		 * the policy to userspace.
-		 */
-		*policy |= (pol->flags & MPOL_MODE_FLAGS);
-	}
+	/* Fetch the node for the given address */
+	if (addr_node)
+		*addr_node = lookup_node(mm, addr);
 
-	err = 0;
-	if (nmask) {
-		if (mpol_store_user_nodemask(pol)) {
-			*nmask = pol->w.user_nodemask;
-		} else {
-			task_lock(current);
-			get_policy_nodemask(pol, nmask);
-			task_unlock(current);
-		}
+	param->mode = pol == &default_policy ? MPOL_DEFAULT : pol->mode;
+	param->mode_flags = (pol->flags & MPOL_MODE_FLAGS);
+	param->home_node = pol->home_node;
+
+	if (param->policy_nodes)
+		do_get_mempolicy_nodemask(pol, param->policy_nodes);
+
+	if (pol != &default_policy) {
+		mpol_put(pol);
+		mpol_cond_put(pol);
 	}
 
- out:
-	mpol_cond_put(pol);
-	if (vma)
-		mmap_read_unlock(mm);
-	if (pol_refcount)
-		mpol_put(pol_refcount);
-	return err;
+	return 0;
+}
+
+/* Retrieve NUMA policy for the current task */
+static long do_get_task_mempolicy(struct mempolicy_param *param, int *pol_node)
+{
+	struct mempolicy *pol = current->mempolicy;
+
+	if (!pol)
+		pol = &default_policy;	/* indicates default behavior */
+
+	param->mode = pol == &default_policy ? MPOL_DEFAULT : pol->mode;
+	/* Internal flags must be masked off before exposing to userspace */
+	param->mode_flags = (pol->flags & MPOL_MODE_FLAGS);
+	param->home_node = NUMA_NO_NODE;
+
+	if (pol_node)
+		*pol_node = do_get_policy_node(pol);
+
+	if (param->policy_nodes)
+		do_get_mempolicy_nodemask(pol, param->policy_nodes);
+
+	return 0;
 }
 
 #ifdef CONFIG_MIGRATION
@@ -1742,16 +1745,77 @@ static int kernel_get_mempolicy(int __user *policy,
 				unsigned long addr,
 				unsigned long flags)
 {
+	struct mempolicy_param param;
 	int err;
-	int pval;
+	int address_node = NUMA_NO_NODE;
+	int pval = 0;
+	int pol_node = 0;
 	nodemask_t nodes;
 
 	if (nmask != NULL && maxnode < nr_node_ids)
 		return -EINVAL;
 
-	addr = untagged_addr(addr);
+	if (flags &
+		~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED))
+		return -EINVAL;
 
-	err = do_get_mempolicy(&pval, &nodes, addr, flags);
+	/* Ensure any data that may be copied to userland is initialized */
+	memset(&param, 0, sizeof(param));
+	param.policy_nodes = &nodes;
+
+	/*
+	 * set_mempolicy was originally multiplexed based on 3 flags:
+	 *   MPOL_F_MEMS_ALLOWED:  fetch task->mems_allowed
+	 *   MPOL_F_ADDR        :  operate on vma->mempolicy
+	 *   MPOL_F_NODE        :  change return value of *policy
+	 *
+	 * Split this behavior out here, rather than internal functions,
+	 * so that the internal functions can be re-used by future
+	 * get_mempolicy2 interfaces and the arg structure made extensible
+	 */
+	if (flags & MPOL_F_MEMS_ALLOWED) {
+		if (flags & (MPOL_F_NODE|MPOL_F_ADDR))
+			return -EINVAL;
+		pval = 0;	/* just so it's initialized */
+		err = do_get_mems_allowed(&nodes);
+	} else if (flags & MPOL_F_ADDR) {
+		/* If F_ADDR, we operation on a vma policy (or default) */
+		err = do_get_vma_mempolicy(untagged_addr(addr),
+					   &address_node, &param);
+		if (err)
+			return err;
+		 /* if (F_ADDR | F_NODE), *pval is the address' node */
+		if (flags & MPOL_F_NODE) {
+			/* if we failed to fetch, that's likely an EFAULT */
+			if (address_node < 0)
+				return address_node;
+			pval = address_node;
+		} else
+			pval = param.mode | param.mode_flags;
+	} else {
+		 /* if not F_ADDR and addr != null, EINVAL */
+		if (addr)
+			return -EINVAL;
+
+		err = do_get_task_mempolicy(&param, &pol_node);
+		if (err)
+			return err;
+		/*
+		 * if F_NODE was set and mode was MPOL_INTERLEAVE
+		 * *pval is equal to next interleave node.
+		 *
+		 * if pol_node < 0, this means the mode did not have a
+		 * a compatible policy.  This presently emulates the
+		 * original behavior of (F_NODE) & (!MPOL_INTERLEAVE)
+		 * producing -EINVAL
+		 */
+		if (flags & MPOL_F_NODE) {
+			if (pol_node < 0)
+				return pol_node;
+			pval = pol_node;
+		} else
+			pval = param.mode | param.mode_flags;
+	}
 
 	if (err)
 		return err;
-- 
2.39.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v6 08/12] mm/mempolicy: add userland mempolicy arg structure
  2024-01-03 22:41 [PATCH v6 00/12] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (2 preceding siblings ...)
  2024-01-03 22:42 ` [PATCH v6 06/12] mm/mempolicy: refactor kernel_get_mempolicy for code re-use Gregory Price
@ 2024-01-03 22:42 ` Gregory Price
  2024-01-04  0:19   ` Randy Dunlap
  2024-01-03 22:42 ` [PATCH v6 10/12] mm/mempolicy: add get_mempolicy2 syscall Gregory Price
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 12+ messages in thread
From: Gregory Price @ 2024-01-03 22:42 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, linux-arch,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha,
	Frank van der Linden

This patch adds the new user-api argument structure intended for
set_mempolicy2 and mbind2.

struct mpol_param {
  __u16 mode;
  __u16 mode_flags;
  __s32 home_node;          /* mbind2: policy home node */
  __u16 pol_maxnodes;
  __u8 resv[6];
  __aligned_u64 *pol_nodes;
};

This structure is intended to be extensible as new mempolicy extensions
are added.

For example, set_mempolicy_home_node was added to allow vma mempolicies
to have a preferred/home node assigned.  This structure allows the user
to set the home node at the time mempolicy is created, rather than
requiring an additional syscalls.

Full breakdown of arguments as of this patch:
    mode:         Mempolicy mode (MPOL_DEFAULT, MPOL_INTERLEAVE)

    mode_flags:   Flags previously or'd into mode in set_mempolicy
                  (e.g.: MPOL_F_STATIC_NODES, MPOL_F_RELATIVE_NODES)

    home_node:    for mbind2.  Allows the setting of a policy's home
                  with the use of MPOL_MF_HOME_NODE

    pol_maxnodes: Max number of nodes in the policy nodemask

    pol_nodes:    Policy nodemask

The reserved field accounts explicitly for a potential memory hole
in the structure.

Suggested-by: Frank van der Linden <fvdl@google.com>
Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Signed-off-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
---
 .../admin-guide/mm/numa_memory_policy.rst       | 17 +++++++++++++++++
 include/linux/syscalls.h                        |  1 +
 include/uapi/linux/mempolicy.h                  |  9 +++++++++
 3 files changed, 27 insertions(+)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index a70f20ce1ffb..cbfc5f65ed77 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -480,6 +480,23 @@ closest to which page allocation will come from. Specifying the home node overri
 the default allocation policy to allocate memory close to the local node for an
 executing CPU.
 
+Extended Mempolicy Arguments::
+
+	struct mpol_param {
+		__u16 mode;
+		__u16 mode_flags;
+		__s32 home_node;	 /* mbind2: set home node */
+		__u64 pol_maxnodes;
+		__aligned_u64 pol_nodes; /* nodemask pointer */
+	};
+
+The extended mempolicy argument structure is defined to allow the mempolicy
+interfaces future extensibility without the need for additional system calls.
+
+The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to
+all interfaces relative to their non-extended counterparts. Each additional
+field may only apply to specific extended interfaces.  See the respective
+extended interface man page for more details.
 
 Memory Policy Command Line Interface
 ====================================
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index fd9d12de7e92..fb0b4b2b9bea 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -74,6 +74,7 @@ struct landlock_ruleset_attr;
 enum landlock_rule_type;
 struct cachestat_range;
 struct cachestat;
+struct mpol_param;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 1f9bb10d1a47..109788c8be92 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -27,6 +27,15 @@ enum {
 	MPOL_MAX,	/* always last member of enum */
 };
 
+struct mpol_param {
+	__u16 mode;
+	__u16 mode_flags;
+	__s32 home_node;	/* mbind2: policy home node */
+	__u16 pol_maxnodes;
+	__u8 resv[6];
+	__aligned_u64 pol_nodes;
+};
+
 /* Flags for set_mempolicy */
 #define MPOL_F_STATIC_NODES	(1 << 15)
 #define MPOL_F_RELATIVE_NODES	(1 << 14)
-- 
2.39.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v6 10/12] mm/mempolicy: add get_mempolicy2 syscall
  2024-01-03 22:41 [PATCH v6 00/12] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (3 preceding siblings ...)
  2024-01-03 22:42 ` [PATCH v6 08/12] mm/mempolicy: add userland mempolicy arg structure Gregory Price
@ 2024-01-03 22:42 ` Gregory Price
  2024-01-03 22:42 ` [PATCH v6 11/12] mm/mempolicy: add the mbind2 syscall Gregory Price
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Gregory Price @ 2024-01-03 22:42 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, linux-arch,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, Michal Hocko,
	Geert Uytterhoeven

get_mempolicy2 is an extensible get_mempolicy interface which allows
a user to retrieve the memory policy for a task or address.

Defined as:

get_mempolicy2(struct mpol_param *param, size_t size,
               unsigned long addr, unsigned long flags)

Top level input values:

mpol_param:    The field which collects information about the mempolicy
              returned to userspace.
addr:         if MPOL_F_ADDR is passed in `flags`, this address will be
              used to return the mempolicy details of the vma the
              address belongs to
flags:        if MPOL_F_ADDR, return mempolicy info vma containing addr
              else, returns task mempolicy information

Input values include the following fields of mpol_param:

pol_maxnodes: if pol_nodes is set, must describe max number of nodes
              to be copied to pol_nodes
pol_nodes:    if set, the nodemask of the policy returned here

Output values include the following fields of mpol_param:

mode:         mempolicy mode
mode_flags:   mempolicy mode flags
home_node:    policy home node will be returned here, or -1 if not.
pol_nodes:    if set, the nodemask for the mempolicy

MPOL_F_NODE has been dropped from get_mempolicy2 (EINVAL).

MPOL_F_NODE originally allowed for either the ability to acquire
the node the page of `addr` is presently allocated on - or, if
addr is not provided and the policy mode is MPOL_INTERLEAVE - it
would return the node that would be used for the next allocation.

Neither of these capabilities were pulled forward into get_mempolicy2
  a) both are still possible to acquire via get_mempolicy()
  b) both of pieces of data are racey by definition and have not
     proven useful.
  c) Should such a use be identified, it can be easily added back
     into mempolicy2 as an extension.

MPOL_F_MEMS_ALLOWED has been dropped from get_mempolicy2 (EINVAL).

MPOL_F_MEMS_ALLOWED originally returned the task->mems_allowed
nodemask in the nodemask parameter instead of the task or vma
nodemask.
  a) this is still accessible from get_mempolicy()
  b) task->mems_allowed is not technically part of mempolicy,
     though it is related.
  c) should this warrant bringing forward (in the scenario
     where get_mempolicy is deprecated), it can be added as
     an explicit extension.  Or more smartly: implemented as
     its own syscall.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
---
 .../admin-guide/mm/numa_memory_policy.rst     | 11 ++++-
 arch/alpha/kernel/syscalls/syscall.tbl        |  1 +
 arch/arm/tools/syscall.tbl                    |  1 +
 arch/arm64/include/asm/unistd.h               |  2 +-
 arch/arm64/include/asm/unistd32.h             |  2 +
 arch/m68k/kernel/syscalls/syscall.tbl         |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |  1 +
 arch/s390/kernel/syscalls/syscall.tbl         |  1 +
 arch/sh/kernel/syscalls/syscall.tbl           |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |  1 +
 include/linux/syscalls.h                      |  2 +
 include/uapi/asm-generic/unistd.h             |  4 +-
 kernel/sys_ni.c                               |  1 +
 mm/mempolicy.c                                | 42 +++++++++++++++++++
 .../arch/mips/entry/syscalls/syscall_n64.tbl  |  1 +
 .../arch/powerpc/entry/syscalls/syscall.tbl   |  1 +
 .../perf/arch/s390/entry/syscalls/syscall.tbl |  1 +
 .../arch/x86/entry/syscalls/syscall_64.tbl    |  1 +
 25 files changed, 79 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 62a4ea14c646..a2ff6e89e48b 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -454,11 +454,20 @@ Get [Task] Memory Policy or Related Information::
 	long get_mempolicy(int *mode,
 			   const unsigned long *nmask, unsigned long maxnode,
 			   void *addr, int flags);
+	long get_mempolicy2(struct mpol_param *param, size_t size,
+			    unsigned long addr, unsigned long flags);
 
 Queries the "task/process memory policy" of the calling task, or the
 policy or location of a specified virtual address, depending on the
 'flags' argument.
 
+get_mempolicy2() is an extended version of get_mempolicy() capable of
+acquiring extended information about a mempolicy, including those
+that can only be set via set_mempolicy2() or mbind2().
+
+MPOL_F_NODE functionality has been removed from get_mempolicy2(),
+but can still be accessed via get_mempolicy().
+
 See the get_mempolicy(2) man page for more details
 
 
@@ -501,7 +510,7 @@ Extended Mempolicy Arguments::
 The extended mempolicy argument structure is defined to allow the mempolicy
 interfaces future extensibility without the need for additional system calls.
 
-Extended interfaces (set_mempolicy2) use this argument structure.
+Extended interfaces (set_mempolicy2 and get_mempolicy2) use this structure.
 
 The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to
 all interfaces relative to their non-extended counterparts. Each additional
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 0dc288a1118a..0301a8b0a262 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -497,3 +497,4 @@
 565	common	futex_wait			sys_futex_wait
 566	common	futex_requeue			sys_futex_requeue
 567	common	set_mempolicy2			sys_set_mempolicy2
+568	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 50172ec0e1f5..771a33446e8e 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -471,3 +471,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 298313d2e0af..b63f870debaf 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -39,7 +39,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		458
+#define __NR_compat_syscalls		459
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index cee8d669c342..f8d01007aee0 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -921,6 +921,8 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait)
 __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
 #define __NR_set_mempolicy2 457
 __SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2)
+#define __NR_get_mempolicy2 458
+__SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 839d90c535f2..048a409e684c 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -457,3 +457,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 567c8b883735..327b01bd6793 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -463,3 +463,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index cc0640e16f2f..921d58e1da23 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -396,3 +396,4 @@
 455	n32	futex_wait			sys_futex_wait
 456	n32	futex_requeue			sys_futex_requeue
 457	n32	set_mempolicy2			sys_set_mempolicy2
+458	n32	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index f7262fde98d9..9271c83c9993 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -445,3 +445,4 @@
 455	o32	futex_wait			sys_futex_wait
 456	o32	futex_requeue			sys_futex_requeue
 457	o32	set_mempolicy2			sys_set_mempolicy2
+458	o32	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index e10f0e8bd064..0654f3f89fc7 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -456,3 +456,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 4f03f5f42b78..ac11d2064e7a 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -544,3 +544,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index f98dadc2e9df..1cdcafe1ccca 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -460,3 +460,4 @@
 455  common	futex_wait		sys_futex_wait			sys_futex_wait
 456  common	futex_requeue		sys_futex_requeue		sys_futex_requeue
 457  common	set_mempolicy2		sys_set_mempolicy2		sys_set_mempolicy2
+458  common	get_mempolicy2		sys_get_mempolicy2		sys_get_mempolicy2
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index f47ba9f2d05d..f71742024c29 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -460,3 +460,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 53fb16616728..2fbf5dbe0620 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -503,3 +503,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4b4dc41b24ee..0af813b9a118 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -462,3 +462,4 @@
 455	i386	futex_wait		sys_futex_wait
 456	i386	futex_requeue		sys_futex_requeue
 457	i386	set_mempolicy2		sys_set_mempolicy2
+458	i386	get_mempolicy2		sys_get_mempolicy2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 1bc2190bec27..0b777876fc15 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -379,6 +379,7 @@
 455	common	futex_wait		sys_futex_wait
 456	common	futex_requeue		sys_futex_requeue
 457	common	set_mempolicy2		sys_set_mempolicy2
+458	common	get_mempolicy2		sys_get_mempolicy2
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index e26dc89399eb..4536c9a4227d 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -428,3 +428,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b37ea6715456..c4dc5069bae7 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -821,6 +821,8 @@ asmlinkage long sys_get_mempolicy(int __user *policy,
 				unsigned long __user *nmask,
 				unsigned long maxnode,
 				unsigned long addr, unsigned long flags);
+asmlinkage long sys_get_mempolicy2(struct mpol_param __user *param, size_t size,
+				   unsigned long addr, unsigned long flags);
 asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nmask,
 				unsigned long maxnode);
 asmlinkage long sys_set_mempolicy2(struct mpol_param __user *param, size_t size,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 55486aba099f..719accc731db 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -830,9 +830,11 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait)
 __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
 #define __NR_set_mempolicy2 457
 __SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2)
+#define __NR_get_mempolicy2 458
+__SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2)
 
 #undef __NR_syscalls
-#define __NR_syscalls 458
+#define __NR_syscalls 459
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index fa1373c8bff8..6afbd3a41319 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -188,6 +188,7 @@ COND_SYSCALL(process_mrelease);
 COND_SYSCALL(remap_file_pages);
 COND_SYSCALL(mbind);
 COND_SYSCALL(get_mempolicy);
+COND_SYSCALL(get_mempolicy2);
 COND_SYSCALL(set_mempolicy);
 COND_SYSCALL(set_mempolicy2);
 COND_SYSCALL(migrate_pages);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 84d877195deb..0b2e31d8636d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1873,6 +1873,48 @@ SYSCALL_DEFINE5(get_mempolicy, int __user *, policy,
 	return kernel_get_mempolicy(policy, nmask, maxnode, addr, flags);
 }
 
+SYSCALL_DEFINE4(get_mempolicy2, struct mpol_param __user *, uparam, size_t, usize,
+		unsigned long, addr, unsigned long, flags)
+{
+	struct mpol_param kparam;
+	struct mempolicy_param mparam;
+	int err;
+	nodemask_t policy_nodemask;
+	unsigned long __user *nodes_ptr;
+
+	if (flags & ~(MPOL_F_ADDR))
+		return -EINVAL;
+
+	/* initialize any memory liable to be copied to userland */
+	memset(&mparam, 0, sizeof(mparam));
+
+	err = copy_struct_from_user(&kparam, sizeof(kparam), uparam, usize);
+	if (err)
+		return -EINVAL;
+
+	mparam.policy_nodes = kparam.pol_nodes ? &policy_nodemask : NULL;
+	if (flags & MPOL_F_ADDR)
+		err = do_get_vma_mempolicy(untagged_addr(addr), NULL, &mparam);
+	else
+		err = do_get_task_mempolicy(&mparam, NULL);
+
+	if (err)
+		return err;
+
+	kparam.mode = mparam.mode;
+	kparam.mode_flags = mparam.mode_flags;
+	kparam.home_node = mparam.home_node;
+	if (kparam.pol_nodes) {
+		nodes_ptr = u64_to_user_ptr(kparam.pol_nodes);
+		err = copy_nodes_to_user(nodes_ptr, kparam.pol_maxnodes,
+					 mparam.policy_nodes);
+		if (err)
+			return err;
+	}
+
+	return copy_to_user(uparam, &kparam, usize) ? -EFAULT : 0;
+}
+
 bool vma_migratable(struct vm_area_struct *vma)
 {
 	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
index bb1351df51d9..c34c6877379e 100644
--- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
+++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
@@ -372,3 +372,4 @@
 455	n64	futex_wait			sys_futex_wait
 456	n64	futex_requeue			sys_futex_requeue
 457	n64	set_mempolicy2			sys_set_mempolicy2
+458	n64	get_mempolicy2			sys_get_mempolicy2
diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
index 4f03f5f42b78..ac11d2064e7a 100644
--- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
@@ -544,3 +544,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
index f98dadc2e9df..1cdcafe1ccca 100644
--- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
@@ -460,3 +460,4 @@
 455  common	futex_wait		sys_futex_wait			sys_futex_wait
 456  common	futex_requeue		sys_futex_requeue		sys_futex_requeue
 457  common	set_mempolicy2		sys_set_mempolicy2		sys_set_mempolicy2
+458  common	get_mempolicy2		sys_get_mempolicy2		sys_get_mempolicy2
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
index 21f2579679d4..edf338f32645 100644
--- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
@@ -379,6 +379,7 @@
 455	common	futex_wait		sys_futex_wait
 456	common	futex_requeue		sys_futex_requeue
 457	common 	set_mempolicy2		sys_set_mempolicy2
+458	common 	get_mempolicy2		sys_get_mempolicy2
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
-- 
2.39.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v6 11/12] mm/mempolicy: add the mbind2 syscall
  2024-01-03 22:41 [PATCH v6 00/12] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (4 preceding siblings ...)
  2024-01-03 22:42 ` [PATCH v6 10/12] mm/mempolicy: add get_mempolicy2 syscall Gregory Price
@ 2024-01-03 22:42 ` Gregory Price
  2024-01-03 22:42 ` [PATCH v6 12/12] mm/mempolicy: extend mempolicy2 and mbind2 to support weighted interleave Gregory Price
  2024-01-11  3:41 ` [PATCH v6 00/12] mempolicy2, mbind2, and " Andi Kleen
  7 siblings, 0 replies; 12+ messages in thread
From: Gregory Price @ 2024-01-03 22:42 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, linux-arch,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, Michal Hocko,
	Frank van der Linden, Geert Uytterhoeven

mbind2 is an extensible mbind interface which allows a user to
set the mempolicy for one or more address ranges.

Defined as:

mbind2(unsigned long addr, unsigned long len, struct mpol_param *param,
       size_t size, unsigned long flags)

addr:         address of the memory range to operate on
len:          length of the memory range
flags:        MPOL_MF_HOME_NODE + original mbind() flags

Input values include the following fields of mpol_param:

mode:         The MPOL_* policy (DEFAULT, INTERLEAVE, etc.)
mode_flags:   The MPOL_F_* flags that were previously passed in or'd
	      into the mode.  This was split to hopefully allow future
	      extensions additional mode/flag space.
home_node:    if (flags & MPOL_MF_HOME_NODE), set home node of policy
	      to this otherwise it is ignored.
pol_maxnodes: The max number of nodes described by pol_nodes
pol_nodes:    the nodemask to apply for the memory policy

The semantics are otherwise the same as mbind(), except that
the home_node can be set.

Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Frank van der Linden <fvdl@google.com>
Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Suggested-by: Rakie Kim <rakie.kim@sk.com>
Suggested-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Suggested-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
---
 .../admin-guide/mm/numa_memory_policy.rst     | 12 +++++-
 arch/alpha/kernel/syscalls/syscall.tbl        |  1 +
 arch/arm/tools/syscall.tbl                    |  1 +
 arch/arm64/include/asm/unistd.h               |  2 +-
 arch/arm64/include/asm/unistd32.h             |  2 +
 arch/m68k/kernel/syscalls/syscall.tbl         |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |  1 +
 arch/s390/kernel/syscalls/syscall.tbl         |  1 +
 arch/sh/kernel/syscalls/syscall.tbl           |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |  1 +
 include/linux/syscalls.h                      |  3 ++
 include/uapi/asm-generic/unistd.h             |  4 +-
 include/uapi/linux/mempolicy.h                |  5 ++-
 kernel/sys_ni.c                               |  1 +
 mm/mempolicy.c                                | 43 +++++++++++++++++++
 .../arch/mips/entry/syscalls/syscall_n64.tbl  |  1 +
 .../arch/powerpc/entry/syscalls/syscall.tbl   |  1 +
 .../perf/arch/s390/entry/syscalls/syscall.tbl |  1 +
 .../arch/x86/entry/syscalls/syscall_64.tbl    |  1 +
 26 files changed, 85 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index a2ff6e89e48b..66a778d58899 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -476,12 +476,18 @@ Install VMA/Shared Policy for a Range of Task's Address Space::
 	long mbind(void *start, unsigned long len, int mode,
 		   const unsigned long *nmask, unsigned long maxnode,
 		   unsigned flags);
+	long mbind2(void* start, unsigned long len, struct mpol_param *param,
+		    size_t size, unsigned long flags);
 
 mbind() installs the policy specified by (mode, nmask, maxnodes) as a
 VMA policy for the range of the calling task's address space specified
 by the 'start' and 'len' arguments.  Additional actions may be
 requested via the 'flags' argument.
 
+mbind2() is an extended version of mbind() capable of setting extended
+mempolicy features. For example, one can set the home node for the memory
+policy without an additional call to set_mempolicy_home_node().
+
 See the mbind(2) man page for more details.
 
 Set home node for a Range of Task's Address Spacec::
@@ -497,6 +503,9 @@ closest to which page allocation will come from. Specifying the home node overri
 the default allocation policy to allocate memory close to the local node for an
 executing CPU.
 
+mbind2() also provides a way for the home node to be set at the time the
+mempolicy is set. See the mbind(2) man page for more details.
+
 Extended Mempolicy Arguments::
 
 	struct mpol_param {
@@ -510,7 +519,8 @@ Extended Mempolicy Arguments::
 The extended mempolicy argument structure is defined to allow the mempolicy
 interfaces future extensibility without the need for additional system calls.
 
-Extended interfaces (set_mempolicy2 and get_mempolicy2) use this structure.
+Extended interfaces (set_mempolicy2, get_mempolicy2, and mbind2) use this
+this argument structure.
 
 The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to
 all interfaces relative to their non-extended counterparts. Each additional
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 0301a8b0a262..e8239293c35a 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -498,3 +498,4 @@
 566	common	futex_requeue			sys_futex_requeue
 567	common	set_mempolicy2			sys_set_mempolicy2
 568	common	get_mempolicy2			sys_get_mempolicy2
+569	common	mbind2				sys_mbind2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 771a33446e8e..a3f39750257a 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -472,3 +472,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index b63f870debaf..abe10a833fcd 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -39,7 +39,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		459
+#define __NR_compat_syscalls		460
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index f8d01007aee0..446b7f034332 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -923,6 +923,8 @@ __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
 __SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2)
 #define __NR_get_mempolicy2 458
 __SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2)
+#define __NR_mbind2 459
+__SYSCALL(__NR_mbind2, sys_mbind2)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 048a409e684c..9a12dface18e 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -458,3 +458,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 327b01bd6793..6cb740123137 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -464,3 +464,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 921d58e1da23..52cf720f8ae2 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -397,3 +397,4 @@
 456	n32	futex_requeue			sys_futex_requeue
 457	n32	set_mempolicy2			sys_set_mempolicy2
 458	n32	get_mempolicy2			sys_get_mempolicy2
+459	n32	mbind2				sys_mbind2
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 9271c83c9993..fd37c5301a48 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -446,3 +446,4 @@
 456	o32	futex_requeue			sys_futex_requeue
 457	o32	set_mempolicy2			sys_set_mempolicy2
 458	o32	get_mempolicy2			sys_get_mempolicy2
+459	o32	mbind2				sys_mbind2
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 0654f3f89fc7..fcd67bc405b1 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -457,3 +457,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index ac11d2064e7a..89715417014c 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -545,3 +545,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 1cdcafe1ccca..c8304e0d0aa7 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -461,3 +461,4 @@
 456  common	futex_requeue		sys_futex_requeue		sys_futex_requeue
 457  common	set_mempolicy2		sys_set_mempolicy2		sys_set_mempolicy2
 458  common	get_mempolicy2		sys_get_mempolicy2		sys_get_mempolicy2
+459  common	mbind2			sys_mbind2			sys_mbind2
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index f71742024c29..e5c51b6c367f 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -461,3 +461,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 2fbf5dbe0620..74527f585500 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -504,3 +504,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 0af813b9a118..be2e2aa17dd8 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -463,3 +463,4 @@
 456	i386	futex_requeue		sys_futex_requeue
 457	i386	set_mempolicy2		sys_set_mempolicy2
 458	i386	get_mempolicy2		sys_get_mempolicy2
+459	i386	mbind2			sys_mbind2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 0b777876fc15..6e2347eb8773 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -380,6 +380,7 @@
 456	common	futex_requeue		sys_futex_requeue
 457	common	set_mempolicy2		sys_set_mempolicy2
 458	common	get_mempolicy2		sys_get_mempolicy2
+459	common	mbind2			sys_mbind2
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 4536c9a4227d..f00a21317dc0 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -429,3 +429,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index c4dc5069bae7..02f5c1e94ae5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -817,6 +817,9 @@ asmlinkage long sys_mbind(unsigned long start, unsigned long len,
 				const unsigned long __user *nmask,
 				unsigned long maxnode,
 				unsigned flags);
+asmlinkage long sys_mbind2(unsigned long start, unsigned long len,
+			   const struct mpol_param __user *param, size_t usize,
+			   unsigned long flags);
 asmlinkage long sys_get_mempolicy(int __user *policy,
 				unsigned long __user *nmask,
 				unsigned long maxnode,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 719accc731db..cd31599bb9cc 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -832,9 +832,11 @@ __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
 __SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2)
 #define __NR_get_mempolicy2 458
 __SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2)
+#define __NR_mbind2 459
+__SYSCALL(__NR_mbind2, sys_mbind2)
 
 #undef __NR_syscalls
-#define __NR_syscalls 459
+#define __NR_syscalls 460
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 109788c8be92..7c7c384479fc 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -53,13 +53,14 @@ struct mpol_param {
 #define MPOL_F_ADDR	(1<<1)	/* look up vma using address */
 #define MPOL_F_MEMS_ALLOWED (1<<2) /* return allowed memories */
 
-/* Flags for mbind */
+/* Flags for mbind/mbind2 */
 #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
 #define MPOL_MF_MOVE	 (1<<1)	/* Move pages owned by this process to conform
 				   to policy */
 #define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to policy */
 #define MPOL_MF_LAZY	 (1<<3)	/* UNSUPPORTED FLAG: Lazy migrate on fault */
-#define MPOL_MF_INTERNAL (1<<4)	/* Internal flags start here */
+#define MPOL_MF_HOME_NODE (1<<4)	/* mbind2: set home node */
+#define MPOL_MF_INTERNAL (1<<5)	/* Internal flags start here */
 
 #define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
 			 MPOL_MF_MOVE     | 	\
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6afbd3a41319..2483b5afa99f 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -187,6 +187,7 @@ COND_SYSCALL(process_madvise);
 COND_SYSCALL(process_mrelease);
 COND_SYSCALL(remap_file_pages);
 COND_SYSCALL(mbind);
+COND_SYSCALL(mbind2);
 COND_SYSCALL(get_mempolicy);
 COND_SYSCALL(get_mempolicy2);
 COND_SYSCALL(set_mempolicy);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0b2e31d8636d..53301e173c90 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1612,6 +1612,49 @@ SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
 	return kernel_mbind(start, len, mode, nmask, maxnode, flags);
 }
 
+SYSCALL_DEFINE5(mbind2, unsigned long, start, unsigned long, len,
+		const struct mpol_param __user *, uparam, size_t, usize,
+		unsigned long, flags)
+{
+	struct mpol_param kparam;
+	struct mempolicy_param mparam;
+	nodemask_t policy_nodes;
+	unsigned long __user *nodes_ptr;
+	int err;
+
+	if (!start || !len)
+		return -EINVAL;
+
+	err = copy_struct_from_user(&kparam, sizeof(kparam), uparam, usize);
+	if (err)
+		return -EINVAL;
+
+	err = validate_mpol_flags(kparam.mode, &kparam.mode_flags);
+	if (err)
+		return err;
+
+	mparam.mode = kparam.mode;
+	mparam.mode_flags = kparam.mode_flags;
+
+	/* if home node given, validate it is online */
+	if (flags & MPOL_MF_HOME_NODE) {
+		if ((kparam.home_node >= MAX_NUMNODES) ||
+			!node_online(kparam.home_node))
+			return -EINVAL;
+		mparam.home_node = kparam.home_node;
+	} else
+		mparam.home_node = NUMA_NO_NODE;
+	flags &= ~MPOL_MF_HOME_NODE;
+
+	nodes_ptr = u64_to_user_ptr(kparam.pol_nodes);
+	err = get_nodes(&policy_nodes, nodes_ptr, kparam.pol_maxnodes);
+	if (err)
+		return err;
+	mparam.policy_nodes = &policy_nodes;
+
+	return do_mbind(untagged_addr(start), len, &mparam, flags);
+}
+
 /* Set the process memory policy */
 static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
 				 unsigned long maxnode)
diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
index c34c6877379e..4fd9f742d903 100644
--- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
+++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
@@ -373,3 +373,4 @@
 456	n64	futex_requeue			sys_futex_requeue
 457	n64	set_mempolicy2			sys_set_mempolicy2
 458	n64	get_mempolicy2			sys_get_mempolicy2
+459	n64	mbind2				sys_mbind2
diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
index ac11d2064e7a..89715417014c 100644
--- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
@@ -545,3 +545,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
index 1cdcafe1ccca..c8304e0d0aa7 100644
--- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
@@ -461,3 +461,4 @@
 456  common	futex_requeue		sys_futex_requeue		sys_futex_requeue
 457  common	set_mempolicy2		sys_set_mempolicy2		sys_set_mempolicy2
 458  common	get_mempolicy2		sys_get_mempolicy2		sys_get_mempolicy2
+459  common	mbind2			sys_mbind2			sys_mbind2
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
index edf338f32645..3fc74241da5d 100644
--- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
@@ -380,6 +380,7 @@
 456	common	futex_requeue		sys_futex_requeue
 457	common 	set_mempolicy2		sys_set_mempolicy2
 458	common 	get_mempolicy2		sys_get_mempolicy2
+459	common 	mbind2			sys_mbind2
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
-- 
2.39.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v6 12/12] mm/mempolicy: extend mempolicy2 and mbind2 to support weighted interleave
  2024-01-03 22:41 [PATCH v6 00/12] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (5 preceding siblings ...)
  2024-01-03 22:42 ` [PATCH v6 11/12] mm/mempolicy: add the mbind2 syscall Gregory Price
@ 2024-01-03 22:42 ` Gregory Price
  2024-01-11  3:41 ` [PATCH v6 00/12] mempolicy2, mbind2, and " Andi Kleen
  7 siblings, 0 replies; 12+ messages in thread
From: Gregory Price @ 2024-01-03 22:42 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, linux-arch,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha

Extend mempolicy2 and mbind2 to support weighted interleave, and
demonstrate the extensibility of the mpol_param structure.

To support weighted interleave we add interleave weight fields to the
following structures:

Kernel Internal:  (include/linux/mempolicy.h)
struct mempolicy {
	/* task-local weights to apply to weighted interleave */
	u8 weights[MAX_NUMNODES];
}
struct mempolicy_param {
	/* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */
	u8 *il_weights;	/* of size MAX_NUMNODES */
}

UAPI: (/include/uapi/linux/mempolicy.h)
struct mpol_param {
	/* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */
	__u8 *il_weights; /* of size pol_maxnodes */
}

The minimum weight of a node is always 1.  If the user desires 0
allocations on a node, the node should be removed from the nodemask.

If the user does not provide weights (il_weights == NULL), global
weights will be used during allocation. Changes made to global weights
will be reflected in future allocations.

If the user provides weights and a weight is set to 0, the weight for
that node will be initialized to the global value.

If a user provides weights and a node is not set in the node mask,
the weight for that node will be set to the globally defined weight.
This is so a reasonable default value can be expected if the nodemask
changes (e.g. cgroups causes a migration or mems_allowed change).

Local weights are never updated when a global weight is updated.

Examples:

global weights: [4,4,2,2]
Set:  Nodes-0,1,2,3  Weights: NULL
      [global weights] are used.

Set:  Nodes-0,1,2,3  Weights: [1,2,3,4]
      local_weights = [1,2,3,4]

Set:  Nodes-0,2      Weights: [2,0,2,0]
      local_weights = [2,4,1,2]

Basic logic during allocation is as follows:

  weight = pol->wil.weights[node]
  /* if no local weight, use sysfs weight */
  if (!weight)
    weight = iw_table[weight]
  /* if no sysfs weight, use system default */
  if (!weight)
    weight = default_iw_table[weight]

To simplify creations and duplication of mempolicies, the weights are
added as a structure directly within mempolicy. This allows the
existing logic in __mpol_dup to copy the weights without additional
allocations:

if (old == current->mempolicy) {
	task_lock(current);
	*new = *old;
	task_unlock(current);
} else
	*new = *old

Suggested-by: Rakie Kim <rakie.kim@sk.com>
Suggested-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Suggested-by: Honggyu Kim <honggyu.kim@sk.com>
Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Suggested-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Rakie Kim <rakie.kim@sk.com>
Signed-off-by: Rakie Kim <rakie.kim@sk.com>
Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Co-developed-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Co-developed-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Signed-off-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     |  12 ++
 include/linux/mempolicy.h                     |   2 +
 include/uapi/linux/mempolicy.h                |   1 +
 mm/mempolicy.c                                | 134 ++++++++++++++++--
 4 files changed, 141 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 66a778d58899..620b54ff2cef 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -254,11 +254,22 @@ MPOL_WEIGHTED_INTERLEAVE
 	This mode operates the same as MPOL_INTERLEAVE, except that
 	interleaving behavior is executed based on weights set in
 	/sys/kernel/mm/mempolicy/weighted_interleave/
+	when configured to utilize global weights, or based on task-local
+	weights configured with set_mempolicy2(2) or mbind2(2).
 
 	Weighted interleave allocates pages on nodes according to a
 	weight.  For example if nodes [0,1] are weighted [5,2], 5 pages
 	will be allocated on node0 for every 2 pages allocated on node1.
 
+	When utilizing task-local weights, if node's is not set in the
+	nodemask, or its weight was set to 0, the local weight will be
+	set to the system default.  Updates to system default weights
+	will not be refleted in local weights.
+
+	The minimum weight for a node set in the policy nodemask is
+	always 1. If no allocations on a node, the node should be
+	removed from the nodemask.
+
 NUMA memory policy supports the following optional mode flags:
 
 MPOL_F_STATIC_NODES
@@ -514,6 +525,7 @@ Extended Mempolicy Arguments::
 		__s32 home_node;	 /* mbind2: set home node */
 		__u64 pol_maxnodes;
 		__aligned_u64 pol_nodes; /* nodemask pointer */
+		__aligned_u64 il_weights;  /* u8 buf of size pol_maxnodes */
 	};
 
 The extended mempolicy argument structure is defined to allow the mempolicy
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e6795e2d0cc2..9854790a9aac 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -58,6 +58,7 @@ struct mempolicy {
 	/* Weighted interleave settings */
 	struct {
 		u8 cur_weight;
+		u8 weights[MAX_NUMNODES];
 		u8 scratch_weights[MAX_NUMNODES]; /* Used to avoid allocations */
 	} wil;
 };
@@ -71,6 +72,7 @@ struct mempolicy_param {
 	unsigned short mode_flags;	/* policy mode flags */
 	int home_node;			/* mbind: use MPOL_MF_HOME_NODE */
 	nodemask_t *policy_nodes;	/* get/set/mbind */
+	u8 *il_weights;			/* for mode MPOL_WEIGHTED_INTERLEAVE */
 };
 
 /*
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 7c7c384479fc..06e0fc2bb29b 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -34,6 +34,7 @@ struct mpol_param {
 	__u16 pol_maxnodes;
 	__u8 resv[6];
 	__aligned_u64 pol_nodes;
+	__aligned_u64 il_weights; /* size: pol_maxnodes * sizeof(__u8) */
 };
 
 /* Flags for set_mempolicy */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 53301e173c90..78e7614e0cd4 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -286,6 +286,7 @@ static struct mempolicy *mpol_new(struct mempolicy_param *param)
 	unsigned short mode = param->mode;
 	unsigned short flags = param->mode_flags;
 	nodemask_t *nodes = param->policy_nodes;
+	int node;
 
 	if (mode == MPOL_DEFAULT) {
 		if (nodes && !nodes_empty(*nodes))
@@ -323,6 +324,23 @@ static struct mempolicy *mpol_new(struct mempolicy_param *param)
 	policy->flags = flags;
 	policy->home_node = param->home_node;
 	policy->wil.cur_weight = 0;
+	memset(policy->wil.weights, 0, MAX_NUMNODES);
+
+	/* If user provides weights, ensure all weights are set to something */
+	if (policy->mode == MPOL_WEIGHTED_INTERLEAVE && param->il_weights) {
+		for (node = 0; node < MAX_NUMNODES; node++) {
+			u8 weight = 0;
+
+			if (node_isset(node, *nodes))
+				weight = param->il_weights[node];
+			/* If a user sets a weight to 0, use global default */
+			if (!weight)
+				weight = iw_table[node];
+			if (!weight)
+				weight = default_iw_table[node];
+			policy->wil.weights[node] = weight;
+		}
+	}
 
 	return policy;
 }
@@ -952,6 +970,26 @@ static void do_get_mempolicy_nodemask(struct mempolicy *pol, nodemask_t *nmask)
 	}
 }
 
+static void do_get_mempolicy_il_weights(struct mempolicy *pol,
+					u8 weights[MAX_NUMNODES])
+{
+	int i = 0;
+
+	if (pol->mode != MPOL_WEIGHTED_INTERLEAVE) {
+		memset(weights, 0, MAX_NUMNODES);
+		return;
+	}
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		u8 weight = pol->wil.weights[i];
+
+		if (!weight)
+			weight = iw_table[i];
+		if (!weight)
+			weight = default_iw_table[i];
+		weights[i] = weight;
+	}
+}
+
 /* Retrieve NUMA policy for a VMA assocated with a given address  */
 static long do_get_vma_mempolicy(unsigned long addr, int *addr_node,
 				 struct mempolicy_param *param)
@@ -985,6 +1023,9 @@ static long do_get_vma_mempolicy(unsigned long addr, int *addr_node,
 	if (param->policy_nodes)
 		do_get_mempolicy_nodemask(pol, param->policy_nodes);
 
+	if (param->il_weights)
+		do_get_mempolicy_il_weights(pol, param->il_weights);
+
 	if (pol != &default_policy) {
 		mpol_put(pol);
 		mpol_cond_put(pol);
@@ -1012,6 +1053,9 @@ static long do_get_task_mempolicy(struct mempolicy_param *param, int *pol_node)
 	if (param->policy_nodes)
 		do_get_mempolicy_nodemask(pol, param->policy_nodes);
 
+	if (param->il_weights)
+		do_get_mempolicy_il_weights(pol, param->il_weights);
+
 	return 0;
 }
 
@@ -1620,6 +1664,8 @@ SYSCALL_DEFINE5(mbind2, unsigned long, start, unsigned long, len,
 	struct mempolicy_param mparam;
 	nodemask_t policy_nodes;
 	unsigned long __user *nodes_ptr;
+	u8 *weights = NULL;
+	u8 __user *weights_ptr;
 	int err;
 
 	if (!start || !len)
@@ -1652,7 +1698,27 @@ SYSCALL_DEFINE5(mbind2, unsigned long, start, unsigned long, len,
 		return err;
 	mparam.policy_nodes = &policy_nodes;
 
-	return do_mbind(untagged_addr(start), len, &mparam, flags);
+	if (kparam.mode == MPOL_WEIGHTED_INTERLEAVE) {
+		weights_ptr = u64_to_user_ptr(kparam.il_weights);
+		if (weights_ptr) {
+			weights = kzalloc(MAX_NUMNODES,
+					  GFP_KERNEL | __GFP_NORETRY);
+			if (!weights)
+				return -ENOMEM;
+			err = copy_struct_from_user(weights,
+						    MAX_NUMNODES,
+						    weights_ptr,
+						    kparam.pol_maxnodes);
+			if (err)
+				goto leave_weights;
+		}
+	}
+	mparam.il_weights = weights;
+
+	err = do_mbind(untagged_addr(start), len, &mparam, flags);
+leave_weights:
+	kfree(weights);
+	return err;
 }
 
 /* Set the process memory policy */
@@ -1696,6 +1762,8 @@ SYSCALL_DEFINE3(set_mempolicy2, struct mpol_param __user *, uparam,
 	int err;
 	nodemask_t policy_nodemask;
 	unsigned long __user *nodes_ptr;
+	u8 *weights = NULL;
+	u8 __user *weights_ptr;
 
 	if (flags)
 		return -EINVAL;
@@ -1721,7 +1789,24 @@ SYSCALL_DEFINE3(set_mempolicy2, struct mpol_param __user *, uparam,
 	} else
 		mparam.policy_nodes = NULL;
 
-	return do_set_mempolicy(&mparam);
+	if (kparam.mode == MPOL_WEIGHTED_INTERLEAVE && kparam.il_weights) {
+		weights = kzalloc(MAX_NUMNODES, GFP_KERNEL | __GFP_NORETRY);
+		if (!weights)
+			return -ENOMEM;
+		weights_ptr = u64_to_user_ptr(kparam.il_weights);
+		err = copy_struct_from_user(weights,
+					    MAX_NUMNODES,
+					    weights_ptr,
+					    kparam.pol_maxnodes);
+		if (err)
+			goto leave_weights;
+	}
+	mparam.il_weights = weights;
+
+	err = do_set_mempolicy(&mparam);
+leave_weights:
+	kfree(weights);
+	return err;
 }
 
 static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
@@ -1924,6 +2009,8 @@ SYSCALL_DEFINE4(get_mempolicy2, struct mpol_param __user *, uparam, size_t, usiz
 	int err;
 	nodemask_t policy_nodemask;
 	unsigned long __user *nodes_ptr;
+	u8 __user *weights_ptr;
+	u8 *weights = NULL;
 
 	if (flags & ~(MPOL_F_ADDR))
 		return -EINVAL;
@@ -1935,6 +2022,13 @@ SYSCALL_DEFINE4(get_mempolicy2, struct mpol_param __user *, uparam, size_t, usiz
 	if (err)
 		return -EINVAL;
 
+	if (kparam.il_weights) {
+		weights = kzalloc(MAX_NUMNODES, GFP_KERNEL | __GFP_NORETRY);
+		if (!weights)
+			return -ENOMEM;
+	}
+	mparam.il_weights = weights;
+
 	mparam.policy_nodes = kparam.pol_nodes ? &policy_nodemask : NULL;
 	if (flags & MPOL_F_ADDR)
 		err = do_get_vma_mempolicy(untagged_addr(addr), NULL, &mparam);
@@ -1942,7 +2036,7 @@ SYSCALL_DEFINE4(get_mempolicy2, struct mpol_param __user *, uparam, size_t, usiz
 		err = do_get_task_mempolicy(&mparam, NULL);
 
 	if (err)
-		return err;
+		goto leave_weights;
 
 	kparam.mode = mparam.mode;
 	kparam.mode_flags = mparam.mode_flags;
@@ -1952,10 +2046,21 @@ SYSCALL_DEFINE4(get_mempolicy2, struct mpol_param __user *, uparam, size_t, usiz
 		err = copy_nodes_to_user(nodes_ptr, kparam.pol_maxnodes,
 					 mparam.policy_nodes);
 		if (err)
-			return err;
+			goto leave_weights;
+	}
+
+	if (kparam.mode == MPOL_WEIGHTED_INTERLEAVE && kparam.il_weights) {
+		weights_ptr = u64_to_user_ptr(kparam.il_weights);
+		if (copy_to_user(weights_ptr, weights, kparam.pol_maxnodes)) {
+			err = -EFAULT;
+			goto leave_weights;
+		}
 	}
 
-	return copy_to_user(uparam, &kparam, usize) ? -EFAULT : 0;
+	err = copy_to_user(uparam, &kparam, usize) ? -EFAULT : 0;
+leave_weights:
+	kfree(weights);
+	return err;
 }
 
 bool vma_migratable(struct vm_area_struct *vma)
@@ -2077,8 +2182,10 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
 		return next;
 
 	if (!policy->wil.cur_weight) {
-		u8 next_weight = iw_table[next];
+		u8 next_weight = policy->wil.weights[next];
 
+		if (!next_weight)
+			next_weight = iw_table[next];
 		if (!next_weight)
 			next_weight = default_iw_table[next];
 		policy->wil.cur_weight = next_weight;
@@ -2175,8 +2282,10 @@ static unsigned int read_once_interleave_weights(struct mempolicy *pol,
 	/* Similar issue to read_once_policy_nodemask */
 	barrier();
 	for_each_node_mask(nid, *mask) {
-		u8 weight = iw_table[nid];
+		u8 weight = pol->wil.weights[nid];
 
+		if (!weight)
+			weight = iw_table[nid];
 		if (!weight)
 			weight = default_iw_table[nid];
 		weight_total += weight;
@@ -3115,21 +3224,28 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 	if (mpol) {
 		struct sp_node *sn;
 		struct mempolicy *npol;
+		u8 *weights = NULL;
 		NODEMASK_SCRATCH(scratch);
 
 		if (!scratch)
 			goto put_mpol;
 
+		weights = kzalloc(MAX_NUMNODES, GFP_KERNEL | __GFP_NORETRY);
+		if (!weights)
+			goto free_scratch;
+		memcpy(weights, mpol->wil.weights, sizeof(weights));
+
 		memset(&mparam, 0, sizeof(mparam));
 		mparam.mode = mpol->mode;
 		mparam.mode_flags = mpol->flags;
 		mparam.policy_nodes = &mpol->w.user_nodemask;
 		mparam.home_node = NUMA_NO_NODE;
+		mparam.il_weights = weights;
 
 		/* contextualize the tmpfs mount point mempolicy to this file */
 		npol = mpol_new(&mparam);
 		if (IS_ERR(npol))
-			goto free_scratch; /* no valid nodemask intersection */
+			goto free_weights; /* no valid nodemask intersection */
 
 		task_lock(current);
 		ret = mpol_set_nodemask(npol, &mpol->w.user_nodemask, scratch);
@@ -3143,6 +3259,8 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 			sp_insert(sp, sn);
 put_npol:
 		mpol_put(npol);	/* drop initial ref on file's npol */
+free_weights:
+		kfree(weights);
 free_scratch:
 		NODEMASK_SCRATCH_FREE(scratch);
 put_mpol:
-- 
2.39.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6 08/12] mm/mempolicy: add userland mempolicy arg structure
  2024-01-03 22:42 ` [PATCH v6 08/12] mm/mempolicy: add userland mempolicy arg structure Gregory Price
@ 2024-01-04  0:19   ` Randy Dunlap
  2024-01-04  2:03     ` Gregory Price
  0 siblings, 1 reply; 12+ messages in thread
From: Randy Dunlap @ 2024-01-04  0:19 UTC (permalink / raw)
  To: Gregory Price, linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, linux-arch,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha,
	Frank van der Linden

Hi,

On 1/3/24 14:42, Gregory Price wrote:
> This patch adds the new user-api argument structure intended for
> set_mempolicy2 and mbind2.
> 
> struct mpol_param {
>   __u16 mode;
>   __u16 mode_flags;
>   __s32 home_node;          /* mbind2: policy home node */
>   __u16 pol_maxnodes;
>   __u8 resv[6];
>   __aligned_u64 *pol_nodes;
> };
> 
> This structure is intended to be extensible as new mempolicy extensions
> are added.
> 
> For example, set_mempolicy_home_node was added to allow vma mempolicies
> to have a preferred/home node assigned.  This structure allows the user
> to set the home node at the time mempolicy is created, rather than
> requiring an additional syscalls.
> 
> Full breakdown of arguments as of this patch:
>     mode:         Mempolicy mode (MPOL_DEFAULT, MPOL_INTERLEAVE)
> 
>     mode_flags:   Flags previously or'd into mode in set_mempolicy
>                   (e.g.: MPOL_F_STATIC_NODES, MPOL_F_RELATIVE_NODES)
> 
>     home_node:    for mbind2.  Allows the setting of a policy's home
>                   with the use of MPOL_MF_HOME_NODE
> 
>     pol_maxnodes: Max number of nodes in the policy nodemask
> 
>     pol_nodes:    Policy nodemask
> 
> The reserved field accounts explicitly for a potential memory hole
> in the structure.
> 
> Suggested-by: Frank van der Linden <fvdl@google.com>
> Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
> Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
> Signed-off-by: Gregory Price <gregory.price@memverge.com>
> Co-developed-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
> Signed-off-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
> ---
>  .../admin-guide/mm/numa_memory_policy.rst       | 17 +++++++++++++++++
>  include/linux/syscalls.h                        |  1 +
>  include/uapi/linux/mempolicy.h                  |  9 +++++++++
>  3 files changed, 27 insertions(+)
> 
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index a70f20ce1ffb..cbfc5f65ed77 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -480,6 +480,23 @@ closest to which page allocation will come from. Specifying the home node overri
>  the default allocation policy to allocate memory close to the local node for an
>  executing CPU.
>  
> +Extended Mempolicy Arguments::
> +
> +	struct mpol_param {
> +		__u16 mode;
> +		__u16 mode_flags;
> +		__s32 home_node;	 /* mbind2: set home node */
> +		__u64 pol_maxnodes;
> +		__aligned_u64 pol_nodes; /* nodemask pointer */
> +	};
>

Can you make the above documentation struct agree with the
struct in the header below, please?
(just a difference in the size of pol_maxnodes and the
'resv' bytes)


> +The extended mempolicy argument structure is defined to allow the mempolicy
> +interfaces future extensibility without the need for additional system calls.
> +
> +The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to
> +all interfaces relative to their non-extended counterparts. Each additional
> +field may only apply to specific extended interfaces.  See the respective
> +extended interface man page for more details.
>  
>  Memory Policy Command Line Interface
>  ====================================


> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index 1f9bb10d1a47..109788c8be92 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -27,6 +27,15 @@ enum {
>  	MPOL_MAX,	/* always last member of enum */
>  };
>  
> +struct mpol_param {
> +	__u16 mode;
> +	__u16 mode_flags;
> +	__s32 home_node;	/* mbind2: policy home node */
> +	__u16 pol_maxnodes;
> +	__u8 resv[6];
> +	__aligned_u64 pol_nodes;
> +};
> +
>  /* Flags for set_mempolicy */
>  #define MPOL_F_STATIC_NODES	(1 << 15)
>  #define MPOL_F_RELATIVE_NODES	(1 << 14)

-- 
#Randy


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6 08/12] mm/mempolicy: add userland mempolicy arg structure
  2024-01-04  0:19   ` Randy Dunlap
@ 2024-01-04  2:03     ` Gregory Price
  2024-01-10 20:11       ` [PATCH v6 00/12] mempolicy2, mbind2, and weighted interleave Gregory Price
  0 siblings, 1 reply; 12+ messages in thread
From: Gregory Price @ 2024-01-04  2:03 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, linux-arch, akpm, arnd, tglx, luto, mingo, bp,
	dave.hansen, hpa, mhocko, tj, ying.huang, corbet, rakie.kim,
	hyeongtak.ji, honggyu.kim, vtavarespetr, peterz, jgroves,
	ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha,
	Frank van der Linden

On Wed, Jan 03, 2024 at 04:19:03PM -0800, Randy Dunlap wrote:
> Hi,
> 
> On 1/3/24 14:42, Gregory Price wrote:
> > This patch adds the new user-api argument structure intended for
> > set_mempolicy2 and mbind2.
> > 
> > struct mpol_param {
> >   __u16 mode;
> >   __u16 mode_flags;
> >   __s32 home_node;          /* mbind2: policy home node */
> >   __u16 pol_maxnodes;
> >   __u8 resv[6];
> >   __aligned_u64 *pol_nodes;
> > };
> > 
> >  
> > +Extended Mempolicy Arguments::
> > +
> > +	struct mpol_param {
> > +		__u16 mode;
> > +		__u16 mode_flags;
> > +		__s32 home_node;	 /* mbind2: set home node */
> > +		__u64 pol_maxnodes;
> > +		__aligned_u64 pol_nodes; /* nodemask pointer */
> > +	};
> >
> 
> Can you make the above documentation struct agree with the
> struct in the header below, please?
> (just a difference in the size of pol_maxnodes and the
> 'resv' bytes)
> 
> 

*facepalm* made a note to double check this, and then still didn't.

Thank you for reviewing.  Will fix in the next pass of feedback.

~Gregory


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6 00/12] mempolicy2, mbind2, and weighted interleave
  2024-01-04  2:03     ` Gregory Price
@ 2024-01-10 20:11       ` Gregory Price
  0 siblings, 0 replies; 12+ messages in thread
From: Gregory Price @ 2024-01-10 20:11 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, akpm, linux-kernel

On Wed, Jan 10, 2024 at 02:11:39PM -0500, Andi Kleen wrote:
> > Weighted interleave is a new interleave policy intended to make use
> > of heterogeneous memory environments appearing with CXL.
> >
> > To implement weighted interleave with task-local weights, we need
> > new syscalls capable of passing a weight array. This is the
> > justification for mempolicy2/mbind2 - which are designed to be
> > extensible to capture future policies as well.
>
> I might be late to the party here, but it's not clear to me you really
> need the redesigned system calls. set_mempolicy has one argument left
> so it can be enhanced with a new pointer dependending on a bit in mode.
> For mbind() it already uses all arguments, but it has a flags argument.
>
> But it's unclear to me if a fully flexible weight array is really
> needed here anyways. Can some common combinations be encoded in flags instead?
> I assume it's mainly about some nodes getting preference depending on
> some attribute
>

(apologize for the re-send, I had a delivery failure on the list, want
to make sure it gets captured).

This is actually something I haven't written out in the RFC that I
probably should have: I'm also trying to make it so that a mempolicy
can be described in its entirety with a single syscall.

This cannot be done with the existing interfaces.
(see: the existence of set_mempolicy_home_node).

Likewise you cannot fetch the entire mempolicy configuration with a
single get_mempolicy() call.  Certainly if task-local weights exist,
there's no room left to add that on either.

You'd really like to know that the policy you set is not changed between
calls to multiple interfaces. Right now, if you want to twiddle bits in
the mempolicy (like home_node), the syscall does: (*new = *old)+change.

That's... not great.

So I did consider extending set_mempolicy() to allow you to twiddle the
weight of a given node, but I was considering another proposal in the
process: process_set_mempolicy and process_mbind.

These interfaces were proposed to allow mempolicy to be changed by an
external task. (This is presently not possible).

That's basically how we got to this current iteration.


Re: fully flexible weight array in the task

At the end of the day, this is really about bandwidth distribution. For
a reasonable set of workloads, the system-global settings should be
sufficient.  However, it was at least recommended that we also explore
task-local weights along the way - so here we are.

I'm certainly open to changing this, or even just dropping the task-local
weight requirement entirely, but I did want to consider the other issues
(above) in the process so we don't design ourselves into a corner if we
have to go there anyway.

> So if you add such a attribute, perhaps configurable in sysfs, and
> then have flags like give weight + 1 on attribute, give weight + 2 on
> attribute give weight + 4 on attribute. If more are needed there are more bits.
> That would be a much more compact and simpler interface.
>
> For set_mempolicy either add a flags argument or encode in mode.
>
> It would also shrink the whole patchkit dramatically and be less risky.

I'm certainly not against this idea.  It is less flexible, but it does make
it simpler.  Another limitation, however, is that you have to make multiple
syscalls if you want to change the weights of multiple nodes.

I wanted to avoid that kind of ambiguity.

If we don't think the external changing interfaces are realistic, then this
is all moot, I'm down for whatever design is feasible.

>
> You perhaps underestimate the cost and risk of designing complex
> kernel interfaces, it all has to be hardened audited fuzzed deployed etc.
>

Definitely not under any crazy impression that something like this is a
quick process.  Just iterating as I go and gathering feedback (I think
we're on version 4 or 5 of this idea, and v7 of this patch line :]).

I fully expect the initial chunk of (MPOL_WEIGHTED_INTERLEAVE + sysfs)
will be a merge candidate long before the task-local weights will be,
if only because, as you said, it's a much simpler extension and less risky.

I appreciate the feedback, happy to keep hacking away,
~Gregory


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6 00/12] mempolicy2, mbind2, and weighted interleave
  2024-01-03 22:41 [PATCH v6 00/12] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (6 preceding siblings ...)
  2024-01-03 22:42 ` [PATCH v6 12/12] mm/mempolicy: extend mempolicy2 and mbind2 to support weighted interleave Gregory Price
@ 2024-01-11  3:41 ` Andi Kleen
  7 siblings, 0 replies; 12+ messages in thread
From: Andi Kleen @ 2024-01-11  3:41 UTC (permalink / raw)
  To: Gregory Price; +Cc: linux-mm, akpm



[sorry for the resend, fixed mailing list address]

Gregory Price <gourry.memverge-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
writes:

> Weighted interleave is a new interleave policy intended to make
> use of heterogeneous memory environments appearing with CXL.
>
> To implement weighted interleave with task-local weights, we
> need new syscalls capable of passing a weight array. This is
> the justification for mempolicy2/mbind2 - which are designed
> to be extensible to capture future policies as well.

I might be late to the party here, but it's not clear to me you really
need the redesigned system calls. set_mempolicy has one argument left
so it can be enhanced with a new pointer dependending on a bit in mode.
For mbind() it already uses all arguments, but it has a flags argument.

But it's unclear to me if a fully flexible weight array is really needed
here anyways. Can some common combinations be encoded in flags instead?
I assume it's mainly about some nodes getting preference depending on some attribute

So if you add such a attribute, perhaps configurable in sysfs, and then
have flags like give weight + 1 on attribute, give weight + 2 on attribute
give weight + 4 on attribute. If more are needed there are more bits.
That would be a much more compact and simpler interface.

For set_mempolicy either add a flags argument or encode in mode.

It would also shrink the whole patchkit dramatically and be less risky.

You perhaps underestimate the cost and risk of designing complex kernel
interfaces, it all has to be hardened audited fuzzed deployed etc.

-Andi



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-01-11  3:41 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-03 22:41 [PATCH v6 00/12] mempolicy2, mbind2, and weighted interleave Gregory Price
2024-01-03 22:41 ` [PATCH v6 01/12] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price
2024-01-03 22:41 ` [PATCH v6 02/12] mm/mempolicy: refactor a read-once mechanism into a function for re-use Gregory Price
2024-01-03 22:42 ` [PATCH v6 06/12] mm/mempolicy: refactor kernel_get_mempolicy for code re-use Gregory Price
2024-01-03 22:42 ` [PATCH v6 08/12] mm/mempolicy: add userland mempolicy arg structure Gregory Price
2024-01-04  0:19   ` Randy Dunlap
2024-01-04  2:03     ` Gregory Price
2024-01-10 20:11       ` [PATCH v6 00/12] mempolicy2, mbind2, and weighted interleave Gregory Price
2024-01-03 22:42 ` [PATCH v6 10/12] mm/mempolicy: add get_mempolicy2 syscall Gregory Price
2024-01-03 22:42 ` [PATCH v6 11/12] mm/mempolicy: add the mbind2 syscall Gregory Price
2024-01-03 22:42 ` [PATCH v6 12/12] mm/mempolicy: extend mempolicy2 and mbind2 to support weighted interleave Gregory Price
2024-01-11  3:41 ` [PATCH v6 00/12] mempolicy2, mbind2, and " Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox