linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] mm/memblock,x86,acpi: hotplug memory alignment advisement
@ 2024-10-16 19:24 Gregory Price
  2024-10-16 19:24 ` [PATCH v2 1/3] mm/memblock: implement memblock_advise_size_order and probe functions Gregory Price
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Gregory Price @ 2024-10-16 19:24 UTC (permalink / raw)
  To: x86, linux-kernel, linux-acpi, linux-mm
  Cc: dan.j.williams, ira.weiny, david, dave.hansen, luto, peterz,
	tglx, mingo, bp, hpa, rafael, lenb, rppt, akpm, alison.schofield,
	Jonathan.Cameron, rrichter, ytcoode, haibo1.xu, dave.jiang

When physical address regions are not aligned to memory block size,
the misaligned portion is lost (stranded capacity).

Block size (min/max/selected) is architecture defined. Most architectures
tend to use the minimum block size or some simplistic heurist. On x86,
memory block size increases up to 2GB, and is otherwise fitted to the
alignment of non-hotplug (special purpose memory).

CXL exposes its memory for management through the ACPI CEDT (CXL Early
Detection Table) in a field called the CXL Fixed Memory Window.  Per
the CXL specification, this memory must be aligned to at least 256MB.

When a CFMW aligns on a size less than the block size, this causes a
loss of up to 2GB per CFMW on x86.  It is not uncommon for CFMW to be
allocated per-device - though this behavior is BIOS defined.

This patch set provides 3 things:
 1) implement advise/probe functions in mm/memblock.c to report/probe
    architecture agnostic hotplug memory alignment advice.
 2) update x86 memblock size logic to consider the hotplug advice
 3) add code in acpi/numa/srat.c to report CFMW alignment advice

The advisement interfaces are design to be called during arch_init
code prior to allocator and smp_init.  start_kernel will call these
through setup_arch() (via acpi and mm/init_64.c on x86), which occurs
prior to mm_core_init and smp_init - so no need for atomics.

There's an attempt to signal callers to advise() that probe has already
occurred, but this is predicated on the notion that probe() actually
occurs (which presently only happens on x86). This is to assist debugging
future users who may mistakenly call this after allocator or smp init.

Likewise, if probe() occurs more than once, we return -EBUSY to prevent
inconsistent values from being reported - i.e. this interaction should
happen exactly once, and all other behavior is an error / the probed
value should be acquired via memory_block_size_bytes() instead.

Suggested-by: Ira Weiny <ira.weiny@intel.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Gregory Price <gourry@gourry.net>

Gregory Price (3):
  mm/memblock: implement memblock_advise_size_order and probe functions
  x86: probe memblock size advisement value during mm init
  acpi,srat: reduce memory block size if CFMWS has a smaller alignment

 arch/x86/mm/init_64.c    | 16 +++++++++++++
 drivers/acpi/numa/srat.c | 42 ++++++++++++++++++++++++++++++++++
 include/linux/memblock.h |  2 ++
 mm/memblock.c            | 49 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 109 insertions(+)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/3] mm/memblock: implement memblock_advise_size_order and probe functions
  2024-10-16 19:24 [PATCH v2 0/3] mm/memblock,x86,acpi: hotplug memory alignment advisement Gregory Price
@ 2024-10-16 19:24 ` Gregory Price
  2024-10-20  8:57   ` Mike Rapoport
  2024-10-16 19:24 ` [PATCH v2 2/3] x86: probe memblock size advisement value during mm init Gregory Price
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Gregory Price @ 2024-10-16 19:24 UTC (permalink / raw)
  To: x86, linux-kernel, linux-acpi, linux-mm
  Cc: dan.j.williams, ira.weiny, david, dave.hansen, luto, peterz,
	tglx, mingo, bp, hpa, rafael, lenb, rppt, akpm, alison.schofield,
	Jonathan.Cameron, rrichter, ytcoode, haibo1.xu, dave.jiang

Hotplug memory sources may have opinions on what the memblock size
should be - usually for alignment purposes.  For example, CXL memory
extents can be as small as 256MB with a matching physical alignment.

Implement memblock_advise_size_order for use during early init, prior
to allocator and smp init, for software to advise the system as to what
the preferred block size should be.

The probe function is meant for arch_init code to fetch this value
once during memblock size calculation. Use of the advisement value
is arch-specific, and no guarantee is made that it will be used.

Calls to either function after probe results in -EBUSY to signal that
advisement is ignored or that memblock_get_size_bytes should be used.

Suggested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/memblock.h |  2 ++
 mm/memblock.c            | 49 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index fc4d75c6cec3..efb1f7cfbd58 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -111,6 +111,8 @@ static inline void memblock_discard(void) {}
 #endif
 
 void memblock_allow_resize(void);
+int memblock_advise_size_order(int order);
+int memblock_probe_size_order(void);
 int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid,
 		      enum memblock_flags flags);
 int memblock_add(phys_addr_t base, phys_addr_t size);
diff --git a/mm/memblock.c b/mm/memblock.c
index 3b9dc2d89b8a..e0bdba011564 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2009,6 +2009,55 @@ void __init memblock_allow_resize(void)
 	memblock_can_resize = 1;
 }
 
+/*
+ * @order: bit-order describing the preferred minimum block size
+ *
+ * Intended for use by early-boot software prior to smp and allocator init to
+ * advise the architecture what the minimum block size should be. Should only
+ * be called during arch init before allocator and smp init.
+ *
+ * This value can only decrease after it has been initially set, the intention
+ * is to identify the smallest supported alignment across all opinions.
+ *
+ * Use of this advisement value is arch-specific.
+ *
+ * Returns: 0 on success, -EINVAL if order is <=0, and -EBUSY if already probed
+ */
+static int memblock_sz_order;
+#define MEMBLOCK_SZO_PROBED (-1)
+int memblock_advise_size_order(int order)
+{
+	if (order <= 0)
+		return -EINVAL;
+
+	if (memblock_sz_order == MEMBLOCK_SZO_PROBED)
+		return -EBUSY;
+
+	if (memblock_sz_order)
+		memblock_sz_order = min(order, memblock_sz_order);
+	else
+		memblock_sz_order = order;
+
+	return 0;
+}
+
+/*
+ * memblock_probe_size_order is intended for arch init code to probe one time,
+ * for a suggested memory block size.  After the first call, the result will
+ * always be -EBUSY. A late user should call memory_block_size_bytes instead to
+ * determine the actual block size in use.
+ *
+ * Should only be called during arch init prior to allocator and smp init.
+ *
+ * Returns: block size order, 0 if never set, or -EBUSY if previously probed.
+ */
+int memblock_probe_size_order(void)
+{
+	int rv = xchg(&memblock_sz_order, -1);
+
+	return (rv == -1) ? -EBUSY : rv;
+}
+
 static int __init early_memblock(char *p)
 {
 	if (p && strstr(p, "debug"))
-- 
2.43.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 2/3] x86: probe memblock size advisement value during mm init
  2024-10-16 19:24 [PATCH v2 0/3] mm/memblock,x86,acpi: hotplug memory alignment advisement Gregory Price
  2024-10-16 19:24 ` [PATCH v2 1/3] mm/memblock: implement memblock_advise_size_order and probe functions Gregory Price
@ 2024-10-16 19:24 ` Gregory Price
  2024-10-21 11:12   ` David Hildenbrand
  2024-10-16 19:24 ` [PATCH v2 3/3] acpi,srat: reduce memory block size if CFMWS has a smaller alignment Gregory Price
  2024-10-21  9:51 ` [PATCH v2 0/3] mm/memblock,x86,acpi: hotplug memory alignment advisement David Hildenbrand
  3 siblings, 1 reply; 13+ messages in thread
From: Gregory Price @ 2024-10-16 19:24 UTC (permalink / raw)
  To: x86, linux-kernel, linux-acpi, linux-mm
  Cc: dan.j.williams, ira.weiny, david, dave.hansen, luto, peterz,
	tglx, mingo, bp, hpa, rafael, lenb, rppt, akpm, alison.schofield,
	Jonathan.Cameron, rrichter, ytcoode, haibo1.xu, dave.jiang

Systems with hotplug may provide an advisement value on what the
memblock size should be.  Probe this value when the rest of the
configuration values are considered.

The new heuristic is as follows

1) set_memory_block_size_order value if already set (cmdline param)
2) minimum block size if memory is less than large block limit
3) [new] hotplug advise: lesser of advise value or memory alignment
4) Max block size if system is bare-metal
5) Largest size that aligns to end of memory.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
 arch/x86/mm/init_64.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ff253648706f..b72923b12d99 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1439,6 +1439,7 @@ static unsigned long probe_memory_block_size(void)
 {
 	unsigned long boot_mem_end = max_pfn << PAGE_SHIFT;
 	unsigned long bz;
+	int order;
 
 	/* If memory block size has been set, then use it */
 	bz = set_memory_block_size;
@@ -1451,6 +1452,21 @@ static unsigned long probe_memory_block_size(void)
 		goto done;
 	}
 
+	/* Consider hotplug advisement value (if set) */
+	order = memblock_probe_size_order();
+	bz = order > 0 ? (1UL << order) : 0;
+	if (bz) {
+		/* Align down to max and up to min supported */
+		bz = max(min(bz, MAX_BLOCK_SIZE), MIN_MEMORY_BLOCK_SIZE);
+		/* Use lesser of advisement and end of memory alignment */
+		for (; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) {
+			if (IS_ALIGNED(boot_mem_end, bz))
+				goto done;
+		}
+		/* Barring clean alignment, default to min block size */
+		goto done;
+	}
+
 	/*
 	 * Use max block size to minimize overhead on bare metal, where
 	 * alignment for memory hotplug isn't a concern.
-- 
2.43.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 3/3] acpi,srat: reduce memory block size if CFMWS has a smaller alignment
  2024-10-16 19:24 [PATCH v2 0/3] mm/memblock,x86,acpi: hotplug memory alignment advisement Gregory Price
  2024-10-16 19:24 ` [PATCH v2 1/3] mm/memblock: implement memblock_advise_size_order and probe functions Gregory Price
  2024-10-16 19:24 ` [PATCH v2 2/3] x86: probe memblock size advisement value during mm init Gregory Price
@ 2024-10-16 19:24 ` Gregory Price
  2024-10-21  9:51 ` [PATCH v2 0/3] mm/memblock,x86,acpi: hotplug memory alignment advisement David Hildenbrand
  3 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2024-10-16 19:24 UTC (permalink / raw)
  To: x86, linux-kernel, linux-acpi, linux-mm
  Cc: dan.j.williams, ira.weiny, david, dave.hansen, luto, peterz,
	tglx, mingo, bp, hpa, rafael, lenb, rppt, akpm, alison.schofield,
	Jonathan.Cameron, rrichter, ytcoode, haibo1.xu, dave.jiang

The CXL Fixed Memory Window allows for memory aligned down to the
size of 256MB.  However, by default on x86, memory blocks increase
in size as total System RAM capacity increases. On x86, this caps
out at 2G when 64GB of System RAM is reached.

When the CFMWS regions are not aligned to memory block size, this
results in lost capacity on either side of the alignment.

Parse all CFMWS to detect the largest common denomenator among all
regions, and advise memblock to reduce the block size accordingly.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/acpi/numa/srat.c | 42 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index 44f91f2c6c5d..5fc03a99570e 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -14,6 +14,7 @@
 #include <linux/errno.h>
 #include <linux/acpi.h>
 #include <linux/memblock.h>
+#include <linux/memory.h>
 #include <linux/numa.h>
 #include <linux/nodemask.h>
 #include <linux/topology.h>
@@ -333,6 +334,35 @@ acpi_parse_memory_affinity(union acpi_subtable_headers *header,
 	return 0;
 }
 
+/*
+ * CXL allows CFMW to be aligned along 256MB boundaries, but large memory
+ * systems default to larger alignments (2GB on x86). Misalignments can
+ * cause some capacity to become unreachable. Calculate the largest supported
+ * alignment for all CFMW to maximize the amount of mappable capacity.
+ */
+static int __init acpi_align_cfmws(union acpi_subtable_headers *header,
+				   void *arg, const unsigned long table_end)
+{
+	struct acpi_cedt_cfmws *cfmws = (struct acpi_cedt_cfmws *)header;
+	u64 start = cfmws->base_hpa;
+	u64 size = cfmws->window_size;
+	unsigned long *fin_bz = arg;
+	unsigned long bz;
+
+	for (bz = SZ_64T; bz >= SZ_256M; bz >>= 1) {
+		if (IS_ALIGNED(start, bz) && IS_ALIGNED(size, bz))
+			break;
+	}
+
+	/* Only adjust downward, we never want to increase block size */
+	if (bz < *fin_bz && bz >= SZ_256M)
+		*fin_bz = bz;
+	else if (bz < SZ_256M)
+		pr_err("CFMWS: [BIOS BUG] base/size alignment violates spec\n");
+
+	return 0;
+}
+
 static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
 				   void *arg, const unsigned long table_end)
 {
@@ -501,6 +531,7 @@ acpi_table_parse_srat(enum acpi_srat_type id,
 int __init acpi_numa_init(void)
 {
 	int i, fake_pxm, cnt = 0;
+	unsigned long bz = SZ_64T;
 
 	if (acpi_disabled)
 		return -EINVAL;
@@ -552,6 +583,17 @@ int __init acpi_numa_init(void)
 	}
 	last_real_pxm = fake_pxm;
 	fake_pxm++;
+
+	/* Calculate and set largest supported memory block size alignment */
+	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_align_cfmws, &bz);
+	if (bz >= SZ_256M) {
+		if (memblock_advise_size_order(ffs(bz)-1) < 0)
+			pr_warn("CFMWS: memblock size advise failed\n");
+		else
+			pr_info("CFMWS: memblock advised size(%ld)\n", bz);
+	}
+
+	/* Then parse and fill the numa nodes with the described memory */
 	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
 			      &fake_pxm);
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/3] mm/memblock: implement memblock_advise_size_order and probe functions
  2024-10-16 19:24 ` [PATCH v2 1/3] mm/memblock: implement memblock_advise_size_order and probe functions Gregory Price
@ 2024-10-20  8:57   ` Mike Rapoport
  2024-10-21 14:39     ` Gregory Price
  0 siblings, 1 reply; 13+ messages in thread
From: Mike Rapoport @ 2024-10-20  8:57 UTC (permalink / raw)
  To: Gregory Price
  Cc: x86, linux-kernel, linux-acpi, linux-mm, dan.j.williams,
	ira.weiny, david, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, rafael, lenb, akpm, alison.schofield, Jonathan.Cameron,
	rrichter, ytcoode, haibo1.xu, dave.jiang

On Wed, Oct 16, 2024 at 03:24:43PM -0400, Gregory Price wrote:
> Hotplug memory sources may have opinions on what the memblock size
> should be - usually for alignment purposes.  For example, CXL memory
> extents can be as small as 256MB with a matching physical alignment.
> 
> Implement memblock_advise_size_order for use during early init, prior
> to allocator and smp init, for software to advise the system as to what
> the preferred block size should be.
> 
> The probe function is meant for arch_init code to fetch this value
> once during memblock size calculation. Use of the advisement value
> is arch-specific, and no guarantee is made that it will be used.

I'm confused.
 
Aren't we talking about memory blocks for hotplugable memory here?
This functionality rather belongs to drivers/base/memory.c, doesn't it?

> Calls to either function after probe results in -EBUSY to signal that
> advisement is ignored or that memblock_get_size_bytes should be used.
> 
> Suggested-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>  include/linux/memblock.h |  2 ++
>  mm/memblock.c            | 49 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 51 insertions(+)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index fc4d75c6cec3..efb1f7cfbd58 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -111,6 +111,8 @@ static inline void memblock_discard(void) {}
>  #endif
>  
>  void memblock_allow_resize(void);
> +int memblock_advise_size_order(int order);
> +int memblock_probe_size_order(void);
>  int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid,
>  		      enum memblock_flags flags);
>  int memblock_add(phys_addr_t base, phys_addr_t size);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 3b9dc2d89b8a..e0bdba011564 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -2009,6 +2009,55 @@ void __init memblock_allow_resize(void)
>  	memblock_can_resize = 1;
>  }
>  
> +/*
> + * @order: bit-order describing the preferred minimum block size
> + *
> + * Intended for use by early-boot software prior to smp and allocator init to
> + * advise the architecture what the minimum block size should be. Should only
> + * be called during arch init before allocator and smp init.
> + *
> + * This value can only decrease after it has been initially set, the intention
> + * is to identify the smallest supported alignment across all opinions.
> + *
> + * Use of this advisement value is arch-specific.
> + *
> + * Returns: 0 on success, -EINVAL if order is <=0, and -EBUSY if already probed
> + */
> +static int memblock_sz_order;
> +#define MEMBLOCK_SZO_PROBED (-1)
> +int memblock_advise_size_order(int order)
> +{
> +	if (order <= 0)
> +		return -EINVAL;
> +
> +	if (memblock_sz_order == MEMBLOCK_SZO_PROBED)
> +		return -EBUSY;
> +
> +	if (memblock_sz_order)
> +		memblock_sz_order = min(order, memblock_sz_order);
> +	else
> +		memblock_sz_order = order;
> +
> +	return 0;
> +}
> +
> +/*
> + * memblock_probe_size_order is intended for arch init code to probe one time,
> + * for a suggested memory block size.  After the first call, the result will
> + * always be -EBUSY. A late user should call memory_block_size_bytes instead to
> + * determine the actual block size in use.
> + *
> + * Should only be called during arch init prior to allocator and smp init.
> + *
> + * Returns: block size order, 0 if never set, or -EBUSY if previously probed.
> + */
> +int memblock_probe_size_order(void)
> +{
> +	int rv = xchg(&memblock_sz_order, -1);
> +
> +	return (rv == -1) ? -EBUSY : rv;
> +}
> +
>  static int __init early_memblock(char *p)
>  {
>  	if (p && strstr(p, "debug"))
> -- 
> 2.43.0
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/3] mm/memblock,x86,acpi: hotplug memory alignment advisement
  2024-10-16 19:24 [PATCH v2 0/3] mm/memblock,x86,acpi: hotplug memory alignment advisement Gregory Price
                   ` (2 preceding siblings ...)
  2024-10-16 19:24 ` [PATCH v2 3/3] acpi,srat: reduce memory block size if CFMWS has a smaller alignment Gregory Price
@ 2024-10-21  9:51 ` David Hildenbrand
  2024-10-21 14:51   ` Gregory Price
  3 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2024-10-21  9:51 UTC (permalink / raw)
  To: Gregory Price, x86, linux-kernel, linux-acpi, linux-mm
  Cc: dan.j.williams, ira.weiny, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, rafael, lenb, rppt, akpm, alison.schofield,
	Jonathan.Cameron, rrichter, ytcoode, haibo1.xu, dave.jiang



Am 16.10.24 um 21:24 schrieb Gregory Price:
> When physical address regions are not aligned to memory block size,
> the misaligned portion is lost (stranded capacity).
> 
> Block size (min/max/selected) is architecture defined. Most architectures
> tend to use the minimum block size or some simplistic heurist. On x86,
> memory block size increases up to 2GB, and is otherwise fitted to the
> alignment of non-hotplug (special purpose memory).
> 
> CXL exposes its memory for management through the ACPI CEDT (CXL Early
> Detection Table) in a field called the CXL Fixed Memory Window.  Per
> the CXL specification, this memory must be aligned to at least 256MB.
> 
> When a CFMW aligns on a size less than the block size, this causes a
> loss of up to 2GB per CFMW on x86.  It is not uncommon for CFMW to be
> allocated per-device - though this behavior is BIOS defined.
> 
> This patch set provides 3 things:
>   1) implement advise/probe functions in mm/memblock.c to report/probe
>      architecture agnostic hotplug memory alignment advice.
>   2) update x86 memblock size logic to consider the hotplug advice
>   3) add code in acpi/numa/srat.c to report CFMW alignment advice
> 
> The advisement interfaces are design to be called during arch_init
> code prior to allocator and smp_init.  start_kernel will call these
> through setup_arch() (via acpi and mm/init_64.c on x86), which occurs
> prior to mm_core_init and smp_init - so no need for atomics.
> 
> There's an attempt to signal callers to advise() that probe has already
> occurred, but this is predicated on the notion that probe() actually
> occurs (which presently only happens on x86). This is to assist debugging
> future users who may mistakenly call this after allocator or smp init.
> 
> Likewise, if probe() occurs more than once, we return -EBUSY to prevent
> inconsistent values from being reported - i.e. this interaction should
> happen exactly once, and all other behavior is an error / the probed
> value should be acquired via memory_block_size_bytes() instead.
> 
> Suggested-by: Ira Weiny <ira.weiny@intel.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Gregory Price <gourry@gourry.net>

Just as a side note, a while ago there was a discussion about variable-sized 
memory blocks -- essentially removing memory_block_size_bytes().

The main issue is that this would change /sys/devices/system/memory/ in ways it 
could break existing user space. I believe there are other corner cases that are 
a bit nasty to handle (e.g., removing parts of a larger memory block), but 
likely it could be handled.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] x86: probe memblock size advisement value during mm init
  2024-10-16 19:24 ` [PATCH v2 2/3] x86: probe memblock size advisement value during mm init Gregory Price
@ 2024-10-21 11:12   ` David Hildenbrand
  2024-10-21 14:46     ` Gregory Price
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2024-10-21 11:12 UTC (permalink / raw)
  To: Gregory Price, x86, linux-kernel, linux-acpi, linux-mm
  Cc: dan.j.williams, ira.weiny, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, rafael, lenb, rppt, akpm, alison.schofield,
	Jonathan.Cameron, rrichter, ytcoode, haibo1.xu, dave.jiang



Am 16.10.24 um 21:24 schrieb Gregory Price:
> Systems with hotplug may provide an advisement value on what the
> memblock size should be.  Probe this value when the rest of the
> configuration values are considered.
> 
> The new heuristic is as follows
> 
> 1) set_memory_block_size_order value if already set (cmdline param)
> 2) minimum block size if memory is less than large block limit
> 3) [new] hotplug advise: lesser of advise value or memory alignment
> 4) Max block size if system is bare-metal
> 5) Largest size that aligns to end of memory.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>   arch/x86/mm/init_64.c | 16 ++++++++++++++++
>   1 file changed, 16 insertions(+)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index ff253648706f..b72923b12d99 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1439,6 +1439,7 @@ static unsigned long probe_memory_block_size(void)
>   {
>   	unsigned long boot_mem_end = max_pfn << PAGE_SHIFT;
>   	unsigned long bz;
> +	int order;
>   
>   	/* If memory block size has been set, then use it */
>   	bz = set_memory_block_size;
> @@ -1451,6 +1452,21 @@ static unsigned long probe_memory_block_size(void)
>   		goto done;
>   	}
>   
> +	/* Consider hotplug advisement value (if set) */
> +	order = memblock_probe_size_order();

"size_order" is a very weird name. Just return a size?

memory_block_advised_max_size()

or sth like that?

> +	bz = order > 0 ? (1UL << order) : 0;
> +	if (bz) {
> +		/* Align down to max and up to min supported */
> +		bz = 
> +		/* Use lesser of advisement and end of memory alignment */
> +		for (; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) {
> +			if (IS_ALIGNED(boot_mem_end, bz))
> +				goto done;

This looks like duplicate code wit the loop below.

Could we refactored it into something like:

advised_max_size = memory_block_advised_max_size();
if (!advised_max_size) {
	bz = MAX_BLOCK_SIZE;
	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)
		goto done,
} else {
	bz = max(min(advised_max_size, MAX_BLOCK_SIZE), MIN_MEMORY_BLOCK_SIZE);
}

for (; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) {
	if (IS_ALIGNED(boot_mem_end, bz))
		break;



-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/3] mm/memblock: implement memblock_advise_size_order and probe functions
  2024-10-20  8:57   ` Mike Rapoport
@ 2024-10-21 14:39     ` Gregory Price
  0 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2024-10-21 14:39 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: x86, linux-kernel, linux-acpi, linux-mm, dan.j.williams,
	ira.weiny, david, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, rafael, lenb, akpm, alison.schofield, Jonathan.Cameron,
	rrichter, ytcoode, haibo1.xu, dave.jiang

On Sun, Oct 20, 2024 at 11:57:27AM +0300, Mike Rapoport wrote:
> On Wed, Oct 16, 2024 at 03:24:43PM -0400, Gregory Price wrote:
> > Hotplug memory sources may have opinions on what the memblock size
> > should be - usually for alignment purposes.  For example, CXL memory
> > extents can be as small as 256MB with a matching physical alignment.
> > 
> > Implement memblock_advise_size_order for use during early init, prior
> > to allocator and smp init, for software to advise the system as to what
> > the preferred block size should be.
> > 
> > The probe function is meant for arch_init code to fetch this value
> > once during memblock size calculation. Use of the advisement value
> > is arch-specific, and no guarantee is made that it will be used.
> 
> I'm confused.
>  
> Aren't we talking about memory blocks for hotplugable memory here?
> This functionality rather belongs to drivers/base/memory.c, doesn't it?
>

You're right, I should have put it there - i got distracted by the ifdef
mess around get_block_size_bytes/set...order and just tossed it into memblock
to avoid it.  I should be able to ifdef the definition in the header and move
it into memory.c
 
> > Calls to either function after probe results in -EBUSY to signal that
> > advisement is ignored or that memblock_get_size_bytes should be used.
> > 
> > Suggested-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Gregory Price <gourry@gourry.net>
> > ---
> >  include/linux/memblock.h |  2 ++
> >  mm/memblock.c            | 49 ++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 51 insertions(+)
> > 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] x86: probe memblock size advisement value during mm init
  2024-10-21 11:12   ` David Hildenbrand
@ 2024-10-21 14:46     ` Gregory Price
  2024-10-21 15:57       ` David Hildenbrand
  2024-10-21 16:04       ` Gregory Price
  0 siblings, 2 replies; 13+ messages in thread
From: Gregory Price @ 2024-10-21 14:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: x86, linux-kernel, linux-acpi, linux-mm, dan.j.williams,
	ira.weiny, dave.hansen, luto, peterz, tglx, mingo, bp, hpa,
	rafael, lenb, rppt, akpm, alison.schofield, Jonathan.Cameron,
	rrichter, ytcoode, haibo1.xu, dave.jiang

On Mon, Oct 21, 2024 at 01:12:26PM +0200, David Hildenbrand wrote:
> 
> 
> Am 16.10.24 um 21:24 schrieb Gregory Price:
> > Systems with hotplug may provide an advisement value on what the
> > memblock size should be.  Probe this value when the rest of the
> > configuration values are considered.
> > 
> > The new heuristic is as follows
> > 
> > 1) set_memory_block_size_order value if already set (cmdline param)
> > 2) minimum block size if memory is less than large block limit
> > 3) [new] hotplug advise: lesser of advise value or memory alignment
> > 4) Max block size if system is bare-metal
> > 5) Largest size that aligns to end of memory.
> > 
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Gregory Price <gourry@gourry.net>
> > ---
> >   arch/x86/mm/init_64.c | 16 ++++++++++++++++
> >   1 file changed, 16 insertions(+)
> > 
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index ff253648706f..b72923b12d99 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -1439,6 +1439,7 @@ static unsigned long probe_memory_block_size(void)
> >   {
> >   	unsigned long boot_mem_end = max_pfn << PAGE_SHIFT;
> >   	unsigned long bz;
> > +	int order;
> >   	/* If memory block size has been set, then use it */
> >   	bz = set_memory_block_size;
> > @@ -1451,6 +1452,21 @@ static unsigned long probe_memory_block_size(void)
> >   		goto done;
> >   	}
> > +	/* Consider hotplug advisement value (if set) */
> > +	order = memblock_probe_size_order();
> 
> "size_order" is a very weird name. Just return a size?
> 
> memory_block_advised_max_size()
> 
> or sth like that?
> 

There isn't technically an overall "max block size", nor any alignment
requirements - so order was a nice way of enforcing 2-order alignment
while also having the ability to get a -1/-EBUSY/whatever out.

I can change it if it's a big sticking point - but that's my reasoning.

> > +	bz = order > 0 ? (1UL << order) : 0;
> > +	if (bz) {
> > +		/* Align down to max and up to min supported */
> > +		bz = +		/* Use lesser of advisement and end of memory alignment */
> > +		for (; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) {
> > +			if (IS_ALIGNED(boot_mem_end, bz))
> > +				goto done;
> 
> This looks like duplicate code wit the loop below.
> 
> Could we refactored it into something like:
> 
> advised_max_size = memory_block_advised_max_size();
> if (!advised_max_size) {
> 	bz = MAX_BLOCK_SIZE;
> 	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)
> 		goto done,
> } else {
> 	bz = max(min(advised_max_size, MAX_BLOCK_SIZE), MIN_MEMORY_BLOCK_SIZE);
> }
> 
> for (; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) {
> 	if (IS_ALIGNED(boot_mem_end, bz))
> 		break;
> 
>

this is better, will update.

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/3] mm/memblock,x86,acpi: hotplug memory alignment advisement
  2024-10-21  9:51 ` [PATCH v2 0/3] mm/memblock,x86,acpi: hotplug memory alignment advisement David Hildenbrand
@ 2024-10-21 14:51   ` Gregory Price
  0 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2024-10-21 14:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: x86, linux-kernel, linux-acpi, linux-mm, dan.j.williams,
	ira.weiny, dave.hansen, luto, peterz, tglx, mingo, bp, hpa,
	rafael, lenb, rppt, akpm, alison.schofield, Jonathan.Cameron,
	rrichter, ytcoode, haibo1.xu, dave.jiang

On Mon, Oct 21, 2024 at 11:51:38AM +0200, David Hildenbrand wrote:
> 
> 
> Am 16.10.24 um 21:24 schrieb Gregory Price:
> > When physical address regions are not aligned to memory block size,
> > the misaligned portion is lost (stranded capacity).
> > 
> > Block size (min/max/selected) is architecture defined. Most architectures
> > tend to use the minimum block size or some simplistic heurist. On x86,
> > memory block size increases up to 2GB, and is otherwise fitted to the
> > alignment of non-hotplug (special purpose memory).
> > 
> > CXL exposes its memory for management through the ACPI CEDT (CXL Early
> > Detection Table) in a field called the CXL Fixed Memory Window.  Per
> > the CXL specification, this memory must be aligned to at least 256MB.
> > 
> > When a CFMW aligns on a size less than the block size, this causes a
> > loss of up to 2GB per CFMW on x86.  It is not uncommon for CFMW to be
> > allocated per-device - though this behavior is BIOS defined.
> > 
> > This patch set provides 3 things:
> >   1) implement advise/probe functions in mm/memblock.c to report/probe
> >      architecture agnostic hotplug memory alignment advice.
> >   2) update x86 memblock size logic to consider the hotplug advice
> >   3) add code in acpi/numa/srat.c to report CFMW alignment advice
> > 
> > The advisement interfaces are design to be called during arch_init
> > code prior to allocator and smp_init.  start_kernel will call these
> > through setup_arch() (via acpi and mm/init_64.c on x86), which occurs
> > prior to mm_core_init and smp_init - so no need for atomics.
> > 
> > There's an attempt to signal callers to advise() that probe has already
> > occurred, but this is predicated on the notion that probe() actually
> > occurs (which presently only happens on x86). This is to assist debugging
> > future users who may mistakenly call this after allocator or smp init.
> > 
> > Likewise, if probe() occurs more than once, we return -EBUSY to prevent
> > inconsistent values from being reported - i.e. this interaction should
> > happen exactly once, and all other behavior is an error / the probed
> > value should be acquired via memory_block_size_bytes() instead.
> > 
> > Suggested-by: Ira Weiny <ira.weiny@intel.com>
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Suggested-by: Dan Williams <dan.j.williams@intel.com>
> > Signed-off-by: Gregory Price <gourry@gourry.net>
> 
> Just as a side note, a while ago there was a discussion about variable-sized
> memory blocks -- essentially removing memory_block_size_bytes().
>

If you have any links, happy to do some reading up on it.  Was going to look
into some more memblock behavior in the future so it's worth looking at.

> 
> The main issue is that this would change /sys/devices/system/memory/ in ways
> it could break existing user space. I believe there are other corner cases
> that are a bit nasty to handle (e.g., removing parts of a larger memory
> block), but likely it could be handled.
> 

This is why I wanted to avoid a new interface in the first place and just
piggyback on set_memory_block_size_order - now there are two interfaces to
do the same thing and more hurdles.  But I suppose the suggestive-nature of
this one makes it far less offensive since it can be completely ignored.

~Gregory


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] x86: probe memblock size advisement value during mm init
  2024-10-21 14:46     ` Gregory Price
@ 2024-10-21 15:57       ` David Hildenbrand
  2024-10-21 16:17         ` Gregory Price
  2024-10-21 16:04       ` Gregory Price
  1 sibling, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2024-10-21 15:57 UTC (permalink / raw)
  To: Gregory Price
  Cc: x86, linux-kernel, linux-acpi, linux-mm, dan.j.williams,
	ira.weiny, dave.hansen, luto, peterz, tglx, mingo, bp, hpa,
	rafael, lenb, rppt, akpm, alison.schofield, Jonathan.Cameron,
	rrichter, ytcoode, haibo1.xu, dave.jiang

On 21.10.24 16:46, Gregory Price wrote:
> On Mon, Oct 21, 2024 at 01:12:26PM +0200, David Hildenbrand wrote:
>>
>>
>> Am 16.10.24 um 21:24 schrieb Gregory Price:
>>> Systems with hotplug may provide an advisement value on what the
>>> memblock size should be.  Probe this value when the rest of the
>>> configuration values are considered.
>>>
>>> The new heuristic is as follows
>>>
>>> 1) set_memory_block_size_order value if already set (cmdline param)
>>> 2) minimum block size if memory is less than large block limit
>>> 3) [new] hotplug advise: lesser of advise value or memory alignment
>>> 4) Max block size if system is bare-metal
>>> 5) Largest size that aligns to end of memory.
>>>
>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>> Signed-off-by: Gregory Price <gourry@gourry.net>
>>> ---
>>>    arch/x86/mm/init_64.c | 16 ++++++++++++++++
>>>    1 file changed, 16 insertions(+)
>>>
>>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>>> index ff253648706f..b72923b12d99 100644
>>> --- a/arch/x86/mm/init_64.c
>>> +++ b/arch/x86/mm/init_64.c
>>> @@ -1439,6 +1439,7 @@ static unsigned long probe_memory_block_size(void)
>>>    {
>>>    	unsigned long boot_mem_end = max_pfn << PAGE_SHIFT;
>>>    	unsigned long bz;
>>> +	int order;
>>>    	/* If memory block size has been set, then use it */
>>>    	bz = set_memory_block_size;
>>> @@ -1451,6 +1452,21 @@ static unsigned long probe_memory_block_size(void)
>>>    		goto done;
>>>    	}
>>> +	/* Consider hotplug advisement value (if set) */
>>> +	order = memblock_probe_size_order();
>>
>> "size_order" is a very weird name. Just return a size?
>>
>> memory_block_advised_max_size()
>>
>> or sth like that?
>>
> 
> There isn't technically an overall "max block size", nor any alignment
> requirements - so order was a nice way of enforcing 2-order alignment
> while also having the ability to get a -1/-EBUSY/whatever out.

I see. But we (MM) just call it "order" then, like pageblock_order, 
max_order, compound_order ... but here we use "size everywhere" so I 
prefer to just sticking to that.

> 
> I can change it if it's a big sticking point - but that's my reasoning.

Simply enforce it when setting the size. We call it "memory_block_size" 
everywhere and it's also a power-of-2 etc and sanity-check that in 
memory_dev_init().


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] x86: probe memblock size advisement value during mm init
  2024-10-21 14:46     ` Gregory Price
  2024-10-21 15:57       ` David Hildenbrand
@ 2024-10-21 16:04       ` Gregory Price
  1 sibling, 0 replies; 13+ messages in thread
From: Gregory Price @ 2024-10-21 16:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: x86, linux-kernel, linux-acpi, linux-mm, dan.j.williams,
	ira.weiny, dave.hansen, luto, peterz, tglx, mingo, bp, hpa,
	rafael, lenb, rppt, akpm, alison.schofield, Jonathan.Cameron,
	rrichter, ytcoode, haibo1.xu, dave.jiang

On Mon, Oct 21, 2024 at 10:46:38AM -0400, Gregory Price wrote:
> On Mon, Oct 21, 2024 at 01:12:26PM +0200, David Hildenbrand wrote:
> > 
> > > +	/* Consider hotplug advisement value (if set) */
> > > +	order = memblock_probe_size_order();
> > 
> > "size_order" is a very weird name. Just return a size?
> > 
> > memory_block_advised_max_size()
> > 
> > or sth like that?
> > 
> 
> There isn't technically an overall "max block size", nor any alignment
> requirements - so order was a nice way of enforcing 2-order alignment
> while also having the ability to get a -1/-EBUSY/whatever out.
> 
> I can change it if it's a big sticking point - but that's my reasoning.
> 

maybe change to

memory_block_advise_max_size
memory_block_probe_max_size

but still take in / return an order?

~Gregory


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] x86: probe memblock size advisement value during mm init
  2024-10-21 15:57       ` David Hildenbrand
@ 2024-10-21 16:17         ` Gregory Price
  0 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2024-10-21 16:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: x86, linux-kernel, linux-acpi, linux-mm, dan.j.williams,
	ira.weiny, dave.hansen, luto, peterz, tglx, mingo, bp, hpa,
	rafael, lenb, rppt, akpm, alison.schofield, Jonathan.Cameron,
	rrichter, ytcoode, haibo1.xu, dave.jiang

On Mon, Oct 21, 2024 at 05:57:28PM +0200, David Hildenbrand wrote:
> On 21.10.24 16:46, Gregory Price wrote:
> > On Mon, Oct 21, 2024 at 01:12:26PM +0200, David Hildenbrand wrote:
> > > 
> > > 
> > > Am 16.10.24 um 21:24 schrieb Gregory Price:
> > > > Systems with hotplug may provide an advisement value on what the
> > > > memblock size should be.  Probe this value when the rest of the
> > > > configuration values are considered.
> > > > 
> > > > The new heuristic is as follows
> > > > 
> > > > 1) set_memory_block_size_order value if already set (cmdline param)
> > > > 2) minimum block size if memory is less than large block limit
> > > > 3) [new] hotplug advise: lesser of advise value or memory alignment
> > > > 4) Max block size if system is bare-metal
> > > > 5) Largest size that aligns to end of memory.
> > > > 
> > > > Suggested-by: David Hildenbrand <david@redhat.com>
> > > > Signed-off-by: Gregory Price <gourry@gourry.net>
> > > > ---
> > > >    arch/x86/mm/init_64.c | 16 ++++++++++++++++
> > > >    1 file changed, 16 insertions(+)
> > > > 
> > > > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > > > index ff253648706f..b72923b12d99 100644
> > > > --- a/arch/x86/mm/init_64.c
> > > > +++ b/arch/x86/mm/init_64.c
> > > > @@ -1439,6 +1439,7 @@ static unsigned long probe_memory_block_size(void)
> > > >    {
> > > >    	unsigned long boot_mem_end = max_pfn << PAGE_SHIFT;
> > > >    	unsigned long bz;
> > > > +	int order;
> > > >    	/* If memory block size has been set, then use it */
> > > >    	bz = set_memory_block_size;
> > > > @@ -1451,6 +1452,21 @@ static unsigned long probe_memory_block_size(void)
> > > >    		goto done;
> > > >    	}
> > > > +	/* Consider hotplug advisement value (if set) */
> > > > +	order = memblock_probe_size_order();
> > > 
> > > "size_order" is a very weird name. Just return a size?
> > > 
> > > memory_block_advised_max_size()
> > > 
> > > or sth like that?
> > > 
> > 
> > There isn't technically an overall "max block size", nor any alignment
> > requirements - so order was a nice way of enforcing 2-order alignment
> > while also having the ability to get a -1/-EBUSY/whatever out.
> 
> I see. But we (MM) just call it "order" then, like pageblock_order,
> max_order, compound_order ... but here we use "size everywhere" so I prefer
> to just sticking to that.
> 
> > 
> > I can change it if it's a big sticking point - but that's my reasoning.
> 
> Simply enforce it when setting the size. We call it "memory_block_size"
> everywhere and it's also a power-of-2 etc and sanity-check that in
> memory_dev_init().
> 
>

Disregard my other email.  Didn't see this one come through.

I'll switch to a size and check alignment. Probably i need to play
with the locking mechanism to avoid changing after it's probe the
first time, but i'll poke at it.

So probably i change to an ssize_t for the arg and return value.

~Gregory


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-10-21 16:17 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-16 19:24 [PATCH v2 0/3] mm/memblock,x86,acpi: hotplug memory alignment advisement Gregory Price
2024-10-16 19:24 ` [PATCH v2 1/3] mm/memblock: implement memblock_advise_size_order and probe functions Gregory Price
2024-10-20  8:57   ` Mike Rapoport
2024-10-21 14:39     ` Gregory Price
2024-10-16 19:24 ` [PATCH v2 2/3] x86: probe memblock size advisement value during mm init Gregory Price
2024-10-21 11:12   ` David Hildenbrand
2024-10-21 14:46     ` Gregory Price
2024-10-21 15:57       ` David Hildenbrand
2024-10-21 16:17         ` Gregory Price
2024-10-21 16:04       ` Gregory Price
2024-10-16 19:24 ` [PATCH v2 3/3] acpi,srat: reduce memory block size if CFMWS has a smaller alignment Gregory Price
2024-10-21  9:51 ` [PATCH v2 0/3] mm/memblock,x86,acpi: hotplug memory alignment advisement David Hildenbrand
2024-10-21 14:51   ` Gregory Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox