[RFC PATCH v1 0/2] How HugeTLB handle HWPoison page at truncation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v1 0/2] How HugeTLB handle HWPoison page at truncation
@ 2025-01-19 18:06 Jiaqi Yan
  2025-01-19 18:06 ` [RFC PATCH v1 1/2] selftest/mm: test HWPoison hugetlb truncation behavior Jiaqi Yan
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Jiaqi Yan @ 2025-01-19 18:06 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe, sidhartha.kumar, muchun.song
  Cc: jane.chu, akpm, osalvador, rientjes, jthoughton, linux-mm,
	linux-kernel, Jiaqi Yan

While I was working on userspace MFR via memfd [1], I spend some time to
understand what current kernel does when a HugeTLB-backing memfd is
truncated. My expectation is, if there is a HWPoison HugeTLB folio
mapped via the memfd to userspace, it will be unmapped right away but
still be kept in page cache [2]; however when the memfd is truncated to
zero or after the memfd is closed, kernel should dissolve the HWPoison
folio in the page cache, and free only the clean raw pages to buddy
allocator, excluding the poisoned raw page.

So I wrote a hugetlb-mfr-base.c selftest and expect
0. say nr_hugepages initially is 64 as system configuration.
1. after MADV_HWPOISON, nr_hugepages should still be 64 as we kept even
   HWPoison huge folio in page cache. free_hugepages should be
   nr_hugepages minus whatever the amount in use.
2. after truncated memfd to zero, nr_hugepages should reduced to 63 as
   kernel dissolved and freed the HWPoison huge folio. free_hugepages
   should also be 63.

However, when testing at the head of mm-stable commit 2877a83e4a0a
("mm/hugetlb: use folio->lru int demote_free_hugetlb_folios()"), I found
although free_hugepages is reduced to 63, nr_hugepages is not reduced
and stay at 64.

Is my expectation outdated? Or is this some kind of bug?

I assume this is a bug and then digged a little bit more. It seems there
are two issues, or two things I don't really understand.

1. During try_memory_failure_hugetlb, we should increased the target
   in-use folio's refcount via get_hwpoison_hugetlb_folio. However,
   until the end of try_memory_failure_hugetlb, this refcout is not put.
   I can make sense of this given we keep in-use huge folio in page
   cache. However, I failed to find the place to put this refcount at
   through remove_inode_hugepages. Is the refcount dec missing? At least
   my testcase suggested yes. In folios_put_refs, I added a dump_page:
   if (!folio_ref_sub_and_test(folio, nr_refs)) {
	  dump_page(&folio-page, "track hwpoison folio's ref");
	  continue;
   }
[ 1069.320976] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2780000
[ 1069.320978] head: order:18 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 1069.320980] flags: 0x400000000100044(referenced|head|hwpoison|node=0|zone=1)
[ 1069.320982] page_type: f4(hugetlb)
[ 1069.320984] raw: 0400000000100044 ffffffff8760bbc8 ffffffff8760bbc8 0000000000000000
[ 1069.320985] raw: 0000000000000000 0000000000000000 00000001f4000000 0000000000000000
[ 1069.320987] head: 0400000000100044 ffffffff8760bbc8 ffffffff8760bbc8 0000000000000000
[ 1069.320988] head: 0000000000000000 0000000000000000 00000001f4000000 0000000000000000
[ 1069.320990] head: 0400000000000012 ffffdd53de000001 ffffffffffffffff 0000000000000000
[ 1069.320991] head: 0000000000040000 0000000000000000 00000000ffffffff 0000000000000000
[ 1069.320992] page dumped because: track hwpoison folio's ref

2. Even if folio's refcount do drop to zero and we get into
   free_huge_folio, it is not clear to me which part of free_huge_folio
   is handling the case that folio is HWPoison. In my test what I
   observed is that evantually the folio is enqueue_hugetlb_folio()-ed.

I tried to fix both issues with a very immature patch and the
hugetlb-mfr-base.c can pass. The patch shows the two things I think
currently missing.

Want to use this RFC to better understand what behavior I should expect,
and if this is indeed an issue, to discuss fixes. Thanks.

[1] https://lore.kernel.org/linux-mm/20250118231549.1652825-1-jiaqiyan@google.com/T
[2] https://lore.kernel.org/all/20221018200125.848471-1-jthoughton@google.com/T/#u

Jiaqi Yan (2):
  selftest/mm: test HWPoison hugetlb truncation behavior
  mm/hugetlb: immature fix to handle HWPoisoned folio

 mm/hugetlb.c                                  |   6 +
 mm/swap.c                                     |   9 +-
 tools/testing/selftests/mm/Makefile           |   1 +
 tools/testing/selftests/mm/hugetlb-mfr-base.c | 240 ++++++++++++++++++
 4 files changed, 255 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/mm/hugetlb-mfr-base.c

-- 
2.48.0.rc2.279.g1de40edade-goog

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH v1 1/2] selftest/mm: test HWPoison hugetlb truncation behavior
  2025-01-19 18:06 [RFC PATCH v1 0/2] How HugeTLB handle HWPoison page at truncation Jiaqi Yan
@ 2025-01-19 18:06 ` Jiaqi Yan
  2025-01-19 20:18   ` Pedro Falcato
  2025-01-19 18:06 ` [RFC PATCH v1 2/2] mm/hugetlb: immature fix to handle HWPoisoned folio Jiaqi Yan
  2025-01-20 10:59 ` [RFC PATCH v1 0/2] How HugeTLB handle HWPoison page at truncation David Hildenbrand
  2 siblings, 1 reply; 10+ messages in thread
From: Jiaqi Yan @ 2025-01-19 18:06 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe, sidhartha.kumar, muchun.song
  Cc: jane.chu, akpm, osalvador, rientjes, jthoughton, linux-mm,
	linux-kernel, Jiaqi Yan

Test based on my understanding of the memory failure recovery behavior
for HugeTLB file system, especially after file is truncated/closed.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 tools/testing/selftests/mm/Makefile           |   1 +
 tools/testing/selftests/mm/hugetlb-mfr-base.c | 240 ++++++++++++++++++
 2 files changed, 241 insertions(+)
 create mode 100644 tools/testing/selftests/mm/hugetlb-mfr-base.c

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 63ce39d024bb5..576626c93ccab 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -62,6 +62,7 @@ TEST_GEN_FILES += hmm-tests
 TEST_GEN_FILES += hugetlb-madvise
 TEST_GEN_FILES += hugetlb-read-hwpoison
 TEST_GEN_FILES += hugetlb-soft-offline
+TEST_GEN_FILES += hugetlb-mfr-base
 TEST_GEN_FILES += hugepage-mmap
 TEST_GEN_FILES += hugepage-mremap
 TEST_GEN_FILES += hugepage-shm
diff --git a/tools/testing/selftests/mm/hugetlb-mfr-base.c b/tools/testing/selftests/mm/hugetlb-mfr-base.c
new file mode 100644
index 0000000000000..b8eee071babe6
--- /dev/null
+++ b/tools/testing/selftests/mm/hugetlb-mfr-base.c
@@ -0,0 +1,240 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <linux/magic.h>
+#include <linux/memfd.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <sys/statfs.h>
+#include <sys/types.h>
+
+#include "../kselftest.h"
+#include "vm_util.h"
+
+#define EPREFIX			" !!! "
+#define BYTE_LENTH_IN_1G	0x40000000
+#define HUGETLB_FILL		0xab
+
+static void *sigbus_addr;
+static int sigbus_addr_lsb;
+static bool expecting_sigbus;
+static bool got_sigbus;
+static bool was_mceerr;
+
+static int create_hugetlbfs_file(struct statfs *file_stat)
+{
+	int fd;
+	int flags = MFD_HUGETLB | MFD_HUGE_1GB;
+
+	fd = memfd_create("hugetlb_tmp", flags);
+	if (fd < 0)
+		ksft_exit_fail_perror("Failed to memfd_create");
+
+	memset(file_stat, 0, sizeof(*file_stat));
+	if (fstatfs(fd, file_stat)) {
+		close(fd);
+		ksft_exit_fail_perror("Failed to fstatfs");
+	}
+	if (file_stat->f_type != HUGETLBFS_MAGIC) {
+		close(fd);
+		ksft_exit_fail_msg("Not hugetlbfs file");
+	}
+
+	ksft_print_msg("Created hugetlb_tmp file\n");
+	ksft_print_msg("hugepagesize=%#lx\n", file_stat->f_bsize);
+	if (file_stat->f_bsize != BYTE_LENTH_IN_1G)
+		ksft_exit_fail_msg("Hugepage size is not 1G");
+
+	return fd;
+}
+
+/*
+ * SIGBUS handler for "do_hwpoison" thread that mapped and MADV_HWPOISON
+ */
+static void sigbus_handler(int signo, siginfo_t *info, void *context)
+{
+	if (!expecting_sigbus)
+		ksft_exit_fail_msg("unexpected sigbus with addr=%p",
+				   info->si_addr);
+
+	got_sigbus = true;
+	was_mceerr = (info->si_code == BUS_MCEERR_AO ||
+		      info->si_code == BUS_MCEERR_AR);
+	sigbus_addr = info->si_addr;
+	sigbus_addr_lsb = info->si_addr_lsb;
+}
+
+static void *do_hwpoison(void *hwpoison_addr)
+{
+	int hwpoison_size = getpagesize();
+
+	ksft_print_msg("MADV_HWPOISON hwpoison_addr=%p, len=%d\n",
+		       hwpoison_addr, hwpoison_size);
+	if (madvise(hwpoison_addr, hwpoison_size, MADV_HWPOISON) < 0)
+		ksft_exit_fail_perror("Failed to MADV_HWPOISON");
+
+	pthread_exit(NULL);
+}
+
+static void test_hwpoison_multiple_pages(unsigned char *start_addr)
+{
+	pthread_t pthread;
+	int ret;
+	unsigned char *hwpoison_addr;
+	unsigned long offsets[] = {0x200000};
+
+	for (size_t i = 0; i < ARRAY_SIZE(offsets); ++i) {
+		sigbus_addr = (void *)0xBADBADBAD;
+		sigbus_addr_lsb = 0;
+		was_mceerr = false;
+		got_sigbus = false;
+		expecting_sigbus = true;
+		hwpoison_addr = start_addr + offsets[i];
+
+		ret = pthread_create(&pthread, NULL, &do_hwpoison, hwpoison_addr);
+		if (ret)
+			ksft_exit_fail_perror("Failed to create hwpoison thread");
+
+		ksft_print_msg("Created thread to hwpoison and access hwpoison_addr=%p\n",
+			       hwpoison_addr);
+
+		pthread_join(pthread, NULL);
+
+		if (!got_sigbus)
+			ksft_test_result_fail("Didn't get a SIGBUS\n");
+		if (!was_mceerr)
+			ksft_test_result_fail("Didn't get a BUS_MCEERR_A(R|O)\n");
+		if (sigbus_addr != hwpoison_addr)
+			ksft_test_result_fail("Incorrect address: got=%p, expected=%p\n",
+					      sigbus_addr, hwpoison_addr);
+		if (sigbus_addr_lsb != 30)
+			ksft_test_result_fail("Incorrect address LSB: got=%d, expected=%d\n",
+					      sigbus_addr_lsb, pshift());
+
+		ksft_print_msg("Received expected and correct SIGBUS\n");
+	}
+}
+
+static int read_nr_hugepages(unsigned long hugepage_size,
+			     unsigned long *nr_hugepages)
+{
+	char buffer[256] = {0};
+	char cmd[256] = {0};
+
+	sprintf(cmd, "cat /sys/kernel/mm/hugepages/hugepages-%ldkB/nr_hugepages",
+		hugepage_size);
+	FILE *cmdfile = popen(cmd, "r");
+
+	if (cmdfile == NULL) {
+		ksft_perror(EPREFIX "failed to popen nr_hugepages");
+		return -1;
+	}
+
+	if (!fgets(buffer, sizeof(buffer), cmdfile)) {
+		ksft_perror(EPREFIX "failed to read nr_hugepages");
+		pclose(cmdfile);
+		return -1;
+	}
+
+	*nr_hugepages = atoll(buffer);
+	pclose(cmdfile);
+	return 0;
+}
+
+/*
+ * Main thread that drives the test.
+ */
+static void test_main(int fd, size_t len)
+{
+	unsigned char *map;
+	struct sigaction new, old;
+	const unsigned long hugepagesize_kb = BYTE_LENTH_IN_1G / 1024;
+	unsigned long nr_hugepages_before = 0;
+	unsigned long nr_hugepages_after = 0;
+
+	if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_before) != 0) {
+		close(fd);
+		ksft_exit_fail_msg("Failed to read nr_hugepages\n");
+	}
+	ksft_print_msg("NR hugepages before MADV_HWPOISON is %ld\n", nr_hugepages_before);
+
+	if (ftruncate(fd, len) < 0)
+		ksft_exit_fail_perror("Failed to ftruncate");
+
+	ksft_print_msg("Allocated %#lx bytes to HugeTLB file\n", len);
+
+	map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (map == MAP_FAILED)
+		ksft_exit_fail_msg("Failed to mmap");
+
+	ksft_print_msg("Created HugeTLB mapping: %p\n", map);
+
+	memset(map, HUGETLB_FILL, len);
+	ksft_print_msg("Memset every byte to 0xab\n");
+
+	new.sa_sigaction = &sigbus_handler;
+	new.sa_flags = SA_SIGINFO;
+	if (sigaction(SIGBUS, &new, &old) < 0)
+		ksft_exit_fail_msg("Failed to setup SIGBUS handler");
+
+	ksft_print_msg("Setup SIGBUS handler successfully\n");
+
+	test_hwpoison_multiple_pages(map);
+
+	if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_after) != 0) {
+		close(fd);
+		ksft_exit_fail_msg("Failed to read nr_hugepages\n");
+	}
+
+	/*
+	 * After MADV_HWPOISON, hugepage should still be in HugeTLB pool.
+	 */
+	ksft_print_msg("NR hugepages after MADV_HWPOISON is %ld\n", nr_hugepages_after);
+	if (nr_hugepages_before != nr_hugepages_after)
+		ksft_test_result_fail("NR hugepages reduced by %ld after MADV_HWPOISON\n",
+				      nr_hugepages_before - nr_hugepages_after);
+
+	/* End of the lifetime of the created HugeTLB memfd. */
+	if (ftruncate(fd, 0) < 0)
+		ksft_exit_fail_perror("Failed to ftruncate to 0");
+	munmap(map, len);
+	close(fd);
+
+	/*
+	 * After freed by userspace, MADV_HWPOISON-ed hugepage should be
+	 * dissolved into raw pages and removed from HugeTLB pool.
+	 */
+	if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_after) != 0) {
+		close(fd);
+		ksft_exit_fail_msg("Failed to read nr_hugepages\n");
+	}
+	ksft_print_msg("NR hugepages after truncate is %ld\n", nr_hugepages_after);
+	if (nr_hugepages_before != nr_hugepages_after + 1)
+		ksft_test_result_fail("NR hugepages is not reduced after truncate memfd\n");
+
+	ksft_test_result_pass("All done\n");
+}
+
+int main(int argc, char **argv)
+{
+	int fd;
+	struct statfs file_stat;
+	size_t len = BYTE_LENTH_IN_1G;
+
+	ksft_print_header();
+	ksft_set_plan(1);
+
+	fd = create_hugetlbfs_file(&file_stat);
+	test_main(fd, len);
+
+	ksft_finished();
+}
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH v1 2/2] mm/hugetlb: immature fix to handle HWPoisoned folio
  2025-01-19 18:06 [RFC PATCH v1 0/2] How HugeTLB handle HWPoison page at truncation Jiaqi Yan
  2025-01-19 18:06 ` [RFC PATCH v1 1/2] selftest/mm: test HWPoison hugetlb truncation behavior Jiaqi Yan
@ 2025-01-19 18:06 ` Jiaqi Yan
  2025-01-20 10:59 ` [RFC PATCH v1 0/2] How HugeTLB handle HWPoison page at truncation David Hildenbrand
  2 siblings, 0 replies; 10+ messages in thread
From: Jiaqi Yan @ 2025-01-19 18:06 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe, sidhartha.kumar, muchun.song
  Cc: jane.chu, akpm, osalvador, rientjes, jthoughton, linux-mm,
	linux-kernel, Jiaqi Yan

The fix has two steps
1. Decrement HWPoison HugeTLB huge folio's refcount when it is 1. It is
   done in folios_put_refs during truncating or eviting HugeTLB file.
2. Dissolve HugeTLB HWpoison folio in free_huge_folio.

Again, just for demo purpose, not a proper fix, especially step 1.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 mm/hugetlb.c | 6 ++++++
 mm/swap.c    | 9 ++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 87761b042ed04..b28e0bc7f199f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1818,6 +1818,7 @@ void free_huge_folio(struct folio *folio)
 	int nid = folio_nid(folio);
 	struct hugepage_subpool *spool = hugetlb_folio_subpool(folio);
 	bool restore_reserve;
+	bool hwpoison = folio_test_hwpoison(folio);
 	unsigned long flags;
 
 	VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
@@ -1869,6 +1870,11 @@ void free_huge_folio(struct folio *folio)
 		remove_hugetlb_folio(h, folio, true);
 		spin_unlock_irqrestore(&hugetlb_lock, flags);
 		update_and_free_hugetlb_folio(h, folio, true);
+	} else if (hwpoison) {
+		remove_hugetlb_folio(h, folio, false);
+		h->max_huge_pages--;
+		spin_unlock_irqrestore(&hugetlb_lock, flags);
+		update_and_free_hugetlb_folio(h, folio, true);
 	} else {
 		arch_clear_hugetlb_flags(folio);
 		enqueue_hugetlb_folio(h, folio);
diff --git a/mm/swap.c b/mm/swap.c
index 746a5ceba42c9..d6b4d4cb4004f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -961,7 +961,14 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 			continue;
 		}
 
-		if (!folio_ref_sub_and_test(folio, nr_refs))
+		folio_ref_sub(folio, nr_refs);
+
+		if (folio_test_hugetlb(folio) &&
+		    folio_ref_count(folio) == 1 &&
+		    folio_test_hwpoison(folio))
+			folio_ref_dec(folio);
+
+		if (folio_ref_count(folio) > 0)
 			continue;
 
 		/* hugetlb has its own memcg */
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v1 1/2] selftest/mm: test HWPoison hugetlb truncation behavior
  2025-01-19 18:06 ` [RFC PATCH v1 1/2] selftest/mm: test HWPoison hugetlb truncation behavior Jiaqi Yan
@ 2025-01-19 20:18   ` Pedro Falcato
  0 siblings, 0 replies; 10+ messages in thread
From: Pedro Falcato @ 2025-01-19 20:18 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: nao.horiguchi, linmiaohe, sidhartha.kumar, muchun.song, jane.chu,
	akpm, osalvador, rientjes, jthoughton, linux-mm, linux-kernel

On Sun, Jan 19, 2025 at 6:06 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
>
> Test based on my understanding of the memory failure recovery behavior
> for HugeTLB file system, especially after file is truncated/closed.
>
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
>  tools/testing/selftests/mm/Makefile           |   1 +
>  tools/testing/selftests/mm/hugetlb-mfr-base.c | 240 ++++++++++++++++++
>  2 files changed, 241 insertions(+)
>  create mode 100644 tools/testing/selftests/mm/hugetlb-mfr-base.c
>
> diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
> index 63ce39d024bb5..576626c93ccab 100644
> --- a/tools/testing/selftests/mm/Makefile
> +++ b/tools/testing/selftests/mm/Makefile
> @@ -62,6 +62,7 @@ TEST_GEN_FILES += hmm-tests
>  TEST_GEN_FILES += hugetlb-madvise
>  TEST_GEN_FILES += hugetlb-read-hwpoison
>  TEST_GEN_FILES += hugetlb-soft-offline
> +TEST_GEN_FILES += hugetlb-mfr-base
>  TEST_GEN_FILES += hugepage-mmap
>  TEST_GEN_FILES += hugepage-mremap
>  TEST_GEN_FILES += hugepage-shm
> diff --git a/tools/testing/selftests/mm/hugetlb-mfr-base.c b/tools/testing/selftests/mm/hugetlb-mfr-base.c
> new file mode 100644
> index 0000000000000..b8eee071babe6
> --- /dev/null
> +++ b/tools/testing/selftests/mm/hugetlb-mfr-base.c
> @@ -0,0 +1,240 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#define _GNU_SOURCE
> +#include <errno.h>
> +#include <pthread.h>
> +#include <signal.h>
> +#include <stdbool.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <string.h>
> +#include <unistd.h>
> +
> +#include <linux/magic.h>
> +#include <linux/memfd.h>
> +#include <sys/mman.h>
> +#include <sys/prctl.h>
> +#include <sys/statfs.h>
> +#include <sys/types.h>
> +
> +#include "../kselftest.h"
> +#include "vm_util.h"
> +
> +#define EPREFIX                        " !!! "
> +#define BYTE_LENTH_IN_1G       0x40000000

LENGTH, but the macro name itself is also a little weird-sounding

> +#define HUGETLB_FILL           0xab
> +
> +static void *sigbus_addr;
> +static int sigbus_addr_lsb;
> +static bool expecting_sigbus;
> +static bool got_sigbus;
> +static bool was_mceerr;
> +
> +static int create_hugetlbfs_file(struct statfs *file_stat)
> +{
> +       int fd;
> +       int flags = MFD_HUGETLB | MFD_HUGE_1GB;
> +
> +       fd = memfd_create("hugetlb_tmp", flags);
> +       if (fd < 0)
> +               ksft_exit_fail_perror("Failed to memfd_create");
> +
> +       memset(file_stat, 0, sizeof(*file_stat));
> +       if (fstatfs(fd, file_stat)) {
> +               close(fd);
> +               ksft_exit_fail_perror("Failed to fstatfs");
> +       }
> +       if (file_stat->f_type != HUGETLBFS_MAGIC) {
> +               close(fd);
> +               ksft_exit_fail_msg("Not hugetlbfs file");
> +       }
> +
> +       ksft_print_msg("Created hugetlb_tmp file\n");
> +       ksft_print_msg("hugepagesize=%#lx\n", file_stat->f_bsize);
> +       if (file_stat->f_bsize != BYTE_LENTH_IN_1G)
> +               ksft_exit_fail_msg("Hugepage size is not 1G");
> +
> +       return fd;
> +}
> +
> +/*
> + * SIGBUS handler for "do_hwpoison" thread that mapped and MADV_HWPOISON
> + */
> +static void sigbus_handler(int signo, siginfo_t *info, void *context)
> +{
> +       if (!expecting_sigbus)
> +               ksft_exit_fail_msg("unexpected sigbus with addr=%p",
> +                                  info->si_addr);
> +
> +       got_sigbus = true;
> +       was_mceerr = (info->si_code == BUS_MCEERR_AO ||
> +                     info->si_code == BUS_MCEERR_AR);
> +       sigbus_addr = info->si_addr;
> +       sigbus_addr_lsb = info->si_addr_lsb;
> +}
> +
> +static void *do_hwpoison(void *hwpoison_addr)
> +{
> +       int hwpoison_size = getpagesize();
> +
> +       ksft_print_msg("MADV_HWPOISON hwpoison_addr=%p, len=%d\n",
> +                      hwpoison_addr, hwpoison_size);
> +       if (madvise(hwpoison_addr, hwpoison_size, MADV_HWPOISON) < 0)
> +               ksft_exit_fail_perror("Failed to MADV_HWPOISON");
> +
> +       pthread_exit(NULL);
> +}
> +
> +static void test_hwpoison_multiple_pages(unsigned char *start_addr)
> +{
> +       pthread_t pthread;
> +       int ret;
> +       unsigned char *hwpoison_addr;
> +       unsigned long offsets[] = {0x200000};
> +
> +       for (size_t i = 0; i < ARRAY_SIZE(offsets); ++i) {
> +               sigbus_addr = (void *)0xBADBADBAD;
> +               sigbus_addr_lsb = 0;
> +               was_mceerr = false;
> +               got_sigbus = false;
> +               expecting_sigbus = true;
> +               hwpoison_addr = start_addr + offsets[i];
> +
> +               ret = pthread_create(&pthread, NULL, &do_hwpoison, hwpoison_addr);
> +               if (ret)
> +                       ksft_exit_fail_perror("Failed to create hwpoison thread");
> +
> +               ksft_print_msg("Created thread to hwpoison and access hwpoison_addr=%p\n",
> +                              hwpoison_addr);
> +
> +               pthread_join(pthread, NULL);
> +
> +               if (!got_sigbus)
> +                       ksft_test_result_fail("Didn't get a SIGBUS\n");
> +               if (!was_mceerr)
> +                       ksft_test_result_fail("Didn't get a BUS_MCEERR_A(R|O)\n");
> +               if (sigbus_addr != hwpoison_addr)
> +                       ksft_test_result_fail("Incorrect address: got=%p, expected=%p\n",
> +                                             sigbus_addr, hwpoison_addr);
> +               if (sigbus_addr_lsb != 30)
> +                       ksft_test_result_fail("Incorrect address LSB: got=%d, expected=%d\n",
> +                                             sigbus_addr_lsb, pshift());
> +
> +               ksft_print_msg("Received expected and correct SIGBUS\n");
> +       }
> +}
> +
> +static int read_nr_hugepages(unsigned long hugepage_size,
> +                            unsigned long *nr_hugepages)
> +{
> +       char buffer[256] = {0};
> +       char cmd[256] = {0};
> +
> +       sprintf(cmd, "cat /sys/kernel/mm/hugepages/hugepages-%ldkB/nr_hugepages",
> +               hugepage_size);

You'll notice this is just reading a file through an external command

> +       FILE *cmdfile = popen(cmd, "r");
> +
> +       if (cmdfile == NULL) {
> +               ksft_perror(EPREFIX "failed to popen nr_hugepages");
> +               return -1;
> +       }
> +
> +       if (!fgets(buffer, sizeof(buffer), cmdfile)) {
> +               ksft_perror(EPREFIX "failed to read nr_hugepages");
> +               pclose(cmdfile);
> +               return -1;
> +       }

and this is just reading a "file" (pipe) directly, soo....

static int read_nr_hugepages(...)
{
    char path[256];
    snprintf(path, 256,
/sys/kernel/mm/hugepages/hugepages-%ldkB/nr_hugepages",
hugepage_size);
    FILE *file = fopen(path, "r");
    fscanf(file, "%lu", nr_hugepages);
}

(error handling omitted)

-- 
Pedro


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v1 0/2] How HugeTLB handle HWPoison page at truncation
  2025-01-19 18:06 [RFC PATCH v1 0/2] How HugeTLB handle HWPoison page at truncation Jiaqi Yan
  2025-01-19 18:06 ` [RFC PATCH v1 1/2] selftest/mm: test HWPoison hugetlb truncation behavior Jiaqi Yan
  2025-01-19 18:06 ` [RFC PATCH v1 2/2] mm/hugetlb: immature fix to handle HWPoisoned folio Jiaqi Yan
@ 2025-01-20 10:59 ` David Hildenbrand
  2025-01-21  1:21   ` Jiaqi Yan
  2 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2025-01-20 10:59 UTC (permalink / raw)
  To: Jiaqi Yan, nao.horiguchi, linmiaohe, sidhartha.kumar, muchun.song
  Cc: jane.chu, akpm, osalvador, rientjes, jthoughton, linux-mm, linux-kernel

On 19.01.25 19:06, Jiaqi Yan wrote:
> While I was working on userspace MFR via memfd [1], I spend some time to
> understand what current kernel does when a HugeTLB-backing memfd is
> truncated. My expectation is, if there is a HWPoison HugeTLB folio
> mapped via the memfd to userspace, it will be unmapped right away but
> still be kept in page cache [2]; however when the memfd is truncated to
> zero or after the memfd is closed, kernel should dissolve the HWPoison
> folio in the page cache, and free only the clean raw pages to buddy
> allocator, excluding the poisoned raw page.
> 
> So I wrote a hugetlb-mfr-base.c selftest and expect
> 0. say nr_hugepages initially is 64 as system configuration.
> 1. after MADV_HWPOISON, nr_hugepages should still be 64 as we kept even
>     HWPoison huge folio in page cache. free_hugepages should be
>     nr_hugepages minus whatever the amount in use.
> 2. after truncated memfd to zero, nr_hugepages should reduced to 63 as
>     kernel dissolved and freed the HWPoison huge folio. free_hugepages
>     should also be 63.
> 
> However, when testing at the head of mm-stable commit 2877a83e4a0a
> ("mm/hugetlb: use folio->lru int demote_free_hugetlb_folios()"), I found
> although free_hugepages is reduced to 63, nr_hugepages is not reduced
> and stay at 64.
> 
> Is my expectation outdated? Or is this some kind of bug?
> 
> I assume this is a bug and then digged a little bit more. It seems there
> are two issues, or two things I don't really understand.
> 
> 1. During try_memory_failure_hugetlb, we should increased the target
>     in-use folio's refcount via get_hwpoison_hugetlb_folio. However,
>     until the end of try_memory_failure_hugetlb, this refcout is not put.
>     I can make sense of this given we keep in-use huge folio in page
>     cache.

Isn't the general rule that hwpoisoned folios have a raised refcount 
such that they won't get freed + reused? At least that's how the buddy 
deals with them, and I suspect also hugetlb?

> [ 1069.320976] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2780000
> [ 1069.320978] head: order:18 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> [ 1069.320980] flags: 0x400000000100044(referenced|head|hwpoison|node=0|zone=1)
> [ 1069.320982] page_type: f4(hugetlb)
> [ 1069.320984] raw: 0400000000100044 ffffffff8760bbc8 ffffffff8760bbc8 0000000000000000
> [ 1069.320985] raw: 0000000000000000 0000000000000000 00000001f4000000 0000000000000000
> [ 1069.320987] head: 0400000000100044 ffffffff8760bbc8 ffffffff8760bbc8 0000000000000000
> [ 1069.320988] head: 0000000000000000 0000000000000000 00000001f4000000 0000000000000000
> [ 1069.320990] head: 0400000000000012 ffffdd53de000001 ffffffffffffffff 0000000000000000
> [ 1069.320991] head: 0000000000040000 0000000000000000 00000000ffffffff 0000000000000000
> [ 1069.320992] page dumped because: track hwpoison folio's ref
> 
> 2. Even if folio's refcount do drop to zero and we get into
>     free_huge_folio, it is not clear to me which part of free_huge_folio
>     is handling the case that folio is HWPoison. In my test what I
>     observed is that evantually the folio is enqueue_hugetlb_folio()-ed.

How would we get a refcount of 0 if we assume the raised refcount on a 
hwpoisoned hugetlb folio?

I'm probably missing something: are you saying that you can trigger a 
hwpoisoned hugetlb folio to get reallocated again, in upstream code?


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v1 0/2] How HugeTLB handle HWPoison page at truncation
  2025-01-20 10:59 ` [RFC PATCH v1 0/2] How HugeTLB handle HWPoison page at truncation David Hildenbrand
@ 2025-01-21  1:21   ` Jiaqi Yan
  2025-01-21  5:00     ` jane.chu
  2025-01-21  8:02     ` David Hildenbrand
  0 siblings, 2 replies; 10+ messages in thread
From: Jiaqi Yan @ 2025-01-21  1:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: nao.horiguchi, linmiaohe, sidhartha.kumar, muchun.song, jane.chu,
	akpm, osalvador, rientjes, jthoughton, linux-mm, linux-kernel

On Mon, Jan 20, 2025 at 2:59 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 19.01.25 19:06, Jiaqi Yan wrote:
> > While I was working on userspace MFR via memfd [1], I spend some time to
> > understand what current kernel does when a HugeTLB-backing memfd is
> > truncated. My expectation is, if there is a HWPoison HugeTLB folio
> > mapped via the memfd to userspace, it will be unmapped right away but
> > still be kept in page cache [2]; however when the memfd is truncated to
> > zero or after the memfd is closed, kernel should dissolve the HWPoison
> > folio in the page cache, and free only the clean raw pages to buddy
> > allocator, excluding the poisoned raw page.
> >
> > So I wrote a hugetlb-mfr-base.c selftest and expect
> > 0. say nr_hugepages initially is 64 as system configuration.
> > 1. after MADV_HWPOISON, nr_hugepages should still be 64 as we kept even
> >     HWPoison huge folio in page cache. free_hugepages should be
> >     nr_hugepages minus whatever the amount in use.
> > 2. after truncated memfd to zero, nr_hugepages should reduced to 63 as
> >     kernel dissolved and freed the HWPoison huge folio. free_hugepages
> >     should also be 63.
> >
> > However, when testing at the head of mm-stable commit 2877a83e4a0a
> > ("mm/hugetlb: use folio->lru int demote_free_hugetlb_folios()"), I found
> > although free_hugepages is reduced to 63, nr_hugepages is not reduced
> > and stay at 64.
> >
> > Is my expectation outdated? Or is this some kind of bug?
> >
> > I assume this is a bug and then digged a little bit more. It seems there
> > are two issues, or two things I don't really understand.
> >
> > 1. During try_memory_failure_hugetlb, we should increased the target
> >     in-use folio's refcount via get_hwpoison_hugetlb_folio. However,
> >     until the end of try_memory_failure_hugetlb, this refcout is not put.
> >     I can make sense of this given we keep in-use huge folio in page
> >     cache.
>
> Isn't the general rule that hwpoisoned folios have a raised refcount
> such that they won't get freed + reused? At least that's how the buddy
> deals with them, and I suspect also hugetlb?

Thanks, David.

I see, so it is expected that the _entire_ huge folio will always have
at least a refcount of 1, even when the folio can become "free".

For *free* huge folio, try_memory_failure_hugetlb dissolves it and
frees the clean pages (a lot) to the buddy allocator. This made me
think the same thing will happen for *in-use* huge folio _eventually_
(i.e. somehow the refcount due to HWPoison can be put). I feel this is
a little bit unfortunate for the clean pages, but if it is what it is,
that's fair as it is not a bug.

>
> > [ 1069.320976] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2780000
> > [ 1069.320978] head: order:18 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> > [ 1069.320980] flags: 0x400000000100044(referenced|head|hwpoison|node=0|zone=1)
> > [ 1069.320982] page_type: f4(hugetlb)
> > [ 1069.320984] raw: 0400000000100044 ffffffff8760bbc8 ffffffff8760bbc8 0000000000000000
> > [ 1069.320985] raw: 0000000000000000 0000000000000000 00000001f4000000 0000000000000000
> > [ 1069.320987] head: 0400000000100044 ffffffff8760bbc8 ffffffff8760bbc8 0000000000000000
> > [ 1069.320988] head: 0000000000000000 0000000000000000 00000001f4000000 0000000000000000
> > [ 1069.320990] head: 0400000000000012 ffffdd53de000001 ffffffffffffffff 0000000000000000
> > [ 1069.320991] head: 0000000000040000 0000000000000000 00000000ffffffff 0000000000000000
> > [ 1069.320992] page dumped because: track hwpoison folio's ref
> >
> > 2. Even if folio's refcount do drop to zero and we get into
> >     free_huge_folio, it is not clear to me which part of free_huge_folio
> >     is handling the case that folio is HWPoison. In my test what I
> >     observed is that evantually the folio is enqueue_hugetlb_folio()-ed.
>
> How would we get a refcount of 0 if we assume the raised refcount on a
> hwpoisoned hugetlb folio?
>
> I'm probably missing something: are you saying that you can trigger a
> hwpoisoned hugetlb folio to get reallocated again, in upstream code?

No, I think it is just my misunderstanding. From what you said, the
expectation of HWPoison hugetlb folio is just it won't get reallocated
again, which is true.

My (wrong) expectation is, in addition to the "won't reallocated
again" part, some (large) portion of the huge folio will be freed to
the buddy allocator. On the other hand, is it something worth having /
improving? (1G - some_single_digit * 4KB) seems to be valuable to the
system, though they are all 4K. #1 and #2 above are then what needs to
be done if the improvement is worth chasing.

>
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v1 0/2] How HugeTLB handle HWPoison page at truncation
  2025-01-21  1:21   ` Jiaqi Yan
@ 2025-01-21  5:00     ` jane.chu
  2025-01-21  5:08       ` Jiaqi Yan
  2025-01-21  8:02     ` David Hildenbrand
  1 sibling, 1 reply; 10+ messages in thread
From: jane.chu @ 2025-01-21  5:00 UTC (permalink / raw)
  To: Jiaqi Yan, David Hildenbrand
  Cc: nao.horiguchi, linmiaohe, sidhartha.kumar, muchun.song, akpm,
	osalvador, rientjes, jthoughton, linux-mm, linux-kernel


On 1/20/2025 5:21 PM, Jiaqi Yan wrote:
> On Mon, Jan 20, 2025 at 2:59 AM David Hildenbrand <david@redhat.com> wrote:
>> On 19.01.25 19:06, Jiaqi Yan wrote:
>>> While I was working on userspace MFR via memfd [1], I spend some time to
>>> understand what current kernel does when a HugeTLB-backing memfd is
>>> truncated. My expectation is, if there is a HWPoison HugeTLB folio
>>> mapped via the memfd to userspace, it will be unmapped right away but
>>> still be kept in page cache [2]; however when the memfd is truncated to
>>> zero or after the memfd is closed, kernel should dissolve the HWPoison
>>> folio in the page cache, and free only the clean raw pages to buddy
>>> allocator, excluding the poisoned raw page.
>>>
>>> So I wrote a hugetlb-mfr-base.c selftest and expect
>>> 0. say nr_hugepages initially is 64 as system configuration.
>>> 1. after MADV_HWPOISON, nr_hugepages should still be 64 as we kept even
>>>      HWPoison huge folio in page cache. free_hugepages should be
>>>      nr_hugepages minus whatever the amount in use.
>>> 2. after truncated memfd to zero, nr_hugepages should reduced to 63 as
>>>      kernel dissolved and freed the HWPoison huge folio. free_hugepages
>>>      should also be 63.
>>>
>>> However, when testing at the head of mm-stable commit 2877a83e4a0a
>>> ("mm/hugetlb: use folio->lru int demote_free_hugetlb_folios()"), I found
>>> although free_hugepages is reduced to 63, nr_hugepages is not reduced
>>> and stay at 64.
>>>
>>> Is my expectation outdated? Or is this some kind of bug?
>>>
>>> I assume this is a bug and then digged a little bit more. It seems there
>>> are two issues, or two things I don't really understand.
>>>
>>> 1. During try_memory_failure_hugetlb, we should increased the target
>>>      in-use folio's refcount via get_hwpoison_hugetlb_folio. However,
>>>      until the end of try_memory_failure_hugetlb, this refcout is not put.
>>>      I can make sense of this given we keep in-use huge folio in page
>>>      cache.
>> Isn't the general rule that hwpoisoned folios have a raised refcount
>> such that they won't get freed + reused? At least that's how the buddy
>> deals with them, and I suspect also hugetlb?
> Thanks, David.
>
> I see, so it is expected that the _entire_ huge folio will always have
> at least a refcount of 1, even when the folio can become "free".
>
> For *free* huge folio, try_memory_failure_hugetlb dissolves it and
> frees the clean pages (a lot) to the buddy allocator. This made me
> think the same thing will happen for *in-use* huge folio _eventually_
> (i.e. somehow the refcount due to HWPoison can be put). I feel this is
> a little bit unfortunate for the clean pages, but if it is what it is,
> that's fair as it is not a bug.

Agreed with David.  For *in use* hugetlb pages, including unused shmget 
pages, hugetlb shouldn't

dissvolve the page, not until an explicit freeing action is taken like 
RMID and echo 0 > nr_hugepages.

-jane

>
>>> [ 1069.320976] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2780000
>>> [ 1069.320978] head: order:18 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
>>> [ 1069.320980] flags: 0x400000000100044(referenced|head|hwpoison|node=0|zone=1)
>>> [ 1069.320982] page_type: f4(hugetlb)
>>> [ 1069.320984] raw: 0400000000100044 ffffffff8760bbc8 ffffffff8760bbc8 0000000000000000
>>> [ 1069.320985] raw: 0000000000000000 0000000000000000 00000001f4000000 0000000000000000
>>> [ 1069.320987] head: 0400000000100044 ffffffff8760bbc8 ffffffff8760bbc8 0000000000000000
>>> [ 1069.320988] head: 0000000000000000 0000000000000000 00000001f4000000 0000000000000000
>>> [ 1069.320990] head: 0400000000000012 ffffdd53de000001 ffffffffffffffff 0000000000000000
>>> [ 1069.320991] head: 0000000000040000 0000000000000000 00000000ffffffff 0000000000000000
>>> [ 1069.320992] page dumped because: track hwpoison folio's ref
>>>
>>> 2. Even if folio's refcount do drop to zero and we get into
>>>      free_huge_folio, it is not clear to me which part of free_huge_folio
>>>      is handling the case that folio is HWPoison. In my test what I
>>>      observed is that evantually the folio is enqueue_hugetlb_folio()-ed.
>> How would we get a refcount of 0 if we assume the raised refcount on a
>> hwpoisoned hugetlb folio?
>>
>> I'm probably missing something: are you saying that you can trigger a
>> hwpoisoned hugetlb folio to get reallocated again, in upstream code?
> No, I think it is just my misunderstanding. From what you said, the
> expectation of HWPoison hugetlb folio is just it won't get reallocated
> again, which is true.
>
> My (wrong) expectation is, in addition to the "won't reallocated
> again" part, some (large) portion of the huge folio will be freed to
> the buddy allocator. On the other hand, is it something worth having /
> improving? (1G - some_single_digit * 4KB) seems to be valuable to the
> system, though they are all 4K. #1 and #2 above are then what needs to
> be done if the improvement is worth chasing.
>
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v1 0/2] How HugeTLB handle HWPoison page at truncation
  2025-01-21  5:00     ` jane.chu
@ 2025-01-21  5:08       ` Jiaqi Yan
  2025-01-21  5:22         ` jane.chu
  0 siblings, 1 reply; 10+ messages in thread
From: Jiaqi Yan @ 2025-01-21  5:08 UTC (permalink / raw)
  To: jane.chu
  Cc: David Hildenbrand, nao.horiguchi, linmiaohe, sidhartha.kumar,
	muchun.song, akpm, osalvador, rientjes, jthoughton, linux-mm,
	linux-kernel

On Mon, Jan 20, 2025 at 9:01 PM <jane.chu@oracle.com> wrote:
>
>
> On 1/20/2025 5:21 PM, Jiaqi Yan wrote:
> > On Mon, Jan 20, 2025 at 2:59 AM David Hildenbrand <david@redhat.com> wrote:
> >> On 19.01.25 19:06, Jiaqi Yan wrote:
> >>> While I was working on userspace MFR via memfd [1], I spend some time to
> >>> understand what current kernel does when a HugeTLB-backing memfd is
> >>> truncated. My expectation is, if there is a HWPoison HugeTLB folio
> >>> mapped via the memfd to userspace, it will be unmapped right away but
> >>> still be kept in page cache [2]; however when the memfd is truncated to
> >>> zero or after the memfd is closed, kernel should dissolve the HWPoison
> >>> folio in the page cache, and free only the clean raw pages to buddy
> >>> allocator, excluding the poisoned raw page.
> >>>
> >>> So I wrote a hugetlb-mfr-base.c selftest and expect
> >>> 0. say nr_hugepages initially is 64 as system configuration.
> >>> 1. after MADV_HWPOISON, nr_hugepages should still be 64 as we kept even
> >>>      HWPoison huge folio in page cache. free_hugepages should be
> >>>      nr_hugepages minus whatever the amount in use.
> >>> 2. after truncated memfd to zero, nr_hugepages should reduced to 63 as
> >>>      kernel dissolved and freed the HWPoison huge folio. free_hugepages
> >>>      should also be 63.
> >>>
> >>> However, when testing at the head of mm-stable commit 2877a83e4a0a
> >>> ("mm/hugetlb: use folio->lru int demote_free_hugetlb_folios()"), I found
> >>> although free_hugepages is reduced to 63, nr_hugepages is not reduced
> >>> and stay at 64.
> >>>
> >>> Is my expectation outdated? Or is this some kind of bug?
> >>>
> >>> I assume this is a bug and then digged a little bit more. It seems there
> >>> are two issues, or two things I don't really understand.
> >>>
> >>> 1. During try_memory_failure_hugetlb, we should increased the target
> >>>      in-use folio's refcount via get_hwpoison_hugetlb_folio. However,
> >>>      until the end of try_memory_failure_hugetlb, this refcout is not put.
> >>>      I can make sense of this given we keep in-use huge folio in page
> >>>      cache.
> >> Isn't the general rule that hwpoisoned folios have a raised refcount
> >> such that they won't get freed + reused? At least that's how the buddy
> >> deals with them, and I suspect also hugetlb?
> > Thanks, David.
> >
> > I see, so it is expected that the _entire_ huge folio will always have
> > at least a refcount of 1, even when the folio can become "free".
> >
> > For *free* huge folio, try_memory_failure_hugetlb dissolves it and
> > frees the clean pages (a lot) to the buddy allocator. This made me
> > think the same thing will happen for *in-use* huge folio _eventually_
> > (i.e. somehow the refcount due to HWPoison can be put). I feel this is
> > a little bit unfortunate for the clean pages, but if it is what it is,
> > that's fair as it is not a bug.
>
> Agreed with David.  For *in use* hugetlb pages, including unused shmget
> pages, hugetlb shouldn't dissvolve the page, not until an explicit freeing action is taken like
> RMID and echo 0 > nr_hugepages.

To clarify myself, I am not asking memory-failure.c to dissolve the
hugepage at the time it is in-use, but rather when it becomes free
(truncated or process exited).

>
> -jane
>
> >
> >>> [ 1069.320976] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2780000
> >>> [ 1069.320978] head: order:18 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> >>> [ 1069.320980] flags: 0x400000000100044(referenced|head|hwpoison|node=0|zone=1)
> >>> [ 1069.320982] page_type: f4(hugetlb)
> >>> [ 1069.320984] raw: 0400000000100044 ffffffff8760bbc8 ffffffff8760bbc8 0000000000000000
> >>> [ 1069.320985] raw: 0000000000000000 0000000000000000 00000001f4000000 0000000000000000
> >>> [ 1069.320987] head: 0400000000100044 ffffffff8760bbc8 ffffffff8760bbc8 0000000000000000
> >>> [ 1069.320988] head: 0000000000000000 0000000000000000 00000001f4000000 0000000000000000
> >>> [ 1069.320990] head: 0400000000000012 ffffdd53de000001 ffffffffffffffff 0000000000000000
> >>> [ 1069.320991] head: 0000000000040000 0000000000000000 00000000ffffffff 0000000000000000
> >>> [ 1069.320992] page dumped because: track hwpoison folio's ref
> >>>
> >>> 2. Even if folio's refcount do drop to zero and we get into
> >>>      free_huge_folio, it is not clear to me which part of free_huge_folio
> >>>      is handling the case that folio is HWPoison. In my test what I
> >>>      observed is that evantually the folio is enqueue_hugetlb_folio()-ed.
> >> How would we get a refcount of 0 if we assume the raised refcount on a
> >> hwpoisoned hugetlb folio?
> >>
> >> I'm probably missing something: are you saying that you can trigger a
> >> hwpoisoned hugetlb folio to get reallocated again, in upstream code?
> > No, I think it is just my misunderstanding. From what you said, the
> > expectation of HWPoison hugetlb folio is just it won't get reallocated
> > again, which is true.
> >
> > My (wrong) expectation is, in addition to the "won't reallocated
> > again" part, some (large) portion of the huge folio will be freed to
> > the buddy allocator. On the other hand, is it something worth having /
> > improving? (1G - some_single_digit * 4KB) seems to be valuable to the
> > system, though they are all 4K. #1 and #2 above are then what needs to
> > be done if the improvement is worth chasing.
> >
> >>
> >> --
> >> Cheers,
> >>
> >> David / dhildenb
> >>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v1 0/2] How HugeTLB handle HWPoison page at truncation
  2025-01-21  5:08       ` Jiaqi Yan
@ 2025-01-21  5:22         ` jane.chu
  0 siblings, 0 replies; 10+ messages in thread
From: jane.chu @ 2025-01-21  5:22 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: David Hildenbrand, nao.horiguchi, linmiaohe, sidhartha.kumar,
	muchun.song, akpm, osalvador, rientjes, jthoughton, linux-mm,
	linux-kernel


On 1/20/2025 9:08 PM, Jiaqi Yan wrote:
> On Mon, Jan 20, 2025 at 9:01 PM <jane.chu@oracle.com> wrote:
>>
>> On 1/20/2025 5:21 PM, Jiaqi Yan wrote:
>>> On Mon, Jan 20, 2025 at 2:59 AM David Hildenbrand <david@redhat.com> wrote:
>>>> On 19.01.25 19:06, Jiaqi Yan wrote:
>>>>> While I was working on userspace MFR via memfd [1], I spend some time to
>>>>> understand what current kernel does when a HugeTLB-backing memfd is
>>>>> truncated. My expectation is, if there is a HWPoison HugeTLB folio
>>>>> mapped via the memfd to userspace, it will be unmapped right away but
>>>>> still be kept in page cache [2]; however when the memfd is truncated to
>>>>> zero or after the memfd is closed, kernel should dissolve the HWPoison
>>>>> folio in the page cache, and free only the clean raw pages to buddy
>>>>> allocator, excluding the poisoned raw page.
>>>>>
>>>>> So I wrote a hugetlb-mfr-base.c selftest and expect
>>>>> 0. say nr_hugepages initially is 64 as system configuration.
>>>>> 1. after MADV_HWPOISON, nr_hugepages should still be 64 as we kept even
>>>>>       HWPoison huge folio in page cache. free_hugepages should be
>>>>>       nr_hugepages minus whatever the amount in use.
>>>>> 2. after truncated memfd to zero, nr_hugepages should reduced to 63 as
>>>>>       kernel dissolved and freed the HWPoison huge folio. free_hugepages
>>>>>       should also be 63.
>>>>>
>>>>> However, when testing at the head of mm-stable commit 2877a83e4a0a
>>>>> ("mm/hugetlb: use folio->lru int demote_free_hugetlb_folios()"), I found
>>>>> although free_hugepages is reduced to 63, nr_hugepages is not reduced
>>>>> and stay at 64.
>>>>>
>>>>> Is my expectation outdated? Or is this some kind of bug?
>>>>>
>>>>> I assume this is a bug and then digged a little bit more. It seems there
>>>>> are two issues, or two things I don't really understand.
>>>>>
>>>>> 1. During try_memory_failure_hugetlb, we should increased the target
>>>>>       in-use folio's refcount via get_hwpoison_hugetlb_folio. However,
>>>>>       until the end of try_memory_failure_hugetlb, this refcout is not put.
>>>>>       I can make sense of this given we keep in-use huge folio in page
>>>>>       cache.
>>>> Isn't the general rule that hwpoisoned folios have a raised refcount
>>>> such that they won't get freed + reused? At least that's how the buddy
>>>> deals with them, and I suspect also hugetlb?
>>> Thanks, David.
>>>
>>> I see, so it is expected that the _entire_ huge folio will always have
>>> at least a refcount of 1, even when the folio can become "free".
>>>
>>> For *free* huge folio, try_memory_failure_hugetlb dissolves it and
>>> frees the clean pages (a lot) to the buddy allocator. This made me
>>> think the same thing will happen for *in-use* huge folio _eventually_
>>> (i.e. somehow the refcount due to HWPoison can be put). I feel this is
>>> a little bit unfortunate for the clean pages, but if it is what it is,
>>> that's fair as it is not a bug.
>> Agreed with David.  For *in use* hugetlb pages, including unused shmget
>> pages, hugetlb shouldn't dissvolve the page, not until an explicit freeing action is taken like
>> RMID and echo 0 > nr_hugepages.
> To clarify myself, I am not asking memory-failure.c to dissolve the
> hugepage at the time it is in-use, but rather when it becomes free
> (truncated or process exited).

Understood, a free hugetlb page in the pool should have refcount 1 though.

-jane

>
>> -jane
>>
>>>>> [ 1069.320976] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2780000
>>>>> [ 1069.320978] head: order:18 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
>>>>> [ 1069.320980] flags: 0x400000000100044(referenced|head|hwpoison|node=0|zone=1)
>>>>> [ 1069.320982] page_type: f4(hugetlb)
>>>>> [ 1069.320984] raw: 0400000000100044 ffffffff8760bbc8 ffffffff8760bbc8 0000000000000000
>>>>> [ 1069.320985] raw: 0000000000000000 0000000000000000 00000001f4000000 0000000000000000
>>>>> [ 1069.320987] head: 0400000000100044 ffffffff8760bbc8 ffffffff8760bbc8 0000000000000000
>>>>> [ 1069.320988] head: 0000000000000000 0000000000000000 00000001f4000000 0000000000000000
>>>>> [ 1069.320990] head: 0400000000000012 ffffdd53de000001 ffffffffffffffff 0000000000000000
>>>>> [ 1069.320991] head: 0000000000040000 0000000000000000 00000000ffffffff 0000000000000000
>>>>> [ 1069.320992] page dumped because: track hwpoison folio's ref
>>>>>
>>>>> 2. Even if folio's refcount do drop to zero and we get into
>>>>>       free_huge_folio, it is not clear to me which part of free_huge_folio
>>>>>       is handling the case that folio is HWPoison. In my test what I
>>>>>       observed is that evantually the folio is enqueue_hugetlb_folio()-ed.
>>>> How would we get a refcount of 0 if we assume the raised refcount on a
>>>> hwpoisoned hugetlb folio?
>>>>
>>>> I'm probably missing something: are you saying that you can trigger a
>>>> hwpoisoned hugetlb folio to get reallocated again, in upstream code?
>>> No, I think it is just my misunderstanding. From what you said, the
>>> expectation of HWPoison hugetlb folio is just it won't get reallocated
>>> again, which is true.
>>>
>>> My (wrong) expectation is, in addition to the "won't reallocated
>>> again" part, some (large) portion of the huge folio will be freed to
>>> the buddy allocator. On the other hand, is it something worth having /
>>> improving? (1G - some_single_digit * 4KB) seems to be valuable to the
>>> system, though they are all 4K. #1 and #2 above are then what needs to
>>> be done if the improvement is worth chasing.
>>>
>>>> --
>>>> Cheers,
>>>>
>>>> David / dhildenb
>>>>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v1 0/2] How HugeTLB handle HWPoison page at truncation
  2025-01-21  1:21   ` Jiaqi Yan
  2025-01-21  5:00     ` jane.chu
@ 2025-01-21  8:02     ` David Hildenbrand
  1 sibling, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2025-01-21  8:02 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: nao.horiguchi, linmiaohe, sidhartha.kumar, muchun.song, jane.chu,
	akpm, osalvador, rientjes, jthoughton, linux-mm, linux-kernel

On 21.01.25 02:21, Jiaqi Yan wrote:
> On Mon, Jan 20, 2025 at 2:59 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 19.01.25 19:06, Jiaqi Yan wrote:
>>> While I was working on userspace MFR via memfd [1], I spend some time to
>>> understand what current kernel does when a HugeTLB-backing memfd is
>>> truncated. My expectation is, if there is a HWPoison HugeTLB folio
>>> mapped via the memfd to userspace, it will be unmapped right away but
>>> still be kept in page cache [2]; however when the memfd is truncated to
>>> zero or after the memfd is closed, kernel should dissolve the HWPoison
>>> folio in the page cache, and free only the clean raw pages to buddy
>>> allocator, excluding the poisoned raw page.
>>>
>>> So I wrote a hugetlb-mfr-base.c selftest and expect
>>> 0. say nr_hugepages initially is 64 as system configuration.
>>> 1. after MADV_HWPOISON, nr_hugepages should still be 64 as we kept even
>>>      HWPoison huge folio in page cache. free_hugepages should be
>>>      nr_hugepages minus whatever the amount in use.
>>> 2. after truncated memfd to zero, nr_hugepages should reduced to 63 as
>>>      kernel dissolved and freed the HWPoison huge folio. free_hugepages
>>>      should also be 63.
>>>
>>> However, when testing at the head of mm-stable commit 2877a83e4a0a
>>> ("mm/hugetlb: use folio->lru int demote_free_hugetlb_folios()"), I found
>>> although free_hugepages is reduced to 63, nr_hugepages is not reduced
>>> and stay at 64.
>>>
>>> Is my expectation outdated? Or is this some kind of bug?
>>>
>>> I assume this is a bug and then digged a little bit more. It seems there
>>> are two issues, or two things I don't really understand.
>>>
>>> 1. During try_memory_failure_hugetlb, we should increased the target
>>>      in-use folio's refcount via get_hwpoison_hugetlb_folio. However,
>>>      until the end of try_memory_failure_hugetlb, this refcout is not put.
>>>      I can make sense of this given we keep in-use huge folio in page
>>>      cache.
>>
>> Isn't the general rule that hwpoisoned folios have a raised refcount
>> such that they won't get freed + reused? At least that's how the buddy
>> deals with them, and I suspect also hugetlb?
> 
> Thanks, David.
> 
> I see, so it is expected that the _entire_ huge folio will always have
> at least a refcount of 1, even when the folio can become "free".
> 
> For *free* huge folio, try_memory_failure_hugetlb dissolves it and
> frees the clean pages (a lot) to the buddy allocator. This made me
> think the same thing will happen for *in-use* huge folio _eventually_
> (i.e. somehow the refcount due to HWPoison can be put). I feel this is
> a little bit unfortunate for the clean pages, but if it is what it is,
> that's fair as it is not a bug.

Yes, that's my understanding. Free pages are a lot easier to handle 
because we can just reliably dissolve and free them. For in-unse, it's a 
lot more tricky.

Similar to ordinary free buddy vs. allocated pages.

> 
>>
>>> [ 1069.320976] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2780000
>>> [ 1069.320978] head: order:18 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
>>> [ 1069.320980] flags: 0x400000000100044(referenced|head|hwpoison|node=0|zone=1)
>>> [ 1069.320982] page_type: f4(hugetlb)
>>> [ 1069.320984] raw: 0400000000100044 ffffffff8760bbc8 ffffffff8760bbc8 0000000000000000
>>> [ 1069.320985] raw: 0000000000000000 0000000000000000 00000001f4000000 0000000000000000
>>> [ 1069.320987] head: 0400000000100044 ffffffff8760bbc8 ffffffff8760bbc8 0000000000000000
>>> [ 1069.320988] head: 0000000000000000 0000000000000000 00000001f4000000 0000000000000000
>>> [ 1069.320990] head: 0400000000000012 ffffdd53de000001 ffffffffffffffff 0000000000000000
>>> [ 1069.320991] head: 0000000000040000 0000000000000000 00000000ffffffff 0000000000000000
>>> [ 1069.320992] page dumped because: track hwpoison folio's ref
>>>
>>> 2. Even if folio's refcount do drop to zero and we get into
>>>      free_huge_folio, it is not clear to me which part of free_huge_folio
>>>      is handling the case that folio is HWPoison. In my test what I
>>>      observed is that evantually the folio is enqueue_hugetlb_folio()-ed.
>>
>> How would we get a refcount of 0 if we assume the raised refcount on a
>> hwpoisoned hugetlb folio?
>>
>> I'm probably missing something: are you saying that you can trigger a
>> hwpoisoned hugetlb folio to get reallocated again, in upstream code?
> 
> No, I think it is just my misunderstanding. From what you said, the
> expectation of HWPoison hugetlb folio is just it won't get reallocated
> again, which is true.

Right.

> 
> My (wrong) expectation is, in addition to the "won't reallocated
> again" part, some (large) portion of the huge folio will be freed to
> the buddy allocator. On the other hand, is it something worth having /
> improving? (1G - some_single_digit * 4KB) seems to be valuable to the
> system, though they are all 4K. #1 and #2 above are then what needs to
> be done if the improvement is worth chasing.

I think one challenge is making sure that the page won't accidentally 
get reallocated again -- and in contrast to free hugetlb pages we cannot 
handle this split synchronously.

I recall that we might not remember the exact page part of a hugetlb 
folio that was poisoned (HVO optimization), but maybe we changed that in 
the meantime,

Likely it would be worth having, but probably not very easy to implement.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-01-21  8:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-19 18:06 [RFC PATCH v1 0/2] How HugeTLB handle HWPoison page at truncation Jiaqi Yan
2025-01-19 18:06 ` [RFC PATCH v1 1/2] selftest/mm: test HWPoison hugetlb truncation behavior Jiaqi Yan
2025-01-19 20:18   ` Pedro Falcato
2025-01-19 18:06 ` [RFC PATCH v1 2/2] mm/hugetlb: immature fix to handle HWPoisoned folio Jiaqi Yan
2025-01-20 10:59 ` [RFC PATCH v1 0/2] How HugeTLB handle HWPoison page at truncation David Hildenbrand
2025-01-21  1:21   ` Jiaqi Yan
2025-01-21  5:00     ` jane.chu
2025-01-21  5:08       ` Jiaqi Yan
2025-01-21  5:22         ` jane.chu
2025-01-21  8:02     ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox