linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] mm/hugetlb: Restore the reservation
@ 2024-02-05 19:18 Breno Leitao
  2024-02-05 19:18 ` [PATCH v2 1/2] mm/hugetlb: Restore the reservation if needed Breno Leitao
  2024-02-05 19:18 ` [PATCH v2 2/2] selftests/mm: run_vmtests.sh: add hugetlb_madv_vs_map Breno Leitao
  0 siblings, 2 replies; 3+ messages in thread
From: Breno Leitao @ 2024-02-05 19:18 UTC (permalink / raw)
  To: mike.kravetz, linux-mm, akpm, muchun.song
  Cc: lstoakes, willy, hannes, mhocko, roman.gushchin, linux-kernel

This is a fix for a case where a backing huge page could stolen after
madvise(MADV_DONTNEED).

A full reproducer is in selftest. See
https://lore.kernel.org/all/20240105155419.1939484-1-leitao@debian.org/

v1:
  * https://lore.kernel.org/all/20240117171058.2192286-1-leitao@debian.org/
v2:
  * In version 1, there was a lockdep dependency detected by syzbot.
	* https://lore.kernel.org/all/00000000000050a2fb060fdc478c@google.com/
	* The lockdep dependency was caused because `vma_add_reservation()` was
	  called with the pte lock. This is fixed now by deferring the
	  vma_add_reservation() to after the spinlock.
  * Version 2 fixes the problem above by setting the restore_reserve bit
    inside the ptl, but, calling vma_add_reservation() later, after the
    lock is released.
	* Reported by a test done by Ryan Roberts.

In order to test this patch, I instrumented the kernel with LOCKDEP and
KASAN, and run the following tests, without any regression:
  * The self test that reproduces the problem
  * All mm hugetlb selftests
	SUMMARY: PASS=9 SKIP=0 FAIL=0
  * All libhugetlbfs tests
	PASS:     0     86
	FAIL:     0      0

Breno Leitao (2):
  mm/hugetlb: Restore the reservation if needed
  selftests/mm: run_vmtests.sh: add hugetlb_madv_vs_map

 mm/hugetlb.c                              | 25 +++++++++++++++++++++++
 tools/testing/selftests/mm/run_vmtests.sh |  1 +
 2 files changed, 26 insertions(+)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH v2 1/2] mm/hugetlb: Restore the reservation if needed
  2024-02-05 19:18 [PATCH v2 0/2] mm/hugetlb: Restore the reservation Breno Leitao
@ 2024-02-05 19:18 ` Breno Leitao
  2024-02-05 19:18 ` [PATCH v2 2/2] selftests/mm: run_vmtests.sh: add hugetlb_madv_vs_map Breno Leitao
  1 sibling, 0 replies; 3+ messages in thread
From: Breno Leitao @ 2024-02-05 19:18 UTC (permalink / raw)
  To: mike.kravetz, linux-mm, akpm, muchun.song
  Cc: lstoakes, willy, hannes, mhocko, roman.gushchin, linux-kernel,
	Rik van Riel

Currently there is a bug that a huge page could be stolen, and when the
original owner tries to fault in it, it causes a page fault.

You can achieve that by:
  1) Creating a single page
	echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

  2) mmap() the page above with MAP_HUGETLB into (void *ptr1).
	* This will mark the page as reserved
  3) touch the page, which causes a page fault and allocates the page
	* This will move the page out of the free list.
	* It will also unreserved the page, since there is no more free
	  page
  4) madvise(MADV_DONTNEED) the page
	* This will free the page, but not mark it as reserved.
  5) Allocate a secondary page with mmap(MAP_HUGETLB) into (void *ptr2).
	* it should fail, but, since there is no more available page.
	* But, since the page above is not reserved, this mmap() succeed.
  6) Faulting at ptr1 will cause a SIGBUS
	* it will try to allocate a huge page, but there is none
	  available

A full reproducer is in selftest. See
https://lore.kernel.org/all/20240105155419.1939484-1-leitao@debian.org/

Fix this by restoring the reserved page if necessary.

These are the condition for the page restore:

 * The system is not using surplus pages. The goal is to reduce the
   surplus usage for this case.
 * If the VMA has the HPAGE_RESV_OWNER flag set, and is PRIVATE. This is
   safely checked using __vma_private_lock()
 * The page is anonymous

Once this is scenario is found, set the `hugetlb_restore_reserve` bit in
the folio. Then check if the resv reservations need to be adjusted
later, done later, after the spinlock, since the vma_xxxx_reservation()
might touch the file system lock.

Suggested-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/hugetlb.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ed1581b670d4..44f1e6366d04 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5585,6 +5585,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
+	bool adjust_reservation = false;
 	unsigned long last_addr_mask;
 	bool force_flush = false;
 
@@ -5677,7 +5678,31 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		hugetlb_count_sub(pages_per_huge_page(h), mm);
 		hugetlb_remove_rmap(page_folio(page));
 
+		/*
+		 * Restore the reservation for anonymous page, otherwise the
+		 * backing page could be stolen by someone.
+		 * If there we are freeing a surplus, do not set the restore
+		 * reservation bit.
+		 */
+		if (!h->surplus_huge_pages && __vma_private_lock(vma) &&
+		    folio_test_anon(page_folio(page))) {
+			folio_set_hugetlb_restore_reserve(page_folio(page));
+			/* Reservation to be adjusted after the spin lock */
+			adjust_reservation = true;
+		}
+
 		spin_unlock(ptl);
+
+		/*
+		 * Adjust the reservation for the region that will have the
+		 * reserve restored. Keep in mind that vma_needs_reservation() changes
+		 * resv->adds_in_progress if it succeeds. If this is not done,
+		 * do_exit() will not see it, and will keep the reservation
+		 * forever.
+		 */
+		if (adjust_reservation && vma_needs_reservation(h, vma, address))
+			vma_add_reservation(h, vma, address);
+
 		tlb_remove_page_size(tlb, page, huge_page_size(h));
 		/*
 		 * Bail out after unmapping reference page if supplied
-- 
2.34.1



^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH v2 2/2] selftests/mm: run_vmtests.sh: add hugetlb_madv_vs_map
  2024-02-05 19:18 [PATCH v2 0/2] mm/hugetlb: Restore the reservation Breno Leitao
  2024-02-05 19:18 ` [PATCH v2 1/2] mm/hugetlb: Restore the reservation if needed Breno Leitao
@ 2024-02-05 19:18 ` Breno Leitao
  1 sibling, 0 replies; 3+ messages in thread
From: Breno Leitao @ 2024-02-05 19:18 UTC (permalink / raw)
  To: mike.kravetz, linux-mm, akpm, muchun.song, Shuah Khan
  Cc: lstoakes, willy, hannes, mhocko, roman.gushchin, linux-kernel,
	open list:KERNEL SELFTEST FRAMEWORK

hugetlb_madv_vs_map selftest was not part of the mm test-suite since we
didn't have a fix for the problem it found.

Now that the problem is already fixed (see previous commit), let's
enable this selftest in the default test-suite.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 tools/testing/selftests/mm/run_vmtests.sh | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 246d53a5d7f2..50e2094ed761 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -253,6 +253,7 @@ nr_hugepages_tmp=$(cat /proc/sys/vm/nr_hugepages)
 # For this test, we need one and just one huge page
 echo 1 > /proc/sys/vm/nr_hugepages
 CATEGORY="hugetlb" run_test ./hugetlb_fault_after_madv
+CATEGORY="hugetlb" run_test ./hugetlb_madv_vs_map
 # Restore the previous number of huge pages, since further tests rely on it
 echo "$nr_hugepages_tmp" > /proc/sys/vm/nr_hugepages
 
-- 
2.34.1



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-02-05 19:19 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-05 19:18 [PATCH v2 0/2] mm/hugetlb: Restore the reservation Breno Leitao
2024-02-05 19:18 ` [PATCH v2 1/2] mm/hugetlb: Restore the reservation if needed Breno Leitao
2024-02-05 19:18 ` [PATCH v2 2/2] selftests/mm: run_vmtests.sh: add hugetlb_madv_vs_map Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox