[PATCH v1 00/12] mm: fixes for device-exclusive entries (hmm)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 00/12] mm: fixes for device-exclusive entries (hmm)
@ 2025-01-29 11:53 David Hildenbrand
  2025-01-29 11:53 ` [PATCH v1 01/12] mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs David Hildenbrand
                   ` (11 more replies)
  0 siblings, 12 replies; 56+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, David Hildenbrand,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

Discussing the PageTail() call in make_device_exclusive_range() with
Willy, I recently discovered [1] that device-exclusive handling does
not properly work with THP, making the hmm-tests selftests fail if THPs
are enabled on the system.

Looking into more details, I found that hugetlb is not properly fenced,
and I realized that something that was bugging me for longer -- how
device-exclusive entries interact with mapcounts -- completely breaks
migration and swapout of these folios while they have device-exclusive
PTEs.

The program below can be used to allocate 1 GiB worth of pages and
making them device-exclusive on a kernel with CONFIG_TEST_HMM.

Once they are device-exclusive, these folios cannot get swapped out
(/proc/$pid/smaps_rollup will always indicate 1 GiB RSS no matter how
much one forces memory reclaim), and when having a memory block onlined
to ZONE_MOVABLE, trying to offline it will loop forever and complain about
failed migration of a page that should be movable.

# echo offline > /sys/devices/system/memory/memory136/state
# echo online_movable > /sys/devices/system/memory/memory136/state
# ./hmm-swap &
... wait until everything is device-exclusive
# echo offline > /sys/devices/system/memory/memory136/state
[  285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000
  index:0x7f20671f7 pfn:0x442b6a
[  285.196618][T14882] memcg:ffff888179298000
[  285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate|
  dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff)
[  285.201734][T14882] raw: ...
[  285.204464][T14882] raw: ...
[  285.207196][T14882] page dumped because: migration failure
[  285.209072][T14882] page_owner tracks the page as allocated
[  285.210915][T14882] page last allocated via order 0, migratetype
  Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO),
  id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774
[  285.216765][T14882]  post_alloc_hook+0x197/0x1b0
[  285.218874][T14882]  get_page_from_freelist+0x76e/0x3280
[  285.220864][T14882]  __alloc_frozen_pages_noprof+0x38e/0x2740
[  285.223302][T14882]  alloc_pages_mpol+0x1fc/0x540
[  285.225130][T14882]  folio_alloc_mpol_noprof+0x36/0x340
[  285.227222][T14882]  vma_alloc_folio_noprof+0xee/0x1a0
[  285.229074][T14882]  __handle_mm_fault+0x2b38/0x56a0
[  285.230822][T14882]  handle_mm_fault+0x368/0x9f0
...

This series fixes all issues I found so far. There is no easy way to fix
without a bigger rework/cleanup. I'll send out some additional cleanups
that are not strictly required separately on top.

I wish we could just use some special present PROT_NONE PTEs instead of
these (non-present, non-none) fake-swap entries; but that just results in
the same problem we keep having (lack of spare PTE bits), and staring at
other similar fake-swap entries, that ship has sailed.

With this series, make_device_exclusive() doesn't actually belong into
mm/rmap.c anymore, but I'll leave moving that for another day.

I only tested this series with the hmm-tests selftests due to lack of HW,
so I'd appreciate some testing, especially if the interaction between
two GPUs wanting a device-exclusive entry works as expected.

I was able to trigger the -EBUSY in the hmm-tests a couple of times; but
it doesn't retry yet. We should look into converting the folio_try_lock()
into a folio_lock(), and just retry GUP immediately if the folio_walk
fails in loop.

<program>
#include <stdio.h>
#include <fcntl.h>
#include <stdint.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/ioctl.h>
#include <linux/types.h>
#include <linux/ioctl.h>

#define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd)

struct hmm_dmirror_cmd {
	__u64 addr;
	__u64 ptr;
	__u64 npages;
	__u64 cpages;
	__u64 faults;
};

const size_t size = 1 * 1024 * 1024 * 1024ul;
const size_t chunk_size = 2 * 1024 * 1024ul;

int main(void)
{
	struct hmm_dmirror_cmd cmd;
	size_t cur_size;
	int fd, ret;
	char *addr, *mirror;

	fd = open("/dev/hmm_dmirror1", O_RDWR, 0);
	if (fd < 0) {
		perror("open failed\n");
		exit(1);
	}

	addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
		    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	if (addr == MAP_FAILED) {
		perror("mmap failed\n");
		exit(1);
	}
	madvise(addr, size, MADV_NOHUGEPAGE);
	memset(addr, 1, size);

	mirror = malloc(chunk_size);

	for (cur_size = 0; cur_size < size; cur_size += chunk_size) {
		cmd.addr = (uintptr_t)addr + cur_size;
		cmd.ptr = (uintptr_t)mirror;
		cmd.npages = chunk_size / getpagesize();
		ret = ioctl(fd, HMM_DMIRROR_EXCLUSIVE, &cmd);
		if (ret) {
			perror("ioctl failed\n");
			exit(1);
		}
	}
	pause();
	return 0;
}
</program>

[1] https://lkml.kernel.org/r/25e02685-4f1d-47fa-be5b-01ff85bb0ce2@redhat.com

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alex Shi <alexs@kernel.org>
Cc: Yanteng Si <si.yanteng@linux.dev>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>

David Hildenbrand (12):
  mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs
  mm/rmap: reject hugetlb folios in folio_make_device_exclusive()
  mm/rmap: convert make_device_exclusive_range() to
    make_device_exclusive()
  mm/rmap: implement make_device_exclusive() using folio_walk instead of
    rmap walk
  mm/memory: detect writability in restore_exclusive_pte() through
    can_change_pte_writable()
  mm: use single SWP_DEVICE_EXCLUSIVE entry type
  mm/page_vma_mapped: device-private entries are not migration entries
  mm/rmap: handle device-exclusive entries correctly in
    try_to_unmap_one()
  mm/rmap: handle device-exclusive entries correctly in
    try_to_migrate_one()
  mm/rmap: handle device-exclusive entries correctly in
    folio_referenced_one()
  mm/rmap: handle device-exclusive entries correctly in
    page_vma_mkclean_one()
  mm/rmap: keep mapcount untouched for device-exclusive entries

 Documentation/mm/hmm.rst                    |   2 +-
 Documentation/translations/zh_CN/mm/hmm.rst |   2 +-
 drivers/gpu/drm/nouveau/nouveau_svm.c       |   5 +-
 include/linux/mmu_notifier.h                |   2 +-
 include/linux/rmap.h                        |   5 +-
 include/linux/swap.h                        |   7 +-
 include/linux/swapops.h                     |  27 +-
 lib/test_hmm.c                              |  45 +-
 mm/gup.c                                    |   3 +
 mm/memory.c                                 |  28 +-
 mm/mprotect.c                               |   8 -
 mm/page_table_check.c                       |   5 +-
 mm/page_vma_mapped.c                        |   3 +-
 mm/rmap.c                                   | 457 +++++++++-----------
 14 files changed, 246 insertions(+), 353 deletions(-)

base-commit: 4845035bec80db8d716a94b80bb5593bf1d69270
-- 
2.48.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v1 01/12] mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs
  2025-01-29 11:53 [PATCH v1 00/12] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
@ 2025-01-29 11:53 ` David Hildenbrand
  2025-01-29 21:42   ` John Hubbard
  2025-01-30  5:46   ` Alistair Popple
  2025-01-29 11:54 ` [PATCH v1 02/12] mm/rmap: reject hugetlb folios in folio_make_device_exclusive() David Hildenbrand
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 56+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, David Hildenbrand,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe, stable

We only have two FOLL_SPLIT_PMD users. While uprobe refuses hugetlb
early, make_device_exclusive_range() can end up getting called on
hugetlb VMAs.

Right now, this means that with a PMD-sized hugetlb page, we can end
up calling split_huge_pmd(), because pmd_trans_huge() also succeeds
with hugetlb PMDs.

For example, using a modified hmm-test selftest one can trigger:

[  207.017134][T14945] ------------[ cut here ]------------
[  207.018614][T14945] kernel BUG at mm/page_table_check.c:87!
[  207.019716][T14945] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
[  207.021072][T14945] CPU: 3 UID: 0 PID: ...
[  207.023036][T14945] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
[  207.024834][T14945] RIP: 0010:page_table_check_clear.part.0+0x488/0x510
[  207.026128][T14945] Code: ...
[  207.029965][T14945] RSP: 0018:ffffc9000cb8f348 EFLAGS: 00010293
[  207.031139][T14945] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: ffffffff8249a0cd
[  207.032649][T14945] RDX: ffff88811e883c80 RSI: ffffffff8249a357 RDI: ffff88811e883c80
[  207.034183][T14945] RBP: ffff888105c0a050 R08: 0000000000000005 R09: 0000000000000000
[  207.035688][T14945] R10: 00000000ffffffff R11: 0000000000000003 R12: 0000000000000001
[  207.037203][T14945] R13: 0000000000000200 R14: 0000000000000001 R15: dffffc0000000000
[  207.038711][T14945] FS:  00007f2783275740(0000) GS:ffff8881f4980000(0000) knlGS:0000000000000000
[  207.040407][T14945] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  207.041660][T14945] CR2: 00007f2782c00000 CR3: 0000000132356000 CR4: 0000000000750ef0
[  207.043196][T14945] PKRU: 55555554
[  207.043880][T14945] Call Trace:
[  207.044506][T14945]  <TASK>
[  207.045086][T14945]  ? __die+0x51/0x92
[  207.045864][T14945]  ? die+0x29/0x50
[  207.046596][T14945]  ? do_trap+0x250/0x320
[  207.047430][T14945]  ? do_error_trap+0xe7/0x220
[  207.048346][T14945]  ? page_table_check_clear.part.0+0x488/0x510
[  207.049535][T14945]  ? handle_invalid_op+0x34/0x40
[  207.050494][T14945]  ? page_table_check_clear.part.0+0x488/0x510
[  207.051681][T14945]  ? exc_invalid_op+0x2e/0x50
[  207.052589][T14945]  ? asm_exc_invalid_op+0x1a/0x20
[  207.053596][T14945]  ? page_table_check_clear.part.0+0x1fd/0x510
[  207.054790][T14945]  ? page_table_check_clear.part.0+0x487/0x510
[  207.055993][T14945]  ? page_table_check_clear.part.0+0x488/0x510
[  207.057195][T14945]  ? page_table_check_clear.part.0+0x487/0x510
[  207.058384][T14945]  __page_table_check_pmd_clear+0x34b/0x5a0
[  207.059524][T14945]  ? __pfx___page_table_check_pmd_clear+0x10/0x10
[  207.060775][T14945]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
[  207.061940][T14945]  ? __pfx___lock_acquire+0x10/0x10
[  207.062967][T14945]  pmdp_huge_clear_flush+0x279/0x360
[  207.064024][T14945]  split_huge_pmd_locked+0x82b/0x3750
...

Before commit 9cb28da54643 ("mm/gup: handle hugetlb in the generic
follow_page_mask code"), we would have ignored the flag; instead, let's
simply refuse the combination completely in check_vma_flags(): the
caller is likely not prepared to handle any hugetlb folios.

We'll teach make_device_exclusive_range() separately to ignore any hugetlb
folios as a future-proof safety net.

Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
Cc: <stable@vger.kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/gup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index 3883b307780e..61e751baf862 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1283,6 +1283,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
 		return -EOPNOTSUPP;
 
+	if ((gup_flags & FOLL_SPLIT_PMD) && is_vm_hugetlb_page(vma))
+		return -EOPNOTSUPP;
+
 	if (vma_is_secretmem(vma))
 		return -EFAULT;
 
-- 
2.48.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 01/12] mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs
  2025-01-29 11:53 ` [PATCH v1 01/12] mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs David Hildenbrand
@ 2025-01-29 21:42   ` John Hubbard
  2025-01-30  8:56     ` David Hildenbrand
  2025-01-30  5:46   ` Alistair Popple
  1 sibling, 1 reply; 56+ messages in thread
From: John Hubbard @ 2025-01-29 21:42 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe, stable

On 1/29/25 3:53 AM, David Hildenbrand wrote:
> We only have two FOLL_SPLIT_PMD users. While uprobe refuses hugetlb
> early, make_device_exclusive_range() can end up getting called on
> hugetlb VMAs.
> 
> Right now, this means that with a PMD-sized hugetlb page, we can end
> up calling split_huge_pmd(), because pmd_trans_huge() also succeeds
> with hugetlb PMDs.
> 
> For example, using a modified hmm-test selftest one can trigger:
> 
> [  207.017134][T14945] ------------[ cut here ]------------
> [  207.018614][T14945] kernel BUG at mm/page_table_check.c:87!
> [  207.019716][T14945] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> [  207.021072][T14945] CPU: 3 UID: 0 PID: ...
> [  207.023036][T14945] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
> [  207.024834][T14945] RIP: 0010:page_table_check_clear.part.0+0x488/0x510
> [  207.026128][T14945] Code: ...
> [  207.029965][T14945] RSP: 0018:ffffc9000cb8f348 EFLAGS: 00010293
> [  207.031139][T14945] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: ffffffff8249a0cd
> [  207.032649][T14945] RDX: ffff88811e883c80 RSI: ffffffff8249a357 RDI: ffff88811e883c80
> [  207.034183][T14945] RBP: ffff888105c0a050 R08: 0000000000000005 R09: 0000000000000000
> [  207.035688][T14945] R10: 00000000ffffffff R11: 0000000000000003 R12: 0000000000000001
> [  207.037203][T14945] R13: 0000000000000200 R14: 0000000000000001 R15: dffffc0000000000
> [  207.038711][T14945] FS:  00007f2783275740(0000) GS:ffff8881f4980000(0000) knlGS:0000000000000000
> [  207.040407][T14945] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  207.041660][T14945] CR2: 00007f2782c00000 CR3: 0000000132356000 CR4: 0000000000750ef0
> [  207.043196][T14945] PKRU: 55555554
> [  207.043880][T14945] Call Trace:
> [  207.044506][T14945]  <TASK>
> [  207.045086][T14945]  ? __die+0x51/0x92
> [  207.045864][T14945]  ? die+0x29/0x50
> [  207.046596][T14945]  ? do_trap+0x250/0x320
> [  207.047430][T14945]  ? do_error_trap+0xe7/0x220
> [  207.048346][T14945]  ? page_table_check_clear.part.0+0x488/0x510
> [  207.049535][T14945]  ? handle_invalid_op+0x34/0x40
> [  207.050494][T14945]  ? page_table_check_clear.part.0+0x488/0x510
> [  207.051681][T14945]  ? exc_invalid_op+0x2e/0x50
> [  207.052589][T14945]  ? asm_exc_invalid_op+0x1a/0x20
> [  207.053596][T14945]  ? page_table_check_clear.part.0+0x1fd/0x510
> [  207.054790][T14945]  ? page_table_check_clear.part.0+0x487/0x510
> [  207.055993][T14945]  ? page_table_check_clear.part.0+0x488/0x510
> [  207.057195][T14945]  ? page_table_check_clear.part.0+0x487/0x510
> [  207.058384][T14945]  __page_table_check_pmd_clear+0x34b/0x5a0
> [  207.059524][T14945]  ? __pfx___page_table_check_pmd_clear+0x10/0x10
> [  207.060775][T14945]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
> [  207.061940][T14945]  ? __pfx___lock_acquire+0x10/0x10
> [  207.062967][T14945]  pmdp_huge_clear_flush+0x279/0x360
> [  207.064024][T14945]  split_huge_pmd_locked+0x82b/0x3750
> ...
> 
> Before commit 9cb28da54643 ("mm/gup: handle hugetlb in the generic
> follow_page_mask code"), we would have ignored the flag; instead, let's

...and so after that commit (which doesn't touch FOLL_SPLIT_PMD, we no
longer ignore the flag? At a first look at that commit, I don't quite
understand the connection, can you clarify just a bit for me?

> simply refuse the combination completely in check_vma_flags(): the
> caller is likely not prepared to handle any hugetlb folios.

Yes.

> 
> We'll teach make_device_exclusive_range() separately to ignore any hugetlb
> folios as a future-proof safety net.
> 
> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   mm/gup.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 3883b307780e..61e751baf862 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1283,6 +1283,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>   	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
>   		return -EOPNOTSUPP;
>   
> +	if ((gup_flags & FOLL_SPLIT_PMD) && is_vm_hugetlb_page(vma))
> +		return -EOPNOTSUPP;
> +

This seems correct by inspection, as one cannot split a hugetlbfs page, so:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard

>   	if (vma_is_secretmem(vma))
>   		return -EFAULT;
>   




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 01/12] mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs
  2025-01-29 21:42   ` John Hubbard
@ 2025-01-30  8:56     ` David Hildenbrand
  0 siblings, 0 replies; 56+ messages in thread
From: David Hildenbrand @ 2025-01-30  8:56 UTC (permalink / raw)
  To: John Hubbard, linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, Andrew Morton,
	Jérôme Glisse, Jonathan Corbet, Alex Shi, Yanteng Si,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pasha Tatashin, Peter Xu, Alistair Popple,
	Jason Gunthorpe, stable

On 29.01.25 22:42, John Hubbard wrote:
> On 1/29/25 3:53 AM, David Hildenbrand wrote:
>> We only have two FOLL_SPLIT_PMD users. While uprobe refuses hugetlb
>> early, make_device_exclusive_range() can end up getting called on
>> hugetlb VMAs.
>>
>> Right now, this means that with a PMD-sized hugetlb page, we can end
>> up calling split_huge_pmd(), because pmd_trans_huge() also succeeds
>> with hugetlb PMDs.
>>
>> For example, using a modified hmm-test selftest one can trigger:
>>
>> [  207.017134][T14945] ------------[ cut here ]------------
>> [  207.018614][T14945] kernel BUG at mm/page_table_check.c:87!
>> [  207.019716][T14945] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
>> [  207.021072][T14945] CPU: 3 UID: 0 PID: ...
>> [  207.023036][T14945] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
>> [  207.024834][T14945] RIP: 0010:page_table_check_clear.part.0+0x488/0x510
>> [  207.026128][T14945] Code: ...
>> [  207.029965][T14945] RSP: 0018:ffffc9000cb8f348 EFLAGS: 00010293
>> [  207.031139][T14945] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: ffffffff8249a0cd
>> [  207.032649][T14945] RDX: ffff88811e883c80 RSI: ffffffff8249a357 RDI: ffff88811e883c80
>> [  207.034183][T14945] RBP: ffff888105c0a050 R08: 0000000000000005 R09: 0000000000000000
>> [  207.035688][T14945] R10: 00000000ffffffff R11: 0000000000000003 R12: 0000000000000001
>> [  207.037203][T14945] R13: 0000000000000200 R14: 0000000000000001 R15: dffffc0000000000
>> [  207.038711][T14945] FS:  00007f2783275740(0000) GS:ffff8881f4980000(0000) knlGS:0000000000000000
>> [  207.040407][T14945] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  207.041660][T14945] CR2: 00007f2782c00000 CR3: 0000000132356000 CR4: 0000000000750ef0
>> [  207.043196][T14945] PKRU: 55555554
>> [  207.043880][T14945] Call Trace:
>> [  207.044506][T14945]  <TASK>
>> [  207.045086][T14945]  ? __die+0x51/0x92
>> [  207.045864][T14945]  ? die+0x29/0x50
>> [  207.046596][T14945]  ? do_trap+0x250/0x320
>> [  207.047430][T14945]  ? do_error_trap+0xe7/0x220
>> [  207.048346][T14945]  ? page_table_check_clear.part.0+0x488/0x510
>> [  207.049535][T14945]  ? handle_invalid_op+0x34/0x40
>> [  207.050494][T14945]  ? page_table_check_clear.part.0+0x488/0x510
>> [  207.051681][T14945]  ? exc_invalid_op+0x2e/0x50
>> [  207.052589][T14945]  ? asm_exc_invalid_op+0x1a/0x20
>> [  207.053596][T14945]  ? page_table_check_clear.part.0+0x1fd/0x510
>> [  207.054790][T14945]  ? page_table_check_clear.part.0+0x487/0x510
>> [  207.055993][T14945]  ? page_table_check_clear.part.0+0x488/0x510
>> [  207.057195][T14945]  ? page_table_check_clear.part.0+0x487/0x510
>> [  207.058384][T14945]  __page_table_check_pmd_clear+0x34b/0x5a0
>> [  207.059524][T14945]  ? __pfx___page_table_check_pmd_clear+0x10/0x10
>> [  207.060775][T14945]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
>> [  207.061940][T14945]  ? __pfx___lock_acquire+0x10/0x10
>> [  207.062967][T14945]  pmdp_huge_clear_flush+0x279/0x360
>> [  207.064024][T14945]  split_huge_pmd_locked+0x82b/0x3750
>> ...
>>
>> Before commit 9cb28da54643 ("mm/gup: handle hugetlb in the generic
>> follow_page_mask code"), we would have ignored the flag; instead, let's
> 
> ...and so after that commit (which doesn't touch FOLL_SPLIT_PMD, we no
> longer ignore the flag? At a first look at that commit, I don't quite
> understand the connection, can you clarify just a bit for me?

Sure! Before that commit we always went via hugetlb_follow_page_mask(), 
so we never ended up in follow_pmd_mask().

hugetlb_follow_page_mask() didn't check for the flag ("ignored it"), so 
we would not have crashed in GUP.

Thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 01/12] mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs
  2025-01-29 11:53 ` [PATCH v1 01/12] mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs David Hildenbrand
  2025-01-29 21:42   ` John Hubbard
@ 2025-01-30  5:46   ` Alistair Popple
  1 sibling, 0 replies; 56+ messages in thread
From: Alistair Popple @ 2025-01-30  5:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe, stable

On Wed, Jan 29, 2025 at 12:53:59PM +0100, David Hildenbrand wrote:
> We only have two FOLL_SPLIT_PMD users. While uprobe refuses hugetlb
> early, make_device_exclusive_range() can end up getting called on
> hugetlb VMAs.
> 
> Right now, this means that with a PMD-sized hugetlb page, we can end
> up calling split_huge_pmd(), because pmd_trans_huge() also succeeds
> with hugetlb PMDs.
> 
> For example, using a modified hmm-test selftest one can trigger:

I haven't explicitly tested this with nouveau but I can't see anything that
would prevent this being called on a hugetlb vma there either.

Reviewed-by: Alistair Popple <apopple@nvidia.com>

> [  207.017134][T14945] ------------[ cut here ]------------
> [  207.018614][T14945] kernel BUG at mm/page_table_check.c:87!
> [  207.019716][T14945] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> [  207.021072][T14945] CPU: 3 UID: 0 PID: ...
> [  207.023036][T14945] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
> [  207.024834][T14945] RIP: 0010:page_table_check_clear.part.0+0x488/0x510
> [  207.026128][T14945] Code: ...
> [  207.029965][T14945] RSP: 0018:ffffc9000cb8f348 EFLAGS: 00010293
> [  207.031139][T14945] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: ffffffff8249a0cd
> [  207.032649][T14945] RDX: ffff88811e883c80 RSI: ffffffff8249a357 RDI: ffff88811e883c80
> [  207.034183][T14945] RBP: ffff888105c0a050 R08: 0000000000000005 R09: 0000000000000000
> [  207.035688][T14945] R10: 00000000ffffffff R11: 0000000000000003 R12: 0000000000000001
> [  207.037203][T14945] R13: 0000000000000200 R14: 0000000000000001 R15: dffffc0000000000
> [  207.038711][T14945] FS:  00007f2783275740(0000) GS:ffff8881f4980000(0000) knlGS:0000000000000000
> [  207.040407][T14945] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  207.041660][T14945] CR2: 00007f2782c00000 CR3: 0000000132356000 CR4: 0000000000750ef0
> [  207.043196][T14945] PKRU: 55555554
> [  207.043880][T14945] Call Trace:
> [  207.044506][T14945]  <TASK>
> [  207.045086][T14945]  ? __die+0x51/0x92
> [  207.045864][T14945]  ? die+0x29/0x50
> [  207.046596][T14945]  ? do_trap+0x250/0x320
> [  207.047430][T14945]  ? do_error_trap+0xe7/0x220
> [  207.048346][T14945]  ? page_table_check_clear.part.0+0x488/0x510
> [  207.049535][T14945]  ? handle_invalid_op+0x34/0x40
> [  207.050494][T14945]  ? page_table_check_clear.part.0+0x488/0x510
> [  207.051681][T14945]  ? exc_invalid_op+0x2e/0x50
> [  207.052589][T14945]  ? asm_exc_invalid_op+0x1a/0x20
> [  207.053596][T14945]  ? page_table_check_clear.part.0+0x1fd/0x510
> [  207.054790][T14945]  ? page_table_check_clear.part.0+0x487/0x510
> [  207.055993][T14945]  ? page_table_check_clear.part.0+0x488/0x510
> [  207.057195][T14945]  ? page_table_check_clear.part.0+0x487/0x510
> [  207.058384][T14945]  __page_table_check_pmd_clear+0x34b/0x5a0
> [  207.059524][T14945]  ? __pfx___page_table_check_pmd_clear+0x10/0x10
> [  207.060775][T14945]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
> [  207.061940][T14945]  ? __pfx___lock_acquire+0x10/0x10
> [  207.062967][T14945]  pmdp_huge_clear_flush+0x279/0x360
> [  207.064024][T14945]  split_huge_pmd_locked+0x82b/0x3750
> ...
> 
> Before commit 9cb28da54643 ("mm/gup: handle hugetlb in the generic
> follow_page_mask code"), we would have ignored the flag; instead, let's
> simply refuse the combination completely in check_vma_flags(): the
> caller is likely not prepared to handle any hugetlb folios.
> 
> We'll teach make_device_exclusive_range() separately to ignore any hugetlb
> folios as a future-proof safety net.
> 
> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/gup.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 3883b307780e..61e751baf862 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1283,6 +1283,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>  	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
>  		return -EOPNOTSUPP;
>  
> +	if ((gup_flags & FOLL_SPLIT_PMD) && is_vm_hugetlb_page(vma))
> +		return -EOPNOTSUPP;
> +
>  	if (vma_is_secretmem(vma))
>  		return -EFAULT;
>  
> -- 
> 2.48.1
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v1 02/12] mm/rmap: reject hugetlb folios in folio_make_device_exclusive()
  2025-01-29 11:53 [PATCH v1 00/12] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
  2025-01-29 11:53 ` [PATCH v1 01/12] mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs David Hildenbrand
@ 2025-01-29 11:54 ` David Hildenbrand
  2025-01-30  5:47   ` Alistair Popple
  2025-01-29 11:54 ` [PATCH v1 03/12] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive() David Hildenbrand
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, David Hildenbrand,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe, stable

Even though FOLL_SPLIT_PMD on hugetlb now always fails with -EOPNOTSUPP,
let's add a safety net in case FOLL_SPLIT_PMD usage would ever be reworked.

In particular, before commit 9cb28da54643 ("mm/gup: handle hugetlb in the
generic follow_page_mask code"), GUP(FOLL_SPLIT_PMD) would just have
returned a page. In particular, hugetlb folios that are not PMD-sized
would never have been prone to FOLL_SPLIT_PMD.

hugetlb folios can be anonymous, and page_make_device_exclusive_one() is
not really prepared for handling them at all. So let's spell that out.

Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Cc: <stable@vger.kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index c6c4d4ea29a7..17fbfa61f7ef 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2499,7 +2499,7 @@ static bool folio_make_device_exclusive(struct folio *folio,
 	 * Restrict to anonymous folios for now to avoid potential writeback
 	 * issues.
 	 */
-	if (!folio_test_anon(folio))
+	if (!folio_test_anon(folio) || folio_test_hugetlb(folio))
 		return false;
 
 	rmap_walk(folio, &rwc);
-- 
2.48.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 02/12] mm/rmap: reject hugetlb folios in folio_make_device_exclusive()
  2025-01-29 11:54 ` [PATCH v1 02/12] mm/rmap: reject hugetlb folios in folio_make_device_exclusive() David Hildenbrand
@ 2025-01-30  5:47   ` Alistair Popple
  0 siblings, 0 replies; 56+ messages in thread
From: Alistair Popple @ 2025-01-30  5:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe, stable

On Wed, Jan 29, 2025 at 12:54:00PM +0100, David Hildenbrand wrote:
> Even though FOLL_SPLIT_PMD on hugetlb now always fails with -EOPNOTSUPP,
> let's add a safety net in case FOLL_SPLIT_PMD usage would ever be reworked.
> 
> In particular, before commit 9cb28da54643 ("mm/gup: handle hugetlb in the
> generic follow_page_mask code"), GUP(FOLL_SPLIT_PMD) would just have
> returned a page. In particular, hugetlb folios that are not PMD-sized
> would never have been prone to FOLL_SPLIT_PMD.
> 
> hugetlb folios can be anonymous, and page_make_device_exclusive_one() is
> not really prepared for handling them at all. So let's spell that out.

Agreed, thanks for fixing.

Reviewed-by: Alistair Popple <apopple@nvidia.com>

> Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/rmap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c6c4d4ea29a7..17fbfa61f7ef 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2499,7 +2499,7 @@ static bool folio_make_device_exclusive(struct folio *folio,
>  	 * Restrict to anonymous folios for now to avoid potential writeback
>  	 * issues.
>  	 */
> -	if (!folio_test_anon(folio))
> +	if (!folio_test_anon(folio) || folio_test_hugetlb(folio))
>  		return false;
>  
>  	rmap_walk(folio, &rwc);
> -- 
> 2.48.1
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v1 03/12] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()
  2025-01-29 11:53 [PATCH v1 00/12] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
  2025-01-29 11:53 ` [PATCH v1 01/12] mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs David Hildenbrand
  2025-01-29 11:54 ` [PATCH v1 02/12] mm/rmap: reject hugetlb folios in folio_make_device_exclusive() David Hildenbrand
@ 2025-01-29 11:54 ` David Hildenbrand
  2025-01-30  5:57   ` Alistair Popple
  2025-01-30 13:46   ` Simona Vetter
  2025-01-29 11:54 ` [PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk David Hildenbrand
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 56+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, David Hildenbrand,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

The single "real" user in the tree of make_device_exclusive_range() always
requests making only a single address exclusive. The current implementation
is hard to fix for properly supporting anonymous THP / large folios and
for avoiding messing with rmap walks in weird ways.

So let's always process a single address/page and return folio + page to
minimize page -> folio lookups. This is a preparation for further
changes.

Reject any non-anonymous or hugetlb folios early, directly after GUP.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 Documentation/mm/hmm.rst                    |  2 +-
 Documentation/translations/zh_CN/mm/hmm.rst |  2 +-
 drivers/gpu/drm/nouveau/nouveau_svm.c       |  5 +-
 include/linux/mmu_notifier.h                |  2 +-
 include/linux/rmap.h                        |  5 +-
 lib/test_hmm.c                              | 45 +++++------
 mm/rmap.c                                   | 90 +++++++++++----------
 7 files changed, 75 insertions(+), 76 deletions(-)

diff --git a/Documentation/mm/hmm.rst b/Documentation/mm/hmm.rst
index f6d53c37a2ca..7d61b7a8b65b 100644
--- a/Documentation/mm/hmm.rst
+++ b/Documentation/mm/hmm.rst
@@ -400,7 +400,7 @@ Exclusive access memory
 Some devices have features such as atomic PTE bits that can be used to implement
 atomic access to system memory. To support atomic operations to a shared virtual
 memory page such a device needs access to that page which is exclusive of any
-userspace access from the CPU. The ``make_device_exclusive_range()`` function
+userspace access from the CPU. The ``make_device_exclusive()`` function
 can be used to make a memory range inaccessible from userspace.
 
 This replaces all mappings for pages in the given range with special swap
diff --git a/Documentation/translations/zh_CN/mm/hmm.rst b/Documentation/translations/zh_CN/mm/hmm.rst
index 0669f947d0bc..22c210f4e94f 100644
--- a/Documentation/translations/zh_CN/mm/hmm.rst
+++ b/Documentation/translations/zh_CN/mm/hmm.rst
@@ -326,7 +326,7 @@ devm_memunmap_pages() 和 devm_release_mem_region() 当资源可以绑定到 ``s
 
 一些设备具有诸如原子PTE位的功能，可以用来实现对系统内存的原子访问。为了支持对一
 个共享的虚拟内存页的原子操作，这样的设备需要对该页的访问是排他的，而不是来自CPU
-的任何用户空间访问。  ``make_device_exclusive_range()`` 函数可以用来使一
+的任何用户空间访问。  ``make_device_exclusive()`` 函数可以用来使一
 个内存范围不能从用户空间访问。
 
 这将用特殊的交换条目替换给定范围内的所有页的映射。任何试图访问交换条目的行为都会
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index b4da82ddbb6b..39e3740980bb 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -609,10 +609,9 @@ static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm,
 
 		notifier_seq = mmu_interval_read_begin(&notifier->notifier);
 		mmap_read_lock(mm);
-		ret = make_device_exclusive_range(mm, start, start + PAGE_SIZE,
-					    &page, drm->dev);
+		page = make_device_exclusive(mm, start, drm->dev, &folio);
 		mmap_read_unlock(mm);
-		if (ret <= 0 || !page) {
+		if (IS_ERR(page)) {
 			ret = -EINVAL;
 			goto out;
 		}
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index e2dd57ca368b..d4e714661826 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -46,7 +46,7 @@ struct mmu_interval_notifier;
  * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
  * longer have exclusive access to the page. When sent during creation of an
  * exclusive range the owner will be initialised to the value provided by the
- * caller of make_device_exclusive_range(), otherwise the owner will be NULL.
+ * caller of make_device_exclusive(), otherwise the owner will be NULL.
  */
 enum mmu_notifier_event {
 	MMU_NOTIFY_UNMAP = 0,
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 683a04088f3f..86425d42c1a9 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -663,9 +663,8 @@ int folio_referenced(struct folio *, int is_locked,
 void try_to_migrate(struct folio *folio, enum ttu_flags flags);
 void try_to_unmap(struct folio *, enum ttu_flags flags);
 
-int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
-				unsigned long end, struct page **pages,
-				void *arg);
+struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
+		void *owner, struct folio **foliop);
 
 /* Avoid racy checks */
 #define PVMW_SYNC		(1 << 0)
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 056f2e411d7b..9e1b07a227a3 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -780,10 +780,8 @@ static int dmirror_exclusive(struct dmirror *dmirror,
 	unsigned long start, end, addr;
 	unsigned long size = cmd->npages << PAGE_SHIFT;
 	struct mm_struct *mm = dmirror->notifier.mm;
-	struct page *pages[64];
 	struct dmirror_bounce bounce;
-	unsigned long next;
-	int ret;
+	int ret = 0;
 
 	start = cmd->addr;
 	end = start + size;
@@ -795,36 +793,31 @@ static int dmirror_exclusive(struct dmirror *dmirror,
 		return -EINVAL;
 
 	mmap_read_lock(mm);
-	for (addr = start; addr < end; addr = next) {
-		unsigned long mapped = 0;
-		int i;
-
-		next = min(end, addr + (ARRAY_SIZE(pages) << PAGE_SHIFT));
+	for (addr = start; addr < end; addr += PAGE_SIZE) {
+		struct folio *folio;
+		struct page *page;
 
-		ret = make_device_exclusive_range(mm, addr, next, pages, NULL);
-		/*
-		 * Do dmirror_atomic_map() iff all pages are marked for
-		 * exclusive access to avoid accessing uninitialized
-		 * fields of pages.
-		 */
-		if (ret == (next - addr) >> PAGE_SHIFT)
-			mapped = dmirror_atomic_map(addr, next, pages, dmirror);
-		for (i = 0; i < ret; i++) {
-			if (pages[i]) {
-				unlock_page(pages[i]);
-				put_page(pages[i]);
-			}
+		page = make_device_exclusive(mm, addr, &folio, NULL);
+		if (IS_ERR(page)) {
+			ret = PTR_ERR(page);
+			break;
 		}
 
-		if (addr + (mapped << PAGE_SHIFT) < next) {
-			mmap_read_unlock(mm);
-			mmput(mm);
-			return -EBUSY;
-		}
+		ret = dmirror_atomic_map(addr, addr + PAGE_SIZE, &page, dmirror);
+		if (!ret)
+			ret = -EBUSY;
+		folio_unlock(folio);
+		folio_put(folio);
+
+		if (ret)
+			break;
 	}
 	mmap_read_unlock(mm);
 	mmput(mm);
 
+	if (ret)
+		return -EBUSY;
+
 	/* Return the migrated data for verification. */
 	ret = dmirror_bounce_init(&bounce, start, size);
 	if (ret)
diff --git a/mm/rmap.c b/mm/rmap.c
index 17fbfa61f7ef..676df4fba5b0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2495,70 +2495,78 @@ static bool folio_make_device_exclusive(struct folio *folio,
 		.arg = &args,
 	};
 
-	/*
-	 * Restrict to anonymous folios for now to avoid potential writeback
-	 * issues.
-	 */
-	if (!folio_test_anon(folio) || folio_test_hugetlb(folio))
-		return false;
-
 	rmap_walk(folio, &rwc);
 
 	return args.valid && !folio_mapcount(folio);
 }
 
 /**
- * make_device_exclusive_range() - Mark a range for exclusive use by a device
+ * make_device_exclusive() - Mark an address for exclusive use by a device
  * @mm: mm_struct of associated target process
- * @start: start of the region to mark for exclusive device access
- * @end: end address of region
- * @pages: returns the pages which were successfully marked for exclusive access
+ * @addr: the virtual address to mark for exclusive device access
  * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier to allow filtering
+ * @foliop: folio pointer will be stored here on success.
+ *
+ * This function looks up the page mapped at the given address, grabs a
+ * folio reference, locks the folio and replaces the PTE with special
+ * device-exclusive non-swap entry, preventing userspace CPU access. The
+ * function will return with the folio locked and referenced.
  *
- * Returns: number of pages found in the range by GUP. A page is marked for
- * exclusive access only if the page pointer is non-NULL.
+ * On fault these special device-exclusive entries are replaced with the
+ * original PTE under folio lock, after calling MMU notifiers.
  *
- * This function finds ptes mapping page(s) to the given address range, locks
- * them and replaces mappings with special swap entries preventing userspace CPU
- * access. On fault these entries are replaced with the original mapping after
- * calling MMU notifiers.
+ * Only anonymous non-hugetlb folios are supported and the VMA must have
+ * write permissions such that we can fault in the anonymous page writable
+ * in order to mark it exclusive. The caller must hold the mmap_lock in read
+ * mode.
  *
  * A driver using this to program access from a device must use a mmu notifier
  * critical section to hold a device specific lock during programming. Once
  * programming is complete it should drop the page lock and reference after
  * which point CPU access to the page will revoke the exclusive access.
+ *
+ * Returns: pointer to mapped page on success, otherwise a negative error.
  */
-int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
-				unsigned long end, struct page **pages,
-				void *owner)
+struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
+		void *owner, struct folio **foliop)
 {
-	long npages = (end - start) >> PAGE_SHIFT;
-	long i;
+	struct folio *folio;
+	struct page *page;
+	long npages;
+
+	mmap_assert_locked(mm);
 
-	npages = get_user_pages_remote(mm, start, npages,
+	/*
+	 * Fault in the page writable and try to lock it; note that if the
+	 * address would already be marked for exclusive use by the device,
+	 * the GUP call would undo that first by triggering a fault.
+	 */
+	npages = get_user_pages_remote(mm, addr, 1,
 				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
-				       pages, NULL);
-	if (npages < 0)
-		return npages;
-
-	for (i = 0; i < npages; i++, start += PAGE_SIZE) {
-		struct folio *folio = page_folio(pages[i]);
-		if (PageTail(pages[i]) || !folio_trylock(folio)) {
-			folio_put(folio);
-			pages[i] = NULL;
-			continue;
-		}
+				       &page, NULL);
+	if (npages != 1)
+		return ERR_PTR(npages);
+	folio = page_folio(page);
 
-		if (!folio_make_device_exclusive(folio, mm, start, owner)) {
-			folio_unlock(folio);
-			folio_put(folio);
-			pages[i] = NULL;
-		}
+	if (!folio_test_anon(folio) || folio_test_hugetlb(folio)) {
+		folio_put(folio);
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	if (!folio_trylock(folio)) {
+		folio_put(folio);
+		return ERR_PTR(-EBUSY);
 	}
 
-	return npages;
+	if (!folio_make_device_exclusive(folio, mm, addr, owner)) {
+		folio_unlock(folio);
+		folio_put(folio);
+		return ERR_PTR(-EBUSY);
+	}
+	*foliop = folio;
+	return page;
 }
-EXPORT_SYMBOL_GPL(make_device_exclusive_range);
+EXPORT_SYMBOL_GPL(make_device_exclusive);
 #endif
 
 void __put_anon_vma(struct anon_vma *anon_vma)
-- 
2.48.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 03/12] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()
  2025-01-29 11:54 ` [PATCH v1 03/12] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive() David Hildenbrand
@ 2025-01-30  5:57   ` Alistair Popple
  2025-01-30  9:04     ` David Hildenbrand
  2025-01-31  0:28     ` Alistair Popple
  2025-01-30 13:46   ` Simona Vetter
  1 sibling, 2 replies; 56+ messages in thread
From: Alistair Popple @ 2025-01-30  5:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Wed, Jan 29, 2025 at 12:54:01PM +0100, David Hildenbrand wrote:
> The single "real" user in the tree of make_device_exclusive_range() always
> requests making only a single address exclusive. The current implementation
> is hard to fix for properly supporting anonymous THP / large folios and
> for avoiding messing with rmap walks in weird ways.
> 
> So let's always process a single address/page and return folio + page to
> minimize page -> folio lookups. This is a preparation for further
> changes.
> 
> Reject any non-anonymous or hugetlb folios early, directly after GUP.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  Documentation/mm/hmm.rst                    |  2 +-
>  Documentation/translations/zh_CN/mm/hmm.rst |  2 +-
>  drivers/gpu/drm/nouveau/nouveau_svm.c       |  5 +-
>  include/linux/mmu_notifier.h                |  2 +-
>  include/linux/rmap.h                        |  5 +-
>  lib/test_hmm.c                              | 45 +++++------
>  mm/rmap.c                                   | 90 +++++++++++----------
>  7 files changed, 75 insertions(+), 76 deletions(-)
> 
> diff --git a/Documentation/mm/hmm.rst b/Documentation/mm/hmm.rst
> index f6d53c37a2ca..7d61b7a8b65b 100644
> --- a/Documentation/mm/hmm.rst
> +++ b/Documentation/mm/hmm.rst
> @@ -400,7 +400,7 @@ Exclusive access memory
>  Some devices have features such as atomic PTE bits that can be used to implement
>  atomic access to system memory. To support atomic operations to a shared virtual
>  memory page such a device needs access to that page which is exclusive of any
> -userspace access from the CPU. The ``make_device_exclusive_range()`` function
> +userspace access from the CPU. The ``make_device_exclusive()`` function
>  can be used to make a memory range inaccessible from userspace.
>  
>  This replaces all mappings for pages in the given range with special swap
> diff --git a/Documentation/translations/zh_CN/mm/hmm.rst b/Documentation/translations/zh_CN/mm/hmm.rst
> index 0669f947d0bc..22c210f4e94f 100644
> --- a/Documentation/translations/zh_CN/mm/hmm.rst
> +++ b/Documentation/translations/zh_CN/mm/hmm.rst
> @@ -326,7 +326,7 @@ devm_memunmap_pages() 和 devm_release_mem_region() 当资源可以绑定到 ``s
>  
>  一些设备具有诸如原子PTE位的功能，可以用来实现对系统内存的原子访问。为了支持对一
>  个共享的虚拟内存页的原子操作，这样的设备需要对该页的访问是排他的，而不是来自CPU
> -的任何用户空间访问。  ``make_device_exclusive_range()`` 函数可以用来使一
> +的任何用户空间访问。  ``make_device_exclusive()`` 函数可以用来使一
>  个内存范围不能从用户空间访问。
>  
>  这将用特殊的交换条目替换给定范围内的所有页的映射。任何试图访问交换条目的行为都会
> diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
> index b4da82ddbb6b..39e3740980bb 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> @@ -609,10 +609,9 @@ static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm,
>  
>  		notifier_seq = mmu_interval_read_begin(&notifier->notifier);
>  		mmap_read_lock(mm);
> -		ret = make_device_exclusive_range(mm, start, start + PAGE_SIZE,
> -					    &page, drm->dev);
> +		page = make_device_exclusive(mm, start, drm->dev, &folio);
>  		mmap_read_unlock(mm);
> -		if (ret <= 0 || !page) {
> +		if (IS_ERR(page)) {
>  			ret = -EINVAL;
>  			goto out;
>  		}
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index e2dd57ca368b..d4e714661826 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -46,7 +46,7 @@ struct mmu_interval_notifier;
>   * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
>   * longer have exclusive access to the page. When sent during creation of an
>   * exclusive range the owner will be initialised to the value provided by the
> - * caller of make_device_exclusive_range(), otherwise the owner will be NULL.
> + * caller of make_device_exclusive(), otherwise the owner will be NULL.
>   */
>  enum mmu_notifier_event {
>  	MMU_NOTIFY_UNMAP = 0,
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 683a04088f3f..86425d42c1a9 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -663,9 +663,8 @@ int folio_referenced(struct folio *, int is_locked,
>  void try_to_migrate(struct folio *folio, enum ttu_flags flags);
>  void try_to_unmap(struct folio *, enum ttu_flags flags);
>  
> -int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
> -				unsigned long end, struct page **pages,
> -				void *arg);
> +struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
> +		void *owner, struct folio **foliop);
>  
>  /* Avoid racy checks */
>  #define PVMW_SYNC		(1 << 0)
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 056f2e411d7b..9e1b07a227a3 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -780,10 +780,8 @@ static int dmirror_exclusive(struct dmirror *dmirror,
>  	unsigned long start, end, addr;
>  	unsigned long size = cmd->npages << PAGE_SHIFT;
>  	struct mm_struct *mm = dmirror->notifier.mm;
> -	struct page *pages[64];
>  	struct dmirror_bounce bounce;
> -	unsigned long next;
> -	int ret;
> +	int ret = 0;
>  
>  	start = cmd->addr;
>  	end = start + size;
> @@ -795,36 +793,31 @@ static int dmirror_exclusive(struct dmirror *dmirror,
>  		return -EINVAL;
>  
>  	mmap_read_lock(mm);
> -	for (addr = start; addr < end; addr = next) {
> -		unsigned long mapped = 0;
> -		int i;
> -
> -		next = min(end, addr + (ARRAY_SIZE(pages) << PAGE_SHIFT));
> +	for (addr = start; addr < end; addr += PAGE_SIZE) {
> +		struct folio *folio;
> +		struct page *page;
>  
> -		ret = make_device_exclusive_range(mm, addr, next, pages, NULL);
> -		/*
> -		 * Do dmirror_atomic_map() iff all pages are marked for
> -		 * exclusive access to avoid accessing uninitialized
> -		 * fields of pages.
> -		 */
> -		if (ret == (next - addr) >> PAGE_SHIFT)
> -			mapped = dmirror_atomic_map(addr, next, pages, dmirror);
> -		for (i = 0; i < ret; i++) {
> -			if (pages[i]) {
> -				unlock_page(pages[i]);
> -				put_page(pages[i]);
> -			}
> +		page = make_device_exclusive(mm, addr, &folio, NULL);
> +		if (IS_ERR(page)) {
> +			ret = PTR_ERR(page);
> +			break;
>  		}
>  
> -		if (addr + (mapped << PAGE_SHIFT) < next) {
> -			mmap_read_unlock(mm);
> -			mmput(mm);
> -			return -EBUSY;
> -		}
> +		ret = dmirror_atomic_map(addr, addr + PAGE_SIZE, &page, dmirror);
> +		if (!ret)
> +			ret = -EBUSY;
> +		folio_unlock(folio);
> +		folio_put(folio);
> +
> +		if (ret)
> +			break;
>  	}
>  	mmap_read_unlock(mm);
>  	mmput(mm);
>  
> +	if (ret)
> +		return -EBUSY;
> +
>  	/* Return the migrated data for verification. */
>  	ret = dmirror_bounce_init(&bounce, start, size);
>  	if (ret)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 17fbfa61f7ef..676df4fba5b0 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2495,70 +2495,78 @@ static bool folio_make_device_exclusive(struct folio *folio,
>  		.arg = &args,
>  	};
>  
> -	/*
> -	 * Restrict to anonymous folios for now to avoid potential writeback
> -	 * issues.
> -	 */
> -	if (!folio_test_anon(folio) || folio_test_hugetlb(folio))
> -		return false;

A bit of churn with the previous patch but I suppose it makes a stable backport
of the previous patch easier. The rest looks good so:

Reviewed-by: Alistair Popple <apopple@nvidia.com>

>  	rmap_walk(folio, &rwc);
>  
>  	return args.valid && !folio_mapcount(folio);
>  }
>  
>  /**
> - * make_device_exclusive_range() - Mark a range for exclusive use by a device
> + * make_device_exclusive() - Mark an address for exclusive use by a device
>   * @mm: mm_struct of associated target process
> - * @start: start of the region to mark for exclusive device access
> - * @end: end address of region
> - * @pages: returns the pages which were successfully marked for exclusive access
> + * @addr: the virtual address to mark for exclusive device access
>   * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier to allow filtering
> + * @foliop: folio pointer will be stored here on success.
> + *
> + * This function looks up the page mapped at the given address, grabs a
> + * folio reference, locks the folio and replaces the PTE with special
> + * device-exclusive non-swap entry, preventing userspace CPU access. The
> + * function will return with the folio locked and referenced.
>   *
> - * Returns: number of pages found in the range by GUP. A page is marked for
> - * exclusive access only if the page pointer is non-NULL.
> + * On fault these special device-exclusive entries are replaced with the
> + * original PTE under folio lock, after calling MMU notifiers.
>   *
> - * This function finds ptes mapping page(s) to the given address range, locks
> - * them and replaces mappings with special swap entries preventing userspace CPU
> - * access. On fault these entries are replaced with the original mapping after
> - * calling MMU notifiers.
> + * Only anonymous non-hugetlb folios are supported and the VMA must have
> + * write permissions such that we can fault in the anonymous page writable
> + * in order to mark it exclusive. The caller must hold the mmap_lock in read
> + * mode.
>   *
>   * A driver using this to program access from a device must use a mmu notifier
>   * critical section to hold a device specific lock during programming. Once
>   * programming is complete it should drop the page lock and reference after
>   * which point CPU access to the page will revoke the exclusive access.
> + *
> + * Returns: pointer to mapped page on success, otherwise a negative error.
>   */
> -int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
> -				unsigned long end, struct page **pages,
> -				void *owner)
> +struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
> +		void *owner, struct folio **foliop)
>  {
> -	long npages = (end - start) >> PAGE_SHIFT;
> -	long i;
> +	struct folio *folio;
> +	struct page *page;
> +	long npages;
> +
> +	mmap_assert_locked(mm);
>  
> -	npages = get_user_pages_remote(mm, start, npages,
> +	/*
> +	 * Fault in the page writable and try to lock it; note that if the
> +	 * address would already be marked for exclusive use by the device,
> +	 * the GUP call would undo that first by triggering a fault.
> +	 */
> +	npages = get_user_pages_remote(mm, addr, 1,
>  				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> -				       pages, NULL);
> -	if (npages < 0)
> -		return npages;
> -
> -	for (i = 0; i < npages; i++, start += PAGE_SIZE) {
> -		struct folio *folio = page_folio(pages[i]);
> -		if (PageTail(pages[i]) || !folio_trylock(folio)) {
> -			folio_put(folio);
> -			pages[i] = NULL;
> -			continue;
> -		}
> +				       &page, NULL);
> +	if (npages != 1)
> +		return ERR_PTR(npages);
> +	folio = page_folio(page);
>  
> -		if (!folio_make_device_exclusive(folio, mm, start, owner)) {
> -			folio_unlock(folio);
> -			folio_put(folio);
> -			pages[i] = NULL;
> -		}
> +	if (!folio_test_anon(folio) || folio_test_hugetlb(folio)) {
> +		folio_put(folio);
> +		return ERR_PTR(-EOPNOTSUPP);
> +	}
> +
> +	if (!folio_trylock(folio)) {
> +		folio_put(folio);
> +		return ERR_PTR(-EBUSY);
>  	}
>  
> -	return npages;
> +	if (!folio_make_device_exclusive(folio, mm, addr, owner)) {
> +		folio_unlock(folio);
> +		folio_put(folio);
> +		return ERR_PTR(-EBUSY);
> +	}
> +	*foliop = folio;
> +	return page;
>  }
> -EXPORT_SYMBOL_GPL(make_device_exclusive_range);
> +EXPORT_SYMBOL_GPL(make_device_exclusive);
>  #endif
>  
>  void __put_anon_vma(struct anon_vma *anon_vma)
> -- 
> 2.48.1
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 03/12] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()
  2025-01-30  5:57   ` Alistair Popple
@ 2025-01-30  9:04     ` David Hildenbrand
  2025-01-31  0:28     ` Alistair Popple
  1 sibling, 0 replies; 56+ messages in thread
From: David Hildenbrand @ 2025-01-30  9:04 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 17fbfa61f7ef..676df4fba5b0 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2495,70 +2495,78 @@ static bool folio_make_device_exclusive(struct folio *folio,
>>   		.arg = &args,
>>   	};
>>   
>> -	/*
>> -	 * Restrict to anonymous folios for now to avoid potential writeback
>> -	 * issues.
>> -	 */
>> -	if (!folio_test_anon(folio) || folio_test_hugetlb(folio))
>> -		return false;
> 
> A bit of churn with the previous patch but I suppose it makes a stable backport
> of the previous patch easier.

Yes, that's the idea.

Thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 03/12] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()
  2025-01-30  5:57   ` Alistair Popple
  2025-01-30  9:04     ` David Hildenbrand
@ 2025-01-31  0:28     ` Alistair Popple
  2025-01-31  9:29       ` David Hildenbrand
  1 sibling, 1 reply; 56+ messages in thread
From: Alistair Popple @ 2025-01-31  0:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Thu, Jan 30, 2025 at 04:57:39PM +1100, Alistair Popple wrote:
> On Wed, Jan 29, 2025 at 12:54:01PM +0100, David Hildenbrand wrote:

[...]

> > -int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
> > -				unsigned long end, struct page **pages,
> > -				void *owner)
> > +struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
> > +		void *owner, struct folio **foliop)
> >  {
> > -	long npages = (end - start) >> PAGE_SHIFT;
> > -	long i;
> > +	struct folio *folio;
> > +	struct page *page;
> > +	long npages;
> > +
> > +	mmap_assert_locked(mm);
> >  
> > -	npages = get_user_pages_remote(mm, start, npages,
> > +	/*
> > +	 * Fault in the page writable and try to lock it; note that if the
> > +	 * address would already be marked for exclusive use by the device,
> > +	 * the GUP call would undo that first by triggering a fault.
> > +	 */
> > +	npages = get_user_pages_remote(mm, addr, 1,
> >  				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> > -				       pages, NULL);
> > -	if (npages < 0)
> > -		return npages;
> > -
> > -	for (i = 0; i < npages; i++, start += PAGE_SIZE) {
> > -		struct folio *folio = page_folio(pages[i]);
> > -		if (PageTail(pages[i]) || !folio_trylock(folio)) {
> > -			folio_put(folio);
> > -			pages[i] = NULL;
> > -			continue;
> > -		}
> > +				       &page, NULL);
> > +	if (npages != 1)
> > +		return ERR_PTR(npages);
> > +	folio = page_folio(page);
> >  
> > -		if (!folio_make_device_exclusive(folio, mm, start, owner)) {
> > -			folio_unlock(folio);
> > -			folio_put(folio);
> > -			pages[i] = NULL;
> > -		}
> > +	if (!folio_test_anon(folio) || folio_test_hugetlb(folio)) {
> > +		folio_put(folio);
> > +		return ERR_PTR(-EOPNOTSUPP);
> > +	}
> > +
> > +	if (!folio_trylock(folio)) {

Actually I think we can make this folio_lock(folio) now. The only reason for
the trylock was to avoid deadlock between other threads looping over a range
of folios while holding folio locks which is something the migration code also
does.

 - Alistair

> > +		folio_put(folio);
> > +		return ERR_PTR(-EBUSY);
> >  	}
> >  
> > -	return npages;
> > +	if (!folio_make_device_exclusive(folio, mm, addr, owner)) {
> > +		folio_unlock(folio);
> > +		folio_put(folio);
> > +		return ERR_PTR(-EBUSY);
> > +	}
> > +	*foliop = folio;
> > +	return page;
> >  }
> > -EXPORT_SYMBOL_GPL(make_device_exclusive_range);
> > +EXPORT_SYMBOL_GPL(make_device_exclusive);
> >  #endif
> >  
> >  void __put_anon_vma(struct anon_vma *anon_vma)
> > -- 
> > 2.48.1
> > 
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 03/12] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()
  2025-01-31  0:28     ` Alistair Popple
@ 2025-01-31  9:29       ` David Hildenbrand
  0 siblings, 0 replies; 56+ messages in thread
From: David Hildenbrand @ 2025-01-31  9:29 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On 31.01.25 01:28, Alistair Popple wrote:
> On Thu, Jan 30, 2025 at 04:57:39PM +1100, Alistair Popple wrote:
>> On Wed, Jan 29, 2025 at 12:54:01PM +0100, David Hildenbrand wrote:
> 
> [...]
> 
>>> -int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
>>> -				unsigned long end, struct page **pages,
>>> -				void *owner)
>>> +struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
>>> +		void *owner, struct folio **foliop)
>>>   {
>>> -	long npages = (end - start) >> PAGE_SHIFT;
>>> -	long i;
>>> +	struct folio *folio;
>>> +	struct page *page;
>>> +	long npages;
>>> +
>>> +	mmap_assert_locked(mm);
>>>   
>>> -	npages = get_user_pages_remote(mm, start, npages,
>>> +	/*
>>> +	 * Fault in the page writable and try to lock it; note that if the
>>> +	 * address would already be marked for exclusive use by the device,
>>> +	 * the GUP call would undo that first by triggering a fault.
>>> +	 */
>>> +	npages = get_user_pages_remote(mm, addr, 1,
>>>   				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
>>> -				       pages, NULL);
>>> -	if (npages < 0)
>>> -		return npages;
>>> -
>>> -	for (i = 0; i < npages; i++, start += PAGE_SIZE) {
>>> -		struct folio *folio = page_folio(pages[i]);
>>> -		if (PageTail(pages[i]) || !folio_trylock(folio)) {
>>> -			folio_put(folio);
>>> -			pages[i] = NULL;
>>> -			continue;
>>> -		}
>>> +				       &page, NULL);
>>> +	if (npages != 1)
>>> +		return ERR_PTR(npages);
>>> +	folio = page_folio(page);
>>>   
>>> -		if (!folio_make_device_exclusive(folio, mm, start, owner)) {
>>> -			folio_unlock(folio);
>>> -			folio_put(folio);
>>> -			pages[i] = NULL;
>>> -		}
>>> +	if (!folio_test_anon(folio) || folio_test_hugetlb(folio)) {
>>> +		folio_put(folio);
>>> +		return ERR_PTR(-EOPNOTSUPP);
>>> +	}
>>> +
>>> +	if (!folio_trylock(folio)) {
> 
> Actually I think we can make this folio_lock(folio) now. The only reason for
> the trylock was to avoid deadlock between other threads looping over a range
> of folios while holding folio locks which is something the migration code also
> does.

Okay, let me do that in a separate patch. Thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 03/12] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()
  2025-01-29 11:54 ` [PATCH v1 03/12] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive() David Hildenbrand
  2025-01-30  5:57   ` Alistair Popple
@ 2025-01-30 13:46   ` Simona Vetter
  2025-01-30 15:56     ` David Hildenbrand
  1 sibling, 1 reply; 56+ messages in thread
From: Simona Vetter @ 2025-01-30 13:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

On Wed, Jan 29, 2025 at 12:54:01PM +0100, David Hildenbrand wrote:
> The single "real" user in the tree of make_device_exclusive_range() always
> requests making only a single address exclusive. The current implementation
> is hard to fix for properly supporting anonymous THP / large folios and
> for avoiding messing with rmap walks in weird ways.
> 
> So let's always process a single address/page and return folio + page to
> minimize page -> folio lookups. This is a preparation for further
> changes.
> 
> Reject any non-anonymous or hugetlb folios early, directly after GUP.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Yeah this makes sense. Even for pmd entries I think we want to make this
very explicit with an explicit hugetlb opt-in I think.

Acked-by: Simona Vetter <simona.vetter@ffwll.ch>


> ---
>  Documentation/mm/hmm.rst                    |  2 +-
>  Documentation/translations/zh_CN/mm/hmm.rst |  2 +-
>  drivers/gpu/drm/nouveau/nouveau_svm.c       |  5 +-
>  include/linux/mmu_notifier.h                |  2 +-
>  include/linux/rmap.h                        |  5 +-
>  lib/test_hmm.c                              | 45 +++++------
>  mm/rmap.c                                   | 90 +++++++++++----------
>  7 files changed, 75 insertions(+), 76 deletions(-)
> 
> diff --git a/Documentation/mm/hmm.rst b/Documentation/mm/hmm.rst
> index f6d53c37a2ca..7d61b7a8b65b 100644
> --- a/Documentation/mm/hmm.rst
> +++ b/Documentation/mm/hmm.rst
> @@ -400,7 +400,7 @@ Exclusive access memory
>  Some devices have features such as atomic PTE bits that can be used to implement
>  atomic access to system memory. To support atomic operations to a shared virtual
>  memory page such a device needs access to that page which is exclusive of any
> -userspace access from the CPU. The ``make_device_exclusive_range()`` function
> +userspace access from the CPU. The ``make_device_exclusive()`` function
>  can be used to make a memory range inaccessible from userspace.
>  
>  This replaces all mappings for pages in the given range with special swap
> diff --git a/Documentation/translations/zh_CN/mm/hmm.rst b/Documentation/translations/zh_CN/mm/hmm.rst
> index 0669f947d0bc..22c210f4e94f 100644
> --- a/Documentation/translations/zh_CN/mm/hmm.rst
> +++ b/Documentation/translations/zh_CN/mm/hmm.rst
> @@ -326,7 +326,7 @@ devm_memunmap_pages() 和 devm_release_mem_region() 当资源可以绑定到 ``s
>  
>  一些设备具有诸如原子PTE位的功能，可以用来实现对系统内存的原子访问。为了支持对一
>  个共享的虚拟内存页的原子操作，这样的设备需要对该页的访问是排他的，而不是来自CPU
> -的任何用户空间访问。  ``make_device_exclusive_range()`` 函数可以用来使一
> +的任何用户空间访问。  ``make_device_exclusive()`` 函数可以用来使一
>  个内存范围不能从用户空间访问。
>  
>  这将用特殊的交换条目替换给定范围内的所有页的映射。任何试图访问交换条目的行为都会
> diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
> index b4da82ddbb6b..39e3740980bb 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> @@ -609,10 +609,9 @@ static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm,
>  
>  		notifier_seq = mmu_interval_read_begin(&notifier->notifier);
>  		mmap_read_lock(mm);
> -		ret = make_device_exclusive_range(mm, start, start + PAGE_SIZE,
> -					    &page, drm->dev);
> +		page = make_device_exclusive(mm, start, drm->dev, &folio);
>  		mmap_read_unlock(mm);
> -		if (ret <= 0 || !page) {
> +		if (IS_ERR(page)) {
>  			ret = -EINVAL;
>  			goto out;
>  		}
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index e2dd57ca368b..d4e714661826 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -46,7 +46,7 @@ struct mmu_interval_notifier;
>   * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
>   * longer have exclusive access to the page. When sent during creation of an
>   * exclusive range the owner will be initialised to the value provided by the
> - * caller of make_device_exclusive_range(), otherwise the owner will be NULL.
> + * caller of make_device_exclusive(), otherwise the owner will be NULL.
>   */
>  enum mmu_notifier_event {
>  	MMU_NOTIFY_UNMAP = 0,
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 683a04088f3f..86425d42c1a9 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -663,9 +663,8 @@ int folio_referenced(struct folio *, int is_locked,
>  void try_to_migrate(struct folio *folio, enum ttu_flags flags);
>  void try_to_unmap(struct folio *, enum ttu_flags flags);
>  
> -int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
> -				unsigned long end, struct page **pages,
> -				void *arg);
> +struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
> +		void *owner, struct folio **foliop);
>  
>  /* Avoid racy checks */
>  #define PVMW_SYNC		(1 << 0)
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 056f2e411d7b..9e1b07a227a3 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -780,10 +780,8 @@ static int dmirror_exclusive(struct dmirror *dmirror,
>  	unsigned long start, end, addr;
>  	unsigned long size = cmd->npages << PAGE_SHIFT;
>  	struct mm_struct *mm = dmirror->notifier.mm;
> -	struct page *pages[64];
>  	struct dmirror_bounce bounce;
> -	unsigned long next;
> -	int ret;
> +	int ret = 0;
>  
>  	start = cmd->addr;
>  	end = start + size;
> @@ -795,36 +793,31 @@ static int dmirror_exclusive(struct dmirror *dmirror,
>  		return -EINVAL;
>  
>  	mmap_read_lock(mm);
> -	for (addr = start; addr < end; addr = next) {
> -		unsigned long mapped = 0;
> -		int i;
> -
> -		next = min(end, addr + (ARRAY_SIZE(pages) << PAGE_SHIFT));
> +	for (addr = start; addr < end; addr += PAGE_SIZE) {
> +		struct folio *folio;
> +		struct page *page;
>  
> -		ret = make_device_exclusive_range(mm, addr, next, pages, NULL);
> -		/*
> -		 * Do dmirror_atomic_map() iff all pages are marked for
> -		 * exclusive access to avoid accessing uninitialized
> -		 * fields of pages.
> -		 */
> -		if (ret == (next - addr) >> PAGE_SHIFT)
> -			mapped = dmirror_atomic_map(addr, next, pages, dmirror);
> -		for (i = 0; i < ret; i++) {
> -			if (pages[i]) {
> -				unlock_page(pages[i]);
> -				put_page(pages[i]);
> -			}
> +		page = make_device_exclusive(mm, addr, &folio, NULL);
> +		if (IS_ERR(page)) {
> +			ret = PTR_ERR(page);
> +			break;
>  		}
>  
> -		if (addr + (mapped << PAGE_SHIFT) < next) {
> -			mmap_read_unlock(mm);
> -			mmput(mm);
> -			return -EBUSY;
> -		}
> +		ret = dmirror_atomic_map(addr, addr + PAGE_SIZE, &page, dmirror);
> +		if (!ret)
> +			ret = -EBUSY;
> +		folio_unlock(folio);
> +		folio_put(folio);
> +
> +		if (ret)
> +			break;
>  	}
>  	mmap_read_unlock(mm);
>  	mmput(mm);
>  
> +	if (ret)
> +		return -EBUSY;
> +
>  	/* Return the migrated data for verification. */
>  	ret = dmirror_bounce_init(&bounce, start, size);
>  	if (ret)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 17fbfa61f7ef..676df4fba5b0 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2495,70 +2495,78 @@ static bool folio_make_device_exclusive(struct folio *folio,
>  		.arg = &args,
>  	};
>  
> -	/*
> -	 * Restrict to anonymous folios for now to avoid potential writeback
> -	 * issues.
> -	 */
> -	if (!folio_test_anon(folio) || folio_test_hugetlb(folio))
> -		return false;
> -
>  	rmap_walk(folio, &rwc);
>  
>  	return args.valid && !folio_mapcount(folio);
>  }
>  
>  /**
> - * make_device_exclusive_range() - Mark a range for exclusive use by a device
> + * make_device_exclusive() - Mark an address for exclusive use by a device
>   * @mm: mm_struct of associated target process
> - * @start: start of the region to mark for exclusive device access
> - * @end: end address of region
> - * @pages: returns the pages which were successfully marked for exclusive access
> + * @addr: the virtual address to mark for exclusive device access
>   * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier to allow filtering
> + * @foliop: folio pointer will be stored here on success.
> + *
> + * This function looks up the page mapped at the given address, grabs a
> + * folio reference, locks the folio and replaces the PTE with special
> + * device-exclusive non-swap entry, preventing userspace CPU access. The
> + * function will return with the folio locked and referenced.
>   *
> - * Returns: number of pages found in the range by GUP. A page is marked for
> - * exclusive access only if the page pointer is non-NULL.
> + * On fault these special device-exclusive entries are replaced with the
> + * original PTE under folio lock, after calling MMU notifiers.
>   *
> - * This function finds ptes mapping page(s) to the given address range, locks
> - * them and replaces mappings with special swap entries preventing userspace CPU
> - * access. On fault these entries are replaced with the original mapping after
> - * calling MMU notifiers.
> + * Only anonymous non-hugetlb folios are supported and the VMA must have
> + * write permissions such that we can fault in the anonymous page writable
> + * in order to mark it exclusive. The caller must hold the mmap_lock in read
> + * mode.
>   *
>   * A driver using this to program access from a device must use a mmu notifier
>   * critical section to hold a device specific lock during programming. Once
>   * programming is complete it should drop the page lock and reference after
>   * which point CPU access to the page will revoke the exclusive access.
> + *
> + * Returns: pointer to mapped page on success, otherwise a negative error.
>   */
> -int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
> -				unsigned long end, struct page **pages,
> -				void *owner)
> +struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
> +		void *owner, struct folio **foliop)
>  {
> -	long npages = (end - start) >> PAGE_SHIFT;
> -	long i;
> +	struct folio *folio;
> +	struct page *page;
> +	long npages;
> +
> +	mmap_assert_locked(mm);
>  
> -	npages = get_user_pages_remote(mm, start, npages,
> +	/*
> +	 * Fault in the page writable and try to lock it; note that if the
> +	 * address would already be marked for exclusive use by the device,
> +	 * the GUP call would undo that first by triggering a fault.
> +	 */
> +	npages = get_user_pages_remote(mm, addr, 1,
>  				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> -				       pages, NULL);
> -	if (npages < 0)
> -		return npages;
> -
> -	for (i = 0; i < npages; i++, start += PAGE_SIZE) {
> -		struct folio *folio = page_folio(pages[i]);
> -		if (PageTail(pages[i]) || !folio_trylock(folio)) {
> -			folio_put(folio);
> -			pages[i] = NULL;
> -			continue;
> -		}
> +				       &page, NULL);
> +	if (npages != 1)
> +		return ERR_PTR(npages);
> +	folio = page_folio(page);
>  
> -		if (!folio_make_device_exclusive(folio, mm, start, owner)) {
> -			folio_unlock(folio);
> -			folio_put(folio);
> -			pages[i] = NULL;
> -		}
> +	if (!folio_test_anon(folio) || folio_test_hugetlb(folio)) {
> +		folio_put(folio);
> +		return ERR_PTR(-EOPNOTSUPP);
> +	}
> +
> +	if (!folio_trylock(folio)) {
> +		folio_put(folio);
> +		return ERR_PTR(-EBUSY);
>  	}
>  
> -	return npages;
> +	if (!folio_make_device_exclusive(folio, mm, addr, owner)) {
> +		folio_unlock(folio);
> +		folio_put(folio);
> +		return ERR_PTR(-EBUSY);
> +	}
> +	*foliop = folio;
> +	return page;
>  }
> -EXPORT_SYMBOL_GPL(make_device_exclusive_range);
> +EXPORT_SYMBOL_GPL(make_device_exclusive);
>  #endif
>  
>  void __put_anon_vma(struct anon_vma *anon_vma)
> -- 
> 2.48.1
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 03/12] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()
  2025-01-30 13:46   ` Simona Vetter
@ 2025-01-30 15:56     ` David Hildenbrand
  0 siblings, 0 replies; 56+ messages in thread
From: David Hildenbrand @ 2025-01-30 15:56 UTC (permalink / raw)
  To: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

On 30.01.25 14:46, Simona Vetter wrote:
> On Wed, Jan 29, 2025 at 12:54:01PM +0100, David Hildenbrand wrote:
>> The single "real" user in the tree of make_device_exclusive_range() always
>> requests making only a single address exclusive. The current implementation
>> is hard to fix for properly supporting anonymous THP / large folios and
>> for avoiding messing with rmap walks in weird ways.
>>
>> So let's always process a single address/page and return folio + page to
>> minimize page -> folio lookups. This is a preparation for further
>> changes.
>>
>> Reject any non-anonymous or hugetlb folios early, directly after GUP.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Yeah this makes sense. Even for pmd entries I think we want to make this
> very explicit with an explicit hugetlb opt-in I think.
> 
> Acked-by: Simona Vetter <simona.vetter@ffwll.ch>

Thanks, I'll fold in the following:

diff --git a/mm/rmap.c b/mm/rmap.c
index 676df4fba5b0..94256925682d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2525,6 +2525,10 @@ static bool folio_make_device_exclusive(struct folio *folio,
   * programming is complete it should drop the page lock and reference after
   * which point CPU access to the page will revoke the exclusive access.
   *
+ * Note: This function always operates on individual PTEs mapping individual
+ * pages. PMD-sized THPs are first remapped to be mapped by PTEs before the
+ * conversion happens on a single PTE corresponding to @addr.
+ *
   * Returns: pointer to mapped page on success, otherwise a negative error.
   */
  struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk
  2025-01-29 11:53 [PATCH v1 00/12] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (2 preceding siblings ...)
  2025-01-29 11:54 ` [PATCH v1 03/12] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive() David Hildenbrand
@ 2025-01-29 11:54 ` David Hildenbrand
  2025-01-30  6:11   ` Alistair Popple
  2025-01-29 11:54 ` [PATCH v1 05/12] mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable() David Hildenbrand
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, David Hildenbrand,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

We require a writable PTE and only support anonymous folio: we can only
have exactly one PTE pointing at that page, which we can just lookup
using a folio walk, avoiding the rmap walk and the anon VMA lock.

So let's stop doing an rmap walk and perform a folio walk instead, so we
can easily just modify a single PTE and avoid relying on rmap/mapcounts.

We now effectively work on a single PTE instead of multiple PTEs of
a large folio, allowing for conversion of individual PTEs from
non-exclusive to device-exclusive -- note that the other way always
worked on single PTEs.

We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
that is not required: GUP will already take care of the
MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
entry) when not finding a present PTE and having to trigger a fault and
ending up in remove_device_exclusive_entry(). Note that the PTE is
always writable, and we can always create a writable-device-exclusive
entry.

With this change, device-exclusive is fully compatible with THPs /
large folios. We still require PMD-sized THPs to get PTE-mapped, and
supporting PMD-mapped THP (without the PTE-remapping) is a different
endeavour that might not be worth it at this point.

This gets rid of the "folio_mapcount()" usage and let's us fix ordinary
rmap walks (migration/swapout) next. Spell out that messing with the
mapcount is wrong and must be fixed.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap.c | 188 ++++++++++++++++--------------------------------------
 1 file changed, 55 insertions(+), 133 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 676df4fba5b0..49ffac6d27f8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2375,131 +2375,6 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 }
 
 #ifdef CONFIG_DEVICE_PRIVATE
-struct make_exclusive_args {
-	struct mm_struct *mm;
-	unsigned long address;
-	void *owner;
-	bool valid;
-};
-
-static bool page_make_device_exclusive_one(struct folio *folio,
-		struct vm_area_struct *vma, unsigned long address, void *priv)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
-	struct make_exclusive_args *args = priv;
-	pte_t pteval;
-	struct page *subpage;
-	bool ret = true;
-	struct mmu_notifier_range range;
-	swp_entry_t entry;
-	pte_t swp_pte;
-	pte_t ptent;
-
-	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
-				      vma->vm_mm, address, min(vma->vm_end,
-				      address + folio_size(folio)),
-				      args->owner);
-	mmu_notifier_invalidate_range_start(&range);
-
-	while (page_vma_mapped_walk(&pvmw)) {
-		/* Unexpected PMD-mapped THP? */
-		VM_BUG_ON_FOLIO(!pvmw.pte, folio);
-
-		ptent = ptep_get(pvmw.pte);
-		if (!pte_present(ptent)) {
-			ret = false;
-			page_vma_mapped_walk_done(&pvmw);
-			break;
-		}
-
-		subpage = folio_page(folio,
-				pte_pfn(ptent) - folio_pfn(folio));
-		address = pvmw.address;
-
-		/* Nuke the page table entry. */
-		flush_cache_page(vma, address, pte_pfn(ptent));
-		pteval = ptep_clear_flush(vma, address, pvmw.pte);
-
-		/* Set the dirty flag on the folio now the pte is gone. */
-		if (pte_dirty(pteval))
-			folio_mark_dirty(folio);
-
-		/*
-		 * Check that our target page is still mapped at the expected
-		 * address.
-		 */
-		if (args->mm == mm && args->address == address &&
-		    pte_write(pteval))
-			args->valid = true;
-
-		/*
-		 * Store the pfn of the page in a special migration
-		 * pte. do_swap_page() will wait until the migration
-		 * pte is removed and then restart fault handling.
-		 */
-		if (pte_write(pteval))
-			entry = make_writable_device_exclusive_entry(
-							page_to_pfn(subpage));
-		else
-			entry = make_readable_device_exclusive_entry(
-							page_to_pfn(subpage));
-		swp_pte = swp_entry_to_pte(entry);
-		if (pte_soft_dirty(pteval))
-			swp_pte = pte_swp_mksoft_dirty(swp_pte);
-		if (pte_uffd_wp(pteval))
-			swp_pte = pte_swp_mkuffd_wp(swp_pte);
-
-		set_pte_at(mm, address, pvmw.pte, swp_pte);
-
-		/*
-		 * There is a reference on the page for the swap entry which has
-		 * been removed, so shouldn't take another.
-		 */
-		folio_remove_rmap_pte(folio, subpage, vma);
-	}
-
-	mmu_notifier_invalidate_range_end(&range);
-
-	return ret;
-}
-
-/**
- * folio_make_device_exclusive - Mark the folio exclusively owned by a device.
- * @folio: The folio to replace page table entries for.
- * @mm: The mm_struct where the folio is expected to be mapped.
- * @address: Address where the folio is expected to be mapped.
- * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
- *
- * Tries to remove all the page table entries which are mapping this
- * folio and replace them with special device exclusive swap entries to
- * grant a device exclusive access to the folio.
- *
- * Context: Caller must hold the folio lock.
- * Return: false if the page is still mapped, or if it could not be unmapped
- * from the expected address. Otherwise returns true (success).
- */
-static bool folio_make_device_exclusive(struct folio *folio,
-		struct mm_struct *mm, unsigned long address, void *owner)
-{
-	struct make_exclusive_args args = {
-		.mm = mm,
-		.address = address,
-		.owner = owner,
-		.valid = false,
-	};
-	struct rmap_walk_control rwc = {
-		.rmap_one = page_make_device_exclusive_one,
-		.done = folio_not_mapped,
-		.anon_lock = folio_lock_anon_vma_read,
-		.arg = &args,
-	};
-
-	rmap_walk(folio, &rwc);
-
-	return args.valid && !folio_mapcount(folio);
-}
-
 /**
  * make_device_exclusive() - Mark an address for exclusive use by a device
  * @mm: mm_struct of associated target process
@@ -2530,9 +2405,12 @@ static bool folio_make_device_exclusive(struct folio *folio,
 struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 		void *owner, struct folio **foliop)
 {
-	struct folio *folio;
+	struct folio *folio, *fw_folio;
+	struct vm_area_struct *vma;
+	struct folio_walk fw;
 	struct page *page;
-	long npages;
+	swp_entry_t entry;
+	pte_t swp_pte;
 
 	mmap_assert_locked(mm);
 
@@ -2540,12 +2418,16 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 	 * Fault in the page writable and try to lock it; note that if the
 	 * address would already be marked for exclusive use by the device,
 	 * the GUP call would undo that first by triggering a fault.
+	 *
+	 * If any other device would already map this page exclusively, the
+	 * fault will trigger a conversion to an ordinary
+	 * (non-device-exclusive) PTE and issue a MMU_NOTIFY_EXCLUSIVE.
 	 */
-	npages = get_user_pages_remote(mm, addr, 1,
-				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
-				       &page, NULL);
-	if (npages != 1)
-		return ERR_PTR(npages);
+	page = get_user_page_vma_remote(mm, addr,
+					FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
+					&vma);
+	if (IS_ERR(page))
+		return page;
 	folio = page_folio(page);
 
 	if (!folio_test_anon(folio) || folio_test_hugetlb(folio)) {
@@ -2558,11 +2440,51 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 		return ERR_PTR(-EBUSY);
 	}
 
-	if (!folio_make_device_exclusive(folio, mm, addr, owner)) {
+	/*
+	 * Let's do a second walk and make sure we still find the same page
+	 * mapped writable. If we don't find what we expect, we will trigger
+	 * GUP again to fix it up. Note that a page of an anonymous folio can
+	 * only be mapped writable using exactly one page table mapping
+	 * ("exclusive"), so there cannot be other mappings.
+	 */
+	fw_folio = folio_walk_start(&fw, vma, addr, 0);
+	if (fw_folio != folio || fw.page != page ||
+	    fw.level != FW_LEVEL_PTE || !pte_write(fw.pte)) {
+		if (fw_folio)
+			folio_walk_end(&fw, vma);
 		folio_unlock(folio);
 		folio_put(folio);
 		return ERR_PTR(-EBUSY);
 	}
+
+	/* Nuke the page table entry so we get the uptodate dirty bit. */
+	flush_cache_page(vma, addr, page_to_pfn(page));
+	fw.pte = ptep_clear_flush(vma, addr, fw.ptep);
+
+	/* Set the dirty flag on the folio now the pte is gone. */
+	if (pte_dirty(fw.pte))
+		folio_mark_dirty(folio);
+
+	/*
+	 * Store the pfn of the page in a special device-exclusive non-swap pte.
+	 * do_swap_page() will trigger the conversion back while holding the
+	 * folio lock.
+	 */
+	entry = make_writable_device_exclusive_entry(page_to_pfn(page));
+	swp_pte = swp_entry_to_pte(entry);
+	if (pte_soft_dirty(fw.pte))
+		swp_pte = pte_swp_mksoft_dirty(swp_pte);
+	/* The pte is writable, uffd-wp does not apply. */
+	set_pte_at(mm, addr, fw.ptep, swp_pte);
+
+	/*
+	 * TODO: The device-exclusive non-swap PTE holds a folio reference but
+	 * does not count as a mapping (mapcount), which is wrong and must be
+	 * fixed, otherwise RMAP walks don't behave as expected.
+	 */
+	folio_remove_rmap_pte(folio, page, vma);
+
+	folio_walk_end(&fw, vma);
 	*foliop = folio;
 	return page;
 }
-- 
2.48.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk
  2025-01-29 11:54 ` [PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk David Hildenbrand
@ 2025-01-30  6:11   ` Alistair Popple
  2025-01-30  9:01     ` David Hildenbrand
  2025-01-30  9:40     ` Simona Vetter
  0 siblings, 2 replies; 56+ messages in thread
From: Alistair Popple @ 2025-01-30  6:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
> We require a writable PTE and only support anonymous folio: we can only
> have exactly one PTE pointing at that page, which we can just lookup
> using a folio walk, avoiding the rmap walk and the anon VMA lock.
> 
> So let's stop doing an rmap walk and perform a folio walk instead, so we
> can easily just modify a single PTE and avoid relying on rmap/mapcounts.
> 
> We now effectively work on a single PTE instead of multiple PTEs of
> a large folio, allowing for conversion of individual PTEs from
> non-exclusive to device-exclusive -- note that the other way always
> worked on single PTEs.
> 
> We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
> that is not required: GUP will already take care of the
> MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
> entry) when not finding a present PTE and having to trigger a fault and
> ending up in remove_device_exclusive_entry().

I will have to look at this a bit more closely tomorrow but this doesn't seem
right to me. We may be transitioning from a present PTE (ie. a writable
anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
update their copies of the PTE. So I think the notifier call is needed.

> Note that the PTE is
> always writable, and we can always create a writable-device-exclusive
> entry.
> 
> With this change, device-exclusive is fully compatible with THPs /
> large folios. We still require PMD-sized THPs to get PTE-mapped, and
> supporting PMD-mapped THP (without the PTE-remapping) is a different
> endeavour that might not be worth it at this point.
> 
> This gets rid of the "folio_mapcount()" usage and let's us fix ordinary
> rmap walks (migration/swapout) next. Spell out that messing with the
> mapcount is wrong and must be fixed.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/rmap.c | 188 ++++++++++++++++--------------------------------------
>  1 file changed, 55 insertions(+), 133 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 676df4fba5b0..49ffac6d27f8 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2375,131 +2375,6 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
>  }
>  
>  #ifdef CONFIG_DEVICE_PRIVATE
> -struct make_exclusive_args {
> -	struct mm_struct *mm;
> -	unsigned long address;
> -	void *owner;
> -	bool valid;
> -};
> -
> -static bool page_make_device_exclusive_one(struct folio *folio,
> -		struct vm_area_struct *vma, unsigned long address, void *priv)
> -{
> -	struct mm_struct *mm = vma->vm_mm;
> -	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> -	struct make_exclusive_args *args = priv;
> -	pte_t pteval;
> -	struct page *subpage;
> -	bool ret = true;
> -	struct mmu_notifier_range range;
> -	swp_entry_t entry;
> -	pte_t swp_pte;
> -	pte_t ptent;
> -
> -	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
> -				      vma->vm_mm, address, min(vma->vm_end,
> -				      address + folio_size(folio)),
> -				      args->owner);
> -	mmu_notifier_invalidate_range_start(&range);
> -
> -	while (page_vma_mapped_walk(&pvmw)) {
> -		/* Unexpected PMD-mapped THP? */
> -		VM_BUG_ON_FOLIO(!pvmw.pte, folio);
> -
> -		ptent = ptep_get(pvmw.pte);
> -		if (!pte_present(ptent)) {
> -			ret = false;
> -			page_vma_mapped_walk_done(&pvmw);
> -			break;
> -		}
> -
> -		subpage = folio_page(folio,
> -				pte_pfn(ptent) - folio_pfn(folio));
> -		address = pvmw.address;
> -
> -		/* Nuke the page table entry. */
> -		flush_cache_page(vma, address, pte_pfn(ptent));
> -		pteval = ptep_clear_flush(vma, address, pvmw.pte);
> -
> -		/* Set the dirty flag on the folio now the pte is gone. */
> -		if (pte_dirty(pteval))
> -			folio_mark_dirty(folio);
> -
> -		/*
> -		 * Check that our target page is still mapped at the expected
> -		 * address.
> -		 */
> -		if (args->mm == mm && args->address == address &&
> -		    pte_write(pteval))
> -			args->valid = true;
> -
> -		/*
> -		 * Store the pfn of the page in a special migration
> -		 * pte. do_swap_page() will wait until the migration
> -		 * pte is removed and then restart fault handling.
> -		 */
> -		if (pte_write(pteval))
> -			entry = make_writable_device_exclusive_entry(
> -							page_to_pfn(subpage));
> -		else
> -			entry = make_readable_device_exclusive_entry(
> -							page_to_pfn(subpage));
> -		swp_pte = swp_entry_to_pte(entry);
> -		if (pte_soft_dirty(pteval))
> -			swp_pte = pte_swp_mksoft_dirty(swp_pte);
> -		if (pte_uffd_wp(pteval))
> -			swp_pte = pte_swp_mkuffd_wp(swp_pte);
> -
> -		set_pte_at(mm, address, pvmw.pte, swp_pte);
> -
> -		/*
> -		 * There is a reference on the page for the swap entry which has
> -		 * been removed, so shouldn't take another.
> -		 */
> -		folio_remove_rmap_pte(folio, subpage, vma);
> -	}
> -
> -	mmu_notifier_invalidate_range_end(&range);
> -
> -	return ret;
> -}
> -
> -/**
> - * folio_make_device_exclusive - Mark the folio exclusively owned by a device.
> - * @folio: The folio to replace page table entries for.
> - * @mm: The mm_struct where the folio is expected to be mapped.
> - * @address: Address where the folio is expected to be mapped.
> - * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
> - *
> - * Tries to remove all the page table entries which are mapping this
> - * folio and replace them with special device exclusive swap entries to
> - * grant a device exclusive access to the folio.
> - *
> - * Context: Caller must hold the folio lock.
> - * Return: false if the page is still mapped, or if it could not be unmapped
> - * from the expected address. Otherwise returns true (success).
> - */
> -static bool folio_make_device_exclusive(struct folio *folio,
> -		struct mm_struct *mm, unsigned long address, void *owner)
> -{
> -	struct make_exclusive_args args = {
> -		.mm = mm,
> -		.address = address,
> -		.owner = owner,
> -		.valid = false,
> -	};
> -	struct rmap_walk_control rwc = {
> -		.rmap_one = page_make_device_exclusive_one,
> -		.done = folio_not_mapped,
> -		.anon_lock = folio_lock_anon_vma_read,
> -		.arg = &args,
> -	};
> -
> -	rmap_walk(folio, &rwc);
> -
> -	return args.valid && !folio_mapcount(folio);
> -}
> -
>  /**
>   * make_device_exclusive() - Mark an address for exclusive use by a device
>   * @mm: mm_struct of associated target process
> @@ -2530,9 +2405,12 @@ static bool folio_make_device_exclusive(struct folio *folio,
>  struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
>  		void *owner, struct folio **foliop)
>  {
> -	struct folio *folio;
> +	struct folio *folio, *fw_folio;
> +	struct vm_area_struct *vma;
> +	struct folio_walk fw;
>  	struct page *page;
> -	long npages;
> +	swp_entry_t entry;
> +	pte_t swp_pte;
>  
>  	mmap_assert_locked(mm);
>  
> @@ -2540,12 +2418,16 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
>  	 * Fault in the page writable and try to lock it; note that if the
>  	 * address would already be marked for exclusive use by the device,
>  	 * the GUP call would undo that first by triggering a fault.
> +	 *
> +	 * If any other device would already map this page exclusively, the
> +	 * fault will trigger a conversion to an ordinary
> +	 * (non-device-exclusive) PTE and issue a MMU_NOTIFY_EXCLUSIVE.
>  	 */
> -	npages = get_user_pages_remote(mm, addr, 1,
> -				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> -				       &page, NULL);
> -	if (npages != 1)
> -		return ERR_PTR(npages);
> +	page = get_user_page_vma_remote(mm, addr,
> +					FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> +					&vma);
> +	if (IS_ERR(page))
> +		return page;
>  	folio = page_folio(page);
>  
>  	if (!folio_test_anon(folio) || folio_test_hugetlb(folio)) {
> @@ -2558,11 +2440,51 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
>  		return ERR_PTR(-EBUSY);
>  	}
>  
> -	if (!folio_make_device_exclusive(folio, mm, addr, owner)) {
> +	/*
> +	 * Let's do a second walk and make sure we still find the same page
> +	 * mapped writable. If we don't find what we expect, we will trigger
> +	 * GUP again to fix it up. Note that a page of an anonymous folio can
> +	 * only be mapped writable using exactly one page table mapping
> +	 * ("exclusive"), so there cannot be other mappings.
> +	 */
> +	fw_folio = folio_walk_start(&fw, vma, addr, 0);
> +	if (fw_folio != folio || fw.page != page ||
> +	    fw.level != FW_LEVEL_PTE || !pte_write(fw.pte)) {
> +		if (fw_folio)
> +			folio_walk_end(&fw, vma);
>  		folio_unlock(folio);
>  		folio_put(folio);
>  		return ERR_PTR(-EBUSY);
>  	}
> +
> +	/* Nuke the page table entry so we get the uptodate dirty bit. */
> +	flush_cache_page(vma, addr, page_to_pfn(page));
> +	fw.pte = ptep_clear_flush(vma, addr, fw.ptep);
> +
> +	/* Set the dirty flag on the folio now the pte is gone. */
> +	if (pte_dirty(fw.pte))
> +		folio_mark_dirty(folio);
> +
> +	/*
> +	 * Store the pfn of the page in a special device-exclusive non-swap pte.
> +	 * do_swap_page() will trigger the conversion back while holding the
> +	 * folio lock.
> +	 */
> +	entry = make_writable_device_exclusive_entry(page_to_pfn(page));
> +	swp_pte = swp_entry_to_pte(entry);
> +	if (pte_soft_dirty(fw.pte))
> +		swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +	/* The pte is writable, uffd-wp does not apply. */
> +	set_pte_at(mm, addr, fw.ptep, swp_pte);
> +
> +	/*
> +	 * TODO: The device-exclusive non-swap PTE holds a folio reference but
> +	 * does not count as a mapping (mapcount), which is wrong and must be
> +	 * fixed, otherwise RMAP walks don't behave as expected.
> +	 */
> +	folio_remove_rmap_pte(folio, page, vma);
> +
> +	folio_walk_end(&fw, vma);
>  	*foliop = folio;
>  	return page;
>  }
> -- 
> 2.48.1
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk
  2025-01-30  6:11   ` Alistair Popple
@ 2025-01-30  9:01     ` David Hildenbrand
  2025-01-30  9:12       ` David Hildenbrand
  2025-01-30  9:24       ` David Hildenbrand
  2025-01-30  9:40     ` Simona Vetter
  1 sibling, 2 replies; 56+ messages in thread
From: David Hildenbrand @ 2025-01-30  9:01 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On 30.01.25 07:11, Alistair Popple wrote:
> On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
>> We require a writable PTE and only support anonymous folio: we can only
>> have exactly one PTE pointing at that page, which we can just lookup
>> using a folio walk, avoiding the rmap walk and the anon VMA lock.
>>
>> So let's stop doing an rmap walk and perform a folio walk instead, so we
>> can easily just modify a single PTE and avoid relying on rmap/mapcounts.
>>
>> We now effectively work on a single PTE instead of multiple PTEs of
>> a large folio, allowing for conversion of individual PTEs from
>> non-exclusive to device-exclusive -- note that the other way always
>> worked on single PTEs.
>>
>> We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
>> that is not required: GUP will already take care of the
>> MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
>> entry) when not finding a present PTE and having to trigger a fault and
>> ending up in remove_device_exclusive_entry().
> 
> I will have to look at this a bit more closely tomorrow but this doesn't seem
> right to me. We may be transitioning from a present PTE (ie. a writable
> anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
> therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
> update their copies of the PTE. So I think the notifier call is needed.

Then it is all very confusing:

"MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no 
longer have exclusive access to the page."

That's simply not true in the scenario you describe, because nobody had 
exclusive access.

But what you are saying is, that we need to inform others (e.g., KVM) 
that we are converting it to a device-exclusive entry, such that they 
stop accessing it.

That makes sense to me (and the cleanup patch in the cleanup series 
would have to go as well to prevent the livelock).

So we would have to update the documentation of MMU_NOTIFY_EXCLUSIVE 
that it is also trigger on conversion from non-exclusive to exclusive.

Does that make sense?

Thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk
  2025-01-30  9:01     ` David Hildenbrand
@ 2025-01-30  9:12       ` David Hildenbrand
  2025-01-30  9:24       ` David Hildenbrand
  1 sibling, 0 replies; 56+ messages in thread
From: David Hildenbrand @ 2025-01-30  9:12 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On 30.01.25 10:01, David Hildenbrand wrote:
> On 30.01.25 07:11, Alistair Popple wrote:
>> On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
>>> We require a writable PTE and only support anonymous folio: we can only
>>> have exactly one PTE pointing at that page, which we can just lookup
>>> using a folio walk, avoiding the rmap walk and the anon VMA lock.
>>>
>>> So let's stop doing an rmap walk and perform a folio walk instead, so we
>>> can easily just modify a single PTE and avoid relying on rmap/mapcounts.
>>>
>>> We now effectively work on a single PTE instead of multiple PTEs of
>>> a large folio, allowing for conversion of individual PTEs from
>>> non-exclusive to device-exclusive -- note that the other way always
>>> worked on single PTEs.
>>>
>>> We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
>>> that is not required: GUP will already take care of the
>>> MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
>>> entry) when not finding a present PTE and having to trigger a fault and
>>> ending up in remove_device_exclusive_entry().
>>
>> I will have to look at this a bit more closely tomorrow but this doesn't seem
>> right to me. We may be transitioning from a present PTE (ie. a writable
>> anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
>> therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
>> update their copies of the PTE. So I think the notifier call is needed.
> 
> Then it is all very confusing:
> 
> "MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
> longer have exclusive access to the page."
> 
> That's simply not true in the scenario you describe, because nobody had
> exclusive access.
> 
> But what you are saying is, that we need to inform others (e.g., KVM)
> that we are converting it to a device-exclusive entry, such that they
> stop accessing it.
> 
> That makes sense to me (and the cleanup patch in the cleanup series
> would have to go as well to prevent the livelock).
> 
> So we would have to update the documentation of MMU_NOTIFY_EXCLUSIVE
> that it is also trigger on conversion from non-exclusive to exclusive.
> 
> Does that make sense?

Something like that on top:

diff --git a/mm/rmap.c b/mm/rmap.c
index 49ffac6d27f8..fd6dfe67ce7b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2405,6 +2405,7 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
  struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
                 void *owner, struct folio **foliop)
  {
+       struct mmu_notifier_range range;
         struct folio *folio, *fw_folio;
         struct vm_area_struct *vma;
         struct folio_walk fw;
@@ -2413,6 +2414,7 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
         pte_t swp_pte;
  
         mmap_assert_locked(mm);
+       addr = PAGE_ALIGN_DOWN(addr);
  
         /*
          * Fault in the page writable and try to lock it; note that if the
@@ -2440,6 +2442,14 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
                 return ERR_PTR(-EBUSY);
         }
  
+       /*
+        * Inform secondary MMUs that we are going to convert this PTE to
+        * device-exclusive, such that they unmap it now.
+        */
+       mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
+                                     mm, addr, addr + PAGE_SIZE, owner);
+       mmu_notifier_invalidate_range_start(&range);
+
         /*
          * Let's do a second walk and make sure we still find the same page
          * mapped writable. If we don't find what we expect, we will trigger
@@ -2452,6 +2462,7 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
             fw.level != FW_LEVEL_PTE || !pte_write(fw.pte)) {
                 if (fw_folio)
                         folio_walk_end(&fw, vma);
+               mmu_notifier_invalidate_range_end(&range);
                 folio_unlock(folio);
                 folio_put(folio);
                 return ERR_PTR(-EBUSY);
@@ -2485,6 +2496,7 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
         folio_remove_rmap_pte(folio, page, vma);
  
         folio_walk_end(&fw, vma);
+       mmu_notifier_invalidate_range_end(&range);
         *foliop = folio;
         return page;
  }


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk
  2025-01-30  9:01     ` David Hildenbrand
  2025-01-30  9:12       ` David Hildenbrand
@ 2025-01-30  9:24       ` David Hildenbrand
  2025-01-30 22:31         ` Alistair Popple
  1 sibling, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2025-01-30  9:24 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On 30.01.25 10:01, David Hildenbrand wrote:
> On 30.01.25 07:11, Alistair Popple wrote:
>> On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
>>> We require a writable PTE and only support anonymous folio: we can only
>>> have exactly one PTE pointing at that page, which we can just lookup
>>> using a folio walk, avoiding the rmap walk and the anon VMA lock.
>>>
>>> So let's stop doing an rmap walk and perform a folio walk instead, so we
>>> can easily just modify a single PTE and avoid relying on rmap/mapcounts.
>>>
>>> We now effectively work on a single PTE instead of multiple PTEs of
>>> a large folio, allowing for conversion of individual PTEs from
>>> non-exclusive to device-exclusive -- note that the other way always
>>> worked on single PTEs.
>>>
>>> We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
>>> that is not required: GUP will already take care of the
>>> MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
>>> entry) when not finding a present PTE and having to trigger a fault and
>>> ending up in remove_device_exclusive_entry().
>>
>> I will have to look at this a bit more closely tomorrow but this doesn't seem
>> right to me. We may be transitioning from a present PTE (ie. a writable
>> anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
>> therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
>> update their copies of the PTE. So I think the notifier call is needed.
> 
> Then it is all very confusing:
> 
> "MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
> longer have exclusive access to the page."

So the second sentence actually describes the other condition. Likely we
should make that clearer:

--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -43,10 +43,11 @@ struct mmu_interval_notifier;
   * a device driver to possibly ignore the invalidation if the
   * owner field matches the driver's device private pgmap owner.
   *
- * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
- * longer have exclusive access to the page. When sent during creation of an
- * exclusive range the owner will be initialised to the value provided by the
- * caller of make_device_exclusive(), otherwise the owner will be NULL.
+ * @MMU_NOTIFY_EXCLUSIVE: (1) to signal a device driver that the device will no
+ * longer have exclusive access to the page; and (2) to signal that a page will
+ * be made exclusive to a device. During (1), the owner will be NULL, during
+ * (2), the owner will be initialised to the value provided by the caller of
+ * make_device_exclusive().
   */
  enum mmu_notifier_event {
         MMU_NOTIFY_UNMAP = 0,


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk
  2025-01-30  9:24       ` David Hildenbrand
@ 2025-01-30 22:31         ` Alistair Popple
  2025-02-04 10:56           ` David Hildenbrand
  0 siblings, 1 reply; 56+ messages in thread
From: Alistair Popple @ 2025-01-30 22:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Thu, Jan 30, 2025 at 10:24:37AM +0100, David Hildenbrand wrote:
> On 30.01.25 10:01, David Hildenbrand wrote:
> > On 30.01.25 07:11, Alistair Popple wrote:
> > > On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
> > > > We require a writable PTE and only support anonymous folio: we can only
> > > > have exactly one PTE pointing at that page, which we can just lookup
> > > > using a folio walk, avoiding the rmap walk and the anon VMA lock.
> > > > 
> > > > So let's stop doing an rmap walk and perform a folio walk instead, so we
> > > > can easily just modify a single PTE and avoid relying on rmap/mapcounts.
> > > > 
> > > > We now effectively work on a single PTE instead of multiple PTEs of
> > > > a large folio, allowing for conversion of individual PTEs from
> > > > non-exclusive to device-exclusive -- note that the other way always
> > > > worked on single PTEs.
> > > > 
> > > > We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
> > > > that is not required: GUP will already take care of the
> > > > MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
> > > > entry) when not finding a present PTE and having to trigger a fault and
> > > > ending up in remove_device_exclusive_entry().
> > > 
> > > I will have to look at this a bit more closely tomorrow but this doesn't seem
> > > right to me. We may be transitioning from a present PTE (ie. a writable
> > > anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
> > > therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
> > > update their copies of the PTE. So I think the notifier call is needed.
> > 
> > Then it is all very confusing:

Can't argue with that in hindsight :-)

> > "MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
> > longer have exclusive access to the page."
> 
> So the second sentence actually describes the other condition. Likely we
> should make that clearer:
>
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -43,10 +43,11 @@ struct mmu_interval_notifier;
>   * a device driver to possibly ignore the invalidation if the
>   * owner field matches the driver's device private pgmap owner.
>   *
> - * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
> - * longer have exclusive access to the page. When sent during creation of an
> - * exclusive range the owner will be initialised to the value provided by the
> - * caller of make_device_exclusive(), otherwise the owner will be NULL.
> + * @MMU_NOTIFY_EXCLUSIVE: (1) to signal a device driver that the device will no
> + * longer have exclusive access to the page; and (2) to signal that a page will
> + * be made exclusive to a device. During (1), the owner will be NULL, during
> + * (2), the owner will be initialised to the value provided by the caller of
> + * make_device_exclusive().

Yes, I think that makes things clearer. Logically these are really two different
events though - I guess I didn't want to add another one at the time but I
wonder if we should just make them separate events rather than overloading them?

>   */
>  enum mmu_notifier_event {
>         MMU_NOTIFY_UNMAP = 0,
> 
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk
  2025-01-30 22:31         ` Alistair Popple
@ 2025-02-04 10:56           ` David Hildenbrand
  0 siblings, 0 replies; 56+ messages in thread
From: David Hildenbrand @ 2025-02-04 10:56 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On 30.01.25 23:31, Alistair Popple wrote:
> On Thu, Jan 30, 2025 at 10:24:37AM +0100, David Hildenbrand wrote:
>> On 30.01.25 10:01, David Hildenbrand wrote:
>>> On 30.01.25 07:11, Alistair Popple wrote:
>>>> On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
>>>>> We require a writable PTE and only support anonymous folio: we can only
>>>>> have exactly one PTE pointing at that page, which we can just lookup
>>>>> using a folio walk, avoiding the rmap walk and the anon VMA lock.
>>>>>
>>>>> So let's stop doing an rmap walk and perform a folio walk instead, so we
>>>>> can easily just modify a single PTE and avoid relying on rmap/mapcounts.
>>>>>
>>>>> We now effectively work on a single PTE instead of multiple PTEs of
>>>>> a large folio, allowing for conversion of individual PTEs from
>>>>> non-exclusive to device-exclusive -- note that the other way always
>>>>> worked on single PTEs.
>>>>>
>>>>> We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
>>>>> that is not required: GUP will already take care of the
>>>>> MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
>>>>> entry) when not finding a present PTE and having to trigger a fault and
>>>>> ending up in remove_device_exclusive_entry().
>>>>
>>>> I will have to look at this a bit more closely tomorrow but this doesn't seem
>>>> right to me. We may be transitioning from a present PTE (ie. a writable
>>>> anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
>>>> therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
>>>> update their copies of the PTE. So I think the notifier call is needed.
>>>
>>> Then it is all very confusing:
> 
> Can't argue with that in hindsight :-)
> 
>>> "MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
>>> longer have exclusive access to the page."
>>
>> So the second sentence actually describes the other condition. Likely we
>> should make that clearer:
>>
>> --- a/include/linux/mmu_notifier.h
>> +++ b/include/linux/mmu_notifier.h
>> @@ -43,10 +43,11 @@ struct mmu_interval_notifier;
>>    * a device driver to possibly ignore the invalidation if the
>>    * owner field matches the driver's device private pgmap owner.
>>    *
>> - * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
>> - * longer have exclusive access to the page. When sent during creation of an
>> - * exclusive range the owner will be initialised to the value provided by the
>> - * caller of make_device_exclusive(), otherwise the owner will be NULL.
>> + * @MMU_NOTIFY_EXCLUSIVE: (1) to signal a device driver that the device will no
>> + * longer have exclusive access to the page; and (2) to signal that a page will
>> + * be made exclusive to a device. During (1), the owner will be NULL, during
>> + * (2), the owner will be initialised to the value provided by the caller of
>> + * make_device_exclusive().
> 
> Yes, I think that makes things clearer. Logically these are really two different
> events though - I guess I didn't want to add another one at the time but I
> wonder if we should just make them separate events rather than overloading them?

I had the same thought and then I wondered: can't we simply use 
MMU_NOTIFY_CLEAR for the exclusive->ordinary path?

I mean, it's essentially a zap+flush followed by a re-insertion of the 
PFN swap entry. Similar to page migration ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk
  2025-01-30  6:11   ` Alistair Popple
  2025-01-30  9:01     ` David Hildenbrand
@ 2025-01-30  9:40     ` Simona Vetter
  2025-01-30  9:47       ` David Hildenbrand
  1 sibling, 1 reply; 56+ messages in thread
From: Simona Vetter @ 2025-01-30  9:40 UTC (permalink / raw)
  To: Alistair Popple
  Cc: David Hildenbrand, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Thu, Jan 30, 2025 at 05:11:49PM +1100, Alistair Popple wrote:
> On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
> > We require a writable PTE and only support anonymous folio: we can only
> > have exactly one PTE pointing at that page, which we can just lookup
> > using a folio walk, avoiding the rmap walk and the anon VMA lock.
> > 
> > So let's stop doing an rmap walk and perform a folio walk instead, so we
> > can easily just modify a single PTE and avoid relying on rmap/mapcounts.
> > 
> > We now effectively work on a single PTE instead of multiple PTEs of
> > a large folio, allowing for conversion of individual PTEs from
> > non-exclusive to device-exclusive -- note that the other way always
> > worked on single PTEs.
> > 
> > We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
> > that is not required: GUP will already take care of the
> > MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
> > entry) when not finding a present PTE and having to trigger a fault and
> > ending up in remove_device_exclusive_entry().
> 
> I will have to look at this a bit more closely tomorrow but this doesn't seem
> right to me. We may be transitioning from a present PTE (ie. a writable
> anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
> therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
> update their copies of the PTE. So I think the notifier call is needed.

I guess this is a question of semantics we want, for multiple gpus do we
require that device-exclusive also excludes other gpus or not. I'm leaning
towards agreeing with you here.

> > Note that the PTE is
> > always writable, and we can always create a writable-device-exclusive
> > entry.
> > 
> > With this change, device-exclusive is fully compatible with THPs /
> > large folios. We still require PMD-sized THPs to get PTE-mapped, and
> > supporting PMD-mapped THP (without the PTE-remapping) is a different
> > endeavour that might not be worth it at this point.

I'm not sure we actually want hugepages for device exclusive, since it has
an impact on what's allowed and what not. If we only ever do 4k entries
then userspace can assume that as long atomics are separated by a 4k page
there's no issue when both the gpu and cpu hammer on them. If we try to
keep thp entries then suddenly a workload that worked before will result
in endless ping-pong between gpu and cpu because the separate atomic
counters (or whatever) now all sit in the same 2m page.

So going with thp might result in userspace having to spread out atomics
even more, which is just wasting memory and not saving any tlb entries
since often you don't need that many.

tldr; I think not supporting thp entries for device exclusive is a
feature, not a bug.

Cheers, Sima

> > This gets rid of the "folio_mapcount()" usage and let's us fix ordinary
> > rmap walks (migration/swapout) next. Spell out that messing with the
> > mapcount is wrong and must be fixed.
> > 
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> >  mm/rmap.c | 188 ++++++++++++++++--------------------------------------
> >  1 file changed, 55 insertions(+), 133 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 676df4fba5b0..49ffac6d27f8 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -2375,131 +2375,6 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
> >  }
> >  
> >  #ifdef CONFIG_DEVICE_PRIVATE
> > -struct make_exclusive_args {
> > -	struct mm_struct *mm;
> > -	unsigned long address;
> > -	void *owner;
> > -	bool valid;
> > -};
> > -
> > -static bool page_make_device_exclusive_one(struct folio *folio,
> > -		struct vm_area_struct *vma, unsigned long address, void *priv)
> > -{
> > -	struct mm_struct *mm = vma->vm_mm;
> > -	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> > -	struct make_exclusive_args *args = priv;
> > -	pte_t pteval;
> > -	struct page *subpage;
> > -	bool ret = true;
> > -	struct mmu_notifier_range range;
> > -	swp_entry_t entry;
> > -	pte_t swp_pte;
> > -	pte_t ptent;
> > -
> > -	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
> > -				      vma->vm_mm, address, min(vma->vm_end,
> > -				      address + folio_size(folio)),
> > -				      args->owner);
> > -	mmu_notifier_invalidate_range_start(&range);
> > -
> > -	while (page_vma_mapped_walk(&pvmw)) {
> > -		/* Unexpected PMD-mapped THP? */
> > -		VM_BUG_ON_FOLIO(!pvmw.pte, folio);
> > -
> > -		ptent = ptep_get(pvmw.pte);
> > -		if (!pte_present(ptent)) {
> > -			ret = false;
> > -			page_vma_mapped_walk_done(&pvmw);
> > -			break;
> > -		}
> > -
> > -		subpage = folio_page(folio,
> > -				pte_pfn(ptent) - folio_pfn(folio));
> > -		address = pvmw.address;
> > -
> > -		/* Nuke the page table entry. */
> > -		flush_cache_page(vma, address, pte_pfn(ptent));
> > -		pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > -
> > -		/* Set the dirty flag on the folio now the pte is gone. */
> > -		if (pte_dirty(pteval))
> > -			folio_mark_dirty(folio);
> > -
> > -		/*
> > -		 * Check that our target page is still mapped at the expected
> > -		 * address.
> > -		 */
> > -		if (args->mm == mm && args->address == address &&
> > -		    pte_write(pteval))
> > -			args->valid = true;
> > -
> > -		/*
> > -		 * Store the pfn of the page in a special migration
> > -		 * pte. do_swap_page() will wait until the migration
> > -		 * pte is removed and then restart fault handling.
> > -		 */
> > -		if (pte_write(pteval))
> > -			entry = make_writable_device_exclusive_entry(
> > -							page_to_pfn(subpage));
> > -		else
> > -			entry = make_readable_device_exclusive_entry(
> > -							page_to_pfn(subpage));
> > -		swp_pte = swp_entry_to_pte(entry);
> > -		if (pte_soft_dirty(pteval))
> > -			swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > -		if (pte_uffd_wp(pteval))
> > -			swp_pte = pte_swp_mkuffd_wp(swp_pte);
> > -
> > -		set_pte_at(mm, address, pvmw.pte, swp_pte);
> > -
> > -		/*
> > -		 * There is a reference on the page for the swap entry which has
> > -		 * been removed, so shouldn't take another.
> > -		 */
> > -		folio_remove_rmap_pte(folio, subpage, vma);
> > -	}
> > -
> > -	mmu_notifier_invalidate_range_end(&range);
> > -
> > -	return ret;
> > -}
> > -
> > -/**
> > - * folio_make_device_exclusive - Mark the folio exclusively owned by a device.
> > - * @folio: The folio to replace page table entries for.
> > - * @mm: The mm_struct where the folio is expected to be mapped.
> > - * @address: Address where the folio is expected to be mapped.
> > - * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
> > - *
> > - * Tries to remove all the page table entries which are mapping this
> > - * folio and replace them with special device exclusive swap entries to
> > - * grant a device exclusive access to the folio.
> > - *
> > - * Context: Caller must hold the folio lock.
> > - * Return: false if the page is still mapped, or if it could not be unmapped
> > - * from the expected address. Otherwise returns true (success).
> > - */
> > -static bool folio_make_device_exclusive(struct folio *folio,
> > -		struct mm_struct *mm, unsigned long address, void *owner)
> > -{
> > -	struct make_exclusive_args args = {
> > -		.mm = mm,
> > -		.address = address,
> > -		.owner = owner,
> > -		.valid = false,
> > -	};
> > -	struct rmap_walk_control rwc = {
> > -		.rmap_one = page_make_device_exclusive_one,
> > -		.done = folio_not_mapped,
> > -		.anon_lock = folio_lock_anon_vma_read,
> > -		.arg = &args,
> > -	};
> > -
> > -	rmap_walk(folio, &rwc);
> > -
> > -	return args.valid && !folio_mapcount(folio);
> > -}
> > -
> >  /**
> >   * make_device_exclusive() - Mark an address for exclusive use by a device
> >   * @mm: mm_struct of associated target process
> > @@ -2530,9 +2405,12 @@ static bool folio_make_device_exclusive(struct folio *folio,
> >  struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
> >  		void *owner, struct folio **foliop)
> >  {
> > -	struct folio *folio;
> > +	struct folio *folio, *fw_folio;
> > +	struct vm_area_struct *vma;
> > +	struct folio_walk fw;
> >  	struct page *page;
> > -	long npages;
> > +	swp_entry_t entry;
> > +	pte_t swp_pte;
> >  
> >  	mmap_assert_locked(mm);
> >  
> > @@ -2540,12 +2418,16 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
> >  	 * Fault in the page writable and try to lock it; note that if the
> >  	 * address would already be marked for exclusive use by the device,
> >  	 * the GUP call would undo that first by triggering a fault.
> > +	 *
> > +	 * If any other device would already map this page exclusively, the
> > +	 * fault will trigger a conversion to an ordinary
> > +	 * (non-device-exclusive) PTE and issue a MMU_NOTIFY_EXCLUSIVE.
> >  	 */
> > -	npages = get_user_pages_remote(mm, addr, 1,
> > -				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> > -				       &page, NULL);
> > -	if (npages != 1)
> > -		return ERR_PTR(npages);
> > +	page = get_user_page_vma_remote(mm, addr,
> > +					FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> > +					&vma);
> > +	if (IS_ERR(page))
> > +		return page;
> >  	folio = page_folio(page);
> >  
> >  	if (!folio_test_anon(folio) || folio_test_hugetlb(folio)) {
> > @@ -2558,11 +2440,51 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
> >  		return ERR_PTR(-EBUSY);
> >  	}
> >  
> > -	if (!folio_make_device_exclusive(folio, mm, addr, owner)) {
> > +	/*
> > +	 * Let's do a second walk and make sure we still find the same page
> > +	 * mapped writable. If we don't find what we expect, we will trigger
> > +	 * GUP again to fix it up. Note that a page of an anonymous folio can
> > +	 * only be mapped writable using exactly one page table mapping
> > +	 * ("exclusive"), so there cannot be other mappings.
> > +	 */
> > +	fw_folio = folio_walk_start(&fw, vma, addr, 0);
> > +	if (fw_folio != folio || fw.page != page ||
> > +	    fw.level != FW_LEVEL_PTE || !pte_write(fw.pte)) {
> > +		if (fw_folio)
> > +			folio_walk_end(&fw, vma);
> >  		folio_unlock(folio);
> >  		folio_put(folio);
> >  		return ERR_PTR(-EBUSY);
> >  	}
> > +
> > +	/* Nuke the page table entry so we get the uptodate dirty bit. */
> > +	flush_cache_page(vma, addr, page_to_pfn(page));
> > +	fw.pte = ptep_clear_flush(vma, addr, fw.ptep);
> > +
> > +	/* Set the dirty flag on the folio now the pte is gone. */
> > +	if (pte_dirty(fw.pte))
> > +		folio_mark_dirty(folio);
> > +
> > +	/*
> > +	 * Store the pfn of the page in a special device-exclusive non-swap pte.
> > +	 * do_swap_page() will trigger the conversion back while holding the
> > +	 * folio lock.
> > +	 */
> > +	entry = make_writable_device_exclusive_entry(page_to_pfn(page));
> > +	swp_pte = swp_entry_to_pte(entry);
> > +	if (pte_soft_dirty(fw.pte))
> > +		swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > +	/* The pte is writable, uffd-wp does not apply. */
> > +	set_pte_at(mm, addr, fw.ptep, swp_pte);
> > +
> > +	/*
> > +	 * TODO: The device-exclusive non-swap PTE holds a folio reference but
> > +	 * does not count as a mapping (mapcount), which is wrong and must be
> > +	 * fixed, otherwise RMAP walks don't behave as expected.
> > +	 */
> > +	folio_remove_rmap_pte(folio, page, vma);
> > +
> > +	folio_walk_end(&fw, vma);
> >  	*foliop = folio;
> >  	return page;
> >  }
> > -- 
> > 2.48.1
> > 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk
  2025-01-30  9:40     ` Simona Vetter
@ 2025-01-30  9:47       ` David Hildenbrand
  2025-01-30 13:00         ` Simona Vetter
  0 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2025-01-30  9:47 UTC (permalink / raw)
  To: Alistair Popple, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On 30.01.25 10:40, Simona Vetter wrote:
> On Thu, Jan 30, 2025 at 05:11:49PM +1100, Alistair Popple wrote:
>> On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
>>> We require a writable PTE and only support anonymous folio: we can only
>>> have exactly one PTE pointing at that page, which we can just lookup
>>> using a folio walk, avoiding the rmap walk and the anon VMA lock.
>>>
>>> So let's stop doing an rmap walk and perform a folio walk instead, so we
>>> can easily just modify a single PTE and avoid relying on rmap/mapcounts.
>>>
>>> We now effectively work on a single PTE instead of multiple PTEs of
>>> a large folio, allowing for conversion of individual PTEs from
>>> non-exclusive to device-exclusive -- note that the other way always
>>> worked on single PTEs.
>>>
>>> We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
>>> that is not required: GUP will already take care of the
>>> MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
>>> entry) when not finding a present PTE and having to trigger a fault and
>>> ending up in remove_device_exclusive_entry().
>>
>> I will have to look at this a bit more closely tomorrow but this doesn't seem
>> right to me. We may be transitioning from a present PTE (ie. a writable
>> anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
>> therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
>> update their copies of the PTE. So I think the notifier call is needed.
> 
> I guess this is a question of semantics we want, for multiple gpus do we
> require that device-exclusive also excludes other gpus or not. I'm leaning
> towards agreeing with you here.

See my reply, it's also relevant for non-device, such as KVM. So it's 
the right thing to do.

> 
>>> Note that the PTE is
>>> always writable, and we can always create a writable-device-exclusive
>>> entry.
>>>
>>> With this change, device-exclusive is fully compatible with THPs /
>>> large folios. We still require PMD-sized THPs to get PTE-mapped, and
>>> supporting PMD-mapped THP (without the PTE-remapping) is a different
>>> endeavour that might not be worth it at this point.
> 
> I'm not sure we actually want hugepages for device exclusive, since it has
> an impact on what's allowed and what not. If we only ever do 4k entries
> then userspace can assume that as long atomics are separated by a 4k page
> there's no issue when both the gpu and cpu hammer on them. If we try to
> keep thp entries then suddenly a workload that worked before will result
> in endless ping-pong between gpu and cpu because the separate atomic
> counters (or whatever) now all sit in the same 2m page.

Agreed. And the conversion + mapping into the device gets trickier.

> 
> So going with thp might result in userspace having to spread out atomics
> even more, which is just wasting memory and not saving any tlb entries
> since often you don't need that many.
> 
> tldr; I think not supporting thp entries for device exclusive is a
> feature, not a bug.

So, you agree with my "different endeavour that might not be worth it" 
statement?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk
  2025-01-30  9:47       ` David Hildenbrand
@ 2025-01-30 13:00         ` Simona Vetter
  2025-01-30 15:59           ` David Hildenbrand
  0 siblings, 1 reply; 56+ messages in thread
From: Simona Vetter @ 2025-01-30 13:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alistair Popple, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Thu, Jan 30, 2025 at 10:47:29AM +0100, David Hildenbrand wrote:
> On 30.01.25 10:40, Simona Vetter wrote:
> > On Thu, Jan 30, 2025 at 05:11:49PM +1100, Alistair Popple wrote:
> > > On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
> > > > We require a writable PTE and only support anonymous folio: we can only
> > > > have exactly one PTE pointing at that page, which we can just lookup
> > > > using a folio walk, avoiding the rmap walk and the anon VMA lock.
> > > > 
> > > > So let's stop doing an rmap walk and perform a folio walk instead, so we
> > > > can easily just modify a single PTE and avoid relying on rmap/mapcounts.
> > > > 
> > > > We now effectively work on a single PTE instead of multiple PTEs of
> > > > a large folio, allowing for conversion of individual PTEs from
> > > > non-exclusive to device-exclusive -- note that the other way always
> > > > worked on single PTEs.
> > > > 
> > > > We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
> > > > that is not required: GUP will already take care of the
> > > > MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
> > > > entry) when not finding a present PTE and having to trigger a fault and
> > > > ending up in remove_device_exclusive_entry().
> > > 
> > > I will have to look at this a bit more closely tomorrow but this doesn't seem
> > > right to me. We may be transitioning from a present PTE (ie. a writable
> > > anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
> > > therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
> > > update their copies of the PTE. So I think the notifier call is needed.
> > 
> > I guess this is a question of semantics we want, for multiple gpus do we
> > require that device-exclusive also excludes other gpus or not. I'm leaning
> > towards agreeing with you here.
> 
> See my reply, it's also relevant for non-device, such as KVM. So it's the
> right thing to do.

Yeah sounds good.

> > > > Note that the PTE is
> > > > always writable, and we can always create a writable-device-exclusive
> > > > entry.
> > > > 
> > > > With this change, device-exclusive is fully compatible with THPs /
> > > > large folios. We still require PMD-sized THPs to get PTE-mapped, and
> > > > supporting PMD-mapped THP (without the PTE-remapping) is a different
> > > > endeavour that might not be worth it at this point.
> > 
> > I'm not sure we actually want hugepages for device exclusive, since it has
> > an impact on what's allowed and what not. If we only ever do 4k entries
> > then userspace can assume that as long atomics are separated by a 4k page
> > there's no issue when both the gpu and cpu hammer on them. If we try to
> > keep thp entries then suddenly a workload that worked before will result
> > in endless ping-pong between gpu and cpu because the separate atomic
> > counters (or whatever) now all sit in the same 2m page.
> 
> Agreed. And the conversion + mapping into the device gets trickier.
> 
> > 
> > So going with thp might result in userspace having to spread out atomics
> > even more, which is just wasting memory and not saving any tlb entries
> > since often you don't need that many.
> > 
> > tldr; I think not supporting thp entries for device exclusive is a
> > feature, not a bug.
> 
> So, you agree with my "different endeavour that might not be worth it"
> statement?

Yes.

Well I think we should go further and clearly document that we
intentionally return split pages. Because it's part of the uapi contract
with users of all this.

And if someone needs pmd entries for performance or whatever, we need two
things:

a) userspace must mmap that memory as hugepage memory, to clearly signal
the promise that atomics are split up on hugepage sizes and not just page
size

b) we need to extend make_device_exclusive and drivers to handle the
hugetlb folio case

I think thp is simply not going to work here, it's impossible (without
potentially causing fault storms) to figure out what userspace might want.

Cheers, Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk
  2025-01-30 13:00         ` Simona Vetter
@ 2025-01-30 15:59           ` David Hildenbrand
  2025-01-31 17:00             ` Simona Vetter
  0 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2025-01-30 15:59 UTC (permalink / raw)
  To: Alistair Popple, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

>>>>> Note that the PTE is
>>>>> always writable, and we can always create a writable-device-exclusive
>>>>> entry.
>>>>>
>>>>> With this change, device-exclusive is fully compatible with THPs /
>>>>> large folios. We still require PMD-sized THPs to get PTE-mapped, and
>>>>> supporting PMD-mapped THP (without the PTE-remapping) is a different
>>>>> endeavour that might not be worth it at this point.
>>>
>>> I'm not sure we actually want hugepages for device exclusive, since it has
>>> an impact on what's allowed and what not. If we only ever do 4k entries
>>> then userspace can assume that as long atomics are separated by a 4k page
>>> there's no issue when both the gpu and cpu hammer on them. If we try to
>>> keep thp entries then suddenly a workload that worked before will result
>>> in endless ping-pong between gpu and cpu because the separate atomic
>>> counters (or whatever) now all sit in the same 2m page.
>>
>> Agreed. And the conversion + mapping into the device gets trickier.
>>
>>>
>>> So going with thp might result in userspace having to spread out atomics
>>> even more, which is just wasting memory and not saving any tlb entries
>>> since often you don't need that many.
>>>
>>> tldr; I think not supporting thp entries for device exclusive is a
>>> feature, not a bug.
>>
>> So, you agree with my "different endeavour that might not be worth it"
>> statement?
> 
> Yes.
> 
> Well I think we should go further and clearly document that we
> intentionally return split pages. Because it's part of the uapi contract
> with users of all this.

Yes, see my reply to patch #3/

> 
> And if someone needs pmd entries for performance or whatever, we need two
> things:
> 
> a) userspace must mmap that memory as hugepage memory, to clearly signal
> the promise that atomics are split up on hugepage sizes and not just page
> size
> 
> b) we need to extend make_device_exclusive and drivers to handle the
> hugetlb folio case
> 
> I think thp is simply not going to work here, it's impossible (without
> potentially causing fault storms) to figure out what userspace might want.

Right, I added a link to this discussion in the patch.

Thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk
  2025-01-30 15:59           ` David Hildenbrand
@ 2025-01-31 17:00             ` Simona Vetter
  0 siblings, 0 replies; 56+ messages in thread
From: Simona Vetter @ 2025-01-31 17:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alistair Popple, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Thu, Jan 30, 2025 at 04:59:16PM +0100, David Hildenbrand wrote:
> > > > > > Note that the PTE is
> > > > > > always writable, and we can always create a writable-device-exclusive
> > > > > > entry.
> > > > > > 
> > > > > > With this change, device-exclusive is fully compatible with THPs /
> > > > > > large folios. We still require PMD-sized THPs to get PTE-mapped, and
> > > > > > supporting PMD-mapped THP (without the PTE-remapping) is a different
> > > > > > endeavour that might not be worth it at this point.
> > > > 
> > > > I'm not sure we actually want hugepages for device exclusive, since it has
> > > > an impact on what's allowed and what not. If we only ever do 4k entries
> > > > then userspace can assume that as long atomics are separated by a 4k page
> > > > there's no issue when both the gpu and cpu hammer on them. If we try to
> > > > keep thp entries then suddenly a workload that worked before will result
> > > > in endless ping-pong between gpu and cpu because the separate atomic
> > > > counters (or whatever) now all sit in the same 2m page.
> > > 
> > > Agreed. And the conversion + mapping into the device gets trickier.
> > > 
> > > > 
> > > > So going with thp might result in userspace having to spread out atomics
> > > > even more, which is just wasting memory and not saving any tlb entries
> > > > since often you don't need that many.
> > > > 
> > > > tldr; I think not supporting thp entries for device exclusive is a
> > > > feature, not a bug.
> > > 
> > > So, you agree with my "different endeavour that might not be worth it"
> > > statement?
> > 
> > Yes.
> > 
> > Well I think we should go further and clearly document that we
> > intentionally return split pages. Because it's part of the uapi contract
> > with users of all this.
> 
> Yes, see my reply to patch #3/

Ack.

> > And if someone needs pmd entries for performance or whatever, we need two
> > things:
> > 
> > a) userspace must mmap that memory as hugepage memory, to clearly signal
> > the promise that atomics are split up on hugepage sizes and not just page
> > size
> > 
> > b) we need to extend make_device_exclusive and drivers to handle the
> > hugetlb folio case
> > 
> > I think thp is simply not going to work here, it's impossible (without
> > potentially causing fault storms) to figure out what userspace might want.
> 
> Right, I added a link to this discussion in the patch.

Thanks, Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v1 05/12] mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable()
  2025-01-29 11:53 [PATCH v1 00/12] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (3 preceding siblings ...)
  2025-01-29 11:54 ` [PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk David Hildenbrand
@ 2025-01-29 11:54 ` David Hildenbrand
  2025-01-30  9:51   ` Simona Vetter
  2025-01-29 11:54 ` [PATCH v1 06/12] mm: use single SWP_DEVICE_EXCLUSIVE entry type David Hildenbrand
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, David Hildenbrand,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

Let's do it just like mprotect write-upgrade or during NUMA-hinting
faults on PROT_NONE PTEs: detect if the PTE can be writable by using
can_change_pte_writable().

Set the PTE only dirty if the folio is dirty: we might not
necessarily have a write access, and setting the PTE writable doesn't
require setting the PTE dirty.

With this change in place, there is no need to have separate
readable and writable device-exclusive entry types, and we'll merge
them next separately.

Note that, during fork(), we first convert the device-exclusive entries
back to ordinary PTEs, and we only ever allow conversion of writable
PTEs to device-exclusive -- only mprotect can currently change them to
readable-device-exclusive. Consequently, we always expect
PageAnonExclusive(page)==true and can_change_pte_writable()==true,
unless we are dealing with soft-dirty tracking or uffd-wp. But reusing
can_change_pte_writable() for now is cleaner.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 03efeeef895a..db38d6ae4e74 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -725,18 +725,21 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
 	struct folio *folio = page_folio(page);
 	pte_t orig_pte;
 	pte_t pte;
-	swp_entry_t entry;
 
 	orig_pte = ptep_get(ptep);
 	pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
 	if (pte_swp_soft_dirty(orig_pte))
 		pte = pte_mksoft_dirty(pte);
 
-	entry = pte_to_swp_entry(orig_pte);
 	if (pte_swp_uffd_wp(orig_pte))
 		pte = pte_mkuffd_wp(pte);
-	else if (is_writable_device_exclusive_entry(entry))
-		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+
+	if ((vma->vm_flags & VM_WRITE) &&
+	    can_change_pte_writable(vma, address, pte)) {
+		if (folio_test_dirty(folio))
+			pte = pte_mkdirty(pte);
+		pte = pte_mkwrite(pte, vma);
+	}
 
 	VM_BUG_ON_FOLIO(pte_write(pte) && (!folio_test_anon(folio) &&
 					   PageAnonExclusive(page)), folio);
-- 
2.48.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 05/12] mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable()
  2025-01-29 11:54 ` [PATCH v1 05/12] mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable() David Hildenbrand
@ 2025-01-30  9:51   ` Simona Vetter
  2025-01-30  9:58     ` David Hildenbrand
  0 siblings, 1 reply; 56+ messages in thread
From: Simona Vetter @ 2025-01-30  9:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

On Wed, Jan 29, 2025 at 12:54:03PM +0100, David Hildenbrand wrote:
> Let's do it just like mprotect write-upgrade or during NUMA-hinting
> faults on PROT_NONE PTEs: detect if the PTE can be writable by using
> can_change_pte_writable().
> 
> Set the PTE only dirty if the folio is dirty: we might not
> necessarily have a write access, and setting the PTE writable doesn't
> require setting the PTE dirty.

Not sure whether there's much difference in practice, since a device
exclusive access means a write, so the folio better be dirty (unless we
aborted halfway through). But then I couldn't find the code in nouveau to
do that, so now I'm confused.
-Sima

> With this change in place, there is no need to have separate
> readable and writable device-exclusive entry types, and we'll merge
> them next separately.
> 
> Note that, during fork(), we first convert the device-exclusive entries
> back to ordinary PTEs, and we only ever allow conversion of writable
> PTEs to device-exclusive -- only mprotect can currently change them to
> readable-device-exclusive. Consequently, we always expect
> PageAnonExclusive(page)==true and can_change_pte_writable()==true,
> unless we are dealing with soft-dirty tracking or uffd-wp. But reusing
> can_change_pte_writable() for now is cleaner.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 03efeeef895a..db38d6ae4e74 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -725,18 +725,21 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
>  	struct folio *folio = page_folio(page);
>  	pte_t orig_pte;
>  	pte_t pte;
> -	swp_entry_t entry;
>  
>  	orig_pte = ptep_get(ptep);
>  	pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
>  	if (pte_swp_soft_dirty(orig_pte))
>  		pte = pte_mksoft_dirty(pte);
>  
> -	entry = pte_to_swp_entry(orig_pte);
>  	if (pte_swp_uffd_wp(orig_pte))
>  		pte = pte_mkuffd_wp(pte);
> -	else if (is_writable_device_exclusive_entry(entry))
> -		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> +
> +	if ((vma->vm_flags & VM_WRITE) &&
> +	    can_change_pte_writable(vma, address, pte)) {
> +		if (folio_test_dirty(folio))
> +			pte = pte_mkdirty(pte);
> +		pte = pte_mkwrite(pte, vma);
> +	}
>  
>  	VM_BUG_ON_FOLIO(pte_write(pte) && (!folio_test_anon(folio) &&
>  					   PageAnonExclusive(page)), folio);
> -- 
> 2.48.1
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 05/12] mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable()
  2025-01-30  9:51   ` Simona Vetter
@ 2025-01-30  9:58     ` David Hildenbrand
  2025-01-30 13:03       ` Simona Vetter
  0 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2025-01-30  9:58 UTC (permalink / raw)
  To: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

On 30.01.25 10:51, Simona Vetter wrote:
> On Wed, Jan 29, 2025 at 12:54:03PM +0100, David Hildenbrand wrote:
>> Let's do it just like mprotect write-upgrade or during NUMA-hinting
>> faults on PROT_NONE PTEs: detect if the PTE can be writable by using
>> can_change_pte_writable().
>>
>> Set the PTE only dirty if the folio is dirty: we might not
>> necessarily have a write access, and setting the PTE writable doesn't
>> require setting the PTE dirty.
> 
> Not sure whether there's much difference in practice, since a device
> exclusive access means a write, so the folio better be dirty (unless we
> aborted halfway through). But then I couldn't find the code in nouveau to
> do that, so now I'm confused.

That confused me as well. Requiring the PTE to be writable does not 
imply that it is dirty.

So something must either set the PTE or the folio dirty.

( In practice, most anonymous folios are dirty most of the time ... )

If we assume that "device-exclusive entries" are always dirty, then it 
doesn't make sense to set the folio dirty when creating device-exclusive 
entries. We'd always have to set the PTE dirty when restoring the 
exclusive pte.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 05/12] mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable()
  2025-01-30  9:58     ` David Hildenbrand
@ 2025-01-30 13:03       ` Simona Vetter
  2025-01-30 23:06         ` Alistair Popple
  0 siblings, 1 reply; 56+ messages in thread
From: Simona Vetter @ 2025-01-30 13:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

On Thu, Jan 30, 2025 at 10:58:51AM +0100, David Hildenbrand wrote:
> On 30.01.25 10:51, Simona Vetter wrote:
> > On Wed, Jan 29, 2025 at 12:54:03PM +0100, David Hildenbrand wrote:
> > > Let's do it just like mprotect write-upgrade or during NUMA-hinting
> > > faults on PROT_NONE PTEs: detect if the PTE can be writable by using
> > > can_change_pte_writable().
> > > 
> > > Set the PTE only dirty if the folio is dirty: we might not
> > > necessarily have a write access, and setting the PTE writable doesn't
> > > require setting the PTE dirty.
> > 
> > Not sure whether there's much difference in practice, since a device
> > exclusive access means a write, so the folio better be dirty (unless we
> > aborted halfway through). But then I couldn't find the code in nouveau to
> > do that, so now I'm confused.
> 
> That confused me as well. Requiring the PTE to be writable does not imply
> that it is dirty.
> 
> So something must either set the PTE or the folio dirty.

Yeah I'm not finding that something.

> ( In practice, most anonymous folios are dirty most of the time ... )

And yup that's why I think it hasn't blown up yet.

> If we assume that "device-exclusive entries" are always dirty, then it
> doesn't make sense to set the folio dirty when creating device-exclusive
> entries. We'd always have to set the PTE dirty when restoring the exclusive
> pte.

I do agree with your change, I think it's correct to put this
responsibility onto drivers. It's just that nouveau seems to not be
entirely correct.

And thinking about this I have vague memories that I've discussed the case
of the missing folio_set_dirty in noveau hmm code before, maybe with
Alistair. But quick search in archives didn't turn up anything.
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 05/12] mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable()
  2025-01-30 13:03       ` Simona Vetter
@ 2025-01-30 23:06         ` Alistair Popple
  2025-01-31 10:55           ` David Hildenbrand
  0 siblings, 1 reply; 56+ messages in thread
From: Alistair Popple @ 2025-01-30 23:06 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Thu, Jan 30, 2025 at 02:03:42PM +0100, Simona Vetter wrote:
> On Thu, Jan 30, 2025 at 10:58:51AM +0100, David Hildenbrand wrote:
> > On 30.01.25 10:51, Simona Vetter wrote:
> > > On Wed, Jan 29, 2025 at 12:54:03PM +0100, David Hildenbrand wrote:
> > > > Let's do it just like mprotect write-upgrade or during NUMA-hinting
> > > > faults on PROT_NONE PTEs: detect if the PTE can be writable by using
> > > > can_change_pte_writable().
> > > > 
> > > > Set the PTE only dirty if the folio is dirty: we might not
> > > > necessarily have a write access, and setting the PTE writable doesn't
> > > > require setting the PTE dirty.
> > > 
> > > Not sure whether there's much difference in practice, since a device
> > > exclusive access means a write, so the folio better be dirty (unless we
> > > aborted halfway through). But then I couldn't find the code in nouveau to
> > > do that, so now I'm confused.
> > 
> > That confused me as well. Requiring the PTE to be writable does not imply
> > that it is dirty.
> > 
> > So something must either set the PTE or the folio dirty.
> 
> Yeah I'm not finding that something.
> 
> > ( In practice, most anonymous folios are dirty most of the time ... )
> 
> And yup that's why I think it hasn't blown up yet.
> 
> > If we assume that "device-exclusive entries" are always dirty, then it
> > doesn't make sense to set the folio dirty when creating device-exclusive
> > entries. We'd always have to set the PTE dirty when restoring the exclusive
> > pte.
> 
> I do agree with your change, I think it's correct to put this
> responsibility onto drivers. It's just that nouveau seems to not be
> entirely correct.

Yeah, agree it should be a driver responsibility but also can't see how nouveau
is correct there either. I might see if I can get it to blow up...

> And thinking about this I have vague memories that I've discussed the case
> of the missing folio_set_dirty in noveau hmm code before, maybe with
> Alistair. But quick search in archives didn't turn up anything.

I have vague recollections of that, but I could be confusing it with some of the
migrate_vma_*() issues we had dropping dirty bits (see
https://lkml.kernel.org/r/dd48e4882ce859c295c1a77612f66d198b0403f9.1662078528.git-series.apopple@nvidia.com)

> -Sima
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 05/12] mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable()
  2025-01-30 23:06         ` Alistair Popple
@ 2025-01-31 10:55           ` David Hildenbrand
  2025-01-31 17:05             ` Simona Vetter
  0 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2025-01-31 10:55 UTC (permalink / raw)
  To: Alistair Popple, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On 31.01.25 00:06, Alistair Popple wrote:
> On Thu, Jan 30, 2025 at 02:03:42PM +0100, Simona Vetter wrote:
>> On Thu, Jan 30, 2025 at 10:58:51AM +0100, David Hildenbrand wrote:
>>> On 30.01.25 10:51, Simona Vetter wrote:
>>>> On Wed, Jan 29, 2025 at 12:54:03PM +0100, David Hildenbrand wrote:
>>>>> Let's do it just like mprotect write-upgrade or during NUMA-hinting
>>>>> faults on PROT_NONE PTEs: detect if the PTE can be writable by using
>>>>> can_change_pte_writable().
>>>>>
>>>>> Set the PTE only dirty if the folio is dirty: we might not
>>>>> necessarily have a write access, and setting the PTE writable doesn't
>>>>> require setting the PTE dirty.
>>>>
>>>> Not sure whether there's much difference in practice, since a device
>>>> exclusive access means a write, so the folio better be dirty (unless we
>>>> aborted halfway through). But then I couldn't find the code in nouveau to
>>>> do that, so now I'm confused.
>>>
>>> That confused me as well. Requiring the PTE to be writable does not imply
>>> that it is dirty.
>>>
>>> So something must either set the PTE or the folio dirty.
>>
>> Yeah I'm not finding that something.
>>
>>> ( In practice, most anonymous folios are dirty most of the time ... )
>>
>> And yup that's why I think it hasn't blown up yet.
>>
>>> If we assume that "device-exclusive entries" are always dirty, then it
>>> doesn't make sense to set the folio dirty when creating device-exclusive
>>> entries. We'd always have to set the PTE dirty when restoring the exclusive
>>> pte.
>>
>> I do agree with your change, I think it's correct to put this
>> responsibility onto drivers. It's just that nouveau seems to not be
>> entirely correct.
> 
> Yeah, agree it should be a driver responsibility but also can't see how nouveau
> is correct there either. I might see if I can get it to blow up...

(in context of the rmap walkers) The question is, how do we consider 
device-exclusive entries:

(1) dirty? Not from a CPU perspective.
(2) referenced? Not from a CPU perspective.

If the answer is always "no" to all questions, then memory notifiers 
must handle it, because we'd be answering the question from the CPU 
point of view.

If the answer is always "yes", there is a problem: we can only make it 
clean/young by converting it to an ordinary PTE first (requiring MMU 
notifiers etc.), which makes it quite nasty.

Mixed answers are not possible, because we don't know just from staring 
at the entry.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 05/12] mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable()
  2025-01-31 10:55           ` David Hildenbrand
@ 2025-01-31 17:05             ` Simona Vetter
  2025-02-04 10:58               ` David Hildenbrand
  0 siblings, 1 reply; 56+ messages in thread
From: Simona Vetter @ 2025-01-31 17:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alistair Popple, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Fri, Jan 31, 2025 at 11:55:55AM +0100, David Hildenbrand wrote:
> On 31.01.25 00:06, Alistair Popple wrote:
> > On Thu, Jan 30, 2025 at 02:03:42PM +0100, Simona Vetter wrote:
> > > On Thu, Jan 30, 2025 at 10:58:51AM +0100, David Hildenbrand wrote:
> > > > On 30.01.25 10:51, Simona Vetter wrote:
> > > > > On Wed, Jan 29, 2025 at 12:54:03PM +0100, David Hildenbrand wrote:
> > > > > > Let's do it just like mprotect write-upgrade or during NUMA-hinting
> > > > > > faults on PROT_NONE PTEs: detect if the PTE can be writable by using
> > > > > > can_change_pte_writable().
> > > > > > 
> > > > > > Set the PTE only dirty if the folio is dirty: we might not
> > > > > > necessarily have a write access, and setting the PTE writable doesn't
> > > > > > require setting the PTE dirty.
> > > > > 
> > > > > Not sure whether there's much difference in practice, since a device
> > > > > exclusive access means a write, so the folio better be dirty (unless we
> > > > > aborted halfway through). But then I couldn't find the code in nouveau to
> > > > > do that, so now I'm confused.
> > > > 
> > > > That confused me as well. Requiring the PTE to be writable does not imply
> > > > that it is dirty.
> > > > 
> > > > So something must either set the PTE or the folio dirty.
> > > 
> > > Yeah I'm not finding that something.
> > > 
> > > > ( In practice, most anonymous folios are dirty most of the time ... )
> > > 
> > > And yup that's why I think it hasn't blown up yet.
> > > 
> > > > If we assume that "device-exclusive entries" are always dirty, then it
> > > > doesn't make sense to set the folio dirty when creating device-exclusive
> > > > entries. We'd always have to set the PTE dirty when restoring the exclusive
> > > > pte.
> > > 
> > > I do agree with your change, I think it's correct to put this
> > > responsibility onto drivers. It's just that nouveau seems to not be
> > > entirely correct.
> > 
> > Yeah, agree it should be a driver responsibility but also can't see how nouveau
> > is correct there either. I might see if I can get it to blow up...
> 
> (in context of the rmap walkers) The question is, how do we consider
> device-exclusive entries:
> 
> (1) dirty? Not from a CPU perspective.
> (2) referenced? Not from a CPU perspective.
> 
> If the answer is always "no" to all questions, then memory notifiers must
> handle it, because we'd be answering the question from the CPU point of
> view.
> 
> If the answer is always "yes", there is a problem: we can only make it
> clean/young by converting it to an ordinary PTE first (requiring MMU
> notifiers etc.), which makes it quite nasty.
> 
> Mixed answers are not possible, because we don't know just from staring at
> the entry.

I think it's the gpu's (or whatever is using it) responsibility to update
folio state while it has ptes pointing at memory. Whether that's
device-exclusive system memory or device-private migrated memory. Anything
else doesn't make sense to me conceptually.

And I don't think we can just blindly assume even for device-exclusive
mappings that they will be dirty when we convert them back to a real pte,
because we might have raced trying to set up the gpu mapping and restarted
before we even put the pte into place. Or maybe someone was real quick at
writing it back after the gpu already dropped it's pte.

I guess maybe some clear documentation in all these functions
(make_device_exclusive, hmm_range_fault, migration helpers) that it's the
drivers job to dirty pages correctly would help?

But nouveau definitely does not look very correct here, pretty sure on
that.

Cheers, Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 05/12] mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable()
  2025-01-31 17:05             ` Simona Vetter
@ 2025-02-04 10:58               ` David Hildenbrand
  0 siblings, 0 replies; 56+ messages in thread
From: David Hildenbrand @ 2025-02-04 10:58 UTC (permalink / raw)
  To: Alistair Popple, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On 31.01.25 18:05, Simona Vetter wrote:
> On Fri, Jan 31, 2025 at 11:55:55AM +0100, David Hildenbrand wrote:
>> On 31.01.25 00:06, Alistair Popple wrote:
>>> On Thu, Jan 30, 2025 at 02:03:42PM +0100, Simona Vetter wrote:
>>>> On Thu, Jan 30, 2025 at 10:58:51AM +0100, David Hildenbrand wrote:
>>>>> On 30.01.25 10:51, Simona Vetter wrote:
>>>>>> On Wed, Jan 29, 2025 at 12:54:03PM +0100, David Hildenbrand wrote:
>>>>>>> Let's do it just like mprotect write-upgrade or during NUMA-hinting
>>>>>>> faults on PROT_NONE PTEs: detect if the PTE can be writable by using
>>>>>>> can_change_pte_writable().
>>>>>>>
>>>>>>> Set the PTE only dirty if the folio is dirty: we might not
>>>>>>> necessarily have a write access, and setting the PTE writable doesn't
>>>>>>> require setting the PTE dirty.
>>>>>>
>>>>>> Not sure whether there's much difference in practice, since a device
>>>>>> exclusive access means a write, so the folio better be dirty (unless we
>>>>>> aborted halfway through). But then I couldn't find the code in nouveau to
>>>>>> do that, so now I'm confused.
>>>>>
>>>>> That confused me as well. Requiring the PTE to be writable does not imply
>>>>> that it is dirty.
>>>>>
>>>>> So something must either set the PTE or the folio dirty.
>>>>
>>>> Yeah I'm not finding that something.
>>>>
>>>>> ( In practice, most anonymous folios are dirty most of the time ... )
>>>>
>>>> And yup that's why I think it hasn't blown up yet.
>>>>
>>>>> If we assume that "device-exclusive entries" are always dirty, then it
>>>>> doesn't make sense to set the folio dirty when creating device-exclusive
>>>>> entries. We'd always have to set the PTE dirty when restoring the exclusive
>>>>> pte.
>>>>
>>>> I do agree with your change, I think it's correct to put this
>>>> responsibility onto drivers. It's just that nouveau seems to not be
>>>> entirely correct.
>>>
>>> Yeah, agree it should be a driver responsibility but also can't see how nouveau
>>> is correct there either. I might see if I can get it to blow up...
>>
>> (in context of the rmap walkers) The question is, how do we consider
>> device-exclusive entries:
>>
>> (1) dirty? Not from a CPU perspective.
>> (2) referenced? Not from a CPU perspective.
>>
>> If the answer is always "no" to all questions, then memory notifiers must
>> handle it, because we'd be answering the question from the CPU point of
>> view.
>>
>> If the answer is always "yes", there is a problem: we can only make it
>> clean/young by converting it to an ordinary PTE first (requiring MMU
>> notifiers etc.), which makes it quite nasty.
>>
>> Mixed answers are not possible, because we don't know just from staring at
>> the entry.
> 
> I think it's the gpu's (or whatever is using it) responsibility to update
> folio state while it has ptes pointing at memory. Whether that's
> device-exclusive system memory or device-private migrated memory. Anything
> else doesn't make sense to me conceptually.
> 
> And I don't think we can just blindly assume even for device-exclusive
> mappings that they will be dirty when we convert them back to a real pte,
> because we might have raced trying to set up the gpu mapping and restarted
> before we even put the pte into place. Or maybe someone was real quick at
> writing it back after the gpu already dropped it's pte.
> 
> I guess maybe some clear documentation in all these functions
> (make_device_exclusive, hmm_range_fault, migration helpers) that it's the
> drivers job to dirty pages correctly would help?


I'll add a comment to make_device_exclusive(), stating that these 
entries are considered clean+old from a MM perspective, and that the 
driver must update the folio when notified by MMU notifiers.

We should probably document that somewhere else as well as you suggested 
separately.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v1 06/12] mm: use single SWP_DEVICE_EXCLUSIVE entry type
  2025-01-29 11:53 [PATCH v1 00/12] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (4 preceding siblings ...)
  2025-01-29 11:54 ` [PATCH v1 05/12] mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable() David Hildenbrand
@ 2025-01-29 11:54 ` David Hildenbrand
  2025-01-30 13:43   ` Simona Vetter
  2025-01-30 23:28   ` Alistair Popple
  2025-01-29 11:54 ` [PATCH v1 07/12] mm/page_vma_mapped: device-private entries are not migration entries David Hildenbrand
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 56+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, David Hildenbrand,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

There is no need for the distinction anymore; let's merge the readable
and writable device-exclusive entries into a single device-exclusive
entry type.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/swap.h    |  7 +++----
 include/linux/swapops.h | 27 ++++-----------------------
 mm/mprotect.c           |  8 --------
 mm/page_table_check.c   |  5 ++---
 mm/rmap.c               |  2 +-
 5 files changed, 10 insertions(+), 39 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 91b30701274e..9a48e79a0a52 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -74,14 +74,13 @@ static inline int current_is_kswapd(void)
  * to a special SWP_DEVICE_{READ|WRITE} entry.
  *
  * When a page is mapped by the device for exclusive access we set the CPU page
- * table entries to special SWP_DEVICE_EXCLUSIVE_* entries.
+ * table entries to a special SWP_DEVICE_EXCLUSIVE entry.
  */
 #ifdef CONFIG_DEVICE_PRIVATE
-#define SWP_DEVICE_NUM 4
+#define SWP_DEVICE_NUM 3
 #define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
 #define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
-#define SWP_DEVICE_EXCLUSIVE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
-#define SWP_DEVICE_EXCLUSIVE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3)
+#define SWP_DEVICE_EXCLUSIVE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
 #else
 #define SWP_DEVICE_NUM 0
 #endif
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 96f26e29fefe..64ea151a7ae3 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -186,26 +186,16 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
 	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
 }
 
-static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
+static inline swp_entry_t make_device_exclusive_entry(pgoff_t offset)
 {
-	return swp_entry(SWP_DEVICE_EXCLUSIVE_READ, offset);
-}
-
-static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
-{
-	return swp_entry(SWP_DEVICE_EXCLUSIVE_WRITE, offset);
+	return swp_entry(SWP_DEVICE_EXCLUSIVE, offset);
 }
 
 static inline bool is_device_exclusive_entry(swp_entry_t entry)
 {
-	return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_READ ||
-		swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE;
+	return swp_type(entry) == SWP_DEVICE_EXCLUSIVE;
 }
 
-static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
-{
-	return unlikely(swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE);
-}
 #else /* CONFIG_DEVICE_PRIVATE */
 static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
@@ -227,12 +217,7 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
 	return false;
 }
 
-static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
-{
-	return swp_entry(0, 0);
-}
-
-static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
+static inline swp_entry_t make_device_exclusive_entry(pgoff_t offset)
 {
 	return swp_entry(0, 0);
 }
@@ -242,10 +227,6 @@ static inline bool is_device_exclusive_entry(swp_entry_t entry)
 	return false;
 }
 
-static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
-{
-	return false;
-}
 #endif /* CONFIG_DEVICE_PRIVATE */
 
 #ifdef CONFIG_MIGRATION
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 516b1d847e2c..9cb6ab7c4048 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -225,14 +225,6 @@ static long change_pte_range(struct mmu_gather *tlb,
 				newpte = swp_entry_to_pte(entry);
 				if (pte_swp_uffd_wp(oldpte))
 					newpte = pte_swp_mkuffd_wp(newpte);
-			} else if (is_writable_device_exclusive_entry(entry)) {
-				entry = make_readable_device_exclusive_entry(
-							swp_offset(entry));
-				newpte = swp_entry_to_pte(entry);
-				if (pte_swp_soft_dirty(oldpte))
-					newpte = pte_swp_mksoft_dirty(newpte);
-				if (pte_swp_uffd_wp(oldpte))
-					newpte = pte_swp_mkuffd_wp(newpte);
 			} else if (is_pte_marker_entry(entry)) {
 				/*
 				 * Ignore error swap entries unconditionally,
diff --git a/mm/page_table_check.c b/mm/page_table_check.c
index 509c6ef8de40..c2b3600429a0 100644
--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -196,9 +196,8 @@ EXPORT_SYMBOL(__page_table_check_pud_clear);
 /* Whether the swap entry cached writable information */
 static inline bool swap_cached_writable(swp_entry_t entry)
 {
-	return is_writable_device_exclusive_entry(entry) ||
-	    is_writable_device_private_entry(entry) ||
-	    is_writable_migration_entry(entry);
+	return is_writable_device_private_entry(entry) ||
+	       is_writable_migration_entry(entry);
 }
 
 static inline void page_table_check_pte_flags(pte_t pte)
diff --git a/mm/rmap.c b/mm/rmap.c
index 49ffac6d27f8..65d9bbea16d0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2470,7 +2470,7 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 	 * do_swap_page() will trigger the conversion back while holding the
 	 * folio lock.
 	 */
-	entry = make_writable_device_exclusive_entry(page_to_pfn(page));
+	entry = make_device_exclusive_entry(page_to_pfn(page));
 	swp_pte = swp_entry_to_pte(entry);
 	if (pte_soft_dirty(fw.pte))
 		swp_pte = pte_swp_mksoft_dirty(swp_pte);
-- 
2.48.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 06/12] mm: use single SWP_DEVICE_EXCLUSIVE entry type
  2025-01-29 11:54 ` [PATCH v1 06/12] mm: use single SWP_DEVICE_EXCLUSIVE entry type David Hildenbrand
@ 2025-01-30 13:43   ` Simona Vetter
  2025-01-30 23:28   ` Alistair Popple
  1 sibling, 0 replies; 56+ messages in thread
From: Simona Vetter @ 2025-01-30 13:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

On Wed, Jan 29, 2025 at 12:54:04PM +0100, David Hildenbrand wrote:
> There is no need for the distinction anymore; let's merge the readable
> and writable device-exclusive entries into a single device-exclusive
> entry type.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Yeah I really don't think there's a need to track that on the cpu ptes.
Device should have write/dirty bits (which nouveau really should forward)
or defensively assume the page is always dirtied when clearing a pte.

Plus the entire point of device exclusive access is to support device
atomics, so in practice it's always a write access that dirties anyway.

Acked-by: Simona Vetter <simona.vetter@ffwll.ch>

> ---
>  include/linux/swap.h    |  7 +++----
>  include/linux/swapops.h | 27 ++++-----------------------
>  mm/mprotect.c           |  8 --------
>  mm/page_table_check.c   |  5 ++---
>  mm/rmap.c               |  2 +-
>  5 files changed, 10 insertions(+), 39 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 91b30701274e..9a48e79a0a52 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -74,14 +74,13 @@ static inline int current_is_kswapd(void)
>   * to a special SWP_DEVICE_{READ|WRITE} entry.
>   *
>   * When a page is mapped by the device for exclusive access we set the CPU page
> - * table entries to special SWP_DEVICE_EXCLUSIVE_* entries.
> + * table entries to a special SWP_DEVICE_EXCLUSIVE entry.
>   */
>  #ifdef CONFIG_DEVICE_PRIVATE
> -#define SWP_DEVICE_NUM 4
> +#define SWP_DEVICE_NUM 3
>  #define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
>  #define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
> -#define SWP_DEVICE_EXCLUSIVE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
> -#define SWP_DEVICE_EXCLUSIVE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3)
> +#define SWP_DEVICE_EXCLUSIVE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
>  #else
>  #define SWP_DEVICE_NUM 0
>  #endif
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 96f26e29fefe..64ea151a7ae3 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -186,26 +186,16 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
>  	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
>  }
>  
> -static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
> +static inline swp_entry_t make_device_exclusive_entry(pgoff_t offset)
>  {
> -	return swp_entry(SWP_DEVICE_EXCLUSIVE_READ, offset);
> -}
> -
> -static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
> -{
> -	return swp_entry(SWP_DEVICE_EXCLUSIVE_WRITE, offset);
> +	return swp_entry(SWP_DEVICE_EXCLUSIVE, offset);
>  }
>  
>  static inline bool is_device_exclusive_entry(swp_entry_t entry)
>  {
> -	return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_READ ||
> -		swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE;
> +	return swp_type(entry) == SWP_DEVICE_EXCLUSIVE;
>  }
>  
> -static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
> -{
> -	return unlikely(swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE);
> -}
>  #else /* CONFIG_DEVICE_PRIVATE */
>  static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
>  {
> @@ -227,12 +217,7 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
>  	return false;
>  }
>  
> -static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
> -{
> -	return swp_entry(0, 0);
> -}
> -
> -static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
> +static inline swp_entry_t make_device_exclusive_entry(pgoff_t offset)
>  {
>  	return swp_entry(0, 0);
>  }
> @@ -242,10 +227,6 @@ static inline bool is_device_exclusive_entry(swp_entry_t entry)
>  	return false;
>  }
>  
> -static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
> -{
> -	return false;
> -}
>  #endif /* CONFIG_DEVICE_PRIVATE */
>  
>  #ifdef CONFIG_MIGRATION
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 516b1d847e2c..9cb6ab7c4048 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -225,14 +225,6 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				newpte = swp_entry_to_pte(entry);
>  				if (pte_swp_uffd_wp(oldpte))
>  					newpte = pte_swp_mkuffd_wp(newpte);
> -			} else if (is_writable_device_exclusive_entry(entry)) {
> -				entry = make_readable_device_exclusive_entry(
> -							swp_offset(entry));
> -				newpte = swp_entry_to_pte(entry);
> -				if (pte_swp_soft_dirty(oldpte))
> -					newpte = pte_swp_mksoft_dirty(newpte);
> -				if (pte_swp_uffd_wp(oldpte))
> -					newpte = pte_swp_mkuffd_wp(newpte);
>  			} else if (is_pte_marker_entry(entry)) {
>  				/*
>  				 * Ignore error swap entries unconditionally,
> diff --git a/mm/page_table_check.c b/mm/page_table_check.c
> index 509c6ef8de40..c2b3600429a0 100644
> --- a/mm/page_table_check.c
> +++ b/mm/page_table_check.c
> @@ -196,9 +196,8 @@ EXPORT_SYMBOL(__page_table_check_pud_clear);
>  /* Whether the swap entry cached writable information */
>  static inline bool swap_cached_writable(swp_entry_t entry)
>  {
> -	return is_writable_device_exclusive_entry(entry) ||
> -	    is_writable_device_private_entry(entry) ||
> -	    is_writable_migration_entry(entry);
> +	return is_writable_device_private_entry(entry) ||
> +	       is_writable_migration_entry(entry);
>  }
>  
>  static inline void page_table_check_pte_flags(pte_t pte)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 49ffac6d27f8..65d9bbea16d0 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2470,7 +2470,7 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
>  	 * do_swap_page() will trigger the conversion back while holding the
>  	 * folio lock.
>  	 */
> -	entry = make_writable_device_exclusive_entry(page_to_pfn(page));
> +	entry = make_device_exclusive_entry(page_to_pfn(page));
>  	swp_pte = swp_entry_to_pte(entry);
>  	if (pte_soft_dirty(fw.pte))
>  		swp_pte = pte_swp_mksoft_dirty(swp_pte);
> -- 
> 2.48.1
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 06/12] mm: use single SWP_DEVICE_EXCLUSIVE entry type
  2025-01-29 11:54 ` [PATCH v1 06/12] mm: use single SWP_DEVICE_EXCLUSIVE entry type David Hildenbrand
  2025-01-30 13:43   ` Simona Vetter
@ 2025-01-30 23:28   ` Alistair Popple
  1 sibling, 0 replies; 56+ messages in thread
From: Alistair Popple @ 2025-01-30 23:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Wed, Jan 29, 2025 at 12:54:04PM +0100, David Hildenbrand wrote:
> There is no need for the distinction anymore; let's merge the readable
> and writable device-exclusive entries into a single device-exclusive
> entry type.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/swap.h    |  7 +++----
>  include/linux/swapops.h | 27 ++++-----------------------
>  mm/mprotect.c           |  8 --------
>  mm/page_table_check.c   |  5 ++---
>  mm/rmap.c               |  2 +-
>  5 files changed, 10 insertions(+), 39 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 91b30701274e..9a48e79a0a52 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -74,14 +74,13 @@ static inline int current_is_kswapd(void)
>   * to a special SWP_DEVICE_{READ|WRITE} entry.
>   *
>   * When a page is mapped by the device for exclusive access we set the CPU page
> - * table entries to special SWP_DEVICE_EXCLUSIVE_* entries.
> + * table entries to a special SWP_DEVICE_EXCLUSIVE entry.
>   */
>  #ifdef CONFIG_DEVICE_PRIVATE
> -#define SWP_DEVICE_NUM 4
> +#define SWP_DEVICE_NUM 3
>  #define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
>  #define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
> -#define SWP_DEVICE_EXCLUSIVE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
> -#define SWP_DEVICE_EXCLUSIVE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3)
> +#define SWP_DEVICE_EXCLUSIVE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
>  #else
>  #define SWP_DEVICE_NUM 0
>  #endif
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 96f26e29fefe..64ea151a7ae3 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -186,26 +186,16 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
>  	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
>  }
>  
> -static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
> +static inline swp_entry_t make_device_exclusive_entry(pgoff_t offset)
>  {
> -	return swp_entry(SWP_DEVICE_EXCLUSIVE_READ, offset);
> -}
> -
> -static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
> -{
> -	return swp_entry(SWP_DEVICE_EXCLUSIVE_WRITE, offset);
> +	return swp_entry(SWP_DEVICE_EXCLUSIVE, offset);
>  }
>  
>  static inline bool is_device_exclusive_entry(swp_entry_t entry)
>  {
> -	return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_READ ||
> -		swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE;
> +	return swp_type(entry) == SWP_DEVICE_EXCLUSIVE;
>  }
>  
> -static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
> -{
> -	return unlikely(swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE);
> -}
>  #else /* CONFIG_DEVICE_PRIVATE */
>  static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
>  {
> @@ -227,12 +217,7 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
>  	return false;
>  }
>  
> -static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
> -{
> -	return swp_entry(0, 0);
> -}
> -
> -static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
> +static inline swp_entry_t make_device_exclusive_entry(pgoff_t offset)
>  {
>  	return swp_entry(0, 0);
>  }
> @@ -242,10 +227,6 @@ static inline bool is_device_exclusive_entry(swp_entry_t entry)
>  	return false;
>  }
>  
> -static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
> -{
> -	return false;
> -}
>  #endif /* CONFIG_DEVICE_PRIVATE */
>  
>  #ifdef CONFIG_MIGRATION
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 516b1d847e2c..9cb6ab7c4048 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -225,14 +225,6 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				newpte = swp_entry_to_pte(entry);
>  				if (pte_swp_uffd_wp(oldpte))
>  					newpte = pte_swp_mkuffd_wp(newpte);
> -			} else if (is_writable_device_exclusive_entry(entry)) {
> -				entry = make_readable_device_exclusive_entry(
> -							swp_offset(entry));
> -				newpte = swp_entry_to_pte(entry);
> -				if (pte_swp_soft_dirty(oldpte))
> -					newpte = pte_swp_mksoft_dirty(newpte);
> -				if (pte_swp_uffd_wp(oldpte))
> -					newpte = pte_swp_mkuffd_wp(newpte);

So just to check my understanding the idea is we now check vma->vm_flags in
restore_exclusive_pte() to restore them as read-only in the case of mprotect()
write protecting the range? That makes sense to me, so assuming that's true:

Reviewed-by: Alistair Popple <apopple@nvidia.com>

>  			} else if (is_pte_marker_entry(entry)) {
>  				/*
>  				 * Ignore error swap entries unconditionally,
> diff --git a/mm/page_table_check.c b/mm/page_table_check.c
> index 509c6ef8de40..c2b3600429a0 100644
> --- a/mm/page_table_check.c
> +++ b/mm/page_table_check.c
> @@ -196,9 +196,8 @@ EXPORT_SYMBOL(__page_table_check_pud_clear);
>  /* Whether the swap entry cached writable information */
>  static inline bool swap_cached_writable(swp_entry_t entry)
>  {
> -	return is_writable_device_exclusive_entry(entry) ||
> -	    is_writable_device_private_entry(entry) ||
> -	    is_writable_migration_entry(entry);
> +	return is_writable_device_private_entry(entry) ||
> +	       is_writable_migration_entry(entry);
>  }
>  
>  static inline void page_table_check_pte_flags(pte_t pte)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 49ffac6d27f8..65d9bbea16d0 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2470,7 +2470,7 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
>  	 * do_swap_page() will trigger the conversion back while holding the
>  	 * folio lock.
>  	 */
> -	entry = make_writable_device_exclusive_entry(page_to_pfn(page));
> +	entry = make_device_exclusive_entry(page_to_pfn(page));
>  	swp_pte = swp_entry_to_pte(entry);
>  	if (pte_soft_dirty(fw.pte))
>  		swp_pte = pte_swp_mksoft_dirty(swp_pte);
> -- 
> 2.48.1
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v1 07/12] mm/page_vma_mapped: device-private entries are not migration entries
  2025-01-29 11:53 [PATCH v1 00/12] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (5 preceding siblings ...)
  2025-01-29 11:54 ` [PATCH v1 06/12] mm: use single SWP_DEVICE_EXCLUSIVE entry type David Hildenbrand
@ 2025-01-29 11:54 ` David Hildenbrand
  2025-01-30 23:36   ` Alistair Popple
  2025-01-29 11:54 ` [PATCH v1 08/12] mm/rmap: handle device-exclusive entries correctly in try_to_unmap_one() David Hildenbrand
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, David Hildenbrand,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

It's unclear why they would be considered migration entries; they are
not.

Likely we'll never really trigger that case in practice, because
migration (including folio split) of a folio that has device-private
entries is never started, as we would detect "additional references":
device-private entries adjust the mapcount, but not the refcount.

Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/page_vma_mapped.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 81839a9e74f1..32679be22d30 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -111,8 +111,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 			return false;
 		entry = pte_to_swp_entry(ptent);
 
-		if (!is_migration_entry(entry) &&
-		    !is_device_exclusive_entry(entry))
+		if (!is_migration_entry(entry))
 			return false;
 
 		pfn = swp_offset_pfn(entry);
-- 
2.48.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 07/12] mm/page_vma_mapped: device-private entries are not migration entries
  2025-01-29 11:54 ` [PATCH v1 07/12] mm/page_vma_mapped: device-private entries are not migration entries David Hildenbrand
@ 2025-01-30 23:36   ` Alistair Popple
  2025-01-31 11:06     ` David Hildenbrand
  0 siblings, 1 reply; 56+ messages in thread
From: Alistair Popple @ 2025-01-30 23:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On Wed, Jan 29, 2025 at 12:54:05PM +0100, David Hildenbrand wrote:
> It's unclear why they would be considered migration entries; they are
> not.

Yeah, I agree that doesn't seem right. I suspect I was initially modelling
device exclusive entries similar to migration entries but obviously went too
far. So thanks for fixing:

Reviewed-by: Alistair Popple <apopple@nvidia.com>

> Likely we'll never really trigger that case in practice, because
> migration (including folio split) of a folio that has device-private
> entries is never started, as we would detect "additional references":
> device-private entries adjust the mapcount, but not the refcount.
> 
> Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/page_vma_mapped.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 81839a9e74f1..32679be22d30 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -111,8 +111,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>  			return false;
>  		entry = pte_to_swp_entry(ptent);
>  
> -		if (!is_migration_entry(entry) &&
> -		    !is_device_exclusive_entry(entry))
> +		if (!is_migration_entry(entry))
>  			return false;
>  
>  		pfn = swp_offset_pfn(entry);
> -- 
> 2.48.1
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 07/12] mm/page_vma_mapped: device-private entries are not migration entries
  2025-01-30 23:36   ` Alistair Popple
@ 2025-01-31 11:06     ` David Hildenbrand
  0 siblings, 0 replies; 56+ messages in thread
From: David Hildenbrand @ 2025-01-31 11:06 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Jason Gunthorpe

On 31.01.25 00:36, Alistair Popple wrote:
> On Wed, Jan 29, 2025 at 12:54:05PM +0100, David Hildenbrand wrote:
>> It's unclear why they would be considered migration entries; they are
>> not.
> 
> Yeah, I agree that doesn't seem right. I suspect I was initially modelling
> device exclusive entries similar to migration entries but obviously went too
> far. So thanks for fixing:
> 
> Reviewed-by: Alistair Popple <apopple@nvidia.com>

Thanks ... fixing all the wrong use of "device-private" in the 
subject+description ... not sure what my mind was doing there.

It's all about "device-exclusive" of course.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v1 08/12] mm/rmap: handle device-exclusive entries correctly in try_to_unmap_one()
  2025-01-29 11:53 [PATCH v1 00/12] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (6 preceding siblings ...)
  2025-01-29 11:54 ` [PATCH v1 07/12] mm/page_vma_mapped: device-private entries are not migration entries David Hildenbrand
@ 2025-01-29 11:54 ` David Hildenbrand
  2025-01-30 10:10   ` Simona Vetter
  2025-01-29 11:54 ` [PATCH v1 09/12] mm/rmap: handle device-exclusive entries correctly in try_to_migrate_one() David Hildenbrand
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, David Hildenbrand,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
we can return with a device-exclusive entry from page_vma_mapped_walk().

try_to_unmap_one() is not prepared for that, so teach it about these
non-present nonswap PTEs.

Before that, could we also have triggered this case with device-private
entries? Unlikely.

Note that we could currently only run into this case with
device-exclusive entries on THPs. For order-0 folios, we still adjust
the mapcount on conversion to device-exclusive, making the rmap walk
abort early (folio_mapcount() == 0 and breaking swapout). We'll fix
that next, now that try_to_unmap_one() can handle it.

Further note that try_to_unmap() calls MMU notifiers and holds the
folio lock, so any device-exclusive users should be properly prepared
for this device-exclusive PTE to "vanish".

Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap.c | 53 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 40 insertions(+), 13 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 65d9bbea16d0..12900f367a2a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1648,9 +1648,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+	bool anon_exclusive, ret = true;
 	pte_t pteval;
 	struct page *subpage;
-	bool anon_exclusive, ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 	unsigned long pfn;
@@ -1722,7 +1722,19 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		/* Unexpected PMD-mapped THP? */
 		VM_BUG_ON_FOLIO(!pvmw.pte, folio);
 
-		pfn = pte_pfn(ptep_get(pvmw.pte));
+		/*
+		 * We can end up here with selected non-swap entries that
+		 * actually map pages similar to PROT_NONE; see
+		 * page_vma_mapped_walk()->check_pte().
+		 */
+		pteval = ptep_get(pvmw.pte);
+		if (likely(pte_present(pteval))) {
+			pfn = pte_pfn(pteval);
+		} else {
+			pfn = swp_offset_pfn(pte_to_swp_entry(pteval));
+			VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+		}
+
 		subpage = folio_page(folio, pfn - folio_pfn(folio));
 		address = pvmw.address;
 		anon_exclusive = folio_test_anon(folio) &&
@@ -1778,7 +1790,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				hugetlb_vma_unlock_write(vma);
 			}
 			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
-		} else {
+			if (pte_dirty(pteval))
+				folio_mark_dirty(folio);
+		} else if (likely(pte_present(pteval))) {
 			flush_cache_page(vma, address, pfn);
 			/* Nuke the page table entry. */
 			if (should_defer_flush(mm, flags)) {
@@ -1796,6 +1810,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
+			if (pte_dirty(pteval))
+				folio_mark_dirty(folio);
+		} else {
+			pte_clear(mm, address, pvmw.pte);
 		}
 
 		/*
@@ -1805,10 +1823,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		 */
 		pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
 
-		/* Set the dirty flag on the folio now the pte is gone. */
-		if (pte_dirty(pteval))
-			folio_mark_dirty(folio);
-
 		/* Update high watermark before we lower rss */
 		update_hiwater_rss(mm);
 
@@ -1822,8 +1836,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				dec_mm_counter(mm, mm_counter(folio));
 				set_pte_at(mm, address, pvmw.pte, pteval);
 			}
-
-		} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
+		} else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
+			   !userfaultfd_armed(vma)) {
 			/*
 			 * The guest indicated that the page content is of no
 			 * interest anymore. Simply discard the pte, vmscan
@@ -1902,6 +1916,12 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
+
+			/*
+			 * arch_unmap_one() is expected to be a NOP on
+			 * architectures where we could have non-swp entries
+			 * here, so we'll not check/care.
+			 */
 			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
 				swap_free(entry);
 				set_pte_at(mm, address, pvmw.pte, pteval);
@@ -1926,10 +1946,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			swp_pte = swp_entry_to_pte(entry);
 			if (anon_exclusive)
 				swp_pte = pte_swp_mkexclusive(swp_pte);
-			if (pte_soft_dirty(pteval))
-				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-			if (pte_uffd_wp(pteval))
-				swp_pte = pte_swp_mkuffd_wp(swp_pte);
+			if (likely(pte_present(pteval))) {
+				if (pte_soft_dirty(pteval))
+					swp_pte = pte_swp_mksoft_dirty(swp_pte);
+				if (pte_uffd_wp(pteval))
+					swp_pte = pte_swp_mkuffd_wp(swp_pte);
+			} else {
+				if (pte_swp_soft_dirty(pteval))
+					swp_pte = pte_swp_mksoft_dirty(swp_pte);
+				if (pte_swp_uffd_wp(pteval))
+					swp_pte = pte_swp_mkuffd_wp(swp_pte);
+			}
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
 		} else {
 			/*
-- 
2.48.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 08/12] mm/rmap: handle device-exclusive entries correctly in try_to_unmap_one()
  2025-01-29 11:54 ` [PATCH v1 08/12] mm/rmap: handle device-exclusive entries correctly in try_to_unmap_one() David Hildenbrand
@ 2025-01-30 10:10   ` Simona Vetter
  2025-01-30 11:08     ` David Hildenbrand
  0 siblings, 1 reply; 56+ messages in thread
From: Simona Vetter @ 2025-01-30 10:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

On Wed, Jan 29, 2025 at 12:54:06PM +0100, David Hildenbrand wrote:
> Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
> we can return with a device-exclusive entry from page_vma_mapped_walk().
> 
> try_to_unmap_one() is not prepared for that, so teach it about these
> non-present nonswap PTEs.
> 
> Before that, could we also have triggered this case with device-private
> entries? Unlikely.

Just quick comment on this, I'm still pondering all the other aspects.

device-private memory is entirely owned by the driver, the core mm isn't
supposed to touch these beyond migrating it back to system memory in
do_swap_page. Plus using rmap when the driver asks for invalidating
mappings as needed.

So no lru, thp, migration or anything initiated by core mm should ever
happen on these device private pages. If it does, it'd be a bug.

device-exclusive is entirely different ofc since that's just normal system
memory managed by core mm/.
-Sima
> 
> Note that we could currently only run into this case with
> device-exclusive entries on THPs. For order-0 folios, we still adjust
> the mapcount on conversion to device-exclusive, making the rmap walk
> abort early (folio_mapcount() == 0 and breaking swapout). We'll fix
> that next, now that try_to_unmap_one() can handle it.
> 
> Further note that try_to_unmap() calls MMU notifiers and holds the
> folio lock, so any device-exclusive users should be properly prepared
> for this device-exclusive PTE to "vanish".
> 
> Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/rmap.c | 53 ++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 40 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 65d9bbea16d0..12900f367a2a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1648,9 +1648,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> +	bool anon_exclusive, ret = true;
>  	pte_t pteval;
>  	struct page *subpage;
> -	bool anon_exclusive, ret = true;
>  	struct mmu_notifier_range range;
>  	enum ttu_flags flags = (enum ttu_flags)(long)arg;
>  	unsigned long pfn;
> @@ -1722,7 +1722,19 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  		/* Unexpected PMD-mapped THP? */
>  		VM_BUG_ON_FOLIO(!pvmw.pte, folio);
>  
> -		pfn = pte_pfn(ptep_get(pvmw.pte));
> +		/*
> +		 * We can end up here with selected non-swap entries that
> +		 * actually map pages similar to PROT_NONE; see
> +		 * page_vma_mapped_walk()->check_pte().
> +		 */
> +		pteval = ptep_get(pvmw.pte);
> +		if (likely(pte_present(pteval))) {
> +			pfn = pte_pfn(pteval);
> +		} else {
> +			pfn = swp_offset_pfn(pte_to_swp_entry(pteval));
> +			VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
> +		}
> +
>  		subpage = folio_page(folio, pfn - folio_pfn(folio));
>  		address = pvmw.address;
>  		anon_exclusive = folio_test_anon(folio) &&
> @@ -1778,7 +1790,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  				hugetlb_vma_unlock_write(vma);
>  			}
>  			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
> -		} else {
> +			if (pte_dirty(pteval))
> +				folio_mark_dirty(folio);
> +		} else if (likely(pte_present(pteval))) {
>  			flush_cache_page(vma, address, pfn);
>  			/* Nuke the page table entry. */
>  			if (should_defer_flush(mm, flags)) {
> @@ -1796,6 +1810,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  			} else {
>  				pteval = ptep_clear_flush(vma, address, pvmw.pte);
>  			}
> +			if (pte_dirty(pteval))
> +				folio_mark_dirty(folio);
> +		} else {
> +			pte_clear(mm, address, pvmw.pte);
>  		}
>  
>  		/*
> @@ -1805,10 +1823,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  		 */
>  		pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
>  
> -		/* Set the dirty flag on the folio now the pte is gone. */
> -		if (pte_dirty(pteval))
> -			folio_mark_dirty(folio);
> -
>  		/* Update high watermark before we lower rss */
>  		update_hiwater_rss(mm);
>  
> @@ -1822,8 +1836,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  				dec_mm_counter(mm, mm_counter(folio));
>  				set_pte_at(mm, address, pvmw.pte, pteval);
>  			}
> -
> -		} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
> +		} else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
> +			   !userfaultfd_armed(vma)) {
>  			/*
>  			 * The guest indicated that the page content is of no
>  			 * interest anymore. Simply discard the pte, vmscan
> @@ -1902,6 +1916,12 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  				set_pte_at(mm, address, pvmw.pte, pteval);
>  				goto walk_abort;
>  			}
> +
> +			/*
> +			 * arch_unmap_one() is expected to be a NOP on
> +			 * architectures where we could have non-swp entries
> +			 * here, so we'll not check/care.
> +			 */
>  			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
>  				swap_free(entry);
>  				set_pte_at(mm, address, pvmw.pte, pteval);
> @@ -1926,10 +1946,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  			swp_pte = swp_entry_to_pte(entry);
>  			if (anon_exclusive)
>  				swp_pte = pte_swp_mkexclusive(swp_pte);
> -			if (pte_soft_dirty(pteval))
> -				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> -			if (pte_uffd_wp(pteval))
> -				swp_pte = pte_swp_mkuffd_wp(swp_pte);
> +			if (likely(pte_present(pteval))) {
> +				if (pte_soft_dirty(pteval))
> +					swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +				if (pte_uffd_wp(pteval))
> +					swp_pte = pte_swp_mkuffd_wp(swp_pte);
> +			} else {
> +				if (pte_swp_soft_dirty(pteval))
> +					swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +				if (pte_swp_uffd_wp(pteval))
> +					swp_pte = pte_swp_mkuffd_wp(swp_pte);
> +			}
>  			set_pte_at(mm, address, pvmw.pte, swp_pte);
>  		} else {
>  			/*
> -- 
> 2.48.1
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 08/12] mm/rmap: handle device-exclusive entries correctly in try_to_unmap_one()
  2025-01-30 10:10   ` Simona Vetter
@ 2025-01-30 11:08     ` David Hildenbrand
  2025-01-30 13:06       ` Simona Vetter
  0 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2025-01-30 11:08 UTC (permalink / raw)
  To: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

On 30.01.25 11:10, Simona Vetter wrote:
> On Wed, Jan 29, 2025 at 12:54:06PM +0100, David Hildenbrand wrote:
>> Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
>> we can return with a device-exclusive entry from page_vma_mapped_walk().
>>
>> try_to_unmap_one() is not prepared for that, so teach it about these
>> non-present nonswap PTEs.
>>
>> Before that, could we also have triggered this case with device-private
>> entries? Unlikely.
> 
> Just quick comment on this, I'm still pondering all the other aspects.
> 
> device-private memory is entirely owned by the driver, the core mm isn't
> supposed to touch these beyond migrating it back to system memory in
> do_swap_page. Plus using rmap when the driver asks for invalidating
> mappings as needed.
> 
> So no lru, thp, migration or anything initiated by core mm should ever
> happen on these device private pages. If it does, it'd be a bug.

I was not 100% sure about HWPoison handling, that's why I added that 
comment. In other regards I agree: reclaim etc. does not apply.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 08/12] mm/rmap: handle device-exclusive entries correctly in try_to_unmap_one()
  2025-01-30 11:08     ` David Hildenbrand
@ 2025-01-30 13:06       ` Simona Vetter
  2025-01-30 14:08         ` Jason Gunthorpe
  2025-01-30 15:52         ` David Hildenbrand
  0 siblings, 2 replies; 56+ messages in thread
From: Simona Vetter @ 2025-01-30 13:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

On Thu, Jan 30, 2025 at 12:08:42PM +0100, David Hildenbrand wrote:
> On 30.01.25 11:10, Simona Vetter wrote:
> > On Wed, Jan 29, 2025 at 12:54:06PM +0100, David Hildenbrand wrote:
> > > Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
> > > we can return with a device-exclusive entry from page_vma_mapped_walk().
> > > 
> > > try_to_unmap_one() is not prepared for that, so teach it about these
> > > non-present nonswap PTEs.
> > > 
> > > Before that, could we also have triggered this case with device-private
> > > entries? Unlikely.
> > 
> > Just quick comment on this, I'm still pondering all the other aspects.
> > 
> > device-private memory is entirely owned by the driver, the core mm isn't
> > supposed to touch these beyond migrating it back to system memory in
> > do_swap_page. Plus using rmap when the driver asks for invalidating
> > mappings as needed.
> > 
> > So no lru, thp, migration or anything initiated by core mm should ever
> > happen on these device private pages. If it does, it'd be a bug.
> 
> I was not 100% sure about HWPoison handling, that's why I added that
> comment. In other regards I agree: reclaim etc. does not apply.

So maybe I'm just entirely lost, but unless you have a coherent
interconnect I don't think hwpoisin should get involved with device
private memory? And for a coherent interconnect it's just device memory,
which isn't treated very special.

Also to clarify, I meant this as a general comment for all subsequent
patches that have the same paragraph.
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 08/12] mm/rmap: handle device-exclusive entries correctly in try_to_unmap_one()
  2025-01-30 13:06       ` Simona Vetter
@ 2025-01-30 14:08         ` Jason Gunthorpe
  2025-01-30 16:10           ` Simona Vetter
  2025-01-30 15:52         ` David Hildenbrand
  1 sibling, 1 reply; 56+ messages in thread
From: Jason Gunthorpe @ 2025-01-30 14:08 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple

On Thu, Jan 30, 2025 at 02:06:12PM +0100, Simona Vetter wrote:
> On Thu, Jan 30, 2025 at 12:08:42PM +0100, David Hildenbrand wrote:
> > On 30.01.25 11:10, Simona Vetter wrote:
> > > On Wed, Jan 29, 2025 at 12:54:06PM +0100, David Hildenbrand wrote:
> > > > Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
> > > > we can return with a device-exclusive entry from page_vma_mapped_walk().
> > > > 
> > > > try_to_unmap_one() is not prepared for that, so teach it about these
> > > > non-present nonswap PTEs.
> > > > 
> > > > Before that, could we also have triggered this case with device-private
> > > > entries? Unlikely.
> > > 
> > > Just quick comment on this, I'm still pondering all the other aspects.
> > > 
> > > device-private memory is entirely owned by the driver, the core mm isn't
> > > supposed to touch these beyond migrating it back to system memory in
> > > do_swap_page. Plus using rmap when the driver asks for invalidating
> > > mappings as needed.
> > > 
> > > So no lru, thp, migration or anything initiated by core mm should ever
> > > happen on these device private pages. If it does, it'd be a bug.
> > 
> > I was not 100% sure about HWPoison handling, that's why I added that
> > comment. In other regards I agree: reclaim etc. does not apply.
> 
> So maybe I'm just entirely lost, but unless you have a coherent
> interconnect I don't think hwpoisin should get involved with device
> private memory? And for a coherent interconnect it's just device memory,
> which isn't treated very special.

I'm not sure it is meaningful, but in principle a driver could keep
track of the poisoned private memory using that struct page
bit. Perhaps in that sense it is more of a driver private flag than
something the core MM would touch.

If you have a coherent interconnect then you should not be using
device private.

Jason


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 08/12] mm/rmap: handle device-exclusive entries correctly in try_to_unmap_one()
  2025-01-30 14:08         ` Jason Gunthorpe
@ 2025-01-30 16:10           ` Simona Vetter
  0 siblings, 0 replies; 56+ messages in thread
From: Simona Vetter @ 2025-01-30 16:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, linux-kernel, linux-doc, dri-devel, linux-mm,
	nouveau, Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple

On Thu, Jan 30, 2025 at 10:08:32AM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 30, 2025 at 02:06:12PM +0100, Simona Vetter wrote:
> > On Thu, Jan 30, 2025 at 12:08:42PM +0100, David Hildenbrand wrote:
> > > On 30.01.25 11:10, Simona Vetter wrote:
> > > > On Wed, Jan 29, 2025 at 12:54:06PM +0100, David Hildenbrand wrote:
> > > > > Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
> > > > > we can return with a device-exclusive entry from page_vma_mapped_walk().
> > > > > 
> > > > > try_to_unmap_one() is not prepared for that, so teach it about these
> > > > > non-present nonswap PTEs.
> > > > > 
> > > > > Before that, could we also have triggered this case with device-private
> > > > > entries? Unlikely.
> > > > 
> > > > Just quick comment on this, I'm still pondering all the other aspects.
> > > > 
> > > > device-private memory is entirely owned by the driver, the core mm isn't
> > > > supposed to touch these beyond migrating it back to system memory in
> > > > do_swap_page. Plus using rmap when the driver asks for invalidating
> > > > mappings as needed.
> > > > 
> > > > So no lru, thp, migration or anything initiated by core mm should ever
> > > > happen on these device private pages. If it does, it'd be a bug.
> > > 
> > > I was not 100% sure about HWPoison handling, that's why I added that
> > > comment. In other regards I agree: reclaim etc. does not apply.
> > 
> > So maybe I'm just entirely lost, but unless you have a coherent
> > interconnect I don't think hwpoisin should get involved with device
> > private memory? And for a coherent interconnect it's just device memory,
> > which isn't treated very special.
> 
> I'm not sure it is meaningful, but in principle a driver could keep
> track of the poisoned private memory using that struct page
> bit. Perhaps in that sense it is more of a driver private flag than
> something the core MM would touch.
> 
> If you have a coherent interconnect then you should not be using
> device private.

Yes on both, that's what I meant.
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 08/12] mm/rmap: handle device-exclusive entries correctly in try_to_unmap_one()
  2025-01-30 13:06       ` Simona Vetter
  2025-01-30 14:08         ` Jason Gunthorpe
@ 2025-01-30 15:52         ` David Hildenbrand
  1 sibling, 0 replies; 56+ messages in thread
From: David Hildenbrand @ 2025-01-30 15:52 UTC (permalink / raw)
  To: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

On 30.01.25 14:06, Simona Vetter wrote:
> On Thu, Jan 30, 2025 at 12:08:42PM +0100, David Hildenbrand wrote:
>> On 30.01.25 11:10, Simona Vetter wrote:
>>> On Wed, Jan 29, 2025 at 12:54:06PM +0100, David Hildenbrand wrote:
>>>> Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
>>>> we can return with a device-exclusive entry from page_vma_mapped_walk().
>>>>
>>>> try_to_unmap_one() is not prepared for that, so teach it about these
>>>> non-present nonswap PTEs.
>>>>
>>>> Before that, could we also have triggered this case with device-private
>>>> entries? Unlikely.
>>>
>>> Just quick comment on this, I'm still pondering all the other aspects.
>>>
>>> device-private memory is entirely owned by the driver, the core mm isn't
>>> supposed to touch these beyond migrating it back to system memory in
>>> do_swap_page. Plus using rmap when the driver asks for invalidating
>>> mappings as needed.
>>>
>>> So no lru, thp, migration or anything initiated by core mm should ever
>>> happen on these device private pages. If it does, it'd be a bug.
>>
>> I was not 100% sure about HWPoison handling, that's why I added that
>> comment. In other regards I agree: reclaim etc. does not apply.
> 
> So maybe I'm just entirely lost, but unless you have a coherent
> interconnect I don't think hwpoisin should get involved with device
> private memory? And for a coherent interconnect it's just device memory,
> which isn't treated very special.

I would have thought that in a scenario Jason describes, that you would 
still want to zap the page from the page table (try_to_unmap()) and 
install a hwpoison entry instead.

But yes, right now this should never ever happen: memory_failure() does 
some ZONE_DEVICE specific things, but likely doesn't call try_to_unmap() 
on these folios.

> 
> Also to clarify, I meant this as a general comment for all subsequent
> patches that have the same paragraph.

Yeah, I'll rephrase that to "We'll never hit that case for special 
device-private pages."

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v1 09/12] mm/rmap: handle device-exclusive entries correctly in try_to_migrate_one()
  2025-01-29 11:53 [PATCH v1 00/12] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (7 preceding siblings ...)
  2025-01-29 11:54 ` [PATCH v1 08/12] mm/rmap: handle device-exclusive entries correctly in try_to_unmap_one() David Hildenbrand
@ 2025-01-29 11:54 ` David Hildenbrand
  2025-01-29 11:54 ` [PATCH v1 10/12] mm/rmap: handle device-exclusive entries correctly in folio_referenced_one() David Hildenbrand
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, David Hildenbrand,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
we can return with a device-exclusive entry from page_vma_mapped_walk().

try_to_migrate_one() is not prepared for that, so teach it about these
non-present nonswap PTEs. We already handle device-private entries by
specializing on the folio, so we can reshuffle that code to make it
work on the non-present nonswap PTEs instead.

Get rid of most folio_is_device_private() handling, except when handling
HWPoison. It's unclear what the right thing to do here is.

Note that we could currently only run into this case with
device-exclusive entries on THPs; but as we have a refcount vs. mapcount
inbalance, folio splitting etc. will just bail out early and not even
try migrating. For order-0 folios, we still adjust the mapcount on
conversion to device-exclusive, making the rmap walk abort early
(folio_mapcount() == 0 and breaking swapout). We'll fix
that next, now that try_to_migrate_one() can handle it.

Further note that try_to_migrate() calls MMU notifiers and holds the
folio lock, so any device-exclusive users should be properly prepared
for this device-exclusive PTE to "vanish".

Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap.c | 125 ++++++++++++++++++++++--------------------------------
 1 file changed, 51 insertions(+), 74 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 12900f367a2a..903a78e60781 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2040,9 +2040,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+	bool anon_exclusive, writable, ret = true;
 	pte_t pteval;
 	struct page *subpage;
-	bool anon_exclusive, ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 	unsigned long pfn;
@@ -2109,24 +2109,20 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		/* Unexpected PMD-mapped THP? */
 		VM_BUG_ON_FOLIO(!pvmw.pte, folio);
 
-		pfn = pte_pfn(ptep_get(pvmw.pte));
-
-		if (folio_is_zone_device(folio)) {
-			/*
-			 * Our PTE is a non-present device exclusive entry and
-			 * calculating the subpage as for the common case would
-			 * result in an invalid pointer.
-			 *
-			 * Since only PAGE_SIZE pages can currently be
-			 * migrated, just set it to page. This will need to be
-			 * changed when hugepage migrations to device private
-			 * memory are supported.
-			 */
-			VM_BUG_ON_FOLIO(folio_nr_pages(folio) > 1, folio);
-			subpage = &folio->page;
+		/*
+		 * We can end up here with selected non-swap entries that
+		 * actually map pages similar to PROT_NONE; see
+		 * page_vma_mapped_walk()->check_pte().
+		 */
+		pteval = ptep_get(pvmw.pte);
+		if (likely(pte_present(pteval))) {
+			pfn = pte_pfn(pteval);
 		} else {
-			subpage = folio_page(folio, pfn - folio_pfn(folio));
+			pfn = swp_offset_pfn(pte_to_swp_entry(pteval));
+			VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
 		}
+
+		subpage = folio_page(folio, pfn - folio_pfn(folio));
 		address = pvmw.address;
 		anon_exclusive = folio_test_anon(folio) &&
 				 PageAnonExclusive(subpage);
@@ -2182,7 +2178,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			}
 			/* Nuke the hugetlb page table entry */
 			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
-		} else {
+			if (pte_dirty(pteval))
+				folio_mark_dirty(folio);
+			writable = pte_write(pteval);
+		} else if (likely(pte_present(pteval))) {
 			flush_cache_page(vma, address, pfn);
 			/* Nuke the page table entry. */
 			if (should_defer_flush(mm, flags)) {
@@ -2200,54 +2199,21 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
+			if (pte_dirty(pteval))
+				folio_mark_dirty(folio);
+			writable = pte_write(pteval);
+		} else {
+			pte_clear(mm, address, pvmw.pte);
+			writable = is_writable_device_private_entry(pte_to_swp_entry(pteval));
 		}
 
-		/* Set the dirty flag on the folio now the pte is gone. */
-		if (pte_dirty(pteval))
-			folio_mark_dirty(folio);
+		VM_WARN_ON_FOLIO(writable && folio_test_anon(folio) &&
+				!anon_exclusive, folio);
 
 		/* Update high watermark before we lower rss */
 		update_hiwater_rss(mm);
 
-		if (folio_is_device_private(folio)) {
-			unsigned long pfn = folio_pfn(folio);
-			swp_entry_t entry;
-			pte_t swp_pte;
-
-			if (anon_exclusive)
-				WARN_ON_ONCE(folio_try_share_anon_rmap_pte(folio,
-									   subpage));
-
-			/*
-			 * Store the pfn of the page in a special migration
-			 * pte. do_swap_page() will wait until the migration
-			 * pte is removed and then restart fault handling.
-			 */
-			entry = pte_to_swp_entry(pteval);
-			if (is_writable_device_private_entry(entry))
-				entry = make_writable_migration_entry(pfn);
-			else if (anon_exclusive)
-				entry = make_readable_exclusive_migration_entry(pfn);
-			else
-				entry = make_readable_migration_entry(pfn);
-			swp_pte = swp_entry_to_pte(entry);
-
-			/*
-			 * pteval maps a zone device page and is therefore
-			 * a swap pte.
-			 */
-			if (pte_swp_soft_dirty(pteval))
-				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-			if (pte_swp_uffd_wp(pteval))
-				swp_pte = pte_swp_mkuffd_wp(swp_pte);
-			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
-			trace_set_migration_pte(pvmw.address, pte_val(swp_pte),
-						folio_order(folio));
-			/*
-			 * No need to invalidate here it will synchronize on
-			 * against the special swap migration pte.
-			 */
-		} else if (PageHWPoison(subpage)) {
+		if (PageHWPoison(subpage) && !folio_is_device_private(folio)) {
 			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
 			if (folio_test_hugetlb(folio)) {
 				hugetlb_count_sub(folio_nr_pages(folio), mm);
@@ -2257,8 +2223,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				dec_mm_counter(mm, mm_counter(folio));
 				set_pte_at(mm, address, pvmw.pte, pteval);
 			}
-
-		} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
+		} else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
+			   !userfaultfd_armed(vma)) {
 			/*
 			 * The guest indicated that the page content is of no
 			 * interest anymore. Simply discard the pte, vmscan
@@ -2274,6 +2240,11 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			swp_entry_t entry;
 			pte_t swp_pte;
 
+			/*
+			 * arch_unmap_one() is expected to be a NOP on
+			 * architectures where we could have non-swp entries
+			 * here.
+			 */
 			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
 				if (folio_test_hugetlb(folio))
 					set_huge_pte_at(mm, address, pvmw.pte,
@@ -2284,8 +2255,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
-			VM_BUG_ON_PAGE(pte_write(pteval) && folio_test_anon(folio) &&
-				       !anon_exclusive, subpage);
 
 			/* See folio_try_share_anon_rmap_pte(): clear PTE first. */
 			if (folio_test_hugetlb(folio)) {
@@ -2310,7 +2279,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			 * pte. do_swap_page() will wait until the migration
 			 * pte is removed and then restart fault handling.
 			 */
-			if (pte_write(pteval))
+			if (writable)
 				entry = make_writable_migration_entry(
 							page_to_pfn(subpage));
 			else if (anon_exclusive)
@@ -2319,15 +2288,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			else
 				entry = make_readable_migration_entry(
 							page_to_pfn(subpage));
-			if (pte_young(pteval))
-				entry = make_migration_entry_young(entry);
-			if (pte_dirty(pteval))
-				entry = make_migration_entry_dirty(entry);
-			swp_pte = swp_entry_to_pte(entry);
-			if (pte_soft_dirty(pteval))
-				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-			if (pte_uffd_wp(pteval))
-				swp_pte = pte_swp_mkuffd_wp(swp_pte);
+			if (likely(pte_present(pteval))) {
+				if (pte_young(pteval))
+					entry = make_migration_entry_young(entry);
+				if (pte_dirty(pteval))
+					entry = make_migration_entry_dirty(entry);
+				swp_pte = swp_entry_to_pte(entry);
+				if (pte_soft_dirty(pteval))
+					swp_pte = pte_swp_mksoft_dirty(swp_pte);
+				if (pte_uffd_wp(pteval))
+					swp_pte = pte_swp_mkuffd_wp(swp_pte);
+			} else {
+				swp_pte = swp_entry_to_pte(entry);
+				if (pte_swp_soft_dirty(pteval))
+					swp_pte = pte_swp_mksoft_dirty(swp_pte);
+				if (pte_swp_uffd_wp(pteval))
+					swp_pte = pte_swp_mkuffd_wp(swp_pte);
+			}
 			if (folio_test_hugetlb(folio))
 				set_huge_pte_at(mm, address, pvmw.pte, swp_pte,
 						hsz);
-- 
2.48.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v1 10/12] mm/rmap: handle device-exclusive entries correctly in folio_referenced_one()
  2025-01-29 11:53 [PATCH v1 00/12] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (8 preceding siblings ...)
  2025-01-29 11:54 ` [PATCH v1 09/12] mm/rmap: handle device-exclusive entries correctly in try_to_migrate_one() David Hildenbrand
@ 2025-01-29 11:54 ` David Hildenbrand
  2025-01-29 11:54 ` [PATCH v1 11/12] mm/rmap: handle device-exclusive entries correctly in page_vma_mkclean_one() David Hildenbrand
  2025-01-29 11:54 ` [PATCH v1 12/12] mm/rmap: keep mapcount untouched for device-exclusive entries David Hildenbrand
  11 siblings, 0 replies; 56+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, David Hildenbrand,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
we can return with a device-exclusive entry from page_vma_mapped_walk().

folio_referenced_one() is not prepared for that, so teach it about these
non-present nonswap PTEs.

We'll likely never hit that path with device-private entries, but we
could with device-exclusive ones.

It's not really clear what to do: the device could be accessing this
PTE, but we don't have that information in the PTE. Likely MMU notifiers
should be taking care of that, and we can just assume "not referenced by
the CPU".

Note that we could currently only run into this case with
device-exclusive entries on THPs. For order-0 folios, we still adjust
the mapcount on conversion to device-exclusive, making the rmap walk
abort early (folio_mapcount() == 0). We'll fix that next, now that
folio_referenced_one() can handle it.

Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 903a78e60781..77b063e9aec4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -899,8 +899,14 @@ static bool folio_referenced_one(struct folio *folio,
 			if (lru_gen_look_around(&pvmw))
 				referenced++;
 		} else if (pvmw.pte) {
-			if (ptep_clear_flush_young_notify(vma, address,
-						pvmw.pte))
+			/*
+			 * We can end up here with selected non-swap entries
+			 * that actually map pages similar to PROT_NONE; see
+			 * page_vma_mapped_walk()->check_pte(). From a CPU
+			 * perspective, these PTEs are old.
+			 */
+			if (pte_present(ptep_get(pvmw.pte)) &&
+			    ptep_clear_flush_young_notify(vma, address, pvmw.pte))
 				referenced++;
 		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
 			if (pmdp_clear_flush_young_notify(vma, address,
-- 
2.48.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v1 11/12] mm/rmap: handle device-exclusive entries correctly in page_vma_mkclean_one()
  2025-01-29 11:53 [PATCH v1 00/12] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (9 preceding siblings ...)
  2025-01-29 11:54 ` [PATCH v1 10/12] mm/rmap: handle device-exclusive entries correctly in folio_referenced_one() David Hildenbrand
@ 2025-01-29 11:54 ` David Hildenbrand
  2025-01-29 11:54 ` [PATCH v1 12/12] mm/rmap: keep mapcount untouched for device-exclusive entries David Hildenbrand
  11 siblings, 0 replies; 56+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, David Hildenbrand,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

Ever since commit b756a3b5e7ea ("mm: device exclusive memory access")
we can return with a device-exclusive entry from page_vma_mapped_walk().

page_vma_mkclean_one() is not prepared for that, so teach it about these
non-present nonswap PTEs.

We'll likely never hit that path with device-private entries, but we
could with device-exclusive ones.

It's not really clear what to do: the device could be accessing this
PTE, but we don't have that information in the PTE. Likely MMU notifiers
should be taking care of that, and we can just assume "not writable and
not dirty from CPU perspective".

Note that we could currently only run into this case with
device-exclusive entries on THPs. We still adjust the mapcount on
conversion to device-exclusive, making the rmap walk
abort early (folio_mapcount() == 0) for order-0 folios. We'll fix that
next, now that page_vma_mkclean_one() can handle it.

Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/rmap.c b/mm/rmap.c
index 77b063e9aec4..9e2002d97d6f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1050,6 +1050,14 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw)
 			pte_t *pte = pvmw->pte;
 			pte_t entry = ptep_get(pte);

+			/*
+			 * We can end up here with selected non-swap entries
+			 * that actually map pages similar to PROT_NONE; see
+			 * page_vma_mapped_walk()->check_pte(). From a CPU
+			 * perspective, these PTEs are clean and not writable.
+			 */
+			if (!pte_present(entry))
+				continue;
 			if (!pte_dirty(entry) && !pte_write(entry))
 				continue;

-- 
2.48.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v1 12/12] mm/rmap: keep mapcount untouched for device-exclusive entries
  2025-01-29 11:53 [PATCH v1 00/12] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
                   ` (10 preceding siblings ...)
  2025-01-29 11:54 ` [PATCH v1 11/12] mm/rmap: handle device-exclusive entries correctly in page_vma_mkclean_one() David Hildenbrand
@ 2025-01-29 11:54 ` David Hildenbrand
  2025-01-30 10:37   ` Simona Vetter
  11 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-doc, dri-devel, linux-mm, nouveau, David Hildenbrand,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

Now that conversion to device-exclusive does no longer perform an
rmap walk and the main page_vma_mapped_walk() users were taught to
properly handle nonswap entries, let's treat device-exclusive entries just
as if they would be present, similar to how we handle device-private
entries already.

This fixes swapout/migration of folios with device-exclusive entries.

Likely there are still some page_vma_mapped_walk() callers that are not
fully prepared for these entries, and where we simply want to refuse
!pte_present() entries. They have to be fixed independently; the ones in
mm/rmap.c are prepared.

Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 17 +----------------
 mm/rmap.c   |  7 -------
 2 files changed, 1 insertion(+), 23 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index db38d6ae4e74..cd689cd8a7c8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -743,20 +743,6 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
 
 	VM_BUG_ON_FOLIO(pte_write(pte) && (!folio_test_anon(folio) &&
 					   PageAnonExclusive(page)), folio);
-
-	/*
-	 * No need to take a page reference as one was already
-	 * created when the swap entry was made.
-	 */
-	if (folio_test_anon(folio))
-		folio_add_anon_rmap_pte(folio, page, vma, address, RMAP_NONE);
-	else
-		/*
-		 * Currently device exclusive access only supports anonymous
-		 * memory so the entry shouldn't point to a filebacked page.
-		 */
-		WARN_ON_ONCE(1);
-
 	set_pte_at(vma->vm_mm, address, ptep, pte);
 
 	/*
@@ -1628,8 +1614,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 		 */
 		WARN_ON_ONCE(!vma_is_anonymous(vma));
 		rss[mm_counter(folio)]--;
-		if (is_device_private_entry(entry))
-			folio_remove_rmap_pte(folio, page, vma);
+		folio_remove_rmap_pte(folio, page, vma);
 		folio_put(folio);
 	} else if (!non_swap_entry(entry)) {
 		/* Genuine swap entries, hence a private anon pages */
diff --git a/mm/rmap.c b/mm/rmap.c
index 9e2002d97d6f..4acc9f6d743a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2495,13 +2495,6 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 	/* The pte is writable, uffd-wp does not apply. */
 	set_pte_at(mm, addr, fw.ptep, swp_pte);
 
-	/*
-	 * TODO: The device-exclusive non-swap PTE holds a folio reference but
-	 * does not count as a mapping (mapcount), which is wrong and must be
-	 * fixed, otherwise RMAP walks don't behave as expected.
-	 */
-	folio_remove_rmap_pte(folio, page, vma);
-
 	folio_walk_end(&fw, vma);
 	*foliop = folio;
 	return page;
-- 
2.48.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 12/12] mm/rmap: keep mapcount untouched for device-exclusive entries
  2025-01-29 11:54 ` [PATCH v1 12/12] mm/rmap: keep mapcount untouched for device-exclusive entries David Hildenbrand
@ 2025-01-30 10:37   ` Simona Vetter
  2025-01-30 11:42     ` David Hildenbrand
  0 siblings, 1 reply; 56+ messages in thread
From: Simona Vetter @ 2025-01-30 10:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

On Wed, Jan 29, 2025 at 12:54:10PM +0100, David Hildenbrand wrote:
> Now that conversion to device-exclusive does no longer perform an
> rmap walk and the main page_vma_mapped_walk() users were taught to
> properly handle nonswap entries, let's treat device-exclusive entries just
> as if they would be present, similar to how we handle device-private
> entries already.

So the reason for handling device-private entries in rmap is so that
drivers can rely on try_to_migrate and related code to invalidate all the
various ptes even for device private memory. Otherwise no one should hit
this path, at least if my understanding is correct.

So I'm very much worried about opening a can of worms here because I think
this adds a genuine new case to all the various callers.

> This fixes swapout/migration of folios with device-exclusive entries.
> 
> Likely there are still some page_vma_mapped_walk() callers that are not
> fully prepared for these entries, and where we simply want to refuse
> !pte_present() entries. They have to be fixed independently; the ones in
> mm/rmap.c are prepared.

The other worry is that maybe breaking migration is a feature, at least in
parts. If thp constantly reassembles a pmd entry because hey all the
memory is contig and userspace allocated a chunk of memory to place
atomics that alternate between cpu and gpu nicely separated by 4k pages,
then we'll thrash around invalidating ptes to no end. So might be more
fallout here.
-Sima

> 
> Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory.c | 17 +----------------
>  mm/rmap.c   |  7 -------
>  2 files changed, 1 insertion(+), 23 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index db38d6ae4e74..cd689cd8a7c8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -743,20 +743,6 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
>  
>  	VM_BUG_ON_FOLIO(pte_write(pte) && (!folio_test_anon(folio) &&
>  					   PageAnonExclusive(page)), folio);
> -
> -	/*
> -	 * No need to take a page reference as one was already
> -	 * created when the swap entry was made.
> -	 */
> -	if (folio_test_anon(folio))
> -		folio_add_anon_rmap_pte(folio, page, vma, address, RMAP_NONE);
> -	else
> -		/*
> -		 * Currently device exclusive access only supports anonymous
> -		 * memory so the entry shouldn't point to a filebacked page.
> -		 */
> -		WARN_ON_ONCE(1);
> -
>  	set_pte_at(vma->vm_mm, address, ptep, pte);
>  
>  	/*
> @@ -1628,8 +1614,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
>  		 */
>  		WARN_ON_ONCE(!vma_is_anonymous(vma));
>  		rss[mm_counter(folio)]--;
> -		if (is_device_private_entry(entry))
> -			folio_remove_rmap_pte(folio, page, vma);
> +		folio_remove_rmap_pte(folio, page, vma);
>  		folio_put(folio);
>  	} else if (!non_swap_entry(entry)) {
>  		/* Genuine swap entries, hence a private anon pages */
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9e2002d97d6f..4acc9f6d743a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2495,13 +2495,6 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
>  	/* The pte is writable, uffd-wp does not apply. */
>  	set_pte_at(mm, addr, fw.ptep, swp_pte);
>  
> -	/*
> -	 * TODO: The device-exclusive non-swap PTE holds a folio reference but
> -	 * does not count as a mapping (mapcount), which is wrong and must be
> -	 * fixed, otherwise RMAP walks don't behave as expected.
> -	 */
> -	folio_remove_rmap_pte(folio, page, vma);
> -
>  	folio_walk_end(&fw, vma);
>  	*foliop = folio;
>  	return page;
> -- 
> 2.48.1
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 12/12] mm/rmap: keep mapcount untouched for device-exclusive entries
  2025-01-30 10:37   ` Simona Vetter
@ 2025-01-30 11:42     ` David Hildenbrand
  2025-01-30 13:19       ` Simona Vetter
  0 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2025-01-30 11:42 UTC (permalink / raw)
  To: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

On 30.01.25 11:37, Simona Vetter wrote:
> On Wed, Jan 29, 2025 at 12:54:10PM +0100, David Hildenbrand wrote:
>> Now that conversion to device-exclusive does no longer perform an
>> rmap walk and the main page_vma_mapped_walk() users were taught to
>> properly handle nonswap entries, let's treat device-exclusive entries just
>> as if they would be present, similar to how we handle device-private
>> entries already.
> 
> So the reason for handling device-private entries in rmap is so that
> drivers can rely on try_to_migrate and related code to invalidate all the
> various ptes even for device private memory. Otherwise no one should hit
> this path, at least if my understanding is correct.

Right, device-private probably only happen to be seen on the migration 
path so far.

> 
> So I'm very much worried about opening a can of worms here because I think
> this adds a genuine new case to all the various callers.

To be clear: it can all already happen.

Assume you have a THP (or any mTHP today). You can easily trigger the 
scenario that folio_mapcount() != 0 with active device-exclusive 
entries, and you start doing rmap walks and stumble over these 
device-exclusive entries and *not* handle them properly. Note that more 
and more systems are configured to just give you THP unless you 
explicitly opted-out using MADV_NOHUGEPAGE early.

Note that b756a3b5e7ea added that hunk that still walks these 
device-exclusive entries in rmap code, but didn't actually update the 
rmap walkers:

@@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)

                 /* Handle un-addressable ZONE_DEVICE memory */
                 entry = pte_to_swp_entry(*pvmw->pte);
-               if (!is_device_private_entry(entry))
+               if (!is_device_private_entry(entry) &&
+                   !is_device_exclusive_entry(entry))
                         return false;

                 pfn = swp_offset(entry);

That was the right thing to do, because they resemble PROT_NONE entries 
and not migration entries or anything else that doesn't hold a folio 
reference).

Fortunately, it's only the page_vma_mapped_walk() callers that need care.

mm/rmap.c is handled with this series.

mm/page_vma_mapped.c should work already.

mm/migrate.c: does not apply

mm/page_idle.c: likely should just skip !pte_present().

mm/ksm.c might be fine, but likely we should just reject !pte_present().

kernel/events/uprobes.c likely should reject !pte_present().

mm/damon/paddr.c likely should reject !pte_present().

I briefly though about a flag to indicate if a page_vma_mapped_walk() 
supports these non-present entries, but likely just fixing them up is 
easier+cleaner.

Now that I looked at all, I might just write patches for them.

> 
>> This fixes swapout/migration of folios with device-exclusive entries.
>>
>> Likely there are still some page_vma_mapped_walk() callers that are not
>> fully prepared for these entries, and where we simply want to refuse
>> !pte_present() entries. They have to be fixed independently; the ones in
>> mm/rmap.c are prepared.
> 
> The other worry is that maybe breaking migration is a feature, at least in
> parts. 

Maybe breaking swap and migration is a feature in some reality, in this 
reality it's a BUG :)

If thp constantly reassembles a pmd entry because hey all the
> memory is contig and userspace allocated a chunk of memory to place
> atomics that alternate between cpu and gpu nicely separated by 4k pages,
> then we'll thrash around invalidating ptes to no end. So might be more
> fallout here.

khugepaged will back off once it sees an exclusive entry, so collapsing 
could only happen once everything is non-exclusive. See 
__collapse_huge_page_isolate() as an example.

It's really only page_vma_mapped_walk() callers that are affected by 
this change, not any other page table walkers.

It's unfortunate that we now have to fix it all up, that original series 
should have never been merged that way.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 12/12] mm/rmap: keep mapcount untouched for device-exclusive entries
  2025-01-30 11:42     ` David Hildenbrand
@ 2025-01-30 13:19       ` Simona Vetter
  2025-01-30 15:43         ` David Hildenbrand
  0 siblings, 1 reply; 56+ messages in thread
From: Simona Vetter @ 2025-01-30 13:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

On Thu, Jan 30, 2025 at 12:42:26PM +0100, David Hildenbrand wrote:
> On 30.01.25 11:37, Simona Vetter wrote:
> > On Wed, Jan 29, 2025 at 12:54:10PM +0100, David Hildenbrand wrote:
> > > Now that conversion to device-exclusive does no longer perform an
> > > rmap walk and the main page_vma_mapped_walk() users were taught to
> > > properly handle nonswap entries, let's treat device-exclusive entries just
> > > as if they would be present, similar to how we handle device-private
> > > entries already.
> > 
> > So the reason for handling device-private entries in rmap is so that
> > drivers can rely on try_to_migrate and related code to invalidate all the
> > various ptes even for device private memory. Otherwise no one should hit
> > this path, at least if my understanding is correct.
> 
> Right, device-private probably only happen to be seen on the migration path
> so far.
> 
> > 
> > So I'm very much worried about opening a can of worms here because I think
> > this adds a genuine new case to all the various callers.
> 
> To be clear: it can all already happen.
> 
> Assume you have a THP (or any mTHP today). You can easily trigger the
> scenario that folio_mapcount() != 0 with active device-exclusive entries,
> and you start doing rmap walks and stumble over these device-exclusive
> entries and *not* handle them properly. Note that more and more systems are
> configured to just give you THP unless you explicitly opted-out using
> MADV_NOHUGEPAGE early.
> 
> Note that b756a3b5e7ea added that hunk that still walks these
> device-exclusive entries in rmap code, but didn't actually update the rmap
> walkers:
> 
> @@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
> 
>                 /* Handle un-addressable ZONE_DEVICE memory */
>                 entry = pte_to_swp_entry(*pvmw->pte);
> -               if (!is_device_private_entry(entry))
> +               if (!is_device_private_entry(entry) &&
> +                   !is_device_exclusive_entry(entry))
>                         return false;
> 
>                 pfn = swp_offset(entry);
> 
> That was the right thing to do, because they resemble PROT_NONE entries and
> not migration entries or anything else that doesn't hold a folio reference).

Yeah I got that part. What I meant is that doubling down on this needs a
full audit and cannot rely on "we already have device private entries
going through these paths for much longer", which was the impression I
got. I guess it worked, thanks for doing that below :-)

And at least from my very rough understanding of mm, at least around all
this gpu stuff, tracking device exclusive mappings like real cpu mappings
makes sense, they do indeed act like PROT_NONE with some magic to restore
access on fault.

I do wonder a bit though what else is all not properly tracked because
they should be like prot_none except arent. I guess we'll find those as we
hit them :-/

> Fortunately, it's only the page_vma_mapped_walk() callers that need care.
> 
> mm/rmap.c is handled with this series.
> 
> mm/page_vma_mapped.c should work already.
> 
> mm/migrate.c: does not apply
> 
> mm/page_idle.c: likely should just skip !pte_present().
> 
> mm/ksm.c might be fine, but likely we should just reject !pte_present().
> 
> kernel/events/uprobes.c likely should reject !pte_present().
> 
> mm/damon/paddr.c likely should reject !pte_present().
> 
> 
> I briefly though about a flag to indicate if a page_vma_mapped_walk()
> supports these non-present entries, but likely just fixing them up is
> easier+cleaner.
> 
> Now that I looked at all, I might just write patches for them.
> 
> > 
> > > This fixes swapout/migration of folios with device-exclusive entries.
> > > 
> > > Likely there are still some page_vma_mapped_walk() callers that are not
> > > fully prepared for these entries, and where we simply want to refuse
> > > !pte_present() entries. They have to be fixed independently; the ones in
> > > mm/rmap.c are prepared.
> > 
> > The other worry is that maybe breaking migration is a feature, at least in
> > parts.
> 
> Maybe breaking swap and migration is a feature in some reality, in this
> reality it's a BUG :)

Oh yeah I agree on those :-)

> If thp constantly reassembles a pmd entry because hey all the
> > memory is contig and userspace allocated a chunk of memory to place
> > atomics that alternate between cpu and gpu nicely separated by 4k pages,
> > then we'll thrash around invalidating ptes to no end. So might be more
> > fallout here.
> 
> khugepaged will back off once it sees an exclusive entry, so collapsing
> could only happen once everything is non-exclusive. See
> __collapse_huge_page_isolate() as an example.

Ah ok. I think might be good to add that to the commit message, so that
people who don't understand mm deeply (like me) aren't worried when they
stumble over this change in the future again when digging around.

> It's really only page_vma_mapped_walk() callers that are affected by this
> change, not any other page table walkers.

I guess my mm understanding is just not up to that, but I couldn't figure
out why just looking at page_vma_mapped_walk() only is good enough?

> It's unfortunate that we now have to fix it all up, that original series
> should have never been merged that way.

Yeah looks like a rather big miss.
-Sima

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 12/12] mm/rmap: keep mapcount untouched for device-exclusive entries
  2025-01-30 13:19       ` Simona Vetter
@ 2025-01-30 15:43         ` David Hildenbrand
  2025-01-31 17:13           ` Simona Vetter
  0 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2025-01-30 15:43 UTC (permalink / raw)
  To: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

>> Assume you have a THP (or any mTHP today). You can easily trigger the
>> scenario that folio_mapcount() != 0 with active device-exclusive entries,
>> and you start doing rmap walks and stumble over these device-exclusive
>> entries and *not* handle them properly. Note that more and more systems are
>> configured to just give you THP unless you explicitly opted-out using
>> MADV_NOHUGEPAGE early.
>>
>> Note that b756a3b5e7ea added that hunk that still walks these
>> device-exclusive entries in rmap code, but didn't actually update the rmap
>> walkers:
>>
>> @@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>>
>>                  /* Handle un-addressable ZONE_DEVICE memory */
>>                  entry = pte_to_swp_entry(*pvmw->pte);
>> -               if (!is_device_private_entry(entry))
>> +               if (!is_device_private_entry(entry) &&
>> +                   !is_device_exclusive_entry(entry))
>>                          return false;
>>
>>                  pfn = swp_offset(entry);
>>
>> That was the right thing to do, because they resemble PROT_NONE entries and
>> not migration entries or anything else that doesn't hold a folio reference).
> 
> Yeah I got that part. What I meant is that doubling down on this needs a
> full audit and cannot rely on "we already have device private entries
> going through these paths for much longer", which was the impression I
> got. I guess it worked, thanks for doing that below :-)

I know I know, I shouldn't have touched it ... :)

So yeah, I'll spend some extra work on sorting out the other cases.

> 
> And at least from my very rough understanding of mm, at least around all
> this gpu stuff, tracking device exclusive mappings like real cpu mappings
> makes sense, they do indeed act like PROT_NONE with some magic to restore
> access on fault.
> 
> I do wonder a bit though what else is all not properly tracked because
> they should be like prot_none except arent. I guess we'll find those as we
> hit them :-/

Likely a lot of stuff. But more in a "entry gets ignored -- 
functionality not implemented, move along" way, because all page table 
walkers have to care about !pte_present() already; it's just RMAP code 
that so far never required it.

[...]

> 
>> If thp constantly reassembles a pmd entry because hey all the
>>> memory is contig and userspace allocated a chunk of memory to place
>>> atomics that alternate between cpu and gpu nicely separated by 4k pages,
>>> then we'll thrash around invalidating ptes to no end. So might be more
>>> fallout here.
>>
>> khugepaged will back off once it sees an exclusive entry, so collapsing
>> could only happen once everything is non-exclusive. See
>> __collapse_huge_page_isolate() as an example.
> 
> Ah ok. I think might be good to add that to the commit message, so that
> people who don't understand mm deeply (like me) aren't worried when they
> stumble over this change in the future again when digging around.

Will do, thanks for raising that concern!

> 
>> It's really only page_vma_mapped_walk() callers that are affected by this
>> change, not any other page table walkers.
> 
> I guess my mm understanding is just not up to that, but I couldn't figure
> out why just looking at page_vma_mapped_walk() only is good enough?

See above: these never had to handle !page_present() before -- in 
contrast to the other page table walkers.

So nothing bad happens when these page table walkers traverse these 
PTEs, it's just that the functionality will usually be implemented.

Take MADV_PAGEOUT as an example: madvise_cold_or_pageout_pte_range() 
will simply skip "!pte_present()", because it wouldn't know what to do 
in that case.

Of course, there could be page table walkers that check all cases and 
bail out if they find something unexpected: do_swap_page() cannot make 
forward progress and will inject a VM_FAULT_SIGBUS if it doesn't 
recognize the entry. But these are rather rare.

We could enlighten selected page table walkers to handle 
device-exclusive where it really makes sense later.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v1 12/12] mm/rmap: keep mapcount untouched for device-exclusive entries
  2025-01-30 15:43         ` David Hildenbrand
@ 2025-01-31 17:13           ` Simona Vetter
  0 siblings, 0 replies; 56+ messages in thread
From: Simona Vetter @ 2025-01-31 17:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, dri-devel, linux-mm, nouveau,
	Andrew Morton, Jérôme Glisse, Jonathan Corbet,
	Alex Shi, Yanteng Si, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pasha Tatashin, Peter Xu,
	Alistair Popple, Jason Gunthorpe

On Thu, Jan 30, 2025 at 04:43:08PM +0100, David Hildenbrand wrote:
> > > Assume you have a THP (or any mTHP today). You can easily trigger the
> > > scenario that folio_mapcount() != 0 with active device-exclusive entries,
> > > and you start doing rmap walks and stumble over these device-exclusive
> > > entries and *not* handle them properly. Note that more and more systems are
> > > configured to just give you THP unless you explicitly opted-out using
> > > MADV_NOHUGEPAGE early.
> > > 
> > > Note that b756a3b5e7ea added that hunk that still walks these
> > > device-exclusive entries in rmap code, but didn't actually update the rmap
> > > walkers:
> > > 
> > > @@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
> > > 
> > >                  /* Handle un-addressable ZONE_DEVICE memory */
> > >                  entry = pte_to_swp_entry(*pvmw->pte);
> > > -               if (!is_device_private_entry(entry))
> > > +               if (!is_device_private_entry(entry) &&
> > > +                   !is_device_exclusive_entry(entry))
> > >                          return false;
> > > 
> > >                  pfn = swp_offset(entry);
> > > 
> > > That was the right thing to do, because they resemble PROT_NONE entries and
> > > not migration entries or anything else that doesn't hold a folio reference).
> > 
> > Yeah I got that part. What I meant is that doubling down on this needs a
> > full audit and cannot rely on "we already have device private entries
> > going through these paths for much longer", which was the impression I
> > got. I guess it worked, thanks for doing that below :-)
> 
> I know I know, I shouldn't have touched it ... :)
> 
> So yeah, I'll spend some extra work on sorting out the other cases.

Thanks :-)

> > And at least from my very rough understanding of mm, at least around all
> > this gpu stuff, tracking device exclusive mappings like real cpu mappings
> > makes sense, they do indeed act like PROT_NONE with some magic to restore
> > access on fault.
> > 
> > I do wonder a bit though what else is all not properly tracked because
> > they should be like prot_none except arent. I guess we'll find those as we
> > hit them :-/
> 
> Likely a lot of stuff. But more in a "entry gets ignored -- functionality
> not implemented, move along" way, because all page table walkers have to
> care about !pte_present() already; it's just RMAP code that so far never
> required it.

I think it'd be good to include a tersion summary of this in the commit
messages - I'd expect this is code other gpu folks will need to crawl
through in the future again, and I had no idea where I should even start
looking to figure this out.

> 
> [...]
> 
> > 
> > > If thp constantly reassembles a pmd entry because hey all the
> > > > memory is contig and userspace allocated a chunk of memory to place
> > > > atomics that alternate between cpu and gpu nicely separated by 4k pages,
> > > > then we'll thrash around invalidating ptes to no end. So might be more
> > > > fallout here.
> > > 
> > > khugepaged will back off once it sees an exclusive entry, so collapsing
> > > could only happen once everything is non-exclusive. See
> > > __collapse_huge_page_isolate() as an example.
> > 
> > Ah ok. I think might be good to add that to the commit message, so that
> > people who don't understand mm deeply (like me) aren't worried when they
> > stumble over this change in the future again when digging around.
> 
> Will do, thanks for raising that concern!
> 
> > 
> > > It's really only page_vma_mapped_walk() callers that are affected by this
> > > change, not any other page table walkers.
> > 
> > I guess my mm understanding is just not up to that, but I couldn't figure
> > out why just looking at page_vma_mapped_walk() only is good enough?
> 
> See above: these never had to handle !page_present() before -- in contrast
> to the other page table walkers.
>
> So nothing bad happens when these page table walkers traverse these PTEs,
> it's just that the functionality will usually be implemented.
> 
> Take MADV_PAGEOUT as an example: madvise_cold_or_pageout_pte_range() will
> simply skip "!pte_present()", because it wouldn't know what to do in that
> case.
> 
> Of course, there could be page table walkers that check all cases and bail
> out if they find something unexpected: do_swap_page() cannot make forward
> progress and will inject a VM_FAULT_SIGBUS if it doesn't recognize the
> entry. But these are rather rare.

Yeah this all makes sense to me now. Thanks a lot for your explanation,
I'll try to pay it back by trying to review the next version of the series
a bit.

> We could enlighten selected page table walkers to handle device-exclusive
> where it really makes sense later.

I think rmap for eviction/migration is really the big one that obviously
should be fixed. All the other cases I could think of I think just end up
in handle_mm_fault() to sort out the situation and then retry.

Cheers, Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2025-02-04 10:59 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-29 11:53 [PATCH v1 00/12] mm: fixes for device-exclusive entries (hmm) David Hildenbrand
2025-01-29 11:53 ` [PATCH v1 01/12] mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs David Hildenbrand
2025-01-29 21:42   ` John Hubbard
2025-01-30  8:56     ` David Hildenbrand
2025-01-30  5:46   ` Alistair Popple
2025-01-29 11:54 ` [PATCH v1 02/12] mm/rmap: reject hugetlb folios in folio_make_device_exclusive() David Hildenbrand
2025-01-30  5:47   ` Alistair Popple
2025-01-29 11:54 ` [PATCH v1 03/12] mm/rmap: convert make_device_exclusive_range() to make_device_exclusive() David Hildenbrand
2025-01-30  5:57   ` Alistair Popple
2025-01-30  9:04     ` David Hildenbrand
2025-01-31  0:28     ` Alistair Popple
2025-01-31  9:29       ` David Hildenbrand
2025-01-30 13:46   ` Simona Vetter
2025-01-30 15:56     ` David Hildenbrand
2025-01-29 11:54 ` [PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk David Hildenbrand
2025-01-30  6:11   ` Alistair Popple
2025-01-30  9:01     ` David Hildenbrand
2025-01-30  9:12       ` David Hildenbrand
2025-01-30  9:24       ` David Hildenbrand
2025-01-30 22:31         ` Alistair Popple
2025-02-04 10:56           ` David Hildenbrand
2025-01-30  9:40     ` Simona Vetter
2025-01-30  9:47       ` David Hildenbrand
2025-01-30 13:00         ` Simona Vetter
2025-01-30 15:59           ` David Hildenbrand
2025-01-31 17:00             ` Simona Vetter
2025-01-29 11:54 ` [PATCH v1 05/12] mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable() David Hildenbrand
2025-01-30  9:51   ` Simona Vetter
2025-01-30  9:58     ` David Hildenbrand
2025-01-30 13:03       ` Simona Vetter
2025-01-30 23:06         ` Alistair Popple
2025-01-31 10:55           ` David Hildenbrand
2025-01-31 17:05             ` Simona Vetter
2025-02-04 10:58               ` David Hildenbrand
2025-01-29 11:54 ` [PATCH v1 06/12] mm: use single SWP_DEVICE_EXCLUSIVE entry type David Hildenbrand
2025-01-30 13:43   ` Simona Vetter
2025-01-30 23:28   ` Alistair Popple
2025-01-29 11:54 ` [PATCH v1 07/12] mm/page_vma_mapped: device-private entries are not migration entries David Hildenbrand
2025-01-30 23:36   ` Alistair Popple
2025-01-31 11:06     ` David Hildenbrand
2025-01-29 11:54 ` [PATCH v1 08/12] mm/rmap: handle device-exclusive entries correctly in try_to_unmap_one() David Hildenbrand
2025-01-30 10:10   ` Simona Vetter
2025-01-30 11:08     ` David Hildenbrand
2025-01-30 13:06       ` Simona Vetter
2025-01-30 14:08         ` Jason Gunthorpe
2025-01-30 16:10           ` Simona Vetter
2025-01-30 15:52         ` David Hildenbrand
2025-01-29 11:54 ` [PATCH v1 09/12] mm/rmap: handle device-exclusive entries correctly in try_to_migrate_one() David Hildenbrand
2025-01-29 11:54 ` [PATCH v1 10/12] mm/rmap: handle device-exclusive entries correctly in folio_referenced_one() David Hildenbrand
2025-01-29 11:54 ` [PATCH v1 11/12] mm/rmap: handle device-exclusive entries correctly in page_vma_mkclean_one() David Hildenbrand
2025-01-29 11:54 ` [PATCH v1 12/12] mm/rmap: keep mapcount untouched for device-exclusive entries David Hildenbrand
2025-01-30 10:37   ` Simona Vetter
2025-01-30 11:42     ` David Hildenbrand
2025-01-30 13:19       ` Simona Vetter
2025-01-30 15:43         ` David Hildenbrand
2025-01-31 17:13           ` Simona Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox