From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B7D9FC9832F for ; Sun, 18 Jan 2026 17:54:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BBD9E6B0098; Sun, 18 Jan 2026 12:54:53 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B7DA26B0099; Sun, 18 Jan 2026 12:54:53 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AA01A6B009B; Sun, 18 Jan 2026 12:54:53 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 937F66B0098 for ; Sun, 18 Jan 2026 12:54:53 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id E5904C210D for ; Sun, 18 Jan 2026 17:54:52 +0000 (UTC) X-FDA: 84345835224.21.7998AA7 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf16.hostedemail.com (Postfix) with ESMTP id 1515C180002 for ; Sun, 18 Jan 2026 17:54:48 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=RiCA9bOu; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf16.hostedemail.com: domain of mpenttil@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=mpenttil@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1768758890; a=rsa-sha256; cv=none; b=n19B+ISY46uOs4r3ykUzlMi+0wzTq23vMu8GsoUxAgOE7KM1veqgaCNMVBY9OK67p3DzKZ LkAh76Dw+yk33p6GolPeRGvqIofhf5QcGYNHcpd5xNge9WO+GD0NTciEelNBq7eNVw67C1 3LXPLIH0dJOgiI0b92dtl9YxgqQ7yh0= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=RiCA9bOu; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf16.hostedemail.com: domain of mpenttil@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=mpenttil@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1768758890; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=gV/XObCJQW+KRxixfCvJyEDHiGXfLdVuEz3fhlq1yGU=; b=1S78EBRjYc9jCxvcbOp9x+VpY5ngFnGnDdqySkRGXT//OdTlm57LvPkviZDAJuA4fDTdv7 29mUiFg66K5NPjA6+azq49XJTxRcg2UGbPL0qCBriUOlCDmnEbeAK+uKxgn8jDGod09+4p m3vIvqa6NY31P8hjzLUyyY5k0d/c0kE= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1768758888; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=gV/XObCJQW+KRxixfCvJyEDHiGXfLdVuEz3fhlq1yGU=; b=RiCA9bOuvMkGsHwYtj9mGbdM2E/9MaFsuZ4TvEyLOcDaBohcK0WvR1Z141Z35kj57Qx7ry ou5nvr/scJRq89O+2g3pc6s9UO81SMBYkuSIH1jWkJNwzN+skhYdZY68OlvVck9P6pNSDf WVFnrlf2zirL+2lfxhU9phehKaxX2Gc= Received: from mail-lf1-f69.google.com (mail-lf1-f69.google.com [209.85.167.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-486-hslg6clRPkqeKBhQXYO8oQ-1; Sun, 18 Jan 2026 12:54:47 -0500 X-MC-Unique: hslg6clRPkqeKBhQXYO8oQ-1 X-Mimecast-MFC-AGG-ID: hslg6clRPkqeKBhQXYO8oQ_1768758886 Received: by mail-lf1-f69.google.com with SMTP id 2adb3069b0e04-59b6c0abec0so3768920e87.0 for ; Sun, 18 Jan 2026 09:54:46 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1768758885; x=1769363685; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=gV/XObCJQW+KRxixfCvJyEDHiGXfLdVuEz3fhlq1yGU=; b=tsl6AXNbgA0mzi70jcAHB+T3b7Zip88dfu8ZyGAFfxNTGR8dQjft0evD2ZE6G1AFs/ NmeA5PmhN/7lH/REKDzInnwIw7UgCCgu0ddrhgQeiFOVOaoOuyCWshSRvhVDAzP1XCAp 8+HevMEJWMMIxUKVqxor2i44uRuLLeoWjsnV5pl+ulHS/ZPvVX0pOGSlBgkRzgR02S1r E8oINLsV1t0zrof2+XmqGrDxi27HYKIBiWm29pPFBloH2bph3L/mU9Bf0uWpY/iZk75O 2KSaNu+jY4wqMaxsqY/Umi477p3+wm3QEiDAvdQUf2FhIv25h7tjbAYYhMkdVZkpIqnL lIbA== X-Gm-Message-State: AOJu0YzfymrJiA7iob3qAGXbgM8hDYgXFmS2XCCtd1QK/3OZtJ+wvLme iNuOX0KGua6uhgPKjAhCKESv/xV+cKoXGBRi5nzH8fUskDFLBAHdUZsA82bBCwchIxAT16pBfD4 VDvPw1Bz1a01cvptoTgwdkG563rhMXa2acQ+bY5MbBzKeio2/rZM= X-Gm-Gg: AY/fxX77Tg6x1oPjAtuu/dpRpU/Q4CfbF7Dk8L2nTsR1XnIH8s7U9Jqk4G13RZJC+jS 39E8EUC9TlAr267vI82xS7KReCsoV7MtCasHhumRabHPp8vS7GCTSEj26Rn3Iq4ZhY7gSd/Y+HD z0MGZIcUumhxsr4m/ErqWz2z4wJKPruGNhTSAgT8lN/mqYi4TWpN/qDPgaiEyJluIzBlMirVCl3 PshFluvzE/JmcDkHzg5QfKBNaFEG7Q8jf9nRyCEyiGAqTX6U2WjVCDCmz1TQQWAE2nBYqPzTLj9 DKpvJfs54WHxHaoIzdfwuhFNgaKGhZjxEeHKuXZxMVfPAoLp+VL30cFhhkZdkgWMHjfKLlziRmA HoY0SDyA+BkFazi9yRP8FlQJioLs5VROBS5M= X-Received: by 2002:a05:6512:3a8e:b0:598:ef90:3e9b with SMTP id 2adb3069b0e04-59baeedb8b8mr3074618e87.34.1768758885314; Sun, 18 Jan 2026 09:54:45 -0800 (PST) X-Received: by 2002:a05:6512:3a8e:b0:598:ef90:3e9b with SMTP id 2adb3069b0e04-59baeedb8b8mr3074608e87.34.1768758884756; Sun, 18 Jan 2026 09:54:44 -0800 (PST) Received: from [192.168.1.86] (85-23-51-1.bb.dnainternet.fi. [85.23.51.1]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-59baf9d442dsm2655940e87.94.2026.01.18.09.54.44 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 18 Jan 2026 09:54:44 -0800 (PST) Message-ID: <5b396f1e-b25c-494c-8286-4791eb542510@redhat.com> Date: Sun, 18 Jan 2026 19:54:43 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/3] mm: unified hmm fault and migrate device pagewalk paths To: Matthew Brost Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, David Hildenbrand , Jason Gunthorpe , Leon Romanovsky , Alistair Popple , Balbir Singh , Zi Yan References: <20260114091923.3950465-1-mpenttil@redhat.com> <20260114091923.3950465-2-mpenttil@redhat.com> <151ee03d-423b-4ac2-9dd4-11b1ceaadbaf@redhat.com> From: =?UTF-8?Q?Mika_Penttil=C3=A4?= In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: TJa1rpRK8ewlQSd1e6xoq7H1FjpiA27X3Z2YGFy1rT8_1768758886 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 1515C180002 X-Rspamd-Server: rspam06 X-Stat-Signature: wdtrhujzaouk5z9ef6yn597u6uxn951k X-Rspam-User: X-HE-Tag: 1768758888-692655 X-HE-Meta: U2FsdGVkX1/vnMEV1zS82KdbS31JensLgB0lhXzEu7ACESQyRG1Ni++ZrvvngwvydtvotSUJcUFRJrS18/ETpD+JeFmt8jWoHwGsL+kdNxyL0mBxi0TLhmeR0xjbJBbMLSle305CLBz6VSHAJE1Jhhz7KQbywrB+deSr5FBuOUql02uIiGcO8RGedYBR6If/CyuHk9loTnJIf6Z1Q5SHZOUM3PSwU763inhiYk7jA4NKv82hzggq+JU/fXWdhhRf4SUAY+mVSe+dSYqXBS6dFWSiSruSJnR4NfNz13yweulYe+KXlYBZtOsXz/idhSPkvVoT4N00RHvwcxF2fgctnq864IYZayKumJtKuWzTDFKizc9U1/6/9d7j4tiz8tf4xOxNpYJihd0jJaPz8yt/rtXNVdr370utX+HcN+xly9zxHN5GhRs+OFS0ZqDAWtUclWAkwEjyGOkl7ksz/aIFm7/zqz9azEgs27K7hLewfzkEJoAdbFSlgykrorPw7SeDqNZJph4diCdTr7zQlW1PILuhZS+xC1nlxKkeRwBdxPLdpfj9DUr6F8YTSJbJ/VLkoJNlno0wnqZwmQlnx3U7xd/pJagtVOJlybpCCvpGQyXtarl1n09xZMPSH9VJhMQuVMGmO6X1n5ys1BVHNKFliKu92z5yEym5UQKgburxaFuhj3dB8sESSPRTwqRvlysLWf47eB4yYyI6qrCJM9Lsp7qvhhB2/udcjLoREwXE6PZLtkLyRz2vOyFu4HY4kGnbwhlMkgg/A/c3HBxMSHRVMp517DNt9E5gYSsu9pm08O2DPi74D6AM6oV+PemllwuWCz2aDR9BjEeQZ0qmWtCw9aBrXOQHXBcwicvae+lfysoy8WwkZYpaRYJiXTEghR5Xazo4UUJWvSIm8gr1Gd7qNhgM3EQ5GPOm2Ay8yd1T/V08dovGwArO1tvWmBb/Mj3jokrmtFxSTL1Z2Wg0HlQ zxY3svAl AwnyjHhUeugktiquB9BFRJywgfo1Or0puipt90Vi16LmF4bfzekGwBUZ7aZpTWzRaepIKaaAc7InxpTJ/jDJjGJTua0Vj8gLRD+WpKmFHiljaSgekSYDJFOyuwFrbPIKXZ5IoL3oP2LgYtesfqxdBoj4kYwKOhKFVE4Rh0h1uaCi/D8lb0F+3l4OSwyYzU6qnvnC/OAY0+5pXny9A6yrs6zvizg3LE+qFGERsQLsIJJaSzKp0ubNdtQ+0VLFWdpLyJ+tOco/Ca//Fcv4Q7WOfDbp3QxY1kqL9ivYHlWZqguWvvfve3FAAgiVPbNs3aVCC+p1vGntQ3u4ufdfR9xE5aNJvFca30bwqPcvSY4AsUQKGGe4XxspYil5QhERrsvV00coNjn/iuIV6E4MOrmnC7IXDIgbE34Vw9GgOEMnDk3qWKqyLYh40gcJAl6SacsD0oZ32w2hjPKAR84pfeYeRLvAG7LbmwsIuiKDYNHI9YHPW7iECajr32AugAL+X7svnsLqAm7BWG7FDNak= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 1/18/26 06:02, Matthew Brost wrote: > On Sat, Jan 17, 2026 at 09:07:36AM +0200, Mika Penttilä wrote: >> On 1/16/26 22:05, Matthew Brost wrote: >> >>> On Fri, Jan 16, 2026 at 12:39:30PM +0200, Mika Penttilä wrote: >>>> Hi, >>>> >>>> On 1/16/26 04:06, Matthew Brost wrote: >>>> >>>>> On Wed, Jan 14, 2026 at 11:19:21AM +0200, mpenttil@redhat.com wrote: >>>>>> From: Mika Penttilä >>>>>> >>>>>> Currently, the way device page faulting and migration works >>>>>> is not optimal, if you want to do both fault handling and >>>>>> migration at once. >>>>>> >>>>>> Being able to migrate not present pages (or pages mapped with incorrect >>>>>> permissions, eg. COW) to the GPU requires doing either of the >>>>>> following sequences: >>>>>> >>>>>> 1. hmm_range_fault() - fault in non-present pages with correct permissions, etc. >>>>>> 2. migrate_vma_*() - migrate the pages >>>>>> >>>>>> Or: >>>>>> >>>>>> 1. migrate_vma_*() - migrate present pages >>>>>> 2. If non-present pages detected by migrate_vma_*(): >>>>>> a) call hmm_range_fault() to fault pages in >>>>>> b) call migrate_vma_*() again to migrate now present pages >>>>>> >>>>>> The problem with the first sequence is that you always have to do two >>>>>> page walks even when most of the time the pages are present or zero page >>>>>> mappings so the common case takes a performance hit. >>>>>> >>>>>> The second sequence is better for the common case, but far worse if >>>>>> pages aren't present because now you have to walk the page tables three >>>>>> times (once to find the page is not present, once so hmm_range_fault() >>>>>> can find a non-present page to fault in and once again to setup the >>>>>> migration). It is also tricky to code correctly. >>>>>> >>>>>> We should be able to walk the page table once, faulting >>>>>> pages in as required and replacing them with migration entries if >>>>>> requested. >>>>>> >>>>>> Add a new flag to HMM APIs, HMM_PFN_REQ_MIGRATE, >>>>>> which tells to prepare for migration also during fault handling. >>>>>> Also, for the migrate_vma_setup() call paths, a flags, MIGRATE_VMA_FAULT, >>>>>> is added to tell to add fault handling to migrate. >>>>>> >>>>>> Cc: David Hildenbrand >>>>>> Cc: Jason Gunthorpe >>>>>> Cc: Leon Romanovsky >>>>>> Cc: Alistair Popple >>>>>> Cc: Balbir Singh >>>>>> Cc: Zi Yan >>>>>> Cc: Matthew Brost >>>>> I'll try to test this when I can but horribly behind at the moment. >>>>> >>>>> You can use Intel's CI system to test SVM too. I can get you authorized >>>>> to use this. The list to trigger is intel-xe@lists.freedesktop.org and >>>>> patches must apply to drm-tip. I'll let you know when you are >>>>> authorized. >>>> Thanks, appreciate, will do that also! >>>> >>> Working on enabling this for you in CI. >>> >>> I did a quick test by running our complete test suite and got a kernel >>> hang in this section: >>> >>> xe_exec_system_allocator.threads-shared-vm-many-stride-malloc-prefetch >>> >>> Stack trace: >>> >>> [ 182.915763] INFO: task xe_exec_system_:5357 blocked for more than 30 seconds. >>> [ 182.922866] Tainted: G U W 6.19.0-rc4-xe+ #2549 >>> [ 182.929183] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> [ 182.936912] task:xe_exec_system_ state:D stack:0 pid:5357 tgid:1862 ppid:1853 task_flags:0x400040 flags:0x00080000 >>> [ 182.936916] Call Trace: >>> [ 182.936918] >>> [ 182.936919] __schedule+0x4df/0xc20 >>> [ 182.936924] schedule+0x22/0xd0 >>> [ 182.936925] io_schedule+0x41/0x60 >>> [ 182.936926] migration_entry_wait_on_locked+0x21c/0x2a0 >>> [ 182.936929] ? __pfx_wake_page_function+0x10/0x10 >>> [ 182.936931] migration_entry_wait+0xad/0xf0 >>> [ 182.936933] hmm_vma_walk_pmd+0xd5f/0x19b0 >>> [ 182.936935] walk_pgd_range+0x51d/0xa60 >>> [ 182.936938] __walk_page_range+0x75/0x1e0 >>> [ 182.936940] walk_page_range_mm_unsafe+0x138/0x1f0 >>> [ 182.936941] hmm_range_fault+0x8f/0x160 >>> [ 182.936945] drm_gpusvm_get_pages+0x1ae/0x8a0 [drm_gpusvm_helper] >>> [ 182.936949] drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper] >>> [ 182.936951] xe_svm_range_get_pages+0x1b/0x50 [xe] >>> [ 182.936979] xe_vm_bind_ioctl+0x15c3/0x17e0 [xe] >>> [ 182.937001] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe] >>> [ 182.937021] ? drm_ioctl_kernel+0xa3/0x100 >>> [ 182.937024] drm_ioctl_kernel+0xa3/0x100 >>> [ 182.937026] drm_ioctl+0x213/0x440 >>> [ 182.937028] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe] >>> [ 182.937061] xe_drm_ioctl+0x5a/0xa0 [xe] >>> [ 182.937083] __x64_sys_ioctl+0x7f/0xd0 >>> [ 182.937085] do_syscall_64+0x50/0x290 >>> [ 182.937088] entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> [ 182.937091] RIP: 0033:0x7ff00f724ded >>> [ 182.937092] RSP: 002b:00007ff00b9fa640 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 >>> [ 182.937094] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007ff00f724ded >>> [ 182.937095] RDX: 00007ff00b9fa6d0 RSI: 0000000040886445 RDI: 0000000000000003 >>> [ 182.937096] RBP: 00007ff00b9fa690 R08: 0000000000000000 R09: 0000000000000000 >>> [ 182.937097] R10: 0000000000000001 R11: 0000000000000246 R12: 00007ff00b9fa6d0 >>> [ 182.937098] R13: 0000000040886445 R14: 0000000000000003 R15: 00007ff00f8a9000 >>> [ 182.937099] >>> >>> This section is a racy test with parallel CPU and device access that is >>> likely causing the migration process to abort and retry. From the stack >>> trace, it looks like a migration PMD didn’t get properly removed, and a >>> subsequent call to hmm_range_fault hangs on a migration entry that was >>> not removed during the migration abort. >>> >>> IIRC, some of the last bits in Balbir’s large device pages series had a >>> similar bug, which I sent to Andrew with fixup patches. I suspect you >>> have a similar bug. If I can find the time, I’ll see if I can track it >>> down. >> Thanks for your efforts Matthew! >> > Happy to help. > >> I remember those discussions, and looks similar. If recall correctly, it was >> about failed splits and continuation after that. The code bails out the >> pmd loop in this case but maybe this part is missing and is equivalent to >> collect skip: >> >> diff --git a/mm/hmm.c b/mm/hmm.c >> index 39a07d895043..c7a7bb923a37 100644 >> --- a/mm/hmm.c >> +++ b/mm/hmm.c >> @@ -988,8 +988,12 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, >> } >> >> r = hmm_vma_handle_migrate_prepare(walk, pmdp, addr, hmm_pfns); >> - if (r) >> + if (r) { >> + /* Split has failed, skip to end. */ >> + for (i = 0; addr < end; addr += PAGE_SIZE, i++) >> + hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS; >> break; >> + } >> } >> pte_unmap(ptep - 1); >> >> With that is should be quite similar to current flow (for effects). >> > I think the above code is conceptually correct and needed, but I don’t > think it is the actual problem. > > I traced the issue to the handle and prepare functions racing with other > migrations. The handle function doesn’t find a valid PTE/PMD and > populates the HMM PFN with zero. The prepare function then finds a valid > PTE/PMD and installs a migration entry. My code aborts the migration > because we only migrate if all pages in the requested range are found. > migrate_vma_pages/migrate_vma_finalize are called, but since some PFNs > are zero, the migration entries installed during the prepare step are > not removed. > > This is actually very easy to reproduce on my end if I disable compound > page collection in my migrations and then run a test that migrates 2M > pages with racing CPU/GPU access. I suspect we could write a selftest to > replicate this. > > I have a hacky patch that fixes this — among other things I spotted > while looking at this — but I don’t think that’s the right approach. I > believe the right approach is to unify the handle and prepare functions > into a single one. If you look at what these two functions do, they are > actually quite similar; e.g., both inspect PTE/PMD state to make > decisions. Of course, if you want a migration you need the PTE/PMD > locks, and those would need to be dropped if you want to do something > like fault in a page. This still seems doable. I’ll let you and the HMM > maintainers duke this out regarding whether this is an acceptable > approach :). > > FWIW, here are the changes I had to make to get my code stable. Tested > with compound page collection both on and off. Thanks a lot, you spotted the root cause! Seems the issue is, if migrating, you have to set the HMM_PFN_VALID and HMM_PFN_MIGRATE both while holding the pmd/ptl lock (without releasing in between). My original idea with the separate handle and prepare steps was that handle does the faulting and pfn population step, the the HMM_PFN_VALID establishing step. Prepare then decides about migration, by possibly adding HMM_PFN_MIGRATE. Currently this isn't race free. One way to fix is like you do, but admit that brings some redundancy between the paths. I think having those two functions separate is good for manageability, instead of scattering migrate related fragments to many places. The migration part brings quite a lot complexity with the splitting and stuff, and likely would end up adding encapsulating some of it in own functions anyway. I am experimenting with modified locking, so while migrating we lock the pte/pmd for the walk (and unlock when handling faults) so the handle and migrate happen logically under same locking region. Let's see how that looks like. I think that should solve the seen issues as well.. Will send v2 with this and other fixes so far.. > diff --git a/mm/hmm.c b/mm/hmm.c > index 39a07d895043..5e24cd82b393 100644 > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -509,8 +509,10 @@ static int hmm_vma_handle_migrate_prepare_pmd(const struct mm_walk *walk, > } > > if (pmd_trans_huge(*pmdp)) { > - if (!(minfo & MIGRATE_VMA_SELECT_SYSTEM)) > + if (!(minfo & MIGRATE_VMA_SELECT_SYSTEM)) { > + hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR); > goto out; > + } > > folio = pmd_folio(*pmdp); > if (is_huge_zero_folio(folio)) { > @@ -523,16 +525,22 @@ static int hmm_vma_handle_migrate_prepare_pmd(const struct mm_walk *walk, > > folio = softleaf_to_folio(entry); > > - if (!softleaf_is_device_private(entry)) > - goto out; > - > - if (!(minfo & MIGRATE_VMA_SELECT_DEVICE_PRIVATE)) > + if (!softleaf_is_device_private(entry)) { > + hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR); > goto out; > - if (folio->pgmap->owner != migrate->pgmap_owner) > + } > + if (!(minfo & MIGRATE_VMA_SELECT_DEVICE_PRIVATE)) { > + hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR); > goto out; > + } > > + if (folio->pgmap->owner != migrate->pgmap_owner) { > + hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR); > + goto out; > + } > } else { > spin_unlock(ptl); > + hmm_vma_walk->last = start; > return -EBUSY; > } > > @@ -541,6 +549,7 @@ static int hmm_vma_handle_migrate_prepare_pmd(const struct mm_walk *walk, > if (folio != fault_folio && unlikely(!folio_trylock(folio))) { > spin_unlock(ptl); > folio_put(folio); > + hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR); > return 0; > } > > @@ -555,8 +564,10 @@ static int hmm_vma_handle_migrate_prepare_pmd(const struct mm_walk *walk, > .pmd = pmdp, > .vma = walk->vma, > }; > + unsigned long pfn = page_to_pfn(folio_page(folio, 0)); > > - hmm_pfn[0] |= HMM_PFN_MIGRATE | HMM_PFN_COMPOUND; > + hmm_pfn[0] |= HMM_PFN_MIGRATE | HMM_PFN_COMPOUND | > + HMM_PFN_VALID; > > r = set_pmd_migration_entry(&pvmw, folio_page(folio, 0)); > if (r) { > @@ -564,6 +575,8 @@ static int hmm_vma_handle_migrate_prepare_pmd(const struct mm_walk *walk, > r = -ENOENT; // fallback > goto unlock_out; > } > + hmm_pfn[0] &= HMM_PFN_FLAGS; > + hmm_pfn[0] |= pfn; > for (i = 1, start += PAGE_SIZE; start < end; start += PAGE_SIZE, i++) > hmm_pfn[i] &= HMM_PFN_INOUT_FLAGS; > > @@ -604,7 +617,7 @@ static int hmm_vma_handle_migrate_prepare(const struct mm_walk *walk, > struct dev_pagemap *pgmap; > bool anon_exclusive; > struct folio *folio; > - unsigned long pfn; > + unsigned long pfn = 0; > struct page *page; > softleaf_t entry; > pte_t pte, swp_pte; > @@ -688,8 +701,8 @@ static int hmm_vma_handle_migrate_prepare(const struct mm_walk *walk, > goto out; > } > > - folio = page_folio(page); > - if (folio_test_large(folio)) { > + folio = page ? page_folio(page) : NULL; > + if (folio && folio_test_large(folio)) { > int ret; > > pte_unmap_unlock(ptep, ptl); > @@ -745,12 +758,15 @@ static int hmm_vma_handle_migrate_prepare(const struct mm_walk *walk, > set_pte_at(mm, addr, ptep, pte); > folio_unlock(folio); > folio_put(folio); > + *hmm_pfn &= HMM_PFN_INOUT_FLAGS; > goto out; > } > } else { > pte = ptep_get_and_clear(mm, addr, ptep); > } > > + /* XXX: Migrate layer calls folio_mark_dirty if pte_dirty */ > + > /* Setup special migration page table entry */ > if (writable) > entry = make_writable_migration_entry(pfn); > @@ -759,6 +775,8 @@ static int hmm_vma_handle_migrate_prepare(const struct mm_walk *walk, > else > entry = make_readable_migration_entry(pfn); > > + /* XXX: Migrate layer makes entry young / dirty based on PTE */ > + > swp_pte = swp_entry_to_pte(entry); > if (pte_present(pte)) { > if (pte_soft_dirty(pte)) > @@ -775,8 +793,18 @@ static int hmm_vma_handle_migrate_prepare(const struct mm_walk *walk, > set_pte_at(mm, addr, ptep, swp_pte); > folio_remove_rmap_pte(folio, page, walk->vma); > folio_put(folio); > - *hmm_pfn |= HMM_PFN_MIGRATE; > + /* > + * XXX: It is possible the PTE wasn't present in the first part > + * of the HHM walk, repopulate it... > + */ > + *hmm_pfn &= HMM_PFN_FLAGS; > + *hmm_pfn |= HMM_PFN_MIGRATE | pfn | HMM_PFN_VALID | > + (writable ? HMM_PFN_WRITE : 0); > > + /* > + * XXX: Is there a perf impact of calling flush_tlb_range on > + * each PTE vs. range like migrate_vma layer? > + */ > if (pte_present(pte)) > flush_tlb_range(walk->vma, addr, addr + PAGE_SIZE); > } else > @@ -988,8 +1016,10 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, > } > > r = hmm_vma_handle_migrate_prepare(walk, pmdp, addr, hmm_pfns); > - if (r) > + if (r) { > + hmm_pfns_fill(addr, end, hmm_vma_walk, HMM_PFN_ERROR); > break; > + } > } > pte_unmap(ptep - 1); > > diff --git a/mm/migrate_device.c b/mm/migrate_device.c > index c8f5a0615a5e..58d42d3ab673 100644 > --- a/mm/migrate_device.c > +++ b/mm/migrate_device.c > @@ -268,6 +268,12 @@ int migrate_vma_setup(struct migrate_vma *args) > > migrate_hmm_range_setup(&range); > > + /* Remove migration PTEs */ > + if (ret) { > + migrate_vma_pages(args); > + migrate_vma_finalize(args); > + } > + > /* > * At this point pages are locked and unmapped, and thus they have > * stable content and can safely be copied to destination memory that > > Matt --Mika