From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C83E9D262B0 for ; Wed, 21 Jan 2026 08:05:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2C9146B0005; Wed, 21 Jan 2026 03:05:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 287276B0088; Wed, 21 Jan 2026 03:05:13 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 172556B0089; Wed, 21 Jan 2026 03:05:13 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 0377E6B0005 for ; Wed, 21 Jan 2026 03:05:13 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id ACC181AF824 for ; Wed, 21 Jan 2026 08:05:12 +0000 (UTC) X-FDA: 84355235664.06.DA2B61B Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf21.hostedemail.com (Postfix) with ESMTP id 38D0D1C0002 for ; Wed, 21 Jan 2026 08:05:10 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="F5uEdu6/"; spf=pass (imf21.hostedemail.com: domain of mpenttil@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=mpenttil@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1768982710; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=IiuCPaQ6y66CwHu1niFEC+ixeY8SAH33F9VRoniFSg8=; b=eYKAPptB6Qwpk3K4dFkKgGXsTdGjsxuoS5Q/hxZqfSSANIUwiOB4C7wluMgk141GJM3kMI zWF6ripd+VGjbfWOTHmk/3aS98xRdgAO0P7P6fQtc5+UWUQFIBfH8RiIl1iWYC0bDhvfNt PfPSMub+3YSOuqqTCwl6lbnoyjMPQO8= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="F5uEdu6/"; spf=pass (imf21.hostedemail.com: domain of mpenttil@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=mpenttil@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1768982710; a=rsa-sha256; cv=none; b=K+Gqg6kL8CNGX/BocRWeiw5xNoZfGxg3AoVV2WKckaBcust3M8CHhIWQtlufN6SW+mk8tE 5u3MXR8e0AZtVEzYkuWiXYFs0k8qidMYGLDl+C16zG9Oa//+EZ30VJ+eyavZFTrYCGcn2z YxnhCjdu98NjBBYpR/upS+19qzrwZnA= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1768982708; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=IiuCPaQ6y66CwHu1niFEC+ixeY8SAH33F9VRoniFSg8=; b=F5uEdu6/wNy2b4SHhM3C7X/C/OC8cOmmZsIy84lPEpoAmm6HL9K7uRiNR2fPOELS1RrhKy 2iKWHsQVR8fTZQTmj+E6ixpslXvreo+3f58FwJc7Ufi98/uvatAP51QYATIm350yj8GQaK 5xSOPDUg/z0kCrkEpPE4FJts5IV5E44= Received: from mail-lf1-f69.google.com (mail-lf1-f69.google.com [209.85.167.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-59-uPo_el1RPi2J7R4f-hgjSA-1; Wed, 21 Jan 2026 03:05:03 -0500 X-MC-Unique: uPo_el1RPi2J7R4f-hgjSA-1 X-Mimecast-MFC-AGG-ID: uPo_el1RPi2J7R4f-hgjSA_1768982702 Received: by mail-lf1-f69.google.com with SMTP id 2adb3069b0e04-59b7972367fso3063787e87.1 for ; Wed, 21 Jan 2026 00:05:03 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1768982702; x=1769587502; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=IiuCPaQ6y66CwHu1niFEC+ixeY8SAH33F9VRoniFSg8=; b=IVW3YSTSbNbQU2qWmwCCV0QISqLhx0SDpk9FwEh0ZwalG73DAG9UD950CVRVykRErB 5f0WN1TJTdg504OyO1AuyLV0zacUgUlpM3Q8ywJ8XKZjNfr1sYWOSelQ5BW/y4PMbhjs tDLcbTDrHPTvR9JFfIgu3IR7ytWTOa2E0/HK4sgopzmw29Gu6N2MHx8LQ+0KkVAbFbgM N+6ZVVTgwmsN7jAmcfq3kqY+bSSvVf1Y+VLrNa62ysiUgN2L0cuqCUe1C2rmdLYItjRw j9ASFuokb9oC/n56Q1qJVzYpPGE+jaCKp0WkFTceRWe0htyiUozuxDInrrfxlIse1tM1 emNA== X-Gm-Message-State: AOJu0YwfvmLsmKRYOPBGqekEZimWdsdzbpNNw200cgVCHQ4jXojcNNON ujkfYgWeNX/+bQXPBVAmn0SoWkurVLxY3Jv7qdUWZzOSPnIVisK/dK+anHKJBh+kFiQFO+b9C2p BcGfuxGYxuATElL/uMVSsLYypqaRF/J3uzGBjLO54bi3v+MQYTNQ= X-Gm-Gg: AZuq6aLNuPJP0K2WnjApLXfmAqi8psCXG5LMExPY+CkuhMK+Ybsn4mQkVf3z/0+mAct wa+JZfzd1su1NWm2gQAHy7Kza0uOoN0yrG/4G+6WnBM/5FWhfRRxmHWBl9tfYn5Qwa852/vasqt BTLNSlUbYDnW9sPgUEmdz5fHyrQOCE80hskhhfB+kg0HtAfnHHP9rlooG6bxdIEWZpua6NyAxfv VKBFFfaixKokthN6M3oXmvxsHcVo2Ksa/VqomoyZrMFvc9Vc/3dGnoqKtx1HyU7YnoNDNu1xNBj nX6XGtwh5RJAYhKCRgOox6Jy5WqCisWr7Qm/gbqphrfT+xsuYpI+jC0c4JQO44PoY66WCQ1enrB BLjpc1GJZI1YZN+psl0KHdhjDbl6g+lm1Efc= X-Received: by 2002:a05:6512:1592:b0:598:eed6:4975 with SMTP id 2adb3069b0e04-59baef0bc12mr6310275e87.48.1768982701623; Wed, 21 Jan 2026 00:05:01 -0800 (PST) X-Received: by 2002:a05:6512:1592:b0:598:eed6:4975 with SMTP id 2adb3069b0e04-59baef0bc12mr6310255e87.48.1768982701027; Wed, 21 Jan 2026 00:05:01 -0800 (PST) Received: from [192.168.1.86] (85-23-51-1.bb.dnainternet.fi. [85.23.51.1]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-59baf33eb76sm4693651e87.10.2026.01.21.00.05.00 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 21 Jan 2026 00:05:00 -0800 (PST) Message-ID: Date: Wed, 21 Jan 2026 10:05:00 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 1/3] mm: unified hmm fault and migrate device pagewalk paths To: Matthew Brost Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, David Hildenbrand , Jason Gunthorpe , Leon Romanovsky , Alistair Popple , Balbir Singh , Zi Yan References: <20260119112502.645059-1-mpenttil@redhat.com> <20260119112502.645059-2-mpenttil@redhat.com> From: =?UTF-8?Q?Mika_Penttil=C3=A4?= In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: WAAEzzIFcBwv9IipMEnMa5QLwx6B_DovKYE8NxWzBrE_1768982702 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 38D0D1C0002 X-Stat-Signature: uhe36dx1n4b3ho3btmqyti86c797iif4 X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1768982710-524975 X-HE-Meta: U2FsdGVkX19BeSa2oRAMtvjhPZUqQS7Cjrh5zPoLk5hyOwYYv8LL5X+wxDPFf2iTE9G+aAqbpVhUd76V51y0UVaLBgQvG8GxT0/aNdpWLcLvogUb8+pycEd9s8PgwCSdaPskNi/upalMpZZtndxbzHj4qdHaw9pLGVkxZ3jiLBFnuIfD3yUnXxjb/NGBkgGSQKQ0mahnxqTE/alvPLoxi1PAr+HFWO/jcSLW7CNKcg/XDwaOYgjBgjJ6kkXa4fmZRNZLTJCUkKwX45QrmcY4c7Cj7Ecnzd0mpcXs2xZL0ypjVJRjDD4+kqFyumtONz4IAeZ5d8yWhF5rNhcJUSvB5KsCqaXa7Vyfm6QQr4SNQFoSwnNTepDmo7OnUffmgwnIIaVpEbUka6OtzH6d/+fVFVP660kKW9ctNvi0q/11lm4G37hS/5Kwl6Fjp6cqIDWYdois75M5zO/8g1JfnpNUnMPSjZWWED8lzic9F5YXGbesP/J84vcfYA3JUyNTHgu1mDPeGX87K1Sk/6gyjy6cO93QdPj0OM46ljy2l3eRfep1VE8ASuLrH9H7bbrs6v8Zv4f3aOLpCypNAxtpFYb+jCSZanUm7hWDR4sAQJnZjehchosOt47ImFMA9zwIOO63PSzG41JG21lnAcDHFDEQQ5EGsDoTBa+4KXtGGtbY40yFuMp/PIui8dfrvWR66oAzKHnnQP9I8Iac1g8S4x5Zxo4rGmPuUB37OwIVheFOpCotLI54zvt7lUkIaYi7Z2R3EJReS6gSUZl9r30PLbbCAGxtxtniJCfMGV73lB3+yJNp2cyvnUivgberHNMbsh0E3pPX4Mx+HCpFg0G//ttheT8JchCDnRA+6uX+4yvCmAhLBbFoVa3jAqa5+ePIx9RR0BEPErkobfxn+6IDx279R+coVAcpH0VvAMMWfDJTVFfaDwirvibCfdbCZ5bLdnDiZeO8vNmMSHT0XEVg1EE jWyKunTy ALVhBSzOA7n2+TxW8OCEWEbUEn4scMzXTfp5XkLK6zvkGdtIyguFvqhWXqTlkn14zZcoPmb/L4OeN86dUL2qfnLiIs1VMa8K8qM4ebUG3NXmXrdKc7OJgr39zDnFC7EkgwKX/B/qsjGCJSlgUamWYfm0e1zioFoJCu28zl2baxRWmqHt1xP04oW3ces3QpFtobQI21j0Nme9AbsVFz1W1h1M0wSIZfk5afTL9WnMdi1mrThPqJ2ms60ylINO6nIcnjYDyn29Dp1gpwVbW5z29pNpnrIy8uKPMJt5WbU1sYpU+S0i19Z+BCdOECRUXbZwhCeDuzTwBirmxzAy1JkC6A1Lk8LRBju3PY26h2qHpVHEbzXQe/esQsWy8jb52J+Ub1O/RBJzOA5vnByQt8S8y88P2UFEikJcs4FCGs5DlQS8PWVcEAsQzsPt/gv+FX7e+dO/uK0+lrsmziyA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 1/20/26 18:03, Matthew Brost wrote: > On Mon, Jan 19, 2026 at 01:25:00PM +0200, mpenttil@redhat.com wrote: >> From: Mika Penttilä >> >> Currently, the way device page faulting and migration works >> is not optimal, if you want to do both fault handling and >> migration at once. >> >> Being able to migrate not present pages (or pages mapped with incorrect >> permissions, eg. COW) to the GPU requires doing either of the >> following sequences: >> >> 1. hmm_range_fault() - fault in non-present pages with correct permissions, etc. >> 2. migrate_vma_*() - migrate the pages >> >> Or: >> >> 1. migrate_vma_*() - migrate present pages >> 2. If non-present pages detected by migrate_vma_*(): >> a) call hmm_range_fault() to fault pages in >> b) call migrate_vma_*() again to migrate now present pages >> >> The problem with the first sequence is that you always have to do two >> page walks even when most of the time the pages are present or zero page >> mappings so the common case takes a performance hit. >> >> The second sequence is better for the common case, but far worse if >> pages aren't present because now you have to walk the page tables three >> times (once to find the page is not present, once so hmm_range_fault() >> can find a non-present page to fault in and once again to setup the >> migration). It is also tricky to code correctly. >> >> We should be able to walk the page table once, faulting >> pages in as required and replacing them with migration entries if >> requested. >> >> Add a new flag to HMM APIs, HMM_PFN_REQ_MIGRATE, >> which tells to prepare for migration also during fault handling. >> Also, for the migrate_vma_setup() call paths, a flag, MIGRATE_VMA_FAULT, >> is added to tell to add fault handling to migrate. >> >> Cc: David Hildenbrand >> Cc: Jason Gunthorpe >> Cc: Leon Romanovsky >> Cc: Alistair Popple >> Cc: Balbir Singh >> Cc: Zi Yan >> Cc: Matthew Brost > A couple of comments/questions around the locking. Personally, I like > the approach, but its of the maintianers if they like it. I also haven't > pulled or tested this yet and likely won't have time for at least a few > days, so all comments are based on inspection. Thanks, your comments are valid. Will prepare a v3 with those addressed and fixed. >> Suggested-by: Alistair Popple >> Signed-off-by: Mika Penttilä >> --- >> include/linux/hmm.h | 19 +- >> include/linux/migrate.h | 27 +- >> mm/hmm.c | 770 +++++++++++++++++++++++++++++++++++++--- >> mm/migrate_device.c | 86 ++++- >> 4 files changed, 839 insertions(+), 63 deletions(-) >> >> diff --git a/include/linux/hmm.h b/include/linux/hmm.h >> index db75ffc949a7..e2f53e155af2 100644 >> --- a/include/linux/hmm.h >> +++ b/include/linux/hmm.h >> @@ -12,7 +12,7 @@ >> #include >> >> struct mmu_interval_notifier; >> - >> +struct migrate_vma; >> /* >> * On output: >> * 0 - The page is faultable and a future call with >> @@ -27,6 +27,7 @@ struct mmu_interval_notifier; >> * HMM_PFN_P2PDMA_BUS - Bus mapped P2P transfer >> * HMM_PFN_DMA_MAPPED - Flag preserved on input-to-output transformation >> * to mark that page is already DMA mapped >> + * HMM_PFN_MIGRATE - Migrate PTE installed >> * >> * On input: >> * 0 - Return the current state of the page, do not fault it. >> @@ -34,6 +35,7 @@ struct mmu_interval_notifier; >> * will fail >> * HMM_PFN_REQ_WRITE - The output must have HMM_PFN_WRITE or hmm_range_fault() >> * will fail. Must be combined with HMM_PFN_REQ_FAULT. >> + * HMM_PFN_REQ_MIGRATE - For default_flags, request to migrate to device >> */ >> enum hmm_pfn_flags { >> /* Output fields and flags */ >> @@ -48,15 +50,25 @@ enum hmm_pfn_flags { >> HMM_PFN_P2PDMA = 1UL << (BITS_PER_LONG - 5), >> HMM_PFN_P2PDMA_BUS = 1UL << (BITS_PER_LONG - 6), >> >> - HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 11), >> + /* Migrate request */ >> + HMM_PFN_MIGRATE = 1UL << (BITS_PER_LONG - 7), >> + HMM_PFN_COMPOUND = 1UL << (BITS_PER_LONG - 8), >> + HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 13), >> >> /* Input flags */ >> HMM_PFN_REQ_FAULT = HMM_PFN_VALID, >> HMM_PFN_REQ_WRITE = HMM_PFN_WRITE, >> + HMM_PFN_REQ_MIGRATE = HMM_PFN_MIGRATE, >> >> HMM_PFN_FLAGS = ~((1UL << HMM_PFN_ORDER_SHIFT) - 1), >> }; >> >> +enum { >> + /* These flags are carried from input-to-output */ >> + HMM_PFN_INOUT_FLAGS = HMM_PFN_DMA_MAPPED | HMM_PFN_P2PDMA | >> + HMM_PFN_P2PDMA_BUS, >> +}; >> + >> /* >> * hmm_pfn_to_page() - return struct page pointed to by a device entry >> * >> @@ -107,6 +119,7 @@ static inline unsigned int hmm_pfn_to_map_order(unsigned long hmm_pfn) >> * @default_flags: default flags for the range (write, read, ... see hmm doc) >> * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter >> * @dev_private_owner: owner of device private pages >> + * @migrate: structure for migrating the associated vma >> */ >> struct hmm_range { >> struct mmu_interval_notifier *notifier; >> @@ -117,12 +130,14 @@ struct hmm_range { >> unsigned long default_flags; >> unsigned long pfn_flags_mask; >> void *dev_private_owner; >> + struct migrate_vma *migrate; >> }; >> >> /* >> * Please see Documentation/mm/hmm.rst for how to use the range API. >> */ >> int hmm_range_fault(struct hmm_range *range); >> +int hmm_range_migrate_prepare(struct hmm_range *range, struct migrate_vma **pargs); >> >> /* >> * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range >> diff --git a/include/linux/migrate.h b/include/linux/migrate.h >> index 26ca00c325d9..104eda2dd881 100644 >> --- a/include/linux/migrate.h >> +++ b/include/linux/migrate.h >> @@ -3,6 +3,7 @@ >> #define _LINUX_MIGRATE_H >> >> #include >> +#include >> #include >> #include >> #include >> @@ -97,6 +98,16 @@ static inline int set_movable_ops(const struct movable_operations *ops, enum pag >> return -ENOSYS; >> } >> >> +enum migrate_vma_info { >> + MIGRATE_VMA_SELECT_NONE = 0, >> + MIGRATE_VMA_SELECT_COMPOUND = MIGRATE_VMA_SELECT_NONE, >> +}; >> + >> +static inline enum migrate_vma_info hmm_select_migrate(struct hmm_range *range) >> +{ >> + return MIGRATE_VMA_SELECT_NONE; >> +} >> + >> #endif /* CONFIG_MIGRATION */ >> >> #ifdef CONFIG_NUMA_BALANCING >> @@ -140,11 +151,12 @@ static inline unsigned long migrate_pfn(unsigned long pfn) >> return (pfn << MIGRATE_PFN_SHIFT) | MIGRATE_PFN_VALID; >> } >> >> -enum migrate_vma_direction { >> +enum migrate_vma_info { >> MIGRATE_VMA_SELECT_SYSTEM = 1 << 0, >> MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1, >> MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2, >> MIGRATE_VMA_SELECT_COMPOUND = 1 << 3, >> + MIGRATE_VMA_FAULT = 1 << 4, >> }; >> >> struct migrate_vma { >> @@ -182,6 +194,17 @@ struct migrate_vma { >> struct page *fault_page; >> }; >> >> +static inline enum migrate_vma_info hmm_select_migrate(struct hmm_range *range) >> +{ >> + enum migrate_vma_info minfo; >> + >> + minfo = range->migrate ? range->migrate->flags : 0; >> + minfo |= (range->default_flags & HMM_PFN_REQ_MIGRATE) ? >> + MIGRATE_VMA_SELECT_SYSTEM : 0; >> + >> + return minfo; >> +} >> + >> int migrate_vma_setup(struct migrate_vma *args); >> void migrate_vma_pages(struct migrate_vma *migrate); >> void migrate_vma_finalize(struct migrate_vma *migrate); >> @@ -192,7 +215,7 @@ void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns, >> unsigned long npages); >> void migrate_device_finalize(unsigned long *src_pfns, >> unsigned long *dst_pfns, unsigned long npages); >> - >> +void migrate_hmm_range_setup(struct hmm_range *range); >> #endif /* CONFIG_MIGRATION */ >> >> #endif /* _LINUX_MIGRATE_H */ >> diff --git a/mm/hmm.c b/mm/hmm.c >> index 4ec74c18bef6..1fdb8665eeec 100644 >> --- a/mm/hmm.c >> +++ b/mm/hmm.c >> @@ -20,6 +20,7 @@ >> #include >> #include >> #include >> +#include >> #include >> #include >> #include >> @@ -27,12 +28,20 @@ >> #include >> #include >> #include >> +#include >> >> #include "internal.h" >> >> struct hmm_vma_walk { >> - struct hmm_range *range; >> - unsigned long last; >> + struct mmu_notifier_range mmu_range; >> + struct vm_area_struct *vma; >> + struct hmm_range *range; >> + unsigned long start; >> + unsigned long end; >> + unsigned long last; >> + bool locked; >> + bool pmdlocked; >> + spinlock_t *ptl; >> }; >> >> enum { >> @@ -41,21 +50,38 @@ enum { >> HMM_NEED_ALL_BITS = HMM_NEED_FAULT | HMM_NEED_WRITE_FAULT, >> }; >> >> -enum { >> - /* These flags are carried from input-to-output */ >> - HMM_PFN_INOUT_FLAGS = HMM_PFN_DMA_MAPPED | HMM_PFN_P2PDMA | >> - HMM_PFN_P2PDMA_BUS, >> -}; >> - >> static int hmm_pfns_fill(unsigned long addr, unsigned long end, >> - struct hmm_range *range, unsigned long cpu_flags) >> + struct hmm_vma_walk *hmm_vma_walk, unsigned long cpu_flags) >> { >> + struct hmm_range *range = hmm_vma_walk->range; >> unsigned long i = (addr - range->start) >> PAGE_SHIFT; >> + enum migrate_vma_info minfo; >> + bool migrate = false; >> + >> + minfo = hmm_select_migrate(range); >> + if (cpu_flags != HMM_PFN_ERROR) { >> + if (minfo && (vma_is_anonymous(hmm_vma_walk->vma))) { >> + cpu_flags |= (HMM_PFN_VALID | HMM_PFN_MIGRATE); >> + migrate = true; >> + } >> + } >> + >> + if (migrate && thp_migration_supported() && >> + (minfo & MIGRATE_VMA_SELECT_COMPOUND) && >> + IS_ALIGNED(addr, HPAGE_PMD_SIZE) && >> + IS_ALIGNED(end, HPAGE_PMD_SIZE)) { >> + range->hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS; >> + range->hmm_pfns[i] |= cpu_flags | HMM_PFN_COMPOUND; >> + addr += PAGE_SIZE; >> + i++; >> + cpu_flags = 0; >> + } >> >> for (; addr < end; addr += PAGE_SIZE, i++) { >> range->hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS; >> range->hmm_pfns[i] |= cpu_flags; >> } >> + >> return 0; >> } >> >> @@ -171,11 +197,11 @@ static int hmm_vma_walk_hole(unsigned long addr, unsigned long end, >> if (!walk->vma) { >> if (required_fault) >> return -EFAULT; >> - return hmm_pfns_fill(addr, end, range, HMM_PFN_ERROR); >> + return hmm_pfns_fill(addr, end, hmm_vma_walk, HMM_PFN_ERROR); >> } >> if (required_fault) >> return hmm_vma_fault(addr, end, required_fault, walk); > Can we assert in hmm_vma_fault that neither the PMD nor the PTE lock is > held? That would be a quick sanity check to ensure we haven’t screwed > anything up in the state-machine walk. > > We could add marco like HMM_ASSERT or HMM_WARN, etc... Good idea, will add. > >> - return hmm_pfns_fill(addr, end, range, 0); >> + return hmm_pfns_fill(addr, end, hmm_vma_walk, 0); >> } >> >> static inline unsigned long hmm_pfn_flags_order(unsigned long order) >> @@ -208,8 +234,13 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr, >> cpu_flags = pmd_to_hmm_pfn_flags(range, pmd); >> required_fault = >> hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, cpu_flags); >> - if (required_fault) >> + if (required_fault) { >> + if (hmm_vma_walk->pmdlocked) { >> + spin_unlock(hmm_vma_walk->ptl); >> + hmm_vma_walk->pmdlocked = false; >> + } >> return hmm_vma_fault(addr, end, required_fault, walk); >> + } >> >> pfn = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); >> for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) { >> @@ -289,14 +320,28 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, >> goto fault; >> >> if (softleaf_is_migration(entry)) { >> - pte_unmap(ptep); >> - hmm_vma_walk->last = addr; >> - migration_entry_wait(walk->mm, pmdp, addr); >> - return -EBUSY; >> + if (!hmm_select_migrate(range)) { >> + if (hmm_vma_walk->locked) { >> + pte_unmap_unlock(ptep, hmm_vma_walk->ptl); >> + hmm_vma_walk->locked = false; > I don’t think it should be possible for the lock to be held here, given > that we only take it when selecting migration. So maybe we should assert > that it is not locked. Correct. Will fix. > >> + } else >> + pte_unmap(ptep); >> + >> + hmm_vma_walk->last = addr; >> + migration_entry_wait(walk->mm, pmdp, addr); >> + return -EBUSY; >> + } else >> + goto out; >> } >> >> /* Report error for everything else */ >> - pte_unmap(ptep); >> + >> + if (hmm_vma_walk->locked) { >> + pte_unmap_unlock(ptep, hmm_vma_walk->ptl); >> + hmm_vma_walk->locked = false; >> + } else >> + pte_unmap(ptep); >> + >> return -EFAULT; >> } >> >> @@ -313,7 +358,12 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, >> if (!vm_normal_page(walk->vma, addr, pte) && >> !is_zero_pfn(pte_pfn(pte))) { >> if (hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0)) { >> - pte_unmap(ptep); >> + if (hmm_vma_walk->locked) { >> + pte_unmap_unlock(ptep, hmm_vma_walk->ptl); >> + hmm_vma_walk->locked = false; >> + } else >> + pte_unmap(ptep); >> + >> return -EFAULT; >> } >> new_pfn_flags = HMM_PFN_ERROR; >> @@ -326,7 +376,11 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, >> return 0; >> >> fault: >> - pte_unmap(ptep); >> + if (hmm_vma_walk->locked) { >> + pte_unmap_unlock(ptep, hmm_vma_walk->ptl); >> + hmm_vma_walk->locked = false; >> + } else >> + pte_unmap(ptep); >> /* Fault any virtual address we were asked to fault */ >> return hmm_vma_fault(addr, end, required_fault, walk); >> } >> @@ -370,13 +424,18 @@ static int hmm_vma_handle_absent_pmd(struct mm_walk *walk, unsigned long start, >> required_fault = hmm_range_need_fault(hmm_vma_walk, hmm_pfns, >> npages, 0); >> if (required_fault) { >> - if (softleaf_is_device_private(entry)) >> + if (softleaf_is_device_private(entry)) { >> + if (hmm_vma_walk->pmdlocked) { >> + spin_unlock(hmm_vma_walk->ptl); >> + hmm_vma_walk->pmdlocked = false; >> + } >> return hmm_vma_fault(addr, end, required_fault, walk); >> + } >> else >> return -EFAULT; >> } >> >> - return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); >> + return hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR); >> } >> #else >> static int hmm_vma_handle_absent_pmd(struct mm_walk *walk, unsigned long start, >> @@ -384,15 +443,486 @@ static int hmm_vma_handle_absent_pmd(struct mm_walk *walk, unsigned long start, >> pmd_t pmd) >> { >> struct hmm_vma_walk *hmm_vma_walk = walk->private; >> - struct hmm_range *range = hmm_vma_walk->range; >> unsigned long npages = (end - start) >> PAGE_SHIFT; >> >> if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0)) >> return -EFAULT; >> - return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); >> + return hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR); >> } >> #endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */ >> >> +#ifdef CONFIG_DEVICE_MIGRATION >> +/** >> + * migrate_vma_split_folio() - Helper function to split a THP folio >> + * @folio: the folio to split >> + * @fault_page: struct page associated with the fault if any >> + * >> + * Returns 0 on success >> + */ >> +static int migrate_vma_split_folio(struct folio *folio, >> + struct page *fault_page) >> +{ >> + int ret; >> + struct folio *fault_folio = fault_page ? page_folio(fault_page) : NULL; >> + struct folio *new_fault_folio = NULL; >> + >> + if (folio != fault_folio) { >> + folio_get(folio); >> + folio_lock(folio); >> + } >> + >> + ret = split_folio(folio); >> + if (ret) { >> + if (folio != fault_folio) { >> + folio_unlock(folio); >> + folio_put(folio); >> + } >> + return ret; >> + } >> + >> + new_fault_folio = fault_page ? page_folio(fault_page) : NULL; >> + >> + /* >> + * Ensure the lock is held on the correct >> + * folio after the split >> + */ >> + if (!new_fault_folio) { >> + folio_unlock(folio); >> + folio_put(folio); >> + } else if (folio != new_fault_folio) { >> + if (new_fault_folio != fault_folio) { >> + folio_get(new_fault_folio); >> + folio_lock(new_fault_folio); >> + } >> + folio_unlock(folio); >> + folio_put(folio); >> + } >> + >> + return 0; >> +} >> + >> +static int hmm_vma_handle_migrate_prepare_pmd(const struct mm_walk *walk, >> + pmd_t *pmdp, >> + unsigned long start, >> + unsigned long end, >> + unsigned long *hmm_pfn) >> +{ >> + struct hmm_vma_walk *hmm_vma_walk = walk->private; >> + struct hmm_range *range = hmm_vma_walk->range; >> + struct migrate_vma *migrate = range->migrate; >> + struct folio *fault_folio = NULL; >> + struct folio *folio; >> + enum migrate_vma_info minfo; >> + unsigned long i; >> + int r = 0; >> + >> + minfo = hmm_select_migrate(range); >> + if (!minfo) >> + return r; >> + > Can we assert the PMD is locked here? ack > >> + fault_folio = (migrate && migrate->fault_page) ? >> + page_folio(migrate->fault_page) : NULL; >> + >> + if (pmd_none(*pmdp)) >> + return hmm_pfns_fill(start, end, hmm_vma_walk, 0); >> + >> + if (!(hmm_pfn[0] & HMM_PFN_VALID)) >> + goto out; >> + >> + if (pmd_trans_huge(*pmdp)) { >> + if (!(minfo & MIGRATE_VMA_SELECT_SYSTEM)) >> + goto out; >> + >> + folio = pmd_folio(*pmdp); >> + if (is_huge_zero_folio(folio)) >> + return hmm_pfns_fill(start, end, hmm_vma_walk, 0); >> + >> + } else if (!pmd_present(*pmdp)) { >> + const softleaf_t entry = softleaf_from_pmd(*pmdp); >> + >> + folio = softleaf_to_folio(entry); >> + >> + if (!softleaf_is_device_private(entry)) >> + goto out; >> + >> + if (!(minfo & MIGRATE_VMA_SELECT_DEVICE_PRIVATE)) >> + goto out; >> + if (folio->pgmap->owner != migrate->pgmap_owner) >> + goto out; >> + >> + } else { >> + hmm_vma_walk->last = start; >> + return -EBUSY; >> + } >> + >> + folio_get(folio); >> + >> + if (folio != fault_folio && unlikely(!folio_trylock(folio))) { >> + folio_put(folio); >> + hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR); >> + return 0; >> + } >> + >> + if (thp_migration_supported() && >> + (migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) && >> + (IS_ALIGNED(start, HPAGE_PMD_SIZE) && >> + IS_ALIGNED(end, HPAGE_PMD_SIZE))) { >> + >> + struct page_vma_mapped_walk pvmw = { >> + .ptl = hmm_vma_walk->ptl, >> + .address = start, >> + .pmd = pmdp, >> + .vma = walk->vma, >> + }; >> + >> + hmm_pfn[0] |= HMM_PFN_MIGRATE | HMM_PFN_COMPOUND; >> + >> + r = set_pmd_migration_entry(&pvmw, folio_page(folio, 0)); >> + if (r) { >> + hmm_pfn[0] &= ~(HMM_PFN_MIGRATE | HMM_PFN_COMPOUND); >> + r = -ENOENT; // fallback >> + goto unlock_out; >> + } >> + for (i = 1, start += PAGE_SIZE; start < end; start += PAGE_SIZE, i++) >> + hmm_pfn[i] &= HMM_PFN_INOUT_FLAGS; >> + >> + } else { >> + r = -ENOENT; // fallback >> + goto unlock_out; >> + } >> + >> + >> +out: >> + return r; >> + >> +unlock_out: >> + if (folio != fault_folio) >> + folio_unlock(folio); >> + folio_put(folio); >> + goto out; >> + >> +} >> + >> +/* >> + * Install migration entries if migration requested, either from fault >> + * or migrate paths. >> + * >> + */ >> +static int hmm_vma_handle_migrate_prepare(const struct mm_walk *walk, >> + pmd_t *pmdp, >> + pte_t *ptep, >> + unsigned long addr, >> + unsigned long *hmm_pfn) >> +{ >> + struct hmm_vma_walk *hmm_vma_walk = walk->private; >> + struct hmm_range *range = hmm_vma_walk->range; >> + struct migrate_vma *migrate = range->migrate; >> + struct mm_struct *mm = walk->vma->vm_mm; >> + struct folio *fault_folio = NULL; >> + enum migrate_vma_info minfo; >> + struct dev_pagemap *pgmap; >> + bool anon_exclusive; >> + struct folio *folio; >> + unsigned long pfn; >> + struct page *page; >> + softleaf_t entry; >> + pte_t pte, swp_pte; >> + bool writable = false; >> + >> + // Do we want to migrate at all? >> + minfo = hmm_select_migrate(range); >> + if (!minfo) >> + return 0; >> + > Can we assert the PTE lock is held here? ack > >> + fault_folio = (migrate && migrate->fault_page) ? >> + page_folio(migrate->fault_page) : NULL; >> + >> + if (!hmm_vma_walk->locked) { >> + ptep = pte_offset_map_lock(mm, pmdp, addr, &hmm_vma_walk->ptl); >> + hmm_vma_walk->locked = true; >> + } >> + pte = ptep_get(ptep); >> + > How would we get without PTE lock being held? Shouldn't the caller take > the lock? This was leftover from previous version when there was a jump to the beginning. Will remove. > >> + if (pte_none(pte)) { >> + // migrate without faulting case >> + if (vma_is_anonymous(walk->vma)) { >> + *hmm_pfn &= HMM_PFN_INOUT_FLAGS; >> + *hmm_pfn |= HMM_PFN_MIGRATE | HMM_PFN_VALID; >> + goto out; >> + } >> + } >> + >> + if (!(hmm_pfn[0] & HMM_PFN_VALID)) >> + goto out; >> + >> + if (!pte_present(pte)) { >> + /* >> + * Only care about unaddressable device page special >> + * page table entry. Other special swap entries are not >> + * migratable, and we ignore regular swapped page. >> + */ >> + entry = softleaf_from_pte(pte); >> + if (!softleaf_is_device_private(entry)) >> + goto out; >> + >> + if (!(minfo & MIGRATE_VMA_SELECT_DEVICE_PRIVATE)) >> + goto out; >> + >> + page = softleaf_to_page(entry); >> + folio = page_folio(page); >> + if (folio->pgmap->owner != migrate->pgmap_owner) >> + goto out; >> + >> + if (folio_test_large(folio)) { >> + int ret; >> + >> + pte_unmap_unlock(ptep, hmm_vma_walk->ptl); >> + hmm_vma_walk->locked = false; >> + ret = migrate_vma_split_folio(folio, >> + migrate->fault_page); >> + if (ret) >> + goto out_error; >> + return -EAGAIN; >> + } >> + >> + pfn = page_to_pfn(page); >> + if (softleaf_is_device_private_write(entry)) >> + writable = true; >> + } else { >> + pfn = pte_pfn(pte); >> + if (is_zero_pfn(pfn) && >> + (minfo & MIGRATE_VMA_SELECT_SYSTEM)) { >> + *hmm_pfn = HMM_PFN_MIGRATE|HMM_PFN_VALID; >> + goto out; >> + } >> + page = vm_normal_page(walk->vma, addr, pte); >> + if (page && !is_zone_device_page(page) && >> + !(minfo & MIGRATE_VMA_SELECT_SYSTEM)) { >> + goto out; >> + } else if (page && is_device_coherent_page(page)) { >> + pgmap = page_pgmap(page); >> + >> + if (!(minfo & >> + MIGRATE_VMA_SELECT_DEVICE_COHERENT) || >> + pgmap->owner != migrate->pgmap_owner) >> + goto out; >> + } >> + >> + folio = page ? page_folio(page) : NULL; >> + if (folio && folio_test_large(folio)) { >> + int ret; >> + >> + pte_unmap_unlock(ptep, hmm_vma_walk->ptl); >> + hmm_vma_walk->locked = false; >> + >> + ret = migrate_vma_split_folio(folio, >> + migrate->fault_page); >> + if (ret) >> + goto out_error; >> + return -EAGAIN; >> + } >> + >> + writable = pte_write(pte); >> + } >> + >> + if (!page || !page->mapping) >> + goto out; >> + >> + /* >> + * By getting a reference on the folio we pin it and that blocks >> + * any kind of migration. Side effect is that it "freezes" the >> + * pte. >> + * >> + * We drop this reference after isolating the folio from the lru >> + * for non device folio (device folio are not on the lru and thus >> + * can't be dropped from it). >> + */ >> + folio = page_folio(page); >> + folio_get(folio); >> + >> + /* >> + * We rely on folio_trylock() to avoid deadlock between >> + * concurrent migrations where each is waiting on the others >> + * folio lock. If we can't immediately lock the folio we fail this >> + * migration as it is only best effort anyway. >> + * >> + * If we can lock the folio it's safe to set up a migration entry >> + * now. In the common case where the folio is mapped once in a >> + * single process setting up the migration entry now is an >> + * optimisation to avoid walking the rmap later with >> + * try_to_migrate(). >> + */ >> + >> + if (fault_folio == folio || folio_trylock(folio)) { >> + anon_exclusive = folio_test_anon(folio) && >> + PageAnonExclusive(page); >> + >> + flush_cache_page(walk->vma, addr, pfn); >> + >> + if (anon_exclusive) { >> + pte = ptep_clear_flush(walk->vma, addr, ptep); >> + >> + if (folio_try_share_anon_rmap_pte(folio, page)) { >> + set_pte_at(mm, addr, ptep, pte); >> + folio_unlock(folio); >> + folio_put(folio); >> + goto out; >> + } >> + } else { >> + pte = ptep_get_and_clear(mm, addr, ptep); >> + } >> + >> + if (pte_dirty(pte)) >> + folio_mark_dirty(folio); >> + >> + /* Setup special migration page table entry */ >> + if (writable) >> + entry = make_writable_migration_entry(pfn); >> + else if (anon_exclusive) >> + entry = make_readable_exclusive_migration_entry(pfn); >> + else >> + entry = make_readable_migration_entry(pfn); >> + >> + if (pte_present(pte)) { >> + if (pte_young(pte)) >> + entry = make_migration_entry_young(entry); >> + if (pte_dirty(pte)) >> + entry = make_migration_entry_dirty(entry); >> + } >> + >> + swp_pte = swp_entry_to_pte(entry); >> + if (pte_present(pte)) { >> + if (pte_soft_dirty(pte)) >> + swp_pte = pte_swp_mksoft_dirty(swp_pte); >> + if (pte_uffd_wp(pte)) >> + swp_pte = pte_swp_mkuffd_wp(swp_pte); >> + } else { >> + if (pte_swp_soft_dirty(pte)) >> + swp_pte = pte_swp_mksoft_dirty(swp_pte); >> + if (pte_swp_uffd_wp(pte)) >> + swp_pte = pte_swp_mkuffd_wp(swp_pte); >> + } >> + >> + set_pte_at(mm, addr, ptep, swp_pte); >> + folio_remove_rmap_pte(folio, page, walk->vma); >> + folio_put(folio); >> + *hmm_pfn |= HMM_PFN_MIGRATE; >> + >> + if (pte_present(pte)) >> + flush_tlb_range(walk->vma, addr, addr + PAGE_SIZE); >> + } else >> + folio_put(folio); >> +out: >> + return 0; >> +out_error: >> + return -EFAULT; >> + >> +} >> + >> +static int hmm_vma_walk_split(pmd_t *pmdp, >> + unsigned long addr, >> + struct mm_walk *walk) >> +{ >> + struct hmm_vma_walk *hmm_vma_walk = walk->private; >> + struct hmm_range *range = hmm_vma_walk->range; >> + struct migrate_vma *migrate = range->migrate; >> + struct folio *folio, *fault_folio; >> + spinlock_t *ptl; >> + int ret = 0; >> + >> + fault_folio = (migrate && migrate->fault_page) ? >> + page_folio(migrate->fault_page) : NULL; >> + > Assert the PMD lock isn't held? ack > >> + ptl = pmd_lock(walk->mm, pmdp); >> + if (unlikely(!pmd_trans_huge(*pmdp))) { >> + spin_unlock(ptl); >> + goto out; >> + } >> + >> + folio = pmd_folio(*pmdp); >> + if (is_huge_zero_folio(folio)) { >> + spin_unlock(ptl); >> + split_huge_pmd(walk->vma, pmdp, addr); >> + } else { >> + folio_get(folio); >> + spin_unlock(ptl); >> + >> + if (folio != fault_folio) { >> + if (unlikely(!folio_trylock(folio))) { >> + folio_put(folio); >> + ret = -EBUSY; >> + goto out; >> + } >> + } else >> + folio_put(folio); >> + >> + ret = split_folio(folio); >> + if (fault_folio != folio) { >> + folio_unlock(folio); >> + folio_put(folio); >> + } >> + >> + } >> +out: >> + return ret; >> +} >> +#else >> +static int hmm_vma_handle_migrate_prepare_pmd(const struct mm_walk *walk, >> + pmd_t *pmdp, >> + unsigned long start, >> + unsigned long end, >> + unsigned long *hmm_pfn) >> +{ >> + return 0; >> +} >> + >> +static int hmm_vma_handle_migrate_prepare(const struct mm_walk *walk, >> + pmd_t *pmdp, >> + pte_t *pte, >> + unsigned long addr, >> + unsigned long *hmm_pfn) >> +{ >> + return 0; >> +} >> + >> +static int hmm_vma_walk_split(pmd_t *pmdp, >> + unsigned long addr, >> + struct mm_walk *walk) >> +{ >> + return 0; >> +} >> +#endif >> + >> +static int hmm_vma_capture_migrate_range(unsigned long start, >> + unsigned long end, >> + struct mm_walk *walk) >> +{ >> + struct hmm_vma_walk *hmm_vma_walk = walk->private; >> + struct hmm_range *range = hmm_vma_walk->range; >> + >> + if (!hmm_select_migrate(range)) >> + return 0; >> + >> + if (hmm_vma_walk->vma && (hmm_vma_walk->vma != walk->vma)) >> + return -ERANGE; >> + >> + hmm_vma_walk->vma = walk->vma; >> + hmm_vma_walk->start = start; >> + hmm_vma_walk->end = end; >> + >> + if (end - start > range->end - range->start) >> + return -ERANGE; >> + >> + if (!hmm_vma_walk->mmu_range.owner) { >> + mmu_notifier_range_init_owner(&hmm_vma_walk->mmu_range, MMU_NOTIFY_MIGRATE, 0, >> + walk->vma->vm_mm, start, end, >> + range->dev_private_owner); >> + mmu_notifier_invalidate_range_start(&hmm_vma_walk->mmu_range); >> + } >> + >> + return 0; >> +} >> + >> static int hmm_vma_walk_pmd(pmd_t *pmdp, >> unsigned long start, >> unsigned long end, >> @@ -403,43 +933,112 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, >> unsigned long *hmm_pfns = >> &range->hmm_pfns[(start - range->start) >> PAGE_SHIFT]; >> unsigned long npages = (end - start) >> PAGE_SHIFT; >> + struct mm_struct *mm = walk->vma->vm_mm; >> unsigned long addr = start; >> + enum migrate_vma_info minfo; >> + unsigned long i; >> + spinlock_t *ptl; >> pte_t *ptep; >> pmd_t pmd; >> + int r; >> + >> + minfo = hmm_select_migrate(range); >> >> again: >> + hmm_vma_walk->locked = false; >> + hmm_vma_walk->pmdlocked = false; >> pmd = pmdp_get_lockless(pmdp); >> - if (pmd_none(pmd)) >> - return hmm_vma_walk_hole(start, end, -1, walk); >> + if (pmd_none(pmd)) { >> + r = hmm_vma_walk_hole(start, end, -1, walk); >> + if (r || !minfo) >> + return r; >> + >> + ptl = pmd_lock(walk->mm, pmdp); >> + if (pmd_none(*pmdp)) { >> + // hmm_vma_walk_hole() filled migration needs >> + spin_unlock(ptl); >> + return r; >> + } >> + spin_unlock(ptl); >> + } >> >> if (thp_migration_supported() && pmd_is_migration_entry(pmd)) { >> - if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0)) { >> + if (!minfo) { >> + if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0)) { >> + hmm_vma_walk->last = addr; >> + pmd_migration_entry_wait(walk->mm, pmdp); >> + return -EBUSY; >> + } >> + } >> + for (i = 0; addr < end; addr += PAGE_SIZE, i++) >> + hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS; >> + >> + return 0; >> + } >> + >> + if (minfo) { >> + hmm_vma_walk->ptl = pmd_lock(mm, pmdp); >> + hmm_vma_walk->pmdlocked = true; >> + pmd = pmdp_get(pmdp); >> + } else >> + pmd = pmdp_get_lockless(pmdp); >> + >> + if (pmd_trans_huge(pmd) || !pmd_present(pmd)) { >> + >> + if (!pmd_present(pmd)) { >> + r = hmm_vma_handle_absent_pmd(walk, start, end, hmm_pfns, >> + pmd); >> + if (r || !minfo) > Do we need to drop the PMD lock here upon error? Yes, will fix. > >> + return r; >> + } else { >> + >> + /* >> + * No need to take pmd_lock here if not migrating, >> + * even if some other thread is splitting the huge >> + * pmd we will get that event through mmu_notifier callback. >> + * >> + * So just read pmd value and check again it's a transparent >> + * huge or device mapping one and compute corresponding pfn >> + * values. >> + */ >> + >> + if (!pmd_trans_huge(pmd)) { >> + // must be lockless >> + goto again; > How can '!pmd_trans_huge' be true here? Seems impossible based on outer if statement. Good point. Will fix and refactor this. > >> + } >> + >> + r = hmm_vma_handle_pmd(walk, addr, end, hmm_pfns, pmd); >> + >> + if (r || !minfo) > Do we need to drop the PMD lock here upon error? Yes, will fix. > >> + return r; >> + } >> + >> + r = hmm_vma_handle_migrate_prepare_pmd(walk, pmdp, start, end, hmm_pfns); >> + >> + if (hmm_vma_walk->pmdlocked) { >> + spin_unlock(hmm_vma_walk->ptl); >> + hmm_vma_walk->pmdlocked = false; >> + } >> + >> + if (r == -ENOENT) { >> + r = hmm_vma_walk_split(pmdp, addr, walk); >> + if (r) { >> + /* Split not successful, skip */ >> + return hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR); >> + } >> + >> + /* Split successful or "again", reloop */ >> hmm_vma_walk->last = addr; >> - pmd_migration_entry_wait(walk->mm, pmdp); >> return -EBUSY; >> } >> - return hmm_pfns_fill(start, end, range, 0); >> - } >> >> - if (!pmd_present(pmd)) >> - return hmm_vma_handle_absent_pmd(walk, start, end, hmm_pfns, >> - pmd); >> + return r; >> >> - if (pmd_trans_huge(pmd)) { >> - /* >> - * No need to take pmd_lock here, even if some other thread >> - * is splitting the huge pmd we will get that event through >> - * mmu_notifier callback. >> - * >> - * So just read pmd value and check again it's a transparent >> - * huge or device mapping one and compute corresponding pfn >> - * values. >> - */ >> - pmd = pmdp_get_lockless(pmdp); >> - if (!pmd_trans_huge(pmd)) >> - goto again; >> + } >> >> - return hmm_vma_handle_pmd(walk, addr, end, hmm_pfns, pmd); >> + if (hmm_vma_walk->pmdlocked) { >> + spin_unlock(hmm_vma_walk->ptl); >> + hmm_vma_walk->pmdlocked = false; >> } >> >> /* >> @@ -451,22 +1050,41 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, >> if (pmd_bad(pmd)) { >> if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0)) >> return -EFAULT; >> - return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); >> + return hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR); >> } >> >> - ptep = pte_offset_map(pmdp, addr); >> + if (minfo) { >> + ptep = pte_offset_map_lock(mm, pmdp, addr, &hmm_vma_walk->ptl); >> + if (ptep) >> + hmm_vma_walk->locked = true; >> + } else >> + ptep = pte_offset_map(pmdp, addr); >> if (!ptep) >> goto again; >> + >> for (; addr < end; addr += PAGE_SIZE, ptep++, hmm_pfns++) { >> - int r; >> >> r = hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, hmm_pfns); >> if (r) { >> /* hmm_vma_handle_pte() did pte_unmap() */ > Drop the PTE lock if held? hmm_vma_handle_pte() does unmap/unlock on error. > >> return r; >> } >> + >> + r = hmm_vma_handle_migrate_prepare(walk, pmdp, ptep, addr, hmm_pfns); >> + if (r == -EAGAIN) { > Assert the callee dropped the PTE lock? ack > > Matt --Mika