Re: [PATCH v4] mm: Fix a hmm_range_fault() livelock / starvation problem

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Balbir Singh <balbirs@nvidia.com>
To: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
	intel-xe@lists.freedesktop.org
Cc: Alistair Popple <apopple@nvidia.com>,
	Ralph Campbell <rcampbell@nvidia.com>,
	Christoph Hellwig <hch@lst.de>,
	Jason Gunthorpe <jgg@mellanox.com>,
	Jason Gunthorpe <jgg@ziepe.ca>, Leon Romanovsky <leon@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Matthew Brost <matthew.brost@intel.com>,
	John Hubbard <jhubbard@nvidia.com>,
	linux-mm@kvack.org, dri-devel@lists.freedesktop.org,
	stable@vger.kernel.org
Subject: Re: [PATCH v4] mm: Fix a hmm_range_fault() livelock / starvation problem
Date: Tue, 10 Feb 2026 13:47:56 +1100	[thread overview]
Message-ID: <7e1649dd-d673-4398-bc3c-2fd6626c9177@nvidia.com> (raw)
In-Reply-To: <050af3658287690c9f9b29a49bb3e31ecb4c273e.camel@linux.intel.com>

On 2/5/26 23:41, Thomas Hellström wrote:
> On Thu, 2026-02-05 at 22:20 +1100, Balbir Singh wrote:
>> On 2/5/26 22:10, Thomas Hellström wrote:
>>> If hmm_range_fault() fails a folio_trylock() in do_swap_page,
>>> trying to acquire the lock of a device-private folio for migration,
>>> to ram, the function will spin until it succeeds grabbing the lock.
>>>
>>> However, if the process holding the lock is depending on a work
>>> item to be completed, which is scheduled on the same CPU as the
>>> spinning hmm_range_fault(), that work item might be starved and
>>> we end up in a livelock / starvation situation which is never
>>> resolved.
>>>
>>> This can happen, for example if the process holding the
>>> device-private folio lock is stuck in
>>>    migrate_device_unmap()->lru_add_drain_all()
>>> The lru_add_drain_all() function requires a short work-item
>>> to be run on all online cpus to complete.
>>>
>>> A prerequisite for this to happen is:
>>> a) Both zone device and system memory folios are considered in
>>>    migrate_device_unmap(), so that there is a reason to call
>>>    lru_add_drain_all() for a system memory folio while a
>>>    folio lock is held on a zone device folio.
>>> b) The zone device folio has an initial mapcount > 1 which causes
>>>    at least one migration PTE entry insertion to be deferred to
>>>    try_to_migrate(), which can happen after the call to
>>>    lru_add_drain_all().
>>> c) No or voluntary only preemption.
>>>
>>> This all seems pretty unlikely to happen, but indeed is hit by
>>> the "xe_exec_system_allocator" igt test.
>>>
>>
>> Do you have a stack trace from the test? I am trying to visualize the
>> livelock/starvation, but I can't from the description.
> 
> The spinning thread: (The backtrace varies slightly from time to time:)
> 
> [  805.201476] watchdog: BUG: soft lockup - CPU#139 stuck for 52s!
> [kworker/u900:1:9985]
> [  805.201477] Modules linked in: xt_conntrack nft_chain_nat
> xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge
> stp llc xfrm_user xfrm_algo xt_addrtype nft_compat x_tables nf_tables
> mei_gsc_proxy pmt_crashlog mtd_intel_dg mei_gsc overlay qrtr
> snd_hda_codec_intelhdmi snd_hda_codec_hdmi intel_rapl_msr
> intel_rapl_common cfg80211 intel_uncore_frequency
> intel_uncore_frequency_common intel_ifs i10nm_edac sunrpc binfmt_misc
> skx_edac_common nfit xe x86_pkg_temp_thermal intel_powerclamp coretemp
> nls_iso8859_1 kvm_intel kvm drm_ttm_helper drm_suballoc_helper
> gpu_sched snd_hda_intel cmdlinepart drm_gpuvm snd_intel_dspcfg drm_exec
> spi_nor drm_gpusvm_helper snd_hda_codec drm_buddy pmt_telemetry
> dax_hmem snd_hwdep pmt_discovery mtd video irqbypass cxl_acpi qat_4xxx
> iaa_crypto snd_hda_core pmt_class ttm rapl ses cxl_port snd_pcm
> intel_cstate enclosure cxl_core intel_qat isst_if_mmio isst_if_mbox_pci
> drm_display_helper snd_timer snd cec idxd crc8 einj ast mei_me
> spi_intel_pci rc_core soundcore isst_if_common
> [  805.201496]  ipmi_ssif authenc i2c_i801 intel_vsec idxd_bus
> spi_intel i2c_algo_bit mei i2c_ismt i2c_smbus wmi joydev input_leds
> ipmi_si acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler
> acpi_pad mac_hid pfr_telemetry pfr_update sch_fq_codel msr efi_pstore
> dm_multipath nfnetlink dmi_sysfs autofs4 btrfs blake2b libblake2b
> raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor
> async_tx xor raid6_pq raid1 raid0 linear rndis_host cdc_ether usbnet
> mii nvme hid_generic mpt3sas i40e nvme_core usbhid ahci
> ghash_clmulni_intel raid_class nvme_keyring scsi_transport_sas hid
> libahci nvme_auth libie hkdf libie_adminq pinctrl_emmitsburg
> aesni_intel
> [  805.201510] CPU: 139 UID: 0 PID: 9985 Comm: kworker/u900:1 Tainted:
> G S      W    L      6.19.0-rc7+ #18 PREEMPT(voluntary) 
> [  805.201512] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN, [L]=SOFTLOCKUP
> [  805.201512] Hardware name: Supermicro SYS-421GE-TNRT/X13DEG-OA, BIOS
> 2.5a 02/21/2025
> [  805.201513] Workqueue: xe_page_fault_work_queue
> xe_pagefault_queue_work [xe]
> [  805.201599] RIP: 0010:_raw_spin_unlock+0x16/0x40
> [  805.201602] Code: cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90
> 90 90 90 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 00 65 ff 0d fa a6 40
> 01 <74> 10 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc 0f 1f 44 00
> [  805.201603] RSP: 0018:ffffd2a663a4f678 EFLAGS: 00000247
> [  805.201603] RAX: fffff85c67e35080 RBX: ffffd2a663a4f7b8 RCX:
> 0000000000000000
> [  805.201604] RDX: ffff8b88fdd31a00 RSI: 0000000000000000 RDI:
> fffff75c86ff5928
> [  805.201605] RBP: ffffd2a663a4f678 R08: 0000000000000000 R09:
> 0000000000000000
> [  805.201605] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000631d10d42000
> [  805.201606] R13: ffffd2a663a4f7b8 R14: 00000001a4ca4067 R15:
> 74000003ff9f8d42
> [  805.201606] FS:  0000000000000000(0000) GS:ffff8bc76202b000(0000)
> knlGS:0000000000000000
> [  805.201607] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  805.201608] CR2: 0000631d10c00088 CR3: 0000003de3040004 CR4:
> 0000000000f72ef0
> [  805.201609] PKRU: 55555554
> [  805.201609] Call Trace:
> [  805.201610]  <TASK>
> [  805.201610]  do_swap_page+0x17c6/0x1b70
> [  805.201612]  ? sysvec_apic_timer_interrupt+0x57/0xc0
> [  805.201614]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> [  805.201615]  ? __pfx_default_wake_function+0x10/0x10
> [  805.201617]  ? ___pte_offset_map+0x1c/0x130
> [  805.201619]  __handle_mm_fault+0xa75/0x1020
> [  805.201621]  handle_mm_fault+0xeb/0x2f0
> [  805.201622]  ? handle_mm_fault+0x11a/0x2f0
> [  805.201623]  hmm_vma_fault.isra.0+0x5b/0xb0
> [  805.201625]  hmm_vma_walk_pmd+0x5c7/0xc40
> [  805.201627]  ? sysvec_apic_timer_interrupt+0x57/0xc0
> [  805.201629]  walk_pgd_range+0x5ba/0xbf0
> [  805.201631]  __walk_page_range+0x8e/0x220
> [  805.201633]  walk_page_range_mm_unsafe+0x149/0x210
> [  805.201635]  walk_page_range+0x2a/0x40
> [  805.201636]  hmm_range_fault+0x5c/0xb0
> [  805.201638]  drm_gpusvm_range_evict+0x11a/0x1d0 [drm_gpusvm_helper]
> [  805.201641]  __xe_svm_handle_pagefault+0x5fa/0xf00 [xe]
> [  805.201736]  ? select_task_rq_fair+0x9bc/0x2970
> [  805.201738]  xe_svm_handle_pagefault+0x3d/0xb0 [xe]
> [  805.201827]  xe_pagefault_queue_work+0x233/0x370 [xe]
> [  805.201905]  process_one_work+0x18d/0x370
> [  805.201907]  worker_thread+0x31a/0x460
> [  805.201908]  ? __pfx_worker_thread+0x10/0x10
> [  805.201909]  kthread+0x10b/0x220
> [  805.201910]  ? __pfx_kthread+0x10/0x10
> [  805.201912]  ret_from_fork+0x289/0x2c0
> [  805.201913]  ? __pfx_kthread+0x10/0x10
> [  805.201915]  ret_from_fork_asm+0x1a/0x30
> [  805.201917]  </TASK>
> 
> The thread holding the page-lock:
> 
> [ 1629.938195] Workqueue: xe_page_fault_work_queue
> xe_pagefault_queue_work [xe]
> [ 1629.938340] Call Trace:
> [ 1629.938341]  <TASK>
> [ 1629.938342]  __schedule+0x47f/0x1890
> [ 1629.938346]  ? psi_group_change+0x1bd/0x4d0
> [ 1629.938350]  ? __pick_eevdf+0x70/0x180
> [ 1629.938353]  schedule+0x27/0xf0
> [ 1629.938357]  schedule_timeout+0xcf/0x110
> [ 1629.938361]  __wait_for_common+0x98/0x180
> [ 1629.938364]  ? __pfx_schedule_timeout+0x10/0x10
> [ 1629.938368]  wait_for_completion+0x24/0x40
> [ 1629.938370]  __flush_work+0x2b6/0x400
> [ 1629.938373]  ? kick_pool+0x77/0x1b0
> [ 1629.938377]  ? __pfx_wq_barrier_func+0x10/0x10
> [ 1629.938382]  flush_work+0x1c/0x30
> [ 1629.938384]  __lru_add_drain_all+0x19f/0x2a0
> [ 1629.938390]  lru_add_drain_all+0x10/0x20
> [ 1629.938392]  migrate_device_unmap+0x433/0x480
> [ 1629.938398]  migrate_vma_setup+0x245/0x300
> [ 1629.938403]  drm_pagemap_migrate_to_devmem+0x2a8/0xc00
> [drm_gpusvm_helper]
> [ 1629.938410]  ? krealloc_node_align_noprof+0x12f/0x3a0
> [ 1629.938413]  ? __xe_bo_create_locked+0x376/0x840 [xe]
> [ 1629.938529]  xe_drm_pagemap_populate_mm+0x25f/0x3a0 [xe]
> [ 1629.938721]  drm_pagemap_populate_mm+0x74/0xe0 [drm_gpusvm_helper]
> [ 1629.938731]  xe_svm_alloc_vram+0xad/0x270 [xe]
> [ 1629.938933]  ? xe_tile_local_pagemap+0x41/0x170 [xe]
> [ 1629.939095]  ? ktime_get+0x41/0x100
> [ 1629.939098]  __xe_svm_handle_pagefault+0xa90/0xf00 [xe]
> [ 1629.939279]  xe_svm_handle_pagefault+0x3d/0xb0 [xe]
> [ 1629.939460]  xe_pagefault_queue_work+0x233/0x370 [xe]
> [ 1629.939620]  process_one_work+0x18d/0x370
> [ 1629.939623]  worker_thread+0x31a/0x460
> [ 1629.939626]  ? __pfx_worker_thread+0x10/0x10
> [ 1629.939629]  kthread+0x10b/0x220
> [ 1629.939632]  ? __pfx_kthread+0x10/0x10
> [ 1629.939636]  ret_from_fork+0x289/0x2c0
> [ 1629.939639]  ? __pfx_kthread+0x10/0x10
> [ 1629.939642]  ret_from_fork_asm+0x1a/0x30
> [ 1629.939648]  </TASK>
> 
> The worker that this thread waits on in flush_work() is, 
> most likely, the one starved on cpu-time on cpu #139.
> 
Thanks, makes sense!

Balbir

next prev parent reply	other threads:[~2026-02-10  2:48 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-05 11:10 Thomas Hellström
2026-02-05 11:20 ` Balbir Singh
2026-02-05 12:41   ` Thomas Hellström
2026-02-10  2:47     ` Balbir Singh [this message]
2026-02-09 14:47 ` Thomas Hellström
2026-02-10  1:34   ` Andrew Morton
2026-02-12  8:52     ` Thomas Hellström
2026-02-10  2:22   ` Alistair Popple
2026-02-10  2:56 ` Balbir Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7e1649dd-d673-4398-bc3c-2fd6626c9177@nvidia.com \
    --to=balbirs@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hch@lst.de \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jgg@mellanox.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=leon@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=matthew.brost@intel.com \
    --cc=rcampbell@nvidia.com \
    --cc=stable@vger.kernel.org \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox