From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 20817C021BB for ; Tue, 25 Feb 2025 22:48:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A56D2280003; Tue, 25 Feb 2025 17:48:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9B828280001; Tue, 25 Feb 2025 17:48:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7E3C8280003; Tue, 25 Feb 2025 17:48:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 5AC9D280001 for ; Tue, 25 Feb 2025 17:48:51 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id B936C1C7660 for ; Tue, 25 Feb 2025 22:48:50 +0000 (UTC) X-FDA: 83159958420.26.2F4F8E8 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf05.hostedemail.com (Postfix) with ESMTP id 68C4F10001C for ; Tue, 25 Feb 2025 22:48:48 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=gWiTqVGq; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf05.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740523728; a=rsa-sha256; cv=none; b=yc18CdhXICjfEE3xW8++H43zanGx/DByBviagTdSiLcQloI4s9ojxjtFS7ktTRyu1P6Bdg aNS1o0Dz6zTTQjPFNR+OaEMAs7zhFDnSonkDb7210PiCL+1VB3wEeuvEXUlwHmC1f5BIWY XQIlrg056H89EfE0sTODPDStnMuPFPI= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=gWiTqVGq; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf05.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740523728; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=F9USmteYBOsJlbnz2c74ZK2ivD7zFdGWY3i53bXovAE=; b=d1YBeFgR/CxD14P50Qww54xh1RAZdfSI48Fj/6e5Kauv4wd08Papbs3CvCA3+PZbPCBJas Zbf77evjdeCxEuQP6rUiuNKZJAjOWcYa9j8Dt8aSgWu9T1zgkWLll17FRCv9XHsT1WK8D/ 5q/FKSkHwe3uBq2OZeL0Ai6JMXrcPR8= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1740523727; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=F9USmteYBOsJlbnz2c74ZK2ivD7zFdGWY3i53bXovAE=; b=gWiTqVGqjodaUGaZpAQsPcxgzjJD6f9DWPWU5f9StYxZ3fdny3xsvYrzbULo75oxtdF5WP V8pf0kcNZRGPXpFyrl/VQDgreiCKtMp7m2cNMJrCslB6FkQR2x1kPH0wJXuQoQ8jBPJjaB KPin0hZtpOdSv7ku9z15ropdFudHqbo= Received: from mail-il1-f197.google.com (mail-il1-f197.google.com [209.85.166.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-55-2J4fTbLOOnGZpLM3dAb8Tg-1; Tue, 25 Feb 2025 17:48:46 -0500 X-MC-Unique: 2J4fTbLOOnGZpLM3dAb8Tg-1 X-Mimecast-MFC-AGG-ID: 2J4fTbLOOnGZpLM3dAb8Tg_1740523726 Received: by mail-il1-f197.google.com with SMTP id e9e14a558f8ab-3d2b6d933baso132247465ab.3 for ; Tue, 25 Feb 2025 14:48:46 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740523725; x=1741128525; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=F9USmteYBOsJlbnz2c74ZK2ivD7zFdGWY3i53bXovAE=; b=U1eiwWRPxa+cLnxkYaoYUDmIfCDOtFEWQU2H3m1hfm6KOXLMG2DEyCfRWntZiG9VQh SW8MjwHbAfMAr9TDR3U0nNFeX8+oKqCwgZYdmbojDVOHry6pEjC1xlnl31DCBPWHZ9NU eC8JTUuz28xXze+t4CGY7/cYggeoS/RgEQG8hAajDgkYo3nEvqX2gWIcYPpbgmbtG5du D8/xcv9sjje71haOqBDlMD6FET8URe6eGspjFIEyujCknpetrhC4AeeIK8ULhNXL+oxF kTIblDV+XqIZ1vlmWKW3ZmDjtv8ur5C4M/+tR8Jy2WYZMY9BqZGB8TIEUGwYLk002/J9 W4Fw== X-Forwarded-Encrypted: i=1; AJvYcCUMtjjw7+KIevKnpQ6Myk815qKCddkS/IzCj8IEk8sfP6yKUHBblNzDpJyRitsdTr16ymdVudxSXg==@kvack.org X-Gm-Message-State: AOJu0YwXZ/+enrnQXhASlcR/pIV9QAoNf5vNLAkFkhdG7JhVNPSSWkxz 5fBbuZReRCuxE7gqxy+B4gpa+roeghc4AXRRXBGkxvWnvjfCIK77moFL/wUKx88su/tzRkU7Cwq bOyPoC9g3AY8Ne/4MAPJuoqDxCHJdzkSnPC6SJMauzgsdRhBY X-Gm-Gg: ASbGnctN1ucrrub9h2ESAPfGycMX21p48jliuA/1hXnbgYm55M0FzXWGEpJ88S4h7mM fY0YcX7NwqwKW2JkqXeLkVKaQcW2rtMsQ1337tTRkx4x+DG77cbfK/j+9akMt+rpe+oKmP6e/OU Duyzmnw3QEKIzk4YQTIPCydv5OEhVHV2fbGHPP0lQXpURp8QjwbU73l7P+UiPQxLZ4Zr+sHxT7V Rz6TJyi4nhP7cgRaath8rZkYdq//Oyjp8VlclOPqz5Xdrc46wYHoxzeMITfpVBqsOlO+A== X-Received: by 2002:a05:6e02:1fec:b0:3d0:239a:c46a with SMTP id e9e14a558f8ab-3d2cae6c31cmr155256595ab.9.1740523725683; Tue, 25 Feb 2025 14:48:45 -0800 (PST) X-Google-Smtp-Source: AGHT+IEQ7EnFgM3wQVUi5kFJpCaifI3VyCXEtwhDnx/oGpR/znVxALQ5e6RThgXAJT1QmJArmFzIlA== X-Received: by 2002:a05:6e02:1fec:b0:3d0:239a:c46a with SMTP id e9e14a558f8ab-3d2cae6c31cmr155256395ab.9.1740523725319; Tue, 25 Feb 2025 14:48:45 -0800 (PST) Received: from x1.local ([2604:7a40:2041:2b00::1000]) by smtp.gmail.com with ESMTPSA id 8926c6da1cb9f-4f0475309c9sm609890173.118.2025.02.25.14.48.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 25 Feb 2025 14:48:44 -0800 (PST) Date: Tue, 25 Feb 2025 17:48:39 -0500 From: Peter Xu To: Suren Baghdasaryan Cc: akpm@linux-foundation.org, lokeshgidra@google.com, aarcange@redhat.com, 21cnbao@gmail.com, v-songbaohua@oppo.com, david@redhat.com, willy@infradead.org, Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, hughd@google.com, jannh@google.com, kaleshsingh@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/1] userfaultfd: do not block on locking a large folio with raised refcount Message-ID: References: <20250225204613.2316092-1-surenb@google.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: FBLOhW8qeBGoCMIq4oE3lrJ_sw5ZG7-baycUFxhrsrk_1740523726 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Queue-Id: 68C4F10001C X-Rspamd-Server: rspam12 X-Stat-Signature: agaqhttyng3cwzcuokpox8qdn99wj38z X-HE-Tag: 1740523728-278659 X-HE-Meta: U2FsdGVkX180biisiYRUm/eEzdstW32gJRQf76c4E+bZgAKy1iyKDzjsuGzx//FZHg/HiCXbNOYNX0rilkSDoMse4Spj1xpccBMRZWKdOVFM//rMw67s5pACg2vvamPL5nNMUiqIIxOEH4ZrRDiz0Whs7eXYlJjv1ZPgWEUTL+mdwzRoQ3CFlYS+0qATNaj9BKAuueuUJzML/TJ0JKGrPfvZSpeyZ/mbju0s3T49TVdXxRl4+/YsBW3di8xYDiDAxW8WzoHiPlNiQ7E/2W/1q3aAfHOfkbjTqjdiEgMX7AA8mj7vusvWIS8hmxkGjXm25C4MdMPv9EOyaH/Yj5xrJ7n2LzdBJBPnZnVJSdACr+vNq0qOLfB8JpPJ/7/Pvs5zt62mylO2MkmVyNQZQIYOI6X734xx0tkxjUTCsF4NNvhrWoH4EXDLk48UOSjT+XmhpX3PVAVmW9cK7cl6P1vH7MSWV88ro88WKZM0puc+UffXbHU8OOqHdx7VTgEvJ47sIFP0rTt89h9KkY3pQG4WN1kEV8T1Jrg7lINRW2MfbujBHqwhx7EeWLv372fhhv7AvEHwvyqob+qoDR1Ou3RYrMKbI8BYoKb7uGvFzYV5XybxXjI2gmHL7l6SNnBLl1D6wROo4n/KYi6t1eo5h/++QPZqahReFKwbH4QUkPpzb9tsz/yLXc+ZGDa3um3CkB0NXZZHMhQD47t2gyEQl0zTo2OYGVq41O1LJ69ITIABFiKPE53gJpSNRGu1UyfpyvkIKy/kfixmePfRI8rsQ8vzyLoxaE3WxQuZhyrjfpS1C25pjoJ+4hsRejUlbLYWRn1zumpGamfZXjOJzfvh4pQxrVpu0z8WztlRCjZQ/4wQXwq7RgrkrHXqKB7B0BhYur3AqlAT+y5avMRJ0zaDJ3ABqtoDF9e2EUv6t0HIho7J3wxeukAAJHIrmBjTb4lBwUc7MjLnNrDRVWqUypAxWed GguIjPve MWTCBZJJQE91Ck4isAhbvx/iqtn0dSejl/UuGqnDmsQ9VXSTdZeFGoWiop449Z5qkPDl0g1WS0icJiEbc1SvPt7iTqzoiLQiYqUEIT6BC107VrkOkf1QX2J78WU7IAGa4WWQpQoMmPUgg084NBtmwyqZfp9ulBQp7LfzIfU9fmjil8v7n+jMzkl0dnphVHVSGsno29TEV9YRBmhAW45s/pNH5rOz7ctUdtbD0KbgMl9qs/rpOF2wtdYhw6htWWrh6I1wALBXB32b/HQ7IPxU3okEqmiLVMs2i5azk+imWGK2873wN9R8YevcQ2SkCCbMn2Ao7tXJ1993hR1j03e34vzb/OBW6Y6JsPkdbQTG97uLdnEApCR/yXp2GJuA8Jl08bmUQWqt/dWJZaCD08yslsoAc/ohPOPAqGAcJ3Lv4lPSrXr8j913MJAXgvg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000009, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Feb 25, 2025 at 02:21:39PM -0800, Suren Baghdasaryan wrote: > On Tue, Feb 25, 2025 at 2:12 PM Suren Baghdasaryan wrote: > > > > On Tue, Feb 25, 2025 at 1:32 PM Peter Xu wrote: > > > > > > On Tue, Feb 25, 2025 at 12:46:13PM -0800, Suren Baghdasaryan wrote: > > > > Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlock > > > > state when it goes into split_folio() with raised folio refcount. > > > > split_folio() expects the reference count to be exactly > > > > mapcount + num_pages_in_folio + 1 (see can_split_folio()) and fails with > > > > EAGAIN otherwise. If multiple processes are trying to move the same > > > > large folio, they raise the refcount (all tasks succeed in that) then > > > > one of them succeeds in locking the folio, while others will block in > > > > folio_lock() while keeping the refcount raised. The winner of this > > > > race will proceed with calling split_folio() and will fail returning > > > > EAGAIN to the caller and unlocking the folio. The next competing process > > > > will get the folio locked and will go through the same flow. In the > > > > meantime the original winner will be retried and will block in > > > > folio_lock(), getting into the queue of waiting processes only to repeat > > > > the same path. All this results in a livelock. > > > > An easy fix would be to avoid waiting for the folio lock while holding > > > > folio refcount, similar to madvise_free_huge_pmd() where folio lock is > > > > acquired before raising the folio refcount. > > > > Modify move_pages_pte() to try locking the folio first and if that fails > > > > and the folio is large then return EAGAIN without touching the folio > > > > refcount. If the folio is single-page then split_folio() is not called, > > > > so we don't have this issue. > > > > Lokesh has a reproducer [1] and I verified that this change fixes the > > > > issue. > > > > > > > > [1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock > > > > > > > > Reported-by: Lokesh Gidra > > > > Signed-off-by: Suren Baghdasaryan > > > > > > Reviewed-by: Peter Xu > > > > > > One question irrelevant of this change below.. > > > > > > > --- > > > > mm/userfaultfd.c | 17 ++++++++++++++++- > > > > 1 file changed, 16 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > > index 867898c4e30b..f17f8290c523 100644 > > > > --- a/mm/userfaultfd.c > > > > +++ b/mm/userfaultfd.c > > > > @@ -1236,6 +1236,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, > > > > */ > > > > if (!src_folio) { > > > > struct folio *folio; > > > > + bool locked; > > > > > > > > /* > > > > * Pin the page while holding the lock to be sure the > > > > @@ -1255,12 +1256,26 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, > > > > goto out; > > > > } > > > > > > > > + locked = folio_trylock(folio); > > > > + /* > > > > + * We avoid waiting for folio lock with a raised refcount > > > > + * for large folios because extra refcounts will result in > > > > + * split_folio() failing later and retrying. If multiple > > > > + * tasks are trying to move a large folio we can end > > > > + * livelocking. > > > > + */ > > > > + if (!locked && folio_test_large(folio)) { > > > > + spin_unlock(src_ptl); > > > > + err = -EAGAIN; > > > > + goto out; > > > > + } > > > > + > > > > folio_get(folio); > > > > src_folio = folio; > > > > src_folio_pte = orig_src_pte; > > > > spin_unlock(src_ptl); > > > > > > > > - if (!folio_trylock(src_folio)) { > > > > + if (!locked) { > > > > pte_unmap(&orig_src_pte); > > > > pte_unmap(&orig_dst_pte); > > > > > > .. just notice this. Are these problematic? I mean, orig_*_pte are stack > > > variables, afaict. I'd expect these things blow on HIGHPTE.. > > > > Ugh! Yes, I think so. From a quick look, move_pages_pte() is the only > > place we have this issue and I don't see a reason for copying src_pte > > and dst_pte values. I'll spend some more time trying to understand if > > we really need these local copies. > > Ah, we copy the values to later check if PTEs changed from under us. > But I see no reason we need to use orig_{src|dst}_pte instead of > {src|dst}_pte when doing pte_unmap(). I think we can safely replace That looks like something we just overlooked before, meanwhile it's undetectable on !HIGHPTE anyway.. in which case the addr ignored, and that turns always into an rcu unlock. > them with the original ones. WDYT? Agreed. Maybe not "the original ones" if we're looking for words to put into the commit message: it could be "we should kunmap() whatever we kmap()ed before", or something better. Thanks, -- Peter Xu