From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 34CD8C021B2 for ; Tue, 25 Feb 2025 22:12:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 814FA280003; Tue, 25 Feb 2025 17:12:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7C4C7280001; Tue, 25 Feb 2025 17:12:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 665B5280003; Tue, 25 Feb 2025 17:12:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 476AB280001 for ; Tue, 25 Feb 2025 17:12:45 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id AB566520E8 for ; Tue, 25 Feb 2025 22:12:44 +0000 (UTC) X-FDA: 83159867448.13.604D445 Received: from mail-qt1-f182.google.com (mail-qt1-f182.google.com [209.85.160.182]) by imf28.hostedemail.com (Postfix) with ESMTP id D28CEC0013 for ; Tue, 25 Feb 2025 22:12:42 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=rxSYlWT3; spf=pass (imf28.hostedemail.com: domain of surenb@google.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740521562; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KwrTC5LqQGsUOp6a4E+fD006nIMQ2oDxzYaqrSA4yW0=; b=AfVyWPZqj8/T60I2AxHjDfeFajrCZ3GnzctOxMUpsMHG6L/YRoOUaA9Kb4d4MifworbzDq tQ8NFkYUxVfLtzzH7tMTLdMcBaMRD27+JDQvMCp/Vw55I6w2W6a4RhTY4EHebcwmmUStci u0BVimwncaIECiSvt999AZnTSUV5OtY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740521562; a=rsa-sha256; cv=none; b=LTph8ZbKWcIsSm84LswxhoQVOQ06PsTcO/stcqjoiCWqN3o4og+7C+yKt5Y5O3KXnExJue dWsgR7erJLeyQyoitLpOu6Eaz8FgH9twyfH1XQxBlCL+/I8IIhzgpMBjw0qL6Cps/ZSglH GsejfRd+h8p4N1XkZq2w2aUQSe1E2tQ= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=rxSYlWT3; spf=pass (imf28.hostedemail.com: domain of surenb@google.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-qt1-f182.google.com with SMTP id d75a77b69052e-471f1dd5b80so29601cf.1 for ; Tue, 25 Feb 2025 14:12:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1740521562; x=1741126362; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=KwrTC5LqQGsUOp6a4E+fD006nIMQ2oDxzYaqrSA4yW0=; b=rxSYlWT3enVeYg7P3Yk9hfC79hGgDzVn/BY1njwVlCWDxBPlRxwDfu3YoHpbUf1Gtz K6i5lYTOpy3voajpUIWmiMJMLdsjLjF4BIfO5j2X3j1+c4v6Lce9jW4O3x4qVso+tDVe iylcSq+TtfpprMCZKLqj71UyDpXeFYcIlONfLsvWGIi4UanexHBfpTw9oIK4XsJDYTiN BFS7q42LkUFoOLHt0kZA7TuX6/gM9CkC/dKGQdCyqsU8yJZitqPp2iqXZ34zXHHVrES4 J+0vrxTgRz5yu3KYbxhmy7e+YJ6LVtbjzVG883aNMXhQxK3+Dvup3I5Udsnk5cSYea9V fYUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740521562; x=1741126362; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=KwrTC5LqQGsUOp6a4E+fD006nIMQ2oDxzYaqrSA4yW0=; b=cHNbA97eIkpsax21hvOb0nx3mNO193zX5p1Lpnxz8Lg3FmeN62iUWYIuBlKk4FP86S +Rl9TuM5IwzQouiF4fm2A6Jreh6SVotVvsBblKfqh4McewvCRMqmiFVoCpY/VAnKpjpc QpA/HCd3OEYMG3LyGZEKrw4WDw5uvQhouJE1WMpFzeOLXQWu7hv3IOV8YePD44fOrDpS R4gTqhy5ovf1FWmJfS6LV9KnZpOQ2VqUu7EoaJ0sPmMYvIADv5vOR3Pot0EsvGthMGUj 4owgSjXqUG+MKuZqXrhfn3rJKr+ribk8qI1AzbVn4658JES8VgjzTvLydrRMDa6OfAFT w2+A== X-Forwarded-Encrypted: i=1; AJvYcCWeSXJFMHO1kbXQYR+xcGNrTmkZdbE2/X1wB1IKD9yH57Op8GOxEYYnm8t3zULN0wCpFR83r45AFg==@kvack.org X-Gm-Message-State: AOJu0YwLuvXIh1HpYYJXyty6ntm+T3fgcKNeS+Kr5snnEZ/uhtpPHixF upZa85CGWsZNivUz4KCZaIdk0wnZdO9ubjqcHqResNBzU9X3nAEGZNaXGzOIC7jpr+ipFTxAjVk 1Uz4ASYojLIPVgub8CKvJQQc95WoJRYaHLoVR X-Gm-Gg: ASbGncsyT2ywGPVbM0PVUFBF4J5n6eFlOe4eZGM6kYzGr32geJ4nk6AdRW59Hcp10Op 8sXyzOQP+ocnNpbUSy5YrNY9J59NJ8FKQx9enWL0igMYJmeMzVe4ftZtIv8P+tF5wWSVVwKyFL3 h3An/D/9E= X-Google-Smtp-Source: AGHT+IG8QHwqC/lfMZgIU/2wkhNunDI7FmTTo0moaCzvZNTzHf2UgTGIJS4OByBvqQnxdoCvhd4P+CJkuyJL8FDcvK0= X-Received: by 2002:a05:622a:13ca:b0:466:861a:f633 with SMTP id d75a77b69052e-47376e5d9bdmr6454421cf.5.1740521561574; Tue, 25 Feb 2025 14:12:41 -0800 (PST) MIME-Version: 1.0 References: <20250225204613.2316092-1-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Tue, 25 Feb 2025 14:12:30 -0800 X-Gm-Features: AWEUYZkYR2X7-hqdcRQcXL8Vrtj_xi5DIxLU7WQvN1cO0LOhMz7v5f97S-16abI Message-ID: Subject: Re: [PATCH 1/1] userfaultfd: do not block on locking a large folio with raised refcount To: Peter Xu Cc: akpm@linux-foundation.org, lokeshgidra@google.com, aarcange@redhat.com, 21cnbao@gmail.com, v-songbaohua@oppo.com, david@redhat.com, willy@infradead.org, Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, hughd@google.com, jannh@google.com, kaleshsingh@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: D28CEC0013 X-Rspamd-Server: rspam07 X-Stat-Signature: c6i4egq7k77d798pn4qf1xnowfgx473x X-HE-Tag: 1740521562-682076 X-HE-Meta: U2FsdGVkX19nMG5TLoCKkmEQC9C4DpSIwOfNHV+J2ThL35Texrbw2lCqt2dt5hNEGE9Z8CgIo/GMGV6cRlHoXWo3f9hTearGqlLKFOK5MDJJFg35TdmKMLr+AjfeLVjzJfbbuTvj6MVvg3XDhPhJRTaP/+lsNqbazM1+80n1sNSpxig1c4UQMLqPDpmg5rgscyA3Fes9kvdma1r0KFy4IfK3gsiBie1BcHM0b62qrwQrhxwtR5xtPLCqrk92mLPuKaXtd3ks4+vA4XuEaQmKpDVIVp0jii+AsYbhy6wXomMkqqpiMZuhCUqY+AMVbwqOc9uV3KVoTRn7O9SFagSueXwb8a431Jg6mXR0OwPohLC+sOkOzHDNTOvg8yQhMY4vo9H6QRHa68X8BByTA9s4gy9dxfeJ3nXjjtmP42VnnAkoblMTaxcW5XfvmDxNAIkReJ/8kYVCAmGN5ZvgINHNyFAUKfpNmuB7aqtqkgsJtFgBVhme+9wCxvk39tLnBhiAtr4C+Onp9kF+EGvihTkWKPupIv+KSGwsaee383zrzHYktSJ3wUqjCfIQWvUK5YO+mWGWyHG2SWOn16Qsnnrw5+DX0zixDfOqSnO5o/lx/CeyxYD1s0bogel2yuZs5zfkCdSpTOAnG+VdhC/r947fzUKou4JGqt+Tr/BoG7Va0q/iIiQv2LDFYxzuD9z3VOCiHtKGTJuGvKKRBNRgH/+0prfZD2yDcpkhRKnUp+UTdH4z9tVeVtZC7b0K9o1Jben4bwK89imJ/2QMi67znuVj8xce3C5LNAeUCX1vTWg7L/ziv+Spam2eR0ndkzecRkxbQviYXD0Z/TcB3aCq1h1DEtwYoqqFZ8Qn5CJuWX3goVGjAm89B1AuVch2Fsjn9pnMiBM9TSgPdIzALgJQx7KTrHb7wsYtDHx6fx8tH7TY4agPBc5kp+n5w8pKFcuTgCu+OzhQu2ISKICVijd4pXz oTY033lc oWNNo9Yd+IZ47jSQ4AkaQ7ScHhYOxdcxyX/a+xx3HM07pIPR8PmSqik+GLhrAHDYFwXbDrKStvGmVhTvuTZVKCQ7KQNtXmKBjExFzroTIsNXSjhcDV0jtVVNkzkzD96JiKAxOHcPIwIBtLvd81dg6R8JU7EICC4GAMEZ6u1G7EaayzBFFA0QLxUolhfJjR+mTy5HRxmBktgPzeI9DaybxeGILX+uAXuPyLTODPVGPNvcxlhLVld/gOEk3biVUebCDWGnuItsVa2Kpw+U= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Feb 25, 2025 at 1:32=E2=80=AFPM Peter Xu wrote: > > On Tue, Feb 25, 2025 at 12:46:13PM -0800, Suren Baghdasaryan wrote: > > Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlo= ck > > state when it goes into split_folio() with raised folio refcount. > > split_folio() expects the reference count to be exactly > > mapcount + num_pages_in_folio + 1 (see can_split_folio()) and fails wit= h > > EAGAIN otherwise. If multiple processes are trying to move the same > > large folio, they raise the refcount (all tasks succeed in that) then > > one of them succeeds in locking the folio, while others will block in > > folio_lock() while keeping the refcount raised. The winner of this > > race will proceed with calling split_folio() and will fail returning > > EAGAIN to the caller and unlocking the folio. The next competing proces= s > > will get the folio locked and will go through the same flow. In the > > meantime the original winner will be retried and will block in > > folio_lock(), getting into the queue of waiting processes only to repea= t > > the same path. All this results in a livelock. > > An easy fix would be to avoid waiting for the folio lock while holding > > folio refcount, similar to madvise_free_huge_pmd() where folio lock is > > acquired before raising the folio refcount. > > Modify move_pages_pte() to try locking the folio first and if that fail= s > > and the folio is large then return EAGAIN without touching the folio > > refcount. If the folio is single-page then split_folio() is not called, > > so we don't have this issue. > > Lokesh has a reproducer [1] and I verified that this change fixes the > > issue. > > > > [1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock > > > > Reported-by: Lokesh Gidra > > Signed-off-by: Suren Baghdasaryan > > Reviewed-by: Peter Xu > > One question irrelevant of this change below.. > > > --- > > mm/userfaultfd.c | 17 ++++++++++++++++- > > 1 file changed, 16 insertions(+), 1 deletion(-) > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > index 867898c4e30b..f17f8290c523 100644 > > --- a/mm/userfaultfd.c > > +++ b/mm/userfaultfd.c > > @@ -1236,6 +1236,7 @@ static int move_pages_pte(struct mm_struct *mm, p= md_t *dst_pmd, pmd_t *src_pmd, > > */ > > if (!src_folio) { > > struct folio *folio; > > + bool locked; > > > > /* > > * Pin the page while holding the lock to be sure= the > > @@ -1255,12 +1256,26 @@ static int move_pages_pte(struct mm_struct *mm,= pmd_t *dst_pmd, pmd_t *src_pmd, > > goto out; > > } > > > > + locked =3D folio_trylock(folio); > > + /* > > + * We avoid waiting for folio lock with a raised = refcount > > + * for large folios because extra refcounts will = result in > > + * split_folio() failing later and retrying. If m= ultiple > > + * tasks are trying to move a large folio we can = end > > + * livelocking. > > + */ > > + if (!locked && folio_test_large(folio)) { > > + spin_unlock(src_ptl); > > + err =3D -EAGAIN; > > + goto out; > > + } > > + > > folio_get(folio); > > src_folio =3D folio; > > src_folio_pte =3D orig_src_pte; > > spin_unlock(src_ptl); > > > > - if (!folio_trylock(src_folio)) { > > + if (!locked) { > > pte_unmap(&orig_src_pte); > > pte_unmap(&orig_dst_pte); > > .. just notice this. Are these problematic? I mean, orig_*_pte are stac= k > variables, afaict. I'd expect these things blow on HIGHPTE.. Ugh! Yes, I think so. From a quick look, move_pages_pte() is the only place we have this issue and I don't see a reason for copying src_pte and dst_pte values. I'll spend some more time trying to understand if we really need these local copies. > > > src_pte =3D dst_pte =3D NULL; > > > > base-commit: 801d47bd96ce22acd43809bc09e004679f707c39 > > -- > > 2.48.1.658.g4767266eb4-goog > > > > -- > Peter Xu >