From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AAE41C021B2 for ; Tue, 25 Feb 2025 23:02:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1D544280002; Tue, 25 Feb 2025 18:02:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 15E3C280003; Tue, 25 Feb 2025 18:02:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F1929280002; Tue, 25 Feb 2025 18:02:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id D1328280001 for ; Tue, 25 Feb 2025 18:02:05 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 70193A0C41 for ; Tue, 25 Feb 2025 23:02:05 +0000 (UTC) X-FDA: 83159991810.15.60C7F54 Received: from mail-qt1-f182.google.com (mail-qt1-f182.google.com [209.85.160.182]) by imf21.hostedemail.com (Postfix) with ESMTP id 8A3F31C0009 for ; Tue, 25 Feb 2025 23:02:03 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ZL9i66fQ; spf=pass (imf21.hostedemail.com: domain of surenb@google.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740524523; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Kd1Nij5gJveAu0p82tAMdHKM3qEfEzzG4Cog3zXWzXA=; b=WUGEkY4TbCQvUCrBrHqZrviYM0fFwysS3uF4vTr3uCRM4jAxs0CIaiqhzz/jFhLHY8MweX hETnnffhGPFjHW6/EHcK89s9C2CuK9rv+2o3U9DIgjSzNmoIXtqS4CKXC1S0Yg+eC+r6LC gBzARZlG0np0sFybB14QYfKSsiQ092M= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ZL9i66fQ; spf=pass (imf21.hostedemail.com: domain of surenb@google.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740524523; a=rsa-sha256; cv=none; b=2BOm5B619E6LXOS1cZNArnkoew8bg+JeWX+4eiWW+RQRImzxlVJyggq002Bdj7NgW0Y9Co TkJPWa8nXCgyAwXN1kQUrPRyHRi+eGVWwSo2YNNGRzXAStU9q1clqx90zhTOLkR0irL0VQ Bz9GJtDQWZi0MH+3LFjcjVsA8RndbRk= Received: by mail-qt1-f182.google.com with SMTP id d75a77b69052e-471fbfe8b89so123081cf.0 for ; Tue, 25 Feb 2025 15:02:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1740524523; x=1741129323; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Kd1Nij5gJveAu0p82tAMdHKM3qEfEzzG4Cog3zXWzXA=; b=ZL9i66fQKM8hCrwsKtmBbbN+gsoDcqjGYIG8Rb6WMxjKDn+frnbP9QMGxD9lF9UFVC KW1ImvuqHv1Lpyw37N5SDx9x3NFrNmCu7JLPBb//txW0MQwhl/UQ8rKbvxbtKnBocux5 HkM9w1gRAWOIzIBKqs2rEXLn3WXcrJm2SEtgTdrHRv84clTTwYQBos7JEtmVujsS8MkQ HZ5cUduaG7FnSXTmFvAWF2H8Uqgpmy+NHmUgOQViEKcSjcmUGERQAM7b4338NbwgssHu uakdTYwEMzVDDGOCAnS2V7jRdNOGw7TvZ+ynZYFKcaqOP4z+CiqX4SOfYrB/2lk5dczL TliQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740524523; x=1741129323; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Kd1Nij5gJveAu0p82tAMdHKM3qEfEzzG4Cog3zXWzXA=; b=cafzwAYM2zwycuFTyxhQ8zXt9QzlqfjqSJnukBOHo3DW/zRIoFpvW6YMPSWiTcBnty H1rUg152wvv1XJSOSG6ICyjs9SWIM1q/PEsKDqebJ3PihBt/W7g4XSiCXCQriKX8sXvQ wJMYulSgOImTELK27gay3/qex3O6PrERruZ63ruM9fnwT4oFzzX4ONqncJg+CL9ZxBMT tKAkGUNgtQLzNsuKMUN61IBOK4tmVAQBZFQazh93L/pTibfCLDH1/kfNShVt1DSjtYaW Jd8Jp3Saw4FJZFRWPtRhc442dIQ3kfCuSRH1Hyo7MYJQEdUjBqE63RMOIs0CJWeJ+591 iBow== X-Forwarded-Encrypted: i=1; AJvYcCUGH3+CllEhHnocrTFSApjnO0DdBqHkIDNg7seuq3lUMZ+i+P1tQd1ZYAoZKpkBeumYuo/FDf7Icw==@kvack.org X-Gm-Message-State: AOJu0Yx0HBMXnR9psC1uofjYuDK5+zBShD7govaCgcHsrOoO44j9ZtsZ yCh38iG0Drp1NC6YYlVaA6IhrIGLzizAxTWAJrSqkguSwsQHQ0kn2BXX2cfr9jPNj5dZWQh79xg fxnHrLM2brfX5JYu5FaqsncwHkJ1fovTPilPh X-Gm-Gg: ASbGncvx0QvOC8AUkbJmPLNTgXu7MdaCULwMEWVyOabEggUbjkKylvklSUJNYsdBpxO I8JPSEXhDm70+Vs/d7Pid372n5Zwxh4pMkmXZYFwz4DuS1ym7eaboMdrlypLHWnVzNhCbRgWdiW TExq5ZKXk= X-Google-Smtp-Source: AGHT+IHNcfZONBUsPiWpz7/dLowm3QuSIStBlzrnCQaqMToc2xzjWKryrZw7PgO4lZLHv3ILKXVJghpiQDpmbyaVgKQ= X-Received: by 2002:a05:622a:148c:b0:466:8887:6751 with SMTP id d75a77b69052e-47376fd259emr6843411cf.23.1740524522204; Tue, 25 Feb 2025 15:02:02 -0800 (PST) MIME-Version: 1.0 References: <20250225204613.2316092-1-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Tue, 25 Feb 2025 15:01:51 -0800 X-Gm-Features: AWEUYZn2vSga4IwLqX-ZbjEm0R09IHHwmNRv7QsIRiRNt58lbeTeDFxK9a6YzoQ Message-ID: Subject: Re: [PATCH 1/1] userfaultfd: do not block on locking a large folio with raised refcount To: Peter Xu Cc: akpm@linux-foundation.org, lokeshgidra@google.com, aarcange@redhat.com, 21cnbao@gmail.com, v-songbaohua@oppo.com, david@redhat.com, willy@infradead.org, Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, hughd@google.com, jannh@google.com, kaleshsingh@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 8A3F31C0009 X-Stat-Signature: 6moyzmtnonjgwxr959diahrmhunokwrr X-Rspamd-Server: rspam03 X-HE-Tag: 1740524523-219610 X-HE-Meta: U2FsdGVkX18VLVA0lVDfYSN8lnHZ5VqsznSWnm8TP++mYqyEN8kkkGENMyu6QCFGg+fwgvw000jdmZ0xSY8P6xHVmAGS+T/yRGVpXtZjj1E/wBAFgTHCs3yk+A/u0pgppPNtjHZum3UiPCT7gfy/HX8qnf79WDdcSn9QIo3E04E/x419EPvOSOhhaFggPbbqoYYHdY2AEaJ2sgKdYLldlBUi4Hs8A+HZB4G2k0f6Qe0r2BnvNULoUhfUNf71w8KtcmyaedMyUTzNDSsCDPkD6eh0Tt+yPBRjkVPTkjh2EhMSRD16s5tU3gQu/8SDjOLDuUq0MgF+c203dT14rWg0kUNKEXbPf9djYo8UKMp2NnhmPBaSZIoIfizJZmGqn92EPqXLVzIGleQxItGNyhIHHHU/VO276ccAlEHh6kO6tcaPyny2GxPF7BHXuSs2JVAtbU5BqWFrLZSxR357hZEXqupXh8hT9eodfQSqMnU8pPsSXFFwOUFkjHQTwRI2MI+rikl7Y8KDFznDukyQziqXdLkTr4QsGPzjbel1tmEtrjrYgmnpNYYRrWMl8eBhqCBjUZKXdnmEVKtuGGA34gHccf7K7IASjW640qZqBg+SdE26uSHIQJvh33mkmFi5NtS9lIN/gcDFCDnOVKwvx7s+uICrnGuHbCcTHGeQViX6smyz1rbfiUGLYSeDBDJUcDnydusbz7aCgb0G6PxWHmZe/wDiR99mz8R/3fM1Kr/LlDwUrO5LO9DLRnCQwJr2ihhaaoOioXzJsBxic+WpAcNSKUfUpV/xMLzsvT5sh8Y8RpFkWg/Q+Sx9qX/qp9roK5Cq57JRGeb2K/gYJCmzBIqw5AZ8X22Te2Xbo6NIaknEYk8e+T8LHx4nWJKc6XIMmbgQ6Yp1qg7rEL/HH6cKmpZuIwdLVnTMP/j68hlcuw4flcU2fAqH804TVJt9S/V7juTPtIugGRPV96gDftzFDVx /n58HPy9 HURMZMoprqgQMLwmvipuGIJ9jM6Bnfoip/O+Kdvp3soqLVPwx11yiQYK6/ruO91IG8uuY4U/LaZB49VvstYLFpZgFtdVx9QjYNJddDZYzframYAY82jMgT38LO+u6AlCy6FeuYlLReIHG57/Y4zl0MNU8rGTAw/M+NstasriVcjQCPWPvzojvBKwiYUKArtb+R0BSMmk6IgHrwhJxkAtVNQs7L/91CjYWLaJYD+uxpfPBL/d2x6qi2fYBdA/PTfG8jYiPLX/MCz5Wnmo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Feb 25, 2025 at 2:48=E2=80=AFPM Peter Xu wrote: > > On Tue, Feb 25, 2025 at 02:21:39PM -0800, Suren Baghdasaryan wrote: > > On Tue, Feb 25, 2025 at 2:12=E2=80=AFPM Suren Baghdasaryan wrote: > > > > > > On Tue, Feb 25, 2025 at 1:32=E2=80=AFPM Peter Xu = wrote: > > > > > > > > On Tue, Feb 25, 2025 at 12:46:13PM -0800, Suren Baghdasaryan wrote: > > > > > Lokesh recently raised an issue about UFFDIO_MOVE getting into a = deadlock > > > > > state when it goes into split_folio() with raised folio refcount. > > > > > split_folio() expects the reference count to be exactly > > > > > mapcount + num_pages_in_folio + 1 (see can_split_folio()) and fai= ls with > > > > > EAGAIN otherwise. If multiple processes are trying to move the sa= me > > > > > large folio, they raise the refcount (all tasks succeed in that) = then > > > > > one of them succeeds in locking the folio, while others will bloc= k in > > > > > folio_lock() while keeping the refcount raised. The winner of thi= s > > > > > race will proceed with calling split_folio() and will fail return= ing > > > > > EAGAIN to the caller and unlocking the folio. The next competing = process > > > > > will get the folio locked and will go through the same flow. In t= he > > > > > meantime the original winner will be retried and will block in > > > > > folio_lock(), getting into the queue of waiting processes only to= repeat > > > > > the same path. All this results in a livelock. > > > > > An easy fix would be to avoid waiting for the folio lock while ho= lding > > > > > folio refcount, similar to madvise_free_huge_pmd() where folio lo= ck is > > > > > acquired before raising the folio refcount. > > > > > Modify move_pages_pte() to try locking the folio first and if tha= t fails > > > > > and the folio is large then return EAGAIN without touching the fo= lio > > > > > refcount. If the folio is single-page then split_folio() is not c= alled, > > > > > so we don't have this issue. > > > > > Lokesh has a reproducer [1] and I verified that this change fixes= the > > > > > issue. > > > > > > > > > > [1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock > > > > > > > > > > Reported-by: Lokesh Gidra > > > > > Signed-off-by: Suren Baghdasaryan > > > > > > > > Reviewed-by: Peter Xu > > > > > > > > One question irrelevant of this change below.. > > > > > > > > > --- > > > > > mm/userfaultfd.c | 17 ++++++++++++++++- > > > > > 1 file changed, 16 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > > > index 867898c4e30b..f17f8290c523 100644 > > > > > --- a/mm/userfaultfd.c > > > > > +++ b/mm/userfaultfd.c > > > > > @@ -1236,6 +1236,7 @@ static int move_pages_pte(struct mm_struct = *mm, pmd_t *dst_pmd, pmd_t *src_pmd, > > > > > */ > > > > > if (!src_folio) { > > > > > struct folio *folio; > > > > > + bool locked; > > > > > > > > > > /* > > > > > * Pin the page while holding the lock to b= e sure the > > > > > @@ -1255,12 +1256,26 @@ static int move_pages_pte(struct mm_struc= t *mm, pmd_t *dst_pmd, pmd_t *src_pmd, > > > > > goto out; > > > > > } > > > > > > > > > > + locked =3D folio_trylock(folio); > > > > > + /* > > > > > + * We avoid waiting for folio lock with a r= aised refcount > > > > > + * for large folios because extra refcounts= will result in > > > > > + * split_folio() failing later and retrying= . If multiple > > > > > + * tasks are trying to move a large folio w= e can end > > > > > + * livelocking. > > > > > + */ > > > > > + if (!locked && folio_test_large(folio)) { > > > > > + spin_unlock(src_ptl); > > > > > + err =3D -EAGAIN; > > > > > + goto out; > > > > > + } > > > > > + > > > > > folio_get(folio); > > > > > src_folio =3D folio; > > > > > src_folio_pte =3D orig_src_pte; > > > > > spin_unlock(src_ptl); > > > > > > > > > > - if (!folio_trylock(src_folio)) { > > > > > + if (!locked) { > > > > > pte_unmap(&orig_src_pte); > > > > > pte_unmap(&orig_dst_pte); > > > > > > > > .. just notice this. Are these problematic? I mean, orig_*_pte ar= e stack > > > > variables, afaict. I'd expect these things blow on HIGHPTE.. > > > > > > Ugh! Yes, I think so. From a quick look, move_pages_pte() is the only > > > place we have this issue and I don't see a reason for copying src_pte > > > and dst_pte values. I'll spend some more time trying to understand if > > > we really need these local copies. > > > > Ah, we copy the values to later check if PTEs changed from under us. > > But I see no reason we need to use orig_{src|dst}_pte instead of > > {src|dst}_pte when doing pte_unmap(). I think we can safely replace > > That looks like something we just overlooked before, meanwhile it's > undetectable on !HIGHPTE anyway.. in which case the addr ignored, and tha= t > turns always into an rcu unlock. > > > them with the original ones. WDYT? > > Agreed. Maybe not "the original ones" if we're looking for words to put > into the commit message: it could be "we should kunmap() whatever we > kmap()ed before", or something better. Sounds good to me. I'll give others one day to provide their input and will post a fix. Thanks! > > Thanks, > > -- > Peter Xu >