From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 34CFDC433EF
	for <linux-mm@archiver.kernel.org>; Wed, 16 Feb 2022 08:27:28 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A205F6B0078; Wed, 16 Feb 2022 03:27:27 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9CFDD6B007B; Wed, 16 Feb 2022 03:27:27 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 898606B007D; Wed, 16 Feb 2022 03:27:27 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0211.hostedemail.com [216.40.44.211])
	by kanga.kvack.org (Postfix) with ESMTP id 7A12D6B0078
	for <linux-mm@kvack.org>; Wed, 16 Feb 2022 03:27:27 -0500 (EST)
Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 3C9098C5BC
	for <linux-mm@kvack.org>; Wed, 16 Feb 2022 08:27:27 +0000 (UTC)
X-FDA: 79147963734.09.85B93F6
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf05.hostedemail.com (Postfix) with ESMTP id BF998100006
	for <linux-mm@kvack.org>; Wed, 16 Feb 2022 08:27:25 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1645000045;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=f+APuVFvQMNhQs2waXJIdN+7TWJE+H1LvXxfCrcJeiE=;
	b=F8H9dFelQMbBZaazJ80bP1Gq5XkWZwKANRIQj2mwZyS/13DVyergmdjyn6w614+zRdUxY2
	eOStV94ohkvILK2pX3aB7fodsEBDxER2xszRnjn+ECXo1I3YNki6psOu4zP6jvamJ8Kkdx
	3cf0VP8ltiYNHb7uQIyNUZdCPrmsM7M=
Received: from mail-pj1-f72.google.com (mail-pj1-f72.google.com
 [209.85.216.72]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-28-wmXudox1MimwWJe707s4Lg-1; Wed, 16 Feb 2022 03:27:23 -0500
X-MC-Unique: wmXudox1MimwWJe707s4Lg-1
Received: by mail-pj1-f72.google.com with SMTP id w3-20020a17090ac98300b001b8b914e91aso1127988pjt.0
        for <linux-mm@kvack.org>; Wed, 16 Feb 2022 00:27:23 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:content-transfer-encoding
         :in-reply-to;
        bh=f+APuVFvQMNhQs2waXJIdN+7TWJE+H1LvXxfCrcJeiE=;
        b=E+ZJ+sRcf3F/I2aBesEj/iUsGG062fKGE1AcMFrICI53/aKjK4gBed9fu8Mh3rOTZo
         yOjcdxbm0Kyq/GYgXKDcv6gSDHurdm822jrWeyOob2ZgBDmChsik4+KLLGa12UYoo4qI
         4eqJXWsR+JUOa/WMMZRvPVxGW7fR6oy5ipzY7Q/peEli3BXK+zM+rD/jjfSWli+deTyf
         ohD3Q2h2dnLh4KroV5xvRN4devGEjm7fA6EA9zYg668gQF+b8af58lkutH55+1jbKjxY
         Zrtn0i2lMaKWe2fLn83/3ewodFW50/V/oT5Y4aizBVE8U1Mqxr3dcsrtETW7kRv1xBeh
         vguw==
X-Gm-Message-State: AOAM530NR4V3w34w+XoTVFcDRLqi4UEOF4HH3s7CAnbInByEEuq/nfuj
	dC94ZYlS2KnZ9+hEI2S5ndcUrmE1uC5GViaua617TwKBOpkDyUVZ7TmYUXtbFATSTy8gv54pi2Z
	2frCOB2LPCLM=
X-Received: by 2002:a63:4557:0:b0:35e:6484:8643 with SMTP id u23-20020a634557000000b0035e64848643mr1403163pgk.250.1645000042427;
        Wed, 16 Feb 2022 00:27:22 -0800 (PST)
X-Google-Smtp-Source: ABdhPJwGHYY5IqVwK0NEGfyzk6ERuvNr8jC0d7kD3vQX6uuFtlk0BaEGZ9ENSiBuHEFKg1xnKcCUeA==
X-Received: by 2002:a63:4557:0:b0:35e:6484:8643 with SMTP id u23-20020a634557000000b0035e64848643mr1403142pgk.250.1645000042034;
        Wed, 16 Feb 2022 00:27:22 -0800 (PST)
Received: from xz-m1.local ([64.64.123.81])
        by smtp.gmail.com with ESMTPSA id kb12sm8432560pjb.20.2022.02.16.00.27.14
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 16 Feb 2022 00:27:21 -0800 (PST)
Date: Wed, 16 Feb 2022 16:27:07 +0800
From: Peter Xu <peterx@redhat.com>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Mike Rapoport <rppt@kernel.org>, David Hildenbrand <david@redhat.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linux-MM <linux-mm@kvack.org>
Subject: Re: userfaultfd: usability issue due to lack of UFFD events ordering
Message-ID: <Ygy1Ww7HMAlxP7ea@xz-m1.local>
References: <YffsxLDZk2osB7US@kernel.org>
 <63a8a665-4431-a13c-c320-1b46e5f62005@redhat.com>
 <Yffx/PJP1TuJCnhc@kernel.org>
 <a7660987-23d1-d550-5315-7f24c1b27076@redhat.com>
 <YfgutA6FYwu7RyJP@kernel.org>
 <B2B2DFF0-7967-4F80-8AAC-3DB0B3911CED@gmail.com>
 <YgTDTjrhoiyH4ZTr@xz-m1.local>
 <BC9D9187-1777-4336-AFA4-CD34208DF31E@gmail.com>
 <YgnUVqKfkYTjz3Gx@xz-m1.local>
 <F195F8B6-05C4-45BC-BA10-632CA3699941@gmail.com>
MIME-Version: 1.0
In-Reply-To: <F195F8B6-05C4-45BC-BA10-632CA3699941@gmail.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Authentication-Results: imf05.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=F8H9dFel;
	spf=none (imf05.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=peterx@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
X-Rspamd-Server: rspam07
X-Rspam-User: 
X-Rspamd-Queue-Id: BF998100006
X-Stat-Signature: hsom56j6uobnneamz3491gkkmrsedwzn
X-HE-Tag: 1645000045-877036
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Feb 15, 2022 at 02:35:09PM -0800, Nadav Amit wrote:
>=20
>=20
> > On Feb 13, 2022, at 8:02 PM, Peter Xu <peterx@redhat.com> wrote:
> >=20
> > Thanks for explaining.
> >=20
> > I also digged out the discussion threads between you and Mike and tha=
t's a good
> > one too summarizing the problems:
> >=20
> > https://lore.kernel.org/all/5921BA80-F263-4F8D-B7E6-316CEB602B51@gmai=
l.com/
> >=20
> > Scenario 4 is kind of special imho along all those, because that's th=
e only one
> > that can be workarounded by user application by only copying pages on=
e by one.
> > I know you were even leveraging iouring in your local tree, so that's=
 probably
> > not a solution at all for you. But I'm just trying to start thinking =
without
> > that scenario for now.
> >=20
> > Per my understanding, a major issue regarding the rest of the scenari=
os is
> > ordering of uffd messages may not match with how things are happening=
.  This
> > actually contains two problems.
> >=20
> > First of all, mmap_sem is mostly held read for all page faults and mo=
st of the
> > mm changes except e.g. fork, then we can never serialize them.  Not t=
o mention
> > uffd events releases mmap_sem within prep and completion.  Let's call=
 it
> > problem 1.
> >=20
> > The other problem 2 is we can never serialize faults against events.
> >=20
> > For problem 1, I do sense something that mmap_sem is just not suitabl=
e for uffd
> > scenario. Say, we grant concurrent with most of the events like dontn=
eed and
> > mremap, but when uffd ordering is a concern we may not want to grant =
that
> > concurrency.  I'm wondering whether it means uffd may need its own se=
maphore to
> > achieve this.  So for all events that uffd cares we take write lock o=
n a new
> > uffd_sem after mmap_sem, meanwhile we don't release that uffd_sem aft=
er prep of
> > events, not until completion (the message is read).  It'll slow down =
uffd
> > tracked systems but guarantees ordering.
>=20
> Peter,
>=20
> Thanks for finding the time and looking into the issues that I encounte=
red.
>=20
> Your approach sounds possible, but it sounds to me unsafe to acquire uf=
fd_sem
> after mmap_lock, since it might cause deadlocks (e.g., if a process use=
s events
> to manage its own memory).

Right, it's unsafe if to be taken after mmap_sem.  If to do so IIUC we ne=
ed to
take it before mmap_sem hence we can release mmap_sem under it.

In my mind that could be a feature bit UFFD_FEATURE_STRICT_ORDERING, when=
 it's
set then the mm bound to the userfaultfd file will have a flag set within=
 the
mm->flags, let's say MMF_UFFD_STRICT_ORDER.

Then for uffd related syscalls like fork(), mremap() and so on we conditi=
onally
take that uffd_sem and we need to do that before mmap_sem.  We take it wr=
ite
for all the uffd event contexts, and take it read for all the uffd page f=
aults.

But even if above would work again I have little confidence that it'll wo=
rk in
reality. Firstly it does look odd already that an uffd lock needs to be t=
aken
before the whole mm's, starting to affect common workloads even not using=
 uffd
(even the flag lookup could affect cacheline, I think, but not sure how s=
lower
it would be).  Not to mention that should greatly slow down the tracee pr=
ocess.
It definitely needs more thoughts anyway.

>=20
> >=20
> > At the meantime, I'm wildly thinking whether we can tackle with the o=
ther
> > problem by merging the page fault queue with the event queue, aka, ev=
ent_wqh
> > and fault_pending_wqh.  Obviously we'll need to identify the messages=
 when
> > read() and conditionally move then into fault_wqh only if they come f=
rom page
> > faults, but that seems doable?
>=20
> This, I guess is necessary in addition to your aforementioned proposal =
to have
> some semaphore protecting, can do the trick.
>=20
> While I got your attention, let me share some other challenges I encoun=
tered
> using userfaultfd. They might be unrelated, but perhaps you can keep th=
em in
> the back of your mind. Nobody should suffer as I did ;-)

Heh.

>=20
> 1. mmap_changing (i.e., -EAGAIN on ioctls) makes using userfaultfd hard=
er than
> it should be, especially when using io-uring as I wish to do.
>=20
> I think it is not too hard to address by changing the API. For instance=
, if
> uffd-ctx had a uffd-generation that would increase on each event, the u=
ser
> could have provided an ioctl-generation as part of copy/zero/etc ioctls=
, and
> the kernel would only fail the operation if ioctl copy/zero/etc operati=
on
> only succeeds if the uffd-generation is lower/equal than the one provid=
ed by
> the user.=20

Assuming that gen_id is copied over from the uffd message, and if that co=
unter
only increases, then I don't understand why it can be lower than the user
provided.

I don't quite get how that solves your problem too, since -EAGAIN can sti=
ll
trigger.  I must have missed something.

>=20
> 2. userfaultfd is separated from other tracing/instrumentation mechanis=
ms in
> the kernel. I, for instance, also wanted to track mmap events (let=E2=80=
=99s put
> aside for a second why). Tracking these events can be done with ptrace =
or
> perf_event_open() but then it is hard to correlate these events with
> userfaultfd. It would have been easier for users, I think, if userfault=
fd
> notifications were provided through ptrace/tracepoints mechanisms as we=
ll.
>=20
> 3. Nesting/chaining. It is not easy to allow two monitors to use userfa=
ultfd
> concurrently. This seems as a general problem that I believe ptrace suf=
fers
> from too. I know it might seem far-fetched to have 2 monitors at the mo=
ment,
> but I think that any tracking/instrumentation mechanism (e.g., ptrace,
> software-dirty, not to mention hardware virtualization) should be desig=
ned
> from the beginning with such support as adding it in a later stage can =
be
> tricky.

2) and 3) definitely need more thoughts..

PS: I think I first read your name from a paper on the nested virt. :-) B=
ut I
forgot the details.

>=20
> 4. Missing state. It would be useful to provide the TID of the faulting
> thread. I will send a patch for this one once I get the necessary
> internal approvals.

Before I fully digest your reply and the problems, I want to make sure yo=
u are
aware of UFFD_FEATURE_THREAD_ID.. I don't know how you missed it, but it =
does
sound like what you wanted.

Thanks,

--=20
Peter Xu