From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 34CFDC433EF for ; Wed, 16 Feb 2022 08:27:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A205F6B0078; Wed, 16 Feb 2022 03:27:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9CFDD6B007B; Wed, 16 Feb 2022 03:27:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 898606B007D; Wed, 16 Feb 2022 03:27:27 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0211.hostedemail.com [216.40.44.211]) by kanga.kvack.org (Postfix) with ESMTP id 7A12D6B0078 for ; Wed, 16 Feb 2022 03:27:27 -0500 (EST) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 3C9098C5BC for ; Wed, 16 Feb 2022 08:27:27 +0000 (UTC) X-FDA: 79147963734.09.85B93F6 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf05.hostedemail.com (Postfix) with ESMTP id BF998100006 for ; Wed, 16 Feb 2022 08:27:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1645000045; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=f+APuVFvQMNhQs2waXJIdN+7TWJE+H1LvXxfCrcJeiE=; b=F8H9dFelQMbBZaazJ80bP1Gq5XkWZwKANRIQj2mwZyS/13DVyergmdjyn6w614+zRdUxY2 eOStV94ohkvILK2pX3aB7fodsEBDxER2xszRnjn+ECXo1I3YNki6psOu4zP6jvamJ8Kkdx 3cf0VP8ltiYNHb7uQIyNUZdCPrmsM7M= Received: from mail-pj1-f72.google.com (mail-pj1-f72.google.com [209.85.216.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-28-wmXudox1MimwWJe707s4Lg-1; Wed, 16 Feb 2022 03:27:23 -0500 X-MC-Unique: wmXudox1MimwWJe707s4Lg-1 Received: by mail-pj1-f72.google.com with SMTP id w3-20020a17090ac98300b001b8b914e91aso1127988pjt.0 for ; Wed, 16 Feb 2022 00:27:23 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=f+APuVFvQMNhQs2waXJIdN+7TWJE+H1LvXxfCrcJeiE=; b=E+ZJ+sRcf3F/I2aBesEj/iUsGG062fKGE1AcMFrICI53/aKjK4gBed9fu8Mh3rOTZo yOjcdxbm0Kyq/GYgXKDcv6gSDHurdm822jrWeyOob2ZgBDmChsik4+KLLGa12UYoo4qI 4eqJXWsR+JUOa/WMMZRvPVxGW7fR6oy5ipzY7Q/peEli3BXK+zM+rD/jjfSWli+deTyf ohD3Q2h2dnLh4KroV5xvRN4devGEjm7fA6EA9zYg668gQF+b8af58lkutH55+1jbKjxY Zrtn0i2lMaKWe2fLn83/3ewodFW50/V/oT5Y4aizBVE8U1Mqxr3dcsrtETW7kRv1xBeh vguw== X-Gm-Message-State: AOAM530NR4V3w34w+XoTVFcDRLqi4UEOF4HH3s7CAnbInByEEuq/nfuj dC94ZYlS2KnZ9+hEI2S5ndcUrmE1uC5GViaua617TwKBOpkDyUVZ7TmYUXtbFATSTy8gv54pi2Z 2frCOB2LPCLM= X-Received: by 2002:a63:4557:0:b0:35e:6484:8643 with SMTP id u23-20020a634557000000b0035e64848643mr1403163pgk.250.1645000042427; Wed, 16 Feb 2022 00:27:22 -0800 (PST) X-Google-Smtp-Source: ABdhPJwGHYY5IqVwK0NEGfyzk6ERuvNr8jC0d7kD3vQX6uuFtlk0BaEGZ9ENSiBuHEFKg1xnKcCUeA== X-Received: by 2002:a63:4557:0:b0:35e:6484:8643 with SMTP id u23-20020a634557000000b0035e64848643mr1403142pgk.250.1645000042034; Wed, 16 Feb 2022 00:27:22 -0800 (PST) Received: from xz-m1.local ([64.64.123.81]) by smtp.gmail.com with ESMTPSA id kb12sm8432560pjb.20.2022.02.16.00.27.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 16 Feb 2022 00:27:21 -0800 (PST) Date: Wed, 16 Feb 2022 16:27:07 +0800 From: Peter Xu To: Nadav Amit Cc: Mike Rapoport , David Hildenbrand , Mike Rapoport , Andrea Arcangeli , Linux-MM Subject: Re: userfaultfd: usability issue due to lack of UFFD events ordering Message-ID: References: <63a8a665-4431-a13c-c320-1b46e5f62005@redhat.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=F8H9dFel; spf=none (imf05.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspamd-Server: rspam07 X-Rspam-User: X-Rspamd-Queue-Id: BF998100006 X-Stat-Signature: hsom56j6uobnneamz3491gkkmrsedwzn X-HE-Tag: 1645000045-877036 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 15, 2022 at 02:35:09PM -0800, Nadav Amit wrote: >=20 >=20 > > On Feb 13, 2022, at 8:02 PM, Peter Xu wrote: > >=20 > > Thanks for explaining. > >=20 > > I also digged out the discussion threads between you and Mike and tha= t's a good > > one too summarizing the problems: > >=20 > > https://lore.kernel.org/all/5921BA80-F263-4F8D-B7E6-316CEB602B51@gmai= l.com/ > >=20 > > Scenario 4 is kind of special imho along all those, because that's th= e only one > > that can be workarounded by user application by only copying pages on= e by one. > > I know you were even leveraging iouring in your local tree, so that's= probably > > not a solution at all for you. But I'm just trying to start thinking = without > > that scenario for now. > >=20 > > Per my understanding, a major issue regarding the rest of the scenari= os is > > ordering of uffd messages may not match with how things are happening= . This > > actually contains two problems. > >=20 > > First of all, mmap_sem is mostly held read for all page faults and mo= st of the > > mm changes except e.g. fork, then we can never serialize them. Not t= o mention > > uffd events releases mmap_sem within prep and completion. Let's call= it > > problem 1. > >=20 > > The other problem 2 is we can never serialize faults against events. > >=20 > > For problem 1, I do sense something that mmap_sem is just not suitabl= e for uffd > > scenario. Say, we grant concurrent with most of the events like dontn= eed and > > mremap, but when uffd ordering is a concern we may not want to grant = that > > concurrency. I'm wondering whether it means uffd may need its own se= maphore to > > achieve this. So for all events that uffd cares we take write lock o= n a new > > uffd_sem after mmap_sem, meanwhile we don't release that uffd_sem aft= er prep of > > events, not until completion (the message is read). It'll slow down = uffd > > tracked systems but guarantees ordering. >=20 > Peter, >=20 > Thanks for finding the time and looking into the issues that I encounte= red. >=20 > Your approach sounds possible, but it sounds to me unsafe to acquire uf= fd_sem > after mmap_lock, since it might cause deadlocks (e.g., if a process use= s events > to manage its own memory). Right, it's unsafe if to be taken after mmap_sem. If to do so IIUC we ne= ed to take it before mmap_sem hence we can release mmap_sem under it. In my mind that could be a feature bit UFFD_FEATURE_STRICT_ORDERING, when= it's set then the mm bound to the userfaultfd file will have a flag set within= the mm->flags, let's say MMF_UFFD_STRICT_ORDER. Then for uffd related syscalls like fork(), mremap() and so on we conditi= onally take that uffd_sem and we need to do that before mmap_sem. We take it wr= ite for all the uffd event contexts, and take it read for all the uffd page f= aults. But even if above would work again I have little confidence that it'll wo= rk in reality. Firstly it does look odd already that an uffd lock needs to be t= aken before the whole mm's, starting to affect common workloads even not using= uffd (even the flag lookup could affect cacheline, I think, but not sure how s= lower it would be). Not to mention that should greatly slow down the tracee pr= ocess. It definitely needs more thoughts anyway. >=20 > >=20 > > At the meantime, I'm wildly thinking whether we can tackle with the o= ther > > problem by merging the page fault queue with the event queue, aka, ev= ent_wqh > > and fault_pending_wqh. Obviously we'll need to identify the messages= when > > read() and conditionally move then into fault_wqh only if they come f= rom page > > faults, but that seems doable? >=20 > This, I guess is necessary in addition to your aforementioned proposal = to have > some semaphore protecting, can do the trick. >=20 > While I got your attention, let me share some other challenges I encoun= tered > using userfaultfd. They might be unrelated, but perhaps you can keep th= em in > the back of your mind. Nobody should suffer as I did ;-) Heh. >=20 > 1. mmap_changing (i.e., -EAGAIN on ioctls) makes using userfaultfd hard= er than > it should be, especially when using io-uring as I wish to do. >=20 > I think it is not too hard to address by changing the API. For instance= , if > uffd-ctx had a uffd-generation that would increase on each event, the u= ser > could have provided an ioctl-generation as part of copy/zero/etc ioctls= , and > the kernel would only fail the operation if ioctl copy/zero/etc operati= on > only succeeds if the uffd-generation is lower/equal than the one provid= ed by > the user.=20 Assuming that gen_id is copied over from the uffd message, and if that co= unter only increases, then I don't understand why it can be lower than the user provided. I don't quite get how that solves your problem too, since -EAGAIN can sti= ll trigger. I must have missed something. >=20 > 2. userfaultfd is separated from other tracing/instrumentation mechanis= ms in > the kernel. I, for instance, also wanted to track mmap events (let=E2=80= =99s put > aside for a second why). Tracking these events can be done with ptrace = or > perf_event_open() but then it is hard to correlate these events with > userfaultfd. It would have been easier for users, I think, if userfault= fd > notifications were provided through ptrace/tracepoints mechanisms as we= ll. >=20 > 3. Nesting/chaining. It is not easy to allow two monitors to use userfa= ultfd > concurrently. This seems as a general problem that I believe ptrace suf= fers > from too. I know it might seem far-fetched to have 2 monitors at the mo= ment, > but I think that any tracking/instrumentation mechanism (e.g., ptrace, > software-dirty, not to mention hardware virtualization) should be desig= ned > from the beginning with such support as adding it in a later stage can = be > tricky. 2) and 3) definitely need more thoughts.. PS: I think I first read your name from a paper on the nested virt. :-) B= ut I forgot the details. >=20 > 4. Missing state. It would be useful to provide the TID of the faulting > thread. I will send a patch for this one once I get the necessary > internal approvals. Before I fully digest your reply and the problems, I want to make sure yo= u are aware of UFFD_FEATURE_THREAD_ID.. I don't know how you missed it, but it = does sound like what you wanted. Thanks, --=20 Peter Xu