From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 12D89C433EF for ; Mon, 14 Feb 2022 04:02:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 467176B0072; Sun, 13 Feb 2022 23:02:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3C85C6B0073; Sun, 13 Feb 2022 23:02:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 21B076B0078; Sun, 13 Feb 2022 23:02:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.28]) by kanga.kvack.org (Postfix) with ESMTP id 0B1BF6B0072 for ; Sun, 13 Feb 2022 23:02:43 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay11.hostedemail.com (Postfix) with ESMTP id C266280C19 for ; Mon, 14 Feb 2022 04:02:42 +0000 (UTC) X-FDA: 79140038964.04.CB3E815 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf27.hostedemail.com (Postfix) with ESMTP id 2F74E40004 for ; Mon, 14 Feb 2022 04:02:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1644811361; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kIrFK/8FX/OcUobt8+5w2mcYxUXGjgJkNQfPAKlgZMI=; b=hUVxpuL49II+yjjXpo9uadLe3LUrYXfZJz9QzKtNZuO6Kb4wvajjO9NVSPbVLXsRHbKY+3 GatP+z34v5m8OBRSFvuxOyZu0IWGYnE7lp+ggPtpU57EoLtEeUhDjOHnSMQ7Ei92JdkkV3 ZDAQpv9F7XZdgYnQviq6bSY2sLTr2oY= Received: from mail-pf1-f198.google.com (mail-pf1-f198.google.com [209.85.210.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-66-qm5Cm7PWNq659hyQsJcIiQ-1; Sun, 13 Feb 2022 23:02:40 -0500 X-MC-Unique: qm5Cm7PWNq659hyQsJcIiQ-1 Received: by mail-pf1-f198.google.com with SMTP id e18-20020aa78252000000b004df7a13daeaso10844287pfn.2 for ; Sun, 13 Feb 2022 20:02:39 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=kIrFK/8FX/OcUobt8+5w2mcYxUXGjgJkNQfPAKlgZMI=; b=kQ21EF6tr0F/bVoedELL2yLKwip2k1lEGKniusbAhKxhr27nAQUMDaPw+wGx/ZYuEC HUxhqw8/U8hdcX2mKDkd/H+LzO1E/iaw6koXZXwMXosHQFLCaUTx6YBqfTLa9pFXc6nQ SWW7C1S4+V4DLvcqPQusyyGnoYrfBarjHDbU+uVKfqjAZ3zwjaWF/lwRRKT0VVBnt+j1 TOVb7nNzDAJrCFIl73cM6Q5xxXNXC0v9dr5PYTak0U9jokzapBSl72kCRFN6B8gqnxwj u0AcNfpACrrvgz/ubk8LP0j8+abH4x1y0nAQiU3XiBf6qdgzF0eUVzEMJwpHqunfXrt8 SZww== X-Gm-Message-State: AOAM532QOlZJi9go2FojCxfiCvlaR763VXucehxj+3+iE1TJ1BM7+2lD R1VL+DQopCZ7L1ng2wa18syBnCpsRTdFhn6C+RfqAT/JPng3MJli2YJYPs2i+TK4VXCjWLfnJIQ nG2JpJxIGBgc= X-Received: by 2002:a17:90a:3484:: with SMTP id p4mr12474634pjb.91.1644811358982; Sun, 13 Feb 2022 20:02:38 -0800 (PST) X-Google-Smtp-Source: ABdhPJyPFlH4hvJqvPLGb8fZvbeX6C6vOs56d3feipTI2avKmPfTHbizeUJv9jFS90O12/j/g6AoVg== X-Received: by 2002:a17:90a:3484:: with SMTP id p4mr12474606pjb.91.1644811358568; Sun, 13 Feb 2022 20:02:38 -0800 (PST) Received: from xz-m1.local ([94.177.118.137]) by smtp.gmail.com with ESMTPSA id z14sm29687087pfh.173.2022.02.13.20.02.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 13 Feb 2022 20:02:38 -0800 (PST) Date: Mon, 14 Feb 2022 12:02:30 +0800 From: Peter Xu To: Nadav Amit Cc: Mike Rapoport , David Hildenbrand , Mike Rapoport , Andrea Arcangeli , Linux-MM Subject: Re: userfaultfd: usability issue due to lack of UFFD events ordering Message-ID: References: <11831b20-0b46-92df-885a-1220430f9257@redhat.com> <63a8a665-4431-a13c-c320-1b46e5f62005@redhat.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 2F74E40004 X-Rspam-User: Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=hUVxpuL4; spf=none (imf27.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Stat-Signature: 9xibt54k166jhdp7bi7ajb431b4w9z15 X-HE-Tag: 1644811362-545504 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Nadav, On Thu, Feb 10, 2022 at 10:42:43AM -0800, Nadav Amit wrote: > I think that the MADV_DONTNEED/PF-resolution “race" only affects usage-models > that handle the page-fault concurrently with UFFD-monitoring (using multiple > monitor threads or IO-uring as I try to do). At least for use-cases such as > live-migration. > > I think the scenario you have in mind is the following, which is resolved > with mmap_changing that Mike introduced some time ago: > > UFFD monitor App thread #0 App thread #1 > ------------ ------------- ------------- > #PF > UFFD Read > [#PF] > MADV_DONTNEED > mmap_changing = 1 > > userfaultfd_event_wait_completion() > [queue event, wait] > UFFD-copy > -EAGAIN since mmmap_changing > 0 > > > mmap_changing will keep being elevated, and UFFD-copy not served (fail) until > the monitor reads the UFFD event. The monitor, in this scenario, is single > threaded and therefore orders UFFD-read and UFFD-copy, preventing them from > racing. > > Assuming the monitor is smart enough to reevaluate the course of action after > MADV_DONTNEED is handled, it should be safe. Personally, I do not like the > burden that this scheme puts on the monitor, the fact it needs to retry or > even the return value [I think it should be EBUSY since immediate retry would > fail. With IO-uring, EAGAIN triggers an immediate retry, which is useless.] > Yet, concurrent UFFD-event/#PF can be handled properly by a smart monitor. > > *However*, userfaultfd events seem as very hard to use (to say the least) in > the following cases: > > 1. The UFFD-copy is issued by one thread and the UFFD-read is performed by > another. For me this is the most painful even if you may consider it > as “unorthodox”. It is very useful for performance, especially if the > UFFD-copy is large. This is definitely a valid use case for uffd, and IMHO that's a good base model when the uffd app is performance critical. > > 2. If the race is between 2 userfaultfd *events*. The events might not be > properly ordered (i.e., the order in which they are read does not reflect > the order in which they occurred) despite the use of *_userfaultfd_prep(), > since they are only queued (to be reported and trigger wake) by > userfaultfd_event_wait_completion(), after the VMA and PTEs were updated > and more importantly after mmap-lock was dropped. > > This means that if you have fork and MADV_DONTNEED, the monitor might see > their order inverted, and won’t be able to know whether the child has the > pages zapped or not. > > Other races are possible too, for instance between mremap() and munmap(). > In most cases the monitor might be able (with quite some work) to > figure out that the order of the events it received does not make sense > and the events must have been reordered. Yet, implementing something like > that is far from trivial and there are some cases that are probably > impossible to resolve just based on the UFFD read events. > > I personally addressed this issue with seccomp+ptrace to trap on > entry/exit to relevant syscalls (e.g., munmap, mmap, fork), and > prevent concurrent calls to obtain correct order of the events. It is far > from trivial and introduces some overheads, but I did not find a better > solution. Thanks for explaining. I also digged out the discussion threads between you and Mike and that's a good one too summarizing the problems: https://lore.kernel.org/all/5921BA80-F263-4F8D-B7E6-316CEB602B51@gmail.com/ Scenario 4 is kind of special imho along all those, because that's the only one that can be workarounded by user application by only copying pages one by one. I know you were even leveraging iouring in your local tree, so that's probably not a solution at all for you. But I'm just trying to start thinking without that scenario for now. Per my understanding, a major issue regarding the rest of the scenarios is ordering of uffd messages may not match with how things are happening. This actually contains two problems. First of all, mmap_sem is mostly held read for all page faults and most of the mm changes except e.g. fork, then we can never serialize them. Not to mention uffd events releases mmap_sem within prep and completion. Let's call it problem 1. The other problem 2 is we can never serialize faults against events. For problem 1, I do sense something that mmap_sem is just not suitable for uffd scenario. Say, we grant concurrent with most of the events like dontneed and mremap, but when uffd ordering is a concern we may not want to grant that concurrency. I'm wondering whether it means uffd may need its own semaphore to achieve this. So for all events that uffd cares we take write lock on a new uffd_sem after mmap_sem, meanwhile we don't release that uffd_sem after prep of events, not until completion (the message is read). It'll slow down uffd tracked systems but guarantees ordering. At the meantime, I'm wildly thinking whether we can tackle with the other problem by merging the page fault queue with the event queue, aka, event_wqh and fault_pending_wqh. Obviously we'll need to identify the messages when read() and conditionally move then into fault_wqh only if they come from page faults, but that seems doable? Not sure above makes any sense, as I could have missed something. Meanwhile I think even if we order all the messages to match with facts there're still some other issues that are outliers of this, but let's see how it sounds so far. Thanks, -- Peter Xu