From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6F101C433EF for ; Thu, 10 Feb 2022 18:42:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A69286B0071; Thu, 10 Feb 2022 13:42:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A177F6B0075; Thu, 10 Feb 2022 13:42:47 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8B95D6B0078; Thu, 10 Feb 2022 13:42:47 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.25]) by kanga.kvack.org (Postfix) with ESMTP id 7E03E6B0071 for ; Thu, 10 Feb 2022 13:42:47 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id 525B0120A57 for ; Thu, 10 Feb 2022 18:42:47 +0000 (UTC) X-FDA: 79127741574.04.BAAB591 Received: from mail-pf1-f180.google.com (mail-pf1-f180.google.com [209.85.210.180]) by imf30.hostedemail.com (Postfix) with ESMTP id DC70B80002 for ; Thu, 10 Feb 2022 18:42:46 +0000 (UTC) Received: by mail-pf1-f180.google.com with SMTP id r19so11810767pfh.6 for ; Thu, 10 Feb 2022 10:42:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=UjN4ENN1Mh9BSmqoCS5P7zeF9S7Q1+At276uwuXT1Rg=; b=q3znpo8TbhcolVrOdgz0qhi7HnBFjAfkL+3lnPC4qRNdkCTy7on/DpSoOwej7eDl75 9lBWTsGSdoNIsc1BgUdj9vJ7hq3HHYg3EQGkYup4bgBDCt5gcwdtgpc26n9tqHeHG/gf KCZ5z+yR46flx3Z0ymyJVWXP9g/QnJz1/N3fkzf+V+BH0CfpgdKLvgzZih0aHPmPItFM ojRJ8k6S0iKEuSWpt1SpljYlyq5UQjdP+ARpNdBf03mqpuIc55VwveAlT3mpu05TxWuo sa8xtTv4950uh8h+TsqB9rU9Mcx5dAzPqK//tRtR/CnjOqC3laF+ufCVzVYQgV2f4gXQ 7ehA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=UjN4ENN1Mh9BSmqoCS5P7zeF9S7Q1+At276uwuXT1Rg=; b=CAP6z047i3B4Q8TKN6wBZSI+9RouculiVK4xAyUwvLXg8U1saYZgQ4Tnv+9d/RYpCN pYwS7NNoNlFGz8HA/Vip3CrxLSU6oyB7Ox+IZNkjHfVmOI1+enA3+tJNbwqxOz/jrdRy kOZgSTL/6Li5mp3B7MAl5obqfnitXwWngnM0Sxcx9sCKebme1NWTRmTjv1YVJUBCrixD R1wkqDHx0Ii28qoRJngSpqB9FYaI7v8+iz9bMJPk1oKHmMo3l+uNPZm8iPVWm+Hgxe1t b8lG7dTrSl6ufuOw+Bxm3FTwX+em04MFyyTZVZMcJ6cRfEk8jhbRtMaxQXlt7Cjaip5X TJsA== X-Gm-Message-State: AOAM531bk8aYeWMOVuq1C/K9TIx5e7tz09VKNldM+JX6u2HTsQPGUt1c SvYY7ippX1A9uc913YbdxhY= X-Google-Smtp-Source: ABdhPJyDQkR/io56u5YWXjs6BF5ry8bnrwHtYM59JJs5fCEF2IJiaTAoqPou39I0aEP/8B/8TyPhWQ== X-Received: by 2002:a05:6a00:114e:: with SMTP id b14mr8886705pfm.31.1644518565556; Thu, 10 Feb 2022 10:42:45 -0800 (PST) Received: from smtpclient.apple (c-24-6-216-183.hsd1.ca.comcast.net. [24.6.216.183]) by smtp.gmail.com with ESMTPSA id a22sm18199433pfv.185.2022.02.10.10.42.44 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 10 Feb 2022 10:42:45 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 15.0 \(3693.40.0.1.81\)) Subject: Re: userfaultfd: usability issue due to lack of UFFD events ordering From: Nadav Amit In-Reply-To: Date: Thu, 10 Feb 2022 10:42:43 -0800 Cc: Mike Rapoport , David Hildenbrand , Mike Rapoport , Andrea Arcangeli , Linux-MM Content-Transfer-Encoding: quoted-printable Message-Id: References: <11831b20-0b46-92df-885a-1220430f9257@redhat.com> <63a8a665-4431-a13c-c320-1b46e5f62005@redhat.com> To: Peter Xu X-Mailer: Apple Mail (2.3693.40.0.1.81) X-Rspamd-Queue-Id: DC70B80002 Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=q3znpo8T; spf=pass (imf30.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.210.180 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-Stat-Signature: wuu8xjud5bqgdf4ti11o5tr3b3yb3a8p X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1644518566-946812 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Feb 9, 2022, at 11:48 PM, Peter Xu wrote: >=20 > Hi, Nadav & all, >=20 > On Mon, Jan 31, 2022 at 02:39:01PM -0800, Nadav Amit wrote: >> There are use-cases in which you do need to know the order between >> user-initiated MADV_DONTNEED and page-faults. For instance, if you >> build a userspace paging mechanism, you need to know whether the >> page content is zero or whatever is held in the disk. >=20 > When there's no uffd monitor, concurrent page faults with = MADV_DONTNEED can > already result in undefined behavior, right? Assuming the page fault = is a write > with non-zero data, then the page can either contain zero or non-zero = data > at last, IIUC. >=20 > If above is true, I'm wondering whether it's already impossible to do = it right > when there is an uffd monitor? I think that the MADV_DONTNEED/PF-resolution =E2=80=9Crace" only affects = usage-models that handle the page-fault concurrently with UFFD-monitoring (using = multiple monitor threads or IO-uring as I try to do). At least for use-cases such = as live-migration. I think the scenario you have in mind is the following, which is = resolved with mmap_changing that Mike introduced some time ago: UFFD monitor App thread #0 App thread #1 ------------ ------------- ------------- #PF UFFD Read [#PF] =09 MADV_DONTNEED mmap_changing =3D 1 userfaultfd_event_wait_completion() [queue event, wait] =20 UFFD-copy -EAGAIN since mmmap_changing > 0 mmap_changing will keep being elevated, and UFFD-copy not served (fail) = until the monitor reads the UFFD event. The monitor, in this scenario, is = single threaded and therefore orders UFFD-read and UFFD-copy, preventing them = from racing. Assuming the monitor is smart enough to reevaluate the course of action = after MADV_DONTNEED is handled, it should be safe. Personally, I do not like = the burden that this scheme puts on the monitor, the fact it needs to retry = or even the return value [I think it should be EBUSY since immediate retry = would fail. With IO-uring, EAGAIN triggers an immediate retry, which is = useless.] Yet, concurrent UFFD-event/#PF can be handled properly by a smart = monitor. *However*, userfaultfd events seem as very hard to use (to say the = least) in the following cases: 1. The UFFD-copy is issued by one thread and the UFFD-read is performed = by another. For me this is the most painful even if you may consider it as =E2=80=9Cunorthodox=E2=80=9D. It is very useful for performance, = especially if the UFFD-copy is large. 2. If the race is between 2 userfaultfd *events*. The events might not = be properly ordered (i.e., the order in which they are read does not = reflect the order in which they occurred) despite the use of = *_userfaultfd_prep(), since they are only queued (to be reported and trigger wake) by userfaultfd_event_wait_completion(), after the VMA and PTEs were = updated and more importantly after mmap-lock was dropped. This means that if you have fork and MADV_DONTNEED, the monitor might = see their order inverted, and won=E2=80=99t be able to know whether the = child has the pages zapped or not. Other races are possible too, for instance between mremap() and = munmap(). In most cases the monitor might be able (with quite some work) to figure out that the order of the events it received does not make = sense and the events must have been reordered. Yet, implementing something = like that is far from trivial and there are some cases that are probably impossible to resolve just based on the UFFD read events. I personally addressed this issue with seccomp+ptrace to trap on entry/exit to relevant syscalls (e.g., munmap, mmap, fork), and prevent concurrent calls to obtain correct order of the events. It is = far from trivial and introduces some overheads, but I did not find a better solution. Let me know if I am missing anything. Thanks, Nadav=