From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B4F8EC433F5 for ; Mon, 31 Jan 2022 22:39:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1D8328D002D; Mon, 31 Jan 2022 17:39:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 188668D0028; Mon, 31 Jan 2022 17:39:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0290C8D002D; Mon, 31 Jan 2022 17:39:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0146.hostedemail.com [216.40.44.146]) by kanga.kvack.org (Postfix) with ESMTP id E87A48D0028 for ; Mon, 31 Jan 2022 17:39:05 -0500 (EST) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 8F89F998ED for ; Mon, 31 Jan 2022 22:39:05 +0000 (UTC) X-FDA: 79092049050.14.932DE85 Received: from mail-pg1-f181.google.com (mail-pg1-f181.google.com [209.85.215.181]) by imf03.hostedemail.com (Postfix) with ESMTP id 382A22000A for ; Mon, 31 Jan 2022 22:39:04 +0000 (UTC) Received: by mail-pg1-f181.google.com with SMTP id h23so13565327pgk.11 for ; Mon, 31 Jan 2022 14:39:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=3siTr5poxmDpjkDL/IT4zS8isnI7EWvBx2i1jUpNFiQ=; b=oxENvmghpyScs5jgOXQdtxfR75WgNQiFOhb4xqPpG4fzh+ZaW2B75IAeK78NXs4lhf OqqB0KAjKzKI9CxhhUrkr/MmSJkplvckLBJJTZPoKuvujN31tNZY/rO5UT5c3QuBiPC0 PcHLrCCXObwusvj773yhWfq3pKCtNd9K7Fm9mg3ZxtOvuAt+EN1edjF6HIxTYEXuJ+95 fIy5/Abqw8xmmGzTgtGNFFUawC+dBVjXdi+3p17ENFXAIJONTMRcfRXcAg4+TrtMx7+M XpsbmUSbWvQf4x4T7kiraoQSv+4t/nTQX3VpKg4MRhsAA8dGNePT26LnyPAGRGXvbUD7 iGrA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=3siTr5poxmDpjkDL/IT4zS8isnI7EWvBx2i1jUpNFiQ=; b=0JVZal5aUkFCLR23iDiSesaHz3zqmZmw8axDmT+LJ4YPJv2qz9GtpGykRJaXvgh/qy /kYEtUiShTgTBhyTDmFemSPh0ARox8MYMzVU0u58067GxX1tKIdMbEsDeypmA9jcsy4O kpFCn2To87LXvl9N6tyWFwi2ALCRHhVAoUngZh+0enVVF6RdUauL1vYSTmSk35TiDaEj nY64nTWatT/hWeQqiB0rhWVBO6qhAwEFTcp8qICrj3ICYG/BuJiNox+SobogsDoq47O4 j17Edf+uJgOshWgVVwl0+bxM2rATs54pKv8A1MxUnfchk+xlqeMgPCeCrNyCrv2zBSfu N9FA== X-Gm-Message-State: AOAM5335KNM8lDfYGlE91bqn0fRW8YJ7w4TjvLKp5jcYkeQU17U34Cba 329l+JAnn7bxgDjt91CcBuA= X-Google-Smtp-Source: ABdhPJyfeKbzd3ZQPJY4V88dG8j3wnXfVIRWMQSzfejvrgDMtYjHCGemyjQIHmo8LdZSet/KWE/VWg== X-Received: by 2002:a62:b618:: with SMTP id j24mr22239065pff.69.1643668743544; Mon, 31 Jan 2022 14:39:03 -0800 (PST) Received: from smtpclient.apple (c-24-6-216-183.hsd1.ca.comcast.net. [24.6.216.183]) by smtp.gmail.com with ESMTPSA id v22sm20944843pfu.38.2022.01.31.14.39.02 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 31 Jan 2022 14:39:03 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 15.0 \(3693.40.0.1.81\)) Subject: Re: userfaultfd: usability issue due to lack of UFFD events ordering From: Nadav Amit In-Reply-To: Date: Mon, 31 Jan 2022 14:39:01 -0800 Cc: David Hildenbrand , Mike Rapoport , Andrea Arcangeli , Peter Xu , Linux-MM Content-Transfer-Encoding: quoted-printable Message-Id: References: <11831b20-0b46-92df-885a-1220430f9257@redhat.com> <63a8a665-4431-a13c-c320-1b46e5f62005@redhat.com> To: Mike Rapoport X-Mailer: Apple Mail (2.3693.40.0.1.81) X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 382A22000A X-Stat-Signature: ouypexf9wpj7z7wgpoi8fq6c3qmei7go X-Rspam-User: nil Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=oxENvmgh; spf=pass (imf03.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.215.181 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-HE-Tag: 1643668744-432853 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Jan 31, 2022, at 10:47 AM, Mike Rapoport wrote: >=20 > On Mon, Jan 31, 2022 at 03:41:05PM +0100, David Hildenbrand wrote: >> On 31.01.22 15:28, Mike Rapoport wrote: >>> On Mon, Jan 31, 2022 at 03:12:36PM +0100, David Hildenbrand wrote: >>>> On 31.01.22 15:05, Mike Rapoport wrote: >>>>> On Mon, Jan 31, 2022 at 11:48:27AM +0100, David Hildenbrand wrote: >>>>>> On 31.01.22 11:42, Mike Rapoport wrote: >>>>>>> Hi Nadav, >>>>>>>=20 >>>>>>> On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote: >>>>>>>> Using userfautlfd and looking at the kernel code, I encountered = a usability >>>>>>>> issue that complicates userspace UFFD-monitor implementation. I = obviosuly >>>>>>>> might be wrong, so I would appreciate a (polite?) feedback. I = do have a >>>>>>>> userspace workaround, but I thought it is worthy to share and = to hear your >>>>>>>> opinion, as well as feedback from other UFFD users. >>>>>>>>=20 >>>>>>>> The issue I encountered regards the ordering of UFFD events = tbat might not >>>>>>>> reflect the actual order in which events took place. >>>>>>>>=20 >>>>>>>> In more detail, UFFD events (e.g., unmap, fork) are not ordered = against >>>>>>>> themselves [*]. The mm-lock is dropped before notifying the = userspace >>>>>>>> UFFD-monitor, and therefore there is no guarantee as to whether = the order of >>>>>>>> the events actually reflects the order in which the events took = place. >>>>>>>> This can prevent a UFFD-monitor from using the events to track = which >>>>>>>> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a >>>>>>>> UFFD_EVENT_UNMAP message (which reflects unmap in the parent = process) can >>>>>>>> be reordered, if the events are triggered by two different = threads. In >>>>>>>> this case the UFFD-monitor cannot figure from the events = whether the >>>>>>>> child process has the unmapped memory range still mapped = (because fork >>>>>>>> happened first) or not. >>>>>>>=20 >>>>>>> Yeah, it seems that something like this is possible: >>>>>>>=20 >>>>>>>=20 >>>>>>> fork() munmap() >>>>>>> mmap_write_unlock(); >>>>>>> = mmap_write_lock_killable(); >>>>>>> do_things(); >>>>>>> = mmap_{read,write}_unlock(); >>>>>>> = userfaultfd_unmap_complete(); >>>>>>> dup_userfaultfd_complete(); >>>>>>>=20 >>>>>>=20 >>>>>> I was thinking about other possible races, e.g., = MADV_DONTNEED/MADV_FREE >>>>>> racing with UFFD_EVENT_PAGEFAULT -- where we only hold the = mmap_lock in >>>>>> read mode. But not sure if they apply. >>>>>=20 >>>>> The userspace can live with these, at least for uffd missing page = faults. >>>>> If the monitor will try to resolve a page fault for a removed = area, the >>>>> errno from UFFDIO_COPY/ZERO can be used to detect such races. >>>>=20 >>>> I was wondering if the monitor could get confused if he just = resolved a >>>> page fault via UFFDIO_COPY/ZERO and then receives a REMOVE event. >>>=20 >>> And why would it be confused? >>=20 >> My thinking was that the monitor might use REMOVE events to track = which >> pages are actually populated. If you receive REMOVE after >> UFFDIO_COPY/ZERO the monitor would conclude that the page is not >> populated, just like if we'd get the MADV_DONTNEED/MADV_REMOVE >> immediately after placing a page. >=20 > I still don't follow your usecase. >=20 > In CRIU we simply discard whatever content we had to fill when there = is > REMOVE event. If a page fault occurs in that region we use = UFFDIO_ZEROPAGE, > just as it would happen in "normal" page fault processing=20 > (note, CRIU does not support uffd with hugetlb or shmem) I think that the point that David makes is valid. There are use-cases in which you do need to know the order between user-initiated MADV_DONTNEED and page-faults. For instance, if you build a userspace paging mechanism, you need to know whether the page content is zero or whatever is held in the disk. I presume mmap_changing was designed for a similar purpose, assuming that if you had a page-fault that started before MADV_DONTNEED, and you try to serve it using copy-ioctl, the copy would fail. I think that this works only if you assume that there is a single UFFD monitor thread (that reads the uffd and issues appropriate ioctl=E2=80=99s= ), and that all operations are performed synchronously (which I am trying to avoid using io-uring). Otherwise, a copy ioctl that is initiated before MADV_DONTNEED (to resolve page-fault) can take place after the userspace was=20 already notified of UFFD_EVENT_REMOVE (i.e., mmap_changing=3D=3D0), and there is no way to cancel the copy that was initiated. As a result, following MADV_DONTNEED, the memory would not be zeroed. As for me, I decided that due to the lack of ordering, I just cannot use the UFFD events, and I have to rely on ptrace to obtain order of these events. I might be wrong, but any solution is not trivial and is likely to require API changes.