From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CE09EC433FE for ; Mon, 7 Dec 2020 04:31:45 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4354620882 for ; Mon, 7 Dec 2020 04:31:45 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4354620882 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 60C8F8D0002; Sun, 6 Dec 2020 23:31:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 595E48D0001; Sun, 6 Dec 2020 23:31:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 45D5F8D0002; Sun, 6 Dec 2020 23:31:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0124.hostedemail.com [216.40.44.124]) by kanga.kvack.org (Postfix) with ESMTP id 29C0D8D0001 for ; Sun, 6 Dec 2020 23:31:44 -0500 (EST) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id DC517824999B for ; Mon, 7 Dec 2020 04:31:43 +0000 (UTC) X-FDA: 77565212886.05.car37_4816610273db Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin05.hostedemail.com (Postfix) with ESMTP id C1E4418019687 for ; Mon, 7 Dec 2020 04:31:43 +0000 (UTC) X-HE-Tag: car37_4816610273db X-Filterd-Recvd-Size: 7472 Received: from mail-pf1-f196.google.com (mail-pf1-f196.google.com [209.85.210.196]) by imf41.hostedemail.com (Postfix) with ESMTP for ; Mon, 7 Dec 2020 04:31:43 +0000 (UTC) Received: by mail-pf1-f196.google.com with SMTP id t7so8537064pfh.7 for ; Sun, 06 Dec 2020 20:31:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=4gxEaB36cnYBuLLFzSYchCEp0p3wHq7jkMXvHid0lLw=; b=YwT8AHRGUPK+rct8HVEiT0QdSgAjegA+UfbRCjiFk4LYfDNyUqDRCnkx8Y3ZVWBFLv 1O+Pc/0itPntpbvU3E7Ihuw1O6k8RCAbH+IiC3JVnazkV6DZf+IgLnhS3pUenlZs65ut wtDJkpaKdSD41c0oSWS92cXAExbKju8hBTu94pK74FmSDX9DW1cTz4fbpn6fyBGdaoFe oNANaYB7F5+JrIUkBK5zTcOJ0AUkkhRqtN9Ue8ms4PhumZfuC0KC0EuuxZr1e9/M9L3L 59IslqHwc5e1A0LcgoYYXtLa3vHh5ZA1P+TAAfUISnQ9C+98yl3/Mla/aFZNw0km0/SG 4D5Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=4gxEaB36cnYBuLLFzSYchCEp0p3wHq7jkMXvHid0lLw=; b=MwI77gkWWadt1SOWMYIEet4w2URqnKMm1zSk33x00xYlH2q5pM1cEoDzRgobkn8ykr IFUcaMi0u5kY/Zjz2jsM4cYpRwGmB8O7wHJf6trEgC6nGRVAliMtgjGIs20l6EWHs1va ZZ/Ix1etDB9lVh1FS0XSQhGPah+3xlTFRGb/cop17W7OhIyNYQ7vEkw+SnG9qHA24Ar4 IwcI5e/RjZbqYvr9hKgRtRxzX6QzkQbxsZ9orCDcIgjTWKkVReJ9XX4YaEHrqhFJ9ATi bvJ5d1uDy/v/1yixoaokQmq+xrnvqI1esvaSBlGkncNZwBTkolqNbWD3o2wwUak+PpRY d+gw== X-Gm-Message-State: AOAM533InFhtlBxv4/kEx44r88EKI0cVzM7ZTM+8an1WX6PLLsKKC7va Tjd9KecaxIXG27SXYlKWlis= X-Google-Smtp-Source: ABdhPJylstN4ygiesT5Z14Y04e4p+JqnYDFElGEENqYrnrzK40waf+eyOucVpBA78CNAk2CEBoBAXw== X-Received: by 2002:a17:902:8f83:b029:d7:ec99:d2fd with SMTP id z3-20020a1709028f83b02900d7ec99d2fdmr14458410plo.17.1607315501978; Sun, 06 Dec 2020 20:31:41 -0800 (PST) Received: from ?IPv6:2601:647:4700:9b2:fc04:46d2:982e:6f95? ([2601:647:4700:9b2:fc04:46d2:982e:6f95]) by smtp.gmail.com with ESMTPSA id q35sm8950649pjh.38.2020.12.06.20.31.40 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 06 Dec 2020 20:31:40 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.4\)) Subject: Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races From: Nadav Amit In-Reply-To: <20201206093703.GY123287@linux.ibm.com> Date: Sun, 6 Dec 2020 20:31:39 -0800 Cc: Mike Rapoport , Andrew Morton , linux-mm , lkml , Andrea Arcangeli , Mike Kravetz , Pavel Emelyanov , Andrei Vagin Content-Transfer-Encoding: quoted-printable Message-Id: <5921BA80-F263-4F8D-B7E6-316CEB602B51@gmail.com> References: <1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.com> <31DA12CC-E9CC-497D-A2EE-B83549D95CE8@gmail.com> <20201206093703.GY123287@linux.ibm.com> To: Mike Rapoport X-Mailer: Apple Mail (2.3608.120.23.2.4) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Thanks for the detailed answer, Mike. Things are clearer in regard to = your intention. > On Dec 6, 2020, at 1:37 AM, Mike Rapoport wrote: >=20 > The uffd monotor should *know* what is the state of child's memory and > without this patch it could only guess. I see - so mmap_changing is not just about graceful handling of = copy-ioctl=E2=80=99s (which I think monitors could handle before mmap_changing was = introduced) but to allow the monitor to know which pages are mapped in each process. Makes sense, but I have strong doubts it really works (see below). >> 2. How is memory ordering supposed to work here? IIUC, mmap_changing = is not >> protected by any lock and there are no memory barriers that are = associated >> with the assignment. Indeed, the code calls WRITE_ONCE()/READ_ONCE(), = but >> AFAIK this does not guarantee ordering with non-volatile = reads/writes. >=20 > There is also mmap_lock involved, so I don't see how copy can start in > parallel with fork processing. Fork sets mmap_chaning to true while > holding mmap_lock, so copy cannot start in parallel. When mmap_lock is > realeased, mmap_chaning remains true until fork event is pushed to > userspace and when this is done there is no issue with > userfaultfd_copy. Whenever I run into a non-standard and non-trivial synchronization = algorithm in the kernel (and elsewhere), I become very confused and concerned. I raised my question since I wanted to modify the code and could not = figure out how to properly do so. Based on your input that the monitor is = expected to know the child mappings according to userfaultfd events, I now think = that the kernel does not provide this ability and the locking scheme is = broken. Here are some scenarios that I think are broken - please correct me if I = am wrong: * Scenario 1: MADV_DONTNEED racing with userfaultfd page-faults userfaultfd_remove() only holds the mmap_lock for read, so these events cannot be ordered with userfaultfd page-faults. * Scenario 2: MADV_DONTNEED racing with fork() As userfaultfd_remove() releases mmap_lock after the user notification = and before the actual unmapping, concurrent fork() might happen before or = after the actual unmapping in MADV_DONTNEED and the user therefore has no way = of knowing whether the actual unmapping took place before or after the = fork(). * Scenario 3: Concurrent MADV_DONTNEED can cause userfaultfd_remove() to clear mmap_changing cleared before all the notifications are completed. As mmap_lock is only taken for read, the first thread the completed userfaultfd_remove() would clear the indication that was set by the = other one. * Scenario 4: Fork starts and ends between copying of two pages. As mmap_lock might be released during ioctl_copy() (inside __mcopy_atomic()), some pages might be mapped in the child and others = not: CPU0 CPU1 ---- ---- ioctl_copy(): __mcopy_atomic() mmap_read_lock() !mmap_changing [ok] mfill_atomic_pte() =3D=3D 0 [page0 copied] mfill_atomic_pte() =3D=3D -ENOENT [page1 will be retried] mmap_read_unlock() goto retry fork(): dup_userfaultfd() -> mmap_changing=3Dtrue userfaultfd_event_wait_completion() -> mmap_changing=3Dfalse mmap_read_lock() !mmap_changing [ok] mfill_atomic_pte() =3D=3D 0 [page1 copied] mmap_read_unlock() =20 return: 2 pages were mapped, while the first is present in the child = and the second one is non-present. Bottom-line: it seems to me that mmap_changing should be a counter (not boolean) that is protected by mmap_lock. This counter should be kept elevated throughout the entire operation (in regard to MADV_DONTNEED). Perhaps mmap_lock does not have to be taken to decrease the counter, but then an smp_wmb() would be needed before the counter is decreased. Let me know whether I am completely off or missing something. Thanks, Nadav