From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E2B19C636D7 for ; Fri, 3 Feb 2023 06:10:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0905E6B0072; Fri, 3 Feb 2023 01:10:12 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 040156B0073; Fri, 3 Feb 2023 01:10:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E70A96B0074; Fri, 3 Feb 2023 01:10:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id D64CD6B0072 for ; Fri, 3 Feb 2023 01:10:11 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id A6E198027A for ; Fri, 3 Feb 2023 06:10:11 +0000 (UTC) X-FDA: 80424955422.22.B1E6446 Received: from mail-lj1-f169.google.com (mail-lj1-f169.google.com [209.85.208.169]) by imf08.hostedemail.com (Postfix) with ESMTP id CA82E160007 for ; Fri, 3 Feb 2023 06:10:09 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=chromium.org header.s=google header.b=Ug0du6Zs; dmarc=pass (policy=none) header.from=chromium.org; spf=pass (imf08.hostedemail.com: domain of stevensd@chromium.org designates 209.85.208.169 as permitted sender) smtp.mailfrom=stevensd@chromium.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675404610; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VqL0SQ3sFdYoa5P5qE3J+IqmzxCHfdLil4//qOukaOY=; b=dQFTjo8zSmVsY70WVjzu7De+BcSz4+6gBzUb1tQGOcup5HSpO0nWycdu7nnV1BorvivmmJ 2YJMzPX0Z3NlNap49CxXdM3xAnddozeMSVMsF5TZ2yViT9wfGIW/90h18Qbbya3+4j861F C8T7v0Ro/vEAkEPKItfQ6Ny7zwccjRc= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=chromium.org header.s=google header.b=Ug0du6Zs; dmarc=pass (policy=none) header.from=chromium.org; spf=pass (imf08.hostedemail.com: domain of stevensd@chromium.org designates 209.85.208.169 as permitted sender) smtp.mailfrom=stevensd@chromium.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675404610; a=rsa-sha256; cv=none; b=ThsMdYOA9h6Id4+vS25/QpP0P8CcWreoOkrNWTI1kQ1jz3/n37BU8uQlJ1318C3hMDGT+g aHZSE3wvmFtUAEkZ3qVrJSRccq6AHqwcsy/xtJXi7AndZTfiSTWfRSvHIQSivMwKLdKI5r wqMFCzQ0svQ9tkRugLQzJX7M2LAxNhw= Received: by mail-lj1-f169.google.com with SMTP id b14so4271904ljr.3 for ; Thu, 02 Feb 2023 22:10:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=VqL0SQ3sFdYoa5P5qE3J+IqmzxCHfdLil4//qOukaOY=; b=Ug0du6ZsUuXTJu6b3g/CGY8kqC4yyxZB4v/TH0a89VeM3IhOpSyX55eq8qEs/jU1OC lISjBlt0MuhKlmyU8/vzTW3o0VULRH0LDGyH814Tn92cJLfR6LZj1HxF7Tcmn0YNfW3+ s+IvXr32+m0vKdefX5RYLBPOpwCRp/xQNrA+U= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=VqL0SQ3sFdYoa5P5qE3J+IqmzxCHfdLil4//qOukaOY=; b=tEA+P7OBqXHvSugDIkDQtEXULusOz2RaoTKRzznYjBlqIzMO6wO8Zkn973bS7GDOX9 9RWMUNQMqr0g3VGNow9kZ31I9wO6MchTja2UZ4pTRrVS+KNlUHeTYy9nY55mP7c/Arjd zuZDm8GCVvkLlJM4lZevCsUNr1kVxGeORnRVumO/a5tL9MFFkpvExomfL0zY0cGaSVUv GaDyX7ReUKSny6g7tynjLevmv9Tot1KnDJysw4BHDE5YjqsrdBNNWOZpc02qM7BZC2+6 iutFbiHexszUvs7OYptDsk+36IDDtiXDDkcaIvGjtsgFcpCCkO6fhtfybZSAw5dtOo7g QgdA== X-Gm-Message-State: AO0yUKWoomhXzBJlVPszQ2Pl042CwGAWOU/Gnzy0BIrZO/sf4t4X5iTb 7vqFc7njkL1+D4tf3V/WgPi3n9NzOc1DCUk7flE3Zw== X-Google-Smtp-Source: AK7set9VpFMQa1cflzV+xHjqLU7Pbdkzpmn80Ts81v7rzpqniWrtnJHa39GjAeGUKP5DTjFoa3kLC/wUf05URmhGwtY= X-Received: by 2002:a2e:8781:0:b0:290:5b7c:4838 with SMTP id n1-20020a2e8781000000b002905b7c4838mr1659971lji.51.1675404607785; Thu, 02 Feb 2023 22:10:07 -0800 (PST) MIME-Version: 1.0 References: <20230201034137.2463113-1-stevensd@google.com> In-Reply-To: From: David Stevens Date: Fri, 3 Feb 2023 15:09:55 +0900 Message-ID: Subject: Re: [PATCH] mm/khugepaged: skip shmem with armed userfaultfd To: Peter Xu Cc: Yang Shi , David Hildenbrand , "Kirill A. Shutemov" , linux-mm@kvack.org, Andrew Morton , linux-kernel@vger.kernel.org, Hugh Dickins Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: CA82E160007 X-Stat-Signature: ki9rwmfm48ewzuidquh1q88hphzpmr56 X-HE-Tag: 1675404609-447585 X-HE-Meta: U2FsdGVkX183cc1Lwo+wkPEJHE0B5kci2966xWnnNHaoBmpGFGGKUtW0Qr8OJtgbI/qp9BezXb4nJu45x7SI9AXrbYMbU1ksEDf+Hmwr5Oi7pmZMtXGWDoyDVYt/xj8VthM6xqTnFbb3Vo26WY1x82aSKq6LaneJfMWEJFVFHVzzt85ltHJv7jehq2cUtufd9z7WGSf8GKuA6gzcIYygjvoea8Fbd3EgwFHTOwghcVbGxShau7wOZXuTLF2bFFAu3r3V3tSJjFPxe4BNUb0QtIadM457hWQAIPZtM/QtSKCBtOV0XZLl3ccNLukMENln/jZ/60NVDDOhNRsSqCb1p/KDxJ/nBC45ypndVtXiJ6N+44N1LS2Ozw1WcnopiH96R5Siq8jcXiVrYoNE7xwiT8lsVLFlk4UytsrlC6T2zx04lmFiUrYhaZmx5Q59yJficzWPwHqfPTEJR8JcpFEEfzEyz1Ttjt0DG+cT0YZbHJnjZtiKzf4VV5tjS2vcKgTLPoxJsPYTndLN9aVUvJB6CFEqFUFiy2d3U8QgevxpKuX1eCuCx1uX2G5Bdv9eJRIfSjtP/udLGcZulgQAzL+n8+vGKpjmd1BwRtZKw7u5uucirbWuOPu5j9Vt6zhNj1gGWf1HWiOCZILABlNgQ2gKpnARR4PytHQaH5/3gYBZ61g2WhdFHZuTG/EEHYJRtfXvclPTFJ/tl6wKaqHHybIfkE2klUQJeZbC6Yi1iVfpKzBDl2MqYA1takIlGIk2hHac9HlmaVLdSE58ulfwDb1rkBUTmmdBfcIlG0aQC41bdOcXLLdf18epqKIGEoj7FIdQwO5BMdErlwKlFrS/JfU6n9y5ORWwdp+4i/IZx9gAP6OFVAQUqmLR/NPSS2g+QlXZ7Bub+LfSis7iqUaJy9TWnqdnFqBrDamzmC5wFni7yfqpO026Ee65Whnb5F8ZTxT9dDm7gcHIsof230H9WI0 5O8Yvqek E52gKPzecEtW/FbcnvYtLPVCgiZXPvTILUg3Ov+PkKRJyB3IBf8812C4HWJBwmjKQ+sGxxTWI3CKbCQVThZjxiWTELjN8MQ5qemzrsIf5Ud6QIHia77cliwkxLRyf7dGwsQQMifF/FXnQ8d2MoYY5lw/XIgz0cnFPWeqr8u/9iRDuU/5jqVmqOSYqFlTfSaZj+N3XCxjc9bQzoHd0CbARqrZPWBEFLTOCXuE8iHr6tj2RlxkCXyF7319fkg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000703, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > > > I don't know if it's necessary to go that far. Userfaultfd plus shmem > > > is inherently brittle. It's possible for userspace to bypass > > > userfaultfd on a shmem mapping by accessing the shmem through a > > > different mapping or simply by using the write syscall. > > Yes this is possible, but this is user-visible operation - no matter it was > a read()/write() from another process, or mmap()ed memory accesses. > Khugepaged merges ptes in a way that is out of control of users. That's > something the user can hardly control. > > AFAICT currently file-based uffd missing mode all works in that way. IOW > the user should have full control of the file/inode under the hood to make > sure there will be nothing surprising. Otherwise I don't really see how > the missing mode can work solidly since it's page cache based. > > > > It might be sufficient to say that the kernel won't directly bypass a > > > VMA's userfaultfd to collapse the underlying shmem's pages. Although on > > > the other hand, I guess it's not great for the presence of an unused > > > shmem mapping lying around to cause khugepaged to have user-visible > > > side effects. > > Maybe it works for your use case already, for example, if in your app the > shmem is only and always be mapped once? However that doesn't seem like a > complete solution to me. We're using userfaultfd for guest memory for a VM. We do have sandboxed device processes. However, thinking about it a bit more, this approach would probably cause issues with device hotplug. > There's nothing that will prevent another mapping being established, and > right after that happens it'll stop working, because khugepaged can notice > that new mm/vma which doesn't register with uffd at all, and thinks it a > good idea to collapse the shmem page cache again. Uffd will silently fail > in another case even if not immediately in your current app/reproducer. > > Again, I don't think what I propose above is anything close to good.. It'll > literally disable any collapsing possibility for a shmem node as long as > any small portion of the inode mapping address space got registered by any > process with uffd. I just don't see any easier approach so far. Maybe we can make things easier by being more precise about what bug we're trying to fix. Strictly speaking, I don't think what we're concerned about is whether or not userfaultfd is registered on a particular VMA at a particular point in time. I think what we're actually concerned about is that when userspace has a page with an armed userfaultfd that it knows is missing, that page should not be filled by khugepaged. If userspace doesn't know that a userfaultfd armed page is missing, then even if khugepaged fills that page, as far as userspace is concerned, the page was filled by khugepaged before userfaultfd was armed. If that's a valid way to look at it, then I think the fact that collapse_file locks hpage provides most of the necessary locking. From there, we need to check whether there are any VMAs with armed userfaultfds that might have observed a missing page. I think that can be done while iterating over VMAs in retract_page_tables without acquiring any mmap_lock by adding some memory barriers to userfaultfd_set_vm_flags and userfaultfd_armed. It is possible that a userfaultfd gets registered on a particular VMA after we check its flags but before the collapse finishes. I think the only observability hole left would be operations on the shmem file descriptor that don't actually lock pages (e.g. SEEK_DATA/SEEK_HOLE), which are hopefully solvable with some more thought. -David > Thanks, > > -- > Peter Xu >