From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F250AC05027 for ; Thu, 2 Feb 2023 17:40:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1FECF6B0074; Thu, 2 Feb 2023 12:40:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1AF936B0075; Thu, 2 Feb 2023 12:40:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 076E46B0078; Thu, 2 Feb 2023 12:40:27 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id EB2916B0074 for ; Thu, 2 Feb 2023 12:40:26 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id AE8B8AB731 for ; Thu, 2 Feb 2023 17:40:26 +0000 (UTC) X-FDA: 80423066052.12.327E9A7 Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) by imf24.hostedemail.com (Postfix) with ESMTP id DB48E180016 for ; Thu, 2 Feb 2023 17:40:24 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=noLQQZMG; spf=pass (imf24.hostedemail.com: domain of shy828301@gmail.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675359624; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=U/RSPgZ8O7lwBHcV3urtBbRY3bEHAeOKrdKUoyEXCSw=; b=oFxXYr+zrCssjb/FQF2wm6Qk25rMOua1MaWdpv5H+bsDAe7VniF3fjW3wZk/2YAbVjlKqv aE4Ct1eAM82gdQRkmx5DJMoMATXOcKKWf+A59/rQ90iIxTVFHrwKKuZ9kv7MQK+g2DgrbE 8X4JPewxNtWbU0gnfQr9JY8VoyKstnc= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=noLQQZMG; spf=pass (imf24.hostedemail.com: domain of shy828301@gmail.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675359624; a=rsa-sha256; cv=none; b=Fwx9uNEoyiqFmMGekorN1EaNwXauAxqRTTOpLJxfpWP2ltkX43a5pEq34z8Rl1TXiKZJEs 4WmAJqN+ia82ELn8ceRquuDhE4vugRKwNFhDOsPpHCq0evUSjuTsFzoFb09vv7GjTOPsvD mpBB7+7YB3N7jn4n3LVoJqpK6uO4p24= Received: by mail-pl1-f169.google.com with SMTP id be8so2592553plb.7 for ; Thu, 02 Feb 2023 09:40:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=U/RSPgZ8O7lwBHcV3urtBbRY3bEHAeOKrdKUoyEXCSw=; b=noLQQZMG8vJ/uXtVKZprqubhR+bO/2jCRcM80IYgTevIwE4vzF147NT2B3bOADx3R1 YsUF/h37HlLUsKcaIxXWr8s0R144BrI5ZTZsoqcx9pkH/LsZp1CVkN+E/qdkCRSpV558 BNgKqGhJ073F9Ttc1GFXvuCGeYrbW7zkoSvwarGnh4pOwO3fGWOs9J3BWrv65Tn2aIZ7 aitxGQjXepibqjXT/7m1JtKySuHhomKRNPzymUU3GU1vbl/bg/cWfrsGTQ7Sc/+DOebU MehDO81dW+d8PT7PXB90N8L6laPs3EixSugN3L3AduwyOYB+Y5c7HXf+DLpo01OK3tu9 tCiA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=U/RSPgZ8O7lwBHcV3urtBbRY3bEHAeOKrdKUoyEXCSw=; b=GUg8mYWJN6R7lyUvhRK6f5uY2u9+9hi6tfNEIq52L0Eq0T3RDl3+cmE+i33FSCquaT iyxl0S55CLGcv2OU8pGuyjaqEAeCn0TIY53xxDdwT+NqcDYvkGWS0fi4GV7x8TytGQt9 qCB7N0bOW2hlq4g104L9t5TSaUnnhoqcJGxxF8IqPCn4W8yVPm6UvBBwk16TF5T2gD1z nTlBkvzNahH4+RkmnIbJ8aRqZ+gZI0ikCZ2VyhlCyWCIv47tNKnL/MO9bK5ue5nDSPuR Kt34woGCrdFGHPZZnf2eBf1H/GAUTmRdYbKSXAETmJzBwL2S5qPJsuE1kUKsyy+Ff1aX HBwQ== X-Gm-Message-State: AO0yUKWJ4jSdmyO5vMmwn0mGQNbdicf3/BGh4DzvZ17qfoEZsTvJPgFI sujaL7z+qvp6+wQOC4I+himGXWPUWhc1q1R3Oz4= X-Google-Smtp-Source: AK7set+AbVbqJ3H3XalS+6LkG0f++RGxG9eDFeeHjf7Wgf71bZwVogyAOc31anXVFXgYMZoY0dtx7jqjodkFBN3sp2E= X-Received: by 2002:a17:902:8b82:b0:193:648:821d with SMTP id ay2-20020a1709028b8200b001930648821dmr1839558plb.10.1675359623551; Thu, 02 Feb 2023 09:40:23 -0800 (PST) MIME-Version: 1.0 References: <20230201034137.2463113-1-stevensd@google.com> In-Reply-To: From: Yang Shi Date: Thu, 2 Feb 2023 09:40:12 -0800 Message-ID: Subject: Re: [PATCH] mm/khugepaged: skip shmem with armed userfaultfd To: David Stevens Cc: Peter Xu , David Hildenbrand , "Kirill A. Shutemov" , linux-mm@kvack.org, Andrew Morton , linux-kernel@vger.kernel.org, Hugh Dickins Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: DB48E180016 X-Stat-Signature: zex6k9gdbbtbmbei1nb8oog661yo343w X-Rspam-User: X-HE-Tag: 1675359624-287615 X-HE-Meta: U2FsdGVkX19o+IFIIvUvE9sZXmYNembCJu8CZN+dK6+Epdp8yXXvDcwAsfxzUwJxsCOFBw5tW6Y9tfDan0iMgun+pvoflmBkm2q9+5nOXdmnPmrl3I0f+iMjn+yIgGWyXNuFK7d1kR02gJ5/TbWU3skFpQjNIm0CuA1xHBGPFEC/E4KfVWdyE5cbW0DWA7+lufXxwnM9/P/En5reIZwiJkzmPPJ3j50Q+9YyMoqSR2qycPk9Even6pIFqRC2OovV67vTCssZKHckjsaOL2m2FXO6RLuiLp8radz6BOy/6kChPSd/jIsQYj0UjRvsinu/7yRRDyj4/mF8IetKraB9qEL5Tee6gFUQkSXsvEDQU6n+eSyTfRfGeqn9iSOxwpPso7wg2Km5T0rqQvcHbslEhrJSKqWHKv9ACOkjJ80jVJjdnri3wmtGu5s1tGUhLKpogfVGF5OpoJY6RQLoIUdrSlP6EzOl7AjtFzqMvHI7+gxdSdCiZzSRQ15Jg+YWqkswdSYApl4AxrhcXUC/PCQftBuRGzhRFFCiWsLanvE0gNNCs4sTOjFpJsgXTa76kqsVB1u0fQaOYuaan2HgqhsLsZgfoYP1T74PxvBvMp5mqDxMCHsvPWGMVMbbHs7PpcOtWsOoEKGk51YD1wHjjNnHwYg7GAW+ox5RtfF7iLZI4re/GQnEHh9GfyXQhwUCKVHaTeRxu/BOFPNjLEXIohZjtMaVqpT2EFaQ8d6Z/v4GCgQoe10ANPafka3K71mJ1B/b4f36oVPdALxK1uH/ijMvvlO+N/+5pOjkZ1IwqUawS8UPc3SCVJiPxH+LqnGm4b3BgAS885F0E0TeEOLidyu1ZSBNqPEjQcr5fsEptixm+/hPe+M5Gzozz+No3fyQYN9O+7PkWefGvyX+U5MCH4bZ6sv8k+XeARs3N1S3fWDWFZDSOdVUj4VFyf/OkhkqVoSv3QQAB+MxRXfd7vxQYBz 5aRKjZpi xmoIsV0S5Ef/gHlFk7NwPU9Y3K4amWdtXYjeSGynEuXN0zaTJ7vIbXvtyChjJF0rUqUT0Cl3YdFr5A18RH5Y22Lmb1fUGE/wPpNky/y8Xu96iDZlfAkAj60DcA+3OZzXW5fRcqUfClD4vuz0WqTeoqU3ibgmvsBXLK0qRkg2RDIWRmOsVdUNu+qKZCjg0oNV4lu7LVh956yOAeEwyiH2vSacaWbXTdIb8UCYc0ejDwZWQ1XH1fwzSUaCyCYuhpBev/Kwy8tSnkyMGmhHZay5RuwAjrco/U7qNWW0eLcPhRK16WTLj2ZArpOIecpWCYIcGRgMZH4OssCGWUKQQxLue5DhSqVXVJ4veSA+8BgpXtgXRS6PuQy34C7NVEFZ04bAuOM78bRowfHRKFVzNSPlUBEYiosD3IO3KITw3qu75PYDm5vx/AdyNJX3Lvh0+CeFoPqr5ZciZxRIZcww= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Feb 2, 2023 at 1:56 AM David Stevens wrote: > > On Thu, Feb 2, 2023 at 5:52 AM Peter Xu wrote: > > > > On Wed, Feb 01, 2023 at 09:36:37AM -0800, Yang Shi wrote: > > > On Tue, Jan 31, 2023 at 7:42 PM David Stevens wrote: > > > > > > > > From: David Stevens > > > > > > > > Collapsing memory in a vma that has an armed userfaultfd results in > > > > zero-filling any missing pages, which breaks user-space paging for those > > > > filled pages. Avoid khugepage bypassing userfaultfd by not collapsing > > > > pages in shmem reached via scanning a vma with an armed userfaultfd if > > > > doing so would zero-fill any pages. > > > > > > > > Signed-off-by: David Stevens > > > > --- > > > > mm/khugepaged.c | 35 ++++++++++++++++++++++++----------- > > > > 1 file changed, 24 insertions(+), 11 deletions(-) > > > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > > index 79be13133322..48e944fb8972 100644 > > > > --- a/mm/khugepaged.c > > > > +++ b/mm/khugepaged.c > > > > @@ -1736,8 +1736,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > > > > * + restore gaps in the page cache; > > > > * + unlock and free huge page; > > > > */ > > > > -static int collapse_file(struct mm_struct *mm, unsigned long addr, > > > > - struct file *file, pgoff_t start, > > > > +static int collapse_file(struct mm_struct *mm, struct vm_area_struct *vma, > > > > + unsigned long addr, struct file *file, pgoff_t start, > > > > struct collapse_control *cc) > > > > { > > > > struct address_space *mapping = file->f_mapping; > > > > @@ -1784,6 +1784,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > > > > * be able to map it or use it in another way until we unlock it. > > > > */ > > > > > > > > + if (is_shmem) > > > > + mmap_read_lock(mm); > > > > > > If you release mmap_lock before then reacquire it here, the vma is not > > > trusted anymore. It is not safe to use the vma anymore. > > > > > > Since you already read uffd_was_armed before releasing mmap_lock, so > > > you could pass it directly to collapse_file w/o dereferencing vma > > > again. The problem may be false positive (not userfaultfd armed > > > anymore), but it should be fine. Khugepaged could collapse this area > > > in the next round. > > I didn't notice this race condition. It should be possible to adapt > hugepage_vma_revalidate for this situation, or at least to create an > analogous situation. But once you release mmap_lock, the vma still may be changed, revalidation just can guarantee the vma is valid when you hold the mmap_lock unless mmap_lock is held for the whole collapse or at some point that the collapse doesn't have impact on userfaultfd anymore. We definitely don't want to hold mmap_lock for the whole collapse, but I don't know whether we could release it earlier or not due to my limited knowledge of userfaultfd. > > > Unfortunately that may not be enough.. because it's also possible that it > > reads uffd_armed==false, released mmap_sem, passed it over to the scanner, > > but then when scanning the file uffd got armed in parallel. > > > > There's another problem where the current vma may not have uffd armed, > > khugepaged may think it has nothing to do with uffd and moved on with > > collapsing, but actually it's armed in another vma of either the current mm > > or just another mm's. > > > > It seems non-trivial too to safely check this across all the vmas, let's > > say, by a reverse walk - the only safe way is to walk all the vmas and take > > the write lock for every mm, but that's not only too heavy but also merely > > impossible to always make it right because of deadlock issues and on the > > order of mmap write lock to take.. > > > > So far what I can still think of is, if we can extend shmem_inode_info and > > have a counter showing how many uffd has been armed. It can be a generic > > counter too (e.g. shmem_inode_info.collapse_guard_counter) just to avoid > > the page cache being collapsed under the hood, but I am also not aware of > > whether it can be reused by other things besides uffd. > > > > Then when we do the real collapsing, say, when: > > > > xas_set_order(&xas, start, HPAGE_PMD_ORDER); > > xas_store(&xas, hpage); > > xas_unlock_irq(&xas); > > > > We may need to make sure that counter keeps static (probably by holding > > some locks during the process) and we only do that last phase collapse if > > counter==0. > > > > Similar checks in this patch can still be done, but that'll only service as > > a role of failing faster before the ultimate check on the uffd_armed > > counter. Otherwise I just don't quickly see how to avoid race conditions. > > I don't know if it's necessary to go that far. Userfaultfd plus shmem > is inherently brittle. It's possible for userspace to bypass > userfaultfd on a shmem mapping by accessing the shmem through a > different mapping or simply by using the write syscall. It might be > sufficient to say that the kernel won't directly bypass a VMA's > userfaultfd to collapse the underlying shmem's pages. Although on the > other hand, I guess it's not great for the presence of an unused shmem > mapping lying around to cause khugepaged to have user-visible side > effects. > > -David > > > It'll be great if someone can come up with something better than above.. > > Copy Hugh too. > > > > Thanks, > > > > -- > > Peter Xu > >