From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30A4CC61DA4 for ; Thu, 2 Feb 2023 20:22:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A89F76B0073; Thu, 2 Feb 2023 15:22:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A38326B0074; Thu, 2 Feb 2023 15:22:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9013D6B0075; Thu, 2 Feb 2023 15:22:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 815936B0073 for ; Thu, 2 Feb 2023 15:22:57 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 3C22712043A for ; Thu, 2 Feb 2023 20:22:57 +0000 (UTC) X-FDA: 80423475594.11.B5120BF Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf22.hostedemail.com (Postfix) with ESMTP id 324EAC0015 for ; Thu, 2 Feb 2023 20:22:54 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=XJiq862J; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf22.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675369375; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=C8mj02relLT5yXgzADcfXwaECaVPVU3XukbehCzKhxU=; b=IHtkFm3PA3EA3L6rIobRm3mmaPab2tOQVZR1NSwkPaIh77+5UIyt2Abyed9+4GXATApqEp iI5MR5E/DH195hTryjtltFuqeTlmgKzYx2dGkJkaH4hQf+XQCnoz70ikS+CPX5l7B46c9r qpILLzzaRVv3gn2z41LNwiUiX+IAlQo= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=XJiq862J; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf22.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675369375; a=rsa-sha256; cv=none; b=zXp+B52QS0LfeKMwh3152GmgXiQYKHypCd6Lieoo0E1FGAjzeASV8YrqacX44CCblBvadr c3u0bu7Qbtg7rFf3WdiLEjBy0qoLqUMBaQ+MmRKEOgpLBaBsVyoLUwN9PVwVVISsiFxsgV GUIAa/y8nyFOjJAi4yC/HARvHE9p9Zk= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1675369374; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=C8mj02relLT5yXgzADcfXwaECaVPVU3XukbehCzKhxU=; b=XJiq862Jtd/h6LhwAqDquTA59OJiMGyYIT6U5L39jges9U74tcnC+heZbV86P18/PWGwuO q4mqXCIyKsHJ+e/IBNEokFVgptdypetJfuXygP+Mv3xleK684eorZaI39iJNQjVUIzS2eI aE/cvwcv3G8rwbHeJpjIry18ouv9x+4= Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-146-UXlWYvuRM96USiFu7Av92Q-1; Thu, 02 Feb 2023 15:22:53 -0500 X-MC-Unique: UXlWYvuRM96USiFu7Av92Q-1 Received: by mail-qv1-f71.google.com with SMTP id n13-20020a056214008d00b0053a62a11f9fso1576331qvr.2 for ; Thu, 02 Feb 2023 12:22:53 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=C8mj02relLT5yXgzADcfXwaECaVPVU3XukbehCzKhxU=; b=gKXUoxaS8IA5WLKL/D/vbOMIKmsW3ZnECCgJVJX46id3PYa03xpV0N+Yt8Je3l1MZi JWiabzDriceTu8MVa6CeqRkqJJQTnGgG8DKQu4sVdMbapMECP3xPxmcYa5HCb3A/NXKH kKIMVxqQOFFy5mm9lmCfdBvpLT3qOKvoE3YN+d3XjAT+gxtCbTxx2wv0x7yVvWMHD8on vDDogiCZm0MaGksVkdSMccXLuXjYPJzZFnspchNZWouizjMfLVVFDzyz4Hph3ujVWjog iFO9PIh1on+K+D7SeI0y+yRm4pqihZgB7wBMr+y4a7NbPDOxqPVWl5R9JmFr5hnAY36f QT4A== X-Gm-Message-State: AO0yUKWytzxxShozKTQz+zc/2wE+weLS4Gq6mABGrZp1S2rd9sbWcN61 ohExifkpylJkxp1Di/W+Br1fbu5FLPG6Kst9Prn9O2EGQ4EvKvCtjxtW/qVIPUNdg9b+Y8mzKPF Lf42ayPGHyAQ= X-Received: by 2002:ac8:7fc3:0:b0:3b6:35cb:b944 with SMTP id b3-20020ac87fc3000000b003b635cbb944mr13723147qtk.2.1675369372813; Thu, 02 Feb 2023 12:22:52 -0800 (PST) X-Google-Smtp-Source: AK7set8+W2L9PqxK+nZ6DUZhTv0aPWQjw2KE1797P9aB4M0bxUeMN0F2dvLAiIGhvNflvo/sIlOiTA== X-Received: by 2002:ac8:7fc3:0:b0:3b6:35cb:b944 with SMTP id b3-20020ac87fc3000000b003b635cbb944mr13723104qtk.2.1675369372470; Thu, 02 Feb 2023 12:22:52 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id u186-20020a3760c3000000b0072c86374ddfsm381158qkb.71.2023.02.02.12.22.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 02 Feb 2023 12:22:51 -0800 (PST) Date: Thu, 2 Feb 2023 15:22:50 -0500 From: Peter Xu To: Yang Shi Cc: David Stevens , David Hildenbrand , "Kirill A. Shutemov" , linux-mm@kvack.org, Andrew Morton , linux-kernel@vger.kernel.org, Hugh Dickins Subject: Re: [PATCH] mm/khugepaged: skip shmem with armed userfaultfd Message-ID: References: <20230201034137.2463113-1-stevensd@google.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 324EAC0015 X-Stat-Signature: etj87xqdxrdm1dqwtsb8k8fown5e9kfy X-HE-Tag: 1675369374-186874 X-HE-Meta: U2FsdGVkX1/w82YlPOlgvbovDc77zSvYQpiprIbXlZPHnoO1NRKu/gHpxacvicCpkwlG7xSu8z9C2KugncT3sdQiKh47l6uLzXq7MHeRFV6QhogSDrcC6/AFj78WzHcfm9DOjTYZxwRpT0VkhW40M1xFQ5OFgAOZrI9Y/S7u31e8MjeWFHepwWC6NBac86nBl7DxtgN4Xmh3E0kmBCmsmHiBiQ6wVPtOWNAlpWTXH6u6ljpq5C54SfxmizqNpHq4Rb74Y2MW/R4Lp3C8N3n3cuCd6N+mltW1+5+ipmnzNMqW/41vxWIkZGLixSwXpmZie17ltx/hq76aQ1FOs+sqyUWatJBjGOglL74kFHWret0njaNvCsBPeMNUC0H3tyqKfkLwW+gr7QzWV1TulwM2urtmIJv4bGpmgt5hK8Y9tc3w7r9sUvTl9Bpzu8NgkHBWClpFUj50s54MTBborfeTvorUIAvgodU8oAo1bjIwpvvSjaEHnpfCLjK5ub2crQoUtNJo3+aZfUvj6tl4+JOej25UsVKW7PjjEVUGfWmnjIPNurXexRJuIqI1SFFksOolIvbOItK2XEbTfV9LmzVKZKPPmVjepj7lUea5XIZLWtIoIISGIs0+DpBQwfF4zoNdOq+7jqV9CoLXKL64Gq6hUZ37IFU5M7qPTZ4ZTW+7St2TKjrjOMBGL+2IYU8DgSWFnDREePrvs/60I5NHCgfYhSSB9b3plQ3SOLM8BFNPJvMuhnJZNz0s7aksJ6HXQmLWfBGxJwRT1VkaNEagH3bHtYW4hfRDM5XAXTdO97CmBKEGPfxqDL6e8ghxgdYbBGlytfR8pA825VGfSjtHfMtby57yK30RY6P6VJTISK4Pz5+DvWoTXN3cC+yQiFl2gi9ldmN9lDJPNisr/M1dKqTJDcTDAFO7ioiHV8el6H4wkjlfKuXF64wpCIzhPSnbFfRsz25fAgocK8zQjeQ3C58 JbHC132G VyVc1YOP4zSBxh3pK7Byq/guXNZ26o2vKVnivn1nyZKgOTE4jyIQnUuHUXVR/xZl2yj3umm9WpROJz2RcsVhjkNYbZqN7f2GetoWim1vSRXVRnaLnaEWWiKlvwpYW/F4GnfQtISq97vzXZVh74auCsVv5ekYfANo8IOicZQwZBJW9lcWenYPrg539bdNBze4W999dvt5uMkeXvidwVi19+NhiRPVFSl9F7sXrYvGplke5LOA794V8UIX5IQ+H6z8wJfaW8pNw9Zptoj69ViHqE4meMqSFctspnv/x3qY3WdKpLN/I9CZKG/C+sGBL3YrYPT779DfH/7jK7EHIRrfFoLxaOPHVRwTDmvi/1sk27l2ztBySEl1RfdeSGdx+rzrH+5VSOyzG4jTI9Uuq96CGPO2M3TJr+y6JyjW/bATvpyHlFV1pL8EfJVe74nvfms6IaoAqo8WzO7Y3v9s= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Feb 02, 2023 at 09:40:12AM -0800, Yang Shi wrote: > On Thu, Feb 2, 2023 at 1:56 AM David Stevens wrote: > > > > On Thu, Feb 2, 2023 at 5:52 AM Peter Xu wrote: > > > > > > On Wed, Feb 01, 2023 at 09:36:37AM -0800, Yang Shi wrote: > > > > On Tue, Jan 31, 2023 at 7:42 PM David Stevens wrote: > > > > > > > > > > From: David Stevens > > > > > > > > > > Collapsing memory in a vma that has an armed userfaultfd results in > > > > > zero-filling any missing pages, which breaks user-space paging for those > > > > > filled pages. Avoid khugepage bypassing userfaultfd by not collapsing > > > > > pages in shmem reached via scanning a vma with an armed userfaultfd if > > > > > doing so would zero-fill any pages. > > > > > > > > > > Signed-off-by: David Stevens > > > > > --- > > > > > mm/khugepaged.c | 35 ++++++++++++++++++++++++----------- > > > > > 1 file changed, 24 insertions(+), 11 deletions(-) > > > > > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > > > index 79be13133322..48e944fb8972 100644 > > > > > --- a/mm/khugepaged.c > > > > > +++ b/mm/khugepaged.c > > > > > @@ -1736,8 +1736,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > > > > > * + restore gaps in the page cache; > > > > > * + unlock and free huge page; > > > > > */ > > > > > -static int collapse_file(struct mm_struct *mm, unsigned long addr, > > > > > - struct file *file, pgoff_t start, > > > > > +static int collapse_file(struct mm_struct *mm, struct vm_area_struct *vma, > > > > > + unsigned long addr, struct file *file, pgoff_t start, > > > > > struct collapse_control *cc) > > > > > { > > > > > struct address_space *mapping = file->f_mapping; > > > > > @@ -1784,6 +1784,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > > > > > * be able to map it or use it in another way until we unlock it. > > > > > */ > > > > > > > > > > + if (is_shmem) > > > > > + mmap_read_lock(mm); > > > > > > > > If you release mmap_lock before then reacquire it here, the vma is not > > > > trusted anymore. It is not safe to use the vma anymore. > > > > > > > > Since you already read uffd_was_armed before releasing mmap_lock, so > > > > you could pass it directly to collapse_file w/o dereferencing vma > > > > again. The problem may be false positive (not userfaultfd armed > > > > anymore), but it should be fine. Khugepaged could collapse this area > > > > in the next round. > > > > I didn't notice this race condition. It should be possible to adapt > > hugepage_vma_revalidate for this situation, or at least to create an > > analogous situation. > > But once you release mmap_lock, the vma still may be changed, > revalidation just can guarantee the vma is valid when you hold the > mmap_lock unless mmap_lock is held for the whole collapse or at some > point that the collapse doesn't have impact on userfaultfd anymore. We > definitely don't want to hold mmap_lock for the whole collapse, but I > don't know whether we could release it earlier or not due to my > limited knowledge of userfaultfd. I agree with Yang; I don't quickly see how that'll resolve the issue. > > > > > > Unfortunately that may not be enough.. because it's also possible that it > > > reads uffd_armed==false, released mmap_sem, passed it over to the scanner, > > > but then when scanning the file uffd got armed in parallel. > > > > > > There's another problem where the current vma may not have uffd armed, > > > khugepaged may think it has nothing to do with uffd and moved on with > > > collapsing, but actually it's armed in another vma of either the current mm > > > or just another mm's. > > > > > > It seems non-trivial too to safely check this across all the vmas, let's > > > say, by a reverse walk - the only safe way is to walk all the vmas and take > > > the write lock for every mm, but that's not only too heavy but also merely > > > impossible to always make it right because of deadlock issues and on the > > > order of mmap write lock to take.. > > > > > > So far what I can still think of is, if we can extend shmem_inode_info and > > > have a counter showing how many uffd has been armed. It can be a generic > > > counter too (e.g. shmem_inode_info.collapse_guard_counter) just to avoid > > > the page cache being collapsed under the hood, but I am also not aware of > > > whether it can be reused by other things besides uffd. > > > > > > Then when we do the real collapsing, say, when: > > > > > > xas_set_order(&xas, start, HPAGE_PMD_ORDER); > > > xas_store(&xas, hpage); > > > xas_unlock_irq(&xas); > > > > > > We may need to make sure that counter keeps static (probably by holding > > > some locks during the process) and we only do that last phase collapse if > > > counter==0. > > > > > > Similar checks in this patch can still be done, but that'll only service as > > > a role of failing faster before the ultimate check on the uffd_armed > > > counter. Otherwise I just don't quickly see how to avoid race conditions. > > > > I don't know if it's necessary to go that far. Userfaultfd plus shmem > > is inherently brittle. It's possible for userspace to bypass > > userfaultfd on a shmem mapping by accessing the shmem through a > > different mapping or simply by using the write syscall. Yes this is possible, but this is user-visible operation - no matter it was a read()/write() from another process, or mmap()ed memory accesses. Khugepaged merges ptes in a way that is out of control of users. That's something the user can hardly control. AFAICT currently file-based uffd missing mode all works in that way. IOW the user should have full control of the file/inode under the hood to make sure there will be nothing surprising. Otherwise I don't really see how the missing mode can work solidly since it's page cache based. > > It might be sufficient to say that the kernel won't directly bypass a > > VMA's userfaultfd to collapse the underlying shmem's pages. Although on > > the other hand, I guess it's not great for the presence of an unused > > shmem mapping lying around to cause khugepaged to have user-visible > > side effects. Maybe it works for your use case already, for example, if in your app the shmem is only and always be mapped once? However that doesn't seem like a complete solution to me. There's nothing that will prevent another mapping being established, and right after that happens it'll stop working, because khugepaged can notice that new mm/vma which doesn't register with uffd at all, and thinks it a good idea to collapse the shmem page cache again. Uffd will silently fail in another case even if not immediately in your current app/reproducer. Again, I don't think what I propose above is anything close to good.. It'll literally disable any collapsing possibility for a shmem node as long as any small portion of the inode mapping address space got registered by any process with uffd. I just don't see any easier approach so far. Thanks, -- Peter Xu