From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 39A5BC636D4 for ; Wed, 1 Feb 2023 20:52:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8FFC36B0072; Wed, 1 Feb 2023 15:52:32 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8AD756B0074; Wed, 1 Feb 2023 15:52:32 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7753F6B0075; Wed, 1 Feb 2023 15:52:32 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 692BC6B0072 for ; Wed, 1 Feb 2023 15:52:32 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id EDF424020E for ; Wed, 1 Feb 2023 20:52:31 +0000 (UTC) X-FDA: 80419921302.27.EE26366 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf07.hostedemail.com (Postfix) with ESMTP id B24134001C for ; Wed, 1 Feb 2023 20:52:29 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=h39Sac8N; spf=pass (imf07.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675284749; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=S587H/AHgtqLelKRqRW9IyHD7csvvblJVyxaPynGgt4=; b=sOQ6Jd8H2AVZDi2p5QkPJiwHGlbiQJoLxcj7ZTJBrjzluwDFMeI+aoa/Pe3HBzZHf671yb dOWpMtzF+ySCDCHmiI7NBy4VD89EcsAlwiEUI39qr5mFfNR7dOBiM3K3WI5CeZ2/42+zJu IKX97Q2pbeV5x/xjOFgBFUgNphTjEB0= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=h39Sac8N; spf=pass (imf07.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675284749; a=rsa-sha256; cv=none; b=MEnIsPqhvgOAgPAX2yY7Xdl4XnP6wTLazGiP/SLX5SNzl5hqTM7jI6xoOm3r5E+BO6XWRt W3tcYtqBCQJXhrAs5hNJgyaZq02++mZbaQwpKe7Y4eK969CnXP+W/4TBzaRzVHt6gmU0iK Km9qNQWdZIeyQzxaQLwmPp/YUBOijSM= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1675284749; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=S587H/AHgtqLelKRqRW9IyHD7csvvblJVyxaPynGgt4=; b=h39Sac8NXN+W1u3IbalU0YHgFXDg14XxtUsLgTYH6tyA9udJCj4tdc3L4auDI3zCTwSpvh +CT+/+gbm1W9eOjNRMoGuULqTVmVKjmDYiqD9peuI+pzGuNYRrUXdVTUuA3dBWsw80UaA/ z5ex5aoqThk9IDi89A4D6jPS/jXtuIY= Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com [209.85.219.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-49-96G1G3qrMDWlfif-BisEIg-1; Wed, 01 Feb 2023 15:52:28 -0500 X-MC-Unique: 96G1G3qrMDWlfif-BisEIg-1 Received: by mail-qv1-f72.google.com with SMTP id j12-20020a056214032c00b0053782e42278so10309747qvu.5 for ; Wed, 01 Feb 2023 12:52:27 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=S587H/AHgtqLelKRqRW9IyHD7csvvblJVyxaPynGgt4=; b=BQoiVNMR/WT2Gpftk4UuJWbN1wAQGtloIJ9XjkAyi845daLLJzY8G8NoOcjUmoLbKB t8vpwuhRBjKE8iTrrKHGZOotXCnzPmoBXPvAQldz35yfGrDxBsA9Hpht+Hh2UmEO4G0o 7G8n9ofg+61aDCqZx6zDDZmU8OUNQa11m3QKJOqOdchbAQT1mVeqGOVN+Uoz5QfHQL2H GC6flxgndgz8EiA6fNb+l+jbiTX43td5GTLxTQRm3WzW4Qt3endOivjEzUWG5sczk9F3 vT4Un6lOmkM1Z1yIIQKrW2G+5v4u7Kc/IdlCiACtzoAIgu4/ZqU6C4/AYgnPGhc/v+4n tGow== X-Gm-Message-State: AO0yUKVr0dXddewIxX2wXNLd2pEchqD93DmwjFw6hRBcBGersyYL+2ca zP45kxpaEq+mB7i2tBbq9hjIVEKWGxVO+tWuFgPDaAOI1sXDa4f/gGOtIbdiDQ8i5ZITrD6Hok6 C2kqhB74YsXw= X-Received: by 2002:a05:622a:1e10:b0:3b8:68df:fc72 with SMTP id br16-20020a05622a1e1000b003b868dffc72mr5768719qtb.2.1675284747297; Wed, 01 Feb 2023 12:52:27 -0800 (PST) X-Google-Smtp-Source: AK7set/1l/OfMXsvHPEGBhvk9wxikeE80tcy9FA+8CSUjxWzqv1kAcmQ0SOJNC5SyVkazaEdThSshQ== X-Received: by 2002:a05:622a:1e10:b0:3b8:68df:fc72 with SMTP id br16-20020a05622a1e1000b003b868dffc72mr5768699qtb.2.1675284747045; Wed, 01 Feb 2023 12:52:27 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id fu3-20020a05622a5d8300b003a7eb5baf3csm12324867qtb.69.2023.02.01.12.52.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 01 Feb 2023 12:52:26 -0800 (PST) Date: Wed, 1 Feb 2023 15:52:24 -0500 From: Peter Xu To: Yang Shi Cc: David Stevens , David Hildenbrand , "Kirill A. Shutemov" , linux-mm@kvack.org, Andrew Morton , linux-kernel@vger.kernel.org, Hugh Dickins Subject: Re: [PATCH] mm/khugepaged: skip shmem with armed userfaultfd Message-ID: References: <20230201034137.2463113-1-stevensd@google.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: B24134001C X-Stat-Signature: rgp9bzp9fsok9fj8qjgc9zymsaa96ija X-Rspam-User: X-HE-Tag: 1675284749-456861 X-HE-Meta: U2FsdGVkX18YAEmnjc0Tkkzo3A6jcctCML1cXQdqF+42Bjl8S09tVLco49XOC73hBfZo1wvRbYtM8oQxvVw96+QKrNKgkEU6dnFl/HYBttILdtu2yONGJvTuV0V9iwacV/qoMeYmVizY9INEIa6W5zLChs9M62HtUTnMPWqTj1Kdctpm/fzknERgiUEs4uTqKeOQS4TtscDdlpm7/qK/EYtjcs1EygKM6F0gxtLlKVhPh6Npw0ljhbgHSK+MI6qhQsgs5uSfVrij9qRtDeHXX2vOmndwsi/Wqr690FvqP9GsEbyY1rv71Ktf1SyHvWHCe60F8qAnp+g3PA1qViTHXFUPn+xqVmwcinqhQOxXPBdu1TD7vdUAdGw/MpnaZZUvlvXh/fW8dPr+aUSlYfRHFmqStAKDX8APV+QDw0AAyN68FH5rB546vkXLHSd69RWwOF60U4MLzVaI9pZ4t5FRiKLz5wQExnYDElT5nn5LvkpsBWTSTqV6EBbMj/tJYOhPWm9kdYrY3Qwh42SBN4fLYwI8xrgRBq4JXHH6hXWuKn57KFLnUuN6fFG1xMfJcm8+EXm2ARa6esANU4fmUC2CyB221se600ciNCui5e8kiv4mAuwuoEImh+LwjFZgaZhE6625hq1au2lDoRQHWDmgszX+/Cn7Mq6bGyHchazQNhoT3xwtFkfuezJkXS9S/vndR7bpc+1HjsiqqALDJ9mo96m9L3cE/NHv4VlJxKRramlXfq+Ee9+YU8BIXBYvxi65uFSknYIfPS6T7VUHO7FDg49pW0wcaK4YBBD00CBzWrtCW/W6SUzsk1UH2d0W+45DMP1qQdM13bTHXgP+N+FiueQgYmpCfTszUWex6LrWj88YzB1MqxzxS5BYKx0Zq/SZnj07EOJHaqg9aXakQ7v5b0W0uESAP/dXHOdpeQchshxvmM1aFiLFE0ICYQ7lKHcc0Y7lRPAWSZ09pT7sIIA TgXwNbcZ P1BvCIDfH2ruR7sCqrCcVwonwlI1lzd1xTFBLUQl5aoVT9LLAbllZ95aM0WlFqBXduEBg4KDt3fAe50qC8Bk47a6RmLNxU0kG/nEyoXXC8UepywZHjhskL2vWcvPIiHPXBgt8qJmaAxgefihpODs8OD62qGJA7SAuTqqC6hxVGLJEzo34xOZ2tCN0slDtOft5iUQbUTK5yg8G1NUGQJFERhEdQ+k/OnDKj6O/4esAJX0AqlBReAGsImggRwIOIaBW1x92zX1+m5oSUGD3FXu8r1nP726o3NSU+XnEvxsr4S9HAnoDM7KBALOOaUPninNw8/bcVtQ5ojpTPggO5/TohIrflPOpdH/P4/ck/nvMv32c1ygX2o3EEXes7Ubk8LAIqi3ylyqeAopXDW2P80EEVdSDt+uYbOJKQQUVdT7TicxpOc5JYuZGsvbsig== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Feb 01, 2023 at 09:36:37AM -0800, Yang Shi wrote: > On Tue, Jan 31, 2023 at 7:42 PM David Stevens wrote: > > > > From: David Stevens > > > > Collapsing memory in a vma that has an armed userfaultfd results in > > zero-filling any missing pages, which breaks user-space paging for those > > filled pages. Avoid khugepage bypassing userfaultfd by not collapsing > > pages in shmem reached via scanning a vma with an armed userfaultfd if > > doing so would zero-fill any pages. > > > > Signed-off-by: David Stevens > > --- > > mm/khugepaged.c | 35 ++++++++++++++++++++++++----------- > > 1 file changed, 24 insertions(+), 11 deletions(-) > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 79be13133322..48e944fb8972 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -1736,8 +1736,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > > * + restore gaps in the page cache; > > * + unlock and free huge page; > > */ > > -static int collapse_file(struct mm_struct *mm, unsigned long addr, > > - struct file *file, pgoff_t start, > > +static int collapse_file(struct mm_struct *mm, struct vm_area_struct *vma, > > + unsigned long addr, struct file *file, pgoff_t start, > > struct collapse_control *cc) > > { > > struct address_space *mapping = file->f_mapping; > > @@ -1784,6 +1784,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > > * be able to map it or use it in another way until we unlock it. > > */ > > > > + if (is_shmem) > > + mmap_read_lock(mm); > > If you release mmap_lock before then reacquire it here, the vma is not > trusted anymore. It is not safe to use the vma anymore. > > Since you already read uffd_was_armed before releasing mmap_lock, so > you could pass it directly to collapse_file w/o dereferencing vma > again. The problem may be false positive (not userfaultfd armed > anymore), but it should be fine. Khugepaged could collapse this area > in the next round. Unfortunately that may not be enough.. because it's also possible that it reads uffd_armed==false, released mmap_sem, passed it over to the scanner, but then when scanning the file uffd got armed in parallel. There's another problem where the current vma may not have uffd armed, khugepaged may think it has nothing to do with uffd and moved on with collapsing, but actually it's armed in another vma of either the current mm or just another mm's. It seems non-trivial too to safely check this across all the vmas, let's say, by a reverse walk - the only safe way is to walk all the vmas and take the write lock for every mm, but that's not only too heavy but also merely impossible to always make it right because of deadlock issues and on the order of mmap write lock to take.. So far what I can still think of is, if we can extend shmem_inode_info and have a counter showing how many uffd has been armed. It can be a generic counter too (e.g. shmem_inode_info.collapse_guard_counter) just to avoid the page cache being collapsed under the hood, but I am also not aware of whether it can be reused by other things besides uffd. Then when we do the real collapsing, say, when: xas_set_order(&xas, start, HPAGE_PMD_ORDER); xas_store(&xas, hpage); xas_unlock_irq(&xas); We may need to make sure that counter keeps static (probably by holding some locks during the process) and we only do that last phase collapse if counter==0. Similar checks in this patch can still be done, but that'll only service as a role of failing faster before the ultimate check on the uffd_armed counter. Otherwise I just don't quickly see how to avoid race conditions. It'll be great if someone can come up with something better than above.. Copy Hugh too. Thanks, -- Peter Xu