From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 965F0C636D4 for ; Thu, 2 Feb 2023 09:56:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1FBEF6B0071; Thu, 2 Feb 2023 04:56:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1AC5A6B0072; Thu, 2 Feb 2023 04:56:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0750D6B0073; Thu, 2 Feb 2023 04:56:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id EAD466B0071 for ; Thu, 2 Feb 2023 04:56:19 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id C532FAB559 for ; Thu, 2 Feb 2023 09:56:19 +0000 (UTC) X-FDA: 80421896478.29.2651D56 Received: from mail-lf1-f48.google.com (mail-lf1-f48.google.com [209.85.167.48]) by imf20.hostedemail.com (Postfix) with ESMTP id DB6FE1C0008 for ; Thu, 2 Feb 2023 09:56:17 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=chromium.org header.s=google header.b=TjptjBDi; dmarc=pass (policy=none) header.from=chromium.org; spf=pass (imf20.hostedemail.com: domain of stevensd@chromium.org designates 209.85.167.48 as permitted sender) smtp.mailfrom=stevensd@chromium.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675331778; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Et1yOPHuWuwUUtD1YE0TSilhNb27QJfJADrrMJq+0v0=; b=R/Z/F9JPeqZ2prRi8abn82i+8snJ5VoGsg34EFX5WZIhgDBnBq4VPQKpROaA6GRQ8oDEJs ihldyXuEAQ72UO9jSiecXQoDCvY9iDrgZXnIcvDZiAAjaa4aW+34zHlWPpctv2hRDe0SL9 ae1UtW3bWptZmKkYpc3cdmFkWgH1XA4= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=chromium.org header.s=google header.b=TjptjBDi; dmarc=pass (policy=none) header.from=chromium.org; spf=pass (imf20.hostedemail.com: domain of stevensd@chromium.org designates 209.85.167.48 as permitted sender) smtp.mailfrom=stevensd@chromium.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675331778; a=rsa-sha256; cv=none; b=v73lV70/uCBYsXh36Hw0cGhl4XSBtGytEQJG0Sjj653qvQz60W6B6o2NhdUj5Hj3MJqMQB Js/kyV0SovpMphceL+HYWGUZXdh8/rZTOjsDIRdqlzOoyyAQLJKS0BLCG90br5wYGNt3tT RJldCq4SPZGLgjZ/9yRVFu2YIb4xyn0= Received: by mail-lf1-f48.google.com with SMTP id j17so2178978lfr.3 for ; Thu, 02 Feb 2023 01:56:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=Et1yOPHuWuwUUtD1YE0TSilhNb27QJfJADrrMJq+0v0=; b=TjptjBDiactzfKrHMuO2Z00fY2DaOBxUU6HUMSOrRXHJnPEL+5jjvxSoh4BfQpa350 TC9aT+jlpH8XKUj7b8C0N+h4kJBbE2FSoj3QgBqjYcDOTbCO1wh3X21X0wxp7P6hkxs2 JLCQmBvxI/6UfVDTmaNc8qL4tVi6rnFLyIJ2Q= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Et1yOPHuWuwUUtD1YE0TSilhNb27QJfJADrrMJq+0v0=; b=fGhLXsebFkjzxLv1fSFiQUInlLF5GghjED+2Z9JHX1SpQdZ3cp6YTvrFsRXxEGIyTT 9xovgArJRuS7nOxFXes8HyngsEfNFZPAm596hrVXNOnBhjZ/F3WS3uvw0Ds4iN9r4Lkf Otko/xkG4Dd07Z8RAXbAZXoScEtW6lSncIZdGtNj5TyxYARslMMeogPnPgryGwJCqHeO ZmciHeqn6pmD3jzejFEwyBiFGCaKSbj6oXkTojAbMMx/PeetOJeCgPd4xK5S/Downi5r TUOQRH17CEVermCnt/+yxV2ByetWxZhJGiUWnOjtodgcXX7BPnWwa5XO5xEnV5E02oH5 CM0Q== X-Gm-Message-State: AO0yUKWDrRZLDTJJ9TQfBaseHqsgn789txSweSINotwMOdd5SwUKhsRW 3zAJVddlV4tkBrZQyN96zPUWV8xM8Pl+HDqhp3mB5w== X-Google-Smtp-Source: AK7set/+LRUeC+y7hURPLw3FDp13bWqyHKjm6FhqDKZXu/4BvgyeX2K9tedksP7JXqTzGG4YdKEE0oEkumQJliuj6/I= X-Received: by 2002:a19:5f54:0:b0:4cb:1ddd:3b5 with SMTP id a20-20020a195f54000000b004cb1ddd03b5mr905136lfj.171.1675331776263; Thu, 02 Feb 2023 01:56:16 -0800 (PST) MIME-Version: 1.0 References: <20230201034137.2463113-1-stevensd@google.com> In-Reply-To: From: David Stevens Date: Thu, 2 Feb 2023 18:56:04 +0900 Message-ID: Subject: Re: [PATCH] mm/khugepaged: skip shmem with armed userfaultfd To: Peter Xu Cc: Yang Shi , David Hildenbrand , "Kirill A. Shutemov" , linux-mm@kvack.org, Andrew Morton , linux-kernel@vger.kernel.org, Hugh Dickins Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: DB6FE1C0008 X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: 8s6566g1nm91q5n7kem4mm6r4u4feoyf X-HE-Tag: 1675331777-776302 X-HE-Meta: U2FsdGVkX1/fBXSQruWAlJPnC6iGUBNIP2O29UrX0tBC0B5nt97yCGQMVp+Mb4tG34zJTmQNIcM3naGdCwbBj1EVu6yDuGdUmM0whkZGVse9IvlvRVLgL7W8ffJbNuQXyojrhAxRbK+yGU35rjVr4+a3eFNVe3hel37rV1C610w257paw5kHWB5mMtiYhzYndyioPSFAgpkUurvDaZ9cj6Aj0Rnz9K1cV7R/socRr6k73DVgsCt9U7chHHspwfcRoRFowtsuwOueGrftT6DldQpVMwVsWPWspdLkHcFEUTnoBK/1mvCAOxM5AnZE/SptM+/pkqtJJneVDWqrYCBvhOAXsZcAwswF9kAcTZyR+RT2p9vYEyKZpTbLqC1SDW2TU5t4HURgvLQbyZ3uYIO0pMlz/pwOmokhEE0aECjwHWFHwfenTJXw5YFF1YhO8jvN7noJUrZJwbf0/c2N/j+uYKcpGa1eh+mkqSqc1BrQ5Hz4jVWLI3ULCRHl7X83sBMzEb3Z2+5wC9hdTtOp5Wpd55oKgYJJObS6QDOsJzjNaW0EsEtivIimEHdn/FFlx9cSSNooHsQ1+St3EtEmEc3hUR/Bb0AXonuYAjUvsqMVkeyTqHivwMO0A+iyxeEBWY6iMholwjv76QVYH+2agM4Hn9sHMx2TL0X7kWW8vyBmvsfDabXVkB3YuceZtY3x6XLvgI79E1hXTfRS9xYBGTpj4P0hPnNX4yA5W75mBtjUbw18dddHQ/t5wNxLMKwmOlUQRFUtoMHK4Iz48nQz5sHDK7GoANP0Jx+0L7jmLPchfgFIBgT0TwLvifEduB/WulatZWC4GwzNxQINBo/FR4aJO1mXpegJA5Kk6LWL8lOa6qYRH/F0NmmCfisFv3uhLtjG+JLNOCUv3SBFs595uCVGkFWhs3zsWJUfdEYDeRDmFahi5o1c6zEs+64pWgwDHuEVAP/6tf2512cx9UPAIKz DnGeGwmz nlNUkTY7zpzoPuPFahg3WPgAUzPyfW5WqtoFVgkB9QQdXr577md7aCdnJ0PAIrtREh6iTL8t2DPQRL2RUuZuDP0+oYdUntJbGmuDH1Ka5GnHQvllUZlobs/e7B5zmEUXNkGdTzbS+YD3qrcSVnJ3ZGZGih3WPKhiVzRxd/zREjpkI1BmAAxu36hknTeWcz3mOX85ChkjBj7oi4toGmgBts5NqHJQxH5JUnetyLIEQQKm4g2tsdLxiAkU+6fl5coJ/qfBvX+u0M/UUaVIt3rPvtUIMtc7aLBOZXAR437fi7/+hHbX3NXWby+Egjo2k/m9FSn8pdnXeq/V1YH4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Feb 2, 2023 at 5:52 AM Peter Xu wrote: > > On Wed, Feb 01, 2023 at 09:36:37AM -0800, Yang Shi wrote: > > On Tue, Jan 31, 2023 at 7:42 PM David Stevens wrote: > > > > > > From: David Stevens > > > > > > Collapsing memory in a vma that has an armed userfaultfd results in > > > zero-filling any missing pages, which breaks user-space paging for those > > > filled pages. Avoid khugepage bypassing userfaultfd by not collapsing > > > pages in shmem reached via scanning a vma with an armed userfaultfd if > > > doing so would zero-fill any pages. > > > > > > Signed-off-by: David Stevens > > > --- > > > mm/khugepaged.c | 35 ++++++++++++++++++++++++----------- > > > 1 file changed, 24 insertions(+), 11 deletions(-) > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > index 79be13133322..48e944fb8972 100644 > > > --- a/mm/khugepaged.c > > > +++ b/mm/khugepaged.c > > > @@ -1736,8 +1736,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > > > * + restore gaps in the page cache; > > > * + unlock and free huge page; > > > */ > > > -static int collapse_file(struct mm_struct *mm, unsigned long addr, > > > - struct file *file, pgoff_t start, > > > +static int collapse_file(struct mm_struct *mm, struct vm_area_struct *vma, > > > + unsigned long addr, struct file *file, pgoff_t start, > > > struct collapse_control *cc) > > > { > > > struct address_space *mapping = file->f_mapping; > > > @@ -1784,6 +1784,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > > > * be able to map it or use it in another way until we unlock it. > > > */ > > > > > > + if (is_shmem) > > > + mmap_read_lock(mm); > > > > If you release mmap_lock before then reacquire it here, the vma is not > > trusted anymore. It is not safe to use the vma anymore. > > > > Since you already read uffd_was_armed before releasing mmap_lock, so > > you could pass it directly to collapse_file w/o dereferencing vma > > again. The problem may be false positive (not userfaultfd armed > > anymore), but it should be fine. Khugepaged could collapse this area > > in the next round. I didn't notice this race condition. It should be possible to adapt hugepage_vma_revalidate for this situation, or at least to create an analogous situation. > Unfortunately that may not be enough.. because it's also possible that it > reads uffd_armed==false, released mmap_sem, passed it over to the scanner, > but then when scanning the file uffd got armed in parallel. > > There's another problem where the current vma may not have uffd armed, > khugepaged may think it has nothing to do with uffd and moved on with > collapsing, but actually it's armed in another vma of either the current mm > or just another mm's. > > It seems non-trivial too to safely check this across all the vmas, let's > say, by a reverse walk - the only safe way is to walk all the vmas and take > the write lock for every mm, but that's not only too heavy but also merely > impossible to always make it right because of deadlock issues and on the > order of mmap write lock to take.. > > So far what I can still think of is, if we can extend shmem_inode_info and > have a counter showing how many uffd has been armed. It can be a generic > counter too (e.g. shmem_inode_info.collapse_guard_counter) just to avoid > the page cache being collapsed under the hood, but I am also not aware of > whether it can be reused by other things besides uffd. > > Then when we do the real collapsing, say, when: > > xas_set_order(&xas, start, HPAGE_PMD_ORDER); > xas_store(&xas, hpage); > xas_unlock_irq(&xas); > > We may need to make sure that counter keeps static (probably by holding > some locks during the process) and we only do that last phase collapse if > counter==0. > > Similar checks in this patch can still be done, but that'll only service as > a role of failing faster before the ultimate check on the uffd_armed > counter. Otherwise I just don't quickly see how to avoid race conditions. I don't know if it's necessary to go that far. Userfaultfd plus shmem is inherently brittle. It's possible for userspace to bypass userfaultfd on a shmem mapping by accessing the shmem through a different mapping or simply by using the write syscall. It might be sufficient to say that the kernel won't directly bypass a VMA's userfaultfd to collapse the underlying shmem's pages. Although on the other hand, I guess it's not great for the presence of an unused shmem mapping lying around to cause khugepaged to have user-visible side effects. -David > It'll be great if someone can come up with something better than above.. > Copy Hugh too. > > Thanks, > > -- > Peter Xu >