From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5CD9C636D3 for ; Wed, 1 Feb 2023 23:57:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D320B6B0071; Wed, 1 Feb 2023 18:57:48 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CE0C06B0072; Wed, 1 Feb 2023 18:57:48 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BCF1E6B0078; Wed, 1 Feb 2023 18:57:48 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id AF82D6B0071 for ; Wed, 1 Feb 2023 18:57:48 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 82A7A1407AB for ; Wed, 1 Feb 2023 23:57:48 +0000 (UTC) X-FDA: 80420388216.22.7BADD19 Received: from mail-pg1-f178.google.com (mail-pg1-f178.google.com [209.85.215.178]) by imf25.hostedemail.com (Postfix) with ESMTP id C6730A000C for ; Wed, 1 Feb 2023 23:57:46 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=OeejTvcP; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf25.hostedemail.com: domain of shy828301@gmail.com designates 209.85.215.178 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675295866; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WGa7e7F/TDsF1urwxiuZqzkKC/7RmX/Br21vDTl0s/M=; b=Q1Afj+V5SyiVm9DEp09aPpmW+u+coXML2+M61Ud3D//F6OOWkZtBL7bf7RTPV2Uik2G8va f/7akckhuuwTM/57gmnTj+F+Bazvg9P49Ahncy+fGxmiWpKURUNn1Z49bVe//2GviAv8VO i7+Or5sXpGp7b/WeLe+nNlaDqxHdm3g= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=OeejTvcP; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf25.hostedemail.com: domain of shy828301@gmail.com designates 209.85.215.178 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675295866; a=rsa-sha256; cv=none; b=dGKWiD2BzwnM1lREkrLTA8uCbVfB3c1g7vrSUtqPNfga39antVxqXOK4C9oAK2usO+2gsW UttPDzLppNw9viODMwyA7nMU35nrZMLr76MRDMHou7rXqUTAT97nWi8BHHr1ubhWsFAA/C y2uZOthkvG0H8lUcI7d/1BCdKM359hw= Received: by mail-pg1-f178.google.com with SMTP id r18so130504pgr.12 for ; Wed, 01 Feb 2023 15:57:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=WGa7e7F/TDsF1urwxiuZqzkKC/7RmX/Br21vDTl0s/M=; b=OeejTvcPsRHGIN95mlSNc1CP0w8sZUDEEvXvvk7Tyk4wijIdV4JFo5weE9VJcAxZ7R 5ACbbyFDzxzCGmICVjZURapBvavx3BuCMCLuy5Gdu5Nu6IbukQZFX20qLlmG/WFV6yFH TSCNxKUzHSMiYCCwnHldv2o1UoF0xKNpUyl1EqN1FxpARvvyd71Ye9igmFBWyqFrelcY xeTThieUZNPtNM5JoPNbv3Dx0F9oShCb4Umz904D2vMbiyE1hPPAdJk/Zw4JEWT8X222 9JrWlCOvBs3NQd85XBD0RcLI6Jhll6BJiKvHGyf8l0pE0ALDcHTybgTkrXMS5Re9B7MG 4Ssg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=WGa7e7F/TDsF1urwxiuZqzkKC/7RmX/Br21vDTl0s/M=; b=poeK5K1pUbz/uknTh4m4u4RO0HRTFShl70sWphOsyQD0obF8fY3S/OyK2RyYVw7KQ1 WLHdmQRxnP881zGIctrOX5oxgRi1CHYnADtz9tBbiamBC5okrDRg0bGHOMym4EBXJas/ XhdvYBehv8rvtJ/MbczSTXzIit137qF9IvmiV2VfKkyldRL2IOxrWUni637Pdaudpa4i FoJOgCup5RVoIWwCmCu0cfpJ1bpuxhwzUu1b/u+nSVV27u6SWkF3tFzTtk66yLtg8pQL NPDm23lhHzs/mNldHpn+rn4tIUJBj17b7uX1Ia4ysznniFodn4gGhaD/HrXNEDYh1I3E xEPg== X-Gm-Message-State: AO0yUKXGQxU/g0Q8EbsYhJRSL1bvDbWh3iC1tIzpICK+LkOkDxJEsatX JGN9J3bpiJ2F2VjOOgGLodzv1A1VvokHwNCYrE8= X-Google-Smtp-Source: AK7set9aT91CCQY72WKSkYBwyIBGTU5zCp1zvRULyUrU9L9yfRjNRCbZ52VE04Z0LDBznEiARn0ZUaMIAstpoQIz39Y= X-Received: by 2002:a05:6a00:1589:b0:592:7c9a:1236 with SMTP id u9-20020a056a00158900b005927c9a1236mr899484pfk.26.1675295865289; Wed, 01 Feb 2023 15:57:45 -0800 (PST) MIME-Version: 1.0 References: <20230201034137.2463113-1-stevensd@google.com> In-Reply-To: From: Yang Shi Date: Wed, 1 Feb 2023 15:57:33 -0800 Message-ID: Subject: Re: [PATCH] mm/khugepaged: skip shmem with armed userfaultfd To: Peter Xu Cc: David Stevens , David Hildenbrand , "Kirill A. Shutemov" , linux-mm@kvack.org, Andrew Morton , linux-kernel@vger.kernel.org, Hugh Dickins Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: C6730A000C X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: gi6ak58eo7sx1yf5z697ezn91efpkhiw X-HE-Tag: 1675295866-933502 X-HE-Meta: U2FsdGVkX1+QHaDSUbVKS4JBmGAhlOTlFlnlEPjGzkz6b3hTvVB6zT6jo0HxS+gF6g8NLJzasmcFb9JIaiMeQfAKgQybRdjNRkHOQuaFYLGlkhAj09yz9KIX0zZU+Ae5o5TOMX8XrsrwlYkrwy08DyAWvFhMoWcISEpsUaByrVeQNbJlgPVvjjRhDwDGAV2LrgKfZb4wRE8hX4o5Hfy5RbWDHaTiS4NJlKksUXe5dGKbEVTA5HYUyOpxNfBO6fnzz7fyM9u5WRSzuuIZ9Q1vIWaBkGwrPGypK50Wig6eL2q1rD11cdXvpM/vjD6LKY5oRtxFaz6lkZPTgWaFXsVsrsiraCAq+QlMpAmrPcUYXjCaM8DZ7yx3cvL0YMZIDbOVoB61F635gvlh98jBPl6q+s5tVTEB2mlQ+n9urUSKFiV5rtsUSJ1YxhiLDF2YlHans0wNWJzE+4AuxUtpgMECNuYad4R4zQqC8dzarRMWSx/Uf5xrqRZ4bJtus+SJIB3qsklyOyyysnTeov2CddEBJOtOjg+LXzv6/3vHg9LpLiu2hO9u7QTDAPGbchX+Ks5ANLhckd5tZFRZbwZ3x6TPT4ThuA3Xyp+RHdyJ5QqhvIEieng7FRUVwOBnusMKi9M4uE5qoVhq7Mj21y109qJiHP9eillhjZVicyfADqIhw3trSPVlS+ED+G0wrj7jDxx0YE8smCZGJDJsTS9Lr5MuXthls5J8eVzoL8JxLFqjUIyczvIGqllKIkD9v1dxOAoffsfp/1vrgjc/NG/FNN7vUqxcINNw3FLZmQ2urgF4R67oH659X+834RHlGuLGNX7mPWdIlgOlmT0C2HL0cFrclpxT/BIR8HpfofcqUXZlyP0eT7FB3TlJK9QjgcX0yETSsYtkSLMW+nKnkxr9OPbDRQ7SMvhAMJHg44LX/ACL/+/he7fwkTnSxrpprRX5+nc5fB4RD0L48aClTpxEv2Y +AmgZG4s ejVB5Z5Eu6UZ1G8q4flleqwapFGYBU2+5L2KMA2WDtAqLGTV1HPaCuF++9Sh9zdIUAhIH0TmVpfz1jFb5Tsc0f2v+foJvKetzyTAsxnZ+oBUrMz5DSYPCcMT51mTLP7BUrdwHvDU1S1FT2wYQAEPrfBUd7xNa/vMW4Qtsps0RTGYpSNFEjUlPteO/IPv8i7ZPemVPN+EwauZmBeYuOtB2pkjJzzFU+kR0JPB6tQFdGFwztQuKq+6o8s9L9cwXfRASQp237sAuqE3WIDkEP3Hflb1z9Rmz/gyU5hyxBqe68WQkiI+kWi8dnSZp6RcY/5IMLP+/BmZVEt7mjKQ7yEf0wkgRJDJjVcv7faQnXy8p+W9C1NR6FV5C+Si+GV3GLPuIk99mX7rJK0Rv4Fl6obCdaXckbM6HM4ZPX5jW9XII4fh2MerOALWcSjIia+5mHUOWB3zx/+BMwbJsch4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Feb 1, 2023 at 12:52 PM Peter Xu wrote: > > On Wed, Feb 01, 2023 at 09:36:37AM -0800, Yang Shi wrote: > > On Tue, Jan 31, 2023 at 7:42 PM David Stevens wrote: > > > > > > From: David Stevens > > > > > > Collapsing memory in a vma that has an armed userfaultfd results in > > > zero-filling any missing pages, which breaks user-space paging for those > > > filled pages. Avoid khugepage bypassing userfaultfd by not collapsing > > > pages in shmem reached via scanning a vma with an armed userfaultfd if > > > doing so would zero-fill any pages. > > > > > > Signed-off-by: David Stevens > > > --- > > > mm/khugepaged.c | 35 ++++++++++++++++++++++++----------- > > > 1 file changed, 24 insertions(+), 11 deletions(-) > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > index 79be13133322..48e944fb8972 100644 > > > --- a/mm/khugepaged.c > > > +++ b/mm/khugepaged.c > > > @@ -1736,8 +1736,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > > > * + restore gaps in the page cache; > > > * + unlock and free huge page; > > > */ > > > -static int collapse_file(struct mm_struct *mm, unsigned long addr, > > > - struct file *file, pgoff_t start, > > > +static int collapse_file(struct mm_struct *mm, struct vm_area_struct *vma, > > > + unsigned long addr, struct file *file, pgoff_t start, > > > struct collapse_control *cc) > > > { > > > struct address_space *mapping = file->f_mapping; > > > @@ -1784,6 +1784,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > > > * be able to map it or use it in another way until we unlock it. > > > */ > > > > > > + if (is_shmem) > > > + mmap_read_lock(mm); > > > > If you release mmap_lock before then reacquire it here, the vma is not > > trusted anymore. It is not safe to use the vma anymore. > > > > Since you already read uffd_was_armed before releasing mmap_lock, so > > you could pass it directly to collapse_file w/o dereferencing vma > > again. The problem may be false positive (not userfaultfd armed > > anymore), but it should be fine. Khugepaged could collapse this area > > in the next round. > > Unfortunately that may not be enough.. because it's also possible that it > reads uffd_armed==false, released mmap_sem, passed it over to the scanner, > but then when scanning the file uffd got armed in parallel. Aha, yeah, I missed that part... thanks for pointing it out. > > There's another problem where the current vma may not have uffd armed, > khugepaged may think it has nothing to do with uffd and moved on with > collapsing, but actually it's armed in another vma of either the current mm > or just another mm's. Out of curiosity, could you please elaborate how another vma armed with userfaultfd could have an impact on the vmas that are not armed? > > It seems non-trivial too to safely check this across all the vmas, let's > say, by a reverse walk - the only safe way is to walk all the vmas and take > the write lock for every mm, but that's not only too heavy but also merely > impossible to always make it right because of deadlock issues and on the > order of mmap write lock to take.. > > So far what I can still think of is, if we can extend shmem_inode_info and > have a counter showing how many uffd has been armed. It can be a generic > counter too (e.g. shmem_inode_info.collapse_guard_counter) just to avoid > the page cache being collapsed under the hood, but I am also not aware of > whether it can be reused by other things besides uffd. > > Then when we do the real collapsing, say, when: > > xas_set_order(&xas, start, HPAGE_PMD_ORDER); > xas_store(&xas, hpage); > xas_unlock_irq(&xas); > > We may need to make sure that counter keeps static (probably by holding > some locks during the process) and we only do that last phase collapse if > counter==0. > > Similar checks in this patch can still be done, but that'll only service as > a role of failing faster before the ultimate check on the uffd_armed > counter. Otherwise I just don't quickly see how to avoid race conditions. > > It'll be great if someone can come up with something better than above.. > Copy Hugh too. > > Thanks, > > -- > Peter Xu >