From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 35668C52D7C for ; Mon, 12 Aug 2024 18:29:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B35156B009E; Mon, 12 Aug 2024 14:29:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AE0456B00B0; Mon, 12 Aug 2024 14:29:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9331D6B00BA; Mon, 12 Aug 2024 14:29:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 718456B009E for ; Mon, 12 Aug 2024 14:29:56 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 21D0E1A0366 for ; Mon, 12 Aug 2024 18:29:56 +0000 (UTC) X-FDA: 82444432392.11.28FBADC Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf30.hostedemail.com (Postfix) with ESMTP id 132468001A for ; Mon, 12 Aug 2024 18:29:53 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="d6S/LNJt"; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf30.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723487382; a=rsa-sha256; cv=none; b=4hME/Ef0uAtb1+jvlrw5Iw8i2EUBZvT2AqoM5cu6wP51wxR2m917/yfpr9or9ji/FOtj0f ff8Qnhmct3mWnbgnq615pWlVJsCrKmrCatSpnlZ1/RFV8XM/kmQ4mYkizvzpq6l4sHOfFB tkX/cwR5RL3HouIvjmhGJtPEZfnScbQ= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="d6S/LNJt"; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf30.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723487382; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3lzHa+4oPgNh0JbtSfXdnixp3cbcHkW2yBVpxgt+hHU=; b=1aEkIDZhPCpjByM0Kz7ldyMuHJTRpsk+lDe5qT73z9ihQKiUD7Ii/THo8S+BR0SolZ3uZ/ dSd3OJfUluIO6f7p+mCuMPvfCTbxVTQc/iNmbZ03hJTQvJy9gQbUJvEesaDZWk6qxVfgRg rai6YAF01PwcniXR3rMU/amF28g5v+w= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1723487393; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=3lzHa+4oPgNh0JbtSfXdnixp3cbcHkW2yBVpxgt+hHU=; b=d6S/LNJtRekFufYdziRvWY8aKh1M7uicfp99pvAEZpoE3P69UjbGIjl3UmMgI41DIA+5/8 yK/UyBBp8ib/DSCP4NVL/EbWGev3AUvIKcBXEo+aPZDmdBqtZ644LsyWK5AhLalU17BHnw qicYVLt3rHJFVVrxvQi7hye+Hcu8nH4= Received: from mail-vs1-f72.google.com (mail-vs1-f72.google.com [209.85.217.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-115-V-JYFxT9Mz2eNfHo3SCBwA-1; Mon, 12 Aug 2024 14:29:52 -0400 X-MC-Unique: V-JYFxT9Mz2eNfHo3SCBwA-1 Received: by mail-vs1-f72.google.com with SMTP id ada2fe7eead31-492a6b7ce80so178816137.3 for ; Mon, 12 Aug 2024 11:29:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723487391; x=1724092191; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=3lzHa+4oPgNh0JbtSfXdnixp3cbcHkW2yBVpxgt+hHU=; b=Yz/0YdlK+m7SM0SC2IuCNkXPROgal+DL0xBKxU4uho+EfgNJP6KchanHDAqK7NKTpy mEKP9vMsIvaWgfYUsYdQsepd7sQGlYJzw+zrgASkLlYt7rYeVNArGEvEabTl1r83zlkk 8ikLJGo1VQHLYNjU1G8N75nI78fiCaimhERiFV3ST8rNJUuAtRsqPQXaCXQIIQdCO57n /Fkn0wllKScyZjsPc/8iDIArp9XksGD5rqlNQkmvjAC3Uf7gDZJIDr8POdKOIUWwg4JJ Oc4ck53TO6nkU6vOYMwCaP+leOiPRYxp2hTXhC4Lg0rmiuSbOq9X2ZDHGKSypUWRUr7z VCNQ== X-Gm-Message-State: AOJu0YzuQ6R7gH6TsK59k6iJwKQvZXDg9TmuBVP3vXz3fNy0D+9yxHEm KPiOkOdkCSkLsn1PYRwEun+qHSCAfwjdx5GHEZfIZgG7qwpw/E5nc6QH5iOkWjZXKmi2AK66QVC 2QVweBn4FO9N9931+JiGYPMdzmGhHHOfWwdHOFVfXZTzhcm7F X-Received: by 2002:a05:6102:38d1:b0:493:31f9:d14e with SMTP id ada2fe7eead31-4974398cf6amr814434137.2.1723487391393; Mon, 12 Aug 2024 11:29:51 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEgRvFG1AyF51iCR4QUK+iwha9x9ZVtk0r/+P1oFdTeH6YoJu+ZSYhqn/ax3TJ2lpn0JjOZGQ== X-Received: by 2002:a05:6102:38d1:b0:493:31f9:d14e with SMTP id ada2fe7eead31-4974398cf6amr814391137.2.1723487390933; Mon, 12 Aug 2024 11:29:50 -0700 (PDT) Received: from x1n (pool-99-254-121-117.cpe.net.cable.rogers.com. [99.254.121.117]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-4531c26dc23sm25342021cf.75.2024.08.12.11.29.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 12 Aug 2024 11:29:50 -0700 (PDT) Date: Mon, 12 Aug 2024 14:29:47 -0400 From: Peter Xu To: David Hildenbrand Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Sean Christopherson , Oscar Salvador , Jason Gunthorpe , Axel Rasmussen , linux-arm-kernel@lists.infradead.org, x86@kernel.org, Will Deacon , Gavin Shan , Paolo Bonzini , Zi Yan , Andrew Morton , Catalin Marinas , Ingo Molnar , Alistair Popple , Borislav Petkov , Thomas Gleixner , kvm@vger.kernel.org, Dave Hansen , Alex Williamson , Yan Zhao Subject: Re: [PATCH 07/19] mm/fork: Accept huge pfnmap entries Message-ID: References: <20240809160909.1023470-1-peterx@redhat.com> <20240809160909.1023470-8-peterx@redhat.com> <8ef394e6-a964-41c4-b33c-0e940b6b9bd8@redhat.com> MIME-Version: 1.0 In-Reply-To: <8ef394e6-a964-41c4-b33c-0e940b6b9bd8@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspam-User: X-Rspamd-Queue-Id: 132468001A X-Rspamd-Server: rspam01 X-Stat-Signature: 5sqanrs8mjc4m61twxg7ew3dmx6igdd4 X-HE-Tag: 1723487393-516573 X-HE-Meta: U2FsdGVkX1/S+y7hCQeRPiYoHatuAobOfTq8gFvh/kpOqv+r5J41l7yumJZJ2/yWHy+lbZUgtCyd2PKilNxw7KwfA+naUMdy7gRwVz+KUESX6mHm/agPhUzSNcmpa8yUjo7vWuyCR1y8b6Ce+AofSS3s6ScWRwogsIvqdLinuOuxRI1kc1anPXYsq/3KNONffqtFerppPVhCRp3vxWpdQBRlxSimWof5AdX+FD8ojblANXbg/5WWqigJZNZsHnD2Td737Fd9W9oIsUesP19beA0VeM80jn5wYln+80Qoikg5YT5A+Tv81Kn2lu0BOSeD5gk4wuwe8rHeQ/k0oHrrjNnjHq1SIrBW+bG5of5uN27MC1wR0ZDsudIIBR+APhUeMV0hnNA0dyc8HcpnxRyLRC2tCMS4oD/Z0axppFXuTOJW/tZhUtdfUKezy1IKM3LEjjypcr7K+fYRH0ndRLYTYGOSGOX9sNExyZedgEAmEp810ZBCoNS1qjO0XUyGFOWDjiJ4Qy9uOiK/ozBELLDLMgppTkumhaFEY59LSUN+Dtu0sztTnNexZOgw4EocYqV6ynumDvKIfmiwjxMSaqRMrPNMnlreczZHJa9rZZXMVDGntTdBIpwFbvYECt5k48qhRLoNfRNMgrs41o9ebfMocDNM1jkYbHXOyjRL0cz6s9A0sag5cTnxOBxMGJIyDOCelCjIvGL9VnGV4Wn0SPme/rjlJzVsJGX6ONaFOXIQMzNn6nc25+LFOaqoIKPCVdj5GoEzSMzOodaO8+7Hf/nnsRb2LDDl121jHpEW1hY+f0gWfrW/+EY7f9XXZFi8YUdEQrZxpyXRT5jxiVqeaZkg0QWKu+ApDX7/2I4fDVPAK4RDM/NEvYa89uJOz983CQ7SHnSXuLa8NcnaRxXgDSSyif/CzsHY+ER7rJfOnzBxWSm6JxZXYRDVHSCfviqnaBTDjR19YiUb735D+hMCBuV vSqpWgKr wYo3CI+zt8yMqXcnIamKDCylIaWcMUQAq3znNpWQBtge+kNE9EpLJN3a9z4Kx0A29CR4BXAaZkz505uqsloH3Et9MB5EbUwZEkPsbPDVnB0Vrs4OX4hEK/FPBIuH2VzGRHsBhRflmMs4okK+YPWX6d8nT7rAW8yqH/qlWEOg+Efaip9F08L82Xk0gS0nNalGIpi+sNLfXhZrF1cy1LYqk6R4ETQg9sUhhluC0ZT0cropr/Qd9Aw5DgkvmnLLgprMM4KPWatW2Zzgd4pFgTkGge5gj0oV0X+eGCralLZ1lnbJ/sc598qQt2aXv7zJJ1CztwJHW2aSuURWyA63DU/JviWowqXXoSCE5flPWx5iBv1zk9C1K/+Lmmc5Tktd+e1PFxWskVc+r6Hkw/ONlOPmykO1waCu8Bj31X5iS8MEsT6bNeGBacRW6RFWQFA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Aug 09, 2024 at 07:59:58PM +0200, David Hildenbrand wrote: > On 09.08.24 19:15, Peter Xu wrote: > > On Fri, Aug 09, 2024 at 06:32:44PM +0200, David Hildenbrand wrote: > > > On 09.08.24 18:08, Peter Xu wrote: > > > > Teach the fork code to properly copy pfnmaps for pmd/pud levels. Pud is > > > > much easier, the write bit needs to be persisted though for writable and > > > > shared pud mappings like PFNMAP ones, otherwise a follow up write in either > > > > parent or child process will trigger a write fault. > > > > > > > > Do the same for pmd level. > > > > > > > > Signed-off-by: Peter Xu > > > > --- > > > > mm/huge_memory.c | 27 ++++++++++++++++++++++++--- > > > > 1 file changed, 24 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > index 6568586b21ab..015c9468eed5 100644 > > > > --- a/mm/huge_memory.c > > > > +++ b/mm/huge_memory.c > > > > @@ -1375,6 +1375,22 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > > > > pgtable_t pgtable = NULL; > > > > int ret = -ENOMEM; > > > > + pmd = pmdp_get_lockless(src_pmd); > > > > + if (unlikely(pmd_special(pmd))) { > > > > + dst_ptl = pmd_lock(dst_mm, dst_pmd); > > > > + src_ptl = pmd_lockptr(src_mm, src_pmd); > > > > + spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); > > > > + /* > > > > + * No need to recheck the pmd, it can't change with write > > > > + * mmap lock held here. > > > > + */ > > > > + if (is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)) { > > > > + pmdp_set_wrprotect(src_mm, addr, src_pmd); > > > > + pmd = pmd_wrprotect(pmd); > > > > + } > > > > + goto set_pmd; > > > > + } > > > > + > > > > > > I strongly assume we should be using using vm_normal_page_pmd() instead of > > > pmd_page() further below. pmd_special() should be mostly limited to GUP-fast > > > and vm_normal_page_pmd(). > > > > One thing to mention that it has this: > > > > if (!vma_is_anonymous(dst_vma)) > > return 0; > > Another obscure thing in this function. It's not the job of copy_huge_pmd() > to make the decision whether to copy, it's the job of vma_needs_copy() in > copy_page_range(). > > And now I have to suspect that uffd-wp is broken with this function, because > as vma_needs_copy() clearly states, we must copy, and we don't do that for > PMDs. Ugh. > > What a mess, we should just do what we do for PTEs and we will be fine ;) IIUC it's not a problem: file uffd-wp is different from anonymous, in that it pushes everything down to ptes. It means if we skipped one huge pmd here for file, then it's destined to have nothing to do with uffd-wp, otherwise it should have already been split at the first attempt to wr-protect. > > Also, we call copy_huge_pmd() only if "is_swap_pmd(*src_pmd) || > pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)" > > Would that even be the case with PFNMAP? I suspect that pmd_trans_huge() > would return "true" for special pfnmap, which is rather "surprising", but > fortunate for us. It's definitely not surprising to me as that's the plan.. and I thought it shoulidn't be surprising to you - if you remember before I sent this one, I tried to decouple that here with the "thp agnostic" series: https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com in which you reviewed it (which I appreciated). So yes, pfnmap on pmd so far will report pmd_trans_huge==true. > > Likely we should be calling copy_huge_pmd() if pmd_leaf() ... cleanup for > another day. Yes, ultimately it should really be a pmd_leaf(), but since I didn't get much feedback there, and that can further postpone this series from being posted I'm afraid, then I decided to just move on with "taking pfnmap as THPs". The corresponding change on this path is here in that series: https://lore.kernel.org/all/20240717220219.3743374-7-peterx@redhat.com/ @@ -1235,8 +1235,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, src_pmd = pmd_offset(src_pud, addr); do { next = pmd_addr_end(addr, end); - if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd) - || pmd_devmap(*src_pmd)) { + if (is_swap_pmd(*src_pmd) || pmd_is_leaf(*src_pmd)) { int err; VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma); err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd, > > > > > So it's only about anonymous below that. In that case I feel like the > > pmd_page() is benign, and actually good. > > Yes, it would likely currently work. > > > > > Though what you're saying here made me notice my above check doesn't seem > > to be necessary, I mean, "(is_cow_mapping(src_vma->vm_flags) && > > pmd_write(pmd))" can't be true when special bit is set, aka, pfnmaps.. and > > if it's writable for CoW it means it's already an anon. > > > > I think I can probably drop that line there, perhaps with a > > VM_WARN_ON_ONCE() making sure it won't happen. > > > > > > > > Again, we should be doing this similar to how we handle PTEs. > > > > > > I'm a bit confused about the "unlikely(!pmd_trans_huge(pmd)" check, below: > > > what else should we have here if it's not a migration entry but a present > > > entry? > > > > I had a feeling that it was just a safety belt since the 1st day of thp > > when Andrea worked that out, so that it'll work with e.g. file truncation > > races. > > > > But with current code it looks like it's only anonymous indeed, so looks > > not possible at least from that pov. > > Yes, as stated above, likely broken with UFFD-WP ... > > I really think we should make this code just behave like it would with PTEs, > instead of throwing in more "different" handling. So it could simply because file / anon uffd-wp work very differently. Let me know if you still spot something that is suspicious, but in all cases I guess we can move on with this series, but maybe if you can find something I can tackle them together when I decide to go back to the mremap() issues in the other thread. Thanks, -- Peter Xu