From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 31053C07E9B for ; Mon, 19 Jul 2021 19:21:25 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A40DA61002 for ; Mon, 19 Jul 2021 19:21:24 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A40DA61002 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 406366B018C; Mon, 19 Jul 2021 15:21:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 38E958D00EC; Mon, 19 Jul 2021 15:21:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1BBC16B018E; Mon, 19 Jul 2021 15:21:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0163.hostedemail.com [216.40.44.163]) by kanga.kvack.org (Postfix) with ESMTP id DFC086B018C for ; Mon, 19 Jul 2021 15:21:24 -0400 (EDT) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 7DFF2184B83E9 for ; Mon, 19 Jul 2021 19:21:23 +0000 (UTC) X-FDA: 78380306046.20.2E84B3B Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf19.hostedemail.com (Postfix) with ESMTP id 0647EB000444 for ; Mon, 19 Jul 2021 19:21:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1626722482; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=bTwq4PuX2DzHsZtvOO0BQjYNFm+9mjIm5nAUk4EcBYk=; b=iyUM+VhAaBggTiZOwy03jtKuLoZFScNH0QW/p+0/jDlELixHycY4FGtWsjB4ajbZm+U0oz pPxGiSbrzNO7hUGBdMl9KHXBvFvoh/WYnErEpbnIk1ri7O/cOzNWI+Wp9mbEroJsnRj38u BDQCtnb/whg8acegHFeCO1SyJQnCFnE= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-520-Y2ydCCfIPuqKWg16fXBCFg-1; Mon, 19 Jul 2021 15:21:21 -0400 X-MC-Unique: Y2ydCCfIPuqKWg16fXBCFg-1 Received: by mail-wm1-f71.google.com with SMTP id y6-20020a7bc1860000b0290227b53c7cefso427514wmi.1 for ; Mon, 19 Jul 2021 12:21:20 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=bTwq4PuX2DzHsZtvOO0BQjYNFm+9mjIm5nAUk4EcBYk=; b=E/jF0ShI2mY1IcqFzdGKf+1U3+z4xiRGVftSp1d+IjbR3sd/jVVaa7fKDof9bpnFLI VDweSabwNTUg/2y6HnDN1QeP2gPqFZv7TmOuRzy927zvL0dbqugBm9ZyrIx65Bqx6PCz AgLGiGdi1DV8oPd+atjZZG087G76JwnYZeDdGDH0gISu76C5YqIX0MBkIJOIwBV01mSf 99fuhXkkG6X0jcyvSHX52YnEhKBQ1esCF7rKOc9NXFgVb9xhgx7mOFuhHEZAl7JaeWoy WNXYs9L/2N1RPiwnt9r4RM/Ik3lzjGKg27TUQc57D5KQBWJcRlnrJ/ZVE/gMDSGjeMhU W+/w== X-Gm-Message-State: AOAM531w/qeoCFjWYfuenZpaKBjH+BfkF8JXHy2TGOSE4F3FOpg2JqXt eGqQp+ylxLALIQR3phPa454G42VNVFdO/a+DscljTbclnLQhxLw86t8E19PdrvsN9X2kmRnOiYa vaQvHFWo39j8= X-Received: by 2002:a05:6000:10d0:: with SMTP id b16mr5980512wrx.332.1626722479918; Mon, 19 Jul 2021 12:21:19 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxHS0AwxkuMU+B+rIqWnOeSf01dX6b1TkG+iLkkBuz0iDHECPP6ZWmOrT1DQ9QifA6xEGP3ig== X-Received: by 2002:a05:6000:10d0:: with SMTP id b16mr5980475wrx.332.1626722479595; Mon, 19 Jul 2021 12:21:19 -0700 (PDT) Received: from ?IPv6:2003:d8:2f0a:7f00:fad7:3bc9:69d:31f? (p200300d82f0a7f00fad73bc9069d031f.dip0.t-ipconnect.de. [2003:d8:2f0a:7f00:fad7:3bc9:69d:31f]) by smtp.gmail.com with ESMTPSA id n16sm21493126wrr.73.2021.07.19.12.21.18 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 19 Jul 2021 12:21:19 -0700 (PDT) To: Peter Xu , linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Jason Gunthorpe , Mike Kravetz , Alistair Popple , Matthew Wilcox , "Kirill A . Shutemov" , Hugh Dickins , Tiberiu Georgescu , Andrea Arcangeli , Axel Rasmussen , Nadav Amit , Mike Rapoport , Jerome Glisse , Andrew Morton , Miaohe Lin References: <20210715201422.211004-1-peterx@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v5 00/26] userfaultfd-wp: Support shmem and hugetlbfs Message-ID: <251ed5e3-d898-efdc-ca5c-7b047dc80cb4@redhat.com> Date: Mon, 19 Jul 2021 21:21:18 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <20210715201422.211004-1-peterx@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 0647EB000444 X-Stat-Signature: 8gfot9yk6t7yzs5qcy4d744ntmyeqfow Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=iyUM+VhA; spf=none (imf19.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-HE-Tag: 1626722482-946848 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 15.07.21 22:13, Peter Xu wrote: > This is v5 of uffd-wp shmem & hugetlbfs support, which completes uffd-w= p as a > full feature. It's based on v5.14-rc1. >=20 > I reposted the whole series majorly to trigger the syzbot tests again; = sorry if > it brings a bit of noise. Please let me know if there's easier way to = trigger > the syzbot test instead of reposting the whole series. >=20 > Meanwhile, recently discussion around soft-dirty shows that soft-dirty = may have > similar requirement as uffd-wp on persisting the dirty information: >=20 > https://lore.kernel.org/lkml/20210714152426.216217-1-tiberiu.georgescu@= nutanix.com/ >=20 > Then the mechanism provided in this patchset may be suitable for soft-d= irty too. >=20 > The whole series can also be found online [1]. >=20 > v5 changelog: > - Fix two issues spotted by syzbot > - Compile test with (1) !USERFAULTFD, (2) USERFAULTFD && !USERFAULTFD_W= P >=20 > Previous versions: >=20 > RFC: https://lore.kernel.org/lkml/20210115170907.24498-1-peterx@redhat.= com/ > v1: https://lore.kernel.org/lkml/20210323004912.35132-1-peterx@redhat.= com/ > v2: https://lore.kernel.org/lkml/20210427161317.50682-1-peterx@redhat.= com/ > v3: https://lore.kernel.org/lkml/20210527201927.29586-1-peterx@redhat.= com/ > v4: https://lore.kernel.org/lkml/20210714222117.47648-1-peterx@redhat.= com/ >=20 > About Swap Special PTE > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > In short, the so-called "swap special pte" in this patchset is a new ty= pe of > pte that doesn't exist in the past, but it got used initially in this s= eries in > file-backed memories. It is used to persist information even if the pt= es got > dropped meanwhile when the page cache still existed. For example, when > splitting a file-backed huge pmd, we could be simply dropping the pmd e= ntry > then wait until another fault coming. It's okay in the past since all > information in the pte can be retained from the page cache when the nex= t page > fault triggers. However in this case, uffd-wp is per-pte information w= hich > cannot be kept in page cache, so that information needs to be maintaine= d > somehow still in the pgtable entry, even if the pgtable entry is going = to be > dropped. Here instead of replacing with a none entry, we used the "swa= p > special pte". Then when the next page fault triggers, we can observe o= rig_pte > to retain this information. >=20 > I'm copy-pasting some commit message from the patch "mm/swap: Introduce= the > idea of special swap ptes", where it tried to explain this pte in anoth= er angle: >=20 > We used to have special swap entries, like migration entries, hw-p= oison > entries, device private entries, etc. >=20 > Those "special swap entries" reside in the range that they need to= be at least > swap entries first, and their types are decided by swp_type(entry)= . >=20 > This patch introduces another idea called "special swap ptes". >=20 > It's very easy to get confused against "special swap entries", but= a speical > swap pte should never contain a swap entry at all. It means, it's= illegal to > call pte_to_swp_entry() upon a special swap pte. >=20 > Make the uffd-wp special pte to be the first special swap pte. >=20 > Before this patch, is_swap_pte()=3D=3Dtrue means one of the below: >=20 > (a.1) The pte has a normal swap entry (non_swap_entry()=3D=3Dfa= lse). For > example, when an anonymous page got swapped out. >=20 > (a.2) The pte has a special swap entry (non_swap_entry()=3D=3Dt= rue). For > example, a migration entry, a hw-poison entry, etc. >=20 > After this patch, is_swap_pte()=3D=3Dtrue means one of the below, = where case (b) is > added: >=20 > (a) The pte contains a swap entry. >=20 > (a.1) The pte has a normal swap entry (non_swap_entry()=3D=3Dfa= lse). For > example, when an anonymous page got swapped out. >=20 > (a.2) The pte has a special swap entry (non_swap_entry()=3D=3Dt= rue). For > example, a migration entry, a hw-poison entry, etc. >=20 > (b) The pte does not contain a swap entry at all (so it cannot be= passed > into pte_to_swp_entry()). For example, uffd-wp special swap = pte. >=20 > Hugetlbfs needs similar thing because it's also file-backed. I directl= y reused > the same special pte there, though the shmem/hugetlb change on supporti= ng this > new pte is different since they don't share code path a lot. >=20 > Patch layout > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > Part (1): Shmem support, this is where the special swap pte is introduc= ed. > Some zap rework is needed within the process: >=20 > mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte > shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP > mm: Clear vmf->pte after pte_unmap_same() returns > mm/userfaultfd: Introduce special pte for unmapped file-backed mem > mm/swap: Introduce the idea of special swap ptes > shmem/userfaultfd: Handle uffd-wp special pte in page fault handler > mm: Drop first_index/last_index in zap_details > mm: Introduce zap_details.zap_flags > mm: Introduce ZAP_FLAG_SKIP_SWAP > shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backe= d > shmem/userfaultfd: Allow wr-protect none pte for file-backed mem > shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on t= hps > shmem/userfaultfd: Handle the left-overed special swap ptes > shmem/userfaultfd: Pass over uffd-wp special swap pte when fork() >=20 > Part (2): Hugetlb supportdisable huge pmd sharing for uffd-wp patches h= ave been > merged. The rest is the changes required to teach hugetlbfs understand= the > special swap pte too that introduced with the uffd-wp change: >=20 > mm/hugetlb: Drop __unmap_hugepage_range definition from hugetlb.h > mm/hugetlb: Introduce huge pte version of uffd-wp helpers > hugetlb/userfaultfd: Hook page faults for uffd write protection > hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP > hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT > mm/hugetlb: Introduce huge version of special swap pte helpers > hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handle= r > hugetlb/userfaultfd: Allow wr-protect none ptes > hugetlb/userfaultfd: Only drop uffd-wp special pte if required >=20 > Part (3): Enable both features in code and test (plus pagemap support) >=20 > mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs > userfaultfd: Enable write protection for shmem & hugetlbfs > userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs >=20 > Tests > =3D=3D=3D=3D=3D >=20 > I've tested it using either userfaultfd kselftest program, but also wit= h > umapsort [2] which should be even stricter. Tested page swapping in/ou= t during > umapsort. >=20 > If anyone would like to try umapsort, need to use an extremely hacked v= ersion > of umap library [3], because by default umap only supports anonymous. = So to > test it we need to build [3] then [2]. >=20 > Any comment would be greatly welcomed. Thanks, Hi Peter, I just stumbled over copy_page_range() optimization /* * Don't copy ptes where a page fault will fill them correctly. * Fork becomes much lighter when there are big shared or private * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ if (!(src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) && !src_vma->anon_vma) return 0; IIUC, that means you'll not copy the WP bits for shmem and, therefore, lose them during fork. --=20 Thanks, David / dhildenb