From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2F304C6377A for ; Wed, 21 Jul 2021 22:28:13 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D32AE6023D for ; Wed, 21 Jul 2021 22:28:10 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D32AE6023D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5320B6B0036; Wed, 21 Jul 2021 18:28:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4E1C86B005D; Wed, 21 Jul 2021 18:28:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3A9C96B006C; Wed, 21 Jul 2021 18:28:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0226.hostedemail.com [216.40.44.226]) by kanga.kvack.org (Postfix) with ESMTP id 192356B0036 for ; Wed, 21 Jul 2021 18:28:10 -0400 (EDT) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 880EF274CD for ; Wed, 21 Jul 2021 22:28:09 +0000 (UTC) X-FDA: 78388034298.18.7BB373F Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf01.hostedemail.com (Postfix) with ESMTP id 1AAB950599FC for ; Wed, 21 Jul 2021 22:28:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1626906488; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=FI7Ns0ZdVAKMqXH/EU6QHqkRQy2TyEEXty0F6uwC/2Y=; b=B5LIEPH8ZZKWmgSA2radsVM5pVVyaBgEXygJgcK5JnuUhDfZJIu+VrBRNxKkSHUoOXLxQY Jq/LNcIaOEL5yGSGA8AWVOmnpCxNGPmczL809L2rVRMeyCwfOEsnUSG0JxRzJ4i7s0d2Ul Bq6TI+swhNQcXOwpFezW6du7E7bzCoY= Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com [209.85.222.197]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-85-6XOxOO-3Mj2tDn6OSm1gHA-1; Wed, 21 Jul 2021 18:28:06 -0400 X-MC-Unique: 6XOxOO-3Mj2tDn6OSm1gHA-1 Received: by mail-qk1-f197.google.com with SMTP id t144-20020a3746960000b02903ad9c5e94baso2684579qka.16 for ; Wed, 21 Jul 2021 15:28:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=FI7Ns0ZdVAKMqXH/EU6QHqkRQy2TyEEXty0F6uwC/2Y=; b=OixTZnOcqFIrvQTxMTvPWSXS4+1X2KxAbms4r0Xj4GaZ9kjklHug0/G2U3ERLmKSz8 qQj5JtM1vVrMIdLref9nxWOxU2Fpbi7xmBlImR1ClWOrDZg4PqKx8Dx9dkwhJPfBuZ4H QEbDPY6Vo2trqxXdGdr+m7DgGbGYobjHhn9/mxeO/p9sOAq0d4wTGTFWDDYxdLjCSeI4 GFhAYsM1Esd1mUrZO14JKGcrMRuLM8X9SlI6nbawS89en54KJUl4Pg8ZwGi5aTopoqwU b3WctHimtYO8lyLjqwQyFZc7EZYatLZ3uNyVv0/LmInkksocXpd1jzW7o3OI81ux0CSR PMhQ== X-Gm-Message-State: AOAM533tPkDx28yudpp+r39MZYOGut0M+Hc2NrlxRotjUhGeCh7tJaLb iDjRr2dDd1F6c/sSGpHuoBdGogJaB78sO+THlW3pB2eY5Zr9+CbOkNNx6CVEdKOlk/6oCM/ZHbG 4z7d+BxBjj6I= X-Received: by 2002:a05:6214:846:: with SMTP id dg6mr39164007qvb.9.1626906486382; Wed, 21 Jul 2021 15:28:06 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzHUXXRpAflEVE5cyFlmDTjyYIrA4sbqdJvPxJn/e8eQYNd1u9cN2EKqcPXbPSGJHIVnbH6hw== X-Received: by 2002:a05:6214:846:: with SMTP id dg6mr39163991qvb.9.1626906486105; Wed, 21 Jul 2021 15:28:06 -0700 (PDT) Received: from t490s (bras-base-toroon474qw-grc-65-184-144-111-238.dsl.bell.ca. [184.144.111.238]) by smtp.gmail.com with ESMTPSA id i4sm9921034qka.130.2021.07.21.15.28.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 21 Jul 2021 15:28:05 -0700 (PDT) Date: Wed, 21 Jul 2021 18:28:03 -0400 From: Peter Xu To: Ivan Teterevkov Cc: David Hildenbrand , Tiberiu Georgescu , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Axel Rasmussen , Nadav Amit , Jerome Glisse , "Kirill A . Shutemov" , Jason Gunthorpe , Alistair Popple , Andrew Morton , Andrea Arcangeli , Matthew Wilcox , Mike Kravetz , Hugh Dickins , Miaohe Lin , Mike Rapoport , "Carl Waldspurger [C]" , Florian Schmidt , "ovzxemul@gmail.com" Subject: Re: [PATCH v5 24/26] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs Message-ID: References: <20210715201422.211004-1-peterx@redhat.com> <20210715201651.212134-1-peterx@redhat.com> <5c3c84ee-02f6-a2af-13b8-5dcf70676641@redhat.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 1AAB950599FC X-Stat-Signature: 7r4oz4a1cd8koewz6ft6nhw1oqmy5k6a Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=B5LIEPH8; spf=none (imf01.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-HE-Tag: 1626906488-697503 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi, Ivan, On Wed, Jul 21, 2021 at 07:54:44PM +0000, Ivan Teterevkov wrote: > On Wed, Jul 21, 2021 4:20 PM +0000, David Hildenbrand wrote: > > On 21.07.21 16:38, Ivan Teterevkov wrote: > > > On Mon, Jul 19, 2021 5:56 PM +0000, Peter Xu wrote: > > >> I'm also curious what would be the real use to have an accurate > > >> PM_SWAP accounting. To me current implementation may not provide > > >> accurate value but should be good enough for most cases. However not > > >> sure whether it's also true for your use case. > > > > > > We want the PM_SWAP bit implemented (for shared memory in the pagemap > > > interface) to enhance the live migration for some fraction of the > > > guest VMs that have their pages swapped out to the host swap. Once > > > those pages are paged in and transferred over network, we then want to > > > release them with madvise(MADV_PAGEOUT) and preserve the working set > > > of the guest VMs to reduce the thrashing of the host swap. > > > > There are 3 possibilities I think (swap is just another variant of the page cache): > > > > 1) The page is not in the page cache, e.g., it resides on disk or in a swap file. > > pte_none(). > > 2) The page is in the page cache and is not mapped into the page table. > > pte_none(). > > 3) The page is in the page cache and mapped into the page table. > > !pte_none(). > > > > Do I understand correctly that you want to identify 1) and indicate it via > > PM_SWAP? > > Yes, and I also want to outline the context so we're on the same page. > > This series introduces the support for userfaultfd-wp for shared memory > because once a shared page is swapped, its PTE is cleared. Upon retrieval > from a swap file, there's no way to "recover" the _PAGE_SWP_UFFD_WP flag > because unlike private memory it's not kept in PTE or elsewhere. > > We came across the same issue with PM_SWAP in the pagemap interface, but > fortunately, there's the place that we could query: the i_pages field of > the struct address_space (XArray). In https://lkml.org/lkml/2021/7/14/595 > we do it similarly to what shmem_fault() does when it handles #PF. > > Now, in the context of this series, we were exploring whether it makes > any practical sense to introduce more brand new flags to the special > PTE to populate the pagemap flags "on the spot" from the given PTE. > > However, I can't see how (and why) to achieve that specifically for > PM_SWAP even with an extra bit: the XArray is precisely what we need for > the live migration use case. Another flag PM_SOFT_DIRTY suffers the same > problem as UFFD_WP_SWP_PTE_SPECIAL before this patch series, but we don't > need it at the moment. > > Hope that clarification makes sense? Yes it helps, thanks. So I can understand now on how that patch comes initially, even if it may not work for PM_SOFT_DIRTY but it seems working indeed for PM_SWAP. However I have a concern that I raised also in the other thread: I think there'll be an extra and meaningless xa_load() for all the real pte_none()s that aren't swapped out but just having no page at the back from the very beginning. That happens much more frequent when the memory being observed by pagemap is mapped in a huge chunk and sparsely mapped. With old code we'll simply skip those ptes, but now I have no idea how much overhead would a xa_load() brings. Btw, I think there's a way to implement such an idea similar to the swap special uffd-wp pte - when page reclaim of shmem pages, instead of putting a none pte there maybe we can also have one bit set in the none pte showing that this pte is swapped out. When the page faulted back we just drop that bit. That bit could be also scanned by pagemap code to know that this page was swapped out. That should be much lighter than xa_load(), and that identifies immediately from a real none pte just by reading the value. Do you think this would work? > > The only outstanding note I have is about the compatibility of our > patches around pte_to_pagemap_entry(). I think the resulting code > should look like this: > > static pagemap_entry_t pte_to_pagemap_entry(...) > { > if (pte_present(pte)) { > ... > } else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) { > ... > if (pte_swp_uffd_wp_special(pte)) { > flags |= PM_UFFD_WP; > } > } > } > > The is_swap_pte() branch will be taken for the swapped out shared pages, > thanks to shmem_file(), so the pte_swp_uffd_wp_special() can be checked > inside. > > Alternatively, we could just remove "else" statement: > > static pagemap_entry_t pte_to_pagemap_entry(...) > { > if (pte_present(pte)) { > ... > } else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) { > ... > } > > if (pte_swp_uffd_wp_special(pte)) { > flags |= PM_UFFD_WP; > } > } > > What do you reckon? I don't worry too much on how we implement those in details yet. Both look fine to me. Thanks, -- Peter Xu