From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2147BC5DF60 for ; Wed, 6 Nov 2019 00:08:47 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 9D76C21A4A for ; Wed, 6 Nov 2019 00:08:46 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=intel-com.20150623.gappssmtp.com header.i=@intel-com.20150623.gappssmtp.com header.b="e+Di3mlc" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9D76C21A4A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 4785F6B0003; Tue, 5 Nov 2019 19:08:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 401CE6B0005; Tue, 5 Nov 2019 19:08:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2A2596B0007; Tue, 5 Nov 2019 19:08:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0052.hostedemail.com [216.40.44.52]) by kanga.kvack.org (Postfix) with ESMTP id 10F946B0003 for ; Tue, 5 Nov 2019 19:08:46 -0500 (EST) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id C3505180AD81A for ; Wed, 6 Nov 2019 00:08:45 +0000 (UTC) X-FDA: 76123916610.11.team88_86a69f7230317 X-HE-Tag: team88_86a69f7230317 X-Filterd-Recvd-Size: 10783 Received: from mail-ot1-f66.google.com (mail-ot1-f66.google.com [209.85.210.66]) by imf50.hostedemail.com (Postfix) with ESMTP for ; Wed, 6 Nov 2019 00:08:45 +0000 (UTC) Received: by mail-ot1-f66.google.com with SMTP id l14so4961585oti.10 for ; Tue, 05 Nov 2019 16:08:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=NBnxU3butVFZM7AnAXfHT8dkMjtUFYfyCvUDW61SXV0=; b=e+Di3mlc5nguC0AjBP91KU4QLtXyLv60VHL8bcA2yTFqSx8GzUT3JPnO2SUvw9dJ3e x892tG+naCnJ7xx1/XyJqFBd5BAvIkbzvc8Hm+iANPYt1TTQlO54TnNTT2/Vr/wnnHwU cWH/uHsxgeOfe757EoaxLRwaf5ARnH5aT0zdOmRUm2qEb9I99F92KkPxOUH4REMi83wk kEyAkGq3/W1xKyxTMdH0KRUk/WXIRl4YbdL49pTFBgS4yeLctZvOTpoL9B1muQPVqJmY p3ipEL1tac4Bnfpc8lK1KzxSavaLkdb1VqqzcOJA3HvnvhhzQSrRj7AaXeANHjyy47rG tLMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=NBnxU3butVFZM7AnAXfHT8dkMjtUFYfyCvUDW61SXV0=; b=Cb04lfX8fhqi73P9zLJvr/Dugfr6zCNxoJDynlssJD9glDHJZXTa2HHNnzNVkNTXCD gyCBOXzOzvNLRm+CJUtAGlt0g4F57//SnljOiBm/Qv7vJPADxTKO6j+pQ0EY0UP2ERdh aX2gLzqpHP2MF0Wf+Z758vcb3OuNUEbIlpoqMdrsP8JQ34mStfDRdgc1HMVgYc94M6zK vb6VxVhmqucUSPKQuHk6kzOvmBKDVx/WGqQJZlKITxQgIzWvf9PkAyvliGpT1HEXorbu kwqUoO4wM76JoWJD4gLB01pCLfdb+PaSFnNN5FQ0JjV154UCmdgzQ+TWUXSd1gnfKBdt 0ItA== X-Gm-Message-State: APjAAAVstq+O5T/qoYDOBS2CUhwOQrD4ikIBk6+ND37ItlcxQNcPmtwU G7y8sUYQJRF1T0pK6bSkAd5KpWkQ67l2JWecenRmrw== X-Google-Smtp-Source: APXvYqy6XKydDtiqLti7hhnEPqaloEXe8khYaaQC8o7B44In3kaSUL1URuHERGnTYAOJPrixGMDMzxSCyGGCNLsXEME= X-Received: by 2002:a9d:5f11:: with SMTP id f17mr24252119oti.207.1572998924064; Tue, 05 Nov 2019 16:08:44 -0800 (PST) MIME-Version: 1.0 References: <01adb4cb-6092-638c-0bab-e61322be7cf5@redhat.com> <613f3606-748b-0e56-a3ad-1efaffa1a67b@redhat.com> <20191105160000.GC8128@linux.intel.com> <20191105231316.GE23297@linux.intel.com> <20191106000315.GI23297@linux.intel.com> In-Reply-To: <20191106000315.GI23297@linux.intel.com> From: Dan Williams Date: Tue, 5 Nov 2019 16:08:32 -0800 Message-ID: Subject: Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes To: Sean Christopherson Cc: David Hildenbrand , Linux Kernel Mailing List , Linux MM , Michal Hocko , Andrew Morton , kvm-ppc@vger.kernel.org, linuxppc-dev , KVM list , linux-hyperv@vger.kernel.org, devel@driverdev.osuosl.org, xen-devel , X86 ML , Alexander Duyck , Alexander Duyck , Alex Williamson , Allison Randal , Andy Lutomirski , "Aneesh Kumar K.V" , Anshuman Khandual , Anthony Yznaga , Benjamin Herrenschmidt , Borislav Petkov , Boris Ostrovsky , Christophe Leroy , Cornelia Huck , Dave Hansen , Haiyang Zhang , "H. Peter Anvin" , Ingo Molnar , "Isaac J. Manjarres" , Jim Mattson , Joerg Roedel , Johannes Weiner , Juergen Gross , KarimAllah Ahmed , Kees Cook , "K. Y. Srinivasan" , "Matthew Wilcox (Oracle)" , Matt Sickler , Mel Gorman , Michael Ellerman , Michal Hocko , Mike Rapoport , Mike Rapoport , Nicholas Piggin , Oscar Salvador , Paolo Bonzini , Paul Mackerras , Paul Mackerras , Pavel Tatashin , Pavel Tatashin , Peter Zijlstra , Qian Cai , =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , Sasha Levin , Stefano Stabellini , Stephen Hemminger , Thomas Gleixner , Vitaly Kuznetsov , Vlastimil Babka , Wanpeng Li , YueHaibing , Adam Borowski Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Nov 5, 2019 at 4:03 PM Sean Christopherson wrote: > > On Tue, Nov 05, 2019 at 03:43:29PM -0800, Dan Williams wrote: > > On Tue, Nov 5, 2019 at 3:30 PM Dan Williams wrote: > > > > > > On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson > > > wrote: > > > > > > > > On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote: > > > > > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand wrote: > > > > > > > The scarier code (for me) is transparent_hugepage_adjust() and > > > > > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the > > > > > > > interaction between THP and _PAGE_DEVMAP. > > > > > > > > > > > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it > > > > > > had to be said :/ ). Luckily, this should be independent of the > > > > > > PG_reserved thingy AFAIKs. > > > > > > > > > > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte() > > > > > are honoring kvm_is_reserved_pfn(), so again I'm missing where the > > > > > page count gets mismanaged and leads to the reported hang. > > > > > > > > When mapping pages into the guest, KVM gets the page via gup(), which > > > > increments the page count for ZONE_DEVICE pages. But KVM puts the page > > > > using kvm_release_pfn_clean(), which skips put_page() if PageReserved() > > > > and so never puts its reference to ZONE_DEVICE pages. > > > > > > Oh, yeah, that's busted. > > > > Ugh, it's extra busted because every other gup user in the kernel > > tracks the pages resulting from gup and puts them (put_page()) when > > they are done. KVM wants to forget about whether it did a gup to get > > the page and optionally trigger put_page() based purely on the pfn. > > Outside of VFIO device assignment that needs pages pinned for DMA, why > > does KVM itself need to pin pages? If pages are pinned over a return > > to userspace that needs to be a FOLL_LONGTERM gup. > > Short answer, KVM pins the page to ensure correctness with respect to the > primary MMU invalidating the associated host virtual address, e.g. when > the page is being migrated or unmapped from host userspace. > > The main use of gup() is to handle guest page faults and map pages into > the guest, i.e. into KVM's secondary MMU. KVM uses gup() to both get the > PFN and to temporarily pin the page. The pin is held just long enough to > guaranteed that any invalidation via the mmu_notifier will be stalled > until after KVM finishes installing the page into the secondary MMU, i.e. > the pin is short-term and not held across a return to userspace or entry > into the guest. When a subsequent mmu_notifier invalidation occurs, KVM > pulls the PFN from the secondary MMU and uses that to update accessed > and dirty bits in the host. > > There are a few other KVM flows that eventually call into gup(), but those > are "traditional" short-term pins and use put_page() directly. Ok, I was misinterpreting the effect of the bug with what KVM is using the reference to do. To your other point: > But David's proposed fix for the above refcount bug is to omit the patch > so that KVM no longer treats ZONE_DEVICE pages as reserved. That seems > like the right thing to do, including for thp_adjust(), e.g. it would > naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is > mapped with a huge page (2mb or above) in the host. The only hiccup is > figuring out how to correctly transfer the reference. That might not be the only hiccup. There's currently no such thing as huge pages for ZONE_DEVICE, there are huge *mappings* (pmd and pud), but the result of pfn_to_page() on such a mapping does not yield a huge 'struct page'. It seems there are other paths in KVM that assume that more typical page machinery is active like SetPageDirty() based on kvm_is_reserved_pfn(). While I told David that I did not want to see more usage of is_zone_device_page(), this patch below (untested) seems a cleaner path with less surprises: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4df0aa6b8e5c..fbea17c1810c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1831,7 +1831,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean); void kvm_release_pfn_clean(kvm_pfn_t pfn) { - if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) + if ((!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) || + (pfn_valid(pfn) && is_zone_device_page(pfn_to_page(pfn)))) put_page(pfn_to_page(pfn)); } EXPORT_SYMBOL_GPL(kvm_release_pfn_clean); This is safe because the reference that KVM took earlier protects the is_zone_device_page() lookup from racing device teardown. Otherwise, if KVM does not have a reference it's unsafe, but that's already even more broken because KVM would be releasing a page that it never referenced. Every other KVM operation that assumes page allocator pages would continue to honor kvm_is_reserved_pfn().