From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1A19FC001E0 for ; Tue, 8 Aug 2023 23:56:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EC44B6B0071; Tue, 8 Aug 2023 19:56:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E74736B0074; Tue, 8 Aug 2023 19:56:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D3BC48D0001; Tue, 8 Aug 2023 19:56:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id C563F6B0071 for ; Tue, 8 Aug 2023 19:56:16 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 82B9E1C8DF7 for ; Tue, 8 Aug 2023 23:56:16 +0000 (UTC) X-FDA: 81102598752.04.44BB374 Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) by imf16.hostedemail.com (Postfix) with ESMTP id CA8C118000E for ; Tue, 8 Aug 2023 23:56:14 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b="Q6/BjIT8"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf16.hostedemail.com: domain of 3HdbSZAYKCEIwierngksskpi.gsqpmry1-qqozego.svk@flex--seanjc.bounces.google.com designates 209.85.215.201 as permitted sender) smtp.mailfrom=3HdbSZAYKCEIwierngksskpi.gsqpmry1-qqozego.svk@flex--seanjc.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1691538974; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XgoBtS59LZuoBAyEloPvADSmnrKxGh/v4r7Okvrfr4o=; b=erMSFGnda8GQJr/STN/PMOhz3FMdfk/B4MYegdrTGm7RIC2kwbYFhhkBXeh5oLFH4XCHcX Dk7J0EQhQoiMnHlVAFlFMiAzslnhFkU6S7KzZd1Xd+IzxRDxFamMETYyp7E5bwcncB7h4O IgVkgYG8Ja7ZAvdbwZ0/0P1bpUyaVpk= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b="Q6/BjIT8"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf16.hostedemail.com: domain of 3HdbSZAYKCEIwierngksskpi.gsqpmry1-qqozego.svk@flex--seanjc.bounces.google.com designates 209.85.215.201 as permitted sender) smtp.mailfrom=3HdbSZAYKCEIwierngksskpi.gsqpmry1-qqozego.svk@flex--seanjc.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1691538974; a=rsa-sha256; cv=none; b=21VhAFLyS3sYTa2E+Nsfb5ZWnZlC1tENQ828UXFr8PS3ekBJjdVEcVDJ1m7jE4LgJxpI3R DjB7wreqpG8DMZ3u+BwrxvPgxVcxBnEiWyVY/sR9LSc92ULnElmRIWIOUmbbdvHYKrZXzv EuvQQkNatvQLI/aW7/I/D23n594ed4M= Received: by mail-pg1-f201.google.com with SMTP id 41be03b00d2f7-56477bea06fso4246333a12.1 for ; Tue, 08 Aug 2023 16:56:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691538973; x=1692143773; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=XgoBtS59LZuoBAyEloPvADSmnrKxGh/v4r7Okvrfr4o=; b=Q6/BjIT8mjhrv07nsAbmoicM/qHEosdwfknD5hDC0C02vvZ/1F1ijLo0pMCvbgvuaJ FPehy1Ab0tZr56BJl4xw6YJXPeMsDYqA/G6PU/IK2Y+P7axa5Jirw92bqk3Z0uxK14Si 6MUrKEbWhluUamwzEwLQ62wXmiS5bUPopiSaIFq9c8iQ/6ukqmm9bbadnll32FMJB96X cLXaFhlYzBYIb0wJdq1zRYXKVSRNTLY0EqC/5EGEYC05HropJaGaiUiS4brw6U+DblUi slMKXrah5BAK2kSaNWEZOQHSrfv0iH3TM2zaEgPsnX6CM8Jr0kH3SrUlN38WBn08OnQM qY5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691538973; x=1692143773; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=XgoBtS59LZuoBAyEloPvADSmnrKxGh/v4r7Okvrfr4o=; b=QPZJO2rgugDbQv/GUYAYcnXZ3VvIVdVd7PEUITFrjjOdhU38EWDzFhyrGrTJ05T9oK tCz2HKoLkYGW+cBji9k7CCAPrcXLb2XdK5gSlm+i20v2o6lbsDEvdY8ZQ4d+oj3m9k/L kVAobTUyJ5W8AqFAHyUnr5If/mQ1EGTGMAHnyO3f0yCh0gx/3H3OjuTIAqUr1zjOInhE eGXO3tYGRvNzq6Ap9sjoWyhPIszFAgU72lsf1mfG6QNrXeKqsjQ1dVRRwpxcBat6LvSu ZYLC0sxh/iXz/CTKWJ1o8vPypSpTw9Bn7yAunztBKyw0wt+vVleiicKQ7zxMQsdjY3NU 83KQ== X-Gm-Message-State: AOJu0YyXIybwkbrzxHkA2jVWoxMVmH6L6LH+fP+LjaK60YWuuUzgggnZ tDYXIQNadiinO+MaGfbsIAjFHX60/oA= X-Google-Smtp-Source: AGHT+IFCy98whUjzcexHxPN7DcnWkTBvcwwuL6i8yxqfvEo6t02c7Nz85aa9s5UPPKWZzlxIm3hK6x9XiBs= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a63:7702:0:b0:564:9d36:f3e7 with SMTP id s2-20020a637702000000b005649d36f3e7mr21257pgc.0.1691538973466; Tue, 08 Aug 2023 16:56:13 -0700 (PDT) Date: Tue, 8 Aug 2023 16:56:11 -0700 In-Reply-To: Mime-Version: 1.0 References: <20230808071329.19995-1-yan.y.zhao@intel.com> <20230808071702.20269-1-yan.y.zhao@intel.com> Message-ID: Subject: Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration From: Sean Christopherson To: Jason Gunthorpe Cc: Yan Zhao , linux-mm@kvack.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, pbonzini@redhat.com, mike.kravetz@oracle.com, apopple@nvidia.com, rppt@kernel.org, akpm@linux-foundation.org, kevin.tian@intel.com Content-Type: text/plain; charset="us-ascii" X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: CA8C118000E X-Stat-Signature: sfyjsnkeedr3rwyxap9xe7udimyqzzxf X-Rspam-User: X-HE-Tag: 1691538974-404402 X-HE-Meta: U2FsdGVkX19zUKlo46/fZXD0YYN46CoPy3uvccPDm89haMRC1SzIJhDhHz5WALO2pNQzZuMJPsBJcMd9JArhZ0MFcAg6CXuox8oxHdkiVw5BDZFpa8wZiLFKgF0YfJ3HEkbnj7cgLebCtyiAv5em4DO0fclHiECSowErB5yrvvehgTdayQBdJH3WByOif/R1KrhNB38AlhYUa476QbZDHG9M1MTh7wG3HKBzJ5RR9o1He9ZuTaOPBWaxfi3RXbXNzckz51AwDjZxgW6l2L7nG+d3OB0dgZSO3hQ/PaDO9UhlSR4JdT2bq4dwgv/EZVfucupoRJhINHzTscep+PocKGI0olAwIa23WVNd8B+3XB2wDXxX59UdJuyN9xW7coqFBV1p/ja41An0nSlq/Pwyhx2RoFs7Nexey71yagJUFIaRql3XWyRg0j+uq+yhArBwFoCeHQbXFH1QS0SJ/XUblJY+AkzmSFBvtmPhyIYzEe+GIZHDtDdfE/Vy4KC15BKzU1P8+kCVf7yy0I8Tq6aye1HqIDk2sZHM0aiZER370Ld+NXyj9RgoNCGuG9PP3BQbukl8N6CYKqZKsfgxdCahjCysW8KgJnANhfs9ke4QEzQhm1uXOWiPHQoASrOWKmv3J1djxF3QGuP0Oozn90YicLIKTwMRG0eeJbhJdhOb/5fzLXlU2pAsdJDlqfX0YmnYX9nOzo40PpVVnKflt5SZtTvueB2qtA3pocgcCjrGBE1jV9zvxuNtJGKD0VBWwNgFem6Nts5lRuKLtEH0zryCm8GxyGnGjPityGNxhm5Kqlgqs0PAIA3Vr1/HI/KsQfdJerU5CW3taJRiK/V1ZTRC7E8SW4CHwnv0V4JF0JGFaxvMNLguNDVfWscgcJR2UeQ4H3KHV9aBMjX7Ehwwqer3sf+Amgir+YXUCJGG5qoi6NdYhEE2GDOc53Pon4VDpAlkL8IN9DHOqsZp9wEasOa fQJ93jyq OAeNuAn/mJSvluva/oYXwkH3UHIB8wiVrix2ag/8O0zEG8jHPVQnVHEcCTmZ+Ml+6CkDMznoTEAeHqLBguluVABYEQ7iy111ZyJJLpmy9mDscsjVUDwvgUF3vdo1OUqim5m+N+B6Hu0Sf7xbqFra6Q03gCGgKW3w2amnBh9KJJUK7I3OZhqOULz6czQwCZlEkqMLQF84KIJe4T5L3aRS4sdv/tz8S4/cTRXPBQAxrqaO6fmaltNQqsGKlofVaC+xYtA8YEyj22rgLvOY/69yJmV7bbTA+PqxKJ+A5hyOa5vWcTXEOUNYSSxgM/uUn58s3S643mtAZq1YO20KFhecdALqRSvi5kJuC5oyFYGvaeO/OEzbhCeoNkufKAzdDhH+NJPGK6mSNkzmu9Ow6JHJn8MwCP+Di4r5gRBxgqfZSqkzWyUpgcP8+3ulNciYo8I6PoB/s9bC1yNHrFpk9XFs7MsYEqXWns/YPsGXpc32VS7aqIricALeEXqY0WZs8vxSUcofq2zcdL+oOnjXEV0e+ooU8ZMc8fHQm+hyWK+3gfGKdKliNK5J9QxUmahuoAOzUyaFXjkJq9aN2PAI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote: > > On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > > > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote: > > > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > > > > !is_last_spte(iter.old_spte, iter.level)) > > > > continue; > > > > > > > > + if (skip_pinned) { > > > > + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte); > > > > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > > > > + struct folio *folio; > > > > + > > > > + if (!page) > > > > + continue; > > > > + > > > > + folio = page_folio(page); > > > > + > > > > + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) && > > > > + folio_maybe_dma_pinned(folio)) > > > > + continue; > > > > + } > > > > + > > > > > > I don't get it.. > > > > > > The last patch made it so that the NUMA balancing code doesn't change > > > page_maybe_dma_pinned() pages to PROT_NONE > > > > > > So why doesn't KVM just check if the current and new SPTE are the same > > > and refrain from invalidating if nothing changed? > > > > Because KVM doesn't have visibility into the current and new PTEs when the zapping > > occurs. The contract for invalidate_range_start() requires that KVM drop all > > references before returning, and so the zapping occurs before change_pte_range() > > or change_huge_pmd() have done antyhing. > > > > > Duplicating the checks here seems very frail to me. > > > > Yes, this is approach gets a hard NAK from me. IIUC, folio_maybe_dma_pinned() > > can yield different results purely based on refcounts, i.e. KVM could skip pages > > that the primary MMU does not, and thus violate the mmu_notifier contract. And > > in general, I am steadfastedly against adding any kind of heuristic to KVM's > > zapping logic. > > > > This really needs to be fixed in the primary MMU and not require any direct > > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs > > to be skipped. > > This likely has the same issue you just described, we don't know if it > can be skipped until we iterate over the PTEs and by then it is too > late to invoke the notifier. Maybe some kind of abort and restart > scheme could work? Or maybe treat this as a userspace config problem? Pinning DMA pages in a VM, having a fair amount of remote memory, *and* expecting NUMA balancing to do anything useful for that VM seems like a userspace problem. Actually, does NUMA balancing even support this particular scenario? I see this in do_numa_page() /* TODO: handle PTE-mapped THP */ if (PageCompound(page)) goto out_map; and then for PG_anon_exclusive * ... For now, we only expect it to be * set on tail pages for PTE-mapped THP. */ PG_anon_exclusive = PG_mappedtodisk, which IIUC means zapping these pages to do migrate_on-fault will never succeed. Can we just tell userspace to mbind() the pinned region to explicitly exclude the VMA(s) from NUMA balancing?