From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 95E10C3DA4A for ; Thu, 8 Aug 2024 21:21:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1F5556B008C; Thu, 8 Aug 2024 17:21:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1A5C26B0092; Thu, 8 Aug 2024 17:21:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 01EEA6B0095; Thu, 8 Aug 2024 17:21:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id D3AF06B008C for ; Thu, 8 Aug 2024 17:21:30 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 8A7BC401B8 for ; Thu, 8 Aug 2024 21:21:30 +0000 (UTC) X-FDA: 82430349540.09.AE49705 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf25.hostedemail.com (Postfix) with ESMTP id 50020A000E for ; Thu, 8 Aug 2024 21:21:28 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=KVO7WLvV; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf25.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723152015; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=646kYxDCmIPDI121m97IwwsmhjY653tF83ZSDiyw6zM=; b=6CKMourQ9tU1XqldC3yBx8FjYCu+KywXdmDp3HLf1cd2xRF8VTvLVymToCU7q/1Zw1Ja/N YSH9Foy6ahgZLM3ZVHYHwl1gqD8lW0s/dYODd18RPX4Qq2wNc1osUFfJRZEF3tu4+v0pOQ EGE4rcKA/q/kglkN3Sqx9Z3xHJC/+VI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723152015; a=rsa-sha256; cv=none; b=nDDk2WvdW4tNbJBBkNRXXfOVk+RVyUotRVNWJvSp3pjSTrIObzHoVtdw3D4knpiHaEX67C a42NrZpyMJg+hHuVtSzxGMxjZ8b6KnjcfP75uzpHMsPuboJNEjHCVHys49iHhs3qdvR8C9 m/8jdE5jNxpLreUmhDGQdJu+MePY4Kc= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=KVO7WLvV; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf25.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1723152087; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=646kYxDCmIPDI121m97IwwsmhjY653tF83ZSDiyw6zM=; b=KVO7WLvVt39aM28QJFWE3s3WV1ZyTdISfKsRBKHnxxXPmwdFU2TFpoYrrvZ21UIDiC3/bf cZvoPAwmZj9MR1cwiHFwmXPh1dfTViEmA4vQOnEfCDpRUlKYjENCz92p7cB31R1+Tlm4Ry 3x0eIoOeRyqUiWnZ6n+TJ0r2+kWaZ3s= Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-263-nrWadpMfNVSHfMz8RQ6GsA-1; Thu, 08 Aug 2024 17:21:25 -0400 X-MC-Unique: nrWadpMfNVSHfMz8RQ6GsA-1 Received: by mail-qv1-f71.google.com with SMTP id 6a1803df08f44-6b7a17dea12so1152396d6.2 for ; Thu, 08 Aug 2024 14:21:25 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723152085; x=1723756885; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=646kYxDCmIPDI121m97IwwsmhjY653tF83ZSDiyw6zM=; b=S+JYxJ39TDrX5PwTIw5ozIiEdUM8bFB7DVClIPXwWWK/Kz0f7e3sx/AHEgReCbTUxv DJbYv7YpBDkqpu1ru+c/vhCvVYJw91/6LAX5ctZoaKXLep33i4T0nDTpUDx1XWLQOqXJ HEA3mkCZpoL17LCtMlBr12bP8Q4rcrgzDqBt0DjtA4WIklJXwP+ajXA5qI/vM4dnQPoh tJhVDRSyey8tWc7rD3uTjMVDDRP4DO6EK6TjPQmr8hOoA6Hk5BrFtm7Ox6jWwvwXtkNF 4XgVUZXmBkJe2QZDX/aAjvK5zBhpDBzqQ32e7A06fcEOhTj86jnhgldEoZEffjzW+rUv ag9Q== X-Forwarded-Encrypted: i=1; AJvYcCV0DEW625RYR6wYxiETA/3WwIBBw2pj0kNrCPwnMDecSEpwYQQt2m0exdL6wKdwlt2kJH1bw0z43f39hk6IhtdP/Zc= X-Gm-Message-State: AOJu0YzETthOe93CHDN1zQ53UrI3H65rihMP0F7nb4jOO1/QpfLH20Ys 5v0zZg1yh29wwwFRCg49p+QCSAM6l8YEw13PIENX5pauR35x6VV3ZGJLhWqY7pN8jaqeYylIjFw vO6lyVgdNX1aog0UKa+v9LbRUSt6AGR6gAeJi8Dgrs2u1pBVQ X-Received: by 2002:a05:6214:e48:b0:6b7:586c:2cf9 with SMTP id 6a1803df08f44-6bd6bda328fmr22227756d6.8.1723152084732; Thu, 08 Aug 2024 14:21:24 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEfLm9br531QtX5McSAepTFClX/i/YyLNfV9BZkuLTVq/qPkXnmRVEyjeA/AUTE6Y8mdpCcjg== X-Received: by 2002:a05:6214:e48:b0:6b7:586c:2cf9 with SMTP id 6a1803df08f44-6bd6bda328fmr22227426d6.8.1723152084258; Thu, 08 Aug 2024 14:21:24 -0700 (PDT) Received: from x1n (pool-99-254-121-117.cpe.net.cable.rogers.com. [99.254.121.117]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6bb9c797dbdsm70021666d6.52.2024.08.08.14.21.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Aug 2024 14:21:23 -0700 (PDT) Date: Thu, 8 Aug 2024 17:21:20 -0400 From: Peter Xu To: Sean Christopherson Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "Aneesh Kumar K . V" , Michael Ellerman , Oscar Salvador , Dan Williams , James Houghton , Matthew Wilcox , Nicholas Piggin , Rik van Riel , Dave Jiang , Andrew Morton , x86@kernel.org, Ingo Molnar , Rick P Edgecombe , "Kirill A . Shutemov" , linuxppc-dev@lists.ozlabs.org, Mel Gorman , Hugh Dickins , Borislav Petkov , David Hildenbrand , Thomas Gleixner , Vlastimil Babka , Dave Hansen , Christophe Leroy , Huang Ying , kvm@vger.kernel.org, Paolo Bonzini , David Rientjes Subject: Re: [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs Message-ID: References: <20240807194812.819412-1-peterx@redhat.com> <20240807194812.819412-3-peterx@redhat.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 50020A000E X-Stat-Signature: ebtqdownpmohf14yscjjafsrke1y9bsu X-Rspam-User: X-HE-Tag: 1723152088-791589 X-HE-Meta: U2FsdGVkX1+rzDlCNR5Dzvsqwg1wb91ZSYhRwBUau1f8JQgidlB/7yAiLMIuWprNosuoOo7TpEK+5QgRV1re5iKiGc+6YcJmhqnen3O93oGOnEwls6utN11CytYN1cgz2YxnvRK637NMGTmQ9YRs6q36RsmRPDHkr5MLlHCPzmzudXGhyDJaCfOif/g0wSqJ+mJiUyIrdJV3uZbPsmjJzNwN/4FvqwHYX61a22npdL0zb+LoIuXbruzTlXWqAJ7LwH+iYdK1OqdKqBKcek2kynKyd2dJeQ9I2zZcwrqzVisO+WFmmUZ1q5PeFmR1n6RnKbQT0B6xaGwdVeWfXASC0uMeZgrKf4I4YJSuunWzKMkUj0wG2IqN/ZIIEOCCD/ecFuAVP6tr3Y+yetfhakVyRk0VE9FL3yIZUpw+6T1NcjtsJh6QeBP4g49CVvKVH0sFeBoXGjFhyZqb++Y9xUSSDM+/o/D1okaDxS+1fCVcnIRbX4SyX3s0vXBrVydYLjJuLP3yAA1ACQYn5YV2FUIaAPHwOnagNbtypkedKV6DbOvVScr9ZdlSJeUx50/F4T1UehcvjaVlnlBBSCEy9Sp5/eA0cRd7uR0RGjy+YceEDhqLNaIqQGePo562BS+78Jg2mXrwtc1yqy6OCjIWQmkDsjb7P+TfHAFkXgmtRAXt8ljFyGpFZTzqsb93t2SPFP6zb2zV1NVdMLDSG7lPAS9V16FE9IJbiwWHlBeWq6J/dh2c3GGFMx3/FkdNi4jfHvB3eIPrru55EBMl1ygmJipDO3uLMnQgCIrP+iPy3SFVpZ1xyFZ7InGvBxOjdR3ZGK/hsVbaybgOyU3Q6sEGJyQ8MVKIplsPo1ysrcd0X5OwPbqW4sWIkvQrjFeHvxuJ1yWpU0Ss+KH4DicS74EgZzmuN2x1nuU8+1gevkuwE7wTEfzcvby+GXXH9eMxKti1HiY47Oyjgobd5JwUXK8LKC+ zUiMQEY2 0q9nXbRBdJp5UvRHbi9hDpdVFRWMFki3ng1I9kjxK8N1GZTGMIKti/wshtWdV1ogvGz5JW+jIWQcnwLAuDPNbXvv7pl11QdsQ6vwzVZBOTFU8r92FLxSlGSrOyDvViSGnf5niVqjrorKc2kviGuXMpqr8UCHhfywhXvc0NjjYM8b+rX1agWRFE/zhHDVzl2nz/vEouuJJjMlhdyFUwRcdYJq6goMA+Lfc5l3Mwe91U3fNFgbaB2LKCoAVcBav5n5Ml3X6thL2cNs2M4Z6q+HZ6B714nseqEQnASNCR0sqV1zM9pRj3qL98gI5SnBKFM8AKpwCjta9v+Ui0embMhpqFkYmOtxK7Lh2rRtM2vmf33+Gp3GuNrJlXEUzm7itH+lJRbuJ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, Sean, On Thu, Aug 08, 2024 at 08:33:59AM -0700, Sean Christopherson wrote: > On Wed, Aug 07, 2024, Peter Xu wrote: > > mprotect() does mmu notifiers in PMD levels. It's there since 2014 of > > commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to > > change_pmd_range"). > > > > At that time, the issue was that NUMA balancing can be applied on a huge > > range of VM memory, even if nothing was populated. The notification can be > > avoided in this case if no valid pmd detected, which includes either THP or > > a PTE pgtable page. > > > > Now to pave way for PUD handling, this isn't enough. We need to generate > > mmu notifications even on PUD entries properly. mprotect() is currently > > broken on PUD (e.g., one can easily trigger kernel error with dax 1G > > mappings already), this is the start to fix it. > > > > To fix that, this patch proposes to push such notifications to the PUD > > layers. > > > > There is risk on regressing the problem Rik wanted to resolve before, but I > > think it shouldn't really happen, and I still chose this solution because > > of a few reasons: > > > > 1) Consider a large VM that should definitely contain more than GBs of > > memory, it's highly likely that PUDs are also none. In this case there > > I don't follow this. Did you mean to say it's highly likely that PUDs are *NOT* > none? I did mean the original wordings. Note that in the previous case Rik worked on, it's about a mostly empty VM got NUMA hint applied. So I did mean "PUDs are also none" here, with the hope that when the numa hint applies on any part of the unpopulated guest memory, it'll find nothing in PUDs. Here it's mostly not about a huge PUD mapping as long as the guest memory is not backed by DAX (since only DAX supports 1G huge pud so far, while hugetlb has its own path here in mprotect, so it must be things like anon or shmem), but a PUD entry that contains pmd pgtables. For that part, I was trying to justify "no pmd pgtable installed" with the fact that "a large VM that should definitely contain more than GBs of memory", it means the PUD range should hopefully never been accessed, so even the pmd pgtable entry should be missing. With that, we should hopefully keep avoiding mmu notifications after this patch, just like it used to be when done in pmd layers. > > > will have no regression. > > > > 2) KVM has evolved a lot over the years to get rid of rmap walks, which > > might be the major cause of the previous soft-lockup. At least TDP MMU > > already got rid of rmap as long as not nested (which should be the major > > use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM > > pgtable (e.g. EPT on x86), the invalidation of a full empty region in > > most cases could be pretty fast now, comparing to 2014. > > The TDP MMU will indeed be a-ok. It only zaps leaf SPTEs in response to > mmu_notifier invalidations, and checks NEED_RESCHED after processing each SPTE, > i.e. KVM won't zap an entire PUD and get stuck processing all its children. > > I doubt the shadow MMU will fair much better than it did years ago though, AFAICT > the relevant code hasn't changed. E.g. when zapping a large range in response to > an mmu_notifier invalidation, KVM never yields even if blocking is allowed. That > said, it is stupidly easy to fix the soft lockup problem in the shadow MMU. KVM > already has an rmap walk path that plays nice with NEED_RESCHED *and* zaps rmaps, > but because of how things grew organically over the years, KVM never adopted the > cond_resched() logic for the mmu_notifier path. > > As a bonus, now the .change_pte() is gone, the only other usage of x86's > kvm_handle_gfn_range() is for the aging mmu_notifiers, and I want to move those > to their own flow too[*], i.e. kvm_handle_gfn_range() in its current form can > be removed entirely. > > I'll post a separate series, I don't think it needs to block this work, and I'm > fairly certain I can get this done for 6.12 (shouldn't be a large or scary series, > though I may tack on my lockless aging idea as an RFC). Great, and thanks for all these information! Glad to know. I guess it makes me feel more confident that this patch shouldn't have any major side effect at least on KVM side. Thanks, > > https://lore.kernel.org/all/Zo137P7BFSxAutL2@google.com > -- Peter Xu