From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 69C99C02188 for ; Mon, 27 Jan 2025 15:52:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E77F6280174; Mon, 27 Jan 2025 10:52:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E270F28016F; Mon, 27 Jan 2025 10:52:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CA161280174; Mon, 27 Jan 2025 10:52:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A7F5B28016F for ; Mon, 27 Jan 2025 10:52:07 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 67D60A0317 for ; Mon, 27 Jan 2025 15:52:07 +0000 (UTC) X-FDA: 83053673094.20.AF9C74C Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91]) by imf27.hostedemail.com (Postfix) with ESMTP id AA4E140008 for ; Mon, 27 Jan 2025 15:52:05 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="Opi/ZOMV"; spf=pass (imf27.hostedemail.com: domain of will@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=will@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737993125; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KEhZoveV4xsVcrEGMNp8f/U3ymfMzB7pcEICULPrO+I=; b=P9FDwhViupAFa0z7GppYdhhsTxnqMUEYq1Nn0UFWlr50kcaY/eJEN6qp5mNYL0ZPsKtY4a uBH7JZqIBHgKPEMaLeUf+S9wc5f62+rleQujZu9fWVwvE1h9Z26OSb9/32Bgm+13QpA2rO 73jex9IAPQ0Gs/gINZFU75dA4M2zXBA= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="Opi/ZOMV"; spf=pass (imf27.hostedemail.com: domain of will@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=will@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737993125; a=rsa-sha256; cv=none; b=QTjBtx+kfW91uBuwQqHD7YE+WcLeRnoOfrhXUoRE9SGWK/PiT47UT28t13NacSp/01nz6f +slgDVlM7/VVXobe73weOxZ1BB+YYM/cdoZGOortVj5EXiClmrAdv6oWmPjEuRZOdz+sHf 567r7IJe2uXHptqKilqRHSBDsU5uaso= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id B5CFFA41826; Mon, 27 Jan 2025 15:50:17 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 87519C4CED2; Mon, 27 Jan 2025 15:51:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1737993124; bh=lKShl/TiOp2ZoXo/jImChQXPJ2hMZLXx2pOut7WiptI=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Opi/ZOMV/zfLovkaRrjcFN22SaJmB+XJGtnDJ/cIoGIjT25SkYRyAorU+UjcclWyd 6DePEaUZ/QK4W0yHLc5Kb3ieU/uOi+zRZdhIrfh5zorrG9bSfVDQc9h0dR9gO+3bHq ul1VD9PJ3iJuekF+lfnqBztVCR/WhUeMwXxaqNSwUyFJREhVmoO8Vtntzw4FAiTB00 00OvDbjpP1jbVZyUSglO9BK2uxJ3fZYicWL3MsWiS1eDM68OxCFUf3tPPbIQwtET/k cGteI6qIszlohfI51re4Bs3hb3htIsly4LrqymlfAEdhSawB5JdAEcLJmnZm2zRPZn N7W92SLp2WhHA== Date: Mon, 27 Jan 2025 15:51:47 +0000 From: Will Deacon To: Jann Horn Cc: Valentin Schneider , linux-kernel@vger.kernel.org, x86@kernel.org, virtualization@lists.linux.dev, linux-arm-kernel@lists.infradead.org, loongarch@lists.linux.dev, linux-riscv@lists.infradead.org, linux-perf-users@vger.kernel.org, xen-devel@lists.xenproject.org, kvm@vger.kernel.org, linux-arch@vger.kernel.org, rcu@vger.kernel.org, linux-hardening@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, bpf@vger.kernel.org, bcm-kernel-feedback-list@broadcom.com, Juergen Gross , Ajay Kaher , Alexey Makhalov , Russell King , Catalin Marinas , Huacai Chen , WANG Xuerui , Paul Walmsley , Palmer Dabbelt , Albert Ou , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Peter Zijlstra , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , "Liang, Kan" , Boris Ostrovsky , Josh Poimboeuf , Pawan Gupta , Sean Christopherson , Paolo Bonzini , Andy Lutomirski , Arnd Bergmann , Frederic Weisbecker , "Paul E. McKenney" , Jason Baron , Steven Rostedt , Ard Biesheuvel , Neeraj Upadhyay , Joel Fernandes , Josh Triplett , Boqun Feng , Uladzislau Rezki , Mathieu Desnoyers , Lai Jiangshan , Zqiang , Juri Lelli , Clark Williams , Yair Podemsky , Tomas Glozar , Vincent Guittot , Dietmar Eggemann , Ben Segall , Mel Gorman , Kees Cook , Andrew Morton , Christoph Hellwig , Shuah Khan , Sami Tolvanen , Miguel Ojeda , Alice Ryhl , "Mike Rapoport (Microsoft)" , Samuel Holland , Rong Xu , Nicolas Saenz Julienne , Geert Uytterhoeven , Yosry Ahmed , "Kirill A. Shutemov" , "Masami Hiramatsu (Google)" , Jinghao Jia , Luis Chamberlain , Randy Dunlap , Tiezhu Yang Subject: Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs Message-ID: <20250127155146.GB25757@willie-the-truck> References: <20250114175143.81438-1-vschneid@redhat.com> <20250114175143.81438-30-vschneid@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: AA4E140008 X-Stat-Signature: ji65crymuphmpr8ksrfpri4bqiqdizsw X-Rspam-User: X-HE-Tag: 1737993125-642136 X-HE-Meta: U2FsdGVkX19cJyoJEbCZQG/m/ejLBOwcFrOCLY3mUWsz3BJlkc+/jWhq6Sa5uhhnaKoH+pedgv9x1OPLu9o801btHl++gF2EqrxdQIgEEBTVgfht8pCndzScNh8GuKUGLmZctnmOS6gVKC/FXmIAk+vbw304Y+0A4XLH1QvK3tNBuYP9dsbpiXFS/yISyiFt+Tbr8RZtpjtSm1C9wRMWmKZolBptrqo9R4Dq6ubpcCNA5E6Fes1kHD5fKxWe99nqeK6CWPOSbx6uE9GxtZlPrUJNa5JDg/RNqDmUr/glhjtCdBCuyfWiSLd+SAOEe9Wd6szPQ8yBMwksz2sI5QbMeo+0wmwGF1UknMSY8gtMBriyeMRMkuJlOHEgPyUHc84nstAMrpLoIP/O3n3bZjEoe1afZnhmrNoLxyiO7a/gsi+tl16KPh8ubZlLqSkpVCvsEaSF8qcGnmnjaUy+nrz1jO2p081FWrxYox4Awh7qmGuDwu1QhY2XNbyUR9gpgQxxHEzK4OMv1lwgOXBfLUBGI42t0MLxBnp3n1hQ0m5nkax+ruvVWMsygoikOZ8jQp7vt1gO0cGC8w+iognPnBd644uWwXx7lMv5m5QCQ7wnktJ/6Fm/UAGFD0/Qs8jFXe+TAhuwTHpBHd498A7RkUFICQP9xRkls+LaoxOnxAH60WQVv7WZmiqRnIpZLPusKIfd1nqU2NzPfvJ9eHrBOGSu8LSLUlkAYCS3jteuWV4+nLcBbglclYPrt8+SlkLkKiJdcr+FD+nYaomQUAp21Z4olK+Lh8ipQNIrC+6cBDNYNlQwiTKR9Rz+ESqrruNX9xCbMIdUKb5DKDRzb+ydgI+jYO5Qo8QsSwi38hlD4/GnVXvp2wO4cQSzd/LblZ/cbn5P+CjCo67Hmv87lD0T6/jnm1Lf+HZMtvAI3hqzNKRK5IEVT6CKtSnx2jdyvArhqGxtzzItxcT8IdCvDGol9Lu 4oO7uJIs k1cXMKeQNlndqKFxpwtqAOlncdTMH5sxqIwHdyyls6GjSQIq9IN/j58U+HpcGA8hXKsZod7L4z/BCVJYzcu/C7B89kku+jy+Cx2BKDXo74GvwmmINpbbg97oxJMWcc6zs98pZSruTd9WJ1+BY3K3Qq58UX6fT2JL383T0ByXyBwTGNdy8xNte/8vimZwKemTFHnDntWMpUIRc0tPDwsSBPhkwEzVdQjKIL+jA/vR87IMn8z6T33p1RdzEIpE8VL4IZqPAdtaakHPJd1IQNF7FSx9L4OgYoRJfOiOt55NjLTqj3fXHv6W6uI0xwR2cQ9qNWyfJ1EOUkU7jo3K+aEXzJwLZvO41+FWd3GNUU7sC/RFM1oGXZqA/zMUsDANX9RiF2d7N X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jan 17, 2025 at 04:52:19PM +0100, Jann Horn wrote: > On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider wrote: > > On 14/01/25 19:16, Jann Horn wrote: > > > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider wrote: > > >> vunmap()'s issued from housekeeping CPUs are a relatively common source of > > >> interference for isolated NOHZ_FULL CPUs, as they are hit by the > > >> flush_tlb_kernel_range() IPIs. > > >> > > >> Given that CPUs executing in userspace do not access data in the vmalloc > > >> range, these IPIs could be deferred until their next kernel entry. > > >> > > >> Deferral vs early entry danger zone > > >> =================================== > > >> > > >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd > > >> and then accessed in early entry code. > > > > > > In other words, it needs a guarantee that no vmalloc allocations that > > > have been created in the vmalloc region while the CPU was idle can > > > then be accessed during early entry, right? > > > > I'm not sure if that would be a problem (not an mm expert, please do > > correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't > > deferred anyway. > > flush_cache_vmap() is about stuff like flushing data caches on > architectures with virtually indexed caches; that doesn't do TLB > maintenance. When you look for its definition on x86 or arm64, you'll > see that they use the generic implementation which is simply an empty > inline function. > > > So after vmapping something, I wouldn't expect isolated CPUs to have > > invalid TLB entries for the newly vmapped page. > > > > However, upon vunmap'ing something, the TLB flush is deferred, and thus > > stale TLB entries can and will remain on isolated CPUs, up until they > > execute the deferred flush themselves (IOW for the entire duration of the > > "danger zone"). > > > > Does that make sense? > > The design idea wrt TLB flushes in the vmap code is that you don't do > TLB flushes when you unmap stuff or when you map stuff, because doing > TLB flushes across the entire system on every vmap/vunmap would be a > bit costly; instead you just do batched TLB flushes in between, in > __purge_vmap_area_lazy(). > > In other words, the basic idea is that you can keep calling vmap() and > vunmap() a bunch of times without ever doing TLB flushes until you run > out of virtual memory in the vmap region; then you do one big TLB > flush, and afterwards you can reuse the free virtual address space for > new allocations again. > > So if you "defer" that batched TLB flush for CPUs that are not > currently running in the kernel, I think the consequence is that those > CPUs may end up with incoherent TLB state after a reallocation of the > virtual address space. > > Actually, I think this would mean that your optimization is disallowed > at least on arm64 - I'm not sure about the exact wording, but arm64 > has a "break before make" rule that forbids conflicting writable > address translations or something like that. Yes, that would definitely be a problem. There's also the more obvious issue that the CnP ("Common not Private") feature of some Arm CPUs means that TLB entries can be shared between cores, so the whole idea of using a CPU's exception level to predicate invalidation is flawed on such a system. Will