From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 61EECC02198 for ; Mon, 10 Feb 2025 18:37:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F20546B0089; Mon, 10 Feb 2025 13:37:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id ED05A6B008A; Mon, 10 Feb 2025 13:37:13 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D711E6B008C; Mon, 10 Feb 2025 13:37:13 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id B5C6D6B0089 for ; Mon, 10 Feb 2025 13:37:13 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id E1C954620E for ; Mon, 10 Feb 2025 18:36:35 +0000 (UTC) X-FDA: 83104890750.23.185A197 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf14.hostedemail.com (Postfix) with ESMTP id 9E99810000B for ; Mon, 10 Feb 2025 18:36:33 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Vo5WL7Fh; spf=pass (imf14.hostedemail.com: domain of vschneid@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=vschneid@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1739212593; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=pqE0jHaMjhCJShfdzgjb9zITCwMN+BQSAsP6fR/sVdA=; b=bzRdKdrW2WmGSx5Cl7UvAjs+ZsZ4cV2muUReV2ixivN/Y7vbVjmdUhczcHQ6/VWTYyU0l7 8egZXuwjUgrCWNmwVB9sTI4cVacSNlhPE1YtECyWFmDS3Z/AWowk374+fX/6FoxrIi+GuP 6ROZ7AOhpesVVgAlKAtA9rebmt1x5XQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739212593; a=rsa-sha256; cv=none; b=an/5dD30/buPzs6aLbuWlUG4m0nYQE3bk1XqfHZzH1IP+RCtUACGKFyspny8fO0Sy6IpPZ 58FFb5TV2ZOL5wVRYYBbA8Iom7jiuyckIIq5TaWwgzU4cFTcuQzhJNRirzpj2P9+EqibT0 YvIUZMbC4q/sUCjE++vwvTGICXwCM90= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Vo5WL7Fh; spf=pass (imf14.hostedemail.com: domain of vschneid@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=vschneid@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1739212593; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=pqE0jHaMjhCJShfdzgjb9zITCwMN+BQSAsP6fR/sVdA=; b=Vo5WL7FhIYD9BEr9KoCwI0Lkp0COUKWv730W7H6ipIjPHYkIMzfPQLGw9mU7NinysycPWa +A+/XrwEzn1m6D1g6y2M0WZy4uh84VYodKO0fJY7K4tHjhIlwS0bX5n4iTwviN5EiY4S2D djx/5oFeRUlC3f7RpfRIExxygG3UQdc= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-511-d4_-xHWgNvi0UUZijEGFIQ-1; Mon, 10 Feb 2025 13:36:31 -0500 X-MC-Unique: d4_-xHWgNvi0UUZijEGFIQ-1 X-Mimecast-MFC-AGG-ID: d4_-xHWgNvi0UUZijEGFIQ Received: by mail-wr1-f69.google.com with SMTP id ffacd0b85a97d-38dc32f753dso1694922f8f.3 for ; Mon, 10 Feb 2025 10:36:31 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739212590; x=1739817390; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=fiJBK34xYsYFUmPZtAh8Dtmq/EliyHBgVCGizBc5us0=; b=DnKWhbZfkF86j5U0Dwo209UoM0Y/PqJ3O71HN8LJCp4Ny18eL3fFQmBEcGAm+KiTwL 84Q983p8q1YIkhKDL9hm8kbnAQGUGCapSzPZO/wmURrvimdXThpaQipyqd+aK8wFU83y +vO1JLHsVDND8+nQH6mHUs5fE9/YeuV/iT/mkYzWCK0Q/NEBlwvjzXVb98561MTcXWpG 3HyUPQmmPl1jQ1mTxf10yWC1btjvqSMjdAoY9hrg9sEFnC3sw/FJhR6vDcFUA6zqBVFc LhaTewCCS5HKgoIQyKH+cXuITqXswfixeapgHww6AsbvIPpIgEsa3Fj3W43y6kEAwRhg h7lQ== X-Forwarded-Encrypted: i=1; AJvYcCWEwm7Tht7x+jA8zWTAFj0WhpOaraeIrg/FGA27B4InyJAaLC8JdOjGjH0sMQr8wwPpTslOhgKwUw==@kvack.org X-Gm-Message-State: AOJu0YwVezP2MfcWAIQD0aFeQi7ZkMMlsrxRaA6Sg9z4dhEeRJk3yUfs 4U0zP0yrDd4Cw2vWDI1sYHEXYihoe/ApljPgEO2aLHaKQnfLAAtBpDr5K14gFwjE/ODhNxmhrd5 CMgkY/enC5cDGtvc57gUPW5KC98fMmf7X+eIhdg1OCmlavo38 X-Gm-Gg: ASbGnctUELL1ltIE49/tyHkcAPKxU9aarHR1q1WHg7x8Eiz2Pxr1yDussf4hfXu5+SC ANBCLBV5zgcldi2ibqd5EJQaHcDYdWRbLuksFGF3vRAl4vc9cZzNafhRVRD7VHsFLlu+MiYbOeH 8a80+ThL6e8UyyTayHXXeWzduh09xtcNOQuWj6eNVnZgnkRBhW2qGA9jVnu3ZglUqJdAkVIF1KD qGCDKKw/FEboXy95RZq/yMmSAgmhBXwG+azMXJbK7itfPs4Bru+of85VZaiyGaG3BVn8AOzL7tI 8Chuso6J5CBM0Bj8mKHBpHSycnqpJLqbpoNhWizxXJVu3Aexax53lozbiKE8UX3OkA== X-Received: by 2002:a5d:47ac:0:b0:38d:db7b:5d7d with SMTP id ffacd0b85a97d-38de419476bmr505874f8f.32.1739212590295; Mon, 10 Feb 2025 10:36:30 -0800 (PST) X-Google-Smtp-Source: AGHT+IGWifeCshW6jETL9TuYLjadqG7dtdSXfyo+9LdDhlWGx2V5vqBUb9lSHZiG4U0YoCHaPoMeeg== X-Received: by 2002:a5d:47ac:0:b0:38d:db7b:5d7d with SMTP id ffacd0b85a97d-38de419476bmr505766f8f.32.1739212589672; Mon, 10 Feb 2025 10:36:29 -0800 (PST) Received: from vschneid-thinkpadt14sgen2i.remote.csb (213-44-141-166.abo.bbox.fr. [213.44.141.166]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4394328fcb8sm43783895e9.32.2025.02.10.10.36.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Feb 2025 10:36:29 -0800 (PST) From: Valentin Schneider To: Jann Horn Cc: linux-kernel@vger.kernel.org, x86@kernel.org, virtualization@lists.linux.dev, linux-arm-kernel@lists.infradead.org, loongarch@lists.linux.dev, linux-riscv@lists.infradead.org, linux-perf-users@vger.kernel.org, xen-devel@lists.xenproject.org, kvm@vger.kernel.org, linux-arch@vger.kernel.org, rcu@vger.kernel.org, linux-hardening@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, bpf@vger.kernel.org, bcm-kernel-feedback-list@broadcom.com, Juergen Gross , Ajay Kaher , Alexey Makhalov , Russell King , Catalin Marinas , Will Deacon , Huacai Chen , WANG Xuerui , Paul Walmsley , Palmer Dabbelt , Albert Ou , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Peter Zijlstra , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , "Liang, Kan" , Boris Ostrovsky , Josh Poimboeuf , Pawan Gupta , Sean Christopherson , Paolo Bonzini , Andy Lutomirski , Arnd Bergmann , Frederic Weisbecker , "Paul E. McKenney" , Jason Baron , Steven Rostedt , Ard Biesheuvel , Neeraj Upadhyay , Joel Fernandes , Josh Triplett , Boqun Feng , Uladzislau Rezki , Mathieu Desnoyers , Lai Jiangshan , Zqiang , Juri Lelli , Clark Williams , Yair Podemsky , Tomas Glozar , Vincent Guittot , Dietmar Eggemann , Ben Segall , Mel Gorman , Kees Cook , Andrew Morton , Christoph Hellwig , Shuah Khan , Sami Tolvanen , Miguel Ojeda , Alice Ryhl , "Mike Rapoport (Microsoft)" , Samuel Holland , Rong Xu , Nicolas Saenz Julienne , Geert Uytterhoeven , Yosry Ahmed , "Kirill A. Shutemov" , "Masami Hiramatsu (Google)" , Jinghao Jia , Luis Chamberlain , Randy Dunlap , Tiezhu Yang Subject: Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs In-Reply-To: References: <20250114175143.81438-1-vschneid@redhat.com> <20250114175143.81438-30-vschneid@redhat.com> Date: Mon, 10 Feb 2025 19:36:25 +0100 Message-ID: MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: CGMPQgvNRdO-NhpBkQrCXffKAjDLYDmu7XHZCV37j-s_1739212591 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 9E99810000B X-Rspamd-Server: rspam07 X-Stat-Signature: e63zfbsy15hi4yeb9iotuoxspt47dq46 X-HE-Tag: 1739212593-83788 X-HE-Meta: U2FsdGVkX19iW1i5I9+Y5Vliu84f7gvJhYj5JeAv+/LY8tSS5EuzHLFTAracrDKTLTdDQODd2ZtEmbuyAPdiLX7/wOoO6kOz8HCWj1pgbWF+6j8ZG+yLq0wyYu4p56T9mBsHWEFzSGbWmQ0a2kefQY4ouRze7ogZNZPxBU+knTm8vq5uoYTe8vfwr/tMq1Euo1vQyIssMwJ1zQwT0bm6ezF6kF77jnAEr2WpOB/Tq7phsZwT/WN9bA16B7ftyRmVDDV1oq66p0MeubYq0zuAB/z05izSNvcg42gsKSu1hfvdJlEl2mS7P0OQPbYdfrvZRUNFw2R6zQ5uH7OPW6gLIKKVHh8staraaEFHdpR7rIGha8vbaaAOqnDcszkPIuAu8CNUpCYrILxO5uCXcHVlNj5RYEDR/aRsdNRIrvePkyc+uXMYWTApsg32IZdiBWcUhRCv3FzLVS1whQ0TzyZl1/GPCy9XQxTesVrAtA5aOtRBDaIGw7rLv7vpxZRgCsR0pYSDuH8+8rWIWTNbhImupQtD/toxxlJVgJ4D9sEtSMFu1pi2Q80x/o/btEqy7Ephy3DPb6PNubXrRJWGAOYvS2iP9MrYfQC5l+N95pha04y26PXQFn0w18Mll5mj15VL1uHOpjuTl+cKm0hItBzMvh8dizM5DkEwcZAET8K4ZMgr81JTXj184a6U8+kcP2LtoDssbGRi4Dzkd+NH62m2SJ+RWMhys+23EExdvqu2H3K6PwHtvJFtXWoRqwr5CDuGHCDpNmmeGTTEPVLCRf5/bo516/5VnvFMEmRG/9UETyjqWhMwvawLL/3SP2rn9ciY9x6XMV2kR7Bh/621HrqEbtfcT/TJTurNMk+zqsAQPPsLxPpFniPe8XYbLug2YuYBmweT4TDUUg5SWyCHVhlr07rwYSOjSog3QWd+RG5OJnKEZOuBPzLUZG0NjPQ2+OLVYRQx1SEqc5pZYTjtvPq ocgy+xqj Q7x0tA/NeJHLvbeibE2FvPDYPIB0owixpKXosF/TxtSrh0/s4vKftSrXgxMUSdra8S6nHs8/H6990VzulxlIZavDMyV3mqSmZ4+Dn7olUxC49iatoRY6v5zNd1cL9qrGNQj5xxMtQL5oe03cCLzYmmo8kHfggsxekaeAUnncG9TW+9K1K+mpYm6FK2/m9jqDaL1xS4le7zNu/DH4VxyMrpitWB7VGswB0mDvFbIJgOvPSBrI52wDQhuOusGfIJgXLqyuD2WotOYcinaTjYXUnyeHoI/Tq76PYjKg7p41xvkNV7PSQm50mnJThGxULkITqXO8ZacFN2NmozQfVjY26atF/f7B1E5efbKhlB3yLtxjpC98cnvmwF9t0vIAh+7grX236nhySuzKxn4YkvSCZ5Zv3yK+TnqSgR7KM7tWS1vsjiSBeVEuzhxDvWZfDNvie1k+JNe0orz4XJujosrQB0FTcrYSLvz8/ZenHNp5BAwmqbmy/VJzt2U81Gg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 17/01/25 16:52, Jann Horn wrote: > On Fri, Jan 17, 2025 at 4:25=E2=80=AFPM Valentin Schneider wrote: >> On 14/01/25 19:16, Jann Horn wrote: >> > On Tue, Jan 14, 2025 at 6:51=E2=80=AFPM Valentin Schneider wrote: >> >> vunmap()'s issued from housekeeping CPUs are a relatively common sour= ce of >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the >> >> flush_tlb_kernel_range() IPIs. >> >> >> >> Given that CPUs executing in userspace do not access data in the vmal= loc >> >> range, these IPIs could be deferred until their next kernel entry. >> >> >> >> Deferral vs early entry danger zone >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> >> >> This requires a guarantee that nothing in the vmalloc range can be vu= nmap'd >> >> and then accessed in early entry code. >> > >> > In other words, it needs a guarantee that no vmalloc allocations that >> > have been created in the vmalloc region while the CPU was idle can >> > then be accessed during early entry, right? >> >> I'm not sure if that would be a problem (not an mm expert, please do >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't >> deferred anyway. > > flush_cache_vmap() is about stuff like flushing data caches on > architectures with virtually indexed caches; that doesn't do TLB > maintenance. When you look for its definition on x86 or arm64, you'll > see that they use the generic implementation which is simply an empty > inline function. > >> So after vmapping something, I wouldn't expect isolated CPUs to have >> invalid TLB entries for the newly vmapped page. >> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus >> stale TLB entries can and will remain on isolated CPUs, up until they >> execute the deferred flush themselves (IOW for the entire duration of th= e >> "danger zone"). >> >> Does that make sense? > > The design idea wrt TLB flushes in the vmap code is that you don't do > TLB flushes when you unmap stuff or when you map stuff, because doing > TLB flushes across the entire system on every vmap/vunmap would be a > bit costly; instead you just do batched TLB flushes in between, in > __purge_vmap_area_lazy(). > > In other words, the basic idea is that you can keep calling vmap() and > vunmap() a bunch of times without ever doing TLB flushes until you run > out of virtual memory in the vmap region; then you do one big TLB > flush, and afterwards you can reuse the free virtual address space for > new allocations again. > > So if you "defer" that batched TLB flush for CPUs that are not > currently running in the kernel, I think the consequence is that those > CPUs may end up with incoherent TLB state after a reallocation of the > virtual address space. > > Actually, I think this would mean that your optimization is disallowed > at least on arm64 - I'm not sure about the exact wording, but arm64 > has a "break before make" rule that forbids conflicting writable > address translations or something like that. > > (I said "until you run out of virtual memory in the vmap region", but > that's not actually true - see the comment above lazy_max_pages() for > an explanation of the actual heuristic. You might be able to tune that > a bit if you'd be significantly happier with less frequent > interruptions, or something along those lines.) I've been thinking some more (this is your cue to grab a brown paper bag)..= . Experimentation (unmap the whole VMALLOC range upon return to userspace, see what explodes upon entry into the kernel) suggests that the early entry "danger zone" should only access the vmaped stack, which itself isn't an issue. That is obviously just a test on one system configuration, and the problem I'm facing is trying put in place /some/ form of instrumentation that would at the very least cause a warning for any future patch that would introduce a vmap'd access in early entry code. That, or a complete mitigation that prevents those accesses altogether. What if isolated CPUs unconditionally did a TLBi as late as possible in the stack right before returning to userspace? This would mean that upon re-entering the kernel, an isolated CPU's TLB wouldn't contain any kernel range translation - with the exception of whatever lies between the last-minute flush and the actual userspace entry, which should be feasible to vet? Then AFAICT there wouldn't be any work/flush to defer, the IPI could be entirely silenced if it targets an isolated CPU.