From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 52C22C02181 for ; Mon, 20 Jan 2025 16:09:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B38FD6B0082; Mon, 20 Jan 2025 11:09:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AC2146B0083; Mon, 20 Jan 2025 11:09:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8EDA96B0085; Mon, 20 Jan 2025 11:09:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 6F1F26B0082 for ; Mon, 20 Jan 2025 11:09:44 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 07F9EA02E0 for ; Mon, 20 Jan 2025 16:09:44 +0000 (UTC) X-FDA: 83028315888.01.403754E Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf02.hostedemail.com (Postfix) with ESMTP id 9933680013 for ; Mon, 20 Jan 2025 16:09:41 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=YCuIi2gq; spf=pass (imf02.hostedemail.com: domain of vschneid@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=vschneid@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737389381; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SNeDTVMDd6LFz2Ibm+JZz5IT863oQDJPOaC5/U8lcJk=; b=iX4R8drd5gezA3f1Di4itzUr2liMZvFcuXBPP0DVvFxqXTLmeSoV8WigDkrMmHA5d0QVru VY5LyKVr7KkVvP8dwABN6NCuiYqBbD0ebBizAcTSr3ESLnQlz3+l237KWxZrnS0HsGw/qn xAEXtWvip1E1ungjEZ2pMq6vMEJK6lI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737389381; a=rsa-sha256; cv=none; b=tNT0wPaDOJBliY6zHLSiAe1tsXkl9OY0YRSLtxTI3IrbpROHzIkEC8jLgdsjfls5GntLDf 1OpdLyw04nKNN0B6y/5mdkXa8j02cneOpUbVARQuApKbVtcbNr0wqmI9fsI4TI9s0tePAN mqrZxC6WGpbois+K4fL9qwpElXuQSWM= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=YCuIi2gq; spf=pass (imf02.hostedemail.com: domain of vschneid@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=vschneid@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1737389381; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=SNeDTVMDd6LFz2Ibm+JZz5IT863oQDJPOaC5/U8lcJk=; b=YCuIi2gqY27/RenYo+i67O1JK7ekVse6cs1fAn7xO6n6ukpSgzz3wokTwcvwoSIDPChQFO yupdrm0+05XyuGVby5P4zH4FFDrt382SGWD+3X6j1lkD4aqbEDrQBw4NKoLCHKliqBmMj5 WCxV8GAHEnJFtnnMaRdblDg5U18r8xM= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-207-G40ms4KWOKCwZ5jvM4RNtA-1; Mon, 20 Jan 2025 11:09:39 -0500 X-MC-Unique: G40ms4KWOKCwZ5jvM4RNtA-1 X-Mimecast-MFC-AGG-ID: G40ms4KWOKCwZ5jvM4RNtA Received: by mail-wr1-f69.google.com with SMTP id ffacd0b85a97d-3862e986d17so1960708f8f.3 for ; Mon, 20 Jan 2025 08:09:39 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737389378; x=1737994178; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=uPYF3/ytKmNcBW6xY8UN6OXUapKiaxgyBXzq3dB0L9c=; b=rev4AAp5jjDuF+2/qCwmLmjc4otUDvXCy+03V1RV4u1h9cpKzAg+jkpJcrccP+1V3y mhKcQ8LIF9OL+AgF5JnoxejB7/lsGo9ZZKzkYxCek5gcLKvDaXtv4UvR2jTPuloTu+Lw fDzV3dqz+H/VsVbZSwl4TK1dm8Rrs+YeOKM2OpRR8gmkRrmqV7cunA63lt96azjh63h8 ketqhrdSicMfDZWiKUWSoqxMHaZdStQmn66fSiTVSjf/K4Oj7bIgs6yQAu35h/qQ+9jS lHFXMc/qzIdx3rzPolXSpCAWYJBuaIkD2XMhYelzBwyGDm4x67UfdZkYlzFs5WaohKsG yZig== X-Forwarded-Encrypted: i=1; AJvYcCV0ATbPoZpDVd9WrNiqqV+54N07xYh0Uf9R2IodvEsoJFT7QAdpSWIMqDO1j4B6mmzdnJ3MItBLbg==@kvack.org X-Gm-Message-State: AOJu0YxowYpoNbJ9mO9rzydhCLTzn4inPfsTL4jFEeHJWsi+otDyznxp lhYlVNz02sHudoKo+mA9RxieZ/q6cO7vUqcoTmCrHggDB/JgZJUUBFGfvp4HCR5HeizcBEpFamL RxOwgKjHsZ+DJ77G2E6YgVu+A9GGlNWVwhheK1lTtGAsjzBxC X-Gm-Gg: ASbGnct2u6uDG1eDUe63fXX7bFun4tgcdNP6RdYkzea8Z9zVn2RbdsbOFUwBYh+YYAd HmwxtOMOH3MnZBXEkEI9YaUZfNh59/zEC9zRwwzsSAKWBcj0kkEp9cFM5mpE+USe0U+KZrGXh/K BDQGJm0Z1/0iH2lc4v0WvVqdLvxcLxrCQ8L2ipipPoZLG8vWyfUKi2xQ+J4zEYpQpTEk2EgzIY4 mMpL2g5Po+ZO8VkSgkQooqzZCxvQjYclqnfNP75zsDNG1TnM7IKdUssYZtTo1JPUpE5FuV19hKd 5LRAZp7aaQgKQlMyQuy3dlcTIz1WSR8S1v7zOMNn0s+j10hMUIVH9bc= X-Received: by 2002:adf:f682:0:b0:38b:e26d:ea0b with SMTP id ffacd0b85a97d-38bf566c314mr10592124f8f.25.1737389378225; Mon, 20 Jan 2025 08:09:38 -0800 (PST) X-Google-Smtp-Source: AGHT+IE7jMmKmijGDhYJpj3v2AzDeWwh7lxfHaye7x+JEw9SklvyOVZq+Wb4Niy5wxwqRIeZEeA2rg== X-Received: by 2002:adf:f682:0:b0:38b:e26d:ea0b with SMTP id ffacd0b85a97d-38bf566c314mr10592030f8f.25.1737389377661; Mon, 20 Jan 2025 08:09:37 -0800 (PST) Received: from vschneid-thinkpadt14sgen2i.remote.csb (213-44-141-166.abo.bbox.fr. [213.44.141.166]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-38bf3221b70sm10695813f8f.26.2025.01.20.08.09.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 20 Jan 2025 08:09:37 -0800 (PST) From: Valentin Schneider To: Uladzislau Rezki Cc: Uladzislau Rezki , Jann Horn , linux-kernel@vger.kernel.org, x86@kernel.org, virtualization@lists.linux.dev, linux-arm-kernel@lists.infradead.org, loongarch@lists.linux.dev, linux-riscv@lists.infradead.org, linux-perf-users@vger.kernel.org, xen-devel@lists.xenproject.org, kvm@vger.kernel.org, linux-arch@vger.kernel.org, rcu@vger.kernel.org, linux-hardening@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, bpf@vger.kernel.org, bcm-kernel-feedback-list@broadcom.com, Juergen Gross , Ajay Kaher , Alexey Makhalov , Russell King , Catalin Marinas , Will Deacon , Huacai Chen , WANG Xuerui , Paul Walmsley , Palmer Dabbelt , Albert Ou , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Peter Zijlstra , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , "Liang, Kan" , Boris Ostrovsky , Josh Poimboeuf , Pawan Gupta , Sean Christopherson , Paolo Bonzini , Andy Lutomirski , Arnd Bergmann , Frederic Weisbecker , "Paul E. McKenney" , Jason Baron , Steven Rostedt , Ard Biesheuvel , Neeraj Upadhyay , Joel Fernandes , Josh Triplett , Boqun Feng , Mathieu Desnoyers , Lai Jiangshan , Zqiang , Juri Lelli , Clark Williams , Yair Podemsky , Tomas Glozar , Vincent Guittot , Dietmar Eggemann , Ben Segall , Mel Gorman , Kees Cook , Andrew Morton , Christoph Hellwig , Shuah Khan , Sami Tolvanen , Miguel Ojeda , Alice Ryhl , "Mike Rapoport (Microsoft)" , Samuel Holland , Rong Xu , Nicolas Saenz Julienne , Geert Uytterhoeven , Yosry Ahmed , "Kirill A. Shutemov" , "Masami Hiramatsu (Google)" , Jinghao Jia , Luis Chamberlain , Randy Dunlap , Tiezhu Yang Subject: Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs In-Reply-To: References: <20250114175143.81438-1-vschneid@redhat.com> <20250114175143.81438-30-vschneid@redhat.com> Date: Mon, 20 Jan 2025 17:09:34 +0100 Message-ID: MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: iXUuOr1GMOGEyr1wxeIyoh8K9LTn2kIBOBVVaVbQk2o_1737389378 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 9933680013 X-Stat-Signature: ef7bxiiuxirgsgfnsuppzwcwjnka9e9k X-Rspam-User: X-Rspamd-Server: rspam12 X-HE-Tag: 1737389381-104839 X-HE-Meta: U2FsdGVkX19N6D6okWuC50ZnhK64NLSXs15vl562lruiT0d1lEH30Xsob9eF5ui1+I8ea0kWaXaooWEsflVjVTuutOfzbQM3+lbtkfwcD5TH9cZgAMJFU8YPhFFo38bGrYQ9Pw/AlqmsFXtpm9jSj3mV8zLzLwJ/kchG6835UOixt75jBiSQc6FowuNBLL/xcFD2UPI3nyaZajqaNqWWxzm3Pf0uyz3Y2DvzOeGLMuDYVxGGKjYU+2/jAaMAxLr6A1cDXDi0QLuh7eErK9+8/+H2CaaOvt2WYwz9TnCbHruuxW+kh73SfPhzQAnwOuIhBEZ+FLZ1VT+d2xeN7uwRH7VRvywhuxEgsffw29wlghDd7doIm+ZR9zBNTcqeGTbJOaQE70PByQpn+VmwpgNYg8zw/2uM/DGphBAbvoQFEVSnH3P5Jbg0BpASUaEarMBE8/Ri7A9l3Q3PKzVtN8/8CmNqE7oWENk7ziEitve7/3EcI6Grx3c3yA6u+GTHIAja/d+35lYcSA7rw6JTzDCEeBloh/3FH3WDFBc6NXPJla19CsytP2hsFYOXSI82Mr54+foAgZ0R7uD2TjDUTofRaCMUuX4Uvrebz+NHB8iDHAV9JfY8j6X2TYjkcrFu/nlFVyGIdXbJ5xBj2H1S2yVNkLnfw3TxxG0Tawc4DZhYxwfbP0Oqba4hnih1BlbIphHtJt57WQPA1I02BMICj4GzhNyRaUorrH4kF6iJurXCKRt5je7YA++0K6MwiEU0807txr7GhQHGF9Rl/Cf+8SErFikmV/vRN2tfp8dBoXohA8xpCPEW/8nHMWb18e7vmAKChEROEFGvkp0ke2KOGXDxStj3cOQEw8YF3m3/SCo21CPydK+s9aw3m1skPxEuiMVU2S/FkTUV4J9mRpwUbFoluEzAtGCG8HZNvc/t1qhuYzPKDc1dYA+L4JXlei1sSHFOsFm7snT0jGCbMlw7IcU uOMRM5Bl vGnOlyXvmDtqlb6BQI+IIJCE1EWv8s8NbY764w2lnXmmKRZokkRiO2UlRBLScOL7SlalTZhDCQqtw9UPYw2qlepPDmbqgr2LjQ+oAQVqKG84zqklvUK1FYRhKcbbYSFdwBniNotgVovZCEWFMr3ZD6194uuX3TamIcy/OTfz7dH+pXJAsR9g/GcHXWzzEsGVKnRDzoe+OnVl/FSr1o6nTzUcnYvVMn1BqSTUAvJBHxY2bkKeZZaP5ROS+eboE3YvzAp+r7bUpbeJycgSZI3jtcx6t13TLvN28955zIOjcmXZmgIcnSFBXNk5a3bcyV914hCnbNJgkHUY+ZquDLlaqq/RAy2PNp9VsNIPts7NebT0Xdd9vvJ5g9juM8oEWTjcEj3XPEws+LqF0z0T7rbn2yVLis3r0lDZ+m5U1yLIm1Y8JzonRK6y0Wtb3dlITfk/olAZ0q4AGR6ZZqA9V3OTIh0hYhMOR9x8CkXmig0C39oQ85R5D8FT7H/5dcw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 20/01/25 12:15, Uladzislau Rezki wrote: > On Fri, Jan 17, 2025 at 06:00:30PM +0100, Valentin Schneider wrote: >> On 17/01/25 17:11, Uladzislau Rezki wrote: >> > On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote: >> >> On 14/01/25 19:16, Jann Horn wrote: >> >> > On Tue, Jan 14, 2025 at 6:51=E2=80=AFPM Valentin Schneider wrote: >> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common s= ource of >> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the >> >> >> flush_tlb_kernel_range() IPIs. >> >> >> >> >> >> Given that CPUs executing in userspace do not access data in the v= malloc >> >> >> range, these IPIs could be deferred until their next kernel entry. >> >> >> >> >> >> Deferral vs early entry danger zone >> >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> >> >> >> >> This requires a guarantee that nothing in the vmalloc range can be= vunmap'd >> >> >> and then accessed in early entry code. >> >> > >> >> > In other words, it needs a guarantee that no vmalloc allocations th= at >> >> > have been created in the vmalloc region while the CPU was idle can >> >> > then be accessed during early entry, right? >> >> >> >> I'm not sure if that would be a problem (not an mm expert, please do >> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't >> >> deferred anyway. >> >> >> >> So after vmapping something, I wouldn't expect isolated CPUs to have >> >> invalid TLB entries for the newly vmapped page. >> >> >> >> However, upon vunmap'ing something, the TLB flush is deferred, and th= us >> >> stale TLB entries can and will remain on isolated CPUs, up until they >> >> execute the deferred flush themselves (IOW for the entire duration of= the >> >> "danger zone"). >> >> >> >> Does that make sense? >> >> >> > Probably i am missing something and need to have a look at your patche= s, >> > but how do you guarantee that no-one map same are that you defer for T= LB >> > flushing? >> > >> >> That's the cool part: I don't :') >> > Indeed, sounds unsafe :) Then we just do not need to free areas. > >> For deferring instruction patching IPIs, I (well Josh really) managed to >> get instrumentation to back me up and catch any problematic area. >> >> I looked into getting something similar for vmalloc region access in >> .noinstr code, but I didn't get anywhere. I even tried using emulated >> watchpoints on QEMU to watch the whole vmalloc range, but that went abou= t >> as well as you could expect. >> >> That left me with staring at code. AFAICT the only vmap'd thing that is >> accessed during early entry is the task stack (CONFIG_VMAP_STACK), which >> itself cannot be freed until the task exits - thus can't be subject to >> invalidation when a task is entering kernelspace. >> >> If you have any tracing/instrumentation suggestions, I'm all ears (eyes?= ). >> > As noted before, we defer flushing for vmalloc. We have a lazy-threshold > which can be exposed(if you need it) over sysfs for tuning. So, we can ad= d it. > In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a single userspace application that will never enter the kernel, unless forced to by some interference (e.g. IPI sent from a housekeeping CPU). Increasing the lazy threshold would unfortunately only delay the interference - housekeeping CPUs are free to run whatever, and so they will eventually cause the lazy threshold to be hit and IPI all the CPUs, including the isolated/NOHZ_FULL ones. I was thinking maybe we could subdivide the vmap space into two regions with their own thresholds, but a task may allocate/vmap stuff while on a HK CPU and be moved to an isolated CPU afterwards, and also I still don't have any strong guarantee about what accesses an isolated CPU can do in its early entry code :( > -- > Uladzislau Rezki