From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 543CAC04FDF for ; Tue, 25 Jul 2023 17:12:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 807BE6B0071; Tue, 25 Jul 2023 13:12:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7B7ED6B0074; Tue, 25 Jul 2023 13:12:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 659008D0001; Tue, 25 Jul 2023 13:12:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 579206B0071 for ; Tue, 25 Jul 2023 13:12:31 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id EB0FE140E4F for ; Tue, 25 Jul 2023 17:12:30 +0000 (UTC) X-FDA: 81050778060.22.1FED1FB Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by imf04.hostedemail.com (Postfix) with ESMTP id 309134000C for ; Tue, 25 Jul 2023 17:12:26 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=JedTnt84; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf04.hostedemail.com: domain of dave.hansen@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=dave.hansen@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690305148; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=U+W1KNKHyqBKM6hSZGmbZo/SIeRybXtKYBcosPEslug=; b=HZG8HIf1PCGbNlt+pdIJntecZpe9gUhtyVfUJLtSXD9eurTTemEbQo1+czSsYnv0hRRv8t 2qEIIwV1hD+CnYfdeSCysEIRSxMdzN0t4xZcC/XWRSxNjaDGFPQciXGQ/1ULHcQNCyIrz1 iRE1BAHHEwv9DkzakTJNrooE4Rh8fBk= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=JedTnt84; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf04.hostedemail.com: domain of dave.hansen@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=dave.hansen@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690305148; a=rsa-sha256; cv=none; b=bCa8YWtqT8HkOFZNrvuUXNdgVZ6roBqNFx9M9QbrdeLFGjByhpn7vDrHThesBqs+gcn1Hs MK3gvwwZuwEasZ+act912C1yoSpsMjbjxl1HF76Hqgfe6zAsjG06/gnobTcYZgNb0CgV1H I8Hnr4FttcnKSKqPCdZQn7MVLlL/sT4= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1690305147; x=1721841147; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=fq0MiyGtfdmEy3j11xkXHlesEoaC2GKhOtTrGN15RYs=; b=JedTnt841yOkkkHU5LSXE6ZVFNUIX91IPe7VfNtjrJAHKT78KohOxDwl n/zrY+yRsKNj2d7zlmmdXSPgjj3HgecoW4LqCGvNFGYm07srgXfWDG0Zx RTf/rIie6PKZ5i4w/DLLtFlAceOJh3H27Iu6501/FGRQrihwtVBdK6xjd x7jV6FVwt78jI0xfyZq8O6opy2cQVVOgy01vfFZga3MGMQCcrJ8jVvtR4 XrSFgcS3jN9M3LV5Be5oRKP9dAvr85hN488oZ9xuyzgBYaHn4SRZstY96 1tEzb8gAJ1wnPoUz41zkGp9Y6x4EL3gWJ+s+JqWhN7dKYYBPg8fgkkz0u w==; X-IronPort-AV: E=McAfee;i="6600,9927,10782"; a="366675102" X-IronPort-AV: E=Sophos;i="6.01,230,1684825200"; d="scan'208";a="366675102" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jul 2023 10:12:24 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10782"; a="720114234" X-IronPort-AV: E=Sophos;i="6.01,230,1684825200"; d="scan'208";a="720114234" Received: from chrisper-mobl.amr.corp.intel.com (HELO [10.209.69.88]) ([10.209.69.88]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jul 2023 10:12:22 -0700 Message-ID: Date: Tue, 25 Jul 2023 10:12:21 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 Subject: Re: [RFC PATCH v2 20/20] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs Content-Language: en-US To: Marcelo Tosatti Cc: Valentin Schneider , Nadav Amit , Linux Kernel Mailing List , "linux-trace-kernel@vger.kernel.org" , "linux-doc@vger.kernel.org" , "kvm@vger.kernel.org" , linux-mm , bpf , the arch/x86 maintainers , "rcu@vger.kernel.org" , "linux-kselftest@vger.kernel.org" , Steven Rostedt , Masami Hiramatsu , Jonathan Corbet , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Paolo Bonzini , Wanpeng Li , Vitaly Kuznetsov , Andy Lutomirski , Peter Zijlstra , Frederic Weisbecker , "Paul E. McKenney" , Neeraj Upadhyay , Joel Fernandes , Josh Triplett , Boqun Feng , Mathieu Desnoyers , Lai Jiangshan , Zqiang , Andrew Morton , Uladzislau Rezki , Christoph Hellwig , Lorenzo Stoakes , Josh Poimboeuf , Jason Baron , Kees Cook , Sami Tolvanen , Ard Biesheuvel , Nicholas Piggin , Juerg Haefliger , Nicolas Saenz Julienne , "Kirill A. Shutemov" , Dan Carpenter , Chuang Wang , Yang Jihong , Petr Mladek , "Jason A. Donenfeld" , Song Liu , Julian Pidancet , Tom Lendacky , Dionna Glaze , =?UTF-8?Q?Thomas_Wei=c3=9fschuh?= , Juri Lelli , Daniel Bristot de Oliveira , Yair Podemsky References: <20230720163056.2564824-1-vschneid@redhat.com> <20230720163056.2564824-21-vschneid@redhat.com> <188AEA79-10E6-4DFF-86F4-FE624FD1880F@vmware.com> <2284d0db-f94a-e059-7bd0-bab4f112ed35@intel.com> From: Dave Hansen In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 309134000C X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: sqphham8e3h4ubg6uc9xq3atouxz3u7p X-HE-Tag: 1690305146-648043 X-HE-Meta: U2FsdGVkX19iY9LgruVyf9okWPpThyJsM7LVeVo4ZpScQxAFLFLBO2eOAPihPZgh8j7aK6A96E41D/b1WbRNOeClL1zHeyXRDNMXn37Lz3CE9pfnCBa2q/DohmSg65tpnloE/7pUVG0tHG6mIesoO4dENqbZEYMIgyH0GKsjW6B97hEsS5MTIby3GqNNS3LjXQrMIfwVe/HXhrtkSS836WAptPJFa2LSE9PmpI5A945iUDJfSnhUg9nKupX899jfGVQjS8SkaOj1ekA5jXv4kiIeClIUzkKT1Ohb+5I0NLL0s6XFO8eShYl1uIBMrNIDib7fUERc+NJ8jAuHGp+1GG/1Uvd3L6MGUkkAhDfJqRqDp9UyghG66M/Au0DMIxhpfzmlF6YL3/IvunMXCwh7kdkiX941LEERdLND8EkicIIzKpMPIyVLP8x069Cj4JL+pmCfcph0nr1nOahK+osQ3dxKT+YCrWoTA+0S56PFVGPxfp7Wu8T1Ci3uX2s77iCSZjm9BfRKo92Hp8qX0Ru0wcsfmILTqxKCqlmz4VUK3gT0fb+xOsaJqJHjIeK/enV7RghkFAXpLKkL3JLhuZPWogJTb0NrJN3cLhinJLuFQ757gFe6nBlko84eU6XVVclyTqggu6u4eMjKiBZkMCqIwz4RGFVYo/VKIoNMmI7BJRQf+WojY8axPODZmIndUisOvTkoS6SqRkAS9HbYIxZONnB58RzaV7+eb1NVmqNZtT3sLbUMAxfwzMkldchw7caRGohCGq+jZQH3mU89BrJy5/z9/nFTwJuQUz1YKOI/F4NPuIR4RtyYBOY2lgGLN7s1j/OChOKbbXMWCCWURHQFe+hPgTyRtqvtOsFzOqikWCAfUkF2UAAbuHn8rAhKc4aWsnhZz6YRX8LoSVmbuBHf9o/JS1oeiP7GI52l/PPVw5FkSryefD8z4H2nlm61TNjmqPba3M8n87ILMzGS7np 9qLHKed0 sLHskDrGQ6jOnwHT9yhnFS+cbZzS04Ucol136cWt2DkuNc8Xp34wJqEGqvqhnHb0v5QMUpU6deV0Pk9ibkDPpURDw+piP7VOyYEawP7pIGwtrEdBYSzihUKUFnh9ZdlK0OqZfVfuhi8U05cvJaUWTuL/CV4x7DM2oi1tQv0K7xM2rAMGIKjsTF/i2U+0JCBtUIGL1i5AsS0Hy3TR/cBtIkXhCZQqFutJlDcmEctZ+yHXGVMMRHTUnQM0ntXvDBx+4+XcvNKgJw4KQIy3RSJOZ/K8ING9Ld8GL9bgzhkj2qeS3wH0MRCTFuAxcE1vALug/7BvI6bWWJeTd7nqpCEjeUi1v8VIeMzAhcUWAXmNOH3nrzKyYRHNJogzJ5w+NPJFKoKLT X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 7/25/23 09:37, Marcelo Tosatti wrote: >> TLB flushes for freed page tables are another game entirely. The CPU is >> free to cache any part of the paging hierarchy it wants at any time. > Depend on CONFIG_PAGE_TABLE_ISOLATION=y, which flushes TLB (and page > table caches) on user->kernel and kernel->user context switches ? Well, first of all, CONFIG_PAGE_TABLE_ISOLATION doesn't flush the TLB at all on user<->kernel switches when PCIDs are enabled. Second, even if it did, the CPU is still free to cache any portion of the paging hierarchy at any time. Without LASS[1], userspace can even _compel_ walks of the kernel portion of the address space, and we don't have any infrastructure to tell if a freed kernel page is exposed in the user copy of the page tables with PTI. Third, (also ignoring PCIDs) there are plenty of instructions between kernel entry and the MOV-to-CR3 that can flush the TLB. All those instructions architecturally permitted to speculatively set Accessed or Dirty bits in any part of the address space. If they run into a free page table page, things get ugly. These accesses are not _likely_. There probably isn't a predictor out there that's going to see a: movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2) and go off trying to dirty memory in the vmalloc() area. But we'd need some backward *and* forward-looking guarantees from our intrepid CPU designers to promise that this kind of thing is safe yesterday, today and tomorrow. I suspect such a guarantee is going to be hard to obtain. 1. https://lkml.kernel.org/r/20230110055204.3227669-1-yian.chen@intel.com