From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96A2AC3DA4A for ; Fri, 9 Aug 2024 16:55:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2CC6E6B00A6; Fri, 9 Aug 2024 12:55:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 27C9D6B00A7; Fri, 9 Aug 2024 12:55:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 144C06B00A9; Fri, 9 Aug 2024 12:55:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id E842D6B00A6 for ; Fri, 9 Aug 2024 12:55:33 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 993E71C44FF for ; Fri, 9 Aug 2024 16:55:33 +0000 (UTC) X-FDA: 82433308146.12.04C790F Received: from mail-yw1-f172.google.com (mail-yw1-f172.google.com [209.85.128.172]) by imf06.hostedemail.com (Postfix) with ESMTP id B615418002B for ; Fri, 9 Aug 2024 16:55:31 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=bls8MTa5; spf=pass (imf06.hostedemail.com: domain of surenb@google.com designates 209.85.128.172 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723222465; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=J5i/p0omALrzZ58Vf9x0UYTuhGGWDAQBmom9R+mFQxo=; b=WYDOyvQDn+VSBJEZlYcLffVtJ1q9oTzyZjVL6IMWOYO+is/xo4OJZj2A+L6JAea9SR9tta Hhwty4IwuBBeDbwX9w/aGgQ3ZpM+USIjTmDvcJriI2SLEHFDe9VdlDnaz4n6kDMDjkH6Y6 GTGsooXa9gUYtyFBlFZMvBEvEV77l3g= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723222465; a=rsa-sha256; cv=none; b=KsGbfrm/dLLAgMaasR4WkmtiI/WI+ipGOjFvMLLKZH9NalBQTxcil7kzZg7d1OsKf978Id DHR38FTxDnCEqLEDS0SORVT6kkAm6pwavxvMCssSo0ASj1/eR9p807CqaFrR2IREU/leI6 iPMlEPAddA5GWboJgPXwiY+eMEJiIZ8= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=bls8MTa5; spf=pass (imf06.hostedemail.com: domain of surenb@google.com designates 209.85.128.172 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-yw1-f172.google.com with SMTP id 00721157ae682-65f9708c50dso23698507b3.2 for ; Fri, 09 Aug 2024 09:55:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1723222531; x=1723827331; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=J5i/p0omALrzZ58Vf9x0UYTuhGGWDAQBmom9R+mFQxo=; b=bls8MTa5bd6jwIpxYa0Kx6roSjrl4d2YzD8RKjxy5Ht7PSxTyE2liU/cOvEH9Qj0/m MM6yIlK03ss7sZBVy/K186qCr5K+tonG3nH/6e882ubVq/6ZGHR2h1YRxzvkwnwfT8O3 jl65suiUSbHeUq6dLzb8pdjIcZsAjRGUC4+tXhkuhhWu2pcoovrZYke0FxGIxJJK8KMl hqPXViI8TE3xZSMa4/+6uTfSucL/l0IRIuIFimrDeY7bIstFGpc6Dv6mxY7NCcC2xMfT Wj7noYx5JqFn2vNbR7YMcWU9mQJgrb2ipGs345UcU3VnFvQ6qZsVLbSXlRR/i9WzQ+qF nlPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723222531; x=1723827331; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=J5i/p0omALrzZ58Vf9x0UYTuhGGWDAQBmom9R+mFQxo=; b=JjX6hPIghPpr9BeK4EPP9lEJgATxI9aDwlAfFviDf9flbcfSD7kqU8kczFOZzA4q7k cNXiaTITChSifXj3qmnl7sLfs5Apo279MsNlUWaXWoDc2ytThYPYC3na1zZcZ8Mpeq1R 8JmQcujW87y6A7TWRg9P4NKsohcqptACyTJAf82eDDAqOWPOv069hu50Mik/dgld6cLB mnXiTuP8euVfV+y1+6BoYhBZHdL3TPQrqYiRBvPbyRVankwRNDkL6V8zD9FShIQiTCYn 8DngW+uv84mTzfmZ50baYih79DwI75aOsWqFlTxXX+SSgTc0xRtQQoAZs5KILb8UxJ08 +i5g== X-Forwarded-Encrypted: i=1; AJvYcCWwQbiukBQRFshBU+bVlM2qtyeYwm8VOtK8aagtMR60XqBaZLh0Mq7WQDBXAZ/xTpUHmOTZm+QjY9ydplbwPQHCBKY= X-Gm-Message-State: AOJu0YyHt3hry1cBok2YHYFSd2YN6Clcw4j3I6xcxUVXU4XjdSYVcwSC 79nHt3jDOcfweFvfMPzOnoKi5wP6XiN+8ygOwtWm2WD9LJTl1fAIjsa4U5GwEJhQx1Cy9Gp0sIb MbJ8TFxkx1yxKPNAr8ntqSAhie+9w9UxEbzxl X-Google-Smtp-Source: AGHT+IEck6xMZmVO50703MWINq6qnIu7HvngT5vX/8c+opsyGxR16BYh0iiv5ys03WK109fWbm3Iu+oIo9ZiKO0KJB0= X-Received: by 2002:a05:690c:b05:b0:646:3ef4:6ace with SMTP id 00721157ae682-69ec66fea64mr28248707b3.24.1723222530393; Fri, 09 Aug 2024 09:55:30 -0700 (PDT) MIME-Version: 1.0 References: <20240807182325.2585582-1-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Fri, 9 Aug 2024 16:55:17 +0000 Message-ID: Subject: Re: [RFC 1/1] mm: introduce mmap_lock_speculation_{start|end} To: Andrii Nakryiko Cc: Jann Horn , akpm@linux-foundation.org, peterz@infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Matthew Wilcox , Vlastimil Babka , Michal Hocko Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: B615418002B X-Stat-Signature: adukazrhxbdzr77afhsqnm1c6bnn3huf X-HE-Tag: 1723222531-446482 X-HE-Meta: U2FsdGVkX19vx+KYGg+gYNMr60XwHjubvDaV82mnysp1o8pewg/Ba1XMyMQaKKvYcrMPQj05lPI0Gq5aANl+1tQjQUER2Ut5F1jUKLlnAjEuKtjZwr+SYJcc5dWXjblra+YfCU0fJJYLnEj/R2AEaMS59KUpZBoewO9mXHkhGoUuu7hLfk4mTkhd35V+MASXxVefFMDBKWOYWQJDYHX+LQV4OqCP36u7aLoDBVA9HnalnF8iPAi8ePJ5GMWGp3tcuZ+c6Kq9L1E/SEgBT9GuNFi1ksokBTHOHvMLJE2GN/yR9m2nf52b/Ux7gMeVRRI9fDCUOorTJHgYcQadxLeD9EVvQboQxxhBl5n3k01CXxDgNPafh5fsA3y7RYPotTKWPnkOT6xMf3ply3CiCvjB0oxa7njsHciQyusUtp0ZSxRMneol8JEeyxNsWWFxTNmAvWFA3KLH0H2jZyy1ey336FROty/brXoeXTV8csngHEAXQx39twVSS9uzxmOi0sQ4cZ0O598BtnOzvr8nL1P6opkgFGdRiEovZ00L4IixmnmjqqnBW4tm/XWw+ExEDDWvK26Ou1bgpJLimOa68OPVfdIjTb3CiTp/51H/s8KYg0JLPci0gMnOf6dvrozC/A4YS2G1pYjwrd+PhuQELi9LdKqrd9S2mj43tjH1otHcVNlh968Pt5E0f5/tDF5myhWV7SEgATsJQxdBp0VFxuS5lR77PMNDkOWbxCYZFpyYQGutYD2Uk5A/55gwvY4xcaA83fLjRFbmfTFm1H0xruWWozBeLiEbT3c6e6oQIfz+0ZpfyrMI5uPiKWVV6gWJFes9Fqts0T0QbS6YSUihEvxeb6MW5XVMPSmotN3MuWfgcy/vDU3sexSfICni+oe+ZnTy63lTdDBBuObY3sHIJvj5tLqU4vkHkPj74Qhq6vQWERWr4SqGHHjJ3j1l23AlxGzX5DLS5yFKRQzRcaln3Ld WVsilAg3 HlfsBMuuP/jSS4n7INig1jyojOi6GXX4fxXTr5W/WwMPsoYOmTuC0+9ETiaImiUsaMZLFRCIxqDyBIxFXAQoJ44EGNv0SIOfxKaJJ3aSosM5W2gfar6vR0zYcRJOY1QrGw5TZXPe2D0/1tiVKRUvzPpAH21LSUHjP4xpPkAW7NFcLBphhDnbejQva5ZytzTpes/Hrb1X0lVpvMSfxIIiInAGAXq8HT1mue781NTsXukPoJ/hQSUCx/jqToCEr1y7704sjpsdQVVgqtDzeXNpx3t+9kDjaTaPaspYUlmrOM2F8VTsi2iDXKAZieuNKAvT5pFjD/vd361zAyYLBMlWiQgD07QutcNaDf1PVFP7ox0FEalnEXrkSWRBT7klDr8WFIvFpV5unKGKPPPCiR9AKlYHHLCcR33UOLIuw4LZfc/4K9GVeMp2CwaDEtYBn/fxEp9WBZlJJmT3iTxQsMPuinvNFwR41VfR7CYAG6mqEWqTf8HJK2aD3RYe+gTJItPoBO9eoPTAP+r9xPJqKLViHOjm0rT4py+KRKNvexpP8yj2ph6ZbS+hHBwTVhj9B8TtvBEeY X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Aug 9, 2024 at 4:39=E2=80=AFPM Andrii Nakryiko wrote: > > On Fri, Aug 9, 2024 at 8:21=E2=80=AFAM Jann Horn wrote= : > > > > On Fri, Aug 9, 2024 at 12:36=E2=80=AFAM Andrii Nakryiko > > wrote: > > > On Thu, Aug 8, 2024 at 3:16=E2=80=AFPM Jann Horn w= rote: > > > > > > > > On Fri, Aug 9, 2024 at 12:05=E2=80=AFAM Andrii Nakryiko > > > > wrote: > > > > > On Thu, Aug 8, 2024 at 2:43=E2=80=AFPM Jann Horn wrote: > > > > > > > > > > > > On Thu, Aug 8, 2024 at 11:11=E2=80=AFPM Andrii Nakryiko > > > > > > wrote: > > > > > > > On Thu, Aug 8, 2024 at 2:02=E2=80=AFPM Suren Baghdasaryan wrote: > > > > > > > > > > > > > > > > On Thu, Aug 8, 2024 at 8:19=E2=80=AFPM Andrii Nakryiko > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > On Wed, Aug 7, 2024 at 11:23=E2=80=AFAM Suren Baghdasarya= n wrote: > > > > > > > > > > > > > > > > > > > > Add helper functions to speculatively perform operation= s without > > > > > > > > > > read-locking mmap_lock, expecting that mmap_lock will n= ot be > > > > > > > > > > write-locked and mm is not modified from under us. > > > > > > > > > > > > > > > > > > > > Signed-off-by: Suren Baghdasaryan > > > > > > > > > > Suggested-by: Peter Zijlstra > > > > > > > > > > Cc: Andrii Nakryiko > > > > > > > > > > --- > > > > > > > > > > > > > > > > > > This change makes sense and makes mm's seq a bit more use= ful and > > > > > > > > > meaningful. I've also tested it locally with uprobe stres= s-test, and > > > > > > > > > it seems to work great, I haven't run into any problems w= ith a > > > > > > > > > multi-hour stress test run so far. Thanks! > > > > > > > > > > > > > > > > Thanks for testing and feel free to include this patch into= your set. > > > > > > > > > > > > > > Will do! > > > > > > > > > > > > > > > > > > > > > > > I've been thinking about this some more and there is a very= unlikely > > > > > > > > corner case if between mmap_lock_speculation_start() and > > > > > > > > mmap_lock_speculation_end() mmap_lock is write-locked/unloc= ked so many > > > > > > > > times that mm->mm_lock_seq (int) overflows and just happen = to reach > > > > > > > > the same value as we recorded in mmap_lock_speculation_star= t(). This > > > > > > > > would generate a false positive, which would show up as if = the > > > > > > > > mmap_lock was never touched. Such overflows are possible fo= r vm_lock > > > > > > > > as well (see: https://elixir.bootlin.com/linux/v6.10.3/sour= ce/include/linux/mm_types.h#L688) > > > > > > > > but they are not critical because a false result would simp= ly lead to > > > > > > > > a retry under mmap_lock. However for your case this would b= e a > > > > > > > > critical issue. This is an extremely low probability scenar= io but > > > > > > > > should we still try to handle it? > > > > > > > > > > > > > > > > > > > > > > No, I think it's fine. > > > > > > > > > > > > Modern computers don't take *that* long to count to 2^32, even = when > > > > > > every step involves one or more syscalls. I've seen bugs where,= for > > > > > > example, a 32-bit refcount is not decremented where it should, = making > > > > > > it possible to overflow the refcount with 2^32 operations of so= me > > > > > > kind, and those have taken something like 3 hours to trigger in= one > > > > > > case (https://bugs.chromium.org/p/project-zero/issues/detail?id= =3D2478), > > > > > > 14 hours in another case. Or even cases where, if you have enou= gh RAM, > > > > > > you can create 2^32 legitimate references to an object and over= flow a > > > > > > refcount that way > > > > > > (https://bugs.chromium.org/p/project-zero/issues/detail?id=3D80= 9 if you > > > > > > had more than 32 GiB of RAM, taking only 25 minutes to overflow= the > > > > > > 32-bit counter - and that is with every step allocating memory)= . > > > > > > So I'd expect 2^32 simple operations that take the mmap lock fo= r > > > > > > writing to be faster than 25 minutes on a modern desktop machin= e. > > > > > > > > > > > > So for a reader of some kinda 32-bit sequence count, if it is > > > > > > conceivably possible for the reader to take at least maybe a co= uple > > > > > > minutes or so between the sequence count reads (also counting t= ime > > > > > > during which the reader is preempted or something like that), t= here > > > > > > could be a problem. At that point in the analysis, if you wante= d to > > > > > > know whether it's actually exploitable, I guess you'd have to l= ook at > > > > > > what kinda context you're running in, and what kinda events can > > > > > > interrupt/preempt you (like whether someone can send a sufficie= ntly > > > > > > dense flood of IPIs to completely prevent you making forward pr= ogress, > > > > > > like in https://www.vusec.net/projects/ghostrace/), and for how= long > > > > > > those things can delay you (maybe including what the pessimal > > > > > > scheduler behavior looks like if you're in preemptible context,= or how > > > > > > long clock interrupts can take to execute when processing a gia= nt pile > > > > > > of epoll watches), and so on... > > > > > > > > > > > > > > > > And here we are talking about *lockless* *speculative* VMA usage = that > > > > > will last what, at most on the order of a few microseconds? > > > > > > > > Are you talking about time spent in task context, or time spent whi= le > > > > the task is on the CPU (including time in interrupt context), or ab= out > > > > wall clock time? > > > > > > We are doing, roughly: > > > > > > mmap_lock_speculation_start(); > > > rcu_read_lock(); > > > vma_lookup(); > > > rb_find(); > > > rcu_read_unlock(); > > > mmap_lock_speculation_end(); > > > > > > > > > On non-RT kernel this can be prolonged only by having an NMI somewher= e > > > in the middle. > > > > I don't think you're running with interrupts off here? Even on kernels > > without any preemption support, normal interrupts (like timers, > > incoming network traffic, TLB flush IPIs) should still be able to > > interrupt here. And in CONFIG_PREEMPT kernels (which enable > > CONFIG_PREEMPT_RCU by default), rcu_read_lock() doesn't block > > preemption, so you can even get preempted here - I don't think you > > need RT for that. > > Fair enough, normal interrupts can happen as well. Still, we are > talking about the above fast sequence running long enough (for > whatever reason) for the rest of the system to update mm (and not just > plan increment counters) for 2 billion times with mmap_write_lock() + > actual work + vma_end_write_all() logic. All kinds of bad things will > start happening before that: RCU stall warnings, lots of accumulated > memory waiting for RCU grace period, blocked threads on > synchronize_rcu(), etc. > > > > > My understanding is that the main difference between normal > > CONFIG_PREEMPT and RT is whether spin_lock() blocks preemption. > > > > > On RT it can get preempted even within RCU locked > > > region, if I understand correctly. If you manage to make this part ru= n > > > sufficiently long to overflow 31-bit counter, it's probably a bigger > > > problem than mmap's sequence wrapping over, no? > > > > From the perspective of security, I don't consider it to be > > particularly severe by itself if a local process can make the system > > stall for very long amounts of time. And from the perspective of > > reliability, I think scenarios where someone has to very explicitly go > > out of their way to destabilize the system don't matter so much? > > So just to be clear. u64 counter is a no-brainer and I have nothing > against that. What I do worry about, though, is that this 64-bit > counter will be objected to due to it being potentially slower on > 32-bit architectures. So I'd rather have > mmap_lock_speculation_{start,end}() with a 32-bit mm_lock_seq counter > than not have a way to speculate against VMA/mm at all. IMHO the probability that the 32-bit counter will wrap around and end up at exactly the same value out of 2^32 possible ones is so minuscule that we could ignore that possibility.