From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E392AC3DA4A for ; Fri, 9 Aug 2024 16:39:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 61F696B00A4; Fri, 9 Aug 2024 12:39:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5CEA86B00A9; Fri, 9 Aug 2024 12:39:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 448F66B00AC; Fri, 9 Aug 2024 12:39:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 246F46B00A4 for ; Fri, 9 Aug 2024 12:39:52 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id CA47F1A1041 for ; Fri, 9 Aug 2024 16:39:51 +0000 (UTC) X-FDA: 82433268582.28.4CBE3D4 Received: from mail-pg1-f175.google.com (mail-pg1-f175.google.com [209.85.215.175]) by imf10.hostedemail.com (Postfix) with ESMTP id DE822C0019 for ; Fri, 9 Aug 2024 16:39:49 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=hC64cjTe; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf10.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.215.175 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723221516; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=K8h94mQb5lQjqtK7zDJ1PwY5ziFswJopQOJnwekObas=; b=65GUPl6OZ0RdeVmTWhqmgSB0+t3/E5I/OTiW2sQxibF9AaL+fifFnoXpZCT4vzS+rVswlD AgKadQgvcJrFx6bOUIldnaXhaOP7D9XJ7ryzCCYBpV8JaP8LhmH/TzTv9PuWWEoKxyRKmV gvRhHfhrl00m3OgzPLYZgPzxPmfq2xg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723221516; a=rsa-sha256; cv=none; b=nysQMvURpd+BjQw69hI3FzK9Xn439j7Hfn+OCNCpEGWU6p8E5HelTHDZGT12VUc+1NE496 Ow7BYbIgrm1gbyHpULca67DJOgTg80HIx3f5ddeNdUDqRb4YEdwBn3wpfDRnIHllKWiBBd 2qB2j7yArb0DpDStFCHsaiYchAV5UDU= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=hC64cjTe; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf10.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.215.175 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com Received: by mail-pg1-f175.google.com with SMTP id 41be03b00d2f7-7c3d8f260easo184102a12.1 for ; Fri, 09 Aug 2024 09:39:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1723221588; x=1723826388; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=K8h94mQb5lQjqtK7zDJ1PwY5ziFswJopQOJnwekObas=; b=hC64cjTevA5CVsQ9Wnjf1haSbFoYbqmda9xGhLEr2BFXFTRSBkb1aEFbLOWHbNoFVb 3v8yNpEEt5jiTtv3UInMEPMZCIfgDuMldBXmOlC0GfbRZr/eRwUhO22OsQXz1gYPeLBl pFwRaHZYal8AClTcQWBFjiyPKprI6WTgpEQF4Dwe+dmcqZLkaTaGwFQRJDT20AzM0t6q ZJzRrosQ9BbH1cfj/dQhSPSocEP7mDRvi0fZBkRhgIQhbxNHeJHdUzZWjqh7FKoeMLfn Owz3gyPeUKAEX6mzbmCKSODdo5DSXhDJqBbdkj3Vb47RRgK/IYDVGOWpAR0dMPCm8aTu oWtQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723221588; x=1723826388; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=K8h94mQb5lQjqtK7zDJ1PwY5ziFswJopQOJnwekObas=; b=L5JbbRMRj9Ik4t0x6NkVwwcgqzKkOPc8wh1D45lhLguQn4TEs5C6CB0VpfuSYmDSAo UddYCc9UkSgGJxVuSWlUX5NIVqsha+54HJE2Picw7/3ixmpdLU4NELw8757kCC3qO/KK cLZedjlU4ikfV6mtITGnTngjbb84JiEzu0RhpYAPcrtuh4ZtlPcHQJX+arZrbYlhE0OU pawluKowC66bjay/TyyYrLA6kmWfm204gijH+LlAQ+/rTwAcsvJEdQ9vXdfcyQVgs/jL KO2OczunafQo5MO3e16DPDAtOFUca+pHzelfeWKGvmuZzuCelNj14p9k6PYzuSNYp34F Z66Q== X-Forwarded-Encrypted: i=1; AJvYcCXdXMpjDYlqsXqweKzvpuhQNgxn1zrk41WuCWnbapAzyZpSEL8Ji30sUOI1GfmA/ZbJUGEPUcrQI8sazUNxBH3Mgo0= X-Gm-Message-State: AOJu0YyaiqvkGmEaTy8aEVQWguve/BBZ02VOM6hRzLgMS8tB19IeKUpP 0xWK0sZvpqiDCeCCzdjS/gGpFreNlhPvycvO2hX1SncdKOzQYA+AanAe5Sh5+RoLzPREr/rSTrB KO7yWGL1gcGYwdClgBdPir4XeSZo= X-Google-Smtp-Source: AGHT+IHeZZZIN2AC8gtdHpx65kuXunukB4bIZkAxWSenK2JsRwdXlVyUCGquSPw3PUwE0LWPwXNFLaVp+SfO9nDa5mQ= X-Received: by 2002:a17:90a:a103:b0:2c9:8020:1b51 with SMTP id 98e67ed59e1d1-2d1e7f998c8mr2287078a91.3.1723221588395; Fri, 09 Aug 2024 09:39:48 -0700 (PDT) MIME-Version: 1.0 References: <20240807182325.2585582-1-surenb@google.com> In-Reply-To: From: Andrii Nakryiko Date: Fri, 9 Aug 2024 09:39:35 -0700 Message-ID: Subject: Re: [RFC 1/1] mm: introduce mmap_lock_speculation_{start|end} To: Jann Horn Cc: Suren Baghdasaryan , akpm@linux-foundation.org, peterz@infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Matthew Wilcox , Vlastimil Babka , Michal Hocko Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: DE822C0019 X-Stat-Signature: rawhmdtjowxrd443deog7ppsn5bgunfr X-Rspam-User: X-HE-Tag: 1723221589-894345 X-HE-Meta: U2FsdGVkX1+9/gRt4owQjGWepY7j1ooGaBEVzgo6oYTnaJXMWinmk1Li8yWg6s71hTZGBCj0asxLH3fzQqin9rjO72SHYVanlY9jPuMouq7KHvKx7aWLuVBkF5dyQv03dttJxojPkLn3DI4O+ZxivcFSx8hCKJZfrGFOGz3+mkKKioKHqoycDcxJ6SDYquD9HMZ4x9fcJN7D6ijXFJ9AcEf5NwOhnxC7R4px0XMLdZ1uAjpuBeZ4uOvH6hL/G/MfMQM3jIr5G+7ozi+sKaOrihXso8kfGcSr2v2urqZU0KG0ageeRZsXw1ZB4uDNAvDJUtmBlwf0LAyXJlXgQx6f5iBC8YqPtjPhd6aUYvSpM9W11ylL1eaa2mS1Hihd4tfDHOs0Yu7uxms6kIw+dwGLcpNsmgu+b6s5aADphXCOS8q3wssuFtFaZpBT2g3EbhGOtX1iQM9ApdWGQAh00INuYaeiLBM9CbobJVlCF1z+DjL2+tf3ZY0JMzZgTyMnNIixusn4XYJUZOOS9qQaoxdzPqrvXqoJZRscvBWOBdkpCGDNEYP2mWbGWmQfK6nXK283i56uST7FdmBIIjUMWnjsmFfIETXlwhlCl9CsC8+D0tmSvX+ahZ3bCeIAP3VXyl9XYprEWs1oxMM44pG3FohkO8OWT0GLzp++zpSMLuC4TBla/gw/M9kdx9RegXGAz8ri40fSa/MEGm7bBxuVrT+kniKJKr7ZMSTCq8gPxk0V3EASw94a9yGDFcgv/VLiVaSsoNB5TQDHVpqAkgrz1r0mFAkVkuQMfw8EZqDG58dnZhCmIA4zyMu3yJSG35DDk/tCNq5y2z69+XB9nw2nw2cT55jPujAQfNhHnCTJi0eMmPVnk5gHOhp3cyQSIqTS8U0xVtKsOuw0MSYOdLYq6EoDqSm+eehkIi53eoIl4gzxtgcV8P4/evXY+CUsNzC61Jqw+LDzZZQLJJp0QZX2z9s +mS4T0Z8 em7UEfTdilCyOhujB+ArgiK5K1Japn3N9ELLx+qihIOM77AFhpiUs4ZAwrBU/h79DLPRB6oQoheZoh6DKAEEsMzW5yGXZ21S6fQGmqhTbyp2Syk+CV9/C/wDBcqNDXE3rJ9GWqKUsoDBrgOWmOTg7VZWZuKduVPkSE61lfyz1Q2s4RoKvo6qjL7x3ALCyPJPw1oHoiM4J6lsThwdLTZfW7qsbQclFwV0JM+CNdrtsQ/5l2Ak0jr2ekb57NiSTyV4LY8VHdA/eYVuJlD3p84nN4tMfV9+rOdD1QMf8Egff4J91JFYVHtzQupx+YBAg63cGjt9io3zt70pyDZD4itgs/9I9a9t6hJQpM7fhVGycSLAIwyuHmOvvgTpTtTDGmZidESjKHxkR3XFc8ymCxMmVZm7xFKaG6eErBYkxd3/F7n9ShIDMrMAcfunpYqOpVtWiFe+kswmsGmZFj6XRye6/kjkdDrTsJw1uDy5xpvKupI8h04z4mUSAQSO9nSA8Fnz1E1vcjoFTnddUrX79yvdJSEu15xkkEVCC6UPNWhHtR+KO/xeu/2JNAoPFZL4jfQCEoN29KBRgILcdDnTwDvOeQufPyji1A9/vqgSulcVfXDAj5oc8SndLJasAeDJxEUwslm5GXLFktwPhdwXTNhDDufUGzw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Aug 9, 2024 at 8:21=E2=80=AFAM Jann Horn wrote: > > On Fri, Aug 9, 2024 at 12:36=E2=80=AFAM Andrii Nakryiko > wrote: > > On Thu, Aug 8, 2024 at 3:16=E2=80=AFPM Jann Horn wro= te: > > > > > > On Fri, Aug 9, 2024 at 12:05=E2=80=AFAM Andrii Nakryiko > > > wrote: > > > > On Thu, Aug 8, 2024 at 2:43=E2=80=AFPM Jann Horn = wrote: > > > > > > > > > > On Thu, Aug 8, 2024 at 11:11=E2=80=AFPM Andrii Nakryiko > > > > > wrote: > > > > > > On Thu, Aug 8, 2024 at 2:02=E2=80=AFPM Suren Baghdasaryan wrote: > > > > > > > > > > > > > > On Thu, Aug 8, 2024 at 8:19=E2=80=AFPM Andrii Nakryiko > > > > > > > wrote: > > > > > > > > > > > > > > > > On Wed, Aug 7, 2024 at 11:23=E2=80=AFAM Suren Baghdasaryan = wrote: > > > > > > > > > > > > > > > > > > Add helper functions to speculatively perform operations = without > > > > > > > > > read-locking mmap_lock, expecting that mmap_lock will not= be > > > > > > > > > write-locked and mm is not modified from under us. > > > > > > > > > > > > > > > > > > Signed-off-by: Suren Baghdasaryan > > > > > > > > > Suggested-by: Peter Zijlstra > > > > > > > > > Cc: Andrii Nakryiko > > > > > > > > > --- > > > > > > > > > > > > > > > > This change makes sense and makes mm's seq a bit more usefu= l and > > > > > > > > meaningful. I've also tested it locally with uprobe stress-= test, and > > > > > > > > it seems to work great, I haven't run into any problems wit= h a > > > > > > > > multi-hour stress test run so far. Thanks! > > > > > > > > > > > > > > Thanks for testing and feel free to include this patch into y= our set. > > > > > > > > > > > > Will do! > > > > > > > > > > > > > > > > > > > > I've been thinking about this some more and there is a very u= nlikely > > > > > > > corner case if between mmap_lock_speculation_start() and > > > > > > > mmap_lock_speculation_end() mmap_lock is write-locked/unlocke= d so many > > > > > > > times that mm->mm_lock_seq (int) overflows and just happen to= reach > > > > > > > the same value as we recorded in mmap_lock_speculation_start(= ). This > > > > > > > would generate a false positive, which would show up as if th= e > > > > > > > mmap_lock was never touched. Such overflows are possible for = vm_lock > > > > > > > as well (see: https://elixir.bootlin.com/linux/v6.10.3/source= /include/linux/mm_types.h#L688) > > > > > > > but they are not critical because a false result would simply= lead to > > > > > > > a retry under mmap_lock. However for your case this would be = a > > > > > > > critical issue. This is an extremely low probability scenario= but > > > > > > > should we still try to handle it? > > > > > > > > > > > > > > > > > > > No, I think it's fine. > > > > > > > > > > Modern computers don't take *that* long to count to 2^32, even wh= en > > > > > every step involves one or more syscalls. I've seen bugs where, f= or > > > > > example, a 32-bit refcount is not decremented where it should, ma= king > > > > > it possible to overflow the refcount with 2^32 operations of some > > > > > kind, and those have taken something like 3 hours to trigger in o= ne > > > > > case (https://bugs.chromium.org/p/project-zero/issues/detail?id= =3D2478), > > > > > 14 hours in another case. Or even cases where, if you have enough= RAM, > > > > > you can create 2^32 legitimate references to an object and overfl= ow a > > > > > refcount that way > > > > > (https://bugs.chromium.org/p/project-zero/issues/detail?id=3D809 = if you > > > > > had more than 32 GiB of RAM, taking only 25 minutes to overflow t= he > > > > > 32-bit counter - and that is with every step allocating memory). > > > > > So I'd expect 2^32 simple operations that take the mmap lock for > > > > > writing to be faster than 25 minutes on a modern desktop machine. > > > > > > > > > > So for a reader of some kinda 32-bit sequence count, if it is > > > > > conceivably possible for the reader to take at least maybe a coup= le > > > > > minutes or so between the sequence count reads (also counting tim= e > > > > > during which the reader is preempted or something like that), the= re > > > > > could be a problem. At that point in the analysis, if you wanted = to > > > > > know whether it's actually exploitable, I guess you'd have to loo= k at > > > > > what kinda context you're running in, and what kinda events can > > > > > interrupt/preempt you (like whether someone can send a sufficient= ly > > > > > dense flood of IPIs to completely prevent you making forward prog= ress, > > > > > like in https://www.vusec.net/projects/ghostrace/), and for how l= ong > > > > > those things can delay you (maybe including what the pessimal > > > > > scheduler behavior looks like if you're in preemptible context, o= r how > > > > > long clock interrupts can take to execute when processing a giant= pile > > > > > of epoll watches), and so on... > > > > > > > > > > > > > And here we are talking about *lockless* *speculative* VMA usage th= at > > > > will last what, at most on the order of a few microseconds? > > > > > > Are you talking about time spent in task context, or time spent while > > > the task is on the CPU (including time in interrupt context), or abou= t > > > wall clock time? > > > > We are doing, roughly: > > > > mmap_lock_speculation_start(); > > rcu_read_lock(); > > vma_lookup(); > > rb_find(); > > rcu_read_unlock(); > > mmap_lock_speculation_end(); > > > > > > On non-RT kernel this can be prolonged only by having an NMI somewhere > > in the middle. > > I don't think you're running with interrupts off here? Even on kernels > without any preemption support, normal interrupts (like timers, > incoming network traffic, TLB flush IPIs) should still be able to > interrupt here. And in CONFIG_PREEMPT kernels (which enable > CONFIG_PREEMPT_RCU by default), rcu_read_lock() doesn't block > preemption, so you can even get preempted here - I don't think you > need RT for that. Fair enough, normal interrupts can happen as well. Still, we are talking about the above fast sequence running long enough (for whatever reason) for the rest of the system to update mm (and not just plan increment counters) for 2 billion times with mmap_write_lock() + actual work + vma_end_write_all() logic. All kinds of bad things will start happening before that: RCU stall warnings, lots of accumulated memory waiting for RCU grace period, blocked threads on synchronize_rcu(), etc. > > My understanding is that the main difference between normal > CONFIG_PREEMPT and RT is whether spin_lock() blocks preemption. > > > On RT it can get preempted even within RCU locked > > region, if I understand correctly. If you manage to make this part run > > sufficiently long to overflow 31-bit counter, it's probably a bigger > > problem than mmap's sequence wrapping over, no? > > From the perspective of security, I don't consider it to be > particularly severe by itself if a local process can make the system > stall for very long amounts of time. And from the perspective of > reliability, I think scenarios where someone has to very explicitly go > out of their way to destabilize the system don't matter so much? So just to be clear. u64 counter is a no-brainer and I have nothing against that. What I do worry about, though, is that this 64-bit counter will be objected to due to it being potentially slower on 32-bit architectures. So I'd rather have mmap_lock_speculation_{start,end}() with a 32-bit mm_lock_seq counter than not have a way to speculate against VMA/mm at all.