From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E5BB9C52D73 for ; Thu, 8 Aug 2024 22:36:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 589576B0088; Thu, 8 Aug 2024 18:36:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5119D6B0089; Thu, 8 Aug 2024 18:36:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 316A86B0095; Thu, 8 Aug 2024 18:36:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 0E7C46B0088 for ; Thu, 8 Aug 2024 18:36:28 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id BC0A81A01D0 for ; Thu, 8 Aug 2024 22:36:27 +0000 (UTC) X-FDA: 82430538414.17.363CFCE Received: from mail-wr1-f48.google.com (mail-wr1-f48.google.com [209.85.221.48]) by imf27.hostedemail.com (Postfix) with ESMTP id EF78240002 for ; Thu, 8 Aug 2024 22:36:24 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="Ih/hi8Lo"; spf=pass (imf27.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.221.48 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723156520; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SWRVMLga6wpH1SouyS7VgkovpZarU2ZzxsG01R4Pwp0=; b=mXfBtefymppnRcm5fAhHYER6J3/CCqr/i+XuwbqEzcvBKECwD0BbsKiH4VP0/bQ3ajzUxW UzOROROuBhQDVWGAIPKzH2/ho+PygYAgjtnVf5nlg4JEafGTiYoDz2vF/hlB1NRaJ4gPr2 Q2Zwxt7rAAOqTcU0VAcZo1ZwLTOIZFc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723156520; a=rsa-sha256; cv=none; b=tSpMrIGqmQjksv/l764HlaWveMh03LSY4s+Q9/flCWO70HRXONYtmIt2lH57u/o5m9oVxT fK8fJrEqwtqYS6JRAak4WezuAQI5Z1lKyycePt8tK/E6oIG7TyK7MxIeOe+om3oarBSd13 FLABMbdormQo4UGNomP6bF75jH6NU8U= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="Ih/hi8Lo"; spf=pass (imf27.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.221.48 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-wr1-f48.google.com with SMTP id ffacd0b85a97d-36bcc168cdaso951945f8f.0 for ; Thu, 08 Aug 2024 15:36:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1723156583; x=1723761383; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=SWRVMLga6wpH1SouyS7VgkovpZarU2ZzxsG01R4Pwp0=; b=Ih/hi8LoEzLrkAu8bI/Q0SEbA+v/n20WZoHaz1nbDcQWAdgaFGQUd3Ry+E7bAhaLM/ v1swxxpXYZa++G+p/ppo7MSmPwE0imnEHEzesnBJnaZtsTyPgc4aSjWj1zpHVf3Cqo9t dz4K+7xDXzUZr7qJDnieNPae20mrEpq/T4JQu5zyL23ld0yK5ZABSQ7I3z2XTFy2eYFt 79Tc0hMz6QYbuK6VQStSOutxlIi0WgzkhRYkXKAyi5m13CWZBKnQDdU8Va5VGlf8eHFu npmwlBdU3HpH4eC/V5GUX28iFyk3ymqgcK9Kdlpcw/pJoQJC5KaEluPwT/IzZaasnfP5 v5iA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723156583; x=1723761383; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=SWRVMLga6wpH1SouyS7VgkovpZarU2ZzxsG01R4Pwp0=; b=SQ/dpOE8elTXjrXOfU3CPmzvWiUKZ0pQxNXeJpq3oh+pSc3RcH0NniLNbVTpKkzigx +5pKqa/OeY4a9r5wgi4/a5/RFfpd5Dn9mNGVvEtQMqbZKbPhrCKSDAtvEU5NAQpK/h7Y z37FmvX2ifCot0PSAzrCT+HR/O0JG/zQS1S6imEjrieExLYB60AvQRFL4a3GdAMtOSRP c3st6RfhApcPgfpNbOIxHM0tXpwlrx1BscxUT1CNJ/Jet2SuEjWGtW/JP3d+TxNK3UU2 9h1eRf6twYWfJ7nvsMEtHn4ozY93lLrrK1rV6S9eAKXQ6M4e+g+fRA4T0o99E9VtYmQy kXJg== X-Forwarded-Encrypted: i=1; AJvYcCWD/ISjMKRaHGndkmiGrkzSvcPMn1vbLRl1FpodXtVVhGo8uJz+1fkXQ1y+jx6GAj4islyW6MzG2A==@kvack.org X-Gm-Message-State: AOJu0Yxy6bCjuLFbe1TMSjmGOp8bM8WCHmJ21SEhFYrDywJxz5tKLald VtDi4wgjRefP7ozScjE0hbmOKaCU8+BWAhZXEE0jXLa2B+0rzhlL4wKNhybNGM8WMtZXPYZG1RA h11aa+iwWTgqIEFQgImj3EdDOke8= X-Google-Smtp-Source: AGHT+IGOjgtP6nMtssysUHus8AAiD01zv6ZlWb5Mdmhxz8LZtrPQegQq8dWT1goSEzdDMr70MJh3e+mhdBBKMh2UgX8= X-Received: by 2002:a5d:456b:0:b0:366:ee01:30d6 with SMTP id ffacd0b85a97d-36d27580bd8mr2915803f8f.49.1723156583212; Thu, 08 Aug 2024 15:36:23 -0700 (PDT) MIME-Version: 1.0 References: <20240807182325.2585582-1-surenb@google.com> In-Reply-To: From: Andrii Nakryiko Date: Thu, 8 Aug 2024 15:36:07 -0700 Message-ID: Subject: Re: [RFC 1/1] mm: introduce mmap_lock_speculation_{start|end} To: Jann Horn Cc: Suren Baghdasaryan , akpm@linux-foundation.org, peterz@infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Matthew Wilcox , Vlastimil Babka , Michal Hocko Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: EF78240002 X-Stat-Signature: nwtg47r3oh7o6hgiuaxdubnb3iff74uj X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1723156584-120000 X-HE-Meta: U2FsdGVkX18f9X1I2ZKtemlbKSkOCd1DJHKVYxKCCgKauPM993lqqok+wk6m2kCDrb+qwBcJ69+0sgRlOTKwzxMS0rdPni+kgA3OgMHSdu6aNCoF6gImkZZSsC9L7CrQQy0gfI/4M1tXYtlz8s2IXbkO1Nlq7BlL5+FDIrwZFxlGu9eL0z5V7t3BOX6jpiI1hHkGjL8vfjqA7miaq07ruCd/rdCPnRLbZ8JZy3YYC7jLAEA00h5u5C5wQlzO5mQ6PmYB9NbZnBzljqqpP1qxhwMbEIkaiC4fpTsYJymIulk/WITgMjOS6KGBjCnxtYC779ZxWjxARg9B+rg/LPTTtZ88lnzNs0+u3dFo3JRhx5FpDAVyGcYANaDG6ps/PGp0uVEhhzHG9PgCH300xRy4e4D3IC0GHTOCHZTbLcibnvi4WV3tSn5mkzd8mIJesPaZXU7BOPXDkkfippt6HejFJqiG0b6HNzaiuqJZYR2VNvxL8C8s3v7XihAHp+U+HJjt9iFVoY6gRwHszrWJmpIMlvQT++uEALIgbtsSWCkh+K+102mgr2fYW/Dn9Hek7ZkxETZkzLBxvOzlVNDEczTLBb0iS0y37uTMZEtwAV68HHvXuWEJlhjEI7wSoR8XTMjDUCENwtFRK2W751RSCkr6cRhHJyvTpjNCc9vate216UtQ+Wwx8WRQ0V/T3t0DwsTkTrkojTK2GqjMK4un0N1B/gOUO15Djxm0SMCczthJoHh25PgQ8wJ566aRy/n5GaWIqDCMph1oBZggzSUf178XLUDl1h2QZ+AuRr49qmWjtBWBSzTyTlxT8ruCrU3gcOOvTJXQWNUrJbtE5pD95w9ux+ayQgCRjcozMEnl91ORhbXol+7EHzvMiB/era9d06edXjpuxBihklzj82rHthA28l9dEtGSLTc8lL5VAhQq1pN86numK7qLFHe14RAfTRot4A9glUtYeP2eQMSScwT OZ4Hz8fr bgN5IYy0A/ntmPhND1qBsw9ukg2px+ocwPwdt41GC428JKRcf/qo2mTvQjtcrtAVT7FsuJPVTl4TiK1XnQVJckFq+a+5r1abcmZVA75LD/DkMx8IPbVlO/9G4x6apGJ0M1+RMHvtOdxFVnQcSyEMjL+BNXor8Iip/b8gU/AykDtNae/zFzHmzekyEkJjGQuv1VPP6+i+i2OT8auyXELz7qIOB9MMED3LPCv6H5nLMfSU7c1tQlrAzzTmz4bvaqkhiWTZ7qlJEL0mgziYPg12HGE2tsRO58nMymqGHbU1wLY5t8Dy26YAUd7PIeNnvrYJJz/NzjVFZNbETNYOGfAPWrARFF1MviSeuQRyCawONjTQ1JuFIsyZdarnIllJWQVPfWFKdktQ8BdZnlTQkbkZbHg8RjtY6l15mvTgcRpQc1zYArxrNz3MeHSoU844WNsSdUzjZUtl3YXbrJsfleO80KhYXGsSXKL7d/qqW6j80b6X8Y0zvAo6ZYTcmIzjFbbdcVdwWEpKcTPv4nopWCW4ctlv7+QkD53MUw5mMQWGYTR3M5qydLXZ7vpL0S1nd0ee+ASc3yem2KUY8gO6dACn5JBz09cOLVQspPS5oPJfq2IARLRkHicW5e7H1txWpWOTNKFSl/WnFTlxyshVJvcwzQkbIUQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Aug 8, 2024 at 3:16=E2=80=AFPM Jann Horn wrote: > > On Fri, Aug 9, 2024 at 12:05=E2=80=AFAM Andrii Nakryiko > wrote: > > On Thu, Aug 8, 2024 at 2:43=E2=80=AFPM Jann Horn wro= te: > > > > > > On Thu, Aug 8, 2024 at 11:11=E2=80=AFPM Andrii Nakryiko > > > wrote: > > > > On Thu, Aug 8, 2024 at 2:02=E2=80=AFPM Suren Baghdasaryan wrote: > > > > > > > > > > On Thu, Aug 8, 2024 at 8:19=E2=80=AFPM Andrii Nakryiko > > > > > wrote: > > > > > > > > > > > > On Wed, Aug 7, 2024 at 11:23=E2=80=AFAM Suren Baghdasaryan wrote: > > > > > > > > > > > > > > Add helper functions to speculatively perform operations with= out > > > > > > > read-locking mmap_lock, expecting that mmap_lock will not be > > > > > > > write-locked and mm is not modified from under us. > > > > > > > > > > > > > > Signed-off-by: Suren Baghdasaryan > > > > > > > Suggested-by: Peter Zijlstra > > > > > > > Cc: Andrii Nakryiko > > > > > > > --- > > > > > > > > > > > > This change makes sense and makes mm's seq a bit more useful an= d > > > > > > meaningful. I've also tested it locally with uprobe stress-test= , and > > > > > > it seems to work great, I haven't run into any problems with a > > > > > > multi-hour stress test run so far. Thanks! > > > > > > > > > > Thanks for testing and feel free to include this patch into your = set. > > > > > > > > Will do! > > > > > > > > > > > > > > I've been thinking about this some more and there is a very unlik= ely > > > > > corner case if between mmap_lock_speculation_start() and > > > > > mmap_lock_speculation_end() mmap_lock is write-locked/unlocked so= many > > > > > times that mm->mm_lock_seq (int) overflows and just happen to rea= ch > > > > > the same value as we recorded in mmap_lock_speculation_start(). T= his > > > > > would generate a false positive, which would show up as if the > > > > > mmap_lock was never touched. Such overflows are possible for vm_l= ock > > > > > as well (see: https://elixir.bootlin.com/linux/v6.10.3/source/inc= lude/linux/mm_types.h#L688) > > > > > but they are not critical because a false result would simply lea= d to > > > > > a retry under mmap_lock. However for your case this would be a > > > > > critical issue. This is an extremely low probability scenario but > > > > > should we still try to handle it? > > > > > > > > > > > > > No, I think it's fine. > > > > > > Modern computers don't take *that* long to count to 2^32, even when > > > every step involves one or more syscalls. I've seen bugs where, for > > > example, a 32-bit refcount is not decremented where it should, making > > > it possible to overflow the refcount with 2^32 operations of some > > > kind, and those have taken something like 3 hours to trigger in one > > > case (https://bugs.chromium.org/p/project-zero/issues/detail?id=3D247= 8), > > > 14 hours in another case. Or even cases where, if you have enough RAM= , > > > you can create 2^32 legitimate references to an object and overflow a > > > refcount that way > > > (https://bugs.chromium.org/p/project-zero/issues/detail?id=3D809 if y= ou > > > had more than 32 GiB of RAM, taking only 25 minutes to overflow the > > > 32-bit counter - and that is with every step allocating memory). > > > So I'd expect 2^32 simple operations that take the mmap lock for > > > writing to be faster than 25 minutes on a modern desktop machine. > > > > > > So for a reader of some kinda 32-bit sequence count, if it is > > > conceivably possible for the reader to take at least maybe a couple > > > minutes or so between the sequence count reads (also counting time > > > during which the reader is preempted or something like that), there > > > could be a problem. At that point in the analysis, if you wanted to > > > know whether it's actually exploitable, I guess you'd have to look at > > > what kinda context you're running in, and what kinda events can > > > interrupt/preempt you (like whether someone can send a sufficiently > > > dense flood of IPIs to completely prevent you making forward progress= , > > > like in https://www.vusec.net/projects/ghostrace/), and for how long > > > those things can delay you (maybe including what the pessimal > > > scheduler behavior looks like if you're in preemptible context, or ho= w > > > long clock interrupts can take to execute when processing a giant pil= e > > > of epoll watches), and so on... > > > > > > > And here we are talking about *lockless* *speculative* VMA usage that > > will last what, at most on the order of a few microseconds? > > Are you talking about time spent in task context, or time spent while > the task is on the CPU (including time in interrupt context), or about > wall clock time? We are doing, roughly: mmap_lock_speculation_start(); rcu_read_lock(); vma_lookup(); rb_find(); rcu_read_unlock(); mmap_lock_speculation_end(); On non-RT kernel this can be prolonged only by having an NMI somewhere in the middle. On RT it can get preempted even within RCU locked region, if I understand correctly. If you manage to make this part run sufficiently long to overflow 31-bit counter, it's probably a bigger problem than mmap's sequence wrapping over, no? > > https://www.vusec.net/projects/ghostrace/ is pretty amazing - when you > look at the paper > https://download.vusec.net/papers/ghostrace_sec24.pdf you can see in > Figure 4 how they managed to turn a race window that's 8 instructions > wide into a window they can stretch "indefinitely", and they didn't > even have to reschedule to pull it off. If I understand correctly, > they stretched the race window to something like 35 seconds and could > have stretched it even wider if they had wanted to? > > (And yes, Linux fixed the specific trick they used for doing that, but > it still shows that this kinda thing is possible in principle.)