From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 18DCBC3DA4D for ; Thu, 17 Aug 2023 16:22:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4B170940040; Thu, 17 Aug 2023 12:22:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 45FD0940009; Thu, 17 Aug 2023 12:22:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3008A940040; Thu, 17 Aug 2023 12:22:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 21400940009 for ; Thu, 17 Aug 2023 12:22:10 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id BDA47806DC for ; Thu, 17 Aug 2023 16:22:09 +0000 (UTC) X-FDA: 81134113578.20.5163821 Received: from mail-ed1-f42.google.com (mail-ed1-f42.google.com [209.85.208.42]) by imf01.hostedemail.com (Postfix) with ESMTP id D49F74001D for ; Thu, 17 Aug 2023 16:22:07 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b="SbZ/dxJ8"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf01.hostedemail.com: domain of jannh@google.com designates 209.85.208.42 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692289327; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ena0p3NSggEqk+vk2x8D1QzEktiYk6xgHB3uHWeBhi8=; b=dUQza6DQBR9x5tskYE7PeQrMiBxF5zSohZOaKU3retbUmNAXdpKokFNTO4ZdBQXY+v0hNy 5NY2gRbsNeH+S8WnSCX6+CGLHwSNV5s5zDrpADAN7Ua80ikzk04RRi/VvY8Q5ZZMKP2Oc5 p/X32hG/fRebUL1ll9egVXSEtU+kSf8= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b="SbZ/dxJ8"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf01.hostedemail.com: domain of jannh@google.com designates 209.85.208.42 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692289328; a=rsa-sha256; cv=none; b=bFg+JZFDi6D/rxHRzH5/eKeb5t+JoKXda4cCIBedCS7NOpeef93D+HVQ6uoSlIBcHD4uE/ +V+Z2Bve/EJCw1k99MeBqUC/+m2tnj7jLZpTBXe9eItWrRPJVaPGjEFLuf96XpDB7pNT5l iC0ec7EqbJ2ZTHKe/U1Tu8lUJZQySZ0= Received: by mail-ed1-f42.google.com with SMTP id 4fb4d7f45d1cf-521e046f6c7so12265a12.1 for ; Thu, 17 Aug 2023 09:22:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1692289326; x=1692894126; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ena0p3NSggEqk+vk2x8D1QzEktiYk6xgHB3uHWeBhi8=; b=SbZ/dxJ8cwYpJxoocaE7f2Go7xFhUfdGxmUCICEpIF+LUY1Sw5dmsA3cla8FSDaDBs cHd0/Qe8qHP/TJ4P+LHVcofX7+U+Sj4HVIuTCGPRP5E52jXO5CJerPNNX5TB7MwrOS14 KPyFphqn/szOAEgUDOagIUHLHyUw0jvdSrvxGV8Wcgr+LrVc5ZGY1EQbgUKVnYG49z1n +HgjyvrSDyk6xjycgAUIzOm2CN918EO2580QD4lC72pvbDawOoovKWaeRwzu/tXuLAWz O9OstIcjqa3MmzRrAshYvrGaK8Gb2y6or/yOmmcSprjj+PgZKyXApx0ri+GMdPG7o4GO Fb/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692289326; x=1692894126; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ena0p3NSggEqk+vk2x8D1QzEktiYk6xgHB3uHWeBhi8=; b=E72eWk2Z1Yp6p6kfoOFr5MWT7dsHUfuIeYWzNmo+2VP+7V2f8kQclsrpx56J8p1qgx nnSU2mOLdnMT8CK+16HzzxU3mkynaZ33waZ5P2ToPYNrE+6ecI4Q8qb5oH1949NnSCJv b5Lxbbg+b32py3gIHlI1LvwO/ipmN3NemFHpkWUfxmMQxekqriC16x2wN+DXmSsuig75 dqmJF+WCyZ/U2pdm1o7vjowND/5fVvjxZ/IMpZ/wjcrO/h7MBYz7PGONHpcxYxKWpAd9 v+NG0MrKPVkhIrByfny+3TPTxxB9ZKoHqzV2mv6aKAGhIyOhpYTpRouzY9LVIls9+FP0 SiiQ== X-Gm-Message-State: AOJu0YxzNS6HvZDtvZyAMeDybKHMtKb447IdoOyMfX/ji7hcq49+kDt0 wbMS0c1OtaXT81bkBetNFOR9/TkefxOfr8I2i8vU8w== X-Google-Smtp-Source: AGHT+IEvc1Iar/eaWxd/s2yjVp8hczH15UMlGe3dZN9WEzzOh/gJ+w0h2ykqzhK03+HThMRgHhE58Px4a4PA5KN1KM0= X-Received: by 2002:a50:9ee2:0:b0:51a:1ffd:10e with SMTP id a89-20020a509ee2000000b0051a1ffd010emr87129edf.3.1692289326180; Thu, 17 Aug 2023 09:22:06 -0700 (PDT) MIME-Version: 1.0 References: <20230817134231.xji3i6w7ev3rbs52@revolver> In-Reply-To: <20230817134231.xji3i6w7ev3rbs52@revolver> From: Jann Horn Date: Thu, 17 Aug 2023 18:21:29 +0200 Message-ID: Subject: Re: [mm] VMA merging behavior wrt anon_vma has been slightly broken since Linux 3.0 (in a non-dangerous way) To: "Liam R. Howlett" , Jann Horn , Linux-MM , Andrew Morton , Shaohua Li , kernel list , Peter Zijlstra Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: D49F74001D X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: bf6787a1kzjbjzx3fm4ynioift77yz8n X-HE-Tag: 1692289327-226222 X-HE-Meta: U2FsdGVkX18HWgTXkXq7tQkdhsuW/EWd3Fm4v8tt7pQBBBwC9VhS04y3I3tiIHDbo17zvnJKgy727Kx7cTvm9QPn3laugYBall4uD+OqFQeP5pngKamqp4+NtIFkkenkz9pC8NLXBklM+VNhWvtJSZVHBi+IQ0EXXRgHT8I03xXdIurTegeyc2hx+Oq4IxCBY7tzyVt99YyHOaLelENAdGWhWo4Vm/DLbGZfXHg16721jdHnn7HJ4IYieUtZQsC9hnNQAbUXGyms/g9WCzp2exxPiZTewcBdoPbfg7o57qZMoQMyQqA+T6O8100K3lFYlxZzM3ShrTdGWlcKxrY76ZPUKMfd//rJ2lxq9/RbcB2+MrDHELaPpHdHTbPDyU0O+1KKi82FVNYq8Y7jerpUbSdImJhqAVfMX4qhx3MwnXimgV2o3JbgtkjYtk3XNm9nFVZ6Tkz/dgOTve8ydY6qbh+QNm4LbQRPXLsd9CUmuzrQKkDYIqyQDdHugsmeHRyzyWQs5EROtFIq4YTm5BGKklio+5Kl3xvZDb1beKWpIYEmPhksYq3yKOTILNqWReBCYIp93umnQ9LT9ytHtQnPXfUzihoUv9Kfj8lGb2uyMoo7WuQes5XFh/hF9wfq5ivEm4tSbS7n//M2eREHKUMyEXSEPOsAnlaFyb+Bk1yWM0xOtqFJepUtVOT2XVmTsH6hqzGNY/PXsCbx8LLehhswoXWkT4WpExjouQElIOU4wvBPKV7JLb5+tfMn0UD+k0oAL/89xfSxPP2wLtRcQxN+HDrtYuRCZxJaFXdyinmY5Q5U4/Yqqw5KwzYyUSh8QZu5IAoD2WZm+jTTZBRc9SPao1oUuIhXjffn80JiZi9hLWa+5tbhWzuFLnfWmkwI16Z5j0Cwq07iJavUaONIVTwl2jdKXx536AeftoJPIkwR/crMntXajYMo+0BViFEb0GeQoo810H2afLre/gnzV13 bDQBgLJU zCUKuXVH3HSeTumLsNJcUwKpFUWi+OuCx4vawSogiUbllsbfzhFFoRgTiP0xQy5/u8Sw/3AP/J1VUKfUWZP6mIFCfg+Cglx9Lt8rxNO9e1xO8WZfMD8gKoDkuyjx3opgdFfIb12xaRhZMEomOjo98eKNLeALL8vc/hR2kOVwLAZg0SvUTCt+3sDcHU3DW1byhpbSRE3QCcUylzXJh+xN7KY1D0pJ/PZ+uaEAr852fE3h9lRuidpBDdHLlF3wLJmVEBNoELomWm4L8pxnm80oMMIpkkDBBWpSOvFetQUNIWDYJNioe1P5mj9WWxgKR11G1H7Gl X-Bogosity: Ham, tests=bogofilter, spamicity=0.000004, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Aug 17, 2023 at 3:42=E2=80=AFPM Liam R. Howlett wrote: > * Jann Horn [230815 15:44]: > > Hi! > > > > I think VMA merging was accidentally nerfed a bit by commit > > 965f55dea0e3 ("mmap: avoid merging cloned VMAs"), which landed in > > Linux 3.0 - essentially, that commit makes it impossible to merge a > > VMA with an anon_vma into an adjacent VMA that does not have an > > anon_vma. (But the other direction works.) > > > > > > is_mergeable_anon_vma() is defined as: > > > > ``` > > static inline bool is_mergeable_anon_vma(struct anon_vma *anon_vma1, > > struct anon_vma *anon_vma2, struct vm_area_struct *vma= ) > > { > > /* > > * The list_is_singular() test is to avoid merging VMA cloned f= rom > > * parents. This can improve scalability caused by anon_vma loc= k. > > */ > > if ((!anon_vma1 || !anon_vma2) && (!vma || > > list_is_singular(&vma->anon_vma_chain))) > > return true; > > return anon_vma1 =3D=3D anon_vma2; > > } > > ``` > > > > If this function is called with a non-NULL vma pointer (which is > > almost always the case, except when checking for whether it's possible > > to merge in both directions at the same time), > > You are talking about case 1 & 6 here? Yeah, about the third call to is_mergeable_anon_vma() in those cases. > To get here merge_prev and > merge_next must be set.. which means can_vma_merge_after() and > can_vma_merge_before() must succeed.. which means > is_mergeable_anon_vma() returned true with both prev and next being > passed through as "vma". So, I think, even that case suffers the same > issue? Right, my intent was to talk about individual callers of vma_merge(). > That is, we won't have merge_prev =3D=3D true if prev has an empty > anon_vma_chain. The same is true for merge_next. > > >and one of the two > > anon_vmas is non-NULL, this returns > > list_is_singular(&vma->anon_vma_chain). I believe that > > list_is_singular() call is supposed to check whether the > > anon_vma_chain contains *more than one* element, but it actually also > > fails if the anon_vma_chain contains zero elements. > > > > This means that the dup_anon_vma() calls in vma_merge() are > > effectively all no-ops because they are never called with a target > > that does not have an anon_vma and a source that has an anon_vma. > > > > I think this is unintentional - though I guess this unintentional > > refusal to merge VMAs this way also lowers the complexity of what can > > happen in the VMA merging logic. So I think the right fix here is to > > make this kind of merging possible again by changing > > "list_is_singular(&vma->anon_vma_chain)" to > > "list_empty(&vma->anon_vma_chain) || > > list_is_singular(&vma->anon_vma_chain)", but my security hat makes me > > say that I'd also be happy if the unintentional breakage stayed this > > way it is now. > > The commit message for the offending change says > find_mergeable_anon_vma() already considers merging these, so it may not > be as nerfed as it looks? Ah, right, that should reduce the impact a lot. > From what I understand the merging is an optimisation and, from the > commit message, the change was to increase scalability Yes, my understanding is that it was to increase scalability by avoiding merging VMAs when that would cause more use of an anon_vma that is shared with other processes. - An empty anon_vma_chain means we have no anon_vma. - An anon_vma_chain with a single element means we have an anon_vma that is private to our process. - An anon_vma_chain with more than one element is likely shared with other processes. My understanding is that the scalability issue probably comes mainly from having multiple processes hitting the same anon_vma lock during operations under the mmap lock in write mode; that wouldn't be a problem for same-process sharing. There might also be some scalability impact from this kind of situation, although probably less because AFAIU rmap walks aren't needed for much hotpath stuff: - Process P1 creates a VMA V1, covering range A1-A2, with a new anon_vma A= V1 - Process P1 forks a child process P2 - P2 inherits a copy of V1 called V1', with two anon_vmas: - AV1 (inherited) - AV2 (used for new pages) - P1 extends VMA V1, now it covers range A1-A3 - let's hypothetically assume that P2 extends VMA V1', now also covering range A1-A3 - P1 inserts a page somewhere in the range A2-A3, it gets attached to AV1 - now if we have to do an rmap walk on the page, this unnecessarily requires walking the page tables of both P1 and P2 Basically merging VMAs with anon_vma_chains of length >1 could easily lead to some kind of false sharing for rmap walks of pages mapped in the parent, which is probably undesirable especially in cases where a process forks off a lot of children. This sort of situation is probably harder to trigger in a single-process scenario with a length-1 AVC, since you'd have to be splitting VMAs and moving them around with mremap() and extending them with mremap() to have overlapping ranges in the interval tree. My understanding is that essentially, in a process that is not going to fork (or constantly move VMAs around), it would be fine to have more or less a single anon_vma for all anonymous VMAs; that just means rmap walks will have to do O(log(number_of_vmas)) steps through the anon_vma interval tree. In a process that forks, doing so would mean that operations like VMA expansion and unmapping on inherited VMAs in all the children would involve taking a shared lock - that'd probably still be mostly fine, I guess? But if children kept merging new VMAs into the parent's anon_vma, all their VMA modification operations would all hit the same lock, which would be bad. (By the way, we also depend on not merging VMAs with more than one anon_vma_chain entry for correctness/security, see commit 2555283eb40d "mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuse".) > so this shifts > to using more memory to gain scalability on the anon_vma lock. Making > this change will shift back to (maybe?) less memory for more pressure on > that lock then? I am hesitant to suggest un-nerfing it, but it > shouldn't be left as it is since the code is unclear on what is > happening.