From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C5BB4D24477 for ; Fri, 11 Oct 2024 05:38:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2A11D6B008C; Fri, 11 Oct 2024 01:38:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 202D46B0092; Fri, 11 Oct 2024 01:38:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 07C5C6B0093; Fri, 11 Oct 2024 01:38:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id D98066B008C for ; Fri, 11 Oct 2024 01:38:05 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id CB4CFA0945 for ; Fri, 11 Oct 2024 05:37:57 +0000 (UTC) X-FDA: 82660215288.17.7CD4D73 Received: from mail-qv1-f53.google.com (mail-qv1-f53.google.com [209.85.219.53]) by imf24.hostedemail.com (Postfix) with ESMTP id 80083180014 for ; Fri, 11 Oct 2024 05:38:02 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=h41KX7Lz; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf24.hostedemail.com: domain of mikhail.v.gavrilov@gmail.com designates 209.85.219.53 as permitted sender) smtp.mailfrom=mikhail.v.gavrilov@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728625055; a=rsa-sha256; cv=none; b=DR9Ytyd5TFpFyGXgkOdQBwQP7E1uN/w/sSO3YrCmkjpu/emxhQaq+Yu5PfP4XusdrNoUoL 7gGjZHBRl6zdEW6n3EVE6+Sdlo4eTIeA6JrzK/D7tQ2XEnELC2wczHKitOUNa/qPySbmGs 8SFKX1qpHv3RsdXLKfuGNELLTysKNyk= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=h41KX7Lz; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf24.hostedemail.com: domain of mikhail.v.gavrilov@gmail.com designates 209.85.219.53 as permitted sender) smtp.mailfrom=mikhail.v.gavrilov@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728625055; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hwlVoZjFNRYGaq2/r/PMr145DdKMAVPt1Mu6ZSsQqB0=; b=HPHCB05xRU3D2UfljxQIvTqtcDunv0iWaIlP2eTdJBTD7+ov5Hhew5PmbZ9SaKC82NmqwY gQrCk3l68rsN3v7t6SG/JQ+h941f0HsWuLJJ9lWppP6t2F7q16y5mWLgVbwV5WirgAzoOg i2BQklrdqK2pJqMBUOWps3rSVzYIaCY= Received: by mail-qv1-f53.google.com with SMTP id 6a1803df08f44-6cbce16d151so8570816d6.2 for ; Thu, 10 Oct 2024 22:38:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1728625082; x=1729229882; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=hwlVoZjFNRYGaq2/r/PMr145DdKMAVPt1Mu6ZSsQqB0=; b=h41KX7LzFUUXZ6yj9+6e5NNe+5S5VchDcWrg6rZbwPk5O3RkXmUSo4+nk3pqW6062O Cv4OvqFuaWs1fvOo1D3wJdzBtpCAoUif4ROhSyfoM2O0D9y5JDa4QNb14fMyMpdHbLXx veu6Sua8in6Y/mQvoKc6pBAieIY+0YFrAi22E93y9412sOWZfmuB03N9Wfp76USzlnOX YzF897Htb48tEorN4HAns+ARAzS2rlUjcM2+Ww1HU/mbyjSsZnrt05YvXGmS+xpSvcMy PdvVS/beDmQmviu3J89IKau0miSSxOOdaYi+w5mLeiVBQP+ik7d+YwJljbA70r6iu/cy IXag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728625082; x=1729229882; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=hwlVoZjFNRYGaq2/r/PMr145DdKMAVPt1Mu6ZSsQqB0=; b=Tm40/9VxXzmVKuZRFAKPA9c1v4IBUov3zm6ST2WQ0OMeHLbDfKiU7/j4AzBtSKfdsy a+yJBqEyFgMLIADy1QpIpd0tJTQwprrVlHFxIpTVOybiu7BXDU4L1M/wj4l810xe234Z zLrCW7Xx81OSUqSbGy/Fvn/cxmPwNERcsToCcGfxGdTal5oGBh9ZjQzh9WedsVm/cMt2 cNAikvRbJER1D8xcUy50pA6Z0eINbQuFLcTSqUOhXitgN9bEbc0r78ddg16NoLtvtdg9 H2VSxZBY2ElBCFT/JeT1mVm+ClPI6bBGi+wIrJV4/n6ggo06UXqaJWBE9TOkzyQUmS6v oR5Q== X-Forwarded-Encrypted: i=1; AJvYcCUzsXgkep7rZwpbOYbMCE+4GwYqB6ed4SM5l8fnAv0iDjJKarAE/6JK4Mxy2RrSlkxYun/ptF6zKw==@kvack.org X-Gm-Message-State: AOJu0YwXWzec8DPXKzUFl3ltFuXPKJb/CMT2fvadUkbSn+0KbRsPxuoG 5LqUbNJTD5LWr5/MSvacAaeKHvTA6Gh2oqpo1dOsnwsc1luXFwe5jb2XUNZM6Ynmq9+BVcRxz0E WN3aalLQNwLNMyXBL8ZzKEIQ8AExV5mT/isbL9Q== X-Google-Smtp-Source: AGHT+IEQfyemdrvKQKrZ7MjKWE0nTXcXuLcBF9JFya6fxbzlek9cQNBnNIMUqQGG+YXT+vOGAagftGcVUxGYjXMBWCA= X-Received: by 2002:a0c:f408:0:b0:6cb:cc5e:1bcf with SMTP id 6a1803df08f44-6cbeffe59b8mr19631136d6.21.1728625082463; Thu, 10 Oct 2024 22:38:02 -0700 (PDT) MIME-Version: 1.0 References: <48b349a2a0f7c76e18772712d0997a5e12ab0a3b.1728314403.git.lorenzo.stoakes@oracle.com> In-Reply-To: <48b349a2a0f7c76e18772712d0997a5e12ab0a3b.1728314403.git.lorenzo.stoakes@oracle.com> From: Mikhail Gavrilov Date: Fri, 11 Oct 2024 10:37:51 +0500 Message-ID: Subject: Re: [PATCH hotfix 6.12 v3 1/2] maple_tree: correct tree corruption on spanning store To: Lorenzo Stoakes Cc: Andrew Morton , "Liam R . Howlett" , Matthew Wilcox , Vlastimil Babka , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Sidhartha Kumar , Bert Karwatzki , maple-tree@lists.infradead.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: bjdep8n9zgc8wh18txzu996z6yxgjasm X-Rspamd-Queue-Id: 80083180014 X-Rspamd-Server: rspam02 X-HE-Tag: 1728625082-285258 X-HE-Meta: U2FsdGVkX19mNOcSemKA5DjS8Jsf8ZDmoJ1wdQ5GhutVHQ+V0cOdgTR2P4qeEB4WwUkoUpUwmG4QcjoiPQ0rEK5DVzqiB3ZDUY3/6O9PMNIZPazPbRlONRDbh6Uu5wQPwsVYX5WikDcg6HguGmEkNAx7Z43U3dbNTSq8HjPq50olb6J2khwLVArbpu/hiYMAWMq54BEOdfmWznnkEXjvOZ8uN6fpSQJ+GJqXOQWKD9qj9r6rrVopiaAp0FiMeLwxGuBYyGZtRptbGYLx12fBjIFd4qaOSVnXpgDUMroMbE1Fq5OASmNDpbm2/PZ0+aXFxEAfaRU4wxR8VjYA+YMsd3xnAiZDDKTB8rMqef+zquZMrDUxas/W7ggudbNkR5cIVFwIXP3TJjaGx1gsE7xUa3jqsaiQ1J5jQP646+LXpn9DRfcM+mQIG71M2XPgrGX3uyHebx4SaUIIC/t6vw2bl1qs4YvDGb1FIME3WtXMToDzzmM86i6AMO12SJMBpLYySSCBv4o38y2BI2nAmzASVwgyecElYUTDnwTeI2DAP8qVG7TwAhsPsRJj+fEmzfOeRUCsB0YaLMxcOGz9vTDREbVp8kQt05/QtFMsU5J/h0RehERsXwsCwaUx+peEgijBE4rdYnEbJYeuO1IjYEHMowM4uU33dDsmM2j9ncgT5KORGiw4ODyNL9/C2dN3znB8n8826mwTGXQ1frMIFIlHudTqRt63PSDmzKGTKRju8k1/8sgidqkMEdiDEmv5vr1uaLg7E5tHHkGmVj/9t4+nJVZkx3Af8GUNrFivwLmBnmU8lqooCOR2T+dOH3SX7/HKKW+SbffYd0uvJiGe6o6PBpCdilrTrg6L1xkv2CLDNMIVVqcWyl9VEYtLMGHv0+hTufR5xLQXFMH6I5IrkvDAGY8BxKlAiI+diFmJvktJ6wzxRqJr4qDiEqXM3jNARQ+E3H3QoWDj8wrLhkf5wy+ jDByM25H OeE/oMzu4c4iiRmy6QtZOv+2NRovu7dZILQPuNKhuKXL5T/BnN1VNkgRPyRr93fzKUp4DVDdnY1QqyQn0jun+D8eFpH36VBvztEcvdsoLgQ2VFML5B25ma1cQvX+ybmaoKfBY6V3Ft985zZRj7WBYM781TiHAopo19PlWnwEKds2+cjsTBSicygxLxjQVP/xk34KqlF8USrmldpju3C6V3bo2vVpL9hGJJmZPwUpy+CVhWf+xs/HLlMSLoVqf7j8/ZEqHi36ku61f/G4uunrk2CrlRqAztyMP2Ps3rcZr/8AMEmXFKrxfJpNwS92Ab1MMJM6a3L1LhHawjCzUefROV/vIP5wUHmHilzlmwzTw2WJPgn0D5KHCSNyzj84bGVNYnLzIa/f9rNLc4PuEqJv65l/McCu/AE3KltFgQEqhZV3Pa9hSrN9/58FMdQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Oct 7, 2024 at 8:28=E2=80=AFPM Lorenzo Stoakes wrote: > > There has been a subtle bug present in the maple tree implementation from > its inception. > > This arises from how stores are performed - when a store occurs, it will > overwrite overlapping ranges and adjust the tree as necessary to > accommodate this. > > A range may always ultimately span two leaf nodes. In this instance we wa= lk > the two leaf nodes, determine which elements are not overwritten to the > left and to the right of the start and end of the ranges respectively and > then rebalance the tree to contain these entries and the newly inserted > one. > > This kind of store is dubbed a 'spanning store' and is implemented by > mas_wr_spanning_store(). > > In order to reach this stage, mas_store_gfp() invokes mas_wr_preallocate(= ), > mas_wr_store_type() and mas_wr_walk() in turn to walk the tree and update > the object (mas) to traverse to the location where the write should be > performed, determining its store type. > > When a spanning store is required, this function returns false stopping a= t > the parent node which contains the target range, and mas_wr_store_type() > marks the mas->store_type as wr_spanning_store to denote this fact. > > When we go to perform the store in mas_wr_spanning_store(), we first > determine the elements AFTER the END of the range we wish to store (that > is, to the right of the entry to be inserted) - we do this by walking to > the NEXT pivot in the tree (i.e. r_mas.last + 1), starting at the node we > have just determined contains the range over which we intend to write. > > We then turn our attention to the entries to the left of the entry we are > inserting, whose state is represented by l_mas, and copy these into a 'bi= g > node', which is a special node which contains enough slots to contain two > leaf node's worth of data. > > We then copy the entry we wish to store immediately after this - the copy > and the insertion of the new entry is performed by mas_store_b_node(). > > After this we copy the elements to the right of the end of the range whic= h > we are inserting, if we have not exceeded the length of the node > (i.e. r_mas.offset <=3D r_mas.end). > > Herein lies the bug - under very specific circumstances, this logic can > break and corrupt the maple tree. > > Consider the following tree: > > Height > 0 Root Node > / \ > pivot =3D 0xffff / \ pivot =3D ULONG_MAX > / \ > 1 A [-----] ... > / \ > pivot =3D 0x4fff / \ pivot =3D 0xffff > / \ > 2 (LEAVES) B [-----] [-----] C > ^--- Last pivot 0xffff. > > Now imagine we wish to store an entry in the range [0x4000, 0xffff] (note > that all ranges expressed in maple tree code are inclusive): > > 1. mas_store_gfp() descends the tree, finds node A at <=3D0xffff, then > determines that this is a spanning store across nodes B and C. The mas > state is set such that the current node from which we traverse further > is node A. > > 2. In mas_wr_spanning_store() we try to find elements to the right of piv= ot > 0xffff by searching for an index of 0x10000: > > - mas_wr_walk_index() invokes mas_wr_walk_descend() and > mas_wr_node_walk() in turn. > > - mas_wr_node_walk() loops over entries in node A until EITHER it > finds an entry whose pivot equals or exceeds 0x10000 OR it > reaches the final entry. > > - Since no entry has a pivot equal to or exceeding 0x10000, pivot > 0xffff is selected, leading to node C. > > - mas_wr_walk_traverse() resets the mas state to traverse node C. We > loop around and invoke mas_wr_walk_descend() and mas_wr_node_walk() > in turn once again. > > - Again, we reach the last entry in node C, which has a pivot of > 0xffff. > > 3. We then copy the elements to the left of 0x4000 in node B to the big > node via mas_store_b_node(), and insert the new [0x4000, 0xffff] entry > too. > > 4. We determine whether we have any entries to copy from the right of the > end of the range via - and with r_mas set up at the entry at pivot > 0xffff, r_mas.offset <=3D r_mas.end, and then we DUPLICATE the entry a= t > pivot 0xffff. > > 5. BUG! The maple tree is corrupted with a duplicate entry. > > This requires a very specific set of circumstances - we must be spanning > the last element in a leaf node, which is the last element in the parent > node. > > spanning store across two leaf nodes with a range that ends at that share= d > pivot. > > A potential solution to this problem would simply be to reset the walk ea= ch > time we traverse r_mas, however given the rarity of this situation it see= ms > that would be rather inefficient. > > Instead, this patch detects if the right hand node is populated, i.e. has > anything we need to copy. > > We do so by only copying elements from the right of the entry being inser= ted > when the maximum value present exceeds the last, rather than basing this = on > offset position. > > The patch also updates some comments and eliminates the unused bool retur= n > value in mas_wr_walk_index(). > > The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tr= ee > in mmap_region()") seems to have made the probability of this event much > more likely, which is the point at which reports started to be submitted > concerning this bug. > > The motivation for this change arose from Bert Karwatzki's report of > encountering mm instability after the release of kernel v6.12-rc1 which, > after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration > options, was identified as maple tree corruption. > > After Bert very generously provided his time and ability to reproduce thi= s > event consistently, I was able to finally identify that the issue discuss= ed > in this commit message was occurring for him. > > Reported-and-tested-by: Bert Karwatzki > Closes: https://lore.kernel.org/all/20241001023402.3374-1-spasswolf@web.d= e/ > Reported-by: Mikhail Gavrilov > Closes: https://lore.kernel.org/all/CABXGCsOPwuoNOqSMmAvWO2Fz4TEmPnjFj-b7= iF+XFRu1h7-+Dg@mail.gmail.com/ > Fixes: 54a611b60590 ("Maple Tree: add new data structure") > Signed-off-by: Lorenzo Stoakes > Acked-by: Vlastimil Babka > --- > lib/maple_tree.c | 12 ++++++------ > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c > index 20990ecba2dd..2fe59c534328 100644 > --- a/lib/maple_tree.c > +++ b/lib/maple_tree.c > @@ -2196,6 +2196,8 @@ static inline void mas_node_or_none(struct ma_state= *mas, > > /* > * mas_wr_node_walk() - Find the correct offset for the index in the @ma= s. > + * If @mas->index cannot be found within the contai= ning > + * node, we traverse to the last entry in the node. > * @wr_mas: The maple write state > * > * Uses mas_slot_locked() and does not need to worry about dead nodes. > @@ -3532,7 +3534,7 @@ static bool mas_wr_walk(struct ma_wr_state *wr_mas) > return true; > } > > -static bool mas_wr_walk_index(struct ma_wr_state *wr_mas) > +static void mas_wr_walk_index(struct ma_wr_state *wr_mas) > { > struct ma_state *mas =3D wr_mas->mas; > > @@ -3541,11 +3543,9 @@ static bool mas_wr_walk_index(struct ma_wr_state *= wr_mas) > wr_mas->content =3D mas_slot_locked(mas, wr_mas->slots, > mas->offset); > if (ma_is_leaf(wr_mas->type)) > - return true; > + return; > mas_wr_walk_traverse(wr_mas); > - > } > - return true; > } > /* > * mas_extend_spanning_null() - Extend a store of a %NULL to include sur= rounding %NULLs. > @@ -3765,8 +3765,8 @@ static noinline void mas_wr_spanning_store(struct m= a_wr_state *wr_mas) > memset(&b_node, 0, sizeof(struct maple_big_node)); > /* Copy l_mas and store the value in b_node. */ > mas_store_b_node(&l_wr_mas, &b_node, l_mas.end); > - /* Copy r_mas into b_node. */ > - if (r_mas.offset <=3D r_mas.end) > + /* Copy r_mas into b_node if there is anything to copy. */ > + if (r_mas.max > r_mas.last) > mas_mab_cp(&r_mas, r_mas.offset, r_mas.end, > &b_node, b_node.b_end + 1); > else > -- > 2.46.2 Tested-by: Mikhail Gavrilov --=20 Best Regards, Mike Gavrilov.