From: Yu Kuai <yukuai1@huaweicloud.com>
To: stable@vger.kernel.org, gregkh@linuxfoundation.org,
harry.wentland@amd.com, sunpeng.li@amd.com,
Rodrigo.Siqueira@amd.com, alexander.deucher@amd.com,
christian.koenig@amd.com, Xinhui.Pan@amd.com, airlied@gmail.com,
daniel@ffwll.ch, viro@zeniv.linux.org.uk, brauner@kernel.org,
Liam.Howlett@oracle.com, akpm@linux-foundation.org,
hughd@google.com, willy@infradead.org, sashal@kernel.org,
srinivasan.shanmugam@amd.com, chiahsuan.chung@amd.com,
mingo@kernel.org, mgorman@techsingularity.net,
yukuai3@huawei.com, chengming.zhou@linux.dev,
zhangpeng.00@bytedance.com, chuck.lever@oracle.com
Cc: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
maple-tree@lists.infradead.org, linux-mm@kvack.org,
yukuai1@huaweicloud.com, yi.zhang@huawei.com,
yangerkun@huawei.com
Subject: [PATCH 6.6 28/28] maple_tree: correct tree corruption on spanning store
Date: Thu, 24 Oct 2024 21:22:25 +0800 [thread overview]
Message-ID: <20241024132225.2271667-13-yukuai1@huaweicloud.com> (raw)
In-Reply-To: <20241024132225.2271667-1-yukuai1@huaweicloud.com>
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
commit bea07fd63192b61209d48cbb81ef474cc3ee4c62 upstream.
Patch series "maple_tree: correct tree corruption on spanning store", v3.
There has been a nasty yet subtle maple tree corruption bug that appears
to have been in existence since the inception of the algorithm.
This bug seems far more likely to happen since commit f8d112a4e657
("mm/mmap: avoid zeroing vma tree in mmap_region()"), which is the point
at which reports started to be submitted concerning this bug.
We were made definitely aware of the bug thanks to the kind efforts of
Bert Karwatzki who helped enormously in my being able to track this down
and identify the cause of it.
The bug arises when an attempt is made to perform a spanning store across
two leaf nodes, where the right leaf node is the rightmost child of the
shared parent, AND the store completely consumes the right-mode node.
This results in mas_wr_spanning_store() mitakenly duplicating the new and
existing entries at the maximum pivot within the range, and thus maple
tree corruption.
The fix patch corrects this by detecting this scenario and disallowing the
mistaken duplicate copy.
The fix patch commit message goes into great detail as to how this occurs.
This series also includes a test which reliably reproduces the issue, and
asserts that the fix works correctly.
Bert has kindly tested the fix and confirmed it resolved his issues. Also
Mikhail Gavrilov kindly reported what appears to be precisely the same
bug, which this fix should also resolve.
This patch (of 2):
There has been a subtle bug present in the maple tree implementation from
its inception.
This arises from how stores are performed - when a store occurs, it will
overwrite overlapping ranges and adjust the tree as necessary to
accommodate this.
A range may always ultimately span two leaf nodes. In this instance we
walk the two leaf nodes, determine which elements are not overwritten to
the left and to the right of the start and end of the ranges respectively
and then rebalance the tree to contain these entries and the newly
inserted one.
This kind of store is dubbed a 'spanning store' and is implemented by
mas_wr_spanning_store().
In order to reach this stage, mas_store_gfp() invokes
mas_wr_preallocate(), mas_wr_store_type() and mas_wr_walk() in turn to
walk the tree and update the object (mas) to traverse to the location
where the write should be performed, determining its store type.
When a spanning store is required, this function returns false stopping at
the parent node which contains the target range, and mas_wr_store_type()
marks the mas->store_type as wr_spanning_store to denote this fact.
When we go to perform the store in mas_wr_spanning_store(), we first
determine the elements AFTER the END of the range we wish to store (that
is, to the right of the entry to be inserted) - we do this by walking to
the NEXT pivot in the tree (i.e. r_mas.last + 1), starting at the node we
have just determined contains the range over which we intend to write.
We then turn our attention to the entries to the left of the entry we are
inserting, whose state is represented by l_mas, and copy these into a 'big
node', which is a special node which contains enough slots to contain two
leaf node's worth of data.
We then copy the entry we wish to store immediately after this - the copy
and the insertion of the new entry is performed by mas_store_b_node().
After this we copy the elements to the right of the end of the range which
we are inserting, if we have not exceeded the length of the node (i.e.
r_mas.offset <= r_mas.end).
Herein lies the bug - under very specific circumstances, this logic can
break and corrupt the maple tree.
Consider the following tree:
Height
0 Root Node
/ \
pivot = 0xffff / \ pivot = ULONG_MAX
/ \
1 A [-----] ...
/ \
pivot = 0x4fff / \ pivot = 0xffff
/ \
2 (LEAVES) B [-----] [-----] C
^--- Last pivot 0xffff.
Now imagine we wish to store an entry in the range [0x4000, 0xffff] (note
that all ranges expressed in maple tree code are inclusive):
1. mas_store_gfp() descends the tree, finds node A at <=0xffff, then
determines that this is a spanning store across nodes B and C. The mas
state is set such that the current node from which we traverse further
is node A.
2. In mas_wr_spanning_store() we try to find elements to the right of pivot
0xffff by searching for an index of 0x10000:
- mas_wr_walk_index() invokes mas_wr_walk_descend() and
mas_wr_node_walk() in turn.
- mas_wr_node_walk() loops over entries in node A until EITHER it
finds an entry whose pivot equals or exceeds 0x10000 OR it
reaches the final entry.
- Since no entry has a pivot equal to or exceeding 0x10000, pivot
0xffff is selected, leading to node C.
- mas_wr_walk_traverse() resets the mas state to traverse node C. We
loop around and invoke mas_wr_walk_descend() and mas_wr_node_walk()
in turn once again.
- Again, we reach the last entry in node C, which has a pivot of
0xffff.
3. We then copy the elements to the left of 0x4000 in node B to the big
node via mas_store_b_node(), and insert the new [0x4000, 0xffff] entry
too.
4. We determine whether we have any entries to copy from the right of the
end of the range via - and with r_mas set up at the entry at pivot
0xffff, r_mas.offset <= r_mas.end, and then we DUPLICATE the entry at
pivot 0xffff.
5. BUG! The maple tree is corrupted with a duplicate entry.
This requires a very specific set of circumstances - we must be spanning
the last element in a leaf node, which is the last element in the parent
node.
spanning store across two leaf nodes with a range that ends at that shared
pivot.
A potential solution to this problem would simply be to reset the walk
each time we traverse r_mas, however given the rarity of this situation it
seems that would be rather inefficient.
Instead, this patch detects if the right hand node is populated, i.e. has
anything we need to copy.
We do so by only copying elements from the right of the entry being
inserted when the maximum value present exceeds the last, rather than
basing this on offset position.
The patch also updates some comments and eliminates the unused bool return
value in mas_wr_walk_index().
The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma
tree in mmap_region()") seems to have made the probability of this event
much more likely, which is the point at which reports started to be
submitted concerning this bug.
The motivation for this change arose from Bert Karwatzki's report of
encountering mm instability after the release of kernel v6.12-rc1 which,
after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration
options, was identified as maple tree corruption.
After Bert very generously provided his time and ability to reproduce this
event consistently, I was able to finally identify that the issue
discussed in this commit message was occurring for him.
Link: https://lkml.kernel.org/r/cover.1728314402.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/48b349a2a0f7c76e18772712d0997a5e12ab0a3b.1728314403.git.lorenzo.stoakes@oracle.com
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Bert Karwatzki <spasswolf@web.de>
Closes: https://lore.kernel.org/all/20241001023402.3374-1-spasswolf@web.de/
Tested-by: Bert Karwatzki <spasswolf@web.de>
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Closes: https://lore.kernel.org/all/CABXGCsOPwuoNOqSMmAvWO2Fz4TEmPnjFj-b7iF+XFRu1h7-+Dg@mail.gmail.com/
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
lib/maple_tree.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 5328e08723d7..c57b6fc4db2e 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -2239,6 +2239,8 @@ static inline void mas_node_or_none(struct ma_state *mas,
/*
* mas_wr_node_walk() - Find the correct offset for the index in the @mas.
+ * If @mas->index cannot be found within the containing
+ * node, we traverse to the last entry in the node.
* @wr_mas: The maple write state
*
* Uses mas_slot_locked() and does not need to worry about dead nodes.
@@ -3655,7 +3657,7 @@ static bool mas_wr_walk(struct ma_wr_state *wr_mas)
return true;
}
-static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
+static void mas_wr_walk_index(struct ma_wr_state *wr_mas)
{
struct ma_state *mas = wr_mas->mas;
@@ -3664,11 +3666,9 @@ static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
wr_mas->content = mas_slot_locked(mas, wr_mas->slots,
mas->offset);
if (ma_is_leaf(wr_mas->type))
- return true;
+ return;
mas_wr_walk_traverse(wr_mas);
-
}
- return true;
}
/*
* mas_extend_spanning_null() - Extend a store of a %NULL to include surrounding %NULLs.
@@ -3899,8 +3899,8 @@ static inline int mas_wr_spanning_store(struct ma_wr_state *wr_mas)
memset(&b_node, 0, sizeof(struct maple_big_node));
/* Copy l_mas and store the value in b_node. */
mas_store_b_node(&l_wr_mas, &b_node, l_mas.end);
- /* Copy r_mas into b_node. */
- if (r_mas.offset <= r_mas.end)
+ /* Copy r_mas into b_node if there is anything to copy. */
+ if (r_mas.max > r_mas.last)
mas_mab_cp(&r_mas, r_mas.offset, r_mas.end,
&b_node, b_node.b_end + 1);
else
--
2.39.2
next prev parent reply other threads:[~2024-10-24 13:25 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-24 13:19 [PATCH 6.6 00/28] fix CVE-2024-46701 Yu Kuai
2024-10-24 13:19 ` [PATCH 6.6 01/28] maple_tree: add mt_free_one() and mt_attr() helpers Yu Kuai
2024-10-24 13:19 ` [PATCH 6.6 02/28] maple_tree: introduce {mtree,mas}_lock_nested() Yu Kuai
2024-10-24 13:19 ` [PATCH 6.6 03/28] maple_tree: introduce interfaces __mt_dup() and mtree_dup() Yu Kuai
2024-10-24 13:19 ` [PATCH 6.6 04/28] maple_tree: skip other tests when BENCH is enabled Yu Kuai
2024-10-24 13:19 ` [PATCH 6.6 05/28] maple_tree: preserve the tree attributes when destroying maple tree Yu Kuai
2024-10-24 13:19 ` [PATCH 6.6 06/28] maple_tree: remove unnecessary default labels from switch statements Yu Kuai
2024-10-24 13:19 ` [PATCH 6.6 07/28] maple_tree: make mas_erase() more robust Yu Kuai
2024-10-24 13:19 ` [PATCH 6.6 08/28] maple_tree: move debug check to __mas_set_range() Yu Kuai
2024-10-24 13:19 ` [PATCH 6.6 09/28] maple_tree: add end of node tracking to the maple state Yu Kuai
2024-10-24 13:19 ` [PATCH 6.6 10/28] maple_tree: use cached node end in mas_next() Yu Kuai
2024-10-24 13:19 ` [PATCH 6.6 11/28] maple_tree: use cached node end in mas_destroy() Yu Kuai
2024-10-24 13:19 ` [PATCH 6.6 12/28] maple_tree: clean up inlines for some functions Yu Kuai
2024-10-24 13:19 ` [PATCH 6.6 13/28] maple_tree: add test for mtree_dup() Yu Kuai
2024-10-24 13:19 ` [PATCH 6.6 14/28] maple_tree: separate ma_state node from status Yu Kuai
2024-10-24 13:19 ` [PATCH 6.6 15/28] maple_tree: remove mas_searchable() Yu Kuai
2024-10-24 13:22 ` [PATCH 6.6 16/28] Revert "maple_tree: correct tree corruption on spanning store" Yu Kuai
2024-10-24 13:22 ` [PATCH 6.6 17/28] maple_tree: use maple state end for write operations Yu Kuai
2024-10-24 13:22 ` [PATCH 6.6 18/28] maple_tree: don't find node end in mtree_lookup_walk() Yu Kuai
2024-10-24 13:22 ` [PATCH 6.6 19/28] maple_tree: mtree_range_walk() clean up Yu Kuai
2024-10-24 13:22 ` [PATCH 6.6 20/28] lib/maple_tree.c: fix build error due to hotfix alteration Yu Kuai
2024-10-24 13:22 ` [PATCH 6.6 21/28] maple_tree: avoid checking other gaps after getting the largest gap Yu Kuai
2024-10-24 13:22 ` [PATCH 6.6 22/28] libfs: Re-arrange locking in offset_iterate_dir() Yu Kuai
2024-10-24 13:22 ` [PATCH 6.6 23/28] libfs: Define a minimum directory offset Yu Kuai
2024-10-24 13:22 ` [PATCH 6.6 24/28] libfs: Add simple_offset_empty() Yu Kuai
2024-10-24 13:22 ` [PATCH 6.6 25/28] maple_tree: Add mtree_alloc_cyclic() Yu Kuai
2024-10-24 13:22 ` [PATCH 6.6 26/28] libfs: Convert simple directory offsets to use a Maple Tree Yu Kuai
2024-10-24 13:22 ` [PATCH 6.6 27/28] libfs: fix infinite directory reads for offset dir Yu Kuai
2024-10-24 13:22 ` Yu Kuai [this message]
2024-11-06 15:02 ` [PATCH 6.6 28/28] maple_tree: correct tree corruption on spanning store Lorenzo Stoakes
2024-11-07 1:22 ` Yu Kuai
2024-11-06 6:16 ` [PATCH 6.6 00/28] fix CVE-2024-46701 Greg KH
2024-11-06 14:44 ` Liam R. Howlett
2024-11-06 15:19 ` Chuck Lever III
2024-11-06 16:21 ` James Bottomley
2024-11-07 0:57 ` Yu Kuai
2024-11-07 14:41 ` Chuck Lever
2024-11-08 1:19 ` Yu Kuai
2024-11-08 13:23 ` Chuck Lever III
2024-11-08 17:03 ` Liam R. Howlett
2024-11-09 1:38 ` Yu Kuai
2024-11-09 1:30 ` Yu Kuai
2024-11-09 16:58 ` Chuck Lever III
2024-11-11 0:56 ` Yu Kuai
2024-11-07 14:44 ` Liam R. Howlett
2024-11-06 14:43 ` Lorenzo Stoakes
2024-11-07 1:43 ` Yu Kuai
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241024132225.2271667-13-yukuai1@huaweicloud.com \
--to=yukuai1@huaweicloud.com \
--cc=Liam.Howlett@oracle.com \
--cc=Rodrigo.Siqueira@amd.com \
--cc=Xinhui.Pan@amd.com \
--cc=airlied@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=brauner@kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=chiahsuan.chung@amd.com \
--cc=christian.koenig@amd.com \
--cc=chuck.lever@oracle.com \
--cc=daniel@ffwll.ch \
--cc=dri-devel@lists.freedesktop.org \
--cc=gregkh@linuxfoundation.org \
--cc=harry.wentland@amd.com \
--cc=hughd@google.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=maple-tree@lists.infradead.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@kernel.org \
--cc=sashal@kernel.org \
--cc=srinivasan.shanmugam@amd.com \
--cc=stable@vger.kernel.org \
--cc=sunpeng.li@amd.com \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=yangerkun@huawei.com \
--cc=yi.zhang@huawei.com \
--cc=yukuai3@huawei.com \
--cc=zhangpeng.00@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox