From: Wei Yang <richard.weiyang@gmail.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
Matthew Wilcox <willy@infradead.org>,
Vlastimil Babka <vbabka@suse.cz>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Sidhartha Kumar <sidhartha.kumar@oracle.com>,
Bert Karwatzki <spasswolf@web.de>,
Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>,
maple-tree@lists.infradead.org
Subject: Re: [PATCH hotfix 6.12 v3 1/2] maple_tree: correct tree corruption on spanning store
Date: Fri, 11 Oct 2024 08:39:45 +0000 [thread overview]
Message-ID: <20241011083945.z6a3y22pszyq6hse@master> (raw)
In-Reply-To: <48b349a2a0f7c76e18772712d0997a5e12ab0a3b.1728314403.git.lorenzo.stoakes@oracle.com>
On Mon, Oct 07, 2024 at 04:28:32PM +0100, Lorenzo Stoakes wrote:
>There has been a subtle bug present in the maple tree implementation from
>its inception.
>
>This arises from how stores are performed - when a store occurs, it will
>overwrite overlapping ranges and adjust the tree as necessary to
>accommodate this.
>
>A range may always ultimately span two leaf nodes. In this instance we walk
>the two leaf nodes, determine which elements are not overwritten to the
>left and to the right of the start and end of the ranges respectively and
>then rebalance the tree to contain these entries and the newly inserted
>one.
>
>This kind of store is dubbed a 'spanning store' and is implemented by
>mas_wr_spanning_store().
>
>In order to reach this stage, mas_store_gfp() invokes mas_wr_preallocate(),
>mas_wr_store_type() and mas_wr_walk() in turn to walk the tree and update
>the object (mas) to traverse to the location where the write should be
>performed, determining its store type.
>
>When a spanning store is required, this function returns false stopping at
>the parent node which contains the target range, and mas_wr_store_type()
>marks the mas->store_type as wr_spanning_store to denote this fact.
>
>When we go to perform the store in mas_wr_spanning_store(), we first
>determine the elements AFTER the END of the range we wish to store (that
>is, to the right of the entry to be inserted) - we do this by walking to
>the NEXT pivot in the tree (i.e. r_mas.last + 1), starting at the node we
>have just determined contains the range over which we intend to write.
>
>We then turn our attention to the entries to the left of the entry we are
>inserting, whose state is represented by l_mas, and copy these into a 'big
>node', which is a special node which contains enough slots to contain two
>leaf node's worth of data.
>
>We then copy the entry we wish to store immediately after this - the copy
>and the insertion of the new entry is performed by mas_store_b_node().
>
>After this we copy the elements to the right of the end of the range which
>we are inserting, if we have not exceeded the length of the node
>(i.e. r_mas.offset <= r_mas.end).
>
>Herein lies the bug - under very specific circumstances, this logic can
>break and corrupt the maple tree.
>
>Consider the following tree:
>
>Height
> 0 Root Node
> / \
> pivot = 0xffff / \ pivot = ULONG_MAX
> / \
> 1 A [-----] ...
> / \
> pivot = 0x4fff / \ pivot = 0xffff
> / \
> 2 (LEAVES) B [-----] [-----] C
> ^--- Last pivot 0xffff.
>
>Now imagine we wish to store an entry in the range [0x4000, 0xffff] (note
>that all ranges expressed in maple tree code are inclusive):
>
>1. mas_store_gfp() descends the tree, finds node A at <=0xffff, then
> determines that this is a spanning store across nodes B and C. The mas
> state is set such that the current node from which we traverse further
> is node A.
>
>2. In mas_wr_spanning_store() we try to find elements to the right of pivot
> 0xffff by searching for an index of 0x10000:
>
> - mas_wr_walk_index() invokes mas_wr_walk_descend() and
> mas_wr_node_walk() in turn.
>
> - mas_wr_node_walk() loops over entries in node A until EITHER it
> finds an entry whose pivot equals or exceeds 0x10000 OR it
> reaches the final entry.
>
> - Since no entry has a pivot equal to or exceeding 0x10000, pivot
> 0xffff is selected, leading to node C.
>
> - mas_wr_walk_traverse() resets the mas state to traverse node C. We
> loop around and invoke mas_wr_walk_descend() and mas_wr_node_walk()
> in turn once again.
>
> - Again, we reach the last entry in node C, which has a pivot of
> 0xffff.
>
>3. We then copy the elements to the left of 0x4000 in node B to the big
> node via mas_store_b_node(), and insert the new [0x4000, 0xffff] entry
> too.
>
>4. We determine whether we have any entries to copy from the right of the
> end of the range via - and with r_mas set up at the entry at pivot
> 0xffff, r_mas.offset <= r_mas.end, and then we DUPLICATE the entry at
> pivot 0xffff.
>
>5. BUG! The maple tree is corrupted with a duplicate entry.
>
>This requires a very specific set of circumstances - we must be spanning
>the last element in a leaf node, which is the last element in the parent
>node.
>
>spanning store across two leaf nodes with a range that ends at that shared
>pivot.
>
>A potential solution to this problem would simply be to reset the walk each
>time we traverse r_mas, however given the rarity of this situation it seems
>that would be rather inefficient.
>
>Instead, this patch detects if the right hand node is populated, i.e. has
>anything we need to copy.
>
>We do so by only copying elements from the right of the entry being inserted
>when the maximum value present exceeds the last, rather than basing this on
>offset position.
>
>The patch also updates some comments and eliminates the unused bool return
>value in mas_wr_walk_index().
>
>The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree
>in mmap_region()") seems to have made the probability of this event much
>more likely, which is the point at which reports started to be submitted
>concerning this bug.
>
>The motivation for this change arose from Bert Karwatzki's report of
>encountering mm instability after the release of kernel v6.12-rc1 which,
>after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration
>options, was identified as maple tree corruption.
>
>After Bert very generously provided his time and ability to reproduce this
>event consistently, I was able to finally identify that the issue discussed
>in this commit message was occurring for him.
>
>Reported-and-tested-by: Bert Karwatzki <spasswolf@web.de>
>Closes: https://lore.kernel.org/all/20241001023402.3374-1-spasswolf@web.de/
>Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
>Closes: https://lore.kernel.org/all/CABXGCsOPwuoNOqSMmAvWO2Fz4TEmPnjFj-b7iF+XFRu1h7-+Dg@mail.gmail.com/
>Fixes: 54a611b60590 ("Maple Tree: add new data structure")
>Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
--
Wei Yang
Help you, Help me
next prev parent reply other threads:[~2024-10-11 8:39 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-07 15:28 [PATCH hotfix 6.12 v3 0/2] " Lorenzo Stoakes
2024-10-07 15:28 ` [PATCH hotfix 6.12 v3 1/2] " Lorenzo Stoakes
2024-10-07 20:39 ` Liam R. Howlett
2024-10-11 5:37 ` Mikhail Gavrilov
2024-10-11 8:39 ` Wei Yang [this message]
2024-10-07 15:28 ` [PATCH hotfix 6.12 v3 2/2] maple_tree: add regression test for spanning store bug Lorenzo Stoakes
2024-10-07 20:39 ` Liam R. Howlett
2024-10-11 8:40 ` Wei Yang
2024-10-17 7:17 ` [PATCH hotfix 6.12 v3 0/2] maple_tree: correct tree corruption on spanning store Andrew Morton
2024-10-17 7:38 ` Lorenzo Stoakes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241011083945.z6a3y22pszyq6hse@master \
--to=richard.weiyang@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=maple-tree@lists.infradead.org \
--cc=mikhail.v.gavrilov@gmail.com \
--cc=sidhartha.kumar@oracle.com \
--cc=spasswolf@web.de \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox