From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 06797CF8868 for ; Sat, 5 Oct 2024 00:56:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2351B6B0280; Fri, 4 Oct 2024 20:56:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1BF046B0282; Fri, 4 Oct 2024 20:56:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 038BD6B0284; Fri, 4 Oct 2024 20:56:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id C25A66B0280 for ; Fri, 4 Oct 2024 20:56:09 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 8F64516057F for ; Sat, 5 Oct 2024 00:56:07 +0000 (UTC) X-FDA: 82637731974.07.E93B54C Received: from mout.web.de (mout.web.de [212.227.15.3]) by imf17.hostedemail.com (Postfix) with ESMTP id 3565440004 for ; Sat, 5 Oct 2024 00:56:04 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=web.de header.s=s29768273 header.b=OAA9saWj; spf=pass (imf17.hostedemail.com: domain of spasswolf@web.de designates 212.227.15.3 as permitted sender) smtp.mailfrom=spasswolf@web.de; dmarc=pass (policy=quarantine) header.from=web.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728089699; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=US4ELvZveioeKd+zRDh9paghb6EwdjvF4+GRpVTb5Ik=; b=R6fPbPATVzb9FlxC68jysm62ytmv3vKM8VYBiPQ8rhQdgTd95tbj++JcRz4BhCMgPi+lH5 8UbVWt9rKeZvmZaXJ0nfXfBJNB2xVmxC4oVhYoPZmpEFOBAKhDhxRK23t82SwL0aCvQMWQ 93ml3lEOr8+3tmrTgkg9gCoAcnardtw= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=web.de header.s=s29768273 header.b=OAA9saWj; spf=pass (imf17.hostedemail.com: domain of spasswolf@web.de designates 212.227.15.3 as permitted sender) smtp.mailfrom=spasswolf@web.de; dmarc=pass (policy=quarantine) header.from=web.de ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728089699; a=rsa-sha256; cv=none; b=vy7isf4SF4Wre8guV83iAjdCk4HevmmfTjgluZFFhMwaqeoGmdGKw0NnYY0F6/Szq28GYT jG62HIMW9X/KTPImgBDjLGmWNaM5NaAk9j5cj5JOASQoe3YV49/mK5UEnfesMBDbIrnlBD wUpZhlpMzadNFLYhQjfjKzoeyV86NCI= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=web.de; s=s29768273; t=1728089763; x=1728694563; i=spasswolf@web.de; bh=US4ELvZveioeKd+zRDh9paghb6EwdjvF4+GRpVTb5Ik=; h=X-UI-Sender-Class:Message-ID:Subject:From:To:Cc:Date:In-Reply-To: References:Content-Type:MIME-Version:Content-Transfer-Encoding:cc: content-transfer-encoding:content-type:date:from:message-id: mime-version:reply-to:subject:to; b=OAA9saWjs4IKwAFINUkKn7HVmsFOVWbrrlclbei7+guqHIUzuUpXlQtYmLqoZfRf 9jBAl1k6OlDCXHty8hIw0UhfioH8I9J9TYRB4JIWGJrzBrpCTMgKw6m7k6VW8G4Y0 RTeCKY+nKBIyKxp/5Zqa/Arh65AmTxkdjMaMCF1twoHiP4EXEmSmycOMcDiAbW2kV bXXgkUU/jA8f8KEj3qJ59pOXm48vnyxrcAPFKezUwBtHiVBIeH9jYK6kBUHmrbLR9 t5060Gq4u6aE5cSTsETMPlWCLxCFps55y5X6powfL1aJI/lOZbVA5N2aktPSo8TrH pHjhuk+H9sM/IH944g== X-UI-Sender-Class: 814a7b36-bfc1-4dae-8640-3722d8ec6cd6 Received: from [192.168.0.101] ([84.119.92.193]) by smtp.web.de (mrweb005 [213.165.67.108]) with ESMTPSA (Nemesis) id 1M3Ez3-1t0Rzi093l-0039Jg; Sat, 05 Oct 2024 02:56:03 +0200 Message-ID: <088a3541b85b783ef68337bd4bb790d62f200dfa.camel@web.de> Subject: Re: [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() From: Bert Karwatzki To: Lorenzo Stoakes Cc: "Liam R . Howlett" , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, spassowlf@web.de Date: Sat, 05 Oct 2024 02:56:01 +0200 In-Reply-To: <3b83746d-b73e-4b89-9b74-5aba5a827f45@lucifer.local> References: <20241004093546.3232-1-spasswolf@web.de> <3b83746d-b73e-4b89-9b74-5aba5a827f45@lucifer.local> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.54.0-1+b1 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Provags-ID: V03:K1:929eX6I7xC8zQKm+5s+TEpMT82t07Ecx/j0r5bMZS9EGFilHiaF HOte4WyhkPa35WiMmffNicT5j132ZUSAqssLM6C7HQd2rVwanSHRuTT5Tl1do007xo/sNgg cQ5W28HpdsIeD7eAaZgslDx/rBGq2G6MRKwhsqx/wNG3eOqtoG22Z0HIYgkBgWjLvpOiEma m6XOPaXXH++daxp3kXvhw== UI-OutboundReport: notjunk:1;M01:P0:IttrPLwLCR4=;DiJ9K+Kwjle3Etl4W/BTec6yfUY qSfN0JlEPy/V6wVQYbcc1CSqePFDxar0ShTlnhg+QdVymrYBd3SaSaafn5rpLdFBh7GJMt7Z0 HZe86C2WDiF9Hyyqhf7+Z7iaLAjAfxSISntCBTGk7su6RjJdgcTVsC/G5ARVaNpFyAKTeDtJA RXxUv8ARduPpGhe4qXfYPjeCarqucnJh13h91suU44oBZL5JsQz1b+v0eaXhVvYcndOvf8Dcj tuiUHn3rhsB8mpaGBDnx3c2EFjzX2111VUGMrm47IMbJ97TtdGI8OC6uwFeCR/+WEvep6xDJE 5mu5CWOkFtcMrv1H7xVysbC/4TNXZX6SU0i98IphYl52/ntiKgEXwBmtnqWpSoM+0Yn6GA5pa ZHBpRPhYAxQDpumoXn+E1F4w1u0E/hLNwmrXzzlxyg34BKqMBBgi75jhk9XZJKZWsf1T5Xmle /ap0jyCsmOIJ0EjRlNPpS586abThE+RunQbTyHHIesphSatXiIf8XFh8UXOL/WtHgaeJAf+wH mnUB4ZliXkQ70aZkx2b2dpL5fj0b9rzWvPKso/CgBAC5C5kjkJ5I6KpI/HyvhjKWhJPv5qZ3c Ii4M8wfwYjY74VrbcTdakQ2b5aL8IW6bUbGxBNq3b289gjzp5JdKiHUnxlhQS3wHUWN4Zvyb3 2yUBok+QMmJkUtQeTO7SEpUfl4Js4FZJnbj/rFuyHcONyCGkrjKN18dW1PzvFS0+ZIoZiD39Q L+8zjhuX6iCG5ajIYPjCtXcs4Jw2zNDluG/836YorTtACBESC9QStNfY2X1ZB/EazahaM7kDz 0muFD42lSRCORhgvlLDGLB7g== X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: 3565440004 X-Stat-Signature: m9ce1iiaidk9ppni6r8ki5fp8nyjcq9z X-HE-Tag: 1728089764-434465 X-HE-Meta: U2FsdGVkX1/bexiROv3QOU65q/lFfrzty+p9Djhgv0/d2xq4e/tkuf3YJdjAbJJSsgRz5e3uH2ZKXOEIyLc3HbRGJghk5fLWRHETR68LxRtI1kYDWUHwvaJuxMbzy0zMjWND+GpEA2PPopC60646XlexdnEWQtI4GjIXaJmAgcp3kE68be1gnynGyc+yDvX/GGO9jv1DONxURJVpakEui93MNZbaiC4MtgUh8yhr/kr2q1DL67d9q9G7Y7SkYhqB1amLii7R3raXaGYlY3T6xAfbqxrU3KHDqaY+u3z+YUY1hdpS24b1wPwa/PEqSOAmFmFRZyejbmBHPrtj3LFEHuyDGrAyl1vu+yz3eye/wU33CIqpZZMbjGBkSPFwt14p50G7eXHj59tXhom+gDajPy5NVa437MoEtvHB+h0MsrluyqmDlX+hEUjjIW8bnUpyjhSgzBLFw9cLqJsLMfn3rnafVAPm0U5P03XVzGG8qL/Fgdgaj/wlpIh4JDDa4azZVW8LPvtIwuYcFZeXx1HZjPvd+M39+PBPc9OvpYn/4a9HczWhQWziqCSPu8S3lhbRCb6TDyZHXxgAAvdMHN+iEJxxmrav6kJLsu/S2kOqATA2qCLRDr6TvI7S6VLSh7116px07Ug9HQDieGj3jQ8l5Fz9gfriOZhvbxcliBzfn8tsNMT189tCxEfxfl0ynSs020Tpl9PCLAVTheexCE4kjr9MVRbbnWSxTlIw0h7fO1jtn1NkQgMLnhtaPxKn115MlxRxWkkZAbPQsPtFBkq/OBF3ITzLdsFw2MvF1E1N9vj8WSkQLUCVohVq2rAlI0nRU3RLPKwfiKo/xvgzXs4IWHcjLJYsrU9Lkj9fHiw5acL4jcfjo4IOJsgTW0SSRFra596WK8vTvWCV8CCjY7q7B3YPKMYXw2OHGaoNUL9Blfws+80Ij8BQcQbS+k0VQFpNPHts/Tygv5aAZ2xRevs YK0I2+E3 UAlGn25AO/IrCC9XlnTcbCQs183ANabKFpWy1Zu2N3mZqTqrhk/Qxl/fzVCuwXBe0EosWxTJM4q8lEdS9KgCsWfkstinlRus6dI4cUKApopCz7U6tsGuTyayOc/oLuImNE8lxhrEgXQO3g2gZMHSU+dAvZhR+w+37cIST1kXq/124hY5Erpp7qDrEz5+DSmtuBPNW/OfJGur45onXxokqztJZb8cPzVrTTU8R0h3b5k5nkuAIWFsZzKisUv+NfTvDFDleIl2Jhb8H9C8+KuZxdLpFmdkxyOkEAYqP/4p2O1oU/ekIhu2pGgeecE/Pt/nyZhTF+U8/omW1lEX0/WNqI6KJMp+2ApJ4pnB7 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Am Freitag, dem 04.10.2024 um 23:41 +0100 schrieb Lorenzo Stoakes: > On Fri, Oct 04, 2024 at 11:35:44AM +0200, Bert Karwatzki wrote: > > Here's the log procduced by this kernel: > > > > c9e7f76815d3 (HEAD -> maple_tree_debug_4) hack: set of info stuff v5 > > 7e3bb072761a mm: correct error handling in mmap_region() > > 77df9e4bb222 (tag: next-20241001, origin/master, origin/HEAD, master) = Add linux-next specific files for 20241001 > > > > Again it took two attempts to trigger the bug. > > > > Bert Karwatzki > > > > Sending an updated, cleaned up version of the patch with a lot of > explanation. This is functionally identical to the v3 fix I already sent= so > you can try that or this to confirm it resolves your issue. > > If you are able to do so, I can submit this to upstream for a hotfix. If > not, well then back to the drawing board and I'd be very shocked :) > > I have been able to reproduce the issue locally in our userland testing > suite entirely consistently, and this patch resolves the issue and also > continues to pass all maple tree unit tests. > > Again thank you so much for all your help - I hope you are able to find = a > spare moment to quickly give this one a try and confirm whether it does > indeed address the problem you've reported. > > Thanks, Lorenzo > > ----8<---- > From 126d65bd9839cd3ec941007872b357e27fd56066 Mon Sep 17 00:00:00 2001 > From: Lorenzo Stoakes > Date: Fri, 4 Oct 2024 15:18:58 +0100 > Subject: [PATCH] maple_tree: correct tree corruption on spanning store > > Writing a data range into a maple tree may involve overwriting a number = of > existing entries that span across more than one node. Doing so invokes a > 'spanning' store. > > Performing a spanning store across two leaf nodes in a maple tree in whi= ch > entries are overwritten is achieved by first initialising a 'big' node, > which will store the coalesced entries between the two nodes comprising > entries prior to the newly stored entry, the newly stored entry, and > subsequent entries. > > This 'big node' is then merged back into the tree and the tree is > rebalanced, replacing the entries across the spanned nodes with those > contained in the big node. > > The operation is performed in mas_wr_spanning_store() which starts by > establishing two maple tree state objects ('mas' objects) on the left of > the range and on the right (l_mas and r_mas respectively). > > l_mas traverses to the beginning of the range to be stored in order to c= opy > the data BEFORE the requested store into the big node. > > We then insert our new entry immediately afterwards (both the left copy = and > the storing of the new entry are combined and performed by > mas_store_b_node()). > > r_mas traverses to the populated slot immediately after, in order to cop= y > the data AFTER the requested store into the big node. > > This copy of the right-hand node is performed by mas_mab_cp() as long as > r_mas indicates that there's data to copy, i.e. r_mas.offset <=3D r_mas.= end. > > We traverse r_mas to this position in mas_wr_node_walk() using a simple > loop: > > while (offset < count && mas->index > wr_mas->pivots[offset]) > offset++; > > Note here that count is determined to be the (inclusive) index of the la= st > node containing data in the node as determined by ma_data_end(). > > This means that even in searching for mas->index, which will have been s= et > to one plus the end of the target range in order to traverse to the next > slot in mas_wr_spanning_store(), we will terminate the iteration at the = end > of the node range even if this condition is not met due to the offset < > count condition. > > The fact this right hand node contains the end of the range being stored= is > why we are traversing it, and this loop is why we appear to discover a > viable range within the right node to copy to the big one. > > However, if the node that r_mas traverses contains a pivot EQUAL to the = end > of the range being stored, and this is the LAST pivot contained within t= he > node, something unexpected happens: > > 1. The l_mas traversal copy and insertion of the new entry in the big no= de > is performed via mas_store_b_node() correctly. > > 2. The traversal performed by mas_wr_node_walk() means our r_mas.offset = is > set to the offset of the entry equal to the end of the range we store= . > > 3. We therefore copy this DUPLICATE of the final pivot into the big node= , > and insert this DUPLICATE entry, alongside its invalid slot entry > immediately after the newly inserted entry. > > 4. The big node containing this duplicated is inserted into the tree whi= ch > is rebalanced, and therefore the maple tree becomes corrupted. > > Note that if the right hand node had one or more entries with pivots of > greater value than the end of the stored range, this would not happen. I= f > it contained entries with pivots of lesser value it would not be the rig= ht > node in this spanning store. > > This appears to have been at risk of happening throughout the maple tree= 's > history, however it seemed significantly less likely to occur until > recently. > > The balancing of the tree seems to have made it unlikely that you would > happen to perform a store that both spans two nodes AND would overwrite > precisely the entry with the largest pivot in the right-hand node which > contains no further larger pivots. > > The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma t= ree > in mmap_region()") seems to have made the probability of this event much > more likely. > > Previous to this change, MAP_FIXED mappings which were overwritten would > first be cleared before any subsequent store or importantly - merge of > surrounding entries - would be performed. > > After this change, this is no longer the case, and this means that, in t= he > worst case, a number of entries might be overwritten in combination with= a > merge (and subsequent overwriting expansion) between both the prior entr= y > AND a subsequent entry. > > The motivation for this change arose from Bert Karwatzki's report of > encountering mm instability after the release of kernel v6.12-rc1 which, > after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration > options, was identified as maple tree corruption. > > After Bert very generously provided his time and ability to reproduce th= is > event consistently, I was able to finally identify that the issue discus= sed > in this commit message was occurring for him. > > The solution implemented in this patch is: > > 1. Adjust mas_wr_walk_index() to return a boolean value indicating wheth= er > the containing node is actually populated with entries possessing piv= ots > equal to or greater than mas->index. > > 2. When traversing the right node in mas_wr_spanning_store(), use this > value to determine whether to try to copy from the right node - if it= is > not populated, then do not do so. > > This passes all maple tree unit tests and resolves the reported bug. > --- > lib/maple_tree.c | 20 ++++++++++++++++---- > 1 file changed, 16 insertions(+), 4 deletions(-) > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c > index 37abf0fe380b..e6f0da908ba7 100644 > --- a/lib/maple_tree.c > +++ b/lib/maple_tree.c > @@ -2194,6 +2194,8 @@ static inline void mas_node_or_none(struct ma_stat= e *mas, > > /* > * mas_wr_node_walk() - Find the correct offset for the index in the @m= as. > + * If @mas->index cannot be found within the conta= ining > + * node, we traverse to the last entry in the node= . > * @wr_mas: The maple write state > * > * Uses mas_slot_locked() and does not need to worry about dead nodes. > @@ -3527,6 +3529,12 @@ static bool mas_wr_walk(struct ma_wr_state *wr_ma= s) > return true; > } > > +/* > + * Traverse the maple tree until the offset of mas->index is reached. > + * > + * Return: Is this node actually populated with entries possessing pivo= ts equal > + * to or greater than mas->index? > + */ > static bool mas_wr_walk_index(struct ma_wr_state *wr_mas) > { > struct ma_state *mas =3D wr_mas->mas; > @@ -3535,8 +3543,11 @@ static bool mas_wr_walk_index(struct ma_wr_state = *wr_mas) > mas_wr_walk_descend(wr_mas); > wr_mas->content =3D mas_slot_locked(mas, wr_mas->slots, > mas->offset); > - if (ma_is_leaf(wr_mas->type)) > - return true; > + if (ma_is_leaf(wr_mas->type)) { > + unsigned long pivot =3D wr_mas->pivots[mas->offset]; > + > + return pivot =3D=3D 0 || mas->index <=3D pivot; > + } > mas_wr_walk_traverse(wr_mas); > > } > @@ -3696,6 +3707,7 @@ static noinline void mas_wr_spanning_store(struct = ma_wr_state *wr_mas) > struct maple_big_node b_node; > struct ma_state *mas; > unsigned char height; > + bool r_populated; > > /* Left and Right side of spanning store */ > MA_STATE(l_mas, NULL, 0, 0); > @@ -3737,7 +3749,7 @@ static noinline void mas_wr_spanning_store(struct = ma_wr_state *wr_mas) > r_mas.last++; > > r_mas.index =3D r_mas.last; > - mas_wr_walk_index(&r_wr_mas); > + r_populated =3D mas_wr_walk_index(&r_wr_mas); > r_mas.last =3D r_mas.index =3D mas->last; > > /* Set up left side. */ > @@ -3761,7 +3773,7 @@ static noinline void mas_wr_spanning_store(struct = ma_wr_state *wr_mas) > /* Copy l_mas and store the value in b_node. */ > mas_store_b_node(&l_wr_mas, &b_node, l_mas.end); > /* Copy r_mas into b_node. */ > - if (r_mas.offset <=3D r_mas.end) > + if (r_populated && r_mas.offset <=3D r_mas.end) > mas_mab_cp(&r_mas, r_mas.offset, r_mas.end, > &b_node, b_node.b_end + 1); > else > -- > 2.46.2 I just tested this and it passed ten tests (i.e. upgrading the proton vers= ion i steam) in a row. Bert Karwatzki