From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A8952EB64DA for ; Fri, 14 Jul 2023 12:33:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D590B6B0071; Fri, 14 Jul 2023 08:33:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D07C06B0072; Fri, 14 Jul 2023 08:33:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BCFA46B0074; Fri, 14 Jul 2023 08:33:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id AF2266B0071 for ; Fri, 14 Jul 2023 08:33:07 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 7E380160255 for ; Fri, 14 Jul 2023 12:33:07 +0000 (UTC) X-FDA: 81010157214.11.A0CE1D4 Received: from mail-pf1-f180.google.com (mail-pf1-f180.google.com [209.85.210.180]) by imf09.hostedemail.com (Postfix) with ESMTP id 8CE6A140009 for ; Fri, 14 Jul 2023 12:33:03 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=flds3MvE; spf=pass (imf09.hostedemail.com: domain of zhangpeng.00@bytedance.com designates 209.85.210.180 as permitted sender) smtp.mailfrom=zhangpeng.00@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1689337985; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YE68fL0Jr5S3VfTwK0LM8IWsYImIMY6y+ghb1IBqp4M=; b=Yz4fWN6qmJy6twcAWLuwub1Uxh2YEz8v/cGQ0ictlHp2r6D/84UdlkTlYLk0TOeX4h/aoU m3jPXuCyJxp8vEZ6eUK6R7AxhEeLPMynsCSnoNdIFEdnOY9LIlgd9H4GTkv0cfKELbcgTL 1UVhwd9t/i2hRC6Z924mtD6qAs/vqNE= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=flds3MvE; spf=pass (imf09.hostedemail.com: domain of zhangpeng.00@bytedance.com designates 209.85.210.180 as permitted sender) smtp.mailfrom=zhangpeng.00@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689337985; a=rsa-sha256; cv=none; b=NwHqWiyCdbNSlgpyjutGU7aJaDkhTLAmDt9d63K/Iatg3WJIoMLC/Tb+MpnP7DCGqEYCZ9 jlkpXxzyy525h1N8l0VQPcBU9b+AHWO04OYrWQxSf8DytNJcPvlxEhG/NRMRAgEn/VwvIr IdTyySPwmgDWgPAXvN8xnqTGZHZLSMc= Received: by mail-pf1-f180.google.com with SMTP id d2e1a72fcca58-666ecb21f86so1790036b3a.3 for ; Fri, 14 Jul 2023 05:33:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1689337982; x=1691929982; h=content-transfer-encoding:in-reply-to:from:cc:references:to:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=YE68fL0Jr5S3VfTwK0LM8IWsYImIMY6y+ghb1IBqp4M=; b=flds3MvE+Qif7Y6neo4eB7+7ilvKIxubGB32eZPK6b8eU8WcOYebb1BdscJPwe/9yV eJtKU7acv6SMZc6zE0lBGm8N5B+1KOalfSuZVjB2A1C+9uw6z0lcPyrIlbEeoQaoomOT fs3Ivd54fJD7QPNbJwfHeBUDN4W1kELdMvXqD+BoGmCn/N4LsaxU7Hzx7xwRb8HxOFgE Z9D58I600Jjclc78Nap3TozGkwYs5Ax61lrLRYl5bJTXGAAPxm0jxJ/GLVPGa0Z+lNIb 0wQNutqqnBWxczWAfguesfPM7XdkIxrk0r2Z80iaN1k/RTqaD3cNWP40NvoMEX5p0vQW iBcg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689337982; x=1691929982; h=content-transfer-encoding:in-reply-to:from:cc:references:to:subject :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=YE68fL0Jr5S3VfTwK0LM8IWsYImIMY6y+ghb1IBqp4M=; b=bpDb/WsGhm6DYqG4pUsw6SaXZfSr2eatiil01C52d7cteeALvS8kaBRUlP4YyvmvWu goKhmyby9PMj13ZGMa1sJwON9BqsD60fbf7FDBdqlyzZeoK3AJ2m7DLon1ysR/qE049f htJwnFmaFYGWHvjRlnK28FVjuqVTe1zWjyu7xEHK9O1dDoGY7ffEkRZP6yCAjeUcjm/X eBS09vTehejxQvxlBDdiE6L1qUUI7HwxxztQ3WiYQ1shMIkaGosnl4Zk1EBpcCw9FxC9 PPUbpu8D94DNFmjoQQt1Cb5yYOlkU5swQlo5mNyfe/uslCGzM3vZodbKQjGJ1Iichafb DPIw== X-Gm-Message-State: ABy/qLYgP/DZa/cSe06rHqTun6roVF2Zt5h5D8RScKevSNRs8xQZUTf8 XpdljvrV+PHLa7OGCRpd3GvrXA== X-Google-Smtp-Source: APBJJlEm4BD0EWFJ5DJ8KPQnFlj5A2ODdTMlr8mkNgQ865kluLoP9TMbfNDdWInNI01wE15b6O/x5g== X-Received: by 2002:a05:6a00:24ca:b0:66a:386c:e6a6 with SMTP id d10-20020a056a0024ca00b0066a386ce6a6mr5757501pfv.6.1689337981788; Fri, 14 Jul 2023 05:33:01 -0700 (PDT) Received: from [10.254.16.139] ([139.177.225.225]) by smtp.gmail.com with ESMTPSA id j5-20020a62e905000000b0068338b6667asm4283272pfh.212.2023.07.14.05.32.58 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 14 Jul 2023 05:33:01 -0700 (PDT) Message-ID: <152f65f1-5b6d-1795-1bc9-3ea318f0bae0@bytedance.com> Date: Fri, 14 Jul 2023 20:32:55 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 Subject: Re: Maple Tree Work To: "Liam R. Howlett" References: <20230707163815.ns4kdz7iut5octjv@revolver> <463899aa-6cbd-f08e-0aca-077b0e4e4475@bytedance.com> <20230712152738.ld7rn5zmmhtc6wkg@revolver> <20230713153657.b2xzf6z3yxi6nojn@revolver> Cc: Peng Zhang , maple-tree@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org From: Peng Zhang In-Reply-To: <20230713153657.b2xzf6z3yxi6nojn@revolver> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 8CE6A140009 X-Rspam-User: X-Stat-Signature: scht8pgznotbdhqdh57pz4rjqdp8pir7 X-Rspamd-Server: rspam01 X-HE-Tag: 1689337983-14287 X-HE-Meta: U2FsdGVkX19qGItfYA6ZmpoKKaNRgXEvQ9stVzm5QRI3TMvkxtszC9KJjYf2p86w7IN+q/pJa1UcL+3dD/63uuE6FDOM1Vz/AFvpaKHTeCLqlHkX4ck8OMu8FP7H9qqv97CSmorlkpDBcCTNF0t3e3jAayVA4FJL2b1XyYTZcu01eDk/zr+LYLVNGntx2oGy3G80+k1gpLtmiwLiHpOcPOuX8D6pjbDpHgM0vDHBcFuV1InI066e9BN/pzYGR2+qNvu6+MMQGRBUhsXDP+sjF21vamZ0KVN4Tu5OuPfQhonKnkWNHu4lOkyuY2r2OaVYiYiX8yLiym32cvnMA1G++V55VsfAMwco/kAvXjaJ2luCSs2RC2iVQ7Eg1+7lJSipsAkMxRynp6h/ia6JcMNkwVl6yb8TMj7o0QxfS6836rXKg47xQXNC/f54M+h/g9EGNHnE+8ulh5sczJ2pz2xVT6z++WIFiSCwFEUSnhK9vW7jwjY6zGfuJRP7yhx0q/AeD5IuKtFQwxf7R9Bpfh64KwC9vV0YlXTiLoNzULg/PNezK/BPjxP+bq167SZ7biHxOJLGqss63Pll5aK4b+JehiNEvZpwnLNy0/xYbTFBt3HiQUqYprUTKS9Gz5y7ke1rpM5chnl7xPw37stkJoXWa2o91xQTnb5J6CDzGrMG8ElioU9wcq7ZjW6JvKsW2Uc1z8LzrQnxAl/hv//FJL0N7iuGHG2JCIllvKaJXI4sa0cUcC2cuO0amvNG68ea7gyMHn0+ahCq5Q6N2cFMd8ADxc2Al5nrxol/+hfR7mWb+5NjIjbkMqlSVckxW48kxuRDFZCaHygFrypaXmab5+YJuMFcuSj+lqDfKEj+MO5idDCmVBtJQp+0j4kLzaVTy/EXhRrq+6q7AV5JYHF8v59y0qyWfF/mf9vi7yAnrtUlJs4c5VjhsDLp4tJ9eLzlTT3fVyc22pUJpfBTfht2mrA tlQthWPt inPmMG2WyXOoqrTNSAvWLWCRm+zPm+KzPAXCKgRgIvCWSfpYoOkBPRysUY+tFDhA67OhELSeufifAD8myH7D2QHTloO5MjqewzzQHmPsWpWCFo+Zcu2F5gicyx4xJbdGb0qj+1MFX+yyBjsYXoRBia4BhTRNixoKnTu7SfYwLNEpGnpkNZ9VIcikqA6y3P91Q+uXTK61Gz7o/kHlw4ui2vyRJrQjEQ+3ZcJG+Zkc3uDW5d3mIEdIm5aShozcQGKrumPovqWFCRLy7zLmN9EX/tonjc6fvIdhO+qisK+DXezk96UjOSGEVf0BdB/I0gCUikegNCfFKESGS5JwmzPQg9SWtIFVz7aiX9iQnBwgNxpYa2w51SyqaPN0qApgs/fYdOqL0hcVlYakWu+cwixJ/oETuCA8t7NX4ToMqUN1MLXfdVEo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: 在 2023/7/13 23:36, Liam R. Howlett 写道: > * Peng Zhang [230713 04:05]: >> >> >> 在 2023/7/12 23:27, Liam R. Howlett 写道: >>> Dropping Danilo from the Cc.. >>> >>> I was asked to add linux-mm to the list, so I did that as well. >>> >>> If anyone else is interested in seeing the full list, it's on lkml [1] and >>> the maple-tree list [2]. >>> >>> Thank you Peng for looking at the list and taking time to think about >>> the items. >>> >>> * Peng Zhang [230712 07:49]: >>>> >>>> >>>> 在 2023/7/8 00:38, Liam R. Howlett 写道: >>>>> - Fork & Dup tree + Delete DONT_COPY >>>>> This is to optimize dup_mmap() in kernel/fork.c, but other >>>>> users may want faster duplications of a tree. >>>>> This should be faster than building a tree in bulk mode. The >>>>> idea is to just copy each node and replace the pointers in, >>>>> probably, a BFS order. Once the leaves are reached, the VMAs >>>>> will be replaced by the copies made in fork, unless DONT_COPY is >>>>> set, in which case the VMA will be deleted from the copied tree. >>>>> DONT_COPY is not common and since the tree isn't visible, all >>>>> nodes should be available for reuse (no RCU worries). >>>> If DONT_COPY is set, this method will be complicated, because the gaps >>>> adjacent to it need to be merged, and the gaps of all ancestor nodes need to >>>> be updated. >>> >>> My understanding is that this is a rare event; there aren't many VMAs >>> marked this way. The store operation already does all the necessary >>> work for deleting an entry. The gap tracking is also updated, and that >>> would only happen if the new gap is larger. Are you concerned about the >>> performance/runtime of handling the DONT_COPY in this way? >> Yes, if no DONT_COPY is set, copying all nodes and replacing the >> pointers must be the fastest way. I was just thinking if there is >> a faster way if DONT_COPY exists. Using the store operation to >> delete unnecessary entries will cause additional overhead sometimes. >> To give an example: >> >> Suppose a leaf node layout of the maple tree is as follows: >> [VMA1][VMA2][NULL][VMA3] >> >> If VMA2 has DONT_COPY set, we need to change the node layout as >> follows to delete it: >> [VMA1'][NULL][VMA3'] >> >> At least we need to copy this node to replace it to make this >> delete operation, even need to rebalance. However, this is a >> rare case. In most cases, there is no gap between VMAs, so it >> will not cause changes in the node layout. > > Remember that at this point, there is no readers so we could edit the > node without a copy. It would require new code, but it's just moving > data left.. The bigger worry is a rebalance, as you said, and that can > get complicated. We know we have (more than) enough room to store the > data, but editing in place isn't done in this version of the code. We > do allow for node re-use by pushing back onto the mas->alloc, so the > data requirements won't be too high. Without any readers, the parent > pivots could be changed directly between two leaves. > >>> >>>> >>>> I have another idea to build a tree, if inserted in order, we only >>>> insert at the leaf node. All leaf nodes are connected using a linked >>>> list. In the end we get a linked list with only leaf nodes. Then we >>>> construct non-leaf nodes layer by layer from bottom to top. I think >>>> this is also faster than bulk mode. Another advantage of this method >>>> is that we are applicable to more scenarios, do not need the original >>>> tree, only need to know the ranges inserted in order. I don't know >>>> how fast this method is, so we can discuss it. >>> >>> What is the advantage of a linked list over just building the tree as >>> you go? Considering the non-leaf nodes are just a list of nodes >>> already, and you will have to do the same work of setting pivots, >>> allocating nodes, and filling them after you have the linked list. >>> >>> What work do you avoid that would make a linked list faster than bulk >>> insert or a tree copy/replace VMAs? I was thinking that we could avoid >>> a lot of the work involved with splitting/pushing and the parent >>> construction by using memcpy of each node, replace each slot (and >>> parent) with a new memcpy of the mirrored tree, then have a minimum >>> amount of modifications to delete the DONT_COPY during the VMA >>> replacement. BFS copy would ensure we wouldn't modify the source tree >>> during VMA replacement and deletion (DONT_COPY). So the rebalancing (in >>> bulk insert), pivot calculations, metadata, and gaps are (mostly) saved >>> by using memcpy. >> Your analysis is correct. >>> >>> From what I understand from your linked list idea, we would need to >>> construct the child node by examining each entry and know that a certain >>> entry is a DONT_COPY (ie: VMA logic would be needed in the maple tree >>> code or a callback?). We really can't have VMA logic in the maple tree >>> code, so we could do some sort of loop through the VMAs to add the entry >>> to the list if desired. >>> >>> Once we have a linked list, we still need to figure out each pivot for >>> the parent (I guess we won't use the last slot so there is a pivot to >>> check?), and each max gap in each child to construct the upper layers >>> of the tree. Is this correct? >>> >>> I guess we would still need to adjust the last node to ensure sufficient >>> entries as well, so as we add items we may need to rebalance the last >>> leaf with the second-last leaf. >> Yes, the last two leaves need to check to make sure they have enough >> items. >>> >>> The bulk store currently adjusts the split to favour splitting >>> left, could you get the same result by strictly filling the nodes? This >>> would have to have special handling to rebalance the last one - which we >>> have a pretty good idea of when it's going to happen as we have a count >>> (and the DONT_COPY is rare). >>> >>> Are you thinking you could compact the tree at the same time to gain >>> efficiency? >>> >>> What would you consider a sufficient packed tree? It's best to keep >>> some space around for expansion/contraction. This works out since, I >>> think, we would need to keep that last slot free so we have a pivot to >>> check in your linked list plan. Initial development with strict >>> split/join rules resulted in a 'jittering' of nodes as the number of >>> entries in a node shifted just above/below the threshold so we relaxed >>> the merging of nodes to avoid such a scenario. >> We can control the number of entries of nodes, for example, let this >> number be (min + max)/2, so as to avoid making a node too 'dense' or >> too 'loose'. > > By the way, in the VMA case, we also know the number of VMAs in the > tree. Unfortunately, we don't know how many are DONT_COPY VMAs. I > wonder if it would be worth while to balance each VMA with its neighbour > during this operation, at least within one tree level? It would reduce > the possibility of a larger rebalance on a DONT_COPY. It's probably not > worth it because it would slow down our fast path. > >>> >>> Let me know if you would like me to put your name beside the Fork & Dup >>> Tree item in the list of work. >> You can put my name on this one and I'll do it. > > Sounds good, thanks! > >> I use the method of copying all nodes, so I will implement an interface >> to get a mirrored tree. >> >> But I can't think of a good way to replace old VMAs, it can't be done >> during the copy tree process, because maybe some VMAs have DONT_COPY >> flag. > > I think it could be done during the copy, instead of a 'replace' it > would be a 'delete'. I think this is why we need to use a BFS-like > duplication. So once we reach the leaves, then we can modify the tree > knowing that the above state has already been copied and so it's going > to be altering a copy of the data and so we are at a point where it can > be mutated. You could detect that the lock-step next/insert is out of > sync and delete the range between the old_mas and the new_mas. > >> It seems that we can only scan all VMAs in the source tree again >> to update the new tree. We have already traversed the source tree once >> in the process of copying the tree. Is there any way to avoid traversing >> it again? > > Well, we haven't visited the VMAs in the copy, but we have visited the > leaf nodes with all the pointers to the VMAs. I get what you are > saying thought, we will have to duplicate the leaves then re-visit the > leaves to replace the VMAs. I am not sure we can avoid this since a > rebalance may occur, and it would be very tricky to rebalance with old > and new data in the tree - it's probably best to just revisit, at least > to begin with. > > Depending on how you implement it, you could make the copying of the > tree end on the first leaf node by using the height of the tree to > figure out which way to sweep (left to right or right to left) on the > first level. Not strictly BFS, but you could end the maple state in the > correct location to start replacing the VMAs. Would that work? > > We also have a reverse iterator, so we could just run through the tree > from the right-most node to the start. > > I was thinking that we would make the first 'duplication store' detect > an empty tree and duplicate the tree, ending on the left-most leaf and > then replace the first entry (and possibly delete up to the first > store). Each subsequent store would do the same. We would need a > 'duplication complete' that would remove any entries beyond the last > store and rebalance, if necessary. > > Feel free to use some or none of the ideas. I hope some of this helps > with what you are trying to work out. Let me know if you have any more > questions. Thank you for providing so much information, I am doing this feature. Thanks, Peng > > Thanks, > Liam >