From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A7EC6EB64D7 for ; Tue, 20 Jun 2023 21:17:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E600D8D0002; Tue, 20 Jun 2023 17:17:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DE9848D0001; Tue, 20 Jun 2023 17:17:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C3B678D0002; Tue, 20 Jun 2023 17:17:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id AEC538D0001 for ; Tue, 20 Jun 2023 17:17:02 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 814771A045E for ; Tue, 20 Jun 2023 21:17:02 +0000 (UTC) X-FDA: 80924386284.21.D7F218A Received: from mail-qk1-f179.google.com (mail-qk1-f179.google.com [209.85.222.179]) by imf23.hostedemail.com (Postfix) with ESMTP id 741C0140014 for ; Tue, 20 Jun 2023 21:17:00 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=joelfernandes.org header.s=google header.b=Yj2QVX53; spf=pass (imf23.hostedemail.com: domain of joel@joelfernandes.org designates 209.85.222.179 as permitted sender) smtp.mailfrom=joel@joelfernandes.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1687295820; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=9+lU1iOzH0JYHL0tnMsXHMxNMnW9cnjxfw7NbbeeT9k=; b=nheaJqbKLfCWYg6l4JJ2ESORhaU/ZlapTGw+3505OnYRqZKFjIMkhfCVh65ZG7oeVlNQXD Xs46ZJU35lE0Rr8xBnMUvDrrHcPYqxv8fPsBtcKQZX5xxWDqo5B3K2uE0JaS+bCY/rZgRR tZwAwntYtxleNJ+kO7vEffbh6H8/7T8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687295820; a=rsa-sha256; cv=none; b=7OVJVghrVendnvA6RCT49vfhvQno4GJCefQLsGg2MtZORpYLRs59cTpxKeSPli51BO3Ku4 /UJz1Mn9/wqLjO+gpsEge024av5lv5F9tGTCRqcAnTk704ZdwhXJpSPMOZeFBckia+1CXo rV7OdyAggYpyqRSnqVr2/d+lfW83i1o= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=joelfernandes.org header.s=google header.b=Yj2QVX53; spf=pass (imf23.hostedemail.com: domain of joel@joelfernandes.org designates 209.85.222.179 as permitted sender) smtp.mailfrom=joel@joelfernandes.org; dmarc=none Received: by mail-qk1-f179.google.com with SMTP id af79cd13be357-763aba072b9so134614085a.0 for ; Tue, 20 Jun 2023 14:17:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joelfernandes.org; s=google; t=1687295819; x=1689887819; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=9+lU1iOzH0JYHL0tnMsXHMxNMnW9cnjxfw7NbbeeT9k=; b=Yj2QVX53Fw+1mon5tH4IY14qcLlldykQRdTs9ybZsnZrNJDZk6dwY4IldEP5yGcWEb yi7HHM/ZTWOv76ma7egJ1o7JXrdbxcrt+c7DSGndboQhD0bTQKRGlyVXUA1Q2UAd2L8H 21l3AexyK7XnNoIxWR7NImrcrkLh1RDYji0b8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687295819; x=1689887819; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=9+lU1iOzH0JYHL0tnMsXHMxNMnW9cnjxfw7NbbeeT9k=; b=eN9yx/qZ77o7EtO0a14cI71LnZ/LEohbnBm+ry0V0CrECM2i3aBTnnonhJ3SjRy7E9 aoVsmlkYV/dZZYj8nsEtmOibsrPuX/lzX3SPjf5UlV+4xldRb/HOksAcbG49TG2a8jpX YhYn8SFUq4zs9wDS5RHfEKqiQbNfnwBBL/2HYgRtIyLyLt2hykQtDGUQPS5PPBqXslxz jFYf+Z4iFnKC4dLaVhGpLLWtYCngPxnGYdgUAQHRDVLNuE08Kup8+LhpIvs6rEGgo5ma dhXXYNQQ4okt0EaYsI2GPCHIObTOk4vqrPWdKnYsm7PM4ymJ+LMA+6OBzM6J4BqFIrrj tmxQ== X-Gm-Message-State: AC+VfDyrSnfxr62ENMFo/FN2xoUTfA5kfU0kObUItarJf7gVEZyjojMu OQeZe94KomMiw+G8G4WPdm2xXg== X-Google-Smtp-Source: ACHHUZ4xbfW9KezmmcCLJNdjr99ASmS2ufrHWysJjhsE+cY/TB9a4xu9GZNLCbBFA1/pGAvmRfCELw== X-Received: by 2002:a05:620a:270e:b0:762:5ad0:59c5 with SMTP id b14-20020a05620a270e00b007625ad059c5mr7894137qkp.60.1687295819363; Tue, 20 Jun 2023 14:16:59 -0700 (PDT) Received: from [192.168.0.140] (c-98-249-43-138.hsd1.va.comcast.net. [98.249.43.138]) by smtp.gmail.com with ESMTPSA id o21-20020a05620a131500b0075b35e72a21sm1524673qkj.86.2023.06.20.14.16.58 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 20 Jun 2023 14:16:58 -0700 (PDT) Message-ID: Date: Tue, 20 Jun 2023 17:16:57 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 Subject: Re: [PATCH v4 1/7] mm/mremap: Optimize the start addresses in move_page_tables() Content-Language: en-US To: Lorenzo Stoakes Cc: linux-kernel@vger.kernel.org, Linus Torvalds , linux-kselftest@vger.kernel.org, linux-mm@kvack.org, Shuah Khan , Vlastimil Babka , Michal Hocko , Kirill A Shutemov , "Liam R. Howlett" , "Paul E. McKenney" , Suren Baghdasaryan , Kalesh Singh , Lokesh Gidra , Vineeth Pillai References: <20230531220807.2048037-1-joel@joelfernandes.org> <20230531220807.2048037-2-joel@joelfernandes.org> From: Joel Fernandes In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 741C0140014 X-Rspam-User: X-Stat-Signature: 1jtexiy8xaxc1dinqrjqg3kneqf589cx X-Rspamd-Server: rspam03 X-HE-Tag: 1687295820-592507 X-HE-Meta: U2FsdGVkX1+sn3WSD2qQGYIvqV9eEdJ6aZCRFvbdVjZ2eubRAdUrcchHTyBxUncQhX3x0hleunplwhpSNNd7URzhIV7yrZBkI3iNB1lTZa7Fv8qEeEfncXyPJAIS8jy0jYxnrArkylVwz+XfZ9ZD1sVJcHJADpUuU6PmZATi8/tvQCwlje+3tc+VSK4oRQ/m/L6mwpJQ2TwQ7vRNdfGWRNZdrztaq42NDY0XQqm7r5jJWC3NTZqpytCHgSe7GvD0gQp9p8OhRYy+FFVwInt0QE4bn0ve7hiWTihygSpoAIcLNOOO+Q3RRrQWGY5TDoEcW2n8uSfJEpSXyrL7E8V39Z9DXT34sXxhxpGfSaGH016GhOemMUpNWrCh4vsb7At8AFXWPKIy2+swS/E1JLsmfJren1/2Cc8WHfgsAR4igqDpcTzkhZ3G6UHIzeIBCbXvAQRt9/Rc52BnvJ9Ek9IkFrogJSZfdxO1OiBybMpbGKWGlCDJeJYUURoil0Xkje8Kr29cykk1J4kAfOs3TarWHztneE7u3Zl+e3RWLB3bujoNRdyDVzUn1d11weBLVj1VJZfw1y7B70zWHSJNxFTsOqG93HTphvJWDGR4tTm1AA2geXFrEthEOj0S/mUwSCqr7W5pFfMbraygkvFdYxwJItBtjhFOixG0NFjDvO3VNZJtWBsDtqJVTCPSiYLc6jYtolbRWYG8qnQbAqZnJMIrG+Y+3x9K0PFwqtX4llAPjncJfEIJBeTTK/YdeQBlOocK7zBoJvfRNT/nCdFs/wA2cEwebQZ5H+NDRpd4X1z3Fco/6XI4Xg+ZBWPxylxwL02LO90SCtmiZMSQSIQ99qECW7wzUb0vVcpCBZRGLYKeAkKDnYULfwnxbJ3cmnx0HoVJ1U0jmf14T7u7ngJxNLrnq+JFZ+xFskrYRESUGYoXcvVH71R5s759KM16Wc9iiqwCsjbVMNpXd+PYm1QyGQg kmsnK3K7 kcJY1/bcDAPhqC7f59RCjTHbXfLDowvVzdUf3kZOUeKjiqavHxqlxsIKEY02olEgEW098ynsygZa+nX8RNZQ2ZwAW7Ax2rrNhyMcb/4trPb2dqSOY8KKB23zDDX9CpyvwRz6NpNFB+Ti6010phD3ZEFydk5ZgpOKEfD9ZnyamKrquVkfcmuFR3+FOr/al8IYFnItNaijCI7pm5AdAF6WOgsYXofKw0ydd2FtwLZ3I4AXzGWLD3IwJ3+mATuC3gnrhb6KKi6+Jlrf9UHazU5qkGXdmQ9Z/+CaS2oVUf5pIjizY+t6iihvjjCsJQzr/Bq1CJo4bBe9Wm4hLKtJD0kzfssS2JbtXriuMkbjD8il6bUiLUIkUMmlPAEUpKKhBHzv/lgPR2Tfb7xIVST0aWnVYusYYRCDnllU6+bAJZdcehRBWCZZ48qBLH0RDH/h5Nihw5oMgWldaIT6w7oHoMPXtpowNA0GhA3Dlk2hrSHPpNHRePLMb8QpNVKwU93Qu1Z6PqY8CxmpBKF61Dbc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Lorenzo, On 6/20/23 07:02, Lorenzo Stoakes wrote: > On Mon, Jun 19, 2023 at 11:55:08AM -0400, Joel Fernandes wrote: >> Hi Lorenzo, >> Thanks for the review! I replied below: >> >> On 6/17/23 18:49, Lorenzo Stoakes wrote: >>> On Wed, May 31, 2023 at 10:08:01PM +0000, Joel Fernandes (Google) wrote: >>>> Recently, we see reports [1] of a warning that triggers due to >>>> move_page_tables() doing a downward and overlapping move on a >>>> mutually-aligned offset within a PMD. By mutual alignment, I >>>> mean the source and destination addresses of the mremap are at >>>> the same offset within a PMD. >>>> >>>> This mutual alignment along with the fact that the move is downward is >>>> sufficient to cause a warning related to having an allocated PMD that >>>> does not have PTEs in it. >>>> >>>> This warning will only trigger when there is mutual alignment in the >>>> move operation. A solution, as suggested by Linus Torvalds [2], is to >>>> initiate the copy process at the PMD level whenever such alignment is >>>> present. Implementing this approach will not only prevent the warning >>>> from being triggered, but it will also optimize the operation as this >>>> method should enhance the speed of the copy process whenever there's a >> >> [...] >> >>>> Suggested-by: Linus Torvalds >>>> Signed-off-by: Joel Fernandes (Google) >>>> --- >>>> mm/mremap.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> 1 file changed, 61 insertions(+) >>>> >>>> diff --git a/mm/mremap.c b/mm/mremap.c >>>> index 411a85682b58..bf355e4d6bd4 100644 >>>> --- a/mm/mremap.c >>>> +++ b/mm/mremap.c >>>> @@ -478,6 +478,51 @@ static bool move_pgt_entry(enum pgt_entry entry, struct >>>> return moved; >>>> } >>>> >>>> +/* >>>> + * A helper to check if a previous mapping exists. Required for >>>> + * move_page_tables() and realign_addr() to determine if a previous mapping >>>> + * exists before we can do realignment optimizations. >>>> + */ >>>> +static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_align, >>>> + unsigned long mask) >>>> +{ >>>> + unsigned long addr_masked = addr_to_align & mask; >>>> + struct vm_area_struct *prev = NULL, *cur = NULL; >>>> + >>>> + /* >>>> + * If @addr_to_align of either source or destination is not the beginning >>>> + * of the corresponding VMA, we can't align down or we will destroy part >>>> + * of the current mapping. >>>> + */ >>>> + if (vma->vm_start != addr_to_align) >>>> + return false; >>> >>> See below, I think we can eliminate this check. >>> >>>> + >>>> + /* >>>> + * Find the VMA before @vma to see if it subsumes the masked address. >>>> + * The mmap write lock is held here so the lookup is safe. >>>> + */ >>>> + cur = find_vma_prev(vma->vm_mm, vma->vm_start, &prev); >>>> + if (WARN_ON_ONCE(cur != vma)) >>>> + return false; >>>> + >>>> + return !prev || prev->vm_end <= addr_masked; >>> >>> This is a bit clunky, and I don't think we need the WARN_ON_ONCE() check if >>> we're under the mmap_lock. >>> >>> How about something like:- >>> >>> return find_vma_intersection(vma->mm, addr_masked, vma->vm_start) == NULL; >>> >>> Which explicitly asserts that the range in [addr_masked, vma->vm_start) is >>> empty. >>> >>> But actually, we should be able to go further and replace the previous >>> check with:- >>> >>> return find_vma_intersection(vma->mm, addr_masked, addr_to_align) == NULL; >>> >>> Which will fail if addr_to_align is offset within the VMA. >> >> Your suggestion would mean that we do a full VMA search starting from the >> root. That would not be a nice thing if say we've 1000s of VMAs? >> >> Actually Liam told me to use find_vma_prev() because given a VMA, the maple >> tree would not have to work that hard for the common case to find the >> previous VMA. Per conversing with him, there is a chance we may have to go >> one step above in the tree if we hit the edge of a node, but that's not >> supposed to be the common case. In previous code, the previous VMA could >> just be obtained using the "previous VMA" pointer, however that pointer has >> been remove since the maple tree changes and given a VMA, going to the >> previous one using the maple tree is just as fast (as I'm told). > > As far as I can tell, find_vma_prev() already does a walk? I mean this is > equivalent to find_vma() only retrieving the previous VMA right? I defer to > Liam, but I'm not sure this would be that much more involved? Perhaps he > can comment. > > An alternative is to create an iterator and use vma_prev(). I find it > extremely clunky that we search for a VMA we already possess (and it's > previous one) while not needing the the former. > > I'm not hugely familiar with the maple tree (perhaps Liam can comment) but > I suspect that'd be more performant if that's the concern. Either way I > would be surprised if this is the correct approach. I see your point. I am not sure myself, the maple tree functions for both APIs are indeed similar. We already have looked up the VMA being aligned down. If there is a way to get the previous VMA quickly, given an existing VMA, I can incorporate that change. Ideally, if I had access to the ma_state used for lookup of the VMA being aligned down, I could perhaps reuse that somehow. But when I checked, that seemed a lot more invasive to pass that state down to these align functions. But there is a merit to your suggestion itself in the sense it cuts down a few more lines of code. >> Considering this, I would keep the code as-is and perhaps you/we could >> consider the replacement with another API in a subsequent patch as it does >> the job for this patch. > > See above. I don't think this kind of comment is helpful in code > review. Your disagreement above suffices, I've responded to it and of > course if there is no other way this is fine. > > But I'd be surprised, and re-looking up a VMA we already have is just > horrid. It's not really a nitpick, it's a code quality issue in my view. > > In any case, let's please try to avoid 'if you are bothered, write a follow > up patch' style responses. If you disagree with something just say so, it's > fine! :) I wasn't disagreeing :) Just saying that the find_vma_prev() suggested in a previous conversation with Liam fixes the issue (and has been tested a lot in this series, on my side) so I was hoping to stick to that and we could iterate more on that in the future. However, after taking a deeper look at the maple tree, I'd like to give the find_vma_intersection() option at least a try (with appropriate attribution to you). Apologies if the response style in my previous email came across badly. That wasn't my intent and I will try to improve myself. [..] >>>> + realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK); >>>> + } >>>> + >>>> if (is_vm_hugetlb_page(vma)) >>>> return move_hugetlb_page_tables(vma, new_vma, old_addr, >>>> new_addr, len); >>>> @@ -565,6 +619,13 @@ unsigned long move_page_tables(struct vm_area_struct *vma, >>>> >>>> mmu_notifier_invalidate_range_end(&range); >>>> >>>> + /* >>>> + * Prevent negative return values when {old,new}_addr was realigned >>>> + * but we broke out of the above loop for the first PMD itself. >>>> + */ >>>> + if (len + old_addr < old_end) >>>> + return 0; >>>> + >>> >>> I find this a little iffy, I mean I see that if you align [old,new]_addr to >>> PMD, then from then on in you're relying on the fact that the loop is just >>> going from old_addr (now aligned) -> old_end and thus has the correct >>> length. >>> >>> Can't we just fix this issue by correcting len? If you take my review above >>> which checks len in [maybe_]realign_addr(), you could take that as a >>> pointer and equally update that. >>> >>> Then you can drop this check. >> >> The drawback of adjusting len is it changes what move_page_tables() users >> were previously expecting. >> >> I think we should look at the return value of move_page_tables() as well, >> not just len independently. >> >> len is what the user requested. >> >> "len + old_addr - old_end" is how much was actually copied and is the return value. >> >> If everything was copied, old_addr == old_end and len is unchanged. > > Ah yeah I see, sorry I missed the fact we're returning a value, that does > complicate things... > > If we retain the hugetlb logic, then we could work around the issue with > that instance of len by storing the 'actual length' of the range in > a new var actual_len and passing that. > > If we choose to instead just not do this for hugetlb (I wonder if the > hugetlb handling code actually does the equivalent of this since surely > these pages have to be handled a PMD at a time?) then we can drop the whole > actual_len idea [see below on response to hugetlb thing]. Thanks. Yes, you are right. We should already b good with hugetlb handling as it does appear that hugetlb_move_page_tables() does copy by huge_page_size(h), so the old_addr should already be PMD-aligned for it to be able to do that. [..] >>>> return len + old_addr - old_end; /* how much done */ >>>> } >>> Also I am concerned in the hugetlb case -> len is passed to >>> move_hugetlb_page_tables() which is now strictly incorrect, I wonder if >>> this could cause an issue? >>> >>> Correcting len seems the neat way of addressing this. >> >> That's a good point. I am wondering if we can just change that from: >> >> if (is_vm_hugetlb_page(vma)) >> return move_hugetlb_page_tables(vma, new_vma, old_addr, >> new_addr, len); >> >> to: >> if (is_vm_hugetlb_page(vma)) >> return move_hugetlb_page_tables(vma, new_vma, old_addr, >> new_addr, old_addr - new_addr); >> >> Or, another option is to turn it off for hugetlb by just moving: >> >> if (len >= PMD_SIZE - (old_addr & ~PMD_MASK)) >> realign_addr(...); >> >> to after: >> >> if (is_vm_hugetlb_page(vma)) >> return move_hugetlb_page_tables(...); >> >> thanks, > > I think the actual_len solution should sort this right? If not maybe better > to be conservative and disable for the hugetlb case (I'm not sure if this > would help given you'd need to be PMD aligned anyway right?), so not to > hold up the series. > > If we do decide not to include hugetlb (the endless 'special case' for so > much code...) in this then we can drop the actual_len idea altogether. > > (Yes I realise it's ironic I'm suggesting deferring to a later patch here > but there you go ;) ;-). Considering our discussion above that hugetlb mremap addresses should always starts at a PMD boundary, maybe I can just add a warning to the if() like so to detect any potential? if (is_vm_hugetlb_page(vma)) { WARN_ON_ONCE(old_addr - old_end != len); return move_hugetlb_page_tables(vma, new_vma, old_addr, new_addr, len); } Thank you so much and I learnt a lot from you and others in -mm community. - Joel