From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4FCC9C6FD1C for ; Sat, 25 Mar 2023 16:47:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 89EEE6B0075; Sat, 25 Mar 2023 12:47:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 84E9A6B0078; Sat, 25 Mar 2023 12:47:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 73DB36B007B; Sat, 25 Mar 2023 12:47:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 65F0F6B0075 for ; Sat, 25 Mar 2023 12:47:19 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 31B5EC022B for ; Sat, 25 Mar 2023 16:47:19 +0000 (UTC) X-FDA: 80608000998.28.5ABBEC0 Received: from mail-yw1-f177.google.com (mail-yw1-f177.google.com [209.85.128.177]) by imf27.hostedemail.com (Postfix) with ESMTP id 775F140019 for ; Sat, 25 Mar 2023 16:47:17 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=joelfernandes.org header.s=google header.b=tNGKWix9; spf=pass (imf27.hostedemail.com: domain of joel@joelfernandes.org designates 209.85.128.177 as permitted sender) smtp.mailfrom=joel@joelfernandes.org; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679762837; a=rsa-sha256; cv=none; b=6695YAfTM8HSBo+6L6O/U/BuSL8vVC067OZWFrbb7oNHHhO7SluvL2yKY5MKaCF/YQHSyW PCYgqenBvQJ1hfUBx+ywhapwGV1I1JkBFiDre2DiCE5LuZZqi7yXw7gd9qB/7EKtLwH7zp QoYPLupeM2qTC5pyCQrnMo8QdQxHz00= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=joelfernandes.org header.s=google header.b=tNGKWix9; spf=pass (imf27.hostedemail.com: domain of joel@joelfernandes.org designates 209.85.128.177 as permitted sender) smtp.mailfrom=joel@joelfernandes.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679762837; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yuinZIw6Yor0K6X5bMFGjb+r0iCgiWpAUiHlBpjx5S4=; b=R+fgJmnqVk6g5NIBA6tmi5k7DEwRedghKYDtMfx+ZgeVTjNTU7H7Y5tLUdiB3+3YxEI34h jO2zPQQnqUMwM90XLVVYf7njYwSTlRs89MRggQAgRhT+M6xXOul5eOTdBCNNaHlGaokZRw S+/gQ54UFhtoR+q9DOykJuxJAeBXHJY= Received: by mail-yw1-f177.google.com with SMTP id 00721157ae682-536af432ee5so92771827b3.0 for ; Sat, 25 Mar 2023 09:47:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joelfernandes.org; s=google; t=1679762836; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=yuinZIw6Yor0K6X5bMFGjb+r0iCgiWpAUiHlBpjx5S4=; b=tNGKWix9w4Ros/0fBXDlveKlSF9FdbpZJ1Dp89Z8JTU7YaU3W/pu5VUnT1DTQCvU9u p/h8Q/KsMaT4YHx4fwreDlA5KCqakQasOKbUKfeVQ27qpFT9mc4QY3nTv0+hZU2WbGk0 Ji4o/izZ8S7JILa2vpNvZfnklM4CozlwaB/qE= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679762836; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=yuinZIw6Yor0K6X5bMFGjb+r0iCgiWpAUiHlBpjx5S4=; b=5wxmkrcegqvLaAhfbxVvfE1/nIA/wgBr5fpgNBLrjgFJEQdgNmym/Ba3dz5P/iX89C RgNpkcpbjEOawQoimWXLtlWjnQIP+YYOwFx8WeXqKsA+1SdTfkYvLByBjSeXtKenOGa9 V+amt+fAOVU558KyM2v4+wvL10gqr+ABNNDQpHFWZ+FsLNsy7Ep21JlXYzM+8jInyJjg i6wt0rX2AqNymdUTENOzd2ZmJGAij0OG+Z5rVwi93G9+NhXUtUBO+bvnkywxOQM8UUSU U8UTmUkt98ovDLVhHnElgvBuczBXzvBXYG9hjl0NljIpjhgmOFf6qA9UPh3mLVmRMvbv +M4A== X-Gm-Message-State: AAQBX9e1f/2cFzY7yWbxQEfIGIJ0jHiZBlrWcMjSYgGlGcxmlgGBPm7c 7F0Gm5Nz7oY6n8J/acQ6lje7Q02Rh/kcWdkdLdJTcQ== X-Google-Smtp-Source: AKy350ZYw1WANBO4EDCyYnPXd89I5FSJCYTgiymkFeQE8UuNELTYzA5K0tFwLLrI7AcuMDNicjm/odmWzRC498sKYVI= X-Received: by 2002:a81:a707:0:b0:545:6225:4536 with SMTP id e7-20020a81a707000000b0054562254536mr2752308ywh.9.1679762836444; Sat, 25 Mar 2023 09:47:16 -0700 (PDT) MIME-Version: 1.0 References: <20230324130530.xsmqcxapy4j2aaik@box.shutemov.name> <20230325163323.GA3088525@google.com> In-Reply-To: <20230325163323.GA3088525@google.com> From: Joel Fernandes Date: Sat, 25 Mar 2023 12:47:05 -0400 Message-ID: Subject: Re: WARN_ON in move_normal_pmd To: Linus Torvalds Cc: "Kirill A. Shutemov" , Michal Hocko , Naresh Kamboju , Andrew Morton , linux-mm@kvack.org, LKML Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 775F140019 X-Rspamd-Server: rspam01 X-Stat-Signature: xct7szhupashc4nu7xddn6bj5j9jck6p X-HE-Tag: 1679762837-428562 X-HE-Meta: U2FsdGVkX1+m9jy0kdHL3zS8DhykhWQ2l7ntSLDbMf6mAsMY/KNW8e8eVpcS5mSLbr6TZsw7AYulMEba5FPPMeN4bTeWNGs09knPQMxV9aJ8jX6cXY+sZk41zgBzHAQkaCVDA3Sb99IW3kbv8TbRsbUV4b79Ql1M6U6WNsdgBLRm/nPJFmfiNKS8FD3VIftfW/rD7wzTQeIGzN23pX1XKutKR4Ew0RbGsW4VN4xeg6dlR9KWVvwCQ7dCrarKkXmK/Jl5JAGvPPsL8XjI57UYGsZ3WbYg7xQYgiCoA9QX1QNP/m2KftVy4TArn7nvKSn36zeSKuMCwEYBwaPk2EX6urhQg2LDZGYvKf4Db8Lvy2vSDUfDpc0Ch/Oiz52tW0x/MlCbeK4nZWdEf/6fPQgo20uQNhzTji06+/Wz1bsHz+XpfkBDTA2HnMm+Qltcyu+6E2D4Cb0Ud3aLYyLE4i/U19dYnTWnakO4VPl4ZQUzKWAA2DXah9MJ8sFvSAt2EkYdkWZZXUxNvE24LK0kbKn9VfU29UXsR+E9xdsvBJT4NdIml/tp2d7N9mgOzcjNYIsfgdX2XIG7aHQ0Dh+HpLGf2xkjuyjM6BspDLVPjXvaSxxhjyOstFMC2Y1RsdQVJwAHnM2poZoE2wZaHFfCUP1NzTMBfBOQAhP9ED8F+P9lRGeLvg0MIYCMqPDq8DwrVTjPQJkxJPBrX8Yz2ZOR2f395HVoehfbPDL+uPTyoAR+dsEgKKNyh7QhzfYC3fWLt91DUArQ7R1BCYHstvZ2zOMK/0gcPviP3/gihXOHPOeOuWUgtRYLyLcRk9KJ92UGZyG/91QdOD9c3BTQkWJA7UUpp056ovNKQ5j26P94Jr+4UClcVu/HZEOrkwgtImEuS4uLac83jbHLFBuyLxF7XRuUCrXUEcMkV1vukSJAdUMSGx/X6L8D6yjShNsj7meRaMhs0nyGaLljNFEXnlKWlKY OXz4Awgz qEAhljtAGj/K1rg5chM+acFV9zWDE/PyaJ/3relk6hz1Y/uhnC7vA7xYZqYBJbrrUPhe1acqFz8kb2mMIQF53KD01sVaMWvl6O2mNVwadOkKjZVw4SIIFpttzMwM9iW2nzkTEn6Mqquht2Su26JP/29nhRaWbHehn62VGGOonnsTcGCMh7QWEGA+DqzScmYTBSnWos0ZPVnT6oMCpvspHjMWXllpxeBwY95HjIDI8+JZyJFHEcSoCYJcQVVamufAGTzJ0U9brwBzBuApdhvC5KxfUy2LGEs5iAYTZHaCuS7+NhQ1JehB9FaOpE9dF5aucxHiskVjGEENzC5lvPiViHoNmoQWiMrsVaetX X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, Mar 25, 2023 at 12:33=E2=80=AFPM Joel Fernandes wrote: > > Hi Linus, > > On Fri, Mar 24, 2023 at 04:38:03PM -0700, Linus Torvalds wrote: > > On Fri, Mar 24, 2023 at 6:43=E2=80=AFAM Joel Fernandes wrote: > > > > > > Wouldn't it be better to instead fix it from the caller side? Like > > > making it non-overlapping. > > > > I wonder if we could just do something like this in mremap() instead > > > > - if old/new are mutually PMD_ALIGNED > > > > - *and* there is no vma below new within the same PMD > > > > - then just expand the mremap to be PMD-aligned downwards > > > > IOW, the problem with the exec stack moving case isn't really that > > it's overlapping: that part is fine. We're moving downwards, and we > > start from the bottom, so the moving part works fine. > > > > No, the problem is that we *start* by moving individual pages, and > > then by the time we've a few pages down by a whole PMD, we finish the > > source PMD (and we've cleared all the contents of it), but it still > > exists. > > > > And at *that* point, when we go and start copying the next page, we're > > suddenly fully PMD-aligned, and now we try to copy a whole PMD, and > > then that code is unhappy about the fact that the old (empty) PMD is > > there in the target. > > > > You are very right. I am able to also trigger the warning with a syntheti= c > program that just mremaps a range and moves it in the same way that trigg= ers > this issue, however I had to hack the kernel to prevent mremap from error= ing > out if ranges overlap (unlike exec, mremap does some initial checks for > that). Also had to do other hacks but I did reproduce it consistently. > > The issue is that even though the PMD is empty, it is allocated. So > pmd_none() is kind of a lie in some sense, it is pointing to empty PTEs w= hen > the range is really empty. > > How about we replace the warning with something like the following? > > + if (unlikely(!pmd_none(*new_pmd))) { > + // Check if any ptes in the pmd are non-empty. Doing this= here > + // is ok since this is not a fast path. > + bool pmd_empty =3D true; > + unsigned long tmp_addr =3D new_addr; > + pte_t* check_pte =3D pte_offset_map(new_pmd, new_addr); > + > + new_ptl =3D pte_lockptr(mm, new_pmd); > + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); > + for (; tmp_addr < old_addr + PMD_SIZE; check_pte++, tmp_a= ddr +=3D PAGE_SIZE) { Apologies, here I was going for "tmp_addr < new_addr + PMD_SIZE". I made the change and it still works (This is just to show the basic idea, I am still testing it). thanks, - Joel > + if (!pte_none(*check_pte)) { > + pmd_empty =3D false; > + break; > + } > + } > > + WARN_ON_ONCE(!pmd_empty); > + spin_unlock(new_ptl); > + } > + > > > And for all of this to happen, we need to move things by an exact > > multiple of PMD size, because otherwise we'd never get to that aligned > > situation at all, and we'd always do all the movement in individual > > pages, and everything would be just fine. > > > > And more importantly, if we had just *started* with moving a whole > > PMD, this also wouldn't have happened. But we didn't. We started > > moving individual pages. > > > > So you could see the warning not as a "this range overlaps" warning > > (it's fine, and happens all the time, and we do individual pages that > > way quite happily), but really as a "hey, this was very inefficient - > > you shouldn't have done those individual pages as several small > > independent invidual pages in the first place" warning. > > > > Exactly. > > > So some kind of > > > > /* Is the movement mutually PMD-aligned? */ > > if ((old_addr ^ new_addr) & ~PMD_MASK =3D=3D 0) { > > .. try to extend the move_vma() down to the *aligned* > > PMD case .. > > } > > > > I actually didn't follow what you meant by "mutually PMD-aligned". Could = you > provide some example address numbers to explain? > > AFAIK, only 2MB aligned memory addresses can be directly addressed by a P= MD. > Otherwise how will you index the bytes in the 2MB page? You need bits in = the > address for that. > > > logic in move_page_tables() would get rid of the warning, and would > > make the move more efficient since you'd skip the "move individual > > pages and allocate a new PMD" case entirely. > > > > This is all fairly com,plicated, and the "try to extend the move > > range" would also have to depend on CONFIG_HAVE_MOVE_PMD etc, so I'm > > not saying it's trivial. > > > > But it would seem to be a really nice optimization, in addition to > > getting rid of the warning. > > > > It could even help real world cases outside of this odd stack > > remapping case if users ever end up moving vma's by multiples of > > PMD_SIZE, and there aren't other vma's around the source/target that > > disable the optimization. > > > > Hmm? Anybody want to look into that? It looks hairy enough that I > > think that "you could test this with mutually aligned mremap() > > source/targets in some test program" would be a good thing. Because > > the pure execve() case is rare enough that using *that* as a test-case > > seems like a fool's errand. > > > > Just to mention, mremap errors out if you try to move overlapping ranges > because this in mremap_to(): > > /* Ensure the old/new locations do not overlap > if (addr + old_len > new_addr && new_addr + new_len > addr) { > pr_err("%s: (%s) (%d)", __func__, __FILE__, __LINE__); > goto out; > } > > Or is there an mremap trick I might be missing which actually allows > overlapping range moves? > > thanks, > > - Joel > >