From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BF7FCEED61A
	for <linux-mm@archiver.kernel.org>; Fri, 15 Sep 2023 18:05:31 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 31CB16B039A; Fri, 15 Sep 2023 14:05:31 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2CC866B039B; Fri, 15 Sep 2023 14:05:31 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 16D5C6B039C; Fri, 15 Sep 2023 14:05:31 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 044EC6B039A
	for <linux-mm@kvack.org>; Fri, 15 Sep 2023 14:05:31 -0400 (EDT)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id B538B120C76
	for <linux-mm@kvack.org>; Fri, 15 Sep 2023 18:05:30 +0000 (UTC)
X-FDA: 81239609220.01.201B921
Received: from mail-yb1-f176.google.com (mail-yb1-f176.google.com [209.85.219.176])
	by imf18.hostedemail.com (Postfix) with ESMTP id E36A81C0033
	for <linux-mm@kvack.org>; Fri, 15 Sep 2023 18:05:28 +0000 (UTC)
Authentication-Results: imf18.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b="X2IEdF/V";
	spf=pass (imf18.hostedemail.com: domain of surenb@google.com designates 209.85.219.176 as permitted sender) smtp.mailfrom=surenb@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694801129; a=rsa-sha256;
	cv=none;
	b=gFie3XO72SoEL+gf4qS86eDL/vf03/nF+GuRJQea2ZGe1GzpZGkKJSDFiVVEF3Dx6vwWbK
	EfVGy/5xTzzztRSXi4Kb5esxITJWiF0ZUgsRuO4Rld7tAG95PqfFyDbWfv2dBlazN4yrly
	tKzIm8zYdBkiC61dcqiGmPwNTmye6Vo=
ARC-Authentication-Results: i=1;
	imf18.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b="X2IEdF/V";
	spf=pass (imf18.hostedemail.com: domain of surenb@google.com designates 209.85.219.176 as permitted sender) smtp.mailfrom=surenb@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1694801129;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=CrVfTWvidw6KIt1gA6gh98z08bQkOzyYMa1Hz/6HT9c=;
	b=ezHHK/lgNMKIzACa8Gi0zUmiPevMy4BUboHduFrkdNZ+27V9UxeywOHQGlcnsKj0VzLa4l
	plTUGjfty2CsvvshY2x/nuPKbOhsMMDvZEvZqrwOAGc7vN6s1WdA08HZTQ8v+Zsn9+HDzU
	BrRQlOqZEqRZVwndUwC4O4sFfmoCBGg=
Received: by mail-yb1-f176.google.com with SMTP id 3f1490d57ef6-d7f0048b042so2381324276.2
        for <linux-mm@kvack.org>; Fri, 15 Sep 2023 11:05:28 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1694801128; x=1695405928; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=CrVfTWvidw6KIt1gA6gh98z08bQkOzyYMa1Hz/6HT9c=;
        b=X2IEdF/VCDtmAxDj8FyIH9ipvTbyvoCHOR6DQY7pHAEC2MK3V9lqSUOnh4tAa3vG+w
         GX8cipv6KEAlxcvuiCjQF361iicGFh++qix0Dr8l3Nverf+pSV53Qwlg3iG+TbDoDCWc
         li972wNS0k+Dxz8V8Xnapw/uDhSxjQ+lsfmn9kSZV5xRulUg0ojRql0Vr325PyoNToty
         gwXZlZj+6MwqXJfdUNgpi6SIg04FbtV8+eMbsO5qgk/7Hk+nbNXvwZ0LplaviiXSDdsn
         GDYISCj5bEMIKoTV/2UMw3sIdvaRGM9bwoG9zSbQpphkIBGptWYFoGc1Hni7kRJQXSmY
         YsnQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1694801128; x=1695405928;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=CrVfTWvidw6KIt1gA6gh98z08bQkOzyYMa1Hz/6HT9c=;
        b=DJ+YDt1sPevrdjhxfkKeiIZ2yFn50Y79lKY6cPd/L8gXt1JCI+JTdlL0w7DI2q8tkl
         W+5RCoSXTO8wXHgaSn88UaqiAdu5ij/0ZaioD083KorNavyTlZ9tCMF7eVuyP8Ty06W7
         DD/gw9XZuwF8qfrwlJvKGxMiJqPI1H9dc8hkLZBupiyfuNL7HeP2jB8nJsSIdhgY9dGY
         cJbzaI7jV/4mG2UUzRwe/UJ7Z6m4dCKa0tR4hkXyDB43WjNKJxpK8BkBAWRoftmeIzvB
         oqkr65S+RUBXdIjaUzHL3y8ccZ4nO80MnqGq4W4Q3mFTuQl8JkQMnlgZLz162+XGjGXB
         xBig==
X-Gm-Message-State: AOJu0YxvZ8QSJhhlP1K4WPgpGkgD1wvE1WElZCofG21WW8S8gAc2Y0hZ
	BNPLiCVZLKQgERALqImf+fysivVbeO7NoKNm4js6DQ==
X-Google-Smtp-Source: AGHT+IHIzOSnTpyS/wBQYEDRFq2qubMTYkQRKYEhlL9Wpr7/okybozOUQjxdwG8vSx1FNBDnxPmLRCuCS92UNxj2PSE=
X-Received: by 2002:a25:cb8b:0:b0:d81:5ec1:80d5 with SMTP id
 b133-20020a25cb8b000000b00d815ec180d5mr2573293ybg.18.1694801127773; Fri, 15
 Sep 2023 11:05:27 -0700 (PDT)
MIME-Version: 1.0
References: <ZP/3hgDCXeceE9uU@casper.infradead.org> <ZQB76G/6NxVgoE9u@casper.infradead.org>
 <CAJuCfpGEa504z1pgqR9KtCWQPESTcRcu8xoyxcNOv=6t4R+ong@mail.gmail.com>
 <CAJuCfpHYNbH0WmfDnpX6eqL3f3Z632iQrcw6oqPXtB0_QjaiiQ@mail.gmail.com>
 <CAJuCfpF4j6fKpk853tXu155jbfk1z_PbWXjf4bypYwDRf78-iw@mail.gmail.com>
 <CAJuCfpETZr56WD5j7aQY-dY84ciur=QTZYxuShmjEG+fZFhDsw@mail.gmail.com>
 <CAJuCfpECwpQ8wHnwhkLztvvxZmP9rH+aW3A39BSzkZ9t2JK6dQ@mail.gmail.com>
 <ZQNaT/3xPxATKJVR@casper.infradead.org> <CAJuCfpEAZbtUrwrfqd+PWJv9efVy1HRbqrLUYAP4rYRvk0vWug@mail.gmail.com>
 <CAJuCfpGp2CwGJmmwzK7WdudOyL1CCWVaERRK9qTtNA8SZ365SA@mail.gmail.com>
 <ZQN58hFWfgn+OfvG@casper.infradead.org> <CAJuCfpGRSJhBBZop_L-UubuveUWBca4YtyPBzM2KZGEx7iwhXg@mail.gmail.com>
 <da13f5a-b87f-d776-e58a-d492fa2773f9@google.com> <CAJuCfpGO4bRZaDJv5Vuf2wLh3t4hE=5EqDObm_UfcQk4B08PrQ@mail.gmail.com>
In-Reply-To: <CAJuCfpGO4bRZaDJv5Vuf2wLh3t4hE=5EqDObm_UfcQk4B08PrQ@mail.gmail.com>
From: Suren Baghdasaryan <surenb@google.com>
Date: Fri, 15 Sep 2023 11:05:14 -0700
Message-ID: <CAJuCfpE8jnvL23W6fY4_HZf-969aEgvR3-LGRTUC-SFhPFju+w@mail.gmail.com>
Subject: Re: [syzbot] [mm?] kernel BUG in vma_replace_policy
To: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>, Yang Shi <shy828301@gmail.com>, 
	Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>, 
	syzbot <syzbot+b591856e0f0139f83023@syzkaller.appspotmail.com>, 
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, 
	syzkaller-bugs@googlegroups.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: E36A81C0033
X-Stat-Signature: 1wd57qm9e8ho4reph8kqfrcmxnf437yw
X-Rspam-User: 
X-HE-Tag: 1694801128-526124
X-HE-Meta: U2FsdGVkX1+PQPvljH5ogng/U39JjUE2VCp1ALngYME2sRHUN6xlPC0j+rLCV28VpI/XuCYEZyyt1pM3D7jwIuKDd+WkNNtqdbOVNrumazLp4WtuhTmSt9N0vuhV0w2pc+ta7PAaKyclBqT6omGETUftz56NheSqghjOa9rKm9pQi4/RA6vp0OCuZ3sSE8gopKfDRLrWwJHCL8TkTzn8vIfXgYMuizUXgZUrPwTau7iZPFfzHN9YFAVkoo0IFOFRWBSC80cUcnIUyXWA4fQ6wi0zsrA1iaUgyQ+r+AD7qG4AQUxwJjyehN1ysHCHWne0PQFF5vNtV3B1ykv9gi3f35rlsmYHFnPkyRIoC4dVlsu2Vwj6lX2DaSyEDsQRQnMDlpHYJZzW+wsKu4OSiIEbohgdINIT9nTpfLriGAKQyJ6kog//HD/K8nHWjSQBppkAg3ui38rX1ST0duiVxRPiLgoFOd7m+bpIsAU9CHQW9/XwDrB8rTo3larv1FuyGw6wJ+d+gRmPq8AAtfa57LfsSDjXFwOA4RJpPmXS2BC6d4oYdwOo9bxdeFFEb4o1ouhUul9s5euSZKSBHuUxzQez78BvD/W5VFkC/PNp09oBGgBmHQ40ml3tcBbWqU9rf66qJik+VZQm7BfXFmW0/0AkUEQqwfG5NVQLtQTn1SgXQxMOGiJcpTOlaMWEtLkzAha3LUF1O+DsVeu0I+83+srcJHGyhb6ZCHJXd+dK5TM2KI+x0kkhNESw9BLpfI8mWy4GZYUt/j9SHuklS9WjcNAZ0YQS3s/nHu0m0iD0+XW7mqUfnmBcTGNOo5shSmPj+MQ5d2b0I5AglkDguxvBsF4zQVLH0Ag10+pV3HB5yzoJOVtHnMPBnt9Ey4Sz/NVC0xnDrDBxbXbaAA/TOz0KRJjS+2WXqL5SOoBkUtZf9eL+QQdtWbLLktaYAp4qXAVa0J+Rn94Wq/XAD8g5WB2AlE4
 ZQjtbhbP
 +1/L+FuQ3JQNGKbMUuDAAaEFVW3TGpjQEZRS2lScThbOlZIeYzeb/gx3UARjSmrfRPUcMncQltwIAomJjmvlSLNJn04yieqkDCuB6YziG5nPt/4uwmSQK3mtrIaSRXGvatmYkkTdIrXsrk1JHnl0wg8TAVaSe1yLufstQ8heTEWqEe+MOAcvAIaJaN1MpEckTT8RWAkb8Z/+hfGhKORr/BcwUyHKZODj1/4YGiVtI+8hwsB6JfU6On4v7ASbK6q3Q5VHSUN1hiC+M2el62Oaa9OfpPPaBou1+ccZPIHqOp40HzKPwA3NFVcN8QwoEtR7qfiBXfzx1G30tsPjDprlz5fLaEpVdY4x08zM0Hfp4owzzoK8l24rd7W4QXR9i4SiGem9VAkX2IWFO1MSjL3I5IijkMKkHJn2EX96nSyhI7KNd+1Gu6vk9ghBDuQDMUy1zs/gtCtSE/lj/uqN5gsB2lrhnmrauEDzLjKyE+pWpggMmRuGQ+re9X4G3+GMkePA06jsCbVO0s6/fH/OIunhK4kLiYhedPU+/byTlTMRFiyD0+Ks7PrvJk913DGUWvMfyN3p2nilNJ/1dwPtQsLzWkg6/FJXpFRs+qAgunRZty4EXrKg=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Sep 15, 2023 at 9:09=E2=80=AFAM Suren Baghdasaryan <surenb@google.c=
om> wrote:
>
> On Thu, Sep 14, 2023 at 9:26=E2=80=AFPM Hugh Dickins <hughd@google.com> w=
rote:
> >
> > On Thu, 14 Sep 2023, Suren Baghdasaryan wrote:
> > > On Thu, Sep 14, 2023 at 9:24=E2=80=AFPM Matthew Wilcox <willy@infrade=
ad.org> wrote:
> > > > On Thu, Sep 14, 2023 at 08:53:59PM +0000, Suren Baghdasaryan wrote:
> > > > > On Thu, Sep 14, 2023 at 8:00=E2=80=AFPM Suren Baghdasaryan <suren=
b@google.com> wrote:
> > > > > > On Thu, Sep 14, 2023 at 7:09=E2=80=AFPM Matthew Wilcox <willy@i=
nfradead.org> wrote:
> > > > > > >
> > > > > > > On Thu, Sep 14, 2023 at 06:20:56PM +0000, Suren Baghdasaryan =
wrote:
> > > > > > > > I think I found the problem and the explanation is much sim=
pler. While
> > > > > > > > walking the page range, queue_folios_pte_range() encounters=
 an
> > > > > > > > unmovable page and queue_folios_pte_range() returns 1. That=
 causes a
> > > > > > > > break from the loop inside walk_page_range() and no more VM=
As get
> > > > > > > > locked. After that the loop calling mbind_range() walks ove=
r all VMAs,
> > > > > > > > even the ones which were skipped by queue_folios_pte_range(=
) and that
> > > > > > > > causes this BUG assertion.
> > > > > > > >
> > > > > > > > Thinking what's the right way to handle this situation (wha=
t's the
> > > > > > > > expected behavior here)...
> > > > > > > > I think the safest way would be to modify walk_page_range()=
 and make
> > > > > > > > it continue calling process_vma_walk_lock() for all VMAs in=
 the range
> > > > > > > > even when __walk_page_range() returns a positive err. Any o=
bjection or
> > > > > > > > alternative suggestions?
> > > > > > >
> > > > > > > So we only return 1 here if MPOL_MF_MOVE* & MPOL_MF_STRICT we=
re
> > > > > > > specified.  That means we're going to return an error, no mat=
ter what,
> > > > > > > and there's no point in calling mbind_range().  Right?
> > > > > > >
> > > > > > > +++ b/mm/mempolicy.c
> > > > > > > @@ -1334,6 +1334,8 @@ static long do_mbind(unsigned long star=
t, unsigned long len,
> > > > > > >         ret =3D queue_pages_range(mm, start, end, nmask,
> > > > > > >                           flags | MPOL_MF_INVERT, &pagelist, =
true);
> > > > > > >
> > > > > > > +       if (ret =3D=3D 1)
> > > > > > > +               ret =3D -EIO;
> > > > > > >         if (ret < 0) {
> > > > > > >                 err =3D ret;
> > > > > > >                 goto up_out;
> > > > > > >
> > > > > > > (I don't really understand this code, so it can't be this sim=
ple, can
> > > > > > > it?  Why don't we just return -EIO from queue_folios_pte_rang=
e() if
> > > > > > > this is the right answer?)
> > > > > >
> > > > > > Yeah, I'm trying to understand the expected behavior of this fu=
nction
> > > > > > to make sure we are not missing anything. I tried a simple fix =
that I
> > > > > > suggested in my previous email and it works but I want to under=
stand a
> > > > > > bit more about this function's logic before posting the fix.
> > > > >
> > > > > So, current functionality is that after queue_pages_range() encou=
nters
> > > > > an unmovable page, terminates the loop and returns 1, mbind_range=
()
> > > > > will still be called for the whole range
> > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1=
345),
> > > > > all pages in the pagelist will be migrated
> > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1=
355)
> > > > > and only after that the -EIO code will be returned
> > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1=
362).
> > > > > So, if we follow Matthew's suggestion we will be altering the cur=
rent
> > > > > behavior which I assume is not what we want to do.
> > > >
> > > > Right, I'm intentionally changing the behaviour.  My thinking is
> > > > that mbind(MPOL_MF_MOVE | MPOL_MF_STRICT) is going to fail.  Should
> > > > such a failure actually move the movable pages before reporting tha=
t
> > > > it failed?  I don't know.
> > > >
> > > > > The simple fix I was thinking about that would not alter this beh=
avior
> > > > > is smth like this:
> > > >
> > > > I don't like it, but can we run it past syzbot to be sure it solves=
 the
> > > > issue and we're not chasing a ghost here?
> > >
> > > Yes, I just finished running the reproducer on both upstream and
> > > linux-next builds listed in
> > > https://syzkaller.appspot.com/bug?extid=3Db591856e0f0139f83023 and th=
e
> > > problem does not happen anymore.
> > > I'm fine with your suggestion too, just wanted to point out it would
> > > introduce change in the behavior. Let me know how you want to proceed=
.
> >
> > Well done, identifying the mysterious cause of this problem:
> > I'm glad to hear that you've now verified that hypothesis.
> >
> > You're right, it would be a regression to follow Matthew's suggestion.
> >
> > Traditionally, modulo bugs and inconsistencies, the queue_pages_range()
> > phase of do_mbind() has done the best it can, gathering all the pages i=
t
> > can that need migration, even if some were missed; and proceeds to do t=
he
> > mbind_range() phase if there was nothing "seriously" wrong (a gap causi=
ng
> > -EFAULT).  Then at the end, if MPOL_MF_STRICT was set, and not all the
> > pages could be migrated (or MOVE was not specified and not all pages
> > were well placed), it returns -EIO rather than 0 to inform the caller
> > that not all could be done.
> >
> > There have been numerous tweaks, but I think most importantly
> > 5.3's d883544515aa ("mm: mempolicy: make the behavior consistent when
> > MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") added those "return 1=
"s
> > which stop the pagewalk early.  In my opinion, not an improvement - mak=
es
> > it harder to get mbind() to do the best job it can (or is it justified =
as
> > what you're asking for if you say STRICT?).
> >
> > But whatever, it would be a further regression for mbind() not to have
> > done the mbind_range(), even though it goes on to return -EIO.
> >
> > I had a bad first reaction to your walk_page_range() patch (was expecti=
ng
> > to see vma_start_write()s in mbind_range()), but perhaps your patch is
> > exactly what process_mm_walk_lock() does now demand.
> >
> > [Why is Hugh responding on this?  Because I have some long-standing
> > mm/mempolicy.c patches to submit next week, but in reviewing what I
> > could or could not afford to get into at this time, had decided I'd
> > better stay out of queue_pages_range() for now - beyond the trivial
> > preferring an MPOL_MF_WRLOCK flag to your bool lock_vma.]
>
> Thanks for the feedback, Hugh!
> Yeah, this positive err handling is kinda weird. If this behavior (do
> as much as possible even if we fail eventually) is specific to mbind()
> then we could keep walk_page_range() as is and lock the VMAs inside
> the loop that calls mbind_range() with a condition that ret is
> positive. That would be the simplest solution IMHO. But if we expect
> walk_page_range() to always apply requested page_walk_lock policy to
> all VMAs even if some mm_walk_ops returns a positive error somewhere
> in the middle of the walk then my fix would work for that. So, to me
> the important question is how we want walk_page_range() to behave in
> these conditions. I think we should answer that first and document
> that. Then the fix will be easy.

I looked at all the cases where we perform page walk while locking
VMAs and mbind() seems to be the only one that would require
walk_page_range() to lock all VMAs even for a failed walk. So, I
suggest this fix instead and I can also document that if
walk_page_range() fails it might not apply page_walk_lock policy to
the VMAs.

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 42b5567e3773..cbc584e9b6ca 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1342,6 +1342,9 @@ static long do_mbind(unsigned long start,
unsigned long len,
         vma_iter_init(&vmi, mm, start);
         prev =3D vma_prev(&vmi);
         for_each_vma_range(vmi, vma, end) {
+                /* If queue_pages_range failed then not all VMAs
might be locked */
+                if (ret)
+                        vma_start_write(vma);
                 err =3D mbind_range(&vmi, vma, &prev, start, end, new);
                 if (err)
                         break;

If this looks good I'll post the patch. Matthew, Hugh, anyone else?

>
>
> >
> > Hugh