From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A8169EED60B for ; Fri, 15 Sep 2023 16:09:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 22A286B037E; Fri, 15 Sep 2023 12:09:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1DB346B0380; Fri, 15 Sep 2023 12:09:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0A3066B0382; Fri, 15 Sep 2023 12:09:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id ECA826B037E for ; Fri, 15 Sep 2023 12:09:53 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id C5FAFA1049 for ; Fri, 15 Sep 2023 16:09:53 +0000 (UTC) X-FDA: 81239317866.05.7BB733D Received: from mail-yw1-f181.google.com (mail-yw1-f181.google.com [209.85.128.181]) by imf02.hostedemail.com (Postfix) with ESMTP id 094878001A for ; Fri, 15 Sep 2023 16:09:50 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=wRfvDJ7k; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf02.hostedemail.com: domain of surenb@google.com designates 209.85.128.181 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694794191; a=rsa-sha256; cv=none; b=mwcbfZs6QjP0sIfHwbHOdfO4eTv4TVZq1LgPZOAioIkqLnMb2kDcmr1a6KuSv4raZlGZlc M8YsOp8kvq+kQZylLtTduEk1KXTmzbwEo3swdSceA2TTnAyzoMbi7pY2NQX1Z3SC8Kwyjq c6tW6/oub2WpWmqLi1TpjLIOvgxUPbg= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=wRfvDJ7k; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf02.hostedemail.com: domain of surenb@google.com designates 209.85.128.181 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1694794191; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OT/l5x29AyoRvoD7pEXYEkzhshUqx33MJvUVP7UoZXw=; b=MhBVMQxp7QwGMaOl+n2lrXhk6MB8SZ3TH9IqpBo94ZMYWo5Qidly4qYSLkaqcSlCxdl7x+ xs7m611Eb/LMdS+IyPvNNH9+SbjGqs5+UaEzmHP36JqYveCVVlXeEW/gF1kKKaAjkWClBj uEt4taAqy/+jP0Q/bslpJIAloeMveiA= Received: by mail-yw1-f181.google.com with SMTP id 00721157ae682-5925e580f12so25385437b3.3 for ; Fri, 15 Sep 2023 09:09:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1694794190; x=1695398990; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=OT/l5x29AyoRvoD7pEXYEkzhshUqx33MJvUVP7UoZXw=; b=wRfvDJ7k4kBV804Wg3liOLunnCZaWjcpvO+gN00UC3/NqfrTRTUNOCEeEgiHHBXq1I 2XKvnKkqug2qifO5UMh34ZOg02bP9kHMLj/8vy2hVk4pSO8W+PnHCIA67xwo8auGMCUV xFIkOzsmN9EMJedi9nfGaM5jceLf5Pp+aEfm4cCwKs8QhNmi0/oqXZ0/QD+IX0tcfws4 8hGNeHErTHRkZX2JuSAeV9rEY+O1yM5FGESh+Gzlfv+Vtjns9BEJuxy8PUaIaK+8cksm UD35q5aJE5rrOddGICz8imq6sdDI14d0S85TLffisAAlWkyzSRUf8m1YYGDzKR6EXwnz p54w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694794190; x=1695398990; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=OT/l5x29AyoRvoD7pEXYEkzhshUqx33MJvUVP7UoZXw=; b=LbdW0Tc6LSSL0IPygPz3j2r+OXBGY9V4W6qyYqKyA9uYjCq9RFCCFxjPDskAyuOP4H JgDaAB3jRNIDp/d7OSg6Bz0H2vD7wV4av3IO1bsvNRwTr1wKqiBrnkpOFnMp7ubBmQzH A+vg8MBSnRKtJnzcB/hnDd4dNM1R/ct2BS0BJIdH+UuX7Jc3LqizHRNDX21KOmqQCD6A 2l0PjoEm0vfLC2hdPoQv6kVLQ1gWLmnw6nZ1jOFZjipRO50W/qVLyG0TIatOqqIuqJ6O 71erzqhmeojNRDdQ3W7wCynNFMcTs+jbfrFrFvFO63GHtYSqjtlQhH4HtLLGZuwixmyr ivNw== X-Gm-Message-State: AOJu0YxZjVtSbMA+aPwfgSG6yZbKZLADUxZT49keMflSJG6304jeMxam sOnwmi9zOT2JABpNX5iNT4hHUHiMqb9gNQTnhXyMww== X-Google-Smtp-Source: AGHT+IGcsRAuU/xiH7ZRj8iuX9glKjWfzeMcDtBoA3V9WUYg8WOSj6sr4JhTr16YjNce4M8m2MDE9/Cf8a3RnusLen4= X-Received: by 2002:a0d:d652:0:b0:589:e586:6f0a with SMTP id y79-20020a0dd652000000b00589e5866f0amr2220365ywd.3.1694794189807; Fri, 15 Sep 2023 09:09:49 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Suren Baghdasaryan Date: Fri, 15 Sep 2023 09:09:37 -0700 Message-ID: Subject: Re: [syzbot] [mm?] kernel BUG in vma_replace_policy To: Hugh Dickins Cc: Matthew Wilcox , Yang Shi , Michal Hocko , Vlastimil Babka , syzbot , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, syzkaller-bugs@googlegroups.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 094878001A X-Stat-Signature: writpquoyf4p7fbrs6o68ei8hmzbjtqu X-HE-Tag: 1694794190-118457 X-HE-Meta: U2FsdGVkX188yERt5Rrrl0lRNpzfGtSgGQOnir6E1zmlj3r3YLe2KYK3o71e7ryDLrkK//Job73p2UYF5x/+KQHy3+993b1DJcqO1PESgPwSIMChgR3RUKSGnjB3oLJqLaRpZSVVk9hKs1RM7pYJyDlCZmCJcX4UwqXv+TzgM+6tjrh+auHPeb/2N3ULH2o7cDwtmJqWA0aLl+K5yEpXBGix+BZmpXSWr/5UH3CEK3W/1tv4q6Xp2uZBsEtwmIK5/Iho72Md2mIhYNy8xIlQ7Kj/1H8WyImkoLok5gJtj4dTapiFMb7qGV9Dq7+7SlskworRSgV7NtmcNpkDVYQ7gZrlIKDKtBKkfWA5L1Ux0WIJRH3lVt5E1QedHw8iTOYHbk82S4EKOTNlFK6WZgnvCXgDNUgqgfYIpOYrQE5vYWjTGljlSvxwkJG9EmcjQL+ewSJRcaMd+1knVyhNc4z3sKpre0llS//aIxQlDZIRGJzzNTHls038C5z10SvvwUaS0aUo2l//R6qPFlVnts26841AbkE3Z6rjTrqQnVSb1D16X/ojbcTXulNsYY5FtBAECbBgx8e3mEqKfjfxEeDmae1XENrAPM9eY0WHLh/BMemEsuWUCMldEedaQRQk5GGAEDvGlbhWBwtrdi1/q4+LqCOrgSfFqt4FPISFJV4yENDtFtiF+hHJuX2Qz5z7XqdWd6ZPD/uC/oHlKcvb5tyEzKXRGD08GGG6KyQ6dB+xGPrbsn+d9P3SuYydE0fWIZk0bYa/ApWrMihXnZVNps7ffJy1CZ0qKdwwLEREBIPReN8jhJsBrl5e5wGREZi4yy8nDM+mFH9/o1tPp6rqMzc2Rh4uEEogdPguBHIKSoUwEmjfJIF/Wy/XYQ+gI1EmArGG2E5AGGruMKobQazYDKx9ulH/OfmT4G1SoGp5fxLffD4f9ULeTu9MaUXN39+pVQxBi3hiy3U+AIui+Tmp6u7 UJ+qz+5c 2Iqj38uOFLkrCghxV+lJWhL5ORQbwSg8X83QBZ2vhD6nuA3WEBSkcsEGw734NrV0TkQgJ35x4E5/DztVkhkqzuCdHsIuwIcEmgWziE18OqQCwdCWVSq0moVxS4YlVDrDoVr5quVmbovcnF1bGGlf97lZ5B+J9olUotIayal/cYAnhNMMR6nLIARepGgTmf5n6nnIaUh9ArbNPCjB2kfei/IN7U1nKiFDjAq2upXMyVjR8yM49LLwXkNaMCoQzMuo9IOkg3mcAhKa/H3UChqMqCih9aIKQYJwraNeboNopNXWn4as52p0CUoq20OT2FHYR/RtnMF2rTIdVYMOoL599rCWVs7nHt8kPwZCepRI5GYkzMCvJ5hW+nWW1DQIB8QdPlhNNmzr8r71QfUs6Zqp1LQObE7rIXrOsVHAAfCKPvqzdeEpHk8caYPFT6Zqw2h4cZp+9ftnw+bnFFsVzFh6m0IHJ0vlQWnPyruy22CpQ8fq2ej3Efjvlc15OIMZjNzMsFYj0XhbC8BluFXkz1ymrtxMwqG4lqtLmoXKjn1T5GlkPtBcuXOjSbOoxCFIMFNBYywGFGEG2fhhYQSLA2zTIlyp3Zt8tpYoNSZUjvmyx+Vmy3nU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Sep 14, 2023 at 9:26=E2=80=AFPM Hugh Dickins wro= te: > > On Thu, 14 Sep 2023, Suren Baghdasaryan wrote: > > On Thu, Sep 14, 2023 at 9:24=E2=80=AFPM Matthew Wilcox wrote: > > > On Thu, Sep 14, 2023 at 08:53:59PM +0000, Suren Baghdasaryan wrote: > > > > On Thu, Sep 14, 2023 at 8:00=E2=80=AFPM Suren Baghdasaryan wrote: > > > > > On Thu, Sep 14, 2023 at 7:09=E2=80=AFPM Matthew Wilcox wrote: > > > > > > > > > > > > On Thu, Sep 14, 2023 at 06:20:56PM +0000, Suren Baghdasaryan wr= ote: > > > > > > > I think I found the problem and the explanation is much simpl= er. While > > > > > > > walking the page range, queue_folios_pte_range() encounters a= n > > > > > > > unmovable page and queue_folios_pte_range() returns 1. That c= auses a > > > > > > > break from the loop inside walk_page_range() and no more VMAs= get > > > > > > > locked. After that the loop calling mbind_range() walks over = all VMAs, > > > > > > > even the ones which were skipped by queue_folios_pte_range() = and that > > > > > > > causes this BUG assertion. > > > > > > > > > > > > > > Thinking what's the right way to handle this situation (what'= s the > > > > > > > expected behavior here)... > > > > > > > I think the safest way would be to modify walk_page_range() a= nd make > > > > > > > it continue calling process_vma_walk_lock() for all VMAs in t= he range > > > > > > > even when __walk_page_range() returns a positive err. Any obj= ection or > > > > > > > alternative suggestions? > > > > > > > > > > > > So we only return 1 here if MPOL_MF_MOVE* & MPOL_MF_STRICT were > > > > > > specified. That means we're going to return an error, no matte= r what, > > > > > > and there's no point in calling mbind_range(). Right? > > > > > > > > > > > > +++ b/mm/mempolicy.c > > > > > > @@ -1334,6 +1334,8 @@ static long do_mbind(unsigned long start,= unsigned long len, > > > > > > ret =3D queue_pages_range(mm, start, end, nmask, > > > > > > flags | MPOL_MF_INVERT, &pagelist, tr= ue); > > > > > > > > > > > > + if (ret =3D=3D 1) > > > > > > + ret =3D -EIO; > > > > > > if (ret < 0) { > > > > > > err =3D ret; > > > > > > goto up_out; > > > > > > > > > > > > (I don't really understand this code, so it can't be this simpl= e, can > > > > > > it? Why don't we just return -EIO from queue_folios_pte_range(= ) if > > > > > > this is the right answer?) > > > > > > > > > > Yeah, I'm trying to understand the expected behavior of this func= tion > > > > > to make sure we are not missing anything. I tried a simple fix th= at I > > > > > suggested in my previous email and it works but I want to underst= and a > > > > > bit more about this function's logic before posting the fix. > > > > > > > > So, current functionality is that after queue_pages_range() encount= ers > > > > an unmovable page, terminates the loop and returns 1, mbind_range() > > > > will still be called for the whole range > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L134= 5), > > > > all pages in the pagelist will be migrated > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L135= 5) > > > > and only after that the -EIO code will be returned > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L136= 2). > > > > So, if we follow Matthew's suggestion we will be altering the curre= nt > > > > behavior which I assume is not what we want to do. > > > > > > Right, I'm intentionally changing the behaviour. My thinking is > > > that mbind(MPOL_MF_MOVE | MPOL_MF_STRICT) is going to fail. Should > > > such a failure actually move the movable pages before reporting that > > > it failed? I don't know. > > > > > > > The simple fix I was thinking about that would not alter this behav= ior > > > > is smth like this: > > > > > > I don't like it, but can we run it past syzbot to be sure it solves t= he > > > issue and we're not chasing a ghost here? > > > > Yes, I just finished running the reproducer on both upstream and > > linux-next builds listed in > > https://syzkaller.appspot.com/bug?extid=3Db591856e0f0139f83023 and the > > problem does not happen anymore. > > I'm fine with your suggestion too, just wanted to point out it would > > introduce change in the behavior. Let me know how you want to proceed. > > Well done, identifying the mysterious cause of this problem: > I'm glad to hear that you've now verified that hypothesis. > > You're right, it would be a regression to follow Matthew's suggestion. > > Traditionally, modulo bugs and inconsistencies, the queue_pages_range() > phase of do_mbind() has done the best it can, gathering all the pages it > can that need migration, even if some were missed; and proceeds to do the > mbind_range() phase if there was nothing "seriously" wrong (a gap causing > -EFAULT). Then at the end, if MPOL_MF_STRICT was set, and not all the > pages could be migrated (or MOVE was not specified and not all pages > were well placed), it returns -EIO rather than 0 to inform the caller > that not all could be done. > > There have been numerous tweaks, but I think most importantly > 5.3's d883544515aa ("mm: mempolicy: make the behavior consistent when > MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") added those "return 1"s > which stop the pagewalk early. In my opinion, not an improvement - makes > it harder to get mbind() to do the best job it can (or is it justified as > what you're asking for if you say STRICT?). > > But whatever, it would be a further regression for mbind() not to have > done the mbind_range(), even though it goes on to return -EIO. > > I had a bad first reaction to your walk_page_range() patch (was expecting > to see vma_start_write()s in mbind_range()), but perhaps your patch is > exactly what process_mm_walk_lock() does now demand. > > [Why is Hugh responding on this? Because I have some long-standing > mm/mempolicy.c patches to submit next week, but in reviewing what I > could or could not afford to get into at this time, had decided I'd > better stay out of queue_pages_range() for now - beyond the trivial > preferring an MPOL_MF_WRLOCK flag to your bool lock_vma.] Thanks for the feedback, Hugh! Yeah, this positive err handling is kinda weird. If this behavior (do as much as possible even if we fail eventually) is specific to mbind() then we could keep walk_page_range() as is and lock the VMAs inside the loop that calls mbind_range() with a condition that ret is positive. That would be the simplest solution IMHO. But if we expect walk_page_range() to always apply requested page_walk_lock policy to all VMAs even if some mm_walk_ops returns a positive error somewhere in the middle of the walk then my fix would work for that. So, to me the important question is how we want walk_page_range() to behave in these conditions. I think we should answer that first and document that. Then the fix will be easy. > > Hugh