From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5A5DAC001DE for ; Mon, 31 Jul 2023 19:31:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D754128009E; Mon, 31 Jul 2023 15:31:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D25EC28007A; Mon, 31 Jul 2023 15:31:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C13FB28009E; Mon, 31 Jul 2023 15:31:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id B0A8E28007A for ; Mon, 31 Jul 2023 15:31:03 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 388FE807CB for ; Mon, 31 Jul 2023 19:31:03 +0000 (UTC) X-FDA: 81072900006.26.D6A813D Received: from mail-ot1-f51.google.com (mail-ot1-f51.google.com [209.85.210.51]) by imf16.hostedemail.com (Postfix) with ESMTP id 618FA18001C for ; Mon, 31 Jul 2023 19:31:00 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=UjhhRZWD; spf=pass (imf16.hostedemail.com: domain of surenb@google.com designates 209.85.210.51 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690831861; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Ri5yafbuL09u63Nj9JUthSCzYPjx6uG9CsWJ2yEpNqI=; b=7o52/ferLmXKk1zSMrLnuPIP7SGxc8EhIriGztznX+21kO9GB2wqRECKrv+s/Fr9/x+ZeE IfOs/yYLXI3rMDgsmEpDeuzVBhwpsH/2GWpa7dUb8nYGandgsxtiJFvQpnggvBHpYZWj3G 1heVNBPvKXwnTq4oIQ81oT/eBKl3p7E= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690831861; a=rsa-sha256; cv=none; b=Ls+sXaikFRhITrRqQB/N/+4bkEMJNB7qwZ8k/Q9zrzEyUU4h/FTOz0j+xGIyUQygV/NiTZ vZpXjytwEgRyzpqhPGAkd2qHpCKycRwiB3BQu9vOMvNBK+DNZM393o8NzyE1+JyRJVnhBJ jvAih+Ndw+/TS7jReoo/G+nIux04qeE= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=UjhhRZWD; spf=pass (imf16.hostedemail.com: domain of surenb@google.com designates 209.85.210.51 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ot1-f51.google.com with SMTP id 46e09a7af769-6bca3311b4fso970520a34.0 for ; Mon, 31 Jul 2023 12:31:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1690831860; x=1691436660; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Ri5yafbuL09u63Nj9JUthSCzYPjx6uG9CsWJ2yEpNqI=; b=UjhhRZWDtRx74bTvWmT3fMpVy9szO/l+3kCSx8h2TcZ9k4cgarSo5SfvC1JUtQWX+y Ic0vdZX0fvJbxOGIa/D4ARmFCuIfLrgd0MsaajA7pWuYCWijlokkay5XSTFAsM6eCGxn lguEpnYec9vPl+WHyI8moFd1j+CX9bxV+CFdeAUO1AREy0HzxFzqpzgNp73nFfxI0fnt FSg/TEgrKzORzXHYSN3aBJMJaGF3Vzw/rFysSJIkiclwC8RQJhRZKsgJ5SIzI0w0w40v 1LafnpwapEkqmNENRXtCKrlWcCnu7ppHubztKogJbCnGcndzzwpLF6B9kqtZR49JlvKQ BArw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690831860; x=1691436660; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Ri5yafbuL09u63Nj9JUthSCzYPjx6uG9CsWJ2yEpNqI=; b=cqzvDDFEQlMPrxZjAgURMlE3qLZc3tn5sj45pnYhYatORBdxw6Xid+n8MEXriNdTSg kyANi4xkY43NEf9Az0Gv7MAYUcMHxGgh7OOuHnScAYf0eTqwFX/R+qiTn0+7MVrMmgwF 9MyJZ4J4mppIsRLH4bYBGfjtvtuOH30kJfErZMEUqxu2ch36Bw2L2MqEmxV/5m7oPYN5 1U51mdyp7n3MC69X4RhyC75AXT052zxfKAPuevzMM/nbFpmqIFzSMh1F45+WSuZ8fvO7 dLZhItSgmuwExHGXM1KIL9LyNEabROV4pHgn7sRxeiHs8semN/WbrOLNkjeQMw5IMAP7 n9vg== X-Gm-Message-State: ABy/qLbwi3h+sIPoNZZRuXLEvEEXVD1WozofKFgKZwrnkiKnPiA9UI4S SqLqAKzwsNcjyuwX/xA16cuRAbWKrCIi0xO2fKtEHg== X-Google-Smtp-Source: APBJJlH1Nb3wYpeMxuSzBBPjz1Nyltou8BS0F/g8ei595+a0AVaM6Jlz10PXmItkuf3OR74IkTYEFYxYrZGe05RE9y8= X-Received: by 2002:a05:6358:4283:b0:134:d467:b751 with SMTP id s3-20020a056358428300b00134d467b751mr830493rwc.21.1690831859825; Mon, 31 Jul 2023 12:30:59 -0700 (PDT) MIME-Version: 1.0 References: <20230731171233.1098105-1-surenb@google.com> <20230731171233.1098105-2-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Mon, 31 Jul 2023 12:30:46 -0700 Message-ID: Subject: Re: [PATCH 1/6] mm: enable page walking API to lock vmas during the walk To: Linus Torvalds Cc: akpm@linux-foundation.org, jannh@google.com, willy@infradead.org, liam.howlett@oracle.com, david@redhat.com, peterx@redhat.com, ldufour@linux.ibm.com, vbabka@suse.cz, michel@lespinasse.org, jglisse@google.com, mhocko@suse.com, hannes@cmpxchg.org, dave@stgolabs.net, hughd@google.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, stable@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: cgnm3qzjpdwikzpr4ryzzd4pnwf3f748 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 618FA18001C X-Rspam-User: X-HE-Tag: 1690831860-777220 X-HE-Meta: U2FsdGVkX18NToOFaas7WsQSrft43fENad/0BH1tLiO8+CuWr07JYyGqPyLvUBdZiVq6A+nA9cjFBoYIUn2XzpMtdMblQpwMScD38BX+pRjatypT6MtZAAtpuSCaexJnu5fT1p3+XCltLjHgIP2iGk/wIGdqrS0K3tQF0B3OiFqAp4uM1OUfOcVdr+2P+RV2SXU3rYXbwy/1E9KpDVS5AMn3YFy7AsGW1w645jqxZtStq54RWD4fj+reDt3QwJDSW9NhC+fJjrLy1cVAoBeX+JgCAakQpv1AINdOVbSp7jSHJURvHi8QY28FEg7K70tM81BFv51JW5pKkELl0IerwhMqH2Rye3HxVabQHJadu0scwmQm7j6hKH4FI3QLsBIhZLUOnKgGpDhT2MeVGiKaOZd7dazcSsXe93lXaghU/9+X8lPfhHEEWxxpibfyt+TEUQYLKbreHTkvZZMlV9sCB8ohQ431eYhQ8YVVZwpj4iMUkwPYoHvrQc9DqbdQ9C/q7qAtUQuMUSiuoSr1hVxgb3ROc8GqsSxbmOSqemrqbUHlh6bi0pBv4xMyGaN8HVwkbNAkFm1jXHfWzq9JxN4cCtgOkPO0hcZw6m1TL1oA65Se6VOXrK830JhvQ0QGA3AmhB5/D7iLHcvJStVYRQRZK02IjE0mkaZCg+bCynoUtSpRmlu5S1gvQR02yuooKw2VCBzJ8r2K0en2AgBHuUK1CUgIFWLLVNuJ4IHu4OqDwL699DSlC1yk4tJcA6jJbWPd32AUccAlsY9ICzOIiPt/IOWI7SGNOKFxsPmeS415Z4voHRGhbfJrfmytVpcxAHsQNhOMyasta6e0VHdxamLxrxhvp6UUkP/n+pU8O4MyQZHApV0+oI0AJKkEx+Gvtvy99CEY3xW7DmJyW0qzRGjq1OHaQ7/cdv1A/k+c+T+6JEw6I1maHejsn2Nk1FL4eUiJMSQ2G+w52GDVrxclTTT sXEjPGmb Z1rgXPokLKOGAZkazqWakgypqXgin2Vgh+0hDlg81MFv5Y4APFQXwmu3oZ+ocPaGUmD/J37UgGAes72bSXYv4B6UrU8TzWinhU9UjXCHEgACCRxcIPSCSw1U4yeP5PjcM24pCqmF7FeIZF8EYn0icuj9Xp7Ip4gGwGwL5+l2jGNAPR5UWJRdsvtM7lNNFAPp22QeScurrIa8nEvBGCsVjlT+mmxh0DZV3Tvg3/UWgFuIgc2ovGUYGDiMipchcYq//RGoM6Fz7ERvFkUGKkfJ9CKGATkWuz28MM+YlmtaOrYs94gAjAfIPrg/9pQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Jul 31, 2023 at 11:02=E2=80=AFAM Linus Torvalds wrote: > > On Mon, 31 Jul 2023 at 10:12, Suren Baghdasaryan wrot= e: > > > > - walk_page_vma(vma, &subpage_walk_ops, NULL); > > + walk_page_vma(vma, &subpage_walk_ops, true, NULL); > > Rather than add a new argument to the walk_page_*() functions, I > really think you should just add the locking rule to the 'const struct > mm_walk_ops' structure. > > The locking rule goes along with the rules for what you are actually > doing, after all. Plus it would actually make it all much more legible > when it's not just some random "true/false" argument, but a an actual > > .write_lock =3D 1 > > in the ops definition. Yeah, I was thinking about that but thought a flag like this in a pure "ops" struct would be frowned upon. If this is acceptable then it makes it much cleaner. > > Yes, yes, that might mean that some ops might need duplicating in case > you really have a walk that sometimes takes the lock, and sometimes > doesn't, but that is odd to begin with. > > The only such case I found from a quick look was the very strange > queue_pages_range() case. Is it really true that do_mbind() needs the > write-lock, but do_migrate_pages() does not? > > And if they really are that different maybe they should have different wa= lk_ops? Makes sense to me. > > Maybe there were other cases that I didn't notice. > > > error =3D walk_page_range(current->mm, start, end, > > - &prot_none_walk_ops, &new_pgprot); > > + &prot_none_walk_ops, true, &new_pgprot)= ; > > This looks odd. You're adding vma locking to a place that didn't do it be= fore. > > Yes, the mmap semaphore is held for writing, but this particular walk > doesn't need it as far as I can tell. Yes you are correct. Locking a vma in this case seems unnecessary. > > In fact, this feels like that walker should maybe *verify* that it's > held for writing, but not try to write it again? In this particular case, does this walk even require the vma to be write locked? Looks like it's simply checking the ptes. And if so, walk_page_range() already has mmap_assert_locked(walk.mm) at the beginning to ensure the tree is stable. Do we need anything else here? > > Maybe the "lock_vma" flag should be a tri-state: > > - lock for reading (no-op per vma), verify that the mmap sem is held > for reading > > - lock for reading (no-op per vma), but with mmap sem held for > writing (this kind of "check before doing changes" walker) > > - lock for writing (with mmap sem obviously needs to be held for writing= ) > > > mmap_assert_locked(walk.mm); > > + if (lock_vma) > > + vma_start_write(vma); > > So I think this should also be tightened up, and something like > > switch (ops->locking) { > case WRLOCK: > vma_start_write(vma); > fallthrough; > case WRLOCK_VERIFY: > mmap_assert_write_locked(mm); > break; > case RDLOCK: > mmap_assert_locked(walk.mm); > } I got the idea but a couple of modifications, if I may. walk_page_range() already does mmap_assert_locked() at the beginning, so we can change it to: if (ops->locking =3D=3D RDLOCK) mmap_assert_locked(walk.mm); else mmap_assert_write_locked(mm); and during the walk: switch (ops->locking) { case WRLOCK: vma_start_write(vma); break; #ifdef CONFIG_PER_VMA_LOCK case WRLOCK_VERIFY: vma_assert_write_locked(vma); break; #endif } The above CONFIG_PER_VMA_LOCK is ugly but with !CONFIG_PER_VMA_LOCK vma_assert_write_locked() becomes mmap_assert_write_locked() and we already checked that, so it's unnecessary. > > because we shouldn't have a 'vma_start_write()' without holding the > mmap sem for *writing*, and the above would also allow that > mprotect_fixup() "walk to see if we can merge, verify that it was > already locked" thing. > > Hmm? > > NOTE! The above names are just completely made up. I dcon't think it > should actually be some "WRLOCK" enum. There are probably much better > names. Take the above as a "maybe something kind of in this direction" > rather than "do it exactly like this". I'm not great with names... Maybe just add a PGWALK_ prefix like this: PGWALK_RDLOCK PGWALK_WRLOCK PGWALK_WRLOCK_VERIFY ? > > Linus