From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7F771D18138
	for <linux-mm@archiver.kernel.org>; Mon, 14 Oct 2024 18:15:09 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 12BA66B0088; Mon, 14 Oct 2024 14:15:09 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0DB326B0089; Mon, 14 Oct 2024 14:15:09 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id EBD116B008A; Mon, 14 Oct 2024 14:15:08 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id CC0326B0088
	for <linux-mm@kvack.org>; Mon, 14 Oct 2024 14:15:08 -0400 (EDT)
Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 63DAD1A1002
	for <linux-mm@kvack.org>; Mon, 14 Oct 2024 18:14:54 +0000 (UTC)
X-FDA: 82673009328.10.1B71F3F
Received: from mail-ed1-f51.google.com (mail-ed1-f51.google.com [209.85.208.51])
	by imf11.hostedemail.com (Postfix) with ESMTP id 6448B4001D
	for <linux-mm@kvack.org>; Mon, 14 Oct 2024 18:14:59 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=CvCs4ans;
	spf=pass (imf11.hostedemail.com: domain of jannh@google.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=jannh@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1728929660;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=mNEJX4kqr9odUybTaFkTf23XVWJozLbeDB372EuR6j4=;
	b=Hu0mxfdiMI3kEtMIuViljtruI0XO5LbVcEw26H5qCq4ZPpIORpKvUN5wxgRNu8xmZtvAyz
	qDQQ2P6pY6gdbSAf4I0vAx3ekDh7CBeQHBigVS6CR/e32QRmBvR+cwewlvcqlrHeDRcwjt
	wtw14C8nmPl/XyVif1vOQMW59Yz1b08=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=CvCs4ans;
	spf=pass (imf11.hostedemail.com: domain of jannh@google.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=jannh@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728929660; a=rsa-sha256;
	cv=none;
	b=JVUoeNooi9GGGUvKx2gNY8roQw9u+aRPc8V76LrE/9z6O9Ibko08yw888KsT5gQGO5mMHc
	LjfNpYH3afuSuNRdLFhVrmEFdCJa+MelWKpmBGPvjiYAZMNL6Gohv2yO8tEE0EsFDLgSm4
	0y27vKOzBlCUQQgscS3XUbn+g9pCqOk=
Received: by mail-ed1-f51.google.com with SMTP id 4fb4d7f45d1cf-5c932b47552so25424a12.0
        for <linux-mm@kvack.org>; Mon, 14 Oct 2024 11:15:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1728929705; x=1729534505; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=mNEJX4kqr9odUybTaFkTf23XVWJozLbeDB372EuR6j4=;
        b=CvCs4ansOm8D9vjscZLDE2Id13LLzb1XSP353JEYEp9N4pU6iFFrQBqhz+Cve/lbZD
         d3rhJHwauwMQXW1TDBUsNlVViSUpWcLWhnYFiPas1Wukc7QdCrumJK9zFwdvWDx9t/2X
         ZBgl6hX/zrYNvCV4dSmzEFb4xccFKYZyfHyN3p4wlcJUiD45VJQe5fQZ689BMnXummeZ
         xX+m9BrEEcoeqEwTmzofpk+JohGhPvBiM9drrU7Ld7jMKChkniTlQc+kF0ILKhCZcN+u
         SHNQMI74Ri3zH3u2G46QfSB0Ykbn5/8QJGD93ZijNugsg0f8450thBjWqLBR/z/wcv4V
         kMEA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1728929705; x=1729534505;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=mNEJX4kqr9odUybTaFkTf23XVWJozLbeDB372EuR6j4=;
        b=A+3bwQVPd2MFRKFEb03Oq/4P8imB9dc9J+YM4YHNz1PSF0rlihpPtOgLqV5wzBxoU0
         P0SOAnG4GXDav1Xy9gHhvEc1TtWheuFKj0j5+M4TAJl+m3Ep9ZwTznynFZwnU5EzbTPB
         xbHR1W0iu/wnVc7YbNRBVxH8zI6iQ80hvIanqd521Jr8v+4mvd00qLKFfNBw23F2eFbg
         KhnV3o3xmG87BllWgLa2y6zttn9MERS1wqokb4OtGjVqwV4SLm7KxKOIK1K4Ryv58Bpf
         4ZRHs4RfNX6Uj0bJb4VCPGMkHIRoIVFq9yMkkRWZL5v8zANe1QyhY5UPSK+wLi04Ienn
         +WlA==
X-Forwarded-Encrypted: i=1; AJvYcCX+I+YwzD6u5whTA8WZsO43k9H75XO3yL1qAlqXGz+PmWAEsV+Vre3GzhXdOW4oozuaWyIHK2d3dQ==@kvack.org
X-Gm-Message-State: AOJu0YzQ3c7Mb7ibkjJk0iQPSXpZLvy2pQZ5UA8PdAZL/vtq0MFHdU/p
	jRV2f4AwFivLOUSqnRNoEGk5BLtMiRbfjbk6Net0ZVF8gswyBTykRERtiD2MBBbNMAgu3Q2n0aD
	6a//BLayRYYP1EHcfNwzjnc7GFoyUkwRMaezx
X-Google-Smtp-Source: AGHT+IEbkDzR+ThZoGvgYtCS7Y5gwEV3PkXHeMMP5MKP0m0Oi8H/8Bb2XOPnZQdihqJsksuw7ZM7VJ44A8HdzyHg+Ps=
X-Received: by 2002:a05:6402:26c1:b0:5c2:5641:af79 with SMTP id
 4fb4d7f45d1cf-5c95b0bd39amr402722a12.0.1728929704710; Mon, 14 Oct 2024
 11:15:04 -0700 (PDT)
MIME-Version: 1.0
References: <cover.1727440966.git.lorenzo.stoakes@oracle.com>
 <a578ee9bb656234d3a19bf9e97c3012378d31a19.1727440966.git.lorenzo.stoakes@oracle.com>
 <CAG48ez3ursoL-f=mYpV79Do18XPPt+MPPHNUBv6uFE1GhpOwSA@mail.gmail.com>
 <868739d2-0869-462f-ac86-1a8d1dccb0a4@lucifer.local> <CAG48ez3vqbqyWb4bLdpqSUnhwqGo2OQetecNhEGPdCGDr94nbQ@mail.gmail.com>
 <b483187b-5caa-4581-86b1-895184301cbb@lucifer.local>
In-Reply-To: <b483187b-5caa-4581-86b1-895184301cbb@lucifer.local>
From: Jann Horn <jannh@google.com>
Date: Mon, 14 Oct 2024 20:14:26 +0200
Message-ID: <CAG48ez0pCasCQ_=ALog7nseKnGci0o0LyQehV42J==gNAiXBVg@mail.gmail.com>
Subject: Re: [RFC PATCH 3/4] mm: madvise: implement lightweight guard page mechanism
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Suren Baghdasaryan <surenb@google.com>, 
	"Liam R . Howlett" <Liam.Howlett@oracle.com>, Matthew Wilcox <willy@infradead.org>, 
	Vlastimil Babka <vbabka@suse.cz>, "Paul E . McKenney" <paulmck@kernel.org>, 
	David Hildenbrand <david@redhat.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	Muchun Song <muchun.song@linux.dev>, Richard Henderson <richard.henderson@linaro.org>, 
	Ivan Kokshaysky <ink@jurassic.park.msu.ru>, Matt Turner <mattst88@gmail.com>, 
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>, 
	"James E . J . Bottomley" <James.Bottomley@hansenpartnership.com>, Helge Deller <deller@gmx.de>, 
	Chris Zankel <chris@zankel.net>, Max Filippov <jcmvbkbc@gmail.com>, Arnd Bergmann <arnd@arndb.de>, 
	linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, 
	linux-parisc@vger.kernel.org, linux-arch@vger.kernel.org, 
	Shuah Khan <shuah@kernel.org>, Christian Brauner <brauner@kernel.org>, linux-kselftest@vger.kernel.org, 
	Sidhartha Kumar <sidhartha.kumar@oracle.com>, Vlastimil Babka <vbabka@suze.cz>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Stat-Signature: t474szcgi1xt613sj1q8kp3abyrxw57y
X-Rspamd-Queue-Id: 6448B4001D
X-Rspamd-Server: rspam11
X-HE-Tag: 1728929699-496986
X-HE-Meta: U2FsdGVkX1+LNE1v/Wc4i4t7wtgDHeHMbj9k2bKlKXKlhB/FjpH+AZIpDnkhQFWIREqeD3LXGUo0+fUrh5nAVst/ivOMZ1y+bJp4XHIs90jQFW2QRpwAV3Yra6LoSIzsM9tTeZydE5i/mkh95SJ+FtJeLwzkQC+Z2MTfpSXPd9iiD/ga/YclUCzIvV/31ktzhimh+0w5/1i01652xA4eZ/ACLkhIKBsXHHF+zyTv388ch7p0yQ1IMh48tWVzS6/Ej8T+FIG3F7Hd/iC5obzvgs+w53/W/al5GMZe+NgRb/g2YyUeAB7knbeCV4eDEYaDWu+vDY/QAl1URSeaqhDYXDhwhAwIAq9lck6A9PP0J6vv/HZd05uJLVATBj7bNfFmQlAl+UM3CTL0I/c+XyDyvkiaGdhB3TG4HKn810eBBUdCMmnhr7BKCD+Ja73GStkIA3EJnu9e2E1TNzgJcSTa4L3mVHAGxWfqOcv18fe9TUpzyPZwkPU/HBdTZqPrml/uF3a6xpu+G2AyBiRg4vjmFflhs80UrTsF+ebXhIf735Gpu/TK+qVvCg3C7IF5OrdKn2niO283dRDrbaRbPZGVn+EwKpoOIhLAX87i04s/0GVhJ81DCgsqD1ibx3OuKSOOBtY5pfn2OFeqdDSsKFLeXboaJQmxGlezTHIf26Jn9EkiQGrlEozJ+HIMfHwUleYvcOBGz3r/PmX3gnbFS9Ury1efX2BEGLaYSvG55YWVGp2jmLOqvubISObvC53kulwbkTnn3s9Nl1HlVZaISRd5smkoJUZ7eBgVVkd+j+gh1r+3YAdAZtIaUxPJp/jfPV/2cltsq3kxtK7WrCoFfVG22suHLzoMQFucZWZnltwFS4KpMzH7AZlp2JDHfXzLDMaxmsKEuyaT8yjwCfS1LvL9vInHOK8HRd/UFSk/pqFaV97mIlZ28GStI99syViw7QVv09GMRG5yXKakTqg1/Lp
 lQSrCpEm
 ngRhxEbDgQ3il1T2pMysrU5ABar4eAdEgfEhfElKiLd1M062d0BAz+fAj1OfRBldiR+50HeGTS722WDon3adciszWySsRtxeK3USPL0Awty9z5f2wOcnowzTRH3dyLad61+SMT6DikSoQaxV2iwIP9SUdKFwd8GIFuyRaz/puriOObFOINziv9DzFb7F1gd8GCoTO3lwcgh+DexB2ou0wqFJ7jvofyIZYogl7ofyuq1qGSupg5BT3wUAsltNFHIDXmWAYNii2Y8MvFsrYL0VK7VvSCn4MVrjWGoPuTJP1M2SL2hU=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Oct 14, 2024 at 7:02=E2=80=AFPM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Mon, Oct 14, 2024 at 05:56:50PM +0200, Jann Horn wrote:
> > On Mon, Oct 14, 2024 at 1:09=E2=80=AFPM Lorenzo Stoakes <lorenzo.stoake=
s@oracle.com> wrote:
> > > On Fri, Oct 11, 2024 at 08:11:36PM +0200, Jann Horn wrote:
> > > > On Fri, Sep 27, 2024 at 2:51=E2=80=AFPM Lorenzo Stoakes <lorenzo.st=
oakes@oracle.com> wrote:
> > > By being optimistic and simply having the user having to handle loopi=
ng
> > > which seems reasonable (again, it's weird if you're installing poison
> > > markers and another thread could be racing you) we avoid all of that.
> >
> > I guess one case in which that could happen legitimately is if you
> > race a MADV_POISON on the area 0x1ff000-0x200100 (first page is
> > populated, second page is not, pmd entry corresponding to 0x200000 is
> > clear) with a page fault at 0x200200? So you could have a scenario
> > like:
> >
> > 1. MADV_POISON starts walk_page_range()
> > 2. MADV_POISON sees non-zero, non-poison PTE at 0x1ff000, stops the wal=
k
> > 3. MADV_POISON does zap_page_range_single()
> > 4. pagefault at 0x200200 happens and populates with a hugepage
> > 5. MADV_POISON enters walk_page_range()
> > 6. MADV_POISON splits the THP
> > 7. MADV_POISON sees a populated PTE
>
> You really shouldn't be seeing page faults in the range you are setting u=
p
> poison markers for _at all_ :) it's something you'd do ahead of time.

But that's not what happens in my example - the address where the
fault happens (0x200200) *is not* in the address range that
MADV_POISON is called on (0x1ff000-0x200100). The fault and the
MADV_POISON are in different 4KiB pages. What causes the conflict is
that the fault and the MADV_POISON overlap the same *2MiB region*
(both are in the region 0x200000-0x400000), and so THP stuff can
effectively cause "page faults in the range you are setting up poison
markers for".

> But of course it's possible some scenario could arise like that, that's
> what the EAGAIN is for.
>
> I just really don't want to get into a realm of trying to prove absolutel=
y
> under all circumstances that we can't go on forever in a loop like that.

We can have a bailout on signal_pending() or something like that, and
a cond_resched(). Then as far as I know, it won't really make a
difference in behavior whether the loop is in the kernel or in
userspace code that's following what the manpage tells it to do -
either way, the program will loop until it either finishes its work or
is interrupted by a signal, and either way it can get preempted.
(Well, except under PREEMPT_NONE, but that is basically asking for
long scheduling delays.)

And we do have other codepaths that have to loop endlessly if they
keep racing with page table updates the wrong way, though I guess
those loops are not going to always scan over a large address range
over and over again...

Maybe something like this would be good enough, and mirror what you'd
otherwise tell userspace to do?


@@ -1598,6 +1598,7 @@ int do_madvise(struct mm_struct *mm, unsigned
long start, size_t len_in, int beh
                return madvise_inject_error(behavior, start, start + len_in=
);
 #endif

+retry:
        write =3D madvise_need_mmap_write(behavior);
        if (write) {
                if (mmap_write_lock_killable(mm))
@@ -1627,6 +1628,12 @@ int do_madvise(struct mm_struct *mm, unsigned
long start, size_t len_in, int beh
        else
                mmap_read_unlock(mm);

+       if (error =3D=3D <<<some special value>>>) {
+               if (!signal_pending(current))
+                       goto retry;
+               error =3D -ERESTARTNOINTR;
+       }
+
        return error;
 }

Buuut, heh, actually, I just realized: You could even omit this and
simply replace -EINTR with -ERESTARTNOINTR in your code as the error
value, and then the kernel would automatically go back into the
syscall for you after going through signal handing and such, without
userspace noticing.
https://lore.kernel.org/all/20121206220955.GZ4939@ZenIV.linux.org.uk/
has some explanation on how this works. Basically it tells the
architecture's syscall entry code to move the userspace instruction
pointer back to the syscall instruction, so as soon as execution
returns to userspace, the first userspace instruction that executes
will immediately re-do the syscall. That might be the easiest way,
even if it is maybe a *little* bit of an API abuse to use this thing
without having a pending signal...


> If you drop the lock for contention then you up the risk of that, it just
> feels dangerous.
>
> A userland program can however live with a 'if EAGAIN try again' situatio=
n.
>
> An alternative approach to this might be to try to take the VMA lock, but
> given the fraught situation with locking elsewhere I wonder if we should.
>
> Also, you have to be realy unlucky with timing for this to happen, even i=
n
> the scenario you mention (where you'd have to be unlucky with alignment
> too), unless you're _heavily_ page faulting in the range, either way a
> userland loop checking EAGAIN doesn't seem unreasonable.

Yes, we could do -EINTR and document that for userspace, and as long
as everyone using this properly reads the documentation, it will be
fine. Though I imagine that from the userspace programmer perspective
that's a weird API design - as in, if this error code always means I
have to try again, why can't the kernel do that internally. It's kind
of leaking an implementation detail into the UAPI.