From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C9124CDD0E1
	for <linux-mm@archiver.kernel.org>; Tue, 22 Oct 2024 19:58:20 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 469366B00A3; Tue, 22 Oct 2024 15:58:20 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 418EC6B00A5; Tue, 22 Oct 2024 15:58:20 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2B97D6B00A6; Tue, 22 Oct 2024 15:58:20 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 0FB1E6B00A3
	for <linux-mm@kvack.org>; Tue, 22 Oct 2024 15:58:20 -0400 (EDT)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id BB3B4C04A7
	for <linux-mm@kvack.org>; Tue, 22 Oct 2024 19:58:01 +0000 (UTC)
X-FDA: 82702299204.08.3CE08FF
Received: from mail-wm1-f42.google.com (mail-wm1-f42.google.com [209.85.128.42])
	by imf22.hostedemail.com (Postfix) with ESMTP id A4971C001C
	for <linux-mm@kvack.org>; Tue, 22 Oct 2024 19:57:57 +0000 (UTC)
Authentication-Results: imf22.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=X5Iwa4jm;
	spf=pass (imf22.hostedemail.com: domain of jannh@google.com designates 209.85.128.42 as permitted sender) smtp.mailfrom=jannh@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729627020; a=rsa-sha256;
	cv=none;
	b=bm0lYTFNCtT3Mxey+eLxTnKWYcUgPdgjinn+4gXJ6oRAFO6K0dlpQx0N2Nt0u7e2+FPk7X
	hNJ4p7jaEXOK994z7T5GQRv4LVIIbtQSfQ1zoVAJ+DeWrKs/040m8cbQeV53sfmLgHv2Ww
	SVhHH4UDqHaPTjSP9kttbsEoUohZnAk=
ARC-Authentication-Results: i=1;
	imf22.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=X5Iwa4jm;
	spf=pass (imf22.hostedemail.com: domain of jannh@google.com designates 209.85.128.42 as permitted sender) smtp.mailfrom=jannh@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1729627020;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=UI9a6JxWBBH+S3C6zR+p95GIu67ToNwJVnSyMsWGv7s=;
	b=dHdJdur37kke2URrMb5LjaS3x8hGnrexehCmj9IDeL8pvPUi9w2Vx8sIEvDSABExZXZykb
	B6EqbHMVcXL6MZYl61eNCB6FAVWN+g9DiMOh2yREQaSSXR9/S5F4OQc78SOTMLvg8dDM7Y
	iNwPurRvom/zZ/3NMPkc16sCqavauew=
Received: by mail-wm1-f42.google.com with SMTP id 5b1f17b1804b1-43150ea2db6so61375e9.0
        for <linux-mm@kvack.org>; Tue, 22 Oct 2024 12:58:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1729627096; x=1730231896; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=UI9a6JxWBBH+S3C6zR+p95GIu67ToNwJVnSyMsWGv7s=;
        b=X5Iwa4jmi9n2Icvj384Ds9wvxU31SV9phBDyFujsEJpuJaCMgWQErRu0ZO0o3zzOSo
         m9iBxCeFQd1dXDISck1iVcJJZI/I9R4sRuYDoPDOHzZJ2z14hXxCnMjJ3JNVvP4j2MLH
         l9PVtYRF7pgdeRvL5vmMpf6GJfUN1Vs+yLI22aaEGaC1HowFPgMjdlJQnTpSjivNAufe
         8WWi3oxMC+DGr1akEeS3NeZSptW++mmsOeIt1YFG7EazlKcuhAzRGkfBb6eH3y4nPl82
         oaLc4+uWIQJtfQp4yCZVasJgfN07Oihc5wlziK6RiU0GtF/oozpKCTJnJWhE1fkeiEQc
         Kc7Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1729627096; x=1730231896;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=UI9a6JxWBBH+S3C6zR+p95GIu67ToNwJVnSyMsWGv7s=;
        b=K96Ycgwo5PxkrpItJ0AE6vfoZx2jzIs4QS/AkmhBrSZSbaFGukIAFgFc4FSGhpyKNF
         wGySd//eGEOJX9sxeWnDrJwW5xo43au79WDWaXoO+aa59GtwNd1U+SLGqDomHqeZBq0i
         tta+0CYTdP/fpsKw5Gjy/XmNBp8viZ9UjUNzHqah/BhCTaZVovpSdLq2cTi3NiEs7vEd
         sheI5jRhtqct/IXLsJRJd6cEclS6d3YavRlwRXTIdc39RxSmYyTCUVXsnGrqdQtU4svJ
         IwWNbisPeeBDSgG48xSM5Ccu5QnQzCcEjxM2T7oYs6L1peKuBbh+9Zb7EQ32KchvCJAJ
         Uf3A==
X-Forwarded-Encrypted: i=1; AJvYcCVu14hNUs15xmRtFvRSfoN+iBG4a1VhHaxuvBULTv0Bu4iXO0Am6jTsE61yiGDnPENLoz6L8MTLrw==@kvack.org
X-Gm-Message-State: AOJu0YwBM72Ly7woUNt/BhqzXTmt3s9fyz8OlENfL2DUvKcoS3Y6u6OS
	Rs9Lt/N4oHIEDpED0whtybqE3HENFy4Rxfb+9rZfEx3Pfotgfw4Ba+31kiECHbS2cHTDgjWCgHr
	TodyMNUjLXEvjIHfVS8DBfxRdAMYj3peSlpdN
X-Google-Smtp-Source: AGHT+IFeWIyT88dGlsdB6/p06aq90kLyVYjy4rjrqh6CY+sRs1htAUUVZyjMS+uR8oPhtq3VpqNiR+I6FEIsDJOzmq4=
X-Received: by 2002:a05:600c:4e4e:b0:426:6edd:61a7 with SMTP id
 5b1f17b1804b1-4318479635dmr176525e9.7.1729627095385; Tue, 22 Oct 2024
 12:58:15 -0700 (PDT)
MIME-Version: 1.0
References: <cover.1729440856.git.lorenzo.stoakes@oracle.com>
 <fce49bbbfe41b82161a37b022c8eb1e6c20e1d85.1729440856.git.lorenzo.stoakes@oracle.com>
 <c37ada68-5bf5-4ca5-9de8-c0838160c443@suse.cz> <393b0932-1c52-4d59-9466-e5e6184a7daf@lucifer.local>
 <f2448c59-0456-49e8-9676-609629227808@suse.cz> <CAG48ez3WS3EH9DuhE1b+7AX3+1=dVtd1M7y_5Ev4Shp2YxiYWg@mail.gmail.com>
 <2647d37b-3482-4fc9-8da2-1158ebdc919e@lucifer.local>
In-Reply-To: <2647d37b-3482-4fc9-8da2-1158ebdc919e@lucifer.local>
From: Jann Horn <jannh@google.com>
Date: Tue, 22 Oct 2024 21:57:39 +0200
Message-ID: <CAG48ez2rrHeq5aNPA9pB-1T6n1WjCh+g+8Hp4QrZemqVb6UsYg@mail.gmail.com>
Subject: Re: [PATCH v2 3/5] mm: madvise: implement lightweight guard page mechanism
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>, Andrew Morton <akpm@linux-foundation.org>, 
	Suren Baghdasaryan <surenb@google.com>, "Liam R . Howlett" <Liam.Howlett@oracle.com>, 
	Matthew Wilcox <willy@infradead.org>, "Paul E . McKenney" <paulmck@kernel.org>, 
	David Hildenbrand <david@redhat.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	Muchun Song <muchun.song@linux.dev>, Richard Henderson <richard.henderson@linaro.org>, 
	Ivan Kokshaysky <ink@jurassic.park.msu.ru>, Matt Turner <mattst88@gmail.com>, 
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>, 
	"James E . J . Bottomley" <James.Bottomley@hansenpartnership.com>, Helge Deller <deller@gmx.de>, 
	Chris Zankel <chris@zankel.net>, Max Filippov <jcmvbkbc@gmail.com>, Arnd Bergmann <arnd@arndb.de>, 
	linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, 
	linux-parisc@vger.kernel.org, linux-arch@vger.kernel.org, 
	Shuah Khan <shuah@kernel.org>, Christian Brauner <brauner@kernel.org>, linux-kselftest@vger.kernel.org, 
	Sidhartha Kumar <sidhartha.kumar@oracle.com>, Jeff Xu <jeffxu@chromium.org>, 
	Christoph Hellwig <hch@infradead.org>, linux-api@vger.kernel.org, 
	John Hubbard <jhubbard@nvidia.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: iqozqeepgkatp3zitn7in1bwykk94eus
X-Rspamd-Queue-Id: A4971C001C
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-HE-Tag: 1729627077-942565
X-HE-Meta: U2FsdGVkX1/Od+anjZMwA1zzyWZT+1qID/sWI53UZ8EU9sCmUnfcPeYkBliqGtYxz3pYQp+yVYVKIue6Rz9rwhqOGHhfrYVj4/JTq32Bqko13w/QUPvWqDrcZMVdOmkG6rzddGpplhdYBVHX2LKg1YKBRQeV+4vftooRj4whHXNsNOcQa3dylCPvADKJLauI5NGehx8VU6gs4eS3ESkNNbzl19FEb2NttMzYCVYRSlmo3nzUbW7xQvPBwWqjTC9In/ntKMoxdzGYZDHQafHIDE0cMgOlhux3Vbk1+mt5IBYHvWsBlywRnAkEoIS9oP70e5cQPMGekd5NwBaA2WQ+yfZEWWr87mlAqbLxnLS5mYYj4spF3H/jSLA9dt1PSkHCDByrf7RLUpxOrj7mvrQn1X/EIO+HgHoxTaJQyMog+gmVodnI9PZEGDGITK081Iiz+zfJ4hC3ci577U1JOO9WlZnSIHSXu0wmFrmXidkREMdJZq7kPgxehaHzhr6aOB2fiYtZOj9vyy0ylqTQ9Y8IPT6dWs/cCKxp/KnYYaz8Df0RSHw31vXYKcheZhab1+nxESDuSyNyPQqd4qbDuSOmS/5R0RWEKsYBROrVQN3Xz0wex3yelnPhw7ZWZ8BGEhAHRV1vCATvqOBu2l4JE4cVrwXvDk3RIiuj8R3nBy65UwSqrVJKyKkvtOy8w0OeWQi/gUOzU0UPhQcqj567w/6+KSbP1vomjxKeSTWnmz8JyXzRj9DHFzgLr/FtSOrFin+AKtbL3dOnERtLZ/GjW60mkkN9crI6eDVVgi6ifGyy3+rZdkHKpaGZTBXFEndnO5rs9JdSdKAul94M0HYP55H4vP+9hsdXLhl1urSieforQvHVNhVq+F+GO7QqyfCeJP6ciMleMLQQwoRxGMqsv+nsh86DkNMk8Zmt057FPr+2ZsAXUHQ+eDjl8nLvd8dm+VPwwEJCzZKloIJ3PCLgnd9
 HceSDkPg
 G7RxfAPGPZqmyCOM7AfvlsG9NSVDygmnYt1Ra4AtJs+J7HEv8ICaISknop8zMd7MMXViQ5UVIMuklDV6FCUL9yV85xvppRmn8mc3WvfMsbU2I+MvSdztbqTpJ/ljiI4fMqY4yZJPY1n9lRcyTpbPBiSE9J3b78+vtm+m+h2y62wrho1Nh3F/LsmhHH4pHtklhDdii/lD0IPL6lICtjy54kQQdX0gYybE3g5Cc
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Oct 22, 2024 at 9:35=E2=80=AFPM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Tue, Oct 22, 2024 at 09:08:53PM +0200, Jann Horn wrote:
> > On Mon, Oct 21, 2024 at 10:46=E2=80=AFPM Vlastimil Babka <vbabka@suse.c=
z> wrote:
> > > On 10/21/24 22:27, Lorenzo Stoakes wrote:
> > > > On Mon, Oct 21, 2024 at 10:11:29PM +0200, Vlastimil Babka wrote:
> > > >> On 10/20/24 18:20, Lorenzo Stoakes wrote:
> > > >> > +  while (true) {
> > > >> > +          /* Returns < 0 on error, =3D=3D 0 if success, > 0 if =
zap needed. */
> > > >> > +          err =3D walk_page_range_mm(vma->vm_mm, start, end,
> > > >> > +                                   &guard_poison_walk_ops, NULL=
);
> > > >> > +          if (err <=3D 0)
> > > >> > +                  return err;
> > > >> > +
> > > >> > +          /*
> > > >> > +           * OK some of the range have non-guard pages mapped, =
zap
> > > >> > +           * them. This leaves existing guard pages in place.
> > > >> > +           */
> > > >> > +          zap_page_range_single(vma, start, end - start, NULL);
> > > >>
> > > >> ... however the potentially endless loop doesn't seem great. Could=
 a
> > > >> malicious program keep refaulting the range (ignoring any segfault=
s if it
> > > >> loses a race) with one thread while failing to make progress here =
with
> > > >> another thread? Is that ok because it would only punish itself?
> > > >
> > > > Sigh. Again, I don't think you've read the previous series have you=
? Or
> > > > even the changelog... I added this as Jann asked for it. Originally=
 we'd
> > > > -EAGAIN if we got raced. See the discussion over in v1 for details.
> > > >
> > > > I did it that way specifically to avoid such things, but Jann didn'=
t appear
> > > > to think it was a problem.
> > >
> > > If Jann is fine with this then it must be secure enough.
> >
> > My thinking there was:
> >
> > We can legitimately race with adjacent faults populating the area
> > we're operating on with THP pages; as long as the zapping and
> > poison-marker-setting are separate, *someone* will have to do the
> > retry. Either we do it in the kernel, or we tell userspace to handle
> > it, but having the kernel take care of it is preferable because it
> > makes the stable UAPI less messy.
> >
> > One easy way to do it in the kernel would be to return -ERESTARTNOINTR
> > after the zap_page_range_single() instead of jumping back up, which in
> > terms of locking and signal handling and such would be equivalent to
> > looping in userspace (because really that's what -ERESTARTNOINTR does
> > - it returns out to userspace and moves the instruction pointer back
> > to restart the syscall). Though if we do that immediately, it might
> > make MADV_POISON unnecessarily slow, so we should probably retry once
> > before doing that. The other easy way is to just loop here.
>
> Yes we should definitely retry probably a few times to cover the rare
> situation of a THP race as you describe under non-abusive circumstances.
>
> >
> > The cond_resched() and pending fatal signal check mean that (except on
> > CONFIG_PREEMPT_NONE) the only differences between the current
> > implementation and looping in userspace are that we don't handle
> > non-fatal signals in between iterations and that we keep hogging the
> > mmap_lock in read mode. We do already have a bunch of codepaths that
> > retry on concurrent page table changes, like when zap_pte_range()
> > encounters a pte_offset_map_lock() failure; though I guess the
> > difference is that the retry on those is just a couple instructions,
> > which would be harder to race consistently, while here we redo walks
> > across the entire range, which should be fairly easy to race
> > repeatedly.
> >
> > So I guess you have a point that this might be the easiest way to
> > stall other tasks that are trying to take mmap_lock for an extended
> > amount of time, I did not fully consider that... and then I guess you
> > could use that to slow down usercopy fault handling (once the lock
> > switches to handoff mode because of a stalled writer?) or slow down
> > other processes trying to read /proc/$pid/cmdline?
>
> Hm does that need a write lock?

No, but if you have one reader that is hogging the rwsem, and then a
writer is queued up on the rwsem afterwards, I think new readers will
sometimes be queued up behind the writer. So even though the rwsem is
only actually held by a reader, new readers can't immediately take the
rwsem because the rwsem code thinks that would be unfair to a pending
writer who just wants to make some quick change. I'm not super
familiar with this code, but basically I think it works roughly like
this:

If the rwsem code notices that a bunch of readers are preventing a
writer from taking the lock, the rwsem code will start queuing new
readers behind the queued writer. You can see in rwsem_read_trylock()
that the trylock fastpath is skipped if anyone is waiting on the rwsem
or the handoff flag is set, and in rwsem_down_read_slowpath() the
"reader optimistic lock stealing" path is skipped if the lock is
currently held by multiple readers or if the handoff bit is set.

The handoff bit can be set in rwsem_try_write_lock() by a writer "if
it is an RT task or wait in the wait queue for too long". Basically I
think it means something like "I think other users of the lock are
hogging it more than they should, stop stealing the lock from me".
And the RWSEM_WAIT_TIMEOUT for triggering handoff mode is pretty
short, RWSEM_WAIT_TIMEOUT is defined to something like 4ms, so I think
that's how long writers tolerate the lock being hogged by readers
before they prevent new readers from stealing the lock.