From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=nnJc=AG=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8B8FCC433E0
	for <linux-mm@archiver.kernel.org>; Thu, 25 Jun 2020 15:49:29 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 494FB20857
	for <linux-mm@archiver.kernel.org>; Thu, 25 Jun 2020 15:49:29 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 494FB20857
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=chris-wilson.co.uk
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id BBF3D6B0028; Thu, 25 Jun 2020 11:49:28 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B709D8D0002; Thu, 25 Jun 2020 11:49:28 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A85886B005C; Thu, 25 Jun 2020 11:49:28 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0097.hostedemail.com [216.40.44.97])
	by kanga.kvack.org (Postfix) with ESMTP id 91CBE6B0028
	for <linux-mm@kvack.org>; Thu, 25 Jun 2020 11:49:28 -0400 (EDT)
Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 4C517E5E58
	for <linux-mm@kvack.org>; Thu, 25 Jun 2020 15:49:28 +0000 (UTC)
X-FDA: 76968168816.02.bait76_400fb1f26e4e
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin02.hostedemail.com (Postfix) with ESMTP id 49BAA600010A50FF
	for <linux-mm@kvack.org>; Thu, 25 Jun 2020 15:49:01 +0000 (UTC)
X-HE-Tag: bait76_400fb1f26e4e
X-Filterd-Recvd-Size: 8844
Received: from fireflyinternet.com (mail.fireflyinternet.com [109.228.58.192])
	by imf34.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Thu, 25 Jun 2020 15:49:00 +0000 (UTC)
X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138;
Received: from localhost (unverified [78.156.65.138]) 
	by fireflyinternet.com (Firefly Internet (M1)) with ESMTP (TLS) id 21615605-1500050 
	for multiple; Thu, 25 Jun 2020 16:48:28 +0100
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <20200625151227.GP1320@dhcp22.suse.cz>
References: <20200624191417.16735-1-chris@chris-wilson.co.uk> <20200625075725.GC1320@dhcp22.suse.cz> <159308284703.4527.16058577374955415124@build.alporthouse.com> <20200625151227.GP1320@dhcp22.suse.cz>
To: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, intel-gfx@lists.freedesktop.org, Andrew Morton <akpm@linux-foundation.org>, Jan Kara <jack@suse.cz>, Jérôme Glisse <jglisse@redhat.com>, John Hubbard <jhubbard@nvidia.com>, Claudio Imbrenda <imbrenda@linux.ibm.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, Jason Gunthorpe <jgg@ziepe.ca>
Subject: Re: [PATCH] mm: Skip opportunistic reclaim for dma pinned pages
From: Chris Wilson <chris@chris-wilson.co.uk>
Message-ID: <159310010426.31486.12265756858860973914@build.alporthouse.com>
User-Agent: alot/0.8.1
Date: Thu, 25 Jun 2020 16:48:24 +0100
X-Rspamd-Queue-Id: 49BAA600010A50FF
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam04
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Quoting Michal Hocko (2020-06-25 16:12:27)
> On Thu 25-06-20 12:00:47, Chris Wilson wrote:
> > Quoting Michal Hocko (2020-06-25 08:57:25)
> > > On Wed 24-06-20 20:14:17, Chris Wilson wrote:
> > > > A general rule of thumb is that shrinkers should be fast and effect=
ive.
> > > > They are called from direct reclaim at the most incovenient of time=
s when
> > > > the caller is waiting for a page. If we attempt to reclaim a page b=
eing
> > > > pinned for active dma [pin_user_pages()], we will incur far greater
> > > > latency than a normal anonymous page mapped multiple times. Worse t=
he
> > > > page may be in use indefinitely by the HW and unable to be reclaimed
> > > > in a timely manner.
> > > >=20
> > > > A side effect of the LRU shrinker not being dma aware is that we wi=
ll
> > > > often attempt to perform direct reclaim on the persistent group of =
dma
> > > > pages while continuing to use the dma HW (an issue as the HW may al=
ready
> > > > be actively waiting for the next user request), and even attempt to
> > > > reclaim a partially allocated dma object in order to satisfy pinning
> > > > the next user page for that object.
> > >=20
> > > You are talking about direct reclaim but this path is shared with the
> > > background reclaim. This is a bit confusing. Maybe you just want to
> > > outline the latency in the reclaim which is more noticeable in the
> > > direct reclaim to the userspace. This would be good to be clarified.
> > >=20
> > > How much memory are we talking about here btw?
> >=20
> > It depends. In theory, it is used sparingly. But it is under userspace
> > control, exposed via Vulkan, OpenGL, OpenCL, media and even old XShm. If
> > all goes to plan the application memory is only pinned for as long as t=
he
> > HW is using it, but that is an indefinite period of time and an indefin=
ite
> > amount of memory. There are provisions in place to impose upper limits
> > on how long an operation can last on the HW, and the mmu-notifier is
> > there to ensure we do unpin the memory on demand. However cancelling a
> > HW operation (which will result in data loss and often process
> > termination due to an unfortunate sequence of events when userspace
> > fails to recover) for a try_to_unmap on behalf of the LRU shrinker is n=
ot
> > a good choice.
>=20
> OK, thanks for the clarification. What and when should MM intervene to
> prevent potential OOM?

Open to any and all suggestions. At the moment we have the shrinkers for
direct reclaim and kswapd [though for direct reclaim we don't like
touching active dma pages for the same reason] and the oom-notifier as a
last resort. At the moment, we start trying to flush work from the gpu
inside kswapd, but that tbh is not particularly effective. If we used
kswapd as a signal for lowmemory, we could/should be more proactive in
releasing work and returning dirty pages to shmemfs as soon as it is
complete [normally retiring completed work is lazy and does not
immediately drop dirty pages]. That would also suggest a complementary
signal for when we can go back to being lazy again.

We are also a bit too lax in applying backpressure. The focus has been on
making sure one hog does not prevent an innocent client from accessing
the HW. We use cond_synchronize_rcu() on a per-client basis if it appears
we are allocating new requests too fast when allocations start failing.
That is very probably too little too late. It does achieve the original
goal of preventing a single client from submitting requests faster than
we can retire them. But we have nothing for page allocations, other than
trying to reclaim shmemfs allocations after a NORETRY failure, hoping to
stave off the shrinker. [That fails miserably as we are never alone on
the system, ofc.]

A cgroups interface to restrict memory allocations on a per client basis
has never quite materialised. Even then we have to assume the worse and
overcommitting is the norm.

> [...]
> > > Btw. overall intention of the patch is not really clear to me. Do I g=
et
> > > it right that this is going to reduce latency of the reclaim for pages
> > > that are not reclaimable anyway because they are pinned? If yes do we
> > > have any numbers for that.
> >=20
> > I can plug it into a microbenchmark ala cycletest to show the impact...
> > Memory filled with 64M gup objects, random utilisation of those with
> > the GPU; background process filling the pagecache with find /; reporting
> > the time difference from the expected expiry of a timer with the actual:
> > [On a Geminilake Atom-class processor with 8GiB, average of 5 runs, each
> > measuring mean latency for 20s -- mean is probably a really bad choice
> > here, we need 50/90/95/99]
> >=20
> > direct reclaim calling mmu-notifier:
> > gem_syslatency: cycles=3D2122, latency mean=3D1601.185us max=3D33572us
> >=20
> > skipping try_to_unmap_one with page_maybe_dma_pinned:
> > gem_syslatency: cycles=3D1965, latency mean=3D597.971us max=3D28462us
> >=20
> > Baseline (background find /; application touched all memory, but no HW
> > ops)
> > gem_syslatency: cycles=3D0, latency mean=3D6.695us max=3D77us
> >=20
> > Compare with the time to allocate a single THP against load:
> >=20
> > Baseline:
> > gem_syslatency: cycles=3D0, latency mean=3D1541.562us max=3D52196us
> > Direct reclaim calling mmu-notifier:
> > gem_syslatency: cycles=3D2115, latency mean=3D9050.930us max=3D396986us
> > page_maybe_dma_pinned skip:
> > gem_syslatency: cycles=3D2325, latency mean=3D7431.633us max=3D187960us
> >=20
> > Take with a massive pinch of salt. I expect, once I find the right
> > sequence, to reliably control the induced latency on the RT thread.
> >=20
> > But first, I have to look at why there's a correlation with HW load and
> > timer latency, even with steady state usage. That's quite surprising --
> > ah, I had it left to PREEMPT_VOLUNTARY and this machine has to scan
> > every request submitted to HW. Just great.
> >=20
> > With PREEMPT:
> > Timer:
> > Base:    gem_syslatency: cycles=3D0, latency mean=3D8.823us max=3D83us
> > Reclaim: gem_syslatency: cycles=3D2224, latency mean=3D79.308us max=3D4=
805us
> > Skip:    gem_syslatency: cycles=3D2677, latency mean=3D70.306us max=3D4=
720us
> >=20
> > THP:
> > Base:    gem_syslatency: cycles=3D0, latency mean=3D1993.693us max=3D20=
1958us
> > Reclaim: gem_syslatency: cycles=3D1284, latency mean=3D2873.633us max=
=3D295962us
> > Skip:    gem_syslatency: cycles=3D1809, latency mean=3D1991.509us max=
=3D261050us
> >=20
> > Earlier caveats notwithstanding; confidence in results still low.
> >=20
> > And refine the testing somewhat, if at the very least gather enough
> > samples for credible statistics.
>=20
> OK, so my understanding is that the overall impact is very low. So what
> is the primary motivation for the patch? Prevent from a pointless work -
> aka invoke the notifier?

I'm working on capturing the worst cases :) I just need to find the way
to consistently steer reclaim to hitting active gup objects. And then
wrap that inside a timer, without it appearing too contrived. Does
populating a [THP] mmap seems reasonable way to measure page allocation
latency?
-Chris