From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8B8FCC433E0 for ; Thu, 25 Jun 2020 15:49:29 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 494FB20857 for ; Thu, 25 Jun 2020 15:49:29 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 494FB20857 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=chris-wilson.co.uk Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id BBF3D6B0028; Thu, 25 Jun 2020 11:49:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B709D8D0002; Thu, 25 Jun 2020 11:49:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A85886B005C; Thu, 25 Jun 2020 11:49:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0097.hostedemail.com [216.40.44.97]) by kanga.kvack.org (Postfix) with ESMTP id 91CBE6B0028 for ; Thu, 25 Jun 2020 11:49:28 -0400 (EDT) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 4C517E5E58 for ; Thu, 25 Jun 2020 15:49:28 +0000 (UTC) X-FDA: 76968168816.02.bait76_400fb1f26e4e Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin02.hostedemail.com (Postfix) with ESMTP id 49BAA600010A50FF for ; Thu, 25 Jun 2020 15:49:01 +0000 (UTC) X-HE-Tag: bait76_400fb1f26e4e X-Filterd-Recvd-Size: 8844 Received: from fireflyinternet.com (mail.fireflyinternet.com [109.228.58.192]) by imf34.hostedemail.com (Postfix) with ESMTP for ; Thu, 25 Jun 2020 15:49:00 +0000 (UTC) X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138; Received: from localhost (unverified [78.156.65.138]) by fireflyinternet.com (Firefly Internet (M1)) with ESMTP (TLS) id 21615605-1500050 for multiple; Thu, 25 Jun 2020 16:48:28 +0100 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable In-Reply-To: <20200625151227.GP1320@dhcp22.suse.cz> References: <20200624191417.16735-1-chris@chris-wilson.co.uk> <20200625075725.GC1320@dhcp22.suse.cz> <159308284703.4527.16058577374955415124@build.alporthouse.com> <20200625151227.GP1320@dhcp22.suse.cz> To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, intel-gfx@lists.freedesktop.org, Andrew Morton , Jan Kara , Jérôme Glisse , John Hubbard , Claudio Imbrenda , "Kirill A . Shutemov" , Jason Gunthorpe Subject: Re: [PATCH] mm: Skip opportunistic reclaim for dma pinned pages From: Chris Wilson Message-ID: <159310010426.31486.12265756858860973914@build.alporthouse.com> User-Agent: alot/0.8.1 Date: Thu, 25 Jun 2020 16:48:24 +0100 X-Rspamd-Queue-Id: 49BAA600010A50FF X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Quoting Michal Hocko (2020-06-25 16:12:27) > On Thu 25-06-20 12:00:47, Chris Wilson wrote: > > Quoting Michal Hocko (2020-06-25 08:57:25) > > > On Wed 24-06-20 20:14:17, Chris Wilson wrote: > > > > A general rule of thumb is that shrinkers should be fast and effect= ive. > > > > They are called from direct reclaim at the most incovenient of time= s when > > > > the caller is waiting for a page. If we attempt to reclaim a page b= eing > > > > pinned for active dma [pin_user_pages()], we will incur far greater > > > > latency than a normal anonymous page mapped multiple times. Worse t= he > > > > page may be in use indefinitely by the HW and unable to be reclaimed > > > > in a timely manner. > > > >=20 > > > > A side effect of the LRU shrinker not being dma aware is that we wi= ll > > > > often attempt to perform direct reclaim on the persistent group of = dma > > > > pages while continuing to use the dma HW (an issue as the HW may al= ready > > > > be actively waiting for the next user request), and even attempt to > > > > reclaim a partially allocated dma object in order to satisfy pinning > > > > the next user page for that object. > > >=20 > > > You are talking about direct reclaim but this path is shared with the > > > background reclaim. This is a bit confusing. Maybe you just want to > > > outline the latency in the reclaim which is more noticeable in the > > > direct reclaim to the userspace. This would be good to be clarified. > > >=20 > > > How much memory are we talking about here btw? > >=20 > > It depends. In theory, it is used sparingly. But it is under userspace > > control, exposed via Vulkan, OpenGL, OpenCL, media and even old XShm. If > > all goes to plan the application memory is only pinned for as long as t= he > > HW is using it, but that is an indefinite period of time and an indefin= ite > > amount of memory. There are provisions in place to impose upper limits > > on how long an operation can last on the HW, and the mmu-notifier is > > there to ensure we do unpin the memory on demand. However cancelling a > > HW operation (which will result in data loss and often process > > termination due to an unfortunate sequence of events when userspace > > fails to recover) for a try_to_unmap on behalf of the LRU shrinker is n= ot > > a good choice. >=20 > OK, thanks for the clarification. What and when should MM intervene to > prevent potential OOM? Open to any and all suggestions. At the moment we have the shrinkers for direct reclaim and kswapd [though for direct reclaim we don't like touching active dma pages for the same reason] and the oom-notifier as a last resort. At the moment, we start trying to flush work from the gpu inside kswapd, but that tbh is not particularly effective. If we used kswapd as a signal for lowmemory, we could/should be more proactive in releasing work and returning dirty pages to shmemfs as soon as it is complete [normally retiring completed work is lazy and does not immediately drop dirty pages]. That would also suggest a complementary signal for when we can go back to being lazy again. We are also a bit too lax in applying backpressure. The focus has been on making sure one hog does not prevent an innocent client from accessing the HW. We use cond_synchronize_rcu() on a per-client basis if it appears we are allocating new requests too fast when allocations start failing. That is very probably too little too late. It does achieve the original goal of preventing a single client from submitting requests faster than we can retire them. But we have nothing for page allocations, other than trying to reclaim shmemfs allocations after a NORETRY failure, hoping to stave off the shrinker. [That fails miserably as we are never alone on the system, ofc.] A cgroups interface to restrict memory allocations on a per client basis has never quite materialised. Even then we have to assume the worse and overcommitting is the norm. > [...] > > > Btw. overall intention of the patch is not really clear to me. Do I g= et > > > it right that this is going to reduce latency of the reclaim for pages > > > that are not reclaimable anyway because they are pinned? If yes do we > > > have any numbers for that. > >=20 > > I can plug it into a microbenchmark ala cycletest to show the impact... > > Memory filled with 64M gup objects, random utilisation of those with > > the GPU; background process filling the pagecache with find /; reporting > > the time difference from the expected expiry of a timer with the actual: > > [On a Geminilake Atom-class processor with 8GiB, average of 5 runs, each > > measuring mean latency for 20s -- mean is probably a really bad choice > > here, we need 50/90/95/99] > >=20 > > direct reclaim calling mmu-notifier: > > gem_syslatency: cycles=3D2122, latency mean=3D1601.185us max=3D33572us > >=20 > > skipping try_to_unmap_one with page_maybe_dma_pinned: > > gem_syslatency: cycles=3D1965, latency mean=3D597.971us max=3D28462us > >=20 > > Baseline (background find /; application touched all memory, but no HW > > ops) > > gem_syslatency: cycles=3D0, latency mean=3D6.695us max=3D77us > >=20 > > Compare with the time to allocate a single THP against load: > >=20 > > Baseline: > > gem_syslatency: cycles=3D0, latency mean=3D1541.562us max=3D52196us > > Direct reclaim calling mmu-notifier: > > gem_syslatency: cycles=3D2115, latency mean=3D9050.930us max=3D396986us > > page_maybe_dma_pinned skip: > > gem_syslatency: cycles=3D2325, latency mean=3D7431.633us max=3D187960us > >=20 > > Take with a massive pinch of salt. I expect, once I find the right > > sequence, to reliably control the induced latency on the RT thread. > >=20 > > But first, I have to look at why there's a correlation with HW load and > > timer latency, even with steady state usage. That's quite surprising -- > > ah, I had it left to PREEMPT_VOLUNTARY and this machine has to scan > > every request submitted to HW. Just great. > >=20 > > With PREEMPT: > > Timer: > > Base: gem_syslatency: cycles=3D0, latency mean=3D8.823us max=3D83us > > Reclaim: gem_syslatency: cycles=3D2224, latency mean=3D79.308us max=3D4= 805us > > Skip: gem_syslatency: cycles=3D2677, latency mean=3D70.306us max=3D4= 720us > >=20 > > THP: > > Base: gem_syslatency: cycles=3D0, latency mean=3D1993.693us max=3D20= 1958us > > Reclaim: gem_syslatency: cycles=3D1284, latency mean=3D2873.633us max= =3D295962us > > Skip: gem_syslatency: cycles=3D1809, latency mean=3D1991.509us max= =3D261050us > >=20 > > Earlier caveats notwithstanding; confidence in results still low. > >=20 > > And refine the testing somewhat, if at the very least gather enough > > samples for credible statistics. >=20 > OK, so my understanding is that the overall impact is very low. So what > is the primary motivation for the patch? Prevent from a pointless work - > aka invoke the notifier? I'm working on capturing the worst cases :) I just need to find the way to consistently steer reclaim to hitting active gup objects. And then wrap that inside a timer, without it appearing too contrived. Does populating a [THP] mmap seems reasonable way to measure page allocation latency? -Chris