From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 267ABC25B6B
	for <linux-mm@archiver.kernel.org>; Thu, 26 Oct 2023 23:48:23 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8AE4E8D0035; Thu, 26 Oct 2023 19:48:22 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 84DA28D0034; Thu, 26 Oct 2023 19:48:22 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 6B0978D0035; Thu, 26 Oct 2023 19:48:22 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 54C078D0034
	for <linux-mm@kvack.org>; Thu, 26 Oct 2023 19:48:22 -0400 (EDT)
Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 2CB1EA0DF2
	for <linux-mm@kvack.org>; Thu, 26 Oct 2023 23:48:22 +0000 (UTC)
X-FDA: 81389254044.03.3219F23
Received: from mail-vs1-f51.google.com (mail-vs1-f51.google.com [209.85.217.51])
	by imf29.hostedemail.com (Postfix) with ESMTP id 74AD4120019
	for <linux-mm@kvack.org>; Thu, 26 Oct 2023 23:48:20 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=gYUvOdkG;
	spf=pass (imf29.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.51 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1698364100;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=+In5cIMB3HJpQTMHOEwIOuwXk/mnIKncDUTw+oPVd94=;
	b=Lm+i7XXXxIIrGIR6TUxjJlcRg8BSm2ab4XVapwfyrTeeKNJwqss+Kw/Tkwy/+RpYgkF3Jh
	yxKoP7zlVYkSgJtW2ihjgvMite8KSUtk4tyw3XOncGOzNSEqmUw3y+kSoUxBdX2L0JCtdx
	oN/uqpNKiKWpSpD9cqvZ41YKWvq34Ms=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698364100; a=rsa-sha256;
	cv=none;
	b=0BsAOhbD30QJlGus7i59XZokRA+qABPnlsZr1oMDCvKduaNiovBuBC55QCGB4WzpPd/XKZ
	ImcG9WsBgO/7FR7+2On176/KL53MXlfjRMn8B01tocei5Nb5Q1WnfaNZ9Q1+W5XdSRgcFU
	jexcl5VcZBWCS+VP5wP5tLw6vlAAqxk=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=gYUvOdkG;
	spf=pass (imf29.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.51 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-vs1-f51.google.com with SMTP id ada2fe7eead31-45853ab5556so667550137.2
        for <linux-mm@kvack.org>; Thu, 26 Oct 2023 16:48:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1698364099; x=1698968899; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=+In5cIMB3HJpQTMHOEwIOuwXk/mnIKncDUTw+oPVd94=;
        b=gYUvOdkGPAbHsdBFcFaF6VT4zWDYo66BTx/ZWc88viUoi8NjHz+jVDxkf1cYPlMG9v
         XG3E45a1AhtVMFYN0dEljA7/AKTKqluA/Qwl+syMGN3iBKq3yWUNWkS5JZDWZt+BcaAk
         T6h0gQ8YOMXQL5t0+TNMQJC9qVoj9sk8cg2oRaoB2MiX/Zacg0Og7FAKxFcKYdCXekU/
         8t0HoFIPS0k5epAWoHBxe+PdOIiHCJpkxm7G4TJG2Opl0Qk+PstZgd/mgkUx8sx5/8Iz
         VcmldQd/7Xr6Vt9vxY2qQ/R25r8A6pwGlz+F3BRrV2xzhmQYQp+tV116r010Ca8tUptd
         KyxQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1698364099; x=1698968899;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=+In5cIMB3HJpQTMHOEwIOuwXk/mnIKncDUTw+oPVd94=;
        b=JNnTqClonNc/hn2nAS3VDOMdAvX9B7oRCdlvBWn1th6+tsTWhC050PPqElDDYwd9A4
         d0ISWV6FK/gA1GoGGArX+upV8Uw8+VP8Vaj084s5a9/DGyWlozfPtvhrfm6zkcy9smiO
         epcUw4tbzhstG4xEMs2YhPnjNfEYBmwLkBekiJ8Y6XeYfIiBloeOCxuTD4SH5bMHVOkr
         AEBTvaLjFVpUyp5/SZzQNihhHO5WtH3jfQWNH8atHtIYExDMLZlCQm6rWDYOHq2zD59G
         6UDwARWBUaITKtphMCuMBPw1OnqC1BuD8Udf3OYFNdkeSRpHXw2m5DdEMrmBrppfx7Sk
         V6Xw==
X-Gm-Message-State: AOJu0YzjxnX3swzjgMCjJCqhqKmleCKwf33O+WiGaQdMkQf/R7Ql4LgQ
	xjGEHYMGgtoA7nylIR3JouiHhQOJknoBTMMSTAw=
X-Google-Smtp-Source: AGHT+IGHK9Xl19bFVAuBbDA6w+Zx4r5YYHEYxhcFPaOUYOyPXe2dhYoI51npOeOFy8MPncPy91scXXqyEycM+oIwE1M=
X-Received: by 2002:a67:b80c:0:b0:458:1c00:c32f with SMTP id
 i12-20020a67b80c000000b004581c00c32fmr1422072vsf.34.1698364099116; Thu, 26
 Oct 2023 16:48:19 -0700 (PDT)
MIME-Version: 1.0
References: <ae3115778a3fa10ec77152e18beed54fafe0f6e7.1698151516.git.baolin.wang@linux.alibaba.com>
 <CAGsJ_4zgdAmyU-075jd8KfXn=CdAVC8Rs481sCOd5N2a68yPUg@mail.gmail.com>
 <CAGsJ_4z1u4-_JXUM9GG2cqc4Nwrx1v69uHsbff5jDQZHQgWP+w@mail.gmail.com>
 <87y1frqz2u.fsf@nvdebian.thelocal> <CAGsJ_4wFiz-obaoXqfU9p-YqgFwExyXpGjpDKMOUt7mnenD-ew@mail.gmail.com>
 <87ttqfqw8f.fsf@nvdebian.thelocal> <d2d8062c-8248-b710-ccd6-e5359d15c385@linux.alibaba.com>
 <87bkcn1j5k.fsf@nvdebian.thelocal> <CAOUHufZ84aDmiW3Efh87q1oMJr-zk5cyaebucCFzevFHx77ngQ@mail.gmail.com>
 <CAGsJ_4zueK32KMHM0=EYjB3spYvh-yJU=buorG+6+Stnu=cypw@mail.gmail.com> <877cnb0zyk.fsf@nvdebian.thelocal>
In-Reply-To: <877cnb0zyk.fsf@nvdebian.thelocal>
From: Barry Song <21cnbao@gmail.com>
Date: Fri, 27 Oct 2023 07:48:07 +0800
Message-ID: <CAGsJ_4wTmA0FsPe_+EfA4wrxiWO_q1AkrJt0UyfHZaU0EJAQdw@mail.gmail.com>
Subject: Re: [PATCH] arm64: mm: drop tlb flush operation when clearing the
 access bit
To: Alistair Popple <apopple@nvidia.com>
Cc: Yu Zhao <yuzhao@google.com>, Baolin Wang <baolin.wang@linux.alibaba.com>, 
	catalin.marinas@arm.com, will@kernel.org, akpm@linux-foundation.org, 
	v-songbaohua@oppo.com, linux-mm@kvack.org, 
	linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: 835ongj14c6eayu89176rubys7kmghhe
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 74AD4120019
X-Rspam-User: 
X-HE-Tag: 1698364100-264296
X-HE-Meta: U2FsdGVkX184UlsCfYmr+UCNoBd6ANnpcW9CuDStlj3Gb2eWW1/p3ZuvJT2JpfpYYCCAuxm+4s0wYkwGYu4W62gkAu2JcjCxhjj2O8LPsb86aCsEVhZxN7Dqe3PL67fOXx++iXoRw36971Yg7uGzjhS57/Z3xxmybI1bvZZ0zq8WyX7an7OgninA6ySs0PiagJlVZ7XQhzi+1nIMeJ1D7nOCLBRhQDEKay/06Eh1u22BaY9LeRzeSXt60YsmxRVwoM4apHerWRuAADnVU4dyZpF+X+dGEhobIMQ2IyQlvBMj7gpJC0m7ZFdzKqqXb9ZM2i1fc36fibOa53M/KfdKKVxkr/6WwacDJNRUzGkY+88yZZuFveBTP7Yrshpz9eU30vraWePy2+AVecq/H2rW2RVKoHuBtiTqBSjNOPX8vweChZ5plTaxojKlGw40evdeIWG4/BTYfy+6T6y/CzkqnObjZuQYY6bEZXvYhbfocrIfHK/BOAHOkRlgNUTPrQvN6RuCbWUHyyDdY5NVAMrG4tSLdGsNGi6UbClCvuWWfdPoMSZsEi4O55cx7UGajhOKxtDDMN9qklTJAp1nVLgXnQrvsJLqUH/fqWh5K5YKnwLbrXiwlNKVA7/T1NGG6wHpVOoI+7VHeJjN05FKeigy3TL559ODL851JFeFaK5uY3oMRYsfgj4Z6U4KSlyGf2wDCbox5xl+rmmKGuKKuwS9t15GZohiWd4PWu16n3UUEtYSqNJbRp3h1iDK+xMVNw57vUEm+9GmNZuEHCYUK8OCafHc8jwmVrLydPnkAqy7UbHnyf13FqUID1ilMQTKf/AR7hksUOy0dJfb/h+5D5HYRI0fQJ9OeUc1UC7/4qWLXDXmVXV97wov+2i+FkvelA4uEkWVwxbT6Niz2LDlnkogXQput/mcnhAmPWawTPcPJqOKahWcIw1p6JOVPXuAGlC/uistkp6ZvoCKOZGAIbY
 Qoh8TNdx
 82QIJps1t/uR21vgHGAELNcF4wNMf/JwWe2wEbzKSS1qtg9/Sgn5guHHjUEknkvETyaxM+GQkQkdxmUeD+BthG7SDQHwljfbxjyCBwMMvl+L4Uk011v/8JCV3xlkVxzf3ABRb9bSJXhzsJ6+4qQFN9jp2PFSIWLF+b5fAqFhS2SiJzaKdwXoMXZdTUnDGAi4dnDP1E1MrVyTALr+MILRXPodTGrSENPOX1S7jQlvTe9G4jflq0yxIrwyIbUDAk757YgfDcM1w+8SFnwh6e95s0dvYiJXrkRWNg3GP63zaVz+vrdP5GaEcSbtpunr+UXc3f0IGqPsp+KrrSKIyKuJf8rVrieum8opPDSiOEnQJfQeFGvbSeVzBEP1TbKfCFeA6KIo4lJMQLW0HOzjcXe/CbHaXI92auKbSEoOSLaCbw+iT8pNAm/aDs2HVNjCdqDxg3I9rnavCIn6YQvO0yj0P3ap9xmrbEZLIwAEwrMBFBKF3k5stRf27swQrq7gJwiFrZecBL42vT4TtJFzljGDuk4tIKXXZQC15zLZMG0OKQMy8u0to8ktHz3J6FA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Oct 25, 2023 at 6:16=E2=80=AFPM Alistair Popple <apopple@nvidia.com=
> wrote:
>
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Oct 25, 2023 at 2:17=E2=80=AFPM Yu Zhao <yuzhao@google.com> wro=
te:
> >>
> >> On Tue, Oct 24, 2023 at 9:21=E2=80=AFPM Alistair Popple <apopple@nvidi=
a.com> wrote:
> >> >
> >> >
> >> > Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> >> >
> >> > > On 10/25/2023 9:58 AM, Alistair Popple wrote:
> >> > >> Barry Song <21cnbao@gmail.com> writes:
> >> > >>
> >> > >>> On Wed, Oct 25, 2023 at 9:18=E2=80=AFAM Alistair Popple <apopple=
@nvidia.com> wrote:
> >> > >>>>
> >> > >>>>
> >> > >>>> Barry Song <21cnbao@gmail.com> writes:
> >> > >>>>
> >> > >>>>> On Wed, Oct 25, 2023 at 7:16=E2=80=AFAM Barry Song <21cnbao@gm=
ail.com> wrote:
> >> > >>>>>>
> >> > >>>>>> On Tue, Oct 24, 2023 at 8:57=E2=80=AFPM Baolin Wang
> >> > >>>>>> <baolin.wang@linux.alibaba.com> wrote:
> >> > >> [...]
> >> > >>
> >> > >>>>>> (A). Constant flush cost vs. (B). very very occasional reclai=
med hot
> >> > >>>>>> page,  B might
> >> > >>>>>> be a correct choice.
> >> > >>>>>
> >> > >>>>> Plus, I doubt B is really going to happen. as after a page is =
promoted to
> >> > >>>>> the head of lru list or new generation, it needs a long time t=
o slide back
> >> > >>>>> to the inactive list tail or to the candidate generation of mg=
lru. the time
> >> > >>>>> should have been large enough for tlb to be flushed. If the pa=
ge is really
> >> > >>>>> hot, the hardware will get second, third, fourth etc opportuni=
ty to set an
> >> > >>>>> access flag in the long time in which the page is re-moved to =
the tail
> >> > >>>>> as the page can be accessed multiple times if it is really hot=
.
> >> > >>>>
> >> > >>>> This might not be true if you have external hardware sharing th=
e page
> >> > >>>> tables with software through either HMM or hardware supported A=
TS
> >> > >>>> though.
> >> > >>>>
> >> > >>>> In those cases I think it's much more likely hardware can still=
 be
> >> > >>>> accessing the page even after a context switch on the CPU say. =
So those
> >> > >>>> pages will tend to get reclaimed even though hardware is still =
actively
> >> > >>>> using them which would be quite expensive and I guess could lea=
d to
> >> > >>>> thrashing as each page is reclaimed and then immediately faulte=
d back
> >> > >>>> in.
> >> > >
> >> > > That's possible, but the chance should be relatively low. At least=
 on
> >> > > x86, I have not heard of this issue.
> >> >
> >> > Personally I've never seen any x86 system that shares page tables wi=
th
> >> > external devices, other than with HMM. More on that below.
> >> >
> >> > >>> i am not quite sure i got your point. has the external hardware =
sharing cpu's
> >> > >>> pagetable the ability to set access flag in page table entries b=
y
> >> > >>> itself? if yes,
> >> > >>> I don't see how our approach will hurt as folio_referenced can n=
otify the
> >> > >>> hardware driver and the driver can flush its own tlb. If no, i d=
on't see
> >> > >>> either as the external hardware can't set access flags, that mea=
ns we
> >> > >>> have ignored its reference and only knows cpu's access even in t=
he current
> >> > >>> mainline code. so we are not getting worse.
> >> > >>>
> >> > >>> so the external hardware can also see cpu's TLB? or cpu's tlb fl=
ush can
> >> > >>> also broadcast to external hardware, then external hardware sees=
 the
> >> > >>> cleared access flag, thus, it can set access flag in page table =
when the
> >> > >>> hardware access it?  If this is the case, I feel what you said i=
s true.
> >> > >> Perhaps it would help if I gave a concrete example. Take for exam=
ple
> >> > >> the
> >> > >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in o=
ne of
> >> > >> two ways depending on the specific HW implementation.
> >> > >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU
> >> > >> TLB
> >> > >> invalidations. If BTM is not supported it relies on SW to explici=
tly
> >> > >> forward TLB invalidations via MMU notifiers.
> >> > >
> >> > > On our ARM64 hardware, we rely on BTM to maintain TLB coherency.
> >> >
> >> > Lucky you :-)
> >> >
> >> > ARM64 SMMU architecture specification supports the possibilty of bot=
h,
> >> > as does the driver. Not that I think whether or not BTM is supported=
 has
> >> > much relevance to this issue.
> >> >
> >> > >> Now consider the case where some external device is accessing map=
pings
> >> > >> via the SMMU. The access flag will be cached in the SMMU TLB. If =
we
> >> > >> clear the access flag without a TLB invalidate the access flag in=
 the
> >> > >> CPU page table will not get updated because it's already set in t=
he SMMU
> >> > >> TLB.
> >> > >> As an aside access flag updates happen in one of two ways. If the
> >> > >> SMMU
> >> > >> HW supports hardware translation table updates (HTTU) then hardwa=
re will
> >> > >> manage updating access/dirty flags as required. If this is not su=
pported
> >> > >> then SW is relied on to update these flags which in practice mean=
s
> >> > >> taking a minor fault. But I don't think that is relevant here - i=
n
> >> > >> either case without a TLB invalidate neither of those things will
> >> > >> happen.
> >> > >> I suppose drivers could implement the clear_flush_young() MMU
> >> > >> notifier
> >> > >> callback (none do at the moment AFAICT) but then won't that just =
lead to
> >> > >> the opposite problem - that every page ever used by an external d=
evice
> >> > >> remains active and unavailable for reclaim because the access fla=
g never
> >> > >> gets cleared? I suppose they could do the flush then which would =
ensure
> >> > >
> >> > > Yes, I think so too. The reason there is currently no problem, per=
haps
> >> > > I think, there are no actual use cases at the moment? At least on =
our
> >> > > Alibaba's fleet, SMMU and MMU do not share page tables now.
> >> >
> >> > We have systems that do.
> >>
> >> Just curious: do those systems run the Linux kernel? If so, are pages
> >> shared with SMMU pinned? If not, then how are IO PFs handled after
> >> pages are reclaimed?
>
> Yes, these systems all run Linux. Pages shared with SMMU aren't pinned
> and fault handling works as Barry notes below - a driver is notified of
> a fault and calls handle_mm_fault() in response.
>
> > it will call handle_mm_fault(vma, prm->addr, fault_flags, NULL); in
> > I/O PF, so finally
> > it runs the same codes to get page back just like CPU's PF.
> >
> > years ago, we recommended a pin solution, but obviously there were lots=
 of
> > push backs:
> > https://lore.kernel.org/linux-mm/1612685884-19514-1-git-send-email-wang=
zhou1@hisilicon.com/
>
> Right. Having to pin pages defeats the whole point of having hardware
> that can handle page faults.

Right. But
Sometimes I feel that purpose and the technology for the purpose are
contradictory.
People want user space dma to have high bandwidth and performance by droppi=
ng
u/k copying and complicated syscalls managing those buffers. but IO PF some=
how
defeats all the efforts to improve performance.

AFAIK, IO PF is much much slower than PF as it depends on interrupts and ev=
ent
queues in hardware. Even a minor page fault due to page migration or
demand paging
can significantly decrease the speed of userspace DMA. So finally people ha=
ve to
depend on some mlock-like and pre-paging ways to achieve the purpose
if performance
is the key concern.

Thanks
Barry