From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 333C5C0032E
	for <linux-mm@archiver.kernel.org>; Wed, 25 Oct 2023 18:23:20 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C03A68D000C; Wed, 25 Oct 2023 14:23:19 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id BB45D8D0001; Wed, 25 Oct 2023 14:23:19 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A7C048D000C; Wed, 25 Oct 2023 14:23:19 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 998838D0001
	for <linux-mm@kvack.org>; Wed, 25 Oct 2023 14:23:19 -0400 (EDT)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 6DEA9B577B
	for <linux-mm@kvack.org>; Wed, 25 Oct 2023 18:23:19 +0000 (UTC)
X-FDA: 81384806118.22.58829E9
Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com [209.85.160.175])
	by imf28.hostedemail.com (Postfix) with ESMTP id F2F4EC0004
	for <linux-mm@kvack.org>; Wed, 25 Oct 2023 18:23:15 +0000 (UTC)
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=oQwuPZC5;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf28.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=yuzhao@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1698258196;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=YOngol+UNt2BS7ROUL819iuWlTNz6vkC6Ill4epbCm4=;
	b=mTz/RChnw2L78UUTUKng8MhhxA1v+LUB2wj3Z7vV6QnGP8dUE2pnnKrY+huvxq0c6BVLLv
	YO3k83KwboPkU4c1Jytc/qoBsLsxcGafm48Mj2uJ2U3RVeMFctnXdqERRU5FKqRi3TZKIK
	0JRYoIsvIPDiaeq24FeCbYqD7Q5eh+g=
ARC-Authentication-Results: i=1;
	imf28.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=oQwuPZC5;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf28.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=yuzhao@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698258196; a=rsa-sha256;
	cv=none;
	b=mleTbdcOcwCWiJDpp5j9NfLitCQp0hLMFyeNoTZYJrni9wSsTSHzYcT8MxtK6iPU+ffPCE
	z87gXsMWdD1PW5X1HeuSNRGkwV5JVujghtXRG3rvp3Oza14DxDX6BQ6qJI2ZHI/WMfJXb+
	HHWuNJqGrRhCRvmuKeX+SautlePhVpQ=
Received: by mail-qt1-f175.google.com with SMTP id d75a77b69052e-41cdc669c5eso386811cf.1
        for <linux-mm@kvack.org>; Wed, 25 Oct 2023 11:23:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1698258196; x=1698862996; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=YOngol+UNt2BS7ROUL819iuWlTNz6vkC6Ill4epbCm4=;
        b=oQwuPZC5BU2jVXxxsheSpnkA6osmyJxe6o1y6lD5sCgdQ4iDMEk6qauC64avJnOKnn
         DgTAwAwkv7LCp7TNra2CJDarXtJCHciB0MFa8cnHsn7Be5ulMQV6eFezocvxueEF9jsa
         8u+/fqQUCwegRcL8gnB5NMPkqyL6NvGYmfuhxEFQLl8lIEJAjKnqxbQm+uEL5Mf1gCHM
         lFKxE90KRCg6HUbfFXJTDepbXcS9LucsZiXz5UQDo0RXuPjI6ZKCLwGQuYgNpgqNKL1Z
         eDJghf2QkHZhEvX6Ico+/kM1Xsofyl7rDAyRnA8/huTNF2drQh4mLLt5YNpqNI4DKVbO
         qc2g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1698258196; x=1698862996;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=YOngol+UNt2BS7ROUL819iuWlTNz6vkC6Ill4epbCm4=;
        b=boqHddhC2+1ihfVtzxnP5HdTuCVB4cDwgOWZvwYZqt0RAt4+ajjI/HTA/sUAKUeUBW
         fnontdw8ZzlQol+C5HtnsYafY49wIsyy2Hqb2FxrZnzi/WC4gZUEfvYk70nuQWOK0czs
         DHCNvAEPYDxYotHkl99kt5Nm0iNGG01K+EM+68ibTi/lRJmRNTmLOEdC/pKni5K4iSNz
         YVlIC3uX5VkTszHyUoBaK6FZSA66zCoI9wL4QxTZSjpNXa4B65uUeaUPUfANGAjeqZiS
         VjrLauGduBcDEV18d7gC6gJ1GGeV1DeHo9cCnNS7e66+RZp+zfJq1eRlTTXyk4HERBjs
         EwDg==
X-Gm-Message-State: AOJu0Yy/P/KCRIAGgqfziyBg6U8rhjnUtl/hlAaiC+w8bEkn0MHPPnsd
	bJCqCv//KtQxZDvLNmKfpbJXAkVYpMwBxZpeZQomBA==
X-Google-Smtp-Source: AGHT+IEUBTKIA5WOFudfFWTufeAV1wkNsfXYI8/ec5yQNnrsqrBY/obT/m/kMgdVpKFvpflGNbKialGaTG9YnDoC3Aw=
X-Received: by 2002:a05:622a:6dcd:b0:41c:da9e:efc6 with SMTP id
 ir13-20020a05622a6dcd00b0041cda9eefc6mr319724qtb.28.1698258195584; Wed, 25
 Oct 2023 11:23:15 -0700 (PDT)
MIME-Version: 1.0
References: <ae3115778a3fa10ec77152e18beed54fafe0f6e7.1698151516.git.baolin.wang@linux.alibaba.com>
 <CAGsJ_4zgdAmyU-075jd8KfXn=CdAVC8Rs481sCOd5N2a68yPUg@mail.gmail.com>
 <CAGsJ_4z1u4-_JXUM9GG2cqc4Nwrx1v69uHsbff5jDQZHQgWP+w@mail.gmail.com>
 <87y1frqz2u.fsf@nvdebian.thelocal> <CAGsJ_4wFiz-obaoXqfU9p-YqgFwExyXpGjpDKMOUt7mnenD-ew@mail.gmail.com>
 <87ttqfqw8f.fsf@nvdebian.thelocal> <d2d8062c-8248-b710-ccd6-e5359d15c385@linux.alibaba.com>
 <87bkcn1j5k.fsf@nvdebian.thelocal> <CAOUHufZ84aDmiW3Efh87q1oMJr-zk5cyaebucCFzevFHx77ngQ@mail.gmail.com>
 <CAGsJ_4zueK32KMHM0=EYjB3spYvh-yJU=buorG+6+Stnu=cypw@mail.gmail.com> <877cnb0zyk.fsf@nvdebian.thelocal>
In-Reply-To: <877cnb0zyk.fsf@nvdebian.thelocal>
From: Yu Zhao <yuzhao@google.com>
Date: Wed, 25 Oct 2023 12:22:37 -0600
Message-ID: <CAOUHufbpyR_wFARsCZ-wVqN0w3ieW2RVfVaJkbikY_O8WGwF1g@mail.gmail.com>
Subject: Re: [PATCH] arm64: mm: drop tlb flush operation when clearing the
 access bit
To: Alistair Popple <apopple@nvidia.com>
Cc: Barry Song <21cnbao@gmail.com>, Baolin Wang <baolin.wang@linux.alibaba.com>, 
	catalin.marinas@arm.com, will@kernel.org, akpm@linux-foundation.org, 
	v-songbaohua@oppo.com, linux-mm@kvack.org, 
	linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: F2F4EC0004
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Stat-Signature: 96om73wjugyyrqii7tajkw85bza4imri
X-HE-Tag: 1698258195-112128
X-HE-Meta: U2FsdGVkX180Wmh4ygeQD/N01/FpwA0i0wIv253R16UpmDRvdaOzM+TUXSEfyQySB3ag8S3Ru/7I+3y3seVpw3i3nHoyrA2HiQ0aht8Xyxcnoul7ccgiACCJL3V8OFyNCOaYA7ztigi5LuIYBIqeeQ27G/wToaMe6iHaKdtfqq46dcCbXCMAX3IQO4ROk9A4sZbybyJ6qJ46xhriqTWpwqetPVVZgjWo9/9tc8U3KvIjskjM+T+XWxgCU7VosstnXkdLFIrj9YzgQH4kBDk6SHfK4bOs67W7QrLInI32A+Gk1AZh9VA2WH+3gmYAtcOngmQk5X8dzzr84Vwz2IuBBF9G+ENKNkrwmpUgc55OzkuJC6g5IKzQtduoPwMGNPaDApCJcsWvUdYn0KvG+HSOh8Hq8j0C3Ihz5+SRCnHb4uYAtqSd3JLEwwuhCR/Yi57T/O0ySijYaQEpDVWDpq4q2eSfmi9Q4oOdAlrpU64YwBGJX4fu+e0Dd1r/TVE2lCEL7y1ojZL9mSm7NoNg0AQ2MZusGEzRaCHSbFeiEQIjxGI3uQxAVVLVoA4jGSxcbYIwpkwCuwqe3DAZ7vVPdZXejMvirm7SXlGiKgG3Q2BqRxYbJ7d5+6wJrhGp8LO4P3gQZWoiXtX0K7Hp03MR+gPaglKUGnjGK26ZNJz7bttzq/+Gzxn3obl28tWKUCgR6+Jz8D+W1wVk8fz0y/cO5+4li/E+JV9PvfX+OBA/IS90TXZl/PqlYXp0x6jkcfkVqD0qjCbxJ/QmQh9dzL30a8zHBX+Rk2KxoEUfKxH/L874L3X+K3ugbd7t2DCeWTYlsH2bqQ5+nHIOtlV1sVg2CtJ3aUrztgJtLjJqZAnjOKIznlLu1mplAo3vj/fr1PUO8AJfP8n6U+JBZEOuiWiTfOKZtWcqqCEEXh6u8Dp6EpyXIAbsPYtE7dFvVFPwx011sFt+lB76rw6SQxY2h5SQfI8
 w/eccSDb
 ntrTpzhpvAKRWAabQsWmgG86MfTpu8I3Mm9SEBNhW99ITwDmcaBb+i3fFxEjB6fl3aZ358wFoVUJzGIiCKYWGDEjzitwiHCDNuI8abbeJhhyVS0YcLQOlWX7DBmHRlS4FasIq+Cw6YxnF5tyYcW5r/24MzuzEesFp3Q0VuzW8Wv8+J2sFtk2wDzBmKRB48srcjIm2m6N5PiEsbqOd22cMW/DzxvXhqwZ4ZOgU36iwOy4+3sEy0iiO4f0OBStaYX9emB2OT+NvKeHMLDryfNfcccuhf7F6EyH8/Su5HGbcT+UIdjaz5WGf5uCvTuvXbXI9ObPP1+FkeYs13eSCUY7sNd5Kwz7hrtVcMTrUbzE6QEDlF2rxxJFTd+eueu4IU332Xn3JSHvNmpksSOxkH6KdTUO3Shsx/Wvp2P7hw0Wuwd6aYWsHgi6fRrmnOQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Oct 25, 2023 at 4:16=E2=80=AFAM Alistair Popple <apopple@nvidia.com=
> wrote:
>
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Oct 25, 2023 at 2:17=E2=80=AFPM Yu Zhao <yuzhao@google.com> wro=
te:
> >>
> >> On Tue, Oct 24, 2023 at 9:21=E2=80=AFPM Alistair Popple <apopple@nvidi=
a.com> wrote:
> >> >
> >> >
> >> > Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> >> >
> >> > > On 10/25/2023 9:58 AM, Alistair Popple wrote:
> >> > >> Barry Song <21cnbao@gmail.com> writes:
> >> > >>
> >> > >>> On Wed, Oct 25, 2023 at 9:18=E2=80=AFAM Alistair Popple <apopple=
@nvidia.com> wrote:
> >> > >>>>
> >> > >>>>
> >> > >>>> Barry Song <21cnbao@gmail.com> writes:
> >> > >>>>
> >> > >>>>> On Wed, Oct 25, 2023 at 7:16=E2=80=AFAM Barry Song <21cnbao@gm=
ail.com> wrote:
> >> > >>>>>>
> >> > >>>>>> On Tue, Oct 24, 2023 at 8:57=E2=80=AFPM Baolin Wang
> >> > >>>>>> <baolin.wang@linux.alibaba.com> wrote:
> >> > >> [...]
> >> > >>
> >> > >>>>>> (A). Constant flush cost vs. (B). very very occasional reclai=
med hot
> >> > >>>>>> page,  B might
> >> > >>>>>> be a correct choice.
> >> > >>>>>
> >> > >>>>> Plus, I doubt B is really going to happen. as after a page is =
promoted to
> >> > >>>>> the head of lru list or new generation, it needs a long time t=
o slide back
> >> > >>>>> to the inactive list tail or to the candidate generation of mg=
lru. the time
> >> > >>>>> should have been large enough for tlb to be flushed. If the pa=
ge is really
> >> > >>>>> hot, the hardware will get second, third, fourth etc opportuni=
ty to set an
> >> > >>>>> access flag in the long time in which the page is re-moved to =
the tail
> >> > >>>>> as the page can be accessed multiple times if it is really hot=
.
> >> > >>>>
> >> > >>>> This might not be true if you have external hardware sharing th=
e page
> >> > >>>> tables with software through either HMM or hardware supported A=
TS
> >> > >>>> though.
> >> > >>>>
> >> > >>>> In those cases I think it's much more likely hardware can still=
 be
> >> > >>>> accessing the page even after a context switch on the CPU say. =
So those
> >> > >>>> pages will tend to get reclaimed even though hardware is still =
actively
> >> > >>>> using them which would be quite expensive and I guess could lea=
d to
> >> > >>>> thrashing as each page is reclaimed and then immediately faulte=
d back
> >> > >>>> in.
> >> > >
> >> > > That's possible, but the chance should be relatively low. At least=
 on
> >> > > x86, I have not heard of this issue.
> >> >
> >> > Personally I've never seen any x86 system that shares page tables wi=
th
> >> > external devices, other than with HMM. More on that below.
> >> >
> >> > >>> i am not quite sure i got your point. has the external hardware =
sharing cpu's
> >> > >>> pagetable the ability to set access flag in page table entries b=
y
> >> > >>> itself? if yes,
> >> > >>> I don't see how our approach will hurt as folio_referenced can n=
otify the
> >> > >>> hardware driver and the driver can flush its own tlb. If no, i d=
on't see
> >> > >>> either as the external hardware can't set access flags, that mea=
ns we
> >> > >>> have ignored its reference and only knows cpu's access even in t=
he current
> >> > >>> mainline code. so we are not getting worse.
> >> > >>>
> >> > >>> so the external hardware can also see cpu's TLB? or cpu's tlb fl=
ush can
> >> > >>> also broadcast to external hardware, then external hardware sees=
 the
> >> > >>> cleared access flag, thus, it can set access flag in page table =
when the
> >> > >>> hardware access it?  If this is the case, I feel what you said i=
s true.
> >> > >> Perhaps it would help if I gave a concrete example. Take for exam=
ple
> >> > >> the
> >> > >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in o=
ne of
> >> > >> two ways depending on the specific HW implementation.
> >> > >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU
> >> > >> TLB
> >> > >> invalidations. If BTM is not supported it relies on SW to explici=
tly
> >> > >> forward TLB invalidations via MMU notifiers.
> >> > >
> >> > > On our ARM64 hardware, we rely on BTM to maintain TLB coherency.
> >> >
> >> > Lucky you :-)
> >> >
> >> > ARM64 SMMU architecture specification supports the possibilty of bot=
h,
> >> > as does the driver. Not that I think whether or not BTM is supported=
 has
> >> > much relevance to this issue.
> >> >
> >> > >> Now consider the case where some external device is accessing map=
pings
> >> > >> via the SMMU. The access flag will be cached in the SMMU TLB. If =
we
> >> > >> clear the access flag without a TLB invalidate the access flag in=
 the
> >> > >> CPU page table will not get updated because it's already set in t=
he SMMU
> >> > >> TLB.
> >> > >> As an aside access flag updates happen in one of two ways. If the
> >> > >> SMMU
> >> > >> HW supports hardware translation table updates (HTTU) then hardwa=
re will
> >> > >> manage updating access/dirty flags as required. If this is not su=
pported
> >> > >> then SW is relied on to update these flags which in practice mean=
s
> >> > >> taking a minor fault. But I don't think that is relevant here - i=
n
> >> > >> either case without a TLB invalidate neither of those things will
> >> > >> happen.
> >> > >> I suppose drivers could implement the clear_flush_young() MMU
> >> > >> notifier
> >> > >> callback (none do at the moment AFAICT) but then won't that just =
lead to
> >> > >> the opposite problem - that every page ever used by an external d=
evice
> >> > >> remains active and unavailable for reclaim because the access fla=
g never
> >> > >> gets cleared? I suppose they could do the flush then which would =
ensure
> >> > >
> >> > > Yes, I think so too. The reason there is currently no problem, per=
haps
> >> > > I think, there are no actual use cases at the moment? At least on =
our
> >> > > Alibaba's fleet, SMMU and MMU do not share page tables now.
> >> >
> >> > We have systems that do.
> >>
> >> Just curious: do those systems run the Linux kernel? If so, are pages
> >> shared with SMMU pinned? If not, then how are IO PFs handled after
> >> pages are reclaimed?
>
> Yes, these systems all run Linux. Pages shared with SMMU aren't pinned
> and fault handling works as Barry notes below - a driver is notified of
> a fault and calls handle_mm_fault() in response.
>
> > it will call handle_mm_fault(vma, prm->addr, fault_flags, NULL); in
> > I/O PF, so finally
> > it runs the same codes to get page back just like CPU's PF.
> >
> > years ago, we recommended a pin solution, but obviously there were lots=
 of
> > push backs:
> > https://lore.kernel.org/linux-mm/1612685884-19514-1-git-send-email-wang=
zhou1@hisilicon.com/
>
> Right. Having to pin pages defeats the whole point of having hardware
> that can handle page faults.

Thanks. How would a DMA transaction be retried after the kernel
resolves an IO PF? I.e., does the h/w (PCIe spec, etc.) support auto
retries or is the s/w responsible for doing so? At least when I worked
on the PCI subsystem, I didn't know any device that was capable of
doing auto retries. (Pasha and I will have a talk on IOMMU at the
coming LPC, so this might be an interesting intersection between IOMMU
and MM to discuss.)