From: Yu Zhao <yuzhao@google.com>
To: Alistair Popple <apopple@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>,
Barry Song <21cnbao@gmail.com>,
catalin.marinas@arm.com, will@kernel.org,
akpm@linux-foundation.org, v-songbaohua@oppo.com,
linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH] arm64: mm: drop tlb flush operation when clearing the access bit
Date: Wed, 25 Oct 2023 00:17:14 -0600 [thread overview]
Message-ID: <CAOUHufZ84aDmiW3Efh87q1oMJr-zk5cyaebucCFzevFHx77ngQ@mail.gmail.com> (raw)
In-Reply-To: <87bkcn1j5k.fsf@nvdebian.thelocal>
On Tue, Oct 24, 2023 at 9:21 PM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
>
> > On 10/25/2023 9:58 AM, Alistair Popple wrote:
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >>> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote:
> >>>>
> >>>>
> >>>> Barry Song <21cnbao@gmail.com> writes:
> >>>>
> >>>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote:
> >>>>>>
> >>>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang
> >>>>>> <baolin.wang@linux.alibaba.com> wrote:
> >> [...]
> >>
> >>>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot
> >>>>>> page, B might
> >>>>>> be a correct choice.
> >>>>>
> >>>>> Plus, I doubt B is really going to happen. as after a page is promoted to
> >>>>> the head of lru list or new generation, it needs a long time to slide back
> >>>>> to the inactive list tail or to the candidate generation of mglru. the time
> >>>>> should have been large enough for tlb to be flushed. If the page is really
> >>>>> hot, the hardware will get second, third, fourth etc opportunity to set an
> >>>>> access flag in the long time in which the page is re-moved to the tail
> >>>>> as the page can be accessed multiple times if it is really hot.
> >>>>
> >>>> This might not be true if you have external hardware sharing the page
> >>>> tables with software through either HMM or hardware supported ATS
> >>>> though.
> >>>>
> >>>> In those cases I think it's much more likely hardware can still be
> >>>> accessing the page even after a context switch on the CPU say. So those
> >>>> pages will tend to get reclaimed even though hardware is still actively
> >>>> using them which would be quite expensive and I guess could lead to
> >>>> thrashing as each page is reclaimed and then immediately faulted back
> >>>> in.
> >
> > That's possible, but the chance should be relatively low. At least on
> > x86, I have not heard of this issue.
>
> Personally I've never seen any x86 system that shares page tables with
> external devices, other than with HMM. More on that below.
>
> >>> i am not quite sure i got your point. has the external hardware sharing cpu's
> >>> pagetable the ability to set access flag in page table entries by
> >>> itself? if yes,
> >>> I don't see how our approach will hurt as folio_referenced can notify the
> >>> hardware driver and the driver can flush its own tlb. If no, i don't see
> >>> either as the external hardware can't set access flags, that means we
> >>> have ignored its reference and only knows cpu's access even in the current
> >>> mainline code. so we are not getting worse.
> >>>
> >>> so the external hardware can also see cpu's TLB? or cpu's tlb flush can
> >>> also broadcast to external hardware, then external hardware sees the
> >>> cleared access flag, thus, it can set access flag in page table when the
> >>> hardware access it? If this is the case, I feel what you said is true.
> >> Perhaps it would help if I gave a concrete example. Take for example
> >> the
> >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of
> >> two ways depending on the specific HW implementation.
> >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU
> >> TLB
> >> invalidations. If BTM is not supported it relies on SW to explicitly
> >> forward TLB invalidations via MMU notifiers.
> >
> > On our ARM64 hardware, we rely on BTM to maintain TLB coherency.
>
> Lucky you :-)
>
> ARM64 SMMU architecture specification supports the possibilty of both,
> as does the driver. Not that I think whether or not BTM is supported has
> much relevance to this issue.
>
> >> Now consider the case where some external device is accessing mappings
> >> via the SMMU. The access flag will be cached in the SMMU TLB. If we
> >> clear the access flag without a TLB invalidate the access flag in the
> >> CPU page table will not get updated because it's already set in the SMMU
> >> TLB.
> >> As an aside access flag updates happen in one of two ways. If the
> >> SMMU
> >> HW supports hardware translation table updates (HTTU) then hardware will
> >> manage updating access/dirty flags as required. If this is not supported
> >> then SW is relied on to update these flags which in practice means
> >> taking a minor fault. But I don't think that is relevant here - in
> >> either case without a TLB invalidate neither of those things will
> >> happen.
> >> I suppose drivers could implement the clear_flush_young() MMU
> >> notifier
> >> callback (none do at the moment AFAICT) but then won't that just lead to
> >> the opposite problem - that every page ever used by an external device
> >> remains active and unavailable for reclaim because the access flag never
> >> gets cleared? I suppose they could do the flush then which would ensure
> >
> > Yes, I think so too. The reason there is currently no problem, perhaps
> > I think, there are no actual use cases at the moment? At least on our
> > Alibaba's fleet, SMMU and MMU do not share page tables now.
>
> We have systems that do.
Just curious: do those systems run the Linux kernel? If so, are pages
shared with SMMU pinned? If not, then how are IO PFs handled after
pages are reclaimed?
next prev parent reply other threads:[~2023-10-25 6:17 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-24 12:56 Baolin Wang
2023-10-24 13:48 ` Kefeng Wang
2023-10-25 1:44 ` Baolin Wang
2023-10-24 22:32 ` Yu Zhao
2023-10-24 23:16 ` Barry Song
2023-10-24 23:31 ` Barry Song
2023-10-25 1:07 ` Alistair Popple
2023-10-25 1:44 ` Barry Song
2023-10-25 1:58 ` Alistair Popple
2023-10-25 2:43 ` Baolin Wang
2023-10-25 3:09 ` Alistair Popple
2023-10-25 6:17 ` Yu Zhao [this message]
2023-10-25 6:27 ` Barry Song
2023-10-25 10:12 ` Alistair Popple
2023-10-25 18:22 ` Yu Zhao
2023-10-25 23:32 ` Alistair Popple
2023-10-26 23:48 ` Barry Song
2023-10-25 2:02 ` Baolin Wang
2023-10-25 1:39 ` Yin, Fengwei
2023-10-25 3:03 ` Baolin Wang
2023-10-25 3:08 ` Yin, Fengwei
2023-10-25 3:15 ` Baolin Wang
2023-10-25 4:34 ` Barry Song
2023-11-07 10:12 ` Will Deacon
2023-11-07 20:50 ` Barry Song
2023-10-26 4:55 ` Anshuman Khandual
2023-10-26 5:54 ` Barry Song
2023-10-26 6:01 ` Anshuman Khandual
2023-10-26 12:30 ` Robin Murphy
2023-10-26 12:32 ` Baolin Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAOUHufZ84aDmiW3Efh87q1oMJr-zk5cyaebucCFzevFHx77ngQ@mail.gmail.com \
--to=yuzhao@google.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=baolin.wang@linux.alibaba.com \
--cc=catalin.marinas@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=v-songbaohua@oppo.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox