From: Yu Zhao <yuzhao@google.com>
To: Alistair Popple <apopple@nvidia.com>
Cc: Barry Song <21cnbao@gmail.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
catalin.marinas@arm.com, will@kernel.org,
akpm@linux-foundation.org, v-songbaohua@oppo.com,
linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH] arm64: mm: drop tlb flush operation when clearing the access bit
Date: Wed, 25 Oct 2023 12:22:37 -0600 [thread overview]
Message-ID: <CAOUHufbpyR_wFARsCZ-wVqN0w3ieW2RVfVaJkbikY_O8WGwF1g@mail.gmail.com> (raw)
In-Reply-To: <877cnb0zyk.fsf@nvdebian.thelocal>
On Wed, Oct 25, 2023 at 4:16 AM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Oct 25, 2023 at 2:17 PM Yu Zhao <yuzhao@google.com> wrote:
> >>
> >> On Tue, Oct 24, 2023 at 9:21 PM Alistair Popple <apopple@nvidia.com> wrote:
> >> >
> >> >
> >> > Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> >> >
> >> > > On 10/25/2023 9:58 AM, Alistair Popple wrote:
> >> > >> Barry Song <21cnbao@gmail.com> writes:
> >> > >>
> >> > >>> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@nvidia.com> wrote:
> >> > >>>>
> >> > >>>>
> >> > >>>> Barry Song <21cnbao@gmail.com> writes:
> >> > >>>>
> >> > >>>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@gmail.com> wrote:
> >> > >>>>>>
> >> > >>>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang
> >> > >>>>>> <baolin.wang@linux.alibaba.com> wrote:
> >> > >> [...]
> >> > >>
> >> > >>>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot
> >> > >>>>>> page, B might
> >> > >>>>>> be a correct choice.
> >> > >>>>>
> >> > >>>>> Plus, I doubt B is really going to happen. as after a page is promoted to
> >> > >>>>> the head of lru list or new generation, it needs a long time to slide back
> >> > >>>>> to the inactive list tail or to the candidate generation of mglru. the time
> >> > >>>>> should have been large enough for tlb to be flushed. If the page is really
> >> > >>>>> hot, the hardware will get second, third, fourth etc opportunity to set an
> >> > >>>>> access flag in the long time in which the page is re-moved to the tail
> >> > >>>>> as the page can be accessed multiple times if it is really hot.
> >> > >>>>
> >> > >>>> This might not be true if you have external hardware sharing the page
> >> > >>>> tables with software through either HMM or hardware supported ATS
> >> > >>>> though.
> >> > >>>>
> >> > >>>> In those cases I think it's much more likely hardware can still be
> >> > >>>> accessing the page even after a context switch on the CPU say. So those
> >> > >>>> pages will tend to get reclaimed even though hardware is still actively
> >> > >>>> using them which would be quite expensive and I guess could lead to
> >> > >>>> thrashing as each page is reclaimed and then immediately faulted back
> >> > >>>> in.
> >> > >
> >> > > That's possible, but the chance should be relatively low. At least on
> >> > > x86, I have not heard of this issue.
> >> >
> >> > Personally I've never seen any x86 system that shares page tables with
> >> > external devices, other than with HMM. More on that below.
> >> >
> >> > >>> i am not quite sure i got your point. has the external hardware sharing cpu's
> >> > >>> pagetable the ability to set access flag in page table entries by
> >> > >>> itself? if yes,
> >> > >>> I don't see how our approach will hurt as folio_referenced can notify the
> >> > >>> hardware driver and the driver can flush its own tlb. If no, i don't see
> >> > >>> either as the external hardware can't set access flags, that means we
> >> > >>> have ignored its reference and only knows cpu's access even in the current
> >> > >>> mainline code. so we are not getting worse.
> >> > >>>
> >> > >>> so the external hardware can also see cpu's TLB? or cpu's tlb flush can
> >> > >>> also broadcast to external hardware, then external hardware sees the
> >> > >>> cleared access flag, thus, it can set access flag in page table when the
> >> > >>> hardware access it? If this is the case, I feel what you said is true.
> >> > >> Perhaps it would help if I gave a concrete example. Take for example
> >> > >> the
> >> > >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of
> >> > >> two ways depending on the specific HW implementation.
> >> > >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU
> >> > >> TLB
> >> > >> invalidations. If BTM is not supported it relies on SW to explicitly
> >> > >> forward TLB invalidations via MMU notifiers.
> >> > >
> >> > > On our ARM64 hardware, we rely on BTM to maintain TLB coherency.
> >> >
> >> > Lucky you :-)
> >> >
> >> > ARM64 SMMU architecture specification supports the possibilty of both,
> >> > as does the driver. Not that I think whether or not BTM is supported has
> >> > much relevance to this issue.
> >> >
> >> > >> Now consider the case where some external device is accessing mappings
> >> > >> via the SMMU. The access flag will be cached in the SMMU TLB. If we
> >> > >> clear the access flag without a TLB invalidate the access flag in the
> >> > >> CPU page table will not get updated because it's already set in the SMMU
> >> > >> TLB.
> >> > >> As an aside access flag updates happen in one of two ways. If the
> >> > >> SMMU
> >> > >> HW supports hardware translation table updates (HTTU) then hardware will
> >> > >> manage updating access/dirty flags as required. If this is not supported
> >> > >> then SW is relied on to update these flags which in practice means
> >> > >> taking a minor fault. But I don't think that is relevant here - in
> >> > >> either case without a TLB invalidate neither of those things will
> >> > >> happen.
> >> > >> I suppose drivers could implement the clear_flush_young() MMU
> >> > >> notifier
> >> > >> callback (none do at the moment AFAICT) but then won't that just lead to
> >> > >> the opposite problem - that every page ever used by an external device
> >> > >> remains active and unavailable for reclaim because the access flag never
> >> > >> gets cleared? I suppose they could do the flush then which would ensure
> >> > >
> >> > > Yes, I think so too. The reason there is currently no problem, perhaps
> >> > > I think, there are no actual use cases at the moment? At least on our
> >> > > Alibaba's fleet, SMMU and MMU do not share page tables now.
> >> >
> >> > We have systems that do.
> >>
> >> Just curious: do those systems run the Linux kernel? If so, are pages
> >> shared with SMMU pinned? If not, then how are IO PFs handled after
> >> pages are reclaimed?
>
> Yes, these systems all run Linux. Pages shared with SMMU aren't pinned
> and fault handling works as Barry notes below - a driver is notified of
> a fault and calls handle_mm_fault() in response.
>
> > it will call handle_mm_fault(vma, prm->addr, fault_flags, NULL); in
> > I/O PF, so finally
> > it runs the same codes to get page back just like CPU's PF.
> >
> > years ago, we recommended a pin solution, but obviously there were lots of
> > push backs:
> > https://lore.kernel.org/linux-mm/1612685884-19514-1-git-send-email-wangzhou1@hisilicon.com/
>
> Right. Having to pin pages defeats the whole point of having hardware
> that can handle page faults.
Thanks. How would a DMA transaction be retried after the kernel
resolves an IO PF? I.e., does the h/w (PCIe spec, etc.) support auto
retries or is the s/w responsible for doing so? At least when I worked
on the PCI subsystem, I didn't know any device that was capable of
doing auto retries. (Pasha and I will have a talk on IOMMU at the
coming LPC, so this might be an interesting intersection between IOMMU
and MM to discuss.)
next prev parent reply other threads:[~2023-10-25 18:23 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-24 12:56 Baolin Wang
2023-10-24 13:48 ` Kefeng Wang
2023-10-25 1:44 ` Baolin Wang
2023-10-24 22:32 ` Yu Zhao
2023-10-24 23:16 ` Barry Song
2023-10-24 23:31 ` Barry Song
2023-10-25 1:07 ` Alistair Popple
2023-10-25 1:44 ` Barry Song
2023-10-25 1:58 ` Alistair Popple
2023-10-25 2:43 ` Baolin Wang
2023-10-25 3:09 ` Alistair Popple
2023-10-25 6:17 ` Yu Zhao
2023-10-25 6:27 ` Barry Song
2023-10-25 10:12 ` Alistair Popple
2023-10-25 18:22 ` Yu Zhao [this message]
2023-10-25 23:32 ` Alistair Popple
2023-10-26 23:48 ` Barry Song
2023-10-25 2:02 ` Baolin Wang
2023-10-25 1:39 ` Yin, Fengwei
2023-10-25 3:03 ` Baolin Wang
2023-10-25 3:08 ` Yin, Fengwei
2023-10-25 3:15 ` Baolin Wang
2023-10-25 4:34 ` Barry Song
2023-11-07 10:12 ` Will Deacon
2023-11-07 20:50 ` Barry Song
2023-10-26 4:55 ` Anshuman Khandual
2023-10-26 5:54 ` Barry Song
2023-10-26 6:01 ` Anshuman Khandual
2023-10-26 12:30 ` Robin Murphy
2023-10-26 12:32 ` Baolin Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAOUHufbpyR_wFARsCZ-wVqN0w3ieW2RVfVaJkbikY_O8WGwF1g@mail.gmail.com \
--to=yuzhao@google.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=baolin.wang@linux.alibaba.com \
--cc=catalin.marinas@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=v-songbaohua@oppo.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox