* Questions about TLB flushing and lru_gen_look_around
@ 2024-09-12 13:03 Phil Elwell
2024-09-13 3:59 ` Yu Zhao
0 siblings, 1 reply; 4+ messages in thread
From: Phil Elwell @ 2024-09-12 13:03 UTC (permalink / raw)
To: Andrew Morton, linux-mm, linux-rpi-kernel
Hi,
I've spent many hours recently trying to diagnose a problem that
manifests as a CPU spin, under load and memory pressure, that can last
for many seconds. The problem can be seen on our downstream kernels
from 6.5 onwards, when built for ARCH=arm, running on a Pi 3B (BCM2837
- quad A53). I've not tested a pure Linux 6.5, but this is not a bug
report.
Pi 3B has limited RAM (1GB), and it was discovered that restricting
this further to 512MB made the spins more frequent, as did adding
other processes. Running an ARM64 kernel in the same configuration
leads to normal OOM behaviour.
I traced the spin to a loop in __copy_to_user_memcpy where
pin_page_for_write fails repeatedly, sometimes for hundreds of
thousands of times. The pin is failing because the user page in
question is marked as being old (L_PTE_YOUNG is unset). When this
happens, the code tries to freshen the page using __put_user, but in
this case it is not triggering the required page fault. Digging
deeper, it can be seen that the PTE in the ARM's shadow hardware PTE
is 0 as expected, but clearly the MMU is not seeing this otherwise it
would be faulting; a TLB flush for that PTE fixes it.
The TLB non-coherency for that PTE can be attributed to a call to
ptep_test_and_clear_young from lru_gen_look_around, which clears the
L_PTE_YOUNG bit in the Linux PTE and zeroes the hardware PTE but
doesn't call flush_tlb_cache. Two possible "fixes" are:
a. Replace ptep_test_and_clear_young with ptep_clear_flush_young,
which includes the TLB flush.
b. After the loop over the page range from "start" to "end", include a
call to flush_tlb_range from "start" to "end" if the "young" count is
non-zero.
My questions are:
1. Which bit of code is meant to take care of TLB coherency where
lru_gen_look_around has made changes?
2. Between the two patches a) and b), which is preferable? b) would
seem better if IPIs are needed to broadcast the TLB flushes, but it
seems that BCM2837 has new enough CPU cores not to require such
broadcasts.
3. walk_pte_range has a similar loop, but it seems it doesn't need to
be patched to fix my spin, possibly because it isn't called. If a
patch to lru_gen_look_around is needed, might one be needed here as
well?
Thanks for your time,
Phil
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Questions about TLB flushing and lru_gen_look_around
2024-09-12 13:03 Questions about TLB flushing and lru_gen_look_around Phil Elwell
@ 2024-09-13 3:59 ` Yu Zhao
2024-09-13 8:50 ` Phil Elwell
0 siblings, 1 reply; 4+ messages in thread
From: Yu Zhao @ 2024-09-13 3:59 UTC (permalink / raw)
To: Phil Elwell
Cc: Andrew Morton, linux-mm, linux-rpi-kernel, Linux ARM, Will Deacon
Hi Phil,
On Thu, Sep 12, 2024 at 7:03 AM Phil Elwell <phil@raspberrypi.com> wrote:
>
> Hi,
>
> I've spent many hours recently trying to diagnose a problem that
> manifests as a CPU spin, under load and memory pressure, that can last
> for many seconds. The problem can be seen on our downstream kernels
> from 6.5 onwards, when built for ARCH=arm, running on a Pi 3B (BCM2837
> - quad A53). I've not tested a pure Linux 6.5, but this is not a bug
> report.
>
> Pi 3B has limited RAM (1GB), and it was discovered that restricting
> this further to 512MB made the spins more frequent, as did adding
> other processes. Running an ARM64 kernel in the same configuration
> leads to normal OOM behaviour.
>
> I traced the spin to a loop in __copy_to_user_memcpy where
> pin_page_for_write fails repeatedly, sometimes for hundreds of
> thousands of times. The pin is failing because the user page in
> question is marked as being old (L_PTE_YOUNG is unset). When this
> happens, the code tries to freshen the page using __put_user, but in
> this case it is not triggering the required page fault. Digging
> deeper, it can be seen that the PTE in the ARM's shadow hardware PTE
> is 0 as expected, but clearly the MMU is not seeing this otherwise it
> would be faulting; a TLB flush for that PTE fixes it.
>
> The TLB non-coherency for that PTE can be attributed to a call to
> ptep_test_and_clear_young from lru_gen_look_around, which clears the
> L_PTE_YOUNG bit in the Linux PTE
Yes, it does that.
> and zeroes the hardware PTE
I don't see how it can happen, or why it's needed. Could you explain?
> but doesn't call flush_tlb_cache.
Correct, and this is because that arch-specific API currently doesn't
require TLB flushes, from the MM's POV. None of the current callers
does, I doubt they were used on arm (32 bit) at all, except MGLRU.
> Two possible "fixes" are:
>
> a. Replace ptep_test_and_clear_young with ptep_clear_flush_young,
> which includes the TLB flush.
> b. After the loop over the page range from "start" to "end", include a
> call to flush_tlb_range from "start" to "end" if the "young" count is
> non-zero.
>
> My questions are:
>
> 1. Which bit of code is meant to take care of TLB coherency where
> lru_gen_look_around has made changes?
None, since the API doesn't explicitly require it (or at least the MM
assumes), as I mentioned above.
> 2. Between the two patches a) and b), which is preferable? b) would
> seem better if IPIs are needed to broadcast the TLB flushes, but it
> seems that BCM2837 has new enough CPU cores not to require such
> broadcasts.
Could this be fixed within arm? If not, we would have to update the
requirement of that arch-specific API. This would affect other archs
that don't require TLB flushes, assuming they exist. And we would need
to fix all callers of ptep_test_and_clear_young() in MM.
> 3. walk_pte_range has a similar loop, but it seems it doesn't need to
> be patched to fix my spin, possibly because it isn't called.
Correct.
> If a
> patch to lru_gen_look_around is needed, might one be needed here as
> well?
No, because that code is disabled, unless hardware can set A-bit,
e.g., arm64 v8.2.
Thanks.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Questions about TLB flushing and lru_gen_look_around
2024-09-13 3:59 ` Yu Zhao
@ 2024-09-13 8:50 ` Phil Elwell
2024-09-26 18:34 ` Yu Zhao
0 siblings, 1 reply; 4+ messages in thread
From: Phil Elwell @ 2024-09-13 8:50 UTC (permalink / raw)
To: Yu Zhao; +Cc: Andrew Morton, linux-mm, linux-rpi-kernel, Linux ARM, Will Deacon
Hi Yu,
Thanks for getting back to me so quickly.
On Fri, 13 Sept 2024 at 04:59, Yu Zhao <yuzhao@google.com> wrote:
>
> Hi Phil,
>
> On Thu, Sep 12, 2024 at 7:03 AM Phil Elwell <phil@raspberrypi.com> wrote:
> >
> > Hi,
> >
> > I've spent many hours recently trying to diagnose a problem that
> > manifests as a CPU spin, under load and memory pressure, that can last
> > for many seconds. The problem can be seen on our downstream kernels
> > from 6.5 onwards, when built for ARCH=arm, running on a Pi 3B (BCM2837
> > - quad A53). I've not tested a pure Linux 6.5, but this is not a bug
> > report.
> >
> > Pi 3B has limited RAM (1GB), and it was discovered that restricting
> > this further to 512MB made the spins more frequent, as did adding
> > other processes. Running an ARM64 kernel in the same configuration
> > leads to normal OOM behaviour.
> >
> > I traced the spin to a loop in __copy_to_user_memcpy where
> > pin_page_for_write fails repeatedly, sometimes for hundreds of
> > thousands of times. The pin is failing because the user page in
> > question is marked as being old (L_PTE_YOUNG is unset). When this
> > happens, the code tries to freshen the page using __put_user, but in
> > this case it is not triggering the required page fault. Digging
> > deeper, it can be seen that the PTE in the ARM's shadow hardware PTE
> > is 0 as expected, but clearly the MMU is not seeing this otherwise it
> > would be faulting; a TLB flush for that PTE fixes it.
> >
> > The TLB non-coherency for that PTE can be attributed to a call to
> > ptep_test_and_clear_young from lru_gen_look_around, which clears the
> > L_PTE_YOUNG bit in the Linux PTE
>
> Yes, it does that.
>
> > and zeroes the hardware PTE
>
> I don't see how it can happen, or why it's needed. Could you explain?
The ARM V7 MMU lacks many features, including the YOUNG flag. To work
around this lack of features, Linux maintains two PTEs per page - one
an idealised Linux PTE and one for use by the MMU. For situations
where the MMU needs additional software support, a zero is written to
the hardware PTE to force a page fault. The maintenance of the shadow
PTEs is handled by the ARM-specific set_pte_ext method.
> > but doesn't call flush_tlb_cache.
>
> Correct, and this is because that arch-specific API currently doesn't
> require TLB flushes, from the MM's POV. None of the current callers
> does, I doubt they were used on arm (32 bit) at all, except MGLRU.
>
> > Two possible "fixes" are:
> >
> > a. Replace ptep_test_and_clear_young with ptep_clear_flush_young,
> > which includes the TLB flush.
> > b. After the loop over the page range from "start" to "end", include a
> > call to flush_tlb_range from "start" to "end" if the "young" count is
> > non-zero.
> >
> > My questions are:
> >
> > 1. Which bit of code is meant to take care of TLB coherency where
> > lru_gen_look_around has made changes?
>
> None, since the API doesn't explicitly require it (or at least the MM
> assumes), as I mentioned above.
I'm new to this area, but I think this statement is wrong, as I'll explain next.
> > 2. Between the two patches a) and b), which is preferable? b) would
> > seem better if IPIs are needed to broadcast the TLB flushes, but it
> > seems that BCM2837 has new enough CPU cores not to require such
> > broadcasts.
>
> Could this be fixed within arm? If not, we would have to update the
> requirement of that arch-specific API. This would affect other archs
> that don't require TLB flushes, assuming they exist. And we would need
> to fix all callers of ptep_test_and_clear_young() in MM.
I think it has already been "fixed". Both ptep_test_and_clear_young
and ptep_clear_flush_young have optional architecture-specific
implementations. The fact that both functions exist - one with flush
and one without - says to me that the non-flush version is intended to
be used when coherency is not required (yet). There is evidence to
support this in the X86 implementation of ptep_clear_flush_young [1],
where the comment says:
/*
* On x86 CPUs, clearing the accessed bit without a TLB flush
* doesn't cause data corruption. [ It could cause incorrect
* page aging and the (mistaken) reclaim of hot pages, but the
* chance of that should be relatively low. ]
*
* So as a performance optimization don't flush the TLB when
* clearing the accessed bit, it will eventually be flushed by
* a context switch or a VM operation anyway. [ In the rare
* event of it not getting flushed for a long time the delay
* shouldn't really matter because there's no real memory
* pressure for swapout to react to. ]
*/
I think functions that care about coherency should be using
ptep_clear_flush_young, trusting the architectures to not perform
unnecessary flushes when coherency is already guaranteed.
> > 3. walk_pte_range has a similar loop, but it seems it doesn't need to
> > be patched to fix my spin, possibly because it isn't called.
>
> Correct.
>
> > If a
> > patch to lru_gen_look_around is needed, might one be needed here as
> > well?
>
> No, because that code is disabled, unless hardware can set A-bit,
> e.g., arm64 v8.2.
Thanks - that makes sense.
Phil
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/mm/pgtable.c?h=v6.11-rc7#n597
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Questions about TLB flushing and lru_gen_look_around
2024-09-13 8:50 ` Phil Elwell
@ 2024-09-26 18:34 ` Yu Zhao
0 siblings, 0 replies; 4+ messages in thread
From: Yu Zhao @ 2024-09-26 18:34 UTC (permalink / raw)
To: Phil Elwell, Russell King
Cc: Andrew Morton, linux-mm, linux-rpi-kernel, Linux ARM, Will Deacon
Hi Phil,
On Fri, Sep 13, 2024 at 2:51 AM Phil Elwell <phil@raspberrypi.com> wrote:
>
> Hi Yu,
>
> Thanks for getting back to me so quickly.
>
> On Fri, 13 Sept 2024 at 04:59, Yu Zhao <yuzhao@google.com> wrote:
> >
> > Hi Phil,
> >
> > On Thu, Sep 12, 2024 at 7:03 AM Phil Elwell <phil@raspberrypi.com> wrote:
> > >
> > > Hi,
> > >
> > > I've spent many hours recently trying to diagnose a problem that
> > > manifests as a CPU spin, under load and memory pressure, that can last
> > > for many seconds. The problem can be seen on our downstream kernels
> > > from 6.5 onwards, when built for ARCH=arm, running on a Pi 3B (BCM2837
> > > - quad A53). I've not tested a pure Linux 6.5, but this is not a bug
> > > report.
> > >
> > > Pi 3B has limited RAM (1GB), and it was discovered that restricting
> > > this further to 512MB made the spins more frequent, as did adding
> > > other processes. Running an ARM64 kernel in the same configuration
> > > leads to normal OOM behaviour.
> > >
> > > I traced the spin to a loop in __copy_to_user_memcpy where
> > > pin_page_for_write fails repeatedly, sometimes for hundreds of
> > > thousands of times. The pin is failing because the user page in
> > > question is marked as being old (L_PTE_YOUNG is unset). When this
> > > happens, the code tries to freshen the page using __put_user, but in
> > > this case it is not triggering the required page fault. Digging
> > > deeper, it can be seen that the PTE in the ARM's shadow hardware PTE
> > > is 0 as expected, but clearly the MMU is not seeing this otherwise it
> > > would be faulting; a TLB flush for that PTE fixes it.
> > >
> > > The TLB non-coherency for that PTE can be attributed to a call to
> > > ptep_test_and_clear_young from lru_gen_look_around, which clears the
> > > L_PTE_YOUNG bit in the Linux PTE
> >
> > Yes, it does that.
> >
> > > and zeroes the hardware PTE
> >
> > I don't see how it can happen, or why it's needed. Could you explain?
>
> The ARM V7 MMU lacks many features, including the YOUNG flag. To work
> around this lack of features, Linux maintains two PTEs per page - one
> an idealised Linux PTE and one for use by the MMU. For situations
> where the MMU needs additional software support, a zero is written to
> the hardware PTE to force a page fault. The maintenance of the shadow
> PTEs is handled by the ARM-specific set_pte_ext method.
Thanks for the explanation!
> > > but doesn't call flush_tlb_cache.
> >
> > Correct, and this is because that arch-specific API currently doesn't
> > require TLB flushes, from the MM's POV. None of the current callers
> > does, I doubt they were used on arm (32 bit) at all, except MGLRU.
> >
> > > Two possible "fixes" are:
> > >
> > > a. Replace ptep_test_and_clear_young with ptep_clear_flush_young,
> > > which includes the TLB flush.
> > > b. After the loop over the page range from "start" to "end", include a
> > > call to flush_tlb_range from "start" to "end" if the "young" count is
> > > non-zero.
> > >
> > > My questions are:
> > >
> > > 1. Which bit of code is meant to take care of TLB coherency where
> > > lru_gen_look_around has made changes?
> >
> > None, since the API doesn't explicitly require it (or at least the MM
> > assumes), as I mentioned above.
>
> I'm new to this area, but I think this statement is wrong, as I'll explain next.
>
> > > 2. Between the two patches a) and b), which is preferable? b) would
> > > seem better if IPIs are needed to broadcast the TLB flushes, but it
> > > seems that BCM2837 has new enough CPU cores not to require such
> > > broadcasts.
> >
> > Could this be fixed within arm? If not, we would have to update the
> > requirement of that arch-specific API. This would affect other archs
> > that don't require TLB flushes, assuming they exist. And we would need
> > to fix all callers of ptep_test_and_clear_young() in MM.
>
> I think it has already been "fixed". Both ptep_test_and_clear_young
> and ptep_clear_flush_young have optional architecture-specific
> implementations. The fact that both functions exist - one with flush
> and one without - says to me that the non-flush version is intended to
> be used when coherency is not required (yet).
I couldn't find any documents on this, but I don't think that's what
it's intended based on the MM code and previous discussions.
I think the difference between the two versions is not about
correctness but performance. IOW, both should work correctly without
requiring TLB flushes from their callers, and both can optionally
flush or not flush TLB based on the preference of specific
architectures. Normally, the version that flushes TLB has better
accuracy but worse performance; and vice versa for the other version.
The above is what I inferred from the following:
* None of the callers of the non-flush version flushes TLB, and they
are MGLRU, DAMON, /sys/kernel/mm/page_idle/bitmap and
/proc/pid/clear_refs.
* The earliest discussions I could find from two decades ago indicate
that it was expected for the caller of ptep_test_and_clear_young() not
to flush TLB [1][2].
In fact, [2] below actually discussed making
ptep_test_and_clear_young() an alias of ptep_clear_flush_young().
[1] https://lore.kernel.org/lkml/200101040628.WAA01899@pizda.ninka.net/
[2] https://lore.kernel.org/linux-mm/20040417211506.C21974@flint.arm.linux.org.uk/
> There is evidence to
> support this in the X86 implementation of ptep_clear_flush_young [1],
> where the comment says:
>
> /*
> * On x86 CPUs, clearing the accessed bit without a TLB flush
> * doesn't cause data corruption. [ It could cause incorrect
> * page aging and the (mistaken) reclaim of hot pages, but the
> * chance of that should be relatively low. ]
> *
> * So as a performance optimization don't flush the TLB when
> * clearing the accessed bit, it will eventually be flushed by
> * a context switch or a VM operation anyway. [ In the rare
> * event of it not getting flushed for a long time the delay
> * shouldn't really matter because there's no real memory
> * pressure for swapout to react to. ]
> */
>
> I think functions that care about coherency should be using
> ptep_clear_flush_young, trusting the architectures to not perform
> unnecessary flushes when coherency is already guaranteed.
My reading of this is that x86 omitted flush in the flush version as a
performance optimization. Similarly, for arm 32-bit, it should include
flush in the non-flush version because otherwise it'd cause
correctness problems, and this is exactly how PPC32 includes TLB flush
in the non-flush version:
From arch/powerpc/include/asm/book3s/32/pgtable.h:
/*
* 2.6 calls this without flushing the TLB entry; this is wrong
* for our hash-based implementation, we fix that up here.
*/
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
static inline int __ptep_test_and_clear_young(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
{
unsigned long old;
old = pte_update(mm, addr, ptep, _PAGE_ACCESSED, 0, 0);
if (old & _PAGE_HASHPTE)
flush_hash_entry(mm, ptep, addr);
return (old & _PAGE_ACCESSED) != 0;
}
#define ptep_test_and_clear_young(__vma, __addr, __ptep) \
__ptep_test_and_clear_young((__vma)->vm_mm, __addr, __ptep)
> > > 3. walk_pte_range has a similar loop, but it seems it doesn't need to
> > > be patched to fix my spin, possibly because it isn't called.
> >
> > Correct.
> >
> > > If a
> > > patch to lru_gen_look_around is needed, might one be needed here as
> > > well?
> >
> > No, because that code is disabled, unless hardware can set A-bit,
> > e.g., arm64 v8.2.
>
> Thanks - that makes sense.
>
> Phil
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/mm/pgtable.c?h=v6.11-rc7#n597
I'm hoping someone who has a more historical background could jump in
and clarify. Regardless how it was intended, I would recommend
interpreting the difference between the flush and non-flush versions
as performance, rather than correctness, related. Otherwise the core
MM would have to flush TLB for all architectures even though some of
them don't need it (x86, arm64, PPC64 at least).
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-09-26 18:34 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-12 13:03 Questions about TLB flushing and lru_gen_look_around Phil Elwell
2024-09-13 3:59 ` Yu Zhao
2024-09-13 8:50 ` Phil Elwell
2024-09-26 18:34 ` Yu Zhao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox