* LPA2 on non-LPA2 hardware broken with 16K pages
@ 2024-07-18 9:39 Asahi Lina
2024-07-18 13:14 ` Will Deacon
0 siblings, 1 reply; 11+ messages in thread
From: Asahi Lina @ 2024-07-18 9:39 UTC (permalink / raw)
To: linux-mm, linux-kernel, asahi, linux-arm-kernel
Cc: Catalin Marinas, Will Deacon
Hi,
I ran into this with the Asahi Linux downstream kernel, based on v6.9.9,
but I believe the problem is also still upstream. The issue seems to be
an interaction between folding one page table level at compile time and
another one at runtime.
With this config, we have:
CONFIG_PGTABLE_LEVELS=4
PAGE_SHIFT=14
PMD_SHIFT=25
PUD_SHIFT=36
PGDIR_SHIFT=47
pgtable_l5_enabled() == false (compile time)
pgtable_l4_enabled() == false (runtime, due to no LPA2)
With p4d folded at compile-time, and pud folded at runtime when LPA2 is
not supported.
With this setup, pgd_offset() is broken since the pgd is actually
supposed to become a pud but the shift is wrong, as it is set at compile
time:
#define pgd_index(a) (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
static inline pgd_t *pgd_offset_pgd(pgd_t *pgd, unsigned long address)
{
return (pgd + pgd_index(address));
};
Then we follow the gup logic (abbreviated):
gup_pgd_range:
pgdp = pgd_offset(current->mm, addr);
pgd_t pgd = READ_ONCE(*pgdp);
At this point, pgd is just the 0th entry of the top level page table
(since those extra address bits will always be 0 for valid 47-bit user
addresses).
p4d then gets folded via pgtable-nop4d.h:
gup_p4d_range:
p4dp = p4d_offset_lockless(pgdp, pgd, addr);
= p4d_offset(&(pgd), address)
= &pgd
p4d_t p4d = READ_ONCE(*p4dp);
Now we have p4dp = stack address of pgd, and p4d = pgd.
gup_pud_range:
pudp = pud_offset_lockless(p4dp, p4d, addr);
-> if (!pgtable_l4_enabled())
= p4d_to_folded_pud(p4dp, addr);
= (pud_t *)PTR_ALIGN_DOWN(p4dp, PAGE_SIZE) + pud_index(addr);
pud_t pud = READ_ONCE(*pudp);
Which is bad pointer math because it only works if p4dp points to a real
page table entry inside a page table, not a single u64 stack address.
This causes random oopses in internal_get_user_pages_fast and related
codepaths.
~~ Lina
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LPA2 on non-LPA2 hardware broken with 16K pages
2024-07-18 9:39 LPA2 on non-LPA2 hardware broken with 16K pages Asahi Lina
@ 2024-07-18 13:14 ` Will Deacon
2024-07-18 13:21 ` Dev Jain
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Will Deacon @ 2024-07-18 13:14 UTC (permalink / raw)
To: Asahi Lina
Cc: linux-mm, linux-kernel, asahi, linux-arm-kernel, Catalin Marinas,
ryan.roberts, mark.rutland, ardb
Hi Lina, [+Ard, Mark and Ryan],
On Thu, Jul 18, 2024 at 06:39:10PM +0900, Asahi Lina wrote:
> I ran into this with the Asahi Linux downstream kernel, based on v6.9.9,
> but I believe the problem is also still upstream. The issue seems to be
> an interaction between folding one page table level at compile time and
> another one at runtime.
Thanks for reporting this!
> With this config, we have:
>
> CONFIG_PGTABLE_LEVELS=4
> PAGE_SHIFT=14
> PMD_SHIFT=25
> PUD_SHIFT=36
> PGDIR_SHIFT=47
> pgtable_l5_enabled() == false (compile time)
> pgtable_l4_enabled() == false (runtime, due to no LPA2)
I think this is 'defconfig' w/ 16k pages, although I wasn't able to
trigger the issue quickly under QEMU with that. Your analysis looks
correct, however.
> With p4d folded at compile-time, and pud folded at runtime when LPA2 is
> not supported.
>
> With this setup, pgd_offset() is broken since the pgd is actually
> supposed to become a pud but the shift is wrong, as it is set at compile
> time:
>
> #define pgd_index(a) (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
>
> static inline pgd_t *pgd_offset_pgd(pgd_t *pgd, unsigned long address)
> {
> return (pgd + pgd_index(address));
> };
>
> Then we follow the gup logic (abbreviated):
>
> gup_pgd_range:
> pgdp = pgd_offset(current->mm, addr);
> pgd_t pgd = READ_ONCE(*pgdp);
>
> At this point, pgd is just the 0th entry of the top level page table
> (since those extra address bits will always be 0 for valid 47-bit user
> addresses).
>
> p4d then gets folded via pgtable-nop4d.h:
>
> gup_p4d_range:
> p4dp = p4d_offset_lockless(pgdp, pgd, addr);
> = p4d_offset(&(pgd), address)
> = &pgd
> p4d_t p4d = READ_ONCE(*p4dp);
>
> Now we have p4dp = stack address of pgd, and p4d = pgd.
>
> gup_pud_range:
> pudp = pud_offset_lockless(p4dp, p4d, addr);
> -> if (!pgtable_l4_enabled())
> = p4d_to_folded_pud(p4dp, addr);
> = (pud_t *)PTR_ALIGN_DOWN(p4dp, PAGE_SIZE) + pud_index(addr);
> pud_t pud = READ_ONCE(*pudp);
>
> Which is bad pointer math because it only works if p4dp points to a real
> page table entry inside a page table, not a single u64 stack address.
Cheers for the explanation; I agree that 6.10 looks like it's affected
in the same way, even though I couldn't reproduce the crash. I think the
root of the problem is that p4d_offset_lockless() returns a stack
address when the p4d level is folded. I wondered about changing the
dummy pXd_offset_lockless() definitions in linux/pgtable.h to pass the
real pointer through instead of the address of the local, but then I
suppose _most_ pXd_offset() implementations are going to dereference
that and it would break the whole point of having _lockless routines
to start with.
What if we provided our own implementation of p4d_offset_lockless()
for the folding case, which could just propagate the page-table pointer?
Diff below.
> This causes random oopses in internal_get_user_pages_fast and related
> codepaths.
Do you have a reliable way to trigger those? I tried doing some GUPpy
things like strace (access_process_vm()) but it all seemed fine.
Thanks,
Will
--->8
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index f8efbc128446..3afe624a39e1 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1065,6 +1065,13 @@ static inline bool pgtable_l5_enabled(void) { return false; }
#define p4d_offset_kimg(dir,addr) ((p4d_t *)dir)
+static inline
+p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long addr)
+{
+ return p4d_offset(pgdp, addr);
+}
+#define p4d_offset_lockless p4d_offset_lockless
+
#endif /* CONFIG_PGTABLE_LEVELS > 4 */
#define pgd_ERROR(e) \
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LPA2 on non-LPA2 hardware broken with 16K pages
2024-07-18 13:14 ` Will Deacon
@ 2024-07-18 13:21 ` Dev Jain
2024-07-18 14:34 ` Asahi Lina
2024-07-19 18:02 ` Ard Biesheuvel
2 siblings, 0 replies; 11+ messages in thread
From: Dev Jain @ 2024-07-18 13:21 UTC (permalink / raw)
To: Will Deacon, Asahi Lina
Cc: linux-mm, linux-kernel, asahi, linux-arm-kernel, Catalin Marinas,
ryan.roberts, mark.rutland, ardb
On 7/18/24 18:44, Will Deacon wrote:
> Hi Lina, [+Ard, Mark and Ryan],
>
> On Thu, Jul 18, 2024 at 06:39:10PM +0900, Asahi Lina wrote:
>> I ran into this with the Asahi Linux downstream kernel, based on v6.9.9,
>> but I believe the problem is also still upstream. The issue seems to be
>> an interaction between folding one page table level at compile time and
>> another one at runtime.
> Thanks for reporting this!
>
>> With this config, we have:
>>
>> CONFIG_PGTABLE_LEVELS=4
>> PAGE_SHIFT=14
>> PMD_SHIFT=25
>> PUD_SHIFT=36
>> PGDIR_SHIFT=47
>> pgtable_l5_enabled() == false (compile time)
>> pgtable_l4_enabled() == false (runtime, due to no LPA2)
> I think this is 'defconfig' w/ 16k pages, although I wasn't able to
> trigger the issue quickly under QEMU with that. Your analysis looks
> correct, however.
Hi Will,
I was also trying to debug this; indeed this is 16K defconfig, and
pgtable_l4_enabled() is returning false on non-LPA2 hardware. Is this
the intended behaviour? Don't we require 4-level pagetable to resolve
virtual addresses on 16K?
>
>> With p4d folded at compile-time, and pud folded at runtime when LPA2 is
>> not supported.
>>
>> With this setup, pgd_offset() is broken since the pgd is actually
>> supposed to become a pud but the shift is wrong, as it is set at compile
>> time:
>>
>> #define pgd_index(a) (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
>>
>> static inline pgd_t *pgd_offset_pgd(pgd_t *pgd, unsigned long address)
>> {
>> return (pgd + pgd_index(address));
>> };
>>
>> Then we follow the gup logic (abbreviated):
>>
>> gup_pgd_range:
>> pgdp = pgd_offset(current->mm, addr);
>> pgd_t pgd = READ_ONCE(*pgdp);
>>
>> At this point, pgd is just the 0th entry of the top level page table
>> (since those extra address bits will always be 0 for valid 47-bit user
>> addresses).
>>
>> p4d then gets folded via pgtable-nop4d.h:
>>
>> gup_p4d_range:
>> p4dp = p4d_offset_lockless(pgdp, pgd, addr);
>> = p4d_offset(&(pgd), address)
>> = &pgd
>> p4d_t p4d = READ_ONCE(*p4dp);
>>
>> Now we have p4dp = stack address of pgd, and p4d = pgd.
>>
>> gup_pud_range:
>> pudp = pud_offset_lockless(p4dp, p4d, addr);
>> -> if (!pgtable_l4_enabled())
>> = p4d_to_folded_pud(p4dp, addr);
>> = (pud_t *)PTR_ALIGN_DOWN(p4dp, PAGE_SIZE) + pud_index(addr);
>> pud_t pud = READ_ONCE(*pudp);
>>
>> Which is bad pointer math because it only works if p4dp points to a real
>> page table entry inside a page table, not a single u64 stack address.
> Cheers for the explanation; I agree that 6.10 looks like it's affected
> in the same way, even though I couldn't reproduce the crash. I think the
> root of the problem is that p4d_offset_lockless() returns a stack
> address when the p4d level is folded. I wondered about changing the
> dummy pXd_offset_lockless() definitions in linux/pgtable.h to pass the
> real pointer through instead of the address of the local, but then I
> suppose _most_ pXd_offset() implementations are going to dereference
> that and it would break the whole point of having _lockless routines
> to start with.
>
> What if we provided our own implementation of p4d_offset_lockless()
> for the folding case, which could just propagate the page-table pointer?
> Diff below.
>
>> This causes random oopses in internal_get_user_pages_fast and related
>> codepaths.
> Do you have a reliable way to trigger those? I tried doing some GUPpy
> things like strace (access_process_vm()) but it all seemed fine.
>
> Thanks,
>
> Will
>
> --->8
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index f8efbc128446..3afe624a39e1 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1065,6 +1065,13 @@ static inline bool pgtable_l5_enabled(void) { return false; }
>
> #define p4d_offset_kimg(dir,addr) ((p4d_t *)dir)
>
> +static inline
> +p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long addr)
> +{
> + return p4d_offset(pgdp, addr);
> +}
> +#define p4d_offset_lockless p4d_offset_lockless
> +
> #endif /* CONFIG_PGTABLE_LEVELS > 4 */
>
> #define pgd_ERROR(e) \
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LPA2 on non-LPA2 hardware broken with 16K pages
2024-07-18 13:14 ` Will Deacon
2024-07-18 13:21 ` Dev Jain
@ 2024-07-18 14:34 ` Asahi Lina
2024-07-19 18:02 ` Ard Biesheuvel
2 siblings, 0 replies; 11+ messages in thread
From: Asahi Lina @ 2024-07-18 14:34 UTC (permalink / raw)
To: Will Deacon
Cc: linux-mm, linux-kernel, asahi, linux-arm-kernel, Catalin Marinas,
ryan.roberts, mark.rutland, ardb
Hi,
On 7/18/24 10:14 PM, Will Deacon wrote:
> Hi Lina, [+Ard, Mark and Ryan],
>
> On Thu, Jul 18, 2024 at 06:39:10PM +0900, Asahi Lina wrote:
>> I ran into this with the Asahi Linux downstream kernel, based on v6.9.9,
>> but I believe the problem is also still upstream. The issue seems to be
>> an interaction between folding one page table level at compile time and
>> another one at runtime.
>
> Thanks for reporting this!
>
>> With this config, we have:
>>
>> CONFIG_PGTABLE_LEVELS=4
>> PAGE_SHIFT=14
>> PMD_SHIFT=25
>> PUD_SHIFT=36
>> PGDIR_SHIFT=47
>> pgtable_l5_enabled() == false (compile time)
>> pgtable_l4_enabled() == false (runtime, due to no LPA2)
>
> I think this is 'defconfig' w/ 16k pages, although I wasn't able to
> trigger the issue quickly under QEMU with that. Your analysis looks
> correct, however.
Yes, it should be. I first ran into the issue with a .config that was
derived from defconfig that someone sent me while trying to debug a
different problem.
[snip]
> Cheers for the explanation; I agree that 6.10 looks like it's affected
> in the same way, even though I couldn't reproduce the crash. I think the
> root of the problem is that p4d_offset_lockless() returns a stack
> address when the p4d level is folded. I wondered about changing the
> dummy pXd_offset_lockless() definitions in linux/pgtable.h to pass the
> real pointer through instead of the address of the local, but then I
> suppose _most_ pXd_offset() implementations are going to dereference
> that and it would break the whole point of having _lockless routines
> to start with.
>
> What if we provided our own implementation of p4d_offset_lockless()
> for the folding case, which could just propagate the page-table pointer?
> Diff below.
That seems to work, it neither reproduces the oopses outright nor
triggers any of the random sanity checks I sprinkled around gup.c while
debugging this to try to make it fail early ^^
>
>> This causes random oopses in internal_get_user_pages_fast and related
>> codepaths.
>
> Do you have a reliable way to trigger those? I tried doing some GUPpy
> things like strace (access_process_vm()) but it all seemed fine.
It's a bit weird because I had kernel builds where it didn't obviously
happen most of the time. During my latest round tracking this down
though, it almost 100% reliably triggered with a simple boot of a Fedora
system, usually in an `lvm` process during boot (lvm2-monitor.service).
The lvm crashes look like this, and I also got this trace in
systemd-journald once:
internal_get_user_pages_fast+0x420/0x1728
pin_user_pages_fast+0x9c/0xc4
iov_iter_extract_pages+0x234/0x1044
bio_iov_iter_get_pages+0x248/0xa90
blkdev_direct_IO.part.0+0x3a0/0x143c
blkdev_read_iter+0x1cc/0x388
aio_read.constprop.0+0x1e0/0x324
io_submit_one.constprop.0+0x378/0x1470
__arm64_sys_io_submit+0x198/0x2d0
invoke_syscall.constprop.0+0xd8/0x1e0
do_el0_svc+0xc4/0x1e0
el0_svc+0x48/0xc0
el0t_64_sync_handler+0x120/0x130
el0t_64_sync+0x190/0x194
Right now with my kernel build, this happens basically every boot.
However, it wasn't always like that, and I'm not sure what other
environmental differences affect the outcome. I guess since it's reading
random stack memory, it depends on what's there...
I disabled lvm2-monitor.service and tried to boot again (to unwedge
things) and this time it oopsed in udisksd like this (I recall seeing
this one before too):
internal_get_user_pages_fast+0x420/0x1728
pin_user_pages_fast+0x9c/0xc4
iov_iter_extract_pages+0x234/0x1044
bio_map_user_iov+0x214/0x724
blk_rq_map_user_iov+0x8b0/0x1080
blk_rq_map_user_io+0x138/0x17c
nvme_map_user_request.isra.0+0x2b4/0x3d8
nvme_submit_user_cmd.isra.0+0x21c/0x300
nvme_user_cmd.isra.0+0x200/0x3fc
nvme_dev_ioctl+0x284/0x480
__arm64_sys_ioctl+0x550/0x1b80
invoke_syscall.constprop.0+0xd8/0x1e0
do_el0_svc+0xc4/0x1e0
el0_svc+0x48/0xc0
el0t_64_sync_handler+0x120/0x130
el0t_64_sync+0x190/0x194
Prior to this I was testing with glmark2 (this whole thing started with
me trying to debug a downstream GPU driver issue, which I now realize is
a completely unrelated bug but it led me down this rabbit hole because
it was first reported by someone coincidentally compiling with LPA2 on).
I use this process termination torture test incantation, and I just
confirmed that it still oopses even with just a simpledrm framebuffer
and llvmpipe (software rendering) and no GPU/"full" KMS drivers compiled
in, so it should be reproducible on other platforms hopefully. The repro
is reliable within a few seconds. Running on KDE Plasma for whatever
that's worth:
while true; do WAYLAND_DISPLAY=wayland-0 timeout -s TERM -k 0 0.5
glmark2-es2-wayland & sleep 0.02 ; done
The trace looks like:
[ 301.830742] internal_get_user_pages_fast+0x248/0xcd0
[ 301.831343] get_user_pages_fast+0x48/0x60
[ 301.831897] get_futex_key+0xa4/0x3d0
[ 301.832281] futex_wait_setup+0x6c/0x164
[ 301.832777] __futex_wait+0xbc/0x15c
[ 301.833200] futex_wait+0x88/0x110
[ 301.833596] do_futex+0xf8/0x1a0
[ 301.833926] __arm64_sys_futex+0xec/0x188
[ 301.834417] invoke_syscall.constprop.0+0x50/0xe4
[ 301.835029] do_el0_svc+0x40/0xdc
[ 301.835378] el0_svc+0x3c/0x140
[ 301.835796] el0t_64_sync_handler+0x120/0x12c
[ 301.836365] el0t_64_sync+0x190/0x194
So it looks like aio, nvme ioctls, and futexes are things that tend to
trigger it. However, it's definitely not always obvious. The person
reporting the GPU issues with a LPA2 kernel build clearly wasn't having
their system crash on every boot, and I myself ended up with kernels
where the only repro was the glmark2 invocation and things weren't just
oopsing on boot.
Going down the aio path I just tried fio, but it turns out it actually
fairly reliably reproduces the futex oops when configured to fail, just
like glmark2 (I couldn't trivially get it to actually oops on actual aio
usage...).
while true; do fio --filename=nonexist --size=1M --rw=read --bs=16k
--ioengine=libaio --numjobs=10 --name=a --readonly; done
(This will complain with "fio: refusing extend of file due to read-only"
and generally not actually run a proper test, but it still repros the
futex problem)
It's not just any futex usage though. I tried writing a trivial futex
test app, and also running the tools/testing/selftests/futex selftests,
and those don't seem to trigger it.
> Thanks,
>
> Will
>
> --->8
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index f8efbc128446..3afe624a39e1 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1065,6 +1065,13 @@ static inline bool pgtable_l5_enabled(void) { return false; }
>
> #define p4d_offset_kimg(dir,addr) ((p4d_t *)dir)
>
> +static inline
> +p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long addr)
> +{
> + return p4d_offset(pgdp, addr);
> +}
> +#define p4d_offset_lockless p4d_offset_lockless
> +
> #endif /* CONFIG_PGTABLE_LEVELS > 4 */
>
> #define pgd_ERROR(e) \
~~ Lina
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LPA2 on non-LPA2 hardware broken with 16K pages
2024-07-18 13:14 ` Will Deacon
2024-07-18 13:21 ` Dev Jain
2024-07-18 14:34 ` Asahi Lina
@ 2024-07-19 18:02 ` Ard Biesheuvel
2024-07-23 14:52 ` Will Deacon
2 siblings, 1 reply; 11+ messages in thread
From: Ard Biesheuvel @ 2024-07-19 18:02 UTC (permalink / raw)
To: Will Deacon
Cc: Asahi Lina, linux-mm, linux-kernel, asahi, linux-arm-kernel,
Catalin Marinas, ryan.roberts, mark.rutland
On Thu, 18 Jul 2024 at 06:14, Will Deacon <will@kernel.org> wrote:
>
> Hi Lina, [+Ard, Mark and Ryan],
>
> On Thu, Jul 18, 2024 at 06:39:10PM +0900, Asahi Lina wrote:
> > I ran into this with the Asahi Linux downstream kernel, based on v6.9.9,
> > but I believe the problem is also still upstream. The issue seems to be
> > an interaction between folding one page table level at compile time and
> > another one at runtime.
>
> Thanks for reporting this!
>
> > With this config, we have:
> >
> > CONFIG_PGTABLE_LEVELS=4
> > PAGE_SHIFT=14
> > PMD_SHIFT=25
> > PUD_SHIFT=36
> > PGDIR_SHIFT=47
> > pgtable_l5_enabled() == false (compile time)
> > pgtable_l4_enabled() == false (runtime, due to no LPA2)
>
> I think this is 'defconfig' w/ 16k pages, although I wasn't able to
> trigger the issue quickly under QEMU with that. Your analysis looks
> correct, however.
>
> > With p4d folded at compile-time, and pud folded at runtime when LPA2 is
> > not supported.
> >
> > With this setup, pgd_offset() is broken since the pgd is actually
> > supposed to become a pud but the shift is wrong, as it is set at compile
> > time:
> >
> > #define pgd_index(a) (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
> >
> > static inline pgd_t *pgd_offset_pgd(pgd_t *pgd, unsigned long address)
> > {
> > return (pgd + pgd_index(address));
> > };
> >
> > Then we follow the gup logic (abbreviated):
> >
> > gup_pgd_range:
> > pgdp = pgd_offset(current->mm, addr);
> > pgd_t pgd = READ_ONCE(*pgdp);
> >
> > At this point, pgd is just the 0th entry of the top level page table
> > (since those extra address bits will always be 0 for valid 47-bit user
> > addresses).
> >
> > p4d then gets folded via pgtable-nop4d.h:
> >
> > gup_p4d_range:
> > p4dp = p4d_offset_lockless(pgdp, pgd, addr);
> > = p4d_offset(&(pgd), address)
> > = &pgd
> > p4d_t p4d = READ_ONCE(*p4dp);
> >
> > Now we have p4dp = stack address of pgd, and p4d = pgd.
> >
> > gup_pud_range:
> > pudp = pud_offset_lockless(p4dp, p4d, addr);
> > -> if (!pgtable_l4_enabled())
> > = p4d_to_folded_pud(p4dp, addr);
> > = (pud_t *)PTR_ALIGN_DOWN(p4dp, PAGE_SIZE) + pud_index(addr);
> > pud_t pud = READ_ONCE(*pudp);
> >
> > Which is bad pointer math because it only works if p4dp points to a real
> > page table entry inside a page table, not a single u64 stack address.
>
> Cheers for the explanation; I agree that 6.10 looks like it's affected
> in the same way, even though I couldn't reproduce the crash. I think the
> root of the problem is that p4d_offset_lockless() returns a stack
> address when the p4d level is folded. I wondered about changing the
> dummy pXd_offset_lockless() definitions in linux/pgtable.h to pass the
> real pointer through instead of the address of the local, but then I
> suppose _most_ pXd_offset() implementations are going to dereference
> that and it would break the whole point of having _lockless routines
> to start with.
>
> What if we provided our own implementation of p4d_offset_lockless()
> for the folding case, which could just propagate the page-table pointer?
> Diff below.
>
> > This causes random oopses in internal_get_user_pages_fast and related
> > codepaths.
>
> Do you have a reliable way to trigger those? I tried doing some GUPpy
> things like strace (access_process_vm()) but it all seemed fine.
>
Thanks for the cc, and thanks to Lina for the excellent diagnosis -
this is really helpful.
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index f8efbc128446..3afe624a39e1 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1065,6 +1065,13 @@ static inline bool pgtable_l5_enabled(void) { return false; }
>
> #define p4d_offset_kimg(dir,addr) ((p4d_t *)dir)
>
> +static inline
> +p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long addr)
This is in the wrong place, I think - we already define this for the
5-level case (around line 1760).
We'll need to introduce another version for the 4-level case, so
perhaps, to reduce the risk of confusion, we might define it as
static inline
p4d_t *p4d_offset_lockless_folded(pgd_t *pgdp, pgd_t pgd, unsigned long addr)
{
...
}
#ifdef __PAGETABLE_P4D_FOLDED
#define p4d_offset_lockless p4d_offset_lockless_folded
#endif
> +{
We might add
if (pgtable_l4_enabled())
pgdp = &pgd;
here to preserve the existing 'lockless' behavior when PUDs are not folded.
> + return p4d_offset(pgdp, addr);
> +}
> +#define p4d_offset_lockless p4d_offset_lockless
> +
> #endif /* CONFIG_PGTABLE_LEVELS > 4 */
>
I suggest we also add something like the below so we can catch these
issues more easily
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -874,9 +874,26 @@ static inline phys_addr_t p4d_page_paddr(p4d_t p4d)
static inline pud_t *p4d_to_folded_pud(p4d_t *p4dp, unsigned long addr)
{
+ /*
+ * The transformation below does not work correctly for descriptors
+ * copied to the stack.
+ */
+ VM_WARN_ON((u64)p4dp >= VMALLOC_START && !__is_kernel((u64)p4dp));
+
return (pud_t *)PTR_ALIGN_DOWN(p4dp, PAGE_SIZE) + pud_index(addr);
}
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LPA2 on non-LPA2 hardware broken with 16K pages
2024-07-19 18:02 ` Ard Biesheuvel
@ 2024-07-23 14:52 ` Will Deacon
2024-07-23 15:02 ` Ard Biesheuvel
0 siblings, 1 reply; 11+ messages in thread
From: Will Deacon @ 2024-07-23 14:52 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Asahi Lina, linux-mm, linux-kernel, asahi, linux-arm-kernel,
Catalin Marinas, ryan.roberts, mark.rutland
Hey Ard,
On Fri, Jul 19, 2024 at 11:02:29AM -0700, Ard Biesheuvel wrote:
> Thanks for the cc, and thanks to Lina for the excellent diagnosis -
> this is really helpful.
>
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index f8efbc128446..3afe624a39e1 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -1065,6 +1065,13 @@ static inline bool pgtable_l5_enabled(void) { return false; }
> >
> > #define p4d_offset_kimg(dir,addr) ((p4d_t *)dir)
> >
> > +static inline
> > +p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long addr)
>
> This is in the wrong place, I think - we already define this for the
> 5-level case (around line 1760).
Hmm, I'm a bit confused. In my tree, we have one definition at line 1012,
which is for the 5-level case (i.e. guarded by
'#if CONFIG_PGTABLE_LEVELS > 4'). I'm adding a new one at line 1065,
which puts it in the '#else' block and means we use an override instead
of the problematic generic version when we're folding.
> We'll need to introduce another version for the 4-level case, so
> perhaps, to reduce the risk of confusion, we might define it as
>
> static inline
> p4d_t *p4d_offset_lockless_folded(pgd_t *pgdp, pgd_t pgd, unsigned long addr)
> {
> ...
> }
> #ifdef __PAGETABLE_P4D_FOLDED
> #define p4d_offset_lockless p4d_offset_lockless_folded
> #endif
Renaming will definitely make this easier on the eye, so I'll do that.
I don't think I need the 'ifdef' though.
> > +{
>
> We might add
>
> if (pgtable_l4_enabled())
> pgdp = &pgd;
>
> here to preserve the existing 'lockless' behavior when PUDs are not
> folded.
The code still needs to be 'lockless' for the 5-level case, so I don't
think this is necessary. Yes, we'll load the same entry multiple times,
but it should be fine because they're in the context of a different
(albeit folded) level.
> > + return p4d_offset(pgdp, addr);
> > +}
> > +#define p4d_offset_lockless p4d_offset_lockless
> > +
> > #endif /* CONFIG_PGTABLE_LEVELS > 4 */
> >
>
> I suggest we also add something like the below so we can catch these
> issues more easily
>
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -874,9 +874,26 @@ static inline phys_addr_t p4d_page_paddr(p4d_t p4d)
>
> static inline pud_t *p4d_to_folded_pud(p4d_t *p4dp, unsigned long addr)
> {
> + /*
> + * The transformation below does not work correctly for descriptors
> + * copied to the stack.
> + */
> + VM_WARN_ON((u64)p4dp >= VMALLOC_START && !__is_kernel((u64)p4dp));
Hmm, this is a bit coarse. Does it work properly with the fixmap?
Will
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LPA2 on non-LPA2 hardware broken with 16K pages
2024-07-23 14:52 ` Will Deacon
@ 2024-07-23 15:02 ` Ard Biesheuvel
2024-07-23 16:05 ` Will Deacon
0 siblings, 1 reply; 11+ messages in thread
From: Ard Biesheuvel @ 2024-07-23 15:02 UTC (permalink / raw)
To: Will Deacon
Cc: Asahi Lina, linux-mm, linux-kernel, asahi, linux-arm-kernel,
Catalin Marinas, ryan.roberts, mark.rutland
On Tue, 23 Jul 2024 at 16:52, Will Deacon <will@kernel.org> wrote:
>
> Hey Ard,
>
> On Fri, Jul 19, 2024 at 11:02:29AM -0700, Ard Biesheuvel wrote:
> > Thanks for the cc, and thanks to Lina for the excellent diagnosis -
> > this is really helpful.
> >
> > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > > index f8efbc128446..3afe624a39e1 100644
> > > --- a/arch/arm64/include/asm/pgtable.h
> > > +++ b/arch/arm64/include/asm/pgtable.h
> > > @@ -1065,6 +1065,13 @@ static inline bool pgtable_l5_enabled(void) { return false; }
> > >
> > > #define p4d_offset_kimg(dir,addr) ((p4d_t *)dir)
> > >
> > > +static inline
> > > +p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long addr)
> >
> > This is in the wrong place, I think - we already define this for the
> > 5-level case (around line 1760).
>
> Hmm, I'm a bit confused. In my tree, we have one definition at line 1012,
> which is for the 5-level case (i.e. guarded by
> '#if CONFIG_PGTABLE_LEVELS > 4'). I'm adding a new one at line 1065,
> which puts it in the '#else' block and means we use an override instead
> of the problematic generic version when we're folding.
>
Indeed. I failed to spot from the context (which is there in the diff)
that this is in the else branch.
> > We'll need to introduce another version for the 4-level case, so
> > perhaps, to reduce the risk of confusion, we might define it as
> >
> > static inline
> > p4d_t *p4d_offset_lockless_folded(pgd_t *pgdp, pgd_t pgd, unsigned long addr)
> > {
> > ...
> > }
> > #ifdef __PAGETABLE_P4D_FOLDED
> > #define p4d_offset_lockless p4d_offset_lockless_folded
> > #endif
>
> Renaming will definitely make this easier on the eye, so I'll do that.
> I don't think I need the 'ifdef' though.
>
Indeed.
> > > +{
> >
> > We might add
> >
> > if (pgtable_l4_enabled())
> > pgdp = &pgd;
> >
> > here to preserve the existing 'lockless' behavior when PUDs are not
> > folded.
>
> The code still needs to be 'lockless' for the 5-level case, so I don't
> think this is necessary.
The 5-level case is never handled here.
There is the 3-level case, where the runtime PUD folding needs the
actual address in order to recalculate the descriptor address using
the correct shift. In this case, we don't dereference the pointer
anyway so the 'lockless' thing doesn't matter (afaict)
In the 4-level case, we want to preserve the original behavior, where
pgd is not reloaded from pgdp. Setting pgdp to &pgd achieves that.
> Yes, we'll load the same entry multiple times,
> but it should be fine because they're in the context of a different
> (albeit folded) level.
>
I don't understand what you are saying here. Why is that fine?
> > > + return p4d_offset(pgdp, addr);
> > > +}
> > > +#define p4d_offset_lockless p4d_offset_lockless
> > > +
> > > #endif /* CONFIG_PGTABLE_LEVELS > 4 */
> > >
> >
> > I suggest we also add something like the below so we can catch these
> > issues more easily
> >
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -874,9 +874,26 @@ static inline phys_addr_t p4d_page_paddr(p4d_t p4d)
> >
> > static inline pud_t *p4d_to_folded_pud(p4d_t *p4dp, unsigned long addr)
> > {
> > + /*
> > + * The transformation below does not work correctly for descriptors
> > + * copied to the stack.
> > + */
> > + VM_WARN_ON((u64)p4dp >= VMALLOC_START && !__is_kernel((u64)p4dp));
>
> Hmm, this is a bit coarse. Does it work properly with the fixmap?
>
Good point. I did some boot tests with this but I'm not sure if it is
100% safe with the fixmap.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LPA2 on non-LPA2 hardware broken with 16K pages
2024-07-23 15:02 ` Ard Biesheuvel
@ 2024-07-23 16:05 ` Will Deacon
2024-07-23 16:28 ` Ard Biesheuvel
0 siblings, 1 reply; 11+ messages in thread
From: Will Deacon @ 2024-07-23 16:05 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Asahi Lina, linux-mm, linux-kernel, asahi, linux-arm-kernel,
Catalin Marinas, ryan.roberts, mark.rutland
On Tue, Jul 23, 2024 at 05:02:15PM +0200, Ard Biesheuvel wrote:
> On Tue, 23 Jul 2024 at 16:52, Will Deacon <will@kernel.org> wrote:
> > On Fri, Jul 19, 2024 at 11:02:29AM -0700, Ard Biesheuvel wrote:
> > > Thanks for the cc, and thanks to Lina for the excellent diagnosis -
> > > this is really helpful.
> > >
> > > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > > > index f8efbc128446..3afe624a39e1 100644
> > > > --- a/arch/arm64/include/asm/pgtable.h
> > > > +++ b/arch/arm64/include/asm/pgtable.h
> > > > @@ -1065,6 +1065,13 @@ static inline bool pgtable_l5_enabled(void) { return false; }
> > > >
> > > > #define p4d_offset_kimg(dir,addr) ((p4d_t *)dir)
> > > >
> > > > +static inline
> > > > +p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long addr)
> > >
> > > This is in the wrong place, I think - we already define this for the
> > > 5-level case (around line 1760).
> >
> > Hmm, I'm a bit confused. In my tree, we have one definition at line 1012,
> > which is for the 5-level case (i.e. guarded by
> > '#if CONFIG_PGTABLE_LEVELS > 4'). I'm adding a new one at line 1065,
> > which puts it in the '#else' block and means we use an override instead
> > of the problematic generic version when we're folding.
> >
>
> Indeed. I failed to spot from the context (which is there in the diff)
> that this is in the else branch.
No worries.
> > > > +{
> > >
> > > We might add
> > >
> > > if (pgtable_l4_enabled())
> > > pgdp = &pgd;
> > >
> > > here to preserve the existing 'lockless' behavior when PUDs are not
> > > folded.
> >
> > The code still needs to be 'lockless' for the 5-level case, so I don't
> > think this is necessary.
>
> The 5-level case is never handled here.
Urgh, yes, sorry. I've done a fantasticly bad job of explaining myself.
> There is the 3-level case, where the runtime PUD folding needs the
> actual address in order to recalculate the descriptor address using
> the correct shift. In this case, we don't dereference the pointer
> anyway so the 'lockless' thing doesn't matter (afaict)
>
> In the 4-level case, we want to preserve the original behavior, where
> pgd is not reloaded from pgdp. Setting pgdp to &pgd achieves that.
Right. What I'm trying to get at is the case where we have folding. For
example, with my patch applied, if we have 3 levels then the lockless
GUP walk looks like:
pgd_t pgd = READ_ONCE(*pgdp);
p4dp = p4d_offset_lockless(pgdp, pgd, addr);
=> Returns pgdp
p4d_t p4d = READ_ONCE(*p4dp);
pudp = pud_offset_lockless(p4dp, p4d, addr);
=> Returns &p4d, which is again the pgdp
pud_t pud = READ_ONCE(*pudp);
So here we're reloading the same pointer multiple times and my argument
is that if we need to add logic to avoid this for the
pgtable_l4_enabled() case, then we have bigger problems.
> > Yes, we'll load the same entry multiple times,
> > but it should be fine because they're in the context of a different
> > (albeit folded) level.
> >
>
> I don't understand what you are saying here. Why is that fine?
I think it's fine because (a) the CPU guarantees same address
read-after-read ordering and (b) We only evaluate the most recently read
value. It would be a problem if we mixed data from different reads but,
because the use is confined to that 'level', we don't end up doing that.
Dunno, am I making any sense?
Will
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LPA2 on non-LPA2 hardware broken with 16K pages
2024-07-23 16:05 ` Will Deacon
@ 2024-07-23 16:28 ` Ard Biesheuvel
2024-07-24 11:33 ` Will Deacon
0 siblings, 1 reply; 11+ messages in thread
From: Ard Biesheuvel @ 2024-07-23 16:28 UTC (permalink / raw)
To: Will Deacon
Cc: Asahi Lina, linux-mm, linux-kernel, asahi, linux-arm-kernel,
Catalin Marinas, ryan.roberts, mark.rutland
On Tue, 23 Jul 2024 at 18:05, Will Deacon <will@kernel.org> wrote:
>
> On Tue, Jul 23, 2024 at 05:02:15PM +0200, Ard Biesheuvel wrote:
> > On Tue, 23 Jul 2024 at 16:52, Will Deacon <will@kernel.org> wrote:
> > > On Fri, Jul 19, 2024 at 11:02:29AM -0700, Ard Biesheuvel wrote:
...
> > > >
> > > > We might add
> > > >
> > > > if (pgtable_l4_enabled())
> > > > pgdp = &pgd;
> > > >
> > > > here to preserve the existing 'lockless' behavior when PUDs are not
> > > > folded.
> > >
> > > The code still needs to be 'lockless' for the 5-level case, so I don't
> > > think this is necessary.
> >
> > The 5-level case is never handled here.
>
> Urgh, yes, sorry. I've done a fantasticly bad job of explaining myself.
>
> > There is the 3-level case, where the runtime PUD folding needs the
> > actual address in order to recalculate the descriptor address using
> > the correct shift. In this case, we don't dereference the pointer
> > anyway so the 'lockless' thing doesn't matter (afaict)
> >
> > In the 4-level case, we want to preserve the original behavior, where
> > pgd is not reloaded from pgdp. Setting pgdp to &pgd achieves that.
>
> Right. What I'm trying to get at is the case where we have folding. For
> example, with my patch applied, if we have 3 levels then the lockless
> GUP walk looks like:
>
>
> pgd_t pgd = READ_ONCE(*pgdp);
>
> p4dp = p4d_offset_lockless(pgdp, pgd, addr);
> => Returns pgdp
> p4d_t p4d = READ_ONCE(*p4dp);
>
> pudp = pud_offset_lockless(p4dp, p4d, addr);
> => Returns &p4d, which is again the pgdp
> pud_t pud = READ_ONCE(*pudp);
>
>
> So here we're reloading the same pointer multiple times and my argument
> is that if we need to add logic to avoid this for the
> pgtable_l4_enabled() case, then we have bigger problems.
>
The 3-level case is not relevant here. My suggestion only affects the
4-level case:
if (pgtable_l4_enabled())
pgdp = &pgd;
which prevents us from evaluating *pgdp twice, which seems to me to be
the reason these routines exist in the first place. Given that the
3-level runtime-folded case is the one we are trying to fix here, I'd
argue that keeping the 4-level case the same as before is important.
> > > Yes, we'll load the same entry multiple times,
> > > but it should be fine because they're in the context of a different
> > > (albeit folded) level.
> > >
> >
> > I don't understand what you are saying here. Why is that fine?
>
> I think it's fine because (a) the CPU guarantees same address
> read-after-read ordering and (b) We only evaluate the most recently read
> value. It would be a problem if we mixed data from different reads but,
> because the use is confined to that 'level', we don't end up doing that.
>
> Dunno, am I making any sense?
>
So what is the point of p?d_offset_lockless()? Is it a performance
optimization that we don't care about on arm64? Or does this reasoning
still only apply to the folded case?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LPA2 on non-LPA2 hardware broken with 16K pages
2024-07-23 16:28 ` Ard Biesheuvel
@ 2024-07-24 11:33 ` Will Deacon
2024-07-24 12:10 ` Ard Biesheuvel
0 siblings, 1 reply; 11+ messages in thread
From: Will Deacon @ 2024-07-24 11:33 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Asahi Lina, linux-mm, linux-kernel, asahi, linux-arm-kernel,
Catalin Marinas, ryan.roberts, mark.rutland
On Tue, Jul 23, 2024 at 06:28:16PM +0200, Ard Biesheuvel wrote:
> On Tue, 23 Jul 2024 at 18:05, Will Deacon <will@kernel.org> wrote:
> >
> > On Tue, Jul 23, 2024 at 05:02:15PM +0200, Ard Biesheuvel wrote:
> > > On Tue, 23 Jul 2024 at 16:52, Will Deacon <will@kernel.org> wrote:
> > > > On Fri, Jul 19, 2024 at 11:02:29AM -0700, Ard Biesheuvel wrote:
> ...
> > > > >
> > > > > We might add
> > > > >
> > > > > if (pgtable_l4_enabled())
> > > > > pgdp = &pgd;
> > > > >
> > > > > here to preserve the existing 'lockless' behavior when PUDs are not
> > > > > folded.
> > > >
> > > > The code still needs to be 'lockless' for the 5-level case, so I don't
> > > > think this is necessary.
> > >
> > > The 5-level case is never handled here.
> >
> > Urgh, yes, sorry. I've done a fantasticly bad job of explaining myself.
> >
> > > There is the 3-level case, where the runtime PUD folding needs the
> > > actual address in order to recalculate the descriptor address using
> > > the correct shift. In this case, we don't dereference the pointer
> > > anyway so the 'lockless' thing doesn't matter (afaict)
> > >
> > > In the 4-level case, we want to preserve the original behavior, where
> > > pgd is not reloaded from pgdp. Setting pgdp to &pgd achieves that.
> >
> > Right. What I'm trying to get at is the case where we have folding. For
> > example, with my patch applied, if we have 3 levels then the lockless
> > GUP walk looks like:
> >
> >
> > pgd_t pgd = READ_ONCE(*pgdp);
> >
> > p4dp = p4d_offset_lockless(pgdp, pgd, addr);
> > => Returns pgdp
> > p4d_t p4d = READ_ONCE(*p4dp);
> >
> > pudp = pud_offset_lockless(p4dp, p4d, addr);
> > => Returns &p4d, which is again the pgdp
> > pud_t pud = READ_ONCE(*pudp);
> >
> >
> > So here we're reloading the same pointer multiple times and my argument
> > is that if we need to add logic to avoid this for the
> > pgtable_l4_enabled() case, then we have bigger problems.
> >
>
> The 3-level case is not relevant here. My suggestion only affects the
> 4-level case:
That's exactly what I'm uneasy about :/
> if (pgtable_l4_enabled())
> pgdp = &pgd;
>
> which prevents us from evaluating *pgdp twice, which seems to me to be
> the reason these routines exist in the first place. Given that the
> 3-level runtime-folded case is the one we are trying to fix here, I'd
> argue that keeping the 4-level case the same as before is important.
I think consistency between 4-level and 3-level is far more important.
Adding code to avoid reloading the entry for one specific case (the
pgtable_l4_enabled() case), whilst requiring other cases (e.g. the
3-level compile-time folded case) to reload from the pointer is
inconsistent. Either they both need it or neither of them need it, no?
> > > > Yes, we'll load the same entry multiple times,
> > > > but it should be fine because they're in the context of a different
> > > > (albeit folded) level.
> > > >
> > >
> > > I don't understand what you are saying here. Why is that fine?
> >
> > I think it's fine because (a) the CPU guarantees same address
> > read-after-read ordering and (b) We only evaluate the most recently read
> > value. It would be a problem if we mixed data from different reads but,
> > because the use is confined to that 'level', we don't end up doing that.
> >
> > Dunno, am I making any sense?
> >
>
> So what is the point of p?d_offset_lockless()? Is it a performance
> optimization that we don't care about on arm64? Or does this reasoning
> still only apply to the folded case?
Well, naturally it's all undocumented. S390 added the macros, but it
looks like that was because they hit a similar problem to the one
reported by Lina (see d3f7b1bb2040 ("mm/gup: fix gup_fast with dynamic
page table folding")) and, really, that change is just moving the logic
out of the fast GUP code and into some new macros.
If you think about trying to do fast GUP using the regular pXd_offset()
macros with 5 levels, the problem is a little more obvious. For example,
it would look something like:
// gup_fast_pgd_range()
pgd_t pgd = READ_ONCE(*pgdp);
if (pgd_none(pgd))
return;
...
// gup_fast_p4d_range()
p4dp = p4d_offset(pgdp, addr);
=> READ_ONCE(*pgdp);
A concurrent writer could e.g. clear *pgdp between the two reads and
we'd end up with junk in p4dp because of a ToCToU-type bug.
But I don't think that applies to the case we're discussing, because the
reload of the entry happens in the following READ_ONCE() rather than
in the _offset macro:
p4d_t p4d = READ_ONCE(*p4dp);
and, as this is in the context of a new level, everything is then
rechecked so that a concurrent modification won't cause us to crash.
Will
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LPA2 on non-LPA2 hardware broken with 16K pages
2024-07-24 11:33 ` Will Deacon
@ 2024-07-24 12:10 ` Ard Biesheuvel
0 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2024-07-24 12:10 UTC (permalink / raw)
To: Will Deacon
Cc: Asahi Lina, linux-mm, linux-kernel, asahi, linux-arm-kernel,
Catalin Marinas, ryan.roberts, mark.rutland
On Wed, 24 Jul 2024 at 13:33, Will Deacon <will@kernel.org> wrote:
>
> On Tue, Jul 23, 2024 at 06:28:16PM +0200, Ard Biesheuvel wrote:
> > On Tue, 23 Jul 2024 at 18:05, Will Deacon <will@kernel.org> wrote:
> > >
> > > On Tue, Jul 23, 2024 at 05:02:15PM +0200, Ard Biesheuvel wrote:
> > > > On Tue, 23 Jul 2024 at 16:52, Will Deacon <will@kernel.org> wrote:
> > > > > On Fri, Jul 19, 2024 at 11:02:29AM -0700, Ard Biesheuvel wrote:
> > ...
> > > > > >
> > > > > > We might add
> > > > > >
> > > > > > if (pgtable_l4_enabled())
> > > > > > pgdp = &pgd;
> > > > > >
> > > > > > here to preserve the existing 'lockless' behavior when PUDs are not
> > > > > > folded.
> > > > >
> > > > > The code still needs to be 'lockless' for the 5-level case, so I don't
> > > > > think this is necessary.
> > > >
> > > > The 5-level case is never handled here.
> > >
> > > Urgh, yes, sorry. I've done a fantasticly bad job of explaining myself.
> > >
> > > > There is the 3-level case, where the runtime PUD folding needs the
> > > > actual address in order to recalculate the descriptor address using
> > > > the correct shift. In this case, we don't dereference the pointer
> > > > anyway so the 'lockless' thing doesn't matter (afaict)
> > > >
> > > > In the 4-level case, we want to preserve the original behavior, where
> > > > pgd is not reloaded from pgdp. Setting pgdp to &pgd achieves that.
> > >
> > > Right. What I'm trying to get at is the case where we have folding. For
> > > example, with my patch applied, if we have 3 levels then the lockless
> > > GUP walk looks like:
> > >
> > >
> > > pgd_t pgd = READ_ONCE(*pgdp);
> > >
> > > p4dp = p4d_offset_lockless(pgdp, pgd, addr);
> > > => Returns pgdp
> > > p4d_t p4d = READ_ONCE(*p4dp);
> > >
> > > pudp = pud_offset_lockless(p4dp, p4d, addr);
> > > => Returns &p4d, which is again the pgdp
> > > pud_t pud = READ_ONCE(*pudp);
> > >
> > >
> > > So here we're reloading the same pointer multiple times and my argument
> > > is that if we need to add logic to avoid this for the
> > > pgtable_l4_enabled() case, then we have bigger problems.
> > >
> >
> > The 3-level case is not relevant here. My suggestion only affects the
> > 4-level case:
>
> That's exactly what I'm uneasy about :/
>
Right.
> > if (pgtable_l4_enabled())
> > pgdp = &pgd;
> >
> > which prevents us from evaluating *pgdp twice, which seems to me to be
> > the reason these routines exist in the first place. Given that the
> > 3-level runtime-folded case is the one we are trying to fix here, I'd
> > argue that keeping the 4-level case the same as before is important.
>
> I think consistency between 4-level and 3-level is far more important.
> Adding code to avoid reloading the entry for one specific case (the
> pgtable_l4_enabled() case), whilst requiring other cases (e.g. the
> 3-level compile-time folded case) to reload from the pointer is
> inconsistent. Either they both need it or neither of them need it, no?
>
The thing to keep in mind here is that the path via
p4d_to_folded_pud() does not dereference the same pointer either. It
just converts a p4d_t* to a pud_t* by deriving the page tables address
from the p4d_t*, and applying the PUD_SHIFT rather than the P4D_shift
which was applied one level up.
So a) it does not dereference, and b) it refers to a different entry
so the prior dereference loaded the wrong entry.
OTOH, the 4-level path is not only used by 16k+lpa2 (or 16k/48bits),
it is also used by 4k/48bits where pgtable_l4_enabled() is a
compile-time constant TRUE, and this case will no longer be 'lockless'
as before.
However, if I am understanding you correctly, you are saying that
a) p4d_offset_lockless() for <5 levels cannot race in the way that
these macros are intended to address, as the folding implies that the
next-level load is in reality a reload of the same entry, and so we
will be using the latest value
b) reloading the same value is not an issue because this is not a
performance optimization but a concurrency/correctness thing.
I suppose it would be good to clarify this in a comment, as doing the
reload in the implementation of a helper that exists to omit it looks
rather dodgy. But I agree with your analysis, and the additional bits
I suggested are not needed.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-07-24 12:11 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-18 9:39 LPA2 on non-LPA2 hardware broken with 16K pages Asahi Lina
2024-07-18 13:14 ` Will Deacon
2024-07-18 13:21 ` Dev Jain
2024-07-18 14:34 ` Asahi Lina
2024-07-19 18:02 ` Ard Biesheuvel
2024-07-23 14:52 ` Will Deacon
2024-07-23 15:02 ` Ard Biesheuvel
2024-07-23 16:05 ` Will Deacon
2024-07-23 16:28 ` Ard Biesheuvel
2024-07-24 11:33 ` Will Deacon
2024-07-24 12:10 ` Ard Biesheuvel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox