From: Joao Martins <joao.m.martins@oracle.com>
To: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Muchun Song <songmuchun@bytedance.com>,
Dan Williams <dan.j.williams@intel.com>,
Tarun Sahu <tsahu@linux.ibm.com>,
linux-mm@kvack.org, akpm@linux-foundation.org
Subject: Re: [PATCH] mm/vmemmap/devdax: Fix kernel crash when probing devdax devices
Date: Tue, 11 Apr 2023 11:33:45 +0100 [thread overview]
Message-ID: <0a69f3b8-88c9-db31-ddb6-5e372f339459@oracle.com> (raw)
In-Reply-To: <871qkqdepe.fsf@linux.ibm.com>
On 11/04/2023 09:07, Aneesh Kumar K.V wrote:
> Joao Martins <joao.m.martins@oracle.com> writes:
>
>> On 07/04/2023 13:23, Aneesh Kumar K.V wrote:
>>> commit c4386bd8ee3a ("mm/memremap: add ZONE_DEVICE support for compound pages")
>>> added support for using optimized vmmemap for devdax devices.
>>
>> Not really. It added *compound pages* for devdax ZONE device which is what the
>> commit describes. It is meant to represent the non base page metadata. Should
>> fsdax support the equivalent, we can have GUP performance on the dax path and
>> remove today's special zone device case of ref-per-base-page GUP-fast path.
>>
>> One was tied when it was brought in but in theory[*] we didn't require anything
>> from arch-es to do this (contrary to vmemmap_populate()). The commit you should
>> refer to is this instead:
>
> ok the compound page dependency came via
> commit 6fd3620b3428 ("mm/page_alloc: reuse tail struct pages for compound devmaps")
>
It's an optimization to remove unnecessary work, not exactly a real dependency.
But I understand what you mean.
> I have now reworked it such that we do
>
> static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
> unsigned long nr_pages)
> {
> #ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
> /*
> * this depends on how vmemmap is populated by
> * vmemmap_populate_compound_pages()
> */
> return is_power_of_2(sizeof(struct page)) &&
> !altmap ? 2 * (PAGE_SIZE / sizeof(struct page)) : nr_pages;
> #else
> return nr_pages;
> #endif
> }
>
I would instead just simplify it to:
static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
unsigned long nr_pages)
{
return IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP) &&
is_power_of_2(sizeof(struct page)) &&
!altmap ? 2 * (PAGE_SIZE / sizeof(struct page)) : nr_pages;
}
As the ternary false condition already handles the non-optimized-vmemmap case.
With your vmemmap_can_optimize() helper perhaps:
static inline unsigned long compound_nr_pages(struct dev_pagemap *pgmap,
struct vmem_altmap *altmap,
unsigned long nr_pages)
{
return vmemmap_can_optimize(pgmap, altmap) ?
2 * (PAGE_SIZE / sizeof(struct page)) : nr_pages;
}
>>
>> 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps")
>>
>> [*] or so I thought before next paragraph
>>
>>> But how vmemmap
>>> mappings are created are architecture specific. For example, powerpc with hash
>>> translation doesn't have vmemmap mappings in init_mm page table instead they are
>>> bolted table entries in the hardware page table
>>>
>>
>> Very interesting, I didn't know this (no init_mm) was an option. And my
>> apologies for regressing s390 :/ So, s390 does not support
>> vmemmap_populate_basepages() then and always need to go to arch
>> vmemmap_populate() to create the vmemmap entries.
>>
>>> vmemmap_populate_compound_pages() used by vmemmap optimization code is not aware
>>> of these architecture-specific mapping. Hence allow architecture to opt for this
>>> feature. I selected architectures supporting HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>>> option as also supporting this feature. I added vmemmap_can_optimize() even
>>> though page_vmemmap_nr(pgmap) > 1 check should filter architecture not
>>> supporting this. > IMHO that brings clarity to the code where we are populating
>>> vmemmap.
>>>
>> I am not sure the last two sentences are right. I would remove, see above and
>> below comments at the end on why.
>
> That is true because i had vmemmap_shift = 0 if arch didn't support
> vmemmap optimization. Based on your reply above, I have now moved that
> out and kept vmemmap_can_optimize()
>
/me nods
> modified mm/page_alloc.c
> @@ -6846,8 +6846,16 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
> static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
> unsigned long nr_pages)
> {
> +#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
> + /*
> + * this depends on how vmemmap is populated by
> + * vmemmap_populate_compound_pages()
> + */
> return is_power_of_2(sizeof(struct page)) &&
> !altmap ? 2 * (PAGE_SIZE / sizeof(struct page)) : nr_pages;
> +#else
> + return nr_pages;
> +#endif
> }
>
> static void __ref memmap_init_compound(struct page *head,
> modified mm/sparse-vmemmap.c
> @@ -458,8 +458,7 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
> !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
> return NULL;
>
> - if (is_power_of_2(sizeof(struct page)) &&
> - pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)
> + if (vmemmap_can_optimize(altmap, pgmap))
> r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
> else
> r = vmemmap_populate(start, end, nid, altmap);
>
> ....
>
>>
>>> Fixes: c4386bd8ee3a ("mm/memremap: add ZONE_DEVICE support for compound pages")
>>
>> It should be:
>>
>> Fixes: 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound
>> devmaps")
>
> updated.
>
>>
>> But if what you included in your patch was what lead to the crash, then your
>> problem is not the vmemmap optimization ... as the latter came after. If it's
>> the hash I included above, it would reduce the affected trees to v5.19+
>>
>>> Cc: Joao Martins <joao.m.martins@oracle.com>
>>> Cc: Muchun Song <songmuchun@bytedance.com>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Reported-by: Tarun Sahu <tsahu@linux.ibm.com>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> ---
>>> arch/arm64/Kconfig | 1 +
>>> arch/loongarch/Kconfig | 1 +
>>> arch/s390/Kconfig | 1 +
>>> arch/x86/Kconfig | 1 +
>>> drivers/dax/device.c | 3 ++-
>>> include/linux/mm.h | 16 ++++++++++++++++
>>> mm/Kconfig | 3 +++
>>> mm/sparse-vmemmap.c | 3 +--
>>> 8 files changed, 26 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index 27b2592698b0..d3f5945f0aff 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -103,6 +103,7 @@ config ARM64
>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>>> select ARCH_WANT_LD_ORPHAN_WARN
>>> select ARCH_WANTS_NO_INSTR
>>> + select ARCH_WANT_OPTIMIZE_VMEMMAP
>>> select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
>>> select ARCH_HAS_UBSAN_SANITIZE_ALL
>>> select ARM_AMBA
>>> diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
>>> index 9cc8b84f7eb0..ce5802066d0e 100644
>>> --- a/arch/loongarch/Kconfig
>>> +++ b/arch/loongarch/Kconfig
>>> @@ -56,6 +56,7 @@ config LOONGARCH
>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>>> select ARCH_WANT_LD_ORPHAN_WARN
>>> select ARCH_WANTS_NO_INSTR
>>> + select ARCH_WANT_OPTIMIZE_VMEMMAP
>>> select BUILDTIME_TABLE_SORT
>>> select COMMON_CLK
>>> select CPU_PM
>>> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
>>> index 933771b0b07a..abffccd937b2 100644
>>> --- a/arch/s390/Kconfig
>>> +++ b/arch/s390/Kconfig
>>> @@ -127,6 +127,7 @@ config S390
>>> select ARCH_WANT_DEFAULT_BPF_JIT
>>> select ARCH_WANT_IPC_PARSE_VERSION
>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>>> + select ARCH_WANT_OPTIMIZE_VMEMMAP
>>> select BUILDTIME_TABLE_SORT
>>> select CLONE_BACKWARDS2
>>> select DMA_OPS if PCI
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index a825bf031f49..e8d66d834b4f 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -127,6 +127,7 @@ config X86
>>> select ARCH_WANT_HUGE_PMD_SHARE
>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP if X86_64
>>> select ARCH_WANT_LD_ORPHAN_WARN
>>> + select ARCH_WANT_OPTIMIZE_VMEMMAP if X86_64
>>> select ARCH_WANTS_THP_SWAP if X86_64
>>> select ARCH_HAS_PARANOID_L1D_FLUSH
>>> select BUILDTIME_TABLE_SORT
>>
>> I would remove these and instead use the
>> ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP, and have your other snip be a second
>> patch to drop the 'HUGETLB_' part
>>
>>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
>>> index 5494d745ced5..05be8e79d64b 100644
>>> --- a/drivers/dax/device.c
>>> +++ b/drivers/dax/device.c
>>> @@ -448,7 +448,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
>>> }
>>>
>>> pgmap->type = MEMORY_DEVICE_GENERIC;
>>> - if (dev_dax->align > PAGE_SIZE)
>>> + if (dev_dax->align > PAGE_SIZE &&
>>> + IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP))
>>> pgmap->vmemmap_shift =
>>> order_base_2(dev_dax->align >> PAGE_SHIFT);
>>
>> vmemmap_shift relates to the compound page size we will use in
>> memmap_init_zone_device(). Otherwise we are back to the base struct page on
>> PMD/PUD case. Instead move the:
>>
>> IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP)
>>
>> further below to __populate_section_memmap() ...
>>
>>> addr = devm_memremap_pages(dev, pgmap);
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 716d30d93616..fb71e21df23d 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -3442,6 +3442,22 @@ void vmemmap_populate_print_last(void);
>>> void vmemmap_free(unsigned long start, unsigned long end,
>>> struct vmem_altmap *altmap);
>>> #endif
>>> +
>>> +#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
>>> +static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
>>> + struct dev_pagemap *pgmap)
>>> +{
>>> + return is_power_of_2(sizeof(struct page)) &&
>>> + pgmap && (pgmap_vmemmap_nr(pgmap) > 1) && !altmap;
>>> +}
>>> +#else
>>> +static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
>>> + struct dev_pagemap *pgmap)
>>> +{
>>> + return false;
>>> +}
>>> +#endif
>>> +
>>> void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
>>> unsigned long nr_pages);
>>>
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index ff7b209dec05..99f87c1be1e8 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -461,6 +461,9 @@ config SPARSEMEM_VMEMMAP
>>> pfn_to_page and page_to_pfn operations. This is the most
>>> efficient option when sufficient kernel resources are available.
>>>
>>> +config ARCH_WANT_OPTIMIZE_VMEMMAP
>>> + bool
>>> +
>>> config HAVE_MEMBLOCK_PHYS_MAP
>>> bool
>>>
>> As you mentioned in your other followup email it probably makes sense that you
>> just use ARCH_HUGETLB_WANT_OPTIMIZE_VMEMMAP if we switch this to per-arch. And
>> then have that kconfig name be renamed into the above. It simplifies the fix in
>> case it pickups stable trees (the dax vmemmap deduplication came after hugetlb).
>> The case for the head page reusal case for hugetlb is anyways captured by a
>> hugetlb static key.
>
> will do
>
>>
>>
>>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>>> index c5398a5960d0..10d73a0dfcec 100644
>>> --- a/mm/sparse-vmemmap.c
>>> +++ b/mm/sparse-vmemmap.c
>>> @@ -458,8 +458,7 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
>>> !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
>>> return NULL;
>>>
>>> - if (is_power_of_2(sizeof(struct page)) &&
>>> - pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)
>>> + if (vmemmap_can_optimize(altmap, pgmap))
>>> r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
>>> else
>>> r = vmemmap_populate(start, end, nid, altmap);
>>
>> Same comment(s) as further above. I think it should be:
>>
>> if (IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP))
>> r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
>> else
>> r = vmemmap_populate(start, end, nid, altmap);
>
> Even if arch want vmemmap optimization the ability to use compound
> vmemmap_populate is further restricted by
>
> if (is_power_of_2(sizeof(struct page)) &&
> pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)
>
> Hence we still want if (vmemmap_can_optimize(..)) right?
>
Correct. It would become:
if (IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP) &&
is_power_of_2(sizeof(struct page)) &&
pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)
r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
I suppose a helper could have this in vmemmap_can_optimize() as you suggest,
as the condition is getting more unreadable. Probably worth using it too
in compound_nr_pages(), but you would need to add a pgmap argument to that
function and change its caller.
vmemmap_can_optimize() is not really the greatest name, but I can't think of
anything better that aligns with the rest of naming of this feature.
>>
>> In principle all these arches below can support
>> vmemmap_populate_compound_pages() but now it would require them enable it
>> explicitly, which I guess makes sense considering this bug you are fixing. But
>> the arch-requirement is not really if we support optimized-vmemmap or not but if
>> we support the init-mm at all. OTOH tieing both hugetlb and dax additionally
>> eliminates confusion around an optimization shared by both, even if the dax
>> could encompass more. I am not sure what is the arch requirement for hugetlb
>> vmemmap opt these days.
>>
>> arch/arm64/mm/mmu.c: return vmemmap_populate_basepages(start, end,
>> node, altmap);
>> arch/ia64/mm/discontig.c: return vmemmap_populate_basepages(start, end,
>> node, NULL);
>> arch/loongarch/mm/init.c: return vmemmap_populate_basepages(start, end,
>> node, NULL);
>> arch/riscv/mm/init.c: return vmemmap_populate_basepages(start, end, node, NULL);
>> arch/x86/mm/init_64.c: err = vmemmap_populate_basepages(start, end,
>> node, NULL);
>> arch/x86/mm/init_64.c: err = vmemmap_populate_basepages(start, end,
>> node, NULL);
prev parent reply other threads:[~2023-04-11 10:34 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-07 12:23 Aneesh Kumar K.V
2023-04-07 18:03 ` Jane Chu
2023-04-08 10:19 ` Aneesh Kumar K.V
2023-04-10 17:27 ` Jane Chu
2023-04-10 17:47 ` Joao Martins
2023-04-10 21:39 ` Jane Chu
2023-04-10 22:55 ` Joao Martins
2023-04-09 2:00 ` Dan Williams
2023-04-10 10:33 ` Joao Martins
2023-04-11 8:07 ` Aneesh Kumar K.V
2023-04-11 10:33 ` Joao Martins [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0a69f3b8-88c9-db31-ddb6-5e372f339459@oracle.com \
--to=joao.m.martins@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.ibm.com \
--cc=dan.j.williams@intel.com \
--cc=linux-mm@kvack.org \
--cc=songmuchun@bytedance.com \
--cc=tsahu@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox