linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: John Hubbard <jhubbard@nvidia.com>
To: Ard Biesheuvel <ardb@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Mark Rutland <mark.rutland@arm.com>,
	Kefeng Wang <wangkefeng.wang@huawei.com>,
	Feiyang Chen <chenfeiyang@loongson.cn>,
	Alistair Popple <apopple@nvidia.com>,
	Ralph Campbell <rcampbell@nvidia.com>,
	linux-arm-kernel@lists.infradead.org,
	LKML <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org, stable@vger.kernel.org
Subject: Re: [PATCH] arm64/mm: don't WARN when alloc/free-ing device private pages
Date: Mon, 10 Apr 2023 00:39:17 -0700	[thread overview]
Message-ID: <8dd0e252-8d8b-a62d-8836-f9f26bc12bc7@nvidia.com> (raw)
In-Reply-To: <CAMj1kXGtFyugzi9MZW=4_oVTy==eAF6283fwvX9fdZhO98ZA3g@mail.gmail.com>

On 4/7/23 03:45, Ard Biesheuvel wrote:
...
> That is what I am talking about - the struct pages are allocated in a
> region that is reserved for something else.
> 
> Maybe an example helps here:

It does! After also poking around quite a lot, and comparing to x86,
it is starting to become clearer now.

> 
> When running the 39-bit VA kernel build on a AMD Seatte board, we will
> have (assuming sizeof(struct page) == 64)
> 
> memstart_addr := 0x80_0000_0000
> 
> PAGE_OFFSET := 0xffff_ff80_0000_0000
> 
> VMEMMAP_SHIFT := 6
> VMEMMAP_START := 0xffff_fffe_0000_0000
> VMEMMAP_SIZE := 0x1_0000_0000
> 
> pfn_to_page() conversions are based on ordinary array indexing
> involving vemmap[], where vmemmap is defined as
> 
> #define vmemmap \
>     ((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT))
> 
> So the PFN associated with the first usable DRAM address is
> 0x800_0000, and pfn_to_page(0x800_0000) will return VMEMMAP_START.

OK, I see how that's set up, yes.

> 
> pfn_to_page(x) for any x < 0x800_0000 will produce a kernel VA that
> points into the vmalloc area, and may conflict with the kernel
> mapping, modules mappings, per-CPU mappings, IO mappings, etc etc.
> 
> pfn_to_page(x) for values 0xc00_0000 < x < 0x1000_0000 will produce a
> kernel VA that points outside the region set aside for the vmemmap.
> This region is currently unused, but that will likely change soon.
> 

I tentatively think I'm in this case right now. Because there is no wrap
around happening in my particular config, which is CONFIG_ARM64_VA_BITS
== 48, and PAGE_SIZE == 4KB and sizeof(struct page) == 64 (details
below).

It occurs to me that ZONE_DEVICE and (within that category) device
private page support need only support rather large setups. On x86, it
requires 64-bit. And on arm64, from what I'm learning after a day or so
of looking around and comparing, I think we must require at least 48 bit
VA support. Otherwise there's just no room for things.

And for smaller systems, everyone disables this fancy automatic handling
(hmm_range_fault()-based page migration) anyway, partly because of the
VA and PA small ranges, but also because of size and performance
constraints.

> pfn_to_page(x) for any x >= 0x1000_0000 will wrap around and produce a
> bogus address in the user range.
> 
> The bottom line is that the VMEMMAP region is dimensioned to cover
> system memory only, i.e., what can be covered by the kernel direct
> map. If you want to allocate struct pages for thing that are not
> system memory, you will need to enlarge the VMEMMAP region, and ensure
> that request_mem_region() produces a region that is covered by it.
> 
> This is going to be tricky with LPA2, because there, the 4k pages
> configuration already uses up half of the vmalloc region to cover the
> linear map, so we have to consider this carefully.

Things are interlocked a little differently on arm64, than on x86, and
the layout is also different. One other interesting thing jumps out at
me: On arm64, the (VMALLOC_END - VMALLOC_START) size is *huge*: 123 TB
on my config. And it seems to cover the kernel mapping. On x86, those
are separate. This still confuses me a bit and I wonder if I'm reading
it wrong?

Also, below are the values on my 48 bit VA setup. I'm listing these in
order to help jumpstart thinking about how exactly to extend
VMEMMAP_SIZE. GPUs have on the order of GB's of memory these days, so
that's the order of magnitude that's needed.

PAGE_OFFSET:                      0xffff000000000000
PAGE_END:                         0xffff800000000000
high_memory:                      0xffff087f80000000 (8 TB)

VMALLOC_START:                    0xffff800008000000
VMALLOC_END:                      0xfffffbfff0000000 (123 TB)

vmemmap:                          0xfffffbfffe000000
VMEMMAP_START:                    0xfffffc0000000000
VMEMMAP_END:                      0xfffffe0000000000

Typical device private struct page
that is causing warnings:         0xffffffffaee00000

VMEMMAP_SIZE:                     0x0000020000000000 (2 TB)
VMEMMAP_SHIFT:                    6

PHYS_OFFSET:                      0x0000000080000000
memstart_addr (signed 64-bit):    0x0000000080000000

MODULES_VADDR:                    0xffff800000000000
MODULES_END:                      0xffff800008000000 (128 MB)

PAGE_SHIFT:                       12
PAGE_SIZE:                        0x0000000000001000 (4 KB)
PAGE_MASK:                        0xfffffffffffff000

PMD_SIZE:                         0x0000000000200000 (2 MB)
PMD_MASK:                         0xffffffffffe00000

PUD_SIZE:                         0x0000000040000000 (1 GB)
PUD_MASK:                         0xffffffffc0000000

PGDIR_SIZE:                       0x0000008000000000 (512 GB)

PTE_ADDR_MASK:                    0x0000fffffffff000
sizeof(struct page):              64

thanks,
-- 
John Hubbard
NVIDIA



  reply	other threads:[~2023-04-10  7:39 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-06  4:05 John Hubbard
2023-04-06  7:31 ` Ard Biesheuvel
2023-04-07  0:13   ` John Hubbard
2023-04-07 10:45     ` Ard Biesheuvel
2023-04-10  7:39       ` John Hubbard [this message]
2023-04-11  2:48         ` John Hubbard
2023-05-12 14:42           ` Ard Biesheuvel
2023-05-13  2:06             ` John Hubbard
2023-04-06 20:07 ` Andrew Morton
2023-04-06 20:18   ` John Hubbard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8dd0e252-8d8b-a62d-8836-f9f26bc12bc7@nvidia.com \
    --to=jhubbard@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=ardb@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=chenfeiyang@loongson.cn \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mark.rutland@arm.com \
    --cc=rcampbell@nvidia.com \
    --cc=stable@vger.kernel.org \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox