From: Nicholas Piggin <npiggin@gmail.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>,
"Torvalds, Linus" <torvalds@linux-foundation.org>
Cc: "akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"ast@kernel.org" <ast@kernel.org>, "bp@alien8.de" <bp@alien8.de>,
"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
"daniel@iogearbox.net" <daniel@iogearbox.net>,
"dborkman@redhat.com" <dborkman@redhat.com>,
"edumazet@google.com" <edumazet@google.com>,
"hch@infradead.org" <hch@infradead.org>,
"hpa@zytor.com" <hpa@zytor.com>,
"imbrenda@linux.ibm.com" <imbrenda@linux.ibm.com>,
"Kernel-team@fb.com" <Kernel-team@fb.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"mbenes@suse.cz" <mbenes@suse.cz>,
"mcgrof@kernel.org" <mcgrof@kernel.org>,
"pmladek@suse.com" <pmladek@suse.com>,
"rppt@kernel.org" <rppt@kernel.org>,
"song@kernel.org" <song@kernel.org>,
"songliubraving@fb.com" <songliubraving@fb.com>
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP
Date: Fri, 22 Apr 2022 14:31:33 +1000 [thread overview]
Message-ID: <1650601109.vb3owbt14k.astroid@bobo.none> (raw)
In-Reply-To: <1650596505.bxrmjmgjur.astroid@bobo.none>
Excerpts from Nicholas Piggin's message of April 22, 2022 1:08 pm:
> Excerpts from Edgecombe, Rick P's message of April 22, 2022 12:29 pm:
>> On Fri, 2022-04-22 at 10:12 +1000, Nicholas Piggin wrote:
>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>> index e163372d3967..70933f4ed069 100644
>>> --- a/mm/vmalloc.c
>>> +++ b/mm/vmalloc.c
>>> @@ -2925,12 +2925,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>> if (nr != nr_pages_request)
>>> break;
>>> }
>>> - } else
>>> - /*
>>> - * Compound pages required for remap_vmalloc_page if
>>> - * high-order pages.
>>> - */
>>> - gfp |= __GFP_COMP;
>>> + }
>>>
>>> /* High-order pages or fallback path if "bulk" fails. */
>>>
>>> @@ -2944,6 +2939,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>> page = alloc_pages_node(nid, gfp, order);
>>> if (unlikely(!page))
>>> break;
>>> + /*
>>> + * Higher order allocations must be able to be
>>> treated as
>>> + * indepdenent small pages by callers (as they can
>>> with
>>> + * small page allocs).
>>> + */
>>> + if (order)
>>> + split_page(page, order);
>>>
>>> /*
>>> * Careful, we allocate and map page-order pages, but
>>
>> FWIW, I like this direction. I think it needs to free them differently
>> though? Since currently assumes they are high order pages in that path.
>
> Yeah I got a bit excited there, but fairly sure that's the bug.
> I'll do a proper patch.
So here's the patch on top of the revert. Only tested on a lowly
powerpc machine, but it does fix this simple test case that does
what the drm driver is obviously doing:
size_t sz = PMD_SIZE;
void *mem = vmalloc(sz);
struct page *p = vmalloc_to_page(mem + PAGE_SIZE*3);
p->mapping = NULL;
p->index = 0;
INIT_LIST_HEAD(&p->lru);
vfree(mem);
Without the below fix the same exact problem reproduces:
BUG: Bad page state in process swapper/0 pfn:00743
page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x743
flags: 0x7ffff000000000(node=0|zone=0|lastcpupid=0x7ffff)
raw: 007ffff000000000 c00c00000001d0c8 c00c00000001d0c8 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: corrupted mapping in tail page
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc3-00082-gfc6fff4a7ce1-dirty #2810
Call Trace:
[c000000002383940] [c0000000006ebb00] dump_stack_lvl+0x74/0xa8 (unreliable)
[c000000002383980] [c0000000003dabdc] bad_page+0x12c/0x170
[c000000002383a00] [c0000000003dad08] free_tail_pages_check+0xe8/0x190
[c000000002383a30] [c0000000003dc45c] free_pcp_prepare+0x31c/0x4e0
[c000000002383a90] [c0000000003df9f0] free_unref_page+0x40/0x1b0
[c000000002383ad0] [c0000000003d7fc8] __vunmap+0x1d8/0x420
[c000000002383b70] [c00000000102e0d8] proc_vmalloc_init+0xdc/0x108
[c000000002383bf0] [c000000000011f80] do_one_initcall+0x60/0x2c0
[c000000002383cc0] [c000000001001658] kernel_init_freeable+0x32c/0x3cc
[c000000002383da0] [c000000000012564] kernel_init+0x34/0x1a0
[c000000002383e10] [c00000000000ce64] ret_from_kernel_thread+0x5c/0x64
Any other concerns with the fix?
Thanks,
Nick
--
mm/vmalloc: huge vmalloc backing pages should be split rather than compound
Huge vmalloc higher-order backing pages were allocated with __GFP_COMP
in order to allow the sub-pages to be refcounted by callers such as
"remap_vmalloc_page [sic]" (remap_vmalloc_range).
However a similar problem exists for other struct page fields callers
use, for example fb_deferred_io_fault() takes a vmalloc'ed page and
not only refcounts it but uses ->lru, ->mapping, ->index. This is not
compatible with compound sub-pages.
The correct approach is to use split high-order pages for the huge
vmalloc backing. These allow callers to treat them in exactly the same
way as individually-allocated order-0 pages.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
mm/vmalloc.c | 36 +++++++++++++++++++++---------------
1 file changed, 21 insertions(+), 15 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0b17498a34f1..09470361dc03 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2653,15 +2653,18 @@ static void __vunmap(const void *addr, int deallocate_pages)
vm_remove_mappings(area, deallocate_pages);
if (deallocate_pages) {
- unsigned int page_order = vm_area_page_order(area);
- int i, step = 1U << page_order;
+ int i;
- for (i = 0; i < area->nr_pages; i += step) {
+ for (i = 0; i < area->nr_pages; i++) {
struct page *page = area->pages[i];
BUG_ON(!page);
- mod_memcg_page_state(page, MEMCG_VMALLOC, -step);
- __free_pages(page, page_order);
+ mod_memcg_page_state(page, MEMCG_VMALLOC, -1);
+ /*
+ * High-order allocs for huge vmallocs are split, so
+ * can be freed as an array of order-0 allocations
+ */
+ __free_pages(page, 0);
cond_resched();
}
atomic_long_sub(area->nr_pages, &nr_vmalloc_pages);
@@ -2914,12 +2917,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
if (nr != nr_pages_request)
break;
}
- } else
- /*
- * Compound pages required for remap_vmalloc_page if
- * high-order pages.
- */
- gfp |= __GFP_COMP;
+ }
/* High-order pages or fallback path if "bulk" fails. */
@@ -2933,6 +2931,15 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
page = alloc_pages_node(nid, gfp, order);
if (unlikely(!page))
break;
+ /*
+ * Higher order allocations must be able to be treated as
+ * indepdenent small pages by callers (as they can with
+ * small-page vmallocs). Some drivers do their own refcounting
+ * on vmalloc_to_page() pages, some use page->mapping,
+ * page->lru, etc.
+ */
+ if (order)
+ split_page(page, order);
/*
* Careful, we allocate and map page-order pages, but
@@ -2992,11 +2999,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
if (gfp_mask & __GFP_ACCOUNT) {
- int i, step = 1U << page_order;
+ int i;
- for (i = 0; i < area->nr_pages; i += step)
- mod_memcg_page_state(area->pages[i], MEMCG_VMALLOC,
- step);
+ for (i = 0; i < area->nr_pages; i++)
+ mod_memcg_page_state(area->pages[i], MEMCG_VMALLOC, 1);
}
/*
--
2.35.1
next prev parent reply other threads:[~2022-04-22 4:31 UTC|newest]
Thread overview: 61+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-15 16:44 Song Liu
2022-04-15 16:44 ` [PATCH v4 bpf 1/4] vmalloc: replace VM_NO_HUGE_VMAP with VM_ALLOW_HUGE_VMAP Song Liu
2022-04-15 17:43 ` Rik van Riel
2022-04-15 16:44 ` [PATCH v4 bpf 2/4] page_alloc: use vmalloc_huge for large system hash Song Liu
2022-04-15 17:43 ` Rik van Riel
2022-04-25 7:07 ` Geert Uytterhoeven
2022-04-25 8:17 ` Linus Torvalds
2022-04-25 8:24 ` Geert Uytterhoeven
2022-04-15 16:44 ` [PATCH v4 bpf 3/4] module: introduce module_alloc_huge Song Liu
2022-04-15 18:06 ` Rik van Riel
2022-06-16 16:10 ` Dave Hansen
2022-04-15 16:44 ` [PATCH v4 bpf 4/4] bpf: use module_alloc_huge for bpf_prog_pack Song Liu
2022-04-15 19:05 ` [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP Luis Chamberlain
2022-04-16 1:34 ` Song Liu
2022-04-16 1:42 ` Luis Chamberlain
2022-04-16 1:43 ` Luis Chamberlain
2022-04-16 5:08 ` Christoph Hellwig
2022-04-16 19:55 ` Song Liu
2022-04-16 20:30 ` Linus Torvalds
2022-04-16 22:26 ` Song Liu
2022-04-18 10:06 ` Mike Rapoport
2022-04-19 0:44 ` Luis Chamberlain
2022-04-19 1:56 ` Edgecombe, Rick P
2022-04-19 5:36 ` Song Liu
2022-04-19 18:42 ` Mike Rapoport
2022-04-19 19:20 ` Linus Torvalds
2022-04-20 2:03 ` Alexei Starovoitov
2022-04-20 2:18 ` Linus Torvalds
2022-04-20 14:42 ` Song Liu
2022-04-20 18:28 ` Luis Chamberlain
2022-04-21 7:29 ` Song Liu
2022-04-21 3:25 ` Nicholas Piggin
2022-04-21 5:48 ` Linus Torvalds
2022-04-21 6:02 ` Linus Torvalds
2022-04-21 9:07 ` Nicholas Piggin
2022-04-21 8:57 ` Nicholas Piggin
2022-04-21 15:44 ` Linus Torvalds
2022-04-21 23:30 ` Nicholas Piggin
2022-04-22 0:49 ` Linus Torvalds
2022-04-22 1:51 ` Nicholas Piggin
2022-04-22 2:31 ` Linus Torvalds
2022-04-22 2:57 ` Nicholas Piggin
2022-04-21 15:47 ` Edgecombe, Rick P
2022-04-21 16:15 ` Linus Torvalds
2022-04-22 0:12 ` Nicholas Piggin
2022-04-22 2:29 ` Edgecombe, Rick P
2022-04-22 2:47 ` Linus Torvalds
2022-04-22 16:54 ` Edgecombe, Rick P
2022-04-22 3:08 ` Nicholas Piggin
2022-04-22 4:31 ` Nicholas Piggin [this message]
2022-04-22 17:10 ` Edgecombe, Rick P
2022-04-22 20:22 ` Edgecombe, Rick P
2022-04-22 3:33 ` Nicholas Piggin
2022-04-21 9:47 ` Nicholas Piggin
2022-04-19 21:24 ` Luis Chamberlain
2022-04-19 23:58 ` Edgecombe, Rick P
2022-04-20 7:58 ` Petr Mladek
2022-04-19 18:20 ` Mike Rapoport
2022-04-24 17:43 ` Linus Torvalds
2022-04-25 6:48 ` Song Liu
2022-04-21 3:19 ` Nicholas Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1650601109.vb3owbt14k.astroid@bobo.none \
--to=npiggin@gmail.com \
--cc=Kernel-team@fb.com \
--cc=akpm@linux-foundation.org \
--cc=ast@kernel.org \
--cc=bp@alien8.de \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=dborkman@redhat.com \
--cc=edumazet@google.com \
--cc=hch@infradead.org \
--cc=hpa@zytor.com \
--cc=imbrenda@linux.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mbenes@suse.cz \
--cc=mcgrof@kernel.org \
--cc=pmladek@suse.com \
--cc=rick.p.edgecombe@intel.com \
--cc=rppt@kernel.org \
--cc=song@kernel.org \
--cc=songliubraving@fb.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox