From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk1-f199.google.com (mail-qk1-f199.google.com [209.85.222.199]) by kanga.kvack.org (Postfix) with ESMTP id 885068E0018 for ; Mon, 10 Dec 2018 11:54:13 -0500 (EST) Received: by mail-qk1-f199.google.com with SMTP id k66so10831192qkf.1 for ; Mon, 10 Dec 2018 08:54:13 -0800 (PST) Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id w144sor6340419qka.90.2018.12.10.08.54.12 for (Google Transport Security); Mon, 10 Dec 2018 08:54:12 -0800 (PST) Subject: Re: [PATCH] mm/zsmalloc.c: Fix zsmalloc 32-bit PAE support References: <20181210142105.6750-1-rafael.tinoco@linaro.org> <4da655ec-a1ac-c524-1eb4-5cbd18b265ef@arm.com> <20181210150551.GE30658@n2100.armlinux.org.uk> From: Rafael David Tinoco Message-ID: <27fb9417-fc55-f01e-c575-8c18862a282e@linaro.org> Date: Mon, 10 Dec 2018 14:53:59 -0200 MIME-Version: 1.0 In-Reply-To: <20181210150551.GE30658@n2100.armlinux.org.uk> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Russell King - ARM Linux , Robin Murphy Cc: Rafael David Tinoco , Rich Felker , linux-ia64@vger.kernel.org, Sergey Senozhatsky , linux-sh@vger.kernel.org, Benjamin Herrenschmidt , Heiko Carstens , Ram Pai , linux-mips@vger.kernel.org, linux-mm@kvack.org, Khalid Aziz , Paul Mackerras , "H . Peter Anvin" , sparclinux@vger.kernel.org, linux-s390@vger.kernel.org, Yoshinori Sato , Michael Ellerman , x86@kernel.org, Ingo Molnar , Catalin Marinas , James Hogan , Anthony Yznaga , Nitin Gupta , Fenghua Yu , Joerg Roedel , Juergen Gross , Vasily Gorbik , Will Deacon , Nicholas Piggin , Borislav Petkov , Andy Lutomirski , Thomas Gleixner , linux-arm-kernel@lists.infradead.org, Christophe Leroy , Tony Luck , Jiri Kosina , linux-kernel@vger.kernel.org, Ralf Baechle , Minchan Kim , Paul Burton , "Aneesh Kumar K . V" , Martin Schwidefsky , linuxppc-dev@lists.ozlabs.org, "David S . Miller" , "Kirill A . Shutemov" On 12/10/18 1:05 PM, Russell King - ARM Linux wrote: > On Mon, Dec 10, 2018 at 02:35:55PM +0000, Robin Murphy wrote: >> On 10/12/2018 14:21, Rafael David Tinoco wrote: >>> On 32-bit systems, zsmalloc uses HIGHMEM and, when PAE is enabled, the >>> physical frame number might be so big that zsmalloc obj encoding (to >>> location) will break, causing: >>> >>> BUG: KASAN: null-ptr-deref in zs_map_object+0xa4/0x2bc >>> Read of size 4 at addr 00000000 by task mkfs.ext4/623 >>> CPU: 2 PID: 623 Comm: mkfs.ext4 Not tainted 4.19.0-rc8-00017-g8239bc6d3307-dirty #15 >>> Hardware name: Generic DT based system >>> [] (unwind_backtrace) from [] (show_stack+0x20/0x24) >>> [] (show_stack) from [] (dump_stack+0xbc/0xe8) >>> [] (dump_stack) from [] (kasan_report+0x248/0x390) >>> [] (kasan_report) from [] (__asan_load4+0x78/0xb4) >>> [] (__asan_load4) from [] (zs_map_object+0xa4/0x2bc) >>> [] (zs_map_object) from [] (zram_bvec_rw.constprop.2+0x324/0x8e4 [zram]) >>> [] (zram_bvec_rw.constprop.2 [zram]) from [] (zram_make_request+0x234/0x46c [zram]) >>> [] (zram_make_request [zram]) from [] (generic_make_request+0x304/0x63c) >>> [] (generic_make_request) from [] (submit_bio+0x4c/0x1c8) >>> [] (submit_bio) from [] (submit_bh_wbc.constprop.15+0x238/0x26c) >>> [] (submit_bh_wbc.constprop.15) from [] (__block_write_full_page+0x524/0x76c) >>> [] (__block_write_full_page) from [] (block_write_full_page+0x1bc/0x1d4) >>> [] (block_write_full_page) from [] (blkdev_writepage+0x24/0x28) >>> [] (blkdev_writepage) from [] (__writepage+0x44/0x78) >>> [] (__writepage) from [] (write_cache_pages+0x3b8/0x800) >>> [] (write_cache_pages) from [] (generic_writepages+0x74/0xa0) >>> [] (generic_writepages) from [] (blkdev_writepages+0x18/0x1c) >>> [] (blkdev_writepages) from [] (do_writepages+0x68/0x134) >>> [] (do_writepages) from [] (__filemap_fdatawrite_range+0xb0/0xf4) >>> [] (__filemap_fdatawrite_range) from [] (file_write_and_wait_range+0x64/0xd0) >>> [] (file_write_and_wait_range) from [] (blkdev_fsync+0x54/0x84) >>> [] (blkdev_fsync) from [] (vfs_fsync_range+0x70/0xd4) >>> [] (vfs_fsync_range) from [] (do_fsync+0x4c/0x80) >>> [] (do_fsync) from [] (sys_fsync+0x1c/0x20) >>> [] (sys_fsync) from [] (ret_fast_syscall+0x0/0x2c) >>> >>> when trying to decode (the pfn) and map the object. >>> >>> That happens because one architecture might not re-define >>> MAX_PHYSMEM_BITS, like in this ARM 32-bit w/ LPAE enabled example. For >>> 32-bit systems, if not re-defined, MAX_POSSIBLE_PHYSMEM_BITS will >>> default to BITS_PER_LONG (32) in most cases, and, with PAE enabled, >>> _PFN_BITS might be wrong: which may cause obj variable to overflow if >>> frame number is in HIGHMEM and referencing a page above the 4GB >>> watermark. >>> >>> commit 6e00ec00b1a7 ("staging: zsmalloc: calculate MAX_PHYSMEM_BITS if >>> not defined") realized MAX_PHYSMEM_BITS depended on SPARSEMEM headers >>> and "fixed" it by calculating it using BITS_PER_LONG if SPARSEMEM wasn't >>> used, like in the example given above. >>> >>> Systems with potential for PAE exist for a long time and assuming >>> BITS_PER_LONG seems inadequate. Defining MAX_PHYSMEM_BITS looks better, >>> however it is NOT a constant anymore for x86. >>> >>> SO, instead, MAX_POSSIBLE_PHYSMEM_BITS should be defined by every >>> architecture using zsmalloc, together with a sanity check for >>> MAX_POSSIBLE_PHYSMEM_BITS being too big on 32-bit systems. >>> >>> Link: https://bugs.linaro.org/show_bug.cgi?id=3765#c17 >>> Signed-off-by: Rafael David Tinoco >>> --- >>> arch/arm/include/asm/pgtable-2level-types.h | 2 ++ >>> arch/arm/include/asm/pgtable-3level-types.h | 2 ++ >>> arch/arm64/include/asm/pgtable-types.h | 2 ++ >>> arch/ia64/include/asm/page.h | 2 ++ >>> arch/mips/include/asm/page.h | 2 ++ >>> arch/powerpc/include/asm/mmu.h | 2 ++ >>> arch/s390/include/asm/page.h | 2 ++ >>> arch/sh/include/asm/page.h | 2 ++ >>> arch/sparc/include/asm/page_32.h | 2 ++ >>> arch/sparc/include/asm/page_64.h | 2 ++ >>> arch/x86/include/asm/pgtable-2level_types.h | 2 ++ >>> arch/x86/include/asm/pgtable-3level_types.h | 3 +- >>> arch/x86/include/asm/pgtable_64_types.h | 4 +-- >>> mm/zsmalloc.c | 35 +++++++++++---------- >>> 14 files changed, 45 insertions(+), 19 deletions(-) >>> >>> diff --git a/arch/arm/include/asm/pgtable-2level-types.h b/arch/arm/include/asm/pgtable-2level-types.h >>> index 66cb5b0e89c5..552dba411324 100644 >>> --- a/arch/arm/include/asm/pgtable-2level-types.h >>> +++ b/arch/arm/include/asm/pgtable-2level-types.h >>> @@ -64,4 +64,6 @@ typedef pteval_t pgprot_t; >>> #endif /* STRICT_MM_TYPECHECKS */ >>> +#define MAX_POSSIBLE_PHYSMEM_BITS 32 >>> + >>> #endif /* _ASM_PGTABLE_2LEVEL_TYPES_H */ >>> diff --git a/arch/arm/include/asm/pgtable-3level-types.h b/arch/arm/include/asm/pgtable-3level-types.h >>> index 921aa30259c4..664c39e6517c 100644 >>> --- a/arch/arm/include/asm/pgtable-3level-types.h >>> +++ b/arch/arm/include/asm/pgtable-3level-types.h >>> @@ -67,4 +67,6 @@ typedef pteval_t pgprot_t; >>> #endif /* STRICT_MM_TYPECHECKS */ >>> +#define MAX_POSSIBLE_PHYSMEM_BITS 36 >> >> Nit: with LPAE, physical addresses go up to 40 bits, not just 36. Hum, I got it from where it was being defined when having SPARSEMEM enabled (#define MAX_PHYSMEM_BITS 36 in arch/arm/include/asm/sparsemem.h), since without SPARSEMEM it would default to BITS_PER_LONG. > Right, with that set at 40, we get: > > #define _PFN_BITS (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT) > > == 28 > > #define OBJ_TAG_BITS 1 > #define OBJ_INDEX_BITS (BITS_PER_LONG - _PFN_BITS - OBJ_TAG_BITS) > > == 3 > > #define ZS_MAX_ZSPAGE_ORDER 2 > #define ZS_MAX_PAGES_PER_ZSPAGE (_AC(1, UL) << ZS_MAX_ZSPAGE_ORDER) > > == 4 > > #define ZS_MIN_ALLOC_SIZE \ > MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS)) > > == 4 << 12 >> 3 = 2048 > > or half-page allocations. > > Given this in Documentation/vm/zsmalloc.rst: > > On the other hand, if we just use single > (0-order) pages, it would suffer from very high fragmentation -- > any object of size PAGE_SIZE/2 or larger would occupy an entire page. > This was one of the major issues with its predecessor (xvmalloc). > > It seems that would not be acceptable, so I have to ask whether > using an unsigned long to store PFN + object ID is really acceptable. > Maybe the real solution to this problem is to stop using an > unsigned long to contain both the PFN and object ID? >