From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.2 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E5DC4C433E1 for ; Thu, 23 Jul 2020 05:36:54 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B358320771 for ; Thu, 23 Jul 2020 05:36:54 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B358320771 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ghiti.fr Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 3EF226B0006; Thu, 23 Jul 2020 01:36:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3A1996B0007; Thu, 23 Jul 2020 01:36:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2B5846B0008; Thu, 23 Jul 2020 01:36:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0120.hostedemail.com [216.40.44.120]) by kanga.kvack.org (Postfix) with ESMTP id 178466B0006 for ; Thu, 23 Jul 2020 01:36:54 -0400 (EDT) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id B4ECA824934B for ; Thu, 23 Jul 2020 05:36:53 +0000 (UTC) X-FDA: 77068231506.07.trip44_400536d26f3c Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin07.hostedemail.com (Postfix) with ESMTP id 8C6CC1802748A for ; Thu, 23 Jul 2020 05:36:53 +0000 (UTC) X-HE-Tag: trip44_400536d26f3c X-Filterd-Recvd-Size: 7064 Received: from relay8-d.mail.gandi.net (relay8-d.mail.gandi.net [217.70.183.201]) by imf39.hostedemail.com (Postfix) with ESMTP for ; Thu, 23 Jul 2020 05:36:52 +0000 (UTC) X-Originating-IP: 90.112.45.105 Received: from [192.168.1.14] (lfbn-gre-1-325-105.w90-112.abo.wanadoo.fr [90.112.45.105]) (Authenticated sender: alex@ghiti.fr) by relay8-d.mail.gandi.net (Postfix) with ESMTPSA id E2EA91BF206; Thu, 23 Jul 2020 05:36:45 +0000 (UTC) Subject: Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone To: Palmer Dabbelt , benh@kernel.crashing.org Cc: mpe@ellerman.id.au, paulus@samba.org, Paul Walmsley , aou@eecs.berkeley.edu, Anup Patel , Atish Patra , zong.li@sifive.com, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org, linux-mm@kvack.org References: From: Alex Ghiti Message-ID: <970adad4-6eec-dffe-ad1c-bf74646229ad@ghiti.fr> Date: Thu, 23 Jul 2020 01:36:45 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: fr X-Rspamd-Queue-Id: 8C6CC1802748A X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam05 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Le 7/21/20 =C3=A0 7:36 PM, Palmer Dabbelt a =C3=A9crit=C2=A0: > On Tue, 21 Jul 2020 16:11:02 PDT (-0700), benh@kernel.crashing.org wrot= e: >> On Tue, 2020-07-21 at 14:36 -0400, Alex Ghiti wrote: >>> > > I guess I don't understand why this is necessary at all. >>> > > Specifically: why >>> > > can't we just relocate the kernel within the linear map?=C2=A0 Th= at would >>> > > let the >>> > > bootloader put the kernel wherever it wants, modulo the physical >>> > > memory size we >>> > > support.=C2=A0 We'd need to handle the regions that are coupled t= o the >>> > > kernel's >>> > > execution address, but we could just put them in an explicit memo= ry >>> > > region >>> > > which is what we should probably be doing anyway. >>> > >>> > Virtual relocation in the linear mapping requires to move the kerne= l >>> > physically too. Zong implemented this physical move in its KASLR RF= C >>> > patchset, which is cumbersome since finding an available physical s= pot >>> > is harder than just selecting a virtual range in the vmalloc range. >>> > >>> > In addition, having the kernel mapping in the linear mapping preven= ts >>> > the use of hugepage for the linear mapping resulting in performance= =20 >>> loss >>> > (at least for the GB that encompasses the kernel). >>> > >>> > Why do you find this "ugly" ? The vmalloc region is just a bunch of >>> > available virtual addresses to whatever purpose we want, and as=20 >>> noted by >>> > Zong, arm64 uses the same scheme. >> >> I don't get it :-) >> >> At least on powerpc we move the kernel in the linear mapping and it >> works fine with huge pages, what is your problem there ? You rely on >> punching small-page size holes in there ? >=20 > That was my original suggestion, and I'm not actually sure it's=20 > invalid.=C2=A0 It > would mean that both the kernel's physical and virtual addresses are se= t=20 > by the > bootloader, which may or may not be workable if we want to have an=20 > sv48+sv39 > kernel.=C2=A0 My initial approach to sv48+sv39 kernels would be to just= throw=20 > away > the sv39 memory on sv48 kernels, which would preserve the linear map bu= t=20 > mean > that there is no single physical address that's accessible for both.=C2= =A0 That > would require some coordination between the bootloader and the kernel a= s to > where it should be loaded, but maybe there's a better way to design the= =20 > linear > map.=C2=A0 Right now we have a bunch of unwritten rules about where thi= ngs=20 > need to > be loaded, which is a recipe for disaster. >=20 > We could copy the kernel around, but I'm not sure I really like that=20 > idea.=C2=A0 We > do zero the BSS right now, so it's not like we entirely rely on the=20 > bootloader > to set up the kernel image, but with the hart race boot scheme we have=20 > right > now we'd at least need to leave a stub sitting around.=C2=A0 Maybe we j= ust throw > away SBI v0.1, though, that's why we called it all legacy in the first=20 > place. >=20 > My bigger worry is that anything that involves running the kernel at=20 > arbitrary > virtual addresses means we need a PIC kernel, which means every global=20 > symbol > needs an indirection.=C2=A0 That's probably not so bad for shared libra= ries,=20 > but the > kernel has a lot of global symbols.=C2=A0 PLT references probably aren'= t so=20 > scary, > as we have an incoherent instruction cache so the virtual function=20 > predictor > isn't that hard to build, but making all global data accesses GOT-relat= ive > seems like a disaster for performance.=C2=A0 This fixed-VA thing really= just=20 > exists > so we don't have to be full-on PIC. >=20 > In theory I think we could just get away with pretending that medany is= =20 > PIC, > which I believe works as long as the data and text offset stays=20 > constant, you > you don't have any symbols between 2GiB and -2GiB (as those may stay fi= xed, > even in medany), and you deal with GP accordingly (which should work=20 > itself out > in the current startup code).=C2=A0 We rely on this for some of the ear= ly=20 > boot code > (and will soon for kexec), but that's a very controlled code base and w= e've > already had some issues.=C2=A0 I'd be much more comfortable adding an e= xplicit > semi-PIC code model, as I tend to miss something when doing these sorts= of > things and then we could at least add it to the GCC test runs and=20 > guarantee it > actually works.=C2=A0 Not really sure I want to deal with that, though.= =C2=A0 It=20 > would, > however, be the only way to get random virtual addresses during kernel > execution. >=20 >> At least in the old days, there were a number of assumptions that >> the kernel text/data/bss resides in the linear mapping. >=20 > Ya, it terrified me as well.=C2=A0 Alex says arm64 puts the kernel in t= he=20 > vmalloc > region, so assuming that's the case it must be possible.=C2=A0 I didn't= get that > from reading the arm64 port (I guess it's no secret that pretty much al= l=20 > I do > is copy their code) See https://elixir.bootlin.com/linux/latest/source/arch/arm64/mm/mmu.c#L6= 15. >=20 >> If you change that you need to ensure that it's still physically >> contiguous and you'll have to tweak __va and __pa, which might induce >> extra overhead. >=20 > I'm operating under the assumption that we don't want to add an=20 > additional load > to virt2phys conversions.=C2=A0 arm64 bends over backwards to avoid the= load,=20 > and > I'm assuming they have a reason for doing so.=C2=A0 Of course, if we're= PIC then > maybe performance just doesn't matter, but I'm not sure I want to just=20 > give up. > Distros will probably build the sv48+sv39 kernels as soon as they show=20 > up, even > if there's no sv48 hardware for a while.