From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3D19C2BA83 for ; Fri, 7 Feb 2020 19:40:09 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 97AC120726 for ; Fri, 7 Feb 2020 19:40:09 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="dnZCcBjR" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 97AC120726 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 371C46B0003; Fri, 7 Feb 2020 14:40:09 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3219A6B0006; Fri, 7 Feb 2020 14:40:09 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2384D6B0007; Fri, 7 Feb 2020 14:40:09 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0068.hostedemail.com [216.40.44.68]) by kanga.kvack.org (Postfix) with ESMTP id 0AEB46B0003 for ; Fri, 7 Feb 2020 14:40:09 -0500 (EST) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id ACC6334A3 for ; Fri, 7 Feb 2020 19:40:08 +0000 (UTC) X-FDA: 76464346896.12.blade21_3b68f23ac273f X-HE-Tag: blade21_3b68f23ac273f X-Filterd-Recvd-Size: 5136 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) by imf40.hostedemail.com (Postfix) with ESMTP for ; Fri, 7 Feb 2020 19:40:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=U8PmAXB/oPLq6nIVWCZjJ44eVsMvnyH1XhgYw0Sqadg=; b=dnZCcBjRppGfTweGDtjpuOUYf5 Hmc8l/C6IhR+AE+0DGSf1Xq/vJUMKcn5tiUj+4mxV3kUjazlrD9+ZZrRGhYNPQDifEZsUNnAtyG/X Cst8tyTIkiX2/giYfJI+XnhTniz5cSAbw3Yh+V9A7V885oGNKqwy4SVXtOY5ek+97nHPa4fJBBHy3 YFyMY7s5mdfKOIYyVGB1mdZzKou1P2u0YOIoM30Z7R2c78s3ARfXULHs9WLAou326XjxmXfUrnENo AB2L0qyj0cIAl06bX6jHwf4smSNr10uX84WRkXDAZNnLS0by88M2iYC510p8VU70aOLbSs/QbRw0y rs0ujOJg==; Received: from willy by bombadil.infradead.org with local (Exim 4.92.3 #3 (Red Hat Linux)) id 1j09Ty-0004Eg-Bj; Fri, 07 Feb 2020 19:40:06 +0000 Date: Fri, 7 Feb 2020 11:40:06 -0800 From: Matthew Wilcox To: "Kirill A. Shutemov" Cc: Mike Rapoport , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org Subject: Re: [LSF/MM/BPF TOPIC] Page table manipulation primitives Message-ID: <20200207194006.GF8731@bombadil.infradead.org> References: <20200206165741.GC17499@linux.ibm.com> <20200206173410.GW8731@bombadil.infradead.org> <20200207174553.mx6onurbvhgn7w5p@box> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200207174553.mx6onurbvhgn7w5p@box> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Feb 07, 2020 at 08:45:53PM +0300, Kirill A. Shutemov wrote: > On Thu, Feb 06, 2020 at 09:34:10AM -0800, Matthew Wilcox wrote: > > On Thu, Feb 06, 2020 at 06:57:41PM +0200, Mike Rapoport wrote: > > > While updating the architectures to properly use 5-level folded page tables > > > without and > > > I wondered if we can do better than explicitly name each and every level of > > > the page table, open-code traversal of all the layers numerous times and > > > have copied do_something_pXd_range(). > > > > > > Then I've come across Kirill's "Proof-of-concept: better(?) page-table > > > manipulation API" [1], but as far as I could see there was no progress > > > since then. > > > > > > I'd like to resurrect the topic and try to see if we can come up with > > > actually better page table manipulation API. > > > > > > [1] https://lore.kernel.org/lkml/20180424154355.mfjgkf47kdp2by4e@black.fi.intel.com/ > > I played a bit more with it after that, but got distracted to other stuff. > I'll see if I'll be able to come up with an update. > > > I don't think this approach helps support 64k pages on ARM > > Could you specify what such support would require? For 64kB pages with a base 4kB page size, you set a special bit in 16 adjacent aligned PTEs. When the MMU sees that bit set, it uses a 64k TLB entry. So I think what we want for a fully generic interface is: void set_vpte_at(struct mm_struct *, unsigned long addr, vpte_iter *, vpte_t, unsigned int order); (maybe we don't need an 'order' here; perhaps it's embedded in the vpte_iter) > > , for example, > > so it doesn't solve enough problems to be worth doing. I'd favour > > an interface which looked more like this: > > > > vpte_iter iter; > > vpte_t vpte; > > > > vpte_iter_for_each(vpte, iter, start, end, flags) { > > unsigned char order = vpte_order(&iter); > > ... do things based on vpte and order ... > > } > > It looks like just an higher level API that can be provided over my > approach. Maybe it should be the default go-to. But I find it useful to be > able go into low-level details where it is matters. I think the key difference is that I would not embed the 'order' in the vpte, but keep it in the iter. I don't know that every architecture has the ability to tell from a union { pte_t, pmd_t, pud_t, p4d_t, pgd_t } which of the levels it is. Looking at the code you provided, another difference is that your method involves a recursive call for each level of the page tables. I'd rather express these kinds of things as "I would like to iterate over each page table entry in this range" than "Have I got to the bottom? If not, recursively call myself". IOW vpte_iter_for_each() would work its way down to the lowest level, and keep track of where it is in the iter, so when moving to the next entry in the tree, it knows whether to go up before going sideways, and then down as far as it needs to. Whatever we come up with, we should be able to collapse away the levels which aren't needed, and support whatever non-PTE-level TLB orders the hardware supports without forcing support for those orders on x86 code. I don't have a good solution for how to express the 'copy_pt_range' in your example, where we need to iterate two mms at the same time. Maybe that's a special iterator which does exactly that.