From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 684FDCA9EC2 for ; Tue, 29 Oct 2019 06:43:20 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 08C8A217F9 for ; Tue, 29 Oct 2019 06:43:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=shutemov-name.20150623.gappssmtp.com header.i=@shutemov-name.20150623.gappssmtp.com header.b="jVcPIrlt" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 08C8A217F9 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=shutemov.name Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 752006B0005; Tue, 29 Oct 2019 02:43:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 702A86B0006; Tue, 29 Oct 2019 02:43:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5F10F6B0007; Tue, 29 Oct 2019 02:43:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0211.hostedemail.com [216.40.44.211]) by kanga.kvack.org (Postfix) with ESMTP id 347276B0005 for ; Tue, 29 Oct 2019 02:43:19 -0400 (EDT) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id C3BE5181AF5C4 for ; Tue, 29 Oct 2019 06:43:18 +0000 (UTC) X-FDA: 76095880476.15.lead77_39f41a310fc42 X-HE-Tag: lead77_39f41a310fc42 X-Filterd-Recvd-Size: 7534 Received: from mail-lj1-f193.google.com (mail-lj1-f193.google.com [209.85.208.193]) by imf36.hostedemail.com (Postfix) with ESMTP for ; Tue, 29 Oct 2019 06:43:18 +0000 (UTC) Received: by mail-lj1-f193.google.com with SMTP id w8so9465315lji.13 for ; Mon, 28 Oct 2019 23:43:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shutemov-name.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=bM2b33Z0Vnc9Ac9V0Lv2X8J2KpK569Z9KH1KOZ+JzRw=; b=jVcPIrlteiZGWJDhmwSW8gN61bfxRh0c3p5CvrRq2Uutn/M/Z4KjQtngvA3VJcyYc5 YXZ9sIef7OrN7ZOqEY4PMinCMzv+/ZDzvjtyOunO2yJ5AiiX+n/5ZFCaJ13cizxGSPAK nXC/meNTCcJFEJQOWHU8I1gVfLdsF8M08zfxHQWTtmctMMFqDokZ7BKnkxJPhvEBuuCv XXvB7bzp1bHcvIo9d0gtPPU0xGWCmyzCfU8v96zVJRwhw34nxB1G0aEUG0jF9PeYa3Pu q/N+8F1BettsLYEHvpxfm/OKi6ulaYQn4orrNthYfuDedlZNaJNsf5PeQDZzEkjNhynU +pXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=bM2b33Z0Vnc9Ac9V0Lv2X8J2KpK569Z9KH1KOZ+JzRw=; b=oOYQKR9+8R2hrUFHNA6fiWWnYuRVDaVaOOESH/B9vYb/K1ImYgeH8F/HoIzL7Yb8pe vaCFr1Wgezp/i9XaCyeXWpQiEVy7g5JY7+lB5RIABlM9POUG9fitM1VkpEgeZSdTBMgP 4m3mu3h6kLnRVfHu+FH6R++vfcnQkifaELDyYDa4KVwNAfrfBveoDIzuZpNvrcG7Sqsb /9sqMH7rl8u89e9fFdTwHTYXn6NBXIimwXfZcbGDEGVaTFioGJwZ7snAM6jKJqlgNDOC 53UQC1HDQQ50sG+f0gA9mYRfMQ0C376d2J9c+UkFad/2Tm45wI6MW59lQrbIeRbjRRnT 1niA== X-Gm-Message-State: APjAAAXle2u8Ojo9mW3ZVWfKsyqPjgMvZ1wc9LREkQe7N3pmGoxpgtMW 138vW8UnSH2+s0thzSXYBflXew== X-Google-Smtp-Source: APXvYqwKHj+2KaZDa+IJmFG2XN6Sn+iey7LdWii3xyIneSuZVP1c5fnFP+Blz2pSi0VQXRfu6gnkUQ== X-Received: by 2002:a05:651c:1ae:: with SMTP id c14mr1087381ljn.135.1572331396587; Mon, 28 Oct 2019 23:43:16 -0700 (PDT) Received: from box.localdomain ([86.57.175.117]) by smtp.gmail.com with ESMTPSA id q14sm6700067ljc.7.2019.10.28.23.43.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 28 Oct 2019 23:43:15 -0700 (PDT) Received: by box.localdomain (Postfix, from userid 1000) id 2E0A9100402; Tue, 29 Oct 2019 09:43:18 +0300 (+03) Date: Tue, 29 Oct 2019 09:43:18 +0300 From: "Kirill A. Shutemov" To: Dan Williams Cc: Mike Rapoport , Linux Kernel Mailing List , Alexey Dobriyan , Andrew Morton , Andy Lutomirski , Arnd Bergmann , Borislav Petkov , Dave Hansen , James Bottomley , Peter Zijlstra , Steven Rostedt , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Linux API , linux-mm , the arch/x86 maintainers , Mike Rapoport Subject: Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Message-ID: <20191029064318.s4n4gidlfjun3d47@box> References: <1572171452-7958-1-git-send-email-rppt@kernel.org> <1572171452-7958-2-git-send-email-rppt@kernel.org> <20191028123124.ogkk5ogjlamvwc2s@box> <20191028130018.GA7192@rapoport-lnx> <20191028131623.zwuwguhm4v4s5imh@box> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20180716 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Oct 28, 2019 at 10:43:51PM -0700, Dan Williams wrote: > On Mon, Oct 28, 2019 at 6:16 AM Kirill A. Shutemov wrote: > > > > On Mon, Oct 28, 2019 at 02:00:19PM +0100, Mike Rapoport wrote: > > > On Mon, Oct 28, 2019 at 03:31:24PM +0300, Kirill A. Shutemov wrote: > > > > On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote: > > > > > From: Mike Rapoport > > > > > > > > > > The mappings created with MAP_EXCLUSIVE are visible only in the context of > > > > > the owning process and can be used by applications to store secret > > > > > information that will not be visible not only to other processes but to the > > > > > kernel as well. > > > > > > > > > > The pages in these mappings are removed from the kernel direct map and > > > > > marked with PG_user_exclusive flag. When the exclusive area is unmapped, > > > > > the pages are mapped back into the direct map. > > > > > > > > I probably blind, but I don't see where you manipulate direct map... > > > > > > __get_user_pages() calls __set_page_user_exclusive() which in turn calls > > > set_direct_map_invalid_noflush() that makes the page not present. > > > > Ah. okay. > > > > I think active use of this feature will lead to performance degradation of > > the system with time. > > > > Setting a single 4k page non-present in the direct mapping will require > > splitting 2M or 1G page we usually map direct mapping with. And it's one > > way road. We don't have any mechanism to map the memory with huge page > > again after the application has freed the page. > > > > It might be okay if all these pages cluster together, but I don't think we > > have a way to achieve it easily. > > Still, it would be worth exploring what that would look like if not > for MAP_EXCLUSIVE then set_mce_nospec() that wants to punch out poison > pages from the direct map. In the case of pmem, where those pages are > able to be repaired, it would be nice to also repair the mapping > granularity of the direct map. The solution has to consist of two parts: finding a range to collapse and actually collapsing the range into a huge page. Finding the collapsible range will likely require background scanning of the direct mapping as we do for THP with khugepaged. It should not too hard, but likely require long and tedious tuning to be effective, but not too disturbing for the system. Alternatively, after any changes to the direct mapping, we can initiate checking if the range is collapsible. Up to 1G around the changed 4k. It might be more taxing than scanning if direct mapping changes often. Collapsing itself appears to be simple: re-check if the range is collapsible under the lock, replace the page table with the huge page and flush the TLB. But some CPUs don't like to have two TLB entries for the same memory with different sizes at the same time. See for instance AMD erratum 383. Getting it right would require making the range not present, flush TLB and only then install huge page. That's what we do for userspace. It will not fly for the direct mapping. There is no reasonable way to exclude other CPU from accessing the range while it's not present (call stop_machine()? :P). Moreover, the range may contain the code that doing the collapse or data required for it... BTW, looks like current __split_large_page() in pageattr.c is susceptible to the errata. Maybe we can get away with the easy way... -- Kirill A. Shutemov