From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 29E36C43334
	for <linux-mm@archiver.kernel.org>; Tue, 28 Jun 2022 14:17:43 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 56ED48E0001; Tue, 28 Jun 2022 10:17:42 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 51EE86B0072; Tue, 28 Jun 2022 10:17:42 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3BFE18E0001; Tue, 28 Jun 2022 10:17:42 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 27B676B0071
	for <linux-mm@kvack.org>; Tue, 28 Jun 2022 10:17:42 -0400 (EDT)
Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay12.hostedemail.com (Postfix) with ESMTP id E60CC1206BD
	for <linux-mm@kvack.org>; Tue, 28 Jun 2022 14:17:41 +0000 (UTC)
X-FDA: 79627847922.27.DD8B255
Received: from mail-pj1-f45.google.com (mail-pj1-f45.google.com [209.85.216.45])
	by imf23.hostedemail.com (Postfix) with ESMTP id 36AFA140034
	for <linux-mm@kvack.org>; Tue, 28 Jun 2022 14:17:39 +0000 (UTC)
Received: by mail-pj1-f45.google.com with SMTP id n16-20020a17090ade9000b001ed15b37424so12771490pjv.3
        for <linux-mm@kvack.org>; Tue, 28 Jun 2022 07:17:39 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bytedance-com.20210112.gappssmtp.com; s=20210112;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to;
        bh=dsdJAZGMU86DRVBeneY3mM/jA9/FuyyN3hzFaqyDicg=;
        b=nyHdWDKz+JUVQGY+RpZ+evo4B+BP7bogCl/7Bxg6RgTsYeWsEtmQfcL2uzO2sAVLhp
         KMluGL+xWxhTiJWzKOECH7su0ssGe3M51J1MbnbCDCV+8we0uJY1a69XkcBhSghf98v1
         luZ2Z1x2f9lFOy71eyJEsRmgz0O+NyYB4y+3kcinbJy0P87uwcpCAiDNg5vJKpVNoQI0
         grKby+ch8a+Rt48BkwT29mfCmCWnf3DhkwVSscQ/qzschIVq+qWifMTJhzrHjqlJGQUP
         G9rmYh00uwCVQyaIgJiWW00rULLWqlpX6PTmS2q7FugnoGPfQTo3NjNBH8IGGcYtRmNo
         HPhg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=dsdJAZGMU86DRVBeneY3mM/jA9/FuyyN3hzFaqyDicg=;
        b=F801CjsmQLV4HVVJ1Wz+vRCT+qvCdrnKO9giOet9AMQTYPdPyQF+lS8uRIf6+GchlZ
         Ph/YAKIQVwBD0O/t3Zv3lpd+ZKqnzh9Xe3JAgl+yguDweyVEkWoag/xpniqdAgQ8VGr8
         6YukkgzcKZ+xknOAvm7z5KUa7mJVSPUTYcxwkFFzE1QGPt+yyjKilWxXnV0bDiY/JJzt
         ns/jMz8D1r2utfl/uMSxkFcamnqMlf9v0FUfuNCtOfPYh2bguJJQLqiESiKbk7nRMp7N
         23tshn9+5eKtsntG409eVOYj/Vc4X9c/mn1hv6M2qqb93zceXYiE3GlPV2e3LNjwyp2o
         B6cg==
X-Gm-Message-State: AJIora8PZCayHD/sRLXpQR+SI7ru4TtOVWOB8rIrmGZkOic97LPR9hMr
	NKYbSeqipAkI3M2dhR9KL/5jfA==
X-Google-Smtp-Source: AGRyM1sMmomGX/L3/p59uYhsnJwv5HPrcu9SmmgqfQ8O0pGZjvE5nMm0EL8KXv9eGmkAFUb2aZuhdg==
X-Received: by 2002:a17:90a:fa8c:b0:1ec:9f5c:846d with SMTP id cu12-20020a17090afa8c00b001ec9f5c846dmr22319832pjb.73.1656425858807;
        Tue, 28 Jun 2022 07:17:38 -0700 (PDT)
Received: from localhost ([2408:8207:18da:2310:e153:cfbc:e790:5935])
        by smtp.gmail.com with ESMTPSA id 200-20020a6214d1000000b00524f29903e0sm9483622pfu.56.2022.06.28.07.17.36
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 28 Jun 2022 07:17:38 -0700 (PDT)
Date: Tue, 28 Jun 2022 22:17:34 +0800
From: Muchun Song <songmuchun@bytedance.com>
To: James Houghton <jthoughton@google.com>
Cc: Mina Almasry <almasrymina@google.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Peter Xu <peterx@redhat.com>, David Hildenbrand <david@redhat.com>,
	David Rientjes <rientjes@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Jue Wang <juew@google.com>,
	Manish Mishra <manish.mishra@nutanix.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity
 mapping
Message-ID: <YrsNfjm+S0KIKn2k@FVFYT0MHHV2J>
References: <20220624173656.2033256-1-jthoughton@google.com>
 <CAHS8izPnJd5EQjUi9cOk=03u3X1rk0PexTQZi+bEE4VMtFfksQ@mail.gmail.com>
 <CADrL8HWse7-=1Z=1_d8szwdkhFH1t8L4pOBO7E7yxgCYF-gc8w@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CADrL8HWse7-=1Z=1_d8szwdkhFH1t8L4pOBO7E7yxgCYF-gc8w@mail.gmail.com>
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656425861; a=rsa-sha256;
	cv=none;
	b=5DivY/1sQmmL0KjGho6Nu1iVK7UKpygUfDuOYq8DIDq4KzfMVA4iOJmNY4W5XkGp3GyQw7
	d66pbhqWXmTvKLXYq2gbWBo0wZWyPywPwT/dNtCtDC76I88ON1HUL7v/vCcOtijKIWuJJW
	UHZHKgST2j561D/Pqjp0Qd7hB7nofJI=
ARC-Authentication-Results: i=1;
	imf23.hostedemail.com;
	dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=nyHdWDKz;
	spf=pass (imf23.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.216.45 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com;
	dmarc=pass (policy=none) header.from=bytedance.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1656425861;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=dsdJAZGMU86DRVBeneY3mM/jA9/FuyyN3hzFaqyDicg=;
	b=g3RimBZPPFDhvDjYvVHki2UOFbJTesm0hq2YFfdokEAw2irasbuR290TPGBc4ReQRXKMUT
	r5/H5t6OfEw2oXw0FDXDzmcShCTzkzDJ0E0ZHSoH5Ojl5QzskEQH94ZgwrBLGBf8xuDLON
	HTighpOdisJAgP6GHt9hfQch1UKhuzg=
X-Rspamd-Server: rspam09
X-Rspamd-Queue-Id: 36AFA140034
Authentication-Results: imf23.hostedemail.com;
	dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=nyHdWDKz;
	spf=pass (imf23.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.216.45 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com;
	dmarc=pass (policy=none) header.from=bytedance.com
X-Rspam-User: 
X-Stat-Signature: a5qnoccbdspqpf7etj9ggdhsp76z6mmi
X-HE-Tag: 1656425859-444663
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, Jun 27, 2022 at 09:27:38AM -0700, James Houghton wrote:
> On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <almasrymina@google.com> wrote:
> >
> > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> > >
> > > [trimmed...]
> > > ---- Userspace API ----
> > >
> > > This patch series introduces a single way to take advantage of
> > > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > > userspace to resolve MINOR page faults on shared VMAs.
> > >
> > > To collapse a HugeTLB address range that has been mapped with several
> > > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > > userspace to know when all pages (that they care about) have been fetched.
> > >
> >
> > Thanks James! Cover letter looks good. A few questions:
> >
> > Why not have the kernel collapse the hugepage once all the 4K pages
> > have been fetched automatically? It would remove the need for a new
> > userspace API, and AFACT there aren't really any cases where it is
> > beneficial to have a hugepage sharded into 4K mappings when those
> > mappings can be collapsed.
> 
> The reason that we don't automatically collapse mappings is because it
> would take additional complexity, and it is less flexible. Consider
> the case of 1G pages on x86: currently, userspace can collapse the
> whole page when it's all ready, but they can also choose to collapse a
> 2M piece of it. On architectures with more supported hugepage sizes
> (e.g., arm64), userspace has even more possibilities for when to
> collapse. This likely further complicates a potential
> automatic-collapse solution. Userspace may also want to collapse the
> mapping for an entire hugepage without completely mapping the hugepage
> first (this would also be possible by issuing UFFDIO_CONTINUE on all
> the holes, though).
> 
> >
> > > ---- HugeTLB Changes ----
> > >
> > > - Mapcount
> > > The way mapcount is handled is different from the way that it was handled
> > > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > > be increased. This scheme means that, for hugepages that aren't mapped at
> > > high granularity, their mapcounts will remain the same as what they would
> > > have been pre-HGM.
> > >
> >
> > Sorry, I didn't quite follow this. It says mapcount is handled

+1

> > differently, but the same if the page is not mapped at high
> > granularity. Can you elaborate on how the mapcount handling will be
> > different when the page is mapped at high granularity?
> 
> I guess I didn't phrase this very well. For the sake of simplicity,
> consider 1G pages on x86, typically mapped with leaf-level PUDs.
> Previously, there were two possibilities for how a hugepage was
> mapped, either it was (1) completely mapped (PUD is present and a
> leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
> case, where the PUD is not none but also not a leaf (this usually
> means that the page is partially mapped). We handle this case as if
> the whole page was mapped. That is, if we partially map a hugepage
> that was previously unmapped (making the PUD point to PMDs), we
> increment its mapcount, and if we completely unmap a partially mapped
> hugepage (making the PUD none), we decrement its mapcount. If we
> collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.
> 
> It is possible for a PUD to be present and not a leaf (mapcount has
> been incremented) but for the page to still be unmapped: if the PMDs
> (or PTEs) underneath are all none. This case is atypical, and as of
> this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
> think it would be very difficult to get this to happen.
> 

It is a good explanation. I think it is better to go to cover letter.

Thanks.

> >
> > > - Page table walking and manipulation
> > > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > > high-granularity mappings. Eventually, it's possible to merge
> > > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> > >
> > > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > > This is because we generally need to know the "size" of a PTE (previously
> > > always just huge_page_size(hstate)).
> > >
> > > For every page table manipulation function that has a huge version (e.g.
> > > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > > hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> > > PTE really is "huge".
> > >
> > > - Synchronization
> > > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > > writing, and for doing high-granularity page table walks, we require it to
> > > be held for reading.
> > >
> > > ---- Limitations & Future Changes ----
> > >
> > > This patch series only implements high-granularity mapping for VM_SHARED
> > > VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
> > > failure recovery for both shared and private mappings.
> > >
> > > The memory failure use case poses its own challenges that can be
> > > addressed, but I will do so in a separate RFC.
> > >
> > > Performance has not been heavily scrutinized with this patch series. There
> > > are places where lock contention can significantly reduce performance. This
> > > will be addressed later.
> > >
> > > The patch series, as it stands right now, is compatible with the VMEMMAP
> > > page struct optimization[3], as we do not need to modify data contained
> > > in the subpage page structs.
> > >
> > > Other omissions:
> > >  - Compatibility with userfaultfd write-protect (will be included in v1).
> > >  - Support for mremap() (will be included in v1). This looks a lot like
> > >    the support we have for fork().
> > >  - Documentation changes (will be included in v1).
> > >  - Completely ignores PMD sharing and hugepage migration (will be included
> > >    in v1).
> > >  - Implementations for architectures that don't use GENERAL_HUGETLB other
> > >    than arm64.
> > >
> > > ---- Patch Breakdown ----
> > >
> > > Patch 1     - Preliminary changes
> > > Patch 2-10  - HugeTLB HGM core changes
> > > Patch 11-13 - HugeTLB HGM page table walking functionality
> > > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > > Patch 20-23 - Userfaultfd and collapse changes
> > > Patch 24-26 - arm64 support and selftests
> > >
> > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > >     name. "High-granularity mapping" is not a great name either. I am open
> > >     to better names.
> >
> > I would drop 1 extra word and do "granular mapping", as in the mapping
> > is more granular than what it normally is (2MB/1G, etc).
> 
> Noted. :)
>