From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 00C99C05027
	for <linux-mm@archiver.kernel.org>; Fri, 20 Jan 2023 17:23:38 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 885F96B0092; Fri, 20 Jan 2023 12:23:38 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 835806B0093; Fri, 20 Jan 2023 12:23:38 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 6D5FC6B0095; Fri, 20 Jan 2023 12:23:38 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 5E9456B0092
	for <linux-mm@kvack.org>; Fri, 20 Jan 2023 12:23:38 -0500 (EST)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 228BA1608FE
	for <linux-mm@kvack.org>; Fri, 20 Jan 2023 17:23:38 +0000 (UTC)
X-FDA: 80375849316.28.491311F
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf23.hostedemail.com (Postfix) with ESMTP id 11FA0140008
	for <linux-mm@kvack.org>; Fri, 20 Jan 2023 17:23:34 +0000 (UTC)
Authentication-Results: imf23.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="WA/Q9Vot";
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=pass (imf23.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1674235415;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=ZXHxVXb+GewUdxs/qxAaosCo1972xxUFkU/C3kl19hw=;
	b=vlRhuVFJQRJzRTilzouK3d0hcDUbM71J/bJm/ICprBQmo3DbmBqReMVHN8VFvXDygqyyCG
	edZ6RhYZpiRK1eliG/+k5m4g97vii2UWJoaTW/wCf5op2ZiVOpm44VYG7cq0YA8fRex4nL
	62TRxbSQ5fmIxyKZZVjBupoyWGA1WRg=
ARC-Authentication-Results: i=1;
	imf23.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="WA/Q9Vot";
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=pass (imf23.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1674235415; a=rsa-sha256;
	cv=none;
	b=yDMfIV6yod+ygKLho+B7r13klIV9e90d4I2EPTjqiKY7zwEk9iipV/CR+l5VpgnZ7AtBoX
	HWFdeVTjHcXHjYW92ZTc/DqHv0yQn3F13sp+BnNqhotsu0MeCzJCZCsxDFiQYI6WzXJ0uC
	ojY33qdyjT6j9pilI6PP99HLzI6dvvo=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1674235414;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=ZXHxVXb+GewUdxs/qxAaosCo1972xxUFkU/C3kl19hw=;
	b=WA/Q9VotB/twJze5BPybpE0NTDV2juVb6ytZOQ/Ad6Ybhen396eqq7BQYXE9Wx39X15oSq
	g3lKlaQw1Ng8VJhBmUcJEYqfHl7G65WiyBB9esUBAdDCkVO6LN3nrnZnN8wWaZJAlEzmH5
	Ku0Qg9xRLi8IwY+tcOSxzzWauCPRWno=
Received: from mail-yb1-f197.google.com (mail-yb1-f197.google.com
 [209.85.219.197]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id
 us-mta-12-DAAypkqbOsu0OxJl8GdBqg-1; Fri, 20 Jan 2023 12:23:31 -0500
X-MC-Unique: DAAypkqbOsu0OxJl8GdBqg-1
Received: by mail-yb1-f197.google.com with SMTP id d17-20020a5b0611000000b00801a0e3e117so2373142ybq.13
        for <linux-mm@kvack.org>; Fri, 20 Jan 2023 09:23:31 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=ZXHxVXb+GewUdxs/qxAaosCo1972xxUFkU/C3kl19hw=;
        b=VayAKQ2To07MW2to8NCkrlwCUiQrGSBLiHB4SvYKpalLOVH7INjUEUJXV5bNFyczdS
         pThkc0kbywuci/+JU5x+UnIA0u+7TTKuPMGry7rxvQdjR2nVuX/NxThJTtYLsOlg8IRW
         GNYWTXgNpd9+z7TrVyxj4vKOvIDbCgt4vJZ3mxbsS19LukA4ONVo5QVLxB+gTh21JEkK
         aVcIGvhlmMVbyqmXR4TaTxQBdsNdesi6De4/MTCQSZo+hHmCOamm2fB03Q1sMFfiJjXX
         CkOD/nhrZB977OToVlPlH6OZw/oSToQBYrs74mLO2U/g1S6oOkHKoQmD09EzG8Xb9Xrc
         96bQ==
X-Gm-Message-State: AFqh2kpZ9f6FYzUsAo9w59y+yRJOZxVfKFeQACrPEJxJK2XMud5kzwc9
	WBFPDAS/6Y+BDbP4LBBBbR593dtWIeAJYuEitWMibHoDDB+nnmts3Yk9Uug46CIegqwcmTPpIao
	xomXz2Srx+cU=
X-Received: by 2002:a81:395:0:b0:4fd:cd4a:e2a3 with SMTP id 143-20020a810395000000b004fdcd4ae2a3mr2940294ywd.3.1674235410680;
        Fri, 20 Jan 2023 09:23:30 -0800 (PST)
X-Google-Smtp-Source: AMrXdXsojfHu69tmKvkGgMgC30m9j67xGJKyiVzr/0YUl4dSz47RxxJrZj1k6lS6SLE/mX0D+trHfQ==
X-Received: by 2002:a81:395:0:b0:4fd:cd4a:e2a3 with SMTP id 143-20020a810395000000b004fdcd4ae2a3mr2940276ywd.3.1674235410354;
        Fri, 20 Jan 2023 09:23:30 -0800 (PST)
Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63])
        by smtp.gmail.com with ESMTPSA id d12-20020a05620a240c00b006fcc3858044sm27387920qkn.86.2023.01.20.09.23.28
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 20 Jan 2023 09:23:29 -0800 (PST)
Date: Fri, 20 Jan 2023 12:23:28 -0500
From: Peter Xu <peterx@redhat.com>
To: James Houghton <jthoughton@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>,
	David Hildenbrand <david@redhat.com>,
	Muchun Song <songmuchun@bytedance.com>,
	David Rientjes <rientjes@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Mina Almasry <almasrymina@google.com>,
	Zach O'Keefe <zokeefe@google.com>,
	Manish Mishra <manish.mishra@nutanix.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Miaohe Lin <linmiaohe@huawei.com>, Yang Shi <shy828301@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for
 walk_hugetlb_range
Message-ID: <Y8rOEKimvUi0noT6@x1n>
References: <CADrL8HVGMTowH4trJhS+eM_EwZKoUgu7LmfwyTGyGRnNnwL3Zg@mail.gmail.com>
 <Y8hITxr/BBMuO6WX@monkey>
 <CADrL8HUggALQET-09Zw3BhFjZdw_G9+v6CU=qtGtK=KZ_DeAsw@mail.gmail.com>
 <Y8l+f2wNp2gAjvYg@monkey>
 <CADrL8HVdL_NMdNq2mEemNCfwkYBAWnbqwyjsAYdQ2fF0iz34Hw@mail.gmail.com>
 <Y8m9gJX4PNoIrpjE@monkey>
 <Y8nCyqLF71g88Idv@x1n>
 <CADrL8HXkdxDdixWRKNw6RFdbiBX-Cb1Lk7qxg6LdeNywbMOaOA@mail.gmail.com>
 <Y8nNHKW0sTnrq8hw@x1n>
 <CADrL8HUvcXn5rjaS+WNt0Gz=1YV7273VVy-o-EdQHSQObuGNkA@mail.gmail.com>
MIME-Version: 1.0
In-Reply-To: <CADrL8HUvcXn5rjaS+WNt0Gz=1YV7273VVy-o-EdQHSQObuGNkA@mail.gmail.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
X-Rspamd-Queue-Id: 11FA0140008
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-Stat-Signature: igctpuhtpb6nm7cf331qdkeucpbffxnr
X-HE-Tag: 1674235414-54473
X-HE-Meta: U2FsdGVkX1/IOBEpGOPhENrUr8EfluSb1dhaePaGBvcYet/SgJG66XzMjPe7JJIsSYpVBGnJZWz56PHevn9a5kfc7gdlkSr+c4THb2b4Y4s8Ce/QE/UV5z6dAQX7LYsC1zo/mRth4AccA8fm3Bg/uJAd6t9nn/yIKBJ/vTxTd587UMi/iorDyqtsQ7e5GV1H6dGa2/SHByRV+t1qgUcDMHWAIi9+wp38i45ekAX6TkmBJjMIi+Yc2zXQt9sO+wVdQbN5sYRHf8wp3rk7cVO6+V/jF32sLCLXL6pULUduNv2P4NLVFl2zSriowIOonr4WS3WclAHKoZPT+TGCWJtDYLK4YtuLjWQCKgzJbfX8OXpHe9WRJOFO0v4EvvlS/WVgLPkFoywAZ9i1bD+QEhhlICgDWhQ+mqtOTaAPhyhR92ng7lbbXswL0mN7Zcji1gO2XDkOAAImPCBG4gvNRQEqI5zgDftcH/VgUE8Gb2hUAIxlDIVJimr3YN99mANNnYjjxHeu79MR5MZbF5oombiGjqJ9CiltHVMhZfYG3wlUX1umMB80vTLT9F/0du8Pk+31k4XxnDSrtrUIq8nTBS9s8tudkx7WlINk4cctlkdxXEAvm63dcftZ8WvKOGLU0/1AfyLNh8+hnXOvXl3Z3gCr19dcpm8mTSfrLYznHBOMz60KeqsKOaAhSksH37li0yCqJ2ie97m+oYPYdoSn4Hows8Ad/J+SLl+0b6mCqQudM+ztCjtw0V6hGVEo8cUBYc/7bIBldC5DrmkFV1jO8QT0XDHkaAEwc1mWL0K5l60tD9ZiI06zFThikd2DSDeyctq25QswY6w6Wzto9ix7tqzzPeP3oWUvbDJJ7a1Cr9w+faLVb6LFW/zXUVzF4OgBKkhS58N5tKEGpkPuy+n+mswBJu9wdJUrnbyRl5UIiuYBJb5nBVQ3EcUB/aDFdUDvJPYBW3Un/7y0u1XEyrqtTtt
 uKjIEYGA
 5a/KjMqQwz9Ppzx/CZXuEcq3Vn0Znl5yZAwTQSyTl7qh6OxLUAfbAl6ts9kHrpBUaJ+9WMHrlSXteINnizQk+O9Y9wGIwmSam+VzewuDHz8/w2dHKQMBni+b2ZwVdTtEiGtZtSHDCVOnOQkLWf0CwRARX9cHxCcSZLwllTqxF2T28K4TRJt9oHLFlkQC2WrXq3EASD22wQVIwZ+8NXupBXqPNSWdXnEJHJe4xDsCqqfyHFpkQrNZBfR5fvs7zUI79kC5v5CDtWA5JgWHlDuVH5n9jbBzrirfVVxyDhqJ/ZCzFvLpFiJmcZbbmJ5O5+Y/O8+imUacZCSwntuPdnScVN6dHrKTox5TiQEyYcx57urcS0pMJ6LycWouCVXfAwgOwYPHxPCCgi8oZMcpZbWX9aay7xIQAd/7P+3o2vBxlpLayXaZXvSN1Gtmvbo8T3EJ95JEC/C+2jHn5vzA=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Jan 19, 2023 at 03:26:14PM -0800, James Houghton wrote:
> On Thu, Jan 19, 2023 at 3:07 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Jan 19, 2023 at 02:35:12PM -0800, James Houghton wrote:
> > > On Thu, Jan 19, 2023 at 2:23 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Thu, Jan 19, 2023 at 02:00:32PM -0800, Mike Kravetz wrote:
> > > > > I do not know much about the (primary) live migration use case.  My
> > > > > guess is that page table lock contention may be an issue?  In this use
> > > > > case, HGM is only enabled for the duration the live migration operation,
> > > > > then a MADV_COLLAPSE is performed.  If contention is likely to be an
> > > > > issue during this time, then yes we would need to pass around with
> > > > > something like hugetlb_pte.
> > > >
> > > > I'm not aware of any such contention issue.  IMHO the migration problem is
> > > > majorly about being too slow transferring a page being so large.  Shrinking
> > > > the page size should resolve the major problem already here IIUC.
> > >
> > > This will be problematic if you scale up VMs to be quite large.
> >
> > Do you mean that for the postcopy use case one can leverage e.g. 2M
> > mappings (over 1G) to avoid lock contentions when VM is large I agree it
> > should be more efficient than having 512 4K page installed, but I think
> > it'll make the page fault resolution slower too if some thead is only
> > looking for a 4k portion of it.
> 
> No, that's not what I meant. Sorry. If you can use the PTL that is
> normally used for 4K PTEs, then you're right, there is no contention
> problem. However, this PTL is determined by the value of the PMD, so
> you need a pointer to the PMD to determine what the PTL should be (or
> a pointer to the PTL itself).
> 
> In hugetlb, we only ever pass around the PTE pointer, and we rely on
> huge_pte_lockptr() to find the PTL for us (and it does so
> appropriately for everything except 4K PTEs). We would need to add the
> complexity of passing around a PMD or PTL everywhere, and that's what
> hugetlb_pte does for us. So that complexity is basically unavoidable,
> unless you're ok with 4K PTEs with taking mm->page_table_lock (I'm
> not).
> 
> >
> > > Google upstreamed the "TDP MMU" for KVM/x86 that removed the need to take
> > > the MMU lock for writing in the EPT violation path. We found that this
> > > change is required for VMs >200 or so vCPUs to consistently avoid CPU
> > > soft lockups in the guest.
> >
> > After the kvm mmu rwlock convertion, it'll allow concurrent page faults
> > even if only 4K pages are used, so it seems not directly relevant to what
> > we're discussing here, no?
> 
> Right. I was just bringing it up to say that if 4K PTLs were
> mm->page_table_lock, we would have a problem.

Ah I see what you meant.  We definitely don't want to use the
page_table_lock for sure.

So if it's about keeping hugetlb_pte I'm fine with it, no matter what the
final version will look like.

> 
> >
> > >
> > > Requiring each UFFDIO_CONTINUE (in the post-copy path) to serialize on
> > > the same PTL would be problematic in the same way.
> >
> > Pte-level pgtable lock only covers 2M range, so I think it depends on which
> > is the address that the vcpu is faulted on?  IIUC the major case should be
> > that the faulted threads are not falling upon the same 2M range.
> 
> Right. I think my comment should make more sense with the above clarification.
> 
> >
> > >
> > > >
> > > > AFAIU 4K-only solution should only reduce any lock contention because locks
> > > > will always be pte-level if VM_HUGETLB_HGM set.  When walking and creating
> > > > the intermediate pgtable entries we can use atomic ops just like generic
> > > > mm, so no lock needed at all.  With uncertainty on the size of mappings,
> > > > we'll need to take any of the multiple layers of locks.
> > > >
> > >
> > > Other than taking the HugeTLB VMA lock for reading, walking/allocating
> > > page tables won't need any additional locking.
> >
> > Actually when revisiting the locks I'm getting a bit confused on whether
> > the vma lock is needed if pmd sharing is anyway forbidden for HGM.  I
> > raised a question in the other patch of MADV_COLLAPSE, maybe they're
> > related questions so we can keep it there.
> 
> We can discuss there. :) I take both the VMA lock and mapping lock so
> that it can stay in sync with huge_pmd_unshare(), and so HGM walks
> have the same synchronization as regular hugetlb PT walks.

Sure. :)

Now after a 2nd thought I don't think it's unsafe to take the vma write
lock here, especially for VM_SHARED.  I can't think of anything that will
go wrong.  It's because we need the vma lock anywhere we'll be walking the
pgtables when having mmap_sem read I think, being afraid of having pmd
sharing being possible.

But I'm not sure whether this is the cleanest way to do it.

IMHO the major special part of hugetlb comparing to generic mm on pgtable
thread safety.  I worry that complicating this lock can potentially make
the hugetlb code even more specific, which is not good for the long term if
we still have a hope of merging more hugetlb codes with the generic paths.

Here since pmd sharing is impossible for HGM, the original vma lock is not
needed here.  Meanwhile, what we want to guard is the pgtable walkers.
They're logically being protected by either mmap lock or the mapping lock
(for rmap walkers).  Fast-gup is another thing but so far I think it's all
safe when you're following the mmu gather facilities.

Somehow I had a feeling that the hugetlb vma lock (along with the pgtable
sharing explorations in the hugetlb world keeping going..) can keep
evolving in the future, and it should be helpful to keep its semantics
simple too.

So to summarize: I wonder whether we can use mmap write lock and
i_mmap_rwsem write lock to protect collapsing for hugetlb, just like what
we do with THP collapsing (after Jann's fix).

madvise_need_mmap_write() is not easily feasible because it's before the
vma scanning so we can't take conditional write lock only for hugetlb, but
that's the next question to ask only if we can reach a consensus on the
lock scheme first for HGM in general.

> 
> >
> > >
> > > We take the PTL to allocate the next level down, but so does generic
> > > mm (look at __pud_alloc, __pmd_alloc for example). Maybe I am
> > > misunderstanding.
> >
> > Sorry you're right, please ignore that.  I don't know why I had that
> > impression that spinlocks are not needed in that process.
> >
> > Actually I am also curious why atomics won't work (by holding mmap read
> > lock, then do cmpxchg(old_entry=0, new_entry) upon the pgtable entries).  I
> > think it's possible I just missed something else.
> 
> I think there are cases where we need to make sure the value of a PTE
> isn't going to change from under us while we're doing some kind of
> other operation, and so a compare-and-swap won't really be what we
> need.

Currently the pgtable spinlock is only taken during populating the
pgtables.  If it can happen, then it can happen too right after we release
the spinlock in e.g. __pmd_alloc().

One thing I can think of is we need more things done rather than the
pgtable entry installations so atomics will stop working if so.  E.g. on
x86 we have paravirt_alloc_pmd().  But I'm not sure whether that's the only
reason.

-- 
Peter Xu