From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 522ECC433EF
	for <linux-mm@archiver.kernel.org>; Wed, 23 Mar 2022 02:31:13 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B16256B0072; Tue, 22 Mar 2022 22:31:12 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AC5A36B0073; Tue, 22 Mar 2022 22:31:12 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 965B06B0074; Tue, 22 Mar 2022 22:31:12 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0140.hostedemail.com [216.40.44.140])
	by kanga.kvack.org (Postfix) with ESMTP id 88EE16B0072
	for <linux-mm@kvack.org>; Tue, 22 Mar 2022 22:31:12 -0400 (EDT)
Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 39378A5656
	for <linux-mm@kvack.org>; Wed, 23 Mar 2022 02:31:12 +0000 (UTC)
X-FDA: 79274073984.27.2AAC50A
Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188])
	by imf25.hostedemail.com (Postfix) with ESMTP id E7857A001D
	for <linux-mm@kvack.org>; Wed, 23 Mar 2022 02:31:10 +0000 (UTC)
Received: from canpemm500002.china.huawei.com (unknown [172.30.72.56])
	by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4KNXPB0glbzfYxT;
	Wed, 23 Mar 2022 10:29:34 +0800 (CST)
Received: from [10.174.177.76] (10.174.177.76) by
 canpemm500002.china.huawei.com (7.192.104.244) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2308.21; Wed, 23 Mar 2022 10:31:06 +0800
Subject: Re: [RFC PATCH 3/5] mm: thp: split huge page to any lower order
 pages.
To: Zi Yan <ziy@nvidia.com>
CC: Roman Gushchin <roman.gushchin@linux.dev>, Shuah Khan <shuah@kernel.org>,
	Yang Shi <shy828301@gmail.com>, Hugh Dickins <hughd@google.com>, "Kirill A .
 Shutemov" <kirill.shutemov@linux.intel.com>, <linux-kernel@vger.kernel.org>,
	<cgroups@vger.kernel.org>, <linux-kselftest@vger.kernel.org>, Matthew Wilcox
	<willy@infradead.org>, <linux-mm@kvack.org>, Yu Zhao <yuzhao@google.com>
References: <20220321142128.2471199-1-zi.yan@sent.com>
 <20220321142128.2471199-4-zi.yan@sent.com>
 <165ec1a8-2b35-f6fb-82d3-b94613dd437a@huawei.com>
 <D03D6945-8BFE-4137-BDB6-BD884656B65B@nvidia.com>
From: Miaohe Lin <linmiaohe@huawei.com>
Message-ID: <ed175cd4-1411-459e-e892-7d889e1253c0@huawei.com>
Date: Wed, 23 Mar 2022 10:31:06 +0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
 Thunderbird/78.6.0
MIME-Version: 1.0
In-Reply-To: <D03D6945-8BFE-4137-BDB6-BD884656B65B@nvidia.com>
Content-Type: text/plain; charset="utf-8"
Content-Language: en-US
X-Originating-IP: [10.174.177.76]
X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To
 canpemm500002.china.huawei.com (7.192.104.244)
X-CFilter-Loop: Reflected
X-Rspam-User: 
X-Stat-Signature: xb494duykyh3m9gtjnnna9ixpc5coao9
Authentication-Results: imf25.hostedemail.com;
	dkim=none;
	spf=pass (imf25.hostedemail.com: domain of linmiaohe@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=linmiaohe@huawei.com;
	dmarc=pass (policy=quarantine) header.from=huawei.com
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: E7857A001D
X-HE-Tag: 1648002670-357892
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 2022/3/22 22:30, Zi Yan wrote:
> On 21 Mar 2022, at 23:21, Miaohe Lin wrote:
>=20
>> On 2022/3/21 22:21, Zi Yan wrote:
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> To split a THP to any lower order pages, we need to reform THPs on
>>> subpages at given order and add page refcount based on the new page
>>> order. Also we need to reinitialize page_deferred_list after removing
>>> the page from the split_queue, otherwise a subsequent split will see
>>> list corruption when checking the page_deferred_list again.
>>>
>>> It has many uses, like minimizing the number of pages after
>>> truncating a pagecache THP. For anonymous THPs, we can only split the=
m
>>> to order-0 like before until we add support for any size anonymous TH=
Ps.
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>  include/linux/huge_mm.h |   8 +++
>>>  mm/huge_memory.c        | 111 ++++++++++++++++++++++++++++++--------=
--
>>>  2 files changed, 91 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 2999190adc22..c7153cd7e9e4 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -186,6 +186,8 @@ void free_transhuge_page(struct page *page);
>>>
>>>  bool can_split_folio(struct folio *folio, int *pextra_pins);
>>>  int split_huge_page_to_list(struct page *page, struct list_head *lis=
t);
>>> +int split_huge_page_to_list_to_order(struct page *page, struct list_=
head *list,
>>> +		unsigned int new_order);
>>>  static inline int split_huge_page(struct page *page)
>>>  {
>>>  	return split_huge_page_to_list(page, NULL);
>>> @@ -355,6 +357,12 @@ split_huge_page_to_list(struct page *page, struc=
t list_head *list)
>>>  {
>>>  	return 0;
>>>  }
>>> +static inline int
>>> +split_huge_page_to_list_to_order(struct page *page, struct list_head=
 *list,
>>> +		unsigned int new_order)
>>> +{
>>> +	return 0;
>>> +}
>>>  static inline int split_huge_page(struct page *page)
>>>  {
>>>  	return 0;
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index fcfa46af6c4c..3617aa3ad0b1 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -2236,11 +2236,13 @@ void vma_adjust_trans_huge(struct vm_area_str=
uct *vma,
>>>  static void unmap_page(struct page *page)
>>>  {
>>>  	struct folio *folio =3D page_folio(page);
>>> -	enum ttu_flags ttu_flags =3D TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
>>> -		TTU_SYNC;
>>> +	enum ttu_flags ttu_flags =3D TTU_RMAP_LOCKED | TTU_SYNC;
>>>
>>>  	VM_BUG_ON_PAGE(!PageHead(page), page);
>>>
>>> +	if (folio_order(folio) >=3D HPAGE_PMD_ORDER)
>>> +		ttu_flags |=3D TTU_SPLIT_HUGE_PMD;
>>> +
>>>  	/*
>>>  	 * Anon pages need migration entries to preserve them, but file
>>>  	 * pages can simply be left unmapped, then faulted back on demand.
>>> @@ -2254,9 +2256,9 @@ static void unmap_page(struct page *page)
>>>  	VM_WARN_ON_ONCE_PAGE(page_mapped(page), page);
>>>  }
>>>
>>> -static void remap_page(struct folio *folio, unsigned long nr)
>>> +static void remap_page(struct folio *folio, unsigned short nr)
>>>  {
>>> -	int i =3D 0;
>>> +	unsigned int i;
>>>
>>>  	/* If unmap_page() uses try_to_migrate() on file, remove this check=
 */
>>>  	if (!folio_test_anon(folio))
>>> @@ -2274,7 +2276,6 @@ static void lru_add_page_tail(struct page *head=
, struct page *tail,
>>>  		struct lruvec *lruvec, struct list_head *list)
>>>  {
>>>  	VM_BUG_ON_PAGE(!PageHead(head), head);
>>> -	VM_BUG_ON_PAGE(PageCompound(tail), head);
>>>  	VM_BUG_ON_PAGE(PageLRU(tail), head);
>>>  	lockdep_assert_held(&lruvec->lru_lock);
>>>
>>> @@ -2295,9 +2296,10 @@ static void lru_add_page_tail(struct page *hea=
d, struct page *tail,
>>>  }
>>>
>>>  static void __split_huge_page_tail(struct page *head, int tail,
>>> -		struct lruvec *lruvec, struct list_head *list)
>>> +		struct lruvec *lruvec, struct list_head *list, unsigned int new_or=
der)
>>>  {
>>>  	struct page *page_tail =3D head + tail;
>>> +	unsigned long compound_head_flag =3D new_order ? (1L << PG_head) : =
0;
>>>
>>>  	VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) !=3D -1, page_tai=
l);
>>>
>>> @@ -2321,6 +2323,7 @@ static void __split_huge_page_tail(struct page =
*head, int tail,
>>>  #ifdef CONFIG_64BIT
>>>  			 (1L << PG_arch_2) |
>>>  #endif
>>> +			 compound_head_flag |
>>>  			 (1L << PG_dirty)));
>>>
>>>  	/* ->mapping in first tail page is compound_mapcount */
>>> @@ -2329,7 +2332,10 @@ static void __split_huge_page_tail(struct page=
 *head, int tail,
>>>  	page_tail->mapping =3D head->mapping;
>>>  	page_tail->index =3D head->index + tail;
>>>
>>> -	/* Page flags must be visible before we make the page non-compound.=
 */
>>> +	/*
>>> +	 * Page flags must be visible before we make the page non-compound =
or
>>> +	 * a compound page in new_order.
>>> +	 */
>>>  	smp_wmb();
>>>
>>>  	/*
>>> @@ -2339,10 +2345,15 @@ static void __split_huge_page_tail(struct pag=
e *head, int tail,
>>>  	 * which needs correct compound_head().
>>>  	 */
>>>  	clear_compound_head(page_tail);
>>> +	if (new_order) {
>>> +		prep_compound_page(page_tail, new_order);
>>> +		prep_transhuge_page(page_tail);
>>> +	}
>>
>> Many thanks for your series. It looks really good. One question:
>> IIUC, It seems there has assumption that LRU compound_pages should
>> be PageTransHuge. So PageTransHuge just checks PageHead:
>>
>> static inline int PageTransHuge(struct page *page)
>> {
>> 	VM_BUG_ON_PAGE(PageTail(page), page);
>> 	return PageHead(page);
>> }
>>
>> So LRU pages with any order( > 0) will might be wrongly treated as THP=
 which
>> has order =3D HPAGE_PMD_ORDER. We should ensure thp_nr_pages is used i=
nstead of
>> hard coded HPAGE_PMD_ORDER.
>>
>> Looks at the below code snippet:
>> mm/mempolicy.c:
>> static struct page *new_page(struct page *page, unsigned long start)
>> {
>> ...
>> 	} else if (PageTransHuge(page)) {
>> 		struct page *thp;
>>
>> 		thp =3D alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
>> 					 HPAGE_PMD_ORDER);
>> 					 ^^^^^^^^^^^^^^^^
>> 		if (!thp)
>> 			return NULL;
>> 		prep_transhuge_page(thp);
>> 		return thp;
>> 	}
>> ...
>> }
>>
>> HPAGE_PMD_ORDER is used instead of thp_nr_pages. So the lower order pa=
ges might be
>> used as if its order is HPAGE_PMD_ORDER. All of such usage might need =
to be fixed.
>> Or am I miss something ?
>>
>> Thanks again for your work. :)
>=20
> THP will still only have HPAGE_PMD_ORDER and will not be split into any=
 order
> other than 0. This series only allows to split huge page cache folio (a=
dded by Matthew)
> into any lower order. I have an explicit VM_BUG_ON() to ensure new_orde=
r
> is only 0 when non page cache page is the input. Since there is still n=
on-trivial
> amount of work to add any order THP support in the kernel. IIRC, Yu Zha=
o (cc=E2=80=99d) was
> planning to work on that.
>=20

Many thanks for clarifying. I'm sorry but I haven't followed Matthew's pa=
tches. I am
wondering could huge page cache folio be treated as THP ? If so, how to e=
nsure the
correctness of huge page cache ?

Thanks again!

> Thanks for checking the patches.

BTW: I like your patches. It's really interesting. :)

>=20