Re: [PATCH 3/5] mm: thp: split huge page to any lower order pages.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Zi Yan <ziy@nvidia.com>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Yang Shi <shy828301@gmail.com>, Yu Zhao <yuzhao@google.com>,
	linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kselftest@vger.kernel.org
Subject: Re: [PATCH 3/5] mm: thp: split huge page to any lower order pages.
Date: Wed, 22 Mar 2023 10:27:59 -0400	[thread overview]
Message-ID: <A75E3B03-C00C-45B5-8D8E-688F091B1F4D@nvidia.com> (raw)
In-Reply-To: <e8352d76-29e4-5e4a-063b-60e279e0b070@arm.com>

[-- Attachment #1: Type: text/plain, Size: 8957 bytes --]

On 22 Mar 2023, at 3:55, Ryan Roberts wrote:

> Hi,
>
> I'm working to enable large, variable-order folios for anonymous memory (see
> RFC, replete with bugs at [1]). This patch set is going to be very useful to me.
> But I have a few questions that I wonder if you can answer, below? I wonder if
> they might relate to the bugs I'm seeing at [1].
>
> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
>
>
>
> On 21/03/2023 00:48, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> To split a THP to any lower order pages, we need to reform THPs on
>> subpages at given order and add page refcount based on the new page
>> order. Also we need to reinitialize page_deferred_list after removing
>> the page from the split_queue, otherwise a subsequent split will see
>> list corruption when checking the page_deferred_list again.
>>
>> It has many uses, like minimizing the number of pages after
>> truncating a huge pagecache page. For anonymous THPs, we can only split
>> them to order-0 like before until we add support for any size anonymous
>> THPs.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  include/linux/huge_mm.h |  10 ++--
>>  mm/huge_memory.c        | 103 +++++++++++++++++++++++++++++-----------
>>  2 files changed, 82 insertions(+), 31 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 20284387b841..32c91e1b59cd 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -147,10 +147,11 @@ void prep_transhuge_page(struct page *page);
>>  void free_transhuge_page(struct page *page);
>>
>>  bool can_split_folio(struct folio *folio, int *pextra_pins);
>> -int split_huge_page_to_list(struct page *page, struct list_head *list);
>> +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +		unsigned int new_order);
>>  static inline int split_huge_page(struct page *page)
>>  {
>> -	return split_huge_page_to_list(page, NULL);
>> +	return split_huge_page_to_list_to_order(page, NULL, 0);
>>  }
>>  void deferred_split_folio(struct folio *folio);
>>
>> @@ -297,7 +298,8 @@ can_split_folio(struct folio *folio, int *pextra_pins)
>>  	return false;
>>  }
>>  static inline int
>> -split_huge_page_to_list(struct page *page, struct list_head *list)
>> +split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +		unsigned int new_order)
>>  {
>>  	return 0;
>>  }
>> @@ -397,7 +399,7 @@ static inline bool thp_migration_supported(void)
>>  static inline int split_folio_to_list(struct folio *folio,
>>  		struct list_head *list)
>>  {
>> -	return split_huge_page_to_list(&folio->page, list);
>> +	return split_huge_page_to_list_to_order(&folio->page, list, 0);
>>  }
>>
>>  static inline int split_folio(struct folio *folio)
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 710189885402..f119b9be33f2 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2359,11 +2359,13 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
>>
>>  static void unmap_folio(struct folio *folio)
>>  {
>> -	enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
>> -		TTU_SYNC;
>> +	enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SYNC;
>>
>>  	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>>
>> +	if (folio_order(folio) >= HPAGE_PMD_ORDER)
>> +		ttu_flags |= TTU_SPLIT_HUGE_PMD;
>> +
>
> Why have you changed the code so that this flag is added conditionally on the
> folio being large enough? I've previously looked at this in the context of my
> bug, and concluded that the consumer would ignore the flag if the folio wasn't
> PMD mapped. Did I conclude incorrectly?

Since if folio order is not larger than PMD order, there is no way of mapping
a PMD to the folio. Thus, TTU_SPLIT_HUGE_PMD does not make sense. Yes, the consumer
will not split any PMD, but will still do page table locks and mmu notifier
work, which cost unnecessary overheads.

I think I better change the if condition to folio_test_pmd_mappable().

>
>
>>  	/*
>>  	 * Anon pages need migration entries to preserve them, but file
>>  	 * pages can simply be left unmapped, then faulted back on demand.
>> @@ -2395,7 +2397,6 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
>>  		struct lruvec *lruvec, struct list_head *list)
>>  {
>>  	VM_BUG_ON_PAGE(!PageHead(head), head);
>> -	VM_BUG_ON_PAGE(PageCompound(tail), head);
>>  	VM_BUG_ON_PAGE(PageLRU(tail), head);
>>  	lockdep_assert_held(&lruvec->lru_lock);
>>
>> @@ -2416,9 +2417,10 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
>>  }
>>
>
> [...]
>
>> -int split_huge_page_to_list(struct page *page, struct list_head *list)
>> +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +				     unsigned int new_order)
>>  {
>>  	struct folio *folio = page_folio(page);
>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>> -	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>> +	/* reset xarray order to new order after split */
>> +	XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order);
>>  	struct anon_vma *anon_vma = NULL;
>>  	struct address_space *mapping = NULL;
>>  	int extra_pins, ret;
>> @@ -2649,6 +2676,18 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>  	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>>  	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>>
>> +	/* Cannot split THP to order-1 (no order-1 THPs) */
>> +	if (new_order == 1) {
>> +		VM_WARN_ONCE(1, "Cannot split to order-1 folio");
>> +		return -EINVAL;
>> +	}
>
> Why can't you split to order-1? I vaguely understand that some data is kept in
> the first 3 struct pages, but I would naively expect the allocator to fail to
> allocate compound pages of order-1 if it was a problem? My large anon folios
> patch is currently allocating order-1 in some circumstances. Perhaps its related
> to my bug?
>

Yes, some data is kept in first 3 struct pages, so order-1 THP is not possible.
The page allocator does not know this restriction, but still allocate an order-1
page. That might be related to your bug. You can have order-1 compound pages,
but it does not mean you can use them for THPs. AFAIK, slab uses order-1 compound
pages, but it does not store slab information on the 3rd struct page.

Basically, page allocator can allocate an order-N page, and it can be:
1. 2^N consecutive physical pages (not a compound page),
2. an order-N compound page,
3. an order-N THP (also an order-N compound page),
4. an order-N hugetlb page (also an order-N compound page).

For THP and hugetlb page, there are prep_transhuge_page() and
prep_new_hugetlb_folio() are called respectively after the page is allocated.
That makes them kinda subclasses of a compound page.

>
>> +
>> +	/* Split anonymous folio to non-zero order not support */
>> +	if (folio_test_anon(folio) && new_order) {
>> +		VM_WARN_ONCE(1, "Split anon folio to non-0 order not support");
>> +		return -EINVAL;
>> +	}
>
> Why don't you support this? What is special about anon folios that means this
> code doesn't work for them?

split_huge_page() code can split to non-0 order anon folios, but the rest of
the mm code might not have proper support yet.
That is why we need your patchset. :)

>
>
>> +
>>  	is_hzp = is_huge_zero_page(&folio->page);
>>  	VM_WARN_ON_ONCE_FOLIO(is_hzp, folio);
>>  	if (is_hzp)
>> @@ -2744,7 +2783,13 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>  	if (folio_ref_freeze(folio, 1 + extra_pins)) {
>>  		if (!list_empty(&folio->_deferred_list)) {
>>  			ds_queue->split_queue_len--;
>> -			list_del(&folio->_deferred_list);
>> +			/*
>> +			 * Reinitialize page_deferred_list after removing the
>> +			 * page from the split_queue, otherwise a subsequent
>> +			 * split will see list corruption when checking the
>> +			 * page_deferred_list.
>> +			 */
>> +			list_del_init(&folio->_deferred_list);
>>  		}
>>  		spin_unlock(&ds_queue->split_queue_lock);
>>  		if (mapping) {
>> @@ -2754,14 +2799,18 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>  			if (folio_test_swapbacked(folio)) {
>>  				__lruvec_stat_mod_folio(folio, NR_SHMEM_THPS,
>>  							-nr);
>> -			} else {
>> +			} else if (!new_order) {
>> +				/*
>> +				 * Decrease THP stats only if split to normal
>> +				 * pages
>> +				 */
>>  				__lruvec_stat_mod_folio(folio, NR_FILE_THPS,
>>  							-nr);
>>  				filemap_nr_thps_dec(mapping);
>>  			}
>>  		}
>>
>> -		__split_huge_page(page, list, end);
>> +		__split_huge_page(page, list, end, new_order);
>>  		ret = 0;
>>  	} else {
>>  		spin_unlock(&ds_queue->split_queue_lock);

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

next prev parent reply	other threads:[~2023-03-22 14:28 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-21  0:48 [PATCH 0/5] Split a folio to any lower order folios Zi Yan
2023-03-21  0:48 ` [PATCH 1/5] mm: memcg: make memcg huge page split support any order split Zi Yan
2023-03-21  0:48 ` [PATCH 2/5] mm: page_owner: add support for splitting to any order in split page_owner Zi Yan
2023-03-24 15:17   ` Michal Koutný
2023-03-24 15:22     ` Zi Yan
2023-03-21  0:48 ` [PATCH 3/5] mm: thp: split huge page to any lower order pages Zi Yan
2023-03-22  7:55   ` Ryan Roberts
2023-03-22 14:27     ` Zi Yan [this message]
2023-03-22 14:48       ` Ryan Roberts
2023-03-21  0:48 ` [PATCH 4/5] mm: truncate: split huge page cache page to a non-zero order if possible Zi Yan
2023-03-21  0:48 ` [PATCH 5/5] mm: huge_memory: enable debugfs to split huge pages to any order Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=A75E3B03-C00C-45B5-8D8E-688F091B1F4D@nvidia.com \
    --to=ziy@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox