[PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list
@ 2024-04-25 21:11 Zi Yan
  2024-04-26  1:45 ` Barry Song
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Zi Yan @ 2024-04-25 21:11 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Zi Yan, Matthew Wilcox (Oracle),
	Yang Shi, Ryan Roberts, Barry Song, David Hildenbrand,
	Lance Yang, linux-kernel

From: Zi Yan <ziy@nvidia.com>

In __folio_remove_rmap(), a large folio is added to deferred split list
if any page in a folio loses its final mapping. But it is possible that
the folio is fully unmapped and adding it to deferred split list is
unnecessary.

For PMD-mapped THPs, that was not really an issue, because removing the
last PMD mapping in the absence of PTE mappings would not have added the
folio to the deferred split queue.

However, for PTE-mapped THPs, which are now more prominent due to mTHP,
they are always added to the deferred split queue. One side effect
is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
unintentionally increased, making it look like there are many partially
mapped folios -- although the whole folio is fully unmapped stepwise.

Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
folio is unmapped in one go and can avoid being added to deferred split
list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
noise when we cannot batch-unmap a complete PTE-mapped folio in one go
-- or where this type of batching is not implemented yet, e.g., migration.

To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
to tell if the whole folio is unmapped. If the folio is already on
deferred split list, it will be skipped, too.

Note: commit 98046944a159 ("mm: huge_memory: add the missing
folio_test_pmd_mappable() for THP split statistics") tried to exclude
mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().

Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
---
 mm/rmap.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index a7913a454028..220ad8a83589 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
 		 * page of the folio is unmapped and at least one page
 		 * is still mapped.
 		 */
-		if (folio_test_large(folio) && folio_test_anon(folio))
-			if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
-				deferred_split_folio(folio);
+		if (folio_test_large(folio) && folio_test_anon(folio) &&
+		    list_empty(&folio->_deferred_list) &&
+		    ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
+		     (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))
+			deferred_split_folio(folio);
 	}

 	/*

base-commit: 66313c66dd90e8711a8b63fc047ddfc69c53636a
-- 
2.43.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list
  2024-04-25 21:11 [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list Zi Yan
@ 2024-04-26  1:45 ` Barry Song
  2024-04-26  1:55   ` Zi Yan
  2024-04-26  3:45 ` Lance Yang
  2024-04-26  8:19 ` David Hildenbrand
  2 siblings, 1 reply; 17+ messages in thread
From: Barry Song @ 2024-04-26  1:45 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, linux-mm, Matthew Wilcox (Oracle),
	Yang Shi, Ryan Roberts, David Hildenbrand, Lance Yang,
	linux-kernel

On Fri, Apr 26, 2024 at 5:11 AM Zi Yan <zi.yan@sent.com> wrote:
>
> From: Zi Yan <ziy@nvidia.com>
>
> In __folio_remove_rmap(), a large folio is added to deferred split list
> if any page in a folio loses its final mapping. But it is possible that
> the folio is fully unmapped and adding it to deferred split list is
> unnecessary.
>
> For PMD-mapped THPs, that was not really an issue, because removing the
> last PMD mapping in the absence of PTE mappings would not have added the
> folio to the deferred split queue.
>
> However, for PTE-mapped THPs, which are now more prominent due to mTHP,
> they are always added to the deferred split queue. One side effect
> is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
> unintentionally increased, making it look like there are many partially
> mapped folios -- although the whole folio is fully unmapped stepwise.
>
> Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
> where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
> folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
> folio is unmapped in one go and can avoid being added to deferred split
> list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
> noise when we cannot batch-unmap a complete PTE-mapped folio in one go
> -- or where this type of batching is not implemented yet, e.g., migration.
>
> To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
> to tell if the whole folio is unmapped. If the folio is already on
> deferred split list, it will be skipped, too.
>
> Note: commit 98046944a159 ("mm: huge_memory: add the missing
> folio_test_pmd_mappable() for THP split statistics") tried to exclude
> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
> fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
> added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
> since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
> deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/rmap.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a7913a454028..220ad8a83589 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                  * page of the folio is unmapped and at least one page
>                  * is still mapped.
>                  */
> -               if (folio_test_large(folio) && folio_test_anon(folio))
> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> -                               deferred_split_folio(folio);
> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> +                   list_empty(&folio->_deferred_list) &&
> +                   ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
> +                    (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))
> +                       deferred_split_folio(folio);

Hi Zi Yan,
in case a mTHP is mapped by two processed (forked but not CoW yet), if we
unmap the whole folio by pte level in one process only, are we still adding this
folio into deferred list?

>         }
>
>         /*
>
> base-commit: 66313c66dd90e8711a8b63fc047ddfc69c53636a
> --
> 2.43.0
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list
  2024-04-26  1:45 ` Barry Song
@ 2024-04-26  1:55   ` Zi Yan
  2024-04-26  2:23     ` Barry Song
  0 siblings, 1 reply; 17+ messages in thread
From: Zi Yan @ 2024-04-26  1:55 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, linux-mm, Matthew Wilcox (Oracle),
	Yang Shi, Ryan Roberts, David Hildenbrand, Lance Yang,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3883 bytes --]

On 25 Apr 2024, at 21:45, Barry Song wrote:

> On Fri, Apr 26, 2024 at 5:11 AM Zi Yan <zi.yan@sent.com> wrote:
>>
>> From: Zi Yan <ziy@nvidia.com>
>>
>> In __folio_remove_rmap(), a large folio is added to deferred split list
>> if any page in a folio loses its final mapping. But it is possible that
>> the folio is fully unmapped and adding it to deferred split list is
>> unnecessary.
>>
>> For PMD-mapped THPs, that was not really an issue, because removing the
>> last PMD mapping in the absence of PTE mappings would not have added the
>> folio to the deferred split queue.
>>
>> However, for PTE-mapped THPs, which are now more prominent due to mTHP,
>> they are always added to the deferred split queue. One side effect
>> is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
>> unintentionally increased, making it look like there are many partially
>> mapped folios -- although the whole folio is fully unmapped stepwise.
>>
>> Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
>> where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
>> folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
>> folio is unmapped in one go and can avoid being added to deferred split
>> list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
>> noise when we cannot batch-unmap a complete PTE-mapped folio in one go
>> -- or where this type of batching is not implemented yet, e.g., migration.
>>
>> To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
>> to tell if the whole folio is unmapped. If the folio is already on
>> deferred split list, it will be skipped, too.
>>
>> Note: commit 98046944a159 ("mm: huge_memory: add the missing
>> folio_test_pmd_mappable() for THP split statistics") tried to exclude
>> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
>> fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
>> added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
>> since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
>> deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> Reviewed-by: Yang Shi <shy828301@gmail.com>
>> ---
>>  mm/rmap.c | 8 +++++---
>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index a7913a454028..220ad8a83589 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>                  * page of the folio is unmapped and at least one page
>>                  * is still mapped.
>>                  */
>> -               if (folio_test_large(folio) && folio_test_anon(folio))
>> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>> -                               deferred_split_folio(folio);
>> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
>> +                   list_empty(&folio->_deferred_list) &&
>> +                   ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
>> +                    (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))
>> +                       deferred_split_folio(folio);
>
> Hi Zi Yan,
> in case a mTHP is mapped by two processed (forked but not CoW yet), if we
> unmap the whole folio by pte level in one process only, are we still adding this
> folio into deferred list?

No. Because the mTHP is still fully mapped by the other process. In terms of code,
nr will be 0 in that case and this if condition is skipped. nr is only increased
from 0 when one of the subpages in the mTHP has no mapping, namely page->_mapcount
becomes negative and last is true in the case RMAP_LEVEL_PTE.


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list
  2024-04-26  1:55   ` Zi Yan
@ 2024-04-26  2:23     ` Barry Song
  2024-04-26  2:50       ` Zi Yan
  0 siblings, 1 reply; 17+ messages in thread
From: Barry Song @ 2024-04-26  2:23 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, linux-mm, Matthew Wilcox (Oracle),
	Yang Shi, Ryan Roberts, David Hildenbrand, Lance Yang,
	linux-kernel

On Fri, Apr 26, 2024 at 9:55 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 25 Apr 2024, at 21:45, Barry Song wrote:
>
> > On Fri, Apr 26, 2024 at 5:11 AM Zi Yan <zi.yan@sent.com> wrote:
> >>
> >> From: Zi Yan <ziy@nvidia.com>
> >>
> >> In __folio_remove_rmap(), a large folio is added to deferred split list
> >> if any page in a folio loses its final mapping. But it is possible that
> >> the folio is fully unmapped and adding it to deferred split list is
> >> unnecessary.
> >>
> >> For PMD-mapped THPs, that was not really an issue, because removing the
> >> last PMD mapping in the absence of PTE mappings would not have added the
> >> folio to the deferred split queue.
> >>
> >> However, for PTE-mapped THPs, which are now more prominent due to mTHP,
> >> they are always added to the deferred split queue. One side effect
> >> is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
> >> unintentionally increased, making it look like there are many partially
> >> mapped folios -- although the whole folio is fully unmapped stepwise.
> >>
> >> Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
> >> where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
> >> folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
> >> folio is unmapped in one go and can avoid being added to deferred split
> >> list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
> >> noise when we cannot batch-unmap a complete PTE-mapped folio in one go
> >> -- or where this type of batching is not implemented yet, e.g., migration.
> >>
> >> To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
> >> to tell if the whole folio is unmapped. If the folio is already on
> >> deferred split list, it will be skipped, too.
> >>
> >> Note: commit 98046944a159 ("mm: huge_memory: add the missing
> >> folio_test_pmd_mappable() for THP split statistics") tried to exclude
> >> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
> >> fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
> >> added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
> >> since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
> >> deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
> >>
> >> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >> Reviewed-by: Yang Shi <shy828301@gmail.com>
> >> ---
> >>  mm/rmap.c | 8 +++++---
> >>  1 file changed, 5 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index a7913a454028..220ad8a83589 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>                  * page of the folio is unmapped and at least one page
> >>                  * is still mapped.
> >>                  */
> >> -               if (folio_test_large(folio) && folio_test_anon(folio))
> >> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> >> -                               deferred_split_folio(folio);
> >> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> >> +                   list_empty(&folio->_deferred_list) &&
> >> +                   ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
> >> +                    (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))
> >> +                       deferred_split_folio(folio);
> >
> > Hi Zi Yan,
> > in case a mTHP is mapped by two processed (forked but not CoW yet), if we
> > unmap the whole folio by pte level in one process only, are we still adding this
> > folio into deferred list?
>
> No. Because the mTHP is still fully mapped by the other process. In terms of code,
> nr will be 0 in that case and this if condition is skipped. nr is only increased
> from 0 when one of the subpages in the mTHP has no mapping, namely page->_mapcount
> becomes negative and last is true in the case RMAP_LEVEL_PTE.

Ok. i see, so "last" won't be true?

case RMAP_LEVEL_PTE:
do {
last = atomic_add_negative(-1, &page->_mapcount);
   if (last && folio_test_large(folio)) {
       last = atomic_dec_return_relaxed(mapped);
       last = (last < ENTIRELY_MAPPED);
}

if (last)
     nr++;
} while (page++, --nr_pages > 0);
break;



>
>
> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list
  2024-04-26  2:23     ` Barry Song
@ 2024-04-26  2:50       ` Zi Yan
  2024-04-26  3:28         ` Barry Song
  0 siblings, 1 reply; 17+ messages in thread
From: Zi Yan @ 2024-04-26  2:50 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, linux-mm, Matthew Wilcox (Oracle),
	Yang Shi, Ryan Roberts, David Hildenbrand, Lance Yang,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4643 bytes --]

On 25 Apr 2024, at 22:23, Barry Song wrote:

> On Fri, Apr 26, 2024 at 9:55 AM Zi Yan <ziy@nvidia.com> wrote:
>>
>> On 25 Apr 2024, at 21:45, Barry Song wrote:
>>
>>> On Fri, Apr 26, 2024 at 5:11 AM Zi Yan <zi.yan@sent.com> wrote:
>>>>
>>>> From: Zi Yan <ziy@nvidia.com>
>>>>
>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>> if any page in a folio loses its final mapping. But it is possible that
>>>> the folio is fully unmapped and adding it to deferred split list is
>>>> unnecessary.
>>>>
>>>> For PMD-mapped THPs, that was not really an issue, because removing the
>>>> last PMD mapping in the absence of PTE mappings would not have added the
>>>> folio to the deferred split queue.
>>>>
>>>> However, for PTE-mapped THPs, which are now more prominent due to mTHP,
>>>> they are always added to the deferred split queue. One side effect
>>>> is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
>>>> unintentionally increased, making it look like there are many partially
>>>> mapped folios -- although the whole folio is fully unmapped stepwise.
>>>>
>>>> Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
>>>> where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
>>>> folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
>>>> folio is unmapped in one go and can avoid being added to deferred split
>>>> list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
>>>> noise when we cannot batch-unmap a complete PTE-mapped folio in one go
>>>> -- or where this type of batching is not implemented yet, e.g., migration.
>>>>
>>>> To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
>>>> to tell if the whole folio is unmapped. If the folio is already on
>>>> deferred split list, it will be skipped, too.
>>>>
>>>> Note: commit 98046944a159 ("mm: huge_memory: add the missing
>>>> folio_test_pmd_mappable() for THP split statistics") tried to exclude
>>>> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
>>>> fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
>>>> added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
>>>> since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
>>>> deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
>>>>
>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>> Reviewed-by: Yang Shi <shy828301@gmail.com>
>>>> ---
>>>>  mm/rmap.c | 8 +++++---
>>>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index a7913a454028..220ad8a83589 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>                  * page of the folio is unmapped and at least one page
>>>>                  * is still mapped.
>>>>                  */
>>>> -               if (folio_test_large(folio) && folio_test_anon(folio))
>>>> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>>> -                               deferred_split_folio(folio);
>>>> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
>>>> +                   list_empty(&folio->_deferred_list) &&
>>>> +                   ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
>>>> +                    (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))
>>>> +                       deferred_split_folio(folio);
>>>
>>> Hi Zi Yan,
>>> in case a mTHP is mapped by two processed (forked but not CoW yet), if we
>>> unmap the whole folio by pte level in one process only, are we still adding this
>>> folio into deferred list?
>>
>> No. Because the mTHP is still fully mapped by the other process. In terms of code,
>> nr will be 0 in that case and this if condition is skipped. nr is only increased
>> from 0 when one of the subpages in the mTHP has no mapping, namely page->_mapcount
>> becomes negative and last is true in the case RMAP_LEVEL_PTE.
>
> Ok. i see, so "last" won't be true?
>
> case RMAP_LEVEL_PTE:
> do {
> last = atomic_add_negative(-1, &page->_mapcount);
>    if (last && folio_test_large(folio)) {
>        last = atomic_dec_return_relaxed(mapped);
>        last = (last < ENTIRELY_MAPPED);
> }
>
> if (last)
>      nr++;
> } while (page++, --nr_pages > 0);
> break;

Right, because for every subpage its corresponding
last = atomic_add_negative(-1, &page->_mapcount); is not true after the unmapping.


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list
  2024-04-26  2:50       ` Zi Yan
@ 2024-04-26  3:28         ` Barry Song
  2024-04-26  3:36           ` Barry Song
  2024-04-26  3:37           ` Zi Yan
  0 siblings, 2 replies; 17+ messages in thread
From: Barry Song @ 2024-04-26  3:28 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, linux-mm, Matthew Wilcox (Oracle),
	Yang Shi, Ryan Roberts, David Hildenbrand, Lance Yang,
	linux-kernel

On Fri, Apr 26, 2024 at 10:50 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 25 Apr 2024, at 22:23, Barry Song wrote:
>
> > On Fri, Apr 26, 2024 at 9:55 AM Zi Yan <ziy@nvidia.com> wrote:
> >>
> >> On 25 Apr 2024, at 21:45, Barry Song wrote:
> >>
> >>> On Fri, Apr 26, 2024 at 5:11 AM Zi Yan <zi.yan@sent.com> wrote:
> >>>>
> >>>> From: Zi Yan <ziy@nvidia.com>
> >>>>
> >>>> In __folio_remove_rmap(), a large folio is added to deferred split list
> >>>> if any page in a folio loses its final mapping. But it is possible that
> >>>> the folio is fully unmapped and adding it to deferred split list is
> >>>> unnecessary.
> >>>>
> >>>> For PMD-mapped THPs, that was not really an issue, because removing the
> >>>> last PMD mapping in the absence of PTE mappings would not have added the
> >>>> folio to the deferred split queue.
> >>>>
> >>>> However, for PTE-mapped THPs, which are now more prominent due to mTHP,
> >>>> they are always added to the deferred split queue. One side effect
> >>>> is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
> >>>> unintentionally increased, making it look like there are many partially
> >>>> mapped folios -- although the whole folio is fully unmapped stepwise.
> >>>>
> >>>> Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
> >>>> where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
> >>>> folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
> >>>> folio is unmapped in one go and can avoid being added to deferred split
> >>>> list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
> >>>> noise when we cannot batch-unmap a complete PTE-mapped folio in one go
> >>>> -- or where this type of batching is not implemented yet, e.g., migration.
> >>>>
> >>>> To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
> >>>> to tell if the whole folio is unmapped. If the folio is already on
> >>>> deferred split list, it will be skipped, too.
> >>>>
> >>>> Note: commit 98046944a159 ("mm: huge_memory: add the missing
> >>>> folio_test_pmd_mappable() for THP split statistics") tried to exclude
> >>>> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
> >>>> fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
> >>>> added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
> >>>> since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
> >>>> deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
> >>>>
> >>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >>>> Reviewed-by: Yang Shi <shy828301@gmail.com>
> >>>> ---
> >>>>  mm/rmap.c | 8 +++++---
> >>>>  1 file changed, 5 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>> index a7913a454028..220ad8a83589 100644
> >>>> --- a/mm/rmap.c
> >>>> +++ b/mm/rmap.c
> >>>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>                  * page of the folio is unmapped and at least one page
> >>>>                  * is still mapped.
> >>>>                  */
> >>>> -               if (folio_test_large(folio) && folio_test_anon(folio))
> >>>> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> >>>> -                               deferred_split_folio(folio);
> >>>> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> >>>> +                   list_empty(&folio->_deferred_list) &&
> >>>> +                   ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
> >>>> +                    (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))
> >>>> +                       deferred_split_folio(folio);
> >>>
> >>> Hi Zi Yan,
> >>> in case a mTHP is mapped by two processed (forked but not CoW yet), if we
> >>> unmap the whole folio by pte level in one process only, are we still adding this
> >>> folio into deferred list?
> >>
> >> No. Because the mTHP is still fully mapped by the other process. In terms of code,
> >> nr will be 0 in that case and this if condition is skipped. nr is only increased
> >> from 0 when one of the subpages in the mTHP has no mapping, namely page->_mapcount
> >> becomes negative and last is true in the case RMAP_LEVEL_PTE.
> >
> > Ok. i see, so "last" won't be true?
> >
> > case RMAP_LEVEL_PTE:
> > do {
> > last = atomic_add_negative(-1, &page->_mapcount);
> >    if (last && folio_test_large(folio)) {
> >        last = atomic_dec_return_relaxed(mapped);
> >        last = (last < ENTIRELY_MAPPED);
> > }
> >
> > if (last)
> >      nr++;
> > } while (page++, --nr_pages > 0);
> > break;
>
> Right, because for every subpage its corresponding
> last = atomic_add_negative(-1, &page->_mapcount); is not true after the unmapping.2

if a mTHP is mapped only by one process, and we unmap it entirely, we will
get nr > 0, then we are executing adding it into deferred_list? so it seems
atomic_read(mapped) is preventing this case from adding deferred_list?

I wonder if  it is possible to fixup nr to 0 from the first place?
for example
/* we are doing an entire unmapping */
if (page==&folio->page && nr_pages ==  folio_nr_pages(folio))
...

>
>
> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list
  2024-04-26  3:28         ` Barry Song
@ 2024-04-26  3:36           ` Barry Song
  2024-04-26  3:37           ` Zi Yan
  1 sibling, 0 replies; 17+ messages in thread
From: Barry Song @ 2024-04-26  3:36 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, linux-mm, Matthew Wilcox (Oracle),
	Yang Shi, Ryan Roberts, David Hildenbrand, Lance Yang,
	linux-kernel

On Fri, Apr 26, 2024 at 11:28 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, Apr 26, 2024 at 10:50 AM Zi Yan <ziy@nvidia.com> wrote:
> >
> > On 25 Apr 2024, at 22:23, Barry Song wrote:
> >
> > > On Fri, Apr 26, 2024 at 9:55 AM Zi Yan <ziy@nvidia.com> wrote:
> > >>
> > >> On 25 Apr 2024, at 21:45, Barry Song wrote:
> > >>
> > >>> On Fri, Apr 26, 2024 at 5:11 AM Zi Yan <zi.yan@sent.com> wrote:
> > >>>>
> > >>>> From: Zi Yan <ziy@nvidia.com>
> > >>>>
> > >>>> In __folio_remove_rmap(), a large folio is added to deferred split list
> > >>>> if any page in a folio loses its final mapping. But it is possible that
> > >>>> the folio is fully unmapped and adding it to deferred split list is
> > >>>> unnecessary.
> > >>>>
> > >>>> For PMD-mapped THPs, that was not really an issue, because removing the
> > >>>> last PMD mapping in the absence of PTE mappings would not have added the
> > >>>> folio to the deferred split queue.
> > >>>>
> > >>>> However, for PTE-mapped THPs, which are now more prominent due to mTHP,
> > >>>> they are always added to the deferred split queue. One side effect
> > >>>> is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
> > >>>> unintentionally increased, making it look like there are many partially
> > >>>> mapped folios -- although the whole folio is fully unmapped stepwise.
> > >>>>
> > >>>> Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
> > >>>> where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
> > >>>> folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
> > >>>> folio is unmapped in one go and can avoid being added to deferred split
> > >>>> list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
> > >>>> noise when we cannot batch-unmap a complete PTE-mapped folio in one go
> > >>>> -- or where this type of batching is not implemented yet, e.g., migration.
> > >>>>
> > >>>> To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
> > >>>> to tell if the whole folio is unmapped. If the folio is already on
> > >>>> deferred split list, it will be skipped, too.
> > >>>>
> > >>>> Note: commit 98046944a159 ("mm: huge_memory: add the missing
> > >>>> folio_test_pmd_mappable() for THP split statistics") tried to exclude
> > >>>> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
> > >>>> fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
> > >>>> added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
> > >>>> since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
> > >>>> deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
> > >>>>
> > >>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
> > >>>> Reviewed-by: Yang Shi <shy828301@gmail.com>
> > >>>> ---
> > >>>>  mm/rmap.c | 8 +++++---
> > >>>>  1 file changed, 5 insertions(+), 3 deletions(-)
> > >>>>
> > >>>> diff --git a/mm/rmap.c b/mm/rmap.c
> > >>>> index a7913a454028..220ad8a83589 100644
> > >>>> --- a/mm/rmap.c
> > >>>> +++ b/mm/rmap.c
> > >>>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> > >>>>                  * page of the folio is unmapped and at least one page
> > >>>>                  * is still mapped.
> > >>>>                  */
> > >>>> -               if (folio_test_large(folio) && folio_test_anon(folio))
> > >>>> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> > >>>> -                               deferred_split_folio(folio);
> > >>>> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> > >>>> +                   list_empty(&folio->_deferred_list) &&
> > >>>> +                   ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
> > >>>> +                    (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))
> > >>>> +                       deferred_split_folio(folio);
> > >>>
> > >>> Hi Zi Yan,
> > >>> in case a mTHP is mapped by two processed (forked but not CoW yet), if we
> > >>> unmap the whole folio by pte level in one process only, are we still adding this
> > >>> folio into deferred list?
> > >>
> > >> No. Because the mTHP is still fully mapped by the other process. In terms of code,
> > >> nr will be 0 in that case and this if condition is skipped. nr is only increased
> > >> from 0 when one of the subpages in the mTHP has no mapping, namely page->_mapcount
> > >> becomes negative and last is true in the case RMAP_LEVEL_PTE.
> > >
> > > Ok. i see, so "last" won't be true?
> > >
> > > case RMAP_LEVEL_PTE:
> > > do {
> > > last = atomic_add_negative(-1, &page->_mapcount);
> > >    if (last && folio_test_large(folio)) {
> > >        last = atomic_dec_return_relaxed(mapped);
> > >        last = (last < ENTIRELY_MAPPED);
> > > }
> > >
> > > if (last)
> > >      nr++;
> > > } while (page++, --nr_pages > 0);
> > > break;
> >
> > Right, because for every subpage its corresponding
> > last = atomic_add_negative(-1, &page->_mapcount); is not true after the unmapping.2
>
> if a mTHP is mapped only by one process, and we unmap it entirely, we will
> get nr > 0, then we are executing adding it into deferred_list? so it seems
> atomic_read(mapped) is preventing this case from adding deferred_list?
>
> I wonder if  it is possible to fixup nr to 0 from the first place?
> for example
> /* we are doing an entire unmapping */
> if (page==&folio->page && nr_pages ==  folio_nr_pages(folio))

or maybe
 case RMAP_LEVEL_PTE:
...
+ if (!atomic_read(mapped))
+     nr = 0;
break;

> ...
>
> >
> >
> > --
> > Best Regards,
> > Yan, Zi


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list
  2024-04-26  3:28         ` Barry Song
  2024-04-26  3:36           ` Barry Song
@ 2024-04-26  3:37           ` Zi Yan
  2024-04-26  3:44             ` Barry Song
  1 sibling, 1 reply; 17+ messages in thread
From: Zi Yan @ 2024-04-26  3:37 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, linux-mm, Matthew Wilcox (Oracle),
	Yang Shi, Ryan Roberts, David Hildenbrand, Lance Yang,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5636 bytes --]

On 25 Apr 2024, at 23:28, Barry Song wrote:

> On Fri, Apr 26, 2024 at 10:50 AM Zi Yan <ziy@nvidia.com> wrote:
>>
>> On 25 Apr 2024, at 22:23, Barry Song wrote:
>>
>>> On Fri, Apr 26, 2024 at 9:55 AM Zi Yan <ziy@nvidia.com> wrote:
>>>>
>>>> On 25 Apr 2024, at 21:45, Barry Song wrote:
>>>>
>>>>> On Fri, Apr 26, 2024 at 5:11 AM Zi Yan <zi.yan@sent.com> wrote:
>>>>>>
>>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>>
>>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>>>> if any page in a folio loses its final mapping. But it is possible that
>>>>>> the folio is fully unmapped and adding it to deferred split list is
>>>>>> unnecessary.
>>>>>>
>>>>>> For PMD-mapped THPs, that was not really an issue, because removing the
>>>>>> last PMD mapping in the absence of PTE mappings would not have added the
>>>>>> folio to the deferred split queue.
>>>>>>
>>>>>> However, for PTE-mapped THPs, which are now more prominent due to mTHP,
>>>>>> they are always added to the deferred split queue. One side effect
>>>>>> is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
>>>>>> unintentionally increased, making it look like there are many partially
>>>>>> mapped folios -- although the whole folio is fully unmapped stepwise.
>>>>>>
>>>>>> Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
>>>>>> where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
>>>>>> folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
>>>>>> folio is unmapped in one go and can avoid being added to deferred split
>>>>>> list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
>>>>>> noise when we cannot batch-unmap a complete PTE-mapped folio in one go
>>>>>> -- or where this type of batching is not implemented yet, e.g., migration.
>>>>>>
>>>>>> To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
>>>>>> to tell if the whole folio is unmapped. If the folio is already on
>>>>>> deferred split list, it will be skipped, too.
>>>>>>
>>>>>> Note: commit 98046944a159 ("mm: huge_memory: add the missing
>>>>>> folio_test_pmd_mappable() for THP split statistics") tried to exclude
>>>>>> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
>>>>>> fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
>>>>>> added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
>>>>>> since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
>>>>>> deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
>>>>>>
>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>>> Reviewed-by: Yang Shi <shy828301@gmail.com>
>>>>>> ---
>>>>>>  mm/rmap.c | 8 +++++---
>>>>>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index a7913a454028..220ad8a83589 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>>>                  * page of the folio is unmapped and at least one page
>>>>>>                  * is still mapped.
>>>>>>                  */
>>>>>> -               if (folio_test_large(folio) && folio_test_anon(folio))
>>>>>> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>>>>> -                               deferred_split_folio(folio);
>>>>>> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
>>>>>> +                   list_empty(&folio->_deferred_list) &&
>>>>>> +                   ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
>>>>>> +                    (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))
>>>>>> +                       deferred_split_folio(folio);
>>>>>
>>>>> Hi Zi Yan,
>>>>> in case a mTHP is mapped by two processed (forked but not CoW yet), if we
>>>>> unmap the whole folio by pte level in one process only, are we still adding this
>>>>> folio into deferred list?
>>>>
>>>> No. Because the mTHP is still fully mapped by the other process. In terms of code,
>>>> nr will be 0 in that case and this if condition is skipped. nr is only increased
>>>> from 0 when one of the subpages in the mTHP has no mapping, namely page->_mapcount
>>>> becomes negative and last is true in the case RMAP_LEVEL_PTE.
>>>
>>> Ok. i see, so "last" won't be true?
>>>
>>> case RMAP_LEVEL_PTE:
>>> do {
>>> last = atomic_add_negative(-1, &page->_mapcount);
>>>    if (last && folio_test_large(folio)) {
>>>        last = atomic_dec_return_relaxed(mapped);
>>>        last = (last < ENTIRELY_MAPPED);
>>> }
>>>
>>> if (last)
>>>      nr++;
>>> } while (page++, --nr_pages > 0);
>>> break;
>>
>> Right, because for every subpage its corresponding
>> last = atomic_add_negative(-1, &page->_mapcount); is not true after the unmapping.2
>
> if a mTHP is mapped only by one process, and we unmap it entirely, we will
> get nr > 0, then we are executing adding it into deferred_list? so it seems
> atomic_read(mapped) is preventing this case from adding deferred_list?

Yes, that is what this patch is doing. When a mTHP is mapped by one process
and later unmapped fully, there is no need to add it to deferred_list.
The mTHP will be freed right afterwards.

>
> I wonder if  it is possible to fixup nr to 0 from the first place?
> for example
> /* we are doing an entire unmapping */
> if (page==&folio->page && nr_pages ==  folio_nr_pages(folio))
> ...
>
>>
>>
>> --
>> Best Regards,
>> Yan, Zi


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list
  2024-04-26  3:37           ` Zi Yan
@ 2024-04-26  3:44             ` Barry Song
  0 siblings, 0 replies; 17+ messages in thread
From: Barry Song @ 2024-04-26  3:44 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, linux-mm, Matthew Wilcox (Oracle),
	Yang Shi, Ryan Roberts, David Hildenbrand, Lance Yang,
	linux-kernel

On Fri, Apr 26, 2024 at 11:37 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 25 Apr 2024, at 23:28, Barry Song wrote:
>
> > On Fri, Apr 26, 2024 at 10:50 AM Zi Yan <ziy@nvidia.com> wrote:
> >>
> >> On 25 Apr 2024, at 22:23, Barry Song wrote:
> >>
> >>> On Fri, Apr 26, 2024 at 9:55 AM Zi Yan <ziy@nvidia.com> wrote:
> >>>>
> >>>> On 25 Apr 2024, at 21:45, Barry Song wrote:
> >>>>
> >>>>> On Fri, Apr 26, 2024 at 5:11 AM Zi Yan <zi.yan@sent.com> wrote:
> >>>>>>
> >>>>>> From: Zi Yan <ziy@nvidia.com>
> >>>>>>
> >>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
> >>>>>> if any page in a folio loses its final mapping. But it is possible that
> >>>>>> the folio is fully unmapped and adding it to deferred split list is
> >>>>>> unnecessary.
> >>>>>>
> >>>>>> For PMD-mapped THPs, that was not really an issue, because removing the
> >>>>>> last PMD mapping in the absence of PTE mappings would not have added the
> >>>>>> folio to the deferred split queue.
> >>>>>>
> >>>>>> However, for PTE-mapped THPs, which are now more prominent due to mTHP,
> >>>>>> they are always added to the deferred split queue. One side effect
> >>>>>> is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
> >>>>>> unintentionally increased, making it look like there are many partially
> >>>>>> mapped folios -- although the whole folio is fully unmapped stepwise.
> >>>>>>
> >>>>>> Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
> >>>>>> where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
> >>>>>> folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
> >>>>>> folio is unmapped in one go and can avoid being added to deferred split
> >>>>>> list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
> >>>>>> noise when we cannot batch-unmap a complete PTE-mapped folio in one go
> >>>>>> -- or where this type of batching is not implemented yet, e.g., migration.
> >>>>>>
> >>>>>> To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
> >>>>>> to tell if the whole folio is unmapped. If the folio is already on
> >>>>>> deferred split list, it will be skipped, too.
> >>>>>>
> >>>>>> Note: commit 98046944a159 ("mm: huge_memory: add the missing
> >>>>>> folio_test_pmd_mappable() for THP split statistics") tried to exclude
> >>>>>> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
> >>>>>> fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
> >>>>>> added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
> >>>>>> since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
> >>>>>> deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
> >>>>>>
> >>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >>>>>> Reviewed-by: Yang Shi <shy828301@gmail.com>
> >>>>>> ---
> >>>>>>  mm/rmap.c | 8 +++++---
> >>>>>>  1 file changed, 5 insertions(+), 3 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>>>> index a7913a454028..220ad8a83589 100644
> >>>>>> --- a/mm/rmap.c
> >>>>>> +++ b/mm/rmap.c
> >>>>>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>>>                  * page of the folio is unmapped and at least one page
> >>>>>>                  * is still mapped.
> >>>>>>                  */
> >>>>>> -               if (folio_test_large(folio) && folio_test_anon(folio))
> >>>>>> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> >>>>>> -                               deferred_split_folio(folio);
> >>>>>> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> >>>>>> +                   list_empty(&folio->_deferred_list) &&
> >>>>>> +                   ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
> >>>>>> +                    (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))
> >>>>>> +                       deferred_split_folio(folio);
> >>>>>
> >>>>> Hi Zi Yan,
> >>>>> in case a mTHP is mapped by two processed (forked but not CoW yet), if we
> >>>>> unmap the whole folio by pte level in one process only, are we still adding this
> >>>>> folio into deferred list?
> >>>>
> >>>> No. Because the mTHP is still fully mapped by the other process. In terms of code,
> >>>> nr will be 0 in that case and this if condition is skipped. nr is only increased
> >>>> from 0 when one of the subpages in the mTHP has no mapping, namely page->_mapcount
> >>>> becomes negative and last is true in the case RMAP_LEVEL_PTE.
> >>>
> >>> Ok. i see, so "last" won't be true?
> >>>
> >>> case RMAP_LEVEL_PTE:
> >>> do {
> >>> last = atomic_add_negative(-1, &page->_mapcount);
> >>>    if (last && folio_test_large(folio)) {
> >>>        last = atomic_dec_return_relaxed(mapped);
> >>>        last = (last < ENTIRELY_MAPPED);
> >>> }
> >>>
> >>> if (last)
> >>>      nr++;
> >>> } while (page++, --nr_pages > 0);
> >>> break;
> >>
> >> Right, because for every subpage its corresponding
> >> last = atomic_add_negative(-1, &page->_mapcount); is not true after the unmapping.2
> >
> > if a mTHP is mapped only by one process, and we unmap it entirely, we will
> > get nr > 0, then we are executing adding it into deferred_list? so it seems
> > atomic_read(mapped) is preventing this case from adding deferred_list?
>
> Yes, that is what this patch is doing. When a mTHP is mapped by one process
> and later unmapped fully, there is no need to add it to deferred_list.
> The mTHP will be freed right afterwards.

thanks. I understand. i feel fixing up nr earlier can make the code
more readable.
 case RMAP_LEVEL_PTE:
...
+ if (!atomic_read(mapped))
+     nr = 0;
break;

as I have been struggling for a long time to understand the code, especially
the one with many conditions in the “if”    :-)

+               if (folio_test_large(folio) && folio_test_anon(folio) &&
+                   list_empty(&folio->_deferred_list) &&
+                   ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
+                    (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))
+                       deferred_split_folio(folio);
        }

>
> >
> > I wonder if  it is possible to fixup nr to 0 from the first place?
> > for example
> > /* we are doing an entire unmapping */
> > if (page==&folio->page && nr_pages ==  folio_nr_pages(folio))
> > ...
> >
> >>
> >>
> >> --
> >> Best Regards,
> >> Yan, Zi
>
>
> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list
  2024-04-25 21:11 [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list Zi Yan
  2024-04-26  1:45 ` Barry Song
@ 2024-04-26  3:45 ` Lance Yang
  2024-04-26  5:36   ` Lance Yang
  2024-04-26  8:19 ` David Hildenbrand
  2 siblings, 1 reply; 17+ messages in thread
From: Lance Yang @ 2024-04-26  3:45 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, linux-mm, Matthew Wilcox (Oracle),
	Yang Shi, Ryan Roberts, Barry Song, David Hildenbrand,
	linux-kernel

On Fri, Apr 26, 2024 at 5:11 AM Zi Yan <zi.yan@sent.com> wrote:
>
> From: Zi Yan <ziy@nvidia.com>
>
> In __folio_remove_rmap(), a large folio is added to deferred split list
> if any page in a folio loses its final mapping. But it is possible that
> the folio is fully unmapped and adding it to deferred split list is
> unnecessary.
>
> For PMD-mapped THPs, that was not really an issue, because removing the
> last PMD mapping in the absence of PTE mappings would not have added the
> folio to the deferred split queue.
>
> However, for PTE-mapped THPs, which are now more prominent due to mTHP,
> they are always added to the deferred split queue. One side effect
> is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
> unintentionally increased, making it look like there are many partially
> mapped folios -- although the whole folio is fully unmapped stepwise.
>
> Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
> where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
> folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
> folio is unmapped in one go and can avoid being added to deferred split
> list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
> noise when we cannot batch-unmap a complete PTE-mapped folio in one go
> -- or where this type of batching is not implemented yet, e.g., migration.
>
> To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
> to tell if the whole folio is unmapped. If the folio is already on
> deferred split list, it will be skipped, too.
>
> Note: commit 98046944a159 ("mm: huge_memory: add the missing
> folio_test_pmd_mappable() for THP split statistics") tried to exclude
> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
> fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
> added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
> since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
> deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/rmap.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a7913a454028..220ad8a83589 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                  * page of the folio is unmapped and at least one page
>                  * is still mapped.
>                  */
> -               if (folio_test_large(folio) && folio_test_anon(folio))
> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> -                               deferred_split_folio(folio);
> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> +                   list_empty(&folio->_deferred_list) &&

FWIW

Perhaps it would achieve the same check, ensuring that at least one
page of the folio is unmapped while at least one page remains mapped.

 +                   atomic_read(mapped) && nr < folio_nr_pages(folio))
 -                   ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
 -                    (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))

Thanks,
Lance


> +                   ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
> +                    (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))
> +                       deferred_split_folio(folio);
>         }
>
>         /*
>
> base-commit: 66313c66dd90e8711a8b63fc047ddfc69c53636a
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list
  2024-04-26  3:45 ` Lance Yang
@ 2024-04-26  5:36   ` Lance Yang
  0 siblings, 0 replies; 17+ messages in thread
From: Lance Yang @ 2024-04-26  5:36 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, linux-mm, Matthew Wilcox (Oracle),
	Yang Shi, Ryan Roberts, Barry Song, David Hildenbrand,
	linux-kernel

On Fri, Apr 26, 2024 at 11:45 AM Lance Yang <ioworker0@gmail.com> wrote:
>
> On Fri, Apr 26, 2024 at 5:11 AM Zi Yan <zi.yan@sent.com> wrote:
> >
> > From: Zi Yan <ziy@nvidia.com>
> >
> > In __folio_remove_rmap(), a large folio is added to deferred split list
> > if any page in a folio loses its final mapping. But it is possible that
> > the folio is fully unmapped and adding it to deferred split list is
> > unnecessary.
> >
> > For PMD-mapped THPs, that was not really an issue, because removing the
> > last PMD mapping in the absence of PTE mappings would not have added the
> > folio to the deferred split queue.
> >
> > However, for PTE-mapped THPs, which are now more prominent due to mTHP,
> > they are always added to the deferred split queue. One side effect
> > is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
> > unintentionally increased, making it look like there are many partially
> > mapped folios -- although the whole folio is fully unmapped stepwise.
> >
> > Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
> > where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
> > folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
> > folio is unmapped in one go and can avoid being added to deferred split
> > list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
> > noise when we cannot batch-unmap a complete PTE-mapped folio in one go
> > -- or where this type of batching is not implemented yet, e.g., migration.
> >
> > To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
> > to tell if the whole folio is unmapped. If the folio is already on
> > deferred split list, it will be skipped, too.
> >
> > Note: commit 98046944a159 ("mm: huge_memory: add the missing
> > folio_test_pmd_mappable() for THP split statistics") tried to exclude
> > mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
> > fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
> > added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
> > since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
> > deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
> >
> > Signed-off-by: Zi Yan <ziy@nvidia.com>
> > Reviewed-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/rmap.c | 8 +++++---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index a7913a454028..220ad8a83589 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >                  * page of the folio is unmapped and at least one page
> >                  * is still mapped.
> >                  */
> > -               if (folio_test_large(folio) && folio_test_anon(folio))
> > -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> > -                               deferred_split_folio(folio);
> > +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> > +                   list_empty(&folio->_deferred_list) &&
>
> FWIW
>
> Perhaps it would achieve the same check, ensuring that at least one
> page of the folio is unmapped while at least one page remains mapped.
>
>  +                   atomic_read(mapped) && nr < folio_nr_pages(folio))
>  -                   ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
>  -                    (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))

Second thought: it’s probably best to leave it as is. The compiler should
optimize out based on the level enum, which is what I overlooked.

Thanks,
Lance

>
> Thanks,
> Lance
>
>
> > +                   ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
> > +                    (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))
> > +                       deferred_split_folio(folio);
> >         }
> >
> >         /*
> >
> > base-commit: 66313c66dd90e8711a8b63fc047ddfc69c53636a
> > --
> > 2.43.0
> >


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list
  2024-04-25 21:11 [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list Zi Yan
  2024-04-26  1:45 ` Barry Song
  2024-04-26  3:45 ` Lance Yang
@ 2024-04-26  8:19 ` David Hildenbrand
  2024-04-26  8:26   ` David Hildenbrand
                     ` (2 more replies)
  2 siblings, 3 replies; 17+ messages in thread
From: David Hildenbrand @ 2024-04-26  8:19 UTC (permalink / raw)
  To: Zi Yan, Andrew Morton, linux-mm
  Cc: Matthew Wilcox (Oracle),
	Yang Shi, Ryan Roberts, Barry Song, Lance Yang, linux-kernel

On 25.04.24 23:11, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> In __folio_remove_rmap(), a large folio is added to deferred split list
> if any page in a folio loses its final mapping. But it is possible that
> the folio is fully unmapped and adding it to deferred split list is
> unnecessary.
> 
> For PMD-mapped THPs, that was not really an issue, because removing the
> last PMD mapping in the absence of PTE mappings would not have added the
> folio to the deferred split queue.
> 
> However, for PTE-mapped THPs, which are now more prominent due to mTHP,
> they are always added to the deferred split queue. One side effect
> is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
> unintentionally increased, making it look like there are many partially
> mapped folios -- although the whole folio is fully unmapped stepwise.
> 
> Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
> where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
> folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
> folio is unmapped in one go and can avoid being added to deferred split
> list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
> noise when we cannot batch-unmap a complete PTE-mapped folio in one go
> -- or where this type of batching is not implemented yet, e.g., migration.
> 
> To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
> to tell if the whole folio is unmapped. If the folio is already on
> deferred split list, it will be skipped, too.
> 
> Note: commit 98046944a159 ("mm: huge_memory: add the missing
> folio_test_pmd_mappable() for THP split statistics") tried to exclude
> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
> fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
> added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
> since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
> deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> ---
>   mm/rmap.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a7913a454028..220ad8a83589 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>   		 * page of the folio is unmapped and at least one page
>   		 * is still mapped.
>   		 */
> -		if (folio_test_large(folio) && folio_test_anon(folio))
> -			if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> -				deferred_split_folio(folio);
> +		if (folio_test_large(folio) && folio_test_anon(folio) &&
> +		    list_empty(&folio->_deferred_list) &&
> +		    ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
> +		     (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))
> +			deferred_split_folio(folio);
>   	}
>   
>   	/*
> 
> base-commit: 66313c66dd90e8711a8b63fc047ddfc69c53636a

Reviewed-by: David Hildenbrand <david@redhat.com>

But maybe we can really improve the code:


diff --git a/mm/rmap.c b/mm/rmap.c
index 2608c40dffade..e310b6c4221d7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1495,6 +1495,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
  {
         atomic_t *mapped = &folio->_nr_pages_mapped;
         int last, nr = 0, nr_pmdmapped = 0;
+       bool partially_mapped = false;
         enum node_stat_item idx;
  
         __folio_rmap_sanity_checks(folio, page, nr_pages, level);
@@ -1515,6 +1516,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
                                         nr++;
                         }
                 } while (page++, --nr_pages > 0);
+
+               partially_mapped = nr && atomic_read(mapped);
                 break;
         case RMAP_LEVEL_PMD:
                 atomic_dec(&folio->_large_mapcount);
@@ -1532,6 +1535,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
                                 nr = 0;
                         }
                 }
+               partially_mapped = nr < nr_pmdmapped;
                 break;
         }
  
@@ -1553,9 +1557,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
                  * page of the folio is unmapped and at least one page
                  * is still mapped.
                  */
-               if (folio_test_large(folio) && folio_test_anon(folio))
-                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
-                               deferred_split_folio(folio);
+               if (folio_test_large(folio) && folio_test_anon(folio) &&
+                   list_empty(&folio->_deferred_list) && partially_mapped)
+                       deferred_split_folio(folio);
         }
  
         /*

The compiler should be smart enough to optimize it all -- most likely ;)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list
  2024-04-26  8:19 ` David Hildenbrand
@ 2024-04-26  8:26   ` David Hildenbrand
  2024-04-26  9:33     ` Lance Yang
  2024-04-26 18:41     ` Yang Shi
  2024-04-26  9:46   ` Barry Song
  2024-04-26 13:11   ` Zi Yan
  2 siblings, 2 replies; 17+ messages in thread
From: David Hildenbrand @ 2024-04-26  8:26 UTC (permalink / raw)
  To: Zi Yan, Andrew Morton, linux-mm
  Cc: Matthew Wilcox (Oracle),
	Yang Shi, Ryan Roberts, Barry Song, Lance Yang, linux-kernel


> @@ -1553,9 +1557,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                    * page of the folio is unmapped and at least one page
>                    * is still mapped.
>                    */
> -               if (folio_test_large(folio) && folio_test_anon(folio))
> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> -                               deferred_split_folio(folio);
> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> +                   list_empty(&folio->_deferred_list) && partially_mapped)
> +                       deferred_split_folio(folio);

And now I realize that we can then even drop the folio_test_large(folio) 
check here!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list
  2024-04-26  8:26   ` David Hildenbrand
@ 2024-04-26  9:33     ` Lance Yang
  2024-04-26 18:41     ` Yang Shi
  1 sibling, 0 replies; 17+ messages in thread
From: Lance Yang @ 2024-04-26  9:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zi Yan, Andrew Morton, linux-mm, Matthew Wilcox (Oracle),
	Yang Shi, Ryan Roberts, Barry Song, linux-kernel

On Fri, Apr 26, 2024 at 4:26 PM David Hildenbrand <david@redhat.com> wrote:
>
>
> > @@ -1553,9 +1557,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >                    * page of the folio is unmapped and at least one page
> >                    * is still mapped.
> >                    */
> > -               if (folio_test_large(folio) && folio_test_anon(folio))
> > -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> > -                               deferred_split_folio(folio);
> > +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> > +                   list_empty(&folio->_deferred_list) && partially_mapped)
> > +                       deferred_split_folio(folio);
>
> And now I realize that we can then even drop the folio_test_large(folio)
> check here!

+1

Thanks,
Lance

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list
  2024-04-26  8:26   ` David Hildenbrand
  2024-04-26  9:33     ` Lance Yang
@ 2024-04-26 18:41     ` Yang Shi
  1 sibling, 0 replies; 17+ messages in thread
From: Yang Shi @ 2024-04-26 18:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zi Yan, Andrew Morton, linux-mm, Matthew Wilcox (Oracle),
	Ryan Roberts, Barry Song, Lance Yang, linux-kernel

On Fri, Apr 26, 2024 at 1:26 AM David Hildenbrand <david@redhat.com> wrote:
>
>
> > @@ -1553,9 +1557,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >                    * page of the folio is unmapped and at least one page
> >                    * is still mapped.
> >                    */
> > -               if (folio_test_large(folio) && folio_test_anon(folio))
> > -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> > -                               deferred_split_folio(folio);
> > +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> > +                   list_empty(&folio->_deferred_list) && partially_mapped)
> > +                       deferred_split_folio(folio);
>
> And now I realize that we can then even drop the folio_test_large(folio)
> check here!

Good idea. This is more understandable.

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list
  2024-04-26  8:19 ` David Hildenbrand
  2024-04-26  8:26   ` David Hildenbrand
@ 2024-04-26  9:46   ` Barry Song
  2024-04-26 13:11   ` Zi Yan
  2 siblings, 0 replies; 17+ messages in thread
From: Barry Song @ 2024-04-26  9:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zi Yan, Andrew Morton, linux-mm, Matthew Wilcox (Oracle),
	Yang Shi, Ryan Roberts, Lance Yang, linux-kernel

On Fri, Apr 26, 2024 at 4:19 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 25.04.24 23:11, Zi Yan wrote:
> > From: Zi Yan <ziy@nvidia.com>
> >
> > In __folio_remove_rmap(), a large folio is added to deferred split list
> > if any page in a folio loses its final mapping. But it is possible that
> > the folio is fully unmapped and adding it to deferred split list is
> > unnecessary.
> >
> > For PMD-mapped THPs, that was not really an issue, because removing the
> > last PMD mapping in the absence of PTE mappings would not have added the
> > folio to the deferred split queue.
> >
> > However, for PTE-mapped THPs, which are now more prominent due to mTHP,
> > they are always added to the deferred split queue. One side effect
> > is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
> > unintentionally increased, making it look like there are many partially
> > mapped folios -- although the whole folio is fully unmapped stepwise.
> >
> > Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
> > where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
> > folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
> > folio is unmapped in one go and can avoid being added to deferred split
> > list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
> > noise when we cannot batch-unmap a complete PTE-mapped folio in one go
> > -- or where this type of batching is not implemented yet, e.g., migration.
> >
> > To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
> > to tell if the whole folio is unmapped. If the folio is already on
> > deferred split list, it will be skipped, too.
> >
> > Note: commit 98046944a159 ("mm: huge_memory: add the missing
> > folio_test_pmd_mappable() for THP split statistics") tried to exclude
> > mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
> > fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
> > added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
> > since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
> > deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
> >
> > Signed-off-by: Zi Yan <ziy@nvidia.com>
> > Reviewed-by: Yang Shi <shy828301@gmail.com>
> > ---
> >   mm/rmap.c | 8 +++++---
> >   1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index a7913a454028..220ad8a83589 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >                * page of the folio is unmapped and at least one page
> >                * is still mapped.
> >                */
> > -             if (folio_test_large(folio) && folio_test_anon(folio))
> > -                     if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> > -                             deferred_split_folio(folio);
> > +             if (folio_test_large(folio) && folio_test_anon(folio) &&
> > +                 list_empty(&folio->_deferred_list) &&
> > +                 ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
> > +                  (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))
> > +                     deferred_split_folio(folio);
> >       }
> >
> >       /*
> >
> > base-commit: 66313c66dd90e8711a8b63fc047ddfc69c53636a
>
> Reviewed-by: David Hildenbrand <david@redhat.com>
>
> But maybe we can really improve the code:
>
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2608c40dffade..e310b6c4221d7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1495,6 +1495,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>   {
>          atomic_t *mapped = &folio->_nr_pages_mapped;
>          int last, nr = 0, nr_pmdmapped = 0;
> +       bool partially_mapped = false;
>          enum node_stat_item idx;
>
>          __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> @@ -1515,6 +1516,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                                          nr++;
>                          }
>                  } while (page++, --nr_pages > 0);
> +
> +               partially_mapped = nr && atomic_read(mapped);

nice!

>                  break;
>          case RMAP_LEVEL_PMD:
>                  atomic_dec(&folio->_large_mapcount);
> @@ -1532,6 +1535,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                                  nr = 0;
>                          }
>                  }
> +               partially_mapped = nr < nr_pmdmapped;
>                  break;
>          }
>
> @@ -1553,9 +1557,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                   * page of the folio is unmapped and at least one page
>                   * is still mapped.
>                   */
> -               if (folio_test_large(folio) && folio_test_anon(folio))
> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> -                               deferred_split_folio(folio);
> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> +                   list_empty(&folio->_deferred_list) && partially_mapped)
> +                       deferred_split_folio(folio);
>          }
>
>          /*
>
> The compiler should be smart enough to optimize it all -- most likely ;)
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list
  2024-04-26  8:19 ` David Hildenbrand
  2024-04-26  8:26   ` David Hildenbrand
  2024-04-26  9:46   ` Barry Song
@ 2024-04-26 13:11   ` Zi Yan
  2 siblings, 0 replies; 17+ messages in thread
From: Zi Yan @ 2024-04-26 13:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, linux-mm, Matthew Wilcox (Oracle),
	Yang Shi, Ryan Roberts, Barry Song, Lance Yang, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5439 bytes --]

On 26 Apr 2024, at 4:19, David Hildenbrand wrote:

> On 25.04.24 23:11, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> In __folio_remove_rmap(), a large folio is added to deferred split list
>> if any page in a folio loses its final mapping. But it is possible that
>> the folio is fully unmapped and adding it to deferred split list is
>> unnecessary.
>>
>> For PMD-mapped THPs, that was not really an issue, because removing the
>> last PMD mapping in the absence of PTE mappings would not have added the
>> folio to the deferred split queue.
>>
>> However, for PTE-mapped THPs, which are now more prominent due to mTHP,
>> they are always added to the deferred split queue. One side effect
>> is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
>> unintentionally increased, making it look like there are many partially
>> mapped folios -- although the whole folio is fully unmapped stepwise.
>>
>> Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
>> where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
>> folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
>> folio is unmapped in one go and can avoid being added to deferred split
>> list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
>> noise when we cannot batch-unmap a complete PTE-mapped folio in one go
>> -- or where this type of batching is not implemented yet, e.g., migration.
>>
>> To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
>> to tell if the whole folio is unmapped. If the folio is already on
>> deferred split list, it will be skipped, too.
>>
>> Note: commit 98046944a159 ("mm: huge_memory: add the missing
>> folio_test_pmd_mappable() for THP split statistics") tried to exclude
>> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
>> fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
>> added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
>> since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
>> deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> Reviewed-by: Yang Shi <shy828301@gmail.com>
>> ---
>>   mm/rmap.c | 8 +++++---
>>   1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index a7913a454028..220ad8a83589 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>   		 * page of the folio is unmapped and at least one page
>>   		 * is still mapped.
>>   		 */
>> -		if (folio_test_large(folio) && folio_test_anon(folio))
>> -			if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>> -				deferred_split_folio(folio);
>> +		if (folio_test_large(folio) && folio_test_anon(folio) &&
>> +		    list_empty(&folio->_deferred_list) &&
>> +		    ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
>> +		     (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))
>> +			deferred_split_folio(folio);
>>   	}
>>    	/*
>>
>> base-commit: 66313c66dd90e8711a8b63fc047ddfc69c53636a
>
> Reviewed-by: David Hildenbrand <david@redhat.com>
>
> But maybe we can really improve the code:
>
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2608c40dffade..e310b6c4221d7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1495,6 +1495,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>  {
>         atomic_t *mapped = &folio->_nr_pages_mapped;
>         int last, nr = 0, nr_pmdmapped = 0;
> +       bool partially_mapped = false;
>         enum node_stat_item idx;
>          __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> @@ -1515,6 +1516,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                                         nr++;
>                         }
>                 } while (page++, --nr_pages > 0);
> +
> +               partially_mapped = nr && atomic_read(mapped);
>                 break;
>         case RMAP_LEVEL_PMD:
>                 atomic_dec(&folio->_large_mapcount);
> @@ -1532,6 +1535,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                                 nr = 0;
>                         }
>                 }
> +               partially_mapped = nr < nr_pmdmapped;
>                 break;
>         }
>  @@ -1553,9 +1557,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                  * page of the folio is unmapped and at least one page
>                  * is still mapped.
>                  */
> -               if (folio_test_large(folio) && folio_test_anon(folio))
> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> -                               deferred_split_folio(folio);
> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> +                   list_empty(&folio->_deferred_list) && partially_mapped)
> +                       deferred_split_folio(folio);
>         }
>          /*
>
> The compiler should be smart enough to optimize it all -- most likely ;)

Sure. Let me send a new one using your changes with folio_test_large(folio) dropped
like you said. Yours is easier to understand. Thank you for helping.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-04-26 18:41 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-25 21:11 [PATCH v4] mm/rmap: do not add fully unmapped large folio to deferred split list Zi Yan
2024-04-26  1:45 ` Barry Song
2024-04-26  1:55   ` Zi Yan
2024-04-26  2:23     ` Barry Song
2024-04-26  2:50       ` Zi Yan
2024-04-26  3:28         ` Barry Song
2024-04-26  3:36           ` Barry Song
2024-04-26  3:37           ` Zi Yan
2024-04-26  3:44             ` Barry Song
2024-04-26  3:45 ` Lance Yang
2024-04-26  5:36   ` Lance Yang
2024-04-26  8:19 ` David Hildenbrand
2024-04-26  8:26   ` David Hildenbrand
2024-04-26  9:33     ` Lance Yang
2024-04-26 18:41     ` Yang Shi
2024-04-26  9:46   ` Barry Song
2024-04-26 13:11   ` Zi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox