* Re: Folio mapcount
2023-01-24 18:13 Folio mapcount Matthew Wilcox
@ 2023-01-24 18:35 ` David Hildenbrand
2023-01-24 18:37 ` David Hildenbrand
2023-01-24 18:35 ` Yang Shi
` (3 subsequent siblings)
4 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2023-01-24 18:35 UTC (permalink / raw)
To: Matthew Wilcox, linux-mm
Cc: Vishal Moola, Hugh Dickins, Rik van Riel, Yin, Fengwei
On 24.01.23 19:13, Matthew Wilcox wrote:
> Once we get to the part of the folio journey where we have
> one-pointer-per-page, we can't afford to maintain per-page state.
> Currently we maintain a per-page mapcount, and that will have to go.
> We can maintain extra state for a multi-page folio, but it has to be a
> constant amount of extra state no matter how many pages are in the folio.
>
> My proposal is that we maintain a single mapcount per folio, and its
> definition is the number of (vma, page table) tuples which have a
> reference to any pages in this folio.
>
> I think there's a good performance win and simplification to be had
> here, so I think it's worth doing for 6.4.
>
> Examples
> --------
>
> In the simple and common case where every page in a folio is mapped
> once by a single vma and single page table, mapcount would be 1 [1].
> If the folio is mapped across a page table boundary by a single VMA,
> after we take a page fault on it in one page table, it gets a mapcount
> of 1. After taking a page fault on it in the other page table, its
> mapcount increases to 2.
>
> For a PMD-sized THP naturally aligned, mapcount is 1. Splitting the
> PMD into PTEs would not change the mapcount; the folio remains order-9
> but it stll has a reference from only one page table (a different page
> table, but still just one).
>
> Implementation sketch
> ---------------------
>
> When we take a page fault, we can/should map every page in the folio
> that fits in this VMA and this page table. We do this at present in
> filemap_map_pages() by looping over each page in the folio and calling
> do_set_pte() on each. We should have a:
>
> do_set_pte_range(vmf, folio, addr, first_page, n);
>
> and then change the API to page_add_new_anon_rmap() / page_add_file_rmap()
> to pass in (folio, first, n) instead of page. That gives us one call to
> page_add_*_rmap() per (vma, page table) tuple.
>
> In try_to_unmap_one(), page_vma_mapped_walk() currently calls us for
> each pfn. We'll want a function like
> page_vma_mapped_walk_skip_to_end_of_ptable()
> in order to persuade it to only call us once or twice if the folio
> is mapped across a page table boundary.
>
> Concerns
> --------
>
> We'll have to be careful to always zap all the PTEs for a given (vma,
> pt) tuple at the same time, otherwise mapcount will get out of sync
> (eg map three pages, unmap two; we shouldn't decrement the mapcount,
> but I don't think we can know that. But does this ever happen? I think
> we always unmap the entire folio, like in try_to_unmap_one().
Not sure about file THP, but for anon ... it's very common to partially
MADV_DONTNEED anon THP. Or to have a wild mixture of two (or more) anon
THP fragments after fork() when COW'ing on the PTE-mapped THP ...
>
> I haven't got my head around SetPageAnonExclusive() yet. I think it can
> be a per-folio bit, but handling a folio split across two page tables
> may be tricky.
I tried hard (very hard!) to make that work but reality caught up. And
the history of why that handling is required goes back to the old days
where we had per-subpage refcounts to then have per-subpage mapcounts to
now have only a single bit to get COW handling right.
There are very (very!) ugly corner cases of partial mremap, partial
MADV_WILLNEED ... some are included in the cow selftest for that reason.
One bit per subpage is certainly "not perfect" but not the end of the
world for now. 512/8 -> 64 byte for a 2 MiB folio ... For now I would
focus on the mapcount ... that will be a challenge on its own and a
bigger improvement :P
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: Folio mapcount
2023-01-24 18:35 ` David Hildenbrand
@ 2023-01-24 18:37 ` David Hildenbrand
0 siblings, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2023-01-24 18:37 UTC (permalink / raw)
To: Matthew Wilcox, linux-mm
Cc: Vishal Moola, Hugh Dickins, Rik van Riel, Yin, Fengwei
On 24.01.23 19:35, David Hildenbrand wrote:
> On 24.01.23 19:13, Matthew Wilcox wrote:
>> Once we get to the part of the folio journey where we have
>> one-pointer-per-page, we can't afford to maintain per-page state.
>> Currently we maintain a per-page mapcount, and that will have to go.
>> We can maintain extra state for a multi-page folio, but it has to be a
>> constant amount of extra state no matter how many pages are in the folio.
>>
>> My proposal is that we maintain a single mapcount per folio, and its
>> definition is the number of (vma, page table) tuples which have a
>> reference to any pages in this folio.
>>
>> I think there's a good performance win and simplification to be had
>> here, so I think it's worth doing for 6.4.
>>
>> Examples
>> --------
>>
>> In the simple and common case where every page in a folio is mapped
>> once by a single vma and single page table, mapcount would be 1 [1].
>> If the folio is mapped across a page table boundary by a single VMA,
>> after we take a page fault on it in one page table, it gets a mapcount
>> of 1. After taking a page fault on it in the other page table, its
>> mapcount increases to 2.
>>
>> For a PMD-sized THP naturally aligned, mapcount is 1. Splitting the
>> PMD into PTEs would not change the mapcount; the folio remains order-9
>> but it stll has a reference from only one page table (a different page
>> table, but still just one).
>>
>> Implementation sketch
>> ---------------------
>>
>> When we take a page fault, we can/should map every page in the folio
>> that fits in this VMA and this page table. We do this at present in
>> filemap_map_pages() by looping over each page in the folio and calling
>> do_set_pte() on each. We should have a:
>>
>> do_set_pte_range(vmf, folio, addr, first_page, n);
>>
>> and then change the API to page_add_new_anon_rmap() / page_add_file_rmap()
>> to pass in (folio, first, n) instead of page. That gives us one call to
>> page_add_*_rmap() per (vma, page table) tuple.
>>
>> In try_to_unmap_one(), page_vma_mapped_walk() currently calls us for
>> each pfn. We'll want a function like
>> page_vma_mapped_walk_skip_to_end_of_ptable()
>> in order to persuade it to only call us once or twice if the folio
>> is mapped across a page table boundary.
>>
>> Concerns
>> --------
>>
>> We'll have to be careful to always zap all the PTEs for a given (vma,
>> pt) tuple at the same time, otherwise mapcount will get out of sync
>> (eg map three pages, unmap two; we shouldn't decrement the mapcount,
>> but I don't think we can know that. But does this ever happen? I think
>> we always unmap the entire folio, like in try_to_unmap_one().
>
> Not sure about file THP, but for anon ... it's very common to partially
> MADV_DONTNEED anon THP. Or to have a wild mixture of two (or more) anon
> THP fragments after fork() when COW'ing on the PTE-mapped THP ...
>
>>
>> I haven't got my head around SetPageAnonExclusive() yet. I think it can
>> be a per-folio bit, but handling a folio split across two page tables
>> may be tricky.
>
> I tried hard (very hard!) to make that work but reality caught up. And
> the history of why that handling is required goes back to the old days
> where we had per-subpage refcounts to then have per-subpage mapcounts to
> now have only a single bit to get COW handling right.
>
> There are very (very!) ugly corner cases of partial mremap, partial
> MADV_WILLNEED ... some are included in the cow selftest for that reason.
s/MADV_WILLNEED/MADV_DONTFORK/
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-01-24 18:13 Folio mapcount Matthew Wilcox
2023-01-24 18:35 ` David Hildenbrand
@ 2023-01-24 18:35 ` Yang Shi
2023-02-02 3:45 ` Mike Kravetz
` (2 subsequent siblings)
4 siblings, 0 replies; 52+ messages in thread
From: Yang Shi @ 2023-01-24 18:35 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
On Tue, Jan 24, 2023 at 10:13 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> Once we get to the part of the folio journey where we have
> one-pointer-per-page, we can't afford to maintain per-page state.
> Currently we maintain a per-page mapcount, and that will have to go.
> We can maintain extra state for a multi-page folio, but it has to be a
> constant amount of extra state no matter how many pages are in the folio.
>
> My proposal is that we maintain a single mapcount per folio, and its
> definition is the number of (vma, page table) tuples which have a
> reference to any pages in this folio.
>
> I think there's a good performance win and simplification to be had
> here, so I think it's worth doing for 6.4.
>
> Examples
> --------
>
> In the simple and common case where every page in a folio is mapped
> once by a single vma and single page table, mapcount would be 1 [1].
> If the folio is mapped across a page table boundary by a single VMA,
> after we take a page fault on it in one page table, it gets a mapcount
> of 1. After taking a page fault on it in the other page table, its
> mapcount increases to 2.
>
> For a PMD-sized THP naturally aligned, mapcount is 1. Splitting the
> PMD into PTEs would not change the mapcount; the folio remains order-9
> but it stll has a reference from only one page table (a different page
> table, but still just one).
>
> Implementation sketch
> ---------------------
>
> When we take a page fault, we can/should map every page in the folio
> that fits in this VMA and this page table. We do this at present in
> filemap_map_pages() by looping over each page in the folio and calling
> do_set_pte() on each. We should have a:
>
> do_set_pte_range(vmf, folio, addr, first_page, n);
>
> and then change the API to page_add_new_anon_rmap() / page_add_file_rmap()
> to pass in (folio, first, n) instead of page. That gives us one call to
> page_add_*_rmap() per (vma, page table) tuple.
>
> In try_to_unmap_one(), page_vma_mapped_walk() currently calls us for
> each pfn. We'll want a function like
> page_vma_mapped_walk_skip_to_end_of_ptable()
> in order to persuade it to only call us once or twice if the folio
> is mapped across a page table boundary.
>
> Concerns
> --------
>
> We'll have to be careful to always zap all the PTEs for a given (vma,
> pt) tuple at the same time, otherwise mapcount will get out of sync
> (eg map three pages, unmap two; we shouldn't decrement the mapcount,
> but I don't think we can know that. But does this ever happen? I think
> we always unmap the entire folio, like in try_to_unmap_one().
Off the top of my head, MADV_DONTNEED may unmap the folio partially,
but keep the folio unsplit until some point, for example, memory
pressure.
munmap() should be able to unmap a folio partially as well.
>
> I haven't got my head around SetPageAnonExclusive() yet. I think it can
> be a per-folio bit, but handling a folio split across two page tables
> may be tricky.
>
> Notes
> -----
>
> [1] Ignoring the bias by -1 to let us detect transitions that we care
> about more efficiently; I'm talking about the value returned from
> page_mapcount(), not the value stored in page->_mapcount.
>
>
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: Folio mapcount
2023-01-24 18:13 Folio mapcount Matthew Wilcox
2023-01-24 18:35 ` David Hildenbrand
2023-01-24 18:35 ` Yang Shi
@ 2023-02-02 3:45 ` Mike Kravetz
2023-02-02 15:31 ` Matthew Wilcox
2023-02-06 20:34 ` Matthew Wilcox
2023-02-07 16:23 ` Zi Yan
4 siblings, 1 reply; 52+ messages in thread
From: Mike Kravetz @ 2023-02-02 3:45 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
On 01/24/23 18:13, Matthew Wilcox wrote:
> Once we get to the part of the folio journey where we have
> one-pointer-per-page, we can't afford to maintain per-page state.
> Currently we maintain a per-page mapcount, and that will have to go.
> We can maintain extra state for a multi-page folio, but it has to be a
> constant amount of extra state no matter how many pages are in the folio.
>
> My proposal is that we maintain a single mapcount per folio, and its
> definition is the number of (vma, page table) tuples which have a
> reference to any pages in this folio.
Hi Matthew, finally took a look at this. Can you clarify your definition of
'page table' here? I think you are talking about all the entries within
one page table page? Is that correct? It certainly makes sense in this
context.
I have always thought of page table as the entire tree structure starting at
*pgd in the mm_struct. So, I was a bit confused. But, I now see elsewhere
that 'page table' may refer to either.
--
Mike Kravetz
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-02-02 3:45 ` Mike Kravetz
@ 2023-02-02 15:31 ` Matthew Wilcox
2023-02-07 16:19 ` Zi Yan
0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-02 15:31 UTC (permalink / raw)
To: Mike Kravetz
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
On Wed, Feb 01, 2023 at 07:45:17PM -0800, Mike Kravetz wrote:
> On 01/24/23 18:13, Matthew Wilcox wrote:
> > Once we get to the part of the folio journey where we have
> > one-pointer-per-page, we can't afford to maintain per-page state.
> > Currently we maintain a per-page mapcount, and that will have to go.
> > We can maintain extra state for a multi-page folio, but it has to be a
> > constant amount of extra state no matter how many pages are in the folio.
> >
> > My proposal is that we maintain a single mapcount per folio, and its
> > definition is the number of (vma, page table) tuples which have a
> > reference to any pages in this folio.
>
> Hi Matthew, finally took a look at this. Can you clarify your definition of
> 'page table' here? I think you are talking about all the entries within
> one page table page? Is that correct? It certainly makes sense in this
> context.
>
> I have always thought of page table as the entire tree structure starting at
> *pgd in the mm_struct. So, I was a bit confused. But, I now see elsewhere
> that 'page table' may refer to either.
Yes, we're pretty sloppy about that. What I had in mind was:
We have a large folio which is mapped at, say, (1.9MB - 2.1MB) in the
user address space. There are thus multiple PTEs which map it and some
of those PTEs belong to one PMD and the rest belong to a second PMD.
It has a mapcount of 2 due to being mapped by PTE entries belonging to
two PMD tables. If it were mapped at (2.1-2.3MB), it would have a
mapcount of 1 due to all its PTEs belonging to a single PMD table.
[ Mike & I spoke earlier this week about what should happen with mapcount
and a theoretical aligned 1GB THP that has its PUD mapping split into
PTE mappings. Splitting a PMD to PTEs does not affect the mapcount
since all of the PTEs are now referenced from a single PMD table instead
of from a PMD entry. But splitting a PUD to PTEs should increment the
mapcount by 511 since the folio is now referenced from 512 PMD tables. ]
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-02-02 15:31 ` Matthew Wilcox
@ 2023-02-07 16:19 ` Zi Yan
2023-02-07 16:44 ` Matthew Wilcox
0 siblings, 1 reply; 52+ messages in thread
From: Zi Yan @ 2023-02-07 16:19 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Mike Kravetz, linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
[-- Attachment #1: Type: text/plain, Size: 2255 bytes --]
On 2 Feb 2023, at 10:31, Matthew Wilcox wrote:
> On Wed, Feb 01, 2023 at 07:45:17PM -0800, Mike Kravetz wrote:
>> On 01/24/23 18:13, Matthew Wilcox wrote:
>>> Once we get to the part of the folio journey where we have
>>> one-pointer-per-page, we can't afford to maintain per-page state.
>>> Currently we maintain a per-page mapcount, and that will have to go.
>>> We can maintain extra state for a multi-page folio, but it has to be a
>>> constant amount of extra state no matter how many pages are in the folio.
>>>
>>> My proposal is that we maintain a single mapcount per folio, and its
>>> definition is the number of (vma, page table) tuples which have a
>>> reference to any pages in this folio.
>>
>> Hi Matthew, finally took a look at this. Can you clarify your definition of
>> 'page table' here? I think you are talking about all the entries within
>> one page table page? Is that correct? It certainly makes sense in this
>> context.
>>
>> I have always thought of page table as the entire tree structure starting at
>> *pgd in the mm_struct. So, I was a bit confused. But, I now see elsewhere
>> that 'page table' may refer to either.
>
> Yes, we're pretty sloppy about that. What I had in mind was:
>
> We have a large folio which is mapped at, say, (1.9MB - 2.1MB) in the
> user address space. There are thus multiple PTEs which map it and some
> of those PTEs belong to one PMD and the rest belong to a second PMD.
> It has a mapcount of 2 due to being mapped by PTE entries belonging to
> two PMD tables. If it were mapped at (2.1-2.3MB), it would have a
> mapcount of 1 due to all its PTEs belonging to a single PMD table.
What is the logic of using PMD as the basic counting unit? Why not use
PTE or PUG? I just cannot understand the goal of doing this.
>
> [ Mike & I spoke earlier this week about what should happen with mapcount
> and a theoretical aligned 1GB THP that has its PUD mapping split into
> PTE mappings. Splitting a PMD to PTEs does not affect the mapcount
> since all of the PTEs are now referenced from a single PMD table instead
> of from a PMD entry. But splitting a PUD to PTEs should increment the
> mapcount by 511 since the folio is now referenced from 512 PMD tables. ]
--
Best Regards,
Yan, Zi
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-02-07 16:19 ` Zi Yan
@ 2023-02-07 16:44 ` Matthew Wilcox
0 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-07 16:44 UTC (permalink / raw)
To: Zi Yan
Cc: Mike Kravetz, linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
On Tue, Feb 07, 2023 at 11:19:40AM -0500, Zi Yan wrote:
> On 2 Feb 2023, at 10:31, Matthew Wilcox wrote:
> > On Wed, Feb 01, 2023 at 07:45:17PM -0800, Mike Kravetz wrote:
> >> On 01/24/23 18:13, Matthew Wilcox wrote:
> >>> Once we get to the part of the folio journey where we have
> >>> one-pointer-per-page, we can't afford to maintain per-page state.
> >>> Currently we maintain a per-page mapcount, and that will have to go.
> >>> We can maintain extra state for a multi-page folio, but it has to be a
> >>> constant amount of extra state no matter how many pages are in the folio.
> >>>
> >>> My proposal is that we maintain a single mapcount per folio, and its
> >>> definition is the number of (vma, page table) tuples which have a
> >>> reference to any pages in this folio.
> >>
> >> Hi Matthew, finally took a look at this. Can you clarify your definition of
> >> 'page table' here? I think you are talking about all the entries within
> >> one page table page? Is that correct? It certainly makes sense in this
> >> context.
> >>
> >> I have always thought of page table as the entire tree structure starting at
> >> *pgd in the mm_struct. So, I was a bit confused. But, I now see elsewhere
> >> that 'page table' may refer to either.
> >
> > Yes, we're pretty sloppy about that. What I had in mind was:
> >
> > We have a large folio which is mapped at, say, (1.9MB - 2.1MB) in the
> > user address space. There are thus multiple PTEs which map it and some
> > of those PTEs belong to one PMD and the rest belong to a second PMD.
> > It has a mapcount of 2 due to being mapped by PTE entries belonging to
> > two PMD tables. If it were mapped at (2.1-2.3MB), it would have a
> > mapcount of 1 due to all its PTEs belonging to a single PMD table.
>
> What is the logic of using PMD as the basic counting unit? Why not use
> PTE or PUG? I just cannot understand the goal of doing this.
Locking and contiguity. If we try to map a folio across a PMD boundary,
we have to have the PTL on both PMDs at the same time (or all PMDs if we
support folios larger than PMD_SIZE). Then we have to make two (or more)
calls to set_ptes() to populate all the PTEs (so that arches don't have
to handle "Oh, I reached the end of the PMD, move to the next one").
Note that I've decided this approach doesn't work because it can't
easily tell us "Am I the only VMA which has this folio mapped?"
But this was the reason.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-01-24 18:13 Folio mapcount Matthew Wilcox
` (2 preceding siblings ...)
2023-02-02 3:45 ` Mike Kravetz
@ 2023-02-06 20:34 ` Matthew Wilcox
2023-02-06 22:55 ` Yang Shi
` (3 more replies)
2023-02-07 16:23 ` Zi Yan
4 siblings, 4 replies; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-06 20:34 UTC (permalink / raw)
To: linux-mm
Cc: Vishal Moola, Hugh Dickins, Rik van Riel, David Hildenbrand, Yin,
Fengwei
On Tue, Jan 24, 2023 at 06:13:21PM +0000, Matthew Wilcox wrote:
> Once we get to the part of the folio journey where we have
> one-pointer-per-page, we can't afford to maintain per-page state.
> Currently we maintain a per-page mapcount, and that will have to go.
> We can maintain extra state for a multi-page folio, but it has to be a
> constant amount of extra state no matter how many pages are in the folio.
>
> My proposal is that we maintain a single mapcount per folio, and its
> definition is the number of (vma, page table) tuples which have a
> reference to any pages in this folio.
I've been thinking about this a lot more, and I have changed my
mind. It works fine to answer the question "Is any page in this
folio mapped", but it's now hard to answer the question "I have it
mapped, does anybody else?" That question is asked, for example,
in madvise_cold_or_pageout_pte_range().
With this definition, if the mapcount is 1, it's definitely only mapped
by us. If it's more than 2, it's definitely mapped by somebody else (*).
If it's 2, maybe we have the folio mapped twice, and maybe we have it
mapped once and somebody else has it mapped once, so we have to consult
the rmap to find out. Not fun times.
(*) If we support folios larger than PMD size, then the answer is more
complex.
I now think the mapcount has to be defined as "How many VMAs have
one-or-more pages of this folio mapped".
That means that our future folio_add_file_rmap_range() looks a bit
like this:
{
bool add_mapcount = true;
if (nr < folio_nr_pages(folio))
add_mapcount = !folio_has_ptes(folio, vma);
if (add_mapcount)
atomic_inc(&folio->_mapcount);
__lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr);
if (nr == HPAGE_PMD_NR)
__lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr);
mlock_vma_folio(folio, vma, nr == HPAGE_PMD_NR);
}
bool folio_mapped_in_vma(struct folio *folio, struct vm_area_struct *vma)
{
unsigned long address = vma_address(&folio->page, vma);
DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
if (!page_vma_mapped_walk(&pvmw))
return false;
page_vma_mapped_walk_done(&pvmw);
return true;
}
... some details to be fixed here; particularly this will currently
deadlock on the PTL, so we'd need not only to exclude the current
PMD from being examined, but also avoid a deadly embrace between
two threads (do we currently have a locking order defined for
page table locks at the same height of the tree?)
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: Folio mapcount
2023-02-06 20:34 ` Matthew Wilcox
@ 2023-02-06 22:55 ` Yang Shi
2023-02-06 23:09 ` Matthew Wilcox
2023-02-07 3:06 ` Yin, Fengwei
` (2 subsequent siblings)
3 siblings, 1 reply; 52+ messages in thread
From: Yang Shi @ 2023-02-06 22:55 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
On Mon, Feb 6, 2023 at 12:34 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jan 24, 2023 at 06:13:21PM +0000, Matthew Wilcox wrote:
> > Once we get to the part of the folio journey where we have
> > one-pointer-per-page, we can't afford to maintain per-page state.
> > Currently we maintain a per-page mapcount, and that will have to go.
> > We can maintain extra state for a multi-page folio, but it has to be a
> > constant amount of extra state no matter how many pages are in the folio.
> >
> > My proposal is that we maintain a single mapcount per folio, and its
> > definition is the number of (vma, page table) tuples which have a
> > reference to any pages in this folio.
>
> I've been thinking about this a lot more, and I have changed my
> mind. It works fine to answer the question "Is any page in this
> folio mapped", but it's now hard to answer the question "I have it
> mapped, does anybody else?" That question is asked, for example,
> in madvise_cold_or_pageout_pte_range().
>
> With this definition, if the mapcount is 1, it's definitely only mapped
> by us. If it's more than 2, it's definitely mapped by somebody else (*).
> If it's 2, maybe we have the folio mapped twice, and maybe we have it
> mapped once and somebody else has it mapped once, so we have to consult
> the rmap to find out. Not fun times.
>
> (*) If we support folios larger than PMD size, then the answer is more
> complex.
>
> I now think the mapcount has to be defined as "How many VMAs have
> one-or-more pages of this folio mapped".
IIRC it may be still possible the folio's mapcount is two, but it is
only mapped by us (for example, two VMAs from the same process).
>
> That means that our future folio_add_file_rmap_range() looks a bit
> like this:
>
> {
> bool add_mapcount = true;
>
> if (nr < folio_nr_pages(folio))
> add_mapcount = !folio_has_ptes(folio, vma);
> if (add_mapcount)
> atomic_inc(&folio->_mapcount);
>
> __lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr);
> if (nr == HPAGE_PMD_NR)
> __lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
> NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr);
>
> mlock_vma_folio(folio, vma, nr == HPAGE_PMD_NR);
> }
>
> bool folio_mapped_in_vma(struct folio *folio, struct vm_area_struct *vma)
> {
> unsigned long address = vma_address(&folio->page, vma);
> DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>
> if (!page_vma_mapped_walk(&pvmw))
> return false;
> page_vma_mapped_walk_done(&pvmw);
> return true;
> }
>
> ... some details to be fixed here; particularly this will currently
> deadlock on the PTL, so we'd need not only to exclude the current
> PMD from being examined, but also avoid a deadly embrace between
> two threads (do we currently have a locking order defined for
> page table locks at the same height of the tree?)
>
>
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: Folio mapcount
2023-02-06 22:55 ` Yang Shi
@ 2023-02-06 23:09 ` Matthew Wilcox
0 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-06 23:09 UTC (permalink / raw)
To: Yang Shi
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
On Mon, Feb 06, 2023 at 02:55:09PM -0800, Yang Shi wrote:
> > I've been thinking about this a lot more, and I have changed my
> > mind. It works fine to answer the question "Is any page in this
> > folio mapped", but it's now hard to answer the question "I have it
> > mapped, does anybody else?" That question is asked, for example,
> > in madvise_cold_or_pageout_pte_range().
> >
> > With this definition, if the mapcount is 1, it's definitely only mapped
> > by us. If it's more than 2, it's definitely mapped by somebody else (*).
> > If it's 2, maybe we have the folio mapped twice, and maybe we have it
> > mapped once and somebody else has it mapped once, so we have to consult
> > the rmap to find out. Not fun times.
> >
> > (*) If we support folios larger than PMD size, then the answer is more
> > complex.
> >
> > I now think the mapcount has to be defined as "How many VMAs have
> > one-or-more pages of this folio mapped".
>
> IIRC it may be still possible the folio's mapcount is two, but it is
> only mapped by us (for example, two VMAs from the same process).
That's true, but it's also true for single-page folios that are mapped
by two VMAs from the same process. I don't want to change the definition
that much!
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-02-06 20:34 ` Matthew Wilcox
2023-02-06 22:55 ` Yang Shi
@ 2023-02-07 3:06 ` Yin, Fengwei
2023-02-07 4:08 ` Matthew Wilcox
2023-02-07 22:39 ` Peter Xu
2023-02-07 22:56 ` James Houghton
3 siblings, 1 reply; 52+ messages in thread
From: Yin, Fengwei @ 2023-02-07 3:06 UTC (permalink / raw)
To: Matthew Wilcox, linux-mm
Cc: Vishal Moola, Hugh Dickins, Rik van Riel, David Hildenbrand
On 2/7/2023 4:34 AM, Matthew Wilcox wrote:
> On Tue, Jan 24, 2023 at 06:13:21PM +0000, Matthew Wilcox wrote:
>> Once we get to the part of the folio journey where we have
>> one-pointer-per-page, we can't afford to maintain per-page state.
>> Currently we maintain a per-page mapcount, and that will have to go.
>> We can maintain extra state for a multi-page folio, but it has to be a
>> constant amount of extra state no matter how many pages are in the folio.
>>
>> My proposal is that we maintain a single mapcount per folio, and its
>> definition is the number of (vma, page table) tuples which have a
>> reference to any pages in this folio.
>
> I've been thinking about this a lot more, and I have changed my
> mind. It works fine to answer the question "Is any page in this
> folio mapped", but it's now hard to answer the question "I have it
> mapped, does anybody else?" That question is asked, for example,
> in madvise_cold_or_pageout_pte_range().
>
> With this definition, if the mapcount is 1, it's definitely only mapped
> by us. If it's more than 2, it's definitely mapped by somebody else (*).
> If it's 2, maybe we have the folio mapped twice, and maybe we have it
> mapped once and somebody else has it mapped once, so we have to consult
> the rmap to find out. Not fun times.
Naive question: If it's 2, why do we need to know whether it's mapped twice
by us or one by us and one by somebody else? Does that mean we allow some
operations if only us mapped it even it's 2? Thanks.
Regards
Yin, Fengwei
>
> (*) If we support folios larger than PMD size, then the answer is more
> complex.
>
> I now think the mapcount has to be defined as "How many VMAs have
> one-or-more pages of this folio mapped".
>
> That means that our future folio_add_file_rmap_range() looks a bit
> like this:
>
> {
> bool add_mapcount = true;
>
> if (nr < folio_nr_pages(folio))
> add_mapcount = !folio_has_ptes(folio, vma);
> if (add_mapcount)
> atomic_inc(&folio->_mapcount);
>
> __lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr);
> if (nr == HPAGE_PMD_NR)
> __lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
> NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr);
>
> mlock_vma_folio(folio, vma, nr == HPAGE_PMD_NR);
> }
>
> bool folio_mapped_in_vma(struct folio *folio, struct vm_area_struct *vma)
> {
> unsigned long address = vma_address(&folio->page, vma);
> DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>
> if (!page_vma_mapped_walk(&pvmw))
> return false;
> page_vma_mapped_walk_done(&pvmw);
> return true;
> }
>
> ... some details to be fixed here; particularly this will currently
> deadlock on the PTL, so we'd need not only to exclude the current
> PMD from being examined, but also avoid a deadly embrace between
> two threads (do we currently have a locking order defined for
> page table locks at the same height of the tree?)
>
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: Folio mapcount
2023-02-07 3:06 ` Yin, Fengwei
@ 2023-02-07 4:08 ` Matthew Wilcox
0 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-07 4:08 UTC (permalink / raw)
To: Yin, Fengwei
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel, David Hildenbrand
On Tue, Feb 07, 2023 at 11:06:13AM +0800, Yin, Fengwei wrote:
> On 2/7/2023 4:34 AM, Matthew Wilcox wrote:
> > On Tue, Jan 24, 2023 at 06:13:21PM +0000, Matthew Wilcox wrote:
> >> Once we get to the part of the folio journey where we have
> >> one-pointer-per-page, we can't afford to maintain per-page state.
> >> Currently we maintain a per-page mapcount, and that will have to go.
> >> We can maintain extra state for a multi-page folio, but it has to be a
> >> constant amount of extra state no matter how many pages are in the folio.
> >>
> >> My proposal is that we maintain a single mapcount per folio, and its
> >> definition is the number of (vma, page table) tuples which have a
> >> reference to any pages in this folio.
> >
> > I've been thinking about this a lot more, and I have changed my
> > mind. It works fine to answer the question "Is any page in this
> > folio mapped", but it's now hard to answer the question "I have it
> > mapped, does anybody else?" That question is asked, for example,
> > in madvise_cold_or_pageout_pte_range().
> >
> > With this definition, if the mapcount is 1, it's definitely only mapped
> > by us. If it's more than 2, it's definitely mapped by somebody else (*).
> > If it's 2, maybe we have the folio mapped twice, and maybe we have it
> > mapped once and somebody else has it mapped once, so we have to consult
> > the rmap to find out. Not fun times.
> Naive question: If it's 2, why do we need to know whether it's mapped twice
> by us or one by us and one by somebody else? Does that mean we allow some
> operations if only us mapped it even it's 2? Thanks.
Yes, see the aforementioned madvise_cold_or_pageout_pte_range():
(in mainline, it's different in -next):
/* Do not interfere with other mappings of this page */
if (page_mapcount(page) != 1)
goto huge_unlock;
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-02-06 20:34 ` Matthew Wilcox
2023-02-06 22:55 ` Yang Shi
2023-02-07 3:06 ` Yin, Fengwei
@ 2023-02-07 22:39 ` Peter Xu
2023-02-07 23:27 ` Matthew Wilcox
2023-02-07 22:56 ` James Houghton
3 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2023-02-07 22:39 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
On Mon, Feb 06, 2023 at 08:34:31PM +0000, Matthew Wilcox wrote:
> On Tue, Jan 24, 2023 at 06:13:21PM +0000, Matthew Wilcox wrote:
> > Once we get to the part of the folio journey where we have
> > one-pointer-per-page, we can't afford to maintain per-page state.
> > Currently we maintain a per-page mapcount, and that will have to go.
> > We can maintain extra state for a multi-page folio, but it has to be a
> > constant amount of extra state no matter how many pages are in the folio.
> >
> > My proposal is that we maintain a single mapcount per folio, and its
> > definition is the number of (vma, page table) tuples which have a
> > reference to any pages in this folio.
>
> I've been thinking about this a lot more, and I have changed my
> mind. It works fine to answer the question "Is any page in this
> folio mapped", but it's now hard to answer the question "I have it
> mapped, does anybody else?" That question is asked, for example,
> in madvise_cold_or_pageout_pte_range().
I'm curious whether it is still fine in rare cases - IMHO it's a matter of
when it'll go severely wrong if the mapcount should be exactly 1 (it's
privately owned by a vma) but we reported 2.
In this MADV_COLD/MADV_PAGEOUT case we'll skip COLD or PAGEOUT some pages
even if we can, but is it a deal breaker (if the benefit of the change can
be proved and worthwhile)? Especially, this only happens with unaligned
folios being mapped.
Is unaligned mapping for a folio common? Is there any other use cases that
can go worse than this one?
(E.g., IIUC superfluous but occasional CoW seems fine)
OTOH...
>
> With this definition, if the mapcount is 1, it's definitely only mapped
> by us. If it's more than 2, it's definitely mapped by somebody else (*).
> If it's 2, maybe we have the folio mapped twice, and maybe we have it
> mapped once and somebody else has it mapped once, so we have to consult
> the rmap to find out. Not fun times.
>
> (*) If we support folios larger than PMD size, then the answer is more
> complex.
>
> I now think the mapcount has to be defined as "How many VMAs have
> one-or-more pages of this folio mapped".
>
> That means that our future folio_add_file_rmap_range() looks a bit
> like this:
>
> {
> bool add_mapcount = true;
>
> if (nr < folio_nr_pages(folio))
> add_mapcount = !folio_has_ptes(folio, vma);
> if (add_mapcount)
> atomic_inc(&folio->_mapcount);
>
> __lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr);
> if (nr == HPAGE_PMD_NR)
> __lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
> NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr);
>
> mlock_vma_folio(folio, vma, nr == HPAGE_PMD_NR);
> }
>
> bool folio_mapped_in_vma(struct folio *folio, struct vm_area_struct *vma)
> {
> unsigned long address = vma_address(&folio->page, vma);
> DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>
> if (!page_vma_mapped_walk(&pvmw))
> return false;
> page_vma_mapped_walk_done(&pvmw);
> return true;
> }
>
> ... some details to be fixed here; particularly this will currently
> deadlock on the PTL, so we'd need not only to exclude the current
> PMD from being examined, but also avoid a deadly embrace between
> two threads (do we currently have a locking order defined for
> page table locks at the same height of the tree?)
... it starts to sound scary if it needs to take >1 pgtable locks.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: Folio mapcount
2023-02-07 22:39 ` Peter Xu
@ 2023-02-07 23:27 ` Matthew Wilcox
2023-02-08 19:40 ` Peter Xu
0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-07 23:27 UTC (permalink / raw)
To: Peter Xu
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
On Tue, Feb 07, 2023 at 05:39:07PM -0500, Peter Xu wrote:
> On Mon, Feb 06, 2023 at 08:34:31PM +0000, Matthew Wilcox wrote:
> > On Tue, Jan 24, 2023 at 06:13:21PM +0000, Matthew Wilcox wrote:
> > > Once we get to the part of the folio journey where we have
> > > one-pointer-per-page, we can't afford to maintain per-page state.
> > > Currently we maintain a per-page mapcount, and that will have to go.
> > > We can maintain extra state for a multi-page folio, but it has to be a
> > > constant amount of extra state no matter how many pages are in the folio.
> > >
> > > My proposal is that we maintain a single mapcount per folio, and its
> > > definition is the number of (vma, page table) tuples which have a
> > > reference to any pages in this folio.
> >
> > I've been thinking about this a lot more, and I have changed my
> > mind. It works fine to answer the question "Is any page in this
> > folio mapped", but it's now hard to answer the question "I have it
> > mapped, does anybody else?" That question is asked, for example,
> > in madvise_cold_or_pageout_pte_range().
>
> I'm curious whether it is still fine in rare cases - IMHO it's a matter of
> when it'll go severely wrong if the mapcount should be exactly 1 (it's
> privately owned by a vma) but we reported 2.
>
> In this MADV_COLD/MADV_PAGEOUT case we'll skip COLD or PAGEOUT some pages
> even if we can, but is it a deal breaker (if the benefit of the change can
> be proved and worthwhile)? Especially, this only happens with unaligned
> folios being mapped.
>
> Is unaligned mapping for a folio common? Is there any other use cases that
> can go worse than this one?
For file pages, I think it can go wrong rather more often than we might
like. I think for anon memory, we'll tend to allocate it to be aligned,
and then it takes some weirdness like mremap() to make it unaligned.
But I'm just waving my hands wildly. I don't really know.
> (E.g., IIUC superfluous but occasional CoW seems fine)
>
> OTOH...
>
> >
> > With this definition, if the mapcount is 1, it's definitely only mapped
> > by us. If it's more than 2, it's definitely mapped by somebody else (*).
> > If it's 2, maybe we have the folio mapped twice, and maybe we have it
> > mapped once and somebody else has it mapped once, so we have to consult
> > the rmap to find out. Not fun times.
> >
> > (*) If we support folios larger than PMD size, then the answer is more
> > complex.
> >
> > I now think the mapcount has to be defined as "How many VMAs have
> > one-or-more pages of this folio mapped".
> >
> > That means that our future folio_add_file_rmap_range() looks a bit
> > like this:
> >
> > {
> > bool add_mapcount = true;
> >
> > if (nr < folio_nr_pages(folio))
> > add_mapcount = !folio_has_ptes(folio, vma);
> > if (add_mapcount)
> > atomic_inc(&folio->_mapcount);
> >
> > __lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr);
> > if (nr == HPAGE_PMD_NR)
> > __lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
> > NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr);
> >
> > mlock_vma_folio(folio, vma, nr == HPAGE_PMD_NR);
> > }
> >
> > bool folio_mapped_in_vma(struct folio *folio, struct vm_area_struct *vma)
> > {
> > unsigned long address = vma_address(&folio->page, vma);
> > DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> >
> > if (!page_vma_mapped_walk(&pvmw))
> > return false;
> > page_vma_mapped_walk_done(&pvmw);
> > return true;
> > }
> >
> > ... some details to be fixed here; particularly this will currently
> > deadlock on the PTL, so we'd need not only to exclude the current
> > PMD from being examined, but also avoid a deadly embrace between
> > two threads (do we currently have a locking order defined for
> > page table locks at the same height of the tree?)
>
> ... it starts to sound scary if it needs to take >1 pgtable locks.
I've been thinking about this one, and I wonder if we can do it
without taking any pgtable locks. The locking environment we're in
is the page fault handler, so we have the mmap_lock for read (for now
anyway ...). We also hold the folio lock, so _if_ the folio is mapped,
those entries can't disappear under us. They also can't appear under
us. We hold the PTL on one PMD, but not necessarily on any other PMD
we examine.
I appreciate that PTEs can _change_ under us if we do not hold the PTL,
but by virtue of holding the folio lock, they can't change from or to
our PFNs. I also think the PMD table cannot disappear under us
since we're holding the mmap_lock for read, and anyone removing page
tables has to take the mmap_lock for write.
Am I missing anything important?
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: Folio mapcount
2023-02-07 23:27 ` Matthew Wilcox
@ 2023-02-08 19:40 ` Peter Xu
2023-02-08 20:25 ` Matthew Wilcox
0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2023-02-08 19:40 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
On Tue, Feb 07, 2023 at 11:27:17PM +0000, Matthew Wilcox wrote:
> I've been thinking about this one, and I wonder if we can do it
> without taking any pgtable locks. The locking environment we're in
> is the page fault handler, so we have the mmap_lock for read (for now
> anyway ...). We also hold the folio lock, so _if_ the folio is mapped,
> those entries can't disappear under us.
Could MADV_DONTNEED do that from another pgtable that we don't hold the
pgtable lock?
> They also can't appear under us.
This seems right.
> We hold the PTL on one PMD, but not necessarily on any other PMD we
> examine.
>
> I appreciate that PTEs can _change_ under us if we do not hold the PTL,
> but by virtue of holding the folio lock, they can't change from or to
> our PFNs. I also think the PMD table cannot disappear under us
> since we're holding the mmap_lock for read, and anyone removing page
> tables has to take the mmap_lock for write.
Seems right to me too.
--
Peter Xu
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-02-08 19:40 ` Peter Xu
@ 2023-02-08 20:25 ` Matthew Wilcox
2023-02-08 20:58 ` Peter Xu
0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-08 20:25 UTC (permalink / raw)
To: Peter Xu
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
On Wed, Feb 08, 2023 at 02:40:11PM -0500, Peter Xu wrote:
> On Tue, Feb 07, 2023 at 11:27:17PM +0000, Matthew Wilcox wrote:
> > I've been thinking about this one, and I wonder if we can do it
> > without taking any pgtable locks. The locking environment we're in
> > is the page fault handler, so we have the mmap_lock for read (for now
> > anyway ...). We also hold the folio lock, so _if_ the folio is mapped,
> > those entries can't disappear under us.
>
> Could MADV_DONTNEED do that from another pgtable that we don't hold the
> pgtable lock?
Oh, ugh, yes. And zap_pte_range() has the PTL first, so we can't sleep
to get the folio lock. And we can't decline to zap the pte on a failed
folio_trylock() (well, we could for MADV_DONTNEED, but not in general).
So ... how about this for a solution:
- If the folio overlaps into the next PMD table, spin_lock it.
- If the folio overlaps into the previous PMD table, unlock our
PTL, lock the previous PTL, re-lock our PTL.
- Do the pvmw, telling it we already have the PTLs held (new PVMW flag).
[explanation simplified; if there is no prior PMD table or if the VMA
limits how far to search, we can skip this]
We have prior art for taking two PTLs in copy_page_range(). There,
the hierarchy is clear; one VMA belongs to the process parent and one
to the child. I don't believe we have precedent for taking two PTLs
in the same VMA, but I think my proposal (order by ascending address in
the process) is the obvious order to choose.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-02-08 20:25 ` Matthew Wilcox
@ 2023-02-08 20:58 ` Peter Xu
2023-02-09 15:10 ` Chih-En Lin
0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2023-02-08 20:58 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
On Wed, Feb 08, 2023 at 08:25:10PM +0000, Matthew Wilcox wrote:
> On Wed, Feb 08, 2023 at 02:40:11PM -0500, Peter Xu wrote:
> > On Tue, Feb 07, 2023 at 11:27:17PM +0000, Matthew Wilcox wrote:
> > > I've been thinking about this one, and I wonder if we can do it
> > > without taking any pgtable locks. The locking environment we're in
> > > is the page fault handler, so we have the mmap_lock for read (for now
> > > anyway ...). We also hold the folio lock, so _if_ the folio is mapped,
> > > those entries can't disappear under us.
> >
> > Could MADV_DONTNEED do that from another pgtable that we don't hold the
> > pgtable lock?
>
> Oh, ugh, yes. And zap_pte_range() has the PTL first, so we can't sleep
> to get the folio lock. And we can't decline to zap the pte on a failed
> folio_trylock() (well, we could for MADV_DONTNEED, but not in general).
>
> So ... how about this for a solution:
>
> - If the folio overlaps into the next PMD table, spin_lock it.
> - If the folio overlaps into the previous PMD table, unlock our
> PTL, lock the previous PTL, re-lock our PTL.
> - Do the pvmw, telling it we already have the PTLs held (new PVMW flag).
>
> [explanation simplified; if there is no prior PMD table or if the VMA
> limits how far to search, we can skip this]
>
> We have prior art for taking two PTLs in copy_page_range(). There,
> the hierarchy is clear; one VMA belongs to the process parent and one
> to the child. I don't believe we have precedent for taking two PTLs
> in the same VMA, but I think my proposal (order by ascending address in
> the process) is the obvious order to choose.
Maybe it'll work? Not sure, but seems be something we'd be extremely
careful with. Having a single mmap read lock covering both seems to
guarantee that the order of the lock is stable, which is a good start..
But I have no good idea on other implications across the whole kernel.
IMHO copy_page_range() is not a great example for proving deadlocks,
because the dst_mm should not be exposed to the whole world yet at all when
copying. Say, I don't see any case some thread can try to take the dst mm
pgtable lock at all until it's all set. I'm even wondering whether it's
safe to not take the dst mm pgtable lock at all during a fork()..
--
Peter Xu
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-02-08 20:58 ` Peter Xu
@ 2023-02-09 15:10 ` Chih-En Lin
2023-02-09 15:43 ` Peter Xu
0 siblings, 1 reply; 52+ messages in thread
From: Chih-En Lin @ 2023-02-09 15:10 UTC (permalink / raw)
To: Peter Xu
Cc: Matthew Wilcox, linux-mm, Vishal Moola, Hugh Dickins,
Rik van Riel, David Hildenbrand, Yin, Fengwei
On Thu, Feb 9, 2023 at 4:59 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Feb 08, 2023 at 08:25:10PM +0000, Matthew Wilcox wrote:
> > On Wed, Feb 08, 2023 at 02:40:11PM -0500, Peter Xu wrote:
> > > On Tue, Feb 07, 2023 at 11:27:17PM +0000, Matthew Wilcox wrote:
> > > > I've been thinking about this one, and I wonder if we can do it
> > > > without taking any pgtable locks. The locking environment we're in
> > > > is the page fault handler, so we have the mmap_lock for read (for now
> > > > anyway ...). We also hold the folio lock, so _if_ the folio is mapped,
> > > > those entries can't disappear under us.
> > >
> > > Could MADV_DONTNEED do that from another pgtable that we don't hold the
> > > pgtable lock?
> >
> > Oh, ugh, yes. And zap_pte_range() has the PTL first, so we can't sleep
> > to get the folio lock. And we can't decline to zap the pte on a failed
> > folio_trylock() (well, we could for MADV_DONTNEED, but not in general).
> >
> > So ... how about this for a solution:
> >
> > - If the folio overlaps into the next PMD table, spin_lock it.
> > - If the folio overlaps into the previous PMD table, unlock our
> > PTL, lock the previous PTL, re-lock our PTL.
> > - Do the pvmw, telling it we already have the PTLs held (new PVMW flag).
> >
> > [explanation simplified; if there is no prior PMD table or if the VMA
> > limits how far to search, we can skip this]
> >
> > We have prior art for taking two PTLs in copy_page_range(). There,
> > the hierarchy is clear; one VMA belongs to the process parent and one
> > to the child. I don't believe we have precedent for taking two PTLs
> > in the same VMA, but I think my proposal (order by ascending address in
> > the process) is the obvious order to choose.
>
> Maybe it'll work? Not sure, but seems be something we'd be extremely
> careful with. Having a single mmap read lock covering both seems to
> guarantee that the order of the lock is stable, which is a good start..
> But I have no good idea on other implications across the whole kernel.
>
> IMHO copy_page_range() is not a great example for proving deadlocks,
> because the dst_mm should not be exposed to the whole world yet at all when
> copying. Say, I don't see any case some thread can try to take the dst mm
> pgtable lock at all until it's all set. I'm even wondering whether it's
> safe to not take the dst mm pgtable lock at all during a fork()..
I don't think it's safe without taking the dst mm pgtable lock during a fork().
Since copy_present_page() will add the page to the anon_vma, the page
can be searched by the rmap.
So, even the fork doesn't finish the duplication of pgtable.
We can still use the existing (and COW mapping) page to access the dst
pgtable by rmap + page_vma_mapped_walk().
But, I didn't consider the mmap_write_lock() here. So, I might be wrong here.
Just provide some thoughts.
Thanks,
Chih-En Lin
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-02-09 15:10 ` Chih-En Lin
@ 2023-02-09 15:43 ` Peter Xu
0 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2023-02-09 15:43 UTC (permalink / raw)
To: Chih-En Lin
Cc: Matthew Wilcox, linux-mm, Vishal Moola, Hugh Dickins,
Rik van Riel, David Hildenbrand, Yin, Fengwei
On Thu, Feb 09, 2023 at 11:10:12PM +0800, Chih-En Lin wrote:
> On Thu, Feb 9, 2023 at 4:59 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Feb 08, 2023 at 08:25:10PM +0000, Matthew Wilcox wrote:
> > > On Wed, Feb 08, 2023 at 02:40:11PM -0500, Peter Xu wrote:
> > > > On Tue, Feb 07, 2023 at 11:27:17PM +0000, Matthew Wilcox wrote:
> > > > > I've been thinking about this one, and I wonder if we can do it
> > > > > without taking any pgtable locks. The locking environment we're in
> > > > > is the page fault handler, so we have the mmap_lock for read (for now
> > > > > anyway ...). We also hold the folio lock, so _if_ the folio is mapped,
> > > > > those entries can't disappear under us.
> > > >
> > > > Could MADV_DONTNEED do that from another pgtable that we don't hold the
> > > > pgtable lock?
> > >
> > > Oh, ugh, yes. And zap_pte_range() has the PTL first, so we can't sleep
> > > to get the folio lock. And we can't decline to zap the pte on a failed
> > > folio_trylock() (well, we could for MADV_DONTNEED, but not in general).
> > >
> > > So ... how about this for a solution:
> > >
> > > - If the folio overlaps into the next PMD table, spin_lock it.
> > > - If the folio overlaps into the previous PMD table, unlock our
> > > PTL, lock the previous PTL, re-lock our PTL.
> > > - Do the pvmw, telling it we already have the PTLs held (new PVMW flag).
> > >
> > > [explanation simplified; if there is no prior PMD table or if the VMA
> > > limits how far to search, we can skip this]
> > >
> > > We have prior art for taking two PTLs in copy_page_range(). There,
> > > the hierarchy is clear; one VMA belongs to the process parent and one
> > > to the child. I don't believe we have precedent for taking two PTLs
> > > in the same VMA, but I think my proposal (order by ascending address in
> > > the process) is the obvious order to choose.
> >
> > Maybe it'll work? Not sure, but seems be something we'd be extremely
> > careful with. Having a single mmap read lock covering both seems to
> > guarantee that the order of the lock is stable, which is a good start..
> > But I have no good idea on other implications across the whole kernel.
> >
> > IMHO copy_page_range() is not a great example for proving deadlocks,
> > because the dst_mm should not be exposed to the whole world yet at all when
> > copying. Say, I don't see any case some thread can try to take the dst mm
> > pgtable lock at all until it's all set. I'm even wondering whether it's
> > safe to not take the dst mm pgtable lock at all during a fork()..
>
> I don't think it's safe without taking the dst mm pgtable lock during a fork().
> Since copy_present_page() will add the page to the anon_vma, the page
> can be searched by the rmap.
> So, even the fork doesn't finish the duplication of pgtable.
> We can still use the existing (and COW mapping) page to access the dst
> pgtable by rmap + page_vma_mapped_walk().
> But, I didn't consider the mmap_write_lock() here. So, I might be wrong here.
> Just provide some thoughts.
Yes I think you're right - I thought the rmap locks was held but after I
rechecked they're not.
It's safe because AFAICT any rmap can only take either of the pgtable locks
but not both. To trigger any potential deadlock on the two spinlocks we
need another concurrent thread trying to take the same two locks, but it
will not happen in this case.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-02-06 20:34 ` Matthew Wilcox
` (2 preceding siblings ...)
2023-02-07 22:39 ` Peter Xu
@ 2023-02-07 22:56 ` James Houghton
2023-02-07 23:08 ` Matthew Wilcox
3 siblings, 1 reply; 52+ messages in thread
From: James Houghton @ 2023-02-07 22:56 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
Hi Matthew,
On Mon, Feb 6, 2023 at 12:34 PM Matthew Wilcox <willy@infradead.org> wrote:
> I now think the mapcount has to be defined as "How many VMAs have
> one-or-more pages of this folio mapped".
>
> That means that our future folio_add_file_rmap_range() looks a bit
> like this:
>
> {
> bool add_mapcount = true;
>
> if (nr < folio_nr_pages(folio))
> add_mapcount = !folio_has_ptes(folio, vma);
Can you elaborate on how folio_has_ptes() might be implemented?
It seems like for a naturally aligned THP, we can see if the PMD is
pmd_present(), otherwise, do we need to check (potentially) all the
PTEs? Does this introduce an ordering requirement, where we have to
update the page table always before/after we call
folio_add_file_rmap_range()?
Thanks!
- James
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: Folio mapcount
2023-02-07 22:56 ` James Houghton
@ 2023-02-07 23:08 ` Matthew Wilcox
2023-02-07 23:27 ` James Houghton
0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-07 23:08 UTC (permalink / raw)
To: James Houghton
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
On Tue, Feb 07, 2023 at 02:56:52PM -0800, James Houghton wrote:
> Hi Matthew,
>
> On Mon, Feb 6, 2023 at 12:34 PM Matthew Wilcox <willy@infradead.org> wrote:
> > I now think the mapcount has to be defined as "How many VMAs have
> > one-or-more pages of this folio mapped".
> >
> > That means that our future folio_add_file_rmap_range() looks a bit
> > like this:
> >
> > {
> > bool add_mapcount = true;
> >
> > if (nr < folio_nr_pages(folio))
> > add_mapcount = !folio_has_ptes(folio, vma);
>
> Can you elaborate on how folio_has_ptes() might be implemented?
... oh. First I called it folio_has_ptes() and then I went looking
for inspiration on how to implement it, and I found page_mapped_in_vma()
and so I thought folio_mapped_in_vma() would be a better name, but
I forgot to change it here. Sorry.
> It seems like for a naturally aligned THP, we can see if the PMD is
> pmd_present(), otherwise, do we need to check (potentially) all the
> PTEs? Does this introduce an ordering requirement, where we have to
> update the page table always before/after we call
> folio_add_file_rmap_range()?
Actually, you _mustn't_ update the page table first, or it will see
the PTEs and say "Ah, the folio is already mapped, I should not
increment mapcount". So we keep the calling order as it is now;
folio_add_X_rmap() followed by a call to set_ptes().
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: Folio mapcount
2023-02-07 23:08 ` Matthew Wilcox
@ 2023-02-07 23:27 ` James Houghton
2023-02-07 23:35 ` Matthew Wilcox
0 siblings, 1 reply; 52+ messages in thread
From: James Houghton @ 2023-02-07 23:27 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
On Tue, Feb 7, 2023 at 3:08 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Feb 07, 2023 at 02:56:52PM -0800, James Houghton wrote:
> > Hi Matthew,
> >
> > On Mon, Feb 6, 2023 at 12:34 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > I now think the mapcount has to be defined as "How many VMAs have
> > > one-or-more pages of this folio mapped".
> > >
> > > That means that our future folio_add_file_rmap_range() looks a bit
> > > like this:
> > >
> > > {
> > > bool add_mapcount = true;
> > >
> > > if (nr < folio_nr_pages(folio))
> > > add_mapcount = !folio_has_ptes(folio, vma);
> >
> > Can you elaborate on how folio_has_ptes() might be implemented?
>
> ... oh. First I called it folio_has_ptes() and then I went looking
> for inspiration on how to implement it, and I found page_mapped_in_vma()
> and so I thought folio_mapped_in_vma() would be a better name, but
> I forgot to change it here. Sorry.
Oh I see. Thanks. :)
So page_vma_mapped_walk() might have to walk up to HPAGE_PMD_NR-ish
PTEs (if we find a bunch of pte_none() PTEs). Just curious, could that
be any slower than what we currently do (like, incrementing up to
HPAGE_PMD_NR-ish subpage mapcounts)? Or is it not a concern?
>
> > It seems like for a naturally aligned THP, we can see if the PMD is
> > pmd_present(), otherwise, do we need to check (potentially) all the
> > PTEs? Does this introduce an ordering requirement, where we have to
> > update the page table always before/after we call
> > folio_add_file_rmap_range()?
>
> Actually, you _mustn't_ update the page table first, or it will see
> the PTEs and say "Ah, the folio is already mapped, I should not
> increment mapcount". So we keep the calling order as it is now;
> folio_add_X_rmap() followed by a call to set_ptes().
Ok, thanks for clarifying.
- James
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: Folio mapcount
2023-02-07 23:27 ` James Houghton
@ 2023-02-07 23:35 ` Matthew Wilcox
2023-02-08 0:35 ` James Houghton
0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-07 23:35 UTC (permalink / raw)
To: James Houghton
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
On Tue, Feb 07, 2023 at 03:27:07PM -0800, James Houghton wrote:
> So page_vma_mapped_walk() might have to walk up to HPAGE_PMD_NR-ish
> PTEs (if we find a bunch of pte_none() PTEs). Just curious, could that
> be any slower than what we currently do (like, incrementing up to
> HPAGE_PMD_NR-ish subpage mapcounts)? Or is it not a concern?
I think it's faster. Both of these operations work on folio_nr_pages()
entries ... but a page table is 8 bytes and a struct page is 64 bytes.
From a CPU prefetching point of view, they're both linear scans, but
PTEs are 8 times denser.
The other factor to consider is how often we do each of these operations.
Mapping a folio happens ~once per call to mmap() (even though it's delayed
until page fault time). Querying folio_total_mapcount() happens ... less
often, I think? Both are going to be quite rare since generally we map
the entire folio at once.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-02-07 23:35 ` Matthew Wilcox
@ 2023-02-08 0:35 ` James Houghton
2023-02-08 2:26 ` Matthew Wilcox
0 siblings, 1 reply; 52+ messages in thread
From: James Houghton @ 2023-02-08 0:35 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
On Tue, Feb 7, 2023 at 3:35 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Feb 07, 2023 at 03:27:07PM -0800, James Houghton wrote:
> > So page_vma_mapped_walk() might have to walk up to HPAGE_PMD_NR-ish
> > PTEs (if we find a bunch of pte_none() PTEs). Just curious, could that
> > be any slower than what we currently do (like, incrementing up to
> > HPAGE_PMD_NR-ish subpage mapcounts)? Or is it not a concern?
>
> I think it's faster. Both of these operations work on folio_nr_pages()
> entries ... but a page table is 8 bytes and a struct page is 64 bytes.
> From a CPU prefetching point of view, they're both linear scans, but
> PTEs are 8 times denser.
>
> The other factor to consider is how often we do each of these operations.
> Mapping a folio happens ~once per call to mmap() (even though it's delayed
> until page fault time). Querying folio_total_mapcount() happens ... less
> often, I think? Both are going to be quite rare since generally we map
> the entire folio at once.
Maybe this is a case where we would see a regression: doing PAGE_SIZE
UFFDIO_CONTINUEs on a THP. Worst case, go from the end of the THP to
the beginning (ending up with a PTE-mapped THP at the end).
For the i'th PTE we map / i'th UFFDIO_CONTINUE, we have to check
`folio_nr_pages() - i` PTEs (for most of the iterations anyway). Seems
like this scales with the square of the size of the folio, so this
approach would be kind of a non-starter for HugeTLB (with
high-granularity mapping), I think.
This example isn't completely contrived: if we did post-copy live
migration with userfaultfd, we might end up doing something like this.
I'm curious what you think. :)
- James
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-02-08 0:35 ` James Houghton
@ 2023-02-08 2:26 ` Matthew Wilcox
0 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-08 2:26 UTC (permalink / raw)
To: James Houghton
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
On Tue, Feb 07, 2023 at 04:35:30PM -0800, James Houghton wrote:
> On Tue, Feb 7, 2023 at 3:35 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Feb 07, 2023 at 03:27:07PM -0800, James Houghton wrote:
> > > So page_vma_mapped_walk() might have to walk up to HPAGE_PMD_NR-ish
> > > PTEs (if we find a bunch of pte_none() PTEs). Just curious, could that
> > > be any slower than what we currently do (like, incrementing up to
> > > HPAGE_PMD_NR-ish subpage mapcounts)? Or is it not a concern?
> >
> > I think it's faster. Both of these operations work on folio_nr_pages()
> > entries ... but a page table is 8 bytes and a struct page is 64 bytes.
> > From a CPU prefetching point of view, they're both linear scans, but
> > PTEs are 8 times denser.
>
> >
> > The other factor to consider is how often we do each of these operations.
> > Mapping a folio happens ~once per call to mmap() (even though it's delayed
> > until page fault time). Querying folio_total_mapcount() happens ... less
> > often, I think? Both are going to be quite rare since generally we map
> > the entire folio at once.
>
> Maybe this is a case where we would see a regression: doing PAGE_SIZE
> UFFDIO_CONTINUEs on a THP. Worst case, go from the end of the THP to
> the beginning (ending up with a PTE-mapped THP at the end).
>
> For the i'th PTE we map / i'th UFFDIO_CONTINUE, we have to check
> `folio_nr_pages() - i` PTEs (for most of the iterations anyway). Seems
> like this scales with the square of the size of the folio, so this
> approach would be kind of a non-starter for HugeTLB (with
> high-granularity mapping), I think.
>
> This example isn't completely contrived: if we did post-copy live
> migration with userfaultfd, we might end up doing something like this.
> I'm curious what you think. :)
I think that's a great corner-case to consider. For hugetlb pages,
we know they're PMD/PUD aligned, so _if_ there's a page table present,
at least one page from the folio is already mapped, and we don't need
to look in the page table to find which one. Similarly, if the folio
is going to occupy the entire PMD/PUD if it's mapped in part, we don't
need to iterate within it. And contrariwise, if it's p*d_none(), then
definitely none of the pages are mapped.
That perhaps calls for using a different implementation than
page_vma_mapped_walk(), which should be worth it to optimise this case.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-01-24 18:13 Folio mapcount Matthew Wilcox
` (3 preceding siblings ...)
2023-02-06 20:34 ` Matthew Wilcox
@ 2023-02-07 16:23 ` Zi Yan
2023-02-07 16:51 ` Matthew Wilcox
4 siblings, 1 reply; 52+ messages in thread
From: Zi Yan @ 2023-02-07 16:23 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
[-- Attachment #1: Type: text/plain, Size: 3173 bytes --]
On 24 Jan 2023, at 13:13, Matthew Wilcox wrote:
> Once we get to the part of the folio journey where we have
> one-pointer-per-page, we can't afford to maintain per-page state.
> Currently we maintain a per-page mapcount, and that will have to go.
> We can maintain extra state for a multi-page folio, but it has to be a
> constant amount of extra state no matter how many pages are in the folio.
>
> My proposal is that we maintain a single mapcount per folio, and its
> definition is the number of (vma, page table) tuples which have a
> reference to any pages in this folio.
How about having two, full_folio_mapcount and partial_folio_mapcount?
If partial_folio_mapcount is 0, we can have a fast path without doing
anything at page level.
>
> I think there's a good performance win and simplification to be had
> here, so I think it's worth doing for 6.4.
>
> Examples
> --------
>
> In the simple and common case where every page in a folio is mapped
> once by a single vma and single page table, mapcount would be 1 [1].
> If the folio is mapped across a page table boundary by a single VMA,
> after we take a page fault on it in one page table, it gets a mapcount
> of 1. After taking a page fault on it in the other page table, its
> mapcount increases to 2.
>
> For a PMD-sized THP naturally aligned, mapcount is 1. Splitting the
> PMD into PTEs would not change the mapcount; the folio remains order-9
> but it stll has a reference from only one page table (a different page
> table, but still just one).
>
> Implementation sketch
> ---------------------
>
> When we take a page fault, we can/should map every page in the folio
> that fits in this VMA and this page table. We do this at present in
> filemap_map_pages() by looping over each page in the folio and calling
> do_set_pte() on each. We should have a:
>
> do_set_pte_range(vmf, folio, addr, first_page, n);
>
> and then change the API to page_add_new_anon_rmap() / page_add_file_rmap()
> to pass in (folio, first, n) instead of page. That gives us one call to
> page_add_*_rmap() per (vma, page table) tuple.
>
> In try_to_unmap_one(), page_vma_mapped_walk() currently calls us for
> each pfn. We'll want a function like
> page_vma_mapped_walk_skip_to_end_of_ptable()
> in order to persuade it to only call us once or twice if the folio
> is mapped across a page table boundary.
>
> Concerns
> --------
>
> We'll have to be careful to always zap all the PTEs for a given (vma,
> pt) tuple at the same time, otherwise mapcount will get out of sync
> (eg map three pages, unmap two; we shouldn't decrement the mapcount,
> but I don't think we can know that. But does this ever happen? I think
> we always unmap the entire folio, like in try_to_unmap_one().
>
> I haven't got my head around SetPageAnonExclusive() yet. I think it can
> be a per-folio bit, but handling a folio split across two page tables
> may be tricky.
>
> Notes
> -----
>
> [1] Ignoring the bias by -1 to let us detect transitions that we care
> about more efficiently; I'm talking about the value returned from
> page_mapcount(), not the value stored in page->_mapcount.
--
Best Regards,
Yan, Zi
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: Folio mapcount
2023-02-07 16:23 ` Zi Yan
@ 2023-02-07 16:51 ` Matthew Wilcox
2023-02-08 19:36 ` Zi Yan
0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-07 16:51 UTC (permalink / raw)
To: Zi Yan
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
On Tue, Feb 07, 2023 at 11:23:31AM -0500, Zi Yan wrote:
> On 24 Jan 2023, at 13:13, Matthew Wilcox wrote:
>
> > Once we get to the part of the folio journey where we have
> > one-pointer-per-page, we can't afford to maintain per-page state.
> > Currently we maintain a per-page mapcount, and that will have to go.
> > We can maintain extra state for a multi-page folio, but it has to be a
> > constant amount of extra state no matter how many pages are in the folio.
> >
> > My proposal is that we maintain a single mapcount per folio, and its
> > definition is the number of (vma, page table) tuples which have a
> > reference to any pages in this folio.
>
> How about having two, full_folio_mapcount and partial_folio_mapcount?
> If partial_folio_mapcount is 0, we can have a fast path without doing
> anything at page level.
A fast path for what? I don't understand your vision; can you spell it
out for me? My current proposal is here:
https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/
The three questions we need to be able to answer (in my current
understanding) are laid out here:
https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@casper.infradead.org/
Of course, the vision also needs to include how we account in
folio_add_(anon|file|new_anon)_rmap() and folio_remove_rmap().
> > I think there's a good performance win and simplification to be had
> > here, so I think it's worth doing for 6.4.
> >
> > Examples
> > --------
> >
> > In the simple and common case where every page in a folio is mapped
> > once by a single vma and single page table, mapcount would be 1 [1].
> > If the folio is mapped across a page table boundary by a single VMA,
> > after we take a page fault on it in one page table, it gets a mapcount
> > of 1. After taking a page fault on it in the other page table, its
> > mapcount increases to 2.
> >
> > For a PMD-sized THP naturally aligned, mapcount is 1. Splitting the
> > PMD into PTEs would not change the mapcount; the folio remains order-9
> > but it stll has a reference from only one page table (a different page
> > table, but still just one).
> >
> > Implementation sketch
> > ---------------------
> >
> > When we take a page fault, we can/should map every page in the folio
> > that fits in this VMA and this page table. We do this at present in
> > filemap_map_pages() by looping over each page in the folio and calling
> > do_set_pte() on each. We should have a:
> >
> > do_set_pte_range(vmf, folio, addr, first_page, n);
> >
> > and then change the API to page_add_new_anon_rmap() / page_add_file_rmap()
> > to pass in (folio, first, n) instead of page. That gives us one call to
> > page_add_*_rmap() per (vma, page table) tuple.
> >
> > In try_to_unmap_one(), page_vma_mapped_walk() currently calls us for
> > each pfn. We'll want a function like
> > page_vma_mapped_walk_skip_to_end_of_ptable()
> > in order to persuade it to only call us once or twice if the folio
> > is mapped across a page table boundary.
> >
> > Concerns
> > --------
> >
> > We'll have to be careful to always zap all the PTEs for a given (vma,
> > pt) tuple at the same time, otherwise mapcount will get out of sync
> > (eg map three pages, unmap two; we shouldn't decrement the mapcount,
> > but I don't think we can know that. But does this ever happen? I think
> > we always unmap the entire folio, like in try_to_unmap_one().
> >
> > I haven't got my head around SetPageAnonExclusive() yet. I think it can
> > be a per-folio bit, but handling a folio split across two page tables
> > may be tricky.
> >
> > Notes
> > -----
> >
> > [1] Ignoring the bias by -1 to let us detect transitions that we care
> > about more efficiently; I'm talking about the value returned from
> > page_mapcount(), not the value stored in page->_mapcount.
>
>
> --
> Best Regards,
> Yan, Zi
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-02-07 16:51 ` Matthew Wilcox
@ 2023-02-08 19:36 ` Zi Yan
2023-02-08 19:54 ` Matthew Wilcox
0 siblings, 1 reply; 52+ messages in thread
From: Zi Yan @ 2023-02-08 19:36 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
[-- Attachment #1: Type: text/plain, Size: 4843 bytes --]
On 7 Feb 2023, at 11:51, Matthew Wilcox wrote:
> On Tue, Feb 07, 2023 at 11:23:31AM -0500, Zi Yan wrote:
>> On 24 Jan 2023, at 13:13, Matthew Wilcox wrote:
>>
>>> Once we get to the part of the folio journey where we have
>>> one-pointer-per-page, we can't afford to maintain per-page state.
>>> Currently we maintain a per-page mapcount, and that will have to go.
>>> We can maintain extra state for a multi-page folio, but it has to be a
>>> constant amount of extra state no matter how many pages are in the folio.
>>>
>>> My proposal is that we maintain a single mapcount per folio, and its
>>> definition is the number of (vma, page table) tuples which have a
>>> reference to any pages in this folio.
>>
>> How about having two, full_folio_mapcount and partial_folio_mapcount?
>> If partial_folio_mapcount is 0, we can have a fast path without doing
>> anything at page level.
>
> A fast path for what? I don't understand your vision; can you spell it
> out for me? My current proposal is here:
A fast code path for only handling folios as a whole. For cases that
subpages are mapped from a folio, traversing through subpages might be
needed and will be slow. A code separation might be cleaner and makes
folio as a whole handling quicker.
For your proposal, "How many VMAs have one-or-more pages of this folio mapped"
should be the responsibility of rmap. We could add a counter to rmap
instead. It seems that you are mixing page table mapping with virtual
address space (VMA) mapping together.
>
> https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/
>
> The three questions we need to be able to answer (in my current
> understanding) are laid out here:
>
> https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@casper.infradead.org/
I think we probably need to clarify the definition of "map" in your
questions. Does it mean mapped by page tables or VMAs? When a page
is mapped into a VMA, it can be mapped by one or more page table entries,
but not the other way around, right? Or is shared page table entry merged
now so that more than one VMAs can use a single page table entry to map
a folio?
>
> Of course, the vision also needs to include how we account in
> folio_add_(anon|file|new_anon)_rmap() and folio_remove_rmap().
>
>>> I think there's a good performance win and simplification to be had
>>> here, so I think it's worth doing for 6.4.
>>>
>>> Examples
>>> --------
>>>
>>> In the simple and common case where every page in a folio is mapped
>>> once by a single vma and single page table, mapcount would be 1 [1].
>>> If the folio is mapped across a page table boundary by a single VMA,
>>> after we take a page fault on it in one page table, it gets a mapcount
>>> of 1. After taking a page fault on it in the other page table, its
>>> mapcount increases to 2.
>>>
>>> For a PMD-sized THP naturally aligned, mapcount is 1. Splitting the
>>> PMD into PTEs would not change the mapcount; the folio remains order-9
>>> but it stll has a reference from only one page table (a different page
>>> table, but still just one).
>>>
>>> Implementation sketch
>>> ---------------------
>>>
>>> When we take a page fault, we can/should map every page in the folio
>>> that fits in this VMA and this page table. We do this at present in
>>> filemap_map_pages() by looping over each page in the folio and calling
>>> do_set_pte() on each. We should have a:
>>>
>>> do_set_pte_range(vmf, folio, addr, first_page, n);
>>>
>>> and then change the API to page_add_new_anon_rmap() / page_add_file_rmap()
>>> to pass in (folio, first, n) instead of page. That gives us one call to
>>> page_add_*_rmap() per (vma, page table) tuple.
>>>
>>> In try_to_unmap_one(), page_vma_mapped_walk() currently calls us for
>>> each pfn. We'll want a function like
>>> page_vma_mapped_walk_skip_to_end_of_ptable()
>>> in order to persuade it to only call us once or twice if the folio
>>> is mapped across a page table boundary.
>>>
>>> Concerns
>>> --------
>>>
>>> We'll have to be careful to always zap all the PTEs for a given (vma,
>>> pt) tuple at the same time, otherwise mapcount will get out of sync
>>> (eg map three pages, unmap two; we shouldn't decrement the mapcount,
>>> but I don't think we can know that. But does this ever happen? I think
>>> we always unmap the entire folio, like in try_to_unmap_one().
>>>
>>> I haven't got my head around SetPageAnonExclusive() yet. I think it can
>>> be a per-folio bit, but handling a folio split across two page tables
>>> may be tricky.
>>>
>>> Notes
>>> -----
>>>
>>> [1] Ignoring the bias by -1 to let us detect transitions that we care
>>> about more efficiently; I'm talking about the value returned from
>>> page_mapcount(), not the value stored in page->_mapcount.
>>
>>
>> --
>> Best Regards,
>> Yan, Zi
--
Best Regards,
Yan, Zi
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-02-08 19:36 ` Zi Yan
@ 2023-02-08 19:54 ` Matthew Wilcox
2023-02-10 15:15 ` Zi Yan
2023-03-29 14:02 ` Yin, Fengwei
0 siblings, 2 replies; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-08 19:54 UTC (permalink / raw)
To: Zi Yan
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
On Wed, Feb 08, 2023 at 02:36:41PM -0500, Zi Yan wrote:
> On 7 Feb 2023, at 11:51, Matthew Wilcox wrote:
>
> > On Tue, Feb 07, 2023 at 11:23:31AM -0500, Zi Yan wrote:
> >> On 24 Jan 2023, at 13:13, Matthew Wilcox wrote:
> >>
> >>> Once we get to the part of the folio journey where we have
> >>> one-pointer-per-page, we can't afford to maintain per-page state.
> >>> Currently we maintain a per-page mapcount, and that will have to go.
> >>> We can maintain extra state for a multi-page folio, but it has to be a
> >>> constant amount of extra state no matter how many pages are in the folio.
> >>>
> >>> My proposal is that we maintain a single mapcount per folio, and its
> >>> definition is the number of (vma, page table) tuples which have a
> >>> reference to any pages in this folio.
> >>
> >> How about having two, full_folio_mapcount and partial_folio_mapcount?
> >> If partial_folio_mapcount is 0, we can have a fast path without doing
> >> anything at page level.
> >
> > A fast path for what? I don't understand your vision; can you spell it
> > out for me? My current proposal is here:
>
> A fast code path for only handling folios as a whole. For cases that
> subpages are mapped from a folio, traversing through subpages might be
> needed and will be slow. A code separation might be cleaner and makes
> folio as a whole handling quicker.
To be clear, in this proposal, there is no subpage mapcount. I've got
my eye on one struct folio per allocation, so there will be no more
tail pages. The proposal has one mapcount, and that's it. I'd be
open to saying "OK, we need two mapcounts", but not to anything that
needs to scale per number of pages in the folio.
> For your proposal, "How many VMAs have one-or-more pages of this folio mapped"
> should be the responsibility of rmap. We could add a counter to rmap
> instead. It seems that you are mixing page table mapping with virtual
> address space (VMA) mapping together.
rmap tells you how many VMAs cover this folio. It doesn't tell you
how many of those VMAs have actually got any pages from it mapped.
It's also rather slower than a simple atomic_read(), so I think
you'll have an uphill battle trying to convince people to use rmap
for this purpose.
I'm not sure what you mean by "add a counter to rmap"? One count
per mapped page in the vma?
> >
> > https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/
> >
> > The three questions we need to be able to answer (in my current
> > understanding) are laid out here:
> >
> > https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@casper.infradead.org/
>
> I think we probably need to clarify the definition of "map" in your
> questions. Does it mean mapped by page tables or VMAs? When a page
> is mapped into a VMA, it can be mapped by one or more page table entries,
> but not the other way around, right? Or is shared page table entry merged
> now so that more than one VMAs can use a single page table entry to map
> a folio?
Mapped by page tables, just like today. It'd be quite the change to
figure out the mapcount of a page newly brought into the page cache;
we'd have to do an rmap walk to see how many mapcounts to give it.
I don't think this is a great idea.
As far as I know, shared page tables are only supported by hugetlbfs,
and I prefer to stick cheese in my ears and pretend they don't exist.
To be absolutely concrete about this, my proposal is:
Folio brought into page cache has mapcount 0 (whether or not there are any VMAs
that cover it)
When we take a page fault on one of the pages in it, its mapcount
increases from 0 to 1.
When we take another page fault on a page in it, we do a pvmw to
determine if any pages from this folio are already mapped by this VMA;
we see that there is one and we do not increment the mapcount.
We partially munmap() so that we need to unmap one of the pages.
We remove it from the page tables and call page_remove_rmap().
That does another pvmw and sees there's still a page in this folio
mapped by this VMA, does not decrement the refcount
We truncate() the file smaller than the position of the folio, which
causes us to unmap the rest of the folio. The pvmw walk detects no
more pages from this folio mapped and we decrement the mapcount.
Clear enough?
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-02-08 19:54 ` Matthew Wilcox
@ 2023-02-10 15:15 ` Zi Yan
2023-03-29 14:02 ` Yin, Fengwei
1 sibling, 0 replies; 52+ messages in thread
From: Zi Yan @ 2023-02-10 15:15 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
David Hildenbrand, Yin, Fengwei
[-- Attachment #1: Type: text/plain, Size: 4447 bytes --]
On 8 Feb 2023, at 14:54, Matthew Wilcox wrote:
> On Wed, Feb 08, 2023 at 02:36:41PM -0500, Zi Yan wrote:
>> On 7 Feb 2023, at 11:51, Matthew Wilcox wrote:
>>
>>> On Tue, Feb 07, 2023 at 11:23:31AM -0500, Zi Yan wrote:
>>>> On 24 Jan 2023, at 13:13, Matthew Wilcox wrote:
>>>>
>>>>> Once we get to the part of the folio journey where we have
>>>>> one-pointer-per-page, we can't afford to maintain per-page state.
>>>>> Currently we maintain a per-page mapcount, and that will have to go.
>>>>> We can maintain extra state for a multi-page folio, but it has to be a
>>>>> constant amount of extra state no matter how many pages are in the folio.
>>>>>
>>>>> My proposal is that we maintain a single mapcount per folio, and its
>>>>> definition is the number of (vma, page table) tuples which have a
>>>>> reference to any pages in this folio.
>>>>
>>>> How about having two, full_folio_mapcount and partial_folio_mapcount?
>>>> If partial_folio_mapcount is 0, we can have a fast path without doing
>>>> anything at page level.
>>>
>>> A fast path for what? I don't understand your vision; can you spell it
>>> out for me? My current proposal is here:
>>
>> A fast code path for only handling folios as a whole. For cases that
>> subpages are mapped from a folio, traversing through subpages might be
>> needed and will be slow. A code separation might be cleaner and makes
>> folio as a whole handling quicker.
>
> To be clear, in this proposal, there is no subpage mapcount. I've got
> my eye on one struct folio per allocation, so there will be no more
> tail pages. The proposal has one mapcount, and that's it. I'd be
> open to saying "OK, we need two mapcounts", but not to anything that
> needs to scale per number of pages in the folio.
>
>> For your proposal, "How many VMAs have one-or-more pages of this folio mapped"
>> should be the responsibility of rmap. We could add a counter to rmap
>> instead. It seems that you are mixing page table mapping with virtual
>> address space (VMA) mapping together.
>
> rmap tells you how many VMAs cover this folio. It doesn't tell you
> how many of those VMAs have actually got any pages from it mapped.
> It's also rather slower than a simple atomic_read(), so I think
> you'll have an uphill battle trying to convince people to use rmap
> for this purpose.
>
> I'm not sure what you mean by "add a counter to rmap"? One count
> per mapped page in the vma?
>
>>>
>>> https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/
>>>
>>> The three questions we need to be able to answer (in my current
>>> understanding) are laid out here:
>>>
>>> https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@casper.infradead.org/
>>
>> I think we probably need to clarify the definition of "map" in your
>> questions. Does it mean mapped by page tables or VMAs? When a page
>> is mapped into a VMA, it can be mapped by one or more page table entries,
>> but not the other way around, right? Or is shared page table entry merged
>> now so that more than one VMAs can use a single page table entry to map
>> a folio?
>
> Mapped by page tables, just like today. It'd be quite the change to
> figure out the mapcount of a page newly brought into the page cache;
> we'd have to do an rmap walk to see how many mapcounts to give it.
> I don't think this is a great idea.
>
> As far as I know, shared page tables are only supported by hugetlbfs,
> and I prefer to stick cheese in my ears and pretend they don't exist.
>
> To be absolutely concrete about this, my proposal is:
>
> Folio brought into page cache has mapcount 0 (whether or not there are any VMAs
> that cover it)
> When we take a page fault on one of the pages in it, its mapcount
> increases from 0 to 1.
> When we take another page fault on a page in it, we do a pvmw to
> determine if any pages from this folio are already mapped by this VMA;
> we see that there is one and we do not increment the mapcount.
> We partially munmap() so that we need to unmap one of the pages.
> We remove it from the page tables and call page_remove_rmap().
> That does another pvmw and sees there's still a page in this folio
> mapped by this VMA, does not decrement the refcount
> We truncate() the file smaller than the position of the folio, which
> causes us to unmap the rest of the folio. The pvmw walk detects no
> more pages from this folio mapped and we decrement the mapcount.
>
> Clear enough?
Yes. Thanks.
--
Best Regards,
Yan, Zi
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-02-08 19:54 ` Matthew Wilcox
2023-02-10 15:15 ` Zi Yan
@ 2023-03-29 14:02 ` Yin, Fengwei
2023-07-01 1:17 ` Zi Yan
1 sibling, 1 reply; 52+ messages in thread
From: Yin, Fengwei @ 2023-03-29 14:02 UTC (permalink / raw)
To: Matthew Wilcox, Zi Yan
Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel, David Hildenbrand
Hi Matthew,
On 2/9/2023 3:54 AM, Matthew Wilcox wrote:
> On Wed, Feb 08, 2023 at 02:36:41PM -0500, Zi Yan wrote:
>> On 7 Feb 2023, at 11:51, Matthew Wilcox wrote:
>>
>>> On Tue, Feb 07, 2023 at 11:23:31AM -0500, Zi Yan wrote:
>>>> On 24 Jan 2023, at 13:13, Matthew Wilcox wrote:
>>>>
>>>>> Once we get to the part of the folio journey where we have
>>>>> one-pointer-per-page, we can't afford to maintain per-page state.
>>>>> Currently we maintain a per-page mapcount, and that will have to go.
>>>>> We can maintain extra state for a multi-page folio, but it has to be a
>>>>> constant amount of extra state no matter how many pages are in the folio.
>>>>>
>>>>> My proposal is that we maintain a single mapcount per folio, and its
>>>>> definition is the number of (vma, page table) tuples which have a
>>>>> reference to any pages in this folio.
>>>>
>>>> How about having two, full_folio_mapcount and partial_folio_mapcount?
>>>> If partial_folio_mapcount is 0, we can have a fast path without doing
>>>> anything at page level.
>>>
>>> A fast path for what? I don't understand your vision; can you spell it
>>> out for me? My current proposal is here:
>>
>> A fast code path for only handling folios as a whole. For cases that
>> subpages are mapped from a folio, traversing through subpages might be
>> needed and will be slow. A code separation might be cleaner and makes
>> folio as a whole handling quicker.
>
> To be clear, in this proposal, there is no subpage mapcount. I've got
> my eye on one struct folio per allocation, so there will be no more
> tail pages. The proposal has one mapcount, and that's it. I'd be
> open to saying "OK, we need two mapcounts", but not to anything that
> needs to scale per number of pages in the folio.
>
>> For your proposal, "How many VMAs have one-or-more pages of this folio mapped"
>> should be the responsibility of rmap. We could add a counter to rmap
>> instead. It seems that you are mixing page table mapping with virtual
>> address space (VMA) mapping together.
>
> rmap tells you how many VMAs cover this folio. It doesn't tell you
> how many of those VMAs have actually got any pages from it mapped.
> It's also rather slower than a simple atomic_read(), so I think
> you'll have an uphill battle trying to convince people to use rmap
> for this purpose.
>
> I'm not sure what you mean by "add a counter to rmap"? One count
> per mapped page in the vma?
>
>>>
>>> https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/
>>>
>>> The three questions we need to be able to answer (in my current
>>> understanding) are laid out here:
>>>
>>> https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@casper.infradead.org/
>>
>> I think we probably need to clarify the definition of "map" in your
>> questions. Does it mean mapped by page tables or VMAs? When a page
>> is mapped into a VMA, it can be mapped by one or more page table entries,
>> but not the other way around, right? Or is shared page table entry merged
>> now so that more than one VMAs can use a single page table entry to map
>> a folio?
>
> Mapped by page tables, just like today. It'd be quite the change to
> figure out the mapcount of a page newly brought into the page cache;
> we'd have to do an rmap walk to see how many mapcounts to give it.
> I don't think this is a great idea.
>
> As far as I know, shared page tables are only supported by hugetlbfs,
> and I prefer to stick cheese in my ears and pretend they don't exist.
>
> To be absolutely concrete about this, my proposal is:
>
> Folio brought into page cache has mapcount 0 (whether or not there are any VMAs
> that cover it)
> When we take a page fault on one of the pages in it, its mapcount
> increases from 0 to 1.
> When we take another page fault on a page in it, we do a pvmw to
> determine if any pages from this folio are already mapped by this VMA;
> we see that there is one and we do not increment the mapcount.
> We partially munmap() so that we need to unmap one of the pages.
> We remove it from the page tables and call page_remove_rmap().
> That does another pvmw and sees there's still a page in this folio
> mapped by this VMA, does not decrement the refcount
> We truncate() the file smaller than the position of the folio, which
> causes us to unmap the rest of the folio. The pvmw walk detects no
> more pages from this folio mapped and we decrement the mapcount.
>
> Clear enough?
I thought about this proposal for some time and would like to give it
a try.
I did a test about getting mapcount with pvmw walk vs folio_mapcount()
call like:
1.
while (page_vma_mapped_walk(&pvmw)) {
mapcount++;
}
2.
mapcount = folio_mapcount(folio);
The pvmw walk is 3X slower than folio_mapcount() call on a Ice Lake
platform.
Also noticed following thing when I read related code:
1. If it's entire folio is mapped to VMA, it's not necessary to do
pvmw walk. We can just increase mapcount (or decrease mapcount if
folio is unmapped from VMA).
2. The folio refcount update needs be changed to match mapcount
change. Otherwise, the #3 question in
https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@casper.infradead.org/
can't be answered.
3. The meaning of lruvec stat of NR_FILE_MAPPED will be changed as
we don't track each page mapcount. This info is exposed to user space
through meminfo interface.
4. The new mapcount present how many VMAs the folio map to. So during
split_vma/merge_vma operation, we need to update the mapcount if the
split/merge happens in the middle of folio.
Consider following case:
A large folio with two cow pages in the middle of it.
|-----------------VMA---------------------------|
|---folio--|cow page1|cow page2|---folio|
And the split_vma happens between cow page1/page2
|----------VMA1----------| |-----------VMA2-----|
|---folio--|cow page1| |cow page2|---folio|
| split_vma here
How do we detect we should update folio mapcount in this case?
Or I am just concerning the thing which is not possible to happen?
Regards
Yin, Fengwei
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: Folio mapcount
2023-03-29 14:02 ` Yin, Fengwei
@ 2023-07-01 1:17 ` Zi Yan
2023-07-02 9:50 ` Yin, Fengwei
0 siblings, 1 reply; 52+ messages in thread
From: Zi Yan @ 2023-07-01 1:17 UTC (permalink / raw)
To: Yin, Fengwei
Cc: Matthew Wilcox, linux-mm, Vishal Moola, Hugh Dickins,
Rik van Riel, David Hildenbrand
[-- Attachment #1: Type: text/plain, Size: 8156 bytes --]
On 29 Mar 2023, at 10:02, Yin, Fengwei wrote:
> Hi Matthew,
>
> On 2/9/2023 3:54 AM, Matthew Wilcox wrote:
>> On Wed, Feb 08, 2023 at 02:36:41PM -0500, Zi Yan wrote:
>>> On 7 Feb 2023, at 11:51, Matthew Wilcox wrote:
>>>
>>>> On Tue, Feb 07, 2023 at 11:23:31AM -0500, Zi Yan wrote:
>>>>> On 24 Jan 2023, at 13:13, Matthew Wilcox wrote:
>>>>>
>>>>>> Once we get to the part of the folio journey where we have
>>>>>> one-pointer-per-page, we can't afford to maintain per-page state.
>>>>>> Currently we maintain a per-page mapcount, and that will have to go.
>>>>>> We can maintain extra state for a multi-page folio, but it has to be a
>>>>>> constant amount of extra state no matter how many pages are in the folio.
>>>>>>
>>>>>> My proposal is that we maintain a single mapcount per folio, and its
>>>>>> definition is the number of (vma, page table) tuples which have a
>>>>>> reference to any pages in this folio.
>>>>>
>>>>> How about having two, full_folio_mapcount and partial_folio_mapcount?
>>>>> If partial_folio_mapcount is 0, we can have a fast path without doing
>>>>> anything at page level.
>>>>
>>>> A fast path for what? I don't understand your vision; can you spell it
>>>> out for me? My current proposal is here:
>>>
>>> A fast code path for only handling folios as a whole. For cases that
>>> subpages are mapped from a folio, traversing through subpages might be
>>> needed and will be slow. A code separation might be cleaner and makes
>>> folio as a whole handling quicker.
>>
>> To be clear, in this proposal, there is no subpage mapcount. I've got
>> my eye on one struct folio per allocation, so there will be no more
>> tail pages. The proposal has one mapcount, and that's it. I'd be
>> open to saying "OK, we need two mapcounts", but not to anything that
>> needs to scale per number of pages in the folio.
>>
>>> For your proposal, "How many VMAs have one-or-more pages of this folio mapped"
>>> should be the responsibility of rmap. We could add a counter to rmap
>>> instead. It seems that you are mixing page table mapping with virtual
>>> address space (VMA) mapping together.
>>
>> rmap tells you how many VMAs cover this folio. It doesn't tell you
>> how many of those VMAs have actually got any pages from it mapped.
>> It's also rather slower than a simple atomic_read(), so I think
>> you'll have an uphill battle trying to convince people to use rmap
>> for this purpose.
>>
>> I'm not sure what you mean by "add a counter to rmap"? One count
>> per mapped page in the vma?
>>
>>>>
>>>> https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/
>>>>
>>>> The three questions we need to be able to answer (in my current
>>>> understanding) are laid out here:
>>>>
>>>> https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@casper.infradead.org/
>>>
>>> I think we probably need to clarify the definition of "map" in your
>>> questions. Does it mean mapped by page tables or VMAs? When a page
>>> is mapped into a VMA, it can be mapped by one or more page table entries,
>>> but not the other way around, right? Or is shared page table entry merged
>>> now so that more than one VMAs can use a single page table entry to map
>>> a folio?
>>
>> Mapped by page tables, just like today. It'd be quite the change to
>> figure out the mapcount of a page newly brought into the page cache;
>> we'd have to do an rmap walk to see how many mapcounts to give it.
>> I don't think this is a great idea.
>>
>> As far as I know, shared page tables are only supported by hugetlbfs,
>> and I prefer to stick cheese in my ears and pretend they don't exist.
>>
>> To be absolutely concrete about this, my proposal is:
>>
>> Folio brought into page cache has mapcount 0 (whether or not there are any VMAs
>> that cover it)
>> When we take a page fault on one of the pages in it, its mapcount
>> increases from 0 to 1.
>> When we take another page fault on a page in it, we do a pvmw to
>> determine if any pages from this folio are already mapped by this VMA;
>> we see that there is one and we do not increment the mapcount.
>> We partially munmap() so that we need to unmap one of the pages.
>> We remove it from the page tables and call page_remove_rmap().
>> That does another pvmw and sees there's still a page in this folio
>> mapped by this VMA, does not decrement the refcount
>> We truncate() the file smaller than the position of the folio, which
>> causes us to unmap the rest of the folio. The pvmw walk detects no
>> more pages from this folio mapped and we decrement the mapcount.
>>
>> Clear enough?
>
> I thought about this proposal for some time and would like to give it
> a try.
>
> I did a test about getting mapcount with pvmw walk vs folio_mapcount()
> call like:
> 1.
> while (page_vma_mapped_walk(&pvmw)) {
> mapcount++;
> }
>
> 2.
> mapcount = folio_mapcount(folio);
>
> The pvmw walk is 3X slower than folio_mapcount() call on a Ice Lake
> platform.
>
>
> Also noticed following thing when I read related code:
> 1. If it's entire folio is mapped to VMA, it's not necessary to do
> pvmw walk. We can just increase mapcount (or decrease mapcount if
> folio is unmapped from VMA).
>
> 2. The folio refcount update needs be changed to match mapcount
> change. Otherwise, the #3 question in
> https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@casper.infradead.org/
> can't be answered.
>
> 3. The meaning of lruvec stat of NR_FILE_MAPPED will be changed as
> we don't track each page mapcount. This info is exposed to user space
> through meminfo interface.
>
> 4. The new mapcount present how many VMAs the folio map to. So during
> split_vma/merge_vma operation, we need to update the mapcount if the
> split/merge happens in the middle of folio.
>
> Consider following case:
> A large folio with two cow pages in the middle of it.
> |-----------------VMA---------------------------|
> |---folio--|cow page1|cow page2|---folio|
>
> And the split_vma happens between cow page1/page2
> |----------VMA1----------| |-----------VMA2-----|
> |---folio--|cow page1| |cow page2|---folio|
> | split_vma here
>
> How do we detect we should update folio mapcount in this case?
> Or I am just concerning the thing which is not possible to happen?
I also did some study on mapcount and tried to use a single mapcount
instead of existing various mapcounts. My conclusion is that from kernel
perspective, a single mapcount is enough, but we will need per-page
mapcount and entire_mapcount for userspace stats, NR_{ANON,FILE}_MAPPED,
and NR_ANON_THPS.
In kernel, almost all code only cares: 1) if a page/folio has extra pins
by checking if mapcount is equal to refcount + extra, and 2)
if a page/folio is mapped multiple times. A single mapcount can meet
these two needs.
But in userspace, to maintain the accuracy of NR_{ANON,FILE}_MAPPED,
and NR_ANON_THPS, kernel needs to know when the corresponding mapcount
goes from 0 to 1 (increase the counter) and 1 to 0 (decrease the counter).
For NR_{ANON,FILE}_MAPPED, it is increased when a page is first mapped
either by PTE or covered by PMD and decreased when a page loses its last
mapping from PTE or PMD. This means without per-page mapcount and
entire_mapcount, we cannot get them right. For NR_ANON_THPS, entire_mapcount
is needed. A single mapcount is a mix of per-page mapcount and
entire_mapcount and kernel is not able to recover the necessary
information for NR_*.
I wonder if userspace can live without these stats or different counters.
NR_ANON_MAPPED is "AnonPages", NR_FILE_MAPPED, is "Mapped" or "file",
NR_ANON_THPS is "AnonHugePages", "anon_thp". Can we just count anonymous
pages and file pages regardless they are mapped or not instead. Does
userspace really want to know the mapped pages? If that change can be done,
we probably can have a single mapcount.
BTW, I am not sure pvmw would work to check per-page or entire mapcounts,
since that means for every rmap removal, pvmw is needed to decide
whether to decrease NR_* counters. That seems to be expensive.
Let me know if I miss anything. Thanks.
--
Best Regards,
Yan, Zi
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: Folio mapcount
2023-07-01 1:17 ` Zi Yan
@ 2023-07-02 9:50 ` Yin, Fengwei
2023-07-02 11:45 ` David Hildenbrand
0 siblings, 1 reply; 52+ messages in thread
From: Yin, Fengwei @ 2023-07-02 9:50 UTC (permalink / raw)
To: Zi Yan
Cc: Matthew Wilcox, linux-mm, Vishal Moola, Hugh Dickins,
Rik van Riel, David Hildenbrand
On 7/1/2023 9:17 AM, Zi Yan wrote:
> In kernel, almost all code only cares: 1) if a page/folio has extra pins
> by checking if mapcount is equal to refcount + extra, and 2)
> if a page/folio is mapped multiple times. A single mapcount can meet
> these two needs.
For 2, how can we know whether a page/folio is mapped multiple times from
single mapcount? My understanding is we need two counts as folio could be
partial mapped.
Regards
Yin, Fengwei
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-07-02 9:50 ` Yin, Fengwei
@ 2023-07-02 11:45 ` David Hildenbrand
2023-07-02 12:26 ` Matthew Wilcox
2023-07-02 19:51 ` Zi Yan
0 siblings, 2 replies; 52+ messages in thread
From: David Hildenbrand @ 2023-07-02 11:45 UTC (permalink / raw)
To: Yin, Fengwei, Zi Yan
Cc: Matthew Wilcox, linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel
On 02.07.23 11:50, Yin, Fengwei wrote:
>
>
> On 7/1/2023 9:17 AM, Zi Yan wrote:
>> In kernel, almost all code only cares: 1) if a page/folio has extra pins
>> by checking if mapcount is equal to refcount + extra, and 2)
>> if a page/folio is mapped multiple times. A single mapcount can meet
>> these two needs.
> For 2, how can we know whether a page/folio is mapped multiple times from
> single mapcount? My understanding is we need two counts as folio could be
> partial mapped.
Yes, a single mapcount is most probably insufficient. I started
analyzing all existing users and use cases, trying to avoid walking page
tables.
If we want to get rid of all of (most) sub-page mapcounts, we'd probably
want:
(1) Total mapcount (compound + any sub-page): page_mapped(), pagecount
vs. refcount games, ...
(2) Compound mapcount (for PMD/PUD-mappale THP only): (2) - (1) tells
you if it's only PMD mapped or also PTE-mapped. For example, for
statistics but also swapout code.
(3) Mapcount of first (or any other) subpage (compount+subpage): for
folio_estimated_sharers().
For anon pages, I'm thinking about remembering an additional
(1) Page/folio creator (MM pointer/identification)
(2) Page/folio creator mapcount
When optimizing a PTE-mapped THP (especially not- pmd-mappale) for the
fork()+exec() case, we'd have to walk page tables to see if all folio
references come from this MM. The page/folio creator exactly avoids that
completely. We might need a mechanism to synchronize against
mapping/unmapping of this folio from the creator concurrently (relevant
when mapped into multiple page tables).
Further, for (1) we'd want a 64bit mapcount for large folios, which
implies a 64bit refcount. For smallish folios, we don't really care.
We should most probably use a bi-weekly MM meeting to discuss that.
Hopefully, I have a full understanding of all use cases and requirements
until then. Don't have sufficient time to look into all the nasty
details right now.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: Folio mapcount
2023-07-02 11:45 ` David Hildenbrand
@ 2023-07-02 12:26 ` Matthew Wilcox
2023-07-03 20:54 ` David Hildenbrand
2023-07-02 19:51 ` Zi Yan
1 sibling, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2023-07-02 12:26 UTC (permalink / raw)
To: David Hildenbrand
Cc: Yin, Fengwei, Zi Yan, linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel
On Sun, Jul 02, 2023 at 01:45:48PM +0200, David Hildenbrand wrote:
> Further, for (1) we'd want a 64bit mapcount for large folios, which implies
> a 64bit refcount. For smallish folios, we don't really care.
>
>
> We should most probably use a bi-weekly MM meeting to discuss that.
We have a bi-weekly meeting to discuss all these things; it's the same
time/day as the MM meeting, but the other weeks in-between.
Last one, we discussed the idea of having a 64-bit mapcount stored in
a tail page, but having mapcount only contribute 1 to refcount instead
of refcount being incremented for every mapcount. We do this trick with
mm_users and mm_count (for different reasons, but it's not
unprecedented).
eg we could do this as:
page_add_anon_rmap:
if (folio_test_large(folio))
first = atomic64_inc_and_test(&folio->_mapcount64)
else
first = atomic_inc_and_test(&page->_mapcount);
if (!first)
folio_put(folio);
which is substantially simpler than what's there now. The accounting
needs a bit of extra work.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-07-02 12:26 ` Matthew Wilcox
@ 2023-07-03 20:54 ` David Hildenbrand
0 siblings, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2023-07-03 20:54 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Yin, Fengwei, Zi Yan, linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel
On 02.07.23 14:26, Matthew Wilcox wrote:
> On Sun, Jul 02, 2023 at 01:45:48PM +0200, David Hildenbrand wrote:
>> Further, for (1) we'd want a 64bit mapcount for large folios, which implies
>> a 64bit refcount. For smallish folios, we don't really care.
>>
>>
>> We should most probably use a bi-weekly MM meeting to discuss that.
>
> We have a bi-weekly meeting to discuss all these things; it's the same
> time/day as the MM meeting, but the other weeks in-between.
Is there some kind of official invitation with meeting details that I
missed?
In any case, I'd appreciate a pointer, so I can join.
>
> Last one, we discussed the idea of having a 64-bit mapcount stored in
> a tail page, but having mapcount only contribute 1 to refcount instead
> of refcount being incremented for every mapcount. We do this trick with
> mm_users and mm_count (for different reasons, but it's not
> unprecedented).
>
Okay, to avoid a 64bit refcount for higher-order folios.
> eg we could do this as:
>
> page_add_anon_rmap:
>
> if (folio_test_large(folio))
> first = atomic64_inc_and_test(&folio->_mapcount64)
> else
> first = atomic_inc_and_test(&page->_mapcount);
> if (!first)
> folio_put(folio);
>
> which is substantially simpler than what's there now. The accounting
> needs a bit of extra work.
Some places that check for page_count() == 1 have to be taught about
that as well.
Also, the rmap vs. refcount handling for __tlb_remove_page() e.g., in
zap_pte_range() has to be considered: just because we dropped the
mapcount doesn't mean we want to drop the refcount. Maye we could take
an extra ref for that.
It surely doesn't sound completely crazy, but this needs more thought
(and you discussed that in the meeting most probably already).
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-07-02 11:45 ` David Hildenbrand
2023-07-02 12:26 ` Matthew Wilcox
@ 2023-07-02 19:51 ` Zi Yan
2023-07-03 1:09 ` Yin, Fengwei
2023-07-03 21:09 ` David Hildenbrand
1 sibling, 2 replies; 52+ messages in thread
From: Zi Yan @ 2023-07-02 19:51 UTC (permalink / raw)
To: David Hildenbrand
Cc: Yin, Fengwei, Matthew Wilcox, linux-mm, Vishal Moola,
Hugh Dickins, Rik van Riel
[-- Attachment #1: Type: text/plain, Size: 3511 bytes --]
On 2 Jul 2023, at 7:45, David Hildenbrand wrote:
> On 02.07.23 11:50, Yin, Fengwei wrote:
>>
>>
>> On 7/1/2023 9:17 AM, Zi Yan wrote:
>>> In kernel, almost all code only cares: 1) if a page/folio has extra pins
>>> by checking if mapcount is equal to refcount + extra, and 2)
>>> if a page/folio is mapped multiple times. A single mapcount can meet
>>> these two needs.
>> For 2, how can we know whether a page/folio is mapped multiple times from
>> single mapcount? My understanding is we need two counts as folio could be
>> partial mapped.
>
> Yes, a single mapcount is most probably insufficient. I started analyzing all existing users and use cases, trying to avoid walking page tables.
From my understanding, a single mapcount is sufficient for kernel users, which
calls page_mapcount(). Because they either check mapcount against refcount to
see if a page has extra pin or check mapcount to see if a page is mapped more
than once.
>
> If we want to get rid of all of (most) sub-page mapcounts, we'd probably want:
>
> (1) Total mapcount (compound + any sub-page): page_mapped(), pagecount
> vs. refcount games, ...
a single mapcount is sufficient in this case.
>
> (2) Compound mapcount (for PMD/PUD-mappale THP only): (2) - (1) tells
> you if it's only PMD mapped or also PTE-mapped. For example, for
> statistics but also swapout code.
For statistics, it is for NR_{ANON,FILE}_MAPPED and NR_ANON_THP. I wonder
if we can use the number of anonymous/file pages and THPs instead, without
caring about if it is mapped or not.
For swapout, folio_entire_mapcount() is used to estimate if a THP is fully
mapped or not. I wonder if we can get away with another estimation like
total_mapcount() > folio_nr_pages().
>
> (3) Mapcount of first (or any other) subpage (compount+subpage): for
> folio_estimated_sharers().
This is another estimation. I wonder if we can use a different estimation
like total_mapcount() > folio_nr_pages() instead.
>
> For anon pages, I'm thinking about remembering an additional
>
> (1) Page/folio creator (MM pointer/identification)
> (2) Page/folio creator mapcount
>
> When optimizing a PTE-mapped THP (especially not- pmd-mappale) for the fork()+exec() case, we'd have to walk page tables to see if all folio references come from this MM. The page/folio creator exactly avoids that completely. We might need a mechanism to synchronize against mapping/unmapping of this folio from the creator concurrently (relevant when mapped into multiple page tables).
creator_mapcount < total_mapcount means multiple MMs map this folio? And this is for
page exclusive check? Sorry I have not checked the code in detail yet. The sync
of creator_mapcount with total_mapcount might have some extra cost. I wonder if
this can be solved by checked num_active_vmas in anon_vma of a folio.
>
>
> Further, for (1) we'd want a 64bit mapcount for large folios, which implies a 64bit refcount. For smallish folios, we don't really care.
>
>
> We should most probably use a bi-weekly MM meeting to discuss that.
>
> Hopefully, I have a full understanding of all use cases and requirements until then. Don't have sufficient time to look into all the nasty details right now.
I agree that we should discuss this more and come up with a detailed list of all use
cases to make sure we do not miss any use case and hopefully simplify the use
of various mapcount if possible.
--
Best Regards,
Yan, Zi
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: Folio mapcount
2023-07-02 19:51 ` Zi Yan
@ 2023-07-03 1:09 ` Yin, Fengwei
2023-07-03 13:24 ` Zi Yan
2023-07-03 21:09 ` David Hildenbrand
1 sibling, 1 reply; 52+ messages in thread
From: Yin, Fengwei @ 2023-07-03 1:09 UTC (permalink / raw)
To: Zi Yan, David Hildenbrand
Cc: Matthew Wilcox, linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel
On 7/3/2023 3:51 AM, Zi Yan wrote:
>> (3) Mapcount of first (or any other) subpage (compount+subpage): for
>> folio_estimated_sharers().
> This is another estimation. I wonder if we can use a different estimation
> like total_mapcount() > folio_nr_pages() instead.
Considering the partial folio mapping, I suppose we should use
total_mapcount() > folio->_nr_pages_mapped
as folio_estimated_sharers().
Regards
Yin, Fengwei
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-07-03 1:09 ` Yin, Fengwei
@ 2023-07-03 13:24 ` Zi Yan
2023-07-03 20:46 ` David Hildenbrand
2023-07-04 1:22 ` Yin, Fengwei
0 siblings, 2 replies; 52+ messages in thread
From: Zi Yan @ 2023-07-03 13:24 UTC (permalink / raw)
To: Yin, Fengwei
Cc: David Hildenbrand, Matthew Wilcox, linux-mm, Vishal Moola,
Hugh Dickins, Rik van Riel
[-- Attachment #1: Type: text/plain, Size: 767 bytes --]
On 2 Jul 2023, at 21:09, Yin, Fengwei wrote:
> On 7/3/2023 3:51 AM, Zi Yan wrote:
>>> (3) Mapcount of first (or any other) subpage (compount+subpage): for
>>> folio_estimated_sharers().
>> This is another estimation. I wonder if we can use a different estimation
>> like total_mapcount() > folio_nr_pages() instead.
> Considering the partial folio mapping, I suppose we should use
> total_mapcount() > folio->_nr_pages_mapped
> as folio_estimated_sharers().
What you propose is to get a precise check instead of estimate, and you assume no PMD mapping and still require per-page mapcount.
What I am proposing is to get rid of per-page mapcount, which is my goal, and use a single mapcount to do a rough estimate.
--
Best Regards,
Yan, Zi
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-07-03 13:24 ` Zi Yan
@ 2023-07-03 20:46 ` David Hildenbrand
2023-07-04 1:22 ` Yin, Fengwei
1 sibling, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2023-07-03 20:46 UTC (permalink / raw)
To: Zi Yan, Yin, Fengwei
Cc: Matthew Wilcox, linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel
On 03.07.23 15:24, Zi Yan wrote:
> On 2 Jul 2023, at 21:09, Yin, Fengwei wrote:
>
>> On 7/3/2023 3:51 AM, Zi Yan wrote:
>>>> (3) Mapcount of first (or any other) subpage (compount+subpage): for
>>>> folio_estimated_sharers().
>>> This is another estimation. I wonder if we can use a different estimation
>>> like total_mapcount() > folio_nr_pages() instead.
>> Considering the partial folio mapping, I suppose we should use
>> total_mapcount() > folio->_nr_pages_mapped
>> as folio_estimated_sharers().
>
> What you propose is to get a precise check instead of estimate, and you assume no PMD mapping and still require per-page mapcount.
>
> What I am proposing is to get rid of per-page mapcount, which is my goal, and use a single mapcount to do a rough estimate.
... and I am (a) not sure it's desirable to only have a single mapcount
just for the sake of it (b) for something that's PMD mappable (IOW, 2
MiB THP), the overhead of having 1 vs. 2 vs. 3. mapcount is pretty
insignificant. 512 vs. 3 is still a big win.
... but again, I have to finish my homework first to make sure I haven't
missed any corner cases.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-07-03 13:24 ` Zi Yan
2023-07-03 20:46 ` David Hildenbrand
@ 2023-07-04 1:22 ` Yin, Fengwei
2023-07-04 2:25 ` Matthew Wilcox
1 sibling, 1 reply; 52+ messages in thread
From: Yin, Fengwei @ 2023-07-04 1:22 UTC (permalink / raw)
To: Zi Yan
Cc: David Hildenbrand, Matthew Wilcox, linux-mm, Vishal Moola,
Hugh Dickins, Rik van Riel
On 7/3/2023 9:24 PM, Zi Yan wrote:
> On 2 Jul 2023, at 21:09, Yin, Fengwei wrote:
>
>> On 7/3/2023 3:51 AM, Zi Yan wrote:
>>>> (3) Mapcount of first (or any other) subpage (compount+subpage): for
>>>> folio_estimated_sharers().
>>> This is another estimation. I wonder if we can use a different estimation
>>> like total_mapcount() > folio_nr_pages() instead.
>> Considering the partial folio mapping, I suppose we should use
>> total_mapcount() > folio->_nr_pages_mapped
>> as folio_estimated_sharers().
>
> What you propose is to get a precise check instead of estimate, and you assume no PMD mapping and still require per-page mapcount.
>
> What I am proposing is to get rid of per-page mapcount, which is my goal, and use a single mapcount to do a rough estimate.
O. Sorry. I didn't notice your goal. So if the rough estimate is enough,
total_mapcount() > folio_nr_pages() works.
Do we need to check all the cases to make sure the rough estimation is
enough for all of them?
Regards
Yin, Fengwei
>
> --
> Best Regards,
> Yan, Zi
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-07-04 1:22 ` Yin, Fengwei
@ 2023-07-04 2:25 ` Matthew Wilcox
0 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2023-07-04 2:25 UTC (permalink / raw)
To: Yin, Fengwei
Cc: Zi Yan, David Hildenbrand, linux-mm, Vishal Moola, Hugh Dickins,
Rik van Riel
On Tue, Jul 04, 2023 at 09:22:07AM +0800, Yin, Fengwei wrote:
>
>
> On 7/3/2023 9:24 PM, Zi Yan wrote:
> > On 2 Jul 2023, at 21:09, Yin, Fengwei wrote:
> >
> >> On 7/3/2023 3:51 AM, Zi Yan wrote:
> >>>> (3) Mapcount of first (or any other) subpage (compount+subpage): for
> >>>> folio_estimated_sharers().
> >>> This is another estimation. I wonder if we can use a different estimation
> >>> like total_mapcount() > folio_nr_pages() instead.
> >> Considering the partial folio mapping, I suppose we should use
> >> total_mapcount() > folio->_nr_pages_mapped
> >> as folio_estimated_sharers().
> >
> > What you propose is to get a precise check instead of estimate, and you assume no PMD mapping and still require per-page mapcount.
> >
> > What I am proposing is to get rid of per-page mapcount, which is my goal, and use a single mapcount to do a rough estimate.
> O. Sorry. I didn't notice your goal. So if the rough estimate is enough,
> total_mapcount() > folio_nr_pages() works.
>
> Do we need to check all the cases to make sure the rough estimation is
> enough for all of them?
That's definitely not enough for deciding whether we need to COW a page or
not. We need to be able to tell the difference between an order-4 folio
with page 1 mapped in two tasks and an order-4 folio with page 0 & page
1 mapped in a single task.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Folio mapcount
2023-07-02 19:51 ` Zi Yan
2023-07-03 1:09 ` Yin, Fengwei
@ 2023-07-03 21:09 ` David Hildenbrand
1 sibling, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2023-07-03 21:09 UTC (permalink / raw)
To: Zi Yan
Cc: Yin, Fengwei, Matthew Wilcox, linux-mm, Vishal Moola,
Hugh Dickins, Rik van Riel
On 02.07.23 21:51, Zi Yan wrote:
> On 2 Jul 2023, at 7:45, David Hildenbrand wrote:
>
>> On 02.07.23 11:50, Yin, Fengwei wrote:
>>>
>>>
>>> On 7/1/2023 9:17 AM, Zi Yan wrote:
>>>> In kernel, almost all code only cares: 1) if a page/folio has extra pins
>>>> by checking if mapcount is equal to refcount + extra, and 2)
>>>> if a page/folio is mapped multiple times. A single mapcount can meet
>>>> these two needs.
>>> For 2, how can we know whether a page/folio is mapped multiple times from
>>> single mapcount? My understanding is we need two counts as folio could be
>>> partial mapped.
>>
>> Yes, a single mapcount is most probably insufficient. I started analyzing all existing users and use cases, trying to avoid walking page tables.
>
> From my understanding, a single mapcount is sufficient for kernel users, which
> calls page_mapcount(). Because they either check mapcount against refcount to
> see if a page has extra pin or check mapcount to see if a page is mapped more
> than once.
>
There are cases where we want to know "do we have PTE mappings", but I
yet have to write it all up.
>>
>> If we want to get rid of all of (most) sub-page mapcounts, we'd probably want:
>>
>> (1) Total mapcount (compound + any sub-page): page_mapped(), pagecount
>> vs. refcount games, ...
>
> a single mapcount is sufficient in this case.
Well, that's what I describe here: 1) covers exactly these cases.
>
>>
>> (2) Compound mapcount (for PMD/PUD-mappale THP only): (2) - (1) tells
>> you if it's only PMD mapped or also PTE-mapped. For example, for
>> statistics but also swapout code.
>
> For statistics, it is for NR_{ANON,FILE}_MAPPED and NR_ANON_THP. I wonder
> if we can use the number of anonymous/file pages and THPs instead, without
> caring about if it is mapped or not.
>
> For swapout, folio_entire_mapcount() is used to estimate if a THP is fully
> mapped or not. I wonder if we can get away with another estimation like
> total_mapcount() > folio_nr_pages().
What do we gain by that? Again, I don't see a reason to degrade current
state just by trying to achieve 1 mapcount when it really barely matter
if we have 2 or 3 instead. Right now we have 513 and with any larger
folios significantly more ... than 2 or 3.
>
>>
>> (3) Mapcount of first (or any other) subpage (compount+subpage): for
>> folio_estimated_sharers().
>
> This is another estimation. I wonder if we can use a different estimation
> like total_mapcount() > folio_nr_pages() instead.
At least not for PMD-mapped THP. Maybe we could do with (2). But I
recall some cases where it got ugly, will try to remember them.
>
>>
>> For anon pages, I'm thinking about remembering an additional
>>
>> (1) Page/folio creator (MM pointer/identification)
>> (2) Page/folio creator mapcount
>>
>> When optimizing a PTE-mapped THP (especially not- pmd-mappale) for the fork()+exec() case, we'd have to walk page tables to see if all folio references come from this MM. The page/folio creator exactly avoids that completely. We might need a mechanism to synchronize against mapping/unmapping of this folio from the creator concurrently (relevant when mapped into multiple page tables).
>
> creator_mapcount < total_mapcount means multiple MMs map this folio? And this is for
> page exclusive check? Sorry I have not checked the code in detail yet. The sync
Right now we essentially do if !PageAnonExlusive:
if (page_count() != 1)
copy
reuse
to see if we really hold the only reference to that folio.
If we could stabilize the creators mapcount, it would be something like
if (f->creator != mm || page_count(f) != f->creators_mapcount)
copy
reuse
So we wouldn't have to scan page tables to identify if we're resonsible
for all of the page references via our page tables.
But that's so far only an idea I had when thinking about how to avoid
page table scans for the simple fork+exec() case, not matured yet.
> of creator_mapcount with total_mapcount might have some extra cost. I wonder if
> this can be solved by checked num_active_vmas in anon_vma of a folio.
As we nowadays match the actual references (i.e., page_count() != 1),
that's most probably insufficient and what I recall, easily less precise.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread