linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Matthew Wilcox <willy@infradead.org>, linux-mm@kvack.org
Cc: Vishal Moola <vishal.moola@gmail.com>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@surriel.com>,
	"Yin, Fengwei" <fengwei.yin@intel.com>
Subject: Re: Folio mapcount
Date: Tue, 24 Jan 2023 19:35:35 +0100	[thread overview]
Message-ID: <3cc8f142-a69d-ae84-6a33-50bdc9aade21@redhat.com> (raw)
In-Reply-To: <Y9Afwds/Jl39UjEp@casper.infradead.org>

On 24.01.23 19:13, Matthew Wilcox wrote:
> Once we get to the part of the folio journey where we have
> one-pointer-per-page, we can't afford to maintain per-page state.
> Currently we maintain a per-page mapcount, and that will have to go.
> We can maintain extra state for a multi-page folio, but it has to be a
> constant amount of extra state no matter how many pages are in the folio.
> 
> My proposal is that we maintain a single mapcount per folio, and its
> definition is the number of (vma, page table) tuples which have a
> reference to any pages in this folio.
> 
> I think there's a good performance win and simplification to be had
> here, so I think it's worth doing for 6.4.
> 
> Examples
> --------
> 
> In the simple and common case where every page in a folio is mapped
> once by a single vma and single page table, mapcount would be 1 [1].
> If the folio is mapped across a page table boundary by a single VMA,
> after we take a page fault on it in one page table, it gets a mapcount
> of 1.  After taking a page fault on it in the other page table, its
> mapcount increases to 2.
> 
> For a PMD-sized THP naturally aligned, mapcount is 1.  Splitting the
> PMD into PTEs would not change the mapcount; the folio remains order-9
> but it stll has a reference from only one page table (a different page
> table, but still just one).
> 
> Implementation sketch
> ---------------------
> 
> When we take a page fault, we can/should map every page in the folio
> that fits in this VMA and this page table.  We do this at present in
> filemap_map_pages() by looping over each page in the folio and calling
> do_set_pte() on each.  We should have a:
> 
>                  do_set_pte_range(vmf, folio, addr, first_page, n);
> 
> and then change the API to page_add_new_anon_rmap() / page_add_file_rmap()
> to pass in (folio, first, n) instead of page.  That gives us one call to
> page_add_*_rmap() per (vma, page table) tuple.
> 
> In try_to_unmap_one(), page_vma_mapped_walk() currently calls us for
> each pfn.  We'll want a function like
>          page_vma_mapped_walk_skip_to_end_of_ptable()
> in order to persuade it to only call us once or twice if the folio
> is mapped across a page table boundary.
> 
> Concerns
> --------
> 
> We'll have to be careful to always zap all the PTEs for a given (vma,
> pt) tuple at the same time, otherwise mapcount will get out of sync
> (eg map three pages, unmap two; we shouldn't decrement the mapcount,
> but I don't think we can know that.  But does this ever happen?  I think
> we always unmap the entire folio, like in try_to_unmap_one().

Not sure about file THP, but for anon ... it's very common to partially 
MADV_DONTNEED anon THP. Or to have a wild mixture of two (or more) anon 
THP fragments after fork() when COW'ing on the PTE-mapped THP ...

> 
> I haven't got my head around SetPageAnonExclusive() yet.  I think it can
> be a per-folio bit, but handling a folio split across two page tables
> may be tricky.

I tried hard (very hard!) to make that work but reality caught up. And 
the history of why that handling is required goes back to the old days 
where we had per-subpage refcounts to then have per-subpage mapcounts to 
now have only a single bit to get COW handling right.

There are very (very!) ugly corner cases of partial mremap, partial
MADV_WILLNEED ... some are included in the cow selftest for that reason.

One bit per subpage is certainly "not perfect" but not the end of the 
world for now. 512/8 -> 64 byte for a 2 MiB folio ... For now I would 
focus on the mapcount ... that will be a challenge on its own and a 
bigger improvement :P


-- 
Thanks,

David / dhildenb



  reply	other threads:[~2023-01-24 18:35 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-24 18:13 Matthew Wilcox
2023-01-24 18:35 ` David Hildenbrand [this message]
2023-01-24 18:37   ` David Hildenbrand
2023-01-24 18:35 ` Yang Shi
2023-02-02  3:45 ` Mike Kravetz
2023-02-02 15:31   ` Matthew Wilcox
2023-02-07 16:19     ` Zi Yan
2023-02-07 16:44       ` Matthew Wilcox
2023-02-06 20:34 ` Matthew Wilcox
2023-02-06 22:55   ` Yang Shi
2023-02-06 23:09     ` Matthew Wilcox
2023-02-07  3:06   ` Yin, Fengwei
2023-02-07  4:08     ` Matthew Wilcox
2023-02-07 22:39   ` Peter Xu
2023-02-07 23:27     ` Matthew Wilcox
2023-02-08 19:40       ` Peter Xu
2023-02-08 20:25         ` Matthew Wilcox
2023-02-08 20:58           ` Peter Xu
2023-02-09 15:10             ` Chih-En Lin
2023-02-09 15:43               ` Peter Xu
2023-02-07 22:56   ` James Houghton
2023-02-07 23:08     ` Matthew Wilcox
2023-02-07 23:27       ` James Houghton
2023-02-07 23:35         ` Matthew Wilcox
2023-02-08  0:35           ` James Houghton
2023-02-08  2:26             ` Matthew Wilcox
2023-02-07 16:23 ` Zi Yan
2023-02-07 16:51   ` Matthew Wilcox
2023-02-08 19:36     ` Zi Yan
2023-02-08 19:54       ` Matthew Wilcox
2023-02-10 15:15         ` Zi Yan
2023-03-29 14:02         ` Yin, Fengwei
2023-07-01  1:17           ` Zi Yan
2023-07-02  9:50             ` Yin, Fengwei
2023-07-02 11:45               ` David Hildenbrand
2023-07-02 12:26                 ` Matthew Wilcox
2023-07-03 20:54                   ` David Hildenbrand
2023-07-02 19:51                 ` Zi Yan
2023-07-03  1:09                   ` Yin, Fengwei
2023-07-03 13:24                     ` Zi Yan
2023-07-03 20:46                       ` David Hildenbrand
2023-07-04  1:22                       ` Yin, Fengwei
2023-07-04  2:25                         ` Matthew Wilcox
2023-07-03 21:09                   ` David Hildenbrand
  -- strict thread matches above, loose matches on Subject: below --
2021-12-15 21:55 folio mapcount Matthew Wilcox
2021-12-16  9:37 ` Kirill A. Shutemov
2021-12-16 13:56   ` Matthew Wilcox
2021-12-16 15:19     ` Jason Gunthorpe
2021-12-16 15:54       ` Matthew Wilcox
2021-12-16 16:45         ` David Hildenbrand
2021-12-16 17:01         ` Jason Gunthorpe
2021-12-16 18:56     ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3cc8f142-a69d-ae84-6a33-50bdc9aade21@redhat.com \
    --to=david@redhat.com \
    --cc=fengwei.yin@intel.com \
    --cc=hughd@google.com \
    --cc=linux-mm@kvack.org \
    --cc=riel@surriel.com \
    --cc=vishal.moola@gmail.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox