Re: folio mapcount - David Hildenbrand

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Matthew Wilcox <willy@infradead.org>, Jason Gunthorpe <jgg@ziepe.ca>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>,
	linux-mm@kvack.org, Hugh Dickins <hughd@google.com>,
	Mike Kravetz <mike.kravetz@oracle.com>
Subject: Re: folio mapcount
Date: Thu, 16 Dec 2021 17:45:04 +0100	[thread overview]
Message-ID: <54338c9c-0985-04a8-5d96-8dd3b15f5709@redhat.com> (raw)
In-Reply-To: <YbthPNtXs1kyls10@casper.infradead.org>

On 16.12.21 16:54, Matthew Wilcox wrote:
> On Thu, Dec 16, 2021 at 11:19:17AM -0400, Jason Gunthorpe wrote:
>> On Thu, Dec 16, 2021 at 01:56:57PM +0000, Matthew Wilcox wrote:
>>> p = mmap(x, 2MB, PROT_READ|PROT_WRITE, ...): THP allocated
>>> mprotect(p, 4KB, PROT_READ): THP split.
>>>
>>> And in that case, I would say the THP now has mapcount of 2 because
>>> there are 2 VMAs mapping it.
>>
>> At least today mapcount is only loosely connected to VMAs. It really
>> counts the number of PUD/PTEs that point at parts of the memory. 
> 
> Careful.  Currently, you need to distinguish between total_mapcount(),
> page_trans_huge_mapcount() and page_mapcount().  Take a look at
> __page_mapcount() to be sure you really know what the mapcount "really"
> counts today ...

Yes, and the documentation above page_trans_huge_mapcount() tries to
bring some clarity. Tries :)

> 
> (also I'm going to assume that when you said PUD you really mean
> PMD throughout)
> 
>> If, under the PTL, you observe a mapcount of 1 then you know that the
>> PUD/PTE you have under lock is the ONLY PUD/PTE that refers to this
>> page and will remain so while the lock is held.
>>
>> So, today the above ends up with a mapcount of 1 and when we take a
>> COW fault we can re-use the page.
>>
>> If the above ends up with a mapcount of 2 then COW will copy not
>> re-use, which will cause unexpected data corruption in all those
>> annoying side cases.
> 
> As I understood David's presentation yesterday, we actually have
> data corruption issues in all the annoying side cases with THPs
> in current upstream, so that's no worse than we have now.  But let's
> see if we can avoid them.

Right, because the refcount is even more shaky ...

> 
> It feels like what we want from a COW perspective is a count of the
> number of MMs mapping a page, not the number of VMAs, PTEs or PMDs
> mapping the page.  Right?
> 
> So here's a corner case ...
> 
> p = mmap(x, 2MB, PROT_READ|PROT_WRITE, ...): THP allocated
> mremap(p + 128K, 128K, 128K, MREMAP_MAYMOVE | MREMAP_FIXED, p + 2MB):
> PMD split
> 

(busy preparing and testing related patches, so I only skimmed over the
discussion)

Whenever we have to go through an internal munmap (mmap, munmap,
mremap), we would split the PMD and map the remainder using PTE. We
place the huge page on the deferred split queue, where the actual
compound page will get split ("THP split").

In move_page_tables() we perform the split_huge_pmd() as well, which
would trigger in your example I think.

For anon pages, IIRC, there is no way to get more than one mapping per
process for a single base page. "sharing" as in "shared anonymous pages"
only applies between processes, not VMAs.

One anon base page can only be mapped once into a process ever. An anon
base page can be mapped shared into multiple processes.

"The function returns the highest mapcount any one of the subpages
has. If the return value is one, even if different processes are
mapping different subpages of the transparent hugepage, they can all
reuse it, because each process is reusing a different subpage."

So if you see "at least one subpage is mapped by more than one" and the
page is anon shared, you have to split the PMD and trigger unsharing for
exactly that subpage.

But it is indeed confusing ...

> Should mapcount be 1 or 2 at this point?  Does the answer change if it's

The PMD was split. Each subpage is mapped exactly once.

page_trans_huge_mapcount() is supposed to return 1 because there is no
sharing.

(Famous last words)

-- 
Thanks,

David / dhildenb

next prev parent reply	other threads:[~2021-12-16 16:45 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-15 21:55 Matthew Wilcox
2021-12-16  9:37 ` Kirill A. Shutemov
2021-12-16 13:56   ` Matthew Wilcox
2021-12-16 15:19     ` Jason Gunthorpe
2021-12-16 15:54       ` Matthew Wilcox
2021-12-16 16:45         ` David Hildenbrand [this message]
2021-12-16 17:01         ` Jason Gunthorpe
2021-12-16 18:56     ` Kirill A. Shutemov
2023-01-24 18:13 Folio mapcount Matthew Wilcox
2023-01-24 18:35 ` David Hildenbrand
2023-01-24 18:37   ` David Hildenbrand
2023-01-24 18:35 ` Yang Shi
2023-02-02  3:45 ` Mike Kravetz
2023-02-02 15:31   ` Matthew Wilcox
2023-02-07 16:19     ` Zi Yan
2023-02-07 16:44       ` Matthew Wilcox
2023-02-06 20:34 ` Matthew Wilcox
2023-02-06 22:55   ` Yang Shi
2023-02-06 23:09     ` Matthew Wilcox
2023-02-07  3:06   ` Yin, Fengwei
2023-02-07  4:08     ` Matthew Wilcox
2023-02-07 22:39   ` Peter Xu
2023-02-07 23:27     ` Matthew Wilcox
2023-02-08 19:40       ` Peter Xu
2023-02-08 20:25         ` Matthew Wilcox
2023-02-08 20:58           ` Peter Xu
2023-02-09 15:10             ` Chih-En Lin
2023-02-09 15:43               ` Peter Xu
2023-02-07 22:56   ` James Houghton
2023-02-07 23:08     ` Matthew Wilcox
2023-02-07 23:27       ` James Houghton
2023-02-07 23:35         ` Matthew Wilcox
2023-02-08  0:35           ` James Houghton
2023-02-08  2:26             ` Matthew Wilcox
2023-02-07 16:23 ` Zi Yan
2023-02-07 16:51   ` Matthew Wilcox
2023-02-08 19:36     ` Zi Yan
2023-02-08 19:54       ` Matthew Wilcox
2023-02-10 15:15         ` Zi Yan
2023-03-29 14:02         ` Yin, Fengwei
2023-07-01  1:17           ` Zi Yan
2023-07-02  9:50             ` Yin, Fengwei
2023-07-02 11:45               ` David Hildenbrand
2023-07-02 12:26                 ` Matthew Wilcox
2023-07-03 20:54                   ` David Hildenbrand
2023-07-02 19:51                 ` Zi Yan
2023-07-03  1:09                   ` Yin, Fengwei
2023-07-03 13:24                     ` Zi Yan
2023-07-03 20:46                       ` David Hildenbrand
2023-07-04  1:22                       ` Yin, Fengwei
2023-07-04  2:25                         ` Matthew Wilcox
2023-07-03 21:09                   ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54338c9c-0985-04a8-5d96-8dd3b15f5709@redhat.com \
    --to=david@redhat.com \
    --cc=hughd@google.com \
    --cc=jgg@ziepe.ca \
    --cc=kirill@shutemov.name \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox