linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Chris Mason" <clm@fb.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: <linux-fsdevel@vger.kernel.org>, <linux-mm@kvack.org>,
	Guoqing Jiang <guoqing.jiang@cloud.ionos.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	David Howells <dhowells@redhat.com>
Subject: Re: PagePrivate handling
Date: Wed, 14 Oct 2020 12:28:06 -0400	[thread overview]
Message-ID: <1EFC2FB8-954F-490F-BE8A-D216ADA1C5E9@fb.com> (raw)
In-Reply-To: <20201014153836.GM20115@casper.infradead.org>

On 14 Oct 2020, at 11:38, Matthew Wilcox wrote:

> On Wed, Oct 14, 2020 at 10:50:51AM -0400, Chris Mason wrote:
>> On 14 Oct 2020, at 9:49, Matthew Wilcox wrote:
>>> Our handling of PagePrivate, page->private and PagePrivate2 is a 
>>> giant
>>> mess.  Let's recap.
>>>
>>> Filesystems which use bufferheads (ie most of them) set 
>>> page->private
>>> to point to a buffer_head, set the PagePrivate bit and increment the
>>> refcount on the page.
>>>
>>> The vmscan pageout code (*) needs to know whether a page is 
>>> freeable:
>>>         if (!is_page_cache_freeable(page))
>>>                 return PAGE_KEEP;
>>> ... where is_page_cache_freeable() contains ...
>>>         return page_count(page) - page_has_private(page) == 1 +
>>> page_cache_pins;
>>>
>>> That's a little inscrutable, but the important thing is that if
>>> page_has_private() is true, then the page's reference count is 
>>> supposed
>>> to be one higher than it would be otherwise.  And that makes sense 
>>> given
>>> how "having bufferheads" means "page refcount ges incremented".
>>>
>>> But page_has_private() doesn't actually mean "PagePrivate is set".
>>> It means "PagePrivate or PagePrivate2 is set".  And I don't 
>>> understand
>>> how filesystems are supposed to keep that straight -- if we're 
>>> setting
>>> PagePrivate2, and PagePrivate is clear, increment the refcount?
>>> If we're clearing PagePrivate, decrement the refcount if 
>>> PagePrivate2
>>> is also clear?
>>
>> At least for btrfs, only PagePrivate elevates the refcount on the 
>> page.
>> PagePrivate2 means:
>>
>> This page has been properly setup for COW’d IO, and it went through 
>> the
>> normal path of page_mkwrite() or file_write() instead of being 
>> silently
>> dirtied by a deep corner of the MM.
>
> What's not clear to me is whether btrfs can be in the situation where
> PagePrivate2 is set and PagePrivate is clear.  If so, then we have a 
> bug
> to fix.

I don’t think it’ll happen.  Everyone in btrfs setting PagePrivate2 
seems to have already called attach_page_private().  It’s possible I 
missed a corner but I don’t think so.

>
>>> We introduced attach_page_private() and detach_page_private() 
>>> earlier
>>> this year to help filesystems get the refcount right.  But we still
>>> have a few filesystems using PagePrivate themselves (afs, btrfs, 
>>> ceph,
>>> crypto, erofs, f2fs, jfs, nfs, orangefs & ubifs) and I'm not 
>>> convinced
>>> they're all getting it right.
>>>
>>> Here's a bug I happened on while looking into this:
>>>
>>>         if (page_has_private(page))
>>>                 attach_page_private(newpage, 
>>> detach_page_private(page));
>>>
>>>         if (PagePrivate2(page)) {
>>>                 ClearPagePrivate2(page);
>>>                 SetPagePrivate2(newpage);
>>>         }
>>>
>>> The aggravating thing is that this doesn't even look like a bug.
>>> You have to be in the kind of mood where you're thinking "What if 
>>> page
>>> has Private2 set and Private clear?" and the answer is that newpage
>>> ends up with PagePrivate set, but page->private set to NULL.
>>
>> Do you mean PagePrivate2 set but page->private NULL?
>
> Sorry, I skipped a step of the explanation.
>
> page_has_private returns true if Private or Private2 is set.  So if
> page has PagePrivate clear and PagePrivate2 set, newpage will end up
> with both PagePrivate and PagePrivate2 set -- attach_page_private()
> doesn't check whether the pointer is NULL (and IMO, it shouldn't).
>

I see what you mean, given how often we end up calling 
btrfs_migratepage() in the fleet, I’m willing to say this is probably 
fine.  But it’s easy enough to toss in a warning.

> Given our current macros, what was _meant_ here was:
>
>          if (PagePrivate(page))
>                  attach_page_private(newpage, 
> detach_page_private(page));
>
> but that's not obviously right.
>
>> Btrfs should only hage
>> PagePrivate2 set on pages that are formally in our writeback state 
>> machine,
>> so it’ll get cleared as we unwind through normal IO or truncate 
>> etc.  For
>> data pages, btrfs page->private is simply set to 1 so the MM will 
>> kindly
>> call releasepage for us.
>
> That's not what I'm seeing here:
>
> static void attach_extent_buffer_page(struct extent_buffer *eb,
>                                       struct page *page)
> {
>         if (!PagePrivate(page))
>                 attach_page_private(page, eb);
>         else
>                 WARN_ON(page->private != (unsigned long)eb);
> }
>
> Or is that not a data page?

Correct, extent_buffers are metadata only.  They wouldn’t be Private2.

>
>>> So what shold we do about all this?  First, I want to make the code
>>> snippet above correct, because it looks right.  So 
>>> page_has_private()
>>> needs to test just PagePrivate and not PagePrivate2.  Now we need a
>>> new function to call to determine whether the filesystem needs its
>>> invalidatepage callback invoked.  Not sure what that should be 
>>> named.
>>
>> I haven’t checked all the page_has_private() callers, but maybe
>> page_has_private() should stay the same and add page_private_count() 
>> for
>> times where we need to get out our fingers and toes for the refcount 
>> math.
>
> I was thinking about page_expected_count() which returns the number of
> references from the page cache plus the number of references from
> the various page privates.  So is_page_cache_freeable() becomes:
>
> 	return page_count(page) == page_expected_count(page) + 1;
>
> can_split_huge_page() becomes:
>
> 	if (page_has_private(page))
> 		return false;
> 	return page_count(page) == page_expected_count(page) +
> 			total_mapcount(page) + 1;
>
>>> I think I also want to rename PG_private_2 to PG_owner_priv_2.
>>> There's a clear relationship between PG_private and page->private.
>>> There is no relationship between PG_private_2 and page->private, so 
>>> it's
>>> a misleading name.  Or maybe it should just be PG_fscache and btrfs 
>>> can
>>> find some other way to mark the pages?
>>
>> Btrfs should be able to flip bits in page->private to cover our 
>> current
>> usage of PG_private_2.  If we ever attach something real to 
>> page->private,
>> we can flip bits in that instead.  It’s kinda messy though and 
>> we’d have to
>> change attach_page_private a little to reflect its new life as a bit 
>> setting
>> machine.
>
> It's not great, but with David wanting to change how PageFsCache is 
> used,
> it may be unavoidable (I'm not sure if he's discussed that with you 
> yet)
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/commit/?h=fscache-iter&id=6f10fd7766ed6d87c3f696bb7931281557b389f5 
> shows part of it
> -- essentially he wants to make PagePrivate2 mean that I/O is 
> currently
> ongoing to an fscache, and so truncate needs to wait on it being 
> finished.
>

It’s fine-ish for btrfs since we’re setting Private2 only on pages 
that are either writeback or will be writeback really soon.  But, I’d 
prefer this end up in a callback so that we don’t have magic meaning 
for magic flags in magic places.

-chris


  parent reply	other threads:[~2020-10-14 16:28 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-14 13:49 Matthew Wilcox
2020-10-14 14:50 ` Chris Mason
2020-10-14 15:38   ` Matthew Wilcox
2020-10-14 16:01     ` Gao Xiang
2020-10-14 16:28     ` Chris Mason [this message]
2020-10-14 16:34     ` Yang Shi
2020-10-14 16:05   ` David Howells

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1EFC2FB8-954F-490F-BE8A-D216ADA1C5E9@fb.com \
    --to=clm@fb.com \
    --cc=dhowells@redhat.com \
    --cc=guoqing.jiang@cloud.ionos.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox