linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Chris Mason" <clm@fb.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: <linux-fsdevel@vger.kernel.org>, <linux-mm@kvack.org>,
	Guoqing Jiang <guoqing.jiang@cloud.ionos.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	David Howells <dhowells@redhat.com>
Subject: Re: PagePrivate handling
Date: Wed, 14 Oct 2020 10:50:51 -0400	[thread overview]
Message-ID: <B60A55DB-6AB7-48BF-8F11-68FF6FF46C4E@fb.com> (raw)
In-Reply-To: <20201014134909.GL20115@casper.infradead.org>

On 14 Oct 2020, at 9:49, Matthew Wilcox wrote:

> Our handling of PagePrivate, page->private and PagePrivate2 is a giant
> mess.  Let's recap.
>
> Filesystems which use bufferheads (ie most of them) set page->private
> to point to a buffer_head, set the PagePrivate bit and increment the
> refcount on the page.
>
> The vmscan pageout code (*) needs to know whether a page is freeable:
>         if (!is_page_cache_freeable(page))
>                 return PAGE_KEEP;
> ... where is_page_cache_freeable() contains ...
>         return page_count(page) - page_has_private(page) == 1 + 
> page_cache_pins;
>
> That's a little inscrutable, but the important thing is that if
> page_has_private() is true, then the page's reference count is 
> supposed
> to be one higher than it would be otherwise.  And that makes sense 
> given
> how "having bufferheads" means "page refcount ges incremented".
>
> But page_has_private() doesn't actually mean "PagePrivate is set".
> It means "PagePrivate or PagePrivate2 is set".  And I don't understand
> how filesystems are supposed to keep that straight -- if we're setting
> PagePrivate2, and PagePrivate is clear, increment the refcount?
> If we're clearing PagePrivate, decrement the refcount if PagePrivate2
> is also clear?

At least for btrfs, only PagePrivate elevates the refcount on the page.  
PagePrivate2 means:

This page has been properly setup for COW’d IO, and it went through 
the normal path of page_mkwrite() or file_write() instead of being 
silently dirtied by a deep corner of the MM.

>
> We introduced attach_page_private() and detach_page_private() earlier
> this year to help filesystems get the refcount right.  But we still
> have a few filesystems using PagePrivate themselves (afs, btrfs, ceph,
> crypto, erofs, f2fs, jfs, nfs, orangefs & ubifs) and I'm not convinced
> they're all getting it right.
>
> Here's a bug I happened on while looking into this:
>
>         if (page_has_private(page))
>                 attach_page_private(newpage, 
> detach_page_private(page));
>
>         if (PagePrivate2(page)) {
>                 ClearPagePrivate2(page);
>                 SetPagePrivate2(newpage);
>         }
>
> The aggravating thing is that this doesn't even look like a bug.
> You have to be in the kind of mood where you're thinking "What if page
> has Private2 set and Private clear?" and the answer is that newpage
> ends up with PagePrivate set, but page->private set to NULL.

Do you mean PagePrivate2 set but page->private NULL?  Btrfs should only 
hage PagePrivate2 set on pages that are formally in our writeback state 
machine, so it’ll get cleared as we unwind through normal IO or 
truncate etc.  For data pages, btrfs page->private is simply set to 1 so 
the MM will kindly call releasepage for us.

> And I
> don't know whether this is a situation that can ever happen with 
> btrfs,
> but we shouldn't have code like this lying around in the tree because
> it _looks_ right and somebody else might copy it.
>
> So what shold we do about all this?  First, I want to make the code
> snippet above correct, because it looks right.  So page_has_private()
> needs to test just PagePrivate and not PagePrivate2.  Now we need a
> new function to call to determine whether the filesystem needs its
> invalidatepage callback invoked.  Not sure what that should be named.
>

I haven’t checked all the page_has_private() callers, but maybe 
page_has_private() should stay the same and add page_private_count() for 
times where we need to get out our fingers and toes for the refcount 
math.

> I think I also want to rename PG_private_2 to PG_owner_priv_2.
> There's a clear relationship between PG_private and page->private.
> There is no relationship between PG_private_2 and page->private, so 
> it's
> a misleading name.  Or maybe it should just be PG_fscache and btrfs 
> can
> find some other way to mark the pages?

Btrfs should be able to flip bits in page->private to cover our current 
usage of PG_private_2.  If we ever attach something real to 
page->private, we can flip bits in that instead.  It’s kinda messy 
though and we’d have to change attach_page_private a little to reflect 
its new life as a bit setting machine.

>
> Also ... do we really need to increment the page refcount if we have
> PagePrivate set?  I'm not awfully familiar with the buffercache -- is
> it possible we end up in a situation where a buffer, perhaps under 
> I/O,
> has the last reference to a struct page?  It seems like that reference 
> is
> always put from drop_buffers() which is called from 
> try_to_free_buffers()
> which is always called by someone who has a reference to a struct page
> that they got from the pagecache.  So what is this reference count 
> for?

I’m not sure what we gain by avoiding the refcount bump?  Many 
filesystems use the pattern of: “put something in page->private, free 
that thing in releasepage.”  Without the refcount bump it feels like 
we’d have more magic to avoid freeing the page without leaking things 
in page->private.  I think the extra ref lets the FS crowd keep our 
noses out of the MM more often, so it seems like a net positive to me.

>
> (*) Also THP split and page migration.  But maybe you don't care about
> those things ... you definitely care about pageout!

-chris


  reply	other threads:[~2020-10-14 14:51 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-14 13:49 Matthew Wilcox
2020-10-14 14:50 ` Chris Mason [this message]
2020-10-14 15:38   ` Matthew Wilcox
2020-10-14 16:01     ` Gao Xiang
2020-10-14 16:28     ` Chris Mason
2020-10-14 16:34     ` Yang Shi
2020-10-14 16:05   ` David Howells

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=B60A55DB-6AB7-48BF-8F11-68FF6FF46C4E@fb.com \
    --to=clm@fb.com \
    --cc=dhowells@redhat.com \
    --cc=guoqing.jiang@cloud.ionos.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox