* PagePrivate handling
@ 2020-10-14 13:49 Matthew Wilcox
2020-10-14 14:50 ` Chris Mason
0 siblings, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2020-10-14 13:49 UTC (permalink / raw)
To: linux-fsdevel, linux-mm; +Cc: Guoqing Jiang, Miaohe Lin, David Howells
Our handling of PagePrivate, page->private and PagePrivate2 is a giant
mess. Let's recap.
Filesystems which use bufferheads (ie most of them) set page->private
to point to a buffer_head, set the PagePrivate bit and increment the
refcount on the page.
The vmscan pageout code (*) needs to know whether a page is freeable:
if (!is_page_cache_freeable(page))
return PAGE_KEEP;
... where is_page_cache_freeable() contains ...
return page_count(page) - page_has_private(page) == 1 + page_cache_pins;
That's a little inscrutable, but the important thing is that if
page_has_private() is true, then the page's reference count is supposed
to be one higher than it would be otherwise. And that makes sense given
how "having bufferheads" means "page refcount ges incremented".
But page_has_private() doesn't actually mean "PagePrivate is set".
It means "PagePrivate or PagePrivate2 is set". And I don't understand
how filesystems are supposed to keep that straight -- if we're setting
PagePrivate2, and PagePrivate is clear, increment the refcount?
If we're clearing PagePrivate, decrement the refcount if PagePrivate2
is also clear?
We introduced attach_page_private() and detach_page_private() earlier
this year to help filesystems get the refcount right. But we still
have a few filesystems using PagePrivate themselves (afs, btrfs, ceph,
crypto, erofs, f2fs, jfs, nfs, orangefs & ubifs) and I'm not convinced
they're all getting it right.
Here's a bug I happened on while looking into this:
if (page_has_private(page))
attach_page_private(newpage, detach_page_private(page));
if (PagePrivate2(page)) {
ClearPagePrivate2(page);
SetPagePrivate2(newpage);
}
The aggravating thing is that this doesn't even look like a bug.
You have to be in the kind of mood where you're thinking "What if page
has Private2 set and Private clear?" and the answer is that newpage
ends up with PagePrivate set, but page->private set to NULL. And I
don't know whether this is a situation that can ever happen with btrfs,
but we shouldn't have code like this lying around in the tree because
it _looks_ right and somebody else might copy it.
So what shold we do about all this? First, I want to make the code
snippet above correct, because it looks right. So page_has_private()
needs to test just PagePrivate and not PagePrivate2. Now we need a
new function to call to determine whether the filesystem needs its
invalidatepage callback invoked. Not sure what that should be named.
I think I also want to rename PG_private_2 to PG_owner_priv_2.
There's a clear relationship between PG_private and page->private.
There is no relationship between PG_private_2 and page->private, so it's
a misleading name. Or maybe it should just be PG_fscache and btrfs can
find some other way to mark the pages?
Also ... do we really need to increment the page refcount if we have
PagePrivate set? I'm not awfully familiar with the buffercache -- is
it possible we end up in a situation where a buffer, perhaps under I/O,
has the last reference to a struct page? It seems like that reference is
always put from drop_buffers() which is called from try_to_free_buffers()
which is always called by someone who has a reference to a struct page
that they got from the pagecache. So what is this reference count for?
(*) Also THP split and page migration. But maybe you don't care about
those things ... you definitely care about pageout!
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: PagePrivate handling
2020-10-14 13:49 PagePrivate handling Matthew Wilcox
@ 2020-10-14 14:50 ` Chris Mason
2020-10-14 15:38 ` Matthew Wilcox
2020-10-14 16:05 ` David Howells
0 siblings, 2 replies; 7+ messages in thread
From: Chris Mason @ 2020-10-14 14:50 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-fsdevel, linux-mm, Guoqing Jiang, Miaohe Lin, David Howells
On 14 Oct 2020, at 9:49, Matthew Wilcox wrote:
> Our handling of PagePrivate, page->private and PagePrivate2 is a giant
> mess. Let's recap.
>
> Filesystems which use bufferheads (ie most of them) set page->private
> to point to a buffer_head, set the PagePrivate bit and increment the
> refcount on the page.
>
> The vmscan pageout code (*) needs to know whether a page is freeable:
> if (!is_page_cache_freeable(page))
> return PAGE_KEEP;
> ... where is_page_cache_freeable() contains ...
> return page_count(page) - page_has_private(page) == 1 +
> page_cache_pins;
>
> That's a little inscrutable, but the important thing is that if
> page_has_private() is true, then the page's reference count is
> supposed
> to be one higher than it would be otherwise. And that makes sense
> given
> how "having bufferheads" means "page refcount ges incremented".
>
> But page_has_private() doesn't actually mean "PagePrivate is set".
> It means "PagePrivate or PagePrivate2 is set". And I don't understand
> how filesystems are supposed to keep that straight -- if we're setting
> PagePrivate2, and PagePrivate is clear, increment the refcount?
> If we're clearing PagePrivate, decrement the refcount if PagePrivate2
> is also clear?
At least for btrfs, only PagePrivate elevates the refcount on the page.
PagePrivate2 means:
This page has been properly setup for COW’d IO, and it went through
the normal path of page_mkwrite() or file_write() instead of being
silently dirtied by a deep corner of the MM.
>
> We introduced attach_page_private() and detach_page_private() earlier
> this year to help filesystems get the refcount right. But we still
> have a few filesystems using PagePrivate themselves (afs, btrfs, ceph,
> crypto, erofs, f2fs, jfs, nfs, orangefs & ubifs) and I'm not convinced
> they're all getting it right.
>
> Here's a bug I happened on while looking into this:
>
> if (page_has_private(page))
> attach_page_private(newpage,
> detach_page_private(page));
>
> if (PagePrivate2(page)) {
> ClearPagePrivate2(page);
> SetPagePrivate2(newpage);
> }
>
> The aggravating thing is that this doesn't even look like a bug.
> You have to be in the kind of mood where you're thinking "What if page
> has Private2 set and Private clear?" and the answer is that newpage
> ends up with PagePrivate set, but page->private set to NULL.
Do you mean PagePrivate2 set but page->private NULL? Btrfs should only
hage PagePrivate2 set on pages that are formally in our writeback state
machine, so it’ll get cleared as we unwind through normal IO or
truncate etc. For data pages, btrfs page->private is simply set to 1 so
the MM will kindly call releasepage for us.
> And I
> don't know whether this is a situation that can ever happen with
> btrfs,
> but we shouldn't have code like this lying around in the tree because
> it _looks_ right and somebody else might copy it.
>
> So what shold we do about all this? First, I want to make the code
> snippet above correct, because it looks right. So page_has_private()
> needs to test just PagePrivate and not PagePrivate2. Now we need a
> new function to call to determine whether the filesystem needs its
> invalidatepage callback invoked. Not sure what that should be named.
>
I haven’t checked all the page_has_private() callers, but maybe
page_has_private() should stay the same and add page_private_count() for
times where we need to get out our fingers and toes for the refcount
math.
> I think I also want to rename PG_private_2 to PG_owner_priv_2.
> There's a clear relationship between PG_private and page->private.
> There is no relationship between PG_private_2 and page->private, so
> it's
> a misleading name. Or maybe it should just be PG_fscache and btrfs
> can
> find some other way to mark the pages?
Btrfs should be able to flip bits in page->private to cover our current
usage of PG_private_2. If we ever attach something real to
page->private, we can flip bits in that instead. It’s kinda messy
though and we’d have to change attach_page_private a little to reflect
its new life as a bit setting machine.
>
> Also ... do we really need to increment the page refcount if we have
> PagePrivate set? I'm not awfully familiar with the buffercache -- is
> it possible we end up in a situation where a buffer, perhaps under
> I/O,
> has the last reference to a struct page? It seems like that reference
> is
> always put from drop_buffers() which is called from
> try_to_free_buffers()
> which is always called by someone who has a reference to a struct page
> that they got from the pagecache. So what is this reference count
> for?
I’m not sure what we gain by avoiding the refcount bump? Many
filesystems use the pattern of: “put something in page->private, free
that thing in releasepage.” Without the refcount bump it feels like
we’d have more magic to avoid freeing the page without leaking things
in page->private. I think the extra ref lets the FS crowd keep our
noses out of the MM more often, so it seems like a net positive to me.
>
> (*) Also THP split and page migration. But maybe you don't care about
> those things ... you definitely care about pageout!
-chris
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: PagePrivate handling
2020-10-14 14:50 ` Chris Mason
@ 2020-10-14 15:38 ` Matthew Wilcox
2020-10-14 16:01 ` Gao Xiang
` (2 more replies)
2020-10-14 16:05 ` David Howells
1 sibling, 3 replies; 7+ messages in thread
From: Matthew Wilcox @ 2020-10-14 15:38 UTC (permalink / raw)
To: Chris Mason
Cc: linux-fsdevel, linux-mm, Guoqing Jiang, Miaohe Lin, David Howells
On Wed, Oct 14, 2020 at 10:50:51AM -0400, Chris Mason wrote:
> On 14 Oct 2020, at 9:49, Matthew Wilcox wrote:
> > Our handling of PagePrivate, page->private and PagePrivate2 is a giant
> > mess. Let's recap.
> >
> > Filesystems which use bufferheads (ie most of them) set page->private
> > to point to a buffer_head, set the PagePrivate bit and increment the
> > refcount on the page.
> >
> > The vmscan pageout code (*) needs to know whether a page is freeable:
> > if (!is_page_cache_freeable(page))
> > return PAGE_KEEP;
> > ... where is_page_cache_freeable() contains ...
> > return page_count(page) - page_has_private(page) == 1 +
> > page_cache_pins;
> >
> > That's a little inscrutable, but the important thing is that if
> > page_has_private() is true, then the page's reference count is supposed
> > to be one higher than it would be otherwise. And that makes sense given
> > how "having bufferheads" means "page refcount ges incremented".
> >
> > But page_has_private() doesn't actually mean "PagePrivate is set".
> > It means "PagePrivate or PagePrivate2 is set". And I don't understand
> > how filesystems are supposed to keep that straight -- if we're setting
> > PagePrivate2, and PagePrivate is clear, increment the refcount?
> > If we're clearing PagePrivate, decrement the refcount if PagePrivate2
> > is also clear?
>
> At least for btrfs, only PagePrivate elevates the refcount on the page.
> PagePrivate2 means:
>
> This page has been properly setup for COW’d IO, and it went through the
> normal path of page_mkwrite() or file_write() instead of being silently
> dirtied by a deep corner of the MM.
What's not clear to me is whether btrfs can be in the situation where
PagePrivate2 is set and PagePrivate is clear. If so, then we have a bug
to fix.
> > We introduced attach_page_private() and detach_page_private() earlier
> > this year to help filesystems get the refcount right. But we still
> > have a few filesystems using PagePrivate themselves (afs, btrfs, ceph,
> > crypto, erofs, f2fs, jfs, nfs, orangefs & ubifs) and I'm not convinced
> > they're all getting it right.
> >
> > Here's a bug I happened on while looking into this:
> >
> > if (page_has_private(page))
> > attach_page_private(newpage, detach_page_private(page));
> >
> > if (PagePrivate2(page)) {
> > ClearPagePrivate2(page);
> > SetPagePrivate2(newpage);
> > }
> >
> > The aggravating thing is that this doesn't even look like a bug.
> > You have to be in the kind of mood where you're thinking "What if page
> > has Private2 set and Private clear?" and the answer is that newpage
> > ends up with PagePrivate set, but page->private set to NULL.
>
> Do you mean PagePrivate2 set but page->private NULL?
Sorry, I skipped a step of the explanation.
page_has_private returns true if Private or Private2 is set. So if
page has PagePrivate clear and PagePrivate2 set, newpage will end up
with both PagePrivate and PagePrivate2 set -- attach_page_private()
doesn't check whether the pointer is NULL (and IMO, it shouldn't).
Given our current macros, what was _meant_ here was:
if (PagePrivate(page))
attach_page_private(newpage, detach_page_private(page));
but that's not obviously right.
> Btrfs should only hage
> PagePrivate2 set on pages that are formally in our writeback state machine,
> so it’ll get cleared as we unwind through normal IO or truncate etc. For
> data pages, btrfs page->private is simply set to 1 so the MM will kindly
> call releasepage for us.
That's not what I'm seeing here:
static void attach_extent_buffer_page(struct extent_buffer *eb,
struct page *page)
{
if (!PagePrivate(page))
attach_page_private(page, eb);
else
WARN_ON(page->private != (unsigned long)eb);
}
Or is that not a data page?
> > So what shold we do about all this? First, I want to make the code
> > snippet above correct, because it looks right. So page_has_private()
> > needs to test just PagePrivate and not PagePrivate2. Now we need a
> > new function to call to determine whether the filesystem needs its
> > invalidatepage callback invoked. Not sure what that should be named.
>
> I haven’t checked all the page_has_private() callers, but maybe
> page_has_private() should stay the same and add page_private_count() for
> times where we need to get out our fingers and toes for the refcount math.
I was thinking about page_expected_count() which returns the number of
references from the page cache plus the number of references from
the various page privates. So is_page_cache_freeable() becomes:
return page_count(page) == page_expected_count(page) + 1;
can_split_huge_page() becomes:
if (page_has_private(page))
return false;
return page_count(page) == page_expected_count(page) +
total_mapcount(page) + 1;
> > I think I also want to rename PG_private_2 to PG_owner_priv_2.
> > There's a clear relationship between PG_private and page->private.
> > There is no relationship between PG_private_2 and page->private, so it's
> > a misleading name. Or maybe it should just be PG_fscache and btrfs can
> > find some other way to mark the pages?
>
> Btrfs should be able to flip bits in page->private to cover our current
> usage of PG_private_2. If we ever attach something real to page->private,
> we can flip bits in that instead. It’s kinda messy though and we’d have to
> change attach_page_private a little to reflect its new life as a bit setting
> machine.
It's not great, but with David wanting to change how PageFsCache is used,
it may be unavoidable (I'm not sure if he's discussed that with you yet)
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/commit/?h=fscache-iter&id=6f10fd7766ed6d87c3f696bb7931281557b389f5 shows part of it
-- essentially he wants to make PagePrivate2 mean that I/O is currently
ongoing to an fscache, and so truncate needs to wait on it being finished.
> >
> > Also ... do we really need to increment the page refcount if we have
> > PagePrivate set? I'm not awfully familiar with the buffercache -- is
> > it possible we end up in a situation where a buffer, perhaps under I/O,
> > has the last reference to a struct page? It seems like that reference
> > is
> > always put from drop_buffers() which is called from
> > try_to_free_buffers()
> > which is always called by someone who has a reference to a struct page
> > that they got from the pagecache. So what is this reference count for?
>
> I’m not sure what we gain by avoiding the refcount bump? Many filesystems
> use the pattern of: “put something in page->private, free that thing in
> releasepage.” Without the refcount bump it feels like we’d have more magic
> to avoid freeing the page without leaking things in page->private. I think
> the extra ref lets the FS crowd keep our noses out of the MM more often, so
> it seems like a net positive to me.
The question is whether the "thing" in page->private can ever have the
last reference on a struct page. Gao says erofs can be in that situation,
so never mind this change.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: PagePrivate handling
2020-10-14 15:38 ` Matthew Wilcox
@ 2020-10-14 16:01 ` Gao Xiang
2020-10-14 16:28 ` Chris Mason
2020-10-14 16:34 ` Yang Shi
2 siblings, 0 replies; 7+ messages in thread
From: Gao Xiang @ 2020-10-14 16:01 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Chris Mason, linux-fsdevel, linux-mm, Guoqing Jiang, Miaohe Lin,
David Howells
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8\"", Size: 2355 bytes --]
On Wed, Oct 14, 2020 at 04:38:36PM +0100, Matthew Wilcox wrote:
> On Wed, Oct 14, 2020 at 10:50:51AM -0400, Chris Mason wrote:
> > On 14 Oct 2020, at 9:49, Matthew Wilcox wrote:
...
> > >
> > > Also ... do we really need to increment the page refcount if we have
> > > PagePrivate set? I'm not awfully familiar with the buffercache -- is
> > > it possible we end up in a situation where a buffer, perhaps under I/O,
> > > has the last reference to a struct page? It seems like that reference
> > > is
> > > always put from drop_buffers() which is called from
> > > try_to_free_buffers()
> > > which is always called by someone who has a reference to a struct page
> > > that they got from the pagecache. So what is this reference count for?
> >
> > I’m not sure what we gain by avoiding the refcount bump? Many filesystems
> > use the pattern of: “put something in page->private, free that thing in
> > releasepage.†Without the refcount bump it feels like we’d have more magic
> > to avoid freeing the page without leaking things in page->private. I think
> > the extra ref lets the FS crowd keep our noses out of the MM more often, so
> > it seems like a net positive to me.
>
> The question is whether the "thing" in page->private can ever have the
> last reference on a struct page. Gao says erofs can be in that situation,
> so never mind this change.
Add some words, just my thought... we have a management structure which could
store PagePrivate page cache pages, !PagePrivate page cache pages, and non-page
cache pages which are directly from buddy system.
and I knew the extra refcount rule for PagePrivate from the beginning (since
the rule is quite stable for many many years (I remembered from 200x introduced
by akpm?) so I designed the whole workflow to handle these different types of
pages in the management structure based on this rule and to make reclaim &
migrate work for all page cache pages properly.). I think many modules think
the rule is stable as well ... anyway, I think there's always be another way
to handle the same thing if the refcount rule is changed, yet I need to
revisit all current logic and do proper changes. And I think many modules
(including out-of-tree modules) could be impacted as well... anyway...
Thanks,
Gao Xiang
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: PagePrivate handling
2020-10-14 14:50 ` Chris Mason
2020-10-14 15:38 ` Matthew Wilcox
@ 2020-10-14 16:05 ` David Howells
1 sibling, 0 replies; 7+ messages in thread
From: David Howells @ 2020-10-14 16:05 UTC (permalink / raw)
To: Matthew Wilcox
Cc: dhowells, Chris Mason, linux-fsdevel, linux-mm, Guoqing Jiang,
Miaohe Lin
Matthew Wilcox <willy@infradead.org> wrote:
> It's not great, but with David wanting to change how PageFsCache is used,
> it may be unavoidable (I'm not sure if he's discussed that with you yet)
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/commit/?h=fscache-iter&id=6f10fd7766ed6d87c3f696bb7931281557b389f5 shows part of it
> -- essentially he wants to make PagePrivate2 mean that I/O is currently
> ongoing to an fscache, and so truncate needs to wait on it being finished.
->invalidatepage() and ->releasepage() had to wait anyway.
PG_fscache used to mean that the cache might have some knowledge of the page
and it might have I/O in progress on it - entirely as and when the cache felt
like doing it.
Now it just means that there's write I/O in progress on it at the netfs's
behest (though it might be issued by a cache helper). The main part of the
cache doesn't know about the page and doesn't care about the page flag, it
just sees an iov_iter.
In a sense, it's now a second PG_writeback flag. A page can be being written
to two locations at the same time (the server and the cache). Each write is
independent, though they may be started from the same place, and one of the
writes may be being handled by another filesystem entirely (the cache may have
started a DIO write to ext4, for example).
It's not needed for reading, since the PG_locked flag entirely suffices for
that. The read helper doesn't do parallel reads from the server and the cache
to the same chunk of pagecache.
Willy has asked that I either make PG_writeback cover both cases or that
PG_fscache can only be set if PG_writeback is also set. However, both of
these require extra state to be attached to page->private to be able to
coordinate this as there may be (as mentioned above) two parallel writes from
the same data that may finish at different times.
David
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: PagePrivate handling
2020-10-14 15:38 ` Matthew Wilcox
2020-10-14 16:01 ` Gao Xiang
@ 2020-10-14 16:28 ` Chris Mason
2020-10-14 16:34 ` Yang Shi
2 siblings, 0 replies; 7+ messages in thread
From: Chris Mason @ 2020-10-14 16:28 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-fsdevel, linux-mm, Guoqing Jiang, Miaohe Lin, David Howells
On 14 Oct 2020, at 11:38, Matthew Wilcox wrote:
> On Wed, Oct 14, 2020 at 10:50:51AM -0400, Chris Mason wrote:
>> On 14 Oct 2020, at 9:49, Matthew Wilcox wrote:
>>> Our handling of PagePrivate, page->private and PagePrivate2 is a
>>> giant
>>> mess. Let's recap.
>>>
>>> Filesystems which use bufferheads (ie most of them) set
>>> page->private
>>> to point to a buffer_head, set the PagePrivate bit and increment the
>>> refcount on the page.
>>>
>>> The vmscan pageout code (*) needs to know whether a page is
>>> freeable:
>>> if (!is_page_cache_freeable(page))
>>> return PAGE_KEEP;
>>> ... where is_page_cache_freeable() contains ...
>>> return page_count(page) - page_has_private(page) == 1 +
>>> page_cache_pins;
>>>
>>> That's a little inscrutable, but the important thing is that if
>>> page_has_private() is true, then the page's reference count is
>>> supposed
>>> to be one higher than it would be otherwise. And that makes sense
>>> given
>>> how "having bufferheads" means "page refcount ges incremented".
>>>
>>> But page_has_private() doesn't actually mean "PagePrivate is set".
>>> It means "PagePrivate or PagePrivate2 is set". And I don't
>>> understand
>>> how filesystems are supposed to keep that straight -- if we're
>>> setting
>>> PagePrivate2, and PagePrivate is clear, increment the refcount?
>>> If we're clearing PagePrivate, decrement the refcount if
>>> PagePrivate2
>>> is also clear?
>>
>> At least for btrfs, only PagePrivate elevates the refcount on the
>> page.
>> PagePrivate2 means:
>>
>> This page has been properly setup for COW’d IO, and it went through
>> the
>> normal path of page_mkwrite() or file_write() instead of being
>> silently
>> dirtied by a deep corner of the MM.
>
> What's not clear to me is whether btrfs can be in the situation where
> PagePrivate2 is set and PagePrivate is clear. If so, then we have a
> bug
> to fix.
I don’t think it’ll happen. Everyone in btrfs setting PagePrivate2
seems to have already called attach_page_private(). It’s possible I
missed a corner but I don’t think so.
>
>>> We introduced attach_page_private() and detach_page_private()
>>> earlier
>>> this year to help filesystems get the refcount right. But we still
>>> have a few filesystems using PagePrivate themselves (afs, btrfs,
>>> ceph,
>>> crypto, erofs, f2fs, jfs, nfs, orangefs & ubifs) and I'm not
>>> convinced
>>> they're all getting it right.
>>>
>>> Here's a bug I happened on while looking into this:
>>>
>>> if (page_has_private(page))
>>> attach_page_private(newpage,
>>> detach_page_private(page));
>>>
>>> if (PagePrivate2(page)) {
>>> ClearPagePrivate2(page);
>>> SetPagePrivate2(newpage);
>>> }
>>>
>>> The aggravating thing is that this doesn't even look like a bug.
>>> You have to be in the kind of mood where you're thinking "What if
>>> page
>>> has Private2 set and Private clear?" and the answer is that newpage
>>> ends up with PagePrivate set, but page->private set to NULL.
>>
>> Do you mean PagePrivate2 set but page->private NULL?
>
> Sorry, I skipped a step of the explanation.
>
> page_has_private returns true if Private or Private2 is set. So if
> page has PagePrivate clear and PagePrivate2 set, newpage will end up
> with both PagePrivate and PagePrivate2 set -- attach_page_private()
> doesn't check whether the pointer is NULL (and IMO, it shouldn't).
>
I see what you mean, given how often we end up calling
btrfs_migratepage() in the fleet, I’m willing to say this is probably
fine. But it’s easy enough to toss in a warning.
> Given our current macros, what was _meant_ here was:
>
> if (PagePrivate(page))
> attach_page_private(newpage,
> detach_page_private(page));
>
> but that's not obviously right.
>
>> Btrfs should only hage
>> PagePrivate2 set on pages that are formally in our writeback state
>> machine,
>> so it’ll get cleared as we unwind through normal IO or truncate
>> etc. For
>> data pages, btrfs page->private is simply set to 1 so the MM will
>> kindly
>> call releasepage for us.
>
> That's not what I'm seeing here:
>
> static void attach_extent_buffer_page(struct extent_buffer *eb,
> struct page *page)
> {
> if (!PagePrivate(page))
> attach_page_private(page, eb);
> else
> WARN_ON(page->private != (unsigned long)eb);
> }
>
> Or is that not a data page?
Correct, extent_buffers are metadata only. They wouldn’t be Private2.
>
>>> So what shold we do about all this? First, I want to make the code
>>> snippet above correct, because it looks right. So
>>> page_has_private()
>>> needs to test just PagePrivate and not PagePrivate2. Now we need a
>>> new function to call to determine whether the filesystem needs its
>>> invalidatepage callback invoked. Not sure what that should be
>>> named.
>>
>> I haven’t checked all the page_has_private() callers, but maybe
>> page_has_private() should stay the same and add page_private_count()
>> for
>> times where we need to get out our fingers and toes for the refcount
>> math.
>
> I was thinking about page_expected_count() which returns the number of
> references from the page cache plus the number of references from
> the various page privates. So is_page_cache_freeable() becomes:
>
> return page_count(page) == page_expected_count(page) + 1;
>
> can_split_huge_page() becomes:
>
> if (page_has_private(page))
> return false;
> return page_count(page) == page_expected_count(page) +
> total_mapcount(page) + 1;
>
>>> I think I also want to rename PG_private_2 to PG_owner_priv_2.
>>> There's a clear relationship between PG_private and page->private.
>>> There is no relationship between PG_private_2 and page->private, so
>>> it's
>>> a misleading name. Or maybe it should just be PG_fscache and btrfs
>>> can
>>> find some other way to mark the pages?
>>
>> Btrfs should be able to flip bits in page->private to cover our
>> current
>> usage of PG_private_2. If we ever attach something real to
>> page->private,
>> we can flip bits in that instead. It’s kinda messy though and
>> we’d have to
>> change attach_page_private a little to reflect its new life as a bit
>> setting
>> machine.
>
> It's not great, but with David wanting to change how PageFsCache is
> used,
> it may be unavoidable (I'm not sure if he's discussed that with you
> yet)
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/commit/?h=fscache-iter&id=6f10fd7766ed6d87c3f696bb7931281557b389f5
> shows part of it
> -- essentially he wants to make PagePrivate2 mean that I/O is
> currently
> ongoing to an fscache, and so truncate needs to wait on it being
> finished.
>
It’s fine-ish for btrfs since we’re setting Private2 only on pages
that are either writeback or will be writeback really soon. But, I’d
prefer this end up in a callback so that we don’t have magic meaning
for magic flags in magic places.
-chris
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: PagePrivate handling
2020-10-14 15:38 ` Matthew Wilcox
2020-10-14 16:01 ` Gao Xiang
2020-10-14 16:28 ` Chris Mason
@ 2020-10-14 16:34 ` Yang Shi
2 siblings, 0 replies; 7+ messages in thread
From: Yang Shi @ 2020-10-14 16:34 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Chris Mason, Linux FS-devel Mailing List, Linux MM,
Guoqing Jiang, Miaohe Lin, David Howells
On Wed, Oct 14, 2020 at 8:38 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Oct 14, 2020 at 10:50:51AM -0400, Chris Mason wrote:
> > On 14 Oct 2020, at 9:49, Matthew Wilcox wrote:
> > > Our handling of PagePrivate, page->private and PagePrivate2 is a giant
> > > mess. Let's recap.
> > >
> > > Filesystems which use bufferheads (ie most of them) set page->private
> > > to point to a buffer_head, set the PagePrivate bit and increment the
> > > refcount on the page.
> > >
> > > The vmscan pageout code (*) needs to know whether a page is freeable:
> > > if (!is_page_cache_freeable(page))
> > > return PAGE_KEEP;
> > > ... where is_page_cache_freeable() contains ...
> > > return page_count(page) - page_has_private(page) == 1 +
> > > page_cache_pins;
> > >
> > > That's a little inscrutable, but the important thing is that if
> > > page_has_private() is true, then the page's reference count is supposed
> > > to be one higher than it would be otherwise. And that makes sense given
> > > how "having bufferheads" means "page refcount ges incremented".
> > >
> > > But page_has_private() doesn't actually mean "PagePrivate is set".
> > > It means "PagePrivate or PagePrivate2 is set". And I don't understand
> > > how filesystems are supposed to keep that straight -- if we're setting
> > > PagePrivate2, and PagePrivate is clear, increment the refcount?
> > > If we're clearing PagePrivate, decrement the refcount if PagePrivate2
> > > is also clear?
> >
> > At least for btrfs, only PagePrivate elevates the refcount on the page.
> > PagePrivate2 means:
> >
> > This page has been properly setup for COW’d IO, and it went through the
> > normal path of page_mkwrite() or file_write() instead of being silently
> > dirtied by a deep corner of the MM.
>
> What's not clear to me is whether btrfs can be in the situation where
> PagePrivate2 is set and PagePrivate is clear. If so, then we have a bug
> to fix.
>
> > > We introduced attach_page_private() and detach_page_private() earlier
> > > this year to help filesystems get the refcount right. But we still
> > > have a few filesystems using PagePrivate themselves (afs, btrfs, ceph,
> > > crypto, erofs, f2fs, jfs, nfs, orangefs & ubifs) and I'm not convinced
> > > they're all getting it right.
> > >
> > > Here's a bug I happened on while looking into this:
> > >
> > > if (page_has_private(page))
> > > attach_page_private(newpage, detach_page_private(page));
> > >
> > > if (PagePrivate2(page)) {
> > > ClearPagePrivate2(page);
> > > SetPagePrivate2(newpage);
> > > }
> > >
> > > The aggravating thing is that this doesn't even look like a bug.
> > > You have to be in the kind of mood where you're thinking "What if page
> > > has Private2 set and Private clear?" and the answer is that newpage
> > > ends up with PagePrivate set, but page->private set to NULL.
> >
> > Do you mean PagePrivate2 set but page->private NULL?
>
> Sorry, I skipped a step of the explanation.
>
> page_has_private returns true if Private or Private2 is set. So if
> page has PagePrivate clear and PagePrivate2 set, newpage will end up
> with both PagePrivate and PagePrivate2 set -- attach_page_private()
> doesn't check whether the pointer is NULL (and IMO, it shouldn't).
>
> Given our current macros, what was _meant_ here was:
>
> if (PagePrivate(page))
> attach_page_private(newpage, detach_page_private(page));
>
> but that's not obviously right.
>
> > Btrfs should only hage
> > PagePrivate2 set on pages that are formally in our writeback state machine,
> > so it’ll get cleared as we unwind through normal IO or truncate etc. For
> > data pages, btrfs page->private is simply set to 1 so the MM will kindly
> > call releasepage for us.
>
> That's not what I'm seeing here:
>
> static void attach_extent_buffer_page(struct extent_buffer *eb,
> struct page *page)
> {
> if (!PagePrivate(page))
> attach_page_private(page, eb);
> else
> WARN_ON(page->private != (unsigned long)eb);
> }
>
> Or is that not a data page?
>
> > > So what shold we do about all this? First, I want to make the code
> > > snippet above correct, because it looks right. So page_has_private()
> > > needs to test just PagePrivate and not PagePrivate2. Now we need a
> > > new function to call to determine whether the filesystem needs its
> > > invalidatepage callback invoked. Not sure what that should be named.
> >
> > I haven’t checked all the page_has_private() callers, but maybe
> > page_has_private() should stay the same and add page_private_count() for
> > times where we need to get out our fingers and toes for the refcount math.
>
> I was thinking about page_expected_count() which returns the number of
> references from the page cache plus the number of references from
> the various page privates. So is_page_cache_freeable() becomes:
>
> return page_count(page) == page_expected_count(page) + 1;
>
> can_split_huge_page() becomes:
>
> if (page_has_private(page))
> return false;
> return page_count(page) == page_expected_count(page) +
> total_mapcount(page) + 1;
>
> > > I think I also want to rename PG_private_2 to PG_owner_priv_2.
> > > There's a clear relationship between PG_private and page->private.
> > > There is no relationship between PG_private_2 and page->private, so it's
> > > a misleading name. Or maybe it should just be PG_fscache and btrfs can
> > > find some other way to mark the pages?
> >
> > Btrfs should be able to flip bits in page->private to cover our current
> > usage of PG_private_2. If we ever attach something real to page->private,
> > we can flip bits in that instead. It’s kinda messy though and we’d have to
> > change attach_page_private a little to reflect its new life as a bit setting
> > machine.
>
> It's not great, but with David wanting to change how PageFsCache is used,
> it may be unavoidable (I'm not sure if he's discussed that with you yet)
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/commit/?h=fscache-iter&id=6f10fd7766ed6d87c3f696bb7931281557b389f5 shows part of it
> -- essentially he wants to make PagePrivate2 mean that I/O is currently
> ongoing to an fscache, and so truncate needs to wait on it being finished.
>
> > >
> > > Also ... do we really need to increment the page refcount if we have
> > > PagePrivate set? I'm not awfully familiar with the buffercache -- is
> > > it possible we end up in a situation where a buffer, perhaps under I/O,
> > > has the last reference to a struct page? It seems like that reference
> > > is
> > > always put from drop_buffers() which is called from
> > > try_to_free_buffers()
> > > which is always called by someone who has a reference to a struct page
> > > that they got from the pagecache. So what is this reference count for?
> >
> > I’m not sure what we gain by avoiding the refcount bump? Many filesystems
> > use the pattern of: “put something in page->private, free that thing in
> > releasepage.” Without the refcount bump it feels like we’d have more magic
> > to avoid freeing the page without leaking things in page->private. I think
> > the extra ref lets the FS crowd keep our noses out of the MM more often, so
> > it seems like a net positive to me.
>
> The question is whether the "thing" in page->private can ever have the
> last reference on a struct page. Gao says erofs can be in that situation,
> so never mind this change.
I recall when truncation is failed to remove private data, we may end
up having page->private as last reference. Anyway vmscan can handle
such case.
>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-10-14 16:34 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-14 13:49 PagePrivate handling Matthew Wilcox
2020-10-14 14:50 ` Chris Mason
2020-10-14 15:38 ` Matthew Wilcox
2020-10-14 16:01 ` Gao Xiang
2020-10-14 16:28 ` Chris Mason
2020-10-14 16:34 ` Yang Shi
2020-10-14 16:05 ` David Howells
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox