[ISSUE] split_folio() and dirty IOMAP folios

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [ISSUE] split_folio() and dirty IOMAP folios
@ 2024-11-07 15:07 David Hildenbrand
  2024-11-07 16:09 ` Matthew Wilcox
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2024-11-07 15:07 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, kvm
  Cc: Zi Yan, Matthew Wilcox, Christian Brauner, Darrick J. Wong,
	Christian Borntraeger, Janosch Frank, Claudio Imbrenda,
	Thomas Huth

Hi,

I'm debugging an interesting problem: split_folio() will fail on dirty 
folios on XFS, and I am not sure who will trigger the writeback in a 
timely manner so code relying on the split to work at some point (in 
sane setups where page pinning is not applicable) can make progress.

https://issues.redhat.com/browse/RHEL-58218

s390x PV ("Protected Virtualization" / "Secure virtualization") does not 
support large folios. So when we want to convert an individual 4k page 
to "secure" and we hit a large folio, we have to split it.

In gmap_make_secure(), we call split_folio() if we hit a large folio, 
and essentially retry forever (after dropping the folio reference).

Starting a "protected VM" (similar to encrypted VMs) will not make 
progress when trying to load the initial encrypted VM state into memory 
("unpack").

I assume other split_folio() users might similarly be affected: 
split_folio() will frequently just fail without any obvious way to "fix 
that up" to make progress.

Looking into the details, it seems to be an IOMAP limitation: 
split_folio() will keep failing in filemap_release_folio() because 
iomap_release_folio() fails on dirty folios. I would have expected 
background writeback to "fix that", but it's either not happening or 
because it's just happening too slowly.

I can see that migration code manually triggers writeback, using 
folio_clear_dirty_for_io() and mapping->a_ops->writepages) when it 
stumbles over a dirty folio.

Should we do the same in split_folio() directly? Or offer callers 
(gmap_make_secure()) a way to trigger this conditionally, similarly to 
how we have ways for waiting for a folio that is under writeback to finish?

... or is there a feasible way forward to make iomap_release_folio() not 
bail out on dirty folios?

The comment there says:

"If the folio is dirty, we refuse to release our metadata because it may 
be partially dirty.  Once we track per-block dirty state, we can release 
the metadata if every block is dirty."

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ISSUE] split_folio() and dirty IOMAP folios
  2024-11-07 15:07 [ISSUE] split_folio() and dirty IOMAP folios David Hildenbrand
@ 2024-11-07 16:09 ` Matthew Wilcox
  2024-11-07 16:34   ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2024-11-07 16:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-fsdevel, linux-mm, kvm, Zi Yan, Christian Brauner,
	Darrick J. Wong, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Thomas Huth

On Thu, Nov 07, 2024 at 04:07:08PM +0100, David Hildenbrand wrote:
> I'm debugging an interesting problem: split_folio() will fail on dirty
> folios on XFS, and I am not sure who will trigger the writeback in a timely
> manner so code relying on the split to work at some point (in sane setups
> where page pinning is not applicable) can make progress.

You could call something like filemap_write_and_wait_range()?

> ... or is there a feasible way forward to make iomap_release_folio() not
> bail out on dirty folios?
> 
> The comment there says:
> 
> "If the folio is dirty, we refuse to release our metadata because it may be
> partially dirty.  Once we track per-block dirty state, we can release the
> metadata if every block is dirty."

With the data structures and callbacks we have, it's hard to do.
Let's see if getting writeback kicked off will be enough to solve the
problem you're working on.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ISSUE] split_folio() and dirty IOMAP folios
  2024-11-07 16:09 ` Matthew Wilcox
@ 2024-11-07 16:34   ` David Hildenbrand
  2024-11-07 20:20     ` Matthew Wilcox
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2024-11-07 16:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-mm, kvm, Zi Yan, Christian Brauner,
	Darrick J. Wong, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Thomas Huth

On 07.11.24 17:09, Matthew Wilcox wrote:
> On Thu, Nov 07, 2024 at 04:07:08PM +0100, David Hildenbrand wrote:
>> I'm debugging an interesting problem: split_folio() will fail on dirty
>> folios on XFS, and I am not sure who will trigger the writeback in a timely
>> manner so code relying on the split to work at some point (in sane setups
>> where page pinning is not applicable) can make progress.
> 
> You could call something like filemap_write_and_wait_range()?

Thanks, have to look into some details of that.

Looks like the folio_clear_dirty_for_io() is buried in 
folio_prepare_writeback(), so that part is taken care of.

Guess I have to fo from folio to "mapping,lstart,lend" such that 
__filemap_fdatawrite_range() would look up the folio again. Sounds doable.

(I assume I have to drop the folio lock+reference before calling that)


It's a bit suboptimal that the split_folio() caller has to take care of 
that. But it's similar to waiting for writeback ... now I wonder if we 
should have a helper function that takes care of "simple" cases of -EBUSY.

> 
>> ... or is there a feasible way forward to make iomap_release_folio() not
>> bail out on dirty folios?
>>
>> The comment there says:
>>
>> "If the folio is dirty, we refuse to release our metadata because it may be
>> partially dirty.  Once we track per-block dirty state, we can release the
>> metadata if every block is dirty."
> 
> With the data structures and callbacks we have, it's hard to do.
> Let's see if getting writeback kicked off will be enough to solve the
> problem you're working on.

Let me find some time tomorrow/next week to play with this.

Thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ISSUE] split_folio() and dirty IOMAP folios
  2024-11-07 16:34   ` David Hildenbrand
@ 2024-11-07 20:20     ` Matthew Wilcox
  2024-11-08  9:11       ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2024-11-07 20:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-fsdevel, linux-mm, kvm, Zi Yan, Christian Brauner,
	Darrick J. Wong, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Thomas Huth

On Thu, Nov 07, 2024 at 05:34:40PM +0100, David Hildenbrand wrote:
> On 07.11.24 17:09, Matthew Wilcox wrote:
> > On Thu, Nov 07, 2024 at 04:07:08PM +0100, David Hildenbrand wrote:
> > > I'm debugging an interesting problem: split_folio() will fail on dirty
> > > folios on XFS, and I am not sure who will trigger the writeback in a timely
> > > manner so code relying on the split to work at some point (in sane setups
> > > where page pinning is not applicable) can make progress.
> > 
> > You could call something like filemap_write_and_wait_range()?
> 
> Thanks, have to look into some details of that.
> 
> Looks like the folio_clear_dirty_for_io() is buried in
> folio_prepare_writeback(), so that part is taken care of.
> 
> Guess I have to fo from folio to "mapping,lstart,lend" such that
> __filemap_fdatawrite_range() would look up the folio again. Sounds doable.
> 
> (I assume I have to drop the folio lock+reference before calling that)

I was thinking you'd do it higher in the callchain than
gmap_make_secure().  Presumably userspace says "I want to make this
256MB range secure" and we can start by writing back that entire
256MB chunk of address space.

That doesn't prevent anybody from dirtying it in-between, of course,
so you can still get -EBUSY and have to loop round again.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ISSUE] split_folio() and dirty IOMAP folios
  2024-11-07 20:20     ` Matthew Wilcox
@ 2024-11-08  9:11       ` David Hildenbrand
  2024-11-11 15:19         ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2024-11-08  9:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-mm, kvm, Zi Yan, Christian Brauner,
	Darrick J. Wong, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Thomas Huth

On 07.11.24 21:20, Matthew Wilcox wrote:
> On Thu, Nov 07, 2024 at 05:34:40PM +0100, David Hildenbrand wrote:
>> On 07.11.24 17:09, Matthew Wilcox wrote:
>>> On Thu, Nov 07, 2024 at 04:07:08PM +0100, David Hildenbrand wrote:
>>>> I'm debugging an interesting problem: split_folio() will fail on dirty
>>>> folios on XFS, and I am not sure who will trigger the writeback in a timely
>>>> manner so code relying on the split to work at some point (in sane setups
>>>> where page pinning is not applicable) can make progress.
>>>
>>> You could call something like filemap_write_and_wait_range()?
>>
>> Thanks, have to look into some details of that.
>>
>> Looks like the folio_clear_dirty_for_io() is buried in
>> folio_prepare_writeback(), so that part is taken care of.
>>
>> Guess I have to fo from folio to "mapping,lstart,lend" such that
>> __filemap_fdatawrite_range() would look up the folio again. Sounds doable.
>>
>> (I assume I have to drop the folio lock+reference before calling that)
> 
> I was thinking you'd do it higher in the callchain than
> gmap_make_secure().  Presumably userspace says "I want to make this
> 256MB range secure" and we can start by writing back that entire
> 256MB chunk of address space.
> 
> That doesn't prevent anybody from dirtying it in-between, of course,
> so you can still get -EBUSY and have to loop round again.

I'm afraid that won't really work.

On the one hand, we might be allocating these pages (+disk blocks) 
during the unpack operation -- where we essentially trigger page faults 
first using gmap_fault() -- so the pages might not even exist before the 
gmap_make_secure() during unpack. One work around would be to 
preallocate+writeback from user space, but it doesn't sound quite right.

But the bigger problem I see is that the initial "unpack" operation is 
not the only case where we trigger this conversion to "secure" state. 
Once the VM is running, we can see calls on arbitrary guest memory even 
during page faults, when gmap_make_secure() is called via 
gmap_convert_to_secure().

I'm still not sure why we see essentially no progress being made, even 
though we temporarily drop the PTL, mmap lock, folio lock, folio ref ... 
maybe related to us triggering a write fault that somehow ends up 
setting the folio dirty :/ Or because writeback is simply too slow / 
backs off.

I'll play with handling -EBUSY from split_folio() differently: if the 
folio is under writeback, wait on that. If the folio is dirty, trigger 
writeback. And I'll look into whether we really need a writable PTE, I 
suspect not, because we are not actually "modifying" page content.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ISSUE] split_folio() and dirty IOMAP folios
  2024-11-08  9:11       ` David Hildenbrand
@ 2024-11-11 15:19         ` David Hildenbrand
  2024-11-21 12:15           ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2024-11-11 15:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-mm, kvm, Zi Yan, Christian Brauner,
	Darrick J. Wong, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Thomas Huth

On 08.11.24 10:11, David Hildenbrand wrote:
> On 07.11.24 21:20, Matthew Wilcox wrote:
>> On Thu, Nov 07, 2024 at 05:34:40PM +0100, David Hildenbrand wrote:
>>> On 07.11.24 17:09, Matthew Wilcox wrote:
>>>> On Thu, Nov 07, 2024 at 04:07:08PM +0100, David Hildenbrand wrote:
>>>>> I'm debugging an interesting problem: split_folio() will fail on dirty
>>>>> folios on XFS, and I am not sure who will trigger the writeback in a timely
>>>>> manner so code relying on the split to work at some point (in sane setups
>>>>> where page pinning is not applicable) can make progress.
>>>>
>>>> You could call something like filemap_write_and_wait_range()?
>>>
>>> Thanks, have to look into some details of that.
>>>
>>> Looks like the folio_clear_dirty_for_io() is buried in
>>> folio_prepare_writeback(), so that part is taken care of.
>>>
>>> Guess I have to fo from folio to "mapping,lstart,lend" such that
>>> __filemap_fdatawrite_range() would look up the folio again. Sounds doable.
>>>
>>> (I assume I have to drop the folio lock+reference before calling that)
>>
>> I was thinking you'd do it higher in the callchain than
>> gmap_make_secure().  Presumably userspace says "I want to make this
>> 256MB range secure" and we can start by writing back that entire
>> 256MB chunk of address space.
>>
>> That doesn't prevent anybody from dirtying it in-between, of course,
>> so you can still get -EBUSY and have to loop round again.
> 
> I'm afraid that won't really work.
> 
> On the one hand, we might be allocating these pages (+disk blocks)
> during the unpack operation -- where we essentially trigger page faults
> first using gmap_fault() -- so the pages might not even exist before the
> gmap_make_secure() during unpack. One work around would be to
> preallocate+writeback from user space, but it doesn't sound quite right.
> 
> But the bigger problem I see is that the initial "unpack" operation is
> not the only case where we trigger this conversion to "secure" state.
> Once the VM is running, we can see calls on arbitrary guest memory even
> during page faults, when gmap_make_secure() is called via
> gmap_convert_to_secure().
> 
> 
> I'm still not sure why we see essentially no progress being made, even
> though we temporarily drop the PTL, mmap lock, folio lock, folio ref ...
> maybe related to us triggering a write fault that somehow ends up
> setting the folio dirty :/ Or because writeback is simply too slow /
> backs off.
> 
> I'll play with handling -EBUSY from split_folio() differently: if the
> folio is under writeback, wait on that. If the folio is dirty, trigger
> writeback. And I'll look into whether we really need a writable PTE, I
> suspect not, because we are not actually "modifying" page content.

The following hack makes it fly:

         case -E2BIG:
                 folio_lock(folio);
                 rc = split_folio(folio);
+               if (rc == -EBUSY) {
+                       if (folio_test_dirty(folio) && !folio_test_anon(folio) &&
+                           folio->mapping) {
+                               struct address_space *mapping = folio->mapping;
+                               loff_t lstart = folio_pos(folio);
+                               loff_t lend = lstart + folio_size(folio);
+
+                               folio_unlock(folio);
+                               /* Mapping can go away ... */
+                               filemap_write_and_wait_range(mapping, lstart, lend);
+                       } else {
+                               folio_unlock(folio);
+                       }
+                       folio_wait_writeback(folio);
+                       folio_lock(folio);
+                       split_folio(folio);
+                       folio_unlock(folio);
+                       folio_put(folio);
+                       return -EAGAIN;
+               }
                 folio_unlock(folio);
                 folio_put(folio);


I think the reason why we don't make any progress on s390x is that the writeback will
mark the folio clean and turn the folio read-only in the page tables as well. So when we
lookup the folio again in the page table, we see that the PTE is not writable and
trigger a write fault ...

... the write fault will mark the folio dirty again, so the split will never succeed.

In above diff, we really must try the split_folio() a second time after waiting, otherwise we
run into the same endless loop.


I'm still not 100% sure if we need a writable PTE; after all we are not modifying page content.
But that's just a side effect of not being able to wait for the split_folio() to make progress
in the writeback case so we can retry the split again.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ISSUE] split_folio() and dirty IOMAP folios
  2024-11-11 15:19         ` David Hildenbrand
@ 2024-11-21 12:15           ` David Hildenbrand
  0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand @ 2024-11-21 12:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-mm, kvm, Zi Yan, Christian Brauner,
	Darrick J. Wong, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Thomas Huth

On 11.11.24 16:19, David Hildenbrand wrote:
> On 08.11.24 10:11, David Hildenbrand wrote:
>> On 07.11.24 21:20, Matthew Wilcox wrote:
>>> On Thu, Nov 07, 2024 at 05:34:40PM +0100, David Hildenbrand wrote:
>>>> On 07.11.24 17:09, Matthew Wilcox wrote:
>>>>> On Thu, Nov 07, 2024 at 04:07:08PM +0100, David Hildenbrand wrote:
>>>>>> I'm debugging an interesting problem: split_folio() will fail on dirty
>>>>>> folios on XFS, and I am not sure who will trigger the writeback in a timely
>>>>>> manner so code relying on the split to work at some point (in sane setups
>>>>>> where page pinning is not applicable) can make progress.
>>>>>
>>>>> You could call something like filemap_write_and_wait_range()?
>>>>
>>>> Thanks, have to look into some details of that.
>>>>
>>>> Looks like the folio_clear_dirty_for_io() is buried in
>>>> folio_prepare_writeback(), so that part is taken care of.
>>>>
>>>> Guess I have to fo from folio to "mapping,lstart,lend" such that
>>>> __filemap_fdatawrite_range() would look up the folio again. Sounds doable.
>>>>
>>>> (I assume I have to drop the folio lock+reference before calling that)
>>>
>>> I was thinking you'd do it higher in the callchain than
>>> gmap_make_secure().  Presumably userspace says "I want to make this
>>> 256MB range secure" and we can start by writing back that entire
>>> 256MB chunk of address space.
>>>
>>> That doesn't prevent anybody from dirtying it in-between, of course,
>>> so you can still get -EBUSY and have to loop round again.
>>
>> I'm afraid that won't really work.
>>
>> On the one hand, we might be allocating these pages (+disk blocks)
>> during the unpack operation -- where we essentially trigger page faults
>> first using gmap_fault() -- so the pages might not even exist before the
>> gmap_make_secure() during unpack. One work around would be to
>> preallocate+writeback from user space, but it doesn't sound quite right.
>>
>> But the bigger problem I see is that the initial "unpack" operation is
>> not the only case where we trigger this conversion to "secure" state.
>> Once the VM is running, we can see calls on arbitrary guest memory even
>> during page faults, when gmap_make_secure() is called via
>> gmap_convert_to_secure().
>>
>>
>> I'm still not sure why we see essentially no progress being made, even
>> though we temporarily drop the PTL, mmap lock, folio lock, folio ref ...
>> maybe related to us triggering a write fault that somehow ends up
>> setting the folio dirty :/ Or because writeback is simply too slow /
>> backs off.
>>
>> I'll play with handling -EBUSY from split_folio() differently: if the
>> folio is under writeback, wait on that. If the folio is dirty, trigger
>> writeback. And I'll look into whether we really need a writable PTE, I
>> suspect not, because we are not actually "modifying" page content.
> 
> The following hack makes it fly:
> 
>           case -E2BIG:
>                   folio_lock(folio);
>                   rc = split_folio(folio);
> +               if (rc == -EBUSY) {
> +                       if (folio_test_dirty(folio) && !folio_test_anon(folio) &&
> +                           folio->mapping) {
> +                               struct address_space *mapping = folio->mapping;
> +                               loff_t lstart = folio_pos(folio);
> +                               loff_t lend = lstart + folio_size(folio);
> +
> +                               folio_unlock(folio);
> +                               /* Mapping can go away ... */
> +                               filemap_write_and_wait_range(mapping, lstart, lend);
> +                       } else {
> +                               folio_unlock(folio);
> +                       }
> +                       folio_wait_writeback(folio);
> +                       folio_lock(folio);
> +                       split_folio(folio);
> +                       folio_unlock(folio);
> +                       folio_put(folio);
> +                       return -EAGAIN;
> +               }
>                   folio_unlock(folio);
>                   folio_put(folio);
> 
> 
> I think the reason why we don't make any progress on s390x is that the writeback will
> mark the folio clean and turn the folio read-only in the page tables as well. So when we
> lookup the folio again in the page table, we see that the PTE is not writable and
> trigger a write fault ...
> 
> ... the write fault will mark the folio dirty again, so the split will never succeed.
> 
> In above diff, we really must try the split_folio() a second time after waiting, otherwise we
> run into the same endless loop.
> 
> 
> I'm still not 100% sure if we need a writable PTE; after all we are not modifying page content.
> But that's just a side effect of not being able to wait for the split_folio() to make progress
> in the writeback case so we can retry the split again.

After discussing this with Darrick and Willy yesterday, I think the 
reason we need a writable PTE is because we *might* modify page content:

"Requests the Ultravisor to make a page accessible to a guest. If it's 
brought in the first time, it will be cleared. If it has been exported 
before, it will be decrypted and integrity checked."

So we'll be effectively modifying the page content we will read when the 
(now secure) page is in the unprotected/exported state.

That makes things more complicated, unfortunately :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-11-21 12:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-07 15:07 [ISSUE] split_folio() and dirty IOMAP folios David Hildenbrand
2024-11-07 16:09 ` Matthew Wilcox
2024-11-07 16:34   ` David Hildenbrand
2024-11-07 20:20     ` Matthew Wilcox
2024-11-08  9:11       ` David Hildenbrand
2024-11-11 15:19         ` David Hildenbrand
2024-11-21 12:15           ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox