From: David Hildenbrand <david@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-fsdevel@vger.kernel.org,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
kvm@vger.kernel.org, Zi Yan <ziy@nvidia.com>,
Christian Brauner <brauner@kernel.org>,
"Darrick J. Wong" <djwong@kernel.org>,
Christian Borntraeger <borntraeger@linux.ibm.com>,
Janosch Frank <frankja@linux.ibm.com>,
Claudio Imbrenda <imbrenda@linux.ibm.com>,
Thomas Huth <thuth@redhat.com>
Subject: Re: [ISSUE] split_folio() and dirty IOMAP folios
Date: Mon, 11 Nov 2024 16:19:31 +0100 [thread overview]
Message-ID: <d3600a33-a481-4c4c-bda6-a446f1c965c6@redhat.com> (raw)
In-Reply-To: <6099e202-ef0a-4d21-958c-2c42db43a5bb@redhat.com>
On 08.11.24 10:11, David Hildenbrand wrote:
> On 07.11.24 21:20, Matthew Wilcox wrote:
>> On Thu, Nov 07, 2024 at 05:34:40PM +0100, David Hildenbrand wrote:
>>> On 07.11.24 17:09, Matthew Wilcox wrote:
>>>> On Thu, Nov 07, 2024 at 04:07:08PM +0100, David Hildenbrand wrote:
>>>>> I'm debugging an interesting problem: split_folio() will fail on dirty
>>>>> folios on XFS, and I am not sure who will trigger the writeback in a timely
>>>>> manner so code relying on the split to work at some point (in sane setups
>>>>> where page pinning is not applicable) can make progress.
>>>>
>>>> You could call something like filemap_write_and_wait_range()?
>>>
>>> Thanks, have to look into some details of that.
>>>
>>> Looks like the folio_clear_dirty_for_io() is buried in
>>> folio_prepare_writeback(), so that part is taken care of.
>>>
>>> Guess I have to fo from folio to "mapping,lstart,lend" such that
>>> __filemap_fdatawrite_range() would look up the folio again. Sounds doable.
>>>
>>> (I assume I have to drop the folio lock+reference before calling that)
>>
>> I was thinking you'd do it higher in the callchain than
>> gmap_make_secure(). Presumably userspace says "I want to make this
>> 256MB range secure" and we can start by writing back that entire
>> 256MB chunk of address space.
>>
>> That doesn't prevent anybody from dirtying it in-between, of course,
>> so you can still get -EBUSY and have to loop round again.
>
> I'm afraid that won't really work.
>
> On the one hand, we might be allocating these pages (+disk blocks)
> during the unpack operation -- where we essentially trigger page faults
> first using gmap_fault() -- so the pages might not even exist before the
> gmap_make_secure() during unpack. One work around would be to
> preallocate+writeback from user space, but it doesn't sound quite right.
>
> But the bigger problem I see is that the initial "unpack" operation is
> not the only case where we trigger this conversion to "secure" state.
> Once the VM is running, we can see calls on arbitrary guest memory even
> during page faults, when gmap_make_secure() is called via
> gmap_convert_to_secure().
>
>
> I'm still not sure why we see essentially no progress being made, even
> though we temporarily drop the PTL, mmap lock, folio lock, folio ref ...
> maybe related to us triggering a write fault that somehow ends up
> setting the folio dirty :/ Or because writeback is simply too slow /
> backs off.
>
> I'll play with handling -EBUSY from split_folio() differently: if the
> folio is under writeback, wait on that. If the folio is dirty, trigger
> writeback. And I'll look into whether we really need a writable PTE, I
> suspect not, because we are not actually "modifying" page content.
The following hack makes it fly:
case -E2BIG:
folio_lock(folio);
rc = split_folio(folio);
+ if (rc == -EBUSY) {
+ if (folio_test_dirty(folio) && !folio_test_anon(folio) &&
+ folio->mapping) {
+ struct address_space *mapping = folio->mapping;
+ loff_t lstart = folio_pos(folio);
+ loff_t lend = lstart + folio_size(folio);
+
+ folio_unlock(folio);
+ /* Mapping can go away ... */
+ filemap_write_and_wait_range(mapping, lstart, lend);
+ } else {
+ folio_unlock(folio);
+ }
+ folio_wait_writeback(folio);
+ folio_lock(folio);
+ split_folio(folio);
+ folio_unlock(folio);
+ folio_put(folio);
+ return -EAGAIN;
+ }
folio_unlock(folio);
folio_put(folio);
I think the reason why we don't make any progress on s390x is that the writeback will
mark the folio clean and turn the folio read-only in the page tables as well. So when we
lookup the folio again in the page table, we see that the PTE is not writable and
trigger a write fault ...
... the write fault will mark the folio dirty again, so the split will never succeed.
In above diff, we really must try the split_folio() a second time after waiting, otherwise we
run into the same endless loop.
I'm still not 100% sure if we need a writable PTE; after all we are not modifying page content.
But that's just a side effect of not being able to wait for the split_folio() to make progress
in the writeback case so we can retry the split again.
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2024-11-11 15:19 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-07 15:07 David Hildenbrand
2024-11-07 16:09 ` Matthew Wilcox
2024-11-07 16:34 ` David Hildenbrand
2024-11-07 20:20 ` Matthew Wilcox
2024-11-08 9:11 ` David Hildenbrand
2024-11-11 15:19 ` David Hildenbrand [this message]
2024-11-21 12:15 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d3600a33-a481-4c4c-bda6-a446f1c965c6@redhat.com \
--to=david@redhat.com \
--cc=borntraeger@linux.ibm.com \
--cc=brauner@kernel.org \
--cc=djwong@kernel.org \
--cc=frankja@linux.ibm.com \
--cc=imbrenda@linux.ibm.com \
--cc=kvm@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=thuth@redhat.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox