Re: [ISSUE] split_folio() and dirty IOMAP folios

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-fsdevel@vger.kernel.org,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	kvm@vger.kernel.org, Zi Yan <ziy@nvidia.com>,
	Christian Brauner <brauner@kernel.org>,
	"Darrick J. Wong" <djwong@kernel.org>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Janosch Frank <frankja@linux.ibm.com>,
	Claudio Imbrenda <imbrenda@linux.ibm.com>,
	Thomas Huth <thuth@redhat.com>
Subject: Re: [ISSUE] split_folio() and dirty IOMAP folios
Date: Mon, 11 Nov 2024 16:19:31 +0100	[thread overview]
Message-ID: <d3600a33-a481-4c4c-bda6-a446f1c965c6@redhat.com> (raw)
In-Reply-To: <6099e202-ef0a-4d21-958c-2c42db43a5bb@redhat.com>

On 08.11.24 10:11, David Hildenbrand wrote:
> On 07.11.24 21:20, Matthew Wilcox wrote:
>> On Thu, Nov 07, 2024 at 05:34:40PM +0100, David Hildenbrand wrote:
>>> On 07.11.24 17:09, Matthew Wilcox wrote:
>>>> On Thu, Nov 07, 2024 at 04:07:08PM +0100, David Hildenbrand wrote:
>>>>> I'm debugging an interesting problem: split_folio() will fail on dirty
>>>>> folios on XFS, and I am not sure who will trigger the writeback in a timely
>>>>> manner so code relying on the split to work at some point (in sane setups
>>>>> where page pinning is not applicable) can make progress.
>>>>
>>>> You could call something like filemap_write_and_wait_range()?
>>>
>>> Thanks, have to look into some details of that.
>>>
>>> Looks like the folio_clear_dirty_for_io() is buried in
>>> folio_prepare_writeback(), so that part is taken care of.
>>>
>>> Guess I have to fo from folio to "mapping,lstart,lend" such that
>>> __filemap_fdatawrite_range() would look up the folio again. Sounds doable.
>>>
>>> (I assume I have to drop the folio lock+reference before calling that)
>>
>> I was thinking you'd do it higher in the callchain than
>> gmap_make_secure().  Presumably userspace says "I want to make this
>> 256MB range secure" and we can start by writing back that entire
>> 256MB chunk of address space.
>>
>> That doesn't prevent anybody from dirtying it in-between, of course,
>> so you can still get -EBUSY and have to loop round again.
> 
> I'm afraid that won't really work.
> 
> On the one hand, we might be allocating these pages (+disk blocks)
> during the unpack operation -- where we essentially trigger page faults
> first using gmap_fault() -- so the pages might not even exist before the
> gmap_make_secure() during unpack. One work around would be to
> preallocate+writeback from user space, but it doesn't sound quite right.
> 
> But the bigger problem I see is that the initial "unpack" operation is
> not the only case where we trigger this conversion to "secure" state.
> Once the VM is running, we can see calls on arbitrary guest memory even
> during page faults, when gmap_make_secure() is called via
> gmap_convert_to_secure().
> 
> 
> I'm still not sure why we see essentially no progress being made, even
> though we temporarily drop the PTL, mmap lock, folio lock, folio ref ...
> maybe related to us triggering a write fault that somehow ends up
> setting the folio dirty :/ Or because writeback is simply too slow /
> backs off.
> 
> I'll play with handling -EBUSY from split_folio() differently: if the
> folio is under writeback, wait on that. If the folio is dirty, trigger
> writeback. And I'll look into whether we really need a writable PTE, I
> suspect not, because we are not actually "modifying" page content.

The following hack makes it fly:

         case -E2BIG:
                 folio_lock(folio);
                 rc = split_folio(folio);
+               if (rc == -EBUSY) {
+                       if (folio_test_dirty(folio) && !folio_test_anon(folio) &&
+                           folio->mapping) {
+                               struct address_space *mapping = folio->mapping;
+                               loff_t lstart = folio_pos(folio);
+                               loff_t lend = lstart + folio_size(folio);
+
+                               folio_unlock(folio);
+                               /* Mapping can go away ... */
+                               filemap_write_and_wait_range(mapping, lstart, lend);
+                       } else {
+                               folio_unlock(folio);
+                       }
+                       folio_wait_writeback(folio);
+                       folio_lock(folio);
+                       split_folio(folio);
+                       folio_unlock(folio);
+                       folio_put(folio);
+                       return -EAGAIN;
+               }
                 folio_unlock(folio);
                 folio_put(folio);


I think the reason why we don't make any progress on s390x is that the writeback will
mark the folio clean and turn the folio read-only in the page tables as well. So when we
lookup the folio again in the page table, we see that the PTE is not writable and
trigger a write fault ...

... the write fault will mark the folio dirty again, so the split will never succeed.

In above diff, we really must try the split_folio() a second time after waiting, otherwise we
run into the same endless loop.


I'm still not 100% sure if we need a writable PTE; after all we are not modifying page content.
But that's just a side effect of not being able to wait for the split_folio() to make progress
in the writeback case so we can retry the split again.

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2024-11-11 15:19 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-07 15:07 David Hildenbrand
2024-11-07 16:09 ` Matthew Wilcox
2024-11-07 16:34   ` David Hildenbrand
2024-11-07 20:20     ` Matthew Wilcox
2024-11-08  9:11       ` David Hildenbrand
2024-11-11 15:19         ` David Hildenbrand [this message]
2024-11-21 12:15           ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d3600a33-a481-4c4c-bda6-a446f1c965c6@redhat.com \
    --to=david@redhat.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=brauner@kernel.org \
    --cc=djwong@kernel.org \
    --cc=frankja@linux.ibm.com \
    --cc=imbrenda@linux.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=thuth@redhat.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox