linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-fsdevel@vger.kernel.org,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	kvm@vger.kernel.org, Zi Yan <ziy@nvidia.com>,
	Christian Brauner <brauner@kernel.org>,
	"Darrick J. Wong" <djwong@kernel.org>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Janosch Frank <frankja@linux.ibm.com>,
	Claudio Imbrenda <imbrenda@linux.ibm.com>,
	Thomas Huth <thuth@redhat.com>
Subject: Re: [ISSUE] split_folio() and dirty IOMAP folios
Date: Thu, 21 Nov 2024 13:15:21 +0100	[thread overview]
Message-ID: <fbb59ba8-7d8c-4d64-ab46-d4950c073018@redhat.com> (raw)
In-Reply-To: <d3600a33-a481-4c4c-bda6-a446f1c965c6@redhat.com>

On 11.11.24 16:19, David Hildenbrand wrote:
> On 08.11.24 10:11, David Hildenbrand wrote:
>> On 07.11.24 21:20, Matthew Wilcox wrote:
>>> On Thu, Nov 07, 2024 at 05:34:40PM +0100, David Hildenbrand wrote:
>>>> On 07.11.24 17:09, Matthew Wilcox wrote:
>>>>> On Thu, Nov 07, 2024 at 04:07:08PM +0100, David Hildenbrand wrote:
>>>>>> I'm debugging an interesting problem: split_folio() will fail on dirty
>>>>>> folios on XFS, and I am not sure who will trigger the writeback in a timely
>>>>>> manner so code relying on the split to work at some point (in sane setups
>>>>>> where page pinning is not applicable) can make progress.
>>>>>
>>>>> You could call something like filemap_write_and_wait_range()?
>>>>
>>>> Thanks, have to look into some details of that.
>>>>
>>>> Looks like the folio_clear_dirty_for_io() is buried in
>>>> folio_prepare_writeback(), so that part is taken care of.
>>>>
>>>> Guess I have to fo from folio to "mapping,lstart,lend" such that
>>>> __filemap_fdatawrite_range() would look up the folio again. Sounds doable.
>>>>
>>>> (I assume I have to drop the folio lock+reference before calling that)
>>>
>>> I was thinking you'd do it higher in the callchain than
>>> gmap_make_secure().  Presumably userspace says "I want to make this
>>> 256MB range secure" and we can start by writing back that entire
>>> 256MB chunk of address space.
>>>
>>> That doesn't prevent anybody from dirtying it in-between, of course,
>>> so you can still get -EBUSY and have to loop round again.
>>
>> I'm afraid that won't really work.
>>
>> On the one hand, we might be allocating these pages (+disk blocks)
>> during the unpack operation -- where we essentially trigger page faults
>> first using gmap_fault() -- so the pages might not even exist before the
>> gmap_make_secure() during unpack. One work around would be to
>> preallocate+writeback from user space, but it doesn't sound quite right.
>>
>> But the bigger problem I see is that the initial "unpack" operation is
>> not the only case where we trigger this conversion to "secure" state.
>> Once the VM is running, we can see calls on arbitrary guest memory even
>> during page faults, when gmap_make_secure() is called via
>> gmap_convert_to_secure().
>>
>>
>> I'm still not sure why we see essentially no progress being made, even
>> though we temporarily drop the PTL, mmap lock, folio lock, folio ref ...
>> maybe related to us triggering a write fault that somehow ends up
>> setting the folio dirty :/ Or because writeback is simply too slow /
>> backs off.
>>
>> I'll play with handling -EBUSY from split_folio() differently: if the
>> folio is under writeback, wait on that. If the folio is dirty, trigger
>> writeback. And I'll look into whether we really need a writable PTE, I
>> suspect not, because we are not actually "modifying" page content.
> 
> The following hack makes it fly:
> 
>           case -E2BIG:
>                   folio_lock(folio);
>                   rc = split_folio(folio);
> +               if (rc == -EBUSY) {
> +                       if (folio_test_dirty(folio) && !folio_test_anon(folio) &&
> +                           folio->mapping) {
> +                               struct address_space *mapping = folio->mapping;
> +                               loff_t lstart = folio_pos(folio);
> +                               loff_t lend = lstart + folio_size(folio);
> +
> +                               folio_unlock(folio);
> +                               /* Mapping can go away ... */
> +                               filemap_write_and_wait_range(mapping, lstart, lend);
> +                       } else {
> +                               folio_unlock(folio);
> +                       }
> +                       folio_wait_writeback(folio);
> +                       folio_lock(folio);
> +                       split_folio(folio);
> +                       folio_unlock(folio);
> +                       folio_put(folio);
> +                       return -EAGAIN;
> +               }
>                   folio_unlock(folio);
>                   folio_put(folio);
> 
> 
> I think the reason why we don't make any progress on s390x is that the writeback will
> mark the folio clean and turn the folio read-only in the page tables as well. So when we
> lookup the folio again in the page table, we see that the PTE is not writable and
> trigger a write fault ...
> 
> ... the write fault will mark the folio dirty again, so the split will never succeed.
> 
> In above diff, we really must try the split_folio() a second time after waiting, otherwise we
> run into the same endless loop.
> 
> 
> I'm still not 100% sure if we need a writable PTE; after all we are not modifying page content.
> But that's just a side effect of not being able to wait for the split_folio() to make progress
> in the writeback case so we can retry the split again.

After discussing this with Darrick and Willy yesterday, I think the 
reason we need a writable PTE is because we *might* modify page content:

"Requests the Ultravisor to make a page accessible to a guest. If it's 
brought in the first time, it will be cleared. If it has been exported 
before, it will be decrypted and integrity checked."

So we'll be effectively modifying the page content we will read when the 
(now secure) page is in the unprotected/exported state.

That makes things more complicated, unfortunately :)

-- 
Cheers,

David / dhildenb



      reply	other threads:[~2024-11-21 12:15 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-07 15:07 David Hildenbrand
2024-11-07 16:09 ` Matthew Wilcox
2024-11-07 16:34   ` David Hildenbrand
2024-11-07 20:20     ` Matthew Wilcox
2024-11-08  9:11       ` David Hildenbrand
2024-11-11 15:19         ` David Hildenbrand
2024-11-21 12:15           ` David Hildenbrand [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fbb59ba8-7d8c-4d64-ab46-d4950c073018@redhat.com \
    --to=david@redhat.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=brauner@kernel.org \
    --cc=djwong@kernel.org \
    --cc=frankja@linux.ibm.com \
    --cc=imbrenda@linux.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=thuth@redhat.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox