Re: [PATCH/RFC] mm: do not drop unused pages when userfaultd is running

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Christian Borntraeger <borntraeger@de.ibm.com>
To: David Hildenbrand <david@redhat.com>,
	linux-mm@kvack.org, linux-s390@vger.kernel.org
Cc: kvm@vger.kernel.org, Janosch Frank <frankja@linux.ibm.com>,
	Cornelia Huck <cohuck@redhat.com>,
	linux-kernel@vger.kernel.org,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [PATCH/RFC] mm: do not drop unused pages when userfaultd is running
Date: Thu, 28 Jun 2018 16:51:14 +0200	[thread overview]
Message-ID: <eaa540cb-2249-72d1-d4b8-a54c2869d7a3@de.ibm.com> (raw)
In-Reply-To: <eca5be20-6c3a-5d9b-152a-f4e8b61b810e@redhat.com>



On 06/28/2018 04:49 PM, David Hildenbrand wrote:
> On 28.06.2018 16:39, Christian Borntraeger wrote:
>>
>>
>> On 06/28/2018 03:18 PM, David Hildenbrand wrote:
>>> On 28.06.2018 14:39, Christian Borntraeger wrote:
>>>> KVM guests on s390 can notify the host of unused pages. This can result
>>>> in pte_unused callbacks to be true for KVM guest memory.
>>>>
>>>> If a page is unused (checked with pte_unused) we might drop this page
>>>> instead of paging it. This can have side-effects on userfaultd, when the
>>>> page in question was already migrated:
>>>>
>>>> The next access of that page will trigger a fault and a user fault
>>>> instead of faulting in a new and empty zero page. As QEMU does not
>>>> expect a userfault on an already migrated page this migration will fail.
>>>>
>>>> The most straightforward solution is to ignore the pte_unused hint if a
>>>> userfault context is active for this VMA.
>>>>
>>>> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
>>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>>> Cc: stable@vger.kernel.org
>>>> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
>>>> ---
>>>>  mm/rmap.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 6db729dc4c50..3f3a72aa99f2 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1481,7 +1481,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>>  				set_pte_at(mm, address, pvmw.pte, pteval);
>>>>  			}
>>>>  
>>>> -		} else if (pte_unused(pteval)) {
>>>> +		} else if (pte_unused(pteval) && !vma->vm_userfaultfd_ctx.ctx) {
>>>>  			/*
>>>>  			 * The guest indicated that the page content is of no
>>>>  			 * interest anymore. Simply discard the pte, vmscan
>>>>
>>>
>>> To understand the implications better:
>>>
>>> This is like a MADV_DONTNEED from user space while a userfaultfd
>>> notifier is registered for this vma range.
>>>
>>> While we can block such calls in QEMU ("we registered it, we know it
>>> best"), we can't do the same in the kernel.
>>>
>>> These "intern MADV_DONTNEED" can actually trigger "deferred", so e.g. if
>>> the pte_unused() was set before userfaultfd has been registered, we can
>>> still get the same result, right?>
>> Not sure I understand your last sentence.
> 
> Rephrased: Instead trying to stop somebody from setting pte_unused will
> not work, as we might get a userfaultfd registration at some point and
> find a previously set pte_unused afterwards.

Yes, exactly. the unused value can be set before the migration.


> 
>> This place here is called on the unmap, (e.g. when the host tries to page out).
>> The value was transferred before (and always before) during the page table invalidation.
>> So pte_unused was always set before. This is the place where we decide if we page
>> out (ans establish a swap pte) or just drop this page table entry. So if
>> no userfaultd is registered at that point in time we are good.
> 
> This certainly applies to ordinary userfaultfd we have right now.
> userfaultfd WP (write-protect) or other features to come might be
> different, but it does not seem to do any harm in case we page out
> instead of dropping it. This way we are on the safe side.

yes.
> 
> In other words: I think this is the right approach.

next prev parent reply	other threads:[~2018-06-28 14:51 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-28 12:39 Christian Borntraeger
2018-06-28 13:18 ` David Hildenbrand
2018-06-28 14:39   ` Christian Borntraeger
2018-06-28 14:49     ` David Hildenbrand
2018-06-28 14:51       ` Christian Borntraeger [this message]
2018-06-29  6:51 ` Christian Borntraeger
2018-06-29 20:46   ` Mike Rapoport

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=eaa540cb-2249-72d1-d4b8-a54c2869d7a3@de.ibm.com \
    --to=borntraeger@de.ibm.com \
    --cc=aarcange@redhat.com \
    --cc=cohuck@redhat.com \
    --cc=david@redhat.com \
    --cc=frankja@linux.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=schwidefsky@de.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox