From: Claudio Imbrenda <imbrenda@linux.ibm.com>
To: John Hubbard <jhubbard@nvidia.com>
Cc: <linux-next@vger.kernel.org>, <akpm@linux-foundation.org>,
<jack@suse.cz>, <kirill@shutemov.name>, <borntraeger@de.ibm.com>,
<david@redhat.com>, <aarcange@redhat.com>, <linux-mm@kvack.org>,
<frankja@linux.ibm.com>, <sfr@canb.auug.org.au>,
<linux-kernel@vger.kernel.org>, <linux-s390@vger.kernel.org>,
Will Deacon <will@kernel.org>
Subject: Re: [PATCH v3 2/2] mm/gup/writeback: add callbacks for inaccessible pages
Date: Fri, 6 Mar 2020 12:18:23 +0100 [thread overview]
Message-ID: <20200306121823.50d253ac@p-imbrenda> (raw)
In-Reply-To: <f58b6839-5233-5ccf-1f1d-60b3b8aaf417@nvidia.com>
On Thu, 5 Mar 2020 14:30:03 -0800
John Hubbard <jhubbard@nvidia.com> wrote:
> On 3/4/20 5:06 AM, Claudio Imbrenda wrote:
> > With the introduction of protected KVM guests on s390 there is now a
> > concept of inaccessible pages. These pages need to be made
> > accessible before the host can access them.
> >
> > While cpu accesses will trigger a fault that can be resolved, I/O
> > accesses will just fail. We need to add a callback into
> > architecture code for places that will do I/O, namely when
> > writeback is started or when a page reference is taken.
> >
> > This is not only to enable paging, file backing etc, it is also
> > necessary to protect the host against a malicious user space. For
> > example a bad QEMU could simply start direct I/O on such protected
> > memory. We do not want userspace to be able to trigger I/O errors
> > and thus the logic is "whenever somebody accesses that page (gup)
> > or does I/O, make sure that this page can be accessed". When the
> > guest tries to access that page we will wait in the page fault
> > handler for writeback to have finished and for the page_ref to be
> > the expected value.
> >
> > On s390x the function is not supposed to fail, so it is ok to use a
> > WARN_ON on failure. If we ever need some more finegrained handling
> > we can tackle this when we know the details.
> >
> > Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> > Acked-by: Will Deacon <will@kernel.org>
> > Reviewed-by: David Hildenbrand <david@redhat.com>
> > Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
> > ---
> > include/linux/gfp.h | 6 ++++++
> > mm/gup.c | 30 +++++++++++++++++++++++++++---
> > mm/page-writeback.c | 5 +++++
> > 3 files changed, 38 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index e5b817cb86e7..be2754841369 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -485,6 +485,12 @@ static inline void arch_free_page(struct page
> > *page, int order) { } #ifndef HAVE_ARCH_ALLOC_PAGE
> > static inline void arch_alloc_page(struct page *page, int order) {
> > } #endif
> > +#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
> > +static inline int arch_make_page_accessible(struct page *page)
> > +{
> > + return 0;
> > +}
> > +#endif
> >
> > struct page *
> > __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int
> > preferred_nid, diff --git a/mm/gup.c b/mm/gup.c
> > index 81a95fbe9901..d0c4c6f336bb 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -413,6 +413,7 @@ static struct page *follow_page_pte(struct
> > vm_area_struct *vma, struct page *page;
> > spinlock_t *ptl;
> > pte_t *ptep, pte;
> > + int ret;
> >
> > /* FOLL_GET and FOLL_PIN are mutually exclusive. */
> > if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
> > @@ -471,8 +472,6 @@ static struct page *follow_page_pte(struct
> > vm_area_struct *vma, if (is_zero_pfn(pte_pfn(pte))) {
> > page = pte_page(pte);
> > } else {
> > - int ret;
> > -
> > ret = follow_pfn_pte(vma, address, ptep,
> > flags); page = ERR_PTR(ret);
> > goto out;
> > @@ -480,7 +479,6 @@ static struct page *follow_page_pte(struct
> > vm_area_struct *vma, }
> >
> > if (flags & FOLL_SPLIT && PageTransCompound(page)) {
> > - int ret;
> > get_page(page);
> > pte_unmap_unlock(ptep, ptl);
> > lock_page(page);
> > @@ -497,6 +495,19 @@ static struct page *follow_page_pte(struct
> > vm_area_struct *vma, page = ERR_PTR(-ENOMEM);
> > goto out;
> > }
> > + /*
> > + * We need to make the page accessible if and only if we
> > are going
> > + * to access its content (the FOLL_PIN case). Please see
> > + * Documentation/core-api/pin_user_pages.rst for details.
> > + */
> > + if (flags & FOLL_PIN) {
> > + ret = arch_make_page_accessible(page);
> > + if (ret) {
> > + unpin_user_page(page);
> > + page = ERR_PTR(ret);
> > + goto out;
> > + }
> > + }
> > if (flags & FOLL_TOUCH) {
> > if ((flags & FOLL_WRITE) &&
> > !pte_dirty(pte) && !PageDirty(page))
> > @@ -2162,6 +2173,19 @@ static int gup_pte_range(pmd_t pmd, unsigned
> > long addr, unsigned long end,
> > VM_BUG_ON_PAGE(compound_head(page) != head, page);
> >
> > + /*
> > + * We need to make the page accessible if and only
> > if we are
> > + * going to access its content (the FOLL_PIN
> > case). Please
> > + * see Documentation/core-api/pin_user_pages.rst
> > for
> > + * details.
> > + */
> > + if (flags & FOLL_PIN) {
> > + ret = arch_make_page_accessible(page);
> > + if (ret) {
> > + unpin_user_page(page);
> > + goto pte_unmap;
> > + }
> > + }
> > SetPageReferenced(page);
> > pages[*nr] = page;
> > (*nr)++;
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index ab5a3cee8ad3..8384be5a2758 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2807,6 +2807,11 @@ int __test_set_page_writeback(struct page
> > *page, bool keep_write) inc_zone_page_state(page,
> > NR_ZONE_WRITE_PENDING); }
> > unlock_page_memcg(page);
> > + /*
> > + * If writeback has been triggered on a page that cannot
> > be made
> > + * accessible, it is too late.
> > + */
> > + WARN_ON(arch_make_page_accessible(page));
>
> Hi,
>
> Sorry for not commenting on this earlier. After looking at this a
> bit, I think a tiny tweak would be helpful, because:
>
> a) WARN_ON() is a big problem for per-page issues, because, like
> ants, pages are prone to show up in large groups. And a warning and
> backtrace for each such page can easily bring a system to a crawl.
>
> b) Based on your explanation of how this works, what your situation
> really seems to call for is the standard "crash hard in DEBUG builds,
> in order to keep developers out of trouble, but continue on in
> non-DEBUG builds".
>
> So maybe you'd be better protected with this instead:
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index ab5a3cee8ad3..b7f3d0766a5f 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2764,7 +2764,7 @@ int test_clear_page_writeback(struct page *page)
> int __test_set_page_writeback(struct page *page, bool keep_write)
> {
> struct address_space *mapping = page_mapping(page);
> - int ret;
> + int ret, access_ret;
>
> lock_page_memcg(page);
> if (mapping && mapping_use_writeback_tags(mapping)) {
> @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page
> *page, bool keep_write) inc_zone_page_state(page,
> NR_ZONE_WRITE_PENDING); }
> unlock_page_memcg(page);
> + access_ret = arch_make_page_accessible(page);
> + /*
> + * If writeback has been triggered on a page that cannot be
> made
> + * accessible, it is too late to recover here.
> + */
> + VM_BUG_ON_PAGE(access_ret != 0, page);
> +
> return ret;
>
> }
>
> Assuming that's acceptable, you can add:
>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
>
> to the updated patch.
I will send an updated patch, thanks a lot for the feedback!
prev parent reply other threads:[~2020-03-06 11:18 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-03-04 13:06 [PATCH v3 0/2] " Claudio Imbrenda
2020-03-04 13:06 ` [PATCH v3 1/2] mm/gup: fixup for 9947ea2c1e608e32 "mm/gup: track FOLL_PIN pages" Claudio Imbrenda
2020-03-05 22:29 ` John Hubbard
2020-03-04 13:06 ` [PATCH v3 2/2] mm/gup/writeback: add callbacks for inaccessible pages Claudio Imbrenda
2020-03-05 22:30 ` John Hubbard
2020-03-06 11:18 ` Claudio Imbrenda [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200306121823.50d253ac@p-imbrenda \
--to=imbrenda@linux.ibm.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=borntraeger@de.ibm.com \
--cc=david@redhat.com \
--cc=frankja@linux.ibm.com \
--cc=jack@suse.cz \
--cc=jhubbard@nvidia.com \
--cc=kirill@shutemov.name \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-next@vger.kernel.org \
--cc=linux-s390@vger.kernel.org \
--cc=sfr@canb.auug.org.au \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox