Inaccessible pages & folios

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Inaccessible pages & folios
@ 2021-04-09 19:40 Matthew Wilcox
  2021-04-12 12:18 ` Claudio Imbrenda
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2021-04-09 19:40 UTC (permalink / raw)
  To: Claudio Imbrenda; +Cc: linux-mm, linux-s390

Hi Claudio,

I'm working on making the page cache manage memory in larger chunks than
PAGE_SIZE [1] [2].  In doing so, I came across this call that you added:

@@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
                inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
        }
        unlock_page_memcg(page);
+       access_ret = arch_make_page_accessible(page);

I'm going to change __test_set_page_writeback() to take a folio [3] and
now I'm wondering what interface you'd like to use.  My preference would
be to rename arch_make_page_accessible() to arch_make_folio_accessible()
and pass a folio, at which time you would make the entire folio (however
many pages might be in it) accessible.  If you would rather, we can
leave the interface as arch_make_page_accessible(), in which case we'll
just call it N times in __test_set_page_writeback() (and I won't need
to touch gup.c).

Let me know what you want.

[1] https://lwn.net/Articles/849538/
[2] https://lore.kernel.org/linux-mm/20210409185105.188284-1-willy@infradead.org/
[3] https://git.infradead.org/users/willy/pagecache.git/commitdiff/85297eb08f1b034b9652ea63dd053e3be4d7de7f

PS: The prototype is in gfp.h.  That's not really appropriate; gfp.h
is about allocating memory, and this call really has nothing to do with
memory allocation.  I think mm.h is a better place for it, if you can't
find a better header file than that.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Inaccessible pages & folios
  2021-04-09 19:40 Inaccessible pages & folios Matthew Wilcox
@ 2021-04-12 12:18 ` Claudio Imbrenda
  2021-04-12 12:43   ` Matthew Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: Claudio Imbrenda @ 2021-04-12 12:18 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-s390

On Fri, 9 Apr 2021 20:40:59 +0100
Matthew Wilcox <willy@infradead.org> wrote:

> Hi Claudio,
> 
> I'm working on making the page cache manage memory in larger chunks
> than PAGE_SIZE [1] [2].  In doing so, I came across this call that

yes, I had read the LWN article with great interest

> you added:
> 
> @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page
> *page, bool keep_write) inc_zone_page_state(page,
> NR_ZONE_WRITE_PENDING); }
>         unlock_page_memcg(page);
> +       access_ret = arch_make_page_accessible(page);
> 
> I'm going to change __test_set_page_writeback() to take a folio [3]
> and now I'm wondering what interface you'd like to use.  My
> preference would be to rename arch_make_page_accessible() to
> arch_make_folio_accessible() and pass a folio, at which time you
> would make the entire folio (however many pages might be in it)
> accessible.  If you would rather, we can leave the interface as
> arch_make_page_accessible(), in which case we'll just call it N times
> in __test_set_page_writeback() (and I won't need to touch gup.c).

For the rename case, how would you handle gup.c?

Consider that arch_make_page_accessible deals (typically) with KVM
guest pages. Once you bundle up the pages in folios, you can have
different pages in the same folio with different properties.

In case of failure, you could end up with a folio with some pages
processed and some not processed. Would you stop at the first error?
What would the state of the folio be? On s390x we use the PG_arch_1 bit
to mark secure pages, how would that work with folios?

and how are fault handlers affected by this folio conversion? would
they still work on pages, or would that also work on folios? on s390x
we use the arch_make_page_accessible function in some fault handlers.

a possible approach maybe would be to keep the _page variant, and add a
_folio wrapper around it

for s390x the PG_arch_1 is very important to prevent protected pages
from being fed to I/O, as in that case Very Bad Things™ would happen.

sorry for the wall of questions, but I actually like your folio
approach and I want to understand it better, so we can find a way to
make everything work well together

> Let me know what you want.
> 
> [1] https://lwn.net/Articles/849538/
> [2]
> https://lore.kernel.org/linux-mm/20210409185105.188284-1-willy@infradead.org/
> [3]
> https://git.infradead.org/users/willy/pagecache.git/commitdiff/85297eb08f1b034b9652ea63dd053e3be4d7de7f
> 
> PS: The prototype is in gfp.h.  That's not really appropriate; gfp.h
> is about allocating memory, and this call really has nothing to do
> with memory allocation.  I think mm.h is a better place for it, if
> you can't find a better header file than that.

I had put it there because arch_alloc_page and arch_free_page are also
there, and the behaviour, from a kernel point of view, is similar
(unaccessible/unallocated pages will trigger a fault). 

I actually do not have a preference regarding where the prototype
lives, as long as everything works. If you think mm.h is more
appropriate, go for it :)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Inaccessible pages & folios
  2021-04-12 12:18 ` Claudio Imbrenda
@ 2021-04-12 12:43   ` Matthew Wilcox
  2021-04-12 13:37     ` Claudio Imbrenda
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2021-04-12 12:43 UTC (permalink / raw)
  To: Claudio Imbrenda; +Cc: linux-mm, linux-s390

On Mon, Apr 12, 2021 at 02:18:09PM +0200, Claudio Imbrenda wrote:
> On Fri, 9 Apr 2021 20:40:59 +0100
> Matthew Wilcox <willy@infradead.org> wrote:
> > I'm going to change __test_set_page_writeback() to take a folio [3]
> > and now I'm wondering what interface you'd like to use.  My
> > preference would be to rename arch_make_page_accessible() to
> > arch_make_folio_accessible() and pass a folio, at which time you
> > would make the entire folio (however many pages might be in it)
> > accessible.  If you would rather, we can leave the interface as
> > arch_make_page_accessible(), in which case we'll just call it N times
> > in __test_set_page_writeback() (and I won't need to touch gup.c).
> 
> For the rename case, how would you handle gup.c?

At first, I'd turn it into arch_make_folio_accessible(page_folio(page));

Eventually, gup.c needs to become folio-aware.  I haven't spent too much
time thinking about it, but code written like this:

                page = pte_page(pte);
                head = try_grab_compound_head(page, 1, flags);
                if (!head)
                        goto pte_unmap;
                if (unlikely(pte_val(pte) != pte_val(*ptep))) {
                        put_compound_head(head, 1, flags);
                        goto pte_unmap;
                }
                VM_BUG_ON_PAGE(compound_head(page) != head, page);

is just crying out for use of folios.  Also, some of the gup callers
would much prefer to work in terms of folios than individual struct pages
(imagine an RDMA adapter that wants to pin several gigabytes of memory
that's allocated using hugetlbfs for example).

> Consider that arch_make_page_accessible deals (typically) with KVM
> guest pages. Once you bundle up the pages in folios, you can have
> different pages in the same folio with different properties.

So what you're saying is that the host might allocate, eg a 1GB folio
for a guest, then the guest splits that up into smaller chunks (eg 1MB),
and would only want one of those small chunks accessible to the hypervisor?

> In case of failure, you could end up with a folio with some pages
> processed and some not processed. Would you stop at the first error?
> What would the state of the folio be? On s390x we use the PG_arch_1 bit
> to mark secure pages, how would that work with folios?
> 
> and how are fault handlers affected by this folio conversion? would
> they still work on pages, or would that also work on folios? on s390x
> we use the arch_make_page_accessible function in some fault handlers.

Folios can be mapped into userspace at an unaligned offset.  So we still
have to work in pages, at least for now.  We might have some optimised
path for aligned folios later.

> a possible approach maybe would be to keep the _page variant, and add a
> _folio wrapper around it

Yes, we can do that.  It's what I'm currently doing for
flush_dcache_folio().

> for s390x the PG_arch_1 is very important to prevent protected pages
> from being fed to I/O, as in that case Very Bad Things™ would happen.
> 
> sorry for the wall of questions, but I actually like your folio
> approach and I want to understand it better, so we can find a way to
> make everything work well together

Great!

> > PS: The prototype is in gfp.h.  That's not really appropriate; gfp.h
> > is about allocating memory, and this call really has nothing to do
> > with memory allocation.  I think mm.h is a better place for it, if
> > you can't find a better header file than that.
> 
> I had put it there because arch_alloc_page and arch_free_page are also
> there, and the behaviour, from a kernel point of view, is similar
> (unaccessible/unallocated pages will trigger a fault). 
> 
> I actually do not have a preference regarding where the prototype
> lives, as long as everything works. If you think mm.h is more
> appropriate, go for it :)

Heh, I see how you got there from the implementors point of view ;-)
I'll move it ...


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Inaccessible pages & folios
  2021-04-12 12:43   ` Matthew Wilcox
@ 2021-04-12 13:37     ` Claudio Imbrenda
  2021-04-12 13:55       ` Matthew Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: Claudio Imbrenda @ 2021-04-12 13:37 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-s390

On Mon, 12 Apr 2021 13:43:41 +0100
Matthew Wilcox <willy@infradead.org> wrote:

[...]

> > For the rename case, how would you handle gup.c?  
> 
> At first, I'd turn it into
> arch_make_folio_accessible(page_folio(page));

that's tricky. what happens if some pages in the folio cannot be made
accessible?

also, I assume you keep the semantic difference between get_page and
pin_page? that's also very important for us

> Eventually, gup.c needs to become folio-aware.  I haven't spent too
> much time thinking about it, but code written like this:
> 
>                 page = pte_page(pte);
>                 head = try_grab_compound_head(page, 1, flags);
>                 if (!head)
>                         goto pte_unmap;
>                 if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>                         put_compound_head(head, 1, flags);
>                         goto pte_unmap;
>                 }
>                 VM_BUG_ON_PAGE(compound_head(page) != head, page);
> 
> is just crying out for use of folios.  Also, some of the gup callers
> would much prefer to work in terms of folios than individual struct
> pages (imagine an RDMA adapter that wants to pin several gigabytes of
> memory that's allocated using hugetlbfs for example).
> 
> > Consider that arch_make_page_accessible deals (typically) with KVM
> > guest pages. Once you bundle up the pages in folios, you can have
> > different pages in the same folio with different properties.  
> 
> So what you're saying is that the host might allocate, eg a 1GB folio
> for a guest, then the guest splits that up into smaller chunks (eg
> 1MB), and would only want one of those small chunks accessible to the
> hypervisor?

qemu will allocate a big chunk of memory, and I/O would happen only on
small chunks (depending on what the guest does). I don't know how swap
and pagecache would behave in the folio scenario.

Also consider that currently we need 4k hardware pages for protected
guests (so folios would be ok, as long as they are backed by small
pages)

How and when are folios created actually?

is there a way to prevent creation of multi-page folios?

> > In case of failure, you could end up with a folio with some pages
> > processed and some not processed. Would you stop at the first error?
> > What would the state of the folio be? On s390x we use the PG_arch_1
> > bit to mark secure pages, how would that work with folios?
> > 
> > and how are fault handlers affected by this folio conversion? would
> > they still work on pages, or would that also work on folios? on
> > s390x we use the arch_make_page_accessible function in some fault
> > handlers.  
> 
> Folios can be mapped into userspace at an unaligned offset.  So we
> still have to work in pages, at least for now.  We might have some
> optimised path for aligned folios later.
> 
> > a possible approach maybe would be to keep the _page variant, and
> > add a _folio wrapper around it  
> 
> Yes, we can do that.  It's what I'm currently doing for
> flush_dcache_folio().

where would the page flags be stored? as I said, we really depend on
that bit to be set correctly to prevent potentially disruptive I/O
errors. It's ok if the bit overindicates protection (non-protected
pages can be marked as protected), but protected pages must at all
times have the bit set.

the reason why this hook exists at all, is to prevent secure pages from
being accidentally (or maliciously) fed into I/O

> > for s390x the PG_arch_1 is very important to prevent protected pages
> > from being fed to I/O, as in that case Very Bad Things™ would
> > happen.
> > 
> > sorry for the wall of questions, but I actually like your folio
> > approach and I want to understand it better, so we can find a way to
> > make everything work well together  
> 
> Great!
> 
> > > PS: The prototype is in gfp.h.  That's not really appropriate;
> > > gfp.h is about allocating memory, and this call really has
> > > nothing to do with memory allocation.  I think mm.h is a better
> > > place for it, if you can't find a better header file than that.  
> > 
> > I had put it there because arch_alloc_page and arch_free_page are
> > also there, and the behaviour, from a kernel point of view, is
> > similar (unaccessible/unallocated pages will trigger a fault). 
> > 
> > I actually do not have a preference regarding where the prototype
> > lives, as long as everything works. If you think mm.h is more
> > appropriate, go for it :)  
> 
> Heh, I see how you got there from the implementors point of view ;-)
> I'll move it ...



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Inaccessible pages & folios
  2021-04-12 13:37     ` Claudio Imbrenda
@ 2021-04-12 13:55       ` Matthew Wilcox
  2021-04-15  9:28         ` Claudio Imbrenda
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2021-04-12 13:55 UTC (permalink / raw)
  To: Claudio Imbrenda; +Cc: linux-mm, linux-s390

On Mon, Apr 12, 2021 at 03:37:18PM +0200, Claudio Imbrenda wrote:
> On Mon, 12 Apr 2021 13:43:41 +0100
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> [...]
> 
> > > For the rename case, how would you handle gup.c?  
> > 
> > At first, I'd turn it into
> > arch_make_folio_accessible(page_folio(page));
> 
> that's tricky. what happens if some pages in the folio cannot be made
> accessible?

I was only thinking about the page cache case ...

        access_ret = arch_make_page_accessible(page);
        /*
         * If writeback has been triggered on a page that cannot be made
         * accessible, it is too late to recover here.
         */
        VM_BUG_ON_PAGE(access_ret != 0, page);

... where it seems all pages _can_ be made accessible.

> also, I assume you keep the semantic difference between get_page and
> pin_page? that's also very important for us

I haven't changed anything in gup.c yet.  Just trying to get the page
cache to suck less right now.

> > So what you're saying is that the host might allocate, eg a 1GB folio
> > for a guest, then the guest splits that up into smaller chunks (eg
> > 1MB), and would only want one of those small chunks accessible to the
> > hypervisor?
> 
> qemu will allocate a big chunk of memory, and I/O would happen only on
> small chunks (depending on what the guest does). I don't know how swap
> and pagecache would behave in the folio scenario.
> 
> Also consider that currently we need 4k hardware pages for protected
> guests (so folios would be ok, as long as they are backed by small
> pages)
> 
> How and when are folios created actually?
> 
> is there a way to prevent creation of multi-page folios?

Today there's no way to create multi-page folios because I haven't submitted
the patch to add alloc_folio() and friends:

https://git.infradead.org/users/willy/pagecache.git/commitdiff/4fe26f7a28ffdc850cd016cdaaa74974c59c5f53

We do have a way to allocate compound pages and add them to the page cache,
but that's only in use by tmpfs/shmem.

What will happen is that (for filesystems which support multipage folios),
they'll be allocated by the page cache.  I expect other places will start
to use folios after that (eg anonymous memory), but I don't know where
all those places will be.  I hope not to be involved in that!

The general principle, though, is that the overhead of tracking memory in
page-sized units is too high, and we need to use larger units by default.
There are occasions when we need to do things to memory in smaller units,
and for those, we can choose to either handle sub-folio things, or we
can split a folio apart into smaller folios.

> > > a possible approach maybe would be to keep the _page variant, and
> > > add a _folio wrapper around it  
> > 
> > Yes, we can do that.  It's what I'm currently doing for
> > flush_dcache_folio().
> 
> where would the page flags be stored? as I said, we really depend on
> that bit to be set correctly to prevent potentially disruptive I/O
> errors. It's ok if the bit overindicates protection (non-protected
> pages can be marked as protected), but protected pages must at all
> times have the bit set.
> 
> the reason why this hook exists at all, is to prevent secure pages from
> being accidentally (or maliciously) fed into I/O

You can still use PG_arch_1 on the sub-pages of a folio.  It's one of the
things you'll have to decide, actually.  Does setting PG_arch_1 on
the head page of the folio indicate that the entire page is accessible,
or just that the head page is accessible?  Different page flags have
made different decisions here.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Inaccessible pages & folios
  2021-04-12 13:55       ` Matthew Wilcox
@ 2021-04-15  9:28         ` Claudio Imbrenda
  0 siblings, 0 replies; 6+ messages in thread
From: Claudio Imbrenda @ 2021-04-15  9:28 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-s390

On Mon, 12 Apr 2021 14:55:14 +0100
Matthew Wilcox <willy@infradead.org> wrote:

[...]

> 
> I was only thinking about the page cache case ...
> 
>         access_ret = arch_make_page_accessible(page);
>         /*
>          * If writeback has been triggered on a page that cannot be
> made
>          * accessible, it is too late to recover here.
>          */
>         VM_BUG_ON_PAGE(access_ret != 0, page);
> 
> ... where it seems all pages _can_ be made accessible.

yes, for that case it is straightforward

> > also, I assume you keep the semantic difference between get_page and
> > pin_page? that's also very important for us  
> 
> I haven't changed anything in gup.c yet.  Just trying to get the page
> cache to suck less right now.

fair enough :)
 
> > > So what you're saying is that the host might allocate, eg a 1GB
> > > folio for a guest, then the guest splits that up into smaller
> > > chunks (eg 1MB), and would only want one of those small chunks
> > > accessible to the hypervisor?  
> > 
> > qemu will allocate a big chunk of memory, and I/O would happen only
> > on small chunks (depending on what the guest does). I don't know
> > how swap and pagecache would behave in the folio scenario.
> > 
> > Also consider that currently we need 4k hardware pages for protected
> > guests (so folios would be ok, as long as they are backed by small
> > pages)
> > 
> > How and when are folios created actually?
> > 
> > is there a way to prevent creation of multi-page folios?  
> 
> Today there's no way to create multi-page folios because I haven't
> submitted the patch to add alloc_folio() and friends:
> 
> https://git.infradead.org/users/willy/pagecache.git/commitdiff/4fe26f7a28ffdc850cd016cdaaa74974c59c5f53
> 
> We do have a way to allocate compound pages and add them to the page
> cache, but that's only in use by tmpfs/shmem.
> 
> What will happen is that (for filesystems which support multipage
> folios), they'll be allocated by the page cache.  I expect other
> places will start to use folios after that (eg anonymous memory), but
> I don't know where all those places will be.  I hope not to be
> involved in that!
> 
> The general principle, though, is that the overhead of tracking
> memory in page-sized units is too high, and we need to use larger
> units by default. There are occasions when we need to do things to
> memory in smaller units, and for those, we can choose to either
> handle sub-folio things, or we can split a folio apart into smaller
> folios.
> 
> > > > a possible approach maybe would be to keep the _page variant,
> > > > and add a _folio wrapper around it    
> > > 
> > > Yes, we can do that.  It's what I'm currently doing for
> > > flush_dcache_folio().  
> > 
> > where would the page flags be stored? as I said, we really depend on
> > that bit to be set correctly to prevent potentially disruptive I/O
> > errors. It's ok if the bit overindicates protection (non-protected
> > pages can be marked as protected), but protected pages must at all
> > times have the bit set.
> > 
> > the reason why this hook exists at all, is to prevent secure pages
> > from being accidentally (or maliciously) fed into I/O  
> 
> You can still use PG_arch_1 on the sub-pages of a folio.  It's one of
> the things you'll have to decide, actually.  Does setting PG_arch_1 on
> the head page of the folio indicate that the entire page is
> accessible, or just that the head page is accessible?  Different page
> flags have made different decisions here.

ok then, I think the simplest and safest thing to do right now is to
keep the flag on each page


in short:
* pagecache -> you can put a loop or introduce a _folio wrapper for
  arch_make_page_accessible
* gup.c -> won't be touched for now, but when the time comes, the
  PG_arch_1 bit should be set for each page



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-04-15  9:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-09 19:40 Inaccessible pages & folios Matthew Wilcox
2021-04-12 12:18 ` Claudio Imbrenda
2021-04-12 12:43   ` Matthew Wilcox
2021-04-12 13:37     ` Claudio Imbrenda
2021-04-12 13:55       ` Matthew Wilcox
2021-04-15  9:28         ` Claudio Imbrenda

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox