[RFC PATCH] block, fs: use FOLL_LONGTERM as gup

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
       [not found] <CGME20250306074101epcas1p4b24ac546f93df2c7fe3176607b20e47f@epcas1p4.samsung.com>
@ 2025-03-06  7:40 ` Sooyong Suk
  2025-03-06 15:26   ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Sooyong Suk @ 2025-03-06  7:40 UTC (permalink / raw)
  To: viro, linux-kernel, akpm, linux-mm; +Cc: jaewon31.kim, spssyr, Sooyong Suk

There are GUP references to pages that are serving as direct IO buffers.
Those pages can be allocated from CMA pageblocks despite they can be
pinned until the DIO is completed.

Generally, pinning for each DIO might be considered as a transient
operation as described at the documentation. But if a large amount of
direct IO is requested constantly, this can make pages in CMA pageblocks
pinned and unable to migrate outside of the pageblock, which can result
in CMA allocation failure.

In Android devices, on first boot after OTA, snapuserd requests a huge
amount of direct IO reads which might occasionally disturb CMA
allocations.

To prevent this, use FOLL_LONGTERM as gup_flags for direct IO requests
via blkdev_direct_IO or __iomap_dio_rw by default not to allocate buffer
pages from CMA pageblocks.

Signed-off-by: Sooyong Suk <s.suk@samsung.com>
---
 block/bio.c         | 2 +-
 include/linux/uio.h | 2 ++
 lib/iov_iter.c      | 2 ++
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index d5bdc31d88d3..683113b3e35a 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1247,7 +1247,7 @@ static unsigned int get_contig_folio_len(unsigned int *num_pages,
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
-	iov_iter_extraction_t extraction_flags = 0;
+	iov_iter_extraction_t extraction_flags = ITER_ALLOW_LONGTERM;
 	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
 	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 853f9de5aa05..d1e9174ee29a 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -377,6 +377,8 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
 /* Flags for iov_iter_get/extract_pages*() */
 /* Allow P2PDMA on the extracted pages */
 #define ITER_ALLOW_P2PDMA	((__force iov_iter_extraction_t)0x01)
+/* Allow LONGTERM on the extracted pages */
+#define ITER_ALLOW_LONGTERM	((__force iov_iter_extraction_t)0x02)
 
 ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
 			       size_t maxsize, unsigned int maxpages,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 9ec806f989f2..4b5c7c30cd4d 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1832,6 +1832,8 @@ static ssize_t iov_iter_extract_user_pages(struct iov_iter *i,
 		gup_flags |= FOLL_WRITE;
 	if (extraction_flags & ITER_ALLOW_P2PDMA)
 		gup_flags |= FOLL_PCI_P2PDMA;
+	if (extraction_flags & ITER_ALLOW_LONGTERM)
+		gup_flags |= FOLL_LONGTERM;
 	if (i->nofault)
 		gup_flags |= FOLL_NOFAULT;
 
-- 
2.25.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-06  7:40 ` [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO Sooyong Suk
@ 2025-03-06 15:26   ` Christoph Hellwig
  2025-03-06 23:28     ` Jaewon Kim
  2025-03-07 20:23     ` Matthew Wilcox
  0 siblings, 2 replies; 21+ messages in thread
From: Christoph Hellwig @ 2025-03-06 15:26 UTC (permalink / raw)
  To: Sooyong Suk; +Cc: viro, linux-kernel, akpm, linux-mm, jaewon31.kim, spssyr

On Thu, Mar 06, 2025 at 04:40:56PM +0900, Sooyong Suk wrote:
> There are GUP references to pages that are serving as direct IO buffers.
> Those pages can be allocated from CMA pageblocks despite they can be
> pinned until the DIO is completed.

direct I/O is eactly the case that is not FOLL_LONGTERM and one of
the reasons to even have the flag.  So big fat no to this.

You also completely failed to address the relevant mailinglist and
maintainers.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-06 15:26   ` Christoph Hellwig
@ 2025-03-06 23:28     ` Jaewon Kim
  2025-03-07  2:07       ` Sooyong Suk
  2025-03-07 20:23     ` Matthew Wilcox
  1 sibling, 1 reply; 21+ messages in thread
From: Jaewon Kim @ 2025-03-06 23:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sooyong Suk, viro, linux-kernel, akpm, linux-mm, spssyr, axboe,
	linux-block, dhavale, surenb

On Fri, Mar 7, 2025 at 12:26 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Mar 06, 2025 at 04:40:56PM +0900, Sooyong Suk wrote:
> > There are GUP references to pages that are serving as direct IO buffers.
> > Those pages can be allocated from CMA pageblocks despite they can be
> > pinned until the DIO is completed.
>
> direct I/O is eactly the case that is not FOLL_LONGTERM and one of
> the reasons to even have the flag.  So big fat no to this.
>

Hello, thank you for your comment.
We, Sooyong and I, wanted to get some opinions about this
FOLL_LONGTERM for direct I/O as CMA memory got pinned pages which had
been pinned from direct io.

> You also completely failed to address the relevant mailinglist and
> maintainers.

I added block maintainer Jens Axboe and the block layer maillinst
here, and added Suren and Sandeep, too.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-06 23:28     ` Jaewon Kim
@ 2025-03-07  2:07       ` Sooyong Suk
  2025-03-07  2:28         ` Suren Baghdasaryan
  0 siblings, 1 reply; 21+ messages in thread
From: Sooyong Suk @ 2025-03-07  2:07 UTC (permalink / raw)
  To: 'Jaewon Kim', 'Christoph Hellwig'
  Cc: viro, linux-kernel, akpm, linux-mm, spssyr, axboe, linux-block,
	dhavale, surenb

> On Fri, Mar 7, 2025 at 12:26 AM Christoph Hellwig <hch@infradead.org>
> wrote:
> >
> > On Thu, Mar 06, 2025 at 04:40:56PM +0900, Sooyong Suk wrote:
> > > There are GUP references to pages that are serving as direct IO
> buffers.
> > > Those pages can be allocated from CMA pageblocks despite they can be
> > > pinned until the DIO is completed.
> >
> > direct I/O is eactly the case that is not FOLL_LONGTERM and one of the
> > reasons to even have the flag.  So big fat no to this.
> >
> 

Understood.

> Hello, thank you for your comment.
> We, Sooyong and I, wanted to get some opinions about this FOLL_LONGTERM
> for direct I/O as CMA memory got pinned pages which had been pinned from
> direct io.
> 
> > You also completely failed to address the relevant mailinglist and
> > maintainers.
> 
> I added block maintainer Jens Axboe and the block layer maillinst here,
> and added Suren and Sandeep, too.

Then, what do you think of using PF_MEMALLOC_PIN for this context as below?
This will only remove __GFP_MOVABLE from its allocation flag.
Since __bio_iov_iter_get_pages() indicates that it will pin user or kernel pages,
there seems to be no reason not to use this process flag.

block/bio.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 65c796ecb..671e28966 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1248,6 +1248,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	unsigned len, i = 0;
 	size_t offset;
 	int ret = 0;
+	unsigned int flags;
 
 	/*
 	 * Move page array up in the allocated memory for the bio vecs as far as
@@ -1267,9 +1268,11 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * result to ensure the bio's total size is correct. The remainder of
 	 * the iov data will be picked up in the next bio iteration.
 	 */
+	flags = memalloc_pin_save();
 	size = iov_iter_extract_pages(iter, &pages,
 				      UINT_MAX - bio->bi_iter.bi_size,
 				      nr_pages, extraction_flags, &offset);
+	memalloc_pin_restore(flags);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-07  2:07       ` Sooyong Suk
@ 2025-03-07  2:28         ` Suren Baghdasaryan
  2025-03-07  6:38           ` Sooyong Suk
  2025-03-12 15:17           ` Christoph Hellwig
  0 siblings, 2 replies; 21+ messages in thread
From: Suren Baghdasaryan @ 2025-03-07  2:28 UTC (permalink / raw)
  To: Sooyong Suk
  Cc: Jaewon Kim, Christoph Hellwig, viro, linux-kernel, akpm,
	linux-mm, spssyr, axboe, linux-block, dhavale

On Thu, Mar 6, 2025 at 6:07 PM Sooyong Suk <s.suk@samsung.com> wrote:
>
> > On Fri, Mar 7, 2025 at 12:26 AM Christoph Hellwig <hch@infradead.org>
> > wrote:
> > >
> > > On Thu, Mar 06, 2025 at 04:40:56PM +0900, Sooyong Suk wrote:
> > > > There are GUP references to pages that are serving as direct IO
> > buffers.
> > > > Those pages can be allocated from CMA pageblocks despite they can be
> > > > pinned until the DIO is completed.
> > >
> > > direct I/O is eactly the case that is not FOLL_LONGTERM and one of the
> > > reasons to even have the flag.  So big fat no to this.
> > >
> >
>
> Understood.
>
> > Hello, thank you for your comment.
> > We, Sooyong and I, wanted to get some opinions about this FOLL_LONGTERM
> > for direct I/O as CMA memory got pinned pages which had been pinned from
> > direct io.
> >
> > > You also completely failed to address the relevant mailinglist and
> > > maintainers.
> >
> > I added block maintainer Jens Axboe and the block layer maillinst here,
> > and added Suren and Sandeep, too.

I'm very far from being a block layer expert :)

>
> Then, what do you think of using PF_MEMALLOC_PIN for this context as below?
> This will only remove __GFP_MOVABLE from its allocation flag.
> Since __bio_iov_iter_get_pages() indicates that it will pin user or kernel pages,
> there seems to be no reason not to use this process flag.

I think this will help you only when the pages are faulted in but if
__get_user_pages() finds an already mapped page which happens to be
allocated from CMA, it will not migrate it. So, you might still end up
with unmovable pages inside CMA.

>
> block/bio.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/block/bio.c b/block/bio.c
> index 65c796ecb..671e28966 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1248,6 +1248,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>         unsigned len, i = 0;
>         size_t offset;
>         int ret = 0;
> +       unsigned int flags;
>
>         /*
>          * Move page array up in the allocated memory for the bio vecs as far as
> @@ -1267,9 +1268,11 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>          * result to ensure the bio's total size is correct. The remainder of
>          * the iov data will be picked up in the next bio iteration.
>          */
> +       flags = memalloc_pin_save();
>         size = iov_iter_extract_pages(iter, &pages,
>                                       UINT_MAX - bio->bi_iter.bi_size,
>                                       nr_pages, extraction_flags, &offset);
> +       memalloc_pin_restore(flags);
>         if (unlikely(size <= 0))
>                 return size ? size : -EFAULT;
>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-07  2:28         ` Suren Baghdasaryan
@ 2025-03-07  6:38           ` Sooyong Suk
  2025-03-12 15:17           ` Christoph Hellwig
  1 sibling, 0 replies; 21+ messages in thread
From: Sooyong Suk @ 2025-03-07  6:38 UTC (permalink / raw)
  To: 'Suren Baghdasaryan'
  Cc: 'Jaewon Kim', 'Christoph Hellwig',
	viro, linux-kernel, akpm, linux-mm, spssyr, axboe, linux-block,
	dhavale

> On Thu, Mar 6, 2025 at 6:07 PM Sooyong Suk <s.suk@samsung.com> wrote:
> >
> > > On Fri, Mar 7, 2025 at 12:26 AM Christoph Hellwig
> > > <hch@infradead.org>
> > > wrote:
> > > >
> > > > On Thu, Mar 06, 2025 at 04:40:56PM +0900, Sooyong Suk wrote:
> > > > > There are GUP references to pages that are serving as direct IO
> > > buffers.
> > > > > Those pages can be allocated from CMA pageblocks despite they
> > > > > can be pinned until the DIO is completed.
> > > >
> > > > direct I/O is eactly the case that is not FOLL_LONGTERM and one of
> > > > the reasons to even have the flag.  So big fat no to this.
> > > >
> > >
> >
> > Understood.
> >
> > > Hello, thank you for your comment.
> > > We, Sooyong and I, wanted to get some opinions about this
> > > FOLL_LONGTERM for direct I/O as CMA memory got pinned pages which
> > > had been pinned from direct io.
> > >
> > > > You also completely failed to address the relevant mailinglist and
> > > > maintainers.
> > >
> > > I added block maintainer Jens Axboe and the block layer maillinst
> > > here, and added Suren and Sandeep, too.
> 
> I'm very far from being a block layer expert :)
> 
> >
> > Then, what do you think of using PF_MEMALLOC_PIN for this context as
> below?
> > This will only remove __GFP_MOVABLE from its allocation flag.
> > Since __bio_iov_iter_get_pages() indicates that it will pin user or
> > kernel pages, there seems to be no reason not to use this process flag.
> 
> I think this will help you only when the pages are faulted in but if
> __get_user_pages() finds an already mapped page which happens to be
> allocated from CMA, it will not migrate it. So, you might still end up
> with unmovable pages inside CMA.
> 

Yes, you're right.
However, we can at least prevent issues from fault-in cases and mitigate
the overall probability of CMA allocation failure. And the pinned pages that
we observed from snapuserd was also allocated by fault-in.

> >
> > block/bio.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/block/bio.c b/block/bio.c index 65c796ecb..671e28966
> > 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1248,6 +1248,7 @@ static int __bio_iov_iter_get_pages(struct bio
> *bio, struct iov_iter *iter)
> >         unsigned len, i = 0;
> >         size_t offset;
> >         int ret = 0;
> > +       unsigned int flags;
> >
> >         /*
> >          * Move page array up in the allocated memory for the bio vecs
> > as far as @@ -1267,9 +1268,11 @@ static int
> __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> >          * result to ensure the bio's total size is correct. The remainder
> of
> >          * the iov data will be picked up in the next bio iteration.
> >          */
> > +       flags = memalloc_pin_save();
> >         size = iov_iter_extract_pages(iter, &pages,
> >                                       UINT_MAX - bio->bi_iter.bi_size,
> >                                       nr_pages, extraction_flags,
> > &offset);
> > +       memalloc_pin_restore(flags);
> >         if (unlikely(size <= 0))
> >                 return size ? size : -EFAULT;
> >
> >




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-06 15:26   ` Christoph Hellwig
  2025-03-06 23:28     ` Jaewon Kim
@ 2025-03-07 20:23     ` Matthew Wilcox
  2025-03-07 21:37       ` Suren Baghdasaryan
  2025-03-12 15:21       ` Christoph Hellwig
  1 sibling, 2 replies; 21+ messages in thread
From: Matthew Wilcox @ 2025-03-07 20:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sooyong Suk, viro, linux-kernel, akpm, linux-mm, jaewon31.kim, spssyr

On Thu, Mar 06, 2025 at 07:26:52AM -0800, Christoph Hellwig wrote:
> On Thu, Mar 06, 2025 at 04:40:56PM +0900, Sooyong Suk wrote:
> > There are GUP references to pages that are serving as direct IO buffers.
> > Those pages can be allocated from CMA pageblocks despite they can be
> > pinned until the DIO is completed.
> 
> direct I/O is eactly the case that is not FOLL_LONGTERM and one of
> the reasons to even have the flag.  So big fat no to this.
> 
> You also completely failed to address the relevant mailinglist and
> maintainers.

You're right; this patch is so bad that it's insulting.

Howver, the problem is real.  And the alternative "solution" being
proposed is worse -- reintroducing cleancache and frontswap.

What I've been asking for and don't have the answer to yet is:

 - What latency is acceptable to reclaim the pages allocated from CMA
   pageblocks?
    - Can we afford a TLB shootdown?  An rmap walk?
 - Is the problem with anonymous or pagecache memory?

I have vaguely been wondering about creating a separate (fake) NUMA node
for the CMA memory so that userspace can control "none of this memory is
in the CMA blocks".  But that's not a great solution either.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-07 20:23     ` Matthew Wilcox
@ 2025-03-07 21:37       ` Suren Baghdasaryan
  2025-03-12 15:21       ` Christoph Hellwig
  1 sibling, 0 replies; 21+ messages in thread
From: Suren Baghdasaryan @ 2025-03-07 21:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Sooyong Suk, viro, linux-kernel, akpm,
	linux-mm, jaewon31.kim, spssyr

On Fri, Mar 7, 2025 at 12:23 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Mar 06, 2025 at 07:26:52AM -0800, Christoph Hellwig wrote:
> > On Thu, Mar 06, 2025 at 04:40:56PM +0900, Sooyong Suk wrote:
> > > There are GUP references to pages that are serving as direct IO buffers.
> > > Those pages can be allocated from CMA pageblocks despite they can be
> > > pinned until the DIO is completed.
> >
> > direct I/O is eactly the case that is not FOLL_LONGTERM and one of
> > the reasons to even have the flag.  So big fat no to this.
> >
> > You also completely failed to address the relevant mailinglist and
> > maintainers.
>
> You're right; this patch is so bad that it's insulting.
>
> Howver, the problem is real.  And the alternative "solution" being
> proposed is worse -- reintroducing cleancache and frontswap.

Matthew, if you are referring to the GCMA proposal I'm working on,
that's not to address this problem. My goal with GCMA is to reuse
memory carveouts (when they are not used) for extending pagecache.

The way I understand this particular problem is that we know direct
I/O will allocate pages and make them unmovable and we do nothing to
prevent these allocations from using CMA.

>
> What I've been asking for and don't have the answer to yet is:

I'll send my findings related to GCMA usecases separately since I
don't want to mix that with the problem discussed here.

>
>  - What latency is acceptable to reclaim the pages allocated from CMA
>    pageblocks?
>     - Can we afford a TLB shootdown?  An rmap walk?
>  - Is the problem with anonymous or pagecache memory?
>
> I have vaguely been wondering about creating a separate (fake) NUMA node
> for the CMA memory so that userspace can control "none of this memory is
> in the CMA blocks".  But that's not a great solution either.
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-07  2:28         ` Suren Baghdasaryan
  2025-03-07  6:38           ` Sooyong Suk
@ 2025-03-12 15:17           ` Christoph Hellwig
  2025-03-12 15:20             ` Suren Baghdasaryan
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2025-03-12 15:17 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Sooyong Suk, Jaewon Kim, Christoph Hellwig, viro, linux-kernel,
	akpm, linux-mm, spssyr, axboe, linux-block, dhavale

On Thu, Mar 06, 2025 at 06:28:40PM -0800, Suren Baghdasaryan wrote:
> I think this will help you only when the pages are faulted in but if
> __get_user_pages() finds an already mapped page which happens to be
> allocated from CMA, it will not migrate it. So, you might still end up
> with unmovable pages inside CMA.

Direct I/O pages are not unmovable.  They are temporarily pinned for
the duration of the direct I/O.

I really don't understand what problem you're trying to fix here.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-12 15:17           ` Christoph Hellwig
@ 2025-03-12 15:20             ` Suren Baghdasaryan
  2025-03-12 15:25               ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Suren Baghdasaryan @ 2025-03-12 15:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sooyong Suk, Jaewon Kim, viro, linux-kernel, akpm, linux-mm,
	spssyr, axboe, linux-block, dhavale

On Wed, Mar 12, 2025 at 8:17 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Mar 06, 2025 at 06:28:40PM -0800, Suren Baghdasaryan wrote:
> > I think this will help you only when the pages are faulted in but if
> > __get_user_pages() finds an already mapped page which happens to be
> > allocated from CMA, it will not migrate it. So, you might still end up
> > with unmovable pages inside CMA.
>
> Direct I/O pages are not unmovable.  They are temporarily pinned for
> the duration of the direct I/O.

Yes but even temporarily pinned pages can cause CMA allocation
failure. My point is that if we know beforehand that the pages will be
pinned we could avoid using CMA and these failures would go away.

>
> I really don't understand what problem you're trying to fix here.
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-07 20:23     ` Matthew Wilcox
  2025-03-07 21:37       ` Suren Baghdasaryan
@ 2025-03-12 15:21       ` Christoph Hellwig
  2025-03-13 22:49         ` David Hildenbrand
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2025-03-12 15:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Sooyong Suk, viro, linux-kernel, akpm,
	linux-mm, jaewon31.kim, spssyr

On Fri, Mar 07, 2025 at 08:23:08PM +0000, Matthew Wilcox wrote:
> Howver, the problem is real.

What is the problem?

> What I've been asking for and don't have the answer to yet is:
> 
>  - What latency is acceptable to reclaim the pages allocated from CMA
>    pageblocks?
>     - Can we afford a TLB shootdown?  An rmap walk?
>  - Is the problem with anonymous or pagecache memory?
> 
> I have vaguely been wondering about creating a separate (fake) NUMA node
> for the CMA memory so that userspace can control "none of this memory is
> in the CMA blocks".  But that's not a great solution either.

Maybe I'm misunderstanding things, but CMA basically provides a region
that allows for large contiguous allocations from it, but otherwise
is used as bog normal kernel memory.  But anyone who wants to allocate
from it needs to move all that memory.  Which to me implies that:

 - latency can be expected to be horrible because a lot of individual
   allocations need to possibly be moved, and all of them could
   be temporarily pinned for I/O
 - any driver using CMA better do this during early boot time, or
   at least under the expectation that doing a CMA allocation
   temporarily causes a huge performance degradation.

If a caller can't cope with that it better don't use CMA.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-12 15:20             ` Suren Baghdasaryan
@ 2025-03-12 15:25               ` Christoph Hellwig
  2025-03-12 15:38                 ` Suren Baghdasaryan
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2025-03-12 15:25 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Christoph Hellwig, Sooyong Suk, Jaewon Kim, viro, linux-kernel,
	akpm, linux-mm, spssyr, axboe, linux-block, dhavale

On Wed, Mar 12, 2025 at 08:20:36AM -0700, Suren Baghdasaryan wrote:
> > Direct I/O pages are not unmovable.  They are temporarily pinned for
> > the duration of the direct I/O.
> 
> Yes but even temporarily pinned pages can cause CMA allocation
> failure. My point is that if we know beforehand that the pages will be
> pinned we could avoid using CMA and these failures would go away.

Direct I/O (and other users of pin_user_pages) are designed to work
on all anonymous and file backed pages, which is kinda the point.
If you CMA user can't wait for the time of an I/O something is wrong
with that caller and it really should not use CMA.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-12 15:25               ` Christoph Hellwig
@ 2025-03-12 15:38                 ` Suren Baghdasaryan
  2025-03-12 15:52                   ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Suren Baghdasaryan @ 2025-03-12 15:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sooyong Suk, Jaewon Kim, viro, linux-kernel, akpm, linux-mm,
	spssyr, axboe, linux-block, dhavale

On Wed, Mar 12, 2025 at 8:25 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Mar 12, 2025 at 08:20:36AM -0700, Suren Baghdasaryan wrote:
> > > Direct I/O pages are not unmovable.  They are temporarily pinned for
> > > the duration of the direct I/O.
> >
> > Yes but even temporarily pinned pages can cause CMA allocation
> > failure. My point is that if we know beforehand that the pages will be
> > pinned we could avoid using CMA and these failures would go away.
>
> Direct I/O (and other users of pin_user_pages) are designed to work
> on all anonymous and file backed pages, which is kinda the point.
> If you CMA user can't wait for the time of an I/O something is wrong
> with that caller and it really should not use CMA.

I might be wrong but my understanding is that we should try to
allocate from CMA when the allocation is movable (not pinned), so that
CMA can move those pages if necessary. I understand that in some cases
a movable allocation can be pinned and we don't know beforehand
whether it will be pinned or not. But in this case we know it will
happen and could avoid this situation.

Yeah, low latency usecases for CMA are problematic and I think the
only current alternative (apart from solutions involving HW change) is
to use a memory carveouts. Device vendors hate that since carved-out
memory ends up poorly utilized. I'm working on a GCMA proposal which
hopefully can address that.

>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-12 15:38                 ` Suren Baghdasaryan
@ 2025-03-12 15:52                   ` Christoph Hellwig
  2025-03-12 16:03                     ` Bart Van Assche
  2025-03-12 16:06                     ` Suren Baghdasaryan
  0 siblings, 2 replies; 21+ messages in thread
From: Christoph Hellwig @ 2025-03-12 15:52 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Christoph Hellwig, Sooyong Suk, Jaewon Kim, viro, linux-kernel,
	akpm, linux-mm, spssyr, axboe, linux-block, dhavale

On Wed, Mar 12, 2025 at 08:38:07AM -0700, Suren Baghdasaryan wrote:
> I might be wrong but my understanding is that we should try to
> allocate from CMA when the allocation is movable (not pinned), so that
> CMA can move those pages if necessary. I understand that in some cases
> a movable allocation can be pinned and we don't know beforehand
> whether it will be pinned or not. But in this case we know it will
> happen and could avoid this situation.

Any file or anonymous folio can be temporarily pinned for I/O and only
moved once that completes.  Direct I/O is one use case for that but there
are plenty others.  I'm not sure how you define "beforehand", but the
pinning is visible in the _pincount field.

> Yeah, low latency usecases for CMA are problematic and I think the
> only current alternative (apart from solutions involving HW change) is
> to use a memory carveouts. Device vendors hate that since carved-out
> memory ends up poorly utilized. I'm working on a GCMA proposal which
> hopefully can address that.

I'd still like to understand what the use case is.  Who does CMA
allocation at a time where heavy direct I/O is in progress?



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-12 15:52                   ` Christoph Hellwig
@ 2025-03-12 16:03                     ` Bart Van Assche
  2025-03-12 16:06                     ` Suren Baghdasaryan
  1 sibling, 0 replies; 21+ messages in thread
From: Bart Van Assche @ 2025-03-12 16:03 UTC (permalink / raw)
  To: Christoph Hellwig, Suren Baghdasaryan
  Cc: Sooyong Suk, Jaewon Kim, viro, linux-kernel, akpm, linux-mm,
	spssyr, axboe, linux-block, dhavale

On 3/12/25 8:52 AM, Christoph Hellwig wrote:
> I'd still like to understand what the use case is.  Who does CMA
> allocation at a time where heavy direct I/O is in progress?

An additional question: why is contiguous memory allocated? Is this
perhaps because the allocated memory will be used for DMA? If so,
can the SMMU be used to make it appear contiguous to DMA clients?

Thanks,

Bart.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-12 15:52                   ` Christoph Hellwig
  2025-03-12 16:03                     ` Bart Van Assche
@ 2025-03-12 16:06                     ` Suren Baghdasaryan
  2025-03-12 16:21                       ` Christoph Hellwig
  1 sibling, 1 reply; 21+ messages in thread
From: Suren Baghdasaryan @ 2025-03-12 16:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sooyong Suk, Jaewon Kim, viro, linux-kernel, akpm, linux-mm,
	spssyr, axboe, linux-block, dhavale

On Wed, Mar 12, 2025 at 8:52 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Mar 12, 2025 at 08:38:07AM -0700, Suren Baghdasaryan wrote:
> > I might be wrong but my understanding is that we should try to
> > allocate from CMA when the allocation is movable (not pinned), so that
> > CMA can move those pages if necessary. I understand that in some cases
> > a movable allocation can be pinned and we don't know beforehand
> > whether it will be pinned or not. But in this case we know it will
> > happen and could avoid this situation.
>
> Any file or anonymous folio can be temporarily pinned for I/O and only
> moved once that completes.  Direct I/O is one use case for that but there
> are plenty others.  I'm not sure how you define "beforehand", but the
> pinning is visible in the _pincount field.

Well, by "beforehand" I mean that when allocating for Direct I/O
operation we know this memory will be pinned, so we could tell the
allocator to avoid CMA. However I agree that FOLL_LONGTERM is a wrong
way to accomplish that.

>
> > Yeah, low latency usecases for CMA are problematic and I think the
> > only current alternative (apart from solutions involving HW change) is
> > to use a memory carveouts. Device vendors hate that since carved-out
> > memory ends up poorly utilized. I'm working on a GCMA proposal which
> > hopefully can address that.
>
> I'd still like to understand what the use case is.  Who does CMA
> allocation at a time where heavy direct I/O is in progress?

I'll let Samsung folks clarify their usecase.

>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-12 16:06                     ` Suren Baghdasaryan
@ 2025-03-12 16:21                       ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2025-03-12 16:21 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Christoph Hellwig, Sooyong Suk, Jaewon Kim, viro, linux-kernel,
	akpm, linux-mm, spssyr, axboe, linux-block, dhavale

On Wed, Mar 12, 2025 at 09:06:02AM -0700, Suren Baghdasaryan wrote:
> > Any file or anonymous folio can be temporarily pinned for I/O and only
> > moved once that completes.  Direct I/O is one use case for that but there
> > are plenty others.  I'm not sure how you define "beforehand", but the
> > pinning is visible in the _pincount field.
> 
> Well, by "beforehand" I mean that when allocating for Direct I/O
> operation we know this memory will be pinned,

Direct I/O is performed on anonymous (or more rarely) file backed pages
that are allocated from the normal allocators.  Some callers might know
that they are eventually going to perform direct I/O on them, but most
won't as that information is a few layers removed from them or totally
hidden in libraries.

The same is true for other pin_user_pages operations.  If you want memory
that is easily available for CMA allocations it better not be given out
as anonymous memory, and probably also not as file backed memory.  Which
just leaves you with easily migratable kernel allocations, i.e. not much.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-12 15:21       ` Christoph Hellwig
@ 2025-03-13 22:49         ` David Hildenbrand
  2025-03-15  1:04           ` John Hubbard
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2025-03-13 22:49 UTC (permalink / raw)
  To: Christoph Hellwig, Matthew Wilcox
  Cc: Sooyong Suk, viro, linux-kernel, akpm, linux-mm, jaewon31.kim, spssyr

On 12.03.25 16:21, Christoph Hellwig wrote:
> On Fri, Mar 07, 2025 at 08:23:08PM +0000, Matthew Wilcox wrote:
>> Howver, the problem is real.
> 
> What is the problem?

I think the problem is the CMA allocation failure, not the latency.

"if a large amount of direct IO is requested constantly, this can make 
pages in CMA pageblocks pinned and unable to migrate outside of the 
pageblock"

We'd need a more reliable way to make CMA allocation -> page migration 
make progress. For example, after we isolated the pageblocks and 
migration starts doing its thing, we could disallow any further GUP 
pins. (e.g., make GUP spin or wait for migration to end)

We could detect in GUP code that a folio is soon expected to be migrated 
by checking the pageblock (isolated) and/or whether the folio is locked.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-13 22:49         ` David Hildenbrand
@ 2025-03-15  1:04           ` John Hubbard
  2025-03-15 23:00             ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: John Hubbard @ 2025-03-15  1:04 UTC (permalink / raw)
  To: David Hildenbrand, Christoph Hellwig, Matthew Wilcox, Jason Gunthorpe
  Cc: Sooyong Suk, viro, linux-kernel, akpm, linux-mm, jaewon31.kim, spssyr

On 3/13/25 3:49 PM, David Hildenbrand wrote:
> On 12.03.25 16:21, Christoph Hellwig wrote:
>> On Fri, Mar 07, 2025 at 08:23:08PM +0000, Matthew Wilcox wrote:
>>> Howver, the problem is real.
>>
>> What is the problem?
> 
> I think the problem is the CMA allocation failure, not the latency.
> 
> "if a large amount of direct IO is requested constantly, this can make 
> pages in CMA pageblocks pinned and unable to migrate outside of the 
> pageblock"
> 
> We'd need a more reliable way to make CMA allocation -> page migration 
> make progress. For example, after we isolated the pageblocks and 
> migration starts doing its thing, we could disallow any further GUP 
> pins. (e.g., make GUP spin or wait for migration to end)
> 
> We could detect in GUP code that a folio is soon expected to be migrated 
> by checking the pageblock (isolated) and/or whether the folio is locked.
> 

Jason Gunthorpe and Matthew both had some ideas about how to fix this [1],
which were very close (maybe the same) to what you're saying here: sleep
and spin in an killable loop.

It turns out to be a little difficult to do this--I had trouble making
the folio's "has waiters" bit work for this, for example. And then...squirrel!

However, I still believe, so far, this is the right approach. I'm just not
sure which thing to wait on, exactly.
 
[1] https://lore.kernel.org/20240502183408.GC3341011@nvidia.com

thanks,
-- 
John Hubbard



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-15  1:04           ` John Hubbard
@ 2025-03-15 23:00             ` David Hildenbrand
  2025-03-15 23:09               ` Zi Yan
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2025-03-15 23:00 UTC (permalink / raw)
  To: John Hubbard, Christoph Hellwig, Matthew Wilcox, Jason Gunthorpe
  Cc: Sooyong Suk, viro, linux-kernel, akpm, linux-mm, jaewon31.kim,
	spssyr, Zi Yan

On 15.03.25 02:04, John Hubbard wrote:
> On 3/13/25 3:49 PM, David Hildenbrand wrote:
>> On 12.03.25 16:21, Christoph Hellwig wrote:
>>> On Fri, Mar 07, 2025 at 08:23:08PM +0000, Matthew Wilcox wrote:
>>>> Howver, the problem is real.
>>>
>>> What is the problem?
>>
>> I think the problem is the CMA allocation failure, not the latency.
>>
>> "if a large amount of direct IO is requested constantly, this can make
>> pages in CMA pageblocks pinned and unable to migrate outside of the
>> pageblock"
>>
>> We'd need a more reliable way to make CMA allocation -> page migration
>> make progress. For example, after we isolated the pageblocks and
>> migration starts doing its thing, we could disallow any further GUP
>> pins. (e.g., make GUP spin or wait for migration to end)
>>
>> We could detect in GUP code that a folio is soon expected to be migrated
>> by checking the pageblock (isolated) and/or whether the folio is locked.
>>
> 
> Jason Gunthorpe and Matthew both had some ideas about how to fix this [1],
> which were very close (maybe the same) to what you're saying here: sleep
> and spin in an killable loop.
> 
> It turns out to be a little difficult to do this--I had trouble making
> the folio's "has waiters" bit work for this, for example. And then...squirrel!
> 
> However, I still believe, so far, this is the right approach. I'm just not
> sure which thing to wait on, exactly.

Zi Yan has a series to convert the "isolate" state of pageblocks to a 
separate pageblock bit; it could be considered a lock-bit. Currently, 
it's essentially the migratetype being MIGRATE_ISOLATE.

As soon as a pageblock is isolated, one must be prepared for contained 
pages/folios to get migrated. The folio lock will only be grabbed once 
actually trying to migrate a folio IIRC, so it might not be the best 
choice: especially considering allocations that span many pageblocks.

So maybe one would need a "has waiters" bit per pageblock, so relevant 
users (e.g., GUP) could wait on the isolate bit getting cleared.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
  2025-03-15 23:00             ` David Hildenbrand
@ 2025-03-15 23:09               ` Zi Yan
  0 siblings, 0 replies; 21+ messages in thread
From: Zi Yan @ 2025-03-15 23:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: John Hubbard, Christoph Hellwig, Matthew Wilcox, Jason Gunthorpe,
	Sooyong Suk, viro, linux-kernel, akpm, linux-mm, jaewon31.kim,
	spssyr

On 15 Mar 2025, at 19:00, David Hildenbrand wrote:

> On 15.03.25 02:04, John Hubbard wrote:
>> On 3/13/25 3:49 PM, David Hildenbrand wrote:
>>> On 12.03.25 16:21, Christoph Hellwig wrote:
>>>> On Fri, Mar 07, 2025 at 08:23:08PM +0000, Matthew Wilcox wrote:
>>>>> Howver, the problem is real.
>>>>
>>>> What is the problem?
>>>
>>> I think the problem is the CMA allocation failure, not the latency.
>>>
>>> "if a large amount of direct IO is requested constantly, this can make
>>> pages in CMA pageblocks pinned and unable to migrate outside of the
>>> pageblock"
>>>
>>> We'd need a more reliable way to make CMA allocation -> page migration
>>> make progress. For example, after we isolated the pageblocks and
>>> migration starts doing its thing, we could disallow any further GUP
>>> pins. (e.g., make GUP spin or wait for migration to end)
>>>
>>> We could detect in GUP code that a folio is soon expected to be migrated
>>> by checking the pageblock (isolated) and/or whether the folio is locked.
>>>
>>
>> Jason Gunthorpe and Matthew both had some ideas about how to fix this [1],
>> which were very close (maybe the same) to what you're saying here: sleep
>> and spin in an killable loop.
>>
>> It turns out to be a little difficult to do this--I had trouble making
>> the folio's "has waiters" bit work for this, for example. And then...squirrel!
>>
>> However, I still believe, so far, this is the right approach. I'm just not
>> sure which thing to wait on, exactly.
>
> Zi Yan has a series to convert the "isolate" state of pageblocks to a separate pageblock bit; it could be considered a lock-bit. Currently, it's essentially the migratetype being MIGRATE_ISOLATE.
>
> As soon as a pageblock is isolated, one must be prepared for contained pages/folios to get migrated. The folio lock will only be grabbed once actually trying to migrate a folio IIRC, so it might not be the best choice: especially considering allocations that span many pageblocks.
>
> So maybe one would need a "has waiters" bit per pageblock, so relevant users (e.g., GUP) could wait on the isolate bit getting cleared.


The patchset is at: https://lore.kernel.org/linux-mm/20250214154215.717537-1-ziy@nvidia.com/. I should be able to work on it soon, as I have been busy with
folio_split() patchset recently. My patchset extends migratetype bits from 4 to 8
and use bit 7 for MIGRATE_ISOLATE.


--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-03-15 23:10 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20250306074101epcas1p4b24ac546f93df2c7fe3176607b20e47f@epcas1p4.samsung.com>
2025-03-06  7:40 ` [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO Sooyong Suk
2025-03-06 15:26   ` Christoph Hellwig
2025-03-06 23:28     ` Jaewon Kim
2025-03-07  2:07       ` Sooyong Suk
2025-03-07  2:28         ` Suren Baghdasaryan
2025-03-07  6:38           ` Sooyong Suk
2025-03-12 15:17           ` Christoph Hellwig
2025-03-12 15:20             ` Suren Baghdasaryan
2025-03-12 15:25               ` Christoph Hellwig
2025-03-12 15:38                 ` Suren Baghdasaryan
2025-03-12 15:52                   ` Christoph Hellwig
2025-03-12 16:03                     ` Bart Van Assche
2025-03-12 16:06                     ` Suren Baghdasaryan
2025-03-12 16:21                       ` Christoph Hellwig
2025-03-07 20:23     ` Matthew Wilcox
2025-03-07 21:37       ` Suren Baghdasaryan
2025-03-12 15:21       ` Christoph Hellwig
2025-03-13 22:49         ` David Hildenbrand
2025-03-15  1:04           ` John Hubbard
2025-03-15 23:00             ` David Hildenbrand
2025-03-15 23:09               ` Zi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox