* Possible race with page_maybe_dma_pinned?
@ 2021-09-29 19:57 Peter Xu
2021-09-29 22:47 ` Linus Torvalds
0 siblings, 1 reply; 4+ messages in thread
From: Peter Xu @ 2021-09-29 19:57 UTC (permalink / raw)
To: Linux MM Mailing List
Cc: Jason Gunthorpe, Linus Torvalds, John Hubbard, Jan Kara,
Andrew Morton, Andrea Arcangeli
Hi, all,
It seems to be racy to call page_maybe_dma_pinned() without properly taking the
mm->write_protect_seq lock, which is taken read for fast gup.
Now we have 3 callers of page_maybe_dma_pinned():
1. page_needs_cow_for_dma
2. pte_is_pinned
3. shrink_page_list
The 1st one is good as it takes the seqlock for write properly. The 2nd & 3rd
are missing, we may need to add them.
The race could trigger when the fast-gup of FOLL_PIN happened right after a
call to page_maybe_dma_pinned() which returned false. One example for page
reclaim of above case 3:
fast-gup thread page reclaim thread
--------------- -------------------
page_maybe_dma_pinned --> false
put the page into swap cache
fast-gup with FOLL_PIN
unmap page in pgtables
...
So commit feb889fb40fa ("mm: don't put pinned pages into the swap cache",
2021-01-17) could still have a small window that will stop working.
Same thing to the pte_is_pinned for clear_refs, which is case 2nd above.
If anyone agrees, and if anyone would like to fix this, please add:
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
As this is originally spotted and reported by Andrea.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: Possible race with page_maybe_dma_pinned? 2021-09-29 19:57 Possible race with page_maybe_dma_pinned? Peter Xu @ 2021-09-29 22:47 ` Linus Torvalds 2021-09-30 1:57 ` John Hubbard 0 siblings, 1 reply; 4+ messages in thread From: Linus Torvalds @ 2021-09-29 22:47 UTC (permalink / raw) To: Peter Xu Cc: Linux MM Mailing List, Jason Gunthorpe, John Hubbard, Jan Kara, Andrew Morton, Andrea Arcangeli On Wed, Sep 29, 2021 at 12:57 PM Peter Xu <peterx@redhat.com> wrote: > > Now we have 3 callers of page_maybe_dma_pinned(): > > 1. page_needs_cow_for_dma > 2. pte_is_pinned > 3. shrink_page_list > > The 1st one is good as it takes the seqlock for write properly. The 2nd & 3rd > are missing, we may need to add them. Well, the pte_is_pinned() case at least could do the seqlock in clear_soft_dirty() - it has the vma and mm available. The page shrinker has always been problematic since it doesn't have the vm (and by "always" I mean "modern times" - long ago we used to scan virtually, in the days before rmap) One option might be for fast-gup to give up on locked pages. I think the page lock is the only thing that shrink_page_list() serializes with. Linus ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Possible race with page_maybe_dma_pinned? 2021-09-29 22:47 ` Linus Torvalds @ 2021-09-30 1:57 ` John Hubbard 2021-09-30 11:11 ` Jan Kara 0 siblings, 1 reply; 4+ messages in thread From: John Hubbard @ 2021-09-30 1:57 UTC (permalink / raw) To: Linus Torvalds, Peter Xu Cc: Linux MM Mailing List, Jason Gunthorpe, Jan Kara, Andrew Morton, Andrea Arcangeli On 9/29/21 15:47, Linus Torvalds wrote: > On Wed, Sep 29, 2021 at 12:57 PM Peter Xu <peterx@redhat.com> wrote: >> >> Now we have 3 callers of page_maybe_dma_pinned(): >> >> 1. page_needs_cow_for_dma >> 2. pte_is_pinned >> 3. shrink_page_list >> >> The 1st one is good as it takes the seqlock for write properly. The 2nd & 3rd >> are missing, we may need to add them. > > Well, the pte_is_pinned() case at least could do the seqlock in > clear_soft_dirty() - it has the vma and mm available. That part seems straightforward, OK. > > The page shrinker has always been problematic since it doesn't have > the vm (and by "always" I mean "modern times" - long ago we used to > scan virtually, in the days before rmap) > > One option might be for fast-gup to give up on locked pages. I think > the page lock is the only thing that shrink_page_list() serializes > with. > In order to avoid locked pages in gup fast, it is easiest to do a check for locked pages *after* fast-pinning them, and unpin them before returning to the caller. This makes the change much smaller. However, doing so leaves a window of time during which the pages are still marked as maybe-dma-pinned, although those pages are never returned to the caller as such. There is already code that is subject to this in lockless_pages_from_mm(), for the case of a failed seqlock read. I'm thinking it's probably OK, because the pages are not going to be held long-term. They will be unpinned before returning from lockless_pages_from_mm(). The counter argument is that this is merely making the race window smaller, which is usually something that I argue against because it just leads to harder-to-find bugs... To be specific, here's what I mean: diff --git a/mm/gup.c b/mm/gup.c index 886d6148d3d03..8ba871a927668 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2657,6 +2657,7 @@ static unsigned long lockless_pages_from_mm(unsigned long start, { unsigned long flags; int nr_pinned = 0; + int i; unsigned seq; if (!IS_ENABLED(CONFIG_HAVE_FAST_GUP) || @@ -2693,7 +2694,23 @@ static unsigned long lockless_pages_from_mm(unsigned long start, unpin_user_pages(pages, nr_pinned); return 0; } + + /* + * Avoiding locked pages, in this fast/lockless context, will + * avoid interfering with shrink_page_list(), in particular. + * Give up upon finding the first locked page, but keep the + * earlier pages, so that slow gup does not have to re-pin them. + */ + for (i = 0; i < nr_pinned; i++) { + if (PageLocked(pages[i])) { + unpin_user_pages(&pages[i], nr_pinned - i); + nr_pinned = i + 1; + break; + } + } } + + return nr_pinned; } thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Possible race with page_maybe_dma_pinned? 2021-09-30 1:57 ` John Hubbard @ 2021-09-30 11:11 ` Jan Kara 0 siblings, 0 replies; 4+ messages in thread From: Jan Kara @ 2021-09-30 11:11 UTC (permalink / raw) To: John Hubbard Cc: Linus Torvalds, Peter Xu, Linux MM Mailing List, Jason Gunthorpe, Jan Kara, Andrew Morton, Andrea Arcangeli On Wed 29-09-21 18:57:33, John Hubbard wrote: > On 9/29/21 15:47, Linus Torvalds wrote: > > On Wed, Sep 29, 2021 at 12:57 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > Now we have 3 callers of page_maybe_dma_pinned(): > > > > > > 1. page_needs_cow_for_dma > > > 2. pte_is_pinned > > > 3. shrink_page_list > > > > > > The 1st one is good as it takes the seqlock for write properly. The 2nd & 3rd > > > are missing, we may need to add them. > > > > Well, the pte_is_pinned() case at least could do the seqlock in > > clear_soft_dirty() - it has the vma and mm available. > > That part seems straightforward, OK. > > > > > The page shrinker has always been problematic since it doesn't have > > the vm (and by "always" I mean "modern times" - long ago we used to > > scan virtually, in the days before rmap) > > > > One option might be for fast-gup to give up on locked pages. I think > > the page lock is the only thing that shrink_page_list() serializes > > with. > > > > In order to avoid locked pages in gup fast, it is easiest to do a > check for locked pages *after* fast-pinning them, and unpin them before > returning to the caller. This makes the change much smaller. > > However, doing so leaves a window of time during which the pages are > still marked as maybe-dma-pinned, although those pages are never > returned to the caller as such. There is already code that is subject to > this in lockless_pages_from_mm(), for the case of a failed seqlock read. > I'm thinking it's probably OK, because the pages are not going to be > held long-term. They will be unpinned before returning from > lockless_pages_from_mm(). > > The counter argument is that this is merely making the race window > smaller, which is usually something that I argue against because it just > leads to harder-to-find bugs... Yeah, what you propose actually does not guarantee that the reclaim and page pinning cannot race in a way that the page gets unmapped and gup_fast() returns a pinned page. Which is what the code in shrink_page_list() tries to prevent AFAIU. The only place where I can see us doing a sanely race-free check in shrink_page_list() path is inside try_to_unmap() - we could implement unmap_if_not_pinned semantics and inside the rmap walk we can bump up the seqcounts (it even conceptually makes some sense) to make sure page_maybe_dma_pinned() check we'd do during the rmap walk is not racing with pup_fast(). By the time we leave the seqcount critical section, the page will be unmapped so we can be sure there will be no new pins of the page. Honza > > To be specific, here's what I mean: > > diff --git a/mm/gup.c b/mm/gup.c > index 886d6148d3d03..8ba871a927668 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2657,6 +2657,7 @@ static unsigned long lockless_pages_from_mm(unsigned long start, > { > unsigned long flags; > int nr_pinned = 0; > + int i; > unsigned seq; > > if (!IS_ENABLED(CONFIG_HAVE_FAST_GUP) || > @@ -2693,7 +2694,23 @@ static unsigned long lockless_pages_from_mm(unsigned long start, > unpin_user_pages(pages, nr_pinned); > return 0; > } > + > + /* > + * Avoiding locked pages, in this fast/lockless context, will > + * avoid interfering with shrink_page_list(), in particular. > + * Give up upon finding the first locked page, but keep the > + * earlier pages, so that slow gup does not have to re-pin them. > + */ > + for (i = 0; i < nr_pinned; i++) { > + if (PageLocked(pages[i])) { > + unpin_user_pages(&pages[i], nr_pinned - i); > + nr_pinned = i + 1; > + break; > + } > + } > } > + > + > return nr_pinned; > } > > > thanks, > -- > John Hubbard > NVIDIA -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2021-09-30 11:11 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-09-29 19:57 Possible race with page_maybe_dma_pinned? Peter Xu 2021-09-29 22:47 ` Linus Torvalds 2021-09-30 1:57 ` John Hubbard 2021-09-30 11:11 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox