Re: [RFC PATCH 0/6] variable-order, large folios for anonymous memory

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	"Yin, Fengwei" <fengwei.yin@intel.com>,
	Yu Zhao <yuzhao@google.com>
Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org
Subject: Re: [RFC PATCH 0/6] variable-order, large folios for anonymous memory
Date: Wed, 22 Mar 2023 12:03:31 +0000	[thread overview]
Message-ID: <a2861b0a-be8f-0911-e551-af9d3b549f6d@arm.com> (raw)
In-Reply-To: <20230317105802.2634004-1-ryan.roberts@arm.com>

Hi Matthew,

On 17/03/2023 10:57, Ryan Roberts wrote:
> Hi All,
> 
> [...]
> 
> Bug(s)
> ======
> 
> When I run this code without the last (workaround) patch, with DEBUG_VM et al,
> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
> relating to invalid kernel addresses (which usually look like either NULL +
> small offset or mostly zeros with a few mid-order bits set + a small offset) or
> lockdep complaining about a bad unlock balance. Call stacks are often in
> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
> email example oopses out separately if anyone wants to review them). My hunch is
> that struct pages adjacent to the folio are being corrupted, but don't have hard
> evidence.
> 
> When adding the workaround patch, which prevents madvise_free_pte_range() from
> attempting to split a large folio, I never see any issues. Although I'm not
> putting the system under memory pressure so guess I might see the same types of
> problem crop up under swap, etc.
> 
> I've reviewed most of the code within split_folio() and can't find any smoking
> gun, but I wonder if there are implicit assumptions about the large folio being
> PMD sized that I'm obviously breaking now?
> 
> The code in madvise_free_pte_range():
> 
> 	if (folio_test_large(folio)) {
> 		if (folio_mapcount(folio) != 1)
> 			goto out;
> 		folio_get(folio);
> 		if (!folio_trylock(folio)) {
> 			folio_put(folio);
> 			goto out;
> 		}
> 		pte_unmap_unlock(orig_pte, ptl);
> 		if (split_folio(folio)) {
> 			folio_unlock(folio);
> 			folio_put(folio);
> 			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> 			goto out;
> 		}
> 		...
> 	}

I've noticed that its folio_split() with a folio order of 1 that causes my
problems. And I also see that the page cache code always explicitly never
allocates order-1 folios:

void page_cache_ra_order(struct readahead_control *ractl,
		struct file_ra_state *ra, unsigned int new_order)
{
	...

	while (index <= limit) {
		unsigned int order = new_order;

		/* Align with smaller pages if needed */
		if (index & ((1UL << order) - 1)) {
			order = __ffs(index);
			if (order == 1)
				order = 0;
		}
		/* Don't allocate pages past EOF */
		while (index + (1UL << order) - 1 > limit) {
			if (--order == 1)
				order = 0;
		}
		err = ra_alloc_folio(ractl, index, mark, order, gfp);
		if (err)
			break;
		index += 1UL << order;
	}

	...
}

Matthew, what is the reason for this? I suspect its guarding against the same
problem I'm seeing.

If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause
any oops/panic/etc. I'd just like to understand the root cause.

Thanks,
Ryan



> 
> Will normally skip my large folios because they have a mapcount > 1, due to
> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
> or CoW in this case?
> 
> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
> large anon folio is not using the folio lock in the same way as a THP would and
> we are therefore not getting the expected serialization?
> 
> I'd really appreciate any suggestions for how to pregress here!
>

next prev parent reply	other threads:[~2023-03-22 12:03 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-17 10:57 Ryan Roberts
2023-03-17 10:57 ` [RFC PATCH 1/6] mm: Expose clear_huge_page() unconditionally Ryan Roberts
2023-03-17 10:57 ` [RFC PATCH 2/6] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() Ryan Roberts
2023-03-17 10:57 ` [RFC PATCH 3/6] mm: Introduce try_vma_alloc_zeroed_movable_folio() Ryan Roberts
2023-03-17 10:58 ` [RFC PATCH 4/6] mm: Implement folio_add_new_anon_rmap_range() Ryan Roberts
2023-03-22  6:59   ` Yin Fengwei
2023-03-22  7:10   ` Yin Fengwei
2023-03-22  7:42     ` Ryan Roberts
2023-03-17 10:58 ` [RFC PATCH 5/6] mm: Allocate large folios for anonymous memory Ryan Roberts
2023-03-17 10:58 ` [RFC PATCH 6/6] WORKAROUND: Don't split large folios on madvise Ryan Roberts
2023-03-22  8:19   ` Yin Fengwei
2023-03-22  8:59     ` Ryan Roberts
2023-03-22 12:03 ` Ryan Roberts [this message]
2023-03-22 13:36   ` [RFC PATCH 0/6] variable-order, large folios for anonymous memory Yin, Fengwei
2023-03-22 14:25     ` Ryan Roberts

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a2861b0a-be8f-0911-e551-af9d3b549f6d@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=fengwei.yin@intel.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox