linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: David Hildenbrand <david@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: Memory hotplug locking issue: Useless (?) zone span seqlock
Date: Thu, 8 May 2025 09:11:50 -0400	[thread overview]
Message-ID: <8a4a5d6e-d2cf-420d-91f3-2797618e7255@efficios.com> (raw)
In-Reply-To: <a5cde237-0dcf-4e85-b763-7a38e9f9c563@redhat.com>

On 2025-05-08 06:45, David Hildenbrand wrote:
> On 07.03.25 21:22, Mathieu Desnoyers wrote:
>> I'm currently perfecting my understanding of the mm code and reviewing
>> pieces of it as I go, and stumbled on this:
>>
>> commit 27cacaad16c5 ("mm,memory_hotplug: drop unneeded locking")
>>
>> This commit removes all users of zone_span_writelock(), thus making
>> the inline useless, but leaves the now useless
>> zone_span_seqbegin()/zone_span_seqretry() in place within
>> page_outside_zone_boundaries().
>>
>> So I'm confused. What's going on ?
>>
>> And if this commit got things very wrong when removing the
>> seqlock, I wonder if there are cases where its partial
>> pgdat_resize_lock() removal can be an issue as well.
> 
> I stumbled over that myself recently as well. I think I mentioned in the 
> past that we should just store
> 
> start_pfn + end_pfn
> 
> instead of
> 
> start_pfn + nr_pages
> 
> 
> Then, concurrent resizing could happen (and we could atomically read 
> start_pfn / end_pfn).
> 
> Right now, when adjusting start_pfn, we always also have to adjust 
> nr_pages. A concurrent reader calculating end_pfn manually could see 
> some crappy result.
> 
> Having that said, I am not aware of issues in that area, but it all 
> looks like only a partial cleanup to me.

I wonder if all callers to zone_spans_pfn() prevent concurrent modification
of the zone start_pfn and nr_pages ?

For instance set_pfnblock_flags_mask() (called by set_pageblock_migratetype)
does:

        VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

If we look at zone_spans_pfn():

static inline unsigned long zone_end_pfn(const struct zone *zone)
{
         return zone->zone_start_pfn + zone->spanned_pages;
}

static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
{
         return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
}

A concurrent updater which shrinks a zone by moving its start would have
to increment zone_start_pfn *and* decrement spanned_pages. If this happens
concurrently with the loads from zone_spans_pfn(), then it can observe
an intermediate state (only nr pages reduced or only zone start moved).
The first scenario at least can lead to false positive VM_BUG_ON().

Likewise if the updater expands the zone by moving its start left. If
the observer loads an updated start pfn without observing the nr pages
update, it can lead to false positive VM_BUG_ON().

I notice that zone_intersects() also uses zone_end_pfn(). It is used for
instance by default_kernel_zone_for_pfn() without locks. In this case,
reading both nr pages and start pfn concurrently with update could cause
a false-positive match, for instance if the start of the range is moved
but the nr pages prior value is loaded (concurrent shrink). This could
match a zone outside of the function parameter range.

Another reader of those fields is update_pgdat_span(), which appears to
be called only from remove_pfn_range_from_zone with mem_hotplug_lock
held in write mode. So that one should be fine.

AFAIU, updates to nr pages and zone start pfn are done by:

- remove_pfn_range_from_zone (AFAIU always called with mem_hotplug_lock
   in write mode),
- shrink_zone_span (called from remove_pfn_range_from_zone),
- resize_zone_range (__meminit function), called from move_pfn_range_to_zone()
   also called with mem_hotplug_lock in write mode.

So I think your idea of keeping track of both zone_start_pfn and zone_end_pfn
would solve this specific issue, however we'd have to carefully consider what
happens to users of spanned_pages (e.g. zone_is_empty()) callers, because it
would then require to calculate the spanned_pages from end - start, which
then can have similar issues wrt atomicity against concurrent updates.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


  reply	other threads:[~2025-05-08 13:11 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-07 20:22 Mathieu Desnoyers
2025-05-08 10:45 ` David Hildenbrand
2025-05-08 13:11   ` Mathieu Desnoyers [this message]
2025-05-09  9:38     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8a4a5d6e-d2cf-420d-91f3-2797618e7255@efficios.com \
    --to=mathieu.desnoyers@efficios.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=osalvador@suse.de \
    --cc=pasha.tatashin@soleen.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox