linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org, Kaiyang Zhao <kaiyang2@cs.cmu.edu>,
	Mel Gorman <mgorman@techsingularity.net>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: Re: [RFC PATCH 00/26] mm: reliable huge page allocator
Date: Wed, 19 Apr 2023 02:54:02 +0300	[thread overview]
Message-ID: <20230418235402.lq7mxrrre2kl6vsf@box.shutemov.name> (raw)
In-Reply-To: <20230418191313.268131-1-hannes@cmpxchg.org>

On Tue, Apr 18, 2023 at 03:12:47PM -0400, Johannes Weiner wrote:
> As memory capacity continues to grow, 4k TLB coverage has not been
> able to keep up. On Meta's 64G webservers, close to 20% of execution
> cycles are observed to be handling TLB misses when using 4k pages
> only. Huge pages are shifting from being a nice-to-have optimization
> for HPC workloads to becoming a necessity for common applications.
> 
> However, while trying to deploy THP more universally, we observe a
> fragmentation problem in the page allocator that often prevents larger
> requests from being met quickly, or met at all, at runtime. Since we
> have to provision hardware capacity for worst case performance,
> unreliable huge page coverage isn't of much help.
> 
> Drilling into the allocator, we find that existing defrag efforts,
> such as mobility grouping and watermark boosting, help, but are
> insufficient by themselves. We still observe a high number of blocks
> being routinely shared by allocations of different migratetypes. This
> in turn results in inefficient or ineffective reclaim/compaction runs.
> 
> In a broad sample of Meta servers, we find that unmovable allocations
> make up less than 7% of total memory on average, yet occupy 34% of the
> 2M blocks in the system. We also found that this effect isn't
> correlated with high uptimes, and that servers can get heavily
> fragmented within the first hour of running a workload.
> 
> The following experiment shows that only 20min of build load under
> moderate memory pressure already results in a significant number of
> typemixed blocks (block analysis run after system is back to idle):
> 
> vanilla:
> unmovable 50
> movable 701
> reclaimable 149
> unmovable blocks with slab/lru pages: 13 ({'slab': 17, 'lru': 19} pages)
> movable blocks with non-LRU pages: 77 ({'slab': 4257, 'kmem': 77, 'other': 2} pages)
> reclaimable blocks with non-slab pages: 16 ({'lru': 37, 'kmem': 311, 'other': 26} pages)
> 
> patched:
> unmovable 65
> movable 457
> reclaimable 159
> free 219
> unmovable blocks with slab/lru pages: 22 ({'slab': 0, 'lru': 38} pages)
> movable blocks with non-LRU pages: 0 ({'slab': 0, 'kmem': 0, 'other': 0} pages)
> reclaimable blocks with non-slab pages: 3 ({'lru': 36, 'kmem': 0, 'other': 23} pages)
> 
> [ The remaining "mixed blocks" in the patched kernel are false
>   positives: LRU pages without migrate callbacks (empty_aops), and
>   i915 shmem that is pinned until reclaimed through shrinkers. ]
> 
> Root causes
> 
> One of the behaviors that sabotage the page allocator's mobility
> grouping is the fact that requests of one migratetype are allowed to
> fall back into blocks of another type before reclaim and compaction
> occur. This is a design decision to prioritize memory utilization over
> block fragmentation - especially considering the history of lumpy
> reclaim and its tendency to overreclaim. However, with compaction
> available, these two goals are no longer in conflict: the scratch
> space of free pages for compaction to work is only twice the size of
> the allocation request; in most cases, only small amounts of
> proactive, coordinated reclaim and compaction is required to prevent a
> fallback which may fragment a pageblock indefinitely.
> 
> Another problem lies in how the page allocator drives reclaim and
> compaction when it does invoke it. While the page allocator targets
> migratetype grouping at the pageblock level, it calls reclaim and
> compaction with the order of the allocation request. As most requests
> are smaller than a pageblock, this results in partial block freeing
> and subsequent fallbacks and type mixing.
> 
> Note that in combination, these two design decisions have a
> self-reinforcing effect on fragmentation: 1. Partially used unmovable
> blocks are filled up with fallback movable pages. 2. A subsequent
> unmovable allocation, instead of grouping up, will then need to enter
> reclaim, which most likely results in a partially freed movable block
> that it falls back into. Over time, unmovable allocations are sparsely
> scattered throughout the address space and poison many pageblocks.
> 
> Note that block fragmentation is driven by lower-order requests. It is
> not reliably mitigated by the mere presence of higher-order requests.
> 
> Proposal
> 
> This series proposes to make THP allocations reliable by enforcing
> pageblock hygiene, and aligning the allocator, reclaim and compaction
> on the pageblock as the base unit for managing free memory. All orders
> up to and including the pageblock are made first-class requests that
> (outside of OOM situations) are expected to succeed without
> exceptional investment by the allocating thread.
> 
> A neutral pageblock type is introduced, MIGRATE_FREE. The first
> allocation to be placed into such a block claims it exclusively for
> the allocation's migratetype. Fallbacks from a different type are no
> longer allowed, and the block is "kept open" for more allocations of
> the same type to ensure tight grouping. A pageblock becomes neutral
> again only once all its pages have been freed.

Sounds like this will cause earlier OOM, no?

I guess with 2M pageblock on 64G server it shouldn't matter much. But how
about smaller machines?

> Reclaim and compaction are changed from partial block reclaim to
> producing whole neutral page blocks.

How does it affect allocation latencies? I see direct compact stall grew
substantially. Hm?

> The watermark logic is adjusted
> to apply to neutral blocks, ensuring that background and direct
> reclaim always maintain a readily-available reserve of them.
> 
> The defragmentation effort changes from reactive to proactive. In
> turn, this makes defragmentation actually more efficient: compaction
> only has to scan movable blocks and can skip other blocks entirely;
> since movable blocks aren't poisoned by unmovable pages, the chances
> of successful compaction in each block are greatly improved as well.
> 
> Defragmentation becomes an ongoing responsibility of all allocations,
> rather than being the burden of only higher-order asks. This prevents
> sub-block allocations - which cause block fragmentation in the first
> place - from starving the increasingly important larger requests.
> 
> There is a slight increase in worst-case memory overhead by requiring
> the watermarks to be met against neutral blocks even when there might
> be free pages in typed blocks. However, the high watermarks are less
> than 1% of the zone, so the increase is relatively small.
> 
> These changes only apply to CONFIG_COMPACTION kernels. Without
> compaction, fallbacks and partial block reclaim remain the best
> trade-off between memory utilization and fragmentation.
> 
> Initial Test Results
> 
> The following is purely an allocation reliability test. Achieving full
> THP benefits in practice is tied to other pending changes, such as the
> THP shrinker to avoid memory pressure from excessive internal
> fragmentation, and tweaks to the kernel's THP allocation strategy.
> 
> The test is a kernel build under moderate-to-high memory pressure,
> with a concurrent process trying to repeatedly fault THPs (madvise):
> 
>                                               HUGEALLOC-VANILLA       HUGEALLOC-PATCHED
> Real time                                   265.04 (    +0.00%)     268.12 (    +1.16%)
> User time                                  1131.05 (    +0.00%)    1131.13 (    +0.01%)
> System time                                 474.66 (    +0.00%)     478.97 (    +0.91%)
> THP fault alloc                           17913.24 (    +0.00%)   19647.50 (    +9.68%)
> THP fault fallback                         1947.12 (    +0.00%)     223.40 (   -88.48%)
> THP fault fail rate %                         9.80 (    +0.00%)       1.12 (   -80.34%)
> Direct compact stall                        282.44 (    +0.00%)     543.90 (   +92.25%)
> Direct compact fail                         262.44 (    +0.00%)     239.90 (    -8.56%)
> Direct compact success                       20.00 (    +0.00%)     304.00 ( +1352.38%)
> Direct compact success rate %                 7.15 (    +0.00%)      57.10 (  +612.90%)
> Compact daemon scanned migrate            21643.80 (    +0.00%)  387479.80 ( +1690.18%)
> Compact daemon scanned free              188462.36 (    +0.00%) 2842824.10 ( +1408.42%)
> Compact direct scanned migrate          1601294.84 (    +0.00%)  275670.70 (   -82.78%)
> Compact direct scanned free             4476155.60 (    +0.00%) 2438835.00 (   -45.51%)
> Compact migrate scanned daemon %              1.32 (    +0.00%)      59.18 ( +2499.00%)
> Compact free scanned daemon %                 3.95 (    +0.00%)      54.31 ( +1018.20%)
> Alloc stall                                2425.00 (    +0.00%)     992.00 (   -59.07%)
> Pages kswapd scanned                     586756.68 (    +0.00%)  975390.20 (   +66.23%)
> Pages kswapd reclaimed                   385468.20 (    +0.00%)  437767.50 (   +13.57%)
> Pages direct scanned                     335199.56 (    +0.00%)  501824.20 (   +49.71%)
> Pages direct reclaimed                   127953.72 (    +0.00%)  151880.70 (   +18.70%)
> Pages scanned kswapd %                       64.43 (    +0.00%)      66.39 (    +2.99%)
> Swap out                                  14083.88 (    +0.00%)   45034.60 (  +219.74%)
> Swap in                                    3395.08 (    +0.00%)    7767.50 (  +128.75%)
> File refaults                             93546.68 (    +0.00%)  129648.30 (   +38.59%)
> 
> The THP fault success rate is drastically improved. A bigger share of
> the work is done by the background threads, as they now proactively
> maintain MIGRATE_FREE block reserves. The increase in memory pressure
> is shown by the uptick in swap activity.
> 
> Status
> 
> Initial test results look promising, but production testing has been
> lagging behind the effort to generalize this code for upstream, and
> putting all the pieces together to make THP work. I'll follow up as I
> gather more data.
> 
> Sending this out now as an RFC to get input on the overall direction.
> 
> The patches are based on v6.2.
> 
>  Documentation/admin-guide/sysctl/vm.rst |  21 -
>  block/bdev.c                            |   2 +-
>  include/linux/compaction.h              | 100 +---
>  include/linux/gfp.h                     |   2 -
>  include/linux/mm.h                      |   1 -
>  include/linux/mmzone.h                  |  30 +-
>  include/linux/page-isolation.h          |  28 +-
>  include/linux/pageblock-flags.h         |   4 +-
>  include/linux/vmstat.h                  |   8 -
>  include/trace/events/mmflags.h          |   4 +-
>  kernel/sysctl.c                         |   8 -
>  mm/compaction.c                         | 242 +++-----
>  mm/internal.h                           |  14 +-
>  mm/memory_hotplug.c                     |   4 +-
>  mm/page_alloc.c                         | 930 +++++++++++++-----------------
>  mm/page_isolation.c                     |  42 +-
>  mm/vmscan.c                             | 251 ++------
>  mm/vmstat.c                             |   6 +-
>  18 files changed, 629 insertions(+), 1068 deletions(-)
> 
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


  parent reply	other threads:[~2023-04-18 23:54 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-18 19:12 Johannes Weiner
2023-04-18 19:12 ` [RFC PATCH 01/26] block: bdev: blockdev page cache is movable Johannes Weiner
2023-04-19  4:07   ` Matthew Wilcox
2023-04-21 12:25   ` Mel Gorman
2023-04-18 19:12 ` [RFC PATCH 02/26] mm: compaction: avoid GFP_NOFS deadlocks Johannes Weiner
2023-04-21 12:27   ` Mel Gorman
2023-04-21 14:17     ` Johannes Weiner
2023-04-18 19:12 ` [RFC PATCH 03/26] mm: make pageblock_order 2M per default Johannes Weiner
2023-04-19  0:01   ` Kirill A. Shutemov
2023-04-19  2:55     ` Johannes Weiner
2023-04-19  3:44       ` Johannes Weiner
2023-04-19 11:10     ` David Hildenbrand
2023-04-19 10:36   ` Vlastimil Babka
2023-04-19 11:09     ` David Hildenbrand
2023-04-21 12:37   ` Mel Gorman
2023-04-18 19:12 ` [RFC PATCH 04/26] mm: page_isolation: write proper kerneldoc Johannes Weiner
2023-04-21 12:39   ` Mel Gorman
2023-04-18 19:12 ` [RFC PATCH 05/26] mm: page_alloc: per-migratetype pcplist for THPs Johannes Weiner
2023-04-21 12:47   ` Mel Gorman
2023-04-21 15:06     ` Johannes Weiner
2023-04-28 10:29       ` Mel Gorman
2023-04-18 19:12 ` [RFC PATCH 06/26] mm: page_alloc: consolidate free page accounting Johannes Weiner
2023-04-21 12:54   ` Mel Gorman
2023-04-21 15:08     ` Johannes Weiner
2023-04-18 19:12 ` [RFC PATCH 07/26] mm: page_alloc: move capture_control to the page allocator Johannes Weiner
2023-04-21 12:59   ` Mel Gorman
2023-04-18 19:12 ` [RFC PATCH 08/26] mm: page_alloc: claim blocks during compaction capturing Johannes Weiner
2023-04-21 13:12   ` Mel Gorman
2023-04-25 14:39     ` Johannes Weiner
2023-04-18 19:12 ` [RFC PATCH 09/26] mm: page_alloc: move expand() above compaction_capture() Johannes Weiner
2023-04-18 19:12 ` [RFC PATCH 10/26] mm: page_alloc: allow compaction capturing from larger blocks Johannes Weiner
2023-04-21 14:14   ` Mel Gorman
2023-04-25 15:40     ` Johannes Weiner
2023-04-28 10:41       ` Mel Gorman
2023-04-18 19:12 ` [RFC PATCH 11/26] mm: page_alloc: introduce MIGRATE_FREE Johannes Weiner
2023-04-21 14:25   ` Mel Gorman
2023-04-18 19:12 ` [RFC PATCH 12/26] mm: page_alloc: per-migratetype free counts Johannes Weiner
2023-04-21 14:28   ` Mel Gorman
2023-04-21 15:35     ` Johannes Weiner
2023-04-21 16:03       ` Mel Gorman
2023-04-21 16:32         ` Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 13/26] mm: compaction: remove compaction result helpers Johannes Weiner
2023-04-21 14:32   ` Mel Gorman
2023-04-18 19:13 ` [RFC PATCH 14/26] mm: compaction: simplify should_compact_retry() Johannes Weiner
2023-04-21 14:36   ` Mel Gorman
2023-04-25  2:15     ` Johannes Weiner
2023-04-25  0:56   ` Huang, Ying
2023-04-25  2:11     ` Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 15/26] mm: compaction: simplify free block check in suitable_migration_target() Johannes Weiner
2023-04-21 14:39   ` Mel Gorman
2023-04-18 19:13 ` [RFC PATCH 16/26] mm: compaction: improve compaction_suitable() accuracy Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 17/26] mm: compaction: refactor __compaction_suitable() Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 18/26] mm: compaction: remove unnecessary is_via_compact_memory() checks Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 19/26] mm: compaction: drop redundant watermark check in compaction_zonelist_suitable() Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 20/26] mm: vmscan: use compaction_suitable() check in kswapd Johannes Weiner
2023-04-25  3:12   ` Huang, Ying
2023-04-25 14:26     ` Johannes Weiner
2023-04-26  1:30       ` Huang, Ying
2023-04-26 15:22         ` Johannes Weiner
2023-04-27  5:41           ` Huang, Ying
2023-04-18 19:13 ` [RFC PATCH 21/26] mm: compaction: align compaction goals with reclaim goals Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 22/26] mm: page_alloc: manage free memory in whole pageblocks Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 23/26] mm: page_alloc: kill highatomic Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 24/26] mm: page_alloc: kill watermark boosting Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 25/26] mm: page_alloc: disallow fallbacks when 2M defrag is enabled Johannes Weiner
2023-04-21 14:56   ` Mel Gorman
2023-04-21 15:24     ` Johannes Weiner
2023-04-21 15:55       ` Mel Gorman
2023-04-18 19:13 ` [RFC PATCH 26/26] mm: page_alloc: add sanity checks for migratetypes Johannes Weiner
2023-04-18 23:54 ` Kirill A. Shutemov [this message]
2023-04-19  2:08   ` [RFC PATCH 00/26] mm: reliable huge page allocator Johannes Weiner
2023-04-19 10:56     ` Vlastimil Babka
2023-04-19  4:11 ` Matthew Wilcox
2023-04-21 16:11   ` Mel Gorman
2023-04-21 17:14     ` Matthew Wilcox
2023-05-02 15:21       ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230418235402.lq7mxrrre2kl6vsf@box.shutemov.name \
    --to=kirill@shutemov.name \
    --cc=hannes@cmpxchg.org \
    --cc=kaiyang2@cs.cmu.edu \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=rientjes@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox