From: Wanpeng Li <liwanp@linux.vnet.ibm.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@surriel.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Zlatko Calusic <zcalusic@bitsync.net>,
Minchan Kim <minchan@kernel.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [patch v2 3/3] mm: page_alloc: fair zone allocator policy
Date: Mon, 5 Aug 2013 18:34:56 +0800 [thread overview]
Message-ID: <20130805103456.GB1039@hacker.(null)> (raw)
In-Reply-To: <1375457846-21521-4-git-send-email-hannes@cmpxchg.org>
On Fri, Aug 02, 2013 at 11:37:26AM -0400, Johannes Weiner wrote:
>Each zone that holds userspace pages of one workload must be aged at a
>speed proportional to the zone size. Otherwise, the time an
>individual page gets to stay in memory depends on the zone it happened
>to be allocated in. Asymmetry in the zone aging creates rather
>unpredictable aging behavior and results in the wrong pages being
>reclaimed, activated etc.
>
>But exactly this happens right now because of the way the page
>allocator and kswapd interact. The page allocator uses per-node lists
>of all zones in the system, ordered by preference, when allocating a
>new page. When the first iteration does not yield any results, kswapd
>is woken up and the allocator retries. Due to the way kswapd reclaims
>zones below the high watermark while a zone can be allocated from when
>it is above the low watermark, the allocator may keep kswapd running
>while kswapd reclaim ensures that the page allocator can keep
>allocating from the first zone in the zonelist for extended periods of
>time. Meanwhile the other zones rarely see new allocations and thus
>get aged much slower in comparison.
>
>The result is that the occasional page placed in lower zones gets
>relatively more time in memory, even gets promoted to the active list
>after its peers have long been evicted. Meanwhile, the bulk of the
>working set may be thrashing on the preferred zone even though there
>may be significant amounts of memory available in the lower zones.
>
>Even the most basic test -- repeatedly reading a file slightly bigger
>than memory -- shows how broken the zone aging is. In this scenario,
>no single page should be able stay in memory long enough to get
>referenced twice and activated, but activation happens in spades:
>
> $ grep active_file /proc/zoneinfo
> nr_inactive_file 0
> nr_active_file 0
> nr_inactive_file 0
> nr_active_file 8
> nr_inactive_file 1582
> nr_active_file 11994
> $ cat data data data data >/dev/null
> $ grep active_file /proc/zoneinfo
> nr_inactive_file 0
> nr_active_file 70
> nr_inactive_file 258753
> nr_active_file 443214
> nr_inactive_file 149793
> nr_active_file 12021
>
>Fix this with a very simple round robin allocator. Each zone is
>allowed a batch of allocations that is proportional to the zone's
>size, after which it is treated as full. The batch counters are reset
>when all zones have been tried and the allocator enters the slowpath
>and kicks off kswapd reclaim. Allocation and reclaim is now fairly
>spread out to all available/allowable zones:
>
> $ grep active_file /proc/zoneinfo
> nr_inactive_file 0
> nr_active_file 0
> nr_inactive_file 174
> nr_active_file 4865
> nr_inactive_file 53
> nr_active_file 860
> $ cat data data data data >/dev/null
> $ grep active_file /proc/zoneinfo
> nr_inactive_file 0
> nr_active_file 0
> nr_inactive_file 666622
> nr_active_file 4988
> nr_inactive_file 190969
> nr_active_file 937
>
Why round robin allocator don't consume ZONE_DMA?
>When zone_reclaim_mode is enabled, allocations will now spread out to
>all zones on the local node, not just the first preferred zone (which
>on a 4G node might be a tiny Normal zone).
>
>Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
>---
> include/linux/mmzone.h | 1 +
> mm/page_alloc.c | 69 ++++++++++++++++++++++++++++++++++++++++++--------
> 2 files changed, 60 insertions(+), 10 deletions(-)
>
>diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>index af4a3b7..dcad2ab 100644
>--- a/include/linux/mmzone.h
>+++ b/include/linux/mmzone.h
>@@ -352,6 +352,7 @@ struct zone {
> * free areas of different sizes
> */
> spinlock_t lock;
>+ int alloc_batch;
> int all_unreclaimable; /* All pages pinned */
> #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> /* Set to true when the PG_migrate_skip bits should be cleared */
>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>index 3b27d3e..b2cdfd0 100644
>--- a/mm/page_alloc.c
>+++ b/mm/page_alloc.c
>@@ -1817,6 +1817,11 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
> }
>
>+static bool zone_local(struct zone *local_zone, struct zone *zone)
>+{
>+ return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
>+}
>+
> static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> {
> return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes);
>@@ -1854,6 +1859,11 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> {
> }
>
>+static bool zone_local(struct zone *local_zone, struct zone *zone)
>+{
>+ return true;
>+}
>+
> static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> {
> return true;
>@@ -1901,6 +1911,26 @@ zonelist_scan:
> if (alloc_flags & ALLOC_NO_WATERMARKS)
> goto try_this_zone;
> /*
>+ * Distribute pages in proportion to the individual
>+ * zone size to ensure fair page aging. The zone a
>+ * page was allocated in should have no effect on the
>+ * time the page has in memory before being reclaimed.
>+ *
>+ * When zone_reclaim_mode is enabled, try to stay in
>+ * local zones in the fastpath. If that fails, the
>+ * slowpath is entered, which will do another pass
>+ * starting with the local zones, but ultimately fall
>+ * back to remote zones that do not partake in the
>+ * fairness round-robin cycle of this zonelist.
>+ */
>+ if (alloc_flags & ALLOC_WMARK_LOW) {
>+ if (zone->alloc_batch <= 0)
>+ continue;
>+ if (zone_reclaim_mode &&
>+ !zone_local(preferred_zone, zone))
>+ continue;
>+ }
>+ /*
> * When allocating a page cache page for writing, we
> * want to get it from a zone that is within its dirty
> * limit, such that no single zone holds more than its
>@@ -2006,7 +2036,8 @@ this_zone_full:
> goto zonelist_scan;
> }
>
>- if (page)
>+ if (page) {
>+ zone->alloc_batch -= 1U << order;
> /*
> * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
> * necessary to allocate the page. The expectation is
>@@ -2015,6 +2046,7 @@ this_zone_full:
> * for !PFMEMALLOC purposes.
> */
> page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
>+ }
>
> return page;
> }
>@@ -2346,16 +2378,28 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
> return page;
> }
>
>-static inline
>-void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
>- enum zone_type high_zoneidx,
>- enum zone_type classzone_idx)
>+static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
>+ struct zonelist *zonelist,
>+ enum zone_type high_zoneidx,
>+ struct zone *preferred_zone)
> {
> struct zoneref *z;
> struct zone *zone;
>
>- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
>- wakeup_kswapd(zone, order, classzone_idx);
>+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>+ if (!(gfp_mask & __GFP_NO_KSWAPD))
>+ wakeup_kswapd(zone, order, zone_idx(preferred_zone));
>+ /*
>+ * Only reset the batches of zones that were actually
>+ * considered in the fast path, we don't want to
>+ * thrash fairness information for zones that are not
>+ * actually part of this zonelist's round-robin cycle.
>+ */
>+ if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
>+ continue;
>+ zone->alloc_batch = high_wmark_pages(zone) -
>+ low_wmark_pages(zone);
>+ }
> }
>
> static inline int
>@@ -2451,9 +2495,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> goto nopage;
>
> restart:
>- if (!(gfp_mask & __GFP_NO_KSWAPD))
>- wake_all_kswapd(order, zonelist, high_zoneidx,
>- zone_idx(preferred_zone));
>+ prepare_slowpath(gfp_mask, order, zonelist,
>+ high_zoneidx, preferred_zone);
>
> /*
> * OK, we're below the kswapd watermark and have kicked background
>@@ -4754,6 +4797,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
> zone_seqlock_init(zone);
> zone->zone_pgdat = pgdat;
>
>+ /* For bootup, initialized properly in watermark setup */
>+ zone->alloc_batch = zone->managed_pages;
>+
> zone_pcp_init(zone);
> lruvec_init(&zone->lruvec);
> if (!size)
>@@ -5525,6 +5571,9 @@ static void __setup_per_zone_wmarks(void)
> zone->watermark[WMARK_LOW] = min_wmark_pages(zone) + (tmp >> 2);
> zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
>
>+ zone->alloc_batch = high_wmark_pages(zone) -
>+ low_wmark_pages(zone);
>+
> setup_zone_migrate_reserve(zone);
> spin_unlock_irqrestore(&zone->lock, flags);
> }
>--
>1.8.3.2
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org. For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-08-05 10:38 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-08-02 15:37 [patch v2 0/3] mm: improve page aging fairness between zones/nodes Johannes Weiner
2013-08-02 15:37 ` [patch v2 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd Johannes Weiner
2013-08-07 14:15 ` Mel Gorman
2013-08-02 15:37 ` [patch v2 2/3] mm: page_alloc: rearrange watermark checking in get_page_from_freelist Johannes Weiner
2013-08-07 14:20 ` Mel Gorman
2013-08-02 15:37 ` [patch v2 3/3] mm: page_alloc: fair zone allocator policy Johannes Weiner
2013-08-02 17:51 ` Rik van Riel
2013-08-05 1:15 ` Minchan Kim
2013-08-05 3:43 ` Johannes Weiner
2013-08-05 4:48 ` Minchan Kim
2013-08-05 5:01 ` Johannes Weiner
2013-08-05 10:34 ` Wanpeng Li
2013-08-05 10:34 ` Wanpeng Li [this message]
2013-08-05 11:34 ` Andrea Arcangeli
2013-08-05 13:11 ` Wanpeng Li
2013-08-05 13:11 ` Wanpeng Li
2013-08-07 14:58 ` Mel Gorman
2013-08-07 15:37 ` Johannes Weiner
2013-08-08 4:16 ` Johannes Weiner
2013-08-08 9:21 ` Mel Gorman
2013-08-09 18:45 ` Rik van Riel
2013-08-16 17:07 ` Kevin Hilman
2013-08-16 17:17 ` Kevin Hilman
2013-08-16 20:18 ` Johannes Weiner
2013-08-16 21:24 ` Stephen Warren
2013-08-16 21:52 ` Kevin Hilman
2013-08-19 0:48 ` Stephen Rothwell
2014-04-02 14:26 ` Thomas Schwinge
2014-04-24 13:37 ` radeon: screen garbled after page allocator change, was: " Johannes Weiner
2014-04-25 21:47 ` Jerome Glisse
2014-04-25 21:50 ` Jerome Glisse
2014-04-25 23:03 ` Jerome Glisse
2014-04-28 8:03 ` Thomas Schwinge
2014-04-28 9:09 ` Thomas Schwinge
2014-04-27 3:31 ` Jerome Glisse
2014-04-27 19:55 ` Jerome Glisse
2014-04-28 7:30 ` Christian König
2014-04-28 12:51 ` Deucher, Alexander
2014-04-28 12:52 ` Deucher, Alexander
2014-04-28 8:09 ` Thomas Schwinge
2014-06-16 7:11 ` Thomas Schwinge
2013-08-02 19:59 ` [patch v2 0/3] mm: improve page aging fairness between zones/nodes Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='20130805103456.GB1039@hacker.(null)' \
--to=liwanp@linux.vnet.ibm.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=minchan@kernel.org \
--cc=riel@surriel.com \
--cc=zcalusic@bitsync.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox