* [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory
@ 2014-01-11 12:36 Vladimir Davydov
2014-01-11 12:36 ` [PATCH 2/5] mm: vmscan: call NUMA-unaware shrinkers irrespective of nodemask Vladimir Davydov
` (4 more replies)
0 siblings, 5 replies; 17+ messages in thread
From: Vladimir Davydov @ 2014-01-11 12:36 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, devel, Mel Gorman, Michal Hocko,
Johannes Weiner, Rik van Riel, Dave Chinner, Glauber Costa
When reclaiming kmem, we currently don't scan slabs that have less than
batch_size objects (see shrink_slab_node()):
while (total_scan >= batch_size) {
shrinkctl->nr_to_scan = batch_size;
shrinker->scan_objects(shrinker, shrinkctl);
total_scan -= batch_size;
}
If there are only a few shrinkers available, such a behavior won't cause
any problems, because the batch_size is usually small, but if we have a
lot of slab shrinkers, which is perfectly possible since FS shrinkers
are now per-superblock, we can end up with hundreds of megabytes of
practically unreclaimable kmem objects. For instance, mounting a
thousand of ext2 FS images with a hundred of files in each and iterating
over all the files using du(1) will result in about 200 Mb of FS caches
that cannot be dropped even with the aid of the vm.drop_caches sysctl!
This problem was initially pointed out by Glauber Costa [*]. Glauber
proposed to fix it by making the shrink_slab() always take at least one
pass, to put it simply, turning the scan loop above to a do{}while()
loop. However, this proposal was rejected, because it could result in
more aggressive and frequent slab shrinking even under low memory
pressure when total_scan is naturally very small.
This patch is a slightly modified version of Glauber's approach.
Similarly to Glauber's patch, it makes shrink_slab() scan less than
batch_size objects, but only if the total number of objects we want to
scan (total_scan) is greater than the total number of objects available
(max_pass). Since total_scan is biased as half max_pass if the current
delta change is small:
if (delta < max_pass / 4)
total_scan = min(total_scan, max_pass / 2);
this is only possible if we are scanning at high prio. That said, this
patch shouldn't change the vmscan behaviour if the memory pressure is
low, but if we are tight on memory, we will do our best by trying to
reclaim all available objects, which sounds reasonable.
[*] http://www.spinics.net/lists/cgroups/msg06913.html
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
---
mm/vmscan.c | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eea668d..0b3fc0a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -281,17 +281,34 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
nr_pages_scanned, lru_pages,
max_pass, delta, total_scan);
- while (total_scan >= batch_size) {
+ /*
+ * Normally, we should not scan less than batch_size objects in one
+ * pass to avoid too frequent shrinker calls, but if the slab has less
+ * than batch_size objects in total and we are really tight on memory,
+ * we will try to reclaim all available objects, otherwise we can end
+ * up failing allocations although there are plenty of reclaimable
+ * objects spread over several slabs with usage less than the
+ * batch_size.
+ *
+ * We detect the "tight on memory" situations by looking at the total
+ * number of objects we want to scan (total_scan). If it is greater
+ * than the total number of objects on slab (max_pass), we must be
+ * scanning at high prio and therefore should try to reclaim as much as
+ * possible.
+ */
+ while (total_scan >= batch_size ||
+ total_scan >= max_pass) {
unsigned long ret;
+ unsigned long nr_to_scan = min(batch_size, total_scan);
- shrinkctl->nr_to_scan = batch_size;
+ shrinkctl->nr_to_scan = nr_to_scan;
ret = shrinker->scan_objects(shrinker, shrinkctl);
if (ret == SHRINK_STOP)
break;
freed += ret;
- count_vm_events(SLABS_SCANNED, batch_size);
- total_scan -= batch_size;
+ count_vm_events(SLABS_SCANNED, nr_to_scan);
+ total_scan -= nr_to_scan;
cond_resched();
}
--
1.7.10.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/5] mm: vmscan: call NUMA-unaware shrinkers irrespective of nodemask
2014-01-11 12:36 [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory Vladimir Davydov
@ 2014-01-11 12:36 ` Vladimir Davydov
2014-01-11 12:36 ` [PATCH 3/5] mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim Vladimir Davydov
` (3 subsequent siblings)
4 siblings, 0 replies; 17+ messages in thread
From: Vladimir Davydov @ 2014-01-11 12:36 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, devel, Dave Chinner, Mel Gorman,
Michal Hocko, Johannes Weiner, Rik van Riel, Glauber Costa
If a shrinker is not NUMA-aware, shrink_slab() should call it exactly
once with nid=0, but currently it is not true: if node 0 is not set in
the nodemask or if it is not online, we will not call such shrinkers at
all. As a result some slabs will be left untouched under some
circumstances. Let us fix it.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reported-by: Dave Chinner <dchinner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
---
mm/vmscan.c | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0b3fc0a..adc3ecb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -369,16 +369,17 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
}
list_for_each_entry(shrinker, &shrinker_list, list) {
- for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
- if (!node_online(shrinkctl->nid))
- continue;
-
- if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
- (shrinkctl->nid != 0))
- break;
-
+ if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
+ shrinkctl->nid = 0;
freed += shrink_slab_node(shrinkctl, shrinker,
- nr_pages_scanned, lru_pages);
+ nr_pages_scanned, lru_pages);
+ continue;
+ }
+
+ for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
+ if (node_online(shrinkctl->nid))
+ freed += shrink_slab_node(shrinkctl, shrinker,
+ nr_pages_scanned, lru_pages);
}
}
--
1.7.10.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 3/5] mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim
2014-01-11 12:36 [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory Vladimir Davydov
2014-01-11 12:36 ` [PATCH 2/5] mm: vmscan: call NUMA-unaware shrinkers irrespective of nodemask Vladimir Davydov
@ 2014-01-11 12:36 ` Vladimir Davydov
2014-01-13 23:11 ` Andrew Morton
2014-01-11 12:36 ` [PATCH 4/5] mm: vmscan: move call to shrink_slab() to shrink_zones() Vladimir Davydov
` (2 subsequent siblings)
4 siblings, 1 reply; 17+ messages in thread
From: Vladimir Davydov @ 2014-01-11 12:36 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, devel, Mel Gorman, Michal Hocko,
Johannes Weiner, Rik van Riel, Dave Chinner, Glauber Costa
When direct reclaim is executed by a process bound to a set of NUMA
nodes, we should scan only those nodes when possible, but currently we
will scan kmem from all online nodes even if the kmem shrinker is NUMA
aware. That said, binding a process to a particular NUMA node won't
prevent it from shrinking inode/dentry caches from other nodes, which is
not good. Fix this.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
---
mm/vmscan.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index adc3ecb..ae6518d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2425,8 +2425,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
unsigned long lru_pages = 0;
nodes_clear(shrink->nodes_to_scan);
- for_each_zone_zonelist(zone, z, zonelist,
- gfp_zone(sc->gfp_mask)) {
+ for_each_zone_zonelist_nodemask(zone, z, zonelist,
+ gfp_zone(sc->gfp_mask), sc->nodemask) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
--
1.7.10.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 4/5] mm: vmscan: move call to shrink_slab() to shrink_zones()
2014-01-11 12:36 [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory Vladimir Davydov
2014-01-11 12:36 ` [PATCH 2/5] mm: vmscan: call NUMA-unaware shrinkers irrespective of nodemask Vladimir Davydov
2014-01-11 12:36 ` [PATCH 3/5] mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim Vladimir Davydov
@ 2014-01-11 12:36 ` Vladimir Davydov
2014-01-13 23:13 ` Andrew Morton
2014-01-11 12:36 ` [PATCH 5/5] mm: vmscan: remove shrink_control arg from do_try_to_free_pages() Vladimir Davydov
2014-01-13 23:05 ` [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory Andrew Morton
4 siblings, 1 reply; 17+ messages in thread
From: Vladimir Davydov @ 2014-01-11 12:36 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, devel, Glauber Costa, Mel Gorman,
Michal Hocko, Johannes Weiner, Rik van Riel, Dave Chinner
This reduces the indentation level of do_try_to_free_pages() and removes
extra loop over all eligible zones counting the number of on-LRU pages.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Glauber Costa <glommer@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <dchinner@redhat.com>
---
mm/vmscan.c | 56 +++++++++++++++++++++++++-------------------------------
1 file changed, 25 insertions(+), 31 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ae6518d..b364f3c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2291,13 +2291,16 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
* the caller that it should consider retrying the allocation instead of
* further reclaim.
*/
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc,
+ struct shrink_control *shrink)
{
struct zoneref *z;
struct zone *zone;
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
+ unsigned long lru_pages = 0;
bool aborted_reclaim = false;
+ struct reclaim_state *reclaim_state = current->reclaim_state;
/*
* If the number of buffer_heads in the machine exceeds the maximum
@@ -2307,6 +2310,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
if (buffer_heads_over_limit)
sc->gfp_mask |= __GFP_HIGHMEM;
+ nodes_clear(shrink->nodes_to_scan);
+
for_each_zone_zonelist_nodemask(zone, z, zonelist,
gfp_zone(sc->gfp_mask), sc->nodemask) {
if (!populated_zone(zone))
@@ -2318,6 +2323,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
if (global_reclaim(sc)) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
+
+ lru_pages += zone_reclaimable_pages(zone);
+ node_set(zone_to_nid(zone), shrink->nodes_to_scan);
+
if (sc->priority != DEF_PRIORITY &&
!zone_reclaimable(zone))
continue; /* Let kswapd poll it */
@@ -2354,6 +2363,20 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
shrink_zone(zone, sc);
}
+ /*
+ * Don't shrink slabs when reclaiming memory from over limit cgroups
+ * but do shrink slab at least once when aborting reclaim for
+ * compaction to avoid unevenly scanning file/anon LRU pages over slab
+ * pages.
+ */
+ if (global_reclaim(sc)) {
+ shrink_slab(shrink, sc->nr_scanned, lru_pages);
+ if (reclaim_state) {
+ sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+ reclaim_state->reclaimed_slab = 0;
+ }
+ }
+
return aborted_reclaim;
}
@@ -2398,9 +2421,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
struct shrink_control *shrink)
{
unsigned long total_scanned = 0;
- struct reclaim_state *reclaim_state = current->reclaim_state;
- struct zoneref *z;
- struct zone *zone;
unsigned long writeback_threshold;
bool aborted_reclaim;
@@ -2413,34 +2433,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
sc->priority);
sc->nr_scanned = 0;
- aborted_reclaim = shrink_zones(zonelist, sc);
-
- /*
- * Don't shrink slabs when reclaiming memory from over limit
- * cgroups but do shrink slab at least once when aborting
- * reclaim for compaction to avoid unevenly scanning file/anon
- * LRU pages over slab pages.
- */
- if (global_reclaim(sc)) {
- unsigned long lru_pages = 0;
-
- nodes_clear(shrink->nodes_to_scan);
- for_each_zone_zonelist_nodemask(zone, z, zonelist,
- gfp_zone(sc->gfp_mask), sc->nodemask) {
- if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
- continue;
-
- lru_pages += zone_reclaimable_pages(zone);
- node_set(zone_to_nid(zone),
- shrink->nodes_to_scan);
- }
+ aborted_reclaim = shrink_zones(zonelist, sc, shrink);
- shrink_slab(shrink, sc->nr_scanned, lru_pages);
- if (reclaim_state) {
- sc->nr_reclaimed += reclaim_state->reclaimed_slab;
- reclaim_state->reclaimed_slab = 0;
- }
- }
total_scanned += sc->nr_scanned;
if (sc->nr_reclaimed >= sc->nr_to_reclaim)
goto out;
--
1.7.10.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 5/5] mm: vmscan: remove shrink_control arg from do_try_to_free_pages()
2014-01-11 12:36 [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory Vladimir Davydov
` (2 preceding siblings ...)
2014-01-11 12:36 ` [PATCH 4/5] mm: vmscan: move call to shrink_slab() to shrink_zones() Vladimir Davydov
@ 2014-01-11 12:36 ` Vladimir Davydov
2014-01-13 23:05 ` [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory Andrew Morton
4 siblings, 0 replies; 17+ messages in thread
From: Vladimir Davydov @ 2014-01-11 12:36 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, devel, Mel Gorman, Michal Hocko,
Johannes Weiner, Rik van Riel, Dave Chinner, Glauber Costa
There is no need passing on a shrink_control struct from
try_to_free_pages() and friends to do_try_to_free_pages() and then to
shrink_zones(), because it is only used in shrink_zones() and the only
field initialized on the top level is gfp_mask, which is always equal to
scan_control.gfp_mask. So let's move shrink_control initialization to
shrink_zones().
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
---
mm/vmscan.c | 32 ++++++++++++--------------------
1 file changed, 12 insertions(+), 20 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b364f3c..b09e1cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2291,8 +2291,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
* the caller that it should consider retrying the allocation instead of
* further reclaim.
*/
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc,
- struct shrink_control *shrink)
+static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
{
struct zoneref *z;
struct zone *zone;
@@ -2301,6 +2300,9 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc,
unsigned long lru_pages = 0;
bool aborted_reclaim = false;
struct reclaim_state *reclaim_state = current->reclaim_state;
+ struct shrink_control shrink = {
+ .gfp_mask = sc->gfp_mask,
+ };
/*
* If the number of buffer_heads in the machine exceeds the maximum
@@ -2310,7 +2312,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc,
if (buffer_heads_over_limit)
sc->gfp_mask |= __GFP_HIGHMEM;
- nodes_clear(shrink->nodes_to_scan);
+ nodes_clear(shrink.nodes_to_scan);
for_each_zone_zonelist_nodemask(zone, z, zonelist,
gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2325,7 +2327,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc,
continue;
lru_pages += zone_reclaimable_pages(zone);
- node_set(zone_to_nid(zone), shrink->nodes_to_scan);
+ node_set(zone_to_nid(zone), shrink.nodes_to_scan);
if (sc->priority != DEF_PRIORITY &&
!zone_reclaimable(zone))
@@ -2370,7 +2372,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc,
* pages.
*/
if (global_reclaim(sc)) {
- shrink_slab(shrink, sc->nr_scanned, lru_pages);
+ shrink_slab(&shrink, sc->nr_scanned, lru_pages);
if (reclaim_state) {
sc->nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;
@@ -2417,8 +2419,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
* else, the number of pages reclaimed
*/
static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
- struct scan_control *sc,
- struct shrink_control *shrink)
+ struct scan_control *sc)
{
unsigned long total_scanned = 0;
unsigned long writeback_threshold;
@@ -2433,7 +2434,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
sc->priority);
sc->nr_scanned = 0;
- aborted_reclaim = shrink_zones(zonelist, sc, shrink);
+ aborted_reclaim = shrink_zones(zonelist, sc);
total_scanned += sc->nr_scanned;
if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2596,9 +2597,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
.target_mem_cgroup = NULL,
.nodemask = nodemask,
};
- struct shrink_control shrink = {
- .gfp_mask = sc.gfp_mask,
- };
/*
* Do not enter reclaim if fatal signal was delivered while throttled.
@@ -2612,7 +2610,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
sc.may_writepage,
gfp_mask);
- nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+ nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
@@ -2679,9 +2677,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
};
- struct shrink_control shrink = {
- .gfp_mask = sc.gfp_mask,
- };
/*
* Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
@@ -2696,7 +2691,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
sc.may_writepage,
sc.gfp_mask);
- nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+ nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
@@ -3352,9 +3347,6 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
.order = 0,
.priority = DEF_PRIORITY,
};
- struct shrink_control shrink = {
- .gfp_mask = sc.gfp_mask,
- };
struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
struct task_struct *p = current;
unsigned long nr_reclaimed;
@@ -3364,7 +3356,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
- nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+ nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
p->reclaim_state = NULL;
lockdep_clear_current_reclaim_state();
--
1.7.10.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory
2014-01-11 12:36 [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory Vladimir Davydov
` (3 preceding siblings ...)
2014-01-11 12:36 ` [PATCH 5/5] mm: vmscan: remove shrink_control arg from do_try_to_free_pages() Vladimir Davydov
@ 2014-01-13 23:05 ` Andrew Morton
2014-01-14 7:23 ` Vladimir Davydov
4 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2014-01-13 23:05 UTC (permalink / raw)
To: Vladimir Davydov
Cc: linux-kernel, linux-mm, devel, Mel Gorman, Michal Hocko,
Johannes Weiner, Rik van Riel, Dave Chinner, Glauber Costa
On Sat, 11 Jan 2014 16:36:31 +0400 Vladimir Davydov <vdavydov@parallels.com> wrote:
> When reclaiming kmem, we currently don't scan slabs that have less than
> batch_size objects (see shrink_slab_node()):
>
> while (total_scan >= batch_size) {
> shrinkctl->nr_to_scan = batch_size;
> shrinker->scan_objects(shrinker, shrinkctl);
> total_scan -= batch_size;
> }
>
> If there are only a few shrinkers available, such a behavior won't cause
> any problems, because the batch_size is usually small, but if we have a
> lot of slab shrinkers, which is perfectly possible since FS shrinkers
> are now per-superblock, we can end up with hundreds of megabytes of
> practically unreclaimable kmem objects. For instance, mounting a
> thousand of ext2 FS images with a hundred of files in each and iterating
> over all the files using du(1) will result in about 200 Mb of FS caches
> that cannot be dropped even with the aid of the vm.drop_caches sysctl!
True. I suspect this was an accidental consequence of the chosen
implementation. As you mentioned, I was thinking that the caches would
all be large, and the remaining 1 .. SHRINK_BATCH-1 objects just
didn't matter.
> This problem was initially pointed out by Glauber Costa [*]. Glauber
> proposed to fix it by making the shrink_slab() always take at least one
> pass, to put it simply, turning the scan loop above to a do{}while()
> loop. However, this proposal was rejected, because it could result in
> more aggressive and frequent slab shrinking even under low memory
> pressure when total_scan is naturally very small.
Well, it wasn't "rejected" - Mel pointed out that Glauber's change
could potentially trigger problems which already exist in shrinkers.
The potential issues seem pretty unlikely to me, and they're things we
can fix up if they eventuate.
So I'm thinking we should at least try Glauber's approach - it's a bit
weird that we should treat the final 0 .. batch_size-1 objects in a
different manner from all the others.
That being said, I think I'll schedule this patch as-is for 3.14. Can
you please take a look at implementing the simpler approach, send me
something for 3.15-rc1?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/5] mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim
2014-01-11 12:36 ` [PATCH 3/5] mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim Vladimir Davydov
@ 2014-01-13 23:11 ` Andrew Morton
2014-01-14 6:56 ` Vladimir Davydov
0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2014-01-13 23:11 UTC (permalink / raw)
To: Vladimir Davydov
Cc: linux-kernel, linux-mm, devel, Mel Gorman, Michal Hocko,
Johannes Weiner, Rik van Riel, Dave Chinner, Glauber Costa
On Sat, 11 Jan 2014 16:36:33 +0400 Vladimir Davydov <vdavydov@parallels.com> wrote:
> When direct reclaim is executed by a process bound to a set of NUMA
> nodes, we should scan only those nodes when possible, but currently we
> will scan kmem from all online nodes even if the kmem shrinker is NUMA
> aware. That said, binding a process to a particular NUMA node won't
> prevent it from shrinking inode/dentry caches from other nodes, which is
> not good. Fix this.
Seems right. I worry that reducing the amount of shrinking which
node-bound processes perform might affect workloads in unexpected ways.
I think I'll save this one for 3.15-rc1, OK?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/5] mm: vmscan: move call to shrink_slab() to shrink_zones()
2014-01-11 12:36 ` [PATCH 4/5] mm: vmscan: move call to shrink_slab() to shrink_zones() Vladimir Davydov
@ 2014-01-13 23:13 ` Andrew Morton
2014-01-14 6:53 ` Vladimir Davydov
0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2014-01-13 23:13 UTC (permalink / raw)
To: Vladimir Davydov
Cc: linux-kernel, linux-mm, devel, Glauber Costa, Mel Gorman,
Michal Hocko, Johannes Weiner, Rik van Riel, Dave Chinner
On Sat, 11 Jan 2014 16:36:34 +0400 Vladimir Davydov <vdavydov@parallels.com> wrote:
> This reduces the indentation level of do_try_to_free_pages() and removes
> extra loop over all eligible zones counting the number of on-LRU pages.
So this should cause no functional change, yes?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/5] mm: vmscan: move call to shrink_slab() to shrink_zones()
2014-01-13 23:13 ` Andrew Morton
@ 2014-01-14 6:53 ` Vladimir Davydov
0 siblings, 0 replies; 17+ messages in thread
From: Vladimir Davydov @ 2014-01-14 6:53 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, devel, Glauber Costa, Mel Gorman,
Michal Hocko, Johannes Weiner, Rik van Riel, Dave Chinner
On 01/14/2014 03:13 AM, Andrew Morton wrote:
> On Sat, 11 Jan 2014 16:36:34 +0400 Vladimir Davydov <vdavydov@parallels.com> wrote:
>
>> This reduces the indentation level of do_try_to_free_pages() and removes
>> extra loop over all eligible zones counting the number of on-LRU pages.
> So this should cause no functional change, yes?
Yes. This patch merely moves a piece of code from one function to another.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/5] mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim
2014-01-13 23:11 ` Andrew Morton
@ 2014-01-14 6:56 ` Vladimir Davydov
0 siblings, 0 replies; 17+ messages in thread
From: Vladimir Davydov @ 2014-01-14 6:56 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, devel, Mel Gorman, Michal Hocko,
Johannes Weiner, Rik van Riel, Dave Chinner, Glauber Costa
On 01/14/2014 03:11 AM, Andrew Morton wrote:
> On Sat, 11 Jan 2014 16:36:33 +0400 Vladimir Davydov <vdavydov@parallels.com> wrote:
>
>> When direct reclaim is executed by a process bound to a set of NUMA
>> nodes, we should scan only those nodes when possible, but currently we
>> will scan kmem from all online nodes even if the kmem shrinker is NUMA
>> aware. That said, binding a process to a particular NUMA node won't
>> prevent it from shrinking inode/dentry caches from other nodes, which is
>> not good. Fix this.
> Seems right. I worry that reducing the amount of shrinking which
> node-bound processes perform might affect workloads in unexpected ways.
Theoretically, it might, especially for NUMA unaware shrinkers. But
that's how it works for cpusets right now - we do not count pages from
nodes that are not allowed for the current process. Besides, when
counting lru pages for kswapd_shrink_zones(), we also consider only the
node this kswapd runs on so that NUMA unaware shrinkers will be scanned
more aggressively on NUMA enabled setups than NUMA aware ones. So, in
fact, this patch makes policy masks handling consistent with the rest of
the vmscan code.
> I think I'll save this one for 3.15-rc1, OK?
OK, thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory
2014-01-13 23:05 ` [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory Andrew Morton
@ 2014-01-14 7:23 ` Vladimir Davydov
2014-01-14 22:14 ` Andrew Morton
0 siblings, 1 reply; 17+ messages in thread
From: Vladimir Davydov @ 2014-01-14 7:23 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, devel, Mel Gorman, Michal Hocko,
Johannes Weiner, Rik van Riel, Dave Chinner, Glauber Costa
On 01/14/2014 03:05 AM, Andrew Morton wrote:
> On Sat, 11 Jan 2014 16:36:31 +0400 Vladimir Davydov <vdavydov@parallels.com> wrote:
>
>> When reclaiming kmem, we currently don't scan slabs that have less than
>> batch_size objects (see shrink_slab_node()):
>>
>> while (total_scan >= batch_size) {
>> shrinkctl->nr_to_scan = batch_size;
>> shrinker->scan_objects(shrinker, shrinkctl);
>> total_scan -= batch_size;
>> }
>>
>> If there are only a few shrinkers available, such a behavior won't cause
>> any problems, because the batch_size is usually small, but if we have a
>> lot of slab shrinkers, which is perfectly possible since FS shrinkers
>> are now per-superblock, we can end up with hundreds of megabytes of
>> practically unreclaimable kmem objects. For instance, mounting a
>> thousand of ext2 FS images with a hundred of files in each and iterating
>> over all the files using du(1) will result in about 200 Mb of FS caches
>> that cannot be dropped even with the aid of the vm.drop_caches sysctl!
> True. I suspect this was an accidental consequence of the chosen
> implementation. As you mentioned, I was thinking that the caches would
> all be large, and the remaining 1 .. SHRINK_BATCH-1 objects just
> didn't matter.
>
>> This problem was initially pointed out by Glauber Costa [*]. Glauber
>> proposed to fix it by making the shrink_slab() always take at least one
>> pass, to put it simply, turning the scan loop above to a do{}while()
>> loop. However, this proposal was rejected, because it could result in
>> more aggressive and frequent slab shrinking even under low memory
>> pressure when total_scan is naturally very small.
> Well, it wasn't "rejected" - Mel pointed out that Glauber's change
> could potentially trigger problems which already exist in shrinkers.
>
> The potential issues seem pretty unlikely to me, and they're things we
> can fix up if they eventuate.
When preparing this patch, I considered not the problems that
potentially exist in some shrinkers, but the issues that unconditional
scan of < batch_size objects might trigger for any shrinker:
1) We would call shrinkers more frequently, which could possibly
increase contention on shrinker-internal locks. The point is that under
very light memory pressure when we can fulfill the allocation request
after a few low-prio scans, we would not call slab shrinkers at all,
instead we would only add the delta to nr_deferred in order to keep
slab-vs-pagecache reclaim balanced. Original Glauber's patch changes
this behavior - it makes shrink_slab() always call the shrinker at least
once, even if the current delta is negligible. I'm afraid, this might
affect performance. Note, this is irrespective of how much objects the
shrinker has to reclaim (< or > batch_size).
2) As Mel Gorman pointed out
(http://thread.gmane.org/gmane.linux.kernel.mm/99059):
> It's possible for caches to shrink to zero where before the last
> SHRINK_SLAB objects would often be protected for any slab. If this is
> an inode or dentry cache and there are very few objects then it's
> possible that objects will be reclaimed before they can be used by the
> process allocating them.
> So I'm thinking we should at least try Glauber's approach - it's a bit
> weird that we should treat the final 0 .. batch_size-1 objects in a
> different manner from all the others.
It's not exactly that we treat the final 0 .. batch_size-1 objects
differently from others. We rather try to accumulate at least batch_size
objects before calling ->scan().
> That being said, I think I'll schedule this patch as-is for 3.14. Can
> you please take a look at implementing the simpler approach, send me
> something for 3.15-rc1?
IMHO the simpler approach (Glauber's patch) is not suitable as is,
because it, in fact, neglects the notion of batch_size when doing low
prio scans, because it calls ->scan() for < batch_size objects even if
the slab has >= batch_size objects while AFAIU it should accumulate a
sufficient number of objects to scan in nr_deferred instead.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory
2014-01-14 7:23 ` Vladimir Davydov
@ 2014-01-14 22:14 ` Andrew Morton
2014-01-15 8:47 ` Vladimir Davydov
0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2014-01-14 22:14 UTC (permalink / raw)
To: Vladimir Davydov
Cc: linux-kernel, linux-mm, devel, Mel Gorman, Michal Hocko,
Johannes Weiner, Rik van Riel, Dave Chinner, Glauber Costa
On Tue, 14 Jan 2014 11:23:30 +0400 Vladimir Davydov <vdavydov@parallels.com> wrote:
> On 01/14/2014 03:05 AM, Andrew Morton wrote:
> > On Sat, 11 Jan 2014 16:36:31 +0400 Vladimir Davydov <vdavydov@parallels.com> wrote:
> >
> >> When reclaiming kmem, we currently don't scan slabs that have less than
> >> batch_size objects (see shrink_slab_node()):
> >>
> >> while (total_scan >= batch_size) {
> >> shrinkctl->nr_to_scan = batch_size;
> >> shrinker->scan_objects(shrinker, shrinkctl);
> >> total_scan -= batch_size;
> >> }
> >>
> >> If there are only a few shrinkers available, such a behavior won't cause
> >> any problems, because the batch_size is usually small, but if we have a
> >> lot of slab shrinkers, which is perfectly possible since FS shrinkers
> >> are now per-superblock, we can end up with hundreds of megabytes of
> >> practically unreclaimable kmem objects. For instance, mounting a
> >> thousand of ext2 FS images with a hundred of files in each and iterating
> >> over all the files using du(1) will result in about 200 Mb of FS caches
> >> that cannot be dropped even with the aid of the vm.drop_caches sysctl!
> > True. I suspect this was an accidental consequence of the chosen
> > implementation. As you mentioned, I was thinking that the caches would
> > all be large, and the remaining 1 .. SHRINK_BATCH-1 objects just
> > didn't matter.
> >
> >> This problem was initially pointed out by Glauber Costa [*]. Glauber
> >> proposed to fix it by making the shrink_slab() always take at least one
> >> pass, to put it simply, turning the scan loop above to a do{}while()
> >> loop. However, this proposal was rejected, because it could result in
> >> more aggressive and frequent slab shrinking even under low memory
> >> pressure when total_scan is naturally very small.
> > Well, it wasn't "rejected" - Mel pointed out that Glauber's change
> > could potentially trigger problems which already exist in shrinkers.
> >
> > The potential issues seem pretty unlikely to me, and they're things we
> > can fix up if they eventuate.
>
> When preparing this patch, I considered not the problems that
> potentially exist in some shrinkers, but the issues that unconditional
> scan of < batch_size objects might trigger for any shrinker:
>
> 1) We would call shrinkers more frequently,
hm, why?
> which could possibly
> increase contention on shrinker-internal locks. The point is that under
> very light memory pressure when we can fulfill the allocation request
> after a few low-prio scans, we would not call slab shrinkers at all,
> instead we would only add the delta to nr_deferred in order to keep
> slab-vs-pagecache reclaim balanced. Original Glauber's patch changes
> this behavior - it makes shrink_slab() always call the shrinker at least
> once, even if the current delta is negligible. I'm afraid, this might
> affect performance. Note, this is irrespective of how much objects the
> shrinker has to reclaim (< or > batch_size).
I doubt if it affects performance much at all - memory reclaim in
general is a slow path.
> 2) As Mel Gorman pointed out
> (http://thread.gmane.org/gmane.linux.kernel.mm/99059):
>
> > It's possible for caches to shrink to zero where before the last
> > SHRINK_SLAB objects would often be protected for any slab. If this is
> > an inode or dentry cache and there are very few objects then it's
> > possible that objects will be reclaimed before they can be used by the
> > process allocating them.
I don't understand that one. It appears to assume that vfs code will
allocate and initialise a dentry or inode, will put it in cache and
release all references to it and then will look it up again and start
using it. vfs doesn't work like that - it would be crazy to do so. It
will instead hold references to those objects while using them. Mainly
to protect them from reclaim.
And if there were any problems of this nature, they would already be
demonstrable with a sufficiently large number of threads/cpus.
> > So I'm thinking we should at least try Glauber's approach - it's a bit
> > weird that we should treat the final 0 .. batch_size-1 objects in a
> > different manner from all the others.
>
> It's not exactly that we treat the final 0 .. batch_size-1 objects
> differently from others. We rather try to accumulate at least batch_size
> objects before calling ->scan().
And if there are < batch_size objects, they never get scanned. That's
different treatment.
> > That being said, I think I'll schedule this patch as-is for 3.14. Can
> > you please take a look at implementing the simpler approach, send me
> > something for 3.15-rc1?
>
> IMHO the simpler approach (Glauber's patch) is not suitable as is,
> because it, in fact, neglects the notion of batch_size when doing low
> prio scans, because it calls ->scan() for < batch_size objects even if
> the slab has >= batch_size objects while AFAIU it should accumulate a
> sufficient number of objects to scan in nr_deferred instead.
Well. If you mean that when nr-objects=large and batch_size=32 and
total_scan=33, the patched code will scan 32 objects and then 1 object
then yes, that should be fixed.
But I remain quite unconvinced that the additional complexity in this
code is justified. The alleged problems with the simple version are
all theoretical and unproven. Simple code is of course preferred - can
we please start out that way and see if any of the theoretical problems
are actually real?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory
2014-01-14 22:14 ` Andrew Morton
@ 2014-01-15 8:47 ` Vladimir Davydov
2014-01-15 9:25 ` Andrew Morton
0 siblings, 1 reply; 17+ messages in thread
From: Vladimir Davydov @ 2014-01-15 8:47 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, devel, Mel Gorman, Michal Hocko,
Johannes Weiner, Rik van Riel, Dave Chinner, Glauber Costa
On 01/15/2014 02:14 AM, Andrew Morton wrote:
> On Tue, 14 Jan 2014 11:23:30 +0400 Vladimir Davydov <vdavydov@parallels.com> wrote:
>
>> On 01/14/2014 03:05 AM, Andrew Morton wrote:
>>> That being said, I think I'll schedule this patch as-is for 3.14. Can
>>> you please take a look at implementing the simpler approach, send me
>>> something for 3.15-rc1?
>> IMHO the simpler approach (Glauber's patch) is not suitable as is,
>> because it, in fact, neglects the notion of batch_size when doing low
>> prio scans, because it calls ->scan() for < batch_size objects even if
>> the slab has >= batch_size objects while AFAIU it should accumulate a
>> sufficient number of objects to scan in nr_deferred instead.
> Well. If you mean that when nr-objects=large and batch_size=32 and
> total_scan=33, the patched code will scan 32 objects and then 1 object
> then yes, that should be fixed.
I mean if nr_objects=large and batch_size=32 and shrink_slab() is called
8 times with total_scan=4, we can either call ->scan() 8 times with
nr_to_scan=4 (Glauber's patch) or call it only once with nr_to_scan=32
(that's how it works now). Frankly, after a bit of thinking I am
starting to doubt that this can affect performance at all provided the
shrinker is implemented in a sane way, because as you've mentioned
shrink_slab() is already a slow path. It seems I misunderstood the
purpose of batch_size initially: I though we need it to limit the number
of calls to ->scan(), but now I guess the only purpose of it is limiting
the number of objects scanned in one pass to avoid latency issues. But
then another question arises - why do you think the behavior you
described above (scanning 32 and then 1 object if total_scan=33,
batch_size=32) is bad? In other words why can't we make the scan loop
look like this:
while (total_scan > 0) {
unsigned long ret;
unsigned long nr_to_scan = min(total_scan, batch_size);
shrinkctl->nr_to_scan = nr_to_scan;
ret = shrinker->scan_objects(shrinker, shrinkctl);
if (ret == SHRINK_STOP)
break;
freed += ret;
count_vm_events(SLABS_SCANNED, nr_to_scan);
total_scan -= nr_to_scan;
cond_resched();
}
?
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory
2014-01-15 8:47 ` Vladimir Davydov
@ 2014-01-15 9:25 ` Andrew Morton
2014-01-15 15:55 ` Vladimir Davydov
0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2014-01-15 9:25 UTC (permalink / raw)
To: Vladimir Davydov
Cc: linux-kernel, linux-mm, devel, Mel Gorman, Michal Hocko,
Johannes Weiner, Rik van Riel, Dave Chinner, Glauber Costa
On Wed, 15 Jan 2014 12:47:35 +0400 Vladimir Davydov <vdavydov@parallels.com> wrote:
> On 01/15/2014 02:14 AM, Andrew Morton wrote:
> > On Tue, 14 Jan 2014 11:23:30 +0400 Vladimir Davydov <vdavydov@parallels.com> wrote:
> >
> >> On 01/14/2014 03:05 AM, Andrew Morton wrote:
> >>> That being said, I think I'll schedule this patch as-is for 3.14. Can
> >>> you please take a look at implementing the simpler approach, send me
> >>> something for 3.15-rc1?
> >> IMHO the simpler approach (Glauber's patch) is not suitable as is,
> >> because it, in fact, neglects the notion of batch_size when doing low
> >> prio scans, because it calls ->scan() for < batch_size objects even if
> >> the slab has >= batch_size objects while AFAIU it should accumulate a
> >> sufficient number of objects to scan in nr_deferred instead.
> > Well. If you mean that when nr-objects=large and batch_size=32 and
> > total_scan=33, the patched code will scan 32 objects and then 1 object
> > then yes, that should be fixed.
>
> I mean if nr_objects=large and batch_size=32 and shrink_slab() is called
> 8 times with total_scan=4, we can either call ->scan() 8 times with
> nr_to_scan=4 (Glauber's patch) or call it only once with nr_to_scan=32
> (that's how it works now). Frankly, after a bit of thinking I am
> starting to doubt that this can affect performance at all provided the
> shrinker is implemented in a sane way, because as you've mentioned
> shrink_slab() is already a slow path. It seems I misunderstood the
> purpose of batch_size initially: I though we need it to limit the number
> of calls to ->scan(), but now I guess the only purpose of it is limiting
> the number of objects scanned in one pass to avoid latency issues.
Actually, the intent of batching is to limit the number of calls to
->scan(). At least, that was the intent when I wrote it! This is a
good principle and we should keep doing it. If we're going to send the
CPU away to tread on a pile of cold cachelines, we should make sure
that it does a good amount of work while it's there.
> But
> then another question arises - why do you think the behavior you
> described above (scanning 32 and then 1 object if total_scan=33,
> batch_size=32) is bad?
Yes, it's a bit inefficient but it won't be too bad. What would be bad
would be to scan a very small number of objects and then to advance to
the next shrinker.
> In other words why can't we make the scan loop
> look like this:
>
> while (total_scan > 0) {
> unsigned long ret;
> unsigned long nr_to_scan = min(total_scan, batch_size);
>
> shrinkctl->nr_to_scan = nr_to_scan;
> ret = shrinker->scan_objects(shrinker, shrinkctl);
> if (ret == SHRINK_STOP)
> break;
> freed += ret;
>
> count_vm_events(SLABS_SCANNED, nr_to_scan);
> total_scan -= nr_to_scan;
>
> cond_resched();
> }
Well, if we come in here with total_scan=1 then we defeat the original
intent of the batching, don't we? We end up doing a lot of work just
to scan one object. So perhaps add something like
if (total_scan < batch_size && max_pass > batch_size)
skip the while loop
If we do this, total_scan will be accumulated into nr_deferred, up to
the point where the threshold is exceeded, yes?
All the arithmetic in there hurts my brain and I don't know what values
total_scan typically ends up with.
btw. all that trickery with delta and lru_pages desperately needs
documenting. What the heck is it intended to do??
We could avoid the "scan 32 then scan just 1" issue with something like
if (total_scan > batch_size)
total_scan %= batch_size;
before the loop. But I expect the effects of that will be unmeasurable
- on average the number of objects which are scanned in the final pass
of the loop will be batch_size/2, yes? That's still a decent amount.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory
2014-01-15 9:25 ` Andrew Morton
@ 2014-01-15 15:55 ` Vladimir Davydov
2014-01-15 22:53 ` Andrew Morton
0 siblings, 1 reply; 17+ messages in thread
From: Vladimir Davydov @ 2014-01-15 15:55 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, devel, Mel Gorman, Michal Hocko,
Johannes Weiner, Rik van Riel, Dave Chinner, Glauber Costa
On 01/15/2014 01:25 PM, Andrew Morton wrote:
> On Wed, 15 Jan 2014 12:47:35 +0400 Vladimir Davydov <vdavydov@parallels.com> wrote:
>
>> On 01/15/2014 02:14 AM, Andrew Morton wrote:
>>> On Tue, 14 Jan 2014 11:23:30 +0400 Vladimir Davydov <vdavydov@parallels.com> wrote:
>>>
>>>> On 01/14/2014 03:05 AM, Andrew Morton wrote:
>>>>> That being said, I think I'll schedule this patch as-is for 3.14. Can
>>>>> you please take a look at implementing the simpler approach, send me
>>>>> something for 3.15-rc1?
>>>> IMHO the simpler approach (Glauber's patch) is not suitable as is,
>>>> because it, in fact, neglects the notion of batch_size when doing low
>>>> prio scans, because it calls ->scan() for < batch_size objects even if
>>>> the slab has >= batch_size objects while AFAIU it should accumulate a
>>>> sufficient number of objects to scan in nr_deferred instead.
>>> Well. If you mean that when nr-objects=large and batch_size=32 and
>>> total_scan=33, the patched code will scan 32 objects and then 1 object
>>> then yes, that should be fixed.
>> I mean if nr_objects=large and batch_size=32 and shrink_slab() is called
>> 8 times with total_scan=4, we can either call ->scan() 8 times with
>> nr_to_scan=4 (Glauber's patch) or call it only once with nr_to_scan=32
>> (that's how it works now). Frankly, after a bit of thinking I am
>> starting to doubt that this can affect performance at all provided the
>> shrinker is implemented in a sane way, because as you've mentioned
>> shrink_slab() is already a slow path. It seems I misunderstood the
>> purpose of batch_size initially: I though we need it to limit the number
>> of calls to ->scan(), but now I guess the only purpose of it is limiting
>> the number of objects scanned in one pass to avoid latency issues.
> Actually, the intent of batching is to limit the number of calls to
> ->scan(). At least, that was the intent when I wrote it! This is a
> good principle and we should keep doing it. If we're going to send the
> CPU away to tread on a pile of cold cachelines, we should make sure
> that it does a good amount of work while it's there.
>
>> But
>> then another question arises - why do you think the behavior you
>> described above (scanning 32 and then 1 object if total_scan=33,
>> batch_size=32) is bad?
> Yes, it's a bit inefficient but it won't be too bad. What would be bad
> would be to scan a very small number of objects and then to advance to
> the next shrinker.
>
>> In other words why can't we make the scan loop
>> look like this:
>>
>> while (total_scan > 0) {
>> unsigned long ret;
>> unsigned long nr_to_scan = min(total_scan, batch_size);
>>
>> shrinkctl->nr_to_scan = nr_to_scan;
>> ret = shrinker->scan_objects(shrinker, shrinkctl);
>> if (ret == SHRINK_STOP)
>> break;
>> freed += ret;
>>
>> count_vm_events(SLABS_SCANNED, nr_to_scan);
>> total_scan -= nr_to_scan;
>>
>> cond_resched();
>> }
>
> Well, if we come in here with total_scan=1 then we defeat the original
> intent of the batching, don't we? We end up doing a lot of work just
> to scan one object. So perhaps add something like
>
> if (total_scan < batch_size && max_pass > batch_size)
> skip the while loop
>
> If we do this, total_scan will be accumulated into nr_deferred, up to
> the point where the threshold is exceeded, yes?
Yes, but only if the shrinker has > batch_size objects to scan,
otherwise (max_pass < batch_size) we will proceed to the while loop and
call ->scan() even if total_scan = 1. I mean if batch_size = 32,
max_pass = 31 and total_scan = 1 (low memory pressure), we will call
->scan() 31 times scanning only 1 object during each pass, which is not
good.
> All the arithmetic in there hurts my brain and I don't know what values
> total_scan typically ends up with.
>
> btw. all that trickery with delta and lru_pages desperately needs
> documenting. What the heck is it intended to do??
>
>
>
> We could avoid the "scan 32 then scan just 1" issue with something like
>
> if (total_scan > batch_size)
> total_scan %= batch_size;
>
> before the loop. But I expect the effects of that will be unmeasurable
> - on average the number of objects which are scanned in the final pass
> of the loop will be batch_size/2, yes? That's still a decent amount.
Let me try to summarize. We want to scan batch_size objects in one pass,
not more (to keep latency low) and not less (to avoid cpu cache
pollution due to too frequent calls); if the calculated value of
nr_to_scan is less than the batch_size we should accumulate it in
nr_deferred instead of calling ->scan() and add nr_deferred to
nr_to_scan on the next pass, i.e. in pseudo-code:
/* calculate current nr_to_scan */
max_pass = shrinker->count();
delta = max_pass * nr_user_pages_scanned / nr_user_pages;
/* add nr_deferred */
total_scan = delta + nr_deferred;
while (total_scan >= batch_size) {
shrinker->scan(batch_size);
total_scan -= batch_size;
}
/* save the remainder to nr_deferred */
nr_deferred = total_scan;
That would work, but if max_pass is < batch_size, it would not scan the
objects immediately even if prio is high (we want to scan all objects).
For example, dropping caches would not work on the first attempt - the
user would have to call it batch_size / max_pass times. This could be
fixed by making the code proceed to ->scan() not only if total_scan is
>= batch_size, but also if max_pass is < batch_size and total_scan is >=
max_pass, i.e.
while (total_scan >= batch_size ||
(max_pass < batch_size && total_scan >= max_pass)) ...
which is equivalent to
while (total_scan >= batch_size ||
total_scan >= max_pass) ...
The latter is the loop condition from the current patch, i.e. this patch
would make the trick if shrink_slab() followed the pseudo-code above. In
real life, it does not actually - we have to bias total_scan before the
while loop in order to avoid dropping fs meta caches on light memory
pressure due to a large number being built in nr_deferred:
if (delta < max_pass / 4)
total_scan = min(total_scan, max_pass / 2);
while (total_scan >= batch_size) ...
With this biasing, it is impossible to achieve the ideal behavior I've
described above, because we will never accumulate max_pass objects in
nr_deferred if memory pressure is low. So, if applied to the real code,
this patch takes on a slightly different sense, which I tried to reflect
in the comment to the code: it will call ->scan() with nr_to_scan <
batch_size only if:
1) max_pass < batch_size && total_scan >= max_pass
and
2) we're tight on memory, i.e. the current delta is high (otherwise
total_scan will be biased as max_pass / 2 and condition 1 won't be
satisfied).
>From our discussion it seems condition 2 is not necessary at all, but it
follows directly from the biasing rule. So I propose to tweak the
biasing a bit so that total_scan won't be lowered < batch_size:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eea668d..78ddd5e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -267,7 +267,7 @@ shrink_slab_node(struct shrink_control *shrinkctl,
struct shrinker *shrinker,
* a large delta change is calculated directly.
*/
if (delta < max_pass / 4)
- total_scan = min(total_scan, max_pass / 2);
+ total_scan = min(total_scan, max(max_pass / 2, batch_size));
/*
* Avoid risking looping forever due to too large nr value:
@@ -281,7 +281,7 @@ shrink_slab_node(struct shrink_control *shrinkctl,
struct shrinker *shrinker,
nr_pages_scanned, lru_pages,
max_pass, delta, total_scan);
- while (total_scan >= batch_size) {
+ while (total_scan >= batch_size || total_scan >= max_pass) {
unsigned long ret;
shrinkctl->nr_to_scan = batch_size;
The first hunk guarantees that total_scan will always accumulate at
least batch_size objects no matter how small max_pass is. That means
that when max_pass is < batch_size we will eventually get >= max_pass
objects to scan and shrink the slab to 0 as we need. What do you think
about that?
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory
2014-01-15 15:55 ` Vladimir Davydov
@ 2014-01-15 22:53 ` Andrew Morton
2014-01-16 8:50 ` Vladimir Davydov
0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2014-01-15 22:53 UTC (permalink / raw)
To: Vladimir Davydov
Cc: linux-kernel, linux-mm, devel, Mel Gorman, Michal Hocko,
Johannes Weiner, Rik van Riel, Dave Chinner, Glauber Costa
On Wed, 15 Jan 2014 19:55:11 +0400 Vladimir Davydov <vdavydov@parallels.com> wrote:
> >
> > We could avoid the "scan 32 then scan just 1" issue with something like
> >
> > if (total_scan > batch_size)
> > total_scan %= batch_size;
> >
> > before the loop. But I expect the effects of that will be unmeasurable
> > - on average the number of objects which are scanned in the final pass
> > of the loop will be batch_size/2, yes? That's still a decent amount.
>
> Let me try to summarize. We want to scan batch_size objects in one pass,
> not more (to keep latency low) and not less (to avoid cpu cache
> pollution due to too frequent calls); if the calculated value of
> nr_to_scan is less than the batch_size we should accumulate it in
> nr_deferred instead of calling ->scan() and add nr_deferred to
> nr_to_scan on the next pass, i.e. in pseudo-code:
>
> /* calculate current nr_to_scan */
> max_pass = shrinker->count();
> delta = max_pass * nr_user_pages_scanned / nr_user_pages;
>
> /* add nr_deferred */
> total_scan = delta + nr_deferred;
>
> while (total_scan >= batch_size) {
> shrinker->scan(batch_size);
> total_scan -= batch_size;
> }
>
> /* save the remainder to nr_deferred */
> nr_deferred = total_scan;
>
> That would work, but if max_pass is < batch_size, it would not scan the
> objects immediately even if prio is high (we want to scan all objects).
Yes, that's a problem.
> For example, dropping caches would not work on the first attempt - the
> user would have to call it batch_size / max_pass times.
And we do want drop_caches to work immediately.
> This could be
> fixed by making the code proceed to ->scan() not only if total_scan is
> >= batch_size, but also if max_pass is < batch_size and total_scan is >=
> max_pass, i.e.
>
> while (total_scan >= batch_size ||
> (max_pass < batch_size && total_scan >= max_pass)) ...
>
> which is equivalent to
>
> while (total_scan >= batch_size ||
> total_scan >= max_pass) ...
>
> The latter is the loop condition from the current patch, i.e. this patch
> would make the trick if shrink_slab() followed the pseudo-code above. In
> real life, it does not actually - we have to bias total_scan before the
> while loop in order to avoid dropping fs meta caches on light memory
> pressure due to a large number being built in nr_deferred:
>
> if (delta < max_pass / 4)
> total_scan = min(total_scan, max_pass / 2);
Oh, is that what's it's for. Where did you discover this gem?
> while (total_scan >= batch_size) ...
>
> With this biasing, it is impossible to achieve the ideal behavior I've
> described above, because we will never accumulate max_pass objects in
> nr_deferred if memory pressure is low. So, if applied to the real code,
> this patch takes on a slightly different sense, which I tried to reflect
> in the comment to the code: it will call ->scan() with nr_to_scan <
> batch_size only if:
>
> 1) max_pass < batch_size && total_scan >= max_pass
>
> and
>
> 2) we're tight on memory, i.e. the current delta is high (otherwise
> total_scan will be biased as max_pass / 2 and condition 1 won't be
> satisfied).
(is max_pass misnamed?)
> >From our discussion it seems condition 2 is not necessary at all, but it
> follows directly from the biasing rule. So I propose to tweak the
> biasing a bit so that total_scan won't be lowered < batch_size:
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eea668d..78ddd5e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -267,7 +267,7 @@ shrink_slab_node(struct shrink_control *shrinkctl,
> struct shrinker *shrinker,
> * a large delta change is calculated directly.
> */
> if (delta < max_pass / 4)
> - total_scan = min(total_scan, max_pass / 2);
> + total_scan = min(total_scan, max(max_pass / 2, batch_size));
>
> /*
> * Avoid risking looping forever due to too large nr value:
> @@ -281,7 +281,7 @@ shrink_slab_node(struct shrink_control *shrinkctl,
> struct shrinker *shrinker,
> nr_pages_scanned, lru_pages,
> max_pass, delta, total_scan);
>
> - while (total_scan >= batch_size) {
> + while (total_scan >= batch_size || total_scan >= max_pass) {
> unsigned long ret;
>
> shrinkctl->nr_to_scan = batch_size;
>
> The first hunk guarantees that total_scan will always accumulate at
> least batch_size objects no matter how small max_pass is. That means
> that when max_pass is < batch_size we will eventually get >= max_pass
> objects to scan and shrink the slab to 0 as we need. What do you think
> about that?
I'm a bit lost :(
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory
2014-01-15 22:53 ` Andrew Morton
@ 2014-01-16 8:50 ` Vladimir Davydov
0 siblings, 0 replies; 17+ messages in thread
From: Vladimir Davydov @ 2014-01-16 8:50 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, devel, Mel Gorman, Michal Hocko,
Johannes Weiner, Rik van Riel, Dave Chinner, Glauber Costa
On 01/16/2014 02:53 AM, Andrew Morton wrote:
> On Wed, 15 Jan 2014 19:55:11 +0400 Vladimir Davydov <vdavydov@parallels.com> wrote:
>
>>> We could avoid the "scan 32 then scan just 1" issue with something like
>>>
>>> if (total_scan > batch_size)
>>> total_scan %= batch_size;
>>>
>>> before the loop. But I expect the effects of that will be unmeasurable
>>> - on average the number of objects which are scanned in the final pass
>>> of the loop will be batch_size/2, yes? That's still a decent amount.
>> Let me try to summarize. We want to scan batch_size objects in one pass,
>> not more (to keep latency low) and not less (to avoid cpu cache
>> pollution due to too frequent calls); if the calculated value of
>> nr_to_scan is less than the batch_size we should accumulate it in
>> nr_deferred instead of calling ->scan() and add nr_deferred to
>> nr_to_scan on the next pass, i.e. in pseudo-code:
>>
>> /* calculate current nr_to_scan */
>> max_pass = shrinker->count();
>> delta = max_pass * nr_user_pages_scanned / nr_user_pages;
>>
>> /* add nr_deferred */
>> total_scan = delta + nr_deferred;
>>
>> while (total_scan >= batch_size) {
>> shrinker->scan(batch_size);
>> total_scan -= batch_size;
>> }
>>
>> /* save the remainder to nr_deferred */
>> nr_deferred = total_scan;
>>
>> That would work, but if max_pass is < batch_size, it would not scan the
>> objects immediately even if prio is high (we want to scan all objects).
> Yes, that's a problem.
>
>> For example, dropping caches would not work on the first attempt - the
>> user would have to call it batch_size / max_pass times.
> And we do want drop_caches to work immediately.
>
>> This could be
>> fixed by making the code proceed to ->scan() not only if total_scan is
>>> = batch_size, but also if max_pass is < batch_size and total_scan is >=
>> max_pass, i.e.
>>
>> while (total_scan >= batch_size ||
>> (max_pass < batch_size && total_scan >= max_pass)) ...
>>
>> which is equivalent to
>>
>> while (total_scan >= batch_size ||
>> total_scan >= max_pass) ...
>>
>> The latter is the loop condition from the current patch, i.e. this patch
>> would make the trick if shrink_slab() followed the pseudo-code above. In
>> real life, it does not actually - we have to bias total_scan before the
>> while loop in order to avoid dropping fs meta caches on light memory
>> pressure due to a large number being built in nr_deferred:
>>
>> if (delta < max_pass / 4)
>> total_scan = min(total_scan, max_pass / 2);
> Oh, is that what's it's for. Where did you discover this gem?
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2014-01-16 8:51 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-11 12:36 [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory Vladimir Davydov
2014-01-11 12:36 ` [PATCH 2/5] mm: vmscan: call NUMA-unaware shrinkers irrespective of nodemask Vladimir Davydov
2014-01-11 12:36 ` [PATCH 3/5] mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim Vladimir Davydov
2014-01-13 23:11 ` Andrew Morton
2014-01-14 6:56 ` Vladimir Davydov
2014-01-11 12:36 ` [PATCH 4/5] mm: vmscan: move call to shrink_slab() to shrink_zones() Vladimir Davydov
2014-01-13 23:13 ` Andrew Morton
2014-01-14 6:53 ` Vladimir Davydov
2014-01-11 12:36 ` [PATCH 5/5] mm: vmscan: remove shrink_control arg from do_try_to_free_pages() Vladimir Davydov
2014-01-13 23:05 ` [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory Andrew Morton
2014-01-14 7:23 ` Vladimir Davydov
2014-01-14 22:14 ` Andrew Morton
2014-01-15 8:47 ` Vladimir Davydov
2014-01-15 9:25 ` Andrew Morton
2014-01-15 15:55 ` Vladimir Davydov
2014-01-15 22:53 ` Andrew Morton
2014-01-16 8:50 ` Vladimir Davydov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox