* make swappiness safer to use @ 2007-07-31 21:52 Andrea Arcangeli 2007-07-31 22:12 ` Andrew Morton 2007-07-31 23:09 ` Andrew Morton 0 siblings, 2 replies; 17+ messages in thread From: Andrea Arcangeli @ 2007-07-31 21:52 UTC (permalink / raw) To: linux-mm; +Cc: Nick Piggin, Andrew Morton Swappiness isn't a safe sysctl. Setting it to 0 for example can hang a system. That's a corner case but even setting it to 10 or lower can waste enormous amounts of cpu without making much progress. We've customers who wants to use swappiness but they can't because of the current implementation (if you change it so the system stops swapping it really stops swapping and nothing works sane anymore if you really had to swap something to make progress). This patch from Kurt Garloff makes swappiness safer to use (no more huge cpu usage or hangs with low swappiness values). I think the prev_priority can also be nuked since it wastes 4 bytes per zone (that would be an incremental patch but I wait the nr_scan_[in]active to be nuked first for similar reasons). Clearly somebody at some point noticed how broken that thing was and they had to add min(priority, prev_priority) to give it some reliability, but they didn't go the last mile to nuke prev_priority too. Calculating distress only in function of not-racy priority is correct and sure more than enough without having to add randomness into the equation. Patch is tested on older kernels but it compiles and it's quite simple so... Overall I'm not very satisified by the swappiness tweak, since it doesn't rally do anything with the dirty pagecache that may be inactive. We need another kind of tweak that controls the inactive scan and tunes the can_writepage feature (not yet in mainline despite having submitted it a few times), not only the active one. That new tweak will tell the kernel how hard to scan the inactive list for pure clean pagecache (something the mainline kernel isn't capable of yet). We already have that feature working in all our enterprise kernels with the default reasonable tune, or they can't even run a readonly backup with tar without triggering huge write I/O. I think it should be available also in mainline later. Signed-off-by: Kurt Garloff <garloff@suse.de> Acked-by: Andrea Arcangeli <andrea@suse.de> diff --git a/mm/vmscan.c b/mm/vmscan.c --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -914,6 +914,22 @@ static void shrink_active_list(unsigned swap_tendency = mapped_ratio / 2 + distress + sc->swappiness; /* + * With low vm_swappiness values, we can actually reach + * situations, where we have the inactive list almost + * completely depleted. + * This will result in the kernel behaving badly, + * looping trying to find free ram and thrashing on + * the working set's page faults. + * So let's increase our swap_tendency when we get + * into such a situation. The formula ensures we only + * boost it when really needed. + */ + swap_tendency += zone_page_state(zone, NR_ACTIVE) / + (zone_page_state(zone, NR_INACTIVE) + 1) + * (vm_swappiness + 1) / 100 + * mapped_ratio / 100; + + /* * Now use this metric to decide whether to start moving mapped * memory onto the inactive list. */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: make swappiness safer to use 2007-07-31 21:52 make swappiness safer to use Andrea Arcangeli @ 2007-07-31 22:12 ` Andrew Morton 2007-07-31 22:40 ` Andrea Arcangeli 2007-07-31 23:09 ` Andrew Morton 1 sibling, 1 reply; 17+ messages in thread From: Andrew Morton @ 2007-07-31 22:12 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm, Nick Piggin On Tue, 31 Jul 2007 23:52:28 +0200 Andrea Arcangeli <andrea@suse.de> wrote: > + swap_tendency += zone_page_state(zone, NR_ACTIVE) / > + (zone_page_state(zone, NR_INACTIVE) + 1) > + * (vm_swappiness + 1) / 100 > + * mapped_ratio / 100; I must say, that's a pretty ugly-looking statement. For a start, the clause * (vm_swappiness + 1) / 100 always evaluates to zero. The L->R associativity prevents that, but the layout is super-misleading, no? And it matters - the potential for overflow and rounding errors here is considerable. Let's go through it. Probably 32-bit is the problem. zone_page_state(zone, NR_ACTIVE) / 0 -> 8,000,000 (zone_page_state(zone, NR_INACTIVE) + 1) min: 1, max: 8,000,000 * (vm_swappiness + 1) min: 1, max: 101 total min: 1, total max: 800,000,000 / 100 total min: 0, total max: 8,000,000 * mapped_ratio total min: 0, total max: 800,000,000 / 100; total min: 0, total max: 8,000,000 then we divide zone_page_state(zone, NR_ACTIVE) by this value. We can get a divide-by-zero if zone_page_state(zone, NR_INACTIVE) is sufficiently small, I think? At least, it isn't obvious that we cannot. I suspect that we can get a value >100, too. Especially when we add it to the existing value of swap_tendency, but I didn't think about it too hard. Want to see if we can present that expression in a more logical fashion, and be more careful about the underflows and overflows, and fix the potential divide-by-zero? Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: make swappiness safer to use 2007-07-31 22:12 ` Andrew Morton @ 2007-07-31 22:40 ` Andrea Arcangeli 2007-07-31 22:51 ` Andrew Morton 0 siblings, 1 reply; 17+ messages in thread From: Andrea Arcangeli @ 2007-07-31 22:40 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, Nick Piggin Hi Andrew! On Tue, Jul 31, 2007 at 03:12:44PM -0700, Andrew Morton wrote: > On Tue, 31 Jul 2007 23:52:28 +0200 > Andrea Arcangeli <andrea@suse.de> wrote: > > > + swap_tendency += zone_page_state(zone, NR_ACTIVE) / > > + (zone_page_state(zone, NR_INACTIVE) + 1) > > + * (vm_swappiness + 1) / 100 > > + * mapped_ratio / 100; > > I must say, that's a pretty ugly-looking statement. For a start, the clause > > * (vm_swappiness + 1) / 100 > > always evaluates to zero. The L->R associativity prevents that, but the > layout is super-misleading, no? I can split into multiple lines if you prefer, but it wouldn't make much difference. The basic idea is that the feedback provided by priority is cpu-wasteful, if we have an active list large 8000000 and inactive being 0, it's absolutely pointless to do what mainline does i.e. wait priority to go down to zero before refiling mapped page down to the inactive list. We clearly can get a better feedback loop by checking for insane balances of the two lists. > And it matters - the potential for overflow and rounding errors here is > considerable. Let's go through it. Probably 32-bit is the problem. > > > zone_page_state(zone, NR_ACTIVE) / > > 0 -> 8,000,000 > > (zone_page_state(zone, NR_INACTIVE) + 1) > > min: 1, max: 8,000,000 > > * (vm_swappiness + 1) > > min: 1, max: 101 > > total min: 1, total max: 800,000,000 > > / 100 > > > total min: 0, total max: 8,000,000 > > * mapped_ratio > > total min: 0, total max: 800,000,000 > > / 100; > > total min: 0, total max: 8,000,000 > > then we divide zone_page_state(zone, NR_ACTIVE) by this value. Hmm no. we divide zone_page_state(zone, NR_ACTIVE) immediately by zone_page_state(zone, NR_INACTIVE)+1. So in the extreme case that inactive is 0 and active is 8000000 we get this: 8000000 / 1 * (swappiness+1)/100 * mapped_ratio / 100 8000000 / 1 = 8000000 8000000 * 100 = 800000000 800000000 / 100 = 8000000 8000000 * 100 = 800000000 800000000 / 100 = 8000000 So in the most extreme case swap_tendency will be 8000000 + the previous swap_tendency value which is fine. > We can get a divide-by-zero if zone_page_state(zone, NR_INACTIVE) is > sufficiently small, I think? At least, it isn't obvious that we cannot. I think gcc should be guaranteed to go from left to right like you said (I don't think we're required to put it in separate local variables to get that guarantee from gcc). "zone_page_state(zone, NR_INACTIVE) + 1" min value is 1. For this to generate a divide by zero zone_page_state(zone, NR_INACTIVE) should return ~1UL which will never happen due to ram constraints. > I suspect that we can get a value >100, too. Especially when we add it to > the existing value of swap_tendency, but I didn't think about it too hard. swap_tendency can already be > 100 of course no problem with that. The idea is to easily boost swap_tendency when there is memory pressure and a tiny inactive list and swappiness close to 0, without waiting distress to hit the breakpoint after waste of cpu touching all those ptes marked young in the failure attempt to find some unmapped page. distress is a last resort to avoid hitting oom early, depending on it doesn't provide for a graceful behavior when swappiness is zero or close to zero (swappiness zero truly deadlocks actually). > Want to see if we can present that expression in a more logical fashion, and > be more careful about the underflows and overflows, and fix the potential > divide-by-zero? I may be missing something, ff I would see it I could fix it. how to express it more logical way I guess all I can do is to split in different lines. As far as I can tell this is already the correct way to compute it w.r.t. to divide by zero and making sure to avoid overflows. We multiply by 100 and then shrink it immediately every time. We want only the effect to be visible when active is significantly larger (order of 100 times larger) than inactive. In all normal conditions with quite some pagecache and not 100% mapped, the effect shouldn't be visible at all. It's only the currently too rought corner cases that we intend to smooth with this. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: make swappiness safer to use 2007-07-31 22:40 ` Andrea Arcangeli @ 2007-07-31 22:51 ` Andrew Morton 2007-07-31 23:02 ` Andrea Arcangeli 0 siblings, 1 reply; 17+ messages in thread From: Andrew Morton @ 2007-07-31 22:51 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm, Nick Piggin On Wed, 1 Aug 2007 00:40:52 +0200 Andrea Arcangeli <andrea@suse.de> wrote: > > Want to see if we can present that expression in a more logical fashion, and > > be more careful about the underflows and overflows, and fix the potential > > divide-by-zero? > > I may be missing something, Yeah, I misread the paranthesisation. sorry. I nice way of coding this would be: /* * comment goes here */ adjust = zone_page_state(zone, NR_ACTIVE) / (zone_page_state(zone, NR_INACTIVE) + 1); /* * comment goes here */ adjust *= (vm_swappiness + 1) / 100; /* * comment goes here */ adjust *= mapped_ratio / 100; /* * comment goes here */ swap_tendency += adjust; so there's no confusion over parenthesisation or associativity, and the reader can see the logic as it unfolds. The compiler should do exactly the same thing. It is worth expending the extra effort and screen space for clarity in that part of the kernel, given the amount of trouble it causes, and the amount of time people spend sweating over it. Those would want to be good comments, too. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: make swappiness safer to use 2007-07-31 22:51 ` Andrew Morton @ 2007-07-31 23:02 ` Andrea Arcangeli [not found] ` <20070801011925.GB20109@mail.ustc.edu.cn> 0 siblings, 1 reply; 17+ messages in thread From: Andrea Arcangeli @ 2007-07-31 23:02 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, Nick Piggin On Tue, Jul 31, 2007 at 03:51:09PM -0700, Andrew Morton wrote: > Yeah, I misread the paranthesisation. sorry. never mind! > I nice way of coding this would be: > > /* > * comment goes here > */ > adjust = zone_page_state(zone, NR_ACTIVE) / > (zone_page_state(zone, NR_INACTIVE) + 1); > > /* > * comment goes here > */ > adjust *= (vm_swappiness + 1) / 100; > > /* > * comment goes here > */ > adjust *= mapped_ratio / 100; > > /* > * comment goes here > */ > swap_tendency += adjust; > > so there's no confusion over parenthesisation or associativity, and the > reader can see the logic as it unfolds. The compiler should do exactly the > same thing. > > It is worth expending the extra effort and screen space for clarity in that > part of the kernel, given the amount of trouble it causes, and the amount > of time people spend sweating over it. Those would want to be good > comments, too. Ok. Signed-off-by: Kurt Garloff <garloff@suse.de> Signed-off-by: Andrea Arcangeli <andrea@suse.de> diff --git a/mm/vmscan.c b/mm/vmscan.c --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -879,6 +879,7 @@ static void shrink_active_list(unsigned long mapped_ratio; long distress; long swap_tendency; + long imbalance; if (zone_is_near_oom(zone)) goto force_reclaim_mapped; @@ -912,6 +913,44 @@ static void shrink_active_list(unsigned * altogether. */ swap_tendency = mapped_ratio / 2 + distress + sc->swappiness; + + /* + * If there's huge imbalance between active and inactive + * (think active 100 times larger than inactive) we should + * become more permissive, or the system will take too much + * cpu before it start swapping during memory pressure. + * Distress is about avoiding early-oom, this is about + * making swappiness graceful despite setting it to low + * values. + * + * Avoid div by zero with nr_inactive+1, and max resulting + * value is vm_total_pages. + */ + imbalance = zone_page_state(zone, NR_ACTIVE) / + (zone_page_state(zone, NR_INACTIVE) + 1); + + /* + * Reduce the effect of imbalance if swappiness is low, + * this means for a swappiness very low, the imbalance + * must be much higher than 100 for this logic to make + * the difference. + * + * Max temporary value is vm_total_pages*100. + */ + imbalance *= (vm_swappiness + 1) / 100; + + /* + * If not much of the ram is mapped, makes the imbalance + * less relevant, it's high priority we refill the inactive + * list with mapped pages only in presence of high ratio of + * mapped pages. + * + * Max temporary value is vm_total_pages*100. + */ + imbalance *= mapped_ratio / 100; + + /* apply imbalance feedback to swap_tendency */ + swap_tendency += imbalance; /* * Now use this metric to decide whether to start moving mapped -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <20070801011925.GB20109@mail.ustc.edu.cn>]
* Re: make swappiness safer to use [not found] ` <20070801011925.GB20109@mail.ustc.edu.cn> @ 2007-08-01 1:19 ` Fengguang Wu [not found] ` <20070801012222.GA20565@mail.ustc.edu.cn> 2007-08-01 2:30 ` Andrea Arcangeli 2 siblings, 0 replies; 17+ messages in thread From: Fengguang Wu @ 2007-08-01 1:19 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, linux-mm, Nick Piggin On Wed, Aug 01, 2007 at 01:02:51AM +0200, Andrea Arcangeli wrote: > diff --git a/mm/vmscan.c b/mm/vmscan.c [...] > @@ -912,6 +913,44 @@ static void shrink_active_list(unsigned > * altogether. > */ > swap_tendency = mapped_ratio / 2 + distress + sc->swappiness; > + > + /* > + * If there's huge imbalance between active and inactive > + * (think active 100 times larger than inactive) we should > + * become more permissive, or the system will take too much > + * cpu before it start swapping during memory pressure. > + * Distress is about avoiding early-oom, this is about > + * making swappiness graceful despite setting it to low > + * values. > + * > + * Avoid div by zero with nr_inactive+1, and max resulting > + * value is vm_total_pages. > + */ > + imbalance = zone_page_state(zone, NR_ACTIVE) / > + (zone_page_state(zone, NR_INACTIVE) + 1); > + > + /* > + * Reduce the effect of imbalance if swappiness is low, > + * this means for a swappiness very low, the imbalance > + * must be much higher than 100 for this logic to make > + * the difference. > + * > + * Max temporary value is vm_total_pages*100. > + */ > + imbalance *= (vm_swappiness + 1) / 100; ~~~~~~~~~~~~~~~~~~~~~~~~~ It will be zero! Better to scale it up before the division: imbalance *= (vm_swappiness + 1) * 1024 / 100; > + > + /* > + * If not much of the ram is mapped, makes the imbalance > + * less relevant, it's high priority we refill the inactive > + * list with mapped pages only in presence of high ratio of > + * mapped pages. > + * > + * Max temporary value is vm_total_pages*100. > + */ > + imbalance *= mapped_ratio / 100; imbalance *= mapped_ratio * 1024 / 100; > + /* apply imbalance feedback to swap_tendency */ > + swap_tendency += imbalance; swap_tendency += imbalance / 1024 / 1024; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <20070801012222.GA20565@mail.ustc.edu.cn>]
* Re: make swappiness safer to use [not found] ` <20070801012222.GA20565@mail.ustc.edu.cn> @ 2007-08-01 1:22 ` Fengguang Wu [not found] ` <20070801013208.GA20085@mail.ustc.edu.cn> 1 sibling, 0 replies; 17+ messages in thread From: Fengguang Wu @ 2007-08-01 1:22 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, linux-mm, Nick Piggin On Wed, Aug 01, 2007 at 09:19:25AM +0800, Fengguang Wu wrote: > > + imbalance *= (vm_swappiness + 1) / 100; > ~~~~~~~~~~~~~~~~~~~~~~~~~ It will be zero! > > Better to scale it up before the division: > imbalance *= (vm_swappiness + 1) * 1024 / 100; > > > + > > + /* > > + * If not much of the ram is mapped, makes the imbalance > > + * less relevant, it's high priority we refill the inactive > > + * list with mapped pages only in presence of high ratio of > > + * mapped pages. > > + * > > + * Max temporary value is vm_total_pages*100. > > + */ > > + imbalance *= mapped_ratio / 100; > > imbalance *= mapped_ratio * 1024 / 100; > > > + /* apply imbalance feedback to swap_tendency */ > > + swap_tendency += imbalance; > > swap_tendency += imbalance / 1024 / 1024; Or simply move the two ' / 100' to this last line? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <20070801013208.GA20085@mail.ustc.edu.cn>]
* Re: make swappiness safer to use [not found] ` <20070801013208.GA20085@mail.ustc.edu.cn> @ 2007-08-01 1:32 ` Fengguang Wu 2007-08-01 2:33 ` Andrea Arcangeli 1 sibling, 0 replies; 17+ messages in thread From: Fengguang Wu @ 2007-08-01 1:32 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, linux-mm, Nick Piggin Here's the updated patch without underflows. Signed-off-by: Kurt Garloff <garloff@suse.de> Signed-off-by: Andrea Arcangeli <andrea@suse.de> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> --- mm/vmscan.c | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) --- linux-2.6.22-rc6-mm1.orig/mm/vmscan.c +++ linux-2.6.22-rc6-mm1/mm/vmscan.c @@ -887,6 +887,7 @@ static void shrink_active_list(unsigned long mapped_ratio; long distress; long swap_tendency; + long imbalance; if (zone_is_near_oom(zone)) goto force_reclaim_mapped; @@ -922,6 +923,46 @@ static void shrink_active_list(unsigned swap_tendency = mapped_ratio / 2 + distress + sc->swappiness; /* + * If there's huge imbalance between active and inactive + * (think active 100 times larger than inactive) we should + * become more permissive, or the system will take too much + * cpu before it start swapping during memory pressure. + * Distress is about avoiding early-oom, this is about + * making swappiness graceful despite setting it to low + * values. + * + * Avoid div by zero with nr_inactive+1, and max resulting + * value is vm_total_pages. + */ + imbalance = zone_page_state(zone, NR_ACTIVE); + imbalance /= zone_page_state(zone, NR_INACTIVE) + 1; + + /* + * Reduce the effect of imbalance if swappiness is low, + * this means for a swappiness very low, the imbalance + * must be much higher than 100 for this logic to make + * the difference. + * + * Max temporary value is vm_total_pages*100. + */ + imbalance *= (vm_swappiness + 1); + imbalance /= 100; + + /* + * If not much of the ram is mapped, makes the imbalance + * less relevant, it's high priority we refill the inactive + * list with mapped pages only in presence of high ratio of + * mapped pages. + * + * Max temporary value is vm_total_pages*100. + */ + imbalance *= mapped_ratio; + imbalance /= 100; + + /* apply imbalance feedback to swap_tendency */ + swap_tendency += imbalance; + + /* * Now use this metric to decide whether to start moving mapped * memory onto the inactive list. */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: make swappiness safer to use [not found] ` <20070801013208.GA20085@mail.ustc.edu.cn> 2007-08-01 1:32 ` Fengguang Wu @ 2007-08-01 2:33 ` Andrea Arcangeli 2007-08-06 18:21 ` Andrew Morton 1 sibling, 1 reply; 17+ messages in thread From: Andrea Arcangeli @ 2007-08-01 2:33 UTC (permalink / raw) To: Fengguang Wu; +Cc: Andrew Morton, linux-mm, Nick Piggin On Wed, Aug 01, 2007 at 09:32:08AM +0800, Fengguang Wu wrote: > Here's the updated patch without underflows. this is ok. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: make swappiness safer to use 2007-08-01 2:33 ` Andrea Arcangeli @ 2007-08-06 18:21 ` Andrew Morton [not found] ` <20070807050032.GA16179@mail.ustc.edu.cn> 0 siblings, 1 reply; 17+ messages in thread From: Andrew Morton @ 2007-08-06 18:21 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Fengguang Wu, linux-mm, Nick Piggin On Wed, 1 Aug 2007 04:33:15 +0200 Andrea Arcangeli <andrea@suse.de> wrote: > On Wed, Aug 01, 2007 at 09:32:08AM +0800, Fengguang Wu wrote: > > Here's the updated patch without underflows. > > this is ok. I lost the plot a bit here. Can I please have a resend of the full and final patch? Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <20070807050032.GA16179@mail.ustc.edu.cn>]
* Re: make swappiness safer to use [not found] ` <20070807050032.GA16179@mail.ustc.edu.cn> @ 2007-08-07 5:00 ` Fengguang Wu 2007-11-12 2:07 ` YAMAMOTO Takashi 1 sibling, 0 replies; 17+ messages in thread From: Fengguang Wu @ 2007-08-07 5:00 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, Nick Piggin On Mon, Aug 06, 2007 at 11:21:54AM -0700, Andrew Morton wrote: > On Wed, 1 Aug 2007 04:33:15 +0200 Andrea Arcangeli <andrea@suse.de> wrote: > > > On Wed, Aug 01, 2007 at 09:32:08AM +0800, Fengguang Wu wrote: > > > Here's the updated patch without underflows. > > > > this is ok. > > I lost the plot a bit here. Can I please have a resend of the full and > final patch? OK, here it is. === From: Andrea Arcangeli <andrea@suse.de> Subject: make swappiness safer to use Swappiness isn't a safe sysctl. Setting it to 0 for example can hang a system. That's a corner case but even setting it to 10 or lower can waste enormous amounts of cpu without making much progress. We've customers who wants to use swappiness but they can't because of the current implementation (if you change it so the system stops swapping it really stops swapping and nothing works sane anymore if you really had to swap something to make progress). This patch from Kurt Garloff makes swappiness safer to use (no more huge cpu usage or hangs with low swappiness values). I think the prev_priority can also be nuked since it wastes 4 bytes per zone (that would be an incremental patch but I wait the nr_scan_[in]active to be nuked first for similar reasons). Clearly somebody at some point noticed how broken that thing was and they had to add min(priority, prev_priority) to give it some reliability, but they didn't go the last mile to nuke prev_priority too. Calculating distress only in function of not-racy priority is correct and sure more than enough without having to add randomness into the equation. Patch is tested on older kernels but it compiles and it's quite simple so... Overall I'm not very satisified by the swappiness tweak, since it doesn't rally do anything with the dirty pagecache that may be inactive. We need another kind of tweak that controls the inactive scan and tunes the can_writepage feature (not yet in mainline despite having submitted it a few times), not only the active one. That new tweak will tell the kernel how hard to scan the inactive list for pure clean pagecache (something the mainline kernel isn't capable of yet). We already have that feature working in all our enterprise kernels with the default reasonable tune, or they can't even run a readonly backup with tar without triggering huge write I/O. I think it should be available also in mainline later. Cc: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@osdl.org> Signed-off-by: Kurt Garloff <garloff@suse.de> Signed-off-by: Andrea Arcangeli <andrea@suse.de> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> --- mm/vmscan.c | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) --- linux-2.6.22-rc6-mm1.orig/mm/vmscan.c +++ linux-2.6.22-rc6-mm1/mm/vmscan.c @@ -887,6 +887,7 @@ static void shrink_active_list(unsigned long mapped_ratio; long distress; long swap_tendency; + long imbalance; if (zone_is_near_oom(zone)) goto force_reclaim_mapped; @@ -922,6 +923,46 @@ static void shrink_active_list(unsigned swap_tendency = mapped_ratio / 2 + distress + sc->swappiness; /* + * If there's huge imbalance between active and inactive + * (think active 100 times larger than inactive) we should + * become more permissive, or the system will take too much + * cpu before it start swapping during memory pressure. + * Distress is about avoiding early-oom, this is about + * making swappiness graceful despite setting it to low + * values. + * + * Avoid div by zero with nr_inactive+1, and max resulting + * value is vm_total_pages. + */ + imbalance = zone_page_state(zone, NR_ACTIVE); + imbalance /= zone_page_state(zone, NR_INACTIVE) + 1; + + /* + * Reduce the effect of imbalance if swappiness is low, + * this means for a swappiness very low, the imbalance + * must be much higher than 100 for this logic to make + * the difference. + * + * Max temporary value is vm_total_pages*100. + */ + imbalance *= (vm_swappiness + 1); + imbalance /= 100; + + /* + * If not much of the ram is mapped, makes the imbalance + * less relevant, it's high priority we refill the inactive + * list with mapped pages only in presence of high ratio of + * mapped pages. + * + * Max temporary value is vm_total_pages*100. + */ + imbalance *= mapped_ratio; + imbalance /= 100; + + /* apply imbalance feedback to swap_tendency */ + swap_tendency += imbalance; + + /* * Now use this metric to decide whether to start moving mapped * memory onto the inactive list. */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: make swappiness safer to use [not found] ` <20070807050032.GA16179@mail.ustc.edu.cn> 2007-08-07 5:00 ` Fengguang Wu @ 2007-11-12 2:07 ` YAMAMOTO Takashi 1 sibling, 0 replies; 17+ messages in thread From: YAMAMOTO Takashi @ 2007-11-12 2:07 UTC (permalink / raw) To: fengguang.wu; +Cc: akpm, andrea, linux-mm, npiggin > + /* > + * Reduce the effect of imbalance if swappiness is low, > + * this means for a swappiness very low, the imbalance > + * must be much higher than 100 for this logic to make > + * the difference. > + * > + * Max temporary value is vm_total_pages*100. > + */ > + imbalance *= (vm_swappiness + 1); > + imbalance /= 100; why vm_swappiness rather than sc->swappiness? YAMAMOTO Takashi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: make swappiness safer to use [not found] ` <20070801011925.GB20109@mail.ustc.edu.cn> 2007-08-01 1:19 ` Fengguang Wu [not found] ` <20070801012222.GA20565@mail.ustc.edu.cn> @ 2007-08-01 2:30 ` Andrea Arcangeli 2 siblings, 0 replies; 17+ messages in thread From: Andrea Arcangeli @ 2007-08-01 2:30 UTC (permalink / raw) To: Fengguang Wu; +Cc: Andrew Morton, linux-mm, Nick Piggin On Wed, Aug 01, 2007 at 09:19:25AM +0800, Fengguang Wu wrote: > On Wed, Aug 01, 2007 at 01:02:51AM +0200, Andrea Arcangeli wrote: > > diff --git a/mm/vmscan.c b/mm/vmscan.c > [...] > > @@ -912,6 +913,44 @@ static void shrink_active_list(unsigned > > * altogether. > > */ > > swap_tendency = mapped_ratio / 2 + distress + sc->swappiness; > > + > > + /* > > + * If there's huge imbalance between active and inactive > > + * (think active 100 times larger than inactive) we should > > + * become more permissive, or the system will take too much > > + * cpu before it start swapping during memory pressure. > > + * Distress is about avoiding early-oom, this is about > > + * making swappiness graceful despite setting it to low > > + * values. > > + * > > + * Avoid div by zero with nr_inactive+1, and max resulting > > + * value is vm_total_pages. > > + */ > > + imbalance = zone_page_state(zone, NR_ACTIVE) / > > + (zone_page_state(zone, NR_INACTIVE) + 1); > > + > > + /* > > + * Reduce the effect of imbalance if swappiness is low, > > + * this means for a swappiness very low, the imbalance > > + * must be much higher than 100 for this logic to make > > + * the difference. > > + * > > + * Max temporary value is vm_total_pages*100. > > + */ > > + imbalance *= (vm_swappiness + 1) / 100; > ~~~~~~~~~~~~~~~~~~~~~~~~~ It will be zero! You know what, while editing I even wrote "imbalance = imbalance * ((vm_swappiness + 1) / 100" because I felt it was different, then I edited it back to the cut-and-pasted version thinking it was actually the same, but it's not. > Better to scale it up before the division: > imbalance *= (vm_swappiness + 1) * 1024 / 100; > > > + > > + /* > > + * If not much of the ram is mapped, makes the imbalance > > + * less relevant, it's high priority we refill the inactive > > + * list with mapped pages only in presence of high ratio of > > + * mapped pages. > > + * > > + * Max temporary value is vm_total_pages*100. > > + */ > > + imbalance *= mapped_ratio / 100; > > imbalance *= mapped_ratio * 1024 / 100; > > > + /* apply imbalance feedback to swap_tendency */ > > + swap_tendency += imbalance; > > swap_tendency += imbalance / 1024 / 1024; No this is exactly what we want to avoid. We must not allow more than 100 times higher the number of pages. The fix is only to do "imbalance = imbalance *" instead of "imbalance *=". -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: make swappiness safer to use 2007-07-31 21:52 make swappiness safer to use Andrea Arcangeli 2007-07-31 22:12 ` Andrew Morton @ 2007-07-31 23:09 ` Andrew Morton 2007-07-31 23:23 ` Andrea Arcangeli 2007-07-31 23:32 ` Martin Bligh 1 sibling, 2 replies; 17+ messages in thread From: Andrew Morton @ 2007-07-31 23:09 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm, Nick Piggin, Martin Bligh On Tue, 31 Jul 2007 23:52:28 +0200 Andrea Arcangeli <andrea@suse.de> wrote: > I think the prev_priority can also be nuked since it wastes 4 bytes > per zone (that would be an incremental patch but I wait the > nr_scan_[in]active to be nuked first for similar reasons). Clearly > somebody at some point noticed how broken that thing was and they had > to add min(priority, prev_priority) to give it some reliability, but > they didn't go the last mile to nuke prev_priority too. Calculating > distress only in function of not-racy priority is correct and sure > more than enough without having to add randomness into the equation. I don't recall seeing any such patch and I suspect it'd cause problems anyway. If we were to base swap_tendency purely on sc->priority then the VM would incorrectly fail to deactivate mapped pages until the scanning had reached a sufficiently high (ie: low) scanning priority. The net effect would be that each time some process runs shrink_active_list(), some pages would be incorrectly retained on the active list and after a while, the code wold start moving mapped pages down to the inactive list. In fact, I think that was (effectively) the behaviour which we had in there, and it caused problems with some worklaod which Martin was looking at and things got better when we fixed it. Anyway, we can say more if we see the patch (or, more accurately, the analysis which comes with that patch). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: make swappiness safer to use 2007-07-31 23:09 ` Andrew Morton @ 2007-07-31 23:23 ` Andrea Arcangeli 2007-07-31 23:32 ` Martin Bligh 1 sibling, 0 replies; 17+ messages in thread From: Andrea Arcangeli @ 2007-07-31 23:23 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, Nick Piggin, Martin Bligh On Tue, Jul 31, 2007 at 04:09:43PM -0700, Andrew Morton wrote: > On Tue, 31 Jul 2007 23:52:28 +0200 > Andrea Arcangeli <andrea@suse.de> wrote: > > > I think the prev_priority can also be nuked since it wastes 4 bytes > > per zone (that would be an incremental patch but I wait the > > nr_scan_[in]active to be nuked first for similar reasons). Clearly > > somebody at some point noticed how broken that thing was and they had > > to add min(priority, prev_priority) to give it some reliability, but > > they didn't go the last mile to nuke prev_priority too. Calculating > > distress only in function of not-racy priority is correct and sure > > more than enough without having to add randomness into the equation. > > I don't recall seeing any such patch and I suspect it'd cause problems > anyway. > > If we were to base swap_tendency purely on sc->priority then the VM would > incorrectly fail to deactivate mapped pages until the scanning had reached > a sufficiently high (ie: low) scanning priority. > > The net effect would be that each time some process runs > shrink_active_list(), some pages would be incorrectly retained on the > active list and after a while, the code wold start moving mapped pages down > to the inactive list. > > In fact, I think that was (effectively) the behaviour which we had in > there, and it caused problems with some worklaod which Martin was looking > at and things got better when we fixed it. > > > Anyway, we can say more if we see the patch (or, more accurately, the > analysis which comes with that patch). My reasoning for prev_priority not being such a great feature is that between the two, sc->priority is critically more important because its being set for the current run, prev_priority is set later (in origin only prev_priority was used as failsafe for the swappiness logic, these days sc->priority is being mixed too because clearly prev_priority alone was not enough). But my whole dislike for those prev_* thinks is that they're all smp racey. So your beloved prev_priority will go back to 12 if a new try_to_free_pages runs with a different gfpmask and/or different order of allocation, screwing the other task in the other CPU that is having such an hard time to find unmapped pages to free because it has a strictier gfpmask (perhaps not allowed to eat into dcache/icache) or bigger order (perhaps even looping nearly forever thanks to the order <= PAGE_ALLOC_COSTLY_ORDER check). So I've an hard time to appreciate the prev_priority thing, because like the nr_scan_[in]active it's imperfect. Comments like those also shows the whole imperfection: /* Now that we've scanned all the zones at this priority level, note * that level within the zone so that the next thread that's a lie, I mean there's no such thing as next thread, all threads may be running in parallel in multiple cpus, or they may be context switching. The comment would be remotely correct if there was a big global semaphore around the vm, which would never happen. It's really the same category of the nr_scan_[in]active, and my dislike for those things is exactly the same and motivated by mostly the same reasons. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: make swappiness safer to use 2007-07-31 23:09 ` Andrew Morton 2007-07-31 23:23 ` Andrea Arcangeli @ 2007-07-31 23:32 ` Martin Bligh 2007-07-31 23:49 ` Andrew Morton 1 sibling, 1 reply; 17+ messages in thread From: Martin Bligh @ 2007-07-31 23:32 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, Nick Piggin Andrew Morton wrote: > On Tue, 31 Jul 2007 23:52:28 +0200 > Andrea Arcangeli <andrea@suse.de> wrote: > >> I think the prev_priority can also be nuked since it wastes 4 bytes >> per zone (that would be an incremental patch but I wait the >> nr_scan_[in]active to be nuked first for similar reasons). Clearly >> somebody at some point noticed how broken that thing was and they had >> to add min(priority, prev_priority) to give it some reliability, but Yeah, that was me. >> they didn't go the last mile to nuke prev_priority too. Calculating >> distress only in function of not-racy priority is correct and sure >> more than enough without having to add randomness into the equation. > > I don't recall seeing any such patch and I suspect it'd cause problems > anyway. > > If we were to base swap_tendency purely on sc->priority then the VM would > incorrectly fail to deactivate mapped pages until the scanning had reached > a sufficiently high (ie: low) scanning priority. > > The net effect would be that each time some process runs > shrink_active_list(), some pages would be incorrectly retained on the > active list and after a while, the code wold start moving mapped pages down > to the inactive list. > > In fact, I think that was (effectively) the behaviour which we had in > there, and it caused problems with some worklaod which Martin was looking > at and things got better when we fixed it. I think I described it reasonably in the changelog ... if not, IIRC the issue was that one task was doing an "easy" reclaim (e.g they could do wait, writeback etc) and one was doing a "hard" reclaim (eg no writeback) and the one having an easy time would set prio back to def_priority. So having a local variable made way more sense to me. It's a race condition that's bloody hard to recreate, and very hard to prove anything on. I did publish exact traces of what happened at the time if anyone wants to go back and look. > Anyway, we can say more if we see the patch (or, more accurately, the > analysis which comes with that patch). I must say, I don't see what's wrong with killing it and having it local. We're rotating the list all the time, IIRC ... so if we start off with only 1/2^12th of the list ... does it matter? we'll just crank it up higher fairly quickly. Not sure why we want to start with the same chunk size we did last time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: make swappiness safer to use 2007-07-31 23:32 ` Martin Bligh @ 2007-07-31 23:49 ` Andrew Morton 0 siblings, 0 replies; 17+ messages in thread From: Andrew Morton @ 2007-07-31 23:49 UTC (permalink / raw) To: Martin Bligh; +Cc: Andrea Arcangeli, linux-mm, Nick Piggin On Tue, 31 Jul 2007 16:32:06 -0700 Martin Bligh <mbligh@mbligh.org> wrote: > > Anyway, we can say more if we see the patch (or, more accurately, the > > analysis which comes with that patch). > > I must say, I don't see what's wrong with killing it and having it > local. We're rotating the list all the time, IIRC ... so if we start > off with only 1/2^12th of the list ... does it matter? we'll just > crank it up higher fairly quickly. Not sure why we want to start > with the same chunk size we did last time. This scenario: - a thread goes into shrink_active_list(), does scan, scan, scan, finding only mapped pages. - eventually, we reach a sufficiently high priority to flip into reclaim-mapped mode. - now, we quickly move SWAP_CLUSTER_MAX pages onto the inactive list, and we're done. So we scanned a few thousand pages, then moved 32-odd down to the inactive list. Now, someone else comes in and does some reclaim. It would be bad to scan another few thousand pages and to then move 32-odd pages down to the inactive list. Think what that pattern looks like: lumps of 32-pages with a few thousand pages between them getting deactivated. To fix this, we attempt to start scanning out in the state which it was originally in: ie, the state which this caller to shrink_active_list() would have discovered for himself _anyway_. After enough pages have been pointlessly recirculated. Can the current implemetnation make mistakes? Sure. But I'd suggest that it will make far less than (thousands/32) mistakes. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2007-11-12 2:07 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-31 21:52 make swappiness safer to use Andrea Arcangeli
2007-07-31 22:12 ` Andrew Morton
2007-07-31 22:40 ` Andrea Arcangeli
2007-07-31 22:51 ` Andrew Morton
2007-07-31 23:02 ` Andrea Arcangeli
[not found] ` <20070801011925.GB20109@mail.ustc.edu.cn>
2007-08-01 1:19 ` Fengguang Wu
[not found] ` <20070801012222.GA20565@mail.ustc.edu.cn>
2007-08-01 1:22 ` Fengguang Wu
[not found] ` <20070801013208.GA20085@mail.ustc.edu.cn>
2007-08-01 1:32 ` Fengguang Wu
2007-08-01 2:33 ` Andrea Arcangeli
2007-08-06 18:21 ` Andrew Morton
[not found] ` <20070807050032.GA16179@mail.ustc.edu.cn>
2007-08-07 5:00 ` Fengguang Wu
2007-11-12 2:07 ` YAMAMOTO Takashi
2007-08-01 2:30 ` Andrea Arcangeli
2007-07-31 23:09 ` Andrew Morton
2007-07-31 23:23 ` Andrea Arcangeli
2007-07-31 23:32 ` Martin Bligh
2007-07-31 23:49 ` Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox