On Fri, May 20, 2011 at 12:19 PM, Minchan Kim wrote: > On Fri, May 20, 2011 at 12:01:12PM -0400, Andrew Lutomirski wrote: >> On Fri, May 20, 2011 at 11:33 AM, Minchan Kim wrote: >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> > index 8bfd450..a5c01e9 100644 >> > --- a/mm/vmscan.c >> > +++ b/mm/vmscan.c >> > @@ -1430,7 +1430,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, >> > >> >        /* Check if we should syncronously wait for writeback */ >> >        if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) { >> > +               unsigned long nr_active; >> >                set_reclaim_mode(priority, sc, true); >> > +               nr_active = clear_active_flags(&page_list, NULL); >> > +               count_vm_events(PGDEACTIVATE, nr_active); >> >                nr_reclaimed += shrink_page_list(&page_list, zone, sc); >> >        } >> > >> > -- >> >> I'm now running that patch *without* the pgdat_balanced fix or the >> need_resched check.  The VM_BUG_ON doesn't happen but I still get > > Please forget need_resched. > Instead of it, could you test shrink_slab patch with !pgdat_balanced? > > @@ -231,8 +231,11 @@ unsigned long shrink_slab(struct shrink_control *shrink, >       if (scanned == 0) >               scanned = SWAP_CLUSTER_MAX; > > -       if (!down_read_trylock(&shrinker_rwsem)) > -               return 1;       /* Assume we'll be able to shrink next time */ > +       if (!down_read_trylock(&shrinker_rwsem)) { > +               /* Assume we'll be able to shrink next time */ > +               ret = 1; > +               goto out; > +       } > >       list_for_each_entry(shrinker, &shrinker_list, list) { >               unsigned long long delta; > @@ -286,6 +289,8 @@ unsigned long shrink_slab(struct shrink_control *shrink, >               shrinker->nr += total_scan; >       } >       up_read(&shrinker_rwsem); > +out: > +       cond_resched(); >       return ret; >  } > >> incorrect OOM kills. >> >> However, if I replace the check with: >> >>       if (false &&should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) { >> >> then my system lags under bad memory pressure but recovers without >> OOMs or oopses. > > I agree you can see OOM but oops? Did you see any oops? No oops. I've now reproduced the OOPS with both the if (false) change and the clear_active_flags change. Also, would this version be better? I think your version overcounts nr_scanned, but I'm not sure what effect that would have. diff --git a/mm/vmscan.c b/mm/vmscan.c index 3f44b81..d1dabc9 100644 @@ -1426,8 +1437,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, /* Check if we should syncronously wait for writeback */ if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) { + unsigned long nr_active, old_nr_scanned; set_reclaim_mode(priority, sc, true); + nr_active = clear_active_flags(&page_list, NULL); + count_vm_events(PGDEACTIVATE, nr_active); + old_nr_scanned = sc->nr_scanned; nr_reclaimed += shrink_page_list(&page_list, zone, sc); + sc->nr_scanned = old_nr_scanned; } local_irq_disable(); I just tested 2.6.38.6 with the attached patch. It survived dirty_ram and test_mempressure without any problems other than slowness, but when I hit ctrl-c to stop test_mempressure, I got the attached oom. --Andy