On 21/06/11 14:07, Mel Gorman wrote: > On Tue, Jun 21, 2011 at 12:59:00PM +0100, P?draig Brady wrote: >> On 21/06/11 12:34, Mel Gorman wrote: >>> On Tue, Jun 21, 2011 at 11:47:35AM +0100, P?draig Brady wrote: >>>> On 21/06/11 11:39, Mel Gorman wrote: >>>>> On Tue, Jun 21, 2011 at 10:53:02AM +0100, P?draig Brady wrote: >>>>>> I tried the 2 patches here to no avail: >>>>>> http://marc.info/?l=linux-mm&m=130503811704830&w=2 >>>>>> >>>>>> I originally logged this at: >>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=712019 >>>>>> >>>>>> I can compile up and quickly test any suggestions. >>>>>> >>>>> >>>>> I recently looked through what kswapd does and there are a number >>>>> of problem areas. Unfortunately, I haven't gotten around to doing >>>>> anything about it yet or running the test cases to see if they are >>>>> really problems. In your case, the following is a strong possibility >>>>> though. This should be applied on top of the two patches merged from >>>>> that thread. >>>>> >>>>> This is not tested in any way, based on 3.0-rc3 >>>> >>>> This does not fix the issue here. >>>> >>> >>> I made a silly mistake here. When you mentioned two patches applied, >>> I assumed you meant two patches that were finally merged from that >>> discussion thread instead of looking at your linked mail. Now that I >>> have checked, I think you applied the SLUB patches while the patches >>> I was thinking of are; >>> >>> [afc7e326: mm: vmscan: correct use of pgdat_balanced in sleeping_prematurely] >>> [f06590bd: mm: vmscan: correctly check if reclaimer should schedule during shrink_slab] >>> >>> The first one in particular has been reported by another user to fix >>> hangs related to copying large files. I'm assuming you are testing >>> against the Fedora kernel. As these patches were merged for 3.0-rc1, can >>> you check if applying just these two patches to your kernel helps? >> >> These patches are already present in my 2.6.38.8-32.fc15.x86_64 kernel :( >> > > Would it be possible to record a profile while it is livelocked to check > if it's stuck in this loop in shrink_slab()? I did: perf record -a -g sleep 10 perf report --stdio > livelock.perf #attached perf annotate shrink_slab -k rpmbuild/BUILD/kernel-2.6.38.fc15/linux-2.6.38.x86_64/vmlinux > shrink_slab.annotate #attached > > while (total_scan >= SHRINK_BATCH) { > long this_scan = SHRINK_BATCH; > int shrink_ret; > int nr_before; > > nr_before = do_shrinker_shrink(shrinker, shrink, 0); > shrink_ret = do_shrinker_shrink(shrinker, shrink, > this_scan); > if (shrink_ret == -1) > break; > if (shrink_ret < nr_before) > ret += nr_before - shrink_ret; > count_vm_events(SLABS_SCANNED, this_scan); > total_scan -= this_scan; > > cond_resched(); > } shrink_slab() looks to be the culprit, but it seems to be the loop outside the above that is spinning. > Also, can you post the output of sysrq+m at a few different times while > kswapd is spinning heavily? I want to see if all_unreclaimable has been > set on zones with a reasonable amount of memory. If they are, it's > possible for kswapd to be in a continual loop calling shrink_slab() and > skipping over normal page reclaim because all_unreclaimable is set > everywhere until a page is freed. I did that 3 times. Attached. cheers, Padraig.