(copying Minchan because I just asked him the same question.)
Thank you, I can try this on ToT, although I think that the problem is not with the OOM killer itself but earlier---i.e. invoking the OOM killer seems unnecessary and wrong. Here's the question.
The general strategy for page allocation seems to be (please correct me as needed):
1. look in the free lists
2. if that did not succeed, try to reclaim, then try again to allocate
3. keep trying as long as progress is made (i.e. something was reclaimed)
4. if no progress was made and no pages were found, invoke the OOM killer.
I'd like to know if that "progress is made" notion is possibly buggy. Specifically, does it mean "progress is made by this task"? Is it possible that resource contention creates a situation where most tasks in most cases can reclaim and allocate, but one task randomly fails to make progress?