From: Mel Gorman <mel@csn.ul.ie>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>,
linux-mm@kvack.org, Nick Piggin <npiggin@suse.de>,
Chris Mason <chris.mason@oracle.com>,
Jens Axboe <jens.axboe@oracle.com>,
linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
Date: Fri, 12 Mar 2010 10:47:12 +0000 [thread overview]
Message-ID: <20100312104712.GB18274@csn.ul.ie> (raw)
In-Reply-To: <20100312020526.d424f2a8.akpm@linux-foundation.org>
On Fri, Mar 12, 2010 at 02:05:26AM -0500, Andrew Morton wrote:
> On Fri, 12 Mar 2010 07:39:26 +0100 Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
>
> >
> >
> > Andrew Morton wrote:
> > > On Mon, 8 Mar 2010 11:48:20 +0000
> > > Mel Gorman <mel@csn.ul.ie> wrote:
> > >
> > >> Under memory pressure, the page allocator and kswapd can go to sleep using
> > >> congestion_wait(). In two of these cases, it may not be the appropriate
> > >> action as congestion may not be the problem.
> > >
> > > clear_bdi_congested() is called each time a write completes and the
> > > queue is below the congestion threshold.
> > >
> > > So if the page allocator or kswapd call congestion_wait() against a
> > > non-congested queue, they'll wake up on the very next write completion.
> >
> > Well the issue came up in all kind of loads where you don't have any
> > writes at all that can wake up congestion_wait.
> > Thats true for several benchmarks, but also real workload as well e.g. A
> > backup job reading almost all files sequentially and pumping out stuff
> > via network.
>
> Why is reclaim going into congestion_wait() at all if there's heaps of
> clean reclaimable pagecache lying around?
>
> (I don't thing the read side of the congestion_wqh[] has ever been used, btw)
>
I believe it's a race albeit one that has been there a long time.
In __alloc_pages_direct_reclaim, a process does approximately the
following
1. Enters direct reclaim
2. Calls cond_reched()
3. Drain pages if necessary
4. Attempt to allocate a page
Between steps 2 and 3, it's possible to have reclaimed the pages but
another process allocate them. It then proceeds and decides try again
but calls congestion_wait() before it loops around.
Plenty of read cache reclaimed but no forward progress.
> > > Hence the above-quoted claim seems to me to be a significant mis-analysis and
> > > perhaps explains why the patchset didn't seem to help anything?
> >
> > While I might have misunderstood you and it is a mis-analysis in your
> > opinion, it fixes a -80% Throughput regression on sequential read
> > workloads, thats not nothing - its more like absolutely required :-)
> >
> > You might check out the discussion with the subject "Performance
> > regression in scsi sequential throughput (iozone) due to "e084b -
> > page-allocator: preserve PFN ordering when __GFP_COLD is set"".
> > While the original subject is misleading from todays point of view, it
> > contains a lengthy discussion about exactly when/why/where time is lost
> > due to congestion wait with a lot of traces, counters, data attachments
> > and such stuff.
>
> Well if we're not encountering lots of dirty pages in reclaim then we
> shouldn't be waiting for writes to retire, of course.
>
> But if we're not encountering lots of dirty pages in reclaim, we should
> be reclaiming pages, normally.
>
We probably are.
> I could understand reclaim accidentally going into congestion_wait() if
> it hit a large pile of pages which are unreclaimable for reasons other
> than being dirty, but is that happening in this case?
>
Probably not. It's almost certainly the race I described above.
> If not, we broke it again.
>
We were broken with respect to this in the first place. That
cond_reched() is badly placed and waiting on congestion when congestion
might not be involved is also a bit odd.
It's possible that Christian's specific problem would also be addressed
by the following patch. Christian, willing to test?
It still feels a bit unnatural though that the page allocator waits on
congestion when what it really cares about is watermarks. Even if this
patch works for Christian, I think it still has merit so will kick it a
few more times.
==== CUT HERE ====
page-allocator: Attempt page allocation immediately after direct reclaim
After a process completes direct reclaim it calls cond_resched() as
potentially it has been running a long time. When it wakes up, it
attempts to allocate a page. There is a large window during which
another process can allocate the pages reclaimed by direct reclaim. This
patch attempts to allocate a page immediately after direct reclaim but
will still go to sleep afterwards if its quantum has expired.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/page_alloc.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a8182c8..973b7fc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1721,8 +1721,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
lockdep_clear_current_reclaim_state();
p->flags &= ~PF_MEMALLOC;
- cond_resched();
-
if (order != 0)
drain_all_pages();
@@ -1731,6 +1729,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
zonelist, high_zoneidx,
alloc_flags, preferred_zone,
migratetype);
+
+ cond_resched();
+
return page;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-03-12 10:47 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-03-08 11:48 Mel Gorman
2010-03-08 11:48 ` [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion Mel Gorman
2010-03-09 13:35 ` Nick Piggin
2010-03-09 14:17 ` Mel Gorman
2010-03-09 15:03 ` Nick Piggin
2010-03-09 15:42 ` Christian Ehrhardt
2010-03-09 18:22 ` Mel Gorman
2010-03-10 2:38 ` Nick Piggin
2010-03-09 17:35 ` Mel Gorman
2010-03-10 2:35 ` Nick Piggin
2010-03-09 15:50 ` Christoph Lameter
2010-03-09 15:56 ` Christian Ehrhardt
2010-03-09 16:09 ` Christoph Lameter
2010-03-09 17:01 ` Mel Gorman
2010-03-09 17:11 ` Christoph Lameter
2010-03-09 17:30 ` Mel Gorman
2010-03-08 11:48 ` [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed Mel Gorman
2010-03-09 9:53 ` Nick Piggin
2010-03-09 10:08 ` Mel Gorman
2010-03-09 10:23 ` Nick Piggin
2010-03-09 10:36 ` Mel Gorman
2010-03-09 11:11 ` Nick Piggin
2010-03-09 11:29 ` Mel Gorman
2010-03-08 11:48 ` [PATCH 3/3] vmscan: Put kswapd to sleep on its own waitqueue, not congestion Mel Gorman
2010-03-09 10:00 ` Nick Piggin
2010-03-09 10:21 ` Mel Gorman
2010-03-09 10:32 ` Nick Piggin
2010-03-11 23:41 ` [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure Andrew Morton
2010-03-12 6:39 ` Christian Ehrhardt
2010-03-12 7:05 ` Andrew Morton
2010-03-12 10:47 ` Mel Gorman [this message]
2010-03-12 12:15 ` Christian Ehrhardt
2010-03-12 14:37 ` Andrew Morton
2010-03-15 12:29 ` Mel Gorman
2010-03-15 14:45 ` Christian Ehrhardt
2010-03-15 12:34 ` Christian Ehrhardt
2010-03-15 20:09 ` Andrew Morton
2010-03-16 10:11 ` Mel Gorman
2010-03-18 17:42 ` Mel Gorman
2010-03-22 23:50 ` Mel Gorman
2010-03-23 14:35 ` Christian Ehrhardt
2010-03-23 21:35 ` Corrado Zoccolo
2010-03-24 11:48 ` Mel Gorman
2010-03-24 12:56 ` Corrado Zoccolo
2010-03-23 22:29 ` Rik van Riel
2010-03-24 14:50 ` Mel Gorman
2010-04-19 12:22 ` Christian Ehrhardt
2010-04-19 21:44 ` Johannes Weiner
2010-04-20 7:20 ` Christian Ehrhardt
2010-04-20 8:54 ` Christian Ehrhardt
2010-04-20 15:32 ` Johannes Weiner
2010-04-20 17:22 ` Rik van Riel
2010-04-21 4:23 ` Christian Ehrhardt
2010-04-21 7:35 ` Christian Ehrhardt
2010-04-21 13:19 ` Rik van Riel
2010-04-22 6:21 ` Christian Ehrhardt
2010-04-26 10:59 ` Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2 Christian Ehrhardt
2010-04-26 11:59 ` KOSAKI Motohiro
2010-04-26 12:43 ` Christian Ehrhardt
2010-04-26 14:20 ` Rik van Riel
2010-04-27 14:00 ` Christian Ehrhardt
2010-04-21 9:03 ` [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure Johannes Weiner
2010-04-21 13:20 ` Rik van Riel
2010-04-20 14:40 ` Rik van Riel
2010-03-24 2:38 ` Greg KH
2010-03-24 11:49 ` Mel Gorman
2010-03-24 13:13 ` Johannes Weiner
2010-03-12 9:09 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100312104712.GB18274@csn.ul.ie \
--to=mel@csn.ul.ie \
--cc=akpm@linux-foundation.org \
--cc=chris.mason@oracle.com \
--cc=ehrhardt@linux.vnet.ibm.com \
--cc=jens.axboe@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=npiggin@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox