* writepage and high performance filesystems
@ 2005-08-23 3:38 Rahul Iyer
2005-08-23 15:12 ` Marcelo Tosatti
0 siblings, 1 reply; 2+ messages in thread
From: Rahul Iyer @ 2005-08-23 3:38 UTC (permalink / raw)
To: Rik van Riel, Marcelo Tosatti, linux-mm
Hi,
As part of some research i was doing, i was looking at high bandwidth
file systems which target to serve the data requirements of computing
clusters. We think we are facing an issue here...
When memory pressure is felt, kswapd is woken up, and it calls
balance_pgdat, which eventually results in pageout() being called. From
the pageout() function on 2.6.11:
325 SetPageReclaim(page);
326 res = mapping->a_ops->writepage(page, &wbc);
This results in the writepage being called for each dirty page if it has
a mapping pointer. A few of the researchers at CMU tell me that this
behavior could be pretty bad for high bandwidth storage back ends. The
reason being that breaking down a 500MB write into several 4K chunks
results in underutilization of the disk bandwidth as there is
unnecessary disk spinning between the 4K writes. Also, the pages are not
evicted fast enough to maintain a steady stream of 4K writes to
optimally utilize the storage bandwidth.
So, I was thinking about the solution to this...
Having the writepage function look like this might probably help...
static_int new_writepage (struct page *page, struct writeback_control *wbc)
{
if (page->mapping->nr_coalesced < coalesce_limit)
page->mapping->nr_coalesced++;
else
page->mapping->writepages(mapping, wbc);
}
where nr_coalesced is the number of pages currently coalesced before a
write in the address_space and coalesce_limit is the number of dirty
pages to coalesce before calling a writepages(). This of course required
the addition of this variable to the address_space. coalesce_limit could
be set through a /proc interface. Setting it to 0 would disable the
coalescing.
writepages() is only called in the synchronous page_reclaim, i.e.,
try_to_free_pages() - via wakeup_bdflush(), but not in the kswapd code
path. Is there any specific reason for this?
However, what would be the advantages of moving this into the kswapd
code path?
I do realize that this could result in pages not getting written out
when asked to, and so cause problems with memory reclaim, but given that
this is a high bandwidth filesystem, there should be a lot of dirty
pages and we should hit coalesce_limit pretty quickly. This would be the
common case i presume. In the event of it not happening, we have the
call to writepages() in try_to_free_pages(), so that would clear things
for us. I agree this behavior is not desirable as try_to_free_pages() is
synchronous, but this behavior should not be the common case.
Is my reasoning logical, or am I missing the bigger picture?
Thanks
Rahul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: writepage and high performance filesystems
2005-08-23 3:38 writepage and high performance filesystems Rahul Iyer
@ 2005-08-23 15:12 ` Marcelo Tosatti
0 siblings, 0 replies; 2+ messages in thread
From: Marcelo Tosatti @ 2005-08-23 15:12 UTC (permalink / raw)
To: Rahul Iyer; +Cc: Rik van Riel, linux-mm
Hi Rahul,
On Mon, Aug 22, 2005 at 11:38:50PM -0400, Rahul Iyer wrote:
> Hi,
> As part of some research i was doing, i was looking at high bandwidth
> file systems which target to serve the data requirements of computing
> clusters. We think we are facing an issue here...
>
> When memory pressure is felt, kswapd is woken up, and it calls
> balance_pgdat, which eventually results in pageout() being called. From
> the pageout() function on 2.6.11:
>
> 325 SetPageReclaim(page);
> 326 res = mapping->a_ops->writepage(page, &wbc);
>
> This results in the writepage being called for each dirty page if it has
> a mapping pointer. A few of the researchers at CMU tell me that this
> behavior could be pretty bad for high bandwidth storage back ends. The
> reason being that breaking down a 500MB write into several 4K chunks
> results in underutilization of the disk bandwidth as there is
> unnecessary disk spinning between the 4K writes.
The VM relies on the IO scheduler for request coalescing (does not mean
it performs optimal writeout behaviour, of course).
So, requests are not "break down into several 4K chunks" from the
perspective of the IO device.
You should examine what is going on with iostat, that should give a
better picture:
mgr/s number of read merges per second (smaller read requests that
were successfully merged into a bigger one)
mgw/s number of write merges per second
kr/s kilobytes read per second
kw/s kilobytes written per second
size average size of the requests sent to disk in kilobytes
> Also, the pages are not evicted fast enough to maintain a steady stream of
> 4K writes to optimally utilize the storage bandwidth.
Do you have data to precise that along with information about the
working set?
I imagine that writeout of contiguous dirty pages, bypassing the LRU
ordering can help many scenarios.
> So, I was thinking about the solution to this...
> Having the writepage function look like this might probably help...
>
> static_int new_writepage (struct page *page, struct writeback_control *wbc)
> {
> if (page->mapping->nr_coalesced < coalesce_limit)
> page->mapping->nr_coalesced++;
> else
> page->mapping->writepages(mapping, wbc);
> }
>
> where nr_coalesced is the number of pages currently coalesced before a
> write in the address_space and coalesce_limit is the number of dirty
> pages to coalesce before calling a writepages(). This of course required
> the addition of this variable to the address_space. coalesce_limit could
> be set through a /proc interface. Setting it to 0 would disable the
> coalescing.
->writepages() semantic is to write all dirty pages of the given
mapping, its used by fsync() and friends, but yes, something similar
would be nice.
Maybe passing a paramater to ->writepages() to indicate <start,len>,
and have the current ->writepages() users pass <0, 0xffffffff>.
Then at the VM level you need to know what how many pages are dirty,
and ask the IO scheduler to drop these requests if they can't be
merged.
Someone should try that.
> writepages() is only called in the synchronous page_reclaim, i.e.,
> try_to_free_pages() - via wakeup_bdflush(), but not in the kswapd code
> path. Is there any specific reason for this?
Because ->writepages() writes _all_ pages of the mapping.
> However, what would be the advantages of moving this into the kswapd
> code path?
>
> I do realize that this could result in pages not getting written out
> when asked to, and so cause problems with memory reclaim, but given that
> this is a high bandwidth filesystem, there should be a lot of dirty
> pages and we should hit coalesce_limit pretty quickly. This would be the
> common case i presume. In the event of it not happening, we have the
> call to writepages() in try_to_free_pages(), so that would clear things
> for us. I agree this behavior is not desirable as try_to_free_pages() is
> synchronous, but this behavior should not be the common case.
>
> Is my reasoning logical, or am I missing the bigger picture?
No it is logical. A more appropriate name for "writepages" would be
"writeallpages" :)
Go ahead and write something!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2005-08-23 15:12 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-08-23 3:38 writepage and high performance filesystems Rahul Iyer
2005-08-23 15:12 ` Marcelo Tosatti
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox