From: Bill Davidsen <davidsen@tmr.com>
To: Ray Bryant <raybry@sgi.com>
Cc: Buddy Lumpkin <b.lumpkin@comcast.net>,
'Con Kolivas' <kernel@kolivas.org>,
'FabF' <fabian.frederick@skynet.be>,
'Bernd Eckenfels' <ecki-news2004-05@lina.inka.de>,
linux-kernel@vger.kernel.org, lse-tech@lists.sourceforge.net,
linux-mm@kvack.org
Subject: Re: why swap at all?
Date: Wed, 09 Jun 2004 15:24:13 -0400 [thread overview]
Message-ID: <40C763DD.7090003@tmr.com> (raw)
In-Reply-To: <40C5D7FB.7020402@sgi.com>
Ray Bryant wrote:
>
> Buddy Lumpkin wrote:
>
>> <snip> One method would be to keep the
>> pagecache on it's own list, and move pages to the head of the list any
>> time
>> they are modified or referenced, and reclaim from the tail.
>> All pages on this list can be considered as "free memory", because any
>> new
>> memory requests would just cause pages to be evicted from the tail of the
>> list.
>>
>
> We have code running on Altix that does exactly this. (Please note,
> however, that this is for our version of Linux 2.4.21 -- Yeah, its
> old, but that is what the product runs at the moment -- we are in
> the process of switching over to Linux 2.6 when all of this will
> have to be re-evaluated.) The changes are in three parts:
>
> (1) We added a new page list, the reclaim list. Pages are put
> onto the reclaim list when they are inserted into the page cache.
> They are removed from the list when they are marked dirty (buffers
> from the page go on to the LRU dirty list) or when the pages are
> mmap'd into an address space, since in either of these situations,
> the pages are not reclaimable. (This list is per node in our
> NUMA system.)
>
> (2) We added code in __alloc_pages() so that if the local node
> allocation is going to fail (remember that Altix is a NUMA machine),
> we call out to a routine to scan the reclaim list on that node and
> to release enough clean buffer cache pages to make the local
> allocation succeed (plus a few pages, for efficiency). If this
> doesn't work, we most likely end up spilling the allocation over
> to another node.
>
> (3) We added code in generic_file_write() to limit the size of
> the page cache on buffered file I/O write operations. If the
> current size of the page cache is larger than the limit, we
> call the same routine as above to release some page cache pages.
> If we can't free enough pages to get below the limit, we throttle
> the write process by delaying it for a bit. This was all to
> avoid the problem of a large buffered file I/O request causing
> the page cache to grow to the point where the system would start
> to swap. (On our large memory systems, dropping into the
> swapping code can cause the system to freeze for 10's of seconds,
> and that is something we would like to avoid).
>
> (We actually don't enforce the page cache limit unless the amount
> of free memory has dropped below a certain threshold. This is to
> keep the page cache from being limited if there is lots of free
> memory -- even though we only limit the page cache on writes,
> it turns out that the kernel is constantly writing to the disk,
> so this also effectively causes the page cache to be limited
> for reads as well.)
>
> This code was also written in response to customer demand. They
> don't like the fact that the buffer cache grows and grows on our
> Altix systems, and they want old buffer cache pages to be cleared
> out when they are no longer needed. Since we almost never suffer
> memory pressure on our systems (and if we do, we are likely in
> trouble), kswapd almost never does this. Buffer cache pages can
> sit around for days with no one removing them. The above was one
> approach to solve that problem.
>
> Pleaes note: YMMV. An Altix is not a desktop system and I make
> no claims that the above approach is appropriate for everyone.
> For us, it turns out to work better to bias storage allocation
> against unbridled growth of the page cache. Indeed, we have
> spent a lot of time trying to solve problems related to page
> cache on Altix systems. Assuming we get our OLS paper done
> in time, you can read more about this in our paper at OLS.
> (If not, we intend to post our experiences paper on the
> oss.sgi.com website.)
>
> Finally, let me reiterate that we are beginning the process of
> evaluating the 2.6 memory manager wrt the same problem as above.
> Before we will propose a change such as above for 2.6, we have
> to convince ourselves that (1) setting vm_swappiness appropriately
> doesn't solve the problem, and (2) that patches such as the ones
> that Nick Piggin has been proposing don't solve the problem
> either, and that (3) there isn't some other mechanism to deal
> with this in 2.6.
I have to admit that the definition of "desktop machine" has changed a
lot in the last few years, in terms of hardware, but I have been running
since 486 days with "what can I build/buy for <$2k which best fits my
overall computing?" With the onset of cheap memory and Opteron, NUMA
will be a factor in the next few years in all probability, and SMP has
been since the dual pentium systems were new.
That said, I think that your work will be useful, even if it is used
piecemeal or as inspiration to Nick, Andrea, and other who have been
working in the area. I find Nick's work as of 2.6.7-rc1-mm1 so good I
haven't moved any of my desktop machines beyond it, but it sounds as if
your work addresses the issue I mentioned about limiting buffer usage,
and Rik's comment that the code lacks check and balances. You seem to
have a balance, I'd love to see it.
--
-bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
prev parent reply other threads:[~2004-06-09 19:24 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <fa.amhil9e.o5kt1u@ifi.uio.no>
[not found] ` <fa.kfm8lru.1l2mdp4@ifi.uio.no>
2004-06-08 15:12 ` Ray Bryant
2004-06-08 15:15 ` Ray Bryant
2004-06-09 19:24 ` Bill Davidsen [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=40C763DD.7090003@tmr.com \
--to=davidsen@tmr.com \
--cc=b.lumpkin@comcast.net \
--cc=ecki-news2004-05@lina.inka.de \
--cc=fabian.frederick@skynet.be \
--cc=kernel@kolivas.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lse-tech@lists.sourceforge.net \
--cc=raybry@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox