From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from ccs-mail.lanl.gov (ccs-mail.lanl.gov [128.165.4.126]) by mailwasher-b.lanl.gov (8.12.11/8.12.11/(ccn-5)) with ESMTP id j9F3Z7I6027219 for ; Fri, 14 Oct 2005 21:35:07 -0600 Subject: Re: Another Clock-pro approx From: Song Jiang In-Reply-To: <1129275512.7845.47.camel@twins> References: <434EA6E8.30603@programming.kicks-ass.net> <1129272286.3637.186.camel@moon.c3.lanl.gov> <1129275512.7845.47.camel@twins> Content-Type: text/plain Message-Id: <1129347305.3637.310.camel@moon.c3.lanl.gov> Mime-Version: 1.0 Date: Fri, 14 Oct 2005 21:35:05 -0600 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Peter Zijlstra Cc: riel@redhat.com, linux-mm@kvack.org List-ID: Peter, Two obvious issues in the actions are (1) test bits have never been set; (2) pages in T3 have never been promoted into T1/T2. I might have to take the responsibility for all the confusion in the understanding of why clock-pro works. (e.g. clock-pro can deal with access-once, loop accesses effectively). I should have put a pseudo code in the clock-pro paper. However, even with a pseudo code, the description of clock-pro would seem much more complicated than the idea behind it actually is. To get its idea without being bothered by all the details involved in the clock-pro description, my suggestion is that let's base our discussion/design on the original LIRS algorithm. (see section 3 of the following paper http://www.cs.wm.edu/hpcs/WWW/HTML/publications/papers/TR-05-11.pdf) In the context of the LIRS algorithm, compare to clock-pro, cold <--> HIR hot <--> LIR HAND_hot --> bottom of the LIRS stack (= last resident hot page). This is the most critical position in the stack, which serves as a criterion of hot/cold. A reference to a page above it (if the stack is vertically viewed) makes the page a hot one, otherwise, the page remains as a cold one. This is also the place where a hot page is demoted to a cold one. HAND-cold --> last resident cold page (== bottom of stack Q) where a victim page is going to be searched. If we allow an unlimited number of non-resident stay in the stack, there is no need for a HAND_test. Actually HAND_hot serves the purpose. If we have a limit on that (say the memory size c), then the HAND-test points to the last non-resident cold page or last hot page, whichever is higher. The idea behind LIRS is very simple: the last hot page is the one least deserving to reside in memory (ignoring the small number of resident cold pages here). If you have a reference above it (in the stack), then you are more deserving to get in (as a hot one), and the last hot one gets out (as a cold one). Otherwise, no h/c states change. A major difference between LIRS and Clock-pro is that clock-pro delays the actions on references (except page faults). Taking this difference into reading clock-pro description, you can easily unravel all the tricks played in it. The conclusion is that it is essential for clock-pro to allow the location comparison in a single list between (non-resident) cold page and the last hot page, so that we know which cold page should be promoted into a hot page with a reference. If we use HAND_test, we must make sure HAND_test stays above HAND_hot. If we use separate lists for h/c pages (like active/inactive lists), we lose the chance for the comparison, and major performance advantages of clock-pro are lost. Actually CAR/CART have been shown in the aforementioned paper not be able to deal with loop access problem of LRU, and that is the reason. It is fine to have a common non-resident page pool for the zoned memory, if we can also set their corresponding place-holders in a single list for each zone. Let me know if the single list suggestion is feasible in the Linux kernel. Thanks. Song On Fri, 2005-10-14 at 01:38, Peter Zijlstra wrote: > On Fri, 2005-10-14 at 00:44 -0600, Song Jiang wrote: > > Hi, Peter, > > > > I didn't see your explanation of the actions such as T2-000. > > So I put my understanding at the below. > > My bad, I keep forgetting to write half the stuff down, often the most > important pieces appear missing ;-( > > As I told Rik yesterday on IRC, the main idea behind it was to make T1 + > T2 behave like 1 big SC-list and superimpose T3 thereon by coupling the > rotation speed. And thus creating 1 virtual wheel as clock-pro has. The > split in T1 and T2 acts as the hot hand and determines the length of the > test period. > > |T1| + |T2| <= c > |T1| + |T2| + |T3| <= 2c > > Resident: > T1 can contain pages marked as both hot and cold. > T2 can only contain cold pages. > > Non-resident: > T3 can only contain non-resident cold pages in their test period. > > The actions presented like Tn-abc are to be read as: place page on the > head of list Tn with the hot/cold bit set to a, the test bit set to b > and the referenced bit set to c. > > > While setting three lists, do you indicate that a page > > can have 3 set of pointers, each for the links in one list? > > No, a page can only be on 1 list at a time. > > > If we don't restrict ourselves to the current active and inactive > > list data structure, can you point to us what major approximations > > we have to make on the original clock-pro policy so that the policy > > becomes acceptable in the kernel? (I understand that zoned data > > structure is one of the required adaptations.) The reason I'm asking > > the question is that I didn't see some major changes here in your > > approxiamtion? Once I understand the kernel required adaptations, > > I can work with you for a clock-pro approximation. > > Yes the zoned data structure is the most difficult. It gives rise to the > fact that we have to split the resident from the non-resident page > management. Each zone will have a resident part and all zones will share > the non-resident part. This because a page leaving zone 1 could be > faulted back in zone 2. > > Another is that the pageout operation is async, ie. we start writeback > against a page but until it is finished the page has to stay on the > resident lists but as soon as writeback finishes and we have not yet had > another reference we need to reclaim the page asap. > > The last one, and I'm not quite sure on this one (Rik?), is that when we > insert a page we're sure it is going to be referenced right after the > context switch to userspace. Hence we effectively insert referenced > pages. > > Rik, any things missing here? > > > I have a clock-pro simulator in C code and powerpoint presentation > > slides explaining clock-pro operations. If you are interested in > > any of them, let me know. > > > > Thanks. > > > > Song > > > > > > > > On Thu, 2005-10-13 at 12:26, Peter Zijlstra wrote: > > > Hi, > > > > > > I've been thinking on another clock-pro approximation. > > > > > > Each page has 3 bits: hot/cold, test and referenced. > > > Say we have 3 lists: T1, T2 and T3. > > > and variable: s > > > > > > T1 will have hot and cold pages, T2 will only have cold pages and T3 > > > will have the non-resident pages. > > > c will be the total number of resident pages; |T1| + |T2| + |T3| = 2c. > > Do you mean |T1 U T2 U > > T3| = 2c ? > > See above. > > New and improved actions: > > > > T1-rotation: > > > > > > h/c test ref action > > > 0 0 0 T2-000 page is passed on to T2 > > > 0 0 1 T2-001 page is passed on to T2 > > > 0 1 0 T2-000 page to T2, clear test bit > > > 0 1 1 T1-100 make it hot > > > 1 0 0 T2-010 turn into cold page, start test period > > > 1 0 1 T1-100 keep it hot, clear ref bit > > > 1 1 0 > > > 1 1 1 > > > > > > > > > T2-rotation: > > > > > > h/c test ref action > > > 0 0 0 > > > 0 0 1 T1-000 keep it cold, keep it resident. > > > 0 1 0 T3-010 make it non-resident > > > 0 1 1 T1-100 make it hot > > > > > > > > > T3-rotation: remove from non-resident list. > > A fault will check the non-resident list and also remove the page if > found. > > > > So, on fault we rotate T2, unless empty then we start by rotating T1 > > > until T2 contains at least 1 cold page. > > > If a T2 rotation creates a hot page, we rotate T1 to degrade a hot > > > page to a cold page in order to keep the cold page target m_c. > > > Every T1 rotation adds |T1| to s. While s > c, we subtract c from s and > > > Does |T1| mean the total size if T1? > > Yes, |T1| is the size of T1. > > > Can you explain why in more > > detail? > > > > This because the hot hand pushes the test hand. > > Hmmm, maybe it should not be every T1 rotation, but only rotations on > hot pages that get accounted. Because at that point the hot page with > the largest inter reference period will disappear and the test period > changes. > > So the idea is to couple to rotation speed of T3 to T1 and scale the > respective sizes. > > > > turn T3 for each subtraction. > > > > > > Compare to clock-pro: > > > T1-rotation <-> Hand_hot > > > T2-rotation <-> Hand_cold > > > T3-rotation <-> Hand_test > > > > > > The normal m_c adaption rules can be applied. > > > > > > Zoned edition: > > > This can be done per zone by having: > > > T1_i, T2_i, T3_j, s, t, u_j > > > where _i is the zone index and _j the non-resident bucket index. > > > > > > Then each T1_i turn will add |T1_i| to s, each c in s will increment t by 1. > > > On each non-resident bucket access we increment u_j until it equals t > > > and for each increment we rotate the bucket. > > > > > More thoughts, it should probably be: u_j = t/J. Where J is the total > number of buckets. > > > Kind regards, > > Peter Zijlstra > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org