From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 7D8596B004F for ; Mon, 8 Jun 2009 02:46:54 -0400 (EDT) Date: Mon, 8 Jun 2009 15:52:46 +0800 From: Wu Fengguang Subject: Re: [rfc][patch] swap: virtual swap readahead Message-ID: <20090608075246.GA12644@localhost> References: <1243436746-2698-1-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="vtzGhvizbBRQ85DL" Content-Disposition: inline In-Reply-To: <1243436746-2698-1-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org To: Johannes Weiner Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Hugh Dickins , Rik van Riel List-ID: --vtzGhvizbBRQ85DL Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Wed, May 27, 2009 at 05:05:46PM +0200, Johannes Weiner wrote: > The current swap readahead implementation reads a physically > contiguous group of swap slots around the faulting page to take > advantage of the disk head's position and in the hope that the > surrounding pages will be needed soon as well. > > This works as long as the physical swap slot order approximates the > LRU order decently, otherwise it wastes memory and IO bandwidth to > read in pages that are unlikely to be needed soon. > > However, the physical swap slot layout diverges from the LRU order > with increasing swap activity, i.e. high memory pressure situations, > and this is exactly the situation where swapin should not waste any > memory or IO bandwidth as both are the most contended resources at > this point. > > This patch makes swap-in base its readaround window on the virtual > proximity of pages in the faulting VMA, as an indicator for pages > needed in the near future, while still taking physical locality of > swap slots into account. > > This has the advantage of reading in big batches when the LRU order > matches the swap slot order while automatically throttling readahead > when the system is thrashing and swap slots are no longer nicely > grouped by LRU order. Hi Johannes, You may want to test the patch against a real desktop :) The attached scripts can do that. I also have the setup to test it out conveniently, so if you send me the latest patch.. Thanks, Fengguang > Signed-off-by: Johannes Weiner > Cc: Hugh Dickins > Cc: Rik van Riel > --- > mm/swap_state.c | 80 +++++++++++++++++++++++++++++++++++++++---------------- > 1 files changed, 57 insertions(+), 23 deletions(-) > > qsbench, 20 runs each, 1.7GB RAM, 2GB swap, 4 cores: > > "mean (standard deviation) median" > > All values are given in seconds. I used a t-test to make sure there > is a statistical difference of at least 95% probability in the > compared runs for the given number of samples, arithmetic mean and > standard deviation. > > 1 x 2048M > vanilla: 391.25 ( 71.76) 384.56 > vswapra: 445.55 ( 83.19) 415.41 > > This is an actual regression. I am not yet quite sure why > this happens and I am undecided whether one humonguous active > vma in the system is a common workload. > > It's also the only regression I found, with qsbench anyway. I > started out with powers of two and tweaked the parameters > until the results between the two kernel versions differed. > > 2 x 1024M > vanilla: 384.25 ( 75.00) 423.08 > vswapra: 290.26 ( 31.38) 299.51 > > 4 x 540M > vanilla: 553.91 (100.02) 554.57 > vswapra: 336.58 ( 52.49) 331.52 > > 8 x 280M > vanilla: 561.08 ( 82.36) 583.12 > vswapra: 319.13 ( 43.17) 307.69 > > 16 x 128M > vanilla: 285.51 (113.20) 236.62 > vswapra: 214.24 ( 62.37) 214.15 > > All these show a nice improvement in performance and runtime > stability. > > The missing shmem support is a big TODO, I will try to find time to > tackle this when the overall idea is not refused in the first place. > > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 3ecea98..8f8daaa 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -336,11 +336,6 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > * > * Returns the struct page for entry and addr, after queueing swapin. > * > - * Primitive swap readahead code. We simply read an aligned block of > - * (1 << page_cluster) entries in the swap area. This method is chosen > - * because it doesn't cost us any seek time. We also make sure to queue > - * the 'original' request together with the readahead ones... > - * > * This has been extended to use the NUMA policies from the mm triggering > * the readahead. > * > @@ -349,27 +344,66 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > struct vm_area_struct *vma, unsigned long addr) > { > - int nr_pages; > - struct page *page; > - unsigned long offset; > - unsigned long end_offset; > - > - /* > - * Get starting offset for readaround, and number of pages to read. > - * Adjust starting address by readbehind (for NUMA interleave case)? > - * No, it's very unlikely that swap layout would follow vma layout, > - * more likely that neighbouring swap pages came from the same node: > - * so use the same "addr" to choose the same node for each swap read. > - */ > - nr_pages = valid_swaphandles(entry, &offset); > - for (end_offset = offset + nr_pages; offset < end_offset; offset++) { > - /* Ok, do the async read-ahead now */ > - page = read_swap_cache_async(swp_entry(swp_type(entry), offset), > - gfp_mask, vma, addr); > + int cluster = 1 << page_cluster; > + int window = cluster << PAGE_SHIFT; > + unsigned long start, pos, end; > + unsigned long pmin, pmax; > + > + /* XXX: fix this for shmem */ > + if (!vma || !vma->vm_mm) > + goto nora; > + > + /* Physical range to read from */ > + pmin = swp_offset(entry) & ~(cluster - 1); > + pmax = pmin + cluster; > + > + /* Virtual range to read from */ > + start = addr & ~(window - 1); > + end = start + window; > + > + for (pos = start; pos < end; pos += PAGE_SIZE) { > + struct page *page; > + swp_entry_t swp; > + spinlock_t *ptl; > + pgd_t *pgd; > + pud_t *pud; > + pmd_t *pmd; > + pte_t *pte; > + > + pgd = pgd_offset(vma->vm_mm, pos); > + if (!pgd_present(*pgd)) > + continue; > + pud = pud_offset(pgd, pos); > + if (!pud_present(*pud)) > + continue; > + pmd = pmd_offset(pud, pos); > + if (!pmd_present(*pmd)) > + continue; > + pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl); > + if (!is_swap_pte(*pte)) { > + pte_unmap_unlock(pte, ptl); > + continue; > + } > + swp = pte_to_swp_entry(*pte); > + pte_unmap_unlock(pte, ptl); > + > + if (swp_type(swp) != swp_type(entry)) > + continue; > + /* > + * Dont move the disk head too far away. This also > + * throttles readahead while thrashing, where virtual > + * order diverges more and more from physical order. > + */ > + if (swp_offset(swp) > pmax) > + continue; > + if (swp_offset(swp) < pmin) > + continue; > + page = read_swap_cache_async(swp, gfp_mask, vma, pos); > if (!page) > - break; > + continue; > page_cache_release(page); > } > lru_add_drain(); /* Push any new pages onto the LRU now */ > +nora: > return read_swap_cache_async(entry, gfp_mask, vma, addr); > } > -- > 1.6.3 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org --vtzGhvizbBRQ85DL Content-Type: application/x-sh Content-Disposition: attachment; filename="run-many-x-apps.sh" Content-Transfer-Encoding: quoted-printable #!/bin/zsh=0A# why zsh? bash does not support floating numbers=0A=0A# aptit= ude install iceweasel gnome-games=0A# aptitude install openoffice.org # and= uncomment the oo* lines=0A=0A=0Aread T0 T1 < /proc/uptime=0A=0Afunction pr= ogress()=0A{=0A read t0 t1 < /proc/uptime=0A t=3D$((t0 - T0))=0A printf "%8= =2E2f " $t=0A echo "$@"=0A}=0A=0Afunction switch_windows()=0A{=0A wmctrl= -l | while read a b c win=0A do=0A progress A "$win"=0A wmctrl -a "$win"= =0A done=0A firefox /usr/share/doc/debian/FAQ/index.html=0A}=0A=0Awhile rea= d app args=0Ado=0A progress N $app $args=0A $app $args &=0A switch_windows= =0Adone << EOF=0Axeyes=0Afirefox=0Anautilus=0Anautilus --browser=0Agthumb= =0Agedit=0Axpdf /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf= =0A=0Axterm=0Amlterm=0Agnome-terminal=0Aurxvt=0A=0Agnome-system-monitor=0Ag= nome-help=0Agnome-dictionary=0A=0A/usr/games/sol=0A/usr/games/gnometris=0A/= usr/games/gnect=0A/usr/games/gtali=0A/usr/games/iagno=0A/usr/games/gnotrave= x=0A/usr/games/mahjongg=0A/usr/games/gnome-sudoku=0A/usr/games/glines=0A/us= r/games/glchess=0A/usr/games/gnomine=0A/usr/games/gnotski=0A/usr/games/gnib= bles=0A/usr/games/gnobots2=0A/usr/games/blackjack=0A/usr/games/same-gnome= =0A=0A/usr/bin/gnome-window-properties=0A/usr/bin/gnome-default-application= s-properties=0A/usr/bin/gnome-at-properties=0A/usr/bin/gnome-typing-monitor= =0A/usr/bin/gnome-at-visual=0A/usr/bin/gnome-sound-properties=0A/usr/bin/gn= ome-at-mobility=0A/usr/bin/gnome-keybinding-properties=0A/usr/bin/gnome-abo= ut-me=0A/usr/bin/gnome-display-properties=0A/usr/bin/gnome-network-preferen= ces=0A/usr/bin/gnome-mouse-properties=0A/usr/bin/gnome-appearance-propertie= s=0A/usr/bin/gnome-control-center=0A/usr/bin/gnome-keyboard-properties=0A= =0A: oocalc=0A: oodraw=0A: ooimpress=0A: oomath=0A: ooweb=0A: oowriter = =0A=0AEOF=0A --vtzGhvizbBRQ85DL Content-Type: application/x-sh Content-Disposition: attachment; filename="test-mmap-exec-prot.sh" Content-Transfer-Encoding: quoted-printable #!/bin/sh=0A=0Aprot=3D$( free.$prot=0A --vtzGhvizbBRQ85DL-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org