From mboxrd@z Thu Jan 1 00:00:00 1970 Subject: "Noreclaim Infrastructure" [was Re: [PATCH 01 of 16] remove nr_scan_inactive/active] From: Lee Schermerhorn In-Reply-To: <20070629141254.GA23310@v2.random> References: <8e38f7656968417dfee0.1181332979@v2.random> <466C36AE.3000101@redhat.com> <20070610181700.GC7443@v2.random> <46814829.8090808@redhat.com> <20070626105541.cd82c940.akpm@linux-foundation.org> <468439E8.4040606@redhat.com> <1183124309.5037.31.camel@localhost> <20070629141254.GA23310@v2.random> Content-Type: text/plain Date: Fri, 29 Jun 2007 18:39:01 -0400 Message-Id: <1183156742.7012.25.camel@localhost> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Andrea Arcangeli Cc: Rik van Riel , Andrew Morton , linux-mm@kvack.org, Nick Dokos List-ID: On Fri, 2007-06-29 at 16:12 +0200, Andrea Arcangeli wrote: > On Fri, Jun 29, 2007 at 09:38:29AM -0400, Lee Schermerhorn wrote: > > > Here's a fairly recent version of the patch if you want to try it on > > your workload. We've seen mixed results on somewhat larger systems, > > with and without your split LRU patch. I've started writing up those > > results. I'll try to get back to finishing up the writeup after OLS and > > vacation. > > This looks a very good idea indeed. > > Overall the O(log(N)) change I doubt would help, being able to give an > efficient answer to "give me only the vmas that maps this anon page" > won't be helpful here since the answer will be the same as the current > question "give me any vma that may be mapping this anon page". Only > for the filebacked mappings it matters. > > Also I'm stunned this is being compared to a java workload, java is a > threaded beast (unless you're capable of understanding async-io in > which case it's still threaded but with tons less threads, but anyway > you code it won't create any anonymous related overhead). What we deal > with isn't really an issue with anon-vma but just with the fact the > system is trying to unmap pages that are mapped in 4000-5000 pte, so > no matter how you code it, there will be still 4000-5000 ptes to check > for each page that we want to know if it's referenced and it will take > system time, this is an hardware issue not a software one. And the > other suspect thing is to do all that pte-mangling work without doing > any I/O at all. Andrea: Yes, the patch is not a panacea. At best, it allows different kswapd's to attempt to unmap different pages associated with the same VMA. But, as you say, you still have to unmap X000 ptes. On one of the smaller ia64 systems we've been testing, we hit this state in the 15000-20000 range of AIM jobs. This patch, along with Rik's split LRU patch allowed us to make forward progress at saturation, and we were actually swapping, instead of just spinning around in page_referenced() and try_to_unmap(). [Actually, I don't think we get past page_referenced() much w/o this patch--have to check.] I have experimented with another "noreclaim" infrastructure, based on some patches by Larry Woodman at Red Hat, to keep non-reclaimable pages off the active/inactive list. I envisioned this as a general infrastructure to handle this case--pages whose anon_vmas have excessively long vma lists, swap-backed pages for which no swap space is available and mlock()ed pages [a la Nick Piggin's patch]. I will include the patch overview here and send along the 2 infrastructure patches and one "client" patch--the excessively referenced anon_vma case. I'm not proposing that these be considered for inclusion. Just another take on this issue. The patches are against 2.6.21-rc6. I have been distracted by other issues lately, so they have languished, and even the overview is a bit out of date relative to on-going activity in this area. I did integrate this series with Rik's split LRU patch at one time, and it all "worked" for some definition thereof. One final note before the "noreclaim overview": I have seen similar behavior on the i_mmap_lock for file back pages running a [too] heavy Oracle/TPC-C workload--on a larger ia64 system with ~8TB of storage. System hung/unresponsive, spitting out "Soft lockup" messages. Stack traces showed cpus in spinlock contention called from page_referenced_file. So, it's not limited to anon pages. Lee ----------------- This series of patches introduces support for mananaging "non-reclaimable" pages off the LRU active and inactive list. In this rather long-winded overview, I attempt to provide the motivation for this work, describe how it relates to other recent patches that address different aspects of the "problem", and give an overview of the mechanism. I'll try not to repeat too much of this in the patch descriptions. We have seen instances of large linux servers [10s/100s of GB of memory => millions of pages] apparently hanging for extended periods [10s or minutes or more] while all processors attempt to reclaim memory. For various reasons many of the pages on the LRU lists become difficult or impossible to reclaim. The system spends a lot time trying to reclaim [unmap] the difficult pages and/or shuffling through the impossible ones. Some of the conditions that make pages difficult or impossible to reclaim: 1) page is anon or shmem, but no swap space available 2) page is mlocked into memory 3) page is anon with an excessive number of related vmas [on the anon_vma list]. More on this below. The basic noreclaim mechanism, described below, is based on a patch developed by Larry Woodman of Red Hat for RHEL4 [2.6.9+ based kernel] to address the first condition above--an x86_64 non-NUMA system with 64G-128G memory [16M-32M 4k pages] with very little swap space-- ~2GB. The majority of the memory on the system was consumed by large database shared memory areas. A file IO intensive operation, such as backup, causes remaining free memory to be consumed by the page cache, initiating reclaim. vmscan then spends a great deal of time shuffling non-swappable anon and shmem pages between the active to the inactive lists, only to find that it can't move them to the swap cache. The pages get reactivated and round and round it goes. Because pages cannot be easily reclaimed, eventually other processors need to allocate pages and enter direct reclaim, only to compete for the zone lru lock. The single [normal] zone on the non-numa platform exacerbates this problem, but it can also arise, per zone, on numa platforms. Larry's patch alleviates this problem by maintaining anon and shmem pages for which no swap space exists on a per zone noreclaim list. Once the pages have been parked there, vmscan deals only with page cache pages, and anon/shmem pages to which space space has already been assigned. Pages move from the noreclaim list back to the LRU when swap space becomes available. Upstream developers have been addressing some of these issues in other ways: Christoph Lameter posted a patch to keep anon pages off the LRU when SWAP support not configured into the kernel. With Christoph's patch, these pages are left out "in limbo"--not on any list. Because of this, Christoph's patch does not address the more common situation of kernels with SWAP configured in, but insufficient or no swap added. I think this is a more common situation because most distros will ship kernels with the SWAP support configured in--at least for "enterprise" use. Maintaining these pages on a noreclaim list, will make it possible to restore these pages to the [in]active lists when/if swap is added. Nick Piggin's patch to keep mlock'ed pages [condition 2 above] off the LRU list also lets the mlocked/non-reclaimable pages float, not on any list. While Nick's patch does allow these pages to become reclaimable when all memory locks are removed, there is another reason to keep pages on a separate list. We want to be able to migrate anon pages that have no swap space backing them, and those that are mlocked. Indeed, the migration infrastructure supports this. However, the LRU lists, via the zone lru locks, arbitrate between tasks attempting to migrate the same pages simultaneously. To migrate a page, we must isolate it from the LRU. If the page cannot be isolated, migration gives up and moves on to another page. Which ever task is successful in isolating the page proceeds with the migration. Keeping the nonreclaimable pages on a separate list, protected by the zone lru lock, would preserve this arbitration function. isolate_page_lru(), used by both migration and Nick's mlock patch, can be enhanced to find pages on the noreclaim list, as well as on the [in]active lists. What's the probability that tasks will race on migrating the same page? Fairly high if auto-migration ever makes it into the kernel, but non-zero in any case. Rik van Reil's patch to split the active and inactive lists can address the non-swappable page problem by throttling the scan of the anon LRU lists, that contain both anon and shmem pages. However, if the system supports any swap space at all, one still needs to scan the anon lists to free up memory consumed by pages already in the swap cache. On large memory systems, the anon lists can still be millions of pages long and contain a large per centage of non-swappable and mlocked pages. This series attempts to unify this work into a general mechanism for managing non-reclaimable pages. The basic objective is to make vmscan as productive as possible on very large memory systems, by eliminating non-productive page shuffling. Like Larry's patch, the noreclaim infrastructure maintains "non-reclaimable" pages on a separate per-zone list. This noreclaim list is, conceptually, another LRU list--a sibling of the active and inactive lists. A page on the noreclaim list will have the PG_lru and PG_noreclaim flags set. The PG_noreclaim flag is analogous to, and mutually exclusive with, the PG_active flag--it specifies which LRU list the page resides on. The noreclaim list supports a pagevec cache, like the active and inactive lists to reduce contention on the zone lru lock in vmscan and in the fault path. Pages on the noreclaim list are "hidden" from page reclaim scanning. Thus, reclaim will not spend time attempting to reclaim the pages, only to find that they can't be unmapped, have no swap space available, are locked into memory, ... However, vmscan may find pages on the [in]active lists that have become non-reclaimable since they were put on the list. It will move them to the noreclaim list at that time. This series of patches includes the basic noreclaim list support and one patch, as a proof of concept, to address the 3rd condition listed above: the excessively long anon_vma list of related vmas. This seemed to be the easiest of the 3 conditions to address, and I have a test case handy [AIM7--see below]. Additional patches to handle anon pages for which no swap exists and to layer Nick Piggin's patch to keep "mlock pages off the LRU" will be forthcoming, if feedback indicates that this approach is worth pursuing. Now, about those anon pages with really long "related vma" lists: We have only seen this in AIM7 benchmarks on largish servers. The situation occurs when a single task fork()s many [10s of] thousands of children, and the the system needs to reclaim memory. We've seen all processors on a system spinning on the anon_vma lock attempting to unmap pages mapped by these thousands of children--for 10s of minutes or until we give up and reboot. I discussed this issue at LCA'07 in a kernel miniconf presentation. Linus questioned whether this was a problem that really needs solving. After all, AIM7 is only a synthetic benchmark. Does any real application behave this way? After the presentation, someone came up to me and told me that Apache also fork()s for each incoming connection and can fork thousands of children. However, I have not witnessed this, nor do I know how long lived these children are. I have included another patch that makes the anon_vma lock a reader/write lock. This allows different cpus to attempt to reclaim, in parallel, different pages that point to the same anon_vma. However, this doesn't solve the problem of trying to unmap pages that are [potentially] mapped into thousands of vmas. The last patch in this series counts the number of related vmas on an anon_vma's list and, when it exceeds a tunable threshold, pages that reference that anon_vma are declared nonreclaimable. We detect these non-reclaimable pages either on fault [COW or new anon page in a vma with an excessively shared anon_vma] or when vmscan encounters such a page on the LRU list. The patch/series does not [yet] support moving such a page back to the [in]active lists when it's anon_vma sharing drops below the threshold. This usually occurs when a task exits or explicitly unmapps the area. Any COWed private pages will be freed at this time, but anon pages that are still shared will remain nonreclaimable even though the related vma count is below the no-reclaim limit. Again, I will address this if the overall approach is deemed worth pursuing. Additional considerations: If the noreclaim list contains mlocked pages, they can be directly deleted from the noreclaim list without scanning when the become unlocked. But, note that we can't use one of the lru link fields to contain the mlocked vma count in this case. If the noreclaim list contains anon/shmem pages for which no swap space exists, it will be necessary to scan the list when swap space becomes available, either because it has been freed from other pages, or because additional swap has been added. The latter case should not occur frequently enough to be a problem. We should be able to defer the scanning when swap space is freed from other pages until a sufficient number become available or system is under severe pressure. If the list contains pages that are merely difficult to reclaim because of the excessive anon_vma sharing, and if we want to make them reclaimable again when the anon_vma related vma count drops to an acceptable value, one would have to scan the list at some point. Again, this could be deferred until there are a sufficient number of such pages to make it worth while or until the system is under severe memory pressure. The above considerations suggest that one consider separate lists for non-reclaimable [no swap, mlocked] and difficult to reclaim. Or, maybe not... Interaction of noreclaim list and LRU lists: My current patch moves pages to the noreclaim list as soon as they are detected, either on the active or inactive list. I could change this such that non-reclaimable pages found on the active list go to the inactive list first, and take a ride there before being declared non-reclaimable. However, we still have the issue of where to place the pages when then come off the no reclaim list: back to the active list? the inactive list? head or tail thereof? My current mechanism, with the PG_active and PG_noreclaim flags being mutually exclusive, does not track activeness of pages on the noreclaim list. To do so would require additional scanning of the list, I think, sort of defeating the purpose of the list. But, maybe acceptable if we scan just to test/modify the active flags. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org