On Wed, May 1, 2013 at 12:10 PM, Johannes Weiner wrote: > On Wed, May 01, 2013 at 08:28:30AM -0700, Hugh Dickins wrote: > > On Tue, 30 Apr 2013, Johannes Weiner wrote: > > > On Wed, Apr 24, 2013 at 08:50:01PM -0700, Hugh Dickins wrote: > > > > On Wed, 24 Apr 2013, Johannes Weiner wrote: > > > > > On Wed, Apr 24, 2013 at 03:18:51PM +0200, Michal Hocko wrote: > > > > > > On Wed 24-04-13 12:42:55, Heiko Carstens wrote: > > > > > > > On Thu, Apr 18, 2013 at 09:13:03AM +0200, Heiko Carstens wrote: > > > > > > > > > > > > > > [ 48.347963] ------------[ cut here ]------------ > > > > > > > [ 48.347972] kernel BUG at mm/memcontrol.c:3994! > > > > > > > __mem_cgroup_uncharge_common() triggers: > > > > > > > > > > > > > > [...] > > > > > > > if (mem_cgroup_disabled()) > > > > > > > return NULL; > > > > > > > > > > > > > > VM_BUG_ON(PageSwapCache(page)); > > > > > > > [...] > > > > > > > > I agree that the actual memcg uncharging should be okay, but the > memsw > > > > swap stats will go wrong (doesn't matter toooo much), and > mem_cgroup_put > > > > get missed (leaking a struct mem_cgroup). > > > > > > Ok, so I just went over this again. For the swapout path the memsw > > > uncharge is deferred, but if we "steal" this uncharge from the swap > > > code, we actually do uncharge memsw in mem_cgroup_do_uncharge(), so we > > > may prematurely unaccount the swap page, but we never leak a charge. > > > Good. > > > > > > Because of this stealing, we also don't do the following: > > > > > > if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) { > > > mem_cgroup_swap_statistics(memcg, true); > > > mem_cgroup_get(memcg); > > > } > > > > > > I.e. it does not matter that mem_cgroup_uncharge_swap() doesn't do the > > > put, we are also not doing the get. We should not leak references. > > > > > > So the only thing that I can see go wrong is that we may have a > > > swapped out page that is not charged to memsw and not accounted as > > > MEM_CGROUP_STAT_SWAP. But I don't know how likely that is, because we > > > check for PG_swapcache in this uncharge path after the last pte is > > > torn down, so even though the page is put on swap cache, it probably > > > won't be swapped. It would require that the PG_swapcache setting > > > would become visible only after the page has been added to the swap > > > cache AND rmap has established at least one swap pte for us to > > > uncharge a page that actually continues to be used. And that's a bit > > > of a stretch, I think. > > > > Sorry, our minds seem to work in different ways, > > I understood very little of what you wrote above :-( > > > > But once I try to disprove you with a counter-example, I seem to > > arrive at the same conclusion as you have (well, I haven't quite > > arrived there yet, but cannot give it any more time). > > I might be losing my mind. But since you are reaching the same > conclusion, and I see the same mental milestones in your thought > process described below, it's more likely that I suck at describing my > train of thought coherently. Or the third possibility: we're both > losing it! > > > Looking at it from my point of view, I concentrate on the racy > > if (PageSwapCache(page)) > > return; > > __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, > false); > > in mem_cgroup_uncharge_page(). > > > > Now, that may or may not catch the case where last reference to page > > is unmapped at the same time as the page is added to swap: but being > > a MEM_CGROUP_CHARGE_TYPE_ANON call, it does not interfere with the > > memsw stats and get/put at all, those remain in balance. > > Yes, exactly. > > > And mem_cgroup_uncharge_swap() has all along been prepared to get > > a zero id from swap_cgroup_record(), if a SwapCache page should be > > uncharged when it was never quite charged as such. > > > > Yes, we may occasionally fail to charge a SwapCache page as such > > if its final unmap from userspace races with its being added to swap; > > but it's heading towards swap_writepage()'s try_to_free_swap() anyway, > > so I don't think that's anything to worry about. > > Agreed as well. If there are no pte references to the swap slot, it > will be freed either way. I didn't even think of the > try_to_free_swap() in the writeout call, but was looking at the > __remove_mapping later on in reclaim that will do a swapcache_free(). > > The only case I was worried about is the following: > > #0 #1 > page_remove_rmap() shrink_page_list() > if --page->mapcount == 0: add_to_swap() > mem_cgroup_uncharge_page() __add_to_swap_cache() > if PageSwapCache: SetPageSwapCache() > return try_to_unmap() > __mem_cgroup_uncharge_common() for each pte: > install swp_entry_t > page->mapcount-- > Thanks for spelling it out for me in more detail, this time I think I do grasp your concern. > > Looking at #1, I don't see anything that would force concurrent > threads to observe SetSwapCache ordered against the page->mapcount--. > My concern was that if those get reordered, #0 may see page->mapcount > == 1 AND !PageSwapcache, and then go ahead and uncharge the page while > there is actually a swp_entry_t pointing to it. The page will be a > proper long-term swap page without being charged as such. > But I don't see any problem with ordering here. #0 is using an atomic operation which returns a result on page->mapcount, so that amounts to (more than) an smp_rmb ensuring it reads mapcount before reading PageSwapCache flag. And in #1, there's at least an unlock of the radix_tree lock (after adding to swap tree) and a lock of the page table lock (before unmapping the page), and that pairing amounts to (more than) an smp_wmb. Hugh > > (If I had time to stop and read through that, I'd probably find it > > just as hard to understand as what you wrote!) > > > > > > > > Did I miss something? If not, I'll just send a patch that removes the > > > VM_BUG_ON() and adds a comment describing the scenarios and a note > > > that we may want to fix this in the future. > > > > I don't think you missed something. Yes, please just send Linus and > > Andrew a patch to remove the VM_BUG_ON() (with Cc stable tag), I now > > agree that's all that's really needed - thanks. > > Will do, thanks for taking them time to think through it again, even > after failing to decipher my ramblings... > > Johannes >