linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	akpm@linux-foundation.org, linux-mm@kvack.org
Subject: Re: [RFC][PATCH] synchrouns swap freeing at zapping vmas
Date: Fri, 22 May 2009 09:26:56 +0900	[thread overview]
Message-ID: <20090522092656.21e76d8f.kamezawa.hiroyu@jp.fujitsu.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0905212035200.15631@sister.anvils>

On Thu, 21 May 2009 22:00:20 +0100 (BST)
Hugh Dickins <hugh.dickins@tiscali.co.uk> wrote:

> On Thu, 21 May 2009, KAMEZAWA Hiroyuki wrote:
> > 
> > In these 6-7 weeks, we tried to fix memcg's swap-leak race by checking
> > swap is valid or not after I/O.
> 
> I realize you've been working on different solutions for many weeks,
> and would love a positive response.  Sorry, I'm not providing that:
> these patches are not so beautiful that I'm eager to see them go in.
> 
> I ought to be attending to other priorities, but you've been clever
> enough to propose intrusive mods that I can't really ignore, just to
> force a response out of me!

Sorry and thank you for your time and kindness.

>  And I'd better get a reply in with my new address, before the old
 > starts bouncing in a few days time.
> 
Updated my address book.


> > But Andrew Morton pointed out that
> > "trylock in free_swap_and_cache() is not good"
> > Oh, yes. it's not good.
> 
> Well, it has served non-memcg very well for years:
> what's so bad about it now?
> 
> I've skimmed through the threads, starting from Nishimura-san's mail
> on 17 March, was that the right one?  My head spins like Balbir's.
> 
Maybe right.

> It seems like you have two leaks, but I may have missed the point.
> 
> One, that mem-swap accounting and mem+swap accounting have some
> disagreement about when to (un)account to a memcg, with the result
> that orphaned swapcache pages are liable to be accounted, but not
> on the LRUs of the memcg.  I'd have thought that inconsistency is
> something you should be sorting out at the memcg end, without
> needing changes to the non-memcg code.
> 
I did these things in memcg. But finally,

-----------------------------------------------
                       |     free_swap_and_cache()
lock_page()            |        
 try_to_free_swap()    |
   check swap refcnt   |
                       |   swap refcnt goes to 1.
                             trylock failure
unlock_page()          | 
------------------------------------------------
This race was the last obstacle in front of me in previous patch.
This patch is a trial to remove trylock. (this patch teach me much ;)

> Other, that orphaned swapcache pages can build up until swap is
> full, before reaching sufficient global memory pressure to run
> through the global LRUs, which is what has traditionally dealt
> with the issue.  And when swap is filled in this way, memcgs can
> no longer put their pages out to swap, so OOM prematurely instead.
> 
yes.

> I can imagine (just imagining, haven't checked, may be quite wrong)
> that split LRUs have interfered with that freeing of swapcache pages:
> since vmscan.c is mainly targetted at freeing memory, I think it tries
> to avoid the swapbacked LRUs once swap is full, so may now be missing
> out on freeing such pages?
> 
Hmm, I feel it is possible. 

> And it's probably an inefficient way to get at them anyway.
> Why not have a global scan to target swapcache pages whenever swap is
> approaching full (full in a real sense, not vm_swap_full's 50% sense)?
> And run that before OOMing, memcg or not.
> 
It's one of points.
I or Nishimura have to modify vm_swap_full() to see memcg information.

But the problem in readahead case is
 - swap entry is used.
 - it's accoutned to a memcg by swap_cgroup
 - but not on memcg's LRU and we can't free it.

> Sorry, you're probably going to have to explain for the umpteenth
> time why these approaches do not work.
> 
IIRC, Nishimura and guys walks mainly for HPC and they tends to have tons of
memory. Then, I'd like to avoid scanning global LRU without any hints, as much as
possible.




> > 
> > Then, this patch series is a trial to remove trylock for swapcache AMAP.
> > Patches are more complex and larger than expected but the behavior itself is
> > much appreciate than prevoius my posts for memcg...
> >  
> > This series contains 2 patches.
> >   1. change refcounting in swap_map.
> >      This is for allowing swap_map to indicate there is swap reference/cache.
> 
> You've gone to a lot of trouble to obscure what this patch is doing:
> lots of changes that didn't need to be made, and an enum of 0 or 1
> which keeps on being translated to a count of 2 or 1.
> 
Ah, ok. it's not good.

> Using the 0x8000 bit in the swap_map to indicate if that swap entry
> is in swapcache, yes, that may well be a good idea - and I don't know
> why that bit isn't already used: might relate to when pids were limited
> to 32000, but more likely was once used as a flag later abandoned.
> But you don't need to change every single call to swap_free() etc,
> they can mostly do just the same as they already do.
> 
yes. Using 0x8000 as flag is the choice.

> Whether it works correctly, I haven't tried to decide.  But was
> puzzled when by the end of it, no real use was actually made of
> the changes: the same trylock_page as before, it just wouldn't
> get tried unsuccessfully so often.  Just preparatory work for
> the second patch?
> 
When swap count returns 1, there are 2 possibilities.
  - there is a swap cache
  - there is swap reference.

In second patch, I wanted to avoid unnecesasry call for
  find_get_page() -> lock_page() -> try_to_free_swap().
because I know I can't use large buffer for batched work.

Without second patch, I have a chace to fix this race
-----------------------------------------------
                       |     free_swap_and_cache()
lock_page()            |        
 try_to_free_swap()    |
   check swap refcnt   |
                       |   swap refcnt goes to 1.
                       |   trylock failure
unlock_page()          | 
------------------------------------------------

There will be no race between swap cache handling v.s. swap usage.



> >   2. synchronous freeing of swap entries.
> >      For avoiding race, free swap_entries in appropriate way with lock_page().
> >      After this patch, race between swapin-readahead v.s. zap_page_range()
> >      will go away.
> >      Note: the whole code for zap_page_range() will not work until the system
> >      or cgroup is very swappy. So, no influence in typical case.
> 
> This patch adds quite a lot of ugliness in a hot path which is already
> uglier than we'd like.   Adding overhead to zap_pte_range, for the rare
> swap and memcg case, isn't very welcome.
> 
Ok, I have to agree.

> > 
> > There are used trylocks more than this patch treats. But IIUC, they are not
> > racy with memcg and I don't care them.
> > (And....I have no idea to remove trylock() in free_pages_and_swapcache(),
> >  which is called via tlb_flush_mmu()....preemption disabled and using percpu.)
> 
> I know well the difficulty, several of us have had patches to solve most
> of the percpu mmu_gather problems, but the file truncation case (under
> i_mmap_lock) has so far defeated us; and you can't ignore that case,
> truncation has to remove even the anon (possibly swapcache) pages
> from a private file mapping.
> 
Ah, I may misunderstand following lines.
== zap_pte_range()
 832                  * If details->check_mapping, we leave swap entries;
 833                  * if details->nonlinear_vma, we leave file entries.
 834                  */
 835                 if (unlikely(details))
 836                         continue;
==
Then...this is bug ?

> But I'm afraid, if you do nothing about free_pages_and_swapcache,
> then I can't see much point in studying the rest of it, which
> would only be addressing half of your problem.
> 
> > 
> > These patches + Nishimura-san's writeback fix will do complete work, I think.
> > But test is not enough.
> 
> Please provide a pointer to Nishimura-san's writeback fix,
> I seem to have missed that.
> 
This one.
  http://marc.info/?l=linux-kernel&m=124236139502335&w=2

> There is indeed little point in attacking the trylock_page()s here,
> unless you also attack all those PageWriteback backoffs.  I can imagine
> a simple patch to do that (removing from swapcache while PageWriteback),
> but it would be adding more atomic ops, and using spin_lock_irq on
> swap_lock everywhere, probably not a good tradeoff.
> 
Ok, I should consider more.
> > 
> > Any comments are welcome. 
> 
> I sincerely wish I could be less discouraging!
> 
I've been feeling like to crash my head by hitting it against an edge ot a tofu
in these days. But if patch 1/2 is acceptable with the modification you suggested,
there will be a way to go. 

You encouraged me :) thanks.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2009-05-22  0:28 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-21  7:41 KAMEZAWA Hiroyuki
2009-05-21  7:43 ` [RFC][PATCH 1/2] change swapcount handling KAMEZAWA Hiroyuki
2009-05-21  7:43 ` [RFC][PATCH 2/2] synchrouns swap freeing without trylock KAMEZAWA Hiroyuki
2009-05-21 12:44   ` Johannes Weiner
2009-05-21 23:46     ` KAMEZAWA Hiroyuki
2009-05-21 21:00 ` [RFC][PATCH] synchrouns swap freeing at zapping vmas Hugh Dickins
2009-05-22  0:26   ` KAMEZAWA Hiroyuki [this message]
2009-05-22  4:39 ` Daisuke Nishimura
2009-05-22  5:05   ` KAMEZAWA Hiroyuki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090522092656.21e76d8f.kamezawa.hiroyu@jp.fujitsu.com \
    --to=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=akpm@linux-foundation.org \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=linux-mm@kvack.org \
    --cc=nishimura@mxp.nes.nec.co.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox