linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: Greg Thelen <gthelen@google.com>
Cc: lsf10-pc@lists.linuxfoundation.org, linux-mm@kvack.org,
	"nishimura@mxp.nes.nec.co.jp" <nishimura@mxp.nes.nec.co.jp>,
	"balbir@linux.vnet.ibm.com" <balbir@linux.vnet.ibm.com>
Subject: Re: [ATTEND][LSF/VM TOPIC] deterministic cgroup charging using file path
Date: Tue, 29 Jun 2010 15:30:59 +0900	[thread overview]
Message-ID: <20100629153059.c49db3b6.kamezawa.hiroyu@jp.fujitsu.com> (raw)
In-Reply-To: <AANLkTik3l5jZlxqmDkkHdEFle4MJFcKLh1kPVNrK6CyE@mail.gmail.com>

On Mon, 28 Jun 2010 22:31:03 -0700
Greg Thelen <gthelen@google.com> wrote:

> On Sun, Jun 27, 2010 at 7:03 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Fri, 25 Jun 2010 13:43:45 -0700
> > Greg Thelen <gthelen@google.com> wrote:

> >> /dev/cgroup/cg1/cg11 A # T1: want memory.limit = 30MB
> >> /dev/cgroup/cg1/cg12 A # T2: want memory.limit = 100MB
> >> /dev/cgroup/cg1 A  A  A  # want memory.limit = 1GB + 30MB + 100MB
> >>
> >> I have implemented a prototype that allows a file system hierarchy be charge a
> >> particular cgroup using a new bind mount option:
> >> + mount -t cgroup none /cgroup -o memory
> >> + mount --bind /tmp/db /tmp/db -o cgroup=/dev/cgroup/cg1
> >>
> >> Any accesses to files within /tmp/db are charged to /dev/cgroup/cg1. A Access to
> >> other files behave normally - they charge the cgroup of the current task.
> >>
> >
> > Interesting, but I want to use madvice() etc..for this kind of jobs, rather than
> > deep hooks into the kernel.
> >
> > madvise(addr, size, MEMORY_RECHAEGE_THIS_PAGES_TO_ME);
> >
> > Then, you can write a command as:
> >
> > A file_recharge [path name] [cgroup]
> > A - this commands move a file cache to specified cgroup.
> >
> > A daemon program which uses this command + inotify will give us much
> > flexible controls on file cache on memcg. Do you have some requirements
> > that this move-charge shouldn't be done in lazy manner ?
> >
> > Status:
> > We have codes for move-charge, inotify but have no code for new madvise.
> >
> >
> > Thanks,
> > -Kame
> 
> This is an interesting approach.  I like the idea of minimizing kernel
> changes.  I want to make sure I understand the idea using terms from
> my above example.
> 
> 1. The daemon establishes inotify() watches on /tmp/db and all sub
> directories to catch any accesses.
> 
> 2. If cg11(T1) is the first process to mmap a portion of a /tmp/db
> file (pages_1) then cg11 will be charged.  T1 will not use madvise()
> because cg11 does not want to be charged.  cg11 will be temporarily
> charged for pages_1.
> 
yes.

> 3. inotify() will inform the proposed daemon that T1 opened /tmp/db,
> so the daemon will use file_recharge, which runs the following within
> the cg1 cgroup:
> - fd = open("/tmp/db/.../path_to_file")
> - va = mmap(NULL, size=stat(fd).st_size, fd)
> - madvise(fd, va, st_size, MEMORY_RECHARGE_THIS_PAGES_TO_ME).  This
> will move the charge of pages_1 from cg11 to cg1.
> 
> Did I state this correctly?
> 
yes.


> I am concerned that the follow-on step does not move the pages to cg1:
> 4. T1 then touches more /tmp/db pages (pages_2) using the same mmap.
> This charges cg11.  I assume that inotify() would not notify the
> daemon for this case because the file is still open. 
you're right.

> So the pages will not be moved to cg1.  Or are you suggesting
> that inotify() enhanced to advertise charge events?

IIUC, now, inotify() doesn't support mmap. But it has read/write notification.
So, let's think about mmapped pages.

For easy implementation, I suggest file_recharge should map the whole file
and move them all under it. But maybe this is an answer you want.

If I write an _easy_ daemon, which will do...

==
  register inotify and add watches.
  The wathces will see OPEN and IN_DELETE_SELF.

  run 2 threads.

Thread1:
  while(1) {
      read() // check events from inotify.
      maintain opened-file information.
  }

Thread2:
  while (1) {
      check opend-file information.
      select a file // you may implement some scheduling, here.
      open,
      mmap
      mincore() .... checks the file is cached.
      madvice() 
      // if you want, touch pages and add Access bit to them.
      close(),

      sleep if necessary.
 }
==
batch-style cron-job rather than sleep will not be very bad for usual use.
But we may need some interface to implement something clever algorithm.


> If the number of directories within /tmp/db is large, then inotify()
> maybe expensive.  I don't think this is a problem.
> 
> Another worry I have is that if for some reason the daemon is started
> after the job, or if the daemon crashes and is restarted, then files
> may have been opened and charged to cg11 without the inotify being
> setup. 
yes.

> The daemon would have problems finding the pages that were
> charged to cg11 and need to be moved to cg1.  The daemon could scan
> the open file table of T1, but any files that are no longer opened may
> be charged to cg11 with no way for the daemon to find them.
> 

Above thread-1 can maintain "opened-file" database.
Or you can run a recovery-scirpt to open /proc/<xxxx>/fd of processes
to trigger OPEN events.

But yes, some in-kernel approach may be required. as...new interface to memcg
rather than madvise.

/memory.move_file_caches
- when you open this and write()/ioctl() file descriptor to this file,
  all on-memory pages of files will be moved to this cgroup.

Hmm...we may be able to add an interface to know last-pagecache-update time.
(Because access-time is tend to be omitted at mount....)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2010-06-29  6:35 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-06-25 20:43 Greg Thelen
2010-06-28  2:03 ` KAMEZAWA Hiroyuki
2010-06-28  5:07   ` Balbir Singh
2010-06-29  6:42     ` Greg Thelen
2010-06-29  5:31   ` [ATTEND][LSF/VM TOPIC] " Greg Thelen
2010-06-29  6:30     ` KAMEZAWA Hiroyuki [this message]
2010-07-01  4:16       ` Greg Thelen
2010-07-01  6:33         ` KAMEZAWA Hiroyuki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100629153059.c49db3b6.kamezawa.hiroyu@jp.fujitsu.com \
    --to=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=gthelen@google.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf10-pc@lists.linuxfoundation.org \
    --cc=nishimura@mxp.nes.nec.co.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox