On Mon, Feb 5, 2018 at 2:03 PM, Daniel Colascione wrote: > When SPLIT_RSS_COUNTING is in use (which it is on SMP systems, > generally speaking), we buffer certain changes to mm-wide counters > through counters local to the current struct task, flushing them to > the mm after seeing 64 page faults, as well as on task exit and > exec. This scheme can leave a large amount of memory unaccounted-for > in process memory counters, especially for processes with many threads > (each of which gets 64 "free" faults), and it produces an > inconsistency with the same memory counters scanned VMA-by-VMA using > smaps. This inconsistency can persist for an arbitrarily long time, > since there is no way to force a task to flush its counters to its mm. > > This patch flushes counters on context switch. This way, we bound the > amount of unaccounted memory without forcing tasks to flush to the > mm-wide counters on each minor page fault. The flush operation should > be cheap: we only have a few counters, adjacent in struct task, and we > don't atomically write to the mm counters unless we've changed > something since the last flush. > > Signed-off-by: Daniel Colascione > --- > kernel/sched/core.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index a7bf32aabfda..7f197a7698ee 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3429,6 +3429,9 @@ asmlinkage __visible void __sched schedule(void) > struct task_struct *tsk = current; > > sched_submit_work(tsk); > + if (tsk->mm) > + sync_mm_rss(tsk->mm); > + > do { > preempt_disable(); > __schedule(false); > Ping? Is this approach just a bad idea? We could instead just manually sync all mm-attached tasks at counter-retrieval time.