On Mon, Feb 5, 2018 at 2:03 PM, Daniel Colascione <dancol@google.com> wrote:

> When SPLIT_RSS_COUNTING is in use (which it is on SMP systems,
> generally speaking), we buffer certain changes to mm-wide counters
> through counters local to the current struct task, flushing them to
> the mm after seeing 64 page faults, as well as on task exit and
> exec. This scheme can leave a large amount of memory unaccounted-for
> in process memory counters, especially for processes with many threads
> (each of which gets 64 "free" faults), and it produces an
> inconsistency with the same memory counters scanned VMA-by-VMA using
> smaps. This inconsistency can persist for an arbitrarily long time,
> since there is no way to force a task to flush its counters to its mm.
>
> This patch flushes counters on context switch. This way, we bound the
> amount of unaccounted memory without forcing tasks to flush to the
> mm-wide counters on each minor page fault. The flush operation should
> be cheap: we only have a few counters, adjacent in struct task, and we
> don't atomically write to the mm counters unless we've changed
> something since the last flush.
>
> Signed-off-by: Daniel Colascione <dancol@google.com>
> ---
>  kernel/sched/core.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a7bf32aabfda..7f197a7698ee 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3429,6 +3429,9 @@ asmlinkage __visible void __sched schedule(void)
>         struct task_struct *tsk = current;
>
>         sched_submit_work(tsk);
> +       if (tsk->mm)
> +               sync_mm_rss(tsk->mm);
> +
>         do {
>                 preempt_disable();
>                 __schedule(false);
>


Ping? Is this approach just a bad idea? We could instead just manually sync
all mm-attached tasks at counter-retrieval time.