From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f197.google.com (mail-io0-f197.google.com [209.85.223.197]) by kanga.kvack.org (Postfix) with ESMTP id 33DDE6B0008 for ; Wed, 21 Feb 2018 14:05:07 -0500 (EST) Received: by mail-io0-f197.google.com with SMTP id r1so2496393ioa.0 for ; Wed, 21 Feb 2018 11:05:07 -0800 (PST) Received: from mail-sor-f41.google.com (mail-sor-f41.google.com. [209.85.220.41]) by mx.google.com with SMTPS id n144sor9505232iod.244.2018.02.21.11.05.06 for (Google Transport Security); Wed, 21 Feb 2018 11:05:06 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20180205220325.197241-1-dancol@google.com> References: <20180205220325.197241-1-dancol@google.com> From: Daniel Colascione Date: Wed, 21 Feb 2018 11:05:04 -0800 Message-ID: Subject: Re: [PATCH] Synchronize task mm counters on context switch Content-Type: multipart/alternative; boundary="94eb2c189b8ce24b120565bd9ac2" Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Daniel Colascione --94eb2c189b8ce24b120565bd9ac2 Content-Type: text/plain; charset="UTF-8" On Mon, Feb 5, 2018 at 2:03 PM, Daniel Colascione wrote: > When SPLIT_RSS_COUNTING is in use (which it is on SMP systems, > generally speaking), we buffer certain changes to mm-wide counters > through counters local to the current struct task, flushing them to > the mm after seeing 64 page faults, as well as on task exit and > exec. This scheme can leave a large amount of memory unaccounted-for > in process memory counters, especially for processes with many threads > (each of which gets 64 "free" faults), and it produces an > inconsistency with the same memory counters scanned VMA-by-VMA using > smaps. This inconsistency can persist for an arbitrarily long time, > since there is no way to force a task to flush its counters to its mm. > > This patch flushes counters on context switch. This way, we bound the > amount of unaccounted memory without forcing tasks to flush to the > mm-wide counters on each minor page fault. The flush operation should > be cheap: we only have a few counters, adjacent in struct task, and we > don't atomically write to the mm counters unless we've changed > something since the last flush. > > Signed-off-by: Daniel Colascione > --- > kernel/sched/core.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index a7bf32aabfda..7f197a7698ee 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3429,6 +3429,9 @@ asmlinkage __visible void __sched schedule(void) > struct task_struct *tsk = current; > > sched_submit_work(tsk); > + if (tsk->mm) > + sync_mm_rss(tsk->mm); > + > do { > preempt_disable(); > __schedule(false); > Ping? Is this approach just a bad idea? We could instead just manually sync all mm-attached tasks at counter-retrieval time. --94eb2c189b8ce24b120565bd9ac2 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On M= on, Feb 5, 2018 at 2:03 PM, Daniel Colascione <dancol@google.com> wrote:
When SPLIT_RSS_COUNTING is i= n use (which it is on SMP systems,
generally speaking), we buffer certain changes to mm-wide counters
through counters local to the current struct task, flushing them to
the mm after seeing 64 page faults, as well as on task exit and
exec. This scheme can leave a large amount of memory unaccounted-for
in process memory counters, especially for processes with many threads
(each of which gets 64 "free" faults), and it produces an
inconsistency with the same memory counters scanned VMA-by-VMA using
smaps. This inconsistency can persist for an arbitrarily long time,
since there is no way to force a task to flush its counters to its mm.

This patch flushes counters on context switch. This way, we bound the
amount of unaccounted memory without forcing tasks to flush to the
mm-wide counters on each minor page fault. The flush operation should
be cheap: we only have a few counters, adjacent in struct task, and we
don't atomically write to the mm counters unless we've changed
something since the last flush.

Signed-off-by: Daniel Colascione <d= ancol@google.com>
---
=C2=A0kernel/sched/core.c | 3 +++
=C2=A01 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a7bf32aabfda..7f197a7698ee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3429,6 +3429,9 @@ asmlinkage __visible void __sched schedule(void)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 struct task_struct *tsk =3D current;

=C2=A0 =C2=A0 =C2=A0 =C2=A0 sched_submit_work(tsk);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (tsk->mm)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sync_mm_rss(tsk->= ;mm);
+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 do {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 preempt_disable();<= br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 __schedule(false);<= br>

=C2=A0
Ping? Is this approach= just a bad idea? We could instead just manually sync all mm-attached tasks= at counter-retrieval time.

--94eb2c189b8ce24b120565bd9ac2-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org