From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=EZRb=D6=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-8.4 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SIGNED_OFF_BY,
	SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7D64DC388F9
	for <linux-mm@archiver.kernel.org>; Fri, 23 Oct 2020 17:56:54 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 9716222201
	for <linux-mm@archiver.kernel.org>; Fri, 23 Oct 2020 17:56:53 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9716222201
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id A96456B005D; Fri, 23 Oct 2020 13:56:52 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A48446B0062; Fri, 23 Oct 2020 13:56:52 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 936486B0068; Fri, 23 Oct 2020 13:56:52 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0143.hostedemail.com [216.40.44.143])
	by kanga.kvack.org (Postfix) with ESMTP id 679A16B005D
	for <linux-mm@kvack.org>; Fri, 23 Oct 2020 13:56:52 -0400 (EDT)
Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 057ED8249980
	for <linux-mm@kvack.org>; Fri, 23 Oct 2020 17:56:52 +0000 (UTC)
X-FDA: 77403945864.30.lift98_2c15f5d2725b
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin30.hostedemail.com (Postfix) with ESMTP id D595D180B3C83
	for <linux-mm@kvack.org>; Fri, 23 Oct 2020 17:56:51 +0000 (UTC)
X-HE-Tag: lift98_2c15f5d2725b
X-Filterd-Recvd-Size: 7569
Received: from mx2.suse.de (mx2.suse.de [195.135.220.15])
	by imf37.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Fri, 23 Oct 2020 17:56:51 +0000 (UTC)
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
Received: from relay2.suse.de (unknown [195.135.221.27])
	by mx2.suse.de (Postfix) with ESMTP id BD450AC82;
	Fri, 23 Oct 2020 17:56:49 +0000 (UTC)
To: Axel Rasmussen <axelrasmussen@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>, Ingo Molnar <mingo@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 Michel Lespinasse <walken@google.com>,
 Daniel Jordan <daniel.m.jordan@oracle.com>, Jann Horn <jannh@google.com>,
 Chinwen Chang <chinwen.chang@mediatek.com>, Davidlohr Bueso
 <dbueso@suse.de>, David Rientjes <rientjes@google.com>,
 Yafang Shao <laoar.shao@gmail.com>, LKML <linux-kernel@vger.kernel.org>,
 Linux MM <linux-mm@kvack.org>
References: <20201020184746.300555-1-axelrasmussen@google.com>
 <20201020184746.300555-2-axelrasmussen@google.com>
 <fa6b9d13-0ef5-4d5d-bda3-657300028e23@suse.cz>
 <CAJHvVcjzZgsvdzciR5v_wkgf3M7aD_vNGv3TXrf5Z5K6SLprSA@mail.gmail.com>
From: Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH v4 1/1] mmap_lock: add tracepoints around lock acquisition
Message-ID: <a98b7807-9ed9-feda-f182-0031e6512328@suse.cz>
Date: Fri, 23 Oct 2020 19:56:49 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.3.3
MIME-Version: 1.0
In-Reply-To: <CAJHvVcjzZgsvdzciR5v_wkgf3M7aD_vNGv3TXrf5Z5K6SLprSA@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 10/23/20 7:38 PM, Axel Rasmussen wrote:
> On Fri, Oct 23, 2020 at 7:00 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 10/20/20 8:47 PM, Axel Rasmussen wrote:
>> > The goal of these tracepoints is to be able to debug lock contention
>> > issues. This lock is acquired on most (all?) mmap / munmap / page fa=
ult
>> > operations, so a multi-threaded process which does a lot of these ca=
n
>> > experience significant contention.
>> >
>> > We trace just before we start acquisition, when the acquisition retu=
rns
>> > (whether it succeeded or not), and when the lock is released (or
>> > downgraded). The events are broken out by lock type (read / write).
>> >
>> > The events are also broken out by memcg path. For container-based
>> > workloads, users often think of several processes in a memcg as a si=
ngle
>> > logical "task", so collecting statistics at this level is useful.
>> >
>> > The end goal is to get latency information. This isn't directly incl=
uded
>> > in the trace events. Instead, users are expected to compute the time
>> > between "start locking" and "acquire returned", using e.g. synthetic
>> > events or BPF. The benefit we get from this is simpler code.
>> >
>> > Because we use tracepoint_enabled() to decide whether or not to trac=
e,
>> > this patch has effectively no overhead unless tracepoints are enable=
d at
>> > runtime. If tracepoints are enabled, there is a performance impact, =
but
>> > how much depends on exactly what e.g. the BPF program does.
>> >
>> > Reviewed-by: Michel Lespinasse <walken@google.com>
>> > Acked-by: Yafang Shao <laoar.shao@gmail.com>
>> > Acked-by: David Rientjes <rientjes@google.com>
>> > Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
>>
>> All seem fine to me, except I started to wonder..
>>
>> > +
>> > +#ifdef CONFIG_MEMCG
>> > +
>> > +DEFINE_PER_CPU(char[MAX_FILTER_STR_VAL], trace_memcg_path);
>> > +
>> > +/*
>> > + * Write the given mm_struct's memcg path to a percpu buffer, and r=
eturn a
>> > + * pointer to it. If the path cannot be determined, the buffer will=
 contain the
>> > + * empty string.
>> > + *
>> > + * Note: buffers are allocated per-cpu to avoid locking, so preempt=
ion must be
>> > + * disabled by the caller before calling us, and re-enabled only af=
ter the
>> > + * caller is done with the pointer.
>>
>> Is this enough? What if we fill the buffer and then an interrupt comes=
 and the
>> handler calls here again? We overwrite the buffer and potentially repo=
rt a wrong
>> cgroup after the execution resumes?
>> If nothing worse can happen (are interrupts disabled while the ftrace =
code is
>> copying from the buffer?), then it's probably ok?
>=20
> I think you're right, get_cpu()/put_cpu() only deals with preemption,
> not interrupts.
>=20
> I'm somewhat sure this code can be called in interrupt context, so I
> don't think we can use locks to prevent this situation. I think it
> works like this: say we acquire the lock, an interrupt happens, and
> then we try to acquire again on the same CPU; we can't sleep, so we're
> stuck.

Yes, we could perhaps trylock() and if it fails, give up on the memcg pat=
h.

> I think we can't kmalloc here (instead of a percpu buffer) either,
> since I would guess that kmalloc may also acquire mmap_lock itself?

the overhead is not worth it anyway, for a tracepoint

> Is adding local_irq_save()/local_irq_restore() in addition to
> get_cpu()/put_cpu() sufficient?

If you do that, then I guess you don't need get_cpu()/put_cpu() anymore. =
But=20
it's more costly.

But sounds like we are solving something that the tracing subystem has to=
 solve=20
as well to store the trace event data, so maybe Steven has some better id=
ea?

>>
>> > + */
>> > +static const char *get_mm_memcg_path(struct mm_struct *mm)
>> > +{
>> > +     struct mem_cgroup *memcg =3D get_mem_cgroup_from_mm(mm);
>> > +
>> > +     if (memcg !=3D NULL && likely(memcg->css.cgroup !=3D NULL)) {
>> > +             char *buf =3D this_cpu_ptr(trace_memcg_path);
>> > +
>> > +             cgroup_path(memcg->css.cgroup, buf, MAX_FILTER_STR_VAL=
);
>> > +             return buf;
>> > +     }
>> > +     return "";
>> > +}
>> > +
>> > +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                       =
            \
>> > +     do {                                                          =
         \
>> > +             get_cpu();                                            =
         \
>> > +             trace_mmap_lock_##type(mm, get_mm_memcg_path(mm),     =
         \
>> > +                                    ##__VA_ARGS__);                =
         \
>> > +             put_cpu();                                            =
         \
>> > +     } while (0)
>> > +
>> > +#else /* !CONFIG_MEMCG */
>> > +
>> > +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                       =
            \
>> > +     trace_mmap_lock_##type(mm, "", ##__VA_ARGS__)
>> > +
>> > +#endif /* CONFIG_MEMCG */
>> > +
>> > +/*
>> > + * Trace calls must be in a separate file, as otherwise there's a c=
ircular
>> > + * dependency between linux/mmap_lock.h and trace/events/mmap_lock.=
h.
>> > + */
>> > +
>> > +void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool =
write)
>> > +{
>> > +     TRACE_MMAP_LOCK_EVENT(start_locking, mm, write);
>> > +}
>> > +EXPORT_SYMBOL(__mmap_lock_do_trace_start_locking);
>> > +
>> > +void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bo=
ol write,
>> > +                                        bool success)
>> > +{
>> > +     TRACE_MMAP_LOCK_EVENT(acquire_returned, mm, write, success);
>> > +}
>> > +EXPORT_SYMBOL(__mmap_lock_do_trace_acquire_returned);
>> > +
>> > +void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write=
)
>> > +{
>> > +     TRACE_MMAP_LOCK_EVENT(released, mm, write);
>> > +}
>> > +EXPORT_SYMBOL(__mmap_lock_do_trace_released);
>> >
>>
>=20