From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.4 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7D64DC388F9 for ; Fri, 23 Oct 2020 17:56:54 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 9716222201 for ; Fri, 23 Oct 2020 17:56:53 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9716222201 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A96456B005D; Fri, 23 Oct 2020 13:56:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A48446B0062; Fri, 23 Oct 2020 13:56:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 936486B0068; Fri, 23 Oct 2020 13:56:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0143.hostedemail.com [216.40.44.143]) by kanga.kvack.org (Postfix) with ESMTP id 679A16B005D for ; Fri, 23 Oct 2020 13:56:52 -0400 (EDT) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 057ED8249980 for ; Fri, 23 Oct 2020 17:56:52 +0000 (UTC) X-FDA: 77403945864.30.lift98_2c15f5d2725b Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin30.hostedemail.com (Postfix) with ESMTP id D595D180B3C83 for ; Fri, 23 Oct 2020 17:56:51 +0000 (UTC) X-HE-Tag: lift98_2c15f5d2725b X-Filterd-Recvd-Size: 7569 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf37.hostedemail.com (Postfix) with ESMTP for ; Fri, 23 Oct 2020 17:56:51 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id BD450AC82; Fri, 23 Oct 2020 17:56:49 +0000 (UTC) To: Axel Rasmussen Cc: Steven Rostedt , Ingo Molnar , Andrew Morton , Michel Lespinasse , Daniel Jordan , Jann Horn , Chinwen Chang , Davidlohr Bueso , David Rientjes , Yafang Shao , LKML , Linux MM References: <20201020184746.300555-1-axelrasmussen@google.com> <20201020184746.300555-2-axelrasmussen@google.com> From: Vlastimil Babka Subject: Re: [PATCH v4 1/1] mmap_lock: add tracepoints around lock acquisition Message-ID: Date: Fri, 23 Oct 2020 19:56:49 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.3.3 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 10/23/20 7:38 PM, Axel Rasmussen wrote: > On Fri, Oct 23, 2020 at 7:00 AM Vlastimil Babka wrote: >> >> On 10/20/20 8:47 PM, Axel Rasmussen wrote: >> > The goal of these tracepoints is to be able to debug lock contention >> > issues. This lock is acquired on most (all?) mmap / munmap / page fa= ult >> > operations, so a multi-threaded process which does a lot of these ca= n >> > experience significant contention. >> > >> > We trace just before we start acquisition, when the acquisition retu= rns >> > (whether it succeeded or not), and when the lock is released (or >> > downgraded). The events are broken out by lock type (read / write). >> > >> > The events are also broken out by memcg path. For container-based >> > workloads, users often think of several processes in a memcg as a si= ngle >> > logical "task", so collecting statistics at this level is useful. >> > >> > The end goal is to get latency information. This isn't directly incl= uded >> > in the trace events. Instead, users are expected to compute the time >> > between "start locking" and "acquire returned", using e.g. synthetic >> > events or BPF. The benefit we get from this is simpler code. >> > >> > Because we use tracepoint_enabled() to decide whether or not to trac= e, >> > this patch has effectively no overhead unless tracepoints are enable= d at >> > runtime. If tracepoints are enabled, there is a performance impact, = but >> > how much depends on exactly what e.g. the BPF program does. >> > >> > Reviewed-by: Michel Lespinasse >> > Acked-by: Yafang Shao >> > Acked-by: David Rientjes >> > Signed-off-by: Axel Rasmussen >> >> All seem fine to me, except I started to wonder.. >> >> > + >> > +#ifdef CONFIG_MEMCG >> > + >> > +DEFINE_PER_CPU(char[MAX_FILTER_STR_VAL], trace_memcg_path); >> > + >> > +/* >> > + * Write the given mm_struct's memcg path to a percpu buffer, and r= eturn a >> > + * pointer to it. If the path cannot be determined, the buffer will= contain the >> > + * empty string. >> > + * >> > + * Note: buffers are allocated per-cpu to avoid locking, so preempt= ion must be >> > + * disabled by the caller before calling us, and re-enabled only af= ter the >> > + * caller is done with the pointer. >> >> Is this enough? What if we fill the buffer and then an interrupt comes= and the >> handler calls here again? We overwrite the buffer and potentially repo= rt a wrong >> cgroup after the execution resumes? >> If nothing worse can happen (are interrupts disabled while the ftrace = code is >> copying from the buffer?), then it's probably ok? >=20 > I think you're right, get_cpu()/put_cpu() only deals with preemption, > not interrupts. >=20 > I'm somewhat sure this code can be called in interrupt context, so I > don't think we can use locks to prevent this situation. I think it > works like this: say we acquire the lock, an interrupt happens, and > then we try to acquire again on the same CPU; we can't sleep, so we're > stuck. Yes, we could perhaps trylock() and if it fails, give up on the memcg pat= h. > I think we can't kmalloc here (instead of a percpu buffer) either, > since I would guess that kmalloc may also acquire mmap_lock itself? the overhead is not worth it anyway, for a tracepoint > Is adding local_irq_save()/local_irq_restore() in addition to > get_cpu()/put_cpu() sufficient? If you do that, then I guess you don't need get_cpu()/put_cpu() anymore. = But=20 it's more costly. But sounds like we are solving something that the tracing subystem has to= solve=20 as well to store the trace event data, so maybe Steven has some better id= ea? >> >> > + */ >> > +static const char *get_mm_memcg_path(struct mm_struct *mm) >> > +{ >> > + struct mem_cgroup *memcg =3D get_mem_cgroup_from_mm(mm); >> > + >> > + if (memcg !=3D NULL && likely(memcg->css.cgroup !=3D NULL)) { >> > + char *buf =3D this_cpu_ptr(trace_memcg_path); >> > + >> > + cgroup_path(memcg->css.cgroup, buf, MAX_FILTER_STR_VAL= ); >> > + return buf; >> > + } >> > + return ""; >> > +} >> > + >> > +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) = \ >> > + do { = \ >> > + get_cpu(); = \ >> > + trace_mmap_lock_##type(mm, get_mm_memcg_path(mm), = \ >> > + ##__VA_ARGS__); = \ >> > + put_cpu(); = \ >> > + } while (0) >> > + >> > +#else /* !CONFIG_MEMCG */ >> > + >> > +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) = \ >> > + trace_mmap_lock_##type(mm, "", ##__VA_ARGS__) >> > + >> > +#endif /* CONFIG_MEMCG */ >> > + >> > +/* >> > + * Trace calls must be in a separate file, as otherwise there's a c= ircular >> > + * dependency between linux/mmap_lock.h and trace/events/mmap_lock.= h. >> > + */ >> > + >> > +void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool = write) >> > +{ >> > + TRACE_MMAP_LOCK_EVENT(start_locking, mm, write); >> > +} >> > +EXPORT_SYMBOL(__mmap_lock_do_trace_start_locking); >> > + >> > +void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bo= ol write, >> > + bool success) >> > +{ >> > + TRACE_MMAP_LOCK_EVENT(acquire_returned, mm, write, success); >> > +} >> > +EXPORT_SYMBOL(__mmap_lock_do_trace_acquire_returned); >> > + >> > +void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write= ) >> > +{ >> > + TRACE_MMAP_LOCK_EVENT(released, mm, write); >> > +} >> > +EXPORT_SYMBOL(__mmap_lock_do_trace_released); >> > >> >=20