From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id A6F5C6C for ; Thu, 21 Jul 2016 16:04:56 +0000 (UTC) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 13D77E2 for ; Thu, 21 Jul 2016 16:04:55 +0000 (UTC) From: Chris Mason To: Jan Kara References: <578F36B9.802@huawei.com> <20160721100014.GB7901@quack2.suse.cz> <577236a8-2921-842a-2243-b8ecfe467381@fb.com> <20160721154532.GC14146@quack2.suse.cz> Message-ID: <788eb231-1701-9602-c5dc-36b8f82db21b@fb.com> Date: Thu, 21 Jul 2016 12:03:34 -0400 MIME-Version: 1.0 In-Reply-To: <20160721154532.GC14146@quack2.suse.cz> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Cc: ksummit-discuss@lists.linuxfoundation.org Subject: Re: [Ksummit-discuss] [TECH TOPIC] Kernel tracing and end-to-end performance breakdown List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On 07/21/2016 11:45 AM, Jan Kara wrote: > On Thu 21-07-16 09:54:53, Chris Mason wrote: >> On 07/21/2016 06:00 AM, Jan Kara wrote: >>> >>> So I think improvements in performance analysis are always welcome but >>> current proposal seems to be somewhat handwavy so I'm not sure what outcome >>> you'd like to get from the discussion... If you have a more concrete >>> proposal how you'd like to achieve what you need, then it may be worth >>> discussion. >>> >>> As a side note I know that Google (and maybe Facebook, not sure here) have >>> out-of-tree patches which provide really neat performance analysis >>> capabilities. I have heard they are not really upstreamable because they >>> are horrible hacks but maybe they can be a good inspiration for this work. >>> If we could get someone from these companies to explain what capabilities >>> they have and how they achieve this (regardless how hacky the >>> implementation may be), that may be an interesting topic. >> >> At least for facebook, we're moving most things to bpf. The most >> interesting part of our analysis isn't so much from the tool used to record >> it, it's from being able to aggregate over the fleet and making comparisons >> at scale. >> >> For example, Josef setup the off-cpu flame graphs such that we can record >> stack traces for a latency higher than N, and then sum up the most expensive >> stack traces over a large number of machines. It makes it much easier to >> find those happens-once-a-day problems. > > By latency higher than N, do you mean that e.g. a syscall took more than N, > or just that a process is sleeping for more than N in some place? Single sleep longer than N. It would be a little more involved to track all the sleeps in a single syscall, but we haven't needed to (yet). -chris