From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A1F43C3F2D1 for ; Wed, 4 Mar 2020 20:52:33 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4114920842 for ; Wed, 4 Mar 2020 20:52:33 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4114920842 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id E16B46B0003; Wed, 4 Mar 2020 15:52:32 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DC6446B0005; Wed, 4 Mar 2020 15:52:32 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CB55D6B0007; Wed, 4 Mar 2020 15:52:32 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0229.hostedemail.com [216.40.44.229]) by kanga.kvack.org (Postfix) with ESMTP id B08726B0003 for ; Wed, 4 Mar 2020 15:52:32 -0500 (EST) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 4DB804DC4 for ; Wed, 4 Mar 2020 20:52:32 +0000 (UTC) X-FDA: 76558878144.14.skirt17_1611deb0f2560 X-HE-Tag: skirt17_1611deb0f2560 X-Filterd-Recvd-Size: 8473 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by imf30.hostedemail.com (Postfix) with ESMTP for ; Wed, 4 Mar 2020 20:52:31 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 04 Mar 2020 12:52:30 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,515,1574150400"; d="scan'208";a="352187003" Received: from schen9-desk.jf.intel.com (HELO [10.54.74.162]) ([10.54.74.162]) by fmsmga001.fm.intel.com with ESMTP; 04 Mar 2020 12:52:30 -0800 From: Tim Chen To: Michal Hocko Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Dave Hansen , Dan Williams , Huang Ying References: <20200214104541.GT31689@dhcp22.suse.cz> Autocrypt: addr=tim.c.chen@linux.intel.com; prefer-encrypt=mutual; keydata= mQINBE6ONugBEAC1c8laQ2QrezbYFetwrzD0v8rOqanj5X1jkySQr3hm/rqVcDJudcfdSMv0 BNCCjt2dofFxVfRL0G8eQR4qoSgzDGDzoFva3NjTJ/34TlK9MMouLY7X5x3sXdZtrV4zhKGv 3Rt2osfARdH3QDoTUHujhQxlcPk7cwjTXe4o3aHIFbcIBUmxhqPaz3AMfdCqbhd7uWe9MAZX 7M9vk6PboyO4PgZRAs5lWRoD4ZfROtSViX49KEkO7BDClacVsODITpiaWtZVDxkYUX/D9OxG AkxmqrCxZxxZHDQos1SnS08aKD0QITm/LWQtwx1y0P4GGMXRlIAQE4rK69BDvzSaLB45ppOw AO7kw8aR3eu/sW8p016dx34bUFFTwbILJFvazpvRImdjmZGcTcvRd8QgmhNV5INyGwtfA8sn L4V13aZNZA9eWd+iuB8qZfoFiyAeHNWzLX/Moi8hB7LxFuEGnvbxYByRS83jsxjH2Bd49bTi XOsAY/YyGj6gl8KkjSbKOkj0IRy28nLisFdGBvgeQrvaLaA06VexptmrLjp1Qtyesw6zIJeP oHUImJltjPjFvyfkuIPfVIB87kukpB78bhSRA5mC365LsLRl+nrX7SauEo8b7MX0qbW9pg0f wsiyCCK0ioTTm4IWL2wiDB7PeiJSsViBORNKoxA093B42BWFJQARAQABtDRUaW0gQ2hlbiAo d29yayByZWxhdGVkKSA8dGltLmMuY2hlbkBsaW51eC5pbnRlbC5jb20+iQI+BBMBAgAoAhsD BgsJCAcDAgYVCAIJCgsEFgIDAQIeAQIXgAUCXFIuxAUJEYZe0wAKCRCiZ7WKota4STH3EACW 1jBRzdzEd5QeTQWrTtB0Dxs5cC8/P7gEYlYQCr3Dod8fG7UcPbY7wlZXc3vr7+A47/bSTVc0 DhUAUwJT+VBMIpKdYUbvfjmgicL9mOYW73/PHTO38BsMyoeOtuZlyoUl3yoxWmIqD4S1xV04 q5qKyTakghFa+1ZlGTAIqjIzixY0E6309spVTHoImJTkXNdDQSF0AxjW0YNejt52rkGXXSoi IgYLRb3mLJE/k1KziYtXbkgQRYssty3n731prN5XrupcS4AiZIQl6+uG7nN2DGn9ozy2dgTi smPAOFH7PKJwj8UU8HUYtX24mQA6LKRNmOgB290PvrIy89FsBot/xKT2kpSlk20Ftmke7KCa 65br/ExDzfaBKLynztcF8o72DXuJ4nS2IxfT/Zmkekvvx/s9R4kyPyebJ5IA/CH2Ez6kXIP+ q0QVS25WF21vOtK52buUgt4SeRbqSpTZc8bpBBpWQcmeJqleo19WzITojpt0JvdVNC/1H7mF 4l7og76MYSTCqIKcLzvKFeJSie50PM3IOPp4U2czSrmZURlTO0o1TRAa7Z5v/j8KxtSJKTgD lYKhR0MTIaNw3z5LPWCCYCmYfcwCsIa2vd3aZr3/Ao31ZnBuF4K2LCkZR7RQgLu+y5Tr8P7c e82t/AhTZrzQowzP0Vl6NQo8N6C2fcwjSrkCDQROjjboARAAx+LxKhznLH0RFvuBEGTcntrC 3S0tpYmVsuWbdWr2ZL9VqZmXh6UWb0K7w7OpPNW1FiaWtVLnG1nuMmBJhE5jpYsi+yU8sbMA 5BEiQn2hUo0k5eww5/oiyNI9H7vql9h628JhYd9T1CcDMghTNOKfCPNGzQ8Js33cFnszqL4I N9jh+qdg5FnMHs/+oBNtlvNjD1dQdM6gm8WLhFttXNPn7nRUPuLQxTqbuoPgoTmxUxR3/M5A KDjntKEdYZziBYfQJkvfLJdnRZnuHvXhO2EU1/7bAhdz7nULZktw9j1Sp9zRYfKRnQdIvXXa jHkOn3N41n0zjoKV1J1KpAH3UcVfOmnTj+u6iVMW5dkxLo07CddJDaayXtCBSmmd90OG0Odx cq9VaIu/DOQJ8OZU3JORiuuq40jlFsF1fy7nZSvQFsJlSmHkb+cDMZDc1yk0ko65girmNjMF hsAdVYfVsqS1TJrnengBgbPgesYO5eY0Tm3+0pa07EkONsxnzyWJDn4fh/eA6IEUo2JrOrex O6cRBNv9dwrUfJbMgzFeKdoyq/Zwe9QmdStkFpoh9036iWsj6Nt58NhXP8WDHOfBg9o86z9O VMZMC2Q0r6pGm7L0yHmPiixrxWdW0dGKvTHu/DH/ORUrjBYYeMsCc4jWoUt4Xq49LX98KDGN dhkZDGwKnAUAEQEAAYkCJQQYAQIADwIbDAUCXFIulQUJEYZenwAKCRCiZ7WKota4SYqUEACj P/GMnWbaG6s4TPM5Dg6lkiSjFLWWJi74m34I19vaX2CAJDxPXoTU6ya8KwNgXU4yhVq7TMId keQGTIw/fnCv3RLNRcTAapLarxwDPRzzq2snkZKIeNh+WcwilFjTpTRASRMRy9ehKYMq6Zh7 PXXULzxblhF60dsvi7CuRsyiYprJg0h2iZVJbCIjhumCrsLnZ531SbZpnWz6OJM9Y16+HILp iZ77miSE87+xNa5Ye1W1ASRNnTd9ftWoTgLezi0/MeZVQ4Qz2Shk0MIOu56UxBb0asIaOgRj B5RGfDpbHfjy3Ja5WBDWgUQGgLd2b5B6MVruiFjpYK5WwDGPsj0nAOoENByJ+Oa6vvP2Olkl gQzSV2zm9vjgWeWx9H+X0eq40U+ounxTLJYNoJLK3jSkguwdXOfL2/Bvj2IyU35EOC5sgO6h VRt3kA/JPvZK+6MDxXmm6R8OyohR8uM/9NCb9aDw/DnLEWcFPHfzzFFn0idp7zD5SNgAXHzV PFY6UGIm86OuPZuSG31R0AU5zvcmWCeIvhxl5ZNfmZtv5h8TgmfGAgF4PSD0x/Bq4qobcfaL ugWG5FwiybPzu2H9ZLGoaRwRmCnzblJG0pRzNaC/F+0hNf63F1iSXzIlncHZ3By15bnt5QDk l50q2K/r651xphs7CGEdKi1nU0YJVbQxJQ== Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Memory cgroups, whether you like it or not Message-ID: <20d941d9-0411-049f-76e0-f9bb9e07f6af@linux.intel.com> Date: Wed, 4 Mar 2020 12:52:29 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 In-Reply-To: <20200214104541.GT31689@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2/14/20 2:45 AM, Michal Hocko wrote: > On Wed 05-02-20 10:34:57, Tim Chen wrote: >> Topic: Memory cgroups, whether you like it or not >> >> 1. Memory cgroup counters scalability >> >> Recently, benchmark teams at Intel were running some bare-metal >> benchmarks. To our great surprise, we saw lots of memcg activity in >> the profiles. When we asked the benchmark team, they did not even >> realize they were using memory cgroups. They were fond of running all >> their benchmarks in containers that just happened to use memory cgroup= s >> by default. What were previously problems only for memory cgroup user= s >> are quickly becoming a problem for everyone. >> >> There are mem cgroup counters that are read in page management paths >> which scale poorly when read. These counters are per cpu based and >> need to be summed over all CPUs to get the overall value for the mem >> cgroup in lruvec_page_state_local function. This led to scalability >> problems on system with large numbers of CPUs. For example, we=E2=80=99= ve seen 14+% kernel >> CPU time consumed in snapshot_refaults(). We have also encountered a >> similar issue recently when computing the lru_size[1]. >> >> We'll like to do some brainstorming to see if there are ways to make >> such accounting more scalable. For example, not all usages >> of such counters need precise counts, and some approximate counts that= are >> updated lazily can be used. >=20 > Please make sure to prepare numbers based on the current upstream kerne= l > so that we have some grounds to base the discussion on. Ideally post > them into the email. Here's a profile on a 5.2 based kernel with some memory tiering modificat= ions. It shows that snapshot_refaults is consuming a big chunk of cpu cycles to= gather the refault stats stored in root memcg's lruvec's WORKINGSET_ACTIVATE field. We have to read #memcg X #ncpu local counters=20 to get the complete refault snapshot. So the computation scales poorly with increasing number of memcg and cpus.=20 We'll be recollecting some the data based on the 5.5 kernel. I will post those when they become available.=20 The cpu cycle percentage below shows the percentage of kernel cpu cycles = taken. And kernel time consumed 31% of cpu cycles. The MySQL workload ran on a 2 socket sy= stem with 24 cores per socket. 14.22% mysqld [kernel.kallsyms] [k] snapshot_refau= lts | ---snapshot_refaults do_try_to_free_pages try_to_free_pages __alloc_pages_slowpath __alloc_pages_nodemask | |--14.07%--alloc_pages_vma | | | --14.06%--__handle_mm_fault | handle_mm_fault | | | |--12.57%--__get_user_pages | | get_user_pages_unlocked | | get_user_pages_fast | | iov_iter_get_pages | | do_blockdev_direct_IO | | ext4_direct_IO | | generic_file_read_iter | | | | | |--12.16%--new_sync_read | | | vfs_read | | | ksys_pread64 | | | do_syscall_64 | | | entry_SYSCALL_= 64_after_hwframe =20 Thanks. Tim