From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CEE8EC48BDF for ; Sat, 19 Jun 2021 00:06:25 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 52DFC6100B for ; Sat, 19 Jun 2021 00:06:25 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 52DFC6100B Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id AB3716B006E; Fri, 18 Jun 2021 20:06:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A62ED6B0070; Fri, 18 Jun 2021 20:06:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8E3666B0072; Fri, 18 Jun 2021 20:06:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0048.hostedemail.com [216.40.44.48]) by kanga.kvack.org (Postfix) with ESMTP id 5A6916B006E for ; Fri, 18 Jun 2021 20:06:24 -0400 (EDT) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id F332FFB5A for ; Sat, 19 Jun 2021 00:06:23 +0000 (UTC) X-FDA: 78268531446.12.1FAE851 Received: from mail-lf1-f43.google.com (mail-lf1-f43.google.com [209.85.167.43]) by imf25.hostedemail.com (Postfix) with ESMTP id A903A600016D for ; Sat, 19 Jun 2021 00:06:23 +0000 (UTC) Received: by mail-lf1-f43.google.com with SMTP id h4so19344625lfu.8 for ; Fri, 18 Jun 2021 17:06:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=KKeM7p01+u3c6iCwgGlskwwh1UyY6kz1gDYfTdGp+40=; b=tnbGUPAaYXWwhHi4l4EouQSCjHB9E4BXvd1Bxu4HrGSSIg3uXCLMocZdRo5+37H3ZY KRtznUdcIHd44QRASJYiXLVa1GKkxxgPBS3yPosBwIE1av5nL9peDb4rgyV7zEt9ZLgQ H8bkFHYYZR65yhuun3pMNXbEu/Sf3ozZa7tLKvqNtf/wUaz2V9MhjeV2l3q9FAco5wpU jPQtle6PrlvGU8MdRwF6YME1VpwpWHsWxcy/VHd6zH+0SwNRH/D835Tups1//2yEXOpp Yg3gqT/Kha8cvsLiJC+kf/ypagMo/oHwMxJtE8PPbIffvFFESxrZlVaVG+aKVkG0HujA oOwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=KKeM7p01+u3c6iCwgGlskwwh1UyY6kz1gDYfTdGp+40=; b=GZNcOUzDOaddO8JkuVoORSoUQ62QmYrM9fYsy6iBfz/anMuGeKO4Pa40C0Y+c4/88o sL5kpDT46KR2sh/YQGsdvY53yK81IyHfcocMmPQYdnv8H0fXRFCj8cMQ8VRlDVW4afX1 +gS981dO0mFSSV+de0tJxCcD0zfB0puYJefajCkTKyKrDYIwcTElVd00vVt93JoAvfUE A0c4Caop0XThHT48hhxmmmAzsxA87SiWP53qw4pUCcVbgBsM5gYaZHZNGFVQ7dGUGj8W YvROvk+jtMkY6hShgEWJFaDSlo1axHWqDSCdPao3SlfOv96aG1gGHu6aJqPap5XSUL0k VHjw== X-Gm-Message-State: AOAM530S2MaA4db+179Tz9/HASC/wi3oPgYBG+UygXeFcgMh/l015R5p IZbeD0ZQ/xohkD/bGNEQRZA4W434gjqEPRXO/Nx9Os7cqt8= X-Google-Smtp-Source: ABdhPJwBhWQB6gRdRZOvFgdjlepqXqz2my4lmXdZPO7BNkguOcTydFF5iQ/m+TmDHUaHsagBmHsnCUTicBVuabh8OzM= X-Received: by 2002:a05:651c:3c6:: with SMTP id f6mr11228973ljp.456.1624060788689; Fri, 18 Jun 2021 16:59:48 -0700 (PDT) MIME-Version: 1.0 References: <475cbc62-a430-2c60-34cc-72ea8baebf2c@linux.intel.com> <82ffac56-e3fb-2d2d-1601-64130310bfc1@linux.intel.com> In-Reply-To: <82ffac56-e3fb-2d2d-1601-64130310bfc1@linux.intel.com> From: Shakeel Butt Date: Fri, 18 Jun 2021 16:59:37 -0700 Message-ID: Subject: Re: [LSF/MM TOPIC] Tiered memory accounting and management To: Tim Chen Cc: Yang Shi , lsf-pc@lists.linux-foundation.org, Linux MM , Michal Hocko , Dan Williams , Dave Hansen , David Rientjes , Wei Xu , Greg Thelen Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: A903A600016D Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=tnbGUPAa; spf=pass (imf25.hostedemail.com: domain of shakeelb@google.com designates 209.85.167.43 as permitted sender) smtp.mailfrom=shakeelb@google.com; dmarc=pass (policy=reject) header.from=google.com X-Stat-Signature: x88194adi4xc4k7y7d5h6mygaew3sbg1 X-HE-Tag: 1624061183-828410 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Jun 18, 2021 at 3:11 PM Tim Chen wrote: > > > > On 6/17/21 11:48 AM, Shakeel Butt wrote: [...] > > > > At the moment "personally" I am more inclined towards a passive > > approach towards the memcg accounting of memory tiers. By that I mean, > > let's start by providing a 'usage' interface and get more > > production/real-world data to motivate the 'limit' interfaces. (One > > minor reason is that defining the 'limit' interface will force us to > > make the decision on defining tiers i.e. numa or a set of numa or > > others). > > Probably we could first start with accounting the memory used in each > NUMA node for a cgroup and exposing this information to user space. > I think that is useful regardless. > Is memory.numa_stat not good enough? This interface does miss __GFP_ACCOUNT non-slab allocations, percpu and sock. > There is still a question of whether we want to define a set of > numa node or tier and extend the accounting and management at that > memory tier abstraction level. > [...] > > > > To give a more concrete example: Let's say we have a system with two > > memory tiers and multiple low and high priority jobs. For high > > priority jobs, set the allocation try list from high to low tier and > > for low priority jobs the reverse of that (I am not sure if we can do > > that out of the box with today's kernel). In the background we migrate > > cold memory down the tiers and hot memory in the reverse direction. > > > > In this background mechanism we can enforce all different limiting > > policies like Yang's original high and low tier percentage or > > something like X% of accesses of high priority jobs should be from > > high tier. > > If I understand what you are saying is you desire the kernel to provide > the interface to expose performance information like > "X% of accesses of high priority jobs is from high tier", I think we can estimate "X% of accesses to high tier" using existing perf/PMU counters. So, no new interface. > and knobs for user space to tell kernel to re-balance pages on > a per job class (or cgroup) basis based on this information. > The page re-balancing will be initiated by user space rather than > by the kernel, similar to what Wei proposed. This is more open to discussion and we should brainstorm the pros and cons of all proposed approaches.