From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8F35FC4743C for ; Mon, 21 Jun 2021 18:58:27 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 137F461002 for ; Mon, 21 Jun 2021 18:58:27 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 137F461002 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 742B76B0081; Mon, 21 Jun 2021 14:58:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6F3616B0085; Mon, 21 Jun 2021 14:58:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 593716B0088; Mon, 21 Jun 2021 14:58:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0147.hostedemail.com [216.40.44.147]) by kanga.kvack.org (Postfix) with ESMTP id 221626B0081 for ; Mon, 21 Jun 2021 14:58:26 -0400 (EDT) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 9F59520BFD for ; Mon, 21 Jun 2021 18:58:25 +0000 (UTC) X-FDA: 78278641770.01.A1C9A3C Received: from mail-ed1-f42.google.com (mail-ed1-f42.google.com [209.85.208.42]) by imf02.hostedemail.com (Postfix) with ESMTP id 41D144006F1D for ; Mon, 21 Jun 2021 18:58:25 +0000 (UTC) Received: by mail-ed1-f42.google.com with SMTP id h17so10488444edw.11 for ; Mon, 21 Jun 2021 11:58:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=+eVG1gD1mttDsYAsWRuU65KDK+yTNAIy76pLDsWozG4=; b=uOLWiJj5EA6w0taheJNEegKALP9GxLyfo1m0KAagGbg+loA2uQosmSFw/jtphkKYj5 tI8Ivr1JE26yWjhu16KY/dEa0Zzo+OQiMyAl+6XEINDiOt/piJhdYv+4w9cDAIKerl4k sjQlKhf3obJOi23OWp7tITQVWsPivluGhqsbYVtIgWMBZQE6R12ZA/TJ175PQEw0PnoM /gn421/LJPyTxFWMaNB/D2iNfv19mZiIW5eDG6jT7XRhi0bVqbScs0S5uiIBLvA07/2Y Q5iZXlMKWDAD7PwDBbrOswccYxmonRsu55LhMwIzCasoERP3LauP6jzh2Q+jIQi7ipDn JnWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=+eVG1gD1mttDsYAsWRuU65KDK+yTNAIy76pLDsWozG4=; b=F/Os9wRIFJWaRzmcgBtlrTV1l9oXM60VSwAyEExZyYEw5bVjAuymd7+5zOxKbPuhNm ETRGULj+4UWzW5ZwukagrU9xMha20c+KSNu4wSQz7cBjyYNjWvEpqlx6K1/Kekz//jdk m4leskWi+36uAVAz4WYFZHSd+YbKmtK908b3gk6hkWEDE8L1zk/yjqR3eSDuScxEoXOn g+46oX/ThwJm1YJf7pEEUrf0Yz36ismuSaYVCFI8ftLWukiMOrvhaevtnR1OAtDRhkTj DtALTOoEW0WGtZkVIuU+Zi5vJGeknCmdWJcQmzjRiP/95zyANlEMGQS2WTzOfCZGmJpn R12g== X-Gm-Message-State: AOAM533xnVVcWqzhbN/VO15gXZyiPyFxQBfgXk4w/9EOsWJqW3zDpDJ9 lbrjIdZpuHDPvXnf7v9SL4/5FnrVyR7zT0fQh0U= X-Google-Smtp-Source: ABdhPJzFH5lf9HP+0W6VvOhEF7MACoVqIIhyCN6W/7H4AMA0uvoTNMHeN684zEIZHQBviYn1B3mfbm50h3OAyPOVBuY= X-Received: by 2002:aa7:c814:: with SMTP id a20mr21980689edt.290.1624301903825; Mon, 21 Jun 2021 11:58:23 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Yang Shi Date: Mon, 21 Jun 2021 11:58:12 -0700 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Userspace managed memory tiering To: Wei Xu Cc: lsf-pc@lists.linux-foundation.org, Linux MM , Dan Williams , Dave Hansen , Tim Chen , David Rientjes , Greg Thelen , Paul Turner , Shakeel Butt , ying.huang@intel.com Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 41D144006F1D X-Stat-Signature: dfrm3h9bdyk99qrmsm91zjhoy8p4xky9 Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20161025 header.b=uOLWiJj5; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf02.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.42 as permitted sender) smtp.mailfrom=shy828301@gmail.com X-HE-Tag: 1624301905-717035 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Jun 18, 2021 at 10:50 AM Wei Xu wrote: > > In this proposal, I'd like to discuss userspace-managed memory tiering > and the kernel support that it needs. > > New memory technologies and interconnect standard make it possible to > have memory with different performance and cost on the same machine > (e.g. DRAM + PMEM, DRAM + cost-optimized memory attached via CXL.mem). > We can expect heterogeneous memory systems that have performance > implications far beyond classical NUMA to become increasingly common > in the future. One of important use cases of such tiered memory > systems is to improve the data center and cloud efficiency with > better performance/TCO. > > Because different classes of applications (e.g. latency sensitive vs > latency tolerant, high priority vs low priority) have different > requirements, richer and more flexible memory tiering policies will > be needed to achieve the desired performance target on a tiered > memory system, which would be more effectively managed by a userspace > agent, not by the kernel. Moreover, we (Google) are explicitly trying > to avoid adding a ton of heuristics to enlighten the kernel about the > policy that we want on multi-tenant machines when the userspace offers > more flexibility. > > To manage memory tiering in userspace, we need the kernel support in > the three key areas: > > - resource abstraction and control of tiered memory; > - API to monitor page accesses for making memory tiering decisions; > - API to migrate pages (demotion/promotion). > > Userspace memory tiering can work on just NUMA memory nodes, provided > that memory resources from different tiers are abstracted into > separate NUMA nodes. The userspace agent can create a tiering > topology among these nodes based on their distances. > > An explicit memory tiering abstraction in the kernel is preferred, > though, because it can not only allow the kernel to react in cases > where it is challenging for userspace (e.g. reclaim-based demotion > when the system is under DRAM pressure due to usage surge), but also > enable tiering controls such as per-cgroup memory tier limits. > This requirement is mostly aligned with the existing proposals [1] > and [2]. > > The userspace agent manages all migratable user memory on the system > and this can be transparent from the point of view of applications. > To demote cold pages and promote hot pages, the userspace agent needs > page access information. Because it is a system-wide tiering for user > memory, the access information for both mapped and unmapped user pages > is needed, and so are the physical page addresses. A combination of > page table accessed-bit scanning and struct page scanning should be > needed. Such page access monitoring should be efficient as well > because the scans can be frequent. To return the page-level access > information to the userspace, one proposal is to use tracepoint > events. The userspace agent can then use BPF programs to collect such > data and also apply customized filters when necessary. Just FYI. There has been a project for userspace daemon. Please refer to https://github.com/fengguang/memory-optimizer We (Alibaba, when I was there) did some preliminary tests and benchmarks with it. The accuracy was pretty good, but the cost was relatively high. I agree with you that efficiency is the key. BPF may be a good approach to improve the cost. I'm not sure what the current status of this project is. You may reach Huang Ying to get more information. > > The userspace agent can also make use of hardware PMU events, for > which the existing kernel support should be sufficient. > > The third area is the API support for migrating pages. The existing > move_pages() syscall can be a candidate, though it is virtual-address > based and cannot migrate unmapped pages. Is a physical-address based > variant (e.g. move_pfns()), an acceptable proposal? > > [1] https://lore.kernel.org/lkml/9cd0dcde-f257-1b94-17d0-f2e24a3ce979@intel.com/ > [2] https://lore.kernel.org/patchwork/cover/1408180/ > > Thanks, > Wei >