From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A0D19C48BE8 for ; Fri, 18 Jun 2021 17:50:42 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 36B0D613C1 for ; Fri, 18 Jun 2021 17:50:42 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 36B0D613C1 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id AD2D56B0062; Fri, 18 Jun 2021 13:50:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A85356B006E; Fri, 18 Jun 2021 13:50:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8FD686B0072; Fri, 18 Jun 2021 13:50:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0233.hostedemail.com [216.40.44.233]) by kanga.kvack.org (Postfix) with ESMTP id 5AA436B0062 for ; Fri, 18 Jun 2021 13:50:41 -0400 (EDT) Received: from smtpin36.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id DBE50AC0D for ; Fri, 18 Jun 2021 17:50:40 +0000 (UTC) X-FDA: 78267584640.36.3ED6B88 Received: from mail-il1-f180.google.com (mail-il1-f180.google.com [209.85.166.180]) by imf05.hostedemail.com (Postfix) with ESMTP id 9CC56E0004D0 for ; Fri, 18 Jun 2021 17:50:40 +0000 (UTC) Received: by mail-il1-f180.google.com with SMTP id i17so540365ilj.11 for ; Fri, 18 Jun 2021 10:50:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:from:date:message-id:subject:to:cc; bh=tHQiLkxrnTlLezeuelguchCKxfErBBlwCEEQNUH9tvk=; b=lKdOhfefiOw5X3yFO290NsPtjZz659cxijC9No/2qDCBz1vAsRpkwKBA/qr0G0OJTD mGFeGcqsdo0J7xhvhCK+cLs7oGAAgLVMFnEdulvix/0XFOThj2/10L3EVTHJedvSkbGI wtVmGuIGQzi1RjF47l4zlt0mrDSxoYL6+gkPgpl3Afrz1waF9mR0uV0MmDRw7bg1Be20 EixB0f3wfPYk0arTBjDX2Ul0ZOZjS3tEO83kuxMVFntQxO3IOZ2KISrG9ySmkQ3NBkNS QPJyHRsusVQXRnDyXrUjRQct+VDb5vPqrwzDDvpHLIBKZVZp09ro5PFfoNwscPUxThdE AJkw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to:cc; bh=tHQiLkxrnTlLezeuelguchCKxfErBBlwCEEQNUH9tvk=; b=RHr8KrO6uSfbm6hfDoO2+eYdtrpL5DrRraWJdle/o5DkKkEhHr40hB6RVB7r5k2mED LdG7Qs6qGGW6k1bXIsl62T2hd2qrTO19neMIY8kGbLPe8+w2v9gPmsF19q+AUyAsDsMk GGTadxR6RLy0d0QtNdzDdKTJ3edIm72/HCD9JcymFqiyUzNuwEDDt4vum9KtB9dwqEUT pKgXO9dhyYREy0I5bxCJcz6NYF3vc+w19KSMkeaAazBRZojhG+E4Xb8qwC5NTMRN9KLc u4hLbyGt/qdl2i64HdhzsOgjIz5rAs287F6x1EWFhVKuUAXmWSO4IoAqFXZwLdZi/cUe r5lw== X-Gm-Message-State: AOAM533xCbd81JHD3jMMIg87Ow75g+zAxSq+0qng690X3bS0Ykfkvggn ZPpqoWZqRz002GLbGjZzk1cU/ccIPo17ZImvPWlvJg== X-Google-Smtp-Source: ABdhPJypUIBVYfdxV0qYRI+DREvm4us9fThifQ9tLW+G1zGaBPS8GldgvgYo/Wj/gntmP+Syjse1zjRgor4zOcXGciI= X-Received: by 2002:a92:c952:: with SMTP id i18mr8126553ilq.292.1624038639843; Fri, 18 Jun 2021 10:50:39 -0700 (PDT) MIME-Version: 1.0 From: Wei Xu Date: Fri, 18 Jun 2021 10:50:29 -0700 Message-ID: Subject: [LSF/MM/BPF TOPIC] Userspace managed memory tiering To: lsf-pc@lists.linux-foundation.org, Linux MM Cc: Dan Williams , Dave Hansen , Tim Chen , David Rientjes , Greg Thelen , Paul Turner , Shakeel Butt Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 9CC56E0004D0 Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=lKdOhfef; spf=pass (imf05.hostedemail.com: domain of weixugc@google.com designates 209.85.166.180 as permitted sender) smtp.mailfrom=weixugc@google.com; dmarc=pass (policy=reject) header.from=google.com X-Stat-Signature: jud3wdua1jkcc5mzr9b8tmmcr6tajcok X-HE-Tag: 1624038640-559883 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In this proposal, I'd like to discuss userspace-managed memory tiering and the kernel support that it needs. New memory technologies and interconnect standard make it possible to have memory with different performance and cost on the same machine (e.g. DRAM + PMEM, DRAM + cost-optimized memory attached via CXL.mem). We can expect heterogeneous memory systems that have performance implications far beyond classical NUMA to become increasingly common in the future. One of important use cases of such tiered memory systems is to improve the data center and cloud efficiency with better performance/TCO. Because different classes of applications (e.g. latency sensitive vs latency tolerant, high priority vs low priority) have different requirements, richer and more flexible memory tiering policies will be needed to achieve the desired performance target on a tiered memory system, which would be more effectively managed by a userspace agent, not by the kernel. Moreover, we (Google) are explicitly trying to avoid adding a ton of heuristics to enlighten the kernel about the policy that we want on multi-tenant machines when the userspace offers more flexibility. To manage memory tiering in userspace, we need the kernel support in the three key areas: - resource abstraction and control of tiered memory; - API to monitor page accesses for making memory tiering decisions; - API to migrate pages (demotion/promotion). Userspace memory tiering can work on just NUMA memory nodes, provided that memory resources from different tiers are abstracted into separate NUMA nodes. The userspace agent can create a tiering topology among these nodes based on their distances. An explicit memory tiering abstraction in the kernel is preferred, though, because it can not only allow the kernel to react in cases where it is challenging for userspace (e.g. reclaim-based demotion when the system is under DRAM pressure due to usage surge), but also enable tiering controls such as per-cgroup memory tier limits. This requirement is mostly aligned with the existing proposals [1] and [2]. The userspace agent manages all migratable user memory on the system and this can be transparent from the point of view of applications. To demote cold pages and promote hot pages, the userspace agent needs page access information. Because it is a system-wide tiering for user memory, the access information for both mapped and unmapped user pages is needed, and so are the physical page addresses. A combination of page table accessed-bit scanning and struct page scanning should be needed. Such page access monitoring should be efficient as well because the scans can be frequent. To return the page-level access information to the userspace, one proposal is to use tracepoint events. The userspace agent can then use BPF programs to collect such data and also apply customized filters when necessary. The userspace agent can also make use of hardware PMU events, for which the existing kernel support should be sufficient. The third area is the API support for migrating pages. The existing move_pages() syscall can be a candidate, though it is virtual-address based and cannot migrate unmapped pages. Is a physical-address based variant (e.g. move_pfns()), an acceptable proposal? [1] https://lore.kernel.org/lkml/9cd0dcde-f257-1b94-17d0-f2e24a3ce979@intel.com/ [2] https://lore.kernel.org/patchwork/cover/1408180/ Thanks, Wei