From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.9 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,MSGID_FROM_MTA_HEADER,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5AFD6C48BDF for ; Fri, 18 Jun 2021 19:13:22 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id CFE5561261 for ; Fri, 18 Jun 2021 19:13:21 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CFE5561261 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=nvidia.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id EBC586B006C; Fri, 18 Jun 2021 15:13:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E92706B006E; Fri, 18 Jun 2021 15:13:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CE4B46B0072; Fri, 18 Jun 2021 15:13:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0111.hostedemail.com [216.40.44.111]) by kanga.kvack.org (Postfix) with ESMTP id 997456B006C for ; Fri, 18 Jun 2021 15:13:20 -0400 (EDT) Received: from smtpin35.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 325E48249980 for ; Fri, 18 Jun 2021 19:13:20 +0000 (UTC) X-FDA: 78267792960.35.E222E13 Received: from NAM04-DM6-obe.outbound.protection.outlook.com (mail-dm6nam08on2071.outbound.protection.outlook.com [40.107.102.71]) by imf03.hostedemail.com (Postfix) with ESMTP id AA082C0201CA for ; Fri, 18 Jun 2021 19:13:19 +0000 (UTC) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=VTm5b5qdSMDJwpdT87VRqz/fuYHFHyM3MsrCPmo0Dz/x8qv7ji2pu8t36Ad5I2yLAVg9hhsAGdbYJyKCSJiQiTFaK5XVdR+syYX5JdVbf2XWu60xWo6tRlFoP47toDJOPyL3zt9Vdcl8zF3KxvgSNrBheClcDDF05HsPKxlG54nQrF4InOHMZP9vzuo4EU+tXA7UnX7RZgmle7L9UNtIVXbcbMsj1dmq7isPYaYfGhsn6BFwYIs6U/yKKpfgAASWdF4l4s9+LSBIWQijazZBuh/V4vdwLTOLjFksjAHlbhA/J6vr0EYL71oEO+F9hSV70qmCkOgzNKmGeNXBhGuckw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=DyJL+1XKvsE4bqvweVOHoy+acUcXI3IDJ5Psz28B0ec=; b=FPT52gNtYEljtOuFR5wuMZAHlxEGr3i+zj1PD7X0NaYXH+c/Lt5hr9s6OpWy378bcaVNA0s4iRuuXJ6j0yPBW1iiNol+zi9h50s8GXGOG3/HXDwW7TgSqWy/jB8/zzXtYwA/LIZXjVmw1QIA45Lij+W4PtT4RAdDKDe9M1N3NY0in/fIzNIjCTAel1+wcsqpiAvad365hx4EYYm28dnRFdO3ZnNeCfzIhXEHKVhpveD7uCHy/3WiDpxMwbd3YBJuBlqQqqrQ8QLZ1kQCLymp4ZPlA3OwXTfzzZI2hNMf6+XN88cVct+XIvuWG03iiDSYaDCj/BEKvvjKaMc3ROL6AQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com; dkim=pass header.d=nvidia.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=DyJL+1XKvsE4bqvweVOHoy+acUcXI3IDJ5Psz28B0ec=; b=rQd74WrTD2+54KO6w++T5sXPkFUbzTikfPB4FkvnJ9s3BXDGDyQzc8noczzWs9l0D+NYw+DYbYvkEBfJwzBl5NG1bxOyfegqpD+0KHwtgQj6I7oChqPheh8ZXX+AbzAcCyUKXv4SjKk3LyEfgNv7NvJ8McOuZex7y99vg/36fKo3flWusBPvoI6m9dFCx19TnNo0fSAx60f2Xn1rz644MO478peLXDNCXpOToaxHqd94EffktYRFtMu5GM281YrNT58QbUo8x86OPYcGt/lDlglmGHeDk/xSypqBrWbhEQkG65jP1RQFpE+oBzSCKAoxys5reFDw7n+C1nQJHY9pNg== Received: from MN2PR12MB3823.namprd12.prod.outlook.com (2603:10b6:208:168::26) by MN2PR12MB4989.namprd12.prod.outlook.com (2603:10b6:208:38::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4242.16; Fri, 18 Jun 2021 19:13:16 +0000 Received: from MN2PR12MB3823.namprd12.prod.outlook.com ([fe80::dcee:535c:30e:95f4]) by MN2PR12MB3823.namprd12.prod.outlook.com ([fe80::dcee:535c:30e:95f4%6]) with mapi id 15.20.4242.021; Fri, 18 Jun 2021 19:13:16 +0000 From: Zi Yan To: Wei Xu Cc: lsf-pc@lists.linux-foundation.org, Linux MM , Dan Williams , Dave Hansen , Tim Chen , David Rientjes , Greg Thelen , Paul Turner , Shakeel Butt Subject: Re: [LSF/MM/BPF TOPIC] Userspace managed memory tiering Date: Fri, 18 Jun 2021 15:13:09 -0400 X-Mailer: MailMate (1.14r5812) Message-ID: <15B5F859-3F7C-4B27-9528-D42D478B48E6@nvidia.com> In-Reply-To: References: Content-Type: multipart/signed; boundary="=_MailMate_ABB6DAE9-D7DB-4C3B-8ADB-F2ADBC087536_="; micalg=pgp-sha512; protocol="application/pgp-signature" X-Originating-IP: [216.228.112.21] X-ClientProxiedBy: BL1PR13CA0222.namprd13.prod.outlook.com (2603:10b6:208:2bf::17) To MN2PR12MB3823.namprd12.prod.outlook.com (2603:10b6:208:168::26) MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 Received: from [10.2.58.56] (216.228.112.21) by BL1PR13CA0222.namprd13.prod.outlook.com (2603:10b6:208:2bf::17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4264.9 via Frontend Transport; Fri, 18 Jun 2021 19:13:13 +0000 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 97e90263-4e90-4cf2-c42e-08d9328d2040 X-MS-TrafficTypeDiagnostic: MN2PR12MB4989: X-MS-Exchange-Transport-Forked: True X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:10000; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: djExFQa7GoY6MUddjePBkwOtoT4R1ldg1FyI2iBs7H6N9KlsuB8JLHBV+MVYdnccKjH3iBZ3M/yY1rPSKq7r/c457EgFDM+M2ZM3xYitRVFVdG05H6YmagWwUEV74L9FYxoJFO2PJ/ztB9ixyDxNNpleANJ+gTCzVp1AsUrAV7Q2BMzN8DiUIve5DEbQMHuSRk0DTrBNi718GtIC/gRP4rcVAVoOUQotwKYxfWFlycj0fwwrNxKjqRnuEWiyVNeVmqDKBynXxo628OotDsRbUqfVGiESv+JSaOVSTlKn9MKQBcLRuDzyCsQqBCt/aL5h0xWzOeLGBQ92HfsmXe0H92i3s7SvqOWD6oT5suXUOIfWTcAsRFb/iPJfqvmzH9Jr86JkKK66xKhbpTCbOyctDXXv07WN+0OWCBPPn2S4SZaVRbHL9vdcirtQVCpDB49I7dQ50ELSApmu1Q5qxetLV9aTHFGiQjmeAq4w6qybs4d+jOtrqhMPovQB1XjSfFOc9v5Wmqq8sqM8Dk5Yvme6gUQpWJIcoHLVdi42NRF8ZGuc2z/4WXLDCeSibQjKZDW0eDPklmk342svzaGpygvGOD62nUd1Am9qqTuKxow1DaByrQToBLZ8SQlCO453xihl+4dTvZq2pYiQuV8c0a24L9q4eR3MLwz3DzxFpgQEVoM5CuoagmSVFQaY8i9yDyqy1i4AQc7JYV+P24VqldR2R15ZiQI7wGj0DTbmNz5qG6CDjh+/HQ1liI8DvijfNV5BgS/VsOfGCwGLE0ZKYMJH8T9cK+BNQndvb+asIXckwZE= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:MN2PR12MB3823.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(4636009)(136003)(366004)(376002)(346002)(396003)(39860400002)(54906003)(6916009)(8676002)(33656002)(53546011)(2616005)(956004)(5660300002)(83380400001)(8936002)(478600001)(7416002)(6666004)(26005)(186003)(86362001)(16526019)(966005)(6486002)(21480400003)(235185007)(36756003)(33964004)(16576012)(4326008)(316002)(66476007)(38100700002)(2906002)(66556008)(66946007)(45980500001)(72826004);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?S0NlSEFpNmc4Z1c3ZnNLaTRMcEw2TVZZSjBwanVVeDN2MU54Ym82THR4NFRU?= =?utf-8?B?dXR4b0NmTHNzRldRUHBieTdYODdQQTdMcGFoQTVUTjlrajVBbHovSndxWkgx?= =?utf-8?B?UlBXQ0JzUGVKUkNKOStJT0U3cnBLTXVJczRidFNodnNDRmRmL280RW02cFow?= =?utf-8?B?RzRtc1E0bnNSTURtSjIrNVlvMTBZaEwxVDR4U1UzTTdoNjhUNTM3dGdCbTcw?= =?utf-8?B?TlpYVys0Z00vMXRzSC8zS3g2TlhqejVvQWxPWGFQWTFRWlFCUzg5dCtGV0U0?= =?utf-8?B?WVBXQkZYUE5rTGlwakxWLzhyZGx3enc3WERJamZDRkNCb1MwYWJhV1hoK21C?= =?utf-8?B?OElscHBYY2JEaWI2QkoxMkl1TVVBbTIrdzVCYVI3QVRhTkZtSnZhNUloYzlS?= =?utf-8?B?SDhiQ1VZUXljamZ4dGVic0wyZ3p3N1lxQzEveTQ2U0tEeGEycXg5OTREbDZh?= =?utf-8?B?UmIzV2FJQ2dERkpWYTN3ZS9BSmVzczZiKzFGK0E3dzgwdiszWFhBV1RmdXU0?= =?utf-8?B?VFJrRTREWUFKOG5NQ1pNTnJUcUZjOW1WUy9SSkp3enlHYWxtMlQ4RFFaWXFo?= =?utf-8?B?UXFjWmpPYzFnb2c3M1p4a3lEcXpRd1pBdHBIYno1MFdKVTkzenVBdnVQaUZ1?= =?utf-8?B?UjUvb2FUWG1pcHFDclptQ1NsaGhCejEwdllyTiswN1R2c3lINHJrMDhkM1hq?= =?utf-8?B?YXIxMnArN3hYdFhIelZNbGV5dUVkd2wxYkR5WkZaY1c2WWFmamlPZWUxTVNZ?= =?utf-8?B?RDZWYnhxVEJqUm1wcVdYL1A1UEJ6NXU0TzJLNkY0YTFtMXNkbFpiV09jMy8r?= =?utf-8?B?OE5reElVbStERW13eGtXaVRIRWdSQXVManZSSmJ5WTY4bFp0MjR5cENBSE1Y?= =?utf-8?B?Z0VaTTM1RmJ0NjNHUzVJcmlSdUpCNVVPd3M0UkZmdm5JVTg4bUllMDgxY3pm?= =?utf-8?B?RjN3Sm9nTHFPR1BnNit4bnUrcVgwSW9ka0ZNMEMvY3RSMVZyMzh6ek5wa0Z0?= =?utf-8?B?eDJJRUEzK1pyZ2FwcGg4ZXBmOTE4eFRlSFFuVkk1TS9LTnRuQUxidXhMcU9I?= =?utf-8?B?K0lNUWlCSFBIY0lveC9uRmpxNE8yTllFOWFqemRKMmx1bDJwZ0phcmNrY0xu?= =?utf-8?B?dFJTWEVMdE1VK05YQXdheWNiUjRwQmhZaUVIK1hVU3IwSDd3OWFEcFJRZ3l0?= =?utf-8?B?dDNOYlBZYWVlVm9RN29UL293N05WdExDeEhHeDZVQ0V6N1RiSmdsZllWTmtP?= =?utf-8?B?RjlVeHRNWEdKeGhqR1lCbFpBeWdENHc3eXA3RVdJblg1THpSZGpnRjBPdE1r?= =?utf-8?B?TkJSWG5DZ1RIaUFtL3daVEtFTG9lbHgwc21lWnc3VFBTeHR0cWprdXdmbHF4?= =?utf-8?B?clBvNGNuanJsL2M0NUV0NEdlWnkrN1l6aWNvckxYYzVlQm1wZDh5cnkrVk4w?= =?utf-8?B?U2tNbHVPbzk3T2xDUWI3ZVZqSktsQmV6d2NOZ0lEV25ZZzFaQStGYk0yVWVK?= =?utf-8?B?anlpd1JTK1ZnZm01SGtQZkhPVEVzN01MU1VaT3lEallheDRIY1llZm9YbVRq?= =?utf-8?B?MDlQZWdnKzJISk9jK2RpRVJwb3N3Sk52QmVhUHFNWktMYWd6UVFFelAyM1pO?= =?utf-8?B?U2FLdldIL0ZyRE45ZE1DbVhtNTVzUDZrK24wNTZ4OUcvZ0QwQmJWWW8zb0V2?= =?utf-8?B?OGRhdG9YS1VmLzJUeklDTzZHeHRmZDZGRzFRdDhlNnptY1VVRlpOY3dlTEpU?= =?utf-8?Q?DTCcWo4Gw9xeYZP0FqzWT6tSO0SX8jKboLd/gsS?= X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-Network-Message-Id: 97e90263-4e90-4cf2-c42e-08d9328d2040 X-MS-Exchange-CrossTenant-AuthSource: MN2PR12MB3823.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 18 Jun 2021 19:13:16.7505 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: wwiivqzhpLA4xedgW/333Y+qKS18C2gpw6wq5Roc13S4FAAuYpcrNKnHVYQ7RCj5 X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN2PR12MB4989 Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=Nvidia.com header.s=selector2 header.b=rQd74WrT; dmarc=pass (policy=none) header.from=nvidia.com; spf=none (imf03.hostedemail.com: domain of ziy@nvidia.com has no SPF policy when checking 40.107.102.71) smtp.mailfrom=ziy@nvidia.com X-Stat-Signature: 4gygzw5j4suqcthuhfz59hshx3ht537m X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: AA082C0201CA X-HE-Tag: 1624043599-307170 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --=_MailMate_ABB6DAE9-D7DB-4C3B-8ADB-F2ADBC087536_= Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 18 Jun 2021, at 13:50, Wei Xu wrote: > In this proposal, I'd like to discuss userspace-managed memory tiering > and the kernel support that it needs. > > New memory technologies and interconnect standard make it possible to > have memory with different performance and cost on the same machine > (e.g. DRAM + PMEM, DRAM + cost-optimized memory attached via CXL.mem). > We can expect heterogeneous memory systems that have performance > implications far beyond classical NUMA to become increasingly common > in the future. One of important use cases of such tiered memory > systems is to improve the data center and cloud efficiency with > better performance/TCO. > > Because different classes of applications (e.g. latency sensitive vs > latency tolerant, high priority vs low priority) have different > requirements, richer and more flexible memory tiering policies will > be needed to achieve the desired performance target on a tiered > memory system, which would be more effectively managed by a userspace > agent, not by the kernel. Moreover, we (Google) are explicitly trying > to avoid adding a ton of heuristics to enlighten the kernel about the > policy that we want on multi-tenant machines when the userspace offers > more flexibility. > > To manage memory tiering in userspace, we need the kernel support in > the three key areas: > > - resource abstraction and control of tiered memory; > - API to monitor page accesses for making memory tiering decisions; > - API to migrate pages (demotion/promotion). > > Userspace memory tiering can work on just NUMA memory nodes, provided > that memory resources from different tiers are abstracted into > separate NUMA nodes. The userspace agent can create a tiering > topology among these nodes based on their distances. > > An explicit memory tiering abstraction in the kernel is preferred, > though, because it can not only allow the kernel to react in cases > where it is challenging for userspace (e.g. reclaim-based demotion > when the system is under DRAM pressure due to usage surge), but also > enable tiering controls such as per-cgroup memory tier limits. > This requirement is mostly aligned with the existing proposals [1] > and [2]. > > The userspace agent manages all migratable user memory on the system > and this can be transparent from the point of view of applications. > To demote cold pages and promote hot pages, the userspace agent needs > page access information. Because it is a system-wide tiering for user > memory, the access information for both mapped and unmapped user pages > is needed, and so are the physical page addresses. A combination of > page table accessed-bit scanning and struct page scanning should be > needed. Such page access monitoring should be efficient as well > because the scans can be frequent. To return the page-level access > information to the userspace, one proposal is to use tracepoint > events. The userspace agent can then use BPF programs to collect such > data and also apply customized filters when necessary. > > The userspace agent can also make use of hardware PMU events, for > which the existing kernel support should be sufficient. I agree that userspace agents would be more flexible in terms of implemen= ting different page migration policies if the OS provides interfaces for that like IRIX did before[1]. > The third area is the API support for migrating pages. The existing > move_pages() syscall can be a candidate, though it is virtual-address > based and cannot migrate unmapped pages. Is a physical-address based > variant (e.g. move_pfns()), an acceptable proposal? PFN cannot be moved, right? I guess you mean moving the data from one page to another based on the given PFN. What are the potential use cases of moving unmapped pages? Moving unmapped page cache pages? Besides all above, using DMA engine or other HW-provided data copy engine= for page migration instead of CPUs[2] and migrating pages in an async way= are something I am interested in, since it could save CPU resources when page migration between nodes becomes more frequent. [1] https://studies.ac.upc.edu/dso/papers/nikolopoulos00case.pdf [2] https://lwn.net/Articles/784925/ =E2=80=94 Best Regards, Yan, Zi --=_MailMate_ABB6DAE9-D7DB-4C3B-8ADB-F2ADBC087536_= Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQJDBAEBCgAtFiEEh7yFAW3gwjwQ4C9anbJR82th+ooFAmDM8EYPHHppeUBudmlk aWEuY29tAAoJEJ2yUfNrYfqKy9kQAKL78xYG92E/BiNhWFMyvmiVwJ+3sKkueLtA FOJMaD5Xfx+z+dAwNAu4DNooD/Xm+taBy9hzpk6J5Mi7ON8ZLjn4v89//7RPYd/F ieRVI1gu0Su33lgGwqi3o8qzGBsh6GkykaUxff03VKV2QBPvsSdYwHPYYjw0U5vA GZdz2yMG5EQzmNX9PuWml7yRqfvJgFIuHCDQf3GyTEnE+mEGoY6+0IG2gsipLsRz FZTaNQ2fgVKAuug7tB+wYooaGWWP+DdHhAtkuVPfC0dsjCtA3jaQmJlDYRDj+EHv D1aO7Q/p4Vbh/FOnBbYVB4VDRwwQsvaWM33+BuDutgrc+ru40wQpqIQbWPJsgpv3 jI7x74wLAAuxNe8Vn9GCFeLz0t7+yH/Ae1kuhw7L+UhmpcwjAFvKPsCKgt6uFKgF punm3+Kw3s9HTpRiSY3dodLoWECQAcBDJ26qgemVxqAg6/MTJ1wvL31fMRpMy9Wp XD/0XujG2WVdw4FUTUXTQM83Q41f/lZOz7I7A1z5nsePNUaneKUENFntc06KimQO Ha6ugpNdAJto9DktQqkZVnVOuI46kY2YVE6c8jK5/sEqtOXwCc2aSdB+u72h++2C oRH76+AUA6ILP4lGlJgJIjSiXpdY2fNFkDfipquOBfMW3QN6rx7w/tsdUO86Mw1D PQmEZwt2 =SD7p -----END PGP SIGNATURE----- --=_MailMate_ABB6DAE9-D7DB-4C3B-8ADB-F2ADBC087536_=--