From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A81B0C00140 for ; Mon, 8 Aug 2022 23:06:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DAC366B0071; Mon, 8 Aug 2022 19:06:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D33E68E0001; Mon, 8 Aug 2022 19:06:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B387B6B0073; Mon, 8 Aug 2022 19:06:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 9EE386B0071 for ; Mon, 8 Aug 2022 19:06:16 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 6A340C0D11 for ; Mon, 8 Aug 2022 23:06:16 +0000 (UTC) X-FDA: 79777960752.02.32E6502 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by imf05.hostedemail.com (Postfix) with ESMTP id 4A5C5100024 for ; Mon, 8 Aug 2022 23:06:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1659999975; x=1691535975; h=date:from:to:cc:subject:message-id:references: in-reply-to:mime-version; bh=4KfptzsphKV5t3XN0HuFZx4umfAEDbMndItsj7qiGHo=; b=CT5H+UlHe7QjfBSaaN2/LGszX6KQ1T89muh/xyt4qo+X043H9ZfrhgUc VkEfaCwz4MClfBevzbHgxahN4oRonCn30mm+w82spylnDNASPPX7g8nnn w6tVA3djM56V+FYc1OCQSs8Jt+Bmm66o2KG+RWfVsnTiRva4s6Lh4DTSW nJVI34Iwwp5sxrcKsYQghc6pcAtcVppJ78X3/cMyXDuyPgPL+vAoFfbtX KmBOEUZjiZJCKmY3uH6A6cyiiaqcf0ZMFg29e3FVxFuMuSV90ldPcTvk4 o1BZals9j3tSxYrMIgfjR438jo55O3ah8RqlYG27vBWTzKWkp+BAfsBjA A==; X-IronPort-AV: E=McAfee;i="6400,9594,10433"; a="291497167" X-IronPort-AV: E=Sophos;i="5.93,222,1654585200"; d="scan'208";a="291497167" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Aug 2022 16:05:52 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.93,222,1654585200"; d="scan'208";a="608007647" Received: from fmsmsx603.amr.corp.intel.com ([10.18.126.83]) by fmsmga007.fm.intel.com with ESMTP; 08 Aug 2022 16:05:52 -0700 Received: from fmsmsx610.amr.corp.intel.com (10.18.126.90) by fmsmsx603.amr.corp.intel.com (10.18.126.83) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.28; Mon, 8 Aug 2022 16:05:52 -0700 Received: from fmsmsx611.amr.corp.intel.com (10.18.126.91) by fmsmsx610.amr.corp.intel.com (10.18.126.90) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.28; Mon, 8 Aug 2022 16:05:51 -0700 Received: from FMSEDG603.ED.cps.intel.com (10.1.192.133) by fmsmsx611.amr.corp.intel.com (10.18.126.91) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.28 via Frontend Transport; Mon, 8 Aug 2022 16:05:51 -0700 Received: from NAM12-BN8-obe.outbound.protection.outlook.com (104.47.55.174) by edgegateway.intel.com (192.55.55.68) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2375.28; Mon, 8 Aug 2022 16:05:51 -0700 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=oQigQ2gKzUJOvBYYZFQISx5MJqPLLvAPeYUx6HFJyD1w5rmFNOaUJm9iYuaXsffCFd7rzgxTfhCUegA3awGfzD9qQMN8DAc2JoBnqpCc5ox2xS3BkYzJ17fuc6gBKpDsvD/NKV9eXRx5LjlGac5KBvk1AySU+ZXBiG91YC2sIuBJzihUEZKB6mIarcCDlTh7PghXhoUAqKZxjmcAbUH65SSHUsxyxiArY5/ldrKhWagVcdQ4cmHfmOIBdTRoeEI9l2ZxX4oqfSY2Wl8J+LEyrpA+BShJHxl02he8NChs91+KMD+1iFPn7O5rkD4zFN8URp0qT3bmgCXqXO/wpz3Hqw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=4EZ0pBHdfxGmzTm68cRkEQAE16gEsu08/QIK3K0QvFk=; b=iCnrHrl0gC3p8O1tWR0txo87NzyarpNo+PcsCNNxG62BwSCYg2Q+FCI8a7HBby0swBp/l3Q9OZNvxcRiehPjAuVXNBB4OcthlZZjavjPaIvbbTrmH7u8/6OcjZ+3ACH09dHdVTDIg1+tNYP6+zP729czVZ9epZsBsR669qkw8O8gChUJHkhfgV+XRba52zHAnEZhBRMmFvdJOWcDlfntTqHLwtNjP6+3iQ/Bq17QBllN45kG97+x9aoWlCsPMPHaS22kugEgvO3kkovy3HLsW56RjFrXFyK4Pod83tuDZnDYGK2kcH0hmzo0yllEL3FMO8oyigGSuZf4TsskxDUjdw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com; dkim=pass header.d=intel.com; arc=none Received: from MWHPR1101MB2126.namprd11.prod.outlook.com (2603:10b6:301:50::20) by DM6PR11MB3548.namprd11.prod.outlook.com (2603:10b6:5:143::18) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5504.15; Mon, 8 Aug 2022 23:05:47 +0000 Received: from MWHPR1101MB2126.namprd11.prod.outlook.com ([fe80::9847:345e:4c5b:ca12]) by MWHPR1101MB2126.namprd11.prod.outlook.com ([fe80::9847:345e:4c5b:ca12%6]) with mapi id 15.20.5504.020; Mon, 8 Aug 2022 23:05:46 +0000 Date: Mon, 8 Aug 2022 16:05:44 -0700 From: Dan Williams To: Srinivas Aji , David Hildenbrand CC: Linux MM , Dan Williams , Vivek Goyal , David Woodhouse , "Gowans, James" , Yue Li , Beau Beauchamp Subject: Re: [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM Message-ID: <62f196c86bec5_1b3c2945d@dwillia2-xfh.jf.intel.com.notmuch> References: <922eda33-be7b-f413-6285-33ed0ea0f09e@redhat.com> Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: X-ClientProxiedBy: BY5PR20CA0036.namprd20.prod.outlook.com (2603:10b6:a03:1f4::49) To MWHPR1101MB2126.namprd11.prod.outlook.com (2603:10b6:301:50::20) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: db958aef-33ab-444b-0400-08da7992870b X-MS-TrafficTypeDiagnostic: DM6PR11MB3548:EE_ X-LD-Processed: 46c98d88-e344-4ed4-8496-4ed7712e255d,ExtAddr X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: krczTusHoZDtTxma4bd4/T9QqUaziDAAbw1tJPrqeGN4N0wCLYTZSAcN7GAOXDeUHy/RJyDGw4BPgOkLDzAStLLFxyroEfEChkz+eco38v6/lvoaiYdh//lK7m6Fzwzkuccb1NGo+PtQcyMHSi8ByxXa/qW58vTEPW20jRyF/1HyhAHdHYg5KkMxZyXoR4LmSXnD0ejdN88BlMh/xWrLrys0fIMdpbZ8UIv9wq4T9HCJcoXhBtXPZjcrzCsxsUGqSqL1EzhzfTEtpD2wyfp3f8qw0/Dcw7RSWcNoOHuNaP3+w99SJVNMv1EmUGk2Xk+ZL7BdD43A04r5hXMuDsMxuQEawzlESfic/6srs8oqIJcOD0N4yc8HSxUa+KtQU7LM32tOLWx1ckMV2vmmwh11a/e3aReB9Jg7fuaRfz9SLuJG93JneO287o6UackFpSSRruF+9hUeYphqGdWpuh2qCQ8wdT1tD+NgQ3xyNDURmZW+T0nMLIeCHB83Lys2gmE85NwzZXQXxKTDFyKEuiquoHvMIeQCQl0fXBErmnC+CeRHZzDdO5z0qHpKY/4wqo+8qtWqnJkHAKMXDPo62FWBXb2KgcCvk/jQRJn1+48e67T0uI/GcK8eAWC/wl5quYNDyeK2mzHKP4B9VirqsrXm3jR+XMQ7SE5kJcIcRKf2y/CEv4n52I7yHJ8CWzaBol7qQlrsgn/PsTbfM/U+s76exXU07Gi8SMfoil58MGSOGZYxcDS9vF94E7hxuCf2RTgV X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:MWHPR1101MB2126.namprd11.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230016)(346002)(396003)(39860400002)(376002)(366004)(136003)(2906002)(83380400001)(110136005)(82960400001)(38100700002)(54906003)(8936002)(316002)(4326008)(66946007)(9686003)(6506007)(8676002)(6512007)(186003)(5660300002)(478600001)(6486002)(66556008)(41300700001)(66476007)(26005)(86362001);DIR:OUT;SFP:1102; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?8hTi7dONoUNCuMWUZBgdVh2z+MNNxwxfa87Z9nN1dLksHfxtYcw4x/YEc4Ar?= =?us-ascii?Q?rFOkhS2arNPWqotwNCV/vBOc/nEPANnKAIrFMHkNmU59vh3dmp0qUKFx7KlX?= =?us-ascii?Q?+tQH83Kcfe0bdR1yzVOKRqJfAZ9efuWqFDGtesOiZ+r8u7JwxU8GuVWk8CTz?= =?us-ascii?Q?PjD57p0klcvlWhNYApHSIr3ptmN+2j6MrW1oxDd9TfBZP0yToDSStbmWcSUT?= =?us-ascii?Q?xfJOXa3jc+sMquiRijNvicp2VjnHbazI30TlMKEKfNfQge5R2zs2l2GAQwLY?= =?us-ascii?Q?iEQ4K5bWzf3OAHVSVN7JAnZMb8HsXAGYyOW8C+0UY/Q40RSWOCgq41C4OtBY?= =?us-ascii?Q?K2qzp4FFVjMI4w3liHi10K17NucapQG5vPr/QQB2ai6gv1olkpueFtDQHZ1E?= =?us-ascii?Q?10vUDf3UtVaBipUd7tKfQCLDG1Uy5TdPWEtbnLD7XZDfk/bDaQcTgxClea81?= =?us-ascii?Q?8dhzrSFvOklmHGOmpHbBu3RGRRNta1C3Bm48UkP03HjA3BfUIp6uVv3ynf4K?= =?us-ascii?Q?sHwr1XgvTmdt5enzDGAzb8G9qnKCdhAckPRglr4Qq9KQC8JKKz0kFZkfsxon?= =?us-ascii?Q?IW1v0RhD126LYB1rEfaKqfmmUa1dsBRm++z6Eivv3Ain96WF8i4hAmVI/2EV?= =?us-ascii?Q?8Av/pNrqkRIBwCtfvvCKJs9jPgY/ggdgSqclwl8lJna6kVuXp3qHuBRt0CL1?= =?us-ascii?Q?d6g70HI4VeSaguu6r+Ky+cBK8iup9g6UQe8YptMv83xvM9Jwu3F/FPweWX7C?= =?us-ascii?Q?7Yelx9xsG0ATyGdRY8E+qkOcE6ZR/YEpzjI4rXdm1oN+x2xwOUuPshKp8MNT?= =?us-ascii?Q?PBCS5rS9+ZciEn1oJayxxsg2DljflqOfMRdxwfj8ZCVTQRu1NhT1k7xCk0uw?= =?us-ascii?Q?S+e9OZD94hne0hFTr59K5lL9jGGWfNrjYXVWdIaOipc30gPbiAvVAQu2L1sY?= =?us-ascii?Q?82XB+LJ+NvA39c8tn6+xMMomZo6A9RF8zLQosCnAINmXY503dzR0BqOdHccB?= =?us-ascii?Q?0gf51YewsdYzyw3BHjTlSMXq76tAFMAcY7pN5Z7qxiKHqWMrd8U2K2VxoVwx?= =?us-ascii?Q?ChQyBmyNgja1bQeUKCbMlA0pwJUNO5b2pPsdR19GCawYUh/7y2Du+FEYUYA3?= =?us-ascii?Q?ZbQqr9KtcLx9ZrCg9pEXs6tNmqD1njVCKREtXP07LUSCJ3WqD1DkMz6nzJxY?= =?us-ascii?Q?za4oN0ENIoayUntbkgXFj1fF8O0ShNYgwMlMDVDP6d/Mx/+DHI756hJvdnHH?= =?us-ascii?Q?Xx8s+7h+sCJCsMJW4ZrpVfK9i0IQMaYRd2OrL1JMS613Web/0MJUFW0/rlTP?= =?us-ascii?Q?bzORCdydZs+OcDu5luu8WKBKnhe2B3rHY9y18fazpUgRO/juvaQld3AP6VmC?= =?us-ascii?Q?impKbA5BDo6PQ9sGe17UHxAFoG4YFLK68cnUKWgN75Q9B131vwQO6NG0wjy/?= =?us-ascii?Q?GzFOEEDYKdn7jnz+N7eF8nI+5ek1iJ8i/IyGVImCQonoKgtf/DJOz7PoSdZG?= =?us-ascii?Q?GyqktNOsNQ3U2ziTCcmVIHcLTZoI7Vy48klQ+hb4+9CtsW9n+0zCyKMsOFoz?= =?us-ascii?Q?mQdTDeUwqdKGPcK/TPz+GYmoA/EFqi/djFgQLQ2WZyS5vMGWzftQM0Q1E9x0?= =?us-ascii?Q?bg=3D=3D?= X-MS-Exchange-CrossTenant-Network-Message-Id: db958aef-33ab-444b-0400-08da7992870b X-MS-Exchange-CrossTenant-AuthSource: MWHPR1101MB2126.namprd11.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 08 Aug 2022 23:05:46.6209 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 46c98d88-e344-4ed4-8496-4ed7712e255d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: 93lGRrTykk0XfbSWqkoGLf7/SlASxEOCkufMPpngFAG5QekMQJn5iePi3MeBFCgitjxm3pbo8mNX0OzBLlt/GcjHHO3+gI1Xw6RM6NacjBU= X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM6PR11MB3548 X-OriginatorOrg: intel.com ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1659999976; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4EZ0pBHdfxGmzTm68cRkEQAE16gEsu08/QIK3K0QvFk=; b=tKlXVumJ8wGlqPH4nwhYDjIOert1uhwUaPXb4kQ/rB3gPUSjjtjVkpU4fZn1a1riLfRONA 4IOYIwE+xdxIlBYzEWdghRKP9aBtvMOp9vksP7Sv7AfYx+RWjNCWYoPN4m5YjGPmsrUGRR y7wV9W3sVuDwfQegZl0KEYFCPLZsuSU= ARC-Authentication-Results: i=2; imf05.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=CT5H+UlH; dmarc=pass (policy=none) header.from=intel.com; arc=reject ("signature check failed: fail, {[1] = sig:microsoft.com:reject}"); spf=pass (imf05.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.24 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1659999976; a=rsa-sha256; cv=fail; b=avqowcOchacIZirS0JRSxl0BRutGMalhHejHC5baVeD4cApFdTg8/zu9f6+ssEoZgiH5Qx uJWqTtekl5D0GsyZ4M+SA2gJDApXzYMb4aoNX3xvZTe2fM4xnLKvnvSv2RDssRNsHKzptm 0r5X7Yn/KJv+zEpvGh6mHa9YAw99Ess= X-Rspamd-Queue-Id: 4A5C5100024 Authentication-Results: imf05.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=CT5H+UlH; dmarc=pass (policy=none) header.from=intel.com; arc=reject ("signature check failed: fail, {[1] = sig:microsoft.com:reject}"); spf=pass (imf05.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.24 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: wk9pnptq7rt6sqrs4uuc6og4kgp93o9x X-HE-Tag: 1659999975-130871 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Srinivas Aji wrote: > On Fri, Aug 05, 2022 at 02:46:26PM +0200, David Hildenbrand wrote: > > Can you explain how "zero copy snapshots of processes" would work, both > > > > a) From a user space POV > > b) From a kernel-internal POV > > > > Especially, what I get is that you have a filesystem on that memory > > region, and all memory that is not used for filesystem blocks can be > > used as ordinary system RAM (a little like shmem, but restricted to dax > > memory regions?). > > > > But how does this interact with zero-copy snapshots? > > > > I feel like I am missing one piece where we really need system RAM as > > part of the bigger picture. Hopefully it's not some hack that converts > > system RAM to file system blocks :) > > My proposal probably falls into this category. The idea is that if we > have the persistent filesystem in the same space as system RAM, we > could make most of the process pages part of a snapshot file by > holding references to the these pages and making the pages > copy-on-write for the process, in about the same way a forked child > would. (I still don't have this piece fully worked out. May be there > are reasons why this won't work or will make something else difficult, > and that is why you are advising against it.) If I understand the proposal correctly I think you eventually run into situations similar to what killed RDMA+FSDAX support. The filesystem needs to be the ultimate arbiter of the physical address space and this solution seems to want to put part of that control in an agent outside of the filesystem. > Regarding the userspace and kernel POV: > > The userspace operation would be that the process tries to save or > restore its pages using vmsplice(). In the kernel, this would be > implemented using a filesystem which shares pages with system RAM and > uses a zero-copy COW mechanism for those process pages which can be > shared with the filesystem. > > I had earlier been thinking of having a different interface to the > kernel, which creates a file with only those memory pages which can be > saved using COW and also indicates to the caller which pages have > actually been saved. But having a vmsplice implementation which does > COW as far as possible keeps the userspace process indicating the > desired function (saving or restoring memory pages) and the kernel > implementation handling the zero copy as an optimization where > possible. While my initial reaction to hearing about this proposal back at LSF indeed made it sound like an extension to FSDAX semantics, now I am not so sure. This requirement you state, "...we have to get the blocks through the memory allocation API, at an offset not under our control" makes me feel like this is a new memory management facility where the application thinks it is getting page allocations serviced via the typical malloc+mempolicy APIs, but another agent is positioned to trap and service those requests. Correct me if I am wrong, but is the end goal similar to what an application in a VM experiences when that VM's memory is backed by a file mappping on the VMM side? I.e. the application is accessing a virtual NUMA node, but the faults into physical address space are trapped and serviced by the VMM. If that is the case then the solution starts look more like NUMA "namespacing" than a block-device + file interface. In other words a rough (I mean rough) strawman like: numactlX --remap=3,/dev/dax0.0 --membind=3 $application Where memory allocation and refault requests can be trapped by that modified numactl. As far as the application is concerned its memory policy is set to allocate from NUMA node 3, and those page allocation requests are routed to numactlX via userfaultfd-like mechanics to map pages out of /dev/dax0.0 (or any other file for that mattter). Snap shotting would be achieved by telling numactlX to CoW all of the pages that it currently has mapped while the snapshot is taken.