From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51031C77B6E for ; Wed, 12 Apr 2023 09:22:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B4ABF900003; Wed, 12 Apr 2023 05:22:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AFA1B900002; Wed, 12 Apr 2023 05:22:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 99ADC900003; Wed, 12 Apr 2023 05:22:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 8BC66900002 for ; Wed, 12 Apr 2023 05:22:49 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 712F1A1B1D for ; Wed, 12 Apr 2023 09:04:42 +0000 (UTC) X-FDA: 80672153604.15.26A7D5A Received: from NAM02-DM3-obe.outbound.protection.outlook.com (mail-dm3nam02on2082.outbound.protection.outlook.com [40.107.95.82]) by imf04.hostedemail.com (Postfix) with ESMTP id 7D77D4001F for ; Wed, 12 Apr 2023 09:04:36 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=amd.com header.s=selector1 header.b=jOdx87L8; arc=pass ("microsoft.com:s=arcselector9901:i=1"); spf=pass (imf04.hostedemail.com: domain of Ivan.Teterevkov@amd.com designates 40.107.95.82 as permitted sender) smtp.mailfrom=Ivan.Teterevkov@amd.com; dmarc=pass (policy=quarantine) header.from=amd.com ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1681290276; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=NnwamTCzNCoosmOSuUut08YNflJ0msr3BDZkjlTd9JU=; b=j62jvjTMARe8L5HLsCfmw4xAa+SsJb9VoqKc+rCcn4omGqtAYHes42pXvhLjQo2Ax6cTxf OQckp9gyOUGNbG5HU4UT0XZnsaAOhIwnwoH75FabY93FWt0yXC5rr0EhH43sHNlBnVh5uD W6n5HNVodwlwxktAvBy5mVhbRKFmsVI= ARC-Authentication-Results: i=2; imf04.hostedemail.com; dkim=pass header.d=amd.com header.s=selector1 header.b=jOdx87L8; arc=pass ("microsoft.com:s=arcselector9901:i=1"); spf=pass (imf04.hostedemail.com: domain of Ivan.Teterevkov@amd.com designates 40.107.95.82 as permitted sender) smtp.mailfrom=Ivan.Teterevkov@amd.com; dmarc=pass (policy=quarantine) header.from=amd.com ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1681290276; a=rsa-sha256; cv=pass; b=h4mwxRnSCSvq4U26GVe+KR/KIKashbPc28Vt/jMl0y4Yh/jnmB3tsS5+4CGfmPOrqSkQFy hGmPeX0XazFhBv6pX5ccHYISOT6m2ubAa5tI1/exzPYMpZYJHG28EwdmkBQSo/PTNLzmF/ 7EdORMsw2ZnxZdJX6hk73MeVdblo3II= ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=ErEc2qjHAf1AHJ/RJp5Yy/9nZjzdi2MtusYBw8X24f6Xgqmb11DoGzeuU94cnL0gd6dOgcmdg/C1X4dKDWe2xWojJdxQ1G/wgeiEeLRVeZKseMfsNa6VEaiGcrxWWxsKEd2XExEyJT8S6yh0gWMSDl0T4d6iucjRXQCSuGWXkc9wUaWOTGGKqNPsYxfqWy1tMcMkrC9+rIbCjQhaoFsJyI6/uFW1W+zquLkvb6zFnQRsmb2x5QgEyK525/r9L32wlmlW3VVk2j8oWF3I+puipuHSJghsymvZZ9F1tcpPHpygP0uchFZyr8LpzKUNFi28PveGpcjjRG5Hg4wkV42ixw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=NnwamTCzNCoosmOSuUut08YNflJ0msr3BDZkjlTd9JU=; b=OaZXLbEv26qObBHwIOmxcUTWs5fFW6CFSmf+o+y57fFUhXxpUVlVW/DRsACsO+1L23Xg5X9MEfC7tJ0dR4FsQUl/W8dIeoCWzOjtIuP0mZJ0D/plHhkAaDxD0WfPmn5zCokF3S/WxjZJAWbu6RE6YroxBtrQB1G/1kIVcPXrPHJ1/ZKrVdymB8DJYNUyxaZoxU2ocseib+eZpVNTai1vwHaVrpWVj+NmG1Nt2aGAOJnCGr7f06b/bemG6oCV9pSzEeyh5MOR+G55jZu7QokXMw71uXMEpqOuulK6GjtautRRMIlvk75LXJKW4a3AD3gTAIV4NkK0NnrEacXjHgFseQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=NnwamTCzNCoosmOSuUut08YNflJ0msr3BDZkjlTd9JU=; b=jOdx87L8eP7b4RZHc2r54kt4T5Q6AkTmWhPUpsjKVXzcRIOZySsQtms1Mgk13LG2iCQIyb3fzgBhkHLxXxCLZ1bGFsAP3/uaOch58pkDRs8ML6xH56AkNzjn65ncH8+p3waQzbZ4Ok1kfHdsg8xAKi8Ye/nKFYdI7OVNwl9p8f0= Received: from MW5PR12MB5598.namprd12.prod.outlook.com (2603:10b6:303:193::11) by MW4PR12MB7288.namprd12.prod.outlook.com (2603:10b6:303:223::15) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6277.34; Wed, 12 Apr 2023 09:04:33 +0000 Received: from MW5PR12MB5598.namprd12.prod.outlook.com ([fe80::8a8d:1887:c17e:4e0c]) by MW5PR12MB5598.namprd12.prod.outlook.com ([fe80::8a8d:1887:c17e:4e0c%6]) with mapi id 15.20.6277.038; Wed, 12 Apr 2023 09:04:33 +0000 From: "Teterevkov, Ivan" To: Alistair Popple CC: "linux-mm@kvack.org" , "jhubbard@nvidia.com" , "jack@suse.cz" , "rppt@linux.ibm.com" , "jglisse@redhat.com" , "ira.weiny@intel.com" , "linux-kernel@vger.kernel.org" Subject: RE: find_get_page() VS pin_user_pages() Thread-Topic: find_get_page() VS pin_user_pages() Thread-Index: AdlsqyEC2Ib370amRP6esQRt2dSqSwAI4d6AABNzQCA= Date: Wed, 12 Apr 2023 09:04:33 +0000 Message-ID: References: <87mt3ehti4.fsf@nvidia.com> In-Reply-To: <87mt3ehti4.fsf@nvidia.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-publictraffictype: Email x-ms-traffictypediagnostic: MW5PR12MB5598:EE_|MW4PR12MB7288:EE_ x-ms-office365-filtering-correlation-id: bd301e30-1e6a-4bcf-aa74-08db3b34eede x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: rDcLdwWbFdpNc+lzTU3sXrfG1AVK1bvOnsWGqL1w5lkRZX6iehldOLcavsdMNCnQXUeUGucUZRHX8c4OFnMuidz7JnEDpzzrylLxueyRKpcmuoSB+wb84D/r5oI2GtcrwZFySN9skYPadk2TA9GZzfxIB7xbcxVsr9ODTzgMQaHgpb7P5XfSWVqjR9JDRPSuMviklYYjnOvUcKr3693HhEvVFy6g4Izr4Hh1vieG9uZfaDzH1I5rIJ668hCj7IkEwSfw5gFKl95HCyRDeNQ1G/QruAv4cYW3NWnh1o4LCiNntzGdogvxycxJ4F8hjSIPqglCjjcvwydxwIljNwwMj9I2COe6jZrNLU6PuXTa3gNEkZ28H273E0K78ophqbuTXQEy4AhugYIHnLCBYz4W9iFiSaFcpu9KCHXQCsjh5sCDOxSE7fPOhWFbuVNYG8K/OzdlzOpERlM/hbbKgKTFrcAjIDSz95nVbJ1DUA1hjt4frHcXWFjvcXI4DCxIqpNmGJr4JfkLRfknyoTaLm+VoGrLdXAcExWQh/1po/VZ9lTAO7czby0CR6e/KPH2G1X68GmvWMlRxJkpENrQhV05BYwiSeeJxNn1y16+6Y83K1s= x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:MW5PR12MB5598.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230028)(4636009)(39860400002)(376002)(346002)(366004)(396003)(136003)(451199021)(71200400001)(478600001)(7696005)(86362001)(83380400001)(33656002)(55016003)(122000001)(38070700005)(38100700002)(966005)(2906002)(6506007)(316002)(26005)(54906003)(9686003)(186003)(5660300002)(64756008)(66476007)(6916009)(66556008)(52536014)(8936002)(41300700001)(8676002)(4326008)(66446008)(66946007)(76116006);DIR:OUT;SFP:1101; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?SvAQV/4cADdk2kWL+JB2eWMGHArXnaqXY4EejCFzf4sTZ/ro2JTNLlPg61Vv?= =?us-ascii?Q?l6fY/AyK9o84OkJmJtHb4dcADAFjgXPcPXVc0AkVyp9PJ9/hFN6tvRAFqpPM?= =?us-ascii?Q?Sl1TqbM5lr9jUrrkhUGlQJl88ntJUQNURazWz6QA3cwXDgcfR0TQIMFgWZZ9?= =?us-ascii?Q?Ph2J28ay683/oegAY68FI5LKXVPsGGgVPjklUXcB+Vn+J5M8aOtcEVTlbheS?= =?us-ascii?Q?CpuXQdICJIBkoqVDT5nRjRKQyoMXWjW5d6CAAwYvqd7h26ixkXqzcTitk3cf?= =?us-ascii?Q?a5eq5iMghJFAzegpMqaXdtkTzdbMgX3JMtE1J1k0eYDH/YZiTNwgci4Ry3R8?= =?us-ascii?Q?/4prmDtx3LizHp5xemMP/TrYpglEgMvbdocNvwKf0c1DGKkJV0Ah82pQ0+A/?= =?us-ascii?Q?x5oUSvSuCl/RPR1RK1gD99QDj5IJacqpescbEGehfS/0PukXIQOqasdQiSrx?= =?us-ascii?Q?F77ThiT/OkG3s1UQ4LEq/zE9HsFw+UfMw7MuTtBMIRe0ONsinEASLB2t1wj2?= =?us-ascii?Q?iH3AsbmhWNv/OR8yvObvRoCRsQN0tNvzG28pBqY4rLNiuSrYTsMJqIBqzWgn?= =?us-ascii?Q?KkUDed6NPjAAPQEFlMxpYof0/OZ6ZF/85zIKAbhQ28AwSJUnlYPPjUQalU9E?= =?us-ascii?Q?GWzEj4m1ySM9Lbx2eKGNKbxnUep8A7ZYn+mfrkL8JpkwqlOCrnpDaQ2x5eFV?= =?us-ascii?Q?9eG118Yy1b8kmy3oGU1v6LOD0Xi258JEFbUu35oaRx7BH9jxT+YOQXWC05Uk?= =?us-ascii?Q?jzRq0HAOswBgV3NT/8YvB3FinPlqg5ahn9WWA/f6/ziaHs8nYeK52N61tmd0?= =?us-ascii?Q?NXo5WE4h368d2o+rSvny+eEmcEqZt/OKOS6hrMdSHaaA+eJSEnh5gAIlGH0o?= =?us-ascii?Q?Q3rl5d2L4Deyqxl696m145ntioyi9KVTTOXmKDo6VcZ91rI2ioCVO39eCeTG?= =?us-ascii?Q?s1S443H5Imf/2rB1HWWLAjyPZjaX05fMeNO3Pf23QydrY4v6IHPdEzAzozSX?= =?us-ascii?Q?cN+oUlvJ8z1S3J/Z3mkxTt2BcuozklmJQG2cgywPTmru92qZKGwA8LIWcHdi?= =?us-ascii?Q?Xf4l83aLHnJmH0Y/AZ5U5d73C8NFPwmAzJpcyZmOaE42Pl5rxEwX1/sgctxM?= =?us-ascii?Q?X6jvkgyilsPU9bgQEwv/2RuqMyRiTdW3dajWixCcRMRc8UiMVe5WF6LwvSZn?= =?us-ascii?Q?V3xfMEg2zYX3fbvjl2HATrZ6LRjftxLkbgu2N0DWakX5xKWKvsWMuzQXDpBv?= =?us-ascii?Q?ofasGxWW/gclwLGHXIFGxSy8eWMqvhHdB820vjxLDXj9wQtkFSULv7Nrk3BB?= =?us-ascii?Q?IzqC3FpxRECkZyQ38Dm2hIrwt1Q9M7EGB9MxPywV2QdLSnVA/ylAYGNTNXxi?= =?us-ascii?Q?cH/vpFkwaT40T82s+ZPBD/huuS8pxVlyb3CsVsOvyti29Cnmu4O5jQPoxNiX?= =?us-ascii?Q?gv+9KDKKfNvIaDB5zO/7HmMUwuU1KdmXcPyL1P+Jm9Bd92wPPR3bJZUkKOrm?= =?us-ascii?Q?mgOhNNFrKmpq08IYQ3udkVq3rhzVsGorAaiW0QK+cD+TQSLncQEE8oQ+o9DB?= =?us-ascii?Q?oh5OW2awrlDgwNBIT6c=3D?= Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: MW5PR12MB5598.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: bd301e30-1e6a-4bcf-aa74-08db3b34eede X-MS-Exchange-CrossTenant-originalarrivaltime: 12 Apr 2023 09:04:33.4392 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: tL2L6xKulJj1iGSUiE5PpIYWTfrfLUZiMmX9LD/AfCpJ6WuxDBCIWaYllx9Ytj86offALJ62BsAj7jIhMFhhPg== X-MS-Exchange-Transport-CrossTenantHeadersStamped: MW4PR12MB7288 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 7D77D4001F X-Rspam-User: X-Stat-Signature: 3yhg9xgwfb7qnsqdgt8ggfmdmdpehd6n X-HE-Tag: 1681290276-366828 X-HE-Meta: U2FsdGVkX1+vY6D8OB8wIYTPYsAaGvBa0tlJ2ZGI7c+Aq+dOd+07dix5lmem4HbmxDhGTGmuzVZKkKjITF7s8AiNLjUtaiVMoi0mua/3A4z5taCt+6uA4gvFPFlmbR44OfQzofhta20yJxgUg9XFdGmVm7+HSRGlbPI8XOB9GaB5mYiotO2mkilh/fKpgzvk+1KqgyxykwamPVjOByAAaKGCh6eMc1WokPNQjfWxIQTFODi+Tc+lLpa5S559LcuU93of5Z9T7FR56cc5CJJnBOQX7pKGkDsVWZlPYJZFtW7g2e080ivBlJN+n1i6zBrz6nbn/PzzHG/TZD0UnEstu1y926xEmg/pz1qcHdvgERLJ+BEFdqTmLhOloUk4IYfzxmjlkZS/zGr0o6Mw0ROH0FCHBH2ng2qqkv2H4P1nU7KHNVg8RyqlafCYIpnK/p7MiMwtL1qj7z/vX47edpBSCOOkpFBunVWM7ILYg5FK0Ua9zRMFzz898g9XjlWxt4cpdiCM+aAoLxo8LcCSWdGon6U+ormNUHo9I+AjZINPkVhMfGGNv0hXHCSCW1kOevTSBI025xGoUki+nMhtEVOSShVDyJkejVBA9HcG6EmHieeeb2gdWj6nvCLoAdqVnnwHn9n6cZL+YM2ewgurQ5i1+Rwjl/LEFdzSWF7SVILo2Ow1wBbvL67uHWHOsGXbmtaVPBdX9JTv0rHF5RboUOOSmuk20/mfRhhxTPFvXfAFg6pHf8Xdzj9qIkRuq/ogn9X2MVadcmMWDiw8sN0eVXf4NJOYRAztqx6RNPeq0saIclWFg0ZlTr1khhgWmyP+ygKZozfS6ZVcpQ5lcPd/nBuuvE1SMcomiE98rhNQxiuzBA5/4WBCto03NMNPV9aI4TiwXWaC34BYp1ey3YqSRyjaUcdVJ6+LwHySAkDTLsPpkxEMLG0tvAT3bmU5/VZue1wZMW0WwTA+8PXamX8Gx/n EVoi9BPw BkU9h6pbqLYz0g7yyN6WZCEuQyG0daqrtv8aNECE1I5/5Glu4JUOme0QUHfOJwcl7z3YDOwAUAyK/tgsaMjvXXRQWmhGe08rcwyV9/Y1UcWPI9eVGsVE7fbGmgKwzV0QSeUtUP9RxT1Lmj4FflmW+F2h5boS0TA4dS+Rr6jA7z8nVogQHtC64mooEh0gxWwU/yp8wu1cIYKr3Kp8gybK9vKhBrsQQZrKl7Vz+myzCFVdFHJb3u6D1Wu6u2NIrUNlQP8YEZWufetr39hZM4iqpu8aSEDiS+E26tecTIeq+GGJkVs2jsRmpDtayJg551eBYKZ+DMCE7fKoXbfYrvhPuo+1fsSSKc/MTTNUNpNzfTBo/jAxj81pQ6U/wJnOau1La2zHzkRsOMLvVcZgGoQzcbpQdANV7qLeHdGauOmZm1y39CmputiTnTXTA55WUZ1Wllp/ckiary6i9eS0YpqPs+EmEkwDSCWBhiGJeFAAOVLbstbM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Alistair Popple =20 > "Teterevkov, Ivan" writes: >=20 > > Hello folks, > > > > I work with an application which aims to share memory in the userspace = and > > interact with the NIC DMA. The memory allocation workflow begins in the > > userspace, which creates a new file backed by 2MiB hugepages with > > memfd_create(MFD_HUGETLB, MFD_HUGE_2MB) and fallocate(). Then the users= pace > > makes an IOCTL to the kernel module with the file descriptor and size s= o that > > the kernel module can get the struct page with find_get_page(). Then th= e kernel > > module calls dma_map_single(page_address(page)) for NIC, which conclude= s the > > datapath. The allocated memory may (significantly) outlive the originat= ing > > userspace application. The hugepages stay mapped with NIC, and the kern= el > > module wants to continue using them and map to other applications that = come and > > go with vm_mmap(). > > > > I am studying the pin_user_pages*() family of functions, and I wonder i= f the > > outlined workflow requires it. The hugepages do not page out, but they = can move > > as they may be allocated with GFP_HIGHUSER_MOVABLE. However, find_get_p= age() > > must increment the page reference counter without mapping and prevent i= t from > > moving. In particular, https://docs.kernel.org/mm/page_migration.html: >=20 > I'm not super familiar with the memfd_create()/find_get_page() workflow > but is there some reason you're not using pin_user_pages*(FOLL_LONGTERM) > to get the struct page initially? You're description above sounds > exactly the use case pin_user_pages() was designed for because it marks > the page as being writen to by DMA, makes sure it's not in a movable > zone, etc. >=20 The biggest obstacle with the application workflow is that the memory allocation is mostly kernel-driven. The kernel module may want to tell DMA about the hugepages before the userspace application maps it into its addre= ss space, so the kernel module does not have the starting user address at hand= . I believe one kernel-side workaround would be to vm_mmap(), pin_user_pages(FOLL_LONGTERM) and possibly vm_munmap() shortly after if we = do not want to keep them mapped in the originating application. This would hav= e a side effect, but the pinning would stay in place until the kernel module un= pins the pages with unpin_user_page(). The pin_user_pages*() operating on behalf of the userspace application made= me think that the pinning was not designed to outlive the application, but per= haps that is what FOLL_LONGTERM for in comparison with FOLL_PIN? > >> How migrate_pages() works > >> ... > >> Steps: > >> ... > >> 4. All the page table references to the page are converted to migratio= n > >> entries. This decreases the mapcount of a page. If the resulting ma= pcount > >> is not zero then we do not migrate the page. > > > > Does find_get_page() achieve that condition or does the outlined workfl= ow > > still requires pin_user_pages*() for safe DMA? >=20 > Yes. The extra page reference will prevent the migration regardless of > mapcount being zero or not. See folio_expected_refs() for how the extra > reference is detected. >=20 Thank you for pointing out folio_expected_refs(). I see that as soon as the reference counter exceeds the number returned by folio_expected_refs(), the page becomes pinned, but it reduces the mobility for the pages coming from ZONE_MOVABLE making pin_user_pages*() preferable. Thanks, Ivan