From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7A608C61DA4 for ; Thu, 23 Feb 2023 01:54:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D13F76B0074; Wed, 22 Feb 2023 20:54:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CC41C6B0075; Wed, 22 Feb 2023 20:54:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B3D076B0078; Wed, 22 Feb 2023 20:54:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id A2D686B0074 for ; Wed, 22 Feb 2023 20:54:03 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 6AE4140A4D for ; Thu, 23 Feb 2023 01:54:03 +0000 (UTC) X-FDA: 80496885966.13.715DE73 Received: from NAM10-DM6-obe.outbound.protection.outlook.com (mail-dm6nam10on2076.outbound.protection.outlook.com [40.107.93.76]) by imf02.hostedemail.com (Postfix) with ESMTP id 9636C8000D for ; Thu, 23 Feb 2023 01:54:00 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=Nvidia.com header.s=selector2 header.b=bOpQo3Od; arc=pass ("microsoft.com:s=arcselector9901:i=1"); spf=pass (imf02.hostedemail.com: domain of jgg@nvidia.com designates 40.107.93.76 as permitted sender) smtp.mailfrom=jgg@nvidia.com; dmarc=pass (policy=reject) header.from=nvidia.com ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677117240; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5VSvVJDLYm3I4TVhgPviimt37DwUtMl1cULco1UlQLk=; b=c81+I83rS6UEKGC1k4ifN6aTCwF2URPXjiuK79j/clPTC4PI6kB4waBohDGOnH7xBwKwn8 Woqe9lBjsLdPWxqq71gCjvDwaDBTI99+OuUg7ZmyglC+JQPqsreuIwFZKPRQcKcGEt4xDO lSYP5qDTF2a9edXlIGy1KMSJv8Px/ZU= ARC-Authentication-Results: i=2; imf02.hostedemail.com; dkim=pass header.d=Nvidia.com header.s=selector2 header.b=bOpQo3Od; arc=pass ("microsoft.com:s=arcselector9901:i=1"); spf=pass (imf02.hostedemail.com: domain of jgg@nvidia.com designates 40.107.93.76 as permitted sender) smtp.mailfrom=jgg@nvidia.com; dmarc=pass (policy=reject) header.from=nvidia.com ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1677117240; a=rsa-sha256; cv=pass; b=sLgpeUQyESlOcrUhFbLCEPEYT/Q3M5SNDP71K6Ykhs91ALUMcGyA6b7Nv0ywuf0AQ6/UYO n/h3R+EcZtAaiC1HYhPugOWnzMODPxky8IwIXxc7/0+wMPs9AYjv50vasUON0nYPlIHikQ D9pfTYlz0rO02fcQB2CfXSpwAwT0plY= ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=XYLK4W+k3lHg0TVv/f6uc8XzEaxeGpoxrHN3l9nFa15OEM3gOSaWmNbTF0wpoVTMDsF+qXazy67u9xN3VxA/0neR8/gG8wNP775ioJMv9PdNcoIDbm67aD17DoXpKLOXOJioYIaplRmyUzDnpKi6r432kHazqoGdDbYxXOjTbkmVknbhF8IlIJiF6+q2fJ+mSE8UqmG/BpEmrM+w5EHt+HM0eUUwkVUuJ+OkKrRS7WMLa48ZyQsM7izsWxpkPrb1M2xBv2VBcFtgWLDm0ImDstVpwYFTqvFvVf7PU1k2I4xZvz+h1z71qFc+1yMwtYYQysmznio6HpFhEd6bGGd6ww== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=5VSvVJDLYm3I4TVhgPviimt37DwUtMl1cULco1UlQLk=; b=FrCkk0UoO69sbLjPQgOS6F3by+QSH66Y4wtM/QGfPNisozk15VZCfpKibSd8SDA5XoglcBOXoBw1qI7fWT/FXcYSYuXVS4II+RA0KzkDjkFJFWpALunBe0mWU5JKfRv9ibymqcSyvdHZEPGowbmyUlY9rcGWeRRmxKDyD5xXu6a+PXAOe5hr2zd88Xb3ritlKasVDFTmp8sdnJEXo9StQ7hGVp4uCcyJqF613ibJlq6K6bWbuMOInDHSjGYrw/ve7NsvXt2+gMOblcGeRV5i6qnQWXM5/bXhylR7PUL1mympSvT0idftF+Y3Q843O2dF6t1AlvDLswgxE8VqCGIi2w== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com; dkim=pass header.d=nvidia.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=5VSvVJDLYm3I4TVhgPviimt37DwUtMl1cULco1UlQLk=; b=bOpQo3Od3xaxzT3jCmjSOW/RN+6DFEuTM3Ci9nqjb1X/7uqP5NZ5Hk3wJqWoFefnFXgzPl1VaK5ywKXgi74Org5nDJctbCjKAT2ZEMV0ohWoPYHSHP2HA76wKJevWlrbwLprTkoX/4vkukRnGfmq1Z0cOH+ciYdwNSweQ1Fb9YZfsYiAgug4cocqZESYjt6pRzaid3w6/ETaSg8vWeMFW9HNf5MuXSLVQ9PZCHa4a3rG26s1QNlbbbr7n2cRvKQlP8azVXzJatOJqO4Uh3Wg6+CP4zCncGi4svAWH7Uo5XZULbu3c/SelWXoitcsLYaww056uk170umjF9b9cM+1lQ== Received: from LV2PR12MB5869.namprd12.prod.outlook.com (2603:10b6:408:176::16) by CYYPR12MB8730.namprd12.prod.outlook.com (2603:10b6:930:c1::15) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6111.21; Thu, 23 Feb 2023 01:53:58 +0000 Received: from LV2PR12MB5869.namprd12.prod.outlook.com ([fe80::3cb3:2fce:5c8f:82ee]) by LV2PR12MB5869.namprd12.prod.outlook.com ([fe80::3cb3:2fce:5c8f:82ee%4]) with mapi id 15.20.6111.021; Thu, 23 Feb 2023 01:53:58 +0000 Date: Wed, 22 Feb 2023 21:53:56 -0400 From: Jason Gunthorpe To: Alistair Popple Cc: Tejun Heo , Michal Hocko , Yosry Ahmed , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, jhubbard@nvidia.com, tjmercier@google.com, hannes@cmpxchg.org, surenb@google.com, mkoutny@suse.com, daniel@ffwll.ch, "Daniel P . Berrange" , Alex Williamson , Zefan Li , Andrew Morton Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory Message-ID: References: <87o7pmnd0p.fsf@nvidia.com> <87k009nvnr.fsf@nvidia.com> Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87k009nvnr.fsf@nvidia.com> X-ClientProxiedBy: BL1P223CA0007.NAMP223.PROD.OUTLOOK.COM (2603:10b6:208:2c4::12) To LV2PR12MB5869.namprd12.prod.outlook.com (2603:10b6:408:176::16) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: LV2PR12MB5869:EE_|CYYPR12MB8730:EE_ X-MS-Office365-Filtering-Correlation-Id: b63d4094-47df-4f5f-17ca-08db1540d3c3 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: YVRVIMlPaItfaJPWiKT3CncKvsYqLixUAZ/oq1zhiEsD1VsZtMB6YeNMocvr8Cppf0cDeTRW+NXL5xh4fRrCHW2VuafoxsX8meZ2Wjwg9ynw6/OhYURz2kK6CA4Fv8/s5RwU0x2uI9Hpq9ZhDvO6+Oa5KoUkjShZLMmcSM4zAjy+B45Hx/BbxPkfE9fSPwnBksW0ce0ywMIfsOjILp3G0pAGuDWGi3gUP7B7K0mMFFzJpIMlUVpIlonCiBpiXeb344warp8dYNV4weCv1LHybcVUQ5/3WetRI1G0QJdQ1DZcw3KwhlvQityJCxtf3bf8Go56lZGQZzOtXLo88YiJdKKjAZ7335hXdfwXUlg2o7fIl7CnEa08LbmdMbcEfzupv0G5IYyn9vjyvOZJWatwjouWOyUfglSea8Twl04aQz+Z40+DsQk3r2MdDkIcZ/16W1VaTrFbtywdsChIDAczT0oWPmz6m6WGBZ68C6bgeNw8KRwcBEsjkEfHu/Vs1eVreLzr4E6pSE78qEbTGMaSWON++4/oEqBqK3qa9k+DvKOApcBndGLbcKvrZg9K4wuczpJcZz+bH8tXTgV6Z3NtB8oJiyw3bf8ZaNPu5ywQHfOY8KhaX2z2JjTxYka25kGbENoc8u/07AXQLq96wZRdcg== X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:LV2PR12MB5869.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230025)(4636009)(346002)(366004)(39860400002)(396003)(136003)(376002)(451199018)(36756003)(66946007)(6636002)(54906003)(83380400001)(316002)(37006003)(6506007)(478600001)(4326008)(186003)(41300700001)(26005)(2616005)(6512007)(5660300002)(86362001)(8936002)(2906002)(7416002)(6862004)(66556008)(66476007)(6486002)(8676002)(38100700002);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?58vWHRxx/J5K5EnV6Au/F9xn9ZHc3dy7B/ZEw6UaWYjdkd9IUNuai0qvkcOC?= =?us-ascii?Q?+BOfJWRryF1Iw16MEbQX6mjP65cep51p8zsWL8TMuvRahXt8h/vT7GLK/a+K?= =?us-ascii?Q?ew/xWb2W8JNnt8B62m+9baa74P90wlZe3F/c27IjteELTtjW/zmcV9JEDqjq?= =?us-ascii?Q?Gr3sTYpzMZLXpUCgo1VxIdYzOHx2M2Zrb3XtS+DK6M92q2VdP7sJCY0xJwF2?= =?us-ascii?Q?hh84Fg9mqdsKurwT6SMG1Ww8Uwvmu7vUQI2L6XT3k7uRx/0z5XxyBXv/sk/W?= =?us-ascii?Q?qT7BxsE4SHThQZZu7/oGRjiV8kXrrFUmOWpNJQvv8wzAnLrSTTAiB9yrSqM9?= =?us-ascii?Q?UhQjUbC9svYt08nJ8H2iN2abvPFdJyEhnXwm02VrOVfyqqUc0aT+4UeppGMn?= =?us-ascii?Q?w1dMsAUwRPLjZuAoNxKaMmg5ITHd8LWUWAsMdwHmEyKLSnho0f+h4wCvZ8Du?= =?us-ascii?Q?rV7nLsGXMIvQI6srzoObmA0P3S0FcD57I2J6IRXlMuju7y/JDzj5q/D8W/Ep?= =?us-ascii?Q?7SAkz8/CN4TjSU2iCEjPshyZhho/Q4IDD9aCH8qqa3k/HrXIic6K4iYzl0Jq?= =?us-ascii?Q?KV5k5nrUf6ynZfgk5IK9ZFVGKNtx+O7Uw6n7XCIZYUXBHbAuRenJJVEr/s6x?= =?us-ascii?Q?X7atA4ycY4qnF49GQ7bcfMe6W7w+cuQvQ8M7yFK6UZXPuLAFVu4ERJGFrfqq?= =?us-ascii?Q?WioY4IYUQlOQxFldndei/QxdPC3Ry8jRH/7d/y/ObjKRJmA7a4fb1jPrMegt?= =?us-ascii?Q?IcfglwFO09tatAzhzSkomu0QnI3uXGh8Yt3DAiQ/M3jW5oT/IPq1wSVGGAWg?= =?us-ascii?Q?ZH7RItbP3JombDx7G4Vwrq2jt+mh9l1pxggXdZ6CXFkk6McsB+o28Gnj7gIF?= =?us-ascii?Q?MXHJIxJMdlT91JotbnQAihVxpbEiQjBc7ET7QkuHIx8Z6PB5D6+Yzj70Q1eQ?= =?us-ascii?Q?xZtXOH2b/M2XxwRl2JnEPHYkdyH6dPh9xB+0x09T7cvI/50yCWwHpmVbKnMp?= =?us-ascii?Q?qXn3+upZpkOdfcfAf8b13GzifyY08kRDqek2GDi4W7DwgqBlzlVimhWyzPPn?= =?us-ascii?Q?wYDoxpI9lzJMwHtU6CLRrZ4lTXk62GvhRa7e8Rq2FiarP2dHWMS4PhXLf2XK?= =?us-ascii?Q?sYciAsExdfT+eoo2hDZ2j6fgfbfAuoF0urwI0W7hB38od2qinSzVePya/8O5?= =?us-ascii?Q?QDh/kRsesAB3IMjJYoe+LNJ1bVn72VfdAEwylI+Sb5hkvdDAWz9XF6Pm8tLW?= =?us-ascii?Q?IPbpI+dq6S7KTcnmsUmznFcxiTeA/GsNWbOZufweYO6CIqbA1GtT2f68XeXt?= =?us-ascii?Q?bpHleXW8M54WOzeyMAbXY64LdMCGn2ZPUXtro/fXlE0J4RjjxOSSrMs1XDfr?= =?us-ascii?Q?rrMQNwwImPQe0q83BQJVG8nBji4xiVyPSPVUcakoeoDAnl3IR6T2fxA7T98B?= =?us-ascii?Q?daxD7MPegFzrpN+NSSeYTlL1dEMrbGtr+2idiMgXezq+YnacABWXZ72LkvFM?= =?us-ascii?Q?c2YU1rsojDbDdmSBVtFz2wXGcuux2yLOJGrqYkdTis6ycKTaqpmC0AWGf21U?= =?us-ascii?Q?ugUydk2r5GpDt9dhVmYf8qsXj2fI0RYKs9hWiuVc?= X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-Network-Message-Id: b63d4094-47df-4f5f-17ca-08db1540d3c3 X-MS-Exchange-CrossTenant-AuthSource: LV2PR12MB5869.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 23 Feb 2023 01:53:57.9644 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: 56I4QJdC8L1YDG6Hzkz+a/7yjDkLVs40jIukv0SxHe2fzYr6t9Usou9OdaHe7yPR X-MS-Exchange-Transport-CrossTenantHeadersStamped: CYYPR12MB8730 X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 9636C8000D X-Stat-Signature: y9mp7wmraay55hozri7ixfxzygsyusxo X-Rspam-User: X-HE-Tag: 1677117240-878017 X-HE-Meta: U2FsdGVkX1+KY97RWSjehv4q09O3xnN0vBnMlmBF1LoaRrEBXYW8tLKOpVohneTAtyiN6qXnZCVE3lNiFkHfkyUEdwSe5IvLevzyZ/mCTSyLco8AxyGRU6J2lG75PQ8IXPCC1dkQW1pBiLtBfDdFeMz57BEuIMKibz2JHSA6xeqAog8Atp+DSQ4jVhwS2PFD6tcw0RkA+aiTFeMfKbYma1egjRJZ9uGLXzLZqYqdk9WZ2WAlBI+FwIWFOFdPhJnLhjOMi6UZT/U3G+H2T0W+ecXVD6g39rbJO1Yu3RlHzgc/ZHsa4DL9asIhW+t+0q4aBnPiQN5eHjA3IVVkOU12aOjD8aOzfncKUNGFhpbHOkSK7CMjdFktby5ZgKDkGVG/D+RWia6R58pbP54f+rpEJ9gVRpDQ4PD+4ir+kqOSDjQWP8MWsnL0zYI0Gu9sOIfryt3CbPUQvpRk6aRVgQFVF3Wh4xVDUHEQ/Wtt2PY2U64wop9dtbk33DEHL0SzkwrJ3XQig6+/QiV/7MiGexy9pT0ww+UYp8QT9tXVUZZL4D8Kf4Kch3cFImb7JWqrT9B2fObfeZ6U1Sg20McfuQ9BSAxo4uIATRtvWbPwnpTsD5oJpG7FKLGpWqNryG9cIJWbB8BUT2xsqKNIYkVixfiBD6jY9mvUp1NGyTpNVpVq/omuOPepoVwnvfQMjg345eBrCLnjh9hcvj2bLG/nhIyhvEnMoS2qu+19C7TN340yvpCi7AQdssW4fx+DFf8Y3bGdprsO6fkXRyGmjHpiInu50STqKCSytgubh6Zw01PDcCyjOuo2/w3HgrMrXyCOFTDTePHWnOA999LaZT7QfHihYipxetQt+1KDqzoBIM0z66tjM4xpjQJhtxLPqWhBzxBEDpRA0gcwEhY/x6uB4fOj4Fz/pwloPsryGd3U1kJ8l4JpZfqpywB+ywOFcBN1DD/sk6VM+Ge83IOJ2vSSAEX ltKNheOm 2+9WAf9z/CZt1zlHH0lJ/lXOgxXoY2H1Aems+hEoJA/KWOW0qIllWY7FJ/11u3W6gHyEN8SXz3l5DNU6ydw6Su4LOEg5yzoetqr3+skA2TtQubsogmzFrSptw5Gx6lguoeu19mbtVUYLWVlk22fjAwhUsnv3s+WB8YvbscQb5Ufpxu7wcYZeDi7l2NpH9J5uK8ChB9RcdFyeca9jNkSA09csCLKQKiWppbh5z71Fl+bVOhSrgDotgHn6vCtAHPTLhyQ8jS17xumXThafI41cWyjFd5XlcKkIrS2Rs X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Feb 23, 2023 at 09:59:35AM +1100, Alistair Popple wrote: > > Jason Gunthorpe writes: > > > On Wed, Feb 22, 2023 at 10:38:25PM +1100, Alistair Popple wrote: > >> When a driver unpins a page we scan the pinners list and assign > >> ownership to the next driver pinning the page by updating memcg_data and > >> removing the vm_account from the list. > > > > I don't see how this works with just the data structure you outlined?? > > Every unique page needs its own list_head in the vm_account, it is > > doable just incredibly costly. > > The idea was every driver already needs to allocate a pages array to > pass to pin_user_pages(), and by necessity drivers have to keep a > reference to the contents of that in one form or another. So > conceptually the equivalent of: > > struct vm_account { > struct list_head possible_pinners; > struct mem_cgroup *memcg; > struct pages **pages; > [...] > }; > > Unpinnig involves finding a new owner by traversing the list of > page->memcg_data->possible_pinners and iterating over *pages[] to figure > out if that vm_account actually has this page pinned or not and could > own it. Oh, you are focusing on Tejun's DOS scenario. The DOS problem is to prevent a pin users in cgroup A from keeping memory charged to cgroup B that it isn't using any more. cgroup B doesn't need to be pinning the memory, it could just be normal VMAs and "isn't using anymore" means it has unmapped all the VMAs. Solving that problem means figuring out when every cgroup stops using the memory - pinning or not. That seems to be very costly. AFAIK this problem also already exists today as the memcg of a page doesn't change while it is pinned. So maybe we don't need to address it. Arguably the pins are not the problem. If we want to treat the pin like allocation then we simply charge the non-owning memcg's for the pin as though it was an allocation. Eg go over every page and if the owning memcg is not the current memcg then charge the current memcg for an allocation of the MAP_SHARED memory. Undoing this is trivial enoug. This doesn't fix the DOS problem but it does sort of harmonize the pin accounting with the memcg by multi-accounting every pin of a MAP_SHARED page. The other drawback is that this isn't the same thing as the current rlimit. The rlimit is largely restricting the creation of unmovable memory. Though, AFAICT memcg seems to bundle unmovable memory (eg GFP_KERNEL) along with movable user pages so it would be self-consistent. I'm unclear if this is OK for libvirt.. > Agree this is costly though. And I don't think all drivers keep the > array around so "iterating over *pages[]" may need to be a callback. I think searching lists of pages is not reasonable. Things like VFIO & KVM use cases effectively pin 90% of all system memory, that is potentially TB of page lists that might need linear searching! Jason