From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C5FE8CD342C for ; Tue, 3 Sep 2024 11:26:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3F8748D015C; Tue, 3 Sep 2024 07:26:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3A7658D013C; Tue, 3 Sep 2024 07:26:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 26FE38D015C; Tue, 3 Sep 2024 07:26:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 0668C8D013C for ; Tue, 3 Sep 2024 07:26:21 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 8574E804E8 for ; Tue, 3 Sep 2024 11:26:20 +0000 (UTC) X-FDA: 82523198520.04.66C0D99 Received: from mail-wm1-f48.google.com (mail-wm1-f48.google.com [209.85.128.48]) by imf04.hostedemail.com (Postfix) with ESMTP id 2DEC14001D for ; Tue, 3 Sep 2024 11:26:17 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=ffwll.ch header.s=google header.b="RMPo/tAW"; dmarc=none; spf=none (imf04.hostedemail.com: domain of simona.vetter@ffwll.ch has no SPF policy when checking 209.85.128.48) smtp.mailfrom=simona.vetter@ffwll.ch ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725362708; a=rsa-sha256; cv=none; b=3MBidLGljx9sF4bWfKC/nzJqmodWF+Imy0na30FO5M0m4y/YgPuMQiZJ8ZfqieVAB3wXQr ZeD4S+covJvRVmYZxjDst5v+Q57uiixvAr+j0ve5smkeLfr4xzdnSXoXVNCDpADw31Yh6N niopxjLzaznRySaND+FUPFVvG1BsTfs= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=ffwll.ch header.s=google header.b="RMPo/tAW"; dmarc=none; spf=none (imf04.hostedemail.com: domain of simona.vetter@ffwll.ch has no SPF policy when checking 209.85.128.48) smtp.mailfrom=simona.vetter@ffwll.ch ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725362708; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=rLLeWTwbX/xTIxk+Rn8jsU6To4jWTFIzV2bqBZrgyUc=; b=JBZgnzOr72vl9xxPP+RWegNImLWbNEx8hoo0p2czhqqj3jN0aZWyk8J2NuJ9ydEknCL3k9 oHq5YA0bg+Jh3O4s5rW6HBb3r755/KJcrO/jiPr4Ql9z4qVPqe/c1pllc7giUnQBGFlwot svbOUDymUZgJHu2lcDD6hLB8X2zJxZg= Received: by mail-wm1-f48.google.com with SMTP id 5b1f17b1804b1-42bc19e94bdso27819215e9.3 for ; Tue, 03 Sep 2024 04:26:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; t=1725362776; x=1725967576; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references :mail-followup-to:message-id:subject:cc:to:from:date:from:to:cc :subject:date:message-id:reply-to; bh=rLLeWTwbX/xTIxk+Rn8jsU6To4jWTFIzV2bqBZrgyUc=; b=RMPo/tAWoKYfsca0Zqs3wBUxwOiu/uXSjmw7Iw4z45xalOTw7lQCmfwyxdSWoCboiR 9skgWsGyrfag92yq6XhLvL0aPFPufytgG6M+U8kZ3CsHHQrNU9O6IWGLmiTp9i6pRExz fKmnXHlB2yYG65csO6UjuWNoPIZjDJJ0HPVzA= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725362776; x=1725967576; h=in-reply-to:content-disposition:mime-version:references :mail-followup-to:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=rLLeWTwbX/xTIxk+Rn8jsU6To4jWTFIzV2bqBZrgyUc=; b=X4DjPACtjCF/J4Z6IA2rGUyzmh8iAUrSw1E/YChD5ncZ/G5vlcCHD0XvT86nUR9OTw dewPxyq/I8HA8kB/VMHCsqVtYVAiFHEF9qvqAhEqk44pYBexjgOoDsj/ulKapn3LLZts bk6quu4pds/o8TsdgMimVrzDhTJTmOgcIXblx2fQAqAumHN2VAkv1G0pi3HGRS7nV6Dn 66yp8lVrJblbQDNI/GCGPQw2YN/Cg3nXb7bvU6Q6Mr/J8DoXLO0UyObJm/TA0z4ZpQv2 /gfIUAmoA0IsWVyOPytQBWvOUDdumMGWS+vGJBzSHh4+YVrGZEGDhb06SEgKXQDRhI3S XL6A== X-Forwarded-Encrypted: i=1; AJvYcCXCMkczI/7rjBduxoXffS2KJ0vWU2o833FyYaL8bQdZTYhHYjpW9LUYx0LyWkLK+/a+qpxtlskCDA==@kvack.org X-Gm-Message-State: AOJu0YwGVEkMabSOanCYUzW9rt7Bn4zvIMHMfkwwQQXyVcVdJcUiQIUQ uD+VeEDJ8LsJZufvToE7ocHIWS5zEKVYwvB34hGdNFQ1mIlyAmrfTiyLKdTQ6+k= X-Google-Smtp-Source: AGHT+IESCiMwP2SuSjoVXX0mhaqNgaV6E3i+EFUvd5vCE5X1q+TteWYrbPgpsTvZp8bAB8dEU70+VQ== X-Received: by 2002:a05:600c:4f87:b0:426:5dde:627a with SMTP id 5b1f17b1804b1-42c8de9de20mr3408135e9.23.1725362776029; Tue, 03 Sep 2024 04:26:16 -0700 (PDT) Received: from phenom.ffwll.local ([2a02:168:57f4:0:5485:d4b2:c087:b497]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-374cee38722sm4231655f8f.112.2024.09.03.04.26.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Sep 2024 04:26:15 -0700 (PDT) Date: Tue, 3 Sep 2024 13:26:13 +0200 From: Simona Vetter To: Maxime Ripard Cc: Tvrtko Ursulin , Maarten Lankhorst , intel-xe@lists.freedesktop.org, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, Tejun Heo , Zefan Li , Johannes Weiner , Andrew Morton , Jonathan Corbet , David Airlie , Thomas Zimmermann , Friedrich Vock , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org Subject: Re: [RFC PATCH 2/6] drm/cgroup: Add memory accounting DRM cgroup Message-ID: Mail-Followup-To: Maxime Ripard , Tvrtko Ursulin , Maarten Lankhorst , intel-xe@lists.freedesktop.org, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, Tejun Heo , Zefan Li , Johannes Weiner , Andrew Morton , Jonathan Corbet , David Airlie , Thomas Zimmermann , Friedrich Vock , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org References: <20240627154754.74828-3-maarten.lankhorst@linux.intel.com> <20240627-paper-vicugna-of-fantasy-c549ed@houat> <6cb7c074-55cb-4825-9f80-5cf07bbd6745@linux.intel.com> <20240628-romantic-emerald-snake-7b26ca@houat> <70289c58-7947-4347-8600-658821a730b0@linux.intel.com> <40ef0eed-c514-4ec1-9486-2967f23824be@ursulin.net> <20240806-gharial-of-abstract-reverence-aad6ea@houat> <20240903-resilient-quiet-oxpecker-d57d7a@houat> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240903-resilient-quiet-oxpecker-d57d7a@houat> X-Operating-System: Linux phenom 6.9.12-amd64 X-Rspamd-Queue-Id: 2DEC14001D X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: ob84fwnmi5efuzit5g6xtgzhq1y9a46c X-HE-Tag: 1725362777-947653 X-HE-Meta: U2FsdGVkX1+4V9lnjqijo1cIAfwOEIV3MyNY0tqImQKgnbR6jvgELd7tVhP5WzpD7xiE94t1GkAIhNUmbR/4fIEhWpi5Ha3tYbMZu0P316pgQ3qmeJa5nOQJ4sfC1gaXhIww/qAKLj+S2f4lwAEQgl+AUe7J7wwSabcQnDj7yqCqN4tCrV0QjDkgQ15hbKhjP4KITe5Tyez3by7rb21hwDWAlmnD8DxbP1VFcPQ/GRqpIZIrNKMMN64L8ISdlWbujz/AH9DBwdTN+/P4sJNyeLdMOeROa7obgWQfEYWMLRdCfIJXGWWGmIgEmKyrhpuv7e6XGgQ+xt2e22RA5t0k50X4AdbQ08mgq92dk7j0Frd2fjQA86S8RffkYbtc64Pq1Bf6Yr1QRGMXn/+m4w18+QX0wpQda+qrcEt9Md8k51xVUX82SmV5kK9WQHiRM27X8EydPvbp/70EuSPwK8OjcdQIqu3qK9omVRKxtGNdgI3+Uplj2/Q5cdU+9GUY7KeJL+Ls+Z+281PSjfPCh/NYdc8LAtuTyAyU7KDnIXpZ1Qgwa1DsYZ7X0NlJGAI/imtoJIN0VmKWOkUeUn036bWH5T/9TSfG6NsqzhbnLa5jkBkhJmPLFdWroKxg9JAHCqnqpsR4Nrn7c7MT54JAqMjsXNz4Bwz4qU4yufSVbYilz8oMDqdkOkwQcnBDYQe3hhqQaj936cWSOXine02zLJY1RO8CapNwrikx9d1DyIIRLrZnN4g8iV+Wv/WCJOq/kpLp8csw61i4s+OmxiB2hXWaiazkhNw46NQaDQnfOzHwnTqXtZxSuobRuuUdh1QixfTg0hkj/8kF7+shzEUa76eN0xPGXzniSk1p/W7ySoc3VG7Vu0auLKXfw1fspSpibzsWjATne+DPnYBcFwbroWarGqkIiF+Tom2dQjGlStkdDQ+HVGRET7axUbugkNf9vRfcHkdcSZSlJVWWN8Cykzu Gt0KFaXH BkP6jeH1V5VtcfB49Mj3Z2Awl8BgiMbt4iMdoLT+2QJll/t7B4sjI+J6jZv6jBZ6SHLCqPQfhvpaHILLR0kB3gnn3CK63NnfN/xM5gWx8Rkw5BuJrjlH16NNcL23rQNx64I2/Mg4AISux4FgCtQO9GaxPxpg144MVRpMgnWjhl9unidNszkfobx1rHSA+VYCTnqK0feBorBsAPHJnqN7hrepDQzjgcKzz+KYzIYcZjzY0gmfMyrk2cQDAe92sHyrIZh7rSYvaAhV6KyHwAIeQMGwShwyUxAueUwcSskgsZlOpsbKWs2imaVJnK7MwbQIw1PQ18eEvuVQtigFLbpSV+MhhjQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Sep 03, 2024 at 10:53:17AM +0200, Maxime Ripard wrote: > On Tue, Aug 06, 2024 at 05:26:21PM GMT, Daniel Vetter wrote: > > On Tue, Aug 06, 2024 at 04:09:43PM +0200, Maxime Ripard wrote: > > > On Tue, Aug 06, 2024 at 03:01:44PM GMT, Daniel Vetter wrote: > > > > On Mon, Jul 01, 2024 at 06:01:41PM +0100, Tvrtko Ursulin wrote: > > > > > > > > > > On 01/07/2024 10:25, Maarten Lankhorst wrote: > > > > > > Den 2024-06-28 kl. 16:04, skrev Maxime Ripard: > > > > > > > Hi, > > > > > > > > > > > > > > On Thu, Jun 27, 2024 at 09:22:56PM GMT, Maarten Lankhorst wrote: > > > > > > > > Den 2024-06-27 kl. 19:16, skrev Maxime Ripard: > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > Thanks for working on this! > > > > > > > > > > > > > > > > > > On Thu, Jun 27, 2024 at 05:47:21PM GMT, Maarten Lankhorst wrote: > > > > > > > > > > The initial version was based roughly on the rdma and misc cgroup > > > > > > > > > > controllers, with a lot of the accounting code borrowed from rdma. > > > > > > > > > > > > > > > > > > > > The current version is a complete rewrite with page counter; it uses > > > > > > > > > > the same min/low/max semantics as the memory cgroup as a result. > > > > > > > > > > > > > > > > > > > > There's a small mismatch as TTM uses u64, and page_counter long pages. > > > > > > > > > > In practice it's not a problem. 32-bits systems don't really come with > > > > > > > > > > > =4GB cards and as long as we're consistently wrong with units, it's > > > > > > > > > > fine. The device page size may not be in the same units as kernel page > > > > > > > > > > size, and each region might also have a different page size (VRAM vs GART > > > > > > > > > > for example). > > > > > > > > > > > > > > > > > > > > The interface is simple: > > > > > > > > > > - populate drmcgroup_device->regions[..] name and size for each active > > > > > > > > > > region, set num_regions accordingly. > > > > > > > > > > - Call drm(m)cg_register_device() > > > > > > > > > > - Use drmcg_try_charge to check if you can allocate a chunk of memory, > > > > > > > > > > use drmcg_uncharge when freeing it. This may return an error code, > > > > > > > > > > or -EAGAIN when the cgroup limit is reached. In that case a reference > > > > > > > > > > to the limiting pool is returned. > > > > > > > > > > - The limiting cs can be used as compare function for > > > > > > > > > > drmcs_evict_valuable. > > > > > > > > > > - After having evicted enough, drop reference to limiting cs with > > > > > > > > > > drmcs_pool_put. > > > > > > > > > > > > > > > > > > > > This API allows you to limit device resources with cgroups. > > > > > > > > > > You can see the supported cards in /sys/fs/cgroup/drm.capacity > > > > > > > > > > You need to echo +drm to cgroup.subtree_control, and then you can > > > > > > > > > > partition memory. > > > > > > > > > > > > > > > > > > > > Signed-off-by: Maarten Lankhorst > > > > > > > > > > Co-developed-by: Friedrich Vock > > > > > > > > > I'm sorry, I should have wrote minutes on the discussion we had with TJ > > > > > > > > > and Tvrtko the other day. > > > > > > > > > > > > > > > > > > We're all very interested in making this happen, but doing a "DRM" > > > > > > > > > cgroup doesn't look like the right path to us. > > > > > > > > > > > > > > > > > > Indeed, we have a significant number of drivers that won't have a > > > > > > > > > dedicated memory but will depend on DMA allocations one way or the > > > > > > > > > other, and those pools are shared between multiple frameworks (DRM, > > > > > > > > > V4L2, DMA-Buf Heaps, at least). > > > > > > > > > > > > > > > > > > This was also pointed out by Sima some time ago here: > > > > > > > > > https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/ > > > > > > > > > > > > > > > > > > So we'll want that cgroup subsystem to be cross-framework. We settled on > > > > > > > > > a "device" cgroup during the discussion, but I'm sure we'll have plenty > > > > > > > > > of bikeshedding. > > > > > > > > > > > > > > > > > > The other thing we agreed on, based on the feedback TJ got on the last > > > > > > > > > iterations of his series was to go for memcg for drivers not using DMA > > > > > > > > > allocations. > > > > > > > > > > > > > > > > > > It's the part where I expect some discussion there too :) > > > > > > > > > > > > > > > > > > So we went back to a previous version of TJ's work, and I've started to > > > > > > > > > work on: > > > > > > > > > > > > > > > > > > - Integration of the cgroup in the GEM DMA and GEM VRAM helpers (this > > > > > > > > > works on tidss right now) > > > > > > > > > > > > > > > > > > - Integration of all heaps into that cgroup but the system one > > > > > > > > > (working on this at the moment) > > > > > > > > > > > > > > > > Should be similar to what I have then. I think you could use my work to > > > > > > > > continue it. > > > > > > > > > > > > > > > > I made nothing DRM specific except the name, if you renamed it the device > > > > > > > > resource management cgroup and changed the init function signature to take a > > > > > > > > name instead of a drm pointer, nothing would change. This is exactly what > > > > > > > > I'm hoping to accomplish, including reserving memory. > > > > > > > > > > > > > > I've started to work on rebasing my current work onto your series today, > > > > > > > and I'm not entirely sure how what I described would best fit. Let's > > > > > > > assume we have two KMS device, one using shmem, one using DMA > > > > > > > allocations, two heaps, one using the page allocator, the other using > > > > > > > CMA, and one v4l2 device using dma allocations. > > > > > > > > > > > > > > So we would have one KMS device and one heap using the page allocator, > > > > > > > and one KMS device, one heap, and one v4l2 driver using the DMA > > > > > > > allocator. > > > > > > > > > > > > > > Would these make different cgroup devices, or different cgroup regions? > > > > > > > > > > > > Each driver would register a device, whatever feels most logical for that device I suppose. > > > > > > > > > > > > My guess is that a prefix would also be nice here, so register a device with name of drm/$name or v4l2/$name, heap/$name. I didn't give it much thought and we're still experimenting, so just try something. :) > > > > > > > > > > > > There's no limit to amount of devices, I only fixed amount of pools to match TTM, but even that could be increased arbitrarily. I just don't think there is a point in doing so. > > > > > > > > > > Do we need a plan for top level controls which do not include region names? > > > > > If the latter will be driver specific then I am thinking of ease of > > > > > configuring it all from the outside. Especially considering that one cgroup > > > > > can have multiple devices in it. > > > > > > > > > > Second question is about double accounting for shmem backed objects. I think > > > > > they will be seen, for drivers which allocate backing store at buffer > > > > > objects creation time, under the cgroup of process doing the creation, in > > > > > the existing memory controller. Right? > > > > > > > > We currently don't set __GFP_ACCOUNT respectively use GFP_KERNEL_ACCOUNT, > > > > so no. Unless someone allocates them with GFP_USER ... > > > > > > > > > Is there a chance to exclude those from there and only have them in this new > > > > > controller? Or would the opposite be a better choice? That is, not see those > > > > > in the device memory controller but only in the existing one. > > > > > > > > I missed this, so jumping in super late. I think guidance from Tejun was > > > > to go the other way around: Exclude allocations from normal system > > > > memory from device cgroups and instead make sure it's tracked in the > > > > existing memcg. > > > > > > > > Which might mean we need memcg shrinkers and the assorted pain ... > > > > > > > > Also I don't think we ever reached some agreement on where things like cma > > > > allocations should be accounted for in this case. > > > > > > Yeah, but that's the thing, memcg probably won't cut it for CMA. Because > > > if you pull the thread, that means that dma-heaps also have to register > > > their buffers into memcg too, even if it's backed by something else than > > > RAM. > > > > For cma I'm kinda leaning towards "both". If you don't have a special cma > > cgroup and just memcg, you can exhaust the cma easily. But if the cma > > allocations also aren't tracked in memcg, you have a blind spot there, > > which isn't great. > > I think one of earlier comment from Tejun was that we don't want to > double-account memory, but I guess your point is that we should double > account if we allocate CMA buffers from the main CMA allocator, and not > if we're allocating from a secondary one? Maybe we need to discuss this with Tejun again, but with CMA the issue is that it's both CMA and normal memory you get through alloc_pages(). So I think this is one of the cases where we do have to double account, because it really is two things in one. My argument is that we should absolutely track it in the memcg, because if CMA isn't accounted there you can use that to allocate more system memory than the memcg allows you to. This is because CMA allocates require that we move any system memory alloations out of there, so if they happen they do create memory pressure, and should result in the memcg-aware shrinkers kicking in if we go over the limits. But we cannot exclusive rely on the memcg, because CMA is a subset of all system memory, so if you set the memcg limit to reasonable manage CMA, you don't have anything left for normal application memory usage at all. Which doesn't work. Therefore I think there must be a limit for both, with the CMA limit necessary being smaller than the memcg limit. And I think we should do that for all CMA regions, with each CMA region being tracked separately, with their own limit of how much you're allowed to allocate in each if there's more than one. Otherwise if it's a total limit and you have a display and a separate camera CMA, then applications that have a limit for display+camera use-case might exhaust one CMA completely if they can allocate their entire limit there. It's kinda like multiple dgpu, if you only set a VRAM limit in total, with the idea that e.g. 2 applications each get half of vram of 2 gpus. Then one application could completely one gpu, preventing the other app from using it. Which defeats the point of account and resource limits. Note that for shmem allocations I concure nowadays with Tejun, we really don't want to double account that because it all boils down to alloc_pages, whether it's a gem bo or application memory that's mmapped. Maybe CMA is special enought that we really want to track that in the memcg itself, as a special limit, and not in some kind of disconnected device cgroup limit? Cheers, Sima -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch