From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B70EAC677C4 for ; Wed, 11 Jun 2025 01:41:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 426026B0098; Tue, 10 Jun 2025 21:41:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3D6776B0099; Tue, 10 Jun 2025 21:41:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2EC566B009A; Tue, 10 Jun 2025 21:41:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 124986B0098 for ; Tue, 10 Jun 2025 21:41:14 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id CE249BE4C1 for ; Wed, 11 Jun 2025 01:41:13 +0000 (UTC) X-FDA: 83541416826.15.0080C37 Received: from mail-ej1-f44.google.com (mail-ej1-f44.google.com [209.85.218.44]) by imf24.hostedemail.com (Postfix) with ESMTP id D887D180009 for ; Wed, 11 Jun 2025 01:41:11 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Tq91E7Mm; spf=pass (imf24.hostedemail.com: domain of airlied@gmail.com designates 209.85.218.44 as permitted sender) smtp.mailfrom=airlied@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749606072; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Z5KVm8LwP5ElbIyrZRzIsxOuK6PMs+75cVJIVz2IgFg=; b=Qj3KolUAzQx9WokT+EL+uS2di0xKZ4w+JSLUvTzRV+zMyhJZaWBWdsGN2kIHkSKwsfsugq VKdsC3E0F8Y0RHl+Oj/Pn/SppdyyWbIlJkVvc5MyvLk325MBoi3r8SoH0JLgq4nbv1p94p zyU1ZexGgmJeLueuiI1WP5sOGwx2p/o= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Tq91E7Mm; spf=pass (imf24.hostedemail.com: domain of airlied@gmail.com designates 209.85.218.44 as permitted sender) smtp.mailfrom=airlied@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749606072; a=rsa-sha256; cv=none; b=TqagUKgsYLGwdggsnMX2FiPdKOeF+HELgcPSfvrBLqSCphVYwm+DQk8zYXwefHbtAkuO4c 84d1TPjuffSohR/3TDxTGZSUFQsjr6mKyzjzN33NMeqxrnIUGkBAeOFzacxm9nLfG1Bvo6 +RyJv6AhV+LaeeAWH+TgoRcef1aUCNA= Received: by mail-ej1-f44.google.com with SMTP id a640c23a62f3a-ade30256175so865064266b.1 for ; Tue, 10 Jun 2025 18:41:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1749606070; x=1750210870; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Z5KVm8LwP5ElbIyrZRzIsxOuK6PMs+75cVJIVz2IgFg=; b=Tq91E7MmRhntm9PzMMO0Wg77pCnITFBFh0UP2nCdZ8LITBcM9DdHp2Diwx8skZA8OR 0kX+IqSTvNuscFzieLoB6Pg8GCFaIB+Dr1Svkk9izYJisr1LFI7yz4Gt/LhjMRyMGzkW qEH2jF0a+9vXM4c/9xwaXMm9jUMzAO7uWVpDWxLp+mjKwXJm7G1AR8zwVNxkHNYUgOQe bz9mcCqFoDhbKkIsEA60OfX5tZwlyLeevXAl1a6wiT5EYIoZmNs8gJ4m2JhYEtqjCCms MzGAZUw+lgKDMDhZQ0/rJ3HMZTrtqAYO6suW/tpVEeliWGL417HUvqDOkcUHOzXE44uX Psag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749606070; x=1750210870; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Z5KVm8LwP5ElbIyrZRzIsxOuK6PMs+75cVJIVz2IgFg=; b=tgUd6SqUUWarFXHQWNaAdvZWm0KIiJGd1n83PYFXKGCRKLYtSBu1xdzLiOSbDZjR4b gXyVZ1I3bEWIfaJsuAzqELnI2haShgBd6G+w01z9C9XMegoEqgMD9PqDT8onY5GiZ202 xIY24zdPbmM81GRlJJE63BmqUYH961H4vtR2FgkR2MUEZQj+Fb3qO1HKDHXkawVB57Bc RYLCT2LSViEeHpO1lkjkXPZEicQ5rppaiGfS5SXO1rjuz+cQOC4MpESeRM+CQ9c/2hCy 4vfqZgxOw7gXqDGapn+clz32eTCFk2TxibeP6urJIAz/aTOHQgvtgvF17YAhjX0Exp6q J3ww== X-Forwarded-Encrypted: i=1; AJvYcCWuN3/ZEladOqe393clzWzEfD4EtXwJ/v2pzCtILgyQ9p9+9FIvysxW3cy9tbLfRi2hW4cUeIOUDw==@kvack.org X-Gm-Message-State: AOJu0Yy2VjD4Qf+Xwis0X/sC+MkFDF1/7PzDSLSd03FOCOLUdKHJaYdi ZtL5ARjIxfaf9kUrWHlWR7+pUZJpCEDKTxx7amg3zGhulxmfdkSqy2gVu6hsJNM9AwnTytY/Qie Pah91slsw27uItCMZVo0M4U+xYJ12T5UVfA== X-Gm-Gg: ASbGncssaAQTPPLks2iFYgpd+KUeEblvoeQI5mH6s7z4FAMmKCCmwHps5D7ThCB76Gq gs+gGl6FELah8jeppF1uEOoWJGm3cJefKR6NvZQlfbdDa5pZL4wCQpCsMnX5kdva1wRgpkzYXWf LGlbCQ0tDelwDTHeM5drvQskOiNNdqtf4GuQNR8/5xUg== X-Google-Smtp-Source: AGHT+IG35ah+7CLjy5EGG+6wnaJM0MxPbGbyhYDOi5NfIsPkpIs/IBkgy5pZcVrZ0UtQKPNpcukk3VDN7ItsTFO9T48= X-Received: by 2002:a17:907:6d02:b0:ad5:7732:6759 with SMTP id a640c23a62f3a-ade897ce810mr140325266b.53.1749606069940; Tue, 10 Jun 2025 18:41:09 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Dave Airlie Date: Wed, 11 Jun 2025 11:40:58 +1000 X-Gm-Features: AX0GCFuQDDxdqtB_7yAeWyXhtMAy6Ffill_0ciBP5GyzfYsC6PkDzILA1WQ9p9g Message-ID: Subject: Re: list_lru isolate callback question? To: Dave Chinner Cc: Kairui Song , Johannes Weiner , Linux Memory Management List Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: D887D180009 X-Stat-Signature: 54zeasuwjra4uzu83k5opsx5596mnkn6 X-Rspam-User: X-HE-Tag: 1749606071-426588 X-HE-Meta: U2FsdGVkX19iZVFozDUcYBTsU1VbryC7zVGipBa+mOuqfPO2A2/IoZGtq9k/OkyzVCjG174XGVf+IbVLILl0S1VS1yjZ0sNytygC/g0dFmBXqnEMACr1e/20mSbt3MFsAzRCg9dP39yiVSY4Dx4gSH50X+UL3sWrDJV9wd66QwYNagx37rR8bm7i5tA92NtcfZD5R+YuAivmoK4BeT5udcz30QPrehgjk7HbclpAVJBitgRno516/4No9LnWE+WC4ItZ+zbOfKVLePgTHWkheL79Mip3D0gC1HdsLVQv0bbf2/wLYJtt32GaFu9BFfpnXLzU5FK3Eg+vIy8+UNBpIuddS/8X4hblktPx+Mp1y+x+n1gO8wwGQKZRHf6wqdXz1wfU8k31MjI+t+hWLFJ0xhRqBloJSRG3e+KukzrqGqLqATEM0LxT79eAGDi/ghGVWZH/tfzfNknKG59ZDR2iQbYg83c3YS3JwQ1lM3fJ0DHemmuFkyzYBJdJFijWM+MSJEE8U22EGDeHHNGZaMiGzNCDFU4r9yM0Fcc+VGFBVyxSkddncMvuT79jUECeJg7kdWtfZD3V26ira2upDBdFHkuTSwJ2+TTOPEwpDBgM/iITZcQp1a+o/jAFMHgYvpWz8kPiC1EF23djew7fIY9MEyFToLWZVp7Phgv3luXH7BBV7iDbC4lD6vIzmaSOLUriFYl2/npH/AonS3zaLp9b8UiddRbWb0PoTvq53Q7uzsdNFXmQNDg2yGVaT1+tiAbfia3HxPUwUaGn76dEvyQhR9bb6PgVCr4A+oMlac9bWz2MnBJS6NIDcoZKyHQqJmOeCV3KoLoeWaE1nGOrqfWR8xyAFcH22gAZwYofsPI0+S5x64ycmnyjbi4FoFrYmYYMAZNeUtBP3x6sFbRlioJ5qptnd+faCBBzqHg2hqvK+Pprpk0WxMvxwdhmg1Xq44afyQJFpSGm5rQjsTP/qOS nHNmKVZ9 8t3cQtIM4/eMWhrP42/xz9Ngz/ZlfDaYDLkl/FVtbIjvbg5/0m+IaMaVZIk5vXezo7LjdCwCxlu23rT/B2YPlZbT/byiID50A9YbLpE7l0z8O5FyC9FUTMfMbPJiQKN6Kp8n6f8HKQSsr91XlGB8K6Fm8n0mtsjgFMV/78BR+tZLAw6BuwQI3ivP/jG61gEY9FCsIH1EUoHYs0JZFeCO84Hky7HNC9noigaFJlB8Rszjl/e1VyLJ1KyTn4n73U7T/q/2KvxfUVrR/dH1h/UwHDDfpxDHK+EE7j/80e5fsCPNqb1xh/AaOq8b3L/LLepm+8EDC2a/LxaNfT5UfLmvyenaxxoviH6kNWmMWGU+eLxEsVo78LF/uPnzUgD8htanT0f86URqDg0Fp/5YtRPaNpL/fW3w193IbJWwqMFeEgQhc3+kwVVFD/sfywPvcvc6UizZ+6y6/DQl94DKtGswU2ppgSmqQATn3cd66sywORfxgfc0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, 11 Jun 2025 at 08:44, Dave Chinner wrote: > > On Fri, Jun 06, 2025 at 08:59:16AM +1000, Dave Airlie wrote: > > On Fri, 6 Jun 2025 at 08:39, Dave Chinner wrote: > > > > > > On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote: > > > > On Thu, 5 Jun 2025 at 17:55, Kairui Song wrote: > > > > > > > > > > On Thu, Jun 5, 2025 at 10:17=E2=80=AFAM Dave Airlie wrote: > > > > > > > > > > > > I've hit a case where I think it might be valuable to have the = nid + > > > > > > struct memcg for the item being iterated available in the isola= te > > > > > > callback, I know in theory we should be able to retrieve it fro= m the > > > > > > item, but I'm also not convinced we should need to since we hav= e it > > > > > > already in the outer function? > > > > > > > > > > > > typedef enum lru_status (*list_lru_walk_cb)(struct list_head *i= tem, > > > > > > struct list_lru_one *list, > > > > > > int nid, > > > > > > struct mem_cgroup *memcg, > > > > > > void *cb_arg); > > > > > > > > > > > > > > > > Hi Dave, > > > > > > > > > > > It's probably not essential (I think I can get the nid back eas= ily, > > > > > > not sure about the memcg yet), but I thought I'd ask if there w= ould be > > > > > > > > > > If it's a slab object you should be able to get it easily with: > > > > > memcg =3D mem_cgroup_from_slab_obj(item)); > > > > > nid =3D page_to_nid(virt_to_page(item)); > > > > > > > > > > > > > It's in relation to some work trying to tie GPU system memory > > > > allocations into memcg properly, > > > > > > > > Not slab objects, but I do have pages so I'm using page_to_nid righ= t now, > > > > however these pages aren't currently setting p->memcg_data as I don= 't > > > > need that for this, but maybe > > > > this gives me a reason to go down that road. > > > > > > How are you accounting the page to the memcg if the page is not > > > marked as owned by as specific memcg? > > > > > > Are you relying on the page being indexed in a specific list_lru to > > > account for the page correcting in reclaim contexts, and that's why > > > you need this information in the walk context? > > > > > > I'd actually like to know more details of the problem you are trying > > > to solve - all I've heard is "we're trying to do with > > > GPUs and memcgs with list_lrus", but I don't know what it is so I > > > can't really give decent feedback on your questions.... > > > > > > > Big picture problem, GPU drivers do a lot of memory allocations for > > userspace applications that historically have not gone via memcg > > accounting. This has been pointed out to be bad and should be fixed. > > > > As part of that problem, GPU drivers have the ability to hand out > > uncached/writecombined pages to userspace, creating these pages > > requires changing attributes and as such is a heavy weight operation > > which necessitates page pools. These page pools only currently have a > > global shrinker and roll their own NUMA awareness. > > Ok, it looks to me like there's been a proliferation of these pools > and shrinkers in recent times? I was aware of the TTM + i915/gem > shrinkers, but now I look I see XE, panfrost and MSM all have there > own custom shrinkers now? Ah, panfrost and msm look simple, but > XE is a wrapper around ttm that does all sorts of weird runtime PM > stuff. There is a bunch of different reasons, TTM is for systems with discrete device memory, it also manages uncached pools, it has a shrink itself for the uncached pools. Xe recently added support for a proper shrinker for the device which needs to interact with TTM, so that is second and different shrinker. The PM interactions are so it can power up the GPU to do certain operations or refuse to do them if powered down. Then panfrost/msm are arm only no discrete memory so have their own simpler ones. I think at some point amdgpu will grow a shrinker more like the Xe one. > > I don't see anything obviously NUMA aware in any of them, though.... The numa awareness is in the uncached pools code, the shrinker isn't numa aware, but the pools can be created per numa-node, and currently just shrink indiscriminately, moving to list_lru should allow them to shrink smarter. > > > We don't need page level memcg tracking as the pages are all either > > allocated to the process as part of a larger buffer object, or the > > pages are in the pool which has the memcg info, so we aren't intending > > on using __GFP_ACCOUNT at this stage. I also don't really like having > > this as part of kmem, these really are userspace only things mostly > > and they are mostly used by gpu and userspace. > > Seems reasonable to me if you can manage the memcgs outside the LRU > contexts sanely. > > > My rough plan: > > 1. convert TTM page pools over to list_lru and use a NUMA aware shrinke= r > > 2. add global and memcg counters and tracking. > > 3. convert TTM page pools over to memcg aware shrinker so we get the > > proper operation inside a memcg for some niche use cases. > > Once you've converted to list-lru, this step should mainly be be > adding a flag to the shrinker and changing the list_lru init > function. > > Just remember that list_lru_{add,del}() require external means of > pinning the memcg while the LRU operation is being done. i.e. the > indexing and management of memcg lifetimes for LRU operations is the > responsibility of the external code, not the list_lru > infrastructure. > > > 4. Figure out how to deal with memory evictions from VRAM - this is > > probably the hardest problem to solve as there is no great policy. > > Buffer objects on the LRU can be unused by userspace but still > mapped into VRAM at the time the shrinker tries to reclaim them? > Hence the shrinker tries to evict them from VRAM to reclaim the > memory? Or are you talking about something else? VRAM isn't currently tracked, but when we do have to evict something from V= RAM, we generally have to evict to system RAM somewhere, and who's memcg that gets accounted to is hard to determine at this point. There is also code in the Xe/TTM shrinker to handle evictions to swap under memory pressure smarter, you have a 10MB VRAM object, it's getting evicted, we allocate 10MB main memory and copy it, then it can get pushed to swap, the recent xe shrinker allows the 10MB to get evicted into 10 1MB chunks which can get pushed to swap quicker to only use 1MB of system RAM. > > > Also handwave shouldn't this all be folios at some point. > > Or maybe a new uncached/writecombined specific type tailored to the > exact needs of GPU resource management. Yes not saying a page flag would make it easier, but sometimes a pageflag would make it easier, though a folio flag for uncached/wc with core mm dealing with pools etc would probably be nice in the future, esp if more devices start needing this stuff. Dave.