From: Balbir Singh <balbirs@nvidia.com>
To: Dave Airlie <airlied@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>,
Kairui Song <ryncsn@gmail.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Linux Memory Management List <linux-mm@kvack.org>
Subject: Re: list_lru isolate callback question?
Date: Thu, 12 Jun 2025 08:34:42 +1000 [thread overview]
Message-ID: <abada181-9ae7-4089-a025-1dc4d74d3487@nvidia.com> (raw)
In-Reply-To: <CAPM=9txKFNXZ0JT1mrbHOhFao7uo851-VUSWUG4FS-y6oJbakw@mail.gmail.com>
On 6/11/25 11:43, Dave Airlie wrote:
> On Wed, 11 Jun 2025 at 09:07, Balbir Singh <balbirs@nvidia.com> wrote:
>>
>> On 6/6/25 08:59, Dave Airlie wrote:
>>> On Fri, 6 Jun 2025 at 08:39, Dave Chinner <david@fromorbit.com> wrote:
>>>>
>>>> On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote:
>>>>> On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote:
>>>>>>
>>>>>> On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote:
>>>>>>>
>>>>>>> I've hit a case where I think it might be valuable to have the nid +
>>>>>>> struct memcg for the item being iterated available in the isolate
>>>>>>> callback, I know in theory we should be able to retrieve it from the
>>>>>>> item, but I'm also not convinced we should need to since we have it
>>>>>>> already in the outer function?
>>>>>>>
>>>>>>> typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item,
>>>>>>> struct list_lru_one *list,
>>>>>>> int nid,
>>>>>>> struct mem_cgroup *memcg,
>>>>>>> void *cb_arg);
>>>>>>>
>>>>>>
>>>>>> Hi Dave,
>>>>>>
>>>>>>> It's probably not essential (I think I can get the nid back easily,
>>>>>>> not sure about the memcg yet), but I thought I'd ask if there would be
>>>>>>
>>>>>> If it's a slab object you should be able to get it easily with:
>>>>>> memcg = mem_cgroup_from_slab_obj(item));
>>>>>> nid = page_to_nid(virt_to_page(item));
>>>>>>
>>>>>
>>>>> It's in relation to some work trying to tie GPU system memory
>>>>> allocations into memcg properly,
>>>>>
>>>>> Not slab objects, but I do have pages so I'm using page_to_nid right now,
>>>>> however these pages aren't currently setting p->memcg_data as I don't
>>>>> need that for this, but maybe
>>>>> this gives me a reason to go down that road.
>>>>
>>>> How are you accounting the page to the memcg if the page is not
>>>> marked as owned by as specific memcg?
>>>>
>>>> Are you relying on the page being indexed in a specific list_lru to
>>>> account for the page correcting in reclaim contexts, and that's why
>>>> you need this information in the walk context?
>>>>
>>>> I'd actually like to know more details of the problem you are trying
>>>> to solve - all I've heard is "we're trying to do <something> with
>>>> GPUs and memcgs with list_lrus", but I don't know what it is so I
>>>> can't really give decent feedback on your questions....
>>>>
>>>
>>> Big picture problem, GPU drivers do a lot of memory allocations for
>>> userspace applications that historically have not gone via memcg
>>> accounting. This has been pointed out to be bad and should be fixed.
>>>
>>> As part of that problem, GPU drivers have the ability to hand out
>>> uncached/writecombined pages to userspace, creating these pages
>>> requires changing attributes and as such is a heavy weight operation
>>> which necessitates page pools. These page pools only currently have a
>>> global shrinker and roll their own NUMA awareness. The
>>> uncached/writecombined memory isn't a core feature of userspace usage
>>> patterns, but since we want to do things right it seems like a good
>>> idea to clean up the space first.
>>>
>>> Get proper vmstat/memcg tracking for all allocations done for the GPU,
>>> these can be very large, so I think we should add core mm counters for
>>> them and memcg ones as well, so userspace can see them and make more
>>> educated decisions.
>>>
>>> We don't need page level memcg tracking as the pages are all either
>>> allocated to the process as part of a larger buffer object, or the
>>> pages are in the pool which has the memcg info, so we aren't intending
>>> on using __GFP_ACCOUNT at this stage. I also don't really like having
>>> this as part of kmem, these really are userspace only things mostly
>>> and they are mostly used by gpu and userspace.
>>>
>>> My rough plan:
>>> 1. convert TTM page pools over to list_lru and use a NUMA aware shrinker
>>> 2. add global and memcg counters and tracking.
>>> 3. convert TTM page pools over to memcg aware shrinker so we get the
>>> proper operation inside a memcg for some niche use cases.
>>> 4. Figure out how to deal with memory evictions from VRAM - this is
>>> probably the hardest problem to solve as there is no great policy.
>>>
>>> Also handwave shouldn't this all be folios at some point.
>>>
>>
>> The key requirements for memcg would be to track the mm on whose behalf
>> the allocation was made.
>>
>> kmemcg (__GFP_ACCOUNT) tracks only kernel
>> allocations (meant for kernel overheads), we don't really need it and
>> you've already mentioned this.
>>
>> For memcg evictions reference count and reclaim is used today, I guess
>> in #4, you are referring to getting that information for VRAM?
>>
>> Is the overall goal to overcommit VRAM or to restrict the amount of
>> VRAM usage or a combination of bith?
>
> This is kinda the crux of where we are getting to.
>
> We don't track VRAM at all with memcg that will be the dmem controllers jobs.
>
> But in the corner case where we do overcommit VRAM, who pays for the
> system RAM where we evict stuff to.
>
> I think ideally we would have system limits give an amount of VRAM and
> system RAM to a process, and it can live within that budget, and we'd
> try not to evict VRAM from processes that have a cgroup accounted
> right to some of it, but that isn't great for average things like
> desktops or games (where overcommit makes sense), it would be more for
> container workloads on GPU clusters.
>
Makes sense! Thanks!
Balbir
next prev parent reply other threads:[~2025-06-11 22:34 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-05 2:16 Dave Airlie
2025-06-05 7:55 ` Kairui Song
2025-06-05 9:22 ` Dave Airlie
2025-06-05 13:53 ` Matthew Wilcox
2025-06-05 20:59 ` Dave Airlie
2025-06-05 22:39 ` Dave Chinner
2025-06-05 22:59 ` Dave Airlie
2025-06-10 22:44 ` Dave Chinner
2025-06-11 1:40 ` Dave Airlie
2025-06-10 23:07 ` Balbir Singh
2025-06-11 1:43 ` Dave Airlie
2025-06-11 22:34 ` Balbir Singh [this message]
2025-06-11 3:36 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=abada181-9ae7-4089-a025-1dc4d74d3487@nvidia.com \
--to=balbirs@nvidia.com \
--cc=airlied@gmail.com \
--cc=david@fromorbit.com \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=ryncsn@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox