linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Balbir Singh <balbirs@nvidia.com>
To: Dave Airlie <airlied@gmail.com>, Dave Chinner <david@fromorbit.com>
Cc: Kairui Song <ryncsn@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Linux Memory Management List <linux-mm@kvack.org>
Subject: Re: list_lru isolate callback question?
Date: Wed, 11 Jun 2025 09:07:05 +1000	[thread overview]
Message-ID: <206e24d8-9027-4a14-9e9a-710bc57921d6@nvidia.com> (raw)
In-Reply-To: <CAPM=9tzaB8DBWHegPD-8+iT3S5g0TtGKoOWXp7v9Psbbbr+uBg@mail.gmail.com>

On 6/6/25 08:59, Dave Airlie wrote:
> On Fri, 6 Jun 2025 at 08:39, Dave Chinner <david@fromorbit.com> wrote:
>>
>> On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote:
>>> On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote:
>>>>
>>>> On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote:
>>>>>
>>>>> I've hit a case where I think it might be valuable to have the nid +
>>>>> struct memcg for the item being iterated available in the isolate
>>>>> callback, I know in theory we should be able to retrieve it from the
>>>>> item, but I'm also not convinced we should need to since we have it
>>>>> already in the outer function?
>>>>>
>>>>> typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item,
>>>>>                         struct list_lru_one *list,
>>>>>                         int nid,
>>>>>                         struct mem_cgroup *memcg,
>>>>>                         void *cb_arg);
>>>>>
>>>>
>>>> Hi Dave,
>>>>
>>>>> It's probably not essential (I think I can get the nid back easily,
>>>>> not sure about the memcg yet), but I thought I'd ask if there would be
>>>>
>>>> If it's a slab object you should be able to get it easily with:
>>>> memcg = mem_cgroup_from_slab_obj(item));
>>>> nid = page_to_nid(virt_to_page(item));
>>>>
>>>
>>> It's in relation to some work trying to tie GPU system memory
>>> allocations into memcg properly,
>>>
>>> Not slab objects, but I do have pages so I'm using page_to_nid right now,
>>> however these pages aren't currently setting p->memcg_data as I don't
>>> need that for this, but maybe
>>> this gives me a reason to go down that road.
>>
>> How are you accounting the page to the memcg if the page is not
>> marked as owned by as specific memcg?
>>
>> Are you relying on the page being indexed in a specific list_lru to
>> account for the page correcting in reclaim contexts, and that's why
>> you need this information in the walk context?
>>
>> I'd actually like to know more details of the problem you are trying
>> to solve - all I've heard is "we're trying to do <something> with
>> GPUs and memcgs with list_lrus", but I don't know what it is so I
>> can't really give decent feedback on your questions....
>>
> 
> Big picture problem, GPU drivers do a lot of memory allocations for
> userspace applications that historically have not gone via memcg
> accounting. This has been pointed out to be bad and should be fixed.
> 
> As part of that problem, GPU drivers have the ability to hand out
> uncached/writecombined pages to userspace, creating these pages
> requires changing attributes and as such is a heavy weight operation
> which necessitates page pools. These page pools only currently have a
> global shrinker and roll their own NUMA awareness. The
> uncached/writecombined memory isn't a core feature of userspace usage
> patterns, but since we want to do things right it seems like a good
> idea to clean up the space first.
> 
> Get proper vmstat/memcg tracking for all allocations done for the GPU,
> these can be very large, so I think we should add core mm counters for
> them and memcg ones as well, so userspace can see them and make more
> educated decisions.
> 
> We don't need page level memcg tracking as the pages are all either
> allocated to the process as part of a larger buffer object, or the
> pages are in the pool which has the memcg info, so we aren't intending
> on using __GFP_ACCOUNT at this stage. I also don't really like having
> this as part of kmem, these really are userspace only things mostly
> and they are mostly used by gpu and userspace.
> 
> My rough plan:
> 1. convert TTM page pools over to list_lru and use a NUMA aware shrinker
> 2. add global and memcg counters and tracking.
> 3. convert TTM page pools over to memcg aware shrinker so we get the
> proper operation inside a memcg for some niche use cases.
> 4. Figure out how to deal with memory evictions from VRAM - this is
> probably the hardest problem to solve as there is no great policy.
> 
> Also handwave shouldn't this all be folios at some point.
> 

The key requirements for memcg would be to track the mm on whose behalf
the allocation was made.

kmemcg (__GFP_ACCOUNT) tracks only kernel
allocations (meant for kernel overheads), we don't really need it and
you've already mentioned this.

For memcg evictions reference count and reclaim is used today, I guess
in #4, you are referring to getting that information for VRAM?

Is the overall goal to overcommit VRAM or to restrict the amount of
VRAM usage or a combination of bith?

Balbir Singh


  parent reply	other threads:[~2025-06-10 23:07 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-05  2:16 Dave Airlie
2025-06-05  7:55 ` Kairui Song
2025-06-05  9:22   ` Dave Airlie
2025-06-05 13:53     ` Matthew Wilcox
2025-06-05 20:59       ` Dave Airlie
2025-06-05 22:39     ` Dave Chinner
2025-06-05 22:59       ` Dave Airlie
2025-06-10 22:44         ` Dave Chinner
2025-06-11  1:40           ` Dave Airlie
2025-06-10 23:07         ` Balbir Singh [this message]
2025-06-11  1:43           ` Dave Airlie
2025-06-11 22:34             ` Balbir Singh
2025-06-11  3:36         ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=206e24d8-9027-4a14-9e9a-710bc57921d6@nvidia.com \
    --to=balbirs@nvidia.com \
    --cc=airlied@gmail.com \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=ryncsn@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox