From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 61501C61DB2
	for <linux-mm@archiver.kernel.org>; Wed, 11 Jun 2025 01:44:16 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D7A686B0088; Tue, 10 Jun 2025 21:44:15 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D2B0C6B0093; Tue, 10 Jun 2025 21:44:15 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C40DD6B0095; Tue, 10 Jun 2025 21:44:15 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id A55DA6B0088
	for <linux-mm@kvack.org>; Tue, 10 Jun 2025 21:44:15 -0400 (EDT)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 5FC3C101A82
	for <linux-mm@kvack.org>; Wed, 11 Jun 2025 01:44:15 +0000 (UTC)
X-FDA: 83541424470.28.BB45B82
Received: from mail-ej1-f51.google.com (mail-ej1-f51.google.com [209.85.218.51])
	by imf08.hostedemail.com (Postfix) with ESMTP id 6B871160012
	for <linux-mm@kvack.org>; Wed, 11 Jun 2025 01:44:13 +0000 (UTC)
Authentication-Results: imf08.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=CbA0oZbP;
	spf=pass (imf08.hostedemail.com: domain of airlied@gmail.com designates 209.85.218.51 as permitted sender) smtp.mailfrom=airlied@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1749606253;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=EdjIaNW7jH3FfcDG5dKaNJpRXD7ARX7+yQj4TvsXVEI=;
	b=arrYVyc1myUgoauHi5MW2BrOzIQd68C4nCh3+5oFHx5utFf08phdse0D1c5QCAJ71V01XC
	zfa6D5KmiFPcsoJqcX882+VDLnwAH+Uu+1PfUnp2gj0oqRwwVxE8vM1R5sdzYmTxuzg/sb
	hKWfoRsqbhWkwOnGWLB8aOUJTMtE2ek=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749606253; a=rsa-sha256;
	cv=none;
	b=7N455sDMvPGbSFQUl3NlaMshI0gK4xK/4qNWW76zHVFPa8wa9jDGfzFXm4oG4LIi/Yoskm
	TrWAYJf3AD2w8XzGK5Hr7tBTVBQ6LL0H/qDTCwwDbPwg0p65ofgX50WE1KbGiCIhLwNhje
	qLp/wdHXYWHoTPk0sp09pP+pbtXejG8=
ARC-Authentication-Results: i=1;
	imf08.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=CbA0oZbP;
	spf=pass (imf08.hostedemail.com: domain of airlied@gmail.com designates 209.85.218.51 as permitted sender) smtp.mailfrom=airlied@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-ej1-f51.google.com with SMTP id a640c23a62f3a-ad883afdf0cso1100094666b.0
        for <linux-mm@kvack.org>; Tue, 10 Jun 2025 18:44:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1749606252; x=1750211052; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=EdjIaNW7jH3FfcDG5dKaNJpRXD7ARX7+yQj4TvsXVEI=;
        b=CbA0oZbP78TmnQ5fQtAGdJaOOttuBGjZYpnBvCTMUjZDXVguP5GhjB1NlISc+okz03
         uydykCvC7BbVtJjQ5CfviQcb/ffCBJ61MS3necimKNNQo8jqnegf2TVRNOLkFuQSyt4w
         Enr+emi/Nm4Ruy+Ms6OS6jiq6q2wplYaYivibXOKw0rLeiP2mXp1FIe6LGcqD+nOn7h/
         8F9ooEAIdLuIJe8JOgpiQHYmDbHIhae/lPvocdwbNwlSeFpA0Rb33M+rg3TwUnTjfYOt
         lRorbd4qf7ToYUq+eBIjw6u55zmiK0m2Mf6mbSqGnYCWj90ZFa76ZoyuLSWAhPXXO7e8
         +gJg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1749606252; x=1750211052;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=EdjIaNW7jH3FfcDG5dKaNJpRXD7ARX7+yQj4TvsXVEI=;
        b=AF5l8DXK1ndckcKbKzuDpcnLY+orH10mXlLP/gwTeoj7BpBwR/BIi/savJI3f2F2hF
         Nuzaq6DEM645xTnlTtWtj4do2RmRCXn6H47Q69Da7lS8XIEgVwD27WhlArMMmr6BLPmI
         MAZNwKrO7AGIVc6wfG7vDl5tgjKtSdCZVH6JlSdOY0B52GKqaXHcMRWVPJTc+Mqv/6eb
         twkkIFsau16Uzudi6wxilkSFmrHPFQaWBlqWNBQ9Kc8rGUOz8VzjDd0TNeHHtgdUwq34
         6l8T/bBOfz9n2E/JmasDGr8E/NTaMyDSPSBqjYsaW2BC7zRnadXDwXuVRfWCq2/Sk4Nx
         RmKw==
X-Forwarded-Encrypted: i=1; AJvYcCViulUqTB5VidBErGAtCdBLvFxMLsnru7e6ID2lYdOj5XNEhMJjsCyDyiHSZ+/6yYXZ4MbuLDce9A==@kvack.org
X-Gm-Message-State: AOJu0YwYGHe9g+mkH/y+Hc8y4x/V10DTjCM8dg+JCL2RCc/EiIDODB1h
	czQcjLUbYc5sHMnd3es601Bfd5IOJT4cm8bl7+8JTZ+6H65LTZFPEeX4qjncQSF4gfRX0MbTLGp
	0RSEhQl4/OaYmJPdU4GEQ4a1Emy8m2znmoQ==
X-Gm-Gg: ASbGncs5B57xhze8amLd/fvAKwPfcJa0zj5gpPUmyNt966Ey/An336Y1Oie9ykGTdIA
	bZYtd1a/Bn65qySZNetZ4zwUy9toynbBdrKse5sIZsSRlY2QFmdxsREcdkNDRXCJuhsuHljcSov
	dDe5K5wYD1OSc6fNbhtKRLAxRLu2OYoEoaU511KgFBaw==
X-Google-Smtp-Source: AGHT+IGW7DyFapBEpymBN4dv/kxOOZAzgqCJGblkuoUZ+o8MGdlLt+tDs+AZqAZFE1aVgURGjkD6yPqgaufjEcaPEHg=
X-Received: by 2002:a17:907:972a:b0:ad5:8594:652e with SMTP id
 a640c23a62f3a-ade8c7d6d8fmr79828966b.35.1749606251807; Tue, 10 Jun 2025
 18:44:11 -0700 (PDT)
MIME-Version: 1.0
References: <CAPM=9tyS7ijMy8VpPTo0pF6maUpeS7ncuSBVKsUPbgFvP_s1iQ@mail.gmail.com>
 <CAMgjq7DGEewWzqzc0717DhxA73uv00VRjWxdiyo3TrO9MSANwA@mail.gmail.com>
 <CAPM=9tyrvdm54LBcitudB06eRYX0BhNd_8mMtE_jTVEzY7G7Cw@mail.gmail.com>
 <aEIcit0uqCCNXU-d@dread.disaster.area> <CAPM=9tzaB8DBWHegPD-8+iT3S5g0TtGKoOWXp7v9Psbbbr+uBg@mail.gmail.com>
 <206e24d8-9027-4a14-9e9a-710bc57921d6@nvidia.com>
In-Reply-To: <206e24d8-9027-4a14-9e9a-710bc57921d6@nvidia.com>
From: Dave Airlie <airlied@gmail.com>
Date: Wed, 11 Jun 2025 11:43:59 +1000
X-Gm-Features: AX0GCFsnjy2_ZJvuLjwiCuLdLFZSKZAVIzPpaTBsjCiemKzOgUoqemgx5frnsxo
Message-ID: <CAPM=9txKFNXZ0JT1mrbHOhFao7uo851-VUSWUG4FS-y6oJbakw@mail.gmail.com>
Subject: Re: list_lru isolate callback question?
To: Balbir Singh <balbirs@nvidia.com>
Cc: Dave Chinner <david@fromorbit.com>, Kairui Song <ryncsn@gmail.com>, 
	Johannes Weiner <hannes@cmpxchg.org>, Linux Memory Management List <linux-mm@kvack.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Queue-Id: 6B871160012
X-Stat-Signature: of7uyjg7biju9f1ird1puqxy47ry5a7c
X-Rspamd-Server: rspam04
X-HE-Tag: 1749606253-271888
X-HE-Meta: U2FsdGVkX19n3W3tyzzGDasvtN85haORSWFtrOfEGGWdekM5F0C+AaByWHwVQoffkyYI1LPc6GBqcCzNLpp8U6dS9nJrwroUhk0Y1rcHY3kE7IyZARjiR/18s5bXEq+cqVtzsAIoysMxYGi2YbKe36QRa4PKRaimG/N19vRM3SN0ksVgVpN9SnKOaPlJN+TEtb5RmHKXSW8mnfeVTAf2wtDqaZCtpCpOIdno7up7aKulSeL8exwsElz27HcPJqt/0rz7y5vfOgrRlABklN2SIEQEdZoQukES5FHlF5Ro2e9yNBGk5i1kdPcdwoVvNZYy7Wz0j3lY9WWxuGH95WaDT7ljQJtpl8yP6g9YEF4bRw0qO9iFlg5LQXXVXhRxYPyFsqazBcaHFUTwDU8YoVwJus2joftQnjUk4WLa5EwMBSBxvBa0a2cdV5nGqcdtY+0J7eng/Ne/Kz6BTLAVUbuFrCUJvpIssKv/b1xxJR30BmPn3weoH+mqIleDXOL437/Sk5XGe8/znkO0MHMcSp7ICMeEeah0Z11bvkbTvkWPwNpTkeiLbpSAAN7Gb6kl0bmUkdoWQk74ICsT6R5yBaBoBZKUG+bONp1GWQkJXvoeeL1f3g8dLykTbBZ3Su9XoH4R+RThYg481tQMiSzmwJUupTYB0m9quiIVLo64CpVjn+wiIrU9yHO9lbz1bvEKHQNwiRFSkEHd3P1b75EAbd+XppWF9QQTPLP2LT4UlhUbd2q5fMaSBKilElCE4eKjFDarvdScY+/sKac9Nxtn22sDubBtZk5mSnSE3YNXsOpxBHAPxC2+AdkFtz6OaQLK6ePav8i4KAeg8e+DIWN4mfdejSA+OMJZInH//FZEKBl6ZLqSDyl9Bx+YsRoGA/szHbTU8ch/b3uMW5m8r2NrHcG59WkK2ykRiZF/qXtbFUcJtAY3zl7xBt7zXFR3sBuF+Nxv40zAzU4CC9QW+jNzlah
 i76929hF
 YOQDTHChihKM38luRlZwfpyOW8OZBl1ZtWR3LMhOGT2JP2seklqxt98tomE/EkBmwRotnewMKcxvNQj2/Bjv3Hn4YiBnKVBXAQD1zdO9TUTFuVxaFHcHu9Kz6ziYQygZ9YrNU55h1/0HZsqMs0vmDU4fQl5vC+6wdClV+3SXSEQlh2kiuta2SkJQ/IZgSFx5+61HkznshQM1fhsvFzhZsZ5Rmd9oIaCJSpohcBOmzz2Z3xzm7nb4eyf/MrNlC0qOvyxIa9aNIjHDut6qno0ANNtWp9sE+eZbJ27lgxwkLcvznjJ+ngod1a72Fd9w1uIe08XTxtDiN+rtDMeMJ50uw/i0ur0Re5Pmgxw8m56NB+5oNAZRVu47SyS5je5xJlE1hRATb723Hzh+W3VPHo0pd1TnGXKF3i4q+r70OYdtPBlOnkegSFL8mntEnbDA1K0y31YBvXiHz8xu/irCjU1wR2kALLSYycTkQrP7GTNJw3IwclK/IxWSlkd1hyx0iMw+mpg/o
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, 11 Jun 2025 at 09:07, Balbir Singh <balbirs@nvidia.com> wrote:
>
> On 6/6/25 08:59, Dave Airlie wrote:
> > On Fri, 6 Jun 2025 at 08:39, Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote:
> >>> On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote:
> >>>>
> >>>> On Thu, Jun 5, 2025 at 10:17=E2=80=AFAM Dave Airlie <airlied@gmail.c=
om> wrote:
> >>>>>
> >>>>> I've hit a case where I think it might be valuable to have the nid =
+
> >>>>> struct memcg for the item being iterated available in the isolate
> >>>>> callback, I know in theory we should be able to retrieve it from th=
e
> >>>>> item, but I'm also not convinced we should need to since we have it
> >>>>> already in the outer function?
> >>>>>
> >>>>> typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item,
> >>>>>                         struct list_lru_one *list,
> >>>>>                         int nid,
> >>>>>                         struct mem_cgroup *memcg,
> >>>>>                         void *cb_arg);
> >>>>>
> >>>>
> >>>> Hi Dave,
> >>>>
> >>>>> It's probably not essential (I think I can get the nid back easily,
> >>>>> not sure about the memcg yet), but I thought I'd ask if there would=
 be
> >>>>
> >>>> If it's a slab object you should be able to get it easily with:
> >>>> memcg =3D mem_cgroup_from_slab_obj(item));
> >>>> nid =3D page_to_nid(virt_to_page(item));
> >>>>
> >>>
> >>> It's in relation to some work trying to tie GPU system memory
> >>> allocations into memcg properly,
> >>>
> >>> Not slab objects, but I do have pages so I'm using page_to_nid right =
now,
> >>> however these pages aren't currently setting p->memcg_data as I don't
> >>> need that for this, but maybe
> >>> this gives me a reason to go down that road.
> >>
> >> How are you accounting the page to the memcg if the page is not
> >> marked as owned by as specific memcg?
> >>
> >> Are you relying on the page being indexed in a specific list_lru to
> >> account for the page correcting in reclaim contexts, and that's why
> >> you need this information in the walk context?
> >>
> >> I'd actually like to know more details of the problem you are trying
> >> to solve - all I've heard is "we're trying to do <something> with
> >> GPUs and memcgs with list_lrus", but I don't know what it is so I
> >> can't really give decent feedback on your questions....
> >>
> >
> > Big picture problem, GPU drivers do a lot of memory allocations for
> > userspace applications that historically have not gone via memcg
> > accounting. This has been pointed out to be bad and should be fixed.
> >
> > As part of that problem, GPU drivers have the ability to hand out
> > uncached/writecombined pages to userspace, creating these pages
> > requires changing attributes and as such is a heavy weight operation
> > which necessitates page pools. These page pools only currently have a
> > global shrinker and roll their own NUMA awareness. The
> > uncached/writecombined memory isn't a core feature of userspace usage
> > patterns, but since we want to do things right it seems like a good
> > idea to clean up the space first.
> >
> > Get proper vmstat/memcg tracking for all allocations done for the GPU,
> > these can be very large, so I think we should add core mm counters for
> > them and memcg ones as well, so userspace can see them and make more
> > educated decisions.
> >
> > We don't need page level memcg tracking as the pages are all either
> > allocated to the process as part of a larger buffer object, or the
> > pages are in the pool which has the memcg info, so we aren't intending
> > on using __GFP_ACCOUNT at this stage. I also don't really like having
> > this as part of kmem, these really are userspace only things mostly
> > and they are mostly used by gpu and userspace.
> >
> > My rough plan:
> > 1. convert TTM page pools over to list_lru and use a NUMA aware shrinke=
r
> > 2. add global and memcg counters and tracking.
> > 3. convert TTM page pools over to memcg aware shrinker so we get the
> > proper operation inside a memcg for some niche use cases.
> > 4. Figure out how to deal with memory evictions from VRAM - this is
> > probably the hardest problem to solve as there is no great policy.
> >
> > Also handwave shouldn't this all be folios at some point.
> >
>
> The key requirements for memcg would be to track the mm on whose behalf
> the allocation was made.
>
> kmemcg (__GFP_ACCOUNT) tracks only kernel
> allocations (meant for kernel overheads), we don't really need it and
> you've already mentioned this.
>
> For memcg evictions reference count and reclaim is used today, I guess
> in #4, you are referring to getting that information for VRAM?
>
> Is the overall goal to overcommit VRAM or to restrict the amount of
> VRAM usage or a combination of bith?

This is kinda the crux of where we are getting to.

We don't track VRAM at all with memcg that will be the dmem controllers job=
s.

But in the corner case where we do overcommit VRAM, who pays for the
system RAM where we evict stuff to.

I think ideally we would have system limits give an amount of VRAM and
system RAM to a process, and it can live within that budget, and we'd
try not to evict VRAM from processes that have a cgroup accounted
right to some of it, but that isn't great for average things like
desktops or games (where overcommit makes sense), it would be more for
container workloads on GPU clusters.

Dave.