From: Johannes Weiner <hannes@cmpxchg.org>
To: Arjun Roy <arjunroy@google.com>
Cc: Arjun Roy <arjunroy.kdev@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
David Miller <davem@davemloft.net>,
netdev <netdev@vger.kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Cgroups <cgroups@vger.kernel.org>, Linux MM <linux-mm@kvack.org>,
Shakeel Butt <shakeelb@google.com>,
Eric Dumazet <edumazet@google.com>,
Soheil Hassas Yeganeh <soheil@google.com>,
Jakub Kicinski <kuba@kernel.org>,
Michal Hocko <mhocko@kernel.org>, Yang Shi <shy828301@gmail.com>,
Roman Gushchin <guro@fb.com>
Subject: Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy
Date: Tue, 23 Mar 2021 13:01:36 -0400 [thread overview]
Message-ID: <YFoe8BO0JsbXTHHF@cmpxchg.org> (raw)
In-Reply-To: <CAOFY-A17g-Aq_TsSX8=mD7ZaSAqx3gzUuCJT8K0xwrSuYdP4Kw@mail.gmail.com>
On Mon, Mar 22, 2021 at 02:35:11PM -0700, Arjun Roy wrote:
> To make sure we're on the same page, then, here's a tentative
> mechanism - I'd rather get buy in before spending too much time on
> something that wouldn't pass muster afterwards.
>
> A) An opt-in mechanism, that a driver needs to explicitly support, in
> order to get properly accounted receive zerocopy.
Yep, opt-in makes sense. That allows piece-by-piece conversion and
avoids us having to have a flag day.
> B) Failure to opt-in (e.g. unchanged old driver) can either lead to
> unaccounted zerocopy (ie. existing behaviour) or, optionally,
> effectively disabled zerocopy (ie. any call to zerocopy will return
> something like EINVAL) (perhaps controlled via some sysctl, which
> either lets zerocopy through or not with/without accounting).
I'd suggest letting it fail gracefully (i.e. no -EINVAL) to not
disturb existing/working setups during the transition period. But the
exact policy is easy to change later on if we change our minds on it.
> The proposed mechanism would involve:
> 1) Some way of marking a page as being allocated by a driver that has
> decided to opt into this mechanism. Say, a page flag, or a memcg flag.
Right. I would stress it should not be a memcg flag or any direct
channel from the network to memcg, as this would limit its usefulness
while having the same maintenance overhead.
It should make the network page a first class MM citizen - like an LRU
page or a slab page - which can be accounted and introspected as such,
including from the memcg side.
So definitely a page flag.
> 2) A callback provided by the driver, that takes a struct page*, and
> returns a boolean. The value of the boolean being true indicates that
> any and all refs on the page are held by the driver. False means there
> exists at least one reference that is not held by the driver.
I was thinking the PageNetwork flag would cover this, but maybe I'm
missing something?
> 3) A branch in put_page() that, for pages marked thus, will consult
> the driver callback and if it returns true, will uncharge the memcg
> for the page.
The way I picture it, put_page() (and release_pages) should do this:
void __put_page(struct page *page)
{
if (is_zone_device_page(page)) {
put_dev_pagemap(page->pgmap);
/*
* The page belongs to the device that created pgmap. Do
* not return it to page allocator.
*/
return;
}
+
+ if (PageNetwork(page)) {
+ put_page_network(page);
+ /* Page belongs to the network stack, not the page allocator */
+ return;
+ }
if (unlikely(PageCompound(page)))
__put_compound_page(page);
else
__put_single_page(page);
}
where put_page_network() is the network-side callback that uncharges
the page.
(..and later can be extended to do all kinds of things when informed
that the page has been freed: update statistics (mod_page_state), put
it on a private network freelist, or ClearPageNetwork() and give it
back to the page allocator etc.
But for starters it can set_page_count(page, 1) after the uncharge to
retain the current silent recycling behavior.)
> The anonymous struct you defined above is part of a union that I think
> normally is one qword in length (well, could be more depending on the
> typedefs I saw there) and I think that can be co-opted to provide the
> driver callback - though, it might require growing the struct by one
> more qword since there may be drivers like mlx5 that are already using
> the field already in there for dma_addr.
The page cache / anonymous struct it's shared with is 5 words (double
linked list pointers, mapping, index, private), and the network struct
is currently one word, so you can add 4 words to a PageNetwork() page
without increasing the size of struct page. That should be plenty of
space to store auxiliary data for drivers, right?
> Anyways, the callback could then be used by the driver to handle the
> other accounting quirks you mentioned, without needing to scan the
> full pool.
Right.
> Of course there are corner cases and such to properly account for, but
> I just wanted to provide a really rough sketch to see if this
> (assuming it were properly implemented) was what you had in mind. If
> so I can put together a v3 patch.
Yeah, makes perfect sense. We can keep iterating like this any time
you feel you accumulate too many open questions. Not just for MM but
also for the networking folks - although I suspect that the first step
would be mostly about the MM infrastructure, and I'm not sure how much
they care about the internals there ;)
> Per my response to Andrew earlier, this would make it even more
> confusing whether this is to be applied against net-next or mm trees.
> But that's a bridge to cross when we get to it.
The mm tree includes -next, so it should be a safe development target
for the time being.
I would then decide it based on how many changes your patch interacts
with on either side. Changes to struct page and the put path are not
very frequent, so I suspect it'll be easy to rebase to net-next and
route everything through there. And if there are heavy changes on both
sides, the -mm tree is the better route anyway.
Does that sound reasonable?
next prev parent reply other threads:[~2021-03-23 17:01 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-16 4:16 Arjun Roy
2021-03-16 4:20 ` Arjun Roy
2021-03-16 4:29 ` Shakeel Butt
2021-03-16 6:22 ` Arjun Roy
2021-03-16 6:28 ` Arjun Roy
2021-03-16 21:02 ` Jakub Kicinski
2021-03-16 10:26 ` Johannes Weiner
2021-03-17 6:05 ` Arjun Roy
2021-03-17 22:12 ` Johannes Weiner
2021-03-22 21:35 ` Arjun Roy
2021-03-23 17:01 ` Johannes Weiner [this message]
2021-03-23 18:42 ` Arjun Roy
2021-03-24 18:25 ` Shakeel Butt
2021-03-24 22:21 ` Arjun Roy
2021-03-23 14:34 ` Michal Hocko
2021-03-23 18:47 ` Arjun Roy
2021-03-24 9:12 ` Michal Hocko
2021-03-24 20:39 ` Arjun Roy
2021-03-24 20:53 ` Shakeel Butt
2021-03-24 21:56 ` Michal Hocko
2021-03-24 21:24 ` Johannes Weiner
2021-03-24 22:49 ` Arjun Roy
2021-03-25 9:02 ` Michal Hocko
2021-03-25 16:47 ` Johannes Weiner
2021-03-25 17:50 ` Michal Hocko
-- strict thread matches above, loose matches on Subject: below --
2021-03-16 1:30 Arjun Roy
2021-03-18 3:21 ` Andrew Morton
2021-03-22 21:19 ` Arjun Roy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YFoe8BO0JsbXTHHF@cmpxchg.org \
--to=hannes@cmpxchg.org \
--cc=akpm@linux-foundation.org \
--cc=arjunroy.kdev@gmail.com \
--cc=arjunroy@google.com \
--cc=cgroups@vger.kernel.org \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=guro@fb.com \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=shakeelb@google.com \
--cc=shy828301@gmail.com \
--cc=soheil@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox