From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=6bQO=ON=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,
	SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9EBBFC433EF
	for <linux-mm@archiver.kernel.org>; Thu, 23 Sep 2021 09:03:53 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 3BDD761216
	for <linux-mm@archiver.kernel.org>; Thu, 23 Sep 2021 09:03:53 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 3BDD761216
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id 6C100900002; Thu, 23 Sep 2021 05:03:52 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 66F2D6B0071; Thu, 23 Sep 2021 05:03:52 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 53719900002; Thu, 23 Sep 2021 05:03:52 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0118.hostedemail.com [216.40.44.118])
	by kanga.kvack.org (Postfix) with ESMTP id 44C036B006C
	for <linux-mm@kvack.org>; Thu, 23 Sep 2021 05:03:52 -0400 (EDT)
Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id E1CC93017B
	for <linux-mm@kvack.org>; Thu, 23 Sep 2021 09:03:51 +0000 (UTC)
X-FDA: 78618250662.16.D8A760F
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf18.hostedemail.com (Postfix) with ESMTP id 52450400208A
	for <linux-mm@kvack.org>; Thu, 23 Sep 2021 09:03:51 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1632387830;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=t2vZx6Tk6VuuMikWnLUs2Mq5yA5lp/FlcON1t/+GITs=;
	b=GXYtIOOJQ1f9xu7MOsgaFiXOlvQd+Tx696rxjUXGOXJH+XwoC2Ii5MEpsrS6jbbMO5RW8O
	3c4ZYL5F/jxsolaKPT6PnMSmTdd6D+sxcIL27/NRuXF9Y5TS7shbZFDR2AcnqcUgzL8rdl
	a2b8RHdac22VFr8Ef6NVbQrYLUTupw8=
Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com
 [209.85.221.70]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-528-2ZwgLyhJPaClyEfykCgdag-1; Thu, 23 Sep 2021 05:03:47 -0400
X-MC-Unique: 2ZwgLyhJPaClyEfykCgdag-1
Received: by mail-wr1-f70.google.com with SMTP id v1-20020adfc401000000b0015e11f71e65so4594164wrf.2
        for <linux-mm@kvack.org>; Thu, 23 Sep 2021 02:03:47 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:to:cc:references:from:organization:subject
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=t2vZx6Tk6VuuMikWnLUs2Mq5yA5lp/FlcON1t/+GITs=;
        b=pYr+sRIeGfnGMcIXVG75aX/pBZ2XZBy4Zy7jwvuUYQWjI7/AdtD0CMZgI/Fh39UHMG
         d93agH7WPpw8Pg5382j41+UHnX1vb+U33auZZaoddq7Qx9APouGHnRlLr4XmJmzvWFDS
         tCOX5pABjmC6w6W3GelMOYBWIUeJ7loWxgssShkeeRPQZS53WCww9NGC6p4CzJTbYghF
         lXPW41LS+LzmhN2l5/wOGLufQkZx3kq4ghO51SLUioqNAuFQP0amNC4TkNmOMhXxMLRf
         oHI5atzv3lomSJevASMncDedyDhwy2neha2Ll9Xx2o7XANPpZuRFMrE9XuNm1SePr5oQ
         bsvg==
X-Gm-Message-State: AOAM530czBjwP89ucj/DPIONamTnk6CeFID3nWpXpTJKL3gYD5hhriEf
	tRmhyC5NCbs7CbGK8fYvFmxPfRMCXvqn6nJgRCdIr5if831ZVaZeS6Fc4eL7zt734z4ce6qR+lk
	8xxIwh4n6dik=
X-Received: by 2002:a5d:598c:: with SMTP id n12mr3580691wri.391.1632387826306;
        Thu, 23 Sep 2021 02:03:46 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJxeRpgN+EL+Z5hNhW1qMdx3/KwVxtgMSwnv4toSVOPPdy+eTVEUuLA9m44ceRboo99Fo1O7RA==
X-Received: by 2002:a5d:598c:: with SMTP id n12mr3580658wri.391.1632387825997;
        Thu, 23 Sep 2021 02:03:45 -0700 (PDT)
Received: from [192.168.3.132] (p4ff23e5d.dip0.t-ipconnect.de. [79.242.62.93])
        by smtp.gmail.com with ESMTPSA id a77sm4539713wme.28.2021.09.23.02.03.44
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Thu, 23 Sep 2021 02:03:45 -0700 (PDT)
To: Kent Overstreet <kent.overstreet@gmail.com>,
 linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
 linux-mm@kvack.org
Cc: Johannes Weiner <hannes@cmpxchg.org>, Matthew Wilcox
 <willy@infradead.org>, Linus Torvalds <torvalds@linux-foundation.org>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Darrick J. Wong" <djwong@kernel.org>, Christoph Hellwig
 <hch@infradead.org>, David Howells <dhowells@redhat.com>
References: <YUvWm6G16+ib+Wnb@moria.home.lan>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: Struct page proposal
Message-ID: <e567ad16-0f2b-940b-a39b-a4d1505bfcb9@redhat.com>
Date: Thu, 23 Sep 2021 11:03:44 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <YUvWm6G16+ib+Wnb@moria.home.lan>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 52450400208A
X-Stat-Signature: xsbbzejhwomoerqcspqn7y8ry6j1roep
Authentication-Results: imf18.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GXYtIOOJ;
	spf=none (imf18.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
X-HE-Tag: 1632387831-517602
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 23.09.21 03:21, Kent Overstreet wrote:
> One thing that's come out of the folios discussions with both Matthew a=
nd
> Johannes is that we seem to be thinking along similar lines regarding o=
ur end
> goals for struct page.
>=20
> The fundamental reason for struct page is that we need memory to be sel=
f
> describing, without any context - we need to be able to go from a gener=
ic
> untyped struct page and figure out what it contains: handling physical =
memory
> failure is the most prominent example, but migration and compaction are=
 more
> common. We need to be able to ask the thing that owns a page of memory =
"hey,
> stop using this and move your stuff here".
>=20
> Matthew's helpfully been coming up with a list of page types:
> https://kernelnewbies.org/MemoryTypes
>=20
> But struct page could be a lot smaller than it is now. I think we can g=
et it
> down to two pointers, which means it'll take up 0.4% of system memory. =
Both
> Matthew and Johannes have ideas for getting it down even further - the =
main
> thing to note is that virt_to_page() _should_ be an uncommon operation =
(most of
> the places we're currently using it are completely unnecessary, look at=
 all the
> places we're using it on the zero page). Johannes is thinking two layer=
 radix
> tree, Matthew was thinking about using maple trees - personally, I thin=
k that
> 0.4% of system memory is plenty good enough.
>=20
>=20
> Ok, but what do we do with the stuff currently in struct page?
> -------------------------------------------------------------
>=20
> The main thing to note is that since in normal operation most folios ar=
e going
> to be describing many pages, not just one - and we'll be using _less_ m=
emory
> overall if we allocate them separately. That's cool.
>=20
> Of course, for this to make sense, we'll have to get all the other stuf=
f in
> struct page moved into their own types, but file & anon pages are the b=
ig one,
> and that's already being tackled.
>=20
> Why two ulongs/pointers, instead of just one?
> ---------------------------------------------
>=20
> Because one of the things we really want and don't have now is a clean =
division
> between allocator and allocatee state. Allocator meaning either the bud=
dy
> allocator or slab, allocatee state would be the folio or the network po=
ol state
> or whatever actually called kmalloc() or alloc_pages().
>=20
> Right now slab state sits in the same place in struct page where alloca=
tee state
> does, and the reason this is bad is that slab/slub are a hell of a lot =
faster
> than the buddy allocator, and Johannes wants to move the boundary betwe=
en slab
> allocations and buddy allocator allocations up to like 64k. If we fix w=
here slab
> state lives, this will become completely trivial to do.
>=20
> So if we have this:
>=20
> struct page {
> 	unsigned long	allocator;
> 	unsigned long	allocatee;
> };
>=20
> The allocator field would be used for either a pointer to slab/slub's s=
tate, if
> it's a slab page, or if it's a buddy allocator page it'd encode the ord=
er of the
> allocation - like compound order today, and probably whether or not the
> (compound group of) pages is free.
>=20
> The allocatee field would be used for a type tagged (using the low bits=
 of the
> pointer) to one of:
>   - struct folio
>   - struct anon_folio, if that becomes a thing
>   - struct network_pool_page
>   - struct pte_page
>   - struct zone_device_page
>=20
> Then we can further refactor things until all the stuff that's currentl=
y crammed
> in struct page lives in types where each struct field means one and pre=
cisely
> one thing, and also where we can freely reshuffle and reorganize and ad=
d stuff
> to the various types where we couldn't before because it'd make struct =
page
> bigger.
>=20
> Other notes & potential issues:
>   - page->compound_dtor needs to die
>=20
>   - page->rcu_head moves into the types that actually need it, no issue=
s there
>=20
>   - page->refcount has question marks around it. I think we can also ju=
st move it
>     into the types that need it; with RCU derefing the pointer to the f=
olio or
>     whatever and grabing a ref on folio->refcount can happen under a RC=
U read
>     lock - there's no real question about whether it's technically poss=
ible to
>     get it out of struct page, and I think it would be cleaner overall =
that way.
>=20
>     However, depending on how it's used from code paths that go from ge=
neric
>     untyped pages, I could see it turning into more of a hassle than it=
's worth.
>     More investigation is needed.
>=20
>   - page->memcg_data - I don't know whether that one more properly belo=
ngs in
>     struct page or in the page subtypes - I'd love it if Johannes could=
 talk
>     about that one.
>=20
>   - page->flags - dealing with this is going to be a huge hassle but al=
so where
>     we'll find some of the biggest gains in overall sanity and readabil=
ity of the
>     code. Right now, PG_locked is super special and ad hoc and I have r=
un into
>     situations multiple times (and Johannes was in vehement agreement o=
n this
>     one) where I simply could not figure the behaviour of the current c=
ode re:
>     who is responsible for locking pages without instrumenting the code=
 with
>     assertions.
>=20
>     Meaning anything we do to create and enforce module boundaries betw=
een
>     different chunks of code is going to suck, but the end result shoul=
d be
>     really worthwhile.
>=20
> Matthew Wilcox and David Howells have been having conversations on IRC =
about
> what to do about other page bits. It appears we should be able to kill =
a lot of
> filesystem usage of both PG_private and PG_private_2 - filesystems in g=
eneral
> hang state off of page->private, soon to be folio->private, and PG_priv=
ate in
> current use just indicates whether page->private is nonzero - meaning i=
t's
> completely redundant.
>=20

Don't get me wrong, but before there are answers to some of the very=20
basic questions raised above (especially everything that lives in=20
page->flags, which are not only page flags, refcount, ...) this isn't=20
very tempting to spend more time on, from a reviewer perspective.

--=20
Thanks,

David / dhildenb