From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9EBBFC433EF for ; Thu, 23 Sep 2021 09:03:53 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3BDD761216 for ; Thu, 23 Sep 2021 09:03:53 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 3BDD761216 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 6C100900002; Thu, 23 Sep 2021 05:03:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 66F2D6B0071; Thu, 23 Sep 2021 05:03:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 53719900002; Thu, 23 Sep 2021 05:03:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0118.hostedemail.com [216.40.44.118]) by kanga.kvack.org (Postfix) with ESMTP id 44C036B006C for ; Thu, 23 Sep 2021 05:03:52 -0400 (EDT) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id E1CC93017B for ; Thu, 23 Sep 2021 09:03:51 +0000 (UTC) X-FDA: 78618250662.16.D8A760F Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf18.hostedemail.com (Postfix) with ESMTP id 52450400208A for ; Thu, 23 Sep 2021 09:03:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1632387830; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=t2vZx6Tk6VuuMikWnLUs2Mq5yA5lp/FlcON1t/+GITs=; b=GXYtIOOJQ1f9xu7MOsgaFiXOlvQd+Tx696rxjUXGOXJH+XwoC2Ii5MEpsrS6jbbMO5RW8O 3c4ZYL5F/jxsolaKPT6PnMSmTdd6D+sxcIL27/NRuXF9Y5TS7shbZFDR2AcnqcUgzL8rdl a2b8RHdac22VFr8Ef6NVbQrYLUTupw8= Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com [209.85.221.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-528-2ZwgLyhJPaClyEfykCgdag-1; Thu, 23 Sep 2021 05:03:47 -0400 X-MC-Unique: 2ZwgLyhJPaClyEfykCgdag-1 Received: by mail-wr1-f70.google.com with SMTP id v1-20020adfc401000000b0015e11f71e65so4594164wrf.2 for ; Thu, 23 Sep 2021 02:03:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=t2vZx6Tk6VuuMikWnLUs2Mq5yA5lp/FlcON1t/+GITs=; b=pYr+sRIeGfnGMcIXVG75aX/pBZ2XZBy4Zy7jwvuUYQWjI7/AdtD0CMZgI/Fh39UHMG d93agH7WPpw8Pg5382j41+UHnX1vb+U33auZZaoddq7Qx9APouGHnRlLr4XmJmzvWFDS tCOX5pABjmC6w6W3GelMOYBWIUeJ7loWxgssShkeeRPQZS53WCww9NGC6p4CzJTbYghF lXPW41LS+LzmhN2l5/wOGLufQkZx3kq4ghO51SLUioqNAuFQP0amNC4TkNmOMhXxMLRf oHI5atzv3lomSJevASMncDedyDhwy2neha2Ll9Xx2o7XANPpZuRFMrE9XuNm1SePr5oQ bsvg== X-Gm-Message-State: AOAM530czBjwP89ucj/DPIONamTnk6CeFID3nWpXpTJKL3gYD5hhriEf tRmhyC5NCbs7CbGK8fYvFmxPfRMCXvqn6nJgRCdIr5if831ZVaZeS6Fc4eL7zt734z4ce6qR+lk 8xxIwh4n6dik= X-Received: by 2002:a5d:598c:: with SMTP id n12mr3580691wri.391.1632387826306; Thu, 23 Sep 2021 02:03:46 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxeRpgN+EL+Z5hNhW1qMdx3/KwVxtgMSwnv4toSVOPPdy+eTVEUuLA9m44ceRboo99Fo1O7RA== X-Received: by 2002:a5d:598c:: with SMTP id n12mr3580658wri.391.1632387825997; Thu, 23 Sep 2021 02:03:45 -0700 (PDT) Received: from [192.168.3.132] (p4ff23e5d.dip0.t-ipconnect.de. [79.242.62.93]) by smtp.gmail.com with ESMTPSA id a77sm4539713wme.28.2021.09.23.02.03.44 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 23 Sep 2021 02:03:45 -0700 (PDT) To: Kent Overstreet , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Johannes Weiner , Matthew Wilcox , Linus Torvalds , Andrew Morton , "Darrick J. Wong" , Christoph Hellwig , David Howells References: From: David Hildenbrand Organization: Red Hat Subject: Re: Struct page proposal Message-ID: Date: Thu, 23 Sep 2021 11:03:44 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 52450400208A X-Stat-Signature: xsbbzejhwomoerqcspqn7y8ry6j1roep Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GXYtIOOJ; spf=none (imf18.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-HE-Tag: 1632387831-517602 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 23.09.21 03:21, Kent Overstreet wrote: > One thing that's come out of the folios discussions with both Matthew a= nd > Johannes is that we seem to be thinking along similar lines regarding o= ur end > goals for struct page. >=20 > The fundamental reason for struct page is that we need memory to be sel= f > describing, without any context - we need to be able to go from a gener= ic > untyped struct page and figure out what it contains: handling physical = memory > failure is the most prominent example, but migration and compaction are= more > common. We need to be able to ask the thing that owns a page of memory = "hey, > stop using this and move your stuff here". >=20 > Matthew's helpfully been coming up with a list of page types: > https://kernelnewbies.org/MemoryTypes >=20 > But struct page could be a lot smaller than it is now. I think we can g= et it > down to two pointers, which means it'll take up 0.4% of system memory. = Both > Matthew and Johannes have ideas for getting it down even further - the = main > thing to note is that virt_to_page() _should_ be an uncommon operation = (most of > the places we're currently using it are completely unnecessary, look at= all the > places we're using it on the zero page). Johannes is thinking two layer= radix > tree, Matthew was thinking about using maple trees - personally, I thin= k that > 0.4% of system memory is plenty good enough. >=20 >=20 > Ok, but what do we do with the stuff currently in struct page? > ------------------------------------------------------------- >=20 > The main thing to note is that since in normal operation most folios ar= e going > to be describing many pages, not just one - and we'll be using _less_ m= emory > overall if we allocate them separately. That's cool. >=20 > Of course, for this to make sense, we'll have to get all the other stuf= f in > struct page moved into their own types, but file & anon pages are the b= ig one, > and that's already being tackled. >=20 > Why two ulongs/pointers, instead of just one? > --------------------------------------------- >=20 > Because one of the things we really want and don't have now is a clean = division > between allocator and allocatee state. Allocator meaning either the bud= dy > allocator or slab, allocatee state would be the folio or the network po= ol state > or whatever actually called kmalloc() or alloc_pages(). >=20 > Right now slab state sits in the same place in struct page where alloca= tee state > does, and the reason this is bad is that slab/slub are a hell of a lot = faster > than the buddy allocator, and Johannes wants to move the boundary betwe= en slab > allocations and buddy allocator allocations up to like 64k. If we fix w= here slab > state lives, this will become completely trivial to do. >=20 > So if we have this: >=20 > struct page { > unsigned long allocator; > unsigned long allocatee; > }; >=20 > The allocator field would be used for either a pointer to slab/slub's s= tate, if > it's a slab page, or if it's a buddy allocator page it'd encode the ord= er of the > allocation - like compound order today, and probably whether or not the > (compound group of) pages is free. >=20 > The allocatee field would be used for a type tagged (using the low bits= of the > pointer) to one of: > - struct folio > - struct anon_folio, if that becomes a thing > - struct network_pool_page > - struct pte_page > - struct zone_device_page >=20 > Then we can further refactor things until all the stuff that's currentl= y crammed > in struct page lives in types where each struct field means one and pre= cisely > one thing, and also where we can freely reshuffle and reorganize and ad= d stuff > to the various types where we couldn't before because it'd make struct = page > bigger. >=20 > Other notes & potential issues: > - page->compound_dtor needs to die >=20 > - page->rcu_head moves into the types that actually need it, no issue= s there >=20 > - page->refcount has question marks around it. I think we can also ju= st move it > into the types that need it; with RCU derefing the pointer to the f= olio or > whatever and grabing a ref on folio->refcount can happen under a RC= U read > lock - there's no real question about whether it's technically poss= ible to > get it out of struct page, and I think it would be cleaner overall = that way. >=20 > However, depending on how it's used from code paths that go from ge= neric > untyped pages, I could see it turning into more of a hassle than it= 's worth. > More investigation is needed. >=20 > - page->memcg_data - I don't know whether that one more properly belo= ngs in > struct page or in the page subtypes - I'd love it if Johannes could= talk > about that one. >=20 > - page->flags - dealing with this is going to be a huge hassle but al= so where > we'll find some of the biggest gains in overall sanity and readabil= ity of the > code. Right now, PG_locked is super special and ad hoc and I have r= un into > situations multiple times (and Johannes was in vehement agreement o= n this > one) where I simply could not figure the behaviour of the current c= ode re: > who is responsible for locking pages without instrumenting the code= with > assertions. >=20 > Meaning anything we do to create and enforce module boundaries betw= een > different chunks of code is going to suck, but the end result shoul= d be > really worthwhile. >=20 > Matthew Wilcox and David Howells have been having conversations on IRC = about > what to do about other page bits. It appears we should be able to kill = a lot of > filesystem usage of both PG_private and PG_private_2 - filesystems in g= eneral > hang state off of page->private, soon to be folio->private, and PG_priv= ate in > current use just indicates whether page->private is nonzero - meaning i= t's > completely redundant. >=20 Don't get me wrong, but before there are answers to some of the very=20 basic questions raised above (especially everything that lives in=20 page->flags, which are not only page flags, refcount, ...) this isn't=20 very tempting to spend more time on, from a reviewer perspective. --=20 Thanks, David / dhildenb