From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4799AC433FE for ; Thu, 14 Oct 2021 12:44:44 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 99BDF610F9 for ; Thu, 14 Oct 2021 12:44:43 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 99BDF610F9 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id F0FA56B006C; Thu, 14 Oct 2021 08:44:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EBF866B0071; Thu, 14 Oct 2021 08:44:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D87B1900002; Thu, 14 Oct 2021 08:44:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0079.hostedemail.com [216.40.44.79]) by kanga.kvack.org (Postfix) with ESMTP id C9CDF6B006C for ; Thu, 14 Oct 2021 08:44:42 -0400 (EDT) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 7557B39499 for ; Thu, 14 Oct 2021 12:44:42 +0000 (UTC) X-FDA: 78695012004.11.B86CA7A Received: from mail-qt1-f181.google.com (mail-qt1-f181.google.com [209.85.160.181]) by imf21.hostedemail.com (Postfix) with ESMTP id 33CF1D040722 for ; Thu, 14 Oct 2021 12:44:41 +0000 (UTC) Received: by mail-qt1-f181.google.com with SMTP id z24so5557936qtv.9 for ; Thu, 14 Oct 2021 05:44:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20210112.gappssmtp.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=jGuyAoNQGUICsDWEInoTSYK9oS4Z2W1MIQsS3tVhHUY=; b=Fq2jwuUDYsKf0DerIMcwqjNmQDSBLm+Zl3LJTozzATsEfeBEWObTXax6IKmL+djlEx +CD4smIFzYXOz44psOQDjrPNNwK9fFJBkfftpq69cMbMLzE9jU85IQ1+wkVOGofj8kVF Np3KjJYEkqbz+GyjRG3STESrCJEt9+M+q/PFOQrxPHZLZjxo8GGXEi51dwtkqaA4nNFi JmNhMRIZlSGarywaQUud1CU5AGcOQV6RAC2uMJVHaxbmyLoGnp2nQKyOpb2mVDeCBaUa 5bIk0tz3OO+XSqE/gIkMobv988I6ew06A0Yg10i8l13cur8C1puEmq+S4f9EM/s+tIHz Tq6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=jGuyAoNQGUICsDWEInoTSYK9oS4Z2W1MIQsS3tVhHUY=; b=IfCsQ43dzEi9109239JeYq6y3jHpHKzn+m9hfKkZXm/L0zUN7jdlEG9gB4BOaVS3p4 9YTl3RY2PHT8yDIadP20R0q8DW9gF7JxDh5kmgkOr4ol20EM1qlcbngLIzFq+wuTTCdx IGw4qh0jVV+pCv+CzeTik1UAijyAFqvGdJtcalsGUFEOecCc3iOOpFTycJoZbjotkzps KB+SUleDKiQpL/OF2oAFtmY5CAFkkULyfIm045iuVym2tlAur3J+DYewE/N5Gob5SaAT ryqA0ri1v4W6a57QnsiSrbxEsunvrIadwBTcwTrfsBDFOhs8KyS1EnjXvbF4sVgEC3YS n+YQ== X-Gm-Message-State: AOAM533+mW1JyrLs75zQxvQEhXX7H84ckf8lshA76jNZ6/o1NO0x93di p4W1xXmoidtAIzVUWKyje5FCeQ== X-Google-Smtp-Source: ABdhPJzKkRlRat9LvnnuyecNSW5a3DFz0MDSLKCsGJzAbsYBkmtN3d0glAKGMlZ/jbY016HzPy7Vfw== X-Received: by 2002:a05:622a:1826:: with SMTP id t38mr6080984qtc.195.1634215481016; Thu, 14 Oct 2021 05:44:41 -0700 (PDT) Received: from localhost (cpe-98-15-154-102.hvc.res.rr.com. [98.15.154.102]) by smtp.gmail.com with ESMTPSA id b2sm1266864qtg.88.2021.10.14.05.44.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Oct 2021 05:44:40 -0700 (PDT) Date: Thu, 14 Oct 2021 08:44:39 -0400 From: Johannes Weiner To: David Hildenbrand Cc: Matthew Wilcox , Kent Overstreet , linux-mm@kvack.org Subject: Re: [PATCH 03/62] mm: Split slab into its own type Message-ID: References: <20211004134650.4031813-1-willy@infradead.org> <20211004134650.4031813-4-willy@infradead.org> <02a055cd-19d6-6e1d-59bb-e9e5f9f1da5b@redhat.com> <425cd66f-2040-4278-6149-69a329a82f79@redhat.com> <842357c1-bec2-654e-c782-569b1fd627b2@redhat.com> <4cccc03f-1a9b-a45f-082f-77a4b37f6761@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4cccc03f-1a9b-a45f-082f-77a4b37f6761@redhat.com> X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 33CF1D040722 X-Stat-Signature: sncbzgq57rzs3j1yfmk8xefiurmtnr4x Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=Fq2jwuUD; spf=pass (imf21.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.181 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org X-HE-Tag: 1634215481-426160 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000002, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Oct 14, 2021 at 09:22:13AM +0200, David Hildenbrand wrote: > On 13.10.21 20:31, Matthew Wilcox wrote: > > On Wed, Oct 13, 2021 at 02:08:57PM -0400, Johannes Weiner wrote: > >> Btw, I think slab_nid() is an interesting thing when it comes to page > >> polymorphy. We want to know the nid for all sorts of memory types: > >> slab, file, anon, buddy etc. In the goal of distilling page down to > >> the fewest number of bytes, this is probably something that should > >> remain in the page rather than be replicated in all subtypes. > > > > Oh, this is a really interesting point. > > > > Node ID is typically 10 bits (I checked Debian & Oracle configs for > > various architectures). That's far more than we can store in the bottom > > bits of a single word, and it's even a good chunk of a second word. > > > > I was assuming that, for the page allocator's memory descriptor and for > > that of many allocators (such as slab), it would be stored *somewhere* > > in the memory descriptor. It wouldn't necessarily have to be the same > > place for all memory descriptors, and maybe (if it's accessed rarely), > > we delegate finding it to the page allocator's knowledge. > > > > But not all memory descriptors want/need/can know this. For example, > > vmalloc() might well spread its memory across multiple nodes. As long > > as we can restore the node assignment again once the pages are vfree(), > > there's no particular need for the vmalloc memory descriptor to know > > what node an individual page came from (and the concept of asking > > vmalloc what node a particular allocation came from is potentially > > nonsense, unless somebody used vmalloc_node() or one of the variants). > > > > Not sure there's an obviously right answer here. I was assuming that at > > first we'd enforce memdesc->flags being the first word of every memory > > descriptor and so we could keep passing page->flags around. That could > > then change later, but it'd be a good first step? > > > > > It's really hard to make an educated guess here without having a full > design proposal of what we actually want to achieve and especially how > we're going to treat all the corner cases (as raised already in > different context). > > I'm all for simplifying struct page and *eventually* being able to > shrink it, even if we end up only shrinking by a little. However, I'm > not sold on doing that by any means (e.g., I cannot agree to any > fundamental page allocator rewrite without an idea what it does to > performance but also complexity). We might always have a space vs. > performance cost and saving space by sacrificing performance isn't > necessarily always a good idea. But again, it's really hard to make an > educated guess. > > Again, I'm all for cleanups and simplifications as long as they really > make things cleaner. So I'm going to comment on the current state and > how the cleanups make sense with the current state. > Very well put, and not off-topic at all. A clearer overarching design proposal that exists in more than one head, that people agree on, and that is concrete enough to allow others to make educated guesses on how random snippets of code would or should look like in the new world, would help immensely. (This applies here, but to a certain degree to folio as well.) HOWEVER, given the scope of struct page, I also think this is a very difficult problem. There are a LOT of nooks and crannies that throw curveballs at these refactors. We're finding new ones daily. I think smaller iterations, with a manageable amount of novelty, that can be reviewed and judged by everybody involved - against the current code base, rather than against diverging future speculation - make more sense. These can still push the paradigm in the direction we want, but we can work out kinks in the overarching ideas as we go. I think what isn't going to work is committing vast amounts of code to open-ended projects and indefinite transitory chaos that makes the codebase go through every conceptual dead end along the way. There are just too many users and too much development work against each release of the kernel nowadays to operate like this. > Node/zone is a property of a base page and belongs into struct page OR > has to be very easily accessible without any kind of heavy locking. The > node/zone is determined once memory gets exposed to the system (e.g., to > the buddy during boot or during memory onlining) and is stable until > memory is offlined again (as of right now, one could imagine changing > zones at runtime). Yes, it's a property of the page frame, for which struct page is the descriptor. Even if we could get away with stuffing them into the higher-level memory descriptors, it's inevitable that this will lead to duplication, make adding new types more cumbersome, and get in the way of writing generic code that works on those shared attributes. struct page really is the right place for it. > For example, node/zone information is required for (almost) lockless PFN > walkers in memory offlining context, to figure out if all pages we're > dealing with belong to one node/zone, but also to properly shrink > zones+nodes to eventually be able to offline complete nodes. I recall > that there are other PFN walkers (page compaction) that need this > information easily accessible. kmemleak is another one of them. page_ext is another lower-level plumbing thing. (Although this one we *may* be able to incorporate into dynamically allocated higher-level descriptors. Another future idea that is worth keeping on the map, but need to be careful not to make assumptions on today.) Then there is generic code that doesn't care about higher-level types: vmstats comes to mind, the generic list_lru code, page table walkers (gup, numa balancing, per-task numa stats), ...