From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4BA0C19F2A for ; Thu, 11 Aug 2022 21:31:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7A5AB6B0073; Thu, 11 Aug 2022 17:31:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 754CD6B0075; Thu, 11 Aug 2022 17:31:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 61CBF8E0001; Thu, 11 Aug 2022 17:31:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 55B586B0073 for ; Thu, 11 Aug 2022 17:31:24 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 2E5354182D for ; Thu, 11 Aug 2022 21:31:24 +0000 (UTC) X-FDA: 79788608088.27.21470D3 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf02.hostedemail.com (Postfix) with ESMTP id 8EE1B80029 for ; Thu, 11 Aug 2022 21:31:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:Message-ID: Subject:Cc:To:From:Date:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:In-Reply-To:References; bh=RAKwc334guoPBQoyyKOu50SCJZQyroHMyPY+vcCwB70=; b=qcM5CcBkmEzt24t4j92hRBTvEX ntOytKlPDWc7Yk5GQSfyLaM8vzHJNxFQDpHlUGj4hbCTPuAL8PCbQDBGoAWivtMoS+w7bHWEZd1rt 176Y6Q3BHNqS42WqBjpTRuYBGyEa0daUGPnWj1VIS+6MDlYnzhbyilNCrABDcLtNuzfrhz9Z8HQwz AQ/h3wa2iLy1pswApy/G6HiE4NVAVSYSfgtZ750H5VnabctiGpTJoQbNU2xI/s81qBL2kI7FMO/SB Il1SLXAS0t7waiTfc221knYG+Lcm+tRLepvVs9hZwbM4V0HcYxiuNw1exN6eys/n3KJipbKB4Q/QJ 8QsF5kZg==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1oMFlx-001ITh-Iu; Thu, 11 Aug 2022 21:31:21 +0000 Date: Thu, 11 Aug 2022 22:31:21 +0100 From: Matthew Wilcox To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: State of the Page (August 2022) Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=qcM5CcBk; dmarc=none; spf=none (imf02.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660253483; a=rsa-sha256; cv=none; b=QkfIL41EWfnp2I4z3ExZYwPEe77kZK3x5JHXagdXDTOyEL+EY39zs/qIqSFy7siE2iwTqS 9VSK+GITpPhIqhjkFu2XN2R1Vwn8GQSoijxZxfla0vfV5DQQ3k9iKsgXElh7JNVhBcNbjH yWzWDu8DvC3ZKhERn1CCpfTJ+6W2cUc= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660253483; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=RAKwc334guoPBQoyyKOu50SCJZQyroHMyPY+vcCwB70=; b=Q+D2sqIvNJTMkHaM7acEKZCerrCi09exmON5rncOap4gM+oR+hdK/0hfm/B9vK8UMFgpy2 ILlKrayvw1dxt7pQG/zVuJoUObV+xs7N/rPxJmcvdZ10ZUQ8pRp9g5JMTxstMzrIHz3X1P 86+Qv1rpV2yjLTvQz+eMSuwtJ5Zp9Gg= X-Rspam-User: Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=qcM5CcBk; dmarc=none; spf=none (imf02.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org X-Stat-Signature: pzauw8m1bw97c9ypz6fumjctgieujrho X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 8EE1B80029 X-HE-Tag: 1660253483-916769 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: ============================== State Of The Page, August 2022 ============================== I thought I'd write down where we are with struct page and where we're going, just to make sure we're all (still?) pulling in a similar direction. Destination =========== For some users, the size of struct page is simply too large. At 64 bytes per 4KiB page, memmap occupies 1.6% of memory. If we can get struct page down to an 8 byte tagged pointer, it will be 0.2% of memory, which is an acceptable overhead. struct page { unsigned long mem_desc; }; Types of memdesc ---------------- This is very much subject to change as new users present themselves. Here are the current ones in-plan: - Undescribed. Instead of the rest of the word being a pointer, there are 2^28 subtypes available: - Unmappable. Typically device drivers allocating private memory. - Reserved. These pages are not allocatable. - HWPoison - Offline (eg balloon) - Guard (see debug_pagealloc) - Slab - Anon Folio - File Folio - Buddy (ie free -- also for PCP?) - Page Table - Vmalloc - Net Pool - Zsmalloc - Z3Fold - Mappable. Typically device drivers mapping memory to userspace That implies 4 bits needed for the tag, so all memdesc allocations must be 16-byte aligned. That is not an undue burden. Memdescs must also be TYPESAFE_BY_RCU if they are mappable to userspace or can be stored in a file's address_space. It may be worth distinguishing between vmalloc-mappable and vmalloc-unmappable to prevent some things being mapped to userspace inadvertently. Contents of a memdesc --------------------- At least initially, the first word of a memdesc must be identical to the current page flags. That allows various functions (eg set_page_dirty()) to work on any kind of page without needing to know whether it's a device driver page, a vmalloc page, anon or file folio. Similarly, both anon and file folios must have the list_head in the same place so they can be placed on the same LRU list. Whether anon and file folios become separate types is still unclear to me. Mappable -------- All pages mapped to userspace must have: - A refcount - A mapcount Preferably in the same place in the memdesc so we can handle them without having separate cases for each type of memdesc. It would be nice to have a pincount as well, but that's already an optional feature. I propose: struct mappable { unsigned long flags; /* contains dirty flag */ atomic_t _refcount; atomic_t _mapcount; }; struct folio { union { unsigned long flags; struct mappable m; }; ... }; Memdescs which should never be mapped to userspace (eg slab, page tables, zsmalloc) do not need to contain such a struct. Mapcount -------- While discussed above, handling mapcount is tricky enough to need its own section. Since folios can be mapped unaligned, we may need to increment mapcount once per page table entry that refers to it. This is different from how THPs are handled today (one refcount per page plus a compound_mapcount for how many times the entire THP is mapped). So splitting a PMD entry results in incrementing mapcount by (PTRS_PER_PMD - 1). If the mapcount is raised to dangerously high levels, we can split the page. This should not happen in normal operation. Extended Memdescs ----------------- One of the things we're considering is that maybe a filesystem will want to have private data allocated with its folios. Instead of hanging extra stuff off folio->private, they could embed a struct folio inside a struct ext4_folio. Buddy memdesc ------------- I need to firm up a plan for this. Allocating memory in order to free memory is generally a bad idea, so we either have to coopt the contents of other memdescs (and some allocations don't have memdescs!) or we need to store everything we need in the remainder of the unsigned long. I'm not yet familiar enough with the page allocator to have a clear picture of what is needed. Where are we? ============= v5.17: - Slab was broken out from struct page in 5.17 (thanks to Vlastimil). - XFS & iomap mostly converted from pages to folios - Block & page cache mostly have the folio interfaces in place v5.18: - Large folio (multiple page) support added for filesystems that opt in - File truncation converted to folios - address_space_operations (aops) ->set_page_dirty converted to ->dirty_folio - Much of get_user_page() converted to folios - rmap_walk() converted to folios v5.19: - Most aops now converted to folios - More folio conversions in migration, shmem, swap, vmscan v6.0: - aops->migratepage became migrate_folio - isolate_page and putback_page removed from aops - More folio conversions in migration, shmem, swap, vmscan Todo ==== Well, most of the above! - Individual filesystems need converting from pages to folios - Zsmalloc, z3fold, page tables, netpools need to be split from struct page into their own types - Anywhere referring to page->... needs to be converted to folio or some other type. Help with any of this gratefully appreciated. Especially if you're the maintainer of a thing and want to convert it yourself. I'd rather help explain the subtleties of folios / mappables / ... to you than try to figure out the details of your code to convert it myself (and get it wrong). Please contact me to avoid multiple people working on the same thing.