From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37FE0C4338F for ; Wed, 25 Aug 2021 15:12:05 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B60F761100 for ; Wed, 25 Aug 2021 15:12:04 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org B60F761100 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 4E7666B0071; Wed, 25 Aug 2021 11:12:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 497038D0001; Wed, 25 Aug 2021 11:12:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 385906B0073; Wed, 25 Aug 2021 11:12:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0215.hostedemail.com [216.40.44.215]) by kanga.kvack.org (Postfix) with ESMTP id 1D7076B0071 for ; Wed, 25 Aug 2021 11:12:04 -0400 (EDT) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id B7D421E06C for ; Wed, 25 Aug 2021 15:12:03 +0000 (UTC) X-FDA: 78513943326.13.7622829 Received: from mail-qv1-f54.google.com (mail-qv1-f54.google.com [209.85.219.54]) by imf21.hostedemail.com (Postfix) with ESMTP id 2C97ED016EBF for ; Wed, 25 Aug 2021 15:12:03 +0000 (UTC) Received: by mail-qv1-f54.google.com with SMTP id jv8so55910qvb.3 for ; Wed, 25 Aug 2021 08:12:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=mj5FKDpZtARuNjGaKT/9KEErn1IgwP8kS9Oc4Re7I24=; b=k/RQVvcx5SIwxF8wBq16jv12cDzmiDOY1IcL680EqUKbEC9J6vK1S3oB+/f0S7+djP FtAyHUXfSo/tfov98sWEwtGUbovkafOElV+b16JUewRVkW7JB7G4GSsxenS9U4YqocYC Z4jYFNVzGuLgpS4lb9Oqi78f9fwFwFUWjYw3LDL8DRVnpgJPCU3drMJGR685/aTOExB2 xmLAhkJ26ODTdYsVd4ZGbeFJUDoMaVFKHYgnlbbdnNsnlFevCgUWlVmLf51fv7LCj0AJ SG0eylCvS2nkRty59gVAukFj46awubG4XZfk3gk1QwriIbJaG0ZqOboPBDUu4wm5XVZ/ d1Og== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=mj5FKDpZtARuNjGaKT/9KEErn1IgwP8kS9Oc4Re7I24=; b=tWFdPdjFFEaniF6D7ZcS1QSLNcTFCWH3px3Ceg2+5YLujgusuEKvDCCCGQouwmFSVh YXCyDd4G7TnESWX++J21sV8M+emSGr6XIV9DozNg91a8KTsu1bFI21yz7L3J3cqySeXF DjKhP9JTSeuKIvsrVMExQ0+Khs+0x6UJ/kuNyWy3d4JQezPclOf8rP53aopdj9sDw48f Cvr1x0iczt4vl73b80p5xJxrA3M6Ag9ECWgdvdALAVRQN7onL6adwuQnFAl/Kn3k9PcI SHuAHFD7z0tTuT0McKx2+vzrGJOqSA9EHxkpXxGCHfSqttPz/5Y8Jhu5ApZFSnMpwyhb pOcA== X-Gm-Message-State: AOAM530Qq1VMAJJjl+GUYakIbgsQ9le9xLS7wxlLheAM2mtBxps08Qo3 BgjtenNpsR9OV7d/r4M4z8g1ew== X-Google-Smtp-Source: ABdhPJz5vV5lCWpSZNPYqg+nzU7pxw6dkh0wFxjV9r/kaYRnJjMhQuDzPj7vevroqE5PbU6KV5wVCQ== X-Received: by 2002:a05:6214:1809:: with SMTP id o9mr18543448qvw.58.1629904322383; Wed, 25 Aug 2021 08:12:02 -0700 (PDT) Received: from localhost (cpe-98-15-154-102.hvc.res.rr.com. [98.15.154.102]) by smtp.gmail.com with ESMTPSA id u189sm188619qkh.14.2021.08.25.08.12.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Aug 2021 08:12:01 -0700 (PDT) Date: Wed, 25 Aug 2021 11:13:45 -0400 From: Johannes Weiner To: Matthew Wilcox Cc: Linus Torvalds , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton Subject: Re: [GIT PULL] Memory folios for v5.15 Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b="k/RQVvcx"; spf=pass (imf21.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.54 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 2C97ED016EBF X-Stat-Signature: c9qe3j7nc69oxfocz8qyk1quwr79kstu X-HE-Tag: 1629904323-15393 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Aug 24, 2021 at 08:44:01PM +0100, Matthew Wilcox wrote: > On Tue, Aug 24, 2021 at 02:32:56PM -0400, Johannes Weiner wrote: > > The folio doc says "It is at least as large as %PAGE_SIZE"; > > folio_order() says "A folio is composed of 2^order pages"; > > page_folio(), folio_pfn(), folio_nr_pages all encode a N:1 > > relationship. And yes, the name implies it too. > > > > This is in direct conflict with what I'm talking about, where base > > page granularity could become coarser than file cache granularity. > > That doesn't make any sense. A page is the fundamental unit of the > mm. Why would we want to increase the granularity of page allocation > and not increase the granularity of the file cache? I'm not sure why one should be tied to the other. The folio itself is based on the premise that a cache entry doesn't have to correspond to exactly one struct page. And I agree with that. I'm just wondering why it continues to imply a cache entry is at least one full page, rather than saying a cache entry is a set of bytes that can be backed however the MM sees fit. So that in case we do bump struct page size in the future we don't have to redo the filesystem interface again. I've listed reasons why 4k pages are increasingly the wrong choice for many allocations, reclaim and paging. We also know there is a need to maintain support for 4k cache entries. > > Are we going to bump struct page to 2M soon? I don't know. Here is > > what I do know about 4k pages, though: > > > > - It's a lot of transactional overhead to manage tens of gigs of > > memory in 4k pages. We're reclaiming, paging and swapping more than > > ever before in our DCs, because flash provides in abundance the > > low-latency IOPS required for that, and parking cold/warm workload > > memory on cheap flash saves expensive RAM. But we're continously > > scanning thousands of pages per second to do this. There was also > > the RWF_UNCACHED thread around reclaim CPU overhead at the higher > > end of buffered IO rates. There is the fact that we have a pending > > proposal from Google to replace rmap because it's too CPU-intense > > when paging into compressed memory pools. > > This seems like an argument for folios, not against them. If user > memory (both anon and file) is being allocated in larger chunks, there > are fewer pages to scan, less book-keeping to do, and all you're paying > for that is I/O bandwidth. Well, it's an argument for huge pages, and we already have those in the form of THP. The problem with THP today is that the page allocator fragments the physical address space at the 4k granularity per default, and groups random allocations with no type information and rudimentary lifetime/reclaimability hints together. I'm having a hard time seeing 2M allocations scale as long as we do this. As opposed to making 2M the default block and using slab-style physical grouping by type and instantiation time for smaller cache entries - to improve the chances of physically contiguous reclaim. But because folios are compound/head pages first and foremost, they are inherently tied to being multiples of PAGE_SIZE. > > - It's a lot of internal fragmentation. Compaction is becoming the > > default method for allocating the majority of memory in our > > servers. This is a latency concern during page faults, and a > > predictability concern when we defer it to khugepaged collapsing. > > Again, the more memory that we allocate in higher-order chunks, the > better this situation becomes. It only needs 1 unfortunately placed 4k page out of 512 to mess up a 2M block indefinitely. And the page allocator has little awareness whether the 4k page it's handing out to somebody pairs well with the 4k page adjacent to it in terms of type and lifetime. > > - struct page is statically eating gigs of expensive memory on every > > single machine, when only some of our workloads would require this > > level of granularity for some of their memory. And that's *after* > > we're fighting over every bit in that structure. > > That, folios does not help with. I have post-folio ideas about how > to address that, but I can't realistically start working on them > until folios are upstream. How would you reduce the memory overhead of struct page without losing necessary 4k granularity at the cache level? As long as folio implies that cache entries can't be smaller than a struct page? I appreciate folio is a big patchset and I don't mean to get too much into speculation about the future. But we're here in part because the filesystems have been too exposed to the backing memory implementation details. So all I'm saying is, if you're touching all the file cache interface now anyway, why not use the opportunity to properly disconnect it from the reality of pages, instead of making the compound page the new interface for filesystems. What's wrong with the idea of a struct cache_entry which can be embedded wherever we want: in a page, a folio or a pageset. Or in the future allocated on demand for