From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4AFF4C636D7
	for <linux-mm@archiver.kernel.org>; Tue, 21 Feb 2023 23:05:51 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 6EF916B0071; Tue, 21 Feb 2023 18:05:50 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 69CFC6B0073; Tue, 21 Feb 2023 18:05:50 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 566B76B0074; Tue, 21 Feb 2023 18:05:50 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 465CF6B0071
	for <linux-mm@kvack.org>; Tue, 21 Feb 2023 18:05:50 -0500 (EST)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 197758065B
	for <linux-mm@kvack.org>; Tue, 21 Feb 2023 23:05:50 +0000 (UTC)
X-FDA: 80492833260.13.F91B788
Received: from mail-ed1-f52.google.com (mail-ed1-f52.google.com [209.85.208.52])
	by imf30.hostedemail.com (Postfix) with ESMTP id 4298A8001B
	for <linux-mm@kvack.org>; Tue, 21 Feb 2023 23:05:48 +0000 (UTC)
Authentication-Results: imf30.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b=nLBqs6dC;
	spf=pass (imf30.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=shy828301@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1677020748;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=bZsFQFRh/x8I2QvXO1hro2DfASw3W5KMKerDwbWes7o=;
	b=Ox+3EycBh76BO8g2tVDhFXQy5m0CHyZXYilQMVrKc1g9w7J6i+EdzfduwxmIl3hSUJ3qSD
	3Cd6BuntqLKwXlB55d3YKveMvTNSFpw9NFYSqhcQ4PNW/j4SU114ld77XLrRToh5aHnGbC
	IDwclA1EGPycFjuMyDEJnfVjTSGIs/U=
ARC-Authentication-Results: i=1;
	imf30.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b=nLBqs6dC;
	spf=pass (imf30.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=shy828301@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677020748; a=rsa-sha256;
	cv=none;
	b=29VN5pfVxF5nPKlf/jBYdV1vbPZThzqMtNgj7OQZnvOWOaxWZbisIQ49ePL8dA2umDq6eN
	datkPmx+zXotxnfmvkglJw/oqVR//ax6zyUnxK3ptSz9xsI3sU2sqda/rccUcP9a3QlHYP
	+kH4H+FBfADPUQDe0qhpbisgxv4Qd2U=
Received: by mail-ed1-f52.google.com with SMTP id h16so23628854edz.10
        for <linux-mm@kvack.org>; Tue, 21 Feb 2023 15:05:47 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=bZsFQFRh/x8I2QvXO1hro2DfASw3W5KMKerDwbWes7o=;
        b=nLBqs6dC+0UjumcFzovK225a4YbB2GqCqM4Hv8JBWaP1lrGBIf+cM5wmMtsgDfaCcv
         yf1ASW4BUPoL0xsqB748c2AQwqjBHhGlIrGRR2HTuOGRj+GYaXwbq5zKdDgoX2P6DegF
         7FzDqIEYJsil23cC6ZxSrlPVLO/+pxGKdBP4MQI7t2QEEZzzaAg7XcpdRq3FgXWBFgQs
         43eK4mUMHJquims+pxvD7BatpnFAmQnJH4AvjQS2dfqx5Qup19zt66fCKa82Vo48iapZ
         TENLEDMMOQXW5zXoR1o9I6BpQcenQL6NmTdCgk3Pw8txZd5OdPkeiIo0LrLVy+wMHpV2
         qt8w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=bZsFQFRh/x8I2QvXO1hro2DfASw3W5KMKerDwbWes7o=;
        b=4uqyFyqRhtOSBB53Bei6kpWU34rkm4WoCn4BT16seuQryOhbKglk0omc78EUEEDzmx
         GQ1XcVbb383wDyiCcgj3IvqgqmfqZ+zJsMjIZN0suboShEqFfm6QOBTLCxTvTdRJrxRn
         E2GbjOrNP8jgRpcuz6wXlxSMOY4/GX4DpmFYmdEYS6mggoKg5b9K4S1fQ4qP8g+iavxW
         lz5Jkzbd5jcQgNO0BJy7qCjwanbMdVYred5Mkuk8pC24zq+qZtiRrYJIqJ0LVxa/pAqy
         bXDgwTrgXgVnXmBsUdQak2IcUHJboKRUo9LTo6sFlbLIUt+DTGduyY240mncUfgSd3AG
         hWsw==
X-Gm-Message-State: AO0yUKUA20JF601uqTk/gVnSb2zV8oYnkyb34K7DVTIwFfn8JERMg/OY
	tBYhovz2Z9298Dg4wewWGfdPoA6fG5ZXriDKPrI=
X-Google-Smtp-Source: AK7set9S5cz4XsNHDiExlazlYJ0+nIqb5Yc4bwqEzJQyzlOZbNoiQsEGATKyMu27idt3m+jehSDglX5vArow+uO7unc=
X-Received: by 2002:a50:9f2d:0:b0:4ad:4c0:c4f9 with SMTP id
 b42-20020a509f2d000000b004ad04c0c4f9mr2665649edf.3.1677020746463; Tue, 21 Feb
 2023 15:05:46 -0800 (PST)
MIME-Version: 1.0
References: <Y/U8bQd15aUO97vS@casper.infradead.org>
In-Reply-To: <Y/U8bQd15aUO97vS@casper.infradead.org>
From: Yang Shi <shy828301@gmail.com>
Date: Tue, 21 Feb 2023 15:05:33 -0800
Message-ID: <CAHbLzkrkZmbVMkh-Y-bDxgy0T0ZRRd+T+o5y5-wKmjKmhN0NmA@mail.gmail.com>
Subject: Re: What size anonymous folios should we allocate?
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
X-Stat-Signature: q15k1bkehr7khc74dck4u5ozksmher5u
X-Rspam-User: 
X-Rspamd-Queue-Id: 4298A8001B
X-Rspamd-Server: rspam06
X-HE-Tag: 1677020748-203489
X-HE-Meta: U2FsdGVkX1+VbBoHuqqDXViQBA4GaqLMBDhVWuH/2KRqJp6lVmzCWPCtKI1xliS0vqpGGyYFGsWL1pk1Dl0lgPUsul9tbiLJ0uloSRnbPixt3eSoB1l4ItK3VufjTuhcg5Dvm0N+UY7sv8fxQ2AfOUHPcuI6dhY4jhe4QZsnNRl+r91zvpEvMphq5evnHfEwZwrsAfwhLHYfE86+h5v0DtG8F1RGojKBeWnesLjnInKg9OcPHzf7QxJ9WAjCyPxMX8N4ka6Stuzx3UkUU5anwbprnKMPtJ/9Sx7w0U9laT9hVl2dRvJK/pNYL1RABUeXwBMafk9DBYx3efrBil2d1/H6nBWNca+d0GkQU9jWudr0zXr+Brf1z5Bh97ASqxw104jaTnBWjw+CLwR/iG5tyodj6BhA/TMGFbUGJxaJ3dtJ1hmoL3d4O3a6jdIO11+j7KseE7JaBrdKLk1cPkCeIX9Ui9XNlNu7Hpg/5zBvi+UppLkExdnCXIL7uHzQL8Ho1iNKtc2KvH2mWdwkd570wFtFGMsjutByaqV2KuQuXlHKanJ6Z4Gon8ROcpiLdidptgM75Jlj671fbnvT1cNa5Ta2ZPdxKolPMkeS9zJMfOQpCnpCkp+1+4ozUNJC8AISbe47P7JGEVLZFhJUGLkGAParpDT8JSYT51AVtvVO52TzBfns7hEYVdjzQRH8xfG3yW7oBhT4NWHJDvVy4pAbKb/tacOHOnXilArBTYIaNU72UnenIppCmutLzGQi2veIhc2I7OVPLbfJFTiBizovzvDNGyqQHA6p4oKcEOtRcZfRi18hXMgTgbK4lu+M9c+C7DjONLYVKda3Rgg0Z7YvcQaKykF7sOTqiFGWRq9u6GEQA4AJUn62mwSkd6H2iY0dx2aFkfm6KJHB7H29eCW7Mx7frQ8ugWVao/PaykH2af+CODnHiLLFnEIoVKtO8AWy3CmO5CljJqRQ9YJ5QRd
 ZSvPl2PL
 4JRvHLjeKju5Bu80y1Ud0emnIo1RUFJmmHeUsUt1mOQHI5nSZp9DgABdAQHif5RoN0g20rde40L3yuiissL1wcBlqo/37zpSsumF/EIJdFtE2KTkg2Duh7zJqktZVpUSf7faiKXvrIqlmXudIdsCwfeWbWqSRDFIGK9P+7Sii54z+ii7nOpNFFSufxv/vo3F3O5oEPPgGZj7A9CFTJbMHaLejioSUzZRfttL23GbdjN14dnmgMcNIMAKCzR4UNbe0g2fBvXrHHmElUkMv9nSlU1vXXUO21mHfcw+NFPccIYgJoJ0vXKJ/aHQlREJBMayWdr6Y5OqngsRPiX+UgYqRhgvzxaQ9KglKUykV2UtTGrIbXWxHG3+bh2pd3fiTU13zjrOlb4whbrTxTl00Uhlk/EcC9/f9XtnsSDdyEq/yWUm2LRwvanhgNvSxhe/8qO7gLXCBefn45ZbKL/4bk8ZtBhp6nCxh+YyjZ09p
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Feb 21, 2023 at 1:49 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> In a sense this question is premature, because we don't have any code
> in place to handle folios which are any size but PMD_SIZE or PAGE_SIZE,
> but let's pretend that code already exists and is just waiting for us
> to answer this policy question.
>
> I'd like to reject three ideas up front: 1. a CONFIG option, 2. a boot
> option and 3. a sysfs tunable.  It is foolish to expect the distro
> packager or the sysadmin to be able to make such a decision.  The
> correct decision will depend upon the instantaneous workload of the
> entire machine and we'll want different answers for different VMAs.

Yeah, I agree those 3 options should be avoided. For some
architectures, there are a or multiple sweet size(s) benefiting from
hardware. For example, ARM64 contiguous PTE supports up to 16
consecutive 4K pages to form a 64K entry in TLB instead of 16 4K
entries. Some implementations may support intermediate sizes (for
example, 8K, 16K and 32K, but this may make the hardware design
harder), but some may not. AMD's coalesce PTE supports a different
size (128K if I remember correctly). So the multiple of the size
supported by hardware (64K or 128K) seems like the common ground from
maximizing hardware benefit point of view. Of course, nothing prevents
the kernel from allocating other orders.

ARM even supports contiguous PMD, but that would be too big to
allocate by buddy allocator.

>
> I'm open to applications having some kind of madvise() call they can
> use to specify hints, but I would prefer to handle memory efficiently
> for applications which do not.
>
> For pagecache memory, we use the per-fd readahead code; if readahead has
> been successful in the past we bump up the folio size until it reaches
> its maximum.  There is no equivalent for anonymous memory.

Yes, kernel can't tell it although the userspace may experience fewer
TLB misses. Anyway it is not an indicator that could be used by kernel
to make a decision.

>
> I'm working my way towards a solution that looks a little like this:
>
> A. We modify khugepaged to quadruple the folio size each time it scans.
>    At the moment, it always attempts to promote straight from order 0
>    to PMD size.  Instead, if it finds four adjacent order-0 folios,
>    it will allocate an order-2 folio to replace them.  Next time it
>    scans, it finds four order-2 folios and replaces them with a single
>    order-4 folio.  And so on, up to PMD order.

Actually I was thinking about the reverse, starting from the biggest
possible order, for example, 2M -> 1M -> ... 64K -> ... 4K. And the
page fault path should be able to use the same fallback order. But
excessive fallback tries may be harmful either.

>
> B. A further modification is that it will require three of the four
>    folios being combined to be on the active list.  If two (or more)
>    of the four folios are inactive, we should leave them alone; either
>    they will remain inactive and eventually be evicted, or they will be
>    activated and eligible for merging in a future pass of khugepaged.

If we use the fallback policy, we should be able to just leave it to
reclamation time. When checking reference we could tell what PTEs are
accessed, then split if there is significant internal fragmentation.

>
> C. We add a new wrinkle to the LRU handling code.  When our scan of the
>    active list examines a folio, we look to see how many of the PTEs
>    mapping the folio have been accessed.  If it is fewer than half, and
>    those half are all in either the first or last half of the folio, we
>    split it.  The active half stays on the active list and the inactive
>    half is moved to the inactive list.

With contiguous PTE, every PTE still maintains its own access bit (but
it is implementation defined, some implementations may just set access
bit once for one PTE in the contiguous region per arm arm IIUC). But
anyway this is definitely feasible.

>
> I feel that these three changes should allow us to iterate towards a
> solution for any given VMA that is close to optimal, and adapts to a
> changing workload with no intervention from a sysadmin, or even hint
> from a program.

Yes, I agree.

>
> There are three different circumstances where we currently allocate
> anonymous memory.  The first is for mmap(MAP_ANONYMOUS), the second is
> COW on a file-backed MAP_PRIVATE and the third is COW of a post-fork
> anonymous mapping.
>
> For the first option, the only hint we have is the size of the VMA.
> I'm tempted to suggest our initial guess at the right size folio to
> allocate should be scaled to that, although I don't have a clear idea
> about what the scale factor should be.
>
> For the second case, I want to strongly suggest that the size of the
> folio allocated by the page cache should be of no concern.  It is largely
> irrelevant to the application's usage pattern what size the page cache
> has chosen to cache the file.  I might start out very conservatively
> here with an order-0 allocation.
>
> For the third case, in contrast, the parent had already established
> an appropriate size folio to use for this VMA before calling fork().
> Whether it is the parent or the child causing the COW, it should probably
> inherit that choice and we should default to the same size folio that
> was already found.

Actually this is not what THP does now. The current THP behavior is to
split the PMD then fallback to order-0 page fault. For smaller orders,
we may consider allocating a large folio.

>
>
> I don't stay current with the research literature, so if someone wants
> to point me to a well-studied algorithm and let me know that I can stop
> thinking about this, that'd be great.  And if anyone wants to start
> working on implementing this, that'd also be great.
>
> P.S. I didn't want to interrupt the flow of the above description to
> note that allocation of any high-order folio can and will fail, so
> there will definitely be fallback points to order-0 folios, which will
> be no different from today.  Except that maybe we'll be able to iterate
> towards the correct folio size in the new khugepaged.
>
> P.P.S. I still consider myself a bit of a novice in the handling of
> anonymous memory, so don't be shy to let me know what I got wrong.
>