From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 79432C64ED6
	for <linux-mm@archiver.kernel.org>; Wed,  1 Mar 2023 10:25:00 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id DDE9F6B0071; Wed,  1 Mar 2023 05:24:59 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D8FE56B0072; Wed,  1 Mar 2023 05:24:59 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C5F656B0073; Wed,  1 Mar 2023 05:24:59 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id B48B06B0071
	for <linux-mm@kvack.org>; Wed,  1 Mar 2023 05:24:59 -0500 (EST)
Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 860641C6F8F
	for <linux-mm@kvack.org>; Wed,  1 Mar 2023 10:24:59 +0000 (UTC)
X-FDA: 80519946318.16.5899B2C
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf29.hostedemail.com (Postfix) with ESMTP id CD6E6120009
	for <linux-mm@kvack.org>; Wed,  1 Mar 2023 10:24:56 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf29.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1677666297;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=+Y8ulD0j6mIscOXYzUqyxt3jCEg3z0dcyB66NUcXhyw=;
	b=HBFH9w7wMVzXD4V6C94r1Zz8lZzpD6QXjaaNj4A/6LvwsaJLBxZ56ZZ4GxNeHCzyVqi5gp
	n+5mtLtXlEVnJZ41TxFW0/KpR2XI/Bpxo+GYI5Ni6ZZ5MEzbejv1QHuK3sUxpR51LXn3pR
	CEjKPFO36dAftTS3Gm150drjH7ERqe8=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf29.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677666297; a=rsa-sha256;
	cv=none;
	b=ueKRCLdbXGHMnoGK7gnhClnJQQ5LmKyfg32VkN94aNUfYbe+NNEF4nBSNHNBZtLhpvosLX
	bl23MJOeuJgBKI2F1DXaZOSd2FteVSBfGJTIdFPHcYF5sIHsVq4H6omk8kLZUOHbawpRWU
	/MKQWetZ3hxT5FAdpLxZwdmVMSb27po=
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id F004E2F4;
	Wed,  1 Mar 2023 02:25:38 -0800 (PST)
Received: from [10.57.79.100] (unknown [10.57.79.100])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 42AAC3FB90;
	Wed,  1 Mar 2023 02:24:55 -0800 (PST)
Message-ID: <a7cd938e-a86f-e3af-f56c-433c92ac69c2@arm.com>
Date: Wed, 1 Mar 2023 10:24:53 +0000
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
 Gecko/20100101 Thunderbird/102.8.0
Subject: Re: What size anonymous folios should we allocate?
Content-Language: en-US
To: Matthew Wilcox <willy@infradead.org>, linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
References: <Y/U8bQd15aUO97vS@casper.infradead.org>
From: Ryan Roberts <ryan.roberts@arm.com>
In-Reply-To: <Y/U8bQd15aUO97vS@casper.infradead.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: CD6E6120009
X-Rspam-User: 
X-Stat-Signature: o77e7rrq46byid1o49bqdiq9d94jrc79
X-HE-Tag: 1677666296-133153
X-HE-Meta: U2FsdGVkX19DfyhOe/v+J6T1ScSDomvRmtmAf9ZDZpnvt75hMWjdf39OLds+RH5GcSm0V76I+SxExS9E0Avw5euOOOj3ej4x7YfdjjTQ5JT+XhU1Zq+a5M15cJvXGyXrqPnz2jRmUpb1kiO2en/Xr5ki9z4H5d5LSkCY656/z57TChaeOMQYyOH+qELxvXPwIwmmMy9qvylx2Sqq9gL5SvW6tb3TSHATlO77JqCbV306uljlwnJKHHdPUiKAN+xrlwT0L7j/HczO8f0rQcMWH3e/kAItlxquCRAfFnq9Da4SYzGTX56MnhVr7M/f952+Wfv83ZjG+LSdI1bT57Mxp1ULpQq2ArFTzEmYkNP64Sv34xxQFnj/7p6Wq9lP/PtK/gpfg3NTa9UFb+DXVLe++byqpoSu0/KuFVfsWxA04Aw4zb3UXfmCp8gE66HKpTlHauVClIf4yblTGQ9gCrtr9genIczixbpZtFGgXJFVUFQ75THMeSqit+hlCFf5p9PvzpnXNOb6EnUvm0n7Y7xPhyxj78eaN6B4uvlQJQjftNLunBfO0VHVB+LTYLnQEJu+gZAj8wZNbU2m/y/L/34H8QsFFRkX4wAN7RD4lKhJZ+ABYML7HlWjY58LoaWMalfXFE18PPRvt/hTIk5ir1y6U8307+yW3ibaGfv3aA7XpXXlNgcPQ6/785zlwPhWoEuUxJ4xNVIia8DQySvG7Vc+ngYOilW48oaCHsjWIuciaofoBOnQLmVAScFTxaOcd0n8yvrzBiq6Hd63UlihuZFgNundzVHjx4Xa3RV1KVs11TnuqdOwuG0XoTeG/aARpuK3oJudrDEqB2tYRBGm3ZFHmNmR70Gy7pa7vtTJ48Xk/uMxDWw15UWREyYB0/CakxFz4UPxbK6sfEBj9B0ThAbibDyjzYt6CBfUr8sQ9TC5Y4SeNjeZsHSUpyfJfWOP4CxUyE9rq4Z5WrVvDTGrSrn
 Pu0cWYHl
 2vPo2BqvHVLV2ZJ6UyY8Aic0flldXNA0McezrGIPodDp5uj8pQLaibBvd7aShTK0vVTu9Hkoefg52k4xcQwlNktdd5f3oSTX6yev1PPr8CuO9k08qb+ncoLvfyN4lVFEotytVtugfGiuQcejH2GHLgnmMOgaDlKY1K91Vz+AD8VCvAO7rWXyuHJheVp3J3nHqiEqmJ7KQ5TeEoRqX4Y1oJ/SzDps29TC9zEk8rnH7p+QYlQZai9+PXKylhUH8fMcuiNopRkA5vUtmt4d4L3IJ8JmG4whO4WqkIXBF
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

I'd like to throw in my 2p here. Although quick disclaimer first; I'm new to mm
so I'm sure I'll say a bunch of dumb stuff - please go easy ;-)

In the few workloads that I'm focused on, I can see a disparity in performance
between a kernel configured for 4K vs 16K pages. My goal is to release the extra
performance we see in the 16K variant into the 4K variant.

My results show that a ~half of the uplift is down to SW efficiency savings in
the kernel; 4x fewer data aborts (vast majority for anon memory) and less effort
spent in mm-heavy syscalls (as expected). And the other ~half is down to HW; the
TLB has less pressure, which causes everything to speed up a bit. But this "bit"
is important, given most of the time is spent in user space, which only benefits
from the HW part. See [1] for more details.

Newer Arm CPUs have a uarch feature called Hardware Page Aggregation (HPA). This
allows the TLB to aggregate 2-8 physically- and virtually- contiguous pages into
a single TLB entry to reduce pressure. (Note this is separate from the contig
bit and is invisible from a SW programming perspective).

So my hope is that I can get the equivalent SW efficiencies with a 4K base page
size and large anonymous folios, and also benefit from a lot of the HW
performance due to it all naturally fitting HPA's requirements.


On 21/02/2023 21:49, Matthew Wilcox wrote:
> In a sense this question is premature, because we don't have any code
> in place to handle folios which are any size but PMD_SIZE or PAGE_SIZE,
> but let's pretend that code already exists and is just waiting for us
> to answer this policy question.
> 
> I'd like to reject three ideas up front: 1. a CONFIG option, 2. a boot
> option and 3. a sysfs tunable.  It is foolish to expect the distro
> packager or the sysadmin to be able to make such a decision.  The
> correct decision will depend upon the instantaneous workload of the
> entire machine and we'll want different answers for different VMAs.
> 
> I'm open to applications having some kind of madvise() call they can
> use to specify hints, but I would prefer to handle memory efficiently
> for applications which do not.

Firmly agree.

> 
> For pagecache memory, we use the per-fd readahead code; if readahead has
> been successful in the past we bump up the folio size until it reaches
> its maximum.  There is no equivalent for anonymous memory.
> 
> I'm working my way towards a solution that looks a little like this:
> 
> A. We modify khugepaged to quadruple the folio size each time it scans.
>    At the moment, it always attempts to promote straight from order 0
>    to PMD size.  Instead, if it finds four adjacent order-0 folios,
>    it will allocate an order-2 folio to replace them.  Next time it
>    scans, it finds four order-2 folios and replaces them with a single
>    order-4 folio.  And so on, up to PMD order.


>From the SW efficiencies perspective, what is the point of doing a replacement
after you have allocated all the order-0 folios? Surely that just adds more
overhead? I think the aim has to be to try to allocate the correct order up
front to cut down the allocation cost; one order-2 allocation is ~4x less
expensive than 4x order-0, right?

I wonder if it is preferable to optimistically allocate a mid-order folio to
begin with, then later choose to split or merge from there? Perhaps these folios
could initially go on a separate list to make them faster to split and reclaim
the unused portions when under memory pressure? (My data/workloads suggest 16K
allocations are knee, and making them bigger than that doesn't proportionally
improve performance).

> 
> B. A further modification is that it will require three of the four
>    folios being combined to be on the active list.  If two (or more)
>    of the four folios are inactive, we should leave them alone; either
>    they will remain inactive and eventually be evicted, or they will be
>    activated and eligible for merging in a future pass of khugepaged.
> 
> C. We add a new wrinkle to the LRU handling code.  When our scan of the
>    active list examines a folio, we look to see how many of the PTEs
>    mapping the folio have been accessed.  If it is fewer than half, and
>    those half are all in either the first or last half of the folio, we
>    split it.  The active half stays on the active list and the inactive
>    half is moved to the inactive list.
> 
> I feel that these three changes should allow us to iterate towards a
> solution for any given VMA that is close to optimal, and adapts to a
> changing workload with no intervention from a sysadmin, or even hint
> from a program.
> 
> There are three different circumstances where we currently allocate
> anonymous memory.  The first is for mmap(MAP_ANONYMOUS), the second is
> COW on a file-backed MAP_PRIVATE and the third is COW of a post-fork
> anonymous mapping.
> 
> For the first option, the only hint we have is the size of the VMA.
> I'm tempted to suggest our initial guess at the right size folio to
> allocate should be scaled to that, although I don't have a clear idea
> about what the scale factor should be.

Ahh - perhaps I misunderstood what you were saying above. My experience has been
that order-2 seems to be the knee in terms of performance gain, so perhaps one
approach would be to start with order-2 allocations, then adjust based on the
observed page fault pattern within the VMA? i.e. if you're getting mostly
in-order faults, increase the VMA's scaling factor, and if its mostly random,
decrease. (just a suggestion based on intuition, so feel free to shoot it down).

> 
> For the second case, I want to strongly suggest that the size of the
> folio allocated by the page cache should be of no concern.  It is largely
> irrelevant to the application's usage pattern what size the page cache
> has chosen to cache the file.  I might start out very conservatively
> here with an order-0 allocation.
> 
> For the third case, in contrast, the parent had already established
> an appropriate size folio to use for this VMA before calling fork().
> Whether it is the parent or the child causing the COW, it should probably
> inherit that choice and we should default to the same size folio that
> was already found.
> 
> 
> I don't stay current with the research literature, so if someone wants
> to point me to a well-studied algorithm and let me know that I can stop
> thinking about this, that'd be great.  And if anyone wants to start
> working on implementing this, that'd also be great.
> 
> P.S. I didn't want to interrupt the flow of the above description to
> note that allocation of any high-order folio can and will fail, so
> there will definitely be fallback points to order-0 folios, which will
> be no different from today.  Except that maybe we'll be able to iterate
> towards the correct folio size in the new khugepaged.
> 
> P.P.S. I still consider myself a bit of a novice in the handling of
> anonymous memory, so don't be shy to let me know what I got wrong.
> 
> 

Thanks,
Ryan

[1] https://lore.kernel.org/linux-mm/4c991dcb-c5bb-86bb-5a29-05df24429607@arm.com/