From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 79432C64ED6 for ; Wed, 1 Mar 2023 10:25:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DDE9F6B0071; Wed, 1 Mar 2023 05:24:59 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D8FE56B0072; Wed, 1 Mar 2023 05:24:59 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C5F656B0073; Wed, 1 Mar 2023 05:24:59 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id B48B06B0071 for ; Wed, 1 Mar 2023 05:24:59 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 860641C6F8F for ; Wed, 1 Mar 2023 10:24:59 +0000 (UTC) X-FDA: 80519946318.16.5899B2C Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf29.hostedemail.com (Postfix) with ESMTP id CD6E6120009 for ; Wed, 1 Mar 2023 10:24:56 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf29.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677666297; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+Y8ulD0j6mIscOXYzUqyxt3jCEg3z0dcyB66NUcXhyw=; b=HBFH9w7wMVzXD4V6C94r1Zz8lZzpD6QXjaaNj4A/6LvwsaJLBxZ56ZZ4GxNeHCzyVqi5gp n+5mtLtXlEVnJZ41TxFW0/KpR2XI/Bpxo+GYI5Ni6ZZ5MEzbejv1QHuK3sUxpR51LXn3pR CEjKPFO36dAftTS3Gm150drjH7ERqe8= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf29.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677666297; a=rsa-sha256; cv=none; b=ueKRCLdbXGHMnoGK7gnhClnJQQ5LmKyfg32VkN94aNUfYbe+NNEF4nBSNHNBZtLhpvosLX bl23MJOeuJgBKI2F1DXaZOSd2FteVSBfGJTIdFPHcYF5sIHsVq4H6omk8kLZUOHbawpRWU /MKQWetZ3hxT5FAdpLxZwdmVMSb27po= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id F004E2F4; Wed, 1 Mar 2023 02:25:38 -0800 (PST) Received: from [10.57.79.100] (unknown [10.57.79.100]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 42AAC3FB90; Wed, 1 Mar 2023 02:24:55 -0800 (PST) Message-ID: Date: Wed, 1 Mar 2023 10:24:53 +0000 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.8.0 Subject: Re: What size anonymous folios should we allocate? Content-Language: en-US To: Matthew Wilcox , linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org References: From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: CD6E6120009 X-Rspam-User: X-Stat-Signature: o77e7rrq46byid1o49bqdiq9d94jrc79 X-HE-Tag: 1677666296-133153 X-HE-Meta: U2FsdGVkX19DfyhOe/v+J6T1ScSDomvRmtmAf9ZDZpnvt75hMWjdf39OLds+RH5GcSm0V76I+SxExS9E0Avw5euOOOj3ej4x7YfdjjTQ5JT+XhU1Zq+a5M15cJvXGyXrqPnz2jRmUpb1kiO2en/Xr5ki9z4H5d5LSkCY656/z57TChaeOMQYyOH+qELxvXPwIwmmMy9qvylx2Sqq9gL5SvW6tb3TSHATlO77JqCbV306uljlwnJKHHdPUiKAN+xrlwT0L7j/HczO8f0rQcMWH3e/kAItlxquCRAfFnq9Da4SYzGTX56MnhVr7M/f952+Wfv83ZjG+LSdI1bT57Mxp1ULpQq2ArFTzEmYkNP64Sv34xxQFnj/7p6Wq9lP/PtK/gpfg3NTa9UFb+DXVLe++byqpoSu0/KuFVfsWxA04Aw4zb3UXfmCp8gE66HKpTlHauVClIf4yblTGQ9gCrtr9genIczixbpZtFGgXJFVUFQ75THMeSqit+hlCFf5p9PvzpnXNOb6EnUvm0n7Y7xPhyxj78eaN6B4uvlQJQjftNLunBfO0VHVB+LTYLnQEJu+gZAj8wZNbU2m/y/L/34H8QsFFRkX4wAN7RD4lKhJZ+ABYML7HlWjY58LoaWMalfXFE18PPRvt/hTIk5ir1y6U8307+yW3ibaGfv3aA7XpXXlNgcPQ6/785zlwPhWoEuUxJ4xNVIia8DQySvG7Vc+ngYOilW48oaCHsjWIuciaofoBOnQLmVAScFTxaOcd0n8yvrzBiq6Hd63UlihuZFgNundzVHjx4Xa3RV1KVs11TnuqdOwuG0XoTeG/aARpuK3oJudrDEqB2tYRBGm3ZFHmNmR70Gy7pa7vtTJ48Xk/uMxDWw15UWREyYB0/CakxFz4UPxbK6sfEBj9B0ThAbibDyjzYt6CBfUr8sQ9TC5Y4SeNjeZsHSUpyfJfWOP4CxUyE9rq4Z5WrVvDTGrSrn Pu0cWYHl 2vPo2BqvHVLV2ZJ6UyY8Aic0flldXNA0McezrGIPodDp5uj8pQLaibBvd7aShTK0vVTu9Hkoefg52k4xcQwlNktdd5f3oSTX6yev1PPr8CuO9k08qb+ncoLvfyN4lVFEotytVtugfGiuQcejH2GHLgnmMOgaDlKY1K91Vz+AD8VCvAO7rWXyuHJheVp3J3nHqiEqmJ7KQ5TeEoRqX4Y1oJ/SzDps29TC9zEk8rnH7p+QYlQZai9+PXKylhUH8fMcuiNopRkA5vUtmt4d4L3IJ8JmG4whO4WqkIXBF X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: I'd like to throw in my 2p here. Although quick disclaimer first; I'm new to mm so I'm sure I'll say a bunch of dumb stuff - please go easy ;-) In the few workloads that I'm focused on, I can see a disparity in performance between a kernel configured for 4K vs 16K pages. My goal is to release the extra performance we see in the 16K variant into the 4K variant. My results show that a ~half of the uplift is down to SW efficiency savings in the kernel; 4x fewer data aborts (vast majority for anon memory) and less effort spent in mm-heavy syscalls (as expected). And the other ~half is down to HW; the TLB has less pressure, which causes everything to speed up a bit. But this "bit" is important, given most of the time is spent in user space, which only benefits from the HW part. See [1] for more details. Newer Arm CPUs have a uarch feature called Hardware Page Aggregation (HPA). This allows the TLB to aggregate 2-8 physically- and virtually- contiguous pages into a single TLB entry to reduce pressure. (Note this is separate from the contig bit and is invisible from a SW programming perspective). So my hope is that I can get the equivalent SW efficiencies with a 4K base page size and large anonymous folios, and also benefit from a lot of the HW performance due to it all naturally fitting HPA's requirements. On 21/02/2023 21:49, Matthew Wilcox wrote: > In a sense this question is premature, because we don't have any code > in place to handle folios which are any size but PMD_SIZE or PAGE_SIZE, > but let's pretend that code already exists and is just waiting for us > to answer this policy question. > > I'd like to reject three ideas up front: 1. a CONFIG option, 2. a boot > option and 3. a sysfs tunable. It is foolish to expect the distro > packager or the sysadmin to be able to make such a decision. The > correct decision will depend upon the instantaneous workload of the > entire machine and we'll want different answers for different VMAs. > > I'm open to applications having some kind of madvise() call they can > use to specify hints, but I would prefer to handle memory efficiently > for applications which do not. Firmly agree. > > For pagecache memory, we use the per-fd readahead code; if readahead has > been successful in the past we bump up the folio size until it reaches > its maximum. There is no equivalent for anonymous memory. > > I'm working my way towards a solution that looks a little like this: > > A. We modify khugepaged to quadruple the folio size each time it scans. > At the moment, it always attempts to promote straight from order 0 > to PMD size. Instead, if it finds four adjacent order-0 folios, > it will allocate an order-2 folio to replace them. Next time it > scans, it finds four order-2 folios and replaces them with a single > order-4 folio. And so on, up to PMD order. >From the SW efficiencies perspective, what is the point of doing a replacement after you have allocated all the order-0 folios? Surely that just adds more overhead? I think the aim has to be to try to allocate the correct order up front to cut down the allocation cost; one order-2 allocation is ~4x less expensive than 4x order-0, right? I wonder if it is preferable to optimistically allocate a mid-order folio to begin with, then later choose to split or merge from there? Perhaps these folios could initially go on a separate list to make them faster to split and reclaim the unused portions when under memory pressure? (My data/workloads suggest 16K allocations are knee, and making them bigger than that doesn't proportionally improve performance). > > B. A further modification is that it will require three of the four > folios being combined to be on the active list. If two (or more) > of the four folios are inactive, we should leave them alone; either > they will remain inactive and eventually be evicted, or they will be > activated and eligible for merging in a future pass of khugepaged. > > C. We add a new wrinkle to the LRU handling code. When our scan of the > active list examines a folio, we look to see how many of the PTEs > mapping the folio have been accessed. If it is fewer than half, and > those half are all in either the first or last half of the folio, we > split it. The active half stays on the active list and the inactive > half is moved to the inactive list. > > I feel that these three changes should allow us to iterate towards a > solution for any given VMA that is close to optimal, and adapts to a > changing workload with no intervention from a sysadmin, or even hint > from a program. > > There are three different circumstances where we currently allocate > anonymous memory. The first is for mmap(MAP_ANONYMOUS), the second is > COW on a file-backed MAP_PRIVATE and the third is COW of a post-fork > anonymous mapping. > > For the first option, the only hint we have is the size of the VMA. > I'm tempted to suggest our initial guess at the right size folio to > allocate should be scaled to that, although I don't have a clear idea > about what the scale factor should be. Ahh - perhaps I misunderstood what you were saying above. My experience has been that order-2 seems to be the knee in terms of performance gain, so perhaps one approach would be to start with order-2 allocations, then adjust based on the observed page fault pattern within the VMA? i.e. if you're getting mostly in-order faults, increase the VMA's scaling factor, and if its mostly random, decrease. (just a suggestion based on intuition, so feel free to shoot it down). > > For the second case, I want to strongly suggest that the size of the > folio allocated by the page cache should be of no concern. It is largely > irrelevant to the application's usage pattern what size the page cache > has chosen to cache the file. I might start out very conservatively > here with an order-0 allocation. > > For the third case, in contrast, the parent had already established > an appropriate size folio to use for this VMA before calling fork(). > Whether it is the parent or the child causing the COW, it should probably > inherit that choice and we should default to the same size folio that > was already found. > > > I don't stay current with the research literature, so if someone wants > to point me to a well-studied algorithm and let me know that I can stop > thinking about this, that'd be great. And if anyone wants to start > working on implementing this, that'd also be great. > > P.S. I didn't want to interrupt the flow of the above description to > note that allocation of any high-order folio can and will fail, so > there will definitely be fallback points to order-0 folios, which will > be no different from today. Except that maybe we'll be able to iterate > towards the correct folio size in the new khugepaged. > > P.P.S. I still consider myself a bit of a novice in the handling of > anonymous memory, so don't be shy to let me know what I got wrong. > > Thanks, Ryan [1] https://lore.kernel.org/linux-mm/4c991dcb-c5bb-86bb-5a29-05df24429607@arm.com/