From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 849E5C761A6 for ; Tue, 28 Mar 2023 12:16:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 11B846B0072; Tue, 28 Mar 2023 08:16:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0CC0C6B0074; Tue, 28 Mar 2023 08:16:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EB0C7900002; Tue, 28 Mar 2023 08:16:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id DD7BF6B0072 for ; Tue, 28 Mar 2023 08:16:36 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 8D6EBA01F1 for ; Tue, 28 Mar 2023 12:16:36 +0000 (UTC) X-FDA: 80618205192.20.CC4A939 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by imf23.hostedemail.com (Postfix) with ESMTP id 5A2A414001D for ; Tue, 28 Mar 2023 12:16:34 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=Sqdng2tU; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=YhmbwhOs; spf=pass (imf23.hostedemail.com: domain of vbabka@suse.cz designates 195.135.220.28 as permitted sender) smtp.mailfrom=vbabka@suse.cz; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1680005794; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=gwQi3q3Bh569WHn1/HEyVnEt9J7ZGibnBaP8gSjl2Yk=; b=K8+WRTV2I3R9UebNdF/wd9sep8jVSiSBrLUsRfxS8OuBgoVboOusOvhfToH475orfDDViv wNDPapFiPG0AtSJnseVcGXp5XU8l/nPwxF7bfNWmhvXGKzxVXS8bleuGwScxIOhSS2TIfQ wVkx1AuY63oyOOhv+N1ONpj6jOPQx5Y= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=Sqdng2tU; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=YhmbwhOs; spf=pass (imf23.hostedemail.com: domain of vbabka@suse.cz designates 195.135.220.28 as permitted sender) smtp.mailfrom=vbabka@suse.cz; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680005794; a=rsa-sha256; cv=none; b=eOhod7HEYPXmjHQ6ZW5+BgFIUFFvjhrKV9oQW1uQ08nWeiADpDela1UF9f/vu39UeDf6Ek g+9LVN/t+VLC5KOW7TdM/fTfkKdYs7PF2Jz0AkwYy5DkdGmQDN70OXSpbFYcG70rj06RDd WRl1iZMfwPDIF5PngCZyotobAcvyCSA= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id DE1FB21A12; Tue, 28 Mar 2023 12:16:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1680005792; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=gwQi3q3Bh569WHn1/HEyVnEt9J7ZGibnBaP8gSjl2Yk=; b=Sqdng2tUb7p0lumCNGr680EuB9gWsxhC4bT0FdP2tNDPv0XMqjxni3LNnYpOHan2qsKers nZmJv16I4Mp5lHCrDwY8kNwSYly6kobR5wb6WjcjuNXyhuFE3y+5N7Fx7y6u4I2TJeLQ6c N2DMaGqao84JIUrhBaL5kqTHCyvvDGQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1680005792; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=gwQi3q3Bh569WHn1/HEyVnEt9J7ZGibnBaP8gSjl2Yk=; b=YhmbwhOsKfl/zCB28+Omsufo1CQF5Kuekh5JcaTYIeseS3TUtoafQuIX0mjh4PQuQHaW3S Sxsb3WXW+rQeiODw== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id C50531390D; Tue, 28 Mar 2023 12:16:32 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id iAuGL6DaImQVLgAAMHmgww (envelope-from ); Tue, 28 Mar 2023 12:16:32 +0000 Message-ID: <4740455e-0b41-3f52-eca2-bf8d4a7c6181@suse.cz> Date: Tue, 28 Mar 2023 14:16:32 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Subject: Re: What size anonymous folios should we allocate? Content-Language: en-US To: Ryan Roberts , Matthew Wilcox , Yang Shi Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <022e1c15-7988-9975-acbc-e661e989ca4a@suse.cz> <7981dd12-4e56-a449-980b-52f27279df81@arm.com> From: Vlastimil Babka In-Reply-To: <7981dd12-4e56-a449-980b-52f27279df81@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 5A2A414001D X-Rspam-User: X-Stat-Signature: 491jgr834q9pt5b1wprq16ioq8k1b6j7 X-HE-Tag: 1680005794-55846 X-HE-Meta: U2FsdGVkX1/e09U7SRNxDIshT+qD9zWLDRVYpmsAbBLNl0sZ94iohNOGu55Ce157UB6awluK5XJFPpNAQ1UIeMbpBIpr2QSXRMQ2oxOArkkI5i2PZUiP/csLTcJU1Ih2rNAeUh39xJq79XspCX3yu58svLtiRtXjq2hWfGjiR+B+kh3YeiROe87beeUvn4j+NDMoY9WYfHp8oGuVkzrviQD+YG2QgsKWUA9UQgVaVvrXNIHb+ajCU/QUf4Y8acFsJ3vyEKBFgp1usWiLjhbiFM15+S8xrtV1DRkVH5Gl0GfO/r0G8hAE5AlSMQqQHJntNQCKVM3KOHdIQoE4MVXTjfgsRbjkkKFzYfEQn82286cWdHJlQOLhaGxdEzOIaGyltjwRuIzLcNWZf7e5zP/XyjZndfX4sumCTeK/qQxViQdbkNbLzTBrTK4vDuEPBzouYbhD6QLmwtPYWHUUg08DcdRwH6FYXH1cKwODIoI3nuO/lKO0sa8GYKj8jB1dlzaLim7Gbw9YP1QwDBNXKkCjhM7JcyyKkyowczfYpz4p0hp9hFgnZ67SB/WgBFwDfDVvj3eFRazFN9iwi5Ae/Vfb/xtl73oCDf0M/qjuVTqpOBhJBLrpPLYuLLuf6VloOyDMHHG8oPlZIGl+7Y3J6TCGGZo5DGZcsIwTtfNEjv5eLY/WEb8q1gEWUI9zJkg2wfdeRg2NekAADUBILiDZFeHba6tic0DrlfYX0LDLcm8O278fmOK6dRXcbdObtvhATvuI5VoGn7TrarhjRq9jvmuhNaD8N0flgYcgGxQq40gCgVqObhbGWwjNiU/P7SBb0Ezp+jF56xP4tM3ndc5FOx4ANRbOLmbHHZ5P0oPpNlzuywoUHNaVFQevXTqk32mCoujMBD45o8Olrp6Mr0wvx4aBp3Psvp3I92PFgucv2muV1yD3mNDckRnfXeAByo29clwegQlR8LQhyw24EPbuvDK nqiQEA4D 9WhvBkLg8yq7ip23ThyCBtit47w9QhcdsyXcbix/9V2MF7tpeJ6VL4EDNnElyF26CHKFwX767F7OFH3YPajpKPusJZRATWJJIog4leeKs+YUNyNJ8G3P0sCT7G8GzJKclvKOFPPflhqPtH/RvkwMd3uMakRrchu/6zpsyt7s7M/plywwaQafsnzUNJjgfMsIIdv9vd7htbKfGMPPB2fa0znjVHiiwdzECqIfGdAjZfWskZ2nCTMKvupXBsZP83dPphS+v X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 3/28/23 12:12, Ryan Roberts wrote: > On 27/03/2023 16:48, Vlastimil Babka wrote: >> On 3/27/23 17:30, Ryan Roberts wrote: >>> On 27/03/2023 13:41, Vlastimil Babka wrote: >>>> On 2/22/23 04:52, Matthew Wilcox wrote: >>>>> On Tue, Feb 21, 2023 at 03:05:33PM -0800, Yang Shi wrote: >>>>> >>>>>>> C. We add a new wrinkle to the LRU handling code. When our scan of the >>>>>>> active list examines a folio, we look to see how many of the PTEs >>>>>>> mapping the folio have been accessed. If it is fewer than half, and >>>>>>> those half are all in either the first or last half of the folio, we >>>>>>> split it. The active half stays on the active list and the inactive >>>>>>> half is moved to the inactive list. >>>>>> >>>>>> With contiguous PTE, every PTE still maintains its own access bit (but >>>>>> it is implementation defined, some implementations may just set access >>>>>> bit once for one PTE in the contiguous region per arm arm IIUC). But >>>>>> anyway this is definitely feasible. >>>>> >>>>> If a CPU doesn't have separate access bits for PTEs, then we should just >>>>> not use the contiguous bits. Knowing which parts of the folio are >>>>> unused is more important than using the larger TLB entries. >>>> >>>> Hm but AFAIK the AMD aggregation is transparent, there are no bits. And IIUC >>>> the "Hardware Page Aggregation (HPA)" Ryan was talking about elsewhere in >>>> the thread, that sounds similar. So I IIUC there will be a larger TLB entry >>>> transparently, and then I don't expect the CPU to update individual bits as >>>> that would defeat the purpose. So I'd expect it will either set them all to >>>> active when forming the larger TLB entry, or set them on a single subpage >>>> and leave the rest at whatever state they were. Hm I wonder if the exact >>>> behavior is defined anywhere. >>> >>> For arm64, at least, there are 2 separate mechanisms: >>> >>> "The Contiguous Bit" (D8.6.1 in the Arm ARM) is a bit in the translation table >>> descriptor that SW can set to indicate that a set of adjacent entries are >>> contiguous and have same attributes and permissions etc. It is architectural. >>> The order of the contiguous range is fixed and depends on the base page size >>> that is in use. When in use, HW access and dirty reporting is only done at the >>> granularity of the contiguous block. >>> >>> "HPA" is a micro-architectural feature on some Arm CPUs, which aims to do a >>> similar thing, but is transparent to SW. In this case, the dirty and access bits >>> remain per-page. But when they differ, this affects the performance of the feature. Oh looks like I get this part properly. Wonder if AMD works the same. >>> Typically HPA can coalesce up to 4 adjacent entries, whereas for a 4KB base page >>> at least, the contiguous bit applies to 16 adjacent entries. >> >> Hm if it's 4 entries on arm64 and presumably 8 on AMD, maybe we can only >> care about how actively accessed are the individual "subpages" above that >> size, to avoid dealing with this uncertainty whether HW tracks them. At such >> smallish sizes we shouldn't induce massive overhead? > > I'm not sure I've fully understood this point. For arm64's HPA, there is no > "uncertainty [about] whether HW tracks them"; HW will always track access/dirty > individually for each base page. The problem is the inverse; if SW (or HW) sets > those bits differently in each page, then TLB coalescing performance may > decrease. Or are you actually suggesting that SW should always set the bits the > same for a 4 or 8 page run, and forgo the extra granularity? I guess we'll need some experiments to see what's the optimal way. IIRC what we do is just clearing the access bit and then let HW set them. If we have 4/8-page folio on the LRU then we likely should clear the whole of it. Perhaps if all subpages are indeed hot enough, the HW will eventually set the accessed bits back and then create the coelesced TLB entry. If we are about the reclaim or split the folio, we would see if it wasn't hot enough and all subpages have the accessed bit, or not, so maybe that should all automatically work. >> >>> I'm hearing that there are workloads where being able to use the contiguous bit >>> really does make a difference, so I would like to explore solutions that can >>> work when we only have access/dirty at the folio level. >> >> And on the higher orders where we have explicit control via bits, we could >> split the explicitly contiguous mappings once in a while to determine if the >> sub-folios are still accessed? Although maybe with 16x4kB pages limit it may >> still be not worth the trouble? > > I have a bigger-picture question; why is it useful to split these large folios? > I think there are 2 potential reasons (but would like to be educated): > > 1. If a set of sub-pages that were pre-faulted as part of a large folio have > _never_ been accessed and we are under memory pressure, I guess we would like to > split the folio and free those pages? > > 2. If a set of subpages within a folio are cold (but were written in the past) > and a separate set of subpages within the same folio are hot and we are under > memory pressure, we would like to swap out the cold pages? These are not fundamentally different, only 1. depends if we optimistically start large (I think the proposal here was not to start (too) large). > If the first reason is important, I guess we would want to initially map > non-contig, then only remap as contig once every subpage has been touched at > least once. Yeah. But the second reason will always apply anyway, access patterns of a workload may change over time. > For the second reason, my intuition says that a conceptual single access and > dirty bit per folio should be sufficient, and folios could be split from > time-to-time to see if one half is cold? Maybe not complete folios need split but just their mappings? > Thanks, > Ryan > > > >> >>> Thanks, >>> Ryan >> >