From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14B32C3ABB9 for ; Mon, 5 May 2025 10:06:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 564BE6B0088; Mon, 5 May 2025 06:06:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4ED596B0089; Mon, 5 May 2025 06:06:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 342456B008A; Mon, 5 May 2025 06:06:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 11BB46B0088 for ; Mon, 5 May 2025 06:06:50 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id AACEABD9CB for ; Mon, 5 May 2025 10:06:50 +0000 (UTC) X-FDA: 83408425380.09.622AD10 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by imf17.hostedemail.com (Postfix) with ESMTP id 662F740009 for ; Mon, 5 May 2025 10:06:48 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=CFY+Bo+D; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=NyrGr7IM; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=CFY+Bo+D; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=NyrGr7IM; dmarc=none; spf=pass (imf17.hostedemail.com: domain of jack@suse.cz designates 195.135.223.130 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746439608; a=rsa-sha256; cv=none; b=dW/qKilUokjuOplE8wUQzqvJVoUXYLp6j8NSFyWrA0MuVLf/QhHFP6MUKhcQ7lJiWy2Yik IS37ywKQWiHJTEt9sWd5CF7rAlBIGr6lc0+qZ7nvEEr6559VXBG/VU4X2xSbRTDhqIH8sG XEOx5ozIXSyw8Fq8UHsLnQ6EBNXZQqs= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=CFY+Bo+D; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=NyrGr7IM; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=CFY+Bo+D; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=NyrGr7IM; dmarc=none; spf=pass (imf17.hostedemail.com: domain of jack@suse.cz designates 195.135.223.130 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746439608; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1uHrYEe/BCPIKsghK9cJ2gEqy5Nqd9eQ4C+mGF/QN4A=; b=3mn8ez5hj/3/jMdTYqWMe8oCnuhnJ0qEqtiEMsRVxQLZXPJZY7omRtaMYSVmZkxoSlovLS MjbLRk45ktrSkMNIieT5hxfyCXd7IPg5BgUAju1RrzKr8VJ7pMm+hlgqeh8OBO0/c+V3XL YImZDEwYrb3vjlz5+M/nXZrY4iCo7Yw= Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id A616421282; Mon, 5 May 2025 10:06:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1746439606; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=1uHrYEe/BCPIKsghK9cJ2gEqy5Nqd9eQ4C+mGF/QN4A=; b=CFY+Bo+DtYwUuRgBGl5AmLhRSuPGKPQIAQynukOPOCqzVtgYDAeQ4gsGHvWB40vrQLZ1zS joSg0B0ocmFM0tyX8U1z6G8YuN2iuO5RqPn4OkxGMEtzpTGFrp83K5mKDPCS9w3OLyFkIV OokPgUlPg9wkcdAbSF+agNyYk1v1aNY= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1746439606; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=1uHrYEe/BCPIKsghK9cJ2gEqy5Nqd9eQ4C+mGF/QN4A=; b=NyrGr7IMNGQ01PenLUsQ9R4tyXomoJxqrWT7MBSfOe2jqiijjmBvmsTisSeLYQYmw4WEr7 UX+/PSAfcFLja/DA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1746439606; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=1uHrYEe/BCPIKsghK9cJ2gEqy5Nqd9eQ4C+mGF/QN4A=; b=CFY+Bo+DtYwUuRgBGl5AmLhRSuPGKPQIAQynukOPOCqzVtgYDAeQ4gsGHvWB40vrQLZ1zS joSg0B0ocmFM0tyX8U1z6G8YuN2iuO5RqPn4OkxGMEtzpTGFrp83K5mKDPCS9w3OLyFkIV OokPgUlPg9wkcdAbSF+agNyYk1v1aNY= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1746439606; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=1uHrYEe/BCPIKsghK9cJ2gEqy5Nqd9eQ4C+mGF/QN4A=; b=NyrGr7IMNGQ01PenLUsQ9R4tyXomoJxqrWT7MBSfOe2jqiijjmBvmsTisSeLYQYmw4WEr7 UX+/PSAfcFLja/DA== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 92F1713883; Mon, 5 May 2025 10:06:46 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id YyWGI7aNGGi4dwAAD6G6ig (envelope-from ); Mon, 05 May 2025 10:06:46 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id 3F95DA0670; Mon, 5 May 2025 12:06:46 +0200 (CEST) Date: Mon, 5 May 2025 12:06:46 +0200 From: Jan Kara To: Ryan Roberts Cc: Andrew Morton , "Matthew Wilcox (Oracle)" , Alexander Viro , Christian Brauner , Jan Kara , David Hildenbrand , Dave Chinner , Catalin Marinas , Will Deacon , Kalesh Singh , Zi Yan , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [RFC PATCH v4 5/5] mm/filemap: Allow arch to request folio size for exec memory Message-ID: References: <20250430145920.3748738-1-ryan.roberts@arm.com> <20250430145920.3748738-6-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250430145920.3748738-6-ryan.roberts@arm.com> X-Rspamd-Action: no action X-Rspamd-Queue-Id: 662F740009 X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: f68w8xxj9q1rgb6tuccatsb7hdb84wm7 X-HE-Tag: 1746439608-105285 X-HE-Meta: U2FsdGVkX19IMYfnq26EZl79FBB16hUJYdb+KHs/Lw1fQpE/eEuSC5+3TGPa+21y/GPdECSC8Tes9NnrQ7F7l9umdBFUN3GQZ+I1yP4saJMNDBBWy3LAGLpljJCBlH8OB55FQQRI9CmG6vyB3W0ub1CLL5XXUq778eSnBTqU3AsOAOwQzmUF/+7Wml8JUv+BRmUiktw0uOEfOljH+XF54MaWq4wyoXPgClWbX9dUcuy49DGv9f8MjEdNvS18MhCzbSuXf9XCeZjoFrD4KNA6VYRQzkqgW9N93/WD+44E2C/m3GwGwCHn9MJ9EwPnmmw/ZMkfmmwUM7HS10qDvC7Jkq/EejuCZXOVRLHvoIeBpLl6epQiPVV1ghLic6PZ8FMXE7rrpBDakEbibWyxU1/1EffZzXzmazWawjIf5HhGRG9srb1/22fuHjwdW8c/dEcPxEgSTOyKVlCp/sH05tvrs00A6MfhcG8i2CJ23D13MMDLkdMRySgvIihM8dgjJ5yKh67YYChf6nMe6xL66ZxMrHJ3ZCeF26fpa2elLLEQYwY393Y6BklNleTS2Sf8exnRERnGjnMMSsJ84DPQaVKv6wuwTY5UPxjnGn6WhjoA0PN2M13pQMbcbCLZI4EaQ79HgMMgRiLnT9DX4Lp+Tbgng+CnedQJLduMIuLdv17ff70wXd3Oz1ZBe6HBuwXd5OCLZkuKd9r8yirdFl67/h78oBqjjbRhFi85mV4fb+v+/BERgIURe2rz5TJ5L+fwg1g3MwjZBRv/PUanOqAawszeOU+KGiMs9FBL5OjeA9+iXS/OAkVAL66hAeH68+nr4sETVQ+DsLLeL+CWR6hjPlC4RMhmh/5wsXEGdgsk4h+VifZwLQ3fvYAD2bMdcj3+8sLkzKydU51fYJ/xRAI7SNDk43hd/TmridJ8UgjXSov2yNYFjF+hKx1lHM7fXmJuuofRTtek/1ZhVtXMjuBimWq swN6+rEQ jxNf0ePdvyjtlsw2og5vP4Een0aSactw8Izu8q6u82DMkoGGxr+z/y4v19O1BMlyHvoSFVImU11ilPMYBFaFQrdD5vhHqcyciX2MEwpjYhogoSkuzBFshQV4p6pWtR4Top9Mvm2NKAQm7gl01hKo5TZqlofSh7YwxDlhFhI0y+Gz6RGbDM9pLOssDXGz7T+FZqk3GFHDybh9ytKfFQJTaBaqABfijJ33jDOAq1j3ihBHgNrQqncxXsyfEXmoqsaC/0eFlxtdYEdgqkzglT2lhux9R5BwEO4yQOAY18vn9wip9TM0ZflgXpIHYe6W0zrlVDA7XitGVDV/nLdrtDCM4IuTvfxe0A23OXHkibrpkZyMJ8urv0J56qwYH8C71vJdN9Nzg+vGthLmAqSSXV8Le09XstA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed 30-04-25 15:59:18, Ryan Roberts wrote: > Change the readahead config so that if it is being requested for an > executable mapping, do a synchronous read into a set of folios with an > arch-specified order and in a naturally aligned manner. We no longer > center the read on the faulting page but simply align it down to the > previous natural boundary. Additionally, we don't bother with an > asynchronous part. > > On arm64 if memory is physically contiguous and naturally aligned to the > "contpte" size, we can use contpte mappings, which improves utilization > of the TLB. When paired with the "multi-size THP" feature, this works > well to reduce dTLB pressure. However iTLB pressure is still high due to > executable mappings having a low likelihood of being in the required > folio size and mapping alignment, even when the filesystem supports > readahead into large folios (e.g. XFS). > > The reason for the low likelihood is that the current readahead > algorithm starts with an order-0 folio and increases the folio order by > 2 every time the readahead mark is hit. But most executable memory tends > to be accessed randomly and so the readahead mark is rarely hit and most > executable folios remain order-0. > > So let's special-case the read(ahead) logic for executable mappings. The > trade-off is performance improvement (due to more efficient storage of > the translations in iTLB) vs potential for making reclaim more difficult > (due to the folios being larger so if a part of the folio is hot the > whole thing is considered hot). But executable memory is a small portion > of the overall system memory so I doubt this will even register from a > reclaim perspective. > > I've chosen 64K folio size for arm64 which benefits both the 4K and 16K > base page size configs. Crucially the same amount of data is still read > (usually 128K) so I'm not expecting any read amplification issues. I > don't anticipate any write amplification because text is always RO. > > Note that the text region of an ELF file could be populated into the > page cache for other reasons than taking a fault in a mmapped area. The > most common case is due to the loader read()ing the header which can be > shared with the beginning of text. So some text will still remain in > small folios, but this simple, best effort change provides good > performance improvements as is. > > Confine this special-case approach to the bounds of the VMA. This > prevents wasting memory for any padding that might exist in the file > between sections. Previously the padding would have been contained in > order-0 folios and would be easy to reclaim. But now it would be part of > a larger folio so more difficult to reclaim. Solve this by simply not > reading it into memory in the first place. > > Benchmarking > ============ > TODO: NUMBERS ARE FOR V3 OF SERIES. NEED TO RERUN FOR THIS VERSION. > > The below shows nginx and redis benchmarks on Ampere Altra arm64 system. > > First, confirmation that this patch causes more text to be contained in > 64K folios: > > | File-backed folios | system boot | nginx | redis | > | by size as percentage |-----------------|-----------------|-----------------| > | of all mapped text mem | before | after | before | after | before | after | > |========================|========|========|========|========|========|========| > | base-page-4kB | 26% | 9% | 27% | 6% | 21% | 5% | > | thp-aligned-8kB | 4% | 2% | 3% | 0% | 4% | 1% | > | thp-aligned-16kB | 57% | 21% | 57% | 6% | 54% | 10% | > | thp-aligned-32kB | 4% | 1% | 4% | 1% | 3% | 1% | > | thp-aligned-64kB | 7% | 65% | 8% | 85% | 9% | 72% | > | thp-aligned-2048kB | 0% | 0% | 0% | 0% | 7% | 8% | > | thp-unaligned-16kB | 1% | 1% | 1% | 1% | 1% | 1% | > | thp-unaligned-32kB | 0% | 0% | 0% | 0% | 0% | 0% | > | thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% | > | thp-partial | 1% | 1% | 0% | 0% | 1% | 1% | > |------------------------|--------|--------|--------|--------|--------|--------| > | cont-aligned-64kB | 7% | 65% | 8% | 85% | 16% | 80% | > > The above shows that for both workloads (each isolated with cgroups) as > well as the general system state after boot, the amount of text backed > by 4K and 16K folios reduces and the amount backed by 64K folios > increases significantly. And the amount of text that is contpte-mapped > significantly increases (see last row). > > And this is reflected in performance improvement: > > | Benchmark | Improvement | > +===============================================+======================+ > | pts/nginx (200 connections) | 8.96% | > | pts/nginx (1000 connections) | 6.80% | > +-----------------------------------------------+----------------------+ > | pts/redis (LPOP, 50 connections) | 5.07% | > | pts/redis (LPUSH, 50 connections) | 3.68% | > > Signed-off-by: Ryan Roberts Looks good to me. Feel free to add: Reviewed-by: Jan Kara Honza > diff --git a/mm/filemap.c b/mm/filemap.c > index e61f374068d4..37fe4a55c00d 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -3252,14 +3252,40 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) > if (mmap_miss > MMAP_LOTSAMISS) > return fpin; > > - /* > - * mmap read-around > - */ > fpin = maybe_unlock_mmap_for_io(vmf, fpin); > - ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2); > - ra->size = ra->ra_pages; > - ra->async_size = ra->ra_pages / 4; > - ra->order = 0; > + if (vm_flags & VM_EXEC) { > + /* > + * Allow arch to request a preferred minimum folio order for > + * executable memory. This can often be beneficial to > + * performance if (e.g.) arm64 can contpte-map the folio. > + * Executable memory rarely benefits from readahead, due to its > + * random access nature, so set async_size to 0. > + * > + * Limit to the boundaries of the VMA to avoid reading in any > + * pad that might exist between sections, which would be a waste > + * of memory. > + */ > + struct vm_area_struct *vma = vmf->vma; > + unsigned long start = vma->vm_pgoff; > + unsigned long end = start + ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT); > + unsigned long ra_end; > + > + ra->order = exec_folio_order(); > + ra->start = round_down(vmf->pgoff, 1UL << ra->order); > + ra->start = max(ra->start, start); > + ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order); > + ra_end = min(ra_end, end); > + ra->size = ra_end - ra->start; > + ra->async_size = 0; > + } else { > + /* > + * mmap read-around > + */ > + ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2); > + ra->size = ra->ra_pages; > + ra->async_size = ra->ra_pages / 4; > + ra->order = 0; > + } > ractl._index = ra->start; > page_cache_ra_order(&ractl, ra); > return fpin; > -- > 2.43.0 > -- Jan Kara SUSE Labs, CR