From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 46A72C3DA79 for ; Mon, 15 Jan 2024 09:33:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B68FB6B0080; Mon, 15 Jan 2024 04:33:42 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B19236B009A; Mon, 15 Jan 2024 04:33:42 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9E05B6B009C; Mon, 15 Jan 2024 04:33:42 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 8D60E6B0080 for ; Mon, 15 Jan 2024 04:33:42 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 5A350140874 for ; Mon, 15 Jan 2024 09:33:42 +0000 (UTC) X-FDA: 81681033084.30.0682D19 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf12.hostedemail.com (Postfix) with ESMTP id 9A49940005 for ; Mon, 15 Jan 2024 09:33:39 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf12.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1705311220; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=q0mtafSNPDMcwaz3NOYDHpipf4O2kBqJCQLej+rPX28=; b=Sx2lnEtsQzgIAXpej9W8vyLrPI1BusktthCFgVp25UwaVwSikdKPvpvOhP2kRZWe7Z946r gdDTOZL5Eu0VSSzxZD6Mlzwk7xeYC5wZtPmxIImrEibme4Xe1dQRPhjRkpJzS+mJZuWaxb 6H8PpYDz0vZX0YTtBUhK1YJHMuXOb20= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf12.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1705311220; a=rsa-sha256; cv=none; b=T15RxDgb/KUfZQz2wwv4CcQQE+uz8pMp4dMuT+8dHpuM8AziixGyXu38CSyoX/FtJywzMS RokZGwqBFHSRxUUShYZQCNKpx31fjHS+P0O5uJg+HIE0cfS8+8z0e+CRZG6QeoMFVYTMKD w04nb+Td5yAxwxOd7CVin8JYVTwmPkE= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id F10B72F4; Mon, 15 Jan 2024 01:34:24 -0800 (PST) Received: from [10.57.76.47] (unknown [10.57.76.47]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 7E5F53F6C4; Mon, 15 Jan 2024 01:33:36 -0800 (PST) Message-ID: <398fdb16-b8c5-4d02-bb5d-d4c9b8f9bf89@arm.com> Date: Mon, 15 Jan 2024 09:33:35 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v1] mm/filemap: Allow arch to request folio size for exec memory Content-Language: en-GB To: Barry Song <21cnbao@gmail.com>, Matthew Wilcox Cc: Catalin Marinas , Will Deacon , Mark Rutland , Andrew Morton , David Hildenbrand , John Hubbard , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org References: <20240111154106.3692206-1-ryan.roberts@arm.com> <654df189-e472-4a75-b2be-6faa8ba18a08@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 9A49940005 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: a3zejyi81iwgfr18st7pmhgc7bq5p1x5 X-HE-Tag: 1705311219-851359 X-HE-Meta: U2FsdGVkX18IDpJDYaLjAgNclzR06Al/A8MQ9AOw2g9OJ35zBAavlccrrnVaZTWum95jrcEAobo9RcgL+rEQ16OBv0sdZqz/1MLEZwwunpWIadsEznj314v0x6m6+009bb3kH1VsqESFIUO4wcVcssqxOj4uWxwyWFDZ/bMysqR99QIItNUcfsMlD9QD2l6MXKWhxJJviuQCnRj/mDL9koUFtebnM9mxFLIm7w4r4hu9wYHgi/EDh38GYJrgFKsaM4WBhSggs0iR2wbaGEdmBMQEI5o4bvppRgd+28sM9jVwhxu+XmUoGlDYpUbgIV46efNO+l9JJW3xkWsctcudeTBeKukDNpp5AwtsXUrqppiMultgby3K+6PoTpHY9CGSJ7AEXvXak/OMuGlJ0ASIaU0y/UKgH90U9tgNS/HhuUwywq4R8op2LR2qnW5y2t5AV1hDl4udtvdHw50i4AVA7yoMZG9lKxRBkRVkyJ0/+5AsjOUA3FFup6pN9iCT3gAfCRRbHTcIwDK1SRFZoaGfttOBPy2lOPveNB99LBy43n3aUg4fiu4oGo8hnqLigcn9chRKTh6i5ou5HMB6KuMFCYh8GTqBLv/7JDwh/jt/DUy5SdCgUmi+r43SoCgFSVxH8mO2DGVCLXbs78Wvv8vYdPv76sME0f+oF7mVrNaucc+AawsscacoX/4k6F60bzKTmQ1e1qtg+y1yboBCt1Tba41Mtxt8iTVyaFUI73GhnC0eqef30It2A58i19+Q5T1LDPd2ZuhiL8PyJdR0UWIBhebuzL7Cgm74nf0KpWqhrBLChB+WIuYYPbHIet7zpy2FXtbNiMc42KPzWcI1xGOEJSA15zGKsnc5DOc628x8fRdHtS7ZQ80HjiQD59kQViZolLf4Yo4hgGKcfiN+qmIcDAIEHI0AFhVCR/hL6JtD8tYbtUZfR0nQCNQNE2K4AE5ibMzr6mHd13Smcj47xal Z1ArHfJ7 E7SZauPRV0uyMd4p/APB3WOVqsa3M/keoFFNnkZjlnlIBiY1Dly1JGlMhVSJD7qyHUmr61AFMqnMNZIPs0Wy0Ce8nFfJLSAf5Nst7qP8T4HGc3oamupIoliV6XQU2ws6FbEqqsSZ/ZqXJwKAfXi2LNamrnpOwRy+OaAz+MsTRS2ewaB4NAzWb7Q9OPnnFgZnBt3LYM5eI9Cid92D+ztpGgkqRRDdvJlai22eW1cWyBKhT15C4qcoqEdFDySsdLG29ZofPkaVByZXe8bXxjdCxDdC9RqFYVwYXf68C X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 13/01/2024 00:11, Barry Song wrote: > On Sat, Jan 13, 2024 at 12:04 PM Matthew Wilcox wrote: >> >> On Sat, Jan 13, 2024 at 11:54:23AM +1300, Barry Song wrote: >>>>> Perhaps an alternative would be to double ra->size and set ra->async_size to >>>>> (ra->size / 2)? That would ensure we always have 64K aligned blocks but would >>>>> give us an async portion so readahead can still happen. >>>> >>>> this might be worth to try as PMD is exactly doing this because async >>>> can decrease >>>> the latency of subsequent page faults. >>>> >>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>> /* Use the readahead code, even if readahead is disabled */ >>>> if (vm_flags & VM_HUGEPAGE) { >>>> fpin = maybe_unlock_mmap_for_io(vmf, fpin); >>>> ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1); >>>> ra->size = HPAGE_PMD_NR; >>>> /* >>>> * Fetch two PMD folios, so we get the chance to actually >>>> * readahead, unless we've been told not to. >>>> */ >>>> if (!(vm_flags & VM_RAND_READ)) >>>> ra->size *= 2; >>>> ra->async_size = HPAGE_PMD_NR; >>>> page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER); >>>> return fpin; >>>> } >>>> #endif >>>> >>> >>> BTW, rather than simply always reading backwards, we did something very >>> "ugly" to simulate "read-around" for CONT-PTE exec before[1] >>> >>> if page faults happen in the first half of cont-pte, we read this 64KiB >>> and its previous 64KiB. otherwise, we read it and its next 64KiB. I actually tried something very similar to this while prototyping. I found that it was about 10% less effective at getting text into 64K folios as the approach I posted. I didn't investigate why, as I came to the conclusion that text unlikely benefits from readahead anyway. >> >> I don't think that makes sense. The CPU executes instructions forwards, >> not "around". I honestly think we should treat text as "random access" >> because function A calls function B and functions A and B might well be >> very far apart from each other. The only time I see you actually >> getting "readahead" hits is if a function is split across two pages (for >> whatever size of page), but that's a false hit! The function is not, >> generally, 64kB long, so doing readahead is no more likely to bring in >> the next page of text that we want than reading any other random page. >> > > it seems you are in favor of Ryan's modification even for filesystems > which don't support large mapping? > >> Unless somebody finds the GNU Rope source code from 1998, or recreates it: >> https://lwn.net/1998/1029/als/rope.html >> Then we might actually have some locality. >> >> Did you actually benchmark what you did? Is there really some locality >> between the code at offset 256-288kB in the file and then in the range >> 192kB-256kB? > > I really didn't have benchmark data, at that point I was like, > instinctively didn’t > want to break the logic of read-around, so made the code just that. > The info your provide makes me re-think if the read-around code is necessary, > thanks! As a quick experiment, I modified my thpmaps script to collect data *only* for executable mappings. This is run *without* my change: | File-backed exec folios | Speedometer | Kernel Compile | |=========================|================|================| |file-thp-aligned-16kB | 56% | 46% | |file-thp-aligned-32kB | 2% | 3% | |file-thp-aligned-64kB | 4% | 5% | |file-thp-unaligned-16kB | 0% | 3% | |file-thp-unaligned-128kB | 2% | 0% | |file-thp-partial | 0% | 0% | It's implied that the rest of the memory (up to 100%) is small (single page) folios. I think the only reason we would see small folios is if we would otherwise run off the end of the file? If so, then I think any text in folios > 16K is a rough proxy for how effective readahead is for text: Not very. Intuitively, I agree with Matthew that readahead doesn't make much sense for text, and this rough data seems to agree. > > was using filesystems without large-mapping support but worked around > the problem by > 1. preparing 16*n normals pages > 2. insert normal pages into xa > 3. let filesystem read 16 normal pages > 4. after all IO completion, transform 16 pages into mTHP and reinsert > mTHP to xa I had a go at something like this too, but was doing it in the dynamic loader and having it do MADV_COLLAPSE to generate PMD-sized THPs for the text. I actaully found this to be even faster for the use cases I was measuring. But of course its using more memory due to the 2M page size, and I expect it is slowing down app load time because it is potentially reading in a lot more text than is actually faulting. Ultimately I think the better strategy is to make the filesystems large folio capable. > > that was very painful and finally made no improvement probably because > of due to various sync overhead. so ran away and didn't dig more data. > > Thanks > Barry