From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 91CC8C38150 for ; Thu, 4 Jul 2024 16:23:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D82AD6B0082; Thu, 4 Jul 2024 12:23:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D32FC6B0083; Thu, 4 Jul 2024 12:23:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BFB036B0085; Thu, 4 Jul 2024 12:23:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 9D4CE6B0082 for ; Thu, 4 Jul 2024 12:23:58 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 1E080C1465 for ; Thu, 4 Jul 2024 16:23:58 +0000 (UTC) X-FDA: 82302591756.19.7B3306E Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf29.hostedemail.com (Postfix) with ESMTP id 10DFD12001A for ; Thu, 4 Jul 2024 16:23:55 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=none; spf=pass (imf29.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720110211; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=o1ZFcy8ljyKE8Bn9ag1jQC4LWAaVLwiWHsRVeUSNdjo=; b=4Op5had/6YkpK0e2/tCQNEDAQUgo1CPY3x8HWe4noxapZ5wbxm28n6q9nwRvm+uOtZSJ5B 8ETIyzFz6CrwsLAAiEWVIIZDOZ69C/8a7qXoG6Gl+LHyvV8C+rYMClLJ2H3lYf8CeguapO 3TTKO2TuRBghPHOUegMznQklbt6cS1Q= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720110211; a=rsa-sha256; cv=none; b=ShYNQRFWSao0yQGjPBcQEwY0n5fQuKk4xr7Axh1cupjwvWs7FHYmxE6ijWU/+G4MOSvvoH mCbcqn4/jno6xPwTOMVq/kbvpscs06CMAxihyFJxfzyfdcF1HPXU+d7ap05vGZbCOzIBon 3eTr8OA6iRDgGrTmrKUpKFk52EsLySs= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=none; spf=pass (imf29.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 2431A367; Thu, 4 Jul 2024 09:24:20 -0700 (PDT) Received: from [10.1.29.168] (XHFQ2J9959.cambridge.arm.com [10.1.29.168]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 27F073F762; Thu, 4 Jul 2024 09:23:53 -0700 (PDT) Message-ID: Date: Thu, 4 Jul 2024 17:23:51 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2] mm/filemap: Allow arch to request folio size for exec memory Content-Language: en-GB From: Ryan Roberts To: Dave Chinner Cc: Catalin Marinas , Will Deacon , Mark Rutland , "Matthew Wilcox (Oracle)" , Andrew Morton , David Hildenbrand , Barry Song <21cnbao@gmail.com>, John Hubbard , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org References: <20240215154059.2863126-1-ryan.roberts@arm.com> <58a67051-6d61-4d16-b073-266522907e05@arm.com> In-Reply-To: <58a67051-6d61-4d16-b073-266522907e05@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 10DFD12001A X-Stat-Signature: xio3164z3mu11ueppdaq48xsdrpkifcz X-HE-Tag: 1720110235-585523 X-HE-Meta: U2FsdGVkX19TvV9yztADK8YPBt+Dl8uNMZTkgZhtmkFSP5aMdGGWbjuwacfId50pDWORfesfjApttZOPc5NIgTpTZZINrdrCdv2Je1ozwn+kf8SEyEDviyoxChVZXq3oQIQypIfoV1EfhrWSXgXc6cDA90zmx6JJA1hGzmFk8Vsw1jlglnwa7CeHjW7zF/p2WKY8ypfs3hRbIgL82S8iswONeBe9ISXYGg4atdcsrGlQuK9GhSj1FrKFparN+J1/XZ3D2P+c233zJMNugtPHhuEb0kO7jxZkT2UACDMlVLyZawyeJvY8z1kz0xy1yjmO9Fq8QoF11/Es8WsK83Xu+FCKcGGkHnaeAvaYeINZNLXW7oE8XjjaEGNQfMyL9XxsS0likZ55crz6PJOvWzMEaQAmZu1v7VSgadlPPH59yqgxHg3P8fjDxxRkGcOeS0OKLfmCx5zPiQa3Vimu16rfvPDhJKbhbXZFM+0wHT9xVRN9vvc+eiQ24OJXW+jjZbOy3pHPXln/2givxidQ7mdETdl0ElfG6TPpk19T2k83Q9PxMoVPeB/dMnkqAo+TNTE8PSmgDVmG2RGDe9OHyBbyXV41LO81dUelItzjXY2Cuypx0jf+dspd1oxNZAwixJ/2IjxXxHD365X9Anv/Jo5BXhm1Eo+R4l7Ie5mPaJL2Ui6VoaoEqBxydwY2s+j2tqNMWVrjowDVXsKKJ6i2DrCGb6pLCoi02Qu2XSgOCPr+h+gK8QfAktJmjV1zQMkjEOIr+obEcFUU+TQGiegO6QZ9/9JeJ9lPkDpxf03gJzowl97E0WQbdJ12RqNI6zMefhfo/5FZdYEdUBUvDF9yzjpo7KwNwpCpZmPE41InR1USxPjLfQY3rXcf60MZKtyfj9eWst+J2NqYpMfVCEzWCvQY6R++Aka8Hx6MKvCi+BVv2k0n1ZJggBF78dduJkaBie8CSfqXANhP5/xpvnOPn2r 4DIDCyHg MKE3nmWqUsW3cBFhhoPGOWbR4q5vndkhx++J8a5IXujyRveBqTTo6sKZiuEYP0pNrUVUekylpajRgj0CGUb54wCghx2GpzzJ7Hv4tIsB06kDvtIt144vM/k/fgRZRi6pbdBkSKSox6TlyC3ox+OpcTe2sIVx+9ABUq4h5i3ShU4RE6KIkcyr/B+hkeJousQLCp1JIjPMjzjV3v1ZuwTZrezP1R38MAA9El3Fd7UM3LTD4Lz/CwM1yx+Sj3gJT1Xhdf+lYrEiD+t6xMLA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Dave, I'm reviving this thread, hoping to make some progress on this. I'd appreciate your thoughts... On 16/02/2024 11:18, Ryan Roberts wrote: > Hi Dave, > > Thanks for taking a look at this! Some comments below... > > On 16/02/2024 00:04, Dave Chinner wrote: >> On Thu, Feb 15, 2024 at 03:40:59PM +0000, Ryan Roberts wrote: >>> Change the readahead config so that if it is being requested for an >>> executable mapping, do a synchronous read of an arch-specified size in a >>> naturally aligned manner. >>> >>> On arm64 if memory is physically contiguous and naturally aligned to the >>> "contpte" size, we can use contpte mappings, which improves utilization >>> of the TLB. When paired with the "multi-size THP" changes, this works >>> well to reduce dTLB pressure. However iTLB pressure is still high due to >>> executable mappings having a low liklihood of being in the required >>> folio size and mapping alignment, even when the filesystem supports >>> readahead into large folios (e.g. XFS). >>> >>> The reason for the low liklihood is that the current readahead algorithm >>> starts with an order-2 folio and increases the folio order by 2 every >>> time the readahead mark is hit. But most executable memory is faulted in >>> fairly randomly and so the readahead mark is rarely hit and most >>> executable folios remain order-2. >> >> Yup, this is a bug in the readahead code, and really has nothing to >> do with executable files, mmap or the architecture. We don't want >> some magic new VM_EXEC min folio size per architecture thingy to be >> set - we just want readahead to do the right thing. > > It sounds like we agree that there is a bug but we don't agree on what the bug > is? My view is that executable segments are accessed in a ~random manner and > therefore readahead (as currently configured) is not very useful. But data may > well be accessed more sequentially and therefore readahead is useful. Given both > data and text can come from the same file, I don't think this can just be a > mapping setting? (my understanding is that there is one "mapping" for the whole > file?) So we need to look to VM_EXEC for that decision. Additionally, what is "the right thing" in your view? > >> >> Indeed, we are already adding a mapping minimum folio order >> directive to the address space to allow for filesystem block sizes >> greater than PAGE_SIZE. That's the generic mechanism that this >> functionality requires. See here: >> >> https://lore.kernel.org/linux-xfs/20240213093713.1753368-5-kernel@pankajraghav.com/ > > Great, I'm vaguely aware of this work, but haven't looked in detail. I'll go > read it. But from your brief description, IIUC, this applies to the whole file, > and is a constraint put in place by the filesystem? Applying to the whole file > may make sense - that means more opportunity for contpte mappings for data pages > too, although I guess this adds more scope for write amplificaiton because data > tends to be writable, and text isn't. But for my use case, its not a hard > constraint, its just a preference which can improve performance. And the > filesystem is the wrong place to make the decision; its the arch that knows > about the performacne opportunities with different block mapping sizes. Having finally taken a proper look at this, I still have the same opinion. I don't think this (hard) minimum folio order work is the right fit for what I'm trying to achieve. I need a soft minimum that can still fall back to order-0 (or the min mapping order), and ideally I want a different soft minimum to be applied to different parts of the file (exec vs other). I'm currently thinking about abandoning the arch hook and replacing with sysfs ABI akin to the mTHP interface. The idea would be that for each size, you could specify 'never', 'always', 'exec' or 'always+exec'. A maximum one size would be allowed be marked as 'exec' at a time. The set of sizes marked 'always' would be the ones considered in page_cache_ra_order(), with fallback to order-0 (or min mapping order) still allowed. If a size is marked 'exec' then we would take VM_EXEC path added by this patch and do sync read into folio of that size. This obviously expands the scope somewhat, but I suspect having the ability to control the folio orders that get allocated by the pagecache will also help reduce large folio allocation failure due to fragmentation; if only a couple folios sizes are in operation in a given system, you are more likely to be able to reclaim the size that you need. All just a thought experiment at the moment, and I'll obviously do some prototyping and large folio allocation success rate measurements. I appreciate that we don't want to add sysfs controls without good justification. But I wonder if this could be a more pallatable solution to people, at least in principle? Thanks, Ryan > > As a side note, concerns have been expressed about the possibility of physical > memory fragmentation becoming problematic, meaning we degrade back to small > folios over time with my mTHP work. The intuituon is that if the whole system is > using a few folio sizes in ~equal quantities then we might be ok, but I don't > have any data yet. Do you have any data on fragmentation? I guess this could be > more concerning for your use case? > >> >> (Probably worth reading some of the other readahead mods in that >> series and the discussion because readahead needs to ensure that it >> fill entire high order folios in a single IO to avoid partial folio >> up-to-date states from partial reads.) >> >> IOWs, it seems to me that we could use this proposed generic mapping >> min order functionality when mmap() is run and VM_EXEC is set to set >> the min order to, say, 64kB. Then the readahead code would simply do >> the right thing, as would all other reads and writes to that >> mapping. > > Ahh yes, hooking into your new logic to set a min order based on VM_EXEC sounds > perfect... > >> >> We could trigger this in the ->mmap() method of the filesystem so >> that filesysetms that can use large folios can turn it on, whilst >> other filesystems remain blissfully unaware of the functionality. >> Filesystems could also do smarter things here, too. eg. enable PMD >> alignment for large mapped files.... > > ...but I don't think the filesystem is the right place. The size preference > should be driven by arch IMHO. > > Thanks, > Ryan > >> >> -Dave. >