From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B7AA9E7716C for ; Thu, 5 Dec 2024 15:20:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 801EB6B00E8; Thu, 5 Dec 2024 10:19:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 75F856B00CE; Thu, 5 Dec 2024 10:19:10 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2BE486B00AB; Thu, 5 Dec 2024 10:19:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 2A5996B008A for ; Tue, 17 Sep 2024 05:09:25 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id CD2871C23B0 for ; Tue, 17 Sep 2024 09:09:24 +0000 (UTC) X-FDA: 82573656648.04.D2B3A1C Received: from mail-vk1-f177.google.com (mail-vk1-f177.google.com [209.85.221.177]) by imf09.hostedemail.com (Postfix) with ESMTP id 03E9B140011 for ; Tue, 17 Sep 2024 09:09:22 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=none; spf=pass (imf09.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.177 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=quarantine) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726564107; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=k7Z5g/rm/8c1AfkK8S33MGBmeLoJ7CID73lMAU/7mCs=; b=WujC3/DmgJvc/oXRdyucvNoPUrSBxojLohIALOEs5GmUDrhJoDhP0Yxit4HQYWBmprpfKm yiu3GToWdP/9uC0IrqLVmDtlEfRr112H2yEyPAVczAGLRLGHdXibmVXr7XeZ2QOkylxjii jNQxX4oP+wNMmNzmWQFCkLrnlYCZqH4= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=none; spf=pass (imf09.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.177 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=quarantine) ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726564107; a=rsa-sha256; cv=none; b=PovaaJxGUcrWNicAltML+GtQEIkKQDOjkm9KL3MvtHXy5YX9vTuqUqqMApxtNhSnnsg3AA GLvTJjvKQCMfMJ+CjWcpNsTsgluPJil66qydtwZfh3JCikgZxosc9QT6/7+VsmsKL4oZrl Ze2ddXCylRDvHggE8BNPRpN63VQ+OYA= Received: by mail-vk1-f177.google.com with SMTP id 71dfb90a1353d-50313dc6d74so1481793e0c.2 for ; Tue, 17 Sep 2024 02:09:22 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726564162; x=1727168962; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=k7Z5g/rm/8c1AfkK8S33MGBmeLoJ7CID73lMAU/7mCs=; b=hSQJ7/EE4GcDend28NhVaWD85K8K1i4xVrKRucf4IRhHM1e/BhjpviExXQdKGLF4cw 5YmpLqGcA1TtC+Ycqj0rCDet2KhYuCiKVk2I8pERtuQ/y1i2CzgnKKiyfhigEO7bvfiN wrrAHyKe8ubdqx5RydI92Smk2iXsGXLPUUU4/7aRB0lIGWi5hLnDbVcXmFABiNaoMlwW f4mW2FKBpIxmLlagWB4zm7Nbf/cKB+GP0GZyGmHDtUOnqQLSF5D3FH4I1PWQbks35qsd PWGh1WsJJpvrVqw6Qti/6Jd0BJhmC5LiBKsuFV2C3y1deSW+PHW7mSw9zKkf2CxO4u5U fftQ== X-Forwarded-Encrypted: i=1; AJvYcCXAZ2UI97mM/dO/NP99B16MP4QzGVKVWNibn4CzYToU9inaTaorBZCvaU3Kpha7HY8S5YJxw/XFtw==@kvack.org X-Gm-Message-State: AOJu0Yw1dOfBPrH1zn9CDIVOmdiBG/rXJChOocMifswhIsCn0xdIwghf 8PA2/fW3ZgSLRZ3+cd8EwQgOfZru+LlgMaF/QK4egD1uDNmH9SuQTAA8uwBVz1g2aE3Fc1wZIzv /QHxnX3h+tv0V5vNWquLiY3t7xCk= X-Google-Smtp-Source: AGHT+IHa8G6luBIHUH9nL2T90KANNf/r7Y/dD6pXxp1/SiFVkw4wuFMSC+gtR0GyRMK72+FRCTbDxU+MRNGZS3Nr0ks= X-Received: by 2002:a05:6122:2028:b0:502:af87:1a91 with SMTP id 71dfb90a1353d-5032d4d6539mr14356885e0c.10.1726564161966; Tue, 17 Sep 2024 02:09:21 -0700 (PDT) MIME-Version: 1.0 References: <20240913091902.1160520-1-dev.jain@arm.com> <091f517d-e7dc-4c10-b1ac-39658f31f0ed@arm.com> In-Reply-To: From: Barry Song Date: Tue, 17 Sep 2024 17:09:08 +0800 Message-ID: Subject: Re: [PATCH] mm: Compute mTHP order efficiently To: Ryan Roberts Cc: Dev Jain , Matthew Wilcox , akpm@linux-foundation.org, david@redhat.com, anshuman.khandual@arm.com, hughd@google.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, baolin.wang@linux.alibaba.com, gshan@redhat.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspam-User: X-Stat-Signature: ee59tn3i8wdw91fm8u4gebppmhbdiquc X-Rspamd-Queue-Id: 03E9B140011 X-Rspamd-Pre-Result: action=add header; module=dmarc; Action set by DMARC X-Rspam: Yes X-HE-Tag: 1726564162-855693 X-HE-Meta: U2FsdGVkX1/xaxVoANNuFzYWNL2engbMLfSwLzTnmEyPy4jONnnefGXLtqNEUrzP08A/oNtdaJIh2B/999M3iLPecoSKT97Qiq2nszcl5xfMYt/4j2m6MCvor1JlqLFjuPAyVi8b//9temtx/6qX2edfOrsMa9/uhewklNpNm2yYj2FJmUEyFgSDXK48PrEk/oAcXxdb0jPbK4ETeApuj4w06YsU9f3a8X0t1pznS0OSzJ5kEYI1gnYFM8oBUDUe2s9/zduNzYx9P8HM0YQUhV61aEUuryPo/43ivUPgFImf7dmwX5MjaMtbZfHC8Q0Ha9bPdqhv39EEjVaNXW4GrqRWeVXLgy7XyD8rzmsH2Uw/UhaHo9avsOYrrr9HlsSBYLUCPc6xlRhMbl7vfNqfHgZ9XrJQKiBMK6Zu4BWUOJEUZTxu/yD4hwrsTohCkxULwyiwacZIwlDNmffkiAiD4LnE+j4rSRN0T8cNvIP5E1IBkPePdhFHCSfTBrrZ8P6V3uiFI5n9cXs4p/sL7Cp5ezZIcj1y98dEVzZ8tFsOqJmsMWOXY0TDpPnwP31gWV+pGHL8e6rdSI4fO2nmz5fXRvcz8u0kSrtdbBQMrW93VRt7YyYqpDlaR9KUtABxg/I/ltG1melPu7MhWbg1i41xBk2A1+3oNIRI4GnCubt6Uv2ojwIeEeNx708xbIQSoTXvbjY9FedNONipG6HuhnUzudX2w39S72diyYD4Myct45HPIdKeczMyHyOPwD4EgVQM2qaAWr2FD22Q2DswWgThW6FdXr+3bqbw/AAGeQRcrkR+g72MEpalMKZ0H2i7bHxA9Y9cq6Zxj8gnwsYtFjwFZKETsvK7eTN4j/+iCCNTQOtjtUuR5i9k3izwSxhroQn0vnrgDfjnM2tZLl95ik557sxqtw/m2EGM/vofYSXAI0mToBM1PexcZ8aacNZo6pBZgD2lBdtZTX5i8BZljCg OBkfuuA4 yUrKzbSDvZQWrc630UReWgCuyk8eC/nksPJ87X75AQX9flZAv1s1eJquJ4DHWU7sZE/HirJHyeFLGsACHvIjPU4wNfZyPAd/pdestBdmk09r+QM1IaXuPpBqc7t9I0wg0MmZZ8eWYiWggRpnKVthjLC3pQNcic0yDRMzL0r7ku9wiqrAryoIQPz3UujBUHphzCOOCDblXTz1CG/oU4MevjaMWRfUP/zRi58QlzWtjisLtFIi417N5zWv7GLOhKEvoBQkjf1oeJKHsjmYOY0v7NEPQuPH5wzz8q7YXu7W0QjBYw3IONhvMzxWNn50y0cipZii0BdCnUbqPbWlGMkvXKkGTMSQM5/0Q69AyqgS2b6K9qcUm8Xg0DpRLaIJN+MV+aA5Gfhh3zERFprA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.002492, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Sep 17, 2024 at 4:54=E2=80=AFPM Ryan Roberts = wrote: > > On 17/09/2024 09:44, Barry Song wrote: > > On Tue, Sep 17, 2024 at 4:29=E2=80=AFPM Ryan Roberts wrote: > >> > >> On 17/09/2024 04:55, Dev Jain wrote: > >>> > >>> On 9/16/24 18:54, Matthew Wilcox wrote: > >>>> On Fri, Sep 13, 2024 at 02:49:02PM +0530, Dev Jain wrote: > >>>>> We use pte_range_none() to determine whether contiguous PTEs are em= pty > >>>>> for an mTHP allocation. Instead of iterating the while loop for eve= ry > >>>>> order, use some information, which is the first set PTE found, from= the > >>>>> previous iteration, to eliminate some cases. The key to understandi= ng > >>>>> the correctness of the patch is that the ranges we want to examine > >>>>> form a strictly decreasing sequence of nested intervals. > >>>> This is a lot more complicated. Do you have any numbers that indica= te > >>>> that it's faster? Yes, it's fewer memory references, but you've gon= e > >>>> from a simple linear scan that's easy to prefetch to an exponential = scan > >>>> that might confuse the prefetchers. > >>> > >>> I do have some numbers, I tested with a simple program, and also used > >>> ktime API, with the latter, enclosing from "order =3D highest_order(o= rders)" > >>> till "pte_unmap(pte)" (enclosing the entire while loop), a rough aver= age > >>> estimate is that without the patch, it takes 1700 ns to execute, with= the > >>> patch, on an average it takes 80 - 100ns less. I cannot think of a go= od > >>> testing program... > >>> > >>> For the prefetching thingy, I am still doing a linear scan, and in ea= ch > >>> iteration, with the patch, the range I am scanning is going to strict= ly > >>> lie inside the range I would have scanned without the patch. Won't th= e > >>> compiler and the CPU still do prefetching, but on a smaller range; wh= ere > >>> does the prefetcher get confused? I confess, I do not understand this > >>> very well. > >>> > >> > >> A little history on this; My original "RFC v2" for mTHP included this > >> optimization [1], but Yu Zhou suggested dropping it to keep things sim= ple, which > >> I did. Then at v8, DavidH suggested we could benefit from this sort of > >> optimization, but we agreed to do it later as a separate change [2]: > >> > >> """ > >>>> Comment: Likely it would make sense to scan only once and determine = the > >>>> "largest none range" around that address, having the largest suitabl= e order > >>>> in mind. > >>> > >>> Yes, that's how I used to do it, but Yu Zhou requested simplifying to= this, > >>> IIRC. Perhaps this an optimization opportunity for later? > >> > >> Yes, definetly. > >> """ > >> > >> Dev independently discovered this opportunity while reading the code, = and I > >> pointed him to the history, and suggested it would likely be worthwhil= e to send > >> a patch. > >> > >> My view is that I don't see how this can harm performance; in the comm= on case, > >> when a single order is enabled, this is essentially the same as before= . But when > >> there are multiple orders enabled, we are now just doing a single line= ar scan of > >> the ptes rather than multiple scans. There will likely be some stack a= ccesses > >> interleved, but I'd be gobsmacked if the prefetchers can't tell the di= fference > >> between the stack and other areas of memory. > >> > >> Perhaps some perf numbers would help; I think the simplest way to gath= er some > >> numbers would be to create a microbenchmark to allocate a large VMA, t= hen fault > >> in single pages at a given stride (say, 1 every 128K), then enable 1M,= 512K, > >> 256K, 128K and 64K mTHP, then memset the entire VMA. It's a bit contri= ved, but > >> this patch will show improvement if the scan is currently a significan= t portion > >> of the page fault. > >> > >> If the proposed benchmark shows an improvement, and we don't see any r= egression > >> when only enabling 64K, then my vote would be to accept the patch. > > > > Agreed. The challenge now is how to benchmark this. In a system > > without fragmentation, > > we consistently succeed in allocating the largest size (1MB). > > Therefore, we need an > > environment where allocations of various sizes can fail proportionally,= allowing > > pte_range_none() to fail on larger sizes but succeed on smaller ones. > > I don't think this is about allocation failure? It's about finding a foli= o order > that fits into the VMA without overlapping any already non-none PTEs. > > > > > It seems we can't micro-benchmark this with a small program. > > My proposal was to deliberately fault in a single (4K) page every 128K. T= hat > will cause the scanning logic to reduce the order to the next lowest enab= led > order and try again. So with the current code, for all orders {1M, 512K, = 256K, > 128K} you would scan the first 128K of ptes (32 entries) then for 64K you= would > scan 16 entries =3D 4 * 32 + 16 =3D 144 entries. For the new change, you = would just > scan 32 entries. I'm a bit confused. If we have a VMA from 1GB to 1GB+4MB, and even if you fault in a single 4KB page every 128KB with 1MB enabled, you'd still succee= d in allocating 1MB. For the range 1GB+128KB to 1GB+1MB, wouldn't there be no page faults? So, you'd still end up scanning 256 entries (1MB/4KB)? For the entire 4MB VMA, we only have 4 page faults? For each page, we scan 256 entries, and there's no way to scan the next possible larger order, lik= e 512KB if 1MB has succeeded? My point is that we need to make the 1MB allocation fail in order to disrup= t the continuity of pte_none(). otherwise, pte_range_none() will return true for = the largest order. > > Although now that I've actually written that down, it doesn't feel like a= very > big win. Perhaps Dev can come up with an even more contrived single-page > pre-allocation pattern that will maximise the number of PTEs we hit with = the > current code, and minimise it with the new code :) > > > > >> > >> [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-7-ryan.rob= erts@arm.com/ > >> [2] > >> https://lore.kernel.org/linux-mm/ca649aad-7b76-4c6d-b513-26b3d58f8e68@= redhat.com/ > >> > >> Thanks, > >> Ryan > Thanks Barry