From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E9830E7716D
	for <linux-mm@archiver.kernel.org>; Thu,  5 Dec 2024 15:19:39 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 0320F6B0085; Thu,  5 Dec 2024 10:19:10 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C85676B00BA; Thu,  5 Dec 2024 10:19:09 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D4ACA6B00B5; Thu,  5 Dec 2024 10:19:08 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 632C76B008A
	for <linux-mm@kvack.org>; Tue, 17 Sep 2024 04:44:47 -0400 (EDT)
Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id E4A2A8023A
	for <linux-mm@kvack.org>; Tue, 17 Sep 2024 08:44:46 +0000 (UTC)
X-FDA: 82573594572.07.BB92198
Received: from mail-vk1-f180.google.com (mail-vk1-f180.google.com [209.85.221.180])
	by imf16.hostedemail.com (Postfix) with ESMTP id 2373D18000E
	for <linux-mm@kvack.org>; Tue, 17 Sep 2024 08:44:44 +0000 (UTC)
Authentication-Results: imf16.hostedemail.com;
	dkim=none;
	spf=pass (imf16.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.180 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=quarantine)
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726562628; a=rsa-sha256;
	cv=none;
	b=RsWIIjXUvSgPMZRi4P+eCGW52UxQbU+lHGOhDtU9/mVGw+/RSLvsCyUMGPcY59EdOl12WB
	fj3Snx2Lp+TITRy58MfdPcXaaC1o0S+78BAYSz3Ruy2JSzcKSY8RFjYTGAy/xD4PdV/wCV
	Kj0Ajy7j9CM6aV3PGfn916Tx3wNdX3M=
ARC-Authentication-Results: i=1;
	imf16.hostedemail.com;
	dkim=none;
	spf=pass (imf16.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.180 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=quarantine)
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1726562628;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=Ng7dwg0Id9CiIJrzvlpnf4b12rzqT653/Nfvp+zhkC4=;
	b=Unrd8iLIZOD+LSRr8vqk/gUp0Cc76EBDe19I63RJ+BT5eWrkOKGc4/p7AcnP8+fAfAm6rZ
	Gcpz3bgRAq9+MbCg8FWmd+ZQZzXpoRAve9xa6wR9QIY7w/WqwE3MQN8VUaDLtD+x/FrSE1
	CvKqLPXYEzYjBNlyaC3lkM+ozwJ+6EQ=
Received: by mail-vk1-f180.google.com with SMTP id 71dfb90a1353d-5010c0e16baso1519588e0c.1
        for <linux-mm@kvack.org>; Tue, 17 Sep 2024 01:44:44 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1726562684; x=1727167484;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=Ng7dwg0Id9CiIJrzvlpnf4b12rzqT653/Nfvp+zhkC4=;
        b=GY6MaxxKIgC+DcwZt5rpSfDOHgtT+v6/R3qZZPBmHCxt+sP2J8HkVNzltuUBHdwHLP
         8hQFuF9CaoBEugAVA7Uymw0M3xMJjgZMfMctFQw36KrQvH4F1KVL8oRHs9bvChL+VUuL
         pDU9tfYZ5lqoulTjoSHfV4TXIxCU6QMeP+eARHVnLPTP39Inw36zw8WckUb1Y+xRSsZL
         CFDbthpGiFSpqY2j/8V7Wgr/2vAq+bzL2r6I7CUytS4LNbyoFcimPK0QcYhnbbIYySXL
         vxZBA4Nf7kaRiBGBQRFqv7uYhmzLB/6cXkGN6PtVgpsu9l3zddiuKTaGx5iWLSbFdA4O
         rTqQ==
X-Forwarded-Encrypted: i=1; AJvYcCWgC7ORW2+l7gAWhe5x9s+Azv/1rFvcUix39cULQU3TKJLB4vTBzWevUURRR6G7R4BppnMym8Zszg==@kvack.org
X-Gm-Message-State: AOJu0YzkM2TvYL2qXbhi/2jiVEJScjpBb4IB/rbpkmBQqoDh8sYmDum2
	DBiQJKn9IqJmd+K535eIlgE70ZMS6DMllEAsMyb/yaGn3NTiDnQtfyaEx5TgIsN1BkVisfGia2e
	xom64VUOZKVWdjkfBDbfEDKOw774=
X-Google-Smtp-Source: AGHT+IEU6Bcx9UIBfIzPfpAGCWgPp25KMplELEDxT42WwuQSjXXXp3T1Ugdy3gj9eaXkGp4wutEVycJAWHlma3KLJ9o=
X-Received: by 2002:a05:6122:3d05:b0:4f2:ffa9:78b5 with SMTP id
 71dfb90a1353d-5032d50ac62mr16032055e0c.11.1726562684009; Tue, 17 Sep 2024
 01:44:44 -0700 (PDT)
MIME-Version: 1.0
References: <20240913091902.1160520-1-dev.jain@arm.com> <ZugxqJ-CjEi5lEW_@casper.infradead.org>
 <091f517d-e7dc-4c10-b1ac-39658f31f0ed@arm.com> <bde86cb8-fe30-4747-b3a7-cc40d0850f10@arm.com>
In-Reply-To: <bde86cb8-fe30-4747-b3a7-cc40d0850f10@arm.com>
From: Barry Song <baohua@kernel.org>
Date: Tue, 17 Sep 2024 16:44:30 +0800
Message-ID: <CAGsJ_4ysXDAzODPkmeuaEAJnXFofxdVYvsdNcyN-xjER+rjMzQ@mail.gmail.com>
Subject: Re: [PATCH] mm: Compute mTHP order efficiently
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>, Matthew Wilcox <willy@infradead.org>, akpm@linux-foundation.org, 
	david@redhat.com, anshuman.khandual@arm.com, hughd@google.com, 
	ioworker0@gmail.com, wangkefeng.wang@huawei.com, 
	baolin.wang@linux.alibaba.com, gshan@redhat.com, linux-kernel@vger.kernel.org, 
	linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 2373D18000E
X-Stat-Signature: ht487egxsan1ntaipo99nuys7dst6qjr
X-Rspam-User: 
X-Rspamd-Pre-Result: action=add header;
	module=dmarc;
	Action set by DMARC
X-Rspam: Yes
X-HE-Tag: 1726562684-494012
X-HE-Meta: U2FsdGVkX18Zcl1+zESCbSAnI2oSbFcwmvS//Ov7zCjN2CvCSCJVSyNw3xYaipvjf1jkNfahAHigx8jN2IcVuuID4yKLyiV2xJuFKu5TodojJxvWS62A/fhp89NaAH8pJ2UP6OCbtb1o14U0esUXOiTBsCuFo4jZ8bmQLQcWJQRlB23VjIdOT4cPMKP73MuhfRGoMW++eOozZjBrLQ9NsHuQ1VZg3f1KA7BPI7tRYyvyvC8+LEupH/bmBlGz+BfqU1nNttKGJB0H8Po9OBMVSJyKWSebQHNpSv8Bjr+jV2InY7tO0wTJ06Vf0fIkbUXwR+rH+MX/01KCM0wL7z+yEQMxX+s7glFq9RJD6tgFE7v5rKpgffTB+JahfmnMKr6/hbJWnVmZyYUbJQ7RVDBCZkOXca7NTkCaK8l7ijsFPJmZdnHX7VUcu8RppVdHxIKjkUf0DlrehrVU7p0nf5/6ZcJm9/EeKSl/b5ZNj61fH3yPfJP2/WvDYTu6G/gRkYfGb9S66JHkndTLMLIw5DpJUd5Y4FZx7poOtxYORNqSSWyHusTvWW2ZKqQ6e7cxSWnlVj+AVtfz0E+qfzApw2HkwcfUvVYzqKgR8M8sPU2asyot1/rUBYPow38O6iJX6AJwOBHE0KngxOdOkh7DTWg8awPDNUr3GBrL5kQo2mYqydiHCZzO1oLbqpPfz1vSAOgIGrUP0qehuTjydQoQgG8S2nkoY8bIYTbd2xhAPae3JbGkpJWKlzfInVx8+xDaG7mjfFq1+tOJUpTGJw6VuG+yr1/xTxHBT4+86gRrc0w+JycJacvQfJVY3LKqkB1nHyxfJb37zLcKg6IwBublMeN5k2s53wKX2nECNCwEudDsxfjngBnzL6poGieAhh7+/M4sjVwzXuUdL/jVl1/np50QUU6gWpCUX98x2nJqT1wMMnKHXJxwKOSrv6RU8vdI0knlxgNFJF5rHaIMlpbIOci
 SK57aSDR
 2nnjh1cBmfpgJ5kbw2wUrM3vdMMNoEcdeesvamLnB8zxZD6QrZjHUqFhq2n2Vsg/YmQKTH6yZQhuuY9iOsGUKgSJwxsF9eDf1+U/8uO2Ol9ORIfP3Qpud78Sa6h0iwvrRihKJQwhMTgUkQ8xhlP6xclfUGUQs+tGGAXX07VL7H7TwszB+n+ntb+AWulymK6zFZHnHRy4F1U6XhHn6DsrHyYBawMyNLIX6JjlHVZWP7lnE6+/8a8HLeJgdL69Z02ox4R2vy6We5fBHxIW/R7NZaEMjClXmsGmPRDIVYKRrvUoWIz1qp7wEqkDb36wSON/kmPOS07OqWXalJVSDw/nQqTo0ibuNwPOYcuu1qsDr9u3e6X/OjCvewsA2RfTArA+c/l5noNNfApB5Muc=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000454, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Sep 17, 2024 at 4:29=E2=80=AFPM Ryan Roberts <ryan.roberts@arm.com>=
 wrote:
>
> On 17/09/2024 04:55, Dev Jain wrote:
> >
> > On 9/16/24 18:54, Matthew Wilcox wrote:
> >> On Fri, Sep 13, 2024 at 02:49:02PM +0530, Dev Jain wrote:
> >>> We use pte_range_none() to determine whether contiguous PTEs are empt=
y
> >>> for an mTHP allocation. Instead of iterating the while loop for every
> >>> order, use some information, which is the first set PTE found, from t=
he
> >>> previous iteration, to eliminate some cases. The key to understanding
> >>> the correctness of the patch is that the ranges we want to examine
> >>> form a strictly decreasing sequence of nested intervals.
> >> This is a lot more complicated.  Do you have any numbers that indicate
> >> that it's faster?  Yes, it's fewer memory references, but you've gone
> >> from a simple linear scan that's easy to prefetch to an exponential sc=
an
> >> that might confuse the prefetchers.
> >
> > I do have some numbers, I tested with a simple program, and also used
> > ktime API, with the latter, enclosing from "order =3D highest_order(ord=
ers)"
> > till "pte_unmap(pte)" (enclosing the entire while loop), a rough averag=
e
> > estimate is that without the patch, it takes 1700 ns to execute, with t=
he
> > patch, on an average it takes 80 - 100ns less. I cannot think of a good
> > testing program...
> >
> > For the prefetching thingy, I am still doing a linear scan, and in each
> > iteration, with the patch, the range I am scanning is going to strictly
> > lie inside the range I would have scanned without the patch. Won't the
> > compiler and the CPU still do prefetching, but on a smaller range; wher=
e
> > does the prefetcher get confused? I confess, I do not understand this
> > very well.
> >
>
> A little history on this; My original "RFC v2" for mTHP included this
> optimization [1], but Yu Zhou suggested dropping it to keep things simple=
, which
> I did. Then at v8, DavidH suggested we could benefit from this sort of
> optimization, but we agreed to do it later as a separate change [2]:
>
> """
> >> Comment: Likely it would make sense to scan only once and determine th=
e
> >> "largest none range" around that address, having the largest suitable =
order
> >> in mind.
> >
> > Yes, that's how I used to do it, but Yu Zhou requested simplifying to t=
his,
> > IIRC. Perhaps this an optimization opportunity for later?
>
> Yes, definetly.
> """
>
> Dev independently discovered this opportunity while reading the code, and=
 I
> pointed him to the history, and suggested it would likely be worthwhile t=
o send
> a patch.
>
> My view is that I don't see how this can harm performance; in the common =
case,
> when a single order is enabled, this is essentially the same as before. B=
ut when
> there are multiple orders enabled, we are now just doing a single linear =
scan of
> the ptes rather than multiple scans. There will likely be some stack acce=
sses
> interleved, but I'd be gobsmacked if the prefetchers can't tell the diffe=
rence
> between the stack and other areas of memory.
>
> Perhaps some perf numbers would help; I think the simplest way to gather =
some
> numbers would be to create a microbenchmark to allocate a large VMA, then=
 fault
> in single pages at a given stride (say, 1 every 128K), then enable 1M, 51=
2K,
> 256K, 128K and 64K mTHP, then memset the entire VMA. It's a bit contrived=
, but
> this patch will show improvement if the scan is currently a significant p=
ortion
> of the page fault.
>
> If the proposed benchmark shows an improvement, and we don't see any regr=
ession
> when only enabling 64K, then my vote would be to accept the patch.

Agreed. The challenge now is how to benchmark this. In a system
without fragmentation,
we consistently succeed in allocating the largest size (1MB).
Therefore, we need an
environment where allocations of various sizes can fail proportionally, all=
owing
pte_range_none() to fail on larger sizes but succeed on smaller ones.

It seems we can't micro-benchmark this with a small program.

>
> [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-7-ryan.robert=
s@arm.com/
> [2]
> https://lore.kernel.org/linux-mm/ca649aad-7b76-4c6d-b513-26b3d58f8e68@red=
hat.com/
>
> Thanks,
> Ryan