From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E9830E7716D for ; Thu, 5 Dec 2024 15:19:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0320F6B0085; Thu, 5 Dec 2024 10:19:10 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C85676B00BA; Thu, 5 Dec 2024 10:19:09 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D4ACA6B00B5; Thu, 5 Dec 2024 10:19:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 632C76B008A for ; Tue, 17 Sep 2024 04:44:47 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id E4A2A8023A for ; Tue, 17 Sep 2024 08:44:46 +0000 (UTC) X-FDA: 82573594572.07.BB92198 Received: from mail-vk1-f180.google.com (mail-vk1-f180.google.com [209.85.221.180]) by imf16.hostedemail.com (Postfix) with ESMTP id 2373D18000E for ; Tue, 17 Sep 2024 08:44:44 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=none; spf=pass (imf16.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.180 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=quarantine) ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726562628; a=rsa-sha256; cv=none; b=RsWIIjXUvSgPMZRi4P+eCGW52UxQbU+lHGOhDtU9/mVGw+/RSLvsCyUMGPcY59EdOl12WB fj3Snx2Lp+TITRy58MfdPcXaaC1o0S+78BAYSz3Ruy2JSzcKSY8RFjYTGAy/xD4PdV/wCV Kj0Ajy7j9CM6aV3PGfn916Tx3wNdX3M= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=none; spf=pass (imf16.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.180 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=quarantine) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726562628; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Ng7dwg0Id9CiIJrzvlpnf4b12rzqT653/Nfvp+zhkC4=; b=Unrd8iLIZOD+LSRr8vqk/gUp0Cc76EBDe19I63RJ+BT5eWrkOKGc4/p7AcnP8+fAfAm6rZ Gcpz3bgRAq9+MbCg8FWmd+ZQZzXpoRAve9xa6wR9QIY7w/WqwE3MQN8VUaDLtD+x/FrSE1 CvKqLPXYEzYjBNlyaC3lkM+ozwJ+6EQ= Received: by mail-vk1-f180.google.com with SMTP id 71dfb90a1353d-5010c0e16baso1519588e0c.1 for ; Tue, 17 Sep 2024 01:44:44 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726562684; x=1727167484; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Ng7dwg0Id9CiIJrzvlpnf4b12rzqT653/Nfvp+zhkC4=; b=GY6MaxxKIgC+DcwZt5rpSfDOHgtT+v6/R3qZZPBmHCxt+sP2J8HkVNzltuUBHdwHLP 8hQFuF9CaoBEugAVA7Uymw0M3xMJjgZMfMctFQw36KrQvH4F1KVL8oRHs9bvChL+VUuL pDU9tfYZ5lqoulTjoSHfV4TXIxCU6QMeP+eARHVnLPTP39Inw36zw8WckUb1Y+xRSsZL CFDbthpGiFSpqY2j/8V7Wgr/2vAq+bzL2r6I7CUytS4LNbyoFcimPK0QcYhnbbIYySXL vxZBA4Nf7kaRiBGBQRFqv7uYhmzLB/6cXkGN6PtVgpsu9l3zddiuKTaGx5iWLSbFdA4O rTqQ== X-Forwarded-Encrypted: i=1; AJvYcCWgC7ORW2+l7gAWhe5x9s+Azv/1rFvcUix39cULQU3TKJLB4vTBzWevUURRR6G7R4BppnMym8Zszg==@kvack.org X-Gm-Message-State: AOJu0YzkM2TvYL2qXbhi/2jiVEJScjpBb4IB/rbpkmBQqoDh8sYmDum2 DBiQJKn9IqJmd+K535eIlgE70ZMS6DMllEAsMyb/yaGn3NTiDnQtfyaEx5TgIsN1BkVisfGia2e xom64VUOZKVWdjkfBDbfEDKOw774= X-Google-Smtp-Source: AGHT+IEU6Bcx9UIBfIzPfpAGCWgPp25KMplELEDxT42WwuQSjXXXp3T1Ugdy3gj9eaXkGp4wutEVycJAWHlma3KLJ9o= X-Received: by 2002:a05:6122:3d05:b0:4f2:ffa9:78b5 with SMTP id 71dfb90a1353d-5032d50ac62mr16032055e0c.11.1726562684009; Tue, 17 Sep 2024 01:44:44 -0700 (PDT) MIME-Version: 1.0 References: <20240913091902.1160520-1-dev.jain@arm.com> <091f517d-e7dc-4c10-b1ac-39658f31f0ed@arm.com> In-Reply-To: From: Barry Song Date: Tue, 17 Sep 2024 16:44:30 +0800 Message-ID: Subject: Re: [PATCH] mm: Compute mTHP order efficiently To: Ryan Roberts Cc: Dev Jain , Matthew Wilcox , akpm@linux-foundation.org, david@redhat.com, anshuman.khandual@arm.com, hughd@google.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, baolin.wang@linux.alibaba.com, gshan@redhat.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 2373D18000E X-Stat-Signature: ht487egxsan1ntaipo99nuys7dst6qjr X-Rspam-User: X-Rspamd-Pre-Result: action=add header; module=dmarc; Action set by DMARC X-Rspam: Yes X-HE-Tag: 1726562684-494012 X-HE-Meta: U2FsdGVkX18Zcl1+zESCbSAnI2oSbFcwmvS//Ov7zCjN2CvCSCJVSyNw3xYaipvjf1jkNfahAHigx8jN2IcVuuID4yKLyiV2xJuFKu5TodojJxvWS62A/fhp89NaAH8pJ2UP6OCbtb1o14U0esUXOiTBsCuFo4jZ8bmQLQcWJQRlB23VjIdOT4cPMKP73MuhfRGoMW++eOozZjBrLQ9NsHuQ1VZg3f1KA7BPI7tRYyvyvC8+LEupH/bmBlGz+BfqU1nNttKGJB0H8Po9OBMVSJyKWSebQHNpSv8Bjr+jV2InY7tO0wTJ06Vf0fIkbUXwR+rH+MX/01KCM0wL7z+yEQMxX+s7glFq9RJD6tgFE7v5rKpgffTB+JahfmnMKr6/hbJWnVmZyYUbJQ7RVDBCZkOXca7NTkCaK8l7ijsFPJmZdnHX7VUcu8RppVdHxIKjkUf0DlrehrVU7p0nf5/6ZcJm9/EeKSl/b5ZNj61fH3yPfJP2/WvDYTu6G/gRkYfGb9S66JHkndTLMLIw5DpJUd5Y4FZx7poOtxYORNqSSWyHusTvWW2ZKqQ6e7cxSWnlVj+AVtfz0E+qfzApw2HkwcfUvVYzqKgR8M8sPU2asyot1/rUBYPow38O6iJX6AJwOBHE0KngxOdOkh7DTWg8awPDNUr3GBrL5kQo2mYqydiHCZzO1oLbqpPfz1vSAOgIGrUP0qehuTjydQoQgG8S2nkoY8bIYTbd2xhAPae3JbGkpJWKlzfInVx8+xDaG7mjfFq1+tOJUpTGJw6VuG+yr1/xTxHBT4+86gRrc0w+JycJacvQfJVY3LKqkB1nHyxfJb37zLcKg6IwBublMeN5k2s53wKX2nECNCwEudDsxfjngBnzL6poGieAhh7+/M4sjVwzXuUdL/jVl1/np50QUU6gWpCUX98x2nJqT1wMMnKHXJxwKOSrv6RU8vdI0knlxgNFJF5rHaIMlpbIOci SK57aSDR 2nnjh1cBmfpgJ5kbw2wUrM3vdMMNoEcdeesvamLnB8zxZD6QrZjHUqFhq2n2Vsg/YmQKTH6yZQhuuY9iOsGUKgSJwxsF9eDf1+U/8uO2Ol9ORIfP3Qpud78Sa6h0iwvrRihKJQwhMTgUkQ8xhlP6xclfUGUQs+tGGAXX07VL7H7TwszB+n+ntb+AWulymK6zFZHnHRy4F1U6XhHn6DsrHyYBawMyNLIX6JjlHVZWP7lnE6+/8a8HLeJgdL69Z02ox4R2vy6We5fBHxIW/R7NZaEMjClXmsGmPRDIVYKRrvUoWIz1qp7wEqkDb36wSON/kmPOS07OqWXalJVSDw/nQqTo0ibuNwPOYcuu1qsDr9u3e6X/OjCvewsA2RfTArA+c/l5noNNfApB5Muc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000454, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Sep 17, 2024 at 4:29=E2=80=AFPM Ryan Roberts = wrote: > > On 17/09/2024 04:55, Dev Jain wrote: > > > > On 9/16/24 18:54, Matthew Wilcox wrote: > >> On Fri, Sep 13, 2024 at 02:49:02PM +0530, Dev Jain wrote: > >>> We use pte_range_none() to determine whether contiguous PTEs are empt= y > >>> for an mTHP allocation. Instead of iterating the while loop for every > >>> order, use some information, which is the first set PTE found, from t= he > >>> previous iteration, to eliminate some cases. The key to understanding > >>> the correctness of the patch is that the ranges we want to examine > >>> form a strictly decreasing sequence of nested intervals. > >> This is a lot more complicated. Do you have any numbers that indicate > >> that it's faster? Yes, it's fewer memory references, but you've gone > >> from a simple linear scan that's easy to prefetch to an exponential sc= an > >> that might confuse the prefetchers. > > > > I do have some numbers, I tested with a simple program, and also used > > ktime API, with the latter, enclosing from "order =3D highest_order(ord= ers)" > > till "pte_unmap(pte)" (enclosing the entire while loop), a rough averag= e > > estimate is that without the patch, it takes 1700 ns to execute, with t= he > > patch, on an average it takes 80 - 100ns less. I cannot think of a good > > testing program... > > > > For the prefetching thingy, I am still doing a linear scan, and in each > > iteration, with the patch, the range I am scanning is going to strictly > > lie inside the range I would have scanned without the patch. Won't the > > compiler and the CPU still do prefetching, but on a smaller range; wher= e > > does the prefetcher get confused? I confess, I do not understand this > > very well. > > > > A little history on this; My original "RFC v2" for mTHP included this > optimization [1], but Yu Zhou suggested dropping it to keep things simple= , which > I did. Then at v8, DavidH suggested we could benefit from this sort of > optimization, but we agreed to do it later as a separate change [2]: > > """ > >> Comment: Likely it would make sense to scan only once and determine th= e > >> "largest none range" around that address, having the largest suitable = order > >> in mind. > > > > Yes, that's how I used to do it, but Yu Zhou requested simplifying to t= his, > > IIRC. Perhaps this an optimization opportunity for later? > > Yes, definetly. > """ > > Dev independently discovered this opportunity while reading the code, and= I > pointed him to the history, and suggested it would likely be worthwhile t= o send > a patch. > > My view is that I don't see how this can harm performance; in the common = case, > when a single order is enabled, this is essentially the same as before. B= ut when > there are multiple orders enabled, we are now just doing a single linear = scan of > the ptes rather than multiple scans. There will likely be some stack acce= sses > interleved, but I'd be gobsmacked if the prefetchers can't tell the diffe= rence > between the stack and other areas of memory. > > Perhaps some perf numbers would help; I think the simplest way to gather = some > numbers would be to create a microbenchmark to allocate a large VMA, then= fault > in single pages at a given stride (say, 1 every 128K), then enable 1M, 51= 2K, > 256K, 128K and 64K mTHP, then memset the entire VMA. It's a bit contrived= , but > this patch will show improvement if the scan is currently a significant p= ortion > of the page fault. > > If the proposed benchmark shows an improvement, and we don't see any regr= ession > when only enabling 64K, then my vote would be to accept the patch. Agreed. The challenge now is how to benchmark this. In a system without fragmentation, we consistently succeed in allocating the largest size (1MB). Therefore, we need an environment where allocations of various sizes can fail proportionally, all= owing pte_range_none() to fail on larger sizes but succeed on smaller ones. It seems we can't micro-benchmark this with a small program. > > [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-7-ryan.robert= s@arm.com/ > [2] > https://lore.kernel.org/linux-mm/ca649aad-7b76-4c6d-b513-26b3d58f8e68@red= hat.com/ > > Thanks, > Ryan