From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E67D0C0218A
	for <linux-mm@archiver.kernel.org>; Mon, 27 Jan 2025 16:02:47 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 380A5280157; Mon, 27 Jan 2025 11:02:47 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 309A1280148; Mon, 27 Jan 2025 11:02:47 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1AA9D280157; Mon, 27 Jan 2025 11:02:47 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id EC7E8280148
	for <linux-mm@kvack.org>; Mon, 27 Jan 2025 11:02:46 -0500 (EST)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 498FE8031A
	for <linux-mm@kvack.org>; Mon, 27 Jan 2025 16:02:46 +0000 (UTC)
X-FDA: 83053699932.01.B8506D6
Received: from casper.infradead.org (casper.infradead.org [90.155.50.34])
	by imf24.hostedemail.com (Postfix) with ESMTP id 6301218001F
	for <linux-mm@kvack.org>; Mon, 27 Jan 2025 16:02:44 +0000 (UTC)
Authentication-Results: imf24.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=UZLpSog+;
	spf=none (imf24.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org;
	dmarc=none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1737993764;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=28LMBxDMd1K+Lfe4Mecb7NBAlb5zAN2Qnc1EAyl4s3g=;
	b=VLlQFpe94u2DCQyRwyrawqgD8rSO/uDYRlYa0+xtxDjGwXhx4PmbZ7Q4cYg068Yc0z2hiT
	pV/ynU5YG3CRzispPy+re6OJJ5QioEfGnWkIqE2QqyGjJ8YxaR6M0BE1xQWbDnTQiUwqM+
	Rya4782vjxRLMnN6NciVdrB/DUC0yOE=
ARC-Authentication-Results: i=1;
	imf24.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=UZLpSog+;
	spf=none (imf24.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org;
	dmarc=none
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737993764; a=rsa-sha256;
	cv=none;
	b=hwRu4pPN8EpKHJgo6a/1lMa2vhY/0R6KPgNDqBaSbra6cmtQ/iiDISPvJBGg7/tuAaXYzC
	dM/hcpuZyz5dgI8waHka66w+cwihCeSGKcmB/RlHW9ySYL5MIYBCboRoyKoJnSuCDHdtM9
	Z4ipsp1mHk0QvOfAJH6FMTJnzQBqG3Q=
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=28LMBxDMd1K+Lfe4Mecb7NBAlb5zAN2Qnc1EAyl4s3g=; b=UZLpSog+4jqsQvTtbD5DQhK/Ba
	6sKjwyDUJ1w3HFoKe4RSIVo39MHzvdP29gL/WjK9zjfBDnnSaAvuIPi9HUkGU1it3xCV2FKQi79z7
	9gkR6vQfLS4nsdQEhHSD9LB/fV1BXDMwbvs6ptmVWDqqRqyIr5zJtZXS8nqFvEdpmTIppmxPcGxRC
	2qErTZgSm8wuGoVjUs80nPQB1e3OPMBGgT8IJOsxeHf4Bm40gyRjQptc3pqHBQa1EGcTphIWdOWSO
	zX4BpQiBnpI8TaUgCwi3rFh0NhdV+ugIvgOn2oB8tSzRiuFSpdcFTiObm9+AF+rZPgTMVxhazPZr/
	HN705aUQ==;
Received: from willy by casper.infradead.org with local (Exim 4.98 #2 (Red Hat Linux))
	id 1tcRZR-00000009c6H-1KNG;
	Mon, 27 Jan 2025 16:02:41 +0000
Date: Mon, 27 Jan 2025 16:02:41 +0000
From: Matthew Wilcox <willy@infradead.org>
To: David Hildenbrand <david@redhat.com>
Cc: linux-mm@kvack.org, linux-block@vger.kernel.org,
	Muchun Song <muchun.song@linux.dev>, Jane Chu <jane.chu@oracle.com>,
	Andres Freund <andres@anarazel.de>
Subject: Re: Direct I/O performance problems with 1GB pages
Message-ID: <Z5euIf-OvrE1suWH@casper.infradead.org>
References: <Z5WF9cA-RZKZ5lDN@casper.infradead.org>
 <e0ba55af-23c4-455e-9449-e74de652fb7c@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <e0ba55af-23c4-455e-9449-e74de652fb7c@redhat.com>
X-Rspamd-Queue-Id: 6301218001F
X-Rspam-User: 
X-Rspamd-Server: rspam11
X-Stat-Signature: bx65xnnn3mcsciswaq3gxsbame1qy51o
X-HE-Tag: 1737993764-199854
X-HE-Meta: U2FsdGVkX18HAsl/w0G0bJdLnu4Hkuhz4SlUjub2/SsdnH6pRcUT/294u86suLbXHeL54uf1weGzhBst5EwVTdnQLfx461BZQuKcx73qKft9aDV9DpwQG8lepcagca0NmQyhGowWSvsf+ezXPjVfXPxkzylqWP5KF2zw52GvIEhR0DliR83+ssw8+9nMwluQ92rV9D9kwCnXyrUNEFYSSPbeYZPHy49xjuCLbYcRnkqytEKncAUnAi1REMuWdCsSOkCZQG7APa1Sj3diCh3UnhUj3cGUiGZfYZfwGjuBQrdw0PlmWyjDkfbQjteo04snX5BT+lrzF5S6uN00Vp98+EM95fSHC7I85JAGt7Dx4TgNZlQ7BteWD4BH2y6W7U3qgW7rMx85Q8yrhfI9afCnummKxqwr13kuDFLoFbjOTVGg73KAd35SImk7HIhbA2pO5yDumli/yRYXL3Xc//3wKYRGQW4pVHGshgviuNYMPh3/jJB2kHfLEY3uHNBc89nmrsSAhFnop655w1/TtCCNPdNS3r9+RySAEwXO5ATd7pKRrZons8Jhp9mwmYzwi7lYMrzNehaguJXOSpVF5qhXLU+xE30mtHGixFqrO9GLdAiRgCj4gmTpujTbXwP+UVF832FYSsUJ0qCburd5x6DIiN6v520AIswDOof/XKtY0mDZf9gvGHnn4EQI1d/LgkuAqbjK3YdLFO0q9Rekb02OzQVgGE4Gfs5A5VPAwYotx29K3tGCJGTccXvQu/Ql64Q3zLJ3JQ+Rp1RdFe2uY6x+W/HhK6v1ZXDejZ2jdBi3tZmdM2LbRW25SOCoHoWmnfe5X8iVDxeDnkk1B5lXXcHt5GNOziPqyo5vIEIEvGjfX7IeKnTCM6Hks0KQsy1VXgTepjosuvKUSlH0fEDwLQ6hb1fYg8SmbKF7ueoKVR65uVTdotvBPkqneZpbblaeunZKJxUDk1JFUDzutATID5H
 rJL2vYW4
 3UR1qRJUJ075fUakbmFPXjN8TDADMIRyUNBdh0VlHlPb8dilY7JPbcT4zqRXufmyn5hRKPXgKfQ7W6vJeT8QzdfyjFhZAZ/e9i8dUtD0s6Qr9cpl3U9ws04aZT5Jh1d7Q08XasM55QH2lDks1jrDQ1RXwmsNNFzLIIOQ8Xx33J+jywK9hQrpXTV/Oj0S3y0Hjy1DWhHVSOPZGpKZM7IpAI84rL0GvMENuA1YK
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

[Adding Andres to the cc.  Sorry for leaving you off in the initial mail]

On Mon, Jan 27, 2025 at 03:09:23PM +0100, David Hildenbrand wrote:
> On 26.01.25 01:46, Matthew Wilcox wrote:
> > Postgres are experimenting with doing direct I/O to 1GB hugetlb pages.
> > Andres has gathered some performance data showing significantly worse
> > performance with 1GB pages compared to 2MB pages.  I sent a patch
> > recently which improves matters [1], but problems remain.
> > 
> > The primary problem we've identified is contention of folio->_refcount
> > with a strong secondary contention on folio->_pincount.  This is coming
> > from the call chain:
> > 
> > iov_iter_extract_pages ->
> > gup_fast_fallback ->
> > try_grab_folio_fast
> > 
> > Obviously we can fix this by sharding the counts.  We could do that by
> > address, since there's no observed performance problem with 2MB pages.
> > But I think we'd do better to shard by CPU.  We have percpu-refcount.h
> > already, and I think it'll work.
> > 
> > The key to percpu refcounts is knowing at what point you need to start
> > caring about whether the refcount has hit zero (we don't care if the
> > refcount oscillates between 1 and 2, but we very much care about when
> > we hit 0).
> > 
> > I think the point at which we call percpu_ref_kill() is when we remove a
> > folio from the page cache.  Before that point, the refcount is guaranteed
> > to always be positive.  After that point, once the refcount hits zero,
> > we must free the folio.
> > 
> > It's pretty rare to remove a hugetlb page from the page cache while it's
> > still mapped.  So we don't need to worry about scalability at that point.
> > 
> > Any volunteers to prototype this?  Andres is a delight to work with,
> > but I just don't have time to take on this project right now.
> 
> Hmmm ... do we really want to make refcounting more complicated, and more
> importantly, hugetlb-refcounting more special ?! :)

No, I really don't.  But I've always been mildly concerned about extra
contention on folio locks, folio refcounts, etc.  I don't know if 2MB
page performance might be improved by a scheme like this, and we might
even want to cut over for sizes larger than, say, 64kB.  That would be
something interesting to investigate.

> If the workload doing a lot of single-page try_grab_folio_fast(), could it
> do so on a larger area (multiple pages at once -> single refcount update)?

Not really.  This is memory that's being used as the buffer cache, so
every thread in your database is hammering on it and pulling in exactly
the data that it needs for the SQL query that it's processing.

> Maybe there is a link to the report you could share, thanks.

Andres shared some gists, but I don't want to send those to a
mailing list without permission.  Here's the kernel part of the
perf report:

    14.04%  postgres         [kernel.kallsyms]          [k] try_grab_folio_fast
            |
             --14.04%--try_grab_folio_fast
                       gup_fast_fallback
                       |
                        --13.85%--iov_iter_extract_pages
                                  bio_iov_iter_get_pages
                                  iomap_dio_bio_iter
                                  __iomap_dio_rw
                                  iomap_dio_rw
                                  xfs_file_dio_read
                                  xfs_file_read_iter
                                  __io_read
                                  io_read
                                  io_issue_sqe
                                  io_submit_sqes
                                  __do_sys_io_uring_enter
                                  do_syscall_64

Now, since postgres is using io_uring, perhaps there could be a path
which registers the memory with the iouring (doing the refcount/pincount
dance once), and then use that pinned memory for each I/O.  Maybe that
already exists; I'm not keeping up with io_uring development and I can't
seem to find any documentation on what things like io_provide_buffers()
actually do.