Re: Direct I/O performance problems with 1GB pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Andres Freund <andres@anarazel.de>
To: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	linux-mm@kvack.org,  linux-block@vger.kernel.org,
	Muchun Song <muchun.song@linux.dev>,
	 Jane Chu <jane.chu@oracle.com>
Subject: Re: Direct I/O performance problems with 1GB pages
Date: Mon, 27 Jan 2025 12:25:15 -0500	[thread overview]
Message-ID: <w7vcs35omcdqkaszcc6fzvakzdoqkzjwtvdsc3lelcnjgzytib@siim7yk4qjrf> (raw)
In-Reply-To: <e0ba55af-23c4-455e-9449-e74de652fb7c@redhat.com>

Hi,

On 2025-01-27 15:09:23 +0100, David Hildenbrand wrote:
> Hmmm ... do we really want to make refcounting more complicated, and more
> importantly, hugetlb-refcounting more special ?! :)

I don't know the answer to that - I mainly wanted to report the issue because
it was pretty nasty to debug and initially surprising (to me).

> If the workload doing a lot of single-page try_grab_folio_fast(), could it
> do so on a larger area (multiple pages at once -> single refcount update)?

In the original case I hit this I (a VM with 10 PCIe 3x NVMEs JBODed
together), the IO size averaged something like ~240kB (most 256kB, with some
smaller ones thrown in). Increasing the IO size further than that starts to
hurt latency and thus requires even deeper IO queues...

Unfortunately for the VMs with those disks I don't have access to hardware
performance counters :(.

> Maybe there is a link to the report you could share, thanks.

A profile of the "original" case where I hit this, without the patch that
Willy linked to:

Note this is a profile *not* using hardware perf counters, thus likely to be
rather skewed:
https://gist.github.com/anarazel/304aa6b81d05feb3f4990b467d02dabc
(this was on Debian Sid's 6.12.6)

Without the patch I achieved ~18GB/s with 1GB pages and ~35GB/s with 2MB
pages.

After applying the patch to add an unlocked already-dirty check to
bio_set_pages_dirty() performance improves to ~20GB/s when using 1GB pages.

A differential profile comparing 2MB and 1GB pages with the patch applied
(again, without hardware perf counters):
https://gist.github.com/anarazel/f993c238ea7d2c34f44440336d90ad8f

Willy then asked me for perf annotate of where in gup_fast_fallback() time is
spent.  I didn't have access to the VM at that point, and tried to repro the
problem with local hardware.

As I don't have quite enough IO throughput available locally, I couldn't repro
the problem quite as easily. But after lowering the average IO size (which is
not unrealistic, far from every workload is just a bulk sequential scan), it
showed up when just using two PCIe 4 NVMe SSDs.

Here are profiles of the 2MB and 1GB cases, with the bio_set_pages_dirty()
patch applied:
https://gist.github.com/anarazel/f0d0a884c55ee18851dc9f15f03f7583

2MB pages get ~12.5GB/s, 1GB pages ~7GB/s, with a *lot* of variance.

This time it's actual hardware perf counters...

Relevant details about the c2c report, excerpted from IRC:

andres | willy: Looking at a bit more detail into the c2c report, it looks
         like the dirtying is due to folio->_pincount and folio->_refcount in
         about equal measure and folio->flags being modified in
         gup_fast_fallback(). The modifications then, unsurprisingly, cause a
         lot of cache misses for reads (like in bio_set_pages_dirty() and
         bio_check_pages_dirty()).

 willy | andres: that makes perfect sense, thanks
 willy | really, the only way to fix that is to split it up
 willy | and either we can split it per-cpu or per-physical-address-range

andres | willy: Yea, that's probably the only fundamental fix. I guess there
         might be some around-the-edges improvements by colocating the write
         heavy data on a separate cache line from flags and whatever is at
         0x8, which are read more often than written. But I really don't know
         enough about how all this is used.

 willy | 0x8 is compound_head which is definitely read more often than written

Greetings,

Andres Freund

next prev parent reply	other threads:[~2025-01-27 17:25 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-26  0:46 Matthew Wilcox
2025-01-27 14:09 ` David Hildenbrand
2025-01-27 16:02   ` Matthew Wilcox
2025-01-27 16:09     ` David Hildenbrand
2025-01-27 16:20       ` David Hildenbrand
2025-01-27 16:56         ` Matthew Wilcox
2025-01-27 16:59           ` David Hildenbrand
2025-01-27 18:21       ` Andres Freund
2025-01-27 18:54         ` Jens Axboe
2025-01-27 19:07           ` David Hildenbrand
2025-01-27 21:32           ` Pavel Begunkov
2025-01-27 16:24     ` Keith Busch
2025-01-27 17:25   ` Andres Freund [this message]
2025-01-27 19:20     ` David Hildenbrand
2025-01-27 19:36       ` Andres Freund
2025-01-28  5:56 ` Christoph Hellwig
2025-01-28  9:47   ` David Hildenbrand
2025-01-29  6:03     ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=w7vcs35omcdqkaszcc6fzvakzdoqkzjwtvdsc3lelcnjgzytib@siim7yk4qjrf \
    --to=andres@anarazel.de \
    --cc=david@redhat.com \
    --cc=jane.chu@oracle.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox