From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9985BC02188 for ; Mon, 27 Jan 2025 17:25:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0647E28018B; Mon, 27 Jan 2025 12:25:22 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0142E280163; Mon, 27 Jan 2025 12:25:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DF7EF28018B; Mon, 27 Jan 2025 12:25:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id B2B85280163 for ; Mon, 27 Jan 2025 12:25:20 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 60E7F1603FC for ; Mon, 27 Jan 2025 17:25:20 +0000 (UTC) X-FDA: 83053908000.18.5271A42 Received: from fhigh-b4-smtp.messagingengine.com (fhigh-b4-smtp.messagingengine.com [202.12.124.155]) by imf13.hostedemail.com (Postfix) with ESMTP id 1D5242000B for ; Mon, 27 Jan 2025 17:25:17 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=anarazel.de header.s=fm3 header.b=I0ao+zjY; dkim=pass header.d=messagingengine.com header.s=fm3 header.b=e8GQumNm; spf=pass (imf13.hostedemail.com: domain of andres@anarazel.de designates 202.12.124.155 as permitted sender) smtp.mailfrom=andres@anarazel.de; dmarc=pass (policy=none) header.from=anarazel.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737998718; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=jItTkK19lVrEaaQasijWB0BlZX8+jCdh2zngJoStwo8=; b=MagQ0LMNAQoxKY8h7uWPaHEyZq81PuWRyfO+n8dig+XsUrKaO9Ur71xm0JWo/7By4J+eGo CTsDsrVvKPFx/tsPSbybqw9YnfUxejEMz2pzCktN6fQ2RhUWXwzNFE9dDRPOgZ2No12HCu UxIMoK0na+RC5cngiMC8XtKe+WC1SO0= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=anarazel.de header.s=fm3 header.b=I0ao+zjY; dkim=pass header.d=messagingengine.com header.s=fm3 header.b=e8GQumNm; spf=pass (imf13.hostedemail.com: domain of andres@anarazel.de designates 202.12.124.155 as permitted sender) smtp.mailfrom=andres@anarazel.de; dmarc=pass (policy=none) header.from=anarazel.de ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737998718; a=rsa-sha256; cv=none; b=BXqBeaAe/Rj1S6xVF1y+5W7sVTB+gkYdRewy3TZhTrcKi63QPoKeXOuMavr98+foPgwxuw 2WJLCxzYZOg5jgRTXHQqZ0+w2mVHW250HlCfosaL/28hNdpMGomx77uf11plNFNraI02VS k0HG6/mVTpxK96WOea/tjFhF88nmtxs= Received: from phl-compute-05.internal (phl-compute-05.phl.internal [10.202.2.45]) by mailfhigh.stl.internal (Postfix) with ESMTP id BD70025401B4; Mon, 27 Jan 2025 12:25:16 -0500 (EST) Received: from phl-mailfrontend-01 ([10.202.2.162]) by phl-compute-05.internal (MEProxy); Mon, 27 Jan 2025 12:25:16 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm3; t=1737998716; x=1738085116; bh=jItTkK19lV rEaaQasijWB0BlZX8+jCdh2zngJoStwo8=; b=I0ao+zjYmvuc/70tU5DtbL1CJk w96cHTrEsqAqtoBCe/zo9UfLyPKSC0UcEjixFcb9YZb/eZajYk47/UGvMgo8EnmS pWSgxtzwG+XASXhc2n87zHT9sgx9eRjYkDosDyTWET+TM8lnAK7VAxdTYaIZjBmK /zrEyeTAzz6noE0VmK8REQHcwJ6fycST/DxjJIYv5olyy0aqF3LJT71c5V7PILcu KfY8GudeOPwDgnFzggnkPSD6VGHh7NexiaFA/1R8TaqBHDElZDVYccJkrSkEBFZN FkeoAgkUXoczYnfeabmxPJbwOO/ZfYPKJmqd+jhoUxgg2uEmeBo/OBEV9d9A== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t= 1737998716; x=1738085116; bh=jItTkK19lVrEaaQasijWB0BlZX8+jCdh2zn gJoStwo8=; b=e8GQumNmQyKTSxhFe3KlzodP1Cc6vSWTt4Oi1o8hr5wr09N/Hb3 +kbEU88ill/wydNq/WQ/KFv6RYbHKZW6W6S/gm1lTG4QwrPALHiD/qx0li6TL3Pz 2gfSMVxJTR8MG0RfSI5VlQHI6BWzgEeGUSGD4c4ClcJuEW1N2xg/Dzbf0ryyy/PE gRRHmWFcbwK8SUX+TuLtE3mvN00sg5Lz89OaIZ3kOg8DZsFCnkJ7GvNV/+V0Hgx5 qkXaGH2y2CGCzA8Q6v0J0eDeVeGfFK7nWQB6NRCCpSwI+fc3/hfa1FKdbGy7iS7U IfkWU7PyAkxFo1mGl84lDe+uixbi8pPQnPg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefuddrudejgedgudefjeejucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdggtfgfnhhsuhgsshgtrhhisggv pdfurfetoffkrfgpnffqhgenuceurghilhhouhhtmecufedttdenucesvcftvggtihhpih gvnhhtshculddquddttddmnecujfgurhepfffhvfevuffkfhggtggujgesthdtsfdttddt vdenucfhrhhomheptehnughrvghsucfhrhgvuhhnugcuoegrnhgurhgvshesrghnrghrrg iivghlrdguvgeqnecuggftrfgrthhtvghrnheptedtkeefffeuudeufeeiffekgeeujefg teefvefhudegleehieeufeffhfehgffhnecuffhomhgrihhnpehgihhthhhusgdrtghomh enucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpegrnhgu rhgvshesrghnrghrrgiivghlrdguvgdpnhgspghrtghpthhtohepiedpmhhouggvpehsmh htphhouhhtpdhrtghpthhtohepfihilhhlhiesihhnfhhrrgguvggrugdrohhrghdprhgt phhtthhopehlihhnuhigqdhmmheskhhvrggtkhdrohhrghdprhgtphhtthhopehmuhgthh hunhdrshhonhhgsehlihhnuhigrdguvghvpdhrtghpthhtohepjhgrnhgvrdgthhhuseho rhgrtghlvgdrtghomhdprhgtphhtthhopegurghvihgusehrvgguhhgrthdrtghomhdprh gtphhtthhopehlihhnuhigqdgslhhotghksehvghgvrhdrkhgvrhhnvghlrdhorhhg X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 27 Jan 2025 12:25:15 -0500 (EST) Date: Mon, 27 Jan 2025 12:25:15 -0500 From: Andres Freund To: David Hildenbrand Cc: Matthew Wilcox , linux-mm@kvack.org, linux-block@vger.kernel.org, Muchun Song , Jane Chu Subject: Re: Direct I/O performance problems with 1GB pages Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 1D5242000B X-Stat-Signature: dexxut8kz55gcy9u1wgxnqqsyx6m5iqn X-Rspam-User: X-HE-Tag: 1737998717-847090 X-HE-Meta: U2FsdGVkX1+6SFoNoxeObLxlm8qVx/1/F1Ish8Z3jHeKeY2u/CsIMsRI+AQcO8+oyMATc7QnE7nFVNCctBZILgYny03JObHc+CTGzxQt6rBRp0ibxvrjR3bmuKC7yoAuoQJFhMuAKYojYNlWg9NWq/ixt1cwqbRVOcuutEjD3zU44pjebfoWg9HpWM5MZoNTXC+oAzBDB0N1fwRCw0Yx9x17LBANGwwEwuvx0LXwAWLiuzC3McLg/AXZ1x7wHZyevQcky8cGCPPKq+TYexgbUy2uho8yvuOr1yQdSKkqWzEHJsWlrSHAD8Ye9yaS9LUCxnsyc15OlKaETAsMfbXW9Svnd81yvlp+Xmnwn5kxrCXt12uJRqA1SgxQ08FoojO87/dDyUi7Q1rjNbKiu08aqZeAwmBAGVYzNxjsINqCapwJrCK4FD3kZotVp7RlEUnsfEhQ6u0a/OjpYwAMEfUR3+SPWtbjIkokWOF7iVF0nVT4y63ogkQBKlibyoCIs8V6flBbVc/h+8ug0lciViu1uckPGLKKu0oPDzTKR0AqIaYf/hqMyBJMoWe/G2wAnI444zoK8fxRgjc6Q0/bNhikV69Gi/7/x8QyUAs+y4/iI0Jzp3Lb/95RmeXWKwR23BjJiyvPbAhrL6Pk5yOIF+igWBn+5g2mu8DohXvMt4fM9nO2q1t3bcp6gCCawZ2tI9GyyB9jGFpHKhJHhYDX6Izn4+SP3UEFeXommuZXUQHaAjxkhkk3OS3uJWIponUWR3Mi8zsG4WvceItjEIr9RYgoz2GbFT0KiG9E4OPDMo+E7ecXsFa6VOEOffw5PgOtPXB+V/QK6iI9IusJ8eaV/QgE4ah8V+gxsjiyxej1AkqmOY5Kplts05Xmd3GHIpd9evCaOUNIB+FM+wIbUSLJ7Ro8VOawBE1ZvtrhJCOixM5cdlQb59b035u3JhAF8CUbuQJcd4uqgtymUYWkd/AB8wH 7M+uNDYg nXtFhsmbLaqYB5ufnTy5T+5xIATRoakkUheu7oExdXDetgjoaftKw4TnKjji1tYAQuvkPcktWjFcXGPvhqN5gorGCSw4LCfZLGPcdk7K6e+RTxlPXBc1bNcHvdOeSabuEC2QFIBS4TbvU/MVOn+CqKLooXxxG6uZRcYtQ04LUJZnS0Fu36ZXefXa17Lfxzrj1mRwGpu9LSCN0ZrdKmNEsR0gGjpbwH/9TcRaBglAJhBbr8NMHizUGV2Jgp9gEMs/NO9K/ggnXlwNCRLTDG5KAjIG+je5H1bdtuDbbUVRXdXt297wWCpz3zp/o9UMIVzSPgCXOqMqvxHhdFoGRlbUIxksQSfB3M56dYhtnsF/IwFRTcRsANyrv+M0d/qN971kxbn7x X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, On 2025-01-27 15:09:23 +0100, David Hildenbrand wrote: > Hmmm ... do we really want to make refcounting more complicated, and more > importantly, hugetlb-refcounting more special ?! :) I don't know the answer to that - I mainly wanted to report the issue because it was pretty nasty to debug and initially surprising (to me). > If the workload doing a lot of single-page try_grab_folio_fast(), could it > do so on a larger area (multiple pages at once -> single refcount update)? In the original case I hit this I (a VM with 10 PCIe 3x NVMEs JBODed together), the IO size averaged something like ~240kB (most 256kB, with some smaller ones thrown in). Increasing the IO size further than that starts to hurt latency and thus requires even deeper IO queues... Unfortunately for the VMs with those disks I don't have access to hardware performance counters :(. > Maybe there is a link to the report you could share, thanks. A profile of the "original" case where I hit this, without the patch that Willy linked to: Note this is a profile *not* using hardware perf counters, thus likely to be rather skewed: https://gist.github.com/anarazel/304aa6b81d05feb3f4990b467d02dabc (this was on Debian Sid's 6.12.6) Without the patch I achieved ~18GB/s with 1GB pages and ~35GB/s with 2MB pages. After applying the patch to add an unlocked already-dirty check to bio_set_pages_dirty() performance improves to ~20GB/s when using 1GB pages. A differential profile comparing 2MB and 1GB pages with the patch applied (again, without hardware perf counters): https://gist.github.com/anarazel/f993c238ea7d2c34f44440336d90ad8f Willy then asked me for perf annotate of where in gup_fast_fallback() time is spent. I didn't have access to the VM at that point, and tried to repro the problem with local hardware. As I don't have quite enough IO throughput available locally, I couldn't repro the problem quite as easily. But after lowering the average IO size (which is not unrealistic, far from every workload is just a bulk sequential scan), it showed up when just using two PCIe 4 NVMe SSDs. Here are profiles of the 2MB and 1GB cases, with the bio_set_pages_dirty() patch applied: https://gist.github.com/anarazel/f0d0a884c55ee18851dc9f15f03f7583 2MB pages get ~12.5GB/s, 1GB pages ~7GB/s, with a *lot* of variance. This time it's actual hardware perf counters... Relevant details about the c2c report, excerpted from IRC: andres | willy: Looking at a bit more detail into the c2c report, it looks like the dirtying is due to folio->_pincount and folio->_refcount in about equal measure and folio->flags being modified in gup_fast_fallback(). The modifications then, unsurprisingly, cause a lot of cache misses for reads (like in bio_set_pages_dirty() and bio_check_pages_dirty()). willy | andres: that makes perfect sense, thanks willy | really, the only way to fix that is to split it up willy | and either we can split it per-cpu or per-physical-address-range andres | willy: Yea, that's probably the only fundamental fix. I guess there might be some around-the-edges improvements by colocating the write heavy data on a separate cache line from flags and whatever is at 0x8, which are read more often than written. But I really don't know enough about how all this is used. willy | 0x8 is compound_head which is definitely read more often than written Greetings, Andres Freund