From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 153CFC0218D for ; Sun, 26 Jan 2025 00:46:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6B4962800E2; Sat, 25 Jan 2025 19:46:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 662832800A6; Sat, 25 Jan 2025 19:46:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 551DE2800E2; Sat, 25 Jan 2025 19:46:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 49F6A2800A6 for ; Sat, 25 Jan 2025 19:46:50 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 88916140D40 for ; Sun, 26 Jan 2025 00:46:49 +0000 (UTC) X-FDA: 83047762938.12.A8431B5 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf27.hostedemail.com (Postfix) with ESMTP id C60B340003 for ; Sun, 26 Jan 2025 00:46:47 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=OGW9zcRj; spf=none (imf27.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737852408; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=jGn+vlufNxbazpI40lKb7tO/eBSKtukFfR6vbxhXKSU=; b=Jk00YHzKPfo4p9bFKhtAVc+X/fsfUGykbEjFNraVU/eQEWfQMArRyU1sWMDHdZkwygCUBO 0JD2cBTMwN+Ux/TR+aCHaCZeoHE/HxGfZA18VrN7sNKXFwQOiQzYxeYDT2JyVVV+r5DjHV zX1UCWgE18o8OcATKD9SPEj9qd39lDE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737852408; a=rsa-sha256; cv=none; b=2AtaWwy/ZM7yl/GrQR5BzYZTLlgCkSAP46D1D9XQJHFzdVFlSIUsOnyAfdHfinjo5cEKJ9 Rr6vkNSaylMMKjhQjmgrz9FoN8BJpkyydt0ve4aHGp6CebdjePuvWMtJTQCTsRDXdxfEKO uJ+6m4HlRlXafIr3bGcSQQUryixQSnQ= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=OGW9zcRj; spf=none (imf27.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:Message-ID: Subject:Cc:To:From:Date:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:In-Reply-To:References; bh=jGn+vlufNxbazpI40lKb7tO/eBSKtukFfR6vbxhXKSU=; b=OGW9zcRjBkOPtShCBAfXaoKQJb yKM3ys4iU7xtm2iq7o+/Um/fszo5dLCrAe8VK3XBvv55ymDXI6hKddQhd9iGavHYODFHefmgyeLUr /9PjLrExvIERBm2UfUf6nR2MET72KIx4R/Eho9+BikUlTj8o57Jizim76uMjxuJjhZse1VvwX1dKX RfEGnMYUXJdBR6Qb9Xi+TOS7dGdAJzyPLU59M6IJKekHeloxeBm4LF2X5a6vlpSDw3I9/B4f3EWq0 J/ycSrWbUOtxsXtHXW7WM7ZzBrJx3e7EaWzRibbWwbIhj6VcEjpfh2DM2n0x0hhw9+WrMQLElYu44 W7GygjmQ==; Received: from willy by casper.infradead.org with local (Exim 4.98 #2 (Red Hat Linux)) id 1tbqnV-00000007ETB-224K; Sun, 26 Jan 2025 00:46:45 +0000 Date: Sun, 26 Jan 2025 00:46:45 +0000 From: Matthew Wilcox To: linux-mm@kvack.org Cc: linux-block@vger.kernel.org, Muchun Song , Jane Chu Subject: Direct I/O performance problems with 1GB pages Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Stat-Signature: zipojfjkfww1q9u8rmmcfi1guhoedmde X-Rspam-User: X-Rspamd-Queue-Id: C60B340003 X-Rspamd-Server: rspam03 X-HE-Tag: 1737852407-177780 X-HE-Meta: U2FsdGVkX1/4uwqXNOfKoZgQ/2GFTkFxJxrSUGIyx10ioZADq3mt8t/V2EdzhVblhcqDN8Zj0YnGpamxVv0rHXLTRR/xVvaGVrDDeHIHIKArSmGud9uKMG68FjDDjpw+dQrUADuPF3F1j5WhEqDnVP82bVTOayWT3H7F0YEswhjLtaCg13pESS3PND6QzJk5G0k6ttRCAuE7lOl/moZSK9w0yYBJdTxNK8m8/UoDaCuNRIw/e95Llu9T8L0TfTSlUTQ8No8bXonIyOSJqGIhNMzm1m2Q9VZmM8WWUoSiYE9AwWvcIUj7yc+TdR5M12WAa2xZCL545JkrhJrA7IpdWUxyybEebwZLG4edB87+I7r1XOTQC0EyDEDbNjxLIlQkKlsXCrnxi4HwBIaG2hyntSIaVuEIUp5a5aUzVubp24KC3Oo/ska2SzuJi5/sRrC/joXzuBwzQ8g+p6jwUCsA31Rid+VCHZCYUqT2nUez6uja9fAC/F9lnK1hRHdZqiMSPATJ2kMZU8eM6woj7rlutgDkhOMkMd6WwSI+HnKUMHOhSW9hMY7ESGpUZhjBVBJWUsyRlav08lLzFR4ULC9olz/PpsA1HxlJu8XMcLY0gC6YdxO1XmlGKQEnrUMegFIMynnbBhc+WrDHERpDoGXSiz4uEWbOlW0hSJzrMWr9I4w3wkvnSY4LIP3T8mtDrVKy4hH3PI8ENaQx8qXSEEAA2xKXMU6VvRgkiRO7FGmf17SzOfUuU0MQudstMD9/mwk3Rx7I41NZYoHxe5ieRtEDUeFhe+GvUcWrRC/C5gUTi1YalHetDqEmFRWxeQPeRoZGiQk62FHU/n9/gB/W8Dy8tN71nOjXJiiThU96EEOMb47pI5+9XfEVJzlydkskV5wUFdannOQPF1qclh5jm1xYLJe/nM3kQ3fws3cVChGg1RQnGf87oa+iEjXOyvqB1+5TeJKobj/whKH3dxl5KjY gYF6ntcP isIGm34/65p/bOuwckmYckVkPZKDeoaslO4fq7Oo/PfSgEG/oRrPd38ROsiV9JAy95jVwF/wuCzXdPhLjlBVEiNXfX28HaqMQlIqQmqS5r1Rcgg7vTigzWR3S+Lg1rYwIH6FXKdAwyzHXswOtgVbEfaJl5FNy6Urz1l+xBWc8MbhrDb6wNX8jLY+L1ildrK8nIkPPe1iFGk3IhY7EuOsPfkO1lbEqj4YvmnqKdRomReJp1MvQnf6nzBqpAcjtQNFERXkGdTQOUMqJoFV6PEiOk2ypoCMn+az4svyqwesUHvXxwOI74DQxdh5Hsw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Postgres are experimenting with doing direct I/O to 1GB hugetlb pages. Andres has gathered some performance data showing significantly worse performance with 1GB pages compared to 2MB pages. I sent a patch recently which improves matters [1], but problems remain. The primary problem we've identified is contention of folio->_refcount with a strong secondary contention on folio->_pincount. This is coming from the call chain: iov_iter_extract_pages -> gup_fast_fallback -> try_grab_folio_fast Obviously we can fix this by sharding the counts. We could do that by address, since there's no observed performance problem with 2MB pages. But I think we'd do better to shard by CPU. We have percpu-refcount.h already, and I think it'll work. The key to percpu refcounts is knowing at what point you need to start caring about whether the refcount has hit zero (we don't care if the refcount oscillates between 1 and 2, but we very much care about when we hit 0). I think the point at which we call percpu_ref_kill() is when we remove a folio from the page cache. Before that point, the refcount is guaranteed to always be positive. After that point, once the refcount hits zero, we must free the folio. It's pretty rare to remove a hugetlb page from the page cache while it's still mapped. So we don't need to worry about scalability at that point. Any volunteers to prototype this? Andres is a delight to work with, but I just don't have time to take on this project right now. [1] https://lore.kernel.org/linux-block/20250124225104.326613-1-willy@infradead.org/