From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 75F37C48BF6 for ; Mon, 26 Feb 2024 17:17:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E778F440188; Mon, 26 Feb 2024 12:17:56 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E269444017F; Mon, 26 Feb 2024 12:17:56 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CF023440188; Mon, 26 Feb 2024 12:17:56 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id C0BDC44017F for ; Mon, 26 Feb 2024 12:17:56 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 6F3ED80998 for ; Mon, 26 Feb 2024 17:17:56 +0000 (UTC) X-FDA: 81834612552.04.65D17E0 Received: from mail-ej1-f47.google.com (mail-ej1-f47.google.com [209.85.218.47]) by imf29.hostedemail.com (Postfix) with ESMTP id 68737120011 for ; Mon, 26 Feb 2024 17:17:54 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=google header.b=V9hh9mNu; dmarc=none; spf=pass (imf29.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.218.47 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708967874; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tWVtevA9wXVF2B05HUcL+aeuhxeDRR1ZXitsqO3GeN0=; b=DiQW4DN91wWgaxkx9csAB5zIeUxvRCDg5qWqDS9jj5EeidsQZgCBeI54U+fAFxinMlvhM3 2tCFXibGTzQhtBwSXmKdMKgNupPkHBiPsI/ofw71WSAxpkhlt29Q4Y6jaPZWedxQU+cQOr GJmVtUxb7jpLgYkPUe6LZh0+jOuqHas= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=google header.b=V9hh9mNu; dmarc=none; spf=pass (imf29.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.218.47 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708967874; a=rsa-sha256; cv=none; b=0ouAdJkgqxlTxlnaKLWUI1oju+bjJ1ErR1eGBXupiMbOxlYqqsEBQ5tW8P3PqGRtfWF1iT 59Gf8tfdKpJ3j6oyflCDL4JetPJadGXNZpgDwEoh7afkGTsVoyYH0ajGG0Mlhoi/NaDJh1 2F3ERV9kRZFl9uWtcFEdLRmx9Dkbu10= Received: by mail-ej1-f47.google.com with SMTP id a640c23a62f3a-a293f2280c7so480441866b.1 for ; Mon, 26 Feb 2024 09:17:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; t=1708967873; x=1709572673; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=tWVtevA9wXVF2B05HUcL+aeuhxeDRR1ZXitsqO3GeN0=; b=V9hh9mNuE/UBusXFWOLVEiW74GxR8LGzjzVgT6c24UU5MXaDuJZSv7wDNKik6LF3Tt IwPsqBJSmRuqACS8ziR1yfDEXP118z3taCVEYZ+CH16aqvA4f2vUurToxnaFhFP0/L0v tAyCBqXuvb7v1VGbsHMOM3lNnvihbxAJrm1AQ= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708967873; x=1709572673; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=tWVtevA9wXVF2B05HUcL+aeuhxeDRR1ZXitsqO3GeN0=; b=miMd+LvB+/Lljf+7AEQNW8+jJZfyGJGnvwBtUQ3qIHWg7LUo+De+oc4+WsIfgDbTEt ptFk6j/QUzpQ1RkTXPNMiO71aDSLWR/95c5tbTP9nF285zsSShTU8POA0HJrnnfMWGON jCvwdANne610KfS+LZuiJWzkuCTnbcx2cg49shBNNA7h0f1J4ErhI8qR+OMy4F9YFsPR an+FDTSMeD+lJHrr0b5BpuwmYSyd7u6UUejZW4zmdlGBjJjjOLyTipY1DHMRSu9o9SiS 28/2WCOtxKFf7gAtDIOxW5lX9UblgNFqrZi9Nfa5dJ+T5JPSkwi2Kj9gqgHYaVm/RxPF pFMg== X-Forwarded-Encrypted: i=1; AJvYcCU6KQhP45Ah6idsu7aDZbcAdQBftNQGq4Jbl7Qk8NyCWYZJen9XrNCxy0gtZZG7obkGoupva6x4FqUem31dLM6fTL0= X-Gm-Message-State: AOJu0Yyhjrew0Hw6f6BDL+7wboSSWgeQP+0DlOkP400hxmny8ClI/37g EdQN+O7I//xbd3Eo90lQzttZHdJvBquTO/GlkGuuGo8EeMAxQIQoi829ulsjNpbRMxfVHvyjwwD ChDlcTw== X-Google-Smtp-Source: AGHT+IE5nKtmlsRzwPCKi1kXriLXQUCOHrHJaa8K+o2BPsbln8aoUlxTb4VbJ+8PB/4/fnJ5VbwIgg== X-Received: by 2002:a17:906:c2d8:b0:a43:2fa9:c05b with SMTP id ch24-20020a170906c2d800b00a432fa9c05bmr2680090ejb.41.1708967872703; Mon, 26 Feb 2024 09:17:52 -0800 (PST) Received: from mail-ej1-f50.google.com (mail-ej1-f50.google.com. [209.85.218.50]) by smtp.gmail.com with ESMTPSA id ga13-20020a170906b84d00b00a3d5efc65e0sm2605082ejb.91.2024.02.26.09.17.51 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 26 Feb 2024 09:17:52 -0800 (PST) Received: by mail-ej1-f50.google.com with SMTP id a640c23a62f3a-a293f2280c7so480435066b.1 for ; Mon, 26 Feb 2024 09:17:51 -0800 (PST) X-Forwarded-Encrypted: i=1; AJvYcCU2Z/miy07OR7yqnBwuJw+iAzEwQP4WQEDGiVvxMTzuR/fEISMactGlUFMlfFrUotOzei9957pVPLDnFVPwnmJhzAw= X-Received: by 2002:a17:906:c350:b0:a43:82d0:38f4 with SMTP id ci16-20020a170906c35000b00a4382d038f4mr1318140ejb.11.1708967870938; Mon, 26 Feb 2024 09:17:50 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Linus Torvalds Date: Mon, 26 Feb 2024 09:17:33 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO To: Al Viro Cc: Kent Overstreet , Matthew Wilcox , Luis Chamberlain , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm , Daniel Gomez , Pankaj Raghav , Jens Axboe , Dave Chinner , Christoph Hellwig , Chris Mason , Johannes Weiner Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 68737120011 X-Stat-Signature: tyzyzupm9c66mg8iaxfdqz5zem3n617t X-HE-Tag: 1708967874-225589 X-HE-Meta: U2FsdGVkX180QhWUYT/xcX1wRn15ay2Ua7GZBgDea94KZRZ32R6OlmjJsMFK/PL3YbFC5k53VtVX1A4FEMBcPTqL0Ey+WkDElnIo/m5uaL8uSo9IaJJ0zo4XMysuFCun65kMuuHNscgiqYcTXjJO96gQ8KE+Qsq1MAidwTSEGHgX28wV+K5o7HYCaTGja5zQVtcz2G24kKt35raQbLnw1bgDGJPZC7bVG6cOfV24ol9aHe4Ez082AUR1I3soVBoh2WPjuYSafWanDfSc3yAIQSp/XpoKqhh9LAKDFpZdmm5CvDvbZLFl5d85H9LBePEtNrEQ1j7Uo9ceH4yUKXGO2C22dRe3MfoaDXifn36wB69NppF3b3LDhSOyP1aoed3iDEUJAhCWDx5+NZyEyrR9ALkCXFRNJFRXPUiFcUOXTq/zgnr9cMQxSbEQK5VlOrbCGck6GAskGa4WkFJROx/XnQFr3NDECS8OkLQirMkkqCtTEQYOn3ZKuLd5HEJJKLgAAz8CWkAajY3nX/hUgWWLujrO/vb1zqhjErvWdjNIwP9BdurBFvAHA+G65kEkkh8SXcsBFHRWTMey5BXFtVCT/pHC7DQrPDZXnfbpmNbqBjiIU9KIdvxWTShj34wa71XCBGxNxYjaEU+OS87/4HCKlemdDJfp5bfJYVGQXM4PV3fSe92l6Mcck7Dss4w8V8Ww8IKLvAuDy0nrJD/i/LsJLZQRZYE8lCCGuOZCvTu4DVsbIdmFpam3mAlpf66UIM6PbNJbRafOD5xYeLiP22AjxAdIE5bSHrGsjqaCiIqjMXqXSCQXnG3uvBULRMalLDtwGDaLFkHPSnywFOP5J36So0+YD84EDQE4X5v6QgkxapnYUTXWl2KQGW06Eg1XOVzKYPeXs4C2ehPz7xlmZ5ehZcUa6BRmhljYWWPGnTJcyRBew3BtA1Mb7IcZwzIQQ5INAyNIiAkOOenWX4mzRHS JTD5I8Kb vgQq/90vQk3bdhKEO90pzqk2+0oe8L0oK71X5aP2vdl26s1QdzG+kljZ/J6T75hV8j0QgMkVepOfEUjw8gpqx7denbycrxhPRJ7a+a9MzeAxmH1Me2Eav2/7nv1clnYOR4sKmj9qxTqAEZvlqLK6172zVcGn0WMrdumhQr+HZJZydbwXe9I34d5a/46DEUsQQzRTUxHXsRZuwL6RGpQTvxoF8n/wDjUSJ50/J6UOctQHA0j59JlxC+3RaUIpp71DaGpUQXtBv2eJqZSOeJYnMGMCJVQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, 25 Feb 2024 at 18:49, Al Viro wrote: > > Uh? __generic_file_write_iter() doesn't take inode lock, but > generic_file_write_iter() does. Yes, as I replied to Kent later, I mis-remembered the details (together with a too-quick grep). It's the reading side that just doesn't care, and which causes us to not actually give the POSIX guarantees anyway. > O_APPEND handling aside, there's > also file_remove_privs() in there and it does need that lock. Fair enough, but that's such a rare special case that it really doesn't merit the lock on every write. What at least the ext4 the DIO code does is something like "look at the write position and the inode size, and decide optimistically whether to take the lock shared or not". Then it re-checks it after potentially taking it for a read (in case the inode size has changed), and might decide to go for a write lock after all.. And I think we could fairly trivially do something similar on the regular write side. Yes, it needs to check SUID/SGID bits in addition to the inode size change, I guess (I don't think the ext4 dio code does, but my quick grepping might once again be incomplete). Anyway, DaveC is obviously also right that for the "actually need to do writeback" case, our writeback is currently intentionally throttled, which is why the benchmark by Luis shows that "almost two orders of magnitude" slowdown with buffered writeback. That's probably mainly an effect of having a backing store with no delays, but still.. However, the reason I dislike our write-side locking is that it actually serializes even the entirely cached case. Now, writing concurrently to the same inode is obviously strange, but it's a common thing for databases. And while all the *serious* ones use DIO, I think the buffered case really should do better. Willy - tangential side note: I looked closer at the issue that you reported (indirectly) with the small reads during heavy write activity. Our _reading_ side is very optimized and has none of the write-side oddities that I can see, and we just have filemap_read -> filemap_get_pages -> filemap_get_read_batch -> folio_try_get_rcu() and there is no page locking or other locking involved (assuming the page is cached and marked uptodate etc, of course). So afaik, it really is just that *one* atomic access (and the matching page ref decrement afterwards). We could easily do all of this without getting any ref to the page at all if we did the page cache release with RCU (and the user copy with "copy_to_user_atomic()"). Honestly, anything else looks like a complete disaster. For tiny reads, a temporary buffer sounds ok, but really *only* for tiny reads where we could have that buffer on the stack. Are tiny reads (handwaving: 100 bytes or less) really worth optimizing for to that degree? In contrast, the RCU-delaying of the page cache might be a good idea in general. We've had other situations where that would have been nice. The main worry would be low-memory situations, I suspect. The "tiny read" optimization smells like a benchmark thing to me. Even with the cacheline possibly bouncing, the system call overhead for tiny reads (particularly with all the mitigations) should be orders of magnitude higher than two atomic accesses. Linus