From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CCA06C5478C for ; Wed, 28 Feb 2024 20:18:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F2C946B0099; Wed, 28 Feb 2024 15:18:33 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EB58C6B009A; Wed, 28 Feb 2024 15:18:33 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D2F706B009B; Wed, 28 Feb 2024 15:18:33 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id BE4CF6B0099 for ; Wed, 28 Feb 2024 15:18:33 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 92BE040CA1 for ; Wed, 28 Feb 2024 20:18:33 +0000 (UTC) X-FDA: 81842325306.14.55E20EE Received: from mail-ed1-f41.google.com (mail-ed1-f41.google.com [209.85.208.41]) by imf01.hostedemail.com (Postfix) with ESMTP id 830DC4001E for ; Wed, 28 Feb 2024 20:18:31 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=google header.b=N8eUbeq6; dmarc=none; spf=pass (imf01.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.208.41 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709151511; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=uASiwX56a9f95T/b0Ft7B7QcrUQnh9efeMVdOev27G8=; b=mWPFMhUlLmCJ84tSzFos9cIAZxHU6mZeNYtmftK7pDBTl0Gi/jsnAMv4qbmNyDW2lj/EjN rLAmqQRsjJ0YfTKSm0/ifyX7tgJqWbPJWXaNJaIoDFkCYr7iozW7V4H+3j3zooXq2J3jXy 7uZbKX0o6WBXpfO0yMosCbvY3R7nufo= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=google header.b=N8eUbeq6; dmarc=none; spf=pass (imf01.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.208.41 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709151511; a=rsa-sha256; cv=none; b=nOYw4u4RQ+y69xHcuiri0nnJHMRCvP9YJzex2dLwZraej3z+66TLczc34AEkSDkMxmQ2YD ZQSP6TR8VjnLAHPaG9tTJ8AnQY5+HgwwaevJ4IahVHgQAuiajVZUnyJfBOl7tKMRo1FwRt EVHM7WdRiUYrrV9owmwUUxFRZ27v+B0= Received: by mail-ed1-f41.google.com with SMTP id 4fb4d7f45d1cf-56454c695e6so281649a12.0 for ; Wed, 28 Feb 2024 12:18:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; t=1709151510; x=1709756310; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=uASiwX56a9f95T/b0Ft7B7QcrUQnh9efeMVdOev27G8=; b=N8eUbeq6rW2fT2zIrtVzfmKJUQpVrg4FUk3DpYb0PKPX+YsIOjaY78Xk5Pj6Ka/InZ Xh+RNnXsbw2TCWaavEe8ebfqotaYOVhr6wYoiFyenPN3MSp36HR2KNumPAbjXQOLGwWG TV1pSkNGFwILQbOK0m+GZ4mduOREHuvOOb41w= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709151510; x=1709756310; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=uASiwX56a9f95T/b0Ft7B7QcrUQnh9efeMVdOev27G8=; b=dP6eNodQ4xoy87n0R1oM8ZvhtDEjPluh97f152jC9X87takLNSwE2DtnQx1FrZGFDx 90wUhToPkeWWihHb9DMrcGagcJT1zH9DTE7bnj0vAqSDgeeoGAoeGHiOSg79Mn8EGu2x xDoAYXHYLzSBXE7qsleD2yLeEEovGgqKFzaXl4dZjnUcIq57WCQmvip0VfkLd9vj31c+ wJJTp7eNQN1X0mhFhrQlHYxies+ee7MZ+Eejld033MeVKfxII9Exr8it7PqYxSP0H9Az /fg2K5OQbzE5mC4GAzFa0bNm0keIQXoAiisKJyqXeTbqjZzdlVcZ2IezZ+Otqz34u7SS qBdg== X-Forwarded-Encrypted: i=1; AJvYcCUI9DafbaJw5kcEBDGLVCeoz4fxx7fhSI3yBEpIulBKFXi6O/WVu2jlGHiLVa79FWLZglnS1VyisQIVz0ijdaZyCHc= X-Gm-Message-State: AOJu0YzVDxUwx0RTa2IKjVkVtafVZHDaXErJvrNLX3YQXLZjfhaudLQy TvLXpWObVfkar/TwmNk9Hp2XYXdT9mgr7kPVaQOrYo0ahH0e8fKFUQaTaiMu5p49V4zZmsUYC8z skMivwQ== X-Google-Smtp-Source: AGHT+IH35mWtMynVz1TnHmUJoJqlpFkFLFdSqZnJgmAtcFgSAUEuDMAyevJAabKt/37J8pqLo4didA== X-Received: by 2002:a05:6402:4344:b0:565:f0f1:b259 with SMTP id n4-20020a056402434400b00565f0f1b259mr434042edc.8.1709151488438; Wed, 28 Feb 2024 12:18:08 -0800 (PST) Received: from mail-ej1-f46.google.com (mail-ej1-f46.google.com. [209.85.218.46]) by smtp.gmail.com with ESMTPSA id em2-20020a056402364200b0056637fd7352sm1970947edb.36.2024.02.28.12.18.07 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 28 Feb 2024 12:18:07 -0800 (PST) Received: by mail-ej1-f46.google.com with SMTP id a640c23a62f3a-a3ed9cae56fso248353166b.1 for ; Wed, 28 Feb 2024 12:18:07 -0800 (PST) X-Forwarded-Encrypted: i=1; AJvYcCUqjkfUDGweHwWZXy9IOivM5ELPDWUEhf0B9e/UivXI9KJc7IcDMTtmGUL1Nspuz/KiIzvei4X2kN3xPB5o6gpYp3I= X-Received: by 2002:a17:906:1b03:b0:a3f:cd6b:80fd with SMTP id o3-20020a1709061b0300b00a3fcd6b80fdmr266036ejg.7.1709151487098; Wed, 28 Feb 2024 12:18:07 -0800 (PST) MIME-Version: 1.0 References: <4uiwkuqkx3lt7cbqlqchhxjq4pxxb3kdt6foblkkhxxpohlolb@iqhjdbz2oy22> In-Reply-To: From: Linus Torvalds Date: Wed, 28 Feb 2024 12:17:50 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO To: Kent Overstreet Cc: Dave Chinner , Luis Chamberlain , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm , Daniel Gomez , Pankaj Raghav , Jens Axboe , Christoph Hellwig , Chris Mason , Johannes Weiner , Matthew Wilcox Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 830DC4001E X-Stat-Signature: wkzydzyzsip4an5g38oozxqix7adwrcr X-HE-Tag: 1709151511-509089 X-HE-Meta: U2FsdGVkX1/EgGidobUra10+Hlsq+CKKVkyQrOXt8cWv152e6uYW1R5gMXzCDsqEDzfNScKxRWDwXXNfwmrMA5PmOQBPnR/lMsqLbSO3e6pmJKoYKZzWgaqvhMcrbVInsqmpWkIKxGIuPvucNAkJwidYMC7HQ2BlrzHl57ovMEOmcguFlVGjr5NgXtSg/QN/o+PhlZyQpMg/mYV8vlQvcKKV8igbb+vTjE6ayW1vVyfZp1Sy7CO8/BtXlHpToEpZAH6njiYp2IMkdNkLAq+o7cqM9NWOkT4Hf2y64fmYWehCEtCUdZjLqNAjO6rG4p0gwQ2j1cAs6b5JEfMQbhN5w3HvUhCPvlZWXu/il8cYolUWGJ63wj/dh91fEdzz2EOfLpOoaaig8r9VdUdwxqFAkWFJ7d7kJibiXbBGzNLkxv7Fn8HOfy2+9Bq9Q9NkF90TODu3eRE5ustDhNbUI10HVBgUnn+muZI2eq2ADKTwU96CgU+uEjU/RYmbooA3dBYuUPRc+w/j8h2XnOz0AchGrNKx3vTfWUpUK4rIb1KlXuWnh4vUty8v4IOwhp9NYyUvfHXysDEmHeA4i5PFkj2wW2QqNEdG5hVIlsDOlE65PKutdZ3kyBx9oG64YM9XyMzY5kwXWJu7dMQHD/JDjwRw238zRljuHfytf91hVSwiZJqMCBAQscllP14BrYudf44698piODE8yBjkFPqw8zGKPcwPkxmqADxEF+InaDvR/zamqF3Zpsm74hRxM8Yv+d7gGZxBs8/AGrq7/B+l85bFt8VLeFg4GG5WEsQFR5UV4ppTFWwnzYIEBUIbMyxoJNgXI+VGYiU2Tls4ZbXV4AUSjxcJhxi27RU0Apy9mYfPzMsey5iqr7uBG4L6nEd0RaXmbk5mhfxCUZ3VkS9z8M1DuEgwmJvLI0WedToQKODeMgHrCkYYI7RIYf3aPjuVMiyQmTOTPmALJ6L042NU6by AUj2Q51S xgfny3jTMrdODIY1Zi+cLMHomRKQ+oA2UvAMZeFH6wr+i1pSEj6/RzG3eckRdsu+RPHAbXbEi7Hi4Wfeod/wSsiF/n2DuLXXE6///U/XLd5p9VlxarxgaUVgT0zeHilJDAm2ULqZdhzn/ngh8nN2RlAj4tsBNhog+SdA/SLkwepifki1AQAh9xrnWQFHDfQlWb6ZP5jnf/3NVYCgM29NsdwiGXqqR16PKZbdEzBvwd6CbfIU9K1jFEFMyRled9OSjMbDbqUsZ2AaD2U1UZFilKvA8Gw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, 28 Feb 2024 at 11:29, Kent Overstreet wrote: > > The more concerning sitution to me would be if breaking write atomicity > means that we end up with data in the file that doesn't correspond to an > total ordering of writes; e.g. part of write a, then write b, then the > rest of write a overlaying part of write b. > > Maybe that can't happen as long as writes are always happening in > ascending folio order? So that was what my suggestion about just overlapping one lock at a time was about - we always do writes in ascending order, and contiguously (even if the data source obviously isn't necessarily some contiguous thing). And I think that's actually fundamental and not something we're likely to get out of. If you do non-contiguous writes, you'll always treat them as separate things. Then just the "lock the next folio before unlocking the previous one" would already give some relevant guarantees, in that at least you wouldn't get overlapping writes where the write data would be mixed up. So you'd get *some* ordering, and while I wouldn't call it "total ordering" (and I think with readers not taking locks you can't get that anyway because readers will *always* see partial writes), I think it's much better than some re-write model. However, the "lock consecutive pages as you go along" does still have the issue of "you have to be able to take a page fault in the middle". And right now that actually depends on us always dropping the page lock in between iterations. This issue is solvable - if you get a partial read while you hold a page lock, you can always just see "are the pages I hold locks on not up-to-date?" And decide to do the read yourself (and mark them up-to-date). We don't do that right now because it's simpler not to, but it's not conceptually a huge problem. It *is* a practical problem, though. For example, in generic_perform_write(), we've left page locking on writes to the filesystems (ie it's done by "->write_begin/->write_end"), so I think in reality that "don't release the lock on folio N until after you've taken the lock on folio N+1" isn't actually wonderful. It has some really nasty practical issues. And yes, those practical issues are because of historical mistakes (some of them very much by yours truly). Having one single "page lock" was probably one of those historical mistakes. If we use a different bit for serializing page writers, the above problem would go away as an issue. ANYWAY. At least with the current setup we literally depend on that "one page at a time" behavior right now, and even XFS - which takes the inode lock shared for reading - does *not* do it for reading a page during a page fault for all these reasons. XFS uses iomap_write_iter() instead of generic_perform_write(), but the solution there is exactly the same, and the issue is fairly fundamental (unless you want to always read in pages that you are going to overwrite first). This is also one of the (many) reasons I think the POSIX atomicity model is complete garbage. I think the whole "reads are atomic with respect to writes" is a feel-good bedtime story. It's simplistic, and it's just not *real*, because it's basically not compatible with mmap. So the whole POSIX atomicity story comes from a historical implementation background and ignores mmap. Fine, people can live in that kind of "read and write without DIO is special" fairy tale and think that paper standards are more important than sane semantics. But I just am not a fan of that. So I am personally perfectly happy to say "POSIX atomicity is a stupid paper standard that has no relevance for reality". The read side *cannot* be atomic wrt the write side. But at the same time, I obviously then care a _lot_ about actual existing loads. I may not worry about some POSIX atomicity guarantees, but I *do* worry about real loads. And I don't think real loads actually care about concurrent overlapping writes at all, but the "I don't think" may be another wishful feel-good bedtime story that isn't based on reality ;) Linus