From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 09031C47DD9 for ; Wed, 28 Feb 2024 11:38:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 920886B009B; Wed, 28 Feb 2024 06:38:58 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8D0CB6B009C; Wed, 28 Feb 2024 06:38:58 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 771E26B009D; Wed, 28 Feb 2024 06:38:58 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 64F226B009B for ; Wed, 28 Feb 2024 06:38:58 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 1A691A0FA3 for ; Wed, 28 Feb 2024 11:38:58 +0000 (UTC) X-FDA: 81841015956.27.893A09E Received: from mail-yb1-f173.google.com (mail-yb1-f173.google.com [209.85.219.173]) by imf18.hostedemail.com (Postfix) with ESMTP id 4EE971C0007 for ; Wed, 28 Feb 2024 11:38:56 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Xw351tEz; spf=pass (imf18.hostedemail.com: domain of amir73il@gmail.com designates 209.85.219.173 as permitted sender) smtp.mailfrom=amir73il@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709120336; a=rsa-sha256; cv=none; b=Es3covFh2cqMEq7hsOnXp7JRf7vVwRlDzDEgKk2JNnTumfUS1+iY3qTuwm7uPa/4IqQACC iPmoZMxoyyQ6+GG3e1a8LpQqvmPWW9rZQqcgt73cBSOfQHbWPkqc8BJgzSCuCObSGk2hYZ 7GP5o5boUG0bgpH76JuH3gY6Jk/du2o= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Xw351tEz; spf=pass (imf18.hostedemail.com: domain of amir73il@gmail.com designates 209.85.219.173 as permitted sender) smtp.mailfrom=amir73il@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709120336; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3cVHkaT8lHsHCD4Yo8gLGFh4Q3p0H7qbn5bB00yB0OU=; b=D3gGI7M7iBmjAkON3rFzBnVA57o/jnRgC6ZkWTgu9Li8q/92UqrERIJyxcTw2r2giQR1BB 3arhIbEzklTFi4pm/UA7IjRYGn+L0KUaR/L7pCbg//2GFEBFu012CF4xT7cuTTdj1cLvPa BFs9rx0Nu5+5oXyDLQSS5hG4wtbm/u4= Received: by mail-yb1-f173.google.com with SMTP id 3f1490d57ef6-dcd7c526cc0so5403711276.1 for ; Wed, 28 Feb 2024 03:38:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709120335; x=1709725135; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=3cVHkaT8lHsHCD4Yo8gLGFh4Q3p0H7qbn5bB00yB0OU=; b=Xw351tEzyLMBuqe/tsY8owqx6/JaBv3N6wHACMEqeSXo39sNLPl88ZlfOKPYic+dHP 7yGBq/SHxUwQgXQ8ZEAv5hR7Ce0id/ai1uvEiU2n247OQjCBs4Bk3g6WXsWiVBH4mOq9 z2/kSq6XmJ/PApn5Ea9wjaYOxsWhnnkXl2L8etg3oRAWb7YxzGiwwpdtg2A6y+27kwBn a5xO5Pemmd/avnvU+WIgWfnVBdA/wiRqan9mcYtfImkPXXa6Vos6QJqpb+nLSwQ+zgPe We1xHJesWmbLaBs5p8cRqKKmTl6+lgdPnfDFFXcskKGW++busRdhSyIxcJmxfcdJ+RkW wTVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709120335; x=1709725135; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3cVHkaT8lHsHCD4Yo8gLGFh4Q3p0H7qbn5bB00yB0OU=; b=uUvA9AjfWUTEv5pBFGatrgy7W/jVvM1ckDwpFlEube4dYKy7zo3DOaP/X41e1U+x+W EhuAx4nYqp4Iacp3CEUn64zWiWU7vELEKpt9kV3l/L6PGn93Irgh5IUpgtTU/a4GmRld SxzunmTihFcvlFCzkS/GnO4qY+RhgCfvZa0tMEQMfxUmCT2QL6xmQBa8RrIeDE7sqNGN j2vDXFEpsAEdfqRvFKAJEvW/5HkIeabli2iBTqq6CDGNElsbq01WPTov9FBc6203ikpU V4Ta0aRqOVX6y86HoFJ7IK9q/QLzmCyd8hl1JJMWlNbTIuyhiKZ3qC2op41fCE8yKR+7 V4RQ== X-Forwarded-Encrypted: i=1; AJvYcCVNrCf9DceLveQbK+NSxl0yfYYbZ38n2pH+HfKKNxqxYcOh1GC4jO2sTdYk5nDh1/EiDXKiaOI3eoIyDGsRNSr397Y= X-Gm-Message-State: AOJu0YzyVewgxopseetqg0Dgsd5+Frli+g6AprWJQuHRLbkU0ruPDUwc /BpBJyHaYnHMlw4PQt7jfIJf/A1ScOV8GTY6G+QoqFdY4JA++1vm+Zpb4IOlZSxzBKRUYnsa/Of ODMwLmqkVgBYtzdETwKeO0RZxjW0= X-Google-Smtp-Source: AGHT+IHY0wrUSlOw9WyRdPBNGyDCMr47R2OPgqsNZgMSxECHTeFCqFgwVKMDSze3nopO0Zj5d+T7xSEwl6jXiMijfes= X-Received: by 2002:a25:8685:0:b0:dcc:44d7:5c7f with SMTP id z5-20020a258685000000b00dcc44d75c7fmr2082491ybk.62.1709120335333; Wed, 28 Feb 2024 03:38:55 -0800 (PST) MIME-Version: 1.0 References: <20240228061257.GA106651@mit.edu> In-Reply-To: <20240228061257.GA106651@mit.edu> From: Amir Goldstein Date: Wed, 28 Feb 2024 13:38:44 +0200 Message-ID: Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] untorn buffered writes To: "Theodore Ts'o" , "Luis R. Rodriguez" Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm , Jan Kara Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 4EE971C0007 X-Stat-Signature: 8byokubj36w6qui46xqw96ta4r49uxqn X-Rspam-User: X-HE-Tag: 1709120336-227190 X-HE-Meta: U2FsdGVkX19pBcZ5RlOI+iCyZMWkSjiCSEBhjpsm5PWLmcVg6R2zAk2JSmo44F1qiHsqliZxNhmNG1VxIJRZdae1wxHiHRO1FHcJ763ydzQArnexQOxl5f26+cLAHOXLq6x4hbtoRbqHrYlEGCo/uOuBaXzw06TBZJVNK5pyiiAL2DKz1GiJxzdfnqa7H6ol+lp7BAhfJWh4abJmzhAx/zcD2MJM/lQUr9O4hTahYPkA126tMi8joqfo3Znzfqxw3PoYG+vpyo8LzPPK3FGmhGTKbjGYChv5gjJcA419/UyuSIA+Dr6eYa0QQaH1Md3vaOIH5oQxn2sq1nZHGdjUHH7o5EeXgjJ+WHY9PuCRQ84nNlHFVFvwKU2fT62NpJfBYT9V9sZmT6ePPgHYaMZC40IUP0m+Gigwjrr8aWJfnvKGFsZA+yWja4UCt39B5kvjwbB/z1a4x+FtNtM6I22lX+821kRuGPLZB05wWgdgms/31XkRDZtXLIuccxD//L4oY6JtK9XiCPDVMGRtvmNt4yZpiRO9+6gUes7XozMiejHkgQ8X7gMnEgwfy4aF6E36eG4Ni5OSWs34DJJOxHxn01zhC8s5ktS210vd4TpAIGQWy41DFkBcfN6bEsdlaUo/rzzC9qhVjslpOMEKWDbcTplC46A9r7Gq++Ndzq2WH8GXAiCRumggNuiTw1FlOX7VtrnkSxpmF9MxOhu0I8+I9XK5g/PpQ4Ll52U7HY47HdZo7DJXHdc0l+REtX2TMrPSxWAIx3ek1UeNGATkByhus+wKbdeX+z+u82sNY/gH2ItZGhW35lz3VYUGuhe4NHNZxUQF4zA07UI2J7mCoJ18rb4TrThnCW0Z1QErC+TQGlFaxzVVVLcl10rmriXD7g9HuMx+VdKwPaKQkVoeWStDXrmGbzLg4S15b2ztqRSB3Jd4e+UvvMsAIwxCixXRVJiWQHkdf/mn4upO0607Fv1 BQl9xBTq fo1G/Qi2tB2hC6DAfM480DZkfQJCRixWjXlVcfKeVpszhmjGwuBUzO2btFf1jlncdROynz5N4dTpoYCinGmeEW4f/EIsT4Ik3tb+g+n9hWVzMfiUI4H2sh4FMTFPhoNB7GdTN1A9Qpzi6wVE1PD3zLEE50LMv7xnZxRFmLtbZgNAsSeRDtVT+YAe1kjlp/VsWDBx6ghMtwJeaMzs8NeyusEuWPfpoo0wyEwsU6VuOiEVTf7P0C8Kp2vcBt3Ck8p6W3GW1tmClKU03K/QZIr5tdLin4YfgSCj6HQ5epwO9q7MFZkfsSd5wyJdBJhNlzjF+y2lXwk7REYA44u2xxrVEfGHXFvtqmuZ48z9R9s0HPVSaNalZ1UzDK65Z0guSRhG9B5gvho1WEXq3UU8B8r1TsJLfAEmjP2k/v/QbgBw/OORftqy55PVWtFCBYkqy0JwsTYjxLL6Msj+GE2It4cyV4PArV+NXey6lwtwvBKKUx5JQTq4DEzu4d76Utg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Feb 28, 2024 at 8:13=E2=80=AFAM Theodore Ts'o wrote= : > > Last year, I talked about an interest to provide database such as > MySQL with the ability to issue writes that would not be torn as they > write 16k database pages[1]. > > [1] https://lwn.net/Articles/932900/ > > There is a patch set being worked on by John Garry which provides > stronger guarantees than what is actually required for this use case, > called "atomic writes". The proposed interface for this facility > involves passing a new flag to pwritev2(2), RWF_ATOMIC, which requests > that the specific write be written to the storage device in an > all-or-nothing fashion, and if it can not be guaranteed, that the > write should fail. In this interface, if the userspace sends an 128k > write with the RWF_ATOMIC flag, if the storage device will support > that an all-or-nothing write with the given size and alignment the > kernel will guarantee that it will be sent as a single 128k request > --- although from the database perspective, if it is using 16k > database pages, it only needs to guarantee that if the write is torn, > it only happen on a 16k boundary. That is, if the write is split into > 32k and 96k request, that would be totally fine as far as the database > is concerned --- and so the RWF_ATOMIC interface is a stronger > guarantee than what might be needed. > > So far, the "atomic write" patchset has only focused on Direct I/O, > where this stronger guarantee is mostly harmless, even if it is > unneeded for the original motivating use case. Which might be OK, > since perhaps there might be other future use cases where they might > want some 32k writes to be "atomic", while other 128k writes might > want to be "atomic" (that is to say, persisted with all-or-nothing > semantics), and the proposed RWF_ATOMIC interface might permit that > --- even though no one can seem top come up with a credible use case > that would require this. > > > However, this proposed interface is highly problematic when it comes > to buffered writes, and Postgress database uses buffered, not direct > I/O writes. Suppose the database performs a 16k write, followed by a > 64k write, followed by a 128k write --- and these writes are done > using a file descriptor that does not have O_DIRECT enable, and let's > suppose they are written using the proposed RWF_ATOMIC flag. In > order to provide the (stronger than we need) RWF_ATOMIC guarantee, the > kernel would need to store the fact that certain pages in the page > cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages > were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the > writeback code knows what the "atomic" guarantee that was made at > write time. This very quickly becomes a mess. > > Another interface that one be much simpler to implement for buffered > writes would be one the untorn write granularity is set on a per-file > descriptor basis, using fcntl(2). We validate whether the untorn > write granularity is one that can be supported when fcntl(2) is > called, and we also store in the inode the largest untorn write > granularity that has been requested by a file descriptor for that > inode. (When the last file descriptor opened for writing has been > closed, the largest untorn write granularity for that inode can be set > back down to zero.) > > The write(2) system call will check whether the size and alignment of > the write are valid given the requested untorn write granularity. And > in the writeback path, the writeback will detect if there are > contiguous (aligned) dirty pages, and make sure they are sent to the > storage device in multiples of the largest requested untorn write > granularity. This provides only the guarantees required by databases, > and obviates the need to track which pages were dirtied by an > RWF_ATOMIC flag, and the size of the RWF_ATOMIC write. > > I'd like to discuss at LSF/MM what the best interface would be for > buffered, untorn writes (I am deliberately avoiding the use of the > word "atomic" since that presumes stronger guarantees than what we > need, and because it has led to confusion in previous discussions), > and what might be needed to support it. > Seems a duplicate of this topic proposed by Luis? https://lore.kernel.org/linux-fsdevel/ZdfDxN26VOFaT_Tv@bombadil.infradead.o= rg/ Maybe you guys want to co-lead this session? Thanks, Amir.