From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 09B77C3DA4A for ; Tue, 20 Aug 2024 16:30:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 92A856B0088; Tue, 20 Aug 2024 12:30:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 88B816B0089; Tue, 20 Aug 2024 12:30:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 705136B008A; Tue, 20 Aug 2024 12:30:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 50AA56B0088 for ; Tue, 20 Aug 2024 12:30:47 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 94824A0369 for ; Tue, 20 Aug 2024 16:30:46 +0000 (UTC) X-FDA: 82473162492.29.233144C Received: from mail-il1-f178.google.com (mail-il1-f178.google.com [209.85.166.178]) by imf11.hostedemail.com (Postfix) with ESMTP id 6046B40032 for ; Tue, 20 Aug 2024 16:30:43 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b=hchkjWfG; spf=pass (imf11.hostedemail.com: domain of axboe@kernel.dk designates 209.85.166.178 as permitted sender) smtp.mailfrom=axboe@kernel.dk; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724171364; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DYpNnAfdwkTY7rhE6SZ5xCU9aT/qWIRpF1Zi1RiAgrM=; b=RB6nza2ThslsJQFfvB+hTSMa7TFJUdw6eZ4ws1oQrCRPHEkWMTzndP6FlZI2ffbdieqLeJ W3QeAK8khczN3lZbe5ucxDAYKudCSfpO9z/WmUf3vLBZXGfbFtYh0c1OUTe1mHhchgQuhv bO3onzUqArmPa5m0/88vNvcjpdJ+NL0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724171364; a=rsa-sha256; cv=none; b=3KHZQIoWIx2DPUpQnJ6mwxM1KvmH6BoiqKeKplgehG4+f2/20EPpEWmS6vKrPcyezOvByO 4GgRz6Omnm/TrUA04qWp4PhPXEh/J8dy2gMSqWh9Zn6dtd4UPzhmp9v99igoOqii8prrhq 9dl8zSiqaEV1T5FVS7ivCSXtVzEz+sQ= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b=hchkjWfG; spf=pass (imf11.hostedemail.com: domain of axboe@kernel.dk designates 209.85.166.178 as permitted sender) smtp.mailfrom=axboe@kernel.dk; dmarc=none Received: by mail-il1-f178.google.com with SMTP id e9e14a558f8ab-39b32f258c8so22004315ab.1 for ; Tue, 20 Aug 2024 09:30:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20230601.gappssmtp.com; s=20230601; t=1724171442; x=1724776242; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=DYpNnAfdwkTY7rhE6SZ5xCU9aT/qWIRpF1Zi1RiAgrM=; b=hchkjWfGwHvXUQGscLmCH+3qgTmUbqsLcXOgzCqg1isq8wQb6N+kMMm1D4uv5fh6/V X71Y+wfkKJqHHNHtZ3JM+LhWj9OE8sSqmxU5Y5aEpNUm7JpBOssq8XCqjOEApf1dyW+j kn+WKAh4lgrw1OTfICl5pG/I0F02Z+bIWWQlW+iylNn+LJBNIsbeJBcWhMsXTRqFRJ++ pmXHqviBfqzteP90CpGZjfZFO4ZvDmEofFyXgAFOUwbKtspuvMAb25ddfMpSTWG3jfk0 R5c/g2lKinH4MM1DNkRHJR6lEWfryoqC6tphpcEHyt9mPVYY5lnt9L6r3dGla1HpaXF8 NRXw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724171442; x=1724776242; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=DYpNnAfdwkTY7rhE6SZ5xCU9aT/qWIRpF1Zi1RiAgrM=; b=Wplj3SOWEecdl5GYQldq7M/haoTsJJoil0KFX+WaLIGpi2jXabMkH//PUYgWyxmEpD swkCQ7PSB4Ap1x/4IGAwSGmJHsPVVHQFpL2t44uOHHa3BSyIhqLLGCXJ/SXwMyfPnaDG dW/Ncf2KnOonnnw2huEAlviD+3czEFbH+H35kHWKQby1UfHeWHNSXLHTdNuzAQIel2Gc IllzYzUcEdKvad0I7m//dZLE4haIV+pGZUnwF1bNkNky0rkWY9qHf837neqib8ne/Ih8 YdjZs3iaHHDk8bG0h+x2yp8Yv/1c21jHvyose6V52Tc6SIyrnNtV8f4gjYmc1a88Go1a 9kCA== X-Forwarded-Encrypted: i=1; AJvYcCX72s6mMkXaihhErDpDK3QOSlberJ7zWEaBDlV9CK7AbjD2LefiID3bh0tNWPVBq4d8AX28T2w1SdCyDu8dqoryOA4= X-Gm-Message-State: AOJu0YwuQKef6PIz3n/L6cyJsFiZ10X+dW1phzGy2e7N21k2cSvQnW9p n06ton5MSbh9SOwmmWv3xMy7KeVpak70eTbRIiwBSC/Sy85cBScIz6/9iRA2/ew= X-Google-Smtp-Source: AGHT+IFo1x+o4q+ouOKPox7DDrQHi+9mUH/kuJu0zMBLnyBj6r0dvjPKc26d3wbkH/VYLDydJnAxBw== X-Received: by 2002:a05:6e02:198b:b0:39d:3c87:1435 with SMTP id e9e14a558f8ab-39d56dd821fmr44069135ab.1.1724171441835; Tue, 20 Aug 2024 09:30:41 -0700 (PDT) Received: from [192.168.1.116] ([96.43.243.2]) by smtp.gmail.com with ESMTPSA id 8926c6da1cb9f-4ccd6e7db2csm3879660173.32.2024.08.20.09.30.40 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 20 Aug 2024 09:30:41 -0700 (PDT) Message-ID: Date: Tue, 20 Aug 2024 10:30:40 -0600 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 5/5] block: implement io_uring discard cmd To: Ming Lei Cc: Pavel Begunkov , io-uring@vger.kernel.org, Conrad Meyer , linux-block@vger.kernel.org, linux-mm@kvack.org, Jan Kara , Shin'ichiro Kawasaki References: <6ecd7ab3386f63f1656dc766c1b5b038ff5353c2.1723601134.git.asml.silence@gmail.com> <4d016a30-d258-4d0e-b3bc-18bf0bd48e32@kernel.dk> Content-Language: en-US From: Jens Axboe In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Stat-Signature: zm68k5ik1znogm6hhry1ioacj3fqd471 X-Rspamd-Queue-Id: 6046B40032 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1724171443-251282 X-HE-Meta: U2FsdGVkX1+ZTAw0XwdgQfyGzmoYSDqC5oAcSfz3zO/4ymc179o+VM2umkBz0EuTk3YTRyk9V81mnfH16nQWteulgrYn212UYInBu7ToVYz+qGwpdoQG4dWWsDF54GHYgpnZ6oHn7mAITDEibvitFLZFjdi0TmlsL7uZ8oNc9HcuPvxMzJ0XWN/Wlx11HD7roNJPacFkaRfuBq0u20lMIWChIqErzBlv+nE9xgDkLm7+huMG1SEGDqOjUEDLLbze/8BFtwbG/zaIbQIMWaYMCDDlJf2vemlH59X8BVZVRJZjYBSqG7aDnl/EW4PUINZCLyDL7+WBmIZ95sF7COHn/jyr7/OVs8p4w+f0eZwaMMAsza+903185JKpFTt85RhjIgz0eKEXeblykBCCfhrI952mj+XeTSPqVQDd2yNK1wPOr9XbAeHBHyd7lKxe4iQug9bjVf+rSn6L/MdH4pqPTDAJrQkcORwEaEn/zEzqROrY8e8N8qBVI6PUhywlaekyLQb3ipn5eyDSCvmrgp/L29xBA6Stxth/vcRijwaygpBOeTtGIgz5KGlVtJhELzS9rQgdHEcXmSckiYa2dNmFxVLopmE3+b1WE2ZN8nUI4+ulPxd5Hr3xN+j0BJH6XOr93NmvnL3Y+mizGh70cWkSRo1rP/eS+BAyjLczZG/3y3HBBQnOsHbDf5y/nlY+/3pREjrqtzdhwziBK8QzIXCasFxVCA6eTGgNpOwzzVACDeghdCPfmYeKDBhcJknnjVHs4FRkdx0BuM/ww9aLglXjfAzjkPAYQPq773DRp3tq/ZtgAoQvIXzt/J4yuyeoQ4RHH3i84gGWiLj1eE0+ttDJOIQHax5kZFEiiYuJxVUYVPpyTgXdPzGYeaeMa4ejG8jTrtOp5nICyLBKFBlVP8ARyD+q+4Kif0tmLdjkCE8VFIyE9MitwPynUpTzsCQYZwFTVTpCKeOJnF85mCNlv/t egXNesLm 3ykL3GRwZZ1ORHWV6I4C79mF9Mp2qC3vnO/Xo55ncAbP466eb/Fo7FRfEqMApnpJGhyHLyZCUl2DsUsp2MtxLhqOYNbRTu8JGGSIG7gKHIHxGnxHL05CurC+s1A30cLksZzEqivRs7dve2UdDn5D8rAbBfRAl+Jg1Ylg82VLm5JID+pg+XVON1d+vW5vJZ5F7y4b1x09cbt8qB7OQAuT25uyog65Ty4hQGqbzSOPbyYE3HdjKwp/WSvsjrsjJgnkJbkIRx3ZoJfLap6nZ+/t5oCClB0JH1mmeTq/AlpbVbtXAPj4l9tH7lfv1i+mA825B25/5PpiPecD5HZ7I0PcaXh8gp+IztsBL2MIyLnR6yGIBeEIBAjdgXJS5tUptsgd5gWBdBx6ou2gcGrFX8xStnsiua+Fxmpri1Aa+XvSfke55tRa52JWIjVM+Stu+jzLVpoIhVPvjYvm1PkvvRRl1+icH7MZCBkDzZDi/bU4Exx8+ovYciWtBx3zgf1VJ10EJDQnBqNX3ymujs07dmGlSE1mJeg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 8/19/24 8:36 PM, Ming Lei wrote: > On Mon, Aug 19, 2024 at 02:01:21PM -0600, Jens Axboe wrote: >> On 8/15/24 7:45 PM, Ming Lei wrote: >>> On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote: >>>> On 8/15/24 5:44 PM, Ming Lei wrote: >>>>> On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote: >>>>>> On 8/15/24 15:33, Jens Axboe wrote: >>>>>>> On 8/14/24 7:42 PM, Ming Lei wrote: >>>>>>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov wrote: >>>>>>>>> >>>>>>>>> Add ->uring_cmd callback for block device files and use it to implement >>>>>>>>> asynchronous discard. Normally, it first tries to execute the command >>>>>>>>> from non-blocking context, which we limit to a single bio because >>>>>>>>> otherwise one of sub-bios may need to wait for other bios, and we don't >>>>>>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry >>>>>>>>> it in a blocking context. >>>>>>>>> >>>>>>>>> Suggested-by: Conrad Meyer >>>>>>>>> Signed-off-by: Pavel Begunkov >>>>>>>>> --- >>>>>>>>> block/blk.h | 1 + >>>>>>>>> block/fops.c | 2 + >>>>>>>>> block/ioctl.c | 94 +++++++++++++++++++++++++++++++++++++++++ >>>>>>>>> include/uapi/linux/fs.h | 2 + >>>>>>>>> 4 files changed, 99 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/block/blk.h b/block/blk.h >>>>>>>>> index e180863f918b..5178c5ba6852 100644 >>>>>>>>> --- a/block/blk.h >>>>>>>>> +++ b/block/blk.h >>>>>>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file); >>>>>>>>> int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, >>>>>>>>> loff_t lstart, loff_t lend); >>>>>>>>> long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>>>>>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); >>>>>>>>> long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>>>>>>>> >>>>>>>>> extern const struct address_space_operations def_blk_aops; >>>>>>>>> diff --git a/block/fops.c b/block/fops.c >>>>>>>>> index 9825c1713a49..8154b10b5abf 100644 >>>>>>>>> --- a/block/fops.c >>>>>>>>> +++ b/block/fops.c >>>>>>>>> @@ -17,6 +17,7 @@ >>>>>>>>> #include >>>>>>>>> #include >>>>>>>>> #include >>>>>>>>> +#include >>>>>>>>> #include "blk.h" >>>>>>>>> >>>>>>>>> static inline struct inode *bdev_file_inode(struct file *file) >>>>>>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { >>>>>>>>> .splice_read = filemap_splice_read, >>>>>>>>> .splice_write = iter_file_splice_write, >>>>>>>>> .fallocate = blkdev_fallocate, >>>>>>>>> + .uring_cmd = blkdev_uring_cmd, >>>>>>>> >>>>>>>> Just be curious, we have IORING_OP_FALLOCATE already for sending >>>>>>>> discard to block device, why is .uring_cmd added for this purpose? >>>>>> >>>>>> Which is a good question, I haven't thought about it, but I tend to >>>>>> agree with Jens. Because vfs_fallocate is created synchronous >>>>>> IORING_OP_FALLOCATE is slow for anything but pretty large requests. >>>>>> Probably can be patched up, which would involve changing the >>>>>> fops->fallocate protot, but I'm not sure async there makes sense >>>>>> outside of bdev (?), and cmd approach is simpler, can be made >>>>>> somewhat more efficient (1 less layer in the way), and it's not >>>>>> really something completely new since we have it in ioctl. >>>>> >>>>> Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock, >>>>> same with blkdev_fallocate(). >>>>> >>>>> But this patch drops this exclusive lock, so it becomes async friendly, >>>>> but may cause stale page cache. However, if the lock is required, it can't >>>>> be efficient anymore and io-wq may be inevitable, :-) >>>> >>>> If you want to grab the lock, you can still opportunistically grab it. >>>> For (by far) the common case, you'll get it, and you can still do it >>>> inline. >>> >>> If the lock is grabbed in the whole cmd lifetime, it is basically one sync >>> interface cause there is at most one async discard cmd in-flight for each >>> device. >> >> Oh for sure, you could not do that anyway as you'd be holding a lock >> across the syscall boundary, which isn't allowed. > > Indeed. > >> >>> Meantime the handling has to move to io-wq for avoiding to block current >>> context, the interface becomes same with IORING_OP_FALLOCATE? >> >> I think the current truncate is overkill, we should be able to get by >> without. And no, I will not entertain an option that's "oh just punt it >> to io-wq". > > BTW, the truncate is added by 351499a172c0 ("block: Invalidate cache on discard v2"), > and block/009 serves as regression test for covering page cache > coherency and discard. > > Here the issue is actually related with the exclusive lock of > filemap_invalidate_lock(). IMO, it is reasonable to prevent page read during > discard for not polluting page cache. block/009 may fail too without the lock. > > It is just that concurrent discards can't be allowed any more by > down_write() of rw_semaphore, and block device is really capable of doing > that. It can be thought as one regression of 7607c44c157d ("block: Hold invalidate_lock in > BLKDISCARD ioctl"). > > Cc Jan Kara and Shin'ichiro Kawasaki. Honestly I just think that's nonsense. It's like mixing direct and buffered writes. Can you get corruption? Yes you most certainly can. There should be no reason why we can't run discards without providing page cache coherency. The sync interface attempts to do that, but that doesn't mean that an async (or a different sync one, if that made sense) should. If you do discards to the same range as you're doing buffered IO, you get to keep both potentially pieces. Fact is that most folks are doing dio for performant IO exactly because buffered writes tend to be horrible, and you could certainly use that with async discards and have the application manage it just fine. So I really think any attempts to provide page cache synchronization for this is futile. And the existing sync one looks pretty abysmal, but it doesn't really matter as it's a sync interfce. If one were to do something about it for an async interface, then just pretend it's dio and increment i_dio_count. -- Jens Axboe