From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AA009C3DA4A for ; Mon, 19 Aug 2024 20:02:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3D55C6B0083; Mon, 19 Aug 2024 16:02:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 386046B0085; Mon, 19 Aug 2024 16:02:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 209296B0088; Mon, 19 Aug 2024 16:02:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id EEEF96B0083 for ; Mon, 19 Aug 2024 16:02:14 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id AA326A9268 for ; Mon, 19 Aug 2024 20:02:14 +0000 (UTC) X-FDA: 82470066588.28.49E86AD Received: from mail-pj1-f45.google.com (mail-pj1-f45.google.com [209.85.216.45]) by imf02.hostedemail.com (Postfix) with ESMTP id A4D2C80024 for ; Mon, 19 Aug 2024 20:02:12 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b=VzajzVjs; dmarc=none; spf=pass (imf02.hostedemail.com: domain of axboe@kernel.dk designates 209.85.216.45 as permitted sender) smtp.mailfrom=axboe@kernel.dk ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724097669; a=rsa-sha256; cv=none; b=JTe7O6IoPSwlYuYYRfcDOI1xhLS2kG/7HSd5SuFCEptqiV/IcvPUgC+roWLf7qiry1+2SX 7OnnfTuurD0YHAfwRY3ZkO47T8lCYkhlH7YklZivfL7Oz5de7iPFRIaLLY+NByp3rO/osr B5bzXdscdCLNJYJ2bsVuOVe+Hs1Fn2Y= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b=VzajzVjs; dmarc=none; spf=pass (imf02.hostedemail.com: domain of axboe@kernel.dk designates 209.85.216.45 as permitted sender) smtp.mailfrom=axboe@kernel.dk ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724097669; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=pBkl3+XvJeAklRL+6TFbxoUucjnynnJ4wXSRFBTFaQc=; b=TwGAe4nGbQ+SAJhtofNfZV69eXIaK3zXxXNCIo8gxX4FeYR6lE1mgeNx8jpC0r4DG/mDWQ gD4WRA65/QUYBEoKh2ooeLCVc1Oa4SHzygDyoehxiL95OkQ+L44K1YANJObEGsmYHsDUgN ICs3t2c9UlnTM0vsrzhsmaSdF+AH0Z4= Received: by mail-pj1-f45.google.com with SMTP id 98e67ed59e1d1-2d3badec49eso633070a91.1 for ; Mon, 19 Aug 2024 13:02:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20230601.gappssmtp.com; s=20230601; t=1724097731; x=1724702531; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=pBkl3+XvJeAklRL+6TFbxoUucjnynnJ4wXSRFBTFaQc=; b=VzajzVjsZK+PSNsU1h0d8WjCO3XppUw4Et360KNwuD4TRte///r+u5bnLO9wl2p1RT 6evdu4t4E/uMweePT68aFIqglyqy2jcZwBZuKfKwORgYduJjPDavYFG2eCU3hayo+SQ5 nsHRybcVl0yU7lYdiear4B001If5c1smX8HoNyFmINlmC7a3A3RKO7tda3COSidlAFXy RNsoMIxbBwK5K0a1upEbZ/MndwxPuDwl06ae8n39S6SAJAo7jnU3M9jVhQGLTFnFrHRn OlSd0oygrFYHMjQT4KMUY2EzowMuyWVe0U9Q5yzAFeG/4oHa28sktlSpSsvCvmJlMtj4 CTAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724097731; x=1724702531; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=pBkl3+XvJeAklRL+6TFbxoUucjnynnJ4wXSRFBTFaQc=; b=wqyJN1xzIRoQGJ6ofFEehU0AJhFt7ADYV5uxbzU80uOQlJ0sjVEcvW6mWixO+twH/a WD5QKzUslayOMXHxqy39MW7If1Kycdp8UB3TLMspEEhDQYJn9c7ZErS0RNznrPnit9Ic aWzW6pEnopuAetXZGeJ2kUlpuqkuQfMXP9nCwYW1UcE0LBcGSr+ppd0WWE4TEoeEcQV9 8LJQTHUCtWz+eB5uVoGEo16MOrDdqSHL+6gdwqrlUPZKacGen//Kb08SxzAD9Oz9Hy1v ofVCK36FsFarw1yYJxpzb3Ke3FkDOoRJg7RFYB56uNJBAZ94b5fGe8ihMSScSl3e5MQZ IW6w== X-Forwarded-Encrypted: i=1; AJvYcCXPb808iAfgEgNvp6X7JlM38X5i7t19sm2YQmqpbW55RnGDHDfBCYFnoprBGA2mNlWgmct/Iiy6uw==@kvack.org X-Gm-Message-State: AOJu0Yxo26E1w3FI1Nk6f8Ebl/ihSEQyaw0XqQZjxKwNzsrOTClf3spm LLDiy7+sIL/3bpxOD0McWiPI7MMVyLMtSeon5nXwIODlO1CfhXkQAahio52k7r0= X-Google-Smtp-Source: AGHT+IGhGfvDEMLkeZJwlMhyh4v1+tO0SzOoGdLBcZEfbz9YTeG1J9v2kob+e+94rKiTd1PuLu4OJw== X-Received: by 2002:a17:90a:d517:b0:2d3:c488:fa6b with SMTP id 98e67ed59e1d1-2d3e151f6d8mr8019983a91.5.1724097730914; Mon, 19 Aug 2024 13:02:10 -0700 (PDT) Received: from [192.168.1.150] ([198.8.77.157]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2d3e2e6b1f6sm7772049a91.19.2024.08.19.13.02.09 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 19 Aug 2024 13:02:10 -0700 (PDT) Message-ID: <5d7aabe5-3988-4a85-b329-4dfe89b22582@kernel.dk> Date: Mon, 19 Aug 2024 14:02:09 -0600 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 5/5] block: implement io_uring discard cmd To: Pavel Begunkov , Ming Lei Cc: io-uring@vger.kernel.org, Conrad Meyer , linux-block@vger.kernel.org, linux-mm@kvack.org References: <6ecd7ab3386f63f1656dc766c1b5b038ff5353c2.1723601134.git.asml.silence@gmail.com> <4d016a30-d258-4d0e-b3bc-18bf0bd48e32@kernel.dk> Content-Language: en-US From: Jens Axboe In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: A4D2C80024 X-Stat-Signature: yhb1hb63nucptkyqwx9bphnuetnysz3b X-Rspam-User: X-HE-Tag: 1724097732-467103 X-HE-Meta: U2FsdGVkX18NT4DNAdBRoJhBT8xfOeD2XJsZImkNVs5PClxvrKGlMl8+KtvPJD+dbcn9E+7XsSenzc2qv3yw1jZe9M2YS8Pek0V9QVnQFN3zx0Mti4XTtjCF6HQWAv1yYcABZ701s89J6jEEOal5rCSaP7d6y5ApsHP1svX9mSfsK7wlI99dbGQlQ1jmHokpzjElWb6WSGp3/yCDsVu0yDEKaANKbM49Gp9eKGM54DQqaARY4dPkHsoWzBLkS34561QdbUL2sJlCr6dv+4DDb7if+6sAcbVRQA0h27bJMyTH7aabxqLTwHPY/hndJrhoUPgNk2deEsXwLx4R+PesMgc8Ak78QhgUtQn5lharxIS+eVal+ul8wPzHZ1ulHZvn31YktfD6fuDVpQ8f/WYB1X9r/kFzTpEWzpeVTSjAARU1Kbz/aUMMylOn7gJPsPSeJJlbAgtV1FH1zzFkMfHwUdV9u6oFhzN5NXvdMvGWeQ14TaUOSXe9Z+5vTOIQTOr/VY38ehtjEsqVWXpG6Ul3qwiFVhdsvPPmQuxeEzezcnWFFrDgKBjQ3pwJJgjN7+h80OMnWJ6znmbVIFpLQyhbTZqGVp/RuLkHoYHtKhzt8ycxJBwZ7fN+WC1IunzZKtRkNVj9okRiNonEzKZyZoJ4QLoQ5e76J6ZSiZhInp2PeO84kXnoqKZJNQ2xokFlyi3exswWWDD1t/pex20cejVyOLRg4fnN4bu+70zIDTwMmVkunHFgTAcGryWmBX22xDSCGJz87nROtRD0pGjTBNMVYU4fS6QCGJwXiPX0QHFc5od9olyhGZBOIZFX9jS+B7FCyAVRR3elSSa+Bylf9vbkkTUh3k5m0ScKMJ8gFrAy/hkXaps/ga3zibNozYYnxAwH9N8lrFhRJe4+Z8OVFA8b9vvHt9VrHPbRpH/bG1OmyX7aoBTWsjDCZIRjcSw0Mj2FCHLt6aLfLl+hnWFzORD W+WDhK4m snf6oHmvpgv7H5/ffAEnB+bwHLTU4JEfQLby6JjMDqItLge6nY6KVR1UwanBNYIVpr0coWKfi6/0E+4l/nM77LQdwU2JeeGw7/jlly6wQwXz+B4O/9fkuDdea4s7XJ87HxGIT2Ts5TRHCWfyVIEfiQ/okCmkzNEpyy3j6010GnewR0JilLbgIjMW9B3+6LxossEh+ywqAWEoBZ6vul7BqGMu8y/84C+5eftcm7XJlXCrN5UyYLtABOI2jyDy01a8UHZku3Ug/v2P9BRKfNqrzflpfS/suiMfWyrou7DmJG2NtERCtXyh1gd/V4DlrBo6bNodKx+LrhVfM+Eqc94/L61JvTj0AhhdzRmsBG49dEetJIo+7+CQkmlBL31qX8/Qs1KPpIWy6E2XZBYbwY/+uVHLprTB7yO0WAzrCVwU1fLE0jTaI5SxkQ+FQU5cWV5Kx7av4ZkguJIR/Q/jTMM/6sDTPEU+aLCWGLiMeXs0LuwLG7Cn2fFtvC7mlE/pflAKY4NKRCcBXpD1q4ez9qf7KN5RkPK0M0k2InoTFSoTmxuN/NZ0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 8/15/24 8:16 PM, Pavel Begunkov wrote: > On 8/16/24 03:08, Ming Lei wrote: >> On Fri, Aug 16, 2024 at 02:59:49AM +0100, Pavel Begunkov wrote: >>> On 8/16/24 02:45, Ming Lei wrote: >>>> On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote: >>>>> On 8/15/24 5:44 PM, Ming Lei wrote: >>>>>> On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote: >>>>>>> On 8/15/24 15:33, Jens Axboe wrote: >>>>>>>> On 8/14/24 7:42 PM, Ming Lei wrote: >>>>>>>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov wrote: >>>>>>>>>> >>>>>>>>>> Add ->uring_cmd callback for block device files and use it to implement >>>>>>>>>> asynchronous discard. Normally, it first tries to execute the command >>>>>>>>>> from non-blocking context, which we limit to a single bio because >>>>>>>>>> otherwise one of sub-bios may need to wait for other bios, and we don't >>>>>>>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry >>>>>>>>>> it in a blocking context. >>>>>>>>>> >>>>>>>>>> Suggested-by: Conrad Meyer >>>>>>>>>> Signed-off-by: Pavel Begunkov >>>>>>>>>> --- >>>>>>>>>>     block/blk.h             |  1 + >>>>>>>>>>     block/fops.c            |  2 + >>>>>>>>>>     block/ioctl.c           | 94 +++++++++++++++++++++++++++++++++++++++++ >>>>>>>>>>     include/uapi/linux/fs.h |  2 + >>>>>>>>>>     4 files changed, 99 insertions(+) >>>>>>>>>> >>>>>>>>>> diff --git a/block/blk.h b/block/blk.h >>>>>>>>>> index e180863f918b..5178c5ba6852 100644 >>>>>>>>>> --- a/block/blk.h >>>>>>>>>> +++ b/block/blk.h >>>>>>>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file); >>>>>>>>>>     int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, >>>>>>>>>>                    loff_t lstart, loff_t lend); >>>>>>>>>>     long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>>>>>>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); >>>>>>>>>>     long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>>>>>>>>> >>>>>>>>>>     extern const struct address_space_operations def_blk_aops; >>>>>>>>>> diff --git a/block/fops.c b/block/fops.c >>>>>>>>>> index 9825c1713a49..8154b10b5abf 100644 >>>>>>>>>> --- a/block/fops.c >>>>>>>>>> +++ b/block/fops.c >>>>>>>>>> @@ -17,6 +17,7 @@ >>>>>>>>>>     #include >>>>>>>>>>     #include >>>>>>>>>>     #include >>>>>>>>>> +#include >>>>>>>>>>     #include "blk.h" >>>>>>>>>> >>>>>>>>>>     static inline struct inode *bdev_file_inode(struct file *file) >>>>>>>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { >>>>>>>>>>            .splice_read    = filemap_splice_read, >>>>>>>>>>            .splice_write   = iter_file_splice_write, >>>>>>>>>>            .fallocate      = blkdev_fallocate, >>>>>>>>>> +       .uring_cmd      = blkdev_uring_cmd, >>>>>>>>> >>>>>>>>> Just be curious, we have IORING_OP_FALLOCATE already for sending >>>>>>>>> discard to block device, why is .uring_cmd added for this purpose? >>>>>>> >>>>>>> Which is a good question, I haven't thought about it, but I tend to >>>>>>> agree with Jens. Because vfs_fallocate is created synchronous >>>>>>> IORING_OP_FALLOCATE is slow for anything but pretty large requests. >>>>>>> Probably can be patched up, which would  involve changing the >>>>>>> fops->fallocate protot, but I'm not sure async there makes sense >>>>>>> outside of bdev (?), and cmd approach is simpler, can be made >>>>>>> somewhat more efficient (1 less layer in the way), and it's not >>>>>>> really something completely new since we have it in ioctl. >>>>>> >>>>>> Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock, >>>>>> same with blkdev_fallocate(). >>>>>> >>>>>> But this patch drops this exclusive lock, so it becomes async friendly, >>>>>> but may cause stale page cache. However, if the lock is required, it can't >>>>>> be efficient anymore and io-wq may be inevitable, :-) >>>>> >>>>> If you want to grab the lock, you can still opportunistically grab it. >>>>> For (by far) the common case, you'll get it, and you can still do it >>>>> inline. >>>> >>>> If the lock is grabbed in the whole cmd lifetime, it is basically one sync >>>> interface cause there is at most one async discard cmd in-flight for each >>>> device. >>>> >>>> Meantime the handling has to move to io-wq for avoiding to block current >>>> context, the interface becomes same with IORING_OP_FALLOCATE? >>> >>> Right, and agree that we can't trylock because we'd need to keep it >>> locked until IO completes, at least the sync versions does that. >>> >>> But I think *invalidate_pages() in the patch should be enough. That's >>> what the write path does, so it shouldn't cause any problem to the >>> kernel. As for user space, that'd be more relaxed than the ioctl, >>> just as writes are, so nothing new to the user. I hope someone with >>> better filemap understanding can confirm it (or not). >> >> I may not be familiar with filemap enough, but looks *invalidate_pages() >> is only for removing pages from the page cache range, and the lock is added >> for preventing new page cache read from being started, so stale data read >> can be avoided when DISCARD is in-progress. > > Sounds like it, but the point is it's the same data race for the > user as if it would've had a write in progress. Right, which is why it should not matter. I think it's pretty silly to take the sync implementation as gospel here, assuming that the original author knew what they were doing in full detail. It just needs proper documenting. -- Jens Axboe