From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D03EC433F5 for ; Wed, 23 Mar 2022 14:33:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B985C6B0072; Wed, 23 Mar 2022 10:33:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B46666B0073; Wed, 23 Mar 2022 10:33:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9C14D6B0074; Wed, 23 Mar 2022 10:33:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0203.hostedemail.com [216.40.44.203]) by kanga.kvack.org (Postfix) with ESMTP id 87BDE6B0072 for ; Wed, 23 Mar 2022 10:33:23 -0400 (EDT) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 20660A5BC8 for ; Wed, 23 Mar 2022 14:33:23 +0000 (UTC) X-FDA: 79275893886.22.4447940 Received: from out30-43.freemail.mail.aliyun.com (out30-43.freemail.mail.aliyun.com [115.124.30.43]) by imf11.hostedemail.com (Postfix) with ESMTP id 4105540012 for ; Wed, 23 Mar 2022 14:33:20 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R781e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04395;MF=xiaoguang.wang@linux.alibaba.com;NM=1;PH=DS;RN=6;SR=0;TI=SMTPD_---0V80FrRj_1648045993; Received: from 30.225.28.175(mailfrom:xiaoguang.wang@linux.alibaba.com fp:SMTPD_---0V80FrRj_1648045993) by smtp.aliyun-inc.com(127.0.0.1); Wed, 23 Mar 2022 22:33:14 +0800 Message-ID: <57da7e54-f582-3b10-52a9-5166adacf4e6@linux.alibaba.com> Date: Wed, 23 Mar 2022 22:33:13 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Subject: Re: [RFC 3/3] scsi: target: tcmu: Support zero copy Content-Language: en-US To: Bodo Stroesser , linux-mm@kvack.org, target-devel@vger.kernel.org, linux-scsi@vger.kernel.org Cc: linux-block@vger.kernel.org, xuyu@linux.alibaba.com References: <20220318095531.15479-1-xiaoguang.wang@linux.alibaba.com> <20220318095531.15479-4-xiaoguang.wang@linux.alibaba.com> <94b00e49-5efb-658f-3142-42e7cc551d19@gmail.com> From: Xiaoguang Wang In-Reply-To: <94b00e49-5efb-658f-3142-42e7cc551d19@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed X-Rspam-User: X-Stat-Signature: h7ttshudfuxqa15o6muqjhfidumcy31d Authentication-Results: imf11.hostedemail.com; dkim=none; spf=pass (imf11.hostedemail.com: domain of xiaoguang.wang@linux.alibaba.com designates 115.124.30.43 as permitted sender) smtp.mailfrom=xiaoguang.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 4105540012 X-HE-Tag: 1648046000-160102 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: hi, First thanks for your kind comments, it really will make codes better. > On 18.03.22 10:55, Xiaoguang Wang wrote: >> Currently in tcmu, for READ commands, it copies user space backstore's >> data buffer to tcmu internal data area, then copies data in data area >> to READ commands sgl pages. For WRITE commands, tcmu copies sgl pages >> to tcmu internal data area, then copies data in data area to user spac= e >> backstore. For both cases, there are obvious copy overhead, which impa= ct >> io throughput, especially for large io size. >> >> To mitigate this issue, we implement zero copy feature to tcmu, which >> map sgl pages to user space backstore's address space. Currently only >> sgl pages's offset and length are both aligned to page size, can this >> command go into tcmu zero copy path. > > Apart from my further comments below: since this change implies > incompatibilities regarding timeout handling and KEEP_BUF, I think the > new feature should be configurable with in default mode being switched > off. OK, agree. > When userspace switches it on, I think > - tcmu should fix ring timeout to 0 (=3D off). No change allowed > - tcmu should either disallow KEEP_BUF at all (not set the feature bit > =C2=A0 in mailbox) or should use a per cmd bit in the ring to show, for= which > =C2=A0 cmds KEEP_BUF is possible. Sorry, since this is a RFC patch set, so I didn't consider this=20 incompatibilities well, I'll try to figure out a better method in following version. > >> >> Signed-off-by: Xiaoguang Wang >> --- >> =C2=A0 drivers/target/target_core_user.c | 257=20 >> +++++++++++++++++++++++++++++++++----- >> =C2=A0 1 file changed, 229 insertions(+), 28 deletions(-) >> >> diff --git a/drivers/target/target_core_user.c=20 >> b/drivers/target/target_core_user.c >> index 7b2a89a67cdb..4314e0c00f8e 100644 >> --- a/drivers/target/target_core_user.c >> +++ b/drivers/target/target_core_user.c >> @@ -16,6 +16,8 @@ >> =C2=A0 #include >> =C2=A0 #include >> =C2=A0 #include >> +#include >> +#include >> =C2=A0 #include >> =C2=A0 #include >> =C2=A0 #include >> @@ -72,6 +74,7 @@ >> =C2=A0=C2=A0 */ >> =C2=A0 #define DATA_PAGES_PER_BLK_DEF 1 >> =C2=A0 #define DATA_AREA_PAGES_DEF (256 * 1024) >> +#define ZC_DATA_AREA_PAGES_DEF (256 * 1024) >> =C2=A0 =C2=A0 #define TCMU_MBS_TO_PAGES(_mbs) ((size_t)_mbs << (20 - P= AGE_SHIFT)) >> =C2=A0 #define TCMU_PAGES_TO_MBS(_pages) (_pages >> (20 - PAGE_SHIFT)) >> @@ -145,9 +148,20 @@ struct tcmu_dev { >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct list_head qfull_queue; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct list_head tmr_queue; >> =C2=A0 +=C2=A0=C2=A0=C2=A0 /* For zero copy handle */ >> +=C2=A0=C2=A0=C2=A0 int zc_data_area_mb; >> +=C2=A0=C2=A0=C2=A0 uint32_t zc_max_blocks; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 uint32_t dbi_max; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 uint32_t dbi_thresh; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long *data_bitmap; >> + >> +=C2=A0=C2=A0=C2=A0 struct mm_struct *vma_vm_mm; >> +=C2=A0=C2=A0=C2=A0 struct vm_area_struct *vma; >> + >> +=C2=A0=C2=A0=C2=A0 uint32_t zc_dbi_max; >> +=C2=A0=C2=A0=C2=A0 uint32_t zc_dbi_thresh; >> +=C2=A0=C2=A0=C2=A0 unsigned long *zc_data_bitmap; >> + >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct xarray data_pages; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 uint32_t data_pages_per_blk; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 uint32_t data_blk_size; >> @@ -177,6 +191,12 @@ struct tcmu_cmd { >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct tcmu_dev *tcmu_dev; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct list_head queue_entry; >> =C2=A0 +=C2=A0=C2=A0=C2=A0 /* For zero_copy handle */ >> +=C2=A0=C2=A0=C2=A0 struct mm_struct *vma_vm_mm; >> +=C2=A0=C2=A0=C2=A0 struct vm_area_struct *vma; >> +=C2=A0=C2=A0=C2=A0 struct iovec *iov; >> +=C2=A0=C2=A0=C2=A0 int iov_cnt; >> + > > I think, you don't need iov and iov_cnt. You can easily use the list of > dbis instead. Also, using the iovs from the ring during command > completion will not work, since the iovs might already be overwritten b= y > userspace. OK, will fix it. > >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 uint16_t cmd_id; >> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* Can't use se_cmd when cleanin= g up expired cmds, because if >> @@ -192,6 +212,7 @@ struct tcmu_cmd { >> =C2=A0 =C2=A0 #define TCMU_CMD_BIT_EXPIRED 0 >> =C2=A0 #define TCMU_CMD_BIT_KEEP_BUF 1 >> +#define TCMU_CMD_BIT_ZEROCOPY 2 >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long flags; >> =C2=A0 }; >> =C2=A0 @@ -495,10 +516,16 @@ static struct genl_family tcmu_genl_famil= y=20 >> __ro_after_init =3D { >> =C2=A0 static void tcmu_cmd_free_data(struct tcmu_cmd *tcmu_cmd, uint3= 2_t=20 >> len) >> =C2=A0 { >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct tcmu_dev *udev =3D tcmu_cmd->tcm= u_dev; >> +=C2=A0=C2=A0=C2=A0 unsigned long *data_bitmap; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 uint32_t i; >> =C2=A0 +=C2=A0=C2=A0=C2=A0 if (test_bit(TCMU_CMD_BIT_ZEROCOPY, &tcmu_c= md->flags)) >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 data_bitmap =3D udev->zc_d= ata_bitmap; >> +=C2=A0=C2=A0=C2=A0 else >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 data_bitmap =3D udev->data= _bitmap; >> + >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 for (i =3D 0; i < len; i++) >> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 clear_bit(tcmu_cmd->dbi[i]= , udev->data_bitmap); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 clear_bit(tcmu_cmd->dbi[i]= , data_bitmap); >> =C2=A0 } >> =C2=A0 =C2=A0 static inline int tcmu_get_empty_block(struct tcmu_dev *= udev, >> @@ -549,8 +576,30 @@ static inline int tcmu_get_empty_block(struct=20 >> tcmu_dev *udev, >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return i =3D=3D page_cnt ? dbi : -1; >> =C2=A0 } >> =C2=A0 +static inline int tcmu_get_zc_empty_block(struct tcmu_dev *ude= v, >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 st= ruct tcmu_cmd *tcmu_cmd, int prev_dbi, >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 in= t *iov_cnt) >> +{ >> +=C2=A0=C2=A0=C2=A0 int dbi; >> + >> +=C2=A0=C2=A0=C2=A0 dbi =3D find_first_zero_bit(udev->zc_data_bitmap,=20 >> udev->zc_dbi_thresh); >> +=C2=A0=C2=A0=C2=A0 if (dbi =3D=3D udev->zc_dbi_thresh) >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return -1; >> + >> +=C2=A0=C2=A0=C2=A0 if (dbi > udev->zc_dbi_max) >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 udev->zc_dbi_max =3D dbi; >> + >> +=C2=A0=C2=A0=C2=A0 set_bit(dbi, udev->zc_data_bitmap); >> +=C2=A0=C2=A0=C2=A0 tcmu_cmd_set_dbi(tcmu_cmd, dbi); >> + >> +=C2=A0=C2=A0=C2=A0 if (dbi !=3D prev_dbi + 1) >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 *iov_cnt +=3D 1; >> +=C2=A0=C2=A0=C2=A0 return dbi; >> +} >> + >> =C2=A0 static int tcmu_get_empty_blocks(struct tcmu_dev *udev, >> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 struct tcmu_cmd *tcmu_cmd, int length) >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 struct tcmu_cmd *tcmu_cmd, int length, >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 bool zero_copy) >> =C2=A0 { >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* start value of dbi + 1 must not be a= valid dbi */ >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 int dbi =3D -2; >> @@ -559,16 +608,111 @@ static int tcmu_get_empty_blocks(struct=20 >> tcmu_dev *udev, >> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 for (; length > 0; length -=3D b= lk_size) { >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 blk_data_len =3D= min_t(uint32_t, length, blk_size); >> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 dbi =3D tcmu_get_empty_blo= ck(udev, tcmu_cmd, dbi, blk_data_len, >> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &iov_cnt)= ; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (zero_copy) { >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 db= i =3D tcmu_get_zc_empty_block(udev, tcmu_cmd, dbi, >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &iov_cnt); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } else { >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 db= i =3D tcmu_get_empty_block(udev, tcmu_cmd, dbi, >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 blk_data_len, &iov_cnt); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (dbi < 0) >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 return -1; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return iov_cnt; >> =C2=A0 } >> =C2=A0 +#define TCMU_ZEROCOPY_PAGE_BATCH 32 >> + >> +static inline void tcmu_zerocopy_one_seg(struct iovec *iov, >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 st= ruct vm_area_struct *vma, >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 st= ruct sg_page_iter *sgiter) >> +{ >> +=C2=A0=C2=A0=C2=A0 struct page *pages[TCMU_ZEROCOPY_PAGE_BATCH]; >> +=C2=A0=C2=A0=C2=A0 unsigned int len =3D iov->iov_len; >> +=C2=A0=C2=A0=C2=A0 unsigned long address =3D (unsigned long)iov->iov_= base +=20 >> vma->vm_start; >> +=C2=A0=C2=A0=C2=A0 unsigned long pages_remaining, pg_index =3D 0; >> +=C2=A0=C2=A0=C2=A0 struct page *page; >> + >> +=C2=A0=C2=A0=C2=A0 while (len > 0) { >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __sg_page_iter_next(sgiter= ); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 page =3D sg_page_iter_page= (sgiter); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pages[pg_index++] =3D page= ; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 len -=3D PAGE_SIZE; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (pg_index =3D=3D TCMU_Z= EROCOPY_PAGE_BATCH || !len) { >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pa= ges_remaining =3D pg_index; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 vm= _insert_pages_mkspecial(vma, address, pages,=20 >> &pages_remaining); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ad= dress =3D address + pg_index * PAGE_SIZE; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pg= _index =3D 0; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >> +=C2=A0=C2=A0=C2=A0 } >> +} >> + >> +static long tcmu_cmd_zerocopy_map(struct tcmu_dev *udev, >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct tcmu_cmd *cmd, >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct iovec *iov, >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 int iov_cnt) >> +{ >> +=C2=A0=C2=A0=C2=A0 struct se_cmd *se_cmd =3D cmd->se_cmd; >> +=C2=A0=C2=A0=C2=A0 struct scatterlist *data_sg; >> +=C2=A0=C2=A0=C2=A0 unsigned int data_nents; >> +=C2=A0=C2=A0=C2=A0 struct iovec *tiov; >> +=C2=A0=C2=A0=C2=A0 struct sg_page_iter sgiter; >> +=C2=A0=C2=A0=C2=A0 struct vm_area_struct *vma =3D udev->vma; >> +=C2=A0=C2=A0=C2=A0 int i, ret =3D 0; > > What happens if the tcmu device currently is not open / mapped? I'm not sure how it will happen. But we may check whether udev->vma has a valid value. If yes, it'll enter tcmu_cmd_zerocopy_map(). > >> + >> +=C2=A0=C2=A0=C2=A0 mmap_read_lock(udev->vma_vm_mm); >> +=C2=A0=C2=A0=C2=A0 data_sg =3D se_cmd->t_data_sg; >> +=C2=A0=C2=A0=C2=A0 data_nents =3D se_cmd->t_data_nents; >> +=C2=A0=C2=A0=C2=A0 __sg_page_iter_start(&sgiter, data_sg, data_nents,= 0); >> +=C2=A0=C2=A0=C2=A0 tiov =3D iov; >> +=C2=A0=C2=A0=C2=A0 for (i =3D 0; i < iov_cnt; i++) { >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tcmu_zerocopy_one_seg(tiov= , vma, &sgiter); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tiov++; >> +=C2=A0=C2=A0=C2=A0 } >> +=C2=A0=C2=A0=C2=A0 cmd->iov =3D iov; >> +=C2=A0=C2=A0=C2=A0 cmd->iov_cnt =3D iov_cnt; > > This won't work! > > The iov's can already be overwritten when you later want to read > them. So you would have to take a real copy of the iovs. OK, will fix it. > >> +=C2=A0=C2=A0=C2=A0 cmd->vma_vm_mm =3D vma->vm_mm; >> +=C2=A0=C2=A0=C2=A0 cmd->vma =3D vma; >> +=C2=A0=C2=A0=C2=A0 mmgrab(cmd->vma_vm_mm); >> +=C2=A0=C2=A0=C2=A0 mmap_read_unlock(udev->vma_vm_mm); >> +=C2=A0=C2=A0=C2=A0 return ret; >> +} >> + >> +static void tcmu_cmd_zerocopy_unmap(struct tcmu_cmd *cmd) >> +{ >> +=C2=A0=C2=A0=C2=A0 struct mm_struct *mm; >> +=C2=A0=C2=A0=C2=A0 struct vm_area_struct *vma; >> +=C2=A0=C2=A0=C2=A0 struct iovec *iov; >> +=C2=A0=C2=A0=C2=A0 unsigned long address; >> +=C2=A0=C2=A0=C2=A0 int i; >> + >> +=C2=A0=C2=A0=C2=A0 mm =3D cmd->vma_vm_mm; >> +=C2=A0=C2=A0=C2=A0 if (!mm) >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return; >> + >> +=C2=A0=C2=A0=C2=A0 vma =3D cmd->vma; >> +=C2=A0=C2=A0=C2=A0 iov =3D cmd->iov; >> +=C2=A0=C2=A0=C2=A0 if (mmget_not_zero(mm)) { >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 mmap_read_lock(mm); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 for (i =3D 0; i < cmd->iov= _cnt; i++) { >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ad= dress =3D (unsigned long)iov->iov_base + vma->vm_start; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 za= p_page_range(vma, address, iov->iov_len); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 io= v++; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 mmap_read_unlock(mm); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 mmput(mm); >> +=C2=A0=C2=A0=C2=A0 } >> + >> +=C2=A0=C2=A0=C2=A0 cmd->vma_vm_mm =3D NULL; >> +=C2=A0=C2=A0=C2=A0 cmd->vma =3D NULL; >> +=C2=A0=C2=A0=C2=A0 mmdrop(mm); >> +} >> + >> =C2=A0 static inline void tcmu_free_cmd(struct tcmu_cmd *tcmu_cmd) >> =C2=A0 { >> +=C2=A0=C2=A0=C2=A0 if (test_bit(TCMU_CMD_BIT_ZEROCOPY, &tcmu_cmd->fla= gs)) >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tcmu_cmd_zerocopy_unmap(tc= mu_cmd); >> + >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 kfree(tcmu_cmd->dbi); >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 kmem_cache_free(tcmu_cmd_cache, tcmu_cm= d); >> =C2=A0 } >> @@ -850,37 +994,57 @@ static bool is_ring_space_avail(struct tcmu_dev=20 >> *udev, size_t cmd_size) >> =C2=A0=C2=A0 * Called with ring lock held. >> =C2=A0=C2=A0 */ >> =C2=A0 static int tcmu_alloc_data_space(struct tcmu_dev *udev, struct=20 >> tcmu_cmd *cmd, >> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 int *iov_bidi_cnt) >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 int *iov_bidi_cnt, bool zero_copy) >> =C2=A0 { >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 int space, iov_cnt =3D 0, ret =3D 0; >> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!cmd->dbi_cnt) >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 goto wr_iov_cnt= s; >> =C2=A0 -=C2=A0=C2=A0=C2=A0 /* try to check and get the data blocks as = needed */ >> -=C2=A0=C2=A0=C2=A0 space =3D spc_bitmap_free(udev->data_bitmap, udev-= >dbi_thresh); >> -=C2=A0=C2=A0=C2=A0 if (space < cmd->dbi_cnt) { >> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long blocks_left = =3D >> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 (udev->max_blocks - udev->dbi_thresh) + space; >> +=C2=A0=C2=A0=C2=A0 if (!zero_copy) { >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* try to check and get th= e data blocks as needed */ >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 space =3D spc_bitmap_free(= udev->data_bitmap, udev->dbi_thresh); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (space < cmd->dbi_cnt) = { >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 un= signed long blocks_left =3D >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (udev->max_blocks - udev->d= bi_thresh) + space; >> + >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if= (blocks_left < cmd->dbi_cnt) { >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 pr_debug("no data space: only %lu available, but as= k=20 >> for %u\n", >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 blo= cks_left * udev->data_blk_size, >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 cmd= ->dbi_cnt * udev->data_blk_size); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 return -1; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >> =C2=A0 -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (blocks_left < c= md->dbi_cnt) { >> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pr= _debug("no data space: only %lu available, but ask for=20 >> %u\n", >> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 blocks_left * udev->data_bl= k_size, >> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 cmd->dbi_cnt * udev->data_b= lk_size); >> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 re= turn -1; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ud= ev->dbi_thresh +=3D cmd->dbi_cnt; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if= (udev->dbi_thresh > udev->max_blocks) >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 udev->dbi_thresh =3D udev->max_blocks; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >> +=C2=A0=C2=A0=C2=A0 } else { >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* try to check and get th= e data blocks as needed */ >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 space =3D spc_bitmap_free(= udev->zc_data_bitmap,=20 >> udev->zc_dbi_thresh); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (space < cmd->dbi_cnt) = { >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 un= signed long blocks_left =3D >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (udev->zc_max_blocks - udev= ->zc_dbi_thresh) +=20 >> space; >> + >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if= (blocks_left < cmd->dbi_cnt) { >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 pr_debug("no data space: only %lu available, but as= k=20 >> for %u\n", >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 blo= cks_left * udev->data_blk_size, >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 cmd= ->dbi_cnt * udev->data_blk_size); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 return -1; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >> =C2=A0 -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 udev->dbi_thresh +=3D= cmd->dbi_cnt; >> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (udev->dbi_thresh > ude= v->max_blocks) >> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ud= ev->dbi_thresh =3D udev->max_blocks; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ud= ev->zc_dbi_thresh +=3D cmd->dbi_cnt; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if= (udev->zc_dbi_thresh > udev->zc_max_blocks) >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 udev->zc_dbi_thresh =3D udev->zc_max_blocks; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >> =C2=A0 -=C2=A0=C2=A0=C2=A0 iov_cnt =3D tcmu_get_empty_blocks(udev, cmd= ,=20 >> cmd->se_cmd->data_length); >> +=C2=A0=C2=A0=C2=A0 iov_cnt =3D tcmu_get_empty_blocks(udev, cmd,=20 >> cmd->se_cmd->data_length, zero_copy); >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (iov_cnt < 0) >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return -1; >> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (cmd->dbi_bidi_cnt) { >> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ret =3D tcmu_get_empty_blo= cks(udev, cmd, cmd->data_len_bidi); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ret =3D tcmu_get_empty_blo= cks(udev, cmd, cmd->data_len_bidi,=20 >> zero_copy); >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (ret < 0) >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 return -1; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >> @@ -1021,6 +1185,7 @@ static int queue_cmd_ring(struct tcmu_cmd=20 >> *tcmu_cmd, sense_reason_t *scsi_err) >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 uint32_t blk_size =3D udev->data_blk_si= ze; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* size of data buffer needed */ >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 size_t data_length =3D (size_t)tcmu_cmd= ->dbi_cnt * blk_size; >> +=C2=A0=C2=A0=C2=A0 bool zero_copy =3D false; >> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 *scsi_err =3D TCM_NO_SENSE; >> =C2=A0 @@ -1044,7 +1209,22 @@ static int queue_cmd_ring(struct tcmu_cm= d=20 >> *tcmu_cmd, sense_reason_t *scsi_err) >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return -1; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >> =C2=A0 -=C2=A0=C2=A0=C2=A0 iov_cnt =3D tcmu_alloc_data_space(udev, tcm= u_cmd, &iov_bidi_cnt); >> +=C2=A0=C2=A0=C2=A0 if (!(se_cmd->se_cmd_flags & SCF_BIDI) && se_cmd->= data_length && >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 IS_ALIGNED(se_cmd->data_le= ngth, PAGE_SIZE)) { > > What is the problem with BIDI here? Why do you skip those cmds? No special reason, just want to be simple in RFC patch set. I will add this support in next version. > >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct scatterlist *data_s= g =3D se_cmd->t_data_sg, *sg; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned int data_nents =3D= se_cmd->t_data_nents; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 int i; >> + >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 for_each_sg(data_sg, sg, d= ata_nents, i) { >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if= (!((!sg->offset || IS_ALIGNED(sg->offset, PAGE_SIZE)) && >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 IS_ALIGNED(sg->length, PAGE_SIZE))) >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 break; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (i =3D=3D data_nents) >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ze= ro_copy =3D true; >> +=C2=A0=C2=A0=C2=A0 } >> + >> +=C2=A0=C2=A0=C2=A0 iov_cnt =3D tcmu_alloc_data_space(udev, tcmu_cmd, = &iov_bidi_cnt,=20 >> zero_copy); >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (iov_cnt < 0) >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 goto free_and_q= ueue; >> =C2=A0 @@ -1093,7 +1273,7 @@ static int queue_cmd_ring(struct tcmu_cmd= =20 >> *tcmu_cmd, sense_reason_t *scsi_err) >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tcmu_cmd_reset_dbi_cur(tcmu_cmd); >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 iov =3D &entry->req.iov[0]; >> =C2=A0 -=C2=A0=C2=A0=C2=A0 if (se_cmd->data_direction =3D=3D DMA_TO_DE= VICE || >> +=C2=A0=C2=A0=C2=A0 if (((se_cmd->data_direction =3D=3D DMA_TO_DEVICE)= && !zero_copy) || >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 se_cmd->se_cmd_= flags & SCF_BIDI) >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 scatter_data_ar= ea(udev, tcmu_cmd, &iov); >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 else >> @@ -1111,6 +1291,11 @@ static int queue_cmd_ring(struct tcmu_cmd=20 >> *tcmu_cmd, sense_reason_t *scsi_err) >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tcmu_setup_cmd_timer(tcmu_cmd, udev->cm= d_time_out,=20 >> &udev->cmd_timer); >> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 entry->hdr.cmd_id =3D tcmu_cmd->= cmd_id; >> +=C2=A0=C2=A0=C2=A0 if (zero_copy) { >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 iov =3D &entry->req.iov[0]= ; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tcmu_cmd_zerocopy_map(udev= , tcmu_cmd, iov, entry->req.iov_cnt); > > This call maps the kernel pages from sgl to userspace no matter whether > it is a READ or WRITE type of cmd. That means, if it is READ, then you > might map old kernel data to userspace, right? Yes, I'm not sure whether it would be a security issue. Use real hardware storage will also run into this issue, if this real hardware storage=20 isn't credible. > >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 set_bit(TCMU_CMD_BIT_ZEROC= OPY, &tcmu_cmd->flags); >> +=C2=A0=C2=A0=C2=A0 } >> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tcmu_hdr_set_len(&entry->hdr.len= _op, command_size); >> =C2=A0 @@ -1366,7 +1551,10 @@ static bool tcmu_handle_completion(struc= t=20 >> tcmu_cmd *cmd, >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 else >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 se_cmd->se_cmd_flags |=3D SCF_TREAT_READ_AS_NORMAL; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >> -=C2=A0=C2=A0=C2=A0 if (se_cmd->se_cmd_flags & SCF_BIDI) { >> + >> +=C2=A0=C2=A0=C2=A0 if (test_bit(TCMU_CMD_BIT_ZEROCOPY, &cmd->flags)) = { >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tcmu_cmd_zerocopy_unmap(cm= d); >> +=C2=A0=C2=A0=C2=A0 } else if (se_cmd->se_cmd_flags & SCF_BIDI) { >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* Get Data-In = buffer before clean up */ >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 gather_data_are= a(udev, cmd, true, read_len); >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } else if (se_cmd->data_direction =3D=3D= DMA_FROM_DEVICE) { >> @@ -1520,6 +1708,8 @@ static void tcmu_check_expired_ring_cmd(struct=20 >> tcmu_cmd *cmd) >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!time_after_eq(jiffies, cmd->deadli= ne)) >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return; >> =C2=A0 +=C2=A0=C2=A0=C2=A0 if (test_bit(TCMU_CMD_BIT_ZEROCOPY, &cmd->f= lags)) >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tcmu_cmd_zerocopy_unmap(cm= d); >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 set_bit(TCMU_CMD_BIT_EXPIRED, &cmd->fla= gs); >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 list_del_init(&cmd->queue_entry); >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 se_cmd =3D cmd->se_cmd; >> @@ -1618,6 +1808,8 @@ static struct se_device=20 >> *tcmu_alloc_device(struct se_hba *hba, const char *name) >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 udev->data_pages_per_blk =3D DATA_PAGES= _PER_BLK_DEF; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 udev->max_blocks =3D DATA_AREA_PAGES_DE= F / udev->data_pages_per_blk; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 udev->data_area_mb =3D TCMU_PAGES_TO_MB= S(DATA_AREA_PAGES_DEF); >> +=C2=A0=C2=A0=C2=A0 udev->zc_max_blocks =3D ZC_DATA_AREA_PAGES_DEF /=20 >> udev->data_pages_per_blk; >> +=C2=A0=C2=A0=C2=A0 udev->zc_data_area_mb =3D TCMU_PAGES_TO_MBS(ZC_DAT= A_AREA_PAGES_DEF); >> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 mutex_init(&udev->cmdr_lock); >> =C2=A0 @@ -1841,6 +2033,9 @@ static void tcmu_vma_open(struct=20 >> vm_area_struct *vma) >> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pr_debug("vma_open\n"); >> =C2=A0 +=C2=A0=C2=A0=C2=A0 udev->vma_vm_mm =3D vma->vm_mm; >> +=C2=A0=C2=A0=C2=A0 udev->vma =3D vma; >> +=C2=A0=C2=A0=C2=A0 mmgrab(udev->vma_vm_mm); > > What about the mmap being opened multiple times? > Up to now tcmu allows one open() only. But after open() userspace might > fork with parent and child calling mmap() independently. > AFAIK, tcmu_vma_open will then be called twice, but I don't know whethe= r > it is the same vma in both calls. I think parent and child process will have different vma address. I had=20 thought this usage before, it will result in inconsistent or bugs. But In real=20 application, even without zero copy feature, if parent and child process operate the s= ame comm ring, seems that it will also result inconsistent or bugs. I'll have a deep look at codes, thanks. Regards, Xiaoguang Wang > >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 kref_get(&udev->kref); >> =C2=A0 } >> =C2=A0 @@ -1850,6 +2045,10 @@ static void tcmu_vma_close(struct=20 >> vm_area_struct *vma) >> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pr_debug("vma_close\n"); >> =C2=A0 +=C2=A0=C2=A0=C2=A0 mmdrop(udev->vma_vm_mm); >> +=C2=A0=C2=A0=C2=A0 udev->vma_vm_mm =3D NULL; >> +=C2=A0=C2=A0=C2=A0 udev->vma =3D NULL; >> + >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* release ref from tcmu_vma_open */ >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 kref_put(&udev->kref, tcmu_dev_kref_rel= ease); >> =C2=A0 } >> @@ -1901,7 +2100,7 @@ static int tcmu_mmap(struct uio_info *info,=20 >> struct vm_area_struct *vma) >> =C2=A0 { >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct tcmu_dev *udev =3D container_of(= info, struct tcmu_dev,=20 >> uio_info); >> =C2=A0 -=C2=A0=C2=A0=C2=A0 vma->vm_flags |=3D VM_DONTEXPAND | VM_DONTD= UMP; >> +=C2=A0=C2=A0=C2=A0 vma->vm_flags |=3D VM_DONTEXPAND | VM_DONTDUMP | V= M_MIXEDMAP; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 vma->vm_ops =3D &tcmu_vm_ops; >> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 vma->vm_private_data =3D udev; >> @@ -2172,7 +2371,7 @@ static int tcmu_configure_device(struct=20 >> se_device *dev) >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct tcmu_dev *udev =3D TCMU_DEV(dev)= ; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct uio_info *info; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct tcmu_mailbox *mb; >> -=C2=A0=C2=A0=C2=A0 size_t data_size; >> +=C2=A0=C2=A0=C2=A0 size_t data_size, zc_data_size; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 int ret =3D 0; >> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ret =3D tcmu_update_uio_info(ude= v); >> @@ -2183,8 +2382,9 @@ static int tcmu_configure_device(struct=20 >> se_device *dev) >> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 mutex_lock(&udev->cmdr_lock); >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 udev->data_bitmap =3D bitmap_zalloc(ude= v->max_blocks, GFP_KERNEL); >> +=C2=A0=C2=A0=C2=A0 udev->zc_data_bitmap =3D bitmap_zalloc(udev->zc_ma= x_blocks,=20 >> GFP_KERNEL); >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 mutex_unlock(&udev->cmdr_lock); >> -=C2=A0=C2=A0=C2=A0 if (!udev->data_bitmap) { >> +=C2=A0=C2=A0=C2=A0 if (!udev->data_bitmap || !udev->zc_data_bitmap) { >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ret =3D -ENOMEM= ; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 goto err_bitmap= _alloc; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >> @@ -2201,7 +2401,8 @@ static int tcmu_configure_device(struct=20 >> se_device *dev) >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 udev->cmdr_size =3D CMDR_SIZE; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 udev->data_off =3D MB_CMDR_SIZE; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 data_size =3D TCMU_MBS_TO_PAGES(udev->d= ata_area_mb) << PAGE_SHIFT; >> -=C2=A0=C2=A0=C2=A0 udev->mmap_pages =3D (data_size + MB_CMDR_SIZE) >>= PAGE_SHIFT; >> +=C2=A0=C2=A0=C2=A0 zc_data_size =3D TCMU_MBS_TO_PAGES(udev->zc_data_a= rea_mb) <<=20 >> PAGE_SHIFT; >> +=C2=A0=C2=A0=C2=A0 udev->mmap_pages =3D (data_size + zc_data_size + M= B_CMDR_SIZE) >>=20 >> PAGE_SHIFT; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 udev->data_blk_size =3D udev->data_page= s_per_blk * PAGE_SIZE; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 udev->dbi_thresh =3D 0; /* Default in I= dle state */ >> =C2=A0 @@ -2221,7 +2422,7 @@ static int tcmu_configure_device(struct=20 >> se_device *dev) >> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 info->mem[0].name =3D "tcm-user = command & data buffer"; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 info->mem[0].addr =3D (phys_addr_t)(uin= tptr_t)udev->mb_addr; >> -=C2=A0=C2=A0=C2=A0 info->mem[0].size =3D data_size + MB_CMDR_SIZE; >> +=C2=A0=C2=A0=C2=A0 info->mem[0].size =3D data_size + zc_data_size + M= B_CMDR_SIZE; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 info->mem[0].memtype =3D UIO_MEM_NONE; >> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 info->irqcontrol =3D tcmu_irqcon= trol;