From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F1FA4C433F5 for ; Fri, 18 Feb 2022 07:15:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 696746B0074; Fri, 18 Feb 2022 02:15:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 645716B0075; Fri, 18 Feb 2022 02:15:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 50CD06B0078; Fri, 18 Feb 2022 02:15:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0227.hostedemail.com [216.40.44.227]) by kanga.kvack.org (Postfix) with ESMTP id 3D3A06B0074 for ; Fri, 18 Feb 2022 02:15:30 -0500 (EST) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id F3AC618155808 for ; Fri, 18 Feb 2022 07:15:29 +0000 (UTC) X-FDA: 79155039978.12.8B677B9 Received: from out30-54.freemail.mail.aliyun.com (out30-54.freemail.mail.aliyun.com [115.124.30.54]) by imf27.hostedemail.com (Postfix) with ESMTP id E013040002 for ; Fri, 18 Feb 2022 07:15:27 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R491e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04423;MF=xiaoguang.wang@linux.alibaba.com;NM=1;PH=DS;RN=3;SR=0;TI=SMTPD_---0V4o7rVv_1645168520; Received: from 30.225.28.137(mailfrom:xiaoguang.wang@linux.alibaba.com fp:SMTPD_---0V4o7rVv_1645168520) by smtp.aliyun-inc.com(127.0.0.1); Fri, 18 Feb 2022 15:15:21 +0800 Content-Type: multipart/alternative; boundary="------------GmPZTOjyYZZodVALnNekFDio" Message-ID: Date: Fri, 18 Feb 2022 15:15:20 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.6.0 Content-Language: en-US To: linux-mm@kvack.org From: Xiaoguang Wang Subject: Question about how to map io request sg pages to user spsace Cc: linux-block@vger.kernel.org, target-devel@vger.kernel.org X-Rspam-User: Authentication-Results: imf27.hostedemail.com; dkim=none; spf=pass (imf27.hostedemail.com: domain of xiaoguang.wang@linux.alibaba.com designates 115.124.30.54 as permitted sender) smtp.mailfrom=xiaoguang.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: E013040002 X-Stat-Signature: 11s5j87o511bnbq358obop8ws36m8q77 X-HE-Tag: 1645168527-754467 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This is a multi-part message in MIME format. --------------GmPZTOjyYZZodVALnNekFDio Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable hi, I have some questions about how to map block device io requests' pages to user space, which may need your help, thanks in advance. Let me first have a brief introduction, one of our customers use tcm_loop= & tcmu to export a virtual block device to user space, tcm_loop and tcmu ar= e belong to scsi/target subsystem.=C2=A0 This virtual block device has a us= er-space backend, which visit remote distributed filesystem to complete io request= s. The data flow likes below: =C2=A0 1) client app issue io request to this virtual block. =C2=A0 2) tcm_loop & tcmu are kernel modules, they handle io requests. =C2=A0 3) tcmu maintain an internal data area, which indeed is a xarray = managing =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 kernel pages. tcmu allocates kernel pages= to data area and copy=20 io requests =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 sg pages to tcmu data area's kernel pages= . =C2=A0 4) tcmu maps data area's kernel pages to user space, then tcmu us= er space =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 backend can read or fill mmaped user spac= e area. But this solution have obvious overhead, allocating tcmu data area pages=20 and one extra copy, which results in tcmu throughput bottleneck, so I try to map=20 block device io requests' sg pages to user space directly, which I believe it can=20 improve tcmu throughput. Currenly I have implemented two prototypes: Solution 1: use vm_insert_pages, which is like tcp getsockopt(TCP_ZEROCOPY_RECEIVE). But there're two restrictions=EF=BC=9A =C2=A0 1 anonymous pages can not be mmaped to user space =C2=A0=C2=A0 =3D=3D> vm_insert_pages =C2=A0=C2=A0 =3D=3D=3D=3D> insert_pages =C2=A0=C2=A0 =3D=3D=3D=3D=3D=3D> insert_page_in_batch_locked =C2=A0=C2=A0 =3D=3D=3D=3D=3D=3D=3D=3D> validate_page_before_insert =C2=A0=C2=A0 In validate_page_before_insert(), it shows anonymous page c= an not be=20 mapped to =C2=A0=C2=A0 use space, we know that if issuing direct io block device, = io=20 requests' sg pages maybe =C2=A0=C2=A0 anonymous page. =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (PageAnon(page) || PageSlab(page= ) || page_has_type(page)) =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return -EIN= VAL; =C2=A0=C2=A0 I wonder why there is such restriction? for safety reasons = ? =C2=A0 2,=C2=A0 warn_on triggered in __folio_mark_dirty =C2=A0 When doing zap_page_range in tcmu user space backend when io=20 completes, there is =C2=A0 a warn_on triggered in __folio_mark_dirty: =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (folio->mapping) {=C2=A0=C2=A0 /* Race= with truncate? */ =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 WARN_ON_ONCE(warn= && !folio_test_uptodate(folio)); =C2=A0 I'm not familiar with folio yet, but I think the reason is that w= hen=20 issuing a buffered =C2=A0 read to block device, it's page cache mapped to user space, but=20 initially it's newly =C2=A0 allocated, hence page_update flag not set.=C2=A0 In zap_pte_range= , there=20 is such codes: =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!PageAnon(page)) { =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (pte_dirty(pte= nt)) { =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 force_flush =3D 1; =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 set_page_dirty(page); =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } =C2=A0So this warn_on is reasonable. =C2=A0Indeed what I want is just to map io request sg pages to tcmu user= =20 space backend, then backend =C2=A0can read or write data to mapped area, I don't want to care about = page=20 or its mapping status, so =C2=A0I choose to use remap_pfn_range. Then solution 2, use remap_pfn_range. remap_pfn_range works well, but it has somewhat obvious overhead. For a=20 512kb io request, it has 128 pages, and usually this 128 page=E2=80=98 pfn are not consecut= ive, so=20 in worst cases, for a 512kb io request, I'd need to issue 128 calls to remap_pfn_range, it's=20 horrible. And in remap_pfn_range, if x86 page attribute table feature is enabled, lookup_memtype called by=20 track_pfn_remap() also introduce obvious overhead. Finally my question is that is there any simple and efficient helper to=20 map block device sg pages to user space, it may accept an array of pages as parameter, anonymous=20 pages can be mapped to user space, and pages would be treated as a special pte(pte_special=20 returns true), so vm_normal_page returns NULL,=C2=A0 above warn_on won't trigger.=C2=A0 Doe= s this=20 sounds reasonable, I'm not a qualified mm developer, but if you think this new helper is=20 reasonable, I can try to add such one, thanks. Regards, Xiaoguang Wang Regards, Xiaoguang Wang --------------GmPZTOjyYZZodVALnNekFDio Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable hi,

I have some questions about how to map block device io requests' pages to
user space, which may need your help, thanks in advance.

Let me first have a brief introduction, one of our customers use tcm_loop &
tcmu to export a virtual block device to user space, tcm_loop and tcmu are
belong to scsi/target subsystem.=C2=A0 This virtual block device has = a user-space
backend, which visit remote distributed filesystem to complete io requests.
The data flow likes below:
=C2=A0 1) client app issue io request to this virtual block.
=C2=A0 2) tcm_loop & tcmu are kernel modules, they handle io requests.
=C2=A0 3) tcmu maintain an internal data area, which indeed is a xarr= ay managing
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 kernel pages. tcmu allocates kernel pa= ges to data area and copy io requests
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 sg pages to tcmu data area's kernel pa= ges.
=C2=A0 4) tcmu maps data area's kernel pages to user space, then tcmu user space
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 backend can read or fill mmaped user s= pace area.

But this solution have obvious overhead, allocating tcmu data area pages and one
extra copy, which results in tcmu throughput bottleneck, so I try to map block device
io requests' sg pages to user space directly, which I believe it can improve tcmu
throughput. Currenly I have implemented two prototypes:

Solution 1:
use vm_insert_pages, which is like tcp getsockopt(TCP_ZEROCOPY_RECEIVE).
But there're two restrictions=EF=BC=9A
=C2=A0 1 anonymous pages can not be mmaped to user space
=C2=A0=C2=A0 =3D=3D> vm_insert_pages
=C2=A0=C2=A0 =3D=3D=3D=3D> insert_pages
=C2=A0=C2=A0 =3D=3D=3D=3D=3D=3D> insert_page_in_batch_locked
=C2=A0=C2=A0 =3D=3D=3D=3D=3D=3D=3D=3D> validate_page_before_insert=
=C2=A0=C2=A0 In validate_page_before_insert(), it shows anonymous pag= e can not be mapped to
=C2=A0=C2=A0 use space, we know that if issuing direct io block devic= e, io requests' sg pages maybe
=C2=A0=C2=A0 anonymous page.
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (PageAnon(page) || PageSlab(p= age) || page_has_type(page))
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return -= EINVAL;
=C2=A0=C2=A0 I wonder why there is such restriction? for safety reaso= ns ?

=C2=A0 2,=C2=A0 warn_on triggered in __folio_mark_dirty
=C2=A0 When doing zap_page_range in tcmu user space backend when io completes, there is
=C2=A0 a warn_on triggered in __folio_mark_dirty:
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (folio->mapping) {=C2=A0=C2=A0 /= * Race with truncate? */
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 WARN_ON_ONCE(w= arn && !folio_test_uptodate(folio));
=C2=A0
=C2=A0 I'm not familiar with folio yet, but I think the reason is tha= t when issuing a buffered
=C2=A0 read to block device, it's page cache mapped to user space, bu= t initially it's newly
=C2=A0 allocated, hence page_update flag not set.=C2=A0 In zap_pte_ra= nge, there is such codes:
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!PageAnon(page)) {
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (pte_dirty(= ptent)) {
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 force_flush =3D 1;
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 set_page_dirty(page);
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 }
=C2=A0So this warn_on is reasonable.
=C2=A0Indeed what I want is just to map io request sg pages to tcmu u= ser space backend, then backend
=C2=A0can read or write data to mapped area, I don't want to care abo= ut page or its mapping status, so
=C2=A0I choose to use remap_pfn_range.

Then solution 2, use remap_pfn_range.
remap_pfn_range works well, but it has somewhat obvious overhead. For a 512kb io request,
it has 128 pages, and usually this 128 page=E2=80=98 pfn are not consecutive, so in worst cases, for a 512kb
io request, I'd need to issue 128 calls to remap_pfn_range, it's horrible. And in remap_pfn_range,
if x86 page attribute table feature is enabled, lookup_memtype called by track_pfn_remap() also
introduce obvious overhead.

Finally my question is that is there any simple and efficient helper to map block device sg pages
to user space, it may accept an array of pages as parameter, anonymous pages can be mapped
to user space, and pages would be treated as a special pte(pte_special returns true), so
vm_normal_page returns NULL,=C2=A0 above warn_on won't trigger.=C2=A0= Does this sounds reasonable,
I'm not a qualified mm developer, but if you think this new helper is reasonable, I can try to add
such one, thanks.


Regards,
Xiaoguang Wang
=C2=A0=C2=A0
=C2=A0
=C2=A0










Regards,
Xiaoguang Wang
--------------GmPZTOjyYZZodVALnNekFDio--