From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id F1FA4C433F5
	for <linux-mm@archiver.kernel.org>; Fri, 18 Feb 2022 07:15:30 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 696746B0074; Fri, 18 Feb 2022 02:15:30 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 645716B0075; Fri, 18 Feb 2022 02:15:30 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 50CD06B0078; Fri, 18 Feb 2022 02:15:30 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0227.hostedemail.com [216.40.44.227])
	by kanga.kvack.org (Postfix) with ESMTP id 3D3A06B0074
	for <linux-mm@kvack.org>; Fri, 18 Feb 2022 02:15:30 -0500 (EST)
Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id F3AC618155808
	for <linux-mm@kvack.org>; Fri, 18 Feb 2022 07:15:29 +0000 (UTC)
X-FDA: 79155039978.12.8B677B9
Received: from out30-54.freemail.mail.aliyun.com (out30-54.freemail.mail.aliyun.com [115.124.30.54])
	by imf27.hostedemail.com (Postfix) with ESMTP id E013040002
	for <linux-mm@kvack.org>; Fri, 18 Feb 2022 07:15:27 +0000 (UTC)
X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R491e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04423;MF=xiaoguang.wang@linux.alibaba.com;NM=1;PH=DS;RN=3;SR=0;TI=SMTPD_---0V4o7rVv_1645168520;
Received: from 30.225.28.137(mailfrom:xiaoguang.wang@linux.alibaba.com fp:SMTPD_---0V4o7rVv_1645168520)
          by smtp.aliyun-inc.com(127.0.0.1);
          Fri, 18 Feb 2022 15:15:21 +0800
Content-Type: multipart/alternative;
 boundary="------------GmPZTOjyYZZodVALnNekFDio"
Message-ID: <c5526629-5ce4-1f99-e9af-36da2876b258@linux.alibaba.com>
Date: Fri, 18 Feb 2022 15:15:20 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
 Gecko/20100101 Thunderbird/91.6.0
Content-Language: en-US
To: linux-mm@kvack.org
From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Subject: Question about how to map io request sg pages to user spsace
Cc: linux-block@vger.kernel.org, target-devel@vger.kernel.org
X-Rspam-User: 
Authentication-Results: imf27.hostedemail.com;
	dkim=none;
	spf=pass (imf27.hostedemail.com: domain of xiaoguang.wang@linux.alibaba.com designates 115.124.30.54 as permitted sender) smtp.mailfrom=xiaoguang.wang@linux.alibaba.com;
	dmarc=pass (policy=none) header.from=alibaba.com
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: E013040002
X-Stat-Signature: 11s5j87o511bnbq358obop8ws36m8q77
X-HE-Tag: 1645168527-754467
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

This is a multi-part message in MIME format.
--------------GmPZTOjyYZZodVALnNekFDio
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable

hi,

I have some questions about how to map block device io requests' pages to
user space, which may need your help, thanks in advance.

Let me first have a brief introduction, one of our customers use tcm_loop=
 &
tcmu to export a virtual block device to user space, tcm_loop and tcmu ar=
e
belong to scsi/target subsystem.=C2=A0 This virtual block device has a us=
er-space
backend, which visit remote distributed filesystem to complete io request=
s.
The data flow likes below:
 =C2=A0 1) client app issue io request to this virtual block.
 =C2=A0 2) tcm_loop & tcmu are kernel modules, they handle io requests.
 =C2=A0 3) tcmu maintain an internal data area, which indeed is a xarray =
managing
 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 kernel pages. tcmu allocates kernel pages=
 to data area and copy=20
io requests
 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 sg pages to tcmu data area's kernel pages=
.
 =C2=A0 4) tcmu maps data area's kernel pages to user space, then tcmu us=
er space
 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 backend can read or fill mmaped user spac=
e area.

But this solution have obvious overhead, allocating tcmu data area pages=20
and one
extra copy, which results in tcmu throughput bottleneck, so I try to map=20
block device
io requests' sg pages to user space directly, which I believe it can=20
improve tcmu
throughput. Currenly I have implemented two prototypes:

Solution 1:
use vm_insert_pages, which is like tcp getsockopt(TCP_ZEROCOPY_RECEIVE).
But there're two restrictions=EF=BC=9A
 =C2=A0 1 anonymous pages can not be mmaped to user space
 =C2=A0=C2=A0 =3D=3D> vm_insert_pages
 =C2=A0=C2=A0 =3D=3D=3D=3D> insert_pages
 =C2=A0=C2=A0 =3D=3D=3D=3D=3D=3D> insert_page_in_batch_locked
 =C2=A0=C2=A0 =3D=3D=3D=3D=3D=3D=3D=3D> validate_page_before_insert
 =C2=A0=C2=A0 In validate_page_before_insert(), it shows anonymous page c=
an not be=20
mapped to
 =C2=A0=C2=A0 use space, we know that if issuing direct io block device, =
io=20
requests' sg pages maybe
 =C2=A0=C2=A0 anonymous page.
 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (PageAnon(page) || PageSlab(page=
) || page_has_type(page))
 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return -EIN=
VAL;
 =C2=A0=C2=A0 I wonder why there is such restriction? for safety reasons =
?

 =C2=A0 2,=C2=A0 warn_on triggered in __folio_mark_dirty
 =C2=A0 When doing zap_page_range in tcmu user space backend when io=20
completes, there is
 =C2=A0 a warn_on triggered in __folio_mark_dirty:
 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (folio->mapping) {=C2=A0=C2=A0 /* Race=
 with truncate? */
 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 WARN_ON_ONCE(warn=
 && !folio_test_uptodate(folio));

 =C2=A0 I'm not familiar with folio yet, but I think the reason is that w=
hen=20
issuing a buffered
 =C2=A0 read to block device, it's page cache mapped to user space, but=20
initially it's newly
 =C2=A0 allocated, hence page_update flag not set.=C2=A0 In zap_pte_range=
, there=20
is such codes:
 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!PageAnon(page)) {
 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (pte_dirty(pte=
nt)) {
 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 force_flush =3D 1;
 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 set_page_dirty(page);
 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 }
 =C2=A0So this warn_on is reasonable.
 =C2=A0Indeed what I want is just to map io request sg pages to tcmu user=
=20
space backend, then backend
 =C2=A0can read or write data to mapped area, I don't want to care about =
page=20
or its mapping status, so
 =C2=A0I choose to use remap_pfn_range.

Then solution 2, use remap_pfn_range.
remap_pfn_range works well, but it has somewhat obvious overhead. For a=20
512kb io request,
it has 128 pages, and usually this 128 page=E2=80=98 pfn are not consecut=
ive, so=20
in worst cases, for a 512kb
io request, I'd need to issue 128 calls to remap_pfn_range, it's=20
horrible. And in remap_pfn_range,
if x86 page attribute table feature is enabled, lookup_memtype called by=20
track_pfn_remap() also
introduce obvious overhead.

Finally my question is that is there any simple and efficient helper to=20
map block device sg pages
to user space, it may accept an array of pages as parameter, anonymous=20
pages can be mapped
to user space, and pages would be treated as a special pte(pte_special=20
returns true), so
vm_normal_page returns NULL,=C2=A0 above warn_on won't trigger.=C2=A0 Doe=
s this=20
sounds reasonable,
I'm not a qualified mm developer, but if you think this new helper is=20
reasonable, I can try to add
such one, thanks.


Regards,
Xiaoguang Wang


Regards,
Xiaoguang Wang
--------------GmPZTOjyYZZodVALnNekFDio
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<html>
  <head>

    <meta http-equiv=3D"content-type" content=3D"text/html; charset=3DUTF=
-8">
  </head>
  <body>
    hi,<br>
    <br>
    I have some questions about how to map block device io requests'
    pages to<br>
    user space, which may need your help, thanks in advance.<br>
    <br>
    Let me first have a brief introduction, one of our customers use
    tcm_loop &amp;<br>
    tcmu to export a virtual block device to user space, tcm_loop and
    tcmu are <br>
    belong to scsi/target subsystem.=C2=A0 This virtual block device has =
a
    user-space<br>
    backend, which visit remote distributed filesystem to complete io
    requests.<br>
    The data flow likes below: <br>
    =C2=A0 1) client app issue io request to this virtual block.<br>
    =C2=A0 2) tcm_loop &amp; tcmu are kernel modules, they handle io
    requests.<br>
    =C2=A0 3) tcmu maintain an internal data area, which indeed is a xarr=
ay
    managing<br>
    =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 kernel pages. tcmu allocates kernel pa=
ges to data area and
    copy io requests<br>
    =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 sg pages to tcmu data area's kernel pa=
ges.<br>
    =C2=A0 4) tcmu maps data area's kernel pages to user space, then tcmu
    user space<br>
    =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 backend can read or fill mmaped user s=
pace area.<br>
    <br>
    But this solution have obvious overhead, allocating tcmu data area
    pages and one<br>
    extra copy, which results in tcmu throughput bottleneck, so I try to
    map block device<br>
    io requests' sg pages to user space directly, which I believe it can
    improve tcmu<br>
    throughput. Currenly I have implemented two prototypes:<br>
    <br>
    Solution 1:<br>
    use vm_insert_pages, which is like tcp
    getsockopt(TCP_ZEROCOPY_RECEIVE).<br>
    But there're two restrictions=EF=BC=9A<br>
    =C2=A0 1 anonymous pages can not be mmaped to user space<br>
    =C2=A0=C2=A0 =3D=3D&gt; vm_insert_pages<br>
    =C2=A0=C2=A0 =3D=3D=3D=3D&gt; insert_pages<br>
    =C2=A0=C2=A0 =3D=3D=3D=3D=3D=3D&gt; insert_page_in_batch_locked<br>
    =C2=A0=C2=A0 =3D=3D=3D=3D=3D=3D=3D=3D&gt; validate_page_before_insert=
<br>
    =C2=A0=C2=A0 In validate_page_before_insert(), it shows anonymous pag=
e can not
    be mapped to<br>
    =C2=A0=C2=A0 use space, we know that if issuing direct io block devic=
e, io
    requests' sg pages maybe<br>
    =C2=A0=C2=A0 anonymous page.<br>
    =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (PageAnon(page) || PageSlab(p=
age) || page_has_type(page))
    <br>
    =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return -=
EINVAL;<br>
    =C2=A0=C2=A0 I wonder why there is such restriction? for safety reaso=
ns ?<br>
    <br>
    =C2=A0 2,=C2=A0 warn_on triggered in __folio_mark_dirty<br>
    =C2=A0 When doing zap_page_range in tcmu user space backend when io
    completes, there is<br>
    =C2=A0 a warn_on triggered in __folio_mark_dirty:<br>
    =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (folio-&gt;mapping) {=C2=A0=C2=A0 /=
* Race with truncate? */<br>
    =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 WARN_ON_ONCE(w=
arn &amp;&amp; !folio_test_uptodate(folio));<br>
    =C2=A0 <br>
    =C2=A0 I'm not familiar with folio yet, but I think the reason is tha=
t
    when issuing a buffered<br>
    =C2=A0 read to block device, it's page cache mapped to user space, bu=
t
    initially it's newly<br>
    =C2=A0 allocated, hence page_update flag not set.=C2=A0 In zap_pte_ra=
nge,
    there is such codes:<br>
    =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!PageAnon(page)) {<br>
    =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (pte_dirty(=
ptent)) {<br>
    =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 force_flush =3D 1;<br>
    =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 set_page_dirty(page);<br>
    =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 }<br>
    =C2=A0So this warn_on is reasonable.<br>
    =C2=A0Indeed what I want is just to map io request sg pages to tcmu u=
ser
    space backend, then backend<br>
    =C2=A0can read or write data to mapped area, I don't want to care abo=
ut
    page or its mapping status, so <br>
    =C2=A0I choose to use remap_pfn_range.<br>
    <br>
    Then solution 2, use remap_pfn_range.<br>
    remap_pfn_range works well, but it has somewhat obvious overhead.
    For a 512kb io request,<br>
    it has 128 pages, and usually this 128 page=E2=80=98 pfn are not
    consecutive, so in worst cases, for a 512kb<br>
    io request, I'd need to issue 128 calls to remap_pfn_range, it's
    horrible. And in remap_pfn_range,<br>
    if x86 page attribute table feature is enabled, lookup_memtype
    called by track_pfn_remap() also<br>
    introduce obvious overhead.<br>
    <br>
    Finally my question is that is there any simple and efficient helper
    to map block device sg pages<br>
    to user space, it may accept an array of pages as parameter,
    anonymous pages can be mapped<br>
    to user space, and pages would be treated as a special
    pte(pte_special returns true), so<br>
    vm_normal_page returns NULL,=C2=A0 above warn_on won't trigger.=C2=A0=
 Does
    this sounds reasonable,<br>
    I'm not a qualified mm developer, but if you think this new helper
    is reasonable, I can try to add<br>
    such one, thanks.<br>
    <br>
    <br>
    Regards,<br>
    Xiaoguang Wang<br>
    =C2=A0=C2=A0 <br>
    =C2=A0<br>
    =C2=A0 <span style=3D"color: rgb(34, 34, 34); font-family: Arial,
      sans-serif; font-size: 13px; font-style: normal;
      font-variant-ligatures: normal; font-variant-caps: normal;
      font-weight: 400; letter-spacing: normal; orphans: 2; text-align:
      start; text-indent: 0px; text-transform: none; white-space:
      normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width:
      0px; background-color: rgb(255, 255, 255);
      text-decoration-thickness: initial; text-decoration-style:
      initial; text-decoration-color: initial; display: inline
      !important; float: none;"></span><br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    Regards,<br>
    Xiaoguang Wang<br>
  </body>
</html>

--------------GmPZTOjyYZZodVALnNekFDio--