From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot0-f198.google.com (mail-ot0-f198.google.com [74.125.82.198]) by kanga.kvack.org (Postfix) with ESMTP id 2B2FE6B0033 for ; Thu, 18 Jan 2018 00:27:26 -0500 (EST) Received: by mail-ot0-f198.google.com with SMTP id 78so13100649otj.15 for ; Wed, 17 Jan 2018 21:27:26 -0800 (PST) Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id x6sor2576287ota.190.2018.01.17.21.27.24 for (Google Transport Security); Wed, 17 Jan 2018 21:27:25 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20180116145240.GD30073@bombadil.infradead.org> References: <20180116145240.GD30073@bombadil.infradead.org> From: "Figo.zhang" Date: Thu, 18 Jan 2018 13:27:24 +0800 Message-ID: Subject: Re: [LSF/MM TOPIC] A high-performance userspace block driver Content-Type: multipart/alternative; boundary="001a113d1a1c0b50ca05630638d1" Sender: owner-linux-mm@kvack.org List-ID: To: Matthew Wilcox Cc: lsf-pc@lists.linux-foundation.org, Linux MM , linux-fsdevel , linux-block@vger.kernel.org --001a113d1a1c0b50ca05630638d1 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable 2018-01-16 22:52 GMT+08:00 Matthew Wilcox : > > I see the improvements that Facebook have been making to the nbd driver, > and I think that's a wonderful thing. Maybe the outcome of this topic > is simply: "Shut up, Matthew, this is good enough". > > It's clear that there's an appetite for userspace block devices; not for > swap devices or the root device, but for accessing data that's stored > in that silo over there, and I really don't want to bring that entire > mess of CORBA / Go / Rust / whatever into the kernel to get to it, > but it would be really handy to present it as a block device. > > I've looked at a few block-driver-in-userspace projects that exist, and > they all seem pretty bad. how about the SPDK=EF=BC=9F > For example, one API maps a few gigabytes of > address space and plays games with vm_insert_page() to put page cache > pages into the address space of the client process. Of course, the TLB > flush overhead of that solution is criminal. > > I've looked at pipes, and they're not an awful solution. We've almost > got enough syscalls to treat other objects as pipes. The problem is > that they're not seekable. So essentially you're looking at having one > pipe per outstanding command. If yu want to make good use of a modern > NAND device, you want a few hundred outstanding commands, and that's a > bit of a shoddy interface. > > Right now, I'm leaning towards combining these two approaches; adding > a VM_NOTLB flag so the mmaped bits of the page cache never make it into > the process's address space, so the TLB shootdown can be safely skipped. > Then check it in follow_page_mask() and return the appropriate struct > page. As long as the userspace process does everything using O_DIRECT, > I think this will work. > > It's either that or make pipes seekable ... > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > --001a113d1a1c0b50ca05630638d1 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


2018-01-16 22:52 GMT+08:00 Matthew Wilcox <willy@infradead.org&g= t;:

I see the improvements that Facebook have been making to the nbd driver, and I think that's a wonderful thing.=C2=A0 Maybe the outcome of this t= opic
is simply: "Shut up, Matthew, this is good enough".

It's clear that there's an appetite for userspace block devices; no= t for
swap devices or the root device, but for accessing data that's stored in that silo over there, and I really don't want to bring that entire mess of CORBA / Go / Rust / whatever into the kernel to get to it,
but it would be really handy to present it as a block device.

I've looked at a few block-driver-in-userspace projects that exist, and=
they all seem pretty bad.=C2=A0

how=C2=A0a= bout=C2=A0the SPDK=EF=BC=9F
=C2=A0
For example, one API maps a few gigabytes of
address space and plays games with vm_insert_page() to put page cache
pages into the address space of the client process.=C2=A0 Of course, the TL= B
flush overhead of that solution is criminal.

I've looked at pipes, and they're not an awful solution.=C2=A0 We&#= 39;ve almost
got enough syscalls to treat other objects as pipes.=C2=A0 The problem is that they're not seekable.=C2=A0 So essentially you're looking at h= aving one
pipe per outstanding command.=C2=A0 If yu want to make good use of a modern=
NAND device, you want a few hundred outstanding commands, and that's a<= br> bit of a shoddy interface.

Right now, I'm leaning towards combining these two approaches; adding a VM_NOTLB flag so the mmaped bits of the page cache never make it into
the process's address space, so the TLB shootdown can be safely skipped= .
Then check it in follow_page_mask() and return the appropriate struct
page.=C2=A0 As long as the userspace process does everything using O_DIRECT= ,
I think this will work.

It's either that or make pipes seekable ...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.= =C2=A0 For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=3Dmailto:"dont@kvack.org"> email@kva= ck.org </a>

--001a113d1a1c0b50ca05630638d1-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org