From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f176.google.com (mail-pd0-f176.google.com [209.85.192.176]) by kanga.kvack.org (Postfix) with ESMTP id F270B6B0038 for ; Wed, 22 Jan 2014 03:13:24 -0500 (EST) Received: by mail-pd0-f176.google.com with SMTP id w10so63095pde.7 for ; Wed, 22 Jan 2014 00:13:24 -0800 (PST) Received: from zill.ext.symas.net (zill.ext.symas.net. [69.43.206.106]) by mx.google.com with ESMTPS id zk9si8748335pac.260.2014.01.22.00.13.21 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Wed, 22 Jan 2014 00:13:22 -0800 (PST) Message-ID: <52DF7D9B.20904@symas.com> Date: Wed, 22 Jan 2014 00:13:15 -0800 From: Howard Chu MIME-Version: 1.0 Subject: Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Persistent memory References: <52D8AEBF.3090803@symas.com> <52D982EB.6010507@amacapital.net> <52DE23E8.9010608@symas.com> <20140121111727.GB13997@dastard> <20140121203620.GD13997@dastard> <20140121230333.GH13997@dastard> In-Reply-To: <20140121230333.GH13997@dastard> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner , Andy Lutomirski Cc: Linux FS Devel , lsf-pc@lists.linux-foundation.org, "linux-mm@kvack.org" Dave Chinner wrote: > On Tue, Jan 21, 2014 at 12:59:42PM -0800, Andy Lutomirski wrote: >> On Tue, Jan 21, 2014 at 12:36 PM, Dave Chinner wrote: >>> On Tue, Jan 21, 2014 at 08:48:06AM -0800, Andy Lutomirski wrote: >>>> On Tue, Jan 21, 2014 at 3:17 AM, Dave Chinner wrote: >>>>> On Mon, Jan 20, 2014 at 11:38:16PM -0800, Howard Chu wrote: >>>>>> Andy Lutomirski wrote: >>>>>>> On 01/16/2014 08:17 PM, Howard Chu wrote: >>>>>>>> Andy Lutomirski wrote: >>>>>>>>> I'm interested in a persistent memory track. There seems to be plenty >>>>>>>>> of other emails about this, but here's my take: >>>>>>>> >>>>>>>> I'm also interested in this track. I'm not up on FS development these >>>>>>>> days, the last time I wrote filesystem code was nearly 20 years ago. But >>>>>>>> persistent memory is a topic near and dear to my heart, and of great >>>>>>>> relevance to my current pet project, the LMDB memory-mapped database. >>>>>>>> >>>>>>>> In a previous era I also developed block device drivers for >>>>>>>> battery-backed external DRAM disks. (My ideal would have been systems >>>>>>>> where all of RAM was persistent. I suppose we can just about get there >>>>>>>> with mobile phones and tablets these days.) >>>>>>>> >>>>>>>> In the context of database engines, I'm interested in leveraging >>>>>>>> persistent memory for write-back caching and how user level code can be >>>>>>>> made aware of it. (If all your cache is persistent and guaranteed to >>>>>>>> eventually reach stable store then you never need to fsync() a >>>>>>>> transaction.) >>>>> >>>>> I don't think that is true - your still going to need fsync to get >>>>> the CPU to flush it's caches and filesystem metadata into the >>>>> persistent domain.... >>>> >>>> I think that this depends on the technology in question. >>>> >>>> I suspect (I don't know for sure) that, if the mapping is WT or UC, >>>> that it would be possible to get the data fully flushed to persistent >>>> storage by doing something like a UC read from any appropriate type of >>>> I/O space (someone from Intel would have to confirm). >>> >>> And what of the filesystem metadata that is necessary to reference >>> that data? What flushes that? e.g. using mmap of sparse files to >>> dynamically allocate persistent memory space requires fdatasync() at >>> minimum.... Why are you talking about fdatasync(), which is used to *avoid* flushing metadata? For reference, we've found that we get highest DB performance using ext2fs with a preallocated file. In that case, we can use fdatasync() and then there's no metadata updates whatsoever. This also means we can ignore the question of FS corruption on a crash. >> If we're using dm-crypt using an NV-DIMM "block" device as cache and a >> real disk as backing store, then ideally mmap would map the NV-DIMM >> directly if the data in question lives there. > > dm-crypt does not use any block device as a cache. You're thinking > about dm-cache or bcache. And neither of them are operating at the > filesystem level or are aware of the difference between fileystem > metadata and user data. Why should that layer need to be aware? A page is a page, as far as they're concerned. > But talking about non-existent block layer > functionality doesn't answer my the question about keeping user data > and filesystem metadata needed to reference that user data > coherent in persistent memory... One of the very useful tools for PCs in the '80s was reset-survivable RAMdisks. Given the existence of persistent memory in a machine, this is a pretty obvious feature to provide. >> If that's happening, >> then, assuming that there are no metadata changes, you could just >> flush the relevant hw caches. This assumes, of course, no dm-crypt, >> no btrfs-style checksumming, and, in general, nothing else that would >> require stable pages or similar things. > > Well yes. Data IO path transformations are another reason why we'll > need the volatile page cache involved in the persistent memory IO > path. It follows immediately from this that applicaitons will still > require fsync() and other data integrity operations because they > have no idea where the persistence domain boundary lives in the IO > stack. And my point, stated a few times now, is there should be a way for applications to discover the existence and characteristics of persistent memory being used in the system. >>> And then there's things like encrypted persistent memory when means >>> applications can't directly access it and so mmap() will be buffered >>> by the page cache just like a normal block device... >>> >>>> All of this suggests to me that a vsyscall "sync persistent memory" >>>> might be better than a real syscall. >>> >>> Perhaps, but that implies some method other than a filesystem to >>> manage access to persistent memory. >> >> It should be at least as good as fdatasync if using XIP or something like pmfs. >> >> For my intended application, I want to use pmfs or something similar >> directly. This means that I want really fast synchronous flushes, and >> I suspect that the usual set of fs calls that handle fdatasync are >> already quite a bit slower than a vsyscall would be, assuming that no >> MSR write is needed. > > What you are saying is that you want a fixed, allocated range of > persistent memory mapped into the applications address space that > you have direct control of. Yes, we can do that through the > filesystem XIP interface (zero the file via memset() rather than via > unwritten extents) and then fsync the file. The metadata on the file > will then never change, and you can do what you want via mmap from > then onwards. I'd suggest at this point that msync() is the > operation that should then be used to flush the data pages in the > mapped range into the persistence domain. > > >>>> For what it's worth, some of the NV-DIMM systems are supposed to be >>>> configured in such a way that, if power fails, an NMI, SMI, or even >>>> (not really sure) a hardwired thing in the memory controller will >>>> trigger the requisite flush. I don't personally believe in this if >>>> L2/L3 cache are involved (they're too big), but for the little write >>>> buffers and memory controller things, this seems entirely plausible. >>> >>> Right - at the moment we have to assume the persistence domain >>> starts at the NVDIMM and doesn't cover the CPU's internal L* caches. >>> I have no idea if/when we'll be seeing CPUs that have persistent >>> caches, so we have to assume that data is still volatile and can be >>> lost unless it has been specifically synced to persistent memory. >>> i.e. persistent memory does not remove the need for fsync and >>> friends... >> >> I have (NDAed and not entirely convincing) docs indicating a way (on >> hardware that I don't have access to) to make the caches be part of >> the persistence domain. > > Every platform will implement persistence domain > mangement differently. So we can't assume that what works on one > platform is going to work or be compatible with any other > platform.... > >> I also have non-NDA'd docs that suggest that >> it's really very fast to flush things through the memory controller. >> (I would need to time it, though. I do have this hardware, and it >> more or less works.) > > It still takes non-zero time, so there is still scope for data loss > on power failure, or even CPU failure. > > Hmmm, now there's something I hadn't really thought about - how does > CPU failure, hotplug and/or power management affect persistence > domains if the CPU cache contains persistent data and it's no longer > accessible? > > Cheers, > > Dave. > -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org