From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id D82D7323 for ; Fri, 24 Jul 2015 17:04:33 +0000 (UTC) Received: from out02.mta.xmission.com (out02.mta.xmission.com [166.70.13.232]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 47C8015E for ; Fri, 24 Jul 2015 17:04:33 +0000 (UTC) From: ebiederm@xmission.com (Eric W. Biederman) To: David Howells References: <28240.1437753683@warthog.procyon.org.uk> <22483.1437754245@warthog.procyon.org.uk> Date: Fri, 24 Jul 2015 11:58:00 -0500 In-Reply-To: <22483.1437754245@warthog.procyon.org.uk> (David Howells's message of "Fri, 24 Jul 2015 17:10:45 +0100") Message-ID: <87twst8pd3.fsf@x220.int.ebiederm.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: ksummit-discuss@lists.linuxfoundation.org, "Serge E. Hallyn" Subject: Re: [Ksummit-discuss] [TECH TOPIC] Overlays and file(system) unioning issues List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , David Howells writes: > [With Mikl=C3=B3s's email address fixed] > > I would like to propose a technical session on filesystem unioning. Ther= e are > a number of issues: > > (1) Whiteouts. > > Linus's idea that a union layer or overlay mounted not as part of a = union > but separately, should expose whiteouts as 0,0 chardevs. Whilst this > might indeed make the backup tools easier as things like tar can the= n use > the stat() and mknod() interfaces rather than having to use special > ioctls or syscalls, Mikl=C3=B3s's idea to implement them as actual 0= ,0 > chardevs in the underlying filesystem incurs some problems: > > (a) It's slow and resource intensive. > > Every whiteout requires an inode to represent it. This means that= if > you, say, have a directory in the lower layer that has a few thous= and > inodes in it and you delete them all, you then eat up inode table > space in the upper layer. > > Further, every chardev inode has to be stat'd to see if it is really > a whiteout. > > (b) It has provided lock ordering issues in overlayfs directory read= ing > because overlayfs has to stat each chardev from within the directo= ry > iterator. > > I have patches to make Ext2 and JFFS2 use special directory entries > labelled with DT_WHITEOUT and no inode. This is more space efficien= t and > faster and can be extended to Ext3 and Ext4. XFS has constants defi= ned > for doing similar. > > I would propose that we change overlayfs to do this. > > Unfortunately, we would still have to support the then obsolete 0,0 > chardevs on disk. > > The stat() and mknod() syscalls would then have to present these obj= ects > to the user as 0,0 chardevs rather than ENOENT errors. To do this it > might be necessary to have a special mount flag to turn off the > translation to DENTRY_WHITEOUT_TYPE dentries and record them as > DENTRY_SPECIAL_TYPE instead with an in-memory inode struct showing i= t to > be 0,0 chardevs. > > David Woodhouse did make an additional suggestion that would make 0,0 > chardevs less space inefficient - and that's to hard link a reserved > inode. > (2) Opaque inodes. > > Should we use an xattr to mark inodes as opaque or should we use an = inode > flag? I have patches to add such an inode flag for Ext2 and JFFS2. > Marking the inode would be more space and time efficient. > > (3) Fall-through markers. > > Unionmount - and possibly other filesystem unioning systems - perform > directory integration on disk. (Note that overlayfs maintains this = in > memory for the lifetime of a directory inode). > > With unionmount, an integrated directory is marked as being opaque w= ith > special directory entries of type DT_FALLTHRU indicating where there= is > stuff in lower layers that can be accessed. > > Should we, perhaps, declare that the user sees such markers as 0,1 > chardevs when the layer is not mounted as part of a union? > > (4) Unionmount and other filesystem unioning systems. > > Do we want to add other filesystem unioning systems into the kernel? > I've brought in a lot of the stuff for unionmount to help support > overlayfs. Unfortunately, overlayfs interferes with some of the stu= ff > that unionmount wants to do (e.g. doing whiteouts differently and in= an > awkward manner). > > (5) Lack of POSIX characteristics. > > There have been complaints that overlayfs isn't sufficiently POSIX l= ike. > Now, this is by design on the part of overlayfs and I agree with the > Mikl=C3=B3s that this is the right way to do it. However, some miti= gation > might be required. > > One of the most annoying features is the fact that if you do: > > fd1 =3D open("foo", O_RDONLY); > fd2 =3D open("foo", O_RDWR); > > then fd1 and fd2 don't necessarily point to the same file. > > I have been given patches by Ratna Bolla that speculatively copy the= file > into the overlayfs file inode as the pages are accessed and direct f= ile > accesses to the overlay inode rather than one of the two layers. I = saw a > number of problems with the approach, but it's possible his latest p= atch > fixes them. > > (6) File-by-file waiver of unioning. > > Jan Olszak has requested that it be possible to mark files in one of= the > layers to suppress copy up on that file and to direct writes to the = lower > layer. This causes problems with rename however. > > (7) File locking and notifications. > > These are similar issues. IIRC, we decided at the Filesystem Summit= that > you get to take locks on the union inode only and that the notificat= ions > only follow changes to the upper layer. This means that you don't g= et > union/union interactions through a common lower layer. > > However, we've since had complaints that tail doesn't follow changes= made > to the lower layer (from James Harvey). > > (8) LSMs and unions/overlays. > > Path-based LSMs should just work now that file->f_path points to the > union layer inode, though they may require namespace awareness. > > Label-based LSMs are another matter. file->f_path.dentry->d_inode p= oints > to the top layer label and file->f_inode points to the lower layer l= abel. > Currently the user of the overlay can 'see through' the overlay and > access lower files in terms of the labels from the lower layer when = doing > file operations, but uses the label from the upper layer when doing = inode > operations. I think this should be consistent and should only use t= he > upper layer label. I'm working on patches to get this to work, but = there > is dissension over which label should be seen. > > Further, mandating that the upper label should be seen does cause > unionmount a problem as there's no upper inode to hang the label off. > This means that the label must be forged anew each time it is requir= ed > until at such time a copy-up is effected. > (9) Unprivileged mounts As there are no backing store issues it should be a tractable problem to get the semantics right to allow containers to use overlayfs. A naive attempt was made by Serge Hallyn and he ran into security issues with copy-up. Can copy-up be made safe if unprivileged users (AKA user namespace root users) mount overlayfs? I think that also intersects with your LSM label handling issues. Eric