From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id C685945E for ; Sat, 25 Jul 2015 15:39:51 +0000 (UTC) Received: from mail-ie0-f172.google.com (mail-ie0-f172.google.com [209.85.223.172]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 080CAEC for ; Sat, 25 Jul 2015 15:39:50 +0000 (UTC) Received: by iehx8 with SMTP id x8so38079344ieh.3 for ; Sat, 25 Jul 2015 08:39:50 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <87twst8pd3.fsf@x220.int.ebiederm.org> References: <28240.1437753683@warthog.procyon.org.uk> <22483.1437754245@warthog.procyon.org.uk> <87twst8pd3.fsf@x220.int.ebiederm.org> Date: Sat, 25 Jul 2015 23:39:50 +0800 Message-ID: From: Lai Jiangshan To: "Eric W. Biederman" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: ksummit-discuss@lists.linuxfoundation.org, "Serge E. Hallyn" Subject: Re: [Ksummit-discuss] [TECH TOPIC] Overlays and file(system) unioning issues List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Sat, Jul 25, 2015 at 12:58 AM, Eric W. Biederman wrote: > David Howells writes: > >> [With Mikl=C3=B3s's email address fixed] >> >> I would like to propose a technical session on filesystem unioning. The= re are >> a number of issues: >> >> (1) Whiteouts. >> >> Linus's idea that a union layer or overlay mounted not as part of a= union >> but separately, should expose whiteouts as 0,0 chardevs. Whilst th= is >> might indeed make the backup tools easier as things like tar can th= en use >> the stat() and mknod() interfaces rather than having to use special >> ioctls or syscalls, Mikl=C3=B3s's idea to implement them as actual = 0,0 >> chardevs in the underlying filesystem incurs some problems: >> >> (a) It's slow and resource intensive. >> >> Every whiteout requires an inode to represent it. This means tha= t if >> you, say, have a directory in the lower layer that has a few thou= sand >> inodes in it and you delete them all, you then eat up inode table >> space in the upper layer. >> >> Further, every chardev inode has to be stat'd to see if it is rea= lly >> a whiteout. >> >> (b) It has provided lock ordering issues in overlayfs directory rea= ding >> because overlayfs has to stat each chardev from within the direct= ory >> iterator. >> >> I have patches to make Ext2 and JFFS2 use special directory entries >> labelled with DT_WHITEOUT and no inode. This is more space efficie= nt and >> faster and can be extended to Ext3 and Ext4. XFS has constants def= ined >> for doing similar. >> >> I would propose that we change overlayfs to do this. >> >> Unfortunately, we would still have to support the then obsolete 0,0 >> chardevs on disk. >> >> The stat() and mknod() syscalls would then have to present these ob= jects >> to the user as 0,0 chardevs rather than ENOENT errors. To do this = it >> might be necessary to have a special mount flag to turn off the >> translation to DENTRY_WHITEOUT_TYPE dentries and record them as >> DENTRY_SPECIAL_TYPE instead with an in-memory inode struct showing = it to >> be 0,0 chardevs. >> >> David Woodhouse did make an additional suggestion that would make 0= ,0 >> chardevs less space inefficient - and that's to hard link a reserve= d >> inode. > > >> (2) Opaque inodes. >> >> Should we use an xattr to mark inodes as opaque or should we use an= inode >> flag? I have patches to add such an inode flag for Ext2 and JFFS2. >> Marking the inode would be more space and time efficient. >> >> (3) Fall-through markers. >> >> Unionmount - and possibly other filesystem unioning systems - perfo= rm >> directory integration on disk. (Note that overlayfs maintains this= in >> memory for the lifetime of a directory inode). >> >> With unionmount, an integrated directory is marked as being opaque = with >> special directory entries of type DT_FALLTHRU indicating where ther= e is >> stuff in lower layers that can be accessed. >> >> Should we, perhaps, declare that the user sees such markers as 0,1 >> chardevs when the layer is not mounted as part of a union? >> >> (4) Unionmount and other filesystem unioning systems. >> >> Do we want to add other filesystem unioning systems into the kernel= ? >> I've brought in a lot of the stuff for unionmount to help support >> overlayfs. Unfortunately, overlayfs interferes with some of the st= uff >> that unionmount wants to do (e.g. doing whiteouts differently and i= n an >> awkward manner). >> >> (5) Lack of POSIX characteristics. >> >> There have been complaints that overlayfs isn't sufficiently POSIX = like. >> Now, this is by design on the part of overlayfs and I agree with th= e >> Mikl=C3=B3s that this is the right way to do it. However, some mit= igation >> might be required. >> >> One of the most annoying features is the fact that if you do: >> >> fd1 =3D open("foo", O_RDONLY); >> fd2 =3D open("foo", O_RDWR); >> >> then fd1 and fd2 don't necessarily point to the same file. >> >> I have been given patches by Ratna Bolla that speculatively copy th= e file >> into the overlayfs file inode as the pages are accessed and direct = file >> accesses to the overlay inode rather than one of the two layers. I= saw a >> number of problems with the approach, but it's possible his latest = patch >> fixes them. >> >> (6) File-by-file waiver of unioning. >> >> Jan Olszak has requested that it be possible to mark files in one o= f the >> layers to suppress copy up on that file and to direct writes to the= lower >> layer. This causes problems with rename however. >> >> (7) File locking and notifications. >> >> These are similar issues. IIRC, we decided at the Filesystem Summi= t that >> you get to take locks on the union inode only and that the notifica= tions >> only follow changes to the upper layer. This means that you don't = get >> union/union interactions through a common lower layer. >> >> However, we've since had complaints that tail doesn't follow change= s made >> to the lower layer (from James Harvey). >> >> (8) LSMs and unions/overlays. >> >> Path-based LSMs should just work now that file->f_path points to th= e >> union layer inode, though they may require namespace awareness. >> >> Label-based LSMs are another matter. file->f_path.dentry->d_inode = points >> to the top layer label and file->f_inode points to the lower layer = label. >> Currently the user of the overlay can 'see through' the overlay and >> access lower files in terms of the labels from the lower layer when= doing >> file operations, but uses the label from the upper layer when doing= inode >> operations. I think this should be consistent and should only use = the >> upper layer label. I'm working on patches to get this to work, but= there >> is dissension over which label should be seen. >> >> Further, mandating that the upper label should be seen does cause >> unionmount a problem as there's no upper inode to hang the label of= f. >> This means that the label must be forged anew each time it is requi= red >> until at such time a copy-up is effected. >> > > (9) Unprivileged mounts > > As there are no backing store issues it should be a tractable > problem to get the semantics right to allow containers to use > overlayfs. A naive attempt was made by Serge Hallyn and he ran > into security issues with copy-up. Can copy-up be made safe if > unprivileged users (AKA user namespace root users) mount overlayfs? > > I think that also intersects with your LSM label handling issues. (10) Big File copy-up Small modification to a big file causes copy-up and eat a lot of space in the uper layer fs. (11) size-optimization diff size-optimization diff helps to reduce the network traffic. thanks, Lai > > Eric > _______________________________________________ > Ksummit-discuss mailing list > Ksummit-discuss@lists.linuxfoundation.org > https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss