From: Lai Jiangshan <jiangshanlai@gmail.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: ksummit-discuss@lists.linuxfoundation.org,
"Serge E. Hallyn" <serge@hallyn.com>
Subject: Re: [Ksummit-discuss] [TECH TOPIC] Overlays and file(system) unioning issues
Date: Sat, 25 Jul 2015 23:39:50 +0800 [thread overview]
Message-ID: <CAJhGHyA8SAk25t9V-Jg6-zwnZrHthQN_FjkWeG6a6GZ9gnqh-w@mail.gmail.com> (raw)
In-Reply-To: <87twst8pd3.fsf@x220.int.ebiederm.org>
On Sat, Jul 25, 2015 at 12:58 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> David Howells <dhowells@redhat.com> writes:
>
>> [With Miklós's email address fixed]
>>
>> I would like to propose a technical session on filesystem unioning. There are
>> a number of issues:
>>
>> (1) Whiteouts.
>>
>> Linus's idea that a union layer or overlay mounted not as part of a union
>> but separately, should expose whiteouts as 0,0 chardevs. Whilst this
>> might indeed make the backup tools easier as things like tar can then use
>> the stat() and mknod() interfaces rather than having to use special
>> ioctls or syscalls, Miklós's idea to implement them as actual 0,0
>> chardevs in the underlying filesystem incurs some problems:
>>
>> (a) It's slow and resource intensive.
>>
>> Every whiteout requires an inode to represent it. This means that if
>> you, say, have a directory in the lower layer that has a few thousand
>> inodes in it and you delete them all, you then eat up inode table
>> space in the upper layer.
>>
>> Further, every chardev inode has to be stat'd to see if it is really
>> a whiteout.
>>
>> (b) It has provided lock ordering issues in overlayfs directory reading
>> because overlayfs has to stat each chardev from within the directory
>> iterator.
>>
>> I have patches to make Ext2 and JFFS2 use special directory entries
>> labelled with DT_WHITEOUT and no inode. This is more space efficient and
>> faster and can be extended to Ext3 and Ext4. XFS has constants defined
>> for doing similar.
>>
>> I would propose that we change overlayfs to do this.
>>
>> Unfortunately, we would still have to support the then obsolete 0,0
>> chardevs on disk.
>>
>> The stat() and mknod() syscalls would then have to present these objects
>> to the user as 0,0 chardevs rather than ENOENT errors. To do this it
>> might be necessary to have a special mount flag to turn off the
>> translation to DENTRY_WHITEOUT_TYPE dentries and record them as
>> DENTRY_SPECIAL_TYPE instead with an in-memory inode struct showing it to
>> be 0,0 chardevs.
>>
>> David Woodhouse did make an additional suggestion that would make 0,0
>> chardevs less space inefficient - and that's to hard link a reserved
>> inode.
>
>
>> (2) Opaque inodes.
>>
>> Should we use an xattr to mark inodes as opaque or should we use an inode
>> flag? I have patches to add such an inode flag for Ext2 and JFFS2.
>> Marking the inode would be more space and time efficient.
>>
>> (3) Fall-through markers.
>>
>> Unionmount - and possibly other filesystem unioning systems - perform
>> directory integration on disk. (Note that overlayfs maintains this in
>> memory for the lifetime of a directory inode).
>>
>> With unionmount, an integrated directory is marked as being opaque with
>> special directory entries of type DT_FALLTHRU indicating where there is
>> stuff in lower layers that can be accessed.
>>
>> Should we, perhaps, declare that the user sees such markers as 0,1
>> chardevs when the layer is not mounted as part of a union?
>>
>> (4) Unionmount and other filesystem unioning systems.
>>
>> Do we want to add other filesystem unioning systems into the kernel?
>> I've brought in a lot of the stuff for unionmount to help support
>> overlayfs. Unfortunately, overlayfs interferes with some of the stuff
>> that unionmount wants to do (e.g. doing whiteouts differently and in an
>> awkward manner).
>>
>> (5) Lack of POSIX characteristics.
>>
>> There have been complaints that overlayfs isn't sufficiently POSIX like.
>> Now, this is by design on the part of overlayfs and I agree with the
>> Miklós that this is the right way to do it. However, some mitigation
>> might be required.
>>
>> One of the most annoying features is the fact that if you do:
>>
>> fd1 = open("foo", O_RDONLY);
>> fd2 = open("foo", O_RDWR);
>>
>> then fd1 and fd2 don't necessarily point to the same file.
>>
>> I have been given patches by Ratna Bolla that speculatively copy the file
>> into the overlayfs file inode as the pages are accessed and direct file
>> accesses to the overlay inode rather than one of the two layers. I saw a
>> number of problems with the approach, but it's possible his latest patch
>> fixes them.
>>
>> (6) File-by-file waiver of unioning.
>>
>> Jan Olszak has requested that it be possible to mark files in one of the
>> layers to suppress copy up on that file and to direct writes to the lower
>> layer. This causes problems with rename however.
>>
>> (7) File locking and notifications.
>>
>> These are similar issues. IIRC, we decided at the Filesystem Summit that
>> you get to take locks on the union inode only and that the notifications
>> only follow changes to the upper layer. This means that you don't get
>> union/union interactions through a common lower layer.
>>
>> However, we've since had complaints that tail doesn't follow changes made
>> to the lower layer (from James Harvey).
>>
>> (8) LSMs and unions/overlays.
>>
>> Path-based LSMs should just work now that file->f_path points to the
>> union layer inode, though they may require namespace awareness.
>>
>> Label-based LSMs are another matter. file->f_path.dentry->d_inode points
>> to the top layer label and file->f_inode points to the lower layer label.
>> Currently the user of the overlay can 'see through' the overlay and
>> access lower files in terms of the labels from the lower layer when doing
>> file operations, but uses the label from the upper layer when doing inode
>> operations. I think this should be consistent and should only use the
>> upper layer label. I'm working on patches to get this to work, but there
>> is dissension over which label should be seen.
>>
>> Further, mandating that the upper label should be seen does cause
>> unionmount a problem as there's no upper inode to hang the label off.
>> This means that the label must be forged anew each time it is required
>> until at such time a copy-up is effected.
>>
>
> (9) Unprivileged mounts
>
> As there are no backing store issues it should be a tractable
> problem to get the semantics right to allow containers to use
> overlayfs. A naive attempt was made by Serge Hallyn and he ran
> into security issues with copy-up. Can copy-up be made safe if
> unprivileged users (AKA user namespace root users) mount overlayfs?
>
> I think that also intersects with your LSM label handling issues.
(10) Big File copy-up
Small modification to a big file causes copy-up and eat a lot of
space in the uper layer fs.
(11) size-optimization diff
size-optimization diff helps to reduce the network traffic.
thanks,
Lai
>
> Eric
> _______________________________________________
> Ksummit-discuss mailing list
> Ksummit-discuss@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss
next prev parent reply other threads:[~2015-07-25 15:39 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-24 16:01 David Howells
2015-07-24 16:10 ` David Howells
2015-07-24 16:58 ` Eric W. Biederman
2015-07-24 17:12 ` James Bottomley
2015-07-25 15:39 ` Lai Jiangshan [this message]
2015-07-29 13:36 ` Serge E. Hallyn
2015-07-27 13:19 ` David Woodhouse
2015-07-27 14:33 ` Theodore Ts'o
2015-07-28 7:13 ` Miklos Szeredi
2015-07-28 12:16 ` Theodore Ts'o
2015-10-15 19:49 ` David Howells
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAJhGHyA8SAk25t9V-Jg6-zwnZrHthQN_FjkWeG6a6GZ9gnqh-w@mail.gmail.com \
--to=jiangshanlai@gmail.com \
--cc=ebiederm@xmission.com \
--cc=ksummit-discuss@lists.linuxfoundation.org \
--cc=serge@hallyn.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox