ksummit.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [Ksummit-discuss] [TECH TOPIC] Overlays and file(system) unioning issues
@ 2015-07-24 16:01 David Howells
  2015-07-24 16:10 ` David Howells
  2015-07-27 13:19 ` David Woodhouse
  0 siblings, 2 replies; 11+ messages in thread
From: David Howells @ 2015-07-24 16:01 UTC (permalink / raw)
  To: ksummit-discuss; +Cc: mszeredi


I would like to propose a technical session on filesystem unioning.  There are
a number of issues:

 (1) Whiteouts.

     Linus's idea that a union layer or overlay mounted not as part of a union
     but separately, should expose whiteouts as 0,0 chardevs.  Whilst this
     might indeed make the backup tools easier as things like tar can then use
     the stat() and mknod() interfaces rather than having to use special
     ioctls or syscalls, Miklós's idea to implement them as actual 0,0
     chardevs in the underlying filesystem incurs some problems:

     (a) It's slow and resource intensive.

     	 Every whiteout requires an inode to represent it.  This means that if
     	 you, say, have a directory in the lower layer that has a few thousand
     	 inodes in it and you delete them all, you then eat up inode table
     	 space in the upper layer.

	 Further, every chardev inode has to be stat'd to see if it is really
	 a whiteout.

     (b) It has provided lock ordering issues in overlayfs directory reading
     	 because overlayfs has to stat each chardev from within the directory
     	 iterator.

     I have patches to make Ext2 and JFFS2 use special directory entries
     labelled with DT_WHITEOUT and no inode.  This is more space efficient and
     faster and can be extended to Ext3 and Ext4.  XFS has constants defined
     for doing similar.

     I would propose that we change overlayfs to do this.

     Unfortunately, we would still have to support the then obsolete 0,0
     chardevs on disk.

     The stat() and mknod() syscalls would then have to present these objects
     to the user as 0,0 chardevs rather than ENOENT errors.  To do this it
     might be necessary to have a special mount flag to turn off the
     translation to DENTRY_WHITEOUT_TYPE dentries and record them as
     DENTRY_SPECIAL_TYPE instead with an in-memory inode struct showing it to
     be 0,0 chardevs.

     David Woodhouse did make an additional suggestion that would make 0,0
     chardevs less space inefficient - and that's to hard link a reserved
     inode.

 (2) Opaque inodes.

     Should we use an xattr to mark inodes as opaque or should we use an inode
     flag?  I have patches to add such an inode flag for Ext2 and JFFS2.
     Marking the inode would be more space and time efficient.

 (3) Fall-through markers.

     Unionmount - and possibly other filesystem unioning systems - perform
     directory integration on disk.  (Note that overlayfs maintains this in
     memory for the lifetime of a directory inode).

     With unionmount, an integrated directory is marked as being opaque with
     special directory entries of type DT_FALLTHRU indicating where there is
     stuff in lower layers that can be accessed.

     Should we, perhaps, declare that the user sees such markers as 0,1
     chardevs when the layer is not mounted as part of a union?

 (4) Unionmount and other filesystem unioning systems.

     Do we want to add other filesystem unioning systems into the kernel?
     I've brought in a lot of the stuff for unionmount to help support
     overlayfs.  Unfortunately, overlayfs interferes with some of the stuff
     that unionmount wants to do (e.g. doing whiteouts differently and in an
     awkward manner).

 (5) Lack of POSIX characteristics.

     There have been complaints that overlayfs isn't sufficiently POSIX like.
     Now, this is by design on the part of overlayfs and I agree with the
     Miklós that this is the right way to do it.  However, some mitigation
     might be required.

     One of the most annoying features is the fact that if you do:

	fd1 = open("foo", O_RDONLY);
	fd2 = open("foo", O_RDWR);

     then fd1 and fd2 don't necessarily point to the same file.

     I have been given patches by Ratna Bolla that speculatively copy the file
     into the overlayfs file inode as the pages are accessed and direct file
     accesses to the overlay inode rather than one of the two layers.  I saw a
     number of problems with the approach, but it's possible his latest patch
     fixes them.

 (6) File-by-file waiver of unioning.

     Jan Olszak has requested that it be possible to mark files in one of the
     layers to suppress copy up on that file and to direct writes to the lower
     layer.  This causes problems with rename however.

 (7) File locking and notifications.

     These are similar issues.  IIRC, we decided at the Filesystem Summit that
     you get to take locks on the union inode only and that the notifications
     only follow changes to the upper layer.  This means that you don't get
     union/union interactions through a common lower layer.

     However, we've since had complaints that tail doesn't follow changes made
     to the lower layer (from James Harvey).

 (8) LSMs and unions/overlays.

     Path-based LSMs should just work now that file->f_path points to the
     union layer inode, though they may require namespace awareness.

     Label-based LSMs are another matter.  file->f_path.dentry->d_inode points
     to the top layer label and file->f_inode points to the lower layer label.
     Currently the user of the overlay can 'see through' the overlay and
     access lower files in terms of the labels from the lower layer when doing
     file operations, but uses the label from the upper layer when doing inode
     operations.  I think this should be consistent and should only use the
     upper layer label.  I'm working on patches to get this to work, but there
     is dissension over which label should be seen.

     Further, mandating that the upper label should be seen does cause
     unionmount a problem as there's no upper inode to hang the label off.
     This means that the label must be forged anew each time it is required
     until at such time a copy-up is effected.

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Ksummit-discuss] [TECH TOPIC] Overlays and file(system) unioning issues
  2015-07-24 16:01 [Ksummit-discuss] [TECH TOPIC] Overlays and file(system) unioning issues David Howells
@ 2015-07-24 16:10 ` David Howells
  2015-07-24 16:58   ` Eric W. Biederman
  2015-07-27 13:19 ` David Woodhouse
  1 sibling, 1 reply; 11+ messages in thread
From: David Howells @ 2015-07-24 16:10 UTC (permalink / raw)
  To: ksummit-discuss


[With Miklós's email address fixed]

I would like to propose a technical session on filesystem unioning.  There are
a number of issues:

 (1) Whiteouts.

     Linus's idea that a union layer or overlay mounted not as part of a union
     but separately, should expose whiteouts as 0,0 chardevs.  Whilst this
     might indeed make the backup tools easier as things like tar can then use
     the stat() and mknod() interfaces rather than having to use special
     ioctls or syscalls, Miklós's idea to implement them as actual 0,0
     chardevs in the underlying filesystem incurs some problems:

     (a) It's slow and resource intensive.

     	 Every whiteout requires an inode to represent it.  This means that if
     	 you, say, have a directory in the lower layer that has a few thousand
     	 inodes in it and you delete them all, you then eat up inode table
     	 space in the upper layer.

	 Further, every chardev inode has to be stat'd to see if it is really
	 a whiteout.

     (b) It has provided lock ordering issues in overlayfs directory reading
     	 because overlayfs has to stat each chardev from within the directory
     	 iterator.

     I have patches to make Ext2 and JFFS2 use special directory entries
     labelled with DT_WHITEOUT and no inode.  This is more space efficient and
     faster and can be extended to Ext3 and Ext4.  XFS has constants defined
     for doing similar.

     I would propose that we change overlayfs to do this.

     Unfortunately, we would still have to support the then obsolete 0,0
     chardevs on disk.

     The stat() and mknod() syscalls would then have to present these objects
     to the user as 0,0 chardevs rather than ENOENT errors.  To do this it
     might be necessary to have a special mount flag to turn off the
     translation to DENTRY_WHITEOUT_TYPE dentries and record them as
     DENTRY_SPECIAL_TYPE instead with an in-memory inode struct showing it to
     be 0,0 chardevs.

     David Woodhouse did make an additional suggestion that would make 0,0
     chardevs less space inefficient - and that's to hard link a reserved
     inode.

 (2) Opaque inodes.

     Should we use an xattr to mark inodes as opaque or should we use an inode
     flag?  I have patches to add such an inode flag for Ext2 and JFFS2.
     Marking the inode would be more space and time efficient.

 (3) Fall-through markers.

     Unionmount - and possibly other filesystem unioning systems - perform
     directory integration on disk.  (Note that overlayfs maintains this in
     memory for the lifetime of a directory inode).

     With unionmount, an integrated directory is marked as being opaque with
     special directory entries of type DT_FALLTHRU indicating where there is
     stuff in lower layers that can be accessed.

     Should we, perhaps, declare that the user sees such markers as 0,1
     chardevs when the layer is not mounted as part of a union?

 (4) Unionmount and other filesystem unioning systems.

     Do we want to add other filesystem unioning systems into the kernel?
     I've brought in a lot of the stuff for unionmount to help support
     overlayfs.  Unfortunately, overlayfs interferes with some of the stuff
     that unionmount wants to do (e.g. doing whiteouts differently and in an
     awkward manner).

 (5) Lack of POSIX characteristics.

     There have been complaints that overlayfs isn't sufficiently POSIX like.
     Now, this is by design on the part of overlayfs and I agree with the
     Miklós that this is the right way to do it.  However, some mitigation
     might be required.

     One of the most annoying features is the fact that if you do:

	fd1 = open("foo", O_RDONLY);
	fd2 = open("foo", O_RDWR);

     then fd1 and fd2 don't necessarily point to the same file.

     I have been given patches by Ratna Bolla that speculatively copy the file
     into the overlayfs file inode as the pages are accessed and direct file
     accesses to the overlay inode rather than one of the two layers.  I saw a
     number of problems with the approach, but it's possible his latest patch
     fixes them.

 (6) File-by-file waiver of unioning.

     Jan Olszak has requested that it be possible to mark files in one of the
     layers to suppress copy up on that file and to direct writes to the lower
     layer.  This causes problems with rename however.

 (7) File locking and notifications.

     These are similar issues.  IIRC, we decided at the Filesystem Summit that
     you get to take locks on the union inode only and that the notifications
     only follow changes to the upper layer.  This means that you don't get
     union/union interactions through a common lower layer.

     However, we've since had complaints that tail doesn't follow changes made
     to the lower layer (from James Harvey).

 (8) LSMs and unions/overlays.

     Path-based LSMs should just work now that file->f_path points to the
     union layer inode, though they may require namespace awareness.

     Label-based LSMs are another matter.  file->f_path.dentry->d_inode points
     to the top layer label and file->f_inode points to the lower layer label.
     Currently the user of the overlay can 'see through' the overlay and
     access lower files in terms of the labels from the lower layer when doing
     file operations, but uses the label from the upper layer when doing inode
     operations.  I think this should be consistent and should only use the
     upper layer label.  I'm working on patches to get this to work, but there
     is dissension over which label should be seen.

     Further, mandating that the upper label should be seen does cause
     unionmount a problem as there's no upper inode to hang the label off.
     This means that the label must be forged anew each time it is required
     until at such time a copy-up is effected.

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Overlays and file(system) unioning issues
  2015-07-24 16:10 ` David Howells
@ 2015-07-24 16:58   ` Eric W. Biederman
  2015-07-24 17:12     ` James Bottomley
                       ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Eric W. Biederman @ 2015-07-24 16:58 UTC (permalink / raw)
  To: David Howells; +Cc: ksummit-discuss, Serge E. Hallyn

David Howells <dhowells@redhat.com> writes:

> [With Miklós's email address fixed]
>
> I would like to propose a technical session on filesystem unioning.  There are
> a number of issues:
>
>  (1) Whiteouts.
>
>      Linus's idea that a union layer or overlay mounted not as part of a union
>      but separately, should expose whiteouts as 0,0 chardevs.  Whilst this
>      might indeed make the backup tools easier as things like tar can then use
>      the stat() and mknod() interfaces rather than having to use special
>      ioctls or syscalls, Miklós's idea to implement them as actual 0,0
>      chardevs in the underlying filesystem incurs some problems:
>
>      (a) It's slow and resource intensive.
>
>      	 Every whiteout requires an inode to represent it.  This means that if
>      	 you, say, have a directory in the lower layer that has a few thousand
>      	 inodes in it and you delete them all, you then eat up inode table
>      	 space in the upper layer.
>
> 	 Further, every chardev inode has to be stat'd to see if it is really
> 	 a whiteout.
>
>      (b) It has provided lock ordering issues in overlayfs directory reading
>      	 because overlayfs has to stat each chardev from within the directory
>      	 iterator.
>
>      I have patches to make Ext2 and JFFS2 use special directory entries
>      labelled with DT_WHITEOUT and no inode.  This is more space efficient and
>      faster and can be extended to Ext3 and Ext4.  XFS has constants defined
>      for doing similar.
>
>      I would propose that we change overlayfs to do this.
>
>      Unfortunately, we would still have to support the then obsolete 0,0
>      chardevs on disk.
>
>      The stat() and mknod() syscalls would then have to present these objects
>      to the user as 0,0 chardevs rather than ENOENT errors.  To do this it
>      might be necessary to have a special mount flag to turn off the
>      translation to DENTRY_WHITEOUT_TYPE dentries and record them as
>      DENTRY_SPECIAL_TYPE instead with an in-memory inode struct showing it to
>      be 0,0 chardevs.
>
>      David Woodhouse did make an additional suggestion that would make 0,0
>      chardevs less space inefficient - and that's to hard link a reserved
>      inode.


>  (2) Opaque inodes.
>
>      Should we use an xattr to mark inodes as opaque or should we use an inode
>      flag?  I have patches to add such an inode flag for Ext2 and JFFS2.
>      Marking the inode would be more space and time efficient.
>
>  (3) Fall-through markers.
>
>      Unionmount - and possibly other filesystem unioning systems - perform
>      directory integration on disk.  (Note that overlayfs maintains this in
>      memory for the lifetime of a directory inode).
>
>      With unionmount, an integrated directory is marked as being opaque with
>      special directory entries of type DT_FALLTHRU indicating where there is
>      stuff in lower layers that can be accessed.
>
>      Should we, perhaps, declare that the user sees such markers as 0,1
>      chardevs when the layer is not mounted as part of a union?
>
>  (4) Unionmount and other filesystem unioning systems.
>
>      Do we want to add other filesystem unioning systems into the kernel?
>      I've brought in a lot of the stuff for unionmount to help support
>      overlayfs.  Unfortunately, overlayfs interferes with some of the stuff
>      that unionmount wants to do (e.g. doing whiteouts differently and in an
>      awkward manner).
>
>  (5) Lack of POSIX characteristics.
>
>      There have been complaints that overlayfs isn't sufficiently POSIX like.
>      Now, this is by design on the part of overlayfs and I agree with the
>      Miklós that this is the right way to do it.  However, some mitigation
>      might be required.
>
>      One of the most annoying features is the fact that if you do:
>
> 	fd1 = open("foo", O_RDONLY);
> 	fd2 = open("foo", O_RDWR);
>
>      then fd1 and fd2 don't necessarily point to the same file.
>
>      I have been given patches by Ratna Bolla that speculatively copy the file
>      into the overlayfs file inode as the pages are accessed and direct file
>      accesses to the overlay inode rather than one of the two layers.  I saw a
>      number of problems with the approach, but it's possible his latest patch
>      fixes them.
>
>  (6) File-by-file waiver of unioning.
>
>      Jan Olszak has requested that it be possible to mark files in one of the
>      layers to suppress copy up on that file and to direct writes to the lower
>      layer.  This causes problems with rename however.
>
>  (7) File locking and notifications.
>
>      These are similar issues.  IIRC, we decided at the Filesystem Summit that
>      you get to take locks on the union inode only and that the notifications
>      only follow changes to the upper layer.  This means that you don't get
>      union/union interactions through a common lower layer.
>
>      However, we've since had complaints that tail doesn't follow changes made
>      to the lower layer (from James Harvey).
>
>  (8) LSMs and unions/overlays.
>
>      Path-based LSMs should just work now that file->f_path points to the
>      union layer inode, though they may require namespace awareness.
>
>      Label-based LSMs are another matter.  file->f_path.dentry->d_inode points
>      to the top layer label and file->f_inode points to the lower layer label.
>      Currently the user of the overlay can 'see through' the overlay and
>      access lower files in terms of the labels from the lower layer when doing
>      file operations, but uses the label from the upper layer when doing inode
>      operations.  I think this should be consistent and should only use the
>      upper layer label.  I'm working on patches to get this to work, but there
>      is dissension over which label should be seen.
>
>      Further, mandating that the upper label should be seen does cause
>      unionmount a problem as there's no upper inode to hang the label off.
>      This means that the label must be forged anew each time it is required
>      until at such time a copy-up is effected.
>

(9) Unprivileged mounts

    As there are no backing store issues it should be a tractable
    problem to get the semantics right to allow containers to use
    overlayfs.  A naive attempt was made by Serge Hallyn and he ran
    into security issues with copy-up.  Can copy-up be made safe if
    unprivileged users (AKA user namespace root users) mount overlayfs?

    I think that also intersects with your LSM label handling issues.

Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Overlays and file(system) unioning issues
  2015-07-24 16:58   ` Eric W. Biederman
@ 2015-07-24 17:12     ` James Bottomley
  2015-07-25 15:39     ` Lai Jiangshan
  2015-07-29 13:36     ` Serge E. Hallyn
  2 siblings, 0 replies; 11+ messages in thread
From: James Bottomley @ 2015-07-24 17:12 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: ksummit-discuss, Serge E. Hallyn

On Fri, 2015-07-24 at 11:58 -0500, Eric W. Biederman wrote:
> David Howells <dhowells@redhat.com> writes:
> 
> > [With Miklós's email address fixed]
> >
> > I would like to propose a technical session on filesystem unioning.  There are
> > a number of issues:
> >
> >  (1) Whiteouts.
> >
> >      Linus's idea that a union layer or overlay mounted not as part of a union
> >      but separately, should expose whiteouts as 0,0 chardevs.  Whilst this
> >      might indeed make the backup tools easier as things like tar can then use
> >      the stat() and mknod() interfaces rather than having to use special
> >      ioctls or syscalls, Miklós's idea to implement them as actual 0,0
> >      chardevs in the underlying filesystem incurs some problems:
> >
> >      (a) It's slow and resource intensive.
> >
> >      	 Every whiteout requires an inode to represent it.  This means that if
> >      	 you, say, have a directory in the lower layer that has a few thousand
> >      	 inodes in it and you delete them all, you then eat up inode table
> >      	 space in the upper layer.
> >
> > 	 Further, every chardev inode has to be stat'd to see if it is really
> > 	 a whiteout.
> >
> >      (b) It has provided lock ordering issues in overlayfs directory reading
> >      	 because overlayfs has to stat each chardev from within the directory
> >      	 iterator.
> >
> >      I have patches to make Ext2 and JFFS2 use special directory entries
> >      labelled with DT_WHITEOUT and no inode.  This is more space efficient and
> >      faster and can be extended to Ext3 and Ext4.  XFS has constants defined
> >      for doing similar.
> >
> >      I would propose that we change overlayfs to do this.
> >
> >      Unfortunately, we would still have to support the then obsolete 0,0
> >      chardevs on disk.
> >
> >      The stat() and mknod() syscalls would then have to present these objects
> >      to the user as 0,0 chardevs rather than ENOENT errors.  To do this it
> >      might be necessary to have a special mount flag to turn off the
> >      translation to DENTRY_WHITEOUT_TYPE dentries and record them as
> >      DENTRY_SPECIAL_TYPE instead with an in-memory inode struct showing it to
> >      be 0,0 chardevs.
> >
> >      David Woodhouse did make an additional suggestion that would make 0,0
> >      chardevs less space inefficient - and that's to hard link a reserved
> >      inode.
> 
> 
> >  (2) Opaque inodes.
> >
> >      Should we use an xattr to mark inodes as opaque or should we use an inode
> >      flag?  I have patches to add such an inode flag for Ext2 and JFFS2.
> >      Marking the inode would be more space and time efficient.
> >
> >  (3) Fall-through markers.
> >
> >      Unionmount - and possibly other filesystem unioning systems - perform
> >      directory integration on disk.  (Note that overlayfs maintains this in
> >      memory for the lifetime of a directory inode).
> >
> >      With unionmount, an integrated directory is marked as being opaque with
> >      special directory entries of type DT_FALLTHRU indicating where there is
> >      stuff in lower layers that can be accessed.
> >
> >      Should we, perhaps, declare that the user sees such markers as 0,1
> >      chardevs when the layer is not mounted as part of a union?
> >
> >  (4) Unionmount and other filesystem unioning systems.
> >
> >      Do we want to add other filesystem unioning systems into the kernel?
> >      I've brought in a lot of the stuff for unionmount to help support
> >      overlayfs.  Unfortunately, overlayfs interferes with some of the stuff
> >      that unionmount wants to do (e.g. doing whiteouts differently and in an
> >      awkward manner).
> >
> >  (5) Lack of POSIX characteristics.
> >
> >      There have been complaints that overlayfs isn't sufficiently POSIX like.
> >      Now, this is by design on the part of overlayfs and I agree with the
> >      Miklós that this is the right way to do it.  However, some mitigation
> >      might be required.
> >
> >      One of the most annoying features is the fact that if you do:
> >
> > 	fd1 = open("foo", O_RDONLY);
> > 	fd2 = open("foo", O_RDWR);
> >
> >      then fd1 and fd2 don't necessarily point to the same file.
> >
> >      I have been given patches by Ratna Bolla that speculatively copy the file
> >      into the overlayfs file inode as the pages are accessed and direct file
> >      accesses to the overlay inode rather than one of the two layers.  I saw a
> >      number of problems with the approach, but it's possible his latest patch
> >      fixes them.
> >
> >  (6) File-by-file waiver of unioning.
> >
> >      Jan Olszak has requested that it be possible to mark files in one of the
> >      layers to suppress copy up on that file and to direct writes to the lower
> >      layer.  This causes problems with rename however.
> >
> >  (7) File locking and notifications.
> >
> >      These are similar issues.  IIRC, we decided at the Filesystem Summit that
> >      you get to take locks on the union inode only and that the notifications
> >      only follow changes to the upper layer.  This means that you don't get
> >      union/union interactions through a common lower layer.
> >
> >      However, we've since had complaints that tail doesn't follow changes made
> >      to the lower layer (from James Harvey).
> >
> >  (8) LSMs and unions/overlays.
> >
> >      Path-based LSMs should just work now that file->f_path points to the
> >      union layer inode, though they may require namespace awareness.
> >
> >      Label-based LSMs are another matter.  file->f_path.dentry->d_inode points
> >      to the top layer label and file->f_inode points to the lower layer label.
> >      Currently the user of the overlay can 'see through' the overlay and
> >      access lower files in terms of the labels from the lower layer when doing
> >      file operations, but uses the label from the upper layer when doing inode
> >      operations.  I think this should be consistent and should only use the
> >      upper layer label.  I'm working on patches to get this to work, but there
> >      is dissension over which label should be seen.
> >
> >      Further, mandating that the upper label should be seen does cause
> >      unionmount a problem as there's no upper inode to hang the label off.
> >      This means that the label must be forged anew each time it is required
> >      until at such time a copy-up is effected.
> >
> 
> (9) Unprivileged mounts
> 
>     As there are no backing store issues it should be a tractable
>     problem to get the semantics right to allow containers to use
>     overlayfs.  A naive attempt was made by Serge Hallyn and he ran
>     into security issues with copy-up.  Can copy-up be made safe if
>     unprivileged users (AKA user namespace root users) mount overlayfs?
> 
>     I think that also intersects with your LSM label handling issues.

We'd be interested in this at Odin.  One of the biggest annoyances with
docker is that you can't make a docker container description of docker
itself because of the way the proxy graph driver works (containers
cannot safely modify a block device then mount it).  Getting this right
for Overlayfs would allow us to begin correcting this problem ... which
is also a big security hole in docker.

Note also that Pavel emelyanov has been considering generalised
namespace descriptions of overlays in his mosaic project:

https://github.com/xemul/mosaic

So he'd likely be interested in this as well

James

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Overlays and file(system) unioning issues
  2015-07-24 16:58   ` Eric W. Biederman
  2015-07-24 17:12     ` James Bottomley
@ 2015-07-25 15:39     ` Lai Jiangshan
  2015-07-29 13:36     ` Serge E. Hallyn
  2 siblings, 0 replies; 11+ messages in thread
From: Lai Jiangshan @ 2015-07-25 15:39 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: ksummit-discuss, Serge E. Hallyn

On Sat, Jul 25, 2015 at 12:58 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> David Howells <dhowells@redhat.com> writes:
>
>> [With Miklós's email address fixed]
>>
>> I would like to propose a technical session on filesystem unioning.  There are
>> a number of issues:
>>
>>  (1) Whiteouts.
>>
>>      Linus's idea that a union layer or overlay mounted not as part of a union
>>      but separately, should expose whiteouts as 0,0 chardevs.  Whilst this
>>      might indeed make the backup tools easier as things like tar can then use
>>      the stat() and mknod() interfaces rather than having to use special
>>      ioctls or syscalls, Miklós's idea to implement them as actual 0,0
>>      chardevs in the underlying filesystem incurs some problems:
>>
>>      (a) It's slow and resource intensive.
>>
>>        Every whiteout requires an inode to represent it.  This means that if
>>        you, say, have a directory in the lower layer that has a few thousand
>>        inodes in it and you delete them all, you then eat up inode table
>>        space in the upper layer.
>>
>>        Further, every chardev inode has to be stat'd to see if it is really
>>        a whiteout.
>>
>>      (b) It has provided lock ordering issues in overlayfs directory reading
>>        because overlayfs has to stat each chardev from within the directory
>>        iterator.
>>
>>      I have patches to make Ext2 and JFFS2 use special directory entries
>>      labelled with DT_WHITEOUT and no inode.  This is more space efficient and
>>      faster and can be extended to Ext3 and Ext4.  XFS has constants defined
>>      for doing similar.
>>
>>      I would propose that we change overlayfs to do this.
>>
>>      Unfortunately, we would still have to support the then obsolete 0,0
>>      chardevs on disk.
>>
>>      The stat() and mknod() syscalls would then have to present these objects
>>      to the user as 0,0 chardevs rather than ENOENT errors.  To do this it
>>      might be necessary to have a special mount flag to turn off the
>>      translation to DENTRY_WHITEOUT_TYPE dentries and record them as
>>      DENTRY_SPECIAL_TYPE instead with an in-memory inode struct showing it to
>>      be 0,0 chardevs.
>>
>>      David Woodhouse did make an additional suggestion that would make 0,0
>>      chardevs less space inefficient - and that's to hard link a reserved
>>      inode.
>
>
>>  (2) Opaque inodes.
>>
>>      Should we use an xattr to mark inodes as opaque or should we use an inode
>>      flag?  I have patches to add such an inode flag for Ext2 and JFFS2.
>>      Marking the inode would be more space and time efficient.
>>
>>  (3) Fall-through markers.
>>
>>      Unionmount - and possibly other filesystem unioning systems - perform
>>      directory integration on disk.  (Note that overlayfs maintains this in
>>      memory for the lifetime of a directory inode).
>>
>>      With unionmount, an integrated directory is marked as being opaque with
>>      special directory entries of type DT_FALLTHRU indicating where there is
>>      stuff in lower layers that can be accessed.
>>
>>      Should we, perhaps, declare that the user sees such markers as 0,1
>>      chardevs when the layer is not mounted as part of a union?
>>
>>  (4) Unionmount and other filesystem unioning systems.
>>
>>      Do we want to add other filesystem unioning systems into the kernel?
>>      I've brought in a lot of the stuff for unionmount to help support
>>      overlayfs.  Unfortunately, overlayfs interferes with some of the stuff
>>      that unionmount wants to do (e.g. doing whiteouts differently and in an
>>      awkward manner).
>>
>>  (5) Lack of POSIX characteristics.
>>
>>      There have been complaints that overlayfs isn't sufficiently POSIX like.
>>      Now, this is by design on the part of overlayfs and I agree with the
>>      Miklós that this is the right way to do it.  However, some mitigation
>>      might be required.
>>
>>      One of the most annoying features is the fact that if you do:
>>
>>       fd1 = open("foo", O_RDONLY);
>>       fd2 = open("foo", O_RDWR);
>>
>>      then fd1 and fd2 don't necessarily point to the same file.
>>
>>      I have been given patches by Ratna Bolla that speculatively copy the file
>>      into the overlayfs file inode as the pages are accessed and direct file
>>      accesses to the overlay inode rather than one of the two layers.  I saw a
>>      number of problems with the approach, but it's possible his latest patch
>>      fixes them.
>>
>>  (6) File-by-file waiver of unioning.
>>
>>      Jan Olszak has requested that it be possible to mark files in one of the
>>      layers to suppress copy up on that file and to direct writes to the lower
>>      layer.  This causes problems with rename however.
>>
>>  (7) File locking and notifications.
>>
>>      These are similar issues.  IIRC, we decided at the Filesystem Summit that
>>      you get to take locks on the union inode only and that the notifications
>>      only follow changes to the upper layer.  This means that you don't get
>>      union/union interactions through a common lower layer.
>>
>>      However, we've since had complaints that tail doesn't follow changes made
>>      to the lower layer (from James Harvey).
>>
>>  (8) LSMs and unions/overlays.
>>
>>      Path-based LSMs should just work now that file->f_path points to the
>>      union layer inode, though they may require namespace awareness.
>>
>>      Label-based LSMs are another matter.  file->f_path.dentry->d_inode points
>>      to the top layer label and file->f_inode points to the lower layer label.
>>      Currently the user of the overlay can 'see through' the overlay and
>>      access lower files in terms of the labels from the lower layer when doing
>>      file operations, but uses the label from the upper layer when doing inode
>>      operations.  I think this should be consistent and should only use the
>>      upper layer label.  I'm working on patches to get this to work, but there
>>      is dissension over which label should be seen.
>>
>>      Further, mandating that the upper label should be seen does cause
>>      unionmount a problem as there's no upper inode to hang the label off.
>>      This means that the label must be forged anew each time it is required
>>      until at such time a copy-up is effected.
>>
>
> (9) Unprivileged mounts
>
>     As there are no backing store issues it should be a tractable
>     problem to get the semantics right to allow containers to use
>     overlayfs.  A naive attempt was made by Serge Hallyn and he ran
>     into security issues with copy-up.  Can copy-up be made safe if
>     unprivileged users (AKA user namespace root users) mount overlayfs?
>
>     I think that also intersects with your LSM label handling issues.

(10) Big File copy-up
       Small modification to a big file causes copy-up and eat a lot of
       space in the uper layer fs.

(11) size-optimization diff
       size-optimization diff helps to reduce the network traffic.

thanks,
Lai


>
> Eric
> _______________________________________________
> Ksummit-discuss mailing list
> Ksummit-discuss@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Overlays and file(system) unioning issues
  2015-07-24 16:01 [Ksummit-discuss] [TECH TOPIC] Overlays and file(system) unioning issues David Howells
  2015-07-24 16:10 ` David Howells
@ 2015-07-27 13:19 ` David Woodhouse
  2015-07-27 14:33   ` Theodore Ts'o
  1 sibling, 1 reply; 11+ messages in thread
From: David Woodhouse @ 2015-07-27 13:19 UTC (permalink / raw)
  To: David Howells, ksummit-discuss; +Cc: mszeredi

[-- Attachment #1: Type: text/plain, Size: 1378 bytes --]

On Fri, 2015-07-24 at 17:01 +0100, David Howells wrote:
> 
>  (1) Whiteouts.
> 
>      Linus's idea that a union layer or overlay mounted not as part of a union
>      but separately, should expose whiteouts as 0,0 chardevs.  Whilst this
>      might indeed make the backup tools easier as things like tar can then use
>      the stat() and mknod() interfaces rather than having to use special
>      ioctls or syscalls, Miklós's idea to implement them as actual 0,0
>      chardevs in the underlying filesystem incurs some problems:

Yeah, this is screwed up. I can kind of understand exposing them to
userspace as 0,0 chardevs in the case where they're mounted outside the
context of a union mount.

But these are *directory entries*, not real inodes. Using chardevs to
implement them *internally* is problematic. It's inefficient, because
it involves an inode lookup and uses up inode#s (in implementations
where that matters), and it's also caused deadlocks in some cases like
JFFS2 where we don't expect the recursion that it requires.

This was much nicer when it was being done with DT_WHT internally. And
you could *still* expose that to userspace as a 0,0 chardev — or
however you like.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Overlays and file(system) unioning issues
  2015-07-27 13:19 ` David Woodhouse
@ 2015-07-27 14:33   ` Theodore Ts'o
  2015-07-28  7:13     ` Miklos Szeredi
  0 siblings, 1 reply; 11+ messages in thread
From: Theodore Ts'o @ 2015-07-27 14:33 UTC (permalink / raw)
  To: David Woodhouse; +Cc: ksummit-discuss, mszeredi

On Mon, Jul 27, 2015 at 02:19:44PM +0100, David Woodhouse wrote:
> This was much nicer when it was being done with DT_WHT internally. And
> you could *still* expose that to userspace as a 0,0 chardev — or
> however you like.

Yeah, that was what I was going to propose.  We can expose it to
userspace as a 0,0 chardev with an inode number of UINT_MAX-1, but
that doesn't have to be the internal representation inside the kernel...

     	     	     	    	     		    	   - Ted

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Overlays and file(system) unioning issues
  2015-07-27 14:33   ` Theodore Ts'o
@ 2015-07-28  7:13     ` Miklos Szeredi
  2015-07-28 12:16       ` Theodore Ts'o
  2015-10-15 19:49       ` David Howells
  0 siblings, 2 replies; 11+ messages in thread
From: Miklos Szeredi @ 2015-07-28  7:13 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: ksummit-discuss

On Mon, Jul 27, 2015 at 4:33 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, Jul 27, 2015 at 02:19:44PM +0100, David Woodhouse wrote:
>> This was much nicer when it was being done with DT_WHT internally. And
>> you could *still* expose that to userspace as a 0,0 chardev — or
>> however you like.
>
> Yeah, that was what I was going to propose.  We can expose it to
> userspace as a 0,0 chardev with an inode number of UINT_MAX-1, but
> that doesn't have to be the internal representation inside the kernel...

Exactly.   Switching overlayfs to deal with DT_WHT is trivial.
Turning off backward compatibility (checking 0,0 chardev as internal
representation) can be made a mount option of overlayfs.  If the
DT_WHT representation is used, then it will work either way, but will
be suboptimal due to getattr on chardev.  If the  back compatibility
is turned off then it will be optimal, but wouldn't deal with the old
representation.

The only remaining issue being backward incompatibility of the
filesystem image itself (old fsck, ...).  Whether that's a problem or
not is up to the user/distro to decide.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Overlays and file(system) unioning issues
  2015-07-28  7:13     ` Miklos Szeredi
@ 2015-07-28 12:16       ` Theodore Ts'o
  2015-10-15 19:49       ` David Howells
  1 sibling, 0 replies; 11+ messages in thread
From: Theodore Ts'o @ 2015-07-28 12:16 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: ksummit-discuss

On Tue, Jul 28, 2015 at 09:13:18AM +0200, Miklos Szeredi wrote:
> 
> Exactly.   Switching overlayfs to deal with DT_WHT is trivial.
> Turning off backward compatibility (checking 0,0 chardev as internal
> representation) can be made a mount option of overlayfs.  If the
> DT_WHT representation is used, then it will work either way, but will
> be suboptimal due to getattr on chardev.  If the  back compatibility
> is turned off then it will be optimal, but wouldn't deal with the old
> representation.
> 
> The only remaining issue being backward incompatibility of the
> filesystem image itself (old fsck, ...).  Whether that's a problem or
> not is up to the user/distro to decide.

Yep, this seems like a no-brainer.  Given that, is there anything that
requires discussion at the Kernel Summit?

							- Ted

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Overlays and file(system) unioning issues
  2015-07-24 16:58   ` Eric W. Biederman
  2015-07-24 17:12     ` James Bottomley
  2015-07-25 15:39     ` Lai Jiangshan
@ 2015-07-29 13:36     ` Serge E. Hallyn
  2 siblings, 0 replies; 11+ messages in thread
From: Serge E. Hallyn @ 2015-07-29 13:36 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: ksummit-discuss, Serge E. Hallyn

On Fri, Jul 24, 2015 at 11:58:00AM -0500, Eric W. Biederman wrote:
> David Howells <dhowells@redhat.com> writes:
> 
> > [With Miklós's email address fixed]
> >
> > I would like to propose a technical session on filesystem unioning.  There are
> > a number of issues:
> >
> >  (1) Whiteouts.
> >
> >      Linus's idea that a union layer or overlay mounted not as part of a union
> >      but separately, should expose whiteouts as 0,0 chardevs.  Whilst this
> >      might indeed make the backup tools easier as things like tar can then use
> >      the stat() and mknod() interfaces rather than having to use special
> >      ioctls or syscalls, Miklós's idea to implement them as actual 0,0
> >      chardevs in the underlying filesystem incurs some problems:
> >
> >      (a) It's slow and resource intensive.
> >
> >      	 Every whiteout requires an inode to represent it.  This means that if
> >      	 you, say, have a directory in the lower layer that has a few thousand
> >      	 inodes in it and you delete them all, you then eat up inode table
> >      	 space in the upper layer.
> >
> > 	 Further, every chardev inode has to be stat'd to see if it is really
> > 	 a whiteout.
> >
> >      (b) It has provided lock ordering issues in overlayfs directory reading
> >      	 because overlayfs has to stat each chardev from within the directory
> >      	 iterator.
> >
> >      I have patches to make Ext2 and JFFS2 use special directory entries
> >      labelled with DT_WHITEOUT and no inode.  This is more space efficient and
> >      faster and can be extended to Ext3 and Ext4.  XFS has constants defined
> >      for doing similar.
> >
> >      I would propose that we change overlayfs to do this.
> >
> >      Unfortunately, we would still have to support the then obsolete 0,0
> >      chardevs on disk.
> >
> >      The stat() and mknod() syscalls would then have to present these objects
> >      to the user as 0,0 chardevs rather than ENOENT errors.  To do this it
> >      might be necessary to have a special mount flag to turn off the
> >      translation to DENTRY_WHITEOUT_TYPE dentries and record them as
> >      DENTRY_SPECIAL_TYPE instead with an in-memory inode struct showing it to
> >      be 0,0 chardevs.
> >
> >      David Woodhouse did make an additional suggestion that would make 0,0
> >      chardevs less space inefficient - and that's to hard link a reserved
> >      inode.
> 
> 
> >  (2) Opaque inodes.
> >
> >      Should we use an xattr to mark inodes as opaque or should we use an inode
> >      flag?  I have patches to add such an inode flag for Ext2 and JFFS2.
> >      Marking the inode would be more space and time efficient.
> >
> >  (3) Fall-through markers.
> >
> >      Unionmount - and possibly other filesystem unioning systems - perform
> >      directory integration on disk.  (Note that overlayfs maintains this in
> >      memory for the lifetime of a directory inode).
> >
> >      With unionmount, an integrated directory is marked as being opaque with
> >      special directory entries of type DT_FALLTHRU indicating where there is
> >      stuff in lower layers that can be accessed.
> >
> >      Should we, perhaps, declare that the user sees such markers as 0,1
> >      chardevs when the layer is not mounted as part of a union?
> >
> >  (4) Unionmount and other filesystem unioning systems.
> >
> >      Do we want to add other filesystem unioning systems into the kernel?
> >      I've brought in a lot of the stuff for unionmount to help support
> >      overlayfs.  Unfortunately, overlayfs interferes with some of the stuff
> >      that unionmount wants to do (e.g. doing whiteouts differently and in an
> >      awkward manner).
> >
> >  (5) Lack of POSIX characteristics.
> >
> >      There have been complaints that overlayfs isn't sufficiently POSIX like.
> >      Now, this is by design on the part of overlayfs and I agree with the
> >      Miklós that this is the right way to do it.  However, some mitigation
> >      might be required.
> >
> >      One of the most annoying features is the fact that if you do:
> >
> > 	fd1 = open("foo", O_RDONLY);
> > 	fd2 = open("foo", O_RDWR);
> >
> >      then fd1 and fd2 don't necessarily point to the same file.
> >
> >      I have been given patches by Ratna Bolla that speculatively copy the file
> >      into the overlayfs file inode as the pages are accessed and direct file
> >      accesses to the overlay inode rather than one of the two layers.  I saw a
> >      number of problems with the approach, but it's possible his latest patch
> >      fixes them.
> >
> >  (6) File-by-file waiver of unioning.
> >
> >      Jan Olszak has requested that it be possible to mark files in one of the
> >      layers to suppress copy up on that file and to direct writes to the lower
> >      layer.  This causes problems with rename however.
> >
> >  (7) File locking and notifications.
> >
> >      These are similar issues.  IIRC, we decided at the Filesystem Summit that
> >      you get to take locks on the union inode only and that the notifications
> >      only follow changes to the upper layer.  This means that you don't get
> >      union/union interactions through a common lower layer.
> >
> >      However, we've since had complaints that tail doesn't follow changes made
> >      to the lower layer (from James Harvey).
> >
> >  (8) LSMs and unions/overlays.
> >
> >      Path-based LSMs should just work now that file->f_path points to the
> >      union layer inode, though they may require namespace awareness.
> >
> >      Label-based LSMs are another matter.  file->f_path.dentry->d_inode points
> >      to the top layer label and file->f_inode points to the lower layer label.
> >      Currently the user of the overlay can 'see through' the overlay and
> >      access lower files in terms of the labels from the lower layer when doing
> >      file operations, but uses the label from the upper layer when doing inode
> >      operations.  I think this should be consistent and should only use the
> >      upper layer label.  I'm working on patches to get this to work, but there
> >      is dissension over which label should be seen.
> >
> >      Further, mandating that the upper label should be seen does cause
> >      unionmount a problem as there's no upper inode to hang the label off.
> >      This means that the label must be forged anew each time it is required
> >      until at such time a copy-up is effected.
> >
> 
> (9) Unprivileged mounts
> 
>     As there are no backing store issues it should be a tractable
>     problem to get the semantics right to allow containers to use
>     overlayfs.  A naive attempt was made by Serge Hallyn and he ran
>     into security issues with copy-up.  Can copy-up be made safe if
>     unprivileged users (AKA user namespace root users) mount overlayfs?
> 
>     I think that also intersects with your LSM label handling issues.

Thanks, Eric.

Sounds like a very interesting topic all around.

-serge

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Overlays and file(system) unioning issues
  2015-07-28  7:13     ` Miklos Szeredi
  2015-07-28 12:16       ` Theodore Ts'o
@ 2015-10-15 19:49       ` David Howells
  1 sibling, 0 replies; 11+ messages in thread
From: David Howells @ 2015-10-15 19:49 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: ksummit-discuss

Theodore Ts'o <tytso@mit.edu> wrote:

> Yep, this seems like a no-brainer.  Given that, is there anything that
> requires discussion at the Kernel Summit?

Seems not.  This didn't garner much interest that I can tell.

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-10-15 19:49 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-24 16:01 [Ksummit-discuss] [TECH TOPIC] Overlays and file(system) unioning issues David Howells
2015-07-24 16:10 ` David Howells
2015-07-24 16:58   ` Eric W. Biederman
2015-07-24 17:12     ` James Bottomley
2015-07-25 15:39     ` Lai Jiangshan
2015-07-29 13:36     ` Serge E. Hallyn
2015-07-27 13:19 ` David Woodhouse
2015-07-27 14:33   ` Theodore Ts'o
2015-07-28  7:13     ` Miklos Szeredi
2015-07-28 12:16       ` Theodore Ts'o
2015-10-15 19:49       ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox