On Fri, May 02, 2014 at 01:11:03PM -0400, Dave Jones wrote:
> On Fri, May 02, 2014 at 09:44:42AM -0700, Josh Triplett wrote:
>  
>  > Topics:
>  > - Kconfig, and avoiding excessive configurability in the pursuit of tiny
>  > - Optimizing a kernel for its exact target userspace.
>  > - Examples of shrinking the kernel
> 
> Something that's partially related here: Making stuff optional
> reduces attack surface the kernel presents. We're starting to grow
> more and more CONFIG options to disable syscalls. I'd like to hear
> peoples reactions on introducing even more optionality in this area.

I'd certainly like to see just about every syscall made optional, for
userspace that doesn't need it.  For specialized systems, that certainly
would decrease attack surface.  However, seccomp decreases attack
surface by the same amount, and for any except those specialized systems
that would make more sense, because the set of available syscalls can
then change with a simple policy change rather than a new kernel.

And this doesn't free us from the obligation to make all new APIs
secure against hostile userspace.

> I had a patch to make this particular syscall a cond_syscall, but then
> XFS ate my homework and I haven't had chance to revisit this.
> So, my questions are:
> - are there other obvious syscalls we could make optional without userspace
>   freaking out when they suddenly start getting ENOSYS ?

I've attached a complete list of the syscalls from
include/linux/syscalls.h that do not appear in kernel/sys_ni.c, and thus
always exist.  (syscalls.h notably does not include all the
arch-specific syscalls, some of which might make sense to leave out as
well.)

Of those, a few classes of syscalls that seem obvious, for various
classes of specialized or legacy-free systems:

- For any syscall updated to have a foo2, foo3, etc, a single config
  option to leave out all the older versions would make sense, to go
  with userspace that never calls the older versions.
- Likewise, the non-64 file calls.
- Likewise, sys_old*
- splice/vmsplice/tee.
- sys_*sync*
- sys_clock_* and any other time functions.
- sys_sched_*
- All signal-related syscalls
- rlimit syscalls
- sys_*xattr*
- sys_nice
- sys_cap{get,set}
- fadvise, fallocate, readahead, etc.
- uid/gid functions.
- ioperm/iopl
- ptrace
- sendfile
- times
- utimes and company

> - how much configurability here is too much ?
>   r_f_p was an obvious candidate because it's.. well, nasty.  Some of the
>   more straightforward syscalls may not be such a big deal, but then we
>   have CONFIG's for kcmp and other 'simple' syscalls already..

We need a more systematic mechanism, I think.  CONFIG_SYSCALL_FOO for
every possible FOO seems too much, even for classes of syscalls.
Ideally, we could feed in a table of syscalls collected by some
analysis of the target userspace, and the kernel will then have exactly
those syscalls.

- Josh Triplett