[Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions

ksummit.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

* [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
@ 2014-05-02 16:44 Josh Triplett
  2014-05-02 17:11 ` Dave Jones
                   ` (4 more replies)
  0 siblings, 5 replies; 79+ messages in thread
From: Josh Triplett @ 2014-05-02 16:44 UTC (permalink / raw)
  To: ksummit-discuss
  Cc: Sarah Sharp, Greg KH, Julia Lawall, Darren Hart, Dan Carpenter

Over time, the Linux kernel has grown far more featureful, but it has
also grown significantly larger, even with all the optional features
turned off.  For the last several years, I've been working to make the
kernel smaller, and mentoring/coordinating projects to do the same, to
enable ridiculously small embedded applications and other fun uses.  I'd
like to discuss that work at Kernel Summit, get size regressions on the
radar of kernel developers and subsystem maintainers, and solicit
discussion on future possibilities to shrink the kernel further.

Topics:
- An overview of why the kernel's size still matters today ("but don't
  we all have tons of memory and storage?")
- Tiny in RAM versus tiny on storage.
- How much the kernel has grown over time.
- How size regressions happen and how to avoid them
- Size measurement, bloat-o-meter, allnoconfig, and other tools
- Compression and the decompression stub
- Kconfig, and avoiding excessive configurability in the pursuit of tiny
- Optimizing a kernel for its exact target userspace.
- Examples of shrinking the kernel
- Discussion on proposed ways to make the kernel tiny, how much they
  might save, how much work they'd require, and how to implement them
  with minimal impact to the un-shrunken common case.

After the session, I'll prepare and maintain a detailed summary of the
proposed ideas, ordered by how much space they'd save versus how much
work they'd be.  I plan to maintain that list on an ongoing basis, to
coordinate tinification projects for ongoing work by people working on
embedded applications, and for the benefit of mentorship projects such
as OPW and SoC.

The set of people on the CC list would be good to have at the
discussion.  I marked this as "CORE TOPIC", but I'd also be happy to
discuss it on the dual-track day if that's preferable.

(I'm currenly on the auto-generated nomination list.)

- Josh Triplett

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 16:44 [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions Josh Triplett
@ 2014-05-02 17:11 ` Dave Jones
  2014-05-02 17:20   ` James Bottomley
                     ` (3 more replies)
  2014-05-08 15:52 ` Christoph Lameter
                   ` (3 subsequent siblings)
  4 siblings, 4 replies; 79+ messages in thread
From: Dave Jones @ 2014-05-02 17:11 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, May 02, 2014 at 09:44:42AM -0700, Josh Triplett wrote:

 > Topics:
 > - Kconfig, and avoiding excessive configurability in the pursuit of tiny
 > - Optimizing a kernel for its exact target userspace.
 > - Examples of shrinking the kernel

Something that's partially related here: Making stuff optional
reduces attack surface the kernel presents. We're starting to grow
more and more CONFIG options to disable syscalls. I'd like to hear
peoples reactions on introducing even more optionality in this area.

I first started thinking about this at LSF/MM where the subject of
sys_remap_file_pages came up. "What even uses this?" "hardly anything".
But for all the users that don't need it, there's this syscall always
built in that does horrible things with VM internals.  It's fortunate
that there hasn't been anything particularly awful beyond simple DoS
bugs in r_f_p.

Distribution kernels are in the sad position of having to always enable
this stuff, but at least for people building their own kernels, or
kernels for appliances, we could make their lives a little better by
not even building this stuff in.

I had a patch to make this particular syscall a cond_syscall, but then
XFS ate my homework and I haven't had chance to revisit this.
So, my questions are:
- are there other obvious syscalls we could make optional without userspace
  freaking out when they suddenly start getting ENOSYS ?
- how much configurability here is too much ?
  r_f_p was an obvious candidate because it's.. well, nasty.  Some of the
  more straightforward syscalls may not be such a big deal, but then we
  have CONFIG's for kcmp and other 'simple' syscalls already..

thoughts?

	Dave

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 17:11 ` Dave Jones
@ 2014-05-02 17:20   ` James Bottomley
  2014-05-02 17:33     ` Dave Jones
                       ` (3 more replies)
  2014-05-02 22:04   ` Jan Kara
                     ` (2 subsequent siblings)
  3 siblings, 4 replies; 79+ messages in thread
From: James Bottomley @ 2014-05-02 17:20 UTC (permalink / raw)
  To: Dave Jones
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, 2014-05-02 at 13:11 -0400, Dave Jones wrote:
> On Fri, May 02, 2014 at 09:44:42AM -0700, Josh Triplett wrote:
>  
>  > Topics:
>  > - Kconfig, and avoiding excessive configurability in the pursuit of tiny
>  > - Optimizing a kernel for its exact target userspace.
>  > - Examples of shrinking the kernel
> 
> Something that's partially related here: Making stuff optional
> reduces attack surface the kernel presents. We're starting to grow
> more and more CONFIG options to disable syscalls. I'd like to hear
> peoples reactions on introducing even more optionality in this area.

My first reaction is reducing the attack surface sounds a reasonable
idea.  My second reaction is that the plural in options makes me want to
run for the hills.  Having a sea of options for enabling and disabling
syscalls gives us the potential for having a set of kernels all with a
slightly differing ABI as people choose what to enable and disable.

If we do this, I think we should have a small number of options related
to use case ... say something like a secure router kernel
CONFIG_SECURE_ROUTER which disables anything a secure router wouldn't
need.

For the distros we could have an ordinary and a reduced attack surface
kernel CONFIG_REDUCED_ATTACK_SURFACE.

The thing I really want to avoid is binaries compiled for one distro not
running on another because of syscall differences.

> I first started thinking about this at LSF/MM where the subject of
> sys_remap_file_pages came up. "What even uses this?" "hardly anything".
> But for all the users that don't need it, there's this syscall always
> built in that does horrible things with VM internals.  It's fortunate
> that there hasn't been anything particularly awful beyond simple DoS
> bugs in r_f_p.
> 
> Distribution kernels are in the sad position of having to always enable
> this stuff, but at least for people building their own kernels, or
> kernels for appliances, we could make their lives a little better by
> not even building this stuff in.
> 
> I had a patch to make this particular syscall a cond_syscall, but then
> XFS ate my homework and I haven't had chance to revisit this.
> So, my questions are:
> - are there other obvious syscalls we could make optional without userspace
>   freaking out when they suddenly start getting ENOSYS ?
> - how much configurability here is too much ?

I covered this above.

>   r_f_p was an obvious candidate because it's.. well, nasty.  Some of the
>   more straightforward syscalls may not be such a big deal, but then we
>   have CONFIG's for kcmp and other 'simple' syscalls already..

Speaking with my Checkpoint/Restore/process migration container hat on,
we need kcmp.  It was designed with security in mind (originally we'd
exposed kernel virtual addresses).  Perhaps some of this hardening
should be focussed more sharply on what is this syscall trying to do and
could it achieve its aim in a more secure way.

James

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 17:20   ` James Bottomley
@ 2014-05-02 17:33     ` Dave Jones
  2014-05-02 17:46       ` Josh Boyer
  2014-05-02 19:03       ` Mark Brown
  2014-05-02 17:33     ` Guenter Roeck
                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 79+ messages in thread
From: Dave Jones @ 2014-05-02 17:33 UTC (permalink / raw)
  To: James Bottomley
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, May 02, 2014 at 10:20:29AM -0700, James Bottomley wrote:

 > > Something that's partially related here: Making stuff optional
 > > reduces attack surface the kernel presents. We're starting to grow
 > > more and more CONFIG options to disable syscalls. I'd like to hear
 > > peoples reactions on introducing even more optionality in this area.
 > 
 > My first reaction is reducing the attack surface sounds a reasonable
 > idea.  My second reaction is that the plural in options makes me want to
 > run for the hills.  Having a sea of options for enabling and disabling
 > syscalls gives us the potential for having a set of kernels all with a
 > slightly differing ABI as people choose what to enable and disable.

Absolutely.  Deciding where the line is between 'core' syscalls, and
optional ones is the tricky part.

 > If we do this, I think we should have a small number of options related
 > to use case ... say something like a secure router kernel
 > CONFIG_SECURE_ROUTER which disables anything a secure router wouldn't
 > need.

That might be one option.  And have these 'profiles' do select
WANT_SYS_WHATEVER to reduce #ifdef complexity in the code..

 > For the distros we could have an ordinary and a reduced attack surface
 > kernel CONFIG_REDUCED_ATTACK_SURFACE.

Speaking from experience, when you have two flavors of something, over
time things tend to creep.  After a few "I want the reduced kernel but with xyz"
requests, we're back to where we started, but I can certainly see people
who do specialized kernels like tails wanting something like this.

 > The thing I really want to avoid is binaries compiled for one distro not
 > running on another because of syscall differences.

This is one reason I think at least for general purpose distributions,
this isn't really an option.  Red Hat, SuSE etc _have_ to ship
remap_file_pages because some tiny percentage of their userbase wants it.

Something else that might be worth thinking about would be a runtime
method to disable syscalls.  That might actually be more useful in the
general case, but less so for the "I want a smaller build" crowd.

 > >   r_f_p was an obvious candidate because it's.. well, nasty.  Some of the
 > >   more straightforward syscalls may not be such a big deal, but then we
 > >   have CONFIG's for kcmp and other 'simple' syscalls already..
 > 
 > Speaking with my Checkpoint/Restore/process migration container hat on,
 > we need kcmp.  It was designed with security in mind (originally we'd
 > exposed kernel virtual addresses).

Right, but as not everyone uses/needs checkpointing, it's an optional thing.
I'm not proposing taking anything away for general purpose kernels here.
kcmp is an example of the sort of thing I'd like to see more of.

	Dave

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 17:20   ` James Bottomley
  2014-05-02 17:33     ` Dave Jones
@ 2014-05-02 17:33     ` Guenter Roeck
  2014-05-02 17:44     ` Steven Rostedt
  2014-05-07 11:32     ` David Woodhouse
  3 siblings, 0 replies; 79+ messages in thread
From: Guenter Roeck @ 2014-05-02 17:33 UTC (permalink / raw)
  To: ksummit-discuss

On 05/02/2014 10:20 AM, James Bottomley wrote:
> On Fri, 2014-05-02 at 13:11 -0400, Dave Jones wrote:
>> On Fri, May 02, 2014 at 09:44:42AM -0700, Josh Triplett wrote:
>>
>>   > Topics:
>>   > - Kconfig, and avoiding excessive configurability in the pursuit of tiny
>>   > - Optimizing a kernel for its exact target userspace.
>>   > - Examples of shrinking the kernel
>>
>> Something that's partially related here: Making stuff optional
>> reduces attack surface the kernel presents. We're starting to grow
>> more and more CONFIG options to disable syscalls. I'd like to hear
>> peoples reactions on introducing even more optionality in this area.
>
> My first reaction is reducing the attack surface sounds a reasonable
> idea.  My second reaction is that the plural in options makes me want to
> run for the hills.  Having a sea of options for enabling and disabling
> syscalls gives us the potential for having a set of kernels all with a
> slightly differing ABI as people choose what to enable and disable.
>
> If we do this, I think we should have a small number of options related
> to use case ... say something like a secure router kernel
> CONFIG_SECURE_ROUTER which disables anything a secure router wouldn't
> need.
>
> For the distros we could have an ordinary and a reduced attack surface
> kernel CONFIG_REDUCED_ATTACK_SURFACE.
>

Excellent idea. I'd like to be able to enable both options, for my company's
routers but even on my home servers if possible (ie for a distribution which
is tagged as 'server' distribution).

Guenter

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 17:20   ` James Bottomley
  2014-05-02 17:33     ` Dave Jones
  2014-05-02 17:33     ` Guenter Roeck
@ 2014-05-02 17:44     ` Steven Rostedt
  2014-05-07 11:32     ` David Woodhouse
  3 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2014-05-02 17:44 UTC (permalink / raw)
  To: James Bottomley
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, 02 May 2014 10:20:29 -0700
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> If we do this, I think we should have a small number of options related
> to use case ... say something like a secure router kernel
> CONFIG_SECURE_ROUTER which disables anything a secure router wouldn't
> need.

I was thinking the same thing.

> 
> For the distros we could have an ordinary and a reduced attack surface
> kernel CONFIG_REDUCED_ATTACK_SURFACE.

Ug, that's a horrible name. Not selecting it would imply we want to
increase the attack surface.

> 
> The thing I really want to avoid is binaries compiled for one distro not
> running on another because of syscall differences.

Agreed.

Your first config option name looks more the way we want to go. Didn't
Linus once ask for config profiles? That is, we could say
CONFIG_FIREWALL, and everything for a firewall would be set. Or
CONFIG_LAPTOP, which would focus on everything for a laptop, etc.

What ever happened to that? The kbuild environment too scary for
everyone?

I wounder if we should seek out people to rewrite it. Or at least
document how the entire thing works. Every time I have to look at that
code I get the willies.

-- Steve

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 17:33     ` Dave Jones
@ 2014-05-02 17:46       ` Josh Boyer
  2014-05-02 18:50         ` H. Peter Anvin
  2014-05-02 19:03       ` Mark Brown
  1 sibling, 1 reply; 79+ messages in thread
From: Josh Boyer @ 2014-05-02 17:46 UTC (permalink / raw)
  To: Dave Jones
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, May 2, 2014 at 1:33 PM, Dave Jones <davej@redhat.com> wrote:
> On Fri, May 02, 2014 at 10:20:29AM -0700, James Bottomley wrote:
>
>  > > Something that's partially related here: Making stuff optional
>  > > reduces attack surface the kernel presents. We're starting to grow
>  > > more and more CONFIG options to disable syscalls. I'd like to hear
>  > > peoples reactions on introducing even more optionality in this area.
>  >
>  > My first reaction is reducing the attack surface sounds a reasonable
>  > idea.  My second reaction is that the plural in options makes me want to
>  > run for the hills.  Having a sea of options for enabling and disabling
>  > syscalls gives us the potential for having a set of kernels all with a
>  > slightly differing ABI as people choose what to enable and disable.
>
> Absolutely.  Deciding where the line is between 'core' syscalls, and
> optional ones is the tricky part.
>
>  > If we do this, I think we should have a small number of options related
>  > to use case ... say something like a secure router kernel
>  > CONFIG_SECURE_ROUTER which disables anything a secure router wouldn't
>  > need.
>
> That might be one option.  And have these 'profiles' do select
> WANT_SYS_WHATEVER to reduce #ifdef complexity in the code..
>
>  > For the distros we could have an ordinary and a reduced attack surface
>  > kernel CONFIG_REDUCED_ATTACK_SURFACE.
>
> Speaking from experience, when you have two flavors of something, over
> time things tend to creep.  After a few "I want the reduced kernel but with xyz"
> requests, we're back to where we started, but I can certainly see people
> who do specialized kernels like tails wanting something like this.
>
>  > The thing I really want to avoid is binaries compiled for one distro not
>  > running on another because of syscall differences.
>
> This is one reason I think at least for general purpose distributions,
> this isn't really an option.  Red Hat, SuSE etc _have_ to ship
> remap_file_pages because some tiny percentage of their userbase wants it.

To mitigate that some, new syscalls could be added with CONFIG
wrappers that default to disabled.  The userbases can't use something
that isn't explicitly turned on, and people would likely need to
request those syscalls.  It would give the distros at least a measure
of how frequently that new syscall would be used, and in what
situations.

josh

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 17:46       ` Josh Boyer
@ 2014-05-02 18:50         ` H. Peter Anvin
  2014-05-02 19:02           ` Josh Boyer
  2014-05-02 19:03           ` Michael Kerrisk (man-pages)
  0 siblings, 2 replies; 79+ messages in thread
From: H. Peter Anvin @ 2014-05-02 18:50 UTC (permalink / raw)
  To: Josh Boyer, Dave Jones
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On 05/02/2014 10:46 AM, Josh Boyer wrote:
> 
> To mitigate that some, new syscalls could be added with CONFIG
> wrappers that default to disabled.  The userbases can't use something
> that isn't explicitly turned on, and people would likely need to
> request those syscalls.  It would give the distros at least a measure
> of how frequently that new syscall would be used, and in what
> situations.
> 

In practice that is equivalent to not having the syscall at all.

	-hpa

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 18:50         ` H. Peter Anvin
@ 2014-05-02 19:02           ` Josh Boyer
  2014-05-02 19:03           ` Michael Kerrisk (man-pages)
  1 sibling, 0 replies; 79+ messages in thread
From: Josh Boyer @ 2014-05-02 19:02 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, May 2, 2014 at 2:50 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 05/02/2014 10:46 AM, Josh Boyer wrote:
>>
>> To mitigate that some, new syscalls could be added with CONFIG
>> wrappers that default to disabled.  The userbases can't use something
>> that isn't explicitly turned on, and people would likely need to
>> request those syscalls.  It would give the distros at least a measure
>> of how frequently that new syscall would be used, and in what
>> situations.
>>
>
> In practice that is equivalent to not having the syscall at all.

Possibly.  Typical case is that end users won't have those syscalls
enabled and they won't care because nothing is using them.  In the
event that someone introduces something into the distro that does use
it, you'd enable it, etc.  That doesn't help the "one binary doesn't
work on multiple distros" problem though, I guess.

My concern with CONFIG_ROUTER and other target profile variants is
that it seems like an attempt at a system-wide seccomp of sorts, only
via Kconfig options.  You could accomplish similar things with SELinux
or other security modules, so why would we go through the hassle of
bugging people about syscall configs?  I'm also skeptical that a
general purpose distro would actually use anything but the broadest
profile, but I'm not saying they wouldn't be useful.

josh

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 17:33     ` Dave Jones
  2014-05-02 17:46       ` Josh Boyer
@ 2014-05-02 19:03       ` Mark Brown
  2014-05-02 19:45         ` Luck, Tony
  1 sibling, 1 reply; 79+ messages in thread
From: Mark Brown @ 2014-05-02 19:03 UTC (permalink / raw)
  To: Dave Jones
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

[-- Attachment #1: Type: text/plain, Size: 452 bytes --]

On Fri, May 02, 2014 at 01:33:09PM -0400, Dave Jones wrote:

> Something else that might be worth thinking about would be a runtime
> method to disable syscalls.  That might actually be more useful in the
> general case, but less so for the "I want a smaller build" crowd.

It would be useful for the smaller build case to have a way of auditing
which syscalls are actually in use on a system so you can then go
through and construct a minimal config.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 18:50         ` H. Peter Anvin
  2014-05-02 19:02           ` Josh Boyer
@ 2014-05-02 19:03           ` Michael Kerrisk (man-pages)
  2014-05-02 19:33             ` Theodore Ts'o
  1 sibling, 1 reply; 79+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-05-02 19:03 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

On Fri, May 2, 2014 at 8:50 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 05/02/2014 10:46 AM, Josh Boyer wrote:
>>
>> To mitigate that some, new syscalls could be added with CONFIG
>> wrappers that default to disabled.  The userbases can't use something
>> that isn't explicitly turned on, and people would likely need to
>> request those syscalls.  It would give the distros at least a measure
>> of how frequently that new syscall would be used, and in what
>> situations.
>>
>
> In practice that is equivalent to not having the syscall at all.

Amen!

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 19:03           ` Michael Kerrisk (man-pages)
@ 2014-05-02 19:33             ` Theodore Ts'o
  2014-05-02 19:38               ` Jiri Kosina
                                 ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Theodore Ts'o @ 2014-05-02 19:33 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

There's been a huge focus on system calls in this discussion, and I
suspect this is a bit of a red herring.  Taking a look at "git log
arch/x86/syscalls/syscall_64.tbl" --- since all the world's is no
longer a Vax, but rather an x86_64 :-P --- there really hasn't been
that many new system calls lately.  Yes, we recently added
renameat(2), but the next addition was half a year earlier, when the
new schedular parameters syscalls went in.

There's much more in the way of kernel functionality and complexity
which isn't really syscall related --- for example, all of the control
group stuff, and security hair caused by things like user namespaces,
and new fallocate(2) modes --- we've added PUNCH_HOLE, COLLAPSE_RANGE,
and ZERO_RANGE, and there are threats to add INSERT_RANGE in the next
release or two.

And if you look at things like renameat(2), the actual code savings by
removing renameat(2) is pretty small, and IMHO, not worth the
complexity and uncertainty that it would represent to application
programmers of "does this system call exist or doesn't it".

In contrast, if you want to take at the bloat and complexity added by
the pluggable security LSM's, control groups, and name spaces, the
comparison isn't even close.  Furthermore, given that low level
progams programs like systemd have grown to require control groups,
it's not like you can even realistically strip it from potentially
even many embedded kernels, since there seems to be a movement to have
systemd infect even smaller embedded applications.

Anyone want to lay odds on when systemd will start using various
namespaces for its own purposes?  :-)

						- Ted

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 19:33             ` Theodore Ts'o
@ 2014-05-02 19:38               ` Jiri Kosina
  2014-05-02 19:49               ` Dave Jones
  2014-05-03 13:32               ` Michael Kerrisk (man-pages)
  2 siblings, 0 replies; 79+ messages in thread
From: Jiri Kosina @ 2014-05-02 19:38 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

On Fri, 2 May 2014, Theodore Ts'o wrote:

> Furthermore, given that low level progams programs like systemd have 
> grown to require control groups, it's not like you can even 
> realistically strip it from potentially even many embedded kernels, 
> since there seems to be a movement to have systemd infect even smaller 
> embedded applications.
> 
> Anyone want to lay odds on when systemd will start using various
> namespaces for its own purposes?  :-)

This is actually interesting topic on its own (but I am not completely 
what kernel maintainers could be realistically doing about it) -- 
userspace _strictly_ depending on _optional_ kernel features.

systemd and cgroups is a good example, but not the only one, I think. A 
few weeks ago I was struggling to boot a systemd-based userspace, to find 
out that fanotify needs to be turned on.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 19:03       ` Mark Brown
@ 2014-05-02 19:45         ` Luck, Tony
  2014-05-02 21:03           ` Mark Brown
  0 siblings, 1 reply; 79+ messages in thread
From: Luck, Tony @ 2014-05-02 19:45 UTC (permalink / raw)
  To: Mark Brown, Dave Jones
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

> It would be useful for the smaller build case to have a way of auditing
> which syscalls are actually in use on a system so you can then go
> through and construct a minimal config.

"strace -c" ?

-Tony

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 19:33             ` Theodore Ts'o
  2014-05-02 19:38               ` Jiri Kosina
@ 2014-05-02 19:49               ` Dave Jones
  2014-05-02 20:06                 ` Steven Rostedt
                                   ` (2 more replies)
  2014-05-03 13:32               ` Michael Kerrisk (man-pages)
  2 siblings, 3 replies; 79+ messages in thread
From: Dave Jones @ 2014-05-02 19:49 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

On Fri, May 02, 2014 at 03:33:14PM -0400, Theodore Ts'o wrote:
 > There's been a huge focus on system calls in this discussion, and I
 > suspect this is a bit of a red herring.  Taking a look at "git log
 > arch/x86/syscalls/syscall_64.tbl" --- since all the world's is no
 > longer a Vax, but rather an x86_64 :-P --- there really hasn't been
 > that many new system calls lately.

I may have a vested interest in syscalls :)

The rate we're adding them has slowed down, but the rate at which we're
finding bugs exposed through them has accelerated enormously over the
last few years.

To use just one example, on certain systems I'd love to be able to just
turn off sys_perf_event_open given what a trainwreck of vulnerabilities it's been
over the last few years [comedy: it is actually a config option, but x86
'selects' it, so you'll have it and you'll like it].
Thankfully at least the scarier parts of it are now hidden behind the
paranoid sysctl.

 > And if you look at things like renameat(2), the actual code savings by
 > removing renameat(2) is pretty small, and IMHO, not worth the
 > complexity and uncertainty that it would represent to application
 > programmers of "does this system call exist or doesn't it".

I think we've got two categories here.

"variant" syscalls like renameat, which just offers enhancements over
an existing syscall. Stuff that things like glibc tend to care about.
This stuff is usually pretty boring, and not even worth considering for
potentially disabling imo.

And then we have "enable boatload of code" syscalls that are typically
used by a few standalone apps/features. kexec, checkpointing, whatever
db it was that cares about remap_file_pages, mempolicy, etc. etc.

It's this "not used by every user" code that tends to scare me, because
it's written with 1-2 well behaved bits of userspace in mind, which
usually means "has so many unchecked corner cases it's not even funny"

Ok, maybe there is also a grey area in the middle, which I guess depends
on what your userspace is going to do, (things like vmsplice and
friends), but I lean towards just classing them in the 2nd category too.

 > In contrast, if you want to take at the bloat and complexity added by
 > the pluggable security LSM's, control groups, and name spaces, the
 > comparison isn't even close.  Furthermore, given that low level
 > progams programs like systemd have grown to require control groups,
 > it's not like you can even realistically strip it from potentially
 > even many embedded kernels, since there seems to be a movement to have
 > systemd infect even smaller embedded applications.

Yeah, we've reached a point of no return with things like cgroups now.

 > Anyone want to lay odds on when systemd will start using various
 > namespaces for its own purposes?  :-)

I thought it already was tbh.

	Dave

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 19:49               ` Dave Jones
@ 2014-05-02 20:06                 ` Steven Rostedt
  2014-05-02 20:41                 ` Theodore Ts'o
  2014-05-02 20:45                 ` Ben Hutchings
  2 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2014-05-02 20:06 UTC (permalink / raw)
  To: Dave Jones
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

On Fri, 2 May 2014 15:49:35 -0400
Dave Jones <davej@redhat.com> wrote:


> I may have a vested interest in syscalls :)
> 
> The rate we're adding them has slowed down, but the rate at which we're
> finding bugs exposed through them has accelerated enormously over the
> last few years.

Well, besides perf, I'm sure the acceleration of finding bugs wasn't
due to system calls adding more of them, but because we now have much
better ways of finding bugs (yes this is a kudos to your trinity).

-- Steve

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 19:49               ` Dave Jones
  2014-05-02 20:06                 ` Steven Rostedt
@ 2014-05-02 20:41                 ` Theodore Ts'o
  2014-05-02 21:01                   ` Dave Jones
  2014-05-02 20:45                 ` Ben Hutchings
  2 siblings, 1 reply; 79+ messages in thread
From: Theodore Ts'o @ 2014-05-02 20:41 UTC (permalink / raw)
  To: Dave Jones
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

On Fri, May 02, 2014 at 03:49:35PM -0400, Dave Jones wrote:
> 
> I may have a vested interest in syscalls :)

Well, yes, fair enough.  :-)

> And then we have "enable boatload of code" syscalls that are typically
> used by a few standalone apps/features. kexec, checkpointing, whatever
> db it was that cares about remap_file_pages, mempolicy, etc. etc.

Sure, that's fair enough.  I would just use a somewhat broader
definition of "boatload of code", since not all of these boatloads are
accessed via system calls.  Some are accessed via new pseudo-file
systems, some via new ioctls, some via new fallocate code points, etc.

> It's this "not used by every user" code that tends to scare me, because
> it's written with 1-2 well behaved bits of userspace in mind, which
> usually means "has so many unchecked corner cases it's not even funny"

And I think we can also further break this down into the classes of
code which require root privs (i.e., like kexec), and those which can
be used by any userid.  It's the latter which is much more problematic
from a security perspective --- as well from a "it's much harder to
change the ABI without breaking large numbers of userspace programms".
(Where as if it's only being used by 1-2 well behaved bits of
privileged userspace, it's actually a tiny bit easier to evolve the ABI.)

So perhaps what that means it that _these_ are the features which
require the most amount of paranoia and testing before we let them
into the mainline kernel in the first place.  Otherwise, once they get
in, there's always a chance that systemd or some other piece of
userspace will strict strictly requiring said optional feature, and it
doesn't matter whether we put in a CONFIG option to disable the
feature --- we'll never be able to do it.

	    	  	   	      - Ted

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 19:49               ` Dave Jones
  2014-05-02 20:06                 ` Steven Rostedt
  2014-05-02 20:41                 ` Theodore Ts'o
@ 2014-05-02 20:45                 ` Ben Hutchings
  2014-05-02 21:03                   ` Dave Jones
  2014-05-03 13:35                   ` Michael Kerrisk (man-pages)
  2 siblings, 2 replies; 79+ messages in thread
From: Ben Hutchings @ 2014-05-02 20:45 UTC (permalink / raw)
  To: Dave Jones
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

[-- Attachment #1: Type: text/plain, Size: 2554 bytes --]

On Fri, 2014-05-02 at 15:49 -0400, Dave Jones wrote:
> On Fri, May 02, 2014 at 03:33:14PM -0400, Theodore Ts'o wrote:
>  > There's been a huge focus on system calls in this discussion, and I
>  > suspect this is a bit of a red herring.  Taking a look at "git log
>  > arch/x86/syscalls/syscall_64.tbl" --- since all the world's is no
>  > longer a Vax, but rather an x86_64 :-P --- there really hasn't been
>  > that many new system calls lately.
> 
> I may have a vested interest in syscalls :)
> 
> The rate we're adding them has slowed down, but the rate at which we're
> finding bugs exposed through them has accelerated enormously over the
> last few years.
> 
> To use just one example, on certain systems I'd love to be able to just
> turn off sys_perf_event_open given what a trainwreck of vulnerabilities it's been
> over the last few years [comedy: it is actually a config option, but x86
> 'selects' it, so you'll have it and you'll like it].
> Thankfully at least the scarier parts of it are now hidden behind the
> paranoid sysctl.

I have considered proposing perf_event_paranoid=3 to disable it
completely for non-root.

>  > And if you look at things like renameat(2), the actual code savings by
>  > removing renameat(2) is pretty small, and IMHO, not worth the
>  > complexity and uncertainty that it would represent to application
>  > programmers of "does this system call exist or doesn't it".
> 
> I think we've got two categories here.
> 
> "variant" syscalls like renameat, which just offers enhancements over
> an existing syscall. Stuff that things like glibc tend to care about.
> This stuff is usually pretty boring, and not even worth considering for
> potentially disabling imo.
> 
> And then we have "enable boatload of code" syscalls that are typically
> used by a few standalone apps/features. kexec, checkpointing, whatever
> db it was that cares about remap_file_pages, mempolicy, etc. etc.
> 
> It's this "not used by every user" code that tends to scare me, because
> it's written with 1-2 well behaved bits of userspace in mind, which
> usually means "has so many unchecked corner cases it's not even funny"
[...]

Since Michael often seems to be the one testing those corner cases while
writing documentation, it seems like you're getting back to the old
issue of whether lack of documentation should be a blocker for adding
new system calls.

Ben.

-- 
Ben Hutchings
Lowery's Law:
             If it jams, force it. If it breaks, it needed replacing anyway.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 20:41                 ` Theodore Ts'o
@ 2014-05-02 21:01                   ` Dave Jones
  2014-05-02 21:19                     ` Josh Boyer
  2014-05-02 21:56                     ` tytso
  0 siblings, 2 replies; 79+ messages in thread
From: Dave Jones @ 2014-05-02 21:01 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

On Fri, May 02, 2014 at 04:41:41PM -0400, Theodore Ts'o wrote:

 > And I think we can also further break this down into the classes of
 > code which require root privs (i.e., like kexec), and those which can
 > be used by any userid.

In the brave new world of secure boot, we kind of have to care about
even the root cases now too [*], but I agree in the general case.

 > So perhaps what that means it that _these_ are the features which
 > require the most amount of paranoia and testing before we let them
 > into the mainline kernel in the first place.  Otherwise, once they get
 > in, there's always a chance that systemd or some other piece of
 > userspace will strict strictly requiring said optional feature, and it
 > doesn't matter whether we put in a CONFIG option to disable the
 > feature --- we'll never be able to do it.

This is starting to tread into the other thread about userspace
mandating 'optional' facilities, but is that even a problem, given
the proliferation of init's (taking the systemd example).
Yes, systemd "won" by now being the default in all the general purpose
distributions, but with my upstream hat on, I think we still care
about embedded systems etc that don't need anywhere near the
functionality that systemd provides.

	Dave

[*] oh god, ioctl.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 20:45                 ` Ben Hutchings
@ 2014-05-02 21:03                   ` Dave Jones
  2014-05-03 13:37                     ` Michael Kerrisk (man-pages)
  2014-05-03 13:35                   ` Michael Kerrisk (man-pages)
  1 sibling, 1 reply; 79+ messages in thread
From: Dave Jones @ 2014-05-02 21:03 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

On Fri, May 02, 2014 at 09:45:18PM +0100, Ben Hutchings wrote:
 > On Fri, 2014-05-02 at 15:49 -0400, Dave Jones wrote:
 
 > > To use just one example, on certain systems I'd love to be able to just
 > > turn off sys_perf_event_open given what a trainwreck of vulnerabilities it's been
 > > over the last few years [comedy: it is actually a config option, but x86
 > > 'selects' it, so you'll have it and you'll like it].
 > > Thankfully at least the scarier parts of it are now hidden behind the
 > > paranoid sysctl.
 > 
 > I have considered proposing perf_event_paranoid=3 to disable it
 > completely for non-root.

Doesn't seem too crazy an idea to me.

 > > It's this "not used by every user" code that tends to scare me, because
 > > it's written with 1-2 well behaved bits of userspace in mind, which
 > > usually means "has so many unchecked corner cases it's not even funny"
 > [...]
 > 
 > Since Michael often seems to be the one testing those corner cases while
 > writing documentation, it seems like you're getting back to the old
 > issue of whether lack of documentation should be a blocker for adding
 > new system calls.

That, and test cases.

	Dave

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 19:45         ` Luck, Tony
@ 2014-05-02 21:03           ` Mark Brown
  2014-05-02 21:08             ` Dave Jones
  2014-05-07 12:35             ` David Woodhouse
  0 siblings, 2 replies; 79+ messages in thread
From: Mark Brown @ 2014-05-02 21:03 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

[-- Attachment #1: Type: text/plain, Size: 566 bytes --]

On Fri, May 02, 2014 at 07:45:44PM +0000, Luck, Tony wrote:

> > It would be useful for the smaller build case to have a way of auditing
> > which syscalls are actually in use on a system so you can then go
> > through and construct a minimal config.

> "strace -c" ?

That works for specific processes but I don't immediately see a
straightforward way to do it system wide (I guess a wrapper that straces
init and children might do the trick but it's not particularly nice).
Part of the trick for getting the general security win is to lower the
barrier to entry.`

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 21:03           ` Mark Brown
@ 2014-05-02 21:08             ` Dave Jones
  2014-05-02 21:14               ` Andy Lutomirski
                                 ` (2 more replies)
  2014-05-07 12:35             ` David Woodhouse
  1 sibling, 3 replies; 79+ messages in thread
From: Dave Jones @ 2014-05-02 21:08 UTC (permalink / raw)
  To: Mark Brown
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, May 02, 2014 at 02:03:40PM -0700, Mark Brown wrote:
 > On Fri, May 02, 2014 at 07:45:44PM +0000, Luck, Tony wrote:
 > 
 > > > It would be useful for the smaller build case to have a way of auditing
 > > > which syscalls are actually in use on a system so you can then go
 > > > through and construct a minimal config.
 > 
 > > "strace -c" ?
 > 
 > That works for specific processes but I don't immediately see a
 > straightforward way to do it system wide (I guess a wrapper that straces
 > init and children might do the trick but it's not particularly nice).
 > Part of the trick for getting the general security win is to lower the
 > barrier to entry.`

Sounds like something you could use tracepoints for maybe ?
Failing that, kprobes ?

I'm pretty sure I've seen systemtap examples of this very thing years
ago, but who knows if they even work any more.

	Dave

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 21:08             ` Dave Jones
@ 2014-05-02 21:14               ` Andy Lutomirski
  2014-05-02 21:21               ` Luck, Tony
  2014-05-03  1:21               ` Mark Brown
  2 siblings, 0 replies; 79+ messages in thread
From: Andy Lutomirski @ 2014-05-02 21:14 UTC (permalink / raw)
  To: Dave Jones
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, May 2, 2014 at 2:08 PM, Dave Jones <davej@redhat.com> wrote:
> On Fri, May 02, 2014 at 02:03:40PM -0700, Mark Brown wrote:
>  > On Fri, May 02, 2014 at 07:45:44PM +0000, Luck, Tony wrote:
>  >
>  > > > It would be useful for the smaller build case to have a way of auditing
>  > > > which syscalls are actually in use on a system so you can then go
>  > > > through and construct a minimal config.
>  >
>  > > "strace -c" ?
>  >
>  > That works for specific processes but I don't immediately see a
>  > straightforward way to do it system wide (I guess a wrapper that straces
>  > init and children might do the trick but it's not particularly nice).
>  > Part of the trick for getting the general security win is to lower the
>  > barrier to entry.`
>
> Sounds like something you could use tracepoints for maybe ?
> Failing that, kprobes ?
>
> I'm pretty sure I've seen systemtap examples of this very thing years
> ago, but who knows if they even work any more.
>

It's actually pretty easy to do this with seccomp -- program it to
send SIGSYS and watch the kernel logs. Admittedly, the lack of log +
ENOSYS as a seccomp action might make this a little bit annoying.

--Andy

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 21:01                   ` Dave Jones
@ 2014-05-02 21:19                     ` Josh Boyer
  2014-05-02 21:23                       ` Jiri Kosina
  2014-05-02 21:27                       ` James Bottomley
  2014-05-02 21:56                     ` tytso
  1 sibling, 2 replies; 79+ messages in thread
From: Josh Boyer @ 2014-05-02 21:19 UTC (permalink / raw)
  To: Dave Jones
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, May 2, 2014 at 5:01 PM, Dave Jones <davej@redhat.com> wrote:
> On Fri, May 02, 2014 at 04:41:41PM -0400, Theodore Ts'o wrote:
>
>  > And I think we can also further break this down into the classes of
>  > code which require root privs (i.e., like kexec), and those which can
>  > be used by any userid.
>
> In the brave new world of secure boot, we kind of have to care about
> even the root cases now too [*], but I agree in the general case.

Speaking of that... is it worth my time to propose a "What to do about
the secure_modules/trusted_kernel/whatever patch set that distros are
carrying to support Secure Boot?  I thought we had agreement and a
path forward at LPC last year, but things seem to have gotten derailed
again.

josh

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 21:08             ` Dave Jones
  2014-05-02 21:14               ` Andy Lutomirski
@ 2014-05-02 21:21               ` Luck, Tony
  2014-05-02 21:38                 ` H. Peter Anvin
  2014-05-03  1:21               ` Mark Brown
  2 siblings, 1 reply; 79+ messages in thread
From: Luck, Tony @ 2014-05-02 21:21 UTC (permalink / raw)
  To: Dave Jones, Mark Brown
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

> Sounds like something you could use tracepoints for maybe ?
> Failing that, kprobes ?
>
> I'm pretty sure I've seen systemtap examples of this very thing years
> ago, but who knows if they even work any more.

If we do head to a world where systems are configured with
a subset of system calls - it would be useful for application
packages to come with a list of syscall dependencies. So you
could avoid much sadness from installing a package that won't
actually work.

But to get this right you'd need to do more than just strace/
kprobe/systemtap scan to see what they *typically* do. The
system calls used will depend on the arguments given and the
input data.  E.g. a trivial test might conclude that bash(1) doesn't
use the pipe(2) system call ... which it doesn't until some user
types:
  $ dmesg | grep ixgbe

-Tony

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 21:19                     ` Josh Boyer
@ 2014-05-02 21:23                       ` Jiri Kosina
  2014-05-02 21:36                         ` Josh Boyer
  2014-05-02 21:27                       ` James Bottomley
  1 sibling, 1 reply; 79+ messages in thread
From: Jiri Kosina @ 2014-05-02 21:23 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, 2 May 2014, Josh Boyer wrote:

> >  > And I think we can also further break this down into the classes of
> >  > code which require root privs (i.e., like kexec), and those which can
> >  > be used by any userid.
> >
> > In the brave new world of secure boot, we kind of have to care about
> > even the root cases now too [*], but I agree in the general case.
> 
> Speaking of that... is it worth my time to propose a "What to do about
> the secure_modules/trusted_kernel/whatever patch set that distros are
> carrying to support Secure Boot?  I thought we had agreement and a
> path forward at LPC last year, but things seem to have gotten derailed
> again.

I believe the biggest remaining thing on the plate is basically just 
kexec/kdump ... is there anything else comparably major?

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 21:19                     ` Josh Boyer
  2014-05-02 21:23                       ` Jiri Kosina
@ 2014-05-02 21:27                       ` James Bottomley
  2014-05-02 21:39                         ` Josh Boyer
  1 sibling, 1 reply; 79+ messages in thread
From: James Bottomley @ 2014-05-02 21:27 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, 2014-05-02 at 17:19 -0400, Josh Boyer wrote:
> On Fri, May 2, 2014 at 5:01 PM, Dave Jones <davej@redhat.com> wrote:
> > On Fri, May 02, 2014 at 04:41:41PM -0400, Theodore Ts'o wrote:
> >
> >  > And I think we can also further break this down into the classes of
> >  > code which require root privs (i.e., like kexec), and those which can
> >  > be used by any userid.
> >
> > In the brave new world of secure boot, we kind of have to care about
> > even the root cases now too [*], but I agree in the general case.
> 
> Speaking of that... is it worth my time to propose a "What to do about
> the secure_modules/trusted_kernel/whatever patch set that distros are
> carrying to support Secure Boot?  I thought we had agreement and a
> path forward at LPC last year, but things seem to have gotten derailed
> again.

Would you believe we're just discussing with the distros how we might
re-engineer the Linux secure boot process.  Unfortunately the details
depend on a UEFI forum proposal that are UEFI confidential at this time,
but you can probably pick them up from Peter Jones, since you're a Red
Hat employee.  One of the side effects of this, if it happens, will be
to separate Linux secure boot policy from Microsoft's binary signing
requirements which might take some of the heat out of the arguments
about which parts of the patch are to please microsoft and refocus the
debate towards how we make better use of secure boot.  I'll try and
ensure that either the proposals are public by KS or that we have
permission to share the details.

James

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 21:23                       ` Jiri Kosina
@ 2014-05-02 21:36                         ` Josh Boyer
  0 siblings, 0 replies; 79+ messages in thread
From: Josh Boyer @ 2014-05-02 21:36 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, May 2, 2014 at 5:23 PM, Jiri Kosina <jkosina@suse.cz> wrote:
> On Fri, 2 May 2014, Josh Boyer wrote:
>
>> >  > And I think we can also further break this down into the classes of
>> >  > code which require root privs (i.e., like kexec), and those which can
>> >  > be used by any userid.
>> >
>> > In the brave new world of secure boot, we kind of have to care about
>> > even the root cases now too [*], but I agree in the general case.
>>
>> Speaking of that... is it worth my time to propose a "What to do about
>> the secure_modules/trusted_kernel/whatever patch set that distros are
>> carrying to support Secure Boot?  I thought we had agreement and a
>> path forward at LPC last year, but things seem to have gotten derailed
>> again.
>
> I believe the biggest remaining thing on the plate is basically just
> kexec/kdump ... is there anything else comparably major?

Alan seems to entirely disagree with the approach we've been taking
for the past couple of years.  I would have to go back and re-read
things, but he seems to favor some rework of capabilities into a
matrix somehow.  I have no idea if he's been working on this or if
it's feasible, or even if anyone else favors that approach.

The dissent seems to have been enough to get the trusted_kernel
patches not pulled into the security tree, so we probably should
discuss it at least a little.

josh

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 21:21               ` Luck, Tony
@ 2014-05-02 21:38                 ` H. Peter Anvin
  0 siblings, 0 replies; 79+ messages in thread
From: H. Peter Anvin @ 2014-05-02 21:38 UTC (permalink / raw)
  To: Luck, Tony, Dave Jones, Mark Brown
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On 05/02/2014 02:21 PM, Luck, Tony wrote:
>> Sounds like something you could use tracepoints for maybe ?
>> Failing that, kprobes ?
>>
>> I'm pretty sure I've seen systemtap examples of this very thing years
>> ago, but who knows if they even work any more.
> 
> If we do head to a world where systems are configured with
> a subset of system calls - it would be useful for application
> packages to come with a list of syscall dependencies. So you
> could avoid much sadness from installing a package that won't
> actually work.
> 

This would also be useful for using seccomp to sandbox processes... to
simply not let them do things they don't have a legitimate need to do.

For super-low-end embedded systems, it makes a lot of sense to be able
to build a kernel with only the functionality needed by a small set of
fixed applications (sometimes only one process which runs both as init
and the application.)

	-hpa

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 21:27                       ` James Bottomley
@ 2014-05-02 21:39                         ` Josh Boyer
  2014-05-02 22:35                           ` Andy Lutomirski
  2014-05-03 17:30                           ` James Bottomley
  0 siblings, 2 replies; 79+ messages in thread
From: Josh Boyer @ 2014-05-02 21:39 UTC (permalink / raw)
  To: James Bottomley
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, May 2, 2014 at 5:27 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> On Fri, 2014-05-02 at 17:19 -0400, Josh Boyer wrote:
>> On Fri, May 2, 2014 at 5:01 PM, Dave Jones <davej@redhat.com> wrote:
>> > On Fri, May 02, 2014 at 04:41:41PM -0400, Theodore Ts'o wrote:
>> >
>> >  > And I think we can also further break this down into the classes of
>> >  > code which require root privs (i.e., like kexec), and those which can
>> >  > be used by any userid.
>> >
>> > In the brave new world of secure boot, we kind of have to care about
>> > even the root cases now too [*], but I agree in the general case.
>>
>> Speaking of that... is it worth my time to propose a "What to do about
>> the secure_modules/trusted_kernel/whatever patch set that distros are
>> carrying to support Secure Boot?  I thought we had agreement and a
>> path forward at LPC last year, but things seem to have gotten derailed
>> again.
>
> Would you believe we're just discussing with the distros how we might
> re-engineer the Linux secure boot process.  Unfortunately the details

I would believe it.

> depend on a UEFI forum proposal that are UEFI confidential at this time,
> but you can probably pick them up from Peter Jones, since you're a Red
> Hat employee.  One of the side effects of this, if it happens, will be

OK.

> to separate Linux secure boot policy from Microsoft's binary signing
> requirements which might take some of the heat out of the arguments
> about which parts of the patch are to please microsoft and refocus the
> debate towards how we make better use of secure boot.  I'll try and
> ensure that either the proposals are public by KS or that we have
> permission to share the details.

The objectionable parts having to do with signing aren't even in the
patchset Matthew has posted.  That's the initial set he tried to get
pulled in and failed.  If the proposal drastically changes that
approach I'd be surprised (maybe pleasantly).

josh

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 21:01                   ` Dave Jones
  2014-05-02 21:19                     ` Josh Boyer
@ 2014-05-02 21:56                     ` tytso
  1 sibling, 0 replies; 79+ messages in thread
From: tytso @ 2014-05-02 21:56 UTC (permalink / raw)
  To: Dave Jones
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

On Fri, May 02, 2014 at 05:01:23PM -0400, Dave Jones wrote:
> This is starting to tread into the other thread about userspace
> mandating 'optional' facilities, but is that even a problem, given
> the proliferation of init's (taking the systemd example).

For the record, there is no "other thread" because as Jiri pointed out
(and I agree) it's not at all clear there's anything we as kernel
developers can do about this.  Once some kind of system call interface
is made available in mainline, it's not at all realistic to say that
"it's only for this use case".

If someone thinks that it might be worthwhile to start such a thread
and/or propose that that might be a worthy topic for the kernel
summit, feel free to start such a thread --- but I'd suggest doing so
after coming up with a proposed solution to how we could even
influence what facilities various bits of userspace might want to
mandate.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 17:11 ` Dave Jones
  2014-05-02 17:20   ` James Bottomley
@ 2014-05-02 22:04   ` Jan Kara
  2014-05-05 23:45   ` Bird, Tim
  2014-05-09 16:22   ` Josh Triplett
  3 siblings, 0 replies; 79+ messages in thread
From: Jan Kara @ 2014-05-02 22:04 UTC (permalink / raw)
  To: Dave Jones
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri 02-05-14 13:11:03, Dave Jones wrote:
> On Fri, May 02, 2014 at 09:44:42AM -0700, Josh Triplett wrote:
>  
>  > Topics:
>  > - Kconfig, and avoiding excessive configurability in the pursuit of tiny
>  > - Optimizing a kernel for its exact target userspace.
>  > - Examples of shrinking the kernel
> 
> Something that's partially related here: Making stuff optional
> reduces attack surface the kernel presents. We're starting to grow
> more and more CONFIG options to disable syscalls. I'd like to hear
> peoples reactions on introducing even more optionality in this area.
> 
> I first started thinking about this at LSF/MM where the subject of
> sys_remap_file_pages came up. "What even uses this?" "hardly anything".
> But for all the users that don't need it, there's this syscall always
> built in that does horrible things with VM internals.  It's fortunate
> that there hasn't been anything particularly awful beyond simple DoS
> bugs in r_f_p.
> 
> Distribution kernels are in the sad position of having to always enable
> this stuff, but at least for people building their own kernels, or
> kernels for appliances, we could make their lives a little better by
> not even building this stuff in.
  So I always thought various security modules or audit are there exactly
to limit attack surface like this (now please pardon my ignorance if I'm
wrong because I know close to nothing about the security stuff). So in my
imagination I'd say you could ship even a distro with a default policy
where e.g. r_f_p would be prohibited and if you ever found an application
that needs it, you could create a separate policy for it (and in the ideal
case where the application is packaged by the distro the policy would come
with it). Am I dreaming too much?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 21:39                         ` Josh Boyer
@ 2014-05-02 22:35                           ` Andy Lutomirski
  2014-05-06 17:18                             ` josh
  2014-05-03 17:30                           ` James Bottomley
  1 sibling, 1 reply; 79+ messages in thread
From: Andy Lutomirski @ 2014-05-02 22:35 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, May 2, 2014 at 2:39 PM, Josh Boyer <jwboyer@fedoraproject.org> wrote:
> On Fri, May 2, 2014 at 5:27 PM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
>> On Fri, 2014-05-02 at 17:19 -0400, Josh Boyer wrote:
>>> On Fri, May 2, 2014 at 5:01 PM, Dave Jones <davej@redhat.com> wrote:
>>> > On Fri, May 02, 2014 at 04:41:41PM -0400, Theodore Ts'o wrote:
>>> >
>>> >  > And I think we can also further break this down into the classes of
>>> >  > code which require root privs (i.e., like kexec), and those which can
>>> >  > be used by any userid.
>>> >
>>> > In the brave new world of secure boot, we kind of have to care about
>>> > even the root cases now too [*], but I agree in the general case.
>>>
>>> Speaking of that... is it worth my time to propose a "What to do about
>>> the secure_modules/trusted_kernel/whatever patch set that distros are
>>> carrying to support Secure Boot?  I thought we had agreement and a
>>> path forward at LPC last year, but things seem to have gotten derailed
>>> again.
>>
>> Would you believe we're just discussing with the distros how we might
>> re-engineer the Linux secure boot process.  Unfortunately the details
>
> I would believe it.
>
>> depend on a UEFI forum proposal that are UEFI confidential at this time,
>> but you can probably pick them up from Peter Jones, since you're a Red
>> Hat employee.  One of the side effects of this, if it happens, will be
>
> OK.
>
>> to separate Linux secure boot policy from Microsoft's binary signing
>> requirements which might take some of the heat out of the arguments
>> about which parts of the patch are to please microsoft and refocus the
>> debate towards how we make better use of secure boot.  I'll try and
>> ensure that either the proposals are public by KS or that we have
>> permission to share the details.
>
> The objectionable parts having to do with signing aren't even in the
> patchset Matthew has posted.  That's the initial set he tried to get
> pulled in and failed.  If the proposal drastically changes that
> approach I'd be surprised (maybe pleasantly).

FWIW, I really don't like the approach where we say that the kernel
must be inviolate but that user code can do whatever it likes as long
as the kernel isn't compromised.  This may be needed to comply with
current MS/UEFI policy, but I think it largely misses the point wrt
actual security.

If the policy can change, then that might be a huge win.

--Andy

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 21:08             ` Dave Jones
  2014-05-02 21:14               ` Andy Lutomirski
  2014-05-02 21:21               ` Luck, Tony
@ 2014-05-03  1:21               ` Mark Brown
  2 siblings, 0 replies; 79+ messages in thread
From: Mark Brown @ 2014-05-03  1:21 UTC (permalink / raw)
  To: Dave Jones
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

[-- Attachment #1: Type: text/plain, Size: 807 bytes --]

On Fri, May 02, 2014 at 05:08:51PM -0400, Dave Jones wrote:
> On Fri, May 02, 2014 at 02:03:40PM -0700, Mark Brown wrote:

>  > That works for specific processes but I don't immediately see a
>  > straightforward way to do it system wide (I guess a wrapper that straces
>  > init and children might do the trick but it's not particularly nice).
>  > Part of the trick for getting the general security win is to lower the
>  > barrier to entry.`

> Sounds like something you could use tracepoints for maybe ?
> Failing that, kprobes ?

Tracepoints do run the risk of overflowing the buffer if run for too
long but if it's the only thing running and/or is monitored that should
be OK, it's more managable than strace.  kprobes should definitely work
I think if there's a suitably canned way of setting it up.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 19:33             ` Theodore Ts'o
  2014-05-02 19:38               ` Jiri Kosina
  2014-05-02 19:49               ` Dave Jones
@ 2014-05-03 13:32               ` Michael Kerrisk (man-pages)
  2 siblings, 0 replies; 79+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-05-03 13:32 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

On 05/02/2014 09:33 PM, Theodore Ts'o wrote:
> There's been a huge focus on system calls in this discussion, and I
> suspect this is a bit of a red herring.  Taking a look at "git log
> arch/x86/syscalls/syscall_64.tbl" --- since all the world's is no
> longer a Vax, but rather an x86_64 :-P --- there really hasn't been
> that many new system calls lately.  Yes, we recently added
> renameat(2), but the next addition was half a year earlier, when the
> new schedular parameters syscalls went in.

A minor correction: that wasn't 6 months ago -- it was 3.14, released 
at the end of March, that added sched_getattr() and sched_setattr().

> There's much more in the way of kernel functionality and complexity
> which isn't really syscall related --- for example, all of the control
> group stuff, and security hair caused by things like user namespaces,
> and new fallocate(2) modes --- we've added PUNCH_HOLE, COLLAPSE_RANGE,
> and ZERO_RANGE, and there are threats to add INSERT_RANGE in the next
> release or two.

Yes, that's a much bigger part of the growing surface. (Just by the 
bye, I try to track the growth of the surface at 
http://man7.org/tlpi/api_changes/ . Corrections and additions are 
welcome. It's reasonably complete with respect to system calls, 
partially complete on /proc and socket options, and rather out to 
lunch on other pieces such as /sys and other pseudo filesystems

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 20:45                 ` Ben Hutchings
  2014-05-02 21:03                   ` Dave Jones
@ 2014-05-03 13:35                   ` Michael Kerrisk (man-pages)
  1 sibling, 0 replies; 79+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-05-03 13:35 UTC (permalink / raw)
  To: Ben Hutchings, Dave Jones
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Heinrich Schuchardt, Darren Hart, Dan Carpenter

On 05/02/2014 10:45 PM, Ben Hutchings wrote:
> On Fri, 2014-05-02 at 15:49 -0400, Dave Jones wrote:
>> On Fri, May 02, 2014 at 03:33:14PM -0400, Theodore Ts'o wrote:
>>  > There's been a huge focus on system calls in this discussion, and I
>>  > suspect this is a bit of a red herring.  Taking a look at "git log
>>  > arch/x86/syscalls/syscall_64.tbl" --- since all the world's is no
>>  > longer a Vax, but rather an x86_64 :-P --- there really hasn't been
>>  > that many new system calls lately.
>>
>> I may have a vested interest in syscalls :)
>>
>> The rate we're adding them has slowed down, but the rate at which we're
>> finding bugs exposed through them has accelerated enormously over the
>> last few years.

Yes. The APIs delivered to userspace continue to be infested with bugs
and design infelicities, many of which go undetected for a long time.

>> To use just one example, on certain systems I'd love to be able to just
>> turn off sys_perf_event_open given what a trainwreck of vulnerabilities it's been
>> over the last few years [comedy: it is actually a config option, but x86
>> 'selects' it, so you'll have it and you'll like it].
>> Thankfully at least the scarier parts of it are now hidden behind the
>> paranoid sysctl.
> 
> I have considered proposing perf_event_paranoid=3 to disable it
> completely for non-root.
> 
>>  > And if you look at things like renameat(2), the actual code savings by
>>  > removing renameat(2) is pretty small, and IMHO, not worth the
>>  > complexity and uncertainty that it would represent to application
>>  > programmers of "does this system call exist or doesn't it".
>>
>> I think we've got two categories here.
>>
>> "variant" syscalls like renameat, which just offers enhancements over
>> an existing syscall. Stuff that things like glibc tend to care about.
>> This stuff is usually pretty boring, and not even worth considering for
>> potentially disabling imo.
>>
>> And then we have "enable boatload of code" syscalls that are typically
>> used by a few standalone apps/features. kexec, checkpointing, whatever
>> db it was that cares about remap_file_pages, mempolicy, etc. etc.
>>
>> It's this "not used by every user" code that tends to scare me, because
>> it's written with 1-2 well behaved bits of userspace in mind, which
>> usually means "has so many unchecked corner cases it's not even funny"

Well it's worse than that, I think. Those unchecked corner cases turn
up even in code that is not protected by config options or privs.
My example of the day: the timeout argument of recvmmsg() does nothing
sensible--there was no (or minimal) testing, seems to have been minimal
review of the feature, and of course there was no documentation of how
the timeout feature should work beyond the statement that "recvmmsg 
now has a struct timespec timeout, that works in the same fashion as
the ppoll one" (Newsflash: recvmmsg() and ppoll() are doing very 
different things, so describing one in terms of the other doesn't
provide much insight.)

https://bugzilla.kernel.org/show_bug.cgi?id=75371
http://thread.gmane.org/gmane.linux.man/5677

> [...]
> 
> Since Michael often seems to be the one testing those corner cases while
> writing documentation, it seems like you're getting back to the old
> issue of whether lack of documentation should be a blocker for adding
> new system calls.

I think there's really room for a lot more rigor here. There is way
too much crap hitting the userspace API. I've long argued that
(ggod) documentation is one of the best ways of finding bugs and
design errors. I know, because that's the way I've discovered a lot
of the problems. Of course, perhaps I am just an odd data point,
but I recently got to help out in an experiment that reproduced 
the results.

Heinrich Schuchardt recently took it upon himself to document the 
fanotify API, which has been undocumented since its release in 2.6.37.
(Heinrich's pages will probably be published in the next week or so,
in the meantime the drafts are here: 
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/tree/ )

In the course of writing the pages (and goaded by me at various
points to "explain this detail" or "tell the reader what happens 
in this case"), Heinrich has uncovered (and documented) one or 
two design infelicities and a good crop of bugs (at least one 
of which has some security implications: 
http://thread.gmane.org/gmane.linux.kernel/1686672/focus=1690201 )

So, Heinrich demonstrated what I've long known: show me a new
kernel-user-space API and I can probably pretty quickly show you
a bug. Writing good documentation goes a long way toward finding
those bugs and design problems, and it really should be done
well before an API is released, since, of course, some API 
problems can't be  fixed later. And, it should be a collaborative
effort involving not just the developer concerned but someone
fairly distant from them who can look skeptically at the 
documentation.

Oh, and I didn't explicitly say it, but to me it's obvious:
good documentation necessarily implies good testing. And
that's the thing that made Heinrich's work good: when he
wrote in response to some of my goadings that the answers 
might take a while, because he'd need to write some tests,
that was exactly what I hoped to hear.

tools like trinity do a great job of catching bizarre behaviors
in APIs, but in the end some bugs (and design problems) are 
only going to be found when human beings sit down and think
deeply about what is going on. (The timeout issue for 
recvmmsg() is a case in point. There's no fuzz testing for
that sort of issue, and for that matter no specification of
the expected behavior against which to test.)

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 21:03                   ` Dave Jones
@ 2014-05-03 13:37                     ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 79+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-05-03 13:37 UTC (permalink / raw)
  To: Dave Jones, Ben Hutchings
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

On 05/02/2014 11:03 PM, Dave Jones wrote:
> On Fri, May 02, 2014 at 09:45:18PM +0100, Ben Hutchings wrote:
>  > On Fri, 2014-05-02 at 15:49 -0400, Dave Jones wrote:

[...]

>  > > It's this "not used by every user" code that tends to scare me, because
>  > > it's written with 1-2 well behaved bits of userspace in mind, which
>  > > usually means "has so many unchecked corner cases it's not even funny"
>  > [...]
>  > 
>  > Since Michael often seems to be the one testing those corner cases while
>  > writing documentation, it seems like you're getting back to the old
>  > issue of whether lack of documentation should be a blocker for adding
>  > new system calls.
> 
> That, and test cases.

Form my point of view, test cases and documentation go hand in glove 
(see my last mail).

Hmmm -- this almost starts to smell like a topic to revisit
at Ksummit.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 21:39                         ` Josh Boyer
  2014-05-02 22:35                           ` Andy Lutomirski
@ 2014-05-03 17:30                           ` James Bottomley
  1 sibling, 0 replies; 79+ messages in thread
From: James Bottomley @ 2014-05-03 17:30 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

Apologies for not seeing this.  Apparently this list is set up with
nodupes as the default and I have mail rules trashing personal copies of
mail I'm supposed to get from the list

On Fri, 2014-05-02 at 17:39 -0400, Josh Boyer wrote:
> On Fri, May 2, 2014 at 5:27 PM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > to separate Linux secure boot policy from Microsoft's binary signing
> > requirements which might take some of the heat out of the arguments
> > about which parts of the patch are to please microsoft and refocus the
> > debate towards how we make better use of secure boot.  I'll try and
> > ensure that either the proposals are public by KS or that we have
> > permission to share the details.
> 
> The objectionable parts having to do with signing aren't even in the
> patchset Matthew has posted.  That's the initial set he tried to get
> pulled in and failed.  If the proposal drastically changes that
> approach I'd be surprised (maybe pleasantly).

Some of the objections are rooted in the suspicions that what we do, we
do to please Microsoft (or at least to get them not to blacklist our
signatures) others are simply based on the idea that secure boot isn't,
because Microsoft designed it wrongly, so we shouldn't call it secure.

Removing the Microsoft proxy allows us to have a more honest debate
about how we want to make use of the capability. I'm not saying it
produces immediate agreement because there's plenty of stuff we have
disagreements over that aren't rooted in suspicions of ulterior motives,
but at least we'll be disagreeing about real issues.

James

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 17:11 ` Dave Jones
  2014-05-02 17:20   ` James Bottomley
  2014-05-02 22:04   ` Jan Kara
@ 2014-05-05 23:45   ` Bird, Tim
  2014-05-06  2:14     ` H. Peter Anvin
  2014-05-09 16:22   ` Josh Triplett
  3 siblings, 1 reply; 79+ messages in thread
From: Bird, Tim @ 2014-05-05 23:45 UTC (permalink / raw)
  To: Dave Jones, Josh Triplett
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, al.stone, Julia Lawall,
	Darren Hart, Dan Carpenter

On Friday, May 02, 2014 10:11 AM, Dave Jones wrote:
> 
> On Fri, May 02, 2014 at 09:44:42AM -0700, Josh Triplett wrote:
> 
>  > Topics:
>  > - Kconfig, and avoiding excessive configurability in the pursuit of tiny
>  > - Optimizing a kernel for its exact target userspace.
>  > - Examples of shrinking the kernel
> 
> Something that's partially related here: Making stuff optional
> reduces attack surface the kernel presents. We're starting to grow
> more and more CONFIG options to disable syscalls. I'd like to hear
> peoples reactions on introducing even more optionality in this area.
> 
> I first started thinking about this at LSF/MM where the subject of
> sys_remap_file_pages came up. "What even uses this?" "hardly anything".
> But for all the users that don't need it, there's this syscall always
> built in that does horrible things with VM internals.  It's fortunate
> that there hasn't been anything particularly awful beyond simple DoS
> bugs in r_f_p.
> 
> Distribution kernels are in the sad position of having to always enable
> this stuff, but at least for people building their own kernels, or
> kernels for appliances, we could make their lives a little better by
> not even building this stuff in.
> 
> I had a patch to make this particular syscall a cond_syscall, but then
> XFS ate my homework and I haven't had chance to revisit this.
> So, my questions are:
> - are there other obvious syscalls we could make optional without userspace
>   freaking out when they suddenly start getting ENOSYS ?
> - how much configurability here is too much ?
>   r_f_p was an obvious candidate because it's.. well, nasty.  Some of the
>   more straightforward syscalls may not be such a big deal, but then we
>   have CONFIG's for kcmp and other 'simple' syscalls already..
> 
> thoughts?

For deeply embedded, I think a good technique is to fine-tune the syscalls to the exact set of programs
that will be run.  For my size research last year (See http://elinux.org/System_Size_Auto-Reduction
and http://elinux.org/images/9/9e/Bird-Kernel-Size-Optimization-LCJ-2013.pdf), I discussed
some results from doing the following:
 - scanning the all the binaries in the file system
 - generating a list of used and unused system calls
 - adding a kernel mechanism to eliminate any unused syscalls, at compile time
   (using LTO, all I had to do was make the syscall unreachable, and the compiler
   eliminated the calls automatically.  Some of the work was removing parts of Andy Kleen's LTO
   patches which prevented unreferenced code from being optimized out.)

The syscalls were detected with a tool that scanned the program's assembly code on ARM,
and found all syscall sequences.  Binaries were statically linked for analysis.

On a default-configured kernel, the system eliminated 161 syscalls, and saved about 95k.
On a minimally configured kernel, the system eliminated 120 syscalls, and saved 48K.

The remaining syscalls for my minimal system consisting of busybox and a web server
was 184 syscalls.  with additional coding and refactoring of the app and busybox, additional
syscalls could have been eliminated.

Some syscalls that were unused, still ended up hanging around due to a few funny references.
With some code refactoring, some additional savings might be possible.

Note that this system required no new kernel CONFIGs, and used macros to hide most of the
complexity from the rest of the kernel source.

I did some other automatic code elimination techniques, with varying results.  These can be seen
in the presentation.
 -- Tim

P.S. I realize this technique is not suitable for general-purpose distributions of Linux.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-05 23:45   ` Bird, Tim
@ 2014-05-06  2:14     ` H. Peter Anvin
  0 siblings, 0 replies; 79+ messages in thread
From: H. Peter Anvin @ 2014-05-06  2:14 UTC (permalink / raw)
  To: Bird, Tim, Dave Jones, Josh Triplett
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, al.stone, Julia Lawall,
	Darren Hart, Dan Carpenter

On 05/05/2014 04:45 PM, Bird, Tim wrote:
>  - adding a kernel mechanism to eliminate any unused syscalls, at compile time
>    (using LTO, all I had to do was make the syscall unreachable, and the compiler
>    eliminated the calls automatically.  Some of the work was removing parts of Andy Kleen's LTO
>    patches which prevented unreferenced code from being optimized out.)

There is some irony here.

	-hpa

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 22:35                           ` Andy Lutomirski
@ 2014-05-06 17:18                             ` josh
  2014-05-06 17:31                               ` Andy Lutomirski
  0 siblings, 1 reply; 79+ messages in thread
From: josh @ 2014-05-06 17:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

On Fri, May 02, 2014 at 03:35:28PM -0700, Andy Lutomirski wrote:
> On Fri, May 2, 2014 at 2:39 PM, Josh Boyer <jwboyer@fedoraproject.org> wrote:
> > On Fri, May 2, 2014 at 5:27 PM, James Bottomley
> > <James.Bottomley@hansenpartnership.com> wrote:
> >> On Fri, 2014-05-02 at 17:19 -0400, Josh Boyer wrote:
> >>> On Fri, May 2, 2014 at 5:01 PM, Dave Jones <davej@redhat.com> wrote:
> >>> > On Fri, May 02, 2014 at 04:41:41PM -0400, Theodore Ts'o wrote:
> >>> >
> >>> >  > And I think we can also further break this down into the classes of
> >>> >  > code which require root privs (i.e., like kexec), and those which can
> >>> >  > be used by any userid.
> >>> >
> >>> > In the brave new world of secure boot, we kind of have to care about
> >>> > even the root cases now too [*], but I agree in the general case.
> >>>
> >>> Speaking of that... is it worth my time to propose a "What to do about
> >>> the secure_modules/trusted_kernel/whatever patch set that distros are
> >>> carrying to support Secure Boot?  I thought we had agreement and a
> >>> path forward at LPC last year, but things seem to have gotten derailed
> >>> again.
> >>
> >> Would you believe we're just discussing with the distros how we might
> >> re-engineer the Linux secure boot process.  Unfortunately the details
> >
> > I would believe it.
> >
> >> depend on a UEFI forum proposal that are UEFI confidential at this time,
> >> but you can probably pick them up from Peter Jones, since you're a Red
> >> Hat employee.  One of the side effects of this, if it happens, will be
> >
> > OK.
> >
> >> to separate Linux secure boot policy from Microsoft's binary signing
> >> requirements which might take some of the heat out of the arguments
> >> about which parts of the patch are to please microsoft and refocus the
> >> debate towards how we make better use of secure boot.  I'll try and
> >> ensure that either the proposals are public by KS or that we have
> >> permission to share the details.
> >
> > The objectionable parts having to do with signing aren't even in the
> > patchset Matthew has posted.  That's the initial set he tried to get
> > pulled in and failed.  If the proposal drastically changes that
> > approach I'd be surprised (maybe pleasantly).
> 
> FWIW, I really don't like the approach where we say that the kernel
> must be inviolate but that user code can do whatever it likes as long
> as the kernel isn't compromised.  This may be needed to comply with
> current MS/UEFI policy, but I think it largely misses the point wrt
> actual security.
> 
> If the policy can change, then that might be a huge win.

We shouldn't give up on securing userspace, either; protecting the
kernel is necessary but not sufficient.  But I do think it's worthwhile
to enforce "root != kernel", quite apart from any "secure boot"
requirements.  That's what Matthew's patches primarily serve to do: make
root not kernel-equivalent.

- Josh Triplett

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-06 17:18                             ` josh
@ 2014-05-06 17:31                               ` Andy Lutomirski
  2014-05-09 18:22                                 ` H. Peter Anvin
  0 siblings, 1 reply; 79+ messages in thread
From: Andy Lutomirski @ 2014-05-06 17:31 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

On Tue, May 6, 2014 at 10:18 AM,  <josh@joshtriplett.org> wrote:
> On Fri, May 02, 2014 at 03:35:28PM -0700, Andy Lutomirski wrote:
>> On Fri, May 2, 2014 at 2:39 PM, Josh Boyer <jwboyer@fedoraproject.org> wrote:
>> > On Fri, May 2, 2014 at 5:27 PM, James Bottomley
>> > <James.Bottomley@hansenpartnership.com> wrote:
>> >> On Fri, 2014-05-02 at 17:19 -0400, Josh Boyer wrote:
>> >>> On Fri, May 2, 2014 at 5:01 PM, Dave Jones <davej@redhat.com> wrote:
>> >>> > On Fri, May 02, 2014 at 04:41:41PM -0400, Theodore Ts'o wrote:
>> >>> >
>> >>> >  > And I think we can also further break this down into the classes of
>> >>> >  > code which require root privs (i.e., like kexec), and those which can
>> >>> >  > be used by any userid.
>> >>> >
>> >>> > In the brave new world of secure boot, we kind of have to care about
>> >>> > even the root cases now too [*], but I agree in the general case.
>> >>>
>> >>> Speaking of that... is it worth my time to propose a "What to do about
>> >>> the secure_modules/trusted_kernel/whatever patch set that distros are
>> >>> carrying to support Secure Boot?  I thought we had agreement and a
>> >>> path forward at LPC last year, but things seem to have gotten derailed
>> >>> again.
>> >>
>> >> Would you believe we're just discussing with the distros how we might
>> >> re-engineer the Linux secure boot process.  Unfortunately the details
>> >
>> > I would believe it.
>> >
>> >> depend on a UEFI forum proposal that are UEFI confidential at this time,
>> >> but you can probably pick them up from Peter Jones, since you're a Red
>> >> Hat employee.  One of the side effects of this, if it happens, will be
>> >
>> > OK.
>> >
>> >> to separate Linux secure boot policy from Microsoft's binary signing
>> >> requirements which might take some of the heat out of the arguments
>> >> about which parts of the patch are to please microsoft and refocus the
>> >> debate towards how we make better use of secure boot.  I'll try and
>> >> ensure that either the proposals are public by KS or that we have
>> >> permission to share the details.
>> >
>> > The objectionable parts having to do with signing aren't even in the
>> > patchset Matthew has posted.  That's the initial set he tried to get
>> > pulled in and failed.  If the proposal drastically changes that
>> > approach I'd be surprised (maybe pleasantly).
>>
>> FWIW, I really don't like the approach where we say that the kernel
>> must be inviolate but that user code can do whatever it likes as long
>> as the kernel isn't compromised.  This may be needed to comply with
>> current MS/UEFI policy, but I think it largely misses the point wrt
>> actual security.
>>
>> If the policy can change, then that might be a huge win.
>
> We shouldn't give up on securing userspace, either; protecting the
> kernel is necessary but not sufficient.  But I do think it's worthwhile
> to enforce "root != kernel", quite apart from any "secure boot"
> requirements.  That's what Matthew's patches primarily serve to do: make
> root not kernel-equivalent.
>

I have two main objections to "root != kernel".  The bigger is that
I'd like to see the security argument for it so that people can think
about whether it makes sense.  The smaller is that "root != kernel"
isn't necessarily well-defined.

For example, should root be able to write to the filesystem from which
the kernel loads?  Should root be able to kexec a new kernel, if that
clears some key known to the current kernel in the process?  Should
root be able to start a KVM instance that passes essentially all
hardware through?  Should root be able to talk directly to the
system's embedded controller?  Should root be able to read all
physical memory?  How about reading just enough to learn the kernel's
semi-secret randomized addresses?  How about running perf without
restrictions?

In the past, the actual security goal seems to have been "root shall
not be able to do anything that would anger Microsoft and/or
Verisign", which is far-enough removed from actual security that I
don't want it anywhere near my system.  But if I could have a
reasonable policy that "root shall not be able to persistently
compromise the machine", then I think this could be great.

Note that the latter goal does not actually require that root be
unable to modify the running kernel.

--Andy

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 17:20   ` James Bottomley
                       ` (2 preceding siblings ...)
  2014-05-02 17:44     ` Steven Rostedt
@ 2014-05-07 11:32     ` David Woodhouse
  2014-05-07 16:38       ` James Bottomley
  3 siblings, 1 reply; 79+ messages in thread
From: David Woodhouse @ 2014-05-07 11:32 UTC (permalink / raw)
  To: James Bottomley
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

[-- Attachment #1: Type: text/plain, Size: 495 bytes --]

On Fri, 2014-05-02 at 10:20 -0700, James Bottomley wrote:
> If we do this, I think we should have a small number of options related
> to use case ... say something like a secure router kernel
> CONFIG_SECURE_ROUTER which disables anything a secure router wouldn't
> need.

Have you seen the amount of stuff that OpenWRT packages? :)

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5745 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 21:03           ` Mark Brown
  2014-05-02 21:08             ` Dave Jones
@ 2014-05-07 12:35             ` David Woodhouse
  2014-05-09 15:51               ` Mark Brown
  1 sibling, 1 reply; 79+ messages in thread
From: David Woodhouse @ 2014-05-07 12:35 UTC (permalink / raw)
  To: Mark Brown
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

[-- Attachment #1: Type: text/plain, Size: 1147 bytes --]

On Fri, 2014-05-02 at 14:03 -0700, Mark Brown wrote:
> On Fri, May 02, 2014 at 07:45:44PM +0000, Luck, Tony wrote:
> 
> > > It would be useful for the smaller build case to have a way of auditing
> > > which syscalls are actually in use on a system so you can then go
> > > through and construct a minimal config.
> 
> > "strace -c" ?
> 
> That works for specific processes but I don't immediately see a
> straightforward way to do it system wide (I guess a wrapper that straces
> init and children might do the trick but it's not particularly nice).
> Part of the trick for getting the general security win is to lower the
> barrier to entry.`

You can do it relatively easily with auditing, surely? Set up an audit
rule for each syscall you aren't already sure is in use. Disable the
rule when you see it used, and it shouldn't even have much of an
overhead over and above what it takes to have auditing enabled in the
first place (which we tried to keep to a minimum).

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5745 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-07 11:32     ` David Woodhouse
@ 2014-05-07 16:38       ` James Bottomley
  0 siblings, 0 replies; 79+ messages in thread
From: James Bottomley @ 2014-05-07 16:38 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Wed, 2014-05-07 at 12:32 +0100, David Woodhouse wrote:
> On Fri, 2014-05-02 at 10:20 -0700, James Bottomley wrote:
> > If we do this, I think we should have a small number of options related
> > to use case ... say something like a secure router kernel
> > CONFIG_SECURE_ROUTER which disables anything a secure router wouldn't
> > need.
> 
> Have you seen the amount of stuff that OpenWRT packages? :)

Yes ... I use it (and actually had to build it a while ago for one of my
unsupported routers).  *I* like running the kitchen sink from my router,
but that's probably not the *normal* use, so CONFIG_SECURE_ROUTER would
probably not be for general OpenWRT.

James

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 16:44 [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions Josh Triplett
  2014-05-02 17:11 ` Dave Jones
@ 2014-05-08 15:52 ` Christoph Lameter
  2014-05-12 17:35 ` Wolfram Sang
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 79+ messages in thread
From: Christoph Lameter @ 2014-05-08 15:52 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, 2 May 2014, Josh Triplett wrote:

> - An overview of why the kernel's size still matters today ("but don't
>   we all have tons of memory and storage?")

Kernel size matters quite a bit for performance. Processor caches are key
to performance and therefore the cache footprint of a function determines
the the possible performance. The smaller the functions and the less data
they access the faster they will run.

Therefore it needs to be possible to reduce the size of the kernel by
disabling unwanted functionality (f.e. cgroups). In order for that to
happen features need to be as independent as possible and also the user
space tools (like systemd) need to be able to handle a kernel with reduced
functionality.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-07 12:35             ` David Woodhouse
@ 2014-05-09 15:51               ` Mark Brown
  0 siblings, 0 replies; 79+ messages in thread
From: Mark Brown @ 2014-05-09 15:51 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

[-- Attachment #1: Type: text/plain, Size: 943 bytes --]

On Wed, May 07, 2014 at 01:35:06PM +0100, David Woodhouse wrote:
> On Fri, 2014-05-02 at 14:03 -0700, Mark Brown wrote:

> > That works for specific processes but I don't immediately see a
> > straightforward way to do it system wide (I guess a wrapper that straces
> > init and children might do the trick but it's not particularly nice).
> > Part of the trick for getting the general security win is to lower the
> > barrier to entry.`

> You can do it relatively easily with auditing, surely? Set up an audit
> rule for each syscall you aren't already sure is in use. Disable the
> rule when you see it used, and it shouldn't even have much of an
> overhead over and above what it takes to have auditing enabled in the
> first place (which we tried to keep to a minimum).

I suspect that's got too high a barrier to entry for a lot of users,
especially since AFAICT it requires userspace tools on the target
system.  It should work though.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 17:11 ` Dave Jones
                     ` (2 preceding siblings ...)
  2014-05-05 23:45   ` Bird, Tim
@ 2014-05-09 16:22   ` Josh Triplett
  2014-05-09 16:59     ` Bird, Tim
  3 siblings, 1 reply; 79+ messages in thread
From: Josh Triplett @ 2014-05-09 16:22 UTC (permalink / raw)
  To: Dave Jones
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

[-- Attachment #1: Type: text/plain, Size: 2919 bytes --]

On Fri, May 02, 2014 at 01:11:03PM -0400, Dave Jones wrote:
> On Fri, May 02, 2014 at 09:44:42AM -0700, Josh Triplett wrote:
>  
>  > Topics:
>  > - Kconfig, and avoiding excessive configurability in the pursuit of tiny
>  > - Optimizing a kernel for its exact target userspace.
>  > - Examples of shrinking the kernel
> 
> Something that's partially related here: Making stuff optional
> reduces attack surface the kernel presents. We're starting to grow
> more and more CONFIG options to disable syscalls. I'd like to hear
> peoples reactions on introducing even more optionality in this area.

I'd certainly like to see just about every syscall made optional, for
userspace that doesn't need it.  For specialized systems, that certainly
would decrease attack surface.  However, seccomp decreases attack
surface by the same amount, and for any except those specialized systems
that would make more sense, because the set of available syscalls can
then change with a simple policy change rather than a new kernel.

And this doesn't free us from the obligation to make all new APIs
secure against hostile userspace.

> I had a patch to make this particular syscall a cond_syscall, but then
> XFS ate my homework and I haven't had chance to revisit this.
> So, my questions are:
> - are there other obvious syscalls we could make optional without userspace
>   freaking out when they suddenly start getting ENOSYS ?

I've attached a complete list of the syscalls from
include/linux/syscalls.h that do not appear in kernel/sys_ni.c, and thus
always exist.  (syscalls.h notably does not include all the
arch-specific syscalls, some of which might make sense to leave out as
well.)

Of those, a few classes of syscalls that seem obvious, for various
classes of specialized or legacy-free systems:

- For any syscall updated to have a foo2, foo3, etc, a single config
  option to leave out all the older versions would make sense, to go
  with userspace that never calls the older versions.
- Likewise, the non-64 file calls.
- Likewise, sys_old*
- splice/vmsplice/tee.
- sys_*sync*
- sys_clock_* and any other time functions.
- sys_sched_*
- All signal-related syscalls
- rlimit syscalls
- sys_*xattr*
- sys_nice
- sys_cap{get,set}
- fadvise, fallocate, readahead, etc.
- uid/gid functions.
- ioperm/iopl
- ptrace
- sendfile
- times
- utimes and company

> - how much configurability here is too much ?
>   r_f_p was an obvious candidate because it's.. well, nasty.  Some of the
>   more straightforward syscalls may not be such a big deal, but then we
>   have CONFIG's for kcmp and other 'simple' syscalls already..

We need a more systematic mechanism, I think.  CONFIG_SYSCALL_FOO for
every possible FOO seems too much, even for classes of syscalls.
Ideally, we could feed in a table of syscalls collected by some
analysis of the target userspace, and the kernel will then have exactly
those syscalls.

- Josh Triplett

[-- Attachment #2: syscalls-i --]
[-- Type: text/plain, Size: 3105 bytes --]

sys_access
sys_adjtimex
sys_alarm
sys_brk
sys_capget
sys_capset
sys_chdir
sys_chmod
sys_chown
sys_chroot
sys_clock_adjtime
sys_clock_getres
sys_clock_gettime
sys_clock_nanosleep
sys_clock_settime
sys_clone
sys_close
sys_creat
sys_dup
sys_dup2
sys_dup3
sys_execve
sys_exit
sys_exit_group
sys_faccessat
sys_fadvise64
sys_fadvise64_64
sys_fallocate
sys_fchdir
sys_fchmod
sys_fchmodat
sys_fchown
sys_fchownat
sys_fcntl
sys_fcntl64
sys_fdatasync
sys_fgetxattr
sys_flistxattr
sys_fork
sys_fremovexattr
sys_fsetxattr
sys_fstat
sys_fstat64
sys_fstatat64
sys_fstatfs
sys_fstatfs64
sys_fsync
sys_ftruncate
sys_ftruncate64
sys_futimesat
sys_getcpu
sys_getcwd
sys_getdents
sys_getdents64
sys_getegid
sys_geteuid
sys_getgid
sys_getgroups
sys_gethostname
sys_getitimer
sys_getpgid
sys_getpgrp
sys_getpid
sys_getppid
sys_getpriority
sys_getresgid
sys_getresuid
sys_getrlimit
sys_getrusage
sys_getsid
sys_gettid
sys_gettimeofday
sys_getuid
sys_getxattr
sys_ioctl
sys_ioperm
sys_kill
sys_lchown
sys_lgetxattr
sys_link
sys_linkat
sys_listxattr
sys_llistxattr
sys_llseek
sys_lremovexattr
sys_lseek
sys_lsetxattr
sys_lstat
sys_lstat64
sys_mkdir
sys_mkdirat
sys_mknod
sys_mknodat
sys_mmap_pgoff
sys_mount
sys_munmap
sys_nanosleep
sys_newfstat
sys_newfstatat
sys_newlstat
sys_newstat
sys_newuname
sys_ni_syscall
sys_nice
sys_old_getrlimit
sys_old_mmap
sys_old_readdir
sys_old_select
sys_oldumount
sys_olduname
sys_open
sys_openat
sys_pause
sys_personality
sys_pipe
sys_pipe2
sys_pivot_root
sys_poll
sys_ppoll
sys_prctl
sys_pread64
sys_preadv
sys_prlimit64
sys_pselect6
sys_ptrace
sys_pwrite64
sys_pwritev
sys_read
sys_readahead
sys_readlink
sys_readlinkat
sys_readv
sys_reboot
sys_removexattr
sys_rename
sys_renameat
sys_renameat2
sys_restart_syscall
sys_rmdir
sys_rt_sigaction
sys_rt_sigpending
sys_rt_sigprocmask
sys_rt_sigqueueinfo
sys_rt_sigsuspend
sys_rt_sigtimedwait
sys_rt_tgsigqueueinfo
sys_sched_get_priority_max
sys_sched_get_priority_min
sys_sched_getaffinity
sys_sched_getattr
sys_sched_getparam
sys_sched_getscheduler
sys_sched_rr_get_interval
sys_sched_setaffinity
sys_sched_setattr
sys_sched_setparam
sys_sched_setscheduler
sys_sched_yield
sys_select
sys_sendfile
sys_sendfile64
sys_set_tid_address
sys_setdomainname
sys_setfsgid
sys_setfsuid
sys_setgid
sys_setgroups
sys_sethostname
sys_setitimer
sys_setns
sys_setpgid
sys_setpriority
sys_setregid
sys_setresgid
sys_setresuid
sys_setreuid
sys_setrlimit
sys_setsid
sys_settimeofday
sys_setuid
sys_setxattr
sys_sgetmask
sys_sigaction
sys_sigaltstack
sys_signal
sys_sigpending
sys_sigprocmask
sys_sigsuspend
sys_splice
sys_ssetmask
sys_stat
sys_stat64
sys_statfs
sys_statfs64
sys_stime
sys_symlink
sys_symlinkat
sys_sync
sys_sync_file_range
sys_sync_file_range2
sys_syncfs
sys_sysctl
sys_sysinfo
sys_tee
sys_tgkill
sys_time
sys_timer_create
sys_timer_delete
sys_timer_getoverrun
sys_timer_gettime
sys_timer_settime
sys_times
sys_tkill
sys_truncate
sys_truncate64
sys_umask
sys_umount
sys_uname
sys_unlink
sys_unlinkat
sys_unshare
sys_ustat
sys_utime
sys_utimensat
sys_utimes
sys_vfork
sys_vhangup
sys_vmsplice
sys_wait4
sys_waitid
sys_waitpid
sys_write
sys_writev

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-09 16:22   ` Josh Triplett
@ 2014-05-09 16:59     ` Bird, Tim
  2014-05-09 17:23       ` josh
  0 siblings, 1 reply; 79+ messages in thread
From: Bird, Tim @ 2014-05-09 16:59 UTC (permalink / raw)
  To: Josh Triplett, Dave Jones
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Friday, May 09, 2014 9:22 AM Josh Triplett wrote:
> 
> On Fri, May 02, 2014 at 01:11:03PM -0400, Dave Jones wrote:
> > On Fri, May 02, 2014 at 09:44:42AM -0700, Josh Triplett wrote:
> >
> >  > Topics:
> >  > - Kconfig, and avoiding excessive configurability in the pursuit of tiny
> >  > - Optimizing a kernel for its exact target userspace.
> >  > - Examples of shrinking the kernel
> >
> > Something that's partially related here: Making stuff optional
> > reduces attack surface the kernel presents. We're starting to grow
> > more and more CONFIG options to disable syscalls. I'd like to hear
> > peoples reactions on introducing even more optionality in this area.
> 
> I'd certainly like to see just about every syscall made optional, for
> userspace that doesn't need it.  For specialized systems, that certainly
> would decrease attack surface.  However, seccomp decreases attack
> surface by the same amount, and for any except those specialized systems
> that would make more sense, because the set of available syscalls can
> then change with a simple policy change rather than a new kernel.
> 
> And this doesn't free us from the obligation to make all new APIs
> secure against hostile userspace.
> 
> > I had a patch to make this particular syscall a cond_syscall, but then
> > XFS ate my homework and I haven't had chance to revisit this.
> > So, my questions are:
> > - are there other obvious syscalls we could make optional without userspace
> >   freaking out when they suddenly start getting ENOSYS ?
> 
> I've attached a complete list of the syscalls from
> include/linux/syscalls.h that do not appear in kernel/sys_ni.c, and thus
> always exist.  (syscalls.h notably does not include all the
> arch-specific syscalls, some of which might make sense to leave out as
> well.)
> 
> Of those, a few classes of syscalls that seem obvious, for various
> classes of specialized or legacy-free systems:
> 
> - For any syscall updated to have a foo2, foo3, etc, a single config
>   option to leave out all the older versions would make sense, to go
>   with userspace that never calls the older versions.
> - Likewise, the non-64 file calls.
> - Likewise, sys_old*
> - splice/vmsplice/tee.
> - sys_*sync*
> - sys_clock_* and any other time functions.
> - sys_sched_*
> - All signal-related syscalls
> - rlimit syscalls
> - sys_*xattr*
> - sys_nice
> - sys_cap{get,set}
> - fadvise, fallocate, readahead, etc.
> - uid/gid functions.
> - ioperm/iopl
> - ptrace
> - sendfile
> - times
> - utimes and company
> 
> > - how much configurability here is too much ?
> >   r_f_p was an obvious candidate because it's.. well, nasty.  Some of the
> >   more straightforward syscalls may not be such a big deal, but then we
> >   have CONFIG's for kcmp and other 'simple' syscalls already..
> 
> We need a more systematic mechanism, I think.  CONFIG_SYSCALL_FOO for
> every possible FOO seems too much, even for classes of syscalls.
> Ideally, we could feed in a table of syscalls collected by some
> analysis of the target userspace, and the kernel will then have exactly
> those syscalls.

In my system, I set it up so that every syscall had it's own
SYSCALL_DEFINE macro. and then used a single header file
consisting of lines like:
#define syscall_setreuid16_unused 1

The SYSCALL_DEFINE macros would then control whether the
syscall was extern'ed or not.  A separate mechanism converted
the CALL macro in calls.S (on ARM) to use sys_ni_syscall, and
LTO made the (now unreferenced) function evaporate.

Overall, this allowed control of every syscall with a single easily
generated (or easily hand-edited) header file.  And, with a stub
header file, everything worked as without the changes.

The header file was auto-generated by tools that scanned the
user-space programs for all possible syscall sequences.

In hindsight this system could probably be improved with some
extra tweaking to the base SYSCALL_DEFINE macros, to make
it so no source changes were required at the function definition sites.

In any event, it's possible to get per-syscall granularity without
having to add new CONFIGS (but at the expense of adding a generated
header file).
 -- Tim

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-09 16:59     ` Bird, Tim
@ 2014-05-09 17:23       ` josh
  0 siblings, 0 replies; 79+ messages in thread
From: josh @ 2014-05-09 17:23 UTC (permalink / raw)
  To: Bird, Tim
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, May 09, 2014 at 06:59:16PM +0200, Bird, Tim wrote:
> On Friday, May 09, 2014 9:22 AM Josh Triplett wrote:
> > 
> > On Fri, May 02, 2014 at 01:11:03PM -0400, Dave Jones wrote:
> > > - how much configurability here is too much ?
> > >   r_f_p was an obvious candidate because it's.. well, nasty.  Some of the
> > >   more straightforward syscalls may not be such a big deal, but then we
> > >   have CONFIG's for kcmp and other 'simple' syscalls already..
> > 
> > We need a more systematic mechanism, I think.  CONFIG_SYSCALL_FOO for
> > every possible FOO seems too much, even for classes of syscalls.
> > Ideally, we could feed in a table of syscalls collected by some
> > analysis of the target userspace, and the kernel will then have exactly
> > those syscalls.
> 
> In my system, I set it up so that every syscall had it's own
> SYSCALL_DEFINE macro. and then used a single header file
> consisting of lines like:
> #define syscall_setreuid16_unused 1
> 
> The SYSCALL_DEFINE macros would then control whether the
> syscall was extern'ed or not.  A separate mechanism converted
> the CALL macro in calls.S (on ARM) to use sys_ni_syscall, and
> LTO made the (now unreferenced) function evaporate.
> 
> Overall, this allowed control of every syscall with a single easily
> generated (or easily hand-edited) header file.  And, with a stub
> header file, everything worked as without the changes.
> 
> The header file was auto-generated by tools that scanned the
> user-space programs for all possible syscall sequences.
> 
> In hindsight this system could probably be improved with some
> extra tweaking to the base SYSCALL_DEFINE macros, to make
> it so no source changes were required at the function definition sites.

Another possibility: make all the syscall functions garbage-collectable,
and only keep those referenced from the actual syscall table.  Then
generate the syscall table accordingly.

- Josh Triplett

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-06 17:31                               ` Andy Lutomirski
@ 2014-05-09 18:22                                 ` H. Peter Anvin
  2014-05-09 20:37                                   ` Andy Lutomirski
  0 siblings, 1 reply; 79+ messages in thread
From: H. Peter Anvin @ 2014-05-09 18:22 UTC (permalink / raw)
  To: Andy Lutomirski, Josh Triplett
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

On 05/06/2014 10:31 AM, Andy Lutomirski wrote:
> 
> I have two main objections to "root != kernel".  The bigger is that
> I'd like to see the security argument for it so that people can think
> about whether it makes sense.  The smaller is that "root != kernel"
> isn't necessarily well-defined.
> 
> For example, should root be able to write to the filesystem from which
> the kernel loads?  Should root be able to kexec a new kernel, if that
> clears some key known to the current kernel in the process?  Should
> root be able to start a KVM instance that passes essentially all
> hardware through?  Should root be able to talk directly to the
> system's embedded controller?  Should root be able to read all
> physical memory?  How about reading just enough to learn the kernel's
> semi-secret randomized addresses?  How about running perf without
> restrictions?
> 
> In the past, the actual security goal seems to have been "root shall
> not be able to do anything that would anger Microsoft and/or
> Verisign", which is far-enough removed from actual security that I
> don't want it anywhere near my system.  But if I could have a
> reasonable policy that "root shall not be able to persistently
> compromise the machine", then I think this could be great.
> 
> Note that the latter goal does not actually require that root be
> unable to modify the running kernel.
> 

The first aspect of this is that the kernel needs to *be able to* lock
out root from select functions.  These things will be system
configuration dependent.

Once you have the separation you can define exceptions.

	-hpa

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-09 18:22                                 ` H. Peter Anvin
@ 2014-05-09 20:37                                   ` Andy Lutomirski
  2014-05-09 22:50                                     ` Josh Triplett
  2014-05-10  0:23                                     ` James Bottomley
  0 siblings, 2 replies; 79+ messages in thread
From: Andy Lutomirski @ 2014-05-09 20:37 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

On Fri, May 9, 2014 at 11:22 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 05/06/2014 10:31 AM, Andy Lutomirski wrote:
>>
>> I have two main objections to "root != kernel".  The bigger is that
>> I'd like to see the security argument for it so that people can think
>> about whether it makes sense.  The smaller is that "root != kernel"
>> isn't necessarily well-defined.
>>
>> For example, should root be able to write to the filesystem from which
>> the kernel loads?  Should root be able to kexec a new kernel, if that
>> clears some key known to the current kernel in the process?  Should
>> root be able to start a KVM instance that passes essentially all
>> hardware through?  Should root be able to talk directly to the
>> system's embedded controller?  Should root be able to read all
>> physical memory?  How about reading just enough to learn the kernel's
>> semi-secret randomized addresses?  How about running perf without
>> restrictions?
>>
>> In the past, the actual security goal seems to have been "root shall
>> not be able to do anything that would anger Microsoft and/or
>> Verisign", which is far-enough removed from actual security that I
>> don't want it anywhere near my system.  But if I could have a
>> reasonable policy that "root shall not be able to persistently
>> compromise the machine", then I think this could be great.
>>
>> Note that the latter goal does not actually require that root be
>> unable to modify the running kernel.
>>
>
> The first aspect of this is that the kernel needs to *be able to* lock
> out root from select functions.  These things will be system
> configuration dependent.
>

I'm still unconvinced.  For Chrome OS-style security, I think that
root just needs to be prevented from doing anything that will
interfere with the verified boot process the next time the machine
boots.  The kernel doesn't need any particular security feature for
this: the kernel can't change the verified boot keys either.  If an
attacker controls root on a Chromebook, the attacker has already won,
at least until the next reboot.

If the idea is to have a verified boot without any hardware or
firmware support, then, yes, the kernel needs to enforce that the
verification path can't be tampered with.  But I think we're talking
about Secure Boot here, and on a correct Secure Boot implementation*,
the worst that the kernel can do is to prevent the box from booting
next time.

The best arguments I've heard so far for why the kernel needs to try
to protect itself against root are:

1. MS/Verisign demand it.

2. It's annoying to fool a user into thinking that they just booted
Some Other OS when they're really running Linux without kernel help.
NB: no one has claimed that it's impossible AFAIK, just that it's
annoyingly complicated.

I like neither of these arguments.  #1 is politics, not security, and
#2 seems like security by annoying the attacker.

To be clear, I don't object on principle to making it possible for the
kernel to defend itself against root.  But it's hard, doing it right
will require a lot of care, and I don't think it's worth doing unless
there's a good reason.  If there's a good reason that I don't know
about, please tell me!

* The recent MITRE paper suggests that very few of these exist.
That's a separate issue.

--Andy

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-09 20:37                                   ` Andy Lutomirski
@ 2014-05-09 22:50                                     ` Josh Triplett
  2014-05-10  0:23                                     ` James Bottomley
  1 sibling, 0 replies; 79+ messages in thread
From: Josh Triplett @ 2014-05-09 22:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

On Fri, May 09, 2014 at 01:37:22PM -0700, Andy Lutomirski wrote:
> The best arguments I've heard so far for why the kernel needs to try
> to protect itself against root are:
> 
> 1. MS/Verisign demand it.
> 
> 2. It's annoying to fool a user into thinking that they just booted
> Some Other OS when they're really running Linux without kernel help.
> NB: no one has claimed that it's impossible AFAIK, just that it's
> annoyingly complicated.
> 
> I like neither of these arguments.  #1 is politics, not security, and
> #2 seems like security by annoying the attacker.

#1 is useful if you care about supporting users booting Linux on modern
systems without changing BIOS configuration.

As for #2, I agree that it's just "annoying the attacker", and I don't
want to quibble over the value of that in this particular case, but keep
in mind that a *lot* of security is "annoying the attacker"; you can
rather precisely quantify how secure a system is by how much it costs to
purchase exploited systems or similar.  (See "An Agenda for Empirical
Cyber Crime Research", USENIX ATC 2011.)  And in very much the same
spirit as "I don't have to run faster than the bear", a lot of security
(against broad-scale exploits rather than targeted threats) is about
making it more painful to exploit a system than to do a
social-engineering attack or a physical security breach.

- Josh Triplett

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-09 20:37                                   ` Andy Lutomirski
  2014-05-09 22:50                                     ` Josh Triplett
@ 2014-05-10  0:23                                     ` James Bottomley
  2014-05-10  0:38                                       ` Andy Lutomirski
  1 sibling, 1 reply; 79+ messages in thread
From: James Bottomley @ 2014-05-10  0:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

On Fri, 2014-05-09 at 13:37 -0700, Andy Lutomirski wrote:
> On Fri, May 9, 2014 at 11:22 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> > On 05/06/2014 10:31 AM, Andy Lutomirski wrote:
> >>
> >> I have two main objections to "root != kernel".  The bigger is that
> >> I'd like to see the security argument for it so that people can think
> >> about whether it makes sense.  The smaller is that "root != kernel"
> >> isn't necessarily well-defined.
> >>
> >> For example, should root be able to write to the filesystem from which
> >> the kernel loads?  Should root be able to kexec a new kernel, if that
> >> clears some key known to the current kernel in the process?  Should
> >> root be able to start a KVM instance that passes essentially all
> >> hardware through?  Should root be able to talk directly to the
> >> system's embedded controller?  Should root be able to read all
> >> physical memory?  How about reading just enough to learn the kernel's
> >> semi-secret randomized addresses?  How about running perf without
> >> restrictions?
> >>
> >> In the past, the actual security goal seems to have been "root shall
> >> not be able to do anything that would anger Microsoft and/or
> >> Verisign", which is far-enough removed from actual security that I
> >> don't want it anywhere near my system.  But if I could have a
> >> reasonable policy that "root shall not be able to persistently
> >> compromise the machine", then I think this could be great.
> >>
> >> Note that the latter goal does not actually require that root be
> >> unable to modify the running kernel.
> >>
> >
> > The first aspect of this is that the kernel needs to *be able to* lock
> > out root from select functions.  These things will be system
> > configuration dependent.
> >
> 
> I'm still unconvinced.  For Chrome OS-style security, I think that
> root just needs to be prevented from doing anything that will
> interfere with the verified boot process the next time the machine
> boots.  The kernel doesn't need any particular security feature for
> this: the kernel can't change the verified boot keys either.  If an
> attacker controls root on a Chromebook, the attacker has already won,
> at least until the next reboot.
> 
> If the idea is to have a verified boot without any hardware or
> firmware support, then, yes, the kernel needs to enforce that the
> verification path can't be tampered with.  But I think we're talking
> about Secure Boot here, and on a correct Secure Boot implementation*,
> the worst that the kernel can do is to prevent the box from booting
> next time.
> 
> The best arguments I've heard so far for why the kernel needs to try
> to protect itself against root are:
> 
> 1. MS/Verisign demand it.

I think we should stop focussing on this case because firstly it's
second guessing what we actually have to do and secondly there might be
a solution on the horizon for Secure Boot which removes the issue.  As I
said, I'll try to have non uefi confidential details by the time KS
rolls around.

> 2. It's annoying to fool a user into thinking that they just booted
> Some Other OS when they're really running Linux without kernel help.
> NB: no one has claimed that it's impossible AFAIK, just that it's
> annoyingly complicated.
> 
> I like neither of these arguments.  #1 is politics, not security, and
> #2 seems like security by annoying the attacker.
> 
> To be clear, I don't object on principle to making it possible for the
> kernel to defend itself against root.  But it's hard, doing it right
> will require a lot of care, and I don't think it's worth doing unless
> there's a good reason.  If there's a good reason that I don't know
> about, please tell me!

I can be agnostic on this.  What I want is something that gives my
internet attached server more security ... in particular, if someone
breaks in I want them to be able to do as little damage as possible.
The reason I'm agnostic is that I have had several break in's over the
past decade I've been running my own colo hosted system.  None actually
managed to escalate to root (so any root security wouldn't have helped).
The problem for me was that they managed to do enough damage without
being root (one even managed to get apache to send out masses of spam
leaving me with a huge cleanup operation ... lesson learned, I now have
an iptables OUTPUT rule rejecting any non postfix traffic to port 25 and
alerting me to the attempt).

The point is I like the theoretical idea of protecting my system from
someone who manages to escalate to root and hey, if it can be done
easily, I'll take it, but that's not usually my biggest problem when I
consider security.

James

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-10  0:23                                     ` James Bottomley
@ 2014-05-10  0:38                                       ` Andy Lutomirski
  2014-05-10  3:44                                         ` Josh Triplett
  0 siblings, 1 reply; 79+ messages in thread
From: Andy Lutomirski @ 2014-05-10  0:38 UTC (permalink / raw)
  To: James Bottomley
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall,
	Darren Hart, Dan Carpenter

On Fri, May 9, 2014 at 5:23 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> On Fri, 2014-05-09 at 13:37 -0700, Andy Lutomirski wrote:
>> On Fri, May 9, 2014 at 11:22 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> > On 05/06/2014 10:31 AM, Andy Lutomirski wrote:
>> >>
>> >> I have two main objections to "root != kernel".  The bigger is that
>> >> I'd like to see the security argument for it so that people can think
>> >> about whether it makes sense.  The smaller is that "root != kernel"
>> >> isn't necessarily well-defined.
>> >>
>> >> For example, should root be able to write to the filesystem from which
>> >> the kernel loads?  Should root be able to kexec a new kernel, if that
>> >> clears some key known to the current kernel in the process?  Should
>> >> root be able to start a KVM instance that passes essentially all
>> >> hardware through?  Should root be able to talk directly to the
>> >> system's embedded controller?  Should root be able to read all
>> >> physical memory?  How about reading just enough to learn the kernel's
>> >> semi-secret randomized addresses?  How about running perf without
>> >> restrictions?
>> >>
>> >> In the past, the actual security goal seems to have been "root shall
>> >> not be able to do anything that would anger Microsoft and/or
>> >> Verisign", which is far-enough removed from actual security that I
>> >> don't want it anywhere near my system.  But if I could have a
>> >> reasonable policy that "root shall not be able to persistently
>> >> compromise the machine", then I think this could be great.
>> >>
>> >> Note that the latter goal does not actually require that root be
>> >> unable to modify the running kernel.
>> >>
>> >
>> > The first aspect of this is that the kernel needs to *be able to* lock
>> > out root from select functions.  These things will be system
>> > configuration dependent.
>> >
>>
>> I'm still unconvinced.  For Chrome OS-style security, I think that
>> root just needs to be prevented from doing anything that will
>> interfere with the verified boot process the next time the machine
>> boots.  The kernel doesn't need any particular security feature for
>> this: the kernel can't change the verified boot keys either.  If an
>> attacker controls root on a Chromebook, the attacker has already won,
>> at least until the next reboot.
>>
>> If the idea is to have a verified boot without any hardware or
>> firmware support, then, yes, the kernel needs to enforce that the
>> verification path can't be tampered with.  But I think we're talking
>> about Secure Boot here, and on a correct Secure Boot implementation*,
>> the worst that the kernel can do is to prevent the box from booting
>> next time.
>>
>> The best arguments I've heard so far for why the kernel needs to try
>> to protect itself against root are:
>>
>> 1. MS/Verisign demand it.
>
> I think we should stop focussing on this case because firstly it's
> second guessing what we actually have to do and secondly there might be
> a solution on the horizon for Secure Boot which removes the issue.  As I
> said, I'll try to have non uefi confidential details by the time KS
> rolls around.
>
>> 2. It's annoying to fool a user into thinking that they just booted
>> Some Other OS when they're really running Linux without kernel help.
>> NB: no one has claimed that it's impossible AFAIK, just that it's
>> annoyingly complicated.
>>
>> I like neither of these arguments.  #1 is politics, not security, and
>> #2 seems like security by annoying the attacker.
>>
>> To be clear, I don't object on principle to making it possible for the
>> kernel to defend itself against root.  But it's hard, doing it right
>> will require a lot of care, and I don't think it's worth doing unless
>> there's a good reason.  If there's a good reason that I don't know
>> about, please tell me!
>
> I can be agnostic on this.  What I want is something that gives my
> internet attached server more security ... in particular, if someone
> breaks in I want them to be able to do as little damage as possible.
> The reason I'm agnostic is that I have had several break in's over the
> past decade I've been running my own colo hosted system.  None actually
> managed to escalate to root (so any root security wouldn't have helped).
> The problem for me was that they managed to do enough damage without
> being root (one even managed to get apache to send out masses of spam
> leaving me with a huge cleanup operation ... lesson learned, I now have
> an iptables OUTPUT rule rejecting any non postfix traffic to port 25 and
> alerting me to the attempt).
>
> The point is I like the theoretical idea of protecting my system from
> someone who manages to escalate to root and hey, if it can be done
> easily, I'll take it, but that's not usually my biggest problem when I
> consider security.

We're almost at the point where it would be reasonable to shove
basically every service on a system into a user namespace, in which
case, barring bugs, you shouldn't be able to own the kernel.  I wonder
if this might do pretty much exactly what you want.

Essentially, you'd mount your filesystems, make a new userns, move all
network devices into a new netns owned by that userns, unshare the
mount namespace, and somehow get systemd or whatever other init
program you're using to play along.

If you're just trying to protect the kernel, you can even map all uids
straight through, including root.

The major annoyance is that it's damn near impossible to mount real
filesystems (i.e. not tmpfs) from inside a userns.  There could be a
daemon outside that helps out.

This doesn't necessarily do anything sensible with device nodes, but
that might not matter for servers.

--Andy

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-10  0:38                                       ` Andy Lutomirski
@ 2014-05-10  3:44                                         ` Josh Triplett
  0 siblings, 0 replies; 79+ messages in thread
From: Josh Triplett @ 2014-05-10  3:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Boyer, Sarah Sharp, ksummit-discuss, Greg KH,
	James Bottomley, Julia Lawall, Darren Hart, Dan Carpenter

On Fri, May 09, 2014 at 05:38:49PM -0700, Andy Lutomirski wrote:
> We're almost at the point where it would be reasonable to shove
> basically every service on a system into a user namespace, in which
> case, barring bugs, you shouldn't be able to own the kernel.  I wonder
> if this might do pretty much exactly what you want.

We should absolutely do this, and it'll make a big difference.  However,
"barring bugs" is a pretty big bar; in practice, it's probably easier to
get from user->kernel than to get from user->root, just because you can
do the former from any process that can make system calls.  We're not
anywhere close to done with fixing system call vulnerabilities.

> Essentially, you'd mount your filesystems, make a new userns, move all
> network devices into a new netns owned by that userns, unshare the
> mount namespace, and somehow get systemd or whatever other init
> program you're using to play along.

systemd makes it rather easy to configure a service for this kind of
namespace isolation.  You can, for instance, put services that don't
need the network in a network namespace that only includes localhost.  I
suspect that far more services will take advantage of that than will
attempt to configure an equivalent isolation setup manually.

- Josh Triplett

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 16:44 [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions Josh Triplett
  2014-05-02 17:11 ` Dave Jones
  2014-05-08 15:52 ` Christoph Lameter
@ 2014-05-12 17:35 ` Wolfram Sang
  2014-05-13 16:36 ` Bird, Tim
  2014-08-17  9:45 ` [Ksummit-discuss] tiny.wiki.kernel.org Josh Triplett
  4 siblings, 0 replies; 79+ messages in thread
From: Wolfram Sang @ 2014-05-12 17:35 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

[-- Attachment #1: Type: text/plain, Size: 969 bytes --]

> - Tiny in RAM versus tiny on storage.
> - How much the kernel has grown over time.
> - How size regressions happen and how to avoid them
> - Size measurement, bloat-o-meter, allnoconfig, and other tools

Besides the discussions about keeping the kernel size small by
deselecting features, I believe we have a few options left to reduce the
size just by rethinking how data is arranged. I have just started to
research 'strings' in the kernel and am already seeing patterns which
look like low hanging fruits to me. (Unsurprisingly, given the amount of
copy&paste code.) Take the OOM message removal as one example, such
things can be fixed and prevented. Although my focus is 'strings'
currently, I am sure the lessons learned can be of generic interest. So
to say: keep the bloat on a level that is really needed for a new
feature.

I proposed to present my results at LinuxCon anyhow, so that would fit.

Disclaimer: Yes, I work more on drivers than core code :)

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-02 16:44 [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions Josh Triplett
                   ` (2 preceding siblings ...)
  2014-05-12 17:35 ` Wolfram Sang
@ 2014-05-13 16:36 ` Bird, Tim
  2014-05-13 18:00   ` josh
  2014-05-14  1:04   ` Julia Lawall
  2014-08-17  9:45 ` [Ksummit-discuss] tiny.wiki.kernel.org Josh Triplett
  4 siblings, 2 replies; 79+ messages in thread
From: Bird, Tim @ 2014-05-13 16:36 UTC (permalink / raw)
  To: Josh Triplett, ksummit-discuss
  Cc: Sarah Sharp, Greg KH, Julia Lawall, Dan Carpenter, Darren Hart, Alan

On Friday, May 02, 2014 9:44 AM, Josh Triplett wrote:
> 
> Over time, the Linux kernel has grown far more featureful, but it has
> also grown significantly larger, even with all the optional features
> turned off.  For the last several years, I've been working to make the
> kernel smaller, and mentoring/coordinating projects to do the same, to
> enable ridiculously small embedded applications and other fun uses.  I'd
> like to discuss that work at Kernel Summit, get size regressions on the
> radar of kernel developers and subsystem maintainers, and solicit
> discussion on future possibilities to shrink the kernel further.
> 
> Topics:
> - An overview of why the kernel's size still matters today ("but don't
>   we all have tons of memory and storage?")
> - Tiny in RAM versus tiny on storage.
> - How much the kernel has grown over time.
> - How size regressions happen and how to avoid them
> - Size measurement, bloat-o-meter, allnoconfig, and other tools
> - Compression and the decompression stub
> - Kconfig, and avoiding excessive configurability in the pursuit of tiny
> - Optimizing a kernel for its exact target userspace.
> - Examples of shrinking the kernel
> - Discussion on proposed ways to make the kernel tiny, how much they
>   might save, how much work they'd require, and how to implement them
>   with minimal impact to the un-shrunken common case.
> 

I'd really like to see a discussion of mechanisms to improve automated
reduction of the kernel.  This will really help, IMHO, to avoid excessive
configurability, and hopefully ameliorate complaints about the long-term
maintenance cost of keeping small configurations available in-tree.

One prime example of this would be the "static-ification" of DT, e.g.
replacing calls to lookup DT info with constants (via macros or some other
source replacement trick), so that we can leverage the compiler's
optimizations for constant propagation and dead code removal.

> After the session, I'll prepare and maintain a detailed summary of the
> proposed ideas, ordered by how much space they'd save versus how much
> work they'd be.  I plan to maintain that list on an ongoing basis, to
> coordinate tinification projects for ongoing work by people working on
> embedded applications, and for the benefit of mentorship projects such
> as OPW and SoC.

Thanks for taking the lead on this!

Can I recommend we use the linux-embedded mailing list for discussions?
It's underutilized and this topic seems like a good fit.  Also, the elinux wiki
is available if you're looking for a place to maintain information on this.
The Linux-tiny material there is stale, but I have been thinking about updating
it since the last ELC.


  -- Tim

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-13 16:36 ` Bird, Tim
@ 2014-05-13 18:00   ` josh
  2014-05-14  1:04   ` Julia Lawall
  1 sibling, 0 replies; 79+ messages in thread
From: josh @ 2014-05-13 18:00 UTC (permalink / raw)
  To: Bird, Tim
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Alan,
	Darren Hart, Dan Carpenter

On Tue, May 13, 2014 at 06:36:40PM +0200, Bird, Tim wrote:
> On Friday, May 02, 2014 9:44 AM, Josh Triplett wrote:
> > 
> > Over time, the Linux kernel has grown far more featureful, but it has
> > also grown significantly larger, even with all the optional features
> > turned off.  For the last several years, I've been working to make the
> > kernel smaller, and mentoring/coordinating projects to do the same, to
> > enable ridiculously small embedded applications and other fun uses.  I'd
> > like to discuss that work at Kernel Summit, get size regressions on the
> > radar of kernel developers and subsystem maintainers, and solicit
> > discussion on future possibilities to shrink the kernel further.
> > 
> > Topics:
> > - An overview of why the kernel's size still matters today ("but don't
> >   we all have tons of memory and storage?")
> > - Tiny in RAM versus tiny on storage.
> > - How much the kernel has grown over time.
> > - How size regressions happen and how to avoid them
> > - Size measurement, bloat-o-meter, allnoconfig, and other tools
> > - Compression and the decompression stub
> > - Kconfig, and avoiding excessive configurability in the pursuit of tiny
> > - Optimizing a kernel for its exact target userspace.
> > - Examples of shrinking the kernel
> > - Discussion on proposed ways to make the kernel tiny, how much they
> >   might save, how much work they'd require, and how to implement them
> >   with minimal impact to the un-shrunken common case.
> > 
> 
> I'd really like to see a discussion of mechanisms to improve automated
> reduction of the kernel.  This will really help, IMHO, to avoid excessive
> configurability, and hopefully ameliorate complaints about the long-term
> maintenance cost of keeping small configurations available in-tree.

Agreed; that's one of the things I planned to include, under "avoiding
excessive configurability" and "Optimizing a kernel for its exact target
userspace".  LTO is another big part of this.

> One prime example of this would be the "static-ification" of DT, e.g.
> replacing calls to lookup DT info with constants (via macros or some other
> source replacement trick), so that we can leverage the compiler's
> optimizations for constant propagation and dead code removal.

I agree completely.  I'd also like to optimize the kernel for a specific
kernel command line (turning global variables set from the command line
into static consts), automatic syscall elimination (as discussed
elsewhere in the thread), and even optimization for specific target
hardware configurations (not just DT, but bits normally found via
probing).

> > After the session, I'll prepare and maintain a detailed summary of the
> > proposed ideas, ordered by how much space they'd save versus how much
> > work they'd be.  I plan to maintain that list on an ongoing basis, to
> > coordinate tinification projects for ongoing work by people working on
> > embedded applications, and for the benefit of mentorship projects such
> > as OPW and SoC.
> 
> Thanks for taking the lead on this!
> 
> Can I recommend we use the linux-embedded mailing list for discussions?
> It's underutilized and this topic seems like a good fit.  Also, the elinux wiki
> is available if you're looking for a place to maintain information on this.
> The Linux-tiny material there is stale, but I have been thinking about updating
> it since the last ELC.

Sounds plausible to me.  And I've been seriously considering taking on
maintenance of a new tiny tree.

- Josh Triplett

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-13 16:36 ` Bird, Tim
  2014-05-13 18:00   ` josh
@ 2014-05-14  1:04   ` Julia Lawall
  1 sibling, 0 replies; 79+ messages in thread
From: Julia Lawall @ 2014-05-14  1:04 UTC (permalink / raw)
  To: Bird, Tim
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Alan, Darren Hart, Dan Carpenter

> One prime example of this would be the "static-ification" of DT, e.g.
> replacing calls to lookup DT info with constants (via macros or some other
> source replacement trick), so that we can leverage the compiler's
> optimizations for constant propagation and dead code removal.

Could Coccinelle help with this somehow?  The requirement would be that
there is some pattern that can be recognized, and the desired change is
static, ie a modification in the source code.

julia

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [Ksummit-discuss] tiny.wiki.kernel.org
  2014-05-02 16:44 [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions Josh Triplett
                   ` (3 preceding siblings ...)
  2014-05-13 16:36 ` Bird, Tim
@ 2014-08-17  9:45 ` Josh Triplett
  4 siblings, 0 replies; 79+ messages in thread
From: Josh Triplett @ 2014-08-17  9:45 UTC (permalink / raw)
  To: ksummit-discuss

In preparation for the tinification discussion at Kernel Summit, I've
created tiny.wiki.kernel.org to track the status of kernel tinification,
potential projects, and HOWTOs.

Please feel free to add ideas or other material there in advance of the
discussion on Monday.

- Josh Triplett

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-15 20:00       ` Greg KH
@ 2014-05-15 20:29         ` Guenter Roeck
  0 siblings, 0 replies; 79+ messages in thread
From: Guenter Roeck @ 2014-05-15 20:29 UTC (permalink / raw)
  To: Greg KH
  Cc: Sarah Sharp, ksummit-discuss, James Bottomley, Julia Lawall,
	Darren Hart, Dan Carpenter

On Thu, May 15, 2014 at 01:00:19PM -0700, Greg KH wrote:
> On Thu, May 15, 2014 at 12:41:59PM -0700, H. Peter Anvin wrote:
> > 
> > I think kernel tinification and what should be acceptable goals and
> > non-goals for the mainline kernel would make an excellent KS topic.
> > Personally, I am for stretching Linux as far across the compute spectrum
> > as it can possibly go.
> 
> I'll second this for both a valid topic, and as a goal for Linux to
> achieve.
> 
Not that it matters much ;-), but same here.

Guenter

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-15 19:41     ` H. Peter Anvin
@ 2014-05-15 20:00       ` Greg KH
  2014-05-15 20:29         ` Guenter Roeck
  0 siblings, 1 reply; 79+ messages in thread
From: Greg KH @ 2014-05-15 20:00 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Sarah Sharp, ksummit-discuss, James Bottomley, Julia Lawall,
	Darren Hart, Dan Carpenter

On Thu, May 15, 2014 at 12:41:59PM -0700, H. Peter Anvin wrote:
> 
> I think kernel tinification and what should be acceptable goals and
> non-goals for the mainline kernel would make an excellent KS topic.
> Personally, I am for stretching Linux as far across the compute spectrum
> as it can possibly go.

I'll second this for both a valid topic, and as a goal for Linux to
achieve.

greg k-h

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-14  2:37   ` Li Zefan
@ 2014-05-15 19:41     ` H. Peter Anvin
  2014-05-15 20:00       ` Greg KH
  0 siblings, 1 reply; 79+ messages in thread
From: H. Peter Anvin @ 2014-05-15 19:41 UTC (permalink / raw)
  To: Li Zefan, James Bottomley
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On 05/13/2014 07:37 PM, Li Zefan wrote:
> 
> Mel has recently working on a patchset for mm optimization, and I
> just noticed one particular patch adds static_key to eliminate cpuset
> overhead in mm code path. We sure can use static_key more in cgroup.
> 
> New features tend to add new stuff to structures like task_struct, which
> static_key or any other mechanisms can't help.
> 

Not to mention that static_key does nothing in a ROM-size-limited
application.

I think kernel tinification and what should be acceptable goals and
non-goals for the mainline kernel would make an excellent KS topic.
Personally, I am for stretching Linux as far across the compute spectrum
as it can possibly go.

	-hpa

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-09  0:31 ` James Bottomley
  2014-05-09 14:48   ` Christoph Lameter
@ 2014-05-14  2:37   ` Li Zefan
  2014-05-15 19:41     ` H. Peter Anvin
  1 sibling, 1 reply; 79+ messages in thread
From: Li Zefan @ 2014-05-14  2:37 UTC (permalink / raw)
  To: James Bottomley
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On 2014/5/9 8:31, James Bottomley wrote:
> 
> On Thu, 2014-05-08 at 11:24 -0500, Christoph Lameter wrote:
>> On Fri, 2 May 2014, Josh Triplett wrote:
>>
>>> - An overview of why the kernel's size still matters today ("but don't
>>>   we all have tons of memory and storage?")
>>
>> Kernel size matters quite a bit for performance. Processor caches are key
>> to performance and therefore the cache footprint of a function determines
>> the the possible performance. The smaller the functions and the less data
>> they access the faster they will run.
> 
> This is about footprint, though, it's about optimizing a code path to
> run in the fewest instructions possible, right?      
> 
>> Therefore it needs to be possible to reduce the size of the kernel by
>> disabling unwanted functionality (f.e. cgroups). In order for that to
>> happen features need to be as independent as possible and also the user
>> space tools (like systemd) need to be able to handle a kernel with reduced
>> functionality.
> 
> I don't believe that follows.  As long as the added code doesn't cause
> the cache footprint of the working set to expand, there's no performance
> reason to compile it out.   If you choose not to use syscalls, then the
> paths are inert from a performance point of view and it doesn't matter
> if they are config'd in or out.  Cgroups, on the other hand impacts
> performance because it adds to the execution path of several syscalls.
> We were careful to use static branching to minimise this, but obviously
> it does expand the cache footprint.  Do you have any figures for the
> performance issues it's causing (being compiled in but unused)?  If it's
> significant, we could try static branching to out of line areas which
> shouldn't impact the cache footprint.
> 

Mel has recently working on a patchset for mm optimization, and I
just noticed one particular patch adds static_key to eliminate cpuset
overhead in mm code path. We sure can use static_key more in cgroup.

New features tend to add new stuff to structures like task_struct, which
static_key or any other mechanisms can't help.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-12 18:06         ` Dave Hansen
@ 2014-05-12 20:20           ` Roland Dreier
  0 siblings, 0 replies; 79+ messages in thread
From: Roland Dreier @ 2014-05-12 20:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, James Bottomley,
	Julia Lawall, Darren Hart, Christoph Lameter, Dan Carpenter

On Mon, May 12, 2014 at 11:06 AM, Dave Hansen <dave@sr71.net> wrote:
>> Loadable modules are using vmalloc areas that use 4k pages which
>> is another issue.
>
> Isn't this just another case where we need to try kmalloc() and fall
> back to vmalloc() when it fails?  Most modules are loaded way before we
> see any kind of possibility of memory fragmentation so I'd expect that
> to be pretty successful in the common case.

This is probably a good idea to "just do".  I think it was about 10
years ago that I first added this hack to our ppc440 kernel (where
avoiding the software filled TLB gave us about 2x for module code) :).

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-09 16:55       ` Christoph Lameter
  2014-05-09 17:21         ` josh
  2014-05-09 17:42         ` James Bottomley
@ 2014-05-12 18:06         ` Dave Hansen
  2014-05-12 20:20           ` Roland Dreier
  2 siblings, 1 reply; 79+ messages in thread
From: Dave Hansen @ 2014-05-12 18:06 UTC (permalink / raw)
  To: Christoph Lameter, Steven Rostedt
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, James Bottomley,
	Julia Lawall, Darren Hart, Dan Carpenter

On 05/09/2014 09:55 AM, Christoph Lameter wrote:
> On Fri, 9 May 2014, Steven Rostedt wrote:
>>> > > One improvement would be to sort the functions by functionality. All the
>>> > > important functions in the first 2M of the code covered by one huge tlb
>>> > > f.e.
>> >
>> > I thought pretty much all of kernel core memory is mapped in by huge
>> > tlbs? At least for kernel core code (not modules), the size should not
>> > impact tlbs.
> Yes, but processor only support a limited amount of 2m tlbs and
> applications also want to use them. A large 100M sized kernel would
> require 50 tlbs and cause tlb trashing if functions are accessed over all
> the code.

Is this taking in to account that second-level TLB on Haswell can now
hold 2M entries?  That should pretty drastically change the landscape here.

> Loadable modules are using vmalloc areas that use 4k pages which
> is another issue.

Isn't this just another case where we need to try kmalloc() and fall
back to vmalloc() when it fails?  Most modules are loaded way before we
see any kind of possibility of memory fragmentation so I'd expect that
to be pretty successful in the common case.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-09 19:02               ` Julia Lawall
@ 2014-05-09 20:31                 ` Steven Rostedt
  0 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2014-05-09 20:31 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, James Bottomley,
	Darren Hart, Christoph Lameter, Dan Carpenter

On Fri, 9 May 2014 21:02:26 +0200 (CEST)
Julia Lawall <julia.lawall@lip6.fr> wrote:
 
> Which kinds of functions are being discussed?  For example, similar 
> drivers may use similar functions, but if one has only one instance of a 
> driver that has a particular functionality, only one of those functions 
> may ever be executed.  Is there only a benefit if similar functions are 
> all executed frequently in practice?

I wasn't thinking about drivers, but more core kernel code, like the
scheduler or system call interface. Things that do have similar designs
and are called frequently.

-- Steve

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-09 18:32             ` Steven Rostedt
@ 2014-05-09 19:02               ` Julia Lawall
  2014-05-09 20:31                 ` Steven Rostedt
  0 siblings, 1 reply; 79+ messages in thread
From: Julia Lawall @ 2014-05-09 19:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, James Bottomley,
	Julia Lawall, Darren Hart, Christoph Lameter, Dan Carpenter

On Fri, 9 May 2014, Steven Rostedt wrote:

> On Fri, 9 May 2014 12:52:00 -0500 (CDT)
> Christoph Lameter <cl@linux.com> wrote:
> 
> > On Fri, 9 May 2014, James Bottomley wrote:
> > 
> > > > Global optimization may allow the folding of small functions into a larger
> > > > one when advantageous (which is not simple to determine).
> > >
> > > It's possible, but complex ... I'd really like to see proof that it
> > > helps before thinking about it.
> > 
> > The proof may mean that one will have to do the work.
> 
> Hello Chicken, meet Egg!
> 
> 
> This is what proof of concepts are for.

Which kinds of functions are being discussed?  For example, similar 
drivers may use similar functions, but if one has only one instance of a 
driver that has a particular functionality, only one of those functions 
may ever be executed.  Is there only a benefit if similar functions are 
all executed frequently in practice?

julia

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-09 17:52           ` Christoph Lameter
@ 2014-05-09 18:32             ` Steven Rostedt
  2014-05-09 19:02               ` Julia Lawall
  0 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2014-05-09 18:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, James Bottomley,
	Julia Lawall, Darren Hart, Dan Carpenter

On Fri, 9 May 2014 12:52:00 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Fri, 9 May 2014, James Bottomley wrote:
> 
> > > Global optimization may allow the folding of small functions into a larger
> > > one when advantageous (which is not simple to determine).
> >
> > It's possible, but complex ... I'd really like to see proof that it
> > helps before thinking about it.
> 
> The proof may mean that one will have to do the work.

Hello Chicken, meet Egg!


This is what proof of concepts are for.

-- Steve

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-09 17:42         ` James Bottomley
  2014-05-09 17:52           ` Christoph Lameter
@ 2014-05-09 17:52           ` Matthew Wilcox
  1 sibling, 0 replies; 79+ messages in thread
From: Matthew Wilcox @ 2014-05-09 17:52 UTC (permalink / raw)
  To: James Bottomley
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Christoph Lameter, Dan Carpenter

[-- Attachment #1: Type: text/plain, Size: 801 bytes --]

On 2014-05-09 1:42 PM, "James Bottomley" <
James.Bottomley@hansenpartnership.com> wrote:
> In theory, we could use link time optimization to place all the most
> used functions in the first TLB entry.  However, as Steve said, have you
> got measurements showing this helps?  If it's down in the noise, it's a
> lot of work for no benefit.

It's going to be highly workload dependent. For example, TPC-C randomly
accesses all of memory. Even doubling the number of 2MB TLB entries isn't
going to help more than a couple of percent. On the other hand, for a
scientific workload which juuuust overflows the number of 2MB entries, you
might see a 100% speedup with the freeing of a single 2MB entry to
userspace. And there are many workloads in between (most exhibit at least
some locality of reference).

[-- Attachment #2: Type: text/html, Size: 948 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-09 17:42         ` James Bottomley
@ 2014-05-09 17:52           ` Christoph Lameter
  2014-05-09 18:32             ` Steven Rostedt
  2014-05-09 17:52           ` Matthew Wilcox
  1 sibling, 1 reply; 79+ messages in thread
From: Christoph Lameter @ 2014-05-09 17:52 UTC (permalink / raw)
  To: James Bottomley
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, 9 May 2014, James Bottomley wrote:

> > Global optimization may allow the folding of small functions into a larger
> > one when advantageous (which is not simple to determine).
>
> It's possible, but complex ... I'd really like to see proof that it
> helps before thinking about it.

The proof may mean that one will have to do the work.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-09 16:55       ` Christoph Lameter
  2014-05-09 17:21         ` josh
@ 2014-05-09 17:42         ` James Bottomley
  2014-05-09 17:52           ` Christoph Lameter
  2014-05-09 17:52           ` Matthew Wilcox
  2014-05-12 18:06         ` Dave Hansen
  2 siblings, 2 replies; 79+ messages in thread
From: James Bottomley @ 2014-05-09 17:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, 2014-05-09 at 11:55 -0500, Christoph Lameter wrote:
> On Fri, 9 May 2014, Steven Rostedt wrote:
> 
> > > One improvement would be to sort the functions by functionality. All the
> > > important functions in the first 2M of the code covered by one huge tlb
> > > f.e.
> >
> > I thought pretty much all of kernel core memory is mapped in by huge
> > tlbs? At least for kernel core code (not modules), the size should not
> > impact tlbs.
> 
> Yes, but processor only support a limited amount of 2m tlbs and
> applications also want to use them. A large 100M sized kernel would
> require 50 tlbs and cause tlb trashing if functions are accessed over all
> the code. Loadable modules are using vmalloc areas that use 4k pages which
> is another issue.

In theory, we could use link time optimization to place all the most
used functions in the first TLB entry.  However, as Steve said, have you
got measurements showing this helps?  If it's down in the noise, it's a
lot of work for no benefit.

> > > Maybe we could reduce the number of cachelines used by critical functions
> > > too? Arent there some tools that can automatize this in gcc?
> >
> > As I believe James has mentioned. This only helps if we keep the
> > critical functions tight in a cacheline. I did some benchmarks moving
> > the tracepoint code more out of line to help in cachelines, and I
> > haven't seen anything above the noise. Which is the reason I haven't
> > pushed that work further.
> >
> > Size may not be as important as having reuse of code. Perhaps if you
> > can tweak several functions to call one helper function, which may
> > actually increase the total size of the kernel, but having more helper
> > functions that live in cache longer may be of benefit.
> 
> More helper functions means more use of l1 cache lines which reduces
> performance.

Not if the compiler inlines them. Plus if we have five critical
functions and we make them share a helper (which the compiler doesn't
inline) then we get a 4xsize of helper reduction in code which outweighs
the additional function call overhead ... this is what Steve is
referring to.  Correct use of helper functions should reduce our L1
cache footprint, but the key is "correct".

> > > In general the ability to reduce the size of the kernel to a minimum is a
> > > desirable feature. I still see deployments of older kernels in the
> > > financial industry because they have a higher performance and lower
> > > latency. The only way to get those guys would be to keep the kernel size
> > > and the size of the data touched the same.
> >
> > I actually wonder if that performance is really on "size" of the kernel
> > and not just less features. Usually with features, we add more function
> > calls and branches, which I believe may be the culprit of the slowdowns
> > we are seeing.
> 
> That too... But James said they were using static branching.

Cgroups are, yes ... after you complained a lot.

> Global optimization may allow the folding of small functions into a larger
> one when advantageous (which is not simple to determine).

It's possible, but complex ... I'd really like to see proof that it
helps before thinking about it.

James

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-09 16:55       ` Christoph Lameter
@ 2014-05-09 17:21         ` josh
  2014-05-09 17:42         ` James Bottomley
  2014-05-12 18:06         ` Dave Hansen
  2 siblings, 0 replies; 79+ messages in thread
From: josh @ 2014-05-09 17:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, James Bottomley,
	Julia Lawall, Darren Hart, Dan Carpenter

On Fri, May 09, 2014 at 11:55:23AM -0500, Christoph Lameter wrote:
> On Fri, 9 May 2014, Steven Rostedt wrote:
> > Size may not be as important as having reuse of code. Perhaps if you
> > can tweak several functions to call one helper function, which may
> > actually increase the total size of the kernel, but having more helper
> > functions that live in cache longer may be of benefit.
> 
> More helper functions means more use of l1 cache lines which reduces
> performance.

If done poorly, but on the other hand, factoring a common code path out
of many call sites into one helper function makes it more likely that
helper function will remain cached.

- Josh Triplett

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-09 16:24     ` Steven Rostedt
@ 2014-05-09 16:55       ` Christoph Lameter
  2014-05-09 17:21         ` josh
                           ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Christoph Lameter @ 2014-05-09 16:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, James Bottomley,
	Julia Lawall, Darren Hart, Dan Carpenter

On Fri, 9 May 2014, Steven Rostedt wrote:

> > One improvement would be to sort the functions by functionality. All the
> > important functions in the first 2M of the code covered by one huge tlb
> > f.e.
>
> I thought pretty much all of kernel core memory is mapped in by huge
> tlbs? At least for kernel core code (not modules), the size should not
> impact tlbs.

Yes, but processor only support a limited amount of 2m tlbs and
applications also want to use them. A large 100M sized kernel would
require 50 tlbs and cause tlb trashing if functions are accessed over all
the code. Loadable modules are using vmalloc areas that use 4k pages which
is another issue.

> > Maybe we could reduce the number of cachelines used by critical functions
> > too? Arent there some tools that can automatize this in gcc?
>
> As I believe James has mentioned. This only helps if we keep the
> critical functions tight in a cacheline. I did some benchmarks moving
> the tracepoint code more out of line to help in cachelines, and I
> haven't seen anything above the noise. Which is the reason I haven't
> pushed that work further.
>
> Size may not be as important as having reuse of code. Perhaps if you
> can tweak several functions to call one helper function, which may
> actually increase the total size of the kernel, but having more helper
> functions that live in cache longer may be of benefit.

More helper functions means more use of l1 cache lines which reduces
performance.

> > In general the ability to reduce the size of the kernel to a minimum is a
> > desirable feature. I still see deployments of older kernels in the
> > financial industry because they have a higher performance and lower
> > latency. The only way to get those guys would be to keep the kernel size
> > and the size of the data touched the same.
>
> I actually wonder if that performance is really on "size" of the kernel
> and not just less features. Usually with features, we add more function
> calls and branches, which I believe may be the culprit of the slowdowns
> we are seeing.

That too... But James said they were using static branching.

Global optimization may allow the folding of small functions into a larger
one when advantageous (which is not simple to determine).

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-09 14:48   ` Christoph Lameter
@ 2014-05-09 16:24     ` Steven Rostedt
  2014-05-09 16:55       ` Christoph Lameter
  0 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2014-05-09 16:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, James Bottomley,
	Julia Lawall, Darren Hart, Dan Carpenter

On Fri, 9 May 2014 09:48:19 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> Static branching means that it is removed from the code path but the
> overall code size still is increased because the function need to be
> somewhere. And usually the additional functions are mixed with other
> functions that are essential. Which means increased need for TLB entries
> to do the virtual mappings. Plus there are noop holes here and there that
> increase the size of the function still.
> 
> One improvement would be to sort the functions by functionality. All the
> important functions in the first 2M of the code covered by one huge tlb
> f.e.

I thought pretty much all of kernel core memory is mapped in by huge
tlbs? At least for kernel core code (not modules), the size should not
impact tlbs.

> 
> Maybe we could reduce the number of cachelines used by critical functions
> too? Arent there some tools that can automatize this in gcc?

As I believe James has mentioned. This only helps if we keep the
critical functions tight in a cacheline. I did some benchmarks moving
the tracepoint code more out of line to help in cachelines, and I
haven't seen anything above the noise. Which is the reason I haven't
pushed that work further.

Size may not be as important as having reuse of code. Perhaps if you
can tweak several functions to call one helper function, which may
actually increase the total size of the kernel, but having more helper
functions that live in cache longer may be of benefit.

> 
> Syscalls are often essential to performance in particular if one wants to
> use the I/O services of the kernel instead of relying on something like
> RDMA that bypasses the kernel.
> 
> In general the ability to reduce the size of the kernel to a minimum is a
> desirable feature. I still see deployments of older kernels in the
> financial industry because they have a higher performance and lower
> latency. The only way to get those guys would be to keep the kernel size
> and the size of the data touched the same.

I actually wonder if that performance is really on "size" of the kernel
and not just less features. Usually with features, we add more function
calls and branches, which I believe may be the culprit of the slowdowns
we are seeing.

-- Steve

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-09  0:31 ` James Bottomley
@ 2014-05-09 14:48   ` Christoph Lameter
  2014-05-09 16:24     ` Steven Rostedt
  2014-05-14  2:37   ` Li Zefan
  1 sibling, 1 reply; 79+ messages in thread
From: Christoph Lameter @ 2014-05-09 14:48 UTC (permalink / raw)
  To: James Bottomley
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Thu, 8 May 2014, James Bottomley wrote:

> > >   we all have tons of memory and storage?")
> >
> > Kernel size matters quite a bit for performance. Processor caches are key
> > to performance and therefore the cache footprint of a function determines
> > the the possible performance. The smaller the functions and the less data
> > they access the faster they will run.
>
> This is about footprint, though, it's about optimizing a code path to
> run in the fewest instructions possible, right?

Code speed depends on where the instructions and data can be retrieved
from. The fewest instructions no longer cut it.

> > Therefore it needs to be possible to reduce the size of the kernel by
> > disabling unwanted functionality (f.e. cgroups). In order for that to
> > happen features need to be as independent as possible and also the user
> > space tools (like systemd) need to be able to handle a kernel with reduced
> > functionality.
>
> I don't believe that follows.  As long as the added code doesn't cause
> the cache footprint of the working set to expand, there's no performance
> reason to compile it out.   If you choose not to use syscalls, then the
> paths are inert from a performance point of view and it doesn't matter
> if they are config'd in or out.  Cgroups, on the other hand impacts
> performance because it adds to the execution path of several syscalls.
> We were careful to use static branching to minimise this, but obviously
> it does expand the cache footprint.  Do you have any figures for the
> performance issues it's causing (being compiled in but unused)?  If it's
> significant, we could try static branching to out of line areas which
> shouldn't impact the cache footprint.

Static branching means that it is removed from the code path but the
overall code size still is increased because the function need to be
somewhere. And usually the additional functions are mixed with other
functions that are essential. Which means increased need for TLB entries
to do the virtual mappings. Plus there are noop holes here and there that
increase the size of the function still.

One improvement would be to sort the functions by functionality. All the
important functions in the first 2M of the code covered by one huge tlb
f.e.

Maybe we could reduce the number of cachelines used by critical functions
too? Arent there some tools that can automatize this in gcc?

Syscalls are often essential to performance in particular if one wants to
use the I/O services of the kernel instead of relying on something like
RDMA that bypasses the kernel.

In general the ability to reduce the size of the kernel to a minimum is a
desirable feature. I still see deployments of older kernels in the
financial industry because they have a higher performance and lower
latency. The only way to get those guys would be to keep the kernel size
and the size of the data touched the same.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
  2014-05-08 16:24 [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions Christoph Lameter
@ 2014-05-09  0:31 ` James Bottomley
  2014-05-09 14:48   ` Christoph Lameter
  2014-05-14  2:37   ` Li Zefan
  0 siblings, 2 replies; 79+ messages in thread
From: James Bottomley @ 2014-05-09  0:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Thu, 2014-05-08 at 11:24 -0500, Christoph Lameter wrote:
> On Fri, 2 May 2014, Josh Triplett wrote:
> 
> > - An overview of why the kernel's size still matters today ("but don't
> >   we all have tons of memory and storage?")
> 
> Kernel size matters quite a bit for performance. Processor caches are key
> to performance and therefore the cache footprint of a function determines
> the the possible performance. The smaller the functions and the less data
> they access the faster they will run.

This is about footprint, though, it's about optimizing a code path to
run in the fewest instructions possible, right?      

> Therefore it needs to be possible to reduce the size of the kernel by
> disabling unwanted functionality (f.e. cgroups). In order for that to
> happen features need to be as independent as possible and also the user
> space tools (like systemd) need to be able to handle a kernel with reduced
> functionality.

I don't believe that follows.  As long as the added code doesn't cause
the cache footprint of the working set to expand, there's no performance
reason to compile it out.   If you choose not to use syscalls, then the
paths are inert from a performance point of view and it doesn't matter
if they are config'd in or out.  Cgroups, on the other hand impacts
performance because it adds to the execution path of several syscalls.
We were careful to use static branching to minimise this, but obviously
it does expand the cache footprint.  Do you have any figures for the
performance issues it's causing (being compiled in but unused)?  If it's
significant, we could try static branching to out of line areas which
shouldn't impact the cache footprint.

James

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions
@ 2014-05-08 16:24 Christoph Lameter
  2014-05-09  0:31 ` James Bottomley
  0 siblings, 1 reply; 79+ messages in thread
From: Christoph Lameter @ 2014-05-08 16:24 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Sarah Sharp, ksummit-discuss, Greg KH, Julia Lawall, Darren Hart,
	Dan Carpenter

On Fri, 2 May 2014, Josh Triplett wrote:

> - An overview of why the kernel's size still matters today ("but don't
>   we all have tons of memory and storage?")

Kernel size matters quite a bit for performance. Processor caches are key
to performance and therefore the cache footprint of a function determines
the the possible performance. The smaller the functions and the less data
they access the faster they will run.

Therefore it needs to be possible to reduce the size of the kernel by
disabling unwanted functionality (f.e. cgroups). In order for that to
happen features need to be as independent as possible and also the user
space tools (like systemd) need to be able to handle a kernel with reduced
functionality.

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2014-08-17  9:45 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-02 16:44 [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions Josh Triplett
2014-05-02 17:11 ` Dave Jones
2014-05-02 17:20   ` James Bottomley
2014-05-02 17:33     ` Dave Jones
2014-05-02 17:46       ` Josh Boyer
2014-05-02 18:50         ` H. Peter Anvin
2014-05-02 19:02           ` Josh Boyer
2014-05-02 19:03           ` Michael Kerrisk (man-pages)
2014-05-02 19:33             ` Theodore Ts'o
2014-05-02 19:38               ` Jiri Kosina
2014-05-02 19:49               ` Dave Jones
2014-05-02 20:06                 ` Steven Rostedt
2014-05-02 20:41                 ` Theodore Ts'o
2014-05-02 21:01                   ` Dave Jones
2014-05-02 21:19                     ` Josh Boyer
2014-05-02 21:23                       ` Jiri Kosina
2014-05-02 21:36                         ` Josh Boyer
2014-05-02 21:27                       ` James Bottomley
2014-05-02 21:39                         ` Josh Boyer
2014-05-02 22:35                           ` Andy Lutomirski
2014-05-06 17:18                             ` josh
2014-05-06 17:31                               ` Andy Lutomirski
2014-05-09 18:22                                 ` H. Peter Anvin
2014-05-09 20:37                                   ` Andy Lutomirski
2014-05-09 22:50                                     ` Josh Triplett
2014-05-10  0:23                                     ` James Bottomley
2014-05-10  0:38                                       ` Andy Lutomirski
2014-05-10  3:44                                         ` Josh Triplett
2014-05-03 17:30                           ` James Bottomley
2014-05-02 21:56                     ` tytso
2014-05-02 20:45                 ` Ben Hutchings
2014-05-02 21:03                   ` Dave Jones
2014-05-03 13:37                     ` Michael Kerrisk (man-pages)
2014-05-03 13:35                   ` Michael Kerrisk (man-pages)
2014-05-03 13:32               ` Michael Kerrisk (man-pages)
2014-05-02 19:03       ` Mark Brown
2014-05-02 19:45         ` Luck, Tony
2014-05-02 21:03           ` Mark Brown
2014-05-02 21:08             ` Dave Jones
2014-05-02 21:14               ` Andy Lutomirski
2014-05-02 21:21               ` Luck, Tony
2014-05-02 21:38                 ` H. Peter Anvin
2014-05-03  1:21               ` Mark Brown
2014-05-07 12:35             ` David Woodhouse
2014-05-09 15:51               ` Mark Brown
2014-05-02 17:33     ` Guenter Roeck
2014-05-02 17:44     ` Steven Rostedt
2014-05-07 11:32     ` David Woodhouse
2014-05-07 16:38       ` James Bottomley
2014-05-02 22:04   ` Jan Kara
2014-05-05 23:45   ` Bird, Tim
2014-05-06  2:14     ` H. Peter Anvin
2014-05-09 16:22   ` Josh Triplett
2014-05-09 16:59     ` Bird, Tim
2014-05-09 17:23       ` josh
2014-05-08 15:52 ` Christoph Lameter
2014-05-12 17:35 ` Wolfram Sang
2014-05-13 16:36 ` Bird, Tim
2014-05-13 18:00   ` josh
2014-05-14  1:04   ` Julia Lawall
2014-08-17  9:45 ` [Ksummit-discuss] tiny.wiki.kernel.org Josh Triplett
2014-05-08 16:24 [Ksummit-discuss] [CORE TOPIC] Kernel tinification: shrinking the kernel and avoiding size regressions Christoph Lameter
2014-05-09  0:31 ` James Bottomley
2014-05-09 14:48   ` Christoph Lameter
2014-05-09 16:24     ` Steven Rostedt
2014-05-09 16:55       ` Christoph Lameter
2014-05-09 17:21         ` josh
2014-05-09 17:42         ` James Bottomley
2014-05-09 17:52           ` Christoph Lameter
2014-05-09 18:32             ` Steven Rostedt
2014-05-09 19:02               ` Julia Lawall
2014-05-09 20:31                 ` Steven Rostedt
2014-05-09 17:52           ` Matthew Wilcox
2014-05-12 18:06         ` Dave Hansen
2014-05-12 20:20           ` Roland Dreier
2014-05-14  2:37   ` Li Zefan
2014-05-15 19:41     ` H. Peter Anvin
2014-05-15 20:00       ` Greg KH
2014-05-15 20:29         ` Guenter Roeck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox