linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: kdbus: to merge or not to merge?
       [not found]                   ` <20150809190027.GA24185@kroah.com>
@ 2015-08-09 22:11                     ` Daniel Mack
  2015-08-10  2:10                       ` Andy Lutomirski
  2015-08-10 17:04                       ` Linus Torvalds
  0 siblings, 2 replies; 3+ messages in thread
From: Daniel Mack @ 2015-08-09 22:11 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Linus Torvalds
  Cc: Tom Gundersen, Kalle A. Sandstrom, Borislav Petkov,
	One Thousand Gnomes, Havoc Pennington, Djalal Harouni,
	Andy Lutomirski, linux-kernel, Eric W. Biederman, cee1,
	David Herrmann, linux-mm

On 08/09/2015 09:00 PM, Greg Kroah-Hartman wrote:
> In chatting with Daniel on IRC, he is writing up a summary of how the
> kdbus memory pools work in more detail, and he said he would sent that
> out in a day or so, so that everyone can review.

Yes, let me quickly describe again how the kdbus pool logic works.

Every bus connection (peer) owns a buffer which is used in order to
receive payloads. Such payloads are either messages sent from other
connections, notifications or returned answer structures in return of
query commands (name lists, etc).

In order to avoid the kernel having to maintaining an internal buffer
the connections then read from with an extra command, we decided to let
the connections own their buffer directly, so they can mmap() the memory
into their task. Allocating a local buffer to collect asynchronous
messages is what they would need to do anyway, so we implemented a
short-cut that allows the kernel to directly access the memory and write
to it. The size of this buffer pool is configured by each connection
individually, during the HELLO call, so the kernel interface is as
flexible as any other memory allocation scheme the kernel provides and
is subject to the same limits.

Internally, the connection pool is simply a shmem backed file. From the
context of the HELLO ioctl, we are calling into shmem_file_setup(), so
the file is eventually owned by the task which created the bus task
connecting to the bus. One reason why we do the shmem file allocation in
the kernel and on behalf of a the userspace task is that we clear the
VM_MAYWRITE bit to prevent the task from writing to the pool through its
mapped buffer. We also do not set VM_NORESERVE, so the entire buffer is
pre-accounted for the task that created the connection.

The pool implementation uses an r/b tree to organize the buffer into
slices. Those slices can be kept by userspace as long as the parsing
implementation needs to have access to them. When finished, the slices
are freed. A simple ring buffer cannot cope with the gaps that emerge by
that.

When a connection buffer is written to, it is done from the context of
another task which calls into the kdbus code through one of the ioctls.
The memcg implementation should hence charge the task that acts as
writer, which is maybe not ideal but can be changed easily with some
addition to the internal APIs. We omitted it for the current version,
which is non-intrusive with regards to other kernel subsystems.

The kdbus implementation is actually comparable to two tasks X and Y
which both have their own buffer file open and mmap()ed, and they both
pass their FD to the other side. If X now writes to Y's file, and that
is causing a page fault, X is accounted for it, correct?

The kernel does *not* do any memory allocation to buffer payload, and
all other allocations (for instance, to keep around the internal state
of a connection, names etc) are subject to conservatively chosen
limitations. There is no unbounded memory allocation in kdbus that I am
aware of. If there was, it would clearly be a bug.

Addressing the point Andy made earlier: yes, due to memory
overcommitment, OOM situations may happen with certain patterns, but the
kernel should have the same measures to deal with them that it already
has with other types of shared userspace memory. Right?

Hope that all makes sense, we're open to discussions around the desired
accounting details. I've copied linux-mm to let more people have a look
into this again.


Thanks,
Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: kdbus: to merge or not to merge?
  2015-08-09 22:11                     ` kdbus: to merge or not to merge? Daniel Mack
@ 2015-08-10  2:10                       ` Andy Lutomirski
  2015-08-10 17:04                       ` Linus Torvalds
  1 sibling, 0 replies; 3+ messages in thread
From: Andy Lutomirski @ 2015-08-10  2:10 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Greg Kroah-Hartman, Linus Torvalds, Tom Gundersen,
	Kalle A. Sandstrom, Borislav Petkov, One Thousand Gnomes,
	Havoc Pennington, Djalal Harouni, linux-kernel,
	Eric W. Biederman, cee1, David Herrmann, linux-mm

On Sun, Aug 9, 2015 at 3:11 PM, Daniel Mack <daniel@zonque.org> wrote:
>
> Internally, the connection pool is simply a shmem backed file. From the
> context of the HELLO ioctl, we are calling into shmem_file_setup(), so
> the file is eventually owned by the task which created the bus task
> connecting to the bus. One reason why we do the shmem file allocation in
> the kernel and on behalf of a the userspace task is that we clear the
> VM_MAYWRITE bit to prevent the task from writing to the pool through its
> mapped buffer. We also do not set VM_NORESERVE, so the entire buffer is
> pre-accounted for the task that created the connection.

I don't have access to the system I've been using for testing right
now, but I wonder how the kdbus pool stack up against the entire rest
of memory allocations for the average desktop process.

>
> The pool implementation uses an r/b tree to organize the buffer into
> slices. Those slices can be kept by userspace as long as the parsing
> implementation needs to have access to them. When finished, the slices
> are freed. A simple ring buffer cannot cope with the gaps that emerge by
> that.
>
> When a connection buffer is written to, it is done from the context of
> another task which calls into the kdbus code through one of the ioctls.
> The memcg implementation should hence charge the task that acts as
> writer, which is maybe not ideal but can be changed easily with some
> addition to the internal APIs. We omitted it for the current version,
> which is non-intrusive with regards to other kernel subsystems.
>

This has at least the following weakness.  I can very easily get
systemd to write to my shmem-backed pool: simply subscribe to one of
its broadcasts.  If I cause such a write to be very slow
(intentionally or otherwise), then PID 1 blocks.

If you change the memcg code to charge me instead of PID 1 (as it
should IMO), then the problem gets worse.

> The kdbus implementation is actually comparable to two tasks X and Y
> which both have their own buffer file open and mmap()ed, and they both
> pass their FD to the other side. If X now writes to Y's file, and that
> is causing a page fault, X is accounted for it, correct?

If PID 1 accepted a memfd from me (even a properly sealed one) and
wrote to it, I would wonder whether it were actually a good idea.

Does this scheme have any actual measurable advantage over the
traditional model of a small non-paged buffer in the kernel (i.e. the
way sockets work) with explicit userspace memfd use as appropriate?

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: kdbus: to merge or not to merge?
  2015-08-09 22:11                     ` kdbus: to merge or not to merge? Daniel Mack
  2015-08-10  2:10                       ` Andy Lutomirski
@ 2015-08-10 17:04                       ` Linus Torvalds
  1 sibling, 0 replies; 3+ messages in thread
From: Linus Torvalds @ 2015-08-10 17:04 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Greg Kroah-Hartman, Tom Gundersen, Kalle A. Sandstrom,
	Borislav Petkov, One Thousand Gnomes, Havoc Pennington,
	Djalal Harouni, Andy Lutomirski, linux-kernel, Eric W. Biederman,
	cee1, David Herrmann, linux-mm

On Sun, Aug 9, 2015 at 3:11 PM, Daniel Mack <daniel@zonque.org> wrote:
>
> The kdbus implementation is actually comparable to two tasks X and Y
> which both have their own buffer file open and mmap()ed, and they both
> pass their FD to the other side. If X now writes to Y's file, and that
> is causing a page fault, X is accounted for it, correct?

No.

With shared memory, there's no particularly obvious accounting rules.
In particular, when somebody maps an already allocated page, it's
basically a no-op from a memory allocation standpoint.

The whole "this is equivalent to the user space deamon" argument is
bogus. Shared memory is very very different from just sending messages
(copying the buffers) and is generally much harder to get a handle on.
And thats' what you should be comparing to.

The old "communicate over a unix domain socket" had pretty clear
accounting rules, and while unix domain sockets have some horribly
nasty issues (most are about passing fd's around) that isn't one of
them.

Anyway, the real issue for me here is that Andy is reporting all these
actual real problems that happen in practice, and the developer
replies are dismissing them on totally irrelevant grounds ("this
should be equivalent to something entirely different that nobody ever
does" or "well, people could opt out, even if they didn't" yadda yadda
yadda).

For example, the whole "tasks X and Y communicate over shmem" is
irrelevant. Normally, when people write those kinds of applications,
they are just regular applications. If they have issues, nobody else
cares. Andy's concern is about one of X/Y being a system daemon and
tricking it into doing bad things ends up effectively killing the
system - whether the *kernel* is alive or not and did the right thing
is almost entirely immaterial.

So please. When Andy sends a bug report with a exploit that kills his
system, just stop responding with irrelevant theoretical arguments. It
is not appropriate.  Instead, acknowledge the problem and work on
fixing it, none of this "but but but it's all the same" crap.

                     Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-08-10 17:04 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CANq1E4SnYq_pZMWYcafB9GmB_O77tbVLPT0=0d6LGQVpvThTrw@mail.gmail.com>
     [not found] ` <CALCETrWE-oYRq+AzRxxcz03AK0pAzgKJtmxAuNwQu+p5S0msBw@mail.gmail.com>
     [not found]   ` <CANq1E4Rek3HXCDU_13OGfRShS7Z0g+fxcTp5C1V3oKC4HgkD_A@mail.gmail.com>
     [not found]     ` <CALCETrUaSgdaq4_mr3GG-ekLwGXkQR5MoRLSj9Wu2dTXDYUp1g@mail.gmail.com>
     [not found]       ` <CANq1E4SkUWWXuksJnWzXd5KStZx-T6q6+WWTHdrQz_WiMry4Cw@mail.gmail.com>
     [not found]         ` <CALCETrXcqOFedk8r-jHK-deRwfum29JHspALE6JUi2gzbo-dhg@mail.gmail.com>
     [not found]           ` <55C3A403.8020202@zonque.org>
     [not found]             ` <CALCETrVr04ZdXHLZXLp_Y+m68Db5Mmh_Wnu6prNCfCqgWm0QzA@mail.gmail.com>
     [not found]               ` <55C4C35A.4070306@zonque.org>
     [not found]                 ` <CA+55aFxDLt-5+=xXeYG4nJKMb8L_iD9FmwTZ2VuughBku-mW3g@mail.gmail.com>
     [not found]                   ` <20150809190027.GA24185@kroah.com>
2015-08-09 22:11                     ` kdbus: to merge or not to merge? Daniel Mack
2015-08-10  2:10                       ` Andy Lutomirski
2015-08-10 17:04                       ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox