* [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
@ 2014-06-11 19:03 Christoph Lameter
2014-06-11 19:26 ` Daniel Phillips
` (5 more replies)
0 siblings, 6 replies; 30+ messages in thread
From: Christoph Lameter @ 2014-06-11 19:03 UTC (permalink / raw)
To: ksummit-discuss
Well this is likely to be a bit of a hot subject but I have been thinking
about this for a couple of years now. This is just a loose collection of
some concerns that I see mostly at the high end but many of these also are
valid for more embedded solutions that have performance issues as well
because the devices are low powered (Android?).
There are numerous issues in memory management that create a level of
complexity that suggests a rewrite would at some point be beneficial:
1. The need to use larger order pages, and the resulting problems with
fragmentation. Memory sizes grow and therefore the number of page structs
where state has to be maintained. Maybe there is something different? If
we use hugepages then we have 511 useless page structs. Some apps need
linear memory where we have trouble and are creating numerous memory
allocators (recently the new bootmem allocator and CMA. Plus lots of
specialized allocators in various subsystems).
2. Support machines with massive amounts of cpus. I got a power8 system
for testing and it has 160 "cpus" on two sockets and four numa
nodes. The new processors from Intel may have up to 18 cores per socket which
only yields 72 "cpus" for a 2 socket sysetm but there are systems with
more socket available and the out look on that level is scary.
Per cpu state and per node state is replicated and it becomes problematic
to aggregate the state for the whole machine since looping over the per
cpu areas becomes expensive.
Can we develop the notion that subsystems own certain cores so that their
execution is restricted to a subset of the system avoiding data
replication and keeping subsystem data hot? I.e. have a device driver
and subsystems driving those devices just run on the NUMA node to which
the PCI-E root complex is attached. Restricting to NUMA node reduces data
locality complexity and increases performance due to cache hot data.
3. Allocation "Zones". These are problematic because the zones often do
not reflect the capabilities of devices to allocate in certain ranges.
They are used for other purposes like MOVABLE pages but then the pages are
not really movable because they are pinnned for other reasons. Argh.
4. Performance characteristics can often not be mapped to kernel
mechanisms. We have NUMA where we can do things but the cpu caching
effects as well as TLB sharing plus the caching of the DIMMs in page
buffers is not really well exploited.
4. Swap: No one really wants to swap today. This needs to be replaced with
something else. Going heavily into swap is akin to locking up the system.
There are numerous band aid solutions but nothing appealing. Maybe the
best idea is the Android idea of the saving app state and removing it from
memory.
5. Page faults:
We do not really use page faults the way they are intended to be used. A
file fault causes numerous readahead requests and then only minor faults
are generated. There is the frequent desire to not have these long
interruptions occur when code is running. mlock[all] is there but isnt
there a better cleaner solution? Maybe we do not want to page a process at
all. Virtualization-like approaches that only support a single process
(like OSV) may be of interest.
Sometimes I think that something like MS-DOS (a "monitor")which provides
services but then gets out of the way may be better because it does not
create the problems that require workaround of an OS. Maybe the full
features "OS" can run on some cores whereas others can only have monitor
like services (we are on the way there with the dynticks approaches by
Frederic Weisbecker).
6. Direct hardware access
Often the kernel subsystems are impeding performance. In high speed
computing we regularly bypass the kernel network subsystems, block I/O
etc. Direct hardware access means though that one is explosed to the ugly
particularities of how a certain device has to be handled. Can we have the
cake and eat it too by defining APIs that allow low level hardware access
but also provide hardware abstraction (maybe limited to certain types of
devices).
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-11 19:03 [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem Christoph Lameter @ 2014-06-11 19:26 ` Daniel Phillips 2014-06-11 19:45 ` Greg KH ` (4 subsequent siblings) 5 siblings, 0 replies; 30+ messages in thread From: Daniel Phillips @ 2014-06-11 19:26 UTC (permalink / raw) To: ksummit-discuss On 06/11/2014 12:03 PM, Christoph Lameter wrote: > Well this is likely to be a bit of a hot subject but I have been thinking > about this for a couple of years now. This is just a loose collection of > some concerns that I see mostly at the high end but many of these also are > valid for more embedded solutions that have performance issues as well > because the devices are low powered (Android?). > > There are numerous issues in memory management that create a level of > complexity that suggests a rewrite would at some point be beneficial: > > 1. The need to use larger order pages, and the resulting problems with > fragmentation. Memory sizes grow and therefore the number of page structs > where state has to be maintained. Maybe there is something different? If > we use hugepages then we have 511 useless page structs. Some apps need > linear memory where we have trouble and are creating numerous memory > allocators (recently the new bootmem allocator and CMA. Plus lots of > specialized allocators in various subsystems). > > mem_map should be a radix tree? Regards, Daniel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-11 19:03 [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem Christoph Lameter 2014-06-11 19:26 ` Daniel Phillips @ 2014-06-11 19:45 ` Greg KH 2014-06-12 13:35 ` John W. Linville 2014-06-13 16:56 ` Christoph Lameter 2014-06-11 20:08 ` josh ` (3 subsequent siblings) 5 siblings, 2 replies; 30+ messages in thread From: Greg KH @ 2014-06-11 19:45 UTC (permalink / raw) To: Christoph Lameter; +Cc: ksummit-discuss On Wed, Jun 11, 2014 at 02:03:05PM -0500, Christoph Lameter wrote: > 6. Direct hardware access > > Often the kernel subsystems are impeding performance. In high speed > computing we regularly bypass the kernel network subsystems, block I/O > etc. Direct hardware access means though that one is explosed to the ugly > particularities of how a certain device has to be handled. Can we have the > cake and eat it too by defining APIs that allow low level hardware access > but also provide hardware abstraction (maybe limited to certain types of > devices). What type of devices are you wanting here, block and networking or something else? We have the uio interface if you want to (and know how to) talk to your hardware directly from userspace, what else do you want to do here that this doesn't provide? thanks, greg k-h ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-11 19:45 ` Greg KH @ 2014-06-12 13:35 ` John W. Linville 2014-06-13 16:57 ` Christoph Lameter 2014-06-13 16:56 ` Christoph Lameter 1 sibling, 1 reply; 30+ messages in thread From: John W. Linville @ 2014-06-12 13:35 UTC (permalink / raw) To: Greg KH; +Cc: ksummit-discuss On Wed, Jun 11, 2014 at 12:45:04PM -0700, Greg KH wrote: > On Wed, Jun 11, 2014 at 02:03:05PM -0500, Christoph Lameter wrote: > > 6. Direct hardware access > > > > Often the kernel subsystems are impeding performance. In high speed > > computing we regularly bypass the kernel network subsystems, block I/O > > etc. Direct hardware access means though that one is explosed to the ugly > > particularities of how a certain device has to be handled. Can we have the > > cake and eat it too by defining APIs that allow low level hardware access > > but also provide hardware abstraction (maybe limited to certain types of > > devices). > > What type of devices are you wanting here, block and networking or > something else? We have the uio interface if you want to (and know how > to) talk to your hardware directly from userspace, what else do you want > to do here that this doesn't provide? AF_PACKET provides some level of hardware abstraction without a lot of overhead for networking apps that are prepared to deal with raw frames. Is this the kind of networking API you would propose? John -- John W. Linville Someday the world will need a hero, and you linville@tuxdriver.com might be all we have. Be ready. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-12 13:35 ` John W. Linville @ 2014-06-13 16:57 ` Christoph Lameter 2014-06-13 17:31 ` Greg KH 0 siblings, 1 reply; 30+ messages in thread From: Christoph Lameter @ 2014-06-13 16:57 UTC (permalink / raw) To: John W. Linville; +Cc: ksummit-discuss On Thu, 12 Jun 2014, John W. Linville wrote: > AF_PACKET provides some level of hardware abstraction without a lot of > overhead for networking apps that are prepared to deal with raw frames. > Is this the kind of networking API you would propose? The kernel is still in the data path and will cause limitations in terms of bandwidth and latency. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-13 16:57 ` Christoph Lameter @ 2014-06-13 17:31 ` Greg KH 2014-06-13 17:59 ` Christoph Lameter 0 siblings, 1 reply; 30+ messages in thread From: Greg KH @ 2014-06-13 17:31 UTC (permalink / raw) To: Christoph Lameter; +Cc: ksummit-discuss On Fri, Jun 13, 2014 at 11:57:04AM -0500, Christoph Lameter wrote: > On Thu, 12 Jun 2014, John W. Linville wrote: > > > AF_PACKET provides some level of hardware abstraction without a lot of > > overhead for networking apps that are prepared to deal with raw frames. > > Is this the kind of networking API you would propose? > > The kernel is still in the data path and will cause limitations in terms > of bandwidth and latency. Of course it will, nothing is "free". If this is a problem, then run one of the many different networking stacks that are in userspace that are tailored to a specific use-case. The kernel has to provide a "general" use case stack, that is its job. greg k-h ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-13 17:31 ` Greg KH @ 2014-06-13 17:59 ` Christoph Lameter 2014-06-13 19:18 ` Stephen Hemminger 0 siblings, 1 reply; 30+ messages in thread From: Christoph Lameter @ 2014-06-13 17:59 UTC (permalink / raw) To: Greg KH; +Cc: ksummit-discuss On Fri, 13 Jun 2014, Greg KH wrote: > On Fri, Jun 13, 2014 at 11:57:04AM -0500, Christoph Lameter wrote: > > On Thu, 12 Jun 2014, John W. Linville wrote: > > > > > AF_PACKET provides some level of hardware abstraction without a lot of > > > overhead for networking apps that are prepared to deal with raw frames. > > > Is this the kind of networking API you would propose? > > > > The kernel is still in the data path and will cause limitations in terms > > of bandwidth and latency. > > Of course it will, nothing is "free". If this is a problem, then run > one of the many different networking stacks that are in userspace that > are tailored to a specific use-case. The kernel has to provide a > "general" use case stack, that is its job. But again I want both. A general stack that allows at least the data path to go direct to the device. The metadata and connection management etc should be firmly in the hand of the kernel. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-13 17:59 ` Christoph Lameter @ 2014-06-13 19:18 ` Stephen Hemminger 2014-06-13 22:30 ` Christoph Lameter 0 siblings, 1 reply; 30+ messages in thread From: Stephen Hemminger @ 2014-06-13 19:18 UTC (permalink / raw) To: Christoph Lameter; +Cc: ksummit-discuss On Fri, 13 Jun 2014 12:59:32 -0500 (CDT) Christoph Lameter <cl@gentwo.org> wrote: > On Fri, 13 Jun 2014, Greg KH wrote: > > > On Fri, Jun 13, 2014 at 11:57:04AM -0500, Christoph Lameter wrote: > > > On Thu, 12 Jun 2014, John W. Linville wrote: > > > > > > > AF_PACKET provides some level of hardware abstraction without a lot of > > > > overhead for networking apps that are prepared to deal with raw frames. > > > > Is this the kind of networking API you would propose? > > > > > > The kernel is still in the data path and will cause limitations in terms > > > of bandwidth and latency. > > > > Of course it will, nothing is "free". If this is a problem, then run > > one of the many different networking stacks that are in userspace that > > are tailored to a specific use-case. The kernel has to provide a > > "general" use case stack, that is its job. > > But again I want both. A general stack that allows at least the data path > to go direct to the device. The metadata and connection management etc > should be firmly in the hand of the kernel. There are several dataplane user mode networking implementations that do this. The problem is you either have to overlap with every networking driver (netmap) or do driver in userspace (DPDK). ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-13 19:18 ` Stephen Hemminger @ 2014-06-13 22:30 ` Christoph Lameter 0 siblings, 0 replies; 30+ messages in thread From: Christoph Lameter @ 2014-06-13 22:30 UTC (permalink / raw) To: Stephen Hemminger; +Cc: ksummit-discuss On Fri, 13 Jun 2014, Stephen Hemminger wrote: > > But again I want both. A general stack that allows at least the data path > > to go direct to the device. The metadata and connection management etc > > should be firmly in the hand of the kernel. > > There are several dataplane user mode networking implementations that > do this. The problem is you either have to overlap with every networking > driver (netmap) or do driver in userspace (DPDK). The netmap stuff requires a system call for any sending and receiving so that does not work right. The driver is userspace does the device control etc etc in user space as well which means the kernel does not police the device. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-11 19:45 ` Greg KH 2014-06-12 13:35 ` John W. Linville @ 2014-06-13 16:56 ` Christoph Lameter 2014-06-13 17:30 ` Greg KH 1 sibling, 1 reply; 30+ messages in thread From: Christoph Lameter @ 2014-06-13 16:56 UTC (permalink / raw) To: Greg KH; +Cc: ksummit-discuss On Wed, 11 Jun 2014, Greg KH wrote: > > Often the kernel subsystems are impeding performance. In high speed > > computing we regularly bypass the kernel network subsystems, block I/O > > etc. Direct hardware access means though that one is explosed to the ugly > > particularities of how a certain device has to be handled. Can we have the > > cake and eat it too by defining APIs that allow low level hardware access > > but also provide hardware abstraction (maybe limited to certain types of > > devices). > > What type of devices are you wanting here, block and networking or > something else? We have the uio interface if you want to (and know how > to) talk to your hardware directly from userspace, what else do you want > to do here that this doesn't provide? Block and networking mainly. The userspace VFIO API exposes device specific registers. We need something that is a decent abstraction. IBverbs is something like that but it could be done much better. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-13 16:56 ` Christoph Lameter @ 2014-06-13 17:30 ` Greg KH 2014-06-13 17:55 ` James Bottomley 2014-06-13 18:01 ` Christoph Lameter 0 siblings, 2 replies; 30+ messages in thread From: Greg KH @ 2014-06-13 17:30 UTC (permalink / raw) To: Christoph Lameter; +Cc: ksummit-discuss On Fri, Jun 13, 2014 at 11:56:08AM -0500, Christoph Lameter wrote: > On Wed, 11 Jun 2014, Greg KH wrote: > > > > Often the kernel subsystems are impeding performance. In high speed > > > computing we regularly bypass the kernel network subsystems, block I/O > > > etc. Direct hardware access means though that one is explosed to the ugly > > > particularities of how a certain device has to be handled. Can we have the > > > cake and eat it too by defining APIs that allow low level hardware access > > > but also provide hardware abstraction (maybe limited to certain types of > > > devices). > > > > What type of devices are you wanting here, block and networking or > > something else? We have the uio interface if you want to (and know how > > to) talk to your hardware directly from userspace, what else do you want > > to do here that this doesn't provide? > > Block and networking mainly. The userspace VFIO API exposes device > specific registers. We need something that is a decent abstraction. > IBverbs is something like that but it could be done much better. Heh, we've been down this road before :) In the end, userspace wants a socket-like interface to the networking "stack", right? So either you provide that with a custom networking library that talks directly to a specific hardware card (like 3 different companies provide), or you just deal with the in-kernel network stack. What else is there that we can do here? And as for block device, "raw access", really? What is lacking with what we already provide in "raw mode", and a no-op block scheduler? How much more "lean" can we possibly go without you having to write a custom userspace uio driver for every block controller out there? thanks, greg k-h ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-13 17:30 ` Greg KH @ 2014-06-13 17:55 ` James Bottomley 2014-06-13 18:41 ` Christoph Lameter 2014-06-13 18:01 ` Christoph Lameter 1 sibling, 1 reply; 30+ messages in thread From: James Bottomley @ 2014-06-13 17:55 UTC (permalink / raw) To: Greg KH; +Cc: ksummit-discuss On Fri, 2014-06-13 at 10:30 -0700, Greg KH wrote: > On Fri, Jun 13, 2014 at 11:56:08AM -0500, Christoph Lameter wrote: > > On Wed, 11 Jun 2014, Greg KH wrote: > > > > > > Often the kernel subsystems are impeding performance. In high speed > > > > computing we regularly bypass the kernel network subsystems, block I/O > > > > etc. Direct hardware access means though that one is explosed to the ugly > > > > particularities of how a certain device has to be handled. Can we have the > > > > cake and eat it too by defining APIs that allow low level hardware access > > > > but also provide hardware abstraction (maybe limited to certain types of > > > > devices). > > > > > > What type of devices are you wanting here, block and networking or > > > something else? We have the uio interface if you want to (and know how > > > to) talk to your hardware directly from userspace, what else do you want > > > to do here that this doesn't provide? > > > > Block and networking mainly. The userspace VFIO API exposes device > > specific registers. We need something that is a decent abstraction. > > IBverbs is something like that but it could be done much better. > > Heh, we've been down this road before :) > > In the end, userspace wants a socket-like interface to the networking > "stack", right? So either you provide that with a custom networking > library that talks directly to a specific hardware card (like 3 > different companies provide), or you just deal with the in-kernel > network stack. What else is there that we can do here? > > And as for block device, "raw access", really? What is lacking with > what we already provide in "raw mode", and a no-op block scheduler? How > much more "lean" can we possibly go without you having to write a custom > userspace uio driver for every block controller out there? Just remember there are lessons from Raw devices too. Oracle originally forced the raw mode on our block devices for this reason ... just get your block layer and filesystems mostly out of our way was their cry. Then they discovered that not having a FS wrapper led to the system not being able to recognise the raw devices as being raw, which lead to an awful lot of really expensive data loss cockups. The compromise today is using filesystems with O_DIRECT to the file data containers. The point here is that lots of people say "just get your operating system out of my way" most realise they actually didn't mean it when presented with the reality. The abstractions most people who say this want are a zero delay data path with someone else taking care of all of the metadata and setup problems ... effectively a MPI type interface. Is that what you're looking for, Christoph? James ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-13 17:55 ` James Bottomley @ 2014-06-13 18:41 ` Christoph Lameter 2014-06-16 11:39 ` Thomas Petazzoni 0 siblings, 1 reply; 30+ messages in thread From: Christoph Lameter @ 2014-06-13 18:41 UTC (permalink / raw) To: James Bottomley; +Cc: ksummit-discuss On Fri, 13 Jun 2014, James Bottomley wrote: > The point here is that lots of people say "just get your operating > system out of my way" most realise they actually didn't mean it when > presented with the reality. Right. Exactly. What I would like to see is the OS doing its part to make things nice and provide a convenient abstraction of the ugly details. > The abstractions most people who say this want are a zero delay data > path with someone else taking care of all of the metadata and setup > problems ... effectively a MPI type interface. Is that what you're > looking for, Christoph? Ideally the setup/metadata should be handled by the OS while the data path would go direct. The get-out-of-the-way piece is restricted only to the performance critical portion which is the actual data transfer. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-13 18:41 ` Christoph Lameter @ 2014-06-16 11:39 ` Thomas Petazzoni 2014-06-16 14:05 ` Christoph Lameter 0 siblings, 1 reply; 30+ messages in thread From: Thomas Petazzoni @ 2014-06-16 11:39 UTC (permalink / raw) To: Christoph Lameter; +Cc: James Bottomley, ksummit-discuss Dear Christoph Lameter, On Fri, 13 Jun 2014 13:41:12 -0500 (CDT), Christoph Lameter wrote: > > The point here is that lots of people say "just get your operating > > system out of my way" most realise they actually didn't mean it when > > presented with the reality. > > Right. Exactly. What I would like to see is the OS doing its part to make > things nice and provide a convenient abstraction of the ugly details. > > > The abstractions most people who say this want are a zero delay data > > path with someone else taking care of all of the metadata and setup > > problems ... effectively a MPI type interface. Is that what you're > > looking for, Christoph? > > Ideally the setup/metadata should be handled by the OS while the data > path would go direct. The get-out-of-the-way piece is restricted only to > the performance critical portion which is the actual data transfer. I might be completely out of topic here, but this very much sounds like what is happening for graphics. There is a DRM/KMS kernel side, which does all the mode setting, context allocation and things like that, and then all the rest takes place in userspace, using hardware-specific pieces of code in libdrm and other components of the graphics stack. If we translate that to networking, there would be a need to have all of the setup/initialization done in the kernel, and then some hardware-specific userspace libraries to use for the data path. Thomas -- Thomas Petazzoni, CTO, Free Electrons Embedded Linux, Kernel and Android engineering http://free-electrons.com ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-16 11:39 ` Thomas Petazzoni @ 2014-06-16 14:05 ` Christoph Lameter 2014-06-16 14:09 ` Thomas Petazzoni 0 siblings, 1 reply; 30+ messages in thread From: Christoph Lameter @ 2014-06-16 14:05 UTC (permalink / raw) To: Thomas Petazzoni; +Cc: James Bottomley, ksummit-discuss On Mon, 16 Jun 2014, Thomas Petazzoni wrote: > I might be completely out of topic here, but this very much sounds like > what is happening for graphics. There is a DRM/KMS kernel side, which > does all the mode setting, context allocation and things like that, and > then all the rest takes place in userspace, using hardware-specific > pieces of code in libdrm and other components of the graphics stack. I thought about that too. > If we translate that to networking, there would be a need to have all > of the setup/initialization done in the kernel, and then some > hardware-specific userspace libraries to use for the data path. Well ideally these would just be API specific in order to support multiple devices. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-16 14:05 ` Christoph Lameter @ 2014-06-16 14:09 ` Thomas Petazzoni 2014-06-16 14:28 ` Christoph Lameter 0 siblings, 1 reply; 30+ messages in thread From: Thomas Petazzoni @ 2014-06-16 14:09 UTC (permalink / raw) To: Christoph Lameter; +Cc: James Bottomley, ksummit-discuss Dear Christoph Lameter, On Mon, 16 Jun 2014 09:05:31 -0500 (CDT), Christoph Lameter wrote: > On Mon, 16 Jun 2014, Thomas Petazzoni wrote: > > > I might be completely out of topic here, but this very much sounds like > > what is happening for graphics. There is a DRM/KMS kernel side, which > > does all the mode setting, context allocation and things like that, and > > then all the rest takes place in userspace, using hardware-specific > > pieces of code in libdrm and other components of the graphics stack. > > I thought about that too. > > > If we translate that to networking, there would be a need to have all > > of the setup/initialization done in the kernel, and then some > > hardware-specific userspace libraries to use for the data path. > > Well ideally these would just be API specific in order to support multiple > devices. Well, my understanding is that libdrm exposes on API, but internally has support for various graphics hardware. Same for OpenGL: a unified normalized API that applications can rely on, and pure user-space implementations that know about the hardware details. Thomas -- Thomas Petazzoni, CTO, Free Electrons Embedded Linux, Kernel and Android engineering http://free-electrons.com ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-16 14:09 ` Thomas Petazzoni @ 2014-06-16 14:28 ` Christoph Lameter 0 siblings, 0 replies; 30+ messages in thread From: Christoph Lameter @ 2014-06-16 14:28 UTC (permalink / raw) To: Thomas Petazzoni; +Cc: James Bottomley, ksummit-discuss On Mon, 16 Jun 2014, Thomas Petazzoni wrote: > Well, my understanding is that libdrm exposes on API, but internally > has support for various graphics hardware. Same for OpenGL: a unified > normalized API that applications can rely on, and pure user-space > implementations that know about the hardware details. Ok then we would need to come up with an API for NICs and storage that allows user space to determine the hardware and use the correct logic. The same approach is used in the Infiniband subsystem. However this means that device driver like code is distributed separately from the kernel. There are separate ibverbs, ibrdma etc trees and its an issue to keep the in kernel portions in sync with the userspace code. Ideally these would go together and be modified by patches that change both the kernel portion and the userspace portion. So maybe add a directory for userspace driver code to the kernel? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-13 17:30 ` Greg KH 2014-06-13 17:55 ` James Bottomley @ 2014-06-13 18:01 ` Christoph Lameter 2014-06-13 18:25 ` Greg KH 1 sibling, 1 reply; 30+ messages in thread From: Christoph Lameter @ 2014-06-13 18:01 UTC (permalink / raw) To: Greg KH; +Cc: ksummit-discuss On Fri, 13 Jun 2014, Greg KH wrote: > In the end, userspace wants a socket-like interface to the networking > "stack", right? So either you provide that with a custom networking > library that talks directly to a specific hardware card (like 3 > different companies provide), or you just deal with the in-kernel > network stack. What else is there that we can do here? Standardize the kernel APIs for this use case as well as the user space APIs so that software runs on any of the 3 companies stacks. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-13 18:01 ` Christoph Lameter @ 2014-06-13 18:25 ` Greg KH 2014-06-13 18:54 ` Christoph Lameter 0 siblings, 1 reply; 30+ messages in thread From: Greg KH @ 2014-06-13 18:25 UTC (permalink / raw) To: Christoph Lameter; +Cc: ksummit-discuss On Fri, Jun 13, 2014 at 01:01:02PM -0500, Christoph Lameter wrote: > On Fri, 13 Jun 2014, Greg KH wrote: > > > In the end, userspace wants a socket-like interface to the networking > > "stack", right? So either you provide that with a custom networking > > library that talks directly to a specific hardware card (like 3 > > different companies provide), or you just deal with the in-kernel > > network stack. What else is there that we can do here? > > Standardize the kernel APIs for this use case The UIO interface is being used for this, so all should be good on the kernel side, right? > as well as the user space APIs so that software runs on any of the 3 > companies stacks. As these libraries are outside of the kernel tree, there's not much us kernel developers can do about this. Work with those companies to do this... greg k-h ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-13 18:25 ` Greg KH @ 2014-06-13 18:54 ` Christoph Lameter 0 siblings, 0 replies; 30+ messages in thread From: Christoph Lameter @ 2014-06-13 18:54 UTC (permalink / raw) To: Greg KH; +Cc: ksummit-discuss On Fri, 13 Jun 2014, Greg KH wrote: > > Standardize the kernel APIs for this use case > > The UIO interface is being used for this, so all should be good on the > kernel side, right? Ok I have seen any vendor use that interface and thus I am not familiar. > > as well as the user space APIs so that software runs on any of the 3 > > companies stacks. > > As these libraries are outside of the kernel tree, there's not much us > kernel developers can do about this. Work with those companies to do > this... Its not that easy a separation. The mangement functions better be left with the kernel so that security and permission management work and so that the device stays in a well known state if accessed from multiple applications. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-11 19:03 [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem Christoph Lameter 2014-06-11 19:26 ` Daniel Phillips 2014-06-11 19:45 ` Greg KH @ 2014-06-11 20:08 ` josh 2014-06-11 20:15 ` Andy Lutomirski ` (2 subsequent siblings) 5 siblings, 0 replies; 30+ messages in thread From: josh @ 2014-06-11 20:08 UTC (permalink / raw) To: Christoph Lameter; +Cc: ksummit-discuss On Wed, Jun 11, 2014 at 02:03:05PM -0500, Christoph Lameter wrote: > Well this is likely to be a bit of a hot subject but I have been thinking > about this for a couple of years now. This is just a loose collection of > some concerns that I see mostly at the high end but many of these also are > valid for more embedded solutions that have performance issues as well > because the devices are low powered (Android?). On the low end, we could also reasonably ask how much overhead Linux memory management adds. Does it make sense to run the standard Linux mm subsystem on a system with, say, 1MB of RAM? - Josh Triplett ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-11 19:03 [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem Christoph Lameter ` (2 preceding siblings ...) 2014-06-11 20:08 ` josh @ 2014-06-11 20:15 ` Andy Lutomirski 2014-06-11 20:52 ` Dave Hansen 2014-06-12 6:59 ` Phillip Lougher 5 siblings, 0 replies; 30+ messages in thread From: Andy Lutomirski @ 2014-06-11 20:15 UTC (permalink / raw) To: Christoph Lameter; +Cc: ksummit-discuss On Wed, Jun 11, 2014 at 12:03 PM, Christoph Lameter <cl@gentwo.org> wrote: > > 3. Allocation "Zones". These are problematic because the zones often do > not reflect the capabilities of devices to allocate in certain ranges. > They are used for other purposes like MOVABLE pages but then the pages are > not really movable because they are pinnned for other reasons. Argh. > What if you just couldn't sleep while you have a MOVABLE page pinned? Or what if you had to pin it and provide a callback to forcibly unpin it? This would complicate direct IO and such, but it would make movable pages really movable. It would also solve an annoyance with the sealing thing: the sealing code wants to take writable pages and make them really readonly. This interacts very badly with existing pins. We have IOMMU in many cases. Would it be so bad to say that direct IO is only really direct if there's an IOMMU? --Andy ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-11 19:03 [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem Christoph Lameter ` (3 preceding siblings ...) 2014-06-11 20:15 ` Andy Lutomirski @ 2014-06-11 20:52 ` Dave Hansen 2014-06-12 6:59 ` Phillip Lougher 5 siblings, 0 replies; 30+ messages in thread From: Dave Hansen @ 2014-06-11 20:52 UTC (permalink / raw) To: Christoph Lameter, ksummit-discuss On 06/11/2014 12:03 PM, Christoph Lameter wrote: > 4. Swap: No one really wants to swap today. This needs to be replaced with > something else. Going heavily into swap is akin to locking up the system. > There are numerous band aid solutions but nothing appealing. Maybe the > best idea is the Android idea of the saving app state and removing it from > memory. Yeah, our entire approach is getting a bit dated, and I think it's really designed around the fact that our swap devices have historically been (relatively) painfully slow to access. There are some patches in mm to _help_, but currently if you throw a really fast swap device in a system and try to swap heavily to it, you don't get anywhere near saturating the device. We'd probably be better off if we just blasted data at the device and just ignored the LRU. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-11 19:03 [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem Christoph Lameter ` (4 preceding siblings ...) 2014-06-11 20:52 ` Dave Hansen @ 2014-06-12 6:59 ` Phillip Lougher 2014-06-13 17:02 ` Christoph Lameter 5 siblings, 1 reply; 30+ messages in thread From: Phillip Lougher @ 2014-06-12 6:59 UTC (permalink / raw) To: Christoph Lameter, ksummit-discuss On 11/06/14 20:03, Christoph Lameter wrote: > Well this is likely to be a bit of a hot subject but I have been thinking > about this for a couple of years now. This is just a loose collection of > some concerns that I see mostly at the high end but many of these also are > valid for more embedded solutions that have performance issues as well > because the devices are low powered (Android?). > > > > There are numerous issues in memory management that create a level of > complexity that suggests a rewrite would at some point be beneficial: Slow incremental improvements, which are already happening, yes. "Grand plans" to rewrite everything from scratch, please no. Academic computing research is littered with grand plans that never went anywhere. Not least your list, which sound like the objectives of the late 80s/mid 90s research into "multi-service operating systems" (or the wider distributed operating systems research of the time). There too we (I was doing research into this at the time), were envisaging hundreds of heterogeneous CPUs with diverse memory hierarchies, interconnects, I/O configurations, instruction set etc. and imagining a grand unifying system that would tie these together. In addition this was the time that audio and video became a serious proposition, and so ideas to incorporate these new concepts into the operating system as "first-class" objects became all the rage, so knowledge of the special characteristics of audio/video were to be built into memory management, the schedulers, the filesystems. Old style operating systems like Unix were out, and everything was to be redesigned from scratch. There were some good ideas proposed, some which in various forms have made their way incrementally into Linux (your list of zones, NUMA, page fault minimisation, direct hardware access). But, in general it failed, it made no discernible impact on the state of the art in operating system implementation. Because it was too much, too grand, no research group has the wherewithal to design this from scratch, and by and large the operating systems companies were happy with what they had. Some universities (like Lancaster and Cambridge where I worked, had prototypes, but these were exemplars of how little rather than how much). Only one company to my knowledge had the hubris to design a new operating system along these lines from scratch, Acorn computers of Cambridge UK (the originators of the ARM CPU BTW), where I left Cambridge University to help design the operating system. Again, nice ideas, but, it proved too much and Acorn went bankrupt in 1998. The new operating system was called Galileo, and there's a few links still around, i.e. http://www.poppyfields.net/acorn/news/acopress/97-02-10b.shtml In contrast Linux which I'd installed in 1994, when I was busily doing "real operating systems work" and dismissed as a toy, took the "modest" approach of reimplementing Unix. After 4 years in 1998, Linux was becoming something to be reckoned with, whilst grand plans just led to failure. In fact within a few years Linux with its "old school" design on a single core was doing things that took us specialised operating systems techniques to do, simply because hardware had become so much better it turned out they were no longer needed. Yeah, this is probably highly off topic, but I had deja vu when reading this "let's redesign everything from scratch, what could possibly go wrong" list. BTW I looked up some of my old colleagues, and it turns out they were still writing papers on this as late as 2009 (only 13 years after I left for Acorn and industry). "The multikernel: a new OS architecture for scalable multicore systems" http://dl.acm.org/citation.cfm?doid=1629575.1629579 It's pay walled, but the abstract has the following which may be of interest to you "We have implemented a multikernel OS to show that the approach is promising, and we describe how traditional scalability problems for operating systems (such as memory management) can be effectively recast using messages and can exploit insights from distributed systems and networking." lol > > > 1. The need to use larger order pages, and the resulting problems with > fragmentation. Memory sizes grow and therefore the number of page structs > where state has to be maintained. Maybe there is something different? If > we use hugepages then we have 511 useless page structs. Some apps need > linear memory where we have trouble and are creating numerous memory > allocators (recently the new bootmem allocator and CMA. Plus lots of > specialized allocators in various subsystems). > This was never solved to my knowledge, there is no panacea here. Even in the 90s we had video subsystems wanting to allocate in units of 1Mbyte, and others in units of 4k. The "solution" was so called split-level allocators, each specialised to deal with a particular "first class media", with them giving back memory to the underlying allocator when memory got tight in another specialised allocator. Not much different to the ad-hoc solutions being adopted in Linux, except the general idea was each specialised allocator had the same API. > 2. Support machines with massive amounts of cpus. I got a power8 system > for testing and it has 160 "cpus" on two sockets and four numa > nodes. The new processors from Intel may have up to 18 cores per socket which > only yields 72 "cpus" for a 2 socket sysetm but there are systems with > more socket available and the out look on that level is scary. > > Per cpu state and per node state is replicated and it becomes problematic > to aggregate the state for the whole machine since looping over the per > cpu areas becomes expensive. > > Can we develop the notion that subsystems own certain cores so that their > execution is restricted to a subset of the system avoiding data > replication and keeping subsystem data hot? I.e. have a device driver > and subsystems driving those devices just run on the NUMA node to which > the PCI-E root complex is attached. Restricting to NUMA node reduces data > locality complexity and increases performance due to cache hot data. Lots of academic hot-air was expended here when designing distributed systems which could scale seamlessly across heterogeneous CPUs connected via different levels of interconnects (bus, ATM, ethernet etc.), zoning, migration, replication etc. The "solution" is probably out there somewhere forgotten about. > > 3. Allocation "Zones". These are problematic because the zones often do > not reflect the capabilities of devices to allocate in certain ranges. > They are used for other purposes like MOVABLE pages but then the pages are > not really movable because they are pinnned for other reasons. Argh. > > 4. Performance characteristics can often not be mapped to kernel > mechanisms. We have NUMA where we can do things but the cpu caching > effects as well as TLB sharing plus the caching of the DIMMs in page > buffers is not really well exploited. > > 4. Swap: No one really wants to swap today. This needs to be replaced with > something else. Going heavily into swap is akin to locking up the system. > There are numerous band aid solutions but nothing appealing. Maybe the > best idea is the Android idea of the saving app state and removing it from > memory. Embedded system operating systems by and large never had swap. Embedded systems which today use Linux see swap as a null op. It isn't used. It is madness to swap to a NAND device. But I actually think Linux is ahead of the curve here, with things like zcache, zswap and compressed filesystems which can be used as an intermediate stage, storing data compressed in memory which is only expanded when necessary. All of these minimise memory footprint without having to resort to a swap device. > > 5. Page faults: > > We do not really use page faults the way they are intended to be used. A > file fault causes numerous readahead requests and then only minor faults > are generated. There is the frequent desire to not have these long > interruptions occur when code is running. mlock[all] is there but isnt > there a better cleaner solution? Maybe we do not want to page a process at > all. Virtualization-like approaches that only support a single process > (like OSV) may be of interest. You concentrate only on page faults swapping file data into memory. By and large embedded systems aim to try and run with their working set in memory (i.e. demand paged at start up but then in cache), trying to preserve any kind of real time guarantee when you discover half your working set has been flushed, and suddenly needs to paged back in from slow NAND is a null op. Page faults between processes with shared mmap segments or more often context switches and repeated memcopying to do I/O between processes is what concerns embedded systems. Context switching and memcopying just throws away limited bandwidth on an embedded system. Case in point, many years ago I was the lead Linux guy for a company designing a SOC for digital TV. Just before I left I had an interesting "conversation" with the chief hardware guy of the team who designed the SOC. Turns out they'd budgeted for the RAM bandwidth needed to decode a typical MPEG stream, but they'd not reckoned on all the memcopies Linux needs to do between its "separate address space" processes. He'd been used to embedded oses which run in a single address space. Fact is security is ever more important even in embedded systems, and a multi address operating system gives security impossible in single address operating systems which do away with paging for efficiency. This security comes at a price. Back when I was designing Galileo for Acorn in the 90s, we knew all about the tradeoffs between single address and multi-address operating systems. I introduced the concept of containers (not the same as the modern Linux containers), separate units of I/O which could be transferred efficiently between processes. We had the concept that trusted processes could be in the same address space, and untrusted processes would be in separate address spaces. Containers transferred between separate address spaces was done via page flipping (unmapping from source, remapping to destination), but containers passed between processes in the same address space would be done via handle. But the same API was done for both, processes could be moved between address spaces but the API was the same. Thus trading off security and efficiency, but it was invisible to the application. > > Sometimes I think that something like MS-DOS (a "monitor")which provides > services but then gets out of the way may be better because it does not > create the problems that require workaround of an OS. Maybe the full > features "OS" can run on some cores whereas others can only have monitor > like services (we are on the way there with the dynticks approaches by > Frederic Weisbecker). > > 6. Direct hardware access > > Often the kernel subsystems are impeding performance. In high speed > computing we regularly bypass the kernel network subsystems, block I/O > etc. Direct hardware access means though that one is explosed to the ugly > particularities of how a certain device has to be handled. Can we have the > cake and eat it too by defining APIs that allow low level hardware access > but also provide hardware abstraction (maybe limited to certain types of > devices). Been there done that. One of the ideas at the time was to reduce the "operating system" to a micro micro kernel, dealing with lowest possible abstraction only. The relevant operating system "stack" would be directly mapped into each process (i.e. the networking stack), avoiding the costly context switch entering kernel mode. But unless you were to produce a "stack" for each and every possible hardware device it meant you had to produce a stack dealing with hardware at the lowest level, but in a generic API way, the actual mapping of that generic hardware API in theory being a wafer thin "shim". Real hardware doesn't work like that. One example, I tried to do that for DMA controllers, but it turns out DMA controllers are widely different, the best performance is obtained via direct knowledge of their quirks. By the time I had worked out a generic API that would work as shim across all controllers, none of the elegance or performance of anything was retained. Phillip > > _______________________________________________ > Ksummit-discuss mailing list > Ksummit-discuss@lists.linuxfoundation.org > https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss > . > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-12 6:59 ` Phillip Lougher @ 2014-06-13 17:02 ` Christoph Lameter 2014-06-13 21:36 ` Benjamin Herrenschmidt 2014-06-14 1:19 ` Phillip Lougher 0 siblings, 2 replies; 30+ messages in thread From: Christoph Lameter @ 2014-06-13 17:02 UTC (permalink / raw) To: Phillip Lougher; +Cc: ksummit-discuss On Thu, 12 Jun 2014, Phillip Lougher wrote: > > 1. The need to use larger order pages, and the resulting problems with > > fragmentation. Memory sizes grow and therefore the number of page structs > > where state has to be maintained. Maybe there is something different? If > > we use hugepages then we have 511 useless page structs. Some apps need > > linear memory where we have trouble and are creating numerous memory > > allocators (recently the new bootmem allocator and CMA. Plus lots of > > specialized allocators in various subsystems). > > > > This was never solved to my knowledge, there is no panacea here. > Even in the 90s we had video subsystems wanting to allocate in units > of 1Mbyte, and others in units of 4k. The "solution" was so called > split-level allocators, each specialised to deal with a particular > "first class media", with them giving back memory to the underlying > allocator when memory got tight in another specialised allocator. > Not much different to the ad-hoc solutions being adopted in Linux, > except the general idea was each specialised allocator had the same > API. It is solvable if the objects are inherent movable. If any object allocated provides a function that makes an object movable then defragmentation is possible and therefore large contiguous area of memory can be created at any time. > > Can we develop the notion that subsystems own certain cores so that their > > execution is restricted to a subset of the system avoiding data > > replication and keeping subsystem data hot? I.e. have a device driver > > and subsystems driving those devices just run on the NUMA node to which > > the PCI-E root complex is attached. Restricting to NUMA node reduces data > > locality complexity and increases performance due to cache hot data. > > Lots of academic hot-air was expended here when designing distributed > systems which could scale seamlessly across heterogeneous CPUs connected > via different levels of interconnects (bus, ATM, ethernet etc.), zoning, > migration, replication etc. The "solution" is probably out there somewhere > forgotten about. We have the issue with homogenous cpus due to the proliferation of cores on processors now. Maybe that is solvable? > Case in point, many years ago I was the lead Linux guy for a company > designing a SOC for digital TV. Just before I left I had an interesting > "conversation" with the chief hardware guy of the team who designed the SOC. > Turns out they'd budgeted for the RAM bandwidth needed to decode a typical > MPEG stream, but they'd not reckoned on all the memcopies Linux needs to do > between its "separate address space" processes. He'd been used to embedded > oses which run in a single address space. Well maybe that is appropriate for some processes? And we could carve out subsections of the hardware where single adress space stuff is possible? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-13 17:02 ` Christoph Lameter @ 2014-06-13 21:36 ` Benjamin Herrenschmidt 2014-06-13 22:23 ` Rik van Riel 2014-06-13 23:04 ` Christoph Lameter 2014-06-14 1:19 ` Phillip Lougher 1 sibling, 2 replies; 30+ messages in thread From: Benjamin Herrenschmidt @ 2014-06-13 21:36 UTC (permalink / raw) To: Christoph Lameter; +Cc: ksummit-discuss On Fri, 2014-06-13 at 12:02 -0500, Christoph Lameter wrote: > On Thu, 12 Jun 2014, Phillip Lougher wrote: > > > > 1. The need to use larger order pages, and the resulting problems with > > > fragmentation. Memory sizes grow and therefore the number of page structs > > > where state has to be maintained. Maybe there is something different? If > > > we use hugepages then we have 511 useless page structs. Some apps need > > > linear memory where we have trouble and are creating numerous memory > > > allocators (recently the new bootmem allocator and CMA. Plus lots of > > > specialized allocators in various subsystems). > > > > > > > This was never solved to my knowledge, there is no panacea here. > > Even in the 90s we had video subsystems wanting to allocate in units > > of 1Mbyte, and others in units of 4k. The "solution" was so called > > split-level allocators, each specialised to deal with a particular > > "first class media", with them giving back memory to the underlying > > allocator when memory got tight in another specialised allocator. > > Not much different to the ad-hoc solutions being adopted in Linux, > > except the general idea was each specialised allocator had the same > > API. > > It is solvable if the objects are inherent movable. If any object > allocated provides a function that makes an object movable then > defragmentation is possible and therefore large contiguous area of memory > can be created at any time. Another interesting thing is migration of pages with mapped DMA on them :-) Our IOMMUs support that, but there isn't a way to hook that up into Linux page migration that wouldn't suck massively at this point. > > > Can we develop the notion that subsystems own certain cores so that their > > > execution is restricted to a subset of the system avoiding data > > > replication and keeping subsystem data hot? I.e. have a device driver > > > and subsystems driving those devices just run on the NUMA node to which > > > the PCI-E root complex is attached. Restricting to NUMA node reduces data > > > locality complexity and increases performance due to cache hot data. > > > > Lots of academic hot-air was expended here when designing distributed > > systems which could scale seamlessly across heterogeneous CPUs connected > > via different levels of interconnects (bus, ATM, ethernet etc.), zoning, > > migration, replication etc. The "solution" is probably out there somewhere > > forgotten about. > > We have the issue with homogenous cpus due to the proliferation of cores > on processors now. Maybe that is solvable? > > > Case in point, many years ago I was the lead Linux guy for a company > > designing a SOC for digital TV. Just before I left I had an interesting > > "conversation" with the chief hardware guy of the team who designed the SOC. > > Turns out they'd budgeted for the RAM bandwidth needed to decode a typical > > MPEG stream, but they'd not reckoned on all the memcopies Linux needs to do > > between its "separate address space" processes. He'd been used to embedded > > oses which run in a single address space. > > Well maybe that is appropriate for some processes? And we could carve out > subsections of the hardware where single adress space stuff is possible? > _______________________________________________ > Ksummit-discuss mailing list > Ksummit-discuss@lists.linuxfoundation.org > https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-13 21:36 ` Benjamin Herrenschmidt @ 2014-06-13 22:23 ` Rik van Riel 2014-06-13 23:04 ` Christoph Lameter 1 sibling, 0 replies; 30+ messages in thread From: Rik van Riel @ 2014-06-13 22:23 UTC (permalink / raw) To: ksummit-discuss On 06/13/2014 05:36 PM, Benjamin Herrenschmidt wrote: > On Fri, 2014-06-13 at 12:02 -0500, Christoph Lameter wrote: >> On Thu, 12 Jun 2014, Phillip Lougher wrote: >> >>>> 1. The need to use larger order pages, and the resulting problems with >>>> fragmentation. Memory sizes grow and therefore the number of page structs >>>> where state has to be maintained. Maybe there is something different? If >>>> we use hugepages then we have 511 useless page structs. Some apps need >> It is solvable if the objects are inherent movable. If any object >> allocated provides a function that makes an object movable then >> defragmentation is possible and therefore large contiguous area of memory >> can be created at any time. > > Another interesting thing is migration of pages with mapped DMA on > them :-) > > Our IOMMUs support that, but there isn't a way to hook that up into > Linux page migration that wouldn't suck massively at this point. The HMM stuff Jerome Glisse is working on may be a suitable framework to add call callbacks for things like migration to. -- All rights reversed ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-13 21:36 ` Benjamin Herrenschmidt 2014-06-13 22:23 ` Rik van Riel @ 2014-06-13 23:04 ` Christoph Lameter 1 sibling, 0 replies; 30+ messages in thread From: Christoph Lameter @ 2014-06-13 23:04 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: ksummit-discuss On Sat, 14 Jun 2014, Benjamin Herrenschmidt wrote: > > It is solvable if the objects are inherent movable. If any object > > allocated provides a function that makes an object movable then > > defragmentation is possible and therefore large contiguous area of memory > > can be created at any time. > > Another interesting thing is migration of pages with mapped DMA on > them :-) > > Our IOMMUs support that, but there isn't a way to hook that up into > Linux page migration that wouldn't suck massively at this point. Well yes that would require a major rethink. While we are at it we may as well try to get more done. Maybe we can do that just for a limited region within the existing memory management. Something like OSV, cgroups or cpuset that restricts it to certain nodes or cpus where we would allow this to occur while the rest still runs the standard kernel. A kind of sidecar approach. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-13 17:02 ` Christoph Lameter 2014-06-13 21:36 ` Benjamin Herrenschmidt @ 2014-06-14 1:19 ` Phillip Lougher 2014-06-16 14:04 ` Christoph Lameter 1 sibling, 1 reply; 30+ messages in thread From: Phillip Lougher @ 2014-06-14 1:19 UTC (permalink / raw) To: Christoph Lameter; +Cc: ksummit-discuss On 13/06/14 18:02, Christoph Lameter wrote: > On Thu, 12 Jun 2014, Phillip Lougher wrote: > >>> 1. The need to use larger order pages, and the resulting problems with >>> fragmentation. Memory sizes grow and therefore the number of page structs >>> where state has to be maintained. Maybe there is something different? If >>> we use hugepages then we have 511 useless page structs. Some apps need >>> linear memory where we have trouble and are creating numerous memory >>> allocators (recently the new bootmem allocator and CMA. Plus lots of >>> specialized allocators in various subsystems). >>> >> >> This was never solved to my knowledge, there is no panacea here. >> Even in the 90s we had video subsystems wanting to allocate in units >> of 1Mbyte, and others in units of 4k. The "solution" was so called >> split-level allocators, each specialised to deal with a particular >> "first class media", with them giving back memory to the underlying >> allocator when memory got tight in another specialised allocator. >> Not much different to the ad-hoc solutions being adopted in Linux, >> except the general idea was each specialised allocator had the same >> API. > > It is solvable if the objects are inherent movable. If any object > allocated provides a function that makes an object movable then > defragmentation is possible and therefore large contiguous area of memory > can be created at any time. > > >>> Can we develop the notion that subsystems own certain cores so that their >>> execution is restricted to a subset of the system avoiding data >>> replication and keeping subsystem data hot? I.e. have a device driver >>> and subsystems driving those devices just run on the NUMA node to which >>> the PCI-E root complex is attached. Restricting to NUMA node reduces data >>> locality complexity and increases performance due to cache hot data. >> >> Lots of academic hot-air was expended here when designing distributed >> systems which could scale seamlessly across heterogeneous CPUs connected >> via different levels of interconnects (bus, ATM, ethernet etc.), zoning, >> migration, replication etc. The "solution" is probably out there somewhere >> forgotten about. > > We have the issue with homogenous cpus due to the proliferation of cores > on processors now. Maybe that is solvable? > >> Case in point, many years ago I was the lead Linux guy for a company >> designing a SOC for digital TV. Just before I left I had an interesting >> "conversation" with the chief hardware guy of the team who designed the SOC. >> Turns out they'd budgeted for the RAM bandwidth needed to decode a typical >> MPEG stream, but they'd not reckoned on all the memcopies Linux needs to do >> between its "separate address space" processes. He'd been used to embedded >> oses which run in a single address space. > > Well maybe that is appropriate for some processes? And we could carve out > subsections of the hardware where single adress space stuff is possible? > Apologies, maybe what I was trying to say wasn't clear :) I wasn't arguing against it, but rather should we be trying to do this at the Linux kernel level. Embedded systems have long had the need to carve out (mainly heterogenous) processors from Linux. Media systems have VLIW media processors (i.e. Philips Trimedia), and mobile phones typically have separate baseband processors. This is done without any core support necessary from the kernel. Just write a device driver that presents a programming & I/O channel to the carved out hardware. Additionally, where Linux kernel has been too heavy weight with its slow real-time response, and/or expensive paged multi-address spaces, the solution is often to use a nano-kernel like ADEOS or RTLinux, running Linux as a separate OS, leaving scope to run lighter weight real-time single address operating systems in parallel. In otherwords if we need more efficiency, do it outside of Linux, rather than try to rewrite the strong protection model in Linux. That way leads to pain. My point about the hardware engineer is people can't have their cake and eat it. Unix/Linux has been successful partly because of its strong protection/paged model. It is difficult to be both secure and efficient. If you want to both then you need to design it into the operating system from the outset. Linux isn't a good place to start. Phillip > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem 2014-06-14 1:19 ` Phillip Lougher @ 2014-06-16 14:04 ` Christoph Lameter 0 siblings, 0 replies; 30+ messages in thread From: Christoph Lameter @ 2014-06-16 14:04 UTC (permalink / raw) To: Phillip Lougher; +Cc: ksummit-discuss On Sat, 14 Jun 2014, Phillip Lougher wrote: > Embedded systems have long had the need to carve out (mainly heterogenous) > processors from Linux. Media systems have VLIW media processors (i.e. > Philips Trimedia), and mobile phones typically have separate baseband > processors. This is done without any core support necessary from the kernel. > Just write a device driver that presents a programming & I/O channel > to the carved out hardware. Well but this is bad because kernel services may be needed by these carved out processors. If the kernel would support this then life would be much easier for you. > Additionally, where Linux kernel has been too heavy weight with its > slow real-time response, and/or expensive paged multi-address spaces, the > solution is often to use a nano-kernel like ADEOS or RTLinux, > running Linux as a separate OS, leaving scope to run lighter weight > real-time single address operating systems in parallel. Having hardware and software that is handled by two differnt OSes is pretty complex. Shoving something like that into the Linux kernel should be pretty easy because most of the infrastructure is already there. > My point about the hardware engineer is people can't have their cake > and eat it. Unix/Linux has been successful partly because of its > strong protection/paged model. It is difficult to be both secure > and efficient. If you want to both then you need to design > it into the operating system from the outset. Linux isn't a good > place to start. I think we can if we allow cores to run with simplified support and reduced overhead. ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2014-06-16 14:28 UTC | newest] Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-06-11 19:03 [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem Christoph Lameter 2014-06-11 19:26 ` Daniel Phillips 2014-06-11 19:45 ` Greg KH 2014-06-12 13:35 ` John W. Linville 2014-06-13 16:57 ` Christoph Lameter 2014-06-13 17:31 ` Greg KH 2014-06-13 17:59 ` Christoph Lameter 2014-06-13 19:18 ` Stephen Hemminger 2014-06-13 22:30 ` Christoph Lameter 2014-06-13 16:56 ` Christoph Lameter 2014-06-13 17:30 ` Greg KH 2014-06-13 17:55 ` James Bottomley 2014-06-13 18:41 ` Christoph Lameter 2014-06-16 11:39 ` Thomas Petazzoni 2014-06-16 14:05 ` Christoph Lameter 2014-06-16 14:09 ` Thomas Petazzoni 2014-06-16 14:28 ` Christoph Lameter 2014-06-13 18:01 ` Christoph Lameter 2014-06-13 18:25 ` Greg KH 2014-06-13 18:54 ` Christoph Lameter 2014-06-11 20:08 ` josh 2014-06-11 20:15 ` Andy Lutomirski 2014-06-11 20:52 ` Dave Hansen 2014-06-12 6:59 ` Phillip Lougher 2014-06-13 17:02 ` Christoph Lameter 2014-06-13 21:36 ` Benjamin Herrenschmidt 2014-06-13 22:23 ` Rik van Riel 2014-06-13 23:04 ` Christoph Lameter 2014-06-14 1:19 ` Phillip Lougher 2014-06-16 14:04 ` Christoph Lameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox