* [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
@ 2014-06-11 19:03 Christoph Lameter
2014-06-11 19:26 ` Daniel Phillips
` (5 more replies)
0 siblings, 6 replies; 30+ messages in thread
From: Christoph Lameter @ 2014-06-11 19:03 UTC (permalink / raw)
To: ksummit-discuss
Well this is likely to be a bit of a hot subject but I have been thinking
about this for a couple of years now. This is just a loose collection of
some concerns that I see mostly at the high end but many of these also are
valid for more embedded solutions that have performance issues as well
because the devices are low powered (Android?).
There are numerous issues in memory management that create a level of
complexity that suggests a rewrite would at some point be beneficial:
1. The need to use larger order pages, and the resulting problems with
fragmentation. Memory sizes grow and therefore the number of page structs
where state has to be maintained. Maybe there is something different? If
we use hugepages then we have 511 useless page structs. Some apps need
linear memory where we have trouble and are creating numerous memory
allocators (recently the new bootmem allocator and CMA. Plus lots of
specialized allocators in various subsystems).
2. Support machines with massive amounts of cpus. I got a power8 system
for testing and it has 160 "cpus" on two sockets and four numa
nodes. The new processors from Intel may have up to 18 cores per socket which
only yields 72 "cpus" for a 2 socket sysetm but there are systems with
more socket available and the out look on that level is scary.
Per cpu state and per node state is replicated and it becomes problematic
to aggregate the state for the whole machine since looping over the per
cpu areas becomes expensive.
Can we develop the notion that subsystems own certain cores so that their
execution is restricted to a subset of the system avoiding data
replication and keeping subsystem data hot? I.e. have a device driver
and subsystems driving those devices just run on the NUMA node to which
the PCI-E root complex is attached. Restricting to NUMA node reduces data
locality complexity and increases performance due to cache hot data.
3. Allocation "Zones". These are problematic because the zones often do
not reflect the capabilities of devices to allocate in certain ranges.
They are used for other purposes like MOVABLE pages but then the pages are
not really movable because they are pinnned for other reasons. Argh.
4. Performance characteristics can often not be mapped to kernel
mechanisms. We have NUMA where we can do things but the cpu caching
effects as well as TLB sharing plus the caching of the DIMMs in page
buffers is not really well exploited.
4. Swap: No one really wants to swap today. This needs to be replaced with
something else. Going heavily into swap is akin to locking up the system.
There are numerous band aid solutions but nothing appealing. Maybe the
best idea is the Android idea of the saving app state and removing it from
memory.
5. Page faults:
We do not really use page faults the way they are intended to be used. A
file fault causes numerous readahead requests and then only minor faults
are generated. There is the frequent desire to not have these long
interruptions occur when code is running. mlock[all] is there but isnt
there a better cleaner solution? Maybe we do not want to page a process at
all. Virtualization-like approaches that only support a single process
(like OSV) may be of interest.
Sometimes I think that something like MS-DOS (a "monitor")which provides
services but then gets out of the way may be better because it does not
create the problems that require workaround of an OS. Maybe the full
features "OS" can run on some cores whereas others can only have monitor
like services (we are on the way there with the dynticks approaches by
Frederic Weisbecker).
6. Direct hardware access
Often the kernel subsystems are impeding performance. In high speed
computing we regularly bypass the kernel network subsystems, block I/O
etc. Direct hardware access means though that one is explosed to the ugly
particularities of how a certain device has to be handled. Can we have the
cake and eat it too by defining APIs that allow low level hardware access
but also provide hardware abstraction (maybe limited to certain types of
devices).
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-11 19:03 [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem Christoph Lameter
@ 2014-06-11 19:26 ` Daniel Phillips
2014-06-11 19:45 ` Greg KH
` (4 subsequent siblings)
5 siblings, 0 replies; 30+ messages in thread
From: Daniel Phillips @ 2014-06-11 19:26 UTC (permalink / raw)
To: ksummit-discuss
On 06/11/2014 12:03 PM, Christoph Lameter wrote:
> Well this is likely to be a bit of a hot subject but I have been thinking
> about this for a couple of years now. This is just a loose collection of
> some concerns that I see mostly at the high end but many of these also are
> valid for more embedded solutions that have performance issues as well
> because the devices are low powered (Android?).
>
> There are numerous issues in memory management that create a level of
> complexity that suggests a rewrite would at some point be beneficial:
>
> 1. The need to use larger order pages, and the resulting problems with
> fragmentation. Memory sizes grow and therefore the number of page structs
> where state has to be maintained. Maybe there is something different? If
> we use hugepages then we have 511 useless page structs. Some apps need
> linear memory where we have trouble and are creating numerous memory
> allocators (recently the new bootmem allocator and CMA. Plus lots of
> specialized allocators in various subsystems).
>
>
mem_map should be a radix tree?
Regards,
Daniel
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-11 19:03 [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem Christoph Lameter
2014-06-11 19:26 ` Daniel Phillips
@ 2014-06-11 19:45 ` Greg KH
2014-06-12 13:35 ` John W. Linville
2014-06-13 16:56 ` Christoph Lameter
2014-06-11 20:08 ` josh
` (3 subsequent siblings)
5 siblings, 2 replies; 30+ messages in thread
From: Greg KH @ 2014-06-11 19:45 UTC (permalink / raw)
To: Christoph Lameter; +Cc: ksummit-discuss
On Wed, Jun 11, 2014 at 02:03:05PM -0500, Christoph Lameter wrote:
> 6. Direct hardware access
>
> Often the kernel subsystems are impeding performance. In high speed
> computing we regularly bypass the kernel network subsystems, block I/O
> etc. Direct hardware access means though that one is explosed to the ugly
> particularities of how a certain device has to be handled. Can we have the
> cake and eat it too by defining APIs that allow low level hardware access
> but also provide hardware abstraction (maybe limited to certain types of
> devices).
What type of devices are you wanting here, block and networking or
something else? We have the uio interface if you want to (and know how
to) talk to your hardware directly from userspace, what else do you want
to do here that this doesn't provide?
thanks,
greg k-h
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-11 19:03 [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem Christoph Lameter
2014-06-11 19:26 ` Daniel Phillips
2014-06-11 19:45 ` Greg KH
@ 2014-06-11 20:08 ` josh
2014-06-11 20:15 ` Andy Lutomirski
` (2 subsequent siblings)
5 siblings, 0 replies; 30+ messages in thread
From: josh @ 2014-06-11 20:08 UTC (permalink / raw)
To: Christoph Lameter; +Cc: ksummit-discuss
On Wed, Jun 11, 2014 at 02:03:05PM -0500, Christoph Lameter wrote:
> Well this is likely to be a bit of a hot subject but I have been thinking
> about this for a couple of years now. This is just a loose collection of
> some concerns that I see mostly at the high end but many of these also are
> valid for more embedded solutions that have performance issues as well
> because the devices are low powered (Android?).
On the low end, we could also reasonably ask how much overhead Linux
memory management adds. Does it make sense to run the standard Linux
mm subsystem on a system with, say, 1MB of RAM?
- Josh Triplett
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-11 19:03 [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem Christoph Lameter
` (2 preceding siblings ...)
2014-06-11 20:08 ` josh
@ 2014-06-11 20:15 ` Andy Lutomirski
2014-06-11 20:52 ` Dave Hansen
2014-06-12 6:59 ` Phillip Lougher
5 siblings, 0 replies; 30+ messages in thread
From: Andy Lutomirski @ 2014-06-11 20:15 UTC (permalink / raw)
To: Christoph Lameter; +Cc: ksummit-discuss
On Wed, Jun 11, 2014 at 12:03 PM, Christoph Lameter <cl@gentwo.org> wrote:
>
> 3. Allocation "Zones". These are problematic because the zones often do
> not reflect the capabilities of devices to allocate in certain ranges.
> They are used for other purposes like MOVABLE pages but then the pages are
> not really movable because they are pinnned for other reasons. Argh.
>
What if you just couldn't sleep while you have a MOVABLE page pinned?
Or what if you had to pin it and provide a callback to forcibly unpin
it? This would complicate direct IO and such, but it would make
movable pages really movable. It would also solve an annoyance with
the sealing thing: the sealing code wants to take writable pages and
make them really readonly. This interacts very badly with existing
pins.
We have IOMMU in many cases. Would it be so bad to say that direct IO
is only really direct if there's an IOMMU?
--Andy
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-11 19:03 [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem Christoph Lameter
` (3 preceding siblings ...)
2014-06-11 20:15 ` Andy Lutomirski
@ 2014-06-11 20:52 ` Dave Hansen
2014-06-12 6:59 ` Phillip Lougher
5 siblings, 0 replies; 30+ messages in thread
From: Dave Hansen @ 2014-06-11 20:52 UTC (permalink / raw)
To: Christoph Lameter, ksummit-discuss
On 06/11/2014 12:03 PM, Christoph Lameter wrote:
> 4. Swap: No one really wants to swap today. This needs to be replaced with
> something else. Going heavily into swap is akin to locking up the system.
> There are numerous band aid solutions but nothing appealing. Maybe the
> best idea is the Android idea of the saving app state and removing it from
> memory.
Yeah, our entire approach is getting a bit dated, and I think it's
really designed around the fact that our swap devices have historically
been (relatively) painfully slow to access.
There are some patches in mm to _help_, but currently if you throw a
really fast swap device in a system and try to swap heavily to it, you
don't get anywhere near saturating the device.
We'd probably be better off if we just blasted data at the device and
just ignored the LRU.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-11 19:03 [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem Christoph Lameter
` (4 preceding siblings ...)
2014-06-11 20:52 ` Dave Hansen
@ 2014-06-12 6:59 ` Phillip Lougher
2014-06-13 17:02 ` Christoph Lameter
5 siblings, 1 reply; 30+ messages in thread
From: Phillip Lougher @ 2014-06-12 6:59 UTC (permalink / raw)
To: Christoph Lameter, ksummit-discuss
On 11/06/14 20:03, Christoph Lameter wrote:
> Well this is likely to be a bit of a hot subject but I have been thinking
> about this for a couple of years now. This is just a loose collection of
> some concerns that I see mostly at the high end but many of these also are
> valid for more embedded solutions that have performance issues as well
> because the devices are low powered (Android?).
>
>
>
> There are numerous issues in memory management that create a level of
> complexity that suggests a rewrite would at some point be beneficial:
Slow incremental improvements, which are already happening, yes.
"Grand plans" to rewrite everything from scratch, please no.
Academic computing research is littered with grand plans that never
went anywhere. Not least your list, which sound like the objectives
of the late 80s/mid 90s research into "multi-service operating systems"
(or the wider distributed operating systems research of the time).
There too we (I was doing research into this at the time), were envisaging
hundreds of heterogeneous CPUs with diverse memory hierarchies,
interconnects, I/O configurations, instruction set etc. and imagining a
grand unifying system that would tie these together. In addition this was
the time that audio and video became a serious proposition, and so ideas to
incorporate these new concepts into the operating system as "first-class"
objects became all the rage, so knowledge of the special characteristics of
audio/video were to be built into memory management, the schedulers, the
filesystems. Old style operating systems like Unix were out, and everything
was to be redesigned from scratch.
There were some good ideas proposed, some which in various forms have made their
way incrementally into Linux (your list of zones, NUMA, page fault minimisation,
direct hardware access). But, in general it failed, it made no discernible
impact on the state of the art in operating system implementation. Because
it was too much, too grand, no research group has the wherewithal to
design this from scratch, and by and large the operating systems companies
were happy with what they had. Some universities (like Lancaster and Cambridge
where I worked, had prototypes, but these were exemplars of how little rather
than how much).
Only one company to my knowledge had the hubris to design a new operating
system along these lines from scratch, Acorn computers of Cambridge UK
(the originators of the ARM CPU BTW), where I left Cambridge University to help
design the operating system. Again, nice ideas, but, it proved too much and Acorn
went bankrupt in 1998. The new operating system was called Galileo, and
there's a few links still around, i.e. http://www.poppyfields.net/acorn/news/acopress/97-02-10b.shtml
In contrast Linux which I'd installed in 1994, when I was busily
doing "real operating systems work" and dismissed as a toy, took the
"modest" approach of reimplementing Unix. After 4 years in 1998, Linux
was becoming something to be reckoned with, whilst grand plans just
led to failure.
In fact within a few years Linux with its "old school" design on a single
core was doing things that took us specialised operating systems
techniques to do, simply because hardware had become so much better it
turned out they were no longer needed.
Yeah, this is probably highly off topic, but I had deja vu when reading
this "let's redesign everything from scratch, what could possibly go
wrong" list.
BTW I looked up some of my old colleagues, and it turns out they
were still writing papers on this as late as 2009 (only 13 years after I
left for Acorn and industry).
"The multikernel: a new OS architecture for scalable multicore systems"
http://dl.acm.org/citation.cfm?doid=1629575.1629579
It's pay walled, but the abstract has the following which may be of interest
to you
"We have implemented a multikernel OS to show that the approach is promising,
and we describe how traditional scalability problems for operating systems
(such as memory management) can be effectively recast using messages and can
exploit insights from distributed systems and networking."
lol
>
>
> 1. The need to use larger order pages, and the resulting problems with
> fragmentation. Memory sizes grow and therefore the number of page structs
> where state has to be maintained. Maybe there is something different? If
> we use hugepages then we have 511 useless page structs. Some apps need
> linear memory where we have trouble and are creating numerous memory
> allocators (recently the new bootmem allocator and CMA. Plus lots of
> specialized allocators in various subsystems).
>
This was never solved to my knowledge, there is no panacea here.
Even in the 90s we had video subsystems wanting to allocate in units
of 1Mbyte, and others in units of 4k. The "solution" was so called
split-level allocators, each specialised to deal with a particular
"first class media", with them giving back memory to the underlying
allocator when memory got tight in another specialised allocator.
Not much different to the ad-hoc solutions being adopted in Linux,
except the general idea was each specialised allocator had the same
API.
> 2. Support machines with massive amounts of cpus. I got a power8 system
> for testing and it has 160 "cpus" on two sockets and four numa
> nodes. The new processors from Intel may have up to 18 cores per socket which
> only yields 72 "cpus" for a 2 socket sysetm but there are systems with
> more socket available and the out look on that level is scary.
>
> Per cpu state and per node state is replicated and it becomes problematic
> to aggregate the state for the whole machine since looping over the per
> cpu areas becomes expensive.
>
> Can we develop the notion that subsystems own certain cores so that their
> execution is restricted to a subset of the system avoiding data
> replication and keeping subsystem data hot? I.e. have a device driver
> and subsystems driving those devices just run on the NUMA node to which
> the PCI-E root complex is attached. Restricting to NUMA node reduces data
> locality complexity and increases performance due to cache hot data.
Lots of academic hot-air was expended here when designing distributed
systems which could scale seamlessly across heterogeneous CPUs connected
via different levels of interconnects (bus, ATM, ethernet etc.), zoning,
migration, replication etc. The "solution" is probably out there somewhere
forgotten about.
>
> 3. Allocation "Zones". These are problematic because the zones often do
> not reflect the capabilities of devices to allocate in certain ranges.
> They are used for other purposes like MOVABLE pages but then the pages are
> not really movable because they are pinnned for other reasons. Argh.
>
> 4. Performance characteristics can often not be mapped to kernel
> mechanisms. We have NUMA where we can do things but the cpu caching
> effects as well as TLB sharing plus the caching of the DIMMs in page
> buffers is not really well exploited.
>
> 4. Swap: No one really wants to swap today. This needs to be replaced with
> something else. Going heavily into swap is akin to locking up the system.
> There are numerous band aid solutions but nothing appealing. Maybe the
> best idea is the Android idea of the saving app state and removing it from
> memory.
Embedded system operating systems by and large never had swap.
Embedded systems which today use Linux see swap as a null op. It isn't
used. It is madness to swap to a NAND device.
But I actually think Linux is ahead of the curve here, with things
like zcache, zswap and compressed filesystems which can be used as
an intermediate stage, storing data compressed in memory which is only
expanded when necessary. All of these minimise memory footprint without
having to resort to a swap device.
>
> 5. Page faults:
>
> We do not really use page faults the way they are intended to be used. A
> file fault causes numerous readahead requests and then only minor faults
> are generated. There is the frequent desire to not have these long
> interruptions occur when code is running. mlock[all] is there but isnt
> there a better cleaner solution? Maybe we do not want to page a process at
> all. Virtualization-like approaches that only support a single process
> (like OSV) may be of interest.
You concentrate only on page faults swapping file data into memory.
By and large embedded systems aim to try and run with their working
set in memory (i.e. demand paged at start up but then in cache), trying
to preserve any kind of real time guarantee when you discover half your
working set has been flushed, and suddenly needs to paged back in from slow
NAND is a null op.
Page faults between processes with shared mmap segments or more often
context switches and repeated memcopying to do I/O between processes
is what concerns embedded systems. Context switching and memcopying just
throws away limited bandwidth on an embedded system.
Case in point, many years ago I was the lead Linux guy for a company
designing a SOC for digital TV. Just before I left I had an interesting
"conversation" with the chief hardware guy of the team who designed the SOC.
Turns out they'd budgeted for the RAM bandwidth needed to decode a typical
MPEG stream, but they'd not reckoned on all the memcopies Linux needs to do
between its "separate address space" processes. He'd been used to embedded
oses which run in a single address space.
Fact is security is ever more important even in embedded systems, and a
multi address operating system gives security impossible in single
address operating systems which do away with paging for efficiency.
This security comes at a price.
Back when I was designing Galileo for Acorn in the 90s, we knew all about
the tradeoffs between single address and multi-address operating systems.
I introduced the concept of containers (not the same as the modern Linux
containers), separate units of I/O which could be transferred efficiently
between processes. We had the concept that trusted processes could be
in the same address space, and untrusted processes would be in separate
address spaces. Containers transferred between separate address spaces
was done via page flipping (unmapping from source, remapping to destination),
but containers passed between processes in the same address space would be
done via handle. But the same API was done for both, processes could be
moved between address spaces but the API was the same. Thus trading off
security and efficiency, but it was invisible to the application.
>
> Sometimes I think that something like MS-DOS (a "monitor")which provides
> services but then gets out of the way may be better because it does not
> create the problems that require workaround of an OS. Maybe the full
> features "OS" can run on some cores whereas others can only have monitor
> like services (we are on the way there with the dynticks approaches by
> Frederic Weisbecker).
>
> 6. Direct hardware access
>
> Often the kernel subsystems are impeding performance. In high speed
> computing we regularly bypass the kernel network subsystems, block I/O
> etc. Direct hardware access means though that one is explosed to the ugly
> particularities of how a certain device has to be handled. Can we have the
> cake and eat it too by defining APIs that allow low level hardware access
> but also provide hardware abstraction (maybe limited to certain types of
> devices).
Been there done that. One of the ideas at the time was to reduce
the "operating system" to a micro micro kernel, dealing with
lowest possible abstraction only. The relevant operating system "stack"
would be directly mapped into each process (i.e. the networking stack),
avoiding the costly context switch entering kernel mode. But unless you
were to produce a "stack" for each and every possible hardware device
it meant you had to produce a stack dealing with hardware at the lowest
level, but in a generic API way, the actual mapping of that generic
hardware API in theory being a wafer thin "shim". Real hardware doesn't
work like that. One example, I tried to do that for DMA controllers, but it
turns out DMA controllers are widely different, the best performance is
obtained via direct knowledge of their quirks. By the time I had worked
out a generic API that would work as shim across all controllers, none of
the elegance or performance of anything was retained.
Phillip
>
> _______________________________________________
> Ksummit-discuss mailing list
> Ksummit-discuss@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss
> .
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-11 19:45 ` Greg KH
@ 2014-06-12 13:35 ` John W. Linville
2014-06-13 16:57 ` Christoph Lameter
2014-06-13 16:56 ` Christoph Lameter
1 sibling, 1 reply; 30+ messages in thread
From: John W. Linville @ 2014-06-12 13:35 UTC (permalink / raw)
To: Greg KH; +Cc: ksummit-discuss
On Wed, Jun 11, 2014 at 12:45:04PM -0700, Greg KH wrote:
> On Wed, Jun 11, 2014 at 02:03:05PM -0500, Christoph Lameter wrote:
> > 6. Direct hardware access
> >
> > Often the kernel subsystems are impeding performance. In high speed
> > computing we regularly bypass the kernel network subsystems, block I/O
> > etc. Direct hardware access means though that one is explosed to the ugly
> > particularities of how a certain device has to be handled. Can we have the
> > cake and eat it too by defining APIs that allow low level hardware access
> > but also provide hardware abstraction (maybe limited to certain types of
> > devices).
>
> What type of devices are you wanting here, block and networking or
> something else? We have the uio interface if you want to (and know how
> to) talk to your hardware directly from userspace, what else do you want
> to do here that this doesn't provide?
AF_PACKET provides some level of hardware abstraction without a lot of
overhead for networking apps that are prepared to deal with raw frames.
Is this the kind of networking API you would propose?
John
--
John W. Linville Someday the world will need a hero, and you
linville@tuxdriver.com might be all we have. Be ready.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-11 19:45 ` Greg KH
2014-06-12 13:35 ` John W. Linville
@ 2014-06-13 16:56 ` Christoph Lameter
2014-06-13 17:30 ` Greg KH
1 sibling, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2014-06-13 16:56 UTC (permalink / raw)
To: Greg KH; +Cc: ksummit-discuss
On Wed, 11 Jun 2014, Greg KH wrote:
> > Often the kernel subsystems are impeding performance. In high speed
> > computing we regularly bypass the kernel network subsystems, block I/O
> > etc. Direct hardware access means though that one is explosed to the ugly
> > particularities of how a certain device has to be handled. Can we have the
> > cake and eat it too by defining APIs that allow low level hardware access
> > but also provide hardware abstraction (maybe limited to certain types of
> > devices).
>
> What type of devices are you wanting here, block and networking or
> something else? We have the uio interface if you want to (and know how
> to) talk to your hardware directly from userspace, what else do you want
> to do here that this doesn't provide?
Block and networking mainly. The userspace VFIO API exposes device
specific registers. We need something that is a decent abstraction.
IBverbs is something like that but it could be done much better.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-12 13:35 ` John W. Linville
@ 2014-06-13 16:57 ` Christoph Lameter
2014-06-13 17:31 ` Greg KH
0 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2014-06-13 16:57 UTC (permalink / raw)
To: John W. Linville; +Cc: ksummit-discuss
On Thu, 12 Jun 2014, John W. Linville wrote:
> AF_PACKET provides some level of hardware abstraction without a lot of
> overhead for networking apps that are prepared to deal with raw frames.
> Is this the kind of networking API you would propose?
The kernel is still in the data path and will cause limitations in terms
of bandwidth and latency.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-12 6:59 ` Phillip Lougher
@ 2014-06-13 17:02 ` Christoph Lameter
2014-06-13 21:36 ` Benjamin Herrenschmidt
2014-06-14 1:19 ` Phillip Lougher
0 siblings, 2 replies; 30+ messages in thread
From: Christoph Lameter @ 2014-06-13 17:02 UTC (permalink / raw)
To: Phillip Lougher; +Cc: ksummit-discuss
On Thu, 12 Jun 2014, Phillip Lougher wrote:
> > 1. The need to use larger order pages, and the resulting problems with
> > fragmentation. Memory sizes grow and therefore the number of page structs
> > where state has to be maintained. Maybe there is something different? If
> > we use hugepages then we have 511 useless page structs. Some apps need
> > linear memory where we have trouble and are creating numerous memory
> > allocators (recently the new bootmem allocator and CMA. Plus lots of
> > specialized allocators in various subsystems).
> >
>
> This was never solved to my knowledge, there is no panacea here.
> Even in the 90s we had video subsystems wanting to allocate in units
> of 1Mbyte, and others in units of 4k. The "solution" was so called
> split-level allocators, each specialised to deal with a particular
> "first class media", with them giving back memory to the underlying
> allocator when memory got tight in another specialised allocator.
> Not much different to the ad-hoc solutions being adopted in Linux,
> except the general idea was each specialised allocator had the same
> API.
It is solvable if the objects are inherent movable. If any object
allocated provides a function that makes an object movable then
defragmentation is possible and therefore large contiguous area of memory
can be created at any time.
> > Can we develop the notion that subsystems own certain cores so that their
> > execution is restricted to a subset of the system avoiding data
> > replication and keeping subsystem data hot? I.e. have a device driver
> > and subsystems driving those devices just run on the NUMA node to which
> > the PCI-E root complex is attached. Restricting to NUMA node reduces data
> > locality complexity and increases performance due to cache hot data.
>
> Lots of academic hot-air was expended here when designing distributed
> systems which could scale seamlessly across heterogeneous CPUs connected
> via different levels of interconnects (bus, ATM, ethernet etc.), zoning,
> migration, replication etc. The "solution" is probably out there somewhere
> forgotten about.
We have the issue with homogenous cpus due to the proliferation of cores
on processors now. Maybe that is solvable?
> Case in point, many years ago I was the lead Linux guy for a company
> designing a SOC for digital TV. Just before I left I had an interesting
> "conversation" with the chief hardware guy of the team who designed the SOC.
> Turns out they'd budgeted for the RAM bandwidth needed to decode a typical
> MPEG stream, but they'd not reckoned on all the memcopies Linux needs to do
> between its "separate address space" processes. He'd been used to embedded
> oses which run in a single address space.
Well maybe that is appropriate for some processes? And we could carve out
subsections of the hardware where single adress space stuff is possible?
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-13 16:56 ` Christoph Lameter
@ 2014-06-13 17:30 ` Greg KH
2014-06-13 17:55 ` James Bottomley
2014-06-13 18:01 ` Christoph Lameter
0 siblings, 2 replies; 30+ messages in thread
From: Greg KH @ 2014-06-13 17:30 UTC (permalink / raw)
To: Christoph Lameter; +Cc: ksummit-discuss
On Fri, Jun 13, 2014 at 11:56:08AM -0500, Christoph Lameter wrote:
> On Wed, 11 Jun 2014, Greg KH wrote:
>
> > > Often the kernel subsystems are impeding performance. In high speed
> > > computing we regularly bypass the kernel network subsystems, block I/O
> > > etc. Direct hardware access means though that one is explosed to the ugly
> > > particularities of how a certain device has to be handled. Can we have the
> > > cake and eat it too by defining APIs that allow low level hardware access
> > > but also provide hardware abstraction (maybe limited to certain types of
> > > devices).
> >
> > What type of devices are you wanting here, block and networking or
> > something else? We have the uio interface if you want to (and know how
> > to) talk to your hardware directly from userspace, what else do you want
> > to do here that this doesn't provide?
>
> Block and networking mainly. The userspace VFIO API exposes device
> specific registers. We need something that is a decent abstraction.
> IBverbs is something like that but it could be done much better.
Heh, we've been down this road before :)
In the end, userspace wants a socket-like interface to the networking
"stack", right? So either you provide that with a custom networking
library that talks directly to a specific hardware card (like 3
different companies provide), or you just deal with the in-kernel
network stack. What else is there that we can do here?
And as for block device, "raw access", really? What is lacking with
what we already provide in "raw mode", and a no-op block scheduler? How
much more "lean" can we possibly go without you having to write a custom
userspace uio driver for every block controller out there?
thanks,
greg k-h
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-13 16:57 ` Christoph Lameter
@ 2014-06-13 17:31 ` Greg KH
2014-06-13 17:59 ` Christoph Lameter
0 siblings, 1 reply; 30+ messages in thread
From: Greg KH @ 2014-06-13 17:31 UTC (permalink / raw)
To: Christoph Lameter; +Cc: ksummit-discuss
On Fri, Jun 13, 2014 at 11:57:04AM -0500, Christoph Lameter wrote:
> On Thu, 12 Jun 2014, John W. Linville wrote:
>
> > AF_PACKET provides some level of hardware abstraction without a lot of
> > overhead for networking apps that are prepared to deal with raw frames.
> > Is this the kind of networking API you would propose?
>
> The kernel is still in the data path and will cause limitations in terms
> of bandwidth and latency.
Of course it will, nothing is "free". If this is a problem, then run
one of the many different networking stacks that are in userspace that
are tailored to a specific use-case. The kernel has to provide a
"general" use case stack, that is its job.
greg k-h
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-13 17:30 ` Greg KH
@ 2014-06-13 17:55 ` James Bottomley
2014-06-13 18:41 ` Christoph Lameter
2014-06-13 18:01 ` Christoph Lameter
1 sibling, 1 reply; 30+ messages in thread
From: James Bottomley @ 2014-06-13 17:55 UTC (permalink / raw)
To: Greg KH; +Cc: ksummit-discuss
On Fri, 2014-06-13 at 10:30 -0700, Greg KH wrote:
> On Fri, Jun 13, 2014 at 11:56:08AM -0500, Christoph Lameter wrote:
> > On Wed, 11 Jun 2014, Greg KH wrote:
> >
> > > > Often the kernel subsystems are impeding performance. In high speed
> > > > computing we regularly bypass the kernel network subsystems, block I/O
> > > > etc. Direct hardware access means though that one is explosed to the ugly
> > > > particularities of how a certain device has to be handled. Can we have the
> > > > cake and eat it too by defining APIs that allow low level hardware access
> > > > but also provide hardware abstraction (maybe limited to certain types of
> > > > devices).
> > >
> > > What type of devices are you wanting here, block and networking or
> > > something else? We have the uio interface if you want to (and know how
> > > to) talk to your hardware directly from userspace, what else do you want
> > > to do here that this doesn't provide?
> >
> > Block and networking mainly. The userspace VFIO API exposes device
> > specific registers. We need something that is a decent abstraction.
> > IBverbs is something like that but it could be done much better.
>
> Heh, we've been down this road before :)
>
> In the end, userspace wants a socket-like interface to the networking
> "stack", right? So either you provide that with a custom networking
> library that talks directly to a specific hardware card (like 3
> different companies provide), or you just deal with the in-kernel
> network stack. What else is there that we can do here?
>
> And as for block device, "raw access", really? What is lacking with
> what we already provide in "raw mode", and a no-op block scheduler? How
> much more "lean" can we possibly go without you having to write a custom
> userspace uio driver for every block controller out there?
Just remember there are lessons from Raw devices too. Oracle originally
forced the raw mode on our block devices for this reason ... just get
your block layer and filesystems mostly out of our way was their cry.
Then they discovered that not having a FS wrapper led to the system not
being able to recognise the raw devices as being raw, which lead to an
awful lot of really expensive data loss cockups.
The compromise today is using filesystems with O_DIRECT to the file data
containers.
The point here is that lots of people say "just get your operating
system out of my way" most realise they actually didn't mean it when
presented with the reality.
The abstractions most people who say this want are a zero delay data
path with someone else taking care of all of the metadata and setup
problems ... effectively a MPI type interface. Is that what you're
looking for, Christoph?
James
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-13 17:31 ` Greg KH
@ 2014-06-13 17:59 ` Christoph Lameter
2014-06-13 19:18 ` Stephen Hemminger
0 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2014-06-13 17:59 UTC (permalink / raw)
To: Greg KH; +Cc: ksummit-discuss
On Fri, 13 Jun 2014, Greg KH wrote:
> On Fri, Jun 13, 2014 at 11:57:04AM -0500, Christoph Lameter wrote:
> > On Thu, 12 Jun 2014, John W. Linville wrote:
> >
> > > AF_PACKET provides some level of hardware abstraction without a lot of
> > > overhead for networking apps that are prepared to deal with raw frames.
> > > Is this the kind of networking API you would propose?
> >
> > The kernel is still in the data path and will cause limitations in terms
> > of bandwidth and latency.
>
> Of course it will, nothing is "free". If this is a problem, then run
> one of the many different networking stacks that are in userspace that
> are tailored to a specific use-case. The kernel has to provide a
> "general" use case stack, that is its job.
But again I want both. A general stack that allows at least the data path
to go direct to the device. The metadata and connection management etc
should be firmly in the hand of the kernel.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-13 17:30 ` Greg KH
2014-06-13 17:55 ` James Bottomley
@ 2014-06-13 18:01 ` Christoph Lameter
2014-06-13 18:25 ` Greg KH
1 sibling, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2014-06-13 18:01 UTC (permalink / raw)
To: Greg KH; +Cc: ksummit-discuss
On Fri, 13 Jun 2014, Greg KH wrote:
> In the end, userspace wants a socket-like interface to the networking
> "stack", right? So either you provide that with a custom networking
> library that talks directly to a specific hardware card (like 3
> different companies provide), or you just deal with the in-kernel
> network stack. What else is there that we can do here?
Standardize the kernel APIs for this use case as well as the user space
APIs so that software runs on any of the 3 companies stacks.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-13 18:01 ` Christoph Lameter
@ 2014-06-13 18:25 ` Greg KH
2014-06-13 18:54 ` Christoph Lameter
0 siblings, 1 reply; 30+ messages in thread
From: Greg KH @ 2014-06-13 18:25 UTC (permalink / raw)
To: Christoph Lameter; +Cc: ksummit-discuss
On Fri, Jun 13, 2014 at 01:01:02PM -0500, Christoph Lameter wrote:
> On Fri, 13 Jun 2014, Greg KH wrote:
>
> > In the end, userspace wants a socket-like interface to the networking
> > "stack", right? So either you provide that with a custom networking
> > library that talks directly to a specific hardware card (like 3
> > different companies provide), or you just deal with the in-kernel
> > network stack. What else is there that we can do here?
>
> Standardize the kernel APIs for this use case
The UIO interface is being used for this, so all should be good on the
kernel side, right?
> as well as the user space APIs so that software runs on any of the 3
> companies stacks.
As these libraries are outside of the kernel tree, there's not much us
kernel developers can do about this. Work with those companies to do
this...
greg k-h
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-13 17:55 ` James Bottomley
@ 2014-06-13 18:41 ` Christoph Lameter
2014-06-16 11:39 ` Thomas Petazzoni
0 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2014-06-13 18:41 UTC (permalink / raw)
To: James Bottomley; +Cc: ksummit-discuss
On Fri, 13 Jun 2014, James Bottomley wrote:
> The point here is that lots of people say "just get your operating
> system out of my way" most realise they actually didn't mean it when
> presented with the reality.
Right. Exactly. What I would like to see is the OS doing its part to make
things nice and provide a convenient abstraction of the ugly details.
> The abstractions most people who say this want are a zero delay data
> path with someone else taking care of all of the metadata and setup
> problems ... effectively a MPI type interface. Is that what you're
> looking for, Christoph?
Ideally the setup/metadata should be handled by the OS while the data
path would go direct. The get-out-of-the-way piece is restricted only to
the performance critical portion which is the actual data transfer.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-13 18:25 ` Greg KH
@ 2014-06-13 18:54 ` Christoph Lameter
0 siblings, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2014-06-13 18:54 UTC (permalink / raw)
To: Greg KH; +Cc: ksummit-discuss
On Fri, 13 Jun 2014, Greg KH wrote:
> > Standardize the kernel APIs for this use case
>
> The UIO interface is being used for this, so all should be good on the
> kernel side, right?
Ok I have seen any vendor use that interface and thus I am not familiar.
> > as well as the user space APIs so that software runs on any of the 3
> > companies stacks.
>
> As these libraries are outside of the kernel tree, there's not much us
> kernel developers can do about this. Work with those companies to do
> this...
Its not that easy a separation. The mangement functions better be left
with the kernel so that security and permission management work and so
that the device stays in a well known state if accessed from multiple
applications.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-13 17:59 ` Christoph Lameter
@ 2014-06-13 19:18 ` Stephen Hemminger
2014-06-13 22:30 ` Christoph Lameter
0 siblings, 1 reply; 30+ messages in thread
From: Stephen Hemminger @ 2014-06-13 19:18 UTC (permalink / raw)
To: Christoph Lameter; +Cc: ksummit-discuss
On Fri, 13 Jun 2014 12:59:32 -0500 (CDT)
Christoph Lameter <cl@gentwo.org> wrote:
> On Fri, 13 Jun 2014, Greg KH wrote:
>
> > On Fri, Jun 13, 2014 at 11:57:04AM -0500, Christoph Lameter wrote:
> > > On Thu, 12 Jun 2014, John W. Linville wrote:
> > >
> > > > AF_PACKET provides some level of hardware abstraction without a lot of
> > > > overhead for networking apps that are prepared to deal with raw frames.
> > > > Is this the kind of networking API you would propose?
> > >
> > > The kernel is still in the data path and will cause limitations in terms
> > > of bandwidth and latency.
> >
> > Of course it will, nothing is "free". If this is a problem, then run
> > one of the many different networking stacks that are in userspace that
> > are tailored to a specific use-case. The kernel has to provide a
> > "general" use case stack, that is its job.
>
> But again I want both. A general stack that allows at least the data path
> to go direct to the device. The metadata and connection management etc
> should be firmly in the hand of the kernel.
There are several dataplane user mode networking implementations that
do this. The problem is you either have to overlap with every networking
driver (netmap) or do driver in userspace (DPDK).
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-13 17:02 ` Christoph Lameter
@ 2014-06-13 21:36 ` Benjamin Herrenschmidt
2014-06-13 22:23 ` Rik van Riel
2014-06-13 23:04 ` Christoph Lameter
2014-06-14 1:19 ` Phillip Lougher
1 sibling, 2 replies; 30+ messages in thread
From: Benjamin Herrenschmidt @ 2014-06-13 21:36 UTC (permalink / raw)
To: Christoph Lameter; +Cc: ksummit-discuss
On Fri, 2014-06-13 at 12:02 -0500, Christoph Lameter wrote:
> On Thu, 12 Jun 2014, Phillip Lougher wrote:
>
> > > 1. The need to use larger order pages, and the resulting problems with
> > > fragmentation. Memory sizes grow and therefore the number of page structs
> > > where state has to be maintained. Maybe there is something different? If
> > > we use hugepages then we have 511 useless page structs. Some apps need
> > > linear memory where we have trouble and are creating numerous memory
> > > allocators (recently the new bootmem allocator and CMA. Plus lots of
> > > specialized allocators in various subsystems).
> > >
> >
> > This was never solved to my knowledge, there is no panacea here.
> > Even in the 90s we had video subsystems wanting to allocate in units
> > of 1Mbyte, and others in units of 4k. The "solution" was so called
> > split-level allocators, each specialised to deal with a particular
> > "first class media", with them giving back memory to the underlying
> > allocator when memory got tight in another specialised allocator.
> > Not much different to the ad-hoc solutions being adopted in Linux,
> > except the general idea was each specialised allocator had the same
> > API.
>
> It is solvable if the objects are inherent movable. If any object
> allocated provides a function that makes an object movable then
> defragmentation is possible and therefore large contiguous area of memory
> can be created at any time.
Another interesting thing is migration of pages with mapped DMA on
them :-)
Our IOMMUs support that, but there isn't a way to hook that up into
Linux page migration that wouldn't suck massively at this point.
> > > Can we develop the notion that subsystems own certain cores so that their
> > > execution is restricted to a subset of the system avoiding data
> > > replication and keeping subsystem data hot? I.e. have a device driver
> > > and subsystems driving those devices just run on the NUMA node to which
> > > the PCI-E root complex is attached. Restricting to NUMA node reduces data
> > > locality complexity and increases performance due to cache hot data.
> >
> > Lots of academic hot-air was expended here when designing distributed
> > systems which could scale seamlessly across heterogeneous CPUs connected
> > via different levels of interconnects (bus, ATM, ethernet etc.), zoning,
> > migration, replication etc. The "solution" is probably out there somewhere
> > forgotten about.
>
> We have the issue with homogenous cpus due to the proliferation of cores
> on processors now. Maybe that is solvable?
>
> > Case in point, many years ago I was the lead Linux guy for a company
> > designing a SOC for digital TV. Just before I left I had an interesting
> > "conversation" with the chief hardware guy of the team who designed the SOC.
> > Turns out they'd budgeted for the RAM bandwidth needed to decode a typical
> > MPEG stream, but they'd not reckoned on all the memcopies Linux needs to do
> > between its "separate address space" processes. He'd been used to embedded
> > oses which run in a single address space.
>
> Well maybe that is appropriate for some processes? And we could carve out
> subsections of the hardware where single adress space stuff is possible?
> _______________________________________________
> Ksummit-discuss mailing list
> Ksummit-discuss@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-13 21:36 ` Benjamin Herrenschmidt
@ 2014-06-13 22:23 ` Rik van Riel
2014-06-13 23:04 ` Christoph Lameter
1 sibling, 0 replies; 30+ messages in thread
From: Rik van Riel @ 2014-06-13 22:23 UTC (permalink / raw)
To: ksummit-discuss
On 06/13/2014 05:36 PM, Benjamin Herrenschmidt wrote:
> On Fri, 2014-06-13 at 12:02 -0500, Christoph Lameter wrote:
>> On Thu, 12 Jun 2014, Phillip Lougher wrote:
>>
>>>> 1. The need to use larger order pages, and the resulting problems with
>>>> fragmentation. Memory sizes grow and therefore the number of page structs
>>>> where state has to be maintained. Maybe there is something different? If
>>>> we use hugepages then we have 511 useless page structs. Some apps need
>> It is solvable if the objects are inherent movable. If any object
>> allocated provides a function that makes an object movable then
>> defragmentation is possible and therefore large contiguous area of memory
>> can be created at any time.
>
> Another interesting thing is migration of pages with mapped DMA on
> them :-)
>
> Our IOMMUs support that, but there isn't a way to hook that up into
> Linux page migration that wouldn't suck massively at this point.
The HMM stuff Jerome Glisse is working on may be a suitable
framework to add call callbacks for things like migration to.
--
All rights reversed
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-13 19:18 ` Stephen Hemminger
@ 2014-06-13 22:30 ` Christoph Lameter
0 siblings, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2014-06-13 22:30 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: ksummit-discuss
On Fri, 13 Jun 2014, Stephen Hemminger wrote:
> > But again I want both. A general stack that allows at least the data path
> > to go direct to the device. The metadata and connection management etc
> > should be firmly in the hand of the kernel.
>
> There are several dataplane user mode networking implementations that
> do this. The problem is you either have to overlap with every networking
> driver (netmap) or do driver in userspace (DPDK).
The netmap stuff requires a system call for any sending and receiving so
that does not work right. The driver is userspace does the device control
etc etc in user space as well which means the kernel does not police the
device.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-13 21:36 ` Benjamin Herrenschmidt
2014-06-13 22:23 ` Rik van Riel
@ 2014-06-13 23:04 ` Christoph Lameter
1 sibling, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2014-06-13 23:04 UTC (permalink / raw)
To: Benjamin Herrenschmidt; +Cc: ksummit-discuss
On Sat, 14 Jun 2014, Benjamin Herrenschmidt wrote:
> > It is solvable if the objects are inherent movable. If any object
> > allocated provides a function that makes an object movable then
> > defragmentation is possible and therefore large contiguous area of memory
> > can be created at any time.
>
> Another interesting thing is migration of pages with mapped DMA on
> them :-)
>
> Our IOMMUs support that, but there isn't a way to hook that up into
> Linux page migration that wouldn't suck massively at this point.
Well yes that would require a major rethink. While we are at it we may
as well try to get more done. Maybe we can do that just for a limited
region within the existing memory management. Something like OSV, cgroups
or cpuset that restricts it to certain nodes or cpus where we would allow
this to occur while the rest still runs the standard kernel.
A kind of sidecar approach.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-13 17:02 ` Christoph Lameter
2014-06-13 21:36 ` Benjamin Herrenschmidt
@ 2014-06-14 1:19 ` Phillip Lougher
2014-06-16 14:04 ` Christoph Lameter
1 sibling, 1 reply; 30+ messages in thread
From: Phillip Lougher @ 2014-06-14 1:19 UTC (permalink / raw)
To: Christoph Lameter; +Cc: ksummit-discuss
On 13/06/14 18:02, Christoph Lameter wrote:
> On Thu, 12 Jun 2014, Phillip Lougher wrote:
>
>>> 1. The need to use larger order pages, and the resulting problems with
>>> fragmentation. Memory sizes grow and therefore the number of page structs
>>> where state has to be maintained. Maybe there is something different? If
>>> we use hugepages then we have 511 useless page structs. Some apps need
>>> linear memory where we have trouble and are creating numerous memory
>>> allocators (recently the new bootmem allocator and CMA. Plus lots of
>>> specialized allocators in various subsystems).
>>>
>>
>> This was never solved to my knowledge, there is no panacea here.
>> Even in the 90s we had video subsystems wanting to allocate in units
>> of 1Mbyte, and others in units of 4k. The "solution" was so called
>> split-level allocators, each specialised to deal with a particular
>> "first class media", with them giving back memory to the underlying
>> allocator when memory got tight in another specialised allocator.
>> Not much different to the ad-hoc solutions being adopted in Linux,
>> except the general idea was each specialised allocator had the same
>> API.
>
> It is solvable if the objects are inherent movable. If any object
> allocated provides a function that makes an object movable then
> defragmentation is possible and therefore large contiguous area of memory
> can be created at any time.
>
>
>>> Can we develop the notion that subsystems own certain cores so that their
>>> execution is restricted to a subset of the system avoiding data
>>> replication and keeping subsystem data hot? I.e. have a device driver
>>> and subsystems driving those devices just run on the NUMA node to which
>>> the PCI-E root complex is attached. Restricting to NUMA node reduces data
>>> locality complexity and increases performance due to cache hot data.
>>
>> Lots of academic hot-air was expended here when designing distributed
>> systems which could scale seamlessly across heterogeneous CPUs connected
>> via different levels of interconnects (bus, ATM, ethernet etc.), zoning,
>> migration, replication etc. The "solution" is probably out there somewhere
>> forgotten about.
>
> We have the issue with homogenous cpus due to the proliferation of cores
> on processors now. Maybe that is solvable?
>
>> Case in point, many years ago I was the lead Linux guy for a company
>> designing a SOC for digital TV. Just before I left I had an interesting
>> "conversation" with the chief hardware guy of the team who designed the SOC.
>> Turns out they'd budgeted for the RAM bandwidth needed to decode a typical
>> MPEG stream, but they'd not reckoned on all the memcopies Linux needs to do
>> between its "separate address space" processes. He'd been used to embedded
>> oses which run in a single address space.
>
> Well maybe that is appropriate for some processes? And we could carve out
> subsections of the hardware where single adress space stuff is possible?
>
Apologies, maybe what I was trying to say wasn't clear :) I wasn't arguing
against it, but rather should we be trying to do this at the Linux kernel
level.
Embedded systems have long had the need to carve out (mainly heterogenous)
processors from Linux. Media systems have VLIW media processors (i.e.
Philips Trimedia), and mobile phones typically have separate baseband
processors. This is done without any core support necessary from the kernel.
Just write a device driver that presents a programming & I/O channel
to the carved out hardware.
Additionally, where Linux kernel has been too heavy weight with its
slow real-time response, and/or expensive paged multi-address spaces, the
solution is often to use a nano-kernel like ADEOS or RTLinux,
running Linux as a separate OS, leaving scope to run lighter weight
real-time single address operating systems in parallel.
In otherwords if we need more efficiency, do it outside of Linux, rather
than try to rewrite the strong protection model in Linux. That way
leads to pain.
My point about the hardware engineer is people can't have their cake
and eat it. Unix/Linux has been successful partly because of its
strong protection/paged model. It is difficult to be both secure
and efficient. If you want to both then you need to design
it into the operating system from the outset. Linux isn't a good
place to start.
Phillip
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-13 18:41 ` Christoph Lameter
@ 2014-06-16 11:39 ` Thomas Petazzoni
2014-06-16 14:05 ` Christoph Lameter
0 siblings, 1 reply; 30+ messages in thread
From: Thomas Petazzoni @ 2014-06-16 11:39 UTC (permalink / raw)
To: Christoph Lameter; +Cc: James Bottomley, ksummit-discuss
Dear Christoph Lameter,
On Fri, 13 Jun 2014 13:41:12 -0500 (CDT), Christoph Lameter wrote:
> > The point here is that lots of people say "just get your operating
> > system out of my way" most realise they actually didn't mean it when
> > presented with the reality.
>
> Right. Exactly. What I would like to see is the OS doing its part to make
> things nice and provide a convenient abstraction of the ugly details.
>
> > The abstractions most people who say this want are a zero delay data
> > path with someone else taking care of all of the metadata and setup
> > problems ... effectively a MPI type interface. Is that what you're
> > looking for, Christoph?
>
> Ideally the setup/metadata should be handled by the OS while the data
> path would go direct. The get-out-of-the-way piece is restricted only to
> the performance critical portion which is the actual data transfer.
I might be completely out of topic here, but this very much sounds like
what is happening for graphics. There is a DRM/KMS kernel side, which
does all the mode setting, context allocation and things like that, and
then all the rest takes place in userspace, using hardware-specific
pieces of code in libdrm and other components of the graphics stack.
If we translate that to networking, there would be a need to have all
of the setup/initialization done in the kernel, and then some
hardware-specific userspace libraries to use for the data path.
Thomas
--
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-14 1:19 ` Phillip Lougher
@ 2014-06-16 14:04 ` Christoph Lameter
0 siblings, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2014-06-16 14:04 UTC (permalink / raw)
To: Phillip Lougher; +Cc: ksummit-discuss
On Sat, 14 Jun 2014, Phillip Lougher wrote:
> Embedded systems have long had the need to carve out (mainly heterogenous)
> processors from Linux. Media systems have VLIW media processors (i.e.
> Philips Trimedia), and mobile phones typically have separate baseband
> processors. This is done without any core support necessary from the kernel.
> Just write a device driver that presents a programming & I/O channel
> to the carved out hardware.
Well but this is bad because kernel services may be needed by these carved
out processors. If the kernel would support this then life would be much
easier for you.
> Additionally, where Linux kernel has been too heavy weight with its
> slow real-time response, and/or expensive paged multi-address spaces, the
> solution is often to use a nano-kernel like ADEOS or RTLinux,
> running Linux as a separate OS, leaving scope to run lighter weight
> real-time single address operating systems in parallel.
Having hardware and software that is handled by two differnt OSes is
pretty complex. Shoving something like that into the Linux kernel should
be pretty easy because most of the infrastructure is already there.
> My point about the hardware engineer is people can't have their cake
> and eat it. Unix/Linux has been successful partly because of its
> strong protection/paged model. It is difficult to be both secure
> and efficient. If you want to both then you need to design
> it into the operating system from the outset. Linux isn't a good
> place to start.
I think we can if we allow cores to run with simplified support and
reduced overhead.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-16 11:39 ` Thomas Petazzoni
@ 2014-06-16 14:05 ` Christoph Lameter
2014-06-16 14:09 ` Thomas Petazzoni
0 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2014-06-16 14:05 UTC (permalink / raw)
To: Thomas Petazzoni; +Cc: James Bottomley, ksummit-discuss
On Mon, 16 Jun 2014, Thomas Petazzoni wrote:
> I might be completely out of topic here, but this very much sounds like
> what is happening for graphics. There is a DRM/KMS kernel side, which
> does all the mode setting, context allocation and things like that, and
> then all the rest takes place in userspace, using hardware-specific
> pieces of code in libdrm and other components of the graphics stack.
I thought about that too.
> If we translate that to networking, there would be a need to have all
> of the setup/initialization done in the kernel, and then some
> hardware-specific userspace libraries to use for the data path.
Well ideally these would just be API specific in order to support multiple
devices.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-16 14:05 ` Christoph Lameter
@ 2014-06-16 14:09 ` Thomas Petazzoni
2014-06-16 14:28 ` Christoph Lameter
0 siblings, 1 reply; 30+ messages in thread
From: Thomas Petazzoni @ 2014-06-16 14:09 UTC (permalink / raw)
To: Christoph Lameter; +Cc: James Bottomley, ksummit-discuss
Dear Christoph Lameter,
On Mon, 16 Jun 2014 09:05:31 -0500 (CDT), Christoph Lameter wrote:
> On Mon, 16 Jun 2014, Thomas Petazzoni wrote:
>
> > I might be completely out of topic here, but this very much sounds like
> > what is happening for graphics. There is a DRM/KMS kernel side, which
> > does all the mode setting, context allocation and things like that, and
> > then all the rest takes place in userspace, using hardware-specific
> > pieces of code in libdrm and other components of the graphics stack.
>
> I thought about that too.
>
> > If we translate that to networking, there would be a need to have all
> > of the setup/initialization done in the kernel, and then some
> > hardware-specific userspace libraries to use for the data path.
>
> Well ideally these would just be API specific in order to support multiple
> devices.
Well, my understanding is that libdrm exposes on API, but internally
has support for various graphics hardware. Same for OpenGL: a unified
normalized API that applications can rely on, and pure user-space
implementations that know about the hardware details.
Thomas
--
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem
2014-06-16 14:09 ` Thomas Petazzoni
@ 2014-06-16 14:28 ` Christoph Lameter
0 siblings, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2014-06-16 14:28 UTC (permalink / raw)
To: Thomas Petazzoni; +Cc: James Bottomley, ksummit-discuss
On Mon, 16 Jun 2014, Thomas Petazzoni wrote:
> Well, my understanding is that libdrm exposes on API, but internally
> has support for various graphics hardware. Same for OpenGL: a unified
> normalized API that applications can rely on, and pure user-space
> implementations that know about the hardware details.
Ok then we would need to come up with an API for NICs and storage that
allows user space to determine the hardware and use the correct logic.
The same approach is used in the Infiniband subsystem. However this means
that device driver like code is distributed separately from the kernel.
There are separate ibverbs, ibrdma etc trees and its an issue to keep the
in kernel portions in sync with the userspace code.
Ideally these would go together and be modified by patches that change
both the kernel portion and the userspace portion.
So maybe add a directory for userspace driver code to the kernel?
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2014-06-16 14:28 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-11 19:03 [Ksummit-discuss] [CORE TOPIC] Redesign Memory Management layer and more core subsystem Christoph Lameter
2014-06-11 19:26 ` Daniel Phillips
2014-06-11 19:45 ` Greg KH
2014-06-12 13:35 ` John W. Linville
2014-06-13 16:57 ` Christoph Lameter
2014-06-13 17:31 ` Greg KH
2014-06-13 17:59 ` Christoph Lameter
2014-06-13 19:18 ` Stephen Hemminger
2014-06-13 22:30 ` Christoph Lameter
2014-06-13 16:56 ` Christoph Lameter
2014-06-13 17:30 ` Greg KH
2014-06-13 17:55 ` James Bottomley
2014-06-13 18:41 ` Christoph Lameter
2014-06-16 11:39 ` Thomas Petazzoni
2014-06-16 14:05 ` Christoph Lameter
2014-06-16 14:09 ` Thomas Petazzoni
2014-06-16 14:28 ` Christoph Lameter
2014-06-13 18:01 ` Christoph Lameter
2014-06-13 18:25 ` Greg KH
2014-06-13 18:54 ` Christoph Lameter
2014-06-11 20:08 ` josh
2014-06-11 20:15 ` Andy Lutomirski
2014-06-11 20:52 ` Dave Hansen
2014-06-12 6:59 ` Phillip Lougher
2014-06-13 17:02 ` Christoph Lameter
2014-06-13 21:36 ` Benjamin Herrenschmidt
2014-06-13 22:23 ` Rik van Riel
2014-06-13 23:04 ` Christoph Lameter
2014-06-14 1:19 ` Phillip Lougher
2014-06-16 14:04 ` Christoph Lameter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox