Re: large page patch (fwd) (fwd)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: large page patch (fwd) (fwd)
       [not found] <Pine.LNX.4.33.0208021252090.2466-100000@penguin.transmeta.com>
@ 2002-08-02 23:54 ` Martin J. Bligh
  2002-08-03  0:35   ` Andrew Morton
  0 siblings, 1 reply; 5+ messages in thread
From: Martin J. Bligh @ 2002-08-02 23:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Hubertus Franke, wli, gh, akpm, swj, linux-mm mailing list

>> Let me than turn around the table. Have you looked at our patch for 2.4.18.
>> It doesn't add anything to the hot path either, if the (vma->pg_order == 0).   
>> Period.
> 
> Nobody has forwarded the patch, and I've seen no discussion of it on the
> kernel mailing lists.
> 
> Guess what the answer is?
> 
> Is it 10 lines of code in the VM subsystem?

No, and you're not going to like the patch in it's current incarnation by
the sound of it. So, having listened to your objections, we're going to
take a slightly different course - we will prepare a minimal version of
the patch with very low impact on the core VM code, but using more 
standard interfaces to access it (eg the shmem method you outlined
earlier). It'll have a little less functionality, but so be it.

There are other apps apart from Oracle that want the ability to use large
pages (eg DB2 and Java), and it seems that most of those want them for 
anonymous mmap or shmem. If we can provide an interface that's more 
standard, it'll make people's porting much easier. IBM Research has done
some significant benchmarking of large page support in a variety of
applications, and has seen 20-40% performance boost for Java, and 
6-22% improvment for the SPEC CPU2000 set of tests. For the full 
details, see the OLS paper at:
http://www.linux.org.uk/~ajh/ols2002_proceedings.pdf.gz
Moreover, we need large pages to reduce PTE consumption in a variety
of applications using shared memory, especially given the additional
overhead of rmap.

We should have this available in a few days - if you could hold off 
until then, we should be able to do an objective comparison? I believe
we can make something that's acceptable to you.

Thanks,

Martin.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: large page patch (fwd) (fwd)
  2002-08-02 23:54 ` large page patch (fwd) (fwd) Martin J. Bligh
@ 2002-08-03  0:35   ` Andrew Morton
  2002-08-03  1:26     ` Linus Torvalds
  0 siblings, 1 reply; 5+ messages in thread
From: Andrew Morton @ 2002-08-03  0:35 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Linus Torvalds, Hubertus Franke, wli, swj, linux-mm mailing list

"Martin J. Bligh" wrote:
> 
> >> Let me than turn around the table. Have you looked at our patch for 2.4.18.
> >> It doesn't add anything to the hot path either, if the (vma->pg_order == 0).
> >> Period.
> >
> > Nobody has forwarded the patch, and I've seen no discussion of it on the
> > kernel mailing lists.
> >
> > Guess what the answer is?
> >
> > Is it 10 lines of code in the VM subsystem?
> 
> No, and you're not going to like the patch in it's current incarnation by
> the sound of it. So, having listened to your objections, we're going to
> take a slightly different course - we will prepare a minimal version of
> the patch with very low impact on the core VM code, but using more
> standard interfaces to access it (eg the shmem method you outlined
> earlier). It'll have a little less functionality, but so be it.

Remind me again what's wrong with wrapping the Intel syscalls
inside malloc() and then maybe grafting a little hook into the shm code?

>...
> We should have this available in a few days - if you could hold off
> until then, we should be able to do an objective comparison? I believe
> we can make something that's acceptable to you.

More than a few days.  The patch which went around isn't Rohit's
latest, and it hasn't even been tested in 2.5 and we're considering
replacing the shm key with an fd, and...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: large page patch (fwd) (fwd)
  2002-08-03  0:35   ` Andrew Morton
@ 2002-08-03  1:26     ` Linus Torvalds
  2002-08-03  4:26       ` Gerrit Huizenga
  0 siblings, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2002-08-03  1:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Martin J. Bligh, Hubertus Franke, wli

On Fri, 2 Aug 2002, Andrew Morton wrote:
>
> Remind me again what's wrong with wrapping the Intel syscalls
> inside malloc() and then maybe grafting a little hook into the shm code?

Indeed.

However, don't think "Intel syscalls", think instead "bring out the
architecture-defined mapping features". In particular, the main objection
I had to Ingo's patch (which, by the sound of it is fairly similar to the
IBM patches which I haven't seen) was that it was much too Intel-centric.

I admit to being x86-centric when it comes to implementation (simply due
to the fact that they are cheap and everywhere), but I try very hard to
avoid making _design_ revolve around x86. In particular, while I'm not a
big fan of the PPC hash tables (understatement of the year), I _do_ like
the BAT mapping that PPC has.

(Alternatively, if you aren't familiar with BAT registers, think
software-filled extra TLB entries that are outside the normal fill policy
and have large sizes. For some architectures it makes sense to do this at
sw TLB fill time, for others that isn't very practical because the page
table lookup is fixed in various ways.)

This is sometimes also referred to as "superpages".

And I think people will find the "separate path" approach more palatable
if you think of it as an interface to BAT registers (with the "normal" VM
path being the interface to the regular page tables). And keeping very
much in mind that on some CPU's these two things really _are_ totally
separate (PPC being the best example).

The fact that on x86, which doesn't have a BAT array, we use the
PMD-spanning "large pages" instead, should be seen as the anomaly, not as
the design case.

This also hopefully explains why I consider anything that touches or cares
about page tables in generic VM code wrt the largepage support to be
fundamentally broken. If the largepage patch messes around with page
tables, it cannot be generic.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: large page patch (fwd) (fwd)
  2002-08-03  1:26     ` Linus Torvalds
@ 2002-08-03  4:26       ` Gerrit Huizenga
  2002-08-03  4:39         ` Linus Torvalds
  0 siblings, 1 reply; 5+ messages in thread
From: Gerrit Huizenga @ 2002-08-03  4:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Martin J. Bligh, Hubertus Franke, wli, swj,
	linux-mm mailing list

In message <Pine.LNX.4.44.0208021757490.2210-100000@home.transmeta.com>, > : Li
nus Torvalds writes:
> 
> 
> On Fri, 2 Aug 2002, Andrew Morton wrote:
> >
> > Remind me again what's wrong with wrapping the Intel syscalls
> > inside malloc() and then maybe grafting a little hook into the shm code?
> 
> Indeed.

Do you really want all calls to malloc to allocate non-pageable
memory?  And I doubt that this memory will be pageable in time for
2.5.

> However, don't think "Intel syscalls", think instead "bring out the
> architecture-defined mapping features". In particular, the main objection
> I had to Ingo's patch (which, by the sound of it is fairly similar to the
> IBM patches which I haven't seen) was that it was much too Intel-centric.

The IBM patch (Simon Winwood's work) was first done for PPC64 and then
ported at my insistence to IA32 since we had an immediate need and
an opportunity to do some specific application porting work on IA32.
The patch was intended to be both architecture neutral and to support
multiple page sizes.  In the essense of hitting the Halloween deadline,
we believe that dropping back for the moment to IA32, pinned, mmap()/
madvise()/shm*() versions, possibly gated by a capability (or not, easily
debatable and I doubt that it matters too much) will get at least IBM
apps on IA32 through the lifetime on 2.6 and probably have the framework
in such that PPC64 can also easily fit in possibly pre-freeze, possibly
post-freeze with mostly arch-specific mods.

> I admit to being x86-centric when it comes to implementation (simply due
> to the fact that they are cheap and everywhere), but I try very hard to
> avoid making _design_ revolve around x86. In particular, while I'm not a
> big fan of the PPC hash tables (understatement of the year), I _do_ like
> the BAT mapping that PPC has.

We folks in the LTC have much the same interest.  In addition to the
obvious IA32/PPC32/PPC64/zSeries/IA64/AMD issues (keep in mind we probably
sell more servers with PPC than with IA32 ;-), we have software products
which run on nearly every platform and every distro in existence.  So,
we too try to qualify most of our work on its potential application to
multiple architectures.

> (Alternatively, if you aren't familiar with BAT registers, think
> software-filled extra TLB entries that are outside the normal fill policy
> and have large sizes. For some architectures it makes sense to do this at
> sw TLB fill time, for others that isn't very practical because the page
> table lookup is fixed in various ways.)

>From what I've heard from the Watson Research experts on PPC64, BAT
registers are actually a bad idea for this and AIX is slowly removing
its dependency on BAT registers.  I'd be interested in a read from
Anton or Paul Mackerras or even the Research folks involved in the
chip design.

And, we are doing everything possible to at least provide code to
demonstrate the solutions we are talking about.  It just may take
a few days to get it properly accelerated.  ;-)

gerrit
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: large page patch (fwd) (fwd)
  2002-08-03  4:26       ` Gerrit Huizenga
@ 2002-08-03  4:39         ` Linus Torvalds
  0 siblings, 0 replies; 5+ messages in thread
From: Linus Torvalds @ 2002-08-03  4:39 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Andrew Morton, Martin J. Bligh, Hubertus Franke, wli, swj,
	linux-mm mailing list


On Fri, 2 Aug 2002, Gerrit Huizenga wrote:

> In message <Pine.LNX.4.44.0208021757490.2210-100000@home.transmeta.com>, > : Li
> nus Torvalds writes:
> >
> >
> > On Fri, 2 Aug 2002, Andrew Morton wrote:
> > >
> > > Remind me again what's wrong with wrapping the Intel syscalls
> > > inside malloc() and then maybe grafting a little hook into the shm code?
> >
> > Indeed.
>
> Do you really want all calls to malloc to allocate non-pageable
> memory?  And I doubt that this memory will be pageable in time for
> 2.5.

No, I'm saying that you can do the SHM_LARGEPAGE bit testing in user space
if you want to.

And obviously it will only succeed for root or similar user anyway.

But hey, the proof is in the pudding. If you guys can come up with a
better scheme that does not pollute the VM paths and has better semantics,
I don't think anybody will complain.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2002-08-03  4:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.33.0208021252090.2466-100000@penguin.transmeta.com>
2002-08-02 23:54 ` large page patch (fwd) (fwd) Martin J. Bligh
2002-08-03  0:35   ` Andrew Morton
2002-08-03  1:26     ` Linus Torvalds
2002-08-03  4:26       ` Gerrit Huizenga
2002-08-03  4:39         ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox