linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [bigmem-patch] 4GB with Linux on IA32
@ 1999-08-16 16:29 Andrea Arcangeli
  1999-08-16 16:48 ` Matthew Wilcox
  1999-08-16 18:43 ` Kanoj Sarcar
  0 siblings, 2 replies; 39+ messages in thread
From: Andrea Arcangeli @ 1999-08-16 16:29 UTC (permalink / raw)
  To: Linus Torvalds, Stephen C. Tweedie, Kanoj Sarcar
  Cc: Wichert, Gerhard, Gerhard, Winfried, linux-kernel, linux-mm

co-developed by SuSE and Siemens (precisely by me at SuSE and Gerhard
Wichert at Siemens).

Object of the patch:

	Allow you to use close to 4giga of memory as anonymous and shm
	memory on IA32.

Performance degradation:

	Close to zero.

Missing feature:

	The page/buffer cache (and so all shared/private not-anonynous
	mappings) can grow up only close to 4giga-PAGE_OFFSET bytes of
	RAM (2giga if CONFIG_2G is selected, 1giga if CONFIG_1G is
	selected).

Implementation details:

	Basically we allow GFP to return addresses without a valid
	virtual->physical mapping. Such pages are the bigmem pages and they
	have a valid page_map as the regular-pages. The bigmem
	pages have the PG_BIGMEM bigflag set into the page->flags field.
	The bigmem pages are completly equivalent to the regular pages
	with the only difference that we can't access them by only touching
	the virtual address returned by GFP. So to do COW or clear_page with
	bigmem pages we need to first create a proper virt-to-phys mapping
	in the fixmap area and then we'll read or write to such phys-page
	by writing or reading in the virt-fixmap area. After the COW or
	after the shm/anonymous allocation the physical page will be mapped
	in the userspace pte and there won't be any difference for
	userspace between bigmem or regular pages.

	The only tiny performance degradation will be in the
	page-fault handler: a check for the bigmem page, if the page is a 
	bigmem page then remap of the fixmap pte and invlpg the fixmap
	virtual address. I believe this little performance degradation
	will be not even noticeable.
	And once the allocation will be complete there won't be any
	performance degradation at all.

	The reason we don't allow the bigmem pages to live in cache is
	because the cache must be read and written by the buffer/page
	cache code and by the block device lowlevel code. Allowing
	the bigmemory to live in the buffer/page/swap cache would be
	possible but we should change lots of kernel common code.
	Since we can just grow the cache up to 2giga of ram (with
	CONFIG_2G) and at the same time we may be running with 2giga of
	ram allocated in shm or malloced memory, I don't believe that it
	worth to change all such code adding further complexity and
	performance degradation to the I/O layer.

	To solve the swapout/swapin of bigmem pages I remap the bigmempage
	in a regular page or I replace the swapped-in regular page
	with a bigmem page when necessary. At the same time I alloc a
	page, I also release another page. If a page is not available
	in the freelist then the swapout will return as not succesfully
	and we'll continue trying to swapout or unmap some other page in the
	process space. The swapout/swapin of the bigmem pages will be a
	bit slower than the swapin/swapout of regular pages but since I/O is
	almost always far slower than memory I believe that even this
	swapin/swapout performance degradation won't be an issue at all
	(almost sure if the swap-blockdevice is DMA driven ;).

How to use the patch:

1)	Grab and extract the 2.3.13 kernel.
	(ftp://ftp.kernel.org/pub/linux/kernel/v2.3/linux-2.3.13.tar.gz)
2)	Apply the patch in attachment over it.
3)	configure the kernel with CONFIG_BIGMEM enabled.
4)	recompile, install the new binary kernel image, reboot and enjoy ;).

CONFIG_1G/CONFIG_2G settings:

o	If you want to allow a task to grow up to 3giga of shm or
	anonymous virtual memory then select CONFIG_1G. (the remaining giga
	of ram can be still used by the other tasks of course)
	NOTE: Selecting CONFIG_1G will allow you to alloc only 1giga of
	ram as cache and so as private/shared mmaps.
o	If you want to alloc up to 2giga of ram in cache then select
	CONFIG_2G. But then the maximum virtual size of a task will be
	limited to 2giga. (the other 2giga of RAM can be used by other
	processes as usual)

Testing:

	I personally did most of the testing with 32mbyte of ram 8-). To
	test the bigmem code with lowmemory machines you simply need to
	set CONFIG_BIGMEM and recompile, since if your machine have less then
	1giga of ram, then part of your memory will be considered as
	bigmemory even if it could have a valid virtual-physical mapping
	inside the 4mbyte kernel pagetables. So even if you have lowmemory
	machines you'll be able to test the code equally well.

	BTW, the patch has enabled some debugging code, so if you are
	going to run precise benchmarks please #undef KMAP_DEBUG in
	include/asm-i386/bigmem.h .

	Of course the code is been tested also on a 4giga amazing
	hardware:

>2.3.13 with bigmem patch is runnung on the 4GB machine. 
>Meminfo after boot shows
>
>        total:    used:    free:  shared: buffers:  cached:
>Mem:  4079742976 118927360 3960815616        0 12242944 66252800
>Swap: 134209536        0 134209536
>MemTotal:   3984124 kB
>MemFree:    3867984 kB
>MemShared:        0 kB
>Buffers:      11956 kB
>Cached:       64700 kB
>BigTotal:   3128320 kB
>BigFree:    3114564 kB
>SwapTotal:   131064 kB
>SwapFree:    131064 kB
>
>and after launching 55 "animate dna.miff" (did you ever do this on a linux
>machine? ;))
>
>        total:    used:    free:  shared: buffers:  cached:
>Mem:  4079742976 3357941760 721801216        0 12255232 68075520
>Swap: 134209536        0 134209536
>MemTotal:   3984124 kB
>MemFree:     704884 kB
>MemShared:        0 kB
>Buffers:      11968 kB
>Cached:       66480 kB
>BigTotal:   3128320 kB
>BigFree:          0 kB
>SwapTotal:   131064 kB
>SwapFree:    131064 kB
>
>Gerhard.

IMHO it would be nice if our bigmempatch would be included into the
official 2.3.x. It doesn't need heavy common code changes and we could
cleanup the code still more by putting some code out of the #ifdef
CONFIG_BIGMEM.

Comments, questions, or incremental patches are welcome of course :).

Thanks.

Andrea

begin 644 bigmem-2.3.13-L.gz
M'XL(".!-LS<"`V)I9VUE;2TR+C,N,3,M3`"T/&M7XT:RG\VOZ+"9&1O+Q@_>
M,.P8QC!.,'"`R20WFZ,C;!ET+4M:/6#(9O:WWWJTI-;+)GLW<^:`K>ZJKJZJ
MKF>+J36;B5;D7XI>N]_N]C<_NI-H83JA$5JNLWGJ.C/K(?+-]J-I>R)<>$LG
MK+5:K=<@JMT]1F(0/8AN3W1Z!]O]@^ZNZ.[O[Z\UF\V5JZC0O=Y!=_M@J\?0
M'SZ(5G=G3]L13?S5[8H/']:$$*XCPD=3#&SOT6B+._CH.O:+"*V%*5[<2#R[
MD3T5YI/I"\<-16"\B%^$%8C0I<]C83F$QI_"#'@X->]A>40Y<:=F6]P20.38
M9A`0PKGC/HOG1R.D;X9OBJEK.0_M-;'6/+$>Q,)<N/Z+""+/<_UPK7EZ=7DV
M.M=/1N?CX7BM*8!&6-[U</=(B&_^,[)\<RJL&1-L."$2$H66;?UN"N_Q);`F
MAATC?GZT)H^(!V!Q1Q,7]@;@]R]$]=ST'=,63Y8?1@!D3*<^4AYXQL04]6/1
M/3]IM->::^+2]1<PX?I4S&S7\U[$U`KF"=DB)OOB1_WC\"?]["-R:90C,3!I
M317!U+>>S'K0$"[-]7&!R$'F7EA.]%5;F^;5TO`GCYM6?V]G<T*ZT+8<TI.2
MYZH2E@ROUKT50$65V]Y"C8.?NZ1O"\-R0'<C7<K/,;^&^L1=H#ZO"?E!O#LW
M'=,'Y@9F&'GO4#/N7=<6[U@)8B:_$UG=$#SIT@R?77\..E68>#F\BV==GXX*
MP_!L#=7H5['^??IH7;P7ZR_KXK=#%):S3`"L/)M$=GN2XUAVL%P4V3E_1A[5
MD"5V@(0"/_?8"&P(^G=B^*'X!#\>3#\01_?PZ8,9>@MWVL9#U`XCL^W8QYH8
MPYE&9&V$9?B1$\*I&;OWEFV*:Q"B%2W$:`3F(#0G).J9]16L@6DXXMRR;?-%
M$S]$CBGQ-`%/$_'<LDA0^Z6PX03RZ3PW_4?#GXHO<'Y-/]3$K66"N@1B<(ZX
M;*8)R=D$C1&;&[37'MN\WFY'VZ?=ZKKE6.$L<B;U)]>:LH[IR,CZ!/"+C8W)
M8FI;CJE[VIJH14Y@/3A`@>V"0FU(*Z('(7`'9HB*<=.9ZEZCL2;^M=;\FS6;
MFK.<LC9KH/H@LQR">^L!,#!Z+?X&R`X!#?RR9@625((T9?E#F$D;FH#^OA/O
M-+$!)N<]'3(#J,,MPK.9[R[@Z>G5>#RX_*A?C"Z'"&G!,;1-!T8ZA\3'_G97
M`V5J]K=WM'Z?./DWV!4@$3^-!Q<75Z?ZS?!V>//3L%;O]O;$T9'H=1JUS0U0
MP;WQB9BYOG@"@VF[$P'+"Q0""!-E%>,9#WX&SM3J]<P&&_76]>!\J%^=G=T.
M[UJYQ9#%E1Q6F0D[2;D)7U1&I:R%PU]/1\2QI`D6J54*4B%?JFQA!_^NY\AN
M=1M`>;.6)Y&7.U1&F%J5J*-TI8;XN[(1<9".J)NJ*5.4)2J85O-\$/Z\_N/P
MYE*_O+H;G0[%^AM["C*,3^238=G&O6VV_^&L:PA2JZ?4MM0M-8Z/00N(&#LP
MD185^Y?!S>7H\ERL?S%\!^TUA1Z\UC,8"7%OHH^<\CH`7&/B4Z1RA]\J96/-
M''S\3#Z_AD3@B<RPI*4JP^;688E8,LI"J_'*"05`@X*Q^5XH*HO':2-K-E*4
M].!PF4]9+#;'QAQ4#`QKUN0K(^7>1)E0&X,)_@&,;W</74EO[V![I\J5J&#+
M_4A'VP(_TI%^Y$J_&]R<#^_$`6QOT7;QR=7)#[<U?(#GO>V*F1'9^-MR?7-A
M>/`)S"#IDHL'V9J9_Q3U[^L9,3:T%S@KC`MYR\(!@*:4`CAM9V)'4U-\7[^[
MNOXXNFELWD00<;87L(\5S)78)D4FQ".5S(TG$)>0N:(K.MV#3@?^`Y=V.^7,
MS8#%S.T?;&V#6%+F=K0.\%;;WT;>-C?9/\H3B*$U6=$GP[>0>P&Y4,D9X1D/
M&'X[$-@&X'H#30P<"&,-,?`GAO-@VI8X,NC)AP#.5WMJ'B/X)G`23HQDY9&-
M\2;0VWX\+C[&)5!^V3$C2':'`VO-USFWY#-@='PX8NA]'%^?^::IPQ`N%K`K
M:B(K!)BEX0%&%`!`[EQ(QP(V@P)J&B$^D'(%R(W0>G"C0"8\F`"DJ0%']X$K
MGH%O(,>)`2$&8?)L'(%HA-'Z`80F#O-^`8D#H3%\%R)TL#'AHQM!0O%H3BCZ
M)/@'3IXXHVH3D[W0U$.Q,4?BX3-LRGOP?#>$A_P,/M-.8\]"3V$KNMPQ`-6?
M,#-IU/#?/\`F(4YW-H-@INXMILG'A_BC/I<0FI`?E$^X%L5"'!OQ>OB)(B08
M1I.9%25->9+6"\#1T:=<8T;-20:A29NN*1`@2AWWHH>NCHE6_6STL_[C>'"M
MGPS/1Y=HVFLQ=V!NV?85;`TF(&&=D-877<SP`G&A5X>H'S(-+]*G1FBTO^[M
MZ!/#,R!<M<(7\5;\#`_.AH.[SS=#_?I\2+Y92N7)L.L)\H;X`X@G_.<75R<#
MQ/\-UP]"/YJ$?/0VA.>;'J2V>JSQSX8'JE'/38)?DKG`/0MU#')2^&]D#C*H
M)<@$U/)=B`X1<7D@!E2U*22^D]!^`0R8@P/K_6=+)I7WM%-Q'\UFD#S*/!9.
M`CM6./MSAL*YMOML@Y;:$*X_6:#OE(3Z`:LKL>\:*&'C4R>RD4'D2+-JX9L/
MD6WX.LXY+`Z3OK&X:NI44@B2,)YX?%(_/[O6!W=7X]$IJ0,1\9T*0R("+&$$
M0?3EYXL+B9?6D%I#M.IR[TRX)GX<0Q0V^,AH)Z[WP@NJN#6FE*?,(P=1T9,,
M-`Z"X-RY%MN-_XU`[2'2A:T(XQXEA!E\1IA@*TV!D,!Z@`I"EVH782!(7&#Y
M7!*)8SYGN"EE4=,5'M&&)"&2C7B(+F\RFVF(II#&548O,=>DE$KT%]07#9^.
M-DTJ<;7RDFCR"B+>OBV8\`JEF2M:,9?2JU:'/_"C#`P2S2`PT@C"7\O2*D^A
MW"Y-4!ZE7&,L67;5\NJO0'((+Y=<HNHYI9PGFO3E9G0WY%VHJBA5K:B\\=2"
M2B:(>!RT\AJ@3/^);0%:9HJOV8^A"0&_&_HOL59E&-(Z9I\!]*I?5>Q?BOJ.
MB%U[FE?SURAZ,D\A(J6L2N,3G<^*%YZCBM./,EU?'@U2F%H2"_+SRDB0ATOJ
M-7M+@NPB4#'$[E'Q0BG4</A51EM2BSD%1?*MA\=0U$\;B&I;4/$P$'>N#]YL
M&OP5]99\A"@+A!`'TCYV80_-WK[6W>+:029F]!XX!X#)N9'IPBAYRH&`##[+
M\[Z*D#1-V`(LH$]RY9?0#0W;-Q8RV"Q'7@T)"\60R3JRS,.UID?W67?`L+%7
M#CB^.DPF95&B7/5@X:$I\N#X!KFR`A=FNML=JN9M[\D"U[]D^<;2\-P@/1H1
M1Y^D59@RF3PQ`#'S$XUCN.G2[1.,]&GYS<;I_?K87+0L9^8>0.J.&ZS1UND@
MPV)&4$^HWTU:$ATBOV9A=FQ\C9,!47M^Q-RW;L'9.Q:=!FVP1EMJ-I?4,7)V
M6]KTIB6M-N^`,225DACF1K(I`R4HWN`!A(/O5%"(@6[!K)XB_S)0W'CA7>[U
MI(S013'7FVQD]0GD$*$"*%JB>TBEC92I;]BZ!GA:;P9CK(H0&QK57"@#Y4&$
M9AYD2BD*0*(K!(GSXR<DT0)JWA'.XT\5LX!+4LUP*G^2VM#O;FN[HMGO]['6
M2'8BI@O2.\/69W84/.JA?8_Z(]+Z3JX1$!O]1-\XF^'J"^Q8$UB?3;:.I2F9
M0-"\QJ$`'^='CC!F<"XSB29YI97HE>)=4ZT%J:S^IE2I^SM8C&_V]W:U+MOZ
M7#@!,/=&8![2(9/$B[<RTQD/;G\\?&T5E"`&%Z/SRWJF4G>83D6R\[AC.3R"
M8]%E*^V]X!+Z1B/FYV&U2)(C#6!.!.X:DN\XL9?Q5XE0ED%=?_KE5D*FA-,^
M,LFH6EM@\66>E.%).*)8-L1IFX9/.'\W?;?%T>4FU_XPW3877OBBXYC,(L"@
M$A-O1_\SE"J^U0-+UQ?-K9VNUN]F=;Q62R,=B!8:?/Y;JE$%MA^(-W8TW\0?
M:2E6U-],YW'S$LM#FL`'\8'E;QAX\2?4UP97;RNT!NOU2$0:7W&RCV&,.%([
M$X(>Q35/VBK76(E=$/R']>MS/:7DK>3]KPK?<;>_M6>V\1!PC(LE"PG))+T2
MS@C=A37141K5\\G6:J++(!F>%X^K=#7HVM4]QJE4$C'`8#PQ59SF7R2[]4JY
M<2M/K/-<%M.Z0D\\WHAKZKF@(DG:^)0='8%OHSU_&IW=M;J=!@&EQ[)B`FYC
M*89X=TLGX::K)[3`7P,OJB;(['#)C%5,U.H@NP8<@U@!CH\%@^6XV6"[7*PS
M/7LZ]J(#W9W#H>E0#!&:0:C#P#UY&C8*.QUM6S2WN]TX/H`PO77,;C0)1^53
ME`X_*X@JW5XRFR--GB\/!SBL:?TM#\3FMIP/A$&);!4%3VGAH4*Y-D=.ZD'8
M>1ZB`RSD8K-@T_/=">0WOO$B\[#<,S4'RPT5\Z_>5C;_6@)0S+VVJ$V^E;;)
M*;T:V.(GRW=%^N]`B,"8F>+1<+!Q3&GV`NTEUB`(#E.M7#K%<)QL9:\UT.PT
MUU(6.I(XVA+'!V]ZWPYX)A7QJW*Q\,4S@S@5Z^]04HDA1QQRU]`\W9*V@87:
MLR,Q/P'[0",GG*H`L?D1BG6G.``C\4!E.+Y^8CW<H?X<9!#)D3/0G=P22EB^
MCJ&U!,X1@2,2.!TAXV&U$]M,YU:3N]^GW>]NI[NWVNE)DU/I:7IVTJ>9,\21
M.[)!#ZS?S4;.QF`,CX:J.D5I9TV+Q@_C4Q4_4_(3!J`86J44(=*'#3I<I,)=
MNOU`08?<;5P/">.22D@50OCMB[?BWTG0QTC$?YIA-Y,,&]-%+.;1F8NKB,GY
M$!N+1?X6A6P0%2)@2$_IAH8L:C<XU\U.X[(;[;W?X?.[E21=L@J'U&`@2<6U
M.JX?!WPR!?R.RXDB+3"#NXKL<$FVJ1;X"I7B5(!3E]/7&M[*V)`W3AJ2:B&K
MC=]-&G('NW3G8FM[2^MMQXDCSD!M$T=*Q$4C->;+KSC:;/Z&EST8J0RF:]67
M&)2JXA+JB^R@:AN.()1,BFO$S?@+;PC+P9R!'+]'43;^6G*^"5DT(+'DU+IR
MS94K?DO/3[QDF1>31V,33D:+BG3Q\2`?5#FJ>K;*2:M[S=6@9[[%WJXO`*Z[
MFY0GU7YS1^TWQZ`'M?A"I`Q390885Z;H>L]]<GLSJ2]^#K#FZV19+8*7((2,
MB8`D(HX9Z&YF#>^$.B[[05G^IC#Y_D4VO(J7-+FWE=XC*Y1`]Y>5,Q&D)F.Y
MY2Y6:9-SEJL/;L=R6_JGM&.;>USLGU/-EGODD%A.S8EM^'2A-V$E7TK`U?[4
M1;%"+UU"E[74J9`NQXM-Z7B@O#>MEC5S_6+91DY1_/G>:*,*>&5CJA'?$:#N
M,]Z)Q@).Y$R8N8XY077"?KPKC`E^48H$4@U!8:6&(?LK2]6Y$>)"$FDEJD!]
M[8_#D\_G6%2*2&M0Q,#):<17(UFE$D'AS3Q1;+?72[IF6,B*%C!,"PO\H33E
M=/#P<5/K**,Q#>[`DB6+VV_<QR)\,^NK.=73GK$U_8K1/V!O@EF\^^5Z",GT
M!I:F,9J&.:ZO6]-ZHZKENZ+AWP3\+#=IF5.>):U?U$_'=<SZ1CW6T1:"-92N
MGZY#/FW:N@YR][D7A!\.:%"]>#:\N1'K>/G;B6R;KBB`%$ZO/T/.+.A"Y!LO
MN=HF8^_B9C7P:8B>=ZU<"N/J17)+05*JB<6<'TJ?E+@D+3U:,FG5TT*GCGM.
MNM&*V)BQ?X@$7];'(4'?BFK%YY7]7(F@*O6I3##_;U5!P?+>5'8TQ/M7J$O2
M2X9#!<<)+^G@]8?4(\'YOG*]@.]5F"_O\&J0>NQE#Y[N7*`&Q-=W^.X*W4X3
M<;V7KM=0.2NK?,N%I>I$E2CB&AD9Y+(37GK[1FDNK^PL\PII7YEOUU0VD"L)
MQ>YT.9VAFX_5\0R5$XXCFG@*74F]O'Y,U/.8>CWCB6XLTR`NDMU7VB[',<*4
MV5L173R2QT5[9DFAC<XX;U2`U0%>W)`L#\#BT:4!7CPI4Y/H8I36Z6>+&-6@
MQ0!O)PWP=C`;VDE;R11]N?C^C1D:EAVDM_D+T=.>&#D/KAB[MF/X29#UW^P=
M*]$4''8\ZI]D?Y`:9[MQ^IJ/H]A/E[2(#<^:E#RFJP6OZ1O+PHD1S(-7^/JD
M3;#)'/P$?@%O/<@8(+Z7.'$7'J0E+7J[ZEW@F1/+L-_%<:SRPLQ.)\G7T?:=
M7NF#Z]&IAL;NU+TW[%#@=W%C<D",0<0=Q<@;FT+"7(SPY94!P5R`;<,7,O"E
MEQ/?FD*@-.#.13KSI&KF"4=!\8WG\MPI:Z$)5=H^#,UW'-9"U`\:@]%7G$;$
MUIK,;()D>/D1SGW.ZM<5?W)YHX.[OFVTNIK2/]'IM0MWIN><$B1IAZ\XQ)9;
M=8!Q9.GAQ0DE!W=KQ<$EL*5966^7WUZ1[T^!<I&&X=6-$_!.@;7P;&MF`9?Q
M2BJ^5;6`Y)_.(":)GZ)[:S(W\)KJ?^.R1YN6CH\LZL'MQ=47?72EG_RB__!Y
M?#VZ/)<WP_>I<M'M=K1>1UJ<S>2>0VF,BZ<`/3Y69B"1L"$=LJ7S2?+`I2_4
MR,"(HMYXOO+J03)LR34R<Y2>;(GWVZ"[PDD\4BPYO9JVIY6TT0HK:)-$X7R*
M5'(4\86M0ED--*/@J67K;/7I4$U>N3:K,Y:>%G7B?U#/*(!75O!E38--:39G
M)\."YN2VD+=GA_`M`R4DQIA&QA0:?Z0P0GZ6]@F^?3M,@XI7<)?]4OE^>6PI
M1WE*B07JK;!`$K!H@[;5USRZY(0[<7<JX1:>M*]X";WPQM/7AOK65D.%>4I@
MY.6!$N"F"MQ07Q&C-JZ\^UY/CGH#"]]IX;VQZETMI27,N`HT5&!-'+T2+LJ;
MYI!VDFFL%'8V8LG((SM4)NKLC-6=K@JXI8+>IZBCB;_Z99=09.?QD)*N0=*^
MCM_8QIHT!14YLQ/W5@X);('-=W1!\@*4A"X!3-HO!,@-='H?#!V6O$$G*S2E
M-K?\@AY33\TDI6*94-\LV7,"E>ZY#++B!<VD8W.8+DR]&OG6207;XI8.KZP"
MA'B#/[T]D(4-'M'#8ZX=,.1EM+B'O!@8/HE\'U^OEHFX2<QK\4NB^NS77N^W
M0WEWV)@&LNX601P+2=C.%O`\9`@,II(#L$K;%XM235\LEFGY8O%GTB`%IJC=
MNXH9V^[S1<E^SHQ=GU,1DAMYM9K8SP[-+0_8TNT4`>@*-0QU5YJ;^!%,[F4[
M&)M89`\#V&.KWP'G3/JMOA6KW)VIU?I=8#UM9F^';Q3NYS<#!OWC>,`WI<%$
MTFT#>9,&GFOB+0^UCMGU-[*@MZ!3I;`XL!(XN0#)&#((4A:7H%G%O]S+!5G,
MR06A(E;):!5/<K6S!)-R2:FPT=Q6S5!A55VYKE3!*'Z'>A?_\$:SO]/7NKLY
M5XJO-7P:G7^J=;YV]O(#HRMXK*H@/[[],KB&@5YG%0-YNM3!SM>M3L:592:!
MDL",O8XZ0L"?S\Z&-^AS\1N$_>(/"?%E,%(]?/JJ1CP9M[52QI+`S[?5BR1?
M1E?)1_DF2%'6./A*9#GBV9W'<./AQU?"79[=UM0M5X.1,NQ25-7?Z\?>%M](
MA*P'$N5%H+&C,P(!&;+O>KYE<)6:JIQ!G(0IJZ/<:HD(E6)ZCM]@<,Y`)T4+
M$IRI-3'0JH?XYUBH#<)OB^&['?=@^(VYZ7!!7/%WU*23?SV%WG^,WT++-^N4
M/Z<BFV994:,NJD+,UE'PKP70'YK)O'B)55J1_%66(YS4.G[BSA=57C`--I^`
M[)))U/IHBT_&$SDUS*7P'=J%T2ZO\GF3S>`Q?C4X^9;Q6_'#DH!L.^>NDJDY
M+[5ST-DO5.EV_Z^X:W]JW$C"/_N_F/TAE(QLQR]LS*O"!E*5NETV![M55Y=0
M+I\M0(5?L6Q@*V'_]NO7/"2-A.$N=ZD-&&DTCU;/3'?/UY]UE.XG#&5PUL[=
MC,V`2439KG1<J`*X/*1M(\"43J@7<W(H9I=-^K4!/@9!)5^3!^F5QAZI8"S'
MI6>KT>T"W/_Q`J0VOHL7HSA21Q.Z^D,\'\VBQG@Q.W&SDJ6.;;*-MTP:(0!'
M6[L=V5C>A@/YOBA?ZI#NC0DBSL$I2&D9P:Q4^JP1+@SQBMK%6S6QND"0-]/;
MPGP.A)W0+D@GE3KW@UXZ;`P#>.T]\%5[;'ESZNN03HR2NR6"`3G%*?DUGCQ=
M$U;%'(PM<>.:KPG+(_D1>,^<F>GKA>!LDTN5S[K#-=1%9)NBOA0]6<`S,`WN
M)R4,DLUSNX#9MUC,2J`8E?3!B,[[RE2*`\3T7,(PO3M6/D$Y38*E"_I(S"B5
M9R5D$"3_/H79>OL6)N7T&H_2<SW'/R?1-((>X`+IF#>!>:9L?+H0"-)WINW6
M846>RP@TS5BY4$?PK01DGE8%DG.'75S6ZSQX&G6?,X?Z/0N%<WIEX-/B)UO`
M&&4B7FL%#&YO,$,IN5<[UH*@],]WKBU*/24IRJN(QO=#FJ4EB4;X!O+U:^X3
M:")C&$HCH;\1+2.2T&B-P>AU0C`E&4J]3LAPV`"/*,\(=Z$;<+/8(F>1$7JP
MW[>*\BP/Y[-X5*C\LB-=;;U)&,X+*L!16#F\+(C\=(%Z23R<>44?)YOEE.R$
M0.Z02H$68CC4H_8U9<OE]M09)N]-B3J"MU7W@KNSNM=?WER+2GL2*KNTP^(O
MX<"KO$:-)0V,Z"3FF^BPA'WH]>HKF6FV<@=,B9=!MV@"XU_?[WJ%RS::E:W^
M.R-:?7DKR9K":<.ETSK8<PR7#INS>K.6@[#F7J/9;0RZ"NQ-]7$S7<><-F/P
M8W,D$L&H!!]%$$U4J]$B6P6&^3,8.VA+GDZC)_5^&M_>J6`$GW\8@Y3&$4Q@
M,$,:F_NJ.95L]1K-?@,L&&SQ;2<>C`<+RJ%@U4(CABE3&,!)"-3>_\>`R?"$
M.9E4F$R0OIO-K1+6K[T>101[;9V69C"]&63`>/$X]$`#^`P^&P\3X"[.#S[_
M/U;_/+_\1+0:`=PN-U4<0`.4=2V3M,V@;^J=T>0="%V4MW:+*D"0@,84Z!92
M((AT`>>,1I9_A-/ISR!1IJA`H>XW\3@OW`?#5B]!4/(3TTC,(S3W%R35AOIT
MAQ>GTT:#"F&D+UG&<S#XP0&X#W;6R7W]9#:KG]#B2SHSQ#O5DD2\>?181KJ1
MM?=>*.ZS^&CAT\]5K>Q5G5)::00O]I\"TH,].7GHF/W6";78(%Q-[3!5@3[/
MXDX\KF+8>07FATMYG%"?AAQ[ED67#>828Z[,D!/1%-EP;+^Q,2_X+Q\GB?90
MZ1X!P5*JR\^C%3&[IS$%_!D\[_57MNYUKK"&GEF!UO!!G6[=;'&,O]7LF#G]
M![L;H*\D2VGI<87]`,<^D'[;*2I$0OE.>P5?/IO3:\,K]'+[)[U.2<8G<11T
MJY7'.B3>M2?OK[BB];U$1SF\DF4[7M_`*;-*$L'8\RR"274S)=B]SS!@K<.X
M@34.W&L9`\&]M9614/1`W@*C/;%C(AP%G!6M&OYLT\\._>SF6"P8U(`AD56T
M6-V.YC'&S-J#1JO=@)5#7:VCY5TT5Y\?85&-H_\MXT5+4OYU?'X+6@MK#RA&
M-?+>NEY\3ZO2)HE6?#3PW[$5L)'HB80A\.>'D3Y1,T:$X,9-W$*,@Q2>/&,T
M="F)OSO0.?PV3BFDAW>8//54!:,[#=T.GLKH/'4E@LS[\//5YRLY(JY<7`[-
M-4/0(97;+/WLA5_=QW;;UQ:0\:8:K@\]_"*NI42$'F@1_+Z)-E%05/NN0@&1
MD42;('DJL&+O:2A-)<L3,HF>B#0#O2X\LPZ@/%-RT[*1-03!&3E6P;?FEP^4
ML48%\^5H*SU\$WVMSCUPTN?-IC^,5[\GHX=(LN=HR:`;-0-&<31F&B=K4A7-
MC%&N(TS@*G(X3G5'DM)1T)BK[=$BVJIS>:Q(#PH22^-YD<Z`&MG1=U4E/1OJ
M]H:D(I'<0WPWF"7H*@ZQP<K;;O68SJ6_7VM+VAJ'K<CH">A=5_D=AP&_8F3R
MPV%581.@2)46WH^G%^C%DORT$_OD'MJ=GIU=GE]=<0&+N,!>XK54"B_3!)9/
MS<N/?_]R_N4<+G/7:DI[OU7U&[)[JF)W^`\L42F>$_3BCNV=D%H(?>^1:LKX
M-V"-,D?]L>@[%<)TO'2&R7(5/5AZ5EJI2+8UM;MB2BPH4#]!XG*NPN4,ML4Q
MP@/E];`JN3E((1:Y%]@ZR91<\V=XG](;_*4+B^)1.7!Q-569W/UX>ODW-)7.
M9!;4[,AK),*J+IE3UY9="S)%W+F0+W7^CU].+\Z"%0+.=*/28%';C@^#RP%8
MPL@.5KPBR&-BI&FMG4F4B>\^\R_3)!A"U"C^QCLF)="JPI%R9R%7].P&73)Z
M[5-JY:K0UFJ+37%>R19J2F&K-H57VN")]W74:LX-<CS;2;:=@B^^F%N2EF8I
M50[I;UH93JA>^MP`LU(<)#:94Y6GHN+W&`^4DZF(+8$V!X7:_4ZMTY9N/W/\
M*MO)EFM,YP?#*[D93Z@G7M%R8I."MAH>)9RZXS/M<4BN8(SFL4P#QTX+2)AC
MFJ@4#.>%)IXMK9VOAA9G("E+;YWO5#TWF;<1PYMEX&GN1:'\AQ+)BL/UO@3Z
M)/$!Y'#]Y2><^43*3O.'9A)F+JW0X'9&8E2,/;!,3;#_?TM5Q3%FVL,[36W\
MH]*GAW'P2I.HR-XIV7%=][-H`3M,!Y2V6HP%@=3N]9FK:]_@%S*;G#0F=J3R
MXN+<Y:F,OPPI'C@[\X#7S>]ZD_OW*N!?[^/;C]&L^MM<!8K3]E+*>'24XX[Q
M6'JY4BY`N[P?TG!YHUZ:-::N^&Z"'$"*_AEBGE1M-:VC9KXRG45[GU4-D46"
MV72IFZRC0B_CND$V!I,XO4M;.9D=JBJECO&GV".R#X#U(8&'Y^*XF+<3'F,M
M=#IV*#EYJKQ_A;54TQ5DNA[:KF>7!U;%$%EL8,L.`L.K<'("+\Y8/+P"&$:[
MZ687B9S>(]D2&CU9"'%A-1*LHV!(AX\(6GV]5)#@D`.Q"6\@SM@H>"D,Y45D
M7:\U&D0()V4>B[2S:<47EX4S"Y_R"3U.A[C8P;%^C^&I`V=3?PS5-[:<=FPQ
MI^=NT7J*B8M<'H=:)52Q)@WLM@S'B3[1LRH77S<X_F[?#IH5NU5EZ/H.Y3LR
MD#3,$-HY9'[-FBO<JC:W^#9_(8$C^N(943H-MNIG^+I^AN7]]+!:N#)Y]L81
MY3M<;!#17,A$$,WURN=-A-$S?6S8R>5%I4OG8H>='!UN/T.'F^I&<4RQX^/!
M_2LC@M(GB0@.\%NPPL&>5=-\"-M2T-B$=XKO,TA'EA[>E<G1@)L'[+J.IDA*
M]%5%3W&R3IAW]:5C"U^XG.&/&9IM^_TM+SVX59C]_.(3?0F-LKGR\']-I:+@
M#HM_M0!+\#!+QJ.YJXS\=TX7^;(GDMWSJ*(43A]WMSM0WD&3$Q"D-<B%=]/\
M'NHO.=F5P_7/PN5=I[QUPW$A0F[IU'>5;"BN?+.93K\RW@D!AF%OH#513FWR
M\&@\=83__OR3H`H?P`PT=^KZSO;0B:HPYKWJH5*2/JSIU1"+%'N?1>&`$(>,
M]A%K^%!X/]-$Q5TS?4FW&6154@E&V2\6;@+'-+I9\UEN90NT3SG2QT7Y8-L&
M_67&X&)(,L='C`;SG1@R[I#'EG42@AVW&O_!;:O?U^D)6EJOJ,&1XH'=F]`K
M]`T1Z3WRH+?P-0V&;AMI9M[N/KDY>\VN-;!AZFV6",8><0(]IG/#5O"O41)C
M%GU%3NJ9[*L$<[E-Y,`%7B*K#8YT@KCH\2I>TY?7P)M?KA8/\<2%WFOJ$HL4
M3J"MQ=K!7B.>.JD:GOU\9^IJ"R<^G<U"WRLB""76X<EBF'>I:;?`Y(4S4-]_
)`RZ2,:1H=0``
`
end

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 16:29 [bigmem-patch] 4GB with Linux on IA32 Andrea Arcangeli
@ 1999-08-16 16:48 ` Matthew Wilcox
  1999-08-16 17:19   ` Andrea Arcangeli
  1999-08-16 18:43 ` Kanoj Sarcar
  1 sibling, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 1999-08-16 16:48 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, linux-mm

On Mon, Aug 16, 1999 at 06:29:30PM +0200, Andrea Arcangeli wrote:
> Performance degradation:
> 
> 	Close to zero.

Have you got some lmbench results to back this up?

-- 
Matthew Wilcox <willy@bofh.ai>
"Windows and MacOS are products, contrived by engineers in the service of
specific companies. Unix, by contrast, is not so much a product as it is a
painstakingly compiled oral history of the hacker subculture." - N Stephenson
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 16:48 ` Matthew Wilcox
@ 1999-08-16 17:19   ` Andrea Arcangeli
  0 siblings, 0 replies; 39+ messages in thread
From: Andrea Arcangeli @ 1999-08-16 17:19 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-mm

On Mon, 16 Aug 1999, Matthew Wilcox wrote:

>Have you got some lmbench results to back this up?

Does lmbench benchmark the _allocation_ of the memory? If so could you
point out to me the exact lmbench command? (you would save me the time for
writing such a simple bench ;). I looked a bit at lmbench and it seems to
me that all mm tools are measuring the time _after_ the allocation
happened (so measuring the hardware bus/cache speed or page-colouring
algorithms and not the OS anonymous/shm page-fault time). But maybe I am
overlooking something?

All bw_mem_rw/bw_mem_cp/bw_mem_rd are _useless_ to benchmark the bigmem
patch since as just said once the allocation of memory is completed the
performance decrease will be _zero_ and not only close to zero.

The only tiny performance hit will happens while allocating a page for
clearing it or for doing the COW inside the page-fault handler (if you
are going to benchmark it make sure to #undef KMAP_DEBUG).

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 16:29 [bigmem-patch] 4GB with Linux on IA32 Andrea Arcangeli
  1999-08-16 16:48 ` Matthew Wilcox
@ 1999-08-16 18:43 ` Kanoj Sarcar
  1999-08-16 19:43   ` Alan Cox
  1999-08-16 20:34   ` Andrea Arcangeli
  1 sibling, 2 replies; 39+ messages in thread
From: Kanoj Sarcar @ 1999-08-16 18:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: torvalds, sct, Gerhard.Wichert, Winfried.Gerhard, linux-kernel, linux-mm

Andrea,

I believe you are on the right track, by marking bigmem pages in
the flags, and requiring a mapping call before the kernel can access
the contents of the page. I have a few issues though. I haven't looked
at your code yet, so it is possible that you may have taken care
of some of this already.

For example, driver and fs code which operate on user pages might
need to be changed. I hear that Stephen's rawio code made it into
2.3.13, so would your patch work if a rawio request was made to
a range of user pages that were in bigmem area? Also, debuggers
want to look at user memory, so they would also need to map the
pages. Are there any other cases where a driver might want to 
look at such bigmem user pages (probably not in the context of
the process, in which case the uaccess functions are usable?).
Basically, any code that does a pte_page and similar calls is suspect, 
right?

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 18:43 ` Kanoj Sarcar
@ 1999-08-16 19:43   ` Alan Cox
  1999-08-16 20:54     ` Andrea Arcangeli
  1999-08-16 20:34   ` Andrea Arcangeli
  1 sibling, 1 reply; 39+ messages in thread
From: Alan Cox @ 1999-08-16 19:43 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: andrea, torvalds, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

> a range of user pages that were in bigmem area? Also, debuggers
> want to look at user memory, so they would also need to map the
> pages. Are there any other cases where a driver might want to 

That is the tricky one. What occurs if I mmap a high memory page of
another process via /proc/pid/mem ? then write it

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 18:43 ` Kanoj Sarcar
  1999-08-16 19:43   ` Alan Cox
@ 1999-08-16 20:34   ` Andrea Arcangeli
  1 sibling, 0 replies; 39+ messages in thread
From: Andrea Arcangeli @ 1999-08-16 20:34 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: torvalds, sct, Gerhard.Wichert, Winfried.Gerhard, linux-kernel, linux-mm

On Mon, 16 Aug 1999, Kanoj Sarcar wrote:

>For example, driver and fs code which operate on user pages might
>need to be changed. I hear that Stephen's rawio code made it into

Yes. Or better I don't want to change the lowlevel blockdevice internals
so I give to such code always regular pages to eat.

>2.3.13, so would your patch work if a rawio request was made to
>a range of user pages that were in bigmem area? Also, debuggers

No idea about rawio (I have not yet read the rawio code).

For debuggers I'll add a kmap to access_one_page in ptrace.c, thanks.

>Basically, any code that does a pte_page and similar calls is suspect, 
>right?

Yes it is. But only if such code can deal with anonymous or shm or
vmalloced pages.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 19:43   ` Alan Cox
@ 1999-08-16 20:54     ` Andrea Arcangeli
  1999-08-16 22:47       ` Andrea Arcangeli
                         ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Andrea Arcangeli @ 1999-08-16 20:54 UTC (permalink / raw)
  To: Alan Cox
  Cc: Kanoj Sarcar, torvalds, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

On Mon, 16 Aug 1999, Alan Cox wrote:

>> a range of user pages that were in bigmem area? Also, debuggers
>> want to look at user memory, so they would also need to map the
>> pages. Are there any other cases where a driver might want to 
>
>That is the tricky one. What occurs if I mmap a high memory page of
>another process via /proc/pid/mem ? then write it

IMO Kanoj was talking about another thing (ptrace).

About the /proc/pid/mem I noticed there are two kmap missing in mem_write
and mem_read (so to read and write from /proc/pid/mem).

The mmap over a /proc/pid/mem instead seems just fine. The only thing you
must care is when you write to the page _inside_ the kernel, if you touch
pages from userspace you'll be fine as usual. For userspace bigmem pages
are completly equal to regular pages and no kernel change is necessary in
such user-map places like mem_mmap in fs/proc/mem.c.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 20:54     ` Andrea Arcangeli
@ 1999-08-16 22:47       ` Andrea Arcangeli
  1999-08-16 23:26         ` Andrea Arcangeli
                           ` (3 more replies)
  1999-08-16 23:28       ` Kanoj Sarcar
  1999-08-17  0:17       ` Andrea Arcangeli
  2 siblings, 4 replies; 39+ messages in thread
From: Andrea Arcangeli @ 1999-08-16 22:47 UTC (permalink / raw)
  To: Alan Cox
  Cc: Kanoj Sarcar, torvalds, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

This incremental (against bigmem-2.3.13-L) patch will fix the ptrace and
/proc/*/mem read/writes to other process VM inside the kernel.

diff -urN 2.3.13-bigmem-L/fs/proc/mem.c tmp/fs/proc/mem.c
--- 2.3.13-bigmem-L/fs/proc/mem.c	Tue Jul 13 02:02:09 1999
+++ tmp/fs/proc/mem.c	Tue Aug 17 00:02:48 1999
@@ -15,6 +15,9 @@
 #include <asm/uaccess.h>
 #include <asm/io.h>
 #include <asm/pgtable.h>
+#ifdef CONFIG_BIGMEM
+#include <asm/bigmem.h>
+#endif
 
 /*
  * mem_write isn't really a good idea right now. It needs
@@ -120,7 +123,13 @@
 		i = PAGE_SIZE-(addr & ~PAGE_MASK);
 		if (i > scount)
 			i = scount;
+#ifdef CONFIG_BIGMEM
+		page = (char *) kmap((unsigned long) page, KM_READ);
+#endif
 		copy_to_user(tmp, page, i);
+#ifdef CONFIG_BIGMEM
+		kunmap((unsigned long) page, KM_READ);
+#endif
 		addr += i;
 		tmp += i;
 		scount -= i;
@@ -177,7 +186,13 @@
 		i = PAGE_SIZE-(addr & ~PAGE_MASK);
 		if (i > count)
 			i = count;
+#ifdef CONFIG_BIGMEM
+		page = (unsigned long) kmap((unsigned long) page, KM_WRITE);
+#endif
 		copy_from_user(page, tmp, i);
+#ifdef CONFIG_BIGMEM
+		kunmap((unsigned long) page, KM_WRITE);
+#endif
 		addr += i;
 		tmp += i;
 		count -= i;
diff -urN 2.3.13-bigmem-L/kernel/ptrace.c tmp/kernel/ptrace.c
--- 2.3.13-bigmem-L/kernel/ptrace.c	Thu Jul 22 01:07:28 1999
+++ tmp/kernel/ptrace.c	Tue Aug 17 00:02:40 1999
@@ -13,6 +13,9 @@
 
 #include <asm/pgtable.h>
 #include <asm/uaccess.h>
+#ifdef CONFIG_BIGMEM
+#include <asm/bigmem.h>
+#endif
 
 /*
  * Access another process' address space, one page at a time.
@@ -52,7 +55,15 @@
 			dst = src;
 			src = buf;
 		}
+#ifdef CONFIG_BIGMEM
+		src = (void *) kmap((unsigned long) src, KM_READ);
+		dst = (void *) kmap((unsigned long) dst, KM_WRITE);
+#endif
 		memcpy(dst, src, len);
+#ifdef CONFIG_BIGMEM
+		kunmap((unsigned long) src, KM_READ);
+		kunmap((unsigned long) dst, KM_WRITE);
+#endif
 	}
 	flush_page_to_ram(page);
 	return len;

The /proc/*/mem read/write seems to not work though (maybe I am doing
something wrong...).

black:/home/andrea# cat /proc/1/mem 
cat: /proc/1/mem: No such process

The same happens also on 2.2.11.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 22:47       ` Andrea Arcangeli
@ 1999-08-16 23:26         ` Andrea Arcangeli
  1999-08-16 23:39           ` Kanoj Sarcar
  1999-08-17  6:39           ` Linus Torvalds
  1999-08-17  6:29         ` Linus Torvalds
                           ` (2 subsequent siblings)
  3 siblings, 2 replies; 39+ messages in thread
From: Andrea Arcangeli @ 1999-08-16 23:26 UTC (permalink / raw)
  To: Alan Cox
  Cc: Kanoj Sarcar, torvalds, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

This other incremental patch will make the bigmem code safe w.r.t. raw-io:

--- 2.3.13-bigmem-L/mm/memory.c	Fri Aug 13 00:31:59 1999
+++ 2.3.13-bigmem/mm/memory.c	Tue Aug 17 00:59:37 1999
@@ -436,6 +436,10 @@
 	map = mem_map + MAP_NR(page);
 	if (PageReserved(map))
 		return 0;
+#ifdef CONFIG_BIGMEM
+	if (PageBIGMEM(map))
+		return 0;
+#endif
 	return map;
 }
 

But now IMO there's to choose between one of the below options:

1) should we change all device drivers to allow us to do I/O over
   bigmem pages? NOTE: all DMA engine are just fine since virt_to_bus
   just works right as Gerhard pointed out to me. The only problem is for
   drivers that reads and writes to the b_data in software.
2) should we change ll_rw_block to force an high limit of bh queued in
   the same request and then remap the b_data in the ll_rw_block layer
   with a NR_REQUEST*MAX_BH_PER_REQUEST array of virtual-pages in the
   fixmap area? (many tlb_flush_all... or at least many SMP-invlpg with a 
   smarter cross-CPU-invlpg message)
   virt_to_bus must be able to resolve the bus address starting from
   the fixmap virtual address.
3) using the remap trick that I am just using in the swapout/swapin code,
   I could just do raw-io on anonymous memory but I get stuck with the shm
   memory where I can't simply realloc a page without browsing all
   processes VM. Should I take a list of all pte that are mapping
   each smp page and doing the remap trick also on shm memory?
4) should I avoid raw-io in the shm memory and use the remap trick
   with the anonymous memory?
5) should I avoid bigmem in shm memory and simply use the remap trick
   with the anonymous memory?

I guess big databases uses the shm memory as cache. And I guess they use
raw-io to fill the shm memory with proper data. Am I right about this? If
so I can't choose (4). And since I would like to use the bigmem as shm
memory I would like to avoid also (5).

(3) looks dirty and add a performance hit in the shm_nopage hander.

(2) looks dirty and slow due the SMP tlb flushes.

(1) looks clean and efficient (100% efficient in the DMA case!) but it
    breaks all drivers out there... :(((

Theorically the cleanest solution would be (1) but I don't know if this
will be a good choice on the long run (theorically on 2038 we won't need
CONFIG_BIGMEM anymore...).

Right now I temporary applyed solution (0): the patch at the top of this
email so if you want to use raw-io on anonymous or shm memory you'll have
to recompile with CONFIG_BIGMEM not set.

Comments? (very welcome :)

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 20:54     ` Andrea Arcangeli
  1999-08-16 22:47       ` Andrea Arcangeli
@ 1999-08-16 23:28       ` Kanoj Sarcar
  1999-08-16 23:49         ` Andrea Arcangeli
  1999-08-17  6:29         ` David S. Miller
  1999-08-17  0:17       ` Andrea Arcangeli
  2 siblings, 2 replies; 39+ messages in thread
From: Kanoj Sarcar @ 1999-08-16 23:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: alan, torvalds, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

> 
> On Mon, 16 Aug 1999, Alan Cox wrote:
> 
> >> a range of user pages that were in bigmem area? Also, debuggers
> >> want to look at user memory, so they would also need to map the
> >> pages. Are there any other cases where a driver might want to 
> >
> >That is the tricky one. What occurs if I mmap a high memory page of
> >another process via /proc/pid/mem ? then write it
> 
> IMO Kanoj was talking about another thing (ptrace).
>

Andrea, 

I was also talking about drivers which assume that all of memory is
direct mapped. For example, __va and __pa assume this. There might be 
other macros/procedures which have the same assumption built in. 
Basically, anything that is dependent on PAGE_OFFSET needs to be
checked. 

For example, on a 2.2.10 kernel:
[kanoj@entity kern]$ gid __va | grep drivers
drivers/char/mem.c:124: if (copy_to_user(buf, __va(p), count))
drivers/char/mem.c:142: return do_write_mem(file, __va(p), p, buf, count, ppos);
drivers/scsi/sym53c8xx.c:572:#define remap_pci_mem(base, size)  ((u_long) __va(base))
drivers/video/creatorfb.c:684:  disp->screen_base = (char *)__va(regs[0].phys_addr) + FFB_DFB24_POFF + 8192 * fb->y_margin + 4 * fb->x_margin;
drivers/video/creatorfb.c:687:  fb->s.ffb.fbc = (struct ffb_fbc *)((char *)__va(regs[0].phys_addr) + FFB_FBC_REGS_POFF);
drivers/video/creatorfb.c:688:  fb->s.ffb.dac = (struct ffb_dac *)((char *)__va(regs[0].phys_addr) + FFB_DAC_POFF);
drivers/sbus/char/zs.c:1934:                                            __va((((unsigned long)zsregs[0].which_io)<<32) |

For all such macros, a decision needs to be made whether such usage 
will create problems if the underlying page happens to be a bigmem page.
If so, the proper mapping (and unmapping) calls need to be made around
the kernel code that accesses the page contents.

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 23:26         ` Andrea Arcangeli
@ 1999-08-16 23:39           ` Kanoj Sarcar
  1999-08-17  0:10             ` Andrea Arcangeli
  1999-08-17 14:26             ` Andrea Arcangeli
  1999-08-17  6:39           ` Linus Torvalds
  1 sibling, 2 replies; 39+ messages in thread
From: Kanoj Sarcar @ 1999-08-16 23:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: alan, torvalds, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

> But now IMO there's to choose between one of the below options:
> 
> 1) should we change all device drivers to allow us to do I/O over
>    bigmem pages? NOTE: all DMA engine are just fine since virt_to_bus
>    just works right as Gerhard pointed out to me. The only problem is for
>    drivers that reads and writes to the b_data in software.
> 2) should we change ll_rw_block to force an high limit of bh queued in
>    the same request and then remap the b_data in the ll_rw_block layer
>    with a NR_REQUEST*MAX_BH_PER_REQUEST array of virtual-pages in the
>    fixmap area? (many tlb_flush_all... or at least many SMP-invlpg with a 
>    smarter cross-CPU-invlpg message)
>    virt_to_bus must be able to resolve the bus address starting from
>    the fixmap virtual address.
> 3) using the remap trick that I am just using in the swapout/swapin code,
>    I could just do raw-io on anonymous memory but I get stuck with the shm
>    memory where I can't simply realloc a page without browsing all
>    processes VM. Should I take a list of all pte that are mapping
>    each smp page and doing the remap trick also on shm memory?
> 4) should I avoid raw-io in the shm memory and use the remap trick
>    with the anonymous memory?
> 5) should I avoid bigmem in shm memory and simply use the remap trick
>    with the anonymous memory?

Avoiding raw-io is not a good solution. Remapping would be a performance 
hit anyway (other than having other problems, namely, you might not find 
a page to remap to). Remapping in just the swap code is acceptable, but
probably not for other cases (like rawio).

> 
> I guess big databases uses the shm memory as cache. And I guess they use
> raw-io to fill the shm memory with proper data. Am I right about this? If

Yes, I believe so ...

> so I can't choose (4). And since I would like to use the bigmem as shm
> memory I would like to avoid also (5).
> 
> (3) looks dirty and add a performance hit in the shm_nopage hander.
> 
> (2) looks dirty and slow due the SMP tlb flushes.
> 
> (1) looks clean and efficient (100% efficient in the DMA case!) but it
>     breaks all drivers out there... :(((
> 
> Theorically the cleanest solution would be (1) but I don't know if this
> will be a good choice on the long run (theorically on 2038 we won't need
							^^^ What is 2038?
> CONFIG_BIGMEM anymore...).

Part of the reason my bigmem patch for 2.2 has been implemented the
way it is, is so that drivers don't break. I think 2.3 is the place to 
teach the kernel and drivers that all of memory is not directly mappable.
Specially, if we hope to put in the PAE/36bit stuff in anytime. Yes,
that means fixing drivers. Of course, that's just my opinion ...

> 
> Right now I temporary applyed solution (0): the patch at the top of this
> email so if you want to use raw-io on anonymous or shm memory you'll have
> to recompile with CONFIG_BIGMEM not set.
> 
> Comments? (very welcome :)
> 

Once you can get resolution and decision on the driver issue, bigmem
pages can also exist in the page cache. And kmalloc could also use
bigmem pages for holding kernel data structures ...

Kanoj

> Andrea
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 23:28       ` Kanoj Sarcar
@ 1999-08-16 23:49         ` Andrea Arcangeli
  1999-08-17  6:29         ` David S. Miller
  1 sibling, 0 replies; 39+ messages in thread
From: Andrea Arcangeli @ 1999-08-16 23:49 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: alan, torvalds, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

On Mon, 16 Aug 1999, Kanoj Sarcar wrote:

>I was also talking about drivers which assume that all of memory is
>direct mapped. For example, __va and __pa assume this. There might be 
>other macros/procedures which have the same assumption built in. 
>Basically, anything that is dependent on PAGE_OFFSET needs to be
>checked. 

Only places that may deal with bigmem pages and the core of the kernel
must be checked. I don't exclude there still something to fix (as happened
with kernel/ptrace.c and /proc/*/mem) but with the current design we
shouldn't need to touch the device drivers at all.

The only real problem currently seems to be raw-io to me... (hints?)

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 23:39           ` Kanoj Sarcar
@ 1999-08-17  0:10             ` Andrea Arcangeli
  1999-08-17  6:37               ` Kanoj Sarcar
  1999-08-17 14:26             ` Andrea Arcangeli
  1 sibling, 1 reply; 39+ messages in thread
From: Andrea Arcangeli @ 1999-08-17  0:10 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: alan, torvalds, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

On Mon, 16 Aug 1999, Kanoj Sarcar wrote:

>a page to remap to). Remapping in just the swap code is acceptable, but
>probably not for other cases (like rawio).

Agreed completly.

>way it is, is so that drivers don't break. I think 2.3 is the place to 
>teach the kernel and drivers that all of memory is not directly mappable.

I tried to avoid this (and I am been successfully until I noticed raw-io
in 2.3.13... sigh).

In the meantime I'll take raw-io disabled if CONFIG_BIGMEM is set .

> [..] And kmalloc could also use
>bigmem pages for holding kernel data structures ...

I really don't think this will ever happen.

BTW, the previous patch I posted for disable raw-io on bigmem pages seems
that won't work correctly but it seems a bug in map_user_kiobuf:

[..]
static struct page * get_page_map(unsigned long page)
{
	struct page *map;
	
	if (MAP_NR(page) >= max_mapnr)
		return 0;
	if (page == ZERO_PAGE(page))
		return 0;
	map = mem_map + MAP_NR(page);
	if (PageReserved(map))
		return 0;
#ifdef CONFIG_BIGMEM
	if (PageBIGMEM(map))
		return 0;
#endif
	return map;
}
[..]
		map = get_page_map(page);
		if (map) {
			if (TryLockPage(map)) {
				goto retry;
			}
			atomic_inc(&map->count);
		}
		spin_unlock(&mm->page_table_lock);
		dprintk ("Installing page %p %p: %d\n", (void *)page, map, i);
		iobuf->pagelist[i] = page;
		iobuf->maplist[i] = map;
		iobuf->nr_pages = ++i;
		
		ptr += PAGE_SIZE;
[..]

If get_page_map() will return zero, then the page will be queued anyway in
the iobuf. The fact that the map is null won't be checked in brw_kiovec().
So it seems you could write to the ZERO_PAGE if you first mmap() the zero
page and then you give as buffer the userspace area where you have mapped
the zero-page... What am I missing?

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 20:54     ` Andrea Arcangeli
  1999-08-16 22:47       ` Andrea Arcangeli
  1999-08-16 23:28       ` Kanoj Sarcar
@ 1999-08-17  0:17       ` Andrea Arcangeli
  1999-08-19 13:33         ` Thierry Vignaud
  2 siblings, 1 reply; 39+ messages in thread
From: Andrea Arcangeli @ 1999-08-17  0:17 UTC (permalink / raw)
  To: Alan Cox
  Cc: Kanoj Sarcar, torvalds, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

I uploaded a new bigmem-2.3.13-M patch here:

	ftp://e-mind.com/pub/andrea/kernel-patches/2.3.13/bigmem-2.3.13-M

(the raw-io must be avoided with bigmem enabled, since the protection I
added in get_page_map() doesn't work right now)

If you'll avoid to do raw-io the patch should be safe and ready to use.

Thanks.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 23:28       ` Kanoj Sarcar
  1999-08-16 23:49         ` Andrea Arcangeli
@ 1999-08-17  6:29         ` David S. Miller
  1999-08-17 12:38           ` Andrea Arcangeli
  1 sibling, 1 reply; 39+ messages in thread
From: David S. Miller @ 1999-08-17  6:29 UTC (permalink / raw)
  To: kanoj
  Cc: andrea, alan, torvalds, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

   For example, on a 2.2.10 kernel:
   [kanoj@entity kern]$ gid __va | grep drivers
   drivers/char/mem.c:124: if (copy_to_user(buf, __va(p), count))
   drivers/char/mem.c:142: return do_write_mem(file, __va(p), p, buf, count, ppos);

Ok, this one could be a problem.

   drivers/scsi/sym53c8xx.c:572:#define remap_pci_mem(base, size)  ((u_long) __va(base))

Sparc specific ifdef'd code, it doesn't matter for ix86.

   drivers/video/creatorfb.c
 ...
   drivers/sbus/char/zs.c

More Sparc specific drivers.

So in essence there are only two spots in mem.c which you might need
to worry about on ix86.

Later,
David S. Miller
davem@redhat.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 22:47       ` Andrea Arcangeli
  1999-08-16 23:26         ` Andrea Arcangeli
@ 1999-08-17  6:29         ` Linus Torvalds
  1999-08-17 12:37           ` Andrea Arcangeli
  1999-08-17  8:52         ` Jakub Jelinek
  1999-08-17  9:13         ` Pavel Machek
  3 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 1999-08-17  6:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Alan Cox, Kanoj Sarcar, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm


On Tue, 17 Aug 1999, Andrea Arcangeli wrote:
>
> This incremental (against bigmem-2.3.13-L) patch will fix the ptrace and
> /proc/*/mem read/writes to other process VM inside the kernel.

Andrea, you really need to clean these things up.

The bigmem patches look fine _except_ for the fact that they have these 

	#ifdef CONFIG_BIGMEM

turds all over the place. That's NOT how to do it.

Instead, you should unconditionally always do

	#include <linux/bigmem.h>

which in turn does something like this:

	#ifdef CONFIG_BIGMEM

	  #include <asm/bigmem.h>

	#else

	  #define kmap(page)	page_address(page)
	  #define kunmap(page)	do { } while (0)

	#endif

and then there is not a _single_ #ifdef inside any actual code.

Remember: if you have to have #ifdef's in actual functional code, you're
doing something wrong. I don't see why you can't just abstract the thing
away with zero performance degradation for the non-bigmem case by just
making the mapping function the existing identity function.

I'd like you to do the above cleanup, and then the bigmem patches look
like they could easily be integrated into the current 2.3.x series. But
with #ifdef's it won't.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-17  0:10             ` Andrea Arcangeli
@ 1999-08-17  6:37               ` Kanoj Sarcar
  1999-08-17  6:41                 ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Kanoj Sarcar @ 1999-08-17  6:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: alan, torvalds, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

> 
> >way it is, is so that drivers don't break. I think 2.3 is the place to 
> >teach the kernel and drivers that all of memory is not directly mappable.
> 
> I tried to avoid this (and I am been successfully until I noticed raw-io
> in 2.3.13... sigh).
> 
> In the meantime I'll take raw-io disabled if CONFIG_BIGMEM is set .
>

Andrea,

As I pointed out before, I don't think rawio is the only case which
breaks.

I will give you one example of the type of cases that I am talking about.
In drivers/char/bttv.c, VIDIOCSFBUF ioctl seems to be setting the "vidadr"
to a kernel virtual address from the physical address present in the 
user's pte. This will not work for bigmem pages.

Now, you might claim that this driver is never used on ia32, or analyze
the way "vidadr" is used and show that the kernel never access the 
kernel v/a stored in "vidadr". What I am pointing out is that this kind
of analysis needs to be made for all drivers (that uses macros that are
dependent on PAGE_OFFSET) ... unless you can claim that you have already 
done this analysis ...

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 23:26         ` Andrea Arcangeli
  1999-08-16 23:39           ` Kanoj Sarcar
@ 1999-08-17  6:39           ` Linus Torvalds
  1999-08-17 12:40             ` Andrea Arcangeli
  1 sibling, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 1999-08-17  6:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Alan Cox, Kanoj Sarcar, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm


On Tue, 17 Aug 1999, Andrea Arcangeli wrote:
>
> This other incremental patch will make the bigmem code safe w.r.t. raw-io:

Well, it makes it safe, but doesn't actually make it _work_. As such, it's
not very usable. I suspect it had better be our current fix, though.

I also suspect that we can't just break all drivers, so for now I would
just make this work for anonymous pages and ignore direct-IO. The driver
issue is going to need some serious thinking, and doing it for anonymous
pages only may be enough for many things. Especially if anonymous pages
_prefer_ the high-memory pages.

Oh, and copied-on-write pages count as anonymous, I assume you did that
already (ie when you allocate a new page and copy the old contents into
it, you might as well consider the new page to be anonymous, even though
it gets its initial data from a potentially non-anonymous page).

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-17  6:37               ` Kanoj Sarcar
@ 1999-08-17  6:41                 ` Linus Torvalds
  1999-08-17  6:50                   ` Kanoj Sarcar
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 1999-08-17  6:41 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: Andrea Arcangeli, alan, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm


On Mon, 16 Aug 1999, Kanoj Sarcar wrote:
> 
> As I pointed out before, I don't think rawio is the only case which
> breaks.
> 
> I will give you one example of the type of cases that I am talking about.
> In drivers/char/bttv.c, VIDIOCSFBUF ioctl seems to be setting the "vidadr"
> to a kernel virtual address from the physical address present in the 
> user's pte. This will not work for bigmem pages.

This is exactly why I have always been adamant that people should NOT do
direct IO and try to walk the page tables. But people have ignored me, and
quite frankly, those drivers should just be broken. The painful part is
finding out which of them do it, but once done they should just be broken
wrt bigmem, no questions asked.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-17  6:41                 ` Linus Torvalds
@ 1999-08-17  6:50                   ` Kanoj Sarcar
  1999-08-17  7:03                     ` Linus Torvalds
  1999-08-17 11:46                     ` Alan Cox
  0 siblings, 2 replies; 39+ messages in thread
From: Kanoj Sarcar @ 1999-08-17  6:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: andrea, alan, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

> 
> 
> 
> On Mon, 16 Aug 1999, Kanoj Sarcar wrote:
> > 
> > As I pointed out before, I don't think rawio is the only case which
> > breaks.
> > 
> > I will give you one example of the type of cases that I am talking about.
> > In drivers/char/bttv.c, VIDIOCSFBUF ioctl seems to be setting the "vidadr"
> > to a kernel virtual address from the physical address present in the 
> > user's pte. This will not work for bigmem pages.
> 
> This is exactly why I have always been adamant that people should NOT do
> direct IO and try to walk the page tables. But people have ignored me, and
> quite frankly, those drivers should just be broken. The painful part is
> finding out which of them do it, but once done they should just be broken
> wrt bigmem, no questions asked.
> 
> 		Linus
> 

The *only* way to prevent this really is to make code like this uncompilable.
That is, prevent definitions like pte_page, PAGE_OFFSET, __va, __pa etc
from being in header files; rather make the driver/fs code invoke specific
routines that do virt-to-phys etc translations. Granted, this might be a
little costlier, but in most cases, this extra cost will be in driver code
that is not performance sensistive anyway. There really should be some
ddi/dki that drivers have to follow. 

Btw, my vote goes for finding and fixing all such driver code, instead 
of just breaking them for bigmem machines.

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-17  6:50                   ` Kanoj Sarcar
@ 1999-08-17  7:03                     ` Linus Torvalds
  1999-08-17  7:23                       ` Linus Torvalds
  1999-08-17 11:46                     ` Alan Cox
  1 sibling, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 1999-08-17  7:03 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: andrea, alan, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm


On Mon, 16 Aug 1999, Kanoj Sarcar wrote:
> 
> Btw, my vote goes for finding and fixing all such driver code, instead 
> of just breaking them for bigmem machines.

The code in question cannot be "fixed". It's doing something wrong in the
first place, 

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-17  7:03                     ` Linus Torvalds
@ 1999-08-17  7:23                       ` Linus Torvalds
  1999-08-17 11:39                         ` Alan Cox
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 1999-08-17  7:23 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: andrea, alan, Stephen C. Tweedie, Gerhard.Wichert,
	Winfried.Gerhard, linux-kernel, linux-mm, linux-usb


On Tue, 17 Aug 1999, Linus Torvalds wrote:
> 
> The code in question cannot be "fixed". It's doing something wrong in the
> first place, 

To expand on the above:

 If you write a driver and you want to give direct DMA access to some
program, the way to do it is NOT by using some magic ioctl number and
doing stupid things like some drivers do (ie notably bttv).

The way to do it is to just be up-front about the fact that the user
process wants direct access to the buffers that the IO is done from, and
use an explicit mmap() on the file descriptor. The driver can then
allocate a contiguous chunk of memory of the right type, and with the
right restrictions, and then let the nopage() function page it into the
user process space. 

Suddenly, such a _wellwritten_ driver no longer needs to play games with
the page tables. And such a well written driver wouldn't have any problems
at all with the BIGMEM patches.

Btw, this is not somehting new. Quite a number of sound drivers do exactly
this, and have been doing it for several years. I don't know why the bttv
driver has to be so broken, but as far as I can tell it's one of two (the
other one being some completely obscure planb driver for power macs).

Oh, and I notice that the USB cpia driver does bad things too, although it
seems to be limited to vmalloc'ed memory so it's not nearly as horrible. 
It seems to have copied the bug from the bttv sources. Johannes, could you
look at that a bit, it really _is_ going to break horribly at some point,
and I hadn't noticed until after I did a quick grep.. You can use
__get_free_pages() to grab a larger area than just a single page. 

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 22:47       ` Andrea Arcangeli
  1999-08-16 23:26         ` Andrea Arcangeli
  1999-08-17  6:29         ` Linus Torvalds
@ 1999-08-17  8:52         ` Jakub Jelinek
  1999-08-17  9:13         ` Pavel Machek
  3 siblings, 0 replies; 39+ messages in thread
From: Jakub Jelinek @ 1999-08-17  8:52 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Kanoj Sarcar, linux-kernel, linux-mm

On Tue, Aug 17, 1999 at 12:47:50AM +0200, Andrea Arcangeli wrote:
> This incremental (against bigmem-2.3.13-L) patch will fix the ptrace and
> /proc/*/mem read/writes to other process VM inside the kernel.

Isn't it cleaner to provided asm/bigmem.h on all platforms and even on
asm-i386 do something like

#ifndef CONFIG_BIGMEM
#define PageBIGMEM(map) 0
#define kmap(x,y) x
#define kunmap(x,y)
#else
...
#endif
?
> 
> diff -urN 2.3.13-bigmem-L/fs/proc/mem.c tmp/fs/proc/mem.c
> --- 2.3.13-bigmem-L/fs/proc/mem.c	Tue Jul 13 02:02:09 1999
> +++ tmp/fs/proc/mem.c	Tue Aug 17 00:02:48 1999
> @@ -15,6 +15,9 @@
>  #include <asm/uaccess.h>
>  #include <asm/io.h>
>  #include <asm/pgtable.h>
> +#ifdef CONFIG_BIGMEM
> +#include <asm/bigmem.h>
> +#endif
>  
>  /*
>   * mem_write isn't really a good idea right now. It needs
> @@ -120,7 +123,13 @@
>  		i = PAGE_SIZE-(addr & ~PAGE_MASK);
>  		if (i > scount)
>  			i = scount;
> +#ifdef CONFIG_BIGMEM
> +		page = (char *) kmap((unsigned long) page, KM_READ);
> +#endif
>  		copy_to_user(tmp, page, i);
> +#ifdef CONFIG_BIGMEM
> +		kunmap((unsigned long) page, KM_READ);
> +#endif
>  		addr += i;
>  		tmp += i;
>  		scount -= i;
> @@ -177,7 +186,13 @@
>  		i = PAGE_SIZE-(addr & ~PAGE_MASK);
>  		if (i > count)
>  			i = count;
> +#ifdef CONFIG_BIGMEM
> +		page = (unsigned long) kmap((unsigned long) page, KM_WRITE);
> +#endif
>  		copy_from_user(page, tmp, i);
> +#ifdef CONFIG_BIGMEM
> +		kunmap((unsigned long) page, KM_WRITE);
> +#endif
>  		addr += i;
>  		tmp += i;
>  		count -= i;
> diff -urN 2.3.13-bigmem-L/kernel/ptrace.c tmp/kernel/ptrace.c
> --- 2.3.13-bigmem-L/kernel/ptrace.c	Thu Jul 22 01:07:28 1999
> +++ tmp/kernel/ptrace.c	Tue Aug 17 00:02:40 1999
> @@ -13,6 +13,9 @@
>  
>  #include <asm/pgtable.h>
>  #include <asm/uaccess.h>
> +#ifdef CONFIG_BIGMEM
> +#include <asm/bigmem.h>
> +#endif
>  
>  /*
>   * Access another process' address space, one page at a time.
> @@ -52,7 +55,15 @@
>  			dst = src;
>  			src = buf;
>  		}
> +#ifdef CONFIG_BIGMEM
> +		src = (void *) kmap((unsigned long) src, KM_READ);
> +		dst = (void *) kmap((unsigned long) dst, KM_WRITE);
> +#endif
>  		memcpy(dst, src, len);
> +#ifdef CONFIG_BIGMEM
> +		kunmap((unsigned long) src, KM_READ);
> +		kunmap((unsigned long) dst, KM_WRITE);
> +#endif
>  	}
>  	flush_page_to_ram(page);
>  	return len;
> 
> The /proc/*/mem read/write seems to not work though (maybe I am doing
> something wrong...).
> 
> black:/home/andrea# cat /proc/1/mem 
> cat: /proc/1/mem: No such process
> 
> The same happens also on 2.2.11.
> 
> Andrea
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.rutgers.edu
> Please read the FAQ at http://www.tux.org/lkml/

Cheers,
    Jakub
___________________________________________________________________
Jakub Jelinek | jj@sunsite.mff.cuni.cz | http://sunsite.mff.cuni.cz
Administrator of SunSITE Czech Republic, MFF, Charles University
___________________________________________________________________
UltraLinux  |  http://ultra.linux.cz/  |  http://ultra.penguin.cz/
Linux version 2.3.13 on a sparc64 machine (1343.49 BogoMips)
___________________________________________________________________
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 22:47       ` Andrea Arcangeli
                           ` (2 preceding siblings ...)
  1999-08-17  8:52         ` Jakub Jelinek
@ 1999-08-17  9:13         ` Pavel Machek
  1999-08-18 14:08           ` Andrea Arcangeli
  3 siblings, 1 reply; 39+ messages in thread
From: Pavel Machek @ 1999-08-17  9:13 UTC (permalink / raw)
  To: Andrea Arcangeli, Alan Cox
  Cc: Kanoj Sarcar, torvalds, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

Hi!

> This incremental (against bigmem-2.3.13-L) patch will fix the ptrace and
> /proc/*/mem read/writes to other process VM inside the kernel.

Your patches start to contain more ifdefs than code. That's bad.

> diff -urN 2.3.13-bigmem-L/fs/proc/mem.c tmp/fs/proc/mem.c
> --- 2.3.13-bigmem-L/fs/proc/mem.c	Tue Jul 13 02:02:09 1999
> +++ tmp/fs/proc/mem.c	Tue Aug 17 00:02:48 1999
> @@ -15,6 +15,9 @@
>  #include <asm/uaccess.h>
>  #include <asm/io.h>
>  #include <asm/pgtable.h>
> +#ifdef CONFIG_BIGMEM
> +#include <asm/bigmem.h>
> +#endif

These ifdefs are probably not needed. Few unused symbols can not hurt,
can they? And if you are worried, put #ifdef pair into asm/bigmem.h,
not into every file.

> @@ -120,7 +123,13 @@
>  		i = PAGE_SIZE-(addr & ~PAGE_MASK);
>  		if (i > scount)
>  			i = scount;
> +#ifdef CONFIG_BIGMEM
> +		page = (char *) kmap((unsigned long) page, KM_READ);
> +#endif

What about kmap existing uncoditionaly, but (inside bigmem.h)

#ifdef CONFIG_BIGMEM
#define kmap(a,b) real_kmap(a,b)
#else
#define kmap(a,b) a
#endif

? Doing this and same for kunmap would save lots of painfull #ifdefs
otherwhere.

								Pavel
-- 
I'm really pavel@ucw.cz. Look at http://195.113.31.123/~pavel.  Pavel
Hi! I'm a .signature virus! Copy me into your ~/.signature, please!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-17  7:23                       ` Linus Torvalds
@ 1999-08-17 11:39                         ` Alan Cox
  1999-08-26 16:27                           ` Andrea Arcangeli
  0 siblings, 1 reply; 39+ messages in thread
From: Alan Cox @ 1999-08-17 11:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kanoj, andrea, alan, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm, linux-usb

>  If you write a driver and you want to give direct DMA access to some
> program, the way to do it is NOT by using some magic ioctl number and
> doing stupid things like some drivers do (ie notably bttv).

The bttv does it right Linus. It has to do some nastiness to work around
the Linux vm subsystem thats all. That nastiness is solely to get a table
of bus addressses of vmalloc pages. I don't think the 4Gig patch breaks it
at all. In the ideal world virt_to_bus() would work on vmalloc pages. It
doesnt and there are good reasons why so we have to handle that bit ourself.

The only ioctl stuff it has for directly sending stuff to addresses is for
frame buffer direct DMA, which is the only sane way to handle TV viewing.
That is given the bus address of the frame buffer by the X server, which
does know what it is doing.

> process wants direct access to the buffers that the IO is done from, and
> use an explicit mmap() on the file descriptor. The driver can then
> allocate a contiguous chunk of memory of the right type, and with the
> right restrictions, and then let the nopage() function page it into the
> user process space. 

Thats basically what bttv does. When you start grabbing we do

	vmalloc
	write BT848 RISC DMA script to match the pages allocated
	
	mmap maps those pages into user space as a ring buffer

The vmalloc is done off the ioctls to begin frame grabbing because it would
be very stupid to have 2-4Mb of ram allocated on open when a user didnt
want to do capturing.

> this, and have been doing it for several years. I don't know why the bttv
> driver has to be so broken, but as far as I can tell it's one of two (the

Its doing what you say it should. So why is it broken. It has to grab
2Mb of RAM or more at times and map them into user space. It does exactly
that.

> and I hadn't noticed until after I did a quick grep.. You can use
> __get_free_pages() to grab a larger area than just a single page. 

Video capture cards want several megabytes. get_free_pages() is unreliable
above about 16K. The bttv could certainly be written to do a loop of using
get_free_page() for each page it wants. That would be a fair bit cleaner. 
However Stephens rawio promises to provide roughly the right framework
for doing this stuff properly so I don't plan to do the job twice.

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-17  6:50                   ` Kanoj Sarcar
  1999-08-17  7:03                     ` Linus Torvalds
@ 1999-08-17 11:46                     ` Alan Cox
  1 sibling, 0 replies; 39+ messages in thread
From: Alan Cox @ 1999-08-17 11:46 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: torvalds, andrea, alan, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

> I will give you one example of the type of cases that I am talking about.
> In drivers/char/bttv.c, VIDIOCSFBUF ioctl seems to be setting the "vidadr"
> to a kernel virtual address from the physical address present in the 
> user's pte. This will not work for bigmem pages.

Oh now I understand Linus rather bizarre message

VIDIOCSFBUF takes a physical base address. The &1 stuff thats i nthere is 
a debug hook that never got taken out. You can ignore the &1 case in that
ioctl or just remove it.

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-17  6:29         ` Linus Torvalds
@ 1999-08-17 12:37           ` Andrea Arcangeli
  1999-08-17 14:04             ` Andrea Arcangeli
  0 siblings, 1 reply; 39+ messages in thread
From: Andrea Arcangeli @ 1999-08-17 12:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Kanoj Sarcar, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

On Mon, 16 Aug 1999, Linus Torvalds wrote:

>I'd like you to do the above cleanup, and then the bigmem patches look
>like they could easily be integrated into the current 2.3.x series. But
>with #ifdef's it won't.

Fine ;)). I'll do the cleanup and I'll give you a new patch without the
#ifdef in the common code (all other archs will have to #define some noop
as well then of course).

Thanks.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-17  6:29         ` David S. Miller
@ 1999-08-17 12:38           ` Andrea Arcangeli
  0 siblings, 0 replies; 39+ messages in thread
From: Andrea Arcangeli @ 1999-08-17 12:38 UTC (permalink / raw)
  To: David S. Miller
  Cc: kanoj, alan, torvalds, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

On Mon, 16 Aug 1999, David S. Miller wrote:

>   From: kanoj@google.engr.sgi.com (Kanoj Sarcar)
>   Date:   Mon, 16 Aug 1999 16:28:58 -0700 (PDT)
>
>   For example, on a 2.2.10 kernel:
>   [kanoj@entity kern]$ gid __va | grep drivers
>   drivers/char/mem.c:124: if (copy_to_user(buf, __va(p), count))
>   drivers/char/mem.c:142: return do_write_mem(file, __va(p), p, buf, count, ppos);
>
>Ok, this one could be a problem.

It isn't. The bigmem is not readable from /dev/mem right now.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-17  6:39           ` Linus Torvalds
@ 1999-08-17 12:40             ` Andrea Arcangeli
  0 siblings, 0 replies; 39+ messages in thread
From: Andrea Arcangeli @ 1999-08-17 12:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Kanoj Sarcar, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

On Mon, 16 Aug 1999, Linus Torvalds wrote:

>pages only may be enough for many things. Especially if anonymous pages
>_prefer_ the high-memory pages.

Yes shm/vmalloc/anonymous memory always prefer the high-memory pages.

>Oh, and copied-on-write pages count as anonymous, I assume you did that
>already (ie when you allocate a new page and copy the old contents into
>it, you might as well consider the new page to be anonymous, even though
>it gets its initial data from a potentially non-anonymous page).

Yes, the copy-on-write always prefer the bigmem pages for the allocation.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-17 12:37           ` Andrea Arcangeli
@ 1999-08-17 14:04             ` Andrea Arcangeli
  0 siblings, 0 replies; 39+ messages in thread
From: Andrea Arcangeli @ 1999-08-17 14:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Kanoj Sarcar, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 856 bytes --]

On Tue, 17 Aug 1999, Andrea Arcangeli wrote:

>#ifdef in the common code (all other archs will have to #define some noop
>as well then of course).

I was wrong, with the linux/bigmem.h trick all other arch should be just
fine ;).

I did a preliminary cleanup. mm/page_alloc.c still uses some #ifdef but
they are not trivally removable, for example:

+#ifdef CONFIG_BIGMEM
+#define BIGMEM_LISTS_OFFSET	NR_MEM_LISTS
+static struct free_area_struct free_area[NR_MEM_LISTS*2];
+#else
 static struct free_area_struct free_area[NR_MEM_LISTS];
+#endif

The cleanedup patch (bigmem-2.3.13-N) is attached.

andrea@laser:~/kernel-patches/2.3.13 > ls -l bigmem-2.3.13-[NM]
-rw-r--r--   1 andrea   andrea      32416 Aug 17 01:28 bigmem-2.3.13-M
-rw-r--r--   1 andrea   andrea      30779 Aug 17 16:01 bigmem-2.3.13-N

(2kbyte removed from the cleanup)

Thanks.

Andrea

[-- Attachment #2: bigmem without #ifdef CONFIG_BIGMEM all over the place --]
[-- Type: APPLICATION/octet-stream, Size: 9259 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-16 23:39           ` Kanoj Sarcar
  1999-08-17  0:10             ` Andrea Arcangeli
@ 1999-08-17 14:26             ` Andrea Arcangeli
  1 sibling, 0 replies; 39+ messages in thread
From: Andrea Arcangeli @ 1999-08-17 14:26 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: alan, torvalds, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm

On Mon, 16 Aug 1999, Kanoj Sarcar wrote:

>							^^^ What is 2038?

#include <time.h>

main()
{
	time_t t = ~0UL >> 1;
	printf("%x, %s", t, ctime(&t));
	t++;
	printf("%x, %s", t, ctime(&t));
}

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-17  9:13         ` Pavel Machek
@ 1999-08-18 14:08           ` Andrea Arcangeli
  1999-08-19 12:20             ` Andrea Arcangeli
  0 siblings, 1 reply; 39+ messages in thread
From: Andrea Arcangeli @ 1999-08-18 14:08 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Cox, Kanoj Sarcar, torvalds, sct, Gerhard.Wichert,
	Winfried.Gerhard, linux-kernel, linux-mm

I cleaned up the kmap interface yesterday (in linux/bigmem.h).

Now I am cleaning up the stuff still more putting all the bigmem common
code and variables in linux/mm/bigmem.c (separated by the arch specific
arch/i386/mm/bigmem.c) so there won't be duplicated sources if some other
arch (not only i386) will want to take advantage of the bigmem common
interface (I just heard some interest in this field ;).

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-18 14:08           ` Andrea Arcangeli
@ 1999-08-19 12:20             ` Andrea Arcangeli
  0 siblings, 0 replies; 39+ messages in thread
From: Andrea Arcangeli @ 1999-08-19 12:20 UTC (permalink / raw)
  To: x-linux-kernel, linux-mm; +Cc: Wichert, Gerhard, Gerhard, Winfried

[-- Attachment #1: Type: TEXT/PLAIN, Size: 668 bytes --]

On Wed, 18 Aug 1999, Andrea Arcangeli wrote:

>Now I am cleaning up the stuff still more putting all the bigmem common
>code and variables in linux/mm/bigmem.c (separated by the arch specific
>arch/i386/mm/bigmem.c) so there won't be duplicated sources if some other
>arch (not only i386) will want to take advantage of the bigmem common
>interface (I just heard some interest in this field ;).

I did the cleanup. The latest bigmem-2.3.14-pre2-O patch is attached to
this email. There aren't real functional differences in the code. It's
only a low-priority architecture-cleanup diff (if you are just using
bigmem-2.3.13-N you can safely ignore this update).

Andrea

[-- Attachment #2: bigmem with arch-cleanup --]
[-- Type: APPLICATION/OCTET-STREAM, Size: 9308 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-17  0:17       ` Andrea Arcangeli
@ 1999-08-19 13:33         ` Thierry Vignaud
  1999-08-19 16:49           ` Stephen C. Tweedie
  0 siblings, 1 reply; 39+ messages in thread
From: Thierry Vignaud @ 1999-08-19 13:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Alan Cox, Kanoj Sarcar, torvalds, sct, Gerhard.Wichert,
	Winfried.Gerhard, linux-kernel, linux-mm

Andrea Arcangeli wrote:
> 
> I uploaded a new bigmem-2.3.13-M patch here:
> 
>         ftp://e-mind.com/pub/andrea/kernel-patches/2.3.13/bigmem-2.3.13-M
> 
> (the raw-io must be avoided with bigmem enabled, since the protection I
> added in get_page_map() doesn't work right now)
> 
> If you'll avoid to do raw-io the patch should be safe and ready to use.

since only recent motherboard support more than 512Mb RAM, and since
they used i686 (PPro, P2, P3), why not use the pse36 extension of these
cpu that enable to stock the segment length on 24bits, which give 64To
when mem unit is 4b page.
this'll make the limit much higher (say 128Mb RAM for the kernel space
memory and 15,9To for the user space).
This would break some api, but why not add foo_64 for each foo()
function as glibc does for big files ?
As for standard api such as of libc, i don't think wa have to worry
about. There are few Programs which want a lot of memory such as oracle.
For these, we may find a special way of accessing the mem (64bits
pointers, 64bit mmap, ...)

-- 
MandrakeSoft          http://www.mandrakesoft.com/
	somewhere between the playstation and the super cray
			         	 --Thierry
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-19 13:33         ` Thierry Vignaud
@ 1999-08-19 16:49           ` Stephen C. Tweedie
  1999-08-20  7:35             ` Thierry Vignaud
  0 siblings, 1 reply; 39+ messages in thread
From: Stephen C. Tweedie @ 1999-08-19 16:49 UTC (permalink / raw)
  To: Thierry Vignaud
  Cc: Andrea Arcangeli, Alan Cox, Kanoj Sarcar, torvalds, sct,
	Gerhard.Wichert, Winfried.Gerhard, x-linux-kernel, linux-mm

Hi,

On Thu, 19 Aug 1999 13:33:32 +0000, Thierry Vignaud
<tvignaud@mandrakesoft.com> said:

> since only recent motherboard support more than 512Mb RAM, and since
> they used i686 (PPro, P2, P3), why not use the pse36 extension of
> these cpu that enable to stock the segment length on 24bits, which
> give 64To when mem unit is 4b page.  this'll make the limit much
> higher (say 128Mb RAM for the kernel space memory and 15,9To for the
> user space).  

The PAE36 extensions let you address 64GB of physical memory, but don't
change the fact that you still have a 32-bit user address space: the
user space is still limited to 3GB.

> This would break some api, but why not add foo_64 for each foo()
> function as glibc does for big files ?  As for standard api such as of
> libc, i don't think wa have to worry about. There are few Programs
> which want a lot of memory such as oracle.  For these, we may find a
> special way of accessing the mem (64bits pointers, 64bit mmap, ...)

The CPU doesn't support 64 bit pointers.  Kind of makes it a bit
inefficient to access the user memory if you have to make a system call
every time. :)

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-19 16:49           ` Stephen C. Tweedie
@ 1999-08-20  7:35             ` Thierry Vignaud
  1999-08-20  9:55               ` Alan Cox
  1999-08-20 18:25               ` Linus Torvalds
  0 siblings, 2 replies; 39+ messages in thread
From: Thierry Vignaud @ 1999-08-20  7:35 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andrea Arcangeli, Alan Cox, Kanoj Sarcar, torvalds,
	Gerhard.Wichert, Winfried.Gerhard, x-linux-kernel, linux-mm

"Stephen C. Tweedie" wrote:
> 
> Hi,
> 
> On Thu, 19 Aug 1999 13:33:32 +0000, Thierry Vignaud
> <tvignaud@mandrakesoft.com> said:
> 
> > since only recent motherboard support more than 512Mb RAM, and since
> > they used i686 (PPro, P2, P3), why not use the pse36 extension of
> > these cpu that enable to stock the segment length on 24bits, which
> > give 64To when mem unit is 4b page.  this'll make the limit much
> > higher (say 128Mb RAM for the kernel space memory and 15,9To for the
> > user space).
> 
> The PAE36 extensions let you address 64GB of physical memory, but don't
> change the fact that you still have a 32-bit user address space: the
> user space is still limited to 3GB.
> 
> > This would break some api, but why not add foo_64 for each foo()
> > function as glibc does for big files ?  As for standard api such as of
> > libc, i don't think wa have to worry about. There are few Programs
> > which want a lot of memory such as oracle.  For these, we may find a
> > special way of accessing the mem (64bits pointers, 64bit mmap, ...)
> 
> The CPU doesn't support 64 bit pointers.  Kind of makes it a bit
> inefficient to access the user memory if you have to make a system call
> every time. :)
Yes, but we do can use 24:32 referencse (as
pse36_extended_selectors:offset). Each process may own a ldt that allow
him to own several 4Gb segment : code, data, stack, kernel mem mapped,
librairies, shared mem (X11/dga -> fb mem and IPC shm).
Each of these segments is still large up to 4Gb, but the process may
addresse more than 4Gb.
We may have to hack gcc & binutils so they generate references against
new selectors. We may put the kernel mem region that the process see in
another segment and alter includes macros so that they handle acess to
kernel structs via the new selector (es,fs or gs on ix86).
As this could broke a lot of soft, we may define a flag in ELF header
that select 2:2 split of ram or the new scheme.

Another solution : add a new brk36 on ix86 that enable very big apps
(such as oracle) to own multiple 4Gb segment and then give to the
userland developper all the difficulties.

Yes, i know, if someone want a lot of mem, he should switch to a 64 bits
arch, but there are now some servers which manage up to 16Go RAM. WinNT
has put a choose a stupid trick and reinvent their classic solution
(EMS, XMS, ...) by allowing apps to copy blocks from above the 4Gb mem.
This is really stupid but they can say they manage more than 4Gb and not
us.

-- 
MandrakeSoft          http://www.mandrakesoft.com/
	somewhere between the playstation and the super cray
			         	 --Thierry
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-20  7:35             ` Thierry Vignaud
@ 1999-08-20  9:55               ` Alan Cox
  1999-08-20 18:25               ` Linus Torvalds
  1 sibling, 0 replies; 39+ messages in thread
From: Alan Cox @ 1999-08-20  9:55 UTC (permalink / raw)
  To: Thierry Vignaud
  Cc: sct, andrea, alan, kanoj, torvalds, Gerhard.Wichert,
	Winfried.Gerhard, x-linux-kernel, linux-mm

> Yes, but we do can use 24:32 referencse (as
> pse36_extended_selectors:offset). Each process may own a ldt that allow
> him to own several 4Gb segment : code, data, stack, kernel mem mapped,
> librairies, shared mem (X11/dga -> fb mem and IPC shm).

32bit large mode. 

> We may have to hack gcc & binutils so they generate references against
> new selectors. We may put the kernel mem region that the process see in

Thats probably four years work. You also need to do a large mode glibc port
so budget another year. And maybe a couple of man years for the kernel.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-20  7:35             ` Thierry Vignaud
  1999-08-20  9:55               ` Alan Cox
@ 1999-08-20 18:25               ` Linus Torvalds
  1 sibling, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 1999-08-20 18:25 UTC (permalink / raw)
  To: Thierry Vignaud
  Cc: Stephen C. Tweedie, Andrea Arcangeli, Alan Cox, Kanoj Sarcar,
	Gerhard.Wichert, Winfried.Gerhard, x-linux-kernel, linux-mm


On Fri, 20 Aug 1999, Thierry Vignaud wrote:
>
> Yes, but we do can use 24:32 referencse (as

Nope.

That's pure Intel propaganda, and has absolutely no basis in reality.

There's a 13:32 bit address space, with the 13 bits coming from the
segment registers. True.

However, that does NOT give you 45 bits of addressing, however much Intel
tried to claim that in early literature. The 13:32 address is mapped onto
a plain linear 32-bit address space, and that's all it gives you.

[ In theory, you can play games with the present bit in the segments to
  make it appear like more, but in practice that is basically useless too,
  don't even bother mentioning it ]

You can make the 36 physical bits available to software the same way
people used to do expansion memory on a 286 - by having a window and
having software change that window. Some databases would be happy with
that. But I much prefer just letting processes have their 3GB worth of
address space, and being able to map in the occasional big page when
really needed. 

Or, actually, I'd much prefer a sane architecture that doesn't continually
try to reinvent the bad idea of memory windows.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [bigmem-patch] 4GB with Linux on IA32
  1999-08-17 11:39                         ` Alan Cox
@ 1999-08-26 16:27                           ` Andrea Arcangeli
  0 siblings, 0 replies; 39+ messages in thread
From: Andrea Arcangeli @ 1999-08-26 16:27 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, kanoj, sct, Gerhard.Wichert, Winfried.Gerhard,
	linux-kernel, linux-mm, linux-usb

On Tue, 17 Aug 1999, Alan Cox wrote:

>of bus addressses of vmalloc pages. I don't think the 4Gig patch breaks it
>at all. In the ideal world virt_to_bus() would work on vmalloc pages. It

Yes, the bigmem patch doesn't break bttv.

bttv alloc the DMA-pool via vmalloc and with the bigmem patch applyed
vmalloc prefere the bigmem pages so the DMA-pool will be always alloced
in bigmem memory.

But using vmalloc all bigmem pages will have a valid virt-to-phys
translation. (Only GFP may return a pointer without a valid virt-to-phys
translation if __GFP_BIGMEM is been specifyed in the gfp_mask.)

So the kernel can also copy-from/to-user the DMA pool using the vmalloc
addresses since it's a _valid_ address.

Via mmap the vmalloced pages will be remapped to userspace memory and
that's fine as well.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~1999-08-26 16:27 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1999-08-16 16:29 [bigmem-patch] 4GB with Linux on IA32 Andrea Arcangeli
1999-08-16 16:48 ` Matthew Wilcox
1999-08-16 17:19   ` Andrea Arcangeli
1999-08-16 18:43 ` Kanoj Sarcar
1999-08-16 19:43   ` Alan Cox
1999-08-16 20:54     ` Andrea Arcangeli
1999-08-16 22:47       ` Andrea Arcangeli
1999-08-16 23:26         ` Andrea Arcangeli
1999-08-16 23:39           ` Kanoj Sarcar
1999-08-17  0:10             ` Andrea Arcangeli
1999-08-17  6:37               ` Kanoj Sarcar
1999-08-17  6:41                 ` Linus Torvalds
1999-08-17  6:50                   ` Kanoj Sarcar
1999-08-17  7:03                     ` Linus Torvalds
1999-08-17  7:23                       ` Linus Torvalds
1999-08-17 11:39                         ` Alan Cox
1999-08-26 16:27                           ` Andrea Arcangeli
1999-08-17 11:46                     ` Alan Cox
1999-08-17 14:26             ` Andrea Arcangeli
1999-08-17  6:39           ` Linus Torvalds
1999-08-17 12:40             ` Andrea Arcangeli
1999-08-17  6:29         ` Linus Torvalds
1999-08-17 12:37           ` Andrea Arcangeli
1999-08-17 14:04             ` Andrea Arcangeli
1999-08-17  8:52         ` Jakub Jelinek
1999-08-17  9:13         ` Pavel Machek
1999-08-18 14:08           ` Andrea Arcangeli
1999-08-19 12:20             ` Andrea Arcangeli
1999-08-16 23:28       ` Kanoj Sarcar
1999-08-16 23:49         ` Andrea Arcangeli
1999-08-17  6:29         ` David S. Miller
1999-08-17 12:38           ` Andrea Arcangeli
1999-08-17  0:17       ` Andrea Arcangeli
1999-08-19 13:33         ` Thierry Vignaud
1999-08-19 16:49           ` Stephen C. Tweedie
1999-08-20  7:35             ` Thierry Vignaud
1999-08-20  9:55               ` Alan Cox
1999-08-20 18:25               ` Linus Torvalds
1999-08-16 20:34   ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox