* Non-GPL export of invalidate_mmap_range
@ 2004-02-16 19:09 Paul E. McKenney
2004-02-17 2:31 ` Andrew Morton
` (2 more replies)
0 siblings, 3 replies; 68+ messages in thread
From: Paul E. McKenney @ 2004-02-16 19:09 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel, linux-mm
Hello, Andrew,
The attached patch to make invalidate_mmap_range() non-GPL exported
seems to have been lost somewhere between 2.6.1-mm4 and 2.6.1-mm5.
It still applies cleanly. Could you please take it up again?
Thanx, Paul
------------------------------------------------------------------------
It was EXPORT_SYMBOL_GPL(), however IBM's GPFS is not GPL.
- the GPFS team contributed to the testing and development of
invaldiate_mmap_range().
- GPFS was developed under AIX and was ported to Linux, and hence meets
Linus's "some binary modules are OK" exemption.
- The export makes sense: clustering filesystems need it for shootdowns to
ensure cache coherency.
25-akpm/mm/memory.c | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)
diff -puN mm/memory.c~invalidate_mmap_range-non-gpl-export mm/memory.c
--- 25/mm/memory.c~invalidate_mmap_range-non-gpl-export Mon Nov 24 11:33:19 2003
+++ 25-akpm/mm/memory.c Mon Nov 24 11:33:34 2003
@@ -1164,7 +1164,7 @@ void invalidate_mmap_range(struct addres
invalidate_mmap_range_list(&mapping->i_mmap_shared, hba, hlen);
up(&mapping->i_shared_sem);
}
-EXPORT_SYMBOL_GPL(invalidate_mmap_range);
+EXPORT_SYMBOL(invalidate_mmap_range);
/*
* Handle all mappings that got truncated by a "truncate()"
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: Non-GPL export of invalidate_mmap_range 2004-02-16 19:09 Non-GPL export of invalidate_mmap_range Paul E. McKenney @ 2004-02-17 2:31 ` Andrew Morton 2004-02-17 7:35 ` Christoph Hellwig 2004-02-17 22:22 ` David Weinehall 2 siblings, 0 replies; 68+ messages in thread From: Andrew Morton @ 2004-02-17 2:31 UTC (permalink / raw) To: paulmck; +Cc: linux-kernel, linux-mm "Paul E. McKenney" <paulmck@us.ibm.com> wrote: > > The attached patch to make invalidate_mmap_range() non-GPL exported > seems to have been lost somewhere between 2.6.1-mm4 and 2.6.1-mm5. > It still applies cleanly. Could you please take it up again? I don't have any particular opinions either way but I do recall there was some disquiet last time this came up. I'm sure someone will remind us ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-16 19:09 Non-GPL export of invalidate_mmap_range Paul E. McKenney 2004-02-17 2:31 ` Andrew Morton @ 2004-02-17 7:35 ` Christoph Hellwig 2004-02-17 12:40 ` Paul E. McKenney 2004-02-17 22:22 ` David Weinehall 2 siblings, 1 reply; 68+ messages in thread From: Christoph Hellwig @ 2004-02-17 7:35 UTC (permalink / raw) To: Paul E. McKenney; +Cc: akpm, linux-kernel, linux-mm On Mon, Feb 16, 2004 at 11:09:27AM -0800, Paul E. McKenney wrote: > Hello, Andrew, > > The attached patch to make invalidate_mmap_range() non-GPL exported > seems to have been lost somewhere between 2.6.1-mm4 and 2.6.1-mm5. > It still applies cleanly. Could you please take it up again? And there's still no reason to ease IBM's GPL violations by exporting deep VM internals. The GPLed DFS you claimed you needed this for still hasn't shown up but instead you want to change the export all the time. Tells a lot about IBMs promises.. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-17 7:35 ` Christoph Hellwig @ 2004-02-17 12:40 ` Paul E. McKenney 2004-02-18 0:19 ` Andrew Morton 2004-02-18 12:12 ` Dominik Kubla 0 siblings, 2 replies; 68+ messages in thread From: Paul E. McKenney @ 2004-02-17 12:40 UTC (permalink / raw) To: Christoph Hellwig, akpm, linux-kernel, linux-mm On Tue, Feb 17, 2004 at 07:35:22AM +0000, Christoph Hellwig wrote: > On Mon, Feb 16, 2004 at 11:09:27AM -0800, Paul E. McKenney wrote: > > Hello, Andrew, > > > > The attached patch to make invalidate_mmap_range() non-GPL exported > > seems to have been lost somewhere between 2.6.1-mm4 and 2.6.1-mm5. > > It still applies cleanly. Could you please take it up again? > > And there's still no reason to ease IBM's GPL violations by exporting > deep VM internals. The GPLed DFS you claimed you needed this for still > hasn't shown up but instead you want to change the export all the time. > > Tells a lot about IBMs promises.. Hello, Christoph! IBM shipped the promised SAN Filesystem some months ago. The source code for the Linux client was released under GPL, as promised, and may be found at the following URL: https://www6.software.ibm.com/dl/sanfsys/sanfsref-i?S_PKG=dl&S_TACT=&S_CMP= A PDF of the protocol specification may be found at the following URL: http://www.storage.ibm.com/software/virtualization/sfs/protocol.html These URLs do require that you register, but there is no cost nor any agreement other than the GPL itself. The Linux client has not been shipped as product yet. The code is still quite rough, which is one reason that it has not be submitted to, for example, LKML. ;-) Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-17 12:40 ` Paul E. McKenney @ 2004-02-18 0:19 ` Andrew Morton 2004-02-18 12:51 ` Arjan van de Ven 2004-02-19 20:56 ` Daniel Phillips 2004-02-18 12:12 ` Dominik Kubla 1 sibling, 2 replies; 68+ messages in thread From: Andrew Morton @ 2004-02-18 0:19 UTC (permalink / raw) To: paulmck; +Cc: hch, linux-kernel, linux-mm "Paul E. McKenney" <paulmck@us.ibm.com> wrote: > > IBM shipped the promised SAN Filesystem some months ago. Neat, but it's hard to see the relevance of this to your patch. I don't see any licensing issues with the patch because the filesystem which needs it clearly meets Linus's "this is not a derived work" criteria. And I don't see a technical problem with the export: given that we export truncate_inode_pages() it makes sense to also export the corresponding pagetable shootdown function. Yes, this is a sensitive issue. Can we please evaluate it strictly according to technical and licensing considerations? Having said that, what concerns issues remain with Paul's patch? Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-18 0:19 ` Andrew Morton @ 2004-02-18 12:51 ` Arjan van de Ven 2004-02-18 14:00 ` Paul E. McKenney 2004-02-18 18:04 ` Tim Bird 2004-02-19 20:56 ` Daniel Phillips 1 sibling, 2 replies; 68+ messages in thread From: Arjan van de Ven @ 2004-02-18 12:51 UTC (permalink / raw) To: Andrew Morton; +Cc: paulmck, hch, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 929 bytes --] On Wed, 2004-02-18 at 01:19, Andrew Morton wrote: > "Paul E. McKenney" <paulmck@us.ibm.com> wrote: > > > > IBM shipped the promised SAN Filesystem some months ago. > > Neat, but it's hard to see the relevance of this to your patch. > > I don't see any licensing issues with the patch because the filesystem > which needs it clearly meets Linus's "this is not a derived work" criteria. it does? It needed no changes to work on linux? it only uses "core unix" apis ? it needs no changes to the core kernel? *buzz* It doesn't require knowledge of deep and changing internals ? *buzz* It doesn't need changing for various kernel versions ? I remember this baby overriding syscalls and the like not too long ago... The word "clearly" isn't correct imo. Just because something has a few lines of code that started on another OS doesn't make it "clearly" not a derived work, at least not in my eyes. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-18 12:51 ` Arjan van de Ven @ 2004-02-18 14:00 ` Paul E. McKenney 2004-02-18 21:10 ` Christoph Hellwig 2004-02-18 18:04 ` Tim Bird 1 sibling, 1 reply; 68+ messages in thread From: Paul E. McKenney @ 2004-02-18 14:00 UTC (permalink / raw) To: Arjan van de Ven; +Cc: Andrew Morton, hch, linux-kernel, linux-mm On Wed, Feb 18, 2004 at 01:51:35PM +0100, Arjan van de Ven wrote: > On Wed, 2004-02-18 at 01:19, Andrew Morton wrote: > > "Paul E. McKenney" <paulmck@us.ibm.com> wrote: > > > > > > IBM shipped the promised SAN Filesystem some months ago. > > > > Neat, but it's hard to see the relevance of this to your patch. > > > > I don't see any licensing issues with the patch because the filesystem > > which needs it clearly meets Linus's "this is not a derived work" criteria. > > it does? I believe so. > It needed no changes to work on linux? There is a small shim layer required, but the bulk of the code implementing GPFS is common between AIX and Linux. It was on AIX first by quite a few years. > it only uses "core unix" apis ? If they are made available, yes. That is the point of this patch, after all. ;-) > it needs no changes to the core kernel? *buzz* You -can- run GPFS in the 2.4 kernel without core-kernel patches, as long as you don't mind putting up with mmap/page-fault races and with NFS exports from different nodes handing out the same lock to two different NFS clients. ;-) > It doesn't require knowledge of deep and changing internals ? *buzz* That is indeed the idea. > It doesn't need changing for various kernel versions ? It is tested on specific kernel versions. Clearly moving from 2.4 to 2.6 requires some change. > I remember this baby overriding syscalls and the like not too long > ago... ??? > The word "clearly" isn't correct imo. Just because something has a few > lines of code that started on another OS doesn't make it "clearly" not a > derived work, at least not in my eyes. Hmmm... You seem to have a rather expansive definition of "a few lines of code". ;-) Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-18 14:00 ` Paul E. McKenney @ 2004-02-18 21:10 ` Christoph Hellwig 2004-02-18 15:06 ` Paul E. McKenney 0 siblings, 1 reply; 68+ messages in thread From: Christoph Hellwig @ 2004-02-18 21:10 UTC (permalink / raw) To: Paul E. McKenney Cc: Arjan van de Ven, Andrew Morton, hch, linux-kernel, linux-mm On Wed, Feb 18, 2004 at 06:00:21AM -0800, Paul E. McKenney wrote: > There is a small shim layer required, but the bulk of the code > implementing GPFS is common between AIX and Linux. It was on AIX > first by quite a few years. Small glue layer? Unfortunately ibm took it off the website, but the thing is damn huge. > > it only uses "core unix" apis ? > > If they are made available, yes. That is the point of this patch, > after all. ;-) No, that's wrong. It patches the syscall table and plays evilish tricks with lowlevel MM code. > > It doesn't require knowledge of deep and changing internals ? *buzz* > > That is indeed the idea. The one on the ibm website a little ago did. You're free to upload a new one that clearly doesn't need all this, but.. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-18 21:10 ` Christoph Hellwig @ 2004-02-18 15:06 ` Paul E. McKenney 2004-02-18 22:21 ` Christoph Hellwig 0 siblings, 1 reply; 68+ messages in thread From: Paul E. McKenney @ 2004-02-18 15:06 UTC (permalink / raw) To: Christoph Hellwig, Arjan van de Ven, Andrew Morton, linux-kernel, linux-mm On Wed, Feb 18, 2004 at 09:10:35PM +0000, Christoph Hellwig wrote: > On Wed, Feb 18, 2004 at 06:00:21AM -0800, Paul E. McKenney wrote: > > There is a small shim layer required, but the bulk of the code > > implementing GPFS is common between AIX and Linux. It was on AIX > > first by quite a few years. > > Small glue layer? Unfortunately ibm took it off the website, but > the thing is damn huge. Perhaps it is huge, but it is a small fraction of the GPFS kernel implementation. > > > it only uses "core unix" apis ? > > > > If they are made available, yes. That is the point of this patch, > > after all. ;-) > > No, that's wrong. It patches the syscall table and plays evilish > tricks with lowlevel MM code. The sys_call_table stuff was under #ifdef, and was intended for use by a research project that was later put out of its misery. This stuff has since been removed from the source tree. As to the evilish tricks with lowlevel MM code, the whole point of the mmap_invalidate_range() patch is to be able to rid GPFS of exactly these evilish tricks. > > > It doesn't require knowledge of deep and changing internals ? *buzz* > > > > That is indeed the idea. > > The one on the ibm website a little ago did. You're free to upload > a new one that clearly doesn't need all this, but.. Again, the point of the mmap_invalidate_range() patch is to be able to do precisely this. Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-18 15:06 ` Paul E. McKenney @ 2004-02-18 22:21 ` Christoph Hellwig 2004-02-18 22:51 ` Andrew Morton 0 siblings, 1 reply; 68+ messages in thread From: Christoph Hellwig @ 2004-02-18 22:21 UTC (permalink / raw) To: Paul E. McKenney Cc: Christoph Hellwig, Arjan van de Ven, Andrew Morton, linux-kernel, linux-mm > The sys_call_table stuff was under #ifdef, and was intended for > use by a research project that was later put out of its misery. > This stuff has since been removed from the source tree. > > As to the evilish tricks with lowlevel MM code, the whole point > of the mmap_invalidate_range() patch is to be able to rid GPFS > of exactly these evilish tricks. It didn;t look like that. Really Paul, the GPL is pretty clear on the derived work thing, and when you need changes to the core kernel and all kinds of nasty hacks it's pretty clear it is a derived work. And it's up to IBM anyway to show it's not a derived work, which is pretty hard IMHO. I don't understand why IBM is pushing this dubious change right now, GPL violation and thus copyright violation issues in Linux is the last thing IBM wants to see in the press with the current mess going on, right? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-18 22:21 ` Christoph Hellwig @ 2004-02-18 22:51 ` Andrew Morton 2004-02-18 23:00 ` Christoph Hellwig ` (2 more replies) 0 siblings, 3 replies; 68+ messages in thread From: Andrew Morton @ 2004-02-18 22:51 UTC (permalink / raw) To: Christoph Hellwig; +Cc: paulmck, arjanv, linux-kernel, linux-mm Christoph Hellwig <hch@infradead.org> wrote: > > I don't understand why IBM is pushing this dubious change right now, It isn't a dubious change, on technical grounds. It is reasonable for a distributed filesystem to want to be able to shoot down pte's which map sections of pagecache. Just as it is reasonable for the filesystem to be able to shoot down the pagecache itself. We've exported much lower-level stuff than this, because some in-kernel module happened to use it. > GPL violation and thus copyright violation issues in Linux is the > last thing IBM wants to see in the press with the current mess going > on, right? Well this is a chicken-and-egg, isn't it. The only way in which we can audit the IBM code for its derivedness is for the source to be made available. Although not necessarily under GPL. Or we accept Paul's claim, which I personally am inclined to do. Look, this isn't going anywhere. We have a perfectly reasonable request from Paul to make this symbol available for IBM's filesystem. The usual way to handle this sort of thing is to say "ooh. shit. hard." and not reply to the email. That is not adequate and hopefully Paul will not let us get away with it. We need to give Paul a reasoned and logically consistent answer to his request. For that we need to establish some sort of framework against which to make a decision and then make the decision. One approach is a fait-accomplis from the top-level maintainer. Here, we're trying to do it in a different way. I have proposed two criteria upon which this should be judged: a) Does the export make technical sense? Do filesystems have legitimate need for access to this symbol? (really, a) is sufficient grounds, but for real-world reasons:) b) Does the IBM filsystem meet the kernel's licensing requirements? It appears that the answers are a): yes and b) probably. Please, feel free to add additional criteria. We could also ask "do we want to withhold this symbols to encourage IBM to GPL the filesystem" or "do we simply refuse to export any symbol which is not used by any GPL software" (if so, why?). Over to you. But at the end of the day, if we decide to not export this symbol, we owe Paul a good, solid reason, yes? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-18 22:51 ` Andrew Morton @ 2004-02-18 23:00 ` Christoph Hellwig 2004-02-18 16:21 ` Paul E. McKenney ` (2 more replies) 2004-02-19 9:11 ` David Weinehall 2004-02-19 10:29 ` Lars Marowsky-Bree 2 siblings, 3 replies; 68+ messages in thread From: Christoph Hellwig @ 2004-02-18 23:00 UTC (permalink / raw) To: Andrew Morton, tovalds Cc: Christoph Hellwig, paulmck, arjanv, linux-kernel, linux-mm On Wed, Feb 18, 2004 at 02:51:32PM -0800, Andrew Morton wrote: > a) Does the export make technical sense? Do filesystems have > legitimate need for access to this symbol? > > (really, a) is sufficient grounds, but for real-world reasons:) > > b) Does the IBM filsystem meet the kernel's licensing requirements? > > > It appears that the answers are a): yes and b) probably. Well, the answer to b) is most likely not. I see it very hard to argue to have something like gpfs not beeing a derived work. The glue code they had online certainly looked very much like a derived work, and if the new version got better they wouldn't have any reason to remove it from the website, right? > Please, feel free to add additional criteria. We could also ask "do we > want to withhold this symbols to encourage IBM to GPL the filesystem" or > "do we simply refuse to export any symbol which is not used by any GPL > software" (if so, why?). Yes. Andrew, please read the GPL, it's very clear about derived works. Then please tell me why you think gpfs is not a derived work. > But at the end of the day, if we decide to not export this symbol, we owe > Paul a good, solid reason, yes? Yes. We've traditionally not exported symbols unless we had an intree user, and especially not if it's for a module that's not GPL licensed. We had this discussion with Linus a few time, maybe he can comment again to make it clear. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-18 23:00 ` Christoph Hellwig @ 2004-02-18 16:21 ` Paul E. McKenney 2004-02-18 23:32 ` Andrew Morton 2004-02-19 0:28 ` Andrew Morton 2 siblings, 0 replies; 68+ messages in thread From: Paul E. McKenney @ 2004-02-18 16:21 UTC (permalink / raw) To: Christoph Hellwig, Andrew Morton, tovalds, arjanv, linux-kernel, linux-mm On Wed, Feb 18, 2004 at 11:00:55PM +0000, Christoph Hellwig wrote: > On Wed, Feb 18, 2004 at 02:51:32PM -0800, Andrew Morton wrote: > > a) Does the export make technical sense? Do filesystems have > > legitimate need for access to this symbol? > > > > (really, a) is sufficient grounds, but for real-world reasons:) > > > > b) Does the IBM filsystem meet the kernel's licensing requirements? > > > > > > It appears that the answers are a): yes and b) probably. > > Well, the answer to b) is most likely not. I see it very hard to argue to > have something like gpfs not beeing a derived work. The glue code they > had online certainly looked very much like a derived work, and if the new > version got better they wouldn't have any reason to remove it from the > website, right? Nice conspiracy theory! ;-) It was moved to a different website some time ago: http://techsupport.services.ibm.com/server/cluster/fixes/gpfsfixhome.html The current version is 2.2.0-1. You will get a tar.gz file, and the glue code source will be in gpfs.gpl-2.2.0-1.noarch.rpm after you unpack. Thanx, Paul > > Please, feel free to add additional criteria. We could also ask "do we > > want to withhold this symbols to encourage IBM to GPL the filesystem" or > > "do we simply refuse to export any symbol which is not used by any GPL > > software" (if so, why?). > > Yes. Andrew, please read the GPL, it's very clear about derived works. > Then please tell me why you think gpfs is not a derived work. > > > But at the end of the day, if we decide to not export this symbol, we owe > > Paul a good, solid reason, yes? > > Yes. We've traditionally not exported symbols unless we had an intree user, > and especially not if it's for a module that's not GPL licensed. > > We had this discussion with Linus a few time, maybe he can comment again to > make it clear. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-18 23:00 ` Christoph Hellwig 2004-02-18 16:21 ` Paul E. McKenney @ 2004-02-18 23:32 ` Andrew Morton 2004-02-19 12:32 ` Christoph Hellwig 2004-02-19 0:28 ` Andrew Morton 2 siblings, 1 reply; 68+ messages in thread From: Andrew Morton @ 2004-02-18 23:32 UTC (permalink / raw) To: Christoph Hellwig; +Cc: paulmck, arjanv, linux-kernel, linux-mm Christoph Hellwig <hch@infradead.org> wrote: > > Yes. Andrew, please read the GPL, it's very clear about derived works. > Then please tell me why you think gpfs is not a derived work. I haven't seen the code. > > But at the end of the day, if we decide to not export this symbol, we owe > > Paul a good, solid reason, yes? > > Yes. We've traditionally not exported symbols unless we had an intree user, > and especially not if it's for a module that's not GPL licensed. That's certainly a good rule of thumb and we (and I) have used it before. What is the reasoning behind it? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-18 23:32 ` Andrew Morton @ 2004-02-19 12:32 ` Christoph Hellwig 2004-02-19 18:56 ` Andrew Morton 0 siblings, 1 reply; 68+ messages in thread From: Christoph Hellwig @ 2004-02-19 12:32 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Hellwig, paulmck, arjanv, linux-kernel, linux-mm, torvalds On Wed, Feb 18, 2004 at 03:32:34PM -0800, Andrew Morton wrote: > > Yes. We've traditionally not exported symbols unless we had an intree user, > > and especially not if it's for a module that's not GPL licensed. > > That's certainly a good rule of thumb and we (and I) have used it before. > > What is the reasoning behind it? The reason is that someone who wants to distribute a binary only module has to show it's module is not a derived work, and someone who needs new core in the kernel and new exports pretty much shows his work is deeply integrated with the kernel. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 12:32 ` Christoph Hellwig @ 2004-02-19 18:56 ` Andrew Morton 2004-02-19 19:01 ` Christoph Hellwig 0 siblings, 1 reply; 68+ messages in thread From: Andrew Morton @ 2004-02-19 18:56 UTC (permalink / raw) To: Christoph Hellwig; +Cc: paulmck, arjanv, linux-kernel, linux-mm, torvalds Christoph Hellwig <hch@infradead.org> wrote: > > On Wed, Feb 18, 2004 at 03:32:34PM -0800, Andrew Morton wrote: > > > Yes. We've traditionally not exported symbols unless we had an intree user, > > > and especially not if it's for a module that's not GPL licensed. > > > > That's certainly a good rule of thumb and we (and I) have used it before. > > > > What is the reasoning behind it? > > The reason is that someone who wants to distribute a binary only module > has to show it's module is not a derived work, and someone who needs new > core in the kernel and new exports pretty much shows his work is deeply > integrated with the kernel. Needing access to invalidate_mmap_range() is surely not an indication of a derived work. It is an indication of a need for a reliable way to achieve inter-node cache consistency. Other distributed filesystems will need this and probably AIX already provides it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 18:56 ` Andrew Morton @ 2004-02-19 19:01 ` Christoph Hellwig 2004-02-19 13:04 ` Paul E. McKenney 2004-02-20 3:17 ` Anton Blanchard 0 siblings, 2 replies; 68+ messages in thread From: Christoph Hellwig @ 2004-02-19 19:01 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Hellwig, paulmck, arjanv, linux-kernel, linux-mm, torvalds On Thu, Feb 19, 2004 at 10:56:08AM -0800, Andrew Morton wrote: > inter-node cache consistency. Other distributed filesystems will need this > and probably AIX already provides it. You've probably not seen the AIX VM architecture. Good for you as it's not good for your stomache. I did when I still was SCAldera and although my NDAs don't allow me to go into details I can tell you that the AIX VM architecture is deeply tied into the segment architecture of the Power CPU and signicicantly different from any other UNIX variant. So porting code from AIX that touches anything VM related is a complete rewrite. Nice argumentation though, for everything but AIX it might actually have worked :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 19:01 ` Christoph Hellwig @ 2004-02-19 13:04 ` Paul E. McKenney 2004-02-20 3:17 ` Anton Blanchard 1 sibling, 0 replies; 68+ messages in thread From: Paul E. McKenney @ 2004-02-19 13:04 UTC (permalink / raw) To: Christoph Hellwig, Andrew Morton, arjanv, linux-kernel, linux-mm, torvalds On Thu, Feb 19, 2004 at 07:01:41PM +0000, Christoph Hellwig wrote: > On Thu, Feb 19, 2004 at 10:56:08AM -0800, Andrew Morton wrote: > > inter-node cache consistency. Other distributed filesystems will need this > > and probably AIX already provides it. > > You've probably not seen the AIX VM architecture. Good for you as it's > not good for your stomache. I did when I still was SCAldera and although > my NDAs don't allow me to go into details I can tell you that the AIX > VM architecture is deeply tied into the segment architecture of the Power > CPU and signicicantly different from any other UNIX variant. > > So porting code from AIX that touches anything VM related is a complete > rewrite. Or, alternatively, requires a surprisingly large glue-code layer. Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 19:01 ` Christoph Hellwig 2004-02-19 13:04 ` Paul E. McKenney @ 2004-02-20 3:17 ` Anton Blanchard 2004-02-20 21:46 ` Valdis.Kletnieks 1 sibling, 1 reply; 68+ messages in thread From: Anton Blanchard @ 2004-02-20 3:17 UTC (permalink / raw) To: Christoph Hellwig, Andrew Morton, paulmck, arjanv, linux-kernel, linux-mm, torvalds > You've probably not seen the AIX VM architecture. Good for you as it's > not good for your stomache. I did when I still was SCAldera and although > my NDAs don't allow me to go into details I can tell you that the AIX > VM architecture is deeply tied into the segment architecture of the Power > CPU and signicicantly different from any other UNIX variant. Interesting, what version of AIX did you get access to? And how can you be sure thats still the case? Anton -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-20 3:17 ` Anton Blanchard @ 2004-02-20 21:46 ` Valdis.Kletnieks 0 siblings, 0 replies; 68+ messages in thread From: Valdis.Kletnieks @ 2004-02-20 21:46 UTC (permalink / raw) To: Anton Blanchard Cc: Christoph Hellwig, Andrew Morton, paulmck, arjanv, linux-kernel, linux-mm, torvalds [-- Attachment #1: Type: text/plain, Size: 891 bytes --] On Fri, 20 Feb 2004 14:17:51 +1100, Anton Blanchard <anton@samba.org> said: > > > You've probably not seen the AIX VM architecture. Good for you as it's > > not good for your stomache. I did when I still was SCAldera and although > > my NDAs don't allow me to go into details I can tell you that the AIX > > VM architecture is deeply tied into the segment architecture of the Power > > CPU and signicicantly different from any other UNIX variant. > > Interesting, what version of AIX did you get access to? And how can you > be sure thats still the case? You don't need access to AIX source. Reading the IBM Redbook on writing a device driver for AIX is sufficient proof. Or even reading up on how to get more heap space than the usual number of segment registers using the 'ld' command (yes, it's userspace visible). And Christoph isn't pulling your leg - it's pretty bizzare... [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-18 23:00 ` Christoph Hellwig 2004-02-18 16:21 ` Paul E. McKenney 2004-02-18 23:32 ` Andrew Morton @ 2004-02-19 0:28 ` Andrew Morton 2004-02-18 18:36 ` Paul E. McKenney ` (2 more replies) 2 siblings, 3 replies; 68+ messages in thread From: Andrew Morton @ 2004-02-19 0:28 UTC (permalink / raw) To: Christoph Hellwig; +Cc: paulmck, arjanv, linux-kernel, linux-mm Christoph Hellwig <hch@infradead.org> wrote: > > Yes. Andrew, please read the GPL, it's very clear about derived works. > Then please tell me why you think gpfs is not a derived work. OK, so I looked at the wrapper. It wasn't a tremendously pleasant experience. It is huge, and uses fairly standard-looking filesytem interfaces and locking primitives. Also some awareness of NFSV4 for some reason. Still, the wrapper is GPL so this is not relevant. Its only use is to tell us whether or not the non-GPL bits are "derived" from Linux, and it doesn't do that. The GPL doesn't define a derived work. It says "If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, ..." And the "But when you distribute..." part is what the Linus doctrine rubs out. Because it is unreasonable to say that a large piece of work such as this is "derived" from Linux. Why do you believe that GPFS represents a kernel licensing violation? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 0:28 ` Andrew Morton @ 2004-02-18 18:36 ` Paul E. McKenney 2004-02-19 12:31 ` Christoph Hellwig 2004-02-20 1:27 ` David Schwartz 2 siblings, 0 replies; 68+ messages in thread From: Paul E. McKenney @ 2004-02-18 18:36 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Hellwig, arjanv, linux-kernel, linux-mm On Wed, Feb 18, 2004 at 04:28:58PM -0800, Andrew Morton wrote: > Christoph Hellwig <hch@infradead.org> wrote: > > > > Yes. Andrew, please read the GPL, it's very clear about derived works. > > Then please tell me why you think gpfs is not a derived work. > > OK, so I looked at the wrapper. It wasn't a tremendously pleasant > experience. It is huge, and uses fairly standard-looking filesytem > interfaces and locking primitives. Also some awareness of NFSV4 for some > reason. > > Still, the wrapper is GPL so this is not relevant. Its only use is to tell > us whether or not the non-GPL bits are "derived" from Linux, and it > doesn't do that. In the spirit of full disclosure, the wrapper is actually distributed under the BSD license. The GPFS guys tell me that the "gpl" in the RPM name means "GPFS Portability Layer". Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 0:28 ` Andrew Morton 2004-02-18 18:36 ` Paul E. McKenney @ 2004-02-19 12:31 ` Christoph Hellwig 2004-02-19 9:11 ` Paul E. McKenney 2004-02-19 18:59 ` Tim Bird 2004-02-20 1:27 ` David Schwartz 2 siblings, 2 replies; 68+ messages in thread From: Christoph Hellwig @ 2004-02-19 12:31 UTC (permalink / raw) To: Andrew Morton, torvalds Cc: Christoph Hellwig, paulmck, arjanv, linux-kernel, linux-mm On Wed, Feb 18, 2004 at 04:28:58PM -0800, Andrew Morton wrote: > OK, so I looked at the wrapper. It wasn't a tremendously pleasant > experience. It is huge, and uses fairly standard-looking filesytem > interfaces and locking primitives. Also some awareness of NFSV4 for some > reason. And pokes deep into internal structures that it shouldn't. > Still, the wrapper is GPL so this is not relevant. It's BSD licensed - they couldn't distribute it together with GPFS if it was GPL. > Its only use is to tell > us whether or not the non-GPL bits are "derived" from Linux, and it > doesn't do that. Well, something that needs an almost one megabyte big wrapper per defintion is not a standalone work but something that's deeply interwinded with the kernel. The tons of kernel version checks certainly show it's poking deeper than it should. > Why do you believe that GPFS represents a kernel licensing violation? See above. Something that pokes deep into internal structures and even needs new exports certainly is a derived work. There's a few different interpretations of the derived works clause in the GPL around, the FSF one wouldn't allow binary modules at all, and Linus' one is also pretty strict. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 12:31 ` Christoph Hellwig @ 2004-02-19 9:11 ` Paul E. McKenney 2004-02-19 18:32 ` Lars Marowsky-Bree 2004-02-19 18:59 ` Tim Bird 1 sibling, 1 reply; 68+ messages in thread From: Paul E. McKenney @ 2004-02-19 9:11 UTC (permalink / raw) To: Christoph Hellwig, Andrew Morton, torvalds, arjanv, linux-kernel, linux-mm On Thu, Feb 19, 2004 at 12:31:10PM +0000, Christoph Hellwig wrote: > On Wed, Feb 18, 2004 at 04:28:58PM -0800, Andrew Morton wrote: > > OK, so I looked at the wrapper. It wasn't a tremendously pleasant > > experience. It is huge, and uses fairly standard-looking filesytem > > interfaces and locking primitives. Also some awareness of NFSV4 for some > > reason. > > And pokes deep into internal structures that it shouldn't. Again, the point of the patch is to get rid of such poking. > > Still, the wrapper is GPL so this is not relevant. > > It's BSD licensed - they couldn't distribute it together with GPFS if > it was GPL. Yep. > > Its only use is to tell > > us whether or not the non-GPL bits are "derived" from Linux, and it > > doesn't do that. > > Well, something that needs an almost one megabyte big wrapper per defintion > is not a standalone work but something that's deeply interwinded with > the kernel. The tons of kernel version checks certainly show it's poking > deeper than it should. On the size, I beg to differ. One of the reasons the glue module is so large is because of the fact that GPFS was written to run in an AIX kernel rather than a Linux kernel. I would guess that if GPFS had been instead been derived from Linux, the glue module would be much smaller. On the kernel version checks, the point of the patch is to get rid of at least some of these. > > Why do you believe that GPFS represents a kernel licensing violation? > > See above. Something that pokes deep into internal structures and even > needs new exports certainly is a derived work. There's a few different > interpretations of the derived works clause in the GPL around, the FSF > one wouldn't allow binary modules at all, and Linus' one is also pretty > strict. So why are you coming out against something that you seem to believe allows -better- alignment with Linus's rules? Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 9:11 ` Paul E. McKenney @ 2004-02-19 18:32 ` Lars Marowsky-Bree 2004-02-19 18:38 ` Arjan van de Ven 2004-02-19 19:16 ` viro 0 siblings, 2 replies; 68+ messages in thread From: Lars Marowsky-Bree @ 2004-02-19 18:32 UTC (permalink / raw) To: Paul E. McKenney, Christoph Hellwig, Andrew Morton, torvalds, arjanv, linux-kernel, linux-mm On 2004-02-19T01:11:29, "Paul E. McKenney" <paulmck@us.ibm.com> said: > > And pokes deep into internal structures that it shouldn't. > Again, the point of the patch is to get rid of such poking. I think this fiddling about this particular exported symbol is hiding the real issue. It seems that Christoph believes that _inherently_, any filesystem kernel module on Linux must be a derived work, because it is intimately tied into the kernel core / VFS. I can certainly see the reasoning here, and it is a valid point of view. Do we want to allow non-OSS filesystems in kernel space at all? That's the entire question. Personally, I would go with "No" and support the consequences of this, because I believe in Open Source; and that the value proposition of Linux is /not/ in binary-only modules, and I would /not/ sacrifice the OSS principles of the literal core of the Linux project for a short term pay-off. (But I'm personally trying to solve that by making them superfluous and putting them out of business by getting an OSS CFS, which seems to be more amiable ;-) Only if we can settle this, we can answer this export question. If we want to allow them, the export is a perfectly reasonable thing to ask for. If not, we probably need to add a few more _GPL barriers. A rule of thumb might be whether any code in the tree uses a given export, and if not, prune it. Anything which even we don't use or export across the user-land boundary certainly qualifies as a kernel interna. Currently, no kernel module seems to use this export. So I'd think such a point could certainly be made. Sincerely, Lars Marowsky-Bree <lmb@suse.de> -- High Availability & Clustering \ ever tried. ever failed. no matter. SUSE Labs | try again. fail again. fail better. Research & Development, SUSE LINUX AG \ -- Samuel Beckett -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 18:32 ` Lars Marowsky-Bree @ 2004-02-19 18:38 ` Arjan van de Ven 2004-02-19 19:16 ` viro 1 sibling, 0 replies; 68+ messages in thread From: Arjan van de Ven @ 2004-02-19 18:38 UTC (permalink / raw) To: Lars Marowsky-Bree; +Cc: linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 489 bytes --] On Thu, Feb 19, 2004 at 07:32:10PM +0100, Lars Marowsky-Bree wrote: > > A rule of thumb might be whether any code in the tree uses a given > export, and if not, prune it. Anything which even we don't use or export > across the user-land boundary certainly qualifies as a kernel interna. political issues aside, this sounds like a decent rule-of-thumb in general; if NO module uses it, it is most likely the wrong API (for example obsoleted API left around) or something really internal. [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 18:32 ` Lars Marowsky-Bree 2004-02-19 18:38 ` Arjan van de Ven @ 2004-02-19 19:16 ` viro 2004-02-19 16:15 ` Paul E. McKenney 1 sibling, 1 reply; 68+ messages in thread From: viro @ 2004-02-19 19:16 UTC (permalink / raw) To: Lars Marowsky-Bree Cc: Paul E. McKenney, Christoph Hellwig, Andrew Morton, torvalds, arjanv, linux-kernel, linux-mm On Thu, Feb 19, 2004 at 07:32:10PM +0100, Lars Marowsky-Bree wrote: > Only if we can settle this, we can answer this export question. If we > want to allow them, the export is a perfectly reasonable thing to ask > for. If not, we probably need to add a few more _GPL barriers. > > A rule of thumb might be whether any code in the tree uses a given > export, and if not, prune it. Anything which even we don't use or export > across the user-land boundary certainly qualifies as a kernel interna. > > Currently, no kernel module seems to use this export. So I'd think such > a point could certainly be made. I'm not sure. I'm all for trimming the export list, but the real questions are * does that export make sense? * does it impose extra restrictions on what we can do with core code? (without breaking it, that is) * is it needed in the first place? If it's redundant - to hell it goes. Note that majority of the exported symbols fail at least one of the above and _that_ is why they should be killed. Whether their users are GPL or not doesn't matter - if they don't make sense, they must die, no matter what b0rken code might be using them. IMNSHO the questions above should be answered first and AFAICS they hadn't been even discussed in that case. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 19:16 ` viro @ 2004-02-19 16:15 ` Paul E. McKenney 0 siblings, 0 replies; 68+ messages in thread From: Paul E. McKenney @ 2004-02-19 16:15 UTC (permalink / raw) To: viro Cc: Lars Marowsky-Bree, Christoph Hellwig, Andrew Morton, torvalds, arjanv, linux-kernel, linux-mm On Thu, Feb 19, 2004 at 07:16:33PM +0000, viro@parcelfarce.linux.theplanet.co.uk wrote: > On Thu, Feb 19, 2004 at 07:32:10PM +0100, Lars Marowsky-Bree wrote: > > Only if we can settle this, we can answer this export question. If we > > want to allow them, the export is a perfectly reasonable thing to ask > > for. If not, we probably need to add a few more _GPL barriers. > > > > A rule of thumb might be whether any code in the tree uses a given > > export, and if not, prune it. Anything which even we don't use or export > > across the user-land boundary certainly qualifies as a kernel interna. > > > > Currently, no kernel module seems to use this export. So I'd think such > > a point could certainly be made. Good questions, see below for my nominations for the answers. > I'm not sure. I'm all for trimming the export list, but the real questions > are > * does that export make sense? Yes, invalidate_mmap_range() permits a distributed filesystem to shoot down mmap()s of a to-be-modified file so that all nodes see a consistent view of that file's data. Having an export means that this functionality need not be reproduced in each and every DFS, reducing DFS intrusiveness. Of course, the issue pointed out by Daniel does need to be addressed. More on that shortly. > * does it impose extra restrictions on what we can do with core > code? (without breaking it, that is) The invalidate_mmap_range() API is pretty generic. It takes an address_space structure, an offset, and a length. The caller can treat the address_space structure pointer as a cookie, so the only sorts of changes that could break this API would be ones that entirely did away with the concept of an address space. Or that introduced the concept of a file with non-integer offsets, in which case invalidate_mmap_range() is the least of our worries. Either case could happen, I suppose, but both seem a bit unlikely. > * is it needed in the first place? If it's redundant - to hell it > goes. Yes, to prevent DFSes from having to reach so far into the guts of the Linux VM system. Thanx, Paul > Note that majority of the exported symbols fail at least one of the above > and _that_ is why they should be killed. Whether their users are GPL or > not doesn't matter - if they don't make sense, they must die, no matter > what b0rken code might be using them. > > IMNSHO the questions above should be answered first and AFAICS they hadn't > been even discussed in that case. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 12:31 ` Christoph Hellwig 2004-02-19 9:11 ` Paul E. McKenney @ 2004-02-19 18:59 ` Tim Bird 1 sibling, 0 replies; 68+ messages in thread From: Tim Bird @ 2004-02-19 18:59 UTC (permalink / raw) To: Christoph Hellwig Cc: Andrew Morton, torvalds, paulmck, arjanv, linux-kernel, linux-mm Christoph Hellwig wrote: > On Wed, Feb 18, 2004 at 04:28:58PM -0800, Andrew Morton wrote: >>OK, so I looked at the wrapper. It wasn't a tremendously pleasant >>experience. It is huge, and uses fairly standard-looking filesytem >>interfaces and locking primitives. Also some awareness of NFSV4 for some >>reason. >> >>Still, the wrapper is GPL so this is not relevant. > > Well, something that needs an almost one megabyte big wrapper per defintion > is not a standalone work but something that's deeply interwinded with > the kernel. The tons of kernel version checks certainly show it's poking > deeper than it should. >... > > Something that pokes deep into internal structures and even > needs new exports certainly is a derived work. I'd argue (again) that having a complex glue layer is not evidence per se of the glued module being a derived work. If anything, it is evidence to the contrary. But it depends on the circumstances. The question for GPFS itself is whether it was modified to run with Linux, and how it was modified, and how much it was modified. If your argument is that Linux, after being modified with the glue layer, is now a derivative work of the glued module, that seems more likely. I'm not sure how the GPL reads on that case. ============================= Tim Bird Architecture Group Co-Chair CE Linux Forum Senior Staff Engineer Sony Electronics E-mail: Tim.Bird@am.sony.com ============================= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* RE: Non-GPL export of invalidate_mmap_range 2004-02-19 0:28 ` Andrew Morton 2004-02-18 18:36 ` Paul E. McKenney 2004-02-19 12:31 ` Christoph Hellwig @ 2004-02-20 1:27 ` David Schwartz 2 siblings, 0 replies; 68+ messages in thread From: David Schwartz @ 2004-02-20 1:27 UTC (permalink / raw) To: Christoph Hellwig; +Cc: paulmck, arjanv, linux-mm > Christoph Hellwig <hch@infradead.org> wrote: > And the "But when you distribute..." part is what the Linus doctrine rubs > out. Because it is unreasonable to say that a large piece of work such as > this is "derived" from Linux. I think you misunderstand how the Linux kernel uses the term "derive". By a "derived work", the GPL is invoking the legal copyright principle of a "derivative work". You can google this term to get a better understanding of it. The term "derived work" does not imply that the work is wholly derived. Rather, it means that some part of the protected expression of the original work is present in the work. In the specific case of Linux kernel modules, the question is whether some part of the protectable expression in the Linkx kernel is present in the module. This is a major issue for compiled modules distributed in object form because the compilation process, through header files, puts pieces of the header files in the resultant object. If the distributed work is in source code form, however, the argument becomes much different. You are not likely to find pieces of the kernel code present in the source code that's distributed. However, one possible argument is that the module is a "sequel" to the kernel. It takes the framework the kernel creates and builds on it. I can't write and sell a Star Trek novel for just this reason, it would be derived from previous such novels because it borrows their universe. Another possible argument is that the module code is so intertwined with kernel code that you can't consider the module by itself a work at all. In the present case, we have a shim that is distributed in source form. The main module works with other operating systems and doesn't contain much Linux-specific code. So the module itself is not a derived work of Linux. The shim is probably a derived work, but the shim is open source. So if there's a license issue, I don't know what it is. DS -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-18 22:51 ` Andrew Morton 2004-02-18 23:00 ` Christoph Hellwig @ 2004-02-19 9:11 ` David Weinehall 2004-02-19 8:58 ` Paul E. McKenney 2004-02-19 10:29 ` Lars Marowsky-Bree 2 siblings, 1 reply; 68+ messages in thread From: David Weinehall @ 2004-02-19 9:11 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Hellwig, paulmck, arjanv, linux-kernel, linux-mm On Wed, Feb 18, 2004 at 02:51:32PM -0800, Andrew Morton wrote: > Christoph Hellwig <hch@infradead.org> wrote: > > > > I don't understand why IBM is pushing this dubious change right now, > > It isn't a dubious change, on technical grounds. It is reasonable for a > distributed filesystem to want to be able to shoot down pte's which map > sections of pagecache. Just as it is reasonable for the filesystem to be > able to shoot down the pagecache itself. > > We've exported much lower-level stuff than this, because some in-kernel > module happened to use it. Probably not always the right choice, though... I highly suspect we far to much of our intestines are easily available. [snip] > We need to give Paul a reasoned and logically consistent answer to his > request. For that we need to establish some sort of framework against > which to make a decision and then make the decision. > > One approach is a fait-accomplis from the top-level maintainer. Here, > we're trying to do it in a different way. > > I have proposed two criteria upon which this should be judged: > > a) Does the export make technical sense? Do filesystems have > legitimate need for access to this symbol? > > (really, a) is sufficient grounds, but for real-world reasons:) > > b) Does the IBM filsystem meet the kernel's licensing requirements? > > > It appears that the answers are a): yes and b) probably. a.) Definitely b.) Perhaps > Please, feel free to add additional criteria. We could also ask "do we > want to withhold this symbols to encourage IBM to GPL the filesystem" or > "do we simply refuse to export any symbol which is not used by any GPL > software" (if so, why?). Over to you. Well, I wasn't altogether joking when I suggested IBM should GPL gpfs. A couple of questions: * Is gpfs a commercial product in the sense that it's something IBM earns revenue from? * Does gpfs contain third party "Intellectual Property" (no, I'm not particularly fond of using that expression, but I digress) If the answer is NO to both of these questions, why _not_ GPL the code? If the answer is NO to only the second question, is the revenue from gpfs big enough to warrant keeping it proprietary? > But at the end of the day, if we decide to not export this symbol, we owe > Paul a good, solid reason, yes? Yup. Silence isn't always golden, sometimes it's outright shitty. Regards: David Weinehall -- /) David Weinehall <tao@acc.umu.se> /) Northern lights wander (\ // Maintainer of the v2.0 kernel // Dance across the winter sky // \) http://www.acc.umu.se/~tao/ (/ Full colour fire (/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 9:11 ` David Weinehall @ 2004-02-19 8:58 ` Paul E. McKenney 2004-03-04 5:51 ` Mike Fedyk 0 siblings, 1 reply; 68+ messages in thread From: Paul E. McKenney @ 2004-02-19 8:58 UTC (permalink / raw) To: Andrew Morton, Christoph Hellwig, arjanv, linux-kernel, linux-mm On Thu, Feb 19, 2004 at 10:11:32AM +0100, David Weinehall wrote: > On Wed, Feb 18, 2004 at 02:51:32PM -0800, Andrew Morton wrote: > > Christoph Hellwig <hch@infradead.org> wrote: > > > > > > I don't understand why IBM is pushing this dubious change right now, > > > > It isn't a dubious change, on technical grounds. It is reasonable for a > > distributed filesystem to want to be able to shoot down pte's which map > > sections of pagecache. Just as it is reasonable for the filesystem to be > > able to shoot down the pagecache itself. > > > > We've exported much lower-level stuff than this, because some in-kernel > > module happened to use it. > > Probably not always the right choice, though... I highly suspect we > far to much of our intestines are easily available. Again, the whole point of the patch is to -reduce- the degree of intestinal export. Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 8:58 ` Paul E. McKenney @ 2004-03-04 5:51 ` Mike Fedyk 0 siblings, 0 replies; 68+ messages in thread From: Mike Fedyk @ 2004-03-04 5:51 UTC (permalink / raw) To: paulmck; +Cc: Andrew Morton, Christoph Hellwig, arjanv, linux-kernel, linux-mm Paul E. McKenney wrote: > On Thu, Feb 19, 2004 at 10:11:32AM +0100, David Weinehall wrote: > >>On Wed, Feb 18, 2004 at 02:51:32PM -0800, Andrew Morton wrote: >> >>>Christoph Hellwig <hch@infradead.org> wrote: >>> >>>>I don't understand why IBM is pushing this dubious change right now, >>> >>>It isn't a dubious change, on technical grounds. It is reasonable for a >>>distributed filesystem to want to be able to shoot down pte's which map >>>sections of pagecache. Just as it is reasonable for the filesystem to be >>>able to shoot down the pagecache itself. >>> >>>We've exported much lower-level stuff than this, because some in-kernel >>>module happened to use it. >> >>Probably not always the right choice, though... I highly suspect we >>far to much of our intestines are easily available. > > > Again, the whole point of the patch is to -reduce- the degree of > intestinal export. > > Thanx, Paul Paul, this still doesn't answer why GPFS can't be released under the GPL. If this has been answered, I'd love to see a pointer to which archives in which I should search. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-18 22:51 ` Andrew Morton 2004-02-18 23:00 ` Christoph Hellwig 2004-02-19 9:11 ` David Weinehall @ 2004-02-19 10:29 ` Lars Marowsky-Bree 2004-02-19 9:00 ` Paul E. McKenney 2004-02-19 11:11 ` Arjan van de Ven 2 siblings, 2 replies; 68+ messages in thread From: Lars Marowsky-Bree @ 2004-02-19 10:29 UTC (permalink / raw) To: Andrew Morton, Christoph Hellwig; +Cc: paulmck, arjanv, linux-kernel, linux-mm On 2004-02-18T14:51:32, Andrew Morton <akpm@osdl.org> said: > a) Does the export make technical sense? Do filesystems have > legitimate need for access to this symbol? > > (really, a) is sufficient grounds, but for real-world reasons:) Technically, I assume both OCFS, Lustre, (OpenGFS), PolyServe and basically /everyone/ doing a cluster file system, proprietary or not, will eventually need this capability. Vendors have included hooks for this in 2.4 already anyway. So on technical grounds, I'm strongly inclined to support it, but I would like to suggest that it is ensured that the hook is sufficient for all of the named CFS. Paul, have you spoken with them? > b) Does the IBM filsystem meet the kernel's licensing requirements? If you are worried about this one, you can export it GPL-only, which as an Open Source developer I'd appreciate, but from a real-world business perspective would be unhappy about ;-) Sincerely, Lars Marowsky-Bree <lmb@suse.de> -- High Availability & Clustering \ ever tried. ever failed. no matter. SUSE Labs | try again. fail again. fail better. Research & Development, SUSE LINUX AG \ -- Samuel Beckett -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 10:29 ` Lars Marowsky-Bree @ 2004-02-19 9:00 ` Paul E. McKenney 2004-02-19 11:11 ` Arjan van de Ven 1 sibling, 0 replies; 68+ messages in thread From: Paul E. McKenney @ 2004-02-19 9:00 UTC (permalink / raw) To: Lars Marowsky-Bree Cc: Andrew Morton, Christoph Hellwig, arjanv, linux-kernel, linux-mm On Thu, Feb 19, 2004 at 11:29:00AM +0100, Lars Marowsky-Bree wrote: > On 2004-02-18T14:51:32, > Andrew Morton <akpm@osdl.org> said: > > > a) Does the export make technical sense? Do filesystems have > > legitimate need for access to this symbol? > > > > (really, a) is sufficient grounds, but for real-world reasons:) > > Technically, I assume both OCFS, Lustre, (OpenGFS), PolyServe and > basically /everyone/ doing a cluster file system, proprietary or not, > will eventually need this capability. Vendors have included hooks for > this in 2.4 already anyway. > > So on technical grounds, I'm strongly inclined to support it, but I > would like to suggest that it is ensured that the hook is sufficient for > all of the named CFS. > > Paul, have you spoken with them? Lustre, yes. At OLS last summer, Peter Braam said that it was useful. The others, no, but they are certainly free to chime in. > > b) Does the IBM filsystem meet the kernel's licensing requirements? > > If you are worried about this one, you can export it GPL-only, which as > an Open Source developer I'd appreciate, but from a real-world business > perspective would be unhappy about ;-) Been there, done that. ;-) Thanx, Paul > Sincerely, > Lars Marowsky-Bree <lmb@suse.de> > > -- > High Availability & Clustering \ ever tried. ever failed. no matter. > SUSE Labs | try again. fail again. fail better. > Research & Development, SUSE LINUX AG \ -- Samuel Beckett > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 10:29 ` Lars Marowsky-Bree 2004-02-19 9:00 ` Paul E. McKenney @ 2004-02-19 11:11 ` Arjan van de Ven 2004-02-19 11:53 ` Lars Marowsky-Bree 1 sibling, 1 reply; 68+ messages in thread From: Arjan van de Ven @ 2004-02-19 11:11 UTC (permalink / raw) To: Lars Marowsky-Bree Cc: Andrew Morton, Christoph Hellwig, paulmck, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 443 bytes --] On Thu, Feb 19, 2004 at 11:29:00AM +0100, Lars Marowsky-Bree wrote: > > b) Does the IBM filsystem meet the kernel's licensing requirements? > > If you are worried about this one, you can export it GPL-only, which as > an Open Source developer I'd appreciate, but from a real-world business > perspective would be unhappy about ;-) It already is exported GPL-only, this is all about changing it to be for linking bin only modules as well... [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 11:11 ` Arjan van de Ven @ 2004-02-19 11:53 ` Lars Marowsky-Bree 0 siblings, 0 replies; 68+ messages in thread From: Lars Marowsky-Bree @ 2004-02-19 11:53 UTC (permalink / raw) To: Arjan van de Ven; +Cc: linux-kernel, linux-mm On 2004-02-19T12:11:17, Arjan van de Ven <arjanv@redhat.com> said: > It already is exported GPL-only, this is all about changing it to be for > linking bin only modules as well... I blame lack of coffee and want a brown paper bag. Sorry. ;) Sincerely, Lars Marowsky-Bree <lmb@suse.de> -- High Availability & Clustering \ ever tried. ever failed. no matter. SUSE Labs | try again. fail again. fail better. Research & Development, SUSE LINUX AG \ -- Samuel Beckett -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-18 12:51 ` Arjan van de Ven 2004-02-18 14:00 ` Paul E. McKenney @ 2004-02-18 18:04 ` Tim Bird 1 sibling, 0 replies; 68+ messages in thread From: Tim Bird @ 2004-02-18 18:04 UTC (permalink / raw) To: arjanv; +Cc: Andrew Morton, paulmck, hch, linux-kernel, linux-mm I should know better than to stir up a hornets nest by discussing GPL issues on this list... :) Arjan van de Ven wrote: > On Wed, 2004-02-18 at 01:19, Andrew Morton wrote: >>Neat, but it's hard to see the relevance of this to your patch. >>I don't see any licensing issues with the patch because the filesystem >>which needs it clearly meets Linus's "this is not a derived work" >>criteria. > > it does? ... > it needs no changes to the core kernel? *buzz* Actually, this would tend towards an interpretation that it was NOT a derived work. That is, if a the Linux kernel must be modified in order to run with a piece of software, that's one indicator that the piece of software (when standing alone) may not be derived from the kernel. I am purposely avoiding the "but what about when it's linked" argument. ============================= Tim Bird Architecture Group Co-Chair CE Linux Forum Senior Staff Engineer Sony Electronics E-mail: Tim.Bird@am.sony.com ============================= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-18 0:19 ` Andrew Morton 2004-02-18 12:51 ` Arjan van de Ven @ 2004-02-19 20:56 ` Daniel Phillips 2004-02-19 22:06 ` Stephen C. Tweedie 1 sibling, 1 reply; 68+ messages in thread From: Daniel Phillips @ 2004-02-19 20:56 UTC (permalink / raw) To: Andrew Morton, paulmck; +Cc: hch, linux-kernel, linux-mm On Tuesday 17 February 2004 19:19, Andrew Morton wrote: > I don't see any licensing issues with the patch because the filesystem > which needs it clearly meets Linus's "this is not a derived work" criteria. > > And I don't see a technical problem with the export: given that we export > truncate_inode_pages() it makes sense to also export the corresponding > pagetable shootdown function. > > Yes, this is a sensitive issue. Can we please evaluate it strictly > according to technical and licensing considerations? > > Having said that, what concerns issues remain with Paul's patch? Hi Andrew, OpenGFS and Sistina GFS use zap_page_range directly, essentially doing the same as invalidate_mmap_range but skipping any vmas belonging to MAP_PRIVATE mmaps. This avoids destroying data on anon pages. GPFS and every other DFS have the same problem as far as I can see, and it isn't addressed by exporting invalidate_mmap_range as it stands. Paul? Regards, Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 20:56 ` Daniel Phillips @ 2004-02-19 22:06 ` Stephen C. Tweedie 2004-02-19 22:31 ` Daniel Phillips 0 siblings, 1 reply; 68+ messages in thread From: Stephen C. Tweedie @ 2004-02-19 22:06 UTC (permalink / raw) To: Daniel Phillips Cc: Andrew Morton, Paul E. McKenney, Christoph Hellwig, linux-kernel, linux-mm, Stephen Tweedie Hi, On Thu, 2004-02-19 at 20:56, Daniel Phillips wrote: > OpenGFS and Sistina GFS use zap_page_range directly, essentially doing the > same as invalidate_mmap_range but skipping any vmas belonging to MAP_PRIVATE > mmaps. Well, MAP_PRIVATE maps can contain shared pages too --- any page in a MAP_PRIVATE map that has been mapped but not yet written to is still shared, and still needs shot down on truncate(). --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 22:06 ` Stephen C. Tweedie @ 2004-02-19 22:31 ` Daniel Phillips 2004-02-19 16:42 ` Paul E. McKenney 0 siblings, 1 reply; 68+ messages in thread From: Daniel Phillips @ 2004-02-19 22:31 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Andrew Morton, Paul E. McKenney, Christoph Hellwig, linux-kernel, linux-mm Hi Stephen, On Thursday 19 February 2004 17:06, Stephen C. Tweedie wrote: > Hi, > > On Thu, 2004-02-19 at 20:56, Daniel Phillips wrote: > > OpenGFS and Sistina GFS use zap_page_range directly, essentially doing > > the same as invalidate_mmap_range but skipping any vmas belonging to > > MAP_PRIVATE mmaps. > > Well, MAP_PRIVATE maps can contain shared pages too --- any page in a > MAP_PRIVATE map that has been mapped but not yet written to is still > shared, and still needs shot down on truncate(). Exactly, and we ought to take this opportunity to do that properly, which is easy. I'm just curious how GPFS deals with this issue, or if it simply doesn't support MAP_PRIVATE. Regards, Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 22:31 ` Daniel Phillips @ 2004-02-19 16:42 ` Paul E. McKenney 2004-02-20 2:06 ` Daniel Phillips 0 siblings, 1 reply; 68+ messages in thread From: Paul E. McKenney @ 2004-02-19 16:42 UTC (permalink / raw) To: Daniel Phillips Cc: Stephen C. Tweedie, Andrew Morton, Christoph Hellwig, linux-kernel, linux-mm On Thu, Feb 19, 2004 at 05:31:33PM -0500, Daniel Phillips wrote: > Hi Stephen, > > On Thursday 19 February 2004 17:06, Stephen C. Tweedie wrote: > > Hi, > > > > On Thu, 2004-02-19 at 20:56, Daniel Phillips wrote: > > > OpenGFS and Sistina GFS use zap_page_range directly, essentially doing > > > the same as invalidate_mmap_range but skipping any vmas belonging to > > > MAP_PRIVATE mmaps. > > > > Well, MAP_PRIVATE maps can contain shared pages too --- any page in a > > MAP_PRIVATE map that has been mapped but not yet written to is still > > shared, and still needs shot down on truncate(). > > Exactly, and we ought to take this opportunity to do that properly, which is > easy. I'm just curious how GPFS deals with this issue, or if it simply > doesn't support MAP_PRIVATE. GPFS supports MAP_PRIVATE, but does not specify the behavior if you change the underlying file. There are a number of things one can do, but one must keep in mind that different processes can MAP_PRIVATE the same file at different times, and that some processes might MAP_SHARED it at the same time that others MAP_PRIVATE it. Here are the alternatives I can imagine: 1. Any time a file changes, create a copy of the old version for any MAP_PRIVATE vmas. This would essentially create a point-in-time copy of any file that a process mapped MAP_PRIVATE. This is arguably the most intuitive from the user's standpoint, but (a) it would not be a small change and (b) I haven't heard of anyone coming up with a good use for it. Please enlighten me if I am missing a simple implementation or compelling uses. 2. Modify invalidate_mmap_range() to leave MAP_PRIVATE vmas. as suggested by Daniel. This would mean that a process that had mapped a file MAP_PRIVATE and faulted in parts of it would see different versions of the file in different pages. This should be straightforward to implement, but in what situation is this skewed view of the file useful? 3. Modify invalidate_mmap_range() to leave MAP_PRIVATE vmas, but invalidate those pages in the vma that have not yet been modified (that are not anonymous) as suggested by Stephen. This would mean that a process that had mapped a file MAP_PRIVATE and written on parts of it would see different versions of the file in different pages. Again, in what situation is this skewed view of the file useful? 5. The current behavior, where the process's writes do not flow through to the file, but all changes to the file are visible to the writing process. 6. Requiring that MAP_PRIVATE be applied only to unchanging files, so that (for example) any change to the underlying file removes that file from any MAP_PRIVATE address spaces. Subsequent accesses would get a SEGV, rather than a surprise from silently changing data. So, please help me out here... What do applications that MAP_PRIVATE changing files really expect to happen? Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 16:42 ` Paul E. McKenney @ 2004-02-20 2:06 ` Daniel Phillips 2004-02-19 19:47 ` Paul E. McKenney 0 siblings, 1 reply; 68+ messages in thread From: Daniel Phillips @ 2004-02-20 2:06 UTC (permalink / raw) To: paulmck Cc: Stephen C. Tweedie, Andrew Morton, Christoph Hellwig, linux-kernel, linux-mm On Thursday 19 February 2004 11:42, Paul E. McKenney wrote: > GPFS supports MAP_PRIVATE, but does not specify the behavior if you > change the underlying file. There are a number of things one can do, > but one must keep in mind that different processes can MAP_PRIVATE the > same file at different times, and that some processes might MAP_SHARED it > at the same time that others MAP_PRIVATE it. Here are the alternatives > I can imagine: > > 1. Any time a file changes, create a copy of the old version > for any MAP_PRIVATE vmas. This would essentially create > a point-in-time copy of any file that a process mapped > MAP_PRIVATE. This is arguably the most intuitive from the > user's standpoint, but (a) it would not be a small change and > (b) I haven't heard of anyone coming up with a good use for it. > Please enlighten me if I am missing a simple implementation or > compelling uses. This is MAP_COPY I think. Even if somebody did manage to sneak it by Linus one day it would certainly not be under the guise of MAP_PRIVATE. > 2. Modify invalidate_mmap_range() to leave MAP_PRIVATE vmas. > as suggested by Daniel. I did not suggest that, rather I described the existing practice in OpenGFS and Sistina GFS, which at least does not destroy anonymous data. The correct behaviour is the one you describe in option 3, and we are perfectly willing to change GFS to obtain that behaviour. To be precise: I suggest we change invalidate_mmap_range to skip anon pages, and change vmtruncate to use something else, having the current semantics. As a historical note: the behavior GFS obtains from option 2 is Posix-compliant, but falls short of Linus-compliance, who insists on completely accurate invalidation behavior as is right and proper. > This would mean that a > process that had mapped a file MAP_PRIVATE and faulted > in parts of it would see different versions of the file > in different pages. This should be straightforward to > implement, but in what situation is this skewed view of > the file useful? You've got me there ;) However, Posix explicitly blesses this sloppy behaviour. I suppose that with additional user space locking, applications could make it work reliably. But it's still sloppy, and worse, it's different from Linux's local filesystem behaviour. > 3. Modify invalidate_mmap_range() to leave MAP_PRIVATE vmas, > but invalidate those pages in the vma that have not yet been > modified (that are not anonymous) as suggested by Stephen. > This would mean that a process that had mapped a file MAP_PRIVATE > and written on parts of it would see different versions of the > file in different pages. This is the correct behaviour and is the current behaviour for local filesystems. In particular, all processes on all nodes will see the current contents of any file page that they have not yet faulted in, as of the last time any process wrote that file page via mmap or otherwise. Our goal for GFS, and the goal I'd like to hold up as definitive for any distributed filesystem, is to imitate local filesystem semantics exactly, even across the cluster. > Again, in what situation is this skewed view of the file useful? It's not skewed in any way that I can see. Though I am no linker expert, I dimly recall that these are precisely the semantics ld relies on. > 5. The current behavior, where the process's writes do not > flow through to the file, but all changes to the file are > visible to the writing process. We all agree that's broken, I hope. > 6. Requiring that MAP_PRIVATE be applied only to unchanging > files, so that (for example) any change to the underlying > file removes that file from any MAP_PRIVATE address spaces. > Subsequent accesses would get a SEGV, rather than a > surprise from silently changing data. Creative :) Well, data that changes "silently" is a fact of life whenever data is shared. It's up to applications to ensure that shared data changes predictably. > So, please help me out here... What do applications that MAP_PRIVATE > changing files really expect to happen? Number 3, is that ok with you? Incidently, your list doesn't include the semantics we'd get by just exporting and using invalidate_mmap_range. I presume that is because you agree it's not correct (it will clobber CoWed anonymous pages). Regards, Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-20 2:06 ` Daniel Phillips @ 2004-02-19 19:47 ` Paul E. McKenney 2004-02-20 5:07 ` Daniel Phillips 0 siblings, 1 reply; 68+ messages in thread From: Paul E. McKenney @ 2004-02-19 19:47 UTC (permalink / raw) To: Daniel Phillips Cc: Stephen C. Tweedie, Andrew Morton, Christoph Hellwig, linux-kernel, linux-mm On Thu, Feb 19, 2004 at 09:06:55PM -0500, Daniel Phillips wrote: > On Thursday 19 February 2004 11:42, Paul E. McKenney wrote: > > GPFS supports MAP_PRIVATE, but does not specify the behavior if you > > change the underlying file. There are a number of things one can do, > > but one must keep in mind that different processes can MAP_PRIVATE the > > same file at different times, and that some processes might MAP_SHARED it > > at the same time that others MAP_PRIVATE it. Here are the alternatives > > I can imagine: > > > > 1. Any time a file changes, create a copy of the old version > > for any MAP_PRIVATE vmas. This would essentially create > > a point-in-time copy of any file that a process mapped > > MAP_PRIVATE. This is arguably the most intuitive from the > > user's standpoint, but (a) it would not be a small change and > > (b) I haven't heard of anyone coming up with a good use for it. > > Please enlighten me if I am missing a simple implementation or > > compelling uses. > > This is MAP_COPY I think. Even if somebody did manage to sneak it by Linus > one day it would certainly not be under the guise of MAP_PRIVATE. Whew! That is a relief!!! ;-) > > 2. Modify invalidate_mmap_range() to leave MAP_PRIVATE vmas. > > as suggested by Daniel. > > I did not suggest that, rather I described the existing practice in OpenGFS > and Sistina GFS, which at least does not destroy anonymous data. The correct > behaviour is the one you describe in option 3, and we are perfectly willing > to change GFS to obtain that behaviour. To be precise: I suggest we change > invalidate_mmap_range to skip anon pages, and change vmtruncate to use > something else, having the current semantics. > > As a historical note: the behavior GFS obtains from option 2 is > Posix-compliant, but falls short of Linus-compliance, who insists on > completely accurate invalidation behavior as is right and proper. OK, this is the OpenGFS zap_inode_mapping(), right? > > This would mean that a > > process that had mapped a file MAP_PRIVATE and faulted > > in parts of it would see different versions of the file > > in different pages. This should be straightforward to > > implement, but in what situation is this skewed view of > > the file useful? > > You've got me there ;) However, Posix explicitly blesses this sloppy > behaviour. I suppose that with additional user space locking, applications > could make it work reliably. But it's still sloppy, and worse, it's > different from Linux's local filesystem behaviour. ;-) > > 3. Modify invalidate_mmap_range() to leave MAP_PRIVATE vmas, > > but invalidate those pages in the vma that have not yet been > > modified (that are not anonymous) as suggested by Stephen. > > This would mean that a process that had mapped a file MAP_PRIVATE > > and written on parts of it would see different versions of the > > file in different pages. > > This is the correct behaviour and is the current behaviour for local > filesystems. In particular, all processes on all nodes will see the current > contents of any file page that they have not yet faulted in, as of the last > time any process wrote that file page via mmap or otherwise. > > Our goal for GFS, and the goal I'd like to hold up as definitive for any > distributed filesystem, is to imitate local filesystem semantics exactly, > even across the cluster. OK, I surrender. I got some private email agreeing with this viewpoint. Any dissenters, speak soon, or... > > Again, in what situation is this skewed view of the file useful? > > It's not skewed in any way that I can see. Though I am no linker expert, I > dimly recall that these are precisely the semantics ld relies on. I thought that the linker relied on people refraining (or being prevented) from updating executables while they are in use. But I am also no linker expert. > > 5. The current behavior, where the process's writes do not > > flow through to the file, but all changes to the file are > > visible to the writing process. > > We all agree that's broken, I hope. I can buy DFSes implementing semantics that are the same as local filesystems. But no one has yet shown me anything that it breaks! > > 6. Requiring that MAP_PRIVATE be applied only to unchanging > > files, so that (for example) any change to the underlying > > file removes that file from any MAP_PRIVATE address spaces. > > Subsequent accesses would get a SEGV, rather than a > > surprise from silently changing data. > > Creative :) Well, data that changes "silently" is a fact of life whenever > data is shared. It's up to applications to ensure that shared data changes > predictably. Glad you liked it. ;-) I think that predictability when using MAP_PRIVATE requires that one refrain from modifying the underlying file while someone has it mmap()ed with MAP_PRIVATE. I would welcome an example proving me wrong. > > So, please help me out here... What do applications that MAP_PRIVATE > > changing files really expect to happen? > > Number 3, is that ok with you? Incidently, your list doesn't include the > semantics we'd get by just exporting and using invalidate_mmap_range. I > presume that is because you agree it's not correct (it will clobber CoWed > anonymous pages). I will give it a shot, though I would still like to hear about examples where the difference in semantics affects a real application. BTW, my list didn't include exporting and using the current invalidate_mmap_range() because I didn't say what I meant to say. Hate it when that happens! ;-) Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-19 19:47 ` Paul E. McKenney @ 2004-02-20 5:07 ` Daniel Phillips 2004-02-20 12:02 ` Paul E. McKenney 0 siblings, 1 reply; 68+ messages in thread From: Daniel Phillips @ 2004-02-20 5:07 UTC (permalink / raw) To: paulmck Cc: Stephen C. Tweedie, Andrew Morton, Christoph Hellwig, linux-kernel, linux-mm On Thursday 19 February 2004 14:47, Paul E. McKenney wrote: > OK, I surrender. I got some private email agreeing with this > viewpoint. Any dissenters, speak soon, or... An implementation is going to look something like the patch below. Unfortunately I don't think there is a way around passing an extra parameter all the way down the unmap call chain. Doubly unfortunately, this doesn't give any benefit at all to anybody who doesn't use a clustered filesystem (which is nearly everybody) while there is a marginal cost. Do you know a better way? Anyway, this is the price of correct MAP_PRIVATE semantics for clustered filesystems. At least I have quantified it so we can decide if it's worth it. (My opinion: correctness is always worth it.) Regards, Daniel --- 2.6.3.clean/include/linux/mm.h 2004-02-17 22:57:13.000000000 -0500 +++ 2.6.3/include/linux/mm.h 2004-02-19 23:18:08.000000000 -0500 @@ -434,9 +434,7 @@ unsigned long size); int unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm, struct vm_area_struct *start_vma, unsigned long start_addr, - unsigned long end_addr, unsigned long *nr_accounted); -void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, - unsigned long address, unsigned long size); + unsigned long end_addr, unsigned long *nr_accounted, int zap); void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); @@ -444,8 +442,7 @@ unsigned long size, pgprot_t prot); extern void invalidate_mmap_range(struct address_space *mapping, - loff_t const holebegin, - loff_t const holelen); + loff_t const holebegin, loff_t const holelen, int zap); extern int vmtruncate(struct inode * inode, loff_t offset); extern pmd_t *FASTCALL(__pmd_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)); extern pte_t *FASTCALL(pte_alloc_kernel(struct mm_struct *mm, pmd_t *pmd, unsigned long address)); --- 2.6.3.clean/mm/memory.c 2004-02-17 22:57:47.000000000 -0500 +++ 2.6.3/mm/memory.c 2004-02-19 23:48:23.000000000 -0500 @@ -386,7 +386,7 @@ static void zap_pte_range(struct mmu_gather *tlb, pmd_t * pmd, - unsigned long address, unsigned long size) + unsigned long address, unsigned long size, int zap) { unsigned long offset; pte_t *ptep; @@ -414,7 +414,7 @@ tlb_remove_tlb_entry(tlb, ptep, address+offset); if (pfn_valid(pfn)) { struct page *page = pfn_to_page(pfn); - if (!PageReserved(page)) { + if (!PageReserved(page) && (zap || (page->mapping && !PageSwapCache(page)))) { if (pte_dirty(pte)) set_page_dirty(page); if (page->mapping && pte_young(pte) && @@ -436,7 +436,7 @@ static void zap_pmd_range(struct mmu_gather *tlb, pgd_t * dir, - unsigned long address, unsigned long size) + unsigned long address, unsigned long size, int zap) { pmd_t * pmd; unsigned long end; @@ -453,14 +453,14 @@ if (end > ((address + PGDIR_SIZE) & PGDIR_MASK)) end = ((address + PGDIR_SIZE) & PGDIR_MASK); do { - zap_pte_range(tlb, pmd, address, end - address); - address = (address + PMD_SIZE) & PMD_MASK; + zap_pte_range(tlb, pmd, address, end - address, zap); + address = (address + PMD_SIZE) & PMD_MASK; pmd++; } while (address < end); } -void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, - unsigned long address, unsigned long end) +static void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, + unsigned long address, unsigned long end, int zap) { pgd_t * dir; @@ -474,7 +474,7 @@ dir = pgd_offset(vma->vm_mm, address); tlb_start_vma(tlb, vma); do { - zap_pmd_range(tlb, dir, address, end - address); + zap_pmd_range(tlb, dir, address, end - address, zap); address = (address + PGDIR_SIZE) & PGDIR_MASK; dir++; } while (address && (address < end)); @@ -524,7 +524,7 @@ */ int unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm, struct vm_area_struct *vma, unsigned long start_addr, - unsigned long end_addr, unsigned long *nr_accounted) + unsigned long end_addr, unsigned long *nr_accounted, int zap) { unsigned long zap_bytes = ZAP_BLOCK_SIZE; unsigned long tlb_start = 0; /* For tlb_finish_mmu */ @@ -568,7 +568,7 @@ tlb_start_valid = 1; } - unmap_page_range(*tlbp, vma, start, start + block); + unmap_page_range(*tlbp, vma, start, start + block, zap); start += block; zap_bytes -= block; if ((long)zap_bytes > 0) @@ -594,8 +594,8 @@ * @address: starting address of pages to zap * @size: number of bytes to zap */ -void zap_page_range(struct vm_area_struct *vma, - unsigned long address, unsigned long size) +void invalidate_page_range(struct vm_area_struct *vma, + unsigned long address, unsigned long size, int zap) { struct mm_struct *mm = vma->vm_mm; struct mmu_gather *tlb; @@ -612,11 +612,17 @@ lru_add_drain(); spin_lock(&mm->page_table_lock); tlb = tlb_gather_mmu(mm, 0); - unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted); + unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted, zap); tlb_finish_mmu(tlb, address, end); spin_unlock(&mm->page_table_lock); } +void zap_page_range(struct vm_area_struct *vma, + unsigned long address, unsigned long size) +{ + invalidate_page_range(vma, address, size, 1); +} + /* * Do a quick page-table lookup for a single page. * mm->page_table_lock must be held. @@ -1095,9 +1101,9 @@ continue; /* Mapping disjoint from hole. */ zba = (hba <= vba) ? vba : hba; zea = (vea <= hea) ? vea : hea; - zap_page_range(vp, + invalidate_page_range(vp, ((zba - vba) << PAGE_SHIFT) + vp->vm_start, - (zea - zba + 1) << PAGE_SHIFT); + (zea - zba + 1) << PAGE_SHIFT, 1); } } @@ -1116,7 +1122,7 @@ * end of the file. */ void invalidate_mmap_range(struct address_space *mapping, - loff_t const holebegin, loff_t const holelen) + loff_t const holebegin, loff_t const holelen, int zap) { unsigned long hba = holebegin >> PAGE_SHIFT; unsigned long hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT; @@ -1156,7 +1162,7 @@ if (inode->i_size < offset) goto do_expand; i_size_write(inode, offset); - invalidate_mmap_range(mapping, offset + PAGE_SIZE - 1, 0); + invalidate_mmap_range(mapping, offset + PAGE_SIZE - 1, 0, 1); truncate_inode_pages(mapping, offset); goto out_truncate; --- 2.6.3.clean/mm/mmap.c 2004-02-17 22:58:32.000000000 -0500 +++ 2.6.3/mm/mmap.c 2004-02-19 22:46:01.000000000 -0500 @@ -1134,7 +1134,7 @@ lru_add_drain(); tlb = tlb_gather_mmu(mm, 0); - unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted); + unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted, 1); vm_unacct_memory(nr_accounted); if (is_hugepage_only_range(start, end - start)) @@ -1436,7 +1436,7 @@ flush_cache_mm(mm); /* Use ~0UL here to ensure all VMAs in the mm are unmapped */ mm->map_count -= unmap_vmas(&tlb, mm, mm->mmap, 0, - ~0UL, &nr_accounted); + ~0UL, &nr_accounted, 1); vm_unacct_memory(nr_accounted); BUG_ON(mm->map_count); /* This is just debugging */ clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-20 5:07 ` Daniel Phillips @ 2004-02-20 12:02 ` Paul E. McKenney 2004-02-20 20:37 ` Daniel Phillips 0 siblings, 1 reply; 68+ messages in thread From: Paul E. McKenney @ 2004-02-20 12:02 UTC (permalink / raw) To: Daniel Phillips Cc: Stephen C. Tweedie, Andrew Morton, Christoph Hellwig, linux-kernel, linux-mm On Fri, Feb 20, 2004 at 12:07:25AM -0500, Daniel Phillips wrote: > On Thursday 19 February 2004 14:47, Paul E. McKenney wrote: > > OK, I surrender. I got some private email agreeing with this > > viewpoint. Any dissenters, speak soon, or... > > An implementation is going to look something like the patch below. > Unfortunately I don't think there is a way around passing an extra parameter > all the way down the unmap call chain. Doubly unfortunately, this doesn't > give any benefit at all to anybody who doesn't use a clustered filesystem > (which is nearly everybody) while there is a marginal cost. Do you know a > better way? Anyway, this is the price of correct MAP_PRIVATE semantics for > clustered filesystems. At least I have quantified it so we can decide if it's > worth it. (My opinion: correctness is always worth it.) "My work is done!" ;-) Almost, anyway. A few comments interspersed. This would be in addition to invalidate_mmap_range-non-gpl-export.patch, right? I cannot think of any reasonable alternative to passing the parameter down either, as it certainly does not be reasonable to duplicate the code... Thanx, Paul > Regards, > > Daniel > > --- 2.6.3.clean/include/linux/mm.h 2004-02-17 22:57:13.000000000 -0500 > +++ 2.6.3/include/linux/mm.h 2004-02-19 23:18:08.000000000 -0500 > @@ -434,9 +434,7 @@ > unsigned long size); > int unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm, > struct vm_area_struct *start_vma, unsigned long start_addr, > - unsigned long end_addr, unsigned long *nr_accounted); > -void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, > - unsigned long address, unsigned long size); > + unsigned long end_addr, unsigned long *nr_accounted, int zap); How about something like "private_too" instead of "zap"? (Ah! unmap_page_range() converted to static, since it is used only in memory.c.) > void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr); > int copy_page_range(struct mm_struct *dst, struct mm_struct *src, > struct vm_area_struct *vma); > @@ -444,8 +442,7 @@ > unsigned long size, pgprot_t prot); > > extern void invalidate_mmap_range(struct address_space *mapping, > - loff_t const holebegin, > - loff_t const holelen); > + loff_t const holebegin, loff_t const holelen, int zap); > extern int vmtruncate(struct inode * inode, loff_t offset); > extern pmd_t *FASTCALL(__pmd_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)); > extern pte_t *FASTCALL(pte_alloc_kernel(struct mm_struct *mm, pmd_t *pmd, unsigned long address)); > --- 2.6.3.clean/mm/memory.c 2004-02-17 22:57:47.000000000 -0500 > +++ 2.6.3/mm/memory.c 2004-02-19 23:48:23.000000000 -0500 > @@ -386,7 +386,7 @@ > > static void > zap_pte_range(struct mmu_gather *tlb, pmd_t * pmd, > - unsigned long address, unsigned long size) > + unsigned long address, unsigned long size, int zap) > { > unsigned long offset; > pte_t *ptep; > @@ -414,7 +414,7 @@ > tlb_remove_tlb_entry(tlb, ptep, address+offset); > if (pfn_valid(pfn)) { > struct page *page = pfn_to_page(pfn); > - if (!PageReserved(page)) { > + if (!PageReserved(page) && (zap || (page->mapping && !PageSwapCache(page)))) { Longish line... > if (pte_dirty(pte)) > set_page_dirty(page); > if (page->mapping && pte_young(pte) && > @@ -436,7 +436,7 @@ > > static void > zap_pmd_range(struct mmu_gather *tlb, pgd_t * dir, > - unsigned long address, unsigned long size) > + unsigned long address, unsigned long size, int zap) > { > pmd_t * pmd; > unsigned long end; > @@ -453,14 +453,14 @@ > if (end > ((address + PGDIR_SIZE) & PGDIR_MASK)) > end = ((address + PGDIR_SIZE) & PGDIR_MASK); > do { > - zap_pte_range(tlb, pmd, address, end - address); > - address = (address + PMD_SIZE) & PMD_MASK; > + zap_pte_range(tlb, pmd, address, end - address, zap); > + address = (address + PMD_SIZE) & PMD_MASK; > pmd++; > } while (address < end); > } > > -void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, > - unsigned long address, unsigned long end) > +static void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, > + unsigned long address, unsigned long end, int zap) > { > pgd_t * dir; > > @@ -474,7 +474,7 @@ > dir = pgd_offset(vma->vm_mm, address); > tlb_start_vma(tlb, vma); > do { > - zap_pmd_range(tlb, dir, address, end - address); > + zap_pmd_range(tlb, dir, address, end - address, zap); > address = (address + PGDIR_SIZE) & PGDIR_MASK; > dir++; > } while (address && (address < end)); > @@ -524,7 +524,7 @@ > */ > int unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm, > struct vm_area_struct *vma, unsigned long start_addr, > - unsigned long end_addr, unsigned long *nr_accounted) > + unsigned long end_addr, unsigned long *nr_accounted, int zap) > { > unsigned long zap_bytes = ZAP_BLOCK_SIZE; > unsigned long tlb_start = 0; /* For tlb_finish_mmu */ > @@ -568,7 +568,7 @@ > tlb_start_valid = 1; > } > > - unmap_page_range(*tlbp, vma, start, start + block); > + unmap_page_range(*tlbp, vma, start, start + block, zap); > start += block; > zap_bytes -= block; > if ((long)zap_bytes > 0) > @@ -594,8 +594,8 @@ > * @address: starting address of pages to zap > * @size: number of bytes to zap > */ > -void zap_page_range(struct vm_area_struct *vma, > - unsigned long address, unsigned long size) > +void invalidate_page_range(struct vm_area_struct *vma, Would it be useful for this to be inline? (Wouldn't seem so, zapping mappings has enough overhead that an extra level of function call should be deep down in the noise...) > + unsigned long address, unsigned long size, int zap) > { > struct mm_struct *mm = vma->vm_mm; > struct mmu_gather *tlb; > @@ -612,11 +612,17 @@ > lru_add_drain(); > spin_lock(&mm->page_table_lock); > tlb = tlb_gather_mmu(mm, 0); > - unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted); > + unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted, zap); > tlb_finish_mmu(tlb, address, end); > spin_unlock(&mm->page_table_lock); > } > > +void zap_page_range(struct vm_area_struct *vma, > + unsigned long address, unsigned long size) > +{ > + invalidate_page_range(vma, address, size, 1); > +} > + > /* > * Do a quick page-table lookup for a single page. > * mm->page_table_lock must be held. > @@ -1095,9 +1101,9 @@ > continue; /* Mapping disjoint from hole. */ > zba = (hba <= vba) ? vba : hba; > zea = (vea <= hea) ? vea : hea; > - zap_page_range(vp, > + invalidate_page_range(vp, > ((zba - vba) << PAGE_SHIFT) + vp->vm_start, > - (zea - zba + 1) << PAGE_SHIFT); > + (zea - zba + 1) << PAGE_SHIFT, 1); > } > } > > @@ -1116,7 +1122,7 @@ > * end of the file. > */ > void invalidate_mmap_range(struct address_space *mapping, > - loff_t const holebegin, loff_t const holelen) > + loff_t const holebegin, loff_t const holelen, int zap) > { > unsigned long hba = holebegin >> PAGE_SHIFT; > unsigned long hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT; Doesn't the new argument need to be passed down through invalidate_mmap_range_list()? > @@ -1156,7 +1162,7 @@ > if (inode->i_size < offset) > goto do_expand; > i_size_write(inode, offset); > - invalidate_mmap_range(mapping, offset + PAGE_SIZE - 1, 0); > + invalidate_mmap_range(mapping, offset + PAGE_SIZE - 1, 0, 1); > truncate_inode_pages(mapping, offset); > goto out_truncate; > > --- 2.6.3.clean/mm/mmap.c 2004-02-17 22:58:32.000000000 -0500 > +++ 2.6.3/mm/mmap.c 2004-02-19 22:46:01.000000000 -0500 > @@ -1134,7 +1134,7 @@ > > lru_add_drain(); > tlb = tlb_gather_mmu(mm, 0); > - unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted); > + unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted, 1); > vm_unacct_memory(nr_accounted); > > if (is_hugepage_only_range(start, end - start)) > @@ -1436,7 +1436,7 @@ > flush_cache_mm(mm); > /* Use ~0UL here to ensure all VMAs in the mm are unmapped */ > mm->map_count -= unmap_vmas(&tlb, mm, mm->mmap, 0, > - ~0UL, &nr_accounted); > + ~0UL, &nr_accounted, 1); > vm_unacct_memory(nr_accounted); > BUG_ON(mm->map_count); /* This is just debugging */ > clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD); > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-20 12:02 ` Paul E. McKenney @ 2004-02-20 20:37 ` Daniel Phillips 2004-02-20 14:01 ` Paul E. McKenney 2004-02-20 21:17 ` Non-GPL export of invalidate_mmap_range Christoph Hellwig 0 siblings, 2 replies; 68+ messages in thread From: Daniel Phillips @ 2004-02-20 20:37 UTC (permalink / raw) To: paulmck Cc: Stephen C. Tweedie, Andrew Morton, Christoph Hellwig, linux-kernel, linux-mm Hi Paul, > I cannot think of any reasonable alternative to passing the parameter > down either, as it certainly does not be reasonable to duplicate the > code... Yes, it's simply the (small) price that has to be paid in order to be able to boast about our accurate semantics. > How about something like "private_too" instead of "zap"? How about just "all", which is what we mean. > > -void zap_page_range(struct vm_area_struct *vma, > > - unsigned long address, unsigned long size) > > +void invalidate_page_range(struct vm_area_struct *vma, > > Would it be useful for this to be inline? (Wouldn't seem so, > zapping mappings has enough overhead that an extra level of > function call should be deep down in the noise...) Yes, it doesn't seem worth it just to save a stack frame. Actually, I erred there in that invalidate_mmap_range should not export the flag, because it never makes sense to pass in non-zero from a DFS. > Doesn't the new argument need to be passed down through > invalidate_mmap_range_list()? It does, thanks for the catch. Please bear with me for a moment while I reroll this, then hopefully we can move on to the more interesting discussion of whether it's worth it. (Yes it is :) Regards, Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-20 20:37 ` Daniel Phillips @ 2004-02-20 14:01 ` Paul E. McKenney 2004-02-20 23:00 ` Daniel Phillips 2004-02-20 21:17 ` Non-GPL export of invalidate_mmap_range Christoph Hellwig 1 sibling, 1 reply; 68+ messages in thread From: Paul E. McKenney @ 2004-02-20 14:01 UTC (permalink / raw) To: Daniel Phillips Cc: Stephen C. Tweedie, Andrew Morton, Christoph Hellwig, linux-kernel, linux-mm On Fri, Feb 20, 2004 at 03:37:26PM -0500, Daniel Phillips wrote: > Hi Paul, > > > I cannot think of any reasonable alternative to passing the parameter > > down either, as it certainly does not be reasonable to duplicate the > > code... > > Yes, it's simply the (small) price that has to be paid in order to be able to > boast about our accurate semantics. ;-) > > How about something like "private_too" instead of "zap"? > > How about just "all", which is what we mean. Fair enough, certainly keeps a few more lines of code within 80 columns. > > > -void zap_page_range(struct vm_area_struct *vma, > > > - unsigned long address, unsigned long size) > > > +void invalidate_page_range(struct vm_area_struct *vma, > > > > Would it be useful for this to be inline? (Wouldn't seem so, > > zapping mappings has enough overhead that an extra level of > > function call should be deep down in the noise...) > > Yes, it doesn't seem worth it just to save a stack frame. > > Actually, I erred there in that invalidate_mmap_range should not export the > flag, because it never makes sense to pass in non-zero from a DFS. Doesn't vmtruncate() want to pass non-zero "all" in to invalidate_mmap_range() in order to maintain compatibility with existing Linux semantics? > > Doesn't the new argument need to be passed down through > > invalidate_mmap_range_list()? > > It does, thanks for the catch. Please bear with me for a moment while I > reroll this, then hopefully we can move on to the more interesting discussion > of whether it's worth it. (Yes it is :) ;-) Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-20 14:01 ` Paul E. McKenney @ 2004-02-20 23:00 ` Daniel Phillips 2004-02-20 16:17 ` Paul E. McKenney 0 siblings, 1 reply; 68+ messages in thread From: Daniel Phillips @ 2004-02-20 23:00 UTC (permalink / raw) To: paulmck Cc: Stephen C. Tweedie, Andrew Morton, Christoph Hellwig, linux-kernel, linux-mm On Friday 20 February 2004 09:01, Paul E. McKenney wrote: > On Fri, Feb 20, 2004 at 03:37:26PM -0500, Daniel Phillips wrote: > > Actually, I erred there in that invalidate_mmap_range should not export > > the flag, because it never makes sense to pass in non-zero from a DFS. > > Doesn't vmtruncate() want to pass non-zero "all" in to > invalidate_mmap_range() in order to maintain compatibility with existing > Linux semantics? That comes from inside. The DFS's truncate interface should just be vmtruncate. If I missed something, please shout. Regards, Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-20 23:00 ` Daniel Phillips @ 2004-02-20 16:17 ` Paul E. McKenney 2004-02-21 3:19 ` Daniel Phillips 2004-02-21 19:00 ` Daniel Phillips 0 siblings, 2 replies; 68+ messages in thread From: Paul E. McKenney @ 2004-02-20 16:17 UTC (permalink / raw) To: Daniel Phillips Cc: Stephen C. Tweedie, Andrew Morton, Christoph Hellwig, linux-kernel, linux-mm On Fri, Feb 20, 2004 at 06:00:32PM -0500, Daniel Phillips wrote: > On Friday 20 February 2004 09:01, Paul E. McKenney wrote: > > On Fri, Feb 20, 2004 at 03:37:26PM -0500, Daniel Phillips wrote: > > > Actually, I erred there in that invalidate_mmap_range should not export > > > the flag, because it never makes sense to pass in non-zero from a DFS. > > > > Doesn't vmtruncate() want to pass non-zero "all" in to > > invalidate_mmap_range() in order to maintain compatibility with existing > > Linux semantics? > > That comes from inside. The DFS's truncate interface should just be > vmtruncate. If I missed something, please shout. Agreed, the DFS's truncate interface should be vmtruncate(). Your earlier patch has a call to invalidate_mmap_range() within vmtruncate(), which passes "1" to the last arg, so as to get rid of all mappings to the truncated portion of the file. So either invalidate_mmap_range() needs to keep the fourth arg or needs to be a wrapper for an underlying function that vmtruncate() can call, or some such. The latter may be what you intended to do. Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-20 16:17 ` Paul E. McKenney @ 2004-02-21 3:19 ` Daniel Phillips 2004-02-21 19:00 ` Daniel Phillips 1 sibling, 0 replies; 68+ messages in thread From: Daniel Phillips @ 2004-02-21 3:19 UTC (permalink / raw) To: paulmck Cc: Stephen C. Tweedie, Andrew Morton, Christoph Hellwig, linux-kernel, linux-mm On Friday 20 February 2004 11:17, Paul E. McKenney wrote: > Your earlier patch has a call to invalidate_mmap_range() within > vmtruncate(), which passes "1" to the last arg, so as to get > rid of all mappings to the truncated portion of the file. > So either invalidate_mmap_range() needs to keep the fourth arg > or needs to be a wrapper for an underlying function that > vmtruncate() can call, or some such. > > The latter may be what you intended to do. Yes, modulo nobody coming up with a legitimate use for the fourth argument. Regards, Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-20 16:17 ` Paul E. McKenney 2004-02-21 3:19 ` Daniel Phillips @ 2004-02-21 19:00 ` Daniel Phillips 2004-02-22 23:39 ` Paul E. McKenney 1 sibling, 1 reply; 68+ messages in thread From: Daniel Phillips @ 2004-02-21 19:00 UTC (permalink / raw) To: paulmck Cc: Stephen C. Tweedie, Andrew Morton, Christoph Hellwig, linux-kernel, linux-mm Hi Paul et al, Here is an updated patch. The name of the exported function is changed to "invalidate_filemap_range" to reflect the fact that only file-backed pages are invalidated, and to distinguish the three parameter flavour from the four parameter version called from vmtruncate. The inner loop in zap_pte_range is hopefully correct now. While I'm in here, why is the assignment "pte =" at line 411 of memory.c not redundant? http://lxr.linux.no/source/mm/memory.c?v=2.6.1#L411 As far as I can see, the ->filemap spinlock protects the pte from modification and pte was already assigned at line 405. Anyway, we can now see that the full cost of this DFS-specific feature in the inner loop is a single (unlikely) branch. I'll repeat my proposition here: providing local filesystem semantics for MAP_PRIVATE on any distributed filesystem requires these decorations on the unmap path. Though there is no benefit for local filesystems, the cost is insignificant. Regards, Daniel --- 2.6.3.clean/include/linux/mm.h 2004-02-17 22:57:13.000000000 -0500 +++ 2.6.3/include/linux/mm.h 2004-02-21 12:59:16.000000000 -0500 @@ -430,23 +430,23 @@ void shmem_lock(struct file * file, int lock); int shmem_zero_setup(struct vm_area_struct *); -void zap_page_range(struct vm_area_struct *vma, unsigned long address, - unsigned long size); int unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm, struct vm_area_struct *start_vma, unsigned long start_addr, - unsigned long end_addr, unsigned long *nr_accounted); -void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, - unsigned long address, unsigned long size); + unsigned long end_addr, unsigned long *nr_accounted, int zap); void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); int zeromap_page_range(struct vm_area_struct *vma, unsigned long from, unsigned long size, pgprot_t prot); - -extern void invalidate_mmap_range(struct address_space *mapping, - loff_t const holebegin, - loff_t const holelen); +extern void invalidate_filemap_range(struct address_space *mapping, loff_t const start, loff_t const length); extern int vmtruncate(struct inode * inode, loff_t offset); +void invalidate_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, int all); + +static inline void zap_page_range(struct vm_area_struct *vma, ulong address, ulong size) +{ + invalidate_page_range(vma, address, size, 1); +} + extern pmd_t *FASTCALL(__pmd_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)); extern pte_t *FASTCALL(pte_alloc_kernel(struct mm_struct *mm, pmd_t *pmd, unsigned long address)); extern pte_t *FASTCALL(pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address)); --- 2.6.3.clean/mm/memory.c 2004-02-17 22:57:47.000000000 -0500 +++ 2.6.3/mm/memory.c 2004-02-21 13:23:36.000000000 -0500 @@ -384,9 +384,13 @@ return -ENOMEM; } -static void -zap_pte_range(struct mmu_gather *tlb, pmd_t * pmd, - unsigned long address, unsigned long size) +static inline int is_anon(struct page *page) +{ + return !page->mapping || PageSwapCache(page); +} + +static void zap_pte_range(struct mmu_gather *tlb, pmd_t * pmd, + unsigned long address, unsigned long size, int all) { unsigned long offset; pte_t *ptep; @@ -409,7 +413,8 @@ continue; if (pte_present(pte)) { unsigned long pfn = pte_pfn(pte); - + if (unlikely(!all) && is_anon(pfn_to_page(pfn))) + continue; pte = ptep_get_and_clear(ptep); tlb_remove_tlb_entry(tlb, ptep, address+offset); if (pfn_valid(pfn)) { @@ -426,7 +431,7 @@ } } } else { - if (!pte_file(pte)) + if (!pte_file(pte) && all) free_swap_and_cache(pte_to_swp_entry(pte)); pte_clear(ptep); } @@ -434,9 +439,8 @@ pte_unmap(ptep-1); } -static void -zap_pmd_range(struct mmu_gather *tlb, pgd_t * dir, - unsigned long address, unsigned long size) +static void zap_pmd_range(struct mmu_gather *tlb, pgd_t * dir, + unsigned long address, unsigned long size, int all) { pmd_t * pmd; unsigned long end; @@ -453,14 +457,14 @@ if (end > ((address + PGDIR_SIZE) & PGDIR_MASK)) end = ((address + PGDIR_SIZE) & PGDIR_MASK); do { - zap_pte_range(tlb, pmd, address, end - address); - address = (address + PMD_SIZE) & PMD_MASK; + zap_pte_range(tlb, pmd, address, end - address, all); + address = (address + PMD_SIZE) & PMD_MASK; pmd++; } while (address < end); } -void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, - unsigned long address, unsigned long end) +static void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, + unsigned long address, unsigned long end, int all) { pgd_t * dir; @@ -474,7 +478,7 @@ dir = pgd_offset(vma->vm_mm, address); tlb_start_vma(tlb, vma); do { - zap_pmd_range(tlb, dir, address, end - address); + zap_pmd_range(tlb, dir, address, end - address, all); address = (address + PGDIR_SIZE) & PGDIR_MASK; dir++; } while (address && (address < end)); @@ -524,7 +528,7 @@ */ int unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm, struct vm_area_struct *vma, unsigned long start_addr, - unsigned long end_addr, unsigned long *nr_accounted) + unsigned long end_addr, unsigned long *nr_accounted, int all) { unsigned long zap_bytes = ZAP_BLOCK_SIZE; unsigned long tlb_start = 0; /* For tlb_finish_mmu */ @@ -568,7 +572,7 @@ tlb_start_valid = 1; } - unmap_page_range(*tlbp, vma, start, start + block); + unmap_page_range(*tlbp, vma, start, start + block, all); start += block; zap_bytes -= block; if ((long)zap_bytes > 0) @@ -594,8 +598,8 @@ * @address: starting address of pages to zap * @size: number of bytes to zap */ -void zap_page_range(struct vm_area_struct *vma, - unsigned long address, unsigned long size) +void invalidate_page_range(struct vm_area_struct *vma, + unsigned long address, unsigned long size, int all) { struct mm_struct *mm = vma->vm_mm; struct mmu_gather *tlb; @@ -612,7 +616,7 @@ lru_add_drain(); spin_lock(&mm->page_table_lock); tlb = tlb_gather_mmu(mm, 0); - unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted); + unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted, all); tlb_finish_mmu(tlb, address, end); spin_unlock(&mm->page_table_lock); } @@ -1071,10 +1075,8 @@ * Both hba and hlen are page numbers in PAGE_SIZE units. * An hlen of zero blows away the entire portion file after hba. */ -static void -invalidate_mmap_range_list(struct list_head *head, - unsigned long const hba, - unsigned long const hlen) +static void invalidate_mmap_range_list(struct list_head *head, + unsigned long const hba, unsigned long const hlen, int all) { struct list_head *curr; unsigned long hea; /* last page of hole. */ @@ -1095,9 +1097,9 @@ continue; /* Mapping disjoint from hole. */ zba = (hba <= vba) ? vba : hba; zea = (vea <= hea) ? vea : hea; - zap_page_range(vp, + invalidate_page_range(vp, ((zba - vba) << PAGE_SHIFT) + vp->vm_start, - (zea - zba + 1) << PAGE_SHIFT); + (zea - zba + 1) << PAGE_SHIFT, all); } } @@ -1115,8 +1117,8 @@ * up to a PAGE_SIZE boundary. A holelen of zero truncates to the * end of the file. */ -void invalidate_mmap_range(struct address_space *mapping, - loff_t const holebegin, loff_t const holelen) +static void invalidate_mmap_range(struct address_space *mapping, + loff_t const holebegin, loff_t const holelen, int all) { unsigned long hba = holebegin >> PAGE_SHIFT; unsigned long hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT; @@ -1133,12 +1135,19 @@ /* Protect against page fault */ atomic_inc(&mapping->truncate_count); if (unlikely(!list_empty(&mapping->i_mmap))) - invalidate_mmap_range_list(&mapping->i_mmap, hba, hlen); + invalidate_mmap_range_list(&mapping->i_mmap, hba, hlen, all); if (unlikely(!list_empty(&mapping->i_mmap_shared))) - invalidate_mmap_range_list(&mapping->i_mmap_shared, hba, hlen); + invalidate_mmap_range_list(&mapping->i_mmap_shared, hba, hlen, all); up(&mapping->i_shared_sem); } -EXPORT_SYMBOL_GPL(invalidate_mmap_range); + + void invalidate_filemap_range(struct address_space *mapping, + loff_t const start, loff_t const length) +{ + invalidate_mmap_range(mapping, start, length, 0); +} + +EXPORT_SYMBOL_GPL(invalidate_filemap_range); /* * Handle all mappings that got truncated by a "truncate()" @@ -1156,7 +1165,7 @@ if (inode->i_size < offset) goto do_expand; i_size_write(inode, offset); - invalidate_mmap_range(mapping, offset + PAGE_SIZE - 1, 0); + invalidate_mmap_range(mapping, offset + PAGE_SIZE - 1, 0, 1); truncate_inode_pages(mapping, offset); goto out_truncate; --- 2.6.3.clean/mm/mmap.c 2004-02-17 22:58:32.000000000 -0500 +++ 2.6.3/mm/mmap.c 2004-02-19 22:46:01.000000000 -0500 @@ -1134,7 +1134,7 @@ lru_add_drain(); tlb = tlb_gather_mmu(mm, 0); - unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted); + unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted, 1); vm_unacct_memory(nr_accounted); if (is_hugepage_only_range(start, end - start)) @@ -1436,7 +1436,7 @@ flush_cache_mm(mm); /* Use ~0UL here to ensure all VMAs in the mm are unmapped */ mm->map_count -= unmap_vmas(&tlb, mm, mm->mmap, 0, - ~0UL, &nr_accounted); + ~0UL, &nr_accounted, 1); vm_unacct_memory(nr_accounted); BUG_ON(mm->map_count); /* This is just debugging */ clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-21 19:00 ` Daniel Phillips @ 2004-02-22 23:39 ` Paul E. McKenney 2004-02-25 21:04 ` [RFC] Distributed mmap API Daniel Phillips 0 siblings, 1 reply; 68+ messages in thread From: Paul E. McKenney @ 2004-02-22 23:39 UTC (permalink / raw) To: Daniel Phillips Cc: Stephen C. Tweedie, Andrew Morton, Christoph Hellwig, linux-kernel, linux-mm Hello, Dan, How about the following? EXPORT_SYMBOL(invalidate_filemap_range); Thanx, Paul On Sat, Feb 21, 2004 at 02:00:16PM -0500, Daniel Phillips wrote: > Hi Paul et al, > > Here is an updated patch. The name of the exported function is changed to > "invalidate_filemap_range" to reflect the fact that only file-backed pages are > invalidated, and to distinguish the three parameter flavour from the four > parameter version called from vmtruncate. The inner loop in zap_pte_range is > hopefully correct now. > > While I'm in here, why is the assignment "pte =" at line 411 of memory.c not > redundant? > > http://lxr.linux.no/source/mm/memory.c?v=2.6.1#L411 > > As far as I can see, the ->filemap spinlock protects the pte from modification > and pte was already assigned at line 405. > > Anyway, we can now see that the full cost of this DFS-specific feature in the inner > loop is a single (unlikely) branch. > > I'll repeat my proposition here: providing local filesystem semantics for > MAP_PRIVATE on any distributed filesystem requires these decorations on the > unmap path. Though there is no benefit for local filesystems, the cost is > insignificant. > > Regards, > > Daniel > > --- 2.6.3.clean/include/linux/mm.h 2004-02-17 22:57:13.000000000 -0500 > +++ 2.6.3/include/linux/mm.h 2004-02-21 12:59:16.000000000 -0500 > @@ -430,23 +430,23 @@ > void shmem_lock(struct file * file, int lock); > int shmem_zero_setup(struct vm_area_struct *); > > -void zap_page_range(struct vm_area_struct *vma, unsigned long address, > - unsigned long size); > int unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm, > struct vm_area_struct *start_vma, unsigned long start_addr, > - unsigned long end_addr, unsigned long *nr_accounted); > -void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, > - unsigned long address, unsigned long size); > + unsigned long end_addr, unsigned long *nr_accounted, int zap); > void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr); > int copy_page_range(struct mm_struct *dst, struct mm_struct *src, > struct vm_area_struct *vma); > int zeromap_page_range(struct vm_area_struct *vma, unsigned long from, > unsigned long size, pgprot_t prot); > - > -extern void invalidate_mmap_range(struct address_space *mapping, > - loff_t const holebegin, > - loff_t const holelen); > +extern void invalidate_filemap_range(struct address_space *mapping, loff_t const start, loff_t const length); > extern int vmtruncate(struct inode * inode, loff_t offset); > +void invalidate_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, int all); > + > +static inline void zap_page_range(struct vm_area_struct *vma, ulong address, ulong size) > +{ > + invalidate_page_range(vma, address, size, 1); > +} > + > extern pmd_t *FASTCALL(__pmd_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)); > extern pte_t *FASTCALL(pte_alloc_kernel(struct mm_struct *mm, pmd_t *pmd, unsigned long address)); > extern pte_t *FASTCALL(pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address)); > --- 2.6.3.clean/mm/memory.c 2004-02-17 22:57:47.000000000 -0500 > +++ 2.6.3/mm/memory.c 2004-02-21 13:23:36.000000000 -0500 > @@ -384,9 +384,13 @@ > return -ENOMEM; > } > > -static void > -zap_pte_range(struct mmu_gather *tlb, pmd_t * pmd, > - unsigned long address, unsigned long size) > +static inline int is_anon(struct page *page) > +{ > + return !page->mapping || PageSwapCache(page); > +} > + > +static void zap_pte_range(struct mmu_gather *tlb, pmd_t * pmd, > + unsigned long address, unsigned long size, int all) > { > unsigned long offset; > pte_t *ptep; > @@ -409,7 +413,8 @@ > continue; > if (pte_present(pte)) { > unsigned long pfn = pte_pfn(pte); > - > + if (unlikely(!all) && is_anon(pfn_to_page(pfn))) > + continue; > pte = ptep_get_and_clear(ptep); > tlb_remove_tlb_entry(tlb, ptep, address+offset); > if (pfn_valid(pfn)) { > @@ -426,7 +431,7 @@ > } > } > } else { > - if (!pte_file(pte)) > + if (!pte_file(pte) && all) > free_swap_and_cache(pte_to_swp_entry(pte)); > pte_clear(ptep); > } > @@ -434,9 +439,8 @@ > pte_unmap(ptep-1); > } > > -static void > -zap_pmd_range(struct mmu_gather *tlb, pgd_t * dir, > - unsigned long address, unsigned long size) > +static void zap_pmd_range(struct mmu_gather *tlb, pgd_t * dir, > + unsigned long address, unsigned long size, int all) > { > pmd_t * pmd; > unsigned long end; > @@ -453,14 +457,14 @@ > if (end > ((address + PGDIR_SIZE) & PGDIR_MASK)) > end = ((address + PGDIR_SIZE) & PGDIR_MASK); > do { > - zap_pte_range(tlb, pmd, address, end - address); > - address = (address + PMD_SIZE) & PMD_MASK; > + zap_pte_range(tlb, pmd, address, end - address, all); > + address = (address + PMD_SIZE) & PMD_MASK; > pmd++; > } while (address < end); > } > > -void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, > - unsigned long address, unsigned long end) > +static void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, > + unsigned long address, unsigned long end, int all) > { > pgd_t * dir; > > @@ -474,7 +478,7 @@ > dir = pgd_offset(vma->vm_mm, address); > tlb_start_vma(tlb, vma); > do { > - zap_pmd_range(tlb, dir, address, end - address); > + zap_pmd_range(tlb, dir, address, end - address, all); > address = (address + PGDIR_SIZE) & PGDIR_MASK; > dir++; > } while (address && (address < end)); > @@ -524,7 +528,7 @@ > */ > int unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm, > struct vm_area_struct *vma, unsigned long start_addr, > - unsigned long end_addr, unsigned long *nr_accounted) > + unsigned long end_addr, unsigned long *nr_accounted, int all) > { > unsigned long zap_bytes = ZAP_BLOCK_SIZE; > unsigned long tlb_start = 0; /* For tlb_finish_mmu */ > @@ -568,7 +572,7 @@ > tlb_start_valid = 1; > } > > - unmap_page_range(*tlbp, vma, start, start + block); > + unmap_page_range(*tlbp, vma, start, start + block, all); > start += block; > zap_bytes -= block; > if ((long)zap_bytes > 0) > @@ -594,8 +598,8 @@ > * @address: starting address of pages to zap > * @size: number of bytes to zap > */ > -void zap_page_range(struct vm_area_struct *vma, > - unsigned long address, unsigned long size) > +void invalidate_page_range(struct vm_area_struct *vma, > + unsigned long address, unsigned long size, int all) > { > struct mm_struct *mm = vma->vm_mm; > struct mmu_gather *tlb; > @@ -612,7 +616,7 @@ > lru_add_drain(); > spin_lock(&mm->page_table_lock); > tlb = tlb_gather_mmu(mm, 0); > - unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted); > + unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted, all); > tlb_finish_mmu(tlb, address, end); > spin_unlock(&mm->page_table_lock); > } > @@ -1071,10 +1075,8 @@ > * Both hba and hlen are page numbers in PAGE_SIZE units. > * An hlen of zero blows away the entire portion file after hba. > */ > -static void > -invalidate_mmap_range_list(struct list_head *head, > - unsigned long const hba, > - unsigned long const hlen) > +static void invalidate_mmap_range_list(struct list_head *head, > + unsigned long const hba, unsigned long const hlen, int all) > { > struct list_head *curr; > unsigned long hea; /* last page of hole. */ > @@ -1095,9 +1097,9 @@ > continue; /* Mapping disjoint from hole. */ > zba = (hba <= vba) ? vba : hba; > zea = (vea <= hea) ? vea : hea; > - zap_page_range(vp, > + invalidate_page_range(vp, > ((zba - vba) << PAGE_SHIFT) + vp->vm_start, > - (zea - zba + 1) << PAGE_SHIFT); > + (zea - zba + 1) << PAGE_SHIFT, all); > } > } > > @@ -1115,8 +1117,8 @@ > * up to a PAGE_SIZE boundary. A holelen of zero truncates to the > * end of the file. > */ > -void invalidate_mmap_range(struct address_space *mapping, > - loff_t const holebegin, loff_t const holelen) > +static void invalidate_mmap_range(struct address_space *mapping, > + loff_t const holebegin, loff_t const holelen, int all) > { > unsigned long hba = holebegin >> PAGE_SHIFT; > unsigned long hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT; > @@ -1133,12 +1135,19 @@ > /* Protect against page fault */ > atomic_inc(&mapping->truncate_count); > if (unlikely(!list_empty(&mapping->i_mmap))) > - invalidate_mmap_range_list(&mapping->i_mmap, hba, hlen); > + invalidate_mmap_range_list(&mapping->i_mmap, hba, hlen, all); > if (unlikely(!list_empty(&mapping->i_mmap_shared))) > - invalidate_mmap_range_list(&mapping->i_mmap_shared, hba, hlen); > + invalidate_mmap_range_list(&mapping->i_mmap_shared, hba, hlen, all); > up(&mapping->i_shared_sem); > } > -EXPORT_SYMBOL_GPL(invalidate_mmap_range); > + > + void invalidate_filemap_range(struct address_space *mapping, > + loff_t const start, loff_t const length) > +{ > + invalidate_mmap_range(mapping, start, length, 0); > +} > + > +EXPORT_SYMBOL_GPL(invalidate_filemap_range); > > /* > * Handle all mappings that got truncated by a "truncate()" > @@ -1156,7 +1165,7 @@ > if (inode->i_size < offset) > goto do_expand; > i_size_write(inode, offset); > - invalidate_mmap_range(mapping, offset + PAGE_SIZE - 1, 0); > + invalidate_mmap_range(mapping, offset + PAGE_SIZE - 1, 0, 1); > truncate_inode_pages(mapping, offset); > goto out_truncate; > > --- 2.6.3.clean/mm/mmap.c 2004-02-17 22:58:32.000000000 -0500 > +++ 2.6.3/mm/mmap.c 2004-02-19 22:46:01.000000000 -0500 > @@ -1134,7 +1134,7 @@ > > lru_add_drain(); > tlb = tlb_gather_mmu(mm, 0); > - unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted); > + unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted, 1); > vm_unacct_memory(nr_accounted); > > if (is_hugepage_only_range(start, end - start)) > @@ -1436,7 +1436,7 @@ > flush_cache_mm(mm); > /* Use ~0UL here to ensure all VMAs in the mm are unmapped */ > mm->map_count -= unmap_vmas(&tlb, mm, mm->mmap, 0, > - ~0UL, &nr_accounted); > + ~0UL, &nr_accounted, 1); > vm_unacct_memory(nr_accounted); > BUG_ON(mm->map_count); /* This is just debugging */ > clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD); > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* [RFC] Distributed mmap API 2004-02-22 23:39 ` Paul E. McKenney @ 2004-02-25 21:04 ` Daniel Phillips 2004-02-25 19:12 ` Paul E. McKenney ` (2 more replies) 0 siblings, 3 replies; 68+ messages in thread From: Daniel Phillips @ 2004-02-25 21:04 UTC (permalink / raw) To: paulmck Cc: Stephen C. Tweedie, Andrew Morton, Christoph Hellwig, linux-kernel, linux-mm This is the function formerly known as invalidate_mmap_range, with the addition of a new code path in the zap_ call chain to handle MAP_PRIVATE properly. This function by itself is enough to support a crude but useful form of distributed mmap where a shared file is cached only on one cluster node at a time. To use this, the distributed filesystem has to hook do_no_page to intercept page faults and carry out the needed global locking. The locking itself does not require any new kernel hooks. In brief, the patch here and another patch to be presented for the do_no_page hook, together provide the core kernel API for a simplified, distributed mmap. (Note that there may be a workaround for the lack of a do_no_page hook, but certainly not as simple and robust.) To put this in perspective, I'll mention the two big limitations of the simplified API: 1) Invalidation is always a whole file at a time 2) Multiple readers may not cache the same data simultaneously To handle sub-file cache granularity, we also need to be able to flush dirty data and evict cache pages with sub-file granularity, giving a trio of cache management functions: unmap_mapping_range(mapping, start, length) /* this patch */ write_mapping_range(mapping, start, length) /* start IO for dirty cache */ evict_mapping_range(mapping, start, length) /* wait on IO and evict cache */ To handle (2) above, the distributed filesystem will need to hook and modify the behaviour of do_wp_page so that it can intercept memory writes to shared cache pages. To summarize the current proposal, and where we need to go in the future: Simple core kernel API for simplistic distributed memory map ------------------------------------------------------------ - unmap_mapping_range export (this patch) - do_no_page hook Improved core kernel API for optimal distributed memory map ----------------------------------------------------------- - unmap_mapping_range export (this patch) - write_mapping_range export - evict_mapping_range export - do_no_page hook - do_wp_page hook There's no big rush to move on to the optimal version just now, since the simplistic version is already a big step forward. I'd like to take this opportunity to apologize to Paul for derailing his more modest proposal, but unfortunately, the semantics that could be obtained that way are fatally flawed: private mmaps just won't work. What I've written here is about the minimum that supports acceptable mmap semantics. And finally, the EXPORT_SYMBOL_GPL issue: after much fretting I've changed it to just EXPORT_SYMBOL in this patch, because I feel that we have better ways to further our goals of free and open software than to try to use this particular API as a battering ram. Of course it's not my decision, I just want to register my vote here. Regards, Daniel --- 2.6.3.clean/include/linux/mm.h 2004-02-17 22:57:13.000000000 -0500 +++ 2.6.3/include/linux/mm.h 2004-02-21 12:59:16.000000000 -0500 @@ -430,23 +430,23 @@ void shmem_lock(struct file * file, int lock); int shmem_zero_setup(struct vm_area_struct *); -void zap_page_range(struct vm_area_struct *vma, unsigned long address, - unsigned long size); int unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm, struct vm_area_struct *start_vma, unsigned long start_addr, - unsigned long end_addr, unsigned long *nr_accounted); -void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, - unsigned long address, unsigned long size); + unsigned long end_addr, unsigned long *nr_accounted, int zap); void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); int zeromap_page_range(struct vm_area_struct *vma, unsigned long from, unsigned long size, pgprot_t prot); - -extern void invalidate_mmap_range(struct address_space *mapping, - loff_t const holebegin, - loff_t const holelen); +extern void invalidate_filemap_range(struct address_space *mapping, loff_t const start, loff_t const length); extern int vmtruncate(struct inode * inode, loff_t offset); +void invalidate_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, int all); + +static inline void zap_page_range(struct vm_area_struct *vma, ulong address, ulong size) +{ + invalidate_page_range(vma, address, size, 1); +} + extern pmd_t *FASTCALL(__pmd_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)); extern pte_t *FASTCALL(pte_alloc_kernel(struct mm_struct *mm, pmd_t *pmd, unsigned long address)); extern pte_t *FASTCALL(pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address)); --- 2.6.3.clean/mm/memory.c 2004-02-17 22:57:47.000000000 -0500 +++ 2.6.3/mm/memory.c 2004-02-25 13:34:57.000000000 -0500 @@ -384,9 +384,13 @@ return -ENOMEM; } -static void -zap_pte_range(struct mmu_gather *tlb, pmd_t * pmd, - unsigned long address, unsigned long size) +static inline int is_anon(struct page *page) +{ + return !page->mapping || PageSwapCache(page); +} + +static void zap_pte_range(struct mmu_gather *tlb, pmd_t * pmd, + unsigned long address, unsigned long size, int all) { unsigned long offset; pte_t *ptep; @@ -409,8 +413,9 @@ continue; if (pte_present(pte)) { unsigned long pfn = pte_pfn(pte); - - pte = ptep_get_and_clear(ptep); + if (unlikely(!all) && is_anon(pfn_to_page(pfn))) + continue; + pte = ptep_get_and_clear(ptep); /* get dirty bit atomically */ tlb_remove_tlb_entry(tlb, ptep, address+offset); if (pfn_valid(pfn)) { struct page *page = pfn_to_page(pfn); @@ -426,17 +431,19 @@ } } } else { - if (!pte_file(pte)) + if (!pte_file(pte)) { + if (!all) + continue; free_swap_and_cache(pte_to_swp_entry(pte)); + } pte_clear(ptep); } } pte_unmap(ptep-1); } -static void -zap_pmd_range(struct mmu_gather *tlb, pgd_t * dir, - unsigned long address, unsigned long size) +static void zap_pmd_range(struct mmu_gather *tlb, pgd_t * dir, + unsigned long address, unsigned long size, int all) { pmd_t * pmd; unsigned long end; @@ -453,14 +460,14 @@ if (end > ((address + PGDIR_SIZE) & PGDIR_MASK)) end = ((address + PGDIR_SIZE) & PGDIR_MASK); do { - zap_pte_range(tlb, pmd, address, end - address); - address = (address + PMD_SIZE) & PMD_MASK; + zap_pte_range(tlb, pmd, address, end - address, all); + address = (address + PMD_SIZE) & PMD_MASK; pmd++; } while (address < end); } -void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, - unsigned long address, unsigned long end) +static void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, + unsigned long address, unsigned long end, int all) { pgd_t * dir; @@ -474,7 +481,7 @@ dir = pgd_offset(vma->vm_mm, address); tlb_start_vma(tlb, vma); do { - zap_pmd_range(tlb, dir, address, end - address); + zap_pmd_range(tlb, dir, address, end - address, all); address = (address + PGDIR_SIZE) & PGDIR_MASK; dir++; } while (address && (address < end)); @@ -524,7 +531,7 @@ */ int unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm, struct vm_area_struct *vma, unsigned long start_addr, - unsigned long end_addr, unsigned long *nr_accounted) + unsigned long end_addr, unsigned long *nr_accounted, int all) { unsigned long zap_bytes = ZAP_BLOCK_SIZE; unsigned long tlb_start = 0; /* For tlb_finish_mmu */ @@ -568,7 +575,7 @@ tlb_start_valid = 1; } - unmap_page_range(*tlbp, vma, start, start + block); + unmap_page_range(*tlbp, vma, start, start + block, all); start += block; zap_bytes -= block; if ((long)zap_bytes > 0) @@ -594,8 +601,8 @@ * @address: starting address of pages to zap * @size: number of bytes to zap */ -void zap_page_range(struct vm_area_struct *vma, - unsigned long address, unsigned long size) +void invalidate_page_range(struct vm_area_struct *vma, + unsigned long address, unsigned long size, int all) { struct mm_struct *mm = vma->vm_mm; struct mmu_gather *tlb; @@ -612,7 +619,7 @@ lru_add_drain(); spin_lock(&mm->page_table_lock); tlb = tlb_gather_mmu(mm, 0); - unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted); + unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted, all); tlb_finish_mmu(tlb, address, end); spin_unlock(&mm->page_table_lock); } @@ -1071,10 +1078,8 @@ * Both hba and hlen are page numbers in PAGE_SIZE units. * An hlen of zero blows away the entire portion file after hba. */ -static void -invalidate_mmap_range_list(struct list_head *head, - unsigned long const hba, - unsigned long const hlen) +static void invalidate_mmap_range_list(struct list_head *head, + unsigned long const hba, unsigned long const hlen, int all) { struct list_head *curr; unsigned long hea; /* last page of hole. */ @@ -1095,9 +1100,9 @@ continue; /* Mapping disjoint from hole. */ zba = (hba <= vba) ? vba : hba; zea = (vea <= hea) ? vea : hea; - zap_page_range(vp, + invalidate_page_range(vp, ((zba - vba) << PAGE_SHIFT) + vp->vm_start, - (zea - zba + 1) << PAGE_SHIFT); + (zea - zba + 1) << PAGE_SHIFT, all); } } @@ -1115,8 +1120,8 @@ * up to a PAGE_SIZE boundary. A holelen of zero truncates to the * end of the file. */ -void invalidate_mmap_range(struct address_space *mapping, - loff_t const holebegin, loff_t const holelen) +static void invalidate_mmap_range(struct address_space *mapping, + loff_t const holebegin, loff_t const holelen, int all) { unsigned long hba = holebegin >> PAGE_SHIFT; unsigned long hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT; @@ -1133,12 +1138,19 @@ /* Protect against page fault */ atomic_inc(&mapping->truncate_count); if (unlikely(!list_empty(&mapping->i_mmap))) - invalidate_mmap_range_list(&mapping->i_mmap, hba, hlen); + invalidate_mmap_range_list(&mapping->i_mmap, hba, hlen, all); if (unlikely(!list_empty(&mapping->i_mmap_shared))) - invalidate_mmap_range_list(&mapping->i_mmap_shared, hba, hlen); + invalidate_mmap_range_list(&mapping->i_mmap_shared, hba, hlen, all); up(&mapping->i_shared_sem); } -EXPORT_SYMBOL_GPL(invalidate_mmap_range); + + void unmap_mapping_range(struct address_space *mapping, + loff_t const start, loff_t const length) +{ + invalidate_mmap_range(mapping, start, length, 0); +} + +EXPORT_SYMBOL(unmap_mapping_range); /* * Handle all mappings that got truncated by a "truncate()" @@ -1156,7 +1168,7 @@ if (inode->i_size < offset) goto do_expand; i_size_write(inode, offset); - invalidate_mmap_range(mapping, offset + PAGE_SIZE - 1, 0); + invalidate_mmap_range(mapping, offset + PAGE_SIZE - 1, 0, 1); truncate_inode_pages(mapping, offset); goto out_truncate; --- 2.6.3.clean/mm/mmap.c 2004-02-17 22:58:32.000000000 -0500 +++ 2.6.3/mm/mmap.c 2004-02-19 22:46:01.000000000 -0500 @@ -1134,7 +1134,7 @@ lru_add_drain(); tlb = tlb_gather_mmu(mm, 0); - unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted); + unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted, 1); vm_unacct_memory(nr_accounted); if (is_hugepage_only_range(start, end - start)) @@ -1436,7 +1436,7 @@ flush_cache_mm(mm); /* Use ~0UL here to ensure all VMAs in the mm are unmapped */ mm->map_count -= unmap_vmas(&tlb, mm, mm->mmap, 0, - ~0UL, &nr_accounted); + ~0UL, &nr_accounted, 1); vm_unacct_memory(nr_accounted); BUG_ON(mm->map_count); /* This is just debugging */ clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC] Distributed mmap API 2004-02-25 21:04 ` [RFC] Distributed mmap API Daniel Phillips @ 2004-02-25 19:12 ` Paul E. McKenney 2004-02-25 19:14 ` Paul E. McKenney 2004-02-25 22:07 ` Andrew Morton 2 siblings, 0 replies; 68+ messages in thread From: Paul E. McKenney @ 2004-02-25 19:12 UTC (permalink / raw) To: Daniel Phillips Cc: Stephen C. Tweedie, Andrew Morton, Christoph Hellwig, linux-kernel, linux-mm On Wed, Feb 25, 2004 at 04:04:19PM -0500, Daniel Phillips wrote: Very cool! > This is the function formerly known as invalidate_mmap_range, with the > addition of a new code path in the zap_ call chain to handle MAP_PRIVATE > properly. This function by itself is enough to support a crude but useful > form of distributed mmap where a shared file is cached only on one cluster > node at a time. > > To use this, the distributed filesystem has to hook do_no_page to intercept > page faults and carry out the needed global locking. The locking itself does > not require any new kernel hooks. In brief, the patch here and another patch > to be presented for the do_no_page hook, together provide the core kernel API > for a simplified, distributed mmap. (Note that there may be a workaround for > the lack of a do_no_page hook, but certainly not as simple and robust.) > > To put this in perspective, I'll mention the two big limitations of the > simplified API: > > 1) Invalidation is always a whole file at a time I must be missing something subtle here... It looks to me like the new unmap_mapping_range() API is capable of invalidating portions of files, based on the "start" and "length" arguments. What am I missing? Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC] Distributed mmap API 2004-02-25 21:04 ` [RFC] Distributed mmap API Daniel Phillips 2004-02-25 19:12 ` Paul E. McKenney @ 2004-02-25 19:14 ` Paul E. McKenney 2004-02-25 22:07 ` Andrew Morton 2 siblings, 0 replies; 68+ messages in thread From: Paul E. McKenney @ 2004-02-25 19:14 UTC (permalink / raw) To: Daniel Phillips Cc: Stephen C. Tweedie, Andrew Morton, Christoph Hellwig, linux-kernel, linux-mm On Wed, Feb 25, 2004 at 04:04:19PM -0500, Daniel Phillips wrote: > > I'd like to take this opportunity to apologize to Paul for derailing his more > modest proposal, but unfortunately, the semantics that could be obtained that > way are fatally flawed: private mmaps just won't work. What I've written here > is about the minimum that supports acceptable mmap semantics. No problem -- it looks like we are getting a much better result than I was proposing, thank you for helping me to see the light! Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC] Distributed mmap API 2004-02-25 21:04 ` [RFC] Distributed mmap API Daniel Phillips 2004-02-25 19:12 ` Paul E. McKenney 2004-02-25 19:14 ` Paul E. McKenney @ 2004-02-25 22:07 ` Andrew Morton 2004-02-25 22:07 ` Daniel Phillips 2004-03-03 3:00 ` Daniel Phillips 2 siblings, 2 replies; 68+ messages in thread From: Andrew Morton @ 2004-02-25 22:07 UTC (permalink / raw) To: Daniel Phillips; +Cc: paulmck, sct, hch, linux-kernel, linux-mm Daniel Phillips <phillips@arcor.de> wrote: > > - pte = ptep_get_and_clear(ptep); > + if (unlikely(!all) && is_anon(pfn_to_page(pfn))) > + continue; > + pte = ptep_get_and_clear(ptep); /* get dirty bit atomically */ > tlb_remove_tlb_entry(tlb, ptep, address+offset); > if (pfn_valid(pfn)) { I think you need to check pfn_valid() before running is_anon(pfn_to_page()) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC] Distributed mmap API 2004-02-25 22:07 ` Andrew Morton @ 2004-02-25 22:07 ` Daniel Phillips 2004-02-25 22:16 ` Andrew Morton 2004-03-03 3:00 ` Daniel Phillips 1 sibling, 1 reply; 68+ messages in thread From: Daniel Phillips @ 2004-02-25 22:07 UTC (permalink / raw) To: Andrew Morton; +Cc: paulmck, sct, hch, linux-kernel, linux-mm On Wednesday 25 February 2004 17:07, Andrew Morton wrote: > Daniel Phillips <phillips@arcor.de> wrote: > > - pte = ptep_get_and_clear(ptep); > > + if (unlikely(!all) && is_anon(pfn_to_page(pfn))) > > + continue; > > + pte = ptep_get_and_clear(ptep); /* get dirty bit atomically */ > > tlb_remove_tlb_entry(tlb, ptep, address+offset); > > if (pfn_valid(pfn)) { > > I think you need to check pfn_valid() before running is_anon(pfn_to_page()) Easy enough: if (unlikely(!all) && pfn_valid(pfn) && is_anon(pfn_to_page(pfn))) but how can we legitimately get !pfn_valid there? Regards, Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC] Distributed mmap API 2004-02-25 22:07 ` Daniel Phillips @ 2004-02-25 22:16 ` Andrew Morton 2004-02-25 22:46 ` Daniel Phillips 0 siblings, 1 reply; 68+ messages in thread From: Andrew Morton @ 2004-02-25 22:16 UTC (permalink / raw) To: Daniel Phillips; +Cc: paulmck, sct, hch, linux-kernel, linux-mm Daniel Phillips <phillips@arcor.de> wrote: > > On Wednesday 25 February 2004 17:07, Andrew Morton wrote: > > Daniel Phillips <phillips@arcor.de> wrote: > > > - pte = ptep_get_and_clear(ptep); > > > + if (unlikely(!all) && is_anon(pfn_to_page(pfn))) > > > + continue; > > > + pte = ptep_get_and_clear(ptep); /* get dirty bit atomically */ > > > tlb_remove_tlb_entry(tlb, ptep, address+offset); > > > if (pfn_valid(pfn)) { > > > > I think you need to check pfn_valid() before running is_anon(pfn_to_page()) > > Easy enough: > > if (unlikely(!all) && pfn_valid(pfn) && is_anon(pfn_to_page(pfn))) You can probably factor this into page = NULL; if (pfn_valid(..)) page = pfn_to_page(..) if (page) .. if (page) .. > but how can we legitimately get !pfn_valid there? A mapping of some I/O region? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC] Distributed mmap API 2004-02-25 22:16 ` Andrew Morton @ 2004-02-25 22:46 ` Daniel Phillips 0 siblings, 0 replies; 68+ messages in thread From: Daniel Phillips @ 2004-02-25 22:46 UTC (permalink / raw) To: Andrew Morton; +Cc: paulmck, sct, hch, linux-kernel, linux-mm On Wednesday 25 February 2004 17:16, Andrew Morton wrote: > > but how can we legitimately get !pfn_valid there? > > A mapping of some I/O region? With MAP_PRIVATE, on a distributed filesystem? OK... Can we recognize those I/O vmas and handle them with their own separate loop, saving a few cycles for the common case? Or just: if (pte_present(pte)) { unsigned long pfn = pte_pfn(pte); struct page *page; if (unlikely(!pfn_valid(pfn))) { ptep_get_and_clear(ptep); tlb_remove_tlb_entry(tlb, ptep, address+offset); continue; } page = pfn_to_page(pfn); if (unlikely(!all) && is_anon(page)) continue; pte = ptep_get_and_clear(ptep); /* get dirty bit atomically */ tlb_remove_tlb_entry(tlb, ptep, address+offset); if (PageReserved(page)) continue; if (pte_dirty(pte)) set_page_dirty(page); if (page->mapping && pte_young(pte) && !PageSwapCache(page)) mark_page_accessed(page); tlb->freed++; page_remove_rmap(page, ptep); tlb_remove_page(tlb, page); } else { Regards, Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC] Distributed mmap API 2004-02-25 22:07 ` Andrew Morton 2004-02-25 22:07 ` Daniel Phillips @ 2004-03-03 3:00 ` Daniel Phillips 2004-03-03 3:15 ` Andrew Morton 1 sibling, 1 reply; 68+ messages in thread From: Daniel Phillips @ 2004-03-03 3:00 UTC (permalink / raw) To: Andrew Morton; +Cc: paulmck, sct, hch, linux-kernel, linux-mm On Wednesday 25 February 2004 17:07, Andrew Morton wrote: > I think you need to check pfn_valid() before running is_anon(pfn_to_page()) Hi Andrew, Here is a rearranged zap_pte_range that avoids any operations for out-of-range pfns. The only annoyance with this factoring is that tlb_remove_tlb_entry is expanded in two places. For most architectures the macro is null anyway, and for the rest it's hardly any code at all, except for ppc64, which has __tlb_remove_tlb_entry as an inline that looks like it expands into a fair amount of code. But probably not enough to worry about. I took the opportunity to remove some indents by liberal use of continues. This version reads pretty easily. if (pte_present(pte)) { unsigned long pfn = pte_pfn(pte); struct page *page; if (unlikely(!pfn_valid(pfn))) { pte_clear(ptep); tlb_remove_tlb_entry(tlb, ptep, address+offset); continue; } page = pfn_to_page(pfn); if (unlikely(!all) && is_anon(page)) continue; pte = ptep_get_and_clear(ptep); /* get dirty bit atomically */ tlb_remove_tlb_entry(tlb, ptep, address+offset); if (PageReserved(page)) continue; if (pte_dirty(pte)) set_page_dirty(page); if (page->mapping && pte_young(pte) && !PageSwapCache(page)) mark_page_accessed(page); tlb->freed++; page_remove_rmap(page, ptep); tlb_remove_page(tlb, page); continue; } I also tried your "if (page)" suggestion, which looks like this: if (pte_present(pte)) { unsigned long pfn = pte_pfn(pte); struct page *page = NULL; if (likely(pfn_valid(pfn))) { page = pfn_to_page(pfn); if (unlikely(!all) && is_anon(page)) continue; } pte = ptep_get_and_clear(ptep); /* get dirty bit atomically */ tlb_remove_tlb_entry(tlb, ptep, address+offset); if (unlikely(!page) || PageReserved(page)) continue; if (pte_dirty(pte)) set_page_dirty(page); if (page->mapping && pte_young(pte) && !PageSwapCache(page)) mark_page_accessed(page); tlb->freed++; page_remove_rmap(page, ptep); tlb_remove_page(tlb, page); continue; } It came out ok too - only one "if (page)", a little shorter and no extra macro expansions, though it's a little harder to follow and might be microscopically slower. The complete patch below uses the first form, and does away with the is_anon inline. Regards, Daniel --- 2.6.3.clean/include/linux/mm.h 2004-02-17 22:57:13.000000000 -0500 +++ 2.6.3/include/linux/mm.h 2004-02-21 12:59:16.000000000 -0500 @@ -430,23 +430,23 @@ void shmem_lock(struct file * file, int lock); int shmem_zero_setup(struct vm_area_struct *); -void zap_page_range(struct vm_area_struct *vma, unsigned long address, - unsigned long size); int unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm, struct vm_area_struct *start_vma, unsigned long start_addr, - unsigned long end_addr, unsigned long *nr_accounted); -void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, - unsigned long address, unsigned long size); + unsigned long end_addr, unsigned long *nr_accounted, int zap); void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); int zeromap_page_range(struct vm_area_struct *vma, unsigned long from, unsigned long size, pgprot_t prot); - -extern void invalidate_mmap_range(struct address_space *mapping, - loff_t const holebegin, - loff_t const holelen); +extern void invalidate_filemap_range(struct address_space *mapping, loff_t const start, loff_t const length); extern int vmtruncate(struct inode * inode, loff_t offset); +void invalidate_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, int all); + +static inline void zap_page_range(struct vm_area_struct *vma, ulong address, ulong size) +{ + invalidate_page_range(vma, address, size, 1); +} + extern pmd_t *FASTCALL(__pmd_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)); extern pte_t *FASTCALL(pte_alloc_kernel(struct mm_struct *mm, pmd_t *pmd, unsigned long address)); extern pte_t *FASTCALL(pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address)); --- 2.6.3.clean/mm/memory.c 2004-02-17 22:57:47.000000000 -0500 +++ 2.6.3/mm/memory.c 2004-03-02 20:59:58.000000000 -0500 @@ -384,9 +384,8 @@ return -ENOMEM; } -static void -zap_pte_range(struct mmu_gather *tlb, pmd_t * pmd, - unsigned long address, unsigned long size) +static void zap_pte_range(struct mmu_gather *tlb, pmd_t * pmd, + unsigned long address, unsigned long size, int all) { unsigned long offset; pte_t *ptep; @@ -409,34 +408,41 @@ continue; if (pte_present(pte)) { unsigned long pfn = pte_pfn(pte); + struct page *page; - pte = ptep_get_and_clear(ptep); - tlb_remove_tlb_entry(tlb, ptep, address+offset); - if (pfn_valid(pfn)) { - struct page *page = pfn_to_page(pfn); - if (!PageReserved(page)) { - if (pte_dirty(pte)) - set_page_dirty(page); - if (page->mapping && pte_young(pte) && - !PageSwapCache(page)) - mark_page_accessed(page); - tlb->freed++; - page_remove_rmap(page, ptep); - tlb_remove_page(tlb, page); - } + if (unlikely(!pfn_valid(pfn))) { + pte_clear(ptep); + tlb_remove_tlb_entry(tlb, ptep, address+offset); + continue; } - } else { - if (!pte_file(pte)) - free_swap_and_cache(pte_to_swp_entry(pte)); - pte_clear(ptep); + page = pfn_to_page(pfn); + if (unlikely(!all) && (!page->mapping || PageSwapCache(page))) + continue; + pte = ptep_get_and_clear(ptep); /* get dirty bit atomically */ + tlb_remove_tlb_entry(tlb, ptep, address+offset); + if (PageReserved(page)) + continue; + if (pte_dirty(pte)) + set_page_dirty(page); + if (page->mapping && pte_young(pte) && !PageSwapCache(page)) + mark_page_accessed(page); + tlb->freed++; + page_remove_rmap(page, ptep); + tlb_remove_page(tlb, page); + continue; } + if (!pte_file(pte)) { + if (!all) + continue; + free_swap_and_cache(pte_to_swp_entry(pte)); + } + pte_clear(ptep); } pte_unmap(ptep-1); } -static void -zap_pmd_range(struct mmu_gather *tlb, pgd_t * dir, - unsigned long address, unsigned long size) +static void zap_pmd_range(struct mmu_gather *tlb, pgd_t * dir, + unsigned long address, unsigned long size, int all) { pmd_t * pmd; unsigned long end; @@ -453,14 +459,14 @@ if (end > ((address + PGDIR_SIZE) & PGDIR_MASK)) end = ((address + PGDIR_SIZE) & PGDIR_MASK); do { - zap_pte_range(tlb, pmd, address, end - address); - address = (address + PMD_SIZE) & PMD_MASK; + zap_pte_range(tlb, pmd, address, end - address, all); + address = (address + PMD_SIZE) & PMD_MASK; pmd++; } while (address < end); } -void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, - unsigned long address, unsigned long end) +static void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, + unsigned long address, unsigned long end, int all) { pgd_t * dir; @@ -474,7 +480,7 @@ dir = pgd_offset(vma->vm_mm, address); tlb_start_vma(tlb, vma); do { - zap_pmd_range(tlb, dir, address, end - address); + zap_pmd_range(tlb, dir, address, end - address, all); address = (address + PGDIR_SIZE) & PGDIR_MASK; dir++; } while (address && (address < end)); @@ -524,7 +530,7 @@ */ int unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm, struct vm_area_struct *vma, unsigned long start_addr, - unsigned long end_addr, unsigned long *nr_accounted) + unsigned long end_addr, unsigned long *nr_accounted, int all) { unsigned long zap_bytes = ZAP_BLOCK_SIZE; unsigned long tlb_start = 0; /* For tlb_finish_mmu */ @@ -568,7 +574,7 @@ tlb_start_valid = 1; } - unmap_page_range(*tlbp, vma, start, start + block); + unmap_page_range(*tlbp, vma, start, start + block, all); start += block; zap_bytes -= block; if ((long)zap_bytes > 0) @@ -594,8 +600,8 @@ * @address: starting address of pages to zap * @size: number of bytes to zap */ -void zap_page_range(struct vm_area_struct *vma, - unsigned long address, unsigned long size) +void invalidate_page_range(struct vm_area_struct *vma, + unsigned long address, unsigned long size, int all) { struct mm_struct *mm = vma->vm_mm; struct mmu_gather *tlb; @@ -612,7 +618,7 @@ lru_add_drain(); spin_lock(&mm->page_table_lock); tlb = tlb_gather_mmu(mm, 0); - unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted); + unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted, all); tlb_finish_mmu(tlb, address, end); spin_unlock(&mm->page_table_lock); } @@ -1071,10 +1077,8 @@ * Both hba and hlen are page numbers in PAGE_SIZE units. * An hlen of zero blows away the entire portion file after hba. */ -static void -invalidate_mmap_range_list(struct list_head *head, - unsigned long const hba, - unsigned long const hlen) +static void invalidate_mmap_range_list(struct list_head *head, + unsigned long const hba, unsigned long const hlen, int all) { struct list_head *curr; unsigned long hea; /* last page of hole. */ @@ -1095,9 +1099,9 @@ continue; /* Mapping disjoint from hole. */ zba = (hba <= vba) ? vba : hba; zea = (vea <= hea) ? vea : hea; - zap_page_range(vp, + invalidate_page_range(vp, ((zba - vba) << PAGE_SHIFT) + vp->vm_start, - (zea - zba + 1) << PAGE_SHIFT); + (zea - zba + 1) << PAGE_SHIFT, all); } } @@ -1115,8 +1119,8 @@ * up to a PAGE_SIZE boundary. A holelen of zero truncates to the * end of the file. */ -void invalidate_mmap_range(struct address_space *mapping, - loff_t const holebegin, loff_t const holelen) +static void invalidate_mmap_range(struct address_space *mapping, + loff_t const holebegin, loff_t const holelen, int all) { unsigned long hba = holebegin >> PAGE_SHIFT; unsigned long hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT; @@ -1133,12 +1137,19 @@ /* Protect against page fault */ atomic_inc(&mapping->truncate_count); if (unlikely(!list_empty(&mapping->i_mmap))) - invalidate_mmap_range_list(&mapping->i_mmap, hba, hlen); + invalidate_mmap_range_list(&mapping->i_mmap, hba, hlen, all); if (unlikely(!list_empty(&mapping->i_mmap_shared))) - invalidate_mmap_range_list(&mapping->i_mmap_shared, hba, hlen); + invalidate_mmap_range_list(&mapping->i_mmap_shared, hba, hlen, all); up(&mapping->i_shared_sem); } -EXPORT_SYMBOL_GPL(invalidate_mmap_range); + + void unmap_mapping_range(struct address_space *mapping, + loff_t const start, loff_t const length) +{ + invalidate_mmap_range(mapping, start, length, 0); +} + +EXPORT_SYMBOL(unmap_mapping_range); /* * Handle all mappings that got truncated by a "truncate()" @@ -1156,7 +1167,7 @@ if (inode->i_size < offset) goto do_expand; i_size_write(inode, offset); - invalidate_mmap_range(mapping, offset + PAGE_SIZE - 1, 0); + invalidate_mmap_range(mapping, offset + PAGE_SIZE - 1, 0, 1); truncate_inode_pages(mapping, offset); goto out_truncate; --- 2.6.3.clean/mm/mmap.c 2004-02-17 22:58:32.000000000 -0500 +++ 2.6.3/mm/mmap.c 2004-02-19 22:46:01.000000000 -0500 @@ -1134,7 +1134,7 @@ lru_add_drain(); tlb = tlb_gather_mmu(mm, 0); - unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted); + unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted, 1); vm_unacct_memory(nr_accounted); if (is_hugepage_only_range(start, end - start)) @@ -1436,7 +1436,7 @@ flush_cache_mm(mm); /* Use ~0UL here to ensure all VMAs in the mm are unmapped */ mm->map_count -= unmap_vmas(&tlb, mm, mm->mmap, 0, - ~0UL, &nr_accounted); + ~0UL, &nr_accounted, 1); vm_unacct_memory(nr_accounted); BUG_ON(mm->map_count); /* This is just debugging */ clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC] Distributed mmap API 2004-03-03 3:00 ` Daniel Phillips @ 2004-03-03 3:15 ` Andrew Morton 2004-03-03 13:06 ` Daniel Phillips 0 siblings, 1 reply; 68+ messages in thread From: Andrew Morton @ 2004-03-03 3:15 UTC (permalink / raw) To: Daniel Phillips; +Cc: paulmck, sct, hch, linux-kernel, linux-mm Daniel Phillips <phillips@arcor.de> wrote: > > Here is a rearranged zap_pte_range that avoids any operations for out-of-range > pfns. Please remind us why Linux needs this patch? > +static void invalidate_mmap_range_list(struct list_head *head, > + unsigned long const hba, unsigned long const hlen, int all) > { I forget what `all' does? anon+swapcache as well as pagecache? A bit of API documentation here would be appropriate. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC] Distributed mmap API 2004-03-03 3:15 ` Andrew Morton @ 2004-03-03 13:06 ` Daniel Phillips 2004-03-04 18:55 ` Paul E. McKenney 0 siblings, 1 reply; 68+ messages in thread From: Daniel Phillips @ 2004-03-03 13:06 UTC (permalink / raw) To: Andrew Morton; +Cc: paulmck, sct, hch, linux-kernel, linux-mm On Tuesday 02 March 2004 22:15, Andrew Morton wrote: > Daniel Phillips <phillips@arcor.de> wrote: > > Here is a rearranged zap_pte_range that avoids any operations for > > out-of-range pfns. > > Please remind us why Linux needs this patch? The is purely to support mmap, including MAP_PRIVATE, accurately on distributed filesystems, where "accurately" is defined as "with local filesystem semantics". If the same file region is mmapped by more than one node, only one of them is allowed to have a given page of the mmap valid in the page tables at any time. When a memory write occurs on one of the other nodes, it must fault so that the distributed filesystem can arrange for exclusive ownership of the file page (or as GFS currently implements it, the whole file) to change from one node to the other. At this time, any pages already faulted in must be unmapped so that future memory accesses will properly fault. This unmapping is done by zap_page_range, which has nearly the semantics we want except that it will also unmap private pages of a MAP_PRIVATE mapping, destroying the only copy of that data. A user would observe the privately written data spontaneously revert to the current file contents. The purpose of this patch is to fix that. This patch allows a distributed filesystem to unmap file-backed memory without unmapping anonymous pages or deleting swap cache, avoiding the above data destruction. Since zap_page_range is the only function that knows how to unmap memory, it needs to be taught how to skip anonymous pages. An alternative to this patch is simply to export zap_page_range, then the distributed filesystem can walk the lists of mmapped vmas itself, skipping any that are MAP_PRIVATE. This achieves Posix local filesystem semantics, but not Linux local filesystem semantics, because updates to the mmap from other nodes become visible unpredictably. Earlier this year, Linus said that he wants tighter semantics for distributed MAP_PRIVATE. This patch presses zap_page_range into service in a way that was not originally intended, that is, for invalidation as opposed to destruction of memory regions. The requirements are identical except for the MAP_PRIVATE detail. Forking the whole zap_ chain would be even more distasteful than grafting on this option flag. It's also impractical to implement a zap_ variant within a dfs module because of the heavy use of per-arch APIs. As far I can see, this patch is the minimum cost of having accurate semantics for distributed MAP_PRIVATE mmap. I'll take the opportunity to beat my chest a once again about the fact that this doesn't benefit anything other than distributed filesystems. On the other hand, the cost is miniscule: 54 bytes, a little stack and likely no measureable cpu. > I forget what `all' does? anon+swapcache as well as pagecache? Yes > A bit of API documentation here would be appropriate. Oops, sorry: /** * zap_page_range - remove user pages in a given range * @vma: vm_area_struct holding the applicable pages * @address: starting address of pages to zap * @size: number of bytes to zap * @all: also unmap anonymous pages */ void zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, int all) Regards, Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC] Distributed mmap API 2004-03-03 13:06 ` Daniel Phillips @ 2004-03-04 18:55 ` Paul E. McKenney 0 siblings, 0 replies; 68+ messages in thread From: Paul E. McKenney @ 2004-03-04 18:55 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrew Morton, sct, hch, linux-kernel, linux-mm This matches what we are after here! Thanx, Paul On Wed, Mar 03, 2004 at 08:06:20AM -0500, Daniel Phillips wrote: > On Tuesday 02 March 2004 22:15, Andrew Morton wrote: > > Daniel Phillips <phillips@arcor.de> wrote: > > > Here is a rearranged zap_pte_range that avoids any operations for > > > out-of-range pfns. > > > > Please remind us why Linux needs this patch? > > The is purely to support mmap, including MAP_PRIVATE, accurately on > distributed filesystems, where "accurately" is defined as "with local > filesystem semantics". > > If the same file region is mmapped by more than one node, only one of them is > allowed to have a given page of the mmap valid in the page tables at any > time. When a memory write occurs on one of the other nodes, it must fault so > that the distributed filesystem can arrange for exclusive ownership of the > file page (or as GFS currently implements it, the whole file) to change from > one node to the other. At this time, any pages already faulted in must be > unmapped so that future memory accesses will properly fault. This unmapping > is done by zap_page_range, which has nearly the semantics we want except that > it will also unmap private pages of a MAP_PRIVATE mapping, destroying the > only copy of that data. A user would observe the privately written data > spontaneously revert to the current file contents. The purpose of this patch > is to fix that. > > This patch allows a distributed filesystem to unmap file-backed memory without > unmapping anonymous pages or deleting swap cache, avoiding the above data > destruction. Since zap_page_range is the only function that knows how to > unmap memory, it needs to be taught how to skip anonymous pages. > > An alternative to this patch is simply to export zap_page_range, then the > distributed filesystem can walk the lists of mmapped vmas itself, skipping > any that are MAP_PRIVATE. This achieves Posix local filesystem semantics, > but not Linux local filesystem semantics, because updates to the mmap from > other nodes become visible unpredictably. Earlier this year, Linus said that > he wants tighter semantics for distributed MAP_PRIVATE. > > This patch presses zap_page_range into service in a way that was not > originally intended, that is, for invalidation as opposed to destruction of > memory regions. The requirements are identical except for the MAP_PRIVATE > detail. Forking the whole zap_ chain would be even more distasteful than > grafting on this option flag. It's also impractical to implement a zap_ > variant within a dfs module because of the heavy use of per-arch APIs. As > far I can see, this patch is the minimum cost of having accurate semantics > for distributed MAP_PRIVATE mmap. > > I'll take the opportunity to beat my chest a once again about the fact that > this doesn't benefit anything other than distributed filesystems. On the > other hand, the cost is miniscule: 54 bytes, a little stack and likely no > measureable cpu. > > > I forget what `all' does? anon+swapcache as well as pagecache? > > Yes > > > A bit of API documentation here would be appropriate. > > Oops, sorry: > > /** > * zap_page_range - remove user pages in a given range > * @vma: vm_area_struct holding the applicable pages > * @address: starting address of pages to zap > * @size: number of bytes to zap > * @all: also unmap anonymous pages > */ > void zap_page_range(struct vm_area_struct *vma, > unsigned long address, unsigned long size, int all) > > Regards, > > Daniel > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-20 20:37 ` Daniel Phillips 2004-02-20 14:01 ` Paul E. McKenney @ 2004-02-20 21:17 ` Christoph Hellwig 2004-02-20 22:16 ` Daniel Phillips 1 sibling, 1 reply; 68+ messages in thread From: Christoph Hellwig @ 2004-02-20 21:17 UTC (permalink / raw) To: Daniel Phillips Cc: paulmck, Stephen C. Tweedie, Andrew Morton, Christoph Hellwig, linux-kernel, linux-mm On Fri, Feb 20, 2004 at 03:37:26PM -0500, Daniel Phillips wrote: > It does, thanks for the catch. Please bear with me for a moment while I > reroll this, then hopefully we can move on to the more interesting discussion > of whether it's worth it. (Yes it is :) What about to the more interesting question who needs it. It think this whole discussion who needs what and which approach is better is pretty much moot as long as we don't have an intree users. Instead of wasting your time on different designs you should hurry of getting your filesystems encumbrance-reviewed, cleaned up and merged - with intree users we have a chance of finding the right API. And your newly started dicussion shows pretty much that with only out of tree users we'll never get a sane API. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-20 21:17 ` Non-GPL export of invalidate_mmap_range Christoph Hellwig @ 2004-02-20 22:16 ` Daniel Phillips 0 siblings, 0 replies; 68+ messages in thread From: Daniel Phillips @ 2004-02-20 22:16 UTC (permalink / raw) To: Christoph Hellwig Cc: paulmck, Stephen C. Tweedie, Andrew Morton, linux-kernel, linux-mm On Friday 20 February 2004 16:17, Christoph Hellwig wrote: > On Fri, Feb 20, 2004 at 03:37:26PM -0500, Daniel Phillips wrote: > > It does, thanks for the catch. Please bear with me for a moment while I > > reroll this, then hopefully we can move on to the more interesting > > discussion of whether it's worth it. (Yes it is :) > > What about to the more interesting question who needs it. It think this > whole discussion who needs what and which approach is better is pretty much > moot as long as we don't have an intree users. We settled that question in this case, see Paul's "surrender" above ;) > Instead of wasting your time on different designs you should hurry of > getting your filesystems encumbrance-reviewed, cleaned up and merged - > with intree users we have a chance of finding the right API. And your > newly started dicussion shows pretty much that with only out of tree users > we'll never get a sane API. Again, we (everybody who cared to jump in) now agree on what is sane here, it's quite logical. As for supplying background material so this makes sense to a wider group of people, sorry it's been on my to-do list for a while. Getting a DFS, namely Sistina GFS, into the tree is underway as you know from the press release, however turning the ship takes time. Meanwhile, the api discussion can't wait because the rudder on that ship is even smaller. Regards, Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-17 12:40 ` Paul E. McKenney 2004-02-18 0:19 ` Andrew Morton @ 2004-02-18 12:12 ` Dominik Kubla 1 sibling, 0 replies; 68+ messages in thread From: Dominik Kubla @ 2004-02-18 12:12 UTC (permalink / raw) To: paulmck; +Cc: Christoph Hellwig, akpm, linux-kernel, linux-mm On Tuesday 17 February 2004 13:40, Paul E. McKenney wrote: > These URLs do require that you register, but there is no cost nor any > agreement other than the GPL itself. The Linux client has not been > shipped as product yet. The code is still quite rough, which is one > reason that it has not be submitted to, for example, LKML. ;-) But registering requires to disclose an unreasonable amount of personal data. This is not acceptable. Kind regards, Dominik Kubla -- Steal my cash, car and TV - but leave the computer! -- Soenke Lange <soenke@escher.north.de> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Non-GPL export of invalidate_mmap_range 2004-02-16 19:09 Non-GPL export of invalidate_mmap_range Paul E. McKenney 2004-02-17 2:31 ` Andrew Morton 2004-02-17 7:35 ` Christoph Hellwig @ 2004-02-17 22:22 ` David Weinehall 2 siblings, 0 replies; 68+ messages in thread From: David Weinehall @ 2004-02-17 22:22 UTC (permalink / raw) To: Paul E. McKenney; +Cc: akpm, linux-kernel, linux-mm On Mon, Feb 16, 2004 at 11:09:27AM -0800, Paul E. McKenney wrote: > Hello, Andrew, > > The attached patch to make invalidate_mmap_range() non-GPL exported > seems to have been lost somewhere between 2.6.1-mm4 and 2.6.1-mm5. > It still applies cleanly. Could you please take it up again? > > Thanx, Paul > > ------------------------------------------------------------------------ > > > > It was EXPORT_SYMBOL_GPL(), however IBM's GPFS is not GPL. Ahhh, but it would be really nice if it was, even if it's irksome to get decent performance out of it ;-) [snip] Regards: David Weinehall -- /) David Weinehall <tao@acc.umu.se> /) Northern lights wander (\ // Maintainer of the v2.0 kernel // Dance across the winter sky // \) http://www.acc.umu.se/~tao/ (/ Full colour fire (/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
end of thread, other threads:[~2004-03-04 18:55 UTC | newest] Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2004-02-16 19:09 Non-GPL export of invalidate_mmap_range Paul E. McKenney 2004-02-17 2:31 ` Andrew Morton 2004-02-17 7:35 ` Christoph Hellwig 2004-02-17 12:40 ` Paul E. McKenney 2004-02-18 0:19 ` Andrew Morton 2004-02-18 12:51 ` Arjan van de Ven 2004-02-18 14:00 ` Paul E. McKenney 2004-02-18 21:10 ` Christoph Hellwig 2004-02-18 15:06 ` Paul E. McKenney 2004-02-18 22:21 ` Christoph Hellwig 2004-02-18 22:51 ` Andrew Morton 2004-02-18 23:00 ` Christoph Hellwig 2004-02-18 16:21 ` Paul E. McKenney 2004-02-18 23:32 ` Andrew Morton 2004-02-19 12:32 ` Christoph Hellwig 2004-02-19 18:56 ` Andrew Morton 2004-02-19 19:01 ` Christoph Hellwig 2004-02-19 13:04 ` Paul E. McKenney 2004-02-20 3:17 ` Anton Blanchard 2004-02-20 21:46 ` Valdis.Kletnieks 2004-02-19 0:28 ` Andrew Morton 2004-02-18 18:36 ` Paul E. McKenney 2004-02-19 12:31 ` Christoph Hellwig 2004-02-19 9:11 ` Paul E. McKenney 2004-02-19 18:32 ` Lars Marowsky-Bree 2004-02-19 18:38 ` Arjan van de Ven 2004-02-19 19:16 ` viro 2004-02-19 16:15 ` Paul E. McKenney 2004-02-19 18:59 ` Tim Bird 2004-02-20 1:27 ` David Schwartz 2004-02-19 9:11 ` David Weinehall 2004-02-19 8:58 ` Paul E. McKenney 2004-03-04 5:51 ` Mike Fedyk 2004-02-19 10:29 ` Lars Marowsky-Bree 2004-02-19 9:00 ` Paul E. McKenney 2004-02-19 11:11 ` Arjan van de Ven 2004-02-19 11:53 ` Lars Marowsky-Bree 2004-02-18 18:04 ` Tim Bird 2004-02-19 20:56 ` Daniel Phillips 2004-02-19 22:06 ` Stephen C. Tweedie 2004-02-19 22:31 ` Daniel Phillips 2004-02-19 16:42 ` Paul E. McKenney 2004-02-20 2:06 ` Daniel Phillips 2004-02-19 19:47 ` Paul E. McKenney 2004-02-20 5:07 ` Daniel Phillips 2004-02-20 12:02 ` Paul E. McKenney 2004-02-20 20:37 ` Daniel Phillips 2004-02-20 14:01 ` Paul E. McKenney 2004-02-20 23:00 ` Daniel Phillips 2004-02-20 16:17 ` Paul E. McKenney 2004-02-21 3:19 ` Daniel Phillips 2004-02-21 19:00 ` Daniel Phillips 2004-02-22 23:39 ` Paul E. McKenney 2004-02-25 21:04 ` [RFC] Distributed mmap API Daniel Phillips 2004-02-25 19:12 ` Paul E. McKenney 2004-02-25 19:14 ` Paul E. McKenney 2004-02-25 22:07 ` Andrew Morton 2004-02-25 22:07 ` Daniel Phillips 2004-02-25 22:16 ` Andrew Morton 2004-02-25 22:46 ` Daniel Phillips 2004-03-03 3:00 ` Daniel Phillips 2004-03-03 3:15 ` Andrew Morton 2004-03-03 13:06 ` Daniel Phillips 2004-03-04 18:55 ` Paul E. McKenney 2004-02-20 21:17 ` Non-GPL export of invalidate_mmap_range Christoph Hellwig 2004-02-20 22:16 ` Daniel Phillips 2004-02-18 12:12 ` Dominik Kubla 2004-02-17 22:22 ` David Weinehall
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox