* [RFC] OVERCOMMIT_ALWAYS extension @ 2005-10-17 17:30 Badari Pulavarty 2005-10-17 18:13 ` Hugh Dickins 0 siblings, 1 reply; 19+ messages in thread From: Badari Pulavarty @ 2005-10-17 17:30 UTC (permalink / raw) To: linux-mm Hi MM-experts, I have been looking at possible ways to extend OVERCOMMIT_ALWAYS to avoid its abuse. Few of the applications (database) would like to overcommit memory (by creating shared memory segments more than RAM+swap), but use only portion of it at any given time and get rid of portions of them through madvise(DONTNEED), when needed. They want this, especially to handle hotplug memory situations (where apps may not have clear idea on how much memory they have in the system at the time of shared memory create). Currently, they are using OVERCOMMIT_ALWAYS system wide to do this - but they are affecting every other application on the system. I am wondering, if there is a better way to do this. Simple solution would be to add IPC_OVERCOMMIT flag or add CAP_SYS_ADMIN to do the overcommit. This way only specific applications, requesting this would be able to overcommit. I am worried about, the over all affects it has on the system. But again, this can't be worse than system wide OVERCOMMIT_ALWAYS. Isn't it ? Ideas ? Thanks, Badari -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] OVERCOMMIT_ALWAYS extension 2005-10-17 17:30 [RFC] OVERCOMMIT_ALWAYS extension Badari Pulavarty @ 2005-10-17 18:13 ` Hugh Dickins 2005-10-17 18:25 ` Hugh Dickins 0 siblings, 1 reply; 19+ messages in thread From: Hugh Dickins @ 2005-10-17 18:13 UTC (permalink / raw) To: Badari Pulavarty; +Cc: linux-mm On Mon, 17 Oct 2005, Badari Pulavarty wrote: > > I have been looking at possible ways to extend OVERCOMMIT_ALWAYS > to avoid its abuse. > > Few of the applications (database) would like to overcommit > memory (by creating shared memory segments more than RAM+swap), > but use only portion of it at any given time and get rid > of portions of them through madvise(DONTNEED), when needed. > They want this, especially to handle hotplug memory situations > (where apps may not have clear idea on how much memory they have > in the system at the time of shared memory create). Currently, > they are using OVERCOMMIT_ALWAYS system wide to do this - but > they are affecting every other application on the system. > > I am wondering, if there is a better way to do this. Simple solution > would be to add IPC_OVERCOMMIT flag or add CAP_SYS_ADMIN to > do the overcommit. This way only specific applications, requesting > this would be able to overcommit. I am worried about, the over > all affects it has on the system. But again, this can't be worse > than system wide OVERCOMMIT_ALWAYS. Isn't it ? mmap has MAP_NORESERVE, without CAP_SYS_ADMIN or other restriction, which exempts that mmap from security_vm_enough_memory checking - unless current setting is OVERCOMMIT_NEVER, in which case MAP_NORESERVE is ignored. So if you're content to move to the OVERCOMMIT_GUESS world, I don't think you could be blamed for adding an IPC_NORESERVE which behaves in the same way, without CAP_SYS_ADMIN restriction. But if you want to move to OVERCOMMIT_NEVER, yet have a flag which says overcommit now, you'll get into a tussle with NEVER-adherents. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] OVERCOMMIT_ALWAYS extension 2005-10-17 18:13 ` Hugh Dickins @ 2005-10-17 18:25 ` Hugh Dickins 2005-10-17 23:14 ` Badari Pulavarty 2005-10-18 16:05 ` [RFC][PATCH] " Badari Pulavarty 0 siblings, 2 replies; 19+ messages in thread From: Hugh Dickins @ 2005-10-17 18:25 UTC (permalink / raw) To: Badari Pulavarty; +Cc: Chris Wright, linux-mm On Mon, 17 Oct 2005, Hugh Dickins wrote: > On Mon, 17 Oct 2005, Badari Pulavarty wrote: > > > > I have been looking at possible ways to extend OVERCOMMIT_ALWAYS > > to avoid its abuse. > > > > Few of the applications (database) would like to overcommit > > memory (by creating shared memory segments more than RAM+swap), > > but use only portion of it at any given time and get rid > > of portions of them through madvise(DONTNEED), when needed. > > They want this, especially to handle hotplug memory situations > > (where apps may not have clear idea on how much memory they have > > in the system at the time of shared memory create). Currently, > > they are using OVERCOMMIT_ALWAYS system wide to do this - but > > they are affecting every other application on the system. > > > > I am wondering, if there is a better way to do this. Simple solution > > would be to add IPC_OVERCOMMIT flag or add CAP_SYS_ADMIN to > > do the overcommit. This way only specific applications, requesting > > this would be able to overcommit. I am worried about, the over > > all affects it has on the system. But again, this can't be worse > > than system wide OVERCOMMIT_ALWAYS. Isn't it ? > > mmap has MAP_NORESERVE, without CAP_SYS_ADMIN or other restriction, > which exempts that mmap from security_vm_enough_memory checking - > unless current setting is OVERCOMMIT_NEVER, in which case > MAP_NORESERVE is ignored. Having written that, it does seem rather odd that we have a flag anyone can set to evade that security_ checking. It was okay when it was just vm_enough_memory, but now it's security_vm_enough_memory, I wonder if this is a significant oversight, and some CAP required. Might break things though. CC'ed Chris. Ah, there's a security_file_mmap earlier, which could reject the MAP_NORESERVE flag if it feels so inclined. Perhaps you'll need to allow a similar opportunity for rejection in your approach. Hugh > So if you're content to move to the OVERCOMMIT_GUESS world, I > don't think you could be blamed for adding an IPC_NORESERVE which > behaves in the same way, without CAP_SYS_ADMIN restriction. > > But if you want to move to OVERCOMMIT_NEVER, yet have a flag which > says overcommit now, you'll get into a tussle with NEVER-adherents. > > Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] OVERCOMMIT_ALWAYS extension 2005-10-17 18:25 ` Hugh Dickins @ 2005-10-17 23:14 ` Badari Pulavarty 2005-10-18 16:05 ` [RFC][PATCH] " Badari Pulavarty 1 sibling, 0 replies; 19+ messages in thread From: Badari Pulavarty @ 2005-10-17 23:14 UTC (permalink / raw) To: Hugh Dickins; +Cc: Chris Wright, linux-mm On Mon, 2005-10-17 at 19:25 +0100, Hugh Dickins wrote: > On Mon, 17 Oct 2005, Hugh Dickins wrote: > > On Mon, 17 Oct 2005, Badari Pulavarty wrote: > > > > > > I have been looking at possible ways to extend OVERCOMMIT_ALWAYS > > > to avoid its abuse. > > > > > > Few of the applications (database) would like to overcommit > > > memory (by creating shared memory segments more than RAM+swap), > > > but use only portion of it at any given time and get rid > > > of portions of them through madvise(DONTNEED), when needed. > > > They want this, especially to handle hotplug memory situations > > > (where apps may not have clear idea on how much memory they have > > > in the system at the time of shared memory create). Currently, > > > they are using OVERCOMMIT_ALWAYS system wide to do this - but > > > they are affecting every other application on the system. > > > > > > I am wondering, if there is a better way to do this. Simple solution > > > would be to add IPC_OVERCOMMIT flag or add CAP_SYS_ADMIN to > > > do the overcommit. This way only specific applications, requesting > > > this would be able to overcommit. I am worried about, the over > > > all affects it has on the system. But again, this can't be worse > > > than system wide OVERCOMMIT_ALWAYS. Isn't it ? > > > > mmap has MAP_NORESERVE, without CAP_SYS_ADMIN or other restriction, > > which exempts that mmap from security_vm_enough_memory checking - > > unless current setting is OVERCOMMIT_NEVER, in which case > > MAP_NORESERVE is ignored. > > Having written that, it does seem rather odd that we have a flag > anyone can set to evade that security_ checking. It was okay when > it was just vm_enough_memory, but now it's security_vm_enough_memory, > I wonder if this is a significant oversight, and some CAP required. > Might break things though. CC'ed Chris. > > Ah, there's a security_file_mmap earlier, which could reject the > MAP_NORESERVE flag if it feels so inclined. Perhaps you'll need > to allow a similar opportunity for rejection in your approach. > > Hugh > > > So if you're content to move to the OVERCOMMIT_GUESS world, I > > don't think you could be blamed for adding an IPC_NORESERVE which > > behaves in the same way, without CAP_SYS_ADMIN restriction. > > > > But if you want to move to OVERCOMMIT_NEVER, yet have a flag which > > says overcommit now, you'll get into a tussle with NEVER-adherents. > > I am perfectly happy with IPC_NORESERVE for OVERCOMMIT_GUESS (since its the default) and fail or ignore IPC_NORESERVE for OVERCOMMIT_NEVER. I will try to code this up and pass it by you. Thanks, Badari -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC][PATCH] OVERCOMMIT_ALWAYS extension 2005-10-17 18:25 ` Hugh Dickins 2005-10-17 23:14 ` Badari Pulavarty @ 2005-10-18 16:05 ` Badari Pulavarty 2005-10-19 17:56 ` Hugh Dickins 1 sibling, 1 reply; 19+ messages in thread From: Badari Pulavarty @ 2005-10-18 16:05 UTC (permalink / raw) To: Hugh Dickins; +Cc: Chris Wright, linux-mm [-- Attachment #1: Type: text/plain, Size: 2827 bytes --] On Mon, 2005-10-17 at 19:25 +0100, Hugh Dickins wrote: > On Mon, 17 Oct 2005, Hugh Dickins wrote: > > On Mon, 17 Oct 2005, Badari Pulavarty wrote: > > > > > > I have been looking at possible ways to extend OVERCOMMIT_ALWAYS > > > to avoid its abuse. > > > > > > Few of the applications (database) would like to overcommit > > > memory (by creating shared memory segments more than RAM+swap), > > > but use only portion of it at any given time and get rid > > > of portions of them through madvise(DONTNEED), when needed. > > > They want this, especially to handle hotplug memory situations > > > (where apps may not have clear idea on how much memory they have > > > in the system at the time of shared memory create). Currently, > > > they are using OVERCOMMIT_ALWAYS system wide to do this - but > > > they are affecting every other application on the system. > > > > > > I am wondering, if there is a better way to do this. Simple solution > > > would be to add IPC_OVERCOMMIT flag or add CAP_SYS_ADMIN to > > > do the overcommit. This way only specific applications, requesting > > > this would be able to overcommit. I am worried about, the over > > > all affects it has on the system. But again, this can't be worse > > > than system wide OVERCOMMIT_ALWAYS. Isn't it ? > > > > mmap has MAP_NORESERVE, without CAP_SYS_ADMIN or other restriction, > > which exempts that mmap from security_vm_enough_memory checking - > > unless current setting is OVERCOMMIT_NEVER, in which case > > MAP_NORESERVE is ignored. > > Having written that, it does seem rather odd that we have a flag > anyone can set to evade that security_ checking. It was okay when > it was just vm_enough_memory, but now it's security_vm_enough_memory, > I wonder if this is a significant oversight, and some CAP required. > Might break things though. CC'ed Chris. > > Ah, there's a security_file_mmap earlier, which could reject the > MAP_NORESERVE flag if it feels so inclined. Perhaps you'll need > to allow a similar opportunity for rejection in your approach. > > Hugh > > > So if you're content to move to the OVERCOMMIT_GUESS world, I > > don't think you could be blamed for adding an IPC_NORESERVE which > > behaves in the same way, without CAP_SYS_ADMIN restriction. > > > > But if you want to move to OVERCOMMIT_NEVER, yet have a flag which > > says overcommit now, you'll get into a tussle with NEVER-adherents. > > > > Hugh > Hugh, As you suggested, here is the patch to add SHM_NORESERVE which does same thing as MAP_NORESERVE. This flag is ignored for OVERCOMMIT_NEVER. I decided to do SHM_NORESERVE instead of IPC_NORESERVE - just to limit its scope. BTW, there is a call to security_shm_alloc() earlier, which could be modified to reject shmget() if it needs to. Is this reasonable ? Please review. Thanks, Badari [-- Attachment #2: shm-noreserve.patch --] [-- Type: text/x-patch, Size: 1357 bytes --] Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> --- linux-2.6.14-rc3.org/include/linux/shm.h 2005-10-18 08:44:28.000000000 -0700 +++ linux-2.6.14-rc3/include/linux/shm.h 2005-10-18 08:46:03.000000000 -0700 @@ -92,6 +92,7 @@ struct shmid_kernel /* private to the ke #define SHM_DEST 01000 /* segment will be destroyed on last detach */ #define SHM_LOCKED 02000 /* segment will not be swapped */ #define SHM_HUGETLB 04000 /* segment will use huge TLB pages */ +#define SHM_NORESERVE 010000 /* don't check for reservations */ #ifdef CONFIG_SYSVIPC long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr); --- linux-2.6.14-rc3.org/ipc/shm.c 2005-10-17 16:57:40.000000000 -0700 +++ linux-2.6.14-rc3/ipc/shm.c 2005-10-18 08:55:50.000000000 -0700 @@ -212,8 +212,16 @@ static int newseg (key_t key, int shmflg file = hugetlb_zero_setup(size); shp->mlock_user = current->user; } else { + int acctflag = VM_ACCOUNT; + /* + * Do not allow no accouting for OVERCOMMIT_NEVER, even + * its asked for. + */ + if ((shmflg & SHM_NORESERVE) && + sysctl_overcommit_memory != OVERCOMMIT_NEVER) + acctflag = 0; sprintf (name, "SYSV%08x", key); - file = shmem_file_setup(name, size, VM_ACCOUNT); + file = shmem_file_setup(name, size, acctflag); } error = PTR_ERR(file); if (IS_ERR(file)) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] OVERCOMMIT_ALWAYS extension 2005-10-18 16:05 ` [RFC][PATCH] " Badari Pulavarty @ 2005-10-19 17:56 ` Hugh Dickins 2005-10-19 18:32 ` Jeff Dike 2005-10-19 18:50 ` Badari Pulavarty 0 siblings, 2 replies; 19+ messages in thread From: Hugh Dickins @ 2005-10-19 17:56 UTC (permalink / raw) To: Badari Pulavarty; +Cc: Chris Wright, Jeff Dike, linux-mm On Tue, 18 Oct 2005, Badari Pulavarty wrote: > > As you suggested, here is the patch to add SHM_NORESERVE which does > same thing as MAP_NORESERVE. This flag is ignored for OVERCOMMIT_NEVER. > I decided to do SHM_NORESERVE instead of IPC_NORESERVE - just to limit > its scope. Good, yes, SHM_NORESERVE is a better name. > BTW, there is a call to security_shm_alloc() earlier, which could > be modified to reject shmget() if it needs to. Excellent. But it can only see shp, and the shp->shm_flags = (shmflg & S_IRWXUGO); will conceal SHM_NORESERVE from it. Since nothing in security/ is worrying about MAP_NORESERVE at present, perhaps you need not bother about this for now. But easily overlooked later if MAP_NORESERVE rejection is added. > Is this reasonable ? Please review. Looks fine as far as it goes, except for the typos in the comment + * Do not allow no accouting for OVERCOMMIT_NEVER, even + * its asked for. should be * Do not allow no accounting for OVERCOMMIT_NEVER, even * if it's asked for. (rather a lot of negatives, but okay there I think!) I say "as far as it goes" because I don't think it's actually going to achieve the effect you said you wanted in your original post. As you've probably noticed, switching off VM_ACCOUNT here will mean that the shm object is accounted page by page as it's instantiated, and I expect you're okay with that. But you want madvise(DONTNEED) to free up those reservations: it'll unmap the pages from userspace, but it won't free the pages from the shm object, so the reservations will still be in force, and accumulate. To achieve the effect you want, along these lines, there needs to be a way to truncate pages out of the middle of the shm object: I believe "punch holes" is the phrase that's been used when this kind of behaviour has been discussed (not particularly in relation to tmpfs) before. Some have proposed a sys_punch syscall to the VFS. Jeff Dike had a patch for like functionality for UML, via a /dev/anon to tmpfs, nearly two years ago. I've kept his mail in my TODO folder ever since, ambivalent about it, and never got around to giving it the review needed. I've a feeling time has moved on so far that Jeff may now be achieving the effect he needs by other means (remap_file_pages?). Is /dev/anon still of interest to you, Jeff? Not that I'm any closer to the point of thinking about it now than then, just want to factor your idea in with what Badari is thinking of. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] OVERCOMMIT_ALWAYS extension 2005-10-19 17:56 ` Hugh Dickins @ 2005-10-19 18:32 ` Jeff Dike 2005-10-19 21:21 ` Badari Pulavarty 2005-10-19 18:50 ` Badari Pulavarty 1 sibling, 1 reply; 19+ messages in thread From: Jeff Dike @ 2005-10-19 18:32 UTC (permalink / raw) To: Hugh Dickins; +Cc: Badari Pulavarty, Chris Wright, linux-mm On Wed, Oct 19, 2005 at 06:56:59PM +0100, Hugh Dickins wrote: > To achieve the effect you want, along these lines, there needs to be > a way to truncate pages out of the middle of the shm object: I believe > "punch holes" is the phrase that's been used when this kind of behaviour > has been discussed (not particularly in relation to tmpfs) before. > Some have proposed a sys_punch syscall to the VFS. > > Jeff Dike had a patch for like functionality for UML, via a /dev/anon > to tmpfs, nearly two years ago. I've kept his mail in my TODO folder > ever since, ambivalent about it, and never got around to giving it the > review needed. I've a feeling time has moved on so far that Jeff may > now be achieving the effect he needs by other means (remap_file_pages?). > > Is /dev/anon still of interest to you, Jeff? Not that I'm any closer > to the point of thinking about it now than then, just want to factor > your idea in with what Badari is thinking of. Yes, either sys_punch or something like /dev/anon is still needed. I need to be able to dirty file-backed pages and tell the host to drop them as though they were clean. Punching a hole in the middle of the file, effectively sparsing it, or having a special driver that drops pages when their map count goes to zero will both work for me. This will avoid having the host swap out pages that are clean from the UML point of view (but dirty from the host's point of view). It will also allow me to free memory back to the host, allowing memory to be added and removed dynamically from UML instances. remap_file_pages is entirely different. That decreases the number of vmas, which, for some reason that is mysterious to me, dramatically increases UML performance. Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] OVERCOMMIT_ALWAYS extension 2005-10-19 18:32 ` Jeff Dike @ 2005-10-19 21:21 ` Badari Pulavarty 2005-10-19 22:38 ` Jeff Dike 0 siblings, 1 reply; 19+ messages in thread From: Badari Pulavarty @ 2005-10-19 21:21 UTC (permalink / raw) To: Jeff Dike; +Cc: Hugh Dickins, linux-mm On Wed, 2005-10-19 at 14:32 -0400, Jeff Dike wrote: > On Wed, Oct 19, 2005 at 06:56:59PM +0100, Hugh Dickins wrote: > > To achieve the effect you want, along these lines, there needs to be > > a way to truncate pages out of the middle of the shm object: I believe > > "punch holes" is the phrase that's been used when this kind of behaviour > > has been discussed (not particularly in relation to tmpfs) before. > > Some have proposed a sys_punch syscall to the VFS. > > > > Jeff Dike had a patch for like functionality for UML, via a /dev/anon > > to tmpfs, nearly two years ago. I've kept his mail in my TODO folder > > ever since, ambivalent about it, and never got around to giving it the > > review needed. I've a feeling time has moved on so far that Jeff may > > now be achieving the effect he needs by other means (remap_file_pages?). > > > > Is /dev/anon still of interest to you, Jeff? Not that I'm any closer > > to the point of thinking about it now than then, just want to factor > > your idea in with what Badari is thinking of. > > Yes, either sys_punch or something like /dev/anon is still needed. I need to > be able to dirty file-backed pages and tell the host to drop them as though > they were clean. Punching a hole in the middle of the file, effectively > sparsing it, or having a special driver that drops pages when their map count > goes to zero will both work for me. This will avoid having the host swap out > pages that are clean from the UML point of view (but dirty from the host's > point of view). It will also allow me to free memory back to the host, > allowing memory to be added and removed dynamically from UML instances. My requirement is a simple subset of yours. All I want is ability to completely drop range of pages in a shared memory segment (as if they are clean). madvise(MADV_DISCARD) would be good enough for me. In fact, I have another weird requirement that - it should be able to drop these pages even when map count is NOT zero. I am still thinking about this one. Our database folks, map these regions into different db2 processes and they want this to work from any given process (even if other processes have it mapped). I am not sure what would happen, some other process touches it after we dropped it - may be a zero page ? Thanks, Badari -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] OVERCOMMIT_ALWAYS extension 2005-10-19 21:21 ` Badari Pulavarty @ 2005-10-19 22:38 ` Jeff Dike 0 siblings, 0 replies; 19+ messages in thread From: Jeff Dike @ 2005-10-19 22:38 UTC (permalink / raw) To: Badari Pulavarty; +Cc: Hugh Dickins, linux-mm On Wed, Oct 19, 2005 at 02:21:48PM -0700, Badari Pulavarty wrote: > My requirement is a simple subset of yours. All I want is ability to > completely drop range of pages in a shared memory segment (as if they > are clean). Your implementation seems to do something very different, though. > madvise(MADV_DISCARD) would be good enough for me. In fact, > I have another weird requirement that - it should be able to drop these > pages even when map count is NOT zero. I am still thinking about this > one. I had been planning on using map count == 0 as a sign to the driver that it should drop the page. However, dropping map count > 0 pages would also work for UML. > Our database folks, map these regions into different db2 processes > and they want this to work from any given process (even if other > processes have it mapped). I am not sure what would happen, some > other process touches it after we dropped it - may be a zero page ? Yeah, that works for me. I don't have a requirement that dropped pages be accessible from multiple processes. But if they were, that might let me map them directly into processes without UML zeroing them first (as they'd be already zeroed by the host). Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] OVERCOMMIT_ALWAYS extension 2005-10-19 17:56 ` Hugh Dickins 2005-10-19 18:32 ` Jeff Dike @ 2005-10-19 18:50 ` Badari Pulavarty 2005-10-19 19:12 ` Darren Hart ` (2 more replies) 1 sibling, 3 replies; 19+ messages in thread From: Badari Pulavarty @ 2005-10-19 18:50 UTC (permalink / raw) To: Hugh Dickins; +Cc: Chris Wright, Jeff Dike, linux-mm, dvhltc [-- Attachment #1: Type: text/plain, Size: 2700 bytes --] On Wed, 2005-10-19 at 18:56 +0100, Hugh Dickins wrote: > On Tue, 18 Oct 2005, Badari Pulavarty wrote: > > > > As you suggested, here is the patch to add SHM_NORESERVE which does > > same thing as MAP_NORESERVE. This flag is ignored for OVERCOMMIT_NEVER. > > I decided to do SHM_NORESERVE instead of IPC_NORESERVE - just to limit > > its scope. > > Good, yes, SHM_NORESERVE is a better name. Hugh, Big Thank you for review and help on this. > > > BTW, there is a call to security_shm_alloc() earlier, which could > > be modified to reject shmget() if it needs to. > > Excellent. But it can only see shp, and the > shp->shm_flags = (shmflg & S_IRWXUGO); > will conceal SHM_NORESERVE from it. I noticed that, but didn't feel like passing it to security_shm_alloc(), since even SHM_HUGETLB and others are not getting passed today. That's why I said, "we could, if need to". > > Since nothing in security/ is worrying about MAP_NORESERVE at present, > perhaps you need not bother about this for now. But easily overlooked > later if MAP_NORESERVE rejection is added. > > > Is this reasonable ? Please review. > > Looks fine as far as it goes, except for the typos in the comment > + * Do not allow no accouting for OVERCOMMIT_NEVER, even > + * its asked for. > should be > * Do not allow no accounting for OVERCOMMIT_NEVER, even > * if it's asked for. > (rather a lot of negatives, but okay there I think!) Initially I wrote it as "For OVERCOMMIT_GUESS and OVERCOMMIT_ALWAYS, allow no accounting if asked for." - which matches the code. But, in future, if we decide to add another mode - then we need to update the comment. > I say "as far as it goes" because I don't think it's actually going to > achieve the effect you said you wanted in your original post. > > As you've probably noticed, switching off VM_ACCOUNT here will mean that > the shm object is accounted page by page as it's instantiated, and I > expect you're okay with that. But you want madvise(DONTNEED) to free > up those reservations: it'll unmap the pages from userspace, but it > won't free the pages from the shm object, so the reservations will > still be in force, and accumulate. Darren Hart is working on patch to add madvise(DISCARD) to extend the functionality of madvise(DONTNEED) to really drop those pages. I was going to ask your opinion on that approach :) shmget(SHM_NORESERVE) + madvise(DISCARD) should do what I was hoping for. (BTW, none of this has been tested with database stuff - I am just concentrating on reasonable extensions. Here is the version of patch under test. (Darren - I am sending this out without your permission, I hope you are okay with it). Thanks, Badari [-- Attachment #2: madvise_discard.patch --] [-- Type: text/x-patch, Size: 14528 bytes --] diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-alpha/mman.h 2.6.12-madvise/include/asm-alpha/mman.h --- /home/linux/views/linux-2.6.12/include/asm-alpha/mman.h 2003-12-17 18:58:04.000000000 -0800 +++ 2.6.12-madvise/include/asm-alpha/mman.h 2005-07-06 09:27:11.000000000 -0700 @@ -42,6 +42,7 @@ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_SPACEAVAIL 5 /* ensure resources are available */ #define MADV_DONTNEED 6 /* don't need these pages */ +#define MADV_DISCARD 7 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-arm/mman.h 2.6.12-madvise/include/asm-arm/mman.h --- /home/linux/views/linux-2.6.12/include/asm-arm/mman.h 2003-12-17 18:58:39.000000000 -0800 +++ 2.6.12-madvise/include/asm-arm/mman.h 2005-07-06 09:28:31.000000000 -0700 @@ -35,6 +35,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-arm26/mman.h 2.6.12-madvise/include/asm-arm26/mman.h --- /home/linux/views/linux-2.6.12/include/asm-arm26/mman.h 2003-12-17 18:58:04.000000000 -0800 +++ 2.6.12-madvise/include/asm-arm26/mman.h 2005-07-06 09:28:40.000000000 -0700 @@ -35,6 +35,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-cris/mman.h 2.6.12-madvise/include/asm-cris/mman.h --- /home/linux/views/linux-2.6.12/include/asm-cris/mman.h 2003-12-17 18:59:44.000000000 -0800 +++ 2.6.12-madvise/include/asm-cris/mman.h 2005-07-06 09:28:53.000000000 -0700 @@ -37,6 +37,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-frv/mman.h 2.6.12-madvise/include/asm-frv/mman.h --- /home/linux/views/linux-2.6.12/include/asm-frv/mman.h 2005-03-02 03:00:08.000000000 -0800 +++ 2.6.12-madvise/include/asm-frv/mman.h 2005-07-06 09:29:01.000000000 -0700 @@ -35,6 +35,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-h8300/mman.h 2.6.12-madvise/include/asm-h8300/mman.h --- /home/linux/views/linux-2.6.12/include/asm-h8300/mman.h 2005-06-17 17:21:39.000000000 -0700 +++ 2.6.12-madvise/include/asm-h8300/mman.h 2005-07-06 09:29:05.000000000 -0700 @@ -35,6 +35,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-i386/mman.h 2.6.12-madvise/include/asm-i386/mman.h --- /home/linux/views/linux-2.6.12/include/asm-i386/mman.h 2003-12-17 18:58:15.000000000 -0800 +++ 2.6.12-madvise/include/asm-i386/mman.h 2005-07-06 09:29:10.000000000 -0700 @@ -35,6 +35,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-ia64/mman.h 2.6.12-madvise/include/asm-ia64/mman.h --- /home/linux/views/linux-2.6.12/include/asm-ia64/mman.h 2004-04-05 16:25:06.000000000 -0700 +++ 2.6.12-madvise/include/asm-ia64/mman.h 2005-07-06 09:29:14.000000000 -0700 @@ -43,6 +43,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-m32r/mman.h 2.6.12-madvise/include/asm-m32r/mman.h --- /home/linux/views/linux-2.6.12/include/asm-m32r/mman.h 2004-10-18 15:51:10.000000000 -0700 +++ 2.6.12-madvise/include/asm-m32r/mman.h 2005-07-06 09:29:20.000000000 -0700 @@ -37,6 +37,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-m68k/mman.h 2.6.12-madvise/include/asm-m68k/mman.h --- /home/linux/views/linux-2.6.12/include/asm-m68k/mman.h 2003-12-17 18:58:16.000000000 -0800 +++ 2.6.12-madvise/include/asm-m68k/mman.h 2005-07-06 09:29:25.000000000 -0700 @@ -35,6 +35,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-mips/mman.h 2.6.12-madvise/include/asm-mips/mman.h --- /home/linux/views/linux-2.6.12/include/asm-mips/mman.h 2003-12-17 18:58:39.000000000 -0800 +++ 2.6.12-madvise/include/asm-mips/mman.h 2005-07-06 09:29:37.000000000 -0700 @@ -65,6 +65,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-parisc/mman.h 2.6.12-madvise/include/asm-parisc/mman.h --- /home/linux/views/linux-2.6.12/include/asm-parisc/mman.h 2003-12-17 18:58:58.000000000 -0800 +++ 2.6.12-madvise/include/asm-parisc/mman.h 2005-07-06 09:32:51.000000000 -0700 @@ -38,6 +38,7 @@ #define MADV_SPACEAVAIL 5 /* insure that resources are reserved */ #define MADV_VPS_PURGE 6 /* Purge pages from VM page cache */ #define MADV_VPS_INHERIT 7 /* Inherit parents page size */ +#define MADV_DISCARD 8 /* free memory and page cache now */ /* The range 12-64 is reserved for page size specification. */ #define MADV_4K_PAGES 12 /* Use 4K pages */ diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-ppc/mman.h 2.6.12-madvise/include/asm-ppc/mman.h --- /home/linux/views/linux-2.6.12/include/asm-ppc/mman.h 2003-12-17 19:00:03.000000000 -0800 +++ 2.6.12-madvise/include/asm-ppc/mman.h 2005-07-06 09:33:13.000000000 -0700 @@ -36,6 +36,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-ppc64/mman.h 2.6.12-madvise/include/asm-ppc64/mman.h --- /home/linux/views/linux-2.6.12/include/asm-ppc64/mman.h 2003-12-17 18:58:47.000000000 -0800 +++ 2.6.12-madvise/include/asm-ppc64/mman.h 2005-07-06 09:33:25.000000000 -0700 @@ -44,6 +44,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-s390/mman.h 2.6.12-madvise/include/asm-s390/mman.h --- /home/linux/views/linux-2.6.12/include/asm-s390/mman.h 2003-12-17 18:58:08.000000000 -0800 +++ 2.6.12-madvise/include/asm-s390/mman.h 2005-07-06 09:33:36.000000000 -0700 @@ -43,6 +43,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-sh/mman.h 2.6.12-madvise/include/asm-sh/mman.h --- /home/linux/views/linux-2.6.12/include/asm-sh/mman.h 2003-12-17 18:59:27.000000000 -0800 +++ 2.6.12-madvise/include/asm-sh/mman.h 2005-07-06 09:33:57.000000000 -0700 @@ -35,6 +35,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-sparc/mman.h 2.6.12-madvise/include/asm-sparc/mman.h --- /home/linux/views/linux-2.6.12/include/asm-sparc/mman.h 2003-12-17 18:59:43.000000000 -0800 +++ 2.6.12-madvise/include/asm-sparc/mman.h 2005-07-06 09:35:02.000000000 -0700 @@ -54,6 +54,7 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ +#define MADV_DISCARD 0x6 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-sparc64/mman.h 2.6.12-madvise/include/asm-sparc64/mman.h --- /home/linux/views/linux-2.6.12/include/asm-sparc64/mman.h 2003-12-17 18:58:49.000000000 -0800 +++ 2.6.12-madvise/include/asm-sparc64/mman.h 2005-07-06 09:35:15.000000000 -0700 @@ -54,6 +54,7 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ +#define MADV_DISCARD 0x6 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-v850/mman.h 2.6.12-madvise/include/asm-v850/mman.h --- /home/linux/views/linux-2.6.12/include/asm-v850/mman.h 2003-12-17 18:59:26.000000000 -0800 +++ 2.6.12-madvise/include/asm-v850/mman.h 2005-07-06 09:35:35.000000000 -0700 @@ -32,6 +32,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/include/asm-x86_64/mman.h 2.6.12-madvise/include/asm-x86_64/mman.h --- /home/linux/views/linux-2.6.12/include/asm-x86_64/mman.h 2003-12-17 18:59:05.000000000 -0800 +++ 2.6.12-madvise/include/asm-x86_64/mman.h 2005-07-06 09:35:40.000000000 -0700 @@ -36,6 +36,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* free memory and page cache now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -purN -X /home/dvhart/.diff.exclude /home/linux/views/linux-2.6.12/mm/madvise.c 2.6.12-madvise/mm/madvise.c --- /home/linux/views/linux-2.6.12/mm/madvise.c 2005-03-02 03:00:18.000000000 -0800 +++ 2.6.12-madvise/mm/madvise.c 2005-07-06 10:15:09.000000000 -0700 @@ -111,6 +111,37 @@ static long madvise_dontneed(struct vm_a return 0; } +static long madvise_discard(struct vm_area_struct * vma, + unsigned long start, unsigned long end) +{ + struct semaphore *i_sem; + loff_t offset; + + if (vma->vm_file && vma->vm_file->f_mapping) { + if (vma->vm_file->f_mapping == &swapper_space) { + printk("%s: vma (%p)'s mapping is swapper_space\n", __FUNCTION__, vma); + return -EINVAL; + } + + if (!vma->vm_file->f_mapping->host) { + printk("%s: vma (%p)'s mapping->host is null\n", __FUNCTION__, vma); + return -EINVAL; + } + + /* looks good, try and rip it out of page cache */ + printk("%s: trying to rip shm vma (%p) inode from page cache\n", __FUNCTION__, vma); + i_sem = &vma->vm_file->f_mapping->host->i_sem; + offset = (loff_t)(start - vma->vm_start); + printk("%s: call truncate_inode_pages(%p, %x\n", __FUNCTION__, + vma->vm_file->f_mapping, (unsigned int)offset); + down(i_sem); + truncate_inode_pages(vma->vm_file->f_mapping, offset); + up(i_sem); + } + + return 0; +} + static long madvise_vma(struct vm_area_struct * vma, unsigned long start, unsigned long end, int behavior) { @@ -130,6 +161,9 @@ static long madvise_vma(struct vm_area_s case MADV_DONTNEED: error = madvise_dontneed(vma, start, end); break; + case MADV_DISCARD: + error = madvise_discard(vma, start, end); + break; default: error = -EINVAL; ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] OVERCOMMIT_ALWAYS extension 2005-10-19 18:50 ` Badari Pulavarty @ 2005-10-19 19:12 ` Darren Hart 2005-10-19 20:10 ` Hugh Dickins 2005-10-19 20:47 ` Jeff Dike 2 siblings, 0 replies; 19+ messages in thread From: Darren Hart @ 2005-10-19 19:12 UTC (permalink / raw) To: Badari Pulavarty; +Cc: Hugh Dickins, Chris Wright, Jeff Dike, linux-mm Badari Pulavarty wrote: >> ... >>I say "as far as it goes" because I don't think it's actually going to >>achieve the effect you said you wanted in your original post. >> >>As you've probably noticed, switching off VM_ACCOUNT here will mean that >>the shm object is accounted page by page as it's instantiated, and I >>expect you're okay with that. But you want madvise(DONTNEED) to free >>up those reservations: it'll unmap the pages from userspace, but it >>won't free the pages from the shm object, so the reservations will >>still be in force, and accumulate. > > > Darren Hart is working on patch to add madvise(DISCARD) to extend > the functionality of madvise(DONTNEED) to really drop those pages. > I was going to ask your opinion on that approach :) > > shmget(SHM_NORESERVE) + madvise(DISCARD) should do what I was > hoping for. (BTW, none of this has been tested with database stuff - > I am just concentrating on reasonable extensions. > > Here is the version of patch under test. > (Darren - I am sending this out without your permission, I hope > you are okay with it). > Of course, no problem. I have a separate patch for sles9sp2 if that is of interest. Please keep me in the loop with any feedback on the patch. -- Darren Hart IBM Linux Technology Center Linux Kernel Team Phone: 503 578 3185 T/L: 775 3185 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] OVERCOMMIT_ALWAYS extension 2005-10-19 18:50 ` Badari Pulavarty 2005-10-19 19:12 ` Darren Hart @ 2005-10-19 20:10 ` Hugh Dickins 2005-10-19 20:47 ` Jeff Dike 2 siblings, 0 replies; 19+ messages in thread From: Hugh Dickins @ 2005-10-19 20:10 UTC (permalink / raw) To: Badari Pulavarty; +Cc: Chris Wright, Jeff Dike, linux-mm, Darren Hart On Wed, 19 Oct 2005, Badari Pulavarty wrote: > > Darren Hart is working on patch to add madvise(DISCARD) to extend > the functionality of madvise(DONTNEED) to really drop those pages. > I was going to ask your opinion on that approach :) > > shmget(SHM_NORESERVE) + madvise(DISCARD) should do what I was > hoping for. (BTW, none of this has been tested with database stuff - > I am just concentrating on reasonable extensions. That sounds interesting, and reasonable. But I'm afraid it's likely to be several days (a week or more) before I get around to studying it. If Jeff gets to look at it sooner, it would be interesting to hear if it suits his need too (but it's entirely inappropriate for me to expect Jeff to find time to do what I'm not). Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] OVERCOMMIT_ALWAYS extension 2005-10-19 18:50 ` Badari Pulavarty 2005-10-19 19:12 ` Darren Hart 2005-10-19 20:10 ` Hugh Dickins @ 2005-10-19 20:47 ` Jeff Dike 2005-10-20 15:11 ` Badari Pulavarty 2 siblings, 1 reply; 19+ messages in thread From: Jeff Dike @ 2005-10-19 20:47 UTC (permalink / raw) To: Badari Pulavarty; +Cc: Hugh Dickins, Chris Wright, linux-mm, dvhltc On Wed, Oct 19, 2005 at 11:50:55AM -0700, Badari Pulavarty wrote: > Darren Hart is working on patch to add madvise(DISCARD) to extend > the functionality of madvise(DONTNEED) to really drop those pages. > I was going to ask your opinion on that approach :) > > shmget(SHM_NORESERVE) + madvise(DISCARD) should do what I was > hoping for. (BTW, none of this has been tested with database stuff - > I am just concentrating on reasonable extensions. madvise(DISCARD) has a promising name, but the implementation seems to be very differant from what the name says. This would seem to throw out all pages in the file after offset, which makes the end parameter kind of pointless: + down(i_sem); + truncate_inode_pages(vma->vm_file->f_mapping, offset); + up(i_sem); It will also fully truncate files which you have only partially mapped, which is somewhat counterintuitive. Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] OVERCOMMIT_ALWAYS extension 2005-10-19 20:47 ` Jeff Dike @ 2005-10-20 15:11 ` Badari Pulavarty 2005-10-20 17:27 ` Jeff Dike 0 siblings, 1 reply; 19+ messages in thread From: Badari Pulavarty @ 2005-10-20 15:11 UTC (permalink / raw) To: Jeff Dike; +Cc: Hugh Dickins, linux-mm, dvhltc On Wed, 2005-10-19 at 16:47 -0400, Jeff Dike wrote: > On Wed, Oct 19, 2005 at 11:50:55AM -0700, Badari Pulavarty wrote: > > Darren Hart is working on patch to add madvise(DISCARD) to extend > > the functionality of madvise(DONTNEED) to really drop those pages. > > I was going to ask your opinion on that approach :) > > > > shmget(SHM_NORESERVE) + madvise(DISCARD) should do what I was > > hoping for. (BTW, none of this has been tested with database stuff - > > I am just concentrating on reasonable extensions. > > madvise(DISCARD) has a promising name, but the implementation seems to be > very differant from what the name says. > > This would seem to throw out all pages in the file after offset, which > makes the end parameter kind of pointless: > > + down(i_sem); > + truncate_inode_pages(vma->vm_file->f_mapping, offset); > + up(i_sem); > > It will also fully truncate files which you have only partially > mapped, which is somewhat counterintuitive. Yes. I agree. We were just trying to re-use existing code to see if it even works. Initial plan was to use invalidate_inode_pages2_range(). But it didn't really do what we wanted. So we ended up using truncate_inode_pages(). If it really works, then I plan to add truncate_inode_pages2_range() to which works on a range of pages, instead of the whole file. madvise(DONTNEED) followed by madvise(DISCARD) should be able to drop all the pages in the given range. Does this make sense ? Does this seem like right approach ? Thanks, Badari -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] OVERCOMMIT_ALWAYS extension 2005-10-20 15:11 ` Badari Pulavarty @ 2005-10-20 17:27 ` Jeff Dike 2005-10-20 22:37 ` Badari Pulavarty 0 siblings, 1 reply; 19+ messages in thread From: Jeff Dike @ 2005-10-20 17:27 UTC (permalink / raw) To: Badari Pulavarty; +Cc: Hugh Dickins, linux-mm, dvhltc On Thu, Oct 20, 2005 at 08:11:05AM -0700, Badari Pulavarty wrote: > Initial plan was to use invalidate_inode_pages2_range(). But it didn't > really do what we wanted. So we ended up using truncate_inode_pages(). > If it really works, then I plan to add truncate_inode_pages2_range() > to which works on a range of pages, instead of the whole file. > madvise(DONTNEED) followed by madvise(DISCARD) should be able to drop > all the pages in the given range. > > Does this make sense ? Does this seem like right approach ? Works for me. I obviously have no idea about the wider vm implications of this - that would be Hugh's territory :-) Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] OVERCOMMIT_ALWAYS extension 2005-10-20 17:27 ` Jeff Dike @ 2005-10-20 22:37 ` Badari Pulavarty 2005-10-24 20:04 ` Hugh Dickins 0 siblings, 1 reply; 19+ messages in thread From: Badari Pulavarty @ 2005-10-20 22:37 UTC (permalink / raw) To: Jeff Dike; +Cc: Hugh Dickins, linux-mm, dvhltc [-- Attachment #1: Type: text/plain, Size: 1117 bytes --] On Thu, 2005-10-20 at 13:27 -0400, Jeff Dike wrote: > On Thu, Oct 20, 2005 at 08:11:05AM -0700, Badari Pulavarty wrote: > > Initial plan was to use invalidate_inode_pages2_range(). But it didn't > > really do what we wanted. So we ended up using truncate_inode_pages(). > > If it really works, then I plan to add truncate_inode_pages2_range() > > to which works on a range of pages, instead of the whole file. > > madvise(DONTNEED) followed by madvise(DISCARD) should be able to drop > > all the pages in the given range. > > > > Does this make sense ? Does this seem like right approach ? > > Works for me. I obviously have no idea about the wider vm implications of > this - that would be Hugh's territory :-) Here is the latest version of madvise(DISCARD) I cooked up after talking to Darren. Changes from previous: 1) madvise(DISCARD) - zaps the range and discards the pages. So, no need to call madvise(DONTNEED) before. 2) I added truncate_inode_pages2_range() to just discard only the range of pages - not the whole file. Hugh, when you get a chance could you review this instead ? Thanks, Badari [-- Attachment #2: madvise-discard2.patch --] [-- Type: text/x-patch, Size: 16242 bytes --] diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-alpha/mman.h linux-2.6.14-rc3.db2/include/asm-alpha/mman.h --- linux-2.6.14-rc3/include/asm-alpha/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-alpha/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -42,6 +42,7 @@ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_SPACEAVAIL 5 /* ensure resources are available */ #define MADV_DONTNEED 6 /* don't need these pages */ +#define MADV_DISCARD 7 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-arm/mman.h linux-2.6.14-rc3.db2/include/asm-arm/mman.h --- linux-2.6.14-rc3/include/asm-arm/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-arm/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -35,6 +35,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-arm26/mman.h linux-2.6.14-rc3.db2/include/asm-arm26/mman.h --- linux-2.6.14-rc3/include/asm-arm26/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-arm26/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -35,6 +35,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-cris/mman.h linux-2.6.14-rc3.db2/include/asm-cris/mman.h --- linux-2.6.14-rc3/include/asm-cris/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-cris/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -37,6 +37,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-frv/mman.h linux-2.6.14-rc3.db2/include/asm-frv/mman.h --- linux-2.6.14-rc3/include/asm-frv/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-frv/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -35,6 +35,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-h8300/mman.h linux-2.6.14-rc3.db2/include/asm-h8300/mman.h --- linux-2.6.14-rc3/include/asm-h8300/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-h8300/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -35,6 +35,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-i386/mman.h linux-2.6.14-rc3.db2/include/asm-i386/mman.h --- linux-2.6.14-rc3/include/asm-i386/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-i386/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -35,6 +35,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-ia64/mman.h linux-2.6.14-rc3.db2/include/asm-ia64/mman.h --- linux-2.6.14-rc3/include/asm-ia64/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-ia64/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -43,6 +43,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-m32r/mman.h linux-2.6.14-rc3.db2/include/asm-m32r/mman.h --- linux-2.6.14-rc3/include/asm-m32r/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-m32r/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -37,6 +37,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-m68k/mman.h linux-2.6.14-rc3.db2/include/asm-m68k/mman.h --- linux-2.6.14-rc3/include/asm-m68k/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-m68k/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -35,6 +35,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-mips/mman.h linux-2.6.14-rc3.db2/include/asm-mips/mman.h --- linux-2.6.14-rc3/include/asm-mips/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-mips/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -65,6 +65,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-parisc/mman.h linux-2.6.14-rc3.db2/include/asm-parisc/mman.h --- linux-2.6.14-rc3/include/asm-parisc/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-parisc/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -38,6 +38,7 @@ #define MADV_SPACEAVAIL 5 /* insure that resources are reserved */ #define MADV_VPS_PURGE 6 /* Purge pages from VM page cache */ #define MADV_VPS_INHERIT 7 /* Inherit parents page size */ +#define MADV_DISCARD 8 /* discard pages right now */ /* The range 12-64 is reserved for page size specification. */ #define MADV_4K_PAGES 12 /* Use 4K pages */ diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-powerpc/mman.h linux-2.6.14-rc3.db2/include/asm-powerpc/mman.h --- linux-2.6.14-rc3/include/asm-powerpc/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-powerpc/mman.h 2005-10-20 13:55:18.000000000 -0700 @@ -44,6 +44,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-s390/mman.h linux-2.6.14-rc3.db2/include/asm-s390/mman.h --- linux-2.6.14-rc3/include/asm-s390/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-s390/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -43,6 +43,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-sh/mman.h linux-2.6.14-rc3.db2/include/asm-sh/mman.h --- linux-2.6.14-rc3/include/asm-sh/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-sh/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -35,6 +35,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-sparc/mman.h linux-2.6.14-rc3.db2/include/asm-sparc/mman.h --- linux-2.6.14-rc3/include/asm-sparc/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-sparc/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -54,6 +54,7 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ +#define MADV_DISCARD 0x6 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-sparc64/mman.h linux-2.6.14-rc3.db2/include/asm-sparc64/mman.h --- linux-2.6.14-rc3/include/asm-sparc64/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-sparc64/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -54,6 +54,7 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ +#define MADV_DISCARD 0x6 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-v850/mman.h linux-2.6.14-rc3.db2/include/asm-v850/mman.h --- linux-2.6.14-rc3/include/asm-v850/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-v850/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -32,6 +32,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-x86_64/mman.h linux-2.6.14-rc3.db2/include/asm-x86_64/mman.h --- linux-2.6.14-rc3/include/asm-x86_64/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-x86_64/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -36,6 +36,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/asm-xtensa/mman.h linux-2.6.14-rc3.db2/include/asm-xtensa/mman.h --- linux-2.6.14-rc3/include/asm-xtensa/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-xtensa/mman.h 2005-10-20 13:56:45.000000000 -0700 @@ -72,6 +72,7 @@ #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ +#define MADV_DISCARD 0x5 /* discard pages right now */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff -Naurp -X dontdiff linux-2.6.14-rc3/include/linux/mm.h linux-2.6.14-rc3.db2/include/linux/mm.h --- linux-2.6.14-rc3/include/linux/mm.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/linux/mm.h 2005-10-20 13:41:57.000000000 -0700 @@ -865,6 +865,7 @@ extern unsigned long do_brk(unsigned lon /* filemap.c */ extern unsigned long page_unuse(struct page *); extern void truncate_inode_pages(struct address_space *, loff_t); +extern void truncate_inode_pages2_range(struct address_space *, loff_t, loff_t); /* generic vm_area_ops exported for stackable file systems */ extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int *); diff -Naurp -X dontdiff linux-2.6.14-rc3/mm/madvise.c linux-2.6.14-rc3.db2/mm/madvise.c --- linux-2.6.14-rc3/mm/madvise.c 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/mm/madvise.c 2005-10-20 13:37:41.000000000 -0700 @@ -137,6 +137,40 @@ static long madvise_dontneed(struct vm_a return 0; } +static long madvise_discard(struct vm_area_struct * vma, + struct vm_area_struct ** prev, + unsigned long start, unsigned long end) +{ + struct address_space *mapping; + loff_t offset, endoff; + int error = 0; + + if (!vma->vm_file || !vma->vm_file->f_mapping + || !vma->vm_file->f_mapping->host) { + return -EINVAL; + } + + mapping = vma->vm_file->f_mapping; + if (mapping == &swapper_space) { + return -EINVAL; + } + + error = madvise_dontneed(vma, prev, start, end); + if (error) + return error; + + /* looks good, try and rip it out of page cache */ + printk("%s: trying to rip shm vma (%p) inode from page cache\n", __FUNCTION__, vma); + offset = (loff_t)(start - vma->vm_start); + endoff = (loff_t)(end - vma->vm_start); + printk("call truncate_inode_pages(%p, %x %x)\n", mapping, + (unsigned int)offset, (unsigned int)endoff); + down(&mapping->host->i_sem); + truncate_inode_pages2_range(mapping, offset, endoff); + up(&mapping->host->i_sem); + return 0; +} + static long madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, int behavior) @@ -153,6 +187,9 @@ madvise_vma(struct vm_area_struct *vma, case MADV_RANDOM: error = madvise_behavior(vma, prev, start, end, behavior); break; + case MADV_DISCARD: + error = madvise_discard(vma, prev, start, end); + break; case MADV_WILLNEED: error = madvise_willneed(vma, prev, start, end); diff -Naurp -X dontdiff linux-2.6.14-rc3/mm/truncate.c linux-2.6.14-rc3.db2/mm/truncate.c --- linux-2.6.14-rc3/mm/truncate.c 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/mm/truncate.c 2005-10-20 13:59:20.000000000 -0700 @@ -113,7 +113,8 @@ invalidate_complete_page(struct address_ * * Called under (and serialised by) inode->i_sem. */ -void truncate_inode_pages(struct address_space *mapping, loff_t lstart) +void truncate_inode_pages2_range(struct address_space *mapping, loff_t lstart, + loff_t end) { const pgoff_t start = (lstart + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT; const unsigned partial = lstart & (PAGE_CACHE_SIZE - 1); @@ -126,7 +127,8 @@ void truncate_inode_pages(struct address pagevec_init(&pvec, 0); next = start; - while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { + while (next <= end && + pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { for (i = 0; i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; pgoff_t page_index = page->index; @@ -142,6 +144,8 @@ void truncate_inode_pages(struct address } truncate_complete_page(mapping, page); unlock_page(page); + if (next > end) + break; } pagevec_release(&pvec); cond_resched(); @@ -176,12 +180,20 @@ void truncate_inode_pages(struct address next++; truncate_complete_page(mapping, page); unlock_page(page); + if (next > end) + break; } pagevec_release(&pvec); } } +void truncate_inode_pages(struct address_space *mapping, loff_t lstart) +{ + return truncate_inode_pages2_range(mapping, lstart, ~0UL); +} + EXPORT_SYMBOL(truncate_inode_pages); +EXPORT_SYMBOL(truncate_inode_pages2_range); /** * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] OVERCOMMIT_ALWAYS extension 2005-10-20 22:37 ` Badari Pulavarty @ 2005-10-24 20:04 ` Hugh Dickins 2005-10-24 20:22 ` Darren Hart 2005-10-24 20:24 ` Badari Pulavarty 0 siblings, 2 replies; 19+ messages in thread From: Hugh Dickins @ 2005-10-24 20:04 UTC (permalink / raw) To: Badari Pulavarty; +Cc: Jeff Dike, linux-mm, dvhltc On Thu, 20 Oct 2005, Badari Pulavarty wrote: > > Changes from previous: > > 1) madvise(DISCARD) - zaps the range and discards the pages. So, no > need to call madvise(DONTNEED) before. > > 2) I added truncate_inode_pages2_range() to just discard only the > range of pages - not the whole file. > > Hugh, when you get a chance could you review this instead ? I haven't had time to go through it thoroughly, and will have no time the next couple of days, but here are some remarks. --- linux-2.6.14-rc3/include/asm-alpha/mman.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/asm-alpha/mman.h 2005-10-20 10:52:37.000000000 -0700 @@ -42,6 +42,7 @@ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_SPACEAVAIL 5 /* ensure resources are available */ #define MADV_DONTNEED 6 /* don't need these pages */ +#define MADV_DISCARD 7 /* discard pages right now */ Throughout the patch there's lots of spaces where there should be tabs. But I'm glad you've put a space after the "#define" here, unlike in that MADV_SPACEAVAIL higher up! Not so glad at your spaces to the right of it. Are we free to define MADV_DISCARD, coming after the others, in each of the architectures? In general, I think mman.h reflects definitions made by native Operating Systems of the architectures in question, and they might have added a few since. --- linux-2.6.14-rc3/include/linux/mm.h 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/include/linux/mm.h 2005-10-20 13:41:57.000000000 -0700 @@ -865,6 +865,7 @@ extern unsigned long do_brk(unsigned lon /* filemap.c */ extern unsigned long page_unuse(struct page *); extern void truncate_inode_pages(struct address_space *, loff_t); +extern void truncate_inode_pages2_range(struct address_space *, loff_t, loff_t); Personally, I have an aversion to sticking a "2" in there. I know you're just following the convention established by invalidate_inode_pages2, but.. Hold on, -mm already contains reiser4-truncate_inode_pages_range.patch, you should be working with that. Doesn't it do just what you need, even without a "2" :-? --- linux-2.6.14-rc3/mm/madvise.c 2005-09-30 14:17:35.000000000 -0700 +++ linux-2.6.14-rc3.db2/mm/madvise.c 2005-10-20 13:37:41.000000000 -0700 @@ -137,6 +137,40 @@ static long madvise_dontneed(struct vm_a return 0; } +static long madvise_discard(struct vm_area_struct * vma, + struct vm_area_struct ** prev, + unsigned long start, unsigned long end) +{ .... + error = madvise_dontneed(vma, prev, start, end); + if (error) + return error; + + /* looks good, try and rip it out of page cache */ + printk("%s: trying to rip shm vma (%p) inode from page cache\n", __FUNCTION__, vma); + offset = (loff_t)(start - vma->vm_start); + endoff = (loff_t)(end - vma->vm_start); + printk("call truncate_inode_pages(%p, %x %x)\n", mapping, + (unsigned int)offset, (unsigned int)endoff); + down(&mapping->host->i_sem); + truncate_inode_pages2_range(mapping, offset, endoff); + up(&mapping->host->i_sem); + return 0; +} Hmm. I don't think it's consistent to zap the pages from a single mm, then remove them from the page cache, while leaving the pages mapped into other mms. Just what would those pages then be? they're not file pages, they're not anonymous pages, such pages have given trouble in the past. I think you'll need to follow vmtruncate much more closely - and the unmap_mapping_range code already allows for a range, shouldn't need much change - going through all the vmas before truncating the range. Which makes it feel more like sys_fpunch than an madvise. You of course need write access to the underlying file, is that checked? What should it be doing to anonymous COWed pages? Not clear whether it should be following truncate in discarding those too, or not. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] OVERCOMMIT_ALWAYS extension 2005-10-24 20:04 ` Hugh Dickins @ 2005-10-24 20:22 ` Darren Hart 2005-10-24 20:24 ` Badari Pulavarty 1 sibling, 0 replies; 19+ messages in thread From: Darren Hart @ 2005-10-24 20:22 UTC (permalink / raw) To: Hugh Dickins; +Cc: Badari Pulavarty, Jeff Dike, linux-mm Hugh Dickins wrote: > On Thu, 20 Oct 2005, Badari Pulavarty wrote: > >>Changes from previous: >> >>1) madvise(DISCARD) - zaps the range and discards the pages. So, no >>need to call madvise(DONTNEED) before. >> >>2) I added truncate_inode_pages2_range() to just discard only the >>range of pages - not the whole file. >> >>Hugh, when you get a chance could you review this instead ? > > > I haven't had time to go through it thoroughly, and will have no time > the next couple of days, but here are some remarks. Excellent points all, I'll take some time here in the next couple days and work up a response to each and an updated patch. Regarding the spaces... I must have used the wrong vi wrapper to edit the patch - apologies, will be fixed in next rev. --Darren > > --- linux-2.6.14-rc3/include/asm-alpha/mman.h 2005-09-30 14:17:35.000000000 -0700 > +++ linux-2.6.14-rc3.db2/include/asm-alpha/mman.h 2005-10-20 10:52:37.000000000 -0700 > @@ -42,6 +42,7 @@ > #define MADV_WILLNEED 3 /* will need these pages */ > #define MADV_SPACEAVAIL 5 /* ensure resources are available */ > #define MADV_DONTNEED 6 /* don't need these pages */ > +#define MADV_DISCARD 7 /* discard pages right now */ > > Throughout the patch there's lots of spaces where there should be tabs. > But I'm glad you've put a space after the "#define" here, unlike in that > MADV_SPACEAVAIL higher up! Not so glad at your spaces to the right of it. > > Are we free to define MADV_DISCARD, coming after the others, in each of > the architectures? In general, I think mman.h reflects definitions made > by native Operating Systems of the architectures in question, and they > might have added a few since. > > --- linux-2.6.14-rc3/include/linux/mm.h 2005-09-30 14:17:35.000000000 -0700 > +++ linux-2.6.14-rc3.db2/include/linux/mm.h 2005-10-20 13:41:57.000000000 -0700 > @@ -865,6 +865,7 @@ extern unsigned long do_brk(unsigned lon > /* filemap.c */ > extern unsigned long page_unuse(struct page *); > extern void truncate_inode_pages(struct address_space *, loff_t); > +extern void truncate_inode_pages2_range(struct address_space *, loff_t, loff_t); > > Personally, I have an aversion to sticking a "2" in there. I know you're > just following the convention established by invalidate_inode_pages2, but.. > > Hold on, -mm already contains reiser4-truncate_inode_pages_range.patch, > you should be working with that. Doesn't it do just what you need, > even without a "2" :-? > > --- linux-2.6.14-rc3/mm/madvise.c 2005-09-30 14:17:35.000000000 -0700 > +++ linux-2.6.14-rc3.db2/mm/madvise.c 2005-10-20 13:37:41.000000000 -0700 > @@ -137,6 +137,40 @@ static long madvise_dontneed(struct vm_a > return 0; > } > > +static long madvise_discard(struct vm_area_struct * vma, > + struct vm_area_struct ** prev, > + unsigned long start, unsigned long end) > +{ > .... > + error = madvise_dontneed(vma, prev, start, end); > + if (error) > + return error; > + > + /* looks good, try and rip it out of page cache */ > + printk("%s: trying to rip shm vma (%p) inode from page cache\n", __FUNCTION__, vma); > + offset = (loff_t)(start - vma->vm_start); > + endoff = (loff_t)(end - vma->vm_start); > + printk("call truncate_inode_pages(%p, %x %x)\n", mapping, > + (unsigned int)offset, (unsigned int)endoff); > + down(&mapping->host->i_sem); > + truncate_inode_pages2_range(mapping, offset, endoff); > + up(&mapping->host->i_sem); > + return 0; > +} > > Hmm. I don't think it's consistent to zap the pages from a single mm, > then remove them from the page cache, while leaving the pages mapped into > other mms. Just what would those pages then be? they're not file pages, > they're not anonymous pages, such pages have given trouble in the past. > > I think you'll need to follow vmtruncate much more closely - and the > unmap_mapping_range code already allows for a range, shouldn't need > much change - going through all the vmas before truncating the range. > > Which makes it feel more like sys_fpunch than an madvise. > > You of course need write access to the underlying file, is that checked? > > What should it be doing to anonymous COWed pages? Not clear whether > it should be following truncate in discarding those too, or not. > > Hugh > -- Darren Hart IBM Linux Technology Center Linux Kernel Team Phone: 503 578 3185 T/L: 775 3185 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] OVERCOMMIT_ALWAYS extension 2005-10-24 20:04 ` Hugh Dickins 2005-10-24 20:22 ` Darren Hart @ 2005-10-24 20:24 ` Badari Pulavarty 1 sibling, 0 replies; 19+ messages in thread From: Badari Pulavarty @ 2005-10-24 20:24 UTC (permalink / raw) To: Hugh Dickins; +Cc: Jeff Dike, linux-mm, dvhltc On Mon, 2005-10-24 at 21:04 +0100, Hugh Dickins wrote: > On Thu, 20 Oct 2005, Badari Pulavarty wrote: > > > > Changes from previous: > > > > 1) madvise(DISCARD) - zaps the range and discards the pages. So, no > > need to call madvise(DONTNEED) before. > > > > 2) I added truncate_inode_pages2_range() to just discard only the > > range of pages - not the whole file. > > > > Hugh, when you get a chance could you review this instead ? > > I haven't had time to go through it thoroughly, and will have no time > the next couple of days, but here are some remarks. > > --- linux-2.6.14-rc3/include/asm-alpha/mman.h 2005-09-30 14:17:35.000000000 -0700 > +++ linux-2.6.14-rc3.db2/include/asm-alpha/mman.h 2005-10-20 10:52:37.000000000 -0700 > @@ -42,6 +42,7 @@ > #define MADV_WILLNEED 3 /* will need these pages */ > #define MADV_SPACEAVAIL 5 /* ensure resources are available */ > #define MADV_DONTNEED 6 /* don't need these pages */ > +#define MADV_DISCARD 7 /* discard pages right now */ > > Throughout the patch there's lots of spaces where there should be tabs. > But I'm glad you've put a space after the "#define" here, unlike in that > MADV_SPACEAVAIL higher up! Not so glad at your spaces to the right of it. Sorry about that. I was working with Darren's old patch and didn't bother cleaning up the white spaces. > > Are we free to define MADV_DISCARD, coming after the others, in each of > the architectures? In general, I think mman.h reflects definitions made > by native Operating Systems of the architectures in question, and they > might have added a few since. I looked at all architectures. No matter what their header file says, none of them actually implemented anything other than the standard ones (documented in the manpages). > --- linux-2.6.14-rc3/include/linux/mm.h 2005-09-30 14:17:35.000000000 -0700 > +++ linux-2.6.14-rc3.db2/include/linux/mm.h 2005-10-20 13:41:57.000000000 -0700 > @@ -865,6 +865,7 @@ extern unsigned long do_brk(unsigned lon > /* filemap.c */ > extern unsigned long page_unuse(struct page *); > extern void truncate_inode_pages(struct address_space *, loff_t); > +extern void truncate_inode_pages2_range(struct address_space *, loff_t, loff_t); > > Personally, I have an aversion to sticking a "2" in there. I know you're > just following the convention established by invalidate_inode_pages2, but.. > > Hold on, -mm already contains reiser4-truncate_inode_pages_range.patch, > you should be working with that. Doesn't it do just what you need, > even without a "2" :-? Yes. Thats exactly what I did also. One less thing to worry for me :) > > --- linux-2.6.14-rc3/mm/madvise.c 2005-09-30 14:17:35.000000000 -0700 > +++ linux-2.6.14-rc3.db2/mm/madvise.c 2005-10-20 13:37:41.000000000 -0700 > @@ -137,6 +137,40 @@ static long madvise_dontneed(struct vm_a > return 0; > } > > +static long madvise_discard(struct vm_area_struct * vma, > + struct vm_area_struct ** prev, > + unsigned long start, unsigned long end) > +{ > .... > + error = madvise_dontneed(vma, prev, start, end); > + if (error) > + return error; > + > + /* looks good, try and rip it out of page cache */ > + printk("%s: trying to rip shm vma (%p) inode from page cache\n", __FUNCTION__, vma); > + offset = (loff_t)(start - vma->vm_start); > + endoff = (loff_t)(end - vma->vm_start); > + printk("call truncate_inode_pages(%p, %x %x)\n", mapping, > + (unsigned int)offset, (unsigned int)endoff); > + down(&mapping->host->i_sem); > + truncate_inode_pages2_range(mapping, offset, endoff); > + up(&mapping->host->i_sem); > + return 0; > +} > > Hmm. I don't think it's consistent to zap the pages from a single mm, > then remove them from the page cache, while leaving the pages mapped into > other mms. Just what would those pages then be? they're not file pages, > they're not anonymous pages, such pages have given trouble in the past. > > I think you'll need to follow vmtruncate much more closely - and the > unmap_mapping_range code already allows for a range, shouldn't need > much change - going through all the vmas before truncating the range. > > Which makes it feel more like sys_fpunch than an madvise. > > You of course need write access to the underlying file, is that checked? > > What should it be doing to anonymous COWed pages? Not clear whether > it should be following truncate in discarding those too, or not. You are right. What we have here is a kludge - pointed out by Andrea also (in a private e-mail). He recommended that I should look at doing "real" MADV_TRUNCATE and add filesystem hooks to make it sure its not limited to only "shmfs". I am re-doing it again. I am scared to touch this part of VM code, thats why I was trying to get away with smallest possible thing. I guess its time to sit and do it for real. Thanks, Badari -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2005-10-24 20:24 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-10-17 17:30 [RFC] OVERCOMMIT_ALWAYS extension Badari Pulavarty 2005-10-17 18:13 ` Hugh Dickins 2005-10-17 18:25 ` Hugh Dickins 2005-10-17 23:14 ` Badari Pulavarty 2005-10-18 16:05 ` [RFC][PATCH] " Badari Pulavarty 2005-10-19 17:56 ` Hugh Dickins 2005-10-19 18:32 ` Jeff Dike 2005-10-19 21:21 ` Badari Pulavarty 2005-10-19 22:38 ` Jeff Dike 2005-10-19 18:50 ` Badari Pulavarty 2005-10-19 19:12 ` Darren Hart 2005-10-19 20:10 ` Hugh Dickins 2005-10-19 20:47 ` Jeff Dike 2005-10-20 15:11 ` Badari Pulavarty 2005-10-20 17:27 ` Jeff Dike 2005-10-20 22:37 ` Badari Pulavarty 2005-10-24 20:04 ` Hugh Dickins 2005-10-24 20:22 ` Darren Hart 2005-10-24 20:24 ` Badari Pulavarty
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox