* unconditional discard calls in the swap code @ 2009-10-30 6:51 Christoph Hellwig 2009-10-30 17:26 ` Hugh Dickins 0 siblings, 1 reply; 7+ messages in thread From: Christoph Hellwig @ 2009-10-30 6:51 UTC (permalink / raw) To: hugh.dickins; +Cc: linux-mm Hi Hugh, since 6a6ba83175c029c7820765bae44692266b29e67a the swap code unconditionally calls blkdev_issue_discard when swap clusters get freed. So far this was harmless because only the mtd driver has discard support wired up and it's pretty fast there (entirely done in-kernel). We're now adding support for real UNAP/TRIM support for SCSI arrays and SSDs, and so far all the real life ones we've dealt with have too many performance issues to just issue the discard requests on the fly. Because of that unconditionally enabling this code is a bad idea, it really needs an option to disable it or even better just leave it disabled by default for now with an option to enable it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: unconditional discard calls in the swap code 2009-10-30 6:51 unconditional discard calls in the swap code Christoph Hellwig @ 2009-10-30 17:26 ` Hugh Dickins 2009-11-18 17:12 ` Christoph Hellwig 0 siblings, 1 reply; 7+ messages in thread From: Hugh Dickins @ 2009-10-30 17:26 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Jens Axboe, Matthew Wilcox, linux-mm, linux-scsi Hi Christoph, (I've added Ccs, hoping for more expertise than we have in linux-mm.) On Fri, 30 Oct 2009, Christoph Hellwig wrote: > > since 6a6ba83175c029c7820765bae44692266b29e67a the swap code > unconditionally calls blkdev_issue_discard when swap clusters get freed. > So far this was harmless because only the mtd driver has discard support > wired up and it's pretty fast there (entirely done in-kernel). > > We're now adding support for real UNMAP/TRIM support for SCSI arrays and > SSDs, and so far all the real life ones we've dealt with have too many > performance issues to just issue the discard requests on the fly. > Because of that unconditionally enabling this code is a bad idea, it > really needs an option to disable it or even better just leave it > disabled by default for now with an option to enable it. Thanks for the info. Yes, in practice TRIM seems a huge disappointment: is there a device on which it is really implemented, and not more trouble than it's worth? I'd been waiting for OCZ to get a Vertex 1.4* firmware out of Beta before looking at swap discard again; but even then, the Linux ATA support is still up in the air, so far as I know. You don't mention swap's discard of the whole partition (or all extents of the swapfile) at swapon time: do you believe that usage is okay to retain? Is it likely on some devices to take so long, that I ought to make it asynchronous? Assuming that initial swap discard is good, I wonder whether just to revert the discard of swap clusters for now: until such time as we find devices (other than mtd) that can implement it efficiently. If we do retain the discard of swap clusters, under something more than an #if 0, any ideas for what I should make it conditional upon? Something near /sys/block/sda/queue/rotational (nicely rw these days) seems appropriate: any chance of a /sys/block/sda/queue/discard_is_useful? I think I'd prefer that to a new option to swapon. Or is there a sensible measurement I could make in swapfile.c: for example, does discard of a range complete faster than write of the same range? (But my guess is that those devices we'd want to avoid discard on, would give erratic answers to any such test; never mind the noise of what other I/Os are concurrent to the same device.) Something I should almost certainly revert: at one stage I made the non-rotational case spread its swapping evenly over the partition, in case the device's wear-levelling was inadequate (localized). But now I think it's better to ignore that possibility, and anchor swapping to the start of the partition just as in the rotational case: in the rotational case it's done to minimize seeking, in the non- rotational case it would be to minimize encroaching upon that initially discarded total extent. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: unconditional discard calls in the swap code 2009-10-30 17:26 ` Hugh Dickins @ 2009-11-18 17:12 ` Christoph Hellwig 2009-11-30 17:22 ` [PATCH] mm: don't discard unused swap slots by default Christoph Hellwig 0 siblings, 1 reply; 7+ messages in thread From: Christoph Hellwig @ 2009-11-18 17:12 UTC (permalink / raw) To: Hugh Dickins Cc: Christoph Hellwig, Jens Axboe, Matthew Wilcox, linux-mm, linux-scsi On Fri, Oct 30, 2009 at 05:26:18PM +0000, Hugh Dickins wrote: > Yes, in practice TRIM seems a huge disappointment: is there a device on > which it is really implemented, and not more trouble than it's worth? > > I'd been waiting for OCZ to get a Vertex 1.4* firmware out of Beta > before looking at swap discard again; but even then, the Linux ATA > support is still up in the air, so far as I know. I've tied it up now for libata, and testing with the releases OCZ 1.4 firmware. Haven't tested anything else yet except for my own implementations of TRIM and WRITE SAME in qemu which are a lot faster than real hardware. > > You don't mention swap's discard of the whole partition (or all > extents of the swapfile) at swapon time: do you believe that usage > is okay to retain? Is it likely on some devices to take so long, > that I ought to make it asynchronous? The use on swapon seems fine - we've also added support to discard on mkfs which is generally fast enough - the existing implementations seem to have mostly constant overhead, the more blocks your discard, the better. > Assuming that initial swap discard is good, I wonder whether just > to revert the discard of swap clusters for now: until such time as > we find devices (other than mtd) that can implement it efficiently. > > If we do retain the discard of swap clusters, under something more > than an #if 0, any ideas for what I should make it conditional upon? add a sysctl / sysfs tunable for it? For all filesystems we now have patches pending to require and -o discard option to use it, which will be quite nessecary for 2.6.33 where all the block layer / scsi layer / libata support will fall into place. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH] mm: don't discard unused swap slots by default 2009-11-18 17:12 ` Christoph Hellwig @ 2009-11-30 17:22 ` Christoph Hellwig 2009-11-30 18:28 ` Hugh Dickins 0 siblings, 1 reply; 7+ messages in thread From: Christoph Hellwig @ 2009-11-30 17:22 UTC (permalink / raw) To: Hugh Dickins; +Cc: Jens Axboe, Matthew Wilcox, linux-mm, linux-scsi Current TRIM/UNMAP/etc implementation are slow enough that discarding small chunk during run time is a bad idea. So only discard the whole swap space on swapon by default, but require the admin to enable it for run-time discards using the new vm.discard_swapspace sysctl. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/include/linux/swap.h =================================================================== --- linux-2.6.orig/include/linux/swap.h 2009-11-27 11:50:47.319003920 +0100 +++ linux-2.6/include/linux/swap.h 2009-11-27 11:51:55.617286868 +0100 @@ -247,6 +247,7 @@ extern unsigned long mem_cgroup_shrink_n extern int __isolate_lru_page(struct page *page, int mode, int file); extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; +extern int vm_discard_swapspace; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; Index: linux-2.6/kernel/sysctl.c =================================================================== --- linux-2.6.orig/kernel/sysctl.c 2009-11-27 11:49:02.935254088 +0100 +++ linux-2.6/kernel/sysctl.c 2009-11-27 11:53:10.333006621 +0100 @@ -1163,6 +1163,16 @@ static struct ctl_table vm_table[] = { .extra1 = &zero, .extra2 = &one_hundred, }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "discard_swapspace", + .data = &vm_discard_swapspace, + .maxlen = sizeof(vm_discard_swapspace), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &one, + }, #ifdef CONFIG_HUGETLB_PAGE { .procname = "nr_hugepages", Index: linux-2.6/mm/swapfile.c =================================================================== --- linux-2.6.orig/mm/swapfile.c 2009-11-27 11:53:19.449254088 +0100 +++ linux-2.6/mm/swapfile.c 2009-11-27 11:54:07.883255931 +0100 @@ -41,6 +41,7 @@ long nr_swap_pages; long total_swap_pages; static int swap_overflow; static int least_priority; +int vm_discard_swapspace; static const char Bad_file[] = "Bad swap file entry "; static const char Unused_file[] = "Unused swap file entry "; @@ -1978,7 +1979,7 @@ SYSCALL_DEFINE2(swapon, const char __use p->flags |= SWP_SOLIDSTATE; p->cluster_next = 1 + (random32() % p->highest_bit); } - if (discard_swap(p) == 0) + if (discard_swap(p) == 0 && vm_discard_swapspace) p->flags |= SWP_DISCARDABLE; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] mm: don't discard unused swap slots by default 2009-11-30 17:22 ` [PATCH] mm: don't discard unused swap slots by default Christoph Hellwig @ 2009-11-30 18:28 ` Hugh Dickins 2009-11-30 18:58 ` Martin K. Petersen 2009-12-31 0:33 ` Hugh Dickins 0 siblings, 2 replies; 7+ messages in thread From: Hugh Dickins @ 2009-11-30 18:28 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Jens Axboe, Matthew Wilcox, linux-mm, linux-scsi On Mon, 30 Nov 2009, Christoph Hellwig wrote: > Current TRIM/UNMAP/etc implementation are slow enough that discarding > small chunk during run time is a bad idea. So only discard the whole > swap space on swapon by default, but require the admin to enable it > for run-time discards using the new vm.discard_swapspace sysctl. > > Signed-off-by: Christoph Hellwig <hch@lst.de> Thanks: you having suggested it, I guess it's no coincidence that this looks a little like what I'm currently experimenting with, on 2.6.32-rc8-mm1 which contains your mods, on Vertex with 1.4 firmware. There's several variables (not the least my own idiocy), and too soon for me to say anything definite; but the impression I'm getting from numbers so far is that (on that SSD anyway) the dubious discards from SWP_DISCARDABLE are actually beneficial - more so than the initial discard of the whole partition. Each SWP_DISCARDABLE discard is of a 1MB range (if 4kB pagesize, and if swap partition - if swapping to fragmented regular file, they would often be of less, so indeed less efficient). Please could you send me, on or offlist, the tests you have which show them to be worth suppressing? I do prefer to avoid tunables if we can; and although this sysctl you suggested is the easiest way, it doesn't seem the correct way. Something in /sys/block/<device>/queue/ would be more correct: but perhaps more trouble than it's worth; and so very specific to swap (and whatever mm/swapfile.c happens to be doing in any release) that it wouldn't belong very well there either. You mentioned an "-o discard" mount option before: so I think what we ought to be doing is an option to swapon. But you can imagine that I'd prefer to avoid that too, if we can work this out without it. If I could see how bad these SWP_DISCARDABLE discards are for myself, I might for the moment prefer just to cut out that code, until we can be more intelligent about it (instead of fixing visible sysctl/sysfs/swapon options which limit to the current implementation). Thanks, Hugh > > Index: linux-2.6/include/linux/swap.h > =================================================================== > --- linux-2.6.orig/include/linux/swap.h 2009-11-27 11:50:47.319003920 +0100 > +++ linux-2.6/include/linux/swap.h 2009-11-27 11:51:55.617286868 +0100 > @@ -247,6 +247,7 @@ extern unsigned long mem_cgroup_shrink_n > extern int __isolate_lru_page(struct page *page, int mode, int file); > extern unsigned long shrink_all_memory(unsigned long nr_pages); > extern int vm_swappiness; > +extern int vm_discard_swapspace; > extern int remove_mapping(struct address_space *mapping, struct page *page); > extern long vm_total_pages; > > Index: linux-2.6/kernel/sysctl.c > =================================================================== > --- linux-2.6.orig/kernel/sysctl.c 2009-11-27 11:49:02.935254088 +0100 > +++ linux-2.6/kernel/sysctl.c 2009-11-27 11:53:10.333006621 +0100 > @@ -1163,6 +1163,16 @@ static struct ctl_table vm_table[] = { > .extra1 = &zero, > .extra2 = &one_hundred, > }, > + { > + .ctl_name = CTL_UNNUMBERED, > + .procname = "discard_swapspace", > + .data = &vm_discard_swapspace, > + .maxlen = sizeof(vm_discard_swapspace), > + .mode = 0644, > + .proc_handler = &proc_dointvec_minmax, > + .extra1 = &zero, > + .extra2 = &one, > + }, > #ifdef CONFIG_HUGETLB_PAGE > { > .procname = "nr_hugepages", > Index: linux-2.6/mm/swapfile.c > =================================================================== > --- linux-2.6.orig/mm/swapfile.c 2009-11-27 11:53:19.449254088 +0100 > +++ linux-2.6/mm/swapfile.c 2009-11-27 11:54:07.883255931 +0100 > @@ -41,6 +41,7 @@ long nr_swap_pages; > long total_swap_pages; > static int swap_overflow; > static int least_priority; > +int vm_discard_swapspace; > > static const char Bad_file[] = "Bad swap file entry "; > static const char Unused_file[] = "Unused swap file entry "; > @@ -1978,7 +1979,7 @@ SYSCALL_DEFINE2(swapon, const char __use > p->flags |= SWP_SOLIDSTATE; > p->cluster_next = 1 + (random32() % p->highest_bit); > } > - if (discard_swap(p) == 0) > + if (discard_swap(p) == 0 && vm_discard_swapspace) > p->flags |= SWP_DISCARDABLE; > } > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] mm: don't discard unused swap slots by default 2009-11-30 18:28 ` Hugh Dickins @ 2009-11-30 18:58 ` Martin K. Petersen 2009-12-31 0:33 ` Hugh Dickins 1 sibling, 0 replies; 7+ messages in thread From: Martin K. Petersen @ 2009-11-30 18:58 UTC (permalink / raw) To: Hugh Dickins Cc: Christoph Hellwig, Jens Axboe, Matthew Wilcox, linux-mm, linux-scsi >>>>> "Hugh" == Hugh Dickins <hugh.dickins@tiscali.co.uk> writes: Hugh> You mentioned an "-o discard" mount option before: so I think what Hugh> we ought to be doing is an option to swapon. But you can imagine Hugh> that I'd prefer to avoid that too, if we can work this out without Hugh> it. The main problem we have is that the devices currently supporting TRIM are doing a piss poor job at it. We have pretty good vendor guarantees that discards are going to be essentially free on SCSI-class hardware. But in the ATA space things are currently being driven by early adopters / tweakers that care more about benchmarketing and feature checklists. Whether things actually work as intended is mostly irrelevant. I think we'll need to give things a little bit of time for decent ATA TRIM implementations to materialize. And then we can switch to an "assume it works, blacklist bad eggs" approach. Until then I think we need to make discard an explicit opt-in feature. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] mm: don't discard unused swap slots by default 2009-11-30 18:28 ` Hugh Dickins 2009-11-30 18:58 ` Martin K. Petersen @ 2009-12-31 0:33 ` Hugh Dickins 1 sibling, 0 replies; 7+ messages in thread From: Hugh Dickins @ 2009-12-31 0:33 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Matthew Wilcox, Martin K. Petersen, linux-mm, linux-scsi On Mon, 30 Nov 2009, Hugh Dickins wrote: > On Mon, 30 Nov 2009, Christoph Hellwig wrote: > > > Current TRIM/UNMAP/etc implementation are slow enough that discarding > > small chunk during run time is a bad idea. So only discard the whole > > swap space on swapon by default, but require the admin to enable it > > for run-time discards using the new vm.discard_swapspace sysctl. > > > > Signed-off-by: Christoph Hellwig <hch@lst.de> > > Thanks: you having suggested it, I guess it's no coincidence that > this looks a little like what I'm currently experimenting with, on > 2.6.32-rc8-mm1 which contains your mods, on Vertex with 1.4 firmware. I continued those experiments, on the OCZ Vertex, and then on the Intel SSDSA2M080G2 which appeared at my door a few weeks ago. > > There's several variables (not the least my own idiocy), and too soon > for me to say anything definite; but the impression I'm getting from > numbers so far is that (on that SSD anyway) the dubious discards from > SWP_DISCARDABLE are actually beneficial - more so than the initial > discard of the whole partition. That initial impression was borne out by further testing on the Vertex, which continued to benefit significantly (but not dramatically) from SWP_DISCARDABLE discards, more so than from swapon's initial discard. The Intel showed altogether less benefit from any swap discards, but more benefit from swapon's initial discard (odd since I'd expect its effects soon to get wiped out) than from ongoing SWP_DISCARDABLEs. I didn't observe strong reason to tune out the SWP_DISCARDABLEs; though it may well be that I'd have done better to invest some time in looking for other swap speedups, than bother with discard in the first place. Martin mentioned "We have pretty good vendor guarantees that discards are going to be essentially free on SCSI-class hardware": can anyone suggest other ATA SSDs implementing discard that I could check? I was surprised by what came through most strongly from this testing, not an effect of initial or ongoing discards at all. I mentioned earlier that I intended to remove how SWP_SOLIDSTATE rotates around the swap area (whereas rotationals anchor to the start of the area). I put that in for low-end flash, on which wear-levelling might be very localized; but I'd come to see that it would be a bad idea on discard-capable SSDs, since they would tend to appear never less than 1MB away from full (after the first pass around the area). So while experimenting with tuning out the initial or ongoing discards, I also experimented with tuning out the initial randomization and the continued rotation. But on both the Vertex and the Intel, the randomization and rotation actually came through as much more consistently beneficial than the discards: definitely behaviour to be retained, even though I'm clueless why. > > Each SWP_DISCARDABLE discard is of a 1MB range (if 4kB pagesize, and > if swap partition - if swapping to fragmented regular file, they would > often be of less, so indeed less efficient). > > Please could you send me, on or offlist, the tests you have which show > them to be worth suppressing? You didn't respond, so perhaps the problem was a worry rather than a demonstrated issue. I can see that discarding filesystems appear liable to be calling discard even on inappropriately small extents, which swap avoids doing (unless swapping to a regular file which is too fragmented to be sensible anyway). > I do prefer to avoid tunables if we can; > and although this sysctl you suggested is the easiest way, it doesn't > seem the correct way. > > Something in /sys/block/<device>/queue/ would be more correct: but > perhaps more trouble than it's worth; and so very specific to swap > (and whatever mm/swapfile.c happens to be doing in any release) > that it wouldn't belong very well there either. > > You mentioned an "-o discard" mount option before: so I think what > we ought to be doing is an option to swapon. But you can imagine > that I'd prefer to avoid that too, if we can work this out without it. > > If I could see how bad these SWP_DISCARDABLE discards are for > myself, I might for the moment prefer just to cut out that code, > until we can be more intelligent about it (instead of fixing visible > sysctl/sysfs/swapon options which limit to the current implementation). Yes, I still have these reservations, so haven't pushed forward your patch, nor any such alternative. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2009-12-31 0:33 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-10-30 6:51 unconditional discard calls in the swap code Christoph Hellwig 2009-10-30 17:26 ` Hugh Dickins 2009-11-18 17:12 ` Christoph Hellwig 2009-11-30 17:22 ` [PATCH] mm: don't discard unused swap slots by default Christoph Hellwig 2009-11-30 18:28 ` Hugh Dickins 2009-11-30 18:58 ` Martin K. Petersen 2009-12-31 0:33 ` Hugh Dickins
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox