* Re: Swap Questions (includes possible bug) - swapfile.c / swap.c
[not found] <Pine.LNX.4.03.9905111114210.19954-100000@baltimore.wwaves.com>
@ 1999-05-11 21:30 ` Rik van Riel
1999-05-12 15:42 ` [PATCH] " Joseph Pranevich
0 siblings, 1 reply; 5+ messages in thread
From: Rik van Riel @ 1999-05-11 21:30 UTC (permalink / raw)
To: Joseph Pranevich; +Cc: Linux Kernel, Linux MM
On Tue, 11 May 1999, Joseph Pranevich wrote:
> I've been gradually sifting my way through the kernel source and I
> have a few minor questions about memory management.
linux-mm@kvack.org (majordomo-managed)
http://www.linux.eu.org/Linux-MM/
> 1) swap.c : page clustering?
> else
> page_cluster = 4;
>
> This is fine, but wouldn't it make sense to generalize this, or is
> the benifit not as great with larger amounts of ram?
The swapOUT clustering is only done to a maximum of 32 (2^5)
pages, so it doesn't make much sense to read in more pages
(which are probably unrelated to the current process).
For mmap() reading we might want to switch to a smarter
algorithm though. Not with reading in more pages, but with
reading in the _next_ area while the program is still busy
processing this one. The idea is to have all data in memory
just before the process needs it :)
> 2) swapfile.c : sys_swapon() question 1
>
> I'm unable to figure out exactly what this code is supposed to be
> doing. Can someone help me out here? I don't understand why we set
> the blocksize twice or what the funniness is with "filp"
>
> p->swap_device = swap_dentry->d_inode->i_rdev;
> set_blocksize(p->swap_device, PAGE_SIZE);
We do I/O on this device in chunks of PAGE_SIZE.
> filp.f_dentry = swap_dentry;
> filp.f_mode = 3; /* read write */
Of course, we want to have our swap device read-write and we
mark it with a magic number so no harm will come to it...
> set_blocksize(p->swap_device, PAGE_SIZE);
Hmm, haven't we seen this one before? Stephen?
> I do apologise for the many questions, I'm just trying to get a
> feel for the swapping subsystem. I apologise if this is already
> documented someplace.
AFAIK it's not yet documented. I'd really appreciate it
if you could do that and send me the docs for inclusion
on the Linux-MM site...
cheers,
Rik -- Open Source: you deserve to be in control of your data.
+-------------------------------------------------------------------+
| Le Reseau netwerksystemen BV: http://www.reseau.nl/ |
| Linux Memory Management site: http://www.linux.eu.org/Linux-MM/ |
| Nederlandse Linux documentatie: http://www.nl.linux.org/ |
+-------------------------------------------------------------------+
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org. For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH] Re: Swap Questions (includes possible bug) - swapfile.c / swap.c
1999-05-11 21:30 ` Swap Questions (includes possible bug) - swapfile.c / swap.c Rik van Riel
@ 1999-05-12 15:42 ` Joseph Pranevich
0 siblings, 0 replies; 5+ messages in thread
From: Joseph Pranevich @ 1999-05-12 15:42 UTC (permalink / raw)
To: Rik van Riel; +Cc: Linux MM
Hello,
Based on what was said below, does this make sense? In addition, I added
the case where we have >32 megs of RAM and we may want to change
page_cluster to 5 (which, if I understand your message, is the highest
that we could reasonably want it.)
We could, in theroy, do this better and have it working on the fly based
on the swap out page cluster size as a maximum, but there doesn't appear
to be a benefit at this point. If however (and I haven't checked) the
maximum is architechture-dependant, it would definately be advantagous to
generalize this further.
(I do have most of the code written to do that for my own personal
justification. But I don't yet check for an upper bound.)
Joe
--- swap.c.old Tue May 11 17:42:02 1999
+++ swap.c Wed May 12 09:29:49 1999
@@ -11,6 +11,7 @@
* Started 18.12.91
* Swap aging added 23.2.95, Stephen Tweedie.
* Buffermem limits added 12.3.98, Rik van Riel.
+ * Additional documentation/code added 5.11.99, Joseph Pranevich
*/
#include <linux/mm.h>
@@ -70,11 +71,31 @@
void __init swap_setup(void)
{
- /* Use a smaller cluster for memory <16MB or <32MB */
+ /* The number for page_cluster can be aproximately determined
+ using the formula:
+
+ floor ( log2(M / 4) )
+
+ Where M is the size of memory in megabytes.
+
+ However, the maximum page_cluster value for swapping out
+ is 5, so it does not make sense to have a higher value here
+ unless that is changed. We also do not ever want to have
+ page_cluster be less than 2.
+
+ With those constraints in mind, we have chosen to implement
+ this like a switch and not calculate the value in code. This
+ should hopefully make this more readable. However, if the
+ maximum cluster value for swapping out is increased, it may
+ make sense to generalize this code then.
+ */
+
if (num_physpages < ((16 * 1024 * 1024) >> PAGE_SHIFT))
page_cluster = 2;
else if (num_physpages < ((32 * 1024 * 1024) >> PAGE_SHIFT))
page_cluster = 3;
- else
+ else if (num_physpages < ((64 * 1024 * 1024) >> PAGE_SHIFT))
page_cluster = 4;
+ else
+ page_cluster = 5;
}
On Tue, 11 May 1999, Rik van Riel wrote:
> On Tue, 11 May 1999, Joseph Pranevich wrote:
>
> > I've been gradually sifting my way through the kernel source and I
> > have a few minor questions about memory management.
>
> linux-mm@kvack.org (majordomo-managed)
> http://www.linux.eu.org/Linux-MM/
>
> > 1) swap.c : page clustering?
>
> > else
> > page_cluster = 4;
> >
> > This is fine, but wouldn't it make sense to generalize this, or is
> > the benifit not as great with larger amounts of ram?
>
> The swapOUT clustering is only done to a maximum of 32 (2^5)
> pages, so it doesn't make much sense to read in more pages
> (which are probably unrelated to the current process).
>
> For mmap() reading we might want to switch to a smarter
> algorithm though. Not with reading in more pages, but with
> reading in the _next_ area while the program is still busy
> processing this one. The idea is to have all data in memory
> just before the process needs it :)
>
>
> > 2) swapfile.c : sys_swapon() question 1
> >
> > I'm unable to figure out exactly what this code is supposed to be
> > doing. Can someone help me out here? I don't understand why we set
> > the blocksize twice or what the funniness is with "filp"
> >
> > p->swap_device = swap_dentry->d_inode->i_rdev;
> > set_blocksize(p->swap_device, PAGE_SIZE);
>
> We do I/O on this device in chunks of PAGE_SIZE.
>
> > filp.f_dentry = swap_dentry;
> > filp.f_mode = 3; /* read write */
>
> Of course, we want to have our swap device read-write and we
> mark it with a magic number so no harm will come to it...
>
> > set_blocksize(p->swap_device, PAGE_SIZE);
>
> Hmm, haven't we seen this one before? Stephen?
>
>
> > I do apologise for the many questions, I'm just trying to get a
> > feel for the swapping subsystem. I apologise if this is already
> > documented someplace.
>
> AFAIK it's not yet documented. I'd really appreciate it
> if you could do that and send me the docs for inclusion
> on the Linux-MM site...
>
> cheers,
>
> Rik -- Open Source: you deserve to be in control of your data.
> +-------------------------------------------------------------------+
> | Le Reseau netwerksystemen BV: http://www.reseau.nl/ |
> | Linux Memory Management site: http://www.linux.eu.org/Linux-MM/ |
> | Nederlandse Linux documentatie: http://www.nl.linux.org/ |
> +-------------------------------------------------------------------+
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org. For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Swap Questions (includes possible bug) - swapfile.c / swap.c
1999-05-12 18:36 ` Stephen C. Tweedie
@ 1999-05-12 19:45 ` Manfred Spraul
0 siblings, 0 replies; 5+ messages in thread
From: Manfred Spraul @ 1999-05-12 19:45 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Manfred Spraul, Rik van Riel, Joseph Pranevich, Linux Kernel, Linux MM
[-- Attachment #1: Type: text/plain, Size: 1432 bytes --]
"Stephen C. Tweedie" wrote:
>
> Hi,
>
> On Wed, 12 May 1999 12:30:27 +0200, "Manfred Spraul"
> <masp0008@stud.uni-sb.de> said:
>
> > There is another problem with this line:
> > set_blocksize() also means that the previous block size
> > doesn't work anymore:
> > if you accidentially enter 'swapon /dev/hda1' (my root drive)
> > instead of 'swapon /dev/hda3', then you have to fsck:
>
> Yep, it would make perfect sense to move the set_blocksize to be after
> the EBUSY check.
Unfortunately that doesn't solve the problem:
The current EBUSY check checks that the partition is not used as a
swap partition, it doesn't check the VFS, and it doesn't check
whether the RAID driver uses the volume.
I've attached an old patch (vs.2.2.6):
I've send that patch to linux-kernel@vger, Alan (..wait until Linus
returns from vacation..), Linus (no reply).
The patch adds a bitmap to the block cache for EBUSY checks.
Actually, we can use this bitmap for other bits if we use devfs and
dynamic MAJOR/MINOR codes:
we must replace all 'MAJOR==LOOP', 'MAJOR==IDE' etc. if we want to
support dynamic block device MAJOR/MINOR's.
Additionally, we save 6-8 kB kernel memory. (ro_bits was an 8 kB
static array).
If you think that the patch is usefull, then I'll make a new patch
vs 2.3.0, otherwise I'll wait until devfs is added, and I'll
try to write a larger patch (dynamic MAJOR/MINOR for block cache)
that includes this one.
--
Manfred
[-- Attachment #2: patch_busy-2.2.6 --]
[-- Type: text/plain, Size: 5707 bytes --]
diff -r -u -P -x CVS -x *,v 2.2.6/drivers/block/ll_rw_blk.c current/drivers/block/ll_rw_blk.c
--- 2.2.6/drivers/block/ll_rw_blk.c Wed Mar 31 00:56:57 1999
+++ current/drivers/block/ll_rw_blk.c Thu Apr 22 18:02:20 1999
@@ -16,6 +16,7 @@
#include <linux/config.h>
#include <linux/locks.h>
#include <linux/mm.h>
+#include <linux/slab.h>
#include <linux/init.h>
#include <asm/system.h>
@@ -241,8 +242,24 @@
}
/* RO fail safe mechanism */
+/* device busy: (C) Manfred Spraul masp0008@stud.uni-sb.de */
-static long ro_bits[MAX_BLKDEV][8];
+struct kdev_bits {
+ unsigned char ro_bits[(1U << MINORBITS)/8];
+ unsigned char busy_bits[(1U << MINORBITS)/8];
+};
+
+static struct kdev_bits* kdev_info[MAX_BLKDEV] = { NULL, NULL };
+
+#define ALLOC_KDEV_BITS(major) \
+ if (kdev_info[major] == NULL) { \
+ kdev_info[major] = kmalloc(sizeof(struct kdev_bits),GFP_KERNEL); \
+ if(kdev_info[major] == NULL) { \
+ printk("ALLOC_KDEV_BITS() failed due to ENOMEM.\n"); \
+ return; \
+ } \
+ memset(kdev_info[major],0,sizeof(struct kdev_bits)); \
+ }
int is_read_only(kdev_t dev)
{
@@ -251,7 +268,8 @@
major = MAJOR(dev);
minor = MINOR(dev);
if (major < 0 || major >= MAX_BLKDEV) return 0;
- return ro_bits[major][minor >> 5] & (1 << (minor & 31));
+ if (kdev_info[major] == NULL) return 0;
+ return kdev_info[major]->ro_bits[minor >> 3] & (1 << (minor & 7));
}
void set_device_ro(kdev_t dev,int flag)
@@ -261,10 +279,39 @@
major = MAJOR(dev);
minor = MINOR(dev);
if (major < 0 || major >= MAX_BLKDEV) return;
- if (flag) ro_bits[major][minor >> 5] |= 1 << (minor & 31);
- else ro_bits[major][minor >> 5] &= ~(1 << (minor & 31));
+ ALLOC_KDEV_BITS(major)
+ if (flag)
+ kdev_info[major]->ro_bits[minor >> 3] |= 1 << (minor & 7);
+ else
+ kdev_info[major]->ro_bits[minor >> 3] &= ~(1 << (minor & 7));
+}
+
+int is_device_busy(kdev_t dev)
+{
+ int minor,major;
+
+ major = MAJOR(dev);
+ minor = MINOR(dev);
+ if (major < 0 || major >= MAX_BLKDEV) return 0;
+ if (kdev_info[major] == NULL) return 0;
+ return kdev_info[major]->busy_bits[minor >> 3] & (1 << (minor & 7));
}
+void set_device_busy(kdev_t dev,int flag)
+{
+ int minor,major;
+
+ major = MAJOR(dev);
+ minor = MINOR(dev);
+ if (major < 0 || major >= MAX_BLKDEV) return;
+ ALLOC_KDEV_BITS(major)
+ if (flag)
+ kdev_info[major]->busy_bits[minor >> 3] |= 1 << (minor & 7);
+ else
+ kdev_info[major]->busy_bits[minor >> 3] &= ~(1 << (minor & 7));
+}
+
+
static inline void drive_stat_acct(int cmd, unsigned long nr_sectors,
short disk_index)
{
@@ -731,7 +778,6 @@
req->rq_status = RQ_INACTIVE;
req->next = NULL;
}
- memset(ro_bits,0,sizeof(ro_bits));
memset(max_readahead, 0, sizeof(max_readahead));
memset(max_sectors, 0, sizeof(max_sectors));
#ifdef CONFIG_AMIGA_Z2RAM
diff -r -u -P -x CVS -x *,v 2.2.6/fs/super.c current/fs/super.c
--- 2.2.6/fs/super.c Tue Apr 20 13:41:57 1999
+++ current/fs/super.c Thu Apr 22 18:02:20 1999
@@ -131,6 +131,7 @@
vfsmnttail->mnt_next = lptr;
vfsmnttail = lptr;
}
+ set_device_busy(sb->s_dev,1);
out:
return lptr;
}
@@ -165,6 +166,8 @@
kfree(tofree->mnt_devname);
kfree(tofree->mnt_dirname);
kfree_s(tofree, sizeof(struct vfsmount));
+
+ set_device_busy(dev,0);
}
int register_filesystem(struct file_system_type * fs)
@@ -873,6 +876,8 @@
if (dir_d->d_covers != dir_d)
goto dput_and_out;
+ if (is_device_busy(dev))
+ goto dput_and_out;
/*
* Note: If the superblock already exists,
* read_super just does a get_super().
diff -r -u -P -x CVS -x *,v 2.2.6/include/linux/fs.h current/include/linux/fs.h
--- 2.2.6/include/linux/fs.h Tue Apr 20 13:41:58 1999
+++ current/include/linux/fs.h Thu Apr 22 18:02:20 1999
@@ -839,6 +839,8 @@
extern struct buffer_head * find_buffer(kdev_t dev, int block, int size);
extern void ll_rw_block(int, int, struct buffer_head * bh[]);
extern int is_read_only(kdev_t);
+extern int is_device_busy(kdev_t);
+extern void set_device_busy(kdev_t dev, int flag);
extern void __brelse(struct buffer_head *);
extern inline void brelse(struct buffer_head *buf)
{
diff -r -u -P -x CVS -x *,v 2.2.6/kernel/ksyms.c current/kernel/ksyms.c
--- 2.2.6/kernel/ksyms.c Wed Mar 31 00:56:57 1999
+++ current/kernel/ksyms.c Thu Apr 22 18:02:20 1999
@@ -47,7 +47,7 @@
#endif
extern char *get_options(char *str, int *ints);
-extern void set_device_ro(kdev_t dev,int flag);
+extern void set_device_ro(kdev_t dev, int flag);
extern struct file_operations * get_blkfops(unsigned int);
extern int blkdev_release(struct inode * inode);
#if !defined(CONFIG_NFSD) && defined(CONFIG_NFSD_MODULE)
@@ -209,6 +209,8 @@
EXPORT_SYMBOL(blk_dev);
EXPORT_SYMBOL(is_read_only);
EXPORT_SYMBOL(set_device_ro);
+EXPORT_SYMBOL(is_device_busy);
+EXPORT_SYMBOL(set_device_busy);
EXPORT_SYMBOL(bmap);
EXPORT_SYMBOL(sync_dev);
EXPORT_SYMBOL(get_blkfops);
diff -r -u -P -x CVS -x *,v 2.2.6/mm/swapfile.c current/mm/swapfile.c
--- 2.2.6/mm/swapfile.c Wed Mar 31 00:56:57 1999
+++ current/mm/swapfile.c Thu Apr 22 18:02:20 1999
@@ -414,6 +414,7 @@
filp.f_op->release(dentry->d_inode,&filp);
filp.f_op->release(dentry->d_inode,&filp);
}
+ set_device_busy(p->swap_device,0);
}
dput(dentry);
@@ -531,6 +532,10 @@
if (S_ISBLK(swap_dentry->d_inode->i_mode)) {
p->swap_device = swap_dentry->d_inode->i_rdev;
+ if(is_device_busy(p->swap_device)) {
+ error = -EBUSY;
+ goto bad_swap;
+ }
set_blocksize(p->swap_device, PAGE_SIZE);
filp.f_dentry = swap_dentry;
@@ -686,6 +691,8 @@
swap_info[prev].next = p - swap_info;
}
error = 0;
+ if(p->swap_device != 0)
+ set_device_busy(p->swap_device,1);
goto out;
bad_swap:
if(filp.f_op && filp.f_op->release)
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Swap Questions (includes possible bug) - swapfile.c / swap.c
1999-05-12 10:30 Manfred Spraul
@ 1999-05-12 18:36 ` Stephen C. Tweedie
1999-05-12 19:45 ` Manfred Spraul
0 siblings, 1 reply; 5+ messages in thread
From: Stephen C. Tweedie @ 1999-05-12 18:36 UTC (permalink / raw)
To: Manfred Spraul; +Cc: Rik van Riel, Joseph Pranevich, Linux Kernel, Linux MM
Hi,
On Wed, 12 May 1999 12:30:27 +0200, "Manfred Spraul"
<masp0008@stud.uni-sb.de> said:
> There is another problem with this line:
> set_blocksize() also means that the previous block size
> doesn't work anymore:
> if you accidentially enter 'swapon /dev/hda1' (my root drive)
> instead of 'swapon /dev/hda3', then you have to fsck:
Yep, it would make perfect sense to move the set_blocksize to be after
the EBUSY check.
--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org. For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Swap Questions (includes possible bug) - swapfile.c / swap.c
@ 1999-05-12 10:30 Manfred Spraul
1999-05-12 18:36 ` Stephen C. Tweedie
0 siblings, 1 reply; 5+ messages in thread
From: Manfred Spraul @ 1999-05-12 10:30 UTC (permalink / raw)
To: Rik van Riel, Joseph Pranevich; +Cc: Linux Kernel, Linux MM
>On Tue, 11 May 1999, Joseph Pranevich wrote:
> case 2:
> error = -EINVAL;
> if (swap_header->info.nr_badpages > MAX_SWAP_BADPAGES)
> goto bad_swap;
MAX_SWAP_BADPAGES is a limitation of the swap format 2,
it's not a kernel limitation. (check include/linux/swap.h)
Rik wrote:
>On Tue, 11 May 1999, Joseph Pranevich wrote:
>> set_blocksize(p->swap_device, PAGE_SIZE);
>
>Hmm, haven't we seen this one before? Stephen?
There is another problem with this line:
set_blocksize() also means that the previous block size
doesn't work anymore:
if you accidentially enter 'swapon /dev/hda1' (my root drive)
instead of 'swapon /dev/hda3', then you have to fsck:
sys_swapon sets the blocksize, then it rejects the call
because there is no swap signature, but now ext2
can't access the partition (blocksize 4096, ext2 needs 1024).
I've posted a patch a few weeks ago, but I received no reply.
Are such problems ignored? (The super user can crash the
machine at will, one more crash doesn't matter)
Regards,
Manfred
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org. For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~1999-05-12 19:45 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <Pine.LNX.4.03.9905111114210.19954-100000@baltimore.wwaves.com>
1999-05-11 21:30 ` Swap Questions (includes possible bug) - swapfile.c / swap.c Rik van Riel
1999-05-12 15:42 ` [PATCH] " Joseph Pranevich
1999-05-12 10:30 Manfred Spraul
1999-05-12 18:36 ` Stephen C. Tweedie
1999-05-12 19:45 ` Manfred Spraul
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox