swapcache bug?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* swapcache bug?
@ 1999-02-07 18:21 Manfred Spraul
  1999-02-07 21:30 ` Eric W. Biederman
  1999-02-08 16:39 ` [PATCH] " Stephen C. Tweedie
  0 siblings, 2 replies; 10+ messages in thread
From: Manfred Spraul @ 1999-02-07 18:21 UTC (permalink / raw)
  To: linux-mm

I'm currently debugging my physical memory ramdisk, and I see lots of
entries in the page cache that have 'page->offset' which aren't
multiples of 4096. (they are multiples of 256)
All of them belong to swapper_inode.

If this is the intended behaviour, then page_hash() should be changed:
it assumes that 'page->offset' is a multiple of 4096.

If this should not happen, please ask me for further details.

Note that there is NO crash, just lots of entries with the same hash
value.
---
- 2.2.1 kernel
- 12 MB Ram
- 72256 kB Swap-partition
---
	Manfred
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: swapcache bug?
  1999-02-07 18:21 swapcache bug? Manfred Spraul
@ 1999-02-07 21:30 ` Eric W. Biederman
  1999-02-08 16:39 ` [PATCH] " Stephen C. Tweedie
  1 sibling, 0 replies; 10+ messages in thread
From: Eric W. Biederman @ 1999-02-07 21:30 UTC (permalink / raw)
  To: masp0008; +Cc: linux-mm

>>>>> "MS" == Manfred Spraul <masp0008@stud.uni-sb.de> writes:

MS> I'm currently debugging my physical memory ramdisk, and I see lots of
MS> entries in the page cache that have 'page->offset' which aren't
MS> multiples of 4096. (they are multiples of 256)
MS> All of them belong to swapper_inode.

MS> If this is the intended behaviour, then page_hash() should be changed:
MS> it assumes that 'page->offset' is a multiple of 4096.

Yes.  Because for the swap cache we store the swap entry which is already
has the page size shifted out of it, but it's also setup so you can store
it directly in a pte which means some 0 bits.

Good spotting, but unless someone can show a significant performance impact 
changing page_hash should wait for 2.3.

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] Re: swapcache bug?
  1999-02-07 18:21 swapcache bug? Manfred Spraul
  1999-02-07 21:30 ` Eric W. Biederman
@ 1999-02-08 16:39 ` Stephen C. Tweedie
  1999-02-08 17:32   ` Linus Torvalds
  1 sibling, 1 reply; 10+ messages in thread
From: Stephen C. Tweedie @ 1999-02-08 16:39 UTC (permalink / raw)
  To: masp0008, Linus Torvalds; +Cc: linux-mm, Stephen Tweedie

Hi,

On Sun, 07 Feb 1999 19:21:38 +0100, Manfred Spraul
<masp0008@stud.uni-sb.de> said:

> I'm currently debugging my physical memory ramdisk, and I see lots of
> entries in the page cache that have 'page->offset' which aren't
> multiples of 4096. (they are multiples of 256)
> All of them belong to swapper_inode.

That is normal.

> If this is the intended behaviour, then page_hash() should be changed:
> it assumes that 'page->offset' is a multiple of 4096.

Good point, the line include/linux/pagemap.h:39,

	return s(i+o) & (PAGE_HASH_SIZE-1);

should probably be 

	return s(i+o+offset) & (PAGE_HASH_SIZE-1);

to mix in the low order bits for swap entries.  Well spotted.  Anyone
see anything wrong with this one-liner change?

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Re: swapcache bug?
  1999-02-08 16:39 ` [PATCH] " Stephen C. Tweedie
@ 1999-02-08 17:32   ` Linus Torvalds
  1999-02-08 17:51     ` Stephen C. Tweedie
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 1999-02-08 17:32 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: masp0008, linux-mm


On Mon, 8 Feb 1999, Stephen C. Tweedie wrote:
> 
> Good point, the line include/linux/pagemap.h:39,
> 
> 	return s(i+o) & (PAGE_HASH_SIZE-1);
> 
> should probably be 
> 
> 	return s(i+o+offset) & (PAGE_HASH_SIZE-1);
> 
> to mix in the low order bits for swap entries.  Well spotted.  Anyone
> see anything wrong with this one-liner change?

Yes, the above will potentially result in different hash entries for the
same page, which means that we now have aliasing and basically just random
behaviour. 

It _may_ be that the hash function is always called with a page-aligned
offset, but that was not how it was strictly meant to be: the way the
thing was envisioned you could just find the page at "offset" by doing

	page_hash(inode,offset)

without page-aligning offset before you did this.

If anything, maybe the swap cache should just use the high bits in the
"offset" field (or at least prefer to do so: something like

	page->offset = swap_entry_to_offset(entry);

and 
	entry = offset_to_swap_entry(page->offset);

that does a PAGE_MASK_BITS rotate on the bits..

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Re: swapcache bug?
  1999-02-08 17:32   ` Linus Torvalds
@ 1999-02-08 17:51     ` Stephen C. Tweedie
  1999-02-08 18:48       ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen C. Tweedie @ 1999-02-08 17:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Stephen C. Tweedie, masp0008, linux-mm

Hi,

On Mon, 8 Feb 1999 09:32:24 -0800 (PST), Linus Torvalds
<torvalds@transmeta.com> said:

> It _may_ be that the hash function is always called with a page-aligned
> offset, but that was not how it was strictly meant to be: the way the
> thing was envisioned you could just find the page at "offset" by doing

> 	page_hash(inode,offset)

It does appear to be: we enforce it pretty much everywhere I can see,
with one possible exception: filemap_nopage(), which assumes
area->vm_offset is already page-aligned.  I think we can still violate
that internally if we are mapping a ZMAGIC binary (urgh), but the VM
breaks anyway if we do that: update_vm_cache cannot deal with such
pages, for a start.

The assumption that we might have flexible offsets will break
__find_page massively anyway, because we _always_ lookup the struct page
by exact match on the offset; __find_page never tries to align things
itself.

Linus, I know Matti Aarnio has been working on supporting >32bit offsets
on Intel, and for that we really do need to start using the low bits in
the page offset for something more useful than MBZ padding.  If there is
a long-term desire to keep those bits in the offset insignificant then
that will really hurt his work; otherwise, I can't see mixing in the
low-order bits to the page hash breaking anything new.

> If anything, maybe the swap cache should just use the high bits in the
> "offset" field 

Yes, we can certainly do that to fix the current has collision problems,
but since there are long term reasons for using more bits of
significance in the page cache offset, it would be good to know whether
you'd be willing to entertain that possibility.  If so, we'll need a
hash function which observes the low bits anyway.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Re: swapcache bug?
  1999-02-08 17:51     ` Stephen C. Tweedie
@ 1999-02-08 18:48       ` Linus Torvalds
  1999-02-08 21:13         ` Matti Aarnio
  1999-02-09  7:15         ` Eric W. Biederman
  0 siblings, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 1999-02-08 18:48 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: masp0008, linux-mm

On Mon, 8 Feb 1999, Stephen C. Tweedie wrote:
> 
> It does appear to be: we enforce it pretty much everywhere I can see,
> with one possible exception: filemap_nopage(), which assumes
> area->vm_offset is already page-aligned.  I think we can still violate
> that internally if we are mapping a ZMAGIC binary (urgh), but the VM
> breaks anyway if we do that: update_vm_cache cannot deal with such
> pages, for a start.

This was done on purpose: it still works as a mapping, but it isn't
coherent with regards to writes to the file. That's fine, as writing to an
executable while it has been mapped is a losing proposition anyway, and
you can't get access through these non-page-aligned mappings any other way
(the "mmap()" system calls etc will all enforce page-aligned regions,
because coherency just wouldn't be possible otherwise). 

> The assumption that we might have flexible offsets will break
> __find_page massively anyway, because we _always_ lookup the struct page
> by exact match on the offset; __find_page never tries to align things
> itself.

Good point.

> Linus, I know Matti Aarnio has been working on supporting >32bit offsets
> on Intel, and for that we really do need to start using the low bits in
> the page offset for something more useful than MBZ padding. 

Yes. The page offset will become a "sector offset" (I'd actually like to
make it a page number, but then I'd have to break ZMAGIC dynamic loading
due to the fractional page offsets, so it's not worth it for three extra
bits), and that gives you 41 bits of addressing even on a 32-bit machine.
Which is plenty - considering that by the time you need more than that
you'd _really_ better be running on a larger machine anyway. 

Note that some patches I saw (I think by Matti) made "page->offset" a long
long, and that is never going to happen. That's just a stupid waste of
time and memory.

>						 If there is
> a long-term desire to keep those bits in the offset insignificant then
> that will really hurt his work; otherwise, I can't see mixing in the
> low-order bits to the page hash breaking anything new.

Ok, you convinced me. 

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Re: swapcache bug?
  1999-02-08 18:48       ` Linus Torvalds
@ 1999-02-08 21:13         ` Matti Aarnio
  1999-02-09  7:15         ` Eric W. Biederman
  1 sibling, 0 replies; 10+ messages in thread
From: Matti Aarnio @ 1999-02-08 21:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: sct, masp0008, linux-mm

Linus Torvalds <torvalds@transmeta.com> wrote:
...
> > Linus, I know Matti Aarnio has been working on supporting >32bit offsets
> > on Intel, and for that we really do need to start using the low bits in
> > the page offset for something more useful than MBZ padding. 
> 
> Yes. The page offset will become a "sector offset" (I'd actually like to
> make it a page number, but then I'd have to break ZMAGIC dynamic loading
> due to the fractional page offsets, so it's not worth it for three extra
> bits), and that gives you 41 bits of addressing even on a 32-bit machine.
> Which is plenty - considering that by the time you need more than that
> you'd _really_ better be running on a larger machine anyway. 

	I forgot (didn't log), who sent me a patch to my L-F-S stuff
	for ZMAGIC page mis-alignment report.  (It was somebody here
	at linux-mm list)  His comment was that only *very old* systems
	contain ZMAGIC files with alignments not already in page
	granularity.

	Given certain limitations in low-level block drivers, using that
	'sector index' idea might be worthy.  It gives us essentially up
	to 512 * 4GB or 2 TB file sizes, which matches current low-level
	limitations.

	However, now doing page offset work, we might need to mask the low
	bits of the sector index to do page cache searches.  (Unless the
	alignment is always guaranteed ?)

> Note that some patches I saw (I think by Matti) made "page->offset" a long
> long, and that is never going to happen. That's just a stupid waste of
> time and memory.

	Good heavens! No!  That can't have been mine.

	In my patches the 'page->offset' became ADT called 'pgoff_t'
	which I used to do compile time trapping of missing convertions.
	When simplified ("#if 1" -> "#if 0" in <linux/mm.h> header file),
	the type is just 'u_long'.

	I don't think you have seen my patches, I have posted the URL,
	but not the patches themselves.

	With recent talks in linux-kernel about internal VFS ABI stability
	being an issue, my current L-F-S patch is *not* ready for 2.2.*.
	It changes one thing, and adds another in the inode_operations
	structure, plus adds a field into 'struct task'.

	I would wait a bit until 2.3 opens, collect a bit of experience
	of it there, and then backport (without doing VFS ABI changes) to
	2.2.*.    Otherwise: "Damn the torpedoes!  Full steam ahead!".
	(And we would hear lots of noicy torpedoes...)

... 
> 		Linus

/Matti Aarnio <matti.aarnio@sonera.fi>
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Re: swapcache bug?
  1999-02-08 18:48       ` Linus Torvalds
  1999-02-08 21:13         ` Matti Aarnio
@ 1999-02-09  7:15         ` Eric W. Biederman
  1999-02-09 16:32           ` Linus Torvalds
  1 sibling, 1 reply; 10+ messages in thread
From: Eric W. Biederman @ 1999-02-09  7:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Stephen C. Tweedie, masp0008, linux-mm

>>>>> "LT" == Linus Torvalds <torvalds@transmeta.com> writes:

LT> Yes. The page offset will become a "sector offset" (I'd actually like to
LT> make it a page number, but then I'd have to break ZMAGIC dynamic loading
LT> due to the fractional page offsets, so it's not worth it for three extra
LT> bits), and that gives you 41 bits of addressing even on a 32-bit machine.
LT> Which is plenty - considering that by the time you need more than that
LT> you'd _really_ better be running on a larger machine anyway. 

???  With the latter OMAGIC format everthing is page aligned already.

I have a patch that removes page sharing support from ZMAGIC but keeps
everything functional.  Tested with a OMAGIC libc ZMAGIC doom and
ZMAGIC Xlibs.   This is on my queue for submission to 2.3.

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Re: swapcache bug?
  1999-02-09  7:15         ` Eric W. Biederman
@ 1999-02-09 16:32           ` Linus Torvalds
  1999-02-10  0:28             ` Eric W. Biederman
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 1999-02-09 16:32 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Stephen C. Tweedie, masp0008, linux-mm

On 9 Feb 1999, Eric W. Biederman wrote:
> 
> ???  With the latter OMAGIC format everthing is page aligned already.

Yes.

However, it's a question of pride too. I don't want to break "normal" user
land applications (as opposed to things like "ifconfig" that are really
very very special), unless I really have to.

As such, I want to support even the old 1kB-aligned ZMAGIC binaries for as
long as it's not a liability, and quite frankly the issue of whether you
make the page cache "offset" be a sector or a page offset is purely a
thing of taste, not a liability.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Re: swapcache bug?
  1999-02-09 16:32           ` Linus Torvalds
@ 1999-02-10  0:28             ` Eric W. Biederman
  0 siblings, 0 replies; 10+ messages in thread
From: Eric W. Biederman @ 1999-02-10  0:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Stephen C. Tweedie, masp0008, linux-mm

>>>>> "LT" == Linus Torvalds <torvalds@transmeta.com> writes:

LT> On 9 Feb 1999, Eric W. Biederman wrote:
>> 
>> ???  With the latter OMAGIC format everthing is page aligned already.

LT> Yes.

LT> However, it's a question of pride too. I don't want to break "normal" user
LT> land applications (as opposed to things like "ifconfig" that are really
LT> very very special), unless I really have to.

You don't have to break programs, just have them use a little more memory.

The way we currently support shared ZMAGIC binaries is a real hack.
There are a lot of cases where it doesn't work. 2k+ ext2fs, and
network file systems.

And the code is very unobvious.

The filesytem code becomes much cleaner if we remove support for non
aligned mappings.

The following patch is all that it takes to remove the need to support
non-aligned mappings.  Everything still works we just use a little
more memory (if multiple copies of the program are running at once),
and complain.  

Avoiding this patch is not worth losing 3 bits of address space, and
code clarity.  

Eric

diff -uNrX linux-ignore-files linux-2.1.132.eb2/fs/binfmt_aout.c linux-2.1.132.eb3.make/fs/binfmt_aout.c
--- linux-2.1.132.eb2/fs/binfmt_aout.c	Fri Dec 25 16:42:47 1998
+++ linux-2.1.132.eb3.make/fs/binfmt_aout.c	Fri Dec 25 22:42:36 1998
@@ -409,7 +409,14 @@
 			return fd;
 		file = fcheck(fd);
 
-		if (!file->f_op || !file->f_op->mmap) {
+		if ((fd_offset & ~PAGE_MASK) != 0) {
+			printk(KERN_WARNING 
+			       "fd_offset is not page aligned. Please convert program: %s\n",
+			       file->f_dentry->d_name.name
+			       );
+		}
+
+		if (!file->f_op || !file->f_op->mmap || ((fd_offset & ~PAGE_MASK) != 0)) {
 			sys_close(fd);
 			do_mmap(NULL, 0, ex.a_text+ex.a_data,
 				PROT_READ|PROT_WRITE|PROT_EXEC,
@@ -530,6 +537,24 @@
 
 	start_addr =  ex.a_entry & 0xfffff000;
 
+	if ((N_TXTOFF(ex) & ~PAGE_MASK) != 0) {
+		printk(KERN_WARNING 
+		       "N_TXTOFF is not page aligned. Please convert library: %s\n",
+		       file->f_dentry->d_name.name
+		       );
+		
+		do_mmap(NULL, start_addr & PAGE_MASK, ex.a_text + ex.a_data + ex.a_bss,
+			PROT_READ | PROT_WRITE | PROT_EXEC,
+			MAP_FIXED| MAP_PRIVATE, 0);
+		
+		read_exec(file->f_dentry, N_TXTOFF(ex),
+			  (char *)start_addr, ex.a_text + ex.a_data, 0);
+		flush_icache_range((unsigned long) start_addr,
+				   (unsigned long) start_addr + ex.a_text + ex.a_data);
+
+		retval = 0;
+		goto out_putf;
+	}
 	/* Now use mmap to map the library into memory. */
 	error = do_mmap(file, start_addr, ex.a_text + ex.a_data,
 			PROT_READ | PROT_WRITE | PROT_EXEC,
diff -uNrX linux-ignore-files linux-2.1.132.eb2/mm/filemap.c linux-2.1.132.eb3.make/mm/filemap.c
--- linux-2.1.132.eb2/mm/filemap.c	Fri Dec 25 16:48:50 1998
+++ linux-2.1.132.eb3.make/mm/filemap.c	Fri Dec 25 23:04:10 1998
@@ -1350,7 +1350,7 @@
 			return -EINVAL;
 	} else {
 		ops = &file_private_mmap;
-		if (vma->vm_offset & (inode->i_sb->s_blocksize - 1))
+		if (vma->vm_offset & (PAGE_SIZE - 1))
 			return -EINVAL;
 	}
 	if (!inode->i_sb || !S_ISREG(inode->i_mode))



--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~1999-02-10  0:27 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1999-02-07 18:21 swapcache bug? Manfred Spraul
1999-02-07 21:30 ` Eric W. Biederman
1999-02-08 16:39 ` [PATCH] " Stephen C. Tweedie
1999-02-08 17:32   ` Linus Torvalds
1999-02-08 17:51     ` Stephen C. Tweedie
1999-02-08 18:48       ` Linus Torvalds
1999-02-08 21:13         ` Matti Aarnio
1999-02-09  7:15         ` Eric W. Biederman
1999-02-09 16:32           ` Linus Torvalds
1999-02-10  0:28             ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox