* swapcache bug? @ 1999-02-07 18:21 Manfred Spraul 1999-02-07 21:30 ` Eric W. Biederman 1999-02-08 16:39 ` [PATCH] " Stephen C. Tweedie 0 siblings, 2 replies; 10+ messages in thread From: Manfred Spraul @ 1999-02-07 18:21 UTC (permalink / raw) To: linux-mm I'm currently debugging my physical memory ramdisk, and I see lots of entries in the page cache that have 'page->offset' which aren't multiples of 4096. (they are multiples of 256) All of them belong to swapper_inode. If this is the intended behaviour, then page_hash() should be changed: it assumes that 'page->offset' is a multiple of 4096. If this should not happen, please ask me for further details. Note that there is NO crash, just lots of entries with the same hash value. --- - 2.2.1 kernel - 12 MB Ram - 72256 kB Swap-partition --- Manfred -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: swapcache bug? 1999-02-07 18:21 swapcache bug? Manfred Spraul @ 1999-02-07 21:30 ` Eric W. Biederman 1999-02-08 16:39 ` [PATCH] " Stephen C. Tweedie 1 sibling, 0 replies; 10+ messages in thread From: Eric W. Biederman @ 1999-02-07 21:30 UTC (permalink / raw) To: masp0008; +Cc: linux-mm >>>>> "MS" == Manfred Spraul <masp0008@stud.uni-sb.de> writes: MS> I'm currently debugging my physical memory ramdisk, and I see lots of MS> entries in the page cache that have 'page->offset' which aren't MS> multiples of 4096. (they are multiples of 256) MS> All of them belong to swapper_inode. MS> If this is the intended behaviour, then page_hash() should be changed: MS> it assumes that 'page->offset' is a multiple of 4096. Yes. Because for the swap cache we store the swap entry which is already has the page size shifted out of it, but it's also setup so you can store it directly in a pte which means some 0 bits. Good spotting, but unless someone can show a significant performance impact changing page_hash should wait for 2.3. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] Re: swapcache bug? 1999-02-07 18:21 swapcache bug? Manfred Spraul 1999-02-07 21:30 ` Eric W. Biederman @ 1999-02-08 16:39 ` Stephen C. Tweedie 1999-02-08 17:32 ` Linus Torvalds 1 sibling, 1 reply; 10+ messages in thread From: Stephen C. Tweedie @ 1999-02-08 16:39 UTC (permalink / raw) To: masp0008, Linus Torvalds; +Cc: linux-mm, Stephen Tweedie Hi, On Sun, 07 Feb 1999 19:21:38 +0100, Manfred Spraul <masp0008@stud.uni-sb.de> said: > I'm currently debugging my physical memory ramdisk, and I see lots of > entries in the page cache that have 'page->offset' which aren't > multiples of 4096. (they are multiples of 256) > All of them belong to swapper_inode. That is normal. > If this is the intended behaviour, then page_hash() should be changed: > it assumes that 'page->offset' is a multiple of 4096. Good point, the line include/linux/pagemap.h:39, return s(i+o) & (PAGE_HASH_SIZE-1); should probably be return s(i+o+offset) & (PAGE_HASH_SIZE-1); to mix in the low order bits for swap entries. Well spotted. Anyone see anything wrong with this one-liner change? --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] Re: swapcache bug? 1999-02-08 16:39 ` [PATCH] " Stephen C. Tweedie @ 1999-02-08 17:32 ` Linus Torvalds 1999-02-08 17:51 ` Stephen C. Tweedie 0 siblings, 1 reply; 10+ messages in thread From: Linus Torvalds @ 1999-02-08 17:32 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: masp0008, linux-mm On Mon, 8 Feb 1999, Stephen C. Tweedie wrote: > > Good point, the line include/linux/pagemap.h:39, > > return s(i+o) & (PAGE_HASH_SIZE-1); > > should probably be > > return s(i+o+offset) & (PAGE_HASH_SIZE-1); > > to mix in the low order bits for swap entries. Well spotted. Anyone > see anything wrong with this one-liner change? Yes, the above will potentially result in different hash entries for the same page, which means that we now have aliasing and basically just random behaviour. It _may_ be that the hash function is always called with a page-aligned offset, but that was not how it was strictly meant to be: the way the thing was envisioned you could just find the page at "offset" by doing page_hash(inode,offset) without page-aligning offset before you did this. If anything, maybe the swap cache should just use the high bits in the "offset" field (or at least prefer to do so: something like page->offset = swap_entry_to_offset(entry); and entry = offset_to_swap_entry(page->offset); that does a PAGE_MASK_BITS rotate on the bits.. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] Re: swapcache bug? 1999-02-08 17:32 ` Linus Torvalds @ 1999-02-08 17:51 ` Stephen C. Tweedie 1999-02-08 18:48 ` Linus Torvalds 0 siblings, 1 reply; 10+ messages in thread From: Stephen C. Tweedie @ 1999-02-08 17:51 UTC (permalink / raw) To: Linus Torvalds; +Cc: Stephen C. Tweedie, masp0008, linux-mm Hi, On Mon, 8 Feb 1999 09:32:24 -0800 (PST), Linus Torvalds <torvalds@transmeta.com> said: > It _may_ be that the hash function is always called with a page-aligned > offset, but that was not how it was strictly meant to be: the way the > thing was envisioned you could just find the page at "offset" by doing > page_hash(inode,offset) It does appear to be: we enforce it pretty much everywhere I can see, with one possible exception: filemap_nopage(), which assumes area->vm_offset is already page-aligned. I think we can still violate that internally if we are mapping a ZMAGIC binary (urgh), but the VM breaks anyway if we do that: update_vm_cache cannot deal with such pages, for a start. The assumption that we might have flexible offsets will break __find_page massively anyway, because we _always_ lookup the struct page by exact match on the offset; __find_page never tries to align things itself. Linus, I know Matti Aarnio has been working on supporting >32bit offsets on Intel, and for that we really do need to start using the low bits in the page offset for something more useful than MBZ padding. If there is a long-term desire to keep those bits in the offset insignificant then that will really hurt his work; otherwise, I can't see mixing in the low-order bits to the page hash breaking anything new. > If anything, maybe the swap cache should just use the high bits in the > "offset" field Yes, we can certainly do that to fix the current has collision problems, but since there are long term reasons for using more bits of significance in the page cache offset, it would be good to know whether you'd be willing to entertain that possibility. If so, we'll need a hash function which observes the low bits anyway. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] Re: swapcache bug? 1999-02-08 17:51 ` Stephen C. Tweedie @ 1999-02-08 18:48 ` Linus Torvalds 1999-02-08 21:13 ` Matti Aarnio 1999-02-09 7:15 ` Eric W. Biederman 0 siblings, 2 replies; 10+ messages in thread From: Linus Torvalds @ 1999-02-08 18:48 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: masp0008, linux-mm On Mon, 8 Feb 1999, Stephen C. Tweedie wrote: > > It does appear to be: we enforce it pretty much everywhere I can see, > with one possible exception: filemap_nopage(), which assumes > area->vm_offset is already page-aligned. I think we can still violate > that internally if we are mapping a ZMAGIC binary (urgh), but the VM > breaks anyway if we do that: update_vm_cache cannot deal with such > pages, for a start. This was done on purpose: it still works as a mapping, but it isn't coherent with regards to writes to the file. That's fine, as writing to an executable while it has been mapped is a losing proposition anyway, and you can't get access through these non-page-aligned mappings any other way (the "mmap()" system calls etc will all enforce page-aligned regions, because coherency just wouldn't be possible otherwise). > The assumption that we might have flexible offsets will break > __find_page massively anyway, because we _always_ lookup the struct page > by exact match on the offset; __find_page never tries to align things > itself. Good point. > Linus, I know Matti Aarnio has been working on supporting >32bit offsets > on Intel, and for that we really do need to start using the low bits in > the page offset for something more useful than MBZ padding. Yes. The page offset will become a "sector offset" (I'd actually like to make it a page number, but then I'd have to break ZMAGIC dynamic loading due to the fractional page offsets, so it's not worth it for three extra bits), and that gives you 41 bits of addressing even on a 32-bit machine. Which is plenty - considering that by the time you need more than that you'd _really_ better be running on a larger machine anyway. Note that some patches I saw (I think by Matti) made "page->offset" a long long, and that is never going to happen. That's just a stupid waste of time and memory. > If there is > a long-term desire to keep those bits in the offset insignificant then > that will really hurt his work; otherwise, I can't see mixing in the > low-order bits to the page hash breaking anything new. Ok, you convinced me. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] Re: swapcache bug? 1999-02-08 18:48 ` Linus Torvalds @ 1999-02-08 21:13 ` Matti Aarnio 1999-02-09 7:15 ` Eric W. Biederman 1 sibling, 0 replies; 10+ messages in thread From: Matti Aarnio @ 1999-02-08 21:13 UTC (permalink / raw) To: Linus Torvalds; +Cc: sct, masp0008, linux-mm Linus Torvalds <torvalds@transmeta.com> wrote: ... > > Linus, I know Matti Aarnio has been working on supporting >32bit offsets > > on Intel, and for that we really do need to start using the low bits in > > the page offset for something more useful than MBZ padding. > > Yes. The page offset will become a "sector offset" (I'd actually like to > make it a page number, but then I'd have to break ZMAGIC dynamic loading > due to the fractional page offsets, so it's not worth it for three extra > bits), and that gives you 41 bits of addressing even on a 32-bit machine. > Which is plenty - considering that by the time you need more than that > you'd _really_ better be running on a larger machine anyway. I forgot (didn't log), who sent me a patch to my L-F-S stuff for ZMAGIC page mis-alignment report. (It was somebody here at linux-mm list) His comment was that only *very old* systems contain ZMAGIC files with alignments not already in page granularity. Given certain limitations in low-level block drivers, using that 'sector index' idea might be worthy. It gives us essentially up to 512 * 4GB or 2 TB file sizes, which matches current low-level limitations. However, now doing page offset work, we might need to mask the low bits of the sector index to do page cache searches. (Unless the alignment is always guaranteed ?) > Note that some patches I saw (I think by Matti) made "page->offset" a long > long, and that is never going to happen. That's just a stupid waste of > time and memory. Good heavens! No! That can't have been mine. In my patches the 'page->offset' became ADT called 'pgoff_t' which I used to do compile time trapping of missing convertions. When simplified ("#if 1" -> "#if 0" in <linux/mm.h> header file), the type is just 'u_long'. I don't think you have seen my patches, I have posted the URL, but not the patches themselves. With recent talks in linux-kernel about internal VFS ABI stability being an issue, my current L-F-S patch is *not* ready for 2.2.*. It changes one thing, and adds another in the inode_operations structure, plus adds a field into 'struct task'. I would wait a bit until 2.3 opens, collect a bit of experience of it there, and then backport (without doing VFS ABI changes) to 2.2.*. Otherwise: "Damn the torpedoes! Full steam ahead!". (And we would hear lots of noicy torpedoes...) ... > Linus /Matti Aarnio <matti.aarnio@sonera.fi> -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] Re: swapcache bug? 1999-02-08 18:48 ` Linus Torvalds 1999-02-08 21:13 ` Matti Aarnio @ 1999-02-09 7:15 ` Eric W. Biederman 1999-02-09 16:32 ` Linus Torvalds 1 sibling, 1 reply; 10+ messages in thread From: Eric W. Biederman @ 1999-02-09 7:15 UTC (permalink / raw) To: Linus Torvalds; +Cc: Stephen C. Tweedie, masp0008, linux-mm >>>>> "LT" == Linus Torvalds <torvalds@transmeta.com> writes: LT> Yes. The page offset will become a "sector offset" (I'd actually like to LT> make it a page number, but then I'd have to break ZMAGIC dynamic loading LT> due to the fractional page offsets, so it's not worth it for three extra LT> bits), and that gives you 41 bits of addressing even on a 32-bit machine. LT> Which is plenty - considering that by the time you need more than that LT> you'd _really_ better be running on a larger machine anyway. ??? With the latter OMAGIC format everthing is page aligned already. I have a patch that removes page sharing support from ZMAGIC but keeps everything functional. Tested with a OMAGIC libc ZMAGIC doom and ZMAGIC Xlibs. This is on my queue for submission to 2.3. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] Re: swapcache bug? 1999-02-09 7:15 ` Eric W. Biederman @ 1999-02-09 16:32 ` Linus Torvalds 1999-02-10 0:28 ` Eric W. Biederman 0 siblings, 1 reply; 10+ messages in thread From: Linus Torvalds @ 1999-02-09 16:32 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Stephen C. Tweedie, masp0008, linux-mm On 9 Feb 1999, Eric W. Biederman wrote: > > ??? With the latter OMAGIC format everthing is page aligned already. Yes. However, it's a question of pride too. I don't want to break "normal" user land applications (as opposed to things like "ifconfig" that are really very very special), unless I really have to. As such, I want to support even the old 1kB-aligned ZMAGIC binaries for as long as it's not a liability, and quite frankly the issue of whether you make the page cache "offset" be a sector or a page offset is purely a thing of taste, not a liability. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] Re: swapcache bug? 1999-02-09 16:32 ` Linus Torvalds @ 1999-02-10 0:28 ` Eric W. Biederman 0 siblings, 0 replies; 10+ messages in thread From: Eric W. Biederman @ 1999-02-10 0:28 UTC (permalink / raw) To: Linus Torvalds; +Cc: Stephen C. Tweedie, masp0008, linux-mm >>>>> "LT" == Linus Torvalds <torvalds@transmeta.com> writes: LT> On 9 Feb 1999, Eric W. Biederman wrote: >> >> ??? With the latter OMAGIC format everthing is page aligned already. LT> Yes. LT> However, it's a question of pride too. I don't want to break "normal" user LT> land applications (as opposed to things like "ifconfig" that are really LT> very very special), unless I really have to. You don't have to break programs, just have them use a little more memory. The way we currently support shared ZMAGIC binaries is a real hack. There are a lot of cases where it doesn't work. 2k+ ext2fs, and network file systems. And the code is very unobvious. The filesytem code becomes much cleaner if we remove support for non aligned mappings. The following patch is all that it takes to remove the need to support non-aligned mappings. Everything still works we just use a little more memory (if multiple copies of the program are running at once), and complain. Avoiding this patch is not worth losing 3 bits of address space, and code clarity. Eric diff -uNrX linux-ignore-files linux-2.1.132.eb2/fs/binfmt_aout.c linux-2.1.132.eb3.make/fs/binfmt_aout.c --- linux-2.1.132.eb2/fs/binfmt_aout.c Fri Dec 25 16:42:47 1998 +++ linux-2.1.132.eb3.make/fs/binfmt_aout.c Fri Dec 25 22:42:36 1998 @@ -409,7 +409,14 @@ return fd; file = fcheck(fd); - if (!file->f_op || !file->f_op->mmap) { + if ((fd_offset & ~PAGE_MASK) != 0) { + printk(KERN_WARNING + "fd_offset is not page aligned. Please convert program: %s\n", + file->f_dentry->d_name.name + ); + } + + if (!file->f_op || !file->f_op->mmap || ((fd_offset & ~PAGE_MASK) != 0)) { sys_close(fd); do_mmap(NULL, 0, ex.a_text+ex.a_data, PROT_READ|PROT_WRITE|PROT_EXEC, @@ -530,6 +537,24 @@ start_addr = ex.a_entry & 0xfffff000; + if ((N_TXTOFF(ex) & ~PAGE_MASK) != 0) { + printk(KERN_WARNING + "N_TXTOFF is not page aligned. Please convert library: %s\n", + file->f_dentry->d_name.name + ); + + do_mmap(NULL, start_addr & PAGE_MASK, ex.a_text + ex.a_data + ex.a_bss, + PROT_READ | PROT_WRITE | PROT_EXEC, + MAP_FIXED| MAP_PRIVATE, 0); + + read_exec(file->f_dentry, N_TXTOFF(ex), + (char *)start_addr, ex.a_text + ex.a_data, 0); + flush_icache_range((unsigned long) start_addr, + (unsigned long) start_addr + ex.a_text + ex.a_data); + + retval = 0; + goto out_putf; + } /* Now use mmap to map the library into memory. */ error = do_mmap(file, start_addr, ex.a_text + ex.a_data, PROT_READ | PROT_WRITE | PROT_EXEC, diff -uNrX linux-ignore-files linux-2.1.132.eb2/mm/filemap.c linux-2.1.132.eb3.make/mm/filemap.c --- linux-2.1.132.eb2/mm/filemap.c Fri Dec 25 16:48:50 1998 +++ linux-2.1.132.eb3.make/mm/filemap.c Fri Dec 25 23:04:10 1998 @@ -1350,7 +1350,7 @@ return -EINVAL; } else { ops = &file_private_mmap; - if (vma->vm_offset & (inode->i_sb->s_blocksize - 1)) + if (vma->vm_offset & (PAGE_SIZE - 1)) return -EINVAL; } if (!inode->i_sb || !S_ISREG(inode->i_mode)) -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~1999-02-10 0:27 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 1999-02-07 18:21 swapcache bug? Manfred Spraul 1999-02-07 21:30 ` Eric W. Biederman 1999-02-08 16:39 ` [PATCH] " Stephen C. Tweedie 1999-02-08 17:32 ` Linus Torvalds 1999-02-08 17:51 ` Stephen C. Tweedie 1999-02-08 18:48 ` Linus Torvalds 1999-02-08 21:13 ` Matti Aarnio 1999-02-09 7:15 ` Eric W. Biederman 1999-02-09 16:32 ` Linus Torvalds 1999-02-10 0:28 ` Eric W. Biederman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox