* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
[not found] <8ener4$6djpb$1@fido.engr.sgi.com>
@ 2000-05-03 3:11 ` Rajagopal Ananthanarayanan
2000-05-03 3:47 ` Linus Torvalds
0 siblings, 1 reply; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-03 3:11 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-mm
Linus Torvalds wrote:
>
> On 2 May 2000, Juan J. Quintela wrote:
> > Hi
> > several people have reported Oops in __free_pages_ok, after a
> > BUG() in page_alloc.h. This happens in 2.3.99-pre[67]. The BUGs are:
>
> I'd like ot know what the back-trace for those reports are?
>
> I'm not against getting rid of the PageSwapEntry logic (it's complication
> for not very much gain), but I'd like to understand this more..
Following please find information about one of the BUG() in __free_pages_ok();
I'm re-sending the messages I sent to linux-kernel eariler last week:
-------------
I ran into a BUG in __free_pages_ok which checks:
----------
if (PageSwapCache(page))
BUG();
----------
The call to free the page was from try_to_swap_out():
----------
/*
* Is the page already in the swap cache? If so, then
* we can just drop our reference to it without doing
* any IO - it's already up-to-date on disk.
*
* Return 0, as we didn't actually free any real
* memory, and we should just continue our scan.
*/
if (PageSwapCache(page)) {
entry.val = page->index;
swap_duplicate(entry);
set_pte(page_table, swp_entry_to_pte(entry));
drop_pte:
vma->vm_mm->rss--;
flush_tlb_page(vma, address);
__free_page(page);
goto out_failed;
}
-----------
The entire trace from kdb is as follows (XXX = unknown):
----------
XXXXXXXXXX XXXXXXXXXX __free_pages_ok(XXXX)
0xc35d5d9c 0xc013543c try_to_swap_out+0xc8( 0xc179dc20, 0x43589000, 0xc2963624,
0x5,
0xc179dc20 )
0xc35d5dd8 0xc0135710 swap_out_vma+0x11c( 0xc179dc20, 0x432d4000, 0x5, 0xc02c7c68,
0xc3256520 )
0xc35d5df8 0xc01357ee swap_out_mm+0x7e( 0xc3256520, 0x5, 0x4, 0x6, 0x5 )
0xc35d5e24 0xc01359de swap_out+0x176( 0x6, 0x5, 0xc35d4000, 0xc02d0890, 0x5 )
0xc35d5e40 0xc0135b31 do_try_to_free_pages+0x89( 0x5, 0xc02d06b8, 0xc02d06b8,
0xc35d5e78,
0xc013666b )
0xc35d5e54 0xc0135d17 try_to_free_pages+0x2b( 0x5, 0xc02d06b8, 0xc02d0898, 0x0,
0xc02d088c )
0xc35d5e78 0xc013666b zone_balance_memory+0x63( 0xc02d088c, 0xc35d4000, 0x0,
0x5b5f00 )
0xc35d5e98 0xc0136724 __alloc_pages+0x80( 0xc35d4000, 0x0, 0xc35d4000 )
0xc35d5eac 0xc0136dbe read_swap_cache_async+0x62( 0x5b5f00, 0x1, 0x5b5f00,
0x5b5f00,
0xc13919c4 )
0xc35d5ecc 0xc012889b do_swap_page+0x97( 0xc35d4000, 0xc179dc20, 0x45671fff,
0xc13919c4,
0x5b5f00 )
0xc35d5efc 0xc0128ce7 handle_mm_fault+0x13b( 0xc35d4000, 0xc179dc20, 0x45671fff,
0x0,
0xc35d4000 )
0xc35d5fb4 0xc011513e do_page_fault+0x18e
[1]more>
0xbffff540 0xc010a681 error_code+0x2d
-------------
I have a kdb helper function which prints out some of the fields in the page,
and also below is the hex dump of the "struct page"
----------
[1]kdb> page 0xc1002de0
struct page at 0xc1002de0
next 0xc1042ee0 prev 0xc101e900 addr space 0xc02d0520 index 557568
count 1 flags PG_uptodate PG_swap_cache PG_swap_entry virtual 0xc0092000
buffers 0x00000000 block_map 00000000000000000000000000000000
[ ... ]
c1002de0: c1042ee0 c101e900 c02d0520 00088200 a..A.e.A .-A....
c1002df0: 00000000 00000001 00000a08 c1042efc ............u..A
c1002e00: c101e91c 00000000 dead4ead c1002e0c .e.A....-N-TH...A
c1002e10: c1002e0c c1002e14 c02f3b3f c1150e5c ...A...A?;/A\..A
c1002e20: 00000000 c0092000 c02d0600 00000000 ..... .A..-A....
---------------------
Question: Is this a problem in the reference count on the page?
If indeed the page can be freed by the call in try_to_swap_out,
then the test in __free_pages_ok will trigger every time this path
is taken. Any one have ideas as to what's wrong?
BTW, the above happened during a relatively normal operation of
using 'diff'. Don't know if it reproducible.
----------------------
As Jeff G. pointed out in another mail, I left out some
important information. If you need more info., please let me know.
(1) this is with kernel version 2.3.99-pre2 with some XFS changes.
(2) I have a 2 CPU X-86 box with 64MB memory.
Here's the .config used to build the kernel:
-----------
#
# Automatically generated by make menuconfig: don't edit
#
CONFIG_X86=y
CONFIG_ISA=y
CONFIG_UID16=y
#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
#
# Processor type and features
#
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
CONFIG_M686=y
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_PGE=y
# CONFIG_MICROCODE is not set
CONFIG_NOHIGHMEM=y
# CONFIG_HIGHMEM4G is not set
# CONFIG_HIGHMEM64G is not set
# CONFIG_MATH_EMULATION is not set
# CONFIG_MTRR is not set
CONFIG_SMP=y
#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODVERSIONS=y
CONFIG_KMOD=y
#
# General setup
#
CONFIG_NET=y
# CONFIG_VISWS is not set
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GODIRECT is not set
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_NAMES=y
# CONFIG_MCA is not set
# CONFIG_HOTPLUG is not set
CONFIG_SYSVIPC=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_SYSCTL is not set
CONFIG_KCORE_ELF=y
# CONFIG_KCORE_AOUT is not set
CONFIG_BINFMT_AOUT=m
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_MISC=m
# CONFIG_PM is not set
# CONFIG_ACPI is not set
# CONFIG_APM is not set
#
# Parallel port support
#
# CONFIG_PARPORT is not set
#
# Plug and Play configuration
#
# CONFIG_PNP is not set
# CONFIG_ISAPNP is not set
#
# Block devices
#
CONFIG_BLK_DEV_FD=m
# CONFIG_BLK_DEV_XD is not set
# CONFIG_PARIDE is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_LOOP is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_MD is not set
# CONFIG_BLK_DEV_RAM is not set
#
# Networking options
#
CONFIG_PACKET=m
# CONFIG_PACKET_MMAP is not set
# CONFIG_NETLINK is not set
# CONFIG_NETFILTER is not set
# CONFIG_FILTER is not set
CONFIG_UNIX=y
CONFIG_INET=y
# CONFIG_IP_MULTICAST is not set
# CONFIG_IP_ADVANCED_ROUTER is not set
# CONFIG_IP_PNP is not set
# CONFIG_IP_ROUTER is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_IP_ALIAS is not set
# CONFIG_SYN_COOKIES is not set
CONFIG_SKB_LARGE=y
# CONFIG_IPV6 is not set
# CONFIG_KHTTPD is not set
# CONFIG_ATM is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_DECNET is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_BRIDGE is not set
# CONFIG_LLC is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_NET_FASTROUTE is not set
# CONFIG_NET_HW_FLOWCONTROL is not set
#
# QoS and/or fair queueing
#
# CONFIG_NET_SCHED is not set
#
# Telephony Support
#
# CONFIG_PHONE is not set
# CONFIG_PHONE_IXJ is not set
#
# ATA/IDE/MFM/RLL support
#
# CONFIG_IDE is not set
# CONFIG_BLK_DEV_IDE_MODES is not set
# CONFIG_BLK_DEV_HD is not set
#
# SCSI support
#
CONFIG_SCSI=y
CONFIG_BLK_DEV_SD=y
CONFIG_SD_EXTRA_DEVS=40
# CONFIG_CHR_DEV_ST is not set
# CONFIG_BLK_DEV_SR is not set
# CONFIG_CHR_DEV_SG is not set
CONFIG_SCSI_DEBUG_QUEUES=y
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
# CONFIG_SCSI_LOGGING is not set
#
# SCSI low-level drivers
#
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_7000FASST is not set
# CONFIG_SCSI_ACARD is not set
CONFIG_SCSI_AHA152X=m
# CONFIG_SCSI_AHA1542 is not set
CONFIG_SCSI_AHA1740=m
CONFIG_SCSI_AIC7XXX=y
# CONFIG_AIC7XXX_TCQ_ON_BY_DEFAULT is not set
CONFIG_AIC7XXX_CMDS_PER_DEVICE=8
# CONFIG_AIC7XXX_PROC_STATS is not set
CONFIG_AIC7XXX_RESET_DELAY=5
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_IN2000 is not set
# CONFIG_SCSI_AM53C974 is not set
# CONFIG_SCSI_MEGARAID is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_DTC3280 is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_EATA_DMA is not set
# CONFIG_SCSI_EATA_PIO is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_GENERIC_NCR5380 is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_NCR53C406A is not set
# CONFIG_SCSI_SYM53C416 is not set
# CONFIG_SCSI_SIM710 is not set
# CONFIG_SCSI_NCR53C7xx is not set
# CONFIG_SCSI_NCR53C8XX is not set
CONFIG_SCSI_SYM53C8XX=m
CONFIG_SCSI_NCR53C8XX_DEFAULT_TAGS=4
CONFIG_SCSI_NCR53C8XX_MAX_TAGS=32
CONFIG_SCSI_NCR53C8XX_SYNC=20
# CONFIG_SCSI_NCR53C8XX_PROFILE is not set
# CONFIG_SCSI_NCR53C8XX_IOMAPPED is not set
# CONFIG_SCSI_NCR53C8XX_PQS_PDS is not set
# CONFIG_SCSI_NCR53C8XX_SYMBIOS_COMPAT is not set
# CONFIG_SCSI_PAS16 is not set
# CONFIG_SCSI_PCI2000 is not set
# CONFIG_SCSI_PCI2220I is not set
# CONFIG_SCSI_PSI240I is not set
# CONFIG_SCSI_QLOGIC_FAS is not set
# CONFIG_SCSI_QLOGIC_ISP is not set
CONFIG_SCSI_QLOGIC_FC=m
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_SEAGATE is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_T128 is not set
# CONFIG_SCSI_U14_34F is not set
# CONFIG_SCSI_ULTRASTOR is not set
# CONFIG_SCSI_DEBUG is not set
#
# IEEE 1394 (FireWire) support
#
# CONFIG_IEEE1394 is not set
#
# I2O device support
#
# CONFIG_I2O is not set
# CONFIG_I2O_PCI is not set
# CONFIG_I2O_BLOCK is not set
# CONFIG_I2O_LAN is not set
# CONFIG_I2O_SCSI is not set
# CONFIG_I2O_PROC is not set
#
# Network device support
#
CONFIG_NETDEVICES=y
#
# ARCnet devices
#
# CONFIG_ARCNET is not set
CONFIG_DUMMY=m
# CONFIG_BONDING is not set
# CONFIG_EQUALIZER is not set
# CONFIG_NET_SB1000 is not set
#
# Ethernet (10 or 100Mbit)
#
CONFIG_NET_ETHERNET=y
CONFIG_NET_VENDOR_3COM=y
# CONFIG_EL1 is not set
# CONFIG_EL2 is not set
# CONFIG_ELPLUS is not set
# CONFIG_EL16 is not set
# CONFIG_EL3 is not set
# CONFIG_3C515 is not set
CONFIG_VORTEX=m
CONFIG_LANCE=m
# CONFIG_NET_VENDOR_SMC is not set
# CONFIG_NET_VENDOR_RACAL is not set
# CONFIG_AT1700 is not set
# CONFIG_DEPCA is not set
# CONFIG_NET_ISA is not set
CONFIG_NET_PCI=y
CONFIG_PCNET32=m
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_AC3200 is not set
# CONFIG_APRICOT is not set
# CONFIG_CS89x0 is not set
# CONFIG_DE4X5 is not set
# CONFIG_TULIP is not set
# CONFIG_DGRS is not set
# CONFIG_DM9102 is not set
# CONFIG_EEPRO100 is not set
# CONFIG_LNE390 is not set
# CONFIG_NE3210 is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_RTL8129 is not set
# CONFIG_8139TOO is not set
# CONFIG_SIS900 is not set
# CONFIG_TLAN is not set
# CONFIG_VIA_RHINE is not set
# CONFIG_ES3210 is not set
# CONFIG_EPIC100 is not set
# CONFIG_NET_POCKET is not set
#
# Ethernet (1000 Mbit)
#
# CONFIG_YELLOWFIN is not set
# CONFIG_ACENIC is not set
# CONFIG_SK98LIN is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
#
# Wireless LAN (non-hamradio)
#
# CONFIG_NET_RADIO is not set
#
# Token Ring devices
#
# CONFIG_TR is not set
# CONFIG_NET_FC is not set
# CONFIG_RCPCI is not set
# CONFIG_SHAPER is not set
#
# Wan interfaces
#
# CONFIG_WAN is not set
#
# Amateur Radio support
#
# CONFIG_HAMRADIO is not set
#
# IrDA (infrared) support
#
# CONFIG_IRDA is not set
#
# ISDN subsystem
#
# CONFIG_ISDN is not set
#
# Old CD-ROM drivers (not SCSI, not IDE)
#
# CONFIG_CD_NO_IDESCSI is not set
#
# Character devices
#
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_SERIAL=y
CONFIG_SERIAL_CONSOLE=y
# CONFIG_SERIAL_EXTENDED is not set
# CONFIG_SERIAL_NONSTANDARD is not set
CONFIG_UNIX98_PTYS=y
CONFIG_UNIX98_PTY_COUNT=256
#
# I2C support
#
# CONFIG_I2C is not set
#
# Mice
#
# CONFIG_BUSMOUSE is not set
CONFIG_MOUSE=y
CONFIG_PSMOUSE=y
CONFIG_82C710_MOUSE=y
# CONFIG_PC110_PAD is not set
#
# Joysticks
#
# CONFIG_JOYSTICK is not set
# CONFIG_QIC02_TAPE is not set
#
# Watchdog Cards
#
# CONFIG_WATCHDOG is not set
CONFIG_PROFILING=y
# CONFIG_NVRAM is not set
# CONFIG_RTC is not set
#
# Video For Linux
#
# CONFIG_VIDEO_DEV is not set
# CONFIG_DTLK is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
#
# Ftape, the floppy tape device driver
#
# CONFIG_FTAPE is not set
CONFIG_DRM=y
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_GAMMA is not set
# CONFIG_AGP is not set
#
# USB support
#
# CONFIG_USB is not set
#
# File systems
#
# CONFIG_QUOTA is not set
CONFIG_AUTOFS_FS=m
CONFIG_AUTOFS4_FS=m
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_FAT_FS is not set
# CONFIG_MSDOS_FS is not set
# CONFIG_UMSDOS_FS is not set
# CONFIG_VFAT_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
CONFIG_ISO9660_FS=m
# CONFIG_JOLIET is not set
# CONFIG_MINIX_FS is not set
# CONFIG_NTFS_FS is not set
# CONFIG_HPFS_FS is not set
CONFIG_PROC_FS=y
# CONFIG_DEVFS_FS is not set
# CONFIG_DEVFS_DEBUG is not set
CONFIG_DEVPTS_FS=y
# CONFIG_QNX4FS_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_EXT2_FS=y
# CONFIG_SYSV_FS is not set
# CONFIG_UDF_FS is not set
# CONFIG_UFS_FS is not set
CONFIG_XFS_FS=m
CONFIG_PAGE_BUF=y
CONFIG_PAGE_BUF_LOCKING=y
CONFIG_AVL=y
# CONFIG_PAGE_BUF_META is not set
# CONFIG_XFS_VNODE_TRACING is not set
CONFIG_AVL=y
# CONFIG_XFS_ARCH_MIPS is not set
CONFIG_XFS_ARCH_NATIVE=y
# CONFIG_XFS_ARCH_MULTI is not set
#
# Network File Systems
#
# CONFIG_CODA_FS is not set
CONFIG_NFS_FS=m
# CONFIG_ROOT_NFS is not set
# CONFIG_NFSD is not set
CONFIG_SUNRPC=m
CONFIG_LOCKD=m
# CONFIG_SMB_FS is not set
# CONFIG_NCP_FS is not set
#
# Partition Types
#
# CONFIG_PARTITION_ADVANCED is not set
CONFIG_MSDOS_PARTITION=y
# CONFIG_NLS is not set
#
# Console drivers
#
CONFIG_VGA_CONSOLE=y
# CONFIG_VIDEO_SELECT is not set
# CONFIG_MDA_CONSOLE is not set
#
# Frame-buffer support
#
# CONFIG_FB is not set
#
# Sound
#
# CONFIG_SOUND is not set
#
# Kernel hacking
#
# CONFIG_MAGIC_SYSRQ is not set
# CONFIG_LOCKMETER is not set
CONFIG_KDB=y
CONFIG_KDB_FRAMEPTR=y
CONFIG_KDB_STBSIZE=13500
CONFIG_MCOUNT=y
CONFIG_PROF_FRAME_PTR=y
--
--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 3:11 ` Oops in __free_pages_ok (pre7-1) (Long) (backtrace) Rajagopal Ananthanarayanan
@ 2000-05-03 3:47 ` Linus Torvalds
2000-05-03 5:26 ` Kanoj Sarcar
0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-03 3:47 UTC (permalink / raw)
To: Rajagopal Ananthanarayanan; +Cc: linux-mm
On Tue, 2 May 2000, Rajagopal Ananthanarayanan wrote:
> -------------
>
> I ran into a BUG in __free_pages_ok which checks:
>
> ----------
> if (PageSwapCache(page))
> BUG();
> ----------
>
> The call to free the page was from try_to_swap_out():
>
> ----------
> /*
> * Is the page already in the swap cache? If so, then
> * we can just drop our reference to it without doing
> * any IO - it's already up-to-date on disk.
> *
> * Return 0, as we didn't actually free any real
> * memory, and we should just continue our scan.
> */
> if (PageSwapCache(page)) {
> entry.val = page->index;
> swap_duplicate(entry);
> set_pte(page_table, swp_entry_to_pte(entry));
> drop_pte:
> vma->vm_mm->rss--;
> flush_tlb_page(vma, address);
> __free_page(page);
> goto out_failed;
> }
Wow.
That code definitely looks buggy.
Looking at the whole try_to_swap_out() in this light shows how it messes
with a _lot_ of page information without holding the page lock. I thought
we fixed this once already, but maybe not.
In try_to_swap_out(), earlier it does a
if (PageLocked(page))
goto out_failed;
and that really is wrong - it should do a
if (TryLockPage(page))
goto out_failed;
and do all the rest with the page locked so that there are no races on
changing the state of the page (and then unlock just before actually
returning, or freeing the page).
As far as I can tell, this is a real bug, and has absolutely nothing to do
with the swap entry cache. It may be that the swap entry cache code just
changed timings for some people enough to show the race.
But maybe I've overlooked something. Anybody else have comments on this?
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 3:47 ` Linus Torvalds
@ 2000-05-03 5:26 ` Kanoj Sarcar
2000-05-03 6:22 ` Rajagopal Ananthanarayanan
2000-05-03 8:11 ` Linus Torvalds
0 siblings, 2 replies; 45+ messages in thread
From: Kanoj Sarcar @ 2000-05-03 5:26 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Rajagopal Ananthanarayanan, linux-mm
>
>
> On Tue, 2 May 2000, Rajagopal Ananthanarayanan wrote:
> > -------------
> >
> > I ran into a BUG in __free_pages_ok which checks:
> >
> > ----------
> > if (PageSwapCache(page))
> > BUG();
> > ----------
> >
> > The call to free the page was from try_to_swap_out():
> >
> > ----------
> > /*
> > * Is the page already in the swap cache? If so, then
> > * we can just drop our reference to it without doing
> > * any IO - it's already up-to-date on disk.
> > *
> > * Return 0, as we didn't actually free any real
> > * memory, and we should just continue our scan.
> > */
> > if (PageSwapCache(page)) {
> > entry.val = page->index;
> > swap_duplicate(entry);
> > set_pte(page_table, swp_entry_to_pte(entry));
> > drop_pte:
> > vma->vm_mm->rss--;
> > flush_tlb_page(vma, address);
> > __free_page(page);
> > goto out_failed;
> > }
>
> Wow.
>
> That code definitely looks buggy.
>
> Looking at the whole try_to_swap_out() in this light shows how it messes
> with a _lot_ of page information without holding the page lock. I thought
> we fixed this once already, but maybe not.
>
> In try_to_swap_out(), earlier it does a
>
> if (PageLocked(page))
> goto out_failed;
>
> and that really is wrong - it should do a
>
> if (TryLockPage(page))
> goto out_failed;
Umm, I am not saying this is not a good idea, but maybe code that
try_to_swap_out() invokes (like filemap_swapout etc) need to be
taught that the incoming page has already been locked.
Nonetheless, unless you show me a possible scenario that will lead
to the observed panic, I am skeptical that this is the real problem.
Lets just talk about swapcache pages (since the problem happened with
that type), and lets forget swapfile deletion, I am pretty sure Ananth
was not trying that. In this restricted situation, I _think_ you can
not theorize what the problem is. That is, if all code that add/delete
pages from the swap cache make sure they never delete a "shared" page
from the scache (as determined by is_page_shared). This is because
the process that kswapd is looking at already ensures that the page is
"shared". The only code that does delete "shared" pages from the scache
is shrink_mmap, but if a process already has a page-reference, shrink_mmap
can not touch that page. Also, most process level code that takes a
page out from the swapcache is interlocked out because kswapd is
holding the vmlist/page_table_lock.
Anyway, I will try to think if there are more race conditions possible.
Ananth, was there shared memory programs in your test suite? Also, if you
have any success in reproducing this, let us know.
Kanoj
>
> and do all the rest with the page locked so that there are no races on
> changing the state of the page (and then unlock just before actually
> returning, or freeing the page).
>
> As far as I can tell, this is a real bug, and has absolutely nothing to do
> with the swap entry cache. It may be that the swap entry cache code just
> changed timings for some people enough to show the race.
>
> But maybe I've overlooked something. Anybody else have comments on this?
>
> Linus
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 5:26 ` Kanoj Sarcar
@ 2000-05-03 6:22 ` Rajagopal Ananthanarayanan
2000-05-03 16:11 ` Kanoj Sarcar
2000-05-03 8:11 ` Linus Torvalds
1 sibling, 1 reply; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-03 6:22 UTC (permalink / raw)
To: Kanoj Sarcar; +Cc: Linus Torvalds, linux-mm
Kanoj Sarcar wrote:
> >
> > Wow.
> >
> > That code definitely looks buggy.
> >
> > Looking at the whole try_to_swap_out() in this light shows how it messes
> > with a _lot_ of page information without holding the page lock. I thought
> > we fixed this once already, but maybe not.
> >
> > In try_to_swap_out(), earlier it does a
> >
> > if (PageLocked(page))
> > goto out_failed;
> >
> > and that really is wrong - it should do a
> >
> > if (TryLockPage(page))
> > goto out_failed;
>
> Umm, I am not saying this is not a good idea, but maybe code that
> try_to_swap_out() invokes (like filemap_swapout etc) need to be
> taught that the incoming page has already been locked.
Dunno. I tend to agree with Linus. Fundamentally, how can any
code examine & change page state (flags, etc). if the code
does not hold the page lock?
>
> Nonetheless, unless you show me a possible scenario that will lead
> to the observed panic, I am skeptical that this is the real problem.
Look at trace I sent out. Basically it goes swap_out() -> swap_out_mm() ->
swap_out_vma() -> try_to_swap_out() -> __free_pages_ok().
1. swap_out select process & vm area within the process to swapout.
2. swap_out_mm selects an "address" within the mm.
3. swap_out_vma converts address to pgd.
4. try_to_swap_out takes pgd looks at the "software" state in "struct page".
Step 2 is about the earliest you can lock the victim page;
it isn't locked there. Step 3 doesn't lock it either. Step 4
as pointed out, explicitly avoids pages which are locked,
but doesn't lock the page!
Some more clarifications below:
>
> Lets just talk about swapcache pages (since the problem happened with
> that type), and lets forget swapfile deletion, I am pretty sure Ananth
> was not trying that. [ ... ]
No, I didn't try to remove swap.
> Anyway, I will try to think if there are more race conditions possible.
> Ananth, was there shared memory programs in your test suite? Also, if you
> have any success in reproducing this, let us know.
I don't think there were any shm stuff in the tests I was running
(again, AFAICT, diff was the only thing running; previous pages
in memory weren't likely from any shm segments). I haven't
reproduced it even a second time. Will let you know.
OTOH, if try_to_swap_out is so broken why aren't we seeing
these problems more often? Or, from other reports in l-k are we?
ananth.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 5:26 ` Kanoj Sarcar
2000-05-03 6:22 ` Rajagopal Ananthanarayanan
@ 2000-05-03 8:11 ` Linus Torvalds
2000-05-03 8:31 ` Linus Torvalds
1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-03 8:11 UTC (permalink / raw)
To: Kanoj Sarcar; +Cc: Rajagopal Ananthanarayanan, linux-mm
On Tue, 2 May 2000, Kanoj Sarcar wrote:
> Moi wrote:
> >
> > Wow.
> >
> > That code definitely looks buggy.
> >
> > Looking at the whole try_to_swap_out() in this light shows how it messes
> > with a _lot_ of page information without holding the page lock. I thought
> > we fixed this once already, but maybe not.
> >
> > In try_to_swap_out(), earlier it does a
> >
> > if (PageLocked(page))
> > goto out_failed;
> >
> > and that really is wrong - it should do a
> >
> > if (TryLockPage(page))
> > goto out_failed;
>
> Umm, I am not saying this is not a good idea, but maybe code that
> try_to_swap_out() invokes (like filemap_swapout etc) need to be
> taught that the incoming page has already been locked.
Oh, definitely. It's more than a one-liner change. Right now all the code
afterwards is written with the notion that the page is unlocked, and
having the page locked means that things have to be done differently (eg
use "__add_to_page_cache()" instead of "add_to_page_cache()" etc - all the
functions that get the page expecting the caller to have already locked
it).
> Nonetheless, unless you show me a possible scenario that will lead
> to the observed panic, I am skeptical that this is the real problem.
You may be right. The code certainly tries to be careful. However, I don't
trust "is_page_shared()" at all, _especially_ if there are people around
who play with the page state without locking the page.
If "is_page_shared()" ends up ever getting the wrong value, I suspect we'd
be screwed. There may be other schenarios..
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 8:11 ` Linus Torvalds
@ 2000-05-03 8:31 ` Linus Torvalds
2000-05-03 16:08 ` Kanoj Sarcar
0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-03 8:31 UTC (permalink / raw)
To: Kanoj Sarcar; +Cc: Rajagopal Ananthanarayanan, linux-mm
On Wed, 3 May 2000, Linus Torvalds wrote:
>
> You may be right. The code certainly tries to be careful. However, I don't
> trust "is_page_shared()" at all, _especially_ if there are people around
> who play with the page state without locking the page.
>
> If "is_page_shared()" ends up ever getting the wrong value, I suspect we'd
> be screwed. There may be other schenarios..
Kanoj, why couldn't this happen:
- CPU0 runs swapout
- finds page which is a swap cache entry
- does the swap_duplicate()
- does __free_page() on it without locking it (it wasn't locked
before, either)
- CPU1 runs shrink_mmap
- finds same page on the LRU list
- locks it _just_ after CPU0 tested that it was unlocked
- looks at the page countersand the swap cache counters to see if
it was shared ("is_page_shared()").
- There is _no_ synchronization between the two, as far as I can tell.
"swap_duplicate()" on CPU0 will get the swap device lock, and
"is_page_shared()" will run with the page lock held, but there is no
common locking between the two at all that I can see.
So "is_page_shared()" can be entirely crap. And can tell shrink_mmap()
that the page cache entry can be freed. Now, I have no idea what that will
actually result in, but I bet that we can just get the usage counters off
by one here, and then at some later date we free page that we've already
free'd - and that page may have been re-allocated for something else and
isin the middle of a page-in right now (which is how we end up freeing a
page that is locked).
Or something. The lack of any synchronization looks fishy to me. The page
lock would act as synchronization, but so would the swap device lock. And
maybe I'm still barking up the wrong tree..
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 8:31 ` Linus Torvalds
@ 2000-05-03 16:08 ` Kanoj Sarcar
2000-05-03 16:14 ` Linus Torvalds
2000-05-04 1:38 ` Linus Torvalds
0 siblings, 2 replies; 45+ messages in thread
From: Kanoj Sarcar @ 2000-05-03 16:08 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Rajagopal Ananthanarayanan, linux-mm
>
>
> On Wed, 3 May 2000, Linus Torvalds wrote:
> >
> > You may be right. The code certainly tries to be careful. However, I don't
> > trust "is_page_shared()" at all, _especially_ if there are people around
> > who play with the page state without locking the page.
> >
> > If "is_page_shared()" ends up ever getting the wrong value, I suspect we'd
> > be screwed. There may be other schenarios..
>
> Kanoj, why couldn't this happen:
> - CPU0 runs swapout
> - finds page which is a swap cache entry
> - does the swap_duplicate()
> - does __free_page() on it without locking it (it wasn't locked
> before, either)
> - CPU1 runs shrink_mmap
> - finds same page on the LRU list
> - locks it _just_ after CPU0 tested that it was unlocked
> - looks at the page countersand the swap cache counters to see if
> it was shared ("is_page_shared()").
>
> - There is _no_ synchronization between the two, as far as I can tell.
> "swap_duplicate()" on CPU0 will get the swap device lock, and
> "is_page_shared()" will run with the page lock held, but there is no
> common locking between the two at all that I can see.
FWIW, I think you are looking in the right direction, ie, shrink_mmap
previously used to run with lock_kernel, and not anymore, so there is a
chance of shrink_mmap racing with try_to_swap_out. I thought about this
though, but couldn't come up with an example ...
But, your example does not pull thru. Note that before shrink_mmap will
even touch the page, it does a
if (!page->buffers && page_count(page) > 1)
goto dispose_continue;
The page is question is guaranteed to have page_count(page) > 1, since
try_to_swap_out has not dropped the user pte reference in your example.
Another thing to note is that shrink_mmap does not do a is_page_shared(),
it just checks for page-reference count to be 0 (the swapentry might have
references from other processes). Else, shrink_mmap will never be able
to free these pages ...
>
> So "is_page_shared()" can be entirely crap. And can tell shrink_mmap()
Not really ... look at other places that call is_page_shared, they all
hold the pagelock. shrink_mmap does not bother with is_page_shared logic.
What is interesting is that people are reporting PageSwapEntry deletion
seems to fix this ...
Kanoj
> that the page cache entry can be freed. Now, I have no idea what that will
> actually result in, but I bet that we can just get the usage counters off
> by one here, and then at some later date we free page that we've already
> free'd - and that page may have been re-allocated for something else and
> isin the middle of a page-in right now (which is how we end up freeing a
> page that is locked).
>
> Or something. The lack of any synchronization looks fishy to me. The page
> lock would act as synchronization, but so would the swap device lock. And
> maybe I'm still barking up the wrong tree..
>
> Linus
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 6:22 ` Rajagopal Ananthanarayanan
@ 2000-05-03 16:11 ` Kanoj Sarcar
2000-05-03 16:19 ` Linus Torvalds
0 siblings, 1 reply; 45+ messages in thread
From: Kanoj Sarcar @ 2000-05-03 16:11 UTC (permalink / raw)
To: Rajagopal Ananthanarayanan; +Cc: Linus Torvalds, linux-mm
>
> Kanoj Sarcar wrote:
>
> > >
> > > Wow.
> > >
> > > That code definitely looks buggy.
> > >
> > > Looking at the whole try_to_swap_out() in this light shows how it messes
> > > with a _lot_ of page information without holding the page lock. I thought
> > > we fixed this once already, but maybe not.
> > >
> > > In try_to_swap_out(), earlier it does a
> > >
> > > if (PageLocked(page))
> > > goto out_failed;
> > >
> > > and that really is wrong - it should do a
> > >
> > > if (TryLockPage(page))
> > > goto out_failed;
> >
> > Umm, I am not saying this is not a good idea, but maybe code that
> > try_to_swap_out() invokes (like filemap_swapout etc) need to be
> > taught that the incoming page has already been locked.
>
> Dunno. I tend to agree with Linus. Fundamentally, how can any
> code examine & change page state (flags, etc). if the code
> does not hold the page lock?
>
Note that try_to_swap_out holds the vmlist/page_table_lock on the
victim process, as well as lock_kernel, and though this is not the
easiest code to analyze, it seems to me that is enough protection
on the swapcache pages. Also note, I am not saying it is not a good
idea to lock the page in try_to_swap_out, but lets not rush into
that without understanding the root cause ...
Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 16:08 ` Kanoj Sarcar
@ 2000-05-03 16:14 ` Linus Torvalds
2000-05-03 16:24 ` Kanoj Sarcar
2000-05-04 1:38 ` Linus Torvalds
1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-03 16:14 UTC (permalink / raw)
To: Kanoj Sarcar; +Cc: Rajagopal Ananthanarayanan, linux-mm
On Wed, 3 May 2000, Kanoj Sarcar wrote:
> > So "is_page_shared()" can be entirely crap. And can tell shrink_mmap()
>
> Not really ... look at other places that call is_page_shared, they all
> hold the pagelock. shrink_mmap does not bother with is_page_shared logic.
That wasn't my argument.
My argument is that yes, the _callers_ of is_page_shared() all hold the
page lock. No question about that. But the things that is_page_shared()
actually tests can be modified without holding the page lock, so the page
lock doesn't actually _protect_ it. See?
So the callers might as well hold one of the networking spinlocks - it
just doesn't matter as a lock, because the places that modify the stuff do
not care about the lock.. And that is fishy.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 16:11 ` Kanoj Sarcar
@ 2000-05-03 16:19 ` Linus Torvalds
2000-05-03 16:35 ` Kanoj Sarcar
0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-03 16:19 UTC (permalink / raw)
To: Kanoj Sarcar; +Cc: Rajagopal Ananthanarayanan, linux-mm
On Wed, 3 May 2000, Kanoj Sarcar wrote:
>
> Note that try_to_swap_out holds the vmlist/page_table_lock on the
> victim process, as well as lock_kernel, and though this is not the
> easiest code to analyze, it seems to me that is enough protection
> on the swapcache pages.
The swapcache code gets none of those locks as far as I can tell.
The swapcache code gets the page lock, and the "page cache" lock. But it
doesn't get the vmlist lock (the swap cache is not associated withany
particular mm), nor does it get the kernel lock (I think - I didn't look
through the code-paths).
> Also note, I am not saying it is not a good
> idea to lock the page in try_to_swap_out, but lets not rush into
> that without understanding the root cause ...
Certainly agreed. The interactions in this area are rather complex. But in
the end (whether this is a real bug or not) I suspect that I'd just prefer
to have the simple "you must lock the page before mucking with the page
flags" rule - even if some other magic lock happens to make all of the
current code ok. Just for clarity.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 16:14 ` Linus Torvalds
@ 2000-05-03 16:24 ` Kanoj Sarcar
0 siblings, 0 replies; 45+ messages in thread
From: Kanoj Sarcar @ 2000-05-03 16:24 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Rajagopal Ananthanarayanan, linux-mm
>
>
> On Wed, 3 May 2000, Kanoj Sarcar wrote:
> > > So "is_page_shared()" can be entirely crap. And can tell shrink_mmap()
> >
> > Not really ... look at other places that call is_page_shared, they all
> > hold the pagelock. shrink_mmap does not bother with is_page_shared logic.
>
> That wasn't my argument.
>
> My argument is that yes, the _callers_ of is_page_shared() all hold the
> page lock. No question about that. But the things that is_page_shared()
> actually tests can be modified without holding the page lock, so the page
> lock doesn't actually _protect_ it. See?
>
Give me an example where the page_lock is not actually protecting the
"sharedness" of the page. Note that though the page_count and swap_count
are not themselves protected by page_lock, the "sharedness" could never
change while you have the page_lock. "Sharedness" being whatever
is_page_shared() returns. Unless you can give me an example ....
Wait a second. I was familiar with is_page_shared() having
if (PageSwapCache(page))
count += swap_count(page) - 2;
and now I see it is
if (PageSwapCache(page))
count += swap_count(page) - 2 - !!page->buffers;
Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 16:19 ` Linus Torvalds
@ 2000-05-03 16:35 ` Kanoj Sarcar
2000-05-03 17:16 ` Linus Torvalds
0 siblings, 1 reply; 45+ messages in thread
From: Kanoj Sarcar @ 2000-05-03 16:35 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Rajagopal Ananthanarayanan, linux-mm
>
>
>
> On Wed, 3 May 2000, Kanoj Sarcar wrote:
> >
> > Note that try_to_swap_out holds the vmlist/page_table_lock on the
> > victim process, as well as lock_kernel, and though this is not the
> > easiest code to analyze, it seems to me that is enough protection
> > on the swapcache pages.
>
> The swapcache code gets none of those locks as far as I can tell.
>
> The swapcache code gets the page lock, and the "page cache" lock. But it
> doesn't get the vmlist lock (the swap cache is not associated withany
> particular mm), nor does it get the kernel lock (I think - I didn't look
> through the code-paths).
>
What we are coming down to is a case by case analysis. For example,
do_wp_page, which does pull a page out of the swap cache, has the
vmlist_lock. do_swap_page does not, but neither is the page in the
pte at that point. free_page_and_swap_cache already has the vmlist_lock.
Some of this is documented in Documentation/vm/locking under the
section "Swap cache locking".
Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 16:35 ` Kanoj Sarcar
@ 2000-05-03 17:16 ` Linus Torvalds
2000-05-03 17:31 ` Kanoj Sarcar
0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-03 17:16 UTC (permalink / raw)
To: Kanoj Sarcar; +Cc: Rajagopal Ananthanarayanan, linux-mm
On Wed, 3 May 2000, Kanoj Sarcar wrote:
>
> What we are coming down to is a case by case analysis. For example,
> do_wp_page, which does pull a page out of the swap cache, has the
> vmlist_lock.
_which_ vmlist? You can share swapcache entries on multiple VM's, and that
is exactly what is_page_shared() is trying to protect against.
Let's say that we have page X in the swap cache from process 1.
Process 2 also has that page, but it's in the page tables.
We do a vmscan on process 2, and will do a "swap_duplicate()" on the swap
entry that we find in page X and free the page (leaving it _just_ in the
swap cache), but at that exact moment another process 1 exits, for
example, and calls free_page_and_swap_cache(). If is_page_shared() gets
that wrong, we're now going to delete the page from the swap cache, yet we
now have an entry to it in the page tables on process 2.
And none of this seems to be synchronized - the vmlist lock is two
separate locks and doesn't protect this case. And as we've seen, vmscan
doesn't get the page lock.
Note that I don't actually believe in this schenario on x86, because with
processor ordering I suspect that is_page_shared() should still at worst
be too pessimistic, which is ok. I just think it's conceptually wrong.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 17:16 ` Linus Torvalds
@ 2000-05-03 17:31 ` Kanoj Sarcar
2000-05-03 18:17 ` Linus Torvalds
0 siblings, 1 reply; 45+ messages in thread
From: Kanoj Sarcar @ 2000-05-03 17:31 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Rajagopal Ananthanarayanan, linux-mm
>
>
>
> On Wed, 3 May 2000, Kanoj Sarcar wrote:
> >
> > What we are coming down to is a case by case analysis. For example,
> > do_wp_page, which does pull a page out of the swap cache, has the
> > vmlist_lock.
>
> _which_ vmlist? You can share swapcache entries on multiple VM's, and that
> is exactly what is_page_shared() is trying to protect against.
>
> Let's say that we have page X in the swap cache from process 1.
>
> Process 2 also has that page, but it's in the page tables.
>
> We do a vmscan on process 2, and will do a "swap_duplicate()" on the swap
> entry that we find in page X and free the page (leaving it _just_ in the
> swap cache), but at that exact moment another process 1 exits, for
> example, and calls free_page_and_swap_cache(). If is_page_shared() gets
> that wrong, we're now going to delete the page from the swap cache, yet we
> now have an entry to it in the page tables on process 2.
>
Okay, here's this example in a little more detail:
Page X: page ref count: 1 (from swapcache) + 1 (from P2)
swap ref count: 1 (from swapcache) + 1 (from P1)
try_to_swap_out will do something like this:
after the swap_duplicate:
Page X: page ref count: 1 (from swapcache) + 1 (from P2)
swap ref count: 1 (from swapcache) + 1 (from P1) + 1 (swap_duplicate)
later on, after __free_page:
Page X: page ref count: 1 (from swapcache)
swap ref count: 1 (from swapcache) + 1 (from P1) + 1 (swap_duplicate)
At no point between the time try_to_swap_out() is running, will is_page_shared()
wrongly indicate the page is _not shared_, when it is really shared (as you
say, it is pessimistic).
Process 2 doing a free_page_and_swap_cache will thruout see the page as
shared.
A similar race in transferring the pageref count to swapcount also exists
in do_swap_page(), there the pagelock is held ...
When I sent some of the swapcache locking code to you, I convinced myself
that the code was protected. Of course, I might have let some conditions
slip by in my reasoning, the code hasn't changed that much since then ...
Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 17:31 ` Kanoj Sarcar
@ 2000-05-03 18:17 ` Linus Torvalds
2000-05-03 18:37 ` Rajagopal Ananthanarayanan
` (2 more replies)
0 siblings, 3 replies; 45+ messages in thread
From: Linus Torvalds @ 2000-05-03 18:17 UTC (permalink / raw)
To: Kanoj Sarcar; +Cc: Rajagopal Ananthanarayanan, linux-mm
On Wed, 3 May 2000, Kanoj Sarcar wrote:
>
> At no point between the time try_to_swap_out() is running, will is_page_shared()
> wrongly indicate the page is _not shared_, when it is really shared (as you
> say, it is pessimistic).
Note that this is true only if you assume processor ordering.
With no common locks, a less strictly ordered system (like an alpha) might
see the update of the swap-count _much_ later on the second CPU, so that
is_page_shared() may end up not being pessimistic after all (it could get
the new page count, but the old swap-count, and thinks that the page is
free to be removed from the swap cache).
This is why not having a shared lock looks like a bug to me. Even if that
particular bug might never trigger on an x86.
_Something_ obviously triggers on the x86, though.
Note that we may be barking up the wrong tree here: it may be a completely
different page mishandling that causes this. For example, one bug in NFS
used to be that it free'd a page that was allocated with "alloc_pages()"
using "free_page()" - which takes the virtual address and only works for
"normal" pages. Now, if you have more than about 960MB of memory and the
allocated page was a highmem page, you may end up freeing the wrong page
due to mixing metaphors, and suddenly the page counts are wrong.
And with the wrong page counts, the BUG() can/will happen only much later,
because a innocent "__free_page()" ends up doing the BUG(), but the real
offender happened earlier.
We fixed one such bug in NFS. Maybe there are more lurking? How much
memory do the machines have that have problems?
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 18:17 ` Linus Torvalds
@ 2000-05-03 18:37 ` Rajagopal Ananthanarayanan
2000-05-03 18:37 ` Kanoj Sarcar
2000-05-03 21:28 ` Jeff Garzik
2 siblings, 0 replies; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-03 18:37 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Kanoj Sarcar, linux-mm
Linus Torvalds wrote:
>
> On Wed, 3 May 2000, Kanoj Sarcar wrote:
> >
> > At no point between the time try_to_swap_out() is running, will is_page_shared()
> > wrongly indicate the page is _not shared_, when it is really shared (as you
> > say, it is pessimistic).
>
>
> _Something_ obviously triggers on the x86, though.
IMHO, that's the right attitude. I really like the idea of
having the page locked if its state is being fiddled with.
I know, we don't fully understand the problem, in the sense
that no one has been able to construct a sample execution
which will hit the bug. But so what? Since the bug is elusive,
even if one comes up with a scenario, no saying that _that_
is what happened during the particular manifestation.
[ ... ]
>
> We fixed one such bug in NFS. Maybe there are more lurking? How much
> memory do the machines have that have problems?
>
I don't use NFS on my test systems. So, that couldn't have
been a problem. I had about 64MB of memory in the system.
BTW, I've been running the test (some tar & diff) for
several hours now on the same system. The system is staying up fine.
regards,
ananth.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 18:17 ` Linus Torvalds
2000-05-03 18:37 ` Rajagopal Ananthanarayanan
@ 2000-05-03 18:37 ` Kanoj Sarcar
2000-05-03 19:41 ` Rajagopal Ananthanarayanan
2000-05-03 21:28 ` Jeff Garzik
2 siblings, 1 reply; 45+ messages in thread
From: Kanoj Sarcar @ 2000-05-03 18:37 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Rajagopal Ananthanarayanan, linux-mm
>
>
> On Wed, 3 May 2000, Kanoj Sarcar wrote:
> >
> > At no point between the time try_to_swap_out() is running, will is_page_shared()
> > wrongly indicate the page is _not shared_, when it is really shared (as you
> > say, it is pessimistic).
>
> Note that this is true only if you assume processor ordering.
>
True ... not to deviate from the current topic, I would think that instead
of imposing locks here, you would want to inject instructions (like the
mips "sync") that makes sure memory is consistant. Imposing locks is a
roundabout way of insuring memory consistancy, since the unlock normally
has this "sync" type instruction encoded in it anyway.
> With no common locks, a less strictly ordered system (like an alpha) might
> see the update of the swap-count _much_ later on the second CPU, so that
> is_page_shared() may end up not being pessimistic after all (it could get
> the new page count, but the old swap-count, and thinks that the page is
> free to be removed from the swap cache).
>
> This is why not having a shared lock looks like a bug to me. Even if that
> particular bug might never trigger on an x86.
>
> _Something_ obviously triggers on the x86, though.
>
> Note that we may be barking up the wrong tree here: it may be a completely
> different page mishandling that causes this. For example, one bug in NFS
> used to be that it free'd a page that was allocated with "alloc_pages()"
> using "free_page()" - which takes the virtual address and only works for
> "normal" pages. Now, if you have more than about 960MB of memory and the
> allocated page was a highmem page, you may end up freeing the wrong page
> due to mixing metaphors, and suddenly the page counts are wrong.
>
Absolutely ... any subsystem which is screwing up the page reference count
would lead to a similar symptom. Very hard to track these ... maybe I will
take some time near the end of the week to run Juan's programs.
Kanoj
> And with the wrong page counts, the BUG() can/will happen only much later,
> because a innocent "__free_page()" ends up doing the BUG(), but the real
> offender happened earlier.
>
> We fixed one such bug in NFS. Maybe there are more lurking? How much
> memory do the machines have that have problems?
>
> Linus
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 18:37 ` Kanoj Sarcar
@ 2000-05-03 19:41 ` Rajagopal Ananthanarayanan
0 siblings, 0 replies; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-03 19:41 UTC (permalink / raw)
To: Kanoj Sarcar; +Cc: Linus Torvalds, linux-mm
Kanoj Sarcar wrote:
>
> >
> >
> > On Wed, 3 May 2000, Kanoj Sarcar wrote:
> > >
> > > At no point between the time try_to_swap_out() is running, will is_page_shared()
> > > wrongly indicate the page is _not shared_, when it is really shared (as you
> > > say, it is pessimistic).
> >
> > Note that this is true only if you assume processor ordering.
> >
>
> True ... not to deviate from the current topic, I would think that instead
> of imposing locks here, you would want to inject instructions (like the
> mips "sync") that makes sure memory is consistant. Imposing locks is a
> roundabout way of insuring memory consistancy, since the unlock normally
> has this "sync" type instruction encoded in it anyway.
Using "sync"-type operations to ensure memory ordering is not an
approach I'd recommend ... We've used it in only a couple of places
in IRIX synchronization code; but I'm yet to meet anyone who can
comfortably argue the correctness of it. Also, it opens up
chances of the compiler screwing up the writes ... and _those_ bugs
are really hard to pin down.
Further more, in the case at hand in Linux, we are not dealing with
high performance operations ... this is swapping after all.
Finally, in an MP system, the s/w synchronization primitives (lock/unlock/rwlock, etc.)
are the building blocks for ensuring correctness of interleaved execution.
Let's use those instead of low-level h/w primitives. Optimizations
can be pushed into (and isolated to) the implementation of the s/w
synchronization primtives.
--
--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 18:17 ` Linus Torvalds
2000-05-03 18:37 ` Rajagopal Ananthanarayanan
2000-05-03 18:37 ` Kanoj Sarcar
@ 2000-05-03 21:28 ` Jeff Garzik
2 siblings, 0 replies; 45+ messages in thread
From: Jeff Garzik @ 2000-05-03 21:28 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Kanoj Sarcar, Rajagopal Ananthanarayanan, linux-mm
Linus Torvalds wrote:
> We fixed one such bug in NFS. Maybe there are more lurking? How much
> memory do the machines have that have problems?
FWIW
Dual P-II w/ 128 MB of memory. pre7-2 and pre7-3 (with #error removed)
both boot up and let me login ok -- I have an NFS automounted home dir.
But... doing a lot of "netscaping" -- clicking around, opening new
windows, making the machine do lots of mmap() and swap -- causes the box
to lock hard.
I'm gonna hook up a serial console and see if I can get output. Might
try booting w/ CONFIG_SMP/num-cpus==1 to see if that triggers any ugly
behavior too.
I'll also try to reproduce the problem without NFS in the picture.
Jeff
--
Jeff Garzik | Nothing cures insomnia like the
Building 1024 | realization that it's time to get up.
MandrakeSoft, Inc. | -- random fortune
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-03 16:08 ` Kanoj Sarcar
2000-05-03 16:14 ` Linus Torvalds
@ 2000-05-04 1:38 ` Linus Torvalds
2000-05-04 2:44 ` Rajagopal Ananthanarayanan
` (2 more replies)
1 sibling, 3 replies; 45+ messages in thread
From: Linus Torvalds @ 2000-05-04 1:38 UTC (permalink / raw)
To: Kanoj Sarcar; +Cc: Rajagopal Ananthanarayanan, linux-mm, David S. Miller
Ok,
there's a pre7-4 out there that does the swapout with the page locked.
I've given it some rudimentary testing, but certainly nothing really
exotic. Please comment..
David pointed out that swapout_highmem can't really work, and he's right
and wrong. It does work, but it works for rather undocumented reasons: it
only gets invoced for anonymous dirty pages, and they are always
cow-shared, so it's ok to "break" the page up into an "old" page and a
"new" page with the same contents. Even though it's not legal in general.
I'm not claiming that this fixes any known bugs, but it _does_ mean that
we probably have the page locked in all fundamental cases where it really
matters. If anybody finds a case where we play with the page-cached-ness
(or similar) of a page without holding the page lock, please holler
loudly.
This way it should be easy to verify that yes, our coherency is fine.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-04 1:38 ` Linus Torvalds
@ 2000-05-04 2:44 ` Rajagopal Ananthanarayanan
2000-05-04 4:05 ` Linus Torvalds
2000-05-04 3:16 ` Rajagopal Ananthanarayanan
2000-05-04 7:42 ` Rajagopal Ananthanarayanan
2 siblings, 1 reply; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-04 2:44 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Kanoj Sarcar, linux-mm, David S. Miller
Linus Torvalds wrote:
>
> Ok,
> there's a pre7-4 out there that does the swapout with the page locked.
> I've given it some rudimentary testing, but certainly nothing really
> exotic. Please comment..
One quick comment: Looking at this part of the diff to mm/vmscan.c:
----------
@@ -138,6 +139,7 @@
flush_tlb_page(vma, address);
vmlist_access_unlock(vma->vm_mm);
error = swapout(page, file);
+ UnlockPage(page);
if (file) fput(file);
if (!error)
goto out_free_success;
-----------------
Didn't you mean the UnlockPage() to go before swapout(...)?
For example, one of the swapout routines, filemap_write_page()
expects the page to be unlocked. If called with page locked,
I'd expect a "double-trip" dead-lock. Right?
Like you said in an earlier mail, most of the code in
try_to_swap_out expects the page to be unlocked. Now,
of course, the reverse is true ... need to watch out!
--
--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-04 1:38 ` Linus Torvalds
2000-05-04 2:44 ` Rajagopal Ananthanarayanan
@ 2000-05-04 3:16 ` Rajagopal Ananthanarayanan
2000-05-04 4:10 ` Linus Torvalds
2000-05-04 7:42 ` Rajagopal Ananthanarayanan
2 siblings, 1 reply; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-04 3:16 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Kanoj Sarcar, linux-mm, David S. Miller
Linus Torvalds wrote:
>
> Ok,
> there's a pre7-4 out there that does the swapout with the page locked.
> I've given it some rudimentary testing, but certainly nothing really
> exotic. Please comment..
>
One other problem with having the page locked in
try_to_swapout() is in the call to
prepare_highmem_swapout() when the incoming
page is in highmem. Then,
(1) The newly allocated page (regular_page) needs to be locked.
This is may be trivial as setting PG_locked in regular_page,
since no one else knows about it.
(2) Before __free_page() is called on the incoming highmem
page it needs to be unlocked --- otherwise, we'll have
dejavu all over in __free_pages_ok!
This is a little tricky however, since not all callers of
prepare_highmem_swapout() have the incoming page locked.
For now, you can get away with something like
(in mm/highmem.c):
/*
* ok, we can just forget about our highmem page since
* we stored its data into the new regular_page.
*/
+ if (PageLocked(page))
+ UnlockPage(page);
__free_page(page);
--
--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-04 2:44 ` Rajagopal Ananthanarayanan
@ 2000-05-04 4:05 ` Linus Torvalds
0 siblings, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2000-05-04 4:05 UTC (permalink / raw)
To: Rajagopal Ananthanarayanan; +Cc: Kanoj Sarcar, linux-mm, David S. Miller
On Wed, 3 May 2000, Rajagopal Ananthanarayanan wrote:
>
> One quick comment: Looking at this part of the diff to mm/vmscan.c:
>
> ----------
> @@ -138,6 +139,7 @@
> flush_tlb_page(vma, address);
> vmlist_access_unlock(vma->vm_mm);
> error = swapout(page, file);
> + UnlockPage(page);
> if (file) fput(file);
> if (!error)
> goto out_free_success;
> -----------------
>
> Didn't you mean the UnlockPage() to go before swapout(...)?
> For example, one of the swapout routines, filemap_write_page()
> expects the page to be unlocked. If called with page locked,
> I'd expect a "double-trip" dead-lock. Right?
Nope. I changed swap_out() so that it gets called with the page locked
(which is much more like the other VM routines work too). Otherwise the
first thing swap_out() would do would be to just re-lock the page,and then
you'd have a window between the caller and the callee when neither the
page lock nor the page table lock were held.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-04 3:16 ` Rajagopal Ananthanarayanan
@ 2000-05-04 4:10 ` Linus Torvalds
2000-05-05 4:46 ` Rajagopal Ananthanarayanan
0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-04 4:10 UTC (permalink / raw)
To: Rajagopal Ananthanarayanan; +Cc: Kanoj Sarcar, linux-mm, David S. Miller
On Wed, 3 May 2000, Rajagopal Ananthanarayanan wrote:
>
> One other problem with having the page locked in
> try_to_swapout() is in the call to
> prepare_highmem_swapout() when the incoming
> page is in highmem.
Look at how I handled this in pre7-4.
Just unlocking the old page and returning with the new page locked is
quite acceptable. The "prepare_highmem_swapout()" thing breaks the
association with the pages anyway, and as such there is no race (and this
is allowable only exactly because of the anonymous and non-shared nature
of a private COW-mapping - which is the only thing we accept in that
code-path anyway).
Doing it that way means that there are no special cases in vmscan.c.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-04 1:38 ` Linus Torvalds
2000-05-04 2:44 ` Rajagopal Ananthanarayanan
2000-05-04 3:16 ` Rajagopal Ananthanarayanan
@ 2000-05-04 7:42 ` Rajagopal Ananthanarayanan
2000-05-04 15:33 ` Linus Torvalds
2 siblings, 1 reply; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-04 7:42 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Kanoj Sarcar, linux-mm, David S. Miller
Linus Torvalds wrote:
>
> Ok,
> there's a pre7-4 out there that does the swapout with the page locked.
I did some testing of this patch with dbench.
The kernel starts shooting processes down pretty quickly
("VM: killing process XXX") on a 2 CPU 64MB system,
with nothing but dbench (8 clients). A concurrently
running vmstat shows very low free memory with some swapping,
and the buffer space remaining around 50MB.
I had applied the 7-4 patch on top of pre6.
When the patch was reversed (leaving just pre6),
the resulting kernel did not have any problems
running dbench in several tries.
Will try some more tomorrow after hearing others experience.
ananth.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-04 7:42 ` Rajagopal Ananthanarayanan
@ 2000-05-04 15:33 ` Linus Torvalds
2000-05-04 15:57 ` Rik van Riel
` (2 more replies)
0 siblings, 3 replies; 45+ messages in thread
From: Linus Torvalds @ 2000-05-04 15:33 UTC (permalink / raw)
To: Rajagopal Ananthanarayanan
Cc: Kanoj Sarcar, linux-mm, David S. Miller, Rik van Riel
On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
>
> I did some testing of this patch with dbench.
> The kernel starts shooting processes down pretty quickly
> ("VM: killing process XXX") on a 2 CPU 64MB system,
> with nothing but dbench (8 clients). A concurrently
> running vmstat shows very low free memory with some swapping,
> and the buffer space remaining around 50MB.
Ok. The page locking patch should not change any swap behaviour at all, so
this behaviour is likely to be due to the pageout changes by Rik.
(Oh, the page locking might cause another part of the vmscanning logic to
temporarily ignore a page because it is locked, but that should be a very
small second-order effect compared to the "big picture" changes in how
much to page out).
Rik, I think the kswapd logic is wrong, and I suspect you made it
worsewhen you added the while-loop. The problem looks like that while
kswapd is working on one zone, it will entirely ignore any other zones. I
think the logic should be more like
for (;;) {
int something_to_do = 0;
pgdat = pgdat_list;
while (pgdat) {
for(i = 0; i < MAX_NR_ZONES; i++) {
zone = pgdat->node_zones+ i;
if (!zone->size || !zone->zone_wake_kswapd)
continue;
something_to_do = 1;
do_try_to_free_pages(GFP_KSWAPD, zone);
}
run_task_queue(&tq_disk);
pgdat = pgdat->node_next;
}
if (something_to_do) {
if (tsk->need_resched)
schedule();
continue;
}
tsk->state = TASK_INTERRUPTIBLE;
interruptible_sleep_on(&kswapd_wait);
}
See? This has two changes to the current logic:
- it is more "balanced" on the do_try_to_free_pages(), ie it calls it for
different zones instead of repeating one zone until no longer needed.
- it continues to do this until no zone needs balancing any more, unlike
the old one that could easily lose kswapd wakeup-requests and just do
one zone.
What do you think? I suspect that the added do-loop in pre7 just made the
"lost wakeups" problem worse by concentrating on one zone for a longer
while and thus more likely to lose wakeups for lower zones (because it
already looked at those).
There might be other details like this lurking, but this looks like a good
first try. Ananth, willing to give it a whirl?
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-04 15:33 ` Linus Torvalds
@ 2000-05-04 15:57 ` Rik van Riel
2000-05-04 17:19 ` Rajagopal Ananthanarayanan
2000-05-04 20:40 ` Oops in __free_pages_ok (pre7-1) (Long) (backtrace) Roger Larsson
2 siblings, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2000-05-04 15:57 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rajagopal Ananthanarayanan, Kanoj Sarcar, linux-mm, David S. Miller
On Thu, 4 May 2000, Linus Torvalds wrote:
> for (;;) {
> int something_to_do = 0;
> pgdat = pgdat_list;
> while (pgdat) {
> for(i = 0; i < MAX_NR_ZONES; i++) {
> zone = pgdat->node_zones+ i;
> if (!zone->size || !zone->zone_wake_kswapd)
> continue;
> something_to_do = 1;
> do_try_to_free_pages(GFP_KSWAPD, zone);
> }
> run_task_queue(&tq_disk);
> pgdat = pgdat->node_next;
> }
> if (something_to_do) {
> if (tsk->need_resched)
> schedule();
> continue;
> }
> tsk->state = TASK_INTERRUPTIBLE;
> interruptible_sleep_on(&kswapd_wait);
> }
>
> See? This has two changes to the current logic:
> - it is more "balanced" on the do_try_to_free_pages(), ie it calls it for
> different zones instead of repeating one zone until no longer needed.
> - it continues to do this until no zone needs balancing any more, unlike
> the old one that could easily lose kswapd wakeup-requests and just do
> one zone.
>
> What do you think?
Indeed, this probably better ...
regards,
Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.
Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/ http://www.surriel.com/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-04 15:33 ` Linus Torvalds
2000-05-04 15:57 ` Rik van Riel
@ 2000-05-04 17:19 ` Rajagopal Ananthanarayanan
2000-05-04 17:41 ` Rik van Riel
2000-05-04 20:40 ` Oops in __free_pages_ok (pre7-1) (Long) (backtrace) Roger Larsson
2 siblings, 1 reply; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-04 17:19 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Kanoj Sarcar, linux-mm, David S. Miller, Rik van Riel
Linus Torvalds wrote:
>
> On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> >
> > I did some testing of this patch with dbench.
> > The kernel starts shooting processes down pretty quickly
> > ("VM: killing process XXX") on a 2 CPU 64MB system,
> > with nothing but dbench (8 clients). A concurrently
> > running vmstat shows very low free memory with some swapping,
> > and the buffer space remaining around 50MB.
[ ... ]
>
> Rik, I think the kswapd logic is wrong, and I suspect you made it
> worsewhen you added the while-loop. The problem looks like that while
> kswapd is working on one zone, it will entirely ignore any other zones. I
> think the logic should be more like
[ ... ]
> What do you think? I suspect that the added do-loop in pre7 just made the
> "lost wakeups" problem worse by concentrating on one zone for a longer
> while and thus more likely to lose wakeups for lower zones (because it
> already looked at those).
>
> There might be other details like this lurking, but this looks like a good
> first try. Ananth, willing to give it a whirl?
>
> Linus
I haven't looked at the code, but I replaced the whole while (1) loop
with the new for(;;). Things still remain the same: when running
dbench VM starts killing processes. Following is a vmstat trace during
the dbench run with the latest change to kswapd. For some reason,
the system seems to swap more with the change; I've attached a second
vmstat trace as of early AM today before the change. Again, I have
a 2 CPU box with 64M memory ... the tests are very controlled, nothing
changes except what I want to change (kernel bits).
------------- vmstat trace with kswapd change ----------------------
[root@delilah /root]# vmstat 1 1000
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 0 41664 968 11144 0 0 12 2 55 14 1 1 98
0 0 0 0 41664 968 11144 0 0 0 0 106 13 0 0 100
0 0 0 0 41664 968 11144 0 0 0 0 105 8 0 0 100
0 0 0 0 41664 968 11144 0 0 0 0 104 8 0 0 100
0 0 0 0 41664 968 11144 0 0 0 0 107 26 0 0 100
0 0 0 0 41664 968 11144 0 0 0 0 110 36 0 0 100
0 0 0 0 41664 968 11144 0 0 0 0 114 46 0 0 100
0 0 0 0 41236 968 11144 0 0 0 0 108 42 0 0 100
0 0 0 0 41216 968 11144 0 0 0 0 105 23 1 0 99
0 8 0 0 38324 1036 13260 0 0 142 85 203 356 2 5 93
0 8 1 0 24376 1096 25548 0 0 33 3378 239 705 1 21 78
0 8 1 0 21340 1116 28220 0 0 26 2482 304 1960 0 7 92
1 7 1 0 11540 1160 36896 0 0 41 2623 318 5384 1 17 82
0 8 1 0 6212 1180 41568 0 0 20 2813 332 3467 1 8 91
0 8 1 0 3040 1192 44360 0 0 6 2727 271 803 0 7 92
0 8 2 432 1208 1148 46484 0 432 60 6968 754 4099 0 7 92
0 8 2 648 1112 1164 46780 0 216 14 3249 348 2720 0 9 91
0 8 0 3948 1240 1112 49984 0 3300 44 22247 2210 6100 0 5 95
0 8 0 4336 1540 1076 49896 0 388 28 4597 317 222 1 12 87
3 5 1 6132 1200 444 51400 1736 3892 1872 31316 5098 2307 0 15 85
0 7 0 6612 1092 452 51528 172 908 329 7872 862 463 0 21 79
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 5 1 6944 1548 456 51148 192 1052 339 8263 1362 405 1 14 86
2 4 1 7212 1532 436 51216 16 668 35 7667 678 470 0 24 76
2 6 1 7072 1540 436 51136 360 808 224 8202 1170 756 0 21 79
1 3 2 7048 1260 448 51320 28 328 61 4082 484 126 0 20 80
2 1 3 7012 964 460 51720 24 180 46 2545 310 90 0 7 93
1 4 1 6984 992 472 51664 24 192 29 2917 320 91 1 17 82
0 1 5 7064 1508 468 51116 32 248 28 2062 345 85 1 16 83
0 3 3 6956 1540 456 51004 32 288 188 4072 525 183 0 14 85
0 5 1 7168 1516 464 50864 84 628 454 4657 824 249 1 10 89
0 7 0 7500 1108 496 51204 20 672 204 2668 412 171 4 9 87
2 5 0 7484 756 504 51476 0 404 179 3601 558 312 1 6 93
0 7 0 7476 1244 488 51076 4 228 342 1557 402 194 1 5 94
0 6 1 7456 1244 496 51076 4 120 691 1530 399 118 3 5 92
0 7 0 7372 1416 488 50780 28 344 1068 1086 351 178 1 6 93
2 4 0 7352 1448 500 50740 0 244 893 506 413 152 2 3 95
0 6 0 7252 1200 500 50944 48 124 1411 289 225 166 0 7 93
0 7 0 7324 1404 488 50796 72 124 708 531 455 526 1 1 98
0 7 0 7364 1480 488 50804 12 224 858 225 965 892 1 4 96
0 6 1 7284 1424 500 50736 16 240 1986 140 218 192 0 10 89
2 4 0 7420 876 500 51476 4 352 1424 588 599 511 2 10 88
1 6 0 7424 1356 496 50944 68 168 679 1042 695 337 1 3 96
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 6 0 7312 1228 520 50936 44 176 513 2044 476 210 4 8 88
0 6 0 7408 1464 516 50812 12 264 135 1566 277 94 3 12 85
1 5 0 7384 1228 536 50952 0 120 208 2530 312 147 1 6 92
0 6 0 7512 1332 528 50996 0 212 104 1553 326 104 0 10 89
0 6 0 7548 1544 536 50908 16 236 132 2059 430 110 1 7 92
2 4 1 7740 1360 552 51156 64 540 427 4635 614 232 3 6 91
0 6 0 7564 1204 552 51172 56 124 217 1031 333 98 3 3 94
0 6 0 7460 1064 560 51112 12 116 718 1529 423 195 6 7 87
0 6 0 7516 1212 580 50968 16 236 662 559 258 176 2 7 91
0 6 0 7472 1448 596 50580 16 0 638 1000 232 182 5 10 85
1 5 1 7556 1380 608 50700 16 124 354 531 306 147 2 5 93
1 5 0 7604 1204 624 50912 12 180 712 2545 470 139 3 6 91
1 5 0 7612 1372 636 50688 8 292 1082 2573 507 199 2 6 92
0 6 0 7600 1516 652 50548 4 168 392 1042 264 125 5 4 90
0 6 0 7592 1356 676 50756 0 108 442 527 286 157 5 8 87
0 6 0 7488 3400 704 48556 0 0 659 0 213 186 4 2 93
0 6 0 7392 7188 712 44660 0 0 745 0 221 220 11 13 76
0 6 0 7392 2588 724 49252 0 0 302 3000 306 486 14 13 73
0 6 2 7420 2004 732 49944 0 136 130 1292 292 3150 7 7 85
5 2 0 7520 2040 732 49988 0 104 38 2268 333 357 1 2 97
4 3 1 7808 1144 724 51068 8 408 371 7555 844 3960 4 8 88
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
4 2 2 8116 1480 728 50852 24 692 190 5143 664 335 3 6 92
1 5 0 7860 1516 736 50608 24 100 122 3083 452 228 3 9 88
0 6 0 7760 1132 744 50780 4 100 352 44 423 99 3 3 94
4 2 0 7676 2516 744 49292 0 128 610 1032 271 189 15 15 70
3 1 5 8352 852 756 51380 28 1112 624 11778 1640 3164 3 9 88
4 2 0 7928 1420 760 50480 8 264 225 1566 483 306 7 10 83
0 6 0 7876 4504 764 47252 0 0 425 1500 328 144 10 16 74
0 6 0 7876 10068 764 41664 0 0 265 2000 418 124 19 21 60
2 4 1 7876 6608 764 45124 0 0 208 3137 266 1170 18 26 55
0 6 1 7872 1712 764 50000 0 0 255 1943 327 4264 8 14 78
3 3 2 7864 1540 764 50364 0 204 21 2003 357 3479 7 16 77
0 6 2 7716 888 764 50992 0 68 147 2071 306 3040 8 10 83
2 4 2 7804 900 772 51016 0 272 270 3982 599 3871 8 14 78
1 5 1 7688 7028 780 44668 0 8 120 2416 326 1194 25 25 50
0 6 1 7688 3152 780 48540 0 0 189 2364 335 10084 17 39 45
0 6 1 7688 1352 780 50332 0 0 185 1843 381 6073 16 20 64
4 2 1 7708 2636 776 49132 0 120 40 4975 298 2135 6 12 82
4 2 2 7832 3856 792 48100 0 348 259 8521 714 6546 6 10 84
4 2 1 7816 5792 792 46028 4 152 145 5785 377 1006 13 16 71
1 5 1 7804 2348 792 49380 52 0 223 859 247 4483 13 19 69
1 5 1 7828 2624 792 49156 0 112 104 2323 314 3228 11 15 74
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
2 4 1 7760 6600 792 44908 0 0 204 2878 361 1783 17 21 61
2 5 1 7736 4428 792 47224 220 0 297 3065 417 4822 27 38 35
0 7 1 7728 5128 792 46468 60 0 102 1865 434 2128 7 8 85
4 3 1 7724 2768 792 48844 12 0 169 3404 498 7431 11 16 73
2 4 0 7696 4128 792 47472 0 116 132 7416 444 4061 13 24 63
1 5 1 7692 5016 792 46572 0 0 8 1500 220 63 11 12 76
0 6 0 7672 5000 792 46564 0 0 12 2000 224 102 8 8 84
5 1 0 7672 4720 792 46860 0 0 69 2000 258 796 7 12 80
0 6 1 7672 5080 792 46484 0 0 115 3114 302 1880 28 34 38
0 6 1 7672 4308 792 47256 0 0 7 1718 286 2861 4 10 86
0 6 1 7672 6528 792 45020 12 0 114 1461 330 1508 3 5 92
0 6 0 7672 9080 792 42468 0 0 7 707 296 233 0 3 97
0 6 0 7672 11948 792 39592 8 0 30 3500 317 182 11 14 75
0 6 0 7672 35396 792 16144 0 0 69 0 273 169 0 7 93
0 6 0 7672 39716 792 11824 0 0 119 0 223 246 0 4 96
0 1 0 7628 41828 792 10520 768 0 532 0 216 248 1 1 98
0 0 0 7616 41320 792 10956 320 0 212 0 124 44 0 1 99
0 0 0 7616 41320 792 10956 0 0 0 16 119 12 0 0 100
0 0 0 7616 40580 792 11252 196 0 159 0 127 69 1 0 99
0 0 0 7616 40568 792 11252 0 0 0 0 106 23 1 0 99
0 8 1 7616 28856 792 22556 0 0 50 1045 137 239 2 17 80
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
7 1 1 7616 19400 792 32076 0 0 5 4999 307 1382 1 17 83
0 8 1 7616 8604 792 42832 0 0 10 5133 312 1871 0 19 81
0 8 1 7616 4616 792 46816 0 0 9 4186 295 758 0 8 92
0 8 1 7616 1032 792 50404 0 0 5 3804 325 803 0 10 90
-----------------------------------------------------------------------------------
------------------ vmstat trace as of pure 7-4 patch ----------------------------
[root@delilah /root]# vmstat 1 1000
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 0 45176 712 8144 0 0 62 8 71 60 3 3 94
0 0 0 0 45176 712 8144 0 0 0 0 106 12 0 0 100
0 1 0 0 45160 716 8144 0 0 1 0 106 14 0 0 100
0 0 0 0 45024 732 8252 0 0 34 59 140 49 0 1 99
0 0 0 0 45024 732 8252 0 0 0 0 121 8 0 0 100
0 0 0 0 45024 732 8252 0 0 0 0 104 10 0 0 100
0 0 0 0 45024 732 8252 0 0 0 0 121 76 0 0 100
0 0 0 0 45020 736 8252 0 0 6 0 117 55 1 0 98
0 2 0 0 44640 772 8324 0 0 27 65 157 84 1 0 98
0 1 0 0 43436 908 8900 0 0 182 0 231 240 5 1 94
1 0 0 0 40604 940 10584 0 0 402 0 173 150 17 2 81
1 0 0 0 40452 952 10896 0 0 84 0 117 80 42 4 54
1 0 0 0 39072 952 10980 0 0 4 0 108 41 50 2 49
1 0 0 0 40476 988 11264 0 0 111 13 146 115 26 3 70
1 0 0 0 40916 988 11196 0 0 8 0 110 76 47 2 50
0 0 0 0 41576 988 11220 0 0 0 0 108 51 17 2 80
0 0 0 0 41576 988 11220 0 0 0 0 104 8 0 0 100
0 0 0 0 41576 988 11220 0 0 0 0 103 8 0 0 100
0 0 0 0 41576 988 11220 0 0 0 1 108 18 0 0 100
0 0 0 0 41576 988 11220 0 0 0 0 115 48 0 0 100
0 0 0 0 41576 988 11220 0 0 0 0 110 32 0 0 100
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 0 41140 988 11220 0 0 0 0 110 43 0 1 99
0 0 0 0 41128 988 11220 0 0 0 0 111 25 0 0 100
8 0 0 0 30488 1072 20260 0 0 73 0 126 186 1 13 86
0 8 0 0 20308 1120 29148 0 0 4 8500 203 66 1 16 83
0 8 1 0 11632 1160 36792 0 0 30 4599 267 395 0 15 84
0 8 1 0 2240 1208 45052 0 0 8 5787 346 981 1 17 82
0 8 2 0 1228 736 46520 0 0 14 2806 298 735 1 17 82
0 8 3 0 1144 388 46648 0 0 13 5154 392 871 1 15 84
5 4 1 1068 1080 384 47832 0 1068 17 15921 1059 1134 0 11 89
8 0 2 2220 1456 404 48616 0 1152 9 5788 487 123 0 10 89
0 8 2 3044 1028 408 49572 0 840 5 4210 264 95 0 17 83
0 8 1 3540 1100 412 49964 0 500 2 8125 222 89 0 8 92
0 8 2 3852 1088 416 50188 0 312 8 3359 224 89 0 9 91
0 9 1 4048 1072 448 50208 84 204 65 770 350 168 1 22 77
0 8 2 4016 1132 460 50068 200 0 88 5464 376 808 1 31 67
0 8 3 4016 1072 476 50112 56 0 25 4052 363 668 0 11 89
0 8 3 4016 1100 484 50080 0 0 13 4529 356 583 0 11 89
0 8 2 4384 1032 452 50504 0 372 4 9637 345 569 1 10 89
0 8 0 4432 1536 448 50052 0 48 0 1923 219 206 0 2 98
0 8 2 5004 1540 444 50628 0 572 71 6836 416 86 0 7 92
1 3 6 5280 1188 432 51168 0 332 21 9390 714 6872 1 11 88
Killed
[root@delilah /root]# vmstat 1 1000
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 4 0 5672 30508 760 21128 12 31 85 710 114 159 4 8 87
0 0 0 5608 31360 760 20740 436 0 172 121 210 227 0 2 98
0 0 0 5608 31360 760 20740 0 0 0 0 168 10 0 0 100
0 0 0 5608 31360 760 20740 0 0 0 0 104 8 0 0 100
0 0 0 5608 31360 760 20740 0 0 0 0 105 8 0 0 100
0 0 0 5592 31324 760 20756 28 0 11 0 112 24 0 0 100
0 0 0 5592 31308 760 20768 12 0 3 22 124 26 0 0 100
0 0 0 5592 31300 760 20776 8 0 2 0 108 20 0 0 100
0 0 0 5592 31300 760 20776 0 0 0 0 103 8 0 0 100
0 0 0 5592 30860 760 20780 4 0 1 0 106 32 0 0 99
0 0 0 5592 30848 760 20780 0 0 0 0 106 25 0 0 100
1 7 1 5592 16064 760 35156 0 0 63 2365 176 1593 1 23 76
0 8 1 5592 2608 760 48632 0 0 38 5350 448 3538 0 28 72
0 8 2 5820 1060 744 50576 0 252 14 4888 409 848 0 10 89
0 5 4 6196 1208 744 50760 0 384 2 10556 543 1439 0 4 96
2 4 3 6080 1120 520 51372 100 1308 399 30927 2351 6011 0 36 63
0 8 2 5792 1056 552 51092 416 0 389 4967 776 6246 0 41 59
5 3 3 5344 1540 564 50908 188 1052 597 22696 2076 2402 0 25 75
0 4 0 5308 1536 576 50724 100 208 731 5552 732 243 1 7 91
0 5 1 5300 1348 588 50900 120 180 968 7045 809 1056 2 12 87
0 5 0 5200 1356 584 50760 56 0 299 1500 416 108 1 2 96
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 5 2 5100 1524 544 50644 24 56 169 4347 301 406 2 11 87
0 5 0 5128 1392 556 50896 48 68 225 2184 398 198 3 4 93
0 3 1 5032 1368 560 50884 32 0 306 1000 334 109 1 5 94
1 2 0 5100 1048 572 51256 0 96 327 2024 266 104 1 8 91
0 3 0 5080 1376 556 50900 12 40 386 2510 277 90 2 11 86
0 3 0 5160 1332 580 50892 0 120 1105 2030 220 148 4 10 86
0 3 0 5108 1228 588 50944 0 0 286 2500 349 76 2 5 93
2 1 0 5108 2488 604 49672 0 0 645 0 251 162 10 6 84
0 3 0 5140 1028 604 51248 0 232 365 2058 262 148 10 20 71
0 3 0 5452 1172 604 51304 0 392 262 4098 342 160 1 4 95
0 3 0 5324 1160 600 51152 0 0 194 1500 455 67 5 7 87
1 2 0 5308 1500 608 50736 0 0 204 2000 310 83 5 7 88
2 1 1 5228 1260 612 51024 0 92 1408 2023 220 209 33 42 25
0 3 0 5200 1084 616 51148 0 0 5 2000 308 721 7 7 86
0 3 0 5004 3116 616 49120 0 4 0 2001 263 48 5 3 92
3 0 0 4956 4336 616 47868 0 0 588 1500 217 106 31 34 34
0 3 0 4848 1496 616 50676 0 0 145 2000 242 62 14 16 70
3 0 0 4848 3684 616 48500 0 0 385 500 219 64 23 26 51
0 3 0 4848 1280 612 50888 0 0 140 3500 212 64 16 14 70
3 0 0 4848 5464 612 46760 0 0 225 0 201 52 8 23 70
0 3 0 4848 1396 616 50768 0 0 293 3000 207 74 20 26 54
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 2 0 4848 4236 616 47936 0 0 0 1500 258 49 6 4 90
0 3 0 4848 1956 616 50216 0 0 310 2000 209 81 24 27 49
2 1 0 4820 1488 608 50692 0 0 196 1500 303 58 10 8 81
0 3 0 4820 3896 608 48308 8 0 281 1000 237 127 18 22 60
0 3 0 4816 33088 608 19068 0 0 178 0 224 261 1 9 90
0 0 0 4752 34680 608 17808 144 0 232 0 168 121 0 1 98
0 1 0 4748 34600 608 17888 4 0 79 40 138 24 0 1 99
0 0 0 4736 34584 608 17884 8 0 2 0 109 16 0 0 100
0 0 0 4736 33996 608 18036 32 0 134 0 131 66 1 1 98
0 1 0 4736 31912 608 20056 0 0 55 0 132 74 1 4 95
0 8 0 4736 15540 608 36272 0 0 72 10000 217 121 0 23 77
4 4 2 4736 5060 608 46716 0 0 9 1670 222 107 0 20 80
0 8 1 4736 2728 608 48856 0 0 22 1653 225 593 0 8 92
2 6 2 4876 1184 600 50744 0 244 35 10361 400 640 1 10 89
3 5 2 5084 1204 608 50908 0 272 93 9176 647 6848 1 7 92
Killed
--------------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-04 17:19 ` Rajagopal Ananthanarayanan
@ 2000-05-04 17:41 ` Rik van Riel
2000-05-04 18:18 ` Rajagopal Ananthanarayanan
0 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2000-05-04 17:41 UTC (permalink / raw)
To: Rajagopal Ananthanarayanan
Cc: Linus Torvalds, Kanoj Sarcar, linux-mm, David S. Miller
On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> Linus Torvalds wrote:
> > There might be other details like this lurking, but this looks like a good
> > first try. Ananth, willing to give it a whirl?
>
> I haven't looked at the code, but I replaced the whole while (1)
> loop with the new for(;;). Things still remain the same: when
> running dbench VM starts killing processes.
I've been thinking about it some more. When we look
carefully the killing is always accompanied by a sudden
decrease in free memory (while kswapd could easily keep
up a few seconds ago).
Having an active/inactive queue, where we maintain a
certain target number of inactive pages, should give us
some more robustness against sudden overload. Also,
guaranteeing that we have indeed a certain number of
freeable pages in every zone...
I'm coding this up as we speak, so please hold on a
little longer ...
regards,
Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.
Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/ http://www.surriel.com/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-04 17:41 ` Rik van Riel
@ 2000-05-04 18:18 ` Rajagopal Ananthanarayanan
2000-05-04 18:43 ` Linus Torvalds
0 siblings, 1 reply; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-04 18:18 UTC (permalink / raw)
To: riel; +Cc: Linus Torvalds, Kanoj Sarcar, linux-mm, David S. Miller
Rik van Riel wrote:
>
> On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> > Linus Torvalds wrote:
>
> > > There might be other details like this lurking, but this looks like a good
> > > first try. Ananth, willing to give it a whirl?
> >
> > I haven't looked at the code, but I replaced the whole while (1)
> > loop with the new for(;;). Things still remain the same: when
> > running dbench VM starts killing processes.
>
> I've been thinking about it some more. When we look
> carefully the killing is always accompanied by a sudden
> decrease in free memory (while kswapd could easily keep
> up a few seconds ago).
You may have something here. It's the burstiness of
the demand. One thing I haven't noticed here in linux-mm
is any approaches to throttle the demand (Or may be I haven't
looked enough). Why not keep requests for new pages unsatisfied
if the _rate_ of allocations exceeds the _rate_ of freeing
(through swap-out or through write-out [bdflush])?
Simple counters don't capture rates. We need deltas in
the last 'n' time intervals. Then, match the delta-A
(allocation) to delta-F (free).
Just a thought,
ananth.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-04 18:18 ` Rajagopal Ananthanarayanan
@ 2000-05-04 18:43 ` Linus Torvalds
2000-05-04 19:00 ` Rik van Riel
0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-04 18:43 UTC (permalink / raw)
To: Rajagopal Ananthanarayanan; +Cc: riel, Kanoj Sarcar, linux-mm, David S. Miller
On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
>
> You may have something here. It's the burstiness of
> the demand.
The way bursty demand is _supposed_ to be handled is that the "demander"
just ends up doing the "try_to_free_pages()" call itself at that point.
There's no way kswapd can handle these cases sanely, and at some point we
just need to start freeing memory synchronously.
So what's probably started happening is that "try_to_free_pages()" is not
trying hard enough to free stuff (probably the counters changed, and it
now only scans half the memory it used to under pressure), so when we get
into the bursty demand situation, the allocator ends up giving up in
disgust.
This is something you'll never see under non-busrty load, simply because
kswapd doesn't care - it will continue to page stuff out whether
try_to_free_pages() returns happy or not. So if try_to_free_pages() isn't
trying hard enough, kswapd will compensate by just calling it more.
Note that changing how hard try_to_free_pages() tries to free a page is
exactly part of what Rik has been doing, so this is something that has
changed recently. It's not trivial to get right, for a very simple reason:
we need to balance the "hardness" between the VM area scanning and the RLU
list scanning.
Rik probably balanced it ok, but ended up making it too soft, giving up
much too easily even when memory really would be available if it were to
just try a bit harder..
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-04 18:43 ` Linus Torvalds
@ 2000-05-04 19:00 ` Rik van Riel
2000-05-04 19:17 ` Linus Torvalds
0 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2000-05-04 19:00 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rajagopal Ananthanarayanan, Kanoj Sarcar, linux-mm, David S. Miller
On Thu, 4 May 2000, Linus Torvalds wrote:
> Note that changing how hard try_to_free_pages() tries to free a page is
> exactly part of what Rik has been doing, so this is something that has
> changed recently. It's not trivial to get right, for a very simple reason:
> we need to balance the "hardness" between the VM area scanning and the RLU
> list scanning.
With the current scheme, it's pretty much impossible to get it
right.
> Rik probably balanced it ok, but ended up making it too soft,
> giving up much too easily even when memory really would be
> available if it were to just try a bit harder..
*nod*
I hope the active/inactive page list scheme will fix this.
(we can push harder since we'll have pages in every stage
of aging every time)
regards,
Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.
Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/ http://www.surriel.com/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-04 19:00 ` Rik van Riel
@ 2000-05-04 19:17 ` Linus Torvalds
2000-05-04 21:16 ` Rajagopal Ananthanarayanan
0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-04 19:17 UTC (permalink / raw)
To: riel; +Cc: Rajagopal Ananthanarayanan, Kanoj Sarcar, linux-mm, David S. Miller
On Thu, 4 May 2000, Rik van Riel wrote:
> On Thu, 4 May 2000, Linus Torvalds wrote:
>
> > Note that changing how hard try_to_free_pages() tries to free a page is
> > exactly part of what Rik has been doing, so this is something that has
> > changed recently. It's not trivial to get right, for a very simple reason:
> > we need to balance the "hardness" between the VM area scanning and the RLU
> > list scanning.
>
> With the current scheme, it's pretty much impossible to get it
> right.
Not really. That is what the "priority levels" are really there for: for
normal use it's actually sufficient to just make sure that the starter
levels (ie 6) balance reasonably well between VM scanning and RLU
scanning. If they balance ok, then system behaviour will be quite
acceptable.
At the same time it is important to make sure that the higher priorities
(ie 1 and 0) try _much_ harder to swap things out than the lower ones.
They don't need to be very balanced, but they need to be effective. That's
why shrink_mmap() uses a quite grotesque
count = nr_lru_pages >> priority;
which means that level 0 will try 64 times harder than level 6 to page
something out. It's also important that once you get to level 0, it really
should scan everything available more than once (once for aging, once for
"everything was aged the first time, the second time we really free
something").
This, I think, is where the new swap_out() falls down flat on its face. It
does a much softer swapout, and the priority is not as aggressive as it is
for shrink_mmap(). Instead of a exponential increase with priority, it
uses a linear one: "counter = nr_threads / (priority+1)".
I suspect that for "priority = 0", we should make sure that "counter" is
at _least_ "nr_threads * 2", simply because we should walk the page tables
at least twice before giving up: the "mm->swap_address" logic means that
walking them once may have started somewhere in the middle and never even
looked at the low values. Because we age, it should probably be more than
that.
So it might be that we should really use something like
counter = (nr_threads << 1) >> (priority >> 1);
instead (just a completely made up heuristic - I just made up something
that has exponential behaviour while still having a starting point close
to what we have now to get roughly the same balancing).
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-04 15:33 ` Linus Torvalds
2000-05-04 15:57 ` Rik van Riel
2000-05-04 17:19 ` Rajagopal Ananthanarayanan
@ 2000-05-04 20:40 ` Roger Larsson
2 siblings, 0 replies; 45+ messages in thread
From: Roger Larsson @ 2000-05-04 20:40 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-mm, Rik van Riel
Oh,
This moves the need_resched test out of the inner loop.
Not good if you like low latencies (I like them).
(The do_try_to_free_pages can take quite some time...)
> if (tsk->need_resched)
> schedule();
Please move it back into the inner loop!
/RogerL
Linus Torvalds wrote:
>
> On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> >
> > I did some testing of this patch with dbench.
> > The kernel starts shooting processes down pretty quickly
> > ("VM: killing process XXX") on a 2 CPU 64MB system,
> > with nothing but dbench (8 clients). A concurrently
> > running vmstat shows very low free memory with some swapping,
> > and the buffer space remaining around 50MB.
>
> Ok. The page locking patch should not change any swap behaviour at all, so
> this behaviour is likely to be due to the pageout changes by Rik.
>
> (Oh, the page locking might cause another part of the vmscanning logic to
> temporarily ignore a page because it is locked, but that should be a very
> small second-order effect compared to the "big picture" changes in how
> much to page out).
>
> Rik, I think the kswapd logic is wrong, and I suspect you made it
> worsewhen you added the while-loop. The problem looks like that while
> kswapd is working on one zone, it will entirely ignore any other zones. I
> think the logic should be more like
>
> for (;;) {
> int something_to_do = 0;
> pgdat = pgdat_list;
> while (pgdat) {
> for(i = 0; i < MAX_NR_ZONES; i++) {
> zone = pgdat->node_zones+ i;
> if (!zone->size || !zone->zone_wake_kswapd)
> continue;
> something_to_do = 1;
> do_try_to_free_pages(GFP_KSWAPD, zone);
> }
> run_task_queue(&tq_disk);
> pgdat = pgdat->node_next;
> }
> if (something_to_do) {
> if (tsk->need_resched)
> schedule();
> continue;
> }
> tsk->state = TASK_INTERRUPTIBLE;
> interruptible_sleep_on(&kswapd_wait);
> }
>
> See? This has two changes to the current logic:
> - it is more "balanced" on the do_try_to_free_pages(), ie it calls it for
> different zones instead of repeating one zone until no longer needed.
> - it continues to do this until no zone needs balancing any more, unlike
> the old one that could easily lose kswapd wakeup-requests and just do
> one zone.
>
> What do you think? I suspect that the added do-loop in pre7 just made the
> "lost wakeups" problem worse by concentrating on one zone for a longer
> while and thus more likely to lose wakeups for lower zones (because it
> already looked at those).
>
> There might be other details like this lurking, but this looks like a good
> first try. Ananth, willing to give it a whirl?
>
> Linus
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
--
Home page:
http://www.norran.net/nra02596/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-04 19:17 ` Linus Torvalds
@ 2000-05-04 21:16 ` Rajagopal Ananthanarayanan
2000-05-04 21:51 ` Rik van Riel
2000-05-04 22:21 ` Linus Torvalds
0 siblings, 2 replies; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-04 21:16 UTC (permalink / raw)
To: Linus Torvalds; +Cc: riel, Kanoj Sarcar, linux-mm, David S. Miller
Linus Torvalds wrote:
>
> On Thu, 4 May 2000, Rik van Riel wrote:
>
> > On Thu, 4 May 2000, Linus Torvalds wrote:
> >
> > > Note that changing how hard try_to_free_pages() tries to free a page is
> > > exactly part of what Rik has been doing, so this is something that has
> > > changed recently. It's not trivial to get right, for a very simple reason:
> > > we need to balance the "hardness" between the VM area scanning and the RLU
> > > list scanning.
> >
> > With the current scheme, it's pretty much impossible to get it
> > right.
>
> Not really. That is what the "priority levels" are really there for: for
[ ... discussion about shrink_mmap() ... ]
> This, I think, is where the new swap_out() falls down flat on its face. It
[ ... discussion about swap_out() ... ]
I looked over the latest (7-4) implementation of swap_out,
shrink_mmap & try_to_free_pages, etc.
One clarification: In the case I reported only
dbench was running, presumably doing a lot of read/write. So, why
isn't shrink_mmap able to find freeable pages? Is it because
the shrink_mmap() is too conservative about implementing LRU?
I mean, it doesn't make sense to swap pages just to keep others
in cache ... if the demand is too high, start shooting down
pages regardless.
Or, is shrink_mmap bailing not because of referenced bit,
but because bdflush is too slow, for example? That is,
are the pages having active I/O so can't be freed?
Do you guys think a profile using gcc-style mcount
would be useful?
--
--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-04 21:16 ` Rajagopal Ananthanarayanan
@ 2000-05-04 21:51 ` Rik van Riel
2000-05-04 22:21 ` Linus Torvalds
1 sibling, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2000-05-04 21:51 UTC (permalink / raw)
To: Rajagopal Ananthanarayanan
Cc: Linus Torvalds, Kanoj Sarcar, linux-mm, David S. Miller
On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> One clarification: In the case I reported only
> dbench was running, presumably doing a lot of read/write. So, why
> isn't shrink_mmap able to find freeable pages? Is it because
> the shrink_mmap() is too conservative about implementing LRU?
> I mean, it doesn't make sense to swap pages just to keep others
> in cache ... if the demand is too high, start shooting down
> pages regardless.
Indeed, we've seen kswapd fail to get us free pages even
when the total RSS was small...
> Or, is shrink_mmap bailing not because of referenced bit,
> but because bdflush is too slow, for example? That is,
> are the pages having active I/O so can't be freed?
>
> Do you guys think a profile using gcc-style mcount
> would be useful?
This could be very useful indeed. To be honest I'm not sure
what is happening (though I have some suspicions).
regards,
Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.
Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/ http://www.surriel.com/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-04 21:16 ` Rajagopal Ananthanarayanan
2000-05-04 21:51 ` Rik van Riel
@ 2000-05-04 22:21 ` Linus Torvalds
2000-05-05 0:47 ` 7-4 VM killing (A solution) Rajagopal Ananthanarayanan
1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-04 22:21 UTC (permalink / raw)
To: Rajagopal Ananthanarayanan; +Cc: riel, Kanoj Sarcar, linux-mm, David S. Miller
On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
>
> One clarification: In the case I reported only
> dbench was running, presumably doing a lot of read/write. So, why
> isn't shrink_mmap able to find freeable pages? Is it because
> the shrink_mmap() is too conservative about implementing LRU?
Probably. One of the things that has changed is exactly _which_ pages are
on the LRU list, so the old heuristics from shrink_mmap() may need some
tweaking too. In fact, as with vmscan, we should probably scan the LRU
list at least _twice_ when the priority level reaches zero (in order to
defeat the aging).
This is also an area where the secondary effects of the vmscan page
lockedness changes could start showing up - the page being locked on the
LRU list makes a difference to the shrink_mmap() algorithm..
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* 7-4 VM killing (A solution)
2000-05-04 22:21 ` Linus Torvalds
@ 2000-05-05 0:47 ` Rajagopal Ananthanarayanan
2000-05-05 1:30 ` Rik van Riel
2000-05-05 5:13 ` Linus Torvalds
0 siblings, 2 replies; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-05 0:47 UTC (permalink / raw)
To: Linus Torvalds; +Cc: riel, Kanoj Sarcar, linux-mm, David S. Miller
Linus Torvalds wrote:
>
> On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> >
> > One clarification: In the case I reported only
> > dbench was running, presumably doing a lot of read/write. So, why
> > isn't shrink_mmap able to find freeable pages? Is it because
> > the shrink_mmap() is too conservative about implementing LRU?
>
> Probably. One of the things that has changed is exactly _which_ pages are
> on the LRU list, so the old heuristics from shrink_mmap() may need some
> tweaking too. In fact, as with vmscan, we should probably scan the LRU
> list at least _twice_ when the priority level reaches zero (in order to
> defeat the aging).
Ok, I may have a solution after having asked, mostly to myself,
why doesn't shrink_mmap() find pages to free?
The answer apparenlty is because in 7-4 shrink_mmap(),
unreferenced pages get filed as "young" if the zone has
enough pages in it (free_pages > pages_high).
Because of this bug, if we examine a zone which already
has enough free pages, all referenced pages now go to
the "back" of the lru list.
On a subsequent scan, we may never get to these pages in time.
Comments?
Here's the new code to shrink_mmap:
------------
[ ... ]
dispose = &young;
if (test_and_clear_bit(PG_referenced, &page->flags))
goto dispose_continue;
if (!page->buffers && page_count(page) > 1)
goto dispose_continue;
dispose = &old;
if (p_zone->free_pages > p_zone->pages_high)
goto dispose_continue;
count--;
/* Page not used -> free it or put it on the old list
* so it gets freed first the next time */
if (TryLockPage(page))
goto dispose_continue;
[ ... ]
-------------------
With this I'm able to run dbench upto 16 threads (using over
0.5 GB of disk). For reference, without the fix,
dbench wouldn't run even with as few as 4 threads (using
much less disk space).
>
> This is also an area where the secondary effects of the vmscan page
> lockedness changes could start showing up - the page being locked on the
> LRU list makes a difference to the shrink_mmap() algorithm..
>
> Linus
Kanoj & I looked over your changes (lot easier to do over
the phone!) ... and didn't find any thing wrong with it.
Again, with the above fix things look good. Since
7-4 is badly broken in this respect, do you want a patch?
Since it is a small change, you can put it in "by hand" ...
--
--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: 7-4 VM killing (A solution)
2000-05-05 0:47 ` 7-4 VM killing (A solution) Rajagopal Ananthanarayanan
@ 2000-05-05 1:30 ` Rik van Riel
2000-05-05 1:47 ` Rajagopal Ananthanarayanan
2000-05-05 5:13 ` Linus Torvalds
1 sibling, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2000-05-05 1:30 UTC (permalink / raw)
To: Rajagopal Ananthanarayanan
Cc: Linus Torvalds, Kanoj Sarcar, linux-mm, David S. Miller
On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> Ok, I may have a solution after having asked, mostly to myself,
> why doesn't shrink_mmap() find pages to free?
>
> The answer apparenlty is because in 7-4 shrink_mmap(),
> unreferenced pages get filed as "young" if the zone has
> enough pages in it (free_pages > pages_high).
>
> Because of this bug, if we examine a zone which already
> has enough free pages, all referenced pages now go to
> the "back" of the lru list.
>
> On a subsequent scan, we may never get to these pages in time.
> Comments?
>
> Here's the new code to shrink_mmap:
>
> ------------
> [ ... ]
> dispose = &young;
> if (test_and_clear_bit(PG_referenced, &page->flags))
> goto dispose_continue;
>
> if (!page->buffers && page_count(page) > 1)
> goto dispose_continue;
>
> dispose = &old;
> if (p_zone->free_pages > p_zone->pages_high)
> goto dispose_continue;
I've tried this variant (a few weeks ago, before submitting
the current code to Linus) and have found a serious bug in
it.
If we put all the unreferenced pages from one zone (with
enough free pages) on the front of the queue, a subsequent
run will not make it to the pages of the zone which needs
to have pages freed currently...
regards,
Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.
Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/ http://www.surriel.com/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: 7-4 VM killing (A solution)
2000-05-05 1:30 ` Rik van Riel
@ 2000-05-05 1:47 ` Rajagopal Ananthanarayanan
0 siblings, 0 replies; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-05 1:47 UTC (permalink / raw)
To: riel; +Cc: Linus Torvalds, Kanoj Sarcar, linux-mm, David S. Miller
Rik van Riel wrote:
>
> I've tried this variant (a few weeks ago, before submitting
> the current code to Linus) and have found a serious bug in
> it.
>
> If we put all the unreferenced pages from one zone (with
> enough free pages) on the front of the queue, a subsequent
> run will not make it to the pages of the zone which needs
> to have pages freed currently...
>
The only reason why pages should be moved to the tail
of the lru list is when they are referenced, and may be
if they have high page->count.
Pages in zones with enough free memory should not be re-ordered.
Such pages should not control the iterations of shrink_mmap.
The unreferenced pages currently at the front of the lru
queue are the ones that we should free first anyway. Just because
the corresponding zone has enough free memory in it, the
relative order does not change. Are you talking about these
pages adding up against "count" in shrink_mmap?
On a more practical note, how does your bug manifest?
What does not run, or does not run better?
--
--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
2000-05-04 4:10 ` Linus Torvalds
@ 2000-05-05 4:46 ` Rajagopal Ananthanarayanan
0 siblings, 0 replies; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-05 4:46 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Kanoj Sarcar, linux-mm, David S. Miller
Linus Torvalds wrote:
>
> On Wed, 3 May 2000, Rajagopal Ananthanarayanan wrote:
> >
> > One other problem with having the page locked in
> > try_to_swapout() is in the call to
> > prepare_highmem_swapout() when the incoming
> > page is in highmem.
>
> Look at how I handled this in pre7-4.
>
> Just unlocking the old page and returning with the new page locked is
> quite acceptable. The "prepare_highmem_swapout()" thing breaks the
> association with the pages anyway, and as such there is no race (and this
> is allowable only exactly because of the anonymous and non-shared nature
> of a private COW-mapping - which is the only thing we accept in that
> code-path anyway).
>
> Doing it that way means that there are no special cases in vmscan.c.
Yep, now I see it after having actually applied the patch ;-)
I missed it in the original patch file with just the diffs, sorry.
ananth.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: 7-4 VM killing (A solution)
2000-05-05 0:47 ` 7-4 VM killing (A solution) Rajagopal Ananthanarayanan
2000-05-05 1:30 ` Rik van Riel
@ 2000-05-05 5:13 ` Linus Torvalds
2000-05-05 6:44 ` Rajagopal Ananthanarayanan
1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-05 5:13 UTC (permalink / raw)
To: Rajagopal Ananthanarayanan; +Cc: riel, Kanoj Sarcar, linux-mm, David S. Miller
On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
>
> Ok, I may have a solution after having asked, mostly to myself,
> why doesn't shrink_mmap() find pages to free?
>
> The answer apparenlty is because in 7-4 shrink_mmap(),
> unreferenced pages get filed as "young" if the zone has
> enough pages in it (free_pages > pages_high).
Good catch.
That's obviously a bug, and your fix looks like the obvious fix. Thanks,
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: 7-4 VM killing (A solution)
2000-05-05 5:13 ` Linus Torvalds
@ 2000-05-05 6:44 ` Rajagopal Ananthanarayanan
2000-05-05 6:51 ` Linus Torvalds
0 siblings, 1 reply; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-05 6:44 UTC (permalink / raw)
To: Linus Torvalds; +Cc: riel, Kanoj Sarcar, linux-mm, David S. Miller
Linus Torvalds wrote:
>
> On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> >
> > Ok, I may have a solution after having asked, mostly to myself,
> > why doesn't shrink_mmap() find pages to free?
> >
> > The answer apparenlty is because in 7-4 shrink_mmap(),
> > unreferenced pages get filed as "young" if the zone has
> > enough pages in it (free_pages > pages_high).
>
> Good catch.
>
> That's obviously a bug, and your fix looks like the obvious fix. Thanks,
Rik still had some reservations although he hasn't
sent a response to my rebuttal, yet. We'll see see.
On another note, noticed your change to shrink_mmap in 7-5:
-------
- count = nr_lru_pages >> priority;
+ count = (nr_lru_pages << 1) >> priority;
-------
Is this to defeat aging? If so, I think its overly cautious:
if all an iteration of shrink_mmap did was to flip the referenced bit,
then that iteration shouldn't be included in count (and in the
current code it isn't). So why double the effort?
--
--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: 7-4 VM killing (A solution)
2000-05-05 6:44 ` Rajagopal Ananthanarayanan
@ 2000-05-05 6:51 ` Linus Torvalds
2000-05-05 10:23 ` Rik van Riel
0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-05 6:51 UTC (permalink / raw)
To: Rajagopal Ananthanarayanan; +Cc: riel, Kanoj Sarcar, linux-mm, David S. Miller
On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
>
> On another note, noticed your change to shrink_mmap in 7-5:
>
> -------
> - count = nr_lru_pages >> priority;
> + count = (nr_lru_pages << 1) >> priority;
> -------
>
> Is this to defeat aging? If so, I think its overly cautious:
> if all an iteration of shrink_mmap did was to flip the referenced bit,
> then that iteration shouldn't be included in count (and in the
> current code it isn't). So why double the effort?
It was indeed because I thought we should defeat aging. But you're right,
the reference bit flip doesn't get counted. My bad, and I'll revert that
one (and you found the real reason for pages not getting free'd anyway)
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: 7-4 VM killing (A solution)
2000-05-05 6:51 ` Linus Torvalds
@ 2000-05-05 10:23 ` Rik van Riel
0 siblings, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2000-05-05 10:23 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rajagopal Ananthanarayanan, Kanoj Sarcar, linux-mm, David S. Miller
On Thu, 4 May 2000, Linus Torvalds wrote:
> On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> > On another note, noticed your change to shrink_mmap in 7-5:
> >
> > -------
> > - count = nr_lru_pages >> priority;
> > + count = (nr_lru_pages << 1) >> priority;
> > -------
> >
> > Is this to defeat aging? If so, I think its overly cautious:
> > if all an iteration of shrink_mmap did was to flip the referenced bit,
> > then that iteration shouldn't be included in count (and in the
> > current code it isn't). So why double the effort?
>
> It was indeed because I thought we should defeat aging. But
> you're right, the reference bit flip doesn't get counted.
Also, we'll be holding the pages on our local &young list, so
we won't be able to see them again (but that's ok since the
next call to shrink_mmap() can easily free them all).
regards,
Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.
Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/ http://www.surriel.com/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2000-05-05 10:23 UTC | newest]
Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <8ener4$6djpb$1@fido.engr.sgi.com>
2000-05-03 3:11 ` Oops in __free_pages_ok (pre7-1) (Long) (backtrace) Rajagopal Ananthanarayanan
2000-05-03 3:47 ` Linus Torvalds
2000-05-03 5:26 ` Kanoj Sarcar
2000-05-03 6:22 ` Rajagopal Ananthanarayanan
2000-05-03 16:11 ` Kanoj Sarcar
2000-05-03 16:19 ` Linus Torvalds
2000-05-03 16:35 ` Kanoj Sarcar
2000-05-03 17:16 ` Linus Torvalds
2000-05-03 17:31 ` Kanoj Sarcar
2000-05-03 18:17 ` Linus Torvalds
2000-05-03 18:37 ` Rajagopal Ananthanarayanan
2000-05-03 18:37 ` Kanoj Sarcar
2000-05-03 19:41 ` Rajagopal Ananthanarayanan
2000-05-03 21:28 ` Jeff Garzik
2000-05-03 8:11 ` Linus Torvalds
2000-05-03 8:31 ` Linus Torvalds
2000-05-03 16:08 ` Kanoj Sarcar
2000-05-03 16:14 ` Linus Torvalds
2000-05-03 16:24 ` Kanoj Sarcar
2000-05-04 1:38 ` Linus Torvalds
2000-05-04 2:44 ` Rajagopal Ananthanarayanan
2000-05-04 4:05 ` Linus Torvalds
2000-05-04 3:16 ` Rajagopal Ananthanarayanan
2000-05-04 4:10 ` Linus Torvalds
2000-05-05 4:46 ` Rajagopal Ananthanarayanan
2000-05-04 7:42 ` Rajagopal Ananthanarayanan
2000-05-04 15:33 ` Linus Torvalds
2000-05-04 15:57 ` Rik van Riel
2000-05-04 17:19 ` Rajagopal Ananthanarayanan
2000-05-04 17:41 ` Rik van Riel
2000-05-04 18:18 ` Rajagopal Ananthanarayanan
2000-05-04 18:43 ` Linus Torvalds
2000-05-04 19:00 ` Rik van Riel
2000-05-04 19:17 ` Linus Torvalds
2000-05-04 21:16 ` Rajagopal Ananthanarayanan
2000-05-04 21:51 ` Rik van Riel
2000-05-04 22:21 ` Linus Torvalds
2000-05-05 0:47 ` 7-4 VM killing (A solution) Rajagopal Ananthanarayanan
2000-05-05 1:30 ` Rik van Riel
2000-05-05 1:47 ` Rajagopal Ananthanarayanan
2000-05-05 5:13 ` Linus Torvalds
2000-05-05 6:44 ` Rajagopal Ananthanarayanan
2000-05-05 6:51 ` Linus Torvalds
2000-05-05 10:23 ` Rik van Riel
2000-05-04 20:40 ` Oops in __free_pages_ok (pre7-1) (Long) (backtrace) Roger Larsson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox