Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
       [not found] <8ener4$6djpb$1@fido.engr.sgi.com>
@ 2000-05-03  3:11 ` Rajagopal Ananthanarayanan
  2000-05-03  3:47   ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-03  3:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm

Linus Torvalds wrote:
> 
> On 2 May 2000, Juan J. Quintela wrote:
> > Hi
> >         several people have reported Oops in __free_pages_ok, after a
> > BUG() in page_alloc.h.  This happens in 2.3.99-pre[67].  The BUGs are:
> 
> I'd like ot know what the back-trace for those reports are?
> 
> I'm not against getting rid of the PageSwapEntry logic (it's complication
> for not very much gain), but I'd like to understand this more..


Following please find information about one of the BUG() in __free_pages_ok();
I'm re-sending the messages I sent to linux-kernel eariler last week:

-------------

I ran into a BUG in __free_pages_ok which checks:

----------
        if (PageSwapCache(page))
                BUG();
----------

The  call to free the page was from try_to_swap_out():

----------
        /*
         * Is the page already in the swap cache? If so, then
         * we can just drop our reference to it without doing
         * any IO - it's already up-to-date on disk.
         *
         * Return 0, as we didn't actually free any real
         * memory, and we should just continue our scan.
         */
        if (PageSwapCache(page)) {
                entry.val = page->index;
                swap_duplicate(entry);
                set_pte(page_table, swp_entry_to_pte(entry));
drop_pte:
                vma->vm_mm->rss--;
                flush_tlb_page(vma, address);
                __free_page(page);
                goto out_failed;
        }
-----------

The entire trace from kdb is as follows (XXX = unknown):

----------
XXXXXXXXXX XXXXXXXXXX  __free_pages_ok(XXXX)
0xc35d5d9c 0xc013543c  try_to_swap_out+0xc8( 0xc179dc20, 0x43589000, 0xc2963624,
0x5,
0xc179dc20 )
0xc35d5dd8 0xc0135710  swap_out_vma+0x11c( 0xc179dc20, 0x432d4000, 0x5, 0xc02c7c68,
0xc3256520 )
0xc35d5df8 0xc01357ee  swap_out_mm+0x7e( 0xc3256520, 0x5, 0x4, 0x6, 0x5 )
0xc35d5e24 0xc01359de  swap_out+0x176( 0x6, 0x5, 0xc35d4000, 0xc02d0890, 0x5 )
0xc35d5e40 0xc0135b31  do_try_to_free_pages+0x89( 0x5, 0xc02d06b8, 0xc02d06b8,
0xc35d5e78,
0xc013666b )
0xc35d5e54 0xc0135d17  try_to_free_pages+0x2b( 0x5, 0xc02d06b8, 0xc02d0898, 0x0,
0xc02d088c )
0xc35d5e78 0xc013666b  zone_balance_memory+0x63( 0xc02d088c, 0xc35d4000, 0x0,
0x5b5f00 )
0xc35d5e98 0xc0136724  __alloc_pages+0x80( 0xc35d4000, 0x0, 0xc35d4000 )
0xc35d5eac 0xc0136dbe  read_swap_cache_async+0x62( 0x5b5f00, 0x1, 0x5b5f00,
0x5b5f00,
0xc13919c4 )
0xc35d5ecc 0xc012889b  do_swap_page+0x97( 0xc35d4000, 0xc179dc20, 0x45671fff,
0xc13919c4,
0x5b5f00 )
0xc35d5efc 0xc0128ce7  handle_mm_fault+0x13b( 0xc35d4000, 0xc179dc20, 0x45671fff,
0x0,
0xc35d4000 )
0xc35d5fb4 0xc011513e  do_page_fault+0x18e
[1]more>
0xbffff540 0xc010a681  error_code+0x2d 
-------------

I have a kdb helper function which prints out some of the fields in the page,
and also below is the hex dump of the "struct page"

----------
  [1]kdb> page 0xc1002de0
struct page at 0xc1002de0
  next 0xc1042ee0 prev 0xc101e900 addr space 0xc02d0520 index 557568
  count 1 flags PG_uptodate PG_swap_cache PG_swap_entry virtual 0xc0092000
  buffers 0x00000000  block_map 00000000000000000000000000000000
    [ ... ] 
                         
c1002de0: c1042ee0 c101e900 c02d0520 00088200  a..A.e.A .-A....
c1002df0: 00000000 00000001 00000a08 c1042efc  ............u..A
c1002e00: c101e91c 00000000 dead4ead c1002e0c  .e.A....-N-TH...A
c1002e10: c1002e0c c1002e14 c02f3b3f c1150e5c  ...A...A?;/A\..A
c1002e20: 00000000 c0092000 c02d0600 00000000  ..... .A..-A.... 
---------------------


Question: Is this a problem in the reference count on the page?
If indeed the page can be freed by the call in try_to_swap_out,
then the test in __free_pages_ok will trigger every time this path
is taken. Any one have ideas as to what's wrong?

BTW, the above happened during a relatively normal operation of
using 'diff'. Don't know if it reproducible.
----------------------

As Jeff G. pointed out in another mail, I left out some
important information. If you need more info., please let me know.

(1) this is with kernel version 2.3.99-pre2 with some XFS changes.
(2) I have a 2 CPU X-86 box with 64MB memory.

Here's the .config used to build the kernel:

-----------
#
# Automatically generated by make menuconfig: don't edit
#
CONFIG_X86=y
CONFIG_ISA=y
CONFIG_UID16=y

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y

#
# Processor type and features
#
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
CONFIG_M686=y
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_PGE=y
# CONFIG_MICROCODE is not set
CONFIG_NOHIGHMEM=y
# CONFIG_HIGHMEM4G is not set
# CONFIG_HIGHMEM64G is not set
# CONFIG_MATH_EMULATION is not set
# CONFIG_MTRR is not set
CONFIG_SMP=y

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODVERSIONS=y
CONFIG_KMOD=y

#
# General setup
#
CONFIG_NET=y
# CONFIG_VISWS is not set
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GODIRECT is not set
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_NAMES=y
# CONFIG_MCA is not set
# CONFIG_HOTPLUG is not set
CONFIG_SYSVIPC=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_SYSCTL is not set
CONFIG_KCORE_ELF=y
# CONFIG_KCORE_AOUT is not set
CONFIG_BINFMT_AOUT=m
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_MISC=m
# CONFIG_PM is not set
# CONFIG_ACPI is not set
# CONFIG_APM is not set

#
# Parallel port support
#
# CONFIG_PARPORT is not set

#
# Plug and Play configuration
#
# CONFIG_PNP is not set
# CONFIG_ISAPNP is not set

#
# Block devices
#
CONFIG_BLK_DEV_FD=m
# CONFIG_BLK_DEV_XD is not set
# CONFIG_PARIDE is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_LOOP is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_MD is not set
# CONFIG_BLK_DEV_RAM is not set

#
# Networking options
#
CONFIG_PACKET=m
# CONFIG_PACKET_MMAP is not set
# CONFIG_NETLINK is not set
# CONFIG_NETFILTER is not set
# CONFIG_FILTER is not set
CONFIG_UNIX=y
CONFIG_INET=y
# CONFIG_IP_MULTICAST is not set
# CONFIG_IP_ADVANCED_ROUTER is not set
# CONFIG_IP_PNP is not set
# CONFIG_IP_ROUTER is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_IP_ALIAS is not set
# CONFIG_SYN_COOKIES is not set
CONFIG_SKB_LARGE=y
# CONFIG_IPV6 is not set
# CONFIG_KHTTPD is not set
# CONFIG_ATM is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_DECNET is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_BRIDGE is not set
# CONFIG_LLC is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_NET_FASTROUTE is not set
# CONFIG_NET_HW_FLOWCONTROL is not set

#
# QoS and/or fair queueing
#
# CONFIG_NET_SCHED is not set

#
# Telephony Support
#
# CONFIG_PHONE is not set
# CONFIG_PHONE_IXJ is not set

#
# ATA/IDE/MFM/RLL support
#
# CONFIG_IDE is not set
# CONFIG_BLK_DEV_IDE_MODES is not set
# CONFIG_BLK_DEV_HD is not set

#
# SCSI support
#
CONFIG_SCSI=y
CONFIG_BLK_DEV_SD=y
CONFIG_SD_EXTRA_DEVS=40
# CONFIG_CHR_DEV_ST is not set
# CONFIG_BLK_DEV_SR is not set
# CONFIG_CHR_DEV_SG is not set
CONFIG_SCSI_DEBUG_QUEUES=y
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
# CONFIG_SCSI_LOGGING is not set

#
# SCSI low-level drivers
#
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_7000FASST is not set
# CONFIG_SCSI_ACARD is not set
CONFIG_SCSI_AHA152X=m
# CONFIG_SCSI_AHA1542 is not set
CONFIG_SCSI_AHA1740=m
CONFIG_SCSI_AIC7XXX=y
# CONFIG_AIC7XXX_TCQ_ON_BY_DEFAULT is not set
CONFIG_AIC7XXX_CMDS_PER_DEVICE=8
# CONFIG_AIC7XXX_PROC_STATS is not set
CONFIG_AIC7XXX_RESET_DELAY=5
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_IN2000 is not set
# CONFIG_SCSI_AM53C974 is not set
# CONFIG_SCSI_MEGARAID is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_DTC3280 is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_EATA_DMA is not set
# CONFIG_SCSI_EATA_PIO is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_GENERIC_NCR5380 is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_NCR53C406A is not set
# CONFIG_SCSI_SYM53C416 is not set
# CONFIG_SCSI_SIM710 is not set
# CONFIG_SCSI_NCR53C7xx is not set
# CONFIG_SCSI_NCR53C8XX is not set
CONFIG_SCSI_SYM53C8XX=m
CONFIG_SCSI_NCR53C8XX_DEFAULT_TAGS=4
CONFIG_SCSI_NCR53C8XX_MAX_TAGS=32
CONFIG_SCSI_NCR53C8XX_SYNC=20
# CONFIG_SCSI_NCR53C8XX_PROFILE is not set
# CONFIG_SCSI_NCR53C8XX_IOMAPPED is not set
# CONFIG_SCSI_NCR53C8XX_PQS_PDS is not set
# CONFIG_SCSI_NCR53C8XX_SYMBIOS_COMPAT is not set
# CONFIG_SCSI_PAS16 is not set
# CONFIG_SCSI_PCI2000 is not set
# CONFIG_SCSI_PCI2220I is not set
# CONFIG_SCSI_PSI240I is not set
# CONFIG_SCSI_QLOGIC_FAS is not set
# CONFIG_SCSI_QLOGIC_ISP is not set
CONFIG_SCSI_QLOGIC_FC=m
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_SEAGATE is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_T128 is not set
# CONFIG_SCSI_U14_34F is not set
# CONFIG_SCSI_ULTRASTOR is not set
# CONFIG_SCSI_DEBUG is not set

#
# IEEE 1394 (FireWire) support
#
# CONFIG_IEEE1394 is not set

#
# I2O device support
#
# CONFIG_I2O is not set
# CONFIG_I2O_PCI is not set
# CONFIG_I2O_BLOCK is not set
# CONFIG_I2O_LAN is not set
# CONFIG_I2O_SCSI is not set
# CONFIG_I2O_PROC is not set

#
# Network device support
#
CONFIG_NETDEVICES=y

#
# ARCnet devices
#
# CONFIG_ARCNET is not set
CONFIG_DUMMY=m
# CONFIG_BONDING is not set
# CONFIG_EQUALIZER is not set
# CONFIG_NET_SB1000 is not set

#
# Ethernet (10 or 100Mbit)
#
CONFIG_NET_ETHERNET=y
CONFIG_NET_VENDOR_3COM=y
# CONFIG_EL1 is not set
# CONFIG_EL2 is not set
# CONFIG_ELPLUS is not set
# CONFIG_EL16 is not set
# CONFIG_EL3 is not set
# CONFIG_3C515 is not set
CONFIG_VORTEX=m
CONFIG_LANCE=m
# CONFIG_NET_VENDOR_SMC is not set
# CONFIG_NET_VENDOR_RACAL is not set
# CONFIG_AT1700 is not set
# CONFIG_DEPCA is not set
# CONFIG_NET_ISA is not set
CONFIG_NET_PCI=y
CONFIG_PCNET32=m
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_AC3200 is not set
# CONFIG_APRICOT is not set
# CONFIG_CS89x0 is not set
# CONFIG_DE4X5 is not set
# CONFIG_TULIP is not set
# CONFIG_DGRS is not set
# CONFIG_DM9102 is not set
# CONFIG_EEPRO100 is not set
# CONFIG_LNE390 is not set
# CONFIG_NE3210 is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_RTL8129 is not set
# CONFIG_8139TOO is not set
# CONFIG_SIS900 is not set
# CONFIG_TLAN is not set
# CONFIG_VIA_RHINE is not set
# CONFIG_ES3210 is not set
# CONFIG_EPIC100 is not set
# CONFIG_NET_POCKET is not set

#
# Ethernet (1000 Mbit)
#
# CONFIG_YELLOWFIN is not set
# CONFIG_ACENIC is not set
# CONFIG_SK98LIN is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set

#
# Wireless LAN (non-hamradio)
#
# CONFIG_NET_RADIO is not set

#
# Token Ring devices
#
# CONFIG_TR is not set
# CONFIG_NET_FC is not set
# CONFIG_RCPCI is not set
# CONFIG_SHAPER is not set

#
# Wan interfaces
#
# CONFIG_WAN is not set

#
# Amateur Radio support
#
# CONFIG_HAMRADIO is not set

#
# IrDA (infrared) support
#
# CONFIG_IRDA is not set

#
# ISDN subsystem
#
# CONFIG_ISDN is not set

#
# Old CD-ROM drivers (not SCSI, not IDE)
#
# CONFIG_CD_NO_IDESCSI is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_SERIAL=y
CONFIG_SERIAL_CONSOLE=y
# CONFIG_SERIAL_EXTENDED is not set
# CONFIG_SERIAL_NONSTANDARD is not set
CONFIG_UNIX98_PTYS=y
CONFIG_UNIX98_PTY_COUNT=256

#
# I2C support
#
# CONFIG_I2C is not set

#
# Mice
#
# CONFIG_BUSMOUSE is not set
CONFIG_MOUSE=y
CONFIG_PSMOUSE=y
CONFIG_82C710_MOUSE=y
# CONFIG_PC110_PAD is not set

#
# Joysticks
#
# CONFIG_JOYSTICK is not set
# CONFIG_QIC02_TAPE is not set

#
# Watchdog Cards
#
# CONFIG_WATCHDOG is not set
CONFIG_PROFILING=y
# CONFIG_NVRAM is not set
# CONFIG_RTC is not set

#
# Video For Linux
#
# CONFIG_VIDEO_DEV is not set
# CONFIG_DTLK is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set

#
# Ftape, the floppy tape device driver
#
# CONFIG_FTAPE is not set
CONFIG_DRM=y
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_GAMMA is not set
# CONFIG_AGP is not set

#
# USB support
#
# CONFIG_USB is not set

#
# File systems
#
# CONFIG_QUOTA is not set
CONFIG_AUTOFS_FS=m
CONFIG_AUTOFS4_FS=m
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_FAT_FS is not set
# CONFIG_MSDOS_FS is not set
# CONFIG_UMSDOS_FS is not set
# CONFIG_VFAT_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
CONFIG_ISO9660_FS=m
# CONFIG_JOLIET is not set
# CONFIG_MINIX_FS is not set
# CONFIG_NTFS_FS is not set
# CONFIG_HPFS_FS is not set
CONFIG_PROC_FS=y
# CONFIG_DEVFS_FS is not set
# CONFIG_DEVFS_DEBUG is not set
CONFIG_DEVPTS_FS=y
# CONFIG_QNX4FS_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_EXT2_FS=y
# CONFIG_SYSV_FS is not set
# CONFIG_UDF_FS is not set
# CONFIG_UFS_FS is not set
CONFIG_XFS_FS=m
CONFIG_PAGE_BUF=y
CONFIG_PAGE_BUF_LOCKING=y
CONFIG_AVL=y
# CONFIG_PAGE_BUF_META is not set
# CONFIG_XFS_VNODE_TRACING is not set
CONFIG_AVL=y
# CONFIG_XFS_ARCH_MIPS is not set
CONFIG_XFS_ARCH_NATIVE=y
# CONFIG_XFS_ARCH_MULTI is not set

#
# Network File Systems
#
# CONFIG_CODA_FS is not set
CONFIG_NFS_FS=m
# CONFIG_ROOT_NFS is not set
# CONFIG_NFSD is not set
CONFIG_SUNRPC=m
CONFIG_LOCKD=m
# CONFIG_SMB_FS is not set
# CONFIG_NCP_FS is not set

#
# Partition Types
#
# CONFIG_PARTITION_ADVANCED is not set
CONFIG_MSDOS_PARTITION=y
# CONFIG_NLS is not set

#
# Console drivers
#
CONFIG_VGA_CONSOLE=y
# CONFIG_VIDEO_SELECT is not set
# CONFIG_MDA_CONSOLE is not set

#
# Frame-buffer support
#
# CONFIG_FB is not set

#
# Sound
#
# CONFIG_SOUND is not set

#
# Kernel hacking
#
# CONFIG_MAGIC_SYSRQ is not set
# CONFIG_LOCKMETER is not set
CONFIG_KDB=y
CONFIG_KDB_FRAMEPTR=y
CONFIG_KDB_STBSIZE=13500
CONFIG_MCOUNT=y
CONFIG_PROF_FRAME_PTR=y

-- 
--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03  3:11 ` Oops in __free_pages_ok (pre7-1) (Long) (backtrace) Rajagopal Ananthanarayanan
@ 2000-05-03  3:47   ` Linus Torvalds
  2000-05-03  5:26     ` Kanoj Sarcar
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-03  3:47 UTC (permalink / raw)
  To: Rajagopal Ananthanarayanan; +Cc: linux-mm


On Tue, 2 May 2000, Rajagopal Ananthanarayanan wrote:
> -------------
> 
> I ran into a BUG in __free_pages_ok which checks:
> 
> ----------
>         if (PageSwapCache(page))
>                 BUG();
> ----------
> 
> The  call to free the page was from try_to_swap_out():
> 
> ----------
>         /*
>          * Is the page already in the swap cache? If so, then
>          * we can just drop our reference to it without doing
>          * any IO - it's already up-to-date on disk.
>          *
>          * Return 0, as we didn't actually free any real
>          * memory, and we should just continue our scan.
>          */
>         if (PageSwapCache(page)) {
>                 entry.val = page->index;
>                 swap_duplicate(entry);
>                 set_pte(page_table, swp_entry_to_pte(entry));
> drop_pte:
>                 vma->vm_mm->rss--;
>                 flush_tlb_page(vma, address);
>                 __free_page(page);
>                 goto out_failed;
>         }

Wow.

That code definitely looks buggy.

Looking at the whole try_to_swap_out() in this light shows how it messes
with a _lot_ of page information without holding the page lock. I thought
we fixed this once already, but maybe not.

In try_to_swap_out(), earlier it does a

	if (PageLocked(page))
		goto out_failed;

and that really is wrong - it should do a

	if (TryLockPage(page))
		goto out_failed;

and do all the rest with the page locked so that there are no races on
changing the state of the page (and then unlock just before actually
returning, or freeing the page).

As far as I can tell, this is a real bug, and has absolutely nothing to do
with the swap entry cache. It may be that the swap entry cache code just
changed timings for some people enough to show the race.

But maybe I've overlooked something. Anybody else have comments on this?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03  3:47   ` Linus Torvalds
@ 2000-05-03  5:26     ` Kanoj Sarcar
  2000-05-03  6:22       ` Rajagopal Ananthanarayanan
  2000-05-03  8:11       ` Linus Torvalds
  0 siblings, 2 replies; 45+ messages in thread
From: Kanoj Sarcar @ 2000-05-03  5:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rajagopal Ananthanarayanan, linux-mm

> 
> 
> On Tue, 2 May 2000, Rajagopal Ananthanarayanan wrote:
> > -------------
> > 
> > I ran into a BUG in __free_pages_ok which checks:
> > 
> > ----------
> >         if (PageSwapCache(page))
> >                 BUG();
> > ----------
> > 
> > The  call to free the page was from try_to_swap_out():
> > 
> > ----------
> >         /*
> >          * Is the page already in the swap cache? If so, then
> >          * we can just drop our reference to it without doing
> >          * any IO - it's already up-to-date on disk.
> >          *
> >          * Return 0, as we didn't actually free any real
> >          * memory, and we should just continue our scan.
> >          */
> >         if (PageSwapCache(page)) {
> >                 entry.val = page->index;
> >                 swap_duplicate(entry);
> >                 set_pte(page_table, swp_entry_to_pte(entry));
> > drop_pte:
> >                 vma->vm_mm->rss--;
> >                 flush_tlb_page(vma, address);
> >                 __free_page(page);
> >                 goto out_failed;
> >         }
> 
> Wow.
> 
> That code definitely looks buggy.
> 
> Looking at the whole try_to_swap_out() in this light shows how it messes
> with a _lot_ of page information without holding the page lock. I thought
> we fixed this once already, but maybe not.
> 
> In try_to_swap_out(), earlier it does a
> 
> 	if (PageLocked(page))
> 		goto out_failed;
> 
> and that really is wrong - it should do a
> 
> 	if (TryLockPage(page))
> 		goto out_failed;

Umm, I am not saying this is not a good idea, but maybe code that 
try_to_swap_out() invokes (like filemap_swapout etc) need to be 
taught that the incoming page has already been locked. 

Nonetheless, unless you show me a possible scenario that will lead
to the observed panic, I am skeptical that this is the real problem.

Lets just talk about swapcache pages (since the problem happened with
that type), and lets forget swapfile deletion, I am pretty sure Ananth
was not trying that. In this restricted situation, I _think_ you can 
not theorize what the problem is. That is, if all code that add/delete
pages from the swap cache make sure they never delete a "shared" page
from the scache (as determined by is_page_shared). This is because 
the process that kswapd is looking at already ensures that the page is
"shared". The only code that does delete "shared" pages from the scache
is shrink_mmap, but if a process already has a page-reference, shrink_mmap
can not touch that page. Also, most process level code that takes a
page out from the swapcache is interlocked out because kswapd is
holding the vmlist/page_table_lock.

Anyway, I will try to think if there are more race conditions possible.
Ananth, was there shared memory programs in your test suite? Also, if you
have any success in reproducing this, let us know.

Kanoj

> 
> and do all the rest with the page locked so that there are no races on
> changing the state of the page (and then unlock just before actually
> returning, or freeing the page).
> 
> As far as I can tell, this is a real bug, and has absolutely nothing to do
> with the swap entry cache. It may be that the swap entry cache code just
> changed timings for some people enough to show the race.
> 
> But maybe I've overlooked something. Anybody else have comments on this?
> 
> 		Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03  5:26     ` Kanoj Sarcar
@ 2000-05-03  6:22       ` Rajagopal Ananthanarayanan
  2000-05-03 16:11         ` Kanoj Sarcar
  2000-05-03  8:11       ` Linus Torvalds
  1 sibling, 1 reply; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-03  6:22 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Linus Torvalds, linux-mm

Kanoj Sarcar wrote:

> >
> > Wow.
> >
> > That code definitely looks buggy.
> >
> > Looking at the whole try_to_swap_out() in this light shows how it messes
> > with a _lot_ of page information without holding the page lock. I thought
> > we fixed this once already, but maybe not.
> >
> > In try_to_swap_out(), earlier it does a
> >
> >       if (PageLocked(page))
> >               goto out_failed;
> >
> > and that really is wrong - it should do a
> >
> >       if (TryLockPage(page))
> >               goto out_failed;
> 
> Umm, I am not saying this is not a good idea, but maybe code that
> try_to_swap_out() invokes (like filemap_swapout etc) need to be
> taught that the incoming page has already been locked.

Dunno. I tend to agree with Linus. Fundamentally, how can any
code examine & change page state (flags, etc). if the code
does not hold the page lock?

> 
> Nonetheless, unless you show me a possible scenario that will lead
> to the observed panic, I am skeptical that this is the real problem.

Look at trace I sent out. Basically it goes swap_out() -> swap_out_mm() ->
swap_out_vma() -> try_to_swap_out() -> __free_pages_ok().

1. swap_out select process & vm area within the process to swapout.
2. swap_out_mm selects an "address" within the mm.
3. swap_out_vma converts address to pgd.
4. try_to_swap_out takes pgd looks at the "software" state in "struct page".

Step 2 is about the earliest you can lock the victim page;
it isn't locked there. Step 3 doesn't lock it either. Step 4
as pointed out, explicitly avoids pages which are locked,
but doesn't lock the page!

Some more clarifications below:

> 
> Lets just talk about swapcache pages (since the problem happened with
> that type), and lets forget swapfile deletion, I am pretty sure Ananth
> was not trying that. [ ... ]

No, I didn't try to remove swap.

> Anyway, I will try to think if there are more race conditions possible.
> Ananth, was there shared memory programs in your test suite? Also, if you
> have any success in reproducing this, let us know.

I don't think there were any shm stuff in the tests I was running
(again, AFAICT, diff was the only thing running; previous pages
in memory weren't likely from any shm segments). I haven't
reproduced it even a second time. Will let you know.

OTOH, if try_to_swap_out is so broken why aren't we seeing
these problems more often? Or, from other reports in l-k are we?

ananth.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03  5:26     ` Kanoj Sarcar
  2000-05-03  6:22       ` Rajagopal Ananthanarayanan
@ 2000-05-03  8:11       ` Linus Torvalds
  2000-05-03  8:31         ` Linus Torvalds
  1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-03  8:11 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Rajagopal Ananthanarayanan, linux-mm

On Tue, 2 May 2000, Kanoj Sarcar wrote:
> Moi wrote:
> > 
> > Wow.
> > 
> > That code definitely looks buggy.
> > 
> > Looking at the whole try_to_swap_out() in this light shows how it messes
> > with a _lot_ of page information without holding the page lock. I thought
> > we fixed this once already, but maybe not.
> > 
> > In try_to_swap_out(), earlier it does a
> > 
> > 	if (PageLocked(page))
> > 		goto out_failed;
> > 
> > and that really is wrong - it should do a
> > 
> > 	if (TryLockPage(page))
> > 		goto out_failed;
> 
> Umm, I am not saying this is not a good idea, but maybe code that 
> try_to_swap_out() invokes (like filemap_swapout etc) need to be 
> taught that the incoming page has already been locked. 

Oh, definitely. It's more than a one-liner change. Right now all the code
afterwards is written with the notion that the page is unlocked, and
having the page locked means that things have to be done differently (eg
use "__add_to_page_cache()" instead of "add_to_page_cache()" etc - all the
functions that get the page expecting the caller to have already locked
it).

> Nonetheless, unless you show me a possible scenario that will lead
> to the observed panic, I am skeptical that this is the real problem.

You may be right. The code certainly tries to be careful. However, I don't
trust "is_page_shared()" at all, _especially_ if there are people around
who play with the page state without locking the page. 

If "is_page_shared()" ends up ever getting the wrong value, I suspect we'd
be screwed. There may be other schenarios..

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03  8:11       ` Linus Torvalds
@ 2000-05-03  8:31         ` Linus Torvalds
  2000-05-03 16:08           ` Kanoj Sarcar
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-03  8:31 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Rajagopal Ananthanarayanan, linux-mm


On Wed, 3 May 2000, Linus Torvalds wrote:
> 
> You may be right. The code certainly tries to be careful. However, I don't
> trust "is_page_shared()" at all, _especially_ if there are people around
> who play with the page state without locking the page. 
> 
> If "is_page_shared()" ends up ever getting the wrong value, I suspect we'd
> be screwed. There may be other schenarios..

Kanoj, why couldn't this happen:
 - CPU0 runs swapout
	- finds page which is a swap cache entry
	- does the swap_duplicate()
	- does __free_page() on it without locking it (it wasn't locked
	  before, either)
 - CPU1 runs shrink_mmap
	- finds same page on the LRU list
	- locks it _just_ after CPU0 tested that it was unlocked
	- looks at the page countersand the swap cache counters to see if
	  it was shared ("is_page_shared()").

 - There is _no_ synchronization between the two, as far as I can tell.
   "swap_duplicate()" on CPU0 will get the swap device lock, and
   "is_page_shared()" will run with the page lock held, but there is no
   common locking between the two at all that I can see.

So "is_page_shared()" can be entirely crap. And can tell shrink_mmap()
that the page cache entry can be freed. Now, I have no idea what that will
actually result in, but I bet that we can just get the usage counters off
by one here, and then at some later date we free page that we've already
free'd - and that page may have been re-allocated for something else and
isin the middle of a page-in right now (which is how we end up freeing a
page that is locked).

Or something. The lack of any synchronization looks fishy to me. The page
lock would act as synchronization, but so would the swap device lock.  And
maybe I'm still barking up the wrong tree..

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03  8:31         ` Linus Torvalds
@ 2000-05-03 16:08           ` Kanoj Sarcar
  2000-05-03 16:14             ` Linus Torvalds
  2000-05-04  1:38             ` Linus Torvalds
  0 siblings, 2 replies; 45+ messages in thread
From: Kanoj Sarcar @ 2000-05-03 16:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rajagopal Ananthanarayanan, linux-mm

> 
> 
> On Wed, 3 May 2000, Linus Torvalds wrote:
> > 
> > You may be right. The code certainly tries to be careful. However, I don't
> > trust "is_page_shared()" at all, _especially_ if there are people around
> > who play with the page state without locking the page. 
> > 
> > If "is_page_shared()" ends up ever getting the wrong value, I suspect we'd
> > be screwed. There may be other schenarios..
> 
> Kanoj, why couldn't this happen:
>  - CPU0 runs swapout
> 	- finds page which is a swap cache entry
> 	- does the swap_duplicate()
> 	- does __free_page() on it without locking it (it wasn't locked
> 	  before, either)
>  - CPU1 runs shrink_mmap
> 	- finds same page on the LRU list
> 	- locks it _just_ after CPU0 tested that it was unlocked
> 	- looks at the page countersand the swap cache counters to see if
> 	  it was shared ("is_page_shared()").
> 
>  - There is _no_ synchronization between the two, as far as I can tell.
>    "swap_duplicate()" on CPU0 will get the swap device lock, and
>    "is_page_shared()" will run with the page lock held, but there is no
>    common locking between the two at all that I can see.

FWIW, I think you are looking in the right direction, ie, shrink_mmap
previously used to run with lock_kernel, and not anymore, so there is a 
chance of shrink_mmap racing with try_to_swap_out. I thought about this 
though, but couldn't come up with an example ...

But, your example does not pull thru. Note that before shrink_mmap will
even touch the page, it does a 

                if (!page->buffers && page_count(page) > 1)
                        goto dispose_continue;

The page is question is guaranteed to have page_count(page) > 1, since 
try_to_swap_out has not dropped the user pte reference in your example.

Another thing to note is that shrink_mmap does not do a is_page_shared(),
it just checks for page-reference count to be 0 (the swapentry might have
references from other processes). Else, shrink_mmap will never be able
to free these pages ...

> 
> So "is_page_shared()" can be entirely crap. And can tell shrink_mmap()

Not really ... look at other places that call is_page_shared, they all
hold the pagelock. shrink_mmap does not bother with is_page_shared logic.

What is interesting is that people are reporting PageSwapEntry deletion
seems to fix this ...

Kanoj

> that the page cache entry can be freed. Now, I have no idea what that will
> actually result in, but I bet that we can just get the usage counters off
> by one here, and then at some later date we free page that we've already
> free'd - and that page may have been re-allocated for something else and
> isin the middle of a page-in right now (which is how we end up freeing a
> page that is locked).
> 
> Or something. The lack of any synchronization looks fishy to me. The page
> lock would act as synchronization, but so would the swap device lock.  And
> maybe I'm still barking up the wrong tree..
> 
> 		Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03  6:22       ` Rajagopal Ananthanarayanan
@ 2000-05-03 16:11         ` Kanoj Sarcar
  2000-05-03 16:19           ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Kanoj Sarcar @ 2000-05-03 16:11 UTC (permalink / raw)
  To: Rajagopal Ananthanarayanan; +Cc: Linus Torvalds, linux-mm

> 
> Kanoj Sarcar wrote:
> 
> > >
> > > Wow.
> > >
> > > That code definitely looks buggy.
> > >
> > > Looking at the whole try_to_swap_out() in this light shows how it messes
> > > with a _lot_ of page information without holding the page lock. I thought
> > > we fixed this once already, but maybe not.
> > >
> > > In try_to_swap_out(), earlier it does a
> > >
> > >       if (PageLocked(page))
> > >               goto out_failed;
> > >
> > > and that really is wrong - it should do a
> > >
> > >       if (TryLockPage(page))
> > >               goto out_failed;
> > 
> > Umm, I am not saying this is not a good idea, but maybe code that
> > try_to_swap_out() invokes (like filemap_swapout etc) need to be
> > taught that the incoming page has already been locked.
> 
> Dunno. I tend to agree with Linus. Fundamentally, how can any
> code examine & change page state (flags, etc). if the code
> does not hold the page lock?
>

Note that try_to_swap_out holds the vmlist/page_table_lock on the
victim process, as well as lock_kernel, and though this is not the
easiest code to analyze, it seems to me that is enough protection 
on the swapcache pages. Also note, I am not saying it is not a good
idea to lock the page in try_to_swap_out, but lets not rush into
that without understanding the root cause ...

Kanoj 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03 16:08           ` Kanoj Sarcar
@ 2000-05-03 16:14             ` Linus Torvalds
  2000-05-03 16:24               ` Kanoj Sarcar
  2000-05-04  1:38             ` Linus Torvalds
  1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-03 16:14 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Rajagopal Ananthanarayanan, linux-mm

On Wed, 3 May 2000, Kanoj Sarcar wrote:
> > So "is_page_shared()" can be entirely crap. And can tell shrink_mmap()
> 
> Not really ... look at other places that call is_page_shared, they all
> hold the pagelock. shrink_mmap does not bother with is_page_shared logic.

That wasn't my argument.

My argument is that yes, the _callers_ of is_page_shared() all hold the
page lock. No question about that. But the things that is_page_shared()
actually tests can be modified without holding the page lock, so the page
lock doesn't actually _protect_ it. See?

So the callers might as well hold one of the networking spinlocks - it
just doesn't matter as a lock, because the places that modify the stuff do
not care about the lock.. And that is fishy.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03 16:11         ` Kanoj Sarcar
@ 2000-05-03 16:19           ` Linus Torvalds
  2000-05-03 16:35             ` Kanoj Sarcar
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-03 16:19 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Rajagopal Ananthanarayanan, linux-mm

On Wed, 3 May 2000, Kanoj Sarcar wrote:
> 
> Note that try_to_swap_out holds the vmlist/page_table_lock on the
> victim process, as well as lock_kernel, and though this is not the
> easiest code to analyze, it seems to me that is enough protection 
> on the swapcache pages.

The swapcache code gets none of those locks as far as I can tell.

The swapcache code gets the page lock, and the "page cache" lock. But it
doesn't get the vmlist lock (the swap cache is not associated withany
particular mm), nor does it get the kernel lock (I think - I didn't look
through the code-paths).

>		 Also note, I am not saying it is not a good
> idea to lock the page in try_to_swap_out, but lets not rush into
> that without understanding the root cause ...

Certainly agreed. The interactions in this area are rather complex. But in
the end (whether this is a real bug or not) I suspect that I'd just prefer
to have the simple "you must lock the page before mucking with the page
flags" rule - even if some other magic lock happens to make all of the
current code ok. Just for clarity. 

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03 16:14             ` Linus Torvalds
@ 2000-05-03 16:24               ` Kanoj Sarcar
  0 siblings, 0 replies; 45+ messages in thread
From: Kanoj Sarcar @ 2000-05-03 16:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rajagopal Ananthanarayanan, linux-mm

> 
> 
> On Wed, 3 May 2000, Kanoj Sarcar wrote:
> > > So "is_page_shared()" can be entirely crap. And can tell shrink_mmap()
> > 
> > Not really ... look at other places that call is_page_shared, they all
> > hold the pagelock. shrink_mmap does not bother with is_page_shared logic.
> 
> That wasn't my argument.
> 
> My argument is that yes, the _callers_ of is_page_shared() all hold the
> page lock. No question about that. But the things that is_page_shared()
> actually tests can be modified without holding the page lock, so the page
> lock doesn't actually _protect_ it. See?
>

Give me an example where the page_lock is not actually protecting the
"sharedness" of the page. Note that though the page_count and swap_count
are not themselves protected by page_lock, the "sharedness" could never 
change while you have the page_lock. "Sharedness" being whatever
is_page_shared() returns. Unless you can give me an example ....

Wait a second. I was familiar with is_page_shared() having 

        if (PageSwapCache(page))
                count += swap_count(page) - 2;

and now I see it is

        if (PageSwapCache(page))
                count += swap_count(page) - 2 - !!page->buffers;

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03 16:19           ` Linus Torvalds
@ 2000-05-03 16:35             ` Kanoj Sarcar
  2000-05-03 17:16               ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Kanoj Sarcar @ 2000-05-03 16:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rajagopal Ananthanarayanan, linux-mm

> 
> 
> 
> On Wed, 3 May 2000, Kanoj Sarcar wrote:
> > 
> > Note that try_to_swap_out holds the vmlist/page_table_lock on the
> > victim process, as well as lock_kernel, and though this is not the
> > easiest code to analyze, it seems to me that is enough protection 
> > on the swapcache pages.
> 
> The swapcache code gets none of those locks as far as I can tell.
> 
> The swapcache code gets the page lock, and the "page cache" lock. But it
> doesn't get the vmlist lock (the swap cache is not associated withany
> particular mm), nor does it get the kernel lock (I think - I didn't look
> through the code-paths).
>

What we are coming down to is a case by case analysis. For example,
do_wp_page, which does pull a page out of the swap cache, has the
vmlist_lock. do_swap_page does not, but neither is the page in the
pte at that point. free_page_and_swap_cache already has the vmlist_lock.

Some of this is documented in Documentation/vm/locking under the
section "Swap cache locking".

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03 16:35             ` Kanoj Sarcar
@ 2000-05-03 17:16               ` Linus Torvalds
  2000-05-03 17:31                 ` Kanoj Sarcar
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-03 17:16 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Rajagopal Ananthanarayanan, linux-mm

On Wed, 3 May 2000, Kanoj Sarcar wrote:
> 
> What we are coming down to is a case by case analysis. For example,
> do_wp_page, which does pull a page out of the swap cache, has the
> vmlist_lock.

_which_ vmlist? You can share swapcache entries on multiple VM's, and that
is exactly what is_page_shared() is trying to protect against. 

Let's say that we have page X in the swap cache from process 1.

Process 2 also has that page, but it's in the page tables.

We do a vmscan on process 2, and will do a "swap_duplicate()" on the swap
entry that we find in page X and free the page (leaving it _just_ in the
swap cache), but at that exact moment another process 1 exits, for
example, and calls free_page_and_swap_cache(). If is_page_shared() gets
that wrong, we're now going to delete the page from the swap cache, yet we
now have an entry to it in the page tables on process 2.

And none of this seems to be synchronized - the vmlist lock is two
separate locks and doesn't protect this case. And as we've seen, vmscan
doesn't get the page lock.

Note that I don't actually believe in this schenario on x86, because with
processor ordering I suspect that is_page_shared() should still at worst
be too pessimistic, which is ok. I just think it's conceptually wrong.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03 17:16               ` Linus Torvalds
@ 2000-05-03 17:31                 ` Kanoj Sarcar
  2000-05-03 18:17                   ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Kanoj Sarcar @ 2000-05-03 17:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rajagopal Ananthanarayanan, linux-mm

> 
> 
> 
> On Wed, 3 May 2000, Kanoj Sarcar wrote:
> > 
> > What we are coming down to is a case by case analysis. For example,
> > do_wp_page, which does pull a page out of the swap cache, has the
> > vmlist_lock.
> 
> _which_ vmlist? You can share swapcache entries on multiple VM's, and that
> is exactly what is_page_shared() is trying to protect against. 
> 
> Let's say that we have page X in the swap cache from process 1.
> 
> Process 2 also has that page, but it's in the page tables.
> 
> We do a vmscan on process 2, and will do a "swap_duplicate()" on the swap
> entry that we find in page X and free the page (leaving it _just_ in the
> swap cache), but at that exact moment another process 1 exits, for
> example, and calls free_page_and_swap_cache(). If is_page_shared() gets
> that wrong, we're now going to delete the page from the swap cache, yet we
> now have an entry to it in the page tables on process 2.
>

Okay, here's this example in a little more detail:

Page X: page ref count: 1 (from swapcache) + 1 (from P2)
	swap ref count: 1 (from swapcache) + 1 (from P1)

try_to_swap_out will do something like this:

after the swap_duplicate:

Page X: page ref count: 1 (from swapcache) + 1 (from P2)
	swap ref count: 1 (from swapcache) + 1 (from P1) + 1 (swap_duplicate) 

later on, after __free_page:

Page X: page ref count: 1 (from swapcache)
	swap ref count: 1 (from swapcache) + 1 (from P1) + 1 (swap_duplicate) 

At no point between the time try_to_swap_out() is running, will is_page_shared()
wrongly indicate the page is _not shared_, when it is really shared (as you
say, it is pessimistic). 

Process 2 doing a free_page_and_swap_cache will thruout see the page as
shared.

A similar race in transferring the pageref count to swapcount also exists
in do_swap_page(), there the pagelock is held ...

When I sent some of the swapcache locking code to you, I convinced myself
that the code was protected. Of course, I might have let some conditions 
slip by in my reasoning, the code hasn't changed that much since then ...

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03 17:31                 ` Kanoj Sarcar
@ 2000-05-03 18:17                   ` Linus Torvalds
  2000-05-03 18:37                     ` Rajagopal Ananthanarayanan
                                       ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Linus Torvalds @ 2000-05-03 18:17 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Rajagopal Ananthanarayanan, linux-mm

On Wed, 3 May 2000, Kanoj Sarcar wrote:
> 
> At no point between the time try_to_swap_out() is running, will is_page_shared()
> wrongly indicate the page is _not shared_, when it is really shared (as you
> say, it is pessimistic). 

Note that this is true only if you assume processor ordering.

With no common locks, a less strictly ordered system (like an alpha) might
see the update of the swap-count _much_ later on the second CPU, so that
is_page_shared() may end up not being pessimistic after all (it could get
the new page count, but the old swap-count, and thinks that the page is
free to be removed from the swap cache).

This is why not having a shared lock looks like a bug to me. Even if that
particular bug might never trigger on an x86.

_Something_ obviously triggers on the x86, though. 

Note that we may be barking up the wrong tree here: it may be a completely
different page mishandling that causes this. For example, one bug in NFS
used to be that it free'd a page that was allocated with "alloc_pages()"
using "free_page()" - which takes the virtual address and only works for
"normal" pages. Now, if you have more than about 960MB of memory and the
allocated page was a highmem page, you may end up freeing the wrong page
due to mixing metaphors, and suddenly the page counts are wrong.

And with the wrong page counts, the BUG() can/will happen only much later,
because a innocent "__free_page()" ends up doing the BUG(), but the real
offender happened earlier.

We fixed one such bug in NFS. Maybe there are more lurking? How much
memory do the machines have that have problems?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03 18:17                   ` Linus Torvalds
@ 2000-05-03 18:37                     ` Rajagopal Ananthanarayanan
  2000-05-03 18:37                     ` Kanoj Sarcar
  2000-05-03 21:28                     ` Jeff Garzik
  2 siblings, 0 replies; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-03 18:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kanoj Sarcar, linux-mm

Linus Torvalds wrote:
> 
> On Wed, 3 May 2000, Kanoj Sarcar wrote:
> >
> > At no point between the time try_to_swap_out() is running, will is_page_shared()
> > wrongly indicate the page is _not shared_, when it is really shared (as you
> > say, it is pessimistic).
> 

> 
> _Something_ obviously triggers on the x86, though.

IMHO, that's the right attitude. I really like the idea of
having the page locked if its state is being fiddled with.
I know, we don't fully understand the problem, in the sense
that no one has been able to construct a sample execution
which will hit the bug. But so what? Since the bug is elusive,
even if one comes up with a scenario, no saying that _that_
is what happened during the particular manifestation.

	[ ... ]

> 
> We fixed one such bug in NFS. Maybe there are more lurking? How much
> memory do the machines have that have problems?
> 

I don't use NFS on my test systems. So, that couldn't have
been a problem. I had about 64MB of memory in the system.

BTW, I've been running the test (some tar & diff) for
several hours now on the same system. The system is staying up fine.

regards,

ananth.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03 18:17                   ` Linus Torvalds
  2000-05-03 18:37                     ` Rajagopal Ananthanarayanan
@ 2000-05-03 18:37                     ` Kanoj Sarcar
  2000-05-03 19:41                       ` Rajagopal Ananthanarayanan
  2000-05-03 21:28                     ` Jeff Garzik
  2 siblings, 1 reply; 45+ messages in thread
From: Kanoj Sarcar @ 2000-05-03 18:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rajagopal Ananthanarayanan, linux-mm

> 
> 
> On Wed, 3 May 2000, Kanoj Sarcar wrote:
> > 
> > At no point between the time try_to_swap_out() is running, will is_page_shared()
> > wrongly indicate the page is _not shared_, when it is really shared (as you
> > say, it is pessimistic). 
> 
> Note that this is true only if you assume processor ordering.
>

True ... not to deviate from the current topic, I would think that instead
of imposing locks here, you would want to inject instructions (like the 
mips "sync") that makes sure memory is consistant. Imposing locks is a
roundabout way of insuring memory consistancy, since the unlock normally
has this "sync" type instruction encoded in it anyway.

> With no common locks, a less strictly ordered system (like an alpha) might
> see the update of the swap-count _much_ later on the second CPU, so that
> is_page_shared() may end up not being pessimistic after all (it could get
> the new page count, but the old swap-count, and thinks that the page is
> free to be removed from the swap cache).
> 
> This is why not having a shared lock looks like a bug to me. Even if that
> particular bug might never trigger on an x86.
> 
> _Something_ obviously triggers on the x86, though. 
> 
> Note that we may be barking up the wrong tree here: it may be a completely
> different page mishandling that causes this. For example, one bug in NFS
> used to be that it free'd a page that was allocated with "alloc_pages()"
> using "free_page()" - which takes the virtual address and only works for
> "normal" pages. Now, if you have more than about 960MB of memory and the
> allocated page was a highmem page, you may end up freeing the wrong page
> due to mixing metaphors, and suddenly the page counts are wrong.
>

Absolutely ... any subsystem which is screwing up the page reference count
would lead to a similar symptom. Very hard to track these ... maybe I will
take some time near the end of the week to run Juan's programs.

Kanoj

 
> And with the wrong page counts, the BUG() can/will happen only much later,
> because a innocent "__free_page()" ends up doing the BUG(), but the real
> offender happened earlier.
> 
> We fixed one such bug in NFS. Maybe there are more lurking? How much
> memory do the machines have that have problems?
> 
> 		Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03 18:37                     ` Kanoj Sarcar
@ 2000-05-03 19:41                       ` Rajagopal Ananthanarayanan
  0 siblings, 0 replies; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-03 19:41 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Linus Torvalds, linux-mm

Kanoj Sarcar wrote:
> 
> >
> >
> > On Wed, 3 May 2000, Kanoj Sarcar wrote:
> > >
> > > At no point between the time try_to_swap_out() is running, will is_page_shared()
> > > wrongly indicate the page is _not shared_, when it is really shared (as you
> > > say, it is pessimistic).
> >
> > Note that this is true only if you assume processor ordering.
> >
> 
> True ... not to deviate from the current topic, I would think that instead
> of imposing locks here, you would want to inject instructions (like the
> mips "sync") that makes sure memory is consistant. Imposing locks is a
> roundabout way of insuring memory consistancy, since the unlock normally
> has this "sync" type instruction encoded in it anyway.

Using "sync"-type operations to ensure memory ordering is not an
approach I'd recommend ... We've used it in only a couple of places
in IRIX synchronization code; but I'm yet to meet anyone who can
comfortably argue the correctness of it. Also, it opens up
chances of the compiler screwing up the writes ... and _those_ bugs
are really hard to pin down.

Further more, in the case at hand in Linux, we are not dealing with
high performance operations ... this is swapping after all.

Finally, in an MP system, the s/w synchronization primitives (lock/unlock/rwlock, etc.)
are the building blocks for ensuring correctness of interleaved execution.
Let's use those instead of low-level h/w primitives. Optimizations
can be pushed into (and isolated to) the implementation of the s/w
synchronization primtives.

-- 
--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03 18:17                   ` Linus Torvalds
  2000-05-03 18:37                     ` Rajagopal Ananthanarayanan
  2000-05-03 18:37                     ` Kanoj Sarcar
@ 2000-05-03 21:28                     ` Jeff Garzik
  2 siblings, 0 replies; 45+ messages in thread
From: Jeff Garzik @ 2000-05-03 21:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kanoj Sarcar, Rajagopal Ananthanarayanan, linux-mm

Linus Torvalds wrote:
> We fixed one such bug in NFS. Maybe there are more lurking? How much
> memory do the machines have that have problems?

FWIW

Dual P-II w/ 128 MB of memory.  pre7-2 and pre7-3 (with #error removed)
both boot up and let me login ok -- I have an NFS automounted home dir. 
But... doing a lot of "netscaping" -- clicking around, opening new
windows, making the machine do lots of mmap() and swap -- causes the box
to lock hard.

I'm gonna hook up a serial console and see if I can get output.  Might
try booting w/ CONFIG_SMP/num-cpus==1 to see if that triggers any ugly
behavior too.

I'll also try to reproduce the problem without NFS in the picture.

	Jeff

-- 
Jeff Garzik              | Nothing cures insomnia like the
Building 1024            | realization that it's time to get up.
MandrakeSoft, Inc.       |        -- random fortune
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-03 16:08           ` Kanoj Sarcar
  2000-05-03 16:14             ` Linus Torvalds
@ 2000-05-04  1:38             ` Linus Torvalds
  2000-05-04  2:44               ` Rajagopal Ananthanarayanan
                                 ` (2 more replies)
  1 sibling, 3 replies; 45+ messages in thread
From: Linus Torvalds @ 2000-05-04  1:38 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Rajagopal Ananthanarayanan, linux-mm, David S. Miller

Ok,
 there's a pre7-4 out there that does the swapout with the page locked.
I've given it some rudimentary testing, but certainly nothing really
exotic. Please comment..

David pointed out that swapout_highmem can't really work, and he's right
and wrong. It does work, but it works for rather undocumented reasons: it
only gets invoced for anonymous dirty pages, and they are always
cow-shared, so it's ok to "break" the page up into an "old" page and a
"new" page with the same contents. Even though it's not legal in general.

I'm not claiming that this fixes any known bugs, but it _does_ mean that
we probably have the page locked in all fundamental cases where it really
matters. If anybody finds a case where we play with the page-cached-ness
(or similar) of a page without holding the page lock, please holler
loudly.

This way it should be easy to verify that yes, our coherency is fine.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-04  1:38             ` Linus Torvalds
@ 2000-05-04  2:44               ` Rajagopal Ananthanarayanan
  2000-05-04  4:05                 ` Linus Torvalds
  2000-05-04  3:16               ` Rajagopal Ananthanarayanan
  2000-05-04  7:42               ` Rajagopal Ananthanarayanan
  2 siblings, 1 reply; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-04  2:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kanoj Sarcar, linux-mm, David S. Miller

Linus Torvalds wrote:
> 
> Ok,
>  there's a pre7-4 out there that does the swapout with the page locked.
> I've given it some rudimentary testing, but certainly nothing really
> exotic. Please comment..

One quick comment: Looking at this part of the diff to mm/vmscan.c:

----------
@@ -138,6 +139,7 @@
                flush_tlb_page(vma, address);
                vmlist_access_unlock(vma->vm_mm);
                error = swapout(page, file);
+               UnlockPage(page);
                if (file) fput(file);
                if (!error)
                        goto out_free_success;
-----------------

Didn't you mean the UnlockPage() to go before swapout(...)?
For example, one of the swapout routines, filemap_write_page()
expects the page to be unlocked. If called with page locked,
I'd expect a "double-trip" dead-lock. Right?

Like you said in an  earlier mail, most of the code in
try_to_swap_out expects the page to be unlocked. Now,
of course, the reverse is true ... need to watch out!

-- 
--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-04  1:38             ` Linus Torvalds
  2000-05-04  2:44               ` Rajagopal Ananthanarayanan
@ 2000-05-04  3:16               ` Rajagopal Ananthanarayanan
  2000-05-04  4:10                 ` Linus Torvalds
  2000-05-04  7:42               ` Rajagopal Ananthanarayanan
  2 siblings, 1 reply; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-04  3:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kanoj Sarcar, linux-mm, David S. Miller

Linus Torvalds wrote:
> 
> Ok,
>  there's a pre7-4 out there that does the swapout with the page locked.
> I've given it some rudimentary testing, but certainly nothing really
> exotic. Please comment..
> 

One other problem with having the page locked in
try_to_swapout() is in the call to 
prepare_highmem_swapout() when the incoming
page is in highmem. Then,
  
(1) The newly allocated page (regular_page) needs to be locked.
    This is may be trivial as setting PG_locked in regular_page,
    since no one else knows about it.

(2) Before __free_page() is called on the incoming highmem
     page it needs to be unlocked --- otherwise, we'll have
     dejavu all over in __free_pages_ok!
    This is a little tricky however, since not all callers of
    prepare_highmem_swapout() have the incoming page locked.
    For now, you can get away with something like
    (in mm/highmem.c):

        /*
         * ok, we can just forget about our highmem page since
         * we stored its data into the new regular_page.
         */
+	if (PageLocked(page))
+		UnlockPage(page);
       __free_page(page);



    
-- 
--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-04  2:44               ` Rajagopal Ananthanarayanan
@ 2000-05-04  4:05                 ` Linus Torvalds
  0 siblings, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2000-05-04  4:05 UTC (permalink / raw)
  To: Rajagopal Ananthanarayanan; +Cc: Kanoj Sarcar, linux-mm, David S. Miller


On Wed, 3 May 2000, Rajagopal Ananthanarayanan wrote:
> 
> One quick comment: Looking at this part of the diff to mm/vmscan.c:
> 
> ----------
> @@ -138,6 +139,7 @@
>                 flush_tlb_page(vma, address);
>                 vmlist_access_unlock(vma->vm_mm);
>                 error = swapout(page, file);
> +               UnlockPage(page);
>                 if (file) fput(file);
>                 if (!error)
>                         goto out_free_success;
> -----------------
> 
> Didn't you mean the UnlockPage() to go before swapout(...)?
> For example, one of the swapout routines, filemap_write_page()
> expects the page to be unlocked. If called with page locked,
> I'd expect a "double-trip" dead-lock. Right?

Nope. I changed swap_out() so that it gets called with the page locked
(which is much more like the other VM routines work too). Otherwise the
first thing swap_out() would do would be to just re-lock the page,and then
you'd have a window between the caller and the callee when neither the
page lock nor the page table lock were held.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-04  3:16               ` Rajagopal Ananthanarayanan
@ 2000-05-04  4:10                 ` Linus Torvalds
  2000-05-05  4:46                   ` Rajagopal Ananthanarayanan
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-04  4:10 UTC (permalink / raw)
  To: Rajagopal Ananthanarayanan; +Cc: Kanoj Sarcar, linux-mm, David S. Miller

On Wed, 3 May 2000, Rajagopal Ananthanarayanan wrote:
> 
> One other problem with having the page locked in
> try_to_swapout() is in the call to 
> prepare_highmem_swapout() when the incoming
> page is in highmem. 

Look at how I handled this in pre7-4.

Just unlocking the old page and returning with the new page locked is
quite acceptable. The "prepare_highmem_swapout()" thing breaks the
association with the pages anyway, and as such there is no race (and this
is allowable only exactly because of the anonymous and non-shared nature
of a private COW-mapping - which is the only thing we accept in that
code-path anyway).

Doing it that way means that there are no special cases in vmscan.c.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-04  1:38             ` Linus Torvalds
  2000-05-04  2:44               ` Rajagopal Ananthanarayanan
  2000-05-04  3:16               ` Rajagopal Ananthanarayanan
@ 2000-05-04  7:42               ` Rajagopal Ananthanarayanan
  2000-05-04 15:33                 ` Linus Torvalds
  2 siblings, 1 reply; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-04  7:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kanoj Sarcar, linux-mm, David S. Miller

Linus Torvalds wrote:
> 
> Ok,
>  there's a pre7-4 out there that does the swapout with the page locked.

I did some testing of this patch with dbench.
The kernel starts shooting processes down pretty quickly
("VM: killing process XXX") on a 2 CPU 64MB system,
with nothing but dbench (8 clients). A concurrently
running vmstat shows very low free memory with some swapping,
and the buffer space remaining around 50MB.

I had applied the 7-4 patch on top of pre6.
When the patch was reversed (leaving just pre6),
the resulting kernel did not have any problems
running dbench in several tries.

Will try some more tomorrow after hearing others experience.

ananth.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-04  7:42               ` Rajagopal Ananthanarayanan
@ 2000-05-04 15:33                 ` Linus Torvalds
  2000-05-04 15:57                   ` Rik van Riel
                                     ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Linus Torvalds @ 2000-05-04 15:33 UTC (permalink / raw)
  To: Rajagopal Ananthanarayanan
  Cc: Kanoj Sarcar, linux-mm, David S. Miller, Rik van Riel

On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> 
> I did some testing of this patch with dbench.
> The kernel starts shooting processes down pretty quickly
> ("VM: killing process XXX") on a 2 CPU 64MB system,
> with nothing but dbench (8 clients). A concurrently
> running vmstat shows very low free memory with some swapping,
> and the buffer space remaining around 50MB.

Ok. The page locking patch should not change any swap behaviour at all, so
this behaviour is likely to be due to the pageout changes by Rik.

(Oh, the page locking might cause another part of the vmscanning logic to
temporarily ignore a page because it is locked, but that should be a very
small second-order effect compared to the "big picture" changes in how
much to page out).

Rik, I think the kswapd logic is wrong, and I suspect you made it
worsewhen you added the while-loop. The problem looks like that while
kswapd is working on one zone, it will entirely ignore any other zones. I
think the logic should be more like

	for (;;) {
		int something_to_do = 0;
		pgdat = pgdat_list;
		while (pgdat) {
			for(i = 0; i < MAX_NR_ZONES; i++) {
				zone = pgdat->node_zones+ i;
				if (!zone->size || !zone->zone_wake_kswapd)
					continue;
				something_to_do = 1;
				do_try_to_free_pages(GFP_KSWAPD, zone);
			}
			run_task_queue(&tq_disk);
			pgdat = pgdat->node_next;
		}
		if (something_to_do) {
			if (tsk->need_resched)
				schedule();
			continue;
		}
		tsk->state = TASK_INTERRUPTIBLE;
		interruptible_sleep_on(&kswapd_wait);
	}

See? This has two changes to the current logic:
 - it is more "balanced" on the do_try_to_free_pages(), ie it calls it for
   different zones instead of repeating one zone until no longer needed.
 - it continues to do this until no zone needs balancing any more, unlike
   the old one that could easily lose kswapd wakeup-requests and just do
   one zone.

What do you think? I suspect that the added do-loop in pre7 just made the
"lost wakeups" problem worse by concentrating on one zone for a longer
while and thus more likely to lose wakeups for lower zones (because it
already looked at those).

There might be other details like this lurking, but this looks like a good
first try. Ananth, willing to give it a whirl?

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-04 15:33                 ` Linus Torvalds
@ 2000-05-04 15:57                   ` Rik van Riel
  2000-05-04 17:19                   ` Rajagopal Ananthanarayanan
  2000-05-04 20:40                   ` Oops in __free_pages_ok (pre7-1) (Long) (backtrace) Roger Larsson
  2 siblings, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2000-05-04 15:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rajagopal Ananthanarayanan, Kanoj Sarcar, linux-mm, David S. Miller

On Thu, 4 May 2000, Linus Torvalds wrote:

> 	for (;;) {
> 		int something_to_do = 0;
> 		pgdat = pgdat_list;
> 		while (pgdat) {
> 			for(i = 0; i < MAX_NR_ZONES; i++) {
> 				zone = pgdat->node_zones+ i;
> 				if (!zone->size || !zone->zone_wake_kswapd)
> 					continue;
> 				something_to_do = 1;
> 				do_try_to_free_pages(GFP_KSWAPD, zone);
> 			}
> 			run_task_queue(&tq_disk);
> 			pgdat = pgdat->node_next;
> 		}
> 		if (something_to_do) {
> 			if (tsk->need_resched)
> 				schedule();
> 			continue;
> 		}
> 		tsk->state = TASK_INTERRUPTIBLE;
> 		interruptible_sleep_on(&kswapd_wait);
> 	}
> 
> See? This has two changes to the current logic:
>  - it is more "balanced" on the do_try_to_free_pages(), ie it calls it for
>    different zones instead of repeating one zone until no longer needed.
>  - it continues to do this until no zone needs balancing any more, unlike
>    the old one that could easily lose kswapd wakeup-requests and just do
>    one zone.
> 
> What do you think?

Indeed, this probably better ...

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-04 15:33                 ` Linus Torvalds
  2000-05-04 15:57                   ` Rik van Riel
@ 2000-05-04 17:19                   ` Rajagopal Ananthanarayanan
  2000-05-04 17:41                     ` Rik van Riel
  2000-05-04 20:40                   ` Oops in __free_pages_ok (pre7-1) (Long) (backtrace) Roger Larsson
  2 siblings, 1 reply; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-04 17:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kanoj Sarcar, linux-mm, David S. Miller, Rik van Riel

Linus Torvalds wrote:
> 
> On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> >
> > I did some testing of this patch with dbench.
> > The kernel starts shooting processes down pretty quickly
> > ("VM: killing process XXX") on a 2 CPU 64MB system,
> > with nothing but dbench (8 clients). A concurrently
> > running vmstat shows very low free memory with some swapping,
> > and the buffer space remaining around 50MB.
	
	[ ... ]

> 
> Rik, I think the kswapd logic is wrong, and I suspect you made it
> worsewhen you added the while-loop. The problem looks like that while
> kswapd is working on one zone, it will entirely ignore any other zones. I
> think the logic should be more like
	
	[ ... ]

> What do you think? I suspect that the added do-loop in pre7 just made the
> "lost wakeups" problem worse by concentrating on one zone for a longer
> while and thus more likely to lose wakeups for lower zones (because it
> already looked at those).
> 
> There might be other details like this lurking, but this looks like a good
> first try. Ananth, willing to give it a whirl?
> 
>                         Linus



I haven't looked at the code, but I replaced the whole while (1) loop
with the new for(;;). Things still remain the same: when running
dbench VM starts killing processes. Following is a vmstat trace during
the dbench run with the latest change to kswapd. For some reason,
the system seems to swap more with the change; I've attached a second
vmstat trace as of early AM today before the change. Again, I have
a 2 CPU box with 64M memory ... the tests are very controlled, nothing
changes except what I want to change (kernel bits).



------------- vmstat trace with kswapd change ----------------------
[root@delilah /root]# vmstat 1 1000
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  0  0      0  41664    968  11144   0   0    12     2   55    14   1   1  98
 0  0  0      0  41664    968  11144   0   0     0     0  106    13   0   0 100
 0  0  0      0  41664    968  11144   0   0     0     0  105     8   0   0 100
 0  0  0      0  41664    968  11144   0   0     0     0  104     8   0   0 100
 0  0  0      0  41664    968  11144   0   0     0     0  107    26   0   0 100
 0  0  0      0  41664    968  11144   0   0     0     0  110    36   0   0 100
 0  0  0      0  41664    968  11144   0   0     0     0  114    46   0   0 100
 0  0  0      0  41236    968  11144   0   0     0     0  108    42   0   0 100
 0  0  0      0  41216    968  11144   0   0     0     0  105    23   1   0  99
 0  8  0      0  38324   1036  13260   0   0   142    85  203   356   2   5  93
 0  8  1      0  24376   1096  25548   0   0    33  3378  239   705   1  21  78
 0  8  1      0  21340   1116  28220   0   0    26  2482  304  1960   0   7  92
 1  7  1      0  11540   1160  36896   0   0    41  2623  318  5384   1  17  82
 0  8  1      0   6212   1180  41568   0   0    20  2813  332  3467   1   8  91
 0  8  1      0   3040   1192  44360   0   0     6  2727  271   803   0   7  92
 0  8  2    432   1208   1148  46484   0 432    60  6968  754  4099   0   7  92
 0  8  2    648   1112   1164  46780   0 216    14  3249  348  2720   0   9  91
 0  8  0   3948   1240   1112  49984   0 3300    44 22247 2210  6100   0   5  95
 0  8  0   4336   1540   1076  49896   0 388    28  4597  317   222   1  12  87
 3  5  1   6132   1200    444  51400 1736 3892  1872 31316 5098  2307   0  15  85
 0  7  0   6612   1092    452  51528 172 908   329  7872  862   463   0  21  79
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  5  1   6944   1548    456  51148 192 1052   339  8263 1362   405   1  14  86
 2  4  1   7212   1532    436  51216  16 668    35  7667  678   470   0  24  76
 2  6  1   7072   1540    436  51136 360 808   224  8202 1170   756   0  21  79
 1  3  2   7048   1260    448  51320  28 328    61  4082  484   126   0  20  80
 2  1  3   7012    964    460  51720  24 180    46  2545  310    90   0   7  93
 1  4  1   6984    992    472  51664  24 192    29  2917  320    91   1  17  82
 0  1  5   7064   1508    468  51116  32 248    28  2062  345    85   1  16  83
 0  3  3   6956   1540    456  51004  32 288   188  4072  525   183   0  14  85
 0  5  1   7168   1516    464  50864  84 628   454  4657  824   249   1  10  89
 0  7  0   7500   1108    496  51204  20 672   204  2668  412   171   4   9  87
 2  5  0   7484    756    504  51476   0 404   179  3601  558   312   1   6  93
 0  7  0   7476   1244    488  51076   4 228   342  1557  402   194   1   5  94
 0  6  1   7456   1244    496  51076   4 120   691  1530  399   118   3   5  92
 0  7  0   7372   1416    488  50780  28 344  1068  1086  351   178   1   6  93
 2  4  0   7352   1448    500  50740   0 244   893   506  413   152   2   3  95
 0  6  0   7252   1200    500  50944  48 124  1411   289  225   166   0   7  93
 0  7  0   7324   1404    488  50796  72 124   708   531  455   526   1   1  98
 0  7  0   7364   1480    488  50804  12 224   858   225  965   892   1   4  96
 0  6  1   7284   1424    500  50736  16 240  1986   140  218   192   0  10  89
 2  4  0   7420    876    500  51476   4 352  1424   588  599   511   2  10  88
 1  6  0   7424   1356    496  50944  68 168   679  1042  695   337   1   3  96
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  6  0   7312   1228    520  50936  44 176   513  2044  476   210   4   8  88
 0  6  0   7408   1464    516  50812  12 264   135  1566  277    94   3  12  85
 1  5  0   7384   1228    536  50952   0 120   208  2530  312   147   1   6  92
 0  6  0   7512   1332    528  50996   0 212   104  1553  326   104   0  10  89
 0  6  0   7548   1544    536  50908  16 236   132  2059  430   110   1   7  92
 2  4  1   7740   1360    552  51156  64 540   427  4635  614   232   3   6  91
 0  6  0   7564   1204    552  51172  56 124   217  1031  333    98   3   3  94
 0  6  0   7460   1064    560  51112  12 116   718  1529  423   195   6   7  87
 0  6  0   7516   1212    580  50968  16 236   662   559  258   176   2   7  91
 0  6  0   7472   1448    596  50580  16   0   638  1000  232   182   5  10  85
 1  5  1   7556   1380    608  50700  16 124   354   531  306   147   2   5  93
 1  5  0   7604   1204    624  50912  12 180   712  2545  470   139   3   6  91
 1  5  0   7612   1372    636  50688   8 292  1082  2573  507   199   2   6  92
 0  6  0   7600   1516    652  50548   4 168   392  1042  264   125   5   4  90
 0  6  0   7592   1356    676  50756   0 108   442   527  286   157   5   8  87
 0  6  0   7488   3400    704  48556   0   0   659     0  213   186   4   2  93
 0  6  0   7392   7188    712  44660   0   0   745     0  221   220  11  13  76
 0  6  0   7392   2588    724  49252   0   0   302  3000  306   486  14  13  73
 0  6  2   7420   2004    732  49944   0 136   130  1292  292  3150   7   7  85
 5  2  0   7520   2040    732  49988   0 104    38  2268  333   357   1   2  97
 4  3  1   7808   1144    724  51068   8 408   371  7555  844  3960   4   8  88
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 4  2  2   8116   1480    728  50852  24 692   190  5143  664   335   3   6  92
 1  5  0   7860   1516    736  50608  24 100   122  3083  452   228   3   9  88
 0  6  0   7760   1132    744  50780   4 100   352    44  423    99   3   3  94
 4  2  0   7676   2516    744  49292   0 128   610  1032  271   189  15  15  70
 3  1  5   8352    852    756  51380  28 1112   624 11778 1640  3164   3   9  88
 4  2  0   7928   1420    760  50480   8 264   225  1566  483   306   7  10  83
 0  6  0   7876   4504    764  47252   0   0   425  1500  328   144  10  16  74
 0  6  0   7876  10068    764  41664   0   0   265  2000  418   124  19  21  60
 2  4  1   7876   6608    764  45124   0   0   208  3137  266  1170  18  26  55
 0  6  1   7872   1712    764  50000   0   0   255  1943  327  4264   8  14  78
 3  3  2   7864   1540    764  50364   0 204    21  2003  357  3479   7  16  77
 0  6  2   7716    888    764  50992   0  68   147  2071  306  3040   8  10  83
 2  4  2   7804    900    772  51016   0 272   270  3982  599  3871   8  14  78
 1  5  1   7688   7028    780  44668   0   8   120  2416  326  1194  25  25  50
 0  6  1   7688   3152    780  48540   0   0   189  2364  335 10084  17  39  45
 0  6  1   7688   1352    780  50332   0   0   185  1843  381  6073  16  20  64
 4  2  1   7708   2636    776  49132   0 120    40  4975  298  2135   6  12  82
 4  2  2   7832   3856    792  48100   0 348   259  8521  714  6546   6  10  84
 4  2  1   7816   5792    792  46028   4 152   145  5785  377  1006  13  16  71
 1  5  1   7804   2348    792  49380  52   0   223   859  247  4483  13  19  69
 1  5  1   7828   2624    792  49156   0 112   104  2323  314  3228  11  15  74
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 2  4  1   7760   6600    792  44908   0   0   204  2878  361  1783  17  21  61
 2  5  1   7736   4428    792  47224 220   0   297  3065  417  4822  27  38  35
 0  7  1   7728   5128    792  46468  60   0   102  1865  434  2128   7   8  85
 4  3  1   7724   2768    792  48844  12   0   169  3404  498  7431  11  16  73
 2  4  0   7696   4128    792  47472   0 116   132  7416  444  4061  13  24  63
 1  5  1   7692   5016    792  46572   0   0     8  1500  220    63  11  12  76
 0  6  0   7672   5000    792  46564   0   0    12  2000  224   102   8   8  84
 5  1  0   7672   4720    792  46860   0   0    69  2000  258   796   7  12  80
 0  6  1   7672   5080    792  46484   0   0   115  3114  302  1880  28  34  38
 0  6  1   7672   4308    792  47256   0   0     7  1718  286  2861   4  10  86
 0  6  1   7672   6528    792  45020  12   0   114  1461  330  1508   3   5  92
 0  6  0   7672   9080    792  42468   0   0     7   707  296   233   0   3  97
 0  6  0   7672  11948    792  39592   8   0    30  3500  317   182  11  14  75
 0  6  0   7672  35396    792  16144   0   0    69     0  273   169   0   7  93
 0  6  0   7672  39716    792  11824   0   0   119     0  223   246   0   4  96
 0  1  0   7628  41828    792  10520 768   0   532     0  216   248   1   1  98
 0  0  0   7616  41320    792  10956 320   0   212     0  124    44   0   1  99
 0  0  0   7616  41320    792  10956   0   0     0    16  119    12   0   0 100
 0  0  0   7616  40580    792  11252 196   0   159     0  127    69   1   0  99
 0  0  0   7616  40568    792  11252   0   0     0     0  106    23   1   0  99
 0  8  1   7616  28856    792  22556   0   0    50  1045  137   239   2  17  80
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 7  1  1   7616  19400    792  32076   0   0     5  4999  307  1382   1  17  83
 0  8  1   7616   8604    792  42832   0   0    10  5133  312  1871   0  19  81
 0  8  1   7616   4616    792  46816   0   0     9  4186  295   758   0   8  92
 0  8  1   7616   1032    792  50404   0   0     5  3804  325   803   0  10  90
-----------------------------------------------------------------------------------




------------------ vmstat trace as of pure 7-4 patch ----------------------------
[root@delilah /root]# vmstat 1 1000
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  0  0      0  45176    712   8144   0   0    62     8   71    60   3   3  94
 0  0  0      0  45176    712   8144   0   0     0     0  106    12   0   0 100
 0  1  0      0  45160    716   8144   0   0     1     0  106    14   0   0 100
 0  0  0      0  45024    732   8252   0   0    34    59  140    49   0   1  99
 0  0  0      0  45024    732   8252   0   0     0     0  121     8   0   0 100
 0  0  0      0  45024    732   8252   0   0     0     0  104    10   0   0 100
 0  0  0      0  45024    732   8252   0   0     0     0  121    76   0   0 100
 0  0  0      0  45020    736   8252   0   0     6     0  117    55   1   0  98
 0  2  0      0  44640    772   8324   0   0    27    65  157    84   1   0  98
 0  1  0      0  43436    908   8900   0   0   182     0  231   240   5   1  94
 1  0  0      0  40604    940  10584   0   0   402     0  173   150  17   2  81
 1  0  0      0  40452    952  10896   0   0    84     0  117    80  42   4  54
 1  0  0      0  39072    952  10980   0   0     4     0  108    41  50   2  49
 1  0  0      0  40476    988  11264   0   0   111    13  146   115  26   3  70
 1  0  0      0  40916    988  11196   0   0     8     0  110    76  47   2  50
 0  0  0      0  41576    988  11220   0   0     0     0  108    51  17   2  80
 0  0  0      0  41576    988  11220   0   0     0     0  104     8   0   0 100
 0  0  0      0  41576    988  11220   0   0     0     0  103     8   0   0 100
 0  0  0      0  41576    988  11220   0   0     0     1  108    18   0   0 100
 0  0  0      0  41576    988  11220   0   0     0     0  115    48   0   0 100
 0  0  0      0  41576    988  11220   0   0     0     0  110    32   0   0 100
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  0  0      0  41140    988  11220   0   0     0     0  110    43   0   1  99
 0  0  0      0  41128    988  11220   0   0     0     0  111    25   0   0 100
 8  0  0      0  30488   1072  20260   0   0    73     0  126   186   1  13  86
 0  8  0      0  20308   1120  29148   0   0     4  8500  203    66   1  16  83
 0  8  1      0  11632   1160  36792   0   0    30  4599  267   395   0  15  84
 0  8  1      0   2240   1208  45052   0   0     8  5787  346   981   1  17  82
 0  8  2      0   1228    736  46520   0   0    14  2806  298   735   1  17  82
 0  8  3      0   1144    388  46648   0   0    13  5154  392   871   1  15  84
 5  4  1   1068   1080    384  47832   0 1068    17 15921 1059  1134   0  11  89
 8  0  2   2220   1456    404  48616   0 1152     9  5788  487   123   0  10  89
 0  8  2   3044   1028    408  49572   0 840     5  4210  264    95   0  17  83
 0  8  1   3540   1100    412  49964   0 500     2  8125  222    89   0   8  92
 0  8  2   3852   1088    416  50188   0 312     8  3359  224    89   0   9  91
 0  9  1   4048   1072    448  50208  84 204    65   770  350   168   1  22  77
 0  8  2   4016   1132    460  50068 200   0    88  5464  376   808   1  31  67
 0  8  3   4016   1072    476  50112  56   0    25  4052  363   668   0  11  89
 0  8  3   4016   1100    484  50080   0   0    13  4529  356   583   0  11  89
 0  8  2   4384   1032    452  50504   0 372     4  9637  345   569   1  10  89
 0  8  0   4432   1536    448  50052   0  48     0  1923  219   206   0   2  98
 0  8  2   5004   1540    444  50628   0 572    71  6836  416    86   0   7  92
 1  3  6   5280   1188    432  51168   0 332    21  9390  714  6872   1  11  88
Killed
[root@delilah /root]# vmstat 1 1000
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  4  0   5672  30508    760  21128  12  31    85   710  114   159   4   8  87
 0  0  0   5608  31360    760  20740 436   0   172   121  210   227   0   2  98
 0  0  0   5608  31360    760  20740   0   0     0     0  168    10   0   0 100
 0  0  0   5608  31360    760  20740   0   0     0     0  104     8   0   0 100
 0  0  0   5608  31360    760  20740   0   0     0     0  105     8   0   0 100
 0  0  0   5592  31324    760  20756  28   0    11     0  112    24   0   0 100
 0  0  0   5592  31308    760  20768  12   0     3    22  124    26   0   0 100
 0  0  0   5592  31300    760  20776   8   0     2     0  108    20   0   0 100
 0  0  0   5592  31300    760  20776   0   0     0     0  103     8   0   0 100
 0  0  0   5592  30860    760  20780   4   0     1     0  106    32   0   0  99
 0  0  0   5592  30848    760  20780   0   0     0     0  106    25   0   0 100
 1  7  1   5592  16064    760  35156   0   0    63  2365  176  1593   1  23  76
 0  8  1   5592   2608    760  48632   0   0    38  5350  448  3538   0  28  72
 0  8  2   5820   1060    744  50576   0 252    14  4888  409   848   0  10  89
 0  5  4   6196   1208    744  50760   0 384     2 10556  543  1439   0   4  96
 2  4  3   6080   1120    520  51372 100 1308   399 30927 2351  6011   0  36  63
 0  8  2   5792   1056    552  51092 416   0   389  4967  776  6246   0  41  59
 5  3  3   5344   1540    564  50908 188 1052   597 22696 2076  2402   0  25  75
 0  4  0   5308   1536    576  50724 100 208   731  5552  732   243   1   7  91
 0  5  1   5300   1348    588  50900 120 180   968  7045  809  1056   2  12  87
 0  5  0   5200   1356    584  50760  56   0   299  1500  416   108   1   2  96
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  5  2   5100   1524    544  50644  24  56   169  4347  301   406   2  11  87
 0  5  0   5128   1392    556  50896  48  68   225  2184  398   198   3   4  93
 0  3  1   5032   1368    560  50884  32   0   306  1000  334   109   1   5  94
 1  2  0   5100   1048    572  51256   0  96   327  2024  266   104   1   8  91
 0  3  0   5080   1376    556  50900  12  40   386  2510  277    90   2  11  86
 0  3  0   5160   1332    580  50892   0 120  1105  2030  220   148   4  10  86
 0  3  0   5108   1228    588  50944   0   0   286  2500  349    76   2   5  93
 2  1  0   5108   2488    604  49672   0   0   645     0  251   162  10   6  84
 0  3  0   5140   1028    604  51248   0 232   365  2058  262   148  10  20  71
 0  3  0   5452   1172    604  51304   0 392   262  4098  342   160   1   4  95
 0  3  0   5324   1160    600  51152   0   0   194  1500  455    67   5   7  87
 1  2  0   5308   1500    608  50736   0   0   204  2000  310    83   5   7  88
 2  1  1   5228   1260    612  51024   0  92  1408  2023  220   209  33  42  25
 0  3  0   5200   1084    616  51148   0   0     5  2000  308   721   7   7  86
 0  3  0   5004   3116    616  49120   0   4     0  2001  263    48   5   3  92
 3  0  0   4956   4336    616  47868   0   0   588  1500  217   106  31  34  34
 0  3  0   4848   1496    616  50676   0   0   145  2000  242    62  14  16  70
 3  0  0   4848   3684    616  48500   0   0   385   500  219    64  23  26  51
 0  3  0   4848   1280    612  50888   0   0   140  3500  212    64  16  14  70
 3  0  0   4848   5464    612  46760   0   0   225     0  201    52   8  23  70
 0  3  0   4848   1396    616  50768   0   0   293  3000  207    74  20  26  54
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 1  2  0   4848   4236    616  47936   0   0     0  1500  258    49   6   4  90
 0  3  0   4848   1956    616  50216   0   0   310  2000  209    81  24  27  49
 2  1  0   4820   1488    608  50692   0   0   196  1500  303    58  10   8  81
 0  3  0   4820   3896    608  48308   8   0   281  1000  237   127  18  22  60
 0  3  0   4816  33088    608  19068   0   0   178     0  224   261   1   9  90
 0  0  0   4752  34680    608  17808 144   0   232     0  168   121   0   1  98
 0  1  0   4748  34600    608  17888   4   0    79    40  138    24   0   1  99
 0  0  0   4736  34584    608  17884   8   0     2     0  109    16   0   0 100
 0  0  0   4736  33996    608  18036  32   0   134     0  131    66   1   1  98
 0  1  0   4736  31912    608  20056   0   0    55     0  132    74   1   4  95
 0  8  0   4736  15540    608  36272   0   0    72 10000  217   121   0  23  77
 4  4  2   4736   5060    608  46716   0   0     9  1670  222   107   0  20  80
 0  8  1   4736   2728    608  48856   0   0    22  1653  225   593   0   8  92
 2  6  2   4876   1184    600  50744   0 244    35 10361  400   640   1  10  89
 3  5  2   5084   1204    608  50908   0 272    93  9176  647  6848   1   7  92
Killed
--------------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-04 17:19                   ` Rajagopal Ananthanarayanan
@ 2000-05-04 17:41                     ` Rik van Riel
  2000-05-04 18:18                       ` Rajagopal Ananthanarayanan
  0 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2000-05-04 17:41 UTC (permalink / raw)
  To: Rajagopal Ananthanarayanan
  Cc: Linus Torvalds, Kanoj Sarcar, linux-mm, David S. Miller

On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> Linus Torvalds wrote:

> > There might be other details like this lurking, but this looks like a good
> > first try. Ananth, willing to give it a whirl?
> 
> I haven't looked at the code, but I replaced the whole while (1)
> loop with the new for(;;). Things still remain the same: when
> running dbench VM starts killing processes.

I've been thinking about it some more. When we look
carefully the killing is always accompanied by a sudden
decrease in free memory (while kswapd could easily keep
up a few seconds ago).

Having an active/inactive queue, where we maintain a
certain target number of inactive pages, should give us
some more robustness against sudden overload. Also,
guaranteeing that we have indeed a certain number of
freeable pages in every zone...

I'm coding this up as we speak, so please hold on a
little longer ...

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-04 17:41                     ` Rik van Riel
@ 2000-05-04 18:18                       ` Rajagopal Ananthanarayanan
  2000-05-04 18:43                         ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-04 18:18 UTC (permalink / raw)
  To: riel; +Cc: Linus Torvalds, Kanoj Sarcar, linux-mm, David S. Miller

Rik van Riel wrote:
> 
> On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> > Linus Torvalds wrote:
> 
> > > There might be other details like this lurking, but this looks like a good
> > > first try. Ananth, willing to give it a whirl?
> >
> > I haven't looked at the code, but I replaced the whole while (1)
> > loop with the new for(;;). Things still remain the same: when
> > running dbench VM starts killing processes.
> 
> I've been thinking about it some more. When we look
> carefully the killing is always accompanied by a sudden
> decrease in free memory (while kswapd could easily keep
> up a few seconds ago).

You may have something here. It's the burstiness of
the demand. One thing I haven't noticed here in linux-mm
is any approaches to throttle the demand (Or may be I haven't
looked enough). Why not keep requests for new pages unsatisfied
if the _rate_ of allocations exceeds the _rate_ of freeing
(through swap-out or through write-out [bdflush])?

Simple counters don't capture rates. We need deltas in
the last 'n' time intervals. Then, match the delta-A
(allocation) to delta-F (free). 

Just a thought,

ananth.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-04 18:18                       ` Rajagopal Ananthanarayanan
@ 2000-05-04 18:43                         ` Linus Torvalds
  2000-05-04 19:00                           ` Rik van Riel
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-04 18:43 UTC (permalink / raw)
  To: Rajagopal Ananthanarayanan; +Cc: riel, Kanoj Sarcar, linux-mm, David S. Miller

On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> 
> You may have something here. It's the burstiness of
> the demand.

The way bursty demand is _supposed_ to be handled is that the "demander"
just ends up doing the "try_to_free_pages()" call itself at that point.
There's no way kswapd can handle these cases sanely, and at some point we
just need to start freeing memory synchronously.

So what's probably started happening is that "try_to_free_pages()" is not
trying hard enough to free stuff (probably the counters changed, and it
now only scans half the memory it used to under pressure), so when we get
into the bursty demand situation, the allocator ends up giving up in
disgust.

This is something you'll never see under non-busrty load, simply because
kswapd doesn't care - it will continue to page stuff out whether
try_to_free_pages() returns happy or not. So if try_to_free_pages() isn't
trying hard enough, kswapd will compensate by just calling it more.

Note that changing how hard try_to_free_pages() tries to free a page is
exactly part of what Rik has been doing, so this is something that has
changed recently. It's not trivial to get right, for a very simple reason:
we need to balance the "hardness" between the VM area scanning and the RLU
list scanning.

Rik probably balanced it ok, but ended up making it too soft, giving up
much too easily even when memory really would be available if it were to
just try a bit harder..

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-04 18:43                         ` Linus Torvalds
@ 2000-05-04 19:00                           ` Rik van Riel
  2000-05-04 19:17                             ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2000-05-04 19:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rajagopal Ananthanarayanan, Kanoj Sarcar, linux-mm, David S. Miller

On Thu, 4 May 2000, Linus Torvalds wrote:

> Note that changing how hard try_to_free_pages() tries to free a page is
> exactly part of what Rik has been doing, so this is something that has
> changed recently. It's not trivial to get right, for a very simple reason:
> we need to balance the "hardness" between the VM area scanning and the RLU
> list scanning.

With the current scheme, it's pretty much impossible to get it
right.

> Rik probably balanced it ok, but ended up making it too soft,
> giving up much too easily even when memory really would be
> available if it were to just try a bit harder..

*nod*

I hope the active/inactive page list scheme will fix this.

(we can push harder since we'll have pages in every stage
of aging every time)

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-04 19:00                           ` Rik van Riel
@ 2000-05-04 19:17                             ` Linus Torvalds
  2000-05-04 21:16                               ` Rajagopal Ananthanarayanan
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-04 19:17 UTC (permalink / raw)
  To: riel; +Cc: Rajagopal Ananthanarayanan, Kanoj Sarcar, linux-mm, David S. Miller

On Thu, 4 May 2000, Rik van Riel wrote:

> On Thu, 4 May 2000, Linus Torvalds wrote:
> 
> > Note that changing how hard try_to_free_pages() tries to free a page is
> > exactly part of what Rik has been doing, so this is something that has
> > changed recently. It's not trivial to get right, for a very simple reason:
> > we need to balance the "hardness" between the VM area scanning and the RLU
> > list scanning.
> 
> With the current scheme, it's pretty much impossible to get it
> right.

Not really. That is what the "priority levels" are really there for: for
normal use it's actually sufficient to just make sure that the starter
levels (ie 6) balance reasonably well between VM scanning and RLU
scanning. If they balance ok, then system behaviour will be quite
acceptable.

At the same time it is important to make sure that the higher priorities
(ie 1 and 0) try _much_ harder to swap things out than the lower ones.
They don't need to be very balanced, but they need to be effective. That's
why shrink_mmap() uses a quite grotesque 

	count = nr_lru_pages >> priority;

which means that level 0 will try 64 times harder than level 6 to page
something out. It's also important that once you get to level 0, it really
should scan everything available more than once (once for aging, once for
"everything was aged the first time, the second time we really free
something").

This, I think, is where the new swap_out() falls down flat on its face. It
does a much softer swapout, and the priority is not as aggressive as it is
for shrink_mmap(). Instead of a exponential increase with priority, it
uses a linear one: "counter = nr_threads / (priority+1)".

I suspect that for "priority = 0", we should make sure that "counter" is
at _least_ "nr_threads * 2", simply because we should walk the page tables
at least twice before giving up: the "mm->swap_address" logic means that
walking them once may have started somewhere in the middle and never even
looked at the low values. Because we age, it should probably be more than
that.

So it might be that we should really use something like

	counter = (nr_threads << 1) >> (priority >> 1);

instead (just a completely made up heuristic - I just made up something
that has exponential behaviour while still having a starting point close
to what we have now to get roughly the same balancing).

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-04 15:33                 ` Linus Torvalds
  2000-05-04 15:57                   ` Rik van Riel
  2000-05-04 17:19                   ` Rajagopal Ananthanarayanan
@ 2000-05-04 20:40                   ` Roger Larsson
  2 siblings, 0 replies; 45+ messages in thread
From: Roger Larsson @ 2000-05-04 20:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm, Rik van Riel

Oh,

This moves the need_resched test out of the inner loop.
Not good if you like low latencies (I like them).
(The do_try_to_free_pages can take quite some time...)

>                         if (tsk->need_resched)
>                                 schedule();

Please move it back into the inner loop!

/RogerL

Linus Torvalds wrote:
> 
> On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> >
> > I did some testing of this patch with dbench.
> > The kernel starts shooting processes down pretty quickly
> > ("VM: killing process XXX") on a 2 CPU 64MB system,
> > with nothing but dbench (8 clients). A concurrently
> > running vmstat shows very low free memory with some swapping,
> > and the buffer space remaining around 50MB.
> 
> Ok. The page locking patch should not change any swap behaviour at all, so
> this behaviour is likely to be due to the pageout changes by Rik.
> 
> (Oh, the page locking might cause another part of the vmscanning logic to
> temporarily ignore a page because it is locked, but that should be a very
> small second-order effect compared to the "big picture" changes in how
> much to page out).
> 
> Rik, I think the kswapd logic is wrong, and I suspect you made it
> worsewhen you added the while-loop. The problem looks like that while
> kswapd is working on one zone, it will entirely ignore any other zones. I
> think the logic should be more like
> 
>         for (;;) {
>                 int something_to_do = 0;
>                 pgdat = pgdat_list;
>                 while (pgdat) {
>                         for(i = 0; i < MAX_NR_ZONES; i++) {
>                                 zone = pgdat->node_zones+ i;
>                                 if (!zone->size || !zone->zone_wake_kswapd)
>                                         continue;
>                                 something_to_do = 1;
>                                 do_try_to_free_pages(GFP_KSWAPD, zone);
>                         }
>                         run_task_queue(&tq_disk);
>                         pgdat = pgdat->node_next;
>                 }
>                 if (something_to_do) {
>                         if (tsk->need_resched)
>                                 schedule();
>                         continue;
>                 }
>                 tsk->state = TASK_INTERRUPTIBLE;
>                 interruptible_sleep_on(&kswapd_wait);
>         }
> 
> See? This has two changes to the current logic:
>  - it is more "balanced" on the do_try_to_free_pages(), ie it calls it for
>    different zones instead of repeating one zone until no longer needed.
>  - it continues to do this until no zone needs balancing any more, unlike
>    the old one that could easily lose kswapd wakeup-requests and just do
>    one zone.
> 
> What do you think? I suspect that the added do-loop in pre7 just made the
> "lost wakeups" problem worse by concentrating on one zone for a longer
> while and thus more likely to lose wakeups for lower zones (because it
> already looked at those).
> 
> There might be other details like this lurking, but this looks like a good
> first try. Ananth, willing to give it a whirl?
> 
>                         Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/

--
Home page:
  http://www.norran.net/nra02596/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-04 19:17                             ` Linus Torvalds
@ 2000-05-04 21:16                               ` Rajagopal Ananthanarayanan
  2000-05-04 21:51                                 ` Rik van Riel
  2000-05-04 22:21                                 ` Linus Torvalds
  0 siblings, 2 replies; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-04 21:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: riel, Kanoj Sarcar, linux-mm, David S. Miller

Linus Torvalds wrote:
> 
> On Thu, 4 May 2000, Rik van Riel wrote:
> 
> > On Thu, 4 May 2000, Linus Torvalds wrote:
> >
> > > Note that changing how hard try_to_free_pages() tries to free a page is
> > > exactly part of what Rik has been doing, so this is something that has
> > > changed recently. It's not trivial to get right, for a very simple reason:
> > > we need to balance the "hardness" between the VM area scanning and the RLU
> > > list scanning.
> >
> > With the current scheme, it's pretty much impossible to get it
> > right.
> 
> Not really. That is what the "priority levels" are really there for: for

	[ ... discussion about shrink_mmap() ... ]

> This, I think, is where the new swap_out() falls down flat on its face. It
	
	[ ... discussion about swap_out() ... ]

I looked over the latest (7-4) implementation of swap_out,
shrink_mmap & try_to_free_pages, etc.

One clarification: In the case I reported only
dbench was running, presumably doing a lot of read/write. So, why
isn't shrink_mmap able to find freeable pages? Is it because
the shrink_mmap() is too conservative about implementing LRU?
I mean, it doesn't make sense to swap pages just to keep others
in cache ... if the demand is too high, start shooting down
pages regardless.

Or, is shrink_mmap bailing not because of referenced bit,
but because bdflush is too slow, for example? That is,
are the pages having active I/O so can't be freed?

Do you guys think a profile using gcc-style mcount
would be useful?


-- 
--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-04 21:16                               ` Rajagopal Ananthanarayanan
@ 2000-05-04 21:51                                 ` Rik van Riel
  2000-05-04 22:21                                 ` Linus Torvalds
  1 sibling, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2000-05-04 21:51 UTC (permalink / raw)
  To: Rajagopal Ananthanarayanan
  Cc: Linus Torvalds, Kanoj Sarcar, linux-mm, David S. Miller

On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:

> One clarification: In the case I reported only
> dbench was running, presumably doing a lot of read/write. So, why
> isn't shrink_mmap able to find freeable pages? Is it because
> the shrink_mmap() is too conservative about implementing LRU?
> I mean, it doesn't make sense to swap pages just to keep others
> in cache ... if the demand is too high, start shooting down
> pages regardless.

Indeed, we've seen kswapd fail to get us free pages even
when the total RSS was small...

> Or, is shrink_mmap bailing not because of referenced bit,
> but because bdflush is too slow, for example? That is,
> are the pages having active I/O so can't be freed?
> 
> Do you guys think a profile using gcc-style mcount
> would be useful?

This could be very useful indeed. To be honest I'm not sure
what is happening (though I have some suspicions).

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-04 21:16                               ` Rajagopal Ananthanarayanan
  2000-05-04 21:51                                 ` Rik van Riel
@ 2000-05-04 22:21                                 ` Linus Torvalds
  2000-05-05  0:47                                   ` 7-4 VM killing (A solution) Rajagopal Ananthanarayanan
  1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-04 22:21 UTC (permalink / raw)
  To: Rajagopal Ananthanarayanan; +Cc: riel, Kanoj Sarcar, linux-mm, David S. Miller

On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> 
> One clarification: In the case I reported only
> dbench was running, presumably doing a lot of read/write. So, why
> isn't shrink_mmap able to find freeable pages? Is it because
> the shrink_mmap() is too conservative about implementing LRU?

Probably. One of the things that has changed is exactly _which_ pages are
on the LRU list, so the old heuristics from shrink_mmap() may need some
tweaking too. In fact, as with vmscan, we should probably scan the LRU
list at least _twice_ when the priority level reaches zero (in order to
defeat the aging).

This is also an area where the secondary effects of the vmscan page
lockedness changes could start showing up - the page being locked on the
LRU list makes a difference to the shrink_mmap() algorithm..

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* 7-4 VM killing (A solution)
  2000-05-04 22:21                                 ` Linus Torvalds
@ 2000-05-05  0:47                                   ` Rajagopal Ananthanarayanan
  2000-05-05  1:30                                     ` Rik van Riel
  2000-05-05  5:13                                     ` Linus Torvalds
  0 siblings, 2 replies; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-05  0:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: riel, Kanoj Sarcar, linux-mm, David S. Miller

Linus Torvalds wrote:
> 
> On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> >
> > One clarification: In the case I reported only
> > dbench was running, presumably doing a lot of read/write. So, why
> > isn't shrink_mmap able to find freeable pages? Is it because
> > the shrink_mmap() is too conservative about implementing LRU?
> 
> Probably. One of the things that has changed is exactly _which_ pages are
> on the LRU list, so the old heuristics from shrink_mmap() may need some
> tweaking too. In fact, as with vmscan, we should probably scan the LRU
> list at least _twice_ when the priority level reaches zero (in order to
> defeat the aging).

Ok, I may have a solution after having asked, mostly to myself,
why doesn't shrink_mmap() find pages to free?

The answer apparenlty is because in 7-4 shrink_mmap(),
unreferenced pages get filed as "young" if the zone has
enough pages in it (free_pages > pages_high).

Because of this bug, if we examine a zone which already
has enough free pages, all referenced pages now go to
the "back" of the lru list.

On a subsequent scan, we may never get to these pages in time.
Comments?

Here's the new code to shrink_mmap:

------------
		[ ... ]
		 dispose = &young;
                if (test_and_clear_bit(PG_referenced, &page->flags))
                        goto dispose_continue;

                if (!page->buffers && page_count(page) > 1)
                        goto dispose_continue;

                dispose = &old;
                if (p_zone->free_pages > p_zone->pages_high)
                        goto dispose_continue;

                count--;
                /* Page not used -> free it or put it on the old list
                 * so it gets freed first the next time */
                if (TryLockPage(page))
                        goto dispose_continue;
		[ ... ]
-------------------

With this I'm able to run dbench upto 16 threads (using over
0.5 GB of disk). For reference, without the fix,
dbench wouldn't run even with as few as 4 threads (using
much less disk space).

> 
> This is also an area where the secondary effects of the vmscan page
> lockedness changes could start showing up - the page being locked on the
> LRU list makes a difference to the shrink_mmap() algorithm..
> 
>                 Linus

Kanoj & I looked over your changes (lot easier to do over
the phone!) ... and didn't find any thing wrong with it.

Again, with the above fix things look good. Since
7-4 is badly broken in this respect, do you want a patch?
Since it is a small change, you can put it in "by hand" ...


-- 
--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: 7-4 VM killing (A solution)
  2000-05-05  0:47                                   ` 7-4 VM killing (A solution) Rajagopal Ananthanarayanan
@ 2000-05-05  1:30                                     ` Rik van Riel
  2000-05-05  1:47                                       ` Rajagopal Ananthanarayanan
  2000-05-05  5:13                                     ` Linus Torvalds
  1 sibling, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2000-05-05  1:30 UTC (permalink / raw)
  To: Rajagopal Ananthanarayanan
  Cc: Linus Torvalds, Kanoj Sarcar, linux-mm, David S. Miller

On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:

> Ok, I may have a solution after having asked, mostly to myself,
> why doesn't shrink_mmap() find pages to free?
> 
> The answer apparenlty is because in 7-4 shrink_mmap(),
> unreferenced pages get filed as "young" if the zone has
> enough pages in it (free_pages > pages_high).
> 
> Because of this bug, if we examine a zone which already
> has enough free pages, all referenced pages now go to
> the "back" of the lru list.
> 
> On a subsequent scan, we may never get to these pages in time.
> Comments?
> 
> Here's the new code to shrink_mmap:
> 
> ------------
> 		[ ... ]
> 		 dispose = &young;
>                 if (test_and_clear_bit(PG_referenced, &page->flags))
>                         goto dispose_continue;
> 
>                 if (!page->buffers && page_count(page) > 1)
>                         goto dispose_continue;
> 
>                 dispose = &old;
>                 if (p_zone->free_pages > p_zone->pages_high)
>                         goto dispose_continue;

I've tried this variant (a few weeks ago, before submitting
the current code to Linus) and have found a serious bug in
it.

If we put all the unreferenced pages from one zone (with
enough free pages) on the front of the queue, a subsequent
run will not make it to the pages of the zone which needs
to have pages freed currently...

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: 7-4 VM killing (A solution)
  2000-05-05  1:30                                     ` Rik van Riel
@ 2000-05-05  1:47                                       ` Rajagopal Ananthanarayanan
  0 siblings, 0 replies; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-05  1:47 UTC (permalink / raw)
  To: riel; +Cc: Linus Torvalds, Kanoj Sarcar, linux-mm, David S. Miller

Rik van Riel wrote:
> 

> I've tried this variant (a few weeks ago, before submitting
> the current code to Linus) and have found a serious bug in
> it.
> 
> If we put all the unreferenced pages from one zone (with
> enough free pages) on the front of the queue, a subsequent
> run will not make it to the pages of the zone which needs
> to have pages freed currently...
> 

The only reason why pages should be moved to the tail
of the lru list is when they are referenced, and may be
if they have high page->count.

Pages in zones with enough free memory should not be re-ordered.
Such pages should not control the iterations of shrink_mmap.

The unreferenced pages currently at the front of the lru
queue are the ones that we should free first anyway. Just because
the corresponding zone has enough free memory in it, the
relative order does not change. Are you talking about these
pages adding up against "count" in shrink_mmap?

On a more practical note, how does your bug manifest?
What does not run, or does not run better?

-- 
--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Oops in __free_pages_ok (pre7-1) (Long) (backtrace)
  2000-05-04  4:10                 ` Linus Torvalds
@ 2000-05-05  4:46                   ` Rajagopal Ananthanarayanan
  0 siblings, 0 replies; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-05  4:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kanoj Sarcar, linux-mm, David S. Miller

Linus Torvalds wrote:
> 
> On Wed, 3 May 2000, Rajagopal Ananthanarayanan wrote:
> >
> > One other problem with having the page locked in
> > try_to_swapout() is in the call to
> > prepare_highmem_swapout() when the incoming
> > page is in highmem.
> 
> Look at how I handled this in pre7-4.
> 
> Just unlocking the old page and returning with the new page locked is
> quite acceptable. The "prepare_highmem_swapout()" thing breaks the
> association with the pages anyway, and as such there is no race (and this
> is allowable only exactly because of the anonymous and non-shared nature
> of a private COW-mapping - which is the only thing we accept in that
> code-path anyway).
> 
> Doing it that way means that there are no special cases in vmscan.c.

Yep, now I see it after having actually applied the patch ;-)
I missed it in the original patch file with just the diffs, sorry.

ananth.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: 7-4 VM killing (A solution)
  2000-05-05  0:47                                   ` 7-4 VM killing (A solution) Rajagopal Ananthanarayanan
  2000-05-05  1:30                                     ` Rik van Riel
@ 2000-05-05  5:13                                     ` Linus Torvalds
  2000-05-05  6:44                                       ` Rajagopal Ananthanarayanan
  1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-05  5:13 UTC (permalink / raw)
  To: Rajagopal Ananthanarayanan; +Cc: riel, Kanoj Sarcar, linux-mm, David S. Miller


On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> 
> Ok, I may have a solution after having asked, mostly to myself,
> why doesn't shrink_mmap() find pages to free?
> 
> The answer apparenlty is because in 7-4 shrink_mmap(),
> unreferenced pages get filed as "young" if the zone has
> enough pages in it (free_pages > pages_high).

Good catch.

That's obviously a bug, and your fix looks like the obvious fix. Thanks,

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: 7-4 VM killing (A solution)
  2000-05-05  5:13                                     ` Linus Torvalds
@ 2000-05-05  6:44                                       ` Rajagopal Ananthanarayanan
  2000-05-05  6:51                                         ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-05-05  6:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: riel, Kanoj Sarcar, linux-mm, David S. Miller

Linus Torvalds wrote:
> 
> On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> >
> > Ok, I may have a solution after having asked, mostly to myself,
> > why doesn't shrink_mmap() find pages to free?
> >
> > The answer apparenlty is because in 7-4 shrink_mmap(),
> > unreferenced pages get filed as "young" if the zone has
> > enough pages in it (free_pages > pages_high).
> 
> Good catch.
> 
> That's obviously a bug, and your fix looks like the obvious fix. Thanks,

Rik still had some reservations although he hasn't
sent a response to my rebuttal, yet. We'll see see.

On another note, noticed your change to shrink_mmap in 7-5:

-------
-       count = nr_lru_pages >> priority;
+       count = (nr_lru_pages << 1) >> priority;
-------

Is this to defeat aging? If so, I think its overly cautious:
if all an iteration of shrink_mmap did was to flip the referenced bit,
then that iteration shouldn't be included in count (and in the
current code it isn't). So why double the effort?

-- 
--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: 7-4 VM killing (A solution)
  2000-05-05  6:44                                       ` Rajagopal Ananthanarayanan
@ 2000-05-05  6:51                                         ` Linus Torvalds
  2000-05-05 10:23                                           ` Rik van Riel
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2000-05-05  6:51 UTC (permalink / raw)
  To: Rajagopal Ananthanarayanan; +Cc: riel, Kanoj Sarcar, linux-mm, David S. Miller


On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> 
> On another note, noticed your change to shrink_mmap in 7-5:
> 
> -------
> -       count = nr_lru_pages >> priority;
> +       count = (nr_lru_pages << 1) >> priority;
> -------
> 
> Is this to defeat aging? If so, I think its overly cautious:
> if all an iteration of shrink_mmap did was to flip the referenced bit,
> then that iteration shouldn't be included in count (and in the
> current code it isn't). So why double the effort?

It was indeed because I thought we should defeat aging. But you're right,
the reference bit flip doesn't get counted. My bad, and I'll revert that
one (and you found the real reason for pages not getting free'd anyway)

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: 7-4 VM killing (A solution)
  2000-05-05  6:51                                         ` Linus Torvalds
@ 2000-05-05 10:23                                           ` Rik van Riel
  0 siblings, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2000-05-05 10:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rajagopal Ananthanarayanan, Kanoj Sarcar, linux-mm, David S. Miller

On Thu, 4 May 2000, Linus Torvalds wrote:
> On Thu, 4 May 2000, Rajagopal Ananthanarayanan wrote:
> > On another note, noticed your change to shrink_mmap in 7-5:
> > 
> > -------
> > -       count = nr_lru_pages >> priority;
> > +       count = (nr_lru_pages << 1) >> priority;
> > -------
> > 
> > Is this to defeat aging? If so, I think its overly cautious:
> > if all an iteration of shrink_mmap did was to flip the referenced bit,
> > then that iteration shouldn't be included in count (and in the
> > current code it isn't). So why double the effort?
> 
> It was indeed because I thought we should defeat aging. But
> you're right, the reference bit flip doesn't get counted.

Also, we'll be holding the pages on our local &young list, so
we won't be able to see them again (but that's ok since the
next call to shrink_mmap() can easily free them all).

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2000-05-05 10:23 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <8ener4$6djpb$1@fido.engr.sgi.com>
2000-05-03  3:11 ` Oops in __free_pages_ok (pre7-1) (Long) (backtrace) Rajagopal Ananthanarayanan
2000-05-03  3:47   ` Linus Torvalds
2000-05-03  5:26     ` Kanoj Sarcar
2000-05-03  6:22       ` Rajagopal Ananthanarayanan
2000-05-03 16:11         ` Kanoj Sarcar
2000-05-03 16:19           ` Linus Torvalds
2000-05-03 16:35             ` Kanoj Sarcar
2000-05-03 17:16               ` Linus Torvalds
2000-05-03 17:31                 ` Kanoj Sarcar
2000-05-03 18:17                   ` Linus Torvalds
2000-05-03 18:37                     ` Rajagopal Ananthanarayanan
2000-05-03 18:37                     ` Kanoj Sarcar
2000-05-03 19:41                       ` Rajagopal Ananthanarayanan
2000-05-03 21:28                     ` Jeff Garzik
2000-05-03  8:11       ` Linus Torvalds
2000-05-03  8:31         ` Linus Torvalds
2000-05-03 16:08           ` Kanoj Sarcar
2000-05-03 16:14             ` Linus Torvalds
2000-05-03 16:24               ` Kanoj Sarcar
2000-05-04  1:38             ` Linus Torvalds
2000-05-04  2:44               ` Rajagopal Ananthanarayanan
2000-05-04  4:05                 ` Linus Torvalds
2000-05-04  3:16               ` Rajagopal Ananthanarayanan
2000-05-04  4:10                 ` Linus Torvalds
2000-05-05  4:46                   ` Rajagopal Ananthanarayanan
2000-05-04  7:42               ` Rajagopal Ananthanarayanan
2000-05-04 15:33                 ` Linus Torvalds
2000-05-04 15:57                   ` Rik van Riel
2000-05-04 17:19                   ` Rajagopal Ananthanarayanan
2000-05-04 17:41                     ` Rik van Riel
2000-05-04 18:18                       ` Rajagopal Ananthanarayanan
2000-05-04 18:43                         ` Linus Torvalds
2000-05-04 19:00                           ` Rik van Riel
2000-05-04 19:17                             ` Linus Torvalds
2000-05-04 21:16                               ` Rajagopal Ananthanarayanan
2000-05-04 21:51                                 ` Rik van Riel
2000-05-04 22:21                                 ` Linus Torvalds
2000-05-05  0:47                                   ` 7-4 VM killing (A solution) Rajagopal Ananthanarayanan
2000-05-05  1:30                                     ` Rik van Riel
2000-05-05  1:47                                       ` Rajagopal Ananthanarayanan
2000-05-05  5:13                                     ` Linus Torvalds
2000-05-05  6:44                                       ` Rajagopal Ananthanarayanan
2000-05-05  6:51                                         ` Linus Torvalds
2000-05-05 10:23                                           ` Rik van Riel
2000-05-04 20:40                   ` Oops in __free_pages_ok (pre7-1) (Long) (backtrace) Roger Larsson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox