From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f198.google.com (mail-qk0-f198.google.com [209.85.220.198]) by kanga.kvack.org (Postfix) with ESMTP id 9876E6B0038 for ; Fri, 13 Jan 2017 16:49:28 -0500 (EST) Received: by mail-qk0-f198.google.com with SMTP id x64so55931132qkb.5 for ; Fri, 13 Jan 2017 13:49:28 -0800 (PST) Received: from esa1.dell-outbound.iphmx.com (esa1.dell-outbound.iphmx.com. [68.232.153.90]) by mx.google.com with ESMTPS id u124si9286153qkf.212.2017.01.13.13.49.27 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 13 Jan 2017 13:49:27 -0800 (PST) From: "Michaud, Adrian" Subject: [LSF/MM TOPIC][LSF/MM ATTEND] Multiple Page Caches, Memory Tiering, Better LRU evictions, Date: Fri, 13 Jan 2017 21:49:14 +0000 Message-ID: <61F9233AFAF8C541AAEC03A42CB0D8C7025D002B@MX203CL01.corp.emc.com> Content-Language: en-US Content-Type: multipart/alternative; boundary="_000_61F9233AFAF8C541AAEC03A42CB0D8C7025D002BMX203CL01corpem_" MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: "lsf-pc@lists.linux-foundation.org" Cc: "linux-mm@kvack.org" --_000_61F9233AFAF8C541AAEC03A42CB0D8C7025D002BMX203CL01corpem_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable I'd like to attend and propose one or all of the following topics at this y= ear's summit. Multiple Page Caches (Software Enhancements) -------------------------- Support for multiple page caches can provide many benefits to the kernel. Different memory types can be put into different page caches. One page cach= e for native DDR system memory, another page cache for slower NV-DIMMs, etc= . General memory can be partitioned into several page caches of different siz= es and could also be dedicated to high priority processes or used with cont= ainers to better isolate memory by dedicating a page cache to a cgroup proc= ess. Each VMA, or process, could have a page cache identifier, or page alloc/fre= e callbacks that allow individual VMAs or processes to specify which page c= ache they want to use. Some VMAs might want anonymous memory backed by vast amounts of slower serv= er class memory like NV-DIMMS. Some processes or individual VMAs might want their own private page cache. Each page cache can have its own eviction policy and low-water markers Individual page caches could also have their own swap device. Memory Tiering (Software Enhancements) -------------------- Using multiple page caches, evictions from one page cache could be moved an= d remapped to another page cache instead of unmapped and written to swap. If a system has 16GB of high speed DDR memory, and 64GB of slower memory, o= ne could create a page cache with high speed DDR memory, another page cache= with slower 64GB memory, and evict/copy/remap from the DDR page cache to t= he slow memory page cache. Evictions from the slow memory page cache would = then get unmapped and written to swap. Better LRU evictions (Software and Hardware Enhancements) ------------------------- Add a page fault counter to the page struct to help colorize page demand. We could suggest to Intel/AMD and other architecture leaders that TLB entri= es also have a translation counter (8-10 bits is sufficient) instead of jus= t an "accessed" bit. Scanning/clearing access bits is obviously inefficien= t; however, if TLBs had a translation counter instead of a single accessed = bit then scanning and recording the amount of activity each TLB has would b= e significantly better and allow us to bettern calculate LRU pages for evic= tions. TLB Shootdown (Hardware Enhancements) -------------------------- We should stomp our feet and demand that TLB shootdowns should be hardware = assisted in future architectures. Current TLB shootdown on x86 is horribly = inefficient and obviously doesn't scale. The QPI/UPI local bus protocol sho= uld provide TLB range invalidation broadcast so that a single CPU can concu= rrently notify other CPU/cores (with a selection mask) that a shared TLB en= try has changed. Sending an IPI to each core is horribly inefficient; espec= ially with the core counts increasing and the frequency of TLB unmapping/re= mapping also possibly increasing shortly with new server class memory exten= sion technology. Page Tables, Interrupt Descriptor Table, Global Descriptor table, etc (Soft= ware and Hardware Enhancements) ---------------------------------------------------------------------------= -------- As small amounts of ultra-high speed memory on severs becomes available (Fo= r example: On-Package Memory from Intel), it would be good to utilize this = memory initially for things like interrupt descriptor tables which we would= like to always have the lowest latency, and possibly some or all of the pa= ge tables to allow faster TLB fetch/evictions as the frequency and latency = of these directly affect overall load/store performance. Also, think about = putting some of the highest frequently accessed kernel data into this ultra= -high speed memory as well like current PID, etc. Over the last few years I've implemented all of these in a private kernel w= ith the exception of the hardware enhancements mentioned above. With suppor= t for multiple page caches, multiple swap devices, individual page coloring= , better LRU evictions, I've realized up to 30% overall performance improve= ments when testing large memory exhausting applications like MongoDB with M= MAPV1. I've also implemented transparent memory tiering using an Intel 3DXP= DIMM simulator as a 2nd tier of slower memory. I'd love to discuss everyth= ing I've done in this space and see if there is interest in moving some of = this into the mainline kernel or if I could offer help with similar efforts= that might already be active. Thanks, Adrian Michaud --_000_61F9233AFAF8C541AAEC03A42CB0D8C7025D002BMX203CL01corpem_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

I’d like to attend and propose one or all of t= he following topics at this year’s summit.

 

Multiple Page Caches (Software Enhancements)

--------------------------

Support for multiple page caches can provide many be= nefits to the kernel.

Different memory types can be put into different pag= e caches. One page cache for native DDR system memory, another page cache f= or slower NV-DIMMs, etc.

General memory can be partitioned into several page = caches of different sizes and could also be dedicated to high priority proc= esses or used with containers to better isolate memory by dedicating a page= cache to a cgroup process.

Each VMA, or process, could have a page cache identi= fier, or page alloc/free callbacks that allow individual VMAs or processes = to specify which page cache they want to use.

Some VMAs might want anonymous memory backed by vast= amounts of slower server class memory like NV-DIMMS.

Some processes or individual VMAs might want their o= wn private page cache.

Each page cache can have its own eviction policy and= low-water markers

Individual page caches could also have their own swa= p device.

 

Memory Tiering (Software Enhancements)

--------------------

Using multiple page caches, evictions from one page = cache could be moved and remapped to another page cache instead of unmapped= and written to swap.

If a system has 16GB of high speed DDR memory, and 6= 4GB of slower memory, one could create a page cache with high speed DDR mem= ory, another page cache with slower 64GB memory, and evict/copy/remap from = the DDR page cache to the slow memory page cache. Evictions from the slow memory page cache would then get unmap= ped and written to swap.

 

Better LRU evictions (Software and Hardware Enhancem= ents)

-------------------------

Add a page fault counter to the page struct to help = colorize page demand.

We could suggest to Intel/AMD and other architecture= leaders that TLB entries also have a translation counter (8-10 bits is suf= ficient) instead of just an “accessed” bit.  Scanning/clea= ring access bits is obviously inefficient; however, if TLBs had a translation counter instead of a single accessed bit then sc= anning and recording the amount of activity each TLB has would be significa= ntly better and allow us to bettern calculate LRU pages for evictions.=

 

TLB Shootdown (Hardware Enhancements)

--------------------------

We should stomp our feet and demand that TLB shootdo= wns should be hardware assisted in future architectures. Current TLB shootd= own on x86 is horribly inefficient and obviously doesn’t scale. The Q= PI/UPI local bus protocol should provide TLB range invalidation broadcast so that a single CPU can concurrently not= ify other CPU/cores (with a selection mask) that a shared TLB entry has cha= nged. Sending an IPI to each core is horribly inefficient; especially with = the core counts increasing and the frequency of TLB unmapping/remapping also possibly increasing shortly with= new server class memory extension technology.

 

Page Tables, Interrupt Descriptor Table, Global Desc= riptor table, etc (Software and Hardware Enhancements)

----------------------------------------------------= -------------------------------

As small amounts of ultra-high speed memory on sever= s becomes available (For example: On-Package Memory from Intel), it would b= e good to utilize this memory initially for things like interrupt descripto= r tables which we would like to always have the lowest latency, and possibly some or all of the page tables to al= low faster TLB fetch/evictions as the frequency and latency of these direct= ly affect overall load/store performance. Also, think about putting some of= the highest frequently accessed kernel data into this ultra-high speed memory as well like current PID, et= c.

 

Over the last few years I’ve implemented all o= f these in a private kernel with the exception of the hardware enhancements= mentioned above. With support for multiple page caches, multiple swap devi= ces, individual page coloring, better LRU evictions, I’ve realized up to 30% overall performance improvements = when testing large memory exhausting applications like MongoDB with MMAPV1.= I’ve also implemented transparent memory tiering using an Intel 3DXP= DIMM simulator as a 2nd tier of slower memory. I’d love to discuss everything I’ve done in thi= s space and see if there is interest in moving some of this into the mainli= ne kernel or if I could offer help with similar efforts that might already = be active.

 

Thanks,

 

Adrian Michaud

 

 

--_000_61F9233AFAF8C541AAEC03A42CB0D8C7025D002BMX203CL01corpem_-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org