Here are my notes from the LSF/MM 2016 MM track. I expect LWN.net to have nicely readable articles on most of these discussions. LSF/MM 2016 Memory Management track notes Transparent Huge Pages     Kirill & Hugh have different implementations of tmpfs transparent huge pages     Kirill can split 4k pages out of huge pages, to avoid splits (refcounting implementation, compound pages)     Hugh's implementation: get it up and running quickly and unobtrusively (team pages)     Kirill's implementation can dirty 4kB inside a huge page on write()     Kirill wants to get huge pages in page cache to work for ext4     cannot be transparent to the filesystem     Hugh: what about small files? huge pages would be wasted space     Kirill: madvise/madvise for THP, or file size based policy     at write time allocate 4kB pages, khugepaged can collapse them     Andrea: what is the advantage of using huge pages for small files?     Hugh: 2MB initial allocation is shrinkable, not charged to memcg     Kirill: for tmpfs, also need to check against tmpfs filesystem size when deciding what page size to allocate     Kirill: does not like how tmpfs is growing more and more special cases (radix tree exception entries, etc)     Aneesh,Andrea: also not happy that kernel would grow yet another kind of huge page     Hugh: Kirill can probably use the same mlock logic my code uses     Kirill: I do not mlock pages, just VMAs, prevent pageout that way     Hugh: Kirill has some stuff working better than I realized, maybe can still use some of my code     Kirill: on split hugepmd Hugh has a split with ptes, Kirill just blows away PMD and lets faults fill in PTEs     Hugh: what Kirill's code does is not quite correct for mlock     Kirill: mlock does not guarantee lack of minor faults     Aneesh: PPC64 needs deposited page tables     hardware page table hashed on actual page size, huge page is only logical not HW supported     last level page table stores slot/hash information     Andrea: do not worry too much about memory consumption with THP     if worried, do small allocations and let khugepaged collapse them     use same model for THP file cache as used for THP anonymous memory     Andrea/Kirill/Hugh:     no need to use special radix tree entries for huge pages, in general     at hole punch time could be useful later, as an optimization     might want a way to mark 4kB pages dirty on radix tree side, inside a compound page (or use page flags on tail page struct)     Hugh: how about two radix trees?     Everybody else: yuck :)     Andrea: with the compound model, I see no benefit to multiple radix trees     First preparation series (by Hugh) already went upstream     Kirill can use some of Hugh's code     DAX needs some of the same code, too     Hugh: compount pages could be extended to offer my functionality, would like to integrate what he has     settling on sysfs/mount options before freezing     then add compount pages on top     Hugh: current show stopper with Kirill's code:     small files, hole punching     khugepaged -> task_work     Advantage: concentrate thp on tasks that use most CPU and could benefit from them the most     Hugh: having one single scanner/compacter might have advantages     When to trigger scanning?     Hugh: observe at page fault time? Vlastimil: if there are no faults because the memory is already present, there would not be an observation event     Johannes: wait for someone to free a THP?     maybe background scanning still best?     merge plans     Hugh would like to merge team pages now, switch to compound pages later     Kirill would like to get compound pages into shape first, then merge things     Andrea: if we go with team pages, we should ensure it is the right solution for both anonymous memory and ext4     Andrea: can we integrate the best parts of both code bases and merge that?     Mel: one of my patch series is heavily colliding with team pages (moving accounting from zones to nodes)     Andrew: need a decision on team pages vs compound pages     Hugh: if compound pages went in first, we would not replace it with team pages later - but the other way around might happen     merge blockers     Compound pages issues: small files memory waste,  fast recovery for small files, get khugepaged into shape, maybe deposit/withdrawal, demonstrate recovery, demonstrate robustness (or Hugh demonstrates brokenness)     Team page issues: recovery (khugepaged cannot collapse team pages), anonymous memory support (Hugh: pretty sure it is possible), API compatible to test compound, don't use page->private, path forward for other filesystems     revert team page patches from MMOTM util blockers addressed GFP flags     __GFP_REPEAT     fuzzy semantics, keep retrying until an allocation succeeds     for higher order allocations     but most used for order 0... (not useful)     can be cleaned up, and get a useful semantic for higher order allocations     "can fail, try hard to be successful, but could still fail in the end"     __GFP_NORETRY - fail after single attempt to reclaim something, not very helpful except for optimistic/opportunistic allocations     maybe have __GFP_BEST_EFFORT, try until a certain point then give up?  (retry until OOM, then fail?)     remove __GFP_REPEAT from non-costly allocations     introduce new flag, use it where useful     can the allocator know compaction was deferred?     more explicit flags? NORECLAIM NOKSWAPD NOCOMPACT NO_OOM etc...     use explicit flags to switch stuff off     clameter: have default definitions with all the "normal stuff" enabled     flags inconsistent - sometimes positive, sometimes negative, sometimes for common things, sometimes for uncommon things     THP allocation not explicit, but inferred from certain flags     concensus on cleaning up GFP usage CMA     KVM on PPC64 runs into a strange hardware requirements     needs contiguous memory for certain data structures     tried to reduce fragmentation/allocation issues with ZONE_CMA     atomic 0 order allocations fail early, due to kswapd not kicking in on time     taking pages out of CMA zone first     compaction does not move movable compound pages (eg. THP), breaking CMA in ZONE_CMA     mlock and other things pinning allocated-as-movable pages also break CMA     what to do instead of ZONE_CMA?     how to keep things movable? sticky MIGRATE_MOVABLE zones?     do not allow reclaimable & unmovable allocations in sticky MIGRATE_MOVABLE zones     memory hotplug has similar requirements to CMA, no need for a new name     need something like physical memory linear reclaim, finding sticky MIGRATE_MOVABLE zones and reclaiming everything inside     Mel: would like to see ZONE_CMA and ZONE_MOVABLE go away     FOLL_MIGRATE get_user_pages flag to move pages away from movable region when being pinned     should be handled by core code, get_user_pages Compaction, higher order allocations     compaction not invoked from THP allocations with delayed fragmentation patch set     kcompactd daemon for background compaction     should kcompactd do fast direct reclaim?  lets see     cooperation with OOM     compaction - hard to get useful feedback about     compaction "does random things, returns with random answer"     no notion of "costly allocations"     compaction can keep indefinitely deferring action, even for smaller allocations (eg. order 2)     sometimes compaction finds too many page blocks with the skip bit set     success rate of compaction skyrocketed with skip bits ignored (stale skip bits?)     migrate skips over MIGRATE_UNMOVABLE page blocks found during order 9 compaction     page block may be perfectly suitable for smaller order compaction     have THP skip more aggressively, while order 2 scans inside more page blocks     priority for compaction code?  aggressiveness of diving into blocks vs skipping     order 9 allocators:     THP - wants allocation to fail quickyl if no order 9 available     hugetlbfs - really wants allocations to succeed VM containers     VM imply more memory consumption than what application that runs in it need     How to pressure guest to give back memory to host ?     Adding new shrinker did not seem to perform well     Move page cache to the host so it would be easier to reclaim memory for all guest     Move memory management from guest kernel to host, some kind of memory controller     Have the guest tell the host how to reclaim, sharing LRU for instance     mmu_notifier is already sharing some informations with access bit (young), but mmu_notifier is to coarse     DAX (in the guest) should be fine to solve filesystem memory     if not DAX backed on the host, needs new mechanism for IO barriers, etc     FUSE driver in the guest and move filesystem to the host     Exchange memory pressure btw guest and host so that host can ask guest to adjust its pressure depending on overall situation of the host Generic page-pool recycle facility     found bottlenecks in both page allocator and DMA APIs     "packet-page" / explicit data path API     make it generic across multiple use cases     get rid of open coded driver approaches     Mel: make per-cpu allocator fast enough to act as the page pool     gets NUMA locality, shrinking, etc all for free     needs pool sizing for used pool items, too - can't keep collecting incoming packets without handling them     allow page allocator to reclaim memory Address Space Mirroring     Haswell-EX allows memory mirroring, partial or all memory     goal: improve high availability by avoiding uncorrectable errors in kernel memory     partial has higher remaining memory capacity, but not software transparent     some memory mirrored, some not     mirrored memory set up in BIOS, amount in each NUMA node proportional to amount of memory in each node     mirror range info in EFI memory map     avoid kernel allocations from non-mirrored memory ranges, avoid ZONE_MOVABLE allocations     put user allocations in non-mirrored memory, avoid ZONE_NORMAL allocations     MADV_MIRROR to put certain user memory in mirrored memory     problem: to put a whole program in mirrored memory, need to relocate libraries into mirrored memory     what is the value proposition of mirroring user space memory?     policy: when mirrored memory is requested, do not fall back to non- mirrored memory     Michal: is this desired?     Aneesh: how should we represent mirrored memory? zones? something else?     Michal: we are back to highmem problem     lesson from highmem era: keep ratio of kernel to non-kernel memory low enough, below 1:4     how much userspace needs to be in mirrored memory, in order to be able to restart applications?     should we have opt-out for mirrored instead of opt-in?     proposed interface: prctl     kcore mirror code upstream already     Mel: systems using lots of ZONE_MOVABLE have problems, and are often unstable     Mel: assuming userspace can figure out the right thing to choose what needs to be mirrored is not safe     Vlastimil: use non-mirrored memory as frontswap only, put all managed memory in mirrored memory     dwmw2: for workload of "guest we care about, guests we don't care about", we can allocate only guest memory for unimportant guests in non-mirrored memory     Mel: even in that scenario a non-important guest's kernel allocations could exhaust mirrored memory     Mel: partial mirroring makes a promise of reliability that it cannot deliver on     false hope     complex configuration makes the system less reliable     Andrea: memory hotplug & other zone_movable users already cause the same problems today Heterogenious Memory Management     used for GPU, CAPI, other kinds of offload engines     GPU has much faster memory than system RAM     to get performance, GPU offload data needs to sit in VRAM     shared address space creates an easier programming model     needs ability to migrate memory between system RAM and VRAM     CPU cannot access VRAM     GPU can access system RAM ... very very slowly     hardware is coming up real soon (this year)     without HMM     GPU stuff running 10/100x slower     need to pin lots of system memory (16GB per device?)     use of mmu_notifier spreading to device drivers, instead of one common solution     special swap type to handle migration     future openCL API wants address space sharing     HMM has some core VM impact, but relatively contained     how to get HMM upstream?  does anybody have objections to anything in HMM?     split up in several series     Andrew: put more info in the changelogs     space for future optimizations     dwmw2: svm API, should move to a generic API     intel_svm_bind_mm - bind the current process to a PASID MM validation & debugging     Sasha using KASAN on locking, trap missed locks     requires annotation of what memory is locked by a lock     how to annotate what memory is protected by a lock?     Kirill: what about a struct with a lock inside?     annotate struct members with which lock protects it?     too much work     trying to improve hugepage testing     split_all_huge_pages     expose list of huge pages through debugfs, allow splitting arbirarily chosen ones     fuzzer to open, close, read & write random files in sysfs & debugfs     how to coordinate security(?) issues with zero-day security folks? Memory cgroups     how to figure out the memory a cgroup needs (as opposed to currently used)?     memory pressure is not enough to determine the needs of a cgroup     cgroups scanned in equal portion     unfair, streaming file IO can result in using lots of memory, even when the cgroup has mostly inactive file pages     potential solution:     dynamically balance the cgroups     adjust limits dynamically, based on their memory pressure     problem: how to detect memory pressure?     when to increase memory? when to decrease memory?     real time aging of various LRU lists     only for active / anon lists, not inactine file list     "keep cgroup data in memory if its working set is younger than X seconds"     refault info: distinguish between refaults (working set faulted back in), and evictions of data that is only used once     can be used to know when to grow a cgroup, but not when to shrink it     vmpressure API: does not work well on very large systems, only on smaller ones     quickly reaches "critical" levels on large systems, that are not even that busy     Johannes: time-based statistic to measure how much time processes wait for IO     not iowait, which measures how long the _system_ waits, but per- task     add refault info in, only count time spent on refaults     wait time above threshold? grow cgroup     wait time under threshold? shrink cgroup, but not below lower limit     Larry: Docker people want per-cgroup vmstat info TLB flush optimizations     mmu_gather side of tlb flush     collect invalidations, gather items to flush     patch: increase size of mmu_gather, and try to flush more at once     Andrea - rmap length scalability issues     too many KSM pages merged together, rmap chain becomes too long     put upper limit on number of shares of a KSM page (256 share limit)     mmu_notifiers batch flush interface?     limit max_page_sharing to reduce KSM rmap chain length OOM killer     goal: make OOM invocation more deterministic     currently: reclaim until there is nothing left to reclaim, then invoke OOM killer     problem: sometimes reclaim gets stuck, and OOM killer is not invoked when it should     one single page free resets the OOM counter, causing livelock     thrashing not detected, on the contrary helps thrashing happen     make things more conservative?     OOM killer invoked on heavy thrashing and no progress made in the VM     OOM reaper - to free resources before OOM killed task can exit by itself     timeout based solution is not trivial, doable, but not preferred by Michal     if Johannes can make a timeout scheme deterministic, Michal has no objections     Michal: I think we can do better without a timer solution     need deterministic way to put system into a consistent state     tmpfs vs OOM killer     OOM killer cannot discard tmpfs files     with cgroups, reap giant tmpfs file anyway in special cases at Google     restart whole container, dump container's tmpfs contents MM tree workflow     most of Andrew's job: sollicit feedback from people     -mm git tree helps many people     Michal: would like email message IDs references in patches, both for original patches and fixes     the value of -fix patches is that previous reviews do not need to get re-done     sometimes a replacement patch is easier     Kirill: sometimes difficult to get patch sets reviewed     generally adds acked-by and reviewed-by lines by hand     Michal: -mm tree is maintainer tree of last resort     Andrew: carrying those extra patches isn't too much work SLUB optimizations lightning talk     bulk APIs for SLUB + SLAB     kmem_cache_{alloc,free}_bulk()     kfree_bulk()     60%speedup measured     can be used from network, rcu free, ...     per CPU freelist per page     nice speedup, but still suffers from a race condition