Here are my notes from the LSF/MM 2016 MM track.
I expect LWN.net to have nicely readable articles
on most of these discussions.


LSF/MM 2016

Memory Management track notes


Transparent Huge Pages

    Kirill & Hugh have different implementations of tmpfs transparent
huge pages

    Kirill can split 4k pages out of huge pages, to avoid splits
(refcounting implementation, compound pages)

    Hugh's implementation: get it up and running quickly and
unobtrusively (team pages)

    Kirill's implementation can dirty 4kB inside a huge page on write()

    Kirill wants to get huge pages in page cache to work for ext4

    cannot be transparent to the filesystem

    Hugh: what about small files? huge pages would be wasted space

    Kirill: madvise/madvise for THP, or file size based policy

    at write time allocate 4kB pages, khugepaged can collapse them

    Andrea: what is the advantage of using huge pages for small files?

    Hugh: 2MB initial allocation is shrinkable, not charged to memcg

    Kirill: for tmpfs, also need to check against tmpfs filesystem size
when deciding what page size to allocate

    Kirill: does not like how tmpfs is growing more and more special
cases (radix tree exception entries, etc)

    Aneesh,Andrea: also not happy that kernel would grow yet another
kind of huge page

    Hugh: Kirill can probably use the same mlock logic my code uses

    Kirill: I do not mlock pages, just VMAs, prevent pageout that way

    Hugh: Kirill has some stuff working better than I realized, maybe
can still use some of my code

    Kirill: on split hugepmd Hugh has a split with ptes, Kirill just
blows away PMD and lets faults fill in PTEs

    Hugh: what Kirill's code does is not quite correct for mlock

    Kirill: mlock does not guarantee lack of minor faults

    Aneesh: PPC64 needs deposited page tables

    hardware page table hashed on actual page size, huge page is only
logical not HW supported

    last level page table stores slot/hash information

    Andrea: do not worry too much about memory consumption with THP

    if worried, do small allocations and let khugepaged collapse them

    use same model for THP file cache as used for THP anonymous memory

    Andrea/Kirill/Hugh:

    no need to use special radix tree entries for huge pages, in
general

    at hole punch time could be useful later, as an optimization

    might want a way to mark 4kB pages dirty on radix tree side, inside
a compound page (or use page flags on tail page struct)

    Hugh: how about two radix trees?

    Everybody else: yuck :)

    Andrea: with the compound model, I see no benefit to multiple radix
trees

    First preparation series (by Hugh) already went upstream

    Kirill can use some of Hugh's code

    DAX needs some of the same code, too

    Hugh: compount pages could be extended to offer my functionality,
would like to integrate what he has

    settling on sysfs/mount options before freezing

    then add compount pages on top

    Hugh: current show stopper with Kirill's code:

    small files, hole punching


    khugepaged -> task_work

    Advantage: concentrate thp on tasks that use most CPU and could
benefit from them the most

    Hugh: having one single scanner/compacter might have advantages

    When to trigger scanning?

    Hugh: observe at page fault time? Vlastimil: if there are no faults
because the memory is already present, there would not be an
observation event

    Johannes: wait for someone to free a THP?

    maybe background scanning still best?


    merge plans

    Hugh would like to merge team pages now, switch to compound pages
later

    Kirill would like to get compound pages into shape first, then
merge things

    Andrea: if we go with team pages, we should ensure it is the right
solution for both anonymous memory and ext4

    Andrea: can we integrate the best parts of both code bases and
merge that?

    Mel: one of my patch series is heavily colliding with team pages
(moving accounting from zones to nodes)

    Andrew: need a decision on team pages vs compound pages

    Hugh: if compound pages went in first, we would not replace it with
team pages later - but the other way around might happen

    merge blockers

    Compound pages issues: small files memory waste,  fast recovery for
small files, get khugepaged into shape, maybe deposit/withdrawal,
demonstrate recovery, demonstrate robustness (or Hugh demonstrates
brokenness)

    Team page issues: recovery (khugepaged cannot collapse team pages),
anonymous memory support (Hugh: pretty sure it is possible), API
compatible to test compound, don't use page->private, path forward for
other filesystems

    revert team page patches from MMOTM util blockers addressed




GFP flags

    __GFP_REPEAT

    fuzzy semantics, keep retrying until an allocation succeeds

    for higher order allocations

    but most used for order 0... (not useful)

    can be cleaned up, and get a useful semantic for higher order
allocations

    "can fail, try hard to be successful, but could still fail in the
end"

    __GFP_NORETRY - fail after single attempt to reclaim something, not
very helpful except for optimistic/opportunistic allocations

    maybe have __GFP_BEST_EFFORT, try until a certain point then give
up?  (retry until OOM, then fail?)

    remove __GFP_REPEAT from non-costly allocations

    introduce new flag, use it where useful

    can the allocator know compaction was deferred?

    more explicit flags? NORECLAIM NOKSWAPD NOCOMPACT NO_OOM etc...

    use explicit flags to switch stuff off

    clameter: have default definitions with all the "normal stuff"
enabled

    flags inconsistent - sometimes positive, sometimes negative,
sometimes for common things, sometimes for uncommon things

    THP allocation not explicit, but inferred from certain flags

    concensus on cleaning up GFP usage



CMA

    KVM on PPC64 runs into a strange hardware requirements

    needs contiguous memory for certain data structures

    tried to reduce fragmentation/allocation issues with ZONE_CMA

    atomic 0 order allocations fail early, due to kswapd not kicking in
on time

    taking pages out of CMA zone first

    compaction does not move movable compound pages (eg. THP), breaking
CMA in ZONE_CMA

    mlock and other things pinning allocated-as-movable pages also
break CMA

    what to do instead of ZONE_CMA?

    how to keep things movable? sticky MIGRATE_MOVABLE zones?

    do not allow reclaimable & unmovable allocations in sticky
MIGRATE_MOVABLE zones

    memory hotplug has similar requirements to CMA, no need for a new
name

    need something like physical memory linear reclaim, finding sticky
MIGRATE_MOVABLE zones and reclaiming everything inside

    Mel: would like to see ZONE_CMA and ZONE_MOVABLE go away

    FOLL_MIGRATE get_user_pages flag to move pages away from movable
region when being pinned

    should be handled by core code, get_user_pages



Compaction, higher order allocations

    compaction not invoked from THP allocations with delayed
fragmentation patch set

    kcompactd daemon for background compaction

    should kcompactd do fast direct reclaim?  lets see

    cooperation with OOM

    compaction - hard to get useful feedback about

    compaction "does random things, returns with random answer"

    no notion of "costly allocations"

    compaction can keep indefinitely deferring action, even for smaller
allocations (eg. order 2)

    sometimes compaction finds too many page blocks with the skip bit
set

    success rate of compaction skyrocketed with skip bits ignored
(stale skip bits?)

    migrate skips over MIGRATE_UNMOVABLE page blocks found during order
9 compaction

    page block may be perfectly suitable for smaller order compaction

    have THP skip more aggressively, while order 2 scans inside more
page blocks

    priority for compaction code?  aggressiveness of diving into blocks
vs skipping

    order 9 allocators:

    THP - wants allocation to fail quickyl if no order 9 available

    hugetlbfs - really wants allocations to succeed


VM containers

    VM imply more memory consumption than what application that runs in
it need

    How to pressure guest to give back memory to host ?

    Adding new shrinker did not seem to perform well

    Move page cache to the host so it would be easier to reclaim memory
for all guest

    Move memory management from guest kernel to host, some kind of
memory controller

    Have the guest tell the host how to reclaim, sharing LRU for
instance

    mmu_notifier is already sharing some informations with access bit
(young), but mmu_notifier is to coarse

    DAX (in the guest) should be fine to solve filesystem memory

    if not DAX backed on the host, needs new mechanism for IO barriers,
etc

    FUSE driver in the guest and move filesystem to the host

    Exchange memory pressure btw guest and host so that host can ask
guest to adjust its pressure depending on overall situation of the host



Generic page-pool recycle facility

    found bottlenecks in both page allocator and DMA APIs

    "packet-page" / explicit data path API

    make it generic across multiple use cases

    get rid of open coded driver approaches

    Mel: make per-cpu allocator fast enough to act as the page pool

    gets NUMA locality, shrinking, etc all for free

    needs pool sizing for used pool items, too - can't keep collecting
incoming packets without handling them

    allow page allocator to reclaim memory



Address Space Mirroring


    Haswell-EX allows memory mirroring, partial or all memory

    goal: improve high availability by avoiding uncorrectable errors in
kernel memory

    partial has higher remaining memory capacity, but not software
transparent

    some memory mirrored, some not

    mirrored memory set up in BIOS, amount in each NUMA node
proportional to amount of memory in each node

    mirror range info in EFI memory map

    avoid kernel allocations from non-mirrored memory ranges, avoid
ZONE_MOVABLE allocations

    put user allocations in non-mirrored memory, avoid ZONE_NORMAL
allocations

    MADV_MIRROR to put certain user memory in mirrored memory

    problem: to put a whole program in mirrored memory, need to
relocate libraries into mirrored memory

    what is the value proposition of mirroring user space memory?

    policy: when mirrored memory is requested, do not fall back to non-
mirrored memory

    Michal: is this desired?

    Aneesh: how should we represent mirrored memory? zones? something
else?

    Michal: we are back to highmem problem

    lesson from highmem era: keep ratio of kernel to non-kernel memory
low enough, below 1:4

    how much userspace needs to be in mirrored memory, in order to be
able to restart applications?

    should we have opt-out for mirrored instead of opt-in?

    proposed interface: prctl

    kcore mirror code upstream already

    Mel: systems using lots of ZONE_MOVABLE have problems, and are
often unstable

    Mel: assuming userspace can figure out the right thing to choose
what needs to be mirrored is not safe

    Vlastimil: use non-mirrored memory as frontswap only, put all
managed memory in mirrored memory

    dwmw2: for workload of "guest we care about, guests we don't care
about", we can allocate only guest memory for unimportant guests in
non-mirrored memory

    Mel: even in that scenario a non-important guest's kernel
allocations could exhaust mirrored memory

    Mel: partial mirroring makes a promise of reliability that it
cannot deliver on

    false hope

    complex configuration makes the system less reliable

    Andrea: memory hotplug & other zone_movable users already cause the
same problems today



Heterogenious Memory Management

    used for GPU, CAPI, other kinds of offload engines

    GPU has much faster memory than system RAM

    to get performance, GPU offload data needs to sit in VRAM

    shared address space creates an easier programming model

    needs ability to migrate memory between system RAM and VRAM

    CPU cannot access VRAM

    GPU can access system RAM ... very very slowly

    hardware is coming up real soon (this year)

    without HMM

    GPU stuff running 10/100x slower

    need to pin lots of system memory (16GB per device?)

    use of mmu_notifier spreading to device drivers, instead of one
common solution

    special swap type to handle migration

    future openCL API wants address space sharing

    HMM has some core VM impact, but relatively contained

    how to get HMM upstream?  does anybody have objections to anything
in HMM?

    split up in several series

    Andrew: put more info in the changelogs

    space for future optimizations

    dwmw2: svm API, should move to a generic API

    intel_svm_bind_mm - bind the current process to a PASID



MM validation & debugging

    Sasha using KASAN on locking, trap missed locks

    requires annotation of what memory is locked by a lock

    how to annotate what memory is protected by a lock?

    Kirill: what about a struct with a lock inside?

    annotate struct members with which lock protects it?

    too much work

    trying to improve hugepage testing

    split_all_huge_pages

    expose list of huge pages through debugfs, allow splitting
arbirarily chosen ones

    fuzzer to open, close, read & write random files in sysfs & debugfs

    how to coordinate security(?) issues with zero-day security folks?



Memory cgroups

    how to figure out the memory a cgroup needs (as opposed to
currently used)?

    memory pressure is not enough to determine the needs of a cgroup

    cgroups scanned in equal portion

    unfair, streaming file IO can result in using lots of memory, even
when the cgroup has mostly inactive file pages

    potential solution:

    dynamically balance the cgroups

    adjust limits dynamically, based on their memory pressure

    problem: how to detect memory pressure?

    when to increase memory? when to decrease memory?

    real time aging of various LRU lists

    only for active / anon lists, not inactine file list

    "keep cgroup data in memory if its working set is younger than X
seconds"

    refault info: distinguish between refaults (working set faulted
back in), and evictions of data that is only used once

    can be used to know when to grow a cgroup, but not when to shrink
it

    vmpressure API: does not work well on very large systems, only on
smaller ones

    quickly reaches "critical" levels on large systems, that are not
even that busy

    Johannes: time-based statistic to measure how much time processes
wait for IO

    not iowait, which measures how long the _system_ waits, but per-
task

    add refault info in, only count time spent on refaults

    wait time above threshold? grow cgroup

    wait time under threshold? shrink cgroup, but not below lower limit

    Larry: Docker people want per-cgroup vmstat info



TLB flush optimizations

    mmu_gather side of tlb flush

    collect invalidations, gather items to flush

    patch: increase size of mmu_gather, and try to flush more at once

    Andrea - rmap length scalability issues

    too many KSM pages merged together, rmap chain becomes too long

    put upper limit on number of shares of a KSM page (256 share limit)

    mmu_notifiers batch flush interface?

    limit max_page_sharing to reduce KSM rmap chain length



OOM killer

    goal: make OOM invocation more deterministic

    currently: reclaim until there is nothing left to reclaim, then
invoke OOM killer

    problem: sometimes reclaim gets stuck, and OOM killer is not
invoked when it should

    one single page free resets the OOM counter, causing livelock

    thrashing not detected, on the contrary helps thrashing happen

    make things more conservative?

    OOM killer invoked on heavy thrashing and no progress made in the
VM

    OOM reaper - to free resources before OOM killed task can exit by
itself

    timeout based solution is not trivial, doable, but not preferred by
Michal

    if Johannes can make a timeout scheme deterministic, Michal has no
objections

    Michal: I think we can do better without a timer solution

    need deterministic way to put system into a consistent state

    tmpfs vs OOM killer

    OOM killer cannot discard tmpfs files

    with cgroups, reap giant tmpfs file anyway in special cases at
Google

    restart whole container, dump container's tmpfs contents



MM tree workflow

    most of Andrew's job: sollicit feedback from people

    -mm git tree helps many people

    Michal: would like email message IDs references in patches, both
for original patches and fixes

    the value of -fix patches is that previous reviews do not need to
get re-done

    sometimes a replacement patch is easier

    Kirill: sometimes difficult to get patch sets reviewed

    generally adds acked-by and reviewed-by lines by hand

    Michal: -mm tree is maintainer tree of last resort

    Andrew: carrying those extra patches isn't too much work



SLUB optimizations lightning talk

    bulk APIs for SLUB + SLAB

    kmem_cache_{alloc,free}_bulk()

    kfree_bulk()

    60%speedup measured

    can be used from network, rcu free, ...

    per CPU freelist per page

    nice speedup, but still suffers from a race condition