From: Laurent Dufour <ldufour@linux.vnet.ibm.com>
To: Daniel Jordan <daniel.m.jordan@oracle.com>,
lsf-pc@lists.linux-foundation.org
Cc: linux-mm@kvack.org, steven.sistare@oracle.com,
pasha.tatashin@oracle.com, yossi.lev@oracle.com,
Dave.Dice@oracle.com, akpm@linux-foundation.org,
mhocko@kernel.org, dave@stgolabs.net,
khandual@linux.vnet.ibm.com, ak@linux.intel.com, mgorman@suse.de
Subject: Re: [LSF/MM TOPIC] lru_lock scalability
Date: Fri, 2 Feb 2018 12:02:10 +0100 [thread overview]
Message-ID: <471f59cc-b4f7-462a-b6e5-064aaf132132@linux.vnet.ibm.com> (raw)
In-Reply-To: <2a16be43-0757-d342-abfb-d4d043922da9@oracle.com>
Hi Daniel,
On 01/02/2018 05:44, Daniel Jordan wrote:
> I'd like to propose a discussion of lru_lock scalability on the mm track.A
> Since this is similar to Laurent Dufour's mmap_sem topic, it might make
> sense to discuss these around the same time.
I do agree, the scalability issue is not only due the mmap_sem, having a
larger discussion on this topic would make sense.
Cheers,
Laurent.
>
> On large systems, lru_lock is one of the hottest locks in the kernel,
> showing up on many memory-intensive benchmarks such as decision support.A
> It also inhibits scalability in many of the mm paths that could be
> parallelized, such as freeing pages during exit/munmap and inode eviction.
>
> I'd like to discuss the following two ways of solving this problem, as well
> as any other approaches or ideas people have.
>
>
> 1) LRU batches
> --------------
>
> This method, developed with Steve Sistare, is described in this RFC series:
>
> A https://lkml.org/lkml/2018/1/31/659
>
> It introduces an array of locks, with each lock protecting certain batches
> of LRU pages.
>
> A A A A A A *ooooooooooo**ooooooooooo**ooooooooooo**oooo ...
> A A A A A A |A A A A A A A A A A ||A A A A A A A A A A ||A A A A A A A A A A ||
> A A A A A A A \ batch 1 /A \ batch 2 /A \ batch 3 /
>
> In this ASCII depiction of an LRU, a page is represented with either '*' or
> 'o'.A An asterisk indicates a sentinel page, which is a page at the edge of
> a batch.A An 'o' indicates a non-sentinel page.
>
> To remove a non-sentinel LRU page, only one lock from the array is
> required.A This allows multiple threads to remove pages from different
> batches simultaneously.A A sentinel page requires lru_lock in addition to a
> lock from the array.
>
> Full performance numbers appear in the last patch in the linked series
> above, but this prototype allows a microbenchmark to do up to 28% more page
> faults per second with 16 or more concurrent processes.
>
>
> 2) Locking individual pages
> ---------------------------
>
> This work, developed in collaboration with Yossi Lev and Dave Dice, locks
> individual pages on the LRU for removal.A It converts lru_lock from a
> spinlock to a rw_lock, using the read mode for concurrent removals and the
> write mode for other operations, i.e. those that need exclusive access to
> the LRU.
>
> We hope to have a prototype and performance numbers closer to the time of
> the summit.
>
> Here's a more detailed description of this approach from Yossi Lev, with
> code snippets to support the ideas.
>
> /**
> * The proposed algorithm for concurrent removal of pages from an LRU list
> * relies on the ability to lock individual pages during removals.A In the
> * implementation below, we support that by storing the NULL value in the next
> * field of the page's lru field.A We can do that because the next field value
> * is not needed once a page is locked. If for some reason we need to maintain
> * the value of the next pointer during page removal, we could use its LSB for
> * the locking purpose, or any other bit in the page that we designate for this
> * purpose.
> */
>
> #define PAGE_LOCKED_VALA A A NULL
>
> #define IS_LOCKED(lru)A A A A (lru.next == PAGE_LOCKED_VAL)
>
>
> /**
> * We change the type of lru_lock from a regular spin lock (spinlock_t) to a
> read-write
> * spin lock (rwlock_t).A Locking in read mode allows some operations to run
> * concurrently with each other, as long as functions that requires exclusive
> * access do not hold the lock in write mode.
> */
> typedef struct pglist_data {
> A A // ...
>
> A A /* Write-intensive fields used by page reclaim */
> A A ZONE_PADDING(_pad1_)
> A A rwlock_tA A A lru_lock;
>
> A A // ...
> } pg_data_t;
>
> static inline rwlock_t *zone_lru_lock(struct zone *zone) {
> A A A A return &zone->zone_pgdat->lru_lock;
> }
>
>
> /**
> * A concurrent variant of del_page_from_lru_list
> *
> * Unlike the regular del_page_from_lru_list function, which must be called
> while
> * the lru_lock is held exclusively, here we are allowed to hold the lock in
> read
> * mode, allowing concurrent runs of the del_page_from_lru_list_concurrent
> function.
> *
> * The implementation assumes that the lru_lock is held in either read or write
> * mode on function entry.
> */
> void del_page_from_lru_list_concurrent(struct page *page,
> A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A struct lruvec *lruvec,
> A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A enum lru_list lru) {
> A A /**
> A A A * A removal of a page from the lru list only requires changing the prev
> and
> A A A * next pointers in its neighboring pages.A Thus, the only case where we
> A A A * need to synchronize concurrent removal of pages is when the pages are
> A A A * adjacent on the LRU.
> A A A *
> A A A * The key idea of the algorithm is to deal with this case by locking
> A A A * individual pages, and requiring a thread that removes a page to acquire
> A A A * the locks on both the page that it removes, and the predecessor of that
> A A A * page. Specifically, the removal of page P1 with a predecessor P0 in the
> A A A * LRU list requires locking both P1 and P0, and keeps P1 locked until P1's
> A A A * removal is completed.A Only then the lock on P0 is released, allowing it
> A A A * to be removed as well.
> A A A *
> A A A * Note that while P1 is removed by thread T, T can manipulate the lru.prev
> A A A * pointer of P1's successor page, even though T does not hold the lock on
> A A A * that page.A This is safe because the only thread other than T that
> may be
> A A A * accessing the lru.prev pointer of P1's successor page is a thread that
> A A A * tries to remove that page, and that thread cannot proceed with the
> A A A * removal (and update the lru.prev of the successor page) as long as P1 is
> A A A * the predecessor, and is locked by T.
> A A A *
> A A A * Since we expect the case of concurrent removal of adjacent pages to be
> A A A * rare, locking of adjacent pages is expected to succeed without
> waiting in
> A A A * the common case; thus we expect very little or no contention during
> A A A * parallel removal of pages from the list.
> A A A *
> A A A * Implementation note: having the information of whether a page is locked
> A A A * be encoded in the list_head structure simplifies the function code a
> bit,
> A A A * as it does not need to deal differently with the corner case where we
> A A A * remove the page at the front of the list (i.e. the most recently used
> A A A * page); this is because the pointers to the head and tail of a list also
> A A A * reside in a field of type list_head (stored in lruvec), so the
> A A A * implementation treats this field as if it is the lru field of a page,
> and
> A A A * locks it as the predecessor of the head page when it is removed.
> A A A */
>
> A A /**
> A A A * Step 1: lock our page (i.e. the page we need to remove); we only fail to
> A A A * do so if our successor is in the process of being removed, in which case
> A A A * we wait for its removal to finish, which will unlock our page.
> A A A */
> A A struct list_head *successor_p = READ_ONCE(page->lru.next);
> A A while (successor_p == PAGE_LOCKED_VAL ||
> A A A A A A A A A cmpxchg(&page->lru.next,
> A A A A A A A A A A A A A A A A A successor_p, PAGE_LOCKED_VAL) != successor_p) {
>
> A A A A A A /* our successor is being removed, wait for it to finish */
> A A A A A A cpu_relax();
> A A A A A A successor_p = READ_ONCE(page->lru.next)
> A A }
>
> A A /**
> A A A * Step 2: our page is locked, successor_p holds the pointer to our
> A A A * successor, which cannot be removed while our page is locked (as we are
> A A A * the predecessor of that page).
> A A A * Try locking our predecessor. Locking will only fail if our
> predecessor is
> A A A * being removed (or was already removed), in which case we wait for its
> A A A * removal to complete, and continue by trying to lock our new predecessor.
> A A A *
> A A A * Notes:
> A A A *
> A A A * 1. We detect that the predecessor is locked by checking whether its next
> A A A * pointer points to our page's node (i.e., our lru field).A If a thread is
> A A A * in the process of removing our predecessor, the test will fail until we
> A A A * have a new predecessor that is not in the process of being removed.
> A A A * We therefore need to keep reading the value stored in our lru.prev field
> A A A * every time an attempt to lock the predecessor fails.
> A A A *
> A A A * 2. The thread that removes our predecessor can update our lru.prev field
> A A A * even though it doesn't hold the lock on our page.A The reason is that we
> A A A * only need to reference the lru.prev pointer of a page when we remove it
> A A A * from the list, our page, is only used by a thread that removes our page,
> A A A * and that thread cannot proceed with the removal as long as the
> A A A * predecessor is
> A A A *
> A A A * 3. This is the part of the code that is responsible to check for the
> A A A * corner case of the head page removal if locking a page was not be a
> A A A * simple operation on a list_head field, because the head page does not
> A A A * have a predecessor page that we can lock. We can avoid dealing with this
> A A A * corner case because a) we lock a page simply by manipulating the next
> A A A * pointer in a field of type list_head, and b) the lru field of the head
> A A A * page points to an instance of the list_head type (this instance is not
> A A A * part of a page, but it still has a next pointer that points to the head
> A A A * page, so we can lock and unlock it just as if it is the lru field of a
> A A A * predecessor page).
> A A A *
> A A A */
> A A struct list_head *predecessor_p = READ_ONCE(page->lru.prev);
> A A while (predecessor_p->next != &page->lru ||
> A A A A A A A A A cmpxchg(&predecessor_p->next,
> A A A A A A A A A A A A A A A A A &page->lru, PAGE_LOCKED_VAL) != &page->lru) {
> A A A A A A /**
> A A A A A A A * Our predecessor is being removed; wait till we have a new unlocked
> A A A A A A A * predecessor
> A A A A A A A */
> A A A A A A cpu_relax();
> A A A A A A predecessor_p = READ_ONCE(page->lru.prev);
> A A }
>
> A A /**
> A A A * Step 3: we now hold the lock on both our page (the one we need to
> remove)
> A A A * and our predecessor page:A link together our successor and predecessor
> A A A * pages by updating their prev and next pointers, respectively.A At that
> A A A * point our node is removed, and we can safely release the lock on it by
> A A A * updating its next pointer.
> A A A *
> A A A * Critically, we have to update the successor prev pointer _before_
> A A A * updating the predecessor next pointer. The reason is that updating the
> A A A * next pointer of a page unlocks that page, and unlocking our predecessor
> A A A * page will allow it to be removed.A Thus, we have to make sure that both
> A A A * pages are properly linked together before releasing the lock on our
> A A A * predecessor.
> A A A *
> A A A * We guarantee that by setting the successor prev pointer first, which
> will
> A A A * now point to our predecessor while it is still locked.A Thus, neither of
> A A A * these pages can yet be removed. Only then we set the predecessor next
> A A A * pointer, which also relinquishes our acquisition of its lock.
> A A A *
> A A A * For similar reasons, we only release the lock on our own node after the
> A A A * successor points to its new predecessor.A (With the way the code is
> A A A * written above, this is not as critical because the successor will still
> A A A * fail to remove itself as long as our next pointer does not point to it,
> A A A * so having any value other than the pointer to the successor in our next
> A A A * field is safe.A That said, there is no reason to touch our next (or
> prev)
> A A A * pointers before we're done with the page removal.
> A A A */
> A A WRITE_ONCE(successor_p->prev, predecessor_p);
> A A WRITE_ONCE(predecessor_p->next, successor_p);
>
> A A /**
> A A A * Cleanup (also releases the lock on our page).
> A A A */
> A A page->lru.next = LIST_POISON1;
> A A page->lru.prev = LIST_POISON2;
> }
>
> /*
> A A A ------------------
> A A A Further Discussion
> A A A ------------------
>
> A A A As described above, the key observation behind the algorithm is that
> simple
> A A A lru page removal operations can proceed in parallel with each other with
> A A A very little per-page synchronization using the compare-and-swap (cmpxchg)
> A A A primitives.A This can only be done safely, though, as long as other
> A A A operations do not access the lru list at the same time.
> A A A To enable concurrent removals only when it is safe to do so, we replace
> the
> A A A spin lock that is used to protect the lru list (aka lru_lock) with a
> A A A read-write lock (rwlock_t), that can be acquired in either exclusive mode
> A A A (write acquisition), or shared mode (read-acquisition). Using the rwlock
> A A A allows us to run many page removal operations in parallel, as long as no
> A A A other operations that requires exclusive operations are being run.
>
> A A A Below we discuss a few key aspects properties of the algorithm.
>
> A A A Safety and synchronization overview
> A A A ===================================
>
> A A A 1. Safe concurrent page removal: because page removals from the list do
> not
> A A A need to traverse the list, but only manipulate the prev and next
> pointers of
> A A A their neighboring pages, most removals can be done in parallel without
> A A A synchronizing which each other; however, if two pages that are adjacent to
> A A A each other in the list are being removed concurrently, some care should be
> A A A taken to ensure that we correctly handle the data race over the pages'
> A A A lru.next and lru.prev fields.
>
> A A A The high level idea of synchronizing page removal is simple: we allow
> A A A individual pages to be locked, and in order to remove a page P1 with a
> A A A predecessor P0 and successor P2 (i.e., P0 <-> P1 <-> P2), the thread
> A A A removing P1 has to lock both P1 and P0 (i.e., the page to be removed, and
> A A A its predecessor).A Note that even though a thread T that removes P1 is not
> A A A holding the lock of P2, P2 cannot be removed while T is removing P1, as P1
> A A A is the predecessor of P2 in the list, and it is locked by T while it is
> A A A being removed.A Thus, once T is holding the locks on both P1 and P0, it
> can
> A A A manipulate the next and prev pointers between all 3 nodes.A The details of
> A A A how the locking protocol and the pointer manipulation is handled are
> A A A described in the code documentation above.
>
> A A A 2. Synchronization between page removals and other operations on the list:
> A A A as mentioned above, the use of a read-write lock guarantees that only
> A A A removal operations are running concurrently together, and the list is not
> A A A being accessed by any other operation during that time.A We note, though,
> A A A that a thread is allowed to acquire the read-write lock in exclusive mode
> A A A (that is, acquire it for writing), which guarantees that it has exclusive
> A A A access to the list. This can be used not only as a contention control
> A A A mechanism (for extreme cases where most concurrent removal operations are
> A A A for adjacent nodes), but also as a barrier to guarantee that all removal
> A A A operations are completed).A This can be useful, for example, if we want to
> A A A reclaim the memory used by previously removed pages and store random
> data in
> A A A it, that cannot be safely observed by a thread during a removal operation.
> A A A A simple write acquisition of the new lru lock will guarantee that all
> A A A threads that are in the midst of removing pages completes their operation
> A A A before the lock is being acquired in exclusive mode.
>
> A A A Progress
> A A A ========
>
> A A A 1. Synchronization granularity: the synchronization mechanism described
> A A A above is extremely fine grained --- a removal of a page only requires
> A A A holding the lock on that page and its predecessor.A Thus, in the
> example of
> A A A a list with pages P0<->P1<->P2 in it, the removal of P1 can only delay the
> A A A removal of P0 and P2, but not the removal of P2's successor or P0's
> A A A predecessor; those can take place concurrently with the removal of P1.
>
> A A A Because we expect concurrent removal of adjacent pages to be rare, we
> expect
> A A A to see very little contention under normal circumstances, and for most
> A A A remove operations to be executed in parallel without needing to wait
> for one
> A A A another.
>
> A A A 2. Deadlock and live locks: the algorithm is not susceptible to either
> A A A deadlocks or livelocks.A The former cannot occur because when a thread can
> A A A only block the removal of its successor node, and the first (head) page
> can
> A A A always be removed without waiting as it does not have a real predecessor
> A A A (see comments in the code above on handling this case).A As for live
> locks,
> A A A thread that acquired a lock never releases it before they finish their
> A A A removal operation.A Thus, threads never "step back" to yield for other
> A A A threads, and can only make progress towards finishing the removal of their
> A A A page, so we should never be in a live lock scenario.
>
> A A A 3. Removal vs. other operations on the list: the lru read-write lock in
> write
> A A A mode prevents any concurrency between operations on the list other than
> page
> A A A removal.A However, in theory, a stream of removal operations can starve
> A A A other operation that require exclusive access to the list; this may happen
> A A A if the removal operations are allowed to keep acquiring and releasing the
> A A A lock in shared mode, keeping the system in a state where there is
> always at
> A A A least one removal operation in progress.
> A A A It is critical, then, that we use a read-write lock that supports an
> A A A anti-starvation mechanism, and in particular protects the writers from
> being
> A A A starved by readers.A Almost all read write locks that are in use today
> have
> A A A such a mechanism; a common idiom is that once a thread is waiting for the
> A A A lock to be acquired in exclusive mode, threads that are trying to acquire
> A A A the lock in shared mode are delayed until the thread that requires
> exclusive
> A A A access acquires and releases the lock for writing.
>
> A A A 4. Handling extreme contention cases: in extreme cases that incur high
> wait
> A A A time for locking individual pages, threads that are waiting to remove a
> page
> A A A can always release their shared ownership of the lru lock, and request an
> A A A exclusive ownership of it instead (that is, acquire the lock in for
> A A A writing).A A successful write acquisition disallows any other operations
> A A A (removals or others) to run in parallel with that thread, but it may be
> used
> A A A as a fall back solution in extreme cases of high contention on a small
> A A A fragment of the list, where a single acquisition of the global list
> lock may
> A A A prove to be more beneficial than a series of individual pages locking.
>
> A A A While we do not expect to see it happening in practice, it is important to
> A A A keep that option in mind, and if necessary implement a fall back mechanism
> A A A that detects the case where most remove operations are waiting to acquire
> A A A individual locks, and signal them to move to a serial execution mode (one
> A A A removal operation at a time).
>
> A A A Integration and compatibility with existing code
> A A A ================================================
>
> A A A One important property of the new algorithm is that it requires very few
> A A A local changes to existing code.A The new concurrent function for page
> A A A removal is very short, and the only global change is replacing the lru
> spin
> A A A lock with a read write lock, and changing the acquisition for the
> purpose of
> A A A page removal to a read acquisition.
>
> A A A Furthermore, the implementation suggested above does not add any
> additional
> A A A fields or change the structure of existing data structures, and hence the
> A A A old, serial page removal function is still compatible with it. Thus, if a
> A A A page removal operation needs to acquire the lock exclusively for some
> A A A reason, it can simply call the regular, serial page removal function, and
> A A A avoid the additional synchronization overhead of the new parallel variant.
> */
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
prev parent reply other threads:[~2018-02-02 11:02 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-01 4:44 Daniel Jordan
2018-02-01 9:44 ` Matthew Wilcox
2018-02-02 4:07 ` Daniel Jordan
2018-02-02 17:00 ` Matthew Wilcox
2018-02-06 15:33 ` Matthew Wilcox
2018-02-08 13:33 ` Daniel Jordan
2018-02-08 13:38 ` Matthew Wilcox
2018-02-02 11:02 ` Laurent Dufour [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=471f59cc-b4f7-462a-b6e5-064aaf132132@linux.vnet.ibm.com \
--to=ldufour@linux.vnet.ibm.com \
--cc=Dave.Dice@oracle.com \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=daniel.m.jordan@oracle.com \
--cc=dave@stgolabs.net \
--cc=khandual@linux.vnet.ibm.com \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mgorman@suse.de \
--cc=mhocko@kernel.org \
--cc=pasha.tatashin@oracle.com \
--cc=steven.sistare@oracle.com \
--cc=yossi.lev@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox