From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E202FC433FE for ; Wed, 10 Nov 2021 18:54:52 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8968C61178 for ; Wed, 10 Nov 2021 18:54:52 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 8968C61178 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id C68F06B0080; Wed, 10 Nov 2021 13:54:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C18DE6B0081; Wed, 10 Nov 2021 13:54:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ABB4B6B0082; Wed, 10 Nov 2021 13:54:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0163.hostedemail.com [216.40.44.163]) by kanga.kvack.org (Postfix) with ESMTP id 94D896B0080 for ; Wed, 10 Nov 2021 13:54:46 -0500 (EST) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 1F8FA811FA for ; Wed, 10 Nov 2021 18:54:46 +0000 (UTC) X-FDA: 78793922172.09.520D800 Received: from mail-pg1-f172.google.com (mail-pg1-f172.google.com [209.85.215.172]) by imf06.hostedemail.com (Postfix) with ESMTP id B0BA3801A8B8 for ; Wed, 10 Nov 2021 18:54:45 +0000 (UTC) Received: by mail-pg1-f172.google.com with SMTP id r132so907406pgr.9 for ; Wed, 10 Nov 2021 10:54:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=3szfw6jRsFyxQTVEG1ikPVQdN5P2dWMh3dSumqMxVhs=; b=Ol5iGuBRZcb8f/Rh9uFSTkYYeFxW7djYwpECqCCgZ6MCvWOSGZeoNoVhLRL9OGsFMU ZqKhFtNJ9AQvOgxtOscXAsWYsr59aDWN8S/RvMJhx6FYVboF0tvZFB6IebFyFNEjMikj vHP4UuFk/dqvht9m19IHxHz4EgulivNWiwU16axIugZjl+NDGHOXb1MeU+Etfm5m+K7g fwkuOTq0A5VveOiVMfMS/F1NncVwrdhXvAo0ztelP7anmprIaiklzoV4tCyJrZII5mxf xEgt5zhDAHtlL4RSS0tKRhO1x9tZwjvST5hHeyRtY29UCuW23DzfkHSH9N3VLqun8NJB OeRQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:sender:from:to:cc:subject:date:message-id :in-reply-to:references:mime-version:content-transfer-encoding; bh=3szfw6jRsFyxQTVEG1ikPVQdN5P2dWMh3dSumqMxVhs=; b=G7LhYGbFY47nzpX3vXdl8OjT6Piw8xUa9cOrYCoCsLotYkVw1bzGz08paT7OBt/TJf 0N33v673nibFhvwTakhuEReUw/tPQpsCaJfLK+ZFqddLpG65IkpIiuwkBe4pNv6XqhYK 0aJbrttxaXXzdzeYufVDYI70Z43I/ZL5CeA8uh2YwP6sQkgE1tJDlIHIQNND0yOSFge9 /HCHTMrNgr1zZfKEDtPx1G88YpLhKZFomSxYCTA8ouUJwVowv8Z9VhqkmM9Hom+vD/YB fTnZ/JOLlOp+PM2i5fl+HTsj5BlmNJMYFINSVhPBm0w5pw+1K0yQjGZFm+5U2MLGMf/1 4tSA== X-Gm-Message-State: AOAM530h+4YP9mrOCOIHCnHQstYVO+4F4NAJfxjP3J5ZFzsLC3rl7tLc 7Lxab1hbf1h50ZA05d6NUn4= X-Google-Smtp-Source: ABdhPJya7wem/PtwrWtEr1hgmrI/oj9/IhcHJBeUUTKLE80PoOgL3CbO3xEgOm6kVlvPZ76s6lbnLw== X-Received: by 2002:a63:a12:: with SMTP id 18mr653072pgk.171.1636570484641; Wed, 10 Nov 2021 10:54:44 -0800 (PST) Received: from bbox-1.mtv.corp.google.com ([2620:15c:211:201:11d4:2de3:ab82:be64]) by smtp.gmail.com with ESMTPSA id g13sm325253pjc.39.2021.11.10.10.54.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 10 Nov 2021 10:54:44 -0800 (PST) From: Minchan Kim To: Andrew Morton Cc: Sergey Senozhatsky , linux-mm , Minchan Kim Subject: [PATCH 7/8] zsmalloc: replace per zpage lock with pool->migrate_lock Date: Wed, 10 Nov 2021 10:54:32 -0800 Message-Id: <20211110185433.1981097-8-minchan@kernel.org> X-Mailer: git-send-email 2.34.0.rc1.387.gb447b232ab-goog In-Reply-To: <20211110185433.1981097-1-minchan@kernel.org> References: <20211110185433.1981097-1-minchan@kernel.org> MIME-Version: 1.0 Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=Ol5iGuBR; spf=pass (imf06.hostedemail.com: domain of minchan.kim@gmail.com designates 209.85.215.172 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none) X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: B0BA3801A8B8 X-Stat-Signature: 1gxaawjd6sp1b687u66aawpy9cd99pa7 X-HE-Tag: 1636570485-555070 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The zsmalloc has used a bit for spin_lock in zpage handle to keep zpage object alive during several operations. However, it causes the problem for PREEMPT_RT as well as introducing too complicated. This patch replaces the bit spin_lock with pool->migrate_lock rwlock. It could make the code simple as well as zsmalloc work under PREEMPT_RT. The drawback is the pool->migrate_lock is bigger granuarity than per zpage lock so the contention would be higher than old when both IO-related operations(i.e., zsmalloc, zsfree, zs_[map|unmap]) and compaction(page/zpage migration) are going in parallel(*, the migrate_lock is rwlock and IO related functions are all read side lock so there is no contention). However, the write-side is fast enough(dominant overhead is just page copy) so it wouldn't affect much. If the lock granurity becomes more problem later, we could introduce table locks based on handle as a hash value. Signed-off-by: Minchan Kim --- mm/zsmalloc.c | 205 +++++++++++++++++++++++--------------------------- 1 file changed, 96 insertions(+), 109 deletions(-) diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index b8b098be92fa..5d4c4d254679 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -30,6 +30,14 @@ =20 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt =20 +/* + * lock ordering: + * page_lock + * pool->migrate_lock + * class->lock + * zspage->lock + */ + #include #include #include @@ -100,15 +108,6 @@ =20 #define _PFN_BITS (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT) =20 -/* - * Memory for allocating for handle keeps object position by - * encoding and the encoded value has a room - * in least bit(ie, look at obj_to_location). - * We use the bit to synchronize between object access by - * user and migration. - */ -#define HANDLE_PIN_BIT 0 - /* * Head in allocated object should have OBJ_ALLOCATED_TAG * to identify the object was allocated or not. @@ -255,6 +254,8 @@ struct zs_pool { struct inode *inode; struct work_struct free_work; #endif + /* protect page/zspage migration */ + rwlock_t migrate_lock; }; =20 struct zspage { @@ -297,6 +298,9 @@ static void zs_unregister_migration(struct zs_pool *p= ool); static void migrate_lock_init(struct zspage *zspage); static void migrate_read_lock(struct zspage *zspage); static void migrate_read_unlock(struct zspage *zspage); +static void migrate_write_lock(struct zspage *zspage); +static void migrate_write_lock_nested(struct zspage *zspage); +static void migrate_write_unlock(struct zspage *zspage); static void kick_deferred_free(struct zs_pool *pool); static void init_deferred_free(struct zs_pool *pool); static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage= ); @@ -308,6 +312,9 @@ static void zs_unregister_migration(struct zs_pool *p= ool) {} static void migrate_lock_init(struct zspage *zspage) {} static void migrate_read_lock(struct zspage *zspage) {} static void migrate_read_unlock(struct zspage *zspage) {} +static void migrate_write_lock(struct zspage *zspage) {} +static void migrate_write_lock_nested(struct zspage *zspage) {} +static void migrate_write_unlock(struct zspage *zspage) {} static void kick_deferred_free(struct zs_pool *pool) {} static void init_deferred_free(struct zs_pool *pool) {} static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage= ) {} @@ -359,14 +366,10 @@ static void cache_free_zspage(struct zs_pool *pool,= struct zspage *zspage) kmem_cache_free(pool->zspage_cachep, zspage); } =20 +/* class->lock(which owns the handle) synchronizes races */ static void record_obj(unsigned long handle, unsigned long obj) { - /* - * lsb of @obj represents handle lock while other bits - * represent object value the handle is pointing so - * updating shouldn't do store tearing. - */ - WRITE_ONCE(*(unsigned long *)handle, obj); + *(unsigned long *)handle =3D obj; } =20 /* zpool driver */ @@ -880,26 +883,6 @@ static bool obj_allocated(struct page *page, void *o= bj, unsigned long *phandle) return true; } =20 -static inline int testpin_tag(unsigned long handle) -{ - return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle); -} - -static inline int trypin_tag(unsigned long handle) -{ - return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle); -} - -static void pin_tag(unsigned long handle) __acquires(bitlock) -{ - bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle); -} - -static void unpin_tag(unsigned long handle) __releases(bitlock) -{ - bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle); -} - static void reset_page(struct page *page) { __ClearPageMovable(page); @@ -968,6 +951,11 @@ static void free_zspage(struct zs_pool *pool, struct= size_class *class, VM_BUG_ON(get_zspage_inuse(zspage)); VM_BUG_ON(list_empty(&zspage->list)); =20 + /* + * Since zs_free couldn't be sleepable, this function cannot call + * lock_page. The page locks trylock_zspage got will be released + * by __free_zspage. + */ if (!trylock_zspage(zspage)) { kick_deferred_free(pool); return; @@ -1263,15 +1251,20 @@ void *zs_map_object(struct zs_pool *pool, unsigne= d long handle, */ BUG_ON(in_interrupt()); =20 - /* From now on, migration cannot move the object */ - pin_tag(handle); - + /* It guarantees it can get zspage from handle safely */ + read_lock(&pool->migrate_lock); obj =3D handle_to_obj(handle); obj_to_location(obj, &page, &obj_idx); zspage =3D get_zspage(page); =20 - /* migration cannot move any subpage in this zspage */ + /* + * migration cannot move any zpages in this zspage. Here, class->lock + * is too heavy since callers would take some time until they calls + * zs_unmap_object API so delegate the locking from class to zspage + * which is smaller granularity. + */ migrate_read_lock(zspage); + read_unlock(&pool->migrate_lock); =20 class =3D zspage_class(pool, zspage); off =3D (class->size * obj_idx) & ~PAGE_MASK; @@ -1330,7 +1323,6 @@ void zs_unmap_object(struct zs_pool *pool, unsigned= long handle) put_cpu_var(zs_map_area); =20 migrate_read_unlock(zspage); - unpin_tag(handle); } EXPORT_SYMBOL_GPL(zs_unmap_object); =20 @@ -1424,6 +1416,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_= t size, gfp_t gfp) size +=3D ZS_HANDLE_SIZE; class =3D pool->size_class[get_size_class_index(size)]; =20 + /* class->lock effectively protects the zpage migration */ spin_lock(&class->lock); zspage =3D find_get_zspage(class); if (likely(zspage)) { @@ -1501,30 +1494,27 @@ void zs_free(struct zs_pool *pool, unsigned long = handle) if (unlikely(!handle)) return; =20 - pin_tag(handle); + /* + * The pool->migrate_lock protects the race with zpage's migration + * so it's safe to get the page from handle. + */ + read_lock(&pool->migrate_lock); obj =3D handle_to_obj(handle); obj_to_page(obj, &f_page); zspage =3D get_zspage(f_page); - - migrate_read_lock(zspage); class =3D zspage_class(pool, zspage); - spin_lock(&class->lock); + read_unlock(&pool->migrate_lock); + obj_free(class->size, obj); class_stat_dec(class, OBJ_USED, 1); fullness =3D fix_fullness_group(class, zspage); - if (fullness !=3D ZS_EMPTY) { - migrate_read_unlock(zspage); + if (fullness !=3D ZS_EMPTY) goto out; - } =20 - migrate_read_unlock(zspage); - /* If zspage is isolated, zs_page_putback will free the zspage */ free_zspage(pool, class, zspage); out: - spin_unlock(&class->lock); - unpin_tag(handle); cache_free_handle(pool, handle); } EXPORT_SYMBOL_GPL(zs_free); @@ -1608,11 +1598,8 @@ static unsigned long find_alloced_obj(struct size_= class *class, offset +=3D class->size * index; =20 while (offset < PAGE_SIZE) { - if (obj_allocated(page, addr + offset, &handle)) { - if (trypin_tag(handle)) - break; - handle =3D 0; - } + if (obj_allocated(page, addr + offset, &handle)) + break; =20 offset +=3D class->size; index++; @@ -1658,7 +1645,6 @@ static int migrate_zspage(struct zs_pool *pool, str= uct size_class *class, =20 /* Stop if there is no more space */ if (zspage_full(class, get_zspage(d_page))) { - unpin_tag(handle); ret =3D -ENOMEM; break; } @@ -1667,15 +1653,7 @@ static int migrate_zspage(struct zs_pool *pool, st= ruct size_class *class, free_obj =3D obj_malloc(pool, get_zspage(d_page), handle); zs_object_copy(class, free_obj, used_obj); obj_idx++; - /* - * record_obj updates handle's value to free_obj and it will - * invalidate lock bit(ie, HANDLE_PIN_BIT) of handle, which - * breaks synchronization using pin_tag(e,g, zs_free) so - * let's keep the lock bit. - */ - free_obj |=3D BIT(HANDLE_PIN_BIT); record_obj(handle, free_obj); - unpin_tag(handle); obj_free(class->size, used_obj); } =20 @@ -1789,6 +1767,11 @@ static void migrate_write_lock(struct zspage *zspa= ge) write_lock(&zspage->lock); } =20 +static void migrate_write_lock_nested(struct zspage *zspage) +{ + write_lock_nested(&zspage->lock, SINGLE_DEPTH_NESTING); +} + static void migrate_write_unlock(struct zspage *zspage) { write_unlock(&zspage->lock); @@ -1856,11 +1839,10 @@ static int zs_page_migrate(struct address_space *= mapping, struct page *newpage, struct zspage *zspage; struct page *dummy; void *s_addr, *d_addr, *addr; - int offset, pos; + int offset; unsigned long handle; unsigned long old_obj, new_obj; unsigned int obj_idx; - int ret =3D -EAGAIN; =20 /* * We cannot support the _NO_COPY case here, because copy needs to @@ -1873,32 +1855,25 @@ static int zs_page_migrate(struct address_space *= mapping, struct page *newpage, VM_BUG_ON_PAGE(!PageMovable(page), page); VM_BUG_ON_PAGE(!PageIsolated(page), page); =20 - zspage =3D get_zspage(page); - - /* Concurrent compactor cannot migrate any subpage in zspage */ - migrate_write_lock(zspage); pool =3D mapping->private_data; + + /* + * The pool migrate_lock protects the race between zpage migration + * and zs_free. + */ + write_lock(&pool->migrate_lock); + zspage =3D get_zspage(page); class =3D zspage_class(pool, zspage); - offset =3D get_first_obj_offset(page); =20 + /* + * the class lock protects zpage alloc/free in the zspage. + */ spin_lock(&class->lock); - if (!get_zspage_inuse(zspage)) { - /* - * Set "offset" to end of the page so that every loops - * skips unnecessary object scanning. - */ - offset =3D PAGE_SIZE; - } + /* the migrate_write_lock protects zpage access via zs_map_object */ + migrate_write_lock(zspage); =20 - pos =3D offset; + offset =3D get_first_obj_offset(page); s_addr =3D kmap_atomic(page); - while (pos < PAGE_SIZE) { - if (obj_allocated(page, s_addr + pos, &handle)) { - if (!trypin_tag(handle)) - goto unpin_objects; - } - pos +=3D class->size; - } =20 /* * Here, any user cannot access all objects in the zspage so let's move= . @@ -1907,25 +1882,30 @@ static int zs_page_migrate(struct address_space *= mapping, struct page *newpage, memcpy(d_addr, s_addr, PAGE_SIZE); kunmap_atomic(d_addr); =20 - for (addr =3D s_addr + offset; addr < s_addr + pos; + for (addr =3D s_addr + offset; addr < s_addr + PAGE_SIZE; addr +=3D class->size) { if (obj_allocated(page, addr, &handle)) { - BUG_ON(!testpin_tag(handle)); =20 old_obj =3D handle_to_obj(handle); obj_to_location(old_obj, &dummy, &obj_idx); new_obj =3D (unsigned long)location_to_obj(newpage, obj_idx); - new_obj |=3D BIT(HANDLE_PIN_BIT); record_obj(handle, new_obj); } } + kunmap_atomic(s_addr); =20 replace_sub_page(class, zspage, newpage, page); - get_page(newpage); - + /* + * Since we complete the data copy and set up new zspage structure, + * it's okay to release migration_lock. + */ + write_unlock(&pool->migrate_lock); + spin_unlock(&class->lock); dec_zspage_isolation(zspage); + migrate_write_unlock(zspage); =20 + get_page(newpage); if (page_zone(newpage) !=3D page_zone(page)) { dec_zone_page_state(page, NR_ZSPAGES); inc_zone_page_state(newpage, NR_ZSPAGES); @@ -1933,22 +1913,8 @@ static int zs_page_migrate(struct address_space *m= apping, struct page *newpage, =20 reset_page(page); put_page(page); - page =3D newpage; - - ret =3D MIGRATEPAGE_SUCCESS; -unpin_objects: - for (addr =3D s_addr + offset; addr < s_addr + pos; - addr +=3D class->size) { - if (obj_allocated(page, addr, &handle)) { - BUG_ON(!testpin_tag(handle)); - unpin_tag(handle); - } - } - kunmap_atomic(s_addr); - spin_unlock(&class->lock); - migrate_write_unlock(zspage); =20 - return ret; + return MIGRATEPAGE_SUCCESS; } =20 static void zs_page_putback(struct page *page) @@ -2077,8 +2043,13 @@ static unsigned long __zs_compact(struct zs_pool *= pool, struct zspage *dst_zspage =3D NULL; unsigned long pages_freed =3D 0; =20 + /* protect the race between zpage migration and zs_free */ + write_lock(&pool->migrate_lock); + /* protect zpage allocation/free */ spin_lock(&class->lock); while ((src_zspage =3D isolate_zspage(class, true))) { + /* protect someone accessing the zspage(i.e., zs_map_object) */ + migrate_write_lock(src_zspage); =20 if (!zs_can_compact(class)) break; @@ -2087,6 +2058,8 @@ static unsigned long __zs_compact(struct zs_pool *p= ool, cc.s_page =3D get_first_page(src_zspage); =20 while ((dst_zspage =3D isolate_zspage(class, false))) { + migrate_write_lock_nested(dst_zspage); + cc.d_page =3D get_first_page(dst_zspage); /* * If there is no more space in dst_page, resched @@ -2096,6 +2069,10 @@ static unsigned long __zs_compact(struct zs_pool *= pool, break; =20 putback_zspage(class, dst_zspage); + migrate_write_unlock(dst_zspage); + dst_zspage =3D NULL; + if (rwlock_is_contended(&pool->migrate_lock)) + break; } =20 /* Stop if we couldn't find slot */ @@ -2103,19 +2080,28 @@ static unsigned long __zs_compact(struct zs_pool = *pool, break; =20 putback_zspage(class, dst_zspage); + migrate_write_unlock(dst_zspage); + if (putback_zspage(class, src_zspage) =3D=3D ZS_EMPTY) { + migrate_write_unlock(src_zspage); free_zspage(pool, class, src_zspage); pages_freed +=3D class->pages_per_zspage; - } + } else + migrate_write_unlock(src_zspage); spin_unlock(&class->lock); + write_unlock(&pool->migrate_lock); cond_resched(); + write_lock(&pool->migrate_lock); spin_lock(&class->lock); } =20 - if (src_zspage) + if (src_zspage) { putback_zspage(class, src_zspage); + migrate_write_unlock(src_zspage); + } =20 spin_unlock(&class->lock); + write_unlock(&pool->migrate_lock); =20 return pages_freed; } @@ -2221,6 +2207,7 @@ struct zs_pool *zs_create_pool(const char *name) return NULL; =20 init_deferred_free(pool); + rwlock_init(&pool->migrate_lock); =20 pool->name =3D kstrdup(name, GFP_KERNEL); if (!pool->name) --=20 2.34.0.rc1.387.gb447b232ab-goog