From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0FEBFC54788 for ; Thu, 22 Feb 2024 14:05:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2686F6B0093; Thu, 22 Feb 2024 09:05:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2181A6B0095; Thu, 22 Feb 2024 09:05:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 043646B0096; Thu, 22 Feb 2024 09:05:14 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id D460E6B0093 for ; Thu, 22 Feb 2024 09:05:14 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id ACD3A140E20 for ; Thu, 22 Feb 2024 14:05:14 +0000 (UTC) X-FDA: 81819611748.05.1FAD26B Received: from out-189.mta0.migadu.com (out-189.mta0.migadu.com [91.218.175.189]) by imf28.hostedemail.com (Postfix) with ESMTP id 86FFCC0026 for ; Thu, 22 Feb 2024 14:05:09 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=X1qraX2V; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf28.hostedemail.com: domain of gang.li@linux.dev designates 91.218.175.189 as permitted sender) smtp.mailfrom=gang.li@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708610709; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=N6t1qyq4TDUtkxpVn8LUmZjkdAyqh+tvT81M41wbLKc=; b=PJUf3A3bY9mapi+50cHXcCj2hIshReSMOR8rkSTW0b1WWMp+GBDQ3EAdstNbzdYVzXUoB6 MnViOCleJhd5iaA3BkyYzmeXlS/hxFXLpeHwUe9dEqeeQ+zc8zGXMPz1GRtqFe0R4nlVL/ BgDharGhacZku/YLQnf5qeWlH7yjXwE= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=X1qraX2V; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf28.hostedemail.com: domain of gang.li@linux.dev designates 91.218.175.189 as permitted sender) smtp.mailfrom=gang.li@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708610709; a=rsa-sha256; cv=none; b=k7xxq11BgTvWtNzm6NgsHnWp0NLQZCMxZwVjJ0mxKJ/5NgUnl8AYM0pum7oVDi3a6cjQIl 5YqcYd6ReZOWFuTzPEMKc2bWVBWzy45d34PT2PchVZkoiDG456H6s19fXPJcJ1pAFEsQwO AGh40FhZsOex1ze+vN+GsponzlEHAcY= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1708610708; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=N6t1qyq4TDUtkxpVn8LUmZjkdAyqh+tvT81M41wbLKc=; b=X1qraX2Vvj0/KBfu6EnJq/+fNR67xug/4C3rpiILP1VuAv3Lrl9M7fpLL3xNyIAibRgjh2 t6IVN2idjt+n0TQ0E+IbU5JFcgU226TBbWzoJOagUVks3iKjgV3ZA0lDtrM/fhY1oX0xtj SGDAg/B2R/m6fm9JvkcoCOxIIyohtc8= From: Gang Li To: Andrew Morton Cc: David Hildenbrand , David Rientjes , Muchun Song , Tim Chen , Steffen Klassert , Daniel Jordan , Jane Chu , "Paul E . McKenney" , Randy Dunlap , linux-mm@kvack.org, linux-kernel@vger.kernel.org, ligang.bdlg@bytedance.com, Gang Li Subject: [PATCH v6 7/8] hugetlb: parallelize 2M hugetlb allocation and initialization Date: Thu, 22 Feb 2024 22:04:20 +0800 Message-Id: <20240222140422.393911-8-gang.li@linux.dev> In-Reply-To: <20240222140422.393911-1-gang.li@linux.dev> References: <20240222140422.393911-1-gang.li@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 86FFCC0026 X-Stat-Signature: 9eiuenq885ja1umk73e3w1hnnpettapi X-HE-Tag: 1708610709-272010 X-HE-Meta: U2FsdGVkX19vhw5fNpj+ApnowC/6eauPmhYhTw5szr0vERYfdaynNjLxWzqMjTppYf99spRNbIGZZLRq6LeK4+OJ5WD9sJ0RgwKytTC/b6/5jQvrERWsY16/LKjp3ik8WVx2j2a5KVCuwJYEyaBeoB9ML4aJ/JQOFSgvpZ008tz+KOUzybJETLD2IY3yriCaLNw4WneKqnbgGxQVDrE1nN8jTGSh6YNkeM6XCGhLSQI9ZLCFu5uddIgjf/5knwMVzMFGblhEY3FBXMyt03rt3Ff1R0a3bfHHecwLBSTAhthEzGcqZb9F6urJOQHQH8zpu8izqfGzYrUH0KiMzaToy3i72DPIm+fqZlSBL75n66pPOfpWY0B0DjRA0vfbPp6ZIKyn8+AzU/C2o9OAhC+Let4PGRes67RWe15i0y3fpM3ZPjFvuS/lOktWSs1rWa7SlGVh8jOJrogIb7cey7/EkZpoG+RQAXuwrOqSZ4Bdp+rN0Ri69s5CNCPAtm/Q3XA6yaVHvaQ4bbGsaSTpkpER1aEB4yTldnZvuOmcCxpMY+/URk2sxhqykJPUZ4/RzK5WbU1vXLuv+T/4Z4yhGBno3fOdVDZlx7rVW7E9EIkuRow86DaWq1oOA91R1RaWqpifQaqnSAAlacfsX0cblUzqk91gdK67ivqxEe/Q5e8v1GKkZhz3ICJ870CB8ziuPQ58iFpi7FqdYhTOSprarWAlVnSh46RTXo1XwHS1vaSNBa19WdjxxpomllGh6OmanGV31asjhLtYsnkQftWjSSALfVR7+/CQzr6Zldggz+GLvvaHrI/wbSTPjBm5s0z55NxGFX9TyqSkB7QIeNDDPlXVJg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: By distributing both the allocation and the initialization tasks across multiple threads, the initialization of 2M hugetlb will be faster, thereby improving the boot speed. Here are some test results: test case no patch(ms) patched(ms) saved ------------------- -------------- ------------- -------- 256c2T(4 node) 2M 3336 1051 68.52% 128c1T(2 node) 2M 1943 716 63.15% Signed-off-by: Gang Li Tested-by: David Rientjes Reviewed-by: Muchun Song --- mm/hugetlb.c | 73 ++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 56 insertions(+), 17 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d1ce1a52ad504..3ce957b3e350b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include @@ -3510,6 +3511,30 @@ static void __init hugetlb_hstate_alloc_pages_errcheck(unsigned long allocated, } } +static void __init hugetlb_pages_alloc_boot_node(unsigned long start, unsigned long end, void *arg) +{ + struct hstate *h = (struct hstate *)arg; + int i, num = end - start; + nodemask_t node_alloc_noretry; + LIST_HEAD(folio_list); + int next_node = first_online_node; + + /* Bit mask controlling how hard we retry per-node allocations.*/ + nodes_clear(node_alloc_noretry); + + for (i = 0; i < num; ++i) { + struct folio *folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY], + &node_alloc_noretry, &next_node); + if (!folio) + break; + + list_move(&folio->lru, &folio_list); + cond_resched(); + } + + prep_and_add_allocated_folios(h, &folio_list); +} + static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h) { unsigned long i; @@ -3525,26 +3550,40 @@ static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h) static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h) { - unsigned long i; - struct folio *folio; - LIST_HEAD(folio_list); - nodemask_t node_alloc_noretry; - - /* Bit mask controlling how hard we retry per-node allocations.*/ - nodes_clear(node_alloc_noretry); + struct padata_mt_job job = { + .fn_arg = h, + .align = 1, + .numa_aware = true + }; - for (i = 0; i < h->max_huge_pages; ++i) { - folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY], - &node_alloc_noretry); - if (!folio) - break; - list_add(&folio->lru, &folio_list); - cond_resched(); - } + job.thread_fn = hugetlb_pages_alloc_boot_node; + job.start = 0; + job.size = h->max_huge_pages; - prep_and_add_allocated_folios(h, &folio_list); + /* + * job.max_threads is twice the num_node_state(N_MEMORY), + * + * Tests below indicate that a multiplier of 2 significantly improves + * performance, and although larger values also provide improvements, + * the gains are marginal. + * + * Therefore, choosing 2 as the multiplier strikes a good balance between + * enhancing parallel processing capabilities and maintaining efficient + * resource management. + * + * +------------+-------+-------+-------+-------+-------+ + * | multiplier | 1 | 2 | 3 | 4 | 5 | + * +------------+-------+-------+-------+-------+-------+ + * | 256G 2node | 358ms | 215ms | 157ms | 134ms | 126ms | + * | 2T 4node | 979ms | 679ms | 543ms | 489ms | 481ms | + * | 50G 2node | 71ms | 44ms | 37ms | 30ms | 31ms | + * +------------+-------+-------+-------+-------+-------+ + */ + job.max_threads = num_node_state(N_MEMORY) * 2; + job.min_chunk = h->max_huge_pages / num_node_state(N_MEMORY) / 2; + padata_do_multithreaded(&job); - return i; + return h->nr_huge_pages; } /* -- 2.20.1