From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.5 required=3.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CF53AC433DB for ; Wed, 17 Mar 2021 07:54:39 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0660764E41 for ; Wed, 17 Mar 2021 07:54:38 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0660764E41 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 854776B006E; Wed, 17 Mar 2021 03:54:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 82B156B0070; Wed, 17 Mar 2021 03:54:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6CC006B0071; Wed, 17 Mar 2021 03:54:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0121.hostedemail.com [216.40.44.121]) by kanga.kvack.org (Postfix) with ESMTP id 5295B6B006E for ; Wed, 17 Mar 2021 03:54:38 -0400 (EDT) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 134F2612F for ; Wed, 17 Mar 2021 07:54:38 +0000 (UTC) X-FDA: 77928604236.28.822EF86 Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) by imf19.hostedemail.com (Postfix) with ESMTP id AA30190009DE for ; Wed, 17 Mar 2021 07:54:36 +0000 (UTC) Received: by mail-pf1-f174.google.com with SMTP id l3so547986pfc.7 for ; Wed, 17 Mar 2021 00:54:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=VEdnb+OWQcwQMcrxWHWutKj4nLUntKHn3JOwrdxbgug=; b=ZhG789uVdwceX1cI4Oizx9Tyz0so076kCp/iJoVssOv1HSEkPwJOp9aFtYZiEaoP+J y5QUrEDZM7vOOaY+vCgpwMhiqv2AXUrBZnR2vxbwn8OhUj06UvwjL4PCnnagkiemzZhA vuZuLc4oWz0AEKELxTYE5o7XvNWDvycBLq6X1tFyMNcT76/Y3SidNizJPLrNeegRynl5 rbs4C9krtvddEMAdnfDBpohiTZ8LbMorpsqiDz0ODNDz7sMkRQhaLXfAtFvj1c2IX+pk b1HzOpMGPHHzvcv1TF6L0EcesxZ2Kt5X7kGBV9ptxXPyDyFg0DGSzcOxbNdzhBT6G8th PRwA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=VEdnb+OWQcwQMcrxWHWutKj4nLUntKHn3JOwrdxbgug=; b=lhguFcSzxMcuVkNmCAK1hximuWEL8kp1VAMv1Ewj6Nqcxop6kRQe6hEL71eZ+EAiZ9 7w+bgY1/NWLllbmq2bK/TAv0AfqmjV9REOxpO+2Z79AKtIwMVCo9ePP0LuyHPNCYMca6 o8yVZw4Z7x2u1E9QdTYzxu72uHhAqAZuUMZA7o6uztvX9JbKNPTBaVcyzExDQNpaqBQ3 iKi2zA+MWstQvLia9gdToQtmSmrgLD3VNg8bjuzDwqnggoZ1kCMHKG9lJE0TsssgW1mC Fc38YjCdGo+ViEhqNTa7DS1xkw4fdBk5dwPb4tVJZpgDE9NOZYchZVP4rHLRvniuO8rl H4Sw== X-Gm-Message-State: AOAM533nUrFVNIohYv3smriaUSOxVIbJozGTQIk2u70O1bWzuyeqFvwk UWbie+tzWK7HfQNMRVFXXa4= X-Google-Smtp-Source: ABdhPJxod2E/PdH3pKA68HxlbApCQrL/MmUZMQTff39IUhR74cZzBqnDfdtOQNTWlst8qwLmQvsUiA== X-Received: by 2002:a63:e715:: with SMTP id b21mr1579595pgi.300.1615967675586; Wed, 17 Mar 2021 00:54:35 -0700 (PDT) Received: from bobo.ozlabs.ibm.com (58-6-239-121.tpgi.com.au. [58.6.239.121]) by smtp.gmail.com with ESMTPSA id z8sm1588187pjr.57.2021.03.17.00.54.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Mar 2021 00:54:34 -0700 (PDT) From: Nicholas Piggin To: linux-kernel@vger.kernel.org Cc: Nicholas Piggin , Andrew Morton , Linus Torvalds , linux-mm@kvack.org, Anton Blanchard , Ingo Molnar Subject: [PATCH v2] Increase page and bit waitqueue hash size Date: Wed, 17 Mar 2021 17:54:27 +1000 Message-Id: <20210317075427.587806-1-npiggin@gmail.com> X-Mailer: git-send-email 2.23.0 MIME-Version: 1.0 X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: AA30190009DE X-Stat-Signature: w6eg4tuokz6usrhqzdwu9o9jdf7k1p8u Received-SPF: none (gmail.com>: No applicable sender policy available) receiver=imf19; identity=mailfrom; envelope-from=""; helo=mail-pf1-f174.google.com; client-ip=209.85.210.174 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1615967676-57320 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The page waitqueue hash is a bit small (256 entries) on very big systems.= A 16 socket 1536 thread POWER9 system was found to encounter hash collision= s and excessive time in waitqueue locking at times. This was intermittent a= nd hard to reproduce easily with the setup we had (very little real IO capacity). The theory is that sometimes (depending on allocation luck) important pages would happen to collide a lot in the hash, slowing down p= age locking, causing the problem to snowball. An small test case was made where threads would write and fsync different pages, generating just a small amount of contention across many pages. Increasing page waitqueue hash size to 262144 entries increased throughpu= t by 182% while also reducing standard deviation 3x. perf before the increa= se: 36.23% [k] _raw_spin_lock_irqsave - - | |--34.60%--wake_up_page_bit | 0 | iomap_write_end.isra.38 | iomap_write_actor | iomap_apply | iomap_file_buffered_write | xfs_file_buffered_aio_write | new_sync_write 17.93% [k] native_queued_spin_lock_slowpath - - | |--16.74%--_raw_spin_lock_irqsave | | | --16.44%--wake_up_page_bit | iomap_write_end.isra.38 | iomap_write_actor | iomap_apply | iomap_file_buffered_write | xfs_file_buffered_aio_write This patch uses alloc_large_system_hash to allocate a bigger system hash that scales somewhat with memory size. The bit/var wait-queue is also changed to keep code matching, albiet with a smaller scale factor. A very small CONFIG_BASE_SMALL option is also added because these are two of the biggest static objects in the image on very small systems. This hash could be made per-node, which may help reduce remote accesses on well localised workloads, but that adds some complexity with indexing and hotplug, so until we get a less artificial workload to test with, keep it simple. Signed-off-by: Nicholas Piggin --- kernel/sched/wait_bit.c | 30 +++++++++++++++++++++++------- mm/filemap.c | 24 +++++++++++++++++++++--- 2 files changed, 44 insertions(+), 10 deletions(-) diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c index 02ce292b9bc0..dba73dec17c4 100644 --- a/kernel/sched/wait_bit.c +++ b/kernel/sched/wait_bit.c @@ -2,19 +2,24 @@ /* * The implementation of the wait_bit*() and related waiting APIs: */ +#include #include "sched.h" =20 -#define WAIT_TABLE_BITS 8 -#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS) - -static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_ali= gned; +#define BIT_WAIT_TABLE_SIZE (1 << bit_wait_table_bits) +#if CONFIG_BASE_SMALL +static const unsigned int bit_wait_table_bits =3D 3; +static wait_queue_head_t bit_wait_table[BIT_WAIT_TABLE_SIZE] __cacheline= _aligned; +#else +static unsigned int bit_wait_table_bits __ro_after_init; +static wait_queue_head_t *bit_wait_table __ro_after_init; +#endif =20 wait_queue_head_t *bit_waitqueue(void *word, int bit) { const int shift =3D BITS_PER_LONG =3D=3D 32 ? 5 : 6; unsigned long val =3D (unsigned long)word << shift | bit; =20 - return bit_wait_table + hash_long(val, WAIT_TABLE_BITS); + return bit_wait_table + hash_long(val, bit_wait_table_bits); } EXPORT_SYMBOL(bit_waitqueue); =20 @@ -152,7 +157,7 @@ EXPORT_SYMBOL(wake_up_bit); =20 wait_queue_head_t *__var_waitqueue(void *p) { - return bit_wait_table + hash_ptr(p, WAIT_TABLE_BITS); + return bit_wait_table + hash_ptr(p, bit_wait_table_bits); } EXPORT_SYMBOL(__var_waitqueue); =20 @@ -246,6 +251,17 @@ void __init wait_bit_init(void) { int i; =20 - for (i =3D 0; i < WAIT_TABLE_SIZE; i++) + if (!CONFIG_BASE_SMALL) { + bit_wait_table =3D alloc_large_system_hash("bit waitqueue hash", + sizeof(wait_queue_head_t), + 0, + 22, + 0, + &bit_wait_table_bits, + NULL, + 0, + 0); + } + for (i =3D 0; i < BIT_WAIT_TABLE_SIZE; i++) init_waitqueue_head(bit_wait_table + i); } diff --git a/mm/filemap.c b/mm/filemap.c index 43700480d897..dbbb5b9d951d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include #include @@ -990,19 +991,36 @@ EXPORT_SYMBOL(__page_cache_alloc); * at a cost of "thundering herd" phenomena during rare hash * collisions. */ -#define PAGE_WAIT_TABLE_BITS 8 -#define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS) +#define PAGE_WAIT_TABLE_SIZE (1 << page_wait_table_bits) +#if CONFIG_BASE_SMALL +static const unsigned int page_wait_table_bits =3D 4; static wait_queue_head_t page_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheli= ne_aligned; +#else +static unsigned int page_wait_table_bits __ro_after_init; +static wait_queue_head_t *page_wait_table __ro_after_init; +#endif =20 static wait_queue_head_t *page_waitqueue(struct page *page) { - return &page_wait_table[hash_ptr(page, PAGE_WAIT_TABLE_BITS)]; + return &page_wait_table[hash_ptr(page, page_wait_table_bits)]; } =20 void __init pagecache_init(void) { int i; =20 + if (!CONFIG_BASE_SMALL) { + page_wait_table =3D alloc_large_system_hash("page waitqueue hash", + sizeof(wait_queue_head_t), + 0, + 21, + 0, + &page_wait_table_bits, + NULL, + 0, + 0); + } + for (i =3D 0; i < PAGE_WAIT_TABLE_SIZE; i++) init_waitqueue_head(&page_wait_table[i]); =20 --=20 2.23.0