From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1A5C5C433E0 for ; Wed, 17 Mar 2021 10:02:28 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7AF3C64F70 for ; Wed, 17 Mar 2021 10:02:27 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7AF3C64F70 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id AD0316B006E; Wed, 17 Mar 2021 06:02:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A80AD6B0070; Wed, 17 Mar 2021 06:02:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8F9F76B0071; Wed, 17 Mar 2021 06:02:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0193.hostedemail.com [216.40.44.193]) by kanga.kvack.org (Postfix) with ESMTP id 744D96B006E for ; Wed, 17 Mar 2021 06:02:26 -0400 (EDT) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 34BC6180AD82F for ; Wed, 17 Mar 2021 10:02:26 +0000 (UTC) X-FDA: 77928926292.19.C952EDA Received: from mail-pg1-f174.google.com (mail-pg1-f174.google.com [209.85.215.174]) by imf27.hostedemail.com (Postfix) with ESMTP id B460F8019147 for ; Wed, 17 Mar 2021 10:02:25 +0000 (UTC) Received: by mail-pg1-f174.google.com with SMTP id 16so18064246pgo.13 for ; Wed, 17 Mar 2021 03:02:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:subject:to:cc:references:in-reply-to:mime-version :message-id:content-transfer-encoding; bh=VRdKUeBJF14EBQDle28LuVLrfRqd1FuunOrC7P4iCGA=; b=j8Smgahk11yOIZKXamxJByWrMD4aof3r/P/f11TckO1l/T/dJQI/d16jvc13pvkjcg mwI5hlplJgQQKhfB4zH3w3XrX4rF6ye/QTzeBHp3XUD2GDeqO/3G3nKWs+6nEXI4RtfF 4gH0vJ5iacpxpHo2XjshTWDGmRILPm6nCEVW/gbj32lc7GL8dno2LyozauHkJbtO2k0V uYJoCUdNdtafYGQTTWfwkUi7Q2HFpqwf1+Sl+fvsaIs2lTLK39BMBGuwoFTUTqeGyriU P9rQWlCo+a1YjnDohp4r0NEi9t31bLpuRCIrzPpNgrvABprwMXMgW7uy1FWfUKBSvr50 Ygzg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:subject:to:cc:references:in-reply-to :mime-version:message-id:content-transfer-encoding; bh=VRdKUeBJF14EBQDle28LuVLrfRqd1FuunOrC7P4iCGA=; b=qXqYR9ds1iJ2Knt5miewCr4eeA6ouzPqdzATxkElno9I2KdBs4U3GhmO7y0PnBEww+ Yuk7irbWRtPVhA+hDp8hcUV4lt9S3LaxqLwq8S22L6j9GV6AlomQkynk592pxKv4SEsN 5YBnlmc9cPdJbON0bDNBPRb3ogRjenorwvrqKZJgg7E/Sw+UfeH7GMS+qQFvx0/7LdWD iynJh2Id1lxUXhlR/JIVTBM2pPCzfweGT3eWektotbiR5H6pB0DUnDSQcarTuY7P9O2h LFz4MV96kvA4PRRi7dw1kEpmqg42WsoxcN8BwtmyGTd/JEf3AR+LhdSeyoIN+Zf60i7E t00Q== X-Gm-Message-State: AOAM532biDEJrxM1D7XwmvguYJE23/03uRJ89d2NiV5xh385ZexMt0Dp eMn+hvT/wtbtZzFJx8TjKp4= X-Google-Smtp-Source: ABdhPJwuZqztzuQ0kxb8UAhQbcX4r9oMUGr4LwbHDSh8zr1SVQk+2misQF3E9987zrPqQKhiz19Ttw== X-Received: by 2002:aa7:9ecf:0:b029:1f4:f737:12d6 with SMTP id r15-20020aa79ecf0000b02901f4f73712d6mr3659854pfq.8.1615975344555; Wed, 17 Mar 2021 03:02:24 -0700 (PDT) Received: from localhost (58-6-239-121.tpgi.com.au. [58.6.239.121]) by smtp.gmail.com with ESMTPSA id fa21sm2086178pjb.25.2021.03.17.03.02.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Mar 2021 03:02:23 -0700 (PDT) Date: Wed, 17 Mar 2021 20:02:18 +1000 From: Nicholas Piggin Subject: Re: [PATCH v2] Increase page and bit waitqueue hash size To: Ingo Molnar Cc: Andrew Morton , Anton Blanchard , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Linus Torvalds References: <20210317075427.587806-1-npiggin@gmail.com> <20210317083830.GC3881262@gmail.com> In-Reply-To: <20210317083830.GC3881262@gmail.com> MIME-Version: 1.0 Message-Id: <1615974423.0rc8elykcq.astroid@bobo.none> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 9g77sqtfr63s5simp4zcss9d3cnr76ip X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: B460F8019147 Received-SPF: none (gmail.com>: No applicable sender policy available) receiver=imf27; identity=mailfrom; envelope-from=""; helo=mail-pg1-f174.google.com; client-ip=209.85.215.174 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1615975345-390652 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Excerpts from Ingo Molnar's message of March 17, 2021 6:38 pm: >=20 > * Nicholas Piggin wrote: >=20 >> The page waitqueue hash is a bit small (256 entries) on very big systems= . A >> 16 socket 1536 thread POWER9 system was found to encounter hash collisio= ns >> and excessive time in waitqueue locking at times. This was intermittent = and >> hard to reproduce easily with the setup we had (very little real IO >> capacity). The theory is that sometimes (depending on allocation luck) >> important pages would happen to collide a lot in the hash, slowing down = page >> locking, causing the problem to snowball. >>=20 >> An small test case was made where threads would write and fsync differen= t >> pages, generating just a small amount of contention across many pages. >>=20 >> Increasing page waitqueue hash size to 262144 entries increased throughp= ut >> by 182% while also reducing standard deviation 3x. perf before the incre= ase: >>=20 >> 36.23% [k] _raw_spin_lock_irqsave - - >> | >> |--34.60%--wake_up_page_bit >> | 0 >> | iomap_write_end.isra.38 >> | iomap_write_actor >> | iomap_apply >> | iomap_file_buffered_write >> | xfs_file_buffered_aio_write >> | new_sync_write >>=20 >> 17.93% [k] native_queued_spin_lock_slowpath - - >> | >> |--16.74%--_raw_spin_lock_irqsave >> | | >> | --16.44%--wake_up_page_bit >> | iomap_write_end.isra.38 >> | iomap_write_actor >> | iomap_apply >> | iomap_file_buffered_write >> | xfs_file_buffered_aio_write >>=20 >> This patch uses alloc_large_system_hash to allocate a bigger system hash >> that scales somewhat with memory size. The bit/var wait-queue is also >> changed to keep code matching, albiet with a smaller scale factor. >>=20 >> A very small CONFIG_BASE_SMALL option is also added because these are tw= o >> of the biggest static objects in the image on very small systems. >>=20 >> This hash could be made per-node, which may help reduce remote accesses >> on well localised workloads, but that adds some complexity with indexing >> and hotplug, so until we get a less artificial workload to test with, >> keep it simple. >>=20 >> Signed-off-by: Nicholas Piggin >> --- >> kernel/sched/wait_bit.c | 30 +++++++++++++++++++++++------- >> mm/filemap.c | 24 +++++++++++++++++++++--- >> 2 files changed, 44 insertions(+), 10 deletions(-) >>=20 >> diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c >> index 02ce292b9bc0..dba73dec17c4 100644 >> --- a/kernel/sched/wait_bit.c >> +++ b/kernel/sched/wait_bit.c >> @@ -2,19 +2,24 @@ >> /* >> * The implementation of the wait_bit*() and related waiting APIs: >> */ >> +#include >> #include "sched.h" >> =20 >> -#define WAIT_TABLE_BITS 8 >> -#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS) >=20 > Ugh, 256 entries is almost embarrassingly small indeed. >=20 > I've put your patch into sched/core, unless Andrew is objecting. Thanks. Andrew and Linux might have some opinions on it, but if it's=20 just in a testing branch for now that's okay. >=20 >> - for (i =3D 0; i < WAIT_TABLE_SIZE; i++) >> + if (!CONFIG_BASE_SMALL) { >> + bit_wait_table =3D alloc_large_system_hash("bit waitqueue hash", >> + sizeof(wait_queue_head_t), >> + 0, >> + 22, >> + 0, >> + &bit_wait_table_bits, >> + NULL, >> + 0, >> + 0); >> + } >> + for (i =3D 0; i < BIT_WAIT_TABLE_SIZE; i++) >> init_waitqueue_head(bit_wait_table + i); >=20 >=20 > Meta suggestion: maybe the CONFIG_BASE_SMALL ugliness could be folded=20 > into alloc_large_system_hash() itself? I don't like the ugliness and that's a good suggestion in some ways, but=20 having the constant size and table is nice for the small system. I don't=20 know, maybe we need to revise the alloc_large_system_hash API slightly. Having some kind of DEFINE_LARGE_ARRAY perhaps then you could have both static and dynamic? I'll think about it. >=20 >> --- a/mm/filemap.c >> +++ b/mm/filemap.c >=20 >> static wait_queue_head_t *page_waitqueue(struct page *page) >> { >> - return &page_wait_table[hash_ptr(page, PAGE_WAIT_TABLE_BITS)]; >> + return &page_wait_table[hash_ptr(page, page_wait_table_bits)]; >> } >=20 > I'm wondering whether you've tried to make this NUMA aware through=20 > page->node? >=20 > Seems like another useful step when having a global hash ... Yes I have patches for that on the back burner. Just wanted to try one step at a time, but I think we should be able to justify it (a well=20 NUMAified workload will tend to store mostly to local page waitqueue so=20 keep cacheline contention within the node). We need to get some access=20 to a big system again and try get some more IO on it at some point, so=20 stay tuned for that. We actually used to have similar to this, but Linux removed it with 9dcb8b685fc30. The difference now is that the page waitqueue has been split out from the bit waitqueue. Doing the page waitqueue is much easier because we don't have the vmalloc problem to deal with. But still it's some complexity. We also do have the page contention bit that Linus refers to which takes=20 pressure off the waitqueues (which is probably why 256 entries has held=20 up surprisingly well), but as we can see we do need larger at the high end. Thanks, Nick