From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7E619D6D25E for ; Thu, 28 Nov 2024 04:43:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0C97A6B0083; Wed, 27 Nov 2024 23:43:24 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 079A36B0085; Wed, 27 Nov 2024 23:43:24 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EAA146B0088; Wed, 27 Nov 2024 23:43:23 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id CDB226B0083 for ; Wed, 27 Nov 2024 23:43:23 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 4D1E31C841A for ; Thu, 28 Nov 2024 04:43:23 +0000 (UTC) X-FDA: 82834259760.04.3502C84 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf10.hostedemail.com (Postfix) with ESMTP id D7E5BC0003 for ; Thu, 28 Nov 2024 04:43:18 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=L4x4RVVh; dmarc=none; spf=none (imf10.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1732768998; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=bHgQ9J4ArMhQqhydg/zZeEQ2x7DBPsepvZ8L3kMwMfk=; b=7EnigXTHXTlBwA6LTijArU+3DdLoyQWIAbGl8k22Im76INjMLFRwclhYsC5MQ1ADK+4oxG h99QfnbUZQj0UxFFk3YiEO5mybJUAvO33jkryT7v/h0RjoZGszDtmw3WD2RnVSnkZnuLHI +T6jf1eL4ZHfGf59ZUnCYMuBezub9U8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732768998; a=rsa-sha256; cv=none; b=YXLTbY8xrm9xXTVBXa4KPHsW9sgHwRfGqyrmSWQW0X17zezBrglA8TcbPvdqE3iLN8L9As 7g0oRrD7FhkDqCRqEKQU4uaQEmB6H+AC6YJlhU36ZaHNhdG4d8ffNaRxQLN+dWQJ3Ir2W9 C5dVBGw4fRqsbgGFmegrjlTrkBhOvbg= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=L4x4RVVh; dmarc=none; spf=none (imf10.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=bHgQ9J4ArMhQqhydg/zZeEQ2x7DBPsepvZ8L3kMwMfk=; b=L4x4RVVhvER+bf2wYzoTo4Ehm8 NbVOpFl4ElwRABcTt3icKx6EAkGJTJyW27S7z9K1+TayDnMU7X4/4hSLXb+ueBh/Sy7SNgIWjlkbj qwTaGpwAw42HNSUWy0661HEEUVVuC4m7aTWXneS0j59sA2MAP2ZyHylgmK1pe3edSBwbti97GPCiY Z0qkP7YHUQm2aRg6d9R5RaLa6916hLnHKkdJkxggAjH/muAovImFnMa6gUmP2a5EcYqAHqAlLd51L RGbJRZQQN5B/oBDE3Y55IbdC0oQPFRJCq1+p45q05WmA4iVpbWT0qQw+YhLk6cPUZM9Zkr+Vz5rXq 5l7hJroA==; Received: from willy by casper.infradead.org with local (Exim 4.98 #2 (Red Hat Linux)) id 1tGWN4-00000002CNv-2gAk; Thu, 28 Nov 2024 04:43:18 +0000 Date: Thu, 28 Nov 2024 04:43:18 +0000 From: Matthew Wilcox To: Mateusz Guzik Cc: Bharata B Rao , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, nikunj@amd.com, vbabka@suse.cz, david@redhat.com, akpm@linux-foundation.org, yuzhao@google.com, axboe@kernel.dk, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, joshdon@google.com, clm@meta.com Subject: Re: [RFC PATCH 0/1] Large folios in block buffered IO path Message-ID: References: <20241127054737.33351-1-bharata@amd.com> <3947869f-90d4-4912-a42f-197147fe64f0@amd.com> <5a517b3a-51b2-45d6-bea3-4a64b75dfd30@amd.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Stat-Signature: 39yn7ngzyt3sfgrjwpxgfbzqehf9gpmb X-Rspamd-Queue-Id: D7E5BC0003 X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1732768998-457629 X-HE-Meta: U2FsdGVkX18MpRdmykYteoejIZMV2ejwAfx+BtB32mQ2qC9EeZR5hk0CKdTsll5SmRw8VyB4yoqWJcnaEZbHas4it2+Hw6v4CJjGFm7oOw8e5YgVO0qZ4eCvgTkV6JxOs2NG1UaZgpvjT6TdARVdtmff9M2yIffeeE4hc9h3oHHSly9DhhN0KGF4IdVycbLTwrlU+RJmO0v9VLxLS3pLHL778KVqSiw8Zib+UpOz6Z0nNTwny+hgZI2a9GtfdWitCb7qcpgrsvGa+TzU5G81eprkySvTTW/zAQ4QHOXYZywXrxWr/OFCuvPESI93zK/kAseDog/dMPnbuUkbiyz1fYERWoU6Xhba13nCI1NxuTZoX8AuYbYI+X4rU8OwaEFXG7lpj6CW9ngwdgeyEpykQxW/SEkwTuyPtV1o5JeFhmne5A9nGtQALVsmWNRpgU5qt4aWaRbnRVgwpI8YFpvWSjVxwb6ozkIsKVl2SRXaxuKQk2Bw02kLIWwiy8m3tK8hyuHf0kzcmL/4kWkEh/k/3TkUTfb/GX+3QRoDRGaQbZKRZX+P+KRIuE/OKZIZ/Qdl5luvl/6qBe23vF7EKxQ/W7h5zNwN0WASROYJFqGWF/l5jKbQAsG3qdu0QLYaEMBPdZtMgdr5fz1LhfjJcPDw0TyXy2RhyEVHeFAfyA0masLbeJxdv6OtRpEC/K583JnZWazed9ibJEYtyyOPJv3XCrvpJ1ITIMALHQOgEaPxArBniBuvRmgEUV4j/omJxZc3B/4AS4fLVpxNZi5UgAzPOwl+DQY9lbOTkcx8jnu4YiA7InrNC+rKzI1bf3g8RO/5F/QMU9oc81qab1I6/rRnc9zVHG9rxfGMLR0ssUVV62Q7Go0kzIMqCHEcOFZ7DODsjggTUlV8XjY3WauzdvDBNDXkK96BKEwTEKYtkL4ef+MifaAxqNkgNis7GZMw+9EVB8gdJyBSAuUY1Sg8SUc 2G1/Lrun tAbObqJmiX7txdAl7j3v9LE67NGgZ4X9uUQyHrkjnfBVKD7srQSD+h6ajN8n7LhhrEzNpGej18sghWgWeFW2ouuh+PwBJkIA9wGjjSe831If5BGi4ti9KYxv6r7sNzJNkGTdzZzXbfZ2o2faBajRUkKmtDsJKahHguMHlQ+duxf0APeah2nUlw09V0fXlynDsB5As0Y7I5Vt2gkmRd3buaT+GyP0xbmvgiqcbXCf0nhf9GF0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Nov 28, 2024 at 05:22:41AM +0100, Mateusz Guzik wrote: > This means that the folio waiting stuff has poor scalability, but > without digging into it I have no idea what can be done. The easy way Actually the easy way is to change: #define PAGE_WAIT_TABLE_BITS 8 to a larger number. > out would be to speculatively spin before buggering off, but one would > have to check what happens in real workloads -- presumably the lock > owner can be off cpu for a long time (I presume there is no way to > store the owner). So ... - There's no space in struct folio to put a rwsem. - But we want to be able to sleep waiting for a folio to (eg) do I/O. This is the solution we have. For the read case, there are three important bits in folio->flags to pay attention to: - PG_locked. This is held during the read. - PG_uptodate. This is set if the read succeeded. - PG_waiters. This is set if anyone is waiting for PG_locked [*] The first thread comes along, allocates a folio, locks it, inserts it into the mapping. The second thread comes along, finds the folio, sees it's !uptodate, sets the waiter bit, adds itself to the waitqueue. The third thread, ditto. The read completes. In interrupt or maybe softirq context, the BIO completion sets the uptodate bit, clears the locked bit and tests the waiter bit. Since the waiter bit is set, it walks the waitqueue looking for waiters which match the locked bit and folio (see folio_wake_bit()). So there's not _much_ of a thundering herd problem here. Most likely the waitqueue is just too damn long with a lot of threads waiting for I/O. [*] oversimplification; don't worry about it.