From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BCE19D6D25E for ; Thu, 28 Nov 2024 04:31:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1A8B86B0088; Wed, 27 Nov 2024 23:31:59 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 159076B0089; Wed, 27 Nov 2024 23:31:59 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 020216B008C; Wed, 27 Nov 2024 23:31:58 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id D8C0C6B0088 for ; Wed, 27 Nov 2024 23:31:58 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 5D2C14177E for ; Thu, 28 Nov 2024 04:31:58 +0000 (UTC) X-FDA: 82834230822.13.C6500A6 Received: from mail-oa1-f42.google.com (mail-oa1-f42.google.com [209.85.160.42]) by imf04.hostedemail.com (Postfix) with ESMTP id CC63D40007 for ; Thu, 28 Nov 2024 04:31:48 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=NhqEK3bs; spf=pass (imf04.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.160.42 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1732768312; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vQOBj303MyI84w3jEX3/YH04HKTcbV31FWEqAcUemuQ=; b=CTv2MCZDXGZcLxUNVQ5BTQBOpn3ZYRBw/kOMtkLEAxJUnlzAQEXoJ6tppXSEQbevFWnGaa JIpCgYdYku8ZoIlar4ImR5hlNEH22LHOsht9XvaEDhyLrKLJSocM0sedWOkmgOGkk4NHjl 57ytBIDxqw8c3kAMhbpMg0wFBar7OkE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732768312; a=rsa-sha256; cv=none; b=Lw4/eIBiWpNazSQwsxLEHkfXI6edyrLSuZHJZxsAWBIWfdWVey+DPTAyU3XCfvJ/Jn4mf4 hy/fVTHkshbaUkrJW+qBf9aM1GfBv+qZJnO+ZmIrnxzVPPiyqrzHLtksFfZSspOXf57PZI 3GVdREN4Y9OaEHpFg4jjn3FF8pWD8kw= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=NhqEK3bs; spf=pass (imf04.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.160.42 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-oa1-f42.google.com with SMTP id 586e51a60fabf-2958ddf99a7so381468fac.2 for ; Wed, 27 Nov 2024 20:31:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1732768315; x=1733373115; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=vQOBj303MyI84w3jEX3/YH04HKTcbV31FWEqAcUemuQ=; b=NhqEK3bsHTTJyV+84TaB5Hz4qpaWUMM0ys6gjgMY+XQdsb8zQ2eGI/D3MVzX9KVEk/ Y9Pz46Or+zAt041ywxct7J4bXSxdtj0AbquDkxwmtySp7kq+k1lDlW/pFgSt7lKtlzs0 hE3eZKdj5d6sV7G7xTt4QvCUjtIisVCacG6RV+TD8Tfss4GHLGJa6nRZGmM8d1npR6ST Y40KtsHfKjjPpjoFbWR2XtBgkXJ/6vqEX+NTwYG2EQE3OGaGVEKqsdVlQEHH4ajMQV3T 0KiICkIirDqzRXlvbkfmJdj9Pgub06nTVVIRFUYS4qyULNsgoVTKAp77s3pwLD+65/1i I7xw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732768315; x=1733373115; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vQOBj303MyI84w3jEX3/YH04HKTcbV31FWEqAcUemuQ=; b=j0g/vC3WBnnGflWN1CMzs8hidSg4Adze1qcclsdcauhoUGBiuTdEwLopNtKNHvGGFp S/ZyEL2WCG0tnPbDbJMbgXBY3demX1qstDOGPhxTDKR1Z0eSxwLr9bSkJMekTEjHFnoU UpDfZBdDoA+VNeVn5SNp1qBuFZiKOPwKJo3m2h1PQG3CGIYEiPisROQeJY+Nm4G/Jzlr HoMzK1jJip+k+I1Ngm4PV13w2EVx4473w18PPkOtq7KQcK5X4vp3UKGe9HoSL1PCx5Kl RiXNvCORd8XYd0ChIjiENx3y2eHTVy9L3DYqtB0OyurqJfE2FCJer6cxKxtf9mtcriDk Cl3g== X-Forwarded-Encrypted: i=1; AJvYcCUCSSha8AVxYGB0+RkErGAr9oqDLPg5UxRmIdCZrgJ1+M1wF9+S1DVxya22IjYch1TyTEQ+f8DQaw==@kvack.org X-Gm-Message-State: AOJu0YytRrDPmu80uzoxjsismoMa6TncVEvvnkrIACbH4lYXvlGA5cgq KTSuFxfgmHkNOEb3fQzWnVeUkqFNPQk99DU7ApBlu665Dcqckiezkic0JxFereKTgYKXQVTll4x 07RjzB86sa21Q1ywSirBN7V4qi18= X-Gm-Gg: ASbGncvxTGyobhqcj6Ddr4QLYsQtwsF/AH+0aKQEOiK4d7iQ0dcPBEw6ETprjpSFW5t /HnARAyKaEKW+2Zh8qh1D5YrFAhn7Nvw= X-Google-Smtp-Source: AGHT+IF27rLYNUlhOO/6DrQbNpdU53t77/FyRdU+wOPQuy8q1Ub04s6i5Cycrua3W/Y7gxUbSkME97EYOwBEpXdg1aA= X-Received: by 2002:a05:6871:a083:b0:297:28e3:db63 with SMTP id 586e51a60fabf-29dc4181f75mr5143715fac.19.1732768315517; Wed, 27 Nov 2024 20:31:55 -0800 (PST) MIME-Version: 1.0 References: <20241127054737.33351-1-bharata@amd.com> <3947869f-90d4-4912-a42f-197147fe64f0@amd.com> <5a517b3a-51b2-45d6-bea3-4a64b75dfd30@amd.com> In-Reply-To: From: Mateusz Guzik Date: Thu, 28 Nov 2024 05:31:38 +0100 Message-ID: Subject: Re: [RFC PATCH 0/1] Large folios in block buffered IO path To: Bharata B Rao Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, nikunj@amd.com, willy@infradead.org, vbabka@suse.cz, david@redhat.com, akpm@linux-foundation.org, yuzhao@google.com, axboe@kernel.dk, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, joshdon@google.com, clm@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: CC63D40007 X-Stat-Signature: skf7496t8qw99h6z4bnbsujhr7sz5n94 X-Rspam-User: X-HE-Tag: 1732768308-263426 X-HE-Meta: U2FsdGVkX1+52REZ9jEDR0UAr9SLGiQL4XpWg9Dhi6ORxzmI3xm46MeEMQdH4ygz2mkvyc7//yPyRdm+cPQdBmIjDJWmPTJekbXAJCr4z23NM2t2ieTnafKKoj3W/Ge6p0lYy8Cu4g1zA4bO6l/cwkZmOn+lix8/s73jzNAaCn6wEAAwerOeS8IdBssgVsGXKrnNQOKY8vUv3zyVvSeUOXkuLWEcKhCHv+wE9CJ2rcTUWo5wW2C6nZqk2mHYsEp5lC5D2UJDxeoxmXyiYLWYYDisVdXiBycaHQeJeCDaOsuDU40r62NIc10waFQFnPqUZLpr2XsZojHL8KKkz4rQAx1KQ+qSDMmhD2v8IqFHntsb55cAQBP1wDFUX0U66ytW/F/gfIyNKJDACAMHFPJ0UcZTzCJ2eHLidTiQcXU3TbYtqze9lbqqF6yayqNXWN9VbcNVFxejRGUsBqtwY74Aek4Zb8s0iM1RzPw6aXNO9VGMyLwpeqL4a4W2znYJT0Y19iP6k7usxsKtxpzuD+HqquAUdc4xtJ17YaDZm8D+HIFWD7KI0YvlCSR4l6+LvQUkCeR1043TCqov220TP75f5bxYL/wfzE2juWXOd9Sa5qg+ZxXvs5ogP3a3mvpNHA8nuPfQSe/eltMk08vOdnQcU1a6ou/Rv5BvyzEbrN3MykBFMirpONqLs19hcSFoQTYgoWz+ckjrir85O/XUAH0YCxcwV9tHD/krxxakqnaUHHjZ1CufzsXzZxaR+91+E9s0ZJINVJ52uC067YmRwWSgjmNvDeRkd0Myi9qKhSlV5v+2RapcEl4boNhYFvg9D0REDWeLAuHelWT23aGO4RM75wC4cubVVd1MSc4TSH69omyhHmPJG/jLlILBVpVDByKsqujyVmviQR66PKGS4A6GF+mEbs9PgEyc4GdyMwktcT1JaiA6P/3Ut4sYN6OmYk50rRPdR6GowrYSwU5em9F d/nEEnbP bjxsgYBLAlq+kqIXWAbeqm/4m+qaed8qbppJe/+RB0AZDK3euiznXVIwNxllLP2LsjnGIiB7SnJtZkXUXpUUyVmprpl8dnOIGBr/ooTEIZ7a1OrDttQzsArBaL/cH4LPc+cHSBqs59RXbtv6CNEAbQJJyd00BF7n3PGVDtY55Q+a9rqfjIiibJ3NBNOYJ3OiBlkDEWMfOGonKhvZfTFikCtpqYGUrVXAEYzu5+Y1Az/VMpa4vXL4ZvH2OUL3H7PhJLMgUc27PVijVhPBtT/RYEbK+outPisqDT12IBqs9Yowdp8VJr6AgfMTc9bJewVscvjNJSQUAnQZeoKYw6kpuwcGgKyD39Z1gySZ1ATBjV0ajjIgfR5RMiABxVA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Nov 28, 2024 at 5:22=E2=80=AFAM Mateusz Guzik w= rote: > > On Thu, Nov 28, 2024 at 5:02=E2=80=AFAM Bharata B Rao w= rote: > > > > The contention with inode_lock is gone after your above changes. The ne= w > > top 10 contention data looks like this now: > > > > contended total wait max wait avg wait type calle= r > > > > 2441494015 172.15 h 1.72 ms 253.83 us spinlock > > folio_wait_bit_common+0xd5 > > 0xffffffffadbf60a3 > > native_queued_spin_lock_slowpath+0x1f3 > > 0xffffffffadbf5d01 _raw_spin_lock_irq+0x51 > > 0xffffffffacdd1905 folio_wait_bit_common+0xd5 > > 0xffffffffacdd2d0a filemap_get_pages+0x68a > > 0xffffffffacdd2e73 filemap_read+0x103 > > 0xffffffffad1d67ba blkdev_read_iter+0x6a > > 0xffffffffacf06937 vfs_read+0x297 > > 0xffffffffacf07653 ksys_read+0x73 > > 25269947 1.58 h 1.72 ms 225.44 us spinlock > > folio_wake_bit+0x62 > > 0xffffffffadbf60a3 > > native_queued_spin_lock_slowpath+0x1f3 > > 0xffffffffadbf537c _raw_spin_lock_irqsave+0x5= c > > 0xffffffffacdcf322 folio_wake_bit+0x62 > > 0xffffffffacdd2ca7 filemap_get_pages+0x627 > > 0xffffffffacdd2e73 filemap_read+0x103 > > 0xffffffffad1d67ba blkdev_read_iter+0x6a > > 0xffffffffacf06937 vfs_read+0x297 > > 0xffffffffacf07653 ksys_read+0x73 > > 44757761 1.05 h 1.55 ms 84.41 us spinlock > > folio_wake_bit+0x62 > > 0xffffffffadbf60a3 > > native_queued_spin_lock_slowpath+0x1f3 > > 0xffffffffadbf537c _raw_spin_lock_irqsave+0x5= c > > 0xffffffffacdcf322 folio_wake_bit+0x62 > > 0xffffffffacdcf7bc folio_end_read+0x2c > > 0xffffffffacf6d4cf mpage_read_end_io+0x6f > > 0xffffffffad1d8abb bio_endio+0x12b > > 0xffffffffad1f07bd blk_mq_end_request_batch+0= x12d > > 0xffffffffc05e4e9b nvme_pci_complete_batch+0x= bb > [snip] > > However a point of concern is that FIO bandwidth comes down drastically > > after the change. > > > > Nicely put :) > > > default inode_lock-fix > > rw=3D30% > > Instance 1 r=3D55.7GiB/s,w=3D23.9GiB/s r=3D9616MiB/s,w=3D4= 121MiB/s > > Instance 2 r=3D38.5GiB/s,w=3D16.5GiB/s r=3D8482MiB/s,w=3D3= 635MiB/s > > Instance 3 r=3D37.5GiB/s,w=3D16.1GiB/s r=3D8609MiB/s,w=3D3= 690MiB/s > > Instance 4 r=3D37.4GiB/s,w=3D16.0GiB/s r=3D8486MiB/s,w=3D3= 637MiB/s > > > > This means that the folio waiting stuff has poor scalability, but > without digging into it I have no idea what can be done. The easy way > out would be to speculatively spin before buggering off, but one would > have to check what happens in real workloads -- presumably the lock > owner can be off cpu for a long time (I presume there is no way to > store the owner). > > The now-removed lock uses rwsems which behave better when contested > and was pulling contention away from folios, artificially *helping* > performance by having the folio bottleneck be exercised less. > > The right thing to do in the long run is still to whack the llseek > lock acquire, but in the light of the above it can probably wait for > better times. WIlly mentioned the folio wait queue hash table could be grown, you can find it in mm/filemap.c: 1062 #define PAGE_WAIT_TABLE_BITS 8 1063 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS) 1064 static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned; 1065 1066 static wait_queue_head_t *folio_waitqueue(struct folio *folio) 1067 { 1068 =E2=94=82 return &folio_wait_table[hash_ptr(folio, PAGE_WAIT_T= ABLE_BITS)]; 1069 } Can you collect off cpu time? offcputime-bpfcc -K > /tmp/out On debian this ships with the bpfcc-tools package. --=20 Mateusz Guzik