From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30FC3C54E67 for ; Wed, 20 Mar 2024 09:07:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B9FFA6B0095; Wed, 20 Mar 2024 05:07:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B2A8E6B0096; Wed, 20 Mar 2024 05:07:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9C9F66B0098; Wed, 20 Mar 2024 05:07:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 863926B0095 for ; Wed, 20 Mar 2024 05:07:16 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 448BDA11DB for ; Wed, 20 Mar 2024 09:07:16 +0000 (UTC) X-FDA: 81916838472.28.671B271 Received: from mail-lj1-f170.google.com (mail-lj1-f170.google.com [209.85.208.170]) by imf27.hostedemail.com (Postfix) with ESMTP id 8B88640002 for ; Wed, 20 Mar 2024 09:07:14 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ksYAWFz5; spf=pass (imf27.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.170 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710925634; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fFmoFRtiVhYlElRiuwA4am3p/eGR8EbCAnxSr4G2ILo=; b=GYgwydutYnhb1F8X9JaD+SXM3xdTh8+Jo5LPE6Zn3vgyhUxu13ef0kUYCFvKt6TluXK4qI jD3jxzvHVyxBDNOOmGLmGWyN3FeYQ0xeVlrcCQElPcCFWgCG9WqXFV0f0BM5nzFK+lrQDD ThS4eBn1NZvJ4AFZOKITRngNEvxIDGk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710925634; a=rsa-sha256; cv=none; b=M56AJpeqVzYTw0ky58M9jqlvbr5fve5vu/fSFNky3VGSAxWJgHStrDfjCoZLMdQ/FreJNO tJSyo0W8DVEqb28N3GxdGZNHJYpKy7Q2PDEJfv9i4ohJxUSMaGeja2NPl4ABADTSPiMZeX D5UC4PWMpVxgQHYNje8Dbj0yNykiOnA= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ksYAWFz5; spf=pass (imf27.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.170 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-lj1-f170.google.com with SMTP id 38308e7fff4ca-2d49f7e5d65so42924841fa.2 for ; Wed, 20 Mar 2024 02:07:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1710925633; x=1711530433; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=fFmoFRtiVhYlElRiuwA4am3p/eGR8EbCAnxSr4G2ILo=; b=ksYAWFz5hA6WoaRTjxWa5aZPZfICgfpcQAaEGIbK2vE4eC8eis8gufeYMDehzYGOW+ jy4CRSkjQYM4+U28aeDJhYPSWUJUlqwnU7sfgFlsrWWGgpv72qB7QAeFPymcvLzq8Zu1 xmlKsNldLX7yuF6oM2O8EsBe9NurCLuPHXPB0D5+9NPuRP2O4yvzjFq2QvC6XHqUrH6K RwanrcTL3a08r1bOshfWUeQjZtU5rvBI4s9F0Zgfn+y5ewFbyDkXQhaTxutJWsPpnJ/j BUCh7qPja7xoq55S9S5BaFLYs4q1gKdYwKG3yQMrTub5ck0Al+cGVw56t/DM/k842SVH 1t0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710925633; x=1711530433; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=fFmoFRtiVhYlElRiuwA4am3p/eGR8EbCAnxSr4G2ILo=; b=BMg9KFCc4GqkQ2tPgWdOaWyQ6j8C8MetO3LEDPb7o1btsKn1yTyY8vrKQwwhkb2uQG EHuOF0ARsan5xiD49pSfjdC6R0kXm8iFq78Qk8On6ASctYuThYXdTb4ulhqviqg20M+2 U9O/P9aGR6yPACwnGStEZ2aaAPpDSYjo/VstyTsRXPk21U55rtY7H1y0pxctd6nJn6o6 78+H+x6sbYoHFC/xY8878GOcunVketgdJBus2qp9hux5+eiaXcIzaad9ZBTCrCz3/xkz vfKXB+U7iAL4FsJUDUViJTcB35DlwYA85PFLJOmnwK70doxdF9PLSwLGSYH7OdDtYh9J bq+g== X-Gm-Message-State: AOJu0YwxICRY28TEAQs+QGR1tuXFTs3ZW1bzY4s84F42FVhFts6JpOJV pcq54XGncbRKs6nTLjGPf2IW7s1oHnjaQicxNU6ZXRWL8Wm/MdEH1/mL9OIrlRoQTt5h35WXvRD Jw9QiuAN15mrCH5EKXYhPfwAcqu8= X-Google-Smtp-Source: AGHT+IGaGyGcpZ4Pq4dMwDtqGwnGEpDwAFYCPeLs/Lld6/x5pNmN+CjRzLTdQxlIZ7nYjd+t07CHSYfFCNsq1cmZCM0= X-Received: by 2002:a2e:9956:0:b0:2d2:bbd9:ebc9 with SMTP id r22-20020a2e9956000000b002d2bbd9ebc9mr916489ljj.24.1710925632383; Wed, 20 Mar 2024 02:07:12 -0700 (PDT) MIME-Version: 1.0 References: <20240319092733.4501-1-ryncsn@gmail.com> <20240319092733.4501-5-ryncsn@gmail.com> In-Reply-To: From: Kairui Song Date: Wed, 20 Mar 2024 17:06:55 +0800 Message-ID: Subject: Re: [PATCH 4/4] mm/filemap: optimize filemap folio adding To: Matthew Wilcox Cc: linux-mm@kvack.org, Andrew Morton , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: u9oh4jdohhkgawkueiqu174rhiwoc4t4 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 8B88640002 X-Rspam-User: X-HE-Tag: 1710925634-206780 X-HE-Meta: U2FsdGVkX1/pezcmC24IRgRQJFxSsvYHpfix95hr63UzypQ7GY3m10uyW6Z1fNVNgewj50C1xdYs43GRZNI6aczpRJBijAX8zybF1/W1FPg+fRdMXiJs1ctNQj1nkkZTsgE0+YdQME4s77c4MS2ToLM5vniWZDPQnqiEoc37Vmzv5MPvrG8/6xSVCNBoshemsM/u/KT4ZBDpeZ3aXjQjeLyKFgd/EIijYKlLn2/sQrwlFID1AIrt5/cvl12JemacIbTr61ygef6F6y2cz9RXKoFhbBAAB2cYTfjO+oiDKV3uWQiw0OYNs/zFgAwpYX7k+v2dDX+LOwpD9zkrDysUUtqL16KatMImnvBzyO+VoUrDosoar4NBwpo/UnRpYsqwire3HssvbZQ+6Sn+PhsdxU9YN+F2ktMocGrSFa7uZzxmbHaLMSzBmxhZ21fbEvyW7OlAfhynD635dYi4nJaJwVM+W3mHynKygABHcweqjbQrVnP3iRr2n6q3eMQCND2tBkE3/Hm06Nk5+tVGCkbMf1w7dP5MeK4+Prhj8TR8vLetGzJimCDp36WF0pER6SKQJIzZEA8pFxZmzFXOwUH2IkLOAquHAVoRMwfqx7VJHRh67wCih44pxvNDHs+pZoX/VHuML+UI//2W3SnZHEoPGalr59zxjrRdUk0z7YFFI/ErI9jn35KY/nb4NzY9lwrLh0fRnvisw2Lcr1YiiGSAv5KA7wUZ2VYcHeNhQDLe+E6godBcyW6xvhE2UVu43qUhjNE47wgOA5oPAlD77EGcO4SFziUJlVwZ6b5+18aA/246wKPhm1Bdiu4/hdGb6FN6Bc7/WmXRHZ8ObQFqDkZVCX4zhXhOojlp7RsfwkFOZuxT55qr01Mam5a32nV0CrVI3iAmsCK+b+rKruHSY+ilGF+FJXWhPhsvNSO3btUW8iy7fmUpHtUBGccq5GH8pustUkbxPpKdri8orWWQZ9+ 7SApqZpY DWM98U6XT5RFtHzSoGhV0TRwvo6eo0aQgXVIa X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Mar 20, 2024 at 6:20=E2=80=AFAM Matthew Wilcox wrote: > > On Tue, Mar 19, 2024 at 05:27:33PM +0800, Kairui Song wrote: > > From: Kairui Song > > > > Instead of doing multiple tree walks, do one optimism range check > > with lock hold, and exit if raced with another insertion. If a shadow > > exists, check it with a new xas_get_order helper before releasing the > > lock to avoid redundant tree walks for getting its order. > > > > Drop the lock and do the allocation only if a split is needed. > > > > In the best case, it only need to walk the tree once. If it needs > > to alloc and split, 3 walks are issued (One for first ranced > > conflict check and order retrieving, one for the second check after > > allocation, one for the insert after split). > > > > Testing with 4k pages, in an 8G cgroup, with 20G brd as block device: > > > > fio -name=3Dcached --numjobs=3D16 --filename=3D/mnt/test.img \ > > --buffered=3D1 --ioengine=3Dmmap --rw=3Drandread --time_based \ > > --ramp_time=3D30s --runtime=3D5m --group_reporting > > > > Before: > > bw ( MiB/s): min=3D 790, max=3D 3665, per=3D100.00%, avg=3D2499.17, s= tdev=3D20.64, samples=3D8698 > > iops : min=3D202295, max=3D938417, avg=3D639785.81, stdev=3D5284= .08, samples=3D8698 > > > > After (+4%): > > bw ( MiB/s): min=3D 451, max=3D 3868, per=3D100.00%, avg=3D2599.83, s= tdev=3D23.39, samples=3D8653 > > iops : min=3D115596, max=3D990364, avg=3D665556.34, stdev=3D5988= .20, samples=3D8653 > > > > Test result with THP (do a THP randread then switch to 4K page in hope = it > > issues a lot of splitting): > > > > fio -name=3Dcached --numjobs=3D16 --filename=3D/mnt/test.img \ > > --buffered=3D1 --ioengine mmap -thp=3D1 --readonly \ > > --rw=3Drandread --random_distribution=3Drandom \ > > --time_based --runtime=3D5m --group_reporting > > > > fio -name=3Dcached --numjobs=3D16 --filename=3D/mnt/test.img \ > > --buffered=3D1 --ioengine mmap --readonly \ > > --rw=3Drandread --random_distribution=3Drandom \ > > --time_based --runtime=3D5s --group_reporting > > > > Before: > > bw ( KiB/s): min=3D28071, max=3D62359, per=3D100.00%, avg=3D53542.44, = stdev=3D179.77, samples=3D9520 > > iops : min=3D 7012, max=3D15586, avg=3D13379.39, stdev=3D44.94, = samples=3D9520 > > bw ( MiB/s): min=3D 2457, max=3D 6193, per=3D100.00%, avg=3D3923.21, s= tdev=3D82.48, samples=3D144 > > iops : min=3D629220, max=3D1585642, avg=3D1004340.78, stdev=3D21= 116.07, samples=3D144 > > > > After (+-0.0%): > > bw ( KiB/s): min=3D30561, max=3D63064, per=3D100.00%, avg=3D53635.82, = stdev=3D177.21, samples=3D9520 > > iops : min=3D 7636, max=3D15762, avg=3D13402.82, stdev=3D44.29, = samples=3D9520 > > bw ( MiB/s): min=3D 2449, max=3D 6145, per=3D100.00%, avg=3D3914.68, s= tdev=3D81.15, samples=3D144 > > iops : min=3D627106, max=3D1573156, avg=3D1002158.11, stdev=3D20= 774.77, samples=3D144 > > > > The performance is better (+4%) for 4K cached read and unchanged for TH= P. > > > > Signed-off-by: Kairui Song > > --- > > mm/filemap.c | 127 ++++++++++++++++++++++++++++++--------------------- > > 1 file changed, 76 insertions(+), 51 deletions(-) > > > > diff --git a/mm/filemap.c b/mm/filemap.c > > index 6bbec8783793..c1484bcdbddb 100644 > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -848,12 +848,77 @@ void replace_page_cache_folio(struct folio *old, = struct folio *new) > > } > > EXPORT_SYMBOL_GPL(replace_page_cache_folio); > > > > +static int __split_add_folio_locked(struct xa_state *xas, struct folio= *folio, > > + pgoff_t index, gfp_t gfp, void **shad= owp) > Thanks for the very helpful review! > I don't love the name of this function. Splitting is a rare thing that > it does. I'd suggest it's more filemap_store(). Yes, the function name is a bit misleading indeed, I can rename it as you suggested, eg. __filemap_store_locked ? > > > +{ > > + void *entry, *shadow, *alloced_shadow =3D NULL; > > + int order, alloced_order =3D 0; > > + > > + gfp &=3D GFP_RECLAIM_MASK; > > + for (;;) { > > + shadow =3D NULL; > > + order =3D 0; > > + > > + xas_for_each_conflict(xas, entry) { > > + if (!xa_is_value(entry)) > > + return -EEXIST; > > + shadow =3D entry; > > + } > > + > > + if (shadow) { > > + if (shadow =3D=3D xas_reload(xas)) { > > Why do you need the xas_reload here? This part of code works based on the guarantee that If there is a larger entry, it will be the first and only entry iterated by xas_for_each_conflict/xas_find_conflict. I added an xas_reload is here to ensure that. But on second thought, this seems not needed indeed. Will it be better if I write this part this way? + shadow =3D NULL; + order =3D -1; + xas_for_each_conflict(xas, entry) { + if (!xa_is_value(entry)) + return -EEXIST; + /* + * If there is a larger entry, it will be the first + * and only entry iterated. + */ + if (order =3D=3D -1) + order =3D xas_get_order(xas); + shadow =3D entry; + } + + if (shadow) { + /* check if alloc & split need, or if previous alloc is still valid */ + if (order > 0 && order > folio_order(folio)) { + if (shadow !=3D alloced_shadow || order !=3D alloced_or= der) + goto unlock; + xas_split(xas, shadow, order); + xas_reset(xas); + } + order =3D -1; + if (shadowp) + *shadowp =3D shadow; + } > > > + order =3D xas_get_order(xas); > > + if (order && order > folio_order(folio)) = { > > + /* entry may have been split befo= re we acquired lock */ > > + if (shadow !=3D alloced_shadow ||= order !=3D alloced_order) > > + goto unlock; > > + xas_split(xas, shadow, order); > > + xas_reset(xas); > > + } > > + order =3D 0; > > + } > > I don't think this is right. I think we can end up skipping a split > and storing a folio into a slot which is of greater order than the folio > we're storing. If there is a larger slot, xas_for_each_conflict and check above should catch that? > > > + if (shadowp) > > + *shadowp =3D shadow; > > + } > > + > > + xas_store(xas, folio); > > + /* Success, return with mapping locked */ > > + if (!xas_error(xas)) > > + return 0; > > +unlock: > > + /* > > + * Unlock path, if errored, return unlocked. > > + * If allocation needed, alloc and retry. > > + */ > > + xas_unlock_irq(xas); > > + if (order) { > > + if (unlikely(alloced_order)) > > + xas_destroy(xas); > > + xas_split_alloc(xas, shadow, order, gfp); > > + if (!xas_error(xas)) { > > + alloced_shadow =3D shadow; > > + alloced_order =3D order; > > + } > > + goto next; > > + } > > + /* xas_nomem result checked by xas_error below */ > > + xas_nomem(xas, gfp); > > +next: > > + xas_lock_irq(xas); > > + if (xas_error(xas)) > > + return xas_error(xas); > > + > > + xas_reset(xas); > > + } > > +} > > Splitting this out into a different function while changing the logic > really makes this hard to review ;-( Sorry about this :( This patch basically rewrites the logic of __filemap_add_folio and the function is getting long, so I thought it would be easier to understand if we split it out. I initially updated the code in place but that change diff seems more messy to me. > > I don't object to splitting the function, but maybe two patches; one > to move the logic and a second to change it? > I can keep it in place in V2.