From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57271D0E6D1 for ; Mon, 21 Oct 2024 08:48:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B2E1D6B009D; Mon, 21 Oct 2024 04:48:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id ADD9E6B009E; Mon, 21 Oct 2024 04:48:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9A59A6B009F; Mon, 21 Oct 2024 04:48:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 7DEE96B009D for ; Mon, 21 Oct 2024 04:48:08 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 27EDC161750 for ; Mon, 21 Oct 2024 08:47:51 +0000 (UTC) X-FDA: 82696981794.10.46CE8EE Received: from relay-b03.edpnet.be (relay-b03.edpnet.be [212.71.1.220]) by imf24.hostedemail.com (Postfix) with ESMTP id 1D204180003 for ; Mon, 21 Oct 2024 08:48:02 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=kabelmail.de (policy=none); spf=fail (imf24.hostedemail.com: domain of janpieter.sollie@kabelmail.de does not designate 212.71.1.220 as permitted sender) smtp.mailfrom=janpieter.sollie@kabelmail.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729500321; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=fbsOKyf+jG+0xj6cYtloUfgR1Lndvx3I7oBWfOKHod8=; b=c1+ETkEjdHq1iasVkyQo2Edd1TcxxbBcpjZ/75KNXANCY8WphMIm3a9AfBlAtcN7dmzKKe anYg3SDg2SQKVvBuuXq6LsfqaNzR3Tw9wqGGyHXiOr6wAEyEVh1qHk+8YylOKuwBUvt+sn d04xpwGI9yZyWLCj9LQaP64K+T5KpsY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729500321; a=rsa-sha256; cv=none; b=At2mzN9XOOPfBCiuZToQKCiedBJCsEOIonoBUJne5+RucEqFSMJ2UKjABrDThLQ6uAcVCQ 3jmwSl64l+BJIF7BuxFVkxvhtVa4aQTqvwJB5IFD/izBWdA8FNmWD+ZzC1Jyt/J884YOB+ 0JVS7cXAm7cRVsh+osXq8XqaAu8cw3E= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=kabelmail.de (policy=none); spf=fail (imf24.hostedemail.com: domain of janpieter.sollie@kabelmail.de does not designate 212.71.1.220 as permitted sender) smtp.mailfrom=janpieter.sollie@kabelmail.de X-ASG-Debug-ID: 1729500482-24639c1dac19e9150001-v9ZeMO Received: from [192.168.177.53] (94.105.112.108.dyn.edpnet.net [94.105.112.108]) by relay-b03.edpnet.be with ESMTP id 1wJQoLd2sZg5lXA3; Mon, 21 Oct 2024 10:48:02 +0200 (CEST) X-Barracuda-Envelope-From: janpieter.sollie@kabelmail.de X-Barracuda-Effective-Source-IP: 94.105.112.108.dyn.edpnet.net[94.105.112.108] X-Barracuda-Apparent-Source-IP: 94.105.112.108 Content-Type: multipart/alternative; boundary="------------850TR6O3rm5Nu40Cd4Avgx0F" Message-ID: Date: Mon, 21 Oct 2024 10:46:47 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mm: Drop INT_MAX limit from kvmalloc() To: Kent Overstreet , Linus Torvalds X-ASG-Orig-Subj: Re: [PATCH] mm: Drop INT_MAX limit from kvmalloc() Cc: Lorenzo Stoakes , linux-bcachefs@vger.kernel.org, linux-mm@kvack.org, Vlastimil Babka , Andrew Morton , Uladzislau Rezki , Christoph Hellwig References: <90bc0794-4cab-415f-a442-4af85a32eed8@lucifer.local> <6eo3gekf6twbnzhpsi2emz2s6sgtof6iba2rvbor7himmejoq5@qbfwtpbpvqoe> Content-Language: nl From: Janpieter Sollie Autocrypt: addr=janpieter.sollie@kabelmail.de; keydata= xsBNBFhRXM0BCADnifwYnfbhQtJso1eeT+fjEDJh8OY5rwfvAbOhHyy003MJ82svXPmM/hUS C6hZjkE4kR7k2O2r+Ev6abRSlM6s6rJ/ZftmwOA7E8vdSkrFDNqRYL7P18+Iq/jM/t/6lsZv O+YcjF/gGmzfOCZ5AByQyLGmh5ZI3vpqJarXskrfi1QiZFeCG4H5WpMInml6NzeTpwFMdJaM JCr3BwnCyR+zeev7ROEWyVRcsj8ufW8ZLOrML9Q5QVjH7tkwzoedOc5UMv80uTaA5YaC1GcZ 57dAna6S1KWy5zx8VaHwXBwbXhDHWvZP318um2BxeTZbl21yXJrUMbYpaoLJzA5ZaoCFABEB AAHNMEphbnBpZXRlciBTb2xsaWUgPGphbnBpZXRlci5zb2xsaWVAa2FiZWxtYWlsLmRlPsLA pwQTAQgAUQIbIwULCQgHAgYVCgkICwIEFgIDAQIeAQIXgAUJD6DYHRYhBGBBYpsrUd7LDG6r UrFUjEUPH5JbBQJlAXcQEhhoa3A6Ly9wZ3AubWl0LmVkdQAKCRCxVIxFDx+SW3oXB/9A4g3v /Mq6uZlBDWBrQ98AFhlkRLwI2mDiNOlP4zJlejqt8S2GvCvhiQAtXV0BSl42sHo9Uxda8sZM d1eOU+oPexaK9iryCEYvOj+IK+n/rR8kyH5P7lzdpHIKYGtZ2y2NtpEuIA9A28MvliGjutJ2 QIAdUaO1GdFVHXvHConuXvBfpmJsBhJeVfoTdzrTlloFTloCC+00a6Q6ndGnwzQQVgfgzRtq +x3O8smVnzsHb7Wu25+sbJl55GONAqyxuRmAF8IB4HAQ9gjgiCxTfjhCt9rhXQgPlT8EAuKT 9MohNtqpXXZFpuQUYRMEqpM7mPfbm4L8fXiPv/szVM0ZIwVCzsBNBFhRXM0BCAD2fJkCQE6L UjUpVltzXjxeRTNQK/W3PC1i3EsjHFSJT04MY12zmXWI0ItNqnRaB2c/X/rtWTuYazSsgQTQ P/qZSts2890GD51y3BAw443rN0plVp4dX6JG/p3V0ha7rXwFjn4Hzgt4IDJSVuNa9nGEceId qTnvZO+F4ukgzV2YVAUWcuDowCBDfRm3yU0GU3rxOHfBTpYwljqFWbGqijsTjF1/Le4ULF+q HUQqwTDY1ijpxRCIeu46WoQEf5Dv0aTIwOJgPK8229uRDHlHhMPvXpqtBYKwQ/rOCz7EfITD lPsHhjnqR+mjeLy7tK/qIyaE7A8HTdoxYoYhAR+AqylRABEBAAHCwHwEGAEIACYCGwwWIQRg QWKbK1Heywxuq1KxVIxFDx+SWwUCXowzcAUJD6DYIwAKCRCxVIxFDx+SW1J3CADK79k3T1La jEHZGkxWPoAuixN8F6pf7hnofoR2KFkMbQk+9WYfiGsOcZwJjiUyQgiaSZcdz6QGkGPDjfLL iNDKvGGGAgRT99ruA4uyfAXjObbrMp8VGr0fI6eRF3nWuh5Rk4W2DY/wjaygA0FJ/vpw8XFs 9NycYvk7yPcXW29N6/56TIc1I7SxmsEzwriGxvuRNZBzuNsEvdgzkF0kTL4nMhH/olPFW1+f 2Y0kA3aAa1ZyjzHsgDUBSKyv35MJRxPLkLK4I/5DisNfjfygE/gvSh+Iss+ZqtZhYzR9oA6k mg7rIQaHeVNOBwFsmkHmXDmH5beCJsgB6hwxiQAUaOaO In-Reply-To: X-Barracuda-Connect: 94.105.112.108.dyn.edpnet.net[94.105.112.108] X-Barracuda-Start-Time: 1729500482 X-Barracuda-URL: https://212.71.1.220:443/cgi-mod/mark.cgi X-Virus-Scanned: by bsmtpd at edpnet.be X-Barracuda-Scan-Msg-Size: 5505 X-Barracuda-BRTS-Status: 1 X-Barracuda-Spam-Score: 0.00 X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=7.0 tests=HTML_MESSAGE X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.132090 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.00 HTML_MESSAGE BODY: HTML included in message X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 1D204180003 X-Stat-Signature: fengzyrqb9futos5ik5hh5uo8xmr8c6m X-Rspam-User: X-HE-Tag: 1729500482-679433 X-HE-Meta: U2FsdGVkX18aTKbUGWskNTRZCk7dY4DYwgMzBxUubEmAI8y0TgRHbqj/7f4nSd+FUqU4IdBcUnDCdcMtiB6vRgzPrz9MBNKESeRP7VWng2sEIl8fRe96kInK+V7EfOFRlWB4tXrhG/pzozbnFFvmUps47Cu9lTR0UVU9XD27eDOMmAyokkCjamfU114No4KuX2mdD8DDVHxG5kjTOUVZfnY9AERM6DOopenVEVd30ouGmb/B5YaIeV56W25yPz0NeUVdcZ9TfiGVnBpyREtdwh4jjoqajnkuBnNKx9SGsogLWtSVlh7ItFecXGgzs50Vg0+kEY/Sa7hH4AMjV2Csj5fTSWmLz/Bey/ycZBAGxYEEJd//zvJIob0P6IYN2NhWoEe1Kh6zfX2xdN28ne8othGA76TN6BZDy8Z9Zu0835YsioNBbYVRlkBIkj13005i3V4o5Ytb3rznz0fdlubENpX8iNY43zRsCvB0a0AamJDKjVEOXK6J22cBnSyEpLfhmEmjDutfuBhtsMUlSuBgpCeZSQs8L7SezEIj4Yk0BqB4Pd+y/23OFV4K9Ujt+ADxMdUVA/vKW9x+pKPoZpIW6/ygZooS94q6lBxJfdiYZcZynzoJQ3gStlu3vSGd8s9ICC9s79L15avnPcQk1h++9Nrb3Vk/CQEktfVOI8Fr4JIJKtdeQ3s6JPOYJxgBXSsz4sV9acSoLoDNYt8oJ3CKFBl9N6Pc8ksxFUcMIsrYfVioPQz8TB3FWzdNcCnlyRybgj4mBjHZdzXDxF7vPg1bYwHAre6p9d45mpL0QndoN1tAvU6AZ5H9gWKYLxj4WWOBKLBobVho+mV723BxKeJHDfRxyVNIKlVPdcpT4MctagHtfuXZxA+ZVpuWAen0TGH4WbiX1gTpaZUhkKC+ylR9FauibQwH13cZqm9Gar71LQAXw0elAOH6p+PrJhbIXAQ+eWpNODqL1/niAkcmVaG Is7GzzSm 9UWGHhIcySGfvCBSCLqi+Y5ntW8SVwnVyXJ5p5OVf7IAb+JQSf9a2q2Fpk+0T4Y7hwX8/1+K1FTvAmDQU40qY1Qz7OuIit6nqA12yIRq9+KWFOKHc+AwxkOdYgZTNiKBMGfalKka2VRcWdVZet+bT/ZNLSOE5fGuWQMwAQ/5w5bWAmG6bDm9J2oLd2vh/EfNk0OJzlfyMxpvc5HmywmRj+yTUemVWSdusAEZXk5wxr9K0DJGE81fml+/uBc9I5Tm0bPxiHrw9WH00KK1aHO4Jb5aoyeUEfdLEiDEOqMvtg9sidWrNrAOMenmsunemMuqPXcEbvlkFaZj4EuqrXW+JGWSe7da61aSZxmyZbAOExKhyigvySQ0F3LfpDmcYt4PFih3gJg6EeZ/LkwWUHs9m3e5cijOinb2eTbax X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This is a multi-part message in MIME format. --------------850TR6O3rm5Nu40Cd4Avgx0F Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Op 20/10/2024 om 22:29 schreef Kent Overstreet: > On Sun, Oct 20, 2024 at 01:19:42PM -0700, Linus Torvalds wrote: > >> Enough said, and you're just making shit up to make excuses. >> >> Also, you might want to start look at latency numbers in addition to >> throughput. If your journal replay needs an *index* that is 2G in >> size, you may have other issues. > Latency for journal replay? > > No, journal replay is only something happens at mount after an unclean > shutdown. We can afford to take some time there, and journal replay > performance hasn't been a concern. > >> Your journal size is insane, and your "artificial cap on performance" >> had better come with numbers. > I'm not going to run custom benchmarks just for a silly argument, sorry. > > But on a fileserver with 128 GB of ram and a 75 TB filesystem (yes, > that's likely a dedicated fileserver), we can quite easily justify a > btree node cache of perhaps 10GB, and on random update workloads the > journal does need to be that big - otherwise our btree node write size > goes down and throughput suffers. > Is this idea based on "the user only has 1 FS per device?" I assume I have this setup (and it probably is, looks like mine). I have 3 bcachefs filesystems each taking 10% of RAM. So, I end up with a memory load of 30% dedicated to bcachefs caching. If I read your argument, you say "I want a large btree node cache, because that's making the fs more efficient".  No doubts about that. VFS buffering may already save you a lot of lookups you're actually building the btree node cache for. Theoretically, there's a large difference about how they work, but in practice, what files will it lookup mostly? Probably the few ones you already have in your vfs buffer. The added value of keeping a large "metadata" cache seems doubtful. I have my doubts about trading 15G of buffer to 15G of btree node cache: You lose the opportunity to share those 15G ram between all filesystems. On the other hand, when you perform many different file lookups, it will shine with everything it has. Maybe some tuning parameter could help here? it will at least limit the "insane" required journal size Janpieter Sollie --------------850TR6O3rm5Nu40Cd4Avgx0F Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit
Op 20/10/2024 om 22:29 schreef Kent Overstreet:
On Sun, Oct 20, 2024 at 01:19:42PM -0700, Linus Torvalds wrote:

Enough said, and you're just making shit up to make excuses.

Also, you might want to start look at latency numbers in addition to
throughput. If your journal replay needs an *index* that is 2G in
size, you may have other issues.
Latency for journal replay?

No, journal replay is only something happens at mount after an unclean
shutdown. We can afford to take some time there, and journal replay
performance hasn't been a concern.

Your journal size is insane, and your "artificial cap on performance"
had better come with numbers.
I'm not going to run custom benchmarks just for a silly argument, sorry.

But on a fileserver with 128 GB of ram and a 75 TB filesystem (yes,
that's likely a dedicated fileserver), we can quite easily justify a
btree node cache of perhaps 10GB, and on random update workloads the
journal does need to be that big - otherwise our btree node write size
goes down and throughput suffers.

Is this idea based on "the user only has 1 FS per device?"
I assume I have this setup (and it probably is, looks like mine).
I have 3 bcachefs filesystems each taking 10% of RAM.
So, I end up with a memory load of 30% dedicated to bcachefs caching.
If I read your argument, you say "I want a large btree node cache,
because that's making the fs more efficient".  No doubts about that.

VFS buffering may already save you a lot of lookups you're actually
building the btree node cache for.
Theoretically, there's a large difference about how they work,
but in practice, what files will it lookup mostly?
Probably the few ones you already have in your vfs buffer.
The added value of keeping a large "metadata" cache seems doubtful.
 
I have my doubts about trading 15G of buffer to 15G of btree node cache:
You lose the opportunity to share those 15G ram between all filesystems.
On the other hand, when you perform many different file lookups,
it will shine with everything it has.

Maybe some tuning parameter could help here?
it will at least limit the "insane" required journal size

Janpieter Sollie

--------------850TR6O3rm5Nu40Cd4Avgx0F--