From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3C31FC48BF6 for ; Sat, 24 Feb 2024 18:13:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 919046B00A6; Sat, 24 Feb 2024 13:13:24 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8C9746B00A7; Sat, 24 Feb 2024 13:13:24 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 790A66B00A8; Sat, 24 Feb 2024 13:13:24 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 69AB86B00A6 for ; Sat, 24 Feb 2024 13:13:24 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 011FFC047D for ; Sat, 24 Feb 2024 18:13:23 +0000 (UTC) X-FDA: 81827494728.15.6A7C48F Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf13.hostedemail.com (Postfix) with ESMTP id 187702000C for ; Sat, 24 Feb 2024 18:13:19 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=qwO9NV2A; spf=none (imf13.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708798402; a=rsa-sha256; cv=none; b=C+d+pOK3tMsEBj8Je5sOgfl9/YcmzZupg38g+02cZHkZVQz0n8BvCZjvNzDwG+5XlqFghD A7Q4kR4TeN7X1l5hhwj99v5hnWU4Q0YGyoAvQa5bk3t9jB4WSDjE0VeWv8KGJZUreUARKP QOfNBJAQcFUimHmgw39/ro9KuH+qNGw= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=qwO9NV2A; spf=none (imf13.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708798402; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4Gqu6DXrP8Slk+IazA2w1RT9aS9FzXuXHguOaEpxhDk=; b=wGQNaqRuRlf9PmmWu6KH52iCUTiAokq2SBLhXdxJ0L9e2zuAD/XZMLZXdqpikAqQlnJwWs FMRiZVVyWeSgfuM6dUDPn9BKF/6eLOL1xLScwQvBpenguLgzRSQ/Byj1wOUf9+6z1CWREy 7RbwuT+G8jPHORgWP3t95mdz/65WP4A= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=4Gqu6DXrP8Slk+IazA2w1RT9aS9FzXuXHguOaEpxhDk=; b=qwO9NV2Al8a5D03bhsEZ5im3XA Pg6tFxq7tJsY0VJiHZGJ/lxFZEuZmipPI4TF5AgsNoy7ZEcNDYZ+tqJm90YfYKW1CwUkLJIS5XATn fQY9NTdAkgHBGFoSV1wz+zUCsO2OylBDHSU8ZbDV3zVBDZiowZqvkNio70/ZzQe2gCTutWbR4WeyW B6SfwdAsiFhNJhJ7niLX+wjAiwIxwZ7QiZc3iCTX2pB3FZB7+IPiYscdW1X1GaaQvhywyJbWyd/3r FDaUZ73i51jNpq3lCh0iIDhmdHkCrY0q3U6GagaPUxk/uPpB/4pn9zLXbjg1oj2dsYDd9MOmrjGVS AxTuu7uQ==; Received: from willy by casper.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1rdwWO-0000000BUrR-2wmF; Sat, 24 Feb 2024 18:13:12 +0000 Date: Sat, 24 Feb 2024 18:13:12 +0000 From: Matthew Wilcox To: Linus Torvalds Cc: Luis Chamberlain , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm , Daniel Gomez , Pankaj Raghav , Jens Axboe , Dave Chinner , Christoph Hellwig , Chris Mason , Johannes Weiner Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 187702000C X-Stat-Signature: dkk3yydm4byaecrn31wycdi8wq956gar X-Rspam-User: X-HE-Tag: 1708798399-914690 X-HE-Meta: U2FsdGVkX1+hBn/3DlDPtTZO5Dp5xxaRsyX/pmTVfEO4L5eWI8GAq7Fh16uLT5Dxy2Xe11Wa+XSztyF0+0TKLqNrcbFDIZDueOdweREeZQXUY4blWA0DnPPAmhDtFO3cTuXe78TxHAAR2Nbnc8l7Jd06MiDcGkIgKnxiBq1ER/FZ/MSQBwzO+AoMTYfl9rpPVXpTA6A+td/ujuLq+q5iI2vUSovGTOEpoVmjD/3AdXlcI+R3ywfCK7QOdl2DEyLCOTlhvDHdy0u72+jd/8ckioStJ4zafScRbh0LwCDG7bW2NRMsTSt1pED9oqHov61KiMy+bidYrP6++/CQWCGcKY0zgBZOcMI30DQ9JQgkU50CX6jxB4kcBOKFaob4ZKmbGw18J8L1Z/qi9SGMnR73dMBl/iXh5fT0ZNKvwTbg9tpkJhpCr6BRhPDkzoOysarQYGSU2zL8JOnX8+AkXNCOULK/CxDPcWuLZ03yDWDBvXsZ1Kki7tXbYRFOhDLnJnjvFFXsmXHyBYOfEhvbMmAXkcHoa5STPSSLc+b78F3+t+OYMuZozOjPjvJlsS9j5WrSfnbdUKp1qzw78hcsn8qzObirEiAYgjiRn/BVfOnZEgGIEalGywr2aViPkrCH6S9yk+qyTuCQGolVdyl1HHOTd0bh4HZLS2xcVs0dVtIclcuNHIIfV11aoi56NU1tJK2O8AswJv0aO8UKa2a9h+moBTxtO7jG9tDwxZwpBIvqwrrbSyHxpw9SFJg45BAE7Xdl4VTpjMuFUFoH5plNwajgWizAN1/oI4R9RbchfswC1k4uriwfVIoldiJAYw5kHwSr1ktoMlpa7CAK4Vj4pdlBNQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Feb 24, 2024 at 09:31:44AM -0800, Linus Torvalds wrote: > On Fri, 23 Feb 2024 at 20:12, Matthew Wilcox wrote: > > > > On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > > > What are the limits to buffered IO > > > and how do we test that? Who keeps track of it? > > > > TLDR: Why does the pagecache suck? > > What? No. > > Our page cache is so good that the question is literally "what are the > limits of it", and "how we would measure them". > > That's not a sign of suckage. For a lot of things our pagecache is amazing. And in other ways it absolutely sucks. The trick will be fixing those less-used glass jaws without damaging the common cases. > When you have to have completely unrealistic loads that nobody would > actually care about in reality just to get a number for the limit, > it's not a sign of problems. No, but sometimes the unrealistic loads are, alas, a good proxy for problems that customers hit. For example, I have one where the customer does an overnight backup with a shitty backup program that doesn't use O_DIRECT and ends up evicting the actual working set from the page cache. They start work the next morning with terrible performance because everything they care about has been swapped out. The "fix" is to limit the pagecache to one NUMA node. I suspect if this customer could be persuaded to run a more recent kernel that this problem has been solved, so I'm not sure there's a call to action from this particular case. Anyway, if there's a way to fix an unrealistic load that doesn't affect realistic loads, sometimes we fix a customer problem too. > Guess what? It's because the CPU in question had quite a bit of L3, > and it was spread out, and the CPU doesn't even start the memory > access before it has checked caches. > > And here's a big honking clue: only a complete nincompoop and mentally > deficient rodent would look at that and say "caches suck". although the problem might be that the CPU has a terrible cross-chiplet interconnect ;-) > > > ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64 > > > Vs > > > ~ 7,000 MiB/s with buffered IO > > > > Profile? My guess is that you're bottlenecked on the xa_lock between > > memory reclaim removing folios from the page cache and the various > > threads adding folios to the page cache. > > I doubt it's the locking. It might not be! But there are read-only workloads that do bottleneck on the xa_lock. > For writeout, we have a very traditional problem: we care about a > million times more about latency than we care about throughput, > because nobody ever actually cares all that much about performance of > huge writes. > > Ask yourself when you have last *really* sat there waiting for writes, > unless it's some dog-slow USB device that writes at 100kB/s? You picked a bad day to send this email ;-) $ sudo dd if=Downloads/debian-testing-amd64-netinst.iso of=/dev/sda [sudo] password for willy: 1366016+0 records in 1366016+0 records out 699400192 bytes (699 MB, 667 MiB) copied, 296.219 s, 2.4 MB/s ok, that was a cheap-arse USB stick, but then I had to wait for the installer to write 800GB of random data to /dev/nvme0n1p3 as it set up the crypto drive. The Debian installer didn't time that for me, but it was enough time to vacuum the couch. > Now, the benchmark that Luis highlighted is a completely different > class of historical problems that has been around forever, namely the > "fill up lots of memory with dirty data". > > And there - because the problem is easy to trigger but nobody tends to > care deeply about throughput because they care much much *MUCH* more > about latency, we have a rather stupid big hammer approach. > > It's called "vm_dirty_bytes". > > Well, that's the knob (not the only one). The actual logic around it > is then quite the moreass of turning that into the > dirty_throttle_control, and the per-bdi dirty limits that try to take > the throughput of the backing device into account etc etc. > > And then all those heuristics are used to actually LITERALLY PAUSE the > writer. We literally have this code: > > __set_current_state(TASK_KILLABLE); > bdi->last_bdp_sleep = jiffies; > io_schedule_timeout(pause); > > in balance_dirty_pages(), which is all about saying "I'm putting you > to sleep, because I judge you to have dirtied so much memory that > you're making things worse for others". > > And a lot of *that* is then because we haven't wanted everybody to > rush in and start their own synchronous writeback, but instead watn > all writeback to be done by somebody else. So now we move from > mm/page-writeback.c to fs/fs-writeback.c, and all the work-queues to > do dirty writeout. > > Notice how the io_schedule_timeout() above doesn't even get woken up > by IO completing. Nope. The "you have written too much" logic > literally pauses the writer, and doesn't even want to wake it up when > there is no more dirty data. > > So the "you went over the dirty limits It's a penalty box, and all of > this comes from "you are doing something that is abnormal and that > disturbs other people, so you get an unconditional penalty". Yes, the > timeout is then obviously tied to how much of a problem the dirtying > is (based on that whole "how fast is the device") but it's purely a > heuristic. > > And (one) important part here is "nobody sane does that". So > benchmarking this is a bit crazy. The code is literally meant for bad > actors, and what you are benchmarking is the kernel telling you "don't > do that then". This was a really good writeup, thanks. I think this might need some tuning (if it is what's going on). When the underlying device can do 86GB/s and we're only getting 7GB/s, we could afford to let this writer do a bit more.