From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4F9D2C47DDF for ; Mon, 29 Jan 2024 04:56:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ABEB06B0074; Sun, 28 Jan 2024 23:56:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A47C56B0075; Sun, 28 Jan 2024 23:56:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8E8126B007B; Sun, 28 Jan 2024 23:56:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 779566B0074 for ; Sun, 28 Jan 2024 23:56:41 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id B8FBE120117 for ; Mon, 29 Jan 2024 04:56:40 +0000 (UTC) X-FDA: 81731138160.28.004FF2D Received: from mail-pf1-f176.google.com (mail-pf1-f176.google.com [209.85.210.176]) by imf02.hostedemail.com (Postfix) with ESMTP id B8C938001A for ; Mon, 29 Jan 2024 04:56:37 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=TAKUYnSi; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf02.hostedemail.com: domain of david@fromorbit.com designates 209.85.210.176 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706504197; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4uVgZZNOGdztcThCjImfEeg1T+w1JfprEYe6mBtACg8=; b=HMbTycazImna2U0FvkueblN3E+FGqnbNaqJiEoJGGQqPDMDp7lJq2vYxiesUMlNG6z7wtO SPiFetgvHVZ25oJsfD3INHopopju+p6gZZ5W0WnXbm5b6EIsIlq9LIbKh/QMdWZrTgSjld Et8Eo+43OW26GiKJDdwgVO28wnf35T0= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=TAKUYnSi; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf02.hostedemail.com: domain of david@fromorbit.com designates 209.85.210.176 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706504197; a=rsa-sha256; cv=none; b=ZGTIHGChe2bL8q8kuZAW4y2mCjWnzjRBsXO/EDZl7G6De0Yo4pR5O4tllVMlKVZsZia4W4 Z5x3Lsw2H2s5VyI/L4+b4FanRC+noQcMQJZ8WaMoC38clSp1lmaiB8oQfO9Br/1Q3bgwcj y2CvqRztuiv1BS20veJJngZSlgB4gxA= Received: by mail-pf1-f176.google.com with SMTP id d2e1a72fcca58-6ddfb0dac4dso1814211b3a.1 for ; Sun, 28 Jan 2024 20:56:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1706504196; x=1707108996; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=4uVgZZNOGdztcThCjImfEeg1T+w1JfprEYe6mBtACg8=; b=TAKUYnSi4bAAPkZVWOmWaeWp3jex6CPL80q6EDn6JLTUbowPjlC5UXDRHrsAQSTPWh cp+z5W/5ZExTKHigdw5292CKcKp9yVuGtKWlCbirc5TL2OVtapeL/qN3DMxGK1TgytqZ LQhybUjNEGwRf1FrsLdUlrwAeN+vUzbCz4neT4EB4VEBQrWPZfkj+2BT7ITg3897S0dU dVbDjNtBfK8ADmltPebH/+v9zBIOCj3Vrq42VNmiScaQIwUoU+JGwMfsiwT3YroxzL/k kMICbZ8k6Yne3uWwTwrpVMqvN2bY8liw8uLYmj2DOxLRw2Srr3IL9QzXzXLXV/XkoYkH lj2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706504196; x=1707108996; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=4uVgZZNOGdztcThCjImfEeg1T+w1JfprEYe6mBtACg8=; b=hCzD05pM/9qvOjKXWGafl7i69AgjwJ3H++y0xotn2r+lE/NAhDys8Zt/gHbMiWQ1nF ARIh7ZxfzdL0eaifgbSw88FsedVSP2QJ2XD1BvpCMJ/t0N9mdB/8sZIVWpm5de0UBDJv aQ81l57LRjUpdEhhNiD/3p8mmr0d+ptAtajZTvBPoA8biIWRjg6Kjr1A7EGx6KW5tpKU pHVQDCimZFf4cAFdS8BpLET+nUXSay7AOikMdPLnskVMDM31f5wRbXgqucdWfMrzDF2p FSao8tLvAtVpU7APRKzQmy8howtkzx8YyhVV24LqMHio5VWbfwwfRhKitfznAYIKIFks tqcg== X-Gm-Message-State: AOJu0YzBlpaUraDLWeOncxsPG6MuFiEvNmNqxe82kzIjkXxYPTnM+rpq XPSL/bARzK0MQRKFaEe+ElWDRiTFNtXRNWVph2oZ531de3PmsWNjTdMW/k4cdIs= X-Google-Smtp-Source: AGHT+IFoFcFBxLwYTXGmXANOC97mvFw7PZV9squyfuKtmaABij+gT0sLLCatjYYhzbO+aUmnpRulcA== X-Received: by 2002:a05:6a00:18a1:b0:6dd:c61e:2026 with SMTP id x33-20020a056a0018a100b006ddc61e2026mr3771217pfh.9.1706504196508; Sun, 28 Jan 2024 20:56:36 -0800 (PST) Received: from dread.disaster.area (pa49-181-38-249.pa.nsw.optusnet.com.au. [49.181.38.249]) by smtp.gmail.com with ESMTPSA id c24-20020aa78c18000000b006dddd283526sm4893266pfd.53.2024.01.28.20.56.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 28 Jan 2024 20:56:35 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1rUJgw-00Gj5q-2H; Mon, 29 Jan 2024 15:56:18 +1100 Date: Mon, 29 Jan 2024 15:56:18 +1100 From: Dave Chinner To: Mike Snitzer Cc: Mike Snitzer , Matthew Wilcox , Ming Lei , Andrew Morton , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Don Dutile , Raghavendra K T , Alexander Viro , Christian Brauner Subject: Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range Message-ID: References: <20240128142522.1524741-1-ming.lei@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Queue-Id: B8C938001A X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: x6qmbyigyb3eutb5r14yoozubc69kugc X-HE-Tag: 1706504197-730785 X-HE-Meta: U2FsdGVkX19b2XqGxwkzzT8rCzF9/wnOHXXs/lmJhnHOHgvcoxLtHTbj1wmWFTdXjm8BldAG6UGC+jxo3qXvxkIo6iK5T+2d+W47D0HffikG7kZSwN1fn+3aYLdgryNeEw0nmpEAHl90r81ioTXeK4kMgtEHAVju/4/CexrUUrgFW+rXf/rfRswAgz9dIbnR1NQXpSArG/jfgwAbtfVSmKF3/7X7CkIMPQ0bktRnZssJfijFka+PTOEKRlLaJFXO5obSjerQu9CqH1nNDAkTrNCA3rpl3JJ8uvRD79c11x2V4Uj+oi5nG8rIpyZAx9yQNfMlKiRYPZxDAPRhavQxOhkRsVswlIWTJo4++IVV/qI0PeusgS/SJA1kZ7L9C0glaj31Xafs17wcrrc/07IVny3GTjm9gB0S0QI0ZcItliok3qEvb9278UFrORvF9r8TZjGhlIOObQ/49uZRDu0RGSYorP7mxeVxtS1cDKLqialmfGyybNYIkZP238lXH0fv4W9JYbveGzv2lwrln0Ax09s1r6tDkeqEGtnKwTpP51VEr36fQ67I9S+UKFR30BV5P/2kbLeiaj9/85oRXlf+AKwpSK/vnRgs2dKiPbQBGovJ/s3Jk9Dqm78dUyM3RB2UkV+djb+kog5NSaNd6TvHnCjUbmlPS53ibjWywpLkAc+eYj1HTqnNWQPOtmEUmVuJqnX1huiJvDhn7wINl3WBivMiQIl5uVj2fH+mVjacu0tYKEBGSd9TfRCYpkVeu9uPI9JH61iDPyq5maWxVW4+055Tn38ZHdb7C558iFFsUPUUcaux2MOS96mc+V6vjrf0hTrsFo80w9S97L5za9qv8+j6fd4YzObOaobS61WBiJxZvafKpH7I1hZurtQ7MXRRUvT8nGTZBvYF0YfDD4cjagAkKRm1je/OIOq9Ty3TBfxHWjMYGsyLMK3m2eMCvsP8H9t/pS6lGftHTAvXFva 9smHI/1d /mgxfRQgH2q6LPS5JA71oAwVGNhjpOhbpdZBw62MAeSmEBwPoKehFXr+G/zjde2zT7Z9aHZAFJc3yjaj5Pn0mwfWcecYHYnrKKJVL/XcK2ncqvrcFYaffbvz09PJDzY7BPV6C81yJ4DLRzWUIJp2jfTIf8mrnYwmtDzG/MHA+e1b/xMlxOXO9RiA2KFcWzTP5hiC/K4UWS60tcUxa/hHbQZSkvzNpmhhrvG57C1V+0zyVuOOjhs7No8ZipYtYlXEzqfqSRR2lJ26fOIv/3fBKvlw2ekWfn9hvPYwZ0KiBv14utXBk0Tb86VjXUJIctdPbMsgXLoEfXw6S1OoXzJnfLG61RYf+w4mDObG0pvU4BAQLEZ52l7dgfBt7E7Irl0VvTsd4bxC6S5eIcDsMS9vXA3b6KUEZGXCi3bLYWGhQABT094gJ2LbE/sDVxqIgb4/UsMMrdHXXDOxsEFYpTH2J0nwzH/kTetgfDAoE5l2SUtLyy+8H3l+jkVtzu9wBHLW9o2+E X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Jan 28, 2024 at 09:12:12PM -0500, Mike Snitzer wrote: > On Sun, Jan 28, 2024 at 8:48 PM Dave Chinner wrote: > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote: > > > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox wrote: > > > > > > > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote: > > > > > On Sun, Jan 28 2024 at 5:02P -0500, > > > > > Matthew Wilcox wrote: > > > > Understood. But ... the application is asking for as much readahead as > > > > possible, and the sysadmin has said "Don't readahead more than 64kB at > > > > a time". So why will we not get a bug report in 1-15 years time saying > > > > "I put a limit on readahead and the kernel is ignoring it"? I think > > > > typically we allow the sysadmin to override application requests, > > > > don't we? > > > > > > The application isn't knowingly asking for readahead. It is asking to > > > mmap the file (and reporter wants it done as quickly as possible.. > > > like occurred before). > > > > .. which we do within the constraints of the given configuration. > > > > > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap > > > request size based on read-ahead setting") -- same logic, just applied > > > to callchain that ends up using madvise(MADV_WILLNEED). > > > > Not really. There is a difference between performing a synchronous > > read IO here that we must complete, compared to optimistic > > asynchronous read-ahead which we can fail or toss away without the > > user ever seeing the data the IO returned. > > > > We want required IO to be done in as few, larger IOs as possible, > > and not be limited by constraints placed on background optimistic > > IOs. > > > > madvise(WILLNEED) is optimistic IO - there is no requirement that it > > complete the data reads successfully. If the data is actually > > required, we'll guarantee completion when the user accesses it, not > > when madvise() is called. IOWs, madvise is async readahead, and so > > really should be constrained by readahead bounds and not user IO > > bounds. > > > > We could change this behaviour for madvise of large ranges that we > > force into the page cache by ignoring device readahead bounds, but > > I'm not sure we want to do this in general. > > > > Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages > > value in this situation to override the device limit for large > > ranges (for some definition of large - say 10x bdi->ra_pages) and > > restore it once the readahead operation is done. This would make it > > behave less like readahead and more like a user read from an IO > > perspective... > > I'm not going to pretend like I'm an expert in this code or all the > distinctions that are in play. BUT, if you look at the high-level > java reproducer: it is requesting mmap of a finite size, starting from > the beginning of the file: > FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel(); > MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size()); Mapping an entire file does not mean "we are going to access the entire file". Lots of code will do this, especially those that do random accesses within the file. > Yet you're talking about the application like it is stabbingly > triggering unbounded async reads that can get dropped, etc, etc. I I don't know what the application actually does. All I see is a microbenchmark that mmaps() a file and walks it sequentially. On a system where readahead has been tuned to de-prioritise sequential IO performance. > just want to make sure the subtlety of (ab)using madvise(WILLNEED) > like this app does isn't incorrectly attributed to something it isn't. > The app really is effectively requesting a user read of a particular > extent in terms of mmap, right? madvise() is an -advisory- interface that does not guarantee any specific behaviour. the man page says: MADV_WILLNEED Expect access in the near future. (Hence, it might be a good idea to read some pages ahead.) It says nothing about guaranteeing that all the data is brought into memory, or that if it does get brought into memory that it will remain there until the application accesses it. It doesn't even imply that IO will even be done immediately. Any application relying on madvise() to fully populate the page cache range before returning is expecting behaviour that is not documented nor guaranteed. Similarly, the fadvise64() man page does not say that WILLNEED will bring the entire file into cache: POSIX_FADV_WILLNEED The specified data will be accessed in the near future. POSIX_FADV_WILLNEED initiates a nonblocking read of the specified region into the page cache. The amount of data read may be de‐ creased by the kernel depending on virtual memory load. (A few megabytes will usually be fully satisfied, and more is rarely use‐ ful.) > BTW, your suggestion to have this app fiddle with ra_pages and then No, I did not suggest that the app fiddle with anything. I was talking about the in-kernel FADV_WILLNEED implementation changing file->f_ra.ra_pages similar to how FADV_RANDOM and FADV_SEQUENTIAL do to change readahead IO behaviour. That then allows subsequent readahead on that vma->file to use a larger value than the default value pulled in off the device. Largely, I think the problem is that the application has set a readahead limit too low for optimal sequential performance. Complaining that readahead is slow when it has been explicitly tuned to be slow doesn't really seem like a problem we can fix with code. -Dave. -- Dave Chinner david@fromorbit.com