From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6687AC021B2 for ; Sun, 23 Feb 2025 05:58:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BA3A06B007B; Sun, 23 Feb 2025 00:58:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B53DA6B0082; Sun, 23 Feb 2025 00:58:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9F3D06B0083; Sun, 23 Feb 2025 00:58:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 8200C6B007B for ; Sun, 23 Feb 2025 00:58:20 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 16826B7455 for ; Sun, 23 Feb 2025 05:58:20 +0000 (UTC) X-FDA: 83150154360.25.2CE3C43 Received: from mail-pj1-f54.google.com (mail-pj1-f54.google.com [209.85.216.54]) by imf07.hostedemail.com (Postfix) with ESMTP id 5729D40002 for ; Sun, 23 Feb 2025 05:58:18 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=LS8duelV; spf=pass (imf07.hostedemail.com: domain of ritesh.list@gmail.com designates 209.85.216.54 as permitted sender) smtp.mailfrom=ritesh.list@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740290298; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:in-reply-to: references:references:dkim-signature; bh=BGZJFtPJi93pDl+vFPJ8Fizj9SCziKJ5IL8J9wS43lg=; b=2egbYZFxsiBh6dCESE39PK5nH7pkBjfmPo8x/j/rhpkmbUlM5eXQziRgqpiXXkKggQl9eg NiTP/O4Kh41RX5chHgEIX4WDgaYeUsAKMIrJQXYlz1mwdDryzuDjUqHl5/oQPmFMPi8Mfh 8BZsaen703CDeTIHEWslt64jyH62kf4= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=LS8duelV; spf=pass (imf07.hostedemail.com: domain of ritesh.list@gmail.com designates 209.85.216.54 as permitted sender) smtp.mailfrom=ritesh.list@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740290298; a=rsa-sha256; cv=none; b=J4xsLjdsdELZ7gUXuhGjWFsA7/8wSAXck7RtETtZuolYmuYNB9Ze5Kh0J4fLHobqwr+qQD Ar3Jr+yFOIugg/DQHYR7P+YpvWYSINriIZQ95AI8tCLtjaZSoaxzj3/1rGQr0sLCJO69F2 AXm8Phqf2wAkP8hoc1ca2aK/PeR1ouo= Received: by mail-pj1-f54.google.com with SMTP id 98e67ed59e1d1-2f44353649aso5088567a91.0 for ; Sat, 22 Feb 2025 21:58:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1740290297; x=1740895097; darn=kvack.org; h=references:message-id:date:in-reply-to:subject:cc:to:from:from:to :cc:subject:date:message-id:reply-to; bh=BGZJFtPJi93pDl+vFPJ8Fizj9SCziKJ5IL8J9wS43lg=; b=LS8duelVb0yorMRbFaLQM1vKJGHAmDg+2COucm99MeezKH3wFSm96/HWDPvvZ3idBC J5n4jqqZwbo3N4jri+ojIUueaVrfMkvHeIaQx4vuLTFGY11td+mU8wFGIYrXjPfQHQTX GyQxaM5VK5P4GiGuuE0PVLTnwZrL32N7/PwW4Tq5GR/x2Paj9cn1n+m/LqZOqEMEDAyy jAK8bCiltm/wVjCLB590ouu42WU+njMuo1WXA7Z+eXT56PsR1uw2A1Hl/TZpnfXZnUuG 4NkXOLNmkJP7kgGdem6hPLtq2Cf0cnFBpEhKKwA2TG+bosFtgDxKKI1IL9xVp4Hf2ZtD bH2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740290297; x=1740895097; h=references:message-id:date:in-reply-to:subject:cc:to:from :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=BGZJFtPJi93pDl+vFPJ8Fizj9SCziKJ5IL8J9wS43lg=; b=b4njIlTmXTf+sujw4BLP7njgrCc8d34E6bdARYD4Eui/MN3IY3WNwFsD0vh5YJV7Mr RdCfCEu+vZO5mB4I0A5FXmq4VqugcQJjEoV5uK0avJUj2tbvB94dlcL3iUFEv3A0/LMz hyVRArYs9vHeKayjiRL6A1n62/qshKwwMcGO62HEPi8TrlyCssTBYTMCz0NA7KsCFOln IRJ3YW3MOyPQhWNDDMzf/7q3xIvntow1zvePATptuYFrfNrvdSBtHknnHrfIcaHlxsTU 5WfrlnWbNvv4jkR7L38Nmpar0osivO686MAamW/ZtjF91We0yGYYxV8N87uOr96BFEW1 Fllw== X-Forwarded-Encrypted: i=1; AJvYcCUyE4nfTvgUkb70YvGLUoNfMZ5ZEj98+AlEDxtVZ1GkmhHra6N60xI+DpI9lR9UWrEOvN5fFr0fjA==@kvack.org X-Gm-Message-State: AOJu0YxOLkSyV/gFuuwa/lbVABsTAIxhTItqL/72AU0ZDJMqeXtQY2uV yIVRx0OoQNchYZfAR1zsjBMVoTIhwSc91fcRHPiBFkCKu4yQD6nx X-Gm-Gg: ASbGncu+uEFjSXvlnkFyOR1AnZqnqVDQhJueZ9GJUrJNMY5him0c5lcbkCG4WkWdkYH WeUlCBZE3bChfOwbjGJSkxHMP69hKS5B+taLWLqQZ5m+mdhXz9M7iVYc1XysO9Hv/O3L7VzOew0 pbrfqKLL10GPkoiUitd94Hniz1HEGs2hMO34Jv5qu0JgNwaUtJT5JwRO553hg107OgldtC59ULQ OQJFqXMY9qHl6e9qZ2d8lvb6KQQ2LKTeyoVTuWQHUjP4FDeg9q0AVRmFBmOkNd8WIYS46Aywfon gaQIkdWPy8YjB81v4A== X-Google-Smtp-Source: AGHT+IHqU/8HpogskbuizH8U8Q2G1weY64/81zcdnrMILOeUk3/UP81xhXgHSVyKMEPnIQe1f1eM+A== X-Received: by 2002:a05:6a21:1796:b0:1ee:d418:f758 with SMTP id adf61e73a8af0-1eef3c889e0mr14861134637.17.1740290297146; Sat, 22 Feb 2025 21:58:17 -0800 (PST) Received: from dw-tp ([171.76.82.51]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-adcd481d902sm12636077a12.21.2025.02.22.21.58.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 22 Feb 2025 21:58:16 -0800 (PST) From: Ritesh Harjani (IBM) To: Kalesh Singh , lsf-pc@lists.linux-foundation.org, "open list:MEMORY MANAGEMENT" , linux-fsdevel Cc: Suren Baghdasaryan , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Juan Yescas , android-mm , Matthew Wilcox , Vlastimil Babka , Michal Hocko Subject: Re: [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior In-Reply-To: Date: Sun, 23 Feb 2025 11:04:50 +0530 Message-ID: <87wmdhgr5x.fsf@gmail.com> References: X-Rspam-User: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 5729D40002 X-Stat-Signature: gr575kwwh6tkpap4shhpopbas9zck9u6 X-HE-Tag: 1740290298-286652 X-HE-Meta: U2FsdGVkX18oMm5i25QhLvyzP49lzuKcEUsVdQ/v4MDEJvf9YgTvlq/q1cUWu4zY9l+lRgXjkgHA4Itxvn1UyKYqlEtNHhZI37MM2MSPw3tSS1dHwm0gCRhX0grl+JE6wDHIU5QYQ6XmHqh5/SOXcHeQM0NW0qEmpVNlWp6NA3l4qpyMNQdh4y3A24CatJbNhCrQcHS1LhZLUP5/MNJVpuMCgujKvwTvuG8aszG2v00LzbBOXzA5K3lpY28yOd2ITuVCjnIWa4Lk+tQtJlTztmVY9S6xRqfn6lOpPKZp2Tvh+8gjsayLkMdwNFcXXlXE6UTdU5ZDKIBxKB2efeL/3FqyWTgFnd1oFPetQ9eARcN9DIqJiVVJL+qncOkGRLIZBTJvQhOx8kdtWv9199TuxxUpXI0/m4aU3A5EvS8G+VNpT64lAidFzOhJAaUYDyLF+0TzrCqOj30we9xh8UpX6FQoj5tSr136bmIQfegu75z1i2AgrYQKcROQY0tXfYz94dNPBhNp7gl5TE8yenB10BSkFniBzGPvVnuSWGRhmypiCCy/TBlSMGsvDBdRWRWaohC4+Lvs3dupgXOhUToRKgTKyYKzqdDENwveYCo+MTFIcx8la0giJmB8rU8KURBvs/crgR3q5ea7OSMk6Zvit5Xau2v5k0p/D3LT6aPSp7RvrdKcmAijgBMw4uvL9Qo7qnhC4eFzVnbA29XXbkacS+vNBWtD4FzzfuO/bZs6bMkxsKQsmwStWzcIZWwvhRGkbzYwQ+MKCN6DF9arvy5yK52J2mZ6FGKIZTqdd7jXORX24H8ddWJNaAnnIa54bocQscfjsQ1CNkbakJ++c0zdfwXywSwz10Gcx2ZpgD/FYzkk3Ob7f7jeZ83hJokSXMQQjNAUUWpGA0MlNX/uppyiUuT8fnIxRrfwAOWI7rRA7fwiJ/DOjQqXaUt3nc1CpwAcw+UiuRr6ZmkREBmM6wJ 2nx6gGGu UqCu73u7l4ELk6EwUWxSdxW5h5BXv4647Y/7F841CI+4cG+QLG0e8EeXiCEiDYS1awCTXJr/HCJyZFoS45nsshtdPWH1i8b4SVlbQzkIYs/1E5JbwUBxf4CnosLltQpJqD/dkgwEun6U63AER5SJugaZk4EetDURPVhkfMNH2Qs5ntqGKQVVmlNFzq3lUlK6DHeSVKc+6XHQiZAMi1YGlwjEQZhRaa4IWiZ2wyUk1vSSdgH1edMK9bKkFrDDjNkoybZ2nIC0BY/W9gFQOLj4sq4E5A9nk9JKAIUEM/wJpFCOfNF8+VWjXDKcJFRKsRMGZO3/KNR7vOY4BGPcjDR7AxFy0nZ1GzLV/kGpQ3Q69hD1DbNyg5wNA94q3Av3A7MHAreEs8Ihbxcu+vwubB70TjOfj2f/xHS9xu7IBoHOXrQs2Zow= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Kalesh Singh writes: > Hi organizers of LSF/MM, > > I realize this is a late submission, but I was hoping there might > still be a chance to have this topic considered for discussion. > > Problem Statement > =============== > > Readahead can result in unnecessary page cache pollution for mapped > regions that are never accessed. Current mechanisms to disable > readahead lack granularity and rather operate at the file or VMA >From what I understand the readahead setting is done at the per-bdi level (default set to 128K). That means we don't get to control the amount of readahead pages needed on a per file basis. If say we can control the amount of readahead pages on a per open fd, will that solve the problem you are facing? That also means we don't need to change the setting for the entire system, but we can control this knob on a per fd basis? I just quickly hacked fcntl to allow setting no. of ra_pages in inode->i_ra_pages. Readahead algorithm then takes this setting whenever it initializes the readahead control in "file_ra_state_init()" So after one opens the file, we can set the fcntl F_SET_FILE_READAHEAD to the preferred value on the open fd. Note: I am not saying the implementation could be 100% correct. But it's just a quick working PoC to discuss whether this is the right approach to the given problem. -ritesh =========== fcntl: Add control to set per inode readahead pages As of now readahead setting is done in units of pages at the bdi level. (default 128K). But sometimes the user wants to have more granular control over this knob on a per file basis. This adds support to control readahead pages on an open fd. Signed-off-by: Ritesh Harjani (IBM) --- fs/btrfs/defrag.c | 2 +- fs/btrfs/free-space-cache.c | 2 +- fs/btrfs/relocation.c | 2 +- fs/btrfs/send.c | 2 +- fs/cramfs/inode.c | 2 +- fs/fcntl.c | 44 +++++++++++++++++++++++++++++++++++++ fs/nfs/nfs4file.c | 2 +- fs/open.c | 2 +- include/linux/fs.h | 4 +++- include/uapi/linux/fcntl.h | 2 ++ mm/readahead.c | 7 ++++-- 11 files changed, 61 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c index 968dae953948..c6616d69a9af 100644 --- a/fs/btrfs/defrag.c +++ b/fs/btrfs/defrag.c @@ -261,7 +261,7 @@ static int btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info, range.len = (u64)-1; range.start = cur; range.extent_thresh = defrag->extent_thresh; - file_ra_state_init(ra, inode->i_mapping); + file_ra_state_init(ra, inode); sb_start_write(fs_info->sb); ret = btrfs_defrag_file(inode, ra, &range, defrag->transid, diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index cfa52ef40b06..ac240b148747 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -373,7 +373,7 @@ static void readahead_cache(struct inode *inode) struct file_ra_state ra; unsigned long last_index; - file_ra_state_init(&ra, inode->i_mapping); + file_ra_state_init(&ra, inode); last_index = (i_size_read(inode) - 1) >> PAGE_SHIFT; page_cache_sync_readahead(inode->i_mapping, &ra, NULL, 0, last_index); diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index bf267bdfa8f8..7688b79ae7e7 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -3057,7 +3057,7 @@ static int relocate_file_extent_cluster(struct reloc_control *rc) if (ret) goto out; - file_ra_state_init(ra, inode->i_mapping); + file_ra_state_init(ra, inode); ret = setup_relocation_extent_mapping(rc); if (ret) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 7254279c3cc9..b22fc2a426e4 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -5745,7 +5745,7 @@ static int send_extent_data(struct send_ctx *sctx, struct btrfs_path *path, return err; } memset(&sctx->ra, 0, sizeof(struct file_ra_state)); - file_ra_state_init(&sctx->ra, sctx->cur_inode->i_mapping); + file_ra_state_init(&sctx->ra, sctx->cur_inode); /* * It's very likely there are no pages from this inode in the page diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c index b84d1747a020..917f09040f6e 100644 --- a/fs/cramfs/inode.c +++ b/fs/cramfs/inode.c @@ -214,7 +214,7 @@ static void *cramfs_blkdev_read(struct super_block *sb, unsigned int offset, devsize = bdev_nr_bytes(sb->s_bdev) >> PAGE_SHIFT; /* Ok, read in BLKS_PER_BUF pages completely first. */ - file_ra_state_init(&ra, mapping); + file_ra_state_init(&ra, mapping->host); page_cache_sync_readahead(mapping, &ra, NULL, blocknr, BLKS_PER_BUF); for (i = 0; i < BLKS_PER_BUF; i++) { diff --git a/fs/fcntl.c b/fs/fcntl.c index 49884fa3c81d..277afe78536f 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -394,6 +394,44 @@ static long fcntl_set_rw_hint(struct file *file, unsigned int cmd, return 0; } +static long fcntl_get_file_readahead(struct file *file, unsigned int cmd, + unsigned long arg) +{ + struct inode *inode = file_inode(file); + u64 __user *argp = (u64 __user *)arg; + u64 ra_pages = READ_ONCE(inode->i_ra_pages); + + if (copy_to_user(argp, &ra_pages, sizeof(*argp))) + return -EFAULT; + return 0; +} + + +static long fcntl_set_file_readahead(struct file *file, unsigned int cmd, + unsigned long arg) +{ + struct inode *inode = file_inode(file); + u64 __user *argp = (u64 __user *)arg; + u64 ra_pages; + + if (!inode_owner_or_capable(file_mnt_idmap(file), inode)) + return -EPERM; + + if (copy_from_user(&ra_pages, argp, sizeof(ra_pages))) + return -EFAULT; + + WRITE_ONCE(inode->i_ra_pages, ra_pages); + + /* + * file->f_mapping->host may differ from inode. As an example, + * blkdev_open() modifies file->f_mapping. + */ + if (file->f_mapping->host != inode) + WRITE_ONCE(file->f_mapping->host->i_ra_pages, ra_pages); + + return 0; +} + /* Is the file descriptor a dup of the file? */ static long f_dupfd_query(int fd, struct file *filp) { @@ -552,6 +590,12 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, case F_SET_RW_HINT: err = fcntl_set_rw_hint(filp, cmd, arg); break; + case F_GET_FILE_READAHEAD: + err = fcntl_get_file_readahead(filp, cmd, arg); + break; + case F_SET_FILE_READAHEAD: + err = fcntl_set_file_readahead(filp, cmd, arg); + break; default: break; } diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c index 1cd9652f3c28..cee84aa8aa0f 100644 --- a/fs/nfs/nfs4file.c +++ b/fs/nfs/nfs4file.c @@ -388,7 +388,7 @@ static struct file *__nfs42_ssc_open(struct vfsmount *ss_mnt, nfs_file_set_open_context(filep, ctx); put_nfs_open_context(ctx); - file_ra_state_init(&filep->f_ra, filep->f_mapping->host->i_mapping); + file_ra_state_init(&filep->f_ra, filep->f_mapping->host); res = filep; out_free_name: kfree(read_name); diff --git a/fs/open.c b/fs/open.c index 0f75e220b700..466c3affe161 100644 --- a/fs/open.c +++ b/fs/open.c @@ -961,7 +961,7 @@ static int do_dentry_open(struct file *f, f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC); f->f_iocb_flags = iocb_flags(f); - file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping); + file_ra_state_init(&f->f_ra, f->f_mapping->host); if ((f->f_flags & O_DIRECT) && !(f->f_mode & FMODE_CAN_ODIRECT)) return -EINVAL; diff --git a/include/linux/fs.h b/include/linux/fs.h index 12fe11b6e3dd..77ee23e30245 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -678,6 +678,8 @@ struct inode { unsigned short i_bytes; u8 i_blkbits; enum rw_hint i_write_hint; + /* Per inode setting for max readahead in page_size units */ + unsigned long i_ra_pages; blkcnt_t i_blocks; #ifdef __NEED_I_SIZE_ORDERED @@ -3271,7 +3273,7 @@ extern ssize_t iter_file_splice_write(struct pipe_inode_info *, extern void -file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping); +file_ra_state_init(struct file_ra_state *ra, struct inode *inode); extern loff_t noop_llseek(struct file *file, loff_t offset, int whence); extern loff_t vfs_setpos(struct file *file, loff_t offset, loff_t maxsize); extern loff_t generic_file_llseek(struct file *file, loff_t offset, int whence); diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 6e6907e63bfc..b6e5413ca660 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -60,6 +60,8 @@ #define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12) #define F_GET_FILE_RW_HINT (F_LINUX_SPECIFIC_BASE + 13) #define F_SET_FILE_RW_HINT (F_LINUX_SPECIFIC_BASE + 14) +#define F_GET_FILE_READAHEAD (F_LINUX_SPECIFIC_BASE + 15) +#define F_SET_FILE_READAHEAD (F_LINUX_SPECIFIC_BASE + 16) /* * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be diff --git a/mm/readahead.c b/mm/readahead.c index 2bc3abf07828..71079ae1753d 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -136,9 +136,12 @@ * memset *ra to zero. */ void -file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping) +file_ra_state_init(struct file_ra_state *ra, struct inode *inode) { - ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages; + unsigned int ra_pages = inode->i_ra_pages ? inode->i_ra_pages : + inode_to_bdi(inode)->ra_pages; + + ra->ra_pages = ra_pages; ra->prev_pos = -1; } EXPORT_SYMBOL_GPL(file_ra_state_init); 2.39.5