From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 113DDC433B4 for ; Thu, 8 Apr 2021 22:12:02 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4CDBB61107 for ; Thu, 8 Apr 2021 22:12:01 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4CDBB61107 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id BD3BF6B0036; Thu, 8 Apr 2021 18:12:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B84736B006C; Thu, 8 Apr 2021 18:12:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A4C3B6B006E; Thu, 8 Apr 2021 18:12:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0159.hostedemail.com [216.40.44.159]) by kanga.kvack.org (Postfix) with ESMTP id 8B3896B0036 for ; Thu, 8 Apr 2021 18:12:00 -0400 (EDT) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 4E1EB183A38B3 for ; Thu, 8 Apr 2021 22:12:00 +0000 (UTC) X-FDA: 78010598400.13.48F0B14 Received: from mail109.syd.optusnet.com.au (mail109.syd.optusnet.com.au [211.29.132.80]) by imf13.hostedemail.com (Postfix) with ESMTP id 3C2E8E000118 for ; Thu, 8 Apr 2021 22:11:57 +0000 (UTC) Received: from dread.disaster.area (pa49-181-239-12.pa.nsw.optusnet.com.au [49.181.239.12]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 93C7665D21; Fri, 9 Apr 2021 08:11:56 +1000 (AEST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1lUcsU-00GYk6-Mx; Fri, 09 Apr 2021 08:11:54 +1000 Date: Fri, 9 Apr 2021 08:11:54 +1000 From: Dave Chinner To: Daniel Xu Cc: bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, jolsa@kernel.org, hannes@cmpxchg.org, yhs@fb.com Subject: Re: [RFC bpf-next 1/1] bpf: Introduce iter_pagecache Message-ID: <20210408221154.GL1990290@dread.disaster.area> References: <22bededbd502e0df45326a54b3056941de65a101.1617831474.git.dxu@dxuuu.xyz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <22bededbd502e0df45326a54b3056941de65a101.1617831474.git.dxu@dxuuu.xyz> X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_f a=gO82wUwQTSpaJfP49aMSow==:117 a=gO82wUwQTSpaJfP49aMSow==:17 a=kj9zAlcOel0A:10 a=3YhXtTcJ-WEA:10 a=7-415B0cAAAA:8 a=SkwiGy6gHLKnJZ6Ta9EA:9 a=CjuIK1q_8ugA:10 a=biEYGPWJfzWAr4FL6Ov7:22 X-Rspamd-Queue-Id: 3C2E8E000118 X-Stat-Signature: wzm591o4aasmh6gg8qatsp8t1hkq1ez8 X-Rspamd-Server: rspam02 Received-SPF: none (fromorbit.com>: No applicable sender policy available) receiver=imf13; identity=mailfrom; envelope-from=""; helo=mail109.syd.optusnet.com.au; client-ip=211.29.132.80 X-HE-DKIM-Result: none/none X-HE-Tag: 1617919917-944725 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Apr 07, 2021 at 02:46:11PM -0700, Daniel Xu wrote: > This commit introduces the bpf page cache iterator. This iterator allows > users to run a bpf prog against each page in the "page cache". > Internally, the "page cache" is extremely tied to VFS superblock + inode > combo. Because of this, iter_pagecache will only examine pages in the > caller's mount namespace. No, it does not just examine pages with in the callers mount namespace, because .... > +static struct inode *goto_next_inode(struct bpf_iter_seq_pagecache_info *info) > +{ > + struct inode *prev_inode = info->cur_inode; > + struct inode *inode; > + > +retry: > + BUG_ON(!info->cur_sb); > + spin_lock(&info->cur_sb->s_inode_list_lock); > + > + if (!info->cur_inode) { > + list_for_each_entry(inode, &info->cur_sb->s_inodes, i_sb_list) { ... this is an "all inodes on the superblock" walk. This will also iterate inodes in other mount namespaces that point to the same superblock. IOWs, if you have different parts of the same filesystem mounted into hundreds of container mount namespaces, this script will not just iterate the local mount name space, it will iterate every inode in every mount namespace. And, of course, if the same files are mounted into multiple containers (think read-only bind mounts using idmapping) then you have zero indication of which container is actually using them, just that there are hundreds of paths to the same inode. And every container will appear to be using exactly the same amount of page cache. IOWs, the stats this generates provide no insight into page cache usage across mount namespaces in many situations, and it leaks information about page cache usage across mount namespace boundaries. And that's before I say "iterating all inodes in a superblock is bad" because it causes lock contention and interrupts normal usage. We avoid s_inodes lists walks as much as we possibly can, and the last thing we want is for userspace to be able to trivially instigate long running walks of the s_inodes list. Remember, we can have hundreds of millions of inodes on this list.... > + spin_lock(&inode->i_lock); > + if (inode_unusual(inode)) { > + spin_unlock(&inode->i_lock); > + continue; > + } > + __iget(inode); > + spin_unlock(&inode->i_lock); This can spin long enough to trigger livelock warnings. Even if it's not held that long, it can cause unexpected long tail latencies in memory reclaim and inode instantiation. Every s_inodes list walk has cond_resched() built into it now.... > + info->ns = current->nsproxy->mnt_ns; > + get_mnt_ns(info->ns); > + INIT_RADIX_TREE(&info->superblocks, GFP_KERNEL); > + > + spin_lock(&info->ns->ns_lock); > + list_for_each_entry(mnt, &info->ns->list, mnt_list) { > + sb = mnt->mnt.mnt_sb; > + > + /* The same mount may be mounted in multiple places */ > + if (radix_tree_lookup(&info->superblocks, (unsigned long)sb)) > + continue; > + > + err = radix_tree_insert(&info->superblocks, > + (unsigned long)sb, (void *)1); And just because nobody has pointed it out yet: radix_tree_insert() will do GFP_KERNEL memory allocations inside the spinlock being held here. ---- You said that you didn't take the "walk the LRUs" approach because walking superblocks "seemed simpler". It's not. Page cache residency and accounting is managed by memcgs, not by mount namespaces. That is, containers usually have a memcg associated with them to control memory usage of the container. The page cache used by a container is accounted directly to the memcg, and memory reclaim can find all the file-backed page cache pages associated with a memcg very quickly (via mem_cgroup_lruvec()). This will find pages associated directly with the memcg, so it gives you a fairly accurate picture of the page cache usage within the container. This has none of the issues that arise from "sb != mnt_ns" that walking superblocks and inode lists have, and it doesn't require you to play games with mounts, superblocks and inode references.... Cheers, Dave. -- Dave Chinner david@fromorbit.com