From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F19B8C3065C for ; Tue, 2 Jul 2024 20:33:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5356B6B007B; Tue, 2 Jul 2024 16:33:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4E5266B0092; Tue, 2 Jul 2024 16:33:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3AD046B0093; Tue, 2 Jul 2024 16:33:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 1EA896B007B for ; Tue, 2 Jul 2024 16:33:59 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id A6B09A1E8C for ; Tue, 2 Jul 2024 20:33:58 +0000 (UTC) X-FDA: 82295964156.28.42732F1 Received: from mail-ej1-f50.google.com (mail-ej1-f50.google.com [209.85.218.50]) by imf03.hostedemail.com (Postfix) with ESMTP id B412320023 for ; Tue, 2 Jul 2024 20:33:56 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=dCid6F2R; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf03.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.218.50 as permitted sender) smtp.mailfrom=mjguzik@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719952418; a=rsa-sha256; cv=none; b=ApIgNFbrtt+fkL/cbbAaWuhhjjAf3FLcL51kHezh6GvhPTZuhpVcD5qDytnFzqYAeCozRD sm8qBGmR/cnvH6s0qyv85XwaCmZcDWw7sxgsUUiLYk1YoaxMgnxMDkr0BCUowXhpDlAvo1 ovhTo6fk5Gy7WwaNWa7HlRznBNoiygg= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=dCid6F2R; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf03.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.218.50 as permitted sender) smtp.mailfrom=mjguzik@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719952418; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UKwxGLtPtO1csp+XHmWB9Or+kYm/bdnHDR55R+xpdpo=; b=zsQAWM7pzpWAW/jDjMb9cImOddEfegpf/WBIaJfNyRRd+N8IeOAGvd/Z6vlMG/zgEZTlPZ cGYzzVzeebH2kkfxCSEXSZBXwyY0dMlsJj8nKT12udsB/rfO0p+ATCpnvCSF3IXrsQblwZ V0qheJgJsO5vhR9Xt6puvTj2TepEAWc= Received: by mail-ej1-f50.google.com with SMTP id a640c23a62f3a-a7252bfe773so487288566b.1 for ; Tue, 02 Jul 2024 13:33:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719952435; x=1720557235; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=UKwxGLtPtO1csp+XHmWB9Or+kYm/bdnHDR55R+xpdpo=; b=dCid6F2RYov/4Hn97SZVNAgfsaGmOGPPSsjNWxqk2yvQcj63ZksXtlRYq1IF880DzS BTn5pACPUoUaVLZSOuVEfhvrGz4smRIcb69aFBHvS0Fe8PjSSi3d0yCBTH0g1Q7bDgoE 3GQrBNkb/bJ0/pLauhI4gtnWjI6ZdVnk+mwP9bGDgJMJBy8Rv0Xr2bQD4x594z3dvzg/ pNBbzRC8Ld4/9qDOoBUHMCPyvPJUyYgKkVcP2dnMzkwzDydvmBmbcRwvwC647bF8R6aw DkgzZTI1eRZYA4lqrjV2rtCz1AR900HAPtx0eb2wurTb7gfg+N8bXC76AQCZy2Ig+ivX nlHA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719952435; x=1720557235; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=UKwxGLtPtO1csp+XHmWB9Or+kYm/bdnHDR55R+xpdpo=; b=UjTyErB37e0Vr1WH4E0i5rxriU4kHpSRto3WGlKvUxKHHAU0Rkey75HuZBc2MxY1+i wEvruRNtmERYM/eXrvfY6fbUv1gwmAU6Ivp2kzeacvMQ8+NaYluix/JnyCmQABiBimGD KhFA/70xnc6IOApNLoQ/SFQilOjUV8k1eQmfeiQpAd1fnvhDR4oZpabKAfx384UeZdsR mBf5C9BolonOWi6pTBcUsOcDXEtisRNEEJxJ9xWJyr3ZHpBZzgEPByw5OE2beDfMZ7Rx PkHH7lvOsl1uxeYCwh3wWsBacb7VbAqrLzHBKTqgl6A+/+lWeJc3A/QUF9ht5LkRHC8j kQtQ== X-Forwarded-Encrypted: i=1; AJvYcCWfJDuPggvREPtXKJAX2THN4HjWrLbJC5DrZW59IX5RxQ0itx7POL+irba9fuNxD/SDmc6j6Q3XQFTeeEp9KGW/AqI= X-Gm-Message-State: AOJu0Yz6AZebHwEAumTZgow2msh6KrBSYR78uav5ibqF8FeaaIGLAfYG n/Q3UIawzI694DvsEtPojXQOUvv45ZXJFqp6Uz5ZZxqsJUf4e5sC X-Google-Smtp-Source: AGHT+IEUXKWqxm9bCxi/jzaT3o5g2LjxlsqVoFdk56x3NLDBfiPwQYgB26TF93yXPu/Y+9kunm3JaA== X-Received: by 2002:a17:907:f99:b0:a6f:64b0:1250 with SMTP id a640c23a62f3a-a75143fde21mr603463566b.35.1719952434710; Tue, 02 Jul 2024 13:33:54 -0700 (PDT) Received: from f (cst-prg-31-46.cust.vodafone.cz. [46.135.31.46]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a72ab08cfccsm449526466b.148.2024.07.02.13.33.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 Jul 2024 13:33:54 -0700 (PDT) Date: Tue, 2 Jul 2024 22:33:45 +0200 From: Mateusz Guzik To: Linus Torvalds Cc: Christian Brauner , kernel test robot , oe-lkp@lists.linux.dev, lkp@intel.com, Linux Memory Management List , linux-kernel@vger.kernel.org, ying.huang@intel.com, feng.tang@intel.com, fengwei.yin@intel.com Subject: Re: [linux-next:master] [lockref] d042dae6ad: unixbench.throughput -33.7% regression Message-ID: <3g3arsrwnyvv562v2rsfv2ms4ht4mk45vwdkvssxkrjhfjtpdz@umyx5tl2du7o> References: <202406270912.633e6c61-oliver.sang@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: B412320023 X-Stat-Signature: czkooceecxanuozpdtk1e91f6ywjxh4p X-Rspam-User: X-HE-Tag: 1719952436-418059 X-HE-Meta: U2FsdGVkX18/t3C2O5qBnRNPUV3qS/8XbRAvI1s6PjSCLaA2gcu/BpWS6cnF3LS+CEyIeEdzOpnYKiPICNSLmpMO+RWamSKHt3JYLkxY6YP3YyFIkKrVJUO8RpkYKXb8dZJZLtbn4iOxJpFEbVLlX3S7dpeTNK90ZIwCEfF93UQe3r1iD0Prcce8m+iSKxOoFPaDqMMurkSJnz+nd0GWcB5vk/DnwSbiYEeXH/ny5UbhebDqjJrOD57Y1R5Q5T0dO3FC/Ib1Fnn4L4WKmCBlGcHWdwh06CZydy2pINmFZbYFPlm97m6VNdJAbeSUq4mj94bfpGx0Hn5esl99VcJobUqu6Ke5OiRkF4H9GfuWdu2sXjVXj80c1dQ4hHdRm10AQOGlghf21jFGbu/ND6+CauNHTvyO+CpfT/UPNyYP440ZoYjTO2jHEnrwu5i7l6D5/mbWyBfipkHl0Md78Ia2V4nYUijFgAGqYfnGM5oupymApOzCFEhvNVqZ9ygapD9/UbvzJlDGQu1QJCsBUhUqbZDfjDCuosMWHQlR3kMXp+XC7aflNqxh9bI7t1RdBQI0IO0pT1QMtVS/O44+tE9gO+COt3fIFVZcTdjUwyxhauLZTckDcp1IzLaJB6RfXO6RCWTBJIHqR/i81qGh17I29eRjeTNjEesYejqGT2Aoz3PpGtLaq3WK46tDtGQfQDzqAVxIKJHnHRRXF9OzELbodr0sdevBmSuWi4H2Pc6khbL9Xbfmk4Oy8mXxXxJ7icg9KttTsbcNcCusTcTYI3OeCih8P0L6D/ZP59VeTtsYDhiKgeQHBv1y1BaGblY5TICUUgB+MNkXcWdiVYQRtt7PQcSWFxV1jG3GLvp8enwXuKA3LT25ZzCeP9L67qzP4BAyJvIwC5/7BPE5ajbJWye95il1KSXV7ORZoqnfAt3xek0riaFFM0WFJcxSlwmU0BjAxsHUqFhaujytvaXzHP7 ahDM77eY LLpAQBqq1hs1dqtoILjXpk2Hk3iZ4zehuzgbto4iu6qXRGeIpOqdWzpGYjXCDF+bDbvMPOszPCIIdxYl5RWg8GWrU21zIaZFoaB3TUSVB+3XdudCCSndTI1iSgvgAyW2oA+84gpdEvL0E/Zd+eJ14+5IzpQjp8ocDKAbbvNoGmGpJ0mTnppILnyOIdQbXyUhcBryNN6xqjhvcTBqRkMO9R0RvZeAwcoVdinVDdspTqYE0tI1wTJX5f4Bch9C38AVdTFOt1Wjvv22fg3pElFt9jLEU1nH9UATiaZTTqcaPP/FKjnguNswtXXP32n5e7z3IXePZL2USsdyg2OhOntKffv6iSc2oTyeCKhl1Rw9I8wjtmkfjHx7p/Tgo0aPqbvpRwKTq72U0TjNF/at0cBjyCFDVP3gDhM35Utu1U+LDb84VPsOz/rlmaWLPmJp7qHbM/hJ8uzseS+IIt0KiCqvTIo0C9GghVeBG/YysO3XzXahd7x3qlhDqUTc8DGB0zceC8ek0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jul 02, 2024 at 11:41:44AM -0700, Linus Torvalds wrote: > On Tue, 2 Jul 2024 at 10:58, Mateusz Guzik wrote: > > > > Suppose the rcu fast path lookup reads the dentry seqc, then does all > > the legitimize_mnt and other work. Everything, except modifying the > > lockref. The caller is given a mnt to put (per-cpu scalable), dentry > > seqc read before any of the path validation and an indication this is > > rcu. > > Yes. > > > Then after whatever is done if the seqc still matches this is the same > > as if there was lockref get/put around it. > > So this is partly why I was thinking of a callback. That "check > sequence number afterwards" is still important. And if it's a > callback, it can be done in the path walking code, and it can go on > and say "oh, I'll need to redo this without RCU". > > If it's a "we returned a dentry under RCU", suddenly the caller has to > know this about the name lookup and do the repeating by hand. > > And as long as we don't expose it to modules and only use it for > "stat()" and friends, I'm ok with it, but I'm just saying that it's > all a bit scary. > > > The only worry is pointers suddenly going NULL or similar as > > dentry/inode is looked at. To be worked out on per-syscall basis. > > We have subtle rules wrt dentry->d_inode. It can indeed become NULL at > any time during the RCU walk, since what protects it is the d_lock and > the dentry count. > > The inode itself is then RCU-free'd, so it will *exist*, but you can't > just blindly use dentry->d_inode itself while under RCU. > Not only that, I think dentry can transition back to positive with another inode. > Which is why it's cached in 'struct nameidata', and we validate it > with nd->seq when it's loaded. And why things like may_lookup() use > nd->inode, not the dentry. > > And then a sequence point failure will return -ECHILD and do the walk > again, while a callback success with all the sequence numbers matching > would return -ECALLBACK or whatever, so that the caller would know > "the stat information was already successfully completed by the > callback". > > Anyway, that was my handwavy "this is why I was thinking of a > callback" thing. But it's also an example of just how nasty and subtle > this all is. > > But I'm convinced this is all eminently *solvable*. There's nothing > fundamental here. Just a lot of small nasty details. > If you are politely by lkml standards suggesting I should probably drop the idea due to unforseen complexities, I'll note I intend to write this for kicks as time permits. If it turns out to be buggy af or there will be nobody with time to review it, it simply wont go in and that will be that. I do think I have an ok understanding what's up, for example did you know xfs does not honor rcu grace periods when recycling inodes? https://lore.kernel.org/all/20231205113833.1187297-1-alexjlzheng@tencent.com/ So this would have to be opt-in per filesystem, probably stuffed somewhere within the inode or dentry. I am definitely not reviewing all the other filesystems for sanity on this front. Rather, one could look over tmpfs, ext4, btrfs and maybe ask Kent to sort out bcachefs (if necessary) and call it a day. Sounds like you are deadset on the callback approach. I'm not going to die on the inline hill, but I will spell it out so that we are on the same page (and I have a question too). In pseudo-code my stuff would like this (names are for ilustrative purposes): struct rcunameidata { .... bool in_rcu; }; ... struct rcunameidata *rnd; error = vfs_rcu_magic_lookup(&rnd, ....); if (error) return error; if (rnd->in_rcu) { /* * fast path goes here, callback code would be identical up to * the point below */ /* * Now validate */ error = vfs_rcu_magic_lookup_validate_or_drop(&rnd, ....)) if (error == 0) /* things worked out */ return export_stuff_to_the_user(....); if (error < 0) /* fail */ return error; } /* * slowpath goes here */ /* * all done, now whack the lookup state. the routine returns void */ vfs_rcu_magic_lookup_finish(&rnd, ....); if (!error) error = export_stuff_to_the_user(....); .... Can you pseudo-code how would the consumer look like in your case? Do you want the callback to execute for both slow and fastpath and switch on the flag? It is rather unclear what you are proposing here. fwiw I think the above would serve as an easy to copy-paste idiom for the few consumers which want it. All the complexity in their case is the in_rcu block which wont go away with a callback. If you still want the callback, callback it is.