From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTP id 1C1303C6 for ; Wed, 21 May 2014 23:40:37 +0000 (UTC) Received: from perceval.ideasonboard.com (perceval.ideasonboard.com [95.142.166.194]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 02AAD1F940 for ; Wed, 21 May 2014 23:40:35 +0000 (UTC) From: Laurent Pinchart To: ksummit-discuss@lists.linuxfoundation.org Date: Thu, 22 May 2014 01:40:46 +0200 Message-ID: <5582461.hFQ1jGg3Wg@avalon> In-Reply-To: References: <2980546.hqgiQV7seV@vostro.rjw.lan> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" Subject: Re: [Ksummit-discuss] [CORE TOPIC] [nomination] Move Fast and Oops Things List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Wednesday 21 May 2014 16:03:49 Dan Williams wrote: > On Wed, May 21, 2014 at 4:06 PM, Rafael J. Wysocki wrote: > > On Wednesday, May 21, 2014 08:35:55 AM Dan Williams wrote: > >> On Wed, May 21, 2014 at 3:11 AM, NeilBrown wrote: > >> > On Wed, 21 May 2014 01:36:55 -0700 Dan Williams wrote: > >> >> On Wed, May 21, 2014 at 1:25 AM, NeilBrown wrote: > >> >> > On Wed, 21 May 2014 00:48:48 -0700 Dan Williams wrote: > >> >> >> On Fri, May 16, 2014 at 8:04 AM, Chris Mason wrote: > >> >> >> > -----BEGIN PGP SIGNED MESSAGE----- > >> >> >> > Hash: SHA1 > >> >> >> > > >> >> >> > On 05/15/2014 10:56 PM, NeilBrown wrote: > >> >> >> >> On Thu, 15 May 2014 16:13:58 -0700 Dan Williams wrote: > >> >> >> >>> What would it take and would we even consider moving 2x faster > >> >> >> >>> than we are now? > >> >> >> >> > >> >> >> >> Hi Dan, you seem to be suggesting that there is some limit other > >> >> >> >> than "competent engineering time" which is slowing Linux > >> >> >> >> "progress" down. > >> >> >> >> > >> >> >> >> Are you really suggesting that? What might these other limits > >> >> >> >> be? > >> >> >> >> > >> >> >> >> Certainly there are limits to minimum gap between > >> >> >> >> conceptualisation and release (at least one release cycle), but > >> >> >> >> is there really a limit to the parallelism that can be achieved? > >> >> >> > > >> >> >> > I haven't compared the FB commit rates with the kernel, but I'll > >> >> >> > pretend Dan's basic thesis is right and talk about which parts of > >> >> >> > the facebook model may move faster than the kernel. > >> >> >> > > >> >> >> > The facebook is pretty similar to the way the kernel works. The > >> >> >> > merge window lasts a few days and the major releases are every > >> >> >> > week, but overall it isn't too far away. > >> >> >> > > >> >> >> > The biggest difference is that we have a centralized tool for > >> >> >> > reviewing the patches, and once it has been reviewed by a > >> >> >> > specific number of people, you push it in. > >> >> >> > > >> >> >> > The patch submission tool runs the patch through lint and various > >> >> >> > static analysis to make sure it follows proper coding style and > >> >> >> > doesn't include patterns of known bugs. This cuts down on the > >> >> >> > review work because the silly coding style mistakes are gone > >> >> >> > before it gets to the tool. > >> >> >> > > >> >> >> > When you put in a patch, you have to put in reviewers, and they > >> >> >> > get a little notification that your patch needs review. Once the > >> >> >> > reviewers are happy, you push the patch in. > >> >> >> > > >> >> >> > The biggest difference: there are no maintainers. If I want to > >> >> >> > go change the calendar tool to fix a bug, I patch it, get someone > >> >> >> > else to sign off and push. > >> >> >> > > >> >> >> > All of which is my way of saying the maintainers (me included) > >> >> >> > are the biggest bottleneck. There are a lot of reasons I think > >> >> >> > the maintainer model fits the kernel better, but at least for > >> >> >> > btrfs I'm trying to speed up the patch review process and use > >> >> >> > patchwork more effectively. > >> >> >> > >> >> >> To be clear, I'm not arguing for a maintainer-less model. We don't > >> >> >> have the tooling or operational-data to support that. We need > >> >> >> maintainers to say "no". But, what I think we can do is give > >> >> >> maintainers more varied ways to say it. The goal, de-escalate the > >> >> >> merge event as a declaration that the code quality/architecture > >> >> >> conversation is over. > >> >> >> > >> >> >> Release early, release often, and with care merge often. > >> >> > > >> >> > I think this falls foul of the "no regressions" rule. > >> >> > > >> >> > The kernel policy is that once the functionality gets to users, it > >> >> > cannot be taken away. Individual drivers in 'staging' manage to > >> >> > avoid this rule because that are clearly separate things. > >> >> > New system calls and attributes in sysfs etc seem to be much harder > >> >> > to "partially" release. > >> >> > >> >> My straw man is something like the following for driver "foo" > >> >> > >> >> if (gatekeeper_foo_new_awesome_sauce) > >> >> > >> >> do_new_thing(); > >> >> > >> >> Where setting gatekeeper_foo_new_awesome_sauce taints the kernel and > >> >> warns that there is no guarantee of this functionality being present > >> >> in the same form or at all going forward. > >> > > >> > Interesting idea. > >> > Trying to imagine how this might play out in practice.... > >> > > >> > You talk about "value delivered to users". But users tend to use > >> > applications, and applications are the users of kernel features. > >> > > >> > Will anyone bother writing or adapting an application to use a feature > >> > which is not guaranteed to hang around? > >> > Maybe they will, but will the users of the application know that it > >> > might stop working after a kernel upgrade? Maybe... > >> > > >> > Maybe if we had some concrete examples of features that could have been > >> > delayed using a gatekeeper. > >> > > >> > The one that springs to my mind is cgroups. Clearly useful, but > >> > clearly controversial. It appears that the original implementation was > >> > seriously flawed and Tejun is doing a massive amount of work to "fix" > >> > it, and this apparently will lead to API changes. And this is > >> > happening without any gatekeepers. Would it have been easier in some > >> > way with gatekeepers? ... I don't see how it would be, except that > >> > fewer people would have used cgroups, and then maybe we wouldn't have > >> > as much collective experience to know what the real problems were(?). > >> > > >> > I think that is the key. With a user-facing option, people will try it > >> > and probably cope if it disappears (though they might complain loudly > >> > and sign petitions declaring facebook to be the anti-$DEITY). However > >> > with kernel internal options, applications are unlikely to use them > >> > without some expectation of stability. So finding the problems would > >> > be a lot harder. > >> > > >> > Which doesn't mean that it can't work, but it would be nice if create > >> > some real life examples to see how it plays out in practice. > >> > >> Biased by my background of course, but I think driver development is > >> more amenable to this sort of approach. For drivers the kernel is in > >> many instances the application. For example, I currently have in my > >> review queue a patch set to add sata port multiplier support to > >> libsas. I hope I get the review done in time for merging it in 3.16. > >> But, what if I also had the option of saying "let's gatekeeper this > >> for a cycle". Users that care could start using it and reporting > >> bugs, and it would be clear that the implementation is provisional. > >> My opinion is that bug reports would attract deeper code review that > >> otherwise would not occur if the feature was simply delayed for a > >> cycle. > > > > There's more to that. > > > > The model you're referring to is only possible if all participants are > > employees of one company or otherwise members of one organization that > > has some kind of control over them. The kernel development is not done > > like that, though, so I'm afraid that the Facebook experience is not > > applicable here directly. > > > > For example, we take patches from pretty much everyone on the Internet. > > Does Facebook do that too? I don't think so. > > I'm struggling to see how this addresses my new libsas feature example? > > Simply, if an end user knows how to override a "gatekeeper" that user > can test features that we are otherwise still debating upstream. They > can of course also apply the patches directly, but I am proposing we > formalize a mechanism to encourage more experimentation in-tree. Isn't that what CONFIG_EXPERIMENTAL was for ? Putting a similar mechanism in place would likely be abused the same way, and end up being enabled by default by distros at the end of the day. http://lwn.net/Articles/520867/ explains how experimental items should be handled, possibly depending on CONFIG_BROKEN (hopefully distros won't enable that one). Let's not forget that the kernel carries security implication. We might want to make it easier for users to enable experimental features, but not so easy that they could enable dangerous features without knowing it, or without realizing what they're doing. Out-of-tree patches should be pretty safe in that regard, an in-tree mechanism should take those constraints into account. We also need to decide on where to put the limit. Experimental features that haven't been properly reviewed can have side effects. They might make build robots fail even when the feature is disabled, because the implementation doesn't properly handle the disabled case. We would need to review experimental patches to prevent that from happening, and that could just put more burden on maintainers instead of helping them. > I'm fully aware we do not have the tactical data nor operational > control to run the kernel like a website, that's not my concern. My > concern is with expanding a maintainer's options for mitigating risk. -- Regards, Laurent Pinchart