From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from wp530.webpack.hosteurope.de (wp530.webpack.hosteurope.de [80.237.130.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CBBFA13DB96 for ; Thu, 13 Jun 2024 08:26:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=80.237.130.52 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718267204; cv=none; b=mo47snGmEBNppXL7NysQ5VjmYl0DbLctUxPpYDDKvzNRX2aJXPMYhJN5xYyEycBWxXlZZX6FfespHj8+LoMD+BFviTGuSBsgNzqOcy0hvfN45jTw8l8qfKc4EUNx2BQoIJ0cexx85ptGefMmip6OqFXjLfvLzQAExANddIGjtOs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718267204; c=relaxed/simple; bh=Tx2qf6Ni8rXq9W9cNmxZdbq0w5VmjCttkmZBwNYzTAo=; h=Message-ID:Date:MIME-Version:Subject:From:To:References: In-Reply-To:Content-Type; b=NQ3C5H+K98vg+B+Kdv0/KdOjfgCmT91Jyi2ajecxazN/wgmqm/1NpMMG8pC1KTyWcnDbPGxBMQd/W1u11DIUqTqfVEX2MF0fJOZk02LKNL1Vwe05mt2JNbSd5pfLdSmlOpQHfQ9wifMAqCsO+0onpx/WA011569KOzCJQGBq3IQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=leemhuis.info; spf=pass smtp.mailfrom=leemhuis.info; dkim=pass (2048-bit key) header.d=leemhuis.info header.i=@leemhuis.info header.b=BylKvGAq; arc=none smtp.client-ip=80.237.130.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=leemhuis.info Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=leemhuis.info Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=leemhuis.info header.i=@leemhuis.info header.b="BylKvGAq" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=leemhuis.info; s=he214686; h=Content-Transfer-Encoding:Content-Type: In-Reply-To:References:To:From:Subject:MIME-Version:Date:Message-ID:From: Sender:Reply-To:Subject:Date:Message-ID:To:Cc:MIME-Version:Content-Type: Content-Transfer-Encoding:Content-ID:Content-Description:In-Reply-To: References; bh=8c8NHZRSbIFg8F6NSztBYAyXFybncDEskrd/bfAa0Vs=; t=1718267201; x=1718699201; b=BylKvGAqaWp+i0utzyub+rVRFjXeoQY3HQZIbFQENLYxQe20nwUAC4Kmx8QVm va6G738w/XOz6c2GO3K22hLg+LPe0wgLg+ZZyF2C9ebwHmnjx71hjHQRoVQXL4nBfEDAhpwVkjjT3 CKNb+n9lJeX7Dxv+TDPV3ujiMmq+po+mtHmdKMe8PseF6Lp6+tnQCzI0mGmfrkahUgFAvvtVRIq1J EZDj9jibNcOxsmvRliBJbqR+C8RgS34bplFtx3u1oDOTszRj9wwOgZRebM38qdzDZ5VHSqcu4QdH7 gHUN3o3MdsjqCnr/CuTeOl9pYTDaWRGNlzXshqY+hwcERYSWXQ==; Received: from [2a02:8108:8980:2478:8cde:aa2c:f324:937e]; authenticated by wp530.webpack.hosteurope.de running ExIM with esmtpsa (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) id 1sHfn5-0003GV-NM; Thu, 13 Jun 2024 10:26:39 +0200 Message-ID: Date: Thu, 13 Jun 2024 10:26:39 +0200 Precedence: bulk X-Mailing-List: ksummit@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: [MAINTAINERS SUMMIT] [1/4] Create written down guidelines for handling regressions From: Thorsten Leemhuis To: "ksummit@lists.linux.dev" References: Content-Language: en-US, de-DE Autocrypt: addr=linux@leemhuis.info; keydata= xsFNBFJ4AQ0BEADCz16x4kl/YGBegAsYXJMjFRi3QOr2YMmcNuu1fdsi3XnM+xMRaukWby47 JcsZYLDKRHTQ/Lalw9L1HI3NRwK+9ayjg31wFdekgsuPbu4x5RGDIfyNpd378Upa8SUmvHik apCnzsxPTEE4Z2KUxBIwTvg+snEjgZ03EIQEi5cKmnlaUynNqv3xaGstx5jMCEnR2X54rH8j QPvo2l5/79Po58f6DhxV2RrOrOjQIQcPZ6kUqwLi6EQOi92NS9Uy6jbZcrMqPIRqJZ/tTKIR OLWsEjNrc3PMcve+NmORiEgLFclN8kHbPl1tLo4M5jN9xmsa0OZv3M0katqW8kC1hzR7mhz+ Rv4MgnbkPDDO086HjQBlS6Zzo49fQB2JErs5nZ0mwkqlETu6emhxneAMcc67+ZtTeUj54K2y Iu8kk6ghaUAfgMqkdIzeSfhO8eURMhvwzSpsqhUs7pIj4u0TPN8OFAvxE/3adoUwMaB+/plk sNe9RsHHPV+7LGADZ6OzOWWftk34QLTVTcz02bGyxLNIkhY+vIJpZWX9UrfGdHSiyYThHCIy /dLz95b9EG+1tbCIyNynr9TjIOmtLOk7ssB3kL3XQGgmdQ+rJ3zckJUQapLKP2YfBi+8P1iP rKkYtbWk0u/FmCbxcBA31KqXQZoR4cd1PJ1PDCe7/DxeoYMVuwARAQABzSdUaG9yc3RlbiBM ZWVtaHVpcyA8bGludXhAbGVlbWh1aXMuaW5mbz7CwZQEEwEKAD4CGwMFCwkIBwMFFQoJCAsF FgIDAQACHgECF4AWIQSoq8a+lZZX4oPULXVytubvTFg9LQUCX31PIwUJFmtPkwAKCRBytubv TFg9LWsyD/4t3g4i2YVp8RoKAcOut0AZ7/uLSqlm8Jcbb+LeeuzjY9T3mQ4ZX8cybc1jRlsL JMYL8GD3a53/+bXCDdk2HhQKUwBJ9PUDbfWa2E/pnqeJeX6naLn1LtMJ78G9gPeG81dX5Yq+ g/2bLXyWefpejlaefaM0GviCt00kG4R/mJJpHPKIPxPbOPY2REzWPoHXJpi7vTOA2R8HrFg/ QJbnA25W55DzoxlRb/nGZYG4iQ+2Eplkweq3s3tN88MxzNpsxZp475RmzgcmQpUtKND7Pw+8 zTDPmEzkHcUChMEmrhgWc2OCuAu3/ezsw7RnWV0k9Pl5AGROaDqvARUtopQ3yEDAdV6eil2z TvbrokZQca2808v2rYO3TtvtRMtmW/M/yyR233G/JSNos4lODkCwd16GKjERYj+sJsW4/hoZ RQiJQBxjnYr+p26JEvghLE1BMnTK24i88Oo8v+AngR6JBxwH7wFuEIIuLCB9Aagb+TKsf+0c HbQaHZj+wSY5FwgKi6psJxvMxpRpLqPsgl+awFPHARktdPtMzSa+kWMhXC4rJahBC5eEjNmP i23DaFWm8BE9LNjdG8Yl5hl7Zx0mwtnQas7+z6XymGuhNXCOevXVEqm1E42fptYMNiANmrpA OKRF+BHOreakveezlpOz8OtUhsew9b/BsAHXBCEEOuuUg87BTQRSeAENARAAzu/3satWzly6 +Lqi5dTFS9+hKvFMtdRb/vW4o9CQsMqL2BJGoE4uXvy3cancvcyodzTXCUxbesNP779JqeHy s7WkF2mtLVX2lnyXSUBm/ONwasuK7KLz8qusseUssvjJPDdw8mRLAWvjcsYsZ0qgIU6kBbvY ckUWkbJj/0kuQCmmulRMcaQRrRYrk7ZdUOjaYmjKR+UJHljxLgeregyiXulRJxCphP5migoy ioa1eset8iF9fhb+YWY16X1I3TnucVCiXixzxwn3uwiVGg28n+vdfZ5lackCOj6iK4+lfzld z4NfIXK+8/R1wD9yOj1rr3OsjDqOaugoMxgEFOiwhQDiJlRKVaDbfmC1G5N1YfQIn90znEYc M7+Sp8Rc5RUgN5yfuwyicifIJQCtiWgjF8ttcIEuKg0TmGb6HQHAtGaBXKyXGQulD1CmBHIW zg7bGge5R66hdbq1BiMX5Qdk/o3Sr2OLCrxWhqMdreJFLzboEc0S13BCxVglnPqdv5sd7veb 0az5LGS6zyVTdTbuPUu4C1ZbstPbuCBwSwe3ERpvpmdIzHtIK4G9iGIR3Seo0oWOzQvkFn8m 2k6H2/Delz9IcHEefSe5u0GjIA18bZEt7R2k8CMZ84vpyWOchgwXK2DNXAOzq4zwV8W4TiYi FiIVXfSj185vCpuE7j0ugp0AEQEAAcLBfAQYAQoAJgIbDBYhBKirxr6Vllfig9QtdXK25u9M WD0tBQJffU8wBQkWa0+jAAoJEHK25u9MWD0tv+0P/A47x8r+hekpuF2KvPpGi3M6rFpdPfeO RpIGkjQWk5M+oF0YH3vtb0+92J7LKfJwv7GIy2PZO2svVnIeCOvXzEM/7G1n5zmNMYGZkSyf x9dnNCjNl10CmuTYud7zsd3cXDku0T+Ow5Dhnk6l4bbJSYzFEbz3B8zMZGrs9EhqNzTLTZ8S Mznmtkxcbb3f/o5SW9NhH60mQ23bB3bBbX1wUQAmMjaDQ/Nt5oHWHN0/6wLyF4lStBGCKN9a TLp6E3100BuTCUCrQf9F3kB7BC92VHvobqYmvLTCTcbxFS4JNuT+ZyV+xR5JiV+2g2HwhxWW uC88BtriqL4atyvtuybQT+56IiiU2gszQ+oxR/1Aq+VZHdUeC6lijFiQblqV6EjenJu+pR9A 7EElGPPmYdO1WQbBrmuOrFuO6wQrbo0TbUiaxYWyoM9cA7v7eFyaxgwXBSWKbo/bcAAViqLW ysaCIZqWxrlhHWWmJMvowVMkB92uPVkxs5IMhSxHS4c2PfZ6D5kvrs3URvIc6zyOrgIaHNzR 8AF4PXWPAuZu1oaG/XKwzMqN/Y/AoxWrCFZNHE27E1RrMhDgmyzIzWQTffJsVPDMQqDfLBhV ic3b8Yec+Kn+ExIF5IuLfHkUgIUs83kDGGbV+wM8NtlGmCXmatyavUwNCXMsuI24HPl7gV2h n7RI In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-bounce-key: webpack.hosteurope.de;linux@leemhuis.info;1718267201;08ddc5c8; X-HE-SMSGID: 1sHfn5-0003GV-NM Different assumptions about the appropriate handling of regressions frequently lead to friction and time consuming discussions during my regression tracking and prodding work. That is frustrating, demotivating and exhausting for everyone involved and even brought us to situations like "then I'm stepping down as maintainer". To avoid things like this, I propose we try to pin down guidelines together and ideally make Linus bless them. The "Expectations and best practices for fixing regressions" in Documentation/process/handling-regressions.rst ( https://docs.kernel.org/process/handling-regressions.html#expectations-and-best-practices-for-fixing-regressions ) could be a start for such guidelines -- but I'm obviously biased here, as I wrote that text, so feel free to propose something new. That text is based on generalized interpretations of statements and actions from Linus while keeping practical application and our workflows in mind -- including the maintenance of stable trees. I have no idea if I went too far somewhere: the submission of that text was addressed to Linus, but he did not react; otoh he merged it later after Greg ACKed it and it came to his doorstep through the docs tree. But in the end it seems most people do not know about this text or do not take it for real. That's why I'd like to make this more official or create something new that is blessed and widely accepted, despite all the downsides that "writing things down" sometimes has. That means that the text likely should be moved somewhere else closer to Documentation/process/submitting-patches.rst and/or to Documentation/process/6.Followthrough.rst maybe. That "Expectations and best practices for fixing regressions" has a short section that basically says everything crucial in generic way already: > As a Linux kernel developer, you are expected to give your best to prevent > situations where a regression caused by a recent change of yours leaves users > only these options: > > * Run a kernel with a regression that impacts usage. > > * Switch to an older or newer kernel series. > > * Continue running an outdated and thus potentially insecure kernel for more > than three weeks after the regression's culprit was identified. Ideally it > should be less than two. And it ought to be just a few days, if the issue is > severe or affects many users -- either in general or in prevalent > environments. This thus should already prevent all variations of the example scenario the mail at the start of this thread covered. So maybe this is enough already. Fun fact: in an earlier version of that text (which was in mainline for about a year and also ACKed by Greg) it was more like "within two weeks, ideally one"; but that afaics turned out to be too demanding, especially for subsystem maintainers, as then they would definitely have to send PRs more often to reach that target. This and other practical aspects are also the reason why the text continues with more detailed instructions: > How to realize that in practice depends on various factors. Use the following > rules of thumb as a guide. Let me comment on some of the other points in case we want to use them as a base for guidelines. Again, many of the underlying problems that lead to the following points can be seen in the scenario at the start of the thread. > In general: > > * Prioritize work on regressions over all other Linux kernel work, unless the > latter concerns a severe issue (e.g. acute security vulnerability, data loss, > bricked hardware, ...). > > * Expedite fixing mainline regressions that recently made it into a proper > mainline, stable, or longterm release (either directly or via backport). Note: The "stable, or longterm release (either directly or via backport)" in this point is just from my interpretation, not sure what Linus thinks about it. Proposal 3/4 will focus on that, so maybe ignore this part here. Another note: That "recently" in the second point becomes more concrete in later points (see quotes below; yes, this is not really ideal and maybe should be fixed) and is based on https://lore.kernel.org/all/CAHk-=wis_qQy4oDNynNKi5b7Qhosmxtoj1jxo5wmB6SRUwQUBQ@mail.gmail.com/ > * Do not consider regressions from the current cycle as something that can wait > till the end of the cycle, as the issue might discourage or prevent users and > CI systems from testing mainline now or generally. Not sure if this and the two points before it are really what Linus wants, but from his actions it seemed to me it's something like that. > * Work with the required care to avoid additional or bigger damage, even if > resolving an issue then might take longer than outlined below. FWIW, this obviously provides a loophole that could be used in any situation -- but at the same time I think it's wise to have it here for cases where reverts are not an option and a proper fix takes time to get right. > On timing once the culprit of a regression is known: > > * Aim to mainline a fix within two or three days, if the issue is severe or > bothering many users -- either in general or in prevalent conditions like a > particular hardware environment, distribution, or stable/longterm series. Note, ignore... > * Aim to mainline a fix by Sunday after the next, if the culprit made it > into a recent mainline, stable, or longterm release (either directly or via > backport); if the culprit became known early during a week and is simple to > resolve, try to mainline the fix within the same week. ...this point here, as proposal 3/4 will cover that in more detail. > * For other regressions, aim to mainline fixes before the hindmost Sunday > within the next three weeks. One or two Sundays later are acceptable, if the > regression is something people can live with easily for a while -- like a > mild performance regression. FWIW, I chose this "Sunday after the next", "hindmost Sunday within the next three weeks", … approach so that subsystem maintainers normally have no extra work if they flush their fixes at the end of the week, which quite a few do. And when it comes to regression that imho is a wise approach, as that ensures the fix makes it into the next -rc. It often also aligns nicely with stable trees, as that way the fix one or two days later might already be in the next stable-rc, as they are often released on Mondays or Tuesdays (at least they usually did before Greg put even more load on his already overburdened shoulders with the CVE stuff a few months ago...). > * It's strongly discouraged to delay mainlining regression fixes till the next > merge window, except when the fix is extraordinarily risky or when the > culprit was mainlined more than a year ago. This used to be a big problem and already got somewhat better, but there is still quite a bit of room for improvement from what I see. > On procedure: > > * Always consider reverting the culprit, as it's often the quickest and least > dangerous way to fix a regression. Don't worry about mainlining a fixed > variant later: that should be straight-forward, as most of the code went > through review once already. This was meant to encourage reverts, as some people see them as something bad -- when in reality it in kernel context Linus afaics wants them to be seen as "this was not ready, no big deal, just revert and reapply in a few weeks together with a fix". Wondering how if we should do something to get that better across. > * Try to resolve any regressions introduced in mainline during the past > twelve months before the current development cycle ends: Linus wants such > regressions to be handled like those from the current cycle, unless fixing > bears unusual risks. > > * Consider CCing Linus on discussions or patch review, if a regression seems > tangly. Do the same in precarious or urgent cases -- especially if the > subsystem maintainer might be unavailable. Also CC the stable team, when you > know such a regression made it into a mainline, stable, or longterm release. > > * For urgent regressions, consider asking Linus to pick up the fix straight > from the mailing list: he is totally fine with that for uncontroversial > fixes. Ideally though such requests should happen in accordance with the > subsystem maintainers or come directly from them. > > * In case you are unsure if a fix is worth the risk applying just days before > a new mainline release, send Linus a mail with the usual lists and people in > CC; in it, summarize the situation while asking him to consider picking up > the fix straight from the list. He then himself can make the call and when > needed even postpone the release. Such requests again should ideally happen > in accordance with the subsystem maintainers or come directly from them. > > Regarding stable and longterm kernels: > > * You are free to leave regressions to the stable team, if they at no point in > time occurred with mainline or were fixed there already. Ignore... > * If a regression made it into a proper mainline release during the past > twelve months, ensure to tag the fix with "Cc: stable@vger.kernel.org", as a > "Fixes:" tag alone does not guarantee a backport. Please add the same tag, > in case you know the culprit was backported to stable or longterm kernels. ...this point here, as proposal 2/4 will cover it. > * When receiving reports about regressions in recent stable or longterm kernel > series, please evaluate at least briefly if the issue might happen in current > mainline as well -- and if that seems likely, take hold of the report. If in > doubt, ask the reporter to check mainline. > > * Whenever you want to swiftly resolve a regression that recently also made it > into a proper mainline, stable, or longterm release, fix it quickly in > mainline; when appropriate thus involve Linus to fast-track the fix (see > above). That's because the stable team normally does neither revert nor fix > any changes that cause the same problems in mainline. > > * In case of urgent regression fixes you might want to ensure prompt > backporting by dropping the stable team a note once the fix was mainlined; > this is especially advisable during merge windows and shortly thereafter, as > the fix otherwise might land at the end of a huge patch queue. > > On patch flow: > > * Developers, when trying to reach the time periods mentioned above, remember > to account for the time it takes to get fixes tested, reviewed, and merged by > Linus, ideally with them being in linux-next at least briefly. Hence, if a > fix is urgent, make it obvious to ensure others handle it appropriately. > > * Reviewers, you are kindly asked to assist developers in reaching the time > periods mentioned above by reviewing regression fixes in a timely manner. > > * Subsystem maintainers, FWIW, there is one problem related to this and the previous point that I haven't written a proposal for, but maybe should have: reviewers and subsystem maintainers have no dead simple and reliable way to detect "ohh, this is a regression fix I maybe should prioritize". Some agreed on tag in the subject could help. [REGFIX]? [URGENT]? [FASTTRACK]? Hmmm, do not really like any of them, except maybe the last... :-/ A `Cc: regressions@lists.linux.dev` as well, but would be yet another tag and harder to spot. :-/ Yes, I mentioned the latter idea two years ago already without success. But some people started doing it since then, which is nice, as it helps me keep an eye on things or to become aware of regressions; it would also allow me to easily spot regression fixes that are queued for the next cycle that instead might better be merged in the current cycle. > you likewise are encouraged to expedite the handling > of regression fixes. Thus evaluate if skipping linux-next is an option for > the particular fix. Also consider sending git pull requests more often than > usual when needed. And try to avoid holding onto regression fixes over > weekends -- especially when the fix is marked for backporting. If all subsystems would usually abide by this point, many regressions would be fixed quite a bit faster from what I see, as "fixes are sitting in subsystem trees for one or more weeks and are sometimes flushed shortly after a new rc" is the aspect that from my point of view currently is the one that is causing most delays (note: for a few subsystems that is not a problem at all, they are good at this!).