* [PATCH] Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
@ 2025-11-24 16:53 Ilias Stamatis
2025-11-24 16:58 ` Andrew Morton
2025-11-25 8:09 ` David Hildenbrand (Red Hat)
0 siblings, 2 replies; 14+ messages in thread
From: Ilias Stamatis @ 2025-11-24 16:53 UTC (permalink / raw)
To: akpm, linux-kernel
Cc: linux-mm, david, nadav.amit, huang.ying.caritas,
andriy.shevchenko, bhe, nh-open-source, Ilias Stamatis
Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
logic") removed an optimization introduced by commit 756398750e11
("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
was not called out in the message of the first commit explicitly so it's
not entirely clear whether removing the optimization happened
inadvertently or not.
As the original commit message of the optimization explains there is no
point considering the children of a subtree in find_next_iomem_res() if
the top level range does not match. Reinstating the optimization results
in significant performance improvements in systems with very large iomem
maps when mmaping /dev/mem.
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
---
kernel/resource.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/kernel/resource.c b/kernel/resource.c
index b9fa2a4ce089..e4e9bac12e6e 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -341,6 +341,8 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
unsigned long flags, unsigned long desc,
struct resource *res)
{
+ /* Skip children until we find a top level range that matches */
+ bool skip_children = true;
struct resource *p;
if (!res)
@@ -351,7 +353,7 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
read_lock(&resource_lock);
- for_each_resource(&iomem_resource, p, false) {
+ for_each_resource(&iomem_resource, p, skip_children) {
/* If we passed the resource we are looking for, stop */
if (p->start > end) {
p = NULL;
@@ -362,6 +364,12 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
if (p->end < start)
continue;
+ /*
+ * We found a top level range that matches what we are looking
+ * for. Time to start checking children too.
+ */
+ skip_children = false;
+
/* Found a match, break */
if (is_type_match(p, flags, desc))
break;
--
2.43.0
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
2025-11-24 16:53 [PATCH] Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()" Ilias Stamatis
@ 2025-11-24 16:58 ` Andrew Morton
2025-11-24 17:05 ` Andy Shevchenko
[not found] ` <c7411175b332f3befb5bebb6a75c7b91f2c1dbbc.camel@amazon.co.uk>
2025-11-25 8:09 ` David Hildenbrand (Red Hat)
1 sibling, 2 replies; 14+ messages in thread
From: Andrew Morton @ 2025-11-24 16:58 UTC (permalink / raw)
To: Ilias Stamatis
Cc: linux-kernel, linux-mm, david, nadav.amit, huang.ying.caritas,
andriy.shevchenko, bhe, nh-open-source
On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
> Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> logic") removed an optimization introduced by commit 756398750e11
> ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> was not called out in the message of the first commit explicitly so it's
> not entirely clear whether removing the optimization happened
> inadvertently or not.
>
> As the original commit message of the optimization explains there is no
> point considering the children of a subtree in find_next_iomem_res() if
> the top level range does not match. Reinstating the optimization results
> in significant performance improvements in systems with very large iomem
> maps when mmaping /dev/mem.
It would be great if we could quantify "significant performance
improvements"?
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
2025-11-24 16:58 ` Andrew Morton
@ 2025-11-24 17:05 ` Andy Shevchenko
[not found] ` <c7411175b332f3befb5bebb6a75c7b91f2c1dbbc.camel@amazon.co.uk>
1 sibling, 0 replies; 14+ messages in thread
From: Andy Shevchenko @ 2025-11-24 17:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Ilias Stamatis, linux-kernel, linux-mm, david, nadav.amit,
huang.ying.caritas, bhe, nh-open-source
On Mon, Nov 24, 2025 at 08:58:16AM -0800, Andrew Morton wrote:
> On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
>
> > Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> > logic") removed an optimization introduced by commit 756398750e11
> > ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> > was not called out in the message of the first commit explicitly so it's
> > not entirely clear whether removing the optimization happened
> > inadvertently or not.
> >
> > As the original commit message of the optimization explains there is no
> > point considering the children of a subtree in find_next_iomem_res() if
> > the top level range does not match. Reinstating the optimization results
> > in significant performance improvements in systems with very large iomem
> > maps when mmaping /dev/mem.
>
> It would be great if we could quantify "significant performance
> improvements"?
+1. It also would be good to know which exact function(s) is a bottleneck.
The mentioned change updated a handful of them.
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply [flat|nested] 14+ messages in thread[parent not found: <c7411175b332f3befb5bebb6a75c7b91f2c1dbbc.camel@amazon.co.uk>]
* Re: [PATCH] Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
[not found] ` <c7411175b332f3befb5bebb6a75c7b91f2c1dbbc.camel@amazon.co.uk>
@ 2025-11-24 18:55 ` andriy.shevchenko
2025-11-24 19:35 ` Stamatis, Ilias
0 siblings, 1 reply; 14+ messages in thread
From: andriy.shevchenko @ 2025-11-24 18:55 UTC (permalink / raw)
To: Stamatis, Ilias
Cc: akpm, nadav.amit, david, linux-mm, linux-kernel,
huang.ying.caritas, bhe, nh-open-source
On Mon, Nov 24, 2025 at 06:01:35PM +0000, Stamatis, Ilias wrote:
> On Mon, 2025-11-24 at 08:58 -0800, Andrew Morton wrote:
> > On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
> >
> > > Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> > > logic") removed an optimization introduced by commit 756398750e11
> > > ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> > > was not called out in the message of the first commit explicitly so it's
> > > not entirely clear whether removing the optimization happened
> > > inadvertently or not.
> > >
> > > As the original commit message of the optimization explains there is no
> > > point considering the children of a subtree in find_next_iomem_res() if
> > > the top level range does not match. Reinstating the optimization results
> > > in significant performance improvements in systems with very large iomem
> > > maps when mmaping /dev/mem.
> >
> > It would be great if we could quantify "significant performance
> > improvements"?
>
> Hi Andrew and Andy,
>
> You are right to call that out and apologies for leaving it vague.
>
> I've done my testing with older kernel versions in systems where `wc -l
> /proc/iomem` can return ~5k. In that environment I see mmaping parts of
> /dev/mem taking 700-1500μs without the optimisation and 10-50μs with the
> optimisation.
>
> The real-world use case we care about is hypervisor live update where having to
> do lots of these mmaps() serially can significantly affect the guest downtime
> if the cost is 20-30x.
Thanks for providing this information.
> > It also would be good to know which exact function(s) is a bottleneck.
>
> Perf tracing shows that ~95% of CPU time is spent in find_next_iomem_res(),
Have you investigated possibility to return that check directly into
the culprit?
> the full call stack being:
>
> find_next_iomem_res+0x3b ([kernel.kallsyms])
> walk_system_ram_range+0x98 ([kernel.kallsyms])
> pat_pagerange_is_ram+0x6e ([kernel.kallsyms])
> reserve_pfn_range+0x47 ([kernel.kallsyms])
> track_pfn_remap+0xb6 ([kernel.kallsyms])
> remap_pfn_range+0x3b ([kernel.kallsyms])
> mmap_mem+0x9e ([kernel.kallsyms])
> mm_struct_mmap_region+0x1f3 ([kernel.kallsyms])
> mmap_region+0xa3 ([kernel.kallsyms])
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
2025-11-24 18:55 ` andriy.shevchenko
@ 2025-11-24 19:35 ` Stamatis, Ilias
2025-11-24 19:52 ` andriy.shevchenko
0 siblings, 1 reply; 14+ messages in thread
From: Stamatis, Ilias @ 2025-11-24 19:35 UTC (permalink / raw)
To: andriy.shevchenko
Cc: nadav.amit, david, linux-mm, akpm, linux-kernel, bhe,
huang.ying.caritas, nh-open-source
On Mon, 2025-11-24 at 20:55 +0200, andriy.shevchenko@linux.intel.com wrote:
> On Mon, Nov 24, 2025 at 06:01:35PM +0000, Stamatis, Ilias wrote:
> > On Mon, 2025-11-24 at 08:58 -0800, Andrew Morton wrote:
> > > On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
> > >
> > > > Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> > > > logic") removed an optimization introduced by commit 756398750e11
> > > > ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> > > > was not called out in the message of the first commit explicitly so it's
> > > > not entirely clear whether removing the optimization happened
> > > > inadvertently or not.
> > > >
> > > > As the original commit message of the optimization explains there is no
> > > > point considering the children of a subtree in find_next_iomem_res() if
> > > > the top level range does not match. Reinstating the optimization results
> > > > in significant performance improvements in systems with very large iomem
> > > > maps when mmaping /dev/mem.
> > >
> > > It would be great if we could quantify "significant performance
> > > improvements"?
> >
> > Hi Andrew and Andy,
> >
> > You are right to call that out and apologies for leaving it vague.
> >
> > I've done my testing with older kernel versions in systems where `wc -l
> > /proc/iomem` can return ~5k. In that environment I see mmaping parts of
> > /dev/mem taking 700-1500μs without the optimisation and 10-50μs with the
> > optimisation.
> >
> > The real-world use case we care about is hypervisor live update where having to
> > do lots of these mmaps() serially can significantly affect the guest downtime
> > if the cost is 20-30x.
>
> Thanks for providing this information.
>
> > > It also would be good to know which exact function(s) is a bottleneck.
> >
> > Perf tracing shows that ~95% of CPU time is spent in find_next_iomem_res(),
>
> Have you investigated possibility to return that check directly into
> the culprit?
>
>
I'm sorry, I don't understand this. Could you please clarify what you mean?
What do you consider to be the culprit and which check do you refer to?
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
2025-11-24 19:35 ` Stamatis, Ilias
@ 2025-11-24 19:52 ` andriy.shevchenko
2025-11-24 23:30 ` Stamatis, Ilias
0 siblings, 1 reply; 14+ messages in thread
From: andriy.shevchenko @ 2025-11-24 19:52 UTC (permalink / raw)
To: Stamatis, Ilias
Cc: nadav.amit, david, linux-mm, akpm, linux-kernel, bhe,
huang.ying.caritas, nh-open-source
On Mon, Nov 24, 2025 at 07:35:31PM +0000, Stamatis, Ilias wrote:
> On Mon, 2025-11-24 at 20:55 +0200, andriy.shevchenko@linux.intel.com wrote:
> > On Mon, Nov 24, 2025 at 06:01:35PM +0000, Stamatis, Ilias wrote:
> > > On Mon, 2025-11-24 at 08:58 -0800, Andrew Morton wrote:
> > > > On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
> > > >
> > > > > Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> > > > > logic") removed an optimization introduced by commit 756398750e11
> > > > > ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> > > > > was not called out in the message of the first commit explicitly so it's
> > > > > not entirely clear whether removing the optimization happened
> > > > > inadvertently or not.
> > > > >
> > > > > As the original commit message of the optimization explains there is no
> > > > > point considering the children of a subtree in find_next_iomem_res() if
> > > > > the top level range does not match. Reinstating the optimization results
> > > > > in significant performance improvements in systems with very large iomem
> > > > > maps when mmaping /dev/mem.
> > > >
> > > > It would be great if we could quantify "significant performance
> > > > improvements"?
> > >
> > > Hi Andrew and Andy,
> > >
> > > You are right to call that out and apologies for leaving it vague.
> > >
> > > I've done my testing with older kernel versions in systems where `wc -l
> > > /proc/iomem` can return ~5k. In that environment I see mmaping parts of
> > > /dev/mem taking 700-1500μs without the optimisation and 10-50μs with the
> > > optimisation.
> > >
> > > The real-world use case we care about is hypervisor live update where having to
> > > do lots of these mmaps() serially can significantly affect the guest downtime
> > > if the cost is 20-30x.
> >
> > Thanks for providing this information.
> >
> > > > It also would be good to know which exact function(s) is a bottleneck.
> > >
> > > Perf tracing shows that ~95% of CPU time is spent in find_next_iomem_res(),
> >
> > Have you investigated possibility to return that check directly into
> > the culprit?
>
> I'm sorry, I don't understand this. Could you please clarify what you mean?
> What do you consider to be the culprit and which check do you refer to?
The mentioned patch removed the check for siblings from next_resource().
The function that your test case complains about is find_next_iomem_res().
Hence, have you tried to reinstantiate the (removed) check from next_resource()
in find_next_iomem_res() and see if it helps?
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
2025-11-24 19:52 ` andriy.shevchenko
@ 2025-11-24 23:30 ` Stamatis, Ilias
2025-11-25 6:50 ` andriy.shevchenko
0 siblings, 1 reply; 14+ messages in thread
From: Stamatis, Ilias @ 2025-11-24 23:30 UTC (permalink / raw)
To: andriy.shevchenko
Cc: nadav.amit, david, linux-mm, akpm, linux-kernel,
huang.ying.caritas, bhe, nh-open-source
On Mon, 2025-11-24 at 21:52 +0200, andriy.shevchenko@linux.intel.com wrote:
> On Mon, Nov 24, 2025 at 07:35:31PM +0000, Stamatis, Ilias wrote:
> > On Mon, 2025-11-24 at 20:55 +0200, andriy.shevchenko@linux.intel.com wrote:
> > > On Mon, Nov 24, 2025 at 06:01:35PM +0000, Stamatis, Ilias wrote:
> > > > On Mon, 2025-11-24 at 08:58 -0800, Andrew Morton wrote:
> > > > > On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
> > > > >
> > > > > > Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> > > > > > logic") removed an optimization introduced by commit 756398750e11
> > > > > > ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> > > > > > was not called out in the message of the first commit explicitly so it's
> > > > > > not entirely clear whether removing the optimization happened
> > > > > > inadvertently or not.
> > > > > >
> > > > > > As the original commit message of the optimization explains there is no
> > > > > > point considering the children of a subtree in find_next_iomem_res() if
> > > > > > the top level range does not match. Reinstating the optimization results
> > > > > > in significant performance improvements in systems with very large iomem
> > > > > > maps when mmaping /dev/mem.
> > > > >
> > > > > It would be great if we could quantify "significant performance
> > > > > improvements"?
> > > >
> > > > Hi Andrew and Andy,
> > > >
> > > > You are right to call that out and apologies for leaving it vague.
> > > >
> > > > I've done my testing with older kernel versions in systems where `wc -l
> > > > /proc/iomem` can return ~5k. In that environment I see mmaping parts of
> > > > /dev/mem taking 700-1500μs without the optimisation and 10-50μs with the
> > > > optimisation.
> > > >
> > > > The real-world use case we care about is hypervisor live update where having to
> > > > do lots of these mmaps() serially can significantly affect the guest downtime
> > > > if the cost is 20-30x.
> > >
> > > Thanks for providing this information.
> > >
> > > > > It also would be good to know which exact function(s) is a bottleneck.
> > > >
> > > > Perf tracing shows that ~95% of CPU time is spent in find_next_iomem_res(),
> > >
> > > Have you investigated possibility to return that check directly into
> > > the culprit?
> >
> > I'm sorry, I don't understand this. Could you please clarify what you mean?
> > What do you consider to be the culprit and which check do you refer to?
>
> The mentioned patch removed the check for siblings from next_resource().
> The function that your test case complains about is find_next_iomem_res().
> Hence, have you tried to reinstantiate the (removed) check from next_resource()
> in find_next_iomem_res() and see if it helps?
>
next_resource() does accept a 'skip_children' parameter in the latest kernel
today which is equivalent to the 'sibling_only' parameter in the older kernels.
And the for_each_resource() macro currently used in find_next_iomem_res() calls
next_resource().
Hope that makes sense.
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
2025-11-24 23:30 ` Stamatis, Ilias
@ 2025-11-25 6:50 ` andriy.shevchenko
2025-11-25 9:56 ` Stamatis, Ilias
0 siblings, 1 reply; 14+ messages in thread
From: andriy.shevchenko @ 2025-11-25 6:50 UTC (permalink / raw)
To: Stamatis, Ilias
Cc: nadav.amit, david, linux-mm, akpm, linux-kernel,
huang.ying.caritas, bhe, nh-open-source
On Mon, Nov 24, 2025 at 11:30:46PM +0000, Stamatis, Ilias wrote:
> On Mon, 2025-11-24 at 21:52 +0200, andriy.shevchenko@linux.intel.com wrote:
> > On Mon, Nov 24, 2025 at 07:35:31PM +0000, Stamatis, Ilias wrote:
> > > On Mon, 2025-11-24 at 20:55 +0200, andriy.shevchenko@linux.intel.com wrote:
> > > > On Mon, Nov 24, 2025 at 06:01:35PM +0000, Stamatis, Ilias wrote:
> > > > > On Mon, 2025-11-24 at 08:58 -0800, Andrew Morton wrote:
> > > > > > On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
> > > > > >
> > > > > > > Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> > > > > > > logic") removed an optimization introduced by commit 756398750e11
> > > > > > > ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> > > > > > > was not called out in the message of the first commit explicitly so it's
> > > > > > > not entirely clear whether removing the optimization happened
> > > > > > > inadvertently or not.
> > > > > > >
> > > > > > > As the original commit message of the optimization explains there is no
> > > > > > > point considering the children of a subtree in find_next_iomem_res() if
> > > > > > > the top level range does not match. Reinstating the optimization results
> > > > > > > in significant performance improvements in systems with very large iomem
> > > > > > > maps when mmaping /dev/mem.
> > > > > >
> > > > > > It would be great if we could quantify "significant performance
> > > > > > improvements"?
> > > > >
> > > > > Hi Andrew and Andy,
> > > > >
> > > > > You are right to call that out and apologies for leaving it vague.
> > > > >
> > > > > I've done my testing with older kernel versions in systems where `wc -l
> > > > > /proc/iomem` can return ~5k. In that environment I see mmaping parts of
> > > > > /dev/mem taking 700-1500μs without the optimisation and 10-50μs with the
> > > > > optimisation.
> > > > >
> > > > > The real-world use case we care about is hypervisor live update where having to
> > > > > do lots of these mmaps() serially can significantly affect the guest downtime
> > > > > if the cost is 20-30x.
> > > >
> > > > Thanks for providing this information.
> > > >
> > > > > > It also would be good to know which exact function(s) is a bottleneck.
> > > > >
> > > > > Perf tracing shows that ~95% of CPU time is spent in find_next_iomem_res(),
> > > >
> > > > Have you investigated possibility to return that check directly into
> > > > the culprit?
> > >
> > > I'm sorry, I don't understand this. Could you please clarify what you mean?
> > > What do you consider to be the culprit and which check do you refer to?
> >
> > The mentioned patch removed the check for siblings from next_resource().
> > The function that your test case complains about is find_next_iomem_res().
> > Hence, have you tried to reinstantiate the (removed) check from next_resource()
> > in find_next_iomem_res() and see if it helps?
>
> next_resource() does accept a 'skip_children' parameter in the latest kernel
> today which is equivalent to the 'sibling_only' parameter in the older
> kernels.
It used to be
if (sibling_only)
return p->sibling;
if (p->child)
return p->child;
...
and become (in the latest kernels)
if (!skip_children && p->child)
return p->child;
...
Can you elaborate how are they interoperable?
TL;DR: I don't think it's an equivalent.
> And the for_each_resource() macro currently used in find_next_iomem_res()
> calls next_resource().
>
> Hope that makes sense.
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
2025-11-25 6:50 ` andriy.shevchenko
@ 2025-11-25 9:56 ` Stamatis, Ilias
2025-11-25 10:23 ` andriy.shevchenko
0 siblings, 1 reply; 14+ messages in thread
From: Stamatis, Ilias @ 2025-11-25 9:56 UTC (permalink / raw)
To: andriy.shevchenko
Cc: nadav.amit, david, linux-mm, akpm, linux-kernel, bhe,
huang.ying.caritas, nh-open-source
On Tue, 2025-11-25 at 08:50 +0200, andriy.shevchenko@linux.intel.com wrote:
> On Mon, Nov 24, 2025 at 11:30:46PM +0000, Stamatis, Ilias wrote:
> > On Mon, 2025-11-24 at 21:52 +0200, andriy.shevchenko@linux.intel.com wrote:
> > > On Mon, Nov 24, 2025 at 07:35:31PM +0000, Stamatis, Ilias wrote:
> > > > On Mon, 2025-11-24 at 20:55 +0200, andriy.shevchenko@linux.intel.com wrote:
> > > > > On Mon, Nov 24, 2025 at 06:01:35PM +0000, Stamatis, Ilias wrote:
> > > > > > On Mon, 2025-11-24 at 08:58 -0800, Andrew Morton wrote:
> > > > > > > On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
> > > > > > >
> > > > > > > > Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> > > > > > > > logic") removed an optimization introduced by commit 756398750e11
> > > > > > > > ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> > > > > > > > was not called out in the message of the first commit explicitly so it's
> > > > > > > > not entirely clear whether removing the optimization happened
> > > > > > > > inadvertently or not.
> > > > > > > >
> > > > > > > > As the original commit message of the optimization explains there is no
> > > > > > > > point considering the children of a subtree in find_next_iomem_res() if
> > > > > > > > the top level range does not match. Reinstating the optimization results
> > > > > > > > in significant performance improvements in systems with very large iomem
> > > > > > > > maps when mmaping /dev/mem.
> > > > > > >
> > > > > > > It would be great if we could quantify "significant performance
> > > > > > > improvements"?
> > > > > >
> > > > > > Hi Andrew and Andy,
> > > > > >
> > > > > > You are right to call that out and apologies for leaving it vague.
> > > > > >
> > > > > > I've done my testing with older kernel versions in systems where `wc -l
> > > > > > /proc/iomem` can return ~5k. In that environment I see mmaping parts of
> > > > > > /dev/mem taking 700-1500μs without the optimisation and 10-50μs with the
> > > > > > optimisation.
> > > > > >
> > > > > > The real-world use case we care about is hypervisor live update where having to
> > > > > > do lots of these mmaps() serially can significantly affect the guest downtime
> > > > > > if the cost is 20-30x.
> > > > >
> > > > > Thanks for providing this information.
> > > > >
> > > > > > > It also would be good to know which exact function(s) is a bottleneck.
> > > > > >
> > > > > > Perf tracing shows that ~95% of CPU time is spent in find_next_iomem_res(),
> > > > >
> > > > > Have you investigated possibility to return that check directly into
> > > > > the culprit?
> > > >
> > > > I'm sorry, I don't understand this. Could you please clarify what you mean?
> > > > What do you consider to be the culprit and which check do you refer to?
> > >
> > > The mentioned patch removed the check for siblings from next_resource().
> > > The function that your test case complains about is find_next_iomem_res().
> > > Hence, have you tried to reinstantiate the (removed) check from next_resource()
> > > in find_next_iomem_res() and see if it helps?
> >
> > next_resource() does accept a 'skip_children' parameter in the latest kernel
> > today which is equivalent to the 'sibling_only' parameter in the older
> > kernels.
>
> It used to be
>
> if (sibling_only)
> return p->sibling;
>
> if (p->child)
> return p->child;
> ...
>
This returns p->sibling if sibling_only == true.
The return value might also be NULL.
>
> and become (in the latest kernels)
>
> if (!skip_children && p->child)
> return p->child;
> ...
>
>
if (!skip_children && p->child)
return p->child;
while (!p->sibling && p->parent) {
p = p->parent;
if (p == subtree_root)
return NULL;
}
return p->sibling;
This is the full function on the latest kernel. If skip_children == true and
there is a sibling, it also returns p->sibling.
If p->sibling is NULL, it'll try to get the parent. In the case of
find_next_iomem_res() the parent will be iomem_resource, in which case the if
(p == subtree_root) path is taken and we return NULL (same as the case of
p->sibling being NULL above).
> Can you elaborate how are they interoperable?
>
> TL;DR: I don't think it's an equivalent.
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
2025-11-25 9:56 ` Stamatis, Ilias
@ 2025-11-25 10:23 ` andriy.shevchenko
2025-11-25 14:23 ` Stamatis, Ilias
0 siblings, 1 reply; 14+ messages in thread
From: andriy.shevchenko @ 2025-11-25 10:23 UTC (permalink / raw)
To: Stamatis, Ilias
Cc: nadav.amit, david, linux-mm, akpm, linux-kernel, bhe,
huang.ying.caritas, nh-open-source
On Tue, Nov 25, 2025 at 09:56:36AM +0000, Stamatis, Ilias wrote:
> On Tue, 2025-11-25 at 08:50 +0200, andriy.shevchenko@linux.intel.com wrote:
> > On Mon, Nov 24, 2025 at 11:30:46PM +0000, Stamatis, Ilias wrote:
> > > On Mon, 2025-11-24 at 21:52 +0200, andriy.shevchenko@linux.intel.com wrote:
> > > > On Mon, Nov 24, 2025 at 07:35:31PM +0000, Stamatis, Ilias wrote:
> > > > > On Mon, 2025-11-24 at 20:55 +0200, andriy.shevchenko@linux.intel.com wrote:
> > > > > > On Mon, Nov 24, 2025 at 06:01:35PM +0000, Stamatis, Ilias wrote:
> > > > > > > On Mon, 2025-11-24 at 08:58 -0800, Andrew Morton wrote:
> > > > > > > > On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
...
> > > > > > > > > Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> > > > > > > > > logic") removed an optimization introduced by commit 756398750e11
> > > > > > > > > ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> > > > > > > > > was not called out in the message of the first commit explicitly so it's
> > > > > > > > > not entirely clear whether removing the optimization happened
> > > > > > > > > inadvertently or not.
> > > > > > > > >
> > > > > > > > > As the original commit message of the optimization explains there is no
> > > > > > > > > point considering the children of a subtree in find_next_iomem_res() if
> > > > > > > > > the top level range does not match. Reinstating the optimization results
> > > > > > > > > in significant performance improvements in systems with very large iomem
> > > > > > > > > maps when mmaping /dev/mem.
> > > > > > > >
> > > > > > > > It would be great if we could quantify "significant performance
> > > > > > > > improvements"?
> > > > > > >
> > > > > > > I've done my testing with older kernel versions in systems where `wc -l
> > > > > > > /proc/iomem` can return ~5k. In that environment I see mmaping parts of
> > > > > > > /dev/mem taking 700-1500μs without the optimisation and 10-50μs with the
> > > > > > > optimisation.
> > > > > > >
> > > > > > > The real-world use case we care about is hypervisor live update where having to
> > > > > > > do lots of these mmaps() serially can significantly affect the guest downtime
> > > > > > > if the cost is 20-30x.
> > > > > >
> > > > > > Thanks for providing this information.
> > > > > >
> > > > > > > > It also would be good to know which exact function(s) is a bottleneck.
> > > > > > >
> > > > > > > Perf tracing shows that ~95% of CPU time is spent in find_next_iomem_res(),
> > > > > >
> > > > > > Have you investigated possibility to return that check directly into
> > > > > > the culprit?
> > > > >
> > > > > I'm sorry, I don't understand this. Could you please clarify what you mean?
> > > > > What do you consider to be the culprit and which check do you refer to?
> > > >
> > > > The mentioned patch removed the check for siblings from next_resource().
> > > > The function that your test case complains about is find_next_iomem_res().
> > > > Hence, have you tried to reinstantiate the (removed) check from next_resource()
> > > > in find_next_iomem_res() and see if it helps?
> > >
> > > next_resource() does accept a 'skip_children' parameter in the latest kernel
> > > today which is equivalent to the 'sibling_only' parameter in the older
> > > kernels.
> >
> > It used to be
> >
> > if (sibling_only)
> > return p->sibling;
> >
> > if (p->child)
> > return p->child;
> > ...
>
> This returns p->sibling if sibling_only == true.
> The return value might also be NULL.
>
> > and become (in the latest kernels)
> >
> > if (!skip_children && p->child)
> > return p->child;
> > ...
>
> if (!skip_children && p->child)
> return p->child;
> while (!p->sibling && p->parent) {
> p = p->parent;
> if (p == subtree_root)
> return NULL;
> }
> return p->sibling;
>
> This is the full function on the latest kernel. If skip_children == true and
> there is a sibling, it also returns p->sibling.
>
> If p->sibling is NULL, it'll try to get the parent. In the case of
> find_next_iomem_res() the parent will be iomem_resource, in which case the if
> (p == subtree_root) path is taken and we return NULL (same as the case of
> p->sibling being NULL above).
Thanks for elaboration.
Please summarise this, add the performance test results and send a v2.
Seems okay to me.
> > Can you elaborate how are they interoperable?
> >
> > TL;DR: I don't think it's an equivalent.
So, it's not a literal equivalent, but it behaves in a very similar way.
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
2025-11-25 10:23 ` andriy.shevchenko
@ 2025-11-25 14:23 ` Stamatis, Ilias
2025-11-25 18:30 ` andriy.shevchenko
0 siblings, 1 reply; 14+ messages in thread
From: Stamatis, Ilias @ 2025-11-25 14:23 UTC (permalink / raw)
To: andriy.shevchenko
Cc: nadav.amit, david, linux-mm, akpm, linux-kernel,
huang.ying.caritas, bhe, nh-open-source
On Tue, 2025-11-25 at 12:23 +0200, andriy.shevchenko@linux.intel.com wrote:
> Thanks for elaboration.
>
> Please summarise this, add the performance test results and send a v2.
> Seems okay to me.
How does this look?
No code change, but happy to send a v2 if that makes things easier for merging.
Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
logic") removed an optimization introduced by commit 756398750e11
("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
was not called out in the message of the first commit explicitly so it's
not entirely clear whether removing the optimization happened
inadvertently or not.
As the original commit message of the optimization explains there is no
point considering the children of a subtree in find_next_iomem_res() if
the top level range does not match.
Reinstating the optimization results in performance improvements in
systems where /proc/iomem is ~5k lines long. Calling mmap() on /dev/mem
in such platforms takes 700-1500μs without the optimisation and 10-50μs
with the optimisation.
Note that even though commit 97523a4edb7b removed the 'sibling_only'
parameter from next_resource(), newer kernels have basically reinstated
it under the name 'skip_children'.
Link: https://lore.kernel.org/all/20251124165349.3377826-1-ilstam@amazon.com/T/#u
Fixes: 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only logic")
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
2025-11-25 14:23 ` Stamatis, Ilias
@ 2025-11-25 18:30 ` andriy.shevchenko
0 siblings, 0 replies; 14+ messages in thread
From: andriy.shevchenko @ 2025-11-25 18:30 UTC (permalink / raw)
To: Stamatis, Ilias
Cc: nadav.amit, david, linux-mm, akpm, linux-kernel,
huang.ying.caritas, bhe, nh-open-source
On Tue, Nov 25, 2025 at 02:23:38PM +0000, Stamatis, Ilias wrote:
> On Tue, 2025-11-25 at 12:23 +0200, andriy.shevchenko@linux.intel.com wrote:
> > Thanks for elaboration.
> >
> > Please summarise this, add the performance test results and send a v2.
> > Seems okay to me.
>
> How does this look?
>
> No code change, but happy to send a v2 if that makes things easier for merging.
Yes, please do.
> Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
resource: Reinstate "avoid unnecessary lookups in find_next_iomem_res()"
(always keep leading prefix).
> Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> logic") removed an optimization introduced by commit 756398750e11
> ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> was not called out in the message of the first commit explicitly so it's
> not entirely clear whether removing the optimization happened
> inadvertently or not.
>
> As the original commit message of the optimization explains there is no
> point considering the children of a subtree in find_next_iomem_res() if
> the top level range does not match.
>
> Reinstating the optimization results in performance improvements in
> systems where /proc/iomem is ~5k lines long. Calling mmap() on /dev/mem
> in such platforms takes 700-1500μs without the optimisation and 10-50μs
> with the optimisation.
>
> Note that even though commit 97523a4edb7b removed the 'sibling_only'
> parameter from next_resource(), newer kernels have basically reinstated
> it under the name 'skip_children'.
>
> Link: https://lore.kernel.org/all/20251124165349.3377826-1-ilstam@amazon.com/T/#u
> Fixes: 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only logic")
> Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
The rest LGTM.
Also, if required, make a comment in the code aligned with this new commit
message.
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
2025-11-24 16:53 [PATCH] Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()" Ilias Stamatis
2025-11-24 16:58 ` Andrew Morton
@ 2025-11-25 8:09 ` David Hildenbrand (Red Hat)
2025-11-25 8:18 ` David Hildenbrand (Red Hat)
1 sibling, 1 reply; 14+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-25 8:09 UTC (permalink / raw)
To: Ilias Stamatis, akpm, linux-kernel
Cc: linux-mm, nadav.amit, huang.ying.caritas, andriy.shevchenko, bhe,
nh-open-source
On 11/24/25 17:53, Ilias Stamatis wrote:
> Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> logic") removed an optimization introduced by commit 756398750e11
> ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> was not called out in the message of the first commit explicitly so it's
> not entirely clear whether removing the optimization happened
> inadvertently or not.
Remembering the history, we have some things where the top might not
fully describe what the lower levels do.
An example is for example found here:
Author: Dan Williams <dan.j.williams@intel.com>
Date: Thu Feb 16 00:36:02 2023 -0800
dax/kmem: Fix leak of memory-hotplug resources
While experimenting with CXL region removal the following corruption of
/proc/iomem appeared.
Before:
f010000000-f04fffffff : CXL Window 0
f010000000-f02fffffff : region4
f010000000-f02fffffff : dax4.0
f010000000-f02fffffff : System RAM (kmem)
The CXL Windows will certainly not match System RAM, as one example.
How would your change affect such cases?
--
Cheers
David
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
2025-11-25 8:09 ` David Hildenbrand (Red Hat)
@ 2025-11-25 8:18 ` David Hildenbrand (Red Hat)
0 siblings, 0 replies; 14+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-25 8:18 UTC (permalink / raw)
To: Ilias Stamatis, akpm, linux-kernel
Cc: linux-mm, nadav.amit, huang.ying.caritas, andriy.shevchenko, bhe,
nh-open-source
On 11/25/25 09:09, David Hildenbrand (Red Hat) wrote:
> On 11/24/25 17:53, Ilias Stamatis wrote:
>> Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
>> logic") removed an optimization introduced by commit 756398750e11
>> ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
>> was not called out in the message of the first commit explicitly so it's
>> not entirely clear whether removing the optimization happened
>> inadvertently or not.
>
> Remembering the history, we have some things where the top might not
> fully describe what the lower levels do.
>
> An example is for example found here:
>
> Author: Dan Williams <dan.j.williams@intel.com>
> Date: Thu Feb 16 00:36:02 2023 -0800
>
> dax/kmem: Fix leak of memory-hotplug resources
>
> While experimenting with CXL region removal the following corruption of
> /proc/iomem appeared.
>
> Before:
> f010000000-f04fffffff : CXL Window 0
> f010000000-f02fffffff : region4
> f010000000-f02fffffff : dax4.0
> f010000000-f02fffffff : System RAM (kmem)
>
> The CXL Windows will certainly not match System RAM, as one example.
>
> How would your change affect such cases?
Looking into the details, I assume, as we only check that the actual
range matches, not the type, that this is fine.
So yeah, that makes sense to me. I guess I removed it as part of
97523a4edb7b by accident, when I primarily wanted to remove the
first_lvl parameter.
Given that the above still works as expected:
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
--
Cheers
David
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-11-25 18:30 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-24 16:53 [PATCH] Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()" Ilias Stamatis
2025-11-24 16:58 ` Andrew Morton
2025-11-24 17:05 ` Andy Shevchenko
[not found] ` <c7411175b332f3befb5bebb6a75c7b91f2c1dbbc.camel@amazon.co.uk>
2025-11-24 18:55 ` andriy.shevchenko
2025-11-24 19:35 ` Stamatis, Ilias
2025-11-24 19:52 ` andriy.shevchenko
2025-11-24 23:30 ` Stamatis, Ilias
2025-11-25 6:50 ` andriy.shevchenko
2025-11-25 9:56 ` Stamatis, Ilias
2025-11-25 10:23 ` andriy.shevchenko
2025-11-25 14:23 ` Stamatis, Ilias
2025-11-25 18:30 ` andriy.shevchenko
2025-11-25 8:09 ` David Hildenbrand (Red Hat)
2025-11-25 8:18 ` David Hildenbrand (Red Hat)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox