* [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel
@ 2026-02-06 19:38 Viacheslav Dubeyko
2026-02-06 23:28 ` Hillf Danton
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Viacheslav Dubeyko @ 2026-02-06 19:38 UTC (permalink / raw)
To: lsf-pc
Cc: Viacheslav Dubeyko, linux-mm, Pavan Rallabhandi, linux-fsdevel,
linux-kernel, bpf
Hello,
Machine Learning (ML) is approach/area of learning from data,
finding patterns, and making predictions without implementing algorithms
by developers. The number of areas of ML applications is growing
with every day. Generally speaking, ML can introduce a self-evolving and
self-learning capability in Linux kernel. There are already research works
and industry efforts to employ ML approaches for configuration and
optimization the Linux kernel. However, introduction of ML approaches
in Linux kernel is not so simple and straightforward way. There are multiple
problems and unanswered questions on this road. First of all, any ML model
requires the floating-point operations (FPU) for running. But there is
no direct use of FPUs in kernel space. Also, ML model requires training phase
that can be a reason of significant performance degradation of Linux kernel.
Even inference phase could be problematic from the performance point of view
on kernel side. The using of ML approaches in Linux kernel is inevitable step.
But, how can we use ML approaches in Linux kernel? Which infrastructure
do we need to adopt ML models in Linux kernel?
What is the goal of using ML models in Linux kernel? The main goal is
to employ ML models for elaboration of a logic of particular Linux kernel
subsystem based on processing data or/and an efficient subsystem
configuration based on internal state of subsystem. As a result, it needs:
(1) collect data for training, (2) execute ML model training phase,
(3) test trained ML model, (4) use ML model for executing the inference phase.
The ML model inference can be used for recommendation of Linux kernel
subsystem configuration or/and for injecting a synthesized subsystem logic
into kernel space (for example, eBPF logic).
How ML infrastructure can be designed in Linux kernel? It needs to introduce
in Linux kernel a special ML library that can implement a generalized
interface of interaction between ML model’s thread in user-space and kernel
subsystem. Likewise interface requires to have the means:
(1) create/initialize/destroy ML model proxy in kernel subsystem,
(2) start/stop ML model proxy, (3) get/preprocess/publish data sets
from kernel space, (4) receive/preprocess/apply ML model recommendation(s)
from user-space, (5) execute synthesized logic/recommendations in kernel-space,
(6) estimate efficiency of synthesized logic/recommendations,
(7) execute error back-propagation with the goal of correction ML model
on user-space side.
The create and initialize logic can be executed by kernel subsystem during
module load or Linux kernel start (oppositely, module unload or kernel
shutdown will execute destroy of ML model proxy logic). ML model thread
in user-space will be capable to re-initialize and to execute
the start/stop logic of ML model proxy on kernel side. First of all,
ML model needs to be trained by data from kernel space. The data can be
requested by ML model from user-space or data can be published by ML model
proxy from kernel-space. The sysfs interface can be used to orchestrate
this interaction. As a result, ML model in user-space should be capable
to extract data set(s) from kernel space through sysfs, FUSE or character
device. Extracted data can be stored in persistent storage and, finally,
ML model can be trained in user-space by accessing these data.
The continuous learning model can be adopted during training phase.
It implies that kernel subsystem can receive ML model recommendations
even during training phase. ML model proxy on kernel side can estimate
the current kernel subsystem state, tries to apply the ML model
recommendations, and estimate the efficiency of applied recommendations.
Generally speaking, ML model proxy on kernel side can consider several
modes of interaction with ML model recommendations: (1) emergency mode,
(2) learning mode, (3) collaboration mode, (4) recommendation mode.
The emergency mode is the mode when kernel subsystem is in critical state
and it is required to work as efficient as possible without capability of
involving the ML model recommendations (for example, ML model
recommendations are completely inadequate or load is very high).
The learning mode implies that kernel subsystem can try to apply
the ML model recommendations for some operations with the goal of
estimation the maturity of ML model. Also, ML model proxy can degrade
the mode to learning state if ML model recommendations becomes inefficient.
The collaboration mode has the goal of using ML recommendations in
50% of operations with the goal of achieving mature state of ML model.
And, finally, ML model proxy can convert kernel subsystem in recommendation
mode if ML model is mature enough and efficiency of applying
the ML recommendations is higher than using human-made algorithms.
The back-propagation approach can be used to correct the ML model
by means of sharing feedback of efficiency estimation from kernel
to user-space side.
I would like to discuss the approach of ML library in Linux kernel
that can provide the way to run/employ ML models in Linux kernel.
Thanks,
Slava.
[REFERENCES]
[1]
https://lore.kernel.org/linux-fsdevel/20240605110219.7356-1-slava@dubeyko.com/
[2] https://www.youtube.com/watch?v=E7q0SKeniXU
[3] https://github.com/kernel-ml-lib/ml-lib
[4] https://github.com/kernel-ml-lib/ml-lib-linux
[5]
https://lore.kernel.org/linux-fsdevel/20260206191136.2609767-1-slava@dubeyko.com/T/#t
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel
2026-02-06 19:38 [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel Viacheslav Dubeyko
@ 2026-02-06 23:28 ` Hillf Danton
2026-02-09 10:03 ` Chris Li
2026-02-09 10:25 ` Barry Song
2 siblings, 0 replies; 17+ messages in thread
From: Hillf Danton @ 2026-02-06 23:28 UTC (permalink / raw)
To: Viacheslav Dubeyko
Cc: lsf-pc, Viacheslav Dubeyko, linux-mm, Pavan Rallabhandi,
linux-fsdevel, linux-kernel, bpf
On Fri, 6 Feb 2026 19:38:28 +0000 Viacheslav Dubeyko wrote:
> Hello,
>
> Machine Learning (ML) is approach/area of learning from data,
> finding patterns, and making predictions without implementing algorithms
> by developers. The number of areas of ML applications is growing
> with every day. Generally speaking, ML can introduce a self-evolving and
> self-learning capability in Linux kernel. There are already research works
> and industry efforts to employ ML approaches for configuration and
> optimization the Linux kernel. However, introduction of ML approaches
> in Linux kernel is not so simple and straightforward way. There are multiple
> problems and unanswered questions on this road. First of all, any ML model
> requires the floating-point operations (FPU) for running. But there is
> no direct use of FPUs in kernel space. Also, ML model requires training phase
> that can be a reason of significant performance degradation of Linux kernel.
> Even inference phase could be problematic from the performance point of view
> on kernel side. The using of ML approaches in Linux kernel is inevitable step.
> But, how can we use ML approaches in Linux kernel? Which infrastructure
> do we need to adopt ML models in Linux kernel?
>
Given the short list, eevdf, slab, ext4, IP stack, usb bus and kvm, ML is not
needed before the second half in 2027, because it wastes minutes to make
either liver or pancreas intelligent. By intelligent I mean liver can edit
ppt in Russian. Perhaps Cerebellum is an exception.
Can you build bot to fix syzbot reports before 2028?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel
2026-02-06 19:38 [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel Viacheslav Dubeyko
2026-02-06 23:28 ` Hillf Danton
@ 2026-02-09 10:03 ` Chris Li
2026-02-09 22:28 ` Viacheslav Dubeyko
2026-02-09 10:25 ` Barry Song
2 siblings, 1 reply; 17+ messages in thread
From: Chris Li @ 2026-02-09 10:03 UTC (permalink / raw)
To: Viacheslav Dubeyko
Cc: lsf-pc, Viacheslav Dubeyko, linux-mm, Pavan Rallabhandi,
linux-fsdevel, linux-kernel, bpf, Chris Mason
On Fri, Feb 6, 2026 at 11:38 AM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> Hello,
>
> Machine Learning (ML) is approach/area of learning from data,
> finding patterns, and making predictions without implementing algorithms
> by developers. The number of areas of ML applications is growing
> with every day. Generally speaking, ML can introduce a self-evolving and
> self-learning capability in Linux kernel. There are already research works
> and industry efforts to employ ML approaches for configuration and
> optimization the Linux kernel. However, introduction of ML approaches
> in Linux kernel is not so simple and straightforward way. There are multiple
> problems and unanswered questions on this road. First of all, any ML model
> requires the floating-point operations (FPU) for running. But there is
> no direct use of FPUs in kernel space. Also, ML model requires training phase
> that can be a reason of significant performance degradation of Linux kernel.
> Even inference phase could be problematic from the performance point of view
> on kernel side. The using of ML approaches in Linux kernel is inevitable step.
> But, how can we use ML approaches in Linux kernel? Which infrastructure
> do we need to adopt ML models in Linux kernel?
I think there are two different things, I think you want the latter
but I am not sure
1) using ML model to help kernel development, code reviews, generate
patches by descriptions etc. For example, Chris Mason has a kernel
review repo on github and he is sharing his review finding the mailing
list:
https://github.com/masoncl/review-prompts/tree/main
It is kernel development related, but the ML agent code is running in
the user space. The actual ML computation might run GPU/TPUs. That
does not seem to be what you have in mind.
2) Run the ML model computation in the kernel space.
Can you clarify if this is what you have in mind? You mention kernel
FPU usage in the kernel for ML model. It is only relevant if you need
to run the FP in the kernel CPU instructions. Most ML computations are
not run in CPU instructions. They run on GPUs/TPUs. Why not keep the
ML program (PyTorch/agents) in the user space and pass the data to the
GPU/TPU driver to run? There will be some kernel instructure like
VFIO/IOMMU involved with the GPU/TPU driver. For the most part the
kernel is just facilitating the data passing to/from the GPU/TPU
driver then to the GPU/TPU hardware. The ML hardware is doing the
heavy lifting.
> What is the goal of using ML models in Linux kernel? The main goal is
> to employ ML models for elaboration of a logic of particular Linux kernel
> subsystem based on processing data or/and an efficient subsystem
> configuration based on internal state of subsystem. As a result, it needs:
> (1) collect data for training, (2) execute ML model training phase,
> (3) test trained ML model, (4) use ML model for executing the inference phase.
As far as I can tell, a lot of those don't need to be in the kernel's
business. It is more of a GPU/TPU driver user space interface thing,
might be easier to allow the driver to convert their own kernel/user
space API then expose common user space library API. Are you trying to
define something like Nvidia CUDA at the kernel level?
> The ML model inference can be used for recommendation of Linux kernel
> subsystem configuration or/and for injecting a synthesized subsystem logic
> into kernel space (for example, eBPF logic).
That again sounds very much like a userspace issue, the above 1) usage case.
> How ML infrastructure can be designed in Linux kernel? It needs to introduce
> in Linux kernel a special ML library that can implement a generalized
> interface of interaction between ML model’s thread in user-space and kernel
> subsystem. Likewise interface requires to have the means:
> (1) create/initialize/destroy ML model proxy in kernel subsystem,
> (2) start/stop ML model proxy, (3) get/preprocess/publish data sets
> from kernel space, (4) receive/preprocess/apply ML model recommendation(s)
> from user-space, (5) execute synthesized logic/recommendations in kernel-space,
> (6) estimate efficiency of synthesized logic/recommendations,
> (7) execute error back-propagation with the goal of correction ML model
> on user-space side.
Unfortunately a lot of those will be tight to the internal
implementation of the GPU/TPU. The model needs to be compiled into
GPU/TPU machine instructions. So forcing a common interface will be
hard because the lower interface requirement might be very different.
Maybe having some common user space library or ML description language
is better than forcing a kernel interface.
> The create and initialize logic can be executed by kernel subsystem during
> module load or Linux kernel start (oppositely, module unload or kernel
> shutdown will execute destroy of ML model proxy logic). ML model thread
> in user-space will be capable to re-initialize and to execute
> the start/stop logic of ML model proxy on kernel side. First of all,
> ML model needs to be trained by data from kernel space. The data can be
> requested by ML model from user-space or data can be published by ML model
> proxy from kernel-space. The sysfs interface can be used to orchestrate
> this interaction. As a result, ML model in user-space should be capable
> to extract data set(s) from kernel space through sysfs, FUSE or character
> device. Extracted data can be stored in persistent storage and, finally,
> ML model can be trained in user-space by accessing these data.
Currently a lot of those are happening in the GPU/TPU drivers and user
space library. One challenging aspect is the hardware interface is
very different between GPUs/TPUs, and might be challenging to expose
common interfaces.
> The continuous learning model can be adopted during training phase.
> It implies that kernel subsystem can receive ML model recommendations
> even during training phase. ML model proxy on kernel side can estimate
> the current kernel subsystem state, tries to apply the ML model
> recommendations, and estimate the efficiency of applied recommendations.
> Generally speaking, ML model proxy on kernel side can consider several
> modes of interaction with ML model recommendations: (1) emergency mode,
That sounds like user space interaction again. Not sure it is for the
kernel space.
Chris
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel
2026-02-06 19:38 [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel Viacheslav Dubeyko
2026-02-06 23:28 ` Hillf Danton
2026-02-09 10:03 ` Chris Li
@ 2026-02-09 10:25 ` Barry Song
2026-02-09 22:07 ` Viacheslav Dubeyko
2 siblings, 1 reply; 17+ messages in thread
From: Barry Song @ 2026-02-09 10:25 UTC (permalink / raw)
To: Viacheslav Dubeyko
Cc: lsf-pc, Viacheslav Dubeyko, linux-mm, Pavan Rallabhandi,
linux-fsdevel, linux-kernel, bpf
On Sat, Feb 7, 2026 at 3:40 AM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
>
> Hello,
>
[...]
>
> The continuous learning model can be adopted during training phase.
> It implies that kernel subsystem can receive ML model recommendations
> even during training phase. ML model proxy on kernel side can estimate
> the current kernel subsystem state, tries to apply the ML model
> recommendations, and estimate the efficiency of applied recommendations.
> Generally speaking, ML model proxy on kernel side can consider several
> modes of interaction with ML model recommendations: (1) emergency mode,
> (2) learning mode, (3) collaboration mode, (4) recommendation mode.
> The emergency mode is the mode when kernel subsystem is in critical state
> and it is required to work as efficient as possible without capability of
> involving the ML model recommendations (for example, ML model
> recommendations are completely inadequate or load is very high).
> The learning mode implies that kernel subsystem can try to apply
> the ML model recommendations for some operations with the goal of
> estimation the maturity of ML model. Also, ML model proxy can degrade
> the mode to learning state if ML model recommendations becomes inefficient.
> The collaboration mode has the goal of using ML recommendations in
> 50% of operations with the goal of achieving mature state of ML model.
> And, finally, ML model proxy can convert kernel subsystem in recommendation
> mode if ML model is mature enough and efficiency of applying
> the ML recommendations is higher than using human-made algorithms.
Hi Slava,
Do we have any concrete examples where an ML-based proxy,
together with its userspace ML agent, has demonstrated
measurable performance improvements over well-designed,
human-crafted kernel algorithms?
Such examples could be in scheduling, filesystem I/O, or memory
reclamation and readahead. I think having a real, data-backed
example would be much more helpful for this discussion than
reasoning about an abstract framework without a concrete use
case.
Thanks,
Barry
^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel
2026-02-09 10:25 ` Barry Song
@ 2026-02-09 22:07 ` Viacheslav Dubeyko
2026-02-10 3:06 ` Barry Song
0 siblings, 1 reply; 17+ messages in thread
From: Viacheslav Dubeyko @ 2026-02-09 22:07 UTC (permalink / raw)
To: 21cnbao
Cc: linux-mm, Pavan Rallabhandi, linux-fsdevel, linux-kernel, lsf-pc, bpf
Hi Barry,
On Mon, 2026-02-09 at 18:25 +0800, Barry Song wrote:
> On Sat, Feb 7, 2026 at 3:40 AM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> >
> > Hello,
> >
> [...]
> >
> > The continuous learning model can be adopted during training phase.
> > It implies that kernel subsystem can receive ML model recommendations
> > even during training phase. ML model proxy on kernel side can estimate
> > the current kernel subsystem state, tries to apply the ML model
> > recommendations, and estimate the efficiency of applied recommendations.
> > Generally speaking, ML model proxy on kernel side can consider several
> > modes of interaction with ML model recommendations: (1) emergency mode,
> > (2) learning mode, (3) collaboration mode, (4) recommendation mode.
> > The emergency mode is the mode when kernel subsystem is in critical state
> > and it is required to work as efficient as possible without capability of
> > involving the ML model recommendations (for example, ML model
> > recommendations are completely inadequate or load is very high).
> > The learning mode implies that kernel subsystem can try to apply
> > the ML model recommendations for some operations with the goal of
> > estimation the maturity of ML model. Also, ML model proxy can degrade
> > the mode to learning state if ML model recommendations becomes inefficient.
> > The collaboration mode has the goal of using ML recommendations in
> > 50% of operations with the goal of achieving mature state of ML model.
> > And, finally, ML model proxy can convert kernel subsystem in recommendation
> > mode if ML model is mature enough and efficiency of applying
> > the ML recommendations is higher than using human-made algorithms.
>
> Hi Slava,
>
> Do we have any concrete examples where an ML-based proxy,
> together with its userspace ML agent, has demonstrated
> measurable performance improvements over well-designed,
> human-crafted kernel algorithms?
>
> Such examples could be in scheduling, filesystem I/O, or memory
> reclamation and readahead. I think having a real, data-backed
> example would be much more helpful for this discussion than
> reasoning about an abstract framework without a concrete use
> case.
>
This patchset [1] is the first step of declaring the ML library API with the
goal of discussing it. As the next step, I am considering of using ML library
API for implementing two real-life use-cases: (1) GC subsystem of LFS file
systems (NILFS2, F2FS, SSDFS), (2) ML-based DAMON approach. I see multiple
potential real-life use-cases of ML library. But let me start from these two
ones and, then, we will able to extend the approach for other use-cases. The
goal of this talk is to hear the opinion of the community and to elaborate the
proper vision of ML library architecture.
Thanks,
Slava.
[1]
https://lore.kernel.org/linux-fsdevel/20260206191136.2609767-1-slava@dubeyko.com/T/#t
^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel
2026-02-09 10:03 ` Chris Li
@ 2026-02-09 22:28 ` Viacheslav Dubeyko
2026-02-10 13:47 ` [Lsf-pc] " Jan Kara
0 siblings, 1 reply; 17+ messages in thread
From: Viacheslav Dubeyko @ 2026-02-09 22:28 UTC (permalink / raw)
To: chrisl
Cc: clm, linux-mm, Pavan Rallabhandi, linux-fsdevel, linux-kernel,
lsf-pc, bpf
On Mon, 2026-02-09 at 02:03 -0800, Chris Li wrote:
> On Fri, Feb 6, 2026 at 11:38 AM Viacheslav Dubeyko
> <Slava.Dubeyko@ibm.com> wrote:
> >
> > Hello,
> >
> > Machine Learning (ML) is approach/area of learning from data,
> > finding patterns, and making predictions without implementing algorithms
> > by developers. The number of areas of ML applications is growing
> > with every day. Generally speaking, ML can introduce a self-evolving and
> > self-learning capability in Linux kernel. There are already research works
> > and industry efforts to employ ML approaches for configuration and
> > optimization the Linux kernel. However, introduction of ML approaches
> > in Linux kernel is not so simple and straightforward way. There are multiple
> > problems and unanswered questions on this road. First of all, any ML model
> > requires the floating-point operations (FPU) for running. But there is
> > no direct use of FPUs in kernel space. Also, ML model requires training phase
> > that can be a reason of significant performance degradation of Linux kernel.
> > Even inference phase could be problematic from the performance point of view
> > on kernel side. The using of ML approaches in Linux kernel is inevitable step.
> > But, how can we use ML approaches in Linux kernel? Which infrastructure
> > do we need to adopt ML models in Linux kernel?
>
> I think there are two different things, I think you want the latter
> but I am not sure
>
> 1) using ML model to help kernel development, code reviews, generate
> patches by descriptions etc. For example, Chris Mason has a kernel
> review repo on github and he is sharing his review finding the mailing
> list:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_masoncl_review-2Dprompts_tree_main&d=DwIFaQ&c=BSDicqBQBDjDI9RkVyTcHQ&r=q5bIm4AXMzc8NJu1_RGmnQ2fMWKq4Y4RAkElvUgSs00&m=vvrDPxyw_JXPrkC8BjzA2kEtwdPfwV2gBMEXG7ZveXM4LhS01LfoGwqhEyUZpPe4&s=rqNez5_rmiEuE7in5e_7MfyUzzqzaA6Gk46WWvmN3yk&e=
> It is kernel development related, but the ML agent code is running in
> the user space. The actual ML computation might run GPU/TPUs. That
> does not seem to be what you have in mind.
>
> 2) Run the ML model computation in the kernel space.
> Can you clarify if this is what you have in mind? You mention kernel
> FPU usage in the kernel for ML model. It is only relevant if you need
> to run the FP in the kernel CPU instructions. Most ML computations are
> not run in CPU instructions. They run on GPUs/TPUs. Why not keep the
> ML program (PyTorch/agents) in the user space and pass the data to the
> GPU/TPU driver to run? There will be some kernel instructure like
> VFIO/IOMMU involved with the GPU/TPU driver. For the most part the
> kernel is just facilitating the data passing to/from the GPU/TPU
> driver then to the GPU/TPU hardware. The ML hardware is doing the
> heavy lifting.
The idea is to have ML model running in user-space and kernel subsystem can
interact with ML model in user-space. As the next step, I am considering two
real-life use-cases: (1) GC subsystem of LFS file system, (2) ML-based DAMON
approach. So, for example, GC can be represented by ML model in user-space. GC
can request data (segments state) from kernel-space and ML model in user-space
can do training or/and inference. As a result, ML model in user-space can select
victim segments and instruct kernel-space logic of moving valid data from victim
segment(s) into clean/current one(s).
>
> > What is the goal of using ML models in Linux kernel? The main goal is
> > to employ ML models for elaboration of a logic of particular Linux kernel
> > subsystem based on processing data or/and an efficient subsystem
> > configuration based on internal state of subsystem. As a result, it needs:
> > (1) collect data for training, (2) execute ML model training phase,
> > (3) test trained ML model, (4) use ML model for executing the inference phase.
>
> As far as I can tell, a lot of those don't need to be in the kernel's
> business. It is more of a GPU/TPU driver user space interface thing,
> might be easier to allow the driver to convert their own kernel/user
> space API then expose common user space library API. Are you trying to
> define something like Nvidia CUDA at the kernel level?
>
> > The ML model inference can be used for recommendation of Linux kernel
> > subsystem configuration or/and for injecting a synthesized subsystem logic
> > into kernel space (for example, eBPF logic).
>
> That again sounds very much like a userspace issue, the above 1) usage case.
>
> > How ML infrastructure can be designed in Linux kernel? It needs to introduce
> > in Linux kernel a special ML library that can implement a generalized
> > interface of interaction between ML model’s thread in user-space and kernel
> > subsystem. Likewise interface requires to have the means:
> > (1) create/initialize/destroy ML model proxy in kernel subsystem,
> > (2) start/stop ML model proxy, (3) get/preprocess/publish data sets
> > from kernel space, (4) receive/preprocess/apply ML model recommendation(s)
> > from user-space, (5) execute synthesized logic/recommendations in kernel-space,
> > (6) estimate efficiency of synthesized logic/recommendations,
> > (7) execute error back-propagation with the goal of correction ML model
> > on user-space side.
>
> Unfortunately a lot of those will be tight to the internal
> implementation of the GPU/TPU. The model needs to be compiled into
> GPU/TPU machine instructions. So forcing a common interface will be
> hard because the lower interface requirement might be very different.
> Maybe having some common user space library or ML description language
> is better than forcing a kernel interface.
>
> > The create and initialize logic can be executed by kernel subsystem during
> > module load or Linux kernel start (oppositely, module unload or kernel
> > shutdown will execute destroy of ML model proxy logic). ML model thread
> > in user-space will be capable to re-initialize and to execute
> > the start/stop logic of ML model proxy on kernel side. First of all,
> > ML model needs to be trained by data from kernel space. The data can be
> > requested by ML model from user-space or data can be published by ML model
> > proxy from kernel-space. The sysfs interface can be used to orchestrate
> > this interaction. As a result, ML model in user-space should be capable
> > to extract data set(s) from kernel space through sysfs, FUSE or character
> > device. Extracted data can be stored in persistent storage and, finally,
> > ML model can be trained in user-space by accessing these data.
>
> Currently a lot of those are happening in the GPU/TPU drivers and user
> space library. One challenging aspect is the hardware interface is
> very different between GPUs/TPUs, and might be challenging to expose
> common interfaces.
>
> > The continuous learning model can be adopted during training phase.
> > It implies that kernel subsystem can receive ML model recommendations
> > even during training phase. ML model proxy on kernel side can estimate
> > the current kernel subsystem state, tries to apply the ML model
> > recommendations, and estimate the efficiency of applied recommendations.
> > Generally speaking, ML model proxy on kernel side can consider several
> > modes of interaction with ML model recommendations: (1) emergency mode,
>
> That sounds like user space interaction again. Not sure it is for the
> kernel space.
Thanks a lot for sharing all your thoughts. :) I think I need to point out that:
ML model running in user-space and kernel subsystem can interact with ML model
in user-space. :) This is the main idea. The goal of ML library is to implement
generalized interface/functionality that can give the capability for any kernel
subsystem to be extended by ML model in user-space. And I believe that we can
provide this in generic way.
And you can check the patchset [1] to see the vision of potential implementation
of the idea.
Thanks,
Slava.
[1]
https://lore.kernel.org/linux-fsdevel/20260206191136.2609767-1-slava@dubeyko.com/T/#t
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel
2026-02-09 22:07 ` Viacheslav Dubeyko
@ 2026-02-10 3:06 ` Barry Song
2026-02-10 19:57 ` Viacheslav Dubeyko
0 siblings, 1 reply; 17+ messages in thread
From: Barry Song @ 2026-02-10 3:06 UTC (permalink / raw)
To: Viacheslav Dubeyko
Cc: linux-mm, Pavan Rallabhandi, linux-fsdevel, linux-kernel, lsf-pc, bpf
On Tue, Feb 10, 2026 at 6:07 AM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> Hi Barry,
>
> On Mon, 2026-02-09 at 18:25 +0800, Barry Song wrote:
> > On Sat, Feb 7, 2026 at 3:40 AM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> > >
> > > Hello,
> > >
> > [...]
> > >
> > > The continuous learning model can be adopted during training phase.
> > > It implies that kernel subsystem can receive ML model recommendations
> > > even during training phase. ML model proxy on kernel side can estimate
> > > the current kernel subsystem state, tries to apply the ML model
> > > recommendations, and estimate the efficiency of applied recommendations.
> > > Generally speaking, ML model proxy on kernel side can consider several
> > > modes of interaction with ML model recommendations: (1) emergency mode,
> > > (2) learning mode, (3) collaboration mode, (4) recommendation mode.
> > > The emergency mode is the mode when kernel subsystem is in critical state
> > > and it is required to work as efficient as possible without capability of
> > > involving the ML model recommendations (for example, ML model
> > > recommendations are completely inadequate or load is very high).
> > > The learning mode implies that kernel subsystem can try to apply
> > > the ML model recommendations for some operations with the goal of
> > > estimation the maturity of ML model. Also, ML model proxy can degrade
> > > the mode to learning state if ML model recommendations becomes inefficient.
> > > The collaboration mode has the goal of using ML recommendations in
> > > 50% of operations with the goal of achieving mature state of ML model.
> > > And, finally, ML model proxy can convert kernel subsystem in recommendation
> > > mode if ML model is mature enough and efficiency of applying
> > > the ML recommendations is higher than using human-made algorithms.
> >
> > Hi Slava,
> >
> > Do we have any concrete examples where an ML-based proxy,
> > together with its userspace ML agent, has demonstrated
> > measurable performance improvements over well-designed,
> > human-crafted kernel algorithms?
> >
> > Such examples could be in scheduling, filesystem I/O, or memory
> > reclamation and readahead. I think having a real, data-backed
> > example would be much more helpful for this discussion than
> > reasoning about an abstract framework without a concrete use
> > case.
> >
>
> This patchset [1] is the first step of declaring the ML library API with the
> goal of discussing it. As the next step, I am considering of using ML library
> API for implementing two real-life use-cases: (1) GC subsystem of LFS file
> systems (NILFS2, F2FS, SSDFS), (2) ML-based DAMON approach. I see multiple
> potential real-life use-cases of ML library. But let me start from these two
> ones and, then, we will able to extend the approach for other use-cases. The
> goal of this talk is to hear the opinion of the community and to elaborate the
> proper vision of ML library architecture.
I’m very interested in your real-world use case.
If you have any early-stage prototype code that demonstrates the full
flow from user space to kernel space—including both the kernel ML proxy
and the user-space ML agent (for example, for filesystem garbage
collection)—I’d be glad to take a look if you’re able to share it.
Thanks
Barry
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel
2026-02-09 22:28 ` Viacheslav Dubeyko
@ 2026-02-10 13:47 ` Jan Kara
2026-02-10 14:20 ` Chris Mason
2026-02-10 21:02 ` Viacheslav Dubeyko
0 siblings, 2 replies; 17+ messages in thread
From: Jan Kara @ 2026-02-10 13:47 UTC (permalink / raw)
To: Viacheslav Dubeyko
Cc: chrisl, clm, linux-mm, Pavan Rallabhandi, linux-fsdevel,
linux-kernel, lsf-pc, bpf
On Mon 09-02-26 22:28:59, Viacheslav Dubeyko via Lsf-pc wrote:
> On Mon, 2026-02-09 at 02:03 -0800, Chris Li wrote:
> > On Fri, Feb 6, 2026 at 11:38 AM Viacheslav Dubeyko
> > <Slava.Dubeyko@ibm.com> wrote:
> > >
> > > Hello,
> > >
> > > Machine Learning (ML) is approach/area of learning from data,
> > > finding patterns, and making predictions without implementing algorithms
> > > by developers. The number of areas of ML applications is growing
> > > with every day. Generally speaking, ML can introduce a self-evolving and
> > > self-learning capability in Linux kernel. There are already research works
> > > and industry efforts to employ ML approaches for configuration and
> > > optimization the Linux kernel. However, introduction of ML approaches
> > > in Linux kernel is not so simple and straightforward way. There are multiple
> > > problems and unanswered questions on this road. First of all, any ML model
> > > requires the floating-point operations (FPU) for running. But there is
> > > no direct use of FPUs in kernel space. Also, ML model requires training phase
> > > that can be a reason of significant performance degradation of Linux kernel.
> > > Even inference phase could be problematic from the performance point of view
> > > on kernel side. The using of ML approaches in Linux kernel is inevitable step.
> > > But, how can we use ML approaches in Linux kernel? Which infrastructure
> > > do we need to adopt ML models in Linux kernel?
> >
> > I think there are two different things, I think you want the latter
> > but I am not sure
> >
> > 1) using ML model to help kernel development, code reviews, generate
> > patches by descriptions etc. For example, Chris Mason has a kernel
> > review repo on github and he is sharing his review finding the mailing
> > list:
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_masoncl_review-2Dprompts_tree_main&d=DwIFaQ&c=BSDicqBQBDjDI9RkVyTcHQ&r=q5bIm4AXMzc8NJu1_RGmnQ2fMWKq4Y4RAkElvUgSs00&m=vvrDPxyw_JXPrkC8BjzA2kEtwdPfwV2gBMEXG7ZveXM4LhS01LfoGwqhEyUZpPe4&s=rqNez5_rmiEuE7in5e_7MfyUzzqzaA6Gk46WWvmN3yk&e=
> > It is kernel development related, but the ML agent code is running in
> > the user space. The actual ML computation might run GPU/TPUs. That
> > does not seem to be what you have in mind.
> >
> > 2) Run the ML model computation in the kernel space.
> > Can you clarify if this is what you have in mind? You mention kernel
> > FPU usage in the kernel for ML model. It is only relevant if you need
> > to run the FP in the kernel CPU instructions. Most ML computations are
> > not run in CPU instructions. They run on GPUs/TPUs. Why not keep the
> > ML program (PyTorch/agents) in the user space and pass the data to the
> > GPU/TPU driver to run? There will be some kernel instructure like
> > VFIO/IOMMU involved with the GPU/TPU driver. For the most part the
> > kernel is just facilitating the data passing to/from the GPU/TPU
> > driver then to the GPU/TPU hardware. The ML hardware is doing the
> > heavy lifting.
>
> The idea is to have ML model running in user-space and kernel subsystem can
> interact with ML model in user-space. As the next step, I am considering two
> real-life use-cases: (1) GC subsystem of LFS file system, (2) ML-based DAMON
> approach. So, for example, GC can be represented by ML model in user-space. GC
> can request data (segments state) from kernel-space and ML model in user-space
> can do training or/and inference. As a result, ML model in user-space can select
> victim segments and instruct kernel-space logic of moving valid data from victim
> segment(s) into clean/current one(s).
To be honest I'm skeptical about how generic this can be. Essentially
you're describing a generic interface to offload arbitrary kernel decision
to userspace. ML is a userspace bussiness here and not really relevant for
the concept AFAICT. And we already have several ways of kernel asking
userspace to do something for it and unless it is very restricted and well
defined it is rather painful, prone to deadlocks, security issues etc.
So by all means if you want to do GC decisions for your filesystem in
userspace by ML, be my guest, it does make some sense although I'd be wary
of issues where we need to writeback dirty pages to free memory which may
now depend on your userspace helper to make a decision which may need the
memory to do the decision... But I don't see why you need all the ML fluff
around it when it seems like just another way to call userspace helper and
why some of the existing methods would not suffice.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel
2026-02-10 13:47 ` [Lsf-pc] " Jan Kara
@ 2026-02-10 14:20 ` Chris Mason
2026-02-10 22:36 ` Viacheslav Dubeyko
2026-02-10 21:02 ` Viacheslav Dubeyko
1 sibling, 1 reply; 17+ messages in thread
From: Chris Mason @ 2026-02-10 14:20 UTC (permalink / raw)
To: Jan Kara, Viacheslav Dubeyko
Cc: chrisl, linux-mm, Pavan Rallabhandi, linux-fsdevel, linux-kernel,
lsf-pc, bpf
On 2/10/26 8:47 AM, Jan Kara wrote:
> On Mon 09-02-26 22:28:59, Viacheslav Dubeyko via Lsf-pc wrote:
>> On Mon, 2026-02-09 at 02:03 -0800, Chris Li wrote:
>>> On Fri, Feb 6, 2026 at 11:38 AM Viacheslav Dubeyko
>>> <Slava.Dubeyko@ibm.com> wrote:
>>>>
>>>> Hello,
>>>>
>>>> Machine Learning (ML) is approach/area of learning from data,
>>>> finding patterns, and making predictions without implementing algorithms
>>>> by developers. The number of areas of ML applications is growing
>>>> with every day. Generally speaking, ML can introduce a self-evolving and
>>>> self-learning capability in Linux kernel. There are already research works
>>>> and industry efforts to employ ML approaches for configuration and
>>>> optimization the Linux kernel. However, introduction of ML approaches
>>>> in Linux kernel is not so simple and straightforward way. There are multiple
>>>> problems and unanswered questions on this road. First of all, any ML model
>>>> requires the floating-point operations (FPU) for running. But there is
>>>> no direct use of FPUs in kernel space. Also, ML model requires training phase
>>>> that can be a reason of significant performance degradation of Linux kernel.
>>>> Even inference phase could be problematic from the performance point of view
>>>> on kernel side. The using of ML approaches in Linux kernel is inevitable step.
>>>> But, how can we use ML approaches in Linux kernel? Which infrastructure
>>>> do we need to adopt ML models in Linux kernel?
>>>
>>> I think there are two different things, I think you want the latter
>>> but I am not sure
>>>
>>> 1) using ML model to help kernel development, code reviews, generate
>>> patches by descriptions etc. For example, Chris Mason has a kernel
>>> review repo on github and he is sharing his review finding the mailing
>>> list:
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_masoncl_review-2Dprompts_tree_main&d=DwIFaQ&c=BSDicqBQBDjDI9RkVyTcHQ&r=q5bIm4AXMzc8NJu1_RGmnQ2fMWKq4Y4RAkElvUgSs00&m=vvrDPxyw_JXPrkC8BjzA2kEtwdPfwV2gBMEXG7ZveXM4LhS01LfoGwqhEyUZpPe4&s=rqNez5_rmiEuE7in5e_7MfyUzzqzaA6Gk46WWvmN3yk&e=
>>> It is kernel development related, but the ML agent code is running in
>>> the user space. The actual ML computation might run GPU/TPUs. That
>>> does not seem to be what you have in mind.
>>>
>>> 2) Run the ML model computation in the kernel space.
>>> Can you clarify if this is what you have in mind? You mention kernel
>>> FPU usage in the kernel for ML model. It is only relevant if you need
>>> to run the FP in the kernel CPU instructions. Most ML computations are
>>> not run in CPU instructions. They run on GPUs/TPUs. Why not keep the
>>> ML program (PyTorch/agents) in the user space and pass the data to the
>>> GPU/TPU driver to run? There will be some kernel instructure like
>>> VFIO/IOMMU involved with the GPU/TPU driver. For the most part the
>>> kernel is just facilitating the data passing to/from the GPU/TPU
>>> driver then to the GPU/TPU hardware. The ML hardware is doing the
>>> heavy lifting.
>>
>> The idea is to have ML model running in user-space and kernel subsystem can
>> interact with ML model in user-space. As the next step, I am considering two
>> real-life use-cases: (1) GC subsystem of LFS file system, (2) ML-based DAMON
>> approach. So, for example, GC can be represented by ML model in user-space. GC
>> can request data (segments state) from kernel-space and ML model in user-space
>> can do training or/and inference. As a result, ML model in user-space can select
>> victim segments and instruct kernel-space logic of moving valid data from victim
>> segment(s) into clean/current one(s).
>
> To be honest I'm skeptical about how generic this can be. Essentially
> you're describing a generic interface to offload arbitrary kernel decision
> to userspace. ML is a userspace bussiness here and not really relevant for
> the concept AFAICT. And we already have several ways of kernel asking
> userspace to do something for it and unless it is very restricted and well
> defined it is rather painful, prone to deadlocks, security issues etc.
>
> So by all means if you want to do GC decisions for your filesystem in
> userspace by ML, be my guest, it does make some sense although I'd be wary
> of issues where we need to writeback dirty pages to free memory which may
> now depend on your userspace helper to make a decision which may need the
> memory to do the decision... But I don't see why you need all the ML fluff
> around it when it seems like just another way to call userspace helper and
> why some of the existing methods would not suffice.
Looking through the description (not the code, apologies), it really
feels like we're reinventing BPF here:
- introspection into what the kernel is currently doing
- communications channel with applications
- a mechanism to override specific kernel functionality
- fancy applications arbitrating decisions.
My feedback during plumbers and also today is that you can get 99% of
what you're looking for with some BPF code.
It may or may not be perfect for your needs, but it's a much faster path
to generate community and collaboration around the goals. After that,
it's a lot easier to justify larger changes in the kernel.
If this becomes an LSF/MM topic, my bar for discussion would be:
- extensive data collected about some kernel component (Damon,
scheduling etc)
- working proof of concept that improved on decisions made in the kernel
- discussion of changes needed to improve or enable the proof of concept
In other words, I don't think we need a list of ways ML might be used.
I think we need specific examples of a way that ML was used and why it's
better than what the kernel is already doing.
-chris
^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel
2026-02-10 3:06 ` Barry Song
@ 2026-02-10 19:57 ` Viacheslav Dubeyko
0 siblings, 0 replies; 17+ messages in thread
From: Viacheslav Dubeyko @ 2026-02-10 19:57 UTC (permalink / raw)
To: 21cnbao
Cc: linux-mm, Pavan Rallabhandi, linux-fsdevel, linux-kernel, lsf-pc, bpf
On Tue, 2026-02-10 at 11:06 +0800, Barry Song wrote:
> On Tue, Feb 10, 2026 at 6:07 AM Viacheslav Dubeyko
> <Slava.Dubeyko@ibm.com> wrote:
> >
> > Hi Barry,
> >
> > On Mon, 2026-02-09 at 18:25 +0800, Barry Song wrote:
> > > On Sat, Feb 7, 2026 at 3:40 AM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > [...]
> > > >
> > > > The continuous learning model can be adopted during training phase.
> > > > It implies that kernel subsystem can receive ML model recommendations
> > > > even during training phase. ML model proxy on kernel side can estimate
> > > > the current kernel subsystem state, tries to apply the ML model
> > > > recommendations, and estimate the efficiency of applied recommendations.
> > > > Generally speaking, ML model proxy on kernel side can consider several
> > > > modes of interaction with ML model recommendations: (1) emergency mode,
> > > > (2) learning mode, (3) collaboration mode, (4) recommendation mode.
> > > > The emergency mode is the mode when kernel subsystem is in critical state
> > > > and it is required to work as efficient as possible without capability of
> > > > involving the ML model recommendations (for example, ML model
> > > > recommendations are completely inadequate or load is very high).
> > > > The learning mode implies that kernel subsystem can try to apply
> > > > the ML model recommendations for some operations with the goal of
> > > > estimation the maturity of ML model. Also, ML model proxy can degrade
> > > > the mode to learning state if ML model recommendations becomes inefficient.
> > > > The collaboration mode has the goal of using ML recommendations in
> > > > 50% of operations with the goal of achieving mature state of ML model.
> > > > And, finally, ML model proxy can convert kernel subsystem in recommendation
> > > > mode if ML model is mature enough and efficiency of applying
> > > > the ML recommendations is higher than using human-made algorithms.
> > >
> > > Hi Slava,
> > >
> > > Do we have any concrete examples where an ML-based proxy,
> > > together with its userspace ML agent, has demonstrated
> > > measurable performance improvements over well-designed,
> > > human-crafted kernel algorithms?
> > >
> > > Such examples could be in scheduling, filesystem I/O, or memory
> > > reclamation and readahead. I think having a real, data-backed
> > > example would be much more helpful for this discussion than
> > > reasoning about an abstract framework without a concrete use
> > > case.
> > >
> >
> > This patchset [1] is the first step of declaring the ML library API with the
> > goal of discussing it. As the next step, I am considering of using ML library
> > API for implementing two real-life use-cases: (1) GC subsystem of LFS file
> > systems (NILFS2, F2FS, SSDFS), (2) ML-based DAMON approach. I see multiple
> > potential real-life use-cases of ML library. But let me start from these two
> > ones and, then, we will able to extend the approach for other use-cases. The
> > goal of this talk is to hear the opinion of the community and to elaborate the
> > proper vision of ML library architecture.
>
> I’m very interested in your real-world use case.
> If you have any early-stage prototype code that demonstrates the full
> flow from user space to kernel space—including both the kernel ML proxy
> and the user-space ML agent (for example, for filesystem garbage
> collection)—I’d be glad to take a look if you’re able to share it.
>
>
I am going to extend for real-life use-case the early-stage prototype code [1].
The [2] is the Linux kernel with integrated ML library. And [3] is patchset that
I've shared recently of this early-stage prototype code.
It will be great to hear your opinion. :)
Thanks,
Slava.
[1] https://github.com/kernel-ml-lib/ml-lib
[2] https://github.com/kernel-ml-lib/ml-lib-linux
[3]
https://lore.kernel.org/linux-fsdevel/20260206191136.2609767-1-slava@dubeyko.com/T/#t
^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [Lsf-pc] [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel
2026-02-10 13:47 ` [Lsf-pc] " Jan Kara
2026-02-10 14:20 ` Chris Mason
@ 2026-02-10 21:02 ` Viacheslav Dubeyko
2026-02-11 9:55 ` Jan Kara
1 sibling, 1 reply; 17+ messages in thread
From: Viacheslav Dubeyko @ 2026-02-10 21:02 UTC (permalink / raw)
To: jack
Cc: linux-mm, linux-fsdevel, linux-kernel, lsf-pc, chrisl, bpf,
Pavan Rallabhandi, clm
On Tue, 2026-02-10 at 14:47 +0100, Jan Kara wrote:
> On Mon 09-02-26 22:28:59, Viacheslav Dubeyko via Lsf-pc wrote:
> > On Mon, 2026-02-09 at 02:03 -0800, Chris Li wrote:
> > > On Fri, Feb 6, 2026 at 11:38 AM Viacheslav Dubeyko
> > > <Slava.Dubeyko@ibm.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > Machine Learning (ML) is approach/area of learning from data,
> > > > finding patterns, and making predictions without implementing algorithms
> > > > by developers. The number of areas of ML applications is growing
> > > > with every day. Generally speaking, ML can introduce a self-evolving and
> > > > self-learning capability in Linux kernel. There are already research works
> > > > and industry efforts to employ ML approaches for configuration and
> > > > optimization the Linux kernel. However, introduction of ML approaches
> > > > in Linux kernel is not so simple and straightforward way. There are multiple
> > > > problems and unanswered questions on this road. First of all, any ML model
> > > > requires the floating-point operations (FPU) for running. But there is
> > > > no direct use of FPUs in kernel space. Also, ML model requires training phase
> > > > that can be a reason of significant performance degradation of Linux kernel.
> > > > Even inference phase could be problematic from the performance point of view
> > > > on kernel side. The using of ML approaches in Linux kernel is inevitable step.
> > > > But, how can we use ML approaches in Linux kernel? Which infrastructure
> > > > do we need to adopt ML models in Linux kernel?
> > >
> > > I think there are two different things, I think you want the latter
> > > but I am not sure
> > >
> > > 1) using ML model to help kernel development, code reviews, generate
> > > patches by descriptions etc. For example, Chris Mason has a kernel
> > > review repo on github and he is sharing his review finding the mailing
> > > list:
> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_masoncl_review-2Dprompts_tree_main&d=DwIFaQ&c=BSDicqBQBDjDI9RkVyTcHQ&r=q5bIm4AXMzc8NJu1_RGmnQ2fMWKq4Y4RAkElvUgSs00&m=vvrDPxyw_JXPrkC8BjzA2kEtwdPfwV2gBMEXG7ZveXM4LhS01LfoGwqhEyUZpPe4&s=rqNez5_rmiEuE7in5e_7MfyUzzqzaA6Gk46WWvmN3yk&e=
> > > It is kernel development related, but the ML agent code is running in
> > > the user space. The actual ML computation might run GPU/TPUs. That
> > > does not seem to be what you have in mind.
> > >
> > > 2) Run the ML model computation in the kernel space.
> > > Can you clarify if this is what you have in mind? You mention kernel
> > > FPU usage in the kernel for ML model. It is only relevant if you need
> > > to run the FP in the kernel CPU instructions. Most ML computations are
> > > not run in CPU instructions. They run on GPUs/TPUs. Why not keep the
> > > ML program (PyTorch/agents) in the user space and pass the data to the
> > > GPU/TPU driver to run? There will be some kernel instructure like
> > > VFIO/IOMMU involved with the GPU/TPU driver. For the most part the
> > > kernel is just facilitating the data passing to/from the GPU/TPU
> > > driver then to the GPU/TPU hardware. The ML hardware is doing the
> > > heavy lifting.
> >
> > The idea is to have ML model running in user-space and kernel subsystem can
> > interact with ML model in user-space. As the next step, I am considering two
> > real-life use-cases: (1) GC subsystem of LFS file system, (2) ML-based DAMON
> > approach. So, for example, GC can be represented by ML model in user-space. GC
> > can request data (segments state) from kernel-space and ML model in user-space
> > can do training or/and inference. As a result, ML model in user-space can select
> > victim segments and instruct kernel-space logic of moving valid data from victim
> > segment(s) into clean/current one(s).
>
> To be honest I'm skeptical about how generic this can be. Essentially
> you're describing a generic interface to offload arbitrary kernel decision
> to userspace. ML is a userspace bussiness here and not really relevant for
> the concept AFAICT. And we already have several ways of kernel asking
> userspace to do something for it and unless it is very restricted and well
> defined it is rather painful, prone to deadlocks, security issues etc.
Scepticism is normal reaction. :) So, nothing wrong is to be sceptical.
I believe it can be pretty generic from the data flow point of view. Probably,
different kernel subsystems could require different ways of interaction with
user-space. However, if we are talking about data flow but NOT execution flow,
then it could be generic enough. And if it can be generic, then we can suggest
generic way of extending any kernel subsystem by ML support.
I don't think that we need to consider the ML library appraoch like "kernel
asking userspace to do something". Rather it needs to consider the model like
"kernel share data with user-space and user-space recommends something to
kernel". So, user-space agent (ML model) can request data from kernel space or
kernel subsystem can notify the user-space agent that data is available. And
it's up to kernel subsystem implementation which data could be shared with user-
space. So, ML model can be trained in user-space and, then, share
recommendations (or eBPF code, for example) with kernel space. Finally, it's up
to kernel subsystem how and when to apply these recommendations on kernel side.
>
> So by all means if you want to do GC decisions for your filesystem in
> userspace by ML, be my guest, it does make some sense although I'd be wary
> of issues where we need to writeback dirty pages to free memory which may
> now depend on your userspace helper to make a decision which may need the
> memory to do the decision... But I don't see why you need all the ML fluff
> around it when it seems like just another way to call userspace helper and
> why some of the existing methods would not suffice.
>
OK. I see. :) You understood GC like a subsystem that helps to kernel memory
subsystem to manage the writeback dirty memory pages. :) It's potential
direction and I like your suggestion. :) But I meant something different because
I consider of LFS file system's GC subsystem. So, if we are using Copy-On-Write
(COW) policy, then we have segments or erase blocks with a mixture of valid and
invalid logical blocks after update operations. And we need GC subsystem to
clean old segments by means of moving valid logical blocks from exhausted
segments into clean/current ones. The problem here is to find an efficient
algorithm of selecting victim segments with smallest amount of valid blocks with
the goal of decreasing write amplification. So, file system needs to share the
metadata details (segments state, for example), ML model can share the
recommendations, and kernel code of file system can finally move valid blocks in
the background.
I don't want to say that ML is a miracle that can solve all our problems. And it
cannot work efficiently for all possible problems. But it can help us to solve
some complicated issues and it makes sense to elaborate some generic framework
for ML adoption into Linux kernel.
Thanks,
Slava.
^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [Lsf-pc] [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel
2026-02-10 14:20 ` Chris Mason
@ 2026-02-10 22:36 ` Viacheslav Dubeyko
2026-02-11 1:30 ` SeongJae Park
0 siblings, 1 reply; 17+ messages in thread
From: Viacheslav Dubeyko @ 2026-02-10 22:36 UTC (permalink / raw)
To: jack, clm
Cc: bpf, linux-mm, chrisl, Pavan Rallabhandi, linux-kernel,
linux-fsdevel, lsf-pc
On Tue, 2026-02-10 at 09:20 -0500, Chris Mason wrote:
> On 2/10/26 8:47 AM, Jan Kara wrote:
> > On Mon 09-02-26 22:28:59, Viacheslav Dubeyko via Lsf-pc wrote:
> > > On Mon, 2026-02-09 at 02:03 -0800, Chris Li wrote:
> > > > On Fri, Feb 6, 2026 at 11:38 AM Viacheslav Dubeyko
> > > > <Slava.Dubeyko@ibm.com> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > Machine Learning (ML) is approach/area of learning from data,
> > > > > finding patterns, and making predictions without implementing algorithms
> > > > > by developers. The number of areas of ML applications is growing
> > > > > with every day. Generally speaking, ML can introduce a self-evolving and
> > > > > self-learning capability in Linux kernel. There are already research works
> > > > > and industry efforts to employ ML approaches for configuration and
> > > > > optimization the Linux kernel. However, introduction of ML approaches
> > > > > in Linux kernel is not so simple and straightforward way. There are multiple
> > > > > problems and unanswered questions on this road. First of all, any ML model
> > > > > requires the floating-point operations (FPU) for running. But there is
> > > > > no direct use of FPUs in kernel space. Also, ML model requires training phase
> > > > > that can be a reason of significant performance degradation of Linux kernel.
> > > > > Even inference phase could be problematic from the performance point of view
> > > > > on kernel side. The using of ML approaches in Linux kernel is inevitable step.
> > > > > But, how can we use ML approaches in Linux kernel? Which infrastructure
> > > > > do we need to adopt ML models in Linux kernel?
> > > >
> > > > I think there are two different things, I think you want the latter
> > > > but I am not sure
> > > >
> > > > 1) using ML model to help kernel development, code reviews, generate
> > > > patches by descriptions etc. For example, Chris Mason has a kernel
> > > > review repo on github and he is sharing his review finding the mailing
> > > > list:
> > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_masoncl_review-2Dprompts_tree_main&d=DwIFaQ&c=BSDicqBQBDjDI9RkVyTcHQ&r=q5bIm4AXMzc8NJu1_RGmnQ2fMWKq4Y4RAkElvUgSs00&m=vvrDPxyw_JXPrkC8BjzA2kEtwdPfwV2gBMEXG7ZveXM4LhS01LfoGwqhEyUZpPe4&s=rqNez5_rmiEuE7in5e_7MfyUzzqzaA6Gk46WWvmN3yk&e=
> > > > It is kernel development related, but the ML agent code is running in
> > > > the user space. The actual ML computation might run GPU/TPUs. That
> > > > does not seem to be what you have in mind.
> > > >
> > > > 2) Run the ML model computation in the kernel space.
> > > > Can you clarify if this is what you have in mind? You mention kernel
> > > > FPU usage in the kernel for ML model. It is only relevant if you need
> > > > to run the FP in the kernel CPU instructions. Most ML computations are
> > > > not run in CPU instructions. They run on GPUs/TPUs. Why not keep the
> > > > ML program (PyTorch/agents) in the user space and pass the data to the
> > > > GPU/TPU driver to run? There will be some kernel instructure like
> > > > VFIO/IOMMU involved with the GPU/TPU driver. For the most part the
> > > > kernel is just facilitating the data passing to/from the GPU/TPU
> > > > driver then to the GPU/TPU hardware. The ML hardware is doing the
> > > > heavy lifting.
> > >
> > > The idea is to have ML model running in user-space and kernel subsystem can
> > > interact with ML model in user-space. As the next step, I am considering two
> > > real-life use-cases: (1) GC subsystem of LFS file system, (2) ML-based DAMON
> > > approach. So, for example, GC can be represented by ML model in user-space. GC
> > > can request data (segments state) from kernel-space and ML model in user-space
> > > can do training or/and inference. As a result, ML model in user-space can select
> > > victim segments and instruct kernel-space logic of moving valid data from victim
> > > segment(s) into clean/current one(s).
> >
> > To be honest I'm skeptical about how generic this can be. Essentially
> > you're describing a generic interface to offload arbitrary kernel decision
> > to userspace. ML is a userspace bussiness here and not really relevant for
> > the concept AFAICT. And we already have several ways of kernel asking
> > userspace to do something for it and unless it is very restricted and well
> > defined it is rather painful, prone to deadlocks, security issues etc.
> >
> > So by all means if you want to do GC decisions for your filesystem in
> > userspace by ML, be my guest, it does make some sense although I'd be wary
> > of issues where we need to writeback dirty pages to free memory which may
> > now depend on your userspace helper to make a decision which may need the
> > memory to do the decision... But I don't see why you need all the ML fluff
> > around it when it seems like just another way to call userspace helper and
> > why some of the existing methods would not suffice.
>
> Looking through the description (not the code, apologies), it really
> feels like we're reinventing BPF here:
>
> - introspection into what the kernel is currently doing
> - communications channel with applications
> - a mechanism to override specific kernel functionality
> - fancy applications arbitrating decisions.
>
> My feedback during plumbers and also today is that you can get 99% of
> what you're looking for with some BPF code.
I see your point. And I can agree with you that eBPF could be used as a
communication channel. I don't try to invent a new communication channel. My
point here that ML library should be the unified means of extending kernel
subsystem by ML model(s) in user-space. So, eBPF could be the one of (or, maybe,
only one) possible communication mechanism. ML library should provide the
unified framework and workflow for easy adding and using ML model(s) in user-
space by kernel subsystems.
>
> It may or may not be perfect for your needs, but it's a much faster path
> to generate community and collaboration around the goals. After that,
> it's a lot easier to justify larger changes in the kernel.
>
Yeah, makes sense. My current patchset is exploring the API that ML library
should provide. And eBPF could be communication channel between ML model in
user-space and kernel subsystem.
> If this becomes an LSF/MM topic, my bar for discussion would be:
> - extensive data collected about some kernel component (Damon,
> scheduling etc)
Exactly, ML-based DAMON approach by using ML library is my next
implementation/exploring step.
> - working proof of concept that improved on decisions made in the kernel
Also, I am considering GC of LFS file system like low-hanging fruit for checking
the ML library approach. Especially, because, for example, NILFS2 has GC as
user-space process and it requires elaboration of efficient GC policy. So, it
could be potential proof of concept for the whole idea. Ideally, several use-
cases should benefit from the idea.
> - discussion of changes needed to improve or enable the proof of concept
Makes sense. This is why I've shared the patchset with initial vision of ML
library API. The goal is to hear all possible critics and to check the
capability of idea (and me) to survive. :)
>
> In other words, I don't think we need a list of ways ML might be used.
> I think we need specific examples of a way that ML was used and why it's
> better than what the kernel is already doing.
>
Yes, as the next step, I am going to explore: (1) GC of LFS file system use-
case, (2) ML-based DAMON approach. I hope to have enough time enough time to
implement it before May and to share some numbers/results.
Thanks,
Slava.
^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [Lsf-pc] [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel
2026-02-10 22:36 ` Viacheslav Dubeyko
@ 2026-02-11 1:30 ` SeongJae Park
2026-02-11 20:29 ` Viacheslav Dubeyko
0 siblings, 1 reply; 17+ messages in thread
From: SeongJae Park @ 2026-02-11 1:30 UTC (permalink / raw)
To: Viacheslav Dubeyko
Cc: SeongJae Park, jack, clm, bpf, linux-mm, chrisl,
Pavan Rallabhandi, linux-kernel, linux-fsdevel, lsf-pc
On Tue, 10 Feb 2026 22:36:35 +0000 Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> Exactly, ML-based DAMON approach by using ML library is my next
> implementation/exploring step.
Glad to hear this. If you find any question or need help for DAMON while doing
this, please feel free to reach out. I will be more than happy to help :)
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel
2026-02-10 21:02 ` Viacheslav Dubeyko
@ 2026-02-11 9:55 ` Jan Kara
2026-02-12 0:53 ` Viacheslav Dubeyko
0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2026-02-11 9:55 UTC (permalink / raw)
To: Viacheslav Dubeyko
Cc: jack, linux-mm, linux-fsdevel, linux-kernel, lsf-pc, chrisl, bpf,
Pavan Rallabhandi, clm
On Tue 10-02-26 21:02:12, Viacheslav Dubeyko wrote:
> On Tue, 2026-02-10 at 14:47 +0100, Jan Kara wrote:
> > On Mon 09-02-26 22:28:59, Viacheslav Dubeyko via Lsf-pc wrote:
> > > The idea is to have ML model running in user-space and kernel subsystem can
> > > interact with ML model in user-space. As the next step, I am considering two
> > > real-life use-cases: (1) GC subsystem of LFS file system, (2) ML-based DAMON
> > > approach. So, for example, GC can be represented by ML model in user-space. GC
> > > can request data (segments state) from kernel-space and ML model in user-space
> > > can do training or/and inference. As a result, ML model in user-space can select
> > > victim segments and instruct kernel-space logic of moving valid data from victim
> > > segment(s) into clean/current one(s).
> >
> > To be honest I'm skeptical about how generic this can be. Essentially
> > you're describing a generic interface to offload arbitrary kernel decision
> > to userspace. ML is a userspace bussiness here and not really relevant for
> > the concept AFAICT. And we already have several ways of kernel asking
> > userspace to do something for it and unless it is very restricted and well
> > defined it is rather painful, prone to deadlocks, security issues etc.
>
> Scepticism is normal reaction. :) So, nothing wrong is to be sceptical.
>
> I believe it can be pretty generic from the data flow point of view. Probably,
> different kernel subsystems could require different ways of interaction with
> user-space. However, if we are talking about data flow but NOT execution flow,
> then it could be generic enough. And if it can be generic, then we can suggest
> generic way of extending any kernel subsystem by ML support.
>
> I don't think that we need to consider the ML library appraoch like "kernel
> asking userspace to do something". Rather it needs to consider the model like
> "kernel share data with user-space and user-space recommends something to
> kernel". So, user-space agent (ML model) can request data from kernel space or
> kernel subsystem can notify the user-space agent that data is available. And
> it's up to kernel subsystem implementation which data could be shared with user-
> space. So, ML model can be trained in user-space and, then, share
> recommendations (or eBPF code, for example) with kernel space. Finally, it's up
> to kernel subsystem how and when to apply these recommendations on kernel side.
I guess I have to see some examples. Because so far it sounds so generic
that I'm failing to see a value in this :)
> > So by all means if you want to do GC decisions for your filesystem in
> > userspace by ML, be my guest, it does make some sense although I'd be wary
> > of issues where we need to writeback dirty pages to free memory which may
> > now depend on your userspace helper to make a decision which may need the
> > memory to do the decision... But I don't see why you need all the ML fluff
> > around it when it seems like just another way to call userspace helper and
> > why some of the existing methods would not suffice.
> >
>
> OK. I see. :) You understood GC like a subsystem that helps to kernel
> memory subsystem to manage the writeback dirty memory pages. :) It's
> potential direction and I like your suggestion. :) But I meant something
> different because I consider of LFS file system's GC subsystem. So, if we
> are using Copy-On-Write (COW) policy, then we have segments or erase
> blocks with a mixture of valid and invalid logical blocks after update
> operations. And we need GC subsystem to clean old segments by means of
> moving valid logical blocks from exhausted segments into clean/current
> ones. The problem here is to find an efficient algorithm of selecting
> victim segments with smallest amount of valid blocks with the goal of
> decreasing write amplification. So, file system needs to share the
> metadata details (segments state, for example), ML model can share the
> recommendations, and kernel code of file system can finally move valid
> blocks in the background.
No, I actually meant the LFS file system GC as you talk about it. But I was
just too terse about my concerns: As you said an LFS with COW needs to
select a new position to write each block. When there is no free block
available, it has to select partially used erase block (some logical blocks
in it became invalid) to reuse. And for this selection you want to use ML
AFAIU. Hence we have a dependency folio writeback -> COW block allocation ->
GC to make some block free -> ML decision. And now you have to be really
careful so that "ML decision" doesn't even indirectly depend on folio
writeback to complete. And bear in mind that e.g. if the code doing "ML
decision" dirties some mmaped file pages it *will* block waiting for page
writeback to complete to get the system below the limit of dirty pages.
This is the kind of deadlock I'm talking about that is hard to avoid when
offloading kernel decisions to userspace (and yes, I've seen these kind of
deadlocks in practice in various shapes and forms with various methods when
kernel depended on userspace to make forward progress).
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [Lsf-pc] [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel
2026-02-11 1:30 ` SeongJae Park
@ 2026-02-11 20:29 ` Viacheslav Dubeyko
0 siblings, 0 replies; 17+ messages in thread
From: Viacheslav Dubeyko @ 2026-02-11 20:29 UTC (permalink / raw)
To: sj
Cc: jack, linux-fsdevel, linux-mm, linux-kernel, lsf-pc, chrisl, bpf,
Pavan Rallabhandi, clm
On Tue, 2026-02-10 at 17:30 -0800, SeongJae Park wrote:
> On Tue, 10 Feb 2026 22:36:35 +0000 Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
>
> > Exactly, ML-based DAMON approach by using ML library is my next
> > implementation/exploring step.
>
> Glad to hear this. If you find any question or need help for DAMON while doing
> this, please feel free to reach out. I will be more than happy to help :)
>
>
Sounds good! Let me start my implementation efforts and I'll share my questions.
:)
Thanks,
Slava.
^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [Lsf-pc] [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel
2026-02-11 9:55 ` Jan Kara
@ 2026-02-12 0:53 ` Viacheslav Dubeyko
2026-02-12 11:02 ` Jan Kara
0 siblings, 1 reply; 17+ messages in thread
From: Viacheslav Dubeyko @ 2026-02-12 0:53 UTC (permalink / raw)
To: jack
Cc: linux-mm, linux-fsdevel, linux-kernel, lsf-pc, chrisl, bpf, clm,
Pavan Rallabhandi
On Wed, 2026-02-11 at 10:55 +0100, Jan Kara wrote:
> On Tue 10-02-26 21:02:12, Viacheslav Dubeyko wrote:
> > On Tue, 2026-02-10 at 14:47 +0100, Jan Kara wrote:
> > > On Mon 09-02-26 22:28:59, Viacheslav Dubeyko via Lsf-pc wrote:
> > > > The idea is to have ML model running in user-space and kernel subsystem can
> > > > interact with ML model in user-space. As the next step, I am considering two
> > > > real-life use-cases: (1) GC subsystem of LFS file system, (2) ML-based DAMON
> > > > approach. So, for example, GC can be represented by ML model in user-space. GC
> > > > can request data (segments state) from kernel-space and ML model in user-space
> > > > can do training or/and inference. As a result, ML model in user-space can select
> > > > victim segments and instruct kernel-space logic of moving valid data from victim
> > > > segment(s) into clean/current one(s).
> > >
> > > To be honest I'm skeptical about how generic this can be. Essentially
> > > you're describing a generic interface to offload arbitrary kernel decision
> > > to userspace. ML is a userspace bussiness here and not really relevant for
> > > the concept AFAICT. And we already have several ways of kernel asking
> > > userspace to do something for it and unless it is very restricted and well
> > > defined it is rather painful, prone to deadlocks, security issues etc.
> >
> > Scepticism is normal reaction. :) So, nothing wrong is to be sceptical.
> >
> > I believe it can be pretty generic from the data flow point of view. Probably,
> > different kernel subsystems could require different ways of interaction with
> > user-space. However, if we are talking about data flow but NOT execution flow,
> > then it could be generic enough. And if it can be generic, then we can suggest
> > generic way of extending any kernel subsystem by ML support.
> >
> > I don't think that we need to consider the ML library appraoch like "kernel
> > asking userspace to do something". Rather it needs to consider the model like
> > "kernel share data with user-space and user-space recommends something to
> > kernel". So, user-space agent (ML model) can request data from kernel space or
> > kernel subsystem can notify the user-space agent that data is available. And
> > it's up to kernel subsystem implementation which data could be shared with user-
> > space. So, ML model can be trained in user-space and, then, share
> > recommendations (or eBPF code, for example) with kernel space. Finally, it's up
> > to kernel subsystem how and when to apply these recommendations on kernel side.
>
> I guess I have to see some examples. Because so far it sounds so generic
> that I'm failing to see a value in this :)
I completely see your point. And I am not going to push anything abstract one. I
am going to implement ML-based approach for several real-life use-cases. So, I
will have something real or I will fail. :)
>
> > > So by all means if you want to do GC decisions for your filesystem in
> > > userspace by ML, be my guest, it does make some sense although I'd be wary
> > > of issues where we need to writeback dirty pages to free memory which may
> > > now depend on your userspace helper to make a decision which may need the
> > > memory to do the decision... But I don't see why you need all the ML fluff
> > > around it when it seems like just another way to call userspace helper and
> > > why some of the existing methods would not suffice.
> > >
> >
> > OK. I see. :) You understood GC like a subsystem that helps to kernel
> > memory subsystem to manage the writeback dirty memory pages. :) It's
> > potential direction and I like your suggestion. :) But I meant something
> > different because I consider of LFS file system's GC subsystem. So, if we
> > are using Copy-On-Write (COW) policy, then we have segments or erase
> > blocks with a mixture of valid and invalid logical blocks after update
> > operations. And we need GC subsystem to clean old segments by means of
> > moving valid logical blocks from exhausted segments into clean/current
> > ones. The problem here is to find an efficient algorithm of selecting
> > victim segments with smallest amount of valid blocks with the goal of
> > decreasing write amplification. So, file system needs to share the
> > metadata details (segments state, for example), ML model can share the
> > recommendations, and kernel code of file system can finally move valid
> > blocks in the background.
>
> No, I actually meant the LFS file system GC as you talk about it. But I was
> just too terse about my concerns: As you said an LFS with COW needs to
> select a new position to write each block. When there is no free block
> available, it has to select partially used erase block (some logical blocks
> in it became invalid) to reuse.
>
I assume that you imply F2FS here. Because, I cannot imagine how LFS file system
(like NILFS2) can do something like this. If it's LFS file system, then you add
logs into the current segment(s). Even if some logical blocks have been
invalidated into this segment, then you add another log into the head/tail of
current segment until complete exhaustion of it. And it needs to allocate the
completely clean/free segment to be current and receive the logs. So, you need
to take completely exhausted segment for cleaning by GC. If you have pure COW
file system, then you cannot write anything in likewise segment until complete
invalidation + "erase"/clean. So, GC moves valid blocks from completely
exhausted segment into the current one(s). It's responsibility of GC to
guarantee that file system is not running out of free physical space if file
system still has free logical blocks. And if we are running out free physical
space, then operation stops because of GC failure to keep enough clean segments.
> And for this selection you want to use ML
> AFAIU. Hence we have a dependency folio writeback -> COW block allocation ->
> GC to make some block free -> ML decision.
>
Usually, GC works in the background. So, ML model in user-space get segments
state metadata from file system. Then, it selects one or several segments and
recommends to file system of moving valid blocks for the selected segment(s) ID
+ maximal amount of valid blocks for single operation. Background process of
file system checks that these logical blocks of exhausted segment are still
valid and initiates operation of moving into the current segment by adding
another log.
Finally, we have two flows: (1) regular file system operations: folio writeback
-> COW block allocation -> add log into current segment; (2) GC operations: ML
GC decision -> recommendation of moving valid blocks for segment -> check that
logical block is still valid -> read block content (ignore logical block if we
have folio in page cache) -> add log into current segment -> update metadata.
Thanks,
Slava.
> And now you have to be really
> careful so that "ML decision" doesn't even indirectly depend on folio
> writeback to complete. And bear in mind that e.g. if the code doing "ML
> decision" dirties some mmaped file pages it *will* block waiting for page
> writeback to complete to get the system below the limit of dirty pages.
> This is the kind of deadlock I'm talking about that is hard to avoid when
> offloading kernel decisions to userspace (and yes, I've seen these kind of
> deadlocks in practice in various shapes and forms with various methods when
> kernel depended on userspace to make forward progress).
>
> Honza
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel
2026-02-12 0:53 ` Viacheslav Dubeyko
@ 2026-02-12 11:02 ` Jan Kara
0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2026-02-12 11:02 UTC (permalink / raw)
To: Viacheslav Dubeyko
Cc: jack, linux-mm, linux-fsdevel, linux-kernel, lsf-pc, chrisl, bpf,
clm, Pavan Rallabhandi
On Thu 12-02-26 00:53:37, Viacheslav Dubeyko wrote:
> On Wed, 2026-02-11 at 10:55 +0100, Jan Kara wrote:
> > On Tue 10-02-26 21:02:12, Viacheslav Dubeyko wrote:
> > > On Tue, 2026-02-10 at 14:47 +0100, Jan Kara wrote:
> > > > On Mon 09-02-26 22:28:59, Viacheslav Dubeyko via Lsf-pc wrote:
> > > > > The idea is to have ML model running in user-space and kernel subsystem can
> > > > > interact with ML model in user-space. As the next step, I am considering two
> > > > > real-life use-cases: (1) GC subsystem of LFS file system, (2) ML-based DAMON
> > > > > approach. So, for example, GC can be represented by ML model in user-space. GC
> > > > > can request data (segments state) from kernel-space and ML model in user-space
> > > > > can do training or/and inference. As a result, ML model in user-space can select
> > > > > victim segments and instruct kernel-space logic of moving valid data from victim
> > > > > segment(s) into clean/current one(s).
> > > >
> > > > To be honest I'm skeptical about how generic this can be. Essentially
> > > > you're describing a generic interface to offload arbitrary kernel decision
> > > > to userspace. ML is a userspace bussiness here and not really relevant for
> > > > the concept AFAICT. And we already have several ways of kernel asking
> > > > userspace to do something for it and unless it is very restricted and well
> > > > defined it is rather painful, prone to deadlocks, security issues etc.
> > >
> > > Scepticism is normal reaction. :) So, nothing wrong is to be sceptical.
> > >
> > > I believe it can be pretty generic from the data flow point of view. Probably,
> > > different kernel subsystems could require different ways of interaction with
> > > user-space. However, if we are talking about data flow but NOT execution flow,
> > > then it could be generic enough. And if it can be generic, then we can suggest
> > > generic way of extending any kernel subsystem by ML support.
> > >
> > > I don't think that we need to consider the ML library appraoch like "kernel
> > > asking userspace to do something". Rather it needs to consider the model like
> > > "kernel share data with user-space and user-space recommends something to
> > > kernel". So, user-space agent (ML model) can request data from kernel space or
> > > kernel subsystem can notify the user-space agent that data is available. And
> > > it's up to kernel subsystem implementation which data could be shared with user-
> > > space. So, ML model can be trained in user-space and, then, share
> > > recommendations (or eBPF code, for example) with kernel space. Finally, it's up
> > > to kernel subsystem how and when to apply these recommendations on kernel side.
> >
> > I guess I have to see some examples. Because so far it sounds so generic
> > that I'm failing to see a value in this :)
>
> I completely see your point. And I am not going to push anything abstract
> one. I am going to implement ML-based approach for several real-life
> use-cases. So, I will have something real or I will fail. :)
OK, good then :)
> > > > So by all means if you want to do GC decisions for your filesystem in
> > > > userspace by ML, be my guest, it does make some sense although I'd be wary
> > > > of issues where we need to writeback dirty pages to free memory which may
> > > > now depend on your userspace helper to make a decision which may need the
> > > > memory to do the decision... But I don't see why you need all the ML fluff
> > > > around it when it seems like just another way to call userspace helper and
> > > > why some of the existing methods would not suffice.
> > > >
> > >
> > > OK. I see. :) You understood GC like a subsystem that helps to kernel
> > > memory subsystem to manage the writeback dirty memory pages. :) It's
> > > potential direction and I like your suggestion. :) But I meant something
> > > different because I consider of LFS file system's GC subsystem. So, if we
> > > are using Copy-On-Write (COW) policy, then we have segments or erase
> > > blocks with a mixture of valid and invalid logical blocks after update
> > > operations. And we need GC subsystem to clean old segments by means of
> > > moving valid logical blocks from exhausted segments into clean/current
> > > ones. The problem here is to find an efficient algorithm of selecting
> > > victim segments with smallest amount of valid blocks with the goal of
> > > decreasing write amplification. So, file system needs to share the
> > > metadata details (segments state, for example), ML model can share the
> > > recommendations, and kernel code of file system can finally move valid
> > > blocks in the background.
> >
> > No, I actually meant the LFS file system GC as you talk about it. But I was
> > just too terse about my concerns: As you said an LFS with COW needs to
> > select a new position to write each block. When there is no free block
> > available, it has to select partially used erase block (some logical blocks
> > in it became invalid) to reuse.
> >
>
> I assume that you imply F2FS here. Because, I cannot imagine how LFS file system
> (like NILFS2) can do something like this. If it's LFS file system, then you add
> logs into the current segment(s). Even if some logical blocks have been
> invalidated into this segment, then you add another log into the head/tail of
> current segment until complete exhaustion of it. And it needs to allocate the
> completely clean/free segment to be current and receive the logs. So, you need
> to take completely exhausted segment for cleaning by GC. If you have pure COW
> file system, then you cannot write anything in likewise segment until complete
> invalidation + "erase"/clean. So, GC moves valid blocks from completely
> exhausted segment into the current one(s). It's responsibility of GC to
> guarantee that file system is not running out of free physical space if file
> system still has free logical blocks. And if we are running out free physical
> space, then operation stops because of GC failure to keep enough clean segments.
Well, details of different filesystem designs are different but they all
have the common feature that on an aged filesystem you need GC to do work
to be able to write as much as you are supposed to be able to write.
> > And for this selection you want to use ML
> > AFAIU. Hence we have a dependency folio writeback -> COW block allocation ->
> > GC to make some block free -> ML decision.
>
> Usually, GC works in the background. So, ML model in user-space get
> segments state metadata from file system. Then, it selects one or several
> segments and recommends to file system of moving valid blocks for the
> selected segment(s) ID + maximal amount of valid blocks for single
> operation. Background process of file system checks that these logical
> blocks of exhausted segment are still valid and initiates operation of
> moving into the current segment by adding another log.
Sure, background operation is the easy case. I'm speaking about the
situation where the filesystem is under such write pressure that GC cannot
keep up and all the write activity is basically blocked waiting for GC to
make forward progress. And again details for different filesystems differ
but all have this property that the speed of GC is one of the limiting
factors for writes when the filesystem is aged enough and the write
pressure is large enough. And the point I'm trying to get across is that
under such pressure consulting userspace for GC decisions is likely to
cause deadlocks. So you will have to have some in-kernel fallbacks to avoid
such deadlocks and logic for triggering these fallbacks to guarantee
forward progress of GC which all gets kind of hairy.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2026-02-12 11:02 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-06 19:38 [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel Viacheslav Dubeyko
2026-02-06 23:28 ` Hillf Danton
2026-02-09 10:03 ` Chris Li
2026-02-09 22:28 ` Viacheslav Dubeyko
2026-02-10 13:47 ` [Lsf-pc] " Jan Kara
2026-02-10 14:20 ` Chris Mason
2026-02-10 22:36 ` Viacheslav Dubeyko
2026-02-11 1:30 ` SeongJae Park
2026-02-11 20:29 ` Viacheslav Dubeyko
2026-02-10 21:02 ` Viacheslav Dubeyko
2026-02-11 9:55 ` Jan Kara
2026-02-12 0:53 ` Viacheslav Dubeyko
2026-02-12 11:02 ` Jan Kara
2026-02-09 10:25 ` Barry Song
2026-02-09 22:07 ` Viacheslav Dubeyko
2026-02-10 3:06 ` Barry Song
2026-02-10 19:57 ` Viacheslav Dubeyko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox