* [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption
@ 2024-11-19 20:55 Brian Johannesmeyer
2024-11-19 20:55 ` [RFC v2 1/2] dmapool: Move pool metadata into non-DMA memory Brian Johannesmeyer
` (3 more replies)
0 siblings, 4 replies; 18+ messages in thread
From: Brian Johannesmeyer @ 2024-11-19 20:55 UTC (permalink / raw)
To: Andrew Morton, linux-mm, linux-kernel, linux-hardening
Cc: Brian Johannesmeyer, Raphael Isemann, Cristiano Giuffrida,
Herbert Bos, Greg KH
We discovered a security-related issue in the DMA pool allocator.
V1 of our RFC was submitted to the Linux kernel security team. They
recommended submitting it to the relevant subsystem maintainers and the
hardening mailing list instead, as they did not consider this an explicit
security issue. Their rationale was that Linux implicitly assumes hardware
can be trusted.
**Threat Model**: While Linux drivers typically trust their hardware, there
may be specific drivers that do not operate under this assumption. Hence,
this threat model assumes a malicious peripheral device capable of
corrupting DMA data to exploit the kernel. In this scenario, the device
manipulates kernel-initialized data (similar to the attack described in the
Thunderclap paper [0]) to achieve arbitrary kernel memory corruption.
**DMA pool background**. A DMA pool aims to reduce the overhead of DMA
allocations by creating a large DMA buffer --- the "pool" --- from which
smaller buffers are allocated as needed. Fundamentally, a DMA pool
functions like a heap: it is a structure composed of linked memory
"blocks", which, in this context, are DMA buffers. When a driver employs a
DMA pool, it grants the device access not only to these blocks but also to
the pointers linking them.
**Vulnerability**. Similar to traditional heap corruption vulnerabilities
--- where a malicious program corrupts heap metadata to e.g., hijack
control flow --- a malicious device may corrupt DMA pool metadata. This
corruption can trivially lead to arbitrary kernel memory corruption from
any driver that uses it. Indeed, because the DMA pool API is extensively
used, this vulnerability is not confined to a single instance. In fact,
every usage of the DMA pool API is potentially vulnerable. An exploit
proceeds with the following steps:
1. The DMA `pool` initializes its list of blocks, then points to the first
block.
2. The malicious device overwrites the first 8 bytes of the first block ---
which contain its `next_block` pointer --- to an arbitrary kernel address,
`kernel_addr`.
3. The driver makes its first call to `dma_pool_alloc()`, after which, the
pool should point to the second block. However, it instead points to
`kernel_addr`.
4. The driver again calls `dma_pool_alloc()`, which incorrectly returns
`kernel_addr`. Therefore, anytime the driver writes to this "block", it may
corrupt sensitive kernel data.
I have a PDF document that illustrates how these steps work. Please let me
know if you would like me to share it with you.
**Proposed mitigation**. To mitigate the corruption of DMA pool metadata
(i.e., the pointers linking the blocks), the metadata should be moved into
non-DMA memory, ensuring it cannot be altered by a device. I have included
a patch series that implements this change. Since I am not deeply familiar
with the DMA pool internals, I would appreciate any feedback on the
patches. I have tested the patches with the `DMAPOOL_TEST` test and my own
basic unit tests that ensure the DMA pool allocator is not vulnerable.
**Performance**. I evaluated the patch set's performance by running the
`DMAPOOL_TEST` test with `DMAPOOL_DEBUG` enabled and with/without the
patches applied. Here is its output *without* the patches applied:
```
dmapool test: size:16 align:16 blocks:8192 time:3194110
dmapool test: size:64 align:64 blocks:8192 time:4730440
dmapool test: size:256 align:256 blocks:8192 time:5489630
dmapool test: size:1024 align:1024 blocks:2048 time:517150
dmapool test: size:4096 align:4096 blocks:1024 time:399616
dmapool test: size:68 align:32 blocks:8192 time:6156527
```
And here is its output *with* the patches applied:
```
dmapool test: size:16 align:16 blocks:8192 time:3541031
dmapool test: size:64 align:64 blocks:8192 time:4227262
dmapool test: size:256 align:256 blocks:8192 time:4890273
dmapool test: size:1024 align:1024 blocks:2048 time:515775
dmapool test: size:4096 align:4096 blocks:1024 time:523096
dmapool test: size:68 align:32 blocks:8192 time:3450830
```
Based on my interpretation of the output, the patch set does not appear to
negatively impact performance. In fact, it shows slight performance
improvements in some tests (i.e., for sizes 64, 256, 1024, and 68).
I speculate that these performance gains may be due to improved spatial
locality of the `next_block` pointers. With the patches applied, the
`next_block` pointers are consistently spaced 24 bytes apart, matching the
new size of `struct dma_block`. Previously, the spacing between
`next_block` pointers depended on the block size, so for 1024-byte blocks,
the pointers were spaced 1024 bytes apart. However, I am still unsure why
the performance improvement for 68-byte blocks is so significant.
[0] Link: https://www.csl.sri.com/~neumann/ndss-iommu.pdf
Brian Johannesmeyer (2):
dmapool: Move pool metadata into non-DMA memory
dmapool: Use pool_find_block() in pool_block_err()
mm/dmapool.c | 96 ++++++++++++++++++++++++++++++++++------------------
1 file changed, 63 insertions(+), 33 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC v2 1/2] dmapool: Move pool metadata into non-DMA memory
2024-11-19 20:55 [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption Brian Johannesmeyer
@ 2024-11-19 20:55 ` Brian Johannesmeyer
2024-11-20 9:37 ` Christoph Hellwig
2024-11-19 20:55 ` [RFC v2 2/2] dmapool: Use pool_find_block() in pool_block_err() Brian Johannesmeyer
` (2 subsequent siblings)
3 siblings, 1 reply; 18+ messages in thread
From: Brian Johannesmeyer @ 2024-11-19 20:55 UTC (permalink / raw)
To: Andrew Morton, linux-mm, linux-kernel, linux-hardening
Cc: Brian Johannesmeyer, Raphael Isemann, Cristiano Giuffrida,
Herbert Bos, Greg KH
If a `struct dma_block` object resides in DMA memory, a malicious
peripheral device can corrupt its metadata --- specifically, its
`next_block` pointer, which links blocks in a DMA pool. By corrupting these
pointers, an attacker can manipulate `dma_pool_alloc()` into returning
attacker-controllable pointers, which can lead to kernel memory corruption
from a driver that calls it.
To prevent this, move the `struct dma_block` metadata into non-DMA memory,
ensuring that devices cannot tamper with the internal pointers of the DMA
pool allocator. Specifically:
- Add a `vaddr` field to `struct dma_block` to point to the actual
DMA-accessible block.
- Maintain an array of `struct dma_block` objects in `struct dma_page` to
track the metadata of each block within an allocated page.
This change secures the DMA pool allocator by keeping its metadata in
kernel memory, inaccessible to peripheral devices, thereby preventing
potential attacks that could corrupt kernel memory through DMA operations.
Co-developed-by: Raphael Isemann <teemperor@gmail.com>
Signed-off-by: Raphael Isemann <teemperor@gmail.com>
Signed-off-by: Brian Johannesmeyer <bjohannesmeyer@gmail.com>
---
mm/dmapool.c | 60 +++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 52 insertions(+), 8 deletions(-)
diff --git a/mm/dmapool.c b/mm/dmapool.c
index a151a21e571b..25005a9fc201 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -43,6 +43,7 @@
struct dma_block {
struct dma_block *next_block;
dma_addr_t dma;
+ void *vaddr;
};
struct dma_pool { /* the pool */
@@ -64,6 +65,8 @@ struct dma_page { /* cacheable header for 'allocation' bytes */
struct list_head page_list;
void *vaddr;
dma_addr_t dma;
+ struct dma_block *blocks;
+ size_t blocks_per_page;
};
static DEFINE_MUTEX(pools_lock);
@@ -91,14 +94,35 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha
static DEVICE_ATTR_RO(pools);
+static struct dma_block *pool_find_block(struct dma_pool *pool, void *vaddr)
+{
+ struct dma_page *page;
+ size_t offset, index;
+
+ list_for_each_entry(page, &pool->page_list, page_list) {
+ if (vaddr < page->vaddr)
+ continue;
+ offset = vaddr - page->vaddr;
+ if (offset >= pool->allocation)
+ continue;
+
+ index = offset / pool->size;
+ if (index >= page->blocks_per_page)
+ return NULL;
+
+ return &page->blocks[index];
+ }
+ return NULL;
+}
+
#ifdef DMAPOOL_DEBUG
static void pool_check_block(struct dma_pool *pool, struct dma_block *block,
gfp_t mem_flags)
{
- u8 *data = (void *)block;
+ u8 *data = (void *)block->vaddr;
int i;
- for (i = sizeof(struct dma_block); i < pool->size; i++) {
+ for (i = 0; i < pool->size; i++) {
if (data[i] == POOL_POISON_FREED)
continue;
dev_err(pool->dev, "%s %s, %p (corrupted)\n", __func__,
@@ -114,7 +138,7 @@ static void pool_check_block(struct dma_pool *pool, struct dma_block *block,
}
if (!want_init_on_alloc(mem_flags))
- memset(block, POOL_POISON_ALLOCATED, pool->size);
+ memset(block->vaddr, POOL_POISON_ALLOCATED, pool->size);
}
static struct dma_page *pool_find_page(struct dma_pool *pool, dma_addr_t dma)
@@ -143,7 +167,7 @@ static bool pool_block_err(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
}
while (block) {
- if (block != vaddr) {
+ if (block->vaddr != vaddr) {
block = block->next_block;
continue;
}
@@ -301,6 +325,7 @@ static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page)
{
unsigned int next_boundary = pool->boundary, offset = 0;
struct dma_block *block, *first = NULL, *last = NULL;
+ size_t i = 0;
pool_init_page(pool, page);
while (offset + pool->size <= pool->allocation) {
@@ -310,7 +335,8 @@ static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page)
continue;
}
- block = page->vaddr + offset;
+ block = &page->blocks[i];
+ block->vaddr = page->vaddr + offset;
block->dma = page->dma + offset;
block->next_block = NULL;
@@ -322,6 +348,7 @@ static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page)
offset += pool->size;
pool->nr_blocks++;
+ i++;
}
last->next_block = pool->next_block;
@@ -339,9 +366,18 @@ static struct dma_page *pool_alloc_page(struct dma_pool *pool, gfp_t mem_flags)
if (!page)
return NULL;
+ page->blocks_per_page = pool->allocation / pool->size;
+ page->blocks = kmalloc_array(page->blocks_per_page,
+ sizeof(struct dma_block), GFP_KERNEL);
+ if (!page->blocks) {
+ kfree(page);
+ return NULL;
+ }
+
page->vaddr = dma_alloc_coherent(pool->dev, pool->allocation,
&page->dma, mem_flags);
if (!page->vaddr) {
+ kfree(page->blocks);
kfree(page);
return NULL;
}
@@ -383,6 +419,7 @@ void dma_pool_destroy(struct dma_pool *pool)
if (!busy)
dma_free_coherent(pool->dev, pool->allocation,
page->vaddr, page->dma);
+ kfree(page->blocks);
list_del(&page->page_list);
kfree(page);
}
@@ -432,9 +469,9 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
*handle = block->dma;
pool_check_block(pool, block, mem_flags);
if (want_init_on_alloc(mem_flags))
- memset(block, 0, pool->size);
+ memset(block->vaddr, 0, pool->size);
- return block;
+ return block->vaddr;
}
EXPORT_SYMBOL(dma_pool_alloc);
@@ -449,9 +486,16 @@ EXPORT_SYMBOL(dma_pool_alloc);
*/
void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
{
- struct dma_block *block = vaddr;
+ struct dma_block *block;
unsigned long flags;
+ block = pool_find_block(pool, vaddr);
+ if (!block) {
+ dev_err(pool->dev, "%s %s, invalid vaddr %p\n",
+ __func__, pool->name, vaddr);
+ return;
+ }
+
spin_lock_irqsave(&pool->lock, flags);
if (!pool_block_err(pool, vaddr, dma)) {
pool_block_push(pool, block, dma);
--
2.34.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC v2 2/2] dmapool: Use pool_find_block() in pool_block_err()
2024-11-19 20:55 [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption Brian Johannesmeyer
2024-11-19 20:55 ` [RFC v2 1/2] dmapool: Move pool metadata into non-DMA memory Brian Johannesmeyer
@ 2024-11-19 20:55 ` Brian Johannesmeyer
2024-11-19 22:14 ` [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption Greg KH
2024-11-20 9:29 ` Christoph Hellwig
3 siblings, 0 replies; 18+ messages in thread
From: Brian Johannesmeyer @ 2024-11-19 20:55 UTC (permalink / raw)
To: Andrew Morton, linux-mm, linux-kernel, linux-hardening
Cc: Brian Johannesmeyer, Raphael Isemann, Cristiano Giuffrida,
Herbert Bos, Greg KH
In the previous patch, the `pool_find_block()` function was added to
translate a virtual address into the corresponding `struct dma_block`. The
existing `pool_find_page()` function performs a similar role by translating
a DMA address into the `struct dma_page` containing it.
To reduce redundant code and improve consistency, remove the
`pool_find_page()` function and update `pool_block_err()` to use
`pool_find_block()` instead. Doing so eliminates duplicate functionality
and consolidates the block lookup process.
Co-developed-by: Raphael Isemann <teemperor@gmail.com>
Signed-off-by: Raphael Isemann <teemperor@gmail.com>
Signed-off-by: Brian Johannesmeyer <bjohannesmeyer@gmail.com>
---
mm/dmapool.c | 38 ++++++++++++--------------------------
1 file changed, 12 insertions(+), 26 deletions(-)
diff --git a/mm/dmapool.c b/mm/dmapool.c
index 25005a9fc201..267fe13347bd 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -141,39 +141,25 @@ static void pool_check_block(struct dma_pool *pool, struct dma_block *block,
memset(block->vaddr, POOL_POISON_ALLOCATED, pool->size);
}
-static struct dma_page *pool_find_page(struct dma_pool *pool, dma_addr_t dma)
-{
- struct dma_page *page;
-
- list_for_each_entry(page, &pool->page_list, page_list) {
- if (dma < page->dma)
- continue;
- if ((dma - page->dma) < pool->allocation)
- return page;
- }
- return NULL;
-}
-
static bool pool_block_err(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
{
- struct dma_block *block = pool->next_block;
- struct dma_page *page;
+ struct dma_block *block = pool_find_block(pool, vaddr);
- page = pool_find_page(pool, dma);
- if (!page) {
- dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n",
- __func__, pool->name, vaddr, &dma);
+ if (!block) {
+ dev_err(pool->dev, "%s %s, invalid block %p\n",
+ __func__, pool->name, vaddr);
return true;
}
- while (block) {
- if (block->vaddr != vaddr) {
- block = block->next_block;
- continue;
+ struct dma_block *iter = pool->next_block;
+
+ while (iter) {
+ if (iter == block) {
+ dev_err(pool->dev, "%s %s, dma %pad already free\n",
+ __func__, pool->name, &dma);
+ return true;
}
- dev_err(pool->dev, "%s %s, dma %pad already free\n",
- __func__, pool->name, &dma);
- return true;
+ iter = iter->next_block;
}
memset(vaddr, POOL_POISON_FREED, pool->size);
--
2.34.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption
2024-11-19 20:55 [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption Brian Johannesmeyer
2024-11-19 20:55 ` [RFC v2 1/2] dmapool: Move pool metadata into non-DMA memory Brian Johannesmeyer
2024-11-19 20:55 ` [RFC v2 2/2] dmapool: Use pool_find_block() in pool_block_err() Brian Johannesmeyer
@ 2024-11-19 22:14 ` Greg KH
2024-11-19 22:22 ` Brian Johannesmeyer
2024-11-20 9:29 ` Christoph Hellwig
3 siblings, 1 reply; 18+ messages in thread
From: Greg KH @ 2024-11-19 22:14 UTC (permalink / raw)
To: Brian Johannesmeyer
Cc: Andrew Morton, linux-mm, linux-kernel, linux-hardening,
Raphael Isemann, Cristiano Giuffrida, Herbert Bos
On Tue, Nov 19, 2024 at 09:55:27PM +0100, Brian Johannesmeyer wrote:
> We discovered a security-related issue in the DMA pool allocator.
>
> V1 of our RFC was submitted to the Linux kernel security team. They
> recommended submitting it to the relevant subsystem maintainers and the
> hardening mailing list instead, as they did not consider this an explicit
> security issue. Their rationale was that Linux implicitly assumes hardware
> can be trusted.
>
> **Threat Model**: While Linux drivers typically trust their hardware, there
> may be specific drivers that do not operate under this assumption. Hence,
> this threat model assumes a malicious peripheral device capable of
> corrupting DMA data to exploit the kernel. In this scenario, the device
> manipulates kernel-initialized data (similar to the attack described in the
> Thunderclap paper [0]) to achieve arbitrary kernel memory corruption.
>
> **DMA pool background**. A DMA pool aims to reduce the overhead of DMA
> allocations by creating a large DMA buffer --- the "pool" --- from which
> smaller buffers are allocated as needed. Fundamentally, a DMA pool
> functions like a heap: it is a structure composed of linked memory
> "blocks", which, in this context, are DMA buffers. When a driver employs a
> DMA pool, it grants the device access not only to these blocks but also to
> the pointers linking them.
>
> **Vulnerability**. Similar to traditional heap corruption vulnerabilities
> --- where a malicious program corrupts heap metadata to e.g., hijack
> control flow --- a malicious device may corrupt DMA pool metadata. This
> corruption can trivially lead to arbitrary kernel memory corruption from
> any driver that uses it. Indeed, because the DMA pool API is extensively
> used, this vulnerability is not confined to a single instance. In fact,
> every usage of the DMA pool API is potentially vulnerable. An exploit
> proceeds with the following steps:
>
> 1. The DMA `pool` initializes its list of blocks, then points to the first
> block.
> 2. The malicious device overwrites the first 8 bytes of the first block ---
> which contain its `next_block` pointer --- to an arbitrary kernel address,
> `kernel_addr`.
> 3. The driver makes its first call to `dma_pool_alloc()`, after which, the
> pool should point to the second block. However, it instead points to
> `kernel_addr`.
> 4. The driver again calls `dma_pool_alloc()`, which incorrectly returns
> `kernel_addr`. Therefore, anytime the driver writes to this "block", it may
> corrupt sensitive kernel data.
>
> I have a PDF document that illustrates how these steps work. Please let me
> know if you would like me to share it with you.
I know I said it privately, but I'll say it here in public, very cool
finding, this is nice work!
> **Proposed mitigation**. To mitigate the corruption of DMA pool metadata
> (i.e., the pointers linking the blocks), the metadata should be moved into
> non-DMA memory, ensuring it cannot be altered by a device. I have included
> a patch series that implements this change. Since I am not deeply familiar
> with the DMA pool internals, I would appreciate any feedback on the
> patches. I have tested the patches with the `DMAPOOL_TEST` test and my own
> basic unit tests that ensure the DMA pool allocator is not vulnerable.
>
> **Performance**. I evaluated the patch set's performance by running the
> `DMAPOOL_TEST` test with `DMAPOOL_DEBUG` enabled and with/without the
> patches applied. Here is its output *without* the patches applied:
> ```
> dmapool test: size:16 align:16 blocks:8192 time:3194110
> dmapool test: size:64 align:64 blocks:8192 time:4730440
> dmapool test: size:256 align:256 blocks:8192 time:5489630
> dmapool test: size:1024 align:1024 blocks:2048 time:517150
> dmapool test: size:4096 align:4096 blocks:1024 time:399616
> dmapool test: size:68 align:32 blocks:8192 time:6156527
> ```
>
> And here is its output *with* the patches applied:
> ```
> dmapool test: size:16 align:16 blocks:8192 time:3541031
> dmapool test: size:64 align:64 blocks:8192 time:4227262
> dmapool test: size:256 align:256 blocks:8192 time:4890273
> dmapool test: size:1024 align:1024 blocks:2048 time:515775
> dmapool test: size:4096 align:4096 blocks:1024 time:523096
> dmapool test: size:68 align:32 blocks:8192 time:3450830
> ```
You had mentioned that the size:68 numbers were going to be re-run, has
that happened and this really is that much of a boost to that size? Or
is this the original numbers?
thanks,
greg k-h
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption
2024-11-19 22:14 ` [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption Greg KH
@ 2024-11-19 22:22 ` Brian Johannesmeyer
0 siblings, 0 replies; 18+ messages in thread
From: Brian Johannesmeyer @ 2024-11-19 22:22 UTC (permalink / raw)
To: Greg KH
Cc: Andrew Morton, linux-mm, linux-kernel, linux-hardening,
Raphael Isemann, Cristiano Giuffrida, Herbert Bos
On Tue, Nov 19, 2024 at 3:15 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> I know I said it privately, but I'll say it here in public, very cool
> finding, this is nice work!
Thanks! I appreciate your earlier feedback as well.
> You had mentioned that the size:68 numbers were going to be re-run, has
> that happened and this really is that much of a boost to that size? Or
> is this the original numbers?
I re-ran the test, and the numbers are consistent across multiple
runs. I’m also surprised by how significant the improvement is for the
68-byte block size.
Thanks,
Brian Johannesmeyer
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption
2024-11-19 20:55 [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption Brian Johannesmeyer
` (2 preceding siblings ...)
2024-11-19 22:14 ` [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption Greg KH
@ 2024-11-20 9:29 ` Christoph Hellwig
2024-11-20 15:56 ` Keith Busch
2024-11-20 18:51 ` Keith Busch
3 siblings, 2 replies; 18+ messages in thread
From: Christoph Hellwig @ 2024-11-20 9:29 UTC (permalink / raw)
To: Brian Johannesmeyer
Cc: Andrew Morton, linux-mm, linux-kernel, linux-hardening,
Raphael Isemann, Cristiano Giuffrida, Herbert Bos, Greg KH,
Keith Busch
On Tue, Nov 19, 2024 at 09:55:27PM +0100, Brian Johannesmeyer wrote:
> We discovered a security-related issue in the DMA pool allocator.
>
> V1 of our RFC was submitted to the Linux kernel security team. They
> recommended submitting it to the relevant subsystem maintainers and the
> hardening mailing list instead, as they did not consider this an explicit
> security issue. Their rationale was that Linux implicitly assumes hardware
> can be trusted.
You should probably Cc Keith as the person who most recently did major
work on the dmpool code and might still remember how it works.
>
> **Threat Model**: While Linux drivers typically trust their hardware, there
> may be specific drivers that do not operate under this assumption. Hence,
> this threat model assumes a malicious peripheral device capable of
> corrupting DMA data to exploit the kernel. In this scenario, the device
> manipulates kernel-initialized data (similar to the attack described in the
> Thunderclap paper [0]) to achieve arbitrary kernel memory corruption.
>
> **DMA pool background**. A DMA pool aims to reduce the overhead of DMA
> allocations by creating a large DMA buffer --- the "pool" --- from which
> smaller buffers are allocated as needed. Fundamentally, a DMA pool
> functions like a heap: it is a structure composed of linked memory
> "blocks", which, in this context, are DMA buffers. When a driver employs a
> DMA pool, it grants the device access not only to these blocks but also to
> the pointers linking them.
>
> **Vulnerability**. Similar to traditional heap corruption vulnerabilities
> --- where a malicious program corrupts heap metadata to e.g., hijack
> control flow --- a malicious device may corrupt DMA pool metadata. This
> corruption can trivially lead to arbitrary kernel memory corruption from
> any driver that uses it. Indeed, because the DMA pool API is extensively
> used, this vulnerability is not confined to a single instance. In fact,
> every usage of the DMA pool API is potentially vulnerable. An exploit
> proceeds with the following steps:
>
> 1. The DMA `pool` initializes its list of blocks, then points to the first
> block.
> 2. The malicious device overwrites the first 8 bytes of the first block ---
> which contain its `next_block` pointer --- to an arbitrary kernel address,
> `kernel_addr`.
> 3. The driver makes its first call to `dma_pool_alloc()`, after which, the
> pool should point to the second block. However, it instead points to
> `kernel_addr`.
> 4. The driver again calls `dma_pool_alloc()`, which incorrectly returns
> `kernel_addr`. Therefore, anytime the driver writes to this "block", it may
> corrupt sensitive kernel data.
>
> I have a PDF document that illustrates how these steps work. Please let me
> know if you would like me to share it with you.
>
> **Proposed mitigation**. To mitigate the corruption of DMA pool metadata
> (i.e., the pointers linking the blocks), the metadata should be moved into
> non-DMA memory, ensuring it cannot be altered by a device. I have included
> a patch series that implements this change. Since I am not deeply familiar
> with the DMA pool internals, I would appreciate any feedback on the
> patches. I have tested the patches with the `DMAPOOL_TEST` test and my own
> basic unit tests that ensure the DMA pool allocator is not vulnerable.
>
> **Performance**. I evaluated the patch set's performance by running the
> `DMAPOOL_TEST` test with `DMAPOOL_DEBUG` enabled and with/without the
> patches applied. Here is its output *without* the patches applied:
> ```
> dmapool test: size:16 align:16 blocks:8192 time:3194110
> dmapool test: size:64 align:64 blocks:8192 time:4730440
> dmapool test: size:256 align:256 blocks:8192 time:5489630
> dmapool test: size:1024 align:1024 blocks:2048 time:517150
> dmapool test: size:4096 align:4096 blocks:1024 time:399616
> dmapool test: size:68 align:32 blocks:8192 time:6156527
> ```
>
> And here is its output *with* the patches applied:
> ```
> dmapool test: size:16 align:16 blocks:8192 time:3541031
> dmapool test: size:64 align:64 blocks:8192 time:4227262
> dmapool test: size:256 align:256 blocks:8192 time:4890273
> dmapool test: size:1024 align:1024 blocks:2048 time:515775
> dmapool test: size:4096 align:4096 blocks:1024 time:523096
> dmapool test: size:68 align:32 blocks:8192 time:3450830
> ```
>
> Based on my interpretation of the output, the patch set does not appear to
> negatively impact performance. In fact, it shows slight performance
> improvements in some tests (i.e., for sizes 64, 256, 1024, and 68).
>
> I speculate that these performance gains may be due to improved spatial
> locality of the `next_block` pointers. With the patches applied, the
> `next_block` pointers are consistently spaced 24 bytes apart, matching the
> new size of `struct dma_block`. Previously, the spacing between
> `next_block` pointers depended on the block size, so for 1024-byte blocks,
> the pointers were spaced 1024 bytes apart. However, I am still unsure why
> the performance improvement for 68-byte blocks is so significant.
>
> [0] Link: https://www.csl.sri.com/~neumann/ndss-iommu.pdf
>
> Brian Johannesmeyer (2):
> dmapool: Move pool metadata into non-DMA memory
> dmapool: Use pool_find_block() in pool_block_err()
>
> mm/dmapool.c | 96 ++++++++++++++++++++++++++++++++++------------------
> 1 file changed, 63 insertions(+), 33 deletions(-)
>
> --
> 2.34.1
>
>
---end quoted text---
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 1/2] dmapool: Move pool metadata into non-DMA memory
2024-11-19 20:55 ` [RFC v2 1/2] dmapool: Move pool metadata into non-DMA memory Brian Johannesmeyer
@ 2024-11-20 9:37 ` Christoph Hellwig
2024-11-20 23:46 ` Brian Johannesmeyer
0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2024-11-20 9:37 UTC (permalink / raw)
To: Brian Johannesmeyer
Cc: Andrew Morton, linux-mm, linux-kernel, linux-hardening,
Raphael Isemann, Cristiano Giuffrida, Herbert Bos, Greg KH,
Keith Busch
On Tue, Nov 19, 2024 at 09:55:28PM +0100, Brian Johannesmeyer wrote:
> + page->blocks_per_page = pool->allocation / pool->size;
> + page->blocks = kmalloc_array(page->blocks_per_page,
> + sizeof(struct dma_block), GFP_KERNEL);
Given that you now need an array of the blocks anyway, it might make
sense to switch from a linked list to a bitmap for tracking free state,
which would be a lot more efficient as you only need a bit per block
as tracking overhead instead of a two pointers and a dma_addr_t.
e.g. do a find_first_zero_bit() to find the ffree slot, then calculate
the dma_addr and virt address by simple offseting into the dma_page
ones with bitnr * pool->size.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption
2024-11-20 9:29 ` Christoph Hellwig
@ 2024-11-20 15:56 ` Keith Busch
2024-11-20 18:51 ` Keith Busch
1 sibling, 0 replies; 18+ messages in thread
From: Keith Busch @ 2024-11-20 15:56 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Brian Johannesmeyer, Andrew Morton, linux-mm, linux-kernel,
linux-hardening, Raphael Isemann, Cristiano Giuffrida,
Herbert Bos, Greg KH
On Wed, Nov 20, 2024 at 01:29:19AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 19, 2024 at 09:55:27PM +0100, Brian Johannesmeyer wrote:
> > We discovered a security-related issue in the DMA pool allocator.
> >
> > V1 of our RFC was submitted to the Linux kernel security team. They
> > recommended submitting it to the relevant subsystem maintainers and the
> > hardening mailing list instead, as they did not consider this an explicit
> > security issue. Their rationale was that Linux implicitly assumes hardware
> > can be trusted.
>
> You should probably Cc Keith as the person who most recently did major
> work on the dmpool code and might still remember how it works.
Thanks.
The intrusive list was overlayed in the freed blocks for spatial
optimizations. If you're moving these field outside of it (I'll have to
review the patch on lore), you can probably relax the minimum dma block
size too since we don't need to hold the data structure information in
it.
> > **Threat Model**: While Linux drivers typically trust their hardware, there
> > may be specific drivers that do not operate under this assumption. Hence,
> > this threat model assumes a malicious peripheral device capable of
> > corrupting DMA data to exploit the kernel. In this scenario, the device
> > manipulates kernel-initialized data (similar to the attack described in the
> > Thunderclap paper [0]) to achieve arbitrary kernel memory corruption.
> >
> > **DMA pool background**. A DMA pool aims to reduce the overhead of DMA
> > allocations by creating a large DMA buffer --- the "pool" --- from which
> > smaller buffers are allocated as needed. Fundamentally, a DMA pool
> > functions like a heap: it is a structure composed of linked memory
> > "blocks", which, in this context, are DMA buffers. When a driver employs a
> > DMA pool, it grants the device access not only to these blocks but also to
> > the pointers linking them.
> >
> > **Vulnerability**. Similar to traditional heap corruption vulnerabilities
> > --- where a malicious program corrupts heap metadata to e.g., hijack
> > control flow --- a malicious device may corrupt DMA pool metadata. This
> > corruption can trivially lead to arbitrary kernel memory corruption from
> > any driver that uses it. Indeed, because the DMA pool API is extensively
> > used, this vulnerability is not confined to a single instance. In fact,
> > every usage of the DMA pool API is potentially vulnerable. An exploit
> > proceeds with the following steps:
> >
> > 1. The DMA `pool` initializes its list of blocks, then points to the first
> > block.
> > 2. The malicious device overwrites the first 8 bytes of the first block ---
> > which contain its `next_block` pointer --- to an arbitrary kernel address,
> > `kernel_addr`.
> > 3. The driver makes its first call to `dma_pool_alloc()`, after which, the
> > pool should point to the second block. However, it instead points to
> > `kernel_addr`.
> > 4. The driver again calls `dma_pool_alloc()`, which incorrectly returns
> > `kernel_addr`. Therefore, anytime the driver writes to this "block", it may
> > corrupt sensitive kernel data.
> >
> > I have a PDF document that illustrates how these steps work. Please let me
> > know if you would like me to share it with you.
> >
> > **Proposed mitigation**. To mitigate the corruption of DMA pool metadata
> > (i.e., the pointers linking the blocks), the metadata should be moved into
> > non-DMA memory, ensuring it cannot be altered by a device. I have included
> > a patch series that implements this change. Since I am not deeply familiar
> > with the DMA pool internals, I would appreciate any feedback on the
> > patches. I have tested the patches with the `DMAPOOL_TEST` test and my own
> > basic unit tests that ensure the DMA pool allocator is not vulnerable.
> >
> > **Performance**. I evaluated the patch set's performance by running the
> > `DMAPOOL_TEST` test with `DMAPOOL_DEBUG` enabled and with/without the
> > patches applied. Here is its output *without* the patches applied:
> > ```
> > dmapool test: size:16 align:16 blocks:8192 time:3194110
> > dmapool test: size:64 align:64 blocks:8192 time:4730440
> > dmapool test: size:256 align:256 blocks:8192 time:5489630
> > dmapool test: size:1024 align:1024 blocks:2048 time:517150
> > dmapool test: size:4096 align:4096 blocks:1024 time:399616
> > dmapool test: size:68 align:32 blocks:8192 time:6156527
> > ```
> >
> > And here is its output *with* the patches applied:
> > ```
> > dmapool test: size:16 align:16 blocks:8192 time:3541031
> > dmapool test: size:64 align:64 blocks:8192 time:4227262
> > dmapool test: size:256 align:256 blocks:8192 time:4890273
> > dmapool test: size:1024 align:1024 blocks:2048 time:515775
> > dmapool test: size:4096 align:4096 blocks:1024 time:523096
> > dmapool test: size:68 align:32 blocks:8192 time:3450830
> > ```
> >
> > Based on my interpretation of the output, the patch set does not appear to
> > negatively impact performance. In fact, it shows slight performance
> > improvements in some tests (i.e., for sizes 64, 256, 1024, and 68).
> >
> > I speculate that these performance gains may be due to improved spatial
> > locality of the `next_block` pointers. With the patches applied, the
> > `next_block` pointers are consistently spaced 24 bytes apart, matching the
> > new size of `struct dma_block`. Previously, the spacing between
> > `next_block` pointers depended on the block size, so for 1024-byte blocks,
> > the pointers were spaced 1024 bytes apart. However, I am still unsure why
> > the performance improvement for 68-byte blocks is so significant.
> >
> > [0] Link: https://www.csl.sri.com/~neumann/ndss-iommu.pdf
> >
> > Brian Johannesmeyer (2):
> > dmapool: Move pool metadata into non-DMA memory
> > dmapool: Use pool_find_block() in pool_block_err()
> >
> > mm/dmapool.c | 96 ++++++++++++++++++++++++++++++++++------------------
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption
2024-11-20 9:29 ` Christoph Hellwig
2024-11-20 15:56 ` Keith Busch
@ 2024-11-20 18:51 ` Keith Busch
2024-11-20 21:58 ` Brian Johannesmeyer
1 sibling, 1 reply; 18+ messages in thread
From: Keith Busch @ 2024-11-20 18:51 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Brian Johannesmeyer, Andrew Morton, linux-mm, linux-kernel,
linux-hardening, Raphael Isemann, Cristiano Giuffrida,
Herbert Bos, Greg KH
On Wed, Nov 20, 2024 at 01:29:19AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 19, 2024 at 09:55:27PM +0100, Brian Johannesmeyer wrote:
> > **Performance**. I evaluated the patch set's performance by running the
> > `DMAPOOL_TEST` test with `DMAPOOL_DEBUG` enabled and with/without the
> > patches applied. Here is its output *without* the patches applied:
Could you rerun your tests without DMAPOOL_DEBUG enabled? That's the
more interesting kernel setup for performance comparisions.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption
2024-11-20 18:51 ` Keith Busch
@ 2024-11-20 21:58 ` Brian Johannesmeyer
2024-11-21 3:37 ` Keith Busch
0 siblings, 1 reply; 18+ messages in thread
From: Brian Johannesmeyer @ 2024-11-20 21:58 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Andrew Morton, linux-mm, linux-kernel,
linux-hardening, Raphael Isemann, Cristiano Giuffrida,
Herbert Bos, Greg KH
> You should probably Cc Keith as the person who most recently did major
> work on the dmpool code and might still remember how it works.
Thank you for adding him, and apologies for not including him initially.
> The intrusive list was overlayed in the freed blocks for spatial
> optimizations. If you're moving these field outside of it (I'll have to
> review the patch on lore), you can probably relax the minimum dma block
> size too since we don't need to hold the data structure information in
> it.
I see. AFAICT, relaxing the minimum DMA block size would just mean
removing these lines from `dma_pool_create()`:
```
if (size < sizeof(struct dma_block))
size = sizeof(struct dma_block);
```
> Could you rerun your tests without DMAPOOL_DEBUG enabled? That's the
> more interesting kernel setup for performance comparisions.
Sure, that makes sense. Here are the results with DMAPOOL_DEBUG disabled:
**Without the patches applied:**
```
dmapool test: size:16 align:16 blocks:8192 time:11860
dmapool test: size:64 align:64 blocks:8192 time:11951
dmapool test: size:256 align:256 blocks:8192 time:12287
dmapool test: size:1024 align:1024 blocks:2048 time:3134
dmapool test: size:4096 align:4096 blocks:1024 time:1686
dmapool test: size:68 align:32 blocks:8192 time:12050
```
**With the patches applied:**
```
dmapool test: size:16 align:16 blocks:8192 time:34432
dmapool test: size:64 align:64 blocks:8192 time:62262
dmapool test: size:256 align:256 blocks:8192 time:238137
dmapool test: size:1024 align:1024 blocks:2048 time:61386
dmapool test: size:4096 align:4096 blocks:1024 time:75342
dmapool test: size:68 align:32 blocks:8192 time:88243
```
These results are consistent across multiple runs. It seems that with
DMAPOOL_DEBUG disabled, the patches introduce a significant
performance hit. Let me know if you have any suggestions or further
tests you'd like me to run.
Thanks,
Brian Johannesmeyer
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 1/2] dmapool: Move pool metadata into non-DMA memory
2024-11-20 9:37 ` Christoph Hellwig
@ 2024-11-20 23:46 ` Brian Johannesmeyer
2024-11-21 5:03 ` Christoph Hellwig
0 siblings, 1 reply; 18+ messages in thread
From: Brian Johannesmeyer @ 2024-11-20 23:46 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrew Morton, linux-mm, linux-kernel, linux-hardening,
Raphael Isemann, Cristiano Giuffrida, Herbert Bos, Greg KH,
Keith Busch
> Given that you now need an array of the blocks anyway, it might make
> sense to switch from a linked list to a bitmap for tracking free state,
> which would be a lot more efficient as you only need a bit per block
> as tracking overhead instead of a two pointers and a dma_addr_t.
>
> e.g. do a find_first_zero_bit() to find the ffree slot, then calculate
> the dma_addr and virt address by simple offseting into the dma_page
> ones with bitnr * pool->size.
Thank you for the suggestion. I hacked together a bitmap-based
approach as you proposed, and while it does improve memory efficiency
by reducing the per-block metadata overhead, it unfortunately appears
to significantly impact the runtime performance.
Here are the performance results, with DMAPOOL_DEBUG disabled. The
first two sets of numbers are the same as my latest response in the
other thread (i.e., [RFC v2 0/2]), and the last set of numbers is with
the bitmap approach applied:
**Without no patches applied:**
```
dmapool test: size:16 align:16 blocks:8192 time:11860
dmapool test: size:64 align:64 blocks:8192 time:11951
dmapool test: size:256 align:256 blocks:8192 time:12287
dmapool test: size:1024 align:1024 blocks:2048 time:3134
dmapool test: size:4096 align:4096 blocks:1024 time:1686
dmapool test: size:68 align:32 blocks:8192 time:12050
```
**With the submitted patches applied:**
```
dmapool test: size:16 align:16 blocks:8192 time:34432
dmapool test: size:64 align:64 blocks:8192 time:62262
dmapool test: size:256 align:256 blocks:8192 time:238137
dmapool test: size:1024 align:1024 blocks:2048 time:61386
dmapool test: size:4096 align:4096 blocks:1024 time:75342
dmapool test: size:68 align:32 blocks:8192 time:88243
```
**With the submitted patches applied AND using a bitmap approach:**
```
dmapool test: size:16 align:16 blocks:8192 time:82733
dmapool test: size:64 align:64 blocks:8192 time:198460
dmapool test: size:256 align:256 blocks:8192 time:710316
dmapool test: size:1024 align:1024 blocks:2048 time:177801
dmapool test: size:4096 align:4096 blocks:1024 time:192297
dmapool test: size:68 align:32 blocks:8192 time:274931
```
My guess as to why: The current linked list implementation allows us
to find the next free block in constant time (`O(1)`) by directly
dereferencing `pool->next_block`, and then following the `next_block`
pointers for subsequent free blocks. In contrast, the bitmap approach
requires iterating over all pages in `page->page_list` and, for each
page, iterating through its bitmap to find the first zero bit. This
results in a worst-case complexity of `O(n * b)`, where `n` is the
number of pages and `b` is the number of bits in each page's bitmap.
If you have ideas for mitigating this runtime overhead, I’d be happy
to explore them further.
Thanks,
Brian Johannesmeyer
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption
2024-11-20 21:58 ` Brian Johannesmeyer
@ 2024-11-21 3:37 ` Keith Busch
2024-11-21 17:31 ` Brian Johannesmeyer
0 siblings, 1 reply; 18+ messages in thread
From: Keith Busch @ 2024-11-21 3:37 UTC (permalink / raw)
To: Brian Johannesmeyer
Cc: Christoph Hellwig, Andrew Morton, linux-mm, linux-kernel,
linux-hardening, Raphael Isemann, Cristiano Giuffrida,
Herbert Bos, Greg KH
On Wed, Nov 20, 2024 at 02:58:54PM -0700, Brian Johannesmeyer wrote:
> These results are consistent across multiple runs. It seems that with
> DMAPOOL_DEBUG disabled, the patches introduce a significant
> performance hit. Let me know if you have any suggestions or further
> tests you'd like me to run.
That's what I was afraid of. I was working on the dma pool because it
showed significant lock contention on the pool for storage heavy
workloads, so cutting down the critical section was priority. With the
current kernel, the dma pool doesn't even register on the profiles
anymore, so it'd be great to keep it that way.
The idea for embedding the links in freed blocks was assuming a driver
wouldn't ask the kernel to free a dma block if the mapped device was
still using it. Untrustworthy hardware is why we can't have nice
things...
Here's my quick thoughts at this late hour, though I might have
something else tomorrow. If the links are external to the dma addr
being freed, then I think you need to change the entire alloc/free API
to replace the dma_addr_t handle with a new type, like a tuple of
{ dma_addr_t, priv_list_link }
That should make it possible to preserve the low constant time to alloc
and free in the critical section, which I think is a good thing to have.
I found 170 uses of dma_pool_alloc, and 360 dma_pool_free in the kernel,
so changing the API is no small task. :(
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 1/2] dmapool: Move pool metadata into non-DMA memory
2024-11-20 23:46 ` Brian Johannesmeyer
@ 2024-11-21 5:03 ` Christoph Hellwig
2024-11-21 17:48 ` Brian Johannesmeyer
0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2024-11-21 5:03 UTC (permalink / raw)
To: Brian Johannesmeyer
Cc: Christoph Hellwig, Andrew Morton, linux-mm, linux-kernel,
linux-hardening, Raphael Isemann, Cristiano Giuffrida,
Herbert Bos, Greg KH, Keith Busch
On Wed, Nov 20, 2024 at 04:46:40PM -0700, Brian Johannesmeyer wrote:
> Thank you for the suggestion. I hacked together a bitmap-based
> approach as you proposed, and while it does improve memory efficiency
> by reducing the per-block metadata overhead, it unfortunately appears
> to significantly impact the runtime performance.
>
> My guess as to why: The current linked list implementation allows us
> to find the next free block in constant time (`O(1)`) by directly
> dereferencing `pool->next_block`, and then following the `next_block`
> pointers for subsequent free blocks. In contrast, the bitmap approach
> requires iterating over all pages in `page->page_list` and, for each
> page, iterating through its bitmap to find the first zero bit. This
> results in a worst-case complexity of `O(n * b)`, where `n` is the
> number of pages and `b` is the number of bits in each page's bitmap.
Indeed. You'd probably need to split the linkage of the pages into
a list of those that have free blocks and those that don't as a minimum.
Can you share your current version?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption
2024-11-21 3:37 ` Keith Busch
@ 2024-11-21 17:31 ` Brian Johannesmeyer
2024-11-21 18:06 ` Keith Busch
0 siblings, 1 reply; 18+ messages in thread
From: Brian Johannesmeyer @ 2024-11-21 17:31 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Andrew Morton, linux-mm, linux-kernel,
linux-hardening, Raphael Isemann, Cristiano Giuffrida,
Herbert Bos, Greg KH
> Here's my quick thoughts at this late hour, though I might have
> something else tomorrow. If the links are external to the dma addr
> being freed, then I think you need to change the entire alloc/free API
> to replace the dma_addr_t handle with a new type, like a tuple of
>
> { dma_addr_t, priv_list_link }
>
> That should make it possible to preserve the low constant time to alloc
> and free in the critical section, which I think is a good thing to have.
> I found 170 uses of dma_pool_alloc, and 360 dma_pool_free in the kernel,
> so changing the API is no small task. :(
Indeed, an API change like this might be the only way to move metadata
out of the DMA blocks while maintaining its current performance.
Regarding the performance hit of the submitted patches:
- *Allocations* remain constant time (`O(1)`), as they simply check
the `pool->next_block` pointer.
- *Frees* are no longer constant time. Previously, `dma_pool_free()`
converted a `vaddr` to its corresponding `block` by type casting, but
now it calls `pool_find_block()`, which iterates over all pages
(`O(n)`).
Therefore, optimizing `dma_pool_free()` is key. While restructuring
the API is the "best" solution, I implemented a compromise:
introducing a `struct maple_tree block_map` field in `struct dma_pool`
to save mappings of a `vaddr` to its corresponding `block`. A maple
tree isn’t constant time, but it offers `O(log n)` performance, which
is a significant improvement over the current `O(n)` iteration.
Here are the performance results. I've already reported the first two
sets of numbers, but I'll repeat them here:
**Without no patches applied:**
```
dmapool test: size:16 align:16 blocks:8192 time:11860
dmapool test: size:64 align:64 blocks:8192 time:11951
dmapool test: size:256 align:256 blocks:8192 time:12287
dmapool test: size:1024 align:1024 blocks:2048 time:3134
dmapool test: size:4096 align:4096 blocks:1024 time:1686
dmapool test: size:68 align:32 blocks:8192 time:12050
```
**With the submitted patches applied:**
```
dmapool test: size:16 align:16 blocks:8192 time:34432
dmapool test: size:64 align:64 blocks:8192 time:62262
dmapool test: size:256 align:256 blocks:8192 time:238137
dmapool test: size:1024 align:1024 blocks:2048 time:61386
dmapool test: size:4096 align:4096 blocks:1024 time:75342
dmapool test: size:68 align:32 blocks:8192 time:88243
```
**With the submitted patches applied AND using a maple tree to improve
the performance of vaddr-to-block translations:**
```
dmapool test: size:16 align:16 blocks:8192 time:43668
dmapool test: size:64 align:64 blocks:8192 time:44746
dmapool test: size:256 align:256 blocks:8192 time:45434
dmapool test: size:1024 align:1024 blocks:2048 time:11013
dmapool test: size:4096 align:4096 blocks:1024 time:5250
dmapool test: size:68 align:32 blocks:8192 time:45900
```
The maple tree optimization reduces the performance hit for most block
sizes, especially for larger blocks. While the performance is not
fully back to baseline, it gives a reasonable trade-off between
protection, runtime performance, and ease of deployment (i.e., not
requiring an API change).
If this approach looks acceptable, I can submit it as a V3 patch
series for further review and discussion.
Thanks,
Brian Johannesmeyer
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 1/2] dmapool: Move pool metadata into non-DMA memory
2024-11-21 5:03 ` Christoph Hellwig
@ 2024-11-21 17:48 ` Brian Johannesmeyer
0 siblings, 0 replies; 18+ messages in thread
From: Brian Johannesmeyer @ 2024-11-21 17:48 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrew Morton, linux-mm, linux-kernel, linux-hardening,
Raphael Isemann, Cristiano Giuffrida, Herbert Bos, Greg KH,
Keith Busch
On Wed, Nov 20, 2024 at 10:03 PM Christoph Hellwig <hch@infradead.org> wrote:
> Indeed. You'd probably need to split the linkage of the pages into
> a list of those that have free blocks and those that don't as a minimum.
>
> Can you share your current version?
Sure, I can share the current version, though fair warning---it’s
still quite messy.
FWIW, I wonder if the bitmap approach might be more suitable as a
separate RFC. AFAICT, the primary issue with the currently submitted
patches is their runtime overhead. I’ve proposed a way to address this
in my recent response to [RFC v2 0/2]. Unfortunately, as I noted,
improving the memory overhead without worsening the runtime
performance is challenging---for example, removing the `next_block`
pointers would require iterating over all pages to find a free
`block`, which significantly impacts the runtime.
That said, how would you prefer I share my bitmap approach? Should I
submit it as a separate patch series or provide the patch directly in
this thread?
Thanks,
Brian Johannesmeyer
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption
2024-11-21 17:31 ` Brian Johannesmeyer
@ 2024-11-21 18:06 ` Keith Busch
2024-11-21 19:07 ` Brian Johannesmeyer
0 siblings, 1 reply; 18+ messages in thread
From: Keith Busch @ 2024-11-21 18:06 UTC (permalink / raw)
To: Brian Johannesmeyer
Cc: Christoph Hellwig, Andrew Morton, linux-mm, linux-kernel,
linux-hardening, Raphael Isemann, Cristiano Giuffrida,
Herbert Bos, Greg KH
On Thu, Nov 21, 2024 at 10:31:11AM -0700, Brian Johannesmeyer wrote:
> **With the submitted patches applied AND using a maple tree to improve
> the performance of vaddr-to-block translations:**
If you have the time, could you compare with using xarray instead?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption
2024-11-21 18:06 ` Keith Busch
@ 2024-11-21 19:07 ` Brian Johannesmeyer
2024-11-22 19:19 ` Brian Johannesmeyer
0 siblings, 1 reply; 18+ messages in thread
From: Brian Johannesmeyer @ 2024-11-21 19:07 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Andrew Morton, linux-mm, linux-kernel,
linux-hardening, Raphael Isemann, Cristiano Giuffrida,
Herbert Bos, Greg KH
On Thu, Nov 21, 2024 at 11:06 AM Keith Busch <kbusch@kernel.org> wrote:
> If you have the time, could you compare with using xarray instead?
Sure. Good idea.
**With the submitted patches applied AND using an xarray for
vaddr-to-block translations:**
```
dmapool test: size:16 align:16 blocks:8192 time:37954
dmapool test: size:64 align:64 blocks:8192 time:40036
dmapool test: size:256 align:256 blocks:8192 time:41942
dmapool test: size:1024 align:1024 blocks:2048 time:10964
dmapool test: size:4096 align:4096 blocks:1024 time:6101
dmapool test: size:68 align:32 blocks:8192 time:41307
```
The xarray approach shows a slight improvement in performance compared
to the maple tree approach.
FWIW, I implemented the two with slightly different semantics:
- In the maple tree implementation, I saved the `block`'s entire
`vaddr` range, allowing any `vaddr` within the `block` to be passed to
`dma_pool_free()`.
- In the xarray implementation, I saved only the `block's` base
`vaddr`, requiring `dma_pool_free()` to be called with the exact
`vaddr` returned by `dma_pool_alloc()`. This aligns with the DMA pool
API documentation, which specifies that the `vaddr` returned by
`dma_pool_alloc()` should be passed to `dma_pool_free()`.
Let me know if you'd like further adjustments.
Thanks,
Brian Johannesmeyer
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption
2024-11-21 19:07 ` Brian Johannesmeyer
@ 2024-11-22 19:19 ` Brian Johannesmeyer
0 siblings, 0 replies; 18+ messages in thread
From: Brian Johannesmeyer @ 2024-11-22 19:19 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Andrew Morton, linux-mm, linux-kernel,
linux-hardening, Raphael Isemann, Cristiano Giuffrida,
Herbert Bos, Greg KH
I’ll go ahead and prepare a V3 patch series with the following updates:
- Using an xarray for vaddr-to-block translations, which improves the
performance of free operations.
- Removing the minimum DMA block size constraint, as it is no longer necessary.
Let me know if there are any additional suggestions or concerns to
address before submission.
Thanks,
Brian
On Thu, Nov 21, 2024 at 12:07 PM Brian Johannesmeyer
<bjohannesmeyer@gmail.com> wrote:
>
> On Thu, Nov 21, 2024 at 11:06 AM Keith Busch <kbusch@kernel.org> wrote:
> > If you have the time, could you compare with using xarray instead?
>
> Sure. Good idea.
>
> **With the submitted patches applied AND using an xarray for
> vaddr-to-block translations:**
> ```
> dmapool test: size:16 align:16 blocks:8192 time:37954
> dmapool test: size:64 align:64 blocks:8192 time:40036
> dmapool test: size:256 align:256 blocks:8192 time:41942
> dmapool test: size:1024 align:1024 blocks:2048 time:10964
> dmapool test: size:4096 align:4096 blocks:1024 time:6101
> dmapool test: size:68 align:32 blocks:8192 time:41307
> ```
>
> The xarray approach shows a slight improvement in performance compared
> to the maple tree approach.
>
> FWIW, I implemented the two with slightly different semantics:
> - In the maple tree implementation, I saved the `block`'s entire
> `vaddr` range, allowing any `vaddr` within the `block` to be passed to
> `dma_pool_free()`.
> - In the xarray implementation, I saved only the `block's` base
> `vaddr`, requiring `dma_pool_free()` to be called with the exact
> `vaddr` returned by `dma_pool_alloc()`. This aligns with the DMA pool
> API documentation, which specifies that the `vaddr` returned by
> `dma_pool_alloc()` should be passed to `dma_pool_free()`.
>
> Let me know if you'd like further adjustments.
>
> Thanks,
>
> Brian Johannesmeyer
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2024-11-22 19:20 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-19 20:55 [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption Brian Johannesmeyer
2024-11-19 20:55 ` [RFC v2 1/2] dmapool: Move pool metadata into non-DMA memory Brian Johannesmeyer
2024-11-20 9:37 ` Christoph Hellwig
2024-11-20 23:46 ` Brian Johannesmeyer
2024-11-21 5:03 ` Christoph Hellwig
2024-11-21 17:48 ` Brian Johannesmeyer
2024-11-19 20:55 ` [RFC v2 2/2] dmapool: Use pool_find_block() in pool_block_err() Brian Johannesmeyer
2024-11-19 22:14 ` [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption Greg KH
2024-11-19 22:22 ` Brian Johannesmeyer
2024-11-20 9:29 ` Christoph Hellwig
2024-11-20 15:56 ` Keith Busch
2024-11-20 18:51 ` Keith Busch
2024-11-20 21:58 ` Brian Johannesmeyer
2024-11-21 3:37 ` Keith Busch
2024-11-21 17:31 ` Brian Johannesmeyer
2024-11-21 18:06 ` Keith Busch
2024-11-21 19:07 ` Brian Johannesmeyer
2024-11-22 19:19 ` Brian Johannesmeyer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox