* [RFC] Type-Partitioned vmalloc (with sample *.ko code)
@ 2025-02-28 20:57 Maxwell Bland
2025-03-03 18:26 ` Kees Cook
0 siblings, 1 reply; 3+ messages in thread
From: Maxwell Bland @ 2025-02-28 20:57 UTC (permalink / raw)
To: linux-hardening
Cc: Kees Cook, Gustavo A. R. Silva, linux-security-module,
Serge Hallyn, Mickaël Salaün, Paul Moore, James Morris,
linux-mm, Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Andrew Wheeler, Sammy BS2 Que, mbland
Dear Linux Hardening, Security, and Memory Management Mailing Lists,
This is primarily an FYI and an RFC. I have some code, included below,
that could be dropped into a *.ko for the 6.1.X kernel, but really this
mail is to query about ideas for acceptable upstream changes.
Thank you ahead of time for reading! If the title alone of this email
sticks out and makes sense immediately, feel free to skip the
introduction below.
INTRODUCTION
For the past few months, I have been sparring with recent CVE PoCs in
the kernel, applying monkey patches to dynamic data structure
allocations, attempting to prevent data-only attacks which use write
gadgets to modify dynamically allocated struct fields otherwise declared
constant.
I wanted to share, briefly, what I feel is a reasonable and general
solution to the standard contemporary exploit procedure. For those
unfamiliar with recent PoC's, see a case study of recent exploits in Man
Yue Mo's article here:
https://github.blog/security/vulnerability-research/the-android-kernel-mitigations-obstacle-race/
Particularly, understanding the "Running arbitrary root commands using
ret2kworker(TM)" section will give a general idea of the issue.
Summarizing, there are thousands of dynamic data structures alloc'd and
free'd in the kernel all the time, for files, for processes, and so
forth, and it is elementary to manipulate any instance of data, but hard
to protect every single one of them. These range from trng device
pointers to kworker queues---everything passing through vmalloc.
The strawman approach presented here is for security engineers to read
CVE-XYZ-ABC PoC, identify the portion of the system being manipulated,
and patch the allocation handler to protect just that data at the
page-table layer, by:
- Reorganizing allocations of those structures so that they are on
the same 2MB hugepage, adjacently, as otherwise existing hardware
support to prevent their mutation (PTE flags) will trigger for unrelated
data allocated adjacently.
- Writing a handler to ensure non-malicious modifications, e.g. keeping
"const" fields const, ensuring modifications to other fields happen at
the right physical PC values and the right pages, handling atomic
updates so that the exception fault on these values maintains ordering
under race conditions (maybe "doubling up" on atomic assembly operations
due to certain microarch issues at the chipset level, see below), and so
on, and so forth.
Eventually, this Sisyphean task amounts to a mountain worth of
point-patches and encoded wisdom, valuable but absurd insofar as there
are a thousand more places for an exploit to manipulate instead of the
protected ones.
DATATYPE PARTITIONED VIRTUAL MEMORY ALLOCATION
The above process can be generalized by changing Linux's vmalloc to
behave more like seL4 (though not identically), by tying allocation
itself to the typing of an object:
https://docs.sel4.systems/Tutorials/untyped.html "objects should be
Without the caveat that objects must be "allocated in order of size,
largest first, to avoid wasting memory."
I demonstrated something similar previously to prevent the intermixed
allocation of SECCOMP BPF code pages with data on ARM64's Android Kernel
here (with which you may be familiar):
https://lore.kernel.org/all/20240423095843.446565600-1-mbland@motorola.com/
That said, the above patch does not do the same for other critical
dynamically allocated data.
So, for instance, to prevent struct file manipulation, I've written the
following code into a init-time loaded kernel (v6.1.x) module:
filp_cachep_ind =
(struct kmem_cache **)kallsyms_lookup_name_ind("filp_cachep");
/* Just nix the existing file cache for one which is page-aligned */
*filp_cachep_ind = kmem_cache_create(
"filp", sizeof(struct file), PAGE_SIZE,
SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
I.e. aligning cache allocations to PAGE_SIZE. See the appendix for
associated module code.
Of course, this is a little insane since:
(1) I'm effectively double allocating the cache to change how
the structs are allocated, because I can't change the kernel's
init process (part of this has to do with Google's GKI).
(2) The kmem infrastructure needs to be also monkey patched so
that this "PAGE_SIZE" alignment actually indicates that objects
can still be allocated next to eachother at the originally
set alignment, reducing dead space due to wasted bytes (not
implemented). And, most important
(3) struct file is just one case of thousands.
However, it seems fine for protecting a specific, given file allocation
targeted by something like:
https://github.com/chompie1337/s8_2019_2215_poc/blob/34f6481ed4ed4cff661b50ac465fc73655b82f64/poc/knox_bypass.c#L50
given you also have the appropriate protection handlers (see appendix
below), this works fine even outside of access to a HVCI system.
Hopefully the above reasoning is clear enough. If so, the proposal
(though it is not clear the best way to do this with standard C, maybe
some preprocessor magic), would be to pass the data's type itself to
kmem_cache_create (and other APIs used to reserve virtual memory for a
struct).
kmem_cache_create would then use this type identifier to allocate and
resolve a region of virtual memory for just objects of that type.
This is an old idea, and I've found evidence of it in, for example,
Levy's discussion of Hydra in 1984's Capability-Based Computer Systems,
which contains the following statement regarding object allocations:
"the appropriate list for an object’s fixed part is determined by a
hashing function on the object’s 64-bit name" (though my implication
here is that the word "name" should be the 64 bit type. I also don't see
much reference to the hardware page tables, and write exception faults
which are the motivation behind the design of such a system.
CONCLUSION
Whatever the implications are, beyond seL4's rough sketch of this idea,
I cannot find Type-Partitioned Virtual Memory Allocation coded in many
other places.
Hopefully, even for those unfamiliar with the exploits in question, the
benefits here are clear, as it closes a certain semantic gap between
heap allocations and the hardware's ability to protect memory.
Thoughts? I've tried, pretty desperately, to figure out an
alternative/easy solution here, but knowing current hardware exception
fault handlers, I see few other ways that we will ever have a system to
prevent the repercussions of write gadgets.
References? I know of the existing efforts toward HVCI, KASAN, and the
KSPP, but hopefully the distinction here is clear enough: I am
referring, specifically to the pain of adjacency between, for example,
f_lock and f_ops, and the implications that this has for hardware. From
what I understand (very little), even OpenBSD does not, though maybe
there has been some discussion of it somewhere in
https://www.openbsd.org/papers/ ... I found nothing for all those
grep-matching "alloc".
Please let me know if you've seen anything else discussing this problem,
particularly anything that might save me from having to rewrite the
virtual memory allocator in our OS to prevent these attacks.
Solutions? I have also been weighing a few other ideas, such as a second
page, similar to or built on KASAN, to understand the "allocation map"
for a given page: but the issue is this allocation map page, or datatype
tag, must then also have a window of writability unless maintained by a
hypervisor or otherwise isolated system.
Thank you again for your time in considering this subject, and providing
your thoughts in this public forum.
Best Regards,
Maxwell
REFERENCES
The patches/discussions here:
https://lore.kernel.org/all/rsk6wtj2ibtl5yygkxlwq3ibngtt5mwpnpjqsh6vz57lino6rs@rcohctmqugn3/
https://lore.kernel.org/all/994dce8b-08cb-474d-a3eb-3970028752e6@infradead.org/
https://lore.kernel.org/all/puj3euv5eafwcx5usqostpohmxgdeq3iout4hqnyk7yt5hcsux@gpiamodhfr54/
https://lore.kernel.org/all/h4hxxozslqmqhwljg5sfold764242pmw5y77mdigaykw5ehjjs@nc4xtzw7xprm/
https://lore.kernel.org/all/20240503131910.307630-1-mic@digikod.net/
PoC's floating around the following CVEs:
- CVE_2024_1086 (pagetable modification)
- CVE_2021_33909 (seccomp codepage modification)
- CVE_2022_22265 (selinux_enforcing state, AVC cache corruption)
- CVE_2021_2215 (struct file pointer manipulation)
- CVE_2022_22057 (kworker queue manipulation)
Some public discussions I've given here include additional notes on CFI
primitives and other errata (excuse my public speaking skills and
ignorance, as this is a developing subject for me):
https://www.youtube.com/watch?v=Rgg01n4jdBU&t=4s&pp=ygUNbWF4d2VsbCBibGFuZA%3D%3D
https://www.youtube.com/watch?v=3DBGardQsHk&t=1844s&pp=ygUNbWF4d2VsbCBibGFuZA%3D%3D
APPENDIX
Below, I'll include a specific example of protecting struct file, for
the 6.1.x kernel, you'll have to excuse the stylistic and questionable
hacks here, since the GKI ensures any useful changes to the kernel need
to use the always-on kernel self-patching mechanism.
- Patching File Allocation:
static struct file *alloc_file_handler(const struct path *path, int flags,
const struct file_operations *fop)
{
struct file *file;
file = alloc_empty_file_ind(flags, current_cred());
if (IS_ERR(file))
return file;
/* TODO: had to expand out the direct struct assignment here
* since the snapdragon cannot handle perm faults on stp instructions
* with two input registers */
file->f_path.dentry = path->dentry;
file->f_path.mnt = path->mnt;
file->f_inode = path->dentry->d_inode;
file->f_mapping = path->dentry->d_inode->i_mapping;
file->f_wb_err = filemap_sample_wb_err(file->f_mapping);
file->f_sb_err = file_sample_sb_err(file);
if (fop->llseek)
file->f_mode |= FMODE_LSEEK;
if ((file->f_mode & FMODE_READ) && (fop->read || fop->read_iter))
file->f_mode |= FMODE_CAN_READ;
if ((file->f_mode & FMODE_WRITE) && (fop->write || fop->write_iter))
file->f_mode |= FMODE_CAN_WRITE;
file->f_iocb_flags = iocb_flags(file);
file->f_mode |= FMODE_OPENED;
file->f_op = fop;
if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
i_readcount_inc(path->dentry->d_inode);
/* NOTE/TODO: until the underlying vmalloc infrastructure is
* patches or rewritten, it is difficult, if not impossible,
* to effectively and efficiently protect all struct file's in the
* kernel. The same holds for kworker queues and many other
* dynamically allocated data structures. Will message mailing
* list about this and maybe continue working on it for the next
* decade )-: */
qcom_smc_waitloop("alloc_file_handler_smc", SMCID_TAG_MEM_PROTECT,
__virt_to_phys(file), PAGE_SIZE);
return file;
}
static void __fput_handler(struct file *file)
{
struct dentry *dentry = file->f_path.dentry;
struct vfsmount *mnt = file->f_path.mnt;
struct inode *inode = file->f_inode;
fmode_t mode = file->f_mode;
bool run_dput = true;
if ((!(file->f_mode & FMODE_OPENED)))
goto out;
might_sleep();
/* Hacks because of QCOM's perm fault handler */
if (atomic_long_read(&file->f_count) == 0xFFFFFFFFFFFFFFFF)
return;
if (atomic_long_read(&file->f_count) == 0x0)
atomic_long_set(&file->f_count, 0xFFFFFFFFFFFFFFFF);
fsnotify_close(file);
/*
* The function eventpoll_release() should be the first called
* in the file cleanup chain.
*/
eventpoll_release_ind(file);
locks_remove_file_ind(file);
ima_file_free(file);
if ((file->f_flags & FASYNC)) {
if (file->f_op->fasync)
file->f_op->fasync(-1, file, 0);
}
if (file->f_op->release)
file->f_op->release(inode, file);
if ((S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
!(mode & FMODE_PATH))) {
cdev_put_ind(inode->i_cdev);
}
fops_put(file->f_op);
put_pid(file->f_owner.pid);
put_file_access(file);
if (run_dput)
dput(dentry);
if ((mode & FMODE_NEED_UNMOUNT))
dissolve_on_fput_ind(mnt);
mntput(mnt);
qcom_smc_waitloop("__fput_handler_smc", SMCID_TAG_MEM_UNPROTECT,
__virt_to_phys(file), PAGE_SIZE);
out:
file_free(file);
}
And on the fault handler side, because the kmem cache allocation places
each struct file on a separate page. Maintaining the mappings of type is
pretty easy to resolve via the SMC call.
if (type == FILE_STRUCT_TYPE) {
if (ipa % PAGE_SIZE == 0x048) {
// manage writes to the atomic type/updates according to CASA semantics on ARM64, etc
}
if (ipa % PAGE_SIZE == 0x030) {
// manage writes to the atomic type/updates according to CASA semantics on ARM64, etc
}
... // prevent writes to f_ops, etc, etc, etc
}
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [RFC] Type-Partitioned vmalloc (with sample *.ko code)
2025-02-28 20:57 [RFC] Type-Partitioned vmalloc (with sample *.ko code) Maxwell Bland
@ 2025-03-03 18:26 ` Kees Cook
2025-03-06 15:50 ` Maxwell Bland
0 siblings, 1 reply; 3+ messages in thread
From: Kees Cook @ 2025-03-03 18:26 UTC (permalink / raw)
To: Maxwell Bland
Cc: linux-hardening, Gustavo A. R. Silva, linux-security-module,
Serge Hallyn, Mickaël Salaün, Paul Moore, James Morris,
linux-mm, Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Andrew Wheeler, Sammy BS2 Que
On Fri, Feb 28, 2025 at 02:57:40PM -0600, Maxwell Bland wrote:
> Dear Linux Hardening, Security, and Memory Management Mailing Lists,
>
> This is primarily an FYI and an RFC. I have some code, included below,
> that could be dropped into a *.ko for the 6.1.X kernel, but really this
> mail is to query about ideas for acceptable upstream changes.
>
> Thank you ahead of time for reading! If the title alone of this email
> sticks out and makes sense immediately, feel free to skip the
> introduction below.
>
> INTRODUCTION
>
> For the past few months, I have been sparring with recent CVE PoCs in
> the kernel, applying monkey patches to dynamic data structure
> allocations, attempting to prevent data-only attacks which use write
> gadgets to modify dynamically allocated struct fields otherwise declared
> constant.
>
> I wanted to share, briefly, what I feel is a reasonable and general
> solution to the standard contemporary exploit procedure. For those
> unfamiliar with recent PoC's, see a case study of recent exploits in Man
> Yue Mo's article here:
>
> https://github.blog/security/vulnerability-research/the-android-kernel-mitigations-obstacle-race/
>
> Particularly, understanding the "Running arbitrary root commands using
> ret2kworker(TM)" section will give a general idea of the issue.
>
> Summarizing, there are thousands of dynamic data structures alloc'd and
> free'd in the kernel all the time, for files, for processes, and so
> forth, and it is elementary to manipulate any instance of data, but hard
> to protect every single one of them. These range from trng device
> pointers to kworker queues---everything passing through vmalloc.
>
> The strawman approach presented here is for security engineers to read
> CVE-XYZ-ABC PoC, identify the portion of the system being manipulated,
> and patch the allocation handler to protect just that data at the
> page-table layer, by:
>
> - Reorganizing allocations of those structures so that they are on
> the same 2MB hugepage, adjacently, as otherwise existing hardware
> support to prevent their mutation (PTE flags) will trigger for unrelated
> data allocated adjacently.
This sounds like the "write rarely" proposal:
https://github.com/KSPP/linux/issues/130
which isolates chosen data structures into immutable memory, except for
specific helpers which are allowed to write to the memory. This is
needed most, by far, for page tables:
https://lore.kernel.org/lkml/20250203101839.1223008-1-kevin.brodsky@arm.com/
It looks from your example at the end that you're depending on a
hypervisor to perform the memory protection and fault handling? Or maybe
I'm misunderstanding where the protection is happening?
> - Writing a handler to ensure non-malicious modifications, e.g. keeping
> "const" fields const, ensuring modifications to other fields happen at
> the right physical PC values and the right pages, handling atomic
> updates so that the exception fault on these values maintains ordering
> under race conditions (maybe "doubling up" on atomic assembly operations
> due to certain microarch issues at the chipset level, see below), and so
> on, and so forth.
As I understand it, this depends on avoiding ROP attacks that will hijack
those PC values. (This is generally true for the entire concept of
"write rarely", though -- nothing specific to this implementation.)
The current proposals use a method of gaining temporary write permission
during the execution path of an approved writer (rather than doing it
via the fault handler, which tends to be very expensive).
> Eventually, this Sisyphean task amounts to a mountain worth of
> point-patches and encoded wisdom, valuable but absurd insofar as there
> are a thousand more places for an exploit to manipulate instead of the
> protected ones.
>
> DATATYPE PARTITIONED VIRTUAL MEMORY ALLOCATION
>
> The above process can be generalized by changing Linux's vmalloc to
> behave more like seL4 (though not identically), by tying allocation
> itself to the typing of an object:
>
> https://docs.sel4.systems/Tutorials/untyped.html "objects should be
>
> Without the caveat that objects must be "allocated in order of size,
> largest first, to avoid wasting memory."
>
> I demonstrated something similar previously to prevent the intermixed
> allocation of SECCOMP BPF code pages with data on ARM64's Android Kernel
> here (with which you may be familiar):
>
> https://lore.kernel.org/all/20240423095843.446565600-1-mbland@motorola.com/
Did this v4 go any further? I see earlier versions had some discussion
around it.
> That said, the above patch does not do the same for other critical
> dynamically allocated data.
>
> So, for instance, to prevent struct file manipulation, I've written the
> following code into a init-time loaded kernel (v6.1.x) module:
>
> filp_cachep_ind =
> (struct kmem_cache **)kallsyms_lookup_name_ind("filp_cachep");
> /* Just nix the existing file cache for one which is page-aligned */
> *filp_cachep_ind = kmem_cache_create(
> "filp", sizeof(struct file), PAGE_SIZE,
> SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
>
> I.e. aligning cache allocations to PAGE_SIZE. See the appendix for
> associated module code.
Just a semantic note: this is kmalloc not vmalloc. You mention
vmalloc, but I don't really see anything specific to vmalloc vs kmalloc
in here... Do you mean to make a distinction between the allocators?
Linux has 3 allocators: page(buddy), kmalloc(kmem), and vmalloc(vmap).
> Of course, this is a little insane since:
>
> (1) I'm effectively double allocating the cache to change how
> the structs are allocated, because I can't change the kernel's
> init process (part of this has to do with Google's GKI).
>
> (2) The kmem infrastructure needs to be also monkey patched so
> that this "PAGE_SIZE" alignment actually indicates that objects
> can still be allocated next to eachother at the originally
> set alignment, reducing dead space due to wasted bytes (not
> implemented). And, most important
A dedicated kmem cache should live in its own page, so I don't think
anything special is needed for things already using kmem_cache_create(),
except that they must not be aliased with same-sized allocations. (i.e.
include the SLAB_NO_MERGE flag.)
> (3) struct file is just one case of thousands.
>
> However, it seems fine for protecting a specific, given file allocation
> targeted by something like:
>
> https://github.com/chompie1337/s8_2019_2215_poc/blob/34f6481ed4ed4cff661b50ac465fc73655b82f64/poc/knox_bypass.c#L50
>
> given you also have the appropriate protection handlers (see appendix
> below), this works fine even outside of access to a HVCI system.
>
> Hopefully the above reasoning is clear enough. If so, the proposal
> (though it is not clear the best way to do this with standard C, maybe
> some preprocessor magic), would be to pass the data's type itself to
> kmem_cache_create (and other APIs used to reserve virtual memory for a
> struct).
>
> kmem_cache_create would then use this type identifier to allocate and
> resolve a region of virtual memory for just objects of that type.
Doing this automatically requires compiler support to provide a way to
distinguish types from the lvalue of an expression:
// kmalloc has no idea what type it will provide an allocation for
struct whatever *p = kmalloc(...);
Or, for the allocation system in the kernel to be totally rearranged to
pass the variable into a helper so the type (and things like alignment)
can be introspected. I proposed doing that here:
https://lore.kernel.org/all/20240822231324.make.666-kees@kernel.org/
It's unclear if this approach will be successful, and I've been waiting
for compiler support of "counted_by" to be more widely usable.
An alternative is to separate allocations by call site (rather than
type). This has some performance benefits too, and requires no special
compiler support. I proposed that here:
https://lore.kernel.org/all/20240809072532.work.266-kees@kernel.org/
With this in place, simple Use-After-Free type confusion is blocked,
but not cross-cache attacks. To defend against cross-cache UAF attacks,
full separation of the virtual memory spaces ("memory allocation pinning")
is needed. SLAB_VIRTUAL has been proposed here:
https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/
> This is an old idea, and I've found evidence of it in, for example,
> Levy's discussion of Hydra in 1984's Capability-Based Computer Systems,
> which contains the following statement regarding object allocations:
> "the appropriate list for an object’s fixed part is determined by a
> hashing function on the object’s 64-bit name" (though my implication
> here is that the word "name" should be the 64 bit type. I also don't see
> much reference to the hardware page tables, and write exception faults
> which are the motivation behind the design of such a system.
>
> CONCLUSION
>
> Whatever the implications are, beyond seL4's rough sketch of this idea,
> I cannot find Type-Partitioned Virtual Memory Allocation coded in many
> other places.
>
> Hopefully, even for those unfamiliar with the exploits in question, the
> benefits here are clear, as it closes a certain semantic gap between
> heap allocations and the hardware's ability to protect memory.
>
> Thoughts? I've tried, pretty desperately, to figure out an
> alternative/easy solution here, but knowing current hardware exception
> fault handlers, I see few other ways that we will ever have a system to
> prevent the repercussions of write gadgets.
>
> References? I know of the existing efforts toward HVCI, KASAN, and the
> KSPP, but hopefully the distinction here is clear enough: I am
> referring, specifically to the pain of adjacency between, for example,
> f_lock and f_ops, and the implications that this has for hardware. From
> what I understand (very little), even OpenBSD does not, though maybe
> there has been some discussion of it somewhere in
> https://www.openbsd.org/papers/ ... I found nothing for all those
> grep-matching "alloc".
Yes, the "write rarely" case is painful since most structures have
elements that are very frequently written. :(
> Please let me know if you've seen anything else discussing this problem,
> particularly anything that might save me from having to rewrite the
> virtual memory allocator in our OS to prevent these attacks.
Hopefully some (all) of the proposals above should provide you with the
desired coverage, though if you're doing this on arm64 you'll need to
implement the arch-specific portions of SLAB_VIRTUAL on arm64.
> Solutions? I have also been weighing a few other ideas, such as a second
> page, similar to or built on KASAN, to understand the "allocation map"
> for a given page: but the issue is this allocation map page, or datatype
> tag, must then also have a window of writability unless maintained by a
> hypervisor or otherwise isolated system.
>
> Thank you again for your time in considering this subject, and providing
> your thoughts in this public forum.
I'd say there are two other things that probably need some
consideration, which are hardware memory tagging, which should provide
much of the SLAB_VIRTUAL protection without the TLB overhead nor
confined memory usage limits.
Another possibility is using stuff like Top Byte Ignore (TBI, arm64)
or Linear Address Mapping (LAM, x86) where a hybrid hardware/software
memory tagging style protection could be built with compiler support
(i.e. checking the pointer tags against the current tag for a given
memory region before writes, or before first use in a function, etc).
There are a lot of options for protecting against UAF and data-only
attacks, but they span a pretty wide range of performance impact, unmet
hardware capabilities, and unmet compiler implementations...
I'd really like to see more movement in this whole area, but it's been a
tricky balancing act. :P
> The patches/discussions here:
> https://lore.kernel.org/all/rsk6wtj2ibtl5yygkxlwq3ibngtt5mwpnpjqsh6vz57lino6rs@rcohctmqugn3/
> https://lore.kernel.org/all/994dce8b-08cb-474d-a3eb-3970028752e6@infradead.org/
Has this moved forward? This got to a v5...
> https://lore.kernel.org/all/puj3euv5eafwcx5usqostpohmxgdeq3iout4hqnyk7yt5hcsux@gpiamodhfr54/
I think Peter Zijlstra solved all the BPF KCFI interactions? See commit
4f9087f16651 ("x86/cfi,bpf: Fix BPF JIT call")
> https://lore.kernel.org/all/h4hxxozslqmqhwljg5sfold764242pmw5y77mdigaykw5ehjjs@nc4xtzw7xprm/
Did this move forward? Another v5...
> https://lore.kernel.org/all/20240503131910.307630-1-mic@digikod.net/
I'd like to see something like Heki move forward too. :)
-Kees
--
Kees Cook
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [RFC] Type-Partitioned vmalloc (with sample *.ko code)
2025-03-03 18:26 ` Kees Cook
@ 2025-03-06 15:50 ` Maxwell Bland
0 siblings, 0 replies; 3+ messages in thread
From: Maxwell Bland @ 2025-03-06 15:50 UTC (permalink / raw)
To: Kees Cook
Cc: linux-hardening, Gustavo A. R. Silva, linux-security-module,
Serge Hallyn, Mickaël Salaün, Paul Moore, James Morris,
linux-mm, Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
Lorenzo Stoakes, Andrew Wheeler, Sammy BS2 Que
On Mon, Mar 03, 2025 at 10:26:16AM -0800, Kees Cook wrote:
> On Fri, Feb 28, 2025 at 02:57:40PM -0600, Maxwell Bland wrote:
> > Summarizing, there are thousands of dynamic data structures alloc'd and
> > free'd in the kernel all the time, for files, for processes, and so
> > forth, and it is elementary to manipulate any instance of data, but hard
> > to protect every single one of them. These range from trng device
> > pointers to kworker queues---everything passing through vmalloc.
> >
> > - Reorganizing allocations of those structures so that they are on
> > the same 2MB hugepage, adjacently, as otherwise existing hardware
> > support to prevent their mutation (PTE flags) will trigger for unrelated
> > data allocated adjacently.
>
> This sounds like the "write rarely" proposal:
> https://github.com/KSPP/linux/issues/130
>
> which isolates chosen data structures into immutable memory, except for
> specific helpers which are allowed to write to the memory. This is
> needed most, by far, for page tables:
> https://lore.kernel.org/lkml/20250203101839.1223008-1-kevin.brodsky@arm.com/
Thank you for this pointer and the others below. I spent a lot of time
the past two days thinking about your email and the links.
> It looks from your example at the end that you're depending on a
> hypervisor to perform the memory protection and fault handling? Or maybe
> I'm misunderstanding where the protection is happening?
Correct. I use the fault handler, proper, though, and optimize it
through careful management of protected vs. unprotected resources, which
pushes me up against the problem of determining specific policies for
each type of kmalloc.
>
> > - Writing a handler to ensure non-malicious modifications, e.g. keeping
> > "const" fields const, ensuring modifications to other fields happen at
> > the right physical PC values and the right pages, handling atomic
> > updates so that the exception fault on these values maintains ordering
> > under race conditions (maybe "doubling up" on atomic assembly operations
> > due to certain microarch issues at the chipset level, see below), and so
> > on, and so forth.
>
> As I understand it, this depends on avoiding ROP attacks that will hijack
> those PC values. (This is generally true for the entire concept of
> "write rarely", though -- nothing specific to this implementation.)
I think a more general solution to this problem, leveraging the POE
mechanism (or just stage-2 translation tables), is to build something on
top of or around CFI. This is natural since the protections already
assume CFI for the data tagging. I can imagine some GCC plugin or
compiler pass for functions, which can appropriately inject
unlock/relock calls around "critical" functions (part of the paciasp
instrumentation).
In fact, I rewrote the QCOM SMC handler to ensure the lock/unlock
semantics were inlined into the specific data operation context, to
prevent creation of a privilege escalation callgate given a CFI bypass.
I attached the code for this at the end.
I will bring this and some other points up to Kevin.
> The current proposals use a method of gaining temporary write permission
> during the execution path of an approved writer (rather than doing it
> via the fault handler, which tends to be very expensive).
I've not found the fault handler approach to be too expensive, at least
for a system matching all current guarantees. Once we begin talking
about struct file's f_lock and every kmalloc, I am inclined to agree.
I think a fault handler based solution can still get a lot of distance
if frequently updated fields of structs were indirected through pointers
(and separate kmalloc calls).
One issue with the POE and other solutions I see is also a lack of
infrastructure for applying specific policies to updates on data
structures: it's one thing to lock the page table outside of
set_memory_rw, but another to ensure the arguments to that API are
not corrupted, e.g. overwriting plt->target here?
arch/arm64/net/bpf_jit_comp.c
2417: if (set_memory_rw(PAGE_MASK & ((uintptr_t)&plt->target),
> > I demonstrated something similar previously to prevent the intermixed
> > allocation of SECCOMP BPF code pages with data on ARM64's Android Kernel
> > here (with which you may be familiar):
> >
> > https://lore.kernel.org/all/20240423095843.446565600-1-mbland@motorola.com/
>
> Did this v4 go any further? I see earlier versions had some discussion
> around it.
No response, so I did not stress the issue (should I have?) I ended up
just hacking around Google's GKI, so upstreaming was no longer
necessary.
> in here... Do you mean to make a distinction between the allocators?
> Linux has 3 allocators: page(buddy), kmalloc(kmem), and vmalloc(vmap).
I guess so, though the vmalloc cases for swap and bpf are important:
void __always_inline patch_jump_to_handler(void *faddr, void *helper)
{
u32 insn;
insn = aarch64_insn_gen_branch_imm_ind((unsigned long)faddr,
(unsigned long)helper,
AARCH64_INSN_BRANCH_NOLINK);
aarch64_insn_patch_text_nosync_ind(faddr, insn);
}
...
/* This works since even though it is "inline", the function is not
* inlined, so we can kallsyms reference it and patch it */
void bpf_jit_binary_lock_ro_injector(struct bpf_binary_header *hdr)
{
/* set_vm_flush_reset_perms(hdr); */
struct vm_struct *vm = find_vm_area_ind((void *)hdr);
if (vm)
vm->flags |= VM_FLUSH_RESET_PERMS;
if (run_CVE_2021_33909) {
hdr->image[8] = 0x13;
hdr->image[9] = 0x37;
hdr->image[10] = 0x13;
hdr->image[11] = 0x37;
pr_info("LOCK RO HERE!\n");
run_CVE_2021_33909 = 0;
}
set_mem_ro((unsigned long)hdr, hdr->size >> PAGE_SHIFT);
set_mem_x((unsigned long)hdr, hdr->size >> PAGE_SHIFT);
}
> A dedicated kmem cache should live in its own page, so I don't think
> anything special is needed for things already using kmem_cache_create(),
> except that they must not be aliased with same-sized allocations. (i.e.
> include the SLAB_NO_MERGE flag.)
Hmm, I thought the write faults on allocated pages were from other types
of data, but maybe I made a mistake here, and it was actually other
`struct file`. This is good to know if true, thank you.
> Doing this automatically requires compiler support to provide a way to
> distinguish types from the lvalue of an expression:
>
> // kmalloc has no idea what type it will provide an allocation for
> struct whatever *p = kmalloc(...);
>
> Or, for the allocation system in the kernel to be totally rearranged to
> pass the variable into a helper so the type (and things like alignment)
> can be introspected. I proposed doing that here:
> https://lore.kernel.org/all/20240822231324.make.666-kees@kernel.org/
>
> It's unclear if this approach will be successful, and I've been waiting
> for compiler support of "counted_by" to be more widely usable.
>
> An alternative is to separate allocations by call site (rather than
> type). This has some performance benefits too, and requires no special
> compiler support. I proposed that here:
> https://lore.kernel.org/all/20240809072532.work.266-kees@kernel.org/
>
> With this in place, simple Use-After-Free type confusion is blocked,
> but not cross-cache attacks. To defend against cross-cache UAF attacks,
> full separation of the virtual memory spaces ("memory allocation pinning")
> is needed. SLAB_VIRTUAL has been proposed here:
> https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/
The above were super valuable! One insight that I've had, related to the
third link above is that if you do have a lock/unlock tracking based on
allocation/deallocation (the second link, or my file tracking from the
last email), you can get a KASAN-like UAF tracking, since if you see a
fault from a PC on a resource which has already trapped into the "free"
handler (and no corresponding alloc), then this can trip a fault.
Essentially, a hashmap tracking each page and associating it to the
current kmalloc type... but I think SLAB_VIRTUAL also achieves this by
preventing te reuse of that virtual page.
> > Please let me know if you've seen anything else discussing this problem,
> > particularly anything that might save me from having to rewrite the
> > virtual memory allocator in our OS to prevent these attacks.
>
> Hopefully some (all) of the proposals above should provide you with the
> desired coverage, though if you're doing this on arm64 you'll need to
> implement the arch-specific portions of SLAB_VIRTUAL on arm64.
Absolutely, from the above, I still have some concerns and see some
limits to these approaches, but your insight that alloc_tag.h can be
used for type tracking and security is valuable, and maybe I can see if
Google can enable it for Android 16 (if they have not already).
> > The patches/discussions here:
> > https://lore.kernel.org/all/rsk6wtj2ibtl5yygkxlwq3ibngtt5mwpnpjqsh6vz57lino6rs@rcohctmqugn3/
> > https://lore.kernel.org/all/994dce8b-08cb-474d-a3eb-3970028752e6@infradead.org/
>
> Has this moved forward? This got to a v5...
Similar case to the other patches, I didn't see much response, so did
not think I should really petition too hard for the changes, though they
are valuable for PXNTable enforcement.
I suppose no response to a patch does not mean it is dead.
> > https://lore.kernel.org/all/puj3euv5eafwcx5usqostpohmxgdeq3iout4hqnyk7yt5hcsux@gpiamodhfr54/
>
> I think Peter Zijlstra solved all the BPF KCFI interactions? See commit
> 4f9087f16651 ("x86/cfi,bpf: Fix BPF JIT call")
Only x86. I supposed the BPF maintainers would integrate my ARM change
coalescing via other means. I have not checked back, but will ping the
BPF list again if there's still the *_handler CFI bypasses in more
recent kernels.
Cheers and thanks,
Maxwell Bland
Appendix 1. Rewritten QCOM SMCs
/* calls the appropriate smc to associate an smc call location labeled by
* smc_call_tag_idiom with a tag in the hypervisor. QCOM won't let moto
* do atomic calls, BTW )-: */
#define add_el2_smc_tag(smc_label_name, tagtype) \
{ \
__asm__("mrs x0, daif\n" \
"orr x0, x0, #0xF\n" \
"msr daif, x0\n" \
"mov x3, x0\n" \
"sub x0, x0, x0\n" \
"movk x0, #0x0007\n" \
"movk x0, #0x" SMC_TYPE_NONATOMIC "300, lsl #16\n" \
"mov x1, 0\n" \
"adr x2, " #smc_label_name "\n" \
"mov x3, %0\n" \
"smc #0\n" \
"msr daif, x3\n" \
: \
: "r"((uint64_t)tagtype) \
: "x0", "x1", "x2", "x3"); \
}
/*
* Importantly, this idiom, and surrounding code, MUST INLINE to
* the function in question, otherwise we run the risk of tag
* allocation/deallocation being run via a CFI exploit. When
* it is introduced into the function in question itself, then
* it becomes possible to guarantee the surrounding code is
* valid, and we are not deallocating outside of the rest of
* the active kernel context.
*
* This makes it much harder for an adversary to "break" the
* system because they cannot just forge a function pointer
* to an arbitrary "unallocate tag" function, but must instead
* forge a function pointer/CFI tag using xxhash to the
* specific kernel operation being protected, and this
* should lead to a myriad of adjacent consequences and
* sanity checks.
*
* For example, in file_free_rcu, this means we have a strong
* guarantee that the SMC call will be followed by a call to
* put_cred, further cementing the necessary context for
* free'ing the file struct.
*/
#define smc_call_tag_idiom(smc_type, smc_tag, smcid, res, arg, size) \
{ \
__asm__ __volatile__( \
"movz x0, #0x000" smcid "\n" \
"movk x0, #0x" smc_type \
"300, lsl #16\n" \
"mov x1, 0\n" \
"mov x2, %1\n" \
"mov x3, %2\n" \
"stp x29, x30, [sp, #-16]!\n" \
"mov x29, sp\n" smc_tag ":\n" \
"smc #0\n" \
"ldp x29, x30, [sp], #16\n" \
"mov %0, x0\n" \
: "=r"(res) \
: "r"(arg), "r"(size) \
: "x0", "x1", "x2", "x3"); \
}
static void __always_inline depreempt_sleep(void)
{
uint64_t preempt_count_val = preempt_count();
preempt_count_sub(preempt_count_val);
msleep(30);
preempt_count_add(preempt_count_val);
}
/* TODO: these need proper synchronization, not just msleep */
#define qcom_smc_waitloop(smc_tag, smcid_tag, arg, size) \
{ \
uint64_t smc_res = 0; \
do { \
if (smc_res) \
depreempt_sleep(); \
smc_call_tag_idiom(SMC_TYPE_NONATOMIC, smc_tag, \
smcid_tag, smc_res, arg, size); \
} while (smc_res); \
__asm__ __volatile__("dmb sy; dsb sy; isb;"); \
}
#define qcom_smc_waitloop_nosleep(smc_tag, smcid_tag, arg, size) \
{ \
uint64_t smc_res = 0; \
do { \
if (smc_res) \
mdelay(30); \
smc_call_tag_idiom(SMC_TYPE_NONATOMIC, smc_tag, \
smcid_tag, smc_res, arg, size); \
} while (smc_res); \
__asm__ __volatile__("dmb sy; dsb sy; isb;"); \
}
#define qcom_smc_waitloop_mayfail(smc_tag, smcid_tag, arg, size) \
{ \
uint64_t smc_res = 0; \
smc_call_tag_idiom(SMC_TYPE_NONATOMIC, smc_tag, \
smcid_tag, smc_res, arg, size); \
__asm__ __volatile__("dmb sy; dsb sy; isb;"); \
}
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-03-06 15:51 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-28 20:57 [RFC] Type-Partitioned vmalloc (with sample *.ko code) Maxwell Bland
2025-03-03 18:26 ` Kees Cook
2025-03-06 15:50 ` Maxwell Bland
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox