linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] reserved-ram for pci-passthrough without VT-d capable hardware
       [not found]   ` <20080625005739.GM6938@duo.random>
@ 2008-06-25  1:18     ` Andrea Arcangeli, Andrea Arcangeli
  2008-07-29 12:11       ` Andrea Arcangeli, Andrea Arcangeli
  0 siblings, 1 reply; 11+ messages in thread
From: Andrea Arcangeli, Andrea Arcangeli @ 2008-06-25  1:18 UTC (permalink / raw)
  To: benami, Avi Kivity, Andrew Morton
  Cc: amit.shah, kvm, aliguori, allen.m.kay, muli, linux-mm

This has to be applied to the host kernel and for example specifying a
relocation address of 0x20000000 it will allow to start kvm guests
capable of pci-passthrough up to "-m 512" by passing the
"-reserved-ram" parameter in the command line. There's no risk of
errors from the user thanks to the reserved ranges being provided to
the virtualization software through /proc/iomem. Only you shouldn't
run more than one -reserved-ram kvm quest per system at once.

This works by reserving the ram early in the e820 map so the initial
pagetables are allocated above the kernel .text relocation and then I
make the sparse code think the reserved-ram is actually available (so
struct pages are allocated) and finally I've to reserve those pages in
the bootmem allocator immediately after the bootmem allocator has been
initialized, so they remain PageReserved not used by linux, but with
'struct page' backing so they can still be exported to qemu via device
driver vma->fault (as they can still be the target of any emulated
dma, not all devices will passthrough).

The virtualization software must create for the guest an e820 map that
only includes the "reserved RAM" regions but if the guest touches
memory with guest physical address in the "reserved RAM failed" ranges
it should provide that as ram and map it with a non linear
mapping (in practice the only problem is for the first page at address
0 physical which is usually the bios and no sane OS is doing DMA to
it).

vmx ~ # cat /proc/iomem |head -n 20
00000000-00000fff : reserved RAM failed
00001000-0008ffff : reserved RAM
00090000-00091fff : reserved RAM failed
00092000-0009cfff : reserved RAM
0009d000-0009ffff : reserved
000a0000-000ec16f : reserved RAM failed
000ec170-000fffff : reserved
00100000-1fffffff : reserved RAM
20000000-bff9ffff : System RAM
  20000000-20315f65 : Kernel code
  20315f66-204c3767 : Kernel data
  20557000-205c9eff : Kernel bss
bffa0000-bffaffff : ACPI Tables
bffb0000-bffdffff : ACPI Non-volatile Storage
bffe0000-bffedfff : reserved
bfff0000-bfffffff : reserved
d0000000-dfffffff : PCI Bus 0000:02
  d0000000-dfffffff : 0000:02:00.0
e0000000-efffffff : PCI MMCONFIG 0
  e0000000-efffffff : pnp 00:0c

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
---

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1198,8 +1198,36 @@ config CRASH_DUMP
 	  (CONFIG_RELOCATABLE=y).
 	  For more details see Documentation/kdump/kdump.txt
 
+config RESERVE_PHYSICAL_START
+	bool "Reserve all RAM below PHYSICAL_START (EXPERIMENTAL)"
+	depends on !RELOCATABLE && X86_64
+	help
+	  This makes the kernel use only RAM above __PHYSICAL_START.
+	  All memory below __PHYSICAL_START will be left unused and
+	  marked as "reserved RAM" in /proc/iomem. The few special
+	  pages that can't be relocated at addresses above
+	  __PHYSICAL_START and that can't be guaranteed to be unused
+	  by the running kernel will be marked "reserved RAM failed"
+	  in /proc/iomem. Those may or may be not used by the kernel
+	  (for example SMP trampoline pages would only be used if
+	  CPU hotplug is enabled).
+
+	  The "reserved RAM" can be mapped by virtualization software
+	  with /dev/mem to create a 1:1 mapping between guest physical
+	  (bus) address and host physical (bus) address. This will
+	  allow PCI passthrough with DMA for the guest using the RAM
+	  with the 1:1 mapping. The only detail to take care of is the
+	  RAM marked "reserved RAM failed". The virtualization
+	  software must create for the guest an e820 map that only
+	  includes the "reserved RAM" regions but if the guest touches
+	  memory with guest physical address in the "reserved RAM
+	  failed" ranges (Linux guest will do that even if the RAM
+	  isn't present in the e820 map), it should provide that as
+	  RAM and map it with a non-linear mapping. This should allow
+	  any Linux kernel to run fine and hopefully any other OS too.
+
 config PHYSICAL_START
-	hex "Physical address where the kernel is loaded" if (EMBEDDED || CRASH_DUMP)
+	hex "Physical address where the kernel is loaded" if (EMBEDDED || CRASH_DUMP || RESERVE_PHYSICAL_START)
 	default "0x1000000" if X86_NUMAQ
 	default "0x200000" if X86_64
 	default "0x100000"
diff --git a/arch/x86/kernel/e820_64.c b/arch/x86/kernel/e820_64.c
--- a/arch/x86/kernel/e820_64.c
+++ b/arch/x86/kernel/e820_64.c
@@ -119,7 +119,31 @@ void __init early_res_to_bootmem(unsigne
 		printk(KERN_INFO "  early res: %d [%lx-%lx] %s\n", i,
 			final_start, final_end - 1, r->name);
 		reserve_bootmem_generic(final_start, final_end - final_start);
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+		if (r->start < __PHYSICAL_START)
+			add_memory_region(r->start, r->end - r->start,
+					  E820_RESERVED_RAM_FAILED);
+#endif			
 	}
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+	/* solve E820_RESERVED_RAM vs E820_RESERVED_RAM_FAILED conflicts */
+	update_e820();
+
+	/* now reserve E820_RESERVED_RAM */
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+
+		if (ei->type != E820_RESERVED_RAM)
+			continue;
+		final_start = max(start, (unsigned long) ei->addr);
+		final_end = min(end, (unsigned long) (ei->addr + ei->size));
+		if (final_start >= final_end)
+			continue;
+		reserve_bootmem_generic(final_start, final_end - final_start);
+		printk(KERN_INFO " bootmem reserved RAM: [%lx-%lx]\n",
+		       final_start, final_end - 1);
+	}
+#endif
 }
 
 /* Check for already reserved areas */
@@ -336,6 +360,16 @@ void __init e820_reserve_resources(void)
 		case E820_RAM:	res->name = "System RAM"; break;
 		case E820_ACPI:	res->name = "ACPI Tables"; break;
 		case E820_NVS:	res->name = "ACPI Non-volatile Storage"; break;
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+		case E820_RESERVED_RAM_FAILED:
+			res->name = "reserved RAM failed";
+			break;
+		case E820_RESERVED_RAM:
+			memset(__va(e820.map[i].addr),
+			       POISON_FREE_INITMEM, e820.map[i].size);
+			res->name = "reserved RAM";
+			break;
+#endif
 		default:	res->name = "reserved";
 		}
 		res->start = e820.map[i].addr;
@@ -377,6 +411,16 @@ void __init e820_mark_nosave_regions(voi
 	}
 }
 
+static int __init e820_is_not_ram(int type)
+{
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+	return type != E820_RAM && type != E820_RESERVED_RAM &&
+		type != E820_RESERVED_RAM_FAILED;
+#else
+	return type != E820_RAM;
+#endif	
+}
+
 /*
  * Finds an active region in the address range from start_pfn to end_pfn and
  * returns its range in ei_startpfn and ei_endpfn for the e820 entry.
@@ -395,11 +439,11 @@ static int __init e820_find_active_regio
 		return 0;
 
 	/* Check if max_pfn_mapped should be updated */
-	if (ei->type != E820_RAM && *ei_endpfn > max_pfn_mapped)
+	if (e820_is_not_ram(ei->type) && *ei_endpfn > max_pfn_mapped)
 		max_pfn_mapped = *ei_endpfn;
 
 	/* Skip if map is outside the node */
-	if (ei->type != E820_RAM || *ei_endpfn <= start_pfn ||
+	if (e820_is_not_ram(ei->type) || *ei_endpfn <= start_pfn ||
 				    *ei_startpfn >= end_pfn)
 		return 0;
 
@@ -495,6 +539,14 @@ static void __init e820_print_map(char *
 		case E820_NVS:
 			printk(KERN_CONT "(ACPI NVS)\n");
 			break;
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+		case E820_RESERVED_RAM:
+			printk(KERN_CONT "(reserved RAM)\n");
+			break;
+		case E820_RESERVED_RAM_FAILED:
+			printk(KERN_CONT "(reserved RAM failed)\n");
+			break;
+#endif
 		default:
 			printk(KERN_CONT "type %u\n", e820.map[i].type);
 			break;
@@ -724,9 +776,31 @@ static int __init copy_e820_map(struct e
 		u64 end = start + size;
 		u32 type = biosmap->type;
 
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+		/* make space for two more low-prio types */
+		type += 2;
+#endif
+
 		/* Overflow in 64 bits? Ignore the memory map. */
 		if (start > end)
 			return -1;
+
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+		if (type == E820_RAM) {
+			if (end <= __PHYSICAL_START) {
+				add_memory_region(start, size,
+						  E820_RESERVED_RAM);
+				continue;
+			}
+			if (start < __PHYSICAL_START) {
+				add_memory_region(start,
+						  __PHYSICAL_START-start,
+						  E820_RESERVED_RAM);
+				size -= __PHYSICAL_START-start;
+				start = __PHYSICAL_START;
+			}
+		}
+#endif
 
 		add_memory_region(start, size, type);
 	} while (biosmap++, --nr_map);
diff --git a/include/asm-x86/e820.h b/include/asm-x86/e820.h
--- a/include/asm-x86/e820.h
+++ b/include/asm-x86/e820.h
@@ -4,10 +4,19 @@
 #define E820MAX	128		/* number of entries in E820MAP */
 #define E820NR	0x1e8		/* # entries in E820MAP */
 
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+#define E820_RESERVED_RAM 1
+#define E820_RESERVED_RAM_FAILED 2
+#define E820_RAM	3
+#define E820_RESERVED	4
+#define E820_ACPI	5
+#define E820_NVS	6
+#else
 #define E820_RAM	1
 #define E820_RESERVED	2
 #define E820_ACPI	3
 #define E820_NVS	4
+#endif
 
 #ifndef __ASSEMBLY__
 struct e820entry {
diff --git a/include/asm-x86/page_64.h b/include/asm-x86/page_64.h
--- a/include/asm-x86/page_64.h
+++ b/include/asm-x86/page_64.h
@@ -29,6 +29,7 @@
 #define __PAGE_OFFSET           _AC(0xffff810000000000, UL)
 
 #define __PHYSICAL_START	CONFIG_PHYSICAL_START
+#define __PHYSICAL_OFFSET	(__PHYSICAL_START-0x200000)
 #define __KERNEL_ALIGN		0x200000
 
 /*
@@ -51,7 +52,7 @@
  * Kernel image size is limited to 512 MB (see level2_kernel_pgt in
  * arch/x86/kernel/head_64.S), and it is mapped here:
  */
-#define KERNEL_IMAGE_SIZE	(512 * 1024 * 1024)
+#define KERNEL_IMAGE_SIZE	(512 * 1024 * 1024 + __PHYSICAL_OFFSET)
 #define KERNEL_IMAGE_START	_AC(0xffffffff80000000, UL)
 
 #ifndef __ASSEMBLY__
diff --git a/include/asm-x86/pgtable_64.h b/include/asm-x86/pgtable_64.h
--- a/include/asm-x86/pgtable_64.h
+++ b/include/asm-x86/pgtable_64.h
@@ -145,7 +145,7 @@ static inline void native_pgd_clear(pgd_
 #define VMALLOC_START    _AC(0xffffc20000000000, UL)
 #define VMALLOC_END      _AC(0xffffe1ffffffffff, UL)
 #define VMEMMAP_START	 _AC(0xffffe20000000000, UL)
-#define MODULES_VADDR    _AC(0xffffffffa0000000, UL)
+#define MODULES_VADDR    (0xffffffffa0000000UL+__PHYSICAL_OFFSET)
 #define MODULES_END      _AC(0xfffffffffff00000, UL)
 #define MODULES_LEN   (MODULES_END - MODULES_VADDR)
 
diff --git a/include/asm-x86/trampoline.h b/include/asm-x86/trampoline.h
--- a/include/asm-x86/trampoline.h
+++ b/include/asm-x86/trampoline.h
@@ -13,7 +13,11 @@ extern unsigned long init_rsp;
 extern unsigned long init_rsp;
 extern unsigned long initial_code;
 
+#ifndef CONFIG_RESERVE_PHYSICAL_START
 #define TRAMPOLINE_BASE 0x6000
+#else
+#define TRAMPOLINE_BASE 0x90000 /* move it next to 640k */
+#endif
 extern unsigned long setup_trampoline(void);
 
 #endif /* __ASSEMBLY__ */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] reserved-ram for pci-passthrough without VT-d capable hardware
  2008-06-25  1:18     ` [PATCH] reserved-ram for pci-passthrough without VT-d capable hardware Andrea Arcangeli, Andrea Arcangeli
@ 2008-07-29 12:11       ` Andrea Arcangeli, Andrea Arcangeli
  2008-07-29 12:43         ` Andi Kleen
  0 siblings, 1 reply; 11+ messages in thread
From: Andrea Arcangeli, Andrea Arcangeli @ 2008-07-29 12:11 UTC (permalink / raw)
  To: benami, Avi Kivity, Andrew Morton
  Cc: amit.shah, kvm, aliguori, allen.m.kay, muli, linux-mm, andi, tglx, mingo

The "reserved RAM" can be mapped by virtualization software with
/dev/mem to create a 1:1 mapping between guest physical (bus) address
and host physical (bus) address. This will allow pci passthrough with
DMA for the guest using the ram with the 1:1 mapping. The only detail
to take care of is the ram marked "reserved RAM failed". The
virtualization software must create for the guest an e820 map that
only includes the "reserved RAM" regions but if the guest touches
memory with guest physical address in the "reserved RAM failed" ranges
(linux guest will do that even if the ram isn't present in the e820
map), it should provide that as ram and map it with a non linear
mapping. This should allow any linux kernel to run fine and hopefully
any other OS too.

svm ~ # cat /proc/iomem |head -n 20
00000000-00000fff : reserved RAM failed
00001000-00005fff : reserved RAM
00006000-00007fff : reserved RAM failed
00008000-0009efff : reserved RAM
0009f000-0009ffff : reserved
000cd600-000cffff : pnp 00:0d
000f0000-000fffff : reserved
00100000-0fffffff : reserved RAM
10000000-3dedffff : System RAM
  10000000-10329ab2 : Kernel code
  10329ab3-104933e7 : Kernel data
  104f5000-10558e67 : Kernel bss
3dee0000-3dee2fff : ACPI Non-volatile Storage
3dee3000-3deeffff : ACPI Tables
3def0000-3defffff : reserved
3dff0000-3ffeffff : pnp 00:0d
e0000000-efffffff : reserved
fa000000-fbffffff : PCI Bus #01
  fa000000-fbffffff : 0000:01:05.0
fda00000-fdbfffff : PCI Bus #01
svm ~ # hexdump /dev/mem | grep -C2 'cccc cccc cccc cccc'
00007e0 0000 0000 0000 0000 0000 0000 0000 0000
*
0001000 cccc cccc cccc cccc cccc cccc cccc cccc
*
0006000 a5a5 a5a5 8ec8 8ed8 8ec0 66d0 06c7 0000
--
*
0007ff0 0000 0000 0000 0000 3063 1000 0000 0000
0008000 cccc cccc cccc cccc cccc cccc cccc cccc
*
009f000 0002 0000 0000 0000 0000 0000 0000 0000
--
00fffe0 6000 3c03 45e7 0184 0500 0082 01c0 0223
00ffff0 5bea 00e0 31f0 2f32 3931 302f 0037 12fc
0100000 cccc cccc cccc cccc cccc cccc cccc cccc
*
10000000 8d48 f92d ffff 48ff ed81 0000 1000 8948
^C
svm ~ #

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
---

This is a port to current linux-2.6.git of the previous reserved-ram
patch. Let me know if there's a chance to get this acked and
included. Anything that isn't at compile time would require much
bigger changes just to parse the command line at 16bit realmode time
to know where to relocate the kernel dynamically. Because 1:1 is a
corner case feature required only by some users, this is the minimal
intrusive approach. This also has some limits as it can't reserve more
than 1g, and with a few more changes 2g but this is ok for a long time
as the virtualized 1:1 guest doesn't need to be huge, just a desktop.

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1276,8 +1276,36 @@ config CRASH_DUMP
 	  (CONFIG_RELOCATABLE=y).
 	  For more details see Documentation/kdump/kdump.txt
 
+config RESERVE_PHYSICAL_START
+	bool "Reserve all RAM below PHYSICAL_START (EXPERIMENTAL)"
+	depends on !RELOCATABLE && X86_64
+	help
+	  This makes the kernel use only RAM above __PHYSICAL_START.
+	  All memory below __PHYSICAL_START will be left unused and
+	  marked as "reserved RAM" in /proc/iomem. The few special
+	  pages that can't be relocated at addresses above
+	  __PHYSICAL_START and that can't be guaranteed to be unused
+	  by the running kernel will be marked "reserved RAM failed"
+	  in /proc/iomem. Those may or may be not used by the kernel
+	  (for example SMP trampoline pages would only be used if
+	  CPU hotplug is enabled).
+
+	  The "reserved RAM" can be mapped by virtualization software
+	  with /dev/mem to create a 1:1 mapping between guest physical
+	  (bus) address and host physical (bus) address. This will
+	  allow PCI passthrough with DMA for the guest using the RAM
+	  with the 1:1 mapping. The only detail to take care of is the
+	  RAM marked "reserved RAM failed". The virtualization
+	  software must create for the guest an e820 map that only
+	  includes the "reserved RAM" regions but if the guest touches
+	  memory with guest physical address in the "reserved RAM
+	  failed" ranges (Linux guest will do that even if the RAM
+	  isn't present in the e820 map), it should provide that as
+	  RAM and map it with a non-linear mapping. This should allow
+	  any Linux kernel to run fine and hopefully any other OS too.
+
 config PHYSICAL_START
-	hex "Physical address where the kernel is loaded" if (EMBEDDED || CRASH_DUMP)
+	hex "Physical address where the kernel is loaded" if (EMBEDDED || CRASH_DUMP || RESERVE_PHYSICAL_START)
 	default "0x1000000" if X86_NUMAQ
 	default "0x200000" if X86_64
 	default "0x100000"
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -148,6 +148,14 @@ void __init e820_print_map(char *who)
 		case E820_NVS:
 			printk(KERN_CONT "(ACPI NVS)\n");
 			break;
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+		case E820_RESERVED_RAM:
+			printk(KERN_CONT "(reserved RAM)\n");
+			break;
+		case E820_RESERVED_RAM_FAILED:
+			printk(KERN_CONT "(reserved RAM failed)\n");
+			break;
+#endif
 		default:
 			printk(KERN_CONT "type %u\n", e820.map[i].type);
 			break;
@@ -384,10 +392,28 @@ static int __init __append_e820_map(stru
 		u64 end = start + size;
 		u32 type = biosmap->type;
 
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+		/* make space for two more low-prio types */
+		type += 2;
+#endif
+
 		/* Overflow in 64 bits? Ignore the memory map. */
 		if (start > end)
 			return -1;
 
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+		if (type == E820_RAM) {
+			if (end <= __PHYSICAL_START)
+				type = E820_RESERVED_RAM;
+			else if (start < __PHYSICAL_START) {
+				e820_add_region(start,
+						__PHYSICAL_START-start,
+						E820_RESERVED_RAM);
+				size -= __PHYSICAL_START-start;
+				start = __PHYSICAL_START;
+			}
+		}
+#endif
 		e820_add_region(start, size, type);
 
 		biosmap++;
@@ -893,7 +919,35 @@ void __init early_res_to_bootmem(u64 sta
 			final_start, final_end);
 		reserve_bootmem_generic(final_start, final_end - final_start,
 				BOOTMEM_DEFAULT);
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+		if (r->start < __PHYSICAL_START)
+			e820_add_region(r->start, r->end - r->start,
+					E820_RESERVED_RAM_FAILED);
+#endif			
 	}
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+	/* solve E820_RESERVED_RAM vs E820_RESERVED_RAM_FAILED conflicts */
+	update_e820();
+
+	/* now reserve E820_RESERVED_RAM */
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+
+		if (ei->type != E820_RESERVED_RAM)
+			continue;
+		final_start = max(start, (u64) ei->addr);
+		final_end = min(end, (u64) (ei->addr + ei->size));
+		if (final_start >= final_end)
+			continue;
+		if (reserve_bootmem_generic(final_start,
+					    final_end - final_start,
+					    BOOTMEM_DEFAULT))
+			printk(KERN_ERR "reserved physical start failure");
+		else
+			printk(KERN_INFO " bootmem reserved RAM: [%Lx-%Lx]\n",
+			       final_start, final_end - 1);
+	}
+#endif
 }
 
 /* Check for already reserved areas */
@@ -1095,6 +1149,17 @@ unsigned long __init e820_end_of_low_ram
 {
 	return e820_end_pfn(1UL<<(32 - PAGE_SHIFT), E820_RAM);
 }
+
+static int __init e820_is_not_ram(int type)
+{
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+	return type != E820_RAM && type != E820_RESERVED_RAM &&
+		type != E820_RESERVED_RAM_FAILED;
+#else
+	return type != E820_RAM;
+#endif	
+}
+
 /*
  * Finds an active region in the address range from start_pfn to last_pfn and
  * returns its range in ei_startpfn and ei_endpfn for the e820 entry.
@@ -1115,8 +1180,8 @@ int __init e820_find_active_region(const
 		return 0;
 
 	/* Skip if map is outside the node */
-	if (ei->type != E820_RAM || *ei_endpfn <= start_pfn ||
-				    *ei_startpfn >= last_pfn)
+	if (e820_is_not_ram(ei->type) || *ei_endpfn <= start_pfn ||
+	    *ei_startpfn >= last_pfn)
 		return 0;
 
 	/* Check for overlaps */
@@ -1260,6 +1325,10 @@ static inline const char *e820_type_to_s
 	case E820_RAM:	return "System RAM";
 	case E820_ACPI:	return "ACPI Tables";
 	case E820_NVS:	return "ACPI Non-volatile Storage";
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+	case E820_RESERVED_RAM_FAILED: return "reserved RAM failed";
+	case E820_RESERVED_RAM: return "reserved RAM";
+#endif
 	default:	return "reserved";
 	}
 }
@@ -1289,6 +1358,12 @@ void __init e820_reserve_resources(void)
 		res->flags = IORESOURCE_MEM | IORESOURCE_BUSY;
 		insert_resource(&iomem_resource, res);
 		res++;
+
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+		if (i == E820_RESERVED_RAM)
+			memset(__va(e820.map[i].addr),
+			       POISON_FREE_INITMEM, e820.map[i].size);
+#endif
 	}
 
 	for (i = 0; i < e820_saved.nr_map; i++) {
diff --git a/include/asm-x86/e820.h b/include/asm-x86/e820.h
--- a/include/asm-x86/e820.h
+++ b/include/asm-x86/e820.h
@@ -39,10 +39,19 @@
 
 #define E820NR	0x1e8		/* # entries in E820MAP */
 
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+#define E820_RESERVED_RAM 1
+#define E820_RESERVED_RAM_FAILED 2
+#define E820_RAM	3
+#define E820_RESERVED	4
+#define E820_ACPI	5
+#define E820_NVS	6
+#else
 #define E820_RAM	1
 #define E820_RESERVED	2
 #define E820_ACPI	3
 #define E820_NVS	4
+#endif
 
 /* reserved RAM used by kernel itself */
 #define E820_RESERVED_KERN        128
diff --git a/include/asm-x86/page_64.h b/include/asm-x86/page_64.h
--- a/include/asm-x86/page_64.h
+++ b/include/asm-x86/page_64.h
@@ -35,6 +35,7 @@
 #define __PAGE_OFFSET           _AC(0xffff880000000000, UL)
 
 #define __PHYSICAL_START	CONFIG_PHYSICAL_START
+#define __PHYSICAL_OFFSET	(__PHYSICAL_START-0x200000)
 #define __KERNEL_ALIGN		0x200000
 
 /*
@@ -57,7 +58,7 @@
  * Kernel image size is limited to 512 MB (see level2_kernel_pgt in
  * arch/x86/kernel/head_64.S), and it is mapped here:
  */
-#define KERNEL_IMAGE_SIZE	(512 * 1024 * 1024)
+#define KERNEL_IMAGE_SIZE	(512 * 1024 * 1024 + __PHYSICAL_OFFSET)
 #define KERNEL_IMAGE_START	_AC(0xffffffff80000000, UL)
 
 #ifndef __ASSEMBLY__
diff --git a/include/asm-x86/pgtable_64.h b/include/asm-x86/pgtable_64.h
--- a/include/asm-x86/pgtable_64.h
+++ b/include/asm-x86/pgtable_64.h
@@ -150,7 +150,7 @@ static inline void native_pgd_clear(pgd_
 #define VMALLOC_START    _AC(0xffffc20000000000, UL)
 #define VMALLOC_END      _AC(0xffffe1ffffffffff, UL)
 #define VMEMMAP_START	 _AC(0xffffe20000000000, UL)
-#define MODULES_VADDR    _AC(0xffffffffa0000000, UL)
+#define MODULES_VADDR    (0xffffffffa0000000UL+__PHYSICAL_OFFSET)
 #define MODULES_END      _AC(0xfffffffffff00000, UL)
 #define MODULES_LEN   (MODULES_END - MODULES_VADDR)
 
diff --git a/include/asm-x86/trampoline.h b/include/asm-x86/trampoline.h
--- a/include/asm-x86/trampoline.h
+++ b/include/asm-x86/trampoline.h
@@ -13,7 +13,11 @@ extern unsigned long init_rsp;
 extern unsigned long init_rsp;
 extern unsigned long initial_code;
 
+#ifndef CONFIG_RESERVE_PHYSICAL_START
 #define TRAMPOLINE_BASE 0x6000
+#else
+#define TRAMPOLINE_BASE 0x90000 /* move it next to 640k */
+#endif
 extern unsigned long setup_trampoline(void);
 
 #endif /* __ASSEMBLY__ */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] reserved-ram for pci-passthrough without VT-d capable hardware
  2008-07-29 12:11       ` Andrea Arcangeli, Andrea Arcangeli
@ 2008-07-29 12:43         ` Andi Kleen
  2008-07-29 12:53           ` Andrea Arcangeli
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2008-07-29 12:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: benami, Avi Kivity, Andrew Morton, amit.shah, kvm, aliguori,
	allen.m.kay, muli, linux-mm, andi, tglx, mingo

> This is a port to current linux-2.6.git of the previous reserved-ram
> patch. Let me know if there's a chance to get this acked and
> included. Anything that isn't at compile time would require much

I still think runtime would be far better. Nobody really wants
a proliferation of more weird special kernel images.

> bigger changes just to parse the command line at 16bit realmode time

You could always do it with kexec if you think 16bit real mode is
too hard.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] reserved-ram for pci-passthrough without VT-d capable hardware
  2008-07-29 12:43         ` Andi Kleen
@ 2008-07-29 12:53           ` Andrea Arcangeli
  2008-07-29 13:17             ` Andi Kleen
  0 siblings, 1 reply; 11+ messages in thread
From: Andrea Arcangeli @ 2008-07-29 12:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: benami, Avi Kivity, Andrew Morton, amit.shah, kvm, aliguori,
	allen.m.kay, muli, linux-mm, tglx, mingo

On Tue, Jul 29, 2008 at 02:43:17PM +0200, Andi Kleen wrote:
> > This is a port to current linux-2.6.git of the previous reserved-ram
> > patch. Let me know if there's a chance to get this acked and
> > included. Anything that isn't at compile time would require much
> 
> I still think runtime would be far better. Nobody really wants
> a proliferation of more weird special kernel images.

Not for the usage we're interested about but surely this would prevent
distro to take advantage of the feature. The question is if distro
need to take advantage of the feature in the first place instead of
sticking with VT-d. 1:1 isn't secure virtualization as the guest must
be trusted so it's not necessarily a good model to deploy to users
that don't know exactly what they're doing.

> > bigger changes just to parse the command line at 16bit realmode time
> 
> You could always do it with kexec if you think 16bit real mode is
> too hard.

It's not too hard, but it'll add bloat to the 16 bit part of the boot
in the bzImage. It's likely simpler than kexec and surely more
user-friendly to setup for the end user.

In any case, my patch does the needed bits with regard to the e820
map. An incremental patch can add the parsing of the booatloader and
switch the Kconfig dependency from PHYSICAL_START to RELOCATABLE. The
e820 file will then have to replace the __PHYSICAL_START define with
something else and that's all.

I mean it's not entirely backwards to provide a compile time smaller
and simpler approach initially, and then to go where you want to go
incrementally later if we're sure there's enough userbase needing 1:1.

I'm not so interested to go there right now, because while this code
is useful right now because the majority of systems out there lacks
VT-d/iommu, I suspect this code could be nuked in the long
run when all systems will ship with that, which is why I kept it all
under #ifdef, and the changes to the other files outside ifdef are
bugfixes needed if you want to kexec-relocate above 40m or so that
should be kept.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] reserved-ram for pci-passthrough without VT-d capable hardware
  2008-07-29 12:53           ` Andrea Arcangeli
@ 2008-07-29 13:17             ` Andi Kleen
  2008-07-30  6:20               ` Amit Shah
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2008-07-29 13:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andi Kleen, benami, Avi Kivity, Andrew Morton, amit.shah, kvm,
	aliguori, allen.m.kay, muli, linux-mm, tglx, mingo

> I'm not so interested to go there right now, because while this code
> is useful right now because the majority of systems out there lacks
> VT-d/iommu, I suspect this code could be nuked in the long
> run when all systems will ship with that, which is why I kept it all

Actually at least on Intel platforms and if you exclude the lowest end
VT-d is shipping universally for quite some time now. If you
buy a Intel box today or bought it in the last year the chances are pretty 
high that it has VT-d support.

> under #ifdef, and the changes to the other files outside ifdef are
> bugfixes needed if you want to kexec-relocate above 40m or so that
> should be kept.

You should split that out then into a separate patch.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] reserved-ram for pci-passthrough without VT-d capable hardware
  2008-07-29 13:17             ` Andi Kleen
@ 2008-07-30  6:20               ` Amit Shah
  2008-07-30 12:27                 ` Andi Kleen
  2008-07-30 13:58                 ` Andrea Arcangeli
  0 siblings, 2 replies; 11+ messages in thread
From: Amit Shah @ 2008-07-30  6:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrea Arcangeli, benami, Avi Kivity, Andrew Morton, kvm,
	aliguori, allen.m.kay, muli, linux-mm, tglx, mingo

* On Tuesday 29 July 2008 18:47:35 Andi Kleen wrote:
> > I'm not so interested to go there right now, because while this code
> > is useful right now because the majority of systems out there lacks
> > VT-d/iommu, I suspect this code could be nuked in the long
> > run when all systems will ship with that, which is why I kept it all
>
> Actually at least on Intel platforms and if you exclude the lowest end
> VT-d is shipping universally for quite some time now. If you
> buy a Intel box today or bought it in the last year the chances are pretty
> high that it has VT-d support.

I think you mean VT-x, which is virtualization extensions for the x86 
architecture. VT-d is virtualization extensions for devices (IOMMU).

Amit

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] reserved-ram for pci-passthrough without VT-d capable hardware
  2008-07-30  6:20               ` Amit Shah
@ 2008-07-30 12:27                 ` Andi Kleen
  2008-07-30 13:58                 ` Andrea Arcangeli
  1 sibling, 0 replies; 11+ messages in thread
From: Andi Kleen @ 2008-07-30 12:27 UTC (permalink / raw)
  To: Amit Shah
  Cc: Andi Kleen, Andrea Arcangeli, benami, Avi Kivity, Andrew Morton,
	kvm, aliguori, allen.m.kay, muli, linux-mm, tglx, mingo

On Wed, Jul 30, 2008 at 11:50:43AM +0530, Amit Shah wrote:
> * On Tuesday 29 July 2008 18:47:35 Andi Kleen wrote:
> > > I'm not so interested to go there right now, because while this code
> > > is useful right now because the majority of systems out there lacks
> > > VT-d/iommu, I suspect this code could be nuked in the long
> > > run when all systems will ship with that, which is why I kept it all
> >
> > Actually at least on Intel platforms and if you exclude the lowest end
> > VT-d is shipping universally for quite some time now. If you
> > buy a Intel box today or bought it in the last year the chances are pretty
> > high that it has VT-d support.
> 
> I think you mean VT-x, which is virtualization extensions for the x86 
> architecture. VT-d is virtualization extensions for devices (IOMMU).

No I really mean VT-d. The modern not very lowend Intel IOHubs all have it.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] reserved-ram for pci-passthrough without VT-d capable hardware
  2008-07-30  6:20               ` Amit Shah
  2008-07-30 12:27                 ` Andi Kleen
@ 2008-07-30 13:58                 ` Andrea Arcangeli
  2008-07-30 14:16                   ` Dor Laor
  2008-07-30 14:22                   ` FUJITA Tomonori
  1 sibling, 2 replies; 11+ messages in thread
From: Andrea Arcangeli @ 2008-07-30 13:58 UTC (permalink / raw)
  To: Amit Shah
  Cc: Andi Kleen, benami, Avi Kivity, Andrew Morton, kvm, aliguori,
	allen.m.kay, muli, linux-mm, tglx, mingo

On Wed, Jul 30, 2008 at 11:50:43AM +0530, Amit Shah wrote:
> * On Tuesday 29 July 2008 18:47:35 Andi Kleen wrote:
> > > I'm not so interested to go there right now, because while this code
> > > is useful right now because the majority of systems out there lacks
> > > VT-d/iommu, I suspect this code could be nuked in the long
> > > run when all systems will ship with that, which is why I kept it all
> >
> > Actually at least on Intel platforms and if you exclude the lowest end
> > VT-d is shipping universally for quite some time now. If you
> > buy a Intel box today or bought it in the last year the chances are pretty
> > high that it has VT-d support.
> 
> I think you mean VT-x, which is virtualization extensions for the x86 
> architecture. VT-d is virtualization extensions for devices (IOMMU).

I think Andi understood VT-d right but even if he was right that every
reader of this email that is buying a new VT-x system today is also
almost guaranteed to get a VT-d motherboard (which I disagree unless
you buy some really expensive toy), there are current large
installations of VT-x systems that lacks VT-d and that with recent
current dual/quadcore cpus are very fast and will be used for the next
couple of years and they will not upgrade just the motherboard to use
pci-passthrough.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] reserved-ram for pci-passthrough without VT-d capable hardware
  2008-07-30 13:58                 ` Andrea Arcangeli
@ 2008-07-30 14:16                   ` Dor Laor
  2008-07-30 14:38                     ` Andrea Arcangeli
  2008-07-30 14:22                   ` FUJITA Tomonori
  1 sibling, 1 reply; 11+ messages in thread
From: Dor Laor @ 2008-07-30 14:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Amit Shah, Andi Kleen, benami, Avi Kivity, Andrew Morton, kvm,
	aliguori, allen.m.kay, muli, linux-mm, tglx, mingo

Andrea Arcangeli wrote:
> On Wed, Jul 30, 2008 at 11:50:43AM +0530, Amit Shah wrote:
>   
>> * On Tuesday 29 July 2008 18:47:35 Andi Kleen wrote:
>>     
>>>> I'm not so interested to go there right now, because while this code
>>>> is useful right now because the majority of systems out there lacks
>>>> VT-d/iommu, I suspect this code could be nuked in the long
>>>> run when all systems will ship with that, which is why I kept it all
>>>>         
>>> Actually at least on Intel platforms and if you exclude the lowest end
>>> VT-d is shipping universally for quite some time now. If you
>>> buy a Intel box today or bought it in the last year the chances are pretty
>>> high that it has VT-d support.
>>>       
>> I think you mean VT-x, which is virtualization extensions for the x86 
>> architecture. VT-d is virtualization extensions for devices (IOMMU).
>>     
>
> I think Andi understood VT-d right but even if he was right that every
> reader of this email that is buying a new VT-x system today is also
> almost guaranteed to get a VT-d motherboard (which I disagree unless
> you buy some really expensive toy), there are current large
> installations of VT-x systems that lacks VT-d and that with recent
> current dual/quadcore cpus are very fast and will be used for the next
> couple of years and they will not upgrade just the motherboard to use
> pci-passthrough.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   

In addition KVM is used in embedded too and things are slower there, we 
know of a specific use case (production) that demands
1:1 mapping and can't use VT-d

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] reserved-ram for pci-passthrough without VT-d capable hardware
  2008-07-30 13:58                 ` Andrea Arcangeli
  2008-07-30 14:16                   ` Dor Laor
@ 2008-07-30 14:22                   ` FUJITA Tomonori
  1 sibling, 0 replies; 11+ messages in thread
From: FUJITA Tomonori @ 2008-07-30 14:22 UTC (permalink / raw)
  To: andrea
  Cc: amit.shah, andi, benami, avi, akpm, kvm, aliguori, allen.m.kay,
	muli, linux-mm, tglx, mingo

On Wed, 30 Jul 2008 15:58:46 +0200
Andrea Arcangeli <andrea@qumranet.com> wrote:

> On Wed, Jul 30, 2008 at 11:50:43AM +0530, Amit Shah wrote:
> > * On Tuesday 29 July 2008 18:47:35 Andi Kleen wrote:
> > > > I'm not so interested to go there right now, because while this code
> > > > is useful right now because the majority of systems out there lacks
> > > > VT-d/iommu, I suspect this code could be nuked in the long
> > > > run when all systems will ship with that, which is why I kept it all
> > >
> > > Actually at least on Intel platforms and if you exclude the lowest end
> > > VT-d is shipping universally for quite some time now. If you
> > > buy a Intel box today or bought it in the last year the chances are pretty
> > > high that it has VT-d support.
> > 
> > I think you mean VT-x, which is virtualization extensions for the x86 
> > architecture. VT-d is virtualization extensions for devices (IOMMU).
> 
> I think Andi understood VT-d right but even if he was right that every
> reader of this email that is buying a new VT-x system today is also
> almost guaranteed to get a VT-d motherboard (which I disagree unless
> you buy some really expensive toy), there are current large
> installations of VT-x systems that lacks VT-d and that with recent
> current dual/quadcore cpus are very fast and will be used for the next
> couple of years and they will not upgrade just the motherboard to use
> pci-passthrough.

Today, very inexpensive desktops (for example, Dell OptiPlex 755) have
VT-d support.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] reserved-ram for pci-passthrough without VT-d capable hardware
  2008-07-30 14:16                   ` Dor Laor
@ 2008-07-30 14:38                     ` Andrea Arcangeli
  0 siblings, 0 replies; 11+ messages in thread
From: Andrea Arcangeli @ 2008-07-30 14:38 UTC (permalink / raw)
  To: Dor Laor
  Cc: Amit Shah, Andi Kleen, benami, Avi Kivity, Andrew Morton, kvm,
	aliguori, allen.m.kay, muli, linux-mm, tglx, mingo

On Wed, Jul 30, 2008 at 05:16:06PM +0300, Dor Laor wrote:
> In addition KVM is used in embedded too and things are slower there, we 
> know of a specific use case (production) that demands
> 1:1 mapping and can't use VT-d

Since you mentioned this ;), I take opportunity to add that those
embedded usages are the ones that are totally fine with the compile
time passthrough-guest-ram decision, instead of a boot time
decision. Those host kernels will likely have RT patches (KVM works
great with preempt-RT indeed) and in turn the compile time ram
selection is the least of their problems as you can imagine ;). So you
can see my patch as an embedded-build option, similar to "Configure
standard kernel features (for small systems)" and no distro is
shipping new kernels with that feature on either.

Than if we decide 1:1 should have larger userbase instead of only the
people that knows what they're doing (i.e. 1:1 guest can destroy
linux-hypervisor) we can always add a bit of strtol parsing to 16bit
kernelloader.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-07-30 14:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1214232737-21267-1-git-send-email-benami@il.ibm.com>
     [not found] ` <1214232737-21267-2-git-send-email-benami@il.ibm.com>
     [not found]   ` <20080625005739.GM6938@duo.random>
2008-06-25  1:18     ` [PATCH] reserved-ram for pci-passthrough without VT-d capable hardware Andrea Arcangeli, Andrea Arcangeli
2008-07-29 12:11       ` Andrea Arcangeli, Andrea Arcangeli
2008-07-29 12:43         ` Andi Kleen
2008-07-29 12:53           ` Andrea Arcangeli
2008-07-29 13:17             ` Andi Kleen
2008-07-30  6:20               ` Amit Shah
2008-07-30 12:27                 ` Andi Kleen
2008-07-30 13:58                 ` Andrea Arcangeli
2008-07-30 14:16                   ` Dor Laor
2008-07-30 14:38                     ` Andrea Arcangeli
2008-07-30 14:22                   ` FUJITA Tomonori

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox