___        /  /\    
      /__/\      /  /::\   
      \__\:\    /  /:/\:\  
      /  /::\  /  /:/  \:\ 
   __/  /:/\/ /__/:/ \__\:\
  /__/\/:/~~  \  \:\ /  /:/
  \  \::/      \  \:\  /:/ 
   \  \:\       \  \:\/:/  
    \__\/        \  \::/   

Arm IntrO Challenge

writing a demo for qemu-system-arm -M vexpress-a9. and possibly other arm emulators. or even devices for people that have/want to buy an arm dev board.

Last year we held this challenge. Optimizing, algorithmic power, and learning some low level details combined into some good fun. Hence it was not completely surprising that when you were asked to choose what this year's challenge should be. That doing the same on an ARM platform won out. In addition to the reasons for last year's challenge many of us were also looking for a good excuse to get a bit more hands on with the ARM architecture. Do give it a try.

Unlike x86/PC systems when you buy ARM systems they don't typically have a bios. Hence no universal way of doing some of the most basic input and output tasks. That's why for now we will only focus on the system emulated by 'qemu-system-arm -M vexpress-a9'. We need to learn what(primecell clcd) is inside this virtual system. In order to communicate with additional hardware a cpu on an X86/PC system has a separate address space known as "io ports" (IN/OUT instructions and /proc/ioports), and memory mapped io (MMIO). An ARM system relies more heavily on MMIO. ARM systems also have the option to be extended with coprocessors. This mechanism is sometimes used to more hardware as well. The main difference though is that we need to learn how to express ourselves in ARM assembly instructions. ARM is a RISC architecture, even though the lines have blurred a bit, this will require us to be more verbose.

Challenge Rules: The 564 bytes include the ones to set up the framebuffer. In this document we included code to do so that takes up 64bytes. However you are welcome to optimize those to win a couple of bytes.



bin source


bin source


bin source

invalid: blasty&inz (oversize 720 bytes)

bin source

This concludes the challenge. A big thanks to everybody that participated. I've decided to crown timpwn as the ,,winner''. I'll probably still process late entries if they show up.

Quickstart Guide and Documentation

Step 1: required tools

A complete ARM capable gnu toolchain would definitely be useful. However in this documentation we are only going to use an assembler. This allows us to be exposed to more lowlevel details, and means that we can get by by just compiling binutils to target ARM.
wget http://ftp.gnu.org/gnu/binutils/binutils-2.22.tar.bz2
tar xjvf binutils-2.22.tar.bz2
./binutils-2.22/configure --target=arm-linux --prefix=$PWD
make install
This should have created all the required tools, namely bin/arm-linux-{as,objdump,objcopy,ld). Adding these paths to PATH and LD_LIBRARY_PATH environment variables will make things easier.

There are a few alternatives. Freefull reports that fasmarm works well too. In case you already have radare2 installed you can also (dis)assemble arm instructions using rasm2 -a arm.

Step 2: Getting Code To Execute

No BIOS to initialize hardware, read the first sector of the disks, find where to execute code from, etc. Qemu only allows you to start it using a "-kernel" option. It loads the binary into memory at 0x6001000 and uses a kernel calling convention on how to pass the commandline and where a potential initrd is loaded. However we will ignore all that and just pretend that the file we supply is part of rom where code execution starts. Execution will start at the first byte of the file.
qemu-system-arm -M vexpress-a9 -kernel code.bin
The first step is to see if the hardware is alive. To do this we will write a little program that outputs the letter 'A' on a serial connection.
e3a03a09        mov     r3, $0x9000
e3413000        movt    r3, $0x1000
e3a02041        mov     r2, $0x41
e5c32000        strb    r2, [r3]
ARM instructions are 32bit. Unless you are talking about Thumb code. Which we will cover later. The first 4 bits of an ARM instruction contain a condition code. "Always" is encoded as 1110. So expect a lot of our instructions to start with 0xe. However it's already a pretty important feature of the ARM architecture we came across here. Most ARM instructions (not Thumb) can be conditional. Similar to cmov on x86 these instructions only execute based on bits in a flags register. This saves a number of branch instructions (on x86 these would be the jcc instructions. jxx/jump/call are now called branch). This allegedly compensates a bit for the lack of branch prediction that can compete with x86'.

So what does this code do? It stores 0x10009000 into the r3 register (there are 16 32bit registers r0->r15, r13,14,r15 are special, respectively stack pointer, link register, and program counter). Stores 'A' into r2. And finally stores r2 at address r3. Basically all we're doing is writing 1 byte to a memory location. And yes in plain ARMv7 instructions that takes 4 instructions and 16 bytes. Let's see if it works.
echo -ne "\x09\x3a\xa0\xe3\x00\x30\x41\xe3\x41\x20\xa0\xe3\x00\x20\xc3\xe5" > acode.bin
qemu-system-arm -M vexpress-a9 -serial $(tty) -kernel acode.bin
If all is well you saw a black window pop up (we pretend this to be our display connected to the device) and in the terminal we saw an 'A' appear. We gave qemu the option "-serial $(tty)" which will make it connect the first serial port to our terminal. You might be wondering why writing to this address makes a character appear on the serial port of the virtual device and then onto our terminal. Let us first clarify that this address is not standardised it's mostly up to the system designers and when we learn how to talk to the display we'll see how to get the right address. However the crux of it is that qemu emulates a simple serial controller which uses MMIO to communicate. and writing into the first register of its range makes it output a character.

Well to be perfectly honest I fibbed a bit for the sake of simplicity. You can actually do it in 3 instructions or 12 bytes.
e3a03a09        mov     r3, $0x9000
e3a02041        mov     r2, $0x41
e7c32e02        strb    r2, [r3, r2, lsl $28]
Because ARM uses fixed length 32bit instructions it can only encode constants in a subset of those bits. In fact they reserved 12 bits for the second operand in many instructions(and one more bit to indicate which format this is in). How these 12 bits are used depends a bit on which instruction it is. For ldr/str instructions these 12 bits can represent a value from -2048->2047. Or instead of a 12bit immediate value 4 bits are used to encode a register (r0->r15), the remaining 8 bit are used to encode a variety of shift and rotate operations.
In a lot of other instructions a range of -2048->2047 is a very inefficient use of bits. So in instructions like AND instead of the 12 bit immediate value an 8bit immediate value is stored and 4 bits are used as a shift count. allowing the byte to be shifted to any even position in the register.
This was exploited here to reduce the code size. The strb instruction stores r2 at (r3 + (r2 << 28)), which modulo 32bit comes down to 0x10009000. In comparison, storing 0x41 at address 0x10009000 on x86 using its variable length encoding takes 8 bytes.

We'll also want an endless loop instruction so when debugging we can halt execution at specific places.
[condition code][101][store return address][24bit offset]
0xea because we want an unconditional branch and we are not interesting in storing the return address (link operation) in ARM speak. The offset is multiplied by 4. Since all instructions are the same size. You're instructions should always start aligned on 4 byte boundary and that's the only place you can jump to. You may wonder why 0xfffffe, which is -2. After multiplication it is -8. The offset is relative to the instruction pointer +8 because of some pipeline reason.

Step 3: Setting Up Debugging

If you want some more advanced debugging capabilities than writing debug messages to the serial port, and using endless loops to halt execution, you can install gdb with support for remote arm targets. You can verify this by typing "set arch" in gdb and checking if arm is listed among the possible architectures.

Same as on x86 we can use gdb to debug code running on our qemu system. We simply add -s and -S to the command line. One prevents the cpu from being started the other listens for gdb on port 1234.
qemu-system-arm -M vexpress-a9 -serial $(tty) -kernel acode.bin -s -S
In another terminal we can then connect gdb to the device using these instructions:
set arch arm
target remote :1234

Step 4: Discovering The Hardware

Well from the name of the system we already know that it uses a versatile express base board. With a cortex-A9 cpu. Which in turn has an interface to use the AMBA bus. Connecting it among other things to the serial connection hardware. Namely the Primecell pl011 which we have used before to output the A character. Perhaps more importantly also Primecell pl111 clcd or color lcd. The display in other words. Other parts of the system can be found in the manual.

Of course I didn't trawl through all those documents before starting. I cheated a bit by loading an ARM linux kernel onto the virtual device and checking /proc/iomem. The linux kernel get's its information from arch/arm/mach-vexpress. This contains the hard coded addresses for this board.
	}, {	/* PL011 UART0 */
		.dev_id		= "10009000.uart",
		.clk		= &osc2_clk,
The v2m.c file builds what is called a device tree. It is a representation of the system that the rest of the kernel uses to know where things are. This feels ugly. Especially since we're used that code that runs on 1 x86 runs everywhere. So it may sound strange that we took a step back in evolution. However that's pretty much the case. ARM systems come in all flavors. Developed under license by many companies. And up until now it's not possible to make a linux kernel that will boot on different arm systems. So we too can't do better than use hard coded addresses for our hardware "discovery". There are efforts to improve on this. One of these is by the use of dtb files(device tree blob). These are compiled versions of a dts/dti files like found in /arch/arm/boot/dts. These will make the process more data driven. The bootloader could then just pass the right system description to the kernel, allowing it to boot in more places.

Beyond this, ARMv7 is a malleable platform, it has a lot of optional extensions, and different choices for settings. Which is great if you're building a system, but if you're writing low level code you need to know what's available. Fortunately the cpu does provide an identification scheme, as tot what features are implemented. Similar to the CPUID instruction on x86. I provide a program (source, screenshot) which reads the different registers and prints out the meaning of the different fields. Press space to advance in the program.

Step 5: Plotting Pixels

Having learned what video hardware is present, we need to know how to initialize it and use it. And it turns out it's very simple. timpwn beat me to the punch and got to plotting the pixels first. We'll take a few moments understanding this and compressing it down to a couple of assembly instructions.

We have a Primecell pl111 clcd. Which the documentation tells us is mapped to 0x10020000. This means that the device's registers are accessible starting from that address. We can look at the linux kernel code to get a concise list of the registers and their offset relative to the base of 0x10020000. First we'll concern ourselves with these ones.
#define CLCD_TIM0               0x00000000
#define CLCD_TIM1               0x00000004
#define CLCD_TIM2               0x00000008
These need to be set to the correct values for our chosen resolution. For now we pick 640x480 24bpp. The manual describes what the different bits in these registers mean. The linux kernel calculates them here. Acknowledging all that we'll just use the ones that were documented here for another device. The values: 0x3F1F3F9C,0x090B61DF,0x067F1800. next we'll set
#define CLCD_UBAS               0x00000010
This register needs to contain the base address of our framebuffer. Using a framebuffer will make a region of memory act much the same as the 0xa0000 region did in the x86 demo. Namely when data is written to a memory location it is interpreted as pixel color information and it is plotted on the screen. RAM memory of the device is mapped into the address space at 0x60000000 our kernel is placed at offset 0x10000 We'll start our framebuffer at 0x60020000. In arm assembly this looks like
	mov r1, $0
	movt r1, $0x1002

	movw r3, $0x3F9C
	movt r3, $0x3F1F
	str r3, [r1, $0x0]

	movw r3, $0x61DF
	movt r3, $0x090B
	str r3, [r1, $0x4]

	mov r3, $0x1800
	movt r3, $0x067F
	str r3, [r1, $0x8]

	mov r2, $0
	movt r2, $0x6002 
	str r2, [r1, $0x10]
Lastly we have to write to the control register. The relevant parts from the kernel header:
#define CLCD_PL111_CNTL         0x00000018

#define CNTL_LCDEN              (1 << 0)
#define CNTL_LCDBPP1            (0 << 1)
#define CNTL_LCDBPP2            (1 << 1)
#define CNTL_LCDBPP4            (2 << 1)
#define CNTL_LCDBPP8            (3 << 1)
#define CNTL_LCDBPP16           (4 << 1)
#define CNTL_LCDBPP16_565       (6 << 1)
#define CNTL_LCDBPP16_444       (7 << 1)
#define CNTL_LCDBPP24           (5 << 1)
#define CNTL_LCDBW              (1 << 4)
#define CNTL_LCDTFT             (1 << 5)
#define CNTL_LCDMONO8           (1 << 6)
#define CNTL_LCDDUAL            (1 << 7)
#define CNTL_BGR                (1 << 8)
#define CNTL_BEBO               (1 << 9)
#define CNTL_BEPO               (1 << 10)
#define CNTL_LCDPWR             (1 << 11)
#define CNTL_LCDVCOMP(x)        ((x) << 12)
#define CNTL_LDMAFIFOTIME       (1 << 15)
#define CNTL_WATERMARK          (1 << 16)
A bitwise OR of the options is expected in CLCD_PL111_CNTL register. We want to enable (CNTL_LCDEN) the device using 24 bits per pixel(CNTL_LCDBPP24) for a TFT ouput (CNTL_LCDTFT) and finally some power(CNTL_LCDPWR). ORed together that gets us 0x82B.
	movw r3, $0x082B
	str r3, [r1, $0x18]
Finally we add a little test code, that just fills the screen to get a result.
mov r0, $0
mov r3, $0x12c000

mov r1, r0
and r1, r1, $0xFF 
str r1, [r2, r0]
add r0, r0, $4
cmp r0, r3
bne .redbars

hcf: B hcf
You can download the file in one piece.


Here is an example demo. It should display:
Near the end you'll find a constant 23, which you can change for some variations on this effect. I also provided a graphical Hello World by loading the vgafont. Which is too large to be usable in this challenge but you can't go without helloworld.

Step 6: Creating Source Files

We want to write code specific for this device. As such we need to tell the assembler what we're targeting.
.arch armv7-a
.fpu neon
.syntax unified

.global _start
The first two lines tell the assembler what sort of hardware, and hence instructions are available. The third lines states we'll use the UAL (Unified Assembler Language) dialect. It's the new one and required to use all of Thumb2 instructions. However this does mean we need to pay more attention as a lot of the examples on the internet use pre-UAL syntax. Using UAL properly the assembler should be able to generate both ARM as Thumb2 instructions from our code. A concise summation of all instructions available can be found in the UAL Quick Reference. This however does not cover VFP/SIMD extension.

The second part just defines a start symbol. While this isn't really necessary it prevents warnings being thrown by the next step.

miniBill provided a basic arm syntax highlighting for kate (kde advanced text editor). I thought this was a great idea so I expanded it to cover the complete UAL syntax. The highlighting file is available.

Even if you don't use kate, it may be useful to check the list of pseuo-ops supported. The important ones are also covered in this quick reference (.req .balign, .word, .byte, .include, ...). The .req pseudo-op is very interesting. It allows you to assign temporary names to registers. When writing x86 code, our brains are pretty good at remembering where we store the different concepts in the 8 general purpose registers. But when you get r0-r15 and s0-s31 it can be quite a challenge. Rather than adding tons of comments you can do:
xpos .req r10
add xpos, xpos, $10

Step 7: Building

Here's a Makefile that will automate the build process. (Keep in mind this is whitespace sensitive)

all: ${NAME}.bin
	qemu-system-arm -M vexpress-a9 -m 128M -kernel $^

%.elf: %.s
	${TOOLPREFIX}-as $^ -o $@
%.out: %.elf
	${TOOLPREFIX}-ld -Ttext=0x60010000 $^ -o $@
%.bin: %.out
	${TOOLPREFIX}-objcopy -O binary $^ $@
%.objdump: %.out
	${TOOLPREFIX}-objdump -d $^  > $@
The linker step is required to use the literal pool feature. It sets the address where our code is loaded (0x60010000). This allows you to write things similar to:
ldr r3, =0xffffffff
ldr r4, =0xcafebabe

ldr r1, =array
.word 0xdeadc0de
In the first case the assembler knows that it can generate the constant 0xffffffff and as such replaces the instruction with mvn r3, #0. It can not generate the 32bit constant 0xcafebabe inside one 32bit instruction so it will find a place to store the value in between the code, and load it with an offset relative to the program counter. Normally the assembler should be capable of finding the best place for these literals. However you can control it with the .ltorg pseudo-op. The last case it will load the address of the array into r1. This doesn't save any space but it's more convenient.

Further Documentation: Instruction Sets

A good general introduction slideshow. The emulated cortex-a9 implements ARMv7-a. This profile supports two main instruction sets: ARM instructions and Thumb-2 instructions. Thumb-2 extends the Thumb instructions which were all 16bit, with 32bit ones to offer more complete functionality. The goal of Thumb/Thumb-2 is to increase the code density. Doing more with fewer bytes. However this also makes it more tedious to write by hand.

You can switch between the two using blx instruction (among others).
	blx thumb_code 
	@thumb instructions here
	blx arm_again
ARM and thumb instruction sets have extensions, for floating point (VFPv3) and SIMD(single instruction multiple data).

Besides these two main ones there's more. Jazelle DBX is a third execution state. It allows for direct execution of java bytecode. There is a BXJ instruction which branches to java. Complicated bytecode is handed back to software. However this mode has been largely deprecated. With newer processors handing back more(all) to the software layer in favor of the next option. As I understand it documentation on this state is lacking. Qemu does not implement this extension.

Instead of Jazelle DBX, a fourth execution state was created. ThumbEE. which modifies Thumb2 instructions to make them better suited as a compilation target for dynamic languages. A large part of these changes involve automated null pointer checks. It's marketed as Jazelle RCT. You can switch to ThumbEE state from Thumb state using ENTERX/LEAVEX
blx start
	@thumbEE code here

These last two modes probably don't suit our purpose too well, but may be interesting to explore.

Further Documentation: Privilege Levels and Processor Modes

Analogous to the ring model on x86. ARM processors have different privilege levels. However they are numbered in the reverse order PL0 is the least privileged mode. If we peek at the CPSR via an attached gdb, we see that
(gdb) info register cpsr
cpsr           0x400001d3       1073742291
ARMv7-a has a possible 11 modes. The last 5 bits of this register indicate the mode. We are currently in. Mode "10011" is the supervisor(svc) mode. This is the mode processors start in after a reset it runs at PL1. Only the Hypervisor part of the optional Virtualization extensions(not implemented in the qemu device) can run at PL2. The next bit controls whether we are in a state executing ARM(0) or Thumb(1) instruction. The next 3 bits in turn disable (data abort, fast interrupt, interrupt) exceptions. Exceptions are the main way to switch between modes. Usermode calls back to supervisor mode using the SVC instruction (this is the UAL mnemonic, it used to be known as SWI). The processor needs to know where to resume execution after a mode switch. Analogous to the idt on x86 there is a Vector Table. It can have different base address but default for our system is that it's mapped into the address space at 0x00000000. Privileged modes can also change mode using the CPS(Change Processor State) instruction.

Further Documentation: Stack

There are twice as many general purpose registers in ARM than in x86 so it took us this long to develop a need for a stack. The stackpointer is stored in register r13. There is an alias "sp". The great thing about push, and pop is that you can push pop a selection of registers at a time. I modified the display initialization code to illustrate how this might work.
bl stack
.word 0x10020000	@ clcd pl111 base
.word 0x3f1f3f9c
.word 0x090b61df
.word 0x067f1800
.word 0x0000082b
.word 0x60020000	@ framebuffer base
mov sp, lr
pop {r2-r6,r8}
str r3, [r2, $0x0]
str r4, [r2, $0x4]
str r5, [r2, $0x8]
str r8, [r2, $0x10]
This only shaves off three instructions, but every byte counts. If you were to write the rest of the demo in thumb code the initial jump could serve the double purpose of switching the instruction set. As shown the POP instruction takes a comma-separated list of registers and ranges of registers enclosed in braces. POP and PUSH are special cases of the LDM(load multiple) and STM(store multiple) instructions respectively. For example the pop {r2-r6,r8} in the snippet above is equivalent to LDMia r13!, {r2-r6,r8}.

Further Documentation: Calling Convention

The default calling convention on ARM is to place the arguments to a function in r0,r1,r2,r3. If these can't hold the arguments a pointer to them must be passed in r0. functions have to preserve registers r4,r5,r6,r7,r8,r9,r10,r11. Since r13,r14,15 have a special purpose. That only leaves r12 to freely mess with in addition to the arguments registers when you don't need them any more. The return value is placed in r0.

You are free to completely ignore this calling convention to save bytes. However if you're just starting out these are a good default.

Further Documentation: Conditional Execution

As mentioned before there is an important feature to be aware of called conditional execution. ARM instructions when suffixed with the letter S will set flags. There are 4 main flags. N (negative), C(carry), V(overflow), Z(zero or equal). 4 bits in the instruction encoding will determine whether or not an instruction is executed based on which of these are set. The flags are stored in the active PSR(program status register). Thumb instructions don't have the space to include 4 bit conditions, instead it uses a special instruction
cmp r0, r1
ITE lt
addlt r0, r0, $1
subge r0, r0, $1
We still add the conditional execution suffixes to the instructions, because we're writing UAL. Which can be assembled both into Thumb-2 and ARM. In the latter case the ITE instruction will not be translated. There are additional bits in CPSR which keep track of the ITE condition when in thumb mode.

Floating point instructions from the VFP extension also support conditional execution. However they use separate comparison instructions (VCMP, ...) and a separate set of flags which are stored in the top 4 bits of the FPSCR register. This register also controls others features of the fpu. For example what rounding mode to use.

Further Documentation: Coprocessor

As mentioned before one way to talk to hardware is through "coprocessors". The ARM and Thumb2 instruction sets have instructions to communicate with what is called a coprocessor. Namely using the instructions MCR,MRC(and related MCR2, MCRR, MCRR2, MRC2, MRRC, MRRC2) to move data in general purpose registers to and from registers of coprocessors. LDC, LDC2 and STC, STC2 allow to load/store these registers from memory. MSR, MRS in turn provide access to coprocessor system registers. Lastly there is the SYS instruction.

There is a maximum for 16 coprocessors, named cp0 through to cp15. cp15 is for system control and identification. cp14 provides debug interface, and ThumbEE and Jazelle DBX configurations. cp10 and cp11 allow the control of the advanced SIMD and VFP.

Coprocessors cp14 and cp15 mainly provide access to some 160 registers. It's slightly comparable to the MSR registers on x86. They offer a wide range of configurable settings, information, and features. An example setting is the EE bit in the system control register (SCTLR) which determines whan endianness the cpu will use after an exception occurs. (but you should be aware that not a lot of hardware actually implements this, most arm devices out there are little endian only) The cpuid scheme mentioned before uses MIDR (main identification register) and friends. An example of a feature using the coprocessor is the DTLBIALL. Any write to this register will invalidate all entries in the data translation lookaside buffer (TLB).

In the next section we will discuss the VFP instructions. These are simply prettier names for instructions based on these using cp10 or cp11. The coprocessor number is encoded in the third least significant nibble. If you look at the encoded instructions from the next section you'll find all 0xa's there.

Further Documentation: Floating Points(VFPv3) and SIMD(neon)

ARM and Thumb-2 instruction set is expanded with instructions for these extensions. They operate on a separate set of registers. (analogous to mmx/xmm registers on x86). However both VFPv3 and SIMD share the same registers. There are 32 double word(64bit registers) VFPv3 sees these as D0 -> D31 Half of them can also be addressed as S0->S31 (single precision floats). Neon can call these either D0-> D31 when it uses them as double words or as Q0->Q15 when it uses them in pairs as 128bit registers. To get a feel for this you can connect to the emulated device via gdb and issue the command: info registers all.

Since we already want to use UAL for the Thumb-2 instructions. We'll try to use UAL here as well. Here this means that most of our floating point instructions will start with a V (VADD.F32, VABS.F32, ...) rather than (FADDs, FABSs, ..).

Since we are using the system right after startup, we still need to enable the VFPv3
ldr r1, =0x40000000	@ VFPEnable
then we can start to use it. In fact this is a pre-UAL version of the instruction. The UAL version of the instruction would be VMSR FPEXC, r1. However there is a bug in this version of binutils. Which only supports the VMSR instruction in unprivileged mode. And it will refuse to assemble access to the FPEXC register unless you use a binutils version based off a cvs version after March of this year, or manually patch the do_vmsr() function in gas/config/tc-arm.c.
ldr r0,=floats
VLDR.32 s0,[r0]
VLDR.32 s1,[r0, #4]

VMUL.f32 s0,s0,s1

vcvt.s32.f32 s0,s0
vmov r5, s0

hcf: b hcf

.float 20.0
.float 2.5
This example will leave 50 in the r5 register. Here is a reference for vfp instructions. While powerful, VFPv3 only implements basic functionality. It does not provide for example trigonometric identities. So we need to write custom implementations. Approximation and function modelling is an interesting field of maths/computer science in its own right. I provide a quick and dirty circle drawing example. Which uses good old Taylor.

Send your submissions to bla@netgarage.org or link me over irc.netgarage.org+6697