## Running a Z-machine on an AVR

This blog has moved and won’t be updated here anymore. It’s now at blog.dsp.id.au

.

Can you make a Z-machine console – a virtual machine for text adventure games –  complete with screen and keyboard on an AVR?

No. Well, probably not on a ATmega328P.

I got the code for ZIP, and stripped out all of the code for reading files and displaying stuff on the screen.  This let me compile it with avr-gcc.

I looked at the result of objdump:

$avr-objdump -h zip_linux zip_linux: file format elf32-avr Sections: Idx Name Size VMA LMA File off Algn 0 .data 000001e4 00800100 0000579c 00005830 2**0 CONTENTS, ALLOC, LOAD, DATA 1 .text 0000579c 00000000 00000000 00000094 2**1 CONTENTS, ALLOC, LOAD, READONLY, CODE 2 .bss 00000ac8 008002e4 008002e4 00005a14 2**0 ALLOC 3 .stab 0000dcec 00000000 00000000 00005a14 2**2 CONTENTS, READONLY, DEBUGGING 4 .stabstr 00003aad 00000000 00000000 00013700 2**0 CONTENTS, READONLY, DEBUGGING 5 .comment 00000022 00000000 00000000 000171ad 2**0 CONTENTS, READONLY  The good news is that the code (the .text section) fits into 22k. That leaves 10k for the screen rendering code (probably from TellyMate), the keyboard and the SD card code. The .data section is quite small too, which means there aren’t many strings in the interpreter itself. The problem is the .bss section, which gets copied to RAM on startup. It’s 2760 bytes, much more than the 2048 which this chip has. Let’s look at this further: $ avr-objdump -t zip_linux | grep bss | sort -t \$'\t' -k 2
...

0080030c g     O .bss    00000004 pc
00800da6 g     O .bss    00000006 __iob
00000114 g       .text    00000010 __do_clear_bss
00800552 g     O .bss    0000004e lookup_table
00800436 l     O .bss    00000100 record_name
00800332 l     O .bss    00000100 save_name
008005a0 g     O .bss    00000800 stack

There’s a 2048 byte stack, which fills the entire memory!  The Infocom games might only need a 1k stack, which will help.

Tellymate has a 38×25 screen, which needs 950 bytes.  This could be brought down to 10 lines, but games might be tricky to play.  The ZX80 didn’t store a full line if it didn’t fill the width of the screen, but adventure games tend to be wordy so this won’t help much.

An SD card library needs about 600 bytes of RAM, which blows the budget.  We’d have to go without a filesystem on the SD card, because it takes up too much flash space.  It sounds like the 512 byte buffer might be optional.

4k of RAM should be plenty.  But with ARM chips being much cheaper than the larger AVRs, that might be the way to go – if I can sort out the jittering image problems.

You might be able to squeeze the interpreter only onto the chip, and communicate with it via its serial port, or another AVR running TellyMate.

Update: It turns out I wasn’t the only one to look at this – Rossum implemented it on the 2560-byte ATmega32U4, but needed to store a 1MB swap file on the SD card!  I don’t know why you need that much – maybe you don’t – but I forgot to include dynamic memory and room for the stack.

## Video signal output circuitry

I’ll need a simple circuit to mix my video signals together. The Arduino TV Out library shows how to do this, but that works with 5V IOs, but the STM32L Discovery (and all ARM chips AFAIK) uses 3.3V.

So which resistors will I need?  To produce a white signal, the sync and video lines will be high.  The equivalent circuit looks like this:

(The 75Ω resistor is the resistance inside the TV).

To show a black signal, the sync line will be high, and the video line will be low, which looks like this:

Wikipedia gives the formula for a voltage divider, so the resistors in the first diagram can be calculated with this formula:

$1= {75 \over {75+{1 \over { {1 \over V} + {1 \over S} }}}}\times 3.3$

and in the second:

$0.3= {{1 \over {{1 \over V} + {1 \over 75}}} \over {1 \over {{1 \over V} + {1 \over 75}}} + S} \times 3.3$

I tried solving these, but that’s well beyond my mathematical ability.  Instead I found some online site that could plot the two formulas (edit: I could have used Wolfram Alpha).  The lines crossed at about RV=250Ω and RS=580Ω. These resistor values don’t exist, so RV=270Ω and RS=560Ω is close enough.  They seem to work fine in the circuit.

## DMA on the STM32L Discovery

There’s one more part to my video generator – the picture data, which I want to transfer to the SPI port using DMA. This actually looks fairly straightforward, these are the available registers:

 MEM2MEM I’m transferring from memory to a peripheral, so this should be off. PL I’ll make this “very high” priority, because I want to keep the picture stable at all costs. If a program writes to the framebuffer during this DMA transfer, it will be blocked. MSIZE I’ve set the SPI port to 8 bits so I’ll stick with that. I don’t think it will make any difference whether it’s 8 or 16. MINC I want the memory pointer to increment during the transfer PINC I guess this should be off, because to write SPI you keep sending data to the same memory location. CIRC I don’t want the memory pointer to circle around. DIR Read from memory Interrupts I won’t need any yet, but eventually I’ll have to turn off the SPI port at the end of the transfer, otherwise I’ll get white bars down the sides of the screen.

DMA_CNDTRx contains how much data to transfer. There are 7 channels, and table 40 of the reference manual says SPI2_TX is on channel 5. This needs to be set to the number of pixels / 8, since I’ll have 8 pixels in one byte.  There’s a “auto-reload” setting somewhere which resets this counter value after a transfer; I think this happens in circular mode.

Table 40 also suggests I must use DMA1 for these transfers.

The peripheral address register should point to the SPI data register (&(SPI2->DR)), and the memory register is the start of the current line of pixels.

That’s all of the available settings!  There’s one more thing to do though: section 10.3.7 says this:

The peripheral DMA requests can be independently activated/de-activated by programming the DMA control bit in the registers of the corresponding peripheral.

I guess this is the TXDMAEN bit in the SPI_CR2 register.

Now for some code… first I’ll make some data to send:

const uint8_t image[] = { 0xAA, 0x55, 0xAA, 0x55 };

Of course later on I’ll have a lot more data…

Now to set the above settings:

  DMA1_Channel5->CCR = DMA_CCR5_PL // very high priority
| DMA_CCR5_MINC  // memory increment mode
| DMA_CCR5_DIR;  // read from memory, not peripheral

Section 10.3.3 has this useful bit of information:

The first transfer address is the one programmed in the DMA_CPARx/DMA_CMARx registers. During transfer operations, these registers keep the initially programmed value. The current transfer addresses (in the current internal peripheral/memory address register) are not accessible by software.

This suggests that I only need to set these at the start and shouldn’t need to touch them again.

To set these:

  DMA1_Channel5->CMAR = (uint32_t) image;       // where to read from
DMA1_Channel5->CPAR = (uint32_t) &(SPI2->DR); // where to write to

Time to try it out… and… nothing!  Maybe there’s another clock setting for DMA, and sure enough there is:

  rccEnableAHB(RCC_AHBENR_DMA1EN, 0); // Enable DMA clock, run at SYSCLK

I still haven’t got anything, so I tried setting the source and destination registers each time before I start a DMA transfer. It looks like now I get a single transfer, but I’m trying to get a transfer on every hsync.

I poked around with the debugger, especially at 0x40026058 which is DMA5->CCR1 (I calculated the address from values in stm32l1xx.h), and noticed that the Enable flag is still set.  Maybe it has to be toggled each time?  Now I get a square wave instead of my data… I then tried decreasing my hsync timer, and decreasing the SPI speed, and I got a reasonable output.  I’m getting some nasty aliasing on my DSO Nano though, maybe I should have borrowed a faster scope!  I think I was triggering the DMA transfers too quickly, which produced that square wave.  Conveniently, I notice the SPI line is now low when it’s idle, which is the output I want.  I’m not sure why it’s gone low, but I’m not complaining.

So to sum up:

  rccEnableAHB(RCC_AHBENR_DMA1EN, 0); // Enable DMA clock, run at SYSCLK
// Configure DMA
DMA1_Channel5->CCR = DMA_CCR5_PL // very high priority
| DMA_CCR5_MINC  // memory increment mode
| DMA_CCR5_DIR;  // read from memory, not peripheral
DMA1_Channel5->CMAR = (uint32_t) image;       // where to read from
DMA1_Channel5->CPAR = (uint32_t) &(SPI2->DR); // where to write to
...
SPI2->CR2 = SPI_CR2_SSOE | SPI_CR2_TXDMAEN;

then in my hsync handler:

    // Activate the DMA transfer
DMA1_Channel5->CCR &= ~DMA_CCR5_EN;
DMA1_Channel5->CNDTR = sizeof(image);
DMA1_Channel5->CCR |= DMA_CCR5_EN;

I didn’t need to reset CMAR and CPAR after all.

I think that’s now demonstrated everything I need for the video signal generator! My code needs a big cleanup, and I’d like to use ChibiOS functions where I can (palSetPadMode instead of messing around with memory locations and data structures, etc).

## Picture data using SPI

I plan to use SPI to send the picture data for my video generator.

First I need to work out what speed to run the port at.  Each line goes for 52 μs, or 1664 cycles.  I could divide this by 4 for 416 pixels per line or 8 for 208 per line.  This sets the baud rate, so I shouldn’t need to divide this by 8 again to get a bytes per second speed.  It looks (from the clock registers) like SPI1 is connected to the APB2 clock, and SPI2 is connected to APB1.  I’m already running APB1 at the system clock (32MHz), so I’d like to use that if I can.  The speed is set in the CR1 register, by the BR bits, which supports dividing by 4 or 8.  I might as well use SPI2.  The datasheet says that SPI2_MOSI can only be on pin B15.  I won’t need the clock output, so I won’t configure a pin for that.

The CR1 register contains a setting for 8 or 16-bit operation.  This affects the size of the data being written.  Since I plan to use DMA I’ll leave it at 8 bits.

It turns out there are very few settings to get SPI working.  I had to stuff around a lot before I got it working though – eventually I copied the ChibiOS code, and set the SPI_CR1_CPOL, SPI_CR1_SSM, SPI_CR1_SSI and SPI_CR2_SSOE flags even though I wouldn’t have thought I need them, and it suddenly worked!

This was enough to get SPI working:

  rccEnableAPB1(RCC_APB1ENR_SPI2EN, 0); // Enable SPI2 clock, run at SYSCLK
PAL_STM32_OSPEED_HIGHEST);           /* MOSI.    */
SPI2->CR1 = //SPI_CR1_BR_0 // divide clock by 4
SPI_CR1_CPOL | SPI_CR1_SSM | SPI_CR1_SSI |
SPI_CR1_BR // divide clock by 256
| SPI_CR1_MSTR;  // master mode
SPI2->CR2 = SPI_CR2_SSOE;
SPI2->CR1 |= SPI_CR1_SPE; // Enable SPI

To send data, write bytes to SPI2->DR.  The output appears on PB15.  I think in the future I’ll try using palSetPadMode for configuring the pins, since it’s better than the 8 lines of code I’ve been using previously to do this.  The above code divides the clock by 256 so I could see the output on my DSO Nano, but I’ll change this to 4 later.

The next step will be using DMA to write the data to SPI instead.

## More timing video signals on the STM32L Discovery

I’ve looked how ChibiOS does its timing, and worked out that it’s unsuitable for timing video signals.  Now I’ll look at the using the timers directly.

The chip has a number of timers, I can’t work out how many.  The ChibiOS HAL manual says it can use timers 2, 3 and 4, so let’s leave those alone for other uses.  That leaves timers 9, 10 and 11.

If the timers work like the AVR’s timers, they work by starting at 0 and counting up to some maximum value, where the counter is reset to 0.  There’s also a compare register, and when the timer matches the compare register, something can happen – we can trigger an interrupt, change the state of a pin and so on.  Being able to change a pin is how PWM works.

It would be nice to use PWM to produce the sync signals.  I’ve found an excellent description of these signals, and at the start of each line, there’s always a falling signal.  So if we set the timer’s maximum value to the end of the line, and have the signal go low when it overflows, that’s the signal start taken care of.

The signal end is a bit trickier.  The image shows that it varies depending on the line; further, on some of the vertical sync lines it happens twice!  This means that we might need to change the value at which the signal goes high as the timer is running.  The AVR can do this, but in some modes the timer register is double-buffered.  If you write a new value to the timer compare register, it’s only applied the next time the timer resets.  They do this so you don’t set a compare value lower than the current register, which means the timer will keep counting up until it overflows!

So can you adjust the compare registers on the fly on the STM32L, and is it double buffered?  It looks like the compare register is called TIMx_CCR1.  There doesn’t seem to be a CCR2, so maybe these timers only have one output.  In the reference manual, section 17.6.11 says:

It is loaded permanently if the preload feature is not selected in the TIMx_CCMR1 register (bit OC1PE). Else the preload value is copied in the active capture/compare 1 register when an update event occurs.

So if the preload feature is off, the compare register can be updated straight away!  But back in the PWM mode description (section 17.4.9), it says:

You must enable the corresponding preload register by setting the OCxPE bit in the TIMx_CCMRx register

So we’re not so lucky.  We need to use the preload register, and we know it updates on an “update event”.  What’s an update event? Back in section 17.4.1:

The update event is sent when the counter reaches the overflow and if the UDIS bit equals 0 in the TIMx_CR1 register.

So all we need to do is update the compare register one line early!

What about the vertical sync lines, where there are two pulses?  That shouldn’t be a problem; we simply consider them two separate lines in software, so the lines are numbered like this:

This also shows that the maximum value will need to be changed in the same way.  In the reference manual, they call this value the “auto-reload” value, and it’s kept in the TIMx_ARR register.  Section 17.4.1 suggests you can choose whether this is double-buffered or not.  We might as well use this feature since the compare register needs it.

There’s one more thing to look at.  At the start of each line, I’ll need to start transferring data from memory to an external port using DMA, and configure the compare and maybe the maximum value register for the next line.  I could either do both in a single interrupt, or set the registers on the reset interrupt, and start the DMA on the compare interrupt.

Do we have enough time from the start of the interrupt to do anything useful? The horizontal sync pulse on a PAL signal is 4.7μs.  If we assume the CPU runs at 16MHz, this is about 75 instructions.  Section 5.5.1 of the ARM manual suggests that it takes 12 cycles to enter an interrupt.  In the AVR, it’s up to the programmer to save the register state at the start of an interrupt.  This means it’s a good idea to do as little as possible in an interrupt, because the compiler inserts lots of “push” and “pop” instructions around the interrupt.  Since the ARM looks after this for you, and takes a fixed amount of time to enter an interrupt, this isn’t a problem.  If we assume it takes 12 cycles to leave an interrupt too, that leaves about 50 cycles to do stuff.  This stuff is working out how long it should take to raise the signal again, and set the compare register and maybe the reset register.

That should be all we need to know about the timers – now I’ll try to use the timer to produce these horizontal sync pulses.

## Timing video signals from the STM32L Discovery

In my last post, I suggested using ChibiOS to produce video signals from the STM32L Discovery.

The configuration for ChibiOS is held in chconf.h.  An interesting section is this one:

/**
* @brief   System tick frequency.
* @details Frequency of the system timer that drives the system ticks. This
*          setting also defines the system tick time unit.
*/
#if !defined(CH_FREQUENCY) || defined(__DOXYGEN__)
#define CH_FREQUENCY                    1000
#endif

It looks like the operating system wakes up periodically, checking whether there’s anything to do.  It also means that to produce video signals, this number may not be accurate enough.

So what is this number used for?  The only interesting reference I can find is in os/hal/platforms/STM32L1xx/hal_lld.c:

SysTick->LOAD = STM32_HCLK / CH_FREQUENCY - 1;

and of course that’s where the system ticks are initialized.  STM32_HCLK looks interesting, there’s plenty of references to this in os/hal/platforms/STM32F4xx/hal_lld.h:

/**
* @brief   AHB frequency.
*/
#if (STM32_HPRE == STM32_HPRE_DIV1) || defined(__DOXYGEN__)
#define STM32_HCLK                  (STM32_SYSCLK / 1)
#elif STM32_HPRE == STM32_HPRE_DIV2
#define STM32_HCLK                  (STM32_SYSCLK / 2)
...

I remember seeing the text AHB before, this is the bus that connects the CPU to the GPIO ports, and other peripherals.  This code suggests that it’s related to the CPU clock via a prescaler, which the clock tree in the reference manual confirms.  This led me here:

/**
* @brief   System clock source.
*/
#if STM32_NO_INIT || defined(__DOXYGEN__)
#define STM32_SYSCLK                STM32_HSICLK
#elif (STM32_SW == STM32_SW_HSI)
#define STM32_SYSCLK                STM32_HSICLK
...

so STM32_SYSCLK is the system clock, and we can choose the source for this.  “HSI” would be the High Speed Internal clock, which is fixed at 16MHz.  It’s possible to use the PLL to run the CPU at 32MHz too.

So working backwards, with the default setting, STM32_HCLK is 16MHz, and ChibiOS’ default tick is 1000 cycles, which is 62.5μs.  For PAL, the sync pulse length is 4.7µs, and the front porch is much shorter than that, so the ChibiOS timer is far too inaccurate for that.  I could change the system tick to 100, but there’s the risk that ChibiOS won’t have enough time to do its scheduling after it wakes up, and that number still isn’t accurate enough.

While I’m rummaging around the ChibiOS code, what does it use to trigger its scheduler?  It never seems to be read anywhere, but looking at SysTick_Type in os/ports/common/ARMCMx/CMSIS/include/core_cm3.h it looks like some part of the address space, specifically at address 0xE000E010.  The datasheet says this is part of the “Cortex-M3 Internal Peripherals”, but that’s all it says.  The CPU manual might be more helpful here, and it says this is part of the “System Control Space”.  Section 3.1.1 says this address contains the “SysTick Control and Status Register”, and the following registers correspond to the SysTick variable in ChibiOS.

So what is the SysTick for?  Section 5.2 suggests that an interrupt can be triggered on the SysTick firing, which might be what ChibiOS uses for its scheduling.  (WordPress didn’t save my draft from here, so I might be missing a few steps.)  So where is the handler for this?

Earler I found that each program starts with an interrupt table.  The example there has 4 entries, but it can be longer.  The linker script (os/ports/GCC/ARMCMx/STM32L1xx/ld/STM32L152xB.ld) contains a section called “vectors”, which is defined in os/ports/GCC/ARMCMx/STM32L1xx/vectors.c.  The SysTick handler is called SysTickVector, which looks like this (from os/ports/GCC/ARMCMx/chcore_v7m.c; I don’t know whether this is an arm6 or an arm7):

CH_IRQ_HANDLER(SysTickVector) {

CH_IRQ_PROLOGUE();

chSysLockFromIsr();
chSysTimerHandlerI();
chSysUnlockFromIsr();

CH_IRQ_EPILOGUE();
}

So this is how the SysTick facility works.  Now this can’t be used for generating the video signals, since it’s not accurate enough – I’d need to use another timer for that.  The timer interrupt would need to be of higher priority than SysTick, otherwise the CPU might be doing something else which would make the image jump around.  The ChibiOS docs suggest that interrupt handlers are like a special thread with higher priority than everything else, which is what I want.

All of this suggests that ARMs are a lot trickier than 8-bit CPUs, because of all of the available features.  I don’t think I’ve even found all of the relevant documentation – with the AVR, one document contains everything you need to know.

Next I’ll look at how the timers work.