Thoughts on an assembly-coded emulator
======================================

This document consists of a series of extracts from email replies by Richard 
Talbot-Watkins to messages from David Sharp concerning writing an interpretive 
emulator for Acorn ARM-based computers. At the time, Richard was putting the 
finishing touches to his Sega Master System emulator, Miracle, written entirely 
in ARM assembly and David was starting an attempt at writing a NES emulator in 
ARM assembly. At the time of writing, Richard was about to graduate with a degree
in Computer Science from Cambridge and David was about to take his A levels.

This document has relevence for anyone emulating 6502 and Z80 processors, 
particularly for NES/SMS, authors looking at emulating mapped memory, and anyone 
writing an emulator in assembly, particularly ARM assembly. It is primarily 
concerned with emulation optimisation at a low level. Richard's Miracle emulator
was quite an achievement being just about being able to emulate the Sega Master
System at a playable speed on an Acorn A3010 with a CPU of just an ARM250 running
at 12MHz.

Please excuse any discontinuity in the text, it was never written to be read 
this way.

================================================================================

> I don't notice much difference between the normal and the faster code, in what 
> way is it cludged to run faster???

Basically, it emulates 240 scanlines per frame instead of 312 and also slightly 
slows the emulated clock speed down - the idea is that some games often just 
wait in a tight loop when they've finished drawing their frame, hanging on till 
the next VSync signal when they can then draw the next frame: by emulating only 
240 lines instead of 312, it should mean that games that use that method don't 
spend so long idling and hence they run faster.  Good idea in theory, apart from 
that lots of games use that 'idle' time to actually do other stuff like sprite 
manipulation or whatever, and so if you cut down the emulated frame time in this 
way, it just goes horribly, horribly wrong....  But Wonderboy 3 benefits from 
this speed cludge :)

The sprite problem you mention is to do with the fact that the SMS can only 
maintain 64 hardware sprites at once, but with some clever timing, more can be 
forced onto the screen at once by changing the location of the sprite info table 
halfway down the refreshing of the screen - so it's important that the timing is 
really exact.  At the moment, I don't actually know the clock speed of the SMS 
for sure, which is why I'm a bit suspicious of sprite flashing problems (as it 
would be caused by changing the sprite info table at the wrong point in the 
frame)

> how come you're forced to use the system clock for the 
> emu timing??

Getting there... I don't really want to add any 'frills' though until the actual 
emulation code itself is rock solid (i.e. runs stuff like Hook, SB2, Alex Kidd - 
Lost Stars, Zool, MK2 and a few others which notably crash).  The system clock 
is used to time 1/50th sec intervals to synchronise SMS VSyncs.  At the moment, 
it keeps a running record of Sega time against real time, so it can try to catch 
up if it slows down for a bit, which is why for the moment it has to quit.

> How are you doing the keypress detection BTW, since I 
> disassembled the code, but couldn't find any OS_Byte 121s so I assume you are 
> reading the keypresses straight out of the keyboard buffer using interrupts? 

Very simple - I just use a keyboard event handler (event 11) to signal when a 
key is pressed or released, and then get that to modify the actual bits of the 
joypad ports themselves.  That way, you never need to use OS_Byte &79 (which is 
v slow if you're doing it a lot) when the joypad port is read (as the right 
value's already magically there!), and at the same time it can be used to test 
for other special keys like quit, pause etc.  So all the keys you ever want to 
test for are all in the same word, ready for testing - funky!

> How will you implement the frameskip in the final version, how about an 
> optional test at the start of the program which guesses the optimum frameskip 
> setting at the machines speed.

Yes eventually.  Although absolutely ideally, I want to implement an automatic 
frameskip which can also change mid-emulation depending on the load on the 
machine at any particular moment.  Shouldn't be too much extra work....

At the moment, I'm trying to debug my Z80 emulation as I'm sure there are a few 
subtle problems which are stopping the Simpsons, Hook, Zool and a few others 
from working - but believe me, tracing opcode bugs like "ADC A doesn't set the 
zero flag if the result of the operation was zero, but only if it was entered 
with carry set" like I fixed the other day, take a lot of effort!

Oh yes, did you ever have Chase HQ or OutRun on your SMS?  Just because I 
thought I had an opcode problem as, on either of those, if you steer to an 
extreme of the track so that it goes off the screen, it seems to wrap around 
(badly) onto the other side - you'll see what I mean if you play it.  I thought 
it was a problem with my emulator, until Matt said that it happens on MasterGear 
too!  So either (a) it was a shittily written game or (b) MasterGear has the 
same bug as mine.  Can you shed any light?

(Editors note: it turned out that these games were just shoddily coded and
did this on the real SMS)

> I'm not quite sure how all these bugs in the Z80 have occurred - surely it's 
> just a straight forward matter of seeing what each instruction does and coding 
> it!?

Hahaha, if only it were that simple!  Problem is, there are about 1500
instructions, and most of them are pretty simple, but there's bound to be
scope for stupid mistakes when there's that much code!  The other thing is
I've coded loads of shortcuts where I've thought I could get away with it
- sadly stuff like INC (HL) wasn't very happy with my shortcut (it assumes
you'd never try to increment the memory mapped paging register... or any
part of ROM, but sadly Gauntlet tries both, arse).  Also, half the time my
optimisation relies on similarities between the way the Z80 and ARM set
their flags, e.g. taking your 6502 emu as an example:

ADC #num is a 6502 instruction which adds the constant value 'num', and
the value of the carry flag, to the accumulator.  It sets the C flag if
the result was more than 8 bits wide, it sets the Z flag if the result was
zero, it sets the N flag if the result was negative (bit 7 set)  and - the
tricky one - the V flag is set if there was an "overflow condition",
namely if two positive nos were added to give a seemingly negative result
(i.e. bit 7 became set) and other such cases where the sign is set
incorrectly.  Working out the algorithm for the V flag becomes a right
bastard (I assure you!). 

Anyway, coding all this explicitly (in a Basic-type pseudo code) would
give something like:

result = A + num + CFlag
CFlag = (result >= 256)  ; assume this returns 0 or 1
NFlag = (result >= 128)
ZFlag = (result = 0)
VFlag = ((NOT (A EOR num) AND (num EOR result)) AND 128) = 128)  ; yuk!
A = (result AND 255)

This is pretty grim, and particularly as ADC #num only takes two clock
cycles - so this means if you're emulating a 2MHz 6502 machine (like a
BBC) on say a 12MHz ARM machine (like mine), you only have 12 ARM clock
cycles to code a perfect ADC instruction before you're behind schedule!
But I believe it should be possible to get a perfect Z80 or 6502 emulation
at exact speed on an ARM250 machine if it's done the right way (my Z80 SMS
emulation runs at about right speed on mine if all the graphical
processing is dropped!).

So the approach I would take with the 6502 emulator (and which I've taken
with the Z80 emu as well, although this is less easy) is to make HUGE use
of the fact that most processors set their flags in similar ways.  By some
incredible coincidence (honest), the ARM sets flags in exactly the same
way as the 6502 in ADC and SBC instructions, so use the fact!  If you keep
your 6502 accumulator in R0, have it stored shifted up 24 bits so that
when it's stored into, setting its top bit will set the ARM N flag too.
That's one flag set for free!  The ADC #num code might look something
like:

where: P is a register which holds the 6502 PSR (flags)

LDRB num,[pc6502],#1 ; get the num from the ADC #num instruction
TST P,P,LSR #(bit position of CFlag)  ; set ARM C flag according to 6502
flags
MVNCS temp,#(&FF<<24) ; extend the C flag from bits 0-23
ORR temp,temp,num,LSL #24
ADCS A,A,temp ; do the ADC, set the new accum result
; and set flags!

And that's it!!  The ADCS sets all the ARM flags in exactly the same way
as the 6502 would.  Now it's just a case of copying them back into your P
register.  If you store the 6502 flags in the same order as the ARM ones,
rather than in the actual 6502 order, you can then just store the flags
with something like:

BIC P,P,#15
ORR P,P,PC,LSR #28

That's ADC in only about 8 clock cycles!!  Wicked.  Stuff like AND, ORA
and so on is much simpler, but can be made extremely quick by the same
method.  There are subtleties when implementing PHP and PLP (in that the
flag order has to be corrected to proper 6502 order when storing to
memory) but hope that's some use anyway.  If any of this makes any sense,
then let me know - I'm doing something right then at least!  

The other great thing about the 6502 of course is that there are only about 160 
instructions...

The way ADC was used on the 6502 was as follows.  If &70 and &71 are a
pair of memory locs containing a 16-bit word, and you want to add 135 to
them (for some weird reason), you'd use code like:

CLC    ; clear carry flag
LDA &70    ; Load low byte of word into A
ADC #135   ; Add 135+0 (because C=0)
STA &70    ; store. If ?&70+135 was > 255, carry will now be set
LDA &71    ; Load high byte of word
ADC #0     ; If C set because of carry from lo byte, this will add 1
             to the high byte, otherwise will just leave it!
STA &71    ; store back.
RTS        ; return

Of course the #135 and #0 could be replaced with some other 16-bit value,
e.g. to add the word at &70/&71 to the one at &72/&73, you could use:

CLC; LDA&70; ADC&72; STA&70; LDA&71; ADC&73; STA&71; RTS

That should give you some idea how ADC works.  SBC is a bit more odd
because it uses some bizarre idea called 'borrow' - I don't know why, but
you have to use the carry flag in reverse.  So, subtracting 135 back from
&70/&71 would be:

SEC    ; set carry flag ready for subtraction
LDA &70
SBC #135   ; subtract 135-(1-C), i.e. 135 (don't know why it was like this
             though!)
STA &70
LDA &71
SBC #0     ; If 'borrow' condition from the subtract (i.e. C=0), then this
             will subtract a further 1 from the high byte
STA &71
RTS

Does that make any sense!?

> I've noticed that some emus do some kind of cycle counting for each 
> instruction and yours seems to pass the cycle count to FNnext when returning 
> to the execute loop - what's the point in this, since this surely slows down 
> emulation?

Yes I guess it does, although not much!  The reason for counting clock cycles is 
for emulating video refresh accurately.  Many games change certain video 
characteristics, such as palette, midway down the screen refresh; this allows 
them to split their screen into two different palettes at a certain pixel line.  
Loads of machines did this: BBC, SMS, CPC and probably many others with three 
initials....  anyway, in order to decide when to update the next video line in 
the emulator, you need to keep track of clock cycles, because you know that one 
video line is drawn every 224 cycles (or whatever).  That's the idea anyway.  If 
this isn't emulated properly, loads of games will fail to work properly, 
particularly on the SMS which absolutely relies on it!

> Stayed up late last night, and now have a crude monitor, and the 
> TAX,TXA,TAY,TYA as well as CLC,SEC,CLD,SED,CLI,SEI and CLV instructions 
> implemented ok (all the tricky ones).

:)

Remember that on TAX,TXA,TAY and TYA, you need to set flags accordingly
too.  But you probably realised that anyway :)  Again the quickest way I
would find to do this, is to store A,X and Y in ARM registers all shifted
up 24 bits.  Then for TAX you could do:

MOVS X,A
BIC P,P,#(Z_Flag)+(N_Flag)
ORREQ P,P,#Z_Flag
ORRMI P,P,#N_Flag

Even better - since the ARM and the 6502 set their flags in exactly the
same way for virtually everything, you could use P just to store the 6502
registers C,V,Z and N in ARM flag order, and have D,I and B stored
somewhere else (you don't need to worry about them that often).  Then TAX
would simply become:

TEQP P,#0 ; set ARM flags to previously stored ones MOVS X,A ; transfer A to X 
MOV P,PC ; store new flags back: the above instruction does not affect C or V on 
the ARM

> I really don't get what you're trying to do here??
> >MVNCS temp,#(&FF<<24) ; extend the C flag from bits 0-23
> 
> or here?
> >ORR temp,temp,num,LSL #24

This is extending the carry bit, which is absolutely vital!  Suppose we
have (in binary) 00001111 in the accumulator, and we're doing ADC#0 where
C=1.  The result we therefore want is 00010000.  But remember we're
storing all the values shifted up 24 bits, so that the ARM flags respond,
so in our ARM register which contains the accumulator we have:

00001111 00000000 00000000 00000000

Now we want to add with carry zero, i.e.

00000000 00000000 00000000 00000000
                                + 1 <- carry

Without carry extension this is going to give:

00001111 00000000 00000000 00000001

Not much use!  But taking those two instructions (MVNCS temp,#&FF<<24 :
ORR temp,temp,num,LSL #24), this is taking the operand of the ADC (which
is in num), shifting it up to the same bit position as the accumulator,
and ORRing onto it lots of binary 1's, i.e.

for ADC#0, we get: 00000000 11111111 11111111 11111111
                   \_num__/ \_____carry extension____/

So now, when we ADC the accumulator to this, we get:

   00001111 00000000 00000000 00000000
+ 00000000 11111111 11111111 11111111
+                                   1 (carry)
   ====================================
   00010000 00000000 00000000 00000000 - correct!

Try it and you'll see how it works - the carry is effectively 'cascaded' all the 
way along to bit 24 which is kinda where you want it to enter the addition.  And 
that way, all the flags get set correctly for free!! (Which has got to be better 
than that dire algorithm for working out the V flag)

By the way, not to depress you or anything, but you realise that each time round 
the main loop you're probably going to have to test for interrupts too, and 
emualte their generation if necessary.  You'll probably find lots of carts 
hanging in infinite loops, waiting for an interrupt to get them out otherwise 
(most games do this to wait for a VSync or something).  Like I said, the fun's 
only just beginning!!!

My unfinished BBC emulator had the emulated PC stored as a real Acorn
address, and this causes no problems, as when you need to get it to a real
6502 address (when pushing onto stack during JSR, or calculating new
addresses in JMPs/JSRs), you can get it by simply adding or subtracting
the Acorn base address of the start of your 64k of 6502 memory.  By
storing the emulated PC (and SP) in terms of Acorn memory, it means you
can use stuff like LDRB r0,[emPC],#1.

The catch is if you have to worry about memory paging: if, when new memory is 
paged in, you just copy a huge chunk from where it's stored into the 64k memory 
map, then this method'll still work - but if a game pages memory around a lot 
(e.g. on entry to its IRQ routine), you don't want to keep block copying these 
chunks of memory around, and it's probably best to find a different way.  SMS 
uses a cunning way of accessing any memory location from the 64k memory map, 
regardless of what's paged where, and all this in three hideously optimised 
instructions (check out FNldmem/FNstmem in Z80Source.Library) - I'll explain it 
if you really want to know what it's doing, cos I really don't expect it to mean 
anything to you! :)

> I'm having a very boring day, so thought I'd take another quick look at the 
> 6502, I'm looking at getting the memory access up and running, and changing 
> the emPC to be an address in Acorn RAM rather than in NES ram so as memory 
> access is done using LDRB r0,[emPC],#1as we mentinoed before. I've been taking 
> a look at stmem and ldmem from SMS and am completely flummuxed. Can you 
> explain this as you mentioned before?
> Thanks

Eek!  Right, I'm doing this away from my source code so I'm trying to
remember what FNldmem does - is it something like:

  FNldmem(dest,addr):
    LDRB dest,[rdoffset,addr,LSR #13]
    ADD dest,addr,dest,LSL #13
    LDRB dest,[mem,dest]

That looks about right.  OK, the idea of FNldmem is that it loads a value
from Z80 memory space ($0..$FFFF) into register 'dest', taking into
account the current paging register settings.  First of all, you need to
know how the SMS divides up its memory pages:

   $0000-$3FFF  = Page 0, a 16k ROM page
   $4000-$7FFF  = Page 1, a 16k ROM page
   $8000-$BFFF  = Page 2, a 16k ROM page, or 16k of cartridge RAM
   $C000-$DFFF  = Page 3, 8k of machine RAM
   $E000-$FFFF  =         A copy of the above RAM

Now because I don't want to keep copying around memory to a 64k mirror of
the Z80's memory in Acorn RAM, I store the whole 256k (or however big)
ROM, and all the SMS RAM, in one contiguous block, and then use FNldmem
above to decode Z80 addresses into the positions in Acorn memory, taking
into account the current paging status.

You'll see that the smallest block of SMS memory which is always kept
together is 8k (for the on-board RAM).  So for a start, we need to split
up the 64k memory map into 8k chunks, and then we can say that every
address in that 8k chunk lives at the same offset within our big chunk of
Acorn memory.

That register 'rdoffsets' is an 8 byte table where each element contains
the 'offset' (will explain) into Arc RAM of every Z80 address in this 8k
block.  So the first instruction is reading the offset for the Z80 address
in question (LSR #13 in order to get the 8k chunk number).  The second
instruction is adding the offset*8k (LSL #13) to the Z80 address.  If the
value stored in the rdoffset table was an absolute 8k block number within
Acorn memory, I would first have to AND the Z80 address with &3FFF and
then add the offset*8k.  Instead, the value in the rdoffset table is
always a value relative to where it is in the Z80 memory map, so I can
just add it to the Z80 address (which saves one instruction per FNldmem). 
Finally, the third instruction fetches the byte from Arc memory, where mem
is a pointer to the base of where it's all stored (well not exactly, but
I'll return to this point too) and dest contains the offset within that
block, as just calculated.

God, this is difficult to explain.  Here's an example.

Suppose we're storing all the SMS ROM banks in one big block starting from
[mem], like this:

  8k chunk 0:  ROM Page 0, 1st half
           1:  ROM Page 0, 2nd half
           2:  ROM Page 1, 1st half
           3:  ROM Page 1, 2nd half
                  etc

Now, suppose we want to page ROM page 2 into $4000..$7FFF - this is
offsets 2 and 3 in rdoffset, and 8k chunks 4 and 5 in Acorn memory.  So
into rdoffset[2] we put 4-2=2, and into rdoffset[3] we put 5-3=2. (The
rdoffset table is modified by the paging register handling routine).  If
you put these values through FNldmem, you'll get some idea what's going on
(perhaps). 

NOW FOR THE SUBTLETIES.... what happens if we wanted to page ROM page 0
into $8000..$BFFF, i.e. offsets 4 and 5 in rdoffset, and 8k chunks 0 and 1
in Acorn memory.  This means we put 0-4 = -4 into rdoffset[4] and -4 into
rdoffset[5].  But, stored in a byte, this will be 252, and therefore the
wrong offset will be used, i.e. rdoffsets can only contain positive
values.  So all we do is we add 8 to every value ever stored in rdoffsets,
guaranteeing that negative numbers can never be stored (using the SMS
paging system at least), and then adjust for this by pointing mem to the
start of the Acorn block of memory, minus 8*8k.  PHEW!

Believe me, this is the only way to do this whole thing in three
instructions, and I imagine you're now thinking I have the most warped
mind on the planet.  And you're probably right :)

Bet you're sorry you asked now :)  FNstmem works identically, but uses a
table 'wroffsets' instead - this is because the ROM entries point their
offsets to a 'dummy' bit of Acorn memory, so if you try to write to ROM,
the actual ROM as stored in Acorn memory is unaffected.

Joe Kelleher has opted for a far more cunning approach, and actually uses
ARM hardware paging to emulate 16k paging.  Problem with that is it won't
work on 4Mb pre-Risc PC machines (because of hardware paging limitations)
and it's also a bugger to write :)  He's a braver man than I...

So if you have to emulate paging or memory mirroring (i.e. bits of RAM of
whatever being repeated in several locations), you're probably going to
have to do something similar.  The only other alternative is to copy 16k
chunks of memory around whenever the paging registers are accessed, but if
you're looking for a fast emulator, I wouldn't recommend it :)

Good Luck!

> I have pretty much assumed that all instructions set there flags in the same 
> way as the ARM (since I have no really good trustworthy docs on ARM or 6502 
> flag setting). You'd think I would with a BBC user manual, advanced BBC guide, 
> and learrn BBC assembly as well as Archimedes assembly language, Archimedes 
> Operating System AND the ROS2 PRMs :o)

Seems like it ought to be enough....  Anyway in summary:

ADD,ADC,SUB,SBC,RSB,RSC,CMP set NZCV exactly like the 6502 
AND,ORR,BIC,EOR,TST,TEQ,MOV,MVN set N and Z only and leave C and V alone (unless 
there was a shift at the end, in which case C is set to the value of the bit 
which gets shifted out last) - I use this to set C in my Z80 emulation loads.

> So some of my instructions are still  a little buggy (particularly 
> SBC still), but I guess flag errors can be corrected later.

Actually, you'd be surprised how important it is to make sure the flags are 
absolutely right - if for example you're not setting Z properly sometimes, it 
can cause loops to be wrongly executed, and then the 6502 code is probably going 
to end up doing something completely different!

> Also, do you have any better ideas as to how to code ROL other than this...
> 
> MovS A,A,LSL #1
> OrrCS A,A,#(1<<24)
> BicCC A,A,#(1<<24) <- wouldn't need this, as MOVS A,A,LSL #1 will always
                        put a 0 at bit 24.

That won't work, because ROL shifts the result left one place, puts the
OLD carry into bit 0 and puts the bit that was shifted out into the NEW
carry.  In your code, you're putting the carry that was shifted out back
into the front again - which is admittedly a proper rotate, but not the
way the 6502 did it :)

Your best bet for ROL A is probably something like:

TEQP P,#0          ; set ARM flags to 6502 flags
ORRCS A,A,#1<<23   ; about to be shifted into bit 24 if C set
MOVS A,A,LSL #1    ; do the shift, set flags (inc. carry)
MOV P,PC           ; store flags again

Again, it's fairly important to make sure it's v fast, as ROL A is only 2 clock 
cycles, and used quite a bit (it's the only way the 6502 can do multiplication)

> LDRB num,[emPC],#1
> 
> but this obviously won't work if emPC is a 6502 memory address instead of an 
> Acorn one. I guess I just have to FNincPC before LDRBing separately.

The above method is perfect if you have a simple 64k memory map with no paging 
though - I was using it in my (very) unfinished BBC emulator once.

The Z80 PC is 16 bits long which means it
can address from $0000..$FFFF.  So there's no possible way of accessing
memory outside this range as PC itself is constrained to contain a value
in this range.

So how's a 256k ROM accessed?  That's where the paging registers come in -
as I said, there are three 16k slots for ROM in the memory map at once.
All you do is write to the paging registers to tell it which portion of
the 256k ROM is going to appear in each 16k slot (just like paged Roms on
the BBC).  The downside of this is only 48k of the whole Rom is visible at
once.  

That's how any machine with a 16 bit PC handles a greater amount of memory than 
64k.  So every time a paging register is written to, the offset tables in 
Miracle are updated so that effectively a new chunk of the Rom appears in the 
appropriate page of memory.  Sounds to me as if the NES does an almost identical 
thing.  Hope that makes some sense!

I was only emulating a 32k BBC B, and what I did was used Acorn words to store 
6502 bytes, so that when you loaded from memory, the result was already shifted 
up 24 bits.  Also what it meant that was upon fetching the opcode, you could 
also load the byte afterwards with something like LDMIA (emPC)!,{R0,R1} (as the 
majority of opcodes are at least two bytes long) which saved yet another fetch 
(LDMIA with 2 or more regs is faster than two separate loads).  Having 256k 
ROMs, or larger, means you probably can't get away with this on anything less 
than a 4Mb machine....  The other more icky thing that it did was to try and 
keep the 6502 flags stored in the ARM flags all the time, rather than keeping 
them safe in some register called P, or whatever, although this does admittedly 
to lead to FAR less flexibility.  I'll think about any ways you could make this 
better, although I reckon you've probably got it about as good as it can get!

I think I effectively used a macro to code different addressing modes, cos you 
can definitely re-use half your code each time.  

The other slight problem is, that I'm 
> not sure how to handle instructions where the number of cycles can vary - the 
> instructions in question are marked by there FNnext(), since there cycle count 
> increases by 1 if a page boundary is crossed. Do you think it's best to set 
> them to their maximum always or try and calculate the page bounday crossing, 
> if so how? :o) I hope you don't feel like you're writing this emulator for me 
> ;o)

(Editors note: recent research from various other emulator authors seems to 
indicate that the number of cycles per instruction doesn't alter according to 
crossing page boundary's)

I was never even counting clock cycles on my old emulator - I guess at the
time, I wasn't really intending to get the timing right.  As for emulating
crossing a page taking an extra clock cycle - I can't think of an easy
quick way of testing this!  Apart from something like:

LDA nnnn,X:
ADD emPC,emPC,#1
FNldmem(lo,emPC)
ADD emPC,emPC,#1
FNldmem(hi,emPC)
ADD emPC,emPC,#1
ORR lo,lo,hi,LSL #8
ADD lo,lo,X,LSR #24
* TEQ hi,lo,LSR #8
* ADDNE timer,timer,#1
etc

It's another two checks each time 'mfraid.  It's a right bummer trying to
get the emulation really optimised, particularly when it's faster than a
2MHz clock - you've only got effectively 6 ARM clock cycles per 6502 clock
cycles to get it all done (on my machine at least).

Another way to speed it up, by the way (which my old 6502 emulator also
uses) is to have a base address (in a register) of all the opcode
routines, and then have your FNnext doing something like:

FNldmem(R0,emPC)
ADD PC,routine_base,R0,LSL #6

and then putting each routine at offsets of 64 bytes from the base address.  A 
bit wasteful, but it does save a branch to your .ExecuteLoop or equivalent, and 
then straight away, another branch to the opcode routine.  Things like this 
*really* do make a difference when you're doing them so much!

> ;LDA ind X (pre-indexed)
> 
> .op%(&A1)
> FNincPC
> FNLoadMem(temp,emPC)             ;zero page address of 
> constants address
> ADD      num,temp,X,LSR #24      ;align X and temp and sum
> 
> AND      num,num,#&FF            ;ensure only a zero page (1 byte) address

I wouldn't bother with that AND - I don't think any programmer would try to use 
&FF/&00 as an address pair, or try to wrap around an address pair table from &FF 
back round to &00 again, even though that's what a real 6502 would do.  If you 
can find these short cuts, then it's worth taking them (even though making 
assumptions like that *can* lead to bugs - don't I know it :( )

> I've just been implementing the stack for the 6502, and have a few questions 
> about it, since it is almost undocumented from a technical point of view. 
> Basically I've assugned the stack between &100 and &1FF as the docs say, and 
> assumed the SP (stack pointer) initialises to 0. When I push something to the 
> stack I am doing this
> 
> ADD temp,SP,#&100 ;add SP and stack offset
> FNStoreMem(number,temp)
> ADD SP,SP,#1

Actually, the pushing algorithm looks something like:

  [SP+&100] = val
  SP = (SP - 1) AND &FF

and so, fairly predictably, pulling goes:

  SP = (SP + 1) AND &FF
  val = [SP+&100]

Note it wraps around if it goes outside its allocated 256 bytes.  Most
software will usually execute LDX #&FF:TXS at the beginning to reset the
stack pointer to its most empty position. 

My 6502 emu was very minimal and assumed absolutely tons (though I was
confident on the BBC that I could get away with it) - it assumed that
every bit of software will always set SP to &FF at the start, and that you
would never use more than 256 bytes of stack, so it never actually
attempted to wrap around.  In fact, it used a real Acorn address for the
stack pointer, but I don't know if that's a particularly good assumption
to make on the NES, or any generic 6502 emulator.

I reckon it's probably worth storing your SP in the bottom 8 bits, but
also, making sure bit 8 is always 1, i.e. storing values from &100..&1ff.
Then PHA becomes:

  MOV temp,A,LSR #24
  FNstmem(temp,sp)
  SUB sp,sp,#1
  ORR sp,sp,#&100  ; you can leave this out if you're not testing
                     for stack wrap around  

PLA would be something like:

  SUB sp,sp,#255   ; this actually does add 1 to the lower 8 bits ORR 
  sp,sp,#&100  ; - it's the best way of wrapping in two instructions ; if you're 
  not wrapping just use ADD sp,sp,#1 instead FNldmem(A,sp) TEQP P,#0 MOVS 
  A,A,LSL #24 MOV P,PC
  
> I had a sod of a time trying to store the PSR to the stack, since I have NZCV 
> stored in b31-28 of the P register and UBDI stored in a memory location in 
> bits 5-2 (5-2 chosen for this instruction). and the 6502 order is way 
> different so I've had to do a lot of bit shifting :o( I could just fudge it, 
> and store the flags in the easiest order for speed but apparently a few games 
> manipulate the stack by hand at times rather than using the normal methods 
> (apparently). I guess it's something to optimised later.  

Yes, it's important to make sure you store the right thing to the stack in
a PHP, because virtually ANY 6502 interrupt handler manipulates the stack
directly to see if it's servicing an IRQ or a BRK.

I rekcon your best bet is to store your B,D and I in the same bit
positions as the actual 6502 ones, and then build a lookup table for
converting your 6502 to ARM and vice versa (!Miracle does this with two
tables: FlagsZ2A and FlagsA2Z).  So set up the table so that given flags
in ARM order in bits 0-3, it returns the equivalent 6502 byte, i.e.

  FOR armflags%=0 TO 15
   b%=0
   IF (armflags% AND 1) b%=b% OR (1<<6)  :REM overflow
   IF (armflags% AND 2) b%=b% OR (1<<0)  :REM carry
   IF (armflags% AND 4) b%=b% OR (1<<(6502zflag)) :REM zero
   IF (armflags% AND 8) b%=b% OR (1<<7)  :REM negative
   FlagsArmTo6502[armflags%]=b%
  NEXT

...or that sort of thing.  (I can't remember the bit position of the 6502
zero flag).  And setup a similar table for converting from 6502 flags
(only being interested in ZCVN) to Arm order.  Then PHP will be:

  LDR temp,[flagsAto6,P,LSR #28]
  LDR temp2,UBDI
  ORR temp,temp,temp2
  FNstmem(temp,sp)
  SUB sp,sp,#1
  ORR sp,sp,#&100

Should do the trick!

> I should think some cunning bastard will have used the looping stack in some 
> NES ROM somewhere so I guess I'd better emulate it, it's one of those things 
> I'll mark for possible optimisation.

I know that !65Host never bothers looping the stack - I think it's usually very 
unlikely that any 6502 prog would use more than 256 bytes of stack - and 
certainly any NES program will have to set the SP to &FF, otherwise it's 
undefined - so on the assumption that wrapping around will write completely over 
anything important that's already there, you can probably assume it won't happen 
:)

> Thanks very much for the 6502em, very interesting, though quite perculiar the 
> way everything uses words rather than byte access - makes the instruction code 
> SOOOooooo much tidier though.

I quite liked that solution at the time, I have to admit - it saves *so* much 
per opcode - but I guess it's only at all feasible if you have only 64k to deal 
with.

==============================================================================
compiled 25/5/99 by David Sharp