Thoughts on an assembly-coded emulator ====================================== This document consists of a series of extracts from email replies by Richard Talbot-Watkins to messages from David Sharp concerning writing an interpretive emulator for Acorn ARM-based computers. At the time, Richard was putting the finishing touches to his Sega Master System emulator, Miracle, written entirely in ARM assembly and David was starting an attempt at writing a NES emulator in ARM assembly. At the time of writing, Richard was about to graduate with a degree in Computer Science from Cambridge and David was about to take his A levels. This document has relevence for anyone emulating 6502 and Z80 processors, particularly for NES/SMS, authors looking at emulating mapped memory, and anyone writing an emulator in assembly, particularly ARM assembly. It is primarily concerned with emulation optimisation at a low level. Richard's Miracle emulator was quite an achievement being just about being able to emulate the Sega Master System at a playable speed on an Acorn A3010 with a CPU of just an ARM250 running at 12MHz. Please excuse any discontinuity in the text, it was never written to be read this way. ================================================================================ > I don't notice much difference between the normal and the faster code, in what > way is it cludged to run faster??? Basically, it emulates 240 scanlines per frame instead of 312 and also slightly slows the emulated clock speed down - the idea is that some games often just wait in a tight loop when they've finished drawing their frame, hanging on till the next VSync signal when they can then draw the next frame: by emulating only 240 lines instead of 312, it should mean that games that use that method don't spend so long idling and hence they run faster. Good idea in theory, apart from that lots of games use that 'idle' time to actually do other stuff like sprite manipulation or whatever, and so if you cut down the emulated frame time in this way, it just goes horribly, horribly wrong.... But Wonderboy 3 benefits from this speed cludge :) The sprite problem you mention is to do with the fact that the SMS can only maintain 64 hardware sprites at once, but with some clever timing, more can be forced onto the screen at once by changing the location of the sprite info table halfway down the refreshing of the screen - so it's important that the timing is really exact. At the moment, I don't actually know the clock speed of the SMS for sure, which is why I'm a bit suspicious of sprite flashing problems (as it would be caused by changing the sprite info table at the wrong point in the frame) > how come you're forced to use the system clock for the > emu timing?? Getting there... I don't really want to add any 'frills' though until the actual emulation code itself is rock solid (i.e. runs stuff like Hook, SB2, Alex Kidd - Lost Stars, Zool, MK2 and a few others which notably crash). The system clock is used to time 1/50th sec intervals to synchronise SMS VSyncs. At the moment, it keeps a running record of Sega time against real time, so it can try to catch up if it slows down for a bit, which is why for the moment it has to quit. > How are you doing the keypress detection BTW, since I > disassembled the code, but couldn't find any OS_Byte 121s so I assume you are > reading the keypresses straight out of the keyboard buffer using interrupts? Very simple - I just use a keyboard event handler (event 11) to signal when a key is pressed or released, and then get that to modify the actual bits of the joypad ports themselves. That way, you never need to use OS_Byte &79 (which is v slow if you're doing it a lot) when the joypad port is read (as the right value's already magically there!), and at the same time it can be used to test for other special keys like quit, pause etc. So all the keys you ever want to test for are all in the same word, ready for testing - funky! > How will you implement the frameskip in the final version, how about an > optional test at the start of the program which guesses the optimum frameskip > setting at the machines speed. Yes eventually. Although absolutely ideally, I want to implement an automatic frameskip which can also change mid-emulation depending on the load on the machine at any particular moment. Shouldn't be too much extra work.... At the moment, I'm trying to debug my Z80 emulation as I'm sure there are a few subtle problems which are stopping the Simpsons, Hook, Zool and a few others from working - but believe me, tracing opcode bugs like "ADC A doesn't set the zero flag if the result of the operation was zero, but only if it was entered with carry set" like I fixed the other day, take a lot of effort! Oh yes, did you ever have Chase HQ or OutRun on your SMS? Just because I thought I had an opcode problem as, on either of those, if you steer to an extreme of the track so that it goes off the screen, it seems to wrap around (badly) onto the other side - you'll see what I mean if you play it. I thought it was a problem with my emulator, until Matt said that it happens on MasterGear too! So either (a) it was a shittily written game or (b) MasterGear has the same bug as mine. Can you shed any light? (Editors note: it turned out that these games were just shoddily coded and did this on the real SMS) > I'm not quite sure how all these bugs in the Z80 have occurred - surely it's > just a straight forward matter of seeing what each instruction does and coding > it!? Hahaha, if only it were that simple! Problem is, there are about 1500 instructions, and most of them are pretty simple, but there's bound to be scope for stupid mistakes when there's that much code! The other thing is I've coded loads of shortcuts where I've thought I could get away with it - sadly stuff like INC (HL) wasn't very happy with my shortcut (it assumes you'd never try to increment the memory mapped paging register... or any part of ROM, but sadly Gauntlet tries both, arse). Also, half the time my optimisation relies on similarities between the way the Z80 and ARM set their flags, e.g. taking your 6502 emu as an example: ADC #num is a 6502 instruction which adds the constant value 'num', and the value of the carry flag, to the accumulator. It sets the C flag if the result was more than 8 bits wide, it sets the Z flag if the result was zero, it sets the N flag if the result was negative (bit 7 set) and - the tricky one - the V flag is set if there was an "overflow condition", namely if two positive nos were added to give a seemingly negative result (i.e. bit 7 became set) and other such cases where the sign is set incorrectly. Working out the algorithm for the V flag becomes a right bastard (I assure you!). Anyway, coding all this explicitly (in a Basic-type pseudo code) would give something like: result = A + num + CFlag CFlag = (result >= 256) ; assume this returns 0 or 1 NFlag = (result >= 128) ZFlag = (result = 0) VFlag = ((NOT (A EOR num) AND (num EOR result)) AND 128) = 128) ; yuk! A = (result AND 255) This is pretty grim, and particularly as ADC #num only takes two clock cycles - so this means if you're emulating a 2MHz 6502 machine (like a BBC) on say a 12MHz ARM machine (like mine), you only have 12 ARM clock cycles to code a perfect ADC instruction before you're behind schedule! But I believe it should be possible to get a perfect Z80 or 6502 emulation at exact speed on an ARM250 machine if it's done the right way (my Z80 SMS emulation runs at about right speed on mine if all the graphical processing is dropped!). So the approach I would take with the 6502 emulator (and which I've taken with the Z80 emu as well, although this is less easy) is to make HUGE use of the fact that most processors set their flags in similar ways. By some incredible coincidence (honest), the ARM sets flags in exactly the same way as the 6502 in ADC and SBC instructions, so use the fact! If you keep your 6502 accumulator in R0, have it stored shifted up 24 bits so that when it's stored into, setting its top bit will set the ARM N flag too. That's one flag set for free! The ADC #num code might look something like: where: P is a register which holds the 6502 PSR (flags) LDRB num,[pc6502],#1 ; get the num from the ADC #num instruction TST P,P,LSR #(bit position of CFlag) ; set ARM C flag according to 6502 flags MVNCS temp,#(&FF<<24) ; extend the C flag from bits 0-23 ORR temp,temp,num,LSL #24 ADCS A,A,temp ; do the ADC, set the new accum result ; and set flags! And that's it!! The ADCS sets all the ARM flags in exactly the same way as the 6502 would. Now it's just a case of copying them back into your P register. If you store the 6502 flags in the same order as the ARM ones, rather than in the actual 6502 order, you can then just store the flags with something like: BIC P,P,#15 ORR P,P,PC,LSR #28 That's ADC in only about 8 clock cycles!! Wicked. Stuff like AND, ORA and so on is much simpler, but can be made extremely quick by the same method. There are subtleties when implementing PHP and PLP (in that the flag order has to be corrected to proper 6502 order when storing to memory) but hope that's some use anyway. If any of this makes any sense, then let me know - I'm doing something right then at least! The other great thing about the 6502 of course is that there are only about 160 instructions... The way ADC was used on the 6502 was as follows. If &70 and &71 are a pair of memory locs containing a 16-bit word, and you want to add 135 to them (for some weird reason), you'd use code like: CLC ; clear carry flag LDA &70 ; Load low byte of word into A ADC #135 ; Add 135+0 (because C=0) STA &70 ; store. If ?&70+135 was > 255, carry will now be set LDA &71 ; Load high byte of word ADC #0 ; If C set because of carry from lo byte, this will add 1 to the high byte, otherwise will just leave it! STA &71 ; store back. RTS ; return Of course the #135 and #0 could be replaced with some other 16-bit value, e.g. to add the word at &70/&71 to the one at &72/&73, you could use: CLC; LDA&70; ADC&72; STA&70; LDA&71; ADC&73; STA&71; RTS That should give you some idea how ADC works. SBC is a bit more odd because it uses some bizarre idea called 'borrow' - I don't know why, but you have to use the carry flag in reverse. So, subtracting 135 back from &70/&71 would be: SEC ; set carry flag ready for subtraction LDA &70 SBC #135 ; subtract 135-(1-C), i.e. 135 (don't know why it was like this though!) STA &70 LDA &71 SBC #0 ; If 'borrow' condition from the subtract (i.e. C=0), then this will subtract a further 1 from the high byte STA &71 RTS Does that make any sense!? > I've noticed that some emus do some kind of cycle counting for each > instruction and yours seems to pass the cycle count to FNnext when returning > to the execute loop - what's the point in this, since this surely slows down > emulation? Yes I guess it does, although not much! The reason for counting clock cycles is for emulating video refresh accurately. Many games change certain video characteristics, such as palette, midway down the screen refresh; this allows them to split their screen into two different palettes at a certain pixel line. Loads of machines did this: BBC, SMS, CPC and probably many others with three initials.... anyway, in order to decide when to update the next video line in the emulator, you need to keep track of clock cycles, because you know that one video line is drawn every 224 cycles (or whatever). That's the idea anyway. If this isn't emulated properly, loads of games will fail to work properly, particularly on the SMS which absolutely relies on it! > Stayed up late last night, and now have a crude monitor, and the > TAX,TXA,TAY,TYA as well as CLC,SEC,CLD,SED,CLI,SEI and CLV instructions > implemented ok (all the tricky ones). :) Remember that on TAX,TXA,TAY and TYA, you need to set flags accordingly too. But you probably realised that anyway :) Again the quickest way I would find to do this, is to store A,X and Y in ARM registers all shifted up 24 bits. Then for TAX you could do: MOVS X,A BIC P,P,#(Z_Flag)+(N_Flag) ORREQ P,P,#Z_Flag ORRMI P,P,#N_Flag Even better - since the ARM and the 6502 set their flags in exactly the same way for virtually everything, you could use P just to store the 6502 registers C,V,Z and N in ARM flag order, and have D,I and B stored somewhere else (you don't need to worry about them that often). Then TAX would simply become: TEQP P,#0 ; set ARM flags to previously stored ones MOVS X,A ; transfer A to X MOV P,PC ; store new flags back: the above instruction does not affect C or V on the ARM > I really don't get what you're trying to do here?? > >MVNCS temp,#(&FF<<24) ; extend the C flag from bits 0-23 > > or here? > >ORR temp,temp,num,LSL #24 This is extending the carry bit, which is absolutely vital! Suppose we have (in binary) 00001111 in the accumulator, and we're doing ADC#0 where C=1. The result we therefore want is 00010000. But remember we're storing all the values shifted up 24 bits, so that the ARM flags respond, so in our ARM register which contains the accumulator we have: 00001111 00000000 00000000 00000000 Now we want to add with carry zero, i.e. 00000000 00000000 00000000 00000000 + 1 <- carry Without carry extension this is going to give: 00001111 00000000 00000000 00000001 Not much use! But taking those two instructions (MVNCS temp,#&FF<<24 : ORR temp,temp,num,LSL #24), this is taking the operand of the ADC (which is in num), shifting it up to the same bit position as the accumulator, and ORRing onto it lots of binary 1's, i.e. for ADC#0, we get: 00000000 11111111 11111111 11111111 \_num__/ \_____carry extension____/ So now, when we ADC the accumulator to this, we get: 00001111 00000000 00000000 00000000 + 00000000 11111111 11111111 11111111 + 1 (carry) ==================================== 00010000 00000000 00000000 00000000 - correct! Try it and you'll see how it works - the carry is effectively 'cascaded' all the way along to bit 24 which is kinda where you want it to enter the addition. And that way, all the flags get set correctly for free!! (Which has got to be better than that dire algorithm for working out the V flag) By the way, not to depress you or anything, but you realise that each time round the main loop you're probably going to have to test for interrupts too, and emualte their generation if necessary. You'll probably find lots of carts hanging in infinite loops, waiting for an interrupt to get them out otherwise (most games do this to wait for a VSync or something). Like I said, the fun's only just beginning!!! My unfinished BBC emulator had the emulated PC stored as a real Acorn address, and this causes no problems, as when you need to get it to a real 6502 address (when pushing onto stack during JSR, or calculating new addresses in JMPs/JSRs), you can get it by simply adding or subtracting the Acorn base address of the start of your 64k of 6502 memory. By storing the emulated PC (and SP) in terms of Acorn memory, it means you can use stuff like LDRB r0,[emPC],#1. The catch is if you have to worry about memory paging: if, when new memory is paged in, you just copy a huge chunk from where it's stored into the 64k memory map, then this method'll still work - but if a game pages memory around a lot (e.g. on entry to its IRQ routine), you don't want to keep block copying these chunks of memory around, and it's probably best to find a different way. SMS uses a cunning way of accessing any memory location from the 64k memory map, regardless of what's paged where, and all this in three hideously optimised instructions (check out FNldmem/FNstmem in Z80Source.Library) - I'll explain it if you really want to know what it's doing, cos I really don't expect it to mean anything to you! :) > I'm having a very boring day, so thought I'd take another quick look at the > 6502, I'm looking at getting the memory access up and running, and changing > the emPC to be an address in Acorn RAM rather than in NES ram so as memory > access is done using LDRB r0,[emPC],#1as we mentinoed before. I've been taking > a look at stmem and ldmem from SMS and am completely flummuxed. Can you > explain this as you mentioned before? > Thanks Eek! Right, I'm doing this away from my source code so I'm trying to remember what FNldmem does - is it something like: FNldmem(dest,addr): LDRB dest,[rdoffset,addr,LSR #13] ADD dest,addr,dest,LSL #13 LDRB dest,[mem,dest] That looks about right. OK, the idea of FNldmem is that it loads a value from Z80 memory space ($0..$FFFF) into register 'dest', taking into account the current paging register settings. First of all, you need to know how the SMS divides up its memory pages: $0000-$3FFF = Page 0, a 16k ROM page $4000-$7FFF = Page 1, a 16k ROM page $8000-$BFFF = Page 2, a 16k ROM page, or 16k of cartridge RAM $C000-$DFFF = Page 3, 8k of machine RAM $E000-$FFFF = A copy of the above RAM Now because I don't want to keep copying around memory to a 64k mirror of the Z80's memory in Acorn RAM, I store the whole 256k (or however big) ROM, and all the SMS RAM, in one contiguous block, and then use FNldmem above to decode Z80 addresses into the positions in Acorn memory, taking into account the current paging status. You'll see that the smallest block of SMS memory which is always kept together is 8k (for the on-board RAM). So for a start, we need to split up the 64k memory map into 8k chunks, and then we can say that every address in that 8k chunk lives at the same offset within our big chunk of Acorn memory. That register 'rdoffsets' is an 8 byte table where each element contains the 'offset' (will explain) into Arc RAM of every Z80 address in this 8k block. So the first instruction is reading the offset for the Z80 address in question (LSR #13 in order to get the 8k chunk number). The second instruction is adding the offset*8k (LSL #13) to the Z80 address. If the value stored in the rdoffset table was an absolute 8k block number within Acorn memory, I would first have to AND the Z80 address with &3FFF and then add the offset*8k. Instead, the value in the rdoffset table is always a value relative to where it is in the Z80 memory map, so I can just add it to the Z80 address (which saves one instruction per FNldmem). Finally, the third instruction fetches the byte from Arc memory, where mem is a pointer to the base of where it's all stored (well not exactly, but I'll return to this point too) and dest contains the offset within that block, as just calculated. God, this is difficult to explain. Here's an example. Suppose we're storing all the SMS ROM banks in one big block starting from [mem], like this: 8k chunk 0: ROM Page 0, 1st half 1: ROM Page 0, 2nd half 2: ROM Page 1, 1st half 3: ROM Page 1, 2nd half etc Now, suppose we want to page ROM page 2 into $4000..$7FFF - this is offsets 2 and 3 in rdoffset, and 8k chunks 4 and 5 in Acorn memory. So into rdoffset[2] we put 4-2=2, and into rdoffset[3] we put 5-3=2. (The rdoffset table is modified by the paging register handling routine). If you put these values through FNldmem, you'll get some idea what's going on (perhaps). NOW FOR THE SUBTLETIES.... what happens if we wanted to page ROM page 0 into $8000..$BFFF, i.e. offsets 4 and 5 in rdoffset, and 8k chunks 0 and 1 in Acorn memory. This means we put 0-4 = -4 into rdoffset[4] and -4 into rdoffset[5]. But, stored in a byte, this will be 252, and therefore the wrong offset will be used, i.e. rdoffsets can only contain positive values. So all we do is we add 8 to every value ever stored in rdoffsets, guaranteeing that negative numbers can never be stored (using the SMS paging system at least), and then adjust for this by pointing mem to the start of the Acorn block of memory, minus 8*8k. PHEW! Believe me, this is the only way to do this whole thing in three instructions, and I imagine you're now thinking I have the most warped mind on the planet. And you're probably right :) Bet you're sorry you asked now :) FNstmem works identically, but uses a table 'wroffsets' instead - this is because the ROM entries point their offsets to a 'dummy' bit of Acorn memory, so if you try to write to ROM, the actual ROM as stored in Acorn memory is unaffected. Joe Kelleher has opted for a far more cunning approach, and actually uses ARM hardware paging to emulate 16k paging. Problem with that is it won't work on 4Mb pre-Risc PC machines (because of hardware paging limitations) and it's also a bugger to write :) He's a braver man than I... So if you have to emulate paging or memory mirroring (i.e. bits of RAM of whatever being repeated in several locations), you're probably going to have to do something similar. The only other alternative is to copy 16k chunks of memory around whenever the paging registers are accessed, but if you're looking for a fast emulator, I wouldn't recommend it :) Good Luck! > I have pretty much assumed that all instructions set there flags in the same > way as the ARM (since I have no really good trustworthy docs on ARM or 6502 > flag setting). You'd think I would with a BBC user manual, advanced BBC guide, > and learrn BBC assembly as well as Archimedes assembly language, Archimedes > Operating System AND the ROS2 PRMs :o) Seems like it ought to be enough.... Anyway in summary: ADD,ADC,SUB,SBC,RSB,RSC,CMP set NZCV exactly like the 6502 AND,ORR,BIC,EOR,TST,TEQ,MOV,MVN set N and Z only and leave C and V alone (unless there was a shift at the end, in which case C is set to the value of the bit which gets shifted out last) - I use this to set C in my Z80 emulation loads. > So some of my instructions are still a little buggy (particularly > SBC still), but I guess flag errors can be corrected later. Actually, you'd be surprised how important it is to make sure the flags are absolutely right - if for example you're not setting Z properly sometimes, it can cause loops to be wrongly executed, and then the 6502 code is probably going to end up doing something completely different! > Also, do you have any better ideas as to how to code ROL other than this... > > MovS A,A,LSL #1 > OrrCS A,A,#(1<<24) > BicCC A,A,#(1<<24) <- wouldn't need this, as MOVS A,A,LSL #1 will always put a 0 at bit 24. That won't work, because ROL shifts the result left one place, puts the OLD carry into bit 0 and puts the bit that was shifted out into the NEW carry. In your code, you're putting the carry that was shifted out back into the front again - which is admittedly a proper rotate, but not the way the 6502 did it :) Your best bet for ROL A is probably something like: TEQP P,#0 ; set ARM flags to 6502 flags ORRCS A,A,#1<<23 ; about to be shifted into bit 24 if C set MOVS A,A,LSL #1 ; do the shift, set flags (inc. carry) MOV P,PC ; store flags again Again, it's fairly important to make sure it's v fast, as ROL A is only 2 clock cycles, and used quite a bit (it's the only way the 6502 can do multiplication) > LDRB num,[emPC],#1 > > but this obviously won't work if emPC is a 6502 memory address instead of an > Acorn one. I guess I just have to FNincPC before LDRBing separately. The above method is perfect if you have a simple 64k memory map with no paging though - I was using it in my (very) unfinished BBC emulator once. The Z80 PC is 16 bits long which means it can address from $0000..$FFFF. So there's no possible way of accessing memory outside this range as PC itself is constrained to contain a value in this range. So how's a 256k ROM accessed? That's where the paging registers come in - as I said, there are three 16k slots for ROM in the memory map at once. All you do is write to the paging registers to tell it which portion of the 256k ROM is going to appear in each 16k slot (just like paged Roms on the BBC). The downside of this is only 48k of the whole Rom is visible at once. That's how any machine with a 16 bit PC handles a greater amount of memory than 64k. So every time a paging register is written to, the offset tables in Miracle are updated so that effectively a new chunk of the Rom appears in the appropriate page of memory. Sounds to me as if the NES does an almost identical thing. Hope that makes some sense! I was only emulating a 32k BBC B, and what I did was used Acorn words to store 6502 bytes, so that when you loaded from memory, the result was already shifted up 24 bits. Also what it meant that was upon fetching the opcode, you could also load the byte afterwards with something like LDMIA (emPC)!,{R0,R1} (as the majority of opcodes are at least two bytes long) which saved yet another fetch (LDMIA with 2 or more regs is faster than two separate loads). Having 256k ROMs, or larger, means you probably can't get away with this on anything less than a 4Mb machine.... The other more icky thing that it did was to try and keep the 6502 flags stored in the ARM flags all the time, rather than keeping them safe in some register called P, or whatever, although this does admittedly to lead to FAR less flexibility. I'll think about any ways you could make this better, although I reckon you've probably got it about as good as it can get! I think I effectively used a macro to code different addressing modes, cos you can definitely re-use half your code each time. The other slight problem is, that I'm > not sure how to handle instructions where the number of cycles can vary - the > instructions in question are marked by there FNnext(), since there cycle count > increases by 1 if a page boundary is crossed. Do you think it's best to set > them to their maximum always or try and calculate the page bounday crossing, > if so how? :o) I hope you don't feel like you're writing this emulator for me > ;o) (Editors note: recent research from various other emulator authors seems to indicate that the number of cycles per instruction doesn't alter according to crossing page boundary's) I was never even counting clock cycles on my old emulator - I guess at the time, I wasn't really intending to get the timing right. As for emulating crossing a page taking an extra clock cycle - I can't think of an easy quick way of testing this! Apart from something like: LDA nnnn,X: ADD emPC,emPC,#1 FNldmem(lo,emPC) ADD emPC,emPC,#1 FNldmem(hi,emPC) ADD emPC,emPC,#1 ORR lo,lo,hi,LSL #8 ADD lo,lo,X,LSR #24 * TEQ hi,lo,LSR #8 * ADDNE timer,timer,#1 etc It's another two checks each time 'mfraid. It's a right bummer trying to get the emulation really optimised, particularly when it's faster than a 2MHz clock - you've only got effectively 6 ARM clock cycles per 6502 clock cycles to get it all done (on my machine at least). Another way to speed it up, by the way (which my old 6502 emulator also uses) is to have a base address (in a register) of all the opcode routines, and then have your FNnext doing something like: FNldmem(R0,emPC) ADD PC,routine_base,R0,LSL #6 and then putting each routine at offsets of 64 bytes from the base address. A bit wasteful, but it does save a branch to your .ExecuteLoop or equivalent, and then straight away, another branch to the opcode routine. Things like this *really* do make a difference when you're doing them so much! > ;LDA ind X (pre-indexed) > > .op%(&A1) > FNincPC > FNLoadMem(temp,emPC) ;zero page address of > constants address > ADD num,temp,X,LSR #24 ;align X and temp and sum > > AND num,num,#&FF ;ensure only a zero page (1 byte) address I wouldn't bother with that AND - I don't think any programmer would try to use &FF/&00 as an address pair, or try to wrap around an address pair table from &FF back round to &00 again, even though that's what a real 6502 would do. If you can find these short cuts, then it's worth taking them (even though making assumptions like that *can* lead to bugs - don't I know it :( ) > I've just been implementing the stack for the 6502, and have a few questions > about it, since it is almost undocumented from a technical point of view. > Basically I've assugned the stack between &100 and &1FF as the docs say, and > assumed the SP (stack pointer) initialises to 0. When I push something to the > stack I am doing this > > ADD temp,SP,#&100 ;add SP and stack offset > FNStoreMem(number,temp) > ADD SP,SP,#1 Actually, the pushing algorithm looks something like: [SP+&100] = val SP = (SP - 1) AND &FF and so, fairly predictably, pulling goes: SP = (SP + 1) AND &FF val = [SP+&100] Note it wraps around if it goes outside its allocated 256 bytes. Most software will usually execute LDX #&FF:TXS at the beginning to reset the stack pointer to its most empty position. My 6502 emu was very minimal and assumed absolutely tons (though I was confident on the BBC that I could get away with it) - it assumed that every bit of software will always set SP to &FF at the start, and that you would never use more than 256 bytes of stack, so it never actually attempted to wrap around. In fact, it used a real Acorn address for the stack pointer, but I don't know if that's a particularly good assumption to make on the NES, or any generic 6502 emulator. I reckon it's probably worth storing your SP in the bottom 8 bits, but also, making sure bit 8 is always 1, i.e. storing values from &100..&1ff. Then PHA becomes: MOV temp,A,LSR #24 FNstmem(temp,sp) SUB sp,sp,#1 ORR sp,sp,#&100 ; you can leave this out if you're not testing for stack wrap around PLA would be something like: SUB sp,sp,#255 ; this actually does add 1 to the lower 8 bits ORR sp,sp,#&100 ; - it's the best way of wrapping in two instructions ; if you're not wrapping just use ADD sp,sp,#1 instead FNldmem(A,sp) TEQP P,#0 MOVS A,A,LSL #24 MOV P,PC > I had a sod of a time trying to store the PSR to the stack, since I have NZCV > stored in b31-28 of the P register and UBDI stored in a memory location in > bits 5-2 (5-2 chosen for this instruction). and the 6502 order is way > different so I've had to do a lot of bit shifting :o( I could just fudge it, > and store the flags in the easiest order for speed but apparently a few games > manipulate the stack by hand at times rather than using the normal methods > (apparently). I guess it's something to optimised later. Yes, it's important to make sure you store the right thing to the stack in a PHP, because virtually ANY 6502 interrupt handler manipulates the stack directly to see if it's servicing an IRQ or a BRK. I rekcon your best bet is to store your B,D and I in the same bit positions as the actual 6502 ones, and then build a lookup table for converting your 6502 to ARM and vice versa (!Miracle does this with two tables: FlagsZ2A and FlagsA2Z). So set up the table so that given flags in ARM order in bits 0-3, it returns the equivalent 6502 byte, i.e. FOR armflags%=0 TO 15 b%=0 IF (armflags% AND 1) b%=b% OR (1<<6) :REM overflow IF (armflags% AND 2) b%=b% OR (1<<0) :REM carry IF (armflags% AND 4) b%=b% OR (1<<(6502zflag)) :REM zero IF (armflags% AND 8) b%=b% OR (1<<7) :REM negative FlagsArmTo6502[armflags%]=b% NEXT ...or that sort of thing. (I can't remember the bit position of the 6502 zero flag). And setup a similar table for converting from 6502 flags (only being interested in ZCVN) to Arm order. Then PHP will be: LDR temp,[flagsAto6,P,LSR #28] LDR temp2,UBDI ORR temp,temp,temp2 FNstmem(temp,sp) SUB sp,sp,#1 ORR sp,sp,#&100 Should do the trick! > I should think some cunning bastard will have used the looping stack in some > NES ROM somewhere so I guess I'd better emulate it, it's one of those things > I'll mark for possible optimisation. I know that !65Host never bothers looping the stack - I think it's usually very unlikely that any 6502 prog would use more than 256 bytes of stack - and certainly any NES program will have to set the SP to &FF, otherwise it's undefined - so on the assumption that wrapping around will write completely over anything important that's already there, you can probably assume it won't happen :) > Thanks very much for the 6502em, very interesting, though quite perculiar the > way everything uses words rather than byte access - makes the instruction code > SOOOooooo much tidier though. I quite liked that solution at the time, I have to admit - it saves *so* much per opcode - but I guess it's only at all feasible if you have only 64k to deal with. ============================================================================== compiled 25/5/99 by David Sharp