ctr wrote:Ada and Modula 2 are compiled languages that have coroutines and no garbage collection. (Then I thought, and surely Modula 3? But it doesn't.) If you're not fussed about garbage collection there are go and Haskell. Iterators in C# also work as a poor man's version.
I guess real threads wouldn't be any good because there's far too much communication needed between the components.
I don't know if language-based co-routines do this but an important requirement is that the co-routines remain synchronised with each other. if they were to each proceed at their own pace some things would not work. The obvious case is where a game uses a timer to re-program the CRTC part-way down the frame. If the co-routine running the CRTC emulation were to gain on the one emulating the timer the point at which the display changed would move up the screen. If threads were synchronised once per frame it would then just be a case of having swapped drift for jitter.
In ElectrEm I used separate threads to achieve coroutines with the caveat that from time n, whether any component will affect any other is completely knowable upfront except in the case of the CPU. So the process to run for q cycles was: calculate the largest number less than q before I know for certain that no component will contact another. Ask the CPU to run for that many cycles. If it exits early and says it ran for only p cycles before being about to make contact, update all other components to p, then resume. I was serialising them all though — the thread side of things was just to gain an additional call stack so that my 6502 code could be read from top to bottom as if a normal opcode-level implementation, but actually be cycle correct.
One of the Mega Drive emulators is even smarter. To run for q cycles:
- have all components store their state;
- ask all to run for q cycles, in parallel;
- ask whether any tried to access a shared resource during that period;
- if so, restore the stored states and try again with the now-known smaller window. Then continue from there.
ElectrEm also did a thing whereby the 6502 knew how to obey and interleave a suitably specific list of memory fetches and chuck them into a buffer. So upon each state change, the video stuff just posted a new list to the CPU and then at end-of-frame it produced the final display, correlating to a timestamped list of palette and mode events. You'd obviously need to do something like that if you actually wanted to spread out across threads.
Clock Signal is a bit more ad hoc; all audio generation and video interpretation is trivially boxed off into separate threads but right now each machine is internally serialised. Or, at least, overwhelmingly so. I use just-in-time processing wherever possible, e.g. a count is kept of how long since the WD1770 was last asked to do anything and attempting to read any of its registers will suddenly make it run for that many cycles prior to being asked for its read.
A slightly softened version applies to user-visible outputs like video collection; that doesn't happen unless or until either the processor is about to write to RAM or to a video register, or the processor reaches the end of the amount of time it's currently supposed to run for, in which case video collection catches up.
I've a mentally-scheduled task which is a pretty simple version of that: push the catch-ups off into asynchronous land. As long as I block until they're all completed before I start the next iteration of the processing loop, life is good. It's not as parallel as if all were running at the same time like real chips, but it should parallelise a bunch of subsystems.
I'm still weighing the Mega Drivey approach mentioned above where there are two or more unpredictable actors with a shared resource. I guess that'd be what a BBC emulator should do to handle the tube. Probably with a drop back to ordinary serialisation upon any communication, which reverts back to reduced-length parallelisation and then increasingly more confident steps only after the two actors seem not to have talked for a certain threshold?
The most similar situation I currently model is the Vic-20 plus C1540, which amounts to two 6502s with a shared serial bus, so my thinking may be unduly boxed in by the specific.
Rich Talbot-Watkins wrote:I have the beginnings of an emulator framework (which could best be described as a 6502 + VIAs + CRTC simulator right now) which also goes for the 'tick each component in turn, cycle-by-cycle' approach. This is just using a fairly traditional state machine in C++ (the generated 6502 state machine ends up being a big switch with 560 cases!). But using a co-routines approach it'd be a bit neater; though, even with parallel stack frames as a language feature, I'm not sure if it'd actually be quicker.
Anyway, with co-routines you'd retain synchronisation by going for exactly the same kind of approach - run one cycle's worth of simulation, and then yield to the next co-routine. It's just cooperative threading, but with the readability advantage that you can write the logic linearly, yielding after each cycle's worth of simulation.
ElectrEm uses the coroutine approach — as above, the processor exists on its own thread, blocking itself in order to yield. Clock Signal uses the state machine approach, though there's only 117 things in its switch statement*. I don't think there's a substantial difference in performance from that angle other than that ElectrEm skips 90% of the overhead via its how-long-until-you-interfere-with-somebody-else scheduling of non-CPU components.
I'm pretty sure the main performance impediment in a modern 8-bit machine emulator is thrashing the instruction cache, branch prediction tables, etc by constantly jumping all over the place. All those jarring transitions from the CPU code and data set to the CRTC code and data set, to the SN76489 code and data set, etc, etc. Especially if you do it strictly as perform a cycle here, perform a cycle there, etc.
* I think I generate mine in a very different sense: the things the switch can hit were selected and implemented manually; what's automatically generated is the table from opcodes to micro-ops, which is nothing beyond the abilities of the C preprocessor. I directly have a list of 256 entries that looks like ZeroXWrite(OperationSTY) or equivalent, but since the list is installed once at machine construction it'd be easy to automate that too. The z80 analogue does so to an extent.