6502 Code Profiling

for bbc micro/electron hardware, peripherals & programming issues (NOT emulators!)
User avatar
hoglet
Posts: 7068
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol
Contact:

6502 Code Profiling

Postby hoglet » Mon Apr 02, 2018 12:59 pm

Hi all,

As some of you will be aware, a few of us (myself, BigEd, myelin and cmorley) have been dabbling with using cheap (<£20) hardware to capture instruction traces from the 6502.

This kind of tool is incredibly versatile and can be used to:
- fault-find non-functional Beebs/Masters/Electrons/Atoms (pretty much any 6502 based system)
- diagnose intermittent hardware faults (it would be applicable to Daniel's Master glitch)
- debug crashing software (e.g. why did White Light crash with MMFS)
- help develop better software (e.g. where is the bottleneck and how can I make it go faster)

For more details on the hardware needed, refer to the other threads:
- Open Source Logic Analyzer Experiments
- Minimal Tube 6502decode bus snoop with FTDI USB
- Myelin's FX2 Tube/Cartridge Adapter - Development and Support

We're at the stage now where the hardware is pretty stable, will run on both Linux and Windows, and is capable of capturing several minutes of 6502 code execution.

On the software side, we've been working on a tool to analyze these instruction traces called "decode6502":
- https://github.com/hoglet67/6502Decoder

This will build/run on any platform where you can run gcc.

The decode6502 program reads a binary capture files(which minimally just contain samples of the 6502's data bus) and produces a full disassembly with register values:

Code: Select all

$ decode6502 --phi2= -sh test/beeb/reset.tmp | head -20
???? :          : RESET !!       : A=?? X=?? Y=?? SP=?? N=? V=? D=? I=1 Z=? C=?
D9CD : A9 40    : LDA #40        : A=40 X=?? Y=?? SP=?? N=0 V=? D=? I=1 Z=0 C=?
D9CF : 8D 00 0D : STA 0D00       : A=40 X=?? Y=?? SP=?? N=0 V=? D=? I=1 Z=0 C=?
D9D2 : 78       : SEI            : A=40 X=?? Y=?? SP=?? N=0 V=? D=? I=1 Z=0 C=?
D9D3 : D8       : CLD            : A=40 X=?? Y=?? SP=?? N=0 V=? D=0 I=1 Z=0 C=?
D9D4 : A2 FF    : LDX #FF        : A=40 X=FF Y=?? SP=?? N=1 V=? D=0 I=1 Z=0 C=?
D9D6 : 9A       : TXS            : A=40 X=FF Y=?? SP=FF N=1 V=? D=0 I=1 Z=0 C=?
D9D7 : AD 4E FE : LDA FE4E       : A=80 X=FF Y=?? SP=FF N=1 V=? D=0 I=1 Z=0 C=?
D9DA : 0A       : ASL A          : A=00 X=FF Y=?? SP=FF N=0 V=? D=0 I=1 Z=1 C=1
D9DB : 48       : PHA            : A=00 X=FF Y=?? SP=FE N=0 V=? D=0 I=1 Z=1 C=1
D9DC : F0 09    : BEQ D9E7       : A=00 X=FF Y=?? SP=FE N=0 V=? D=0 I=1 Z=1 C=1
D9E7 : A2 04    : LDX #04        : A=00 X=04 Y=?? SP=FE N=0 V=? D=0 I=1 Z=0 C=1
D9E9 : 86 01    : STX 01         : A=00 X=04 Y=?? SP=FE N=0 V=? D=0 I=1 Z=0 C=1
D9EB : 85 00    : STA 00         : A=00 X=04 Y=?? SP=FE N=0 V=? D=0 I=1 Z=0 C=1
D9ED : A8       : TAY            : A=00 X=04 Y=00 SP=FE N=0 V=? D=0 I=1 Z=1 C=1
D9EE : 91 00    : STA (00),Y     : A=00 X=04 Y=00 SP=FE N=0 V=? D=0 I=1 Z=1 C=1
D9F0 : C5 01    : CMP 01         : A=00 X=04 Y=00 SP=FE N=1 V=? D=0 I=1 Z=0 C=0
D9F2 : F0 09    : BEQ D9FD       : A=00 X=04 Y=00 SP=FE N=1 V=? D=0 I=1 Z=0 C=0
D9F4 : C8       : INY            : A=00 X=04 Y=01 SP=FE N=0 V=? D=0 I=1 Z=0 C=0
D9F5 : D0 F7    : BNE D9EE       : A=00 X=04 Y=01 SP=FE N=0 V=? D=0 I=1 Z=0 C=0

As the traces get longer, we need some higher level ways of viewing the results. With this in mind, I've started to add some code profiling capabilities to the decode6502 program:

The profiling that's just been added supports several modes of operation:
- instruction level profiling
- call level (exclusive of child subrotines)
- call level (inclusive of child subroutines)

(I'm using a 11.7 second capture from Kee's Atomic AGD as an example here.)

Instruction level profiling

This basically associates the cycles for each instruction with a bucket based on the address of the instruction:

Code: Select all

$ decode6502 --profile=instr,0000,ffff,100 --quiet --byte atic1.bin
0600 :     3114 (  0.026446%)
0e00 :     1566 (  0.013300%)
0f00 :    17410 (  0.147857%)
1000 :    90363 (  0.767422%) *
1100 :    89188 (  0.757443%) *
1200 :      112 (  0.000951%)
1300 :     2904 (  0.024663%)
1500 :    85184 (  0.723438%) *
1600 :   155398 (  1.319742%) **
1700 :    26335 (  0.223654%)
1800 :    25374 (  0.215493%)
1900 :    15042 (  0.127747%)
1a00 :   185808 (  1.578004%) **
1b00 :    24399 (  0.207212%)
1c00 :   656162 (  5.572559%) ********
1d00 :   436156 (  3.704123%) *****
1e00 :  3818481 ( 32.429049%) **************************************************
1f00 :    95190 (  0.808416%) *
2000 :      357 (  0.003032%)
2100 :    54439 (  0.462332%)
2200 :      685 (  0.005817%)
2300 :      189 (  0.001605%)
2c00 :    12389 (  0.105216%)
2d00 :    18294 (  0.155365%)
2e00 :     5160 (  0.043822%)
2f00 :    33797 (  0.287026%)
3000 :    12246 (  0.104001%)
3100 :    12795 (  0.108664%)
3200 :     9268 (  0.078710%)
3300 :     6431 (  0.054616%)
4900 :  2147652 ( 18.239272%) ****************************
7700 :        7 (  0.000059%)
fe00 :  3727578 ( 31.657041%) ************************************************
**** :     5405 (  0.045903%)
     : 11774878 (100.000000%)

The parameters to the profile option are:
- profile type (in this case instr to indicate instruction profiling)
- min address of interest (in hex)
- max address of interest (in hex)
- bucket size (in hex)

You can use these parameters to zoom in on a particular region and change the granularity:

Code: Select all

$ decode6502 --profile=instr,1e00,1eff,10 --quiet --byte atic1.bin
1e00 :    39015 (  0.331341%)
1e10 :    34660 (  0.294355%)
1e20 :    43105 (  0.366076%)
1e30 :    35980 (  0.305566%)
1e40 :    42067 (  0.357261%)
1e50 :    15679 (  0.133156%)
1e60 :     7166 (  0.060858%)
1e70 :   244380 (  2.075436%) *
1e80 :   206846 (  1.756672%) *
1e90 :  1080676 (  9.177811%) ******
1ea0 :   697310 (  5.922015%) ****
1eb0 :   831270 (  7.059691%) *****
1ec0 :   540327 (  4.588812%) ***
**** :  7956397 ( 67.570951%) **************************************************
     : 11774878 (100.000000%)

Instructions outside the region of interest end up in the **** bucket.

Call level profiling

I've also had a go at adding call level profiling. This tries to associate instructions cycle counts with subroutines. A subroutine is any code that is entered with JSR and exited with RTS, and may call further child subroutines. There are two flavours of this: exclusive and inclusive, which differ on whether the cycles consumed by a child subroutine are also included in the parent. None of this requires source code to work.

Exclusive call level profiling

Code: Select all

$ decode6502 --profile=callexc --quiet --byte atic1.bin
warning: call stack underflowed
warning: call stack underflowed
064d :  3730692 ( 31.683488%) **************************************************
0e3d :     1572 (  0.013350%)
1077 :       27 (  0.000229%)
10d4 :   176580 (  1.499633%) **
112f :      438 (  0.003720%)
12f5 :     2968 (  0.025206%)
1553 :    14520 (  0.123313%)
1562 :    54208 (  0.460370%)
15e3 :   174724 (  1.483871%) **
16e9 :      101 (  0.000858%)
1729 :     5049 (  0.042879%)
175f :    18321 (  0.155594%)
185c :     5120 (  0.043482%)
1870 :    12250 (  0.104035%)
18ac :     8004 (  0.067975%)
19c0 :    15870 (  0.134778%)
1a26 :    17936 (  0.152324%)
1a4f :   121226 (  1.029531%) *
1a73 :    51554 (  0.437830%)
1bed :   106066 (  0.900782%) *
1c41 :   397115 (  3.372561%) *****
1c88 :    79751 (  0.677298%) *
1cb7 :    38148 (  0.323978%)
1cd3 :   188735 (  1.602862%) **
1d2b :   314373 (  2.669862%) ****
1dc3 :   149058 (  1.265898%) *
1de4 :   147980 (  1.256743%) *
1e5c :      930 (  0.007898%)
1e68 :       96 (  0.000815%)
1e6e :   170406 (  1.447200%) **
1e8e :  3266255 ( 27.739183%) *******************************************
1f2d :    11929 (  0.101309%)
1f72 :    84130 (  0.714487%) *
2055 :       24 (  0.000204%)
2063 :       52 (  0.000442%)
208b :       42 (  0.000357%)
21c4 :    65994 (  0.560464%)
21d5 :      110 (  0.000934%)
220a :      567 (  0.004815%)
22b3 :      117 (  0.000994%)
2306 :      183 (  0.001554%)
32cf :      628 (  0.005333%)
3309 :     6283 (  0.053359%)
3322 :     2076 (  0.017631%)
49c6 :  2312856 ( 19.642293%) ******************************
**** :    19814 (  0.168274%)
     : 11774878 (100.000000%)


Inclusive call level profiling

Code: Select all

$ decode6502 --profile=callinc --quiet --byte atic1.bin
064d :  3903472 ( 33.150849%) ****************
0e3d :     1572 (  0.013350%)
1077 :       27 (  0.000229%)
10d4 :   176580 (  1.499633%)
112f :      438 (  0.003720%)
12f5 :    21504 (  0.182626%)
1553 :    14520 (  0.123313%)
1562 :    54208 (  0.460370%)
15e3 :   243452 (  2.067554%) *
16e9 :      101 (  0.000858%)
1729 :   266923 (  2.266885%) *
175f :    18321 (  0.155594%)
185c :    12040 (  0.102252%)
1870 :    29204 (  0.248020%)
18ac :     8004 (  0.067975%)
19c0 :    15870 (  0.134778%)
1a26 :    17936 (  0.152324%)
1a4f :   121226 (  1.029531%)
1a73 :   172780 (  1.467361%)
1bed :   503181 (  4.273344%) **
1c41 :   397115 (  3.372561%) *
1c88 :   667147 (  5.665851%) **
1cb7 :   587396 (  4.988553%) **
1cd3 :   549248 (  4.664575%) **
1d2b :  6402916 ( 54.377769%) ***************************
1dc3 :   178366 (  1.514801%)
1de4 :   177594 (  1.508245%)
1e5c :    22450 (  0.190660%)
1e68 :     2250 (  0.019108%)
1e6e :  6081803 ( 51.650667%) *************************
1e8e :  3266255 ( 27.739183%) *************
1f2d :    19001 (  0.161369%)
1f72 :    84130 (  0.714487%)
2055 :       50 (  0.000425%)
2063 :       52 (  0.000442%)
208b :      118 (  0.001002%)
21c4 :    65994 (  0.560464%)
21d5 :      110 (  0.000934%)
220a :      567 (  0.004815%)
22b3 :      352 (  0.002989%)
2306 :      235 (  0.001996%)
32cf :      628 (  0.005333%)
3309 :     6283 (  0.053359%)
3322 :     2076 (  0.017631%)
49c6 :  2312856 ( 19.642293%) *********
**** : 11774878 (100.000000%) **************************************************

All of this is now checked into github:
https://github.com/hoglet67/6502Decoder ... profiler.c

In case anyone wants to play, here's the capture file I've used for the above examples:
atic1.zip
(2.01 MiB) Downloaded 7 times

The zip file also includes the assembler listing, so you can see what the code at 49c6 is:

Code: Select all

0049C6  2               exx:
0049C6  2  A5 62              lda z80_c
0049C8  2  A4 6E              ldy z80_cp
0049CA  2  84 62              sty z80_c
0049CC  2  85 6E              sta z80_cp
0049CE  2  A5 63              lda z80_b
0049D0  2  A4 6F              ldy z80_bp
0049D2  2  84 63              sty z80_b
0049D4  2  85 6F              sta z80_bp
0049D6  2  A5 64              lda z80_e
0049D8  2  A4 70              ldy z80_ep
0049DA  2  84 64              sty z80_e
0049DC  2  85 70              sta z80_ep
0049DE  2  A5 65              lda z80_d
0049E0  2  A4 71              ldy z80_dp
0049E2  2  84 65              sty z80_d
0049E4  2  85 71              sta z80_dp      
0049E6  2  A5 80              lda scraddr
0049E8  2  A4 72              ldy z80_lp
0049EA  2  84 80              sty scraddr
0049EC  2  85 72              sta z80_lp
0049EE  2  A5 81              lda scraddr+1
0049F0  2  A4 73              ldy z80_hp
0049F2  2  84 81              sty scraddr+1
0049F4  2  85 73              sta z80_hp
0049F6  2  60                 rts


The call level profiling has a number of issues at the moment:
- the min/max addresses of interest don't work
- using RTS to implement a jump table will mess it up
- it doesn't really represent the call hierarchy (for example, from the above it's not obvious that 1d2b ==> 1e6e ==> 49c6.

Anyway, I'd really value any suggestions for taking this further.

Or if anyone wants to try it out (ideally on a real problem they care about) I'm very happy to help.

Dave
Last edited by hoglet on Mon Apr 02, 2018 1:12 pm, edited 1 time in total.

SteveF
Posts: 508
Joined: Fri Aug 28, 2015 8:34 pm
Contact:

Re: 6502 Code Profiling

Postby SteveF » Mon Apr 02, 2018 1:11 pm

This looks pretty cool. Am I right in thinking it would be possible (even easy?) to modify an emulator to output a similar trace, so that the analysis tools could be used with an emulator instead of this capture hardware? Up until now I've just done simple 'count[pc]++'-style profiling with a hacked emulator and looked for individual instructions with high counts, but your analysis tools seem much more sophisticated.

User avatar
hoglet
Posts: 7068
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol
Contact:

Re: 6502 Code Profiling

Postby hoglet » Mon Apr 02, 2018 1:16 pm

SteveF wrote:This looks pretty cool. Am I right in thinking it would be possible (even easy?) to modify an emulator to output a similar trace, so that the analysis tools could be used with an emulator instead of this capture hardware? Up until now I've just done simple 'count[pc]++'-style profiling with a hacked emulator and looked for individual instructions with high counts, but your analysis tools seem much more sophisticated.

It would be possible, but the trace files do get quite big quite quickly (~2MB/s for the Beeb), which would slow down the emulator. There are situations where I would have found this very useful.

An alternative would be to include just the profiling capabilities into the emulator. In fact, with B-Em all the hooks are already present in the debug interface.

cmorley
Posts: 426
Joined: Sat Jul 30, 2016 7:11 pm
Location: Oxford
Contact:

Re: 6502 Code Profiling

Postby cmorley » Mon Apr 02, 2018 1:38 pm

Good work Dave. I'm away from all my beebs this week so can't try it out but will download the repo and try it out later.

The decoded instruction traces get huge so it would be good to have some way of gating the decode areas - e.g. an option to bin all the brks before a power on/RESET - perhaps just a "1,000,000 BRKs seen" note or similar. Also maybe some hex memory windows so 'decode in this region' or 'ignore this region' which will allow much easier debugging of ISRs and function calls without having a 500MB text file of the program/BASIC/OS idle loop to scroll through.

Is the CPU stack reconstructed in the emulated CPU? This might be useful - option to print the stack neighbourhood around JSR/RTS/RTI etc. Should make it easier to debug stacking errors.

Just a few thoughts.

User avatar
hoglet
Posts: 7068
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol
Contact:

Re: 6502 Code Profiling

Postby hoglet » Mon Apr 02, 2018 2:22 pm

cmorley wrote:The decoded instruction traces get huge so it would be good to have some way of gating the decode areas - e.g. an option to bin all the brks before a power on/RESET - perhaps just a "1,000,000 BRKs seen" note or similar. Also maybe some hex memory windows so 'decode in this region' or 'ignore this region' which will allow much easier debugging of ISRs and function calls without having a 500MB text file of the program/BASIC/OS idle loop to scroll through.

Yes, I've also been thinking that some trigger and capture controls would be useful, independent of the code profiling.
cmorley wrote:Is the CPU stack reconstructed in the emulated CPU? This might be useful - option to print the stack neighbourhood around JSR/RTS/RTI etc. Should make it easier to debug stacking errors.

There is not currently any general attempt to emulate the stack, or for that matter emulate memory in general. I think this would be possible, and might allow for some additional error checking. Like with the registers, you would have to model the "don't know" state as well.

In the call based profiler module I do track just the call stack - i.e. JSR adds an address and RTS removes it. But I'm not made any attempt to make this exactly mirror the 6502's stack, and it's currently quite fragile. I'll have a think about whether that might be a better approach.

Thanks for the suggestions, do keep them coming.

Dave

User avatar
hoglet
Posts: 7068
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol
Contact:

Re: 6502 Code Profiling

Postby hoglet » Sat Apr 07, 2018 3:06 pm

The call level profiler now groups by call path.

Here's an example:

Code: Select all

$ decode6502 --profile=call --quiet --byte atic1.bin
warning: call stack underflowed, re-initialize call graph
warning: call stack underflowed, re-initialize call graph
    5343 (  0.118923%):
 1394477 ( 31.037838%): 064D
   18923 (  0.421182%): 064D->1A73
   44495 (  0.990356%): 064D->1A73->1A4F
    1572 (  0.034989%): 0E3D
      27 (  0.000601%): 1077
     438 (  0.009749%): 112F
    5049 (  0.112379%): 1729
  174724 (  3.888953%): 1729->15E3
   14520 (  0.323182%): 1729->15E3->1553
   54208 (  1.206545%): 1729->15E3->1562
     101 (  0.002248%): 1729->16E9
   18321 (  0.407783%): 1729->175F
   39095 (  0.870164%): 1BED
  137390 (  3.057984%): 1BED->1C41
   29503 (  0.656669%): 1C88
   14124 (  0.314368%): 1C88->1CB7
   68138 (  1.516595%): 1C88->1CB7->1CD3
   66920 (  1.489485%): 1C88->1CB7->1CD3->10D4
    1453 (  0.032340%): 1C88->1CB7->1CD3->12F5
     372 (  0.008280%): 1C88->1CB7->1CD3->12F5->1E5C
     676 (  0.015046%): 1C88->1CB7->1CD3->12F5->1E5C->1DE4
     136 (  0.003027%): 1C88->1CB7->1CD3->12F5->1E5C->1DE4->21C4
    7796 (  0.173521%): 1C88->1CB7->1CD3->12F5->1E5C->1E8E
     288 (  0.006410%): 1C88->1CB7->1CD3->12F5->32CF
     384 (  0.008547%): 1C88->1CB7->1CD3->185C
     174 (  0.003873%): 1C88->1CB7->1CD3->185C->18AC
     345 (  0.007679%): 1C88->1CB7->1CD3->185C->19C0
    2500 (  0.055644%): 1C88->1CB7->1CD3->1870
    1160 (  0.025819%): 1C88->1CB7->1CD3->1870->18AC
    2300 (  0.051193%): 1C88->1CB7->1CD3->1870->19C0
    6802 (  0.151397%): 1C88->1CB7->1CD3->1A26
    4244 (  0.094462%): 1C88->1CB7->1CD3->1F2D
    2516 (  0.056000%): 1C88->1CB7->1CD3->1F2D->21C4
   19279 (  0.429106%): 1C88->1CB7->1CD3->1F72
      42 (  0.000935%): 1C88->1CB7->1CD3->208B
      24 (  0.000534%): 1C88->1CB7->1CD3->208B->2055
      26 (  0.000579%): 1C88->1CB7->1CD3->208B->2055->2063
      26 (  0.000579%): 1C88->1CB7->1CD3->208B->2063
  116864 (  2.601123%): 1D2B
      93 (  0.002070%): 1D2B->1E5C
     169 (  0.003762%): 1D2B->1E5C->1DE4
      34 (  0.000757%): 1D2B->1E5C->1DE4->21C4
    1949 (  0.043380%): 1D2B->1E5C->1E8E
      96 (  0.002137%): 1D2B->1E68
     176 (  0.003917%): 1D2B->1E68->1DC3
      34 (  0.000757%): 1D2B->1E68->1DC3->21C4
    1944 (  0.043269%): 1D2B->1E68->1E8E
   62389 (  1.388635%): 1D2B->1E6E
   54671 (  1.216850%): 1D2B->1E6E->1DC3
   10744 (  0.239137%): 1D2B->1E6E->1DC3->21C4
   53716 (  1.195594%): 1D2B->1E6E->1DE4
   10744 (  0.239137%): 1D2B->1E6E->1DE4->21C4
 1189278 ( 26.470582%): 1D2B->1E6E->1E8E
  847896 ( 18.872207%): 1D2B->1E6E->49C6
     110 (  0.002448%): 21D5
     567 (  0.012620%): 220A
     117 (  0.002604%): 22B3
     183 (  0.004073%): 22B3->2306
      52 (  0.001157%): 22B3->2306->32CF
    2324 (  0.051727%): 3309
     768 (  0.017094%): 3322
 4492829 (100.000000%)

Coeus
Posts: 746
Joined: Mon Jul 25, 2016 11:05 am
Contact:

Re: 6502 Code Profiling

Postby Coeus » Sat Apr 07, 2018 10:08 pm

hoglet wrote:The call level profiler now groups by call path.


Excellent work.

User avatar
tricky
Posts: 2375
Joined: Tue Jun 21, 2011 8:25 am
Contact:

Re: 6502 Code Profiling

Postby tricky » Sun Apr 08, 2018 8:19 am

Do you think it is worth adding support for reading the label files that can be output from beebasm (and SWIFT I think)?
I added it for my local version of beebem and it is quite useful, where an address isn't an exact match, I just use the nearest label with an offset.
e.g. .draw_text+3
beebasm doesn't export local labels, but that should be easy to add, otherwise you get bne .fun2-1 instead of .fun1.end (I can't remember if that last syntax was added to beebasm or not!)

User avatar
hoglet
Posts: 7068
Joined: Sat Oct 13, 2012 6:21 pm
Location: Bristol
Contact:

Re: 6502 Code Profiling

Postby hoglet » Sun Apr 08, 2018 8:39 am

tricky wrote:Do you think it is worth adding support for reading the label files that can be output from beebasm (and SWIFT I think)?
I added it for my local version of beebem and it is quite useful, where an address isn't an exact match, I just use the nearest label with an offset.
e.g. .draw_text+3
beebasm doesn't export local labels, but that should be easy to add, otherwise you get bne .fun2-1 instead of .fun1.end (I can't remember if that last syntax was added to beebasm or not!)

Yes, I think this is an excellent idea.

I also noticed that ca65/ld65 (which Kees uses on for the Atom) has support for exporting symbols (the -Ln option can be used to output a vice label file). Their syntax is e.g. "labelname = $1234 ; Maybe a comment". That seems a reasonable format to me.

Dave