b0 Logo B0 Internals

Copyright © 2000-2008, Darran Kartaschew

All Items contained within this site are released under the BSD License!

Introduction

B0 the language and this implementation are release under the BSD license, and in doing so, I wanted to aid any developer who wished to make their own implementation, by giving a guided tour of the current implementation, as made by me, so this can be truely given the freedom that is deserves!

Why the BSD License?

Why any license? I guess each developer has his own preferred license as to which they release their own open-source applications. For a long time, I have used many free close-source and open source applications, and this is jsut my way of giving something back to the community.

Because I developed B0 for me, and only for me (well it was made to help me learn more about compiler theory, and so I could make an easy-to-use assembler like language during development of other software), I have no interest in making cash off it. So that others could enjoy it virtually unrestricted (while still maintaining credit), I released both the language and this implementation under the BSD License.

Design decisions

This is just a few paragraphs about some of the higher level decisions that went into developing B0.

History

I'm a typical assembler programming, I like coding to the raw metal, however the typical speed at which most assembler is written is cumbersome.

Also not having developed my own language, (how many programmers, can state, that they have developed and *implemented* a working version of their own language?), it was something that I wanted to try.

I stated by having a quick read of the FAQ associated with news://comp.compilers, and following some of the links I found to some of the many online sources on compiler thoery.

I read through about 4 or 5 texts, before starting, just so I understood many of the techniques. I also made the decision not to use yacc/bison and friends, but to do it from scratch...

The texts I used in the beginning, were by J. Crenshaw and N. Wirth, as models for the actual compiler. (Note: both of these use Pascal or similar and NOT C for code samples).

Why so low level?

That's the assembler programmer in me. Short, sharp and to the point. Also I wanted the language NOT to hide anything from the programmer. But to do that, you have to give the programmer the CPU registers to play with.

Why this style of notation?

Becuase it's more HLL programmer friendly. If you showed a typically programmer what each of these did, which one is the most obvious? r0 = r1; or mov r0, r1? The first one of course, even though both do the *exact* same thing.

64bits?

Well the Opteron and Athlon 64 have been out for some time, Intel has released their implementation (aka w/EM64T), Linux/*BSD has been running on it for a while, and now Windows for x64 has been released, so why not? While current market share for 64bit enabled x86 chips is relatively low compared to the 32bit parts, it's still a void that needs to be filled.

Why generate assembler output, and not generate machine code?

Easy to debug the compiler, and also I can let others worry about the final asm->ML conversion, as well as object file formats.

Why do it in C?

Seriously, to brush up on my C. I haven't coded in C for a number of years, and I though I would brush up on it. The actual C implementation, will show how newbish in C I am, but what the heck!

The flow of the Code!

Major Components

The following sections outline the major components on this implementation, and most of the compilers out there. I will add the disclaimer, that this is most likely not the best way to make a compiler, but it works.

Data Structures

There are 2 major data structures, and 3 minor structures within the B0 implementation.

The Hash Table

The Hash or symbol table is the first major data structure within the current implementation. The Table holds all the symbols/labels in use during compilation, including information about the symbol/label. The current implementation is using a closed-hash table with up to 250,007 hashes or slots available, and utilises 66MB for those entries (roughly 280bytes per entry). The Hash Table is by far the memory hog of the current implementation.

The token stack

The second major data structure is the Token stack, which holds the tokenised input stream. This is a simple array of up to 1024 elements (or tokens). Note: the size of this array is controlled by a single define at the start of the source code.

The file stack

The first of the minor structures, it holds the file handles and filenames of the files currently being processed. It can hold up to 32 file handles and names, which means that the current implementation can handle up to 32 deep library inclusions.

The include Path stack

The second minor structure, holds all the include path names, and is used when searching for included files.

The If-While Stack

This stack holds the current conditions and state as used by the IF-THEN-ELSE and WHILE-DO statements. This structure enables easy handling of the above syntax contexts.

The Tokeniser

The tokeniser, is rather simple in implementation. It's by no means the best, but it's simple. It' sole function to convert your raw UTF8 encoded input into a tokenised form, which can easily be analysised, and have resulting code produced.

It converts symbols into their equivalent token values, strings into an internal UTF32 form, numbers as is, and labels into a single token, (irrespective of the number of characters a token has).

Numbers have a range of '0'-'9', labels/keywords start at 010000000h, and operator tokens are between 01000000h and 100100h within the token stack. Labels have their hash value added to 010000000h, before being placed onto the stack. In summary:

Strings are marked with a string start token, and followed by the UTF32 encodings, until the string end token is reached. As can be seen, it's easy to tell what an item on the token stack is, by it's value (in the broad sense).

All multi-character operators (eg ==, >=, ~*) are all turned into a single token, simplifing token->code generation.

The tokeniser also has another role, to call the token->code generator when an end-of-statement appears, and also call the block function which helps handle the IF-THEN and WHILE-DO stack.

Hashing

As already mentioned the hash table is a closed-hash table, which means it's limited to the number of elements it can hold. A closed-hash table was choosen, as it's easier to implement. The code (hopefully) has been written in such a way, that it should be easy to change the hash table format, or even change to an open-hash table.

After some discussion of hash algorithms in news://alt.lang.asm, I decided to stick with the tried and true ELFHASH algorithm. (The one used for calculating hashes used in the UNIX ELF file format). The holy grail of hash algorithms, is to find a hash algorithms that can produce a unique hash for any given input of a specified domain. Like most holy grails, it doesn't exist. The average programmer is better off using a common and high performance algorithm, and handle hash collisions gracefully, than try to produce no collisions at all, and then either assume that collisions don't exist, or handle them poorly.

In the event of a hash collision, that is 2 non-equal symbols produce the same hash, I used linear rehashing. (A Dr Dobbs article showed that, is most cases a linear rehash is quicker than a full rehash, due to cache/memory issues as well as performance of the rehash). Plus linear rehash is simple. If a collision is found increment the hash value by 1, and keep doing so until a free slot in the hash table is found. The problem, linear rehashing isn't that great when the hash table gets over 80% full, as you tend to hit more non-free slots, than free slots.

The current implementation, will only let a rehash attempt of a symbol equal to the size of the hash table, eg 250,007 times. If it can't find a free slot, it's assumed that the table is full, and we exit gracefully, with an error that too many symbols have been used.

I would just like to add, that local symbols/labels (eg those that are declared within a procedure) are removed from the hash table at the start of the procedure declaration, freeing room for more symbols. So while there may be over 250,000 unique symbols within an application, many of the hash positions get recycled, and in theory the number of symbols may be actually infinite.

Also why the odd size for the hash table? Why not make it 250,000 hashes, instead of 250,007? Well 250,007 is a prime number. The ELFHASH algorithm produces a 32bit hash for the given input. And dividing by a prime, reduces the chance of a collision due to factorisation of the divisor. (The hash itself is used as the index into the hash table).

Token Stack to Code Generation

This probably the heart of the compiler. It's also one of the most complex, well not complex, more ackward.

The core of the function, is just a very large switch() statement, which branches into other smaller IF-THEN-ELSE and switch() statements until the token stack has been turned into code, or if something doesn't make sense, exit with an error.

The first switch() statement works on the very first token in the stack, and just runs that case.

The only symbols, not to appear on the stack are { and ;. These are used by the tokeniser to call the token->code generator. } symbols are placed into the token stack, and handled in such a way that handling the ELSE keyword in easy.

Object Format Creation

The actual creation of source code for each object is done right at the end of the whole compiler process, and isn't done until all the code and data definitions are complete. All that really happens, is that the appropriate headers are attached to the code/data output. Adding other formats is easy because of the technique.

How it all fits together

Well, the following process is used:

  1. Read the command line parameters, and display help if asked or no source file name is provided.
  2. Insert all the keyword tokens into the hash table.
  3. Attempt to open the source file, and if so, create 2 temp files (one for code, the other for data declarations).
  4. Start reading the file, by calling block() which calls nextToken() which is the tokeniser.
  5. The tokeniser keeps going, until either:
    1. The token stack is full, (in which case exit with error)
    2. Or, we reach a end of statement marker, eg ; or {. In which case call processTokenStack().
  6. processTokenStack() reads the tokenised input, and converts it to code/data declarations. Once completed converting the tokenised input without error, we return back to the tokeniser.
  7. Back in the tokeniser we continue to step 5 above, unless we run out of source code file to read.
  8. If at the end of the source code, call processTokenStack() to clear out the token stack, else jump back up to step 5 above.
  9. Using the output format, create a new file and dump the appropriate header to the newly created file. Copy the data and code temp files to the new file, and close the new file.
  10. Delete the temp files
  11. And we are done!

The C Implementation

Introduction

This tutorial of the source, only relates the version 0.0.6. (Download v0.0.6) Other versions may have grown and/or changed how things are handled. v0.0.6 is also simple enough that getting through the code is easy. (At just over 3000 lines of code, it's not that big for a compiler).

Rather than include the source here, just read throug the source code in your favourite text editor.

The Source

The source is divided into sections, however the line numbers should be good enough to follow the context.

Lines 1 - 21

Source code license.

Lines 22 - 24

Standard C includes, and we define POSIX to tell VC++ that we want ANSI compatiblity when calling functions.

Lines 26 - 42

These are a list of all the C functions we have used. Just for use when porting the C implementation to another system.

Lines 44 - 231

All our defines, including token values for operators, and precalculated hash values for keywords.

Lines 233 - 302

All our variables needed, including the various data structures defined above.

Lines 303 - 403

Our keyword strings.

Lines 405 - 413 - getChar()

This function simple reads the next character from the input file. Two variables are used, ch and look_ahead_ch, where ch is the current character, and look_ahead_ch is the next character in the stream. This allows the tokeniser to look at the next character to determine what action to take when tokenising the input.

Lines 415 - 421 - isAlpha()

Simply returns true is the character as passed is a..z or _, else returns false. (Used by the tokeniser).

Lines 422 - 432 - abort()

One of my favourites, this aborts the compiler with an error as specified in a string. Also prints some helpful information for the user, like which file, it's line number, the token number, and the time the compiler has been running before error.

Lines 433 - 440 - isSpace()

Returns true if the character is a space, tab, NULL or CR, else returns false.

Lines 442 - 457 - ElfHash()

The classic ELF HASH algorithm, used by the tokeniser to hash the input it sees as a label.

Lines 459 - 467 - insert_token()

This is one of the helper functions, used during initialisation of the compiler, where a string (containing a keyword) is plassed to this function, and it inserts the keyword into the hash table.

Lines 469 - 474 - insert_token_stack()

This function takes the resultant token (as derived by the tokeniser) and inserts into the token stack (which is later processed by process_token_stack()). It also checks to stack overflow, and aborts if required to.

Lines 476 - 481 - atStackStart()

This function is used by process_token_stack() to see if the current position into the stack is at the start, and if not abort processing.

Lines 483 - 487 - atStackEnd()

This function is similar to the one above, but checks to see if we are at the end of the token stack, while it is being processed.

Lines 489 - 515 - TokenIsLabelType()

This function is used by process_token_stack() to see if the current token, is a label of a particular type. If not, then it aborts processing.

Lines 517 - 523 - IsLabelAllocated()

This function is used by process_token_stack(), to see if the current token is a label, and has not type allocated (that is it hasn't being given a type, whether that be a keyword, variable, proc, etc). If it has already been given a type, we abort with an error. This typically occurs when a proc or variable has been redeclared somewhere in the source code.

Lines 524 - 528 - isHash()

This function, simply ensures that the current token as processed, is a label, (whether that be a proc, variable or keyword).

Lines 529 - 543 - outputString()

This function outputs starting from the current token a UTF16 encoded string to the data temp file. It finishes either when we hit the end of the token stack, or we hit the TOKEN_STRING_END token.

Lines 545 - 551 - outputNumber()

This function is similar to the function above, except it outputs the tokens as is (to form a string which represents a number in the code output stream).

Lines 553 - 561 - setState()

This function determines the type of the current label, and sets the variable state to represent either a byte, word, dword or nothing (qword).

Lines 563 - 605 - TokenIs()

This function checks to see if the current token is of an expected value, and aborts if it is not. This is heavily used during math and bitwise operator code generation.

Lines 606 - 667 - callProc()

This function is used by process_token_stack() to output the required code when calling a procedure or function within the source code. It sets up a data frame, and continues to place the parameters as passed in the source code onto the data frame. At this stage, no checking is done, so a user has the ability to either not pass enough parameters or pass too many parameters. It then calls the procedure, and on return tears down the data frame.

Lines 669 - 679 - outputDynamicString()

This function handles outputing a string into the data stream, during operations a register is given the location of a string, which is not predefined. eg r0 = &'my_string';. It is used primarily by process_token_stack().

Lines 681 - 692 - PrintHelp()

This function simply displays the help message when needed. eg when the -? or -h commandline options are used.

Lines 694 - 732 - dhtoi()

This function is similar to atoi(), however it accepts either a decimal or hexadecimal value as a string and converts it to the correct value.

Lines 734 - 756 - scan_env()

This function is used during the initialisation of the compiler, and is used to breakup the -i and %B0_INCLUDE% variables, into separate units, and place them into the PATH data structure.

Lines 756 - 2252 - process_token_stack()

The function is the primary token->code generator, and since B0 is a rather simple langauge, the resulting generator is rather simple itself.

We start by saving the number of tokens to process in the variable i, and then reset the token count (token). Now using a simple switch() statement we branch depending on the value of the first token.

The first case handles the lib keyword, and attempts to open the file as specified in the string. (and if not in the current directory, try appending our search paths to the filename and opening). If you have a file that opened, place the filename and handle into the file stack, set the current file pointer (file_stack_ptr) to the current operating file. (lines 772 - 815).

The next 2 cases, handle the syscall and sysret keywords. Each simply sees that the token is the only only on the stack, and if so, outputs the keyword. (lines 817 - 827).

The next 2 cases, handle the push and pop keywords. They check to ensure that the parameter is a register, and if so generate the correct code. (lines 829 - 843).

The next case handles the asm keyword. This should be the only keyword on the stack, and the line should be terminated with a {. If so, it simply continues to read the source file and outputing it verbatim, until a terminating } is found. It also does some block number fudging to keep things aligned. (lines 845 - 873).

The ELSE case simply terminates processing, as it should never appear on it's own. (lines 875 - 877).

The next case handles the block terminations, and also handles the ELSE statement. If it sees other input other than } and ELSE, then it shifts the token stack to the right (items move down), and passes off the remainder for processing by the other cases. (lines 879 - 909).

The IF case handles the creation of a IF-THEN and IF-THEN-ELSE block. If checks all the parameters and makeup of the statement, and if all is correct, then outputs the correct code. ELSE termination is handled by the BLOCK_END case (item above), and not by the ELSE case. (lines 911 - 965).

The WHILE case statement, like the IF case above handles the WHILE-DO statement. It simple checks for the correct statement makeup, stores the makeup in the IF-WHILE stack, and outputs a location marker into the code. (lines 967 - 1009).

The PROC case handles procedure/function declarations and also any inline parameters declerations (which are m64 by default). As part of the proc declaration, it clears the hash table of ALL labels with the type LOCAL. Once all the old local labels have been removed, it then runs through the parameters declaring those as m64.(lines 1011 - 1068).

The next case handles the variables declarations, and will preinitialise those globals as required. The case is simply split into 2 parts, the first half (lines 1077 - 1178) handles variable declarations for GLOBALS and the 2 half (lines 1179 - 1230) handles variable declarations for LOCALS. (lines 1070 - 1230).

The next two cases, handle the ! (NOT) and - (NEG) keywords. They both simply check for a reg as a operand, and if found outputs the required code. (lines 1232 - 1254).

The next 2 cases handle the exit() and return() keywords. Each simply checks for a register or immediate as an operand (if one is supplied), and sets r0 appropriately, both either jumping to the the exit routine or returning to the previous procedure. (Remember it's the caller that sets up and tears down the local variable frame). (lines 1257 - 1315).

The final case handles the standard instruction makeup:

The final point, resets the token pointer variable back to zero. (indicating end of processing).

Lines 2254 - 2650 - nextToken()

This is the core of the tokeniser, and features the core components in regards to hashing.

The tokeniser starts by ignoring all whitespace, until the first non-whitespace character is reached. It also will skip any comments as well. (lines 2257 - 2269).

On encountering a non-whitespace character, we determine if it's a alpha or number, or a symbol. (line 2270).

If it's a symbol, we check for the ' character, and if it is we start processing the input as a string, until a final ' character is found. During this process the UTF8 input is converted to UTF32 for placement onto the token stack. (lines 2272 - 2387).

Next based on the symbol, we insert the correct token into the token stack. (lines 2388 - 2552). If none of the symbols match, then we check to see if we have a BOM encoded as UTF8, if so skip it and continue. If not (and it's not a EOF), then abort with an error, saying we don't understand the symbol. (lines 2554 - 2567).

If the character is a number, we simple copy the number onto the token stack (ASCII encoded string) (lines 2571 - 2585).

Else we must have a label. We copy the label to another buffer, and then we hash the buffer (using ElfHash()). We test to see if the slot in the hash table is used, if not, we insert the label into the slot available in the hash table and insert the hash into the token stack. If the slot is used, then we first check to if it's the same string! If so, we just continue on (by inserting the hash into the token stack). If the strings don't match, then we perform a linear rehash, check for an empty slot, if not empty compare the labels, and so until we either find an empty slot, or we run out of slots to fill. If we run out, we abort processing with an error stating that the hash table is full. (lines 2587 - 2646).

Lines 2652 - 2673 - end_block_else()

This function is to handle the ELSE keyword, so that block termination is done correctly in the event of the ELSE keyword. It checks that we are not back at the global level, and that the current block has been initialised by an IF-THEN statement. If all is fine, then we handle the ELSE statement.

Lines 2675 - 2717 - end_block()

This function handles the various cases that may apply when ending a block, that is handling WHILE-DO and IF-THEN statement terminations, and also end of procedures.

Lines 2719 - 2741 - block()

This functions handles the setting up of a block, and all handling EOF in a same manner.

Lines 2743 - 2782 - include_standard_output()

This function ouputs the standard equates and macros needed by ALL B0 applications to assembly correctly when using FASM.

Lines 2784 - 3152 - main()

The start and end of the compiler.

main() starts by reading the commandline and sets the appropriate flags accordingly, eg for -i, -DEBUG, -?, -h, -v, -f and also looks for a filename. (lines 2792 - 2835).

If no filename is given, or help is requested, we display the Help for B0, and next if the version is requested, we display the version string. (lines 2836 - 2843).

We now scan for the %B0_INCLUDE% environment variable, and setup our our PATH data structure. (lines 2844 - 2850).

Next we clear the hash table (lines 2865-2867).

Now that the has table has been cleared, we insert all our keywords and reserved labels into the table. (lines 2869 - 2969).

We now attempt to open our named file, and if we can't we abort. The resulting file handle and filename is placed into the File Stack data structure. We also now open the code and data temp files. (lines 2978 - 3000).

We now set some of the parameters to sane values, and while we have a file to read, call block(). (This is the entry point to actual processing). (lines 3001 - 3009).

Once processing the source file, we close it, and reset the file pointers to the code and data temp files. We next get our original filename, remove the extension and provide a new .asm extension, we then create the new output file. (lines 3020 - 3038).

Next based on the source output type (set by the -f commandline switch), we setup the basic headers, out the standard includes (include_standard_output()), and start copying the data and code temp files to the new output file. (lines 3040 - 3142).

Lastly we close all our files, and delete the 2 temp files. (lines 3147 - 3151).

THE END.