Summary

SoftWire is a class library written in object-oriented C++ for compiling assembly code. It can be used in projects to generate x86 machine code at run-time as an alternative to self-modifying code. Scripting languages might also benefit by using SoftWire as a JIT-compiler back-end. It also allows to eliminate jumps for variables which are temporarily constant during run-time, like for efficient graphics processing by constructing an optimised pipeline. Because of its possibility for 'instruction rewiring' by run-time conditional compilation, I named it "SoftWire". It is targeted only at developers with a good knowledge of C++ and x86 assembly.

Demo

The demo application assembles seven test routines: HelloWorld.asm shows that it is possible to call external functions, like printf. SetBits.asm is a function to set a number of bits in a buffer, starting from a given bit. CrossProduct.asm shows the use of floating-point instructions, macros and inline functions. AlpahBlend.asm uses MMX instructions for blending 32-bit colors, and conditionally compiles for Katmai compatible or older processors. It also shows how to define static data. Factorial.asm calculates a factorial by recursively calling itself. Mandelbrot.asm draws the Mandelbrot fractal in ASCII. The last test shows the use of run-time intrinsics.

(Execute SoftWire.exe at your own risk! It has been tested on many systems, but I make no guarantee that it will work on yours)

Compilation

This library was developed with Visual C++ 6.0 and has project and workspace files for this compiler included. Solution files for Visual C++ .NET and makefiles for Dev-C++ and GCC are also included. For GCC you will need a recent version which supports nameless structs. Should you have any problems compiling the code, please mail me at nicolas@capens.net.

There are known issues with the latest version of the library when compiled with GCC. They will be fixed as soon as possible.

Syntax

The source line layout is similar to that from NASM and the Visual C++ inline assembler:
label:    instruction operands        ; comment

The assembler can accept a file with the .asm extension. This file is treated as one block of code, but subroutines and data can be created with labels. Execution will start at the lablel with the same name as the file, unless an other entry point has been defined. The assembler generates only 32-bit code for processors compatible with the 386 or above.

C/C++ comments are also supported. The assembler is case-sensitive, except for instructions and registers. Labels are like in inline assembler and cannot have special characters like $, #, @, ~, ?, etc.

Specifiers are always optional, but when the assembler has multiple possible instructions you can't predict the behaviour without using a specifier. For example, the assembler can't know if the code "fld [esi]" uses single or double precision floating-point numbers without a DWORD or QWORD specifier. The PTR keyword is optional. The NEAR or SHORT keywords can be used for jumps or calls and are equivalent.

The assembler supports the MMX, 3DNow!, Pentium Pro, SSE and SSE2 instruction set. Some specific instructions have been removed because they are unsafe and/or not useful for 32-bit protected mode:

This is very similar to the Visual C++ inline assembler. It should not limit you for 'normal' use of this run-time assembler. For more details, take a look at the InstructionSet.cpp file. You can always write your own machine-code by defining it as static data.

You can define static data with the DB, DW and DD keywords. By using a label, the address of this data can be referenced. The data will be created at the location of the definition, so it's advisable to put it after the return or before the function label. Since there is no standard way to declare local data, you should put everything on the stack yourself. Local data is not that usefull for a run-time assembler anyway. You could use the 'cdecl' calling convention (standard in Visual C++) to let the caller push the arguments on the stack and remove them after the function has been called. Subroutines can also be created by using labels. To create arrays of static data, you can use DB[#], DW[#] and DD[#]. All variables will be aligned on their natural boundaries.

To align data or code yourself, you can use the ALIGN keyword. For efficiency, jump labels should be 16 byte aligned, and for most SSE instructions the data also has to be 16 byte aligned. The assembler will use NOP instructions for padding for both data and code alignment.

External data can be declared by using the Assembler::defineExternal method in your C++ code. A handy macro defined in Assembler.hpp makes it possible to export a function like printf with "ASM_EXPORT(printf)". Externals can be any kind of data defined in your C++ application, and are treated like void pointers. Externals should be declared before assembling the file. They do not have to be re-declared in the assembly code.

For constants, only numbers and character constants are supported. They can be in binary, octal, decimal or hexadecimal base, with the usual pre- or postfixes. All constant expressions are evaluated, inclusing those of data definitions and memory references. String literals can be created by using DB and double quotation marks.

Conditional compilation can be controlled with the #if, #elif, #else and #endif precompiler directives. The ASM_DEFINE macro can be used to send an integer to the assembler which can be used after the #if and #elif directives. Boolean expressions are evaluated as in C/C++.

This powerful feature can be used to generate many different specific functions without having to code completely new ones. It can eliminate jumps in the assembly code to generate exactly the optmized function you need. An example of this is an SIMD optimized vertex pipeline for 3D graphics. Without conditional compilation, many comparisons and jumps would be needed per vertex to transform and light it correctly for the current settings. With run-time conditional compilation, these instructions can be eliminated, leaving only the wanted instructions, while still being able to handle thousands of setting combinations. It also allows to write different code for other processors, without needing a control statement in your high-performance assembly code and without the need to write it as separate functions which are difficult to maintain.

The preprocessor also supports #include and #define. There is also an 'inline' keyword, which behaves like #define but produces less error-prone code (caused by nested macros) and has a nicer syntax for multiple lines:

inline macroName(argument1, argument2, ...)
{
   code block
}

It is a nice way of defining new instructions, and it even allows to define 'instructions' to be emulated with x86 assembly, like DirectX shader instructions! With normal macros, the only problem for doing this would be that you need parenthesis around the arguments. But even this is solved with the inline macros. You can simply use the above macro like this:

macroName argument1, argument2, ...

When you don't write an open parenthesis SoftWire will automatically assume that you want to use this 'implicit' argument list. The argument list stops at the end of the line, so it is not possible to nest multiple macros without using parenthesis.

Intrinsics

SoftWire also supports another form of run-time code generation. With every assembly instruction corresponds a member function of Assembler with the same name. These functions encode the corresponding instruction and put it in the Loader so it is ready to execute. These run-time intrinsics are ideal for writing a compiler back-end. Because it is all written in C++, things like conditional compilation become trivial.

First you need to construct an Assembler, without providing any arguments (like you had to do when assembling a file).

You can use the usual register names directly. For example "add(eax, ebx);" is a valid member function call of Assembler. For memory operands, you need to use "mem32[...]" and similar syntax. Note that all syntax checks should happen at compile-time. An exception is that it's impossible to check the scale factor in a memory reference.

Take a look at Test.cpp for some example code.

Design

The whole library is encapsulated in a namespace called "SoftWire". This is to prevent name clashes with other projects.

The only class you'll need for assembling a file and getting a pointer to the callable code is 'Assembler'. It has to be constructed with the name of the .asm file which contains the assembly code. The assembler treats it as one block of code, and you can get a void pointer to the assembled code by calling the 'callable()' method. By default the entry point will be a label with the same name as the file. You can also pass the name of a label as the entry point if you want to start excecution from another line. To effectively call the function, you first need to cast it to a function pointer. When the Assembler is destructed, it also deletes
the assembled code.

The first class the assembler will use for processing the assembly file is the 'Scanner'. This class has the task to break up the source code into words, called tokens. It is also resposible for the preprocessing tasks like file inclusion, conditional compilation and macro expansion. The 'Macro' class helps with this last task.

The tokens are stored in a 'Token' class. The scanner also recognizes tokens as being identifiers, constants or punctuators. The scanner does not recognize keywords (except preprocessor directives) and does no syntax checking. The whole file is scanned at once and the tokens are placed in a 'TokenList' class.

Every source line of tokens then goes to the 'Parser'. It will recognize the keywords, check the syntax and pass the information like mnemonic and registers to the code generator.

The code generation is done with the 'Synthesizer' class. It will put the information from the parsed instruction into bytes for the machine code.

The rules for the code generation are stored in the 'InstructionSet' class. The parser uses this class to select the matching instruction(s), and the synthesizer uses it to know how to encode the instruction.

The bytes from the synthesizer are stored into an 'Encoding' class. It also stores information about labels and references to labels to resolve jump addresses.

All encodings are stored in the 'Loader' class. After all instructions have been assembled, this class will resolve all the references and write the machine-code bytes into a buffer. Externally declared data will be resolved by the assembler's 'Linker' class. The Loader also searches for the code entry point. When the assembler is destructed, the assembled code is also destroyed. The linker data is also cleared.

When a syntax error occurs, the assembler throws an 'Error' class. This class simply holds a string with the error description. This message will be printed to the console by using the DebugOutput::printf method. You can easily use your custom error output system by deriving from the DebugOutput class. Besides syntax errors, the assembler might also throw internal errors. This is an alternative to assert(), so it should not happen. If you get an internal error, or worst, an unhandled exception, please contact the author.

Run-time intrinsics are generated by the InstructionSet, and stored in Intrinsics.hpp. You need to uncomment the 'generateIntrinsics();' line when you've made changes to the instruction set and need new intrinsics. Do not attempt to modify the Intrinsics.hpp file manually. The arguments for the intrinsics are defined in Operand.hpp.

Room for Improvement

To make the parsing task easier and faster, the scanner should immediately recognize keywords. The usual method is to use a Deterministic Finite Automaton, but because of the complexity this is not implemented yet. Searching the instruction set for the correct mnemonic could also be done by the scanner.

Using a 'SymbolTable' for faster identifier resolving might be handy. Local variables and function arguments could be made easily accessible with macros or the symbol table.

A potential use of this library is to dynamically optimize code at run-time. Because performance depends a lot on the order of the instructions and the processor architecture, one could determine the dependencies between instructions, and try all permutations to find the fastest.

License Conditions

The following files fall under the LGPL (License.txt) and are Copyright (C) 2002-2003 Nicolas Capens:

If you extend the possibilities of the classes in these files, please send your changes to the copyright holder(s). Do not change this file or License.txt, but use a change log. If you only derive from a class to write your own specific implementation, you don't have to release the source code of your whole project, just give credit where due. This can be done by mentioning my name in your credits list and/or providing a link to the original SoftWire source code (http://softwire.sourceforge.net).

Don't hesitate to contact me and show what you've created with SoftWire!

Contributions & Credits

If you feel like you should also hav been mentioned on this list (or be removed or have something changed), please do not hesitate to contact me to correct this mistake.

Why are contributions, bug fixes and copyrights not indicated in the code? I do not like this because in my opinion source files should be kept as readable as possible. I think it is very annoying that you first have to scroll past a huge block of comments that don't have anything to do with the code itself. Source files are for code. Licence and readme files are for the things not directly related to the code but to the library as a whole. If you cannot agree with this point of view and have some strong arguments, please contact me to discuss it.

Bugs & Feature Requests

SoftWire is a work-in-progress, so every kind of feedback is welcome, good or bad. I'm also always willing to help you out if you don't get something working. If you're a C++ guru and you would have designed some parts differently, I'm all ears. Contact me via e-mail at nicolas@capens.net.

Acknowledgements

Special thanks to:

Kind regards,

Nicolas Capens

Copyright (C) 2002-2003 Nicolas Capens - nicolas@capens.net