JASM is a bytecode assembly backend for creating scripting languages that will compile to JASM bytecode. The purpose of this project is to create a backend for easily creating scripting languages that'll be embedded into C++ code, or will be used as a regular lanugage. With this way, anyone that wants to create a scripting language will be free of performance and execution based headaches, thus being able to focus on their language design and leave the rest to JASM. In addition, the future plans for this project includes adding support for converting JASM and its bytecode to GNU or LLVM ILs. Maybe a utopic plan for now, but who knows?
JASM is a part of CSLB project. The CSLB project (which stands for Common Scripting Language Backend, pronunced as SeezleBee) consists of three parts: Assembler, Linker and Runtime. The assembler and linker parts are what this repo covers. JASM is the name of the IL, bytecode, assembler and linker altogether. The runtime is CSR (Common Script Runtime, pronunced as Caesar, with proper latin pronunciation of C as K as in King, so Kaesar), and it has its own repo [Work in Progress].
You can either grab the compiled binaries from the release section (if there exists such binaries) or build JASM from source. See BUILD.md for building from source.
- JASM Documentation
The JASM project includes the assembler(s) and the linker(s) for the JASM IL.
./
|_ asm/ ------------------------> contains the test.jasm file which I used to test the assembler. I left it there for funs.
|_ docs/ -----------------------> contains documentation `.md`s, concept `.jasm`s and various descriptive `.txt` files.
|_ include/
| |_ bytemode/
| | |_ assembler/
| | |_ linker/
| |_ extensions/
|_ lib/ ------------------------> contains third-party libraries
|_ src/
|_ bytemode/ -----------------> the project is designed to allow future improvements and additions like assembler for assembling
| |_ assembler/ to machine code instead of byte that's why the `assembler` and `linker` reside under `bytemode`
| |_ linker/
|_ core/ ---------------------> the core functions that doesn't change depending on the target mode
|_ extensions/ ---------------> various utility functions, like serialization of types to bytes.
The CLI is mostly handled by the CLI Parser, so if you want to know how it works you have to check its source. As of the CLI usage, basic CLI usage is covered in README.md. If you wan't to know more about the CLI parameters and flags, this section covers that.
jasm --silent or jasm -s (with a lowercase s)
The logging system in the JASM project has four modes: LOG, LOGD, LOGW and LOGE.
The modes LOG, LOGD and LOGW are disabled when the silent flag is set. The LOGE is only active
when its in Medium or High mode. Any unhandled exception is non-silent no matter the silent flag.
jasm --single or jasm -S (with an uppercase S)
By default, JASM first assembles all the input files and then links them together and creates a final file depending on the
--lib-type flag. However when single flag is set, JASM stops after the assembling process and only creates " creates static libraries "jo (jasm object)"
files. Each of these object files follow the static library structure and can then later be used as static libraries.stc" from each file, which can later be linked to other libraries and executables.
jasm --out <value> or jasm -o <value> (with a lowercase o)
By default, JASM will use the first entry from the input files to generate an output name based on it and target output type. The ouput will be placed in the same directory as the first input file. Optionally you can provide a name along with a path for the output file. Be aware that if the name you've provided contains an extension that doesn't match with the target output type, JASM will simply append the correct extension at the end of the name you've provided.
For example if you provide out.jef for a target static library, the output will be out.jef.stc.
jasm --lib-type <value> (where value can be either stc or shd)
By default, JASM will assume that the target output format is executable unless you specify
otherwise via lib-type flag. The flag accepts two possible values, one being stc for STatiC
and the other shd for SHareD. If a lib-type is provided but it is not stc or shd, JASM
will proceed by assuming it as an executable.
jasm --in <..values..> or jasm -I <..values..>
Source files to assemble and (optionally) link. If the target output format is an executable, the first file must contain
the program entry point instruction org instruction along with sts and sth.
jasm --libs <..values..> or jasm -L <..values..>
Libraries to be used and to be linked. ~~~Must contain the shared libraries as well if any used in the project.~~ Only libraries to be linked statically must be present, shared libraries are checked for at runtime.
jasm --pipelines or jasm -p
By default, JASM will use intermediary object files. If the flag pipelines is set, pipeline streams are used instead.
WARNING: Pipelines option is not currently available. If you try to use it, JASM will not complain, but will still use object files. You can check the availability of pipelines using
jasm --help.
jasm --working-dir <value> or jasm -w
By default, JASM will use the directory that it's been executed from as working directory unless you specify otherwise via
the working-dir flag. The value of the flag working-dir must be an existing directory. Otherwise an error will be thrown.
jasm --redirect-stdout <..values..> or jasm -r <..values..>
By default, JASM will use the stdout/stderr streams for logging unless you specify otherwise via redirect-stdout flag.
The flag accepts two values at max; if only one is given, it'll be used as both stdout and stderr; if two is given, they'll be
used as stdout and stderr respectively. If more that two values has been given, the program will silently log an error but
continue the process.
WARNING: Currently there is an issue with stdout redirection that causes segmentation fault.
Although not yet implemented, JASM is designed to allow modular assemblers. Every assembler lies within a <identifier>mode
directory under src. Currently only available assembler is bytemode assembler. It assembles the JASM IL into JASM bytecode which is meant to be executed by the Common Script Runtime (CSR) [Work in Progress].
The assembler follows the following procedure:
If target is an executable:
- Assemble the first file in executable format. Store the AssemblyInfo of the result in memory.
- Assemble the rest as object files and store their AssemblyInfo's in memory.
If target is a library:
- Assemble every file as object files and store their AssemblyInfo's in memory.
If the single mode is active:
- Assemble each file in library format.
Regardless of the target, the result AssemblyInfo's are then passed to the linker.
The way the assembler works is quite straightforward, the Stream::Tokenize function under extensions/streamextensions.cpp creates a token from the input stream that is passed to it. I say token, but it just returns the next "word" from the stream. Only exception is that if there exists something between double-quotes, then it is returned as a single "word". For example both main: and "simple assembler" are a single token.
Then there are three ways the assembler can go:
- The token is a label i.e. ends with a colon -> a symbol at current location is created
- The token is present in the instruction map -> corresponding instruction function is called
- The token marks the end of
.bodysegment i.e. is.end-> assembler finishes assembling the current file.
If the token doesn't fall under these three categories, an exception occurs and program exits with an error. Regardless of the target and the mode, the assembler creates intermediary object files from each file, and then passes the resulting AssemblyInfoCollection to linker, which contains the symbol informations and paths of said object files. The linker handles the rest from here on.
Just as the assembler, JASM is designed to allow modular linkers. Every linker lies within a <identifier>modedirectory under src. Currently only available linker is bytemode linker. It links the object files created by the assembler and checks for symbol errors. The serialization of AssemblyInfo's are done by the linker, if the single mode is active or the target is a library.
The way the linker works is, again, quite simple. It accepts an AssemblyInfoCollection, such that the first AssemblyInfo is in the executable format if the target is an executable. The linker then iterates through each AssemblyInfo and merges them into a single file, that is if the single mode is not active. After the merging process, the linker checks if there are still unknown symbols present.
If there are, then it checks whether the program will use any runtime assemblies or not. If not then the linker exits with an error, otherwise continues silently and finishes its job. Before finishing, it serializes the resulting AssemblyInfo at the end of the final file if symbol info flag is marked in the current AssemblyContext.
If the single mode is active, then instead of merging each object file into one final file, it serializes the AssemblyInfo's of each file at the end of them and creates static libraries from each one of them.
The linker does a brief check after every AssemblyInfo has been linked into the final output; it gets the AssemblyInfo's of every runtime assembly that the current AssemblyContext marks as "will be used" and checks if the unknown symbols (if there exists such thing after linking) exist within them. If not then it exits without an error, but prints a warning message.
NOTE: Static libraries can have runtime assembly dependencies.
Below are the various file formats used by JASM. They are, for some reason, defined with a strange syntax I used when I first started this project. The syntax description can be found in this file.
The executable format has two variants, one containing the AssemblyInfo and the other being just bare bytecode. Both are defined in a single style as it follows:
u32; entry point; entry point of the program decided by the org instruction
u32; stack size; stack size of the program decided by the sts instruction
u32; max heap size; maximum heap size of the program decided by the sth instruction
???; bytecode; compiled bytecode. It's size can be maximum of which u32 can hold
_???; AssemblyInfo; serialized assembly info. It's size can vary from a few bytes to a couple hundred
_u64; size; size of the AssemblyInfo
Both the static and runtime libraries have the same format, it is defined as it follows:
???; bytecode; compiled bytecode. It's size can be maximum of which u32 can hold
???; AssemblyInfo; serialized assembly info. It's size can vary from 1 byte to a couple hundred
u64; size; size of the AssemblyInfo
The object file format has two variants, one for the main executable file that contains the entry point instruction, and one for the remaining files. They both defined in a single style as it follows:
if (library):
???; bytecode; compiled bytecode. It's size can be maximum of which u32 can hold
else:
u32; entry point; entry point of the program decided by the org instruction
u32; stack size; stack size of the program decided by the sts instruction
u32; max heap size; maximum heap size of the program decided by the sth instruction
???; bytecode; compiled bytecode. It's size can be maximum of which u32
The JASM IL is the "assembly language" that is used in CSLB project. It is a basic language
that doesn't have cool features like namespaces, functions, variables or even types.
Well, something like types exist but not quite like what you might think.
JASM IL uses symbols (which are generally called labels) for conditionals, functions
and compile-time constants. Although it utilizes a couple of registers for use, they are
primarily used for storing pointers, parameter sizes and a couple of flags. So JASM is
essentially a stack-based language.
The general structure of a JASM file is like this:
everything here is discarded unless it is ".prep" without quotes, which starts
the preparation section.
.prep
#here lies the preparation instructions#
.body
#here lies the actual code#
.end
everything here is discarded
Usually, you'd not want to write JASM IL by hand, and leave the generation to the frontend.
A JASM file consists of 4 sections. Some sections are actively used in every file, some are more useful for specific targets. But in any case, every JASM file must contain all of them.
init section is the section before the tokenizer reaches .prep. Everything in this
section is discarded unless it is .prep, which marks the start of the preparation section.
This section can be used for small notes, such as the author of the file or some information
created by the frontend that uses JASM IL.
Preparation section is the section residing between .prep and .body tokens. This section
is especially important for executable targets because the instructions org, sts and sth
are special to this section and the executable targets. The imp instruction that marks the
given library as a runtime library is also special to this section, but it can be present in
every file unlike org, sts and sth.
org <symbol_name>
Marks the given symbol as the entry point of the program. This instruction can only be used
once accross every input file. The given symbol might not be present in the same file,
but sts and sth must be. The ordering of these three instructions is important and it must
be exactly as it follows:
org
sts
sth
sts <size>
Sets the stack size of the program. The reason for this is to ensure that the script will not use more memory than the programmer indicates. This boosts the performance of the script since it'll never have to allocate and copy memory. Also a security restriction to ensure that the script will execute in it's own context.
Since the IL would usually be generated by the frontend, the stack size and heaps size would also be calculated by the frontend.
sth <size>
Sets the heap size of the program. Be aware that unlike the stack, heap size must be a multiple of 8, due to the way allocation mapping works.
imp <path/to/shared/lib.shd
Adds the given library file to the current context, indicating that it is a runtime assembly. So that the linker can check it's symbols in case of an unknown symbol being left after the linking process.
Body section is the section residing between .body and .end tokens. The actual code that gets
assembled is contained within this section. There isn't much to tell about this section since it
just contains code. See syntax.
End section is the section residing after .end token and EOF. Everything here is ignored.
JASM utilizes a couple of registers, some being "system" registers and some being general purpose.
eax; u32; stores return values and returned pointers
ebx; u32; general purpose. main purpose is storing addresses
ecx; u32; general purpose.
edx; u32; general purpose.
esi; u32; general purpose.
edi; u32; general purpose.
al; ubyte; stores return values. might be used for temporal bytes and bools.
bl; ubyte; general purpose. might be used for temporal bytes and bools.
cl; ubyte; general purpose. might be used for temporal bytes and bools.
dl; ubyte; general purpose. might be used for temporal bytes and bools.
pc; u32; Program Counter.
sp; u32; Stack Poiner.
bp; u32; Base Pointer.
flg; ubyte; Program Flags.
Technically every single register is modifiable, and the programmer potentially can use this to their own benefits. I don't know how but I think such thing is possible.
Although the registers are mostly general purpose, I designed JASM to be a stack-based language, so the sole purpose of registers might be just storing pointers and arbitrary values in some cases (such as storing parameter size when creating a callstack). In the syntax section you'll learn more about their uses.
JASM syntax is pretty simple, even for an IL. It can broadly be divided into three subheaders: comments, symbols and instructions.
There are no single line comments in JASM. Every comment is multi line. A comment block starts with a hashtag (#) symbol and ends with, again, a hashtag (#) symbol.
.prep
.body
#A comment can be like this too#
#
or like
this
#
# or this
# !!!!BUT HERE IS NOT A COMMENT!!!
.end
Symbols are a crucial part of JASM. Through them we're able to have compile time constants, functions and conditionals.
.prep
org main
sts 512
sth 0
.body
helloWorld: rom "Hello World" ; #this acts like a compile time constant#
main:
#here, main is the entry symbol of the program#
someFunc:
#
some code here
#
.end
They actually just mark a specific position in ROM (the assembled bytecode itself), so that it can then be referenced from somewhere else.
As for now, there are 123 instructions present in JASM. However many of these instructions are
actually variants of some parent instruction. So, when writing JASM IL, There are actually
40 instructions present in JASM IL, thanks to the use of modeflags. The assembler then turns
these instructions to their sub instructions based on their modeflags.
The modeflags are not many, but they allow us to write the generic forms of instructions instead of their "type" dependent forms. There are 4 types of modeflags: numeric, memory, register and comparison.
Modeflags take a prefix symbol to form a token, those prefixes are % and &.
Below are all the modeflags, categorized.
NoMode = 0x00
NumericModeFlags:
Int (%i) = 0x01
Float (%f) = 0x02
Byte (%b) = 0x03
UInt (%ui) = 0x04
UByte (%ub) = 0x05
MemoryModeFlags:
Stack (%s) = 0x06
Heap (%h) = 0x07
RegisterModeFlags:
&eax = 0x08
&ebx = 0x09
&ecx = 0x0A
&edx = 0x0B
&esi = 0x0C
&edi = 0x0D
&pc = 0x0E
&sp = 0x0F
&bp = 0x10
&al = 0x11
&bl = 0x12
&cl = 0x13
&dl = 0x14
CompareModeFlags:
%les = 0x15
%gre = 0x16
%equ = 0x17
%leq = 0x18
%geq = 0x19
%neq = 0x1A
Some of these modeflags are just for pure visuals, such as %ui and %ub. Since JASM uses two's complement whenever you use %ui, you can also use %i and nothing will change. But nevertheless, it makes it easier to read the code.
IMPORTANT NOTE: I'm writing this part from time to time since it's way too boring to do. So if you can't find what you're looking for here, I apologize for the inconvinience.
nop -> 0x00
stc <numeric_mode> <value>
stc %i/%ui/%f <value> -> stt <4bytes>
stc %b/%ub <value> -> ste <byte>
Pushes the given constant to stack. stt denotes STore Thirtythree and ste denotes
STore Eight.
stc <mode> <symbol>
stc %i/%ui/%f <symbol> -> stts <4bytes> (marked for linker)
stc %b/%ub <symbol> -> stes <4bytes> (marked for linker)
Pushes a constant from given symbol to stack. stts denotes STore Thirtythree from Symbol
and stes denotes STore Eight from Symbol. Given 4bytes are the positions that given symbols
denote. If given symbol doesn't exist, the 4bytes are set to zero and are marked for linker.
stc <float> -> stt <4bytes> -> 0x01 <4bytes>
stc <integer> -> stt <4bytes> -> 0x01 <4bytes>
Pushes given constant to stack.
Example:
.prep
.body
constant: rom 21 ; #we'll see rom instruction later#
stc %i 0x15 #store the 32-bit integer 21 on stack#
stc 21 #same as above#
stc %i constant #same as above at runtime, since symbol constant marks the integer 21#
.end
ldc <numeric_mode>
ldc %i/%ui/%f -> ldt
ldc %b/%ub -> lde
Loads the value at the top of the stack to memory at address stored on &ebx.
ldt denotes LoaD Thirtythree and lde denotes LoaD Eight.
Example:
.prep
.body
#let &ebx contain some address#
stc %i 5 #we stored 5 on stack#
ldc %i #now we'll store 5, which is on top of the stack at the address on &ebx#
.end
rda <register_mode> -> rdr <register_mode>
Reads the given register, and stores its value on stack. rdr stands for ReaD Register
rda %i/%ui/%f -> rdt
rda %b/%ub -> rde
Reads the value from memory at address stored on &ebx, and stores it on stack.
rdt denotes ReaD Thritythree and rde denotes ReaD Eight.
Example:
.prep
.body
#let &ecx contain the value 15, in decimal#
#and &ebx contain an address, in which the value 21 is stored#
rda &ecx #now 15 is on top of the stack#
rda %i #now 21 is on opt of the stack#
#so stack is now 15-21#
mov <register_mode> -> movs <byte>
MOVes the top n bytes depending on the given register size from the top of the stack to given register.
mov <register_mode> <register_mode> -> movr <byte> <byte>
MOVes the value of the first register to second one.
mov <value> <register_mode>
mov %b/%ub <register_mode> -> movc <byte> <byte>
mov %i/%ui/%f <register_mode> -> movc <byte> <4bytes>
MOVes the given constant value to given register.
These arithmetic operations differ just in name, so I put them under the same title. Where [op] <- {add, sub, mul, div}
[op] <numeric_mode>
[op] %i/%ui -> [op]i
[op] %b/%ub -> [op]b
[op] %f -> [op]f
Pops and applies [op] to values from the top of the stack, then pushes the result.
[op] <numeric_mode> <register_mode> <register_mode>
[op] %i/%ui -> [op]ri <byte> <byte>
[op] %b/%ub -> [op]rb <byte> <byte>
[op] %f -> [op]rf <byte> <byte>
Adds the first register to second.
These arithmetic operations differ just in name, so I put them under the same title too. Where [op] <- {adds, subs, muls, divs}
[op] <numeric_mode>
[op] %i/%ui -> [op]i
[op] %b/%ub -> [op]b
[op] %f -> [op]f
Applies [op] to values from the top of the stack without popping, then pushes the result.
These too differ just in name. Where [op] <- {inc, dcr}
[op] <numeric_mode> <value>
[op] %i/%ui <value> -> [op]i <4bytes>
[op] %b/%ub <value> -> [op]b <byte>
[op] %f <value> -> [op]f <4bytes>
Increments the value at the top of the stack times.
[op] <numeric_mode> <register_mode> <value>
[op] %i/%ui <register_mode> <value> -> [op]ri <byte> <4bytes>
[op] %b/%ub <register_mode> <value> -> [op]rb <byte> <byte>
[op] %f <register_mode> <value> -> [op]rf <byte> <4bytes>
Increments the value in given register times. Note that if you specify a sizewise missmatching mode between <numeric_mode> and <register_mode>, assembler will throw an error. You can use %b/%ub with 32-bit registers but you can't use %i/%ui/%f with 8-bit registers.
There too differ just in name. Where [op] <- {incs, dcrs}
[op] <numeric_mode> <value>
[op] %i/%ui <value> -> [op]i <4bytes>
[op] %b/%ub <value> -> [op]b <byte>
[op] %f <value> -> [op]f <4bytes>
Increments the value at the top of the stack times and pushes the result, without modifying the value on top of the stack.
mcp and or nor swp dup raw rom inv invs cmp pop jmp swr dur rep alc pow sqr cnd cal ret del
Atomic types are the types that are natively supported by JASM which can be manipulated using instructions. Arrays, lists, maps, strings, layouts are not atomic since JASM can't directly manipulate them. Below are the atomic types of JASM.
Int32
UInt32
Byte
UByte
Float (Single Precision)
And that's all. All JASM instructions are built around these types. Complex types besides these are compiler illusions. Which are up to you to create since CSLB is only a backend.
Each JASM file is independent of each other. If you want to split code accross files, you can directly use the symbols you wish to use since as long as the file that contains the symbol passes from the linker, linker will handle the rest. So no need to indicate the symbol as extenral, include headers or anything.
Although it looks like it can lead to terrible debugging sessions and confusions when reading the code, remember that JASM is not designed to be used directly. The compilers will do the checks anyway loong before sending the generated JASM code to the assembler so why check it again?
main.jasm
.prep
org main
sts 0
sth 0
.body
main:
mov 0 &bl #bl is guaranteed to be initially zero but whatever#
cal test
.end
test.jasm
.prep
.body
test:
mov 5 &eax
ret
.end
This is pefreclty fine as long as you include both files when assembling.
It is also possible to use unknown symbols which are present in the libraries you'll link against. Linker will check the symbols and resolve the issues.
JASM (this repo) is the assembler and linker part of the CSLB. So IL is nothing but a bunch of symbols to be converted to bytecode format at this point. The execution and how it works part is handled by CSR (link can be down due to me not making the repo public) so the execution part will be explained in CSR's documentation.
But if your question is "why" rather than "how", the answer is "I too don't know" 95% of the time. For the 5%, you can contact me and ask anything. Probably I'll learn more from you than you from me so a win/win situation.