- All the tokens lexed by the lexer have their types defined in
token_type.hpp. - We also have
to_stringfunction to convert token types to their string representations. - The lexer uses
flexfor tokenization, and the rules are defined inlex.l. - For finding the line number,
yylinenofeature is used and column number is tracked manually.
- Ignored comments for tokenization.
- Single-line comments start with
//and continue until the end of the line. - Multi-line comments start with
/*and end with*/. They can span multiple lines using state management in the lexer.
- Strings are enclosed in double quotes
"and can contain escaped characters. - The lexer handles string literals by entering a
STRINGstate when it encounters a double quote. - Inside the
STRINGstate, it recognizes escaped characters like\",\\, and\n. - The lexer will continue to read characters until it finds a closing double quote but will throw an error if it encounters an unescaped newline.
- The lexer recognizes various escaped characters within string literals as well as character literals.
- It also supports octal and hexadecimal escapes.
- Int literals can be in decimal, octal, or hexadecimal format.
- Decimal:
123 - Octal:
0777 - Hexadecimal:
0xC0FFEE - Binary:
0b1101
- Decimal:
- Char literals are enclosed in single quotes
'and can contain escaped characters like\',\\, and\n.
- Float literals support several formats:
- Standard decimal notation:
123.456 - Scientific notation:
1.23e4or1.23E4(equals 12300.0)1.23e-4or1.23E-4(equals 0.000123)
- Standard decimal notation:
Note
The float literals do not support patterns like .123 or 123..
- The type of tokens are stored in
token_type.hppand the patterns for reserved keywords are defined inreserved_wordsintoken_type.cpp. - First, pattern containing only alphanumerals and underscores is matched and then checked against the reserved keywords.
- If it matches a reserved keyword, it is returned as that token type; otherwise, it is returned as an identifier token type.
- Thus, for adding a new reserved keyword, you need to:
- Add it to the
TokenTypeenum intoken_type.hpp. - Add its pattern to the
reserved_wordsmap intoken_type.cpp. - Update the
token_to_string_mapintoken_type.cppto include the new keyword.
- Add it to the
- Bool literals also fall in this category, and they are recognized as
trueorfalse.
- The lexer recognizes various operators defined in
operators_map. Each operator is mapped to a specific token type in theTokenTypeenum. - For the addition of new operators, you first need to define it in enum class
TokenTypeintoken_type.hpp, then add the operator pattern inoperator_mapand also update thetoken_to_string_mapintoken_type.cpp. - The operators are matched using maximal munching manually due to keeping of single source of truth in
operators_map.
Note
The pattern matching for operators is complex but cannot be abstracted away into a function because of the functions used are only present in the scope of flex's class but cannot be accessed from the normal functions defined in the lexer file.
- Anything other than that matched above is considered an error.
- The lexer uses a custom error handler function
error_handlerto report errors. - The error handler prints the error message along with the line number and column number where the error occurred.
- The current lexer implementation prints out a table of tokens with their types and values.
- The lexer is designed to be extensible, allowing for the addition of new tokens, operators, and keywords with max 3 updations.