Skip to content

Latest commit

 

History

History
70 lines (59 loc) · 4.14 KB

File metadata and controls

70 lines (59 loc) · 4.14 KB

Lexer

General

  • All the tokens lexed by the lexer have their types defined in token_type.hpp.
  • We also have to_string function to convert token types to their string representations.
  • The lexer uses flex for tokenization, and the rules are defined in lex.l.
  • For finding the line number, yylineno feature is used and column number is tracked manually.

Comments

  • Ignored comments for tokenization.
  • Single-line comments start with // and continue until the end of the line.
  • Multi-line comments start with /* and end with */. They can span multiple lines using state management in the lexer.

Strings

  • Strings are enclosed in double quotes " and can contain escaped characters.
  • The lexer handles string literals by entering a STRING state when it encounters a double quote.
  • Inside the STRING state, it recognizes escaped characters like \", \\, and \n.
  • The lexer will continue to read characters until it finds a closing double quote but will throw an error if it encounters an unescaped newline.

Escaped Characters

  • The lexer recognizes various escaped characters within string literals as well as character literals.
  • It also supports octal and hexadecimal escapes.

Int Literals

  • Int literals can be in decimal, octal, or hexadecimal format.
    • Decimal: 123
    • Octal: 0777
    • Hexadecimal: 0xC0FFEE
    • Binary: 0b1101
  • Char literals are enclosed in single quotes ' and can contain escaped characters like \', \\, and \n.

Float Literals

  • Float literals support several formats:
    • Standard decimal notation: 123.456
    • Scientific notation:
      • 1.23e4 or 1.23E4 (equals 12300.0)
      • 1.23e-4 or 1.23E-4 (equals 0.000123)

Note

The float literals do not support patterns like .123 or 123..

Reserved Keywords and Identifiers

  • The type of tokens are stored in token_type.hpp and the patterns for reserved keywords are defined in reserved_words in token_type.cpp.
  • First, pattern containing only alphanumerals and underscores is matched and then checked against the reserved keywords.
  • If it matches a reserved keyword, it is returned as that token type; otherwise, it is returned as an identifier token type.
  • Thus, for adding a new reserved keyword, you need to:
    • Add it to the TokenType enum in token_type.hpp.
    • Add its pattern to the reserved_words map in token_type.cpp.
    • Update the token_to_string_map in token_type.cpp to include the new keyword.
  • Bool literals also fall in this category, and they are recognized as true or false.

Operators

  • The lexer recognizes various operators defined in operators_map. Each operator is mapped to a specific token type in the TokenType enum.
  • For the addition of new operators, you first need to define it in enum class TokenType in token_type.hpp, then add the operator pattern in operator_map and also update the token_to_string_map in token_type.cpp.
  • The operators are matched using maximal munching manually due to keeping of single source of truth in operators_map.

Note

The pattern matching for operators is complex but cannot be abstracted away into a function because of the functions used are only present in the scope of flex's class but cannot be accessed from the normal functions defined in the lexer file.

Error Handling

  • Anything other than that matched above is considered an error.
  • The lexer uses a custom error handler function error_handler to report errors.
  • The error handler prints the error message along with the line number and column number where the error occurred.

Conclusion

  • The current lexer implementation prints out a table of tokens with their types and values.
  • The lexer is designed to be extensible, allowing for the addition of new tokens, operators, and keywords with max 3 updations.

References