-
Notifications
You must be signed in to change notification settings - Fork 4
Token Groups
A TokenGroup is a logical groupings of tokens based primarily on what could be represented in a single line. They are what the python-minimizer uses as the basis for reasoning about the code given to it to minimize.
The main function for creating token groups is group_tokens. Here, we can see the algorithm use Python's tokenize module to generate individual tokens:
io_wrapper = StringIO(sbuf)
[...]
for tok in generate_tokens(io_wrapper.readline):The tokens it returns are tuples containing information about the individual words in a Python statement. The algorithm accumulates the tokens in a TokenGroup until it detects a newline and all brackets it has detected are closed. When that occurs, the TokenGroup is closed, and a new TokenGroup to store the next logical line is created:
if tok[0] == OP and tok[1] in ('(', '[', '{'):
bracket_ctr += 1
elif tok[0] == OP and tok[1] in (')', ']', '}'):
bracket_ctr -= 1
if tok[0] in (NEWLINE, NL, ENDMARKER, INDENT, DEDENT) and bracket_ctr == 0:
group.append(tok)
groups.append(group)
[...]
group = TokenGroup()
else:
group.append(tok)The real important stuff is done in TokenGroup's append function. This function decides what type of group this TokenGroup is as it learns more information about the group of tokens it contains. All TokenGroups start off with UNKNOWN types, and as each new token is appended, can change the type based on what it learns from that token. Here is the algorithm:
self._tokens.append(tok)
if self.type == TokenGroup.Type.UNKNOWN:
self.type = get_type(tok)
elif self.type == TokenGroup.Type.CODE and tok[0] == COMMENT:
self.type = TokenGroup.Type.CODE_INLINE_COMMENT
elif self.type == TokenGroup.Type.BLANK_LINE:
self.type = get_type(tok)
elif self.type == TokenGroup.Type.DOCSTRING and tok[0] not in (STRING, NEWLINE):
self.type = get_type(tok)The helper function get_type uses the type information from each individual token to determine a type for the larger token group, but as you can see from the append algorithm, it isn't always right on the first try!
def get_type(t):
if t[0] == NAME or tok[0] == OP:
return TokenGroup.Type.CODE
elif t[0] == COMMENT:
if t[1].startswith('#!'):
return TokenGroup.Type.SHEBANG
return TokenGroup.Type.COMMENT
elif t[0] == STRING:
return TokenGroup.Type.DOCSTRING
elif t[0] == NL:
return TokenGroup.Type.BLANK_LINE
elif t[0] == INDENT:
return TokenGroup.Type.INDENT
elif t[0] == DEDENT:
return TokenGroup.Type.DEDENT
elif t[0] == ENDMARKER:
return TokenGroup.Type.EOFOnce tokens are grouped and the group type is properly set, removing things like doc strings and blank lines is as simple as a list comprehension. Here is an example of removing doc strings:
def remove_docstrings(token_groups):
return [grp for grp in token_groups if grp.type != TokenGroup.Type.DOCSTRING]One important note is the difference between COMMENT type TokenGroups and CODE_INLINE_COMMENT TokenGroups. When removing comments, TokenGroups of type COMMENT can be removed wholesale, however, CODE_INLINE_COMMENT groups require a little one to sift through the group, find the comments, and excise only that portion of the group. This is done in the remove_comments function as follows:
def remove_comments(token_groups):
tmp = []
for grp in token_groups:
if grp.type == TokenGroup.Type.CODE_INLINE_COMMENT:
group = TokenGroup()
for tok in grp._tokens:
if tok[0] != COMMENT:
group.append(tok)
tmp.append(group)
else:
tmp.append(grp)
ret = [grp for grp in tmp if grp.type != TokenGroup.Type.COMMENT]
return retEach TokenGroup knows how to produce a line containing its relevant tokens. Care is taken to separate word style operators from symbol style operators, but the original white space is ignored in favor of a maintaining the minimum lexically required spacing. This is the main strength of the python-minimizer - it uses the type information from Python's tokenize module to understand what the code looks like. That is to say: the python-minimizer works regardless of the original source code's format (as long as it is legal python) since it uses Python's own understanding of the language to make decisions about where tokens belong. Here is a snippet from the TokenGroup's untokenize function:
_WORD_OPS = ('and', 'or', 'not', 'is', 'in', 'for', 'while', 'return')
for tok in self._tokens:
if tok[0] != NEWLINE and tok[0] != NL:
if prev:
[...]
if (prev[0] == NAME and tok[0] == NAME) or \
(prev[0] == OP and tok[1] in self._WORD_OPS) or \
(tok[0] in (OP, STRING) and prev[1] in self._WORD_OPS):
ret = ''.join([ret, wspace_char])The module level untokenize function itself looks for groups that are of type INDENT and DEDENT to keep track of the current indentation level, but otherwise leaves the work to the TokenGroup itself:
ret = []
indent_lvl = 0
for grp in tgroups:
if grp.type == TokenGroup.Type.INDENT:
indent_lvl += 1
continue
elif grp.type == TokenGroup.Type.DEDENT:
indent_lvl -= 1
continue
elif grp.type == TokenGroup.Type.EOF:
continue
ret.append(
''.join([indent_char*indent_lvl, grp.untokenize(rmwspace, wspace_char)])
)
return '\n'.join(ret)