Control characters that lxml doesn't like, which are many. Notably '\x0c', which Google docs generates and is a page break. See similar bug across the git pond: https://github.com/html5lib/html5lib-python/issues/96
Control characters that lxml doesn't like, which are many. Notably '\x0c', which Google docs generates and is a page break.
See similar bug across the git pond: html5lib/html5lib-python#96