Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 9 additions & 7 deletions source/articles/example-CSS-selectors-easy-way.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Examples: CSS selectors, the easy way

For the full CSS and Selectors API reference, see the [CSS module](../modules/css.md) and [Selectors module](../modules/selectors.md) documentation.

Let's start with an easy example of using `lexbor` for parsing and serializing
CSS selectors. This example breaks down the major steps and elements, explaining
the overall purpose, requirements, and assumptions at each step.
Expand Down Expand Up @@ -30,7 +32,7 @@ real-world example will be provided later.
The code includes the necessary header files and defines a callback function
(`callback`) that prints the parsed data.

```c
```C
#include <lexbor/css/css.h>

lxb_status_t callback(const lxb_char_t *data, size_t len, void *ctx)
Expand All @@ -45,7 +47,7 @@ lxb_status_t callback(const lxb_char_t *data, size_t len, void *ctx)
The `main` function initializes the CSS parser, parses a CSS selector string,
and then serializes the resulting selector list.

```c
```C
int main(int argc, const char *argv[])
{
// ... (variable declarations)
Expand Down Expand Up @@ -81,7 +83,7 @@ int main(int argc, const char *argv[])
The code defines a CSS selector string (`slctrs`) and initializes the CSS
parser.

```c
```C
static const lxb_char_t slctrs[] = ":has(div, :not(as, 1%, .class), #hash)";

parser = lxb_css_parser_create();
Expand All @@ -94,7 +96,7 @@ status = lxb_css_parser_init(parser, NULL);
The code parses the CSS selector string, checks for parsing errors, and prints
the result.

```c
```C
list = lxb_css_selectors_parse(parser, slctrs,
sizeof(slctrs) / sizeof(lxb_char_t) - 1);

Expand All @@ -109,7 +111,7 @@ if (parser->status != LXB_STATUS_OK) {

The example serializes the parsed selector list and prints any parser logs.

```c
```C
printf("Result: ");
(void) lxb_css_selector_serialize_list(list, callback, NULL);
printf("\n");
Expand All @@ -118,7 +120,7 @@ printf("\n");
if (lxb_css_log_length(lxb_css_parser_log(parser)) != 0) {
printf("Log:\n");
// Serialize parser logs with proper indentation.
(void) lxb_css_log_serialize(parser->log, callback, NULL,
(void) lxb_css_log_serialize(lxb_css_parser_log(parser), callback, NULL,
indent, indent_length);
printf("\n");
}
Expand All @@ -130,7 +132,7 @@ if (lxb_css_log_length(lxb_css_parser_log(parser)) != 0) {
Finally, the code destroys resources for the parser and frees memory allocated
for the selector list.

```c
```C
(void) lxb_css_parser_destroy(parser, true);
lxb_css_selector_list_destroy_memory(list);
```
Expand Down
19 changes: 13 additions & 6 deletions source/articles/part-1-html.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Part one: HTML

**Note:** This article was written during the early development of the Lexbor HTML parser. Some code examples and internal values (such as token type flags) may differ from the current implementation. For up-to-date API reference, see the [HTML module documentation](../modules/html.md).

Hello, everyone!

In this article, I will explain how to create a superfast HTML parser that
Expand Down Expand Up @@ -61,6 +63,8 @@ windows-874, windows-1250, windows-1251, windows-1252, windows-1254,
windows-1255, windows-1256, windows-1257, windows-1258, gb18030, Big5,
ISO-2022-JP, Shift_JIS, EUC-KR, UTF-16BE, UTF-16LE, and x-user-defined.

For details on Lexbor's encoding support, see the [Encoding module documentation](../modules/encoding.md).

## Preprocessing

Once we have decoded the bytes into Unicode characters, we need to perform a
Expand Down Expand Up @@ -446,7 +450,7 @@ be the corresponding ID from the enumeration.

Example:

```c
```C
typedef enum {
LXB_TAG__UNDEF = 0x0000,
LXB_TAG__END_OF_FILE = 0x0001,
Expand All @@ -469,15 +473,15 @@ the DOM (Document Object Model) will include a `Tag ID`. This approach avoids
the need for two comparisons: one for the node type and one for the element.
Instead, a single check can be performed:

```c
```C
if (node->tag_id == LXB_TAG_DIV) {
/* Optimal code */
}
```

Alternatively, you could use:

```c
```C
if (node->type == LXB_DOM_NODE_TYPE_ELEMENT && node->tag_id == LXB_TAG_DIV) {
/* Oh my code */
}
Expand Down Expand Up @@ -534,7 +538,7 @@ tree. This is achieved using the Flags bitmap field.

The field can contain the following values:

```c
```C
enum {
LXB_HTML_TOKEN_TYPE_OPEN = 0x0000,
LXB_HTML_TOKEN_TYPE_CLOSE = 0x0001,
Expand All @@ -548,6 +552,9 @@ enum {
LXB_HTML_TOKEN_TYPE_DONE = 0x0100
};
```

**Note:** This enum reflects an earlier version of the codebase. In the current implementation (see `source/lexbor/html/token.h`), the `TEXT`, `DATA`, `RCDATA`, `CDATA`, and `NULL` token types have been removed, and the remaining values have been renumbered.

Besides the opening/closing token type, there are additional values for the data
converter. Only the tokenizer knows how to correctly convert data, and it marks
the token to indicate how the data should be processed.
Expand Down Expand Up @@ -675,7 +682,7 @@ tree_build_in_body_character(token) {
```

In Lexbor HTML:
```c
```C
tree_build_in_body_character(token) {
lexbor_str_t str = {0};
lxb_html_parser_char_t pc = {0};
Expand All @@ -698,7 +705,7 @@ As illustrated by the example, we have removed all character-based conditions
and created a common function for text processing. This function takes an
argument with data transformation settings:

```c
```C
pc.replace_null /* Replace each '\0' with REPLACEMENT CHARACTER (U+FFFD) */
pc.drop_null /* Delete all '\0's */
pc.is_attribute /* If data is an attribute value, we need smarter parsing */
Expand Down
38 changes: 21 additions & 17 deletions source/articles/part-2-css.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Part Two: CSS

**Note:** This article was written during the early development of the Lexbor CSS parser. Some internal details may differ from the current implementation. For up-to-date API reference, see the [CSS module documentation](../modules/css.md). For current project status, see the [Roadmap](../roadmap.md).

Hello, everyone!

We continue our series on developing a browser engine. Better late than never!
Expand Down Expand Up @@ -90,7 +92,7 @@ div {width: 10px !important}
```

The tokenizer generates tokens:
```html
```
"div" — <ident-token>
" " — <whitespace-token>
"{" — <left-curly-bracket-token>
Expand Down Expand Up @@ -180,7 +182,7 @@ to these callbacks. It would look something like this:
div {width: 10px !important}
```

```html
```
"div" — callback_qualified_rule_prelude(<ident-token>)
" " — callback_qualified_rule_prelude(<whitespace-token>)
— callback_qualified_rule_prelude(<end-token>)
Expand Down Expand Up @@ -230,7 +232,7 @@ It might look something like this:
div {width: 10px !important}
```

```html
```
"div {" — Selectors parse
"width: 10px !important}" — Declarations parse
```
Expand Down Expand Up @@ -353,6 +355,8 @@ like `Qualified Rule`, `At-Rule`, etc., as well as different system phases.
There is also a stack due to the recursive nature of CSS structures, which
avoids recursion directly.

**Note:** The `LXB_CSS_SYNTAX_TOKEN__TERMINATED` token and the `lxb_css_syntax_parser_token()` function described above reflect the internal parsing architecture. For the public API, see the [CSS module documentation](../modules/css.md).

**Pros:**
1. Complete control over the tokenizer.
2. Speed, as everything happens on the fly.
Expand All @@ -372,15 +376,15 @@ is structured. Values in grammars can include combinators and multipliers.

**Sequential Order**

```html
```
<my> = a b c
```

`<my>` can contain the following value:
- `<my> = a b c`

**One Value from the List**:
```html
```
<my> = a | b | c
```

Expand All @@ -390,7 +394,7 @@ is structured. Values in grammars can include combinators and multipliers.
- `<my> = c`

**One or All Values from the List in Any Order**:
```html
```
<my> = a || b || c
```

Expand Down Expand Up @@ -435,7 +439,7 @@ For those familiar with regular expressions, this concept will be immediately
clear.

**Zero or Infinite Number of Times**:
```html
```
<my> = a*
```

Expand All @@ -445,7 +449,7 @@ clear.
- `<my> = `

**One or Infinite Number of Times**:
```html
```
<my> = a+
```

Expand All @@ -454,7 +458,7 @@ clear.
- `<my> = a a a a a a a a a a a a a`

**May or May Not be Present**:
```html
```
<my> = a?
```

Expand All @@ -463,7 +467,7 @@ clear.
- `<my> = `

**May be Present from `A` to `B` Times, Period**:
```html
```
<my> = a{1,4}
```

Expand All @@ -474,7 +478,7 @@ clear.
- `<my> = a a a a`

**One or Infinite Number of Times Separated by Comma**:
```html
```
<my> = a#
```

Expand All @@ -485,7 +489,7 @@ clear.
- `<my> = a, a, a, a`

**Exactly One Value Must be Present**:
```html
```
<my> = [a? | b? | c?]!
```

Expand All @@ -495,7 +499,7 @@ error.

**Multipliers can be Combined**:

```html
```
<my> = a#{1,5}
```

Expand Down Expand Up @@ -545,7 +549,7 @@ The main problems I encountered:

For example, consider this grammar:

```html
```
<text-decoration-line> = none | [ underline || overline || line-through || blink ]
<text-decoration-style> = solid | double | dotted | dashed | wavy
<text-decoration-color> = <color>
Expand All @@ -561,7 +565,7 @@ To manage this, I implemented a limiter for group options using `/1`. This
notation indicates how many options should be selected from the group. As a
result, `<text-decoration>` was transformed into:

```html
```
<text-decoration> = <text-decoration-line> || <text-decoration-style> || <text-decoration-color>/1
```

Expand All @@ -572,7 +576,7 @@ spaces between them. This approach is insufficient; we need to address this
directly in the grammar. To handle this, the `^WS` modifier (Without Spaces) was
introduced:

```html
```
<frequency> = <number-token> <frequency-units>^WS
<

Expand All @@ -595,7 +599,7 @@ For example:

Tests would be generated as follows:

```html
```
<x> = a b c
<x> = a c b
<x> = b a c
Expand Down
Loading