Skip to content

Commit c535911

Browse files
committed
Refined the initial run.
1 parent 97ab111 commit c535911

12 files changed

Lines changed: 657 additions & 251 deletions

File tree

source/articles/example-CSS-selectors-easy-way.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ real-world example will be provided later.
3232
The code includes the necessary header files and defines a callback function
3333
(`callback`) that prints the parsed data.
3434

35-
```c
35+
```C
3636
#include <lexbor/css/css.h>
3737

3838
lxb_status_t callback(const lxb_char_t *data, size_t len, void *ctx)
@@ -47,7 +47,7 @@ lxb_status_t callback(const lxb_char_t *data, size_t len, void *ctx)
4747
The `main` function initializes the CSS parser, parses a CSS selector string,
4848
and then serializes the resulting selector list.
4949
50-
```c
50+
```C
5151
int main(int argc, const char *argv[])
5252
{
5353
// ... (variable declarations)
@@ -83,7 +83,7 @@ int main(int argc, const char *argv[])
8383
The code defines a CSS selector string (`slctrs`) and initializes the CSS
8484
parser.
8585

86-
```c
86+
```C
8787
static const lxb_char_t slctrs[] = ":has(div, :not(as, 1%, .class), #hash)";
8888

8989
parser = lxb_css_parser_create();
@@ -96,7 +96,7 @@ status = lxb_css_parser_init(parser, NULL);
9696
The code parses the CSS selector string, checks for parsing errors, and prints
9797
the result.
9898

99-
```c
99+
```C
100100
list = lxb_css_selectors_parse(parser, slctrs,
101101
sizeof(slctrs) / sizeof(lxb_char_t) - 1);
102102

@@ -111,7 +111,7 @@ if (parser->status != LXB_STATUS_OK) {
111111

112112
The example serializes the parsed selector list and prints any parser logs.
113113

114-
```c
114+
```C
115115
printf("Result: ");
116116
(void) lxb_css_selector_serialize_list(list, callback, NULL);
117117
printf("\n");
@@ -120,7 +120,7 @@ printf("\n");
120120
if (lxb_css_log_length(lxb_css_parser_log(parser)) != 0) {
121121
printf("Log:\n");
122122
// Serialize parser logs with proper indentation.
123-
(void) lxb_css_log_serialize(parser->log, callback, NULL,
123+
(void) lxb_css_log_serialize(lxb_css_parser_log(parser), callback, NULL,
124124
indent, indent_length);
125125
printf("\n");
126126
}
@@ -132,7 +132,7 @@ if (lxb_css_log_length(lxb_css_parser_log(parser)) != 0) {
132132
Finally, the code destroys resources for the parser and frees memory allocated
133133
for the selector list.
134134
135-
```c
135+
```C
136136
(void) lxb_css_parser_destroy(parser, true);
137137
lxb_css_selector_list_destroy_memory(list);
138138
```

source/articles/part-1-html.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,8 @@ windows-874, windows-1250, windows-1251, windows-1252, windows-1254,
6363
windows-1255, windows-1256, windows-1257, windows-1258, gb18030, Big5,
6464
ISO-2022-JP, Shift_JIS, EUC-KR, UTF-16BE, UTF-16LE, and x-user-defined.
6565

66+
For details on Lexbor's encoding support, see the [Encoding module documentation](../modules/encoding.md).
67+
6668
## Preprocessing
6769

6870
Once we have decoded the bytes into Unicode characters, we need to perform a
@@ -448,7 +450,7 @@ be the corresponding ID from the enumeration.
448450

449451
Example:
450452

451-
```c
453+
```C
452454
typedef enum {
453455
LXB_TAG__UNDEF = 0x0000,
454456
LXB_TAG__END_OF_FILE = 0x0001,
@@ -471,15 +473,15 @@ the DOM (Document Object Model) will include a `Tag ID`. This approach avoids
471473
the need for two comparisons: one for the node type and one for the element.
472474
Instead, a single check can be performed:
473475

474-
```c
476+
```C
475477
if (node->tag_id == LXB_TAG_DIV) {
476478
/* Optimal code */
477479
}
478480
```
479481

480482
Alternatively, you could use:
481483

482-
```c
484+
```C
483485
if (node->type == LXB_DOM_NODE_TYPE_ELEMENT && node->tag_id == LXB_TAG_DIV) {
484486
/* Oh my code */
485487
}
@@ -536,7 +538,7 @@ tree. This is achieved using the Flags bitmap field.
536538

537539
The field can contain the following values:
538540

539-
```c
541+
```C
540542
enum {
541543
LXB_HTML_TOKEN_TYPE_OPEN = 0x0000,
542544
LXB_HTML_TOKEN_TYPE_CLOSE = 0x0001,
@@ -550,6 +552,9 @@ enum {
550552
LXB_HTML_TOKEN_TYPE_DONE = 0x0100
551553
};
552554
```
555+
556+
**Note:** This enum reflects an earlier version of the codebase. In the current implementation (see `source/lexbor/html/token.h`), the `TEXT`, `DATA`, `RCDATA`, `CDATA`, and `NULL` token types have been removed, and the remaining values have been renumbered.
557+
553558
Besides the opening/closing token type, there are additional values for the data
554559
converter. Only the tokenizer knows how to correctly convert data, and it marks
555560
the token to indicate how the data should be processed.
@@ -677,7 +682,7 @@ tree_build_in_body_character(token) {
677682
```
678683

679684
In Lexbor HTML:
680-
```c
685+
```C
681686
tree_build_in_body_character(token) {
682687
lexbor_str_t str = {0};
683688
lxb_html_parser_char_t pc = {0};
@@ -700,7 +705,7 @@ As illustrated by the example, we have removed all character-based conditions
700705
and created a common function for text processing. This function takes an
701706
argument with data transformation settings:
702707
703-
```c
708+
```C
704709
pc.replace_null /* Replace each '\0' with REPLACEMENT CHARACTER (U+FFFD) */
705710
pc.drop_null /* Delete all '\0's */
706711
pc.is_attribute /* If data is an attribute value, we need smarter parsing */

source/articles/part-2-css.md

Lines changed: 19 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ div {width: 10px !important}
9292
```
9393

9494
The tokenizer generates tokens:
95-
```html
95+
```
9696
"div" — <ident-token>
9797
" " — <whitespace-token>
9898
"{" — <left-curly-bracket-token>
@@ -182,7 +182,7 @@ to these callbacks. It would look something like this:
182182
div {width: 10px !important}
183183
```
184184

185-
```html
185+
```
186186
"div" — callback_qualified_rule_prelude(<ident-token>)
187187
" " — callback_qualified_rule_prelude(<whitespace-token>)
188188
— callback_qualified_rule_prelude(<end-token>)
@@ -232,7 +232,7 @@ It might look something like this:
232232
div {width: 10px !important}
233233
```
234234

235-
```html
235+
```
236236
"div {" — Selectors parse
237237
"width: 10px !important}" — Declarations parse
238238
```
@@ -355,6 +355,8 @@ like `Qualified Rule`, `At-Rule`, etc., as well as different system phases.
355355
There is also a stack due to the recursive nature of CSS structures, which
356356
avoids recursion directly.
357357

358+
**Note:** The `LXB_CSS_SYNTAX_TOKEN__TERMINATED` token and the `lxb_css_syntax_parser_token()` function described above reflect the internal parsing architecture. For the public API, see the [CSS module documentation](../modules/css.md).
359+
358360
**Pros:**
359361
1. Complete control over the tokenizer.
360362
2. Speed, as everything happens on the fly.
@@ -374,15 +376,15 @@ is structured. Values in grammars can include combinators and multipliers.
374376

375377
**Sequential Order**
376378

377-
```html
379+
```
378380
<my> = a b c
379381
```
380382

381383
`<my>` can contain the following value:
382384
- `<my> = a b c`
383385

384386
**One Value from the List**:
385-
```html
387+
```
386388
<my> = a | b | c
387389
```
388390

@@ -392,7 +394,7 @@ is structured. Values in grammars can include combinators and multipliers.
392394
- `<my> = c`
393395

394396
**One or All Values from the List in Any Order**:
395-
```html
397+
```
396398
<my> = a || b || c
397399
```
398400

@@ -437,7 +439,7 @@ For those familiar with regular expressions, this concept will be immediately
437439
clear.
438440

439441
**Zero or Infinite Number of Times**:
440-
```html
442+
```
441443
<my> = a*
442444
```
443445

@@ -447,7 +449,7 @@ clear.
447449
- `<my> = `
448450

449451
**One or Infinite Number of Times**:
450-
```html
452+
```
451453
<my> = a+
452454
```
453455

@@ -456,7 +458,7 @@ clear.
456458
- `<my> = a a a a a a a a a a a a a`
457459

458460
**May or May Not be Present**:
459-
```html
461+
```
460462
<my> = a?
461463
```
462464

@@ -465,7 +467,7 @@ clear.
465467
- `<my> = `
466468

467469
**May be Present from `A` to `B` Times, Period**:
468-
```html
470+
```
469471
<my> = a{1,4}
470472
```
471473

@@ -476,7 +478,7 @@ clear.
476478
- `<my> = a a a a`
477479

478480
**One or Infinite Number of Times Separated by Comma**:
479-
```html
481+
```
480482
<my> = a#
481483
```
482484

@@ -487,7 +489,7 @@ clear.
487489
- `<my> = a, a, a, a`
488490

489491
**Exactly One Value Must be Present**:
490-
```html
492+
```
491493
<my> = [a? | b? | c?]!
492494
```
493495

@@ -497,7 +499,7 @@ error.
497499

498500
**Multipliers can be Combined**:
499501

500-
```html
502+
```
501503
<my> = a#{1,5}
502504
```
503505

@@ -547,7 +549,7 @@ The main problems I encountered:
547549

548550
For example, consider this grammar:
549551

550-
```html
552+
```
551553
<text-decoration-line> = none | [ underline || overline || line-through || blink ]
552554
<text-decoration-style> = solid | double | dotted | dashed | wavy
553555
<text-decoration-color> = <color>
@@ -563,7 +565,7 @@ To manage this, I implemented a limiter for group options using `/1`. This
563565
notation indicates how many options should be selected from the group. As a
564566
result, `<text-decoration>` was transformed into:
565567

566-
```html
568+
```
567569
<text-decoration> = <text-decoration-line> || <text-decoration-style> || <text-decoration-color>/1
568570
```
569571

@@ -574,7 +576,7 @@ spaces between them. This approach is insufficient; we need to address this
574576
directly in the grammar. To handle this, the `^WS` modifier (Without Spaces) was
575577
introduced:
576578

577-
```html
579+
```
578580
<frequency> = <number-token> <frequency-units>^WS
579581
<
580582
@@ -597,7 +599,7 @@ For example:
597599

598600
Tests would be generated as follows:
599601

600-
```html
602+
```
601603
<x> = a b c
602604
<x> = a c b
603605
<x> = b a c

0 commit comments

Comments
 (0)