Skip to content

Commit 7cd66c0

Browse files
authored
Add JSON output
Add JSON output, improve verbosity, improve logging
1 parent 7f5a73c commit 7cd66c0

23 files changed

Lines changed: 26839 additions & 53 deletions

Readme.md

Lines changed: 171 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ A simple, efficient Python client for [GROBID](https://github.com/kermitt2/grobi
3030
- **Command Line & Library**: Use as a standalone CLI tool or import into your Python projects
3131
- **Coordinate Extraction**: Optional PDF coordinate extraction for precise element positioning
3232
- **Sentence Segmentation**: Layout-aware sentence segmentation capabilities
33+
- **JSON Output**: Convert TEI XML output to structured JSON format with CORD-19-like structure
3334

3435
## 📋 Prerequisites
3536

@@ -40,8 +41,10 @@ A simple, efficient Python client for [GROBID](https://github.com/kermitt2/grobi
4041
- Default server: `http://localhost:8070`
4142
- Online demo: https://lfoppiano-grobid.hf.space (usage limits apply), more details [here](https://grobid.readthedocs.io/en/latest/getting_started/#using-grobid-from-the-cloud).
4243

44+
4345
> [!IMPORTANT]
44-
> GROBID supports Windows only through Docker containers. See the [Docker documentation](https://grobid.readthedocs.io/en/latest/Grobid-docker/) for details.
46+
> GROBID supports Windows only through Docker containers. See
47+
> the [Docker documentation](https://grobid.readthedocs.io/en/latest/Grobid-docker/) for details.
4548
4649
## 🚀 Installation
4750

@@ -131,6 +134,8 @@ grobid_client [OPTIONS] SERVICE
131134
| `--teiCoordinates` | Add PDF coordinates to XML |
132135
| `--segmentSentences` | Segment sentences with coordinates |
133136
| `--flavor` | Processing flavor for fulltext extraction |
137+
| `--json` | Convert TEI output to JSON format |
138+
134139

135140
#### Examples
136141

@@ -141,11 +146,14 @@ grobid_client --input ~/documents --output ~/results processFulltextDocument
141146
# High concurrency with coordinates
142147
grobid_client --input ~/pdfs --output ~/tei --n 20 --teiCoordinates processFulltextDocument
143148

149+
# Process with JSON output
150+
grobid_client --input ~/pdfs --output ~/results --json processFulltextDocument
151+
144152
# Process citations with custom server
145153
grobid_client --server https://grobid.example.com --input ~/citations.txt processCitationList
146154

147-
# Force reprocessing with sentence segmentation
148-
grobid_client --input ~/docs --force --segmentSentences processFulltextDocument
155+
# Force reprocessing with sentence segmentation and JSON output
156+
grobid_client --input ~/docs --force --segmentSentences --json processFulltextDocument
149157
```
150158

151159
### Python Library
@@ -188,6 +196,14 @@ client.process(
188196
segmentSentences=True
189197
)
190198

199+
# Process with JSON output
200+
client.process(
201+
service="processFulltextDocument",
202+
input_path="/path/to/pdfs",
203+
output_path="/path/to/output",
204+
json_output=True
205+
)
206+
191207
# Process citation lists
192208
client.process(
193209
service="processCitationList",
@@ -221,9 +237,79 @@ Configuration can be provided via a JSON file. When using the CLI, the `--server
221237
| `sleep_time` | Wait time when server is busy (seconds) | 5 |
222238
| `timeout` | Client-side timeout (seconds) | 180 |
223239
| `coordinates` | XML elements for coordinate extraction | See above |
240+
| `logging` | Logging configuration (level, format, file output) | See Logging section |
224241

225242
> [!TIP]
226-
> Since version 0.0.12, the config file is optional. The client will use default localhost settings if no configuration is provided.
243+
> Since version 0.0.12, the config file is optional. The client will use default localhost settings if no configuration
244+
> is provided.
245+
246+
### Logging Configuration
247+
248+
The client provides configurable logging with different verbosity levels. By default, only essential statistics and warnings are shown.
249+
250+
#### Logging Behavior
251+
252+
- **Without `--verbose`**: Shows only essential information and warnings/errors
253+
- **With `--verbose`**: Shows detailed processing information at INFO level
254+
255+
#### Always Visible Output
256+
257+
The following information is always displayed regardless of the `--verbose` flag:
258+
259+
```bash
260+
Found 1000 file(s) to process
261+
Processing completed: 950 out of 1000 files processed
262+
Errors: 50 out of 1000 files processed
263+
Processing completed in 120.5 seconds
264+
```
265+
266+
#### Verbose Output (`--verbose`)
267+
268+
When the `--verbose` flag is used, additional detailed information is displayed:
269+
270+
- Server connection status
271+
- Individual file processing details
272+
- JSON conversion messages
273+
- Detailed error messages
274+
- Processing progress information
275+
276+
#### Examples
277+
278+
```bash
279+
# Clean output - only essential statistics
280+
grobid_client --input pdfs/ processFulltextDocument
281+
# Output:
282+
# Found 1000 file(s) to process
283+
# Processing completed: 950 out of 1000 files processed
284+
# Errors: 50 out of 1000 files processed
285+
# Processing completed in 120.5 seconds
286+
287+
# Verbose output - detailed processing information
288+
grobid_client --input pdfs/ --verbose processFulltextDocument
289+
# Output includes all essential stats PLUS:
290+
# GROBID server http://localhost:8070 is up and running
291+
# JSON file example.json does not exist, generating JSON from existing TEI...
292+
# Successfully created JSON file: example.json
293+
# ... and other detailed processing information
294+
```
295+
296+
#### Configuration File Logging
297+
298+
The config file can include logging settings:
299+
300+
```json
301+
{
302+
"grobid_server": "http://localhost:8070",
303+
"logging": {
304+
"level": "WARNING",
305+
"format": "%(asctime)s - %(levelname)s - %(message)s",
306+
"console": true,
307+
"file": null
308+
}
309+
}
310+
```
311+
312+
**Note**: The `--verbose` command line flag always takes precedence over configuration file logging settings.
227313

228314
## 🔬 Services
229315

@@ -234,6 +320,87 @@ Extracts complete document structure including headers, body text, figures, tabl
234320
grobid_client --input pdfs/ --output results/ processFulltextDocument
235321
```
236322

323+
### JSON Output Format
324+
325+
When using the `--json` flag, the client converts TEI XML output to a structured JSON format similar to CORD-19. This provides:
326+
327+
- **Structured Bibliography**: Title, authors, DOI, publication date, journal information
328+
- **Body Text**: Paragraphs and sentences with metadata and reference annotations
329+
- **Figures and Tables**: Structured JSON format for tables with headers, rows, and metadata
330+
- **Reference Information**: In-text citations with offsets and targets
331+
332+
#### JSON Structure
333+
334+
```json
335+
{
336+
"level": "paragraph",
337+
"biblio": {
338+
"title": "Document Title",
339+
"authors": ["Author 1", "Author 2"],
340+
"doi": "10.1000/example",
341+
"publication_date": "2023-01-01",
342+
"journal": "Journal Name",
343+
"abstract": [...]
344+
},
345+
"body_text": [
346+
{
347+
"id": "p_12345",
348+
"text": "Paragraph text with citations [1].",
349+
"head_section": "Introduction",
350+
"refs": [
351+
{
352+
"type": "bibr",
353+
"target": "b1",
354+
"text": "[1]",
355+
"offset_start": 25,
356+
"offset_end": 28
357+
}
358+
]
359+
}
360+
],
361+
"figures_and_tables": [
362+
{
363+
"id": "table_1",
364+
"type": "table",
365+
"label": "Table 1",
366+
"head": "Sample Data",
367+
"content": {
368+
"headers": ["Header 1", "Header 2"],
369+
"rows": [["Value 1", "Value 2"]],
370+
"metadata": {
371+
"row_count": 1,
372+
"column_count": 2,
373+
"has_headers": true
374+
}
375+
}
376+
}
377+
]
378+
}
379+
```
380+
381+
#### Usage Examples
382+
383+
```bash
384+
# Generate both TEI and JSON outputs
385+
grobid_client --input pdfs/ --output results/ --json processFulltextDocument
386+
387+
# JSON output with coordinates and sentence segmentation
388+
grobid_client --input pdfs/ --output results/ --json --teiCoordinates --segmentSentences processFulltextDocument
389+
```
390+
391+
```python
392+
# Python library usage
393+
client.process(
394+
service="processFulltextDocument",
395+
input_path="/path/to/pdfs",
396+
output_path="/path/to/output",
397+
json_output=True
398+
)
399+
```
400+
401+
> [!NOTE]
402+
> When using `--json`, the `--force` flag only checks for existing TEI files. If a TEI file is rewritten (due to `--force`), the corresponding JSON file is automatically rewritten as well.
403+
237404
### Header Document Processing
238405
Extracts only document metadata (title, authors, abstract, etc.).
239406

0 commit comments

Comments
 (0)