@@ -30,6 +30,7 @@ A simple, efficient Python client for [GROBID](https://github.com/kermitt2/grobi
3030- ** Command Line & Library** : Use as a standalone CLI tool or import into your Python projects
3131- ** Coordinate Extraction** : Optional PDF coordinate extraction for precise element positioning
3232- ** Sentence Segmentation** : Layout-aware sentence segmentation capabilities
33+ - ** JSON Output** : Convert TEI XML output to structured JSON format with CORD-19-like structure
3334
3435## 📋 Prerequisites
3536
@@ -40,8 +41,10 @@ A simple, efficient Python client for [GROBID](https://github.com/kermitt2/grobi
4041 - Default server: ` http://localhost:8070 `
4142 - Online demo: https://lfoppiano-grobid.hf.space (usage limits apply), more details [ here] ( https://grobid.readthedocs.io/en/latest/getting_started/#using-grobid-from-the-cloud ) .
4243
44+
4345> [ !IMPORTANT]
44- > GROBID supports Windows only through Docker containers. See the [ Docker documentation] ( https://grobid.readthedocs.io/en/latest/Grobid-docker/ ) for details.
46+ > GROBID supports Windows only through Docker containers. See
47+ > the [ Docker documentation] ( https://grobid.readthedocs.io/en/latest/Grobid-docker/ ) for details.
4548
4649## 🚀 Installation
4750
@@ -131,6 +134,8 @@ grobid_client [OPTIONS] SERVICE
131134| ` --teiCoordinates ` | Add PDF coordinates to XML |
132135| ` --segmentSentences ` | Segment sentences with coordinates |
133136| ` --flavor ` | Processing flavor for fulltext extraction |
137+ | ` --json ` | Convert TEI output to JSON format |
138+
134139
135140#### Examples
136141
@@ -141,11 +146,14 @@ grobid_client --input ~/documents --output ~/results processFulltextDocument
141146# High concurrency with coordinates
142147grobid_client --input ~ /pdfs --output ~ /tei --n 20 --teiCoordinates processFulltextDocument
143148
149+ # Process with JSON output
150+ grobid_client --input ~ /pdfs --output ~ /results --json processFulltextDocument
151+
144152# Process citations with custom server
145153grobid_client --server https://grobid.example.com --input ~ /citations.txt processCitationList
146154
147- # Force reprocessing with sentence segmentation
148- grobid_client --input ~ /docs --force --segmentSentences processFulltextDocument
155+ # Force reprocessing with sentence segmentation and JSON output
156+ grobid_client --input ~ /docs --force --segmentSentences --json processFulltextDocument
149157```
150158
151159### Python Library
@@ -188,6 +196,14 @@ client.process(
188196 segmentSentences = True
189197)
190198
199+ # Process with JSON output
200+ client.process(
201+ service = " processFulltextDocument" ,
202+ input_path = " /path/to/pdfs" ,
203+ output_path = " /path/to/output" ,
204+ json_output = True
205+ )
206+
191207# Process citation lists
192208client.process(
193209 service = " processCitationList" ,
@@ -221,9 +237,79 @@ Configuration can be provided via a JSON file. When using the CLI, the `--server
221237| ` sleep_time ` | Wait time when server is busy (seconds) | 5 |
222238| ` timeout ` | Client-side timeout (seconds) | 180 |
223239| ` coordinates ` | XML elements for coordinate extraction | See above |
240+ | ` logging ` | Logging configuration (level, format, file output) | See Logging section |
224241
225242> [ !TIP]
226- > Since version 0.0.12, the config file is optional. The client will use default localhost settings if no configuration is provided.
243+ > Since version 0.0.12, the config file is optional. The client will use default localhost settings if no configuration
244+ > is provided.
245+
246+ ### Logging Configuration
247+
248+ The client provides configurable logging with different verbosity levels. By default, only essential statistics and warnings are shown.
249+
250+ #### Logging Behavior
251+
252+ - ** Without ` --verbose ` ** : Shows only essential information and warnings/errors
253+ - ** With ` --verbose ` ** : Shows detailed processing information at INFO level
254+
255+ #### Always Visible Output
256+
257+ The following information is always displayed regardless of the ` --verbose ` flag:
258+
259+ ``` bash
260+ Found 1000 file(s) to process
261+ Processing completed: 950 out of 1000 files processed
262+ Errors: 50 out of 1000 files processed
263+ Processing completed in 120.5 seconds
264+ ```
265+
266+ #### Verbose Output (` --verbose ` )
267+
268+ When the ` --verbose ` flag is used, additional detailed information is displayed:
269+
270+ - Server connection status
271+ - Individual file processing details
272+ - JSON conversion messages
273+ - Detailed error messages
274+ - Processing progress information
275+
276+ #### Examples
277+
278+ ``` bash
279+ # Clean output - only essential statistics
280+ grobid_client --input pdfs/ processFulltextDocument
281+ # Output:
282+ # Found 1000 file(s) to process
283+ # Processing completed: 950 out of 1000 files processed
284+ # Errors: 50 out of 1000 files processed
285+ # Processing completed in 120.5 seconds
286+
287+ # Verbose output - detailed processing information
288+ grobid_client --input pdfs/ --verbose processFulltextDocument
289+ # Output includes all essential stats PLUS:
290+ # GROBID server http://localhost:8070 is up and running
291+ # JSON file example.json does not exist, generating JSON from existing TEI...
292+ # Successfully created JSON file: example.json
293+ # ... and other detailed processing information
294+ ```
295+
296+ #### Configuration File Logging
297+
298+ The config file can include logging settings:
299+
300+ ``` json
301+ {
302+ "grobid_server" : " http://localhost:8070" ,
303+ "logging" : {
304+ "level" : " WARNING" ,
305+ "format" : " %(asctime)s - %(levelname)s - %(message)s" ,
306+ "console" : true ,
307+ "file" : null
308+ }
309+ }
310+ ```
311+
312+ ** Note** : The ` --verbose ` command line flag always takes precedence over configuration file logging settings.
227313
228314## 🔬 Services
229315
@@ -234,6 +320,87 @@ Extracts complete document structure including headers, body text, figures, tabl
234320grobid_client --input pdfs/ --output results/ processFulltextDocument
235321```
236322
323+ ### JSON Output Format
324+
325+ When using the ` --json ` flag, the client converts TEI XML output to a structured JSON format similar to CORD-19. This provides:
326+
327+ - ** Structured Bibliography** : Title, authors, DOI, publication date, journal information
328+ - ** Body Text** : Paragraphs and sentences with metadata and reference annotations
329+ - ** Figures and Tables** : Structured JSON format for tables with headers, rows, and metadata
330+ - ** Reference Information** : In-text citations with offsets and targets
331+
332+ #### JSON Structure
333+
334+ ``` json
335+ {
336+ "level" : " paragraph" ,
337+ "biblio" : {
338+ "title" : " Document Title" ,
339+ "authors" : [" Author 1" , " Author 2" ],
340+ "doi" : " 10.1000/example" ,
341+ "publication_date" : " 2023-01-01" ,
342+ "journal" : " Journal Name" ,
343+ "abstract" : [... ]
344+ },
345+ "body_text" : [
346+ {
347+ "id" : " p_12345" ,
348+ "text" : " Paragraph text with citations [1]." ,
349+ "head_section" : " Introduction" ,
350+ "refs" : [
351+ {
352+ "type" : " bibr" ,
353+ "target" : " b1" ,
354+ "text" : " [1]" ,
355+ "offset_start" : 25 ,
356+ "offset_end" : 28
357+ }
358+ ]
359+ }
360+ ],
361+ "figures_and_tables" : [
362+ {
363+ "id" : " table_1" ,
364+ "type" : " table" ,
365+ "label" : " Table 1" ,
366+ "head" : " Sample Data" ,
367+ "content" : {
368+ "headers" : [" Header 1" , " Header 2" ],
369+ "rows" : [[" Value 1" , " Value 2" ]],
370+ "metadata" : {
371+ "row_count" : 1 ,
372+ "column_count" : 2 ,
373+ "has_headers" : true
374+ }
375+ }
376+ }
377+ ]
378+ }
379+ ```
380+
381+ #### Usage Examples
382+
383+ ``` bash
384+ # Generate both TEI and JSON outputs
385+ grobid_client --input pdfs/ --output results/ --json processFulltextDocument
386+
387+ # JSON output with coordinates and sentence segmentation
388+ grobid_client --input pdfs/ --output results/ --json --teiCoordinates --segmentSentences processFulltextDocument
389+ ```
390+
391+ ``` python
392+ # Python library usage
393+ client.process(
394+ service = " processFulltextDocument" ,
395+ input_path = " /path/to/pdfs" ,
396+ output_path = " /path/to/output" ,
397+ json_output = True
398+ )
399+ ```
400+
401+ > [ !NOTE]
402+ > When using ` --json ` , the ` --force ` flag only checks for existing TEI files. If a TEI file is rewritten (due to ` --force ` ), the corresponding JSON file is automatically rewritten as well.
403+
237404### Header Document Processing
238405Extracts only document metadata (title, authors, abstract, etc.).
239406
0 commit comments