Skip to content

Latest commit

 

History

History
558 lines (497 loc) · 25.6 KB

File metadata and controls

558 lines (497 loc) · 25.6 KB

xgen_doc2chunk Processing Flow


Main Flow

User calls: processor.extract_chunks(file_path)
                    ??
                    ??
         DocumentProcessor.extract_chunks()
                    ??
                    ?ω???extract_text()
                    ??      ??
                    ??      ?ω???_create_current_file(file_path)
                    ??      ?ω???_get_handler(extension)
                    ??      ?ω???handler.extract_text(current_file)
                    ??      ?Ӊ???OCR processing (optional)
                    ??
                    ?Ӊ???chunk_text()
                            ??
                            ?Ӊ???create_chunks()

PDF Handler Flow

PDFHandler.extract_text(current_file)
    ??
    ?ω???file_converter.convert(file_data)               [INTERFACE: PDFFileConverter]
    ??      ?Ӊ???Binary ??fitz.Document
    ??
    ?ω???preprocessor.preprocess(doc)                    [INTERFACE: PDFPreprocessor]
    ??      ?Ӊ???Pass-through (returns PreprocessedData with doc unchanged)
    ??
    ?ω???metadata_extractor.extract()                    [INTERFACE: PDFMetadataExtractor]
    ??
    ?ω???_extract_all_tables(doc, file_path)             [INTERNAL]
    ??
    ?Ӊ???For each page:
            ??
            ?ω???ComplexityAnalyzer.analyze()            [CLASS: pdf_complexity_analyzer]
            ??      ?Ӊ???Returns PageComplexity with recommended_strategy
            ??
            ?ω???Branch by strategy:
            ??      ??
            ??      ?ω???FULL_PAGE_OCR:
            ??      ??      ?Ӊ???_process_page_full_ocr()
            ??      ??
            ??      ?ω???BLOCK_IMAGE_OCR:
            ??      ??      ?Ӊ???_process_page_block_ocr()
            ??      ??
            ??      ?ω???HYBRID:
            ??      ??      ?Ӊ???_process_page_hybrid()
            ??      ??
            ??      ?Ӊ???TEXT_EXTRACTION (default):
            ??              ?Ӊ???_process_page_text_extraction()
            ??                      ??
            ??                      ?ω???VectorTextOCREngine.detect_and_extract()
            ??                      ?ω???extract_text_blocks()           [FUNCTION]
            ??                      ?ω???format_image_processor methods  [INTERFACE: PDFImageProcessor]
            ??                      ?Ӊ???merge_page_elements()           [FUNCTION]
            ??
            ?Ӊ???page_tag_processor.create_page_tag()    [INTERFACE: PageTagProcessor]

DOCX Handler Flow

DOCXHandler.extract_text(current_file)
    ??
    ?ω???file_converter.validate(file_data)              [INTERFACE: DOCXFileConverter]
    ??      ?Ӊ???Check if valid ZIP with [Content_Types].xml
    ??
    ?ω???If not valid DOCX:
    ??      ?Ӊ???_extract_with_doc_handler_fallback()    [INTERNAL]
    ??              ?Ӊ???DOCHandler.extract_text()       [DELEGATION]
    ??
    ?ω???file_converter.convert(file_data)               [INTERFACE: DOCXFileConverter]
    ??      ?Ӊ???Binary ??docx.Document
    ??
    ?ω???preprocessor.preprocess(doc)                    [INTERFACE: DOCXPreprocessor]
    ??      ?Ӊ???Returns PreprocessedData (doc in extracted_resources)
    ??
    ?ω???chart_extractor.extract_all_from_file()         [INTERFACE: DOCXChartExtractor]
    ??      ?Ӊ???Pre-extract all charts (callback pattern)
    ??
    ?ω???metadata_extractor.extract()                    [INTERFACE: DOCXMetadataExtractor]
    ??
    ?Ӊ???For each element in doc.element.body:
            ??
            ?ω???If paragraph ('p'):
            ??      ?Ӊ???process_paragraph_element()     [FUNCTION: docx_helper]
            ??              ?ω???format_image_processor.process_drawing_element()
            ??              ?ω???format_image_processor.extract_from_pict()
            ??              ?Ӊ???get_next_chart() callback for charts
            ??
            ?Ӊ???If table ('tbl'):
                    ?Ӊ???process_table_element()         [FUNCTION: docx_helper]

DOC Handler Flow

DOCHandler.extract_text(current_file)
    ??
    ?ω???file_converter.convert()                        [INTERFACE: DOCFileConverter]
    ??      ??
    ??      ?ω???_detect_format() ??DocFormat (RTF/OLE/HTML/DOCX)
    ??      ??
    ??      ?œâ???RTF: file_data (bytes) 반환             [Pass-through]
    ??      ?ω???OLE: _convert_ole() ??olefile.OleFileIO
    ??      ?ω???HTML: _convert_html() ??BeautifulSoup
    ??      ?Ӊ???DOCX: _convert_docx() ??docx.Document
    ??
    ?ω???preprocessor.preprocess(converted_obj)          [INTERFACE: DOCPreprocessor]
    ??      ?Ӊ???Returns PreprocessedData (converted_obj in extracted_resources)
    ??
    ?ω???RTF format detected:
    ??      ?Ӊ???_delegate_to_rtf_handler()              [DELEGATION]
    ??              ?Ӊ???RTFHandler.extract_text(current_file)
    ??
    ?ω???OLE format detected:
    ??      ?Ӊ???_extract_from_ole_obj()                 [INTERNAL]
    ??              ?ω???_extract_ole_metadata()
    ??              ?ω???_extract_ole_text()
    ??              ?Ӊ???_extract_ole_images()
    ??
    ?ω???HTML format detected:
    ??      ?Ӊ???_extract_from_html_obj()                [INTERNAL]
    ??              ?ω???_extract_html_metadata()
    ??              ?Ӊ???BeautifulSoup parsing
    ??
    ?Ӊ???DOCX format detected:
            ?Ӊ???_extract_from_docx_obj()                [INTERNAL]
                    ?Ӊ???docx.Document paragraph/table extraction

RTF Handler Flow

구조: Converter??pass-through, Preprocessor?�서 binary 처리, Handler?�서 ?œì°¨??처리.

RTFHandler.extract_text(current_file)
    ??
    ?ω???file_converter.convert()                        [INTERFACE: RTFFileConverter]
    ??      ?Ӊ???Pass-through (returns raw bytes)
    ??
    ?ω???preprocessor.preprocess()                       [INTERFACE: RTFPreprocessor]
    ??      ??
    ??      ?ω???\binN tag processing (skip binary data)
    ??      ?ω???\pict group image extraction
    ??      ?Ӊ???Returns PreprocessedData (clean_content, image_tags, encoding)
    ??
    ?ω???decode_content()                                [FUNCTION: rtf_decoder]
    ??      ?Ӊ???bytes ??string with detected encoding
    ??
    ?ω???Build RTFConvertedData                          [DATACLASS]
    ??
    ?Ӊ???_extract_from_converted()                       [INTERNAL]
            ??
            ?ω???metadata_extractor.extract()            [INTERFACE: RTFMetadataExtractor]
            ?ω???metadata_extractor.format()
            ??
            ?ω???extract_tables_with_positions()         [FUNCTION: rtf_table_extractor]
            ??
            ?ω???extract_inline_content()                [FUNCTION: rtf_content_extractor]
            ??
            ?Ӊ???Build result string

Excel Handler Flow (XLSX)

ExcelHandler.extract_text(current_file) [XLSX]
    ??
    ?ω???file_converter.convert(file_data, extension='xlsx')  [INTERFACE: ExcelFileConverter]
    ??      ?Ӊ???Binary ??openpyxl.Workbook
    ??
    ?ω???preprocessor.preprocess(wb)                     [INTERFACE: ExcelPreprocessor]
    ??      ?Ӊ???Returns PreprocessedData (wb in extracted_resources)
    ??
    ?ω???_preload_xlsx_data()                            [INTERNAL]
    ??      ?ω???metadata_extractor.extract()            [INTERFACE: XLSXMetadataExtractor]
    ??      ?ω???chart_extractor.extract_all_from_file() [INTERFACE: ExcelChartExtractor]
    ??      ?Ӊ???format_image_processor.extract_images() [INTERFACE: ExcelImageProcessor]
    ??
    ?Ӊ???For each sheet:
            ??
            ?ω???_process_xlsx_sheet()                   [INTERNAL]
            ??      ?ω???page_tag_processor.create_sheet_tag()  [INTERFACE: PageTagProcessor]
            ??      ?ω???extract_textboxes_from_xlsx()   [FUNCTION]
            ??      ?ω???convert_xlsx_sheet_to_table()   [FUNCTION]
            ??      ?Ӊ???convert_xlsx_objects_to_tables()[FUNCTION]
            ??
            ?Ӊ???format_image_processor.get_sheet_images()  [INTERFACE: ExcelImageProcessor]

Excel Handler Flow (XLS)

ExcelHandler.extract_text(current_file) [XLS]
    ??
    ?ω???file_converter.convert(file_data, extension='xls')   [INTERFACE: ExcelFileConverter]
    ??      ?Ӊ???Binary ??xlrd.Book
    ??
    ?ω???preprocessor.preprocess(wb)                     [INTERFACE: ExcelPreprocessor]
    ??      ?Ӊ???Returns PreprocessedData (wb in extracted_resources)
    ??
    ?ω???_get_xls_metadata_extractor().extract_and_format()   [INTERFACE: XLSMetadataExtractor]
    ??
    ?Ӊ???For each sheet:
            ??
            ?ω???page_tag_processor.create_sheet_tag()   [INTERFACE: PageTagProcessor]
            ??
            ?ω???convert_xls_sheet_to_table()            [FUNCTION]
            ??
            ?Ӊ???convert_xls_objects_to_tables()         [FUNCTION]

PPT Handler Flow

PPTHandler.extract_text(current_file)
    ??
    ?ω???file_converter.convert(file_data, file_stream)  [INTERFACE: PPTFileConverter]
    ??      ?Ӊ???Binary ??pptx.Presentation
    ??
    ?ω???preprocessor.preprocess(prs)                    [INTERFACE: PPTPreprocessor]
    ??      ?Ӊ???Returns PreprocessedData (prs in extracted_resources)
    ??
    ?ω???chart_extractor.extract_all_from_file()         [INTERFACE: PPTChartExtractor]
    ??      ?Ӊ???Pre-extract all charts (callback pattern)
    ??
    ?ω???metadata_extractor.extract()                    [INTERFACE: PPTMetadataExtractor]
    ?ω???metadata_extractor.format()                     [INTERFACE: PPTMetadataExtractor]
    ??
    ?Ӊ???For each slide:
            ??
            ?ω???page_tag_processor.create_slide_tag()   [INTERFACE: PageTagProcessor]
            ??
            ?Ӊ???For each shape:
                    ??
                    ?ω???If table: convert_table_to_html()       [FUNCTION]
                    ?ω???If chart: get_next_chart() callback     [Pre-extracted]
                    ?ω???If picture: process_image_shape()       [FUNCTION]
                    ?ω???If group: process_group_shape()         [FUNCTION]
                    ?Ӊ???If text: extract_text_with_bullets()    [FUNCTION]

HWP Handler Flow

HWPHandler.extract_text(current_file)
    ??
    ?ω???file_converter.validate(file_data)              [INTERFACE: HWPFileConverter]
    ??      ?Ӊ???Check if OLE file (magic number check)
    ??
    ?ω???If not OLE file:
    ??      ?Ӊ???_handle_non_ole_file()                  [INTERNAL]
    ??              ?ω???ZIP detected ??HWPXHandler delegation
    ??              ?Ӊ???HWP 3.0 ??Not supported
    ??
    ?ω???chart_extractor.extract_all_from_file()         [INTERFACE: HWPChartExtractor]
    ??
    ?ω???file_converter.convert()                        [INTERFACE: HWPFileConverter]
    ??      ?Ӊ???Binary ??olefile.OleFileIO
    ??
    ?ω???preprocessor.preprocess(ole)                    [INTERFACE: HWPPreprocessor]
    ??      ?Ӊ???Returns PreprocessedData (ole in extracted_resources)
    ??
    ?ω???metadata_extractor.extract()                    [INTERFACE: HWPMetadataExtractor]
    ?ω???metadata_extractor.format()                     [INTERFACE: HWPMetadataExtractor]
    ??
    ?ω???_parse_docinfo(ole)                             [INTERNAL]
    ??      ?Ӊ???parse_doc_info()                        [FUNCTION]
    ??
    ?ω???_extract_body_text(ole)                         [INTERNAL]
    ??      ??
    ??      ?Ӊ???For each section:
    ??              ?ω???decompress_section()            [FUNCTION]
    ??              ?Ӊ???_parse_section()                [INTERNAL]
    ??                      ?”â???_process_picture()      [INTERNAL - format_image_processor ?¬ìš©]
    ??
    ?ω???format_image_processor.process_images_from_bindata()  [INTERFACE: HWPImageProcessor]
    ??
    ?Ӊ???file_converter.close(ole)                       [INTERFACE: HWPFileConverter]

HWPX Handler Flow

HWPXHandler.extract_text(current_file)
    ??
    ?ω???get_file_stream(current_file)                   [INHERITED: BaseHandler]
    ??      ?Ӊ???BytesIO(file_data)
    ??
    ?ω???_is_valid_zip(file_stream)                      [INTERNAL]
    ??
    ?ω???chart_extractor.extract_all_from_file()         [INTERFACE: HWPXChartExtractor]
    ??
    ?ω???zipfile.ZipFile(file_stream)                    [EXTERNAL LIBRARY]
    ??
    ?ω???preprocessor.preprocess(zf)                     [INTERFACE: HWPXPreprocessor]
    ??      ?Ӊ???Returns PreprocessedData (extracted_resources available)
    ??
    ?ω???metadata_extractor.extract()                    [INTERFACE: HWPXMetadataExtractor]
    ?ω???metadata_extractor.format()                     [INTERFACE: HWPXMetadataExtractor]
    ??
    ?ω???parse_bin_item_map(zf)                          [FUNCTION]
    ??
    ?ω???For each section:
    ??      ??
    ??      ?Ӊ???parse_hwpx_section()                    [FUNCTION]
    ??              ??
    ??              ?ω???format_image_processor.process_images()  [INTERFACE: HWPXImageProcessor]
    ??              ??
    ??              ?Ӊ???parse_hwpx_table()              [FUNCTION]
    ??
    ?Ӊ???format_image_processor.get_remaining_images()   [INTERFACE: HWPXImageProcessor]
        format_image_processor.process_images()         [INTERFACE: HWPXImageProcessor]

CSV Handler Flow

CSVHandler.extract_text(current_file)
    ??
    ?ω???file_converter.convert(file_data, encoding)     [INTERFACE: CSVFileConverter]
    ??      ?Ӊ???Binary ??Text (with encoding detection)
    ??
    ?ω???preprocessor.preprocess(content)                [INTERFACE: CSVPreprocessor]
    ??      ?Ӊ???Returns PreprocessedData (content in clean_content)
    ??
    ?ω???detect_delimiter(content)                       [FUNCTION]
    ??
    ?ω???parse_csv_content(content, delimiter)           [FUNCTION]
    ??
    ?ω???detect_header(rows)                             [FUNCTION]
    ??
    ?ω???metadata_extractor.extract(source_info)         [INTERFACE: CSVMetadataExtractor]
    ??      ?Ӊ???CSVSourceInfo contains: file_path, encoding, delimiter, rows, has_header
    ??
    ?Ӊ???convert_rows_to_table(rows, has_header)         [FUNCTION]
            ?Ӊ???Returns HTML table

Text Handler Flow

TextHandler.extract_text(current_file)
    ??
    ?ω???preprocessor.preprocess(file_data)              [INTERFACE: TextPreprocessor]
    ??      ?Ӊ???Returns PreprocessedData (file_data in clean_content)
    ??
    ?ω???file_data.decode(encoding)                      [DIRECT: No FileConverter used]
    ??      ?Ӊ???Try encodings: utf-8, utf-8-sig, cp949, euc-kr, latin-1, ascii
    ??
    ?Ӊ???clean_text() / clean_code_text()                [FUNCTION: utils.py]

Note: TextHandler??file_converterë¥??¬ìš©?˜ì? ?Šê³  ì§�ì ‘ decode?©ë‹ˆ??


HTML Handler Flow

HTMLReprocessor (Utility - NOT a BaseHandler subclass)
    ??
    ?ω???clean_html_file(html_content)                   [FUNCTION]
    ??      ??
    ??      ?ω???BeautifulSoup parsing
    ??      ?ω???Remove unwanted tags (script, style, etc.)
    ??      ?ω???Remove style attributes
    ??      ?ω???_process_table_merged_cells()
    ??      ?Ӊ???Return cleaned HTML string
    ??
    ?Ӊ???Used by DOCHandler when HTML format detected

Note: HTML?€ 별ë�„??BaseHandler ?œë¸Œ?´ëž˜?¤ê? ?†ìе?ˆë‹¤. DOCHandlerê°€ HTML ?•ì‹�??ê°�ì??˜ë©´ ?´ë??�으ë¡?BeautifulSoup?¼ë¡œ 처리?©ë‹ˆ??


Image File Handler Flow

ImageFileHandler.extract_text(current_file)
    ??
    ?ω???preprocessor.preprocess(file_data)              [INTERFACE: ImageFilePreprocessor]
    ??      ?Ӊ???Returns PreprocessedData (file_data in clean_content)
    ??
    ?ω???Validate file extension                         [INTERNAL]
    ??      ?Ӊ???SUPPORTED_IMAGE_EXTENSIONS: jpg, jpeg, png, gif, bmp, webp
    ??
    ?ω???If OCR engine is None:
    ??      ?Ӊ???_build_image_tag(file_path)             [INTERNAL]
    ??              ?Ӊ???Return [image:path] tag
    ??
    ?Ӊ???If OCR engine available:
            ?Ӊ???_ocr_engine.extract_text()              [INTERFACE: BaseOCR]
                    ?Ӊ???Image ??Text via OCR

Note: ImageFileHandler??OCR ?”ì§„???¤ì •??경우?�ë§Œ ?¤ì œ ?�스??추출??ê°€?¥í•©?ˆë‹¤.


Chunking Flow

chunk_text(text, chunk_size, chunk_overlap)
    ??
    ?Ӊ???create_chunks()                                 [FUNCTION]
            ??
            ?ω???_extract_document_metadata()            [FUNCTION]
            ??
            ?ω???Detect file type:
            ??      ??
            ??      ?ω???Table-based (xlsx, xls, csv):
            ??      ??      ?Ӊ???chunk_multi_sheet_content()  [FUNCTION]
            ??      ??
            ??      ?ω???Text with page markers:
            ??      ??      ?Ӊ???chunk_by_pages()        [FUNCTION]
            ??      ??
            ??      ?Ӊ???Plain text:
            ??              ?Ӊ???chunk_plain_text()      [FUNCTION]
            ??
            ?Ӊ???_prepend_metadata_to_chunks()           [FUNCTION]

Interface Integration Summary

?Œâ??€?€?€?€?€?€?€?€?€?€?€?€?¬â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?¬â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?¬â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?¬â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?¬â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??Handler     ??FileConverter       ??Preprocessor        ??MetadataExtractor   ??ChartExtractor      ??FormatImageProcessor??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??PDF         ????PDFFileConverter  ????PDFPreprocessor   ????PDFMetadata       ????NullChart         ????PDFImage          ??
??DOCX        ????DOCXFileConverter ????DOCXPreprocessor  ????DOCXMetadata      ????DOCXChart         ????DOCXImage         ??
??DOC         ????DOCFileConverter  ????DOCPreprocessor   ????NullMetadata      ????NullChart         ????DOCImage          ??
??RTF         ????RTFFileConverter  ????RTFPreprocessor*  ????RTFMetadata       ????NullChart         ????Uses base         ??
??XLSX        ????ExcelFileConverter????ExcelPreprocessor ????XLSXMetadata      ????ExcelChart        ????ExcelImage        ??
??XLS         ????ExcelFileConverter????ExcelPreprocessor ????XLSMetadata       ????ExcelChart        ????ExcelImage        ??
??PPT/PPTX    ????PPTFileConverter  ????PPTPreprocessor   ????PPTMetadata       ????PPTChart          ????PPTImage          ??
??HWP         ????HWPFileConverter  ????HWPPreprocessor   ????HWPMetadata       ????HWPChart          ????HWPImage          ??
??HWPX        ????None (�접 ZIP)   ????HWPXPreprocessor  ????HWPXMetadata      ????HWPXChart         ????HWPXImage         ??
??CSV         ????CSVFileConverter  ????CSVPreprocessor   ????CSVMetadata       ????NullChart         ????CSVImage          ??
??TXT/MD/JSON ????None (�접 decode)????TextPreprocessor  ????NullMetadata      ????NullChart         ????TextImage         ??
??HTML        ????N/A (? í‹¸ë¦¬í‹°)    ????N/A               ????N/A               ????N/A               ????N/A               ??
??Image Files ????ImageFileConverter????ImagePreprocessor ????NullMetadata      ????NullChart         ????ImageFileImage    ??
?”â??€?€?€?€?€?€?€?€?€?€?€?€?´â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?´â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?´â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?´â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?´â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??

??= Interface implemented
??= Not applicable / NullExtractor / Not used
* = RTFPreprocessor has actual processing logic (image extraction, binary cleanup)

Handler Processing Pipeline

모든 ?¸ë“¤?¬ëŠ” ?™ì�¼??처리 ?Œì�´?„ë�¼?¸ì�„ ?°ë¦…?ˆë‹¤:

?Œâ??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??                          Handler Processing Pipeline                             ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??                                                                                  ??
?? 1. FileConverter.convert()     Binary ??Format-specific object                  ??
??        ??                      (fitz.Document, docx.Document, olefile, etc.)    ??
??        ??                                                                        ??
?? 2. Preprocessor.preprocess()   Process/clean the converted data                 ??
??        ??                      (image extraction, binary cleanup, encoding)     ??
??        ??                                                                        ??
?? 3. MetadataExtractor.extract() Extract document metadata                        ??
??        ??                      (title, author, created date, etc.)              ??
??        ??                                                                        ??
?? 4. Content Extraction          Format-specific content extraction               ??
??        ??                      (text, tables, images, charts)                   ??
??        ??                                                                        ??
?? 5. Result Assembly             Build final result string                        ??
??                                                                                  ??
?”â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??

Note: ?€ë¶€ë¶„ì�˜ ?¸ë“¤?¬ì—�??Preprocessor??pass-through (NullPreprocessor).
      RTF???ˆì™¸ë¡? RTFPreprocessor?�서 ?¤ì œ ë°”ì�´?ˆë¦¬ 처리가 ?´ë£¨?´ì§�.

Remaining Function-Based Components

?Œâ??€?€?€?€?€?€?€?€?€?€?€?€?¬â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??Handler     ??Function-Based Components                                  ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??PDF         ??extract_text_blocks(), merge_page_elements(),             ??
??            ??ComplexityAnalyzer, VectorTextOCREngine,                  ??
??            ??BlockImageEngine                                          ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??DOCX        ??process_paragraph_element(), process_table_element()      ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??DOC         ??Format detection, OLE/HTML/DOCX internal processing       ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??RTF         ??decode_content() (rtf_decoder.py)                         ??
??            ??extract_tables_with_positions() (rtf_table_extractor.py)  ??
??            ??extract_inline_content() (rtf_content_extractor.py)       ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??Excel       ??extract_textboxes_from_xlsx(), convert_xlsx_sheet_to_table??
??            ??convert_xls_sheet_to_table(), convert_*_objects_to_tables ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??PPT         ??extract_text_with_bullets(), convert_table_to_html(),     ??
??            ??process_image_shape(), process_group_shape()              ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??HWP         ??parse_doc_info(), decompress_section()                    ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??HWPX        ??parse_bin_item_map(), parse_hwpx_section()                ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??CSV         ??detect_delimiter(), parse_csv_content(), detect_header(), ??
??            ??convert_rows_to_table()                                   ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??Text        ??clean_text(), clean_code_text() (utils.py)                ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??HTML        ??clean_html_file(), _process_table_merged_cells()          ??
??            ??(html_reprocessor.py - utility, not handler)              ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??Image       ??OCR engine integration (BaseOCR subclass)                 ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??Chunking    ??create_chunks(), chunk_by_pages(), chunk_plain_text(),    ??
??            ??chunk_multi_sheet_content(), chunk_large_table()          ??
?”â??€?€?€?€?€?€?€?€?€?€?€?€?´â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??