User calls: processor.extract_chunks(file_path)
??
??
DocumentProcessor.extract_chunks()
??
?ω???extract_text()
?? ??
?? ?ω???_create_current_file(file_path)
?? ?ω???_get_handler(extension)
?? ?ω???handler.extract_text(current_file)
?? ?Ӊ???OCR processing (optional)
??
?Ӊ???chunk_text()
??
?Ӊ???create_chunks()
PDFHandler.extract_text(current_file)
??
?ω???file_converter.convert(file_data) [INTERFACE: PDFFileConverter]
?? ?Ӊ???Binary ??fitz.Document
??
?ω???preprocessor.preprocess(doc) [INTERFACE: PDFPreprocessor]
?? ?Ӊ???Pass-through (returns PreprocessedData with doc unchanged)
??
?ω???metadata_extractor.extract() [INTERFACE: PDFMetadataExtractor]
??
?ω???_extract_all_tables(doc, file_path) [INTERNAL]
??
?Ӊ???For each page:
??
?ω???ComplexityAnalyzer.analyze() [CLASS: pdf_complexity_analyzer]
?? ?Ӊ???Returns PageComplexity with recommended_strategy
??
?ω???Branch by strategy:
?? ??
?? ?ω???FULL_PAGE_OCR:
?? ?? ?Ӊ???_process_page_full_ocr()
?? ??
?? ?ω???BLOCK_IMAGE_OCR:
?? ?? ?Ӊ???_process_page_block_ocr()
?? ??
?? ?ω???HYBRID:
?? ?? ?Ӊ???_process_page_hybrid()
?? ??
?? ?Ӊ???TEXT_EXTRACTION (default):
?? ?Ӊ???_process_page_text_extraction()
?? ??
?? ?ω???VectorTextOCREngine.detect_and_extract()
?? ?ω???extract_text_blocks() [FUNCTION]
?? ?ω???format_image_processor methods [INTERFACE: PDFImageProcessor]
?? ?Ӊ???merge_page_elements() [FUNCTION]
??
?Ӊ???page_tag_processor.create_page_tag() [INTERFACE: PageTagProcessor]
DOCXHandler.extract_text(current_file)
??
?ω???file_converter.validate(file_data) [INTERFACE: DOCXFileConverter]
?? ?Ӊ???Check if valid ZIP with [Content_Types].xml
??
?ω???If not valid DOCX:
?? ?Ӊ???_extract_with_doc_handler_fallback() [INTERNAL]
?? ?Ӊ???DOCHandler.extract_text() [DELEGATION]
??
?ω???file_converter.convert(file_data) [INTERFACE: DOCXFileConverter]
?? ?Ӊ???Binary ??docx.Document
??
?ω???preprocessor.preprocess(doc) [INTERFACE: DOCXPreprocessor]
?? ?Ӊ???Returns PreprocessedData (doc in extracted_resources)
??
?ω???chart_extractor.extract_all_from_file() [INTERFACE: DOCXChartExtractor]
?? ?Ӊ???Pre-extract all charts (callback pattern)
??
?ω???metadata_extractor.extract() [INTERFACE: DOCXMetadataExtractor]
??
?Ӊ???For each element in doc.element.body:
??
?ω???If paragraph ('p'):
?? ?Ӊ???process_paragraph_element() [FUNCTION: docx_helper]
?? ?ω???format_image_processor.process_drawing_element()
?? ?ω???format_image_processor.extract_from_pict()
?? ?Ӊ???get_next_chart() callback for charts
??
?Ӊ???If table ('tbl'):
?Ӊ???process_table_element() [FUNCTION: docx_helper]
DOCHandler.extract_text(current_file)
??
?ω???file_converter.convert() [INTERFACE: DOCFileConverter]
?? ??
?? ?ω???_detect_format() ??DocFormat (RTF/OLE/HTML/DOCX)
?? ??
?? ?œâ???RTF: file_data (bytes) 반환 [Pass-through]
?? ?ω???OLE: _convert_ole() ??olefile.OleFileIO
?? ?ω???HTML: _convert_html() ??BeautifulSoup
?? ?Ӊ???DOCX: _convert_docx() ??docx.Document
??
?ω???preprocessor.preprocess(converted_obj) [INTERFACE: DOCPreprocessor]
?? ?Ӊ???Returns PreprocessedData (converted_obj in extracted_resources)
??
?ω???RTF format detected:
?? ?Ӊ???_delegate_to_rtf_handler() [DELEGATION]
?? ?Ӊ???RTFHandler.extract_text(current_file)
??
?ω???OLE format detected:
?? ?Ӊ???_extract_from_ole_obj() [INTERNAL]
?? ?ω???_extract_ole_metadata()
?? ?ω???_extract_ole_text()
?? ?Ӊ???_extract_ole_images()
??
?ω???HTML format detected:
?? ?Ӊ???_extract_from_html_obj() [INTERNAL]
?? ?ω???_extract_html_metadata()
?? ?Ӊ???BeautifulSoup parsing
??
?Ӊ???DOCX format detected:
?Ӊ???_extract_from_docx_obj() [INTERNAL]
?Ӊ???docx.Document paragraph/table extraction
구조: Converter??pass-through, Preprocessor?�서 binary 처리, Handler?�서 ?œì°¨??처리.
RTFHandler.extract_text(current_file)
??
?ω???file_converter.convert() [INTERFACE: RTFFileConverter]
?? ?Ӊ???Pass-through (returns raw bytes)
??
?ω???preprocessor.preprocess() [INTERFACE: RTFPreprocessor]
?? ??
?? ?ω???\binN tag processing (skip binary data)
?? ?ω???\pict group image extraction
?? ?Ӊ???Returns PreprocessedData (clean_content, image_tags, encoding)
??
?ω???decode_content() [FUNCTION: rtf_decoder]
?? ?Ӊ???bytes ??string with detected encoding
??
?ω???Build RTFConvertedData [DATACLASS]
??
?Ӊ???_extract_from_converted() [INTERNAL]
??
?ω???metadata_extractor.extract() [INTERFACE: RTFMetadataExtractor]
?ω???metadata_extractor.format()
??
?ω???extract_tables_with_positions() [FUNCTION: rtf_table_extractor]
??
?ω???extract_inline_content() [FUNCTION: rtf_content_extractor]
??
?Ӊ???Build result string
ExcelHandler.extract_text(current_file) [XLSX]
??
?ω???file_converter.convert(file_data, extension='xlsx') [INTERFACE: ExcelFileConverter]
?? ?Ӊ???Binary ??openpyxl.Workbook
??
?ω???preprocessor.preprocess(wb) [INTERFACE: ExcelPreprocessor]
?? ?Ӊ???Returns PreprocessedData (wb in extracted_resources)
??
?ω???_preload_xlsx_data() [INTERNAL]
?? ?ω???metadata_extractor.extract() [INTERFACE: XLSXMetadataExtractor]
?? ?ω???chart_extractor.extract_all_from_file() [INTERFACE: ExcelChartExtractor]
?? ?Ӊ???format_image_processor.extract_images() [INTERFACE: ExcelImageProcessor]
??
?Ӊ???For each sheet:
??
?ω???_process_xlsx_sheet() [INTERNAL]
?? ?ω???page_tag_processor.create_sheet_tag() [INTERFACE: PageTagProcessor]
?? ?ω???extract_textboxes_from_xlsx() [FUNCTION]
?? ?ω???convert_xlsx_sheet_to_table() [FUNCTION]
?? ?Ӊ???convert_xlsx_objects_to_tables()[FUNCTION]
??
?Ӊ???format_image_processor.get_sheet_images() [INTERFACE: ExcelImageProcessor]
ExcelHandler.extract_text(current_file) [XLS]
??
?ω???file_converter.convert(file_data, extension='xls') [INTERFACE: ExcelFileConverter]
?? ?Ӊ???Binary ??xlrd.Book
??
?ω???preprocessor.preprocess(wb) [INTERFACE: ExcelPreprocessor]
?? ?Ӊ???Returns PreprocessedData (wb in extracted_resources)
??
?ω???_get_xls_metadata_extractor().extract_and_format() [INTERFACE: XLSMetadataExtractor]
??
?Ӊ???For each sheet:
??
?ω???page_tag_processor.create_sheet_tag() [INTERFACE: PageTagProcessor]
??
?ω???convert_xls_sheet_to_table() [FUNCTION]
??
?Ӊ???convert_xls_objects_to_tables() [FUNCTION]
PPTHandler.extract_text(current_file)
??
?ω???file_converter.convert(file_data, file_stream) [INTERFACE: PPTFileConverter]
?? ?Ӊ???Binary ??pptx.Presentation
??
?ω???preprocessor.preprocess(prs) [INTERFACE: PPTPreprocessor]
?? ?Ӊ???Returns PreprocessedData (prs in extracted_resources)
??
?ω???chart_extractor.extract_all_from_file() [INTERFACE: PPTChartExtractor]
?? ?Ӊ???Pre-extract all charts (callback pattern)
??
?ω???metadata_extractor.extract() [INTERFACE: PPTMetadataExtractor]
?ω???metadata_extractor.format() [INTERFACE: PPTMetadataExtractor]
??
?Ӊ???For each slide:
??
?ω???page_tag_processor.create_slide_tag() [INTERFACE: PageTagProcessor]
??
?Ӊ???For each shape:
??
?ω???If table: convert_table_to_html() [FUNCTION]
?ω???If chart: get_next_chart() callback [Pre-extracted]
?ω???If picture: process_image_shape() [FUNCTION]
?ω???If group: process_group_shape() [FUNCTION]
?Ӊ???If text: extract_text_with_bullets() [FUNCTION]
HWPHandler.extract_text(current_file)
??
?ω???file_converter.validate(file_data) [INTERFACE: HWPFileConverter]
?? ?Ӊ???Check if OLE file (magic number check)
??
?ω???If not OLE file:
?? ?Ӊ???_handle_non_ole_file() [INTERNAL]
?? ?ω???ZIP detected ??HWPXHandler delegation
?? ?Ӊ???HWP 3.0 ??Not supported
??
?ω???chart_extractor.extract_all_from_file() [INTERFACE: HWPChartExtractor]
??
?ω???file_converter.convert() [INTERFACE: HWPFileConverter]
?? ?Ӊ???Binary ??olefile.OleFileIO
??
?ω???preprocessor.preprocess(ole) [INTERFACE: HWPPreprocessor]
?? ?Ӊ???Returns PreprocessedData (ole in extracted_resources)
??
?ω???metadata_extractor.extract() [INTERFACE: HWPMetadataExtractor]
?ω???metadata_extractor.format() [INTERFACE: HWPMetadataExtractor]
??
?ω???_parse_docinfo(ole) [INTERNAL]
?? ?Ӊ???parse_doc_info() [FUNCTION]
??
?ω???_extract_body_text(ole) [INTERNAL]
?? ??
?? ?Ӊ???For each section:
?? ?ω???decompress_section() [FUNCTION]
?? ?Ӊ???_parse_section() [INTERNAL]
?? ?”â???_process_picture() [INTERNAL - format_image_processor ?¬ìš©]
??
?ω???format_image_processor.process_images_from_bindata() [INTERFACE: HWPImageProcessor]
??
?Ӊ???file_converter.close(ole) [INTERFACE: HWPFileConverter]
HWPXHandler.extract_text(current_file)
??
?ω???get_file_stream(current_file) [INHERITED: BaseHandler]
?? ?Ӊ???BytesIO(file_data)
??
?ω???_is_valid_zip(file_stream) [INTERNAL]
??
?ω???chart_extractor.extract_all_from_file() [INTERFACE: HWPXChartExtractor]
??
?ω???zipfile.ZipFile(file_stream) [EXTERNAL LIBRARY]
??
?ω???preprocessor.preprocess(zf) [INTERFACE: HWPXPreprocessor]
?? ?Ӊ???Returns PreprocessedData (extracted_resources available)
??
?ω???metadata_extractor.extract() [INTERFACE: HWPXMetadataExtractor]
?ω???metadata_extractor.format() [INTERFACE: HWPXMetadataExtractor]
??
?ω???parse_bin_item_map(zf) [FUNCTION]
??
?ω???For each section:
?? ??
?? ?Ӊ???parse_hwpx_section() [FUNCTION]
?? ??
?? ?ω???format_image_processor.process_images() [INTERFACE: HWPXImageProcessor]
?? ??
?? ?Ӊ???parse_hwpx_table() [FUNCTION]
??
?Ӊ???format_image_processor.get_remaining_images() [INTERFACE: HWPXImageProcessor]
format_image_processor.process_images() [INTERFACE: HWPXImageProcessor]
CSVHandler.extract_text(current_file)
??
?ω???file_converter.convert(file_data, encoding) [INTERFACE: CSVFileConverter]
?? ?Ӊ???Binary ??Text (with encoding detection)
??
?ω???preprocessor.preprocess(content) [INTERFACE: CSVPreprocessor]
?? ?Ӊ???Returns PreprocessedData (content in clean_content)
??
?ω???detect_delimiter(content) [FUNCTION]
??
?ω???parse_csv_content(content, delimiter) [FUNCTION]
??
?ω???detect_header(rows) [FUNCTION]
??
?ω???metadata_extractor.extract(source_info) [INTERFACE: CSVMetadataExtractor]
?? ?Ӊ???CSVSourceInfo contains: file_path, encoding, delimiter, rows, has_header
??
?Ӊ???convert_rows_to_table(rows, has_header) [FUNCTION]
?Ӊ???Returns HTML table
TextHandler.extract_text(current_file)
??
?ω???preprocessor.preprocess(file_data) [INTERFACE: TextPreprocessor]
?? ?Ӊ???Returns PreprocessedData (file_data in clean_content)
??
?ω???file_data.decode(encoding) [DIRECT: No FileConverter used]
?? ?Ӊ???Try encodings: utf-8, utf-8-sig, cp949, euc-kr, latin-1, ascii
??
?Ӊ???clean_text() / clean_code_text() [FUNCTION: utils.py]
Note: TextHandler??file_converterë¥??¬ìš©?˜ì? ?Šê³ ì§�ì ‘ decode?©ë‹ˆ??
HTMLReprocessor (Utility - NOT a BaseHandler subclass)
??
?ω???clean_html_file(html_content) [FUNCTION]
?? ??
?? ?ω???BeautifulSoup parsing
?? ?ω???Remove unwanted tags (script, style, etc.)
?? ?ω???Remove style attributes
?? ?ω???_process_table_merged_cells()
?? ?Ӊ???Return cleaned HTML string
??
?Ӊ???Used by DOCHandler when HTML format detected
Note: HTML?€ 별ë�„??BaseHandler ?œë¸Œ?´ëž˜?¤ê? ?†ìе?ˆë‹¤. DOCHandlerê°€ HTML ?•ì‹�??ê°�ì??˜ë©´ ?´ë??�으ë¡?BeautifulSoup?¼ë¡œ 처리?©ë‹ˆ??
ImageFileHandler.extract_text(current_file)
??
?ω???preprocessor.preprocess(file_data) [INTERFACE: ImageFilePreprocessor]
?? ?Ӊ???Returns PreprocessedData (file_data in clean_content)
??
?ω???Validate file extension [INTERNAL]
?? ?Ӊ???SUPPORTED_IMAGE_EXTENSIONS: jpg, jpeg, png, gif, bmp, webp
??
?ω???If OCR engine is None:
?? ?Ӊ???_build_image_tag(file_path) [INTERNAL]
?? ?Ӊ???Return [image:path] tag
??
?Ӊ???If OCR engine available:
?Ӊ???_ocr_engine.extract_text() [INTERFACE: BaseOCR]
?Ӊ???Image ??Text via OCR
Note: ImageFileHandler??OCR ?”ì§„???¤ì •??경우?�ë§Œ ?¤ì œ ?�스??추출??ê°€?¥í•©?ˆë‹¤.
chunk_text(text, chunk_size, chunk_overlap)
??
?Ӊ???create_chunks() [FUNCTION]
??
?ω???_extract_document_metadata() [FUNCTION]
??
?ω???Detect file type:
?? ??
?? ?ω???Table-based (xlsx, xls, csv):
?? ?? ?Ӊ???chunk_multi_sheet_content() [FUNCTION]
?? ??
?? ?ω???Text with page markers:
?? ?? ?Ӊ???chunk_by_pages() [FUNCTION]
?? ??
?? ?Ӊ???Plain text:
?? ?Ӊ???chunk_plain_text() [FUNCTION]
??
?Ӊ???_prepend_metadata_to_chunks() [FUNCTION]
?Œâ??€?€?€?€?€?€?€?€?€?€?€?€?¬â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?¬â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?¬â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?¬â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?¬â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??Handler ??FileConverter ??Preprocessor ??MetadataExtractor ??ChartExtractor ??FormatImageProcessor??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??PDF ????PDFFileConverter ????PDFPreprocessor ????PDFMetadata ????NullChart ????PDFImage ??
??DOCX ????DOCXFileConverter ????DOCXPreprocessor ????DOCXMetadata ????DOCXChart ????DOCXImage ??
??DOC ????DOCFileConverter ????DOCPreprocessor ????NullMetadata ????NullChart ????DOCImage ??
??RTF ????RTFFileConverter ????RTFPreprocessor* ????RTFMetadata ????NullChart ????Uses base ??
??XLSX ????ExcelFileConverter????ExcelPreprocessor ????XLSXMetadata ????ExcelChart ????ExcelImage ??
??XLS ????ExcelFileConverter????ExcelPreprocessor ????XLSMetadata ????ExcelChart ????ExcelImage ??
??PPT/PPTX ????PPTFileConverter ????PPTPreprocessor ????PPTMetadata ????PPTChart ????PPTImage ??
??HWP ????HWPFileConverter ????HWPPreprocessor ????HWPMetadata ????HWPChart ????HWPImage ??
??HWPX ????None (ì§�ì ‘ ZIP) ????HWPXPreprocessor ????HWPXMetadata ????HWPXChart ????HWPXImage ??
??CSV ????CSVFileConverter ????CSVPreprocessor ????CSVMetadata ????NullChart ????CSVImage ??
??TXT/MD/JSON ????None (ì§�ì ‘ decode)????TextPreprocessor ????NullMetadata ????NullChart ????TextImage ??
??HTML ????N/A (? 틸리티) ????N/A ????N/A ????N/A ????N/A ??
??Image Files ????ImageFileConverter????ImagePreprocessor ????NullMetadata ????NullChart ????ImageFileImage ??
?”â??€?€?€?€?€?€?€?€?€?€?€?€?´â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?´â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?´â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?´â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?´â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??= Interface implemented
??= Not applicable / NullExtractor / Not used
* = RTFPreprocessor has actual processing logic (image extraction, binary cleanup)
ëª¨ë“ ?¸ë“¤?¬ëŠ” ?™ì�¼??처리 ?Œì�´?„ë�¼?¸ì�„ ?°ë¦…?ˆë‹¤:
?Œâ??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
?? Handler Processing Pipeline ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
?? ??
?? 1. FileConverter.convert() Binary ??Format-specific object ??
?? ?? (fitz.Document, docx.Document, olefile, etc.) ??
?? ?? ??
?? 2. Preprocessor.preprocess() Process/clean the converted data ??
?? ?? (image extraction, binary cleanup, encoding) ??
?? ?? ??
?? 3. MetadataExtractor.extract() Extract document metadata ??
?? ?? (title, author, created date, etc.) ??
?? ?? ??
?? 4. Content Extraction Format-specific content extraction ??
?? ?? (text, tables, images, charts) ??
?? ?? ??
?? 5. Result Assembly Build final result string ??
?? ??
?”â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
Note: ?€ë¶€ë¶„ì�˜ ?¸ë“¤?¬ì—�??Preprocessor??pass-through (NullPreprocessor).
RTF???ˆì™¸ë¡? RTFPreprocessor?�서 ?¤ì œ ë°”ì�´?ˆë¦¬ 처리가 ?´ë£¨?´ì§�.
?Œâ??€?€?€?€?€?€?€?€?€?€?€?€?¬â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??Handler ??Function-Based Components ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??PDF ??extract_text_blocks(), merge_page_elements(), ??
?? ??ComplexityAnalyzer, VectorTextOCREngine, ??
?? ??BlockImageEngine ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??DOCX ??process_paragraph_element(), process_table_element() ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??DOC ??Format detection, OLE/HTML/DOCX internal processing ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??RTF ??decode_content() (rtf_decoder.py) ??
?? ??extract_tables_with_positions() (rtf_table_extractor.py) ??
?? ??extract_inline_content() (rtf_content_extractor.py) ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??Excel ??extract_textboxes_from_xlsx(), convert_xlsx_sheet_to_table??
?? ??convert_xls_sheet_to_table(), convert_*_objects_to_tables ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??PPT ??extract_text_with_bullets(), convert_table_to_html(), ??
?? ??process_image_shape(), process_group_shape() ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??HWP ??parse_doc_info(), decompress_section() ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??HWPX ??parse_bin_item_map(), parse_hwpx_section() ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??CSV ??detect_delimiter(), parse_csv_content(), detect_header(), ??
?? ??convert_rows_to_table() ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??Text ??clean_text(), clean_code_text() (utils.py) ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??HTML ??clean_html_file(), _process_table_merged_cells() ??
?? ??(html_reprocessor.py - utility, not handler) ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??Image ??OCR engine integration (BaseOCR subclass) ??
?œâ??€?€?€?€?€?€?€?€?€?€?€?€?¼â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??
??Chunking ??create_chunks(), chunk_by_pages(), chunk_plain_text(), ??
?? ??chunk_multi_sheet_content(), chunk_large_table() ??
?”â??€?€?€?€?€?€?€?€?€?€?€?€?´â??€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€?€??