Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
packages/markitdown/tests/test_files/** linguist-vendored
packages/markitdown-sample-plugin/tests/test_files/** linguist-vendored

# Treat PDF files as binary to prevent line ending conversion
*.pdf binary
2 changes: 1 addition & 1 deletion packages/markitdown/src/markitdown/__about__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
#
# SPDX-License-Identifier: MIT
__version__ = "0.1.5b1"
__version__ = "0.1.5b2"
55 changes: 51 additions & 4 deletions packages/markitdown/src/markitdown/converters/_pdf_converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -198,15 +198,62 @@ def _extract_form_content_from_words(page: Any) -> str | None:
if not all_table_x_positions:
return None

# Compute global column boundaries
# Compute adaptive column clustering tolerance based on gap analysis
all_table_x_positions.sort()

# Calculate gaps between consecutive x-positions
gaps = []
for i in range(len(all_table_x_positions) - 1):
gap = all_table_x_positions[i + 1] - all_table_x_positions[i]
if gap > 5: # Only significant gaps
gaps.append(gap)

# Determine optimal tolerance using statistical analysis
if gaps and len(gaps) >= 3:
# Use 70th percentile of gaps as threshold (balances precision/recall)
sorted_gaps = sorted(gaps)
percentile_70_idx = int(len(sorted_gaps) * 0.70)
adaptive_tolerance = sorted_gaps[percentile_70_idx]

# Clamp tolerance to reasonable range [25, 50]
adaptive_tolerance = max(25, min(50, adaptive_tolerance))
else:
# Fallback to conservative value
adaptive_tolerance = 35

# Compute global column boundaries using adaptive tolerance
global_columns: list[float] = []
for x in all_table_x_positions:
if not global_columns or x - global_columns[-1] > 30:
if not global_columns or x - global_columns[-1] > adaptive_tolerance:
global_columns.append(x)

# Too many columns suggests dense text, not a form
if len(global_columns) > 8:
# Adaptive max column check based on page characteristics
# Calculate average column width
if len(global_columns) > 1:
content_width = global_columns[-1] - global_columns[0]
avg_col_width = content_width / len(global_columns)

# Forms with very narrow columns (< 30px) are likely dense text
if avg_col_width < 30:
return None

# Compute adaptive max based on columns per inch
# Typical forms have 3-8 columns per inch
columns_per_inch = len(global_columns) / (content_width / 72)

# If density is too high (> 10 cols/inch), likely not a form
if columns_per_inch > 10:
return None

# Adaptive max: allow more columns for wider pages
# Standard letter is 612pt wide, so scale accordingly
adaptive_max_columns = int(20 * (page_width / 612))
adaptive_max_columns = max(15, adaptive_max_columns) # At least 15

if len(global_columns) > adaptive_max_columns:
return None
else:
# Single column, not a form
return None

# Now classify each row as table row or not
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
TECHMART ELECTRONICS
4567 Innovation Blvd
San Francisco, CA 94103
(415) 555-0199

===================================

Store #0342 - Downtown SF
11/23/2024 14:32:18 PST
TXN: TXN-98765-2024
Cashier: Emily Rodriguez
Register: POS-07

-----------------------------------

Wireless Noise-Cancelling
Headphones - Premium Black
AUDIO-5521 1 @ $349.99
Member Discount $-50.00
$299.99
USB-C Hub 7-in-1 Adapter
with HDMI & Ethernet
ACC-8834 2 @ $79.99
$159.98
Portable SSD 2TB
Thunderbolt 3 Compatible
STOR-2241 1 @ $289.00
Member Discount $-29.00
$260.00
Ergonomic Wireless Mouse
Rechargeable Battery
ACC-9012 1 @ $59.99
$59.99
Screen Cleaning Kit
Professional Grade
CARE-1156 3 @ $12.99
$38.97
HDMI 2.1 Cable 6ft
8K Resolution Support
CABLE-7789 2 @ $24.99
Member Discount $-5.00
$44.98
-----------------------------------

SUBTOTAL $863.91
Member Discount (15%)-$84.00
Sales Tax (8.5%) $66.23
Rewards Applied -$25.00
===================================
TOTAL $821.14
===================================

PAYMENT METHOD
Visa Card ending in 4782
Auth: 847392
Ref: REF-20241123-98765

-----------------------------------

REWARDS MEMBER
Sarah Mitchell
ID: TM-447821
Points Earned: 821
Total Points: 3,247
Next Reward: $50 gift card
at 5,000 pts (1,753 to go)

-----------------------------------

RETURN POLICY
Returns within 30 days
Receipt required
Electronics must be unopened

*TXN98765202411231432*

Thank you for shopping!
www.techmart.example.com

===================================

Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
ZAVA AUTO REPAIR
Certified Collision Repair
123 Main Street, Redmond, WA 98052
Phone: (425) 000-0000
Preliminary Estimate (ID: EST-1008)
| Customer Information | | | Vehicle Information | |
| -------------------- | ------------------- | --- | ------------------- | ----------------- |
| Insured name | Gabriel Diaz | | Year | 2022 |
| Claim # | SF-1008 | | Make | Jeep |
| Policy # | POL-2022-555 | | Model | Grand Cherokee |
| Phone | (425) 111-1111 | | Trim | Limited |
| Email | gabriel@contoso.com | | VIN | 1C4RJFBG2NC123456 |
| | | | Color | White |
| | | | Odometer | 9,800 |
| Repair Order # | RO-20221108 | | Estimator | Ellis Turner |
Estimate Totals
| | | Hours | Rate | Cost |
| ---------------- | --- | ----- | ---- | ----- |
| Parts | | | | 2,100 |
| Body Labor | | 2 | 150 | 300 |
| Paint Labor | | 1.5 | 150 | 225 |
| Mechanical Labor | | - | - | - |
Supplies
| | Paint Supplies | | | 60 |
| ------------- | ------------------------ | --- | ------ | ------ |
| | Body Supplies | | | 30 |
| Other Charges | | | | 15 |
| Subtotal | | | | 2,730 |
| Sales Tax | | | 10.20% | 278.46 |
| GRAND TOTAL | | | | 5,738 |
| Note | Minor rear bumper repair | | | |
This is a preliminary estimate for the visible damage of the vehicle. Additional damage / repairs / parts may be found
after the vehicle has been disassembled and damaged parts have been removed. Suspension damages may be
present, but can not be determined until an alignment on the vehicle has been done. Parts Prices may vary due to
models and vehicle maker price updates. Please be advised if vehicle owner elects to have vehicle sent to service for
any mechanical concerns, ALL service departments charge a vehicle diagnostic charge. If the mechanical concern is
deemed not related to an insurance claim, vehicle owner will be reponsible for charges.

ZAVA AUTO REPAIR
Certified Collision Repair
123 Main Street, Redmond, WA 98052
Phone: (425) 000-0000
Preliminary Estimate (ID: EST-1008)
Customer Information Vehicle Information
| Insured name | Bruce Wayne | | Year | 2025 |
| -------------- | -------------------------- | --- | --------- | ------------ |
| Claim # | | 999 | Make | Batman |
| Policy # | IM-BATMAN | | Model | Batmobile |
| Phone | (416) 555-1234 | | Trim | Limited |
| Email | batman@wayneindustries.com | | VIN | XXX |
| | | | Color | Black |
| | | | Odometer | 1 |
| Repair Order # | RO-20221108 | | Estimator | Ellis Turner |
Estimate Totals
| | | Hours | Rate | Cost |
| ---------------- | --- | ----- | ---- | ------ |
| Parts | | | | 99,999 |
| Body Labor | | 2 | 150 | 300 |
| Paint Labor | | 1.5 | 150 | 225 |
| Mechanical Labor | | - | - | - |
Supplies
| | Paint Supplies | | | 60 |
| ------------- | ------------------------ | --- | ------ | --------- |
| | Body Supplies | | | 30 |
| Other Charges | | | | 15 |
| Subtotal | | | | 100,629 |
| Sales Tax | | | 10.20% | 10264.158 |
| GRAND TOTAL | | | | 211,522 |
| Note | Minor rear bumper repair | | | |

This is a preliminary estimate for the visible damage of the vehicle. Additional damage / repairs / parts may be found
after the vehicle has been disassembled and damaged parts have been removed. Suspension damages may be
present, but can not be determined until an alignment on the vehicle has been done. Parts Prices may vary due to
models and vehicle maker price updates. Please be advised if vehicle owner elects to have vehicle sent to service for
any mechanical concerns, ALL service departments charge a vehicle diagnostic charge. If the mechanical concern is
deemed not related to an insurance claim, vehicle owner will be reponsible for charges.
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
INVENTORY RECONCILIATION REPORT
Report ID: SPARSE-2024-INV-1234
Warehouse: Distribution Center East
Report Date: 2024-11-15
Prepared By: Sarah Martinez
| Product Code | Location | Expected | Actual | Variance | Status |
| ------------ | -------- | -------- | ------ | -------- | -------- |
| SKU-8847 | A-12 | 450 | | | |
| | B-07 | | 289 | -23 | |
| SKU-9201 | | 780 | 778 | | OK |
| | C-15 | | | +15 | |
| SKU-4563 | D-22 | | 156 | | CRITICAL |
| | | 180 | | -24 | |
| SKU-7728 | A-08 | 920 | | | |
| | | | 935 | +15 | OK |
Variance Analysis:
Summary Statistics:
Total Variance Cost: $4,287.50
Critical Items: 1
Overall Accuracy: 97.2%
Detailed Analysis by Category:
The inventory reconciliation reveals several key findings. The primary variance driver is SKU-4563,
which shows a -24 unit discrepancy requiring immediate investigation. Location B-07 handling of
SKU-8847 also demonstrates significant variance. Cross-location verification protocols should be

reviewed to prevent future discrepancies. The overall accuracy rate of 97.2% meets our target
threshold, but critical items require expedited resolution to maintain operational efficiency.
Extended Inventory Review:
| Product Code | Category | Unit Cost | Total Value | Last Audit | Notes |
| ------------ | ----------- | --------- | ----------- | ---------- | ---------- |
| SKU-8847 | Electronics | $45.00 | $13,005.00 | 2024-10-15 | |
| SKU-9201 | Hardware | $32.50 | $25,285.00 | 2024-10-22 | Verified |
| SKU-4563 | Software | $120.00 | $18,720.00 | | Critical |
| SKU-7728 | Accessories | $15.75 | $14,726.25 | 2024-11-01 | |
| SKU-3345 | Electronics | $67.00 | $22,445.00 | 2024-10-18 | |
| SKU-5512 | Hardware | $89.00 | $31,150.00 | | Pending |
| SKU-6678 | Software | $200.00 | $42,000.00 | 2024-10-25 | High Value |
| SKU-7789 | Accessories | $8.50 | $5,950.00 | 2024-11-05 | |
| SKU-2234 | Electronics | $125.00 | $35,000.00 | | |
| SKU-1123 | Hardware | $55.00 | $27,500.00 | 2024-10-30 | Verified |
Recommendations:
1. Immediate review of SKU-4563 handling procedures. 2. Implement additional verification for critical
items. 3. Schedule follow-up audit for high-value products (SKU-6678, SKU-2234).
Approval:
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
BOOKING ORDER
Print Date 12/15/2024 14:30:22
Page 1 of 1
STARLIGHT CINEMAS
Orders
| Order / Rev: | 2024-12-5678 | | | Cinema: | | Downtown Multiplex |
| ------------ | -------------- | --- | --- | ---------------- | --- | ------------------ |
| Alt Order #: | SC-WINTER-2024 | | | Primary Contact: | | Sarah Johnson |
Product Desc: Holiday Movie Marathon Package Location: NYC-01
| Estimate: | EST-456 | | | Region: | | NORTHEAST |
| -------------------- | ----------------------- | --- | --- | ------- | --- | --------- |
| Booking Dates: | 12/20/2024 - 12/31/2024 | | | | | |
| Original Date / Rev: | 12/01/24 / 12/10/24 | | | | | |
| Order Type: | Premium Package | | | | | |
Booking Agency
| Name: | Premier Entertainment Group | | | | | |
| ---------------- | --------------------------- | --- | --- | -------------- | --- | --------- |
| | | | | Billing Type: | | Net 30 |
| Contact: | Michael Chen | | | | | |
| | | | | Payment Terms: | | Corporate |
| Billing Contact: | accounting@premierent.com | | | | | |
| | | | | Commission: | | 10% |
555 Broadway Suite 1200
New York, NY 10012
Customer
| Name: | Universal Studios Distribution | | | | | |
| -------------- | ------------------------------ | --- | --- | --- | --- | --- |
| Category: | Film Distributor | | | | | |
| Contact Email: | bookings@universalstudios.com | | | | | |
| Customer ID: | CUST-98765 | | | | | |
| Revenue Code: | FILM-PREMIUM | | | | | |
Booking Summary
| Start Date | End Date | # Shows | Gross Amount | Net Amount | | |
| ---------- | -------- | ------- | ------------ | ---------- | --- | --- |
| 12/20/24 | 12/31/24 | 48 | $12,500.00 | $11,250.00 | | |
Totals
| Month | # Shows | Gross Amount | | Net Amount | | Occupancy |
| ------------- | ------- | ------------ | --- | ---------- | --- | --------- |
| December 2024 | 48 | $12,500.00 | | $11,250.00 | | 85% |
| Totals | 48 | $12,500.00 | | $11,250.00 | | 85% |
Account Representatives
Representative Territory Region Start Date / End Date Commission %
| Sarah Johnson | NYC Metro | NORTHEAST | 12/20/24 - 12/31/24 | | 100% | |
| ------------- | --------- | --------- | ------------------- | --- | ---- | --- |
Show Schedule Details
Ln Screen Start End Movie Title Format Showtime Days Shows Rate Type Total
1 SCR-1 12/20/24 12/25/24 Holiday Spectacular IMAX 3D 7:00 PM Daily 12 $250 PM $3,000
(Runtime: 142 min); Holiday Season Premium
2 SCR-2 12/20/24 12/31/24 Winter Wonderland Standard 4:30 PM Daily 24 $150 MT $3,600
(Runtime: 98 min); Matinee Special
3 SCR-1 12/26/24 12/31/24 New Year Mystery 4DX 9:30 PM Daily 12 $300 PM $3,600
(Runtime: 116 min); Premium Experience
Show Details
| Show Screen | Date Range | Title | Showtime | Days Type | Rate | Revenue |
| ----------- | ---------- | ----- | -------- | --------- | ---- | ------- |
1 SCR-1 12/20-12/25 Holiday Spectacular 7:00 PM Daily PM $250 $3,000
This booking order is subject to cinema availability and standard terms.
2 SCR-2 12/20-12/31 Winter Wonderland 4:30 PM Daily MT $150 $3,600
All showtimes are approximate and subject to change.
3 SCR-1 12/26-12/31 New Year Mystery 9:30 PM Daily PM $300 $3,600
| Total Revenue: | | | | | | $12,500.00 |
| -------------- | --- | --- | --- | --- | --- | ---------- |
Loading