Spaces:

moazx
/

AI-PDF-Tool

Sleeping

App Files Files Community

moazx commited on Nov 16

Commit

443e99e

1 Parent(s): a485ca1

update

Browse files

Files changed (13) hide show

.gitignore +14 -0
.python-version +1 -0
README.md +125 -7
app.py +295 -0
main.py +1309 -0
modal_app.py +112 -0
pdf_extractor_gui.py +624 -0
pyproject.toml +20 -0
run_flask_gpu.py +48 -0
static/css/styles.css +310 -0
static/js/app.js +482 -0
templates/index.html +183 -0
uv.lock +0 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,14 @@

+# Python-generated files
+__pycache__/
+*.py[oc]
+build/
+dist/
+wheels/
+*.egg-info
+# Virtual environments
+.venv
+pdfs/*
+/output4
+/output
+/uploads

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.12

README.md CHANGED Viewed

@@ -1,10 +1,128 @@
 ---
-title: AI PDF Tool
-emoji: 🌖
-colorFrom: gray
-colorTo: green
-sdk: docker
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# PDF Layout Extraction Companion
+A streamlined workflow for extracting figures, tables, annotated layouts, and markdown text from scientific PDFs using [DocLayout-YOLO](https://github.com/juliozhao/DocLayout-YOLO), PyMuPDF, and Flask. The project exposes a command-line pipeline (`main.py`) and a modern Flask web UI (`app.py`).
+---
+## Features
+- **Layout-aware extraction** of figures and tables with YOLO-based detection
+- **Cross-page stitching** for multi-page tables, captions, titles, and body text
+- **Annotated PDF output** with bounding boxes for detected regions
+- **Markdown export** powered by `pymupdf4llm` / `pymupdf-layout`
+- **Flask Web UI** with modern design, dark/light theme, GPU/CPU status, and individual PDF viewing
+- Unified `output/<PDF stem>/` directory structure for CLI + UI runs
 ---
+## Requirements
+- Python 3.12+
+- [uv](https://docs.astral.sh/uv/latest/) (recommended) or `pip`
+- GPU optional (DocLayout-YOLO runs on CPU as well)
+Install dependencies:
+```bash
+uv pip install
+```
+> If you prefer a virtualenv, create/activate it first, then run `uv pip install` inside.
 ---
+## Quick Start
+### Command Line Pipeline
+Process all PDFs in `./pdfs` and write outputs to `./output/<PDF stem>/`:
+```bash
+uv run python main.py
+```
+Each subdirectory contains:
+- `* _content_list.json` – metadata for extracted figures/tables
+- `*_layout.pdf` – annotated PDF with layout boxes
+- `*.md` – markdown export (if `pymupdf4llm` is installed)
+- `figures/` & `tables/` – cropped PNGs with stitched captions/titles
+### Flask Web App (Recommended)
+Launch the modern Flask web interface locally:
+```bash
+python run_flask_gpu.py
+```
+Then open your browser to `http://localhost:5000`
+**Features:**
+- Clean, modern UI with dark/light theme support
+- Multiple PDF upload and processing
+- Individual PDF output viewing with sidebar navigation
+- Real-time GPU/CPU status display
+- Image gallery for figures and tables
+- Markdown preview and download
+- Responsive design for mobile and desktop
+All Flask app runs also write into `./output/<PDF stem>/` using the same structure as the CLI.
+### Deploy to Modal.com (Cloud with GPU)
+Deploy your Flask app online with GPU support using Modal:
+```bash
+# Install Modal CLI
+pip install modal
+# Authenticate with Modal
+modal token new
+# Deploy to Modal
+modal deploy modal_app.py
+```
+See [MODAL_DEPLOYMENT.md](MODAL_DEPLOYMENT.md) for detailed instructions.
+**Benefits:**
+- GPU support (T4, A10G, or A100)
+- Pay-per-use pricing
+- Automatic HTTPS
+- Auto-scaling
+- Global deployment
+---
+## Configuration Highlights
+- **Detection model:** DocLayout-YOLO (`doclayout_yolo_docstructbench_imgsz1024.pt`)
+- **Detection thresholds:** configurable in `main.py`
+- **Layout stitching:** tables, captions, titles, body text
+- **Markdown extraction:** defaults to enabled (`pymupdf4llm.to_markdown`); falls back gracefully if the package is missing
+- **Output directory:** `./output` (configurable near the bottom of `main.py`)
+---
+## File Overview
+| Path | Description |
+|------|-------------|
+| `main.py` | CLI pipeline for batch PDF processing |
+| `app.py` | Flask web application (recommended UI) |
+| `run_flask_gpu.py` | Local Flask runner with GPU support |
+| `modal_app.py` | Modal.com deployment configuration (cloud GPU) |
+| `MODAL_DEPLOYMENT.md` | Modal.com deployment guide |
+| `templates/` | Flask HTML templates |
+| `static/` | Flask static files (CSS, JS) |
+| `pdfs/` | Source PDFs (gitignored) |
+| `output/` | Generated outputs per PDF |
+| `pyproject.toml` | Project metadata & dependency list |
+| `uv.lock` | Locked dependency versions (auto-maintained by `uv`) |
+---
+## Troubleshooting
+- **`ModuleNotFoundError: pymupdf4llm`** – install it via `uv pip install pymupdf4llm` (already listed in `pyproject.toml`).
+- **Slow performance** – ensure GPU CUDA drivers are available or reduce concurrency by toggling `USE_MULTIPROCESSING` in `main.py`.
+- **Large outputs** – clean the `output/` directory before reruns to avoid confusing duplicates.
+For additional logging, set `LOG_LEVEL` or edit the `logger` configuration in `main.py`.
+---
+## Acknowledgements
+- [DocLayout-YOLO](https://github.com/juliozhao/DocLayout-YOLO)
+- [PyMuPDF](https://pymupdf.readthedocs.io/)
+- [PyMuPDF4LLM](https://github.com/pymupdf/RAG/blob/main/pymupdf4llm.md)
+- [Flask](https://flask.palletsprojects.com/)
+Happy extracting! 🎉

app.py ADDED Viewed

	@@ -0,0 +1,295 @@

+import json
+import os
+import shutil
+from pathlib import Path
+from typing import Dict, List, Optional
+from flask import Flask, render_template, request, jsonify, send_file, send_from_directory
+from werkzeug.utils import secure_filename
+import torch
+import main as extractor
+from loguru import logger
+app = Flask(__name__)
+app.config['MAX_CONTENT_LENGTH'] = 500 * 1024 * 1024  # 500MB max file size
+app.config['UPLOAD_FOLDER'] = './uploads'
+app.config['OUTPUT_FOLDER'] = './output'
+# Ensure directories exist
+os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)
+os.makedirs(app.config['OUTPUT_FOLDER'], exist_ok=True)
+# Global model instance
+_model = None
+def get_device_info() -> Dict[str, any]:
+    """Get information about GPU/CPU availability."""
+    cuda_available = torch.cuda.is_available()
+    device = "cuda" if cuda_available else "cpu"
+    info = {
+        "device": device,
+        "cuda_available": cuda_available,
+        "device_name": None,
+        "device_count": 0,
+    }
+    if cuda_available:
+        info["device_name"] = torch.cuda.get_device_name(0)
+        info["device_count"] = torch.cuda.device_count()
+    return info
+def load_model_once():
+    """Load the model once and cache it."""
+    global _model
+    if _model is None:
+        logger.info("Loading DocLayout-YOLO model...")
+        _model = extractor.get_model()
+        logger.info("Model loaded successfully")
+    return _model
+@app.route('/')
+def index():
+    """Main page."""
+    device_info = get_device_info()
+    return render_template('index.html', device_info=device_info)
+@app.route('/api/device-info')
+def device_info():
+    """API endpoint to get device information."""
+    return jsonify(get_device_info())
+@app.route('/api/upload', methods=['POST'])
+def upload_files():
+    """Handle multiple PDF file uploads."""
+    if 'files[]' not in request.files:
+        return jsonify({'error': 'No files provided'}), 400
+    files = request.files.getlist('files[]')
+    extraction_mode = request.form.get('extraction_mode', 'images')
+    include_images = extraction_mode != 'markdown'
+    include_markdown = extraction_mode != 'images'
+    if not files or all(f.filename == '' for f in files):
+        return jsonify({'error': 'No files selected'}), 400
+    results = []
+    for file in files:
+        if file and file.filename.endswith('.pdf'):
+            try:
+                # Save uploaded file
+                filename = secure_filename(file.filename)
+                stem = Path(filename).stem
+                upload_path = Path(app.config['UPLOAD_FOLDER']) / filename
+                file.save(str(upload_path))
+                # Prepare output directory
+                output_dir = Path(app.config['OUTPUT_FOLDER']) / stem
+                output_dir.mkdir(parents=True, exist_ok=True)
+                # Copy PDF to output directory
+                pdf_path = output_dir / filename
+                upload_path.rename(pdf_path)
+                # Process PDF
+                extractor.USE_MULTIPROCESSING = False
+                logger.info(f"Processing {filename} (images={include_images}, markdown={include_markdown})")
+                if include_images:
+                    load_model_once()
+                extractor.process_pdf_with_pool(
+                    pdf_path,
+                    output_dir,
+                    pool=None,
+                    extract_images=include_images,
+                    extract_markdown=include_markdown,
+                )
+                # Collect results
+                json_path = output_dir / f"{stem}_content_list.json"
+                elements = []
+                if include_images and json_path.exists():
+                    elements = json.loads(json_path.read_text(encoding='utf-8'))
+                annotated_pdf = None
+                if include_images:
+                    candidate_pdf = output_dir / f"{stem}_layout.pdf"
+                    if candidate_pdf.exists():
+                        annotated_pdf = str(candidate_pdf.relative_to(app.config['OUTPUT_FOLDER']))
+                markdown_path = None
+                if include_markdown:
+                    candidate_md = output_dir / f"{stem}.md"
+                    if candidate_md.exists():
+                        markdown_path = str(candidate_md.relative_to(app.config['OUTPUT_FOLDER']))
+                # Get figure and table counts
+                figures = [e for e in elements if e.get('type') == 'figure']
+                tables = [e for e in elements if e.get('type') == 'table']
+                results.append({
+                    'filename': filename,
+                    'stem': stem,
+                    'output_dir': str(output_dir.relative_to(app.config['OUTPUT_FOLDER'])),
+                    'figures_count': len(figures),
+                    'tables_count': len(tables),
+                    'elements_count': len(elements),
+                    'annotated_pdf': annotated_pdf,
+                    'markdown_path': markdown_path,
+                    'include_images': include_images,
+                    'include_markdown': include_markdown,
+                })
+            except Exception as e:
+                logger.error(f"Error processing {file.filename}: {e}")
+                results.append({
+                    'filename': file.filename,
+                    'error': str(e)
+                })
+    return jsonify({'results': results})
+@app.route('/api/pdf-list')
+def pdf_list():
+    """Get list of processed PDFs."""
+    output_dir = Path(app.config['OUTPUT_FOLDER'])
+    pdfs = []
+    for item in output_dir.iterdir():
+        if item.is_dir():
+            # Check if this directory has processed content
+            json_files = list(item.glob('*_content_list.json'))
+            md_files = list(item.glob('*.md'))
+            pdf_files = list(item.glob('*.pdf'))
+            if json_files or md_files or pdf_files:
+                stem = item.name
+                pdfs.append({
+                    'stem': stem,
+                    'output_dir': str(item.relative_to(app.config['OUTPUT_FOLDER'])),
+                })
+    return jsonify({'pdfs': pdfs})
+@app.route('/api/pdf-details/<path:pdf_stem>')
+def pdf_details(pdf_stem):
+    """Get detailed information about a processed PDF."""
+    output_dir = Path(app.config['OUTPUT_FOLDER']) / pdf_stem
+    if not output_dir.exists():
+        return jsonify({'error': 'PDF not found'}), 404
+    # Load content list
+    json_files = list(output_dir.glob('*_content_list.json'))
+    elements = []
+    if json_files:
+        elements = json.loads(json_files[0].read_text(encoding='utf-8'))
+    # Get figures and tables
+    figures = [e for e in elements if e.get('type') == 'figure']
+    tables = [e for e in elements if e.get('type') == 'table']
+    # Get file paths
+    annotated_pdf = None
+    pdf_files = list(output_dir.glob('*_layout.pdf'))
+    if pdf_files:
+        annotated_pdf = str(pdf_files[0].relative_to(app.config['OUTPUT_FOLDER']))
+    markdown_path = None
+    md_files = list(output_dir.glob('*.md'))
+    if md_files:
+        markdown_path = str(md_files[0].relative_to(app.config['OUTPUT_FOLDER']))
+    # Get figure and table images
+    figure_dir = output_dir / 'figures'
+    table_dir = output_dir / 'tables'
+    figure_images = []
+    if figure_dir.exists():
+        figure_images = [str(f.relative_to(app.config['OUTPUT_FOLDER']))
+                        for f in sorted(figure_dir.glob('*.png'))]
+    table_images = []
+    if table_dir.exists():
+        table_images = [str(t.relative_to(app.config['OUTPUT_FOLDER']))
+                       for t in sorted(table_dir.glob('*.png'))]
+    return jsonify({
+        'stem': pdf_stem,
+        'figures': figures,
+        'tables': tables,
+        'figures_count': len(figures),
+        'tables_count': len(tables),
+        'elements_count': len(elements),
+        'annotated_pdf': annotated_pdf,
+        'markdown_path': markdown_path,
+        'figure_images': figure_images,
+        'table_images': table_images,
+    })
+@app.route('/output/<path:filename>')
+def output_file(filename):
+    """Serve output files (PDFs, images, markdown)."""
+    return send_from_directory(app.config['OUTPUT_FOLDER'], filename)
+def _delete_by_stem(stem_raw: str):
+    stem = (stem_raw or "").strip()
+    if not stem:
+        return jsonify({'error': 'Missing stem'}), 400
+    # Resolve output directory safely
+    output_root = Path(app.config['OUTPUT_FOLDER']).resolve()
+    target_dir = (output_root / stem).resolve()
+    # Prevent path traversal - ensure target is within output_root
+    if output_root not in target_dir.parents and target_dir != output_root:
+        return jsonify({'error': 'Invalid stem path'}), 400
+    if not target_dir.exists() or not target_dir.is_dir():
+        return jsonify({'error': 'Not found'}), 404
+    # Delete the directory
+    shutil.rmtree(target_dir, ignore_errors=False)
+    logger.info(f"Deleted processed output: {target_dir}")
+    return jsonify({'ok': True, 'deleted': stem})
+@app.route('/api/delete', methods=['POST'])
+def delete_pdf():
+    """Delete a processed PDF directory by stem (JSON or form body)."""
+    try:
+        data = request.get_json(silent=True) or {}
+        stem = (data.get('stem') or request.form.get('stem') or '').strip()
+        return _delete_by_stem(stem)
+    except Exception as e:
+        logger.error(f"Delete failed: {e}")
+        return jsonify({'error': str(e)}), 500
+@app.route('/api/delete/<path:stem>', methods=['POST', 'GET'])
+def delete_pdf_by_path(stem: str):
+    """Alternate endpoint to delete using URL path, for clients avoiding bodies."""
+    try:
+        return _delete_by_stem(stem)
+    except Exception as e:
+        logger.error(f"Delete failed: {e}")
+        return jsonify({'error': str(e)}), 500
+if __name__ == '__main__':
+    app.run(debug=True, host='0.0.0.0', port=5000)

main.py ADDED Viewed

	@@ -0,0 +1,1309 @@

+import os
+import json
+import signal
+import sys
+from pathlib import Path
+from typing import List, Dict, Tuple, Optional, Sequence, Set, Any
+from multiprocessing import Pool, cpu_count
+from functools import partial
+import fitz  # PyMuPDF (Still needed for drawing output PDF)
+import pypdfium2 as pdfium
+import torch
+from doclayout_yolo import YOLOv10
+from huggingface_hub import hf_hub_download
+from loguru import logger
+from PIL import Image
+import numpy as np
+try:
+    import pymupdf4llm  # type: ignore
+except ImportError:  # pragma: no cover - optional dependency
+    pymupdf4llm = None  # type: ignore
+# ----------------------------------------------------------------------
+# CONFIGURATION
+# ----------------------------------------------------------------------
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+# Model options
+MODEL_SIZE = 1024
+REPO_ID = "juliozhao/DocLayout-YOLO-DocStructBench"
+WEIGHTS_FILE = f"doclayout_yolo_docstructbench_imgsz{MODEL_SIZE}.pt"
+# Detection settings
+CONF_THRESHOLD = 0.25
+# Multiprocessing settings
+NUM_WORKERS = None  # None = auto (cpu_count - 1), or set to specific number like 4
+USE_MULTIPROCESSING = True  # Set to False to disable parallel processing entirely
+# ----------------------------------------------------------------------
+# Color map for the layout classes
+# ----------------------------------------------------------------------
+CLASS_COLORS = {
+    "text": (0, 128, 0),          # Dark Green
+    "title": (192, 0, 0),        # Dark Red
+    "figure": (0, 0, 192),       # Dark Blue
+    "table": (218, 165, 32),     # Goldenrod (Dark Yellow)
+    "list": (128, 0, 128),       # Purple
+    "header": (0, 128, 128),     # Teal
+    "footer": (100, 100, 100),   # Dark Gray
+    "figure_caption": (0, 0, 128), # Navy
+    "table_caption": (139, 69, 19),  # Saddle Brown
+    "table_footnote": (128, 0, 128), # Purple
+}
+# Global model instance (will be None in worker processes until loaded)
+_model = None
+_shutdown_requested = False
+# ----------------------------------------------------------------------
+# Signal handler for graceful shutdown
+# ----------------------------------------------------------------------
+def signal_handler(signum, frame):
+    """Handle interrupt signals gracefully."""
+    global _shutdown_requested
+    if not _shutdown_requested:
+        _shutdown_requested = True
+        logger.warning("\n⚠️  Interrupt received! Finishing current page and shutting down gracefully...")
+        logger.warning("Press Ctrl+C again to force quit (may leave incomplete files)")
+    else:
+        logger.error("\n❌ Force quit requested. Exiting immediately.")
+        sys.exit(1)
+def setup_signal_handlers():
+    """Setup signal handlers for graceful shutdown."""
+    signal.signal(signal.SIGINT, signal_handler)
+    signal.signal(signal.SIGTERM, signal_handler)
+# ----------------------------------------------------------------------
+# Model loader function
+# ----------------------------------------------------------------------
+def get_model():
+    """Lazy load the model (only once per process)."""
+    global _model
+    if _model is None:
+        weights_path = hf_hub_download(repo_id=REPO_ID, filename=WEIGHTS_FILE)
+        _model = YOLOv10(weights_path)
+        logger.info(f"✓ Model loaded in worker process (PID: {os.getpid()})")
+    return _model
+# ----------------------------------------------------------------------
+# Worker initialization function
+# ----------------------------------------------------------------------
+def init_worker():
+    """Initialize worker process - loads model once at startup."""
+    try:
+        get_model()
+        logger.success(f"Worker {os.getpid()} ready")
+    except Exception as e:
+        logger.error(f"Failed to initialize worker {os.getpid()}: {e}")
+        raise
+# ----------------------------------------------------------------------
+# Run layout detection on a single page image (YOLO)
+# ----------------------------------------------------------------------
+def detect_page(pil_img: Image.Image) -> List[dict]:
+    """Detect layout elements using YOLO model."""
+    model = get_model()  # Will return already-loaded model in worker
+    img_cv = np.array(pil_img)
+    results = model.predict(
+        img_cv,
+        imgsz=MODEL_SIZE,
+        conf=CONF_THRESHOLD,
+        device=DEVICE,
+        verbose=False
+    )
+    dets = []
+    for i, box in enumerate(results[0].boxes):
+        cls_id = int(box.cls.item())
+        name = results[0].names[cls_id]
+        conf = float(box.conf.item())
+        x0, y0, x1, y1 = box.xyxy[0].cpu().numpy().tolist()
+        dets.append({
+            "name": name,
+            "bbox": [x0, y0, x1, y1],
+            "conf": conf,
+            "source": "yolo",
+            "index": i
+        })
+    return dets
+# ----------------------------------------------------------------------
+# Crop & save figure/table regions (with captions)
+# ----------------------------------------------------------------------
+def get_union_box(box1: List[float], box2: List[float]) -> List[float]:
+    """Get the bounding box enclosing two boxes."""
+    x0 = min(box1[0], box2[0])
+    y0 = min(box1[1], box2[1])
+    x1 = max(box1[2], box2[2])
+    y1 = max(box1[3], box2[3])
+    return [x0, y0, x1, y1]
+def collect_caption_elements(
+    element: Dict,
+    all_dets: List[Dict],
+    target_name: str,
+    max_vertical_gap: float = 60.0,
+    min_overlap: float = 0.25,
+) -> List[Dict]:
+    """
+    Collect contiguous caption detections directly below a figure/table.
+    """
+    base_box = element["bbox"]
+    base_bottom = base_box[3]
+    selected: List[Dict] = []
+    last_bottom = base_bottom
+    relevant = [
+        d for d in all_dets
+        if d["name"] == target_name and d["bbox"][1] >= base_bottom - 5
+    ]
+    relevant.sort(key=lambda d: d["bbox"][1])
+    for cand in relevant:
+        cand_box = cand["bbox"]
+        top = cand_box[1]
+        if selected and top - last_bottom > max_vertical_gap:
+            break
+        if selected:
+            overlap = _horizontal_overlap_ratio(selected[-1]["bbox"], cand_box)
+        else:
+            overlap = _horizontal_overlap_ratio(base_box, cand_box)
+        if overlap < min_overlap:
+            continue
+        selected.append(cand)
+        last_bottom = cand_box[3]
+    return selected
+def collect_title_and_text_segments(
+    element: Dict,
+    all_dets: List[Dict],
+    processed_indices: Set[int],
+    settings: Optional[Dict[str, float]] = None,
+) -> Tuple[List[Dict], List[Dict]]:
+    """
+    Locate a title below the element and any contiguous text blocks directly beneath it.
+    """
+    if settings is None:
+        settings = TITLE_TEXT_ASSOCIATION
+    if not element.get("bbox"):
+        return [], []
+    figure_box = element["bbox"]
+    figure_bottom = figure_box[3]
+    candidates = [
+        d for d in all_dets
+        if d.get("bbox") and d["index"] not in processed_indices
+    ]
+    candidates.sort(key=lambda d: d["bbox"][1])
+    titles: List[Dict] = []
+    texts: List[Dict] = []
+    for idx, det in enumerate(candidates):
+        if det["name"] != "title":
+            continue
+        title_box = det["bbox"]
+        if title_box[1] < figure_bottom - 5:
+            continue
+        vertical_gap = title_box[1] - figure_bottom
+        if vertical_gap > settings["max_title_gap"]:
+            break
+        overlap = _horizontal_overlap_ratio(figure_box, title_box)
+        if overlap < settings["min_overlap"]:
+            continue
+        titles.append(det)
+        last_bottom = title_box[3]
+        for follower in candidates[idx + 1 :]:
+            if follower["name"] == "title":
+                break
+            if follower["name"] != "text":
+                continue
+            text_box = follower["bbox"]
+            if text_box[1] < title_box[1]:
+                continue
+            gap = text_box[1] - last_bottom
+            if gap > settings["max_text_gap"]:
+                break
+            if _horizontal_overlap_ratio(title_box, text_box) < settings["min_overlap"]:
+                continue
+            texts.append(follower)
+            last_bottom = text_box[3]
+        break
+    return titles, texts
+def save_layout_elements(pil_img: Image.Image, page_num: int,
+                         dets: List[dict], out_dir: Path) -> List[dict]:
+    """Save figure and table crops, merging captions."""
+    fig_dir = out_dir / "figures"
+    tab_dir = out_dir / "tables"
+    os.makedirs(fig_dir, exist_ok=True)
+    os.makedirs(tab_dir, exist_ok=True)
+    infos = []
+    fig_count = 0
+    tab_count = 0
+    processed_indices = set()
+    for i, d in enumerate(dets):
+        if d["index"] in processed_indices:
+            continue
+        name = d["name"].lower()
+        final_box = d["bbox"]
+        caption_segments: List[Dict] = []
+        title_segments: List[Dict] = []
+        text_segments: List[Dict] = []
+        if name == "figure":
+            elem_type = "figure"
+            path_template = fig_dir / f"page_{page_num + 1}_fig_{fig_count}.png"
+            fig_count += 1
+            caption_segments = collect_caption_elements(d, dets, "figure_caption")
+            for cap in caption_segments:
+                final_box = get_union_box(final_box, cap["bbox"])
+                processed_indices.add(cap["index"])
+            title_segments, text_segments = collect_title_and_text_segments(
+                d, dets, processed_indices
+            )
+            for seg in title_segments + text_segments:
+                final_box = get_union_box(final_box, seg["bbox"])
+                processed_indices.add(seg["index"])
+        elif name == "table":
+            elem_type = "table"
+            path_template = tab_dir / f"page_{page_num + 1}_tab_{tab_count}.png"
+            tab_count += 1
+            caption_segments = collect_caption_elements(d, dets, "table_caption")
+            for cap in caption_segments:
+                final_box = get_union_box(final_box, cap["bbox"])
+                processed_indices.add(cap["index"])
+        else:
+            continue
+        x0, y0, x1, y1 = map(int, final_box)
+        crop = pil_img.crop((x0, y0, x1, y1))
+        if crop.mode == "CMYK":
+            crop = crop.convert("RGB")
+        crop.save(path_template)
+        info_data = {
+            "type": elem_type,
+            "page": page_num + 1,
+            "bbox_pixels": final_box,
+            "conf": d["conf"],
+            "source": d.get("source", "yolo"),
+            "image_path": str(path_template.relative_to(out_dir)),
+            "width": int(x1 - x0),
+            "height": int(y1 - y0),
+            "page_width": pil_img.width,
+            "page_height": pil_img.height,
+        }
+        if caption_segments:
+            info_data["captions"] = [
+                {
+                    "bbox": cap["bbox"],
+                    "conf": cap.get("conf"),
+                    "index": cap["index"],
+                    "source": cap.get("source"),
+                    "page": page_num + 1,
+                }
+                for cap in caption_segments
+            ]
+        if title_segments:
+            info_data["titles"] = [
+                {
+                    "bbox": seg["bbox"],
+                    "conf": seg.get("conf"),
+                    "index": seg["index"],
+                    "source": seg.get("source"),
+                    "page": page_num + 1,
+                }
+                for seg in title_segments
+            ]
+        if text_segments:
+            info_data["texts"] = [
+                {
+                    "bbox": seg["bbox"],
+                    "conf": seg.get("conf"),
+                    "index": seg["index"],
+                    "source": seg.get("source"),
+                    "page": page_num + 1,
+                }
+                for seg in text_segments
+            ]
+        infos.append(info_data)
+    return infos
+TABLE_STITCH_TOLERANCES = {
+    "x_tol": 60,
+    "y_tol": 60,
+    "width_tol": 120,
+    "height_tol": 120,
+}
+CROSS_PAGE_CAPTION_THRESHOLDS = {
+    "max_top_ratio": 0.35,
+    "max_top_pixels": 220,
+    "x_tol": 120,
+    "width_tol": 200,
+    "min_overlap": 0.05,
+}
+TITLE_TEXT_ASSOCIATION = {
+    "max_title_gap": 220,
+    "max_text_gap": 160,
+    "min_overlap": 0.2,
+}
+def _horizontal_overlap_ratio(box1: List[float], box2: List[float]) -> float:
+    """Compute horizontal overlap ratio between two bounding boxes."""
+    x_left = max(box1[0], box2[0])
+    x_right = min(box1[2], box2[2])
+    overlap = max(0.0, x_right - x_left)
+    if overlap <= 0:
+        return 0.0
+    width_union = max(box1[2], box2[2]) - min(box1[0], box2[0])
+    if width_union <= 0:
+        return 0.0
+    return overlap / width_union
+def _bbox_to_rect(bbox: List[float]) -> Tuple[int, int, int, int]:
+    """Convert [x0, y0, x1, y1] into (x, y, w, h)."""
+    x0, y0, x1, y1 = bbox
+    return int(x0), int(y0), int(x1 - x0), int(y1 - y0)
+def _open_table_image(elem: Dict, out_dir: Path) -> Optional[Image.Image]:
+    """Open a table image relative to the output directory."""
+    image_path = out_dir / elem["image_path"]
+    if not image_path.exists():
+        logger.warning(f"Missing table crop for stitching: {image_path}")
+        return None
+    img = Image.open(image_path)
+    if img.mode != "RGB":
+        img = img.convert("RGB")
+    return img
+def _pad_width(img: Image.Image, target_width: int) -> Image.Image:
+    if img.width >= target_width:
+        return img
+    canvas = Image.new("RGB", (target_width, img.height), color=(255, 255, 255))
+    canvas.paste(img, (0, 0))
+    return canvas
+def _pad_height(img: Image.Image, target_height: int) -> Image.Image:
+    if img.height >= target_height:
+        return img
+    canvas = Image.new("RGB", (img.width, target_height), color=(255, 255, 255))
+    canvas.paste(img, (0, 0))
+    return canvas
+def _append_segment_image(
+    base_img: Image.Image,
+    segment_img: Image.Image,
+    resize_to_base: bool = False,
+) -> Image.Image:
+    """Append segment image below base image with optional width alignment."""
+    if base_img.mode != "RGB":
+        base_img = base_img.convert("RGB")
+    if segment_img.mode != "RGB":
+        segment_img = segment_img.convert("RGB")
+    if resize_to_base and segment_img.width > 0 and base_img.width > 0:
+        segment_img = segment_img.resize(
+            (
+                base_img.width,
+                max(1, int(segment_img.height * (base_img.width / segment_img.width))),
+            ),
+            Image.Resampling.LANCZOS,
+        )
+    target_width = max(base_img.width, segment_img.width)
+    base_img = _pad_width(base_img, target_width)
+    segment_img = _pad_width(segment_img, target_width)
+    stitched = Image.new(
+        "RGB",
+        (target_width, base_img.height + segment_img.height),
+        color=(255, 255, 255),
+    )
+    stitched.paste(base_img, (0, 0))
+    stitched.paste(segment_img, (0, base_img.height))
+    return stitched
+def _render_pdf_page(
+    pdf_doc: pdfium.PdfDocument,
+    page_index: int,
+    scale: float,
+    cache: Dict[int, Image.Image],
+) -> Optional[Image.Image]:
+    """Render a PDF page to a PIL image with caching."""
+    if page_index in cache:
+        return cache[page_index]
+    try:
+        page = pdf_doc[page_index]
+        bitmap = page.render(scale=scale)
+        pil_img = bitmap.to_pil()
+        page.close()
+    except Exception as exc:
+        logger.error(f"Failed to render page {page_index + 1} for caption stitching: {exc}")
+        return None
+    cache[page_index] = pil_img
+    return pil_img
+def _crop_pdf_region(
+    page_img: Optional[Image.Image], bbox: List[float]
+) -> Optional[Image.Image]:
+    """Crop a region from a rendered PDF page."""
+    if page_img is None:
+        return None
+    x0, y0, x1, y1 = map(int, bbox)
+    x0 = max(0, x0)
+    y0 = max(0, y0)
+    x1 = min(page_img.width, max(x0 + 1, x1))
+    y1 = min(page_img.height, max(y0 + 1, y1))
+    if x0 >= x1 or y0 >= y1:
+        return None
+    crop = page_img.crop((x0, y0, x1, y1))
+    if crop.mode == "CMYK":
+        crop = crop.convert("RGB")
+    return crop
+def write_markdown_document(pdf_path: Path, out_dir: Path) -> Optional[Path]:
+    """
+    Extract markdown text from a PDF using PyMuPDF4LLM and write it to disk.
+    """
+    if pymupdf4llm is None:
+        logger.warning(
+            "Skipping markdown extraction for %s because pymupdf4llm is not installed.",
+            pdf_path.name,
+        )
+        return None
+    try:
+        markdown_content = pymupdf4llm.to_markdown(str(pdf_path))
+    except Exception as exc:
+        logger.error(f"  Failed to create markdown for {pdf_path.name}: {exc}")
+        return None
+    if isinstance(markdown_content, list):
+        markdown_content = "\n\n".join(
+            part for part in markdown_content if isinstance(part, str)
+        )
+    if not isinstance(markdown_content, str):
+        logger.error(
+            f"  Unexpected markdown output type {type(markdown_content)} for {pdf_path.name}"
+        )
+        return None
+    markdown_content = markdown_content.strip()
+    if not markdown_content:
+        logger.warning(f"  No textual content extracted from {pdf_path.name}")
+        return None
+    if not markdown_content.endswith("\n"):
+        markdown_content += "\n"
+    md_path = out_dir / f"{pdf_path.stem}.md"
+    md_path.write_text(markdown_content, encoding="utf-8")
+    logger.info(f"  Saved markdown to {md_path.name}")
+    return md_path
+def _collect_text_under_title_cross_page(
+    title_det: Dict,
+    sorted_dets: List[Dict],
+    start_idx: int,
+    page_idx: int,
+    used_indices: Set[Tuple[int, int]],
+    settings: Optional[Dict[str, float]] = None,
+) -> List[Dict]:
+    """Collect text elements directly below a title on the next page."""
+    if settings is None:
+        settings = TITLE_TEXT_ASSOCIATION
+    texts: List[Dict] = []
+    title_box = title_det["bbox"]
+    last_bottom = title_box[3]
+    for follower in sorted_dets[start_idx + 1 :]:
+        det_index = follower.get("index")
+        if det_index is None or (page_idx, det_index) in used_indices:
+            continue
+        if follower["name"] == "title":
+            break
+        if follower["name"] != "text":
+            continue
+        text_box = follower["bbox"]
+        if text_box[1] < title_box[1]:
+            continue
+        gap = text_box[1] - last_bottom
+        if gap > settings["max_text_gap"]:
+            break
+        if _horizontal_overlap_ratio(title_box, text_box) < settings["min_overlap"]:
+            continue
+        texts.append(follower)
+        last_bottom = text_box[3]
+    return texts
+def attach_cross_page_figure_captions(
+    elements: List[Dict],
+    all_dets: Sequence[Optional[List[Dict[str, Any]]]],
+    pdf_bytes: bytes,
+    out_dir: Path,
+    scale: float,
+) -> List[Dict]:
+    """
+    If a figure caption appears on the next page, stitch it to the prior figure.
+    """
+    figures = [elem for elem in elements if elem.get("type") == "figure"]
+    if not figures or not all_dets:
+        return elements
+    try:
+        pdf_doc = pdfium.PdfDocument(pdf_bytes)
+    except Exception as exc:
+        logger.error(f"Unable to reopen PDF for figure caption stitching: {exc}")
+        return elements
+    page_cache: Dict[int, Image.Image] = {}
+    used_following_ids: Set[Tuple[int, int]] = set()
+    # Mark existing caption/title/text detections as used
+    for elem in figures:
+        for key in ("captions", "titles", "texts"):
+            for seg in elem.get(key, []) or []:
+                idx = seg.get("index")
+                page_no = seg.get("page")
+                if idx is None or page_no is None:
+                    continue
+                used_following_ids.add((page_no - 1, idx))
+    for elem in figures:
+        page_no = elem.get("page")
+        bbox = elem.get("bbox_pixels")
+        if page_no is None or bbox is None:
+            continue
+        current_idx = page_no - 1
+        next_idx = current_idx + 1
+        if next_idx >= len(all_dets):
+            continue
+        next_dets = all_dets[next_idx]
+        if not next_dets:
+            continue
+        fig_width = bbox[2] - bbox[0]
+        page_img = _render_pdf_page(pdf_doc, next_idx, scale, page_cache)
+        if page_img is None:
+            continue
+        next_page_height = page_img.height
+        max_top_allowed = min(
+            CROSS_PAGE_CAPTION_THRESHOLDS["max_top_pixels"],
+            int(next_page_height * CROSS_PAGE_CAPTION_THRESHOLDS["max_top_ratio"]),
+        )
+        sorted_next = sorted(
+            [det for det in next_dets if det.get("bbox")],
+            key=lambda det: det["bbox"][1],
+        )
+        caption_candidate: Optional[Tuple[Dict, int]] = None
+        caption_candidates = []
+        for det in sorted_next:
+            if det.get("name") != "figure_caption":
+                continue
+            det_index = det.get("index")
+            if det_index is None or (next_idx, det_index) in used_following_ids:
+                continue
+            det_bbox = det.get("bbox")
+            if not det_bbox or det_bbox[1] > max_top_allowed:
+                continue
+            overlap = _horizontal_overlap_ratio(bbox, det_bbox)
+            x_diff = abs(bbox[0] - det_bbox[0])
+            width_diff = abs((bbox[2] - bbox[0]) - (det_bbox[2] - det_bbox[0]))
+            if overlap < CROSS_PAGE_CAPTION_THRESHOLDS["min_overlap"]:
+                if (
+                    x_diff > CROSS_PAGE_CAPTION_THRESHOLDS["x_tol"]
+                    or width_diff > CROSS_PAGE_CAPTION_THRESHOLDS["width_tol"]
+                ):
+                    continue
+            score = width_diff + 0.5 * x_diff
+            caption_candidates.append((score, det, det_index))
+        if caption_candidates:
+            caption_candidates.sort(key=lambda item: item[0])
+            _, best_det, best_index = caption_candidates[0]
+            caption_candidate = (best_det, best_index)
+        title_candidate: Optional[Tuple[Dict, int]] = None
+        title_texts: List[Dict] = []
+        for idx_sorted, det in enumerate(sorted_next):
+            if det.get("name") != "title":
+                continue
+            det_index = det.get("index")
+            if det_index is None or (next_idx, det_index) in used_following_ids:
+                continue
+            det_bbox = det.get("bbox")
+            if not det_bbox or det_bbox[1] > max_top_allowed:
+                continue
+            overlap = _horizontal_overlap_ratio(bbox, det_bbox)
+            x_diff = abs(bbox[0] - det_bbox[0])
+            if (
+                overlap < TITLE_TEXT_ASSOCIATION["min_overlap"]
+                and x_diff > CROSS_PAGE_CAPTION_THRESHOLDS["x_tol"]
+            ):
+                continue
+            title_candidate = (det, det_index)
+            title_texts = _collect_text_under_title_cross_page(
+                det, sorted_next, idx_sorted, next_idx, used_following_ids
+            )
+            break
+        if not caption_candidate and not title_candidate and not title_texts:
+            continue
+        figure_path = out_dir / elem["image_path"]
+        if not figure_path.exists():
+            continue
+        figure_img = Image.open(figure_path)
+        if figure_img.mode == "CMYK":
+            figure_img = figure_img.convert("RGB")
+        segments_added = False
+        if caption_candidate:
+            cap_det, cap_index = caption_candidate
+            caption_crop = _crop_pdf_region(page_img, cap_det["bbox"])
+            if caption_crop is not None:
+                figure_img = _append_segment_image(
+                    figure_img, caption_crop, resize_to_base=True
+                )
+                elem.setdefault("captions", [])
+                elem["captions"].append(
+                    {
+                        "bbox": cap_det["bbox"],
+                        "conf": cap_det.get("conf"),
+                        "index": cap_index,
+                        "source": cap_det.get("source"),
+                        "page": next_idx + 1,
+                    }
+                )
+                used_following_ids.add((next_idx, cap_index))
+                segments_added = True
+        if title_candidate:
+            title_det, title_index = title_candidate
+            title_crop = _crop_pdf_region(page_img, title_det["bbox"])
+            if title_crop is not None:
+                figure_img = _append_segment_image(figure_img, title_crop)
+                elem.setdefault("titles", [])
+                elem["titles"].append(
+                    {
+                        "bbox": title_det["bbox"],
+                        "conf": title_det.get("conf"),
+                        "index": title_index,
+                        "source": title_det.get("source"),
+                        "page": next_idx + 1,
+                    }
+                )
+                used_following_ids.add((next_idx, title_index))
+                segments_added = True
+            for text_det in title_texts:
+                text_index = text_det.get("index")
+                text_crop = _crop_pdf_region(page_img, text_det["bbox"])
+                if text_crop is None:
+                    continue
+                figure_img = _append_segment_image(figure_img, text_crop)
+                elem.setdefault("texts", [])
+                elem["texts"].append(
+                    {
+                        "bbox": text_det["bbox"],
+                        "conf": text_det.get("conf"),
+                        "index": text_index,
+                        "source": text_det.get("source"),
+                        "page": next_idx + 1,
+                    }
+                )
+                if text_index is not None:
+                    used_following_ids.add((next_idx, text_index))
+                segments_added = True
+        if not segments_added:
+            continue
+        figure_img.save(figure_path)
+        elem["width"] = figure_img.width
+        elem["height"] = figure_img.height
+        span = elem.get("page_span")
+        if span:
+            if next_idx + 1 not in span:
+                span.append(next_idx + 1)
+        else:
+            base_page = elem.get("page")
+            new_span = [page for page in (base_page, next_idx + 1) if page is not None]
+            elem["page_span"] = new_span
+    pdf_doc.close()
+    return elements
+def _stitch_table_pair(
+    base_elem: Dict,
+    candidate_elem: Dict,
+    out_dir: Path,
+    merge_index: int,
+    stitch_type: str,
+) -> Optional[Dict]:
+    """Stitch two table crops either vertically or horizontally."""
+    base_img = _open_table_image(base_elem, out_dir)
+    candidate_img = _open_table_image(candidate_elem, out_dir)
+    if base_img is None or candidate_img is None:
+        return None
+    tables_dir = out_dir / "tables"
+    tables_dir.mkdir(parents=True, exist_ok=True)
+    if stitch_type == "vertical":
+        target_width = max(base_img.width, candidate_img.width)
+        base_img = _pad_width(base_img, target_width)
+        candidate_img = _pad_width(candidate_img, target_width)
+        merged_height = base_img.height + candidate_img.height
+        stitched = Image.new("RGB", (target_width, merged_height), color=(255, 255, 255))
+        stitched.paste(base_img, (0, 0))
+        stitched.paste(candidate_img, (0, base_img.height))
+    else:
+        target_height = max(base_img.height, candidate_img.height)
+        base_img = _pad_height(base_img, target_height)
+        candidate_img = _pad_height(candidate_img, target_height)
+        merged_width = base_img.width + candidate_img.width
+        stitched = Image.new("RGB", (merged_width, target_height), color=(255, 255, 255))
+        stitched.paste(base_img, (0, 0))
+        stitched.paste(candidate_img, (base_img.width, 0))
+    merged_name = (
+        f"page_{base_elem['page']}_to_{candidate_elem['page']}_"
+        f"table_merged_{merge_index}.png"
+    )
+    merged_path = tables_dir / merged_name
+    stitched.save(merged_path)
+    # Remove original partial crops to avoid duplicates
+    (out_dir / base_elem["image_path"]).unlink(missing_ok=True)
+    (out_dir / candidate_elem["image_path"]).unlink(missing_ok=True)
+    new_bbox = [
+        min(base_elem["bbox_pixels"][0], candidate_elem["bbox_pixels"][0]),
+        min(base_elem["bbox_pixels"][1], candidate_elem["bbox_pixels"][1]),
+        max(base_elem["bbox_pixels"][2], candidate_elem["bbox_pixels"][2]),
+        max(base_elem["bbox_pixels"][3], candidate_elem["bbox_pixels"][3]),
+    ]
+    merged_elem = base_elem.copy()
+    merged_elem["page_span"] = [base_elem["page"], candidate_elem["page"]]
+    merged_elem["box_refs"] = [
+        {"page": base_elem["page"], "image_path": base_elem["image_path"]},
+        {"page": candidate_elem["page"], "image_path": candidate_elem["image_path"]},
+    ]
+    merged_elem["bbox_pixels"] = new_bbox
+    merged_elem["image_path"] = str(merged_path.relative_to(out_dir))
+    merged_elem["width"] = stitched.width
+    merged_elem["height"] = stitched.height
+    merged_elem["page_height"] = stitched.height
+    merged_elem["conf"] = min(
+        base_elem.get("conf", 1.0), candidate_elem.get("conf", 1.0)
+    )
+    return merged_elem
+def merge_spanning_tables(elements: List[Dict], out_dir: Path) -> List[Dict]:
+    """
+    Stitch table crops that continue across adjacent pages using the heuristic
+    from the legacy OpenCV-based extractor.
+    """
+    if not elements:
+        return elements
+    tables_by_page: Dict[int, List[Dict]] = {}
+    non_tables: List[Dict] = []
+    for elem in elements:
+        if elem.get("type") != "table":
+            non_tables.append(elem)
+            continue
+        page = elem.get("page")
+        if not isinstance(page, int):
+            non_tables.append(elem)
+            continue
+        tables_by_page.setdefault(page, []).append(elem)
+    merged_results: List[Dict] = []
+    used_next: Dict[int, set[int]] = {}
+    merge_counter = 0
+    for page in sorted(tables_by_page.keys()):
+        current_tables = tables_by_page.get(page, [])
+        next_page_tables = tables_by_page.get(page + 1, [])
+        next_used_indices = used_next.get(page + 1, set())
+        current_used_indices = used_next.get(page, set())
+        for idx_current, table_elem in enumerate(current_tables):
+            if idx_current in current_used_indices:
+                continue
+            if not next_page_tables:
+                merged_results.append(table_elem)
+                continue
+            x, y, w, h = _bbox_to_rect(table_elem["bbox_pixels"])
+            matched = False
+            for idx, candidate in enumerate(next_page_tables):
+                if idx in next_used_indices:
+                    continue
+                if candidate.get("type") != "table":
+                    continue
+                cx, cy, cw, ch = _bbox_to_rect(candidate["bbox_pixels"])
+                vertical_match = (
+                    abs(x - cx) <= TABLE_STITCH_TOLERANCES["x_tol"]
+                    and abs((x + w) - (cx + cw)) <= TABLE_STITCH_TOLERANCES["width_tol"]
+                )
+                horizontal_match = (
+                    abs(y - cy) <= TABLE_STITCH_TOLERANCES["y_tol"]
+                    and abs((y + h) - (cy + ch))
+                    <= TABLE_STITCH_TOLERANCES["height_tol"]
+                )
+                stitch_type = "vertical" if vertical_match else None
+                if not stitch_type and horizontal_match:
+                    stitch_type = "horizontal"
+                if not stitch_type:
+                    continue
+                merge_counter += 1
+                merged_elem = _stitch_table_pair(
+                    table_elem, candidate, out_dir, merge_counter, stitch_type
+                )
+                if merged_elem is None:
+                    continue
+                merged_results.append(merged_elem)
+                next_used_indices.add(idx)
+                matched = True
+                break
+            if not matched:
+                merged_results.append(table_elem)
+        used_next[page + 1] = next_used_indices
+    merged_results.extend(non_tables)
+    return merged_results
+# ----------------------------------------------------------------------
+# Draw layout boxes on the original PDF
+# ----------------------------------------------------------------------
+def draw_layout_pdf(pdf_bytes: bytes, all_dets: List[List[dict]],
+                    scale: float, out_path: Path):
+    """Annotate PDF with semi-transparent bounding boxes and labels."""
+    doc = fitz.open(stream=pdf_bytes, filetype="pdf")
+    for page_no, dets in enumerate(all_dets):
+        page = doc[page_no]
+        for d in dets:
+            rgb = CLASS_COLORS.get(d["name"], (0, 0, 0))
+            rect = fitz.Rect([c / scale for c in d["bbox"]])
+            border_color = [c / 255 for c in rgb]
+            fill_color = [c / 255 for c in rgb]
+            fill_opacity = 0.15
+            border_width = 1.5
+            page.draw_rect(
+                rect,
+                color=border_color,
+                fill=fill_color,
+                width=border_width,
+                overlay=True,
+                fill_opacity=fill_opacity
+            )
+            label = f"{d['name']} {d['conf']:.2f}"
+            if d.get("source"):
+                label += f" [{d['source'][0].upper()}]"
+            text_bg = fitz.Rect(rect.x0, rect.y0 - 10, rect.x0 + 60, rect.y0)
+            page.draw_rect(text_bg, color=None, fill=(1, 1, 1, 0.6), overlay=True)
+            page.insert_text(
+                (rect.x0 + 2, rect.y0 - 8),
+                label,
+                fontsize=6.5,
+                color=border_color,
+                overlay=True
+            )
+    doc.save(str(out_path))
+    doc.close()
+# ----------------------------------------------------------------------
+# Process a single PDF Page (for parallel execution)
+# ----------------------------------------------------------------------
+def process_page(task_data: Tuple[int, bytes, float, Path, str]) -> Optional[Tuple[int, List[dict], List[dict]]]:
+    """
+    Process a single page of a PDF in a worker process.
+    Returns: (page_number, detections, elements) or None on failure
+    """
+    pno, pdf_bytes, scale, out_dir, pdf_name = task_data
+    if _shutdown_requested:
+        return None
+    pdf_pdfium = None
+    try:
+        pdf_pdfium = pdfium.PdfDocument(pdf_bytes)
+        page = pdf_pdfium[pno]
+        bitmap = page.render(scale=scale)
+        pil = bitmap.to_pil()
+        dets = detect_page(pil)
+        elements = save_layout_elements(pil, pno, dets, out_dir)
+        page_figures = len([d for d in dets if d['name'] == 'figure'])
+        page_tables = len([d for d in dets if d['name'] == 'table'])
+        logger.info(f"  [{pdf_name}] Page {pno + 1}: {page_figures} figs, {page_tables} tables")
+        page.close()
+        pdf_pdfium.close()
+        return (pno, dets, elements)
+    except Exception as e:
+        logger.error(f"Failed to process page {pno + 1} of {pdf_name}: {e}")
+        if pdf_pdfium:
+            pdf_pdfium.close()
+        return None
+# ----------------------------------------------------------------------
+# Process a full PDF using the persistent worker pool
+# ----------------------------------------------------------------------
+def process_pdf_with_pool(
+    pdf_path: Path,
+    out_dir: Path,
+    pool: Optional[Pool] = None,
+    *,
+    extract_images: bool = True,
+    extract_markdown: bool = True,
+):
+    """
+    Main processing pipeline for a PDF file.
+    If pool is provided, uses it. Otherwise processes serially.
+    """
+    if _shutdown_requested:
+        logger.warning(f"Skipping {pdf_path.name} due to shutdown request")
+        return
+    stem = pdf_path.stem
+    logger.info(f"Processing {pdf_path.name}")
+    pdf_bytes = pdf_path.read_bytes()
+    doc = None
+    try:
+        doc = pdfium.PdfDocument(pdf_bytes)
+        page_count = len(doc)
+    except Exception as e:
+        logger.error(f"Failed to open PDF {pdf_path.name}: {e}. Skipping.")
+        return
+    finally:
+        if doc is not None:
+            doc.close()
+    scale = 2.0
+    all_elements: List[Dict] = []
+    filtered_dets: List[List[dict]] = []
+    if extract_images:
+        all_dets: List[Optional[List[dict]]] = [None] * page_count
+        if pool is not None and USE_MULTIPROCESSING:
+            logger.info(f"  Using worker pool for {page_count} pages...")
+            tasks = [
+                (pno, pdf_bytes, scale, out_dir, pdf_path.name)
+                for pno in range(page_count)
+            ]
+            try:
+                results = pool.map(process_page, tasks)
+                for res in results:
+                    if res:
+                        pno, dets, elements = res
+                        all_dets[pno] = dets
+                        all_elements.extend(elements)
+            except KeyboardInterrupt:
+                logger.warning("Processing interrupted during parallel execution")
+                raise
+        else:
+            logger.info("Using serial processing...")
+            try:
+                pdf_pdfium = pdfium.PdfDocument(pdf_bytes)
+                for pno in range(page_count):
+                    if _shutdown_requested:
+                        logger.warning(
+                            f"Stopping at page {pno + 1}/{page_count} due to shutdown request"
+                        )
+                        break
+                    try:
+                        logger.info(f"  Processing page {pno + 1}/{page_count}")
+                        page = pdf_pdfium[pno]
+                        bitmap = page.render(scale=scale)
+                        pil = bitmap.to_pil()
+                        dets = detect_page(pil)
+                        all_dets[pno] = dets
+                        elements = save_layout_elements(pil, pno, dets, out_dir)
+                        all_elements.extend(elements)
+                        page_figures = len([d for d in dets if d["name"] == "figure"])
+                        page_tables = len([d for d in dets if d["name"] == "table"])
+                        logger.info(
+                            f"    Found {page_figures} figures and {page_tables} tables"
+                        )
+                        page.close()
+                    except Exception as e:
+                        logger.error(f"Failed to process page {pno + 1}: {e}. Skipping page.")
+                pdf_pdfium.close()
+            except Exception as e:
+                logger.error(f"Fatal error processing {pdf_path.name}: {e}")
+                if "pdf_pdfium" in locals() and pdf_pdfium:
+                    pdf_pdfium.close()
+                return
+        dets_per_page: List[Optional[List[Dict[str, Any]]]] = [
+            det if det is not None else None for det in all_dets
+        ]
+        filtered_dets = [d for d in all_dets if d is not None]
+        if all_elements:
+            all_elements = merge_spanning_tables(all_elements, out_dir)
+            all_elements = attach_cross_page_figure_captions(
+                all_elements, dets_per_page, pdf_bytes, out_dir, scale
+            )
+        if all_elements:
+            content_list_path = out_dir / f"{stem}_content_list.json"
+            with open(content_list_path, "w", encoding="utf-8") as f:
+                json.dump(all_elements, f, ensure_ascii=False, indent=4)
+            logger.info(f"  Saved {len(all_elements)} elements to JSON")
+        if filtered_dets:
+            draw_layout_pdf(
+                pdf_bytes, filtered_dets, scale, out_dir / f"{stem}_layout.pdf"
+            )
+            logger.info("  Generated annotated PDF")
+        else:
+            logger.warning(f"No detections found for {stem}. Skipping layout PDF.")
+    else:
+        logger.info("  Image extraction skipped per configuration.")
+    markdown_path = None
+    if extract_markdown:
+        markdown_path = write_markdown_document(pdf_path, out_dir)
+        if markdown_path is None:
+            logger.warning(f"  Markdown extraction yielded no content for {stem}.")
+    if _shutdown_requested:
+        logger.warning(f"⚠️  Partial results saved for {stem} → {out_dir}")
+    else:
+        if extract_images:
+            logger.success(
+                f"✓ {stem} → {out_dir} ({len(all_elements)} elements extracted)"
+            )
+        else:
+            logger.success(f"✓ {stem} → {out_dir} (image extraction skipped)")
+# ----------------------------------------------------------------------
+# Main
+# ----------------------------------------------------------------------
+if __name__ == "__main__":
+    # Important for multiprocessing on Windows/macOS
+    torch.multiprocessing.set_start_method('spawn', force=True)
+    # Setup signal handlers for graceful shutdown
+    setup_signal_handlers()
+    INPUT_DIR = Path("./pdfs")
+    OUTPUT_DIR = Path("./output")
+    os.makedirs(INPUT_DIR, exist_ok=True)
+    os.makedirs(OUTPUT_DIR, exist_ok=True)
+    pdf_files = list(INPUT_DIR.glob("*.pdf"))
+    if not pdf_files:
+        logger.warning("No PDF files found in ./pdfs")
+        logger.info("Please add PDF files to the ./pdfs directory")
+        logger.info("The script will exit gracefully. No errors occurred.")
+        sys.exit(0)
+    logger.info(f"Found {len(pdf_files)} PDF file(s) to process")
+    logger.info(f"Settings: MODEL_SIZE={MODEL_SIZE}, CONF={CONF_THRESHOLD}")
+    # Determine worker count
+    total_cpus = cpu_count()
+    if NUM_WORKERS is None:
+        num_workers = max(1, total_cpus - 1)
+    else:
+        num_workers = max(1, min(NUM_WORKERS, total_cpus))
+    # Decide whether to use multiprocessing
+    use_pool = USE_MULTIPROCESSING and DEVICE == "cpu" and total_cpus >= 4
+    if use_pool:
+        logger.info(f"🚀 Creating persistent worker pool with {num_workers} workers...")
+    else:
+        if not USE_MULTIPROCESSING:
+            logger.info("Multiprocessing disabled by configuration")
+        elif DEVICE != "cpu":
+            logger.info(f"Using serial GPU processing (device: {DEVICE})")
+        else:
+            logger.info(f"Using serial CPU processing (CPU count {total_cpus} too low)")
+    pool = None
+    try:
+        # Create persistent pool ONCE for all PDFs
+        if use_pool:
+            pool = Pool(processes=num_workers, initializer=init_worker)
+            logger.success(f"✓ Worker pool ready with {num_workers} workers\n")
+        else:
+            # Load model in main process for serial execution
+            logger.info("Initializing model in main process...")
+            get_model()
+            logger.success(f"✓ Model loaded (device: {DEVICE})\n")
+        # Process all PDFs using the same pool
+        for i, pdf_path in enumerate(pdf_files, 1):
+            if _shutdown_requested:
+                logger.warning(f"\nShutdown requested. Processed {i-1}/{len(pdf_files)} files.")
+                break
+            logger.info(f"\n{'='*60}")
+            logger.info(f"📄 File {i}/{len(pdf_files)}: {pdf_path.name}")
+            logger.info(f"{'='*60}")
+            sub_out = OUTPUT_DIR / pdf_path.stem
+            os.makedirs(sub_out, exist_ok=True)
+            try:
+                process_pdf_with_pool(pdf_path, sub_out, pool)
+            except KeyboardInterrupt:
+                logger.warning(f"\nInterrupted while processing {pdf_path.name}")
+                break
+            except Exception as e:
+                logger.error(f"Error processing {pdf_path.name}: {e}")
+                if _shutdown_requested:
+                    break
+                logger.info("Continuing with next file...")
+                continue
+        if _shutdown_requested:
+            logger.warning(f"\n⚠️  Processing interrupted. Partial results saved in {OUTPUT_DIR}")
+        else:
+            logger.success(f"\n✨ All done! Results are in {OUTPUT_DIR}")
+    except KeyboardInterrupt:
+        logger.error("\n❌ Processing interrupted by user")
+        sys.exit(1)
+    except Exception as e:
+        logger.error(f"\n❌ Fatal error: {e}")
+        sys.exit(1)
+    finally:
+        # Clean up pool if it exists
+        if pool is not None:
+            logger.info("\n🧹 Shutting down worker pool...")
+            pool.close()
+            pool.join()
+            logger.success("✓ Worker pool closed cleanly")

modal_app.py ADDED Viewed

	@@ -0,0 +1,112 @@

+"""
+Modal deployment configuration for PDF Layout Extractor Flask app.
+Deploy with: modal deploy modal_app.py
+"""
+import modal
+# Create a Modal image with GPU support and all dependencies
+image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .apt_install(
+        "build-essential",
+        "gcc",
+        "g++",
+        "libgl1",
+        "libglib2.0-0",  # Required for cv2 (provides libgthread-2.0.so.0)
+        "libsm6",        # Required for cv2
+        "libxext6",      # Required for cv2
+        "libxrender-dev",  # Required for cv2
+        "libgomp1",      # Required for cv2
+        "libffi-dev",
+        "libjpeg62-turbo-dev",
+        "zlib1g-dev",
+        "netcat-openbsd",
+    )
+    .pip_install(
+        "torch>=2.0.0",
+        "torchvision>=0.15.0",
+        "doclayout-yolo>=0.0.4",
+        "huggingface-hub>=1.1.2",
+        "loguru>=0.7.3",
+        "pillow>=12.0.0",
+        "pymupdf>=1.26.6",
+        "pymupdf-layout>=0.0.15",
+        "pypdfium2>=5.0.0",
+        "pymupdf4llm>=0.1.9",
+        "flask>=3.0.0",
+        "fastapi>=0.109.0",  # Required for Modal web endpoints
+        "werkzeug>=3.0.0",
+        "gunicorn>=21.2.0",
+        "asgiref>=3.7.0",  # For WSGI-to-ASGI conversion
+    )
+    .run_commands(
+        "mkdir -p /app/uploads /app/output /app/static /app/templates"
+    )
+    # Copy application files directly into the image
+    .add_local_dir("static", remote_path="/app/static")
+    .add_local_dir("templates", remote_path="/app/templates")
+    .add_local_file("app.py", remote_path="/app/app.py")
+    .add_local_file("main.py", remote_path="/app/main.py")
+)
+# Create the Modal app
+app = modal.App("pdf-layout-extractor", image=image)
+# GPU configuration - using T4 for cheapest option (~$0.50/hour while active)
+# For no GPU (CPU only), set gpu=None (much cheaper but slower)
+# Valid options: "T4", "A10G", "A100", or None
+@app.function(
+    image=image,
+    gpu="T4",  # Cheapest GPU option (~$0.50/hour while active)
+    secrets=[
+        # Add any secrets here if needed (e.g., HUGGINGFACE_TOKEN)
+        # modal.Secret.from_name("huggingface-secret"),
+    ],
+    timeout=3600,  # 1 hour timeout for long PDF processing
+    max_containers=10,  # Handle up to 10 concurrent requests
+)
+@modal.asgi_app()
+def flask_app():
+    """
+    Expose the Flask app as an ASGI application for Modal.
+    Flask is WSGI, so we convert it to ASGI using a wrapper.
+    """
+    import sys
+    import os
+    from pathlib import Path
+    # Set working directory
+    os.chdir("/app")
+    sys.path.insert(0, "/app")
+    # Import Flask app
+    from app import app as flask_app_instance
+    # Convert Flask WSGI app to ASGI for Modal
+    # Using asgiref's WSGI-to-ASGI adapter
+    from asgiref.wsgi import WsgiToAsgi
+    asgi_app = WsgiToAsgi(flask_app_instance)
+    return asgi_app
+# Alternative: Deploy as a web endpoint with automatic HTTPS
+@app.function(
+    image=image,
+    gpu="T4",
+    timeout=3600,
+    max_containers=10,
+)
+@modal.fastapi_endpoint(method="GET", label="pdf-extractor")
+def health():
+    """Health check endpoint."""
+    return {"status": "ok", "service": "pdf-layout-extractor"}
+if __name__ == "__main__":
+    # For local testing with Modal dev server:
+    # Run: modal serve modal_app.py
+    pass

pdf_extractor_gui.py ADDED Viewed

	@@ -0,0 +1,624 @@

+import streamlit as st
+import os
+import re
+import cv2
+import fitz  # PyMuPDF
+import pytesseract
+import numpy as np
+from typing import List, Dict, Tuple, Optional
+from concurrent.futures import ThreadPoolExecutor
+import sys
+from pathlib import Path
+import tempfile
+import zipfile
+import io
+class PDFExtractor:
+    def __init__(self):
+        # Configuration (same as original)
+        self.config = {
+            'dpi': 400,
+            'min_area_ratio': 0.02,
+            'max_area_ratio': 0.96,
+            'min_width_px': 200,
+            'min_height_px': 220,
+            'inset_px': 6,
+            'stitch': {
+                'y_tol': 60,
+                'h_tol': 120,
+                'x_tol': 60,
+                'w_tol': 120,
+            },
+            'caption_regex': r"^\s*(?:Figure|Fig\.?|Panel|Table)\s*[\dA-Za-z\-\.]*",
+            'ocr_lang': 'eng',
+            'rotate_on_demand': False,
+            'debug_mode': False,
+            'max_caption_search_pages_ahead': 1,
+        }
+        self.setup_tesseract()
+    def setup_tesseract(self):
+        """Try to find Tesseract executable"""
+        possible_paths = [
+            r'C:\Program Files\Tesseract-OCR\tesseract.exe',
+            r'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe',
+            '/usr/bin/tesseract',
+            '/usr/local/bin/tesseract',
+            'tesseract'  # If in PATH
+        ]
+        for path in possible_paths:
+            try:
+                if os.path.exists(path) or path == 'tesseract':
+                    pytesseract.pytesseract.tesseract_cmd = path
+                    # Test if it works
+                    test_img = np.ones((50, 50, 3), dtype=np.uint8) * 255
+                    pytesseract.image_to_string(test_img)
+                    return True
+            except:
+                continue
+        return False
+    def process_single_pdf(self, pdf_path: str, out_dir: str):
+        """Process a single PDF file (adapted from original code)"""
+        if not os.path.isfile(pdf_path):
+            raise FileNotFoundError(f"PDF not found: {pdf_path}")
+        os.makedirs(out_dir, exist_ok=True)
+        try:
+            doc = fitz.open(pdf_path)
+        except Exception as e:
+            raise Exception(f"Error opening PDF: {e}")
+        detections_by_page = []
+        total_pages = len(doc)
+        # Progress tracking for Streamlit
+        if hasattr(self, 'progress_callback'):
+            self.progress_callback(f"Analyzing {total_pages} pages...")
+        for pno, page in enumerate(doc):
+            img = self.render_page_to_bgr(page, self.config['dpi'])
+            boxes, _ = self.detect_boxes_on_image(
+                img,
+                min_area_ratio=self.config['min_area_ratio'],
+                max_area_ratio=self.config['max_area_ratio'],
+                min_w=self.config['min_width_px'],
+                min_h=self.config['min_height_px'],
+                inset_px=self.config['inset_px'],
+                debug_overlay=self.config['debug_mode'],
+            )
+            for b in boxes:
+                b['page'] = pno
+            detections_by_page.append(boxes)
+            if hasattr(self, 'progress_callback'):
+                self.progress_callback(f"  - Page {pno+1}: {len(boxes)} region(s)")
+        doc.close()
+        self.classify_boxes_with_ocr(detections_by_page, self.config['ocr_lang'])
+        figures = self.stitch_split_figures(detections_by_page)
+        self.save_results(figures, detections_by_page, out_dir)
+    # Original algorithm methods (adapted for the class)
+    def render_page_to_bgr(self, page: fitz.Page, dpi: int) -> np.ndarray:
+        mat = fitz.Matrix(dpi / 72.0, dpi / 72.0)
+        pix = page.get_pixmap(matrix=mat, alpha=False)
+        img_bytes = pix.tobytes("png")
+        arr = np.frombuffer(img_bytes, np.uint8)
+        img = cv2.imdecode(arr, cv2.IMREAD_COLOR)
+        return img
+    def detect_boxes_on_image(self, img: np.ndarray, min_area_ratio: float, max_area_ratio: float,
+                            min_w: int, min_h: int, inset_px: int, debug_overlay: bool = False
+                            ) -> Tuple[List[Dict], Optional[np.ndarray]]:
+        H, W = img.shape[:2]
+        page_area = W * H
+        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
+        bw = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
+                                   cv2.THRESH_BINARY_INV, 21, 12)
+        kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5,5))
+        closed = cv2.morphologyEx(bw, cv2.MORPH_CLOSE, kernel, iterations=2)
+        contours, _ = cv2.findContours(closed, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
+        boxes: List[Dict] = []
+        for cnt in contours:
+            peri = cv2.arcLength(cnt, True)
+            if peri < 80:
+                continue
+            approx = cv2.approxPolyDP(cnt, 0.02 * peri, True)
+            x, y, w, h = cv2.boundingRect(approx)
+            if w < min_w or h < min_h:
+                continue
+            area = w * h
+            area_ratio = area / page_area
+            if not (min_area_ratio <= area_ratio <= max_area_ratio):
+                continue
+            if (w / (h + 1e-6) > 12) or (h / (w + 1e-6) > 12):
+                continue
+            mask = np.zeros((H, W), dtype=np.uint8)
+            cv2.drawContours(mask, [approx], -1, 255, -1)
+            def edge_present(slice_arr: np.ndarray) -> bool:
+                if slice_arr.size == 0:
+                    return False
+                return (np.mean(slice_arr) > 20)
+            edge_thickness = 8
+            top_slice = mask[y:y+edge_thickness, x:x+w] if y+edge_thickness < H else mask[y:H, x:x+w]
+            bottom_slice = mask[max(0, y+h-edge_thickness):y+h, x:x+w]
+            left_slice = mask[y:y+h, x:x+edge_thickness] if x+edge_thickness < W else mask[y:y+h, x:W]
+            right_slice = mask[y:y+h, max(0, x+w-edge_thickness):x+w]
+            top_edge = edge_present(top_slice)
+            bottom_edge = edge_present(bottom_slice)
+            left_edge = edge_present(left_slice)
+            right_edge = edge_present(right_slice)
+            open_sides = []
+            if not top_edge: open_sides.append("top")
+            if not bottom_edge: open_sides.append("bottom")
+            if not left_edge: open_sides.append("left")
+            if not right_edge: open_sides.append("right")
+            x1 = max(0, x + inset_px)
+            y1 = max(0, y + inset_px)
+            x2 = min(W, x + w - inset_px)
+            y2 = min(H, y + h - inset_px)
+            if x2 <= x1 or y2 <= y1:
+                continue
+            crop = img[y1:y2, x1:x2].copy()
+            box = {
+                'coords': (x, y, w, h),
+                'image': crop,
+                'open_sides': open_sides,
+                'area_ratio': float(area_ratio),
+            }
+            boxes.append(box)
+        boxes.sort(key=lambda b: (b['coords'][1], b['coords'][0]))
+        return boxes, None
+    def ocr_text(self, image: np.ndarray, lang: str) -> str:
+        try:
+            txt = pytesseract.image_to_string(image, lang=lang)
+        except Exception:
+            txt = ""
+        return (txt or "").strip()
+    def classify_boxes_with_ocr(self, detections_by_page: List[List[Dict]], lang: str) -> None:
+        caption_re = re.compile(self.config['caption_regex'], re.IGNORECASE)
+        jobs = []
+        with ThreadPoolExecutor(max_workers=os.cpu_count() or 4) as ex:
+            for p_idx, page_boxes in enumerate(detections_by_page):
+                for b_idx, box in enumerate(page_boxes):
+                    jobs.append(((p_idx, b_idx), ex.submit(self.ocr_text, box['image'], lang)))
+            for (p_idx, b_idx), fut in jobs:
+                text = fut.result() or ""
+                box = detections_by_page[p_idx][b_idx]
+                if caption_re.match(text):
+                    box['type'] = 'caption'
+                    box['text'] = text
+                else:
+                    box['type'] = 'figure'
+                    box['text'] = text
+    def stitch_split_figures(self, detections_by_page: List[List[Dict]]) -> List[Dict]:
+        # Mark boxes with IDs and stitch flags
+        for p_idx, page_boxes in enumerate(detections_by_page):
+            for b_idx, box in enumerate(page_boxes):
+                box['id'] = f"p{p_idx+1}_b{b_idx+1}"
+                box['used_for_stitch'] = False
+        figures: List[Dict] = []
+        for p_idx, page_boxes in enumerate(detections_by_page):
+            for b_idx, box in enumerate(page_boxes):
+                if box.get('type') == 'caption':
+                    continue
+                if box['used_for_stitch']:
+                    continue
+                cur_img = box['image']
+                cur_coords = box['coords']
+                pages = [p_idx]
+                bbox_refs = [(p_idx, b_idx)]
+                box['used_for_stitch'] = True
+                np_idx = p_idx + 1
+                candidate = None
+                if np_idx < len(detections_by_page):
+                    for nb_idx, nb in enumerate(detections_by_page[np_idx]):
+                        if nb.get('type') == 'caption' or nb['used_for_stitch']:
+                            continue
+                        x, y, w, h = cur_coords
+                        nx, ny, nw, nh = nb['coords']
+                        if abs(x - nx) < 50 and abs((x+w) - (nx+nw)) < 50:
+                            candidate = (np_idx, nb_idx, nb, 'vertical')
+                            break
+                        if abs(y - ny) < 50 and abs((y+h) - (ny+nh)) < 50:
+                            candidate = (np_idx, nb_idx, nb, 'horizontal')
+                            break
+                if candidate:
+                    np_idx, nb_idx, nb, stitch_type = candidate
+                    nb['used_for_stitch'] = True
+                    pages.append(np_idx)
+                    bbox_refs.append((np_idx, nb_idx))
+                    if stitch_type == 'vertical':
+                        w_max = max(cur_img.shape[1], nb['image'].shape[1])
+                        def pad_to_width(img, target_w):
+                            pad_w = target_w - img.shape[1]
+                            if pad_w <= 0:
+                                return img
+                            return np.pad(img, ((0,0),(0,pad_w),(0,0)),
+                                          mode="constant", constant_values=255)
+                        cur_img = pad_to_width(cur_img, w_max)
+                        nb_img = pad_to_width(nb['image'], w_max)
+                        cur_img = np.vstack([cur_img, nb_img])
+                        x1 = min(cur_coords[0], nb['coords'][0])
+                        y1 = min(cur_coords[1], nb['coords'][1])
+                        x2 = max(cur_coords[0]+cur_coords[2], nb['coords'][0]+nb['coords'][2])
+                        y2 = max(cur_coords[1]+cur_coords[3], nb['coords'][1]+nb['coords'][3])
+                        cur_coords = (x1, y1, x2-x1, y2-y1)
+                    else:  # horizontal
+                        h_max = max(cur_img.shape[0], nb['image'].shape[0])
+                        def pad_to_height(img, target_h):
+                            pad_h = target_h - img.shape[0]
+                            if pad_h <= 0:
+                                return img
+                            return np.pad(img, ((0,pad_h),(0,0),(0,0)),
+                                          mode="constant", constant_values=255)
+                        cur_img = pad_to_height(cur_img, h_max)
+                        nb_img = pad_to_height(nb['image'], h_max)
+                        cur_img = np.hstack([cur_img, nb_img])
+                        x1 = min(cur_coords[0], nb['coords'][0])
+                        y1 = min(cur_coords[1], nb['coords'][1])
+                        x2 = max(cur_coords[0]+cur_coords[2], nb['coords'][0]+nb['coords'][2])
+                        y2 = max(cur_coords[1]+cur_coords[3], nb['coords'][1]+nb['coords'][3])
+                        cur_coords = (x1, y1, x2-x1, y2-y1)
+                figures.append({
+                    'id': f"f{len(figures)+1:03d}",
+                    'pages': pages,
+                    'image': cur_img,
+                    'bbox_refs': bbox_refs,
+                    'base_page': pages[0],
+                    'coords_hint': cur_coords,
+                })
+        return figures
+    def pick_best_caption_for_figure(self, fig: Dict, detections_by_page: List[List[Dict]],
+                                   used_caption_ids: set) -> Optional[Tuple[int, int, Dict]]:
+        base_p = fig['base_page']
+        x, y, w, h = fig['coords_hint']
+        max_ahead = self.config['max_caption_search_pages_ahead']
+        candidates = []
+        for p in range(base_p, min(base_p + 1 + max_ahead, len(detections_by_page))):
+            for b_idx, box in enumerate(detections_by_page[p]):
+                if box.get('type') != 'caption':
+                    continue
+                if box.get('caption_used_id'):
+                    continue
+                bx, by, bw, bh = box['coords']
+                same_page = (p == base_p)
+                after_figure = (not same_page) or (by >= y)
+                if not after_figure:
+                    continue
+                vdist = abs((by) - (y + h)) if same_page else 0
+                wdiff = abs(bw - w)
+                score = vdist + 0.5 * wdiff
+                candidates.append((score, p, b_idx, box))
+        if not candidates:
+            return None
+        candidates.sort(key=lambda t: t[0])
+        for _, p, b_idx, box in candidates:
+            box_id = (p, b_idx)
+            if box_id not in used_caption_ids:
+                return (p, b_idx, box)
+        return None
+    def rotate_if_needed(self, img: np.ndarray) -> np.ndarray:
+        if not self.config['rotate_on_demand']:
+            return img
+        h, w = img.shape[:2]
+        if h > w * 1.2:
+            return cv2.rotate(img, cv2.ROTATE_90_CLOCKWISE)
+        return img
+    def save_results(self, figures: List[Dict], detections_by_page: List[List[Dict]], out_dir: str) -> None:
+        os.makedirs(out_dir, exist_ok=True)
+        used_captions = set()
+        saved = 0
+        for fig in figures:
+            cap = self.pick_best_caption_for_figure(fig, detections_by_page, used_captions)
+            if cap is not None:
+                p, b_idx, cap_box = cap
+                used_captions.add((p, b_idx))
+                fig_img = fig['image']
+                cap_img = cap_box['image']
+                if cap_img.shape[1] != fig_img.shape[1]:
+                    new_h = int(cap_img.shape[0] * (fig_img.shape[1] / cap_img.shape[1]))
+                    cap_img = cv2.resize(cap_img, (fig_img.shape[1], new_h))
+                stitched = cv2.vconcat([fig_img, cap_img])
+                stitched = self.rotate_if_needed(stitched)
+                fname = f"figure_with_caption_{fig['id']}.png"
+                cv2.imwrite(os.path.join(out_dir, fname), stitched)
+                saved += 1
+            else:
+                fig_img = self.rotate_if_needed(fig['image'])
+                fname = f"figure_{fig['id']}.png"
+                cv2.imwrite(os.path.join(out_dir, fname), fig_img)
+                saved += 1
+        cap_count = 0
+        for p_idx, page_boxes in enumerate(detections_by_page):
+            for b_idx, box in enumerate(page_boxes):
+                if box.get('type') == 'caption' and (p_idx, b_idx) not in used_captions:
+                    cap_count += 1
+                    cv2.imwrite(os.path.join(out_dir, f"standalone_caption_{cap_count:03d}.png"), box['image'])
+        if hasattr(self, 'progress_callback'):
+            self.progress_callback(f"Saved {saved} figure image(s) (+ any standalone captions) to: {out_dir}")
+def main():
+    st.set_page_config(
+        page_title="PDF Figure Extractor",
+        page_icon="📄",
+        layout="wide"
+    )
+    # Custom CSS for better styling
+    st.markdown("""
+        <style>
+        .main-title {
+            font-size: 2.5rem;
+            font-weight: bold;
+            color: #1f77b4;
+            text-align: center;
+            margin-bottom: 2rem;
+        }
+        .section-title {
+            font-size: 1.5rem;
+            font-weight: bold;
+            margin-top: 1.5rem;
+            margin-bottom: 1rem;
+        }
+        .success-box {
+            padding: 1rem;
+            background-color: #d4edda;
+            border-left: 5px solid #28a745;
+            margin: 1rem 0;
+        }
+        .error-box {
+            padding: 1rem;
+            background-color: #f8d7da;
+            border-left: 5px solid #dc3545;
+            margin: 1rem 0;
+        }
+        .info-box {
+            padding: 1rem;
+            background-color: #d1ecf1;
+            border-left: 5px solid #17a2b8;
+            margin: 1rem 0;
+        }
+        </style>
+    """, unsafe_allow_html=True)
+    # Title
+    st.markdown('<h1 class="main-title">📄 PDF Figure Extractor</h1>', unsafe_allow_html=True)
+    st.markdown("---")
+    # Initialize extractor in session state
+    if 'extractor' not in st.session_state:
+        st.session_state.extractor = PDFExtractor()
+        tesseract_found = st.session_state.extractor.setup_tesseract()
+        if not tesseract_found:
+            st.info("ℹ️ **Tesseract OCR not detected.** "
+                   "Caption detection will be limited. "
+                   "For local development, install Tesseract from: "
+                   "https://github.com/UB-Mannheim/tesseract/wiki")
+    # Sidebar for settings
+    with st.sidebar:
+        st.header("⚙️ Settings")
+        dpi = st.slider(
+            "Image Quality (DPI)",
+            min_value=150,
+            max_value=600,
+            value=400,
+            step=50,
+            help="Higher DPI means better quality but slower processing"
+        )
+        rotate_images = st.checkbox(
+            "Auto-rotate tall images",
+            value=False,
+            help="Automatically rotate images that are taller than they are wide"
+        )
+        st.markdown("---")
+        st.markdown("### About")
+        st.markdown("""
+        This tool extracts figures and captions from PDF files using:
+        - **Computer Vision** for figure detection
+        - **OCR** for caption recognition
+        - **Smart Stitching** for multi-page figures
+        """)
+    # Main content
+    col1, col2 = st.columns([2, 1])
+    with col1:
+        st.markdown('<h3 class="section-title">1️⃣ Upload PDF Files</h3>', unsafe_allow_html=True)
+        uploaded_files = st.file_uploader(
+            "Choose PDF files",
+            type=['pdf'],
+            accept_multiple_files=True,
+            help="Select one or more PDF files to extract figures from"
+        )
+        if uploaded_files:
+            st.success(f"✅ {len(uploaded_files)} PDF file(s) selected")
+            for i, file in enumerate(uploaded_files, 1):
+                st.text(f"  {i}. {file.name}")
+        else:
+            # Show welcome message when no files uploaded
+            st.info("""
+            👋 **Welcome!** Upload your PDF files to get started.
+            This tool will:
+            - 🔍 Detect figures, charts, and diagrams
+            - 📝 Extract and match captions
+            - 🔄 Stitch multi-page figures
+            - 💾 Package everything for easy download
+            """)
+    with col2:
+        st.markdown('<h3 class="section-title">2️⃣ Process</h3>', unsafe_allow_html=True)
+        process_button = st.button(
+            "🚀 Extract Figures",
+            type="primary",
+            disabled=not uploaded_files,
+            use_container_width=True
+        )
+    # Processing section
+    if process_button and uploaded_files:
+        st.markdown("---")
+        st.markdown('<h3 class="section-title">📊 Processing Status</h3>', unsafe_allow_html=True)
+        # Update config
+        st.session_state.extractor.config['dpi'] = dpi
+        st.session_state.extractor.config['rotate_on_demand'] = rotate_images
+        # Progress tracking
+        progress_bar = st.progress(0)
+        status_text = st.empty()
+        def log_callback(message):
+            pass  # Silent processing
+        st.session_state.extractor.progress_callback = log_callback
+        # Create temporary directory for output
+        with tempfile.TemporaryDirectory() as temp_dir:
+            total_files = len(uploaded_files)
+            all_results = []
+            for i, uploaded_file in enumerate(uploaded_files):
+                # Update progress
+                progress = i / total_files
+                progress_bar.progress(progress)
+                status_text.markdown(f"**Processing:** {uploaded_file.name} ({i+1}/{total_files})")
+                # Save uploaded file temporarily
+                temp_pdf_path = os.path.join(temp_dir, uploaded_file.name)
+                with open(temp_pdf_path, 'wb') as f:
+                    f.write(uploaded_file.getbuffer())
+                # Create output directory for this PDF
+                pdf_name = os.path.splitext(uploaded_file.name)[0]
+                out_dir = os.path.join(temp_dir, pdf_name)
+                try:
+                    st.session_state.extractor.process_single_pdf(temp_pdf_path, out_dir)
+                    # Collect results
+                    if os.path.exists(out_dir):
+                        for filename in os.listdir(out_dir):
+                            if filename.endswith('.png'):
+                                filepath = os.path.join(out_dir, filename)
+                                all_results.append((pdf_name, filename, filepath))
+                except Exception as e:
+                    st.error(f"Error processing {uploaded_file.name}: {str(e)}")
+            # Complete progress
+            progress_bar.progress(1.0)
+            status_text.markdown("**✅ Processing completed!**")
+            # Display results
+            if all_results:
+                st.markdown("---")
+                st.markdown('<h3 class="section-title">🎉 Extraction Results</h3>', unsafe_allow_html=True)
+                st.success(f"Successfully extracted {len(all_results)} figure(s) from {total_files} PDF(s)")
+                # Group by PDF
+                results_by_pdf = {}
+                for pdf_name, filename, filepath in all_results:
+                    if pdf_name not in results_by_pdf:
+                        results_by_pdf[pdf_name] = []
+                    results_by_pdf[pdf_name].append((filename, filepath))
+                # Display results by PDF with auto-expanded previews
+                for pdf_name, files in results_by_pdf.items():
+                    st.markdown(f"### 📄 {pdf_name} ({len(files)} figures)")
+                    # Display images in columns
+                    cols = st.columns(3)
+                    for idx, (filename, filepath) in enumerate(files):
+                        with cols[idx % 3]:
+                            st.image(filepath, caption=filename, use_container_width=True)
+                # Create download button for all results (placed after previews)
+                st.markdown("---")
+                zip_buffer = io.BytesIO()
+                with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zip_file:
+                    for pdf_name, filename, filepath in all_results:
+                        arcname = f"{pdf_name}/{filename}"
+                        zip_file.write(filepath, arcname)
+                zip_buffer.seek(0)
+                st.download_button(
+                    label="📥 Download All Figures (ZIP)",
+                    data=zip_buffer,
+                    file_name="extracted_figures.zip",
+                    mime="application/zip",
+                    use_container_width=True,
+                    type="primary"
+                )
+            else:
+                st.warning("No figures were extracted. The PDFs may not contain detectable figures.")
+    # Footer
+    st.markdown("---")
+    st.markdown(
+        """
+        <div style='text-align: center; color: #666; padding: 2rem 0;'>
+            <p>Made with ❤️ using Streamlit |
+            <a href='https://github.com' target='_blank'>GitHub</a> |
+            Need help? Check the processing log for details</p>
+        </div>
+        """,
+        unsafe_allow_html=True
+    )
+if __name__ == "__main__":
+    main()

pyproject.toml ADDED Viewed

	@@ -0,0 +1,20 @@

+[project]
+name = "pdf-minor-allegations"
+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.12"
+dependencies = [
+    "doclayout-yolo>=0.0.4",
+    "huggingface-hub>=1.1.2",
+    "loguru>=0.7.3",
+    "pillow>=12.0.0",
+    "pymupdf>=1.26.6",
+    "pymupdf-layout>=0.0.15",
+    "pypdfium2>=5.0.0",
+    "pymupdf4llm>=0.1.9",
+    "flask>=3.0.0",
+    "werkzeug>=3.0.0",
+    "torch>=2.0.0",
+    "torchvision>=0.15.0",
+]

run_flask_gpu.py ADDED Viewed

	@@ -0,0 +1,48 @@

+#!/usr/bin/env python
+"""
+Startup script that ensures CUDA PyTorch is installed before running Flask app.
+"""
+import subprocess
+import sys
+from pathlib import Path
+def ensure_cuda_torch():
+    """Ensure CUDA-enabled PyTorch is installed."""
+    try:
+        import torch
+        if torch.cuda.is_available():
+            print(f"✓ CUDA available: {torch.cuda.get_device_name(0)}")
+            return True
+        else:
+            print("⚠ CUDA not available in current PyTorch installation")
+            print("Installing CUDA-enabled PyTorch...")
+            subprocess.run([
+                sys.executable, "-m", "pip", "install",
+                "torch", "torchvision",
+                "--index-url", "https://download.pytorch.org/whl/cu121",
+                "--upgrade"
+            ], check=True)
+            # Re-import to check
+            import importlib
+            importlib.reload(torch)
+            if torch.cuda.is_available():
+                print(f"✓ CUDA now available: {torch.cuda.get_device_name(0)}")
+                return True
+            else:
+                print("⚠ Still no CUDA after installation. Using CPU mode.")
+                return False
+    except Exception as e:
+        print(f"Error checking CUDA: {e}")
+        return False
+if __name__ == '__main__':
+    print("Checking GPU availability...")
+    ensure_cuda_torch()
+    print("\nStarting PDF Layout Extractor Flask App...")
+    print("Open your browser to http://localhost:5000\n")
+    from app import app
+    # Disable reloader to avoid environment discrepancies in child process
+    app.run(debug=False, use_reloader=False, host='0.0.0.0', port=5000)

static/css/styles.css ADDED Viewed

	@@ -0,0 +1,310 @@

+/* CSS Variables for Professional Theme */
+:root {
+    --primary: #4A5568;
+    --primary-dark: #2D3748;
+    --accent: #3182CE;
+    --accent-dark: #2B6CB0;
+    --background-light: #F7FAFC;
+    --background-dark: #1A202C;
+    --card-bg: #FFFFFF;
+    --card-bg-dark: #2D3748;
+    --text-primary: #2D3748;
+    --text-secondary: #718096;
+    --text-invert: #F7FAFC;
+    --border-color: #E2E8F0;
+    --border-color-dark: #4A5568;
+    --shadow-sm: 0 2px 4px rgba(15, 23, 42, 0.05);
+    --shadow-md: 0 4px 6px -1px rgba(15, 23, 42, 0.06), 0 2px 4px -1px rgba(15, 23, 42, 0.04);
+    --shadow-lg: 0 10px 15px -3px rgba(15, 23, 42, 0.1), 0 4px 6px -2px rgba(15, 23, 42, 0.05);
+    --radius-sm: 6px;
+    --radius-md: 8px;
+    --radius-lg: 12px;
+    --transition-base: all 0.25s ease;
+    --status-green: #15803d;
+    --status-yellow: #a16207;
+    --status-red: #b91c1c;
+}
+[data-theme="dark"] {
+    --primary: #CBD5F5;
+    --primary-dark: #1A202C;
+    --accent: #63B3ED;
+    --accent-dark: #4299E1;
+    --background-light: var(--background-dark);
+    --card-bg: var(--card-bg-dark);
+    --text-primary: #EDF2F7;
+    --text-secondary: #A0AEC0;
+    --text-invert: #0F172A;
+    --border-color: #2D3748;
+    --status-green: #4ade80;
+    --status-yellow: #facc15;
+    --status-red: #f87171;
+}
+body {
+    background-color: var(--background-light);
+    color: var(--text-primary);
+    font-family: "Inter", "Segoe UI", -apple-system, BlinkMacSystemFont, sans-serif;
+    transition: background-color 0.3s ease, color 0.3s ease;
+}
+/* Navbar */
+.bg-primary-custom {
+    background: linear-gradient(135deg, #667eea 0%, #764ba2 100%) !important;
+}
+.navbar {
+    box-shadow: var(--shadow-sm);
+}
+/* Cards */
+.card {
+    background: var(--card-bg);
+    border: 1px solid var(--border-color);
+    border-radius: var(--radius-lg);
+    box-shadow: var(--shadow-sm);
+    transition: var(--transition-base);
+}
+.card:hover {
+    box-shadow: var(--shadow-md);
+}
+.card-header {
+    border-bottom: 1px solid var(--border-color);
+    border-radius: var(--radius-lg) var(--radius-lg) 0 0 !important;
+}
+/* Buttons */
+.btn-primary {
+    background: var(--accent);
+    border-color: var(--accent);
+    color: white;
+    transition: var(--transition-base);
+}
+.btn-primary:hover {
+    background: var(--accent-dark);
+    border-color: var(--accent-dark);
+    box-shadow: var(--shadow-md);
+}
+.btn-outline-primary {
+    border-color: var(--accent);
+    color: var(--accent);
+    background: transparent;
+}
+.btn-outline-primary:hover,
+.btn-outline-primary:active,
+.btn-outline-primary:focus {
+    background: var(--accent);
+    border-color: var(--accent);
+    color: white;
+    box-shadow: var(--shadow-sm);
+}
+.btn-check:checked + .btn-outline-primary {
+    background: var(--accent);
+    border-color: var(--accent);
+    color: white;
+}
+/* Form Controls */
+.form-control,
+.form-select {
+    background: var(--card-bg);
+    color: var(--text-primary);
+    border-color: var(--border-color);
+    transition: var(--transition-base);
+}
+.form-control:focus,
+.form-select:focus {
+    background: var(--card-bg);
+    color: var(--text-primary);
+    border-color: var(--accent);
+    box-shadow: 0 0 0 0.2rem rgba(49, 130, 206, 0.15);
+}
+/* List Group */
+.list-group-item {
+    background: var(--card-bg);
+    color: var(--text-primary);
+    border-color: var(--border-color);
+    cursor: pointer;
+    transition: var(--transition-base);
+}
+.list-group-item:hover {
+    background: rgba(49, 130, 206, 0.1);
+    border-color: var(--accent);
+}
+.list-group-item.active {
+    background: var(--accent);
+    border-color: var(--accent);
+    color: white;
+}
+/* Badges */
+.badge {
+    padding: 0.5em 0.75em;
+    font-weight: 600;
+}
+.badge.bg-success {
+    background-color: var(--status-green) !important;
+}
+.badge.bg-warning {
+    background-color: var(--status-yellow) !important;
+}
+.badge.bg-danger {
+    background-color: var(--status-red) !important;
+}
+/* Device Status */
+#deviceBadge {
+    font-size: 0.9rem;
+    padding: 0.4em 0.8em;
+}
+.badge.bg-success {
+    background-color: var(--status-green) !important;
+}
+.badge.bg-secondary {
+    background-color: var(--text-secondary) !important;
+}
+/* Image Gallery */
+.image-gallery {
+    display: grid;
+    grid-template-columns: repeat(auto-fill, minmax(300px, 1fr));
+    gap: 1.5rem;
+    margin-top: 1rem;
+}
+.image-item {
+    position: relative;
+    border-radius: var(--radius-md);
+    overflow: hidden;
+    box-shadow: var(--shadow-sm);
+    transition: var(--transition-base);
+    background: var(--card-bg);
+    border: 1px solid var(--border-color);
+}
+.image-item:hover {
+    box-shadow: var(--shadow-lg);
+    transform: translateY(-2px);
+}
+.image-item img {
+    width: 100%;
+    height: auto;
+    display: block;
+}
+.image-item .image-caption {
+    padding: 0.75rem;
+    background: var(--card-bg);
+    color: var(--text-primary);
+    font-size: 0.875rem;
+    border-top: 1px solid var(--border-color);
+}
+/* Stats Cards */
+.stat-card {
+    background: var(--card-bg);
+    border: 1px solid var(--border-color);
+    border-radius: var(--radius-md);
+    padding: 1.5rem;
+    text-align: center;
+    transition: var(--transition-base);
+}
+.stat-card:hover {
+    box-shadow: var(--shadow-md);
+    transform: translateY(-2px);
+}
+.stat-card .stat-value {
+    font-size: 2rem;
+    font-weight: 700;
+    color: var(--accent);
+    margin: 0.5rem 0;
+}
+.stat-card .stat-label {
+    color: var(--text-secondary);
+    font-size: 0.875rem;
+    text-transform: uppercase;
+    letter-spacing: 0.5px;
+}
+/* Loading Spinner */
+.spinner-border {
+    width: 2rem;
+    height: 2rem;
+}
+/* Empty State */
+.text-center {
+    color: var(--text-secondary);
+}
+/* Markdown Preview */
+.markdown-preview {
+    background: var(--card-bg);
+    border: 1px solid var(--border-color);
+    border-radius: var(--radius-md);
+    padding: 1.5rem;
+    max-height: 600px;
+    overflow-y: auto;
+    font-family: 'Courier New', monospace;
+    font-size: 0.9rem;
+    line-height: 1.6;
+}
+/* Download Buttons */
+.download-btn-group {
+    display: flex;
+    gap: 0.5rem;
+    flex-wrap: wrap;
+    margin-top: 1rem;
+}
+/* Responsive */
+@media (max-width: 768px) {
+    .image-gallery {
+        grid-template-columns: 1fr;
+    }
+    .sticky-top {
+        position: relative !important;
+    }
+}
+/* Scrollbar Styling */
+::-webkit-scrollbar {
+    width: 8px;
+    height: 8px;
+}
+::-webkit-scrollbar-track {
+    background: var(--background-light);
+}
+::-webkit-scrollbar-thumb {
+    background: var(--text-secondary);
+    border-radius: 4px;
+}
+::-webkit-scrollbar-thumb:hover {
+    background: var(--accent);
+}

static/js/app.js ADDED Viewed

	@@ -0,0 +1,482 @@

+// Application State
+const AppState = {
+    currentPdf: null,
+    pdfs: [],
+    deviceInfo: null,
+};
+// Initialize on page load
+document.addEventListener('DOMContentLoaded', function() {
+    initializeTheme();
+    loadDeviceInfo();
+    initializeEventListeners();
+    loadPdfList();
+});
+// Theme Toggle
+function initializeTheme() {
+    const savedTheme = localStorage.getItem('theme') || 'light';
+    document.body.setAttribute('data-theme', savedTheme);
+    updateThemeIcon(savedTheme);
+    document.getElementById('themeToggle').addEventListener('click', function() {
+        const currentTheme = document.body.getAttribute('data-theme');
+        const newTheme = currentTheme === 'dark' ? 'light' : 'dark';
+        document.body.setAttribute('data-theme', newTheme);
+        localStorage.setItem('theme', newTheme);
+        updateThemeIcon(newTheme);
+    });
+}
+function updateThemeIcon(theme) {
+    const icon = document.getElementById('themeIcon');
+    if (icon) {
+        icon.className = theme === 'dark' ? 'fas fa-sun' : 'fas fa-moon';
+    }
+}
+// Load Device Info
+async function loadDeviceInfo() {
+    try {
+        const response = await fetch('/api/device-info');
+        const data = await response.json();
+        AppState.deviceInfo = data;
+        updateDeviceStatus(data);
+    } catch (error) {
+        console.error('Error loading device info:', error);
+        updateDeviceStatus({ device: 'unknown', cuda_available: false });
+    }
+}
+function updateDeviceStatus(info) {
+    const badge = document.getElementById('deviceBadge');
+    const deviceName = document.getElementById('deviceName');
+    if (info.cuda_available) {
+        badge.textContent = 'GPU';
+        badge.className = 'badge bg-success';
+        deviceName.textContent = info.device_name || 'CUDA Device';
+    } else {
+        badge.textContent = 'CPU';
+        badge.className = 'badge bg-secondary';
+        deviceName.textContent = 'CPU Processing';
+    }
+}
+// Event Listeners
+function initializeEventListeners() {
+    const uploadForm = document.getElementById('uploadForm');
+    uploadForm.addEventListener('submit', handleUpload);
+}
+// Handle File Upload
+async function handleUpload(e) {
+    e.preventDefault();
+    const fileInput = document.getElementById('fileInput');
+    const files = fileInput.files;
+    if (files.length === 0) {
+        alert('Please select at least one PDF file');
+        return;
+    }
+    const extractionMode = document.querySelector('input[name="extractionMode"]:checked').value;
+    // Show processing section
+    document.getElementById('processingSection').style.display = 'block';
+    document.getElementById('resultsSection').style.display = 'none';
+    document.getElementById('emptyState').style.display = 'none';
+    const formData = new FormData();
+    for (let i = 0; i < files.length; i++) {
+        formData.append('files[]', files[i]);
+    }
+    formData.append('extraction_mode', extractionMode);
+    try {
+        const response = await fetch('/api/upload', {
+            method: 'POST',
+            body: formData
+        });
+        const data = await response.json();
+        if (data.error) {
+            throw new Error(data.error);
+        }
+        // Hide processing section
+        document.getElementById('processingSection').style.display = 'none';
+        // Reload PDF list and show results
+        await loadPdfList();
+        // Show first PDF details if available
+        if (data.results && data.results.length > 0) {
+            const firstPdf = data.results[0];
+            if (!firstPdf.error) {
+                showPdfDetails(firstPdf.stem);
+            }
+        }
+        // Reset form
+        fileInput.value = '';
+    } catch (error) {
+        console.error('Upload error:', error);
+        alert('Error processing files: ' + error.message);
+        document.getElementById('processingSection').style.display = 'none';
+    }
+}
+// Load PDF List
+async function loadPdfList() {
+    try {
+        const response = await fetch('/api/pdf-list');
+        const data = await response.json();
+        AppState.pdfs = data.pdfs || [];
+        renderPdfList();
+        if (AppState.pdfs.length > 0) {
+            document.getElementById('resultsSection').style.display = 'block';
+            document.getElementById('emptyState').style.display = 'none';
+        } else {
+            document.getElementById('resultsSection').style.display = 'none';
+            document.getElementById('emptyState').style.display = 'block';
+        }
+    } catch (error) {
+        console.error('Error loading PDF list:', error);
+    }
+}
+// Render PDF List
+function renderPdfList() {
+    const pdfList = document.getElementById('pdfList');
+    pdfList.innerHTML = '';
+    if (AppState.pdfs.length === 0) {
+        pdfList.innerHTML = '<div class="text-center text-muted p-3">No PDFs processed yet</div>';
+        return;
+    }
+    AppState.pdfs.forEach((pdf, index) => {
+        const item = document.createElement('div');
+        item.className = `list-group-item d-flex align-items-center justify-content-between ${index === 0 && !AppState.currentPdf ? 'active' : ''} ${AppState.currentPdf === pdf.stem ? 'active' : ''}`;
+        const left = document.createElement('a');
+        left.href = '#';
+        left.className = 'flex-grow-1 text-decoration-none text-reset';
+        left.innerHTML = `
+            <div class="d-flex w-100 justify-content-between">
+                <h6 class="mb-0">
+                    <i class="fas fa-file-pdf me-2"></i>
+                    ${pdf.stem}
+                </h6>
+            </div>
+        `;
+        left.addEventListener('click', function(e) {
+            e.preventDefault();
+            // Update active state
+            document.querySelectorAll('#pdfList .list-group-item').forEach(i => i.classList.remove('active'));
+            item.classList.add('active');
+            showPdfDetails(pdf.stem);
+        });
+        const delBtn = document.createElement('button');
+        delBtn.className = 'btn btn-sm btn-outline-danger ms-3';
+        delBtn.innerHTML = '<i class="fas fa-trash-alt"></i>';
+        delBtn.title = `Delete "${pdf.stem}"`;
+        delBtn.addEventListener('click', async (e) => {
+            e.preventDefault();
+            e.stopPropagation();
+            const confirmed = confirm(`Delete processed outputs for "${pdf.stem}"? This cannot be undone.`);
+            if (!confirmed) return;
+            try {
+                // Use form-encoded POST to the body endpoint for widest compatibility
+                const resp = await fetch('/api/delete', {
+                    method: 'POST',
+                    headers: { 'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8' },
+                    body: new URLSearchParams({ stem: pdf.stem }).toString()
+                });
+                const raw = await resp.text();
+                let res;
+                try { res = JSON.parse(raw); } catch (_) { res = null; }
+                if (!resp.ok || (res && res?.error)) {
+                    throw new Error((res && res?.error) || raw || 'Delete failed');
+                }
+                // Refresh list and clear details if we deleted the active item
+                if (AppState.currentPdf === pdf.stem) {
+                    AppState.currentPdf = null;
+                    const details = document.getElementById('pdfDetails');
+                    if (details) {
+                        details.innerHTML = `
+                            <div class="alert alert-success">
+                                <i class="fas fa-check-circle me-2"></i>
+                                Deleted "${pdf.stem}" successfully.
+                            </div>
+                        `;
+                    }
+                }
+                await loadPdfList();
+            } catch (err) {
+                console.error('Delete error:', err);
+                alert('Failed to delete: ' + (err?.message || err));
+            }
+        });
+        item.appendChild(left);
+        item.appendChild(delBtn);
+        pdfList.appendChild(item);
+    });
+}
+// Show PDF Details
+async function showPdfDetails(pdfStem) {
+    AppState.currentPdf = pdfStem;
+    // Update active state in list
+    document.querySelectorAll('#pdfList .list-group-item').forEach((item, index) => {
+        item.classList.remove('active');
+        const pdfStemFromItem = AppState.pdfs[index]?.stem;
+        if (pdfStemFromItem === pdfStem) {
+            item.classList.add('active');
+        }
+    });
+    try {
+        const response = await fetch(`/api/pdf-details/${encodeURIComponent(pdfStem)}`);
+        const data = await response.json();
+        if (data.error) {
+            throw new Error(data.error);
+        }
+        renderPdfDetails(data);
+    } catch (error) {
+        console.error('Error loading PDF details:', error);
+        document.getElementById('pdfDetails').innerHTML = `
+            <div class="alert alert-danger">
+                <i class="fas fa-exclamation-circle me-2"></i>
+                Error loading PDF details: ${error.message}
+            </div>
+        `;
+    }
+}
+// Render PDF Details
+function renderPdfDetails(data) {
+    const container = document.getElementById('pdfDetails');
+    let html = `
+        <div class="card shadow-sm mb-4">
+            <div class="card-header bg-primary-custom text-white">
+                <h5 class="mb-0">
+                    <i class="fas fa-file-pdf me-2"></i>
+                    ${data.stem}
+                </h5>
+                <button class="btn btn-sm btn-danger float-end" id="deletePdfBtn" title="Delete this processed PDF">
+                    <i class="fas fa-trash-alt me-1"></i> Delete
+                </button>
+            </div>
+            <div class="card-body">
+                <div class="row mb-4">
+                    <div class="col-md-3">
+                        <div class="stat-card">
+                            <i class="fas fa-images fa-2x text-primary mb-2"></i>
+                            <div class="stat-value">${data.figures_count || 0}</div>
+                            <div class="stat-label">Figures</div>
+                        </div>
+                    </div>
+                    <div class="col-md-3">
+                        <div class="stat-card">
+                            <i class="fas fa-table fa-2x text-primary mb-2"></i>
+                            <div class="stat-value">${data.tables_count || 0}</div>
+                            <div class="stat-label">Tables</div>
+                        </div>
+                    </div>
+                    <div class="col-md-3">
+                        <div class="stat-card">
+                            <i class="fas fa-list fa-2x text-primary mb-2"></i>
+                            <div class="stat-value">${data.elements_count || 0}</div>
+                            <div class="stat-label">Total Elements</div>
+                        </div>
+                    </div>
+                    <div class="col-md-3">
+                        <div class="stat-card">
+                            <i class="fas fa-microchip fa-2x text-primary mb-2"></i>
+                            <div class="stat-value">${AppState.deviceInfo?.device === 'cuda' ? 'GPU' : 'CPU'}</div>
+                            <div class="stat-label">Device</div>
+                        </div>
+                    </div>
+                </div>
+                <div class="download-btn-group">
+    `;
+    if (data.annotated_pdf) {
+        html += `
+            <a href="/output/${data.annotated_pdf}" class="btn btn-primary" download>
+                <i class="fas fa-download me-2"></i>
+                Download Annotated PDF
+            </a>
+        `;
+    }
+    if (data.markdown_path) {
+        html += `
+            <a href="/output/${data.markdown_path}" class="btn btn-outline-primary" download>
+                <i class="fas fa-download me-2"></i>
+                Download Markdown
+            </a>
+        `;
+    }
+    html += `
+                </div>
+            </div>
+        </div>
+    `;
+    // Figures Section
+    if (data.figure_images && data.figure_images.length > 0) {
+        html += `
+            <div class="card shadow-sm mb-4">
+                <div class="card-header">
+                    <h5 class="mb-0">
+                        <i class="fas fa-images me-2"></i>
+                        Figures (${data.figure_images.length})
+                    </h5>
+                </div>
+                <div class="card-body">
+                    <div class="image-gallery">
+        `;
+        data.figure_images.forEach((imgPath, index) => {
+            const figure = data.figures[index] || {};
+            html += `
+                <div class="image-item">
+                    <img src="/output/${imgPath}" alt="Figure ${index + 1}" loading="lazy">
+                    <div class="image-caption">
+                        <strong>Figure ${index + 1}</strong>
+                        ${figure.page ? `<br><small class="text-muted">Page ${figure.page}</small>` : ''}
+                    </div>
+                </div>
+            `;
+        });
+        html += `
+                    </div>
+                </div>
+            </div>
+        `;
+    }
+    // Tables Section
+    if (data.table_images && data.table_images.length > 0) {
+        html += `
+            <div class="card shadow-sm mb-4">
+                <div class="card-header">
+                    <h5 class="mb-0">
+                        <i class="fas fa-table me-2"></i>
+                        Tables (${data.table_images.length})
+                    </h5>
+                </div>
+                <div class="card-body">
+                    <div class="image-gallery">
+        `;
+        data.table_images.forEach((imgPath, index) => {
+            const table = data.tables[index] || {};
+            html += `
+                <div class="image-item">
+                    <img src="/output/${imgPath}" alt="Table ${index + 1}" loading="lazy">
+                    <div class="image-caption">
+                        <strong>Table ${index + 1}</strong>
+                        ${table.page ? `<br><small class="text-muted">Page ${table.page}</small>` : ''}
+                    </div>
+                </div>
+            `;
+        });
+        html += `
+                    </div>
+                </div>
+            </div>
+        `;
+    }
+    // Markdown Preview
+    if (data.markdown_path) {
+        html += `
+            <div class="card shadow-sm">
+                <div class="card-header">
+                    <h5 class="mb-0">
+                        <i class="fas fa-file-code me-2"></i>
+                        Markdown Preview
+                    </h5>
+                </div>
+                <div class="card-body">
+                    <div class="markdown-preview" id="markdownPreview">
+                        Loading markdown...
+                    </div>
+                </div>
+            </div>
+        `;
+    }
+    container.innerHTML = html;
+    // Load markdown preview if available
+    if (data.markdown_path) {
+        loadMarkdownPreview(data.markdown_path);
+    }
+    // Wire delete button
+    const deleteBtn = document.getElementById('deletePdfBtn');
+    if (deleteBtn) {
+        deleteBtn.addEventListener('click', async () => {
+            const confirmed = confirm(`Delete processed outputs for "${data.stem}"? This cannot be undone.`);
+            if (!confirmed) return;
+            try {
+                // Use form-encoded POST to the body endpoint for widest compatibility
+                const resp = await fetch('/api/delete', {
+                    method: 'POST',
+                    headers: { 'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8' },
+                    body: new URLSearchParams({ stem: data.stem }).toString()
+                });
+                const raw = await resp.text();
+                let res;
+                try { res = JSON.parse(raw); } catch (_) { res = null; }
+                if (!resp.ok || (res && res.error)) {
+                    throw new Error((res && res.error) || raw || 'Delete failed');
+                }
+                // Refresh list and clear details
+                await loadPdfList();
+                document.getElementById('pdfDetails').innerHTML = `
+                    <div class="alert alert-success">
+                        <i class="fas fa-check-circle me-2"></i>
+                        Deleted "${data.stem}" successfully.
+                    </div>
+                `;
+                AppState.currentPdf = null;
+            } catch (err) {
+                console.error('Delete error:', err);
+                alert('Failed to delete: ' + (err?.message || err));
+            }
+        });
+    }
+}
+// Load Markdown Preview
+async function loadMarkdownPreview(markdownPath) {
+    try {
+        const response = await fetch(`/output/${markdownPath}`);
+        const text = await response.text();
+        document.getElementById('markdownPreview').textContent = text;
+    } catch (error) {
+        console.error('Error loading markdown:', error);
+        document.getElementById('markdownPreview').textContent = 'Error loading markdown content';
+    }
+}

templates/index.html ADDED Viewed

	@@ -0,0 +1,183 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>PDF Layout Extractor</title>
+    <!-- Bootstrap 5 CSS -->
+    <link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet">
+    <!-- Font Awesome -->
+    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
+    <!-- Custom CSS -->
+    <link href="{{ url_for('static', filename='css/styles.css') }}" rel="stylesheet">
+</head>
+<body data-theme="light">
+    <!-- Navigation -->
+    <nav class="navbar navbar-expand-lg navbar-dark bg-primary-custom">
+        <div class="container-fluid">
+            <a class="navbar-brand" href="#">
+                <i class="fas fa-file-pdf me-2"></i>
+                PDF Layout Extractor
+            </a>
+            <button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbarNav">
+                <span class="navbar-toggler-icon"></span>
+            </button>
+            <div class="collapse navbar-collapse" id="navbarNav">
+                <ul class="navbar-nav ms-auto">
+                    <li class="nav-item">
+                        <button class="btn btn-link nav-link text-white" id="themeToggle">
+                            <i class="fas fa-moon" id="themeIcon"></i>
+                        </button>
+                    </li>
+                </ul>
+            </div>
+        </div>
+    </nav>
+    <!-- Main Container -->
+    <div class="container-fluid mt-4">
+        <!-- Device Status Card -->
+        <div class="row mb-4">
+            <div class="col-12">
+                <div class="card shadow-sm">
+                    <div class="card-body">
+                        <div class="d-flex justify-content-between align-items-center">
+                            <div>
+                                <h5 class="card-title mb-1">
+                                    <i class="fas fa-microchip me-2"></i>
+                                    Processing Device
+                                </h5>
+                                <p class="text-muted mb-0" id="deviceStatus">
+                                    <span class="badge bg-secondary" id="deviceBadge">Checking...</span>
+                                </p>
+                            </div>
+                            <div class="text-end">
+                                <div id="deviceInfo" class="small text-muted">
+                                    <div id="deviceName">-</div>
+                                </div>
+                            </div>
+                        </div>
+                    </div>
+                </div>
+            </div>
+        </div>
+        <!-- Upload Section -->
+        <div class="row mb-4">
+            <div class="col-12">
+                <div class="card shadow-sm">
+                    <div class="card-header bg-primary-custom text-white">
+                        <h5 class="mb-0">
+                            <i class="fas fa-upload me-2"></i>
+                            Upload PDFs
+                        </h5>
+                    </div>
+                    <div class="card-body">
+                        <form id="uploadForm">
+                            <div class="mb-3">
+                                <label for="fileInput" class="form-label">Select PDF Files</label>
+                                <input type="file" class="form-control" id="fileInput"
+                                       accept=".pdf" multiple required>
+                                <div class="form-text">You can select multiple PDF files at once</div>
+                            </div>
+                            <div class="mb-3">
+                                <label class="form-label">Extraction Mode</label>
+                                <div class="btn-group w-100" role="group">
+                                    <input type="radio" class="btn-check" name="extractionMode"
+                                           id="modeImages" value="images" checked>
+                                    <label class="btn btn-outline-primary" for="modeImages">
+                                        <i class="fas fa-images me-2"></i>Images Only
+                                    </label>
+                                    <input type="radio" class="btn-check" name="extractionMode"
+                                           id="modeMarkdown" value="markdown">
+                                    <label class="btn btn-outline-primary" for="modeMarkdown">
+                                        <i class="fas fa-file-code me-2"></i>Markdown Only
+                                    </label>
+                                    <input type="radio" class="btn-check" name="extractionMode"
+                                           id="modeBoth" value="both">
+                                    <label class="btn btn-outline-primary" for="modeBoth">
+                                        <i class="fas fa-layer-group me-2"></i>Both
+                                    </label>
+                                </div>
+                            </div>
+                            <button type="submit" class="btn btn-primary w-100" id="uploadBtn">
+                                <i class="fas fa-upload me-2"></i>
+                                Upload and Process
+                            </button>
+                        </form>
+                    </div>
+                </div>
+            </div>
+        </div>
+        <!-- Processing Status -->
+        <div class="row mb-4" id="processingSection" style="display: none;">
+            <div class="col-12">
+                <div class="card shadow-sm">
+                    <div class="card-body">
+                        <div class="d-flex align-items-center">
+                            <div class="spinner-border text-primary me-3" role="status">
+                                <span class="visually-hidden">Loading...</span>
+                            </div>
+                            <div>
+                                <h6 class="mb-0">Processing PDFs...</h6>
+                                <small class="text-muted" id="processingStatus">Please wait</small>
+                            </div>
+                        </div>
+                    </div>
+                </div>
+            </div>
+        </div>
+        <!-- Results Section -->
+        <div class="row" id="resultsSection" style="display: none;">
+            <div class="col-md-4">
+                <div class="card shadow-sm sticky-top" style="top: 20px;">
+                    <div class="card-header bg-primary-custom text-white">
+                        <h5 class="mb-0">
+                            <i class="fas fa-list me-2"></i>
+                            Processed PDFs
+                        </h5>
+                    </div>
+                    <div class="card-body">
+                        <div class="list-group" id="pdfList">
+                            <!-- PDF list will be populated here -->
+                        </div>
+                    </div>
+                </div>
+            </div>
+            <div class="col-md-8">
+                <div id="pdfDetails">
+                    <!-- PDF details will be shown here -->
+                </div>
+            </div>
+        </div>
+        <!-- Empty State -->
+        <div class="row" id="emptyState">
+            <div class="col-12">
+                <div class="card shadow-sm">
+                    <div class="card-body text-center py-5">
+                        <i class="fas fa-file-pdf fa-3x text-muted mb-3"></i>
+                        <h5 class="text-muted">No PDFs processed yet</h5>
+                        <p class="text-muted">Upload PDF files above to get started</p>
+                    </div>
+                </div>
+            </div>
+        </div>
+    </div>
+    <!-- Bootstrap JS -->
+    <script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bootstrap.bundle.min.js"></script>
+    <!-- Custom JS -->
+    <script src="{{ url_for('static', filename='js/app.js') }}"></script>
+</body>
+</html>

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff