AI Document Translator & OCR Tool

A powerful Python automation pipeline that translates PDF documents using Hugging Face's MarianMT and Tesseract OCR, generating a side-by-side comparison of original and translated text.

Python
Automation
Deep Learning
PyTorch
Computer Vision
OCR
Tesseract
Pillow
Natural Language Processing (NLP)
Hugging Face
Transformers
MarianMT

Overview

This tool is a comprehensive solution designed to automate the translation of PDF documents while preserving context. It bridges the gap between raw document processing and advanced AI translation by combining Optical Character Recognition (OCR) with Neural Machine Translation (NMT). Ideally suited for academic or professional use, it generates a reconstructed PDF that displays the original text alongside the translation for easy verification.

Key Features

  • Hybrid Text Extraction: Intelligent pipeline that extracts standard text via PyMuPDF and automatically falls back to Tesseract OCR for images with embedded text.
  • State-of-the-Art Translation: Leverages the Helsinki-NLP/opus-mt-tc-big-en-tr MarianMT model from Hugging Face for high-quality English-to-Turkish translations.
  • Side-by-Side Layout: Unique output format that places the source text and translated text adjacent to each other on the same page.
  • Smart Optimization: Features logic to skip previously processed files and automatically utilizes CUDA-enabled GPUs for accelerated inference.
  • Robust Logging: Detailed tracking of the translation process via translation_log.txt to monitor progress and catch errors.

Tech Stack

  • Core Logic: Python
  • AI & ML: PyTorch, Transformers (Hugging Face)
  • PDF & Image Processing: PyMuPDF (Fitz), FPDF, Pillow
  • OCR Engine: Tesseract
  • Hardware Support: CUDA (GPU Acceleration)