
In Progress
Posted
Paid on delivery
I need a self-contained conversion engine that takes common office files and produces clean, faithful output in other formats without losing a single detail. The converter must reliably handle the following flows: • PDF → Word • PDF → XML • Word → HTML • Word → XML For every run, the resulting file must keep all original elements intact—overall layout and styling, embedded images and graphics, live hyperlinks, tables, italics, and any special characters. An end user should be able to open the converted document and see no visual or structural difference from the source, aside from the new file type. I am open to whichever stack or library you believe best meets these goals (e.g., Python with PDFPlumber + python-docx, Java with Apache POI or Aspose, C# with iText, or even a headless LibreOffice/Pandoc workflow). The key requirement is accuracy and speed under batch processing. Deliverables • Source code with clear build/run instructions • Command-line tool or callable API that receives an input path, output path, and target format • Brief read-me describing any third-party dependencies and their licences • A small test suite proving conversions of at least five sample files per route, highlighting preservation of all critical aspects Acceptance criteria 1. Pixel-level comparison of original vs. converted screenshots shows no misalignment. 2. Automated diff confirms all hyperlinks, image counts, and table structures are present. 3. All sample files pass without manual correction. If you already have a similar solution, let me know; otherwise outline your proposed approach, main libraries, and estimated timeline so we can move forward quickly.
Project ID: 40476194
1 proposal
Remote project
Active 6 days ago
Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs

Losing formatting, broken tables, and missing images in document conversion is almost always a library-choice problem, not a code problem. I've shipped production document-processing pipelines and know which tools hold up under batch loads. For the United Nations, I built a production document and format-conversion system handling multiple document types across four languages. The system had to preserve full structure and formatting at the pace field operations demanded. For this work, I'd use LibreOffice headless for Word-to-HTML and Word-to-XML (preserves styles natively), PDFPlumber plus python-docx for PDF-to-Word with image extraction, and lxml for clean XML serialization. The test suite would cover pixel-comparison screenshots and automated hyperlink plus table diffing across all four routes. One question: are the source PDFs born-digital or scanned? Scanned PDFs need OCR in the pipeline, which changes the accuracy ceiling significantly.
₹600 INR in 7 days
0.0
0.0

Hyderabad, India
Payment method verified
Member since May 29, 2026
$250-750 USD
₹12500-37500 INR
₹37500-75000 INR
$15-25 USD / hour
₹12500-37500 INR
₹750-1250 INR / hour
₹750-1250 INR / hour
₹600-1500 INR
₹100-400 INR / hour
₹1500-12500 INR
$30-250 USD
$25-50 USD / hour
₹12500-37500 INR
min $100000 USD
€8-30 EUR
£250-750 GBP
$10-30 USD
₹37500-75000 INR
$15-25 USD / hour
₹150000-250000 INR