
Geschlossen
Veröffentlicht
I have a single PDF that holds several hundred pages of ledger entries recorded over multiple years. Because of irregular spacing, merged columns, and stray narrative comments, the file can’t be queried, reconciled, or audited in its current form. Your task is to turn every line in that PDF into a structured dataset that I can drop straight into accounting software or run analysis on. Scope of work • Parse the PDF and capture every transaction—including dates, descriptions, reference numbers, debit / credit amounts, and running balances—without loss of detail. • Correct inconsistent number formats (e.g., minus signs, comma placement, mixed currencies) and standardise dates to ISO. • Isolate any narrative comments so they appear in a separate “Notes” field rather than inside numeric columns. • Flag and log any rows that fail numeric checks (unbalanced debits vs credits, non-numeric characters inside amount columns, etc.) so I can inspect them quickly. • Deliver the cleaned output as a single, flat-file database—CSV is fine, but feel free to suggest a lightweight relational structure if you think it will add value. Include the transformation script (Python, R, or similar) so the process is fully reproducible. Acceptance criteria 1. Row counts in the final dataset match the original ledger pages (no dropped or duplicated lines). 2. All numeric fields import into Excel or a SQL table as numbers, not text. 3. Your anomaly log lists every transaction you could not confidently parse and explains why. 4. The script runs end-to-end on my machine with only standard open-source libraries. If you have experience wrangling messy PDFs with tools like Python (pandas, tabula-py, camelot) or R (tidyverse, tabulizer), that will be a plus, but feel free to use any stack you prefer as long as the deliverables meet the criteria above.
Projekt-ID: 40290568
14 Vorschläge
Remote Projekt
Aktiv vor 27 Tagen
Legen Sie Ihr Budget und Ihren Zeitrahmen fest
Für Ihre Arbeit bezahlt werden
Skizzieren Sie Ihren Vorschlag
Sie können sich kostenlos anmelden und auf Aufträge bieten
14 Freelancer bieten im Durchschnitt ₹966 INR/Stunde für diesen Auftrag

Hi, As per my understanding: You have a large multi-year ledger stored in a single PDF with irregular formatting, merged columns, and narrative comments that make it unusable for analysis or accounting import. The objective is to extract every transaction line and convert it into a clean, structured dataset including date, description, reference number, debit, credit, balance, and notes. The final output must standardize number formats and dates, preserve all rows, separate narrative text into a Notes field, and provide an anomaly log for rows that cannot be reliably parsed. Implementation approach: I will parse the PDF using a reproducible script (Python with libraries like pandas and PDF table extraction tools) to capture all ledger rows. Then I will normalize date formats to ISO, clean numeric values, and separate narrative text from financial fields. Validation checks will ensure debit/credit integrity and identify rows with parsing issues. Finally, I will deliver a clean CSV dataset, an anomaly log for review, and the full transformation script so the extraction can be rerun or extended later. A few quick questions: Is the ledger mostly tabular pages or does the format change across sections? Are there multiple currencies that need to be preserved or normalized? Approximately how many total pages are in the PDF?
₹750 INR in 40 Tagen
5,0
5,0

Your project to convert a complex multi-year PDF ledger into a clean, queryable dataset caught my attention because of the detailed challenges you described. I understand you need every transaction line extracted accurately despite irregular spacing and merged columns, which makes the file currently unusable for analysis or accounting. You need the script to parse dates, descriptions, references, debit/credit amounts, and balances while standardizing inconsistent number formats and isolating narrative comments into a separate field. The requirement to flag anomalies and deliver a reproducible Python script with clean CSV or lightweight relational data output aligns well with your goals for auditability and ease of import into Excel or SQL. I recently completed a project where I extracted and cleaned financial data from complex PDFs using Python libraries like Camelot and Pandas, delivering a normalized CSV and a script for reproducibility. I handled similar issues with merged cells, inconsistent number formats, and anomaly logging, ensuring all numeric fields imported correctly into SQL tables without loss of detail. I can complete this task within 5 days, ensuring thorough testing and clear anomaly reporting. Let’s discuss how to get started on transforming your ledger into a clean, functional dataset.
₹825 INR in 7 Tagen
2,9
2,9

Hi, I read your project about converting a multi-year ledger PDF into a clean, structured dataset, and I’d be happy to help. Handling messy PDFs with irregular spacing, merged columns, and mixed formatting is something I’ve done before using Python data pipelines. **Relevant Experience:** • Built Python data processing workflows using **pandas, regex parsing, and PDF extraction tools** to convert unstructured documents into structured datasets. • Developed automated pipelines that **clean numeric formats, normalize dates, and validate financial records** with anomaly logging. For your ledger, I would: • Extract every transaction from the PDF using tools like **Camelot/Tabula + pandas**. • Standardize **dates (ISO), debit/credit formats, and currency values**. • Separate narrative comments into a **Notes field**. • Implement validation checks to **flag rows with parsing issues or numeric inconsistencies**. • Deliver a **clean CSV dataset + anomaly log + reproducible Python script** so the process can run end-to-end on your machine with open-source libraries. Accuracy and traceability are key for financial data, so I’ll also ensure the **final row counts match the original ledger pages** and all numeric columns import cleanly into Excel or SQL. Would you be able to share **a sample page of the PDF** so I can confirm the best extraction approach before starting? Best regards, Mihir
₹1.000 INR in 40 Tagen
1,8
1,8

Hello, I can help convert your multi-year ledger PDF into a clean, structured dataset. I will extract all transactions, standardize dates and numeric formats, separate notes from numeric fields, and implement validation checks to flag any anomalies. You will receive a fully cleaned CSV file, a detailed anomaly log, and a reproducible Python script that runs end-to-end using open-source libraries. The final dataset will maintain complete row integrity and import correctly into Excel or SQL systems. I’m ready to begin once I review the PDF.
₹1.100 INR in 40 Tagen
0,6
0,6

Hello, I’ve reviewed the PDF you shared, and I can convert the entire ledger—hundreds of pages—into a clean, structured dataset ready for import into accounting software or SQL. Here’s the workflow I will deliver: • accurate extraction of every transaction (dates, descriptions, refs, debit/credit, balances) • correction of inconsistent number formats and standardization to ISO dates • separation of narrative comments into a clean “Notes” field • anomaly log for rows that fail numeric or balance checks • final output as a flat‑file database (CSV or SQLite) • a fully reproducible Python script using open‑source libraries (pandas, tabula/camelot, etc.) I’ve handled messy PDFs with irregular spacing and merged columns before, and I can ensure no dropped or duplicated rows, with all numeric fields importing correctly into Excel or SQL. My rate is ₹1,000 INR/hour, and I estimate 3–5 days for full extraction, cleaning, validation, and delivery. If you’d like, I can process a small sample section first to confirm the structure and approach.
₹1.000 INR in 40 Tagen
0,0
0,0

I have strong experience working with Python for data analysis using pandas, NumPy, and data cleaning techniques. I have also worked on projects involving data extraction, preprocessing, and converting messy datasets into structured formats suitable for analysis. I focus on accuracy, reproducibility, and clear documentation so the client can easily reuse the process in the future.
₹750 INR in 40 Tagen
0,0
0,0

Leveraging the power of Python and my deep experience with messy data, I believe I am the perfect fit for your project. I have successfully extracted, cleaned and organized data from a wide variety of sources including PDFs using tools like pandas and camelot, delivering clean datasets that meet strict quality standards. I assure you that every single line of your ledger will be accurately reproduced in the structured dataset, free from any errors or omission. Moreover, I can also implement consistent number formats, standardize dates to ISO, isolate narrative comments and log any failed rows in an efficient manner to allow quick yet comprehensive inspection. My promise is to deliver a clean output as a single flat-file database, which can easily be imported into accounting software to save you time and resources. With a focus on maintainable code and robust anomaly detection systems, rest assured that not a single transaction will bypass scrutiny or end up as unaccounted-for. Additionally, as an AI-assisted developer, my workflows are designed to ensure quick turnaround times without compromising on quality. Trusting me with your project means investing in someone who can expertly bridge the gap between complex backend logic and intuitive frontend experiences. So let's have a chat about how we can bring the best out of your data together!
₹750 INR in 40 Tagen
0,0
0,0

Baran, India
Mitglied seit Juli 3, 2025
₹12500-37500 INR
₹12500-37500 INR
$15-25 USD / Stunde
₹1500-12500 INR
₹600-1500 INR
$1500-3000 USD
£750-1500 GBP
₹1500-12500 INR
£20-250 GBP
₹750-1250 INR / Stunde
₹600-1500 INR
€250-750 EUR
$250-750 USD
₹750-1250 INR / Stunde
₹600-40000 INR
₹37500-75000 INR
€6-12 EUR / Stunde
$30-50 USD
$250-750 USD
₹37500-75000 INR
£20-250 GBP