
Geschlossen
Veröffentlicht
Bezahlt bei Lieferung
Industrial Automation Product Data Extraction, Deduplication & Structured Image Collection Project Overview We are an industrial automation parts distributor building a structured product database to support inbound enquiries and SEO growth. We require an experienced data extraction specialist to: Extract structured product data from major industrial / electronic component distributor websites Identify duplicate manufacturer part numbers across multiple sources Merge all unique information into a single consolidated dataset Extract and organise all available product images per part number Deliver a clean, deduplicated, production-ready dataset This project includes: Data extraction Normalization Deduplication Intelligent merging Structured image collection and organisation Accuracy and clean structure are critical. Core Requirement – Deduplication & Data Merging Many distributor websites list the same Manufacturer Part Number (MPN) but with different technical attributes. We require: One single row per unique Manufacturer Part Number No duplicate MPNs in the final dataset All unique specifications merged into that single row Missing data from one source supplemented from another If two sources provide different values for the same spec, both values must be preserved No overwriting of valid data This is not a simple scrape-and-export task. It requires structured consolidation. Required Fields (Where Available) For each unique MPN: Manufacturer Part Number (normalized format) Brand / Manufacturer Product Description Product Category Product Status (Active / Obsolete / Discontinued) Technical Specifications (structured) Electrical specs (voltage, current, range, etc.) Mechanical specs (if applicable) Input / Output type Datasheet link(s) Source URLs (all sources used for that part) Specifications must be structured — no raw HTML dumps. Image Collection Requirements (Strict) For each unique Manufacturer Part Number: Extract all available product images from all sources No watermarks Highest available resolution No thumbnails unless full resolution is unavailable Folder Structure Each part number must have its own folder: Example: Folder name: 1746-OB16 Inside folder: 1746-OB16 [login to view URL] 1746-OB16 [login to view URL] 1746-OB16 [login to view URL] 1746-OB16 [login to view URL] 1746-OB16 [login to view URL] 1746-OB16 [login to view URL] 1746-OB16 [login to view URL] If 3 images exist on one source and 4 on another, all 7 must be included. Naming convention: MPN + space + alphabetical suffix No duplicate image files. No random filenames. No inconsistent naming. Output Requirements 1. Master Excel File (.xlsx) One row per unique MPN All specifications merged Clean structured columns No duplicate MPNs No HTML tags No formatting issues 2. Image Directory Root folder containing subfolders for each MPN Each subfolder named exactly as the MPN All images inside correctly named (A, B, C…) Clean file structure Data Handling Expectations The freelancer must: Normalize part numbers (remove spacing / hyphen inconsistencies) Identify duplicates across multiple sources Merge specifications intelligently Preserve all unique attributes Handle dynamic / JS-rendered sites Deliver clean, production-ready output In your proposal, explain: Your deduplication logic Your merging methodology How you will handle conflicting specifications How you will prevent image duplication Tools you will use Estimated scale Timeline Total cost Experience Required Only apply if you have: Experience scraping large distributor / e-commerce platforms Experience handling structured product datasets Experience deduplicating and merging datasets Experience bulk-downloading and structuring images Ability to deliver clean, organised, ready-to-import data Please include examples of similar projects. Project Goal We are building a large structured industrial component database. If this phase is successful, this will become an ongoing project.
Projekt-ID: 40225512
75 Vorschläge
Remote Projekt
Aktiv vor 24 Tagen
Legen Sie Ihr Budget und Ihren Zeitrahmen fest
Für Ihre Arbeit bezahlt werden
Skizzieren Sie Ihren Vorschlag
Sie können sich kostenlos anmelden und auf Aufträge bieten
75 Freelancer bieten im Durchschnitt £479 GBP für diesen Auftrag

As the CEO and founder of Digital Screencast, I have amassed a wealth of experience and expertise over my 7-year career in web scraping, data analysis, and automation. Your project aligns closely with my niche, particularly my specialization in extracting and merging large, structured datasets. Having previously worked with top companies like Metlife GOSC and DXC technologies, I've developed solid deduplication logic and coherent methodologies for handling conflicting specifications. Driving value for your project is not just a claim but my core responsibility. My proficiencies stretch across many data-oriented areas including Excel formulas, VBA scripts, web research and copying. This gives me an added advantage as I use these skills daily to clean and structure large datasets just as your project requires. I will assure you of an effective and efficient delivery of a clean, consolidated dataset in the required format along with the strict management of the image directory.
£500 GBP in 7 Tagen
8,6
8,6

Hello, This is a structured data engineering task, not just scraping — and that’s exactly my expertise. I have 5+ years extracting and consolidating industrial/e-commerce datasets using Scrapy, Playwright, Selenium, Requests and Pandas. Deduplication Logic: Normalize MPNs (uppercase, hyphen/spacing standardization) Use normalized MPN as primary key Cross-verify with brand/category where needed Ensure one single row per unique MPN Merging Method: Aggregate all sources into structured dictionaries Preserve all unique specs If conflicting values exist, retain both (no overwriting) Supplement missing data across sources Images: Extract highest resolution (no thumbnails) Remove duplicates using MD5 hash comparison Folder per MPN Naming: MPN + A/B/C format No random filenames, no duplicates Tools: Python, Scrapy, Playwright, Pandas, Pillow, OpenPyXL Estimated timeline depends on volume (e.g., 5–10k MPNs ≈ 2–3 weeks). I’ve handled large distributor datasets before and deliver clean, production-ready outputs. Happy to share sample structure. Best regards,
£750 GBP in 7 Tagen
8,0
8,0

Hello, I understand you need a robust, production-ready data engine for industrial parts: deduplicating MPNs across sources, merging attributes intelligently, and collecting high-res product images with strict folder and naming conventions. I will build a reproducible workflow that normalizes part numbers, crawls major distributors (handling JS-rendered pages), detects duplicates, preserves all valid attributes from each source, and fills gaps using best-fit values from other sources. The result is a clean master Excel with one row per unique MPN and a parallel image directory structure named exactly by the MPN. Output will be free of HTML, with structured specifications (electrical, mechanical, datasheets, sources). I will deliver a scalable process that can run again on new data with consistent results. What are the exact distributor sites to target first, and any preferred order of sources? How many unique MPNs do you expect in Phase 1? Do you require a specific normalization map for MPNs (spaces, hyphens)? Which fields are mandatory vs optional? How should conflicting specs be treated when neither value is clearly superior? Do you want a default value for missing specs or simply leave them blank with source citations? How will you handle dynamic/JS-rendered pages? Are there any restricted sources or robots rules? What is the preferred file naming convention for images beyond MP N + A–G? What is the expected cadence for updates and re-run cycles?
£750 GBP in 13 Tagen
7,4
7,4

⭐⭐⭐⭐⭐ Extract and Organize Industrial Automation Product Data Efficiently ❇️ Hi My Friend, I hope you're doing well. I reviewed your project requirements and see you are looking for a data extraction specialist. Look no further; Zohaib is here to help you! My team has successfully completed 50+ similar projects for data extraction and organization. I will ensure accurate extraction, deduplication, and structured image collection to support your product database. ➡️ Why Me? I can easily handle your data extraction and deduplication project as I have 5 years of experience in data extraction, normalization, and merging datasets. My expertise includes web scraping, data cleaning, and image organization. Additionally, I have a strong grip on tools like Python and Excel, ensuring a smooth and efficient process for your project. ➡️ Let's have a quick chat to discuss your project in detail and let me show you some of my previous work. I look forward to discussing this with you in our chat. ➡️ Skills & Experience: ✅ Data Extraction ✅ Deduplication ✅ Data Normalization ✅ Structured Data Merging ✅ Image Collection ✅ Web Scraping ✅ Data Cleaning ✅ Excel Mastery ✅ Data Organization ✅ Conflict Resolution ✅ Technical Specification Handling ✅ Bulk Image Downloading Waiting for your response! Best Regards, Zohaib
£350 GBP in 2 Tagen
7,5
7,5

Youssef, Full-Time Freelancer with Python Programming expertise, I understand your project requires Industrial Automation Product Data Extraction, Deduplication, and Structured Image Collection. Your goal to extract structured product data and images from major distributor websites, intelligently merge unique specifications, and handle duplicate manufacturer part numbers is clear. I will leverage Playwright, Selenium, and Scrapy for robust data extraction from dynamic content, along with custom Python scripts for complex deduplication logic, intelligent merging of all unique attributes while preserving differing values, and precise structured image collection with specific naming and folder requirements. I have extensive experience delivering clean, production-ready datasets for similar complex data consolidation projects, ensuring the accuracy and structured output you need.
£750 GBP in 1 Tag
7,2
7,2

As an experienced web-scraping specialist solving complex issues is my defining strength. I have extensive experience in extracting data from various websites, including those equipped with advanced anti-bot protection systems like Cloudflare and Incapsula, which are similar to the industrial distributor sites associated with your project. Over the years, I have developed skills utilizing various tools and libraries such as Selenium, BeautifulSoup, Scrapy, and Requests in Python to ensure the successful completion of my projects. This project aligns perfectly with my expertise and proficiency in cleansing and transforming large datasets. My understanding of deduplication techniques for merging multiple sources will prove invaluable for this project's depth and scale. Additionally, I am well-prepared to tackle any specification conflict that arises by using intelligent merging methodologies that retain all unique attributes - vital for preserving valid data.
£500 GBP in 7 Tagen
7,5
7,5

We can perform end-to-end structured data extraction, normalization, deduplication, and intelligent multi-source merging of industrial and electronic component datasets, ensuring one canonical row per MPN with all unique specifications, technical attributes, and source references preserved. We’ll also handle high-resolution image harvesting, hash-based duplicate detection, automated folder structuring, and organize assets into MPN-specific directories following strict naming conventions, making the dataset fully production-ready. Can you share the target distributor websites, expected dataset volume, and any preferred file formats so we can provide a detailed timeline and cost estimate?
£499 GBP in 7 Tagen
7,1
7,1

Hello, I have carefully reviewed your project requirements and clearly understand that this is a structured industrial product data consolidation project requiring advanced extraction, normalization, deduplication, and intelligent merging. I can confidently deliver a clean, production-ready dataset built for long-term scalability and SEO performance. My approach will begin with large-scale extraction using Python with Scrapy and Selenium to handle static and JS-rendered distributor platforms. All Manufacturer Part Numbers will be normalized by removing spacing and hyphen inconsistencies to ensure clean primary keys. Deduplication logic will treat normalized MPN as the master identifier, supported by validation rules to detect formatting variants. Next, I will implement structured merging using pandas where each unique MPN becomes a single consolidated row. Missing specifications will be supplemented from alternate sources, and conflicting spec values will be preserved in structured multi-value fields rather than overwritten. All source URLs will be retained for traceability. For image collection, I will download highest resolution files, apply hash checks to prevent duplication, and automatically structure folders per MPN with strict alphabetical naming conventions. I am ready to define scale, timeline, and cost based on estimated MPN volume. How many distributor sources should we prioritize in phase one? Lets chat and discuss further! Best Regards, Aneesa.
£350 GBP in 1 Tag
6,8
6,8

Hello, I have over 7 years of experience in Data Processing, Excel, Web Scraping, Data Extraction, Data Scraping, and Data Management. I have carefully read through the project requirements and am confident in my ability to deliver the desired results. For the Industrial Automation Data Extraction & Image Collection project, I propose to use a combination of web scraping tools and custom scripts to extract structured product data from various industrial/electronic component distributor websites. I will then implement a robust deduplication process to merge all unique information into a clean, consolidated dataset. My deduplication logic involves identifying and removing duplicate manufacturer part numbers while intelligently merging all unique specifications into a single row. In cases of conflicting specifications, I will preserve both values and prevent data overwriting. For image collection, I will ensure all available product images are extracted, organized per part number, and stored in a structured folder format as per your requirements. I am ready to discuss my approach further in detail. Please connect with me via chat for a more in-depth conversation. You can visit my Profile at https://www.freelancer.com/u/HiraMahmood4072 Thank you.
£275 GBP in 7 Tagen
6,4
6,4

Hello, I am a seasoned data extraction specialist with a focus on structured product databases. I have extensive experience in extracting and merging data from industrial/electronic component distributor websites, ensuring no duplicate manufacturer part numbers exist in the final dataset. My deduplication logic involves meticulous cross-referencing and intelligent merging techniques to maintain data integrity. I handle conflicting specifications by prioritizing data accuracy and completeness, supplementing missing information from alternate sources. To prevent image duplication, I implement a strict folder structure with clear naming conventions. For this project, I plan to utilize advanced scraping tools and deliver a clean, production-ready output within the specified timeline. I have successfully completed similar projects, and I am well-equipped to meet your requirements effectively. Let's discuss further to bring your industrial component database vision to fruition. Best regards,
£350 GBP in 3 Tagen
5,8
5,8

Hi there, ★★★ Web Scraping / Data Scraping Expert ★★★ 5+ Years of Experience ★★★ To complete this project, I will employ a systematic approach to ensure accurate data extraction, deduplication, and structured image collection. 1. Analyze the target distributor websites for data extraction (20 hours) 2. Develop a deduplication algorithm to identify and merge duplicate MPNs (15 hours) 3. Extract and organize images according to the specified naming convention (10 hours) 4. Structure and compile the final dataset into a master Excel file (15 hours) 5. Review and validate the output to ensure accuracy and completeness (5 hours) What I need from you: 1. Access to the target distributor websites or data sources 2. Examples of existing datasets for reference 3. Specific instructions or preferences for data structure and image organization I look forward to connecting at your convenience to ensure the project's success. Best Regards, TechPlus Team
£800 GBP in 10 Tagen
5,7
5,7

It sounds like you’re not asking for a scrape, you’re building a production product master where the same MPN appears across multiple distributors and must be consolidated into one truth row per MPN, with every unique spec preserved and every image collected and organized cleanly. Most attempts fail because they overwrite conflicting specs, dump unstructured HTML, or treat images as “one per product.” Without a strict merge policy and image dedupe rules, the dataset becomes unusable for SEO and inbound enquiries. My approach is to normalize MPNs first, then run a source-to-MPN staging table, and merge into a single master row per MPN using “append not overwrite” rules: if a spec conflicts, both values are stored (separate columns or a delimited multi-value field) with source attribution. For images, I download the highest-resolution asset available, de-duplicate by hash, and enforce your folder and naming convention (MPN + A/B/C…) so every part has a clean asset pack. I’ve delivered merged, deduplicated component datasets across multi-source distributor sites with structured specs and organized image libraries ready for import. Which distributor sites are in scope, and do any block crawling or require API access/whitelisting? Do you prefer conflicts stored as separate source-specific columns, or a single field with multi-values plus source list? Share the target sources and an example MPN set, and I’ll outline the schema, merge rules, and delivery milestones. Adnan
£500 GBP in 7 Tagen
5,8
5,8

Hi I can extract, normalize and consolidate industrial automation product data into a clean, deduplicated and production-ready dataset by pulling structured specs and high-resolution images from multiple distributor sources. A major challenge in projects like this is that identical MPNs appear with inconsistent formatting and conflicting specifications, and I address this by using strict part-number normalization, multi-source attribute merging and conflict-preservation logic so no valid data is lost. I also implement intelligent deduplication where each unique MPN becomes a single row enriched with all unique specs from every source. For images, I ensure de-watermarked, high-resolution files are collected, deduplicated via hashing and saved into correctly named folders following your MPN + alphabetical suffix convention. With experience in large-scale scraping, structured dataset engineering and e-commerce catalog cleanup, I can deliver a fully organized Excel master file and image directory ready for import into your system. This structured approach supports both SEO expansion and long-term catalog growth. Thanks, Hercules
£500 GBP in 7 Tagen
5,8
5,8

Hello, I trust you're doing well. I am well experienced in machine learning algorithms, with nearly a decade of hands-on practice. My expertise lies in developing various artificial intelligence algorithms, including the one you require, using Matlab, Python, and similar tools. I hold a doctorate from Tohoku University and have a number of publications in the same subject. My portfolio, which showcases my past work, is available for your review. Your project piqued my interest, and I would be delighted to be part of it. Let's connect to discuss in detail. Warm regards. please check my portfolio link: https://www.freelancer.com/u/sajjadtaghvaeifr
£700 GBP in 7 Tagen
5,5
5,5

Hi, there, I'm an experienced freelancer specializing in data extraction for industrial automation products. With a background in scraping large distributor websites and handling structured datasets, I am well-equipped to tackle your project. ✅ Leveraging my expertise, I will extract, deduplicate, and merge product data to create a clean, production-ready dataset. ✅ Using advanced normalization techniques, I will ensure each unique Manufacturer Part Number is represented accurately without duplicates. ✅ My strategy includes intelligent merging of specifications, handling conflicting data with precision, and preventing image duplication through meticulous organization. ✅ I will utilize cutting-edge tools to streamline the extraction and consolidation process, delivering accurate results efficiently. ✅ My timeline and cost estimate are tailored to the project's scale and complexity, ensuring timely delivery within budget.
£500 GBP in 5 Tagen
5,2
5,2

I can extract & normalize industrial part data, deduplicate by MPN, intelligently merge all unique specs without overwriting, and collect full-resolution images in clean A/B/C naming per folder. I’ll use Python (Scrapy/Playwright), Excel structuring, conflict-preserving logic, and strict duplicate image checks for a production-ready dataset. Ready to start with a scalable workflow, fast delivery, and ongoing support.
£250 GBP in 1 Tag
4,9
4,9

Hi there, I’m Ahmed from Eastvale, California — a Senior Full-Stack Engineer with over 15 years of experience building high-quality web and mobile applications. After reviewing your job posting, I’m confident that my background and skill set make me an excellent fit for your project. I’ve successfully completed similar projects in the past, so you can expect reliable communication, clean and scalable code, and results delivered on time. I’m ready to get started right away and would love the opportunity to bring your vision to life. Looking forward to working with you. Best regards, Ahmed Hassan
£500 GBP in 7 Tagen
4,8
4,8

Hello Dear! I write to introduce myself. I'm Engineer Toriqul Islam. I was born and grew up in Bangladesh. I speak and write in English like native people. I am a B.S.C. Engineer of Computer Science & Engineering. I completed my graduation from Rajshahi University of Engineering & Technology ( RUET). I love to work on Web Design & Development project. Web Design & development: I am a full-stack web developer with more than 10 years of experience. My design Approach is Always Modern and simple, which attracts people towards it. I have built websites for a wide variety of industries. I have worked with a lot of companies and built astonishing websites. All Clients have good reviews about me. Client Satisfaction is my first Priority. Technologies We Use: Custom Websites Development Using ======>Full Stack Development. 1. HTML5 2. CSS3 3. Bootstrap4 4. jQuery 5. JavaScript 6. Angular JS 7. React JS 8. Node JS 9. WordPress 10. PHP 11. Ruby on Rails 12. MYSQL 13. Laravel 14. .Net 15. CodeIgniter 16. React Native 17. SQL / MySQL 18. Mobile app development 19. Python 20. MongoDB What you'll get? • Fully Responsive Website on All Devices • Reusable Components • Quick response • Clean, tested and documented code • Completely met deadlines and requirements • Clear communication You are cordially welcome to discuss your project. Thank You! Best Regards, Toriqul Islam
£250 GBP in 4 Tagen
4,5
4,5

Hello, I understand that you require structured extraction, deduplication, and image collection for industrial automation products to build a clean, production-ready database. My approach will begin with automated extraction from distributor websites using Python with Selenium and BeautifulSoup, handling dynamic content and JS-rendered pages. I will normalize Manufacturer Part Numbers, detect duplicates across sources, and intelligently merge specifications by preserving all unique values, supplementing missing data without overwriting valid entries. Conflicting specifications will be consolidated into structured columns, ensuring accuracy and completeness. For images, I will bulk-download all available high-resolution files, organize them into per-MPN folders, and follow the precise naming convention (MPN + alphabetical suffix), preventing duplication. The final output will include a master Excel file with one row per unique MPN, clean structured specifications, source URLs, and a fully organized image directory. Deduplication logic relies on exact and fuzzy matching of MPNs, with normalization rules to unify formats, while merging ensures all unique technical attributes are preserved. Thanks, Asif
£750 GBP in 5 Tagen
4,6
4,6

⭐Hello, I'm ready to assist you right away!⭐ I believe I'd be a great fit for your project since I have extensive experience in data extraction, deduplication, and structured image collection. My approach includes extracting product data from major industrial websites, deduplicating manufacturer part numbers, and merging unique information into a consolidated dataset. I ensure that all specifications are structured and accurately organized for easy access. This project will streamline your data management processes, enhance SEO growth, and support inbound inquiries effectively. If you have any questions, would like to discuss the project in more detail, or would like to know how I can help, we can schedule a meeting. Thank you. Maxim
£250 GBP in 4 Tagen
3,8
3,8

Stafford, United Kingdom
Mitglied seit Juli 16, 2024
£10000-20000 GBP
£250-750 GBP
₹12500-37500 INR
$30-250 USD
₹750-1250 INR / Stunde
$30-250 USD
₹1500-12500 INR
₹100-400 INR / Stunde
$30-100 USD
£20-250 GBP
€30-250 EUR
$30-250 USD
$250-750 CAD
₹750-1250 INR / Stunde
$30-250 USD
$200-350 USD
₹750-1250 INR / Stunde
$250-750 USD
$250-750 USD
₹750-1250 INR / Stunde
$30-250 USD
₹1500-12500 INR