
Completed
Posted
Paid on delivery
Python Script — Extract & Balance Yelp Academic Dataset by Business Category Description: Download the Yelp Academic Dataset ([login to view URL] and [login to view URL]). It's free and in the public domain. I need a single CSV or JSON file prepared from these two files according to the following exact specifications. The deliverable is one file, under 30MB, containing a balanced sample of Yelp reviews across specific business categories. Requirements: Join the review file to the business file on business_id so that each record contains both the review text and the business category data. Filter to the following business categories only: Restaurants (all cuisine types) Hair Salons & Barber Shops Nail Salons Spas & Massage Auto Repair & Service Hotels & Lodging Dental & Medical Offices Gyms & Fitness Tours & Experiences Retail / Boutique For each category, include 500 reviews rated 1–2 stars, 500 reviews rated 3 stars, and 500 reviews rated 4–5 stars (1,500 reviews per category, 15,000 reviews total). If a category has fewer than 500 reviews at a given star level, include all available reviews at that level. Select reviews randomly within each category/star stratum. Do not cherry-pick. Each record in the output file must include exactly these fields: business_id business_name category (use the simplified category label from the list above, not the raw Yelp category string) stars (the review star rating, 1–5) review_text review_date Strip all other fields. Verify the final file is under 30MB. If it exceeds 30MB, reduce the per-stratum count proportionally across all categories to bring it under the limit. Deliver the file as UTF-8 encoded CSV with a header row.
Project ID: 40430256
52 proposals
Remote project
Active 9 days ago
Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs

Hi, I can prepare the Yelp Academic Dataset exactly to your specifications and deliver a clean, balanced UTF-8 CSV under 30MB with reproducible sampling logic. ✔ What I’ll implement: Join [login to view URL] + [login to view URL] on business_id Map raw Yelp categories → your simplified category labels Stratified random sampling by category + star group Exact required output fields only Automatic size check + proportional reduction if file exceeds 30MB ✔ Preferred Stack: Python (Pandas / Polars for efficiency) Streaming JSON parsing for large-file handling Deterministic random sampling for reproducibility ✔ Quality Controls: No cherry-picking Validation of category mapping and review counts UTF-8 encoded CSV with clean escaping and header row Final audit for duplicates, missing fields, and file size compliance ✔ Deliverables: Final CSV dataset Well-commented reusable Python script Short README explaining workflow and rerun steps Ready to start immediately. Best regards, Sumya
$100 USD in 1 day
4.7
4.7
52 freelancers are bidding on average $128 USD for this job

HI there i am scraping exert i am able to scrap data from yelp so please contact me then i will show you sample, thank you
$100 USD in 1 day
8.8
8.8

I can build a Python pipeline to process the Yelp Academic Dataset, join reviews with business categories, stratify and randomly sample reviews by rating, then export a clean UTF-8 CSV under your 30MB limit. The script will be reproducible, memory-efficient, and strictly follow your schema and category rules.
$120 USD in 2 days
7.5
7.5

Hi, I can create the Python script and final balanced dataset exactly to your specifications. I’m experienced working with large JSON datasets, data joins, stratified random sampling, and CSV optimization for size constraints. I’ll properly map Yelp categories into your simplified labels, ensure balanced review selection by star group, and deliver a clean UTF-8 CSV under the 30MB limit. The script will be reproducible, well-structured, and validated for accuracy before delivery. Regards sujon
$200 USD in 3 days
7.5
7.5

I would be honored to tackle your project titled "Yelp Data Extraction and Balancing Script". As an experienced Python developer and data specialist, I've spent over 15 years perfecting my skills in tasks that align perfectly with your requirements. I have a keen eye for detail, ensuring that your final file will contain exactly the fields you need: business_id, business_name, category, stars, review_text, and review_date – all stripped of unnecessary information. I guarantee a balanced sample within the 30MB limit comprising of 1500 reviews per category, containing 500 reviews rated 1-2 stars, 500 reviews rated 3 stars, and 500 reviews rated 4-5 stars. The selection process will be random yet comprehensive to avoid cherry-picking specific reviews. With my expertise in web scraping and automation using Selenium, BeautifulSoup, Scrapy, etc., I am confident in smoothly joining the review file with the business file on business_id as required. Additionally, to ensure maximum performance, I'll use protected websites and provide you with clean as well as organized data in CSV or JSON that fits perfectly into your database.
$100 USD in 1 day
7.0
7.0

As a seasoned Full-Stack Web Developer with extensive experience in data extraction and analysis, I'm proud to lead a skilled and committed team at BN-Droids Digital Services that can deliver nothing less than top-tier quality work on your project. Our expertise in Python, combined with our specialized skillset in Web Scraping and Data Extraction, makes us perfectly equipped for the task at hand. We understand the paramount importance of precision and adherence to specifications when working on Data Analysis projects. Through our proficient use of various tools including Python, we're confident in delivering the exact single CSV or JSON file you need, well within the required size limits. Having handled vast amounts of data in the past, we understand its diversity and how crucial it is to cleanse and study data accurately for optimal results – a core skill in this project.
$30 USD in 7 days
6.9
6.9

Hi! I specialize in Python data engineering and dataset processing with over 9 years of experience in data extraction, cleaning, and structured dataset creation using Pandas, JSON/CSV pipelines, and large-scale data handling. Here's how I can help: * Join Yelp review + business datasets using business_id * Filter and map categories into required simplified labels * Build balanced sampling logic (1–2, 3, 4–5 stars per category) * Ensure random, unbiased selection across all strata * Optimize output to stay under 30MB (auto scaling logic) * Export clean UTF-8 CSV with required fields only I can deliver a fully reproducible Python script + final dataset. Do you prefer a one-time script or a reusable pipeline for future dataset generation?
$140 USD in 7 days
7.0
7.0

Hello, I have carefully reviewed your project requirements for extracting and balancing the Yelp Academic Dataset by specific business categories. Let's chat and discuss it further. To handle your project, I will start with downloading the Yelp Academic Dataset and then join the review and business files based on business_id. Next, I will filter the data to include the specified business categories and create a balanced sample of 1,500 reviews per category, as per the star ratings criteria provided. The deliverables will be a single CSV file under 30MB, containing the required fields such as business_id, business_name, category, stars, review_text, and review_date. Before signing-off my bid, I would like to ask a question, i.e., how would you prefer the file delivery method? Best Regards, Aneesa.
$250 USD in 1 day
6.8
6.8

Matching the review text to the business categories in the Yelp Academic Dataset is a task I can handle quickly using Python and pandas. I have processed these specific JSON files before. They can be quite heavy on memory if not handled correctly. I will join the two files on the business_id as requested and apply the random sampling logic to ensure your 15,000 record output is perfectly balanced across those ten categories. I will make sure the script handles the mapping from the raw Yelp category strings to your simplified labels like Auto Repair or Spa and Massage. I will also keep an eye on the 30MB file size limit and adjust the sample counts down proportionally if the text content pushes the CSV over that limit. You can see my previous work with data extraction and Python scripts on my profile here: https://www.freelancer.com/u/microlent I can have this cleaned and ready for you today. Are you looking for the final CSV deliverable only or do you need the Python script as well so you can rerun it on future dataset versions? Let me know and we can get started. ~ Rajesh
$140 USD in 7 days
6.9
6.9

Hi there, We’ve developed similar data extraction scripts for projects like a restaurant review aggregator, where we extracted data from Google and Zomato. We also built a product review aggregator that used web scraping to gather data from multiple sources. For your project, we’ll use Python with libraries like BeautifulSoup and Selenium to extract data from Yelp. We’ll also implement a robust review filtering system to ensure only high-quality reviews are included. Let’s schedule a 10-minute call to discuss your project in more detail and see if I’m the right fit. I usually respond within 10 minutes. I’m eager to learn more about your exciting project. Best, Adil
$121.65 USD in 7 days
6.0
6.0

Hello, I fully understand your requirements. I am ready to start Thanks and Regards, Everest Technology .
$77 USD in 2 days
6.1
6.1

I can create a clean Python pipeline that joins the Yelp Academic Dataset, filters and balances reviews by your required business categories/star groups, then exports a UTF-8 CSV under the 30MB limit with fully randomized sampling and proper validation. I have strong experience with large-scale JSON processing, data balancing, pandas optimization, and reproducible dataset generation, so the final output will be accurate, lightweight, and ready for academic use.
$140 USD in 2 days
5.4
5.4

Hello, I can complete this Yelp Academic Dataset extraction and balancing task exactly according to your specifications using Python and reproducible data-processing workflows. My approach will include: • Parsing and joining the Yelp review and business JSON datasets on business_id • Mapping raw Yelp categories into your simplified category groups • Stratified random sampling by category and star-rating bucket • Automatic handling of low-volume strata (including all available reviews where needed) • UTF-8 CSV export with only the required fields • Final size validation to ensure the dataset remains under 30MB The final output will contain: • business_id • business_name • category • stars • review_text • review_date I’ll also provide: • Clean, documented Python script • Reproducible workflow for rerunning or modifying the extraction later • Verification of balanced sampling and file-size compliance I have experience with large JSON datasets, pandas-based processing, sampling pipelines, and structured data extraction tasks requiring accuracy and consistency. Ready to start immediately and deliver quickly.
$30 USD in 1 day
4.2
4.2

Hi, I’ve thoroughly reviewed your project requirements for extracting and balancing the Yelp Academic Dataset by specific business categories. With extensive experience in Python data processing and JSON handling, I am confident in delivering a high-quality, balanced dataset as per your exact specifications. I will join the review and business files on business_id, filter the categories requested, and randomly sample reviews to meet your star-rating distribution. I’ll ensure the final CSV file is under 30MB, UTF-8 encoded, and contains only the required fields. I can start immediately and provide the deliverable within 5 days. Do you have any preference for how random sampling should be implemented to ensure reproducibility? Best regards,
$155 USD in 15 days
4.2
4.2

Hi, I’ve reviewed your project and can help you get it done quickly and accurately. I’m Sarim Ali Khan, a top-rated freelancer specializing in data automation, analysis, and workflow optimization. My clients hire me because I deliver on time, on budget, and with zero guesswork. I can start right away and share a quick sample if you’d like. Let’s chat about the details and get you results — fast. Best, Sarim
$30 USD in 1 day
4.4
4.4

⭐⭐⭐⭐⭐ ✅Hi there, hope you are doing well! I have experience creating data extraction and transformation scripts for large datasets by joining multiple sources and filtering data, efficiently producing balanced samples for analysis. From my experience, the most important part of this project is ensuring accurate joins and stratified random sampling to maintain balance across the specified categories and star ratings. Approach: ⭕ Download and parse Yelp JSON datasets; ⭕ Join reviews with business data on business_id; ⭕ Filter businesses to specified categories with simplified labels; ⭕ Stratify and randomly sample reviews by star rating per category; ⭕ Assemble final dataset with required fields; ⭕ Check output size and adjust sampling as needed to keep under 30MB; ⭕ Deliver UTF-8 encoded CSV with header. ❓Could you please confirm if you have a preference for how to handle categories with fewer than 500 reviews in any star rating beyond including all available reviews? I am confident that I can deliver a robust, well-structured script that meets your exact specifications efficiently. Best regards, Nam
$200 USD in 3 days
3.8
3.8

Hi, Creating a balanced sample of Yelp reviews across specified business categories is certainly a challenge, and I'm excited to help you tackle this! My approach will involve efficiently using Python to extract data from the provided Yelp Academic Dataset files, joining the review and business data seamlessly based on business_id. Given your requirements, I'll apply rigorous filtering to ensure that the categories you specified are accurately represented. I aim to sample reviews evenly across the defined star ratings while adhering to your file size constraints. This will ensure a well-distributed dataset that meets your criteria. I'm Muhammad Furqan, and I have extensive experience in data extraction, manipulation, and visualization, including projects where I've dealt with large datasets and complex filtering tasks. Would you like any additional specific statistics included in the output file, or are there any specific formats you prefer for the CSV? Looking forward to discussing this further! Thank you, Muhammad Furqan
$175 USD in 2 days
3.9
3.9

Hi, this is Kris from McKinney, Texas, I've reviewed your project requirements and understand that you need a Python script to extract and balance Yelp Academic Dataset by specific business categories. The key challenge lies in creating a single CSV or JSON file that contains a balanced sample of Yelp reviews across the mentioned categories while ensuring the file size does not exceed 30MB. My approach involves joining the review file with the business file based on business_id, filtering the data for the specified business categories, and selecting 500 reviews for each star rating within each category to meet the total requirement of 15,000 reviews. A few additional questions: Q1: Are there any specific constraints or preferences regarding how the data should be sorted within each category? Q2: Do you have a preferred method for handling cases where a category has fewer than 500 reviews at a certain star level? Q3: Is there a preferred format for the final output file (CSV or JSON)? Best regards, Kris Kramer
$30 USD in 1 day
4.3
4.3

Hi Sir, I have been working in the Application Development and AI industry for the past 16 years and worked on around 500+ application using Python, Java, C++, Go and RUST. I have developed Automation scripts, Scrapping Scripts, Automation Systems and Enterprise grade financial applications. I have understood your requirements and can surely do this Python Scripts development for you and get that data according to your provided specifications and requirements with customized filters as mentioned. Awaiting your response. Regards, Jawad.
$100 USD in 1 day
3.2
3.2

Hi, I’ll extract and balance the Yelp Academic Dataset as you specified. The process involves downloading the review and business files, joining them on business_id, and filtering for the selected categories. I will ensure each category includes 500 reviews at each star level where possible, maintaining randomness and avoiding cherry-picking. With extensive experience in Python data manipulation and working with large datasets, I will implement an efficient solution leveraging libraries like pandas for optimal performance and memory management. I’ll confirm that the final file remains under 30MB by adjusting the review counts proportionally if necessary. Are there any specific constraints or formats for the CSV beyond UTF-8 encoding? I’m ready to deliver a clean, structured dataset that meets your needs promptly. Thank you.
$156 USD in 7 days
3.1
3.1

Hi there, I can build this Yelp extraction script exactly to spec and keep the output clean, balanced, and under the 30MB limit. I have strong experience working with large JSON datasets, Python data pipelines, and category based sampling. I will join the review and business files on business_id, map the Yelp categories into your simplified labels, and produce a UTF-8 CSV with only the required fields. The script will sample reviews randomly within each category and star stratum, enforce the 500 per group target where available, and automatically scale counts down if the file size grows too large. I will also validate the final CSV structure, field order, and size before delivery. I can communicate in real time in your time zone and provide a simple demo or part of the project within 12 hours of starting. Q1: Do you want categories mapped by exact Yelp category matches only, or should I include a small keyword based mapping for mixed business types? Q2: Should the script output only the final CSV, or also include the Python code used to generate it? Q3: Do you want me to preserve review dates exactly as stored in Yelp, or convert them to a standard YYYY-MM-DD format? Best regards, Daniel
$30 USD in 2 days
2.9
2.9

Potomac, United States
Payment method verified
Member since May 9, 2026
$1500-3000 USD
$250-750 USD
$10-60 USD
₹750-1250 INR / hour
$1500-3000 USD
$250-750 AUD
$2-8 USD / hour
₹750-1250 INR / hour
$30-250 USD
$30-250 USD
$10-30 USD
$10-30 USD
$10-50 USD
$30-250 USD
$30-250 USD
₹12500-37500 INR
$30-250 USD
$30-250 USD
₹12500-37500 INR
$30-250 USD