Project ID:
1457542
Project Type:
Fixed
Budget:
$47-$97 USD
(Approx. €37-€78 EUR)
Project Description:
** Your knowledge/skills
Mandatory
- You are an experienced user of Rapidminer 5.2
- You have already a previous experience of successful webscraping using Rapidminer 5.2
** Your work habits
Mandatory
- You respect the deadlines (you will proactively report any hurdles)
- You will answer emails within 24 hours
- You will not outsource the job, fully or parts of it
** Your personality
- You don’t hesitate to provide input/ideas that could bring added value to the project
- You are interested in a long term collaboration on further webscraping projects.
** Your task will be
Your mission is to create a webscraping process in Rapidminer where the input is a set of keywords, and the output is a unique Excel spreadsheet (.xls or .xlsx).
- Let’s choose the example of the set of keywords: US “trade balance” (trade balance is between quotes)
- The process will search the 9 following websites for these keywords
http://www.reuters.com
http://www.bloomberg.com
http://www.businessweek.com
http://online.wsj.com
http://www.ft.com
http://www.nytimes.com
http://www.smh.com.au
http://www.guardian.co.uk
http://www.telegraph.co.uk
- For each website, the process will retreive the 3 (default value) most recent articles. This number must be configurable by website, ie. we may configure 5 articles for the NY Times but only 2 for the WSJ.
- The process will save the content of each article (only the article, not the full webpage) in an Excel spreachsheet where the columns are ordered as following:
+ Column 1: publishing date of the article
The format of the date is different on the websites. For example:
On Reuters : Tue Sep 20, 2011 11:40pm EDT
On Bloomberg : Sep 18, 2011 9:00 PM GMT+0200
On Businessweek : August 04, 2011, 4:45 PM EDT
On WSJ : September 27, 2011, 7:30 PM IST
On FT : September 11, 2011 4:24 pm
Etc.
+ Column 2: direct link to the article on the website (the source webpage that has been processed)
+ Column 3: title of the article (without html tags)
+ Column 4: content of the article (without html tags)
- The file “result.xls” will be saved under c:\rapidminer\
** You will deliver
Mandatory
- You will test the process before delivery in order to ensure it works as described
- You will provide the .RMP file.
Skills required:
Web Scraping