Write a Python MapReduce job to find inbound links to particular domain based on Common Crawl Data -- 2
$10-30 CAD
In Bearbeitung
Veröffentlicht vor etwa 8 Jahren
$10-30 CAD
Bezahlt bei Lieferung
Hi,
I'm looking for a developer/data scientist to write me a Map Reduce job in Python which will let me enter either 1 or many domains and crawl a Common Crawl public dataset found on Amazon Web Services. The job should return pages that link to those particular domains (ie. look for inbound links).
The map function should find which pages link to which domain and the reduce function should summarize the count of links which were discovered
I need the map reduce job to crawl over the entire Common Crawl archive and save the results in JSON output to S3. Perhaps, one file for each domain entered - and if a file is too large, split into multiple files.
I should also be able to enter which Common Crawl archive I want to crawl on (there are different archives of snapshots taken on different dates)