Overview
Crawl Engineer and Big Data Enthusiast
We're looking for someone enthusiastic about open source, net neutrality, open data and keeping the web truly open. Common Crawl is dedicated to building and maintaining an open repository of web crawl data in order to enable a new wave of innovation, education and research. If you're looking to do work that matters, come join us!
We're set to do amazing things this year, and there is no better place to hone your big data skills than helping us manage and process our 50 TB corpus. Plus, you'll be working within a passionate community and have the chance to interface with plenty of talented researchers, educators, startup folks, and an incredible advisory board.
Responsibilities
Improve the stability, scaling, and visibility of our distributed web crawler
Use, improve, and extend our post-crawl, Hadoop-based web data processing pipeline
Design and build an easy-to-use mechanism for specification and execution of custom crawls
Experience
You have the necessary background to architect and code for a system with tens of billions of documents
You have strong coding ability and experience with Java and at least one scripting language (e.g. Python, Ruby, Perl, Lua)
You have in-depth knowledge of HTTP and are familiar with web crawlers
You have development and administrative experience with Hadoop and HDFS
Ops experience with Linux or other UNIX
At least some familiarity with AWS, including one or more of EC2, S3, EBS, and EMR
You like to build useful, thorough documentation of code and systems
You're a self-starter wiling to take ownership of projects