We are looking for a key engineer to keep BloomReach’s rapidly growing data processing and serving systems up and running in the cloud. In this role, you are responsible for the reliability and maintainability of BloomReach’s map-reduce and storage resources, as well as serving systems that support full traffic loads from some of the most-visited destinations on the web. You must be comfortable with significant responsibilities and possess a breadth of software, operations and reliability engineering skills.
Maintain reliability of large, load-balanced, high-traffic, multi-tier serving systems
Plan and architect production infrastructure for new and existing deployments
Streamline deployment processes and build monitoring infrastructure to ensure reliable site operations and high transparency
Build tools for monitoring and automation of deployments
Tune Hadoop clusters and Java-based compute and serving systems for performance, troubleshoot incidents and performance issues
Respond rapidly and effectively to time-sensitive incidents and alerts
BS/MS degree in Computer Science preferred
Knowledge of high-traffic and high-availability architectures
A minimum of 3 years experience in production software environments
Deep understanding of networking and network performance, DNS, HTTP, web performance, load balancing, and high-availability serving systems
Very strong Linux system administration and troubleshooting skills
Strong software engineering and debugging abilities, and a fluency in at least two languages (such as Python, Ruby, C/C++, Java, bash)
Experience with server-side web development (such as Tomcat, Django, Apache) highly desired
Knowledge of Amazon Web Services (EC2, S3, EMR, ELB, etc.) a strong plus
Expertise in monitoring systems (such as Nagios), cluster management, database performance (MySQL), and large-scale compute clusters (Hadoop) a plus
BS/MS degree in Computer Science preferred