Flite's Ad Platform powers the world's leading brands/agencies like Samsung, P&G, & SMG and publishers like LinkedIn, Forbes, & Conde Nast. If you want to make a difference at a Sequoia funded startup and would rather address challenges with a 10 line program instead of doing it manually, we want to talk to you!
The Site Reliability Engineering engineer will own the programmatic provisioning of Flite's cloud infrastructure: deploying, scaling, caching, load-balancing, monitoring and security. If customers see it, make it bulletproof. If existing open-source tools don't do the job, fix or write your own.
Working with both engineers and operations, you will help us grow our existing infrastructure and tools to meet the ever-growing needs of our platform. You will ensure that products can be deployed quickly and easily, the technical infrastructure is monitored and self-managing, and that clear recovery and business continuity plans are in effect.
Manage the existing Amazon-hosted staging and production environments
Lead initiatives to improve the stability and scalability of the production environment
Ensure and build a robust deployment pipeline
Wrangle third-party systems management and monitoring tools, and write your own when necessary.
Review and integrate third-party infrastructure product offerings as appropriate
Perform capacity and systems planning for all deployed applications
Share 24 x 7 on-call duties and after-hours responsibilities as required or defined
Relevant four-year degree or equivalent industry experience
Extensive experience with Amazon EC2, Elastic Load Balancing, AutoScaling, CloudWatch, CloudFormation
Minimum of 3 years of production Linux system administration experience
A solid background in at least one high-level programming language (e.g. python, ruby, perl) and a willingness to learn others.
Fluent in one or more of C, C++, Java
Experience in a mission-critical 24x7 production environment required, ideally involving Tomcat
Experience scaling and managing systems in public cloud environments.
Knowledge of high-availability strategies and technologies & site reliability engineering best practices
Experience with high-level systems automation tools (e.g. puppet, chef, ansible, salt)
An understanding of scalable monitoring and statistics gathering architectures and tools
Organized, self-managing, requiring little supervision
Excellent oral/written communication and documentation skills
BONUS POINTS FOR:
Experience managing and optimizing MySQL
Experience with Splunk/Loggly/SumoLogic, CollectD, StatsD
Experience with NewRelic, AppDynamics, StackDriver, DataDog