Metaweb has built a large database to support collaborative web applications. One of the issues with a large database is that there's lots of things that are the same but have slightly different names, and then there's lots of things with the same name that are actually different. You'll recognize this problem as entity resolution, a.k.a., reconciliation.
We're looking for a principal engineer and architect to play a key role in what we call "Identity Services", a set of reusable capabilities related to reconciliation. You will devise novel methods for high-performance, distributed matching of records, which we will put to use in a number of applications. You'll create mechanisms for assigning confidence scores to potential matches.
Ideally, you'd have experience with graph-based entity resolution in either an academic or commercial setting. A background in clustering and classification algorithms would be a big plus (e.g., k-means, SVM, and Bayesian classifiers).
We are passionate about making a large scale, community driven repository for structured information. We like to create tools that make crowd-sourced databases fun, interesting and easy to use. Freebase provides you with an enormous, real-world set of topics for developing innovations in similarity, clustering, and classification.
If this seems like your kind of scene, please submit a cover letter and resume in PDF, plain text or HTML, and include your answer to the following questions:
1. What is your favorite programming language? Why?
2. Why is cosine similarity considered a good similarity measure for use with tf-idf in information retrieval? Why not use Euclidean distance?
3. How does the use of kernels play into creating a good similarity measure?
4. Devise a similarity measure that compares descriptions of objects or topics. Consider the following pairs: 1) Mark Twain, Samuel Clemens, 2) Mark Twain, Kurt Vonnegut, and 3) Mark Twain, Shania Twain. Ideally your similarity measure should predict that the first pair is identical or very similar, and the successive ones are lesser so. If not, explain why. What kinds of descriptions of objects are better than others, and why?