Data Science 101: Stanford Mining Massive Datasets

Print Friendly, PDF & Email

Stanford_MiningMassiveDatasetsA new data science learning resource is about to commence, brought to you by Stanford University via Coursera: Mining Massive Datasets. This class teaches algorithms for extracting models and other information from very large amounts of data. The emphasis is on techniques that are efficient and that scale well. The 7 week course starts September, 29, 2014 and is taught by 3 renowned Stanford computer science professors: Jure Leskovec, Anand Rajaraman, and Jeff Ullman. The summary of the course appearing on the class website is as follows:

We introduce the student to modern distributed file systems and MapReduce, including what distinguishes good MapReduce algorithms from good algorithms in general.  The rest of the course is devoted to algorithms for extracting models and information from large datasets.  Students will learn how Google’s PageRank algorithm models importance of Web pages and some of the many extensions that have been used for a variety of purposes.  We’ll cover locality-sensitive hashing, a bit of magic that allows you to find similar items in a set of items so large you cannot possibly compare each pair.  When data is stored as a very large, sparse matrix, dimensionality reduction is often a good way to model the data, but standard approaches do not scale well; we’ll talk about efficient approaches.  Many other large-scale algorithms are covered as well, as outlined in the course syllabus.

Week 1:
MapReduce, Link Analysis — PageRank

Week 2:
Locality-Sensitive Hashing — Basics + Applications, Distance Measures, Nearest Neighbors, Frequent Itemsets

Week 3:
Data Stream Mining, Analysis of Large Graphs

Week 4:
Recommender Systems, Dimensionality Reduction

Week 5:
Clustering, Computational Advertising

Week 6:
Support-Vector Machines, Decision Trees, MapReduce Algorithms

Week 7:
More About Link Analysis —  Topic-specific PageRank, Link Spam, More About Locality-Sensitive Hashing

 

Sign up for the free insideBIGDATA newsletter.

Speak Your Mind

*