University of Pennsylvania War and Peace World Count Dataset Assignment
Required APA format in text citations and references.
- 1) The text of the novel War and Peace can be downloaded from https://www.library.upenn.edu/ and used as the dataset for these exercises. However, other data sets can easily be substituted. Document all processing steps applied to the data.
i) Use MapReduce in Hadoop to perform a word count on the specified dataset.
ii) Use Pig to perform a word count on the specified dataset.
iii) Use Hive to perform a word count on the specified dataset.
2) Compare and contrast Hadoop, Pig, Hive, and HBase. List strengths and weaknesses of each tool set.
3) Research and summarize three published use cases for each tool set.
4) How does HBase differ from a traditional RDBMS with regards to file structure?
5) Explain window function and how it is similar/different from the type of calculation that can be done with an aggregate function.
6) Give regular expressions for the following:
i) A regex that, given a URL, captures the domain name
ii) A regex that captures PostgreSQL Dollar-quoted String literals
7) Explain how you would use GROUPING SETS to produce the same results as the following GROUP BY CUBE.
i) SELECT state, productID, SUM(volume) FROM sales GROUP BY CUBE (state, productID) ORDER BY state, productID
8) Identify an “embarrassingly parallel” situation from your current work.
9) Explain at least two benefits of YARN.