When would you prefer a Naive Bayes model to a logistic regression model for classification?
A. When you are using several categorical input variables with over 1000 possible values each.
B. When you need to estimate the probability of an outcome,not just which class it is in.
C. When all the input variables are numerical.
D. When some of the input variables might be correlated.
In linear regression modeling, which action can be taken to improve the linearity of the relationship between the dependent and independent variables?
A. Apply a transformation to a variable
B. Use a different statistical package
C. Calculate the R-Squared value
D. Change the units of measurement on the independent variable
You have been assigned to run a linear regression model for each of 5, 000 distinct districts, and all the data is currently stored in a PostgreSQL database. Which tool/library would you use to produce these models with the least effort?
A. MADlib
B. Mahout
C. R
D. HBase
Which word or phrase completes the statement? Structured data is to OLAP data as quasi-structured data is to____
A. Clickstream data
B. XML data
C. Text documents
D. Image files
When would you use a Wilcoxson Rank Sum test?
A. When you cannot make an assumption about the distribution of the populations
B. When the data can easily be sorted
C. When the populations represent the sums of other values
D. When the data cannot easily be sorted
Your colleague, who is new to Hadoop, approaches you with a question. They want to know how best to access their data. This colleague has previously worked extensively with SQL and databases. Which query interface would you recommend?
A. Hive
B. Pig
C. Howl
D. HBase
Refer to the exhibit.
You are assigned to do an end of the year sales analysis of 1, 000 different products, based on the
transaction table. Which column in the end of year report requires the use of a window function?
A. Total Sales to Date
B. Daily Sales
C. Average Daily Price
D. Maximum Price
Refer to the exhibit.
Click on the calculator icon in the upper left corner. You are given a list of pre-defined association rules:
For your next analysis, you must limit your dataset based on rules with confidence greater than 60%. Which of the rules will be kept in the analysis?
A. RENTER => BAD CREDIT
B. RENTER => GOOD CREDIT
C. HOME OWNER => BAD CREDIT
D. HOME OWNER => GOOD CREDIT
E. FREE HOUSING => BAD CREDIT
F. FREE HOUSING => GOOD CREDIT
A. Rules B and D
B. Rules A and F
C. Rules C and E
D. Rules D and E
Refer to the exhibit.
Which type of data issue would you suspect based on the exhibit?
A. "Saturated" data,indicating potential issues with data definitions
B. Incomplete data,indicating potential issues with data transmission
C. Mis-scaled data,indicating potential issues with data entry
D. The exhibit does not raise any obvious concerns with the data.
Refer to the exhibit.
The exhibit shows four graphs labeled as Fig A thorough Fig D. Which figure represents the entropy
function relative to a Boolean classification and is represented by the formula shown in Exhibit?
A. Fig-A
B. Fig-B
C. Fig-C
D. Fig-D