What is default delimiter for Hive tables?
A. ^A (Control-A)
B. , (comma)
C. \t (tab)
D. : (colon)
Under what two conditions does stochastic gradient descent outperform 2nd-order optimization techniques such as iteratively reweighted least squares?
A. When the volume of input data is so large and diverse that a 2nd-order optimization technique can be fit to a sample of the data
B. When the model's estimates must be updated in real-time in order to account for new observations.
C. When the input data can easily fit into memory on a single machine, but we want to calculate confidence intervals for all of the parameters in the model.
D. When we are required to find the parameters that return the optimal value of the objective function.
There are 20 patients with acute lymphoblastic leukemia (ALL) and 32 patients with acute myeloid leukemia (AML), both variants of a blood cancer.
The makeup of the groups as follows:
Each individual has an expression value for each of 10000 different genes. The expression value for each
gene is a continuous value between -1 and 1.
With which type of plot can you encode the most amount of the data visually?
Rather than use all 10,000 features to separate AML from ALL, you pick a small subnet of features to
separate them optimally. You feature vectors have 10,000 dimensions while you only have 52 data points. You use cross-validation to test your chosen set of features. What three methods will choose the features in an optimal way?
A. Singular value Decomposition
B. Bootstrapping
C. Markov chain Monte Carlo
D. Hidden Markov
E. Bayesian Information Criterion
F. Mutual Information
You have a directory containing a number of comma-separated files. Each file has three columns and each filename has a .csv extension. You want to have a single tab-separated file (all .tsv) that contains all the rows from all the files.
Which command is guaranteed to produce the desired output if you have more than 20,000 files to process?
A. Find . name `*, CSV' print0 | sargs -0 cat | tr `,' `\t' > all.tsv
B. Find . name `name * .CSV' | cat | awk `BEGIN {FS = "," OFS = "\t"} {print $1, $2, $3}' > all.tsv
C. Find . name `*.CSV' | tr `,' `\t' | cat > all.tsv
D. Find . name `*.CSV' | cat > all.tsv
E. Cat *.CSV > all.tsv
When optimizing a function using stochastic gradient descent, how frequently should you update your estimate of the gradient?
A. Once after every pass through the data set
B. Once per observation
C. For each observation with a probability that you choose ahead of time
D. After a random number of observations
E. Once every N observations, where you decide N ahead of time
You are about to sample a 100-dimensinal unit-cube. To adequately sample any single given dimension, you need only capture 10 points. How many points do you need to order to sample the complete 100dimensional unit cube adequately?
A. 10010
B. 1010
C. Log2(100)
D. 100
E. 1000
F. 1010
Consider the following sample from a distribution that contains a continuous X and label Y that is either A or B:
Which is the best cut point for X if you want to discretize these values into two buckets in a way that minimizes the sum of chi-square values?
A. X 8
B. X 6
C. X 5
D. X 4
E. X 2
Why is the naive Bayes classifier "naive"?
A. It generally performs worse than more complex methods
B. It Is an unbiased estimator
C. It assumes Independence between all features
D. It makes no assumptions on the underlying distributions (i.e., it is non-parametric)
Which three metrics are useful in measuring the accuracy and quality of a recommender system?
A. Mutual Information
B. RMSF
C. Tanimoto coefficient
D. Pearson correlation
E. Precision
F. Recall