Why should stop an interactive machine learning algorithm as soon as the performance of the model on a test set stops improving?
A. To avoid the need for cross-validating the model
B. To prevent overfitting
C. To increase the VC (VAPNIK-Chervonenkis) dimension for the model
D. To keep the number of terms in the model as possible
E. To maintain the highest VC (Vapnik-Chervonenkis) dimension for the model
Under what two conditions does stochastic gradient descent outperform 2nd-order optimization techniques such as iteratively reweighted least squares?
A. When the volume of input data is so large and diverse that a 2nd-order optimization technique can be fit to a sample of the data
B. When the model's estimates must be updated in real-time in order to account for new observations.
C. When the input data can easily fit into memory on a single machine, but we want to calculate confidence intervals for all of the parameters in the model.
D. When we are required to find the parameters that return the optimal value of the objective function.
What is the result of the following command (the database username is foo and password is bar)?
$ sqoop list-tables - - connect jdbc : mysql : / / localhost/databasename - - table - - username foo - - password bar
A. sqoop lists only those tables in the specified MySql database that have not already been imported into FDFS
B. sqoop returns an error
C. sqoop lists the available tables from the database
D. sqoop imports all the tables from SQL HDFS
You have a large m x n data matrix M. You decide you want to perform dimension reduction/clustering on your data and have decide to use the singular value decomposition (SVD; also called principal components analysis PCA)
Refer to the passage above.
What represents the SVD of the Matrix standard M given the following information:
U is m x m unitary V is n x n unitary S is m x n diagonal Q is n x n invertible D is n x n diagonal L is m x m lower triangular U is m x m upper triangular
A. M = U S V
B. M = U P
C. M = Q D Q-1
D. M = L U
Which best describes the primary function of Flume?
A. Flume is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with an infrastructure consisting of sources and sinks for importing and evaluating large data sets
B. Flume acts as a Hadoop filesystem for log files
C. Flume Imports data from SQL/relational database into your Hadoop cluster
D. Flume provides a query languages for Hadoop similar to SQL
E. Flume is a distributed server for collecting and moving large amount of data into HDFS as it's produced from streaming data flows
What are three benefits of running feature selection analysis before filtering a classification model?
A. Eliminates the need to include a regularization term
B. Reduces the number of subjective decisions required to construct the model
C. Guarantee the optimally of the final model
D. Speeds up the model fitting process
E. Develops an understanding of the importance of different features
F. Improves the predictive performance of the model
You have a large file of N records (one per line), and want to randomly sample 10% them. You have two
functions that are perfect random number generators (through they are a bit slow):
Random_uniform () generates a uniformly distributed number in the interval [0, 1] random_permotation (M)
generates a random permutation of the number O through M -1.
Below are three different functions that implement the sampling.
Method A
For line in file: If random_uniform () < 0.1; Print line
Method B
i = 0
for line in file:
if i % 10 = = 0;
print line
i += 1
Method C
idxs = random_permotation (N) [: (N/10)]
i = 0
for line in file:
if i in idxs:
print line
i +=1
Which method will have the best runtime performance?
A. Method A
B. Method B
C. Method C
From historical data, you know that 50% of students who take Cloudera's Introduction to Data Science: Building Recommenders Systems training course pass this exam, while only 25% of students who did not take the training course pass this exam. You also know that 50% of this exam's candidates also take Cloudera's Introduction to Data Science: Building Recommendations Systems training course.
If we know that a person has passed this exam, what is the probability that they took cloudera's introduction to Data Science: Building Recommender Systems training course?
A. 2/3
B. 1/2
C. 3/4
D. 3/5