Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?
A. pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata
B. pandas API on Spark DataFrames are more performant than Spark DataFrames
C. pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata
D. pandas API on Spark DataFrames are less mutable versions of Spark DataFrames
A machine learning engineer wants to parallelize the inference of group-specific models using the Pandas Function API. They have developed theapply_modelfunction that will look up and load the correct model for each group, and they want to apply it to each group of DataFramedf.
They have written the following incomplete code block:
Which piece of code can be used to fill in the above blank to complete the task?
A. applyInPandas
B. groupedApplyInPandas
C. mapInPandas
D. predict
A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing model hyperparameters via grid search for a classification problem:
Hyperparameter 1: [2, 5, 10] Hyperparameter 2: [50, 100]
Which of the following represents the number of machine learning models that can be trained in parallel during this process?
A. 3
B. 5
C. 6
D. 18
Which of the following approaches can be used to view the notebook that was run to create an MLflow run?
A. Open the MLmodel artifact in the MLflow run paqe
B. Click the "Models" link in the row corresponding to the run in the MLflow experiment paqe
C. Click the "Source" link in the row corresponding to the run in the MLflow experiment page
D. Click the "Start Time" link in the row corresponding to the run in the MLflow experiment page
An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.
Which of the following explanations justifies this suggestion?
A. One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.
B. One-hot encoding is dependent on the target variable's values which differ for each apaplication.
C. One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.
D. One-hot encoding is not a common strategy for representing categorical feature variables numerically.
A data scientist is developing a single-node machine learning model. They have a large number of model configurations to test as a part of their experiment. As a result, the model tuning process takes too long to complete. Which of the following approaches can be used to speed up the model tuning process?
A. Implement MLflow Experiment Tracking
B. Scale up with Spark ML
C. Enable autoscaling clusters
D. Parallelize with Hyperopt
The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.
Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?
A. Logistic regression
B. Singular value decomposition
C. Iterative optimization
D. Least-squares method
A data scientist has produced three new models for a single machine learning problem. In the past, the solution used just one model. All four models have nearly the same prediction latency, but a machine learning engineer suggests that the new solution will be less time efficient during inference.
In which situation will the machine learning engineer be correct?
A. When the new solution requires if-else logic determining which model to use to compute each prediction
B. When the new solution's models have an average latency that is larger than the size of the original model
C. When the new solution requires the use of fewer feature variables than the original model
D. When the new solution requires that each model computes a prediction for every record
E. When the new solution's models have an average size that is larger than the size of the original model
A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column discount is less than or equal 0.
Which of the following code blocks will accomplish this task?
A. spark_df.loc[:,spark_df["discount"] <= 0]
B. spark_df[spark_df["discount"] <= 0]
C. spark_df.filter (col("discount") <= 0)
D. spark_df.loc(spark_df["discount"] <= 0, :]
A team is developing guidelines on when to use various evaluation metrics for classification problems. The team needs to provide input on when to use the F1 score over accuracy.
Which of the following suggestions should the team include in their guidelines?
A. The F1 score should be utilized over accuracy when the number of actual positive cases is identical to the number of actual negative cases.
B. The F1 score should be utilized over accuracy when there are greater than two classes in the target variable.
C. The F1 score should be utilized over accuracy when there is significant imbalance between positive and negative classes and avoiding false negatives is a priority.
D. The F1 score should be utilized over accuracy when identifying true positives and true negatives are equally important to the business problem.