MLS-C01 Online Practice Questions and Answers

Questions 4

A manufacturing company has structured and unstructured data stored in an Amazon S3 bucket A Machine Learning Specialist wants to use SQL to run queries on this data. Which solution requires the LEAST effort to be able to query this data?

A. Use AWS Data Pipeline to transform the data and Amazon RDS to run queries.

B. Use AWS Glue to catalogue the data and Amazon Athena to run queries

C. Use AWS Batch to run ETL on the data and Amazon Aurora to run the quenes

D. Use AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run queries

Buy Now

Questions 5

A Data Scientist wants to gain real-time insights into a data stream of GZIP files. Which solution would allow the use of SQL to query the stream with the LEAST latency?

A. Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data.

B. AWS Glue with a custom ETL script to transform the data.

C. An Amazon Kinesis Client Library to transform the data and save it to an Amazon ES cluster.

D. Amazon Kinesis Data Firehose to transform the data and put it into an Amazon S3 bucket.

Buy Now

Questions 6

A Machine Learning Specialist receives customer data for an online shopping website. The data includes demographics, past visits, and locality information. The Specialist must develop a machine learning approach to identify the customer shopping patterns, preferences and trends to enhance the website for better service and smart recommendations.

Which solution should the Specialist recommend?

A. Latent Dirichlet Allocation (LDA) for the given collection of discrete data to identify patterns in the customer database.

B. A neural network with a minimum of three layers and random initial weights to identify patterns in the customer database

C. Collaborative filtering based on user interactions and correlations to identify patterns in the customer database

D. Random Cut Forest (RCF) over random subsamples to identify patterns in the customer database

Buy Now

Questions 7

A machine learning (ML) specialist needs to extract embedding vectors from a text series. The goal is to provide a ready-to-ingest feature space for a data scientist to develop downstream ML predictive models. The text consists of curated sentences in English. Many sentences use similar words but in different contexts. There are questions and answers among the sentences, and the embedding space must differentiate between them.

Which options can produce the required embedding vectors that capture word context and sequential QA information? (Choose two.)

A. Amazon SageMaker seq2seq algorithm

B. Amazon SageMaker BlazingText algorithm in Skip-gram mode

C. Amazon SageMaker Object2Vec algorithm

D. Amazon SageMaker BlazingText algorithm in continuous bag-of-words (CBOW) mode

E. Combination of the Amazon SageMaker BlazingText algorithm in Batch Skip-gram mode with a custom recurrent neural network (RNN)

Buy Now

Questions 8

An online store is predicting future book sales by using a linear regression model that is based on past sales data. The data includes duration, a numerical feature that represents the number of days that a book has been listed in the online store. A data scientist performs an exploratory data analysis and discovers that the relationship between book sales and duration is skewed and non-linear.

Which data transformation step should the data scientist take to improve the predictions of the model?

A. One-hot encoding

B. Cartesian product transformation

C. Quantile binning

D. Normalization

Buy Now

Questions 9

An automotive company uses computer vision in its autonomous cars. The company trained its object detection models successfully by using transfer learning from a convolutional neural network (CNN). The company trained the models by using PyTorch through the Amazon SageMaker SDK.

The vehicles have limited hardware and compute power. The company wants to optimize the model to reduce memory, battery, and hardware consumption without a significant sacrifice in accuracy.

Which solution will improve the computational efficiency of the models?

A. Use Amazon CloudWatch metrics to gain visibility into the SageMaker training weights, gradients, biases, and activation outputs. Compute the filter ranks based on the training information. Apply pruning to remove the low-ranking filters. Set new weights based on the pruned set of filters. Run a new training job with the pruned model.

B. Use Amazon SageMaker Ground Truth to build and run data labeling workflows. Collect a larger labeled dataset with the labelling workflows. Run a new training job that uses the new labeled data with previous training data.

C. Use Amazon SageMaker Debugger to gain visibility into the training weights, gradients, biases, and activation outputs. Compute the filter ranks based on the training information. Apply pruning to remove the low-ranking filters. Set the new weights based on the pruned set of filters. Run a new training job with the pruned model.

D. Use Amazon SageMaker Model Monitor to gain visibility into the ModelLatency metric and OverheadLatency metric of the model after the company deploys the model. Increase the model learning rate. Run a new training job.

Buy Now

Questions 10

A data scientist receives a collection of insurance claim records. Each record includes a claim ID. the final outcome of the insurance claim, and the date of the final outcome.

The final outcome of each claim is a selection from among 200 outcome categories. Some claim records include only partial information. However, incomplete claim records include only 3 or 4 outcome ...gones from among the 200 available outcome categories. The collection includes hundreds of records for each outcome category. The records are from the previous 3 years.

The data scientist must create a solution to predict the number of claims that will be in each outcome category every month, several months in advance.

Which solution will meet these requirements?

A. Perform classification every month by using supervised learning of the 20X3 outcome categories based on claim contents.

B. Perform reinforcement learning by using claim IDs and dates Instruct the insurance agents who submit the claim records to estimate the expected number of claims in each outcome category every month.

C. Perform forecasting by using claim IDs and dates to identify the expected number ot claims in each outcome category every month.

D. Perform classification by using supervised learning of the outcome categories for which partial information on claim contents is provided. Perform forecasting by using claim IDs and dates for all other outcome categories.

Buy Now

Correct Answer: C

The best solution for this scenario is to perform forecasting by using claim IDs and dates to identify the expected number of claims in each outcome category every month. This solution has the following advantages:

It leverages the historical data of claim outcomes and dates to capture the temporal patterns and trends of the claims in each category1. It does not require the claim contents or any other features to make predictions, which simplifies the data

preparation and reduces the impact of missing or incomplete data2. It can handle the high cardinality of the outcome categories, as forecasting models can output multiple values for each time point3.

It can provide predictions for several months in advance, which is useful for planning and budgeting purposes4.

The other solutions have the following drawbacks:

A: Performing classification every month by using supervised learning of the 200 outcome categories based on claim contents is not suitable, because it assumes that the claim contents are available and complete for all the records, which is

not the case in this scenario2. Moreover, classification models usually output a single label for each input, which is not adequate for predicting the number of claims in each category3. Additionally, classification models do not account for the

temporal aspect of the data, which is important for forecasting1.

B: Performing reinforcement learning by using claim IDs and dates and instructing the insurance agents who submit the claim records to estimate the expected number of claims in each outcome category every month is not feasible, because

it requires a feedback loop between the model and the agents, which might not be available or reliable in this scenario5. Furthermore, reinforcement learning is more suitable for sequential decision making problems, where the model learns

from its actions and rewards, rather than forecasting problems, where the model learns from historical data and outputs future values6.

D: Performing classification by using supervised learning of the outcome categories for which partial information on claim contents is provided and performing forecasting by using claim IDs and dates for all other outcome categories is not

optimal, because it combines two different methods that might not be consistent or compatible with each other7. Also, this solution suffers from the same limitations as solution A, such as the dependency on claim contents, the inability to

handle multiple outputs, and the ignorance of temporal patterns123.

References:

1: Time Series Forecasting - Amazon SageMaker

2: Handling Missing Data for Machine Learning | AWS Machine Learning Blog

3: Forecasting vs Classification: What's the Difference? | DataRobot

4: Amazon Forecast ?Time Series Forecasting Made Easy | AWS News Blog

5: Reinforcement Learning - Amazon SageMaker

6: What is Reinforcement Learning? The Complete Guide | Edureka

7: Combining Machine Learning Models | by Will Koehrsen | Towards Data Science

Questions 11

A machine learning (ML) specialist is using the Amazon SageMaker DeepAR forecasting algorithm to train a model on CPU-based Amazon EC2 On-Demand instances. The model currently takes multiple hours to train. The ML specialist wants to decrease the training time of the model.

Which approaches will meet this requirement? (SELECT TWO )

A. Replace On-Demand Instances with Spot Instances

B. Configure model auto scaling dynamically to adjust the number of instances automatically.

C. Replace CPU-based EC2 instances with GPU-based EC2 instances.

D. Use multiple training instances.

E. Use a pre-trained version of the model. Run incremental training.

Buy Now

Correct Answer: CD

The best approaches to decrease the training time of the model are C and D, because they can improve the computational efficiency and parallelization of the training process. These approaches have the following benefits:

C: Replacing CPU-based EC2 instances with GPU-based EC2 instances can speed up the training of the DeepAR algorithm, as it can leverage the parallel processing power of GPUs to perform matrix operations and gradient computations

faster than CPUs12. The DeepAR algorithm supports GPU-based EC2 instances such as ml.p2 and ml.p33.

D: Using multiple training instances can also reduce the training time of the DeepAR algorithm, as it can distribute the workload across multiple nodes and perform data parallelism4. The DeepAR algorithm supports distributed training with

multiple CPU-based or GPU-based EC2 instances3. The other options are not effective or relevant, because they have the following drawbacks:

A: Replacing On-Demand Instances with Spot Instances can reduce the cost of the training, but not necessarily the time, as Spot Instances are subject to interruption and availability5. Moreover, the DeepAR algorithm does not support

checkpointing, which means that the training cannot resume from the last saved state if the Spot Instance is terminated3.

B: Configuring model auto scaling dynamically to adjust the number of instances automatically is not applicable, as this feature is only available for inference endpoints, not for training jobs6.

E: Using a pre-trained version of the model and running incremental training is not possible, as the DeepAR algorithm does not support incremental training or transfer learning3. The DeepAR algorithm requires a full retraining of the model

whenever new data is added or the hyperparameters are changed7.

References:

1: GPU vs CPU: What Matters Most for Machine Learning? | by Louis (What's AI) Bouchard | Towards Data Science

2: How GPUs Accelerate Machine Learning Training | NVIDIA Developer Blog

3: DeepAR Forecasting Algorithm - Amazon SageMaker

4: Distributed Training - Amazon SageMaker

5: Managed Spot Training - Amazon SageMaker

6: Automatic Scaling - Amazon SageMaker

7: How the DeepAR Algorithm Works - Amazon SageMaker

Questions 12

A digital media company wants to build a customer churn prediction model by using tabular data. The model should clearly indicate whether a customer will stop using the company's services. The company wants to clean the data because the data contains some empty fields, duplicate values, and rare values.

Which solution will meet these requirements with the LEAST development effort?

A. Use SageMaker Canvas to automatically clean the data and to prepare a categorical model.

B. Use SageMaker Data Wrangler to clean the data. Use the built-in SageMaker XGBoost algorithm to train a classification model.

C. Use SageMaker Canvas automatic data cleaning and preparation tools. Use the built-in SageMaker XGBoost algorithm to train a regression model.

D. Use SageMaker Data Wrangler to clean the data. Use the SageMaker Autopilot to train a regression model

Buy Now

Questions 13

A company's machine learning (ML) team needs to build a system that can detect whether people in a collection of images are wearing the company's logo. The company has a set of labeled training data. Which algorithm should the ML team use to meet this requirement?

A. Principal component analysis (PCA)

B. Recurrent neural network (RNN)

C. -nearest neighbors (k-NN)

D. Convolutional neural network (CNN)

Buy Now