When training a machine learning model, it is standard practice to randomly split the dataset into training and testing subsets. The purpose of this is to evaluate how well the model generalizes to unseen data. According to the AI-900 study guide and Microsoft Learn module “Split data for training and evaluation”, this ensures that the model is trained on one portion of the data (training set) and evaluated on another (test or validation set).
The correct answer is C. to test the model by using data that was not used to train the model.
Random splitting prevents data leakage and overfitting, which occur when a model memorizes patterns from the training data instead of learning generalizable relationships. By testing on unseen data, developers can assess true performance, ensuring that predictions will be accurate on future, real-world data.
Options A and B are incorrect because:
A. Train the model twice does not improve accuracy; model accuracy depends on data quality, feature engineering, and algorithm choice.
B. Train multiple models simultaneously refers to model comparison, not the purpose of splitting data.
Thus, the correct reasoning is that random splitting provides a reliable estimate of the model’s predictive power on new data.
[Reference:Microsoft Learn – Split data for training and evaluation in Azure Machine Learning designer, , , , ]
Submit