Basic Concept: Before ingesting data from multiple sources — both structured and unstructured — into an AI-powered SIEM system, the data must be prepared to ensure quality, consistency, and usability. Data from diverse sources often contains noise, errors, duplicates, and formatting inconsistencies that will degrade AI performance if not addressed. CompTIA SecAI+ covers data preparation as a prerequisite for AI system effectiveness.
Why C is Correct: Data cleansing is the process of detecting and correcting or removing corrupt, inaccurate, incomplete, and duplicate data. Before collecting data from multiple structured and unstructured SIEM sources, engineers must cleanse the data to standardize formats, remove duplicates, fill missing values, and eliminate noise. Clean input data is fundamental to producing accurate AI-generated insights and reliable LLM interactions in the SIEM context.
Why A is Wrong: Balancing addresses class distribution imbalance in labeled training data for classification models. While relevant when training ML detection models, it is not the primary consideration before initial data collection from diverse SIEM sources.
Why B is Wrong: Verification confirms that data meets expected quality standards and validates its accuracy against trusted sources. It is a post-collection quality check performed after cleansing, not the first step before data collection.
Why D is Wrong: Vector storage refers to databases that store embeddings for semantic search, relevant for RAG systems. It is a storage architecture decision made after data is collected, processed, and prepared, not a pre-collection technique.
Submit