King's College London

Data Preprocessing

Week 2 — Multiple Choice Quiz  ·  Dr Grigorios Loukides

Questions 10
Type Multiple select & single answer
Topic Data Cleaning, Noise & Feature Selection
0 / 10 answered
Score: 0
Question 01
What are some of the common issues that data cleaning addresses in a database?
⚑ Select all that apply
AIncomplete data
BAccurate data
CConsistent data
DNoisy data
EIntentionally incorrect data
Data cleaning is about fixing problems in data, not preserving what's already good. Incomplete data (A) refers to missing values or gaps — a core cleaning target. Noisy data (D) refers to errors, outliers, and random variance that distort analysis. Intentionally incorrect data (E) covers deliberate errors or falsifications that still need to be detected and corrected. Options B and C ("accurate" and "consistent") describe desirable outcomes of cleaning — they are goals, not problems being addressed.
✓ Correct answers: A, D & E
Question 02
Which of the following is/are measure(s) of data quality?
⚑ Select all that apply
AAccuracy
BRedundancy
CCompleteness
DObsolescence
ETimeliness
The standard data quality dimensions include: Accuracy (A) — does the data correctly reflect reality? Completeness (C) — is all required data present? Timeliness (E) — is the data up-to-date and available when needed? These are widely accepted ISO/industry quality measures. Redundancy (B) is a storage/design concern (duplicate data), not a quality dimension itself. Obsolescence (D) describes data becoming outdated, which is the absence of timeliness — the concept is captured by E rather than being a separate metric.
✓ Correct answers: A, C & E
Question 03
Which methods are commonly used for filling missing values automatically?
⚑ Select all that apply
AFilling with a global constant
BFilling with the mean or median value
CDeleting the entire dataset
DFilling based on attribute correlations
EPredicting using decision tree induction
The standard automated imputation techniques are: Global constant (A) — replace all missing values with a placeholder like "Unknown" or 0. Mean/median (B) — substitute the attribute's central tendency (mean for continuous, median for skewed data). Decision tree induction (E) — train a model on non-missing rows to predict missing values. Option C (deleting the entire dataset) is never a valid imputation strategy — it destroys information. Option D (attribute correlations) sounds plausible but is not a standard named technique in this taxonomy; the correlation insight is embedded within prediction models like regression or decision trees.
✓ Correct answers: A, B & E
Question 04
What techniques are used to smooth out noise in data cleaning?
⚑ Select all that apply
ABinning
BDuplication
CRegression
DAggregation
EClustering
The three canonical noise-smoothing techniques are: Binning (A) — groups nearby values into bins and replaces them with the bin mean, median, or boundary. Regression (C) — fits a mathematical function to data and uses predictions to replace noisy values. Clustering (E) — identifies outliers as points that don't fit any cluster and flags them for removal or replacement. Duplication (B) creates more copies of data — it has nothing to do with noise reduction. Aggregation (D) is a data reduction technique (e.g., rolling averages), but is not classified as a noise-smoothing technique in this course's framework.
✓ Correct answers: A, C & E
Question 05
What are some causes of missing data in a dataset?
⚑ Select all that apply
AEquipment malfunction
BInconsistent data formats
CData entry mistakes
DDeliberate omission for privacy
ETimeouts in data transmission
This is a tricky question! The course specifically identifies two root causes: Equipment malfunction (A) — a sensor or recording device failing mid-collection leaves gaps. Data entry mistakes (C) — a human operator skips a field or enters an invalid value that gets removed during validation. Options D and E might seem reasonable in the real world, but in the context of this course they are not listed as canonical causes. Deliberate omissions (D) relate more to data suppression/anonymisation, and transmission timeouts (E) are an infrastructure issue, not a data collection/entry cause in this taxonomy.
✓ Correct answers: A & C
Question 06
Which noise reduction techniques use neighboring data points to smooth out noise, and how do they fundamentally operate?
⚑ Select all that apply
ABinning – Divides data into intervals and smooths values within each bin
BK-means Clustering – Groups data points into a hierarchy of clusters and removes outliers
CWavelet Transform – Adds data into a wavelet tree that is then searched for outliers
DLinear Regression – Fits a line through data points and uses it to predict and smooth values
EDecision Tree Induction – Combines predictions from multiple models to find outliers
The key phrase here is "neighboring data points." Binning (A) explicitly groups sorted values into local neighborhoods (bins) and smooths each value using its bin neighbors — this is the textbook definition of local smoothing. Linear Regression (D) fits a global line using all data points and predicts smoothed values — it uses all neighboring context implicitly. The other options are either incorrectly described or use different mechanisms: K-means is flat (not hierarchical), and decision trees are for classification/prediction, not ensemble outlier detection as described.
✓ Correct answers: A & D
Question 07
Which statements accurately describe methods and challenges associated with handling noise and outliers in data cleaning?
⚑ Select all that apply
ABinning methods perform local smoothing by consulting the neighborhood of sorted values and partitioning data into equal-depth or equal-width bins.
BLinear regression can be used to predict missing values using features without missing values, but this method is less accurate if the missing value has high variance.
CClustering algorithms can detect outliers, which are then added into clusters containing similar points to them.
DEqual-width binning is better at handling skewed data compared to equal-depth binning because it ensures all bins have the same range.
EDecision tree induction can be used for feature selection by recursively partitioning data based on selected attributes using measures like information gain or Gini index.
A is correct — this is a precise description of binning: sort → partition → smooth locally. B is correct — linear regression imputation works well for low-variance targets but degrades when there is high unexplained variance (poor R²). E is correct — decision trees use information gain (for classification) or Gini impurity and can be applied to rank/select features. C is incorrect — outliers detected by clustering are removed or treated, not merged into nearby clusters; adding them in would defeat the purpose. D is incorrect — equal-width binning actually performs worse on skewed data because rare extreme values dominate entire bins; equal-depth (quantile) binning handles skew better by ensuring equal counts per bin.
✓ Correct answers: A, B & E
Question 08
Which of the following is a key characteristic of equal-width partitioning in the binning process?
AEach bin contains an equal number of records
BEach bin has the same range of values
CBins are created based on the data distribution
DThe number of bins is equal to the number of unique values
EThe bin boundaries are dynamically adjusted
The two types of binning are easily confused: Equal-width (B) divides the value range into k intervals of the same size — e.g., if values range from 0–100 and k=5, each bin covers 20 units regardless of how many data points fall in it. Equal-depth (equal-frequency) (described by option A) ensures each bin contains the same number of records. Options C, D, and E describe neither — adaptive/dynamic partitioning is a different concept entirely.
✓ Correct answer: B
Question 09
In the context of feature selection, what is the main advantage of using wrapper methods?
AThey are computationally inexpensive
BThey evaluate feature subsets based on model performance
CThey require a prescriptive model
DThey can handle large datasets efficiently
EThey ignore the interaction between features and the model
Feature selection has three method families: Filter (statistical tests, fast, model-agnostic), Wrapper (use an actual model to score subsets), and Embedded (selection built into model training). The key advantage of wrapper methods (B) is that they account for feature interactions with a specific model — they search for the subset that literally makes the model perform best. This comes at a cost: they are computationally expensive (ruling out A) and struggle with large datasets (ruling out D). Option E is the opposite of what wrappers do, and C is nonsensical.
✓ Correct answer: B
Question 10
Which statements accurately describe the Recursive Feature Elimination (RFE) process?
⚑ Select all that apply
ARFE starts with an empty feature set and adds features one by one based on their importance.
BRFE iteratively removes the least important feature based on the model's performance.
CRFE ranks features according to their importance to the model.
DRFE uses a pre-determined statistical threshold to eliminate features.
ERFE can be used with any machine learning model to evaluate feature importance.
RFE is a backward elimination approach: it starts with all features, trains the model, identifies the least important feature, removes it, and repeats until the target number of features is reached. So: B is correct (iterative removal of least important). C is correct (features get a ranking from most to least important as a by-product). E is correct (RFE is model-agnostic — it works with SVMs, random forests, logistic regression, etc., as long as the model can produce feature importance scores). A is wrong — that describes forward selection, not RFE. D is wrong — RFE uses a target number of features to stop, not a statistical threshold.
✓ Correct answers: B, C & E
0/10
Quiz Complete!
See how you did.