Data Preprocessing – Week 2 Quiz

Question 01

What are some of the common issues that data cleaning addresses in a database?

⚑ Select all that apply

AIncomplete data

BAccurate data

CConsistent data

DNoisy data

EIntentionally incorrect data

Data cleaning is about fixing problems in data, not preserving what's already good. Incomplete data (A) refers to missing values or gaps — a core cleaning target. Noisy data (D) refers to errors, outliers, and random variance that distort analysis. Intentionally incorrect data (E) covers deliberate errors or falsifications that still need to be detected and corrected. Options B and C ("accurate" and "consistent") describe desirable outcomes of cleaning — they are goals, not problems being addressed.

✓ Correct answers: A, D & E

Question 02

Which of the following is/are measure(s) of data quality?

⚑ Select all that apply

AAccuracy

BRedundancy

CCompleteness

DObsolescence

ETimeliness

The standard data quality dimensions include: Accuracy (A) — does the data correctly reflect reality? Completeness (C) — is all required data present? Timeliness (E) — is the data up-to-date and available when needed? These are widely accepted ISO/industry quality measures. Redundancy (B) is a storage/design concern (duplicate data), not a quality dimension itself. Obsolescence (D) describes data becoming outdated, which is the absence of timeliness — the concept is captured by E rather than being a separate metric.

✓ Correct answers: A, C & E

Question 03

Which methods are commonly used for filling missing values automatically?

⚑ Select all that apply

AFilling with a global constant

BFilling with the mean or median value

CDeleting the entire dataset

DFilling based on attribute correlations

EPredicting using decision tree induction

The standard automated imputation techniques are: Global constant (A) — replace all missing values with a placeholder like "Unknown" or 0. Mean/median (B) — substitute the attribute's central tendency (mean for continuous, median for skewed data). Decision tree induction (E) — train a model on non-missing rows to predict missing values. Option C (deleting the entire dataset) is never a valid imputation strategy — it destroys information. Option D (attribute correlations) sounds plausible but is not a standard named technique in this taxonomy; the correlation insight is embedded within prediction models like regression or decision trees.

✓ Correct answers: A, B & E

Question 04

What techniques are used to smooth out noise in data cleaning?

⚑ Select all that apply

ABinning

BDuplication

CRegression

DAggregation

EClustering

The three canonical noise-smoothing techniques are: Binning (A) — groups nearby values into bins and replaces them with the bin mean, median, or boundary. Regression (C) — fits a mathematical function to data and uses predictions to replace noisy values. Clustering (E) — identifies outliers as points that don't fit any cluster and flags them for removal or replacement. Duplication (B) creates more copies of data — it has nothing to do with noise reduction. Aggregation (D) is a data reduction technique (e.g., rolling averages), but is not classified as a noise-smoothing technique in this course's framework.

✓ Correct answers: A, C & E

Question 05

What are some causes of missing data in a dataset?

⚑ Select all that apply

AEquipment malfunction

BInconsistent data formats

CData entry mistakes

DDeliberate omission for privacy

ETimeouts in data transmission

This is a tricky question! The course specifically identifies two root causes: Equipment malfunction (A) — a sensor or recording device failing mid-collection leaves gaps. Data entry mistakes (C) — a human operator skips a field or enters an invalid value that gets removed during validation. Options D and E might seem reasonable in the real world, but in the context of this course they are not listed as canonical causes. Deliberate omissions (D) relate more to data suppression/anonymisation, and transmission timeouts (E) are an infrastructure issue, not a data collection/entry cause in this taxonomy.

✓ Correct answers: A & C

Question 06

Which noise reduction techniques use neighboring data points to smooth out noise, and how do they fundamentally operate?

⚑ Select all that apply

ABinning – Divides data into intervals and smooths values within each bin

BK-means Clustering – Groups data points into a hierarchy of clusters and removes outliers

CWavelet Transform – Adds data into a wavelet tree that is then searched for outliers

DLinear Regression – Fits a line through data points and uses it to predict and smooth values

EDecision Tree Induction – Combines predictions from multiple models to find outliers

The key phrase here is "neighboring data points." Binning (A) explicitly groups sorted values into local neighborhoods (bins) and smooths each value using its bin neighbors — this is the textbook definition of local smoothing. Linear Regression (D) fits a global line using all data points and predicts smoothed values — it uses all neighboring context implicitly. The other options are either incorrectly described or use different mechanisms: K-means is flat (not hierarchical), and decision trees are for classification/prediction, not ensemble outlier detection as described.

✓ Correct answers: A & D

Question 07

Which statements accurately describe methods and challenges associated with handling noise and outliers in data cleaning?

⚑ Select all that apply

ABinning methods perform local smoothing by consulting the neighborhood of sorted values and partitioning data into equal-depth or equal-width bins.

BLinear regression can be used to predict missing values using features without missing values, but this method is less accurate if the missing value has high variance.

CClustering algorithms can detect outliers, which are then added into clusters containing similar points to them.

DEqual-width binning is better at handling skewed data compared to equal-depth binning because it ensures all bins have the same range.

EDecision tree induction can be used for feature selection by recursively partitioning data based on selected attributes using measures like information gain or Gini index.

A is correct — this is a precise description of binning: sort → partition → smooth locally. B is correct — linear regression imputation works well for low-variance targets but degrades when there is high unexplained variance (poor R²). E is correct — decision trees use information gain (for classification) or Gini impurity and can be applied to rank/select features. C is incorrect — outliers detected by clustering are removed or treated, not merged into nearby clusters; adding them in would defeat the purpose. D is incorrect — equal-width binning actually performs worse on skewed data because rare extreme values dominate entire bins; equal-depth (quantile) binning handles skew better by ensuring equal counts per bin.

✓ Correct answers: A, B & E

Question 08

Which of the following is a key characteristic of equal-width partitioning in the binning process?

AEach bin contains an equal number of records

BEach bin has the same range of values

CBins are created based on the data distribution

DThe number of bins is equal to the number of unique values

EThe bin boundaries are dynamically adjusted

The two types of binning are easily confused: Equal-width (B) divides the value range into k intervals of the same size — e.g., if values range from 0–100 and k=5, each bin covers 20 units regardless of how many data points fall in it. Equal-depth (equal-frequency) (described by option A) ensures each bin contains the same number of records. Options C, D, and E describe neither — adaptive/dynamic partitioning is a different concept entirely.

✓ Correct answer: B

Question 09

In the context of feature selection, what is the main advantage of using wrapper methods?

AThey are computationally inexpensive

BThey evaluate feature subsets based on model performance

CThey require a prescriptive model

DThey can handle large datasets efficiently

EThey ignore the interaction between features and the model

Feature selection has three method families: Filter (statistical tests, fast, model-agnostic), Wrapper (use an actual model to score subsets), and Embedded (selection built into model training). The key advantage of wrapper methods (B) is that they account for feature interactions with a specific model — they search for the subset that literally makes the model perform best. This comes at a cost: they are computationally expensive (ruling out A) and struggle with large datasets (ruling out D). Option E is the opposite of what wrappers do, and C is nonsensical.

✓ Correct answer: B

Question 10

Which statements accurately describe the Recursive Feature Elimination (RFE) process?

⚑ Select all that apply

ARFE starts with an empty feature set and adds features one by one based on their importance.

BRFE iteratively removes the least important feature based on the model's performance.

CRFE ranks features according to their importance to the model.

DRFE uses a pre-determined statistical threshold to eliminate features.

ERFE can be used with any machine learning model to evaluate feature importance.

RFE is a backward elimination approach: it starts with all features, trains the model, identifies the least important feature, removes it, and repeats until the target number of features is reached. So: B is correct (iterative removal of least important). C is correct (features get a ranking from most to least important as a by-product). E is correct (RFE is model-agnostic — it works with SVMs, random forests, logistic regression, etc., as long as the model can produce feature importance scores). A is wrong — that describes forward selection, not RFE. D is wrong — RFE uses a target number of features to stop, not a statistical threshold.

✓ Correct answers: B, C & E