Text Data Analysis I – Week 3 Quiz

Question 01

Which of the following techniques are used in text preprocessing?

⚑ Select all that apply

ATokenization

BStemming

CLemmatization

DIndexing

Text preprocessing is the pipeline that transforms raw text into a clean, structured form before analysis. The core techniques are: Tokenization (A) — splitting text into words or sub-word units; Stemming (B) — reducing words to their root form by chopping suffixes (e.g., "running" → "run"); and Lemmatization (C) — reducing words to their canonical dictionary form using vocabulary and grammar rules. Indexing (D) is a downstream step that happens after preprocessing — it builds a searchable structure (like an inverted index) from the already-cleaned tokens.

✓ Correct answers: A, B & C

Question 02

Stop words are:

AWords removed during preprocessing to reduce noise

BRare words that are added to enhance text meaning

CWords that indicate sentence boundaries

DEssential for stemming

Stop words are extremely common, low-information words like "the," "is," "and," "of," "a" that appear in almost every document and carry little discriminative value for most NLP tasks. They are removed during preprocessing to reduce noise and vocabulary size (A). They are not rare (B) — they're the most frequent words in any language. They don't mark sentence boundaries (C) — that's punctuation or sentence-boundary detectors. And stemming operates on content words, not stop words (D).

✓ Correct answer: A

Question 03

Lemmatization differs from stemming in that:

⚑ Select all that apply

ALemmatization uses vocabulary and morphology rules

BStemming may produce intermediate forms of words

CBoth produce the base form of a word with the same accuracy

DLemmatization is language-independent

Lemmatization (A) uses a full vocabulary and applies proper morphological analysis — it knows that "better" → "good" (irregular) and "running" → "run." It always returns a valid dictionary word. Stemming (B) uses crude heuristic rules to strip suffixes, so it can produce non-words — e.g., "studies" → "studi" with the Porter stemmer. These are non-dictionary intermediate forms. Option C is false — lemmatization is generally more accurate but slower. Option D is backwards — lemmatization is actually language-dependent (it relies on a language-specific lexicon), whereas stemming algorithms are more easily adapted across languages.

✓ Correct answers: A & B

Question 04

Why is tokenization important in text preprocessing?

AIt breaks text into meaningful units

BIt extracts n-grams for analysis

CIt enhances grammar checking

DIt speeds up lemmatization

Tokenization (A) is the foundational first step: it segments a continuous string of characters into discrete tokens — typically words, punctuation, or sub-word units — that all downstream processes can operate on. Without tokens, you have a raw character stream that algorithms cannot interpret. N-gram extraction (B) requires tokenization first and is a separate downstream step. Grammar checking (C) and lemmatization speed (D) are not primary purposes of tokenization.

✓ Correct answer: A

Question 05

In stemming, which of the following suffixes might be removed?

⚑ Select all that apply

A"-ing"

B"-ly"

C"-able"

D"-est"

All four are valid suffix targets in standard stemmers. "-ing" (A) removes the present participle: running → run. "-ly" (B) removes adverb forms: quickly → quick. "-able" (C) removes adjective suffixes: comfortable → comfort. "-est" (D) removes superlative forms: greatest → great. The Porter stemmer and similar algorithms have rules covering all of these common English suffixes. Stemming is rule-based and strips any recognized suffix, regardless of whether the result is a real word.

✓ Correct answers: A, B, C & D

Question 06

Which technique handles the removal of stop words?

APre-trained models

BVocabulary filtering

CN-gram generation

DFeature selection

Stop word removal is implemented via vocabulary filtering (B): you maintain a list (vocabulary/lexicon) of known stop words and simply discard any token that matches an entry in that list. Pre-trained models (A) are used for embedding or fine-tuning downstream tasks, not for basic filtering. N-gram generation (C) is about creating multi-word sequences. Feature selection (D) is a broader ML concept that may or may not remove stop words, but it's not the specific technique for stop word removal itself.

✓ Correct answer: B

Question 07

Which of the following is an application of text mining?

⚑ Select all that apply

AInformation extraction

BMachine translation

CInformation retrieval

DSpeech recognition

Text mining refers to the discovery of patterns and knowledge from text data. Its applications include: Information extraction (A) — automatically pulling structured facts (entities, relations) from unstructured text; Machine translation (B) — translating text between languages using learned linguistic patterns; Information retrieval (C) — finding relevant documents matching a query (e.g., search engines). Speech recognition (D) is an audio signal processing task — it converts spoken audio to text, operating on acoustic waveforms rather than text data, so it is not a text mining application.

✓ Correct answers: A, B & C

Question 08

Which of the following are challenges in text understanding for computers?

⚑ Select all that apply

AAmbiguity

BScale

CSparsity

DVariation

All four are genuine NLP challenges. Ambiguity (A) — words and sentences often have multiple meanings (e.g., "bank" can mean a financial institution or a riverbank). Scale (B) — the internet produces billions of text documents; processing them efficiently is a major engineering and computational challenge. Sparsity (C) — in bag-of-words models, most words don't appear in a given document, creating very sparse high-dimensional vectors that are hard to learn from. Variation (D) — the same concept is expressed in many ways (synonyms, dialects, typos, slang), making generalisation hard.

✓ Correct answers: A, B, C & D

Question 09.1

What is the Euclidean distance between A and B using Boolean feature-based representation in vector space model without stop word removal?

A – Scissors cuts paper and decapitates lizard B – Lizard poisons Spock and eats paper C – Spock smashes scissors and vaporizes rock

A2

B2.24

C2.45

DNone of the above

Since stop words are not removed, "and" stays in the vocabulary. The 12-word vocabulary is:

and	cuts	decapitates	eats	lizard	paper	poisons	rock	scissors	smashes	spock	vaporizes
A:	1	1	1	0	1	1	0	0	1	0	0	0
B:	1	0	0	1	1	1	1	0	0	0	1	0

A and B share: and, lizard, paper → 3 common terms. Differences: A has cuts, decapitates, scissors that B doesn't; B has eats, poisons, spock that A doesn't. That's 6 mismatches. Euclidean distance = √6 ≈ 2.449 ≈ 2.45.

✓ Correct answer: C (2.45)

Question 09.2

What is the cosine similarity between B and C using Boolean feature-based representation in vector space model without stop word removal?

A – Scissors cuts paper and decapitates lizard B – Lizard poisons Spock and eats paper C – Spock smashes scissors and vaporizes rock

A0.33

B0.50

C0.17

DNone of the above

Using the same 12-term vocabulary (stop words kept):

and	eats	lizard	paper	poisons	rock	scissors	smashes	spock	vaporizes
B:	1	1	1	1	1	0	0	0	1	0
C:	1	0	0	0	0	1	1	1	1	1

B and C share: and, spock → dot product = 2. |B| = √6 ≈ 2.449, |C| = √6 ≈ 2.449.
Cosine = 2 / (2.449 × 2.449) = 2 / 6 ≈ 0.333 ≈ 0.33

✓ Correct answer: A (0.33)

Question 10

Suppose vectors A, B, and C show A closer to B than to C using Euclidean distance. Which statement is true regarding cosine similarity?

AA and B are more similar than A and C when measured by cosine similarity

BA and B are less similar than A and C when measured by cosine similarity

CB and C are more similar than A and C when measured by cosine similarity

DB and C are less similar than A and C when measured by cosine similarity

EThere is not enough information to determine the cosine similarity between the vectors

This is a conceptual trap. Euclidean distance measures absolute distance in space, accounting for both direction and magnitude. Cosine similarity measures only the angle between vectors, completely ignoring magnitude. Consider a simple counterexample: A = [1, 0], B = [2, 0], C = [0, 3]. Euclidean: |A–B| = 1, |A–C| = √10 ≈ 3.16, so A is closer to B. But cosine(A, B) = 1.0 (identical direction) while cosine(A, C) = 0 (perpendicular). Now try A = [1,1], B = [1.1, 1.1], C = [10, 0]. A is closer to B by Euclidean, but cosine(A,C) might differ unpredictably. Without knowing the actual vectors, you simply cannot infer cosine similarity from Euclidean distance. The correct answer is E.

✓ Correct answer: E