King's College London

Text Data Analysis I: Pre-processing

Week 3 — Multiple Choice Quiz  ·  Dr Lin Gui

Questions 11
Type Single & multi-select
Topic Tokenization, Stemming, VSM & Similarity
0 / 11 answered
Score: 0
Question 01
Which of the following techniques are used in text preprocessing?
⚑ Select all that apply
ATokenization
BStemming
CLemmatization
DIndexing
Text preprocessing is the pipeline that transforms raw text into a clean, structured form before analysis. The core techniques are: Tokenization (A) — splitting text into words or sub-word units; Stemming (B) — reducing words to their root form by chopping suffixes (e.g., "running" → "run"); and Lemmatization (C) — reducing words to their canonical dictionary form using vocabulary and grammar rules. Indexing (D) is a downstream step that happens after preprocessing — it builds a searchable structure (like an inverted index) from the already-cleaned tokens.
✓ Correct answers: A, B & C
Question 02
Stop words are:
AWords removed during preprocessing to reduce noise
BRare words that are added to enhance text meaning
CWords that indicate sentence boundaries
DEssential for stemming
Stop words are extremely common, low-information words like "the," "is," "and," "of," "a" that appear in almost every document and carry little discriminative value for most NLP tasks. They are removed during preprocessing to reduce noise and vocabulary size (A). They are not rare (B) — they're the most frequent words in any language. They don't mark sentence boundaries (C) — that's punctuation or sentence-boundary detectors. And stemming operates on content words, not stop words (D).
✓ Correct answer: A
Question 03
Lemmatization differs from stemming in that:
⚑ Select all that apply
ALemmatization uses vocabulary and morphology rules
BStemming may produce intermediate forms of words
CBoth produce the base form of a word with the same accuracy
DLemmatization is language-independent
Lemmatization (A) uses a full vocabulary and applies proper morphological analysis — it knows that "better" → "good" (irregular) and "running" → "run." It always returns a valid dictionary word. Stemming (B) uses crude heuristic rules to strip suffixes, so it can produce non-words — e.g., "studies" → "studi" with the Porter stemmer. These are non-dictionary intermediate forms. Option C is false — lemmatization is generally more accurate but slower. Option D is backwards — lemmatization is actually language-dependent (it relies on a language-specific lexicon), whereas stemming algorithms are more easily adapted across languages.
✓ Correct answers: A & B
Question 04
Why is tokenization important in text preprocessing?
AIt breaks text into meaningful units
BIt extracts n-grams for analysis
CIt enhances grammar checking
DIt speeds up lemmatization
Tokenization (A) is the foundational first step: it segments a continuous string of characters into discrete tokens — typically words, punctuation, or sub-word units — that all downstream processes can operate on. Without tokens, you have a raw character stream that algorithms cannot interpret. N-gram extraction (B) requires tokenization first and is a separate downstream step. Grammar checking (C) and lemmatization speed (D) are not primary purposes of tokenization.
✓ Correct answer: A
Question 05
In stemming, which of the following suffixes might be removed?
⚑ Select all that apply
A"-ing"
B"-ly"
C"-able"
D"-est"
All four are valid suffix targets in standard stemmers. "-ing" (A) removes the present participle: running → run. "-ly" (B) removes adverb forms: quickly → quick. "-able" (C) removes adjective suffixes: comfortable → comfort. "-est" (D) removes superlative forms: greatest → great. The Porter stemmer and similar algorithms have rules covering all of these common English suffixes. Stemming is rule-based and strips any recognized suffix, regardless of whether the result is a real word.
✓ Correct answers: A, B, C & D
Question 06
Which technique handles the removal of stop words?
APre-trained models
BVocabulary filtering
CN-gram generation
DFeature selection
Stop word removal is implemented via vocabulary filtering (B): you maintain a list (vocabulary/lexicon) of known stop words and simply discard any token that matches an entry in that list. Pre-trained models (A) are used for embedding or fine-tuning downstream tasks, not for basic filtering. N-gram generation (C) is about creating multi-word sequences. Feature selection (D) is a broader ML concept that may or may not remove stop words, but it's not the specific technique for stop word removal itself.
✓ Correct answer: B
Question 07
Which of the following is an application of text mining?
⚑ Select all that apply
AInformation extraction
BMachine translation
CInformation retrieval
DSpeech recognition
Text mining refers to the discovery of patterns and knowledge from text data. Its applications include: Information extraction (A) — automatically pulling structured facts (entities, relations) from unstructured text; Machine translation (B) — translating text between languages using learned linguistic patterns; Information retrieval (C) — finding relevant documents matching a query (e.g., search engines). Speech recognition (D) is an audio signal processing task — it converts spoken audio to text, operating on acoustic waveforms rather than text data, so it is not a text mining application.
✓ Correct answers: A, B & C
Question 08
Which of the following are challenges in text understanding for computers?
⚑ Select all that apply
AAmbiguity
BScale
CSparsity
DVariation
All four are genuine NLP challenges. Ambiguity (A) — words and sentences often have multiple meanings (e.g., "bank" can mean a financial institution or a riverbank). Scale (B) — the internet produces billions of text documents; processing them efficiently is a major engineering and computational challenge. Sparsity (C) — in bag-of-words models, most words don't appear in a given document, creating very sparse high-dimensional vectors that are hard to learn from. Variation (D) — the same concept is expressed in many ways (synonyms, dialects, typos, slang), making generalisation hard.
✓ Correct answers: A, B, C & D
Question 09.1
What is the Euclidean distance between A and B using Boolean feature-based representation in vector space model without stop word removal?
A – Scissors cuts paper and decapitates lizard B – Lizard poisons Spock and eats paper C – Spock smashes scissors and vaporizes rock
A2
B2.24
C2.45
DNone of the above
Since stop words are not removed, "and" stays in the vocabulary. The 12-word vocabulary is:
andcutsdecapitateseatslizardpaperpoisonsrockscissorssmashesspockvaporizes
A:111011001000
B:100111100010
A and B share: and, lizard, paper → 3 common terms. Differences: A has cuts, decapitates, scissors that B doesn't; B has eats, poisons, spock that A doesn't. That's 6 mismatches. Euclidean distance = √6 ≈ 2.449 ≈ 2.45.
✓ Correct answer: C (2.45)
Question 09.2
What is the cosine similarity between B and C using Boolean feature-based representation in vector space model without stop word removal?
A – Scissors cuts paper and decapitates lizard B – Lizard poisons Spock and eats paper C – Spock smashes scissors and vaporizes rock
A0.33
B0.50
C0.17
DNone of the above
Using the same 12-term vocabulary (stop words kept):
andeatslizardpaperpoisonsrockscissorssmashesspockvaporizes
B:1111100010
C:1000011111
B and C share: and, spock → dot product = 2. |B| = √6 ≈ 2.449, |C| = √6 ≈ 2.449.
Cosine = 2 / (2.449 × 2.449) = 2 / 6 ≈ 0.333 ≈ 0.33
✓ Correct answer: A (0.33)
Question 10
Suppose vectors A, B, and C show A closer to B than to C using Euclidean distance. Which statement is true regarding cosine similarity?
AA and B are more similar than A and C when measured by cosine similarity
BA and B are less similar than A and C when measured by cosine similarity
CB and C are more similar than A and C when measured by cosine similarity
DB and C are less similar than A and C when measured by cosine similarity
EThere is not enough information to determine the cosine similarity between the vectors
This is a conceptual trap. Euclidean distance measures absolute distance in space, accounting for both direction and magnitude. Cosine similarity measures only the angle between vectors, completely ignoring magnitude. Consider a simple counterexample: A = [1, 0], B = [2, 0], C = [0, 3]. Euclidean: |A–B| = 1, |A–C| = √10 ≈ 3.16, so A is closer to B. But cosine(A, B) = 1.0 (identical direction) while cosine(A, C) = 0 (perpendicular). Now try A = [1,1], B = [1.1, 1.1], C = [10, 0]. A is closer to B by Euclidean, but cosine(A,C) might differ unpredictably. Without knowing the actual vectors, you simply cannot infer cosine similarity from Euclidean distance. The correct answer is E.
✓ Correct answer: E
0/11
Quiz Complete!
See how you did.