Question 01
What does TF-IDF stand for?
TF-IDF stands for Term Frequency–Inverse Document Frequency. It is a numerical statistic used to reflect how important a word is to a document within a collection. The formula combines two components: TF (how often the term appears in a specific document) and IDF (the inverse of how many documents contain that term across the whole corpus). Words appearing in many documents get a lower IDF, naturally down-weighting common words like "the" and boosting distinctive ones.
✓ Correct answer: A
Question 02
What does a high TF-IDF score for a word indicate?
A high TF-IDF score means the word is important and distinctive to that particular document. This happens when TF is high (the word appears a lot in that document) and IDF is high (the word is rare across the corpus). A word common in all documents (B) would have IDF ≈ 0, killing its TF-IDF score. Option C would give TF = 0, also zero. Option D equally low IDF. Think of a medical paper frequently using the word "myocardial" — rare in most texts, but prominent in that document → high TF-IDF.
✓ Correct answer: A
Question 03
Which of the following is a limitation of TF-IDF?
TF-IDF's biggest weakness is that it treats every word as an isolated symbol — it has no understanding of meaning. The words "car," "automobile," and "vehicle" would be completely unrelated in TF-IDF space even though they're semantically close. It cannot handle synonymy (same meaning, different words) or polysemy (same word, different meanings). Option A is wrong — TF-IDF naturally suppresses stop words because they appear in all documents (IDF ≈ 0). Option C is partially true but not the primary limitation. Option D is wrong — TF-IDF is entirely unsupervised.
✓ Correct answer: B
Question 04
Which of the following is an example of a bigram?
A bigram is a sequence of exactly two consecutive words (or tokens). "Machine learning" (A) is a two-word sequence — a textbook bigram. Option B, "Deep-learning," is a single hyphenated token, not two separate words. Option C, "bi-gram is," contains three tokens (bi, gram, is) making it a trigram. Option D, "Model," is a single word (a unigram). N-grams are fundamental building blocks of statistical language models, where N indicates the window size: unigram (1), bigram (2), trigram (3), etc.
✓ Correct answer: A
Question 05
What is the main limitation of N-gram language models?
N-gram models predict the next word based on a fixed window of N–1 preceding words. The fundamental limitation is that this fixed window cannot capture dependencies between words far apart in a sentence. Consider: "The cat that sat on the mat was …" — to correctly predict the next word, you might need context from much earlier. N-grams are inherently Markovian — they assume the current word only depends on the last N–1 words. Option B is wrong — N-grams explicitly model word order. Options A and C are real concerns but secondary to the core long-range dependency problem that motivates neural language models.
✓ Correct answer: D
Question 06
What is a key advantage of continuous word representations (word embeddings)?
The defining advantage of word embeddings (e.g., Word2Vec, GloVe) is that semantically similar words end up close together in the dense vector space (A). Words that appear in similar contexts receive similar vectors, so synonyms cluster together. The classic demonstration:
king − man + woman ≈ queen. Options B and C are wrong — embeddings typically require large corpora to learn good representations, and while they reduce dimensionality versus one-hot encodings, they don't "eliminate" the curse of dimensionality. Option D (positional encoding) is a feature of Transformers, not word embeddings themselves.
✓ Correct answer: A
Question 07
What does the dimension of a word embedding vector represent?
The dimensionality of a word embedding (e.g., 100, 300, 768) represents the number of latent semantic features the model uses to represent each word (A). These are not hand-crafted features — they're learned automatically during training. Higher dimensions can capture richer relationships but cost more compute. Contrast this with one-hot encoding, where the vector size equals vocabulary size (B) — embeddings are far smaller and denser. The corpus size (C) and document count (D) are training hyper-parameters that influence quality but don't define the vector dimension.
✓ Correct answer: A
Question 08
What distinguishes large language models (LLMs) from traditional language models?
The fundamental architectural distinction is the use of deep learning — specifically the Transformer architecture (B). Transformers use self-attention mechanisms that can model relationships between any two positions in a sequence regardless of distance, overcoming the fixed-window limitation of N-grams. While LLMs do process large corpora (A), so did earlier neural models. LLMs are primarily self-supervised (trained on raw text without labels), making C incorrect. Option D (coherent text generation) is a capability that results from the architecture, not the defining distinction itself.
✓ Correct answer: B
Question 09
What is the TF-IDF for the term "lizard" in sentence B?
A – Scissors cuts paper and decapitates lizard
B – Lizard poisons Spock and eats paper
C – Rock crushes lizard and breaks scissors
TF-IDF = TF × IDF, where
IDF = log(N / df).
TF("lizard", B) = 1 / 6 = 0.167 ← "lizard" appears once; B has 6 tokens
Document frequency (df) = 3 ← "lizard" is in A ✓, B ✓, C ✓
IDF = log(3 / 3) = log(1) = 0
TF-IDF = 0.167 × 0 = 0
Because "lizard" appears in all three documents, IDF collapses to zero — making TF-IDF = 0. This is exactly TF-IDF's design: words that appear everywhere carry no discriminative power. The distractor 0.477 = log₁₀(3) would apply if "lizard" appeared in only one document; 0.176 would result from a smoothed IDF formula.
✓ Correct answer: C (0)
Question 10
Suppose two words share very similar contexts in a corpus. Which statement is true?
Word embedding models (Word2Vec, GloVe) are built on the distributional hypothesis: "words that occur in similar contexts have similar meanings." So two words with similar contexts will produce similar embedding vectors — they'll be close together in the vector space (A). Option B is wrong: TF-IDF depends on per-document frequency distributions, so two semantically identical words appearing in different documents can have very different TF-IDF values. Options C and D confuse vector values with vector dimensions. In any given model, all words share the same fixed dimensionality (e.g., all vectors are 300-dimensional) — what differs are the actual values inside those vectors, not the size.
✓ Correct answer: A
0/10
Quiz Complete!
See how you did.