Text Data Analysis II: Language Models

Question 01

What does TF-IDF stand for?

ATerm Frequency-Inverse Document Frequency

BText Frequency-Inverse Data Frequency

CTerm Frequency-Integrated Document Feature

DTotal Frequency-Inverse Document Frequency

TF-IDF stands for Term Frequency–Inverse Document Frequency. It is a numerical statistic used to reflect how important a word is to a document within a collection. The formula combines two components: TF (how often the term appears in a specific document) and IDF (the inverse of how many documents contain that term across the whole corpus). Words appearing in many documents get a lower IDF, naturally down-weighting common words like "the" and boosting distinctive ones.

✓ Correct answer: A

Question 02

What does a high TF-IDF score for a word indicate?

AThe word appears frequently in a specific document and rarely across other documents

BThe word is common across all documents

CThe word does not appear in the document at all

DThe word is equally distributed across documents

A high TF-IDF score means the word is important and distinctive to that particular document. This happens when TF is high (the word appears a lot in that document) and IDF is high (the word is rare across the corpus). A word common in all documents (B) would have IDF ≈ 0, killing its TF-IDF score. Option C would give TF = 0, also zero. Option D equally low IDF. Think of a medical paper frequently using the word "myocardial" — rare in most texts, but prominent in that document → high TF-IDF.

✓ Correct answer: A

Question 03

Which of the following is a limitation of TF-IDF?

AIt cannot handle stopwords effectively

BIt does not capture the semantic meaning of words

CIt relies heavily on document length

DIt requires labelled data for training

TF-IDF's biggest weakness is that it treats every word as an isolated symbol — it has no understanding of meaning. The words "car," "automobile," and "vehicle" would be completely unrelated in TF-IDF space even though they're semantically close. It cannot handle synonymy (same meaning, different words) or polysemy (same word, different meanings). Option A is wrong — TF-IDF naturally suppresses stop words because they appear in all documents (IDF ≈ 0). Option C is partially true but not the primary limitation. Option D is wrong — TF-IDF is entirely unsupervised.

✓ Correct answer: B

Question 04

Which of the following is an example of a bigram?

A"Machine learning"

B"Deep-learning"

C"bi-gram is"

D"Model"

A bigram is a sequence of exactly two consecutive words (or tokens). "Machine learning" (A) is a two-word sequence — a textbook bigram. Option B, "Deep-learning," is a single hyphenated token, not two separate words. Option C, "bi-gram is," contains three tokens (bi, gram, is) making it a trigram. Option D, "Model," is a single word (a unigram). N-grams are fundamental building blocks of statistical language models, where N indicates the window size: unigram (1), bigram (2), trigram (3), etc.

✓ Correct answer: A

Question 05

What is the main limitation of N-gram language models?

AThey are computationally expensive for large N values

BThey ignore word order

CThey cannot handle unknown words effectively

DThey fail to capture long-range dependencies

N-gram models predict the next word based on a fixed window of N–1 preceding words. The fundamental limitation is that this fixed window cannot capture dependencies between words far apart in a sentence. Consider: "The cat that sat on the mat was …" — to correctly predict the next word, you might need context from much earlier. N-grams are inherently Markovian — they assume the current word only depends on the last N–1 words. Option B is wrong — N-grams explicitly model word order. Options A and C are real concerns but secondary to the core long-range dependency problem that motivates neural language models.

✓ Correct answer: D

Question 06

What is a key advantage of continuous word representations (word embeddings)?

AThey capture semantic similarity between words

BThey require less training data

CThey eliminate the curse of dimensionality

DThey encode positional information in sentences

The defining advantage of word embeddings (e.g., Word2Vec, GloVe) is that semantically similar words end up close together in the dense vector space (A). Words that appear in similar contexts receive similar vectors, so synonyms cluster together. The classic demonstration: king − man + woman ≈ queen. Options B and C are wrong — embeddings typically require large corpora to learn good representations, and while they reduce dimensionality versus one-hot encodings, they don't "eliminate" the curse of dimensionality. Option D (positional encoding) is a feature of Transformers, not word embeddings themselves.

✓ Correct answer: A

Question 07

What does the dimension of a word embedding vector represent?

AThe number of semantic features encoded in the vector

BThe number of unique words in the vocabulary

CThe size of the corpus used for training

DThe number of documents in the dataset

The dimensionality of a word embedding (e.g., 100, 300, 768) represents the number of latent semantic features the model uses to represent each word (A). These are not hand-crafted features — they're learned automatically during training. Higher dimensions can capture richer relationships but cost more compute. Contrast this with one-hot encoding, where the vector size equals vocabulary size (B) — embeddings are far smaller and denser. The corpus size (C) and document count (D) are training hyper-parameters that influence quality but don't define the vector dimension.

✓ Correct answer: A

Question 08

What distinguishes large language models (LLMs) from traditional language models?

AThe ability to process extremely large corpora

BThe use of deep learning architectures like Transformers

CThe reliance on large-scale labelled datasets

DThe capacity to generate coherent and diverse text

The fundamental architectural distinction is the use of deep learning — specifically the Transformer architecture (B). Transformers use self-attention mechanisms that can model relationships between any two positions in a sequence regardless of distance, overcoming the fixed-window limitation of N-grams. While LLMs do process large corpora (A), so did earlier neural models. LLMs are primarily self-supervised (trained on raw text without labels), making C incorrect. Option D (coherent text generation) is a capability that results from the architecture, not the defining distinction itself.

✓ Correct answer: B

Question 09

What is the TF-IDF for the term "lizard" in sentence B?

A – Scissors cuts paper and decapitates lizard B – Lizard poisons Spock and eats paper C – Rock crushes lizard and breaks scissors

A0.176

B0.477

C0

DNone of the above

TF-IDF = TF × IDF, where IDF = log(N / df).

TF("lizard", B) = 1 / 6 = 0.167 ← "lizard" appears once; B has 6 tokens Document frequency (df) = 3 ← "lizard" is in A ✓, B ✓, C ✓ IDF = log(3 / 3) = log(1) = 0 TF-IDF = 0.167 × 0 = 0

Because "lizard" appears in all three documents, IDF collapses to zero — making TF-IDF = 0. This is exactly TF-IDF's design: words that appear everywhere carry no discriminative power. The distractor 0.477 = log₁₀(3) would apply if "lizard" appeared in only one document; 0.176 would result from a smoothed IDF formula.

✓ Correct answer: C (0)

Question 10

Suppose two words share very similar contexts in a corpus. Which statement is true?

ATheir continuous representations are likely to be very similar

BThe TF-IDF values of these two words will always be close to each other, regardless of context

CThe dimensions of their continuous representation vectors are likely to be close but not the same

DThe dimensions of their continuous representation vectors depend on the size of the vocabulary, so they should be the same

Word embedding models (Word2Vec, GloVe) are built on the distributional hypothesis: "words that occur in similar contexts have similar meanings." So two words with similar contexts will produce similar embedding vectors — they'll be close together in the vector space (A). Option B is wrong: TF-IDF depends on per-document frequency distributions, so two semantically identical words appearing in different documents can have very different TF-IDF values. Options C and D confuse vector values with vector dimensions. In any given model, all words share the same fixed dimensionality (e.g., all vectors are 300-dimensional) — what differs are the actual values inside those vectors, not the size.

✓ Correct answer: A