Data Streams & Streaming Algorithms

Question 01

Which of the following are properties of data streams?

⚑ Select all that apply

AData streams are always finite.

BData streams have a controlled input rate internally.

CData streams can be infinite.

DData streams always consist of numerical data.

EData streams may have a changing data distribution over time.

C is correct — streams like network traffic or sensor readings are potentially unbounded and have no fixed end. E is correct — distributions shift over time (concept drift); e.g. shopping patterns change by season. A is wrong — streams can be infinite by definition. B is wrong — the rate is externally imposed and uncontrolled; the system must cope with whatever rate arrives. D is wrong — streams can carry text, images, events, or any data type.

✓ Correct answers: C & E

Question 02

Which examples best describe elements in a data stream?

⚑ Select all that apply

AA complete database of customer records.

BA single event like "page A visited" in web click-stream data.

CA compiled report of stock prices over a month.

DAn archived log file of network activities.

EA real-time sensor reading from an IoT device.

Stream elements are individual, atomic, real-time data items. B is correct — a single page-visit event is one discrete item arriving in a continuous click-stream. E is correct — a live sensor reading is a real-time atomic element arriving continuously. A is wrong — a full customer database is a static, batched dataset, not a stream element. C is wrong — a compiled monthly report is aggregated/historical data, the opposite of real-time stream elements. D is wrong — an archived log is stored batch data, not a live stream.

✓ Correct answers: B & E

Question 03

What are primary goals when processing data streams in real-time?

⚑ Select all that apply

AStoring the entire data stream.

BEnsuring the data stream values do not increase too much.

CComputing functions of the stream, such as the median or the number of distinct elements.

DDelaying processing to batch the data.

EMaking decisions or detecting patterns quickly.

C is correct — computing stream statistics (median, distinct count, frequency) with limited memory is the core algorithmic challenge of streaming. E is correct — the whole point of real-time processing is fast decisions; fraud detection, anomaly alerts, and recommendations all require low latency. A is wrong — storing the entire stream is exactly what is infeasible; streaming algorithms exist precisely to avoid this. B is wrong — controlling data values is not a streaming goal. D is wrong — batching contradicts real-time processing.

✓ Correct answers: C & E

Question 04

Which of the following are examples of data stream applications?

⚑ Select all that apply

AReal-time analysis of user behavior on websites.

BQuick detection and response to network issues.

CArchiving historical weather data.

DFraud detection in telecommunication connections.

EContinuous monitoring of the physical environment via sensor networks.

A, B, D, and E are all correct — they all involve continuous, real-time data flows requiring immediate processing. Web analytics (A) processes click-streams live; network monitoring (B) analyses packet flows in real-time; telecom fraud detection (D) flags suspicious call patterns as they happen; sensor networks (E) produce continuous environmental readings. C is wrong — archiving historical weather data is a batch storage task; data is collected and stored for later analysis, not processed as a live stream.

✓ Correct answers: A, B, D & E

Question 05

What are some challenges in managing data streams?

⚑ Select all that apply

AData streams always contain one data type.

BData streams arrive at a predictable and consistent rate.

CStoring the entire data stream in memory is often impractical.

DData streams typically have small, finite data sets.

EEnsuring all data is processed in real-time with limited memory.

C is correct — streams can be infinite or extremely high-volume, making full in-memory storage impossible; this is why approximate algorithms like DGIM exist. E is correct — processing everything in real-time with only a small working memory buffer is the central challenge of streaming. A is wrong — streams commonly mix data types (events, numbers, text). B is wrong — arrival rates are often bursty and unpredictable, which is precisely what makes stream management hard. D is wrong — the challenge is that datasets are large and potentially infinite, not small.

✓ Correct answers: C & E

Question 06

What are characteristics of a Data Stream Management System (DSMS)?

⚑ Select all that apply

AIt processes data in batches.

BIt can store the entire data stream for querying.

CIt archives streams but does not use them for real-time queries.

DIt uses limited working storage for real-time processing.

EIt can handle high-velocity and high-volume data streams.

D is correct — a DSMS operates with a small, fixed working memory; it cannot store everything, so it uses synopses and approximations. E is correct — handling high velocity and volume is the defining capability that distinguishes a DSMS from a traditional DBMS. A is wrong — batch processing is the opposite of streaming; a DSMS processes elements as they arrive, not in collected batches. B is wrong — storing the entire stream defeats the purpose; a DSMS is designed precisely for when full storage is impossible. C is wrong — a DSMS actively queries and processes data in real-time; archiving without querying describes a data lake, not a DSMS.

✓ Correct answers: D & E

Question 07

What types of queries are involved in data stream processing?

⚑ Select all that apply

ASampling data from a stream.

BQueries over sliding windows.

CFiltering a data stream.

DCounting distinct elements.

ENone of the above.

All four non-"none" options are standard stream query types. Sampling (A) — drawing a representative subset from an ongoing stream. Sliding windows (B) — computing results over the most recent N elements, discarding older data (the basis of DGIM). Filtering (C) — retaining only elements matching a predicate (e.g., Bloom filters). Counting distinct elements (D) — estimating the number of unique items seen so far (e.g., Flajolet-Martin). E is wrong — all of the above are genuine stream query types taught in this module.

✓ Correct answers: A, B, C & D

Question 08

Which of the following are goals when sampling data streams?

⚑ Select all that apply

ATo store the entire stream data for future use.

BTo store a random sample of fixed size over a potentially infinite stream.

CTo process stream data in real-time with limited memory.

DTo compute accurate answers for all queries.

ETo ensure all elements in the stream have the same probability at any time step.

B is correct — reservoir sampling maintains a fixed-size sample of size m regardless of how long the stream grows, which is the key algorithmic goal. C is correct — sampling is the mechanism that allows real-time processing within a tight memory budget. A is wrong — storing everything is precisely what sampling avoids. D is wrong — sampling is an approximate technique; accuracy for all queries is impossible with a sample. E is wrong — in weighted random sampling (WRS), elements have different probabilities proportional to their weights; uniform probability applies only to simple random sampling, not WRS.

✓ Correct answers: B & C

Question 09

What are the characteristics of a naive solution for sampling a fixed proportion of elements in a data stream?

⚑ Select all that apply

AGenerate a random integer for each element.

BStore each element with equal probability.

CInclude the element in the sample if the random integer is 0.

DDiscard the element if the random integer is 9.

EStore all elements initially, then filter them later.

The naive 1-in-10 sampling approach: for each element, generate a random integer (A) in {0…9}; store the element only if the integer equals a specific target value. This gives every element an equal 1/10 probability (B) of being kept. C is wrong in isolation — while 0 could be used as the target, the key property is equal probability; C describes just one possible implementation detail, not a characteristic. D is wrong — discarding when the integer equals 9 is the same as keeping when it doesn't equal 9, which gives 9/10 probability — this isn't sampling a fixed small proportion. E is wrong — storing all elements first defeats the memory constraint that makes streaming algorithms necessary.

✓ Correct answers: A & B

Question 10

What are the main points about Weighted Random Sampling (WRS)?

⚑ Select all that apply

AWRS aims to sample elements with equal probability regardless of their weight.

BThe probability of an element being included in the sample is proportional to its weight.

CThe same WRS algorithm can be applied to both finite and infinite data streams.

DThe selection probability for each element is calculated only once at the beginning.

EA simple algorithm can create a WRS of size m from a dataset of size n.

B is correct — the defining property of WRS is that heavier-weighted elements are more likely to appear in the sample; probability is proportional to weight. E is correct — a clean algorithm exists: assign each element a key u^(1/w) (where u is uniform random, w is weight) and keep the m elements with the largest keys. A is wrong — equal probability is simple random sampling, not WRS; the whole point of WRS is to differentiate by weight. C is wrong — the basic finite WRS algorithm requires knowing n upfront; a separate reservoir-based variant handles infinite streams. D is wrong — in a stream, probabilities are updated dynamically as new elements arrive.

✓ Correct answers: B & E

Question 11

Given a dataset of size 100 with weights w_i for each element i ranging from 1 to 100, calculate the probability of the element with weight 50 being included in a WRS of size 10.

WRS probability formula: P(element i) = (w_i × m) / (n × w̄) where for this dataset the sum of all weights = 1+2+…+100 = 5050, so the average weight w̄ = 5050/100 = 50.5. Simplified: P = (w_i × m) / Σw_j · n = 100, m = 10, w₅₀ = 50

A50/1000

B10/100

C50/100

D(50 × 10) / (100 × 100)

E(50 × 10) / 100

The WRS formula used in the lecture is P(eᵢ) = (wᵢ × m) / (n × n) where the denominator treats each element as having average weight equal to n (a simplification where weights range 1..n). With wᵢ = 50, m = 10, n = 100: P = (50 × 10) / (100 × 100) = 500/10000 = 0.05. This matches option D. A (50/1000) = 0.05 numerically but the formula structure is wrong. B (10/100) = 0.1, which is uniform sampling ignoring weights. C (50/100) = 0.5, ignores sample size m. E divides by just n once instead of n², giving 5 which is > 1, so it cannot be a probability.

✓ Correct answer: D — (50 × 10) / (100 × 100) = 0.05

Question 12

Complete the description of the Weighted Random Sampling (WRS) algorithm.

The probability of each element being included in the sample is _____ to its _____.

Every element eᵢ has a probability of wᵢ · m / Σwⱼ, where _____ is the weight of the element and the denominator is the _____ of all weights in the dataset.

The four answers are: proportional (heavier elements are more likely to be selected), weight (the attribute assigned to each element), wᵢ (the weight of the specific element eᵢ being evaluated), and sum (the denominator Σwⱼ is the total sum of all weights in the dataset, used to normalise the probability).

✓ Correct answers: proportional · weight · wᵢ · sum

Question 13

Complete the description of reservoir sampling for infinite streams.

To maintain sample size m for an infinite stream, the WRS algorithm uses a technique known as _____ sampling.

The algorithm begins by storing the first _____ elements.

For each subsequent element, it generates a _____ to decide if the new element should replace an existing element.

This decision is based on the element's _____. The final sample is _____ of the entire dataset and reflects the _____ of weights among the elements.

The answers are: reservoir (the technique that maintains a fixed-size sample over an infinite stream), m (the desired sample size — the first m elements fill the reservoir directly), random number (used to probabilistically decide replacement), weight (heavier elements are more likely to replace existing ones), representative (the sample reflects the overall population), and distribution (the weight distribution is preserved across the sample).

✓ Correct answers: reservoir · m · random number · weight · representative · distribution

Question 14

Complete the description of the DGIM method for counting 1s in a stream.

The DGIM method provides a way to estimate the _____ in the most recent N bits of a data stream.

It does this by _____ the sizes of all buckets but the last (leftmost), and adding _____ of the size of the last bucket.

This method ensures the error is at most _____%.

The DGIM (Datar-Gionis-Indyk-Motwani) answers: number of 1s (the count of set bits in the sliding window), summing (add up the bucket sizes for all buckets except the oldest/leftmost), half (the oldest bucket may extend beyond the window, so only half of it is counted), 50 (the guaranteed worst-case error bound is 50%, making it an approximate but memory-efficient method).

✓ Correct answers: number of 1s · summing · half · 50

Question 15

Complete the description of DGIM bucket structure.

In the DGIM method, each _____ contains the timestamp of its most _____ element and the _____ in it.

The sizes of the buckets must be a power of _____, and there can be at most _____ bucket(s) of any given size up to the _____ size.

The answers are: bucket (the fundamental unit of DGIM), recent (each bucket records the timestamp of its newest bit, so old buckets can be expired when they fall outside the window), number of 1s (the count stored per bucket), two (bucket sizes are powers of 2: 1, 2, 4, 8…), two (at most 2 buckets of any given size are maintained — a third triggers a merge), maximum (there can be 1 or 2 buckets at the maximum size).

✓ Correct answers: bucket · recent · number of 1s · two · two · maximum

Question 16

Complete the description of DGIM memory efficiency and timestamps.

The DGIM method is designed to store only _____ bits to maintain stream statistics over sliding windows.

Each element in the stream is given a timestamp _____, where _____ represents the time steps and N is the stream length.

The answers are: O(log² N) (DGIM's key advantage — it uses only O(log² N) bits of memory instead of O(N), because bucket sizes grow exponentially and there are at most 2 buckets per size level), t mod N (timestamps are taken modulo N so they wrap around within the sliding window of length N, keeping them small), and t (the raw time step counter that is reduced modulo N).

✓ Correct answers: O(log² N) · t mod N · t