You have:
DataFrame A: 128 GB of transactions
DataFrame B: 1 GB user lookup table
Which strategy is correct for broadcasting?
3 of 55. A data engineer observes that the upstream streaming source feeds the event table frequently and sends duplicate records. Upon analyzing the current production table, the data engineer found that the time difference in the event_timestamp column of the duplicate records is, at most, 30 minutes.
To remove the duplicates, the engineer adds the code:
df = df.withWatermark("event_timestamp", "30 minutes")
What is the result?
A data engineer is building an Apache Spark Structured Streaming application to process a stream of JSON events in real time. The engineer wants the application to be fault-tolerant and resume processing from the last successfully processed record in case of a failure. To achieve this, the data engineer decides to implement checkpoints.
Which code snippet should the data engineer use?
A.
query = streaming_df.writeStream \
.format("console") \
.option("checkpoint", "/path/to/checkpoint") \
.outputMode("append") \
.start()
B.
query = streaming_df.writeStream \
.format("console") \
.outputMode("append") \
.option("checkpointLocation", "/path/to/checkpoint") \
.start()
C.
query = streaming_df.writeStream \
.format("console") \
.outputMode("complete") \
.start()
D.
query = streaming_df.writeStream \
.format("console") \
.outputMode("append") \
.start()
A Spark application suffers from too many small tasks due to excessive partitioning. How can this be fixed without a full shuffle?
Options:
38 of 55. A data engineer is working with Spark SQL and has a large JSON file stored at /data/input.json. The file contains records with varying schemas, and the engineer wants to create an external table in Spark SQL that:
Reads directly from /data/input.json.
Infers the schema automatically.
Merges differing schemas.
Which code snippet should the engineer use?
A.
CREATE EXTERNAL TABLE users
USING json
OPTIONS (path '/data/input.json', mergeSchema 'true');
B.
CREATE TABLE users
USING json
OPTIONS (path '/data/input.json');
C.
CREATE EXTERNAL TABLE users
USING json
OPTIONS (path '/data/input.json', inferSchema 'true');
D.
CREATE EXTERNAL TABLE users
USING json
OPTIONS (path '/data/input.json', mergeAll 'true');