Databricks Certified Data Engineer Associate Dumps & Practice Exams
- CertiMaan
- Oct 24, 2025
- 19 min read
Updated: Dec 22, 2025
Prepare effectively for the Databricks Certified Data Engineer Associate exam with these updated dumps and practice exams. Covering all core concepts like data ingestion, Spark SQL, Delta Lake, ETL pipelines, and optimization techniques, this prep material aligns with the latest Databricks certification syllabus. Whether you're reviewing exam questions or taking a timed practice test, our resources are designed to simulate the real exam environment. These dumps help you identify weak areas, improve speed and accuracy, and gain the confidence needed to pass the certification on your first attempt. Ideal for aspiring data engineers working with Apache Spark and Databricks Lakehouse Platform.
Databricks Certified Data Engineer Associate Dumps
1. Which tool is used by Auto Loader to process data incrementally?
Spark Structured Streaming
Databricks SQL
Checkpointing
Unity Catalog
2. Which two components function in the DB platform architecture’s control plane? (Choose two.)
Virtual Machines
Compute Orchestration
Compute
Unity Catalog
Serverless Compute
3. How can Git operations must be performed outside of Databricks Repos?
Merge
Pull
Commit
Clone
4. A dataset has been defined using Delta Live Tables and includes an expectations clause: CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION FAIL UPDATE What is the expected behavior when a batch of data containing data that violates these constraints is processed?
Records that violate the expectation are added to the target dataset and recorded as invalid in the event log
Records that violate the expectation cause the job to fail
Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table
Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log
Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset
5. A data engineer is maintaining a data pipeline. Upon data ingestion, the data engineer notices that the source data is starting to have a lower level of quality. The data engineer would like to automate the process of monitoring the quality level. Which of the following tools can the data engineer use to solve this problem?
Delta Lake
Delta Live Tables
Auto Loader
Unity Catalog
6. An engineering manager wants to monitor the performance of a recent project using a Databricks SQL query. For the first week following the project’s release, the manager wants the query results to be updated every minute. However, the manager is concerned that the compute resources used for the query will be left running and cost the organization a lot of money beyond the first week of the project’s release. Which approach can the engineering team use to ensure the query does not cost the organization any money beyond the first week of the project’s release?
They can set a limit to the number of DBUs that are consumed by the SQL Endpoint
They can set the query’s refresh schedule to end after a certain number of refreshes
They can set the query’s refresh schedule to end on a certain date in the query scheduler
They can set a limit to the number of individuals that are able to manage the query’s refresh schedule
7. A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE. The table is configured to run in Development mode using the Continuous Pipeline Mode. Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?
All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing
All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing
All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated
All datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused
All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist until the pipeline is shut down
8. A data engineer has a single-task Job that runs each morning before they begin working. After identifying an upstream data issue, they need to set up another task to run a new notebook prior to the original task. Which approach can the data engineer use to set up the new task?
They can create a new task in the existing Job and then add the original task as a dependency of the new task
They can create a new job from scratch and add both tasks to run concurrently
They can create a new task in the existing Job and then add it as a dependency of the original task
They can clone the existing task in the existing Job and update it to run the new notebook
9. A data engineer has been given a new record of data: id STRING = 'a1' rank INTEGER = 6 rating FLOAT = 9.4 Which SQL commands can be used to append the new record to an existing Delta table my_table?
UPDATE VALUES ('a1', 6, 9.4) my_table
UPDATE my_table VALUES ('a1', 6, 9.4)
INSERT INTO my_table VALUES ('a1', 6, 9.4)
INSERT VALUES ('a1', 6, 9.4) INTO my_table
10. A data engineer has a Job with multiple tasks that runs nightly. Each of the tasks runs slowly because the clusters take a long time to start. Which of the following actions can the data engineer perform to improve the start up time for the clusters used for the Job?
They can configure the clusters to autoscale for larger data sizes
They can configure the clusters to be single-node
They can use jobs clusters instead of all-purpose clusters
They can use endpoints available in Databricks SQL
They can use clusters that are from a cluster pool
11. Which of the following commands will return the location of database customer360?
ALTER DATABASE customer360 SET DBPROPERTIES ('location' = '/user'};
DESCRIBE LOCATION customer360;
DESCRIBE DATABASE customer360;
USE DATABASE customer360;
DROP DATABASE customer360;
12. What describes the relationship between Gold tables and Silver tables?
Gold tables are more likely to contain a less refined view of data than Silver tables
Gold tables are more likely to contain aggregations than Silver tables
Gold tables are more likely to contain truthful data than Silver tables
Gold tables are more likely to contain valuable data than Silver tables
13. A data engineer has a single-task Job that runs each morning before they begin working. After identifying an upstream data issue, they need to set up another task to run a new notebook prior to the original task. Which of the following approaches can the data engineer use to set up the new task?
They can create a new job from scratch and add both tasks to run concurrently
They can create a new task in the existing Job and then add the original task as a dependency of the new task
They can create a new task in the existing Job and then add it as a dependency of the original task
They can clone the existing task to a new Job and then edit it to run the new notebook
They can clone the existing task in the existing Job and update it to run the new notebook
14. A data engineer is managing a data pipeline in Databricks, where multiple Delta tables are used for various transformations. The team wants to track how data flows through the pipeline, including identifying dependencies between Delta tables, notebooks, jobs, and dashboards. The data engineer is utilizing the Unity Catalog lineage feature to monitor this process. How does Unity Catalog’s data lineage feature support the visualization of relationships between Delta tables, notebooks, jobs, and dashboards?
Unity Catalog lineage provides an interactive graph that tracks dependencies between tables and notebooks but excludes any job-related dependencies or dashboard visualizations
Unity Catalog lineage only supports visualizing relationships at the table level and does not extend to notebooks, jobs, or dashboards
Unity Catalog provides an interactive graph that visualizes the dependencies between Delta tables, notebooks, jobs, and dashboards, while also supporting column-level tracking of data transformations
Unity Catalog lineage visualizes dependencies between Delta tables, notebooks, and jobs, but does not provide column-level tracing or relationships with dashboards
15. Which of the following commands will return the number of null values in the member_id column?
SELECT count(member_id) FROM my_table;
SELECT null(member_id) FROM my_table;
SELECT count_if(member_id IS NULL) FROM my_table;
SELECT count(member_id) - count_null(member_id) FROM my_table;
16. A data engineer is running code in a Databricks Repo that is cloned from a central Git repository. A colleague of the data engineer informs them that changes have been made and synced to the central Git repository. The data engineer now needs to sync their Databricks Repo to get the changes from the central Git repository. Which Git operation does the data engineer need to run to accomplish this task?
Push
Merge
Pull
Clone
17. A data engineer has a Python notebook in Databricks, but they need to use SQL to accomplish a specific task within a cell. They still want all of the other cells to use Python without making any changes to those cells. Which of the following describes how the data engineer can use SQL within a cell of their Python notebook?
They can add %sql to the first line of the cell
They can change the default language of the notebook to SQL
They can simply write SQL syntax in the cell
It is not possible to use SQL in a Python notebook
They can attach the cell to a SQL endpoint rather than a Databricks cluster
18. An engineering manager wants to monitor the performance of a recent project using a Databricks SQL query. For the first week following the project’s release, the manager wants the query results to be updated every minute. However, the manager is concerned that the compute resources used for the query will be left running and cost the organization a lot of money beyond the first week of the project’s release. Which of the following approaches can the engineering team use to ensure the query does not cost the organization any money beyond the first week of the project’s release?
They can set a limit to the number of individuals that are able to manage the query’s refresh schedule
They can set the query’s refresh schedule to end on a certain date in the query scheduler
They can set the query’s refresh schedule to end after a certain number of refreshes
They can set a limit to the number of DBUs that are consumed by the SQL Endpoint
They cannot ensure the query does not cost the organization money beyond the first week of the project’s release
19. A data engineer wants to create a relational object by pulling data from two tables. The relational object does not need to be used by other data engineers in other sessions. In order to save on storage costs, the data engineer wants to avoid copying and storing physical data. Which of the following relational objects should the data engineer create?
Delta Table
View
Temporary view
Spark SQL Table
20. Identify how the count_if function and the count where x is null can be used Consider a table random_values with below data. What would be the output of below query? select count_if(col > 1) as count_a. count(*) as count_b.count(col1) as count_c from random_values col1 0 1 2 NULL - 2 3
3 6 6
4 6 5
4 6 6
3 6 5
21. What describes when to use the CREATE STREAMING LIVE TABLE (formerly CREATE INCREMENTAL LIVE TABLE) syntax over the CREATE LIVE TABLE syntax when creating Delta Live Tables (DLT) tables using SQL?
CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally
CREATE STREAMING LIVE TABLE should be used when the previous step in the DLT pipeline is static
CREATE STREAMING LIVE TABLE should be used when the subsequent step in the DLT pipeline is static
CREATE STREAMING LIVE TABLE should be used when data needs to be processed through complicated aggregations
22. A data engineer only wants to execute the final block of a Python program if the Python variable day_of_week is equal to 1 and the Python variable review_period is True. Which of the following control flow statements should the data engineer use to begin this conditionally executed code block?
if day_of_week = 1 & review_period: = "True":
if day_of_week = 1 and review_period = "True":
if day_of_week = 1 and review_period:
if day_of_week == 1 and review_period:
23. Which two conditions are applicable for governance in Databricks Unity Catalog? (Choose two.)
Both catalog and schema must have a managed location in Unity Catalog provided metastore is not associated with a location
You can have more than 1 metastore within a databricks account console but only 1 per region
If metastore is not associated with location, it’s mandatory to associate catalog with managed locations
You can have multiple catalogs within metastore and 1 catalog can be associated with multiple metastore
If catalog is not associated with location, it’s mandatory to associate schema with managed locations
24. A data engineer wants to create a data entity from a couple of tables. The data entity must be used by other data engineers in other sessions. It also must be saved to a physical location. Which of the following data entities should the data engineer create?
Function
Table
Database
View
Temporary view
25. A data engineer wants to schedule their Databricks SQL dashboard to refresh once per day, but they only want the associated SQL endpoint to be running when it is necessary. Which approach can the data engineer use to minimize the total running time of the SQL endpoint used in the refresh schedule of their dashboard?
They can ensure the dashboard’s SQL endpoint is not one of the included query’s SQL endpoint
They can ensure the dashboard’s SQL endpoint matches each of the queries’ SQL endpoints
They can turn on the Auto Stop feature for the SQL endpoint
They can set up the dashboard’s SQL endpoint to be serverless
26. A data engineer has three tables in a Delta Live Tables (DLT) pipeline. They have configured the pipeline to drop invalid records at each table. They notice that some data is being dropped due to quality concerns at some point in the DLT pipeline. They would like to determine at which table in their pipeline the data is being dropped. Which of the following approaches can the data engineer take to identify the table that is dropping the records?
They cannot determine which table is dropping the records
They can set up separate expectations for each table when developing their DLT pipeline
They can navigate to the DLT pipeline page, click on the “Error” button, and review the present errors
They can set up DLT to notify them via email when records are dropped
They can navigate to the DLT pipeline page, click on each table, and view the data quality statistics
27. A data engineer needs to use a Delta table as part of a data pipeline, but they do not know if they have the appropriate permissions. In which location can the data engineer review their permissions on the table?
Catalog Explorer
Dashboards
Repos
Jobs
28. A data engineer is running code in a Databricks Repo that is cloned from a central Git repository. A colleague of the data engineer informs them that changes have been made and synced to the central Git repository. The data engineer now needs to sync their Databricks Repo to get the changes from the central Git repository. Which of the following Git operations does the data engineer need to run to accomplish this task?
Pull
Merge
Push
Commit
Clone
29. An engineering manager uses a Databricks SQL query to monitor ingestion latency for each data source. The manager checks the results of the query every day, but they are manually rerunning the query each day and waiting for the results. Which of the following approaches can the manager use to ensure the results of the query are updated each day?
They can schedule the query to run every 12 hours from the Jobs UI
They can schedule the query to refresh every 1 day from the SQL endpoint's page in Databricks SQL
They can schedule the query to refresh every 12 hours from the SQL endpoint's page in Databricks SQL
They can schedule the query to refresh every 1 day from the query's page in Databricks SQL
30. A data engineer has a Python variable table_name that they would like to use in a SQL query. They want to construct a Python code block that will run the query using table_name. They have the following incomplete code block: ____(f"SELECT customer_id, spend FROM {table_name}") What can be used to fill in the blank to successfully complete the task?
spark.table
spark.sql
spark.delta.sql
dbutils.sql
31. Which of the following describes the storage organization of a Delta table?
Delta tables are stored in a collection of files that contain only the data stored within the table
Delta tables are stored in a single file that contains only the data stored within the table
Delta tables are stored in a single file that contains data, history, metadata, and other attributes
Delta tables store their data in a single file and all metadata in a collection of files in a separate location
Delta tables are stored in a collection of files that contain data, history, metadata, and other attributes
32. Identify a scenario to use an external table. A Data Engineer needs to create a parquet bronze table and wants to ensure that it gets stored in a specific path in an external location. Which table can be created in this scenario?
An external table where the location is pointing to specific path in external location
An external table where the schema has managed location pointing to specific path in external location
A managed table where the catalog has managed location pointing to specific path in external location
A managed table where the location is pointing to specific path in external location
33. Which of the following commands can be used to write data into a Delta table while avoiding the writing of duplicate records?
DROP
IGNORE
APPEND
MERGE
INSERT
34. A data engineer has a Job that has a complex run schedule, and they want to transfer that schedule to other Jobs. Rather than manually selecting each value in the scheduling form in Databricks, which of the following tools can the data engineer use to represent and submit the schedule programmatically?
Cron syntax
pyspark.sql.types.DateType
datetime
pyspark.sql.types.TimestampType
35. A data engineer has developed a data pipeline to ingest data from a JSON source using Auto Loader, but the engineer has not provided any type inference or schema hints in their pipeline. Upon reviewing the data, the data engineer has noticed that all of the columns in the target table are of the string type despite some of the fields only including float or boolean values. Which of the following describes why Auto Loader inferred all of the columns to be of the string type?
All of the fields had at least one null value
Auto Loader only works with string data
Auto Loader cannot infer the schema of ingested data
JSON data is a text-based format
There was a type mismatch between the specific schema and the inferred schema
36. What is the maximum output supported by a job cluster to ensure a notebook does not fail?
10MBs
25MBs
15MBs
30MBs
37. What is used by Spark to record the offset range of the data being processed in each trigger in order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing?
Replayable Sources and Idempotent Sinks
Checkpointing and Write-ahead Logs
Write-ahead Logs and Idempotent Sinks
Checkpointing and Idempotent Sinks
38. Which of the following must be specified when creating a new Delta Live Tables pipeline?
At least one notebook library to be executed
A path to cloud storage location for the written data
A location of a target database for the written data
A key-value pair configuration
39. Which of the following statements regarding the relationship between Silver tables and Bronze tables is always true?
Silver tables contain a less refined, less clean view of data than Bronze data
Silver tables contain aggregates while Bronze data is unaggregated
Silver tables contain less data than Bronze tables
Silver tables contain more data than Bronze tables
Silver tables contain a more refined and cleaner view of data than Bronze tables
40. In which of the following scenarios should a data engineer use the MERGE INTO command instead of the INSERT INTO command?
When the location of the data needs to be changed
When the source is not a Delta table
When the target table is an external table
When the target table cannot contain duplicate records
41. A data engineer has left the organization. The data team needs to transfer ownership of the data engineer’s Delta tables to a new data engineer. The new data engineer is the lead engineer on the data team. Assuming the original data engineer no longer has access, which of the following individuals must be the one to transfer ownership of the Delta tables in Data Explorer?
Databricks account representative
New lead data engineer
Original data engineer
Workspace administrator
This transfer is not possible
42. A data engineer wants to schedule their Databricks SQL dashboard to refresh every hour, but they only want the associated SQL endpoint to be running when it is necessary. The dashboard has multiple queries on multiple datasets associated with it. The data that feeds the dashboard is automatically processed using a Databricks Job. Which of the following approaches can the data engineer use to minimize the total running time of the SQL endpoint used in the refresh schedule of their dashboard?
They can ensure the dashboard's SQL endpoint matches each of the queries' SQL endpoints
They can ensure the dashboard's SQL endpoint is not one of the included query's SQL endpoint
They can set up the dashboard's SQL endpoint to be serverless
They can reduce the cluster size of the SQL endpoint
They can turn on the Auto Stop feature for the SQL endpoint
43. Which of the following describes the relationship between Bronze tables and raw data?
Bronze tables contain more truthful data than raw data
Bronze tables contain a less refined view of data than raw data
Bronze tables contain raw data with a schema applied
Bronze tables contain less data than raw data files
44. A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The data engineer needs to identify which files are new since the previous run in the pipeline, and set up the pipeline to only ingest those new files with each run. Which of the following tools can the data engineer use to solve this problem?
Delta Lake
Auto Loader
Databricks SQL
Unity Catalog
45. In which of the following scenarios should a data engineer select a Task in the Depends On field of a new Databricks Job Task?
When another task needs to use as little compute resources as possible
When another task needs to successfully complete before the new task begins
When another task has the same dependency libraries as the new task
When another task needs to be replaced by the new task
46. A data engineer is maintaining a data pipeline. Upon data ingestion, the data engineer notices that the source data is starting to have a lower level of quality. The data engineer would like to automate the process of monitoring the quality level. Which of the following tools can the data engineer use to solve this problem?
Unity Catalog
Delta Lake
Data Explorer
Delta Live Tables
Auto Loader
47. Which file format is used for storing Delta Lake Table?
CSV
Parquet
Delta
JSON
48. Which of the following data workloads will utilize a Gold table as its source?
A job that cleans data by removing malformatted records
A job that enriches data by parsing its timestamps into a human-readable format
A job that ingests raw data from a streaming source into the Lakehouse
A job that queries aggregated data designed to feed into a dashboard
A job that aggregates uncleaned data to create standard summary statistics
49. A data analyst has created a Delta table sales that is used by the entire data analysis team. They want help from the data engineering team to implement a series of tests to ensure the data is clean. However, the data engineering team uses Python for its tests rather than SQL. Which of the following commands could the data engineering team use to access sales in PySpark?
spark.sql("sales")D. spark.delta.table("sales")
SELECT * FROM sales
There is no way to share data between PySpark and SQL
spark.table("sales")
50. A data analysis team has noticed that their Databricks SQL queries are running too slowly when connected to their always-on SQL endpoint. They claim that this issue is present when many members of the team are running small queries simultaneously. They ask the data engineering team for help. The data engineering team notices that each of the team’s queries uses the same SQL endpoint. Which approach can the data engineering team use to improve the latency of the team’s queries?
They can increase the maximum bound of the SQL endpoint’s scaling range
They can increase the cluster size of the SQL endpoint
They can turn on the Auto Stop feature for the SQL endpoint
They can turn on the Serverless feature for the SQL endpoint
51. What is stored in the Databricks customer's cloud account?
Databricks web application
Cluster management metadata
Notebooks
Data
52. What is a benefit of the Databricks Lakehouse Architecture embracing open source technologies?
Cloud-specific integrations
Simplified governance
Ability to scale workloads
Avoiding vendor lock-in
53. A data engineer wants to create a data entity from a couple of tables. The data entity must be used by other data engineers in other sessions. It also must be saved to a physical location. Which of the following data entities should the data engineer create?
Temporary view
Table
View
Function
54. A data organization leader is upset about the data analysis team’s reports being different from the data engineering team’s reports. The leader believes the siloed nature of their organization’s data engineering and data analysis architectures is to blame. Which of the following describes how a data lakehouse could alleviate this issue?
Both teams would reorganize to report to the same department
Both teams would respond more quickly to ad-hoc requests
Both teams would be able to collaborate on projects in real-time
Both teams would use the same source of truth for their work
55. A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE. The table is configured to run in Production mode using the Continuous Pipeline Mode. What is the expected outcome after clicking Start to update the pipeline assuming previously unprocessed data exists and all definitions are valid?
All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped
All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing
All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated
All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing
56. Which SQL keyword can be used to convert a table from a long format to a wide format?
CONVERT
SUM
PIVOT
TRANSFORM
57. Which type of workloads are compatible with Auto Loader?
Machine learning workloads
Batch workloads
Serverless workloads
Streaming workloads
FAQs
1. What is the Databricks Certified Data Engineer Associate exam?
The Databricks Certified Data Engineer Associate exam validates your ability to build and manage data pipelines, use Databricks tools, and optimize data workflows on the Databricks Lakehouse Platform.
2. How do I become a Databricks Certified Data Engineer Associate?
You must pass the Databricks Certified Data Engineer Associate exam, which assesses your understanding of data ingestion, transformation, storage, and governance using Databricks.
3. What are the prerequisites for the Databricks Certified Data Engineer Associate exam?
There are no official prerequisites, but it’s recommended that you have basic knowledge of SQL, Python, and data engineering concepts.
4. How much does the Databricks Certified Data Engineer Associate certification cost?
The exam costs $200 USD, though the price may vary by region.
5. How many questions are in the Databricks Certified Data Engineer Associate exam?
The exam includes 45 multiple-choice and multiple-select questions to be completed within 90 minutes.
6. What topics are covered in the Databricks Certified Data Engineer Associate exam?
It covers Databricks workspace basics, Delta Lake, ETL processes, data transformation, and pipeline management.
7. How difficult is the Databricks Certified Data Engineer Associate exam?
It’s considered moderately challenging, requiring hands-on experience with Databricks and familiarity with data engineering practices.
8. How long does it take to prepare for the Databricks Certified Data Engineer Associate exam?
Most candidates prepare in 6–8 weeks, depending on prior experience with Databricks and data tools.
9. What jobs can I get after earning the Databricks Certified Data Engineer Associate certification?
You can work as a Data Engineer, ETL Developer, Big Data Engineer, or Cloud Data Specialist.
10. How much salary can I earn with a Databricks Certified Data Engineer Associate certification?
Certified professionals typically earn between $95,000–$130,000 per year, depending on their experience and job role.

Comments