About Motifworks

At Motifworks, we are AZURESMART. We are one of the fastest-growing cloud solutions providers, specializing in Cloud Adoption, Application Innovation, and Effective Data Strategies. Our passion is to empower you to accelerate your digital transformation initiatives using the Microsoft Azure cloud. We’re here to simplify your path to explore what’s possible.

Corporate Office

200 W Townsontown Blvd, Suite 300D, Towson, MD, 21204, US

Regional Offices

Philadelphia ¦ Detroit ¦ Dallas ¦ Florida ¦ Cincinnati ¦ Ohio

Development Office

101 Specialty Business Center, Balewadi Road, Pune- 411045, Maharashtra, India

Connect with us
info@motifworks.com +1 443-424-2340
Don’t be a victim of cyber-attacks. Assess your Cyber security risks before it is too late! Know more ➜

 

15 databricks best Practices for Azure

15 Databricks Best Practices for seamless implementation on Azure

15 Databricks Best Practices for seamless implementation on Azure

15 databricks best Practices for Azure

With growing number of Azure Databricks adoption carried out by enterprises, it is important to follow certain Azure Databricks best practices and procedures to fully embrace the platform. Data experts can leverage Azure data platform accelerators and Azure Databricks security best practices to build and deploy a highly scalable and secure data platform. Enterprises can leverage Azure Databricks to develop data engineering workflows, integrate ML and AI, to create powerful and innovative dashboards.

Follow these 15 databricks best practices to seamlessly implement on Azure platform.

1. Databricks Workspaces

Assign workspaces based on a related group of people working together collaboratively. It helps in streamlining access control matrix within the workspace (folders, notebooks etc.) and also across all resources that the workspace interacts with (ADLS, Azure SQL DB, Synapse etc.)

2. Separate workspaces by environment and deploy them into separate subscriptions

There are various limits at the workspace levels that may impact your environment. To get the maximum out the limits, it is recommended to use separate workspaces for prod, test, and dev environments

Key workspace limits are:

The maximum number of jobs that a workspace can create in an hour is 5000

At any time, you cannot have more than 1000 jobs simultaneously running in a workspace

There can be a maximum of 145 notebooks attached to a cluster

3. Isolate Each Workspace in its own VNet

Only deploy one workspace in a Vnet. It aligns with the ADB’s Workspace level isolation model. Use Vnet Peering to extend the private IP space of the workspace Vnet.

4. Do not store data in Default DBFS Folders

Every Workspace comes with a default DBFS, primarily designed to store libraries and other system-level configuration artifacts such as Init scripts. You should not store any production data in it because the lifecycle of default DBFS is tied to the Workspace and deleting the workspace will also delete the default DBFS and permanently remove its contents.

5. Use Log Analytics workspace to understand Databricks utilization

Monitoring databricks resource utilization is useful in arriving at the correct cluster and VM sizes. Each VM have a set of limits which play an important role in determining the performance of an Azure Databricks job. In order to get utilization metrics of an Azure Databricks cluster, you can stream the VM’s metrics to an Azure Log Analytics Workspace by installing the Log Analytics Agent on each cluster node.

6. Selecting Programming Language

Databricks offers Standard and High Concurrency mode clusters. A High Concurrency cluster supports R, Python, and SQL, whereas a Standard cluster supports Scala, Java, SQL, Python, and R.

Databricks uses Scala for the background processin engine. Hence Scala performs better than Python and SQL. Therefore, for the Standard cluster, Scala is the recommended language for developing Spark jobs.

7. Use Key Vault for Storing Access Keys

Avoid hardcoding of sensitive information within the code. Store all the sensitive information such as storage account keys, database username, database password, etc., in a key vault. Access the key vault in Databricks through a secret scope.

8. Notebooks Organization

It is recommended to create separate folders for each group of users. Store the notebooks in group folders.

9. AutoComplete to Avoid Typographical Errors

Use the ‘tab’ button for auto-complete suggestions and eliminate typographical errors.

10. Use the ‘Format SQL’ Option for Formatting the SQL Cells

Databricks offers a dedicated feature for formatting SQL cells. The “Format SQL code” option can be found in the “Edit” section or pressing CTRL+Shift+F

11. Use ‘Advisor’ Option

The Advisor option analyses the entire run and suggest required optimizations to increase the efficiency of the job.

12. Notebook Chaining

It is always a good practice to include all the repeatedly used operations such as read/write on Data Lake, SQL Database, etc., in one generic Notebook. The same Notebook can be used to set the Spark configurations, mounting ADLS path to DBFS, fetching the secrets from secret scope, etc.

For using the operations defined in the generic Notebook from other notebooks, it should be invoked using the “run” command.

13. Viewing the Content of a File

Use dbutils command (dbutils.fs.head(“”)) to inspect file content and avoid loading the data into a Dataframe and then displaying the data.

14. Use ADF for orchestrating Databricks Notebooks

The ADF pipeline uses pipeline variables for storing the configuration details. During the Databricks notebook invocation within the ADF pipeline, the configuration details are transferred from pipeline variables to Databricks widget variables, thereby eliminating hardcoding in the Databricks notebooks.

15. Widget Variables

The configuration details are made accessible to the Databricks code through the widget variables. The configuration data is transferred from pipeline variable to widget variables when the notebook is invoked in the ADF pipeline.

Tarun Agarwal, Team Motifworks

Tarun Agarwal Tarun Agarwal

VP- Cloud Solutions​​​​​​​​​​​​​ | Motifworks

Known as a Data Analytics thought leader who fuels data-driven transformations for Fortune 500 firms, Tarun’s passion is to tell the “story” of the data that is hidden in an enterprise’s data assets. He does this flawlessly by leveraging Big Data, Machine Learning, AI, and cloud platforms. Tarun’s expertise lies in modernizing data platforms through cutting-edge technology solutions and at Motifworks, Tarun leads the Data & AI practice.

Data and Advanced Analytics Strategy Workshop

No Obligation. No Charge. Expert-led 3-Hour session that will help you innovate with data.​