30 Days DevOps Challenge - NBA Player Stats

#Week1-Day3 #DevOpsAllStarsChallenge

Automate Deployment of Azure Data Factory using Python and Creating pipelines for Data Factory

In this blog post, we'll walk you through an exciting DevOps challenge from Week 1 - Day 3, where we automate the creation and configuration of an Azure Storage Account and Blob Container. This setup enables public access and runs an additional script (adf.py) to create an Azure Data Factory. The Data Factory pulls information from sportsapi.io, transforms the data by removing unnecessary details, and then stores it in a blob container.

Prerequisites

Before we dive in, ensure you have the following:

Python 3.x
Azure SDK for Python
dotenv package for loading environment variables
An Azure subscription with appropriate permissions

Installation

Clone the repository:

 git clone https://github.com/annoyedalien/week1-day3.git
 cd week1-day3

Create a virtual environment and activate it:

 python -m venv (your venv)
 source (your venv)\bin\activate

Install the required Python packages:
```
 pip install -r requirements.txt
```

Create a .env file in the root directory of the project and add the following environment variables:

 AZURE_SUBSCRIPTION_ID=your_subscription_id
 RESOURCE_GROUP_NAME=your_resource_group_name
 STORAGE_ACCOUNT_NAME=your_storage_account_name
 LOCATION=your_location
 CONTAINER_NAME=your_container_name
 DATA_FACTORY_NAME=your_datafactory_name
 REST_API_URL=https://api.sportsdata.io/v3/nba/scores/json/Players
 SUBSCRIPTION_KEY=your_api_key
 LS_REST_NAME=linked_service_rest_name
 LS_BLOB_NAME=linked_service_blob_name

Usage

Run the script:

python main.py

The script will:

Check if the specified resource group exists and create it if it doesn't.
Check if the specified storage account name is available and create the storage account if it doesn't exist.
Enable public access on the storage account.
Create a blob container with anonymous access if it doesn't exist.
Run the adf.py script as a subprocess.

The adf.py script will:

Create an Azure Data Factory, linked services, datasets, and pipelines.
Use config.py to define the properties of the resources to be created.

Script Details

Resource Management: Uses ResourceManagementClient to manage Azure resource groups.
Storage Management: Uses StorageManagementClient to manage Azure storage accounts.
Blob Service: Uses BlobServiceClient to manage blob containers and enable public access.
Environment Variables: Loads configuration from a .env file using the dotenv package.
Subprocess: Runs an additional script (adf.py) after setting up the storage account and container.

`adf.py` and `config.py`

adf.py: Contains the creation of Azure Data Factory along with its linked services, datasets, and pipelines.
config.py: Contains the properties of the linked services, datasets, and pipelines.

Launch Data Factory Studio

After running the script, launch Data Factory Studio in Azure and run the pipeline. Navigate into the storage account, container, and check the blob received from the data factory.

Click on Author
Choose Pipeline
Run Debug

Azure Data Factory

Let’s analyze what happened after manually running the pipeline.

The Rest Dataset invokes a Get request from sportsdata.io with the help of the Linked service we created.

By clicking on Preview Data

we can see all the information requested.

In a scenario where some of the information gathered are not required, we need to transform the data before it goes to a sink.

With the help of Mapping, we can import the schema and remove the unnecessary information.

The mapping schema is included on the config.py which is called by adf.py upon creation

The output is then sent to the Storage Account container with a blob named player.json

By following this guide, you'll be able to automate the process of pulling NBA player profile stats from sportsapi.io, transforming the data, and storing it in an Azure Blob Container. Happy coding! 🏀🚀

Clean Up

Clean up the resources by deleting the Resource Groups created

az group list
az group delete --name [Resource Group Name]

Special Thanks Alicia Ahl for the project
check out her video here https://www.youtube.com/watch?v=RAkMac2QgjM&t=0s