Bart Simons

Bart Simons


Thoughts, stories and ideas.

Bart Simons
Author

Share


Tags


Twitter


How I automate batch processing on Azure with PowerShell

Bart SimonsBart Simons

One of the Android apps that I have made uses a backend that needs to be fed with data on a weekly basis. I decided to keep my scrapers running locally on my laptop in the past, but not anymore since manageability and scheduling was an issue. I have finally prepared this process to be run semi-automatically on Azure, more specifically by using the blob storage and VM services. In this article, I will demonstrate the steps I took and thus how I automate my batch processes on Azure.

Preparation

First of all, I needed to store my bootstrap script and scraper source files somewhere, and Azure blob storage seemed like a perfect fit for this. I started by creating a new resource group and a storage account within that resource group, more specifically a StorageV2 account since we also need to store boot diagnostics on this storage account.

I created two containers, one to store the bootstrap script and the other one to store the scraper source files. I made the choice to use shared access signature (SAS) links because that was the most straight forward way to create download links for all needed files.

The automation part

I use PowerShell to launch and destroy my VMs and so I wrapped every part into functions like this:

$Location                = "West Europe"
$ResourceGroup           = "BartResourceGroup"
$SettingsString          = '{"fileUris":["https://bartstorageaccount.blob.core.windows.net/bootscripts/script.sh?sp=r&st=2018-06-12T11:03:37Z&se=2018-06-15T19:03:37Z&spr=https&sv=2017-11-09&sig=xxxxxxxxxxxxxxxxxxxxx&sr=b"],"commandToExecute":"sh script.sh"}'

Connect-AzureRmAccount

Function Create-ScraperVM {
    # Create a subnet
    $SubnetConfig = New-AzureRmVirtualNetworkSubnetConfig -Name "PrimarySubnet" -AddressPrefix 192.168.1.0/24

    # Create a virtual network
    $VirtualNetwork = New-AzureRmVirtualNetwork -ResourceGroupName $ResourceGroup -Location $Location -Name "PrimaryVirtualNet" -AddressPrefix 192.168.0.0/16 -Subnet $SubnetConfig

    # Create a public IP adress
    $PublicIP = New-AzureRmPublicIpAddress -ResourceGroupName $ResourceGroup -Location $Location -AllocationMethod Static -IdleTimeoutInMinutes 4 -Name "PublicIPScraper"

    # Create SSH inbound SG rule
    $NSGRuleSSH = New-AzureRmNetworkSecurityRuleConfig -Name "NSGRuleSSH"  -Protocol "Tcp" -Direction "Inbound" -Priority 1000 -SourceAddressPrefix * -SourcePortRange * -DestinationAddressPrefix * -DestinationPortRange 22 -Access "Allow"

    # Create NSG
    $NSG = New-AzureRmNetworkSecurityGroup -ResourceGroupName $ResourceGroup -Location $Location -Name "PrimaryNetworkSecurityGroup" -SecurityRules $NSGRuleSSH

    # Create NIC
    $NIC = New-AzureRmNetworkInterface -Name "PrimaryNIC" -ResourceGroupName $ResourceGroup -Location $Location -SubnetId $VirtualNetwork.Subnets[0].Id -PublicIpAddressId $PublicIP.Id -NetworkSecurityGroupId $NSG.Id

    # Create credentials
    $SecurePassword = ConvertTo-SecureString 'MyPasswordGoesHere' -AsPlainText -Force
    $Credentials = New-Object System.Management.Automation.PSCredential ("bart", $SecurePassword)

    # Create virtual machine configuration
    $VMConfig = New-AzureRmVMConfig -VMName "ScraperVM" -VMSize "Standard_F1s" | Set-AzureRmVMOperatingSystem -Linux -ComputerName "ScraperVM" -Credential $Credentials | Set-AzureRmVMSourceImage -PublisherName "Canonical" -Offer "UbuntuServer" -Skus "18.04-LTS" -Version "latest" | Add-AzureRmVMNetworkInterface -Id $NIC.Id

    # Create the virtual machine
    New-AzureRmVM -ResourceGroupName $ResourceGroup -Location $Location -VM $VMConfig

    # Set extension for VM
    Set-AzureRmVMExtension -ResourceGroupName $ResourceGroup -Location $Location -VMName "ScraperVM" -Name "CustomScriptForLinux" -Publisher "Microsoft.OSTCExtensions" -Type "CustomScriptForLinux" -TypeHandlerVersion "1.4" -SettingString $SettingsString
}

Function Remove-ScraperVM {
    # Remove the VM
    Remove-AzureRmVM -Name "ScraperVM" -ResourceGroupName $ResourceGroup -Force

    # Get the disk in the resource group and delete it
    Get-AzureRmDisk -ResourceGroupName $ResourceGroup | Remove-AzureRmDisk -Force   

    # Get the primary NIC and remove it
    Get-AzureRmNetworkInterface -ResourceGroup $ResourceGroup | Remove-AzureRmNetworkInterface -Force

    # Get the public IP and remove it
    Get-AzureRmPublicIpAddress -ResourceGroupName $ResourceGroup | Remove-AzureRmPublicIpAddress -Force

    # Get the virtual network and remove it
    Get-AzureRmVirtualNetwork -ResourceGroupName $ResourceGroup | Remove-AzureRmVirtualNetwork -Force

    # Get the NSG and remove it
    Get-AzureRmNetworkSecurityGroup -ResourceGroupName $ResourceGroup | Remove-AzureRmNetworkSecurityGroup -Force 
}

Create-ScraperVM
Remove-ScraperVM

As you can see, I call the Remove-ScraperVM function right after calling the Create-ScraperVM function. This is because I make use of the CustomScriptForLinux VM extension, which is simply said a definition to run the bootstrap script once the VM is fully started up. The Set-AzureRmVMExtension cmdlet hangs the thread until the script execution finishes.

Now let's take a look at my bootstrap script itself (script.sh):

#!/bin/sh

sudo apt update && sudo apt -y upgrade
sudo apt install -y firefox python3 python3-pymongo python3-selenium python3-retrying sshpass

wget "https://github.com/mozilla/geckodriver/releases/download/v0.20.1/geckodriver-v0.20.1-linux64.tar.gz" -O geckodriver.tar.gz
tar -xvzf geckodriver.tar.gz
chmod +x geckodriver
mv geckodriver /usr/bin/geckodriver

# Change directory to root and download all scrapers
wget "https://bartstorageaccount.blob.core.windows.net/scrapers/scraper1.py?sp=r&st=2018-06-12T11:04:53Z&se=2018-06-15T19:04:53Z&spr=https&sv=2017-11-09&sig=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&sr=b" -O scraper1.py
wget "https://bartstorageaccount.blob.core.windows.net/scrapers/scraper2.py?sp=r&st=2018-06-12T11:05:22Z&se=2018-06-15T19:05:22Z&spr=https&sv=2017-11-09&sig=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&sr=b" -O scraper2.py
wget "https://bartstorageaccount.blob.core.windows.net/scrapers/scraper3.py?sp=r&st=2018-06-12T11:05:52Z&se=2018-06-15T19:05:52Z&spr=https&sv=2017-11-09&sig=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&sr=b" -O scraper3.py

sshpass -p PasswordGoesHere ssh user@databaseserver.local -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -N -L 27017:172.17.0.1:27017 &

python3 scraper1.py
python3 scraper2.py
python3 scraper3.py

After the last scraper finishes (scraper #3), the Create-ScraperVM function will finish, which is exactly when we want to call the Remove-ScraperVM function to remove all the resources that have been created.

The routine that I created can be re-used as many times as possible, as long as everything works (are the signatures for the download links still valid? does the SSH tunnel connection to the database still function properly?)

I hope that this article gives you some insight into the possibilities for your workloads on Azure. My solution may not be the best by definition (for example, Azure HPC batch jobs might be a better fit, but I have yet to get my hands dirty on that.)

Bart Simons
Author

Bart Simons

Comments