Bart Simons

Bart Simons


Thoughts, stories and ideas.

Bart Simons
Author

Share


Tags


.net .net core Apache C# CentOS LAMP NET Framework Pretty URLs Windows Server WireGuard WireGuard.io access log add analysis android api at the same time authentication authorization automate automation azure azurerm backup bash basics batch bootstrap build capture cheat sheet chromium chroot class cli click to close code snippet command line commands compile compiling compression containers control controller controlling convert cpu usage create credentials csv csvparser curl data dd deployment desktop detect devices disable diskpart dism distributed diy docker dom changes dotnet core drivers ease of access encryption example export file transfer files fix folders generalize getting started ghost ghost.org gui guide gunicorn gzip html html tables icewarp igd imagex import inotify install installation interactive ios iphone itunes java javascript jquery json kiosk kotlin linux live load data loading screen lock screen loopback audio lxc lxd lxml macos manage manually message messages minio mirrored mod_rewrite monitor monitoring mutationobserver mysql nexmo nginx no oobe node node.js nodejs not installing notification notifications object storage on desktop one command openssl owncloud parallels parallels tools parse perfect philips hue play port forwarding portainer.io powershell processing ps-spotify python quick raspberry pi record rip ripping rsync rtmp save save data sbapplication scraping script scripting scriptingbridge scripts security send server service sharedpreferences sms songs sonos spotify spotify api spotlight ssh stack streaming streamlink studio sudo swarm swift sync sysprep system audio systemd tables terminal tracking tutorial twilio ubiquiti ubuntu ubuntu 18.04 ui code unifi unlock unsplash source upnp uptime usb tethering wallpapers wasapi website websites webview windows windows 10 without itunes without oobe workaround xaml

How I automate batch processing on Azure with PowerShell

One of the Android apps that I have made uses a backend that needs to be fed with data on a weekly basis. I decided to keep my scrapers running locally on my laptop in the past, but not anymore since manageability and scheduling was an issue. I have finally prepared this process to be run semi-automatically on Azure, more specifically by using the blob storage and VM services. In this article, I will demonstrate the steps I took and thus how I automate my batch processes on Azure.

Preparation

First of all, I needed to store my bootstrap script and scraper source files somewhere, and Azure blob storage seemed like a perfect fit for this. I started by creating a new resource group and a storage account within that resource group, more specifically a StorageV2 account since we also need to store boot diagnostics on this storage account.

I created two containers, one to store the bootstrap script and the other one to store the scraper source files. I made the choice to use shared access signature (SAS) links because that was the most straight forward way to create download links for all needed files.

The automation part

I use PowerShell to launch and destroy my VMs and so I wrapped every part into functions like this:

$Location                = "West Europe"
$ResourceGroup           = "BartResourceGroup"
$SettingsString          = '{"fileUris":["https://bartstorageaccount.blob.core.windows.net/bootscripts/script.sh?sp=r&st=2018-06-12T11:03:37Z&se=2018-06-15T19:03:37Z&spr=https&sv=2017-11-09&sig=xxxxxxxxxxxxxxxxxxxxx&sr=b"],"commandToExecute":"sh script.sh"}'

Connect-AzureRmAccount

Function Create-ScraperVM {
    # Create a subnet
    $SubnetConfig = New-AzureRmVirtualNetworkSubnetConfig -Name "PrimarySubnet" -AddressPrefix 192.168.1.0/24

    # Create a virtual network
    $VirtualNetwork = New-AzureRmVirtualNetwork -ResourceGroupName $ResourceGroup -Location $Location -Name "PrimaryVirtualNet" -AddressPrefix 192.168.0.0/16 -Subnet $SubnetConfig

    # Create a public IP adress
    $PublicIP = New-AzureRmPublicIpAddress -ResourceGroupName $ResourceGroup -Location $Location -AllocationMethod Static -IdleTimeoutInMinutes 4 -Name "PublicIPScraper"

    # Create SSH inbound SG rule
    $NSGRuleSSH = New-AzureRmNetworkSecurityRuleConfig -Name "NSGRuleSSH"  -Protocol "Tcp" -Direction "Inbound" -Priority 1000 -SourceAddressPrefix * -SourcePortRange * -DestinationAddressPrefix * -DestinationPortRange 22 -Access "Allow"

    # Create NSG
    $NSG = New-AzureRmNetworkSecurityGroup -ResourceGroupName $ResourceGroup -Location $Location -Name "PrimaryNetworkSecurityGroup" -SecurityRules $NSGRuleSSH

    # Create NIC
    $NIC = New-AzureRmNetworkInterface -Name "PrimaryNIC" -ResourceGroupName $ResourceGroup -Location $Location -SubnetId $VirtualNetwork.Subnets[0].Id -PublicIpAddressId $PublicIP.Id -NetworkSecurityGroupId $NSG.Id

    # Create credentials
    $SecurePassword = ConvertTo-SecureString 'MyPasswordGoesHere' -AsPlainText -Force
    $Credentials = New-Object System.Management.Automation.PSCredential ("bart", $SecurePassword)

    # Create virtual machine configuration
    $VMConfig = New-AzureRmVMConfig -VMName "ScraperVM" -VMSize "Standard_F1s" | Set-AzureRmVMOperatingSystem -Linux -ComputerName "ScraperVM" -Credential $Credentials | Set-AzureRmVMSourceImage -PublisherName "Canonical" -Offer "UbuntuServer" -Skus "18.04-LTS" -Version "latest" | Add-AzureRmVMNetworkInterface -Id $NIC.Id

    # Create the virtual machine
    New-AzureRmVM -ResourceGroupName $ResourceGroup -Location $Location -VM $VMConfig

    # Set extension for VM
    Set-AzureRmVMExtension -ResourceGroupName $ResourceGroup -Location $Location -VMName "ScraperVM" -Name "CustomScriptForLinux" -Publisher "Microsoft.OSTCExtensions" -Type "CustomScriptForLinux" -TypeHandlerVersion "1.4" -SettingString $SettingsString
}

Function Remove-ScraperVM {
    # Remove the VM
    Remove-AzureRmVM -Name "ScraperVM" -ResourceGroupName $ResourceGroup -Force

    # Get the disk in the resource group and delete it
    Get-AzureRmDisk -ResourceGroupName $ResourceGroup | Remove-AzureRmDisk -Force   

    # Get the primary NIC and remove it
    Get-AzureRmNetworkInterface -ResourceGroup $ResourceGroup | Remove-AzureRmNetworkInterface -Force

    # Get the public IP and remove it
    Get-AzureRmPublicIpAddress -ResourceGroupName $ResourceGroup | Remove-AzureRmPublicIpAddress -Force

    # Get the virtual network and remove it
    Get-AzureRmVirtualNetwork -ResourceGroupName $ResourceGroup | Remove-AzureRmVirtualNetwork -Force

    # Get the NSG and remove it
    Get-AzureRmNetworkSecurityGroup -ResourceGroupName $ResourceGroup | Remove-AzureRmNetworkSecurityGroup -Force 
}

Create-ScraperVM
Remove-ScraperVM

As you can see, I call the Remove-ScraperVM function right after calling the Create-ScraperVM function. This is because I make use of the CustomScriptForLinux VM extension, which is simply said a definition to run the bootstrap script once the VM is fully started up. The Set-AzureRmVMExtension cmdlet hangs the thread until the script execution finishes.

Now let's take a look at my bootstrap script itself (script.sh):

#!/bin/sh

sudo apt update && sudo apt -y upgrade
sudo apt install -y firefox python3 python3-pymongo python3-selenium python3-retrying sshpass

wget "https://github.com/mozilla/geckodriver/releases/download/v0.20.1/geckodriver-v0.20.1-linux64.tar.gz" -O geckodriver.tar.gz
tar -xvzf geckodriver.tar.gz
chmod +x geckodriver
mv geckodriver /usr/bin/geckodriver

# Change directory to root and download all scrapers
wget "https://bartstorageaccount.blob.core.windows.net/scrapers/scraper1.py?sp=r&st=2018-06-12T11:04:53Z&se=2018-06-15T19:04:53Z&spr=https&sv=2017-11-09&sig=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&sr=b" -O scraper1.py
wget "https://bartstorageaccount.blob.core.windows.net/scrapers/scraper2.py?sp=r&st=2018-06-12T11:05:22Z&se=2018-06-15T19:05:22Z&spr=https&sv=2017-11-09&sig=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&sr=b" -O scraper2.py
wget "https://bartstorageaccount.blob.core.windows.net/scrapers/scraper3.py?sp=r&st=2018-06-12T11:05:52Z&se=2018-06-15T19:05:52Z&spr=https&sv=2017-11-09&sig=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&sr=b" -O scraper3.py

sshpass -p PasswordGoesHere ssh user@databaseserver.local -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -N -L 27017:172.17.0.1:27017 &

python3 scraper1.py
python3 scraper2.py
python3 scraper3.py

After the last scraper finishes (scraper #3), the Create-ScraperVM function will finish, which is exactly when we want to call the Remove-ScraperVM function to remove all the resources that have been created.

The routine that I created can be re-used as many times as possible, as long as everything works (are the signatures for the download links still valid? does the SSH tunnel connection to the database still function properly?)

I hope that this article gives you some insight into the possibilities for your workloads on Azure. My solution may not be the best by definition (for example, Azure HPC batch jobs might be a better fit, but I have yet to get my hands dirty on that.)

Bart Simons
Author

Bart Simons

View Comments