Bart Simons

Bart Simons


Thoughts, stories and ideas.

Bart Simons
Author

Share


Tags


.net .net core Apache C# CentOS LAMP NET Framework Pretty URLs Windows Server WireGuard WireGuard.io access log add analysis android api at the same time authentication authorization automate automation azure azurerm backup bash basics batch bootstrap build capture cheat sheet chromium chroot class cli click to close code snippet command line commands compile compiling compression containers control controller controlling convert cpu usage create credentials csv csvparser curl data dd deployment desktop detect devices disable diskpart dism distributed diy docker dom changes dotnet core drivers ease of access encryption example export file transfer files fix folders generalize getting started ghost ghost.org gui guide gunicorn gzip html html tables icewarp igd imagex import inotify install installation interactive ios iphone itunes java javascript jquery json kiosk kotlin linux live load data loading screen lock screen loopback audio lxc lxd lxml macos manage manually message messages minio mirrored mod_rewrite monitor monitoring mutationobserver mysql nexmo nginx no oobe node node.js nodejs not installing notification notifications object storage on desktop one command openssl owncloud parallels parallels tools parse perfect philips hue play port forwarding portainer.io powershell processing ps-spotify python quick raspberry pi record rip ripping rsync rtmp save save data sbapplication scraping script scripting scriptingbridge scripts security send server service sharedpreferences sms songs sonos spotify spotify api spotlight ssh stack streaming streamlink studio sudo swarm swift sync sysprep system audio systemd tables terminal tracking tutorial twilio ubiquiti ubuntu ubuntu 18.04 ui code unifi unlock unsplash source upnp uptime usb tethering wallpapers wasapi website websites webview windows windows 10 without itunes without oobe workaround xaml

Scraping websites with LXML

The internet is such a big place, and it is still growing exponentially together with the (also) growing trend of data traffic. Sometimes, that what just matters is all that we need. Links, paragraphs, keywords are three examples of data that we care about: the metadata. LXML is a great library that makes parsing HTML documents from within Python pretty useful, so I decided to write some code example for those who are interested.

Scraping the Reddit front page as an example

Reddit's front page is easily parsable. In fact, it has a straight forward CSS structure that actually makes sense:

Each link to a post is contained inside a div tag with the thing class inside of it. Chromium - the internet browser in the screenshot above - actually supports searching by XPath from the developer console. Very neat, cheers to the developers that made this possible!

The same thing could be done programmatically, by using Python and LXML. Here's an example that should work:

#!/usr/bin/env python3

import lxml.html

from pycurl import Curl
from io import BytesIO

userAgent = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0'
redditMainPage = 'https://www.reddit.com/new/'

def fetchUri(uriToFetch):
    buffer = BytesIO()
    c = Curl()
    c.setopt(c.URL, uriToFetch)
    c.setopt(c.WRITEDATA, buffer)
    c.setopt(c.USERAGENT, userAgent)
    c.perform()
    c.close()
    return buffer.getvalue().decode('iso-8859-1')

requestResult = fetchUri(redditMainPage)
requestLxmlDocument = lxml.html.document_fromstring(requestResult)
requestLxmlRoot = requestLxmlDocument.xpath("//div[contains(@class, 'thing')]//div[contains(@class, 'entry')]//p[contains(@class, 'title')]//a[contains(@class, 'title')]")

for rootObject in requestLxmlRoot:
    print(str(rootObject.text_content())+"\n")

This code iterates over each reddit post found on the main page, and returns it's name, followed by a newline character. This code snippet should work fine on both Python 2 and 3, with PycURL and LXML installed. Good luck experimenting with LXML!

Bart Simons
Author

Bart Simons

View Comments