Data-mining is the computational analysis of information to uncover patterns that exist within large datasets. These patterns are visualised by various tools and techniques such as producing charts in line and pie form, gauges and maps and is a fantastic way to uncover abnormalities and trends that exist within datasets.
However, before we can start the visualisation process we
must first analyse the data and for that we need to capture some data by
gaining access or uncovering potential new sources of data that are relevant
for our needs.
Data-scraping
One technique is to convert an existing source of digital data
into another form that can be then processed further and this is called
data-scraping. These sources of data may be found in public domain information
websites, publicly released Portable Document Format (PDF) and various other
digital formats that are beyond the scope of this blog.
It is important to stress that obtaining copyrighted or
protected material without consent will result in some type of legal action
against you. Always seek out permissions
and research thoroughly the terms and conditions of the use of the data by the author
(whether they are individuals or governmental departments).
For further discussion about the morals surrounding data-scraping a good article to read is:
Obtaining Open Datasets
Publicly available data and open datasets are available and
will provide the legal source of data that we will use to show the technique
over the course. Starting with Microsoft’s Excel spreadsheet program and then
in a later post we will be using the web based Google spreadsheet interface.
As mentioned previously the two most common forms of
digitally published data can be found inside web pages and PDF form.
In this first tutorial we will be looking how we can use the
Microsoft Excel program to access data found in public domain web pages. Focusing
only on getting data as the further techniques of data cleaning and data
visualisation are necessary however are complex and need further discussion at a later date.
Working on the HTTP layer
Web pages can be built using a variety of competing computer
languages which all rely on using the Hypertext Transfer Protocal (HTTP). HTTP or
“http://” should be familiar to most people as it is represented in the URL
address section of most modern browsers.
HTTP is the base that allows applications to communicate on
the Internet and this interaction is what forms the World Wide Web.
As we view the World Wide Web through a browser we are viewing
an interpretation of that particular web browser's current understanding of
Hypertext Markup Language (HTML) and supported technologies (CSS, Flash media,
etc).
Although there is a multitude of tags to select and possibly
an infinite way to arrange these tags to create a web page. In our context
of data-scraping, the two that we will focus on are:
<TABLE>
<DIV>
The Table and Div tags allow the web page to be created as a
structure (or grid) and allows digital publishers and editors to place the
content elements (text, images, links, etc.,) inside to enhance visual arrangement.
This is much the same as using a textbox inside a word processing program such
as Microsoft Word or Open Office Writer program to adjust the flow of text giving
added meaning or emphasis for the reader.
No comments:
Post a Comment