Wednesday, 30 January 2013

Data-mining workshop No.1


Data-mining is the computational analysis of information to uncover patterns that exist within large datasets. These patterns are visualised by various tools and techniques such as producing charts in line and pie form, gauges and maps and is a fantastic way to uncover abnormalities and trends that exist within datasets.

However, before we can start the visualisation process we must first analyse the data and for that we need to capture some data by gaining access or uncovering potential new sources of data that are relevant for our needs.

Data-scraping

One technique is to convert an existing source of digital data into another form that can be then processed further and this is called data-scraping. These sources of data may be found in public domain information websites, publicly released Portable Document Format (PDF) and various other digital formats that are beyond the scope of this blog.

It is important to stress that obtaining copyrighted or protected material without consent will result in some type of legal action against you.  Always seek out permissions and research thoroughly the terms and conditions of the use of the data by the author (whether they are individuals or governmental departments).
For further discussion about the morals surrounding data-scraping a good article to read is:


Obtaining Open Datasets

Publicly available data and open datasets are available and will provide the legal source of data that we will use to show the technique over the course. Starting with Microsoft’s Excel spreadsheet program and then in a later post we will be using the web based Google spreadsheet interface.

As mentioned previously the two most common forms of digitally published data can be found inside web pages and PDF form.

In this first tutorial we will be looking how we can use the Microsoft Excel program to access data found in public domain web pages. Focusing only on getting data as the further techniques of data cleaning and data visualisation are necessary however are complex and need further discussion at a later date.

Working on the HTTP layer

Web pages can be built using a variety of competing computer languages which all rely on using the Hypertext Transfer Protocal (HTTP). HTTP or “http://” should be familiar to most people as it is represented in the URL address section of most modern browsers.

HTTP is the base that allows applications to communicate on the Internet and this interaction is what forms the World Wide Web.

As we view the World Wide Web through a browser we are viewing an interpretation of that particular web browser's current understanding of Hypertext Markup Language (HTML) and supported technologies (CSS, Flash media, etc).

During interpretation the web browser organises and displays the various elements of the Hypertext Markup Language that consist of tags found inside "< >" symbols.

Although there is a multitude of tags to select and possibly an infinite way to arrange these tags to create a web page. In our context of data-scraping, the two that we will focus on are:

<TABLE>
<DIV>

The Table and Div tags allow the web page to be created as a structure (or grid) and allows digital publishers and editors to place the content elements (text, images, links, etc.,) inside to enhance visual arrangement. This is much the same as using a textbox inside a word processing program such as Microsoft Word or Open Office Writer program to adjust the flow of text giving added meaning or emphasis for the reader. 

No comments:

Post a Comment