Analyzing Large Data Collections with Apache Pig

Objective

The objective of this exercise is to show the use of Apache Pig Latin, a dataflow language for analyzing large data collections. For this purpose, you will work with a sample of the Neubot Data Collection, a data collection comprising Neubot measurements (e.g., download/upload speed tests) realized by different users, in different places and using different internet providers.

Requirements

Apache Pig 14.0 (or greater)
Java 7 (or greater)
Neubot Data Collection
NeubotTestsUDFs.jar (set of user defined functions required for this exercise)

Neubot Data Collection

The Neubot Data Collection is a data set containing network tests (e.g., upload/download speed over HTTP or BitTorrent). Each test is composed of the following information:

Nombre	Descripción
client_address	User IP address (IPv4 or IPv6).
client_country	Country where the test was conducted.
client_provider	Name of user’ internet provider.
connect_time	Number of seconds elapsed between the reception of the first and last package (Round-Trip Time).
download_speed	Download speed (bytes/secs).
neubot_version	Neubot version used for this test.
platform	User operative system.
remote_address	IP address (IPv4 or IPv6) of the server used for this test.
test_name	Test type (ex., speedtest, bittorrent, dash).
timestamp	Time at which the test was realized. Measured as the number of seconds elapsed after 1/01/1970 (cf. UNIX timestamp).
upload_speed	Upload speed (bytes/secs).
latency	Delay between the sent and reception of a control package.
uuid	User ID (generated automatically by Neubot during installation).
asnum	Internet provider’ ID.
region	Country region in which the test was realized (if known).
city	Name of the city.
hour	Hour/Month/Year of the test (derived from timestamp).
month
year

Running Apache Pig

Apache Pig is a data flow language (Pig Latin), an interpreter and a compiler that produces sequences of Map-Reduce programs for analyzing large data sets in parallel infrastructures (e.g., Hadoop). Some of the benefits of using Pig Latin are:

Ease of programming. It is trivial to achieve parallel execution of simple, “embarrassingly parallel” data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility. Users can create their own functions to do special-purpose processing.

Although Pig Latin programs are intended to be executed on a cluster, you can conduct some test in your local machine. For this, open a terminal and type the following instruction for opening GRUNT (Pig’ interactive interpreter) on local mode:

# Move to the folder containing the exercise material
cd ~/hands-on/pig

# Execute Pig in Local mode
pig -x local

Example of a Pig Program

The following program illustrates how to use Apache Pig for processing the Neubot data collection. In particular, it illustrates how to:

Define a schema that describes the structure of your data.
Filter the data based on some criteria (i.e., keep only speedtest).
Project a subset of the attributes of the data collection (eg., keep just the names of the cities where the test were conducted).
Display results on the screen.
Store results on the filesystem.

You can run this program by copy/pasting it in GRUNT.

*** Note: Modify the PATHs to the NeubotTests data collection and NeubotTestsUDFs.jar if necessary.

REGISTER NeubotTestsUDFs.jar;
DEFINE   IPtoNumber convert.IpToNumber();
DEFINE   NumberToIP convert.NumberToIp();

NeubotTests = LOAD 'NeubotTests' using PigStorage(';') as (
                  client_address: chararray,
                  client_country: chararray,
                  lon: float,
                  lat: float,
                  client_provider: chararray,
                  mlabservername:  chararray,
                  connect_time:    float,
                  download_speed:  float,
                  neubot_version:  float,
                  platform:        chararray,
                  remote_address:  chararray,
                  test_name:       chararray,
                  timestamp:       long,
                  upload_speed:    float,
                  latency:  float,
                  uuid:     chararray,
                  asnum:    chararray,
                  region:   chararray,
                  city:     chararray,
                  hour:     int,
                  month:    int,
                  year:     int,
                  weekday:  int,
                  day:      int,
                  filedate: chararray
);

--
-- Keep only the 'speedtests'
--     @ means "use previous result" 
Tests = FILTER @ BY (test_name matches '.*speedtest.*');

--
-- Cities were the tests were conducted
--
Cities = FOREACH @ GENERATE city;
Cities = DISTINCT @;
Cities = ORDER @ BY city;

--
-- Display the results contained in 'Cities'
--
DUMP @;

--
-- Store the results in the folder 'Cities'
--
STORE @ INTO 'SpeedTests';

TO DO

Define a data flow using Pig that answers each of these queries:

Filter the speedtest conducted in Barcelona or Madrid. Then list the internet providers working in those cities.
List the names and the IP ranges of the internet providers located in Barcelona. For this you need to use the IPtoNumber user defined function (cf. NeubotTestsUDFs.jar).
Group the speedtest based on the user network infrastructure (e.g., 3G/4G vs ADSL). For this you can assume some max bandwidth (e.g., 21Mb/sec for ADSL).
Find the user that realized the maximum number of tests. For this user, produce a table showing the evolution of her/his download/upload speeds.

Big Data Visualization

Tools and techniques for exploring data