Topics: Big data in networks; Traffic characterization and traffic models for networks
Authors: Andrea Morichetta, Enrico Bocchi, Hassan Metwalley and Marco Mellia (Politecnico di Torino, Italy)
Presenter bio: Andrea Morichetta is a Ph.D. candidate at the Electronics and Telecommunications
department of Politecnico di Torino. She is currently working on network
security and anomaly detection using clustering techniques and big data
platforms.
Abstract:
The Internet has witnessed the proliferation
of applications and services that rely on HTTP as application protocol.
Users play games, read emails, watch videos, chat and access web pages
using their PC, which in turn downloads tens or hundreds of URLs to
fetch all objects that are needed to display the requested content. As a
results, billions of URLs are observed in the network.
When monitoring the traffic, thus, it is becoming more and more
important to have methodologies that allow one to dig into this data,
and extract useful information. In this paper, we present CLUE,
Clustering for URL Exploration, a methodology that leverage clustering
algorithms, i.e., classic unsupervised approaches developed in the data
mining field, to extract knowledge from passive observation of URLs
carried by the network. This is a challenging problem, given the
unstructured format of URLs, which, being strings, call for specialized
approaches. Inspired by text-mining algorithms, we face this problem by
introducing a concept of URL-distance, and using it to extract URL
clusters using the classic DBSCAN algorithm.
Experiments on actual datasets show encouraging results: well-separated
and consistent clusters emerge and allow us to identify, e.g., malicious
traffic, file hosting services, and third- party tracking services. In a
nutshell, our clustering algorithm offers the means to get insights on
the data carried by the network, with possible applications in the
security, or privacy protection fields.