Topics: Big data in networks; Traffic characterization and traffic models for networks
 
 
	Authors: Andrea Morichetta, Enrico Bocchi, Hassan Metwalley and Marco Mellia (Politecnico di Torino, Italy)
Presenter bio: Andrea Morichetta is a Ph.D. candidate at the Electronics and Telecommunications 
department of Politecnico di Torino. She is currently working on network 
security and anomaly detection using clustering techniques and big data 
platforms.
 
 
	Abstract:
		The Internet has witnessed the proliferation
 of applications and services that rely on HTTP as application protocol.
 Users play games, read emails, watch videos, chat and access web pages 
using their PC, which in turn downloads tens or hundreds of URLs to 
fetch all objects that are needed to display the requested content. As a
 results, billions of URLs are observed in the network.
When monitoring the traffic, thus, it is becoming more and more 
important to have methodologies that allow one to dig into this data, 
and extract useful information. In this paper, we present CLUE, 
Clustering for URL Exploration, a methodology that leverage clustering 
algorithms, i.e., classic unsupervised approaches developed in the data 
mining field, to extract knowledge from passive observation of URLs 
carried by the network. This is a challenging problem, given the 
unstructured format of URLs, which, being strings, call for specialized 
approaches. Inspired by text-mining algorithms, we face this problem by 
introducing a concept of URL-distance, and using it to extract URL 
clusters using the classic DBSCAN algorithm.
Experiments on actual datasets show encouraging results: well-separated 
and consistent clusters emerge and allow us to identify, e.g., malicious
 traffic, file hosting services, and third- party tracking services. In a
 nutshell, our clustering algorithm offers the means to get insights on 
the data carried by the network, with possible applications in the 
security, or privacy protection fields.