Area 7: Network Measurements and Analysis

Digital Connected World

CLUE: Clustering for Mining Web URLs

Topics: Big data in networks; Traffic characterization and traffic models for networks

Download: The paper is accessible via the i-teletraffic.org library or this direct link.

Authors: Andrea Morichetta, Enrico Bocchi, Hassan Metwalley and Marco Mellia (Politecnico di Torino, Italy)
Presenter bio: Andrea Morichetta is a Ph.D. candidate at the Electronics and Telecommunications department of Politecnico di Torino. She is currently working on network security and anomaly detection using clustering techniques and big data platforms.

Abstract: The Internet has witnessed the proliferation of applications and services that rely on HTTP as application protocol. Users play games, read emails, watch videos, chat and access web pages using their PC, which in turn downloads tens or hundreds of URLs to fetch all objects that are needed to display the requested content. As a results, billions of URLs are observed in the network. When monitoring the traffic, thus, it is becoming more and more important to have methodologies that allow one to dig into this data, and extract useful information. In this paper, we present CLUE, Clustering for URL Exploration, a methodology that leverage clustering algorithms, i.e., classic unsupervised approaches developed in the data mining field, to extract knowledge from passive observation of URLs carried by the network. This is a challenging problem, given the unstructured format of URLs, which, being strings, call for specialized approaches. Inspired by text-mining algorithms, we face this problem by introducing a concept of URL-distance, and using it to extract URL clusters using the classic DBSCAN algorithm. Experiments on actual datasets show encouraging results: well-separated and consistent clusters emerge and allow us to identify, e.g., malicious traffic, file hosting services, and third- party tracking services. In a nutshell, our clustering algorithm offers the means to get insights on the data carried by the network, with possible applications in the security, or privacy protection fields.

Program

What's New?

Report available
October 27th, 2016
Arne Jensen Lifetime Award 2016
September 26th, 2016
Photos online!
September 21st, 2016

Important Dates

Paper registration: March 11, 2016 (extended)
Full paper due: March 18, 2016 (extended)
Acceptance notification: May 13, 2016
Camera-ready paper due: June 15, 2016
Note: All dates are hard deadlines.