keyboard_arrow_up
What's in a Domain? Anaylsis of URL Features

Authors

John Hawkins, Transitional AI Research Group, Australia

Abstract

Many data science problems require processing log data derived from web pages, apis or other internet traffic sources. URLs are one of the few ubiquitous data fields that describe internet activity, hence they require effective processing for a wide variety of machine learning applications. While URLs are structurally rich, the structure can be both domain specific and subject to change over time, making feature engineering for URLs an ongoing challenge. In this research we outline the key structural components of URLs and discuss the information available within each. We describe methods for generating features on these URL components and share an open source implementation of these ideas. In addition, we describe a method for exploring URL feature importance that allows for comparison and analysis of the information available inside URLs. We experiment with a collection of URL classification datasets and demonstrate the utility of these tools. Package and source code is open on https://pypi.org/project/url2features

Keywords

Machine Learning, Feature Engineering, Web Search, Semantic Web, Data Science

Full Text  Volume 13, Number 14