Opensource or Public Datasets for Machine Learning Studies and Research



Machine learning (ML) techniques have been applied in many applications from academia to industry and have started to influence our daily lives such as in social media applications or online shopping. Hence, many machine learning algorithms have been developed to improve the performance of these ML techniques.



While learning machine learning basic or developing new algorithms it is essential to have reliable and large datasets which include logical connections and labels between data member. Especially in academia, having a well-known and extensively examined datasets is necessary in order to investigate the performance of newly developed machine learning algorithms and compare them to existing ones.

There are a large amount of publicly available datasets that could be used with various machine learning techniques such as deep learning, classification, reinforcement learning, clustering, etc. I would like to present the datasets that I really like to use:

1. UC Irvine Machine Learning Repository

Nearly all datasets have been published by university researchers in this repository. A wide range of datasets from various areas from marketing to wireless communication systems can be found and most of them are well documented. Link

2. Deeplearning.net Datasets

These datasets are mainly to be used in benchmarking deep learning algorithms. Link

3. Wikipedia: List of Datasets


A Wikipedia page lists plenty of datasets with comprehensive details about them including format, creators, reference study and descriptions of them. Link  




Extra: MIT Lectures on Machine Learning and Deep Learning