A Guide to Real-World Data Collection for Machine Learning | by Leah Berg and Ray McLendon | Sep, 2023

5 Actionable Strategies to Optimize Your Process

Leah Berg and Ray McLendon
Towards Data Science
Photo by Henrik Dønnestad on Unsplash

Whether you’re brand new to data science or the Chief Data Scientist at a large , you’ve probably played with perfectly crafted data sets to solve toy problems. Maybe you’ve used K-Means clustering to predict flower species in the Iris data set. Or maybe you’ve tried out a logistic regression to predict which passengers survived the Titanic voyage.

While these data sets are great for practicing the basics of machine , they don’t mirror the real-world data you’ll come across on the job. In reality, your data can have issues, might not be perfect for the task at hand, or may not exist yet. This means often need to roll up their sleeves and gather data — a challenge often not covered in today’s data science curriculum.

For new Data Scientists, collecting extensive amounts of data before diving into the problem at hand can feel extremely daunting since this stage lays the foundation for the entire machine learning . However, with the right strategies, this process can become much more manageable.

Throughout my 10+ years as a Data Scientist, I’ve encountered a wide variety of data collection strategies, and in this article, I’ll share five of my favorite tips to optimize your data collection process and set you on the path to creating a successful machine learning product.

A powerful starting point lies in offering tangible value right from the beginning. Let’s borrow an example from a major player in the automotive , . Their quest for a fully autonomous vehicle is a substantial goal that’s taken years to develop and has required a massive amount of data collection.

So, what did they do while amassing all of this data?

Photo by Milan Csizmadia on Unsplash

Source link