On any given day, a data scientist is a mathematician, a statistician, a computer programmer, and an analyst equipped with a diverse and wide-ranging skillset, balancing knowledge in different computer programming languages with advanced experience in data mining and visualization.
Machine Learning (ML) is an ever-changing, multidisciplinary field, so staying up to date on the constantly-evolving algorithms, techniques and models is crucial. For this reason, the most important trait for any aspiring ML engineer is the ability to be a self-learner.
While this roadmap provides guidance for anyone who wants to start a career in the field of ML, it's important to keep in mind that we highly encourage readers to explore other courses and materials beyond those recommended below.
It's unnecessary to explain why learning a programming language is an absolute must for anyone wanting to start a career in ML. But while many programming languages provide frameworks to work with ML, Python and R have the richest and most constantly growing ecosystems. Python and R have been competing as the primary languages for ML for a while, but Python has been gaining momentum and is currently the most used language in ML. Since it has an easier syntax and a smoother learning curve, we recommend Python as a starting language.
1. It's easy to find Python courses on the internet. If you're willing to pay, one of the best for beginners is the course offered by DataCamp.
2. Codecademy also offers a friendly introductory course, but again, you will have to pay if you want to access more advanced topics.
3. Learn Python offers a thorough course, and the best part is that it's free.
4. We also highly recommend the book Learn Python the Hard Way:
It's important to note that even though Python is the most broadly adopted language (and in my opinion should be the first language you learn), R is another very good asset to have. It's particularly powerful for exploratory analysis and data visualization, but it also has strong frameworks for ML and Statistics. DataCamp and Leada offer great courses on R.
2. Sharpen Your Basic Skills
Now that you know the basics of your programming language of choice, it's time to build your other basic skills:
Algorithms:
It's essential for a machine learning engineer to know the existing algorithm families and the involved trade-offs, from the basic gradient descent and linear regression to the more advanced deep learning algorithms and reinforcement learning models. A great place to start is Andrew Ng’s machine learning course. Even though this course is based on Matlab code, students are free to use any programming language of their choice.
The next step is Andrew's deep learning specialization course, which is an up-to-date course that covers the main deep learning models and their basic usages. The course is made in Python and provides nice Jupyter Notebooks and quizzes to support your learning. If you are not willing to pay for the course (which, even though it is expensive, is highly recommended), the videos are also available on Youtube.
If you already have a DataCamp subscription, you'll have access to a few courses that cover some basic algorithms, which can be helpful in reinforcing basic skills and comprehension. Elements of AI also provides some free online courses that cover the basics of Machine Learning in a very approachable manner. And if you're interested in digging a little bit further into Natural Language Processing (NLP), there is a good NLP course provided by Dan Jurafsky--the class itself is no longer available via Coursera, but you can still find course lecture slides on his Stanford Website, as well as in his excellent book, co-authored by James H. Martin, entitled Speech and Language Processing.
While we highly recommend that you take these particular courses, there are plenty of other useful classes that you can find in the most popular online learning sites, such as Coursera, Udacity and Udemy. Additionally, though we consider online courses the best way to get started in the ML field, there are also many good introductory books, such as Machine Learning for Hackers, Machine Learning: An Algorithmic Perspective and Data Mining: Practical Machine Learning Tools and Techniques, Third Edition.
Mathematics:
Most Machine Learning models rely on some level of mathematics to accomplish training and optimization. Because of this, it's critical for ML engineers to have calculus skills and at least a basic understanding of how ML algorithms work and how their hyperparameters affect their behavior. It's also important to understand mathematical notation and to be able to understand, reproduce, improve and apply algorithms found in the literature.
So if your math skills are a little bit rusty, we highly encourage you to go back to basic calculus and linear algebra. Khan Academy provides some very good courses on both subjects for free. While math is an important tool to carry in your ML tool belt, don’t feel discouraged if you have a hard time with it. These struggles actually give you a great opportunity to improve, and Hackernoon provides an excellent post on this very concept.
Statistics:
There are many controversies regarding the differences between Statistical Learning and Machine Learning, but while there are many similarities and very subtle differences between the two, it's a fact that it is not possible to do Machine Learning without using statistics to some extent. It's critical for ML engineers to have a deep knowledge of probability. If you feel like your probability skills are lacking, Khan Academy provides good introductory courses on this subject as well.
It is also important to have a strong working knowledge of inferential statistics. You can find good courses on Coursera, such as the one created by Duke University.
Finally, knowing Bayesian Statistics proves helpful as well. Courses for this can again be found via Coursera, such as the one provided by UCSC. There are some good books on this subject to supplement your learning, among which is The Elements of Statistical Learning.
3. Learn and Apply the Basics
Now that you know basic theory, it's time to apply your knowledge. The best way to do this is to use existing datasets from well known use cases. Python sklearn framework provides a datasets package with many “toy” datasets that can be used for training and learning purposes. R also provides a package with many famous datasets that can be used for training. Another good place to find datasets for many different problems and domains is Kaggle.
The main advantage of applying your knowledge in these training environments is that, while you're creating a machine learning tool from the ground up, you still don’t have to worry about data handling, which can be a very time-consuming step, letting you focus more on technique. Since most of these datasets are famous, it's also easy to find internet examples that can provide some guidance on how to get started.
4. Network with People who Share Your Goals
Exchanging knowledge is a great way to broaden your own knowledge base. So now that you have enough knowledge to talk about basic concepts and to engage in ML conversations, do it. Talking about ML with other people can help you see things from new perspectives, raise questions that you haven’t thought about before and find new applications and approaches that you haven’t considered. Meetups and conferences are great networking venues since a lot of people with different knowledge bases and skills join these events. These are great ways get in contact with other beginners who can share helpful materials and courses, as well as more experienced professionals who have already been where you are and can help you develop further.
5. Participate in Competitions
Joining competitions is a wonderful way to start getting serious. While competitions don't provide guidance on the models you have to use or how to train them, as with the existing datasets mentioned in step 3, they do provide clean and ready-to-use data, which saves you the trouble of going after data yourself and doing all the cleaning and handling needed to make it usable. Some competitions also provide forums where users can share their discoveries. These are excellent resources that can help you find your way to better performing models.
The most famous competition website is Kaggle. In addition to competitions, it also provides many useful materials that give aspiring data scientists an opportunity to test theirs skills. But even though Kaggle is one of the main resources for ML engineers, there are many other sites where you can join competitions, such as Driven Data, CrowdAnalytix and Data Science Challenge.
6. Work on Real Problems
Steps 1-5 help us get a feeling for what it's like to work in machine learning and data science, but working on a real problem is much more challenging than it seems at first sight. Working on a real-world problem is an incomparable experience. Through this experience, you will face problems you haven't encountered before, like dealing with dirty or missing data, having to handle huge, unstructured datasets or working with real-time data. Even if you don't have a job that gives you this kind of opportunity, you can try to repeat other people's work by searching for a blog post or academic paper that you find interesting and trying to reproduce it all by yourself. Don’t fool yourself--this isn’t an easy job, and it will require a lot of hard work, but it is worth your time.
And it Doesn't End Here
Now you've gotten an idea of what it's like to work on a real-world problem and you know how frustrating it can be since sometimes we aren’t able to get the results we want. But don’t get flustered--you’re just getting started, and there is still a long road ahead. And don’t forget: your self-learning proclivity is your most valuable asset for this job. Machine learning and data science, like most of the technology-related fields, is constantly evolving. So keep studying, and always keep your skills sharp!
Sources:
"Roadmap: How to Learn Machine Learning in 6 Months" by Zachariah Miller via Medium
Author
Frederico Caram
Frederico Caram is a Data Architect at Avenue Code. He enjoys reading historical fantasy novels, ballroom dancing, and playing video games.