How to Become a Machine Learning Researcher in Cyber-Security (with No Computer Science Courses) — Part 2
So it’s been a few months since I finished writing Part 1 of this Medium series, so I thought it was about time to follow up for all of those readers expecting a Part 2 to the series — especially when the first article is called Part 1 with no Part 2 to be found.
For this article I will spend less time on the content and focus more on some of the best practices while learning to code. I will leave the more content-heavy explanations for further videos, as I find learning habits and documenting your progression as core to what future employers will look for in an applicant.
To start, I think it would be a good time to go over some good learning practices that will really help the concepts to sink in. Learning how to program is analogous to learning to become a better writer. You can read the writing of professional journalists and essayists, but unless you put pen to paper (or key to keyboard), then you will lack the ability to think creatively about concepts you have never read about before. Because ultimately, writing is about putting your own new thoughts into words, not about summarizing the thoughts of others; unless of course you are writing a survey for one reason or another.
In this way programming is similar to writing. You should be able to be presented with a problem you have never seen before and make use of the bevy of knowledge and tools at your disposal towards solving said problem. This is what programming interviews attempt to evaluate; and in many cases you are not being evaluated on whether you can solve the problem in the allotted time (although this is usually the case to an extent), but rather, what is your thinking process when presented with such a problem.
That is because programming is not about reciting lines of code on demand, but applying the tools that you have learned and applying those to a new circumstance with different constraints.
This type of thinking is what I call algorithmic thinking, and is at the core of what makes a programmer more competent over someone who learns the theory without following that up with practice. While working on practical projects may seem boring (who doesn’t want to learn new things?) think of projects as a way of practicing your core coding skills that you will inevitably need when applying for jobs.
Practice Makes Perfect
There has to be a followup to all of those lovely Youtube and Google Machine courses that teach you the theory. Without applying the concepts to new problems, you will find it difficult to apply the concepts to slightly different circumstances when presented with a new dataset where the dataset is not nicely cleaned up and organized as you would expect from the tutorial videos.
First, I want to go over the 80/20 rule. The 80/20 rule suggests that for every 10 hours spent programming, 2 hours should be spent towards theory and learning new tools, and 8 hours should be spent applying those concepts to new problems. While this 1:4 ratio seems like a lot to spend on old concepts, the reality is you are really thinking about what you have learned in a way that is new and unique which reinforces the concepts further. A few examples of practical problems to work on include:
- Completing the Google Machine Learning Crash Course then proceeding to work with your own dataset. A great starting point is to use your monthly banking transactions so that the problem feels closer to home. Most banks allow you to download your transaction record as a .csv file, which you can then load into a Jupyter notebook and play around with. Try some simple exercises to begin with. Beginner: what is my net balance at the end of the month? (confirm with what the statement says) Intermediate: What is my average weekly spending? How much am I spending on Gas, Entertainment or Food? Advanced: can I detect a large purchase mathematically? Think simple 95% confidence interval (see below) or Poisson Model.
# determine the upper and lower limit on the 95% confidence intervalupper = x.mean + (2 * x.std)
lower = # how would you calculate the lower limit?## how would you check if the data point is within the limits?
2. Learning how a Neural Network works and then proceeding to create one from scratch in Numpy. Learning how the forward pass is calculated, then proceeding to calculate the backpropogation and update the weights. Start with a simple Artificial Neural Network (ANN) (1 hidden layer calculated in a loop), then work towards vectorizing the computation using matrix multiplication. Then expand the problem to multiple neurons, or even multiple hidden layers.
3. Learning about different activation functions and then coding them from scratch. Rather than just assume different activation functions are good in different circumstances, you can code them in a few lines of code and view their activations/derivatives.
# vanilla python relu implementation
def relu(x):
if x < 0:
return 0
else:
return x# vanilla python relu derivative implementation
def d_relu(x):
if x < 0:
return 0
else:
return 1## does the relu activation and derivative resemble those from the literature? Plot them yourself!
Even better, if you are already working with an ANN, implement the activation functions yourself and track the performance. Do you see dead neurons? Saturated neurons? What does this mean for the performance of the ANN? (For those that followed the Google Machine Learning course, dead neurons and saturated neurons refer to the derivative values of the neurons).
As a followup to Exercise 1, Kaggle is known as the go-to website for competitive programming for the Data Science community. Even better, you can download datasets from older competitions and work on them yourself. Always explore the data yourself, then proceed to explore how others have used the data along with the winning solutions. When it comes to data exploration there is no strict guidelines to follow, but seeing the workflow of others can help you incorporate that same line of thinking into your own future work.
Documenting Learnings and Progression
As a follow up to the previous section, it is very important that you document your work in some way such that there are indications that you are progressing. In the programming world, Github is the most popular hosting service for code repositories — for both large developers and hobbyists.
Having your portfolio of completed works in one place is important as one of the first places employers go to when you apply for a job is your Github. Github also has a dashboard showing your commits (code you’ve added/changed yourself) which provides a gauge of how active you are in the coding community.
Github is not solely on code commits for team projects and large packages; you can easily create your own repository and make your own daily contributions. That way your code base is constantly being updated in a single place, and there is clear indication for others visiting your Github that you are an active contributor.
Github can also be your first chance in improving your ability to document and disseminate your findings. It’s a skill in of itself to work on a project and to document your data sources, methodology and findings in such a way that is coherent and where others can replicate your results.
For most high profile code libraries, the authors include a Getting Started section to get developers up to date on how the code is run. As an example, Numpy includes a pretty simple but complete github page, with the vast majority of the documentation being served from this dedicated site.
What we learned
In this article I went over 2 key strategies that will improve your ability to develop while serving as practice for future coding related job interviews and applications.
- When learning always consider the 80/20 rule: spend 80% of your time applying old concepts and 20% of your time learning new ones. This will reinforce previously learning theory while giving you an excuse to work on more interesting projects that you might find interesting.
- Document your progress through code commits or through any other medium that demonstrates your work in progress. While Github is the industry standard for code commits, it’s not necessary the best place to document other forms of results. There is a medium publication called Towards Data Science where developers write medium articles on topics related to data science, machine learning and mathematics. You can also document your results in the integrated Kaggle notebooks, or even your own blog if you fancy that.
With that, I hope you learned a lot from this article and I will see you all in Part 3 where we will get back into the content-heavy explanations on becoming a Machine Learning engineer. See you next time!