Big Data – How to Get Started with Data Science

This article represents my opinion on what would it take to get started with Data Science. As I started exploring Big Data, one thing that became clear is that I may not be successful with Big Data unless I have learnt and applied Data Science to make sense out of Big Data (the data with 3Vs: Volume, Velocity, Variety). This is where I started to find out on How to Get Started with Data Science. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.
Following are the key points described later in this article:

  • Data Science is NOT Easy & Quick
  • Any Pre-requisites to get started with Data Science
  • Key Topics/Subjects to learn in Data Science
  • Online courses on Data Science

 

Data Science is NOT Easy & Quick

In last couple of years, everytime I tried to get on board with Big Data, one thing that dared me and pushed me back is aspects of data science related with Big Data. I always knew in my heart that I would be able to master Hadoop technology stack . However, the very fact that what will I do with data (as data science is not easy nut to crack unlike any other IT skills) after I have processed data using Hadoop stack, kept me wondering, getting confused, depressed and hence, stay away from Big Data. Yes, that is true. Data Science is NOT easy and quick. It does require lot of committment, patience and willingness to invest time in learning. So, if you have these “virtues” in you :-), you could certainly get on board with Data Science. Baseline is, it comes with a cost(patience, comittment etc)  and if you are ready to pay for it, it will keep paying you back for time to come.

 

Any Pre-requisites to Get Started with Data Science

Following are some of the areas where it would be good to gain some experience:

  • Basic Knowledge of Programming Language: It would be good to have basic understanding and some level of experience with one or more programming languages and, also awareness about database fundamentals. From that perspective, if you want to get started afresh with a programming language, it may be good idea to start learning Java or Python. You should be able to find good support of machine learning libraries that you could use to learn machine learning algorithms and also design/apply new techniques.
  • Knowledge of Hadoop Technology Stack: It may be good idea to get a basic understanding on Hadoop technology stack including some of the following:
    • MapReduce, HDFS
    • NoSQL key-value datastores such as HBase
    • Pig, Hive

The primary reason why you may want to acquire some experience with programming is that once you get started with Data science and learnt some concepts, and the time comes to experiment and explore, this experience of programming will come handy.

 

Key Topics/Subjects to Learn in Data Science

So, lets look at what are some of the key topics/subjects to learn if you want to get going with data science:

  • Statistical Modeling: Statistical modeling is one of the key subject of data science and one needs to be good at it to excel in the area of data science. It would be good to understand basic concepts of statistical modeling.
  • Machine Learning Algorithms: The key is to understand various different algorithms related with machine learning. If you are a developer, you may need to apply similar style of learning in the way you learnt design patterns. Meaning, as with design pattern, the key is to understand the different types of problems that can be solved using a particular design pattern along with implementation technique, similarly, one needs to understand the problems that could be solved using a specific algorithm and how could these algorithms be implemented. Some of the high-level topics that one would want to cover in relation with learning algorithms are following:
    • Supervised learning (rules, trees, forests, nearest neighbor, regression)
    • Optimization (gradient descent and variants)
    • Unsupervised learning
  • Machine-learning Libraries: Once you got an understand on machine-learning algorithms, the next step would be take on some of the libraries which provide implementations for these algorithms. Some of them are following:
    • OpenNLP is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution.
    • Weka is a collection of machine learning algorithms for data mining tasks.
    • Vowpal Wabbit is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research.
    • Apache Mahout is a framework to support various different machine learning libraries related with recommendation, classification based on data analysis etc.
  • Modeling Tools: Another important aspect is to learn one or more modeling tools (primarily related with statistical analysis & visualization) such as MatLab, R, SAS etc.

 

The above mentioned aspects of data science including knowledge in programming (Hacking) skills and, mathematics & statistics conforms to Drew Conway’s Data Science Venn Diagram shown below. Substantive expertise represents domain knowledge related with Data. A data scientist would be required to do a deep dive into domain related knowledge associated with Data to make the best out of Big Data.

Data_Science_drew_conway

 

Online Courses on Data Science

As I searched in Google, I came across some of the following online courses which could be a good place to start:

  • Introduction to Data Science: It’s a free course by Bill Howe. This is a direct link to the page consisting of all the videos of this course.
  • Data Science Course from JHU: It’s a free course, but you would be require to shell out $49 if you also want a certificate of course completion from JHU. You could find course material on Github

 

Ajitesh Kumar

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.
Posted in Big Data. Tagged with .

2 Responses