Why is SQL so important for your data scientist career?

If you ask any data scientist, they will probably tell you 90% of their time is spent on data processing/munging. 

The success of your analytics results, insights, and quality of your model depends on the quality of your data.

Take a machine learning modeling project, for example:

The overall data process from raw data to clean, ready-to-use data usually involves the following steps:

  1. Data acquisition.
    1. Talking to domain experts and identify the source of the data, understand how the data is generated, if it is of high quality (machine-generated vs. manually entered);
  2. Data Preprocessing
    1. Remove or impute missing data, extract features from textual or categorical data, normalize some data, split the data into training vs. testing, down/upsampling, etc.
  3. Data Postprocessing
    1. Sanity check to make sure there are no apparent mistakes were introduced in previous steps;
    2. Remove outliers or special cases;

And you will likely need to use SQL in every single step! 

