A like

A critique of social media platforms which have fostered an environment of manufactured ideals and unattainable yardsticks

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




5 thing you should know to master SQL for Data Science

We know that Data Science is the next big thing in the industry and we have a lot of tools and languages like R, Python, Tableau, Power BI used in some or the other way to bring about some wonderful insight about the Data. But using any of the mentioned tool or language has one thing in common i.e Extracting data from the database. Most of the time when companies use data to gather meaningful insights they use it from the past records stored in their database. For example, a company who has its presence in 6 countries wants to generate a sales report to understand which country provides a maximum profit. To extract any valuable information the company will require the data to be first extracted from the database and then perform any operations in any of the above-mentioned applications. And to perform this extraction we use SQL (Structured Query Language).

SQL is not just used to type in a query and extract the data. You can manipulate, perform various operations like add columns, sort the data, add conditions. But the question that would come here is, we could perform all these operations on any applications like R, Python etc. so why do we have to use SQL. The answer is yes we can surely do that but I want you to imagine it in a corporate perspective, continuing with our same example of sales report we might have a dataset of each country with millions of row so that when we extract this data we will feed millions of rows in our next step which will automatically increase our processing time, instead by simply adding a few conditions like generating the report for only last 5 years for analysis, we can easily reduce the number of unnecessary rows and voila! reducing the time required to process the data in further steps.

Now that you know the importance of SQL with respect to Data Science workflow. Let us look at the top 5 things in SQL you should know to make its best use.

Let’s Start!

Consider an example of a company which wants to analyze the distribution of its employee salaries in different departments, here there might be multiple employees in a single department so here we use the ‘Group by’ statement and group the dataset as per departments giving us a clear and better picture to analyze the salary.

Sample Code and Output for a group by function.

A subquery is exactly what it sounds like. It’s actually a query that exists within a query. So a query within a query or a select statement within a select statement can be called a subquery. Some important points to know about subqueries is :

Sample Code :

Case clause works based on cases/conditions provided in the query. If a case is true it will return otherwise it will move on to the next case. One of the most used applications of case clause is to transpose the data i.e to convert rows into columns. Let's take an example.

In the below example we count total employees working in each department.

But what if we wanted to see the same result in below manner.

That’s where the CASE clause makes the difference, just write the below query and Voila!!

Using the CASE clause to Transpose the Data.

Now that we have been introduced to the group by clause and a subquery, using a co-related subquery will be fairly comfortable to understand. Basically, both a subquery and co-related subqueries have query within a query, so how do we make it out that a query is a correlated subquery and what does that mean.

Well, a correlated subquery basically means that the subquery portion is correlated with the outer query. Let me put it another way. A correlated subquery is a query nested inside another query that uses values from the outer query. Let us look at an example.

Joins are one of the most widely used and important function when it comes to Data analysis, as we might require data to be fetched from multiple tables in order to gather consolidated information and then perform the analysis.

I am going to show you how joins make our work easier. As we have already discussed that we can use subquery and co-related subqueries to use query within a query and obtain useful information. However, it does make our query look more complicated, that’s where joins come in and make our work easier.

Consider the above example where we wrote a co-related subquery

We can simplify this same code by using Joins as below.

Using Joins

Although this might not look much of difference but consider working on a project where you have to analyze 1000 lines of code, which of the above two options would you prefer ? or I can ask which of the above options will you prefer to use while writing a query so that when you come back in future it would be easier to understand what your query is trying to do?

Using a subquery would be helpful when you have to use conditions in a clause like SELECT, WHERE, HAVING as Joins can only be used in the FROM clause!

These were the 5 must know thing for using SQL in data science. I hope I could help you have a better understanding after reading this blog. Thank you!

References:-

Who am I?

I am an aspiring Data Scientist and I am open to collaborate and work on projects related to Data Manipulation and Machine Learning.

You can check out my blogs here.

Add a comment

Related posts:

The 5 Weirdest Florida Crimes. The Results May Shock You!

Sorry was in the Dhar Man mood there. But on a semi-serious note, I’ve always had a “love-hate-I-don’t-know-what-we-are-and-I-don’t-want-to-find-out” relationship with the state of Florida. Have I…

Binance Lost 2FA

When you square measure commercialism on the binance there’s a necessity to own a binance 2FA for the protection of the Account. However, there square measure instances once Binance Lost 2FA which…

Use the BitBox02 with Electrum wallet

How to use the Electrum wallet together with your BitBox02 hardware wallet. Setup, send and receive Bitcoin. First part of our BitBox ❤️ Electrum series.