10 powerful python data science tips

Python is not limited to Pandas, NumPy and scikit-learn (although they are absolutely essential in data science)! We can use a lot of Python techniques to improve code, accelerate data science tasks, and increase the efficiency of writing code.

10 powerful python data science tips


When was the last time you learned a new Python technique? As data scientists, we are used to using familiar libraries and calling the same functions every time. It's time to break the old convention!

Python is not limited to Pandas, NumPy and scikit-learn (although they are absolutely essential in data science)! We can use a lot of Python techniques to improve code, accelerate data science tasks, and increase the efficiency of writing code.

More importantly, learning the new things we can do in Python is really fun! I like to play with different packages and functions. Every once in a while, there will be a new trick that attracts me, and I integrate it into my daily work.

Therefore, I decided to organize my favorite Python tricks in one place-this article! This list ranges from speeding up basic data science tasks (such as preprocessing) to getting R and Python code in the same Jupyter Notebook. There are a lot of learning tasks waiting for us, let's get started!

New to the world of Python and data science? This is a sophisticated and comprehensive course that can help you get started at the same time:

1. zip: Combine multiple lists in Python

Usually we end up writing complex for loops to combine multiple lists together. Sound familiar? Then you will like the zip function. The purpose of this zip function is to "create an iterator to aggregate elements from each iterable".

Let's use a simple example to understand how to use the zip function and combine multiple lists:


See how easy it is to merge multiple lists?

2. gmplot: Plot GPS coordinates in the Google Maps data set

I like to use Google Maps data. Think about it, it is one of the richest data applications. This is why I decided to start with this Python trick.

When we want to see the relationship between two variables, using a scatter plot is very good. But if the variables are the latitude and longitude coordinates of a location, would you use them? Probably not. It is best to mark these points on a real map so that we can easily see and solve a specific problem (such as optimizing the route).

gmplot provides an amazing interface that can generate HTML and JavaScript to present all the data we want on Google Maps. Let us see an example of how to use gmplot.

Install gmplot

!pip3 install gmplot

Draw location coordinates on Google Maps

You can download the dataset of this code here.

Let's import the library and read the data:

import pandas as pd
import gmplot
data = pd.read_csv('3D_spatial_network.csv')

# latitude and longitude list 
latitude_list = data['LATITUDE'] 
longitude_list = data['LONGITUDE'] 

# center co-ordinates of the map 
gmap = gmplot.GoogleMapPlotter( 56.730876,9.349849,9)

# plot the co-ordinates on the google map 
gmap.scatter( latitude_list, longitude_list, '# FF0000', size = 40, marker = True) 

# the following code will create the html file view that in your web browser 
gmap.heatmap(latitude_list, longitude_list) 

gmap.draw( "mymap.html" )

The above code will generate an HTML file, and you can see the latitude and longitude coordinates are drawn on Google Maps. The heat map shows areas with high density of dots in red. Cool right?

3. category_encoders: Use 15 different encoding schemes to encode categorical variables

One of the biggest obstacles we faced in early data science data sets - how should we deal with categorical variables? Our machines can process numbers in the blink of an eye, but processing categories is a completely different problem.

Some machine learning algorithms can handle categorical variables by themselves. But we need to convert them into numeric variables. For this, category_encoders is an amazing library that provides 15 different encoding schemes.

Let's see how to use this library.

Install category-encoders

!pip3 install category-encoders

Convert categorical data to numeric data

import pandas as pd 
import category_encoders as ce 

# create a Dataframe 
data = pd.DataFrame({ 'gender' : ['Male', 'Female', 'Male', 'Female', 'Female'],
                      'class' : ['A','B','C','D','A'],
                      'city' : ['Delhi','Gurugram','Delhi','Delhi','Gurugram'] }) 

# One Hot Encoding 
# create an object of the One Hot Encoder 

ce_OHE = ce.OneHotEncoder(cols=['gender','city']) 

# transform the data 
data = ce_OHE.fit_transform(data) 

category_encoders supports about 15 different encoding methods, for example:

  • Hash code
  • LeaveOneOut encoding
  • Sequential encoding
  • Binary code
  • Target encoding

All encoders are fully compatible with sklearn-transformers, so you can easily use them in your existing scripts. In addition, category_encoders supports NumPy arrays and Pandas data frames. You can read more about category_encoders here.

4. progress_apply: monitor the time you spend on data science tasks

How much time do you usually spend cleaning and preprocessing data? The statement that data scientists usually spend 60 to 70% of their time cleaning data is correct. For us, tracking this is important, right?

We don’t want to spend days cleaning up the data and ignore other data science steps. This is where the progress_apply function makes our research easier. Let me demonstrate how it works.

Let's calculate the distance from all points to a specific point and view the progress of completing this task. You can download the dataset here.

import pandas as pd
from tqdm._tqdm_notebook import tqdm_notebook
from pysal.lib.cg import harcdist
data = pd.read_csv('3D_spatial_network.csv')

# calculate the distance of each data point from # (Latitude, Longitude) = (58.4442, 9.3722) 

def calculate_distance(x): 
   return harcdist((x['LATITUDE'],x['LONGITUDE']),(58.4442, 9.3722)) 
data['DISTANCE'] = data.progress_apply(calculate_distance,axis=1)

You will see how easy it is to track the progress of our code. Simple and efficient.

5. pandas_profiling: Generate a detailed report of the data set

We spent a lot of time to understand the data we got. This is fair-we don't want to jump directly into model building without knowing the model we are using. This is an essential step in any data science project.

pandas_profiling is a Python package that reduces the amount of work required to perform the initial data analysis steps. The package can generate detailed reports on our data with just one line of code!

import pandas as pd 
import pandas_profiling 

# read the dataset 
data = pd.read_csv('add-your-data-here') 

We can see that with just one line of code, we get a detailed report of the data set:

  • Warnings , for example: Item_Identifier has a high cardinality: 1559 different values ​​warning
  • Frequency count of all categorical variables
  • Quantile and descriptive statistics of numeric variables
  • Correlation diagram

6. grouper: group time series data

Who is not familiar with Pandas now? It is one of the most popular Python libraries and is widely used for data manipulation and analysis. We know that Pandas has amazing capabilities to manipulate and summarize data.

I was researching a time series problem recently and discovered that Pandas has a Grouper function that I have never used before. I became curious about its use.

It turns out that this Grouper function is a very important function for time series data analysis. Let's try this and see how it works. You can download the dataset of this code here.

import pandas as pd 

data = pd.read_excel('sales-data.xlsx') 

Now, the first step in processing any time series data is to convert the date column to DateTime format:

data['date'] = pd.to_datetime(data['date'])

Suppose our goal is to see the monthly sales of each customer. Most of us try to write something complicated here. But this is where Pandas is more useful to us.

data.set_index('date').groupby('name')["ext price"].resample("M").sum()

We can use a simple method through groupby syntax without having to re-index. We will add some extra content to this function and provide some information on how to group data in the date column. It looks cleaner and works exactly the same:

data.groupby(['name', pd.Grouper(key='date', freq='M')])['ext price'].sum()

7. unstack: Convert the index to the column of the Dataframe

We just saw how grouper can help group time series data. Now, there is a challenge here-what if we want to use the name column (index in the above example) as a dataframe column.

This is where the unstack function becomes crucial. Let's apply the unstack function to the code example above and see the result.

data.groupby(['name', pd.Grouper(key='date', freq='M')])['ext price'].sum().unstack()

very useful! Note: If the index is not MultiIndex, the output will be Series.

8. %matplotlib Notebook: Interactive plotting in Jupyter Notebook

I am a big fan of the matplotlib library. It is the most common visualization library we use to generate various graphs in Jupyter Notebook.

To view these plots, we usually use one line-%matplotlib inline when importing the matplotlib library. This is very useful, it presents the static image in Jupyter Notebook.

Just replace the line %matplotlib with %matplotlib notebook and you can see the magical effect. You will get resizable and zoomable drawings in your Notebook!

%matplotlib notebook
import matplotlib.pyplot as plt

# scatter plot of some data # try this on your dataset
plt.scatter(data['quantity'],data['unit price'])

Just change a word and we can get an interactive drawing, which can be resized and zoomed in the drawing.

9. %% time: Check the running time of a specific Python code block

There are many ways to solve a problem. As data scientists, we know this very well. Computational costs are critical in the industry, especially for small and medium-sized organizations. You may want to choose the best method to complete the task in the shortest time.

In fact, it is very easy to check the runtime of a specific code block in Jupyter Notebook.

Just add %% time command to check the running time of a specific cell:

def myfunction(x) : 
    for i in range(1,100000,1) : 

Here, we have CPU time and Wall time. CPU time is the total execution time or running time of the CPU dedicated to a certain process. Wall time refers to the elapsed time between the start of the process and the "now" of the clock.

10: rpy2: R and Python are in the same Jupyter Notebook!

R and Python are two of the best and most popular open source programming languages ​​in the data science world. R is mainly used for statistical analysis, while Python provides a simple interface to convert mathematical solutions into code.

This is great news, we can use them simultaneously in a Jupyter Notebook! We can take advantage of these two ecosystems, for this, we only need to install rpy2.

Therefore, put aside the argument between R and Python for now, and draw ggplot-level charts in our Jupyter Notebook.

!pip3 install rpy2

We can use two languages ​​at the same time and even pass variables between them.

%load_ext rpy2.ipython
%R require(ggplot2)
import pandas as pd
df = pd.DataFrame({
        'Class': ['A', 'A', 'A', 'V', 'V', 'A', 'A', 'A'],
        'X': [4, 3, 5, 2, 1, 7, 7, 5],
        'Y': [0, 4, 3, 6, 7, 10, 11, 9],
        'Z': [1, 2, 3, 1, 2, 3, 1, 2]
%%R -i df
ggplot(data = df) + geom_point(aes(x = X, y= Y, color = Class, size = Z))

Here, we created a data frame df in Python, and used it to create a scatter plot using R's ggplot2 library (geom_point function).


This is my essential collection of Python skills. I like to use these packages and functions in daily tasks. To be honest, my work efficiency has improved, which makes working in Python more interesting than ever.

In addition to these, do you have any Python tricks you want me to know? Tell me in the comments section below, we can exchange ideas!

What's Your Reaction?