Using Python for data management and reporting


The blog introduces using Python® for data management and reporting. Python uses Jupyter® Notebook to access data sources to explore reading and writing data for generating reports from the source data.

Jupyter Notebook

In this post, I use the classic Jupyter Notebook running inside a browser from a server that can be local or remote. Jupyter Notebook supports over 40 programming languages, including Python, Java®, and JavaScript®.

Jupyter Notebook consists of the following components:

  • Jupyter Server
  • Notebook
  • Kernel
  • Reports and portfolios

Install Jupyter Notebook, open a Notebook, and set the Kernel

Note: I used Ubuntu® 18 on my server for this installation.

Perform the following steps:

  1. Install all security patches on the server using the following command:

     sudo apt upgrade -y
    
  2. Run the following command to install one or more packages for converting notebooks to PDF:

     sudo apt install -y texlive-xetex
    
  3. To use Jupyter notebook, install Anaconda, a data science and machine learning platform based on Python. Download the Anaconda3-2019.03-Linux-x86_64.sh package.

  4. Run the following command to ensure this download package is not corrupt:

     sha256sum Anaconda3-2019.03-Linux-x86_64.sh
    
  5. Run the following command to install Anaconda:

     bash Anaconda3-2019.03-Linux-x86_64.sh
    
  1. After the installation completes, restart the bash shell to include the changes by running the following command:

     source ~/.bashrc
    
  2. Update the conda® base code by running the following commands:

     conda config --set auto_activate_base false
     conda update -n base -c defaults conda
    
  3. Create the conda virtual environment (such as python_data) using the latest Python version by running the following command:

     conda create -n python_data python=3
    
  4. Activate this virtual environment y running the following command:

     conda activate python_data
    
  5. Run the following commands to make a directory to hold this GitHub repo:

    mkdir python_data
    cd python_data
    
  6. Run the following commands to install the packages:

    conda activate python_data
    conda install jupyter psutil
    

Connect to the Jupyter Notebook server

Perform the following steps to connect to the Juniper Notebook server:

  1. Activated the virtual environment using the following command:

     conda activate python_data
    
  2. Run the following script to start the server:

      python get_notebook_token.py 
    

    This script uses the following command to start the Jupyter Notebook server and continues to run it outside of the terminal window:

     nohup jupyter notebook --no-browser --port=8086
    

    This provides the token and searches the nohup output if you run the preceding command in the background by using get_notebook_token.py as highlighted in the following snapshot:

  1. Go to the terminal window of your local machine and enter the following command:

     ssh -N -L localhost:8087:localhost:8086 username@<the public IP address of the server>
    
  1. Leave this terminal open while using the Jupyter Notebook server in this session.

  2. In the browser, enter the following URL http://localhost:8087 and provide the token as input.

After logging, you can create a Jupyter Notebook by using Python code and Markdown cells.

Examining the data in a DataFrame

Use Pandas DataFrames inside of Jupyter Notebook for examining the data in a DataFrame. Run the following command to import the pandas package.

import pandas as pd

There are different ways to examine data using Pandas DataFrame:

  • .head(): Displays the first five rows in the DataFrame.
  • .tail(): Displays the last five rows in the DataFrame.
  • vet_records: Displays all the records of the DataFrame.
  • .dtypes: Shows statistical operations on columns where you can perform these operations.
  • .at: Allows the user to change the value of a specific cell.
  • f: .assign: Used to add another column of data.
  • vet_records.type.count(): Used for counting and grouping data.
  • loc: Allows you to use column names to slice data.
  • iloc: Requires the use of index numbers. For example: .iloc[row, column]. remember: python indexes starting at 0.
  • .isin: Used to gather data about a list of items.

Writing a report

For writing a report, you need to use LaTex®. LaTex, built into Jupyter Notebook, helps you create technical and scientific documents and is good at displaying mathematical formulas. You can write LaTex formulas in a Markdown cell.

For the Jupyter report, use a data set that allows you to consider some aspects of socioeconomic factors that may impact it.

To create the report and make sure to run the following commands to install supporting packages:

 import matplotlib.pyplot 
 import numpy 
 import pandas
 import random

Conclusion

You can create and share documents that contain live codes, equations, and visualizations that help in data cleaning, transformation, and statistical modeling in data science.

Learn more about our Data services.

Use the Feedback tab to make any comments or ask questions. You can also start a conversation with us.

post avatar
Dushyant Chauhan

As a Racker, I have lots of experience as Oracle APPS DBA and exploring new technology.

Share this information: