# PyParis logo
from IPython.display import Image
Image("PyParis.png")

Introduction to Data Analysis using Python - PyParis 2017¶

Francis Wolinski - Yotta Conseil¶

Expertise & IT Audit - Data Science - Python

twitter: https://twitter.com/@fran6wol
web: https://yotta-conseil.fr
github : https://github.com/fran6w/PyParis2017

0. Tutorial objectives and materials¶

0.1 Objectives¶

Exploring basic functionalities of Python Data Analysis Ecosystem: Numpy, Pandas, Matplotlib
Analysis two different datasets but conforming them so that the same analysis can be performed on both datasets
Perform few cross analysis on both datasets

0.2 Materials¶

Python 3.5+ : numpy, pandas, matplotlib

Datasets :

Names from US Social Security from 1880 to 2016 : https://www.ssa.gov/oact/babynames/limits.html (National data)
- 137 CSV files named yobYYYY.txt, without header
- Within each file: name, gender (F or M), births (int)
- No names when less that 5 occurences in a year
Names from French Insee from 1900 to 2015 : https://www.insee.fr/fr/statistiques/2540004 (Fichier France)
- 1 single TSV file: nat2015.txt, with header
- In the file: gender (1 or 2), name, year, births (float!)
- No names when less than 3 occurences in a year
- Rare names are labelled as _PRENOMS_RARES
- Years for rare names are labelled XXXX

References

Wes McKinney - Python for Data Analysis - O'Reilly (2013)
Jake VanderPlas - Python Data Science Handbook - O'Reilly (2016)

# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os

# magic command to display graphics within the notebook
%matplotlib inline

# display options
from pandas import set_option
set_option("display.max_rows", 16)
set_option("precision", 3)
plt.style.use('seaborn-notebook')

1. Loading Data in a DataFrame¶

1.1 Loading US Data¶

1.1.1 Loading a single file¶

We first load a single CSV file using Pandas' read_csv() function.

# loads US data from last year
path = "datasets/names/"
filename = "yob2016.txt"
us2016 = pd.read_csv(os.path.join(path, filename),
                           header=None,
                           names=['name', 'gender', 'births'])
us2016

We obtain a Pandas DataFrame object which represents 2D data. The info() method provides the main characteritics of the DataFrame.

# characteristics of a DataFrame
us2016.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32868 entries, 0 to 32867
Data columns (total 3 columns):
name      32868 non-null object
gender    32868 non-null object
births    32868 non-null int64
dtypes: int64(1), object(2)
memory usage: 770.4+ KB

# access to the index
us2016.index

RangeIndex(start=0, stop=32868, step=1)

# acces to columns
us2016.columns

Index(['name', 'gender', 'births'], dtype='object')

# acces to dimensions
us2016.shape

(32868, 3)

We can access to a specific column by using the [] operator and obtain a Pandas Series object which represents 1D data. It shares the original DataFrame index.

# access to a specific column
us2016['name']  # us2016.name

0             Emma
1           Olivia
2              Ava
3           Sophia
4         Isabella
5              Mia
6        Charlotte
7          Abigail
           ...    
32860         Ziya
32861       Ziyang
32862         Zoel
32863       Zolton
32864       Zurich
32865       Zyahir
32866         Zyel
32867        Zylyn
Name: name, dtype: object

# number of US unique names
us2016['name'].nunique()

30294

Few operations on DataFrame: transposition, sorting and grouping

Transposition reverse rows and columns and switch index and column labels : transpose() or T method.

# transposition T or transpose()
us2016.T

It is possible to sort a DataFrame by one or more columns and specifying the ascending : sort_values() method.

# top 10 US names in 2016
us2016.sort_values('births', ascending=False).head(10)

DataFrame are provided with a powerful groupby() method which splits it into chuncks - according to the values of one or more columns - and then apply a summarization function to each chunk and concatenates the whole in a new DataFrame.

This illustrates the split-apply-combine strategy : groupby() then sum() methods.

# US female and male births of year 2016
us2016.groupby('gender').sum()

1.1.2 Loading all files¶

We load and merge all CSV files within the folder names. The year is extracted from file names and added as a new column. We use the append() method which concatenates 2 DataFrames.

# load all data in a single DataFrame
us = pd.DataFrame()
for root, dirs, files in os.walk(path):
    for filename in files:
        if filename[0:3] != "yob":
            continue
        csv = pd.read_csv(os.path.join(root, filename),
                           header=None,
                           names=['name', 'gender', 'births'])
        year = int(filename[3:7])  # yobYYYY.txt
        csv['year'] = year
        us = us.append(csv, ignore_index=True)

us = us[['year', 'name', 'gender', 'births']]  # fancy indexing in order to reorder the columns

us

# characteristics of a DataFrame
us.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1891894 entries, 0 to 1891893
Data columns (total 4 columns):
year      int64
name      object
gender    object
births    int64
dtypes: int64(2), object(2)
memory usage: 57.7+ MB

# number of US unique names
us['name'].nunique()

96174

# top 10 global US names within a year
us.sort_values('births', ascending=False).head(10)

# global US female and male births
us.groupby('gender').sum()['births']  [['births']]

gender
births   NaN
Name: births, dtype: float64

What are the top 10 US names all years? We are going to implement a function so that it can be reused later.

# top (10) names all years
def top(df, size=10):
    return df.groupby(['name', 'gender']).sum()['births'].sort_values(ascending=False).head(size)

# global US top 10 names all years
top(us)

name     gender
James    M         5136026
John     M         5105834
Robert   M         4809176
Michael  M         4338144
Mary     F         4120692
William  M         4087525
David    M         3601032
Joseph   M         2592856
Richard  M         2560771
Charles  M         2378806
Name: births, dtype: int64

1.2 Loading French Data¶

Firstly, we load the French data in order to have a look at them.

# load 
fr = pd.read_csv("datasets/nat2015.txt",
                     sep='\t',
                     encoding='latin-1',
                     header=0,
                     names = ['gender', 'name', 'year', 'births'])
fr = fr[['year', 'name', 'gender', 'births']]
fr

Secondly, we reload the French data so that they have the same characteristics as the US ones :

columns :
- gender mapping '1' => 'M', '2' => 'F' and switching to caterogy
- name case including compound names with '-'
- year astype int
- births astype int
eliminate irrelevant data
sort the data
reset the index

# load French data
fr = pd.read_csv("datasets/nat2015.txt",
                     sep='\t',
                     encoding='latin-1',
                     header=0,
                     names = ['gender', 'name', 'year', 'births'],
                     converters = {'gender': lambda x: 'M' if x == '1' else 'F',
                                   'name': lambda x: '-'.join(map(lambda y: y.capitalize(), x.split('-'))),
                                   'births': lambda x: int(float(x))})
fr = fr[(fr['year'] != 'XXXX') &
              (fr['name'] != '_prenoms_rares') &
              (fr['name'].str.len() != 1)]
fr['year'] = fr['year'].astype(int)
fr = fr[['year', 'name', 'gender', 'births']]
fr.sort_values(['year', 'gender', 'births', 'name'], ascending=[True, True, False, True], inplace=True)
fr.reset_index(drop=True, inplace=True)
fr

# characteristics of a DataFrame
fr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 556743 entries, 0 to 556742
Data columns (total 4 columns):
year      556743 non-null int32
name      556743 non-null object
gender    556743 non-null object
births    556743 non-null int64
dtypes: int32(1), int64(1), object(2)
memory usage: 14.9+ MB

# top 10 global French names within a year
fr.sort_values('births', ascending=False).head(10)

# global French female and male births
fr.groupby('gender').sum()['births']

gender
F    39990411
M    40680898
Name: births, dtype: int64

# global French top 10 names all years
top(fr)

name      gender
Marie     F         2234906
Jean      M         1919820
Pierre    M          889306
Michel    M          820080
André     M          711757
Jeanne    F          550384
Philippe  M          538133
René      M          516492
Alain     M          506750
Louis     M          504437
Name: births, dtype: int64

2. Searching in the Data¶

2.1 Two letters names¶

We are going to implement a function that looks for names with only 2 letters. We use the str operator which tells Python to process series data as strings. Then we use the [] operator for a logical selection.

# making a boolean mask
mask = us['name'].str.len() == 2
mask

0          False
1          False
2          False
3          False
4          False
5          False
6          False
7          False
           ...  
1891886    False
1891887    False
1891888    False
1891889    False
1891890    False
1891891    False
1891892    False
1891893    False
Name: name, dtype: bool

# using the mask to select data
us[mask]

We implement a function which perform all necessary operations and run US and French data using the unique() method.

# searching for names with 2 letters only
def names_of_length_2(df):
    mask = df['name'].str.len() == 2
    return df[mask]['name'].unique()

# US names with 2 letters
names_of_length_2(us)

array(['Ed', 'Wm', 'Al', 'Ab', 'Jo', 'Ah', 'Ng', 'Lu', 'Le', 'Ad', 'Cy',
       'Bo', 'Mc', 'La', 'Jc', 'Hy', 'Jb', 'De', 'Jd', 'Jt', 'El', 'Jr',
       'Ty', 'Vi', 'Ac', 'Sy', 'Lc', 'Od', 'Tj', 'Wc', 'Ld', 'Rc', 'Jw',
       'Be', 'Eb', 'Jl', 'St', 'Lo', 'Oc', 'Rb', 'Lj', 'Lb', 'Rl', 'Rd',
       'Hu', 'Lg', 'Av', 'Ma', 'Ej', 'Si', 'Su', 'Ok', 'An', 'Di', 'Ki',
       'Ky', 'Li', 'Ea', 'Mi', 'In', 'Ja', 'Zo', 'Fe', 'Aj', 'Kc', 'Ka',
       'Jj', 'Dj', 'Bj', 'Gy', 'Je', 'Ji', 'Cj', 'Ra', 'Yu', 'Pj', 'Ta',
       'My', 'Ry', 'Ho', 'Ly', 'Na', 'Rj', 'Ai', 'Tu', 'Vu', 'Ha', 'Ti',
       'Un', 'Jm', 'Pa', 'Ju', 'So', 'Vy', 'Jp', 'Ia', 'Yi', 'Du', 'Ge',
       'Va', 'Sa', 'Wa', 'Po', 'Tc', 'Ni', 'Da', 'Fu', 'Om', 'Ya', 'Me',
       'Ko', 'Ku', 'Se', 'Gi', 'Xu', 'Kt', 'Ye', 'Kd', 'Te', 'Do', 'Md',
       'Kj', 'Ii', 'Mr', 'Bg', 'No', 'Bb', 'Ke', 'Oz', 'Ze', 'Io', 'Qi',
       'Zi', 'Ny', 'Zy', 'Mj', 'Yo', 'Eh', 'Or', 'En', 'Za', 'Sj', 'Ar',
       'Ro', 'Mo', 'Oh', 'Xi', 'Tr'], dtype=object)

# French names with 2 letters
names_of_length_2(fr)

array(['St', 'Ah', 'Jo', 'El', 'Ab', 'Si', 'Le', 'Ha', 'My', 'Al', 'Da',
       'An', 'Ly', 'Pa', 'Va', 'Ai', 'Yi', 'By', 'Tu', 'Ka', 'Ma', 'Yu',
       'Na', 'So', 'Ya', 'Vu', 'Ky', 'Lu', 'Li', 'Vi', 'To', 'Ba', 'Sy',
       'Su', 'Do', 'Ea', 'De', 'Ny', 'Mc', 'Or', 'Ty', 'La', 'Bo', 'Zi',
       'Bi'], dtype=object)

2.2 Names for which births = year¶

Now, we implement a function which looks for names for which births within a year are equal to the year.

# looking for names for wich births within a year are equal to the year
def births_is_year(df):
    return df[df['births'] == df['year']]

# US names with births equals year
births_is_year(us)

# French names with briths equals year
births_is_year(fr)

2.3 Looking for prefix¶

Here, we implement a function which search for prefix names.

First, we select the names starting with a specific prefix using the str operator and the startswith() method.

# select US names starting with 'Fran'
selection = us[us['name'].str.startswith("Fran")]
selection

Second, we can reuse the top function and implement a new function.

# looking for prefix
def look(df, prefix, **kwargs):
    selection = df[df['name'].str.startswith(prefix)]
    return top(selection, **kwargs)

# US names starting with "Fran"
look(us, "Fran")

name       gender
Frank      M         909430
Frances    F         588601
Francis    M         284528
Franklin   M         134707
Francisco  M         131432
Frankie    M          40310
Francine   F          35747
Frankie    F          32913
Francis    F          29041
Francesca  F          28176
Name: births, dtype: int64

# French names starting with "Fran"
look(fr, "Fran")

name             gender
Françoise        F         401522
François         M         398241
Franck           M         177720
Francis          M         166060
Francine         F          92723
France           F          28073
Frank            M          11379
Francette        F           7777
Francisco        M           7236
François-Xavier  M           7125
Name: births, dtype: int64

We can see a French compound name. What are the top French compound name?

# global French top 10 compound names all years
selection = fr[fr['name'].str.contains('-')]
top(selection)

name           gender
Jean-Pierre    M         213143
Jean-Claude    M         172142
Jean-Luc       M         116647
Anne-Marie     F          99695
Jean-François  M          99024
Jean-Marc      M          90144
Marie-Thérèse  F          88345
Jean-Paul      M          86843
Jean-Marie     M          86223
Jean-Michel    M          85960
Name: births, dtype: int64

Which French names are involved in a compound name?

3. Putting Data into Graphs¶

3.1 Evolution of births for a single name (+ gender) over years¶

First, we select the data according to a name (Kevin) and a gender (M). Logical operators (&, | and ~) are provided for selecting data.

# selection of data related to Kevin (M)
selection = us[(us['name'] == "Kevin") & (us['gender'] == "M")]
selection

Second, we switch the year column to be the index of the DataFrame.

# same with year as index
selection_years = selection.set_index('year')
selection_years

There is a strong connexion between Pandas DataFrame and Matplotlib. We can directly use the Matplotlib plot() function which applied to DataFrame or Series as a method.

# plot the births column along with the years
selection_years.plot(title="Births of Kevin (M) over years");

We can put every thing together in a single function and run the French data.

# evolution of a name (+ gender)
def plot_name_gender(df, name, gender):
    selection = df[(df['name'] == name) & (df['gender'] == gender)]
    selection_years = selection.set_index('year')
    selection_years.plot(title="Births of {} ({}) over years".format(name, gender));

# evolution of French Kevin (M)
plot_name_gender(fr, 'Kevin', 'M')

We will try to find later correlations between US and French names over time.

3.2 Evolution of births by gender over years¶

DataFrame are provided with another powerful method pivot_table(). It produces a new DataFrame with as the index the different values of a first column, as the column index the different values of second column and the values of a third column are summarized into cells by using an aggregative function.

We use here this method to obtain the births by gender over years.

# pivoting US data by years and gender and computing the sum of births
genders = us.pivot_table(index='year',
                         columns='gender',
                         values='births',
                         aggfunc=np.sum)
genders

# plot number of births by gender over years
genders.plot(title="Total births by gender over years", color=['r', 'b']);

We can put every thing together in a single function and run the French data.

# plot number of births by gender over years
def plot_births(df):
    table = df.pivot_table(index='year',
                           columns='gender',
                           values='births',
                           aggfunc=np.sum)
    table.plot(title="Total births by gender over years", color=['r', 'b']);

# plot French births by gender over years
plot_births(fr)

4. Measuring the increase in naming diversity¶

We are using the crosstab() method in order to compute the number of different names by gender over years. In the 1880' parents had roughly 1000 names per gender to name their babies. Nowadays, parents have 14 to 19 thousands!

# number of different names by gender over years
diversity = pd.crosstab(us['year'], us['gender'])
diversity.head()

# nowadays
diversity.tail()

We can put every thing together in a single function.

# plot evolution of number of names over years
def plot_diversity(df):
    diversity = pd.crosstab(df['year'], df['gender'])
    diversity.plot(title="Evolution of names diversity by gender over years", color=['r', 'b']);

# US diversity
plot_diversity(us)

# French diversity
plot_diversity(fr)

5. Evolution of Last Letter¶

We will now add a column with the last letter of each name and compute the evolution of the most frequent last letters which appears nowadays. This is computed by using the apply() method and a lambda function which extracts the last letter and put it in uppercase for readability.

# add a column with the last later of names
us["last"] = us["name"].apply(lambda x: x[-1].upper())
us

Then we use the crosstab() function in order to get the number of births by last letters over years.

# diversity of names by last letter over years
last = pd.crosstab(us['year'], us['last'])
last

Then we sort the letters of the last available year. We use the .iloc[] operator which enables to select rows, columns and parts of a DataFrame by indices.

# sort most used last letters for US names in 2015
last.iloc[-1].sort_values(ascending=False)

last
N    7015
A    6944
E    4073
H    2518
I    2385
Y    2081
R    1391
L    1350
     ... 
G     122
C     113
V      80
F      77
J      72
W      50
P      31
Q      30
Name: 2016, dtype: int64

Here we put the 7 top letters in an array.

# put the 7 letters in an array
letters = last.iloc[-1].sort_values(ascending=False).index[:7]
letters

Index(['N', 'A', 'E', 'H', 'I', 'Y', 'R'], dtype='object', name='last')

We can perform a fancy indexing on the DataFrame with the 7 letters and perform a plot.

# last letters DataFrame limited to 7 top last letters of 2015
last[letters]

We can put every thing together in a single function and run the French data.

# plot the evolution of diversity of names according to their last letter
def plot_last_letters(df):
    df["last"] = df["name"].apply(lambda x: x[-1].upper())
    last = pd.crosstab(df['year'], df['last'])
    letters = last.iloc[-1].sort_values(ascending=False).head(7).index
    last[letters].plot(title="Evolution of diversity of names according to their last letter");

# plot the evolution of diversity of US names according to their last letter
plot_last_letters(us)

# plot the evolution of diversity of US names according to their last letter
plot_last_letters(fr)

# interactive scatter plot of last letters for US and French names across years
from ipywidgets import interact, IntSlider

# prepare and normalize data
last_us = pd.crosstab(us['year'], us['last'])
last_us = last_us.div(last_us.sum(axis=1), axis=0)
last_fr = pd.crosstab(fr['year'], fr['last'])
last_fr = last_fr.div(last_fr.sum(axis=1), axis=0)

def plot_letters(year):
    data = pd.concat([last_us.loc[year], last_fr.loc[year]], axis=1, join='inner')
    plt.scatter(data.iloc[:,0], data.iloc[:,1])
    for (x, y, letter) in zip(data.iloc[:,0], data.iloc[:,1], last.columns):
        plt.text(x + 0.003, y + 0.003, letter)
        
interact(plot_letters, year=IntSlider(min=1900, max=2015, step=1, value=1900));

6. Names which gender changed over time¶

6.1 Global gender repartition per name¶

First we restrict names which appear at least 10000 times as F or M over time.

# find top names
top_names = us.groupby('name').sum()['births']
top_names = top_names.sort_values(ascending=False)
top_names = top_names[top_names >= 10000]
top_names

name
James       5159306
John        5127501
Robert      4829274
Michael     4359820
Mary        4135851
William     4103456
David       3613916
Joseph      2603475
             ...   
Kenna         10079
Jovan         10068
Tatyana       10061
Karly         10054
Catrina       10045
Randell       10031
Gilda         10022
Marisela      10012
Name: births, dtype: int64

Select those names in the US DataFrame using the isin() method.

# data for top names
selection = us[us['name'].isin(top_names.index)]
selection

Secondly, we pivot the data so as to have both genders on each line sorted by names and the values are the sum of births per name and gender. We add a column with the ratio F vs M.

# pivot by name and gender with the sum of births
ratio_name = selection.pivot_table(index='name',
                                     columns='gender',
                                     values='births',
                                     aggfunc=np.sum)
ratio_name["ratio"] = ratio_name["F"] / (ratio_name["F"] + ratio_name["M"])
ratio_name

We can now plot the distribution of the ratio among names by using the plot() function with the option kind='hist'. We notice that most of names are mostly either female or male, but some names are possibily ambivalent.

# distribution of F vs M ratio for top 1000 US names
ratio_name['ratio'].plot(kind='hist');

We put everything in a function so that we can process French data.

# plot distribution of F vs M ratio for top 1000 names
def plot_distribution(df):
    top_names = df.groupby('name').sum()['births']
    top_names = top_names.sort_values(ascending=False)
    top_names = top_names[top_names >= 10000]
    selection = df[df['name'].isin(top_names.index)]
    ratio_name = selection.pivot_table(index=['name'],
                                         columns='gender',
                                         values='births',
                                         aggfunc=np.sum)
    ratio_name["ratio"] = ratio_name["F"] / (ratio_name["F"] + ratio_name["M"])
    ratio_name['ratio'].plot(kind='hist');

# distribution of F vs M ratio for top 1000 French names
plot_distribution(fr)

6.2 Plot evolution of names which gender changed over years¶

First we select the data related to a single name, for instance Leslie.

# select US data for Leslie
selection = us[us["name"] == "Leslie"]
selection

Then we compute the evolution of births for this name by gender over years. Note that we could have reuse the selection performed above in ratio_gender.

# evolution of births by gender
evolution_gender = selection.pivot_table(index='year',
                                         columns='gender',
                                         values='births')
evolution_gender

Then we compute the F and M ratio by dividing the births by the sum of each line (axis=1) along the index (axis=0).

# gender evolution over years
evolution_gender = evolution_gender.div(evolution_gender.sum(axis=1), axis=0)
evolution_gender

Put everything in a function and run over US and French data.

# plot gender ratio evolution of years
def plot_transname(df, name):
    selection = df[df["name"] == name]
    gender = selection.pivot_table(index='year',
                                   columns='gender',
                                   values='births')
    gender = gender.div(gender.sum(axis=1), axis=0)
    gender.plot(title="Evolution proportion of {} by gender over years".format(name), style={'M': 'b', 'F': 'r'});

# gender evolution of US Donnie
plot_transname(us, "Leslie")

# gender evolution of US Donnie
plot_transname(us, "Donnie")

plot_transname(us, 'Jean')

# gender evolution of French Camille
plot_transname(fr, "Camille")

# gender evolution of French Dominique
plot_transname(fr, "Dominique")

# gender evolution of French Alix
plot_transname(fr, "Alix")

7. Few cross analysis¶

7.1 Names with different genders in US and France¶

First, we partly reuse the plot_distribution() function to get the F vs M ratio for top 1000 names.

# add a ratio F vs M for top 1000 names
def ratio(df):
    top1000 = df.groupby('name').sum()['births']
    top1000 = top1000.sort_values(ascending=False).head(1000)
    selection = df[df['name'].isin(top1000.index)]
    ratio_name = selection.pivot_table(index=['name'],
                                         columns='gender',
                                         values='births',
                                         aggfunc=np.sum)
    ratio_name["ratio"] = ratio_name["F"] / (ratio_name["F"] + ratio_name["M"])
    return ratio_name

ratio(us)

ratio(fr)

Second, we concatenate the results to get them in a single DataFrame. We need to rename the columns.

# US and French top 1000 names with F vs M ratio
data = pd.concat([ratio(us), ratio(fr)], axis=1)
data.columns = ['us_F', 'us_M', 'us_ratio', 'fr_F', 'fr_M', 'fr_ratio']
data

Finally, we add a delta column with the absolute difference between US and French ratio and we sort the DataFrame.

# compute ratio absolute difference and sort by delta
data["delta"] = np.abs(data["us_ratio"] - data["fr_ratio"])
data.sort_values("delta", ascending=False).tail(10)

7.2 Correlation between US and French names over time¶

Here, we try to compute the correlation between US and French names (+ gender) over time. We build 2 Series with births and compute the correlation over 60 years by shifting one Series along the other.

def correlation(us, fr, name, gender):
    # select and pivot US data
    us_name = us[(us['name'] == name) & (us['gender'] == gender)]
    us_name = us_name.pivot_table(index='year', values='births')
    
    # select and pivot French data
    fr_name = fr[(fr['name'] == name) & (fr['gender'] == gender)]
    fr_name = fr_name.pivot_table(index='year', values='births')
    
    # get the best correlation over 60 years
    n = 60
    k = int(n / 2) - 1
    results = np.zeros(n)
    for i in range(-k, k + 1):
        results[i + k] = us_name.corr(fr_name.shift(-i))  # French data are shifted negatively
    best = results.argmax()
    print("Correlation over {} years : {:.2f}".format(best - k, results[best]))
    
    # concatenate and plot data
    data = pd.concat([us_name, fr_name], axis=1)
    data.columns=['us', 'fr']
    data.plot(title="Correlation between US and French {} ({}) over years".format(name, gender), color=['r', 'b'])

us_name = us[(us['name'] == "Alex") & (us['gender'] == "M")]
us_name = us_name.pivot_table(index='year', values='births')
us_name

year
1880     147
1881     114
1882     172
1883     120
1884     148
1885     159
1886     161
1887     136
        ... 
2009    5165
2010    4429
2011    4027
2012    3854
2013    3544
2014    3172
2015    3096
2016    2826
Name: births, dtype: int64

fr_name = fr[(fr['name'] == "Alex") & (fr['gender'] == "M")]
fr_name = fr_name.pivot_table(index='year', values='births')
fr_name

year
1900      5
1901      9
1902      6
1903      9
1904     12
1905     14
1906     12
1907      9
       ... 
2008    545
2009    581
2010    561
2011    501
2012    493
2013    445
2014    409
2015    405
Name: births, dtype: int64

correlation(us, fr, "Alex", "M")

Correlation over 5 years : 0.89

correlation(us, fr, "Marc", "M")

Correlation over -11 years : 0.91

correlation(us, fr, "Cindy", "F")

Correlation over 28 years : 0.98

8. Conclusion and summary¶

In this session, we have used some of the powerful Python Data Analysis Ecosystem features.

Links refer to the pandas documentation. Most of the functions and methods we have used provide a number of possible arguments.

Data representation

Series: 1D
DataFrame: 2D

Loading data with *read_csv()*

header
names
sep
encoding
converters

Getting information

index: Series and DataFrame index
columns: DataFrame column
shape: Series and DataFrame dimensions
info(): DataFrame informations
values: Index and Series values
unique(): Series unique values
nunique(): Series number of unique values

Selecting data

head() and tail()
Column accessing
Row accessing
Fancy indexing
Logical masking

Indexing and merging data

*set_index()* and *reset_index()*
*append()* and *concat()*

Computing over data

sum()
str.len(), str.startswith() and str.contains()
*apply()* and lambda
isin()
div()

Organizing data

Displaying data

*plot()*

	name	gender	births
0	Emma	F	19414
1	Olivia	F	19246
18757	Noah	M	19015
18758	Liam	M	18138
2	Ava	F	16237
3	Sophia	F	16070
18759	William	M	15668
18760	Mason	M	15192
18761	James	M	14776
4	Isabella	F	14722

	year	name	gender	births
431059	1947	Linda	F	99685
441430	1948	Linda	F	96210
437162	1947	James	M	94762
544612	1957	Michael	M	92716
437163	1947	Robert	M	91641
451669	1949	Linda	F	91013
533150	1956	Michael	M	90620
556187	1958	Michael	M	90512
447469	1948	James	M	88584
510799	1954	Michael	M	88525

	year	name	gender	births
0	1980	A	1	3.0
1	1998	A	1	3.0
2	XXXX	A	1	21.0
3	1976	AADEL	1	5.0
4	1978	AADEL	1	3.0
5	1980	AADEL	1	3.0
6	1981	AADEL	1	5.0
7	1982	AADEL	1	4.0
...	...	...	...	...
589403	1992	ÖZLEM	2	5.0
589404	1993	ÖZLEM	2	7.0
589405	2010	ÖZLEM	2	3.0
589406	2012	ÖZLEM	2	6.0
589407	2013	ÖZLEM	2	3.0
589408	2014	ÖZLEM	2	5.0
589409	2015	ÖZLEM	2	3.0
589410	XXXX	ÖZLEM	2	25.0

	year	name	gender	births
0	1900	Marie	F	48727
1	1900	Jeanne	F	13983
2	1900	Marguerite	F	8059
3	1900	Germaine	F	6981
4	1900	Louise	F	6698
5	1900	Yvonne	F	5099
6	1900	Madeleine	F	4960
7	1900	Suzanne	F	4954
...	...	...	...	...
556735	2015	Zayad	M	3
556736	2015	Zayane	M	3
556737	2015	Ziggy	M	3
556738	2015	Zoan	M	3
556739	2015	Zoubir	M	3
556740	2015	Zoé	M	3
556741	2015	Étan	M	3
556742	2015	Évann	M	3

	year	name	gender	births
95476	1946	Jean	M	53897
1705	1901	Marie	F	52167
3434	1902	Marie	F	51870
97892	1947	Jean	M	51665
5174	1903	Marie	F	50445
6931	1904	Marie	F	50153
8709	1905	Marie	F	48998
0	1900	Marie	F	48727
10545	1906	Marie	F	48465
14232	1908	Marie	F	47479

	year	name	gender	births
0	1880	Mary	F	7065
1	1880	Anna	F	2604
2	1880	Emma	F	2003
3	1880	Elizabeth	F	1939
4	1880	Minnie	F	1746
5	1880	Margaret	F	1578
6	1880	Ida	F	1472
7	1880	Alice	F	1414
...	...	...	...	...
1891886	2016	Ziya	M	5
1891887	2016	Ziyang	M	5
1891888	2016	Zoel	M	5
1891889	2016	Zolton	M	5
1891890	2016	Zurich	M	5
1891891	2016	Zyahir	M	5
1891892	2016	Zyel	M	5
1891893	2016	Zylyn	M	5

	year	name	gender	births
996	1880	Ed	M	310
1444	1880	Wm	M	14
1467	1880	Al	M	12
1850	1880	Ab	M	5
2880	1881	Jo	F	5
2998	1881	Ed	M	272
3326	1881	Al	M	19
3396	1881	Wm	M	15
...	...	...	...	...
1889035	2016	Jb	M	6
1889463	2016	Or	M	6
1889871	2016	Zi	M	6
1889875	2016	Zo	M	6
1890122	2016	Bb	M	5
1890482	2016	En	M	5
1891392	2016	Qi	M	5
1891817	2016	Ye	M	5

	year	name	gender	births
76786	1907	Minnie	F	1907
134324	1916	Oscar	M	1916
265125	1929	Velma	F	1929
274940	1930	Janice	F	1930
312410	1934	Agnes	F	1934
371429	1940	Vernon	M	1940
421493	1946	Margie	F	1946
544774	1957	Gilbert	M	1957
1249432	1997	Selena	F	1997
1293009	1998	Gage	M	1998
1692050	2011	Eliana	F	2011
1745461	2012	Avery	M	2012
1859178	2016	Norah	F	2016
1859179	2016	Rose	F	2016

	year	name	gender	births
3461	1902	Angèle	F	1902
23212	1912	Marius	M	1912
55913	1928	Gaston	M	1928
68050	1933	Fernand	M	1933
466147	2009	Lou	F	2009

	year	name	gender	births
41	1880	Frances	F	605
205	1880	Francis	F	55
410	1880	Frankie	F	17
478	1880	Frank	F	13
947	1880	Frank	M	3242
991	1880	Francis	M	344
1053	1880	Franklin	M	120
1336	1880	Francisco	M	19
...	...	...	...	...
1882584	2016	Franz	M	20
1882939	2016	Fransisco	M	18
1883613	2016	Francois	M	15
1886094	2016	Franciszek	M	9
1887714	2016	Fran	M	7
1890537	2016	Frances	M	5
1890538	2016	Franck	M	5
1890539	2016	Frandy	M	5

	year	name	gender	births
103699	1912	Kevin	M	9
110513	1913	Kevin	M	9
117875	1914	Kevin	M	12
126616	1915	Kevin	M	15
135970	1916	Kevin	M	18
145816	1917	Kevin	M	19
155956	1918	Kevin	M	21
166850	1919	Kevin	M	14
...	...	...	...	...
1643354	2009	Kevin	M	8767
1677700	2010	Kevin	M	7322
1711516	2011	Kevin	M	6558
1745343	2012	Kevin	M	6216
1778794	2013	Kevin	M	5929
1811995	2014	Kevin	M	5886
1845095	2015	Kevin	M	5164
1877871	2016	Kevin	M	4526

gender	F	M
year
1880	90992	110491
1881	91953	100743
1882	107847	113686
1883	112318	104627
1884	129020	114443
1885	133055	107799
1886	144533	110784
1887	145981	101415
...	...	...
2009	1834048	1980491
2010	1774186	1915331
2011	1755278	1895156
2012	1756347	1892094
2013	1749061	1885683
2014	1779496	1913434
2015	1776538	1907211
2016	1756647	1880674

gender	F	M
year
1880	942	1058
1881	938	997
1882	1028	1099
1883	1054	1030
1884	1172	1125

gender	F	M
year
2012	19486	14229
2013	19219	14034
2014	19169	14037
2015	19054	14009
2016	18757	14111

gender	F	M	ratio
name
Aaliyah	79367.0	89.0	0.999
Aaron	4269.0	568057.0	0.007
Abbey	17044.0	35.0	0.998
Abbie	21266.0	325.0	0.985
Abbigail	11286.0	5.0	1.000
Abby	56869.0	181.0	0.997
Abel	113.0	47412.0	0.002
Abigail	345728.0	621.0	0.998
...	...	...	...
Zane	490.0	43403.0	0.011
Zayden	129.0	14576.0	0.009
Zelda	12831.0	5.0	1.000
Zella	12056.0	NaN	NaN
Zelma	16705.0	112.0	0.993
Zion	5472.0	24388.0	0.183
Zoe	116591.0	448.0	0.996
Zoey	71128.0	194.0	0.997

last	A	B	C	D	E	F	G	H	I	J	...	Q	R	S	T	U	V	W	X	Y	Z
year
1880	450	7	7	93	490	7	9	60	7	0	...	0	89	125	76	5	1	7	4	148	4
1881	432	6	8	85	484	5	13	59	6	0	...	0	83	116	78	4	1	7	5	142	2
1882	478	7	8	96	538	8	10	70	6	0	...	0	87	124	83	3	1	8	5	167	4
1883	480	5	8	96	542	7	10	61	4	0	...	0	86	122	79	4	1	9	6	151	2
1884	543	10	7	101	584	6	10	72	4	0	...	0	100	133	83	4	1	7	4	179	4
1885	547	8	8	88	614	6	12	69	6	0	...	0	95	130	83	3	1	6	5	174	2
1886	583	7	7	103	633	4	12	72	5	0	...	0	96	127	87	4	1	7	4	188	3
1887	591	8	8	105	633	6	8	73	6	0	...	0	102	121	81	4	1	7	4	173	4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2009	7975	124	117	496	4438	70	96	2610	2330	56	...	23	1250	1222	480	135	67	47	105	2210	188
2010	7659	135	123	508	4345	75	103	2621	2310	68	...	21	1244	1182	453	137	71	42	107	2192	178
2011	7399	127	105	501	4417	76	96	2613	2327	67	...	24	1249	1171	452	151	70	50	116	2155	170
2012	7293	117	118	476	4364	76	111	2560	2402	70	...	23	1282	1182	479	136	74	47	119	2168	172
2013	7150	124	110	502	4333	84	103	2550	2300	63	...	22	1256	1156	471	154	77	47	132	2147	161
2014	7050	124	105	517	4262	74	102	2523	2286	69	...	21	1302	1152	499	140	77	51	136	2130	170
2015	6982	114	114	513	4238	78	111	2513	2386	71	...	28	1338	1167	486	152	90	47	151	2149	157
2016	6944	126	113	525	4073	77	122	2518	2385	72	...	30	1391	1149	510	171	80	50	156	2081	153

last	N	A	E	H	I	Y	R
year
1880	223	450	490	60	7	148	89
1881	214	432	484	59	6	142	83
1882	224	478	538	70	6	167	87
1883	214	480	542	61	4	151	86
1884	244	543	584	72	4	179	100
1885	226	547	614	69	6	174	95
1886	240	583	633	72	5	188	96
1887	233	591	633	73	6	173	102
...	...	...	...	...	...	...	...
2009	7460	7975	4438	2610	2330	2210	1250
2010	7353	7659	4345	2621	2310	2192	1244
2011	7400	7399	4417	2613	2327	2155	1249
2012	7322	7293	4364	2560	2402	2168	1282
2013	7177	7150	4333	2550	2300	2147	1256
2014	7271	7050	4262	2523	2286	2130	1302
2015	7062	6982	4238	2513	2386	2149	1338
2016	7015	6944	4073	2518	2385	2081	1391

gender	F	M
year
1880	0.092	0.908
1881	0.107	0.893
1882	0.069	0.931
1883	0.055	0.945
1884	0.107	0.893
1885	0.076	0.924
1886	0.056	0.944
1887	0.067	0.933
...	...	...
2009	0.963	0.037
2010	0.971	0.029
2011	0.960	0.040
2012	0.961	0.039
2013	0.969	0.031
2014	0.943	0.057
2015	0.942	0.058
2016	0.947	0.053

gender	F	M	ratio
name
Aaliyah	79367	89	0.999
Aaron	4269	568057	0.007
Abby	56869	181	0.997
Abigail	345728	621	0.998
Abraham	203	86304	0.002
Ada	93464	277	0.997
Adam	2024	544822	0.004
Addison	110535	12623	0.898
...	...	...	...
Wyatt	387	137667	0.003
Xavier	826	125706	0.007
Yolanda	112251	537	0.995
Yvette	62953	186	0.997
Yvonne	155219	619	0.996
Zachary	1513	527858	0.003
Zoe	116591	448	0.996
Zoey	71128	194	0.997

	year	name	gender	births	last
654	1880	Leslie	F	8	E
1108	1880	Leslie	M	79	E
2522	1881	Leslie	F	11	E
3072	1881	Leslie	M	92	E
4593	1882	Leslie	F	9	E
5081	1882	Leslie	M	122	E
6850	1883	Leslie	F	7	E
7225	1883	Leslie	M	120	E
...	...	...	...	...	...
1759787	2013	Leslie	F	1111	E
1781819	2013	Leslie	M	36	E
1793093	2014	Leslie	F	1000	E
1814143	2014	Leslie	M	61	E
1826333	2015	Leslie	F	878	E
1847409	2015	Leslie	M	54	E
1859438	2016	Leslie	F	801	E
1880467	2016	Leslie	M	45	E

gender	F	M	ratio
name
Aaron	NaN	17337.0	NaN
Abdelkader	NaN	9805.0	NaN
Abel	141.0	23293.0	6.017e-03
Achille	142.0	13088.0	1.073e-02
Adam	45.0	50049.0	8.983e-04
Adel	NaN	7566.0	NaN
Adelaide	9801.0	243.0	9.758e-01
Adeline	54154.0	NaN	NaN
...	...	...	...
Yveline	9603.0	NaN	NaN
Yves	234.0	229361.0	1.019e-03
Yvette	203485.0	NaN	NaN
Yvon	NaN	40549.0	NaN
Yvonne	257058.0	3.0	1.000e+00
Zakaria	NaN	9572.0	NaN
Zelie	7147.0	NaN	NaN
Zoé	51220.0	311.0	9.940e-01