In [1]:
# PyParis logo
from IPython.display import Image
Image("PyParis.png")
Out[1]:

Dataviz with matplotlib and seaborn - PyParis 2018

Francis Wolinski - Yotta Conseil

Python & data Science

0. Tutorial objectives and materials

0.1 Objectives

  • Introduction to matplotlib.pyplot
  • Advanced graphics with matplotlib
  • Introduction to seaborn
  • Mixing seaborn and matplotlib

0.2 Documentation

0.3 Materials

In [2]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# display options
pd.set_option("display.max_rows", 16)
pd.set_option("display.max_columns", 30)

1. Matplotlib.pyplot

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

In matplotlib.pyplot the 3 main objects are:

Figure: The top level container for all the plot elements.

Axes (ou Subplots): The Axes contains most of the figure elements and sets the coordinate system.

Axis: X or Y axis of a graphics, different from Axes.

Nota bene: all instructions from the creation of a figure to its display are cumulated in the same graphics in a script or in a cell of a notebook.

In [3]:
# style
plt.style.use('seaborn-darkgrid')
plt.subplots(figsize=(5, 5));
In [4]:
# available styles in matplotlib.pyplot.style.available
print(*plt.style.available, sep=' ')
bmh classic dark_background fast fivethirtyeight ggplot grayscale seaborn-bright seaborn-colorblind seaborn-dark-palette seaborn-dark seaborn-darkgrid seaborn-deep seaborn-muted seaborn-notebook seaborn-paper seaborn-pastel seaborn-poster seaborn-talk seaborn-ticks seaborn-white seaborn-whitegrid seaborn Solarize_Light2 _classic_test
In [5]:
# styling with context manager
with plt.style.context('fivethirtyeight'):
    plt.subplots(figsize=(5, 5))

1.1 Introduction

1. Elementary graphics

In [6]:
# pseudo-random walk
#np.random.seed(0)
plt.plot((np.random.random(100) - 0.5).cumsum());

2. Simple graphics

In [7]:
# a figure with a unique subplot
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111)  # equivalent to ax = fig.add_subplot(1, 1, 1)
ax.set_title("Figure 1")
ax.plot((np.random.random(100) - 0.5).cumsum())
ax.axhline(y=0, color='k')
ax.legend(["Random walk"]);
Exercise 1
  • Implement a random walk with 2 curves in the same plot as a function.
  • Then add lines with the mean of each curve.
  • Watch out for the legends.
In [8]:
# %load exercises/ex1.py

def plot_random_walk2():
    fig = plt.figure(figsize=(8, 6))
    ax = fig.add_subplot(111)  # equivalent to ax = fig.add_subplot(1, 1, 1)
    ax.set_title("Figure 1")
    a = (np.random.random(100) - 0.5).cumsum()
    b = (np.random.random(100) - 0.5).cumsum()
    ax.plot(a, c='g')
    ax.plot(b, c='r')
    ax.axhline(y=a.mean(), color='g', ls=':')
    ax.axhline(y=b.mean(), color='r', ls=':')
    ax.legend(["Random walk 1", "Random walk 2"]);

plot_random_walk2()

Position of legend

Parameter loc, default value best.

It is also possible to set a relative position with the option bbox_to_anchor=(x, y):

  • x: 0.0 = left, 1.0 = right
  • y: 0.0 = bottom, 1.0 = top
Exercise 2
  • Modify the function so that all keywords parameters are passed to the `legend()` method.
  • Try different positions for the legend.
  • Try the option `bbox_to_anchor` to set the legend in the middle on the right outside of the figure.
In [9]:
# %load exercises/ex2.py

def plot_random_walk2(**kwargs):
    fig = plt.figure(figsize=(8, 6))
    ax = fig.add_subplot(111)  # equivalent to ax = fig.add_subplot(1, 1, 1)
    ax.set_title("Figure 1")
    a = (np.random.random(100) - 0.5).cumsum()
    b = (np.random.random(100) - 0.5).cumsum()
    ax.plot(a, c='g')
    ax.plot(b, c='r')
    ax.axhline(y=a.mean(), color='g', ls=':')
    ax.axhline(y=b.mean(), color='r', ls=':')
    ax.legend(["Random walk 1", "Random walk 2"], **kwargs);

plot_random_walk2(loc='lower left')
# plot_random_walk2(bbox_to_anchor=(1.3, 0.6))

3. Compound graphics

1) With the add_subplot() method.

In [10]:
# compound graphics
fig = plt.figure(figsize=(8, 6))

ax1 = fig.add_subplot(221)
ax1.set_title("Figure 1")
ax1.plot(np.random.random(10))

ax2 = fig.add_subplot(222)
ax2.set_title("Figure 2")
ax2.plot(np.random.random(10), 'r--')

ax3 = fig.add_subplot(223)
ax3.set_title("Figure 3")
x = np.random.random(10)
ax3.plot(x, 'c:')
ax3.plot(x, '*', color='darkred')

ax4 = fig.add_subplot(224)
ax4.set_title("Figure 4")
x = np.random.random(10)
ax4.plot(x, '-.', color='0.3')
ax4.plot(x, '^', color='#ff0080');

2) With the subplots() function.

In [11]:
# compound graphics
fig, [[ax1, ax2], [ax3, ax4]] = plt.subplots(2, 2, figsize=(8, 6))

ax1.set_title("Figure 1")
ax1.plot(np.random.random(10))

ax2.set_title("Figure 2")
ax2.plot(np.random.random(10), 'r--')

ax3.set_title("Figure 3")
x = np.random.random(10)
ax3.plot(x, 'c:')
ax3.plot(x, '*', color='darkred')

ax4.set_title("Figure 4")
x = np.random.random(10)
ax4.plot(x, '-.', color='0.3')
ax4.plot(x, '^', color='#ff0080');

In matplotlib there are:

  • 4 types of lines: '-' (solid), '--' (dashed), ':' (dotted), '-.' (dashdotted)
  • several referential of colors:
    • 8 basic colors: 'b' (blue), 'g' (green), 'r' (red), 'c' (cyan), 'm' (magenta), 'y' (yellow), 'k' (black), 'w' (white)
    • grey levels: number between '0.0' (black) and '1.0' (white), in strings
    • 148 named colors: see variable matplotlib.colors.cnames
    • 16+ millions of RGB colors in hexadecimal: #xxyyzz
  • 41 markers: see variable matplotlib.lines.Line2D.markers
  • Line width can also be set with the lw keyword.
In [12]:
print(*mpl.colors.cnames, sep=' ')
aliceblue antiquewhite aqua aquamarine azure beige bisque black blanchedalmond blue blueviolet brown burlywood cadetblue chartreuse chocolate coral cornflowerblue cornsilk crimson cyan darkblue darkcyan darkgoldenrod darkgray darkgreen darkgrey darkkhaki darkmagenta darkolivegreen darkorange darkorchid darkred darksalmon darkseagreen darkslateblue darkslategray darkslategrey darkturquoise darkviolet deeppink deepskyblue dimgray dimgrey dodgerblue firebrick floralwhite forestgreen fuchsia gainsboro ghostwhite gold goldenrod gray green greenyellow grey honeydew hotpink indianred indigo ivory khaki lavender lavenderblush lawngreen lemonchiffon lightblue lightcoral lightcyan lightgoldenrodyellow lightgray lightgreen lightgrey lightpink lightsalmon lightseagreen lightskyblue lightslategray lightslategrey lightsteelblue lightyellow lime limegreen linen magenta maroon mediumaquamarine mediumblue mediumorchid mediumpurple mediumseagreen mediumslateblue mediumspringgreen mediumturquoise mediumvioletred midnightblue mintcream mistyrose moccasin navajowhite navy oldlace olive olivedrab orange orangered orchid palegoldenrod palegreen paleturquoise palevioletred papayawhip peachpuff peru pink plum powderblue purple rebeccapurple red rosybrown royalblue saddlebrown salmon sandybrown seagreen seashell sienna silver skyblue slateblue slategray slategrey snow springgreen steelblue tan teal thistle tomato turquoise violet wheat white whitesmoke yellow yellowgreen
In [13]:
print(*mpl.lines.Line2D.markers, sep=' ')
. , o v ^ < > 1 2 3 4 8 s p * h H + x D d | _ P X 0 1 2 3 4 5 6 7 8 9 10 11 None None   
Exercise 3
  • Modify the graphics above by adding 1 column with 2 figures.
  • Figure 5: solid line and gold color width of 2 + black squares.
  • Figure 6: dashed line and light grey + blue circles.
In [14]:
# %load exercises/ex3.py

# compound graphics
fig, [[ax1, ax2, ax3], [ax4, ax5, ax6]] = plt.subplots(2, 3, figsize=(12, 6))

ax1.set_title("Figure 1")
ax1.plot(np.random.random(10))

ax2.set_title("Figure 2")
ax2.plot(np.random.random(10), 'r--')

ax3.set_title("Figure 3")
x = np.random.random(10)
ax3.plot(x, 'c:')
ax3.plot(x, '*', color='darkred')

ax4.set_title("Figure 4")
x = np.random.random(10)
ax4.plot(x, '-.', color='0.3')
ax4.plot(x, '^', color='#ff0080')

ax5.set_title("Figure 5")
x = np.random.random(10)
ax5.plot(x, '-', color='gold', lw=2)
ax5.plot(x, 's', color='k')

ax6.set_title("Figure 6")
x = np.random.random(10)
ax6.plot(x, '--', color='0.7')
ax6.plot(x, 'o', color='b');

4. Graphics with pandas

Bar plot

In [15]:
# load a set of data
df = pd.read_table('Summer Olympic medallists 1896 to 2008 - ALL MEDALISTS.txt')
df.head()
Out[15]:
City Edition Sport Discipline Athlete NOC Gender Event Event_gender Medal
0 Athens 1896 Aquatics Swimming HAJOS, Alfred HUN Men 100m freestyle M Gold
1 Athens 1896 Aquatics Swimming HERSCHMANN, Otto AUT Men 100m freestyle M Silver
2 Athens 1896 Aquatics Swimming DRIVAS, Dimitrios GRE Men 100m freestyle for sailors M Bronze
3 Athens 1896 Aquatics Swimming MALOKINIS, Ioannis GRE Men 100m freestyle for sailors M Gold
4 Athens 1896 Aquatics Swimming CHASAPIS, Spiridon GRE Men 100m freestyle for sailors M Silver
In [16]:
# setting category on medals

medals = ['Bronze', 'Silver', 'Gold']

if pd.__version__ < '0.21.0':
    df['Medal'] = geo['Medal'].astype('category', categories=medals, ordered=True)
else:
    from pandas.api.types import CategoricalDtype
    cat_medals = CategoricalDtype(categories=medals, ordered=True)
    df['Medal'] = df['Medal'].astype(cat_medals)
    
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29216 entries, 0 to 29215
Data columns (total 10 columns):
City            29216 non-null object
Edition         29216 non-null int64
Sport           29216 non-null object
Discipline      29216 non-null object
Athlete         29216 non-null object
NOC             29216 non-null object
Gender          29216 non-null object
Event           29216 non-null object
Event_gender    29216 non-null object
Medal           29216 non-null category
dtypes: category(1), int64(1), object(8)
memory usage: 2.0+ MB
In [17]:
# cross table Edition x Medal
table = pd.crosstab(df['Edition'], df['Medal'])
table
Out[17]:
Medal Bronze Silver Gold
Edition
1896 40 47 64
1900 142 192 178
1904 123 159 188
1908 211 282 311
1912 284 300 301
1920 355 446 497
1924 285 298 301
1928 242 239 229
... ... ... ...
1980 472 455 460
1984 500 476 483
1988 535 505 506
1992 596 551 558
1996 634 610 615
2000 685 667 663
2004 679 660 659
2008 710 663 669

26 rows × 3 columns

In [18]:
# plot all medals in same graphics
ax = table.plot(kind='bar',
                figsize=(12, 4),
                 title="Medals by edition and metal")
ax.set_xticks(range(len(table)))
ax.set_xlabel("Editions")
ax.set_xticklabels(table.index);
Exercise 4
  • Compute a cross table Edition by Gender and plot for each edition the ratio of gender.
  • Compute a cross table Sport by Medal and plot.
In [19]:
# %load exercises/ex4.py

# for each edition the ratio of medals by gender
table1 = pd.crosstab(df['Edition'], df['Gender'])
table1 = table1.div(table1.sum(axis=1), axis=0)
ax = table1.plot(kind='bar',
                 stacked=True,
                 figsize=(12, 4),
                 title="Medals by edition and gender")
ax.set_xticks(range(len(table1)))
ax.set_xlabel("Editions")
ax.set_xticklabels(table1.index);

# for each sport the number of medals by metal
table2 = pd.crosstab(df['Sport'], df['Medal'])
ax = table2.plot(kind='bar',
                figsize=(12, 4),
                 title="Medals by edition and metal")
ax.set_xticks(range(len(table2)))
ax.set_xlabel("Sports")
ax.set_xticklabels(table2.index);
Exercise 5
  • Modify the graphics above so that:
    1. the graphics is made of different subplots
    2. the X tick labels are rotated by 60°
    3. the colors for the medals are darkorange, silver and gold
    4. no legend is displayed in the subplots
    5. the space between the subplots is increased (use the `subplots_adjust()` function with the `hspace=...` argument)
  • Modify the graphics above so that the editions are replaced by the cities.
    • you will need to adjust the xtick labels to the right (use the `pyplot.xticks()` function with the `ha=...` argument)
In [20]:
# %load exercises/ex5.py

# graphics with subplots
table1 = pd.crosstab(df['Edition'], df['Medal'])
axes = table1.plot(figsize=(9, 6),
               title="Medals by metal and edition",
               kind='bar',
               subplots=True,
               #sharey=True,
               color=['darkorange', 'silver', 'gold'],
               rot=60)
plt.subplots_adjust(hspace=0.4)
axes[-1].set_xticks(range(len(table1)))
axes[-1].set_xlabel("Editions")
axes[-1].set_xticklabels(table1.index)
for ax in axes:
    ax.legend().set_visible(False);
    
# graphics with subplots
table2 = pd.crosstab(df['Sport'], df['Medal'])
axes = table2.plot(figsize=(9, 6),
               title="Medals by metal and sport",
               kind='bar',
               subplots=True,
               #sharey=True,
               color=['darkorange', 'silver', 'gold'],
               rot=60)
plt.subplots_adjust(hspace=0.4)
plt.xticks(ha='right')
axes[-1].set_xticks(range(len(table2)))
axes[-1].set_xlabel("Sports")
axes[-1].set_xticklabels(table2.index)
for ax in axes:
    ax.legend().set_visible(False);

FIne tunings of default matplotlib parameters can be achieved by modifying:

  • either the matplolib.rcParams variable
  • or the matplotlibrc file, see matplotlib.matplotlib_fname().

Of course, these repositories require expertise and attention when modifying them.

In [21]:
# path of matplotlibrc file
print(mpl.matplotlib_fname())
C:\Users\Francis\Anaconda3\lib\site-packages\matplotlib\mpl-data\matplotlibrc

Scatter plot

In [22]:
# load a set of data
geo = pd.read_csv('correspondance-code-insee-code-postal.csv', sep=';',
                 usecols=range(10),
                index_col='Code INSEE')
geo[['Latitude', 'Longitude']] = geo['geo_point_2d'].str.extract('(.+), (.+)', expand=True).astype(float)
geo.head()
Out[22]:
Code Postal Commune Département Région Statut Altitude Moyenne Superficie Population geo_point_2d Latitude Longitude
Code INSEE
31080 31350 BOULOGNE-SUR-GESSE HAUTE-GARONNE MIDI-PYRENEES Chef-lieu canton 301.0 2470.0 1.6 43.2904403081, 0.650641474176 43.290440 0.650641
11143 11510 FEUILLA AUDE LANGUEDOC-ROUSSILLON Commune simple 314.0 2426.0 0.1 42.9291375888, 2.90138923544 42.929138 2.901389
43028 43200 BESSAMOREL HAUTE-LOIRE AUVERGNE Commune simple 888.0 743.0 0.4 45.1306448726, 4.07952494849 45.130645 4.079525
78506 78660 PRUNAY-EN-YVELINES YVELINES ILE-DE-FRANCE Commune simple 155.0 2717.0 0.8 48.5267627187, 1.80513972814 48.526763 1.805140
84081 84310 MORIERES-LES-AVIGNON VAUCLUSE PROVENCE-ALPES-COTE D'AZUR Commune simple 49.0 1042.0 7.6 43.9337788848, 4.90875878315 43.933779 4.908759
In [23]:
# scatter plot provides naive maps
plt.scatter(geo['Longitude'],
            geo['Latitude'],
            s=3);
Exercise 6
  • Limit the geo DataFrame to Metropolitan France and scatter plot.
  • Provide a scatter plot where the color depends on the 'Altitude Moyenne' with the colormap 'coolwarm'.
In [24]:
# %load exercises/ex6.py

# Metropolitan France
metro = geo.loc[geo['Latitude'] > 40]
plt.scatter(metro['Longitude'],
            metro['Latitude'],
            s=3);

metro = metro.sort_values('Altitude Moyenne')
plt.scatter(metro['Longitude'],
            metro['Latitude'],
            c=metro['Altitude Moyenne'],
            s=3,
            cmap=plt.cm.coolwarm)
plt.colorbar();

Colormaps

The matplotlib and seaborn modules manage also colormaps, i.e. palettes of colors associated with discrete or continuous data:

These modules manage also other palettes:

In [25]:
print(*plt.cm.datad.keys(), sep=' ')
afmhot autumn bone binary bwr brg CMRmap cool copper cubehelix flag gnuplot gnuplot2 gray hot hsv jet ocean pink prism rainbow seismic spring summer terrain winter nipy_spectral spectral Blues BrBG BuGn BuPu GnBu Greens Greys Oranges OrRd PiYG PRGn PuBu PuBuGn PuOr PuRd Purples RdBu RdGy RdPu RdYlBu RdYlGn Reds Spectral YlGn YlGnBu YlOrBr YlOrRd gist_earth gist_gray gist_heat gist_ncar gist_rainbow gist_stern gist_yarg coolwarm Wistia Accent Dark2 Paired Pastel1 Pastel2 Set1 Set2 Set3 tab10 tab20 tab20b tab20c Vega10 Vega20 Vega20b Vega20c afmhot_r autumn_r bone_r binary_r bwr_r brg_r CMRmap_r cool_r copper_r cubehelix_r flag_r gnuplot_r gnuplot2_r gray_r hot_r hsv_r jet_r ocean_r pink_r prism_r rainbow_r seismic_r spring_r summer_r terrain_r winter_r nipy_spectral_r spectral_r Blues_r BrBG_r BuGn_r BuPu_r GnBu_r Greens_r Greys_r Oranges_r OrRd_r PiYG_r PRGn_r PuBu_r PuBuGn_r PuOr_r PuRd_r Purples_r RdBu_r RdGy_r RdPu_r RdYlBu_r RdYlGn_r Reds_r Spectral_r YlGn_r YlGnBu_r YlOrBr_r YlOrRd_r gist_earth_r gist_gray_r gist_heat_r gist_ncar_r gist_rainbow_r gist_stern_r gist_yarg_r coolwarm_r Wistia_r Accent_r Dark2_r Paired_r Pastel1_r Pastel2_r Set1_r Set2_r Set3_r tab10_r tab20_r tab20b_r tab20c_r Vega10_r Vega20_r Vega20b_r Vega20c_r
Exercise 7
  • Add 'Densité' of population on geo and switch the 'Statut' column to an ordered category.
  • Provide a scatter plot where:
    • All cities which 'Statut' is less that 'Préfecture' are plotted with a small point
    • All cities which 'Statut' is more that 'Préfecture' are plotted with a circle which radius depends on Population and color depends on 'Densité' with the colormap Reds
    • All cities which Statut is more that 'Préfecture de région' (except those with Arrondissements: Paris, Lyon and Marseille) are plotted along with their names
In [26]:
# %load exercises/ex7.py

geo['Densité'] = geo['Population'] / geo['Superficie']

status = list(geo['Statut'].value_counts().index)
if pd.__version__ < '0.21.0':
    geo['Statut'] = geo['Statut'].astype('category', categories=status, ordered=True)
else:
    from pandas.api.types import CategoricalDtype
    cat_status = CategoricalDtype(categories=status, ordered=True)
    geo['Statut'] = geo['Statut'].astype(cat_status)

metro = geo.loc[geo['Latitude'] > 40]

# Noms des préfectures de région
plt.figure(figsize=(7, 5))
metro_A = metro.loc[metro["Statut"] >= "Préfecture"]
metro_A = metro_A.sort_values("Population", ascending=False)
metro_B = metro.loc[metro["Statut"] < "Préfecture"]

# communes
plt.scatter(metro_B["Longitude"],
            metro_B["Latitude"],
            c='y',
            s=3,
            edgecolors='none')

# préfectures
ax = plt.scatter(metro_A["Longitude"],
                metro_A["Latitude"],
                c=metro_A["Densité"],
                s=metro_A["Population"],
                cmap=plt.cm.Reds,
                edgecolors='none')

# noms des préfectures de région hors PLM
metro_C = metro.loc[(metro["Statut"] >= "Préfecture de région") & ~metro["Commune"].str.contains("ARRONDISSEMENT")]
for i, row in metro_C.iterrows():
    plt.text(row["Longitude"],
                 row["Latitude"],
                 row["Commune"].title(),
                 fontsize=8)
    
plt.colorbar(ax);

2. seaborn

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

barplot

Show point estimates and confidence intervals as rectangular bars.

In [27]:
# number of athletes for each sport, country, gender and medal
table = df.pivot_table(index=['Sport', 'NOC', 'Gender', 'Medal'], values='Athlete', aggfunc='count') #pd.Series.nunique)
table.reset_index(inplace=True)
sports = table['Sport'].value_counts().index[:10]
table = table.loc[table['Sport'].isin(sports)]
table
Out[27]:
Sport NOC Gender Medal Athlete
0 Aquatics ANZ Men Bronze 3.0
1 Aquatics ANZ Men Silver 2.0
2 Aquatics ANZ Men Gold 4.0
3 Aquatics ANZ Women Silver 1.0
4 Aquatics ANZ Women Gold 1.0
5 Aquatics ARG Men Gold 1.0
6 Aquatics ARG Women Bronze 1.0
7 Aquatics ARG Women Silver 1.0
... ... ... ... ... ...
2888 Wrestling USA Men Gold 50.0
2889 Wrestling USA Women Bronze 2.0
2890 Wrestling USA Women Silver 1.0
2891 Wrestling UZB Men Silver 3.0
2892 Wrestling UZB Men Gold 3.0
2893 Wrestling YUG Men Bronze 6.0
2894 Wrestling YUG Men Silver 6.0
2895 Wrestling YUG Men Gold 4.0

1755 rows × 5 columns

In [28]:
# barplot number of athletes by sport and medal
fig, ax = plt.subplots(figsize=(12, 8))
sns.barplot(y='Sport',
            x='Athlete',
            data=table,
            hue='Medal',
            palette=['darkorange', 'silver', 'gold'],
            #ci=0,
            ax=ax);