Plotting Visualizations#

Created on Tue Aug 22 15:01:04 2023

Copyright 2023 Roy Ruddle

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Functions for data quality visualizations.

Internal functions are prefixed by ‘_’. The functions are grouped as follows:

General functions Unused functions Functions for summary plots Functions for purity plots Functions to plot sets and intersections Explanation graph functions

vizdataquality.plot.apply_perceptual_discontinuity_individually(input_data, perceptual_threshold, axis_limits=None)#

Apply perceptual discontinuity threshold. If axis_limits is None then values in range 0 < x < perceptual_threshold are set equal to perceptual_threshold. Otherwise values in the range 0 < x < max are adjusted so each is distinguishable from 0 and max.

Parameters:
  • input_data (series or data frame) – The data.

  • perceptual_threshold (float) – An absolute value (axis_limits = None) or a percentage (0.0 - 1.0) of the axis limit max.

  • axis_limits (None or (min, max) tuple) – The axis limits. The default is None.

Returns:

The adjusted values and, if a series was input, then two extra columns for stacked bar chart plotting of values that have been adjusted and not adjusted, respectively.

Return type:

DataFrame

vizdataquality.plot.apply_perceptual_discontinuity_to_group(input_data, perceptual_threshold, axis_limits=None)#

Apply perceptual discontinuity threshold to a group of values. The values are adjusted so each is >= perceptual_threshold * sum, but the sum is unchanged. That is achieved by increasing values that are below the threshold and decreasing values that are above the threshold.

Parameters:
  • input_data (series or data frame) – The data

  • perceptual_threshold (float) – An absolute value (axis_limits = None) or a percentage (0.0 - 1.0) of the axis limit max.

  • axis_limits (None or (min, max) tuple) – The axis limits. The default is None.

Returns:

A series or dataframe, with perceptual discontinuity applied to groups of values (e.g. in a stacked bar chart)

Return type:

Series or DataFrame

vizdataquality.plot.boxplot(data, number_of_variables_per_row=None, ax_input=None, vert=True, xlabels_rotate=0.0, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#

Create a box plot using the supplied stats to show the distribution of numerical or date/time data.

Parameters:
  • data (series) – Series containing the variable names (index) and pre-computed boxplot stats (for each variable, a list defining [whislo, q1, med, q3, whishi]; see Matplotlib.bxp() for details)

  • number_of_variables_per_row (int) – None (plot all variables in one bar chart) or the number of variables to show (used by multiplot()). The default is None.

  • ax_input (axis or None) – Matplotlib axis

  • vert (boolean) – True (vertical bars; the default) or False (horizontal)

  • xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True)

  • filename (string) – None or a filename for the figure

  • overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)

  • fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object

  • ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object

  • kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.bxp object

Return type:

None.

vizdataquality.plot.boxplot_raw(data, number_of_variables_per_row=None, ax_input=None, vert=True, xlabels_rotate=0.0, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#

Create a box plot from raw data to show the distribution of numerical data.

Parameters:
  • data (series or dataframe) – The values of a variable(s).

  • number_of_variables_per_row (int) – None (plot all variables in one bar chart) or the number of variables to show (used by multiplot()). The default is None.

  • ax_input (axis or None) – Matplotlib axis

  • vert (boolean) – True (vertical bars; the default) or False (horizontal)

  • xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True)

  • filename (string) – None or a filename for the figure

  • overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)

  • fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object

  • ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object

  • kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.boxplot object

Return type:

None.

vizdataquality.plot.datetime_counts(data, component='raw data', gap_threshold=None, show_gaps=True, ax_input=None, xlabels_rotate=0.0, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#

Plot the overall distribution of a datetime variable, or the distribution of a specific component (e.g., month).

Parameters:
  • data (series) – Values of a variable.

  • component (string) – Component to plot (‘year’, ‘month’, ‘dayofweek’, ‘hour’, ‘minute’ or ‘second’; case independent) or ‘raw data’ (default)

  • gap_threshold (None, int or datetime) – None (threshold will be based on the component of the data; the default) or value (threshold to use). Only used if component is specified.

  • show_gaps (boolean) – True (the default) or False (draw lines across gaps). Only used if component is specified.

  • ax_input (axis or None) – Matplotlib axis

  • xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True)

  • filename (string) – None or a filename for the figure

  • overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)

  • fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object

  • ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object

  • kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.plot object

Return type:

None.

vizdataquality.plot.dot_whisker(data, number_of_variables_per_row=None, ax_input=None, vert=True, xlabels_rotate=0.0, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#

Create a dot-or-whisker plot (e.g., to show value lengths for each variable)

Parameters:
  • data (series) – Series containing the variable names (index) and data quality attribute to be plotted (e.g., number of missing values in each variable)

  • number_of_variables_per_row (int) – None (plot all variables in one bar chart) or the number of variables to show (used by multiplot()). The default is None.

  • ax_input (axis or None) – Matplotlib axis

  • vert (boolean) – True (vertical bars; the default) or False (horizontal)

  • xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True)

  • filename (string) – None or a filename for the figure

  • overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)

  • fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object

  • ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object

  • kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.scatter or Axes.errorbar object

Return type:

None.

vizdataquality.plot.histogram(data, perceptual_threshold=0.05, ax_input=None, vert=True, xlabels_rotate=0.0, datalabels=False, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#

Create a histogram to show the distribution of numerical data.

Parameters:
  • data (series, list or numpy array) – The values to be plotted

  • perceptual_threshold (float) – Preceptual discontinuity threshold (0.0 - 1.0) or None. The default is 0.05.

  • ax_input (axis or None) – Matplotlib axis

  • vert (boolean) – True (vertical bars; the default) or False (horizontal)

  • xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True)

  • datalabels (boolean) – Label each bin. The default is False.

  • filename (string) – None or a filename for the figure

  • overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)

  • fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object

  • ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object

  • kwargs (dictionary) – Keyword arguments for a Matplotlib hist object

Return type:

None.

vizdataquality.plot.line(data, option='show gaps', gap_threshold=1, ax_input=None, xlabels_rotate=0.0, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#

Create a line chart showing a data quality attribute (e.g., to show value counts for a discrete numerical variable). The length of each bar can be adjusted to ensure that important perceptual differences are visible.

Parameters:
  • data (series) – Series containing the variable names (index) and data quality attribute to be plotted (e.g., number of missing values in each variable)

  • option (string) – Component to plot (‘show gaps’, ‘show missing’ or ‘interpolate’). The default is ‘show gaps’

  • gap_threshold (int) – Gaps larger than this threshold will be shown (only used if option is ‘show gaps’). The default is 1.

  • ax_input (axis or None) – Matplotlib axis. The default is None.

  • xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True). The default is 0.0.

  • filename (string) – None or a filename for the figure. The default is None.

  • overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists). The default is False.

  • fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object. The default is an empty dictionary.

  • ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object. The default is an empty dictionary.

  • kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.plot and scatter objects

Return type:

None.

vizdataquality.plot.lollipop(data, number_of_variables_per_row=None, ax_input=None, vert=True, xlabels_rotate=0.0, datalabels=False, continuous_value_axis=True, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#

Create a lollipop plot (e.g., to show value counts for a variable)

Parameters:
  • data (series) – Value counts for a variable names.

  • number_of_variables_per_row (int) – None (plot all variables in one bar chart) or the number of variables to show (used by multiplot()). The default is None.

  • ax_input (axis or None) – Matplotlib axis. The default is None.

  • vert (boolean) – True (vertical bars) or False (horizontal). The default is True.

  • xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert=True). The default is 0.0.

  • datalabels (boolean) – Label each data point. The default is False.

  • continuous_value_axis (boolean) – Plot numerical/datetime values on a continuous axis to show any gaps in values. The default is True.

  • filename (string) – Filename for the figure. The default is None.

  • overwrite (boolean) – False (do not overwrite file) or True (overwrite file if it exists). The default is False.

  • fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object. The default is {}.

  • ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object. The default is {}.

  • kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.scatter and Axes.plot objects

Return type:

None.

vizdataquality.plot.multiplot(plottype, data, perceptual_threshold=0.05, number_of_variables_per_row=None, vert=True, xlabels_rotate=0.0, clist=[], datalabels=False, legend=True, continuous_value_axis=True, filename=None, overwrite=False, plt_kw={}, fig_kw={}, ax_kw={}, legend_kw={}, **kwargs)#

Plot a data quality attribute (e.g., number of missing values in each variable). The variables can be plotted on multiple rows of bar charts. The length of each bar can be adjusted to ensure that important perceptual differences are visible.

Parameters:
  • plottype (string) – ‘bar’, ‘box’, ‘boxraw’, ‘dot-or-whisker’, ‘lollipop’, ‘stackedbar’ or ‘violin’

  • data (dataframe (stackedbar or violinplot) or series (all plot types except stackedbar)) – ‘bar’, ‘box’, ‘dot-or-whisker’: Series containing the variable names (index) and data quality attribute to be plotted (e.g., number of missing values in each variable) ‘lollipop’: Series containing value counts. ‘stackedbar’: Dataframe where each column is a bar and the index/rows are the stacks ‘violin’, ‘boxraw’: The values of one (Series) or more columns (dataframe)

  • perceptual_threshold (float) – Preceptual discontinuity threshold (0.0 - 1.0) or None

  • number_of_variables_per_row (int) – None (plot all variables in one bar chart) or the number of variables to show in each row

  • vert (boolean) – True (vertical bars; the default) or False (horizontal)

  • xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True)

  • clist (list, optional) – A list of the colours to use (a different one for each stack in stacked bars). The default is an empty list (use the default colours).

  • datalabels (boolean) – Label each data point. The default is False.

  • legend (boolean) – True (add a legend) or False (no legend). The default is True.

  • continuous_value_axis (boolean) – Plot numerical/datetime values on a continuous axis to show any gaps in values. The default is True.

  • filename (string) – None or a filename for the figure

  • overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)

  • plt_kw (dictionary) – Keyword arguments for a Matplotlib pyplot object

  • fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object

  • ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object

  • legend_kw (dictionary) – Keyword arguments for a Matplotlib legend. Only used if plottype = ‘bar’

  • kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.bar (scalarbar), Axes.bxp (boxplot), Axes.boxplot (boxraw), Axes.scatter and Axes.errorbar (dot_whisker), Axes.scatter and Axes.plot (lollipop) or violinplot object (violin)

Return type:

None.

vizdataquality.plot.plotgrid(tasktype, data, num_rows=None, num_cols=None, vert=True, xlabels_rotate=0.0, perceptual_threshold=0.05, legend=True, components='raw data', gap_threshold=None, show_gaps=True, datalabels=False, continuous_value_axis=True, filename=None, overwrite=False, fig_kw={}, ax_kw={}, legend_kw={}, **kwargs)#

Create a grid of plots of a given type.

Parameters:
  • tasktype (string) – ‘boxraw’, ‘datetime distribution’, ‘histogram’, ‘scalars’ or ‘value counts’

  • data (dataframe or series) – The data.

  • num_rows (int, optional) – The number of rows in the grid. The default is None.

  • num_cols (TYPE, optional) – The number of columns in the grid. The default is None.

  • vert (boolean) – True (vertical bars; the default) or False (horizontal)

  • xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True)

  • perceptual_threshold (float) – Preceptual discontinuity threshold (0.0 - 1.0) or None

  • legend (boolean) – True (add a legend, if a stacked bar chart is plotted) or False (no legend)

  • components (list or string) – Only used if tasktype is ‘datetime distribution’: Component(s) to plot (‘year’, ‘month’, ‘dayofweek’, ‘hour’, ‘minute’ or ‘second’; case independent) or ‘raw data’ (default)

  • gap_threshold (None, int or datetime) – Only used if tasktype is ‘datetime distribution’: None (threshold will be based on the component of the data; the default) or value (threshold to use). Only used if component is specified.

  • show_gaps (boolean) – Only used if tasktype is ‘datetime distribution’: True (the default) or False (draw lines across gaps). Only used if component is specified.

  • datalabels (boolean) – Label each data point (False (default) or True)

  • continuous_value_axis (boolean) – Plot numerical/datetime values on a continuous axis to show any gaps in values. The default is True.

  • filename (string) – None or a filename for the figure

  • overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)

  • fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object

  • ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object

  • **kwargs (dictionary) – Keyword arguments for the plotting function, e.g., scalar_bar().

Return type:

None.

vizdataquality.plot.scalar_bar(data, perceptual_threshold=0.05, number_of_variables_per_row=None, ax_input=None, vert=True, xlabels_rotate=0.0, datalabels=False, legend=True, filename=None, overwrite=False, fig_kw={}, ax_kw={}, legend_kw={}, **kwargs)#

Create a bar chart showing a data quality attribute (e.g., number of missing values in each variable). The length of each bar can be adjusted to ensure that important perceptual differences are visible.

Parameters:
  • data (series) – Series containing the variable names (index) and data quality attribute to be plotted (e.g., number of missing values in each variable)

  • perceptual_threshold (float) – Preceptual discontinuity threshold (0.0 - 1.0) or None. The default is 0.05.

  • number_of_variables_per_row (int) – None (plot all variables in one bar chart) or the number of variables to show (used by multiplot()). The default is None.

  • ax_input (axis or None) – Matplotlib axis. The default is None.

  • vert (boolean) – True (vertical bars; the default) or False (horizontal). The default is True.

  • xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True). The default is 0.0.

  • datalabels (boolean) – Label each data point. The default is False.

  • legend (boolean) – NOT CURRENTLY USED. True (add a legend, if perceptual discontinuity is used) or False (no legend). The default is True.

  • filename (string) – None or a filename for the figure. The default is None.

  • overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists). The default is False.

  • fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object. The default is an empty dictionary.

  • ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object. The default is an empty dictionary.

  • legend_kw (dictionary) – NOT CURRENTLY USED. Keyword arguments for a Matplotlib legend. The default is an empty dictionary.

  • kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.bar object

Return type:

None.

vizdataquality.plot.stacked_bar(data, perceptual_threshold=0.05, number_of_variables_per_row=None, ax_input=None, vert=True, xlabels_rotate=0.0, clist=[], elist=[], datalabels=False, legend=True, filename=None, overwrite=False, fig_kw={}, ax_kw={}, legend_kw={}, **kwargs)#

Create a bar chart showing a data quality attribute (e.g., number of missing values in each variable). The length of each bar can be adjusted to ensure that important perceptual differences are visible.

Parameters:
  • data (series or dataframe) – The data to be plotted (single bar for a series; one bar per column for a dataframe). The index contains the names of the stacks.

  • perceptual_threshold (float) – Preceptual discontinuity threshold (0.0 - 1.0) or None. The default is 0.05.

  • number_of_variables_per_row (int) – None (plot all variables in one bar chart) or the number of variables to show (used by multiplot()). The default is None.

  • ax_input (axis or None) – Matplotlib axis. The default is None.

  • vert (boolean) – True (vertical bars; the default) or False (horizontal). The default is True.

  • xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True). The default is 0.0.

  • clist (list, optional) – The fill colours to use (a different one for each stack). The default is an empty list (use the default colours).

  • elist (list, optional) – The edge colours to use (a different one for each stack). The default is an empty list (use the default colours).

  • datalabels (boolean) – Label each data point. The default is False.

  • legend (boolean) – True (add a legend) or False (no legend). The default is True.

  • filename (string) – None or a filename for the figure. The default is None.

  • overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists). The default is False.

  • fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object. The default is an empty dictionary.

  • ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object. The default is an empty dictionary.

  • legend_kw (dictionary) – Keyword arguments for a Matplotlib legend. The default is an empty dictionary.

  • kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.bar object

Return type:

None.

vizdataquality.plot.table(data, ax_input=None, include_index=False, auto_column_width=True, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#

Plot a table

Parameters:
  • data (series or dataframe) – Series containing the variable names (index) and data quality attribute to be plotted (e.g., number of missing values in each variable)

  • ax_input (axis or None) – Matplotlib axis

  • include_index (boolean) – Include the index in the table (default is False)

  • auto_column_width (boolean) – Automatically set the widths of the table columns (default is True)

  • filename (string) – None or a filename for the figure

  • overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)

  • fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object

  • ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object

  • kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.table object. A useful one is loc=’center’

Return type:

None.

vizdataquality.plot.text(plotdata, number_of_variables_per_row=None, ax_input=None, legend=True, xlabels_rotate=0.0, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#

Plot text for each variable

Parameters:
  • plotdata (series) – Series containing the variable names (index) and data quality attribute to be plotted (e.g., number of missing values in each variable)

  • number_of_variables_per_row (int) – None (plot all variables in one bar chart) or the number of variables to show (used by multiplot()). The default is None.

  • ax_input (axis or None) – Matplotlib axis

  • legend (boolean) – True (add a legend, if a stacked bar chart is plotted) or False (no legend)

  • xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True)

  • filename (string) – None or a filename for the figure

  • overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)

  • fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object

  • ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object

  • kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.bar object

Return type:

None.

vizdataquality.plot.violinplot(data, number_of_variables_per_row=None, ax_input=None, vert=True, xlabels_rotate=0.0, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#

Create a violin plot to show the distribution of numerical data.

Parameters:
  • data (series or dataframe) – The values to be plotted (each column is plotted separately)

  • number_of_variables_per_row (int) – None (plot all variables in one bar chart) or the number of variables to show (used by multiplot()). The default is None.

  • ax_input (axis or None) – Matplotlib axis

  • vert (boolean) – True (vertical bars; the default) or False (horizontal)

  • xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True)

  • filename (string) – None or a filename for the figure

  • overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)

  • fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object

  • ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object

  • kwargs (dictionary) – Keyword arguments for a Matplotlib violinplot object

Return type:

None.