Plotting Visualizations#
Created on Tue Aug 22 15:01:04 2023
Copyright 2023 Roy Ruddle
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Functions for data quality visualizations.
- Internal functions are prefixed by ‘_’. The functions are grouped as follows:
General functions Unused functions Functions for summary plots Functions for purity plots Functions to plot sets and intersections Explanation graph functions
- vizdataquality.plot.apply_perceptual_discontinuity_individually(input_data, perceptual_threshold, axis_limits=None)#
Apply perceptual discontinuity threshold. If axis_limits is None then values in range 0 < x < perceptual_threshold are set equal to perceptual_threshold. Otherwise values in the range 0 < x < max are adjusted so each is distinguishable from 0 and max.
- Parameters:
input_data (series or data frame) – The data.
perceptual_threshold (float) – An absolute value (axis_limits = None) or a percentage (0.0 - 1.0) of the axis limit max.
axis_limits (None or (min, max) tuple) – The axis limits. The default is None.
- Returns:
The adjusted values and, if a series was input, then two extra columns for stacked bar chart plotting of values that have been adjusted and not adjusted, respectively.
- Return type:
DataFrame
- vizdataquality.plot.apply_perceptual_discontinuity_to_group(input_data, perceptual_threshold, axis_limits=None)#
Apply perceptual discontinuity threshold to a group of values. The values are adjusted so each is >= perceptual_threshold * sum, but the sum is unchanged. That is achieved by increasing values that are below the threshold and decreasing values that are above the threshold.
- Parameters:
input_data (series or data frame) – The data
perceptual_threshold (float) – An absolute value (axis_limits = None) or a percentage (0.0 - 1.0) of the axis limit max.
axis_limits (None or (min, max) tuple) – The axis limits. The default is None.
- Returns:
A series or dataframe, with perceptual discontinuity applied to groups of values (e.g. in a stacked bar chart)
- Return type:
Series or DataFrame
- vizdataquality.plot.boxplot(data, number_of_variables_per_row=None, ax_input=None, vert=True, xlabels_rotate=0.0, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#
Create a box plot using the supplied stats to show the distribution of numerical or date/time data.
- Parameters:
data (series) – Series containing the variable names (index) and pre-computed boxplot stats (for each variable, a list defining [whislo, q1, med, q3, whishi]; see Matplotlib.bxp() for details)
number_of_variables_per_row (int) – None (plot all variables in one bar chart) or the number of variables to show (used by multiplot()). The default is None.
ax_input (axis or None) – Matplotlib axis
vert (boolean) – True (vertical bars; the default) or False (horizontal)
xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True)
filename (string) – None or a filename for the figure
overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)
fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object
ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object
kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.bxp object
- Return type:
None.
- vizdataquality.plot.boxplot_raw(data, number_of_variables_per_row=None, ax_input=None, vert=True, xlabels_rotate=0.0, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#
Create a box plot from raw data to show the distribution of numerical data.
- Parameters:
data (series or dataframe) – The values of a variable(s).
number_of_variables_per_row (int) – None (plot all variables in one bar chart) or the number of variables to show (used by multiplot()). The default is None.
ax_input (axis or None) – Matplotlib axis
vert (boolean) – True (vertical bars; the default) or False (horizontal)
xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True)
filename (string) – None or a filename for the figure
overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)
fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object
ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object
kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.boxplot object
- Return type:
None.
- vizdataquality.plot.datetime_counts(data, component='raw data', gap_threshold=None, show_gaps=True, ax_input=None, xlabels_rotate=0.0, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#
Plot the overall distribution of a datetime variable, or the distribution of a specific component (e.g., month).
- Parameters:
data (series) – Values of a variable.
component (string) – Component to plot (‘year’, ‘month’, ‘dayofweek’, ‘hour’, ‘minute’ or ‘second’; case independent) or ‘raw data’ (default)
gap_threshold (None, int or datetime) – None (threshold will be based on the component of the data; the default) or value (threshold to use). Only used if component is specified.
show_gaps (boolean) – True (the default) or False (draw lines across gaps). Only used if component is specified.
ax_input (axis or None) – Matplotlib axis
xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True)
filename (string) – None or a filename for the figure
overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)
fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object
ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object
kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.plot object
- Return type:
None.
- vizdataquality.plot.dot_whisker(data, number_of_variables_per_row=None, ax_input=None, vert=True, xlabels_rotate=0.0, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#
Create a dot-or-whisker plot (e.g., to show value lengths for each variable)
- Parameters:
data (series) – Series containing the variable names (index) and data quality attribute to be plotted (e.g., number of missing values in each variable)
number_of_variables_per_row (int) – None (plot all variables in one bar chart) or the number of variables to show (used by multiplot()). The default is None.
ax_input (axis or None) – Matplotlib axis
vert (boolean) – True (vertical bars; the default) or False (horizontal)
xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True)
filename (string) – None or a filename for the figure
overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)
fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object
ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object
kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.scatter or Axes.errorbar object
- Return type:
None.
- vizdataquality.plot.histogram(data, perceptual_threshold=0.05, ax_input=None, vert=True, xlabels_rotate=0.0, datalabels=False, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#
Create a histogram to show the distribution of numerical data.
- Parameters:
data (series, list or numpy array) – The values to be plotted
perceptual_threshold (float) – Preceptual discontinuity threshold (0.0 - 1.0) or None. The default is 0.05.
ax_input (axis or None) – Matplotlib axis
vert (boolean) – True (vertical bars; the default) or False (horizontal)
xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True)
datalabels (boolean) – Label each bin. The default is False.
filename (string) – None or a filename for the figure
overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)
fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object
ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object
kwargs (dictionary) – Keyword arguments for a Matplotlib hist object
- Return type:
None.
- vizdataquality.plot.line(data, option='show gaps', gap_threshold=1, ax_input=None, xlabels_rotate=0.0, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#
Create a line chart showing a data quality attribute (e.g., to show value counts for a discrete numerical variable). The length of each bar can be adjusted to ensure that important perceptual differences are visible.
- Parameters:
data (series) – Series containing the variable names (index) and data quality attribute to be plotted (e.g., number of missing values in each variable)
option (string) – Component to plot (‘show gaps’, ‘show missing’ or ‘interpolate’). The default is ‘show gaps’
gap_threshold (int) – Gaps larger than this threshold will be shown (only used if option is ‘show gaps’). The default is 1.
ax_input (axis or None) – Matplotlib axis. The default is None.
xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True). The default is 0.0.
filename (string) – None or a filename for the figure. The default is None.
overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists). The default is False.
fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object. The default is an empty dictionary.
ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object. The default is an empty dictionary.
kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.plot and scatter objects
- Return type:
None.
- vizdataquality.plot.lollipop(data, number_of_variables_per_row=None, ax_input=None, vert=True, xlabels_rotate=0.0, datalabels=False, continuous_value_axis=True, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#
Create a lollipop plot (e.g., to show value counts for a variable)
- Parameters:
data (series) – Value counts for a variable names.
number_of_variables_per_row (int) – None (plot all variables in one bar chart) or the number of variables to show (used by multiplot()). The default is None.
ax_input (axis or None) – Matplotlib axis. The default is None.
vert (boolean) – True (vertical bars) or False (horizontal). The default is True.
xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert=True). The default is 0.0.
datalabels (boolean) – Label each data point. The default is False.
continuous_value_axis (boolean) – Plot numerical/datetime values on a continuous axis to show any gaps in values. The default is True.
filename (string) – Filename for the figure. The default is None.
overwrite (boolean) – False (do not overwrite file) or True (overwrite file if it exists). The default is False.
fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object. The default is {}.
ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object. The default is {}.
kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.scatter and Axes.plot objects
- Return type:
None.
- vizdataquality.plot.multiplot(plottype, data, perceptual_threshold=0.05, number_of_variables_per_row=None, vert=True, xlabels_rotate=0.0, clist=[], datalabels=False, legend=True, continuous_value_axis=True, filename=None, overwrite=False, plt_kw={}, fig_kw={}, ax_kw={}, legend_kw={}, **kwargs)#
Plot a data quality attribute (e.g., number of missing values in each variable). The variables can be plotted on multiple rows of bar charts. The length of each bar can be adjusted to ensure that important perceptual differences are visible.
- Parameters:
plottype (string) – ‘bar’, ‘box’, ‘boxraw’, ‘dot-or-whisker’, ‘lollipop’, ‘stackedbar’ or ‘violin’
data (dataframe (stackedbar or violinplot) or series (all plot types except stackedbar)) – ‘bar’, ‘box’, ‘dot-or-whisker’: Series containing the variable names (index) and data quality attribute to be plotted (e.g., number of missing values in each variable) ‘lollipop’: Series containing value counts. ‘stackedbar’: Dataframe where each column is a bar and the index/rows are the stacks ‘violin’, ‘boxraw’: The values of one (Series) or more columns (dataframe)
perceptual_threshold (float) – Preceptual discontinuity threshold (0.0 - 1.0) or None
number_of_variables_per_row (int) – None (plot all variables in one bar chart) or the number of variables to show in each row
vert (boolean) – True (vertical bars; the default) or False (horizontal)
xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True)
clist (list, optional) – A list of the colours to use (a different one for each stack in stacked bars). The default is an empty list (use the default colours).
datalabels (boolean) – Label each data point. The default is False.
legend (boolean) – True (add a legend) or False (no legend). The default is True.
continuous_value_axis (boolean) – Plot numerical/datetime values on a continuous axis to show any gaps in values. The default is True.
filename (string) – None or a filename for the figure
overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)
plt_kw (dictionary) – Keyword arguments for a Matplotlib pyplot object
fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object
ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object
legend_kw (dictionary) – Keyword arguments for a Matplotlib legend. Only used if plottype = ‘bar’
kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.bar (scalarbar), Axes.bxp (boxplot), Axes.boxplot (boxraw), Axes.scatter and Axes.errorbar (dot_whisker), Axes.scatter and Axes.plot (lollipop) or violinplot object (violin)
- Return type:
None.
- vizdataquality.plot.plotgrid(tasktype, data, num_rows=None, num_cols=None, vert=True, xlabels_rotate=0.0, perceptual_threshold=0.05, legend=True, components='raw data', gap_threshold=None, show_gaps=True, datalabels=False, continuous_value_axis=True, filename=None, overwrite=False, fig_kw={}, ax_kw={}, legend_kw={}, **kwargs)#
Create a grid of plots of a given type.
- Parameters:
tasktype (string) – ‘boxraw’, ‘datetime distribution’, ‘histogram’, ‘scalars’ or ‘value counts’
data (dataframe or series) – The data.
num_rows (int, optional) – The number of rows in the grid. The default is None.
num_cols (TYPE, optional) – The number of columns in the grid. The default is None.
vert (boolean) – True (vertical bars; the default) or False (horizontal)
xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True)
perceptual_threshold (float) – Preceptual discontinuity threshold (0.0 - 1.0) or None
legend (boolean) – True (add a legend, if a stacked bar chart is plotted) or False (no legend)
components (list or string) – Only used if tasktype is ‘datetime distribution’: Component(s) to plot (‘year’, ‘month’, ‘dayofweek’, ‘hour’, ‘minute’ or ‘second’; case independent) or ‘raw data’ (default)
gap_threshold (None, int or datetime) – Only used if tasktype is ‘datetime distribution’: None (threshold will be based on the component of the data; the default) or value (threshold to use). Only used if component is specified.
show_gaps (boolean) – Only used if tasktype is ‘datetime distribution’: True (the default) or False (draw lines across gaps). Only used if component is specified.
datalabels (boolean) – Label each data point (False (default) or True)
continuous_value_axis (boolean) – Plot numerical/datetime values on a continuous axis to show any gaps in values. The default is True.
filename (string) – None or a filename for the figure
overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)
fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object
ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object
**kwargs (dictionary) – Keyword arguments for the plotting function, e.g., scalar_bar().
- Return type:
None.
- vizdataquality.plot.scalar_bar(data, perceptual_threshold=0.05, number_of_variables_per_row=None, ax_input=None, vert=True, xlabels_rotate=0.0, datalabels=False, legend=True, filename=None, overwrite=False, fig_kw={}, ax_kw={}, legend_kw={}, **kwargs)#
Create a bar chart showing a data quality attribute (e.g., number of missing values in each variable). The length of each bar can be adjusted to ensure that important perceptual differences are visible.
- Parameters:
data (series) – Series containing the variable names (index) and data quality attribute to be plotted (e.g., number of missing values in each variable)
perceptual_threshold (float) – Preceptual discontinuity threshold (0.0 - 1.0) or None. The default is 0.05.
number_of_variables_per_row (int) – None (plot all variables in one bar chart) or the number of variables to show (used by multiplot()). The default is None.
ax_input (axis or None) – Matplotlib axis. The default is None.
vert (boolean) – True (vertical bars; the default) or False (horizontal). The default is True.
xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True). The default is 0.0.
datalabels (boolean) – Label each data point. The default is False.
legend (boolean) – NOT CURRENTLY USED. True (add a legend, if perceptual discontinuity is used) or False (no legend). The default is True.
filename (string) – None or a filename for the figure. The default is None.
overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists). The default is False.
fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object. The default is an empty dictionary.
ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object. The default is an empty dictionary.
legend_kw (dictionary) – NOT CURRENTLY USED. Keyword arguments for a Matplotlib legend. The default is an empty dictionary.
kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.bar object
- Return type:
None.
- vizdataquality.plot.stacked_bar(data, perceptual_threshold=0.05, number_of_variables_per_row=None, ax_input=None, vert=True, xlabels_rotate=0.0, clist=[], elist=[], datalabels=False, legend=True, filename=None, overwrite=False, fig_kw={}, ax_kw={}, legend_kw={}, **kwargs)#
Create a bar chart showing a data quality attribute (e.g., number of missing values in each variable). The length of each bar can be adjusted to ensure that important perceptual differences are visible.
- Parameters:
data (series or dataframe) – The data to be plotted (single bar for a series; one bar per column for a dataframe). The index contains the names of the stacks.
perceptual_threshold (float) – Preceptual discontinuity threshold (0.0 - 1.0) or None. The default is 0.05.
number_of_variables_per_row (int) – None (plot all variables in one bar chart) or the number of variables to show (used by multiplot()). The default is None.
ax_input (axis or None) – Matplotlib axis. The default is None.
vert (boolean) – True (vertical bars; the default) or False (horizontal). The default is True.
xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True). The default is 0.0.
clist (list, optional) – The fill colours to use (a different one for each stack). The default is an empty list (use the default colours).
elist (list, optional) – The edge colours to use (a different one for each stack). The default is an empty list (use the default colours).
datalabels (boolean) – Label each data point. The default is False.
legend (boolean) – True (add a legend) or False (no legend). The default is True.
filename (string) – None or a filename for the figure. The default is None.
overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists). The default is False.
fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object. The default is an empty dictionary.
ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object. The default is an empty dictionary.
legend_kw (dictionary) – Keyword arguments for a Matplotlib legend. The default is an empty dictionary.
kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.bar object
- Return type:
None.
- vizdataquality.plot.table(data, ax_input=None, include_index=False, auto_column_width=True, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#
Plot a table
- Parameters:
data (series or dataframe) – Series containing the variable names (index) and data quality attribute to be plotted (e.g., number of missing values in each variable)
ax_input (axis or None) – Matplotlib axis
include_index (boolean) – Include the index in the table (default is False)
auto_column_width (boolean) – Automatically set the widths of the table columns (default is True)
filename (string) – None or a filename for the figure
overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)
fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object
ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object
kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.table object. A useful one is loc=’center’
- Return type:
None.
- vizdataquality.plot.text(plotdata, number_of_variables_per_row=None, ax_input=None, legend=True, xlabels_rotate=0.0, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#
Plot text for each variable
- Parameters:
plotdata (series) – Series containing the variable names (index) and data quality attribute to be plotted (e.g., number of missing values in each variable)
number_of_variables_per_row (int) – None (plot all variables in one bar chart) or the number of variables to show (used by multiplot()). The default is None.
ax_input (axis or None) – Matplotlib axis
legend (boolean) – True (add a legend, if a stacked bar chart is plotted) or False (no legend)
xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True)
filename (string) – None or a filename for the figure
overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)
fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object
ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object
kwargs (dictionary) – Keyword arguments for a Matplotlib Axes.bar object
- Return type:
None.
- vizdataquality.plot.violinplot(data, number_of_variables_per_row=None, ax_input=None, vert=True, xlabels_rotate=0.0, filename=None, overwrite=False, fig_kw={}, ax_kw={}, **kwargs)#
Create a violin plot to show the distribution of numerical data.
- Parameters:
data (series or dataframe) – The values to be plotted (each column is plotted separately)
number_of_variables_per_row (int) – None (plot all variables in one bar chart) or the number of variables to show (used by multiplot()). The default is None.
ax_input (axis or None) – Matplotlib axis
vert (boolean) – True (vertical bars; the default) or False (horizontal)
xlabels_rotate (float) – Angle to rotate X axis labels by (only used if vert = True)
filename (string) – None or a filename for the figure
overwrite (boolean) – False (do not overwrite file; the default) or True (overwrite file if it exists)
fig_kw (dictionary) – Keyword arguments for a Matplotlib Figure object
ax_kw (dictionary) – Keyword arguments for a Matplotlib Axes object
kwargs (dictionary) – Keyword arguments for a Matplotlib violinplot object
- Return type:
None.