Calculate statistics#

Created on Wed Jul 19 04:43:43 2023

Copyright 2023 Roy Ruddle

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

vizdataquality.calculate.calc(df, options=None)#

Profile a data frame or series to calculate aspects of data quality and descriptive statistics.

Parameters:

df (DataFrame or Series) – The data.
options (dict, optional) – The descriptive statistics to output (default is None; output everything)

Returns:

The descriptive statistics (seperate row for each variable; variable names are the index; columns are different descriptive statistics).

Return type:

DataFrame

vizdataquality.calculate.check_for_duplicate_header(df)#

Check whether a dataframe contains any rows that are the same as the header, ignoring any empty columns.

Parameters:: df (DataFrame) – The data.
Returns:: True (the header row is duplicated) or False
Return type:: bool

vizdataquality.calculate.check_for_extra_column(df)#

Check whether a dataframe ends with an extra (superfluous) column.

Parameters:: df (DataFrame) – The data.
Returns:: True (there is an extra column) or False
Return type:: bool

vizdataquality.calculate.get_column_names_to_trim(df)#

Return a list of any column names that have leading and/or trailing spaces.

Parameters:: df (DataFrame) – The data.
Returns:: The names of any missing columns that should be trimmed.
Return type:: list

vizdataquality.calculate.get_df_extra_values(df1, df2, convert_numbers=False)#

Return the values that are not in both data frames. Optionally, numbers may be converted so that 1 and 1.0 are considered to be the same as ‘1’, and 1.1 is the same as ‘1.1’, etc.

Parameters:

df1 (DataFrame) – A data frames.
df2 (DataFrame) – A data frames.
convert_numbers (bool, optional) – Whether to convert numbers to strings. The default is False.

Returns:

extra_values – A set of the values that are not in corresponding columns of both data frames. None is returned if the data frames do not have the same column names.

Return type:

set

vizdataquality.calculate.get_missing_column_names(df)#

Return a list of the names Pandas has created for columns that had no name.

Parameters:: df (DataFrame) – The data.
Returns:: The names for any missing columns (an empty list if no column names are missing).
Return type:: list

vizdataquality.calculate.get_non_numeric_values(data, convert_numbers=False)#

Return a list of the unique, non-numeric values in a dataframe or series.

Parameters:

data (DataFrame or Series) – The data.
convert_numbers (bool, optional) – Whether to exclude numbers stored as strings. The default is False.

Returns:

The unique, non-numeric values.

Return type:

list

vizdataquality.calculate.get_num_empty_cols(df)#

Calculate the number of columns that do not contain any values.

Parameters:: df (DataFrame) – The data.
Returns:: The number of columns that do not contain any values.
Return type:: int

vizdataquality.calculate.get_num_empty_rows(df)#

Calculate the number of rows that do not contain any values.

Parameters:: df (DataFrame) – The data.
Returns:: The number of rows that do not contain any values.
Return type:: int

vizdataquality.calculate.get_series_extra_values(series1, series2, convert_numbers=False)#

Return the values that are not in both series. Optionally, numbers may be converted so that 1 and 1.0 are considered to be the same as ‘1’, and 1.1 is the same as ‘1.1’, etc.

Parameters:

series1 (series) – A series.
series2 (series) – A series.
convert_numbers (bool, optional) – Whether to convert numbers to strings. The default is False.

Returns:

extra_values – A set of the values that are not in both series.

Return type:

set

vizdataquality.calculate.get_value_lengths_examples(df)#

Get examples of the shortest, median and longest values of each column in a dataframe.

Parameters:: df (DataFrame) – The data.
Returns:: A dataframe containing the examples. The first column (‘Examples’) specifies what each row contains (e.g., Shortest value).
Return type:: DataFrame

vizdataquality.calculate.step1_datafile_stats(encoding_results=None, filename=None, df=None)#

Get general statistics anf information about a datafile.

Parameters:

encoding_results (dict) – Dictionary containing a text file’s ‘encoding’ and ‘confidence’ (e.g., from utils.detect_file_encoding()).
filename (str) – Full pathname of datafile (default is None). Used to determine text file encoding.
df (DataFrame) – The data (default is None). Used for the other statistics.

Returns:

DataFrame with the columns [‘Statistic’, ‘Value’]

Return type:

DataFrame

vizdataquality.calculate.step1_issues(df)#

Get Step 1 data quality issues. If none are detected then an empty DataFrame is returned.

Parameters:: df (DataFrame) – The data.
Returns:: DataFrame with the columns [‘Data quality issue’, ‘Value or description’].
Return type:: DataFrame