Scipy library in python : The scipy library has various functions and uses the main object is to give idea usage of scipy library. This library also has the toolboxes to solve common scientific problems. The sub modules will contain applications like image processing and statics.
There are some functions in the scipy in python as follows:-
The scipy.starts.cumfreq (a, numbins, defaultreallimits, and weights) has histogram function and calculate the frequency of the histogram.
The frequency is then combined with values and width of the bind also with lower limit and extra points.
Parameters:-
array:-The array acts as the input array.
Numbins:-This is number of bins used for histogram set default value to 10.
Defaultlimits:-The values set at upper and the lower range of the histograms.
Weights:-This looks as the weight for each array.
Results:-In this the frequency is binned and width with lower, extra points.
Example:-
from scipy import state
import numpy as np
arr1= [1, 3, 27, 2, 5, 13]
print(“Array of the element:” arr1, numbins=4)
a, b, c, d=stats.cumfreq (arr1, numbins=4)
print(cumulative frequency:” a)
print (“lower Limit:” b)
print(“bin size:” c)
print(“extra-points:”, d)
Output:-
Array element: [1, 3, 27, 2, 5, 13]
Cumulative frequency: [4.5.5.6.]
Lower limit:-3.33
Bin size: 8.66
Extra-points: 0
The scipy states that the axis as follows:-
The iqr is also called as the inter quartile range and is difference between 75th and the 25th % of the data. It is much robust against the outliners.
The parameter will allow the percentage function and ranges than actual IQR.
Example:-
from scipy.stats import iqr
x=np.array ([[10, 7, 4], [3, 2, 1]])
x
array ([[10, 7, 4], [3, 2, 1]])
iqr(x)
4.0
iqr(x, axis=0)
array([3.5, 2.5, 1.5])
iqr(x,axis=1)
array ([3, 1.])
iqr(x, axis=1, keepdims=True)
array ([3.], [1.])
The gmean is given as follows, scipy.stats, gmean (array, axis=0, dtype=none) will calculate the mean of an array elements and specified axis in python list.
Parameters:-
The parameters of the gmean are as follows.
array:-The input array of the object will have the elements to calculate the geometric mean.
axis:-The axis are along the mean which is to be calculated and computed as default axis is=0.
dtype:-The set will have all type of the returned elements in the array.
Example:-
from scipy.stats.mstats import gmean
arr1=gmean ([1, 3, 27])
print(“Geometric mean is:” arr1)
Output:-
Geometric mean is: 4.
The ranking will refer to data in which the numerical values are replaced by rank when the data is sorted.
Scipy.stats.rankdata (a, method)
They assign data and the rank will begin from 1 then they control rank and assign equal values.
The parameters are as follows:-
A: It is same as the array of values and first flattened.
Method: This method is used to assign the rank and the options are as min, max, dense and ordinary.
Min:-The minimum ranks are assigned to each value of the array.
Max: - The maximum ranks are tied to each value of the array.
Ordinal:-The maximum value that is assigned and tied to the other values
Dense:-The similarity is like the maximum and minimum in dense next highest element we assigned after the tied element.
Example:-
from scipy.stats import rankdata
rankdata ([0, 2, 3, 2])
array ([1. 2.5, 4, 2.5])
rankdata([0, 2, 3, 2]), method=’min’)
array([1. 2. 4., 2.])
rankdata([0, 2, 3, 2]) method=’max’)
array ([1. 3. 4., 3.])
rankdata([0, 2, 3, 2]) method=’dense’)
The normal distribution is called as the common distribution in statistics.
The “skewness” and the “kurtosis” can be used to test the normality of the dataset.
The normality will have the skewness, normality test, Shapiro test, etc
The normality test is used in dataset for normal distribution and function that require distribution which is near normal or totally normal.
The skewness is used to measure the asymmetry of probability distribution and random variable about its mean.
The ratio of two estimates of normal distribution is totally based on the random sample of n observation.
In this numerator is proportional to the square of linear estimator of standard deviation. Then the denominator is sum of squares of the sample mean.
The normality test will provide the data normality at seven times.
The variable distribution is caused by the statistics and based on the assumption.
Normality test will give small statistical power which is used for detecting data.
Parameters:-
array:-They will contain sample to be tested and axis is to compute test of the default value as 0
axisint:-The axis is used to compute the test and the default value is set at 0.
nan_policy:-This will define how to handle the input that contains the nan and return nan which will through the error and omit the performance.
return:-Static flow is present here then the 2 sided squared probabilities are used for hypothesis.
Example:-
from scipy import stats
pts=1000
np.random.seed (28041990)
a=np.random.normal (0, 1, size=pts)
b=np.random.normal (2, 1, size=pts)
x=np.concatenate (a, b)
k2, p=stats.normaltest(x)
The gapminder data is used to alpha=1e-3
print (“p= {: g}”.format (p))
p=3.27e-11
if p<alpha:
print(“null hypothesis can be rejected”)
else:
print(“null hypothesis can be rejected”)
The hypothesis statistical testing is used in making decisions using the data.
It is basically the assumption made about population parameter.
The Homogeneity is assumption for both‘t’ and ‘f’ test which has population variances and samples are equal.
The homogeneity is also called as the “homoskedasticity”and referred as variance of Y and x values.
This will check the homogeneity of variance that is equal and future parameters are suited.
Nowadays we test many things and find the fastest route also the quick way to finish our work.
This is called the hypothesis and the testing idea is one that will work behind and called as hypothesis testing.
It is the thing that data scientist which has the new ideas.
Determining correlation understands the relationship between two or more variables as data analysis or statistical analysis.
The commonly used correlation is Pearson and spearman correlation.
The commonly used in correction measures is spearman correlation.
We will both Pearson and spearman correlation coefficient. The spearman and Pearson is also used in python by pandas, scikit, learn and the numpy.
Compute between gdppercap and life values from multiple countries overtime.
Example:-
import pandas as np
import matplotlib.pyploy as plt
import seaborn as sns
%matplotlib inline
After loading the gapminder as follows,
gapminder=pd.read_csv (data_url)
gapminder=gapminder [[‘gdpPercap’,’LifeExp’]]
The corr() function will be used with pandas and see that gdppercap and lifeExp is correlated and increases the life.
Example:-
gapminder.gdpPercap.corr(gapminder.lifeExp, method=”pearson”)
Output:-
0.58
We use it to compute the Pearson coefficient and the function will take multiple variables as 2D numpy and return matrix.
Example:-
np.corrcoef (gapminder.gdppercap, gapminder.lifeExp)
array ([[1. , 0.58], [0.58370422, 1. ]])
from scipy import stats
gdpPercap=gapminder.gdpPercapvalues
life_exp=gapminder.lifeExp.values
This will state the function as pearsonr () and take two numpy arrays and return the tuple.
stats.pearsonr (gdpPercap, life_exp)
(0.5)
The corr() function is used to compute with the pandas as ,
gapminder.gdpPercap.corr (gapminder.lifeExp, method=”spearman”)
Output is 0.8
The numpy have specific function for computing correlation. The spearman correlation has rank values of the variable.
The compute rank of two variables is used with person correlation function available in numpy.
gapminder[“gdpPercap_r”] =gapminder.gdpPercap.rank()
gapminder [“lifeExp_r”] =gapminder.lifeExp.rank()
gapminder.head ()
Next using the numpycorrcoef () function
np.corrcoef (gapminder.gdpPercap_r, gapminder.lifeExp_r)
array ([1. , 0.82], [0.82])
The spearman correlation using the function as spearmanr () and get result as follows:-
stats.spearmanr (gdpPercap, life_exp)
The comparison between the Pearson and spearman is gdppercap and lifeExp computed by them.
The pearson correlation will assume the data which is normally distributed and spearman does not make any assumption on distribution data.
The chi-squared test is test that will assume frequencies for categorical variable and expected frequencies for variable.
The chi-squared test is present if relationship is present between the variables.
Also data is displayed in the tabular form and row is then represented for variable.
The chi-square is test that will test data as a whole means one not able to tell the levels and table is 2*2.
In machine learning input features are relevant to be predicted and the problem arises.
The statistical test is used to determine the output and use statistical tests to determine output variable.
The chi-squared hypothesis is a good example of test for independence.
The chi-square test is called as the statistical hypothesis test and the ‘chi-squared test’.
The test is constructed from the squared errors and follows the chi-squared distribution.
This has the asymptotically true means has sampling distribution that make chi squared close.