Tumgik
ramasubramanya · 2 years
Text
Pearson Correlation
Tumblr media Tumblr media
import necessary libraries
In [11]:import pandas as pd import numpy as np import scipy.stats import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline
read in the data
In [2]:data1 = pd.read_csv('gapminder2.csv', low_memory=False)
remove unnecessary columns and make a copy of the subdata
In [3]:data2 = data1[["continent", "country", "breastcancerper100th", "urbanrate", "incomeperperson", "breastcancernbdeaths"]] data = data2.copy()
remove missing values(in my case '0' values)
In [5]:data= data.replace(0, np.NaN) data = data.dropna()
Change the data type for chosen variables
In [6]:data['breastcancerper100th'] = pd.to_numeric(data['breastcancerper100th']) data['incomeperperson'] = pd.to_numeric(data['incomeperperson']) data['urbanrate'] = pd.to_numeric(data['urbanrate'])
Scatterplots to vizualize the relationships
In [7]:scat1 = sns.regplot(x="urbanrate", y="breastcancerper100th", fit_reg=True, data=data) plt.xlabel('Urban Rate') plt.ylabel('Breast Cancer Cases') plt.title('Scatterplot for the Association Between Urban Rate and Number of new breast cancer cases')
Out[7]:<matplotlib.text.Text at 0x111e287f0>
In [8]:scat2 = sns.regplot(x="incomeperperson", y="breastcancerper100th", fit_reg=True, data=data) plt.xlabel('Income per Person') plt.ylabel('Breast Cancer Cases') plt.title('Scatterplot for the Association Between Income per Person and Number of new breast cancer cases')
Out[8]:<matplotlib.text.Text at 0x111e9eb38>
Pearson Correlation
In [9]:print ('association between urbanrate and breast cancer cases') print (scipy.stats.pearsonr(data['urbanrate'], data['breastcancerper100th'])) association between urbanrate and breast cancer cases (0.57721793332990379, 4.8461295565483764e-16)
In [10]:print ('association between incomeperperson and breast cancer cases') print (scipy.stats.pearsonr(data['incomeperperson'], data['breastcancerper100th'])) association between incomeperperson and breast cancer cases (0.73139851823791835, 6.7680050678785747e-29)
0 notes
ramasubramanya · 2 years
Text
Generating a Correlation Coefficient
Tumblr media Tumblr media Tumblr media
import necessary libraries
In [17]:import pandas as pd import numpy as np import scipy.stats import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline
read in the data
In [2]:data1 = pd.read_csv('gapminder2.csv', low_memory=False)
remove unnecessary columns and make a copy of the subdata
In [4]:data2 = data1[["continent", "country", "breastcancerper100th", "urbanrate", "incomeperperson", "breastcancernbdeaths"]] data = data2.copy()
remove missing values
In [6]:data= data.replace(0, np.NaN) data = data.dropna()
Change the data type for chosen variables
In [7]:data['breastcancerper100th'] = pd.to_numeric(data['breastcancerper100th']) data['incomeperperson'] = pd.to_numeric(data['incomeperperson']) data['urbanrate'] = pd.to_numeric(data['urbanrate'])
Request Pearson Correlation
In [9]:print ("association between urbanrate and breast cancer cases") print (scipy.stats.pearsonr(data['urbanrate'], data['breastcancerper100th'])) association between urbanrate and breast cancer cases (0.57721793332990379, 4.8461295565483764e-16)
Add a moderator variable to the data
In [10]:def incomegrp (row): if row['incomeperperson'] <= 744.239: return 1 elif row['incomeperperson'] <= 9425.326 : return 2 elif row['incomeperperson'] > 9425.326: return 3
In [11]:data['incomegrp'] = data.apply (lambda row: incomegrp (row),axis=1)
In [12]:chk1 = data['incomegrp'].value_counts(sort=False, dropna=False) print(chk1) 1 46 2 81 3 38 Name: incomegrp, dtype: int64
create sub-datasets for each value of incomegrp
In [13]:sub1=data[(data['incomegrp']== 1)] sub2=data[(data['incomegrp']== 2)] sub3=data[(data['incomegrp']== 3)]
Run Pearson Correlation for every subset
In [14]:print ('association between urbanrate and breast cancer cases for LOW income countries') print (scipy.stats.pearsonr(sub1['urbanrate'], sub1['breastcancerper100th'])) association between urbanrate and breast cancer cases for LOW income countries (0.17887135328796352, 0.23428271424528749)
In [15]:print ('association between urbanrate and internetuserate for MIDDLE income countries') print (scipy.stats.pearsonr(sub2['urbanrate'], sub2['breastcancerper100th'])) association between urbanrate and internetuserate for MIDDLE income countries (0.30017579786534548, 0.0064753813912730154)
In [16]:print ('association between urbanrate and internetuserate for HIGH income countries') print (scipy.stats.pearsonr(sub3['urbanrate'], sub3['breastcancerper100th'])) association between urbanrate and internetuserate for HIGH income countries (0.19306445789710108, 0.24550411335117076)
Create plots to vizualize the associations
In [18]:scat1 = sns.regplot(x="urbanrate", y="breastcancerper100th", data=sub1) plt.xlabel('Urban Rate') plt.ylabel('Breast cancer cases') plt.title('Scatterplot for the association between urban rate and breast cancer cases for LOW income countries') print (scat1) Axes(0.125,0.125;0.775x0.755)
In [19]:scat2 = sns.regplot(x="urbanrate", y="breastcancerper100th", data=sub2) plt.xlabel('Urban Rate') plt.ylabel('Breast Cancer Cases') plt.title('Scatterplot for the association between urban rate and breast cancer cases for MIDDLE income countries') print (scat2) Axes(0.125,0.125;0.775x0.755)
In [20]:scat3 = sns.regplot(x="urbanrate", y="breastcancerper100th", data=sub3) plt.xlabel('Urban Rate') plt.ylabel('Breast Cancer Cases') plt.title('Scatterplot for the association between urban rate and breast cancer cases for HIGH income countries') print (scat3) Axes(0.125,0.125;0.775x0.755)
0 notes
ramasubramanya · 2 years
Text
Chi-Square Analysis
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
import necessary libraries
In [14]:import pandas as pd import numpy as np import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi import scipy.stats import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline
read in the data
In [2]:data1 = pd.read_csv('gapminder2.csv', low_memory=False)
remove unnecessary columns and make a copy of the subdata
In [3]:data2 = data1[["continent", "country", "breastcancerper100th", "urbanrate", "incomeperperson", "breastcancernbdeaths"]] data = data2.copy()
remove missing values(in my case '0' values)
In [4]:data= data.replace(0, np.NaN) data = data.dropna()
Change the data type for chosen variables
In [5]:data['breastcancerper100th'] = pd.to_numeric(data['breastcancerper100th']) data['urbanrate'] = pd.to_numeric(data['urbanrate'])
Create variable quartiles and calculate frequency in bins
In [6]:print('Categories of breast cancer cases per 100 000 females:') data['cancercaseslabel'] =pd.cut(data.breastcancerper100th,4,labels=['low','medium','high','very high']) breastcan_freq = pd.concat(dict(counts = data["cancercaseslabel"].value_counts(sort=False, dropna=False), percentages = data["cancercaseslabel"].value_counts(sort=False, dropna=False, normalize=True)), axis=1) print("Frequency distribution - breast cancer bins:\n", breastcan_freq) print('\n') Categories of breast cancer cases per 100 000 females: Frequency distribution - breast cancer bins: counts percentages high 17 0.103030 low 77 0.466667 medium 54 0.327273 very high 17 0.103030
In [8]:data['urbanratepercent'] =pd.cut(data.urbanrate,4,labels=['0-25%','26-50%','51-74%','75-100%']) urban_freq = pd.concat(dict(counts = data["urbanratepercent"].value_counts(sort=False, dropna=False), percentages = data["urbanratepercent"].value_counts(sort=False, dropna=False, normalize=True)), axis=1) print("Frequency distribution - urban rate:\n", urban_freq) Frequency distribution - urban rate: counts percentages 0-25% 32 0.193939 26-50% 42 0.254545 51-74% 61 0.369697 75-100% 30 0.181818
contingency table of observed counts
In [9]:print('Contingency table of observed counts') ct1=pd.crosstab(data['cancercaseslabel'], data['urbanratepercent']) print (ct1) print('\n') Contingency table of observed counts urbanratepercent 0-25% 26-50% 51-74% 75-100% cancercaseslabel high 0 4 9 4 low 26 27 19 5 medium 6 11 29 8 very high 0 0 4 13
column percentages
In [11]:print('Column percentages') colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct) print('\n') Column percentages urbanratepercent 0-25% 26-50% 51-74% 75-100% cancercaseslabel high 0.0000 0.095238 0.147541 0.133333 low 0.8125 0.642857 0.311475 0.166667 medium 0.1875 0.261905 0.475410 0.266667 very high 0.0000 0.000000 0.065574 0.433333
Chi-square
In [12]:print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1) print('\n') chi-square value, p value, expected counts (71.798834424480177, 6.7520667625654625e-12, 9, array([[ 3.2969697 , 4.32727273, 6.28484848, 3.09090909], [ 14.93333333, 19.6 , 28.46666667, 14. ], [ 10.47272727, 13.74545455, 19.96363636, 9.81818182], [ 3.2969697 , 4.32727273, 6.28484848, 3.09090909]]))
graph percent with new cases of breast cancer within each urbanisation frequency group
In [15]:plt.figure(figsize=(18,10)) sns.factorplot(x="urbanratepercent", y="breastcancerper100th", data=data, kind="bar", ci=None) plt.ylabel('Breast Cancer cases numbers') plt.xlabel('Urbanisation rate groups')
Out[15]:<matplotlib.text.Text at 0x1155eef98><matplotlib.figure.Figure at 0x1154f1358>
creating a subset to include the 2 variables we want to analyse
In [16]:sub2 = data[['cancercaseslabel', 'urbanratepercent']]
COMPARISON 1 Bonferroni Adjustment
In [28]:print('COMPARISON 1 Bonferroni Adjustment') recode1 = {'0-25%':'0-25%', '26-50%': '26-50%'} sub2['COMP1vs2']= sub2['urbanratepercent'].map(recode1) COMPARISON 1 Bonferroni Adjustment /anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy This is separate from the ipykernel package so we can avoid doing imports until
contingency table of observed counts
In [30]:print('Contingency table of observed counts') ct2=pd.crosstab(sub2['cancercaseslabel'], sub2['COMP1vs2']) print (ct2) Contingency table of observed counts COMP1vs2 0-25% 26-50% cancercaseslabel high 0 4 low 26 27 medium 6 11 very high 0 0
column percentages
In [31]:print('Column percentages') colsum=ct2.sum(axis=0) colpct=ct2/colsum print(colpct) Column percentages COMP1vs2 0-25% 26-50% cancercaseslabel high 0.0000 0.095238 low 0.8125 0.642857 medium 0.1875 0.261905 very high 0.0000 0.000000
Chi-square
In [32]:print ('chi-square value, p value, expected counts') cs2= scipy.stats.chi2_contingency(ct2) print (cs2) chi-square value, p value, expected counts --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-32-a16475c49586> in <module>() 1 print ('chi-square value, p value, expected counts') ----> 2 cs2= scipy.stats.chi2_contingency(ct2) 3 print (cs2) /anaconda/lib/python3.6/site-packages/scipy/stats/contingency.py in chi2_contingency(observed, correction, lambda_) 251 zeropos = list(zip(*np.where(expected == 0)))[0] 252 raise ValueError("The internally computed table of expected " --> 253 "frequencies has a zero element at %s." % (zeropos,)) 254 255 # The degrees of freedom ValueError: The internally computed table of expected frequencies has a zero element at (3, 0).
1 note · View note
ramasubramanya · 2 years
Text
Running an analysis of variance
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
import necessary libraries
In [17]:import pandas as pd import numpy as np import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi import seaborn as sns import matplotlib.pyplot as plt
read in the data
In [2]:data1 = pd.read_csv('gapminder2.csv', low_memory=False)
remove unnecessary columns and make a copy of the subdata
In [3]:data2 = data1[["continent", "country", "breastcancerper100th", "urbanrate", "incomeperperson", "breastcancernbdeaths"]] data = data2.copy()
remove missing values(in my case '0' values)
In [4]:data= data.replace(0, np.NaN) data = data.dropna()
In [5]:print(len(data)) print(len(data.columns)) 165 6
Change the data type for chosen variables
In [6]:data['breastcancerper100th'] = pd.to_numeric(data['breastcancerper100th']) data['breastcancernbdeaths'] = pd.to_numeric(data['breastcancernbdeaths']) data['incomeperperson'] = pd.to_numeric(data['incomeperperson']) data['urbanrate'] = pd.to_numeric(data['urbanrate'])
Create variable quartiles and calculate frequency in bins
In [7]:data['urbanratepercent'] =pd.cut(data.urbanrate,4,labels=['0-25%','26-50%','51-74%','75-100%']) urban_freq = pd.concat(dict(counts = data["urbanratepercent"].value_counts(sort=False, dropna=False), percentages = data["urbanratepercent"].value_counts(sort=False, dropna=False, normalize=True)), axis=1) print("Frequency distribution - urban rate:\n", urban_freq) Frequency distribution - urban rate: counts percentages 0-25% 32 0.193939 26-50% 42 0.254545 51-74% 61 0.369697 75-100% 30 0.181818
In [8]:print('Income per person in categories') data['incomelabel'] =pd.cut(data.incomeperperson,4,labels=['low','medium','high','very high']) income_freq = pd.concat(dict(counts = data["incomelabel"].value_counts(sort=False, dropna=False), percentages = data["incomelabel"].value_counts(sort=False, dropna=False, normalize=True)), axis=1) print("Frequency distribution - income per person:\n", income_freq) Income per person in categories Frequency distribution - income per person: counts percentages high 12 0.072727 low 134 0.812121 medium 16 0.096970 very high 3 0.018182
In [9]:data.head()
Out[9]:continentcountrybreastcancerper100thurbanrateincomeperpersonbreastcancernbdeathsurbanratepercentincomelabel1EuropeAlbania57.446.721914.99655130026-50%low2AfricaAlgeria23.565.222231.993335201951-74%low3AfricaAngola23.156.701381.00426865451-74%low4South AmericaArgentina73.992.0010749.419238536275-100%low5AsiaArmenia51.663.861326.74175756151-74%low
create a subset to include the 2 variables(1 categorical + 1 numerical) we want to analyse
In [10]:sub1 = data[['breastcancerper100th', 'urbanratepercent']]
using ols function for calculating the F-statistic and associated p value
In [11]:model1 = smf.ols(formula='breastcancerper100th ~ C(urbanratepercent)', data=sub1) results1 = model1.fit() print (results1.summary()) OLS Regression Results ================================================================================ Dep. Variable: breastcancerper100th R-squared: 0.328 Model: OLS Adj. R-squared: 0.316 Method: Least Squares F-statistic: 26.25 Date: Sun, 22 Oct 2017 Prob (F-statistic): 7.10e-14 Time: 13:28:14 Log-Likelihood: -718.30 No. Observations: 165 AIC: 1445. Df Residuals: 161 BIC: 1457. Df Model: 3 Covariance Type: nonrobust ================================================================================================== coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------------------------- Intercept 21.6437 3.366 6.430 0.000 14.996 28.292 C(urbanratepercent)[T.26-50%] 7.5348 4.468 1.686 0.094 -1.289 16.359 C(urbanratepercent)[T.51-74%] 18.4874 4.157 4.448 0.000 10.279 26.696 C(urbanratepercent)[T.75-100%] 39.7863 4.839 8.221 0.000 30.229 49.343 ============================================================================== Omnibus: 3.014 Durbin-Watson: 1.713 Prob(Omnibus): 0.222 Jarque-Bera (JB): 3.041 Skew: 0.322 Prob(JB): 0.219 Kurtosis: 2.834 Cond. No. 5.46 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Since our p value is so small, 0.0000000000000710, we can safely reject the null hypothethis and accept that there is an association between the percent of urbanisation of countries and the number of breast cancer cases.
In [14]:print ('means for breast cancer cases by urbanisation level') m2= sub1.groupby('urbanratepercent').mean() print (m2) means for breast cancer cases by urbanisation level breastcancerper100th urbanratepercent 0-25% 21.643750 26-50% 29.178571 51-74% 40.131148 75-100% 61.430000
In [15]:print ('standard deviations for breast cancer cases by urbanisation level') sd2 = sub1.groupby('urbanratepercent').std() print (sd2) standard deviations for breast cancer cases by urbanisation level breastcancerper100th urbanratepercent 0-25% 8.567491 26-50% 14.904469 51-74% 20.438334 75-100% 27.502992
run post-hoc test for ANOVA since our categorical variable has more than 2 levels
In [16]:mc1 = multi.MultiComparison(sub1['breastcancerper100th'], sub1['urbanratepercent']) res1 = mc1.tukeyhsd() print(res1.summary()) Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================== group1 group2 meandiff lower upper reject ---------------------------------------------- 0-25% 26-50% 7.5348 -4.066 19.1356 False 0-25% 51-74% 18.4874 7.6961 29.2787 True 0-25% 75-100% 39.7862 27.2221 52.3504 True 26-50% 51-74% 10.9526 1.0397 20.8655 True 26-50% 75-100% 32.2514 20.4332 44.0697 True 51-74% 75-100% 21.2989 10.2741 32.3236 True ----------------------------------------------
0 notes
ramasubramanya · 2 years
Text
Running an analysis of variance
import necessary libraries
In [1]:import pandas as pd import numpy as np import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi import seaborn as sns import matplotlib.pyplot as plt
read in the data
In [2]:data1 = pd.read_csv('gapminder2.csv', low_memory=False)
remove unnecessary columns and make a copy of the subdata
In [3]:data2 = data1[["continent", "country", "breastcancerper100th", "urbanrate", "incomeperperson", "breastcancernbdeaths"]] data = data2.copy()
remove missing values(in my case '0' values)
In [4]:data= data.replace(0, np.NaN) data = data.dropna()
In [5]:print(len(data)) print(len(data.columns)) 165 6
Change the data type for chosen variables
In [6]:data['breastcancerper100th'] = pd.to_numeric(data['breastcancerper100th']) data['breastcancernbdeaths'] = pd.to_numeric(data['breastcancernbdeaths']) data['incomeperperson'] = pd.to_numeric(data['incomeperperson']) data['urbanrate'] = pd.to_numeric(data['urbanrate'])
Create variable quartiles and calculate frequency in bins
In [7]:data['urbanratepercent'] =pd.cut(data.urbanrate,4,labels=['0-25%','26-50%','51-74%','75-100%']) urban_freq = pd.concat(dict(counts = data["urbanratepercent"].value_counts(sort=False, dropna=False), percentages = data["urbanratepercent"].value_counts(sort=False, dropna=False, normalize=True)), axis=1) print("Frequency distribution - urban rate:\n", urban_freq) Frequency distribution - urban rate: counts percentages 0-25% 32 0.193939 26-50% 42 0.254545 51-74% 61 0.369697 75-100% 30 0.181818
In [8]:print('Income per person in categories') data['incomelabel'] =pd.cut(data.incomeperperson,4,labels=['low','medium','high','very high']) income_freq = pd.concat(dict(counts = data["incomelabel"].value_counts(sort=False, dropna=False), percentages = data["incomelabel"].value_counts(sort=False, dropna=False, normalize=True)), axis=1) print("Frequency distribution - income per person:\n", income_freq) Income per person in categories Frequency distribution - income per person: counts percentages high 12 0.072727 low 134 0.812121 medium 16 0.096970 very high 3 0.018182
In [9]:data.head()
Out[9]:continentcountrybreastcancerper100thurbanrateincomeperpersonbreastcancernbdeathsurbanratepercentincomelabel1EuropeAlbania57.446.721914.99655130026-50%low2AfricaAlgeria23.565.222231.993335201951-74%low3AfricaAngola23.156.701381.00426865451-74%low4South AmericaArgentina73.992.0010749.419238536275-100%low5AsiaArmenia51.663.861326.74175756151-74%low
create a subset to include the 2 variables(1 categorical + 1 numerical) we want to analyse
In [10]:sub1 = data[['breastcancerper100th', 'urbanratepercent']]
using ols function for calculating the F-statistic and associated p value
In [11]:model1 = smf.ols(formula='breastcancerper100th ~ C(urbanratepercent)', data=sub1) results1 = model1.fit() print (results1.summary()) OLS Regression Results ================================================================================ Dep. Variable: breastcancerper100th R-squared: 0.328 Model: OLS Adj. R-squared: 0.316 Method: Least Squares F-statistic: 26.25 Date: Sun, 22 Oct 2017 Prob (F-statistic): 7.10e-14 Time: 13:28:14 Log-Likelihood: -718.30 No. Observations: 165 AIC: 1445. Df Residuals: 161 BIC: 1457. Df Model: 3 Covariance Type: nonrobust ================================================================================================== coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------------------------- Intercept 21.6437 3.366 6.430 0.000 14.996 28.292 C(urbanratepercent)[T.26-50%] 7.5348 4.468 1.686 0.094 -1.289 16.359 C(urbanratepercent)[T.51-74%] 18.4874 4.157 4.448 0.000 10.279 26.696 C(urbanratepercent)[T.75-100%] 39.7863 4.839 8.221 0.000 30.229 49.343 ============================================================================== Omnibus: 3.014 Durbin-Watson: 1.713 Prob(Omnibus): 0.222 Jarque-Bera (JB): 3.041 Skew: 0.322 Prob(JB): 0.219 Kurtosis: 2.834 Cond. No. 5.46 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Since our p value is so small, 0.0000000000000710, we can safely reject the null hypothethis and accept that there is an association between the percent of urbanisation of countries and the number of breast cancer cases.
In [14]:print ('means for breast cancer cases by urbanisation level') m2= sub1.groupby('urbanratepercent').mean() print (m2) means for breast cancer cases by urbanisation level breastcancerper100th urbanratepercent 0-25% 21.643750 26-50% 29.178571 51-74% 40.131148 75-100% 61.430000
In [15]:print ('standard deviations for breast cancer cases by urbanisation level') sd2 = sub1.groupby('urbanratepercent').std() print (sd2) standard deviations for breast cancer cases by urbanisation level breastcancerper100th urbanratepercent 0-25% 8.567491 26-50% 14.904469 51-74% 20.438334 75-100% 27.502992
run post-hoc test for ANOVA since our categorical variable has more than 2 levels
In [16]:mc1 = multi.MultiComparison(sub1['breastcancerper100th'], sub1['urbanratepercent']) res1 = mc1.tukeyhsd() print(res1.summary()) Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================== group1 group2 meandiff lower upper reject ---------------------------------------------- 0-25% 26-50% 7.5348 -4.066 19.1356 False 0-25% 51-74% 18.4874 7.6961 29.2787 True 0-25% 75-100% 39.7862 27.2221 52.3504 True 26-50% 51-74% 10.9526 1.0397 20.8655 True 26-50% 75-100% 32.2514 20.4332 44.0697 True 51-74% 75-100% 21.2989 10.2741 32.3236 True ----------------------------------------------
0 notes
ramasubramanya · 2 years
Text
import pandas as pd import numpy as np import seaborn as sb import matplotlib.pyplot as plt load gapminder dataset data = pd.read_csv('gapminder.csv',low_memory=False) lower-case all DataFrame column names data.columns = map(str.lower, data.columns) bug fix for display formats to avoid run time errors pd.set_option('display.float_format', lambda x:'%f'%x) setting variables to be numeric data['suicideper100th'] = data['suicideper100th'].convert_objects(convert_numeric=True) data['breastcancerper100th'] = data['breastcancerper100th'].convert_objects(convert_numeric=True) data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True) data['employrate'] = data['employrate'].convert_objects(convert_numeric=True) display summary statistics about the data print("Statistics for a Suicide Rate") print(data['suicideper100th'].describe()) subset data for a high suicide rate based on summary statistics sub = data[(data['suicideper100th']>12)] make a copy of my new subsetted data sub_copy = sub.copy() Univariate graph for breast cancer rate for people with a high suicide rate plt.figure(1) sb.distplot(sub_copy["breastcancerper100th"].dropna(),kde=False) plt.xlabel('Breast Cancer Rate') plt.ylabel('Frequency') plt.title('Breast Cancer Rate for People with a High Suicide Rate') Univariate graph for hiv rate for people with a high suicide rate plt.figure(2) sb.distplot(sub_copy["hivrate"].dropna(),kde=False) plt.xlabel('HIV Rate') plt.ylabel('Frequency') plt.title('HIV Rate for People with a High Suicide Rate') Univariate graph for employment rate for people with a high suicide rate plt.figure(3) sb.distplot(sub_copy["employrate"].dropna(),kde=False) plt.xlabel('Employment Rate') plt.ylabel('Frequency') plt.title('Employment Rate for People with a High Suicide Rate') Bivariate graph for association of breast cancer rate with HIV rate for people with a high suicide rate plt.figure(4) sb.regplot(x="hivrate",y="breastcancerper100th",fit_reg=False,data=sub_copy) plt.xlabel('HIV Rate') plt.ylabel('Breast Cancer Rate') plt.title('Breast Cancer Rate vs. HIV Rate for People with a High Suicide Rate') 
Tumblr media Tumblr media Tumblr media Tumblr media
0 notes
ramasubramanya · 2 years
Text
Summary of Frequency Distributions I grouped the breast cancer rate, HIV rate and employment rate variables to create three new variables: bcgroup4, hcgroup4 and ecgroup4 using three different methods in Python. The grouped data also includes the count for missing data. 1) For the breast cancer rate, I grouped the data into 4 groups by number of breast cancer cases (1-23, 24-46, 47-69, 70-92) using pandas.cut function. People with lower breast cancer rate experience a high suicide rate. 2) For the HIV rate, I grouped the data into 4 groups by quartile pandas.qcut function. People with lower HIV rate experience a high suicide rate. 3) For the employment rate, I grouped the data into 5 categorical groups using def and apply functions: (1:32-50, 2:51-58, 3:59-64, 4:65-83, 5:NAN). The employment rate is between 51%-58% for people with a high suicide rate.
import pandas as pd
load gapminder dataset
data = pd.read_csv('gapminder.csv',low_memory=False)
lower-case all DataFrame column names
data.columns = map(str.lower, data.columns)
bug fix for display formats to avoid run time errors
pd.set_option('display.float_format', lambda x:'%f'%x)
setting variables to be numeric
data['suicideper100th'] = data['suicideper100th'].apply(pd.to_numeric, errors='coerce') data['breastcancerper100th'] = data['breastcancerper100th'].apply(pd.to_numeric, errors='coerce') data['hivrate'] = data['hivrate'].apply(pd.to_numeric, errors='coerce') data['employrate'] = data['employrate'].apply(pd.to_numeric, errors='coerce')
data['suicideper100th'] = pd.to_numeric(data['suicideper100th'])
data['suicideper100th'] = data['suicideper100th'].convert_objects(convert_numeric=True)
data['breastcancerper100th'] = data['breastcancerper100th'].convert_objects(convert_numeric=True)
data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True)
data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)
display summary statistics about the data
print("Statistics for a Suicide Rate") print(data['suicideper100th'].describe())
subset data for a high suicide rate based on summary statistics
sub = data[(data['suicideper100th']>12)]
make a copy of my new subsetted data
sub_copy = sub.copy()
BREAST CANCER RATE
frequency and percentage distritions for a number of breast cancer cases with a high suicide rate
include the count of missing data and group the variables in 4 groups by number of
breast cancer cases (1-23, 24-46, 47-69, 70-92)
bc_max=sub_copy['breastcancerper100th'].max() # maximum of breast cancer cases
group the data in 4 groups by number of breast cancer cases and record it into new variable bcgroup4
sub_copy['bcgroup4']=pd.cut(sub_copy.breastcancerper100th,[0bc_max,0.25bc_max,0.5bc_max,0.75bc_max,1*bc_max])
frequency for 4 groups of breast cancer cases with a high suicide rate
bc=sub_copy['bcgroup4'].value_counts(sort=False,dropna=False)
percentage for 4 groups of breast cancer cases with a high suicide rate
pbc=sub_copy['bcgroup4'].value_counts(sort=False,dropna=False,normalize=True)*100
cumulative frequency and cumulative percentage for 4 groups of breast cancer cases with a high suicide rate
bc1=[] # Cumulative Frequency pbc1=[] # Cumulative Percentage cf=0 cp=0 for freq in bc: cf=cf+freq bc1.append(cf) pf=cf*100/len(sub_copy) pbc1.append(pf)
print('Number of Breast Cancer Cases with a High Suicide Rate') fmt1 = '%10s %9s %9s %12s %13s' fmt2 = '%9s %9.d %10.2f %9.d %13.2f' print(fmt1 % ('# of Cases','Freq.','Percent','Cum. Freq.','Cum. Percent')) for i, (key, var1, var2, var3, var4) in enumerate(zip(bc.keys(),bc,pbc,bc1,pbc1)): print(fmt2 % (key, var1, var2, var3, var4))
HIV RATE
frequency and percentage distritions for HIV rate with a high suicide rate
include the count of missing data and group the variables in 4 groups by quartile function
group the data in 4 groups and record it into new variable hcgroup4
sub_copy['hcgroup4']=pd.qcut(sub_copy.hivrate,4,labels=["0% tile","25% tile","50% tile","75% tile"])
frequency for 4 groups of HIV rate with a high suicide rate
hc = sub_copy['hcgroup4'].value_counts(sort=False,dropna=False)
percentage for 4 groups of HIV rate with a high suicide rate
phc = sub_copy['hcgroup4'].value_counts(sort=False,dropna=False,normalize=True)*100
cumulative frequency and cumulative percentage for 4 groups of HIV rate with a high suicide rate
hc1=[] # Cumulative Frequency phc1=[] # Cumulative Percentage cf=0 cp=0 for freq in hc: cf=cf+freq hc1.append(cf) pf=cf*100/len(sub_copy) phc1.append(pf)
print('HIV Rate with a High Suicide Rate') print(fmt1 % ('Rate','Freq.','Percent','Cum. Freq.','Cum. Percent')) for i, (key, var1, var2, var3, var4) in enumerate(zip(hc.keys(),hc,phc,hc1,phc1)): print(fmt2 % (key, var1, var2, var3, var4))
EMPLOYMENT RATE
frequency and percentage distritions for employment rate with a high suicide rate
include the count of missing data and group the variables in 5 groups by
group the data in 5 groups and record it into new variable ecgroup4
def ecgroup4 (row): if row['employrate'] >= 32 and row['employrate'] < 51: return 1 elif row['employrate'] >= 51 and row['employrate'] < 59: return 2 elif row['employrate'] >= 59 and row['employrate'] < 65: return 3 elif row['employrate'] >= 65 and row['employrate'] < 84: return 4 else: return 5 # record for NAN values
sub_copy['ecgroup4'] = sub_copy.apply(lambda row: ecgroup4 (row), axis=1)
frequency for 5 groups of employment rate with a high suicide rate
ec = sub_copy['ecgroup4'].value_counts(sort=False,dropna=False)
percentage for 5 groups of employment rate with a high suicide rate
pec = sub_copy['ecgroup4'].value_counts(sort=False,dropna=False,normalize=True)*100
cumulative frequency and cumulative percentage for 5 groups of employment rate with a high suicide rate
ec1=[] # Cumulative Frequency pec1=[] # Cumulative Percentage cf=0 cp=0 for freq in ec: cf=cf+freq ec1.append(cf) pf=cf*100/len(sub_copy) pec1.append(pf)
print('Employment Rate with a High Suicide Rate') print(fmt1 % ('Rate','Freq.','Percent','Cum. Freq.','Cum. Percent')) for i, (key, var1, var2, var3, var4) in enumerate(zip(ec.keys(),ec,pec,ec1,pec1)): print(fmt2 % (key, var1, var2, var3, var4))
Tumblr media
0 notes
ramasubramanya · 2 years
Text
An analysis between alcconsumption, urbanrate and polityscore I have analysed the polityscore of countries with urbanrate>50 and alcconsumption > 6
Code:
data['alcconsumption'] = pandas.to_numeric(data['alcconsumption'], errors='coerce') data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce') data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce')
sub1 = data[(data['urbanrate']>50) & (data['alcconsumption'] >6)] sub2 = sub1.copy() sub3 = data[(data['polityscore']>0)] sub4 = sub3.copy()
print('counts for polityscore') c5 = sub1['polityscore'].value_counts(sort=False, normalize = True) print(c5)
Output: (Image Attached)
Tumblr media
0 notes
ramasubramanya · 2 years
Text
Assessment of Week 1
Topic of Interest – GapMinder
Wanted to check on data columns which are in the same year timeframe. The association between:
AlcoholConsumption – Urbanrate – relectricperperson [year 2008]
Incomeperperson – oilperperson – Internetuserate [year 2010]
Incomeperperson [year 2010] – lifeexpectancy [year 2011]
Questions:
Does average litres of alcoholconsumption depend on urbanrate or not?
What sort of growth or relation between urbanrate and electricity consumption?
Can there be any pattern in AlcoholConsumption and Electric consumption?
Relation between income per person and oil consumption and internet usage?
Do countries of low Income per person (maybe less savings) of 2010 have effect on life expectancy in the following year 2011?
Literature Review:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7320389/
http://www.gw-unterricht.at/images/pdf/gwu_126_076_087_lang.pdf
https://link.springer.com/chapter/10.1007/978-3-540-87730-1_26
Hypoethesis:
            My assumption is that because some data labels are recorded in different years they might affect assumed conclusions if taken directly. To avoid this I have tried to take data labels which have same years excluding the life expectancy.
Hypothesis 1: Generally posh of population is associated with frequent alcohol consumption. An idea if there is a relation in recorded alcohol consumption rate can be used along with urban population.
Hypothesis 2: A hypothesis that residential electric consumption(relectricperperson) is related directly to urban population (urbanrate)
Hypothesis 3: A hypothesis that Income per person (Incomeperperson) is related directly to number of Internet users (Internetuserate)
Hypothesis 4: A hypothesis that Income per person (Incomeperperson) is related directly to oil consumption per capita (oilperperson)
Hypothesis 5: A hypothesis that life expectancy (lifeexpectancy) of 2011 is related directly to income of previous year’s income (incomeperperson)
0 notes
ramasubramanya · 2 years
Text
Started course on Data management and Visualization.
Started blogging as a way to communicate to Coursera-Wesleyan :) Will blog more!
1 note · View note