ramasubramanya - Tumblr blog

ramasubramanya · 2 years

Text

Pearson Correlation

import necessary libraries

In [11]:import pandas as pd import numpy as np import scipy.stats import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline

read in the data

In [2]:data1 = pd.read_csv('gapminder2.csv', low_memory=False)

remove unnecessary columns and make a copy of the subdata

In [3]:data2 = data1[["continent", "country", "breastcancerper100th", "urbanrate", "incomeperperson", "breastcancernbdeaths"]] data = data2.copy()

remove missing values(in my case '0' values)

In [5]:data= data.replace(0, np.NaN) data = data.dropna()

Change the data type for chosen variables

In [6]:data['breastcancerper100th'] = pd.to_numeric(data['breastcancerper100th']) data['incomeperperson'] = pd.to_numeric(data['incomeperperson']) data['urbanrate'] = pd.to_numeric(data['urbanrate'])

Scatterplots to vizualize the relationships

In [7]:scat1 = sns.regplot(x="urbanrate", y="breastcancerper100th", fit_reg=True, data=data) plt.xlabel('Urban Rate') plt.ylabel('Breast Cancer Cases') plt.title('Scatterplot for the Association Between Urban Rate and Number of new breast cancer cases')

Out[7]:<matplotlib.text.Text at 0x111e287f0>

In [8]:scat2 = sns.regplot(x="incomeperperson", y="breastcancerper100th", fit_reg=True, data=data) plt.xlabel('Income per Person') plt.ylabel('Breast Cancer Cases') plt.title('Scatterplot for the Association Between Income per Person and Number of new breast cancer cases')

Out[8]:<matplotlib.text.Text at 0x111e9eb38>

Pearson Correlation

In [9]:print ('association between urbanrate and breast cancer cases') print (scipy.stats.pearsonr(data['urbanrate'], data['breastcancerper100th'])) association between urbanrate and breast cancer cases (0.57721793332990379, 4.8461295565483764e-16)

In [10]:print ('association between incomeperperson and breast cancer cases') print (scipy.stats.pearsonr(data['incomeperperson'], data['breastcancerper100th'])) association between incomeperperson and breast cancer cases (0.73139851823791835, 6.7680050678785747e-29)

0 notes

ramasubramanya · 2 years

Text

Generating a Correlation Coefficient

import necessary libraries

In [17]:import pandas as pd import numpy as np import scipy.stats import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline

read in the data

In [2]:data1 = pd.read_csv('gapminder2.csv', low_memory=False)

remove unnecessary columns and make a copy of the subdata

In [4]:data2 = data1[["continent", "country", "breastcancerper100th", "urbanrate", "incomeperperson", "breastcancernbdeaths"]] data = data2.copy()

remove missing values

In [6]:data= data.replace(0, np.NaN) data = data.dropna()

Change the data type for chosen variables

In [7]:data['breastcancerper100th'] = pd.to_numeric(data['breastcancerper100th']) data['incomeperperson'] = pd.to_numeric(data['incomeperperson']) data['urbanrate'] = pd.to_numeric(data['urbanrate'])

Request Pearson Correlation

In [9]:print ("association between urbanrate and breast cancer cases") print (scipy.stats.pearsonr(data['urbanrate'], data['breastcancerper100th'])) association between urbanrate and breast cancer cases (0.57721793332990379, 4.8461295565483764e-16)

Add a moderator variable to the data

In [10]:def incomegrp (row): if row['incomeperperson'] <= 744.239: return 1 elif row['incomeperperson'] <= 9425.326 : return 2 elif row['incomeperperson'] > 9425.326: return 3

In [11]:data['incomegrp'] = data.apply (lambda row: incomegrp (row),axis=1)

In [12]:chk1 = data['incomegrp'].value_counts(sort=False, dropna=False) print(chk1) 1 46 2 81 3 38 Name: incomegrp, dtype: int64

create sub-datasets for each value of incomegrp

In [13]:sub1=data[(data['incomegrp']== 1)] sub2=data[(data['incomegrp']== 2)] sub3=data[(data['incomegrp']== 3)]

Run Pearson Correlation for every subset

In [14]:print ('association between urbanrate and breast cancer cases for LOW income countries') print (scipy.stats.pearsonr(sub1['urbanrate'], sub1['breastcancerper100th'])) association between urbanrate and breast cancer cases for LOW income countries (0.17887135328796352, 0.23428271424528749)

In [15]:print ('association between urbanrate and internetuserate for MIDDLE income countries') print (scipy.stats.pearsonr(sub2['urbanrate'], sub2['breastcancerper100th'])) association between urbanrate and internetuserate for MIDDLE income countries (0.30017579786534548, 0.0064753813912730154)

In [16]:print ('association between urbanrate and internetuserate for HIGH income countries') print (scipy.stats.pearsonr(sub3['urbanrate'], sub3['breastcancerper100th'])) association between urbanrate and internetuserate for HIGH income countries (0.19306445789710108, 0.24550411335117076)

Create plots to vizualize the associations

In [18]:scat1 = sns.regplot(x="urbanrate", y="breastcancerper100th", data=sub1) plt.xlabel('Urban Rate') plt.ylabel('Breast cancer cases') plt.title('Scatterplot for the association between urban rate and breast cancer cases for LOW income countries') print (scat1) Axes(0.125,0.125;0.775x0.755)

In [19]:scat2 = sns.regplot(x="urbanrate", y="breastcancerper100th", data=sub2) plt.xlabel('Urban Rate') plt.ylabel('Breast Cancer Cases') plt.title('Scatterplot for the association between urban rate and breast cancer cases for MIDDLE income countries') print (scat2) Axes(0.125,0.125;0.775x0.755)

In [20]:scat3 = sns.regplot(x="urbanrate", y="breastcancerper100th", data=sub3) plt.xlabel('Urban Rate') plt.ylabel('Breast Cancer Cases') plt.title('Scatterplot for the association between urban rate and breast cancer cases for HIGH income countries') print (scat3) Axes(0.125,0.125;0.775x0.755)

0 notes

ramasubramanya · 2 years

Text

Chi-Square Analysis

import necessary libraries

In [14]:import pandas as pd import numpy as np import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi import scipy.stats import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline

read in the data

In [2]:data1 = pd.read_csv('gapminder2.csv', low_memory=False)

remove unnecessary columns and make a copy of the subdata

In [3]:data2 = data1[["continent", "country", "breastcancerper100th", "urbanrate", "incomeperperson", "breastcancernbdeaths"]] data = data2.copy()

remove missing values(in my case '0' values)

In [4]:data= data.replace(0, np.NaN) data = data.dropna()

Change the data type for chosen variables

In [5]:data['breastcancerper100th'] = pd.to_numeric(data['breastcancerper100th']) data['urbanrate'] = pd.to_numeric(data['urbanrate'])

Create variable quartiles and calculate frequency in bins

In [6]:print('Categories of breast cancer cases per 100 000 females:') data['cancercaseslabel'] =pd.cut(data.breastcancerper100th,4,labels=['low','medium','high','very high']) breastcan_freq = pd.concat(dict(counts = data["cancercaseslabel"].value_counts(sort=False, dropna=False), percentages = data["cancercaseslabel"].value_counts(sort=False, dropna=False, normalize=True)), axis=1) print("Frequency distribution - breast cancer bins:\n", breastcan_freq) print('\n') Categories of breast cancer cases per 100 000 females: Frequency distribution - breast cancer bins: counts percentages high 17 0.103030 low 77 0.466667 medium 54 0.327273 very high 17 0.103030

In [8]:data['urbanratepercent'] =pd.cut(data.urbanrate,4,labels=['0-25%','26-50%','51-74%','75-100%']) urban_freq = pd.concat(dict(counts = data["urbanratepercent"].value_counts(sort=False, dropna=False), percentages = data["urbanratepercent"].value_counts(sort=False, dropna=False, normalize=True)), axis=1) print("Frequency distribution - urban rate:\n", urban_freq) Frequency distribution - urban rate: counts percentages 0-25% 32 0.193939 26-50% 42 0.254545 51-74% 61 0.369697 75-100% 30 0.181818

contingency table of observed counts

In [9]:print('Contingency table of observed counts') ct1=pd.crosstab(data['cancercaseslabel'], data['urbanratepercent']) print (ct1) print('\n') Contingency table of observed counts urbanratepercent 0-25% 26-50% 51-74% 75-100% cancercaseslabel high 0 4 9 4 low 26 27 19 5 medium 6 11 29 8 very high 0 0 4 13

column percentages

In [11]:print('Column percentages') colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct) print('\n') Column percentages urbanratepercent 0-25% 26-50% 51-74% 75-100% cancercaseslabel high 0.0000 0.095238 0.147541 0.133333 low 0.8125 0.642857 0.311475 0.166667 medium 0.1875 0.261905 0.475410 0.266667 very high 0.0000 0.000000 0.065574 0.433333

Chi-square

In [12]:print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1) print('\n') chi-square value, p value, expected counts (71.798834424480177, 6.7520667625654625e-12, 9, array([[ 3.2969697 , 4.32727273, 6.28484848, 3.09090909], [ 14.93333333, 19.6 , 28.46666667, 14. ], [ 10.47272727, 13.74545455, 19.96363636, 9.81818182], [ 3.2969697 , 4.32727273, 6.28484848, 3.09090909]]))

graph percent with new cases of breast cancer within each urbanisation frequency group

In [15]:plt.figure(figsize=(18,10)) sns.factorplot(x="urbanratepercent", y="breastcancerper100th", data=data, kind="bar", ci=None) plt.ylabel('Breast Cancer cases numbers') plt.xlabel('Urbanisation rate groups')

Out[15]:<matplotlib.text.Text at 0x1155eef98><matplotlib.figure.Figure at 0x1154f1358>

creating a subset to include the 2 variables we want to analyse

In [16]:sub2 = data[['cancercaseslabel', 'urbanratepercent']]

COMPARISON 1 Bonferroni Adjustment

In [28]:print('COMPARISON 1 Bonferroni Adjustment') recode1 = {'0-25%':'0-25%', '26-50%': '26-50%'} sub2['COMP1vs2']= sub2['urbanratepercent'].map(recode1) COMPARISON 1 Bonferroni Adjustment /anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy This is separate from the ipykernel package so we can avoid doing imports until

contingency table of observed counts

In [30]:print('Contingency table of observed counts') ct2=pd.crosstab(sub2['cancercaseslabel'], sub2['COMP1vs2']) print (ct2) Contingency table of observed counts COMP1vs2 0-25% 26-50% cancercaseslabel high 0 4 low 26 27 medium 6 11 very high 0 0

column percentages

In [31]:print('Column percentages') colsum=ct2.sum(axis=0) colpct=ct2/colsum print(colpct) Column percentages COMP1vs2 0-25% 26-50% cancercaseslabel high 0.0000 0.095238 low 0.8125 0.642857 medium 0.1875 0.261905 very high 0.0000 0.000000

Chi-square

In [32]:print ('chi-square value, p value, expected counts') cs2= scipy.stats.chi2_contingency(ct2) print (cs2) chi-square value, p value, expected counts --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-32-a16475c49586> in <module>() 1 print ('chi-square value, p value, expected counts') ----> 2 cs2= scipy.stats.chi2_contingency(ct2) 3 print (cs2) /anaconda/lib/python3.6/site-packages/scipy/stats/contingency.py in chi2_contingency(observed, correction, lambda_) 251 zeropos = list(zip(*np.where(expected == 0)))[0] 252 raise ValueError("The internally computed table of expected " --> 253 "frequencies has a zero element at %s." % (zeropos,)) 254 255 # The degrees of freedom ValueError: The internally computed table of expected frequencies has a zero element at (3, 0).

1 note · View note

ramasubramanya · 2 years

Text

Running an analysis of variance

import necessary libraries

In [17]:import pandas as pd import numpy as np import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi import seaborn as sns import matplotlib.pyplot as plt

read in the data

In [2]:data1 = pd.read_csv('gapminder2.csv', low_memory=False)

remove unnecessary columns and make a copy of the subdata

In [3]:data2 = data1[["continent", "country", "breastcancerper100th", "urbanrate", "incomeperperson", "breastcancernbdeaths"]] data = data2.copy()

remove missing values(in my case '0' values)

In [4]:data= data.replace(0, np.NaN) data = data.dropna()

In [5]:print(len(data)) print(len(data.columns)) 165 6

Change the data type for chosen variables

In [6]:data['breastcancerper100th'] = pd.to_numeric(data['breastcancerper100th']) data['breastcancernbdeaths'] = pd.to_numeric(data['breastcancernbdeaths']) data['incomeperperson'] = pd.to_numeric(data['incomeperperson']) data['urbanrate'] = pd.to_numeric(data['urbanrate'])

Create variable quartiles and calculate frequency in bins

In [7]:data['urbanratepercent'] =pd.cut(data.urbanrate,4,labels=['0-25%','26-50%','51-74%','75-100%']) urban_freq = pd.concat(dict(counts = data["urbanratepercent"].value_counts(sort=False, dropna=False), percentages = data["urbanratepercent"].value_counts(sort=False, dropna=False, normalize=True)), axis=1) print("Frequency distribution - urban rate:\n", urban_freq) Frequency distribution - urban rate: counts percentages 0-25% 32 0.193939 26-50% 42 0.254545 51-74% 61 0.369697 75-100% 30 0.181818

In [8]:print('Income per person in categories') data['incomelabel'] =pd.cut(data.incomeperperson,4,labels=['low','medium','high','very high']) income_freq = pd.concat(dict(counts = data["incomelabel"].value_counts(sort=False, dropna=False), percentages = data["incomelabel"].value_counts(sort=False, dropna=False, normalize=True)), axis=1) print("Frequency distribution - income per person:\n", income_freq) Income per person in categories Frequency distribution - income per person: counts percentages high 12 0.072727 low 134 0.812121 medium 16 0.096970 very high 3 0.018182

In [9]:data.head()

Out[9]:continentcountrybreastcancerper100thurbanrateincomeperpersonbreastcancernbdeathsurbanratepercentincomelabel1EuropeAlbania57.446.721914.99655130026-50%low2AfricaAlgeria23.565.222231.993335201951-74%low3AfricaAngola23.156.701381.00426865451-74%low4South AmericaArgentina73.992.0010749.419238536275-100%low5AsiaArmenia51.663.861326.74175756151-74%low

create a subset to include the 2 variables(1 categorical + 1 numerical) we want to analyse

In [10]:sub1 = data[['breastcancerper100th', 'urbanratepercent']]

using ols function for calculating the F-statistic and associated p value

In [11]:model1 = smf.ols(formula='breastcancerper100th ~ C(urbanratepercent)', data=sub1) results1 = model1.fit() print (results1.summary()) OLS Regression Results ================================================================================ Dep. Variable: breastcancerper100th R-squared: 0.328 Model: OLS Adj. R-squared: 0.316 Method: Least Squares F-statistic: 26.25 Date: Sun, 22 Oct 2017 Prob (F-statistic): 7.10e-14 Time: 13:28:14 Log-Likelihood: -718.30 No. Observations: 165 AIC: 1445. Df Residuals: 161 BIC: 1457. Df Model: 3 Covariance Type: nonrobust ================================================================================================== coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------------------------- Intercept 21.6437 3.366 6.430 0.000 14.996 28.292 C(urbanratepercent)[T.26-50%] 7.5348 4.468 1.686 0.094 -1.289 16.359 C(urbanratepercent)[T.51-74%] 18.4874 4.157 4.448 0.000 10.279 26.696 C(urbanratepercent)[T.75-100%] 39.7863 4.839 8.221 0.000 30.229 49.343 ============================================================================== Omnibus: 3.014 Durbin-Watson: 1.713 Prob(Omnibus): 0.222 Jarque-Bera (JB): 3.041 Skew: 0.322 Prob(JB): 0.219 Kurtosis: 2.834 Cond. No. 5.46 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Since our p value is so small, 0.0000000000000710, we can safely reject the null hypothethis and accept that there is an association between the percent of urbanisation of countries and the number of breast cancer cases.

In [14]:print ('means for breast cancer cases by urbanisation level') m2= sub1.groupby('urbanratepercent').mean() print (m2) means for breast cancer cases by urbanisation level breastcancerper100th urbanratepercent 0-25% 21.643750 26-50% 29.178571 51-74% 40.131148 75-100% 61.430000

In [15]:print ('standard deviations for breast cancer cases by urbanisation level') sd2 = sub1.groupby('urbanratepercent').std() print (sd2) standard deviations for breast cancer cases by urbanisation level breastcancerper100th urbanratepercent 0-25% 8.567491 26-50% 14.904469 51-74% 20.438334 75-100% 27.502992

run post-hoc test for ANOVA since our categorical variable has more than 2 levels

In [16]:mc1 = multi.MultiComparison(sub1['breastcancerper100th'], sub1['urbanratepercent']) res1 = mc1.tukeyhsd() print(res1.summary()) Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================== group1 group2 meandiff lower upper reject ---------------------------------------------- 0-25% 26-50% 7.5348 -4.066 19.1356 False 0-25% 51-74% 18.4874 7.6961 29.2787 True 0-25% 75-100% 39.7862 27.2221 52.3504 True 26-50% 51-74% 10.9526 1.0397 20.8655 True 26-50% 75-100% 32.2514 20.4332 44.0697 True 51-74% 75-100% 21.2989 10.2741 32.3236 True ----------------------------------------------

0 notes

ramasubramanya · 2 years

Text

Running an analysis of variance

import necessary libraries

In [1]:import pandas as pd import numpy as np import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi import seaborn as sns import matplotlib.pyplot as plt

read in the data

In [2]:data1 = pd.read_csv('gapminder2.csv', low_memory=False)

remove unnecessary columns and make a copy of the subdata

In [3]:data2 = data1[["continent", "country", "breastcancerper100th", "urbanrate", "incomeperperson", "breastcancernbdeaths"]] data = data2.copy()

remove missing values(in my case '0' values)

In [4]:data= data.replace(0, np.NaN) data = data.dropna()

In [5]:print(len(data)) print(len(data.columns)) 165 6

Change the data type for chosen variables

Create variable quartiles and calculate frequency in bins

In [9]:data.head()

create a subset to include the 2 variables(1 categorical + 1 numerical) we want to analyse

In [10]:sub1 = data[['breastcancerper100th', 'urbanratepercent']]

using ols function for calculating the F-statistic and associated p value

run post-hoc test for ANOVA since our categorical variable has more than 2 levels

0 notes

ramasubramanya · 2 years

Text

import pandas as pd import numpy as np import seaborn as sb import matplotlib.pyplot as plt load gapminder dataset data = pd.read_csv('gapminder.csv',low_memory=False) lower-case all DataFrame column names data.columns = map(str.lower, data.columns) bug fix for display formats to avoid run time errors pd.set_option('display.float_format', lambda x:'%f'%x) setting variables to be numeric data['suicideper100th'] = data['suicideper100th'].convert_objects(convert_numeric=True) data['breastcancerper100th'] = data['breastcancerper100th'].convert_objects(convert_numeric=True) data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True) data['employrate'] = data['employrate'].convert_objects(convert_numeric=True) display summary statistics about the data print("Statistics for a Suicide Rate") print(data['suicideper100th'].describe()) subset data for a high suicide rate based on summary statistics sub = data[(data['suicideper100th']>12)] make a copy of my new subsetted data sub_copy = sub.copy() Univariate graph for breast cancer rate for people with a high suicide rate plt.figure(1) sb.distplot(sub_copy["breastcancerper100th"].dropna(),kde=False) plt.xlabel('Breast Cancer Rate') plt.ylabel('Frequency') plt.title('Breast Cancer Rate for People with a High Suicide Rate') Univariate graph for hiv rate for people with a high suicide rate plt.figure(2) sb.distplot(sub_copy["hivrate"].dropna(),kde=False) plt.xlabel('HIV Rate') plt.ylabel('Frequency') plt.title('HIV Rate for People with a High Suicide Rate') Univariate graph for employment rate for people with a high suicide rate plt.figure(3) sb.distplot(sub_copy["employrate"].dropna(),kde=False) plt.xlabel('Employment Rate') plt.ylabel('Frequency') plt.title('Employment Rate for People with a High Suicide Rate') Bivariate graph for association of breast cancer rate with HIV rate for people with a high suicide rate plt.figure(4) sb.regplot(x="hivrate",y="breastcancerper100th",fit_reg=False,data=sub_copy) plt.xlabel('HIV Rate') plt.ylabel('Breast Cancer Rate') plt.title('Breast Cancer Rate vs. HIV Rate for People with a High Suicide Rate')

0 notes

ramasubramanya · 2 years

Text

Summary of Frequency Distributions I grouped the breast cancer rate, HIV rate and employment rate variables to create three new variables: bcgroup4, hcgroup4 and ecgroup4 using three different methods in Python. The grouped data also includes the count for missing data. 1) For the breast cancer rate, I grouped the data into 4 groups by number of breast cancer cases (1-23, 24-46, 47-69, 70-92) using pandas.cut function. People with lower breast cancer rate experience a high suicide rate. 2) For the HIV rate, I grouped the data into 4 groups by quartile pandas.qcut function. People with lower HIV rate experience a high suicide rate. 3) For the employment rate, I grouped the data into 5 categorical groups using def and apply functions: (1:32-50, 2:51-58, 3:59-64, 4:65-83, 5:NAN). The employment rate is between 51%-58% for people with a high suicide rate.

import pandas as pd

load gapminder dataset

data = pd.read_csv('gapminder.csv',low_memory=False)

lower-case all DataFrame column names

data.columns = map(str.lower, data.columns)

bug fix for display formats to avoid run time errors

pd.set_option('display.float_format', lambda x:'%f'%x)

setting variables to be numeric

data['suicideper100th'] = data['suicideper100th'].apply(pd.to_numeric, errors='coerce') data['breastcancerper100th'] = data['breastcancerper100th'].apply(pd.to_numeric, errors='coerce') data['hivrate'] = data['hivrate'].apply(pd.to_numeric, errors='coerce') data['employrate'] = data['employrate'].apply(pd.to_numeric, errors='coerce')

data['suicideper100th'] = pd.to_numeric(data['suicideper100th'])

data['suicideper100th'] = data['suicideper100th'].convert_objects(convert_numeric=True)

data['breastcancerper100th'] = data['breastcancerper100th'].convert_objects(convert_numeric=True)

data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True)

data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)

display summary statistics about the data

print("Statistics for a Suicide Rate") print(data['suicideper100th'].describe())

subset data for a high suicide rate based on summary statistics

sub = data[(data['suicideper100th']>12)]

make a copy of my new subsetted data

sub_copy = sub.copy()

BREAST CANCER RATE

frequency and percentage distritions for a number of breast cancer cases with a high suicide rate

include the count of missing data and group the variables in 4 groups by number of

breast cancer cases (1-23, 24-46, 47-69, 70-92)

bc_max=sub_copy['breastcancerper100th'].max() # maximum of breast cancer cases

group the data in 4 groups by number of breast cancer cases and record it into new variable bcgroup4

sub_copy['bcgroup4']=pd.cut(sub_copy.breastcancerper100th,[0bc_max,0.25bc_max,0.5bc_max,0.75bc_max,1*bc_max])

frequency for 4 groups of breast cancer cases with a high suicide rate

bc=sub_copy['bcgroup4'].value_counts(sort=False,dropna=False)

percentage for 4 groups of breast cancer cases with a high suicide rate

pbc=sub_copy['bcgroup4'].value_counts(sort=False,dropna=False,normalize=True)*100

cumulative frequency and cumulative percentage for 4 groups of breast cancer cases with a high suicide rate

bc1=[] # Cumulative Frequency pbc1=[] # Cumulative Percentage cf=0 cp=0 for freq in bc: cf=cf+freq bc1.append(cf) pf=cf*100/len(sub_copy) pbc1.append(pf)

print('Number of Breast Cancer Cases with a High Suicide Rate') fmt1 = '%10s %9s %9s %12s %13s' fmt2 = '%9s %9.d %10.2f %9.d %13.2f' print(fmt1 % ('# of Cases','Freq.','Percent','Cum. Freq.','Cum. Percent')) for i, (key, var1, var2, var3, var4) in enumerate(zip(bc.keys(),bc,pbc,bc1,pbc1)): print(fmt2 % (key, var1, var2, var3, var4))

HIV RATE

frequency and percentage distritions for HIV rate with a high suicide rate

include the count of missing data and group the variables in 4 groups by quartile function

group the data in 4 groups and record it into new variable hcgroup4

sub_copy['hcgroup4']=pd.qcut(sub_copy.hivrate,4,labels=["0% tile","25% tile","50% tile","75% tile"])

frequency for 4 groups of HIV rate with a high suicide rate

hc = sub_copy['hcgroup4'].value_counts(sort=False,dropna=False)

percentage for 4 groups of HIV rate with a high suicide rate

phc = sub_copy['hcgroup4'].value_counts(sort=False,dropna=False,normalize=True)*100

cumulative frequency and cumulative percentage for 4 groups of HIV rate with a high suicide rate

hc1=[] # Cumulative Frequency phc1=[] # Cumulative Percentage cf=0 cp=0 for freq in hc: cf=cf+freq hc1.append(cf) pf=cf*100/len(sub_copy) phc1.append(pf)

print('HIV Rate with a High Suicide Rate') print(fmt1 % ('Rate','Freq.','Percent','Cum. Freq.','Cum. Percent')) for i, (key, var1, var2, var3, var4) in enumerate(zip(hc.keys(),hc,phc,hc1,phc1)): print(fmt2 % (key, var1, var2, var3, var4))

EMPLOYMENT RATE

frequency and percentage distritions for employment rate with a high suicide rate

include the count of missing data and group the variables in 5 groups by

group the data in 5 groups and record it into new variable ecgroup4

def ecgroup4 (row): if row['employrate'] >= 32 and row['employrate'] < 51: return 1 elif row['employrate'] >= 51 and row['employrate'] < 59: return 2 elif row['employrate'] >= 59 and row['employrate'] < 65: return 3 elif row['employrate'] >= 65 and row['employrate'] < 84: return 4 else: return 5 # record for NAN values

sub_copy['ecgroup4'] = sub_copy.apply(lambda row: ecgroup4 (row), axis=1)

frequency for 5 groups of employment rate with a high suicide rate

ec = sub_copy['ecgroup4'].value_counts(sort=False,dropna=False)

percentage for 5 groups of employment rate with a high suicide rate

pec = sub_copy['ecgroup4'].value_counts(sort=False,dropna=False,normalize=True)*100

cumulative frequency and cumulative percentage for 5 groups of employment rate with a high suicide rate

ec1=[] # Cumulative Frequency pec1=[] # Cumulative Percentage cf=0 cp=0 for freq in ec: cf=cf+freq ec1.append(cf) pf=cf*100/len(sub_copy) pec1.append(pf)

print('Employment Rate with a High Suicide Rate') print(fmt1 % ('Rate','Freq.','Percent','Cum. Freq.','Cum. Percent')) for i, (key, var1, var2, var3, var4) in enumerate(zip(ec.keys(),ec,pec,ec1,pec1)): print(fmt2 % (key, var1, var2, var3, var4))

0 notes

ramasubramanya · 2 years

Text

An analysis between alcconsumption, urbanrate and polityscore I have analysed the polityscore of countries with urbanrate>50 and alcconsumption > 6

Code:

data['alcconsumption'] = pandas.to_numeric(data['alcconsumption'], errors='coerce') data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce') data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce')

sub1 = data[(data['urbanrate']>50) & (data['alcconsumption'] >6)] sub2 = sub1.copy() sub3 = data[(data['polityscore']>0)] sub4 = sub3.copy()

print('counts for polityscore') c5 = sub1['polityscore'].value_counts(sort=False, normalize = True) print(c5)

Output: (Image Attached)

0 notes

ramasubramanya · 2 years

Text

Assessment of Week 1

Topic of Interest – GapMinder

Wanted to check on data columns which are in the same year timeframe. The association between:

AlcoholConsumption – Urbanrate – relectricperperson [year 2008]

Incomeperperson – oilperperson – Internetuserate [year 2010]

Incomeperperson [year 2010] – lifeexpectancy [year 2011]

Questions:

Does average litres of alcoholconsumption depend on urbanrate or not?

What sort of growth or relation between urbanrate and electricity consumption?

Can there be any pattern in AlcoholConsumption and Electric consumption?

Relation between income per person and oil consumption and internet usage?

Do countries of low Income per person (maybe less savings) of 2010 have effect on life expectancy in the following year 2011?

Literature Review:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7320389/

http://www.gw-unterricht.at/images/pdf/gwu_126_076_087_lang.pdf

https://link.springer.com/chapter/10.1007/978-3-540-87730-1_26

Hypoethesis:

My assumption is that because some data labels are recorded in different years they might affect assumed conclusions if taken directly. To avoid this I have tried to take data labels which have same years excluding the life expectancy.

Hypothesis 1: Generally posh of population is associated with frequent alcohol consumption. An idea if there is a relation in recorded alcohol consumption rate can be used along with urban population.

Hypothesis 2: A hypothesis that residential electric consumption(relectricperperson) is related directly to urban population (urbanrate)

Hypothesis 3: A hypothesis that Income per person (Incomeperperson) is related directly to number of Internet users (Internetuserate)

Hypothesis 4: A hypothesis that Income per person (Incomeperperson) is related directly to oil consumption per capita (oilperperson)

Hypothesis 5: A hypothesis that life expectancy (lifeexpectancy) of 2011 is related directly to income of previous year’s income (incomeperperson)

0 notes