linear regression - lm function in R does not give coefficients for all factor levels in categorical data -
i trying out linear regression r using categorical attributes , observe don't coefficient value each of different factor levels have.
please see code below, have 5 factor levels states, see 4 values of co-efficients.
> states = c("wa","te","ge","la","sf") > population = c(0.5,0.2,0.6,0.7,0.9) > df = data.frame(states,population) > df states population 1 wa 0.5 2 te 0.2 3 ge 0.6 4 la 0.7 5 sf 0.9 > states=null > population=null > lm(formula=population~states,data=df) call: lm(formula = population ~ states, data = df) coefficients: (intercept) statesla statessf stateste stateswa 0.6 0.1 0.3 -0.4 -0.1
i tried larger data set doing following, still see same behavior
for(i in 1:10) { df = rbind(df,df) }
edit : responses eipi10, mrflick , economy. understand 1 of levels being used reference level. when new test data state's value "ge", how substitute in equation y=m1x1+m2x2+...+c ?
i tried flattening out data such each of these factor levels gets it's separate column, again 1 of column, na coefficient. if have new test data state 'wa', how can 'population value'? substitute it's coefficient?
> df1
population ge mi te wa 1 1 0 0 0 1 2 2 1 0 0 0 3 2 0 0 1 0 4 1 0 1 0 0
lm(formula = population ~ (ge+mi+te+wa),data=df1)
call: lm(formula = population ~ (ge + mi + te + wa), data = df1) coefficients: (intercept) ge mi te wa 1 1 0 1 na
ge
dropped, alphabetically, intercept term. eipi10 stated, can interpret coefficients other levels in states
ge
baseline (statesla = 0.1
meaning la is, on average, 0.1x more ge).
edit:
to respond updated question:
if include of levels in linear regression, you're going have situation called perfect collinearity, responsible strange results you're seeing when force each category own variable. won't explanation of that, find wiki, , know linear regression doesn't work if variable coefficients represented (and you're expecting intercept term). if want see of levels in regression, can perform regression without intercept term, suggested in comments, again, ill-advised unless have specific reason to.
as interpretation of ge
in y=mx+c
equation, can calculate expected y
knowing levels of other states binary (zero or one), , if state ge, zero.
e.g.
y = x1b1 + x2b2 + x3b3 + c y = b1(0) + b2(0) + b3(0) + c y = c
if don't have other variables, in first example, effect of ge equal intercept term (0.6).
Comments
Post a Comment