Yes I am still cleaning the data. It's all unbelievable the amount of crap you can find in the public records. I have a lot of duplicates unfortunately.
Down the road, I'm planning on providing a couple of cool charts/map about trends and evolutions.
- ratio of LCAs to total permanent full time employees. This will show which companies are really leaning on the H1-B visa (I'd estimate my "household name" software company is at about 50%)
- Source of prevailing wage. Interestingly employers don't have to use the BOL published data and can self report. I'd be interested to see how many self report.
- The average delta between prevailing wage and salary per employer/job title.
Check out Open Refine. Has a feature that clusters similar strings and unifies. I remember last time I looked at this data set... 4 letter acronyms spelled 12 different ways, it's unbelievably messy.
I did notice that searching for Google gives you 5 different entities that are all Google.