Overview #
I recently was having discussions with junior engineers about salary expectations for Software Development Engineers (SDEs) in India, especially how it changes with years of experience. Inspired by The Pragmatic Engineer and @deedydas’ tweet, I decided to examine a dataset of Indian salaries.
My goal was to parse the data, clean it, cluster salary ranges by experience, and visualize how salaries distribute across different “tiers” of companies.
Methodology #
Data Collection
- Parse the raw excel sheet into a pandas dataframe using
BeautifulSoup
. Each record contains:- Relevant Experience (years)
- Base Salary
- Variable Bonus
- Stock Components
- Parse the raw excel sheet into a pandas dataframe using
Data Cleaning
- Filtered out missing or non-sensical values (“NULL” or negative).
- Grouped records by integer years of experience.
- Computed
totalSalary
as base + (bonus + stocks) for those with 4+ years of experience. - Removed outliers within each experience group by cutting off the lower 4% and upper 4% of salaries.
Categorizing Companies
- For each experience bracket, compute a tri-modal Gaussian Mixture Model (GMM) to cluster salaries into “Low”, “Medium”, and “High” categories.
Plotting and Summaries
- Using plotnine (a Python port of ggplot2), plot the faceted histogram by years of experience.
- Added vertical dashed lines showing mean salaries for each cluster within each experience group.
- Labeled each cluster with the sample size for clarity.