Welcome, my name is Katherine. I'm an aspiring data scientist and data analyst with a Master's in Data Science from CU Boulder, passionate about turning complex data into meaningful insights that inform smarter decisions. Through hands-on academic projects and internships, I've explored evidence-based policy, dashboard design, and data storytelling to make information more accessible. I'm excited to continue growing and contribute to teams where data makes a real impact.
Welcome! My name is Katherine but I also go by Kath. I am an aspiring data scientist and analyst with a passion for storytelling, learning, and purpose-driven work. I believe that data and empathy are two of the most powerful tools to create meaningful change, especially in fields like healthcare and public health.
Most days, you'll find me working in Jupyter or Colab, turning messy data into clear insights. I enjoy analyzing healthcare data, diving into racing stats, and exploring unfamiliar datasets that spark my curiosity. I use tools like Python, R, SQL, Power BI, and Tableau to clean data, run analysis, build dashboards, and apply machine learning.
I've worked on a range of academic and personal projects, from evidence-based policy analysis at the Colorado Governor's Office to independent explorations of health data and Formula 1 data (yes, I'm a big F1 fan! Stay tuned when the project is complete!). Whether it's mapping patient trends, racing through drivers stats, or diving into something completely new, I value discovering new insights and data-driven stories.
Feel free to poke around and if you have any questions or wish to reach out, let's connect!
Explore my data science projects showcasing skills in machine learning, data analysis, and visualization. Each project demonstrates my approach to solving real-world problems with data.
I led the development of a deep learning solution to predict diabetes risk using uncorrelated health indicators. I evaluated both Feedforward Neural Networks (FNN) and Convolutional Neural Networks (CNN), utilizing Random Forest for feature selection. The FNN achieved 87.5% accuracy, outperforming CNN and demonstrating strong predictive power for independent health variables despite minor overfitting. This project showcases my skills in deep learning model evaluation, feature selection, and health data analysis.
Pima Indians Diabetes Predictive Analysis
Applications: XGBoost, SVM, MLPClassifier, Cross-Validation, Confusion Matrix
Pima Indians Diabetes Predictive Analysis
Overview: Conducted predictive analysis on the Pima Indians Diabetes dataset to detect early signs of diabetes using lab-derived indicators such as glucose, insulin, blood pressure, age, and BMI. Applied data mining pipeline steps including feature selection, correlation analysis, cross-validation, and multiple classification models to evaluate predictive performance. Results confirm that lab work and characteristic features can be used to identify predisposed individuals for early detection and intervention.
Results: Achieved 70.5% accuracy, 75.7% precision, and 77.2% F1-score using SVM, with Decision Tree and XGBoost further supporting that glucose, insulin, and blood pressure are key predictors of diabetes.
Monet GAN Image Generation
Applications: DCGAN, TPU, Image Augmentation
Monet GAN Image Generation
Overview: Trained a Deep Convolutional GAN (DCGAN) model using the "I'm Something of a Painter Myself" Kaggle dataset to generate Monet-style paintings from real-world photo inputs. Data preprocessing involved image augmentation, resizing, and scaling, with modeling supported by TPU acceleration for performance.
Results: Generated Monet-inspired image transformations, though full training was interrupted by TPU limitations. Despite this, Model demonstrated strong artistic potential.
Predicting Diabetes Risk Through Deep Learning
Applications: FNN, CNN, Random Forest (feature selection)
Predicting Diabetes Risk Through Deep Learning
Overview: Evaluated two deep learning architectures, the Feedforward Neural Network (FNN) and the Convolutional Neural Network (CNN), to predict diabetes risk from uncorrelated health indicators. Feature selection was done via Random Forest, focusing on Polyuria, Polydipsia, Gender, Sudden Weight Loss, and Partial Paresis.
Results: Achieved 87.5% accuracy with FNN, which outperformed CNN, indicating it is better fit to work with independent features. Despite signs of overfitting, FNN demonstrated strong potential in modeling diabetes risk from uncorrelated health variables.
Predicting Diabetes Risk Through Unsupervised Learning
Applications: K-Means, Hierarchical Clustering, PCA, NMF, Silhouette Score
Predicting Diabetes Risk Through Unsupervised Learning
Overview: Applied K-Means and Hierarchical Clustering on patient data to identify individuals at risk for diabetes without using labeled outcomes. Reduced dimensionality with NMF before clustering and evaluated model performance using accuracy, precision, confusion matrices, and silhouette scores.
Results: Achieved 81.7% accuracy and 96.3% precision with Hierarchical Clustering as well as a 80.5% accuracy and 96.7% precision for K-Means Clustering, indicating these are high potential methods for evaluating diabetes risks. However, low silhouette scores (<0.45) indicated weak intra-cluster similarity.
Predicting Diabetes Risk Through Supervised Learning
Applications: Logistic Regression, Decision Tree, Learning Curves, Confusion Matrix
Predicting Diabetes Risk Through Supervised Learning
Overview: Built two predictive models, a Logistic Regression and a Decision Tree Classifier, in order to identify diabetes risk, based on symptoms and health indicators such as: age, gender, polyuria, and partial paresis. Evaluated performance using accuracy, precision, confusion matrices, and learning curves.
Results: Achieved 94.2% accuracy with Decision Tree and 93.3% with Logistic Regression, demonstrating two models that could effectively predict diabetes risks.
Disaster Tweets Classification
Applications: Python, NumPy, pandas, Seaborn, Matplotlib, TensorFlow/Keras, CNN, RNN (LSTM), Kaggle
Disaster Tweets Classification
Overview: Used NLP and deep learning to classify tweets as real or fake disaster alerts. Compared CNN vs. RNN (LSTM) deep learning models to classify tweets as real disaster-related posts or not. Used Keras for model design and Tokenizer for preprocessing.
Results: Achieved a Kaggle score of 0.56 with CNN, which slightly outperformed RNN. However, both models underperformed due to CNN overfitting and RNN failing to learn effectively. Future improvements include better hyperparameter tuning and more complex layer configurations.
NYPD Shooting Incident Data Analysis
Applications: R, tidyverse, lubridate, data wrangling, exploratory data visualization
NYPD Shooting Incident Data Analysis
Overview: Analyzed 15 years of NYPD shooting incident data to investigate how borough location, race, and sex affect the likelihood of becoming a shooting victim in NYC. Built visualizations and performed trend modeling to assess patterns over time and across demographics.
Results: Found that sex was the most influential factor, with males consistently at higher risk, while race and location showed minimal predictive power. Despite Brooklyn and the Bronx having more shootings, these were proportional across boroughs and years.
Analysis of Covid-19 Death Rates by Continent
Applications: Exploratory Data Analysis, Data Wrangling
Analysis of Covid-19 Death Rates by Continent
Overview: Analyzed Covid-19 death rates across continents using April 2021 data to explore the relationship between death rates and various socioeconomic and health indicators: population density, extreme poverty, elderly population, hospital beds, life expectancy, cardiovascular death rates, and diabetes prevalence. Investigated trends through visualizations including univariate, bivariate, and multivariate analyses.
Results: Explored 7 variables across continents found that age (especially those 70 and over), diabtes prevalence, and extreme povertywere most correlated with Covid-19 death rates via correlation by differing continents.
Explore my technical skills in Python, R, SQL, and more. Each skill card shows my experience and key libraries I use for data analysis and machine learning.
Key Libraries: pandas, NumPy, matplotlib, Seaborn, scikit
I used Python extensively for data analysis, machine learning, and statistical modeling. My projects included predictive analytics, data visualization, and automated reporting systems. Python's versatility allowed me to handle large datasets efficiently while creating interactive dashboards and implementing complex algorithms for pattern recognition and forecasting.
Key Libraries: dplyr, ggplot2, tidyr
I used R for advanced statistical analysis, data wrangling, and creating publication-quality visualizations. R was essential for hypothesis testing, regression modeling, and exploratory data analysis in academic research and coursework, especially when working with complex or messy datasets.
Databases: PostgreSQL
I used SQL to design, query, and manage relational databases. My experience includes writing complex queries for data extraction, transformation, and reporting, as well as optimizing database performance and ensuring data integrity for analytics projects and dashboards.
View my professional certifications and academic achievements in data science and analytics. These credentials validate my expertise and commitment to continuous learning.
Strengthening foundational data analysis and business insight skills.
Working with hands-on projects using Python, SQL, Excel, and Jupyter to clean, analyze, and visualize real-world datasets.
Establishing foundational skills in creating interactive dashboards for business insights and data visualization.
Working on hands-on projects using Power BI to build dashboards, transform data with Power Query, and write DAX formulas to derive key metrics and business KPIs.
Building skills in supervised and unsupervised learning, statistical modeling, and data visualization using Python and R.
Applying theory through hands-on projects focused on public and health data analysis.
Explore my technical skills in Python, R, SQL, and more. Each skill card shows my experience and key libraries I use for data analysis and machine learning.
Colorado Governor's Office of State Budgeting and Planning (OSPB) | Denver, CO