Custom Transformations at the UDRC

Vincent Brandon, Data Coordinator
April 22, 2020

Narrative

Feature engineering has become indispensable in transforming data. Combining and aggregating existing data allows us to extract more meaning and improve model performance. The UDRC links data on employment, education, and health, to aid researchers in answering important policy questions. To better understand the relations between data collected at different intervals, at different scales, or with numerous collinear factors, feature engineering and custom binning can go a long way to producing novel, informative, datasets.

Take this example:

A researcher wanted to investigate the impact of reported student earnings while attending college. We faced two major problems. The first being academic calendars straddle two years, summer and fall in one, spring in the other. The other issue we needed to tackle was that academic data is reported in relation to three equally spaced time periods, whereas wages are reported quarterly. We were able to work with the researcher to implement a set of transforms that shifted academic data onto the fiscal calendar, then mapped quarterly wage data to the offset trimester. The result was novel alignment of wage through post-secondary outcomes.

In general, the UDRC is able to map explicit transformations to our data before reporting and obfuscation. Researchers provide the formula and we work together to create validation steps. Collaboratively, we can document how to handle null data and edge cases. UDRC staff can interpret transformation requirements annotated in three ways:

  1. Give us the formula. We can provide type, distribution, and edge cases, while researchers specify logic in LaTeX or whiteboard the equations for us.
  2. Do it in SQL yourself. We can generate dummy tables with data closely resembling the real thing. We can provide a sqlite object for you to develop with locally.
  3. Do it in R or python with standard public libraries. We cannot accept arbitrary code or install custom libraries from unqualified repositories, but we can re-implement your process and send the code for confirmation with the dummy data.

Collaboration improves process. Documentation creates a lesson book for future researchers. It is our mission to provide reliable cross-linked information for Utah. Our team of coordinators and research staff will work with you to produce de-identified datasets that benefit policy makers, researchers, and students for years to come. Don’t let obfuscation get in the way of your analysis. We are here to help.