Oct 26, 2022

Extract Transform Load– Natural Language Processing for the Data Set Initiative

Posted by Unison

The backstory

The Data Set Initiative, established to add value to the cost community through transparent, distributable, and credible data sets, is progressing nicely.  We have been organizing a document that describes the history of Unison Cost Engineering's data collection and the processes used to gather, analyze, and utilize data collected, in support of the TruePlanning® application and the cost community.  Thousands of data points have been amassed from external and internal sources.  We are in the process of finding and implementing ways to create visualizations using the information learned from these data points.

Let’s dig into Natural Language Processing (NLP) and how it contributes to Unison Cost Engineering’s Data Set Initiative, along with the Extract Transform Load (ETL) of data in general. NLP is the intersection of linguistics, computer science, and Artificial Intelligence (AI). As shown in the picture below, AI is also made up of Machine Learning (ML) and Deep Learning (DL).

Source: Data Science Foundation(Original post June 2020) We are focusing on the green area. Categorizing old data is one of our challenges during our work on the Data Set Initiative. The hardware component categories are not formally documented in the data. Unfortunately, there is no standard dictionary or taxonomy of hardware component terms used throughout the aerospace and defense cost engineering industry. Since NLP depends on the machine “understanding” the meaning of words (usually by being trained on other text documents) this made implementing most NLP algorithms implausible.Another big issue is that the data set that we are working with is very imbalanced. Most hardware component categories were only represented once in the dataset. This meant that implementing a ML technique for classification purposes would be extremely difficult.

For now, we have written an algorithm that captures certain keywords for hardware component types. We have compiled a list of these keywords based on my experience. Some keywords captured component types (“Primary”), while others were more like subcategories (“Secondary”) that might be good for additional analysis of the data. Below is a small subset of the beginning of the list:

For example, for the datapoint “Blade Antenna, Steel Edge”, we would want to extract the categories “Antenna” and “Blade Antenna” as primary categories. In addition, we would extract the secondary term of “Steel” as that might be an essential characteristic for anyone analyzing the dataset.

This algorithm is not cutting-edge NLP technology, but it is useful for pre-processing text data. However, it still doesn’t solve all of the foreseen problems. A user would still have to check the data for errors (or for new terms that are not currently a part of the algorithm). They might have to normalize their data in specific ways, such as checking for spelling mistakes. But it could still save some time when labelling large amounts of data.

Benefit from supervised, predictive models validated by a dedicated research team to give leadership more confidence in program decisions. Unique, powerful technology derived from the study of data, information, and knowledge for decades gives faster decisions with less risk.

Learn more

Learn More about Streamlining Government Work.

Program Management
Marketplace
Planning, Budgeting and Forecasting
Virtual Acquisition Office
Cost Engineering
Acquisition Support
Contract Lifecycle Management