5/26/21: ETL Prototype
by Eric Turner
The GitHub PR is here.
The data that was loaded is from the USA/CA/butte_county/college/chico scraper. I chose this as a starting point because it has two different types of data, with two differing formats. This allowed me to verify the library reads from the schema.json properly and can load and map no matter the data output. You can find that PR here. It also created 2 new datasets here.
We currently have it set to not auto-commit so someone reviews before committing each time. But it does load data from files, use the schema.json file and (mostly) works!
Current Process:
  1. 1.
    Call the /common/etl library from a specific scraped agency dir, pass the schema.json file to it.
  2. 2.
    The library will locally clone the pdap/datasets and pdap/data-intake repos
  3. 3.
    Then it will verify the agency_id exists and will grab the appropriate record from the database
  4. 4.
    Next it will enumerate over the data items in the schema.json. It will create any datasets where the id is null, otherwise it will find an existing dataset and sync the data from the table with the schema.json. Whichever source has the most recent last_modified time will be the “Source of Truth” and that’s the data that will be used to sync.
  5. 5.
    It will do the same for the data.mapping object, it will search for a table in pdap/data-intake with the exact name of the data_type and then sync the columns so they are there. If a new column is in the database but not the schema.json file, it will add the missing column with a __skip__ value.
  6. 6.
    Once everything has been synced, it will finally enumerate over the mapping object and insert the data into pdap/data-intake. Erroneous records are skipped a message displayed to the console.
  7. 7.
    The schema.json is overwritten in the directory with the synced changes from the database.
Copy link
Edit on GitHub