What is SeekConnect?
SeekConnect is a self-service data onboarding tool that will be made available for select pilot use near the end of Q1 2023.
The initial version of SeekConnect supports the ingestion of CSV via S3 and SFTP, additional connectors will be added in the months to follow to enable users to extract data from other sources.
How do I access SeekConnect?
You can access SeekConnect directly in your DataHub in the tools dropdown menu, depicted below:
Once you select the Seek Connect tool it will navigate you to a new page:
This is your SeekConnect Jobs page where you can manage all source connections, data transformations, and publications. To create a new Job simply click on the Create New + button on the far right. You can decide to create a new source, transformation, or publication. We will go through each of these in this document.
Step 1: How do I create a new data pipeline?
First, we will cover the process of creating a new source:
Once you select the Source option, you will be navigated to the Source Job creation page:
To name your dataset, click on the Untitled Source box, type the new name you want the dataset to have, and press Enter.
To set up your file source, choose the connector you want from the dropdown menu and follow the onscreen instructions to fill out the fields required to set up the connection to the source data.
Once you enter the correct information for your S3 bucket, you have the option to adjust the file format:
Select the correct format for your data (CSV, Parquet, Avro, JSON) and adjust any parameters you would like. Next, click on the continue button at the bottom of the page to create your source.
Step 2: How do I review a sample of the source data before data upload?
WIP - ETA Q2
Once you have completed the Select Source step and established a connection to the source, Seek Connect will redirect you to the Review files screen.
On this screen, you can select a source file asset and preview a sample of the source data.
Click on the Back button to return to the Create dataset screen and modify your source setup.
Click on Confirm to proceed with the data upload process.
Seek Connect automatically creates the necessary landing and ingestion sources and destinations to generate the pipeline to your S3 bucket. The Upload page will confirm the Job source has been completed.
After each field above is completed, you will be navigated to a preview page where you can save this Source Job to be used in Transformation and Publication Jobs.
Once the connection is Active, you can click Save to return to the home page to create a new Data Transformation Job.
Step 3: How to perform a Data Transformation?
To begin your Data Transformation Job, navigate to the Create New + button and select the Transformation option from the drop-down menu.
The next page will allow you to name your Data Transformation Job and select the source(s) you would like to perform the transformation on.
Select your source from the left-hand panel. You can click the preview button to see a sample of the data to help inform your SQL transformation.
To transform this dataset before publishing, you need to navigate to the untitled_transformation.sql tab. Here, you can write SQL code to select or subset the data you want.
In order to help with your SQL query, we have included a copy feature for each data source you have. Simply, hover your cursor over your desired source and click the copy button:
Paste this snippet into the untitled_transformations.sql file to begin writing your transformation.
Now, you can add any additional SQL code to transform your dataset.
For this example, we are going to apply a simple transformation to subset the table on the IS_SUPERHOST column so that we only have Superhosts in the dataset.
To subset this table to only include Superhosts, we need to add WHERE IS_SUPERHOST = ‘t’ to the code snippet
You can preview the results of this query by selecting the Preview button below
We can see from the above preview that the only rows included are Superhosts, so we are seeing the expected result from our above SQL query.
Before saving this Data Transformation Job, you can edit the Title, Rename the File, or choose to materialize this dataset as a table instead of a view:
Materializing the data as a table will refresh/update the data regularly, based on upstream source data updates. Materializing as a view will generate results each time it is queried by a user. Each method brings compute and performance implications to consider. Views are usually the more cost-effective choice, but not always.
Once you have finished making edits to the Data Transformation Job, click Save in the upper right corner. This will prompt you to name your transformation. Make sure to give this a unique name as this will be used in the Publication Job.
After you have saved your Transformation Job, navigate back to the home page to Publish this dataset to your Data Hub.
Step 4: How do I Publish my new dataset to my Data Hub?
To begin your Publication Job, navigate to the Create New + button and select the Publication option from the drop-down menu.
A pop-up window will appear asking you to select which Data Transformation Job you would like to Publish to your Data hub. Select your desired transformation and click Confirm to continue.
Once you confirm which data transformation you would like to publish, you will be navigated to the next page to map your data
Step 5: How do I map my data?
On the left-hand side of the page, you can view and edit the names of your columns. You can also edit the mapping parameters by clicking on the blue icon next to each column. Easily adjust your Data Type, Mapping Type, Units, and Scale.
To rename a column, find the column you want to rename in the left panel, click on its name, type the new name you want the column to have, and press Enter.
To assign mapping for a column, click on the blue filter icon next to its name, and select the appropriate formatting, mapping type, and additional refinement options. A pop-up window will open.
First, select the appropriate Data Type for that column:
- Text format is used for a textual type of data.
- Number format is used for numeric types of data.
- Date field is the calendar dimension of the resulting dataset. It stores the dates of various data points and is used to build time series from your data. Our system supports the following statistical date formats for different frequencies:
- Years: 2009, 2010, 2011
- Half years: 2009H1, 2013H2
- Quarters: 2009Q1, 2010Q3, 2012Q4
- Months: 2009M2, 2011M7
- Country format is used for columns that contain country information, which could be represented by country names or country ISO codes
- Company format is used for columns that contain company information.
- Latitude and Longitude fields can be used to read the latitude or longitude attribute information for location-based data.
Next, select the appropriate Mapping Type for that column:
- Dimension is used to specify a data dimension for the resulting dataset. A data dimension is also represented in the form of a dropdown menu option within the dataset for the selection of data categories to visualize in the DataViewer. Only data rows in your source data that are complete—no data gaps—may be set as a Dimension.
- In order to visualize data in Knoema’s Dataset Viewer, you will need to choose at least one column to be a Dimension. This will be the column where you select your data. For example, the WDI dataset below from World Bank has two Dimensions (Country and Series)
- Measure can be used for numerical data that should be calculated or aggregated. Any number, currency, or numerical value can be used as a Measure. For Measure columns, you can further refine the mapping by choosing its:
- Unit is used to specify the units of measurement.
- Scale is used to specify the scale of measurements.
- These can be specified for a Measure in two additional drop-downs:
- Date field is the calendar dimension of the resulting dataset. It stores the dates of various data points and is used to build time series from your data. Our system supports the following statistical date formats for different frequencies:
- Years: 2009, 2010, 2011
- Half years: 2009H1, 2013H2
- Quarters: 2009Q1, 2010Q3, 2012Q4
- Months: 2009M2, 2011M7
- Days: MM/DD/YYYY, DD.MM.YYYY
- Primary Date is a Knoema platform concept that signifies the column is for x-axis charting in standard Knoema time series visualizations and calculations.
- Detail is any additional information attached to a record. It is typically used for information that would work with the grid visualization tool. For Detail columns, you can further refine the mapping by choosing:
- Unit is used to specify the units of measurement.
- Frequency is used to specify the data frequency. The following frequencies are supported:
- A - annual
- H - semi-annual
- Q - quarterly
- M - monthly
- D - daily
- Scale is used to specify the scale of measurements.
Click the Apply button in the dropdown menu to apply your changes.
Step 6: How do I edit the Metadata?
Next, you can edit the Metadata associated with your dataset. Add Descriptions, Provider details, Reference URLs, and Dataset type fields here.
Scroll down to see the table with all of the mapping information displayed. If you notice any issues with the mappings you can click on them and adjust with ease.
Once you finish editing the mapping fields and metadata, click on the Save button in the upper right corner to upload this dataset to your Data Hub.
After saving this Publication Job the page will reload with a new link to this dataset in your Data Hub:
The Source Job, Transformation Job, and Publication Job will all remain active, creating a live connection to your raw data source. These Jobs will remain on your Jobs homepage, where you can easily reference them and make any edits in the future.