Sherlock Implementation Plan
#
Task 1 - Initiation#
Project Kickoff MeetingThe project kickoff meeting will allow for the determination of the scope and targeting of specific user groups for input during the initiation phase. The implementation plan will be recorded in this document.
#
Project TrackingThe project progress will be tracked via Jira. Currently parent tasks have been defined with estimates. A refined lower level serives of task estimates will be produced as part of the systems design phase
#
Project DocumentationWhere feasible documentation will be hosted on the C-Core docs site this will allow easier management of project as living document through simple markdown text
#
Collaboration Engagement- Arctic UAV may be able to provide some drone data we will most likely engage with during user systems design phase.
#
Discussion Points#
Deliverable Items- Open source library for the generation of STAC data from other metadata sources.
- Should include infrasture setup dockerfiles as well so that in theory others could host their own verison
- Workshop on usage of completed platform
- Best practices and tutorial content
#
User NeedsThe user needs assessment is intended to engage our targeted user groups and find out what types of data we should look at including by finding out what data is currently in use, what data is difficult to work with and which ones are ignored due to access difficulties.
#
FormatUser needs assessment will be done by selecting representative of various engagement sectors and interviewing them using a shared questionnaire. This will help drill down on most import and data sources and what type of interface experience to prioritize
In order to focus these sessions, existing data portal UIs will be used to determine desired features. Chris Hardy and Chris Boyce (Spatial Integrity) will provide a list of contacts representing the targeted user groups. It is expected that the interviews themselves should take about an hour per user group.
Once interviews are complete, a list of targeted data sources can be generated in order to guide development and design tasks.
#
Initial format exploration (Internal Only)Data Source | Organization | Source Type | Integration Method |
---|---|---|---|
Landsat | USGS | AWS s3 | Relative Catalog Link KGD |
Sentinel | ESA | AWS s3 | Conversion to STAC |
Ice Charts | CIS | FTP | Manually register data for format conversion, ingestion addition to STAC |
Geobase/open.canada.ca | NRCAN | CKAN | Manually registered indexed in place |
River Ice | C-CORE | STAC | Relative Catalog Link to KGD |
Web Data | University, Government, and Open Data Portals | Known Data Types | Curated results from domain restricted web crawl or spatial data types to register in dynamic STAC |
RPA Data | Arctic UAV | Tif | Unknown |
GHG Sat | GHG SAT LTD | Tif | Unknown |
AIS Data | Unknown | Unknown | Unknown |
Some other sources that were discussed but are not primary focus
- S2 (already an library to create stac entries)- RCM? could be excellent to get ahead of this, would result creating something new.
#
Examine Existing StandardsThis work item represents taking the targeted data sources and determining how to make STAC entries for them. Simple one and done STAC generation from native metadata format.
#
Discussion Points- This will hopefully end up with us finally committing to a common STAC generation method.
- Part of the proposal discusses that our cataloging code will be open source.
#
Data Cataloging RequirementsBecause the data exists in a structured, stable format we can skip the requirement of building the static STAC items can be omitted.
#
Cataloging Open Portal Data - Open.Canada.caThe Canadian Federal open data portal is hosted using [CKAN]. As part of this solution a data access api is provided. This appears to be a moderately popular format for hosting government open data so any effort here may be easily transferrable.
We can extract metadata using this api and it should be possible to generate our search indices by using a conversion function that will iterate over the datasets in the open portal. For this project we will focus on SHP files, which as of 2020-06-20 has 4595 entries where the format is "SHP" of these initial investigations show that 95% contain a "spatial" attribute which stores a json dump of a polygon.
Loading the dataset into the index will consist of the following steps:
- query record sets
- for each record query the item reference
- convert from CKAN json result to STAC by remapping attributes
- push the modified record into the elastic instance.
For the individual entries generated from the open data portal. It appears that the majority of the entries are singular datasets with no clearly identifiable grouping attribute. In this case, in order to combine what would be collection level metadata and item level metadata we can use the Single File Stac Extension
#
Cataloging an existing STAC catalog - LandsatThe USGS has made a significant effort in creating and maintaining an enormous catalog of ARD for the Landsat satellite As part of this effort a STAC catalog was created. Because the data is in a deliverable format and the metadata is in our targeted format, the STAC entries can be loaded directly into the elastic instance.
#
Creating a catalog from provided assets - GHG and Arctic UAVTwo of the named collaborators on the Sherlock project are Arctic UAV and GHGSat both of which are data providers, that do not have a readily apparent metadata organization structure. Additionally Sherlock or a hosting organization will have to manage the hosting of the data itself. Additionally these providers will likely require increased rigor in the securing of their data assets, which are for commercial purposes. The primary usage of these datasets will likely be through recommendations, weighted heavily by AOI for Arctic UAVs, and by theme for GHGSat. Catalog entries can be generated from a base template including sensor, provider and keyword information and then minimal additional metadata.
#
System DesignThis activity allows us to plan put together a full low level implementation plan. Items to be created include locations of all data stores, api endpoints, intermediate data locations, containerized applications to support the searchability, discoverability, and interoperability of datasets.
#
Considerations- The ongoing costs associated with hosting the elastic instances should be considered which is proportional to the amount of data in the search engine.
- Using event driven communication between components can result in undesirable vendor lock in. Where feasible we should look to communicate with http calls rather than relying on platform internal pub/subs, event triggers, etc
Elastic Instance Costs
- Attempting cost parity by estimating appbase costing using AWS Pricing calculations using 1 master and n node servers.
Location | Package | Nodes | CPU | Memory | Storage | Cost |
---|---|---|---|---|---|---|
appbase.io | Sandbox | 1 | 2 | 4 GB | 30 GB | $49 |
appbase.io | Starter | 3 | 2 | 4 GB | 40 GB | $149 |
appbase.io | Production | 3 | 4 | 16 GB | 480 GB | $799 |
appbase.io | Production | 3 | 16 | 64 GB | 999 GB | $3199 |
AWS elastic search | t2.medium.elasticsearch | 1 | 2 | 4 GB | 30 GB | $110.30 |
AWS elastic search | t2.medium.elasticsearch x 3 | 3 | 2 | 4 GB | 40 GB | $233.30 |
AWS elastic search | m4.xlarge.elasticsearch x 3 | 3 | 4 | 16 GB | 480 GB | $865.10 |
AWS elastic search | m4.4xlarge.elasticsearch x 3 | 3 | 16 | 64 GB | 999 GB | $2,970.36 |
#
Searchability- Dynamic catalog
- STAC API endpoint
- Document Based storage
- Authorization requirements
- Generic Metadata to STAC conversion library
- Metadata to STAC api consider dynamic creation of stac items instead of hosting static?
- Elastic search index
- Elastic Search Data stores
- Elastic search API
- If we focus on Landsat we can load some of the catalog into the elastic index reducing dependence on conversions during search api development.
- Web Client?
#
Discoverability- Search Result UI
- AOI, date range, data tags key themes?
- Faceted Search
- Display Tags
- Grouping Results
- Several UI components.
- Recommendations
- Related data api? (Would reduce amount of data to send to client)
- Related data algorithm
- we can look to do enhanced tagging of various datasets by fusing with larger course grained classifications ie OSM or AVHRR this would be part of the generation phase
- Additionally looking at relation by geolocation, however that would most likely just use aoi proximity
- How does STAC support keyword tagging, is there a extension?
- Premium data tie-ins (RPA, GHGSat, FloeEdge)
#
Interoperability- Container curation api
- Central data store of curated containers
- consume STAC items
- Curated container API
- Host on dockerhub/ public AWS ECR so they can be distributed
- Hosted APIs for instances would most likely not be part of the geoconnections scope but under C-CORE platform plan
- Are we providing upload location for peoples hosted containers?
- Transformers supporting S3, gcp, local output locations
- What does we call these transformers?
- Methodology for adding external transformers into the C-CORE registry will be manual.
- Interoperability Web UI
- Transformer availability ties in to search results?
- Simple browser?