MIT lecture - Socrata Open Data Architecture
-
Upload
evan-chan -
Category
Engineering
-
view
345 -
download
1
Transcript of MIT lecture - Socrata Open Data Architecture
![Page 1: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/1.jpg)
Socrata and Open Data Architecture and
TechnologyEvan Chan
Principal Engineer
![Page 2: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/2.jpg)
• Who is Socrata? • What’s Open Data? • The state of government IT • How Socrata Enables Open Data • The Socrata Architecture • Scaling our Architecture
Agenda
![Page 3: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/3.jpg)
Who is Socrata?!We are a Seattle-based software startup. !We make data useful to everyone.
Open, Public Data
Consumers
Apps
![Page 4: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/4.jpg)
Socrata is…
The most widely adopted Open Data platform
![Page 5: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/5.jpg)
What is Open Data?
![Page 6: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/6.jpg)
21st Century Government
• Lower the cost of healthcare • Improve education systems • Fight climate change • Improve city safety • Reduce the occurrences of crime • Reduce bureaucratic inefficiencies • Spur local innovation
Improved use of government data can:
![Page 7: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/7.jpg)
Governments Want Their Data to be Open
• Fundamental belief that transparent government is better
• Push to modernize government through APIs • Belief that government data can be useful (think
health inspection data and Yelp, or 911 data and Zillow)
![Page 8: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/8.jpg)
1. Geospatial Data 2. Public Safety Data
Traffic, Crime, Environmental, Complaints 3. Salary Data 4. Health Data 5. Expenditure Data 6. Education Data 7. Census Data 8. Parcel Property Data 9. Business Data 10.Locations of Government Services
Most Compelling Datasets
![Page 9: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/9.jpg)
What is Socrata?
• Catalog to find datasets • Tools for easily importing and updating datasets • Simple data visualizations for exploring and showing
data • Reporting and application building environment
![Page 10: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/10.jpg)
Who uses Socrata?
Laura – Local Resident “How safe is my neighborhood?”
Aaron – Community Advocate “I want to see trends in social housing.”
Dave – App Developer “I need real-time API access to 911 data.”
Dora – The Chief Data Officer ”How do we connect our data to the web?”
Pam – Mayor’s Office “How do we share data to make better decisions?”
Sammy – Department Head “I need to shift to self-service digital channels.”
External Data Consumers Government Data Publishers
![Page 11: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/11.jpg)
Visualizations
Analysis
Discovery/SEO
Dashboards
Government Multilateral/NGO
Data Benchmarking / Prediction
Syndication to Consumer Web
Apps
![Page 12: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/12.jpg)
Our Architecture
![Page 13: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/13.jpg)
Information important to you
•Run our own datacenters (SEA/ORD)
•Javascript/Ruby on the frontend
•Java/Scala on the backend
•Postgres, Cassandra, Kafka, Chef
•Hard and novel problems to solve and new backends to explore...
![Page 14: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/14.jpg)
Increase the flow of data
Drive mass consumption
![Page 15: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/15.jpg)
browser datasync Client/API
customer data management system
file (CSV,.xls) API
dataset additions/updates
socrata load balancer
1
2
3
1. Data is brought into the system. Datamay be brought in via direct file uploador by using the datasync client or api which maintains an efficient and robust transfer of the data.
2. Dataset additions, updates or deletes are communicated to the Socrata back end system across the Internet.
ingress
![Page 16: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/16.jpg)
1. Data set update is routed to a request dispatcher. !2. The request dispatcher forwards the request to the data coordinator.
3. Data coordinator performs the data set addition or update.
4. The truth service adds or alters the data set. All primitive data types are addressed. The truth system gets annotations to the data set from the annotation service and applies them as appropriate. 5. Data coordinator informs all appropriate query services including information on the specific data needed.
6. Impacted query services retrieve the data.
ingress
![Page 17: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/17.jpg)
understanding• data set level
– high level: health – more detailed:
• health/restaurant/inspection • health/disease/infectious
• columnar data types – e.g. location/name/city,
demographic/gender
• columnar schematic categories – e.g. crime_type, crimes from
Boston and Chicago datsets
• columnar schematic category classifications, e.g. (assault, assault and battery, violent assault) > assault
• pivot points – e.g. neighborhood, city, business
![Page 18: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/18.jpg)
gold GT 1
crowdsourcer
GT
models
annotations truth
1. A trusted curator prepares a gold ground truth (GT) by manually labeling datasets.
2. The Gold GT is used by a CrowdSourcer system which coordinates jobs across untrusted distributed mechanical turk workers to annotate the Gold GT set. Annotation quality is assessed via the Gold GT.
3. The CrowdSourcer leverages the distributed humans to annotate much larger sets of datasets.
4. Machine learning models are trained against the GT and applied against a larger set of datasets. In addition, trained models are applied in the synchronous workflow described above.
5. Model based and crowd sourced annotations are stored in an annotation service.
6. The Truth system periodically queries the Knowledge system for the latest annotation mappings and applies them to its datasets.
7. Secondary services like Search are notified of changes. They pick them up and are now available for query.
3
4
2
5
search
6
7
annotations curator
![Page 19: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/19.jpg)
socrata load balancer
app service platform
govstat budget core ux
browser
queries
api app 1. Citizens, reporters, and other users
access our core ux or apps via a browser.
2. Apps run on our app service platform and generate queries to our back end services as needed.
3. 3rd party developers build apps using our API which leverage back end services.
1
2
3
query
![Page 20: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/20.jpg)
1. query is routed to a request dispatcher !2. the request dispatcher’s query Coordinator first checks if the query is cached; if so, the cached copy is returned. !3. request dispatcher routes the query to the appropriate specialty subsytem to perform the query.
![Page 21: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/21.jpg)
technologies
• db – postgres including postGIS – lucene/elastic search – spark:(sql, streaming) – cassandra (back end of govstat, dataset metrics)
• languages – scala, ruby, javascript, python
• platforms: – logging: sumo – build: jenkins – test: cucumber – cloud: (aws, azure, own) -> aws – machine learning: sklearn/scipy
![Page 22: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/22.jpg)
1. Broad soda v.x api is made more efficient via techniques like rollup tables.
!1. Big gulp api is serviced
through specialized secondary Services. API is highly controlled; additions are made w/ a clear understanding of scale cost.
2. Big gulp api query breadth is expanded over time, as soda data and user throughput sizes are increased
!!
scaling strategy
![Page 23: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/23.jpg)
![Page 24: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/24.jpg)
ingress tp
1e11~100G
![Page 25: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/25.jpg)
![Page 26: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/26.jpg)
![Page 27: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/27.jpg)
The SODA API
• Recently introduced a new version of our API
• Expressive, SQL-like language • Provides the base for all other
functionality • Provides the base for 3rd parties to
access data hosted by Socrata
![Page 28: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/28.jpg)
SoQL
A REST-like SQL inspired API for accessing and querying datasets
DEMO!
![Page 29: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/29.jpg)
How I feel using SoQL over HTTP
![Page 30: MIT lecture - Socrata Open Data Architecture](https://reader034.fdocuments.us/reader034/viewer/2022051400/55a68f9b1a28abda378b459e/html5/thumbnails/30.jpg)
The Scala SODA Client
•Abstract away JSON parsing and types
•Scala-like query syntax
•Returns results as a Future
•Internally uses Iteratees to stream and parse results