What is Open Data? #︎
From Wikipedia:
Open data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. The goals of the open-source data movement are similar to those of other “open(-source)” movements.
The biggest sources of open data, with specific examples, are:
- Governments and municipalities
- Census data and national statistics
- Data published under open government initiatives
- NGOs and non-profit organizations
- Global development, immigration, etc.
- United Nations
- Eurostat
- OECD
- [WHO][https://www.who.int/gho/database/en/]
- Geographical data
- Global development, immigration, etc.
- News and mass media
- GDELT - an amazing database of world news sentiment analysis
- Civil Unrest Events
- Science & Research
- Health sciences provide weath of well-structured data
- Growing number of machine learning datasets
- Historical datasets
- For-profit organizations
- Sometimes expose their data
- Or allow their data to be scraped for research purposes
- Sports
Why Should You Care? #︎
Open Data for the Good of Society #︎
If you’re not frustrated with the current political situation, no matter which country you call home - you’re probably one of the people who stopped reading/watching news altogether. Enjoy your bliss. But if you are, and willing to do something about it - here’s an option to consider:
Open data and open government movement are at the top of my list of non-violent ways to deal with the present situation, where politicians and public figures largely manipulate the emotions of people and use highly divisive topics to push their agendas. I see it as a way to transition the society away from political arguments (aka “one who screams loudest - wins”) towards arguments grounded in factual data. It will not make the arguments disappear altogether, but would you rather argue about different ways of interpreting data, or perpetuate the popularity contests of focus-group-polished speeches?
On the government side the hopeful trends are:
- Open Government (Canada)
- DigitalGov (US)
- Municipalities that go a few steps beyond to publish data in the open:
- 8 Open Government Data Principles
A great illustration of this potential of open data is the movement of data-driven journalism, see:
- International Consortium of Investigative Journalists who investigated the tax evasion schemes in multi-terabyte leaks of Panama and Paradise Papers.
- Bellingcat
- Data Driven Journalism
- FlowingData
Open Data for Business #︎
If fixing governments is not at the top of your priorities list - there are many successful businesses built with open data.
In the report “Open data: Unlocking innovation and performance with liquid information”, McKinsey sought to quantify the potential value of open data stating that:
Open data — public information and shared data from private sources — can help create $3 trillion a year of value in seven areas of the global economy.
Its uses can be categorized as:
- Business optimization
- in areas such as: market analysis, targeted marketing, customer acquisition, and retention
- e.g. using census data to identify the geographical areas and target demographics most receptive to the company’s products
- Enhancing the existing products
- Google Maps uses GTFS data for transit schedules
- Business models centered around providing extra functionality on top of [what is or should be] open data
- Yelp uses a database of businesses and municipal health inspections and augments it with search, ranking, and social features
- WalkScore uses location data of shops, schools, and transit to compute a convenience rating for rental apartments
- Mapbox uses open data to provide high-quality mapping solutions
Open Data in Machine Learning #︎
The big data and machine learning fields are at the peak of the hype cycle. Expo floors at data conferences are full of startups and mature companies tackling big data problems for the enterprises, and hundreds of AI startups that are always hungry for clean datasets. Very little of this hype touched open data because these companies are currently directing their efforts at the data produced internally within the businesses.
If a business needs a model built, but doesn’t actually produce the data needed to build it internally (e.g. not all businesses that need a behavioral model of users are in the psychology field) - this is when the attention turns to open data. In this case, you’ll be looking at a very scarce collection of datasets usually built and open-sourced by universities, for example:
- Cohn-Kanade face expressions
- RAVDESS for audio emotions
- NRC Word-Emotion Association Lexicon
- Image Net
Most open data sets are very old and well known and used by almost every scientist in the respective field simply because there aren’t many alternatives.
Building a dataset is a very labor-intensive undertaking, typically done by universities, and unfortunately, they are not eager to open up the datasets they’ve built to the public. Even getting access to data in research purposes is often gated to the point of near undiscoverability.
I think the supply and demand for data are completely out of proportion already, and with the demand so rapidly growing the gap continues to widen.
Conclusion #︎
Open data space is still in its infancy. The amount of open data is constantly growing, but even data that is already out there remains frustratingly underutilized and undervalues. There are many creative ways to use it waiting to be discovered by entrepreneurs.
Following is the list of problems that I believe need to be solved to unlock the open data’s true potential:
- Discoverability
- Lack of cohesion
- Accessibility
- Monopolization of data
- Dementia-by-Design - a data design anti-pattern that
- Lineage / Provenance
I will dedicate my next few posts to these issues, their possible causes, and long-term perspectives.
Interested? #︎
Subscribe to our newsletter and work together with us on re-imagining the future of data.