Exploring the Scope and Severity of the World's Biggest Data Breaches and Hacks from 2004 to 2022
Purpose: To complement other research I conducted in fall 2022 focusing on data privacy issues from both a historical and policymaking standpoint, I created this data storytelling piece to illustrate the dangers that data breaches and hacks have posed to privacy worldwide. My hope was that this project displays some key lessons to be learned for the global population as we continue to constantly release, collect, analyze, and store more and more personal, financial, and biometric data online. ​​​​​​​​​​​​​​
Research Question: I focused on answering one broad question throughout this series of data visualizations: What lessons can we learn from the past 15 years of global data breaches and hacks? 
My hypothesis was that data breaches and hacks will show to have grown in almost every category, but specifically in size (the number of data records lost) and severity (the level of data sensitivity, frequency of events, and the number of “interesting story” events). As part of this investigation, I hoped to be able to highlight and discuss specific, concrete examples of data loss and to begin to illuminate the human toll of, and the greater meaning behind, what I suspected was a trend illuminating a general mass loss of data privacy over time.
Results: ​​​​​​​To illustrate key points in the data, I produced a Tableau Story Visualization in six parts, investigating bite-size pieces of the full dataset. ​​​​​​​My exploration found that indeed, data privacy violations have increased and escalated in severity alongside the growth of the global datasphere, however, growth has been irregular. Data was lost across sectors, but none so big as the overarching “web” sector that emerges as time goes on. This makes sense – this dataset encompasses incidents beginning in 2004, wherein the biggest breaches in the early years were a combination of physical hacks (e.g. stealing hard drives or physical documents) versus later years, where breaches segued entirely into the digital, online space. 
Screenshots from the project are below; link to the full visualization can be found here.
Methods: Downloaded data from Information is Beautiful; Cleaned data in Excel/Google Sheets; Connected to Tableau database, visualized via charts, graphs, and Dashboards, then compiled Tableau Story; Drafted report walkthrough of the project as included below.
Skills: Data management, Data storytelling, Explanatory post, Thought piece
Tools Used: Tableau Desktop/Public, Excel, Google Sheets
Learning Outcomes: Communication, Research, User-centered design, Critical perspectives
Story Panel #1
Story Panel #1
Story Panel #2
Story Panel #2
Story Panel #3
Story Panel #3
Story Panel #4
Story Panel #4
Story Panel #5
Story Panel #5
Story Panel #6
Story Panel #6
Research & Background
For this research and data visualization project from fall 2022, I used the “World’s Biggest Data Breaches and Hacks” original dataset, which authors David McCandless, Tom Evans, Paul Barton make available online via Information is Beautiful. The dataset’s sources include information compiled from IdTheftCentre and DataBreaches.net, as well as news reports from New York Times, Forbes, The Guardian, Tech Radar, BBC, PC Mag, Tech Crunch and more.
My research focused on a broad investigative question: What lessons can we learn from the past 15 years of global data breaches and hacks? My hypothesis was that data breaches and hacks will show to have grown in almost every category, but specifically in size (the number of data records lost) and severity (the level of data sensitivity, frequency of events, and the number of “interesting story” events). As part of this investigation, I hoped to be able to highlight and discuss specific, concrete examples of data loss and to begin to illuminate the human toll of, and the greater meaning behind, what I suspected was a trend illuminating a general mass loss of data privacy over time.​​​​​​​
At first glance, the data appeared to show incidents of hacking, data breaches, data leaks, and accidental security lapses growing in size and scope since the timeline starts in 2004. As I read further into the data stories, I found some dramatic recent examples of data loss, including the loss of 900,000 records from a police database in China, a recent leak of Dubai property data illuminating illicit money and criminal investments, and a hack of streaming platform Twitch in 2021 that exposed salary and payouts alongside technical details of new products and platforms. 
Defining the multifaceted nature of data privacy
To trace the idea of data privacy is to trace the growth of the internet, and the growth of its use over time. Internet access has grown exponentially with the proliferation of smartphones and personal devices (Pew Research Center, 2019), which in tandem has increased the amount of data being collected, stored, and analyzed by essentially all companies, institutions, and governments with a web presence. At the same time, the price of surveillance technology has decreased, allowing businesses both large and small to engage in data collection (Heavin et al., 2020). The global size of the “datasphere,” too, is growing exponentially. It is projected to surpass 175 terabytes by 2025 (Kushmaro, 2021). This cascade of effects, coupled with the lack of regulation in the data privacy space, has created a veritable nightmare for individuals wishing to keep their sensitive data private.
A fundamental tension emerges in the world of data collection and analysis that contributes to this conflict. The more detailed the data provided to a researcher for analysis, the more useful it is in drawing conclusions. On one side of the spectrum lies privacy, and on the other, utility (Stewart, 2020). For instance, for medical researchers studying a rare disease’s progression through randomized controlled trials, the greater the information about a population that has the disease in relation to one that does not, the greater the strength of variables to study for potential significance. Significance, in this case, might lead to better treatment and awareness of a disease. However, in the context of a for-profit company looking to generate advertising revenue through data collection, the greater the volume and detail of the data collected, the more it is generally worth to corporations, and the more it is used for arguably less altruistic purposes.
Data privacy violations on the rise
There are also privacy concerns raised with publicly available data. The collection and release of mass amounts of personal data can still have far-reaching implications, even if the data was not necessarily kept private. For instance, in 2017, Strava, the running app, accidentally identified a secret military base by publishing its worldwide running routes, including regular laps taken by these particular armed forces (Stewart, 2020). There are also emerging concerns about the possibilities of reidentification of what was thought to be anonymized data, such as when the Netflix prize competition ended up outing a LGBTQ individual (Singel, 2010). 
Underscoring this conversation is the recognition that internet use is becoming ubiquitous for an increasingly younger population. Children’s internet use has increased more than ever. One highly popular virtual world-building game, Roblox, “rose by over 20% in popularity [in 2021 alone], with 56% of kids playing the game worldwide” (Qustodio, 2021). In the same study, children’s time spent on IXL, a subscription learning service, rose by 46%. YouTube remained children’s top video streaming app despite recently having settled a major COPPA lawsuit about its illegal data collection practices (Federal Trade Commission, 2019). All of these tools have faced scrutiny due to lack of transparency regarding sharing or selling data, lack of attention to children’s safety on their platform, and/or concerns over security of the data they store (Common Sense, 2021). Roblox, in particular, was the subject of a major hack in 2020 (Cox, 2020).
Returning to the research question: What lessons can we learn from the past 15 years of global data breaches and hacks?
Methodology
The Information is Beautiful dataset included 16 column variables with information on each event, including company name, year and date, sources that reported on the breach or hack, and one column containing a variable called “interesting story,” flagging examples such as the Twitch breach or Dubai real estate reporting. Individual events often were listed with both a main sector and a corresponding sub-sector, such as “government, health.”
After locating and downloading the dataset, I both checked and cleaned it in Excel, then added one column for country data, which I added in manually after cross-referencing the source materials for the hack or breach incident. I also transformed the “interesting story” variable into a boolean (true/false) variable. Then, for ease of reading, I grouped the sectors together by main sector only in Tableau Desktop. 
In each of the standalone visualizations, I adjusted the cut points and colors to make each as readable as possible and to reflect “warnings” in colors like reds and oranges, and more educational information in neutral colors. I also grouped related visualizations together in a dashboard to tell a more complete story about one subtopic, for instance, the nature and scope of leaks within the government and military sector. In the map visualization, I edited the color, the number ranges displayed in the legend, and the cluster sizes of the circles displaying the size of the records lost in order to make the results clearer. ​​​​​​​
A major limitation in this visualization is Tableau itself as the database does not allow for much customization within the app. For instance, ideally I would have been able to customize all the numerical data to show either thousands, millions, or billions, as needed, but Tableau forces a choice of one of these only for labeling. The map is also only shown in the Mercator projection, which is a notoriously deceiving and skewed map projection.​​​​​​​
Results
I uploaded the cleaned spreadsheet to Google Drive, connecting Tableau Desktop to the Drive to query it, then created a Tableau Story in a series of five standalone visualizations and dashboards. I published the final product to Tableau Public.
The first part of the Tableau story includes an introductory line chart that summarizes the scope of the issue over time, highlighting the aggregated number of data records lost each year since the timeline begins in 2004.
Tableau Viz #1: Since 2004, the world's biggest data breaches and hacks have steadily, though irregularly increased.
The first story tab shows a timeline of data loss by size (number of data records lost) from 2004 to 2022. 
Tableau Viz #2: Data was lost across sectors, through a variety of methods.
The second tab features a Tableau dashboard with a series of charts showing how the data was lost across sectors (web, healthcare, app, retail, gaming, transport, financial, tech, government, telecoms, legal, media, academic, energy, military) and by method (hacking, inside job, mistake – “oops!”, poor security, or lost device).
Since 2004, the world's biggest data breaches and hacks have steadily, though irregularly, increased.
Since 2004, the world's biggest data breaches and hacks have steadily, though irregularly, increased.
Data was lost across sectors, though a variety of methods.
Data was lost across sectors, though a variety of methods.
Tableau Viz #3: U.S.-based organizations are responsible for the overwhelming majority of data loss over time, though the largest individual incidents have occurred all over the globe.
The third tab is another dashboard that includes a global map of the locations of companies that experienced major breaches and hacks. The map can serve as a jumping off point for potential discussion and investigation into how a country’s geographic location may impact data privacy and protections with regard to oversight or regulation of companies’ data management. ​​​​​​​This tab also includes the “top ten” list of the biggest hacks and breaches within the entire dataset, grouped by organization or entity.

U.S. based organizations are responsible for the overwhelming majority of data loss over time, though the largest individual incidents have occurred all over the globe.

Tableau Viz #4: Some of the most concerning leaks, to date, occurred across web, financial, health, and government sectors.
The fourth and fifth tabs show a selection of the most concerning data hacks and breaches by level of data sensitivity. 
Tableau Viz #5: What happened to cause some of the most notable spikes in sensitive data loss?
Some of the most concerning leaks, to date, occurred across web, financial, health, and government sectors.
Some of the most concerning leaks, to date, occurred across web, financial, health, and government sectors.
What happened to cause some of the most notable spikes in sensitive data loss?
What happened to cause some of the most notable spikes in sensitive data loss?
Tableau Viz #6: Now that you've learned more, go ahead and explore the full dataset and visit interesting story data highlighted alongside the incidents.
The sixth and final tab is a reimagined version of the original Information is Beautiful project, which encourages the user to explore the entire dataset as one interactive visualizationMy edited version does not show all the available variables, but highlights both the company involved and the size of the incident (via number of records lost) alongside any “interesting story” information associated with it. I configured pop-ups to link to those stories with titles or topics, so that the reader can read more about where and at what time major breaches and hacks were reported on.

Now that you've learned more, go ahead and explore the full dataset and visit interesting story data highlighted alongside the incidents.

Conclusion & Areas for Further Study
​​​​​​​As I investigated this dataset, I found that indeed, data privacy violations have increased and escalated in severity alongside the growth of the global datasphere, however, growth has been irregular. I found that data was lost across sectors, but none so big as the overarching “web” sector. This makes sense – this dataset encompasses incidents from 2004 on, when the biggest breaches in the early years were a combination of physical hacks (e.g. stealing hard drives or physical documents) versus later years, where the breaches segued entirely into the digital, online space.
This is an area for further analysis, but in drilling down into certain methods and sectors, I also found that the increase in data loss seemed skewed by certain gigantic breaches and hacks, such as the loss of police data in Shanghai or the breach at J.P. Morgan (both highlighted in the Tableau story data). Many of these were accompanied by a slew of media coverage. I tried to illuminate the story and the human toll of this data loss in the tab highlighting specific breaches alongside some of the most “concerning” data breaches and hacks across the health and government sectors.
Time and the Tableau platform were limitations of this project. Given more time and more limitless customization, there is potential for growing and investigating this dataset further. More calculated fields and parameters could be included to highlight each country’s top ten data breaches and hacks by size and by sector. Further research could be incorporated dealing with how the legal landscape and cultural contexts of different affected companies’ geographic locations may impact their vulnerability to hacking or breaches. Featuring texts, links to outside images, and a more illustrative timeline with interesting stories and images would all help underscore the complexity and urgency of addressing this issue.
As a Data Analytics and Visualization graduate student, issues surrounding data privacy are of particular importance to me. Only three months into this investigation, I became acutely aware of the vastness of the topic of data privacy and the corresponding urgent debate both within the United States and globally about how to address mass data loss. Researchers and leaders across fields of information sciences, law, health, education, politics, and more have spent and continue to spend their entire lives studying the concept of data privacy and protection. This visualization begins to discuss just a small part of that story.
Sources
Cox, J. (2020, May 4). Hacker bribed ‘Roblox’ insider to access user data. VICE. Retrieved from https://www.vice.com/en/article/qj4ddw/hacker-bribed-roblox-insider-accessed-user-dat-reset-passwords 
Feldman, A. (2022, December 16). Whither Data Privacy? INFO 601-03: Foundations of Information.
Kushmaro, P. (2021,  June 7). Why Data Privacy Is A Human Right (And What Businesses Should Do About It). Forbes. Retrieved from https://www.forbes.com/sites/forbescommunicationscouncil/2021/06/07/why-data-privacy-is-a-human-right-and-what-businesses-should-do-about-it/?sh=6fe75a4ec3ca
McCandless, D. (2022, June 1). World’s biggest data breaches & hacks. Information is Beautiful. Retrieved from https://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks ​​​​​​​
Privacy program. The Common Sense Privacy Program. (n.d.). Retrieved from https://privacy.commonsense.org
Singel, R. (2010, March 12). Netflix cancels recommendation contest after privacy lawsuit. Wired. Retrieved from https://www.wired.com/2010/03/netflix-cancels-contest/
Singer, N. and Krolik, A. (2020, January 13). The New York Times. Retrieved from https://www.nytimes.com/2020/01/13/technology/grindr-apps-dating-data-tracking.html
Warzel, C., & Ngu, A. (2019, July 10). Google’s 4,000-word privacy policy is a secret history of the internet. The New York Times. Retrieved December 14, 2022, from https://www.nytimes.com/interactive/2019/07/10/opinion/google-privacy-policy.html ​​​​​​​
Back to Top