Each survey provides the ICIJ data team with an opportunity to learn – and Uber Files was no different.

This may not have been the first time we’ve worked with a large leaked dataset. But rather than mapping intricate networks of offshore companies or tracing the flow of dirty money from country to country, the Uber files presented a new challenge: linking digital calendar events to real meetings and sketching images of relationships based on exchanges of messages between high-level people. influential players in transport policy and industry.

The Uber files are based on a collection of more than 124,000 records, including 83,000 emails and other files created between 2013 and 2017, a period when the American ride-hailing giant was expanding across the world. The files were leaked to The Guardian and shared with the ICIJ. After the publication, former Uber lobbyist Mark MacGann came forward as the source of the leak.

The Uber Files included communications between business executives, as well as messages with key political figures and their representatives that showed Uber’s tactics in trying to gain access to markets around the world. The files contained details of meetings Uber lobbyists held with world leaders and other public officials to try to influence legislation and revealed the company’s use of stealth technologies and evasive tactics. to thwart regulators and law enforcement in at least six countries.

After receiving the files from The Guardian, the ICIJ uploaded them to its bespoke search platform, Datatashare, allowing journalists from more than 40 media partners in 29 countries to search and review the leaked documents.

Here are some questions and answers about ICIJ’s methods in processing and analyzing Uber Files data.

How did the Uber Files records differ from previous leaks to the ICIJ?

The Uber Files records were mostly emails, which included 83,000 of the leak’s 124,000 files. Other ICIJ projects were rather mixed.

Some of the first questions the ICIJ had to answer were what information could be structured from emails, texts and calendar items in the leak and whether external databases could help research and reporting. .

And the subject was new. Previous leaks investigated by the ICIJ have focused primarily on the offshore financial system, and efforts to structure the data (i.e. organize information into standardized, searchable fields) have focused on highlighting information related to individuals and companies using entities registered in secrecy jurisdictions. In the case of the Uber files, the records centered on communication and lobbying efforts to try to influence key stakeholders in different countries and regulations.

How did the ICIJ identify the major personalities in the Uber files?

Using computer programs and programming languages ​​such as Apache Tika, Python and Pandas, the ICIJ extracted the email addresses and the names and domains associated with them. The information was organized on a spreadsheet to help identify the names of key people, including politicians and government officials, whom Uber executives had contacted.

The ICIJ reviewed three types of group calendar files. The ICIJ extracted details of planned meetings between Uber representatives, politicians and officials, then reviewed thousands of emails and internal messages to confirm that the meetings had taken place. Additionally, the ICIJ explored public records in countries and institutions where officials are required to report their meetings and schedules.

The ICIJ found more than 100 meetings that took place from 2014 to 2016 between Uber executives and officials, including 12 with European Commission officials, that had not been made public. Company executives held private meetings with at least six world leaders, a vice president and three deputy prime ministers, according to the analysis.

The ICIJ also used public databases such as the European Union Transparency Register, the US Senate Disclosure Logs, and the French Lobbying Register.

The ICIJ and its partners were also able to review correspondence between the company and academics who published research favorable to the company, showing that the research was coordinated. The files revealed what data Uber provided to academics, what lobbying messages the research would support and in which countries, what media outlets academics would appear on to present research findings, and what messages academics would convey to the general public. and politicians. The ICIJ also searched publicly available academic article databases to identify other publications funded or supported by Uber – often featuring current or former Uber employees as co-researchers. University research data results have appeared in articles by The Guardian and other ICIJ media partners.

The ICIJ also identified several spreadsheets in the leaked filings that contained information about what the company called potential “stakeholders” that Uber might be interested in. The ICIJ combined the information into one master spreadsheet and organized it by country to help partners report. The research showed that Uber, with the help of a consulting firm, had compiled more than 1,850 stakeholders, including current and former public servants, think tanks and citizens’ groups, in 29 countries and in EU institutions.

How did you go about quantifying the scale of Uber’s operations over time?

The ICIJ used the Internet Archive’s WaybackMachine to analyze previous versions of Uber’s website and gather historical reports on how the company’s operations have changed and where it has evolved over time. .

The ICIJ also used Uber’s financial statements filed with the U.S. Securities and Exchange Commission to track growth since 2019, when the company went public.

What were the biggest challenges in analyzing the Uber Files dataset?

Like all leaks, the documents represent only a small slice of a larger reality. Not all of the countries where Uber had operations were mentioned in the records. The records only covered up to 2017. As always, further research and reporting was required.

The ICIJ also encountered a discrepancy between what the company reported in its financial statements and what it told users on its website about the number of markets it was in. The team had to verify the methodological differences between the two figures. Uber finally told the ICIJ that “in terms of the number of cities, the way we calculated that number changed in 2020.”

The ICIJ had to analyze the various lobbying and disclosure records of country meetings, taking into account variations in the type of data available and in its quality. Not all countries have lobbying regulations or public lobbying registries, and not all require politicians to report meetings they hold as part of their official responsibilities.

Any advice for dealing with this type of dataset?

Remember to link the disclosed data to public records. This aids validation efforts and provides additional information that may be useful for analysis.

Understand the limitations of each dataset, its structure and quality, and what questions can or cannot be answered through data analysis.

Review the regulations in all countries and how they affect data collection and analysis. Chat with experts to get a sense of how regulations are enforced in the real world.

When structuring data and performing different types of analysis, document the process and allow ample time for fact-checking. All data collection and analysis performed by ICIJ for Uber Files has been verified.

Will the ICIJ release Uber Files data?

The ICIJ does not disclose personal data in bulk. He will continue to explore datasets with media partners. More than 180 journalists spent months searching the data for stories of public interest. If you have any tips or information you would like to share with the team of journalists who worked on the Uber cases, you can email [email protected]