03/09/2019
New York City, a bustling metropolis, relies heavily on its vast network of taxis and for-hire vehicles to keep its residents and visitors moving. From the iconic yellow cabs to the ubiquitous ride-hailing services like Uber and Lyft, millions of journeys occur daily. But behind this seamless flow of transport lies an immense and meticulously collected dataset, chronicling over 3 billion trips since 2009. Understanding where this invaluable data comes from, how it's managed, and what insights it offers is crucial for anyone interested in urban planning, transport efficiency, or big data analytics.

The primary custodian and source of this monumental dataset is the NYC Taxi & Limousine Commission (TLC). Since its inception in 1971, the TLC has been responsible for overseeing the city's standard taxis and a growing fleet of for-hire vehicles. Over the years, the Commission has diligently aggregated millions of records, providing an unprecedented look into the pulse of New York's transport system. This publicly available data offers a unique lens through which to examine everything from passenger pickup and drop-off times to trip costs, payment types, and even driver earnings.
The Evolution of Data Formats: From CSV to Parquet
Data collection and storage methods are constantly evolving, and the NYC taxi data is no exception. Historically, the raw trip data provided by the TLC was in the familiar Comma Separated Values (CSV) format. This format, while simple and widely compatible, can become cumbersome with extremely large datasets due to its row-by-row processing nature and potential for file size inefficiencies.
A significant shift occurred in May 2022 when the TLC made a pivotal change, transitioning the raw data format from CSV to Apache Parquet. This wasn't merely a change for new files; it involved a full replacement of all historical trip data files, meaning records stretching back to 2009 were re-released in the Parquet format. Parquet is a columnar storage format, which offers several advantages for big data analytics, including more efficient data compression, faster query performance for specific columns, and improved interoperability with various data processing frameworks.
This transition necessitated updates to the scripts and processes used by data analysts and researchers to handle this information. For those still preferring a PostgreSQL database setup, an intermediate step was introduced where each Parquet file is first converted into a CSV before being loaded using PostgreSQL's COPY command. Alternatively, a separate set of scripts was developed to load the Parquet files directly into a ClickHouse database, a columnar database management system known for its high performance in analytical queries. This update also coincided with the addition of several new columns to the High Volume For-Hire Vehicle (Uber, Lyft) trip files, including detailed information about passenger fares, driver pay, and the time drivers spent waiting for passengers, with these new fields available retrospectively to February 2019.
It's important to note that this change means older scripts designed for the original CSV files are no longer compatible with the data available directly from the TLC's website. If one happens to possess the old CSV files, an older version of the data processing code would be required to work with them.
Types of Vehicles and Data Collection Timelines
The NYC taxi data encompasses several categories of vehicles, each with its own history of data aggregation:
- Yellow Taxis: These iconic medallion cabs have been collecting data since 2009, representing the longest continuous record of trips. They operate primarily in Manhattan and at airports, able to pick up street hails.
- Green Taxis (Boro Taxis): Introduced in mid-2013, green taxis are designed to serve passengers in the outer boroughs and northern Manhattan, areas where yellow cabs were less common. Their data collection began concurrently with their introduction.
- For-Hire Vehicles (FHVs): This category includes ride-hailing apps like Uber, Lyft, Via, and Juno, which have grown exponentially in popularity. Data for FHVs has been collected since 2015, providing crucial insights into the evolving landscape of urban transport. The more specific High Volume For-Hire Vehicles (FHVHV) data, which includes the detailed fare and driver pay information, became available from February 2019.
The comprehensive nature of this data allows for deep analysis into how each segment of the taxi and for-hire vehicle market operates, their geographical distribution, and their impact on urban mobility patterns.
The Richness of the Data Schema
The raw data, once processed and loaded into a database (whether PostgreSQL or ClickHouse), is structured into several interconnected tables, providing a holistic view of each trip and related contextual information. Understanding this schema is key to extracting meaningful insights.
| Table Name | Description | Key Data Points |
|---|---|---|
trips | Contains all yellow and green taxi trips. | Cab type ID, pickup/drop-off times, trip distance, fare amounts, payment type, tip amounts. |
fhv_trips | Contains all for-hire vehicle trip records (Uber, Lyft, etc.). | Pickup/drop-off times, base licence number, (since Feb 2019) passenger fare, driver pay, waiting time. |
fhv_bases | Maps FHV trip records to base names and 'doing business as' labels. | Base licence number, base name, doing business as (DBA) name (e.g., 'Uber', 'Lyft'). |
nyct2010 | NYC census tracts and neighbourhood tabulation areas. | Census tract ID, neighbourhood name, borough. |
taxi_zones | TLC's official taxi zone boundaries. | Zone ID, zone name, borough. (Crucial since July 2016, as direct GPS coordinates are no longer provided, replaced by zone IDs). |
central_park_weather_observations | Summary weather data by date. | Date, temperature, precipitation, snow depth. |
A notable change in the data collection methodology, particularly for yellow and green taxis, occurred in July 2016. Prior to this, trip records included precise pickup and drop-off GPS coordinates. However, since then, the TLC no longer provides these coordinates directly. Instead, each trip record now comes with a 'pickup location ID' and 'drop-off location ID', which correspond to the official TLC taxi zones. This anonymisation helps protect passenger privacy while still allowing for geographical analysis at a zone level.
Supplementary Data Sources and Analytical Applications
Beyond the core trip data, the repository for NYC taxi analytics often bundles other crucial datasets that enrich the analysis:
- Shapefiles for NYC Census Tracts and Neighbourhood Tabulation Areas: Sourced from 'Bytes of the Big Apple', these files provide the geographical boundaries needed to visualise data spatially and link trips to specific demographic or urban areas.
- Shapefiles for Taxi Zone Locations: Directly from the TLC, these define the precise geographical boundaries of each taxi zone, enabling accurate mapping of pickup and drop-off points based on the zone IDs.
- Mapping of FHV Base Numbers to Names: Also provided by the TLC, this allows analysts to attribute FHV trips to specific ride-hailing companies.
- Central Park Weather Data: Obtained from the National Climatic Data Center, this data allows for analysis of how weather conditions might influence taxi demand, trip duration, or driver behaviour.
The availability and richness of this data have powered numerous analytical projects and dashboards. For instance, the data has been used extensively for benchmarking the performance of various technology stacks (like PostgreSQL and ClickHouse) in handling large datasets. It also underpins the NYC Taxi & Ridehailing Stats dashboard, offering real-time insights into the city's transport dynamics.
Moreover, the data has facilitated fascinating comparative analyses, such as determining "When Are Citi Bikes Faster Than Taxis in New York City?", which highlights the interplay between different modes of urban transport. Researchers and urban planners utilise this data to understand traffic patterns, identify peak demand areas, optimise taxi allocation, and inform policy decisions regarding public transport and ride-hailing regulations. The sheer volume and detail allow for predictive modelling of demand, analysis of fare structures, and even insights into driver income trends.

Frequently Asked Questions About NYC Taxi Data
Given the complexity and scale of the NYC taxi dataset, several common questions often arise:
Q: Is the NYC taxi data publicly available?
A: Yes, the raw trip data, as well as summary statistics, are publicly available from the NYC Taxi & Limousine Commission (TLC) website.
Q: What is the earliest date for which data is available?
A: Yellow taxi data is available from 2009. Green taxi data began in mid-2013, and For-Hire Vehicle (FHV) data from 2015, with detailed FHV data (fares, driver pay) from February 2019.
Q: Why did the data format change from CSV to Parquet?
A: The change to Apache Parquet in May 2022 was driven by the need for more efficient storage and faster processing of increasingly large datasets. Parquet's columnar format offers better compression and query performance compared to row-based CSVs.
Q: Can I still use old CSV files if I have them?
A: While the TLC no longer provides CSV files for download, if you have existing CSVs, you would need to use an older version of the data processing scripts that were designed to handle that format. Current scripts are updated for Parquet.
Q: Does the data include GPS coordinates for pickups and drop-offs?
A: No, not directly for trips after July 2016. For privacy reasons, precise GPS coordinates were replaced with 'pickup location IDs' and 'drop-off location IDs', which correspond to predefined TLC taxi zones.
Q: What kind of insights can be gained from this data?
A: The data can reveal insights into trip patterns, peak demand hours and locations, average trip costs, tipping behaviour, the impact of weather on travel, and comparisons between different taxi types and ride-hailing services. It's invaluable for urban planning, traffic management, and transport policy analysis.
Q: What tools are commonly used to process and analyse this data?
A: Common tools include database systems like PostgreSQL and ClickHouse for storage, scripting languages like R and Python for data conversion and analysis, and various data visualisation tools to present insights. Command-line tools and shell scripts are also used for downloading and initial processing.
Q: How large is the dataset?
A: The dataset is immense, comprising over 3 billion taxi and for-hire vehicle trips since 2009. Processing and storing this volume of data requires significant computing power and storage capacity.
The Enduring Value of NYC's Transport Data
The collection and meticulous management of New York City's taxi and for-hire vehicle data represent a monumental effort to understand and improve urban mobility. From its origins with the NYC TLC, through the evolution of data formats, and into the hands of data scientists and urban planners, this rich dataset provides an unparalleled window into the dynamics of one of the world's busiest cities. It allows for the benchmarking of technology, the comparison of transport modes, and the continuous refinement of services that keep millions moving every day. As urban environments continue to grow and evolve, the insights gleaned from such comprehensive data will remain indispensable for fostering efficient, equitable, and sustainable transport systems.
If you want to read more articles similar to Unravelling NYC's Taxi Data Tapestry, you can visit the Transport category.
