In this special guest feature, Deep Varma, Vice President of Data Engineering at Trulia, describes his company’s data processes designed to bring unique value to its customers. Deep manages data engineering functions across the Trulia business. This includes the vital acquisition of listings and public records, the consumer search experience and API, email/push, efforts to enhance personalization, industry leading location services such as geo coding, as well as data science, data warehouse, and reporting. During his 17 years of Silicon Valley experience, Deep has focused on building large scale distributed data platforms with IBM, ABB, Yahoo! and two successful startups. Deep is a graduate of the Haas School of Business at the University of California Berkeley. He lives in the San Francisco Bay area with his wife and two boys and loves skiing, watching football, reading technical books, building prototypes and learning new technologies.
Trulia is an online marketplace focused on giving homebuyers, sellers, and renters the information they need to make better decisions about where to live. Rather than having to spend days scouring listings in a newspaper like I did when I bought my first home years ago, consumers today can go to sites like Trulia, where data is stored in one place, making it easy to find and digest real estate information. In addition to listings, Trulia paints the picture of what it’s like to live in a neighborhood, by providing information and unique insights about properties and neighborhoods, including school ratings, crimes, commute times, local amenities, and more.
To ensure consumers are seeing the best and most up-to-date information possible, Trulia processes more than 1.5 terabytes of data every single day. The data falls into two core data sets, with two separate teams processing the information differently.
Dataset One: Listings and Public Records Data
The first data set makes up the “information and unique insights” referenced above, and is the content that drives Trulia. It’s comprised of listings and public records data.
- Listings content is our commodity item and one of the primary reasons why consumers visit Trulia in the first place. There are several different types of listings, including For Sale, For Rent, Foreclosure, New Construction, or For Sale by Owner. This data is provided to us from MLSs, brokers, and even directly by owners and agents. All together, we process more than 4 million listings per day.
- Public records content is mostly comprised of deeds, taxes, and assessments data, and this information gives consumers historical details on a property. There are more than 3,000 counties across the U.S. and here we process tens of millions of records per day. Unfortunately for us though, data schema, format, and accessibility is different from one county to the next.
Processing Listings and Public Records Data
When you take a step back and combine listings data and public records data, Trulia’s data engineers are processing and digesting millions of pieces of data every single day.
There are roughly seven steps to processing this data set:
- We receive data streams in different formats, so we use parsers to convert them to our JSON format and then define unique short codes for attributes for efficient storage of JSON blobs.
- All addresses are standardized, which is one of the key ways to join the information gathered in the two buckets in this dataset. We do this using Trulia’s own address standardization and normalization technology.
- Picture Processing. All associated pictures are processed and resized for a better consumer experience.
- Location aware data is added to listings, such as local amenities and neighborhood crime scores, and stats and trends.
- Historical Data. All historical datasets are assigned to listings.
- Merging and Indexing. All datasets are merged and matched before persisted in an index to make the information searchable for consumers.
Finally, after all that’s done, the data goes through our Data Service API. This abstraction API layer moves the data into a searchable index on Trulia’s platforms like www, mWeb, mobile apps, and email systems.
Dataset Two: Consumer Behavior Data
Millions of consumers engage with Trulia every day by searching listings through their desktop Web browsers, mWeb, native apps, and emails. They’re looking for information, which is what we call their “intent.” In order to serve consumers the right content, we work to capture and understand their intent, which ladders back into the idea of helping them make better decisions.
Within just a few minutes of engagement on our site, consumers generate an average of 18 to 20 events – or signals – about their intent. Which is great, because we can use that data to serve custom content and drive engagement, but you can imagine the challenges of collecting, persisting and scaling this much data.
Processing Consumer Behavior Data
Each consumer engagement with Trulia, or event as we call them, is processed in real-time, and we use data science and machine learning with predictive science to build digital signatures of our consumers based on their events, or intent. After some key learnings about capturing this data and building digital signatures, we recently decided to implement Lambda Architecture. As a result, we’re able to better capture and organize this data, and it’s been very effective in helping deliver near real-time personalization to our consumers, regardless if they’re registered or not, which is paramount.
We also built a personalization hub in-house last year to help us better track consumers’ changing behavior and ensure we’re pushing relevant content to them at all times. Building this hub has been another exciting data engineering challenge and we’re continuously employing various big data technologies and sophisticated data science techniques to tackle it.
Following our data collection and processing, it’s put through our API layer, which, again, provides access to all Trulia’s platforms.
Segmenting the data and having de-centralized teams has allowed us to move fast and stay nimble, and it helps us provide amazing consumer experiences by building data-based products. As we move into 2016, our focus is on improving our algorithms and machine learning systems so they’re even smarter and faster, which will empower us to build data products that provide amazing consumer experiences.
Sign up for the free insideBIGDATA newsletter.