Using geotagged tweets, researchers created a tool that can predict where you live and work, as well as other sensitive information.
AN INTERNATIONAL GROUP OF RESEARCHERS has created an algorithmic tool that uses Twitter to predict where you live in a matter of minutes with greater than 90% accuracy. It can also predict where you work, where you pray, and other information you might prefer to keep private, such as whether you’ve visited a particular strip club or gone to rehab.
The LPAuditor (short for Location Privacy Auditor) tool takes advantage of what the researchers refer to as an “invasive policy.” Twitter deployed after introducing the ability to geotag tweets in 2009. Users who chose to geotag tweets with any location, even something as broad as “New York City,” automatically gave their precise GPS coordinates for years. The coordinates would not be displayed on Twitter. Neither would their supporters. However, the GPS data would still be included in the tweet’s metadata and accessible via Twitter’s API.
Twitter’s app policy was not changed until April 2015. Users must now opt-in to share their precise location, which only a small percentage of people do, according to a Twitter spokesperson. However, the GPS data that people shared before the update is still accessible via the API.
LPAuditor was created by the researchers to analyze geotagged tweets and infer detailed information about people’s most sensitive locations. They describe this process in a new peer-reviewed paper, which will be presented next month at the Network and Distributed System Security Symposium. LPAuditor was able to determine where tens of thousands of people lived, worked, and spent their free time by analyzing clusters of coordinates as well as timestamps on tweets.
According to a member of Twitter’s site integrity team, sharing location data on Twitter has always been voluntary, and the company has always provided users with a way to delete that data in its help section. “We recognized in 2015 that we could be more clear about that, but our overarching perspective on location sharing has always been that it’s voluntary and that users can choose what they do and don’t want to share,” the Twitter employee explained.
True, it has always been up to users whether or not to geotag their tweets. However, there is a significant difference between choosing to share that you are in Paris and choosing to share where you live in Paris. Regardless of the square mileage of the locations users chose to share, Twitter chose to share their locations down to the GPS coordinates for years. The fact that these details were spelled out in Twitter’s help section wouldn’t help users who didn’t realize they needed assistance in the first place.
“If you’re not aware of the problem, you’re never going to go remove that data,” says Jason Polakis, a study co-author and assistant professor of computer science at the University of Illinois at Chicago who specializes in privacy and security. That data, according to the study, can reveal a lot.
Polakis and researchers at the Foundation for Research and Technology in Crete began pulling Twitter metadata from the company’s API in November 2016, well after Twitter changed its settings. They were expanding on previous research that demonstrated the ability to infer private information from geotagged tweets, but they wanted to see if they could do it at scale and with greater precision using automation.
The researchers examined a pool of approximately 15 million geotagged tweets from approximately 87,000 users. Some of the location data associated with those tweets could have come from users who wanted to share their exact location, such as a museum or music venue. However, many users shared nothing more than a city or general area, only to have their GPS location shared anyway.
LPAuditor then went to work, assigning each tweet to a physical location on a map and locating it by time zone. This resulted in clusters of tweets scattered across the map, some busier than others, indicating locations where a given user spends a lot of time—-or, at the very least, a lot of time tweeting.
“If you’re not aware of the issue, you’ll never go remove that data.” – POLAKIS, JASON, THE UNIVERSITY OF ILLINOIS AT CHICAGO
The researchers instructed LPAuditor to look for locations where people spent the most time tweeting over the weekend to predict which cluster might correspond to a user’s home. During the week, you might tweet in the morning, at night, and on your day off in an unpredictable pattern, but on weekends, most people spend the majority of their time at home.
They did the opposite when it came to finding work locations and analyzing tweet patterns throughout the week. LPAuditor examined the locations from which users tweeted the most (excluding home), as well as the periods during which those tweets were sent. This gave the researchers an idea of whether the tweets were sent during a typical eight-hour shift, even if that shift was overnight. Finally, the tool looked for the most frequent time frame during the week and determined that the location with the most tweets during that time frame was most likely the person’s workplace.
When it came time to double-check their answers, the researchers chose a group of about 2,000 users to serve as a sort of ground truth. Compiling this group was a manual process that required two graduate students to independently sift through all of the tweets in the collection to find key phrases that could confirm whether a person was at home or work at the time they sent it. Terms like “I’m at home” or “at the office,” for example, may provide a hint. They examined each tweet for context that could lead to additional information.
They then compared the locations of those tweets to the tool’s predictions and discovered that they were extremely accurate, correctly identifying people’s homes 92.5 percent of the time. It wasn’t as good at predicting where people worked, only 55.6 percent of the time. However, Polakis believes that this could simply mean that the location they identified as “work” is a school or a location where the person spends what would otherwise be working hours.
Finally, the researchers began looking for sensitive locations that a user may have visited. They compared the tweet locations to Foursquare’s directory of businesses and venues to accomplish this. They were looking for hospitals, urgent care centers, and places of worship, as well as strip joints, and gay bars. Any venue within 27 yards of the geotagged tweet would be considered a potential location. They then performed a similar keyword analysis, looking for words related to health, religion, sex, and nightlife to see if a user was likely to be where they appeared to be. Using this method, the researchers discovered that LPAuditor was correct about sensitive locations roughly 80% of the time.
Of course, if a user tweets about being at the doctor while they’re there, one could argue that they’re not concerned with privacy. “The location might give away more information than the user wants to say,” Polakis says. In one case, the researchers discovered a user tweeting about a doctor from a location revealed by GPS coordinates to be a rehab facility. “That’s a much more sensitive context than they were willing to reveal,” he says.
Even when there were no context clues in the tweet, LPAuditor was able to predict whether a person had spent time at a sensitive location by studying how much time people spent there and how many times they returned. However, the researchers were unable to assess the accuracy of these specific predictions.
The majority of this study was based on tweets sent before Twitter’s policy change in April 2015. That change, according to Polakis, made a significant difference in the amount of precise location data available via the API. To quantify how massive, the researchers excluded all tweets collected before April 2015 and discovered that only about one-fifteenth of the users studied could positively identify key locations. In other words, “that kind of invasive Twitter behavior increased the number of people we could attack by 15 times,” Polakis says.
The fact that Twitter’s policies have changed is a good thing. The issue is that so much of that pre-2015 location data is still accessible via the API. When asked why Twitter didn’t scrub it after changing the policy, a Twitter site integrity employee said, “We didn’t feel it would be appropriate for us to go back and unilaterally change people’s tweets without their consent.”
This isn’t the first study to show what can be gleaned from location data or geotagged tweets. However, this paper makes significant contributions, according to Henry Kautz, a computer scientist at the University of Rochester who has conducted similar research. “The breakthrough here is that they studied two types of locations—-work and home—-rather than one, and they did a larger study with a more systematic evaluation and a more highly tuned algorithm, so it got the right answer a higher percentage of the time,” says Kautz. LPAuditor is not limited to Twitter data. It applies to any set of location data.
According to Kautz, Twitter is a minor concern when compared to other apps that continue to use invasive location data practices today. Los Angeles city officials recently filed a lawsuit against the IBM-owned Weather Channel app for allegedly collecting and selling users’ geolocation data under the guise of “personaliz[ing] local weather data, alerts, and forecasts.” Just this week, Motherboard reported that bounty hunters are tracking people using their phones by purchasing location data from T-Mobile, Sprint, and AT&T. Despite the companies’ public promises to stop selling such information. Then some apps become infected with malware and consume location data.
“Today’s big problem isn’t malicious people looking at your geotagged tweets. “The issue is that compromised cell phone apps steal your entire GPS history,” says Kautz. “Not only can one extract your home and work locations from that data, but also a large number of significant places in your life.”
Polakis, however, believes that the fact that Twitter no longer attaches GPS coordinates to all geotagged tweets is insufficient, given that developers still have access to data from before 2015. Yes, some of that information may now be out of date. People are moving. They switch jobs. However, even outdated information can be useful to an attacker, and sensitive information, such as a person’s sexual orientation, appears unlikely to change. This study demonstrates that it is not only possible to infer this type of information from location data, but that a machine can do so almost instantly.
For the time being, the most people can do, according to Polakis, is delete their location data and think twice about sharing it in the future.