Posts Tagged ‘data pull’

magnifying glass on white background with clipping pathIf you are ready to pull data and are looking for advice on how to build a strong Boolean search string, this blog post is for you. I don’t post to this blog often, but I have received several requests about this topic, so I decided to write a blog post about it. I’ll use the Charlottesville protest as an example because my work is in social media and activism.

Discover your keyword pool
Conduct an Advanced Search on Twitter for your event dates and try different keywords to examine tweets. Do the same for hashtag searches during the time of the event through Advanced Search on Twitter. Next, you will sort keywords in an OR paragraph or an AND paragraph that you create in a Word document.

Build a list of OR words
First, identify any anchor word that is undoubtedly about the event you want to capture. For example, HeatherHeyer would be an obvious anchor word choice for the Charlottesville data because you can reasonably expect that anyone talking about her on Twitter is going to be talking about her in the context of the Charlottesville protest. Continue making a list of any word that uniquely connects with your event. Those words will be words that you connect with OR in your Boolean search string.

Build a list of the AND words
Next, you will make a list of “and” words by identifying combinations of words that immediately capture your event. In the context of Charlottesville, you would write something like “KKK and cville” if you think that individually, these words will not capture your event. Put all of your word pairs together and connect them with OR.

After you have finished your OR and AND lists, connect the two lists in a giant search string with OR as a connection word.

Add NOT words if needed (make sure to test your words)
You can easily capture irrelevant data if any of your search terms refer to other people, places, and events than you want to capture (such as a city that has the same name as the last name of a key person). You can test your search terms by individually looking up your OR words and by individually looking up each AND pairing in Advanced Twitter Search to see if irrelevant results pop up, especially if you do not sort by your event date.

Finally, connect the OR list and the AND list with “NOT.”  You can use parentheses for complex search strings (see my example below).

My Charlottesville data pull from GNIP resulted in so many tweets that we had to break up the request into three pulls. You will see that I have some

Pull 1: (Charlottesville OR cville OR VA OR Virginia OR McAuliffe OR @CvilleCityHall OR @VSPPIO) AND (antifa OR Nazis OR Nazi OR neo-Nazi OR Nazi/KKK OR KKK OR (white supremacy) OR (white supremacists) OR (white activists) OR (white activist) OR (James Alex Fields))
Timeline: May 7 to Oct. 12, number of tweets: 3 million approximately
Pull 2: (Charlottesville OR cville OR VA OR Virginia OR McAuliffe OR @CvilleCityHall OR @VSPPIO) AND (antifa OR Nazis OR Nazi OR neo-Nazi OR Nazi/KKK OR KKK OR (white supremacy) OR (white supremacists) OR (white activists) OR (white activist) OR (James Alex Fields))
Timeline: Feb. 7 to May 7, number of tweets: 40,000
Pull 3: (Charlottesville OR cville OR VA OR Virginia OR McAuliffe OR @CvilleCityHall OR @VSPPIO) AND (statue OR memorial OR (Robert E Lee) OR (Lee Park) OR (General Lee) OR Confederate OR (Emancipation Park) OR (Stonewall Jackson) OR protest OR march OR marchers) OR cvilleaug12 OR #invisiblecville OR #HeatherHeyer OR #DeAndreHarris OR (DeAndre Harris) OR #unitycville OR #defendcville OR #cvillestrong OR #standwithcharlottesville
Timeline: Feb. 7 to Oct. 12, number of tweets: 2.8 million approximately

Remember to filter out bots
You’re not out of the woods yet. Once you have your data, make sure to use a method for filtering out bots if you’re doing any theory-building about people’s behavior. There were enough political bots in the Charlottesville data to affect our topic modeling, and this is a fundamental step. Look for identical tweets, nearly identical tweets (because bots can swap out adjectives to try to evade capture), and tweets that tag a bunch of people with the same link. 

Final thoughts for now
One reason I love being at UNC Charlotte is the access to big data and the institutional support for collaborating on interdisciplinary projects! My thanks go to Ryan Wesslen for training me (he is an incredible teacher of more advanced topics, as well). If you have additional tips for search strings or can improve my post, feel free to leave your feedback in the comments area.


Read Full Post »