Scraping Twitter Data – Python


By Asmik Nalmpatian and Lisa Wimmer – for Intro to NLP

Logo_Consulting.JPG

In this notebook we demonstrate scraping Twitter data in Python.

Potential reasons for preferring Python over R for general web scraping include larger resources of online support (Python enjoys a vivid community in the field) and easier handling of web drivers required for actively navigating the internet (this stems from personal experience; running drivers from R frequently ran in to session crashes).

0: Disclaimer

The content displayed here is to a very large extent work by colleagues of ours.

We thank them for sharing their code and refer any complaints to those two gentlemen :)

0: Scope

Goal: collect tweets by German MPs

Steps:

  1. Get list of active MPs
  2. Get MPs' account names on Twitter
  3. Get Twitter data

0: Prep

Load libraries

If needed, first install libraries via !pip install <library>.

In [33]:
# Load libraries

import pandas as pd # data wrangling
import numpy as np # math operations
import math # math operations
import os # directories
import time # system time
import random # random number generation
import pickle # data compression
import re # regular expressions
import unidecode # regular expressions

import urllib.request # scraping
import requests # scraping
from bs4 import BeautifulSoup # scraping
import ctypes # interface to C
import tweepy # twitter 

import sys # system limit (preventing infinite running)
sys.setrecursionlimit(100000)

import selenium # chrome driver
from selenium import webdriver # chrome driver
import selenium.common.exceptions as selexcept # exception handling

1: Get list of active MPs

Fire up selenium driver

For this you need to have downloaded chromedriver.exe from this website and stored it in the below file directory.

Note that the version of the driver must match the Chrome version you use (mismatches will throw an error in below code cell).

In [2]:
# Specify location of chromedriver.exe

chdriver_path = r'C:\\Users\\wimme\\Documents\\1_uni\\1_master\\consulting\\projects\\consulting\\1_scraping\\input\\chromedriver.exe' 

# Set up selenium driver

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
In [3]:
# Start driver (a Chrome browser window should open by itself)

driver = webdriver.Chrome(chdriver_path)

image.png

This part requires some moving around the website, which is why we use a selenium driver in the first place.

In [4]:
# Specify url and navigate to website

website = "https://www.bundestag.de/abgeordnete"
driver.get(website)

# Switch to list view - first, find "List" button

element = driver.find_element_by_class_name('icon-list-bullet') 

# Click button

webdriver.ActionChains(driver).move_to_element(element).click(element).perform()

# Wait for list view to load

time.sleep(random.randint(15, 20))

image.png

In [5]:
# Count how many MPs are listed on the website (includes dropouts and successors)

len(driver.find_element_by_class_name('bt-list-holder').find_elements_by_tag_name('li'))
Out[5]:
743

Get name and party

First, get name and party for each MP.

For much of the scraping we rely on the famous BeautifulSoup package that makes parsing HTML texts easy.

Note that we still need to specify HTML nodes / CSS selectors found by, e.g., Chrome's developer tab or SelectorGadget.

In [9]:
# Set up empty list

abg_df = []

# Find names and party and append to list

for link in driver.find_element_by_class_name('bt-list-holder').find_elements_by_tag_name('li'):
    abg_df.append({
        'name': link.find_element_by_class_name('bt-teaser-person-text').\
        find_element_by_tag_name('h3').text,
        'party': link.find_element_by_class_name('bt-teaser-person-text').\
        find_element_by_tag_name('p').text,
    })
    
# Convert to pandas dataframe    

abg_df = pd.DataFrame(abg_df)
In [10]:
# Separate names in first and last names

name_concat = abg_df['name'].str.split(", ", n = 1, expand = True) 
abg_df['last_name'] = name_concat[0] 
abg_df['first_name'] = name_concat[1] 
abg_df.drop(columns = ['name'], inplace = True) 
abg_df = abg_df.reindex(
    columns = ['last_name', 'first_name'] + list(abg_df.columns[:-2]))
In [11]:
# Add columns for infors that will be scraped shortly

abg_df = abg_df.reindex(columns = abg_df.columns.tolist() + 
                        ['bundesland', 'wahlkreis_name', 'wahlkreis_nr', 'wahlkreis', 'username'])

# Inspect

abg_df.head()
Out[11]:
last_name first_name party bundesland wahlkreis_name wahlkreis_nr wahlkreis username
0 Abercron Dr. Michael von CDU/CSU NaN NaN NaN NaN NaN
1 Achelwilm Doris Die Linke NaN NaN NaN NaN NaN
2 Aggelidis Grigorios FDP NaN NaN NaN NaN NaN
3 Akbulut Gökay Die Linke NaN NaN NaN NaN NaN
4 Albani Stephan CDU/CSU NaN NaN NaN NaN NaN

Scrape info on MP level

Looking good! Now for the somewhat more complicated part.

The driver will now visit each MP's landing page and scrape their personal info, remote-controlling the browser window. You can watch it click and jump to pages :)

In [31]:
# Create range to loop over

abg_range = abg_df.index[abg_df['bundesland'].isnull()]
abg_range
Out[31]:
Int64Index([ 10,  11,  12,  13,  14,  15,  16,  17,  18,  19,
            ...
            733, 734, 735, 736, 737, 738, 739, 740, 741, 742],
           dtype='int64', length=733)
In [34]:
for abg in abg_range:
    
    try:
        
        # (Re-)load list view (for all iterations)
        
        driver.get(website)
        element = driver.find_element_by_class_name('icon-list-bullet')
        
        # Click to change to list view
        
        webdriver.ActionChains(driver).move_to_element(element).click(element).perform() 
        
        # Wait for list view to load
        
        time.sleep(random.randint(3, 5)) 
        
        # Click to open individual page
        
        driver.find_element_by_class_name('bt-list-holder').find_elements_by_tag_name('li')\
        [abg].click()
        
        # Wait for page to load
        
        time.sleep(random.randint(3, 5)) 
        
        # Convert page to soup
        
        soup = BeautifulSoup(driver.page_source, 'lxml') 

        # Extract state (bundesland) and electoral district (wahlkreis)
        
        bundesland = soup.find(
            'div', attrs = {'class': 'col-xs-12 col-sm-6 bt-standard-content'}).h5.text
        wahlkreis = soup.find(
            'div', attrs = {'class': 'col-xs-12 col-sm-6 bt-standard-content'}).a.text \
            if soup.find(
            'div', attrs = {'class': 'col-xs-12 col-sm-6 bt-standard-content'}).a is not None \
            else "n.a."

        # Split wahlkreis in name and ID
        
        wahlkreis_name = wahlkreis.split(':')[1].strip(' ') if wahlkreis \
        not in ["n.a.", None] else ""
        wahlkreis_nr = int(
            wahlkreis.split(':')[0].strip('Wahlkreis').strip('')) if wahlkreis \
            not in ["n.a.", None] else ""
        
        # Extract social media account
        
        social_media = {}
        
        if len(soup.find_all('h5', string = 'Profile im Internet')) == 1:
            for link in soup.find_all(class_ = 'bt-linkliste')[0].find_all('a'):
                social_media[link['title']] = link.get('href')
                
        abg_df.loc[abg, 'bundesland'] = bundesland
        abg_df.loc[abg, 'wahlkreis'] = wahlkreis
        abg_df.loc[abg, 'wahlkreis_name'] = wahlkreis_name
        abg_df.loc[abg, 'wahlkreis_nr'] = wahlkreis_nr
        abg_df.loc[abg, 'username'] = social_media['Twitter'] if 'Twitter' in social_media else ""
        
        if abg%20 == 0:
            print('Data for MP %s successfully retrieved' %abg)
        
    # In case of IndexError or AttributeError, which occurs if page fails to load, try again
    
    except (IndexError, AttributeError, selexcept.NoSuchElementException):
        abg_range = abg_df.index[abg_df['bundesland'].isnull()]
        
ctypes.windll.user32.MessageBoxW(0, "MP data successfully scraped", "Progress Report")  
Data for MP 20 successfully retrieved
Data for MP 60 successfully retrieved
Data for MP 100 successfully retrieved
Data for MP 120 successfully retrieved
Data for MP 140 successfully retrieved
Data for MP 160 successfully retrieved
Data for MP 180 successfully retrieved
Data for MP 220 successfully retrieved
Data for MP 240 successfully retrieved
Data for MP 260 successfully retrieved
Data for MP 280 successfully retrieved
Data for MP 300 successfully retrieved
Data for MP 320 successfully retrieved
Data for MP 340 successfully retrieved
Data for MP 360 successfully retrieved
Data for MP 400 successfully retrieved
Data for MP 420 successfully retrieved
Data for MP 480 successfully retrieved
Data for MP 500 successfully retrieved
Data for MP 520 successfully retrieved
Data for MP 560 successfully retrieved
Data for MP 580 successfully retrieved
Data for MP 600 successfully retrieved
Data for MP 620 successfully retrieved
Data for MP 640 successfully retrieved
Data for MP 660 successfully retrieved
Data for MP 680 successfully retrieved
Data for MP 700 successfully retrieved
Data for MP 720 successfully retrieved
Out[34]:
0
In [35]:
abg_df.head(10)
Out[35]:
last_name first_name party bundesland wahlkreis_name wahlkreis_nr wahlkreis username
0 Abercron Dr. Michael von CDU/CSU Schleswig-Holstein Pinneberg 7 Wahlkreis 007: Pinneberg https://twitter.com/mvabercron/
1 Achelwilm Doris Die Linke Bremen n.a. https://twitter.com/DorisAchelwilm
2 Aggelidis Grigorios FDP Niedersachsen Hannover-Land I 43 Wahlkreis 043: Hannover-Land I https://twitter.com/aggelidis_fdp?lang=de
3 Akbulut Gökay Die Linke Baden-Württemberg Mannheim 275 Wahlkreis 275: Mannheim https://twitter.com/akbulutgokay?lang=de
4 Albani Stephan CDU/CSU Niedersachsen Oldenburg – Ammerland 27 Wahlkreis 027: Oldenburg – Ammerland
5 Alt Renata FDP Baden-Württemberg Nürtingen 262 Wahlkreis 262: Nürtingen
6 Altenkamp Norbert CDU/CSU Hessen Main-Taunus 181 Wahlkreis 181: Main-Taunus
7 Altmaier Peter CDU/CSU Saarland Saarlouis 297 Wahlkreis 297: Saarlouis https://twitter.com/peteraltmaier
8 Amthor Philipp CDU/CSU Mecklenburg-Vorpommern Mecklenburgische Seenplatte I – Vorpommern-Gre... 16 Wahlkreis 016: Mecklenburgische Seenplatte I –...
9 Amtsberg Luise Bündnis 90/Die Grünen Schleswig-Holstein Kiel 5 Wahlkreis 005: Kiel

2: Get MPs' account names on Twitter

For those MPs who have a Twitter account but do not list it on their bundestag.de landing page, we try to find this information on their respective party's website.

Luckily, these websites are static and do not require remote control via selenium.

Note: In case of the AfD party, Twitter accounts are not available on the official party website and must thus be gathered manually.

Access and parse party websites

In [36]:
# Access party websites and convert into soups 

fdp = "https://www.fdpbt.de/fraktion/abgeordnete"
source_fdp = requests.get(fdp).text
soup_fdp = BeautifulSoup(source_fdp, 'html.parser')

cdu = "https://www.cducsu.de/hier-stellt-die-cducsu-bundestagsfraktion-ihre-abgeordneten-vor"
source_cdu = requests.get(cdu).text
soup_cdu = BeautifulSoup(source_cdu, 'html.parser')

spd = "https://www.spdfraktion.de/abgeordnete/alle?wp=19&view=list&old=19"
source_spd = requests.get(spd).text
soup_spd = BeautifulSoup(source_spd, 'html.parser')

gruene = "https://www.gruene-bundestag.de/abgeordnete"
source_gruene = requests.get(gruene).text
soup_gruene = BeautifulSoup(source_gruene, 'html.parser')

# For Die Linke, one needs to extract Twitter accounts from each individual MP website

linke_base = "https://www.linksfraktion.de/fraktion/abgeordnete/"

# Website contains bins of MPs, according to last name

letters = [['a', 'e'], ['f', 'j'], ['k', 'o'], ['p', 't'], ['u', 'z']] 
linke_name_bins = []

for letter in letters:
    extension = f'{letter[0]}-bis-{letter[1]}/' 
    linke_name_bins.append(linke_base + extension)
In [37]:
# For each party, find appropriate parent node in soup

all_abg_cdu = soup_cdu.find_all(class_ = 'teaser delegates')

all_abg_spd = soup_spd.find_all(class_ = 'views-row')

extensions_gruene = soup_gruene.find_all('a', class_ = "abgeordneteTeaser__wrapper")
urlbase_gruene = 'https://www.gruene-bundestag.de'
all_abg_gruene = []

for a in extensions_gruene:
    
    extension = a['href']
    link = urlbase_gruene + str(extension)
    all_abg_gruene.append(link)
    
all_abg_linke = []

for name_bin in linke_name_bins:
    
    source = requests.get(name_bin).text
    soup = BeautifulSoup(source, 'html.parser')
    
    for abg in soup.find_all('div', attrs = {'class': 'col-xs-12 col-sm-12 col-md-6 col-lg-6'}):
        extension = abg.find('h2').find('a')['href'].lstrip('/fraktion/abgeordnete/')
        all_abg_linke.append(linke_base + extension)

Get accounts

In [38]:
# Scrape accounts from soups 

twitter_list = []

# CDU/CSU

for abg in all_abg_cdu:
    
    twitter = abg.find(class_ = 'twitter')
    twitter_list.append(
        {'party': "cdu_csu",
         'name': abg.find('h2').find('span').text.strip(' '),
         'twitter_ext': twitter.find('a', href = True)['href'] if twitter is not None else ""
        }
    )

# Gruene

for abg in all_abg_gruene:
    
    abg_source = requests.get(abg).text
    abg_soup = BeautifulSoup(abg_source, 'html.parser')
    hrefss = []
    twitter = ""
    
    for x in abg_soup.find_all(class_ = "weitereInfoTeaser"):
    
        for y in x.find_all('a', href = True):
            
            z = y['href']
            hrefss.append(z)
            
            for i in hrefss:
                
                if "twitter" not in i:
                    continue 
                else:
                    twitter = i
                    
    twitter_list.append(           
        {'party': "gruene",
         'name': abg_soup.find('h1').text,
         'twitter_ext': twitter
        }
    )
    
# Linke

for abg in all_abg_linke:
    
    abg_source = requests.get(abg).text
    abg_soup = BeautifulSoup(abg_source, 'html.parser')
    twitter = abg_soup.find('a', text = re.compile('Twitter-Profil'))
    twitter_list.append(
        {'party': "linke",
         'name': abg_soup.find('h1').text.strip(' '),
         'twitter_ext': twitter['href'] if twitter is not None else ""
        }
    )

# SPD

for abg in all_abg_spd:
    
    twitter = abg.find(class_ = 'ico_twitter')
    twitter_list.append(
        {'party': "spd",
         'name': abg.find('h3').find('a').get_text().strip(' '),
         'twitter_ext': twitter['href'] if twitter is not None else ""
        }
    )
    
# Convert to data frame    
    
twitter_df = pd.DataFrame(twitter_list)    

ctypes.windll.user32.MessageBoxW(0, "Twitter accounts successfully scraped", "Progress Report")
Out[38]:
1
In [39]:
twitter_df.head()
Out[39]:
party name twitter_ext
0 gruene Luise Amtsberg
1 gruene Lisa Badum https://twitter.com/badulrichmartha
2 gruene Annalena Baerbock https://twitter.com/ABaerbock
3 gruene Margarete Bause https://twitter.com/MargareteBause
4 gruene Dr. Danyal Bayaz https://twitter.com/derdanyal

Merge data

First, we define a regex-based function that (repeatedly) splits names and discards unwanted sequences such as academic titles.

In [40]:
def name_prep(name, twitter = True):
    
    interim = re.sub("[\(\[].*?[\)\]]", "", name).strip(' ')
    interim = re.sub(r'(^\w{1,6}\. ?)', r'', interim)
    interim = re.sub(r'(^\w{1,6}\. ?)', r'', interim)
    interim = re.sub(r'(^\w{1,6}\. ?)', r'', interim)
    interim = re.sub(r'(^\w{1,6}\. ?)', r'', interim)
    interim = re.sub(r'(^\w{1,6}\. ?)', r'', interim)
    interim = unidecode.unidecode(interim).strip(' ')
    interim = re.sub(' +', ' ', interim)
    
    if twitter:
    
        if len(interim.split()) > 2:

            if interim.split()[0].endswith(('.', 'med', 'forest')):
                first_name = interim.split()[1]
            else:
                first_name = interim.split()[0]   

            last_name = interim.split()[-1]
            return (first_name + ' ' + last_name)

            if interim.split()[-1] == 'von':
                first_name = interim.split()[0:-1]     

        else:
            return interim
        
    else:
        
        if len(interim.split()) > 1:     
            return(interim.split()[0])
        else:
            return interim
In [42]:
# Check whether it works

name_prep(name = 'Prof. Dr. Dr. rer. nat. Carl Friedrich Gauss', twitter = True)
Out[42]:
'Carl Gauss'

We apply the function the name variables in both data frames:

In [44]:
# Prepare MP names from Twitter df for name-based matching

twitter_df['name_matching'] = twitter_df['name'].apply(name_prep, twitter = True)
In [45]:
# Prepare MP names from MP df for name-based matching

abg_df['name_matching'] = abg_df['first_name'].apply(name_prep, twitter = False) + ' ' + \
abg_df['last_name'].apply(name_prep, twitter = False)

Ready for merging! For now we keep the user names we found from both sources and set them to NaN where they are empty.

In [47]:
# Merge Twitter df and MP df

abg_twitter_df = pd.merge(
    abg_df, 
    twitter_df[['name_matching', 'twitter_ext']], 
    how = 'left', 
    left_on = 'name_matching', 
    right_on = 'name_matching'
)

abg_twitter_df['username'] = np.where(
    abg_twitter_df['username'] != '', 
    abg_twitter_df['username'], 
    np.nan
)

abg_twitter_df['twitter_ext'] = np.where(
    abg_twitter_df['twitter_ext'] != '', 
    abg_twitter_df['twitter_ext'], 
    np.nan
)
In [48]:
abg_twitter_df.head()
Out[48]:
last_name first_name party bundesland wahlkreis_name wahlkreis_nr wahlkreis username name_matching twitter_ext
0 Abercron Dr. Michael von CDU/CSU Schleswig-Holstein Pinneberg 7 Wahlkreis 007: Pinneberg https://twitter.com/mvabercron/ Michael Abercron NaN
1 Achelwilm Doris Die Linke Bremen n.a. https://twitter.com/DorisAchelwilm Doris Achelwilm https://twitter.com/doris_achelwilm
2 Aggelidis Grigorios FDP Niedersachsen Hannover-Land I 43 Wahlkreis 043: Hannover-Land I https://twitter.com/aggelidis_fdp?lang=de Grigorios Aggelidis NaN
3 Akbulut Gökay Die Linke Baden-Württemberg Mannheim 275 Wahlkreis 275: Mannheim https://twitter.com/akbulutgokay?lang=de Gokay Akbulut https://twitter.com/akbulutgokay
4 Albani Stephan CDU/CSU Niedersachsen Oldenburg – Ammerland 27 Wahlkreis 027: Oldenburg – Ammerland NaN Stephan Albani NaN

And then we get rid of the redundancies, using the names found on the party's websites where possible, and the ones found on the Bundestag site otherwise.

In [49]:
# Impute account name from Bundestag website where necessary and available

abg_twitter_df['username'] = np.where(
    abg_twitter_df['twitter_ext'].notnull(), 
    abg_twitter_df['twitter_ext'], 
    abg_twitter_df['username'])

abg_twitter_df = abg_twitter_df.drop('twitter_ext', axis = 1)
In [50]:
abg_twitter_df.head()
Out[50]:
last_name first_name party bundesland wahlkreis_name wahlkreis_nr wahlkreis username name_matching
0 Abercron Dr. Michael von CDU/CSU Schleswig-Holstein Pinneberg 7 Wahlkreis 007: Pinneberg https://twitter.com/mvabercron/ Michael Abercron
1 Achelwilm Doris Die Linke Bremen n.a. https://twitter.com/doris_achelwilm Doris Achelwilm
2 Aggelidis Grigorios FDP Niedersachsen Hannover-Land I 43 Wahlkreis 043: Hannover-Land I https://twitter.com/aggelidis_fdp?lang=de Grigorios Aggelidis
3 Akbulut Gökay Die Linke Baden-Württemberg Mannheim 275 Wahlkreis 275: Mannheim https://twitter.com/akbulutgokay Gokay Akbulut
4 Albani Stephan CDU/CSU Niedersachsen Oldenburg – Ammerland 27 Wahlkreis 027: Oldenburg – Ammerland NaN Stephan Albani

Get usernames

Lastly, we extract the usernames from the account URLs.

In [51]:
# Define function to extract usernames

def get_username(url):
    
    if url.startswith('http'):
        return(url.split('/')[3].split('?')[0])
    else:
        return(url.split('?')[0])
In [52]:
# Apply to all observations with existing account URL

mask = abg_twitter_df['username'].notnull()
abg_twitter_df['username'] = abg_twitter_df['username'][mask].apply(get_username)
In [53]:
abg_twitter_df.head()
Out[53]:
last_name first_name party bundesland wahlkreis_name wahlkreis_nr wahlkreis username name_matching
0 Abercron Dr. Michael von CDU/CSU Schleswig-Holstein Pinneberg 7 Wahlkreis 007: Pinneberg mvabercron Michael Abercron
1 Achelwilm Doris Die Linke Bremen n.a. doris_achelwilm Doris Achelwilm
2 Aggelidis Grigorios FDP Niedersachsen Hannover-Land I 43 Wahlkreis 043: Hannover-Land I aggelidis_fdp Grigorios Aggelidis
3 Akbulut Gökay Die Linke Baden-Württemberg Mannheim 275 Wahlkreis 275: Mannheim akbulutgokay Gokay Akbulut
4 Albani Stephan CDU/CSU Niedersachsen Oldenburg – Ammerland 27 Wahlkreis 027: Oldenburg – Ammerland NaN Stephan Albani

3: Get Twitter data

After quite some LOC we are now ready to scrape the data we are actually after.

For this, we will use the tweepy library and define a function that retrieves the data we are looking for.

In [ ]:
 
In [55]:
# Function to download tweets for a specific user with Tweepy

def download_tweets_tweepy_mod(username):
    
    # Helper function to check whether tweet is retweet
    
    def is_retweet(x):
        try:
            res = not(math.isnan(x))
        except:
            res = True
        return(res)

    # Helper function to retrieve hashtags
    
    def get_hashtags(x):
        hashtags_dict = x['hashtags']
        hashtags_text = [x['text'] for x in hashtags_dict]
        return(hashtags_text)

    # Helper function to retrieve user mentions
    
    def get_mentions(x):
        mentions_dict = x['user_mentions']
        mentions_text = [x['screen_name'] for x in mentions_dict]
        return(mentions_text)
    
    # Initialize a list to hold all the tweepy Tweets
    
    alltweets = []
    
    # Specify relevant columns
    
    colnames = [
        'created_at', 
        'full_text', 
        'retweet_count', 
        'favorite_count', 
        'followers_count', 
        'location']
    
    try:
        
        # Make initial request for most recent tweets (200 is the maximum allowed count)
        
        new_tweets = api.user_timeline(screen_name = username, 
                                       count = 200, 
                                       tweet_mode = "extended")	
        
        # Save most recent tweets
        
        alltweets.extend(new_tweets)
        
        # Save the id of the oldest tweet less one
        
        oldest = alltweets[-1].id - 1
        
        # Keep grabbing tweets until there are no tweets left to grab
        
        while len(new_tweets) > 0:
            
            # All subsequent requests use the max_id param to prevent duplicates
            
            new_tweets = api.user_timeline(screen_name = username,
                                           count = 200,
                                           max_id = oldest,
                                           tweet_mode = 'extended')
            
            # Save most recent tweets
            
            alltweets.extend(new_tweets)
            oldest = alltweets[-1].id - 1
            
        # Convert output to pandas DataFrame
        
        outtweets = pd.DataFrame([tweet.__dict__ for tweet in alltweets])
        
        # Check whether tweet is retweet
        
        outtweets['is_retweet'] = outtweets['retweeted_status'].apply(is_retweet)
                
        # Retrieve other metrics
        
        outtweets['followers_count'] = [x.followers_count for x in outtweets['author']]
        outtweets['location'] = [x.location for x in outtweets['author']]
        outtweets = outtweets[~ outtweets['is_retweet']]
        outtweets = outtweets[colnames]
        
        # Add boolean column for availability
        
        outtweets.insert(0, 'available', True)
        
    except:

        print('Data for user %s cannot be downloaded' %username)
        outtweets = pd.DataFrame(np.nan, index = [0], columns = colnames)
        outtweets.insert(0, 'available', False)
        
    # Add column with username
    
    outtweets.insert(0, 'username', username)
    return(outtweets)

Again, we need to specify our API credentials, and again, PLEASE do not misuse :)

In [56]:
my_keys = {
    'consumer_key': 'o0g3JVWSKzRYv9dQp2SEPdjXp',
    'consumer_secret': 'AyvUIFzB82w3ZetyTXf1PbHiSxK7CgdcJo0D5jfKAoFlUuP0iH',
    'access_token_key': '1302924762914660354-7ydX1jUVSnscL60hhl83biPGNVeQoH',
    'access_token_secret': '9NqtnWj2q8uLuQkLMWdamJyIEb56hlGJOVgrydzoakorT'}

# Set up access to API

auth = tweepy.OAuthHandler(my_keys['consumer_key'], my_keys['consumer_secret'])
auth.set_access_token(my_keys['access_token_key'], my_keys['access_token_secret'])
api = tweepy.API(auth, wait_on_rate_limit = True)

We create a list of all MPs with active accounts:

In [57]:
# Get names and usernames

names = abg_twitter_df['name_matching']
twitter_usernames = abg_twitter_df['username']
twitter_account = pd.concat([names, twitter_usernames], axis = 1)

mask = twitter_account.username.notnull()
twitter_account = twitter_account[mask]
twitter_account.reset_index(drop = True, inplace = True)
In [58]:
twitter_account.head()
Out[58]:
name_matching username
0 Michael Abercron mvabercron
1 Doris Achelwilm doris_achelwilm
2 Grigorios Aggelidis aggelidis_fdp
3 Gokay Akbulut akbulutgokay
4 Peter Altmaier peteraltmaier

Let the scraping begin:

In [59]:
# Download most recent tweets using tweepy (at most 3200 tweets per user)

tweepy_df = pd.DataFrame()

for username in twitter_account['username']:
    tweepy_df = pd.concat([tweepy_df, download_tweets_tweepy_mod(username)])
    
tweepy_df = twitter_account.merge(tweepy_df, on = 'username')

ctypes.windll.user32.MessageBoxW(0, "Twitter data successfully scraped", "Progress Report")
Data for user doris_achelwilm cannot be downloaded
Data for user EspendillerM cannot be downloaded
Data for user FechnerJohannes cannot be downloaded
Data for user friedhoff_afd cannot be downloaded
Data for user search cannot be downloaded
Data for user GabiHillerOhm cannot be downloaded
Data for user Karsten_Hilse cannot be downloaded
Data for user fjunge cannot be downloaded
Data for user AchimKessler cannot be downloaded
Data for user BabettesChefin cannot be downloaded
Data for user Team_GLoetzsch cannot be downloaded
Data for user HiltrudLotze cannot be downloaded
Data for user buerger2016 cannot be downloaded
Data for user thlutze cannot be downloaded
Data for user SiemtjeMoeller cannot be downloaded
Data for user joloulou cannot be downloaded
Data for user #! cannot be downloaded
Data for user martinrabanus cannot be downloaded
Data for user ManuelaRottmann cannot be downloaded
Data for user UdoSchiefner cannot be downloaded
Data for user #! cannot be downloaded
Data for user Wellenreuther cannot be downloaded
Data for user JensZimmermann1 cannot be downloaded
Out[59]:
1
In [75]:
# Inspect

tweepy_df.sample(n = 20)
Out[75]:
name_matching username available created_at full_text retweet_count favorite_count followers_count location
247609 Niema Movassat niemamovassat True 2021-01-18 08:45:39 @tfoederation 😪 0.0 2.0 25025.0
175494 Cansel Kiziltepe CanselK True 2020-09-25 14:30:36 @HelmutWehner84 @stgSPD @PhilippaSigl @michael... 0.0 6.0 7718.0 Berlin
202517 Renate Kunast renatekuenast True 2020-12-30 16:21:53 @alx_froehlich Die größte wirtschaftliche Bela... 0.0 5.0 68893.0 Berlin
353582 Dirk Spaniel dirkspaniel True 2019-06-22 06:54:39 @jreichelt @HeikoMaas @BILD Herr Reichelt, hab... 3.0 11.0 3997.0 Stuttgart, Deutschland
332649 Martin Schulz MartinSchulz True 2015-03-03 15:34:06 #Greece &amp; the #Eurozone have 4 months to b... 59.0 32.0 680913.0
296139 Bernd Riexinger b_riexinger True 2018-10-09 15:27:07 Noch nie wurden in #Bayern soviele Unterschrif... 34.0 92.0 39976.0 Stuttgart
168747 Uwe Kekeritz Uwekekeritz True 2013-10-01 12:08:58 @Osgyan Herzlichen Glückwunsch zur Wahl als st... 0.0 2.0 3167.0 Uffenheim / Berlin
159899 Oliver Kaczmarek KaczmarekOliver True 2013-05-10 16:42:01 The Broken Circle. (@ Camera) http://t.co/BXdb... 0.0 0.0 3526.0 Kamen
337441 Frank Schwabe FrankSchwabe True 2021-03-11 16:35:58 Wir reden morgen Klartext und zeigen Verbindun... 1.0 7.0 8074.0
80995 Marcus Faber marcusfaber True 2018-02-15 15:03:03 Next Stop: Buchhandel! 😂 https://t.co/USDWUqejHF 0.0 0.0 2685.0 Stendal, Sachsen-Anhalt
114740 Hermann Grohe groehe True 2013-01-20 09:40:18 Union und FDP legen zu, SPD und Grüne verliere... 12.0 2.0 49634.0 Berlin
139086 Torsten Herbst torstenherbst True 2012-07-12 10:10:14 Mehr Augenmaß und Rationalität - Wir brauchen ... 0.0 0.0 2461.0 Dresden
20779 Margarete Bause MargareteBause True 2017-09-11 21:15:34 @lenakoester Entscheidend ist für mich die Zwe... 0.0 3.0 7153.0 München & Berlin
71231 Katharina Droge katdro True 2020-09-27 17:04:41 So toll liebe Bondina Schulze! Herzlichen Glüc... 0.0 26.0 6645.0
287560 Stephan Protschka AfDProtschka True 2019-12-13 05:49:09 Guten Morgen #Deutschland ! Ist die #SPD noch ... 30.0 103.0 5930.0 Mamming, Deutschland
87769 Ulrich Freese ulifreese True 2018-06-14 07:59:27 Heute startet die Fußball WM 2018! Viel Glück ... 1.0 5.0 756.0 Wahlkreis Cottbus/Spree-Neiße
9533 Lisa Badum badulrichmartha True 2019-02-09 08:39:10 @DanielHofing Danke Euch!💪 0.0 2.0 4840.0 Forchheim
390951 Stephan Thomae stephanthomae True 2012-11-27 16:04:05 Glückwunsch an Wirtschaftsminister Martin Zeil... 0.0 0.0 3162.0 Kempten (Allgäu)
310765 Erwin Ruddel Erwin_Rueddel True 2020-10-21 12:28:55 Die 106. Sitzung des Gesundheitsausschusses fa... 0.0 5.0 4602.0 Windhagen, Kreis Neuwied
411024 Johann Wadephul jowadephul True 2020-09-05 10:47:25 Entscheidend ist, dass es eine geschlossene eu... 10.0 48.0 3772.0 Berlin, Deutschland

And that's it! We can now use our meta information and Twitter data as we please - merge them, analyze them, print them ... :)