Blog | Reverse Engineering Read Later Data from the Apple News App

Reverse Engineering Read Later Data from the Apple News App

As we navigate the digital world, we often come across articles we don't have time to read but still want to save for later. One way to accomplish this is by using the Read Later feature in Apple News. But what if you want to access those articles outside the Apple News app, such as on a different device or with someone who doesn't use Apple News? Or what if you want to automatically post links to those articles on your blog? That's where the nerd powers come in.


Reverse Engineering the Data

Initially, I reached out to Rhet Turnbull, the creator of the amazing osxphotos app/Python library that I use to extract the data from Apple Photos. I use that data to power the photo section of my site.

I asked Rhet if he had ever pulled this data from News. While I waited to hear back from him, I used lsof to look for the file that Apple News uses to store Read Later Articles. I discovered that Apple News uses a Binary PList file located in a super obvious place:

/Users/eecue/Library/Containers/com.apple.news/Data/Library/Application Support/com.apple.news/com.apple.news.public-com.apple.news.private-production/reading-list

Simple and obvious, right?! After I found it, I noticed it was in a strange format that a normal binary PList parser couldn’t understand. However, I was able to just run strings on the file and extract the Apple News Article ID which looks like this: https://apple.news/AbtWOAgVqToW62MeeZ1xkcQ.

I wrote a script to parse the data on the page above and then use Beautiful Soup to extract the article data. It wasn’t perfect, but it did the job:

import subprocess
import requests
from bs4 import BeautifulSoup

# Run the `strings` command to extract the strings from the binary file
proc = subprocess.Popen(['strings', '/Users/eecue/Library/Containers/com.apple.news/Data/Library/Application Support/com.apple.news/com.apple.news.public-com.apple.news.private-production/reading-list'], stdout=subprocess.PIPE)

# Loop through the output and look for article IDs
article_ids = []
for line in proc.stdout:
    # Check if the line starts with "rl-" and ends with "_"
    if line.startswith(b'rl-'):
        # Extract the article ID by removing the "rl-" prefix and "_" suffix
        article_id = line.decode().strip()[3:]
        if article_id.endswith('_'):
            article_id = article_id[:-1]
        article_ids.append(article_id)

def extract_info_from_apple_news(news_id):
    # Construct the Apple News URL from the ID
    apple_news_url = f'https://apple.news/{news_id}'

    # Send a GET request to the Apple News URL and get the response HTML
    response = requests.get(apple_news_url)
    html = response.text

    # Use BeautifulSoup to extract the URL from the redirectToUrlAfterTimeout function
    soup = BeautifulSoup(html, 'html.parser')
    script_tag = soup.find('script', string=lambda t: 'redirectToUrlAfterTimeout' in t)
    if script_tag:
        url_start_index = script_tag.text.index('"https://') + 1
        url_end_index = script_tag.text.index('"', url_start_index)
        url = script_tag.text[url_start_index:url_end_index]
    else:
        url = None

    # Extract the og:title, og:description, og:image, and author meta tags
    soup = BeautifulSoup(html, 'html.parser')
    title_tag = soup.find('meta', property='og:title')
    if title_tag:
        title = title_tag['content']
    else:
        title = None

    description_tag = soup.find('meta', property='og:description')
    if description_tag:
        description = description_tag['content']
    else:
        description = None

    image_tag = soup.find('meta', property='og:image')
    if image_tag:
        image = image_tag['content']
    else:
        image = None

    author_tag = soup.find('meta', {'name': 'Author'})
    if author_tag:
        author = author_tag['content']
    else:
        author = None

    # Return the extracted information as a dictionary
    return {
        'url': url,
        'title': title,
        'description': description,
        'image': image,
        'author': author
    }


if __name__ == '__main__':

    for article_id in article_ids:

        # Call the extract_info_from_apple_news function with the provided ID
        extracted_info = extract_info_from_apple_news(article_id)

        # Print the extracted information
        print('Extracted information:')
        print('- URL:', extracted_info['url'])
        print('- Title:', extracted_info['title'])
        print('- Description:', extracted_info['description'])
        print('- Image:', extracted_info['image'])
        print('- Author:', extracted_info['author'])


I shared my findings and with Rhet, and he came through big time.

First he wrote a gist that hacked out the binary PList from the non-standard reading-list file. Not content to just have it as a gist, he then took it to the next level by expanding it into a Python CLI tool and library called apple-news-to-sqlite which he posted on Github. In his own words, he:

went hog wild and created an app to save the news articles to a sqlite database

Using the Python Tool

With the Python tool, you can easily extract the Read Later articles from Apple News and convert them into a SQLite database or just get a Python dictionary of the data. Then you can use the data however you like.

I used the data to create the new Links of the Day sections that show up on my home page and in my RSS feed.

Conclusion

In conclusion, with nerd powers and some creativity, I was able to extract Read Later articles from the Apple News app and use them for various purposes. My initial code was expanded by Rhet Turnbull, who turned it into a Python CLI tool and library. With this tool, you can easily access your saved articles outside the Apple News app.

Have fun!

Tags

  • Apple News
  • SQLite
  • Python
  • Reverse Engineering
  • Technical Skills
  • Data Extraction

Subscribe

Metadata

Post date:

Monday, March 13th, 2023 at 11:43:31 AM